0% found this document useful (0 votes)
66 views

Advanced Data Analysis - Lecture Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Advanced Data Analysis - Lecture Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 874

Lecture notes for

Advanced Data Analysis


ADA1 Stat 427/527 & ADA2 Stat 428/528
Department of Mathematics and Statistics
University of New Mexico

Erik B. Erhardt
Edward J. Bedrick
Ronald M. Schrader

Fall 2016 – Spring 2017


ii

Prof. Erik B. Erhardt


Contents

I ADA1: Software 1
0 Introduction to R, Rstudio, and ggplot 3
0.1 R building blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.2 Plotting with ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
0.2.1 Improving plots . . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.3 Course Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

II ADA1: Summaries and displays, and one-, two-, and


many-way tests of means 23
1 Summarizing and Displaying Data 25
1.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.2 Numerical summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3 Graphical summaries for one quantitative sample . . . . . . . . . . . 31
1.3.1 Dotplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.3.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.3.3 Stem-and-leaf plot . . . . . . . . . . . . . . . . . . . . . . . . 34
1.3.4 Boxplot or box-and-whiskers plot . . . . . . . . . . . . . . . . 36
1.4 Interpretation of Graphical Displays for Numerical Data . . . . . . . 40
1.5 Interpretations for examples . . . . . . . . . . . . . . . . . . . . . . . 54

2 Estimation in One-Sample Problems 55


2.1 Inference for a population mean . . . . . . . . . . . . . . . . . . . . . 56
2.1.1 Standard error, LLN, and CLT . . . . . . . . . . . . . . . . . 57
2.1.2 z-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.1.3 t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.2 CI for µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.2.1 Assumptions for procedures . . . . . . . . . . . . . . . . . . . 66
2.2.2 The effect of α on a two-sided CI . . . . . . . . . . . . . . . . 69
2.3 Hypothesis Testing for µ . . . . . . . . . . . . . . . . . . . . . . . . . 69

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


iv CONTENTS

2.3.1 P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.3.2 Assumptions for procedures . . . . . . . . . . . . . . . . . . . 72
2.3.3 The mechanics of setting up hypothesis tests . . . . . . . . . . 79
2.3.4 The effect of α on the rejection region of a two-sided test . . . 81
2.4 Two-sided tests, CI and p-values . . . . . . . . . . . . . . . . . . . . . 82
2.5 Statistical versus practical significance . . . . . . . . . . . . . . . . . 83
2.6 Design issues and power . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.7 One-sided tests on µ . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.7.1 One-sided CIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3 Two-Sample Inferences 91
3.1 Comparing Two Sets of Measurements . . . . . . . . . . . . . . . . . 92
3.1.1 Plotting head breadth data: . . . . . . . . . . . . . . . . . . . 93
3.1.2 Salient Features to Notice . . . . . . . . . . . . . . . . . . . . 99
3.2 Two-Sample Methods: Paired Versus Independent Samples . . . . . . 99
3.3 Two Independent Samples: CI and Test Using Pooled Variance . . . . 100
3.4 Satterthwaite’s Method, unequal variances . . . . . . . . . . . . . . . 101
3.4.1 R Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.5 One-Sided Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.6 Paired Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.6.1 R Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.7 Should You Compare Means? . . . . . . . . . . . . . . . . . . . . . . 120

4 Checking Assumptions 123


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.2 Testing Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.2.1 Normality tests on non-normal data . . . . . . . . . . . . . . . 127
4.3 Formal Tests of Normality . . . . . . . . . . . . . . . . . . . . . . . . 131
4.4 Testing Equal Population Variances . . . . . . . . . . . . . . . . . . . 139
4.5 Small sample sizes, a comment . . . . . . . . . . . . . . . . . . . . . . 141

5 One-Way Analysis of Variance 143


5.1 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.2 Multiple Comparison Methods: Fisher’s Method . . . . . . . . . . . . 151
5.2.1 FSD Multiple Comparisons in R . . . . . . . . . . . . . . . . . 153
5.2.2 Bonferroni Comparisons . . . . . . . . . . . . . . . . . . . . . 154
5.3 Further Discussion of Multiple Comparisons . . . . . . . . . . . . . . 159
5.4 Checking Assumptions in ANOVA Problems . . . . . . . . . . . . . . 161
5.5 Example from the Child Health and Development Study (CHDS) . . 164

Prof. Erik B. Erhardt


CONTENTS v

III ADA1: Nonparametric, categorical, and regression meth-


ods 173
6 Nonparametric Methods 175
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2 The Sign Test and CI for a Population Median . . . . . . . . . . . . . 176
6.3 Wilcoxon Signed-Rank Procedures . . . . . . . . . . . . . . . . . . . . 183
6.3.1 Nonparametric Analyses of Paired Data . . . . . . . . . . . . 187
6.3.2 Comments on One-Sample Nonparametric Methods . . . . . . 189
6.4 (Wilcoxon-)Mann-Whitney Two-Sample Procedure . . . . . . . . . . 190
6.5 Alternatives for ANOVA and Planned Comparisons . . . . . . . . . . 198
6.5.1 Kruskal-Wallis ANOVA . . . . . . . . . . . . . . . . . . . . . . 199
6.5.2 Transforming Data . . . . . . . . . . . . . . . . . . . . . . . . 199
6.5.3 Planned Comparisons . . . . . . . . . . . . . . . . . . . . . . . 212
6.5.4 Two final ANOVA comments . . . . . . . . . . . . . . . . . . 215
6.6 Permutation tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.6.1 Linear model permutation tests in R . . . . . . . . . . . . . . 219
6.7 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

7 Categorical Data Analysis 227


7.1 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
7.2 Single Proportion Problems . . . . . . . . . . . . . . . . . . . . . . . 231
7.2.1 A CI for p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.2.2 Hypothesis Tests on Proportions . . . . . . . . . . . . . . . . 233
7.2.3 The p-value for a two-sided test . . . . . . . . . . . . . . . . . 235
7.2.4 Appropriateness of Test . . . . . . . . . . . . . . . . . . . . . 236
7.2.5 R Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.2.6 One-Sided Tests and One-Sided Confidence Bounds . . . . . . 237
7.2.7 Small Sample Procedures . . . . . . . . . . . . . . . . . . . . . 239
7.3 Analyzing Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.4 Goodness-of-Fit Tests (Multinomial) . . . . . . . . . . . . . . . . . . 244
7.4.1 Adequacy of the Goodness-of-Fit Test . . . . . . . . . . . . . 246
7.4.2 R Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.4.3 Multiple Comparisons in a Goodness-of-Fit Problem . . . . . 249
7.5 Comparing Two Proportions: Independent Samples . . . . . . . . . . 251
7.5.1 Large Sample CI and Tests for p1 − p2 . . . . . . . . . . . . . 251
7.6 Effect Measures in Two-by-Two Tables . . . . . . . . . . . . . . . . . 258
7.7 Analysis of Paired Samples: Dependent Proportions . . . . . . . . . . 260
7.8 Testing for Homogeneity of Proportions . . . . . . . . . . . . . . . . . 263
7.8.1 Adequacy of the Chi-Square Approximation . . . . . . . . . . 268
7.9 Testing for Homogeneity in Cross-Sectional and Stratified Studies . . 268

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


vi CONTENTS

7.9.1 Testing for Independence in a Two-Way Contingency Table . . 270


7.9.2 Further Analyses in Two-Way Tables . . . . . . . . . . . . . . 271

8 Correlation and Regression 275


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
8.2 Logarithmic transformations . . . . . . . . . . . . . . . . . . . . . . . 278
8.2.1 Log-linear and log-log relationships: amoebas, squares, and cubes278
8.3 Testing that ρ = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
8.3.1 The Spearman Correlation Coefficient . . . . . . . . . . . . . . 285
8.4 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 289
8.4.1 Linear Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 290
8.4.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
8.5 ANOVA Table for Regression . . . . . . . . . . . . . . . . . . . . . . 293
8.5.1 Brief discussion of the output for blood loss problem . . . . . 297
8.6 The regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
8.6.1 Back to the Data . . . . . . . . . . . . . . . . . . . . . . . . . 299
8.7 CI and tests for β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
8.7.1 Testing β1 = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.8 A CI for the population regression line . . . . . . . . . . . . . . . . . 301
8.8.1 CI for predictions . . . . . . . . . . . . . . . . . . . . . . . . . 302
8.8.2 A further look at the blood loss data . . . . . . . . . . . . . . 303
8.9 Model Checking and Regression Diagnostics . . . . . . . . . . . . . . 304
8.9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
8.9.2 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 305
8.9.3 Checking Normality . . . . . . . . . . . . . . . . . . . . . . . . 310
8.9.4 Checking Independence . . . . . . . . . . . . . . . . . . . . . . 310
8.9.5 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
8.9.6 Influential Observations . . . . . . . . . . . . . . . . . . . . . 312
8.9.7 Summary of diagnostic measures . . . . . . . . . . . . . . . . 315
8.10 Regression analysis suggestion . . . . . . . . . . . . . . . . . . . . . . 315
8.10.1 Residual and Diagnostic Analysis of the Blood Loss Data . . . 316
8.10.2 Gesell data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.11 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . 326

IV ADA1: Additional topics 331


9 Introduction to the Bootstrap 333
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
9.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
9.2.1 Ideal versus Bootstrap world, sampling distributions . . . . . . 335

Prof. Erik B. Erhardt


CONTENTS vii

9.2.2 The accuracy of the sample mean . . . . . . . . . . . . . . . . 338


9.2.3 Comparing bootstrap sampling distribution from population
and sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344

10 Power and Sample size 349


10.1 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
10.2 Effect size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.3 Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
10.4 Power calculation via simulation . . . . . . . . . . . . . . . . . . . . . 359

11 Data Cleaning 365


11.1 The five steps of statistical analysis . . . . . . . . . . . . . . . . . . . 366
11.2 R background review . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
11.2.1 Variable types . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
11.2.2 Special values and value-checking functions . . . . . . . . . . . 368
11.3 From raw to technically correct data . . . . . . . . . . . . . . . . . . 369
11.3.1 Technically correct data . . . . . . . . . . . . . . . . . . . . . 369
11.3.2 Reading text data into an R data.frame . . . . . . . . . . . . 370
11.4 Type conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
11.4.1 Introduction to R’s typing system . . . . . . . . . . . . . . . . 377
11.4.2 Recoding factors . . . . . . . . . . . . . . . . . . . . . . . . . 378
11.4.3 Converting dates . . . . . . . . . . . . . . . . . . . . . . . . . 380
11.5 Character-type manipulation . . . . . . . . . . . . . . . . . . . . . . . 382
11.5.1 String normalization . . . . . . . . . . . . . . . . . . . . . . . 383
11.5.2 Approximate string matching . . . . . . . . . . . . . . . . . . 384
11.6 From technically correct data to consistent data . . . . . . . . . . . . 387
11.6.1 Detection and localization of errors . . . . . . . . . . . . . . . 388
11.6.2 Edit rules for detecting obvious inconsistencies . . . . . . . . . 393
11.6.3 Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
11.6.4 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

V ADA2: Review of ADA1 405


1 R statistical software and review 407
1.1 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
1.2 ADA1 Ch 0: R warm-up . . . . . . . . . . . . . . . . . . . . . . . . . 408
1.3 ADA1 Chapters 2, 4, 6: Estimation in one-sample problems . . . . . 410
1.4 ADA1 Chapters 3, 4, 6: Two-sample inferences . . . . . . . . . . . . . 417
1.5 ADA1 Chapters 5, 4, 6: One-way ANOVA . . . . . . . . . . . . . . . 421
1.6 ADA1 Chapter 7: Categorical data analysis . . . . . . . . . . . . . . 432

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


viii CONTENTS

1.7 ADA1 Chapter 8: Correlation and regression . . . . . . . . . . . . . . 436

VI ADA2: Introduction to multiple regression and model


selection 445
2 Introduction to Multiple Linear Regression 447
2.1 Indian systolic blood pressure example . . . . . . . . . . . . . . . . . 447
2.1.1 Taking Weight Into Consideration . . . . . . . . . . . . . . . . 452
2.1.2 Important Points to Notice About the Regression Output . . . 453
2.1.3 Understanding the Model . . . . . . . . . . . . . . . . . . . . 455
2.2 GCE exam score example . . . . . . . . . . . . . . . . . . . . . . . . 458
2.2.1 Some Comments on GCE Analysis . . . . . . . . . . . . . . . 467

3 A Taste of Model Selection for Multiple Regression 473


3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
3.2 Backward Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . 474
3.2.1 Maximum likelihood and AIC/BIC . . . . . . . . . . . . . . . 474
3.3 Example: Peru Indian blood pressure . . . . . . . . . . . . . . . . . . 475
3.3.1 Analysis for Selected Model . . . . . . . . . . . . . . . . . . . 484
3.4 Example: Dennis Cook’s Rat Data . . . . . . . . . . . . . . . . . . . 487

VII ADA2: Experimental design and observational stud-


ies 495
4 One Factor Designs and Extensions 497

5 Paired Experiments and Randomized Block Experiments 501


5.1 Analysis of a Randomized Block Design . . . . . . . . . . . . . . . . . 503
5.2 Extending the One-Factor Design to Multiple Factors . . . . . . . . . 514
5.2.1 Example: Beetle insecticide two-factor design . . . . . . . . . 515
5.2.2 The Interaction Model for a Two-Factor Experiment . . . . . 516
5.2.3 Example: Survival Times of Beetles . . . . . . . . . . . . . . . 522
5.2.4 Example: Output voltage for batteries . . . . . . . . . . . . . 530
5.2.5 Checking assumptions in a two-factor experiment . . . . . . . 535
5.2.6 A Remedy for Non-Constant Variance . . . . . . . . . . . . . 537
5.3 Multiple comparisons: balanced (means) vs unbalanced (lsmeans) . . 545
5.4 Unbalanced Two-Factor Designs and Analysis . . . . . . . . . . . . . 550
5.4.1 Example: Rat insulin . . . . . . . . . . . . . . . . . . . . . . . 550
5.5 Writing factor model equations and interpretting coefficients . . . . . 561

Prof. Erik B. Erhardt


CONTENTS ix

5.5.1 One-way ANOVA, 1 factor with 3 levels . . . . . . . . . . . . 561


5.5.2 Two-way ANOVA, 2 factors with 3 and 2 levels, additive model 561
5.5.3 Two-way ANOVA, 2 factors with 3 and 2 levels, interaction model562

6 A Short Discussion of Observational Studies 563

VIII ADA2: ANCOVA and logistic regression 571


7 Analysis of Covariance: Comparing Regression Lines 573
7.1 ANCOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
7.2 Generalizing the ANCOVA Model to Allow Unequal Slopes . . . . . . 580
7.2.1 Unequal slopes ANCOVA model . . . . . . . . . . . . . . . . . 581
7.2.2 Equal slopes ANCOVA model . . . . . . . . . . . . . . . . . . 586
7.2.3 Equal slopes and equal intercepts ANCOVA model . . . . . . 587
7.2.4 No slopes, but intercepts ANCOVA model . . . . . . . . . . . 588
7.3 Relating Models to Two-Factor ANOVA . . . . . . . . . . . . . . . . 590
7.4 Choosing Among Models . . . . . . . . . . . . . . . . . . . . . . . . . 591
7.4.1 Simultaneous testing of regression parameters . . . . . . . . . 593
7.5 Comments on Comparing Regression Lines . . . . . . . . . . . . . . . 596
7.6 Three-way interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 597

8 Polynomial Regression 601


8.1 Polynomial Models with One Predictor . . . . . . . . . . . . . . . . . 601
8.1.1 Example: Cloud point and percent I-8 . . . . . . . . . . . . . 604
8.2 Polynomial Models with Two Predictors . . . . . . . . . . . . . . . . 608
8.2.1 Example: Mooney viscosity . . . . . . . . . . . . . . . . . . . 608
8.2.2 Example: Mooney viscosity on log scale . . . . . . . . . . . . 611

9 Discussion of Response Models with Factors and Predictors 617


9.1 Some Comments on Building Models . . . . . . . . . . . . . . . . . . 622
9.2 Example: The Effect of Sex and Rank on Faculty Salary . . . . . . . 623
9.2.1 A Three-Way ANOVA on Salary Data . . . . . . . . . . . . . 626
9.2.2 Using Year and Year Since Degree to Predict Salary . . . . . . 633
9.2.3 Using Factors and Predictors to Model Salaries . . . . . . . . 635
9.2.4 Discussion of the Salary Analysis . . . . . . . . . . . . . . . . 641

10 Automated Model Selection for Multiple Regression 645


10.1 Forward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
10.2 Backward Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . 647
10.3 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


x CONTENTS

10.3.1 Example: Indian systolic blood pressure . . . . . . . . . . . . 648


10.4 Other Model Selection Procedures . . . . . . . . . . . . . . . . . . . . 656
10.4.1 R2 Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
10.4.2 Adjusted-R2 Criterion, maximize . . . . . . . . . . . . . . . . 656
10.4.3 Mallows’ Cp Criterion, minimize . . . . . . . . . . . . . . . . . 657
10.5 Illustration with Peru Indian data . . . . . . . . . . . . . . . . . . . . 657
10.5.1 R2 , R̄2 , and Cp Summary for Peru Indian Data . . . . . . . . 663
10.5.2 Peru Indian Data Summary . . . . . . . . . . . . . . . . . . . 663
10.6 Example: Oxygen Uptake . . . . . . . . . . . . . . . . . . . . . . . . 664
10.6.1 Redo analysis excluding first and last observations . . . . . . . 670

11 Logistic Regression 675


11.1 Generalized linear model variance and link families . . . . . . . . . . 675
11.2 Example: Age of Menarche in Warsaw . . . . . . . . . . . . . . . . . 676
11.3 Simple logistic regression model . . . . . . . . . . . . . . . . . . . . . 678
11.3.1 Estimating Regression Parameters via LS of empirical logits . 680
11.3.2 Maximum Likelihood Estimation for Logistic Regression Model 681
11.3.3 Fitting the Logistic Model by Maximum Likelihood, Menarche 682
11.4 Example: Leukemia white blood cell types . . . . . . . . . . . . . . . 686
11.5 Example: The UNM Trauma Data . . . . . . . . . . . . . . . . . . . 699
11.5.1 Selecting Predictors in the Trauma Data . . . . . . . . . . . . 702
11.5.2 Checking Predictions for the Trauma Model . . . . . . . . . . 705
11.6 Historical Example: O-Ring Data . . . . . . . . . . . . . . . . . . . . 707

IX ADA2: Multivariate Methods 715


12 An Introduction to Multivariate Methods 717
12.1 Linear Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
12.2 Vector and Matrix Notation . . . . . . . . . . . . . . . . . . . . . . . 720
12.3 Matrix Notation to Summarize Linear Combinations . . . . . . . . . 723

13 Principal Component Analysis 725


13.1 Example: Temperature Data . . . . . . . . . . . . . . . . . . . . . . . 726
13.2 PCA on Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . 733
13.3 Interpreting Principal Components . . . . . . . . . . . . . . . . . . . 738
13.4 Example: Painted turtle shells . . . . . . . . . . . . . . . . . . . . . . 739
13.4.1 PCA on shells covariance matrix . . . . . . . . . . . . . . . . 740
13.4.2 PCA on shells correlation matrix . . . . . . . . . . . . . . . . 741
13.5 Why is PCA a Sensible Variable Reduction Technique? . . . . . . . . 742
13.5.1 A Warning on Using PCA as a Variable Reduction Technique 744

Prof. Erik B. Erhardt


CONTENTS xi

13.5.2 PCA is Used for Multivariate Outlier Detection . . . . . . . . 746


13.6 Example: Sparrows, for Class Discussion . . . . . . . . . . . . . . . . 746
13.7 PCA for Variable Reduction in Regression . . . . . . . . . . . . . . . 750

14 Cluster Analysis 761


14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
14.1.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
14.1.2 Distance measures . . . . . . . . . . . . . . . . . . . . . . . . 766
14.2 Example: Mammal teeth . . . . . . . . . . . . . . . . . . . . . . . . . 766
14.3 Identifying the Number of Clusters . . . . . . . . . . . . . . . . . . . 769
14.4 Example: 1976 birth and death rates . . . . . . . . . . . . . . . . . . 775
14.4.1 Complete linkage . . . . . . . . . . . . . . . . . . . . . . . . . 777
14.4.2 Single linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . 784

15 Multivariate Analysis of Variance 791

16 Discriminant Analysis 807


16.1 Canonical Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . 808
16.2 Example: Owners of riding mowers . . . . . . . . . . . . . . . . . . . 809
16.3 Discriminant Analysis on Fisher’s Iris Data . . . . . . . . . . . . . . . 817

17 Classification 825
17.1 Classification using Mahalanobis distance . . . . . . . . . . . . . . . . 827
17.2 Evaluating the Accuracy of a Classification Rule . . . . . . . . . . . . 829
17.3 Example: Carapace classification and error . . . . . . . . . . . . . . . 830
17.4 Example: Fisher’s Iris Data cross-validation . . . . . . . . . . . . . . 835
17.4.1 Stepwise variable selection for classification . . . . . . . . . . . 843
17.5 Example: Analysis of Admissions Data . . . . . . . . . . . . . . . . . 846
17.5.1 Further Analysis of the Admissions Data . . . . . . . . . . . . 847
17.5.2 Classification Using Unequal Prior Probabilities . . . . . . . . 851
17.5.3 Classification With Unequal Covariance Matrices, QDA . . . . 855

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


xii CONTENTS

Preface
UNM Stat 427/527: Advanced Data Analysis I (ADA1)
The course website (https://statacumen.com/teaching/ada1) includes these lec-
ture notes and R code, video lectures, helper videos, quizzes, in-class assignments,
homework assignments, datasets, and more course description including student learn-
ing outcomes, rubrics, and other helpful course information.
These notes include clicker questions which I used to use when I lectured from
these notes. I found these, along with the think-pair-share strategy, very effective in
assessing comprehension during the lecture.

Description Statistical tools for scientific research, including parametric and non-
parametric methods for ANOVA and group comparisons, simple linear and multiple
linear regression and basic ideas of experimental design and analysis. Emphasis placed
on the use of statistical packages such as R. Course cannot be counted in the hours
needed for graduate degrees in Mathematics and Statistics.

Goal Learn to produce beautiful (markdown) and reproducible (knitr) reports with
informative plots (ggplot2) and tables (xtable) by writing code (R, Rstudio) to answer
questions using fundamental statistical methods (all one- and two-variable methods),
which youll be proud to present (poster).

Course structure The course semester structure enables students to develop the
statistical, programming, visualization, and research skills to give a poster presenta-
tion. This includes conducting a literature review, developing a dataset codebook,
cleaning data, performing univariate and bivariate statistal summaries, tests, and
visualizations, and presenting results and evaluating peer presentations as part of a
poster session. This course structure follows the GAISE recommendations.
Each week, the course structure is the following. Students prepare for class by
reading these notes, watching video lectures of the notes, and taking a quiz online
before class. In class, students download an Rmarkdown file and work through a
real-data problem or two in teams to reinforce the content from the reading; some-
times, this includes taking data in class into a google spreadsheet and analyzing it
immediately afterwards. Finally, students work on a homework assignment outside
of class that is due the following week.

Prof. Erik B. Erhardt


CONTENTS xiii

UNM Stat 428/528: Advanced Data Analysis II (ADA2)


The course website (https://statacumen.com/teaching/ada2) includes these lec-
ture notes and R code, video lectures, helper videos, quizzes, in-class assignments,
homework assignments, datasets, and more course description including student learn-
ing outcomes, rubrics, and other helpful course information.

Description A continuation of 427/527 that focuses on methods for analyzing mul-


tivariate data and categorical data. Topics include MANOVA, principal components,
discriminant analysis, classification, factor analysis, analysis of contingency tables
including log-linear models for multidimensional tables and logistic regression.

Goal Learn to produce beautiful (markdown) and reproducible (knitr) reports with
informative plots (ggplot2) and tables (xtable) by writing code (R, Rstudio) to an-
swer questions using fundamental statistical methods (analysis of covariance, logistic
regression, and multivariate methods), which youll be proud to present (poster).

Course structure The course semester structure builds on ADA1 and enables stu-
dents to develop the statistical, programming, visualization, and research skills to give
a poster presentation. This includes multiple regression (with an emphasis on assess-
ing model assumptions), logistic regression, multivariate methods, classification, and
combining these methods. Visualization remains an important topic. The semester
culminates in a poster session with many students using their own data. This course
structure follows the GAISE recommendations.
Each week, the course structure is the following. Students prepare for class by
reading these notes, watching video lectures of the notes, and taking a quiz online
before class. In class, students download an Rmarkdown file and work through a
real-data problem or two in teams to reinforce the content from the reading. We
have used student-suggested or student-collected datasets when possible. Homework
is started in class since the realistic data analysis problems can be quite involved and
we want to resolve most student questions in class so that they don’t get stuck on
a small detail. In my experience, the students in the second semester have become
quite mature in their attitudes toward data analysis, coding, and visualization; they
are eager to be challenged and make well-reasoned decisions in their work.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


xiv CONTENTS

Prof. Erik B. Erhardt


Part I

ADA1: Software

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


Chapter 0

Introduction to R, Rstudio, and


ggplot

Contents
0.1 R building blocks . . . . . . . . . . . . . . . . . . . . . . . 3
0.2 Plotting with ggplot2 . . . . . . . . . . . . . . . . . . . . . 10
0.2.1 Improving plots . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.3 Course Overview . . . . . . . . . . . . . . . . . . . . . . . . 22

0.1 R building blocks


R as calculator In the following sections, look at the commands and output and
understand how the command was interpretted to give the output.
#### Calculator
# Arithmetic
2 * 10
## [1] 20
1 + 2
## [1] 3
# Order of operations is preserved
1 + 5 * 10
## [1] 51
(1 + 5) * 10
## [1] 60
# Exponents use the ^ symbol
2^5

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


4 Ch 0: Introduction to R, Rstudio, and ggplot

## [1] 32
9^(1/2)
## [1] 3

Vectors A vector is a set of numbers like a column of a spreadsheet. Below these


sets of numbers are not technically “vectors”, since they are not in a row/column
structure, though they are ordered and can be referred to by index.
#### Vectors
# Create a vector with the c (short for combine) function
c(1, 4, 6, 7)
## [1] 1 4 6 7
c(1:5, 10)
## [1] 1 2 3 4 5 10
# or use a function
# (seq is short for sequence)
seq(1, 10, by = 2)
## [1] 1 3 5 7 9
seq(0, 50, length = 11)
## [1] 0 5 10 15 20 25 30 35 40 45 50
seq(1, 50, length = 11)
## [1] 1.0 5.9 10.8 15.7 20.6 25.5 30.4 35.3 40.2 45.1 50.0
1:10 # short for seq(1, 10, by = 1), or just
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1, 10)
## [1] 1 2 3 4 5 6 7 8 9 10
5:1
## [1] 5 4 3 2 1
# non-integer sequences
# (Note: the [1] at the beginning of lines indicates
# the index of the first value in that row)
seq(0, 100*pi, by = pi)
## [1] 0.000000 3.141593 6.283185 9.424778 12.566371
## [6] 15.707963 18.849556 21.991149 25.132741 28.274334
## [11] 31.415927 34.557519 37.699112 40.840704 43.982297
## [16] 47.123890 50.265482 53.407075 56.548668 59.690260
## [21] 62.831853 65.973446 69.115038 72.256631 75.398224
## [26] 78.539816 81.681409 84.823002 87.964594 91.106187
## [31] 94.247780 97.389372 100.530965 103.672558 106.814150
## [36] 109.955743 113.097336 116.238928 119.380521 122.522113
## [41] 125.663706 128.805299 131.946891 135.088484 138.230077
## [46] 141.371669 144.513262 147.654855 150.796447 153.938040
## [51] 157.079633 160.221225 163.362818 166.504411 169.646003

Prof. Erik B. Erhardt


0.1: R building blocks 5

## [56] 172.787596 175.929189 179.070781 182.212374 185.353967


## [61] 188.495559 191.637152 194.778745 197.920337 201.061930
## [66] 204.203522 207.345115 210.486708 213.628300 216.769893
## [71] 219.911486 223.053078 226.194671 229.336264 232.477856
## [76] 235.619449 238.761042 241.902634 245.044227 248.185820
## [81] 251.327412 254.469005 257.610598 260.752190 263.893783
## [86] 267.035376 270.176968 273.318561 276.460154 279.601746
## [91] 282.743339 285.884931 289.026524 292.168117 295.309709
## [96] 298.451302 301.592895 304.734487 307.876080 311.017673
## [101] 314.159265

Assign variables Assigning objects to varibles.


#### Assign variables
# Assign a vector to a variable with <-
a <- 1:5
a
## [1] 1 2 3 4 5
b <- seq(15, 3, length = 5)
b
## [1] 15 12 9 6 3
c <- a * b
c
## [1] 15 24 27 24 15

Basic functions There are lots of functions available in the base package.
Type ?base and click on Index at the bottom of the help page for a complete list
of functions. Other functions to look at are in the ?stats and ?datasets packages.
#### Basic functions
# Lots of familiar functions work
a
## [1] 1 2 3 4 5
sum(a)
## [1] 15
prod(a)
## [1] 120
mean(a)
## [1] 3
sd(a)
## [1] 1.581139
var(a)
## [1] 2.5

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


6 Ch 0: Introduction to R, Rstudio, and ggplot

min(a)
## [1] 1
median(a)
## [1] 3
max(a)
## [1] 5
range(a)
## [1] 1 5

Extracting subsets It will be an extremely important skill to understand the


structure of your variables and pull out the information you need from them.
#### Extracting subsets
# Specify the indices you want in the square brackets []
a <- seq(0, 100, by = 10)
# blank = include all
a
## [1] 0 10 20 30 40 50 60 70 80 90 100
a[]
## [1] 0 10 20 30 40 50 60 70 80 90 100
# integer +=include, 0=include none, -=exclude
a[5]
## [1] 40
a[c(2, 4, 6, 8)]
## [1] 10 30 50 70
a[0]
## numeric(0)
a[-c(2, 4, 6, 8)]
## [1] 0 20 40 60 80 90 100
a[c(1, 1, 1, 6, 6, 9)] # subsets can be bigger than the original set
## [1] 0 0 0 50 50 80
a[c(1,2)] <- c(333, 555) # update a subset
a
## [1] 333 555 20 30 40 50 60 70 80 90 100

Boolean: True/False Subsets can be selected based on which elements meet spe-
cific conditions.
#### Boolean
a
## [1] 333 555 20 30 40 50 60 70 80 90 100
(a > 50) # TRUE/FALSE for each element

Prof. Erik B. Erhardt


0.1: R building blocks 7

## [1] TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
which(a > 50) # which indicies are TRUE
## [1] 1 2 7 8 9 10 11
a[(a > 50)]
## [1] 333 555 60 70 80 90 100
!(a > 50) # ! negates (flips) TRUE/FALSE values
## [1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
a[!(a > 50)]
## [1] 20 30 40 50

Comparison functions All your favorite comparisons are available.


#### Comparison
# Here they are: < > <= >= != == %in%
a
## [1] 333 555 20 30 40 50 60 70 80 90 100
# equal to
a[(a == 50)]
## [1] 50
# equal to
a[(a == 55)]
## numeric(0)
# not equal to
a[(a != 50)]
## [1] 333 555 20 30 40 60 70 80 90 100
# greater than
a[(a > 50)]
## [1] 333 555 60 70 80 90 100
# less than
a[(a < 50)]
## [1] 20 30 40
# less than or equal to
a[(a <= 50)]
## [1] 20 30 40 50
# which values on left are in the vector on right
(c(10, 14, 40, 60, 99) %in% a)
## [1] FALSE FALSE TRUE TRUE FALSE

Boolean operators compare TRUE/FALSE values and return TRUE/FALSE val-


ues.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


8 Ch 0: Introduction to R, Rstudio, and ggplot

#### Boolean
# & and, | or, ! not
a
## [1] 333 555 20 30 40 50 60 70 80 90 100
a[(a >= 50) & (a <= 90)]
## [1] 50 60 70 80 90
a[(a < 50) | (a > 100)]
## [1] 333 555 20 30 40
a[(a < 50) | !(a > 100)]
## [1] 20 30 40 50 60 70 80 90 100
a[(a >= 50) & !(a <= 90)]
## [1] 333 555 100

Missing values The value NA (not available) means the value is missing. Any
calculation involving NA will return an NA by default.
#### Missing values
NA + 8
## [1] NA
3 * NA
## [1] NA
mean(c(1, 2, NA))
## [1] NA
# Many functions have an na.rm argument (NA remove)
mean(c(NA, 1, 2), na.rm = TRUE)
## [1] 1.5
sum(c(NA, 1, 2))
## [1] NA
sum(c(NA, 1, 2), na.rm = TRUE)
## [1] 3
# Or you can remove them yourself
a <- c(NA, 1:5, NA)
a
## [1] NA 1 2 3 4 5 NA
is.na(a) # which values are missing?
## [1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE
!is.na(a) # which values are NOT missing?
## [1] FALSE TRUE TRUE TRUE TRUE TRUE FALSE
a[!is.na(a)] # return those which are NOT missing
## [1] 1 2 3 4 5
a # note, this did not change the variable a

Prof. Erik B. Erhardt


0.1: R building blocks 9

## [1] NA 1 2 3 4 5 NA
# To save the results of removing the NAs,
# assign to another variable or reassign to the original variable
# Warning: if you write over variable a then the original version is gone forever!
a <- a[!is.na(a)]
a
## [1] 1 2 3 4 5

Your turn!

ClickerQ s — Ch 0, R building blocks, Subset

ClickerQ s — Ch 0, R building blocks, T/F selection 1

ClickerQ s — Ch 0, R building blocks, T/F selection 2

How’d you do?


Outstanding Understanding the operations and how to put them together, with-
out skipping steps.
Good Understanding most of the small steps, missed a couple details.
Hang in there Understanding some of the concepts but all the symbols make my
eyes spin.
Reading and writing in a new language takes work.
You’ll get better as you practice, practice, practice1 .
Having a buddy to work with will help.

R command review This is a summary of the commands we’ve seen so far.


#### Review
<-
+ - * / ^
c()
seq() # by=, length=
sum(), prod(), mean(), sd(), var(),
min(), median(), max(), range()
a[]
(a > 1), ==, !=, >, <, >=, <=, %in%
&, |, !
NA, mean(a, na.rm = TRUE), !is.na()

1
What’s your Carnegie Hall that you’re working towards?

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


10 Ch 0: Introduction to R, Rstudio, and ggplot

0.2 Plotting with ggplot2


There are three primary strategies for plotting in R. The first is base graphics, using
functions such as plot(), points(), and par(). A second is use of the package lattice,
which creates very nice graphics quickly, though can be more difficult to create high-
dimensional data displays. A third, and the one I will use throughout this course, is
using package ggplot2, which is an implementation of the “Grammar of Graphics”.
While creating a very simple plot requires an extra step or two, creating very complex
displays are relatively straightforward as you become comfortable with the syntax.
This section is intended as an introduction to ggplot() and some of its capabilities
(we will cover these plotting functions and others in detail in later chapters). As a
basic introduction, it requires a data.frame object as input (a table where each row
represents an observation or data point, and each column can have a different data
type), and then you define plot layers that stack on top of each other, and each layer
has visual/text elements that are mapped to aesthetics (colors, size, opacity). In this
way, a simple set of commands can be combined to produce extremely informative
displays.
In the example that follows, we consider a dataset mpg consisting of fuel economy
data from 1999 and 2008 for 38 popular models of car.
#### Installing
# only needed once after installing or upgrading R
install.packages("ggplot2")

#### Library
# each time you start R
# load package ggplot2 for its functions and datasets
library(ggplot2)

# ggplot2 includes a dataset "mpg"

# ? gives help on a function or dataset


?mpg
Let’s get a look at the dataset we’re using.
#### mpg dataset
# head() lists the first several rows of a data.frame
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p
## 3 audi a4 2 2008 4 manu~ f 20 31 p
## 4 audi a4 2 2008 4 auto~ f 21 30 p
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p

Prof. Erik B. Erhardt


0.2: Plotting with ggplot2 11

## 6 audi a4 2.8 1999 6 manu~ f 18 26 p


## # ... with 1 more variable: class <chr>
# str() gives the structure of the object
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
# summary() gives frequeny tables for categorical variables
# and mean and five-number summaries for continuous variables
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv
## Min. :4.000 Length:234 Length:234
## 1st Qu.:4.000 Class :character Class :character
## Median :6.000 Mode :character Mode :character
## Mean :5.889
## 3rd Qu.:8.000
## Max. :8.000
## cty hwy fl
## Min. : 9.00 Min. :12.00 Length:234
## 1st Qu.:14.00 1st Qu.:18.00 Class :character
## Median :17.00 Median :24.00 Mode :character
## Mean :16.86 Mean :23.44
## 3rd Qu.:19.00 3rd Qu.:27.00
## Max. :35.00 Max. :44.00
## class
## Length:234
## Class :character
## Mode :character
##
##
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


12 Ch 0: Introduction to R, Rstudio, and ggplot

#### ggplot_mpg_displ_hwy
# specify the dataset and variables
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point() # add a plot layer with points
print(p)

40

● ●

● ●

● ● ●

30 ● ● ●

● ● ● ● ● ● ●
hwy

● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ●

● ●

20 ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●

● ● ● ● ● ● ● ●

● ●

2 3 4 5 6 7
displ

Geoms, aesthetics, and facets are three concepts we’ll see in this section.

Geom: is the “type” of plot


Aesthetics: shape, colour, size, alpha
Faceting: “small multiples” displaying different subsets

Help is available. Try searching for examples, too.

ˆ had.co.nz/ggplot2
ˆ had.co.nz/ggplot2/geom_point.html

When certain aesthetics are defined, an appropriate legend is chosen and displayed
automatically.
#### ggplot_mpg_displ_hwy_colour_class
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point(aes(colour = class))
print(p)

Prof. Erik B. Erhardt


0.2: Plotting with ggplot2 13

40

● ●
class
● ● ● 2seater
● ● ●

30 ● ● ●
● compact
● ● ● ● ● ● ● ● midsize
hwy

● ● ● ● ●

● ● ● ● ● ● ● ●
● minivan
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● pickup
● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● subcompact
● ● ● ● ● ● ●
● suv
● ● ● ● ●

● ●

20 ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●

● ● ● ● ● ● ● ●

● ●

2 3 4 5 6 7
displ

I encourage you to experiment with aesthetics!

1. Assign variables to aesthetics colour, size, and shape.


2. What’s the difference between discrete or continuous variables?
3. What happens when you combine multiple aesthetics?

The behavior of the aesthetics is predictable and customizable.

Aesthetic Discrete Continuous


colour Rainbow of colors Gradient from red to blue
size Discrete size steps Linear mapping between radius and value
shape Different shape for each Shouldn’t work

Let’s see a couple examples.


#### ggplot_mpg_displ_hwy_colour_class_size_cyl_shape_drv
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point(aes(colour = class, size = cyl, shape = drv))
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


14 Ch 0: Introduction to R, Rstudio, and ggplot

class
● 2seater
40 ● compact
● midsize
● minivan
● pickup
● subcompact
● suv

30
drv
hwy

● ●
● 4
● ● ●
f
● ●
● ●

● r




cyl
20 ●

● ● ● ●● ● ● 4
● ● ●● ●● ● 5
●● ●● ● ● ● ●●● ● ●
●● ● ● 6
● ●● ●● ●● ● 7
● ●
● 8

2 3 4 5 6 7
displ

#### ggplot_mpg_displ_hwy_colour_class_size_cyl_shape_drv_alpha
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point(aes(colour = class, size = cyl, shape = drv)
, alpha = 1/4) # alpha is the opacity
print(p)

class
2seater
40 compact
midsize
minivan
pickup
subcompact
suv

30
drv
hwy

4
f
r

cyl
20
4
5
6
7
8

2 3 4 5 6 7
displ

Prof. Erik B. Erhardt


0.2: Plotting with ggplot2 15

Faceting A small multiple2 (sometimes called faceting, trellis chart, lattice chart,
grid chart, or panel chart) is a series or grid of small similar graphics or charts,
allowing them to be easily compared.

ˆ Typically, small multiples will display different subsets of the data.


ˆ Useful strategy for exploring conditional relationships, especially for large data.

Experiment with faceting of different types. What relationships would you like to
see?
#### ggplot_mpg_displ_hwy_facet
# start by creating a basic scatterplot
p <- ggplot(mpg, aes(x = displ, y = hwy))
p <- p + geom_point()

## two methods
# facet_grid(rows ~ cols) for 2D grid, "." for no split.
# facet_wrap(~ var) for 1D ribbon wrapped into 2D.

# examples of subsetting the scatterplot in facets


p1 <- p + facet_grid(. ~ cyl) # columns are cyl categories
p2 <- p + facet_grid(drv ~ .) # rows are drv categories
p3 <- p + facet_grid(drv ~ cyl) # both
p4 <- p + facet_wrap(~ class) # wrap plots by class category

# plot all four in one arrangement


library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4), ncol = 2, top="Facet examples")

2
According to Edward Tufte (Envisioning Information, p. 67): “At the heart of quantitative
reasoning is a single question: Compared to what? Small multiple designs, multivariate and data
bountiful, answer directly by visually enforcing comparisons of changes, of the differences among
objects, of the scope of alternatives. For a wide range of problems in data presentation, small
multiples are the best design solution.”

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


16 Ch 0: Introduction to R, Rstudio, and ggplot

Facet examples
4 5 6 8
40

30 ●

4
● ● ● ●

● ● ● ● ●

● ● ●
40 ● ●
20 ● ● ●



● ● ● ●● ● ●●
●● ●● ● ●●
● ● ● ● ● ●
● ●●● ●
●●● ● ● ●



● ●
● ●
●● 40
● ● ●



● ●● ● ●
● ● ● ●
●●

hwy
30 ●● ● 30 ● ●
● ●

● ● ● ●
● ● ●
● ●

f
●●●●● ● ●
hwy


● ● ●
●● ● ● ●
● ● ●
●● ●

● ● ●● ● ● ●● ●
● ● ● ●
● ● ●
● ●
●●●● ●●● ● ●
●●●●● ● ●●● ●●●● ● ● 20

● ● ●●● ● ● ● ●
●● ●●● ● ● ●
● ● ●● ●● ●
● ●● ● ●
● ●
20 ● ● ●● 40
●●● ●● ●
●● ●● ●●●
●● ●●● ●● ●●● ●●● ● 30

r
●● ●● ● ● ● ●
● ● ● ●
● ●● ●●● ●● ● ●


● ●
20 ●●

● ● ● ●
● ●●

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ displ

4 5 6 8 2seater compact midsize


40 40


● ●

●● ● ●●
30 ●
●●●● ●● ●● ●
●● ●●
●●
30 ● ● ● ● ● ●● ●●●●
●● ● ●
●●
●● ●●● ●
●●●

4

● ● ● ● ●● ●●● ● ●
● ● ● ●
● ●●

● ●●

●● ● ● 20
20 ● ●●●● ●●●
● ●●
● ●●
● ●
●● ●
● ●●● ●
● ●
● ●●
● ●
● ●● ●●●


minivan pickup subcompact

● ●
40 40 ●



● ●

●●● ●
● ● ●

hwy

hwy


30 ●●
●●
●●●
● ● ●
● ●●
●● 30 ● ● ● ●
●● ●
f


●●● ●
●●●●● ● ● ●
● ● ●●
● ●
● ● ● ● ●● ● ●●●
● ●●
●● ● ●● ●
● ●●● ● ●
● ● ●

20 ● 20 ● ●●
●● ● ●
● ● ●●● ●
●●
● ●● ●

● ● ● ●

2 3 4 5 6 7 2 3 4 5 6 7
suv
40
40
30
r

●●
●● ● ●

30
● ●


● ● ●

● ● ●
20 ●

● 20 ● ●●●● ●● ●

● ●
● ●● ●● ●●●●●● ●●●●●●
●● ●
● ● ●● ●● ●●●

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ displ

0.2.1 Improving plots

How can this plot be improved?


#### ggplot_mpg_cty_hwy
p <- ggplot(mpg, aes(x = cty, y = hwy))
p <- p + geom_point()
print(p)

Prof. Erik B. Erhardt


0.2: Plotting with ggplot2 17

● ●

40

● ●

● ●

● ● ●

● ● ● ●

30 ● ● ●

● ● ● ● ● ●

hwy
● ● ●

● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ●

● ● ●

● ● ●

20 ● ● ●

● ● ●

● ● ● ●

● ● ● ● ●

● ● ●

10 15 20 25 30 35
cty

Problem: points lie on top of each other, so it’s impossible to tell how many
observations each point represents.

A solution: Jitter the points to reveal the individual points and reduce the opacity
to 1/2 to indicate when points overlap.
#### ggplot_mpg_cty_hwy_jitter
p <- ggplot(mpg, aes(x = cty, y = hwy))
p <- p + geom_point(position = "jitter", alpha = 1/2)
print(p)

40

30
hwy

20

10 20 30
cty

How can this plot be improved?


#### ggplot_mpg_class_hwy
p <- ggplot(mpg, aes(x = class, y = hwy))
p <- p + geom_point()
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


18 Ch 0: Introduction to R, Rstudio, and ggplot

● ●

40

● ●

● ●

● ●

30 ● ●

● ● ●

hwy
● ● ●

● ● ● ●

● ● ● ● ●

● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ●

● ●

20 ● ● ●

● ●

● ●

● ● ●

● ●

● ●

● ●

2seater compact midsize minivan pickup subcompact suv


class

Problem: The classes are in alphabetical order, which is somewhat arbitrary.

A solution: Reorder the class variable by the mean hwy for a meaningful ordering.
Get help with ?reorder to understand how this works.
#### ggplot_mpg_reorder_class_hwy
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_point()
print(p)

● ●

40

● ●

● ●

● ●

30 ● ●

● ● ●
hwy

● ● ●

● ● ● ●

● ● ● ● ●

● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ●

● ●

20 ● ● ●

● ●

● ●

● ● ●

● ●

● ●

● ●

pickup suv minivan 2seater midsize subcompact compact


reorder(class, hwy)

. . . add jitter
#### ggplot_mpg_reorder_class_hwy_jitter
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_point(position = "jitter")
print(p)

Prof. Erik B. Erhardt


0.2: Plotting with ggplot2 19

● ●

40





● ●

● ● ●

● ● ● ●●


30 ●


● ● ● ●
●● ● ● ● ●
●●●●●


●● ● ●

hwy
● ●
● ● ●
● ●
● ● ●●
● ● ●●
● ● ● ● ●


● ● ●● ● ● ●● ● ● ●
● ●● ●●●
● ● ●
● ● ●● ● ● ●● ●
● ● ● ● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ●
●● ● ● ● ●
● ●

● ● ●
● ● ●
● ● ●●

●●


● ●
20 ● ● ●● ● ● ●
● ●
● ● ●

● ● ●

● ● ● ● ●
● ● ●●●
● ●
● ● ●
● ● ●● ●● ● ●
●● ● ●● ●● ●

● ●● ● ●● ● ● ● ●●
● ●

● ●● ●●

● ●● ●
● ●
● ● ● ●

● ● ● ●

pickup suv minivan 2seater midsize subcompact compact


reorder(class, hwy)

. . . a little less jitter


#### ggplot_mpg_reorder_class_hwy_jitter_less
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_jitter(position = position_jitter(width = 0.1))
print(p)



40









● ●●
●●● ●●


30 ● ●●

● ● ●
●● ●
● ●●
● ● ●

●● ● ●

hwy


● ● ●
●●


●●● ●

● ● ● ●●
● ●
● ●●●
●● ●● ●● ●
● ● ●●● ●●●
● ●
● ● ●●
● ●
● ● ● ● ●●
● ●●● ● ●●
● ●●● ● ●●● ●
● ● ●

● ● ●
● ● ●

● ●

● ●●

● ●
●● ● ● ●
● ●
20 ● ●

● ●● ●
●●● ●●●
●●●
●● ●
● ●●●

● ●
●●
● ●

●●●●●

●●●●


● ●●
●●●

●●
● ●
● ●● ●●
●●
● ●●●●
● ● ●

●●

●●

pickup suv minivan 2seater midsize subcompact compact


reorder(class, hwy)

. . . or replace with boxplots


#### ggplot_mpg_reorder_class_hwy_boxplot
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_boxplot()
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


20 Ch 0: Introduction to R, Rstudio, and ggplot

● ●

40

30

hwy ●

20

● ●

pickup suv minivan 2seater midsize subcompact compact


reorder(class, hwy)

. . . or jitter those points with reduced-opacity boxplots on top


#### ggplot_mpg_reorder_class_hwy_jitter_boxplot
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_jitter(position = position_jitter(width = 0.1))
p <- p + geom_boxplot(alpha = 0.5)
print(p)

● ●

40









● ●●

● ●● ●●
● ●

30 ● ● ●

● ● ●● ●●
●● ● ●
●●● ●●●●

hwy

● ●
●●
● ● ● ●

● ●●
●●
●● ●
● ●●
● ●
● ● ●● ● ●
● ●●
●●

●●

●●
● ●● ●●● ● ●
● ● ●●●
● ● ● ●●●
●●●
● ●●● ● ●

●● ● ● ●
● ● ● ● ●
● ●
● ● ●
● ●
● ●
● ● ● ●
● ● ●
● ●
● ●
●● ●
20 ●● ●●

●● ●●
●●●
●● ●●
● ●●
● ●

●●● ●
● ●●
● ●
●●●● ● ●
● ●● ●
●● ●●
● ●●

●●● ●●● ●
●● ●● ●
● ●
● ● ● ●

●●● ●


● ●●
● ●

pickup suv minivan 2seater midsize subcompact compact


reorder(class, hwy)

. . . even better with the boxplots with jittered points on top


#### ggplot_mpg_reorder_class_hwy_boxplot_jitter
p <- ggplot(mpg, aes(x = reorder(class, hwy), y = hwy))
p <- p + geom_boxplot(alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1))
print(p)

Prof. Erik B. Erhardt


0.2: Plotting with ggplot2 21

● ●


40







● ●

● ●●

● ● ●●●


30 ●
● ●
● ● ● ●●●●
●● ● ● ●●● ●
● ● ●● ●

hwy
●● ● ● ●
●●
●●● ●●

● ● ●
● ●●● ●

●● ●
● ●
●● ● ●●● ● ●
● ● ●● ●●

● ●●
● ● ●
● ●●
● ● ●●●●●●
● ● ●● ●
● ● ●




● ●●
● ●●
● ● ●
● ● ● ●
●●● ●
● ● ●
● ●
● ●●


20 ●● ● ●
●●

● ●●●●●
●● ●●●●
● ●●
● ●●
●●●●
●●●● ●● ●
●●●●

●●●●
●●
● ●●●●● ●
●● ●
●● ●

●●● ●

●●● ●
●●
● ●


●● ● ●

10
pickup suv minivan 2seater midsize subcompact compact
reorder(class, hwy)

. . . and can easily reorder by median() instead of mean() (mean is the default)
#### ggplot_mpg_reorder_class_hwy_boxplot_jitter_median
p <- ggplot(mpg, aes(x = reorder(class, hwy, FUN = median), y = hwy))
p <- p + geom_boxplot(alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1))
print(p)

40





● ●
●● ●

● ●●
●●
●●

30 ● ●

● ●
●● ●


●●●●●● ● ●●●

●● ● ●
hwy

● ● ● ●●

● ● ●●● ●● ●●

●● ●
● ● ●

● ● ●●●●●
●● ●●●● ●●●●●●●
● ● ● ●●●
● ● ● ●●
●● ●● ●●
●●
●●

● ● ●●● ● ●
● ●● ●

● ● ● ●
● ● ●
● ●●
● ●●●

● ●
●●
20 ● ●
● ●●●●● ●
●● ●●
● ●● ●
● ● ●
● ●
●●
● ●● ●●
● ●●
●●●●
●●●●
●● ●
●●●●● ●
● ●●● ● ●●
● ● ●
●● ● ●
● ●
● ●● ●
● ●●



●●
● ●

pickup suv minivan 2seater subcompact compact midsize


reorder(class, hwy, FUN = median)

One-minute paper:
Muddy Any “muddy” points — anything that doesn’t make sense yet?
Thumbs up Anything you really enjoyed or feel excited about?

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


22 Ch 0: Introduction to R, Rstudio, and ggplot

0.3 Course Overview


See ADA2 Chapter 1 notes for a brief overview of all we’ll cover in this semester of
ADA1.

Prof. Erik B. Erhardt


Part II

ADA1: Summaries and displays,


and one-, two-, and many-way tests
of means

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


Chapter 1

Summarizing and Displaying Data

Contents
1.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . 26
1.2 Numerical summaries . . . . . . . . . . . . . . . . . . . . . 26
1.3 Graphical summaries for one quantitative sample . . . . 31
1.3.1 Dotplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.3.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.3.3 Stem-and-leaf plot . . . . . . . . . . . . . . . . . . . . . . . 34
1.3.4 Boxplot or box-and-whiskers plot . . . . . . . . . . . . . . . 36
1.4 Interpretation of Graphical Displays for Numerical Data 40
1.5 Interpretations for examples . . . . . . . . . . . . . . . . . 54

Learning objectives
After completing this topic, you should be able to:
use R’s functions to get help and numerically summarize data.
apply R’s base graphics and ggplot to visually summarize data in several ways.
explain what each plotting option does.
describe the characteristics of a data distribution.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


26 Ch 1: Summarizing and Displaying Data

1.1 Random variables


A random variable is a variable whose value is subject to variations due to chance.
Random variables fall into two broad categories: qualitative and quantitative.
Qualitative data includes categorical outcomes:
Nominal Outcome is one of several categories.
Ex: Blood group, hair color.
Ordinal Outcome is one of several ordered categories.
Ex: Likert data such as (strongly agree, agree, neutral, disagree, strongly disagree).
Quantitative data includes numeric outcomes:
Discrete Outcome is one of a fixed set of numerical values.
Ex: Number of children.
Continuous Outcome is any numerical value.
Ex: Birthweight.
It may not always be perfectly clear which type data belong to, and may sometimes
be classified based on the question being asked of the data. Distinction between
nominal and ordinal variables can be subjective. For example, for vertebral fracture
types: (Wedge, Concavity, Biconcavity, Crush), one could argue that a crush is worse
than a biconcavity which is worse than a concavity . . . , but this is not self-evident.
Distinction between ordinal and discrete variables can be subjective. For example,
cancer staging (I, II, III, IV) sounds discrete, but better treated as ordinal because the
“distance” between stages may be hard to define and unlikely to be equal. Continuous
variables generally measured to a fixed level of precision, which makes them discrete.
This “discreteness” of continuous variables is not a problem, providing there are
enough levels.

ClickerQ s — Random variables

1.2 Numerical summaries


Suppose we have a collection of n individuals, and we measure each individual’s
response on one quantitative characteristic, say height, weight, or systolic blood pres-
sure. For notational simplicity, the collected measurements are denoted by Y1 , Y2 , . . . , Yn ,
where n is the sample size. The order in which the measurements are assigned to
the place-holders (Y1 , Y2 , . . . , Yn ) is irrelevant.
Among the numerical summary measures we’re interested in are the sample
mean Ȳ and the sample standard deviation s. The sample mean is a mea-
sure of central location, or a measure of a “typical value” for the data set. The
standard deviation is a measure of spread in the data set. These summary statistics

Prof. Erik B. Erhardt


1.2: Numerical summaries 27

should be familiar to you. Let us consider a simple example to refresh your memory
on how to compute them.
Suppose we have a sample of n = 8 children with weights (in pounds): 5, 9, 12,
30, 14, 18, 32, 40. Then
P
i Yi Y1 + Y2 + · · · + Yn
Ȳ = =
n n
5 + 9 + 12 + 30 + 14 + 18 + 32 + 40 160
= = = 20.
8 8

#### Numerical summaries


#### mean
y <- c(5, 9, 12, 30, 14, 18, 32, 40)
mean(y)
## [1] 20
The sample standard deviation is the square root of the sample variance
2
(Y1 − Ȳ )2 + (Y2 − Ȳ )2 + · · · + (Yk − Ȳ )2
P
2 i (Yi − Ȳ )
s = =
n−1 n−1
(5 − 20)2 + (9 − 20)2 + · · · + (40 − 20)2
= = 156.3,
√ 7
s = s2 = 12.5.

#### variance
var(y)
## [1] 156.2857
sd(y)
## [1] 12.50143
Summary statistics have well-defined units of measurement, for example, Ȳ = 20
lb, s2 = 156.3 lb2 , and s = 12.5 lb. The standard deviation is often used instead of
s2 as a measure of spread because s is measured in the same units as the data.

Remark If the divisor for s2 was n instead of n − 1, then the variance would be the
average squared deviation observations are from the center of the data as measured
by the mean.
The following graphs should help you to see some physical meaning of the sample
mean and variance. If the data values were placed on a “massless” ruler, the balance
point would be the mean (20). The variance is basically the “average” (remember
n − 1 instead of n) of the total areas of all the squares obtained when squares are

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


28 Ch 1: Summarizing and Displaying Data

formed by joining each value to the mean. In both cases think about the implication
of unusual values (outliers). What happens to the balance point if the 40 were a 400
instead of a 40? What happens to the squares?

The sample median M is an alternative measure of central location. The mea-


sure of spread reported along with M is the interquartile range, IQR = Q3 − Q1 ,
where Q1 and Q3 are the first and third quartiles of the data set, respectively. To
calculate the median and interquartile range, order the data from lowest to highest
values, all repeated values included. The ordered weights are
5 9 12 14 18 30 32 40.
#### sorting
sort(y)
## [1] 5 9 12 14 18 30 32 40
The median M is the value located at the half-way point of the ordered string. There
is an even number of observations, so M is defined to be half-way between the two
middle values, 14 and 18. That is, M = 0.5(14 + 18) = 16 lb. To get the quartiles,
break the data into the lower half: 5 9 12 14, and the upper half: 18 30 32 and 40.
Then
Q1 = first quartile = median of lower half of data = 0.5(9+12)=10.5 lb,
and
Q3 = third quartile = median of upper half of data = 0.5(30+32) = 31 lb.
The interquartile range is

IQR = Q3 − Q1 = 31 − 10.5 = 20.5 lb.

#### quartiles
median(y)
## [1] 16
fivenum(y)
## [1] 5.0 10.5 16.0 31.0 40.0
# The quantile() function can be useful, but doesn't calculate Q1 and Q3
# as defined above, regardless of the 9 types of calculations for them!

Prof. Erik B. Erhardt


1.2: Numerical summaries 29

# summary() is a combination of mean() and quantile(y, c(0, 0.25, 0.5, 0.75, 1))
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 11.25 16.00 20.00 30.50 40.00
# IQR
fivenum(y)[c(2,4)]
## [1] 10.5 31.0
fivenum(y)[4] - fivenum(y)[2]
## [1] 20.5
diff(fivenum(y)[c(2,4)])
## [1] 20.5

The quartiles, with M being the second quartile, break the data set roughly into
fourths. The first quartile is also called the 25th percentile, whereas the median and
third quartiles are the 50th and 75th percentiles, respectively. The IQR is the range
for the middle half of the data.

Suppose we omit the largest observation from the weight data:


5 9 12 14 18 30 32.
How do M and IQR change? With an odd number of observations, there is a unique
middle observation in the ordered string which is M . Here M = 14 lb. It is unclear
which half the median should fall into, so M is placed into both the lower and upper
halves of the data. The lower half is 5 9 12 14, and the upper half is 14 18 30 32.
With this convention, Q1 = 0.5(9 + 12) = 10.5 and Q3 = 0.5(18 + 30) = 24, giving
IQR = 24 − 10.5 = 13.5 (lb).
#### remove largest
# remove the largest observation by removing the last of the sorted values
y2 <- sort(y)[-length(y)]
y2
## [1] 5 9 12 14 18 30 32
median(y2)
## [1] 14
fivenum(y2)
## [1] 5.0 10.5 14.0 24.0 32.0
diff(fivenum(y2)[c(2,4)])
## [1] 13.5

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


30 Ch 1: Summarizing and Displaying Data

If you look at the data set with all eight observations, there actually are many
numbers that split the data set in half, so the median is not uniquely defined1 , al-
though “everybody” agrees to use the average of the two middle values. With quartiles
there is the same ambiguity but no such universal agreement on what to do about it,
however, so R will give slightly different values for Q1 and Q3 when using summary()
and some other commands than we just calculated, and other packages will report
even different values. This has no practical implication (all the values are “correct”)
but it can appear confusing.

Example The data given below are the head breadths in mm for a sample of 18
modern Englishmen, with numerical summaries generated by R.
#### Englishmen
hb <- c(141, 148, 132, 138, 154, 142, 150, 146, 155
, 158, 150, 140, 147, 148, 144, 150, 149, 145)

# see sorted values


sort(hb)
## [1] 132 138 140 141 142 144 145 146 147 148 148 149 150 150 150 154
## [17] 155 158
# number of observations is the length of the vector (when no missing values)
length(hb)
## [1] 18
# default quartiles
summary(hb)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 132.0 142.5 147.5 146.5 150.0 158.0
# standard quartiles
fivenum(hb)
## [1] 132.0 142.0 147.5 150.0 158.0
# range() gives the min and max values
range(hb)
## [1] 132 158
# the range of the data is the (max - min), calculated using diff()
diff(range(hb))
## [1] 26
mean(hb)
## [1] 146.5

1
The technical definition of the median for an even set of values includes the entire range between
the two center values. Thus, selecting any single value in this center range is convenient and the
center of this center range is one sensible choice for the median, M .

Prof. Erik B. Erhardt


1.3: Graphical summaries for one quantitative sample 31

# standard deviation
sd(hb)
## [1] 6.382421
# standard error of the mean
se <- sd(hb)/sqrt(length(hb))

Note that se is the standard error of the sample mean, SEȲ = s/ n, and is a
measure of the precision of the sample mean Ȳ .

ClickerQ s — Numerical summaries

1.3 Graphical summaries for one quantitative sam-


ple
There are four graphical summaries of primary interest: the dotplot, the histogram,
the stem-and-leaf display, and the boxplot. There are many more possible, but
these will often be useful. The plots can be customized. Make liberal use of the help
for learning how to customize them. Plots can also be generated along with many
statistical analyses, a point that we will return to repeatedly.

1.3.1 Dotplots
The dotplot breaks the range of data into many small-equal width intervals, and
counts the number of observations in each interval. The interval count is superimposed
on the number line at the interval midpoint as a series of dots, usually one for each
observation. In the head breadth data, the intervals are centered at integer values, so
the display gives the number of observations at each distinct observed head breadth.
A dotplot of the head breadth data is given below. Of the examples below, the R
base graphics stripchart() with method="stack" resembles the traditional dotplot.
#### stripchart-ggplot
# stripchart (dotplot) using R base graphics
# 3 rows, 1 column
par(mfrow=c(3,1))
stripchart(hb, main="Modern Englishman", xlab="head breadth (mm)")
stripchart(hb, method="stack", cex=2
, main="larger points (cex=2), method is stack")
stripchart(hb, method="jitter", cex=2, frame.plot=FALSE
, main="no frame, method is jitter")

# dotplot using ggplot


library(ggplot2)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


32 Ch 1: Summarizing and Displaying Data

# first put hb vector into a data.frame


hb_df <- data.frame(hb)
p1 <- ggplot(hb_df, aes(x = hb))
p1 <- p1 + geom_dotplot(binwidth = 2)
p1 <- p1 + labs(title = "Modern Englishman head breadth")
p1 <- p1 + xlab("head breadth (mm)")

p2 <- ggplot(hb_df, aes(x = hb))


p2 <- p2 + geom_dotplot(binwidth = 2, stackdir = "center")
p2 <- p2 + labs(title = "Modern Englishman head breadth, stackdir=center")
p2 <- p2 + xlab("head breadth (mm)")

p3 <- ggplot(hb_df, aes(x = hb))


p3 <- p3 + geom_dotplot(binwidth = 2, stackdir = "centerwhole")
p3 <- p3 + labs(title = "Modern Englishman head breadth, stackdir=centerwhole")
p3 <- p3 + xlab("head breadth (mm)")

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)

Modern Englishman head breadth


Modern Englishman
1.00

0.75
count

0.50

0.25

0.00
135 140 145 150 155
130 140 150 160
head breadth (mm)
head breadth (mm)

Modern Englishman head breadth, stackdir=center


larger points (cex=2), method is stack
0.50

0.25
count

0.00

−0.25

−0.50
135 140 145 150 155
130 140 150 160
head breadth (mm)

Modern Englishman head breadth, stackdir=centerwhole


no frame, method is jitter
0.50

0.25
count

0.00

−0.25

−0.50
135 140 145 150 155
130 140 150 160
head breadth (mm)

Prof. Erik B. Erhardt


1.3: Graphical summaries for one quantitative sample 33

1.3.2 Histogram

The histogram and stem-and-leaf displays are similar, breaking the range of data
into a smaller number of equal-width intervals. This produces graphical information
about the observed distribution by highlighting where data values cluster. The his-
togram can use arbitrary intervals, whereas the intervals for the stem-and-leaf display
use the base 10 number system. There is more arbitrariness to histograms than to
stem-and-leaf displays, so histograms can sometimes be regarded a bit suspiciously.
#### hist
# histogram using R base graphics
# par() gives graphical options
# mfrow = "multifigure by row" with 1 row and 3 columns
par(mfrow=c(1,3))
# main is the title, xlab is x-axis label (ylab also available)
hist(hb, main="Modern Englishman", xlab="head breadth (mm)")
# breaks are how many bins-1 to use
hist(hb, breaks = 15, main="Histogram, 15 breaks")
# freq=FALSE changes the vertical axis to density,
# so the total area of the bars is now equal to 1
hist(hb, breaks = 8, freq = FALSE, main="Histogram, density")

# histogram using ggplot


library(ggplot2)
# first put hb vector into a data.frame
hb_df <- data.frame(hb)
p1 <- ggplot(hb_df, aes(x = hb))
# always specify a binwidth for the histogram (default is range/30)
# try several binwidths
p1 <- p1 + geom_histogram(binwidth = 5)
p1 <- p1 + geom_rug()
p1 <- p1 + labs(title = "Modern Englishman head breadth")

p2 <- ggplot(hb_df, aes(x = hb))


# always specify a binwidth for the histogram (default is range/30)
# try several binwidths
p2 <- p2 + geom_histogram(binwidth = 2)
p2 <- p2 + geom_rug()
p2 <- p2 + labs(title = "Modern Englishman head breadth")

library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow=1)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


34 Ch 1: Summarizing and Displaying Data

Modern Englishman head breadth Modern Englishman head breadth


Modern Englishman Histogram, 15 breaks Histogram, density
6 3
8

0.08
6

0.06
4 2
Frequency

Frequency

count

count
Density
4

0.04
2 1

0.02
2

0.00
0

0 0
130 140 150 160 135 145 155 130 140 150 160
130 140 150 160 130 140 150 160
head breadth (mm) hb hb
hb hb

R allows you to modify the graphical display. For example, with the histogram
you might wish to use different midpoints or interval widths. I will let you explore
the possibilities.

1.3.3 Stem-and-leaf plot


A stem-and-leaf plot is a character display histogram defining intervals for a grouped
frequency distribution using the base 10 number system. Intervals are generated by
selecting an appropriate number of lead digits for the data values to be the stem. The
remaining digits comprise the leaf. It is useful for small samples.
Character plots use typewriter characters to make graphs, and can be convenient
for some simple displays, but require use of fixed fonts (like Courier) when copied to
a word processing program or they get distorted.
The display almost looks upside down, since larger numbers are on the bottom
rather than the top. It is done this way so that if rotated 90 degrees counter-clockwise
it is a histogram.
The default stem-and-leaf display for the head breadth data is given below. The
two columns give the stems and leaves. The data have three digits. The first two
comprise the stem. The last digit is the leaf. Thus, a head breadth of 154 has a
stem of 15 and leaf of 4. The possible stems are 13, 14, and 15, whereas the possible
leaves are the integers from 0 to 9. In the first plot, each stem occurs once, while in
the second each stem occurs twice. In the second instance, the first (top) occurrence
of a stem value only holds leaves 0 through 4. The second occurrence holds leaves 5
through 9. The display is generated by placing the leaf value for each observation on
the appropriate stem line. For example, the top 14 stem holds data values between
140 and 144.99. The stems on this line in the display tell us that four observations
fall in this range: 140, 141, 142 and 144. Note that this stem-and-leaf display is
an elaborate histogram with intervals of width 5. An advantage of the stem-and-leaf
display over the histogram is that the original data values can essentially be recovered
from the display.

Prof. Erik B. Erhardt


1.3: Graphical summaries for one quantitative sample 35

#### stem-and-leaf
# stem-and-leaf plot
stem(hb)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 13 | 28
## 14 | 0124567889
## 15 | 000458
# scale=2 makes plot roughly twice as wide
stem(hb, scale=2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 13 | 2
## 13 | 8
## 14 | 0124
## 14 | 567889
## 15 | 0004
## 15 | 58
# scale=5 makes plot roughly five times as wide
stem(hb, scale=5)
##
## The decimal point is at the |
##
## 132 | 0
## 134 |
## 136 |
## 138 | 0
## 140 | 00
## 142 | 0
## 144 | 00
## 146 | 00
## 148 | 000
## 150 | 000
## 152 |
## 154 | 00
## 156 |
## 158 | 0
The data values are always truncated so that a leaf has one digit. The leaf unit
(location of the decimal point) tells us the degree of round-off. This will become
clearer in the next example.
Of the three displays, which is the most informative? I think the middle option
is best to see the clustering and shape of distributions of numbers.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


36 Ch 1: Summarizing and Displaying Data

ClickerQ s — Stem-and-leaf plot

1.3.4 Boxplot or box-and-whiskers plot


The boxplot breaks up the range of data values into regions about the center of the
data, measured by the median. The boxplot highlights outliers and provides a visual
means to assess “normality”. The following help entry outlines the construction of
the boxplot, given the placement of data values on the axis.

The endpoints of the box are placed at the locations of the first and third quartiles.
The location of the median is identified by the line in the box. The whiskers extend to
the data points closest to but not on or outside the outlier fences, which are 1.5IQR
from the quartiles. Outliers are any values on or outside the outlier fences.
The boxplot for the head breadth data is given below. There are a lot of options
that allow you to clutter the boxplot with additional information. Just use the default
settings. We want to see the relative location of data (the median line), have an idea
of the spread of data (IQR, the length of the box), see the shape of the data (relative
distances of components from each other – to be covered later), and identify outliers
(if present). The default boxplot has all these components.
Note that the boxplots below are horizontal to better fit on the page. The

Prof. Erik B. Erhardt


1.3: Graphical summaries for one quantitative sample 37

horizontal=TRUE and coord_flip() commands do this.


#### boxplot
fivenum(hb)
## [1] 132.0 142.0 147.5 150.0 158.0
# boxplot using R base graphics
par(mfrow=c(1,1))
boxplot(hb, horizontal=TRUE
, main="Modern Englishman", xlab="head breadth (mm)")

# boxplot using ggplot


library(ggplot2)
# first put hb vector into a data.frame
hb_df <- data.frame(hb)
p <- ggplot(hb_df, aes(x = "hb", y = hb))
p <- p + geom_boxplot()
p <- p + coord_flip()
p <- p + labs(title = "Modern Englishman head breadth")
print(p)
Modern Englishman head breadth
Modern Englishman

hb
x

135 140 145 150 155

head breadth (mm) 140 150


hb

ClickerQ s — Boxplots

Improvements to the boxplot


As a quick aside, a violin plot is a combination of a boxplot and a kernel density plot.
They can be created using the vioplot() function from vioplot package.
#### vioplot
# vioplot using R base graphics
# 3 rows, 1 column
par(mfrow=c(3,1))

# histogram
hist(hb, freq = FALSE
, main="Histogram with kernel density plot, Modern Englishman")
# Histogram overlaid with kernel density curve
points(density(hb), type = "l")
# rug of points under histogram
rug(hb)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


38 Ch 1: Summarizing and Displaying Data

# violin plot
library(vioplot)
## Loading required package: sm
## Package ’sm’, version 2.2-5.5: type help(sm) for summary information
vioplot(hb, horizontal=TRUE, col="gray")
title("Violin plot, Modern Englishman")

# boxplot
boxplot(hb, horizontal=TRUE
, main="Boxplot, Modern Englishman", xlab="head breadth (mm)")
Histogram with kernel density plot, Modern Englishman
0.08
Density

0.04
0.00

130 135 140 145 150 155 160

hb

Violin plot, Modern Englishman


1

135 140 145 150 155

Boxplot, Modern Englishman

135 140 145 150 155

head breadth (mm)

Example: income The data below are incomes in $1000 units for a sample of
12 retired couples. Numerical and graphical summaries are given. There are two
stem-and-leaf displays provided. The first is the default display.
#### Income examples
income <- c(7, 1110, 7, 5, 8, 12, 0, 5, 2, 2, 46, 7)
# sort in decreasing order
income <- sort(income, decreasing = TRUE)
income
## [1] 1110 46 12 8 7 7 7 5 5 2 2 0
summary(income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.25 7.00 100.92 9.00 1110.00
# stem-and-leaf plot
stem(income)
##
## The decimal point is 3 digit(s) to the right of the |

Prof. Erik B. Erhardt


1.3: Graphical summaries for one quantitative sample 39

##
## 0 | 00000000000
## 0 |
## 1 | 1

Because the two large outliers, I trimmed them to get a sense of the shape of the
distribution where most of the observations are.
#### remove largest
# remove two largest values (the first two)
income2 <- income[-c(1,2)]
income2
## [1] 12 8 7 7 7 5 5 2 2 0
summary(income2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.75 6.00 5.50 7.00 12.00
# stem-and-leaf plot
stem(income2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 0 | 022
## 0 | 557778
## 1 | 2
# scale=2 makes plot roughly twice as wide
stem(income2, scale=2)
##
## The decimal point is at the |
##
## 0 | 0
## 2 | 00
## 4 | 00
## 6 | 000
## 8 | 0
## 10 |
## 12 | 0

Boxplots with full data, then incrementally removing the two largest outliers.
#### income-boxplot
# boxplot using R base graphics
# 1 row, 3 columns
par(mfrow=c(1,3))
boxplot(income, main="Income")
boxplot(income[-1], main="(remove largest)")
boxplot(income2, main="(remove 2 largest)")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


40 Ch 1: Summarizing and Displaying Data

Income (remove largest) (remove 2 largest)

12
● ●

1000

40

10
800

8
30
600

6
20
400

4
10
200

2

0

0
1.4 Interpretation of Graphical Displays for Nu-
merical Data
In many studies, the data are viewed as a subset or sample from a larger collection
of observations or individuals under study, called the population. A primary goal
of many statistical analyses is to generalize the information in the sample to infer
something about the population. For this generalization to be possible, the sample
must reflect the basic patterns of the population. There are several ways to collect
data to ensure that the sample reflects the basic properties of the population, but
the simplest approach, by far, is to take a random or “representative” sample from
the population. A random sample has the property that every possible sample of
a given size has the same chance of being the sample (eventually) selected (though
we often do this only once). Random sampling eliminates any systematic biases
associated with the selected observations, so the information in the sample should
accurately reflect features of the population. The process of sampling introduces
random variation or random errors associated with summaries. Statistical tools are
used to calibrate the size of the errors.
Whether we are looking at a histogram (or stem-and-leaf, or dotplot) from a sam-
ple, or are conceptualizing the histogram generated by the population data, we can
imagine approximating the “envelope” around the display with a smooth curve. The
smooth curve that approximates the population histogram is called the population
frequency curve or population probability density function or population
distribution2 . Statistical methods for inference about a population usually make as-
sumptions about the shape of the population frequency curve. A common assumption
is that the population has a normal frequency curve. In practice, the observed data
are used to assess the reasonableness of this assumption. In particular, a sample dis-
2
“Distribution function” often refers to the “cumulative distribution function”, which is a different
(but one-to-one related) function than what I mean here.

Prof. Erik B. Erhardt


1.4: Interpretation of Graphical Displays for Numerical Data 41

play should resemble a population display, provided the collected data are a random
or representative sample from the population. Several common shapes for frequency
distributions are given below, along with the statistical terms used to describe them.

Unimodal, symmetric, bell-shaped, and no outliers The first display is uni-


modal (one peak), symmetric, and bell-shaped with no outliers. This is the pro-
totypical normal curve. The boxplot (laid on its side for this display) shows strong
evidence of symmetry: the median is about halfway between the first and third quar-
tiles, and the tail lengths are roughly equal. The boxplot is calibrated in such a
way that 7 of every 1000 observations are outliers (more than 1.5(Q3 − Q1 ) from
the quartiles) in samples from a population with a normal frequency curve. Only
2 out of every 1 million observations are extreme outliers (more than 3(Q3 − Q1 )
from the quartiles). We do not have any outliers here out of 250 observations, but
we certainly could have some without indicating nonnormality. If a sample of 30
observations contains 4 outliers, two of which are extreme, would it be reasonable to
assume the population from which the data were collected has a normal frequency
curve? Probably not.
#### Unimodal, symmetric, bell-shaped, and no outliers (Normal distribution)
## base graphics
# sample from normal distribution
x1 <- rnorm(250, mean = 100, sd = 15)

par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x1, freq = FALSE, breaks = 20)
points(density(x1), type = "l")
rug(x1)

# violin plot
library(vioplot)
vioplot(x1, horizontal=TRUE, col="gray")

# boxplot
boxplot(x1, horizontal=TRUE)

## ggplot
# Histogram overlaid with kernel density curve
x1_df <- data.frame(x1)
p1 <- ggplot(x1_df, aes(x = x1))
# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y=..density..))
p1 <- p1 + geom_density(alpha=0.1, fill="white")
p1 <- p1 + geom_rug()

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


42 Ch 1: Summarizing and Displaying Data

# violin plot
p2 <- ggplot(x1_df, aes(x = "x1", y = x1))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()

# boxplot
p3 <- ggplot(x1_df, aes(x = "x1", y = x1))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.

0.03
Histogram of x1

0.02
density
0.020
Density

0.01
0.000

60 80 100 120 140 0.00


75 100 125
x1
x1

x1
x


1

60 80 100 120 140


75 100 125
x1

x1
x

● ●
● ●

60 80 100 120 140


75 100 125
x1

#### Central statistical moments


# moments package for 3rd and 4th moments: skewness() and kurtosis()
library(moments)
summary(x1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.40 89.01 99.27 99.34 108.66 142.90
sd(x1)
## [1] 15.1007
skewness(x1)
## [1] 0.1432898
kurtosis(x1)
## [1] 2.883019
stem(x1)

Prof. Erik B. Erhardt


1.4: Interpretation of Graphical Displays for Numerical Data 43

##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 9
## 6 | 4
## 6 | 5889
## 7 | 3333344
## 7 | 578888899
## 8 | 01111122222223344444
## 8 | 55555666667777888889999999
## 9 | 000111111122222233333344
## 9 | 5555555556666666677777888888899999999
## 10 | 00000111222222233333344444
## 10 | 555555555666666667777777788888999999999
## 11 | 0000011111122233444
## 11 | 566677788999
## 12 | 00001123444
## 12 | 5679
## 13 | 00022234
## 13 | 6
## 14 | 3

Unimodal, symmetric, heavy-tailed The boxplot is better at highlighting out-


liers than are other displays. The histogram and stem-and-leaf displays below appear
to have the same basic shape as a normal curve (unimodal, symmetric). However,
the boxplot shows that we have a dozen outliers in a sample of 250 observations.
We would only expect about two outliers in 250 observations when sampling from a
population with a normal frequency curve. The frequency curve is best described as
unimodal, symmetric, and heavy-tailed.
#### Unimodal, symmetric, heavy-tailed
# sample from normal distribution
x2.temp <- rnorm(250, mean = 0, sd = 1)
x2 <- sign(x2.temp)*x2.temp^2 * 15 + 100

par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x2, freq = FALSE, breaks = 20)
points(density(x2), type = "l")
rug(x2)

# violin plot
library(vioplot)
vioplot(x2, horizontal=TRUE, col="gray")

# boxplot
boxplot(x2, horizontal=TRUE)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


44 Ch 1: Summarizing and Displaying Data

# Histogram overlaid with kernel density curve


x2_df <- data.frame(x2)
p1 <- ggplot(x2_df, aes(x = x2))
# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y=..density..))
p1 <- p1 + geom_density(alpha=0.1, fill="white")
p1 <- p1 + geom_rug()

# violin plot
p2 <- ggplot(x2_df, aes(x = "x2", y = x2))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()

# boxplot
p3 <- ggplot(x2_df, aes(x = "x2", y = x2))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
0.04
Histogram of x2
0.03
density
0.015
Density

0.02

0.01
0.000

0 50 100 150 200 250 300 0.00


0 100 200 300
x2
x2

x2
x


1

0 50 100 150 200 250 300


0 100 200 300
x2

x2
x

●●● ● ●● ● ●
●● ●● ●●●● ●●

●●●●●●● ● ●
●●● ● ● ●● ●
●● ● ● ● ● ● ●
●● ●●
● ●●●●
● ●

●● ● ●●●

●●●
● ● ●
● ●● ● ● ●● ●

0 50 100 150 200 250 300


0 100 200 300
x2

summary(x2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.186 93.748 100.446 102.150 108.950 306.868
sd(x2)
## [1] 29.4546

Prof. Erik B. Erhardt


1.4: Interpretation of Graphical Displays for Numerical Data 45

skewness(x2)
## [1] 1.124581
kurtosis(x2)
## [1] 13.88607
stem(x2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## -0 | 30
## 0 | 34
## 2 | 938
## 4 | 891799
## 6 | 4679991123334678899
## 8 | 001233345667888890012222222333444445555666667777788888899999999
## 10 | 00000000000000000000000000000000111111111222222222222233333333444445+33
## 12 | 000001222455601122344458
## 14 | 135668998
## 16 | 33786
## 18 | 514
## 20 |
## 22 |
## 24 |
## 26 |
## 28 |
## 30 | 7

Symmetric, (uniform,) short-tailed Not all symmetric distributions are mound-


shaped, as the display below suggests. The boxplot shows symmetry, but the tails of
the distribution are shorter (lighter) than in the normal distribution. Note that the
distance between quartiles is roughly constant here.
#### Symmetric, (uniform,) short-tailed
# sample from uniform distribution
x3 <- runif(250, min = 50, max = 150)

par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x3, freq = FALSE, breaks = 20)
points(density(x3), type = "l")
rug(x3)

# violin plot
library(vioplot)
vioplot(x3, horizontal=TRUE, col="gray")

# boxplot

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


46 Ch 1: Summarizing and Displaying Data

boxplot(x3, horizontal=TRUE)

# Histogram overlaid with kernel density curve


x3_df <- data.frame(x3)
p1 <- ggplot(x3_df, aes(x = x3))
# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y=..density..))
p1 <- p1 + geom_density(alpha=0.1, fill="white")
p1 <- p1 + geom_rug()

# violin plot
p2 <- ggplot(x3_df, aes(x = "x3", y = x3))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()

# boxplot
p3 <- ggplot(x3_df, aes(x = "x3", y = x3))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
0.015
Histogram of x3

0.010
density
0.010
Density

0.005
0.000

60 80 100 120 140 0.000


50 75 100 125 150
x3
x3

x3
x


1

60 80 100 120 140


50 75 100 125 150
x3

x3
x

60 80 100 120 140


50 75 100 125 150
x3

summary(x3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.61 75.29 101.44 101.31 127.46 149.46
sd(x3)

Prof. Erik B. Erhardt


1.4: Interpretation of Graphical Displays for Numerical Data 47

## [1] 29.02638
skewness(x3)
## [1] -0.00953667
kurtosis(x3)
## [1] 1.778113
stem(x3)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 12234444
## 5 | 555577778889999
## 6 | 0111223334
## 6 | 556678899
## 7 | 0000011111122334444
## 7 | 5567778899
## 8 | 011111224444
## 8 | 5556666799999
## 9 | 0001112233
## 9 | 55667778999
## 10 | 00000111223334444
## 10 | 5555566777889
## 11 | 001123344444
## 11 | 55577888899999
## 12 | 001122444
## 12 | 5677778999
## 13 | 000011222333344
## 13 | 556667788889
## 14 | 01111222344444
## 14 | 55666666777788999
The mean and median are identical in a population with a (exact) symmetric
frequency curve. The histogram and stem-and-leaf displays for a sample selected
from a symmetric population will tend to be fairly symmetric. Further, the sample
means and medians will likely be close.

Unimodal, skewed right The distribution below is unimodal, and asymmetric or


skewed. The distribution is said to be skewed to the right, or upper end, because
the right tail is much longer than the left tail. The boxplot also shows the skewness –
the region between the minimum observation and the median contains half the data in
less than 1/5 the range of values. In addition, the upper tail contains several outliers.
#### Unimodal, skewed right
# sample from exponential distribution
x4 <- rexp(250, rate = 1)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


48 Ch 1: Summarizing and Displaying Data

par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x4, freq = FALSE, breaks = 20)
points(density(x4), type = "l")
rug(x4)

# violin plot
library(vioplot)
vioplot(x4, horizontal=TRUE, col="gray")

# boxplot
boxplot(x4, horizontal=TRUE)

# Histogram overlaid with kernel density curve


x4_df <- data.frame(x4)
p1 <- ggplot(x4_df, aes(x = x4))
# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y=..density..))
p1 <- p1 + geom_density(alpha=0.1, fill="white")
p1 <- p1 + geom_rug()

# violin plot
p2 <- ggplot(x4_df, aes(x = "x4", y = x4))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()

# boxplot
p3 <- ggplot(x4_df, aes(x = "x4", y = x4))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.

Prof. Erik B. Erhardt


1.4: Interpretation of Graphical Displays for Numerical Data 49

0.8
Histogram of x4
0.8 0.6

density
Density

0.4
0.4

0.2
0.0

0 2 4 6 8 10 0.0
0.0 2.5 5.0 7.5 10.0
x4
x4

x4

x

1

0 2 4 6 8 10
0.0 2.5 5.0 7.5 10.0
x4

x4

x
●●
●●●● ● ● ● ● ●
● ●
●●● ●● ● ●
● ● ● ●● ●

0 2 4 6 8 10
0.0 2.5 5.0 7.5 10.0
x4

summary(x4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003949 0.261104 0.611573 0.908212 1.194486 9.769742
sd(x4)
## [1] 0.9963259
skewness(x4)
## [1] 3.596389
kurtosis(x4)
## [1] 27.45438
stem(x4)
##
## The decimal point is at the |
##
## 0 | 00000000001111111111111111111111122222222222222222222222222333333333+15
## 0 | 55555555555555555555566666666666666777777777777778888888889999999999
## 1 | 0000000111111111112222333333444
## 1 | 5555555566666777788999
## 2 | 0000123333444
## 2 | 556677799
## 3 | 122
## 3 | 68
## 4 | 00
## 4 |
## 5 |
## 5 |
## 6 |
## 6 |
## 7 |

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


50 Ch 1: Summarizing and Displaying Data

## 7 |
## 8 |
## 8 |
## 9 |
## 9 | 8

Unimodal, skewed left The distribution below is unimodal and skewed to the
left. The two examples show that extremely skewed distributions often contain out-
liers in the longer tail of the distribution.
#### Unimodal, skewed left
# sample from uniform distribution
x5 <- 15 - rexp(250, rate = 0.5)

par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x5, freq = FALSE, breaks = 20)
points(density(x5), type = "l")
rug(x5)

# violin plot
library(vioplot)
vioplot(x5, horizontal=TRUE, col="gray")

# boxplot
boxplot(x5, horizontal=TRUE)

# Histogram overlaid with kernel density curve


x5_df <- data.frame(x5)
p1 <- ggplot(x5_df, aes(x = x5))
# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y=..density..))
p1 <- p1 + geom_density(alpha=0.1, fill="white")
p1 <- p1 + geom_rug()

# violin plot
p2 <- ggplot(x5_df, aes(x = "x5", y = x5))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()

# boxplot
p3 <- ggplot(x5_df, aes(x = "x5", y = x5))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()

Prof. Erik B. Erhardt


1.4: Interpretation of Graphical Displays for Numerical Data 51

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
Histogram of x5
0.4

density
0.4
Density

0.2

0.2
0.0

4 6 8 10 12 14 0.0
6 9 12 15
x5
x5

x5

x

1

4 6 8 10 12 14
6 9 12 15
x5

x5
x
● ●●
●● ● ● ●
● ●
● ● ●● ● ● ● ●
● ●

4 6 8 10 12 14
6 9 12 15
x5

summary(x5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.224 12.391 13.795 13.183 14.506 14.994
sd(x5)
## [1] 1.870509
skewness(x5)
## [1] -1.961229
kurtosis(x5)
## [1] 7.888622
stem(x5)
##
## The decimal point is at the |
##
## 4 | 2
## 5 | 5678
## 6 |
## 7 | 9
## 8 | 2569
## 9 | 44889
## 10 | 11334566678
## 11 | 001124456677778899
## 12 | 0001111223334444444444455666667788889999
## 13 | 0000001111222222333344555666666667777778888889999999
## 14 | 00000000001111111111122222223333333333444444555555555555666667777777+25
## 15 | 000000000

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


52 Ch 1: Summarizing and Displaying Data

Bimodal (multi-modal) Not all distributions are unimodal. The distribution


below has two modes or peaks, and is said to be bimodal. Distributions with three
or more peaks are called multi-modal.
#### Bimodal (multi-modal)
# sample from uniform distribution
x6 <- c(rnorm(150, mean = 100, sd = 15), rnorm(150, mean = 150, sd = 15))

par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x6, freq = FALSE, breaks = 20)
points(density(x6), type = "l")
rug(x6)

# violin plot
library(vioplot)
vioplot(x6, horizontal=TRUE, col="gray")

# boxplot
boxplot(x6, horizontal=TRUE)

# Histogram overlaid with kernel density curve


x6_df <- data.frame(x6)
p1 <- ggplot(x6_df, aes(x = x6))
# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y=..density..))
p1 <- p1 + geom_density(alpha=0.1, fill="white")
p1 <- p1 + geom_rug()

# violin plot
p2 <- ggplot(x6_df, aes(x = "x6", y = x6))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()

# boxplot
p3 <- ggplot(x6_df, aes(x = "x6", y = x6))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.

Prof. Erik B. Erhardt


1.4: Interpretation of Graphical Displays for Numerical Data 53

Histogram of x6 0.015

density
0.010
Density

0.010

0.005
0.000

60 80 100 120 140 160 180 0.000


100 150
x6
x6

x6

x

1

60 80 100 120 140 160 180


100 150
x6

x6

x
60 80 100 120 140 160 180
100 150
x6

summary(x6)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.87 99.03 122.71 124.71 151.45 184.32
sd(x6)
## [1] 29.59037
skewness(x6)
## [1] -0.005817938
kurtosis(x6)
## [1] 1.85249
stem(x6)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 |
## 6 | 0368
## 7 | 002244688
## 8 | 0111233356677788889999
## 9 | 000001112222223333444455555666677789999999999999
## 10 | 0011111122444555666666678899
## 11 | 00011222333334555555666678899
## 12 | 00000001123345567889
## 13 | 00011122334456666777788889999
## 14 | 00122233455556677788888999
## 15 | 00000001111222223333344444455555666777888999
## 16 | 01111233344444455555666778889
## 17 | 0125678
## 18 | 00124
The boxplot and histogram or stem-and-leaf display (or dotplot) are used to-

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


54 Ch 1: Summarizing and Displaying Data

gether to describe the distribution. The boxplot does not provide information about
modality – it only tells you about skewness and the presence of outliers.
As noted earlier, many statistical methods assume the population frequency curve
is normal. Small deviations from normality usually do not dramatically influence the
operating characteristics of these methods. We worry most when the deviations from
normality are severe, such as extreme skewness or heavy tails containing multiple
outliers.

ClickerQ s — Graphical summaries

1.5 Interpretations for examples


The head breadth sample is slightly skewed to the left, unimodal, and has no outliers.
The distribution does not deviate substantially from normality. The various measures
of central location (Ȳ = 146.5, M = 147.5) are close, which is common with fairly
symmetric distributions containing no outliers.
The income sample is extremely skewed to the right due to the presence of two
extreme outliers at 46 and 1110. A normality assumption here is unrealistic.
It is important to recognize the influence that outliers can have on the values of
Ȳ and s. The median and interquartile range are more robust (less sensitive) to the
presence of outliers. For the income data Ȳ = 100.9 and s = 318, whereas M = 7 and
IQR = 8.3. If we omit the two outliers, then Ȳ = 5.5 and s = 3.8, whereas M = 6
and IQR = 5.25.
The mean and median often have similar values in data sets without outliers,
so it does not matter much which one is used as the “typical value”. This issue is
important, however, in data sets with extreme outliers. In such instances, the median
is often more reasonable. For example, is Ȳ = 100.9 a reasonable measure for a typical
income in this sample, given that the second largest income is only 46?

R Discussion I have included basic pointers on how to use R in these notes. I


find that copying an R code example from the internet, then modifying the code to
apply to my data is a productive strategy. When you’re first learning, “pretty good”
is often “good enough”, especially when it comes to plots (in other words, spending
20 minutes vs 2 hours on a plot is fine). I will demonstrate most of what you need
and I will be happy to answer questions. You will learn a lot more by using and
experimenting with R than by watching.

Prof. Erik B. Erhardt


Chapter 2

Estimation in One-Sample
Problems

Contents
2.1 Inference for a population mean . . . . . . . . . . . . . . 56
2.1.1 Standard error, LLN, and CLT . . . . . . . . . . . . . . . . 57
2.1.2 z-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.1.3 t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.2 CI for µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.2.1 Assumptions for procedures . . . . . . . . . . . . . . . . . . 66
2.2.2 The effect of α on a two-sided CI . . . . . . . . . . . . . . . 69
2.3 Hypothesis Testing for µ . . . . . . . . . . . . . . . . . . . 69
2.3.1 P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.3.2 Assumptions for procedures . . . . . . . . . . . . . . . . . . 72
2.3.3 The mechanics of setting up hypothesis tests . . . . . . . . 79
2.3.4 The effect of α on the rejection region of a two-sided test . 81
2.4 Two-sided tests, CI and p-values . . . . . . . . . . . . . . 82
2.5 Statistical versus practical significance . . . . . . . . . . . 83
2.6 Design issues and power . . . . . . . . . . . . . . . . . . . 84
2.7 One-sided tests on µ . . . . . . . . . . . . . . . . . . . . . . 84
2.7.1 One-sided CIs . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


56 Ch 2: Estimation in One-Sample Problems

Learning objectives

After completing this topic, you should be able to:

select graphical displays that meaningfully communicate properties of a sample.


assess the assumptions of the one-sample t-test visually.
decide whether the mean of a population is different from a hypothesized value.
recommend action based on a hypothesis test.

Achieving these goals contributes to mastery in these course learning outcomes:

1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.

2.1 Inference for a population mean

Suppose that you have identified a population of interest where individuals are mea-
sured on a single quantitative characteristic, say, weight, height or IQ. You select a
random or representative sample from the population with the goal of estimating the
(unknown) population mean value, identified by µ. You cannot see much of the
population, but you would like to know what is typical in the population (µ). The
only information you can see is that in the sample.

This is a standard problem in statistical inference, and the first inferential problem
that we will tackle. For notational convenience, identify the measurements on the
sample as Y1 , Y2 , . . . , Yn , where n is the sample
P size. Given the data, our best guess,
i Yi
or estimate, of µ is the sample mean: Ȳ = n = Y1 +Y2 +···+Y n
n
.

Prof. Erik B. Erhardt


2.1: Inference for a population mean 57

Population
Huge set of values
Can see very little

Sample

Y1, Y2, …, Yn

Inference
Mean µ
Standard Deviation σ
µ and σ unknown

There are two main methods that are used for inferences on µ: confidence in-
tervals (CI) and hypothesis tests. The standard CI and test procedures are based
on the sample mean and the sample standard deviation, denoted by s.

ClickerQ s — Inference for a population mean, 2

2.1.1 Standard error, LLN, and CLT


The standard error (SE) is the standard deviation of the sampling distribution
of a statistic.
The sampling distribution of a statistic is the distribution of that statistic,
considered as a random variable, when derived from a random sample of size n.
The standard error of the mean (SEM) is the standard deviation of the
sample-mean’s estimate of a population mean. (It can also be viewed as the standard
deviation of the error in the sample mean relative to the true mean, since the sample
mean is an unbiased estimator.) SEM is usually estimated by the sample estimate of
the population standard deviation (sample standard deviation) divided by the square
root of the sample size (assuming statistical independence of the values in the sample):

SEȲ = s/ n
where s is the sample standard deviation (i.e., the sample-based estimate of the
standard deviation of the population), and n is the size (number of observations) of
the sample.
In probability theory, the law of large numbers (LLN) is a theorem that
describes the result of performing the same experiment a large number of times.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


58 Ch 2: Estimation in One-Sample Problems

According to the law, the average of the results obtained from a large number of
trials (the sample mean, Ȳ ) should be close to the expected value (the population
mean, µ), and will tend to become closer as more trials are performed.
In probability theory, the central limit theorem (CLT) states that, given
certain conditions, the mean of a sufficiently large number of independent random
variables, each with finite mean and variance, will be approximately normally dis-
tributed1 .
As a joint illustration of these concepts, consider drawing random variables fol-
lowing a Uniform(0,1) distribution, that is, any value in the interval [0, 1] is equally
likely. By definition, the mean of this distribution
p is µ = 1/2 and the variance is
2
σ = 1/12 (so the standard deviation is σ = 1/12 = 0.289). Therefore,√ if we draw
a sample of size n, then the standard error of the mean will be σ/ n, and as n gets
larger the distribution of the mean will increasingly follow a normal distribution. We
illustrate this by drawing N = 10000 samples of size n and plot those N means,
computing the expected and observed SEM and how well the histogram of sampled
means follows a normal distribution, Notice, indeed, that even with samples as small
as 2 and 6 that the properties of the SEM and the distribution are as predicted.
#### Illustration of Central Limit Theorem, Uniform distribution
# demo.clt.unif(N, n)
# draws N samples of size n from Uniform(0,1)
# and plots the N means with a normal distribution overlay
demo.clt.unif <- function(N, n) {
# draw sample in a matrix with N columns and n rows
sam <- matrix(runif(N*n, 0, 1), ncol=N);
# calculate the mean of each column
sam.mean <- colMeans(sam)
# the sd of the mean is the SEM
sam.se <- sd(sam.mean)
# calculate the true SEM given the sample size n
true.se <- sqrt((1/12)/n)
# draw a histogram of the means
hist(sam.mean, freq = FALSE, breaks = 25
, main = paste("True SEM =", round(true.se, 4)
, ", Est SEM = ", round( sam.se, 4))
, xlab = paste("n =", n))
# overlay a density curve for the sample means
points(density(sam.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(0, 1, length = 1000)
points(x, dnorm(x, mean = 0.5, sd = true.se), type = "l", lwd = 2, col = "red")
# place a rug of points under the plot

1
The central limit theorem has a number of variants. In its common form, the random variables
must be identically distributed. In variants, convergence of the mean to the normal distribution also
occurs for non-identical distributions, given that they comply with certain conditions.

Prof. Erik B. Erhardt


2.1: Inference for a population mean 59

rug(sam.mean)
}

par(mfrow=c(2,2));
demo.clt.unif(10000, 1);
demo.clt.unif(10000, 2);
demo.clt.unif(10000, 6);
demo.clt.unif(10000, 12);

True SEM = 0.2887 , Est SEM = 0.2893 True SEM = 0.2041 , Est SEM = 0.2006

2.0
1.0

1.5
0.8
0.6
Density

Density

1.0
0.4

0.5
0.2
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

n=1 n=2

True SEM = 0.1179 , Est SEM = 0.1194 True SEM = 0.0833 , Est SEM = 0.0843
5
3.0

4
2.5
2.0

3
Density

Density
1.5

2
1.0

1
0.5
0.0

0.2 0.4 0.6 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8

n=6 n = 12

In a more extreme example, we draw samples from an Exponential(1) distribution


(µ = 1 and σ = 1), which is strongly skewed to the right. Notice that the normality
promised by the CLT requires larger samples sizes, about n ≥ 30, than for the previous

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


60 Ch 2: Estimation in One-Sample Problems

Uniform(0,1) example, which required about n ≥ 6.


#### Illustration of Central Limit Theorem, Exponential distribution
# demo.clt.exp(N, n) draws N samples of size n from Exponential(1)
# and plots the N means with a normal distribution overlay
demo.clt.exp <- function(N, n) {
# draw sample in a matrix with N columns and n rows
sam <- matrix(rexp(N*n, 1), ncol=N);
# calculate the mean of each column
sam.mean <- colMeans(sam)
# the sd of the mean is the SEM
sam.se <- sd(sam.mean)
# calculate the true SEM given the sample size n
true.se <- sqrt(1/n)
# draw a histogram of the means
hist(sam.mean, freq = FALSE, breaks = 25
, main = paste("True SEM =", round(true.se, 4), ", Est SEM = ", round(sam.se, 4))
, xlab = paste("n =", n))
# overlay a density curve for the sample means
points(density(sam.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(0, 5, length = 1000)
points(x, dnorm(x, mean = 1, sd = true.se), type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
rug(sam.mean)
}

par(mfrow=c(2,2));
demo.clt.exp(10000, 1);
demo.clt.exp(10000, 6);
demo.clt.exp(10000, 30);
demo.clt.exp(10000, 100);

Prof. Erik B. Erhardt


2.1: Inference for a population mean 61

True SEM = 1 , Est SEM = 1.006 True SEM = 0.4082 , Est SEM = 0.4028
0.8

1.0
0.6

0.8
0.6
Density

Density
0.4

0.4
0.2

0.2
0.0

0.0
0 2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 2.5 3.0

n=1 n=6

True SEM = 0.1826 , Est SEM = 0.1827 True SEM = 0.1 , Est SEM = 0.0995
4
2.0

3
1.5
Density

Density

2
1.0
0.5

1
0.0

0.5 1.0 1.5 2.0 0.8 1.0 1.2 1.4

n = 30 n = 100

Note well that the further the population distribution is from being normal, the
larger the sample size is required to be for the sampling distribution of the sample
mean to be normal. If the population distribution is normal, what’s the minimum
sample size for the sampling distribution of the mean to be normal?
For more examples, try:
#### More examples for Central Limit Theorem can be illustrated with this code
# install.packages("TeachingDemos")
library(TeachingDemos)
# look at examples at bottom of the help page
?clt.examp

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


62 Ch 2: Estimation in One-Sample Problems

2.1.2 z-score

Given a distribution with mean x̄ and standard deviation s, a location-scale trans-


formation known as a z-score will shift the distribution to have mean 0 and scale the
spread to have standard deviation 1:

x − x̄
z = .
s

Below, the original variable x has a normal distribution with mean 100 and stan-
dard deviation 15, Normal(100, 152 ), and z has a Normal(0, 1) distribution.
# sample from normal distribution
df <- data.frame(x = rnorm(100, mean = 100, sd = 15))
df$z <- scale(df$x) # by default, this performs a z-score transformation

summary(df)
## x z.V1
## Min. : 39.64 Min. :-3.446123
## 1st Qu.: 90.99 1st Qu.:-0.485300
## Median :100.00 Median : 0.033925
## Mean : 99.41 Mean : 0.000000
## 3rd Qu.:110.72 3rd Qu.: 0.652006
## Max. :132.70 Max. : 1.919736
## ggplot
library(ggplot2)
p1 <- ggplot(df, aes(x = x))
# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y=..density..))
p1 <- p1 + geom_density(alpha=0.1, fill="white")
p1 <- p1 + geom_rug()
p1 <- p1 + labs(title = "X ~ Normal(100, 15)")

p2 <- ggplot(df, aes(x = z))


# Histogram with density instead of count on y-axis
p2 <- p2 + geom_histogram(aes(y=..density..))
p2 <- p2 + geom_density(alpha=0.1, fill="white")
p2 <- p2 + geom_rug()
p2 <- p2 + labs(title = "Z ~ Normal(0, 1)")

library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=1)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.

Prof. Erik B. Erhardt


2.1: Inference for a population mean 63

X ~ Normal(100, 15)

0.03

density
0.02

0.01

0.00
40 60 80 100 120
x

Z ~ Normal(0, 1)

0.6
density

0.4

0.2

0.0
−2 0 2
z

2.1.3 t-distribution
The Student’s t-distribution is a family of continuous probability distributions that
arises when estimating the mean of a normally distributed population in situations
where the sample size is small and population standard deviation is unknown. The
t-distribution is symmetric and bell-shaped, like the normal distribution, but has
heavier tails, meaning that it is more prone to producing values that fall far from its
mean. Effectively, the t-distribution is wider than the normal distribution because in
addition to estimating the mean µ with Ȳ , we also have to estimate σ 2 with s2 , so
there’s some additional uncertainty. The degrees-of-freedom (df) parameter of the t-
distribution is the sample size n minus the number of variance parameters estimated.
Thus, df = n−1 when we have one sample and df = n−2 when we have two samples.
As n increases, the t-distribution becomes close to the normal distribution, and when
n = ∞ the distributions are equivalent.
#### Normal vs t-distributions with a range of degrees-of-freedom
x <- seq(-8, 8, length = 1000)
par(mfrow=c(1,1))
plot(x, dnorm(x), type = "l", lwd = 2, col = "red"
, main = "Normal (red) vs t-dist with df=1, 2, 6, 12, 30, 100")
points(x, dt(x, 1), type = "l")
points(x, dt(x, 2), type = "l")
points(x, dt(x, 6), type = "l")
points(x, dt(x, 12), type = "l")
points(x, dt(x, 30), type = "l")
points(x, dt(x,100), type = "l")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


64 Ch 2: Estimation in One-Sample Problems

Normal (red) vs t−dist with df=1, 2, 6, 12, 30, 100

0.4
0.3
dnorm(x)

0.2
0.1
0.0

−5 0 5

2.2 CI for µ
Statistical inference provides methods for drawing conclusions about a population
from sample data. In this chapter, we want to make a claim about population mean
µ given sample statistics Ȳ and s.
A CI for µ is a range of plausible values for the unknown population mean µ, based
on the observed data, of the form “Best Guess ± Reasonable Error of the Guess”. To
compute a CI for µ:
1. Define the population parameter, “Let µ = mean [characteristic] for popu-
lation of interest”.
2. Specify the confidence coefficient, which is a number between 0 and 100%,
in the form 100(1 − α)%. Solve for α. (For example, 95% has α = 0.05.)
3. Compute the t-critical value: tcrit = t0.5α such that the area under the t-curve
(df = n − 1) to the right of tcrit is 0.5α. See appendix or internet for a t-table.
4. Report the CI in the form Ȳ ± tcrit SEȲ or as an interval (L, U ). The desired CI
has lower and upper endpoints given by L = Ȳ −tcrit SEȲ and U = Ȳ +tcrit SEȲ ,

Prof. Erik B. Erhardt


2.2: CI for µ 65


respectively, where SEȲ = s/ n is the standard error of the sample mean.
5. Assess method assumptions (see below).

In practice, the confidence coefficient is large, say 95% or 99%, which correspond
to α = 0.05 and 0.01, respectively. The value of α expressed as a percent is known
as the error rate of the CI.

The CI is determined once the confidence coefficient is specified and the data
are collected. Prior to collecting the data, the interval is unknown and is viewed as
random because it will depend on the actual sample selected. Different samples give
different CIs. The “confidence” in, say, the 95% CI (which has a 5% error rate)
can be interpreted as follows. If you repeatedly sample the population and construct
95% CIs for µ, then 95% of the intervals will contain µ, whereas 5% will not. The
interval you construct from your data will either cover µ, or it will not.

The length of the CI

U − L = 2tcrit SEȲ

depends on the accuracy


√ of our estimate Ȳ of µ, as measured by the standard error
of Ȳ , SEȲ = s/ n. Less precise estimates of µ lead to wider intervals for a given
level of confidence.

An example with 100 CIs Consider drawing a sample of 25 observations from a


normally distributed population with mean 10 and sd 2. Calculate the 95% t-CI. Now
do that 100 times. The plot belows reflects the variability of that process. We expect
95 of the 100 CIs to contain the true population mean of 10, that is, on average 5
times out of 100 we draw the incorrect inference that the population mean is in an
interval when it does not contain the true value of 10.
#### Illustration of Confidence Intervals (consistent with their interpretation)
library(TeachingDemos)
ci.examp(mean.sim = 10, sd = 2, n = 25
, reps = 100, conf.level = 0.95, method = "t")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


66 Ch 2: Estimation in One-Sample Problems

Confidence intervals based on t distribution


100

| ||
| |
| |
||
||
| |
| |
| | |
| |
80

| |
|
| |
| |
| |
| |
|| |
| |
| |
| |
60

| |
| |
| |
| |
Index

| |
| |
|| |
| | |
| ||
40

| |
| | |
| | | |
| |
| |
| | |
| | |
20

| |
| |
| |
| |
||
| |
| |
| |
| | |
|
0

8 9 10 11 12

Confidence Interval

ClickerQ s — CI for µ, 2

2.2.1 Assumptions for procedures


I described the classical CI. The procedure is based on the assumptions that the data
are a random sample from the population of interest, and that the population
frequency curve is normal. The population frequency curve can be viewed as a
“smoothed histogram” created from the population data.
The normality assumption can never be completely verified without having the

Prof. Erik B. Erhardt


2.2: CI for µ 67

entire population data. You can assess the reasonableness of this assumption using
a stem-and-leaf display or a boxplot of the sample data. The stem-and-leaf display
from the data should resemble a normal curve.
In fact, the assumptions are slightly looser than this, the population frequency
curve can be anything provided the sample size is large enough that it’s reasonable
to assume that the sampling distribution of the mean is normal.

Assessing assumptions using the bootstrap


We will cover the bootstrap at the end of this course, but a brief introduction here can
help us check model assumptions. Recall in Section 2.1.1 the sampling distribution
examples. Assume the sample is representative of the population. Let’s use our
sample as a proxy for the population and repeatedly draw samples (with replacement)
of size n and calculate the mean, then plot the bootstrap sampling distribution of
means. If this bootstrap sampling distribution strongly deviates from normal, then
that’s evidence from the data that inference using the t-distribution is not appropriate.
Otherwise, if roughly normal, then the t-distribution may be sensible.
#### Visual comparison of whether sampling distribution is close to Normal via Bootstrap
# a function to compare the bootstrap sampling distribution with
# a normal distribution with mean and SEM estimated from the data
bs.one.samp.dist <- function(dat, N = 1e4) {
n <- length(dat);
# resample from data
sam <- matrix(sample(dat, size = N * n, replace = TRUE), ncol=N);
# draw a histogram of the means
sam.mean <- colMeans(sam);
# save par() settings
old.par <- par(no.readonly = TRUE)
# make smaller margins
par(mfrow=c(2,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
# Histogram overlaid with kernel density curve
hist(dat, freq = FALSE, breaks = 6
, main = "Plot of data with smoothed density curve")
points(density(dat), type = "l")
rug(dat)

hist(sam.mean, freq = FALSE, breaks = 25


, main = "Bootstrap sampling distribution of the mean"
, xlab = paste("Data: n =", n
, ", mean =", signif(mean(dat), digits = 5)
, ", se =", signif(sd(dat)/sqrt(n)), digits = 5))
# overlay a density curve for the sample means
points(density(sam.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(min(sam.mean), max(sam.mean), length = 1000)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


68 Ch 2: Estimation in One-Sample Problems

points(x, dnorm(x, mean = mean(dat), sd = sd(dat)/sqrt(n))


, type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
rug(sam.mean)
# restore par() settings
par(old.par)
}

# example data, skewed --- try others datasets to develop your intuition
x <- rgamma(10, shape = .5, scale = 20)
bs.one.samp.dist(x)

Plot of data with smoothed density curve


0.20
0.15
Density

0.10
0.05
0.00

0 2 4 6 8 10 12

dat
Bootstrap sampling distribution of the mean
0.30
0.20
Density

0.10
0.00

0 2 4 6 8

Data: n = 10 , mean = 4.1358 , se = 1.27835 5

Prof. Erik B. Erhardt


2.3: Hypothesis Testing for µ 69

Example: Age at First Heart Transplant Let us go through a hand-calculation


of a CI, using R to generate summary data. I’ll show you later how to generate the
CI in R.
We are interested in the mean age at first heart transplant for a population of
patients.
1. Define the population parameter
Let µ = mean age at the time of first heart transplant for population of patients.
2. Calculate summary statistics from sample
The ages (in years) at first transplant for a sample of 11 heart transplant patients
are as follows:
54, 42, 51, 54, 49, 56, 33, 58, 54, 64, 49.
Summaries
√ for the data are: n = 11, Ȳ = 51.27, and s = 8.26 so that SEȲ =
8.26/ 11 = 2.4904. The degrees of freedom are df = 11 − 1 = 10.
3. Specify confidence level, find critical value, calculate limits
Let us calculate a 95% CI for µ. For a 95% CI α = 0.05, so we need to find
tcrit = t0.025 , which is 2.228. Now tcrit SEȲ = 2.228 × 2.4904 = 5.55. The lower limit
on the CI is L = 51.27 − 5.55 = 45.72. The upper limit is U = 51.27 + 5.55 = 56.82.
4. Summarize in words For example, I am 95% confident that the population
mean age at first transplant is 51.3 ± 5.55, that is, between 45.7 and 56.8 years
(rounding off to 1 decimal place).
5. Check assumptions We will see this in several pages, sampling distribution
is reasonably normal.

2.2.2 The effect of α on a two-sided CI


A two-sided 100(1 − α)% CI for µ is given by Ȳ ± tcrit SEȲ . The CI is centered at
Ȳ and has length 2tcrit SEȲ . The confidence coefficient 100(1 − α)% is increased
by decreasing α, which increases tcrit . That is, increasing the confidence coefficient
makes the CI wider. This is sensible: to increase your confidence that the interval
captures µ you must pinpoint µ with less precision by making the CI wider. For
example, a 95% CI is wider than a 90% CI.

ClickerQ s — CI, quickie

2.3 Hypothesis Testing for µ


A hypothesis test is used to make a decision about a population parameter.
Suppose you are interested in checking whether the population mean µ is equal
to some prespecified value, say µ0 . This question can be formulated as a two-sided

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


70 Ch 2: Estimation in One-Sample Problems

hypothesis test, where you are trying to decide which of two contradictory claims
or hypotheses about µ is more reasonable given the observed data. The null hy-
pothesis, or the hypothesis under test, is H0 : µ = µ0 , whereas the alternative
hypothesis is HA : µ 6= µ0 .

I will explore the ideas behind hypothesis testing later. At this point, I focus on
the mechanics behind the test. The steps in carrying out the test are:

1. Set up the null and alternative hypotheses in words and notation. In words:
“The population mean for [what is being studied] is different from [value of µ0 ].”
(Note that the statement in words is in terms of the alternative hypothesis.)
In notation: H0 : µ = µ0 versus HA : µ 6= µ0 (where µ0 is specified by the
context of the problem).
2. Choose the size or significance level of the test, denoted by α. In practice,
α is set to a small value, say, 0.01 or 0.05, but theoretically can be any value
between 0 and 1.
3. Compute the test statistic

Ȳ − µ0
ts = ,
SEȲ


where SEȲ = s/ n is the standard error.
Note: I sometimes call the test statistic tobs to emphasize that the computed
value depends on the observed data.
4. Compute the critical value tcrit = t0.5α (or p-value from the test statistic) in
the direction of the alternative hypothesis from the t-distribution table with
degrees of freedom df = n − 1.
5. State the conclusion in terms of the problem.
Reject H0 in favor of HA (i.e., decide that H0 is false, based on the data) if
|ts | > tcrit or p-value < α, that is, reject if ts < −tcrit or if ts > tcrit . Otherwise,
Fail to reject H0 .
(Note: We DO NOT accept H0 — more on this later.)
6. Check assumptions of the test, when possible (could do earlier to save yourself
some effort if they are not met).

The process is represented graphically below. The area under the t-probability
curve outside ±tcrit is the size of the test, α. One-half α is the area in each tail. You
reject H0 in favor of HA only if the test statistic is outside ±tcrit .

Prof. Erik B. Erhardt


2.3: Hypothesis Testing for µ 71

α 1−α α
2 2

Reject H0 Reject H0
− tcrit 0
tcrit

2.3.1 P-values
The p-value, or observed significance level for the test, provides a measure of
plausibility for H0 . Smaller values of the p-value imply that H0 is less plausible. To
compute the p-value for a two-sided test, you
1. Compute the test statistic ts as above.
2. Evaluate the area under the t-probability curve (with df = n − 1) outside ±|ts |.

p−value p−value
2 2

− ts 0
ts

The p-value is the total shaded area, or twice the area in either tail. A useful in-
terpretation of the p-value is that it is the chance of obtaining data favoring HA by

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


72 Ch 2: Estimation in One-Sample Problems

this much or more if H0 actually is true. Another interpretation is that


the p-value is the probability of observing a sample mean at least as ex-
treme as the one observed assuming µ0 from H0 is the true population
mean.
If the p-value is small then the sample we obtained is pretty unusual to have obtained
if H0 is true — but we actually got the sample, so probably it is not very unusual, so
we would conclude H0 is false (it would not be unusual if HA is true).
Most, if not all, statistical packages summarize hypothesis tests with a p-value,
rather than a decision (i.e., reject or not reject at a given α level). You can make a
decision to reject or not reject H0 for a size α test based on the p-value as follows
— reject H0 if the p-value is less than α. This decision is identical to that obtained
following the formal rejection procedure given earlier. The reason for this is that the
p-value can be interpreted as the smallest value you can set the size of the test and
still reject H0 given the observed data.
There are a lot of terms to keep straight here. α and tcrit are constants we choose
(actually, one determines the other so we really only choose one, usually α) to set how
rigorous evidence against H0 needs to be. ts and the p-value (again, one determines
the other) are random variables because they are calculated from the random sample.
They are the evidence against H0 .

2.3.2 Assumptions for procedures


I described the classical t-test, which assumes that the data are a random sample
from the population and that the population frequency curve is normal. These are
the same assumptions as for the CI.

Example: Age at First Transplant (Revisited) The ages (in years) at first
transplant for a sample of 11 heart transplant patients are as follows: 54, 42, 51, 54,
49, 56, 33, 58, 54, 64, 49. Summaries for these data are: n = 11, Ȳ = 51.27, s = 8.26
and SEȲ = 2.4904. Test the hypothesis that the mean age at first transplant is 50.
Use α = 0.05.
As in the earlier analysis, define
µ = mean age at time of first transplant for population of patients.
We are interested in testing H0 : µ = 50 against HA : µ 6= 50, so µ0 = 50.
The degrees of freedom are df = 11 − 1 = 10. The critical value for a 5% test is
tcrit = t0.025 = 2.228. (Note α/2 = 0.05/2 = 0.025). The same critical value was used
with the 95% CI.
For the test,
Ȳ − µ0 51.27 − 50
ts = = = 0.51.
SEȲ 2.4904

Prof. Erik B. Erhardt


2.3: Hypothesis Testing for µ 73

Since tcrit = 2.228, we do not reject H0 using a 5% test. Notice the placement of ts
relative to tcrit in the picture below. Equivalently, the p-value for the test is 0.62,
thus we fail to reject H0 because 0.62 > 0.05 = α. The results of the hypothesis test
should not be surprising, since the CI tells you that 50 is a plausible value for the
population mean age at transplant. Note: All you can say is that the data could have
come from a distribution with a mean of 50 — this is not convincing evidence that µ
actually is 50.

.95
.025 .025

Reject H0 0 Reject H0
−2.228 0.51 2.228 0
−.51 .51
ts in middle of distribution, so do not reject H0 Total shaded area is the p−value, .62

Example: Age at First Transplant R output for the heart transplant problem
is given below. Let us look at the output and find all of the summaries we com-
puted. Also, look at the graphical summaries to assess whether the t-test and CI are
reasonable here.
#### Example: Age at First Transplant
# enter data as a vector
age <- c(54, 42, 51, 54, 49, 56, 33, 58, 54, 64, 49)
The age data is unimodal, skewed left, no extreme outliers.
par(mfrow=c(2,1))
# Histogram overlaid with kernel density curve
hist(age, freq = FALSE, breaks = 6)
points(density(age), type = "l")
rug(age)

# violin plot
library(vioplot)
vioplot(age, horizontal=TRUE, col="gray")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


74 Ch 2: Estimation in One-Sample Problems

Histogram of age

Density

0.04
0.00
30 35 40 45 50 55 60 65

age


1

35 40 45 50 55 60 65

# stem-and-leaf plot
stem(age, scale=2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 3 | 3
## 3 |
## 4 | 2
## 4 | 99
## 5 | 1444
## 5 | 68
## 6 | 4
# t.crit
qt(1 - 0.05/2, df = length(age) - 1)
## [1] 2.228139
# look at help for t.test
?t.test
# defaults include: alternative = "two.sided", conf.level = 0.95
t.summary <- t.test(age, mu = 50)
t.summary
##
## One Sample t-test
##
## data: age
## t = 0.51107, df = 10, p-value = 0.6204
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
## 45.72397 56.82149
## sample estimates:

Prof. Erik B. Erhardt


2.3: Hypothesis Testing for µ 75

## mean of x
## 51.27273
summary(age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33.00 49.00 54.00 51.27 55.00 64.00
The assumption of normality of the sampling distribution appears reasonablly
close, using the bootstrap discussed earlier. Therefore, the results for the t-test above
can be trusted.
bs.one.samp.dist(age)

Plot of data with smoothed density curve


Density

0.04
0.00

30 35 40 45 50 55 60 65

dat
Bootstrap sampling distribution of the mean
0.15
0.10
Density

0.05
0.00

40 45 50 55 60

Data: n = 11 , mean = 51.273 , se = 2.49031 5

Aside: To print the shaded region for the p-value, you can use the result of
t.test() with the function t.dist.pval() defined here.
# Function to plot t-distribution with shaded p-value
t.dist.pval <- function(t.summary) {
par(mfrow=c(1,1))
lim.extreme <- max(4, abs(t.summary$statistic) + 0.5)
lim.lower <- -lim.extreme;
lim.upper <- lim.extreme;
x.curve <- seq(lim.lower, lim.upper, length=200)
y.curve <- dt(x.curve, df = t.summary$parameter)
plot(x.curve, y.curve, type = "n"
, ylab = paste("t-dist( df =", signif(t.summary$parameter, 3), ")")
, xlab = paste("t-stat =", signif(t.summary$statistic, 5)
, ", Shaded area is p-value =", signif(t.summary$p.value, 5)))
if ((t.summary$alternative == "less")
| (t.summary$alternative == "two.sided")) {
x.pval.l <- seq(lim.lower, -abs(t.summary$statistic), length=200)
y.pval.l <- dt(x.pval.l, df = t.summary$parameter)
polygon(c(lim.lower, x.pval.l, -abs(t.summary$statistic))

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


76 Ch 2: Estimation in One-Sample Problems

, c(0, y.pval.l, 0), col="gray")


}
if ((t.summary$alternative == "greater")
| (t.summary$alternative == "two.sided")) {
x.pval.u <- seq(abs(t.summary$statistic), lim.upper, length=200)
y.pval.u <- dt(x.pval.u, df = t.summary$parameter)
polygon(c(abs(t.summary$statistic), x.pval.u, lim.upper)
, c(0, y.pval.u, 0), col="gray")
}
points(x.curve, y.curve, type = "l", lwd = 2, col = "black")
}

# for the age example


t.dist.pval(t.summary)
0.4
0.3
t−dist( df = 10 )

0.2
0.1
0.0

−4 −2 0 2 4

t−stat = 0.51107 , Shaded area is p−value = 0.62039

Aside: Note that the t.summary object returned from t.test() includes a number
of quantities that might be useful for additional calculations.
names(t.summary)
## [1] "statistic" "parameter" "p.value" "conf.int"
## [5] "estimate" "null.value" "alternative" "method"
## [9] "data.name"
t.summary$statistic
## t
## 0.5110715
t.summary$parameter
## df
## 10
t.summary$p.value
## [1] 0.6203942
t.summary$conf.int
## [1] 45.72397 56.82149
## attr(,"conf.level")
## [1] 0.95
t.summary$estimate
## mean of x
## 51.27273

Prof. Erik B. Erhardt


2.3: Hypothesis Testing for µ 77

t.summary$null.value
## mean
## 50
t.summary$alternative
## [1] "two.sided"
t.summary$method
## [1] "One Sample t-test"
t.summary$data.name
## [1] "age"

Example: Meteorites One theory of the formation of the solar system states
that all solar system meteorites have the same evolutionary history and thus have the
same cooling rates. By a delicate analysis based on measurements of phosphide crystal
widths and phosphide-nickel content, the cooling rates, in degrees Celsius per million
years, were determined for samples taken from meteorites named in the accompanying
table after the places they were found. The Walker2 County (Alabama, US), Uwet3
(Cross River, Nigeria), and Tocopilla4 (Antofagasta, Chile) meteorite cooling rate
data are below.
Suppose that a hypothesis of solar evolution predicted a mean cooling rate of
µ = 0.54 degrees per million years for the Tocopilla meteorite. Do the observed
cooling rates support this hypothesis? Test at the 5% level. The boxplot and stem-
and-leaf display (given below) show good symmetry. The assumption of a normal
distribution of observations basic to the t-test appears to be realistic.
Meteorite Cooling rates
Walker County 0.69 0.23 0.10 0.03 0.56 0.10 0.01 0.02 0.04 0.22
Uwet 0.21 0.25 0.16 0.23 0.47 1.20 0.29 1.10 0.16
Tocopilla 5.60 2.70 6.20 2.90 1.50 4.00 4.30 3.00 3.60 2.40 6.70 3.80
Let
µ = mean cooling rate over all pieces of the Tocopilla meteorite.
To answer the question of interest, we consider the test of H0 : µ = 0.54 against
HA : µ 6= 0.54. Let us go carry out the test, compute the p-value, and calculate
a 95% CI for µ. The sample √ summaries are n = 12, Ȳ = 3.892, s = 1.583. The
standard error is SEȲ = s/ n = 0.457.
R output for this problem is given below. For a 5% test (i.e., α = 0.05), you
would reject H0 in favor of HA because the p-value ≤ 0.05. The data strongly suggest
that µ 6= 0.54. The 95% CI says that you are 95% confident that the population
mean cooling rate for the Tocopilla meteorite is between 2.89 and 4.90 degrees per
million years. Note that the CI gives us a means to assess how different µ is from the
hypothesized value of 0.54.
2
http://www.lpi.usra.edu/meteor/metbull.php?code=24204
3
http://www.lpi.usra.edu/meteor/metbull.php?code=24138
4
http://www.lpi.usra.edu/meteor/metbull.php?code=17001

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


78 Ch 2: Estimation in One-Sample Problems

#### Example: Meteorites


# enter data as a vector
toco <- c(5.6, 2.7, 6.2, 2.9, 1.5, 4.0, 4.3, 3.0, 3.6, 2.4, 6.7, 3.8)
The Tocopilla data is unimodal, skewed right, no extreme outliers.
par(mfrow=c(2,1))
# Histogram overlaid with kernel density curve
hist(toco, freq = FALSE, breaks = 6)
points(density(toco), type = "l")
rug(toco)

# violin plot
library(vioplot)
vioplot(toco, horizontal=TRUE, col="gray")

Histogram of toco
0.00 0.15 0.30
Density

1 2 3 4 5 6 7

toco


1

2 3 4 5 6

# stem-and-leaf plot
stem(toco, scale=2)
##
## The decimal point is at the |
##
## 1 | 5
## 2 | 479
## 3 | 068
## 4 | 03
## 5 | 6
## 6 | 27
# t.crit
qt(1 - 0.05/2, df = length(toco) - 1)
## [1] 2.200985
t.summary <- t.test(toco, mu = 0.54)

Prof. Erik B. Erhardt


2.3: Hypothesis Testing for µ 79

t.summary
##
## One Sample t-test
##
## data: toco
## t = 7.3366, df = 11, p-value = 1.473e-05
## alternative hypothesis: true mean is not equal to 0.54
## 95 percent confidence interval:
## 2.886161 4.897172
## sample estimates:
## mean of x
## 3.891667
summary(toco)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.500 2.850 3.700 3.892 4.625 6.700
The assumption of normality of the sampling distribution appears reasonable.
Therefore, the results for the t-test above can be trusted.
t.dist.pval(t.summary)
bs.one.samp.dist(toco)

Plot of data with smoothed density curve


0.30
0.4

0.20
Density

0.10
0.3

0.00
t−dist( df = 11 )

1 2 3 4 5 6 7
0.2

dat
Bootstrap sampling distribution of the mean
0.8
0.1

Density

0.4
0.0

0.0

−5 0 5
2.5 3.0 3.5 4.0 4.5 5.0 5.5
t−stat = 7.3366 , Shaded area is p−value = 1.473e−05
Data: n = 12 , mean = 3.8917 , se = 0.456843 5

2.3.3 The mechanics of setting up hypothesis tests


When setting up a test you should imagine you are the researcher conducting the
experiment. In many studies, the researcher wishes to establish that there has been
a change from the status quo, or that they have developed a method that produces
a change (possibly in a specified direction) in the typical response. The researcher
sets H0 to be the status quo and HA to be the research hypothesis — the claim

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


80 Ch 2: Estimation in One-Sample Problems

the researcher wishes to make. In some studies you define the hypotheses so that
HA is the take action hypothesis — rejecting H0 in favor of HA leads one to take a
radical action.
Some perspective on testing is gained by understanding the mechanics behind the
tests. A hypothesis test is a decision process in the face of uncertainty. You are given
data and asked which of two contradictory claims about a population parameter, say
µ, is more reasonable. Two decisions are possible, but whether you make the correct
decision depends on the true state of nature which is unknown to you.
State of nature
Decision H0 true HA true
Fail to reject [accept] H0 correct decision Type-II error
Reject H0 in favor of HA Type-I error correct decision
For a given problem, only one of these errors is possible. For example, if H0 is
true you can make a Type-I error but not a Type-II error. Any reasonable decision
rule based on the data that tells us when to reject H0 and when to not reject H0 will
have a certain probability of making a Type-I error if H0 is true, and a corresponding
probability of making a Type-II error if H0 is false and HA is true. For a given
decision rule, define
α = Prob( Reject H0 given H0 is true ) = Prob( Type-I error )
and
β = Prob( Fail to reject H0 when HA true ) = Prob( Type-II error ).
The mathematics behind hypothesis tests allows you to prespecify or control α.
For a given α, the tests we use (typically) have the smallest possible value of β. Given
the researcher can control α, you set up the hypotheses so that committing a Type-I
error is more serious than committing a Type-II error. The magnitude of α, also
called the size or level of the test, should depend on the seriousness of a Type-I
error in the given problem. The more serious the consequences of a Type-I error, the
smaller α should be. In practice α is often set to 0.10, 0.05, or 0.01, with α = 0.05
being the scientific standard. By setting α to be a small value, you reject H0 in favor
of HA only if the data convincingly indicate that H0 is false.
Let us piece together these ideas for the meteorite problem. Evolutionary history
predicts µ = 0.54. A scientist examining the validity of the theory is trying to
decide whether µ = 0.54 or µ 6= 0.54. Good scientific practice dictates that rejecting
another’s claim when it is true is more serious than not being able to reject it when
it is false. This is consistent with defining H0 : µ = 0.54 (the status quo) and
HA : µ 6= 0.54. To convince yourself, note that the implications of a Type-I error
would be to claim the evolutionary theory is false when it is true, whereas a Type-
II error would correspond to not being able to refute the evolutionary theory when
it is false. With this setup, the scientist will refute the theory only if the data
overwhelmingly suggest that it is false.

Prof. Erik B. Erhardt


2.3: Hypothesis Testing for µ 81

2.3.4 The effect of α on the rejection region of a two-sided


test
For a size α test, you reject H0 : µ = µ0 if

Ȳ − µ0
ts =
SEȲ

satisfies |ts | > tcrit .

0
−3.106 3.106
−2.201 2.201
Rejection Regions for .05 and .01 level tests

The critical value is computed so that the area under the t-probability curve (with
df = n − 1) outside ±tcrit is α, with 0.5α in each tail. Reducing α makes tcrit larger.
That is, reducing the size of the test makes rejecting H0 harder because the rejection
region is smaller. A pictorial representation is given above for the Tocopilla data,
where µ0 = 0.54, n = 12, and df = 11. Note that tcrit = 2.201 and 3.106 for α = 0.05
and 0.01, respectively.
The mathematics behind the test presumes that H0 is true. Given the data, you
use
Ȳ − µ0
ts =
SEȲ

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


82 Ch 2: Estimation in One-Sample Problems

to measure how far Ȳ is from µ0 , relative to the spread in the data given by SEȲ . For
ts to be in the rejection region, Ȳ must be significantly above or below µ0 , relative
to the spread in the data. To see this, note that rejection rule can be expressed as:
Reject H0 if

Ȳ < µ0 − tcrit SEȲ or Ȳ > µ0 + tcrit SEȲ .

The rejection rule is sensible because Ȳ is our best guess for µ. You would reject
H0 : µ = µ0 only if Ȳ is so far from µ0 that you would question the reasonableness
of assuming µ = µ0 . How far Ȳ must be from µ0 before you reject H0 depends on
α (i.e., how willing you are to reject H0 if it is true), and on the value of SEȲ . For
a given sample, reducing α forces Ȳ to be further from µ0 before you reject H0 . For
a given value of α and s, increasing n allows smaller differences between Ȳ and µ0
to be statistically significant (i.e., lead to rejecting H0 ). In problems where small
differences between Ȳ and µ0 lead to rejecting H0 , you need to consider whether the
observed differences are important.
In essence, the t-distribution provides an objective way to calibrate whether the
observed Ȳ is typical of what sample means look like when sampling from a normal
population where H0 is true. If all other assumptions are satisfied, and Ȳ is inordi-
nately far from µ0 , then our only recourse is to conclude that H0 must be incorrect.

2.4 Two-sided tests, CI and p-values

An important relationship among two-sided tests of H0 : µ = µ0 , CI, and p-values is


that

size α test rejects H0 ⇔ 100(1 − α)% CI does not contain µ0 ⇔ p-value ≤ α, and

size α test fails to reject H0 ⇔ 100(1 − α)% CI contains µ0 ⇔ p-value > α.

For example, an α = 0.05 test rejects H0 ⇔ 95% CI does not contain µ0 ⇔


p-value ≤ 0.05. The picture below illustrates the connection between p-values and
rejection regions.

Prof. Erik B. Erhardt


2.5: Statistical versus practical significance 83

0
− tcrit tcrit
If ts is here then p−value > α

If ts is here then p−value < α

Either a CI or a test can be used to decide the plausibility of the claim that
µ = µ0 . Typically, you use the test to answer the question is there a difference?
If so, you use the CI to assess how much of a difference exists. I believe that
scientists place too much emphasis on hypothesis testing.

2.5 Statistical versus practical significance

Suppose in the Tocopilla meteorite example, you rejected H0 : µ = 0.54 at the 5%


level and found a 95% two-sided CI for µ to be 0.55 to 0.58. Although you have
sufficient evidence to conclude that the population mean cooling rate µ differs from
that suggested by evolutionary theory, the range of plausible values for µ is small and
contains only values close to 0.54. Although you have shown statistical significance
here, you need to ask yourself whether the actual difference between µ and 0.54 is large
enough to be important. The answer to such questions is always problem specific.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


84 Ch 2: Estimation in One-Sample Problems

2.6 Design issues and power


An experiment may not be sensitive enough to pick up true differences. For example,
in the Tocopilla meteorite example, suppose the true mean cooling rate is µ = 1.00.
To have a 50% chance of correctly rejecting H0 : µ = 0.54, you would need about
n = 48 observations. If the true mean is µ = 0.75, you would need about 221
observations to have a 50% chance of correctly rejecting H0 . In general, the smaller
the difference between the true and hypothesized mean (relative to the spread in the
population), the more data that is needed to reject H0 . If you have prior information
on the expected difference between the true and hypothesized mean, you can design
an experiment appropriately by choosing the sample size required to likely reject H0 .
The power of a test is the probability of correctly rejecting H0 when it is false.
Equivalently,
power = 1 − Prob( failing to reject H0 when it is false ) = 1 − Prob( Type-II error ).
For a given sample size, the tests I have discussed have maximum power (or smallest
probability of a Type-II error) among all tests with fixed size α. However, the actual
power may be small, so sample size calculations, as briefly highlighted above, are
important prior to collecting data. See your local statistician.
#### Power calculations (that you can do on your own)
# install.packages("pwr")
library(pwr)
?power.t.test
power.t.test(n = NULL, delta = 1.00 - 0.54, sd = sd(toco),
, sig.level = 0.05, power = 0.50
, type = "one.sample", alternative = "two.sided")
power.t.test(n = NULL, delta = 0.75 - 0.54, sd = sd(toco),
, sig.level = 0.05, power = 0.50
, type = "one.sample", alternative = "two.sided")
For more examples, try:
# install.packages("TeachingDemos")
library(TeachingDemos)
# Demonstrate concepts of Power.
?power.examp

2.7 One-sided tests on µ


There are many studies where a one-sided test is appropriate. The two common
scenarios are the lower one-sided test H0 : µ = µ0 (or µ ≥ µ0 ) versus HA : µ < µ0
and the upper one-sided test H0 : µ = µ0 (or µ ≤ µ0 ) versus HA : µ > µ0 .

Prof. Erik B. Erhardt


2.7: One-sided tests on µ 85

Regardless of the alternative hypothesis, the tests are based on the t-statistic:

Ȳ − µ0
ts = .
SEȲ

For the upper one-sided test

1. Compute the critical value tcrit such that the area under the t-curve to the right
of tcrit is the desired size α, that is tcrit = tα .
2. Reject H0 if and only if ts ≥ tcrit .
3. The p-value for the test is the area under the t-curve to the right of the test
statistic ts .

The upper one-sided test uses the upper tail of the t-distribution for a re-
jection region. The p-value calculation reflects the form of the rejection region. You
will reject H0 only for large positive values of ts which require Ȳ to be significantly
greater than µ0 . Does this make sense?

For the lower one-sided test

1. Compute the critical value tcrit such that the area under the t-curve to the right
of tcrit is the desired size α, that is tcrit = tα .
2. Reject H0 if and only if ts ≤ −tcrit .
3. The p-value for the test is the area under the t-curve to the left of the test
statistic ts .

The lower one-sided test uses the lower tail of the t-distribution for a rejection
region. The calculation of the rejection region in terms of −tcrit is awkward but is
necessary for hand calculations because many statistical tables only give upper tail
percentiles. Note that here you will reject H0 only for large negative values of ts
which require Ȳ to be significantly less than µ0 .

As with two-sided tests, the p-value can be used to decide between rejecting or
not rejecting H0 for a test with a given size α. A picture of the rejection region and
the p-value evaluation for one-sided tests is given below.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


86 Ch 2: Estimation in One-Sample Problems

Upper One−Sided Rejection Region Upper One−Sided p−value

α p−value

0 0
tcrit ts

Lower One−Sided Rejection Region Lower One−Sided p−value

α p−value

0 0
− tcrit ts

ClickerQ s — One-sided tests on µ

Example: Weights of canned tomatoes A consumer group suspects that the


average weight of canned tomatoes being produced by a large cannery is less than
the advertised weight of 20 ounces. To check their conjecture, the group purchases
14 cans of the canner’s tomatoes from various grocery stores. The weights of the
contents of the cans to the nearest half ounce were as follows: 20.5, 18.5, 20.0, 19.5,
19.5, 21.0, 17.5, 22.5, 20.0, 19.5, 18.5, 20.0, 18.0, 20.5. Do the data confirm the
group’s suspicions? Test at the 5% level.

Prof. Erik B. Erhardt


2.7: One-sided tests on µ 87

Let µ = the population mean weight for advertised 20 ounce cans of tomatoes
produced by the cannery. The company claims that µ = 20, but the consumer group
believes that µ < 20. Hence the consumer group wishes to test H0 : µ = 20 (or
µ ≥ 20) against HA : µ < 20. The consumer group will reject H0 only if the data
overwhelmingly suggest that H0 is false.
You should assess the normality assumption prior to performing the t-test. The
stem-and-leaf display and the boxplot suggest that the distribution might be slightly
skewed to the left. However, the skewness is not severe and no outliers are present,
so the normality assumption is not unreasonable.
R output for the problem is given below. Let us do a hand calculation using the
summarized data. The sample size, mean, and standard √ deviation are 14, 19.679, and
1.295, respectively. The standard error is SEȲ = s/ n = 0.346. We see that the
sample mean is less than 20. But is it sufficiently less than 20 for us to be willing to
publicly refute the canner’s claim? Let us carry out the test, first using the rejection
region approach, and then by evaluating a p-value.
The test statistic is

Ȳ − µ0 19.679 − 20
ts = = = −0.93.
SEȲ 0.346

The critical value for a 5% one-sided test is t0.05 = 1.771, so we reject H0 if ts < −1.771
(you can get that value from r or from the table). The test statistic is not in the
rejection region. Using the t-table, the p-value is between 0.15 and 0.20. I will draw
a picture to illustrate the critical region and p-value calculation. The exact p-value
from R is 0.185, which exceeds 0.05.
Both approaches lead to the conclusion that we do not have sufficient evidence
to reject H0 . That is, we do not have sufficient evidence to question the accuracy of
the canner’s claim. If you did reject H0 , is there something about how the data were
recorded that might make you uncomfortable about your conclusions?
#### Example: Weights of canned tomatoes
tomato <- c(20.5, 18.5, 20.0, 19.5, 19.5, 21.0, 17.5
, 22.5, 20.0, 19.5, 18.5, 20.0, 18.0, 20.5)

par(mfrow=c(2,1))
# Histogram overlaid with kernel density curve
hist(tomato, freq = FALSE, breaks = 6)
points(density(tomato), type = "l")
rug(tomato)

# violin plot
library(vioplot)
vioplot(tomato, horizontal=TRUE, col="gray")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


88 Ch 2: Estimation in One-Sample Problems

# t.crit
qt(1 - 0.05/2, df = length(tomato) - 1)
## [1] 2.160369
t.summary <- t.test(tomato, mu = 20, alternative = "less")
t.summary
##
## One Sample t-test
##
## data: tomato
## t = -0.92866, df = 13, p-value = 0.185
## alternative hypothesis: true mean is less than 20
## 95 percent confidence interval:
## -Inf 20.29153
## sample estimates:
## mean of x
## 19.67857
summary(tomato)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.50 18.75 19.75 19.68 20.38 22.50

Histogram of tomato
0.4
Density

0.2
0.0

17 18 19 20 21 22 23

tomato


1

18 19 20 21 22

The assumption of normality of the sampling distribution appears reasonable.


Therefore, the results for the t-test above can be trusted.
t.dist.pval(t.summary)
bs.one.samp.dist(tomato)

Prof. Erik B. Erhardt


2.7: One-sided tests on µ 89

Plot of data with smoothed density curve

0.0 0.1 0.2 0.3 0.4


0.4

Density
0.3
t−dist( df = 13 )

17 18 19 20 21 22 23
0.2

dat
Bootstrap sampling distribution of the mean

1.2
0.1

0.8
Density

0.4
0.0

0.0
−4 −2 0 2 4
18.5 19.0 19.5 20.0 20.5 21.0
t−stat = −0.92866 , Shaded area is p−value = 0.18499
Data: n = 14 , mean = 19.679 , se = 0.346121 5

2.7.1 One-sided CIs


How should you couple a one-sided test with a CI procedure? For a lower one-sided
test, you are interested only in an upper bound on µ. Similarly, with an upper
one-sided test you are interested in a lower bound on µ. Computing these type
of bounds maintains the consistency between tests and CI procedures. The general
formulas for lower and upper 100(1 − α)% confidence bounds on µ are given by

Ȳ − tcrit SEȲ and Ȳ + tcrit SEȲ

respectively, where tcrit = tα .


In the cannery problem, to get an upper 95% bound on µ, the critical value is the
same as we used for the one-sided 5% test: t0.05 = 1.771. The upper bound on µ is

Ȳ + t0.05 SEȲ = 19.679 + 1.771 × 0.346 = 19.679 + 0.613 = 20.292.

Thus, you are 95% confident that the population mean weight of the canner’s 20oz
cans of tomatoes is less than or equal to 20.29. As expected, this interval covers 20.
If you are doing a one-sided test in R, it will generate the correct one-sided bound.
That is, a lower one-sided test will generate an upper bound, whereas an upper one-
sided test generates a lower bound. If you only wish to compute a one-sided bound
without doing a test, you need to specify the direction of the alternative which gives
the type of bound you need. An upper bound was generated by R as part of the test
we performed earlier. The result agrees with the hand calculation.
Quite a few packages, do not directly compute one-sided bounds so you have to
fudge a bit. In the cannery problem, to get an upper 95% bound on µ, you take

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


90 Ch 2: Estimation in One-Sample Problems

the upper limit from a 90% two-sided confidence limit on µ. The rational for this
is that with the 90% two-sided CI, µ will fall above the upper limit 5% of the time
and fall below the lower limit 5% of the time. Thus, you are 95% confident that µ
falls below the upper limit of this interval, which gives us our one-sided bound. Here,
you are 95% confident that the population mean weight of the canner’s 20 oz cans of
tomatoes is less than or equal to 20.29, which agrees with our hand calculation.
One-Sample T: Cans
Variable N Mean StDev SE Mean 90% CI
Cans 14 19.6786 1.2951 0.3461 (19.0656, 20.2915)
The same logic applies if you want to generalize the one-sided confidence bounds
to arbitrary confidence levels and to lower one-sided bounds — always double the
error rate of the desired one-sided bound to get the error rate of the required two-
sided interval! For example, if you want a lower 99% bound on µ (with a 1% error
rate), use the lower limit on the 98% two-sided CI for µ (which has a 2% error rate).

ClickerQ s — P-value

Prof. Erik B. Erhardt


Chapter 3

Two-Sample Inferences

Contents
3.1 Comparing Two Sets of Measurements . . . . . . . . . . 92
3.1.1 Plotting head breadth data: . . . . . . . . . . . . . . . . . . 93
3.1.2 Salient Features to Notice . . . . . . . . . . . . . . . . . . . 99
3.2 Two-Sample Methods: Paired Versus Independent Sam-
ples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3 Two Independent Samples: CI and Test Using Pooled
Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.4 Satterthwaite’s Method, unequal variances . . . . . . . . 101
3.4.1 R Implementation . . . . . . . . . . . . . . . . . . . . . . . 102
3.5 One-Sided Tests . . . . . . . . . . . . . . . . . . . . . . . . 111
3.6 Paired Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.6.1 R Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.7 Should You Compare Means? . . . . . . . . . . . . . . . . 120

Learning objectives
After completing this topic, you should be able to:
select graphical displays that meaningfully compare independent populations.
assess the assumptions of the two-sample t-test visually.
decide whether the means between two populations are different.
recommend action based on a hypothesis test.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


92 Ch 3: Two-Sample Inferences

5. define parameters of interest and hypotheses in words and notation.


6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.

3.1 Comparing Two Sets of Measurements


Suppose you have collected data on one variable from two (independent) samples and
you are interested in “comparing” the samples. What tools are good to use?

Example: Head Breadths In this analysis, we will compare a physical feature


of modern day Englishmen with the corresponding feature of some of their ancient
countrymen. The Celts were a vigorous race of people who once populated parts of
England. It is not entirely clear whether they simply died out or merged with other
people who were the ancestors of those who live in England today. A goal of this
study might be to shed some light on possible genetic links between the two groups.
The study is based on the comparison of maximum head breadths (in millimeters)
made on unearthed Celtic skulls and on a number of skulls of modern-day Englishmen.
The data are given below. We have a sample of 18 Englishmen and an independent
sample of 16 Celtic skulls.
#### Example: Head Breadths
# unstacked data as two vectors
english <- c(141, 148, 132, 138, 154, 142, 150, 146, 155, 158,
150, 140, 147, 148, 144, 150, 149, 145)
celts <- c(133, 138, 130, 138, 134, 127, 128, 138, 136, 131,
126, 120, 124, 132, 132, 125)
english
## [1] 141 148 132 138 154 142 150 146 155 158 150 140 147 148 144 150
## [17] 149 145
celts
## [1] 133 138 130 138 134 127 128 138 136 131 126 120 124 132 132 125
What features of these data would we likely be interested in comparing? The
centers of the distributions, the spreads within each distribution, the distributional
shapes, etc.
These data can be analyzed in R as either unstacked separate vectors or as
stacked data where one column contains both samples, with a second column of
labels or subscripts to distinguish the samples. It is easy to create stacked data
from unstacked data and vice-versa. Many comparisons require the plots for the two
groups to have the same scale, which is easiest to control when the data are stacked.

Prof. Erik B. Erhardt


3.1: Comparing Two Sets of Measurements 93

# stacked data as a vector of values and a vector of labels


HeadBreadth <- c(english, celts)
Group <- c(rep("English", length(english)), rep("Celts", length(celts)))
hb <- data.frame(HeadBreadth, Group)
hb
## HeadBreadth Group
## 1 141 English
## 2 148 English
## 3 132 English
## 4 138 English
## 5 154 English
## 6 142 English
## 7 150 English
## 8 146 English
## 9 155 English
## 10 158 English
## 11 150 English
## 12 140 English
## 13 147 English
## 14 148 English
## 15 144 English
## 16 150 English
## 17 149 English
## 18 145 English
## 19 133 Celts
## 20 138 Celts
## 21 130 Celts
## 22 138 Celts
## 23 134 Celts
## 24 127 Celts
## 25 128 Celts
## 26 138 Celts
## 27 136 Celts
## 28 131 Celts
## 29 126 Celts
## 30 120 Celts
## 31 124 Celts
## 32 132 Celts
## 33 132 Celts
## 34 125 Celts

3.1.1 Plotting head breadth data:


1. A dotplot with the same scale for both samples is obtained easily from the
stacked data.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


94 Ch 3: Two-Sample Inferences

#### Plotting head breadth data


# stripchart (dotplot) using R base graphics
stripchart(HeadBreadth ~ Group, method = "stack", data = hb,
main = "Head breadth comparison", xlab = "head breadth (mm)")

# stripchart (dotplot) using ggplot


library(ggplot2)
p <- ggplot(hb, aes(x = HeadBreadth))
p <- p + geom_dotplot(binwidth = 2)
p <- p + facet_grid(Group ~ .) # rows are Group categories
p <- p + labs(title = "Head breadth comparison") + xlab("head breadth (mm)")
print(p)

Head breadth comparison


English
Celts

120 130 140 150

head breadth (mm)

Head breadth comparison


1.00

0.75
Celts

0.50
● ●
0.25
●● ●● ●
● ●●● ●●●●●
count

0.00
1.00

0.75
English

0.50
●●
0.25
● ●●●● ●
0.00 ● ● ●● ●●●● ● ●
120 130 140 150 160
head breadth (mm)

2. Boxplots for comparison are most helpful when plotted in the same axes.
# boxplot using R base graphics
boxplot(HeadBreadth ~ Group, method = "stack", data = hb,
horizontal = TRUE,
main = "Head breadth comparison", xlab = "head breadth (mm)")

p <- ggplot(hb, aes(x = Group, y = HeadBreadth))


p <- p + geom_boxplot()
# add a "+" at the mean

Prof. Erik B. Erhardt


3.1: Comparing Two Sets of Measurements 95

p <- p + stat_summary(fun.y = mean, geom = "point", shape = 3, size = 2)


p <- p + coord_flip()
p <- p + labs(title = "Head breadth comparison")
print(p)

Head breadth comparison

English
Celts

120 130 140 150

head breadth (mm)

Head breadth comparison

English
Group

Celts

120 130 140 150


HeadBreadth

3. Histograms are hard to compare unless you make the scale and actual bins the
same for both. Why is the pair on the right clearly preferable?

# histogram using R base graphics


par(mfcol=c(2,2))
hist(hb$HeadBreadth[(hb$Group == "Celts")],
main = "Head breadth, Celts", xlab = "head breadth (mm)")
hist(hb$HeadBreadth[(hb$Group == "English")],
main = "Head breadth, English", xlab = "head breadth (mm)")

# common x-axis limits based on the range of the entire data set
hist(hb$HeadBreadth[(hb$Group == "Celts")], xlim = range(hb$HeadBreadth),
main = "Head breadth, Celts", xlab = "head breadth (mm)")
hist(hb$HeadBreadth[(hb$Group == "English")], xlim = range(hb$HeadBreadth),
main = "Head breadth, English", xlab = "head breadth (mm)")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


96 Ch 3: Two-Sample Inferences

Head breadth, Celts Head breadth, Celts


5

5
4

4
Frequency

Frequency
3

3
2

2
1

1
0

0
120 125 130 135 140 120 130 140 150

head breadth (mm) head breadth (mm)

Head breadth, English Head breadth, English


8

8
6

6
Frequency

Frequency
4

4
2

2
0

130 135 140 145 150 155 160 120 130 140 150

head breadth (mm) head breadth (mm)

# histogram using ggplot


p <- ggplot(hb, aes(x = HeadBreadth))
p <- p + geom_histogram(binwidth = 4)
p <- p + geom_rug()
p <- p + facet_grid(Group ~ .)
p <- p + labs(title = "Head breadth comparison") + xlab("head breadth (mm)")
print(p)

Prof. Erik B. Erhardt


3.1: Comparing Two Sets of Measurements 97

Head breadth comparison

Celts
2

count
0

English
4

0
120 130 140 150 160
head breadth (mm)

p <- ggplot(hb, aes(x = HeadBreadth, fill=Group))


p <- p + geom_histogram(binwidth = 4, alpha = 0.5, position="identity")
p <- p + geom_rug(aes(colour=Group), alpha = 1/2)
p <- p + labs(title = "Head breadth comparison") + xlab("head breadth (mm)")
print(p)

p <- ggplot(hb, aes(x = HeadBreadth, fill=Group))


p <- p + geom_histogram(binwidth = 4, alpha = 1, position="dodge")
p <- p + geom_rug(aes(colour=Group), alpha = 1/2)
p <- p + labs(title = "Head breadth comparison") + xlab("head breadth (mm)")
print(p)
Head breadth comparison Head breadth comparison

6 6

4 Group 4 Group
count

count

Celts Celts
English English

2 2

0 0

120 130 140 150 160 120 130 140 150 160
head breadth (mm) head breadth (mm)

4. Stem-and-leaf displays for comparisons in R can be pretty useless. The stems


are not forced to match (just like with histograms). It is pretty hard to make
quick comparisons with the following:
stem(english, scale = 2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 13 | 2
## 13 | 8
## 14 | 0124
## 14 | 567889
## 15 | 0004
## 15 | 58

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


98 Ch 3: Two-Sample Inferences

stem(celts, scale = 2)
##
## The decimal point is at the |
##
## 120 | 0
## 122 |
## 124 | 00
## 126 | 00
## 128 | 0
## 130 | 00
## 132 | 000
## 134 | 0
## 136 | 0
## 138 | 000

Using the by() function, you can get summaries by group.


#### summaries by group
# summary for separate vectors
summary(english)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 132.0 142.5 147.5 146.5 150.0 158.0
summary(celts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 120.0 126.8 131.5 130.8 134.5 138.0
# comparing spreads, an assumption of equal variances seems reasonable
sd(english)
## [1] 6.382421
sd(celts)
## [1] 5.434458
IQR(english)
## [1] 7.5
IQR(celts)
## [1] 7.75
# numerical summary of each column in data.frame hb by Group
by(hb, Group, summary)
## Group: Celts
## HeadBreadth Group
## Min. :120.0 Celts :16
## 1st Qu.:126.8 English: 0
## Median :131.5
## Mean :130.8
## 3rd Qu.:134.5
## Max. :138.0
## ----------------------------------------------------

Prof. Erik B. Erhardt


3.2: Two-Sample Methods: Paired Versus Independent Samples 99

## Group: English
## HeadBreadth Group
## Min. :132.0 Celts : 0
## 1st Qu.:142.5 English:18
## Median :147.5
## Mean :146.5
## 3rd Qu.:150.0
## Max. :158.0

3.1.2 Salient Features to Notice


The dotplots, boxplots, and histograms indicate that the English and Celt samples
are slightly skewed to the left. There are no outliers in either sample. It is not
unreasonable to operationally assume that the population frequency curves (i.e., the
histograms for the populations from which the samples were selected) for the English
and Celtic head breadths are normal. Therefore, the sampling distribution of the
means will be reasonably normal.
The sample means and medians are close to each other in each sample, which is
not surprising given the near symmetry and the lack of outliers.
The data suggest that the typical modern English head breadth is greater than
that for Celts. The data sets have comparable spreads, as measured by either the
standard deviation or the IQR.

3.2 Two-Sample Methods: Paired Versus Indepen-


dent Samples
Suppose you have two populations of interest, say populations 1 and 2, and you are
interested in comparing their (unknown) population means, µ1 and µ2 . Inferences
on the unknown population means are based on samples from each population. In
practice, most problems fall into one of two categories.
Independent samples where the sample taken from population 1 has no effect
on which observations are selected from population 2, and vice versa.
Paired or dependent samples where experimental units are paired based on factors
related or unrelated to the variable measured.
The distinction between paired and independent samples may be made clear
through a series of examples.

Example The English and Celt head breadth samples are independent.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


100 Ch 3: Two-Sample Inferences

Example Suppose you are interested in whether the CaCO3 (calcium carbonate)
level in the Atrisco well field, which is the water source for Albuquerque, is changing
over time. To answer this question, the CaCO3 level was recorded at each of 15 wells
at two time points. These data are paired. The two samples are the observations at
Times 1 and 2.

Example To compare state incomes, a random sample of New Mexico households


was selected, and an independent sample of Arizona households was obtained. It is
reasonable to assume independent samples.

Example Suppose you are interested in whether the husband or wife is typically
the heavier smoker among couples where both adults smoke. Data are collected on
households. You measure the average number of cigarettes smoked by each husband
and wife within the sample of households. These data are paired, i.e., you have
selected husband wife pairs as the basis for the samples. It is reasonable to believe
that the responses within a pair are related, or correlated.

Although the focus here will be on comparing population means, you should rec-
ognize that in paired samples you may also be interested, as in the problems above,
in how observations compare within a pair. That is, a paired comparison might
be interested in the difference between the two paired samples. These goals need
not agree, depending on the questions of interest. Note that with paired data, the
sample sizes are equal, and equal to the number of pairs.

ClickerQ s — Independent or paired 1, STT.08.02.030

ClickerQ s — Independent or paired 2, STT.08.02.040

3.3 Two Independent Samples: CI and Test Using


Pooled Variance
These methods assume that the populations have normal frequency curves, with equal
population standard deviations, i.e., σ1 = σ2 . Let (n1 , Ȳ1 , s1 ) and (n2 , Ȳ2 , s2 ) be
the sample sizes, means and standard deviations from the two samples. The standard
CI for µ1 − µ2 is given by

CI = (Ȳ1 − Ȳ2 ) ± tcrit SEȲ1 −Ȳ2

Prof. Erik B. Erhardt


3.4: Satterthwaite’s Method, unequal variances 101

The t-statistic for testing H0 : µ1 − µ2 = 0 (µ1 = µ2 ) against HA : µ1 − µ2 6= 0


(µ1 6= µ2 ) is given by
Ȳ1 − Ȳ2
ts = .
SEȲ1 −Ȳ2
The standard error of Ȳ1 − Ȳ2 used in both the CI and the test is given by
r
1 1
SEȲ1 −Ȳ2 = spooled + .
n1 n2
Here the pooled variance estimator,
(n1 − 1)s21 + (n2 − 1)s22
s2pooled = ,
n1 + n2 − 2
is our best estimate of the common population variance. The pooled estimator of
variance is a weighted average of the two sample variances, with more weight given to
the larger sample. If n1 = n2 then s2pooled is the average of s21 and s22 . The critical value
tcrit for CI and tests is obtained in usual way from a t-table with df = n1 + n2 − 2.
For the test, follow the one-sample procedure, with the new ts and tcrit .
The pooled CI and tests are sensitive to the normality and equal standard devi-
ation assumptions. The observed data can be used to assess the reasonableness of
these assumptions. You should look at boxplots and histograms to assess normality,
.
and should check whether s1 = s2 to assess the assumption σ1 = σ2 . Formal tests of
these assumptions will be discussed later.

3.4 Satterthwaite’s Method, unequal variances


Satterthwaite’s method assumes normality, but does not require equal popula-
tion standard deviations. Satterthwaite’s procedures are somewhat conservative, and
adjust the SE and df to account for unequal population variances. Satterthwaite’s
method uses the same CI and test statistic formula, with a modified standard error:
s
s21 s2
SEȲ1 −Ȳ2 = + 2,
n1 n2

and degrees of freedom:  2


s21 s22
n1
+ n2
df = s41 s42
.
n21 (n1 −1)
+ 2
n2 (n2 −1)

Note that df = n1 + n2 − 2 when n1 = n2 and s1 = s2 . The Satterthwaite and pooled


.
variance procedures usually give similar results when s1 = s2 .

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


102 Ch 3: Two-Sample Inferences

The df formula for Satterthwaite’s method is fairly complex, so when done by


hand some use a conservative df formula that uses the minimum of n1 − 1 and n2 − 1
instead.

3.4.1 R Implementation
R does the pooled and Satterthwaite (Welch) analyses, either on stacked or unstacked
data. The output will contain a p-value for a two-sided test of equal population means
and a CI for the difference in population means. If you include var.equal = TRUE you
will get the pooled method, otherwise the output is for Satterthwaite’s method.

Example: Head Breadths The English and Celts are independent samples. We
looked at boxplots and histograms, which suggested that the normality assumption
for the t-test is reasonable. The R output shows the English and Celt sample standard
deviations and IQRs are fairly close, so the pooled and Satterthwaite results should
be comparable. The pooled analysis is preferable here, but either is appropriate.
We are interested in difference in mean head breadths between Celts and English.
1. Define the population parameters and hypotheses in words and
notation
Let µ1 and µ2 be the mean head breadth for the Celts and English, respectively.
In words: “The difference in population means between Celts and English is differ-
ent from zero mm.”
In notation: H0 : µ1 = µ2 versus HA : µ1 6= µ2 .
Alternatively: H0 : µ1 − µ2 = 0 versus HA : µ1 − µ2 6= 0.
2. Calculate summary statistics from sample
Mean, standard deviation, sample size:
#### Calculate summary statistics
m1 <- mean(celts)
s1 <- sd(celts)
n1 <- length(celts)
m2 <- mean(english)
s2 <- sd(english)
n2 <- length(english)
c(m1, s1, n1)
## [1] 130.750000 5.434458 16.000000
c(m2, s2, n2)
## [1] 146.500000 6.382421 18.000000
The pooled-standard devation, standard error, and degrees-of-freedom are:
sdpool <- sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))
sdpool
## [1] 5.956876

Prof. Erik B. Erhardt


3.4: Satterthwaite’s Method, unequal variances 103

SEpool <- sdpool * sqrt(1 / n1 + 1 / n2)


SEpool
## [1] 2.046736
dfpool <- n1 + n2 - 2
dfpool
## [1] 32
t_pool <- (m1 - m2) / SEpool
t_pool
## [1] -7.69518
The Satterthwaite SE and degrees-of-freedom are:
SE_Sat <- sqrt(s1^2 / n1 + s2^2 / n2)
SE_Sat
## [1] 2.027043
df_Sat <- (SE_Sat^2)^2 / (s1^4 / (n1^2 * (n1 - 1)) + s2^4 / (n2^2 * (n2 - 1)))
df_Sat
## [1] 31.9511
t_Sat <- (m1 - m2) / SE_Sat
t_Sat
## [1] -7.769937
3. Specify confidence level, calculate t-stat, CI limits, p-value
Let us calculate a 95% CI for µ1 − µ2 .
Assuming equal variances, using pooled-variance procedure:
## Equal variances
# var.equal = FALSE is the default

# two-sample t-test specifying two separate vectors


t.summary.eqvar <- t.test(celts, english, var.equal = TRUE)
t.summary.eqvar
##
## Two Sample t-test
##
## data: celts and english
## t = -7.6952, df = 32, p-value = 9.003e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.91906 -11.58094
## sample estimates:
## mean of x mean of y
## 130.75 146.50
Not assuming equal variances, Satterthwaite (Welch):

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


104 Ch 3: Two-Sample Inferences

# two-sample t-test with unequal variances (Welch = Satterthwaite)


# specified using data.frame and a formula, HeadBreadth by Group
t.summary.uneqvar <- t.test(HeadBreadth ~ Group, data = hb, var.equal = FALSE)
t.summary.uneqvar
##
## Welch Two Sample t-test
##
## data: HeadBreadth by Group
## t = -7.7699, df = 31.951, p-value = 7.414e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.8792 -11.6208
## sample estimates:
## mean in group Celts mean in group English
## 130.75 146.50
The form of the output will tell you which sample corresponds to population 1 and
which corresponds to population 2.
4. Summarize in words (Using the pooled-variance results.)
The pooled analysis strongly suggests that H0 : µ1 − µ2 = 0 is false, given the large
t-statistic of −7.7 and two-sided p-value of 9 × 10−9 . Because the p-value < 0.05
we reject the Null hypothesis in favor of the Alternative hypothesis concluding that
the difference in population mean head breadths between the Celts and English are
different.
We are 95% confident that the difference in population means, µ1 − µ2 , is between
−19.9 and −11.6 mm. That is, we are 95% confident that the population mean
head breadth for Englishmen (µ2 ) exceeds the population mean head breadth for
Celts (µ1 ) by between 11.6 and 19.9 mm.
The CI interpretation is made easier by recognizing that we concluded the popula-
tion means are different, so the direction of difference must be consistent with that
seen in the observed data, where the sample mean head breadth for Englishmen
exceeds that for the Celts. Thus, the limits on the CI for µ1 − µ2 tells us how much
smaller the mean is for the Celts (that is, between −19.9 and −11.6 mm).
5. Check assumptions
The assumption of equal population variances will be left to a later chapter.
We can test the assumption that the distribution of Ȳ1 − Ȳ2 is normal using the
bootstrap in the following function.
#### Visual comparison of whether sampling distribution is close to Normal via Bootstrap
# a function to compare the bootstrap sampling distribution
# of the difference of means from two samples with
# a normal distribution with mean and SEM estimated from the data
bs.two.samp.diff.dist <- function(dat1, dat2, N = 1e4) {
n1 <- length(dat1);
n2 <- length(dat2);
# resample from data
sam1 <- matrix(sample(dat1, size = N * n1, replace = TRUE), ncol=N);
sam2 <- matrix(sample(dat2, size = N * n2, replace = TRUE), ncol=N);
# calculate the means and take difference between populations
sam1.mean <- colMeans(sam1);

Prof. Erik B. Erhardt


3.4: Satterthwaite’s Method, unequal variances 105

sam2.mean <- colMeans(sam2);


diff.mean <- sam1.mean - sam2.mean;
# save par() settings
old.par <- par(no.readonly = TRUE)
# make smaller margins
par(mfrow=c(3,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
# Histogram overlaid with kernel density curve
hist(dat1, freq = FALSE, breaks = 6
, main = paste("Sample 1", "\n"
, "n =", n1
, ", mean =", signif(mean(dat1), digits = 5)
, ", sd =", signif(sd(dat1), digits = 5))
, xlim = range(c(dat1, dat2)))
points(density(dat1), type = "l")
rug(dat1)

hist(dat2, freq = FALSE, breaks = 6


, main = paste("Sample 2", "\n"
, "n =", n2
, ", mean =", signif(mean(dat2), digits = 5)
, ", sd =", signif(sd(dat2), digits = 5))
, xlim = range(c(dat1, dat2)))
points(density(dat2), type = "l")
rug(dat2)

hist(diff.mean, freq = FALSE, breaks = 25


, main = paste("Bootstrap sampling distribution of the difference in means", "\n"
, "mean =", signif(mean(diff.mean), digits = 5)
, ", se =", signif(sd(diff.mean), digits = 5)))
# overlay a density curve for the sample means
points(density(diff.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(min(diff.mean), max(diff.mean), length = 1000)
points(x, dnorm(x, mean = mean(diff.mean), sd = sd(diff.mean))
, type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
rug(diff.mean)
# restore par() settings
par(old.par)
}

The distribution of difference in means in the third plot looks very close to normal.

bs.two.samp.diff.dist(celts, english)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


106 Ch 3: Two-Sample Inferences

Sample 1
n = 16 , mean = 130.75 , sd = 5.4345

0.06
0.04
Density

0.02
0.00

120 130 140 150

Sample
dat1 2
n = 18 , mean = 146.5 , sd = 6.3824
0.08
0.06
Density

0.04
0.02
0.00

120 130 140 150

Bootstrap sampling distribution


dat2 of the difference in means
mean = −15.749 , se = 1.9777
0.20
0.15
Density

0.10
0.05
0.00

−20 −15 −10

diff.mean

ClickerQ s — t-interval, STT.08.02.010

Example: Androstenedione levels in diabetics The data consist of indepen-


dent samples of diabetic men and women. For each individual, the scientist recorded
their androstenedione level (ng/dL) (a hormone, and Mark McGwire’s favorite dietary
supplement). Let µ1 = mean androstenedione level for the population of diabetic men,
and µ2 = mean androstenedione level for the population of diabetic women. We are
interested in comparing the population means given the observed data.
The raw data and R output are given below. The boxplots suggest that the
distributions are reasonably symmetric. However, the normality assumption for the
women is unreasonable due to the presence of outliers. The equal population standard
deviation assumption also appears unreasonable. The sample standard deviation for
men is noticeably larger than the women’s standard deviation, even with outliers in
the women’s sample.
#### Example: Androstenedione levels in diabetics
# Data and numerical summaries
men <- c(217, 123, 80, 140, 115, 135, 59, 126, 70, 63,

Prof. Erik B. Erhardt


3.4: Satterthwaite’s Method, unequal variances 107

147, 122, 108, 70)


women <- c( 84, 87, 77, 84, 73, 66, 70, 35, 77, 73,
56, 112, 56, 84, 80, 101, 66, 84)
level <- c(men, women)
sex <- c(rep("men", length(men)), rep("women", length(women)))
andro <- data.frame(level, sex)
andro
## level sex
## 1 217 men
## 2 123 men
## 3 80 men
## 4 140 men
## 5 115 men
## 6 135 men
## 7 59 men
## 8 126 men
## 9 70 men
## 10 63 men
## 11 147 men
## 12 122 men
## 13 108 men
## 14 70 men
## 15 84 women
## 16 87 women
## 17 77 women
## 18 84 women
## 19 73 women
## 20 66 women
## 21 70 women
## 22 35 women
## 23 77 women
## 24 73 women
## 25 56 women
## 26 112 women
## 27 56 women
## 28 84 women
## 29 80 women
## 30 101 women
## 31 66 women
## 32 84 women
# numerical summaries
by(andro, sex, summary)
## sex: men
## level sex
## Min. : 59.0 men :14
## 1st Qu.: 72.5 women: 0
## Median :118.5
## Mean :112.5

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


108 Ch 3: Two-Sample Inferences

## 3rd Qu.:132.8
## Max. :217.0
## ----------------------------------------------------
## sex: women
## level sex
## Min. : 35.00 men : 0
## 1st Qu.: 67.00 women:18
## Median : 77.00
## Mean : 75.83
## 3rd Qu.: 84.00
## Max. :112.00
c(sd(men), sd(women), IQR(men), IQR(women), length(men), length(women))
## [1] 42.75467 17.23625 60.25000 17.00000 14.00000 18.00000
p <- ggplot(andro, aes(x = sex, y = level, fill=sex))
p <- p + geom_boxplot()
# add a "+" at the mean
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 3, size = 2)
#p <- p + coord_flip()
p <- p + labs(title = "Androstenedione Levels in Diabetics")
print(p)

p <- ggplot(andro, aes(x = level, fill=sex))


p <- p + geom_histogram(binwidth = 20, alpha = 0.5, position="identity")
p <- p + geom_rug(aes(colour=sex), alpha = 1/2)
p <- p + labs(title = "Androstenedione Levels in Diabetics")
print(p)

Androstenedione Levels in Diabetics Androstenedione Levels in Diabetics


10.0

200

7.5

150
sex sex
count
level

men 5.0 men


● women women
100

2.5

50
● 0.0

men women 50 100 150 200


sex level

Because of the large difference in variances, I will be more comfortable with the
Satterthwaite analysis here than the pooled variance analysis. The normality assump-
tion of the difference in means appears to be met using the bootstrap assessment. The
distribution of difference in means in the third plot looks very close to normal.
bs.two.samp.diff.dist(men, women)

Prof. Erik B. Erhardt


3.4: Satterthwaite’s Method, unequal variances 109

Sample 1
n = 14 , mean = 112.5 , sd = 42.755

0.015
0.010
Density

0.005
0.000

50 100 150 200

Sample
dat1 2
n = 18 , mean = 75.833 , sd = 17.236
0.020
Density

0.010
0.000

50 100 150 200

Bootstrap sampling distribution


dat2 of the difference in means
mean = 36.509 , se = 11.669
0.030
0.020
Density

0.010
0.000

0 20 40 60 80

diff.mean

1. Define the population parameters and hypotheses in words and


notation
Let µ1 and µ2 be the mean androstenedione level for diabetic men and women,
respectively.
In words: “The difference in population mean androstenedione levels between dia-
betic men and women is different from zero.”
In notation: H0 : µ1 − µ2 = 0 versus HA : µ1 − µ2 6= 0.
2. Calculate summary statistics from sample
(see above)
3. Specify confidence level, calculate t-stat, CI limits, p-value
Not assuming equal variances, Satterthwaite (Welch):
# two-sample t-test with unequal variances (Welch = Satterthwaite)
# specified using data.frame and a formula, level by sex
t.summary <- t.test(level ~ sex, data = andro, var.equal = FALSE)
t.summary
##
## Welch Two Sample t-test
##
## data: level by sex

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


110 Ch 3: Two-Sample Inferences

## t = 3.0235, df = 16.295, p-value = 0.007946


## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 10.99555 62.33778
## sample estimates:
## mean in group men mean in group women
## 112.50000 75.83333
# plot t-distribution with shaded p-value
t.dist.pval(t.summary)
0.4
0.3
t−dist( df = 16.3 )

0.2
0.1
0.0

−4 −2 0 2 4

t−stat = 3.0235 , Shaded area is p−value = 0.0079461

4. Summarize in words The unequal-variance analysis suggests that H0 : µ1 −


µ2 = 0 is false, given the large t-statistic of 3.02 and two-sided p-value of 0.00795.
Because the p-value < 0.05 we reject the Null hypothesis in favor of the Alternative
hypothesis concluding that the difference in population mean androstenedione levels
between diabetic men and women are different.
We are 95% confident confident that the difference in population means, µ1 − µ2 ,
is between 11 and 62.3 ng/dL.
5. Check assumptions
As checked before, while the assumption of equal population variances is not met,
the assumption that the distribution of Ȳ1 − Ȳ2 is normal using the bootstrap
appeared reasonable.
As a comparison, let us examine the output for the pooled procedure (which is
inappropriate since variances are unequal). The p-value for the pooled t-test is 0.002,
whereas the 95% confidence limits are 14.1 and 59.2. That is, we are 95% confident
that the population mean andro level for men exceeds that for women by at least 14.1
but by no more than 59.2. These results are qualitatively similar to the Satterthwaite
conclusions.

Prof. Erik B. Erhardt


3.5: One-Sided Tests 111

3.5 One-Sided Tests


One-sided tests for two-sample problems are where the null hypothesis is H0 : µ1 −
µ2 = 0 but the alternative is directional, either HA : µ1 − µ2 < 0 (i.e., µ1 < µ2 )
or HA : µ1 − µ2 > 0 (i.e., µ1 > µ2 ). Once you understand the general form of
rejection regions and p-values for one-sample tests, the one-sided two-sample tests
do not pose any new problems. Use the t-statistic, with the appropriate tail of the
t-distribution to define critical values and p-values. One-sided two-sample tests are
directly implemented in R, by specifying the type of test with alternative = "less"
or alternative = "greater". One-sided confidence bounds are given with the one-
sided tests.

3.6 Paired Analysis


With paired data, inferences on µ1 − µ2 are based on the sample of differences within
pairs. By taking differences within pairs, two dependent samples are transformed into
one sample, which contains the relevant information for inferences on µd = µ1 − µ2 .
To see this, suppose the observations within a pair are Y1 and Y2 . Then within each
pair, compute the difference d = Y1 − Y2 :
d1 = Y11 − Y21
d2 = Y12 − Y22
..
.
dn = Y1n − Y2n
If the Y1 data are from a population with mean µ1 and the Y2 data are from a
population with mean µ2 , then the ds are a sample from a population with mean µd =
µ1 − µ2 . Furthermore, if the sample of differences comes from a normal population,
then we can use standard one-sample techniques on d1 , . . . , dn to test µd = 0 (that is,
µ1 = µ2 ), and to get a CI for µd = µ1 − µ2 .
Let d¯ = n−1 i di = Ȳ1 − Ȳ2 be the sample mean of the differences (which is also
P
the mean difference), and let sd be the
√ sample standard deviation of the differences.
¯
The standard error of d is SEd¯ = sd / n, where n is the number of pairs. The paired
t-test (two-sided) CI for µd is given by d¯ ± tcrit SEd¯. To test H0 : µd = 0 (µ1 = µ2 )
against HA : µd 6= 0 (µ1 6= µ2 ), use
d¯ − 0
ts =
SEd¯
to compute a p-value as in a two-sided one-sample test. One-sided tests are evaluated
in the usual way for one-sample tests on means.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


112 Ch 3: Two-Sample Inferences

A graphical analysis of paired data focuses on the sample of differences, and


not on the original samples. In particular, the normality assumption is assessed on
the sample of differences.

3.6.1 R Analysis
The most natural way to enter paired data is as two columns, one for each treatment
group. You can then create a new column of differences, and do the usual one-sample
graphical and inferential analysis on this column of differences, or you can do the
paired analysis directly without this intermediate step.

Example: Paired Analysis of Data on Twins Burt (1966) presented data on


IQ scores for identical twins that were raised apart, one by foster parents and one by
the genetic parents. Assuming the data are a random sample of twin pairs, consider
comparing the population mean IQs for twins raised at home to those raised by
foster parents. Let µf =population mean IQ for twin raised by foster parents, and
µg =population mean IQ for twin raised by genetic parents.
I created the data in the worksheet (c1=foster; c2=genetic), and computed the
differences between the IQ scores for the children raised by the genetic and foster
parents (c3=diff=genetic-foster). I also made a scatter plot of the genetic versus
foster IQ scores.
#### Example: Paired Analysis of Data on Twins
# Data and numerical summaries
foster <- c(82, 80, 88, 108, 116, 117, 132, 71, 75, 93,
95, 88, 111, 63, 77, 86, 83, 93, 97, 87,
94, 96, 112, 113, 106, 107, 98)
genetic <- c(82, 90, 91, 115, 115, 129, 131, 78, 79, 82,
97, 100, 107, 68, 73, 81, 85, 87, 87, 93,
94, 95, 97, 97, 103, 106, 111)
diff <- genetic - foster

axis.lim <- range(c(foster, genetic))

iq <- data.frame(foster, genetic, diff)

# scatterplot of foster and genetic IQs, with 1:1 line


p <- ggplot(iq, aes(x = foster, y = genetic))
# draw a 1:1 line, dots above line indicate "genetic > foster"
p <- p + geom_abline(intercept=0, slope=1, alpha=0.2)
p <- p + geom_point()
p <- p + geom_rug()
# make the axes square so it's a fair visual comparison
p <- p + coord_equal()
p <- p + scale_x_continuous(limits=axis.lim)

Prof. Erik B. Erhardt


3.6: Paired Analysis 113

p <- p + scale_y_continuous(limits=axis.lim)
p <- p + labs(title = "IQ of identical twins raised by genetic vs foster parents")
print(p)
IQ of identical twins raised by genetic vs foster parents


120

● ●



genetic

100 ●

● ●●




● ●

● ●

80 ●

60
60 80 100 120
foster

This plot of IQ scores shows that scores are related within pairs of twins. This is
consistent with the need for a paired analysis.
Given the sample of differences, I created a boxplot and a stem and leaf display,
neither which showed marked deviation from normality. The boxplot is centered at
zero, so one would not be too surprised if the test result is insignificant.
p1 <- ggplot(iq, aes(x = diff))
p1 <- p1 + scale_x_continuous(limits=c(-20,+20))
# vertical line at 0
p1 <- p1 + geom_vline(xintercept=0, colour="#BB0000", linetype="dashed")
p1 <- p1 + geom_histogram(aes(y=..density..), binwidth=5)
p1 <- p1 + geom_density(alpha=0.1, fill="white")
p1 <- p1 + geom_rug()

# violin plot
p2 <- ggplot(iq, aes(x = "diff", y = diff))
p2 <- p2 + scale_y_continuous(limits=c(-20,+20))

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


114 Ch 3: Two-Sample Inferences

p2 <- p2 + geom_hline(yintercept=0, colour="#BB0000", linetype="dashed")


p2 <- p2 + geom_violin(fill = "gray50", alpha=1/2)
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()

# boxplot
p3 <- ggplot(iq, aes(x = "diff", y = diff))
p3 <- p3 + scale_y_continuous(limits=c(-20,+20))
p3 <- p3 + geom_hline(yintercept=0, colour="#BB0000", linetype="dashed")
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)
0.06

0.04
density

0.02

0.00
−20 −10 0 10 20
diff

diff
x

−20 −10 0 10 20
diff

diff
x

−20 −10 0 10 20
diff

The normality assumption of the sample mean for a one-sample test is satisfied
(below, left).
Given the sample of differences, I generated a one-sample CI and test. The hy-
pothesis under test is µd = µg − µf = 0. The p-value for this test is large. We do not
have sufficient evidence to claim that the population mean IQs for twins raised apart

Prof. Erik B. Erhardt


3.6: Paired Analysis 115

are different. This is consistent with the CI for µd given below, which covers zero.
bs.one.samp.dist(iq$diff)

# one-sample t-test of differences (paired t-test)


t.summary <- t.test(iq$diff)
t.summary
##
## One Sample t-test
##
## data: iq$diff
## t = 0.12438, df = 26, p-value = 0.902
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -2.875159 3.245529
## sample estimates:
## mean of x
## 0.1851852
# plot t-distribution with shaded p-value
t.dist.pval(t.summary)

Plot of data with smoothed density curve


0.06

0.4
0.04
Density

0.02

0.3
0.00

t−dist( df = 26 )

−20 −15 −10 −5 0 5 10 15


0.2

dat
Bootstrap sampling distribution of the mean
0.1
0.20
Density

0.10

0.0
0.00

−4 −2 0 2 4
−6 −4 −2 0 2 4 6
t−stat = 0.12438 , Shaded area is p−value = 0.90197
Data: n = 27 , mean = 0.18519 , se = 1.48884 5

Alternatively, I can generate the test and CI directly from the raw data in two
columns, specifying paired=TRUE. This gives the following output, which leads to
identical conclusions to the earlier analysis.
# two-sample paired t-test
t.summary <- t.test(iq$genetic, iq$foster, paired=TRUE)
t.summary
##
## Paired t-test
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


116 Ch 3: Two-Sample Inferences

## data: iq$genetic and iq$foster


## t = 0.12438, df = 26, p-value = 0.902
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.875159 3.245529
## sample estimates:
## mean of the differences
## 0.1851852
You might ask why I tortured you by doing the first analysis, which required
creating and analyzing the sample of differences, when the alternative and equivalent
second analysis is so much easier. (A later topic deals with non-parametric analyses
of paired data for which the differences must be first computed.)

Remark: I could have defined the difference to be the foster IQ score minus the
genetic IQ score. How would this change the conclusions?

Example: Paired Comparisons of Two Sleep Remedies The following data


give the amount of sleep gained in hours from two sleep remedies, A and B, applied
to 10 individuals who have trouble sleeping an adequate amount. Negative values
imply sleep loss. In 9 of the 10 individuals, the sleep gain on B exceeded that on A.
Let µA = population mean sleep gain (among troubled sleepers) on remedy A, and
µB = population mean sleep gain (among troubled sleepers) on remedy B. Consider
testing H0 : µB − µA = 0 or equivalently µd = 0, where µd = µB − µA .
The observed distribution of differences between B and A is slightly skewed to
the right, with a single outlier in the upper tail. The normality assumption of the
standard one-sample t-test and CI are suspect here. I will continue with the analysis,
anyways.
#### Example: Paired Differences on Sleep Remedies
# Data
sleep <- read.table(text="
A B
0.7 1.9
-1.6 0.8
-0.2 1.1
-1.2 0.1
0.1 -0.1
3.4 4.4
3.7 5.5
0.8 1.6
0.0 4.6
2.0 3.0
", header = TRUE)

Prof. Erik B. Erhardt


3.6: Paired Analysis 117

# calculate paired difference bewteen two remedies, D = B - A


sleep$D <- sleep$B - sleep$A
sleep
## A B D
## 1 0.7 1.9 1.2
## 2 -1.6 0.8 2.4
## 3 -0.2 1.1 1.3
## 4 -1.2 0.1 1.3
## 5 0.1 -0.1 -0.2
## 6 3.4 4.4 1.0
## 7 3.7 5.5 1.8
## 8 0.8 1.6 0.8
## 9 0.0 4.6 4.6
## 10 2.0 3.0 1.0
# determine range of each axis to use most extreme of each for square plot below
axis.lim <- range(c(sleep$A, sleep$B))

library(ggplot2)
# scatterplot of A and B sleep times, with 1:1 line
p <- ggplot(sleep, aes(x = A, y = B))
# draw a 1:1 line, dots above line indicate "B > A"
p <- p + geom_abline(intercept=0, slope=1, alpha=0.2)
p <- p + geom_point()
p <- p + geom_rug()
# make the axes square so it's a fair visual comparison
p <- p + coord_equal()
p <- p + scale_x_continuous(limits=axis.lim)
p <- p + scale_y_continuous(limits=axis.lim)
p <- p + labs(title = "Sleep hours gained on two sleep remedies: A vs B")
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


118 Ch 3: Two-Sample Inferences

Sleep hours gained on two sleep remedies: A vs B


2
B


0 ●

0 2 4
A

There is evidence here against the normality assumption of the sample mean.
We’ll continue anyway (in practice we’d use a nonparametric method, instead, in a
later chapter).
library(ggplot2)
p1 <- ggplot(sleep, aes(x = D))
p1 <- p1 + scale_x_continuous(limits=c(-5,+5))
p1 <- p1 + geom_histogram(aes(y=..density..), binwidth = 1)
p1 <- p1 + geom_density(alpha=0.1, fill="white", adjust = 2)
# vertical reference line at 0
p1 <- p1 + geom_vline(xintercept = 0, colour="red", linetype="dashed")
p1 <- p1 + geom_vline(xintercept = mean(sleep$D), colour="blue", alpha = 0.5)
p1 <- p1 + geom_rug()
p1 <- p1 + labs(title = "Difference of sleep hours gained: D = B - A")
print(p1)

bs.one.samp.dist(sleep$D)

Prof. Erik B. Erhardt


3.6: Paired Analysis 119

Difference of sleep hours gained: D = B − A


0.6 Plot of data with smoothed density curve

Density

0.0
0.4
−1 0 1 2 3 4 5
density

dat
0.2 Bootstrap sampling distribution of the mean

Density

0.0
0.0
0.5 1.0 1.5 2.0 2.5 3.0
−5.0 −2.5 0.0 2.5 5.0
D Data: n = 10 , mean = 1.52 , se = 0.402161 5

The p-value for testing H0 is 0.004. We’d reject H0 at the 5% or 1% level, and con-
clude that the population mean sleep gains on the remedies are different. We are 95%
confident that µB exceeds µA by between 0.61 and 2.43 hours. Again, these results
must be reported with caution, because the normality assumption is unreasonable.
However, the presence of outliers tends to make the t-test and CI conservative, so we’d
expect to find similar conclusions if we used the nonparametric methods discussed
later in the semester.
# one-sample t-test of differences (paired t-test)
t.summary <- t.test(sleep$D)
t.summary
##
## One Sample t-test
##
## data: sleep$D
## t = 3.7796, df = 9, p-value = 0.004352
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.610249 2.429751
## sample estimates:
## mean of x
## 1.52
# plot t-distribution with shaded p-value
t.dist.pval(t.summary)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


120 Ch 3: Two-Sample Inferences

0.4
0.3
t−dist( df = 9 )

0.2
0.1
0.0

−4 −2 0 2 4

t−stat = 3.7796 , Shaded area is p−value = 0.0043518

Question: In what order should the remedies be given to the patients?

ClickerQ s — Reporting results, STT.06.03.010

3.7 Should You Compare Means?


The mean is the most common feature on which two distributions are compared.
You should not, however, blindly apply the two-sample tests (paired or unpaired)
without asking yourself whether the means are the relevant feature to compare. This
issue is not a big concern when, as highlighted in the first graph below, the two
(normal) populations have equal spreads or standard deviations. In such cases the
difference between the two population means is equal to the difference between any
fixed percentile for the two distributions, so the mean difference is a natural measure
of difference.
Consider instead the hypothetical scenario depicted in the bottom pane below,
where the population mean lifetimes using two distinct drugs for a fatal disease are
µ1 = 16 months from time of diagnosis and µ2 = 22 months from time of diagnosis,
respectively. The standard deviations under the two drugs are σ1 = 1 and σ2 = 6,
respectively. The second drug has the higher mean lifetime, but at the expense of
greater risk. For example, the first drug gives you a 97.7% chance of living at least
14 months, whereas the second drug only gives you a 90.8% chance of living at least
14 months. Which drug is best? It depends on what is important to you, a higher
expected lifetime or a lower risk of dying early.

Prof. Erik B. Erhardt


3.7: Should You Compare Means? 121

Normal Distributions with Identical Variances

0 5 10 15

Normal Distributions with Different Variances

10 15 20 25 30 35 40

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


122 Ch 3: Two-Sample Inferences

Prof. Erik B. Erhardt


Chapter 4

Checking Assumptions

Contents
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.2 Testing Normality . . . . . . . . . . . . . . . . . . . . . . . 124
4.2.1 Normality tests on non-normal data . . . . . . . . . . . . . 127
4.3 Formal Tests of Normality . . . . . . . . . . . . . . . . . . 131
4.4 Testing Equal Population Variances . . . . . . . . . . . . 139
4.5 Small sample sizes, a comment . . . . . . . . . . . . . . . 141

Learning objectives
After completing this topic, you should be able to:
assess the assumptions visually and via formal tests.
Achieving these goals contributes to mastery in these course learning outcomes:
10. Model assumptions.

4.1 Introduction
Almost all statistical methods make assumptions about the data collection process
and the shape of the population distribution. If you reject the null hypothesis in a
test, then a reasonable conclusion is that the null hypothesis is false, provided all
the distributional assumptions made by the test are satisfied. If the assumptions are
not satisfied then that alone might be the cause of rejecting H0 . Additionally, if you
fail to reject H0 , that could be caused solely by failure to satisfy assumptions also.
Hence, you should always check assumptions to the best of your abilities.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


124 Ch 4: Checking Assumptions

Two assumptions that underly the tests and CI procedures that I have discussed
are that the data are a random sample, and that the population frequency curve is
normal. For the pooled variance two-sample test the population variances are also
required to be equal.
The random sample assumption can often be assessed from an understanding of
the data collection process. Unfortunately, there are few general tests for checking
this assumption. I have described exploratory (mostly visual) methods to assess the
normality and equal variance assumption. I will now discuss formal methods to assess
these assumptions.

4.2 Testing Normality

An informal test of normality can be based on a normal scores plot, sometimes


called a rankit plot or a normal probability plot or a normal QQ plot (QQ
= quantile-quantile). You plot the quantiles of the data against the quantiles of the
normal distribution, or expected normal order statistics (in a standard normal
distribution) for a sample with the given number of observations. The normality
assumption is plausible if the plot is fairly linear. I give below several plots often seen
with real data, and what they indicate about the underlying distribution.
There are multiple ways to produce QQ plots in R. The shape can depend upon
whether you plot the normal scores on the x-axis or the y-axis. It is conventional to
plot the data on the y-axis and the normal scores on the x-axis.
Let’s start with some data from a normal distribution.
#### sample from normal distribution
x1 <- rnorm(150, mean = 100, sd = 15)

par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x1, freq = FALSE, breaks = 20)
points(density(x1), type = "l")
rug(x1)

# violin plot
library(vioplot)
vioplot(x1, horizontal=TRUE, col="gray")

# boxplot
boxplot(x1, horizontal=TRUE)

Prof. Erik B. Erhardt


4.2: Testing Normality 125

Histogram of x1

0.000 0.015 0.030


Density
70 80 90 100 110 120 130

x1

1
70 80 90 100 110 120 130

70 80 90 100 110 120 130

There are many ways to get adequate QQ plots. Consider how outliers shows up
in the QQ plot. There may be isolated points on ends of the QQ plot, but only on
the right side is there an outlier. How could you have identified that the right tail
looks longer than the left tail from the QQ plot?
#### QQ plots
# R base graphics
par(mfrow=c(1,1))
# plots the data vs their normal scores
qqnorm(x1)
# plots the reference line
qqline(x1)

# ggplot2 graphics
library(ggplot2)
# http://had.co.nz/ggplot2/stat_qq.html
df <- data.frame(x1)
# stat_qq() below requires "sample" to be assigned a data.frame column
p <- ggplot(df, aes(sample = x1))
# plots the data vs their normal scores
p <- p + stat_qq()
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


126 Ch 4: Checking Assumptions

● ●
Normal Q−Q Plot ● ●




● ● ●
130


●● ●●

●●
120 ●●
●●
●● ●

●●
120

●● ●
●● ●●
●● ●
● ●●
●●
● ●●

● ●

● ●
●●
● ●
●●
●●●


● ●●
110

●●
● ●
Sample Quantiles

● ●●



●●
●●

● ●
●●
●●
● ●
●●
●●



●●
● ●


●●

● ●

sample
● ●




●●
● ●
●●


●●

●●
100

●●
● 100 ●
●●



●●
● ●●





● ●●

●●

● ●●
●●
●●

●●
● ●●


●●

●● ●
●●
●●
●●



●●

●●

● ●


●●

● ●●
●●
●●


90

● ●
●●
● ●



●●
●● ●

●●
● ●
●●
●●


●●
●●
● ●

●● ●

●● ●●

●●
●● ●
●●●
●●●
●● ●
80

● ●●

●● ●●
●●●
●●●●●● ●●●
80 ●
● ●●
● ● ●● ●●●●●

70

● ●●
● ● ●

−2 −1 0 1 2 ●

Theoretical Quantiles −2 −1 0 1 2
theoretical

If you lay a straightedge along the bulk of the plot (putting in a regression line is
not the right way to do it, even if it is easy), you see that the most extreme point on
the right is a little below the line, and the last few points on the left a little above the
line. What does this mean? The point on the right corresponds to a data value more
extreme than expected from a normal distribution (the straight line is where expected
and actual coincide). Extreme points on the right are above the line. What about
the left? Extreme points there should be above the line — since the deviations from
the line are above it on the left, those points are also more extreme than expected.
Even more useful is to add confidence intervals (point-wise, not family-wise —
you will learn the meaning of those terms in the ANOVA section). You don’t expect
a sample from a normally distributed population to have a normal scores plot that
falls exactly on the line, and the amount of deviation depends upon the sample size.
The best QQ plot I could find is available in the car package called qqPlot. Note
that with the dist= option you can use this technique to see if the data appear from
lots of possible distributions, not just normal.
par(mfrow=c(1,1))
# Normality of Residuals
library(car)
## Loading required package: carData
# qq plot for studentized resid
# las = 1 : turns labels on y-axis to read horizontally
# id.n = n : labels n most extreme observations, and outputs to console
# id.cex = 1 : is the size of those labels
# lwd = 1 : line width
qqPlot(x1, las = 1, id = list(n = 6, cex = 1), lwd = 1, main="QQ Plot")
## [1] 65 110 86 125 31 111

Prof. Erik B. Erhardt


4.2: Testing Normality 127

QQ Plot

110 ● 65 ●
130 31 ●86
125 ●●
●●
●●
●●
120 ●●
●●
●●
●●



●●


●●

110 ●
●●

●●
●●





●●



●●

●●

●●



●●

●●

x1


100 ●











●●



●●

●●

●●

●●

●●

●●
●●

●●


90 ●



●●






●●
●●





●●

●●
●●

80 ●
●●
●●●●●


● ● ●●
70
● 111

−2 −1 0 1 2

norm quantiles

In this case the x-axis is labelled “norm quantiles”. You only see a couple of data
values outside the limits (in the tails, where it usually happens). You expect around
5% outside the limits, so there is no indication of non-normality here. I did sample
from a normal population.

4.2.1 Normality tests on non-normal data


Let’s turn to examples of sampling from other, non-normal distributions to see how
the normal QQ plot identifies important features.

Light-tailed symmetric (Uniform)


#### Light-tailed symmetric (Uniform)
# sample from uniform distribution
x2 <- runif(150, min = 50, max = 150)

par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x2, freq = FALSE, breaks = 20)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


128 Ch 4: Checking Assumptions

points(density(x2), type = "l")


rug(x2)

# violin plot
library(vioplot)
vioplot(x2, horizontal=TRUE, col="gray")

# boxplot
boxplot(x2, horizontal=TRUE)

par(mfrow=c(1,1))
qqPlot(x2, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot")
Histogram of x2
QQ Plot
0.010
Density

● ● ●
●●● ●
●●●
0.000

●●●
●●●●
140 ●




60 80 100 120 140 ●

●●
●●

x2 ●●

●●
●●
●●

●●



●●


120 ●
●●






●●

●●



●●
●●

●●





●●






x2

100 ●

● ●

●●
1








●●






60 80 100 120 140 ●

●●
80 ●●


●●
●●
●●


●●
●●

●●
●●
●●
●●●
60 ●●●


●●●
● ●●
● ● ●

−2 −1 0 1 2
60 80 100 120 140
norm quantiles

Heavy-tailed (fairly) symmetric (Normal-squared)

#### Heavy-tailed (fairly) symmetric (Normal-squared)


# sample from normal distribution
x3.temp <- rnorm(150, mean = 0, sd = 1)
x3 <- sign(x3.temp)*x3.temp^2 * 15 + 100

par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x3, freq = FALSE, breaks = 20)
points(density(x3), type = "l")
rug(x3)

# violin plot
library(vioplot)

Prof. Erik B. Erhardt


4.2: Testing Normality 129

vioplot(x3, horizontal=TRUE, col="gray")

# boxplot
boxplot(x3, horizontal=TRUE)

par(mfrow=c(1,1))
qqPlot(x3, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot")
Histogram of x3
QQ Plot
0.000 0.010 0.020
Density


300

0 50 100 150 200 250 300


250
x3

200


●●

x3
● 150 ●●
1

●●
●●●●

●●●●

●●
●●
●●

●●

●●
●●

●●
●●

●●


●●

●●

●●

●●

●●

●●

●●

0 50 100 150 200 250 300 100 ●●

●●

●●

●●

●●

●●


●●

●●


●●

●●




●●
●●

●●

●●

●●

●●

●●

●●
●●

●●
●●
●●

●●
●●
●●


50 ●●●

●●●

0 ● ● ●
●● ● ● ●● ● ●
●● ● ● ● ●●●
●●● ● ● ●
● ● ● ●

−2 −1 0 1 2
0 50 100 150 200 250 300
norm quantiles

Right-skewed (Exponential)

#### Right-skewed (Exponential)


# sample from exponential distribution
x4 <- rexp(150, rate = 1)

par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x4, freq = FALSE, breaks = 20)
points(density(x4), type = "l")
rug(x4)

# violin plot
library(vioplot)
vioplot(x4, horizontal=TRUE, col="gray")

# boxplot
boxplot(x4, horizontal=TRUE)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


130 Ch 4: Checking Assumptions

par(mfrow=c(1,1))
qqPlot(x4, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot")

Histogram of x4
QQ Plot
0.8
Density

8
0.4


0.0

0 2 4 6 8

x4
6


x4
● 4
1


●●●
0 2 4 6 8 ●●●
●●●●●●
●●

2 ●




●●
●●

●●

●●
●●

●●

●●

●●

●●


●●

●●



●●

●●

●●

●●

●●
●●

●●


●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●
●●
●●
●●

●●
●●

●●

●●●●●●●●
●●
0 ● ● ● ● ●●●●●●
● ● ● ● ●

−2 −1 0 1 2
0 2 4 6 8
norm quantiles

Left-skewed (Exponential, reversed)

#### Left-skewed (Exponential, reversed)


# sample from exponential distribution
x5 <- 15 - rexp(150, rate = 0.5)

par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(x5, freq = FALSE, breaks = 20)
points(density(x5), type = "l")
rug(x5)

# violin plot
library(vioplot)
vioplot(x5, horizontal=TRUE, col="gray")

# boxplot
boxplot(x5, horizontal=TRUE)

par(mfrow=c(1,1))
qqPlot(x5, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot")

Prof. Erik B. Erhardt


4.3: Formal Tests of Normality 131

Histogram of x5
0.4 QQ Plot
Density

0.2

●● ● ●
●●●●
●●
●●
●●
●●
●●●●●●●●●
●●
●●
●●
●●

●●
●●

0.0


●●

●●



●●

●●


●●


8 10 12 14 14 ●









●●


●●



●●

●●

●●

x5 ●
●●
●●

●●

●●

●●
●●






●●
●●
12 ●









●●

x5
● ●
●●
1

●●

●●●

●●●●

●●●
8 10 12 14 10 ●●



8 ●


● ● ●

−2 −1 0 1 2
8 10 12 14
norm quantiles

Notice how striking is the lack of linearity in the QQ plot for all the non-normal
distributions, particularly the symmetric light-tailed distribution where the boxplot
looks fairly good. The QQ plot is a sensitive measure of normality. Let us summarize
the patterns we see regarding tails in the plots:
Tail
Tail Weight Left Right
Light Left side of plot points left Right side of plot points right
Heavy Left side of plot points down Right side of plot points up

4.3 Formal Tests of Normality


A formal test of normality is based on the correlation between the data and the
normal scores. The correlation is a measure of the strength of a linear relationship,
with the sign of the correlation indicating the direction of the relationship (that is, +
for increasing relationship, and − for decreasing). The correlation varies from −1 to
+1. In a normal scores plot, you are looking for a correlation close to +1. Normality
is rejected if the correlation is too small. Critical values for the correlation test of
normality, which is commonly called the Shapiro-Wilk test, can be found in many
texts.
R has several tests of normality. The Shapiro-Wilk test shapiro.test() is a
base function. The R package nortest has five others: the Anderson-Darling test
ad.test() is useful, related to the Kolmogorov-Smirnov (Lilliefors) test lillie.test()
which is commonly used in many scientific disciplines but is essentially useless, the
Cramer-von Mises test cvm.test(), and two more. Some packages also have the
Ryan-Joiner test (closely related to the Shapiro-Wilk test).

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


132 Ch 4: Checking Assumptions

Extreme outliers and skewness have the biggest effects on standard methods based
on normality. The Shapiro-Wilk (SW) test is better at picking up these problems
than the Kolmogorov-Smirnov (KS) test. The KS test tends to highlight deviations
from normality in the center of the distribution. These types of deviations are rarely
important because they do not have a noticeable effect on the operating characteristics
of the standard methods. The AD and RJ tests are modifications designed to handle
some of these objections.
Tests for normality may have low power in small to moderate sized samples. Visual
assessment of normality is often more valuable than a formal test. The tests for the
distributions of data above are below and in Figure 4.1.
Normal distribution
#### Formal Tests of Normality
shapiro.test(x1)
##
## Shapiro-Wilk normality test
##
## data: x1
## W = 0.98584, p-value = 0.1289
library(nortest)
ad.test(x1)
##
## Anderson-Darling normality test
##
## data: x1
## A = 0.40732, p-value = 0.3446
# lillie.test(x1)
cvm.test(x1)
##
## Cramer-von Mises normality test
##
## data: x1
## W = 0.05669, p-value = 0.4159

Prof. Erik B. Erhardt


4.3: Formal Tests of Normality 133

Heavy-tailed (fairly) symmetric


shapiro.test(x3)
Light-tailed symmetric ##
shapiro.test(x2) ## Shapiro-Wilk normality test
## ##
## Shapiro-Wilk normality test ## data: x3
## ## W = 0.79633, p-value = 3.587e-13
## data: x2 library(nortest)
## W = 0.95252, p-value = 5.336e-05 ad.test(x3)
library(nortest) ##
ad.test(x2) ## Anderson-Darling normality test
## ##
## Anderson-Darling normality test ## data: x3
## ## A = 9.1433, p-value < 2.2e-16
## data: x2 # lillie.test(x3)
## A = 1.9426, p-value = 5.644e-05 cvm.test(x3)
# lillie.test(x2) ## Warning in cvm.test(x3): p-value is smaller
cvm.test(x2) than 7.37e-10, cannot be computed more accurately
## ##
## Cramer-von Mises normality test ## Cramer-von Mises normality test
## ##
## data: x2 ## data: x3
## W = 0.29567, p-value = 0.0003642 ## W = 1.8248, p-value = 7.37e-10
Right-skewed
shapiro.test(x4)
##
Left-skewed
## Shapiro-Wilk normality test shapiro.test(x5)
## ##
## data: x4 ## Shapiro-Wilk normality test
## W = 0.74125, p-value = 5.872e-15 ##
library(nortest) ## data: x5
ad.test(x4) ## W = 0.8743, p-value = 5.933e-10
## library(nortest)
## Anderson-Darling normality test ad.test(x5)
## ##
## data: x4 ## Anderson-Darling normality test
## A = 9.3715, p-value < 2.2e-16 ##
# lillie.test(x4) ## data: x5
cvm.test(x4) ## A = 6.0016, p-value = 7.938e-15
## Warning in cvm.test(x4): p-value is smaller # lillie.test(x5)
than 7.37e-10, cannot be computed more accurately cvm.test(x5)
## ##
## Cramer-von Mises normality test ## Cramer-von Mises normality test
## ##
## data: x4 ## data: x5
## W = 1.6537, p-value = 7.37e-10 ## W = 1.0553, p-value = 9.648e-10

Figure 4.1: Normality tests for non-normal distributions

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


134 Ch 4: Checking Assumptions

Example: Paired Differences on Sleep Remedies The following boxplot and


normal scores plots suggest that the underlying distribution of differences (for the
paired sleep data taken from the previous chapter) is reasonably symmetric, but
heavy tailed. The p-value for the SW test of normality is 0.042, and for the AD test
is 0.029, both of which call into question a normality assumption. A non-parametric
test comparing the sleep remedies (one that does not assume normality) is probably
more appropriate here. We will return to these data later.
# Normality tests
shapiro.test(sleep$d)
##
## Shapiro-Wilk normality test
##
## data: sleep$d
## W = 0.83798, p-value = 0.04173
library(nortest)
ad.test(sleep$d)
##
## Anderson-Darling normality test
##
## data: sleep$d
## A = 0.77378, p-value = 0.02898
# lillie.test(sleep£d)
cvm.test(sleep$d)
##
## Cramer-von Mises normality test
##
## data: sleep$d
## W = 0.13817, p-value = 0.02769

# plot of data
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(sleep$d, freq = FALSE, breaks = 20)
points(density(sleep$d), type = "l")
rug(sleep$d)

# violin plot
library(vioplot)
vioplot(sleep$d, horizontal=TRUE, col="gray")

# boxplot
boxplot(sleep$d, horizontal=TRUE)

# QQ plot
par(mfrow=c(1,1))

Prof. Erik B. Erhardt


4.3: Formal Tests of Normality 135

qqPlot(sleep$d, las = 1, id = list(n = 4, cex = 1), lwd = 1, main="QQ Plot")


## [1] 9 5 2 8
Histogram of sleep$d
QQ Plot
0.8
Density

9●
0.4
0.0

0 1 2 3 4 4
sleep$d

sleep$d
2●

1

2

0 1 2 3 4
● ●

1 ● ●
● 8

0
● 5
● ●

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


0 1 2 3 4
norm quantiles

Example: Androstenedione Levels This is an independent two-sample problem,


so you must look at normal scores plots for males and females.
The AD test p-value and the SW test p-value for testing normality exceeds 0.10
in each sample. Thus, given the sample sizes (14 for men, 18 for women), we have
insufficient evidence (at α = 0.05) to reject normality in either population.
The women’s boxplot contains two mild outliers, which is highly unusual when
sampling from a normal distribution. The tests are possibly not powerful enough to
pick up this type of deviation from normality in such a small sample. In practice,
this may not be a big concern. The two mild outliers probably have a small effect on
inferences in the sense that non-parametric methods would probably lead to similar
conclusions here.
library(ggplot2)
p1 <- ggplot(andro, aes(x = sex, y = level, fill=sex))
p1 <- p1 + geom_boxplot()
# add a "+" at the mean
p1 <- p1 + stat_summary(fun.y = mean, geom = "point", shape = 3, size = 2)
#p1 <- p1 + coord_flip()
p1 <- p1 + labs(title = "Androstenedione Levels in Diabetics")
#print(p1)

p2 <- ggplot(andro, aes(x = level, fill=sex))


p2 <- p2 + geom_histogram(binwidth = 20, alpha = 0.5, position="identity")
p2 <- p2 + geom_rug(aes(colour=sex))

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


136 Ch 4: Checking Assumptions

Men Women
shapiro.test(men) shapiro.test(women)
## ##
## Shapiro-Wilk normality test ## Shapiro-Wilk normality test
## ##
## data: men ## data: women
## W = 0.90595, p-value = 0.1376 ## W = 0.95975, p-value = 0.5969
library(nortest) library(nortest)
ad.test(men) ad.test(women)
## ##
## Anderson-Darling normality test ## Anderson-Darling normality test
## ##
## data: men ## data: women
## A = 0.4718, p-value = 0.2058 ## A = 0.39468, p-value = 0.3364
# lillie.test(men) # lillie.test(women)
cvm.test(men) cvm.test(women)
## ##
## Cramer-von Mises normality test ## Cramer-von Mises normality test
## ##
## data: men ## data: women
## W = 0.063063, p-value = 0.3221 ## W = 0.065242, p-value = 0.3057

p2 <- p2 + labs(title = "Androstenedione Levels in Diabetics")


#print(p2)

library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=1)

# QQ plot
par(mfrow=c(2,1))
qqPlot(men, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Men")
qqPlot(women, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Women")

Prof. Erik B. Erhardt


4.3: Formal Tests of Normality 137

Androstenedione Levels in Diabetics


QQ Plot, Men
200

200

150
sex
150 ●

men
level


men ●
● ● ●
● women ●

100 100

● ●
● ●

50
−1 0 1

men women norm quantiles


sex

Androstenedione Levels in Diabetics


10.0 QQ Plot, Women

7.5 100 ●


sex ● ● ● ●

women
80 ●
count

● ●
5.0 men ● ●

● ●
women
60 ● ●

2.5
40

−2 −1 0 1 2
0.0

50 100 150 200 norm quantiles


level

Most statisticians use graphical methods (boxplot, normal scores plot) to assess
normality, and do not carry out formal tests.

You may be surprised at how variable 10 observations from a Normal(0,1) distri-


bution looks like; here are 25 samples.
n = 10
r = 5
norm.many <- data.frame(id = rep(seq(1:r^2), n)
, x = rnorm(r^2 * n)
)
library(ggplot2)
p <- ggplot(norm.many, aes(x = x))
p <- p + geom_histogram(binwidth = 0.4)
p <- p + geom_rug()
p <- p + facet_wrap(~ id, ncol = r)
p <- p + labs(title = "Twenty-five samples of size n=10 from a Normal(0,1) distribution")
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


138 Ch 4: Checking Assumptions

Twenty−five samples of size n=10 from a Normal(0,1) distribution


1 2 3 4 5
5

6 7 8 9 10
5

11 12 13 14 15
5

4
count

16 17 18 19 20
5

21 22 23 24 25
5

0
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
x

. . . and here are samples of size n = 30.


n = 30
r = 5
norm.many <- data.frame(id = rep(seq(1:r^2), n)
, x = rnorm(r^2 * n)
)
library(ggplot2)
p <- ggplot(norm.many, aes(x = x))
p <- p + geom_histogram(binwidth = 0.4)
p <- p + geom_rug()
p <- p + facet_wrap(~ id, ncol = r)
p <- p + labs(title = "Twenty-five samples of size n=30 from a Normal(0,1) distribution")

Prof. Erik B. Erhardt


4.4: Testing Equal Population Variances 139

print(p)
Twenty−five samples of size n=30 from a Normal(0,1) distribution
1 2 3 4 5
10.0

7.5

5.0

2.5

0.0

6 7 8 9 10
10.0

7.5

5.0

2.5

0.0

11 12 13 14 15
10.0

7.5
count

5.0

2.5

0.0

16 17 18 19 20
10.0

7.5

5.0

2.5

0.0

21 22 23 24 25
10.0

7.5

5.0

2.5

0.0
−2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4
x

By viewing many versions of this of varying samples sizes you’ll develop your
intuition about what a normal sample looks like.

4.4 Testing Equal Population Variances


In the independent two sample t-test, some researchers test H0 : σ12 = σ22 as a means
to decide between using the pooled-variance procedure or Satterthwaite’s methods.
They suggest the pooled t-test and CI if H0 is not rejected, and Satterthwaite’s

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


140 Ch 4: Checking Assumptions

methods otherwise.
There are a number of well-known tests for equal population variances, of which
Bartlett’s test and Levene’s test are probably the best known. Bartlett’s test assumes
the population distributions are normal, and is the best test when this is true. In
practice, unequal variances and non-normality often go hand-in-hand, so you should
check normality prior to using Bartlett’s test. It is sensitive to data which is not
non-normally distributed, thus it is more likely to return a “false positive” (reject
H0 of equal variances) when the data is non-normal. Levene’s test is more robust
to departures from normality than Bartlett’s test; it is in the car package. Fligner-
Killeen test is a non-parametric test which is very robust against departures from
normality.
I will now define Bartlett’s test, which assumes normally distributed data. As
above, let n∗ = n1 + n2 + · · · + nk , where the ni s are the sample sizes from the k
groups, and define
k
!
1 X 1 1
v =1+ − ∗ .
3(k − 1) i=1 ni − 1 n − k

Bartlett’s statistic for testing H0 : σ12 = · · · = σk2 is given by


( k
)
2.303 X
Bobs = (n − k) log(s2pooled ) − (ni − 1) log(s2i ) ,
v i=1

where s2pooled is the pooled estimator of variance and s2i is the estimated variance
based on the ith sample.
Large values of Bobs suggest that the population variances are unequal. For a size
α test, we reject H0 if Bobs ≥ χ2k−1,crit , where χ2k−1,crit is the upper-α percentile for the
χ2k−1 (chi-squared) probability distribution with k − 1 degrees of freedom. A generic
plot of the χ2 distribution is given below. A p-value for the test is given by the area
under the chi-squared curve to the right of Bobs .

Example: Androstenedione Levels The sample standard deviations and sam-


ples sizes are: s1 = 42.8 and n1 = 14 for men and s2 = 17.2 and n2 = 18 for women.
The sample standard deviations appear to be very different, so I would not be sur-
prised if the test of equal population variances is highly significant. The output below
confirms this: the p-values for Bartlett’s F-test, Levene’s Test, and Fligner-Killeen
test are all much smaller than 0.05. An implication is that the standard pooled-CI
and test on the population means is inappropriate.
#### Testing Equal Population Variances
# numerical summaries
c(mean(men), mean(women), sd(men), sd(women))

Prof. Erik B. Erhardt


4.5: Small sample sizes, a comment 141

## [1] 112.50000 75.83333 42.75467 17.23625


c(IQR(men), IQR(women), length(men), length(women))
## [1] 60.25 17.00 14.00 18.00
## Test equal variance
# assumes populations are normal
bartlett.test(level ~ sex, data = andro)
##
## Bartlett test of homogeneity of variances
##
## data: level by sex
## Bartlett's K-squared = 11.199, df = 1, p-value = 0.0008183
# does not assume normality, requires car package
library(car)
leveneTest(level ~ sex, data = andro)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 7.2015 0.01174 *
## 30
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# nonparametric test
fligner.test(level ~ sex, data = andro)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: level by sex
## Fligner-Killeen:med chi-squared = 5.8917, df = 1, p-value =
## 0.01521

4.5 Small sample sizes, a comment


In Daniel Kahneman’s “Thinking, Fast and Slow” (Ch 10), he discusses “The Law
of Small Numbers” (in contrast to the Law of Large Numbers). As an example from
statisticians Howard Wainer and Harris Zwerling, he makes this observation about the
incidence of kidney cancer in the 3,141 counties of the United States. “The counties in
which the incidence of kidney cancer is lowest are mostly rural, sparsely populated,
and located in traditionally Republican states in the Midwest, the South, and the
West. What do you make of this?” The statistians comment: “It is both easy and
tempting to infer that their low cancer rates are directly due to the clean living of
the rural lifestyle — no air pollution, no water pollution, access to fresh food without
additives.” This makes perfect sense.
“Now consider the counties in which the incidence of kidney cancer is highest.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


142 Ch 4: Checking Assumptions

These ailing counties tend to be mostly rural, sparsely populated, and located in
traditionally Republican states in the Midwest, the South, and the West.” Tongue-
in-cheek, Wainer and Zwerling comment: “It is easy to infer that their high cancer
rates might be directly due to the poverty of the rural lifestyle — no access to good
medical care, a high-fat diet, and too much alcohol, too much tobacco.” Something
is wrong, of course. The rural lifestyle cannot explain both very high and very low
incidence of kidney cancer.
The key factor is not that the counties were rural or predominantly Republican. It
is that rural counties have small populations. The law of large numbers says that as
sample sizes increase that the sample statistic converges to the population proportion,
that is, large samples are more precise than small samples. What Kahneman is calling
the law of small numbers warns that small samples yield extreme results more often
than large samples do.

Prof. Erik B. Erhardt


Chapter 5

One-Way Analysis of Variance

Contents
5.1 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.2 Multiple Comparison Methods: Fisher’s Method . . . . 151
5.2.1 FSD Multiple Comparisons in R . . . . . . . . . . . . . . . 153
5.2.2 Bonferroni Comparisons . . . . . . . . . . . . . . . . . . . . 154
5.3 Further Discussion of Multiple Comparisons . . . . . . . 159
5.4 Checking Assumptions in ANOVA Problems . . . . . . . 161
5.5 Example from the Child Health and Development Study
(CHDS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Learning objectives
After completing this topic, you should be able to:
select graphical displays that meaningfully compare independent populations.
assess the assumptions of the ANOVA visually and by formal tests.
decide whether the means between populations are different, and how.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


144 Ch 5: One-Way Analysis of Variance

5.1 ANOVA
The one-way analysis of variance (ANOVA) is a generalization of the two sample
t-test to k ≥ 2 groups. Assume that the populations of interest have the following
(unknown) population means and standard deviations:
population 1 population 2 · · · population k
mean µ1 µ2 ··· µk
std dev σ1 σ2 ··· σk
A usual interest in ANOVA is whether µ1 = µ2 = · · · = µk . If not, then we wish
to know which means differ, and by how much. To answer these questions we select
samples from each of the k populations, leading to the following data summary:
sample 1 sample 2 · · · sample k
size n1 n2 ··· nk
mean Ȳ1 Ȳ2 ··· Ȳk
std dev s1 s2 ··· sk
A little more notation is needed for the discussion. Let Yij denote the j th observation
in the ith sample and define the total sample size n∗ = n1 + n2 + · · · + nk . Finally, let
Ȳ¯ be the average response over all samples (combined), that is
P
ij Yij
P
¯ ni Ȳi
Ȳ = ∗
= i ∗ .
n n
Note that Ȳ¯ is not the average of the sample means, unless the sample sizes n are i
equal.
An F -statistic is used to test H0 : µ1 = µ2 = · · · = µk against HA : not H0 (that is,
at least two means are different). The assumptions needed for the standard ANOVA
F -test are analogous to the independent pooled two-sample t-test assumptions: (1)
Independent random samples from each population. (2) The population frequency
curves are normal. (3) The populations have equal standard deviations, σ1 = σ2 =
· · · = σk .
The F -test is computed from the ANOVA table, which breaks the spread in the
combined data set into two components, or Sums of Squares (SS). The Within
SS, often called the Residual SS or the Error SS, is the portion of the total spread
due to variability within samples:
SS(Within) = (n1 − 1)s21 + (n2 − 1)s22 + · · · + (nk − 1)s2k = ij (Yij − Ȳi )2 .
P
The Between SS, often called the Model SS, measures the spread between the sample
means
SS(Between) = n1 (Ȳ1 − Ȳ¯ )2 + n2 (Ȳ2 − Ȳ¯ )2 + · · · + nk (Ȳk − Ȳ¯ )2 = i ni (Ȳi − Ȳ¯ )2 ,
P
weighted by the sample sizes. These two SS add to give P
SS(Total) = SS(Between) + SS(Within) = ij (Yij − Ȳ¯ )2 .
Each SS has its own degrees of freedom (df ). The df (Between) is the number of
groups minus one, k − 1. The df (Within) is the total number of observations minus

Prof. Erik B. Erhardt


5.1: ANOVA 145

the number of groups: (n1 − 1) + (n2 − 1) + · · · + (nk − 1) = n∗ − k. These two df


add to give df (Total) = (k − 1) + (n∗ − k) = n∗ − 1.
The Sums of Squares and df are neatly arranged in a table, called the ANOVA
table:
Source df SS MS F
SSM =P i ni (Ȳi − Ȳ¯ )2
P
Between Groups (Model) df M = k − 1 M SM = SSM/df M M SM/M SE
Within Groups (Error) df E = n∗ − k SSE = i (ni − 1)s2i M SE = SSE/df E
df T = n∗ − 1 SST = ij (Yij − Ȳ¯ )2
P
Total M ST = SST /df T
The Mean Square for each source of variation is the corresponding SS divided by
its df . The Mean Squares can be easily interpreted.
The MS(Within)

(n1 − 1)s21 + (n2 − 1)s22 + · · · + (nk − 1)s2k


MS(Within) = = s2pooled
n∗ − k
is a weighted average of the sample variances. The MS(Within) is known as the pooled
estimator of variance, and estimates the assumed common population variance. If
all the sample sizes are equal, the MS(Within) is the average sample variance. The
MS(Within) is identical to the pooled variance estimator in a two-sample problem
when k = 2.
The MS(Between)
P ¯ 2
i ni (Ȳi − Ȳ )
MS(Between) =
k−1
is a measure of variability among the sample means. This MS is a multiple of the
sample variance of Ȳ1 , Ȳ2 , . . . , Ȳk when all the sample sizes are equal.
The MS(Total)
P ¯ 2
ij (Yij − Ȳ )
MS(Total) =
n∗ − 1
is the variance in the combined data set.
The decision on whether to reject H0 : µ1 = µ2 = · · · = µk is based on the ratio
of the MS(Between) and the MS(Within):

MS(Between)
Fs = .
MS(Within)

Large values of Fs indicate large variability among the sample means Ȳ1 , Ȳ2 , . . . , Ȳk
relative to the spread of the data within samples. That is, large values of Fs suggest
that H0 is false.
Formally, for a size α test, reject H0 if Fs ≥ Fcrit , where Fcrit is the upper-α
percentile from an F distribution with numerator degrees of freedom k − 1 and de-
nominator degrees of freedom n∗ −k (i.e., the df for the numerators and denominators

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


146 Ch 5: One-Way Analysis of Variance

in the F -ratio). The p-value for the test is the area under the F -probability curve to
the right of Fs :

F with 4 and 20 degrees of freedom F with 4 and 20 degrees of freedom


FS not significant

α = .05 (fixed) p−value (random)

0 1 2 3 4 5 6 0 1 2 3 4 5 6
FCrit Reject H0 for FS here FS FCrit

For k = 2 the ANOVA F -test is equivalent to the pooled two-sample t-test.


The specification of a one-way analysis of variance is analogous to a regression
analysis (discussed later, though we’ve seen the specification in some plotting func-
tions). The only difference is that the descriptive (x) variable needs to be a factor
and not a numeric variable. We calculate a model object using lm() and extract the
analysis of variance table with anova().

Example: Comparison of Fats During cooking, doughnuts absorb fat in various


amounts. A scientist wished to learn whether the amount absorbed depends on the
type of fat. For each of 4 fats, 6 batches of 24 doughnuts were prepared. The data
are grams of fat absorbed per batch.
Let
µi = pop mean grams of fat i absorbed per batch of 24 doughnuts (−100).
The scientist wishes to test H0 : µ1 = µ2 = µ3 = µ4 against HA : not H0 . There is no
strong evidence against normality here. Furthermore the sample standard deviations
(see output below) are close. The standard ANOVA appears to be appropriate here.
Row fat1 fat2 fat3 fat4
1 164 178 175 155
2 172 191 186 166
3 168 197 178 149
4 177 182 171 164
5 190 185 163 170
6 176 177 176 168
Let’s take a short detour to read the wide table and convert it into long format.

Prof. Erik B. Erhardt


5.1: ANOVA 147

R skills: wide to long table format Many functions in R expect data to be


in a long format rather than a wide format. Let’s use the fat data as an example of
how to read a table as text, convert the wide format to long, and then back to wide
format.
#### Example: Comparison of Fats
fat <- read.table(text="
Row fat1 fat2 fat3 fat4
1 164 178 175 155
2 172 191 186 166
3 168 197 178 149
4 177 182 171 164
5 190 185 163 170
6 176 177 176 168
", header=TRUE)
fat
## Row fat1 fat2 fat3 fat4
## 1 1 164 178 175 155
## 2 2 172 191 186 166
## 3 3 168 197 178 149
## 4 4 177 182 171 164
## 5 5 190 185 163 170
## 6 6 176 177 176 168
From wide to long: Use melt() from the reshape2 package.
#### From wide to long format
library(reshape2)
fat.long <- melt(fat,
# id.vars: ID variables
# all variables to keep but not split apart on
id.vars=c("Row"),
# measure.vars: The source columns
# (if unspecified then all other variables are measure.vars)
measure.vars = c("fat1", "fat2", "fat3", "fat4"),
# variable.name: Name of the destination column identifying each
# original column that the measurement came from
variable.name = "type",
# value.name: column name for values in table
value.name = "amount"
)
## naming variables manually, the variable.name and value.name not working 11/2012
#names(fat.long) <- c("Row", "type", "amount")
fat.long
## Row type amount
## 1 1 fat1 164
## 2 2 fat1 172
## 3 3 fat1 168
## 4 4 fat1 177
## 5 5 fat1 190

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


148 Ch 5: One-Way Analysis of Variance

## 6 6 fat1 176
## 7 1 fat2 178
## 8 2 fat2 191
## 9 3 fat2 197
## 10 4 fat2 182
## 11 5 fat2 185
## 12 6 fat2 177
## 13 1 fat3 175
## 14 2 fat3 186
## 15 3 fat3 178
## 16 4 fat3 171
## 17 5 fat3 163
## 18 6 fat3 176
## 19 1 fat4 155
## 20 2 fat4 166
## 21 3 fat4 149
## 22 4 fat4 164
## 23 5 fat4 170
## 24 6 fat4 168
# or as simple as:
# melt(fat, "Row")

If you don’t specify variable.name, it will name that column “variable”, and if you
leave out value.name, it will name that column “value”.
From long to wide: Use dcast() from the reshape2 package.
#### From long to wide format
fat.wide <- dcast(fat.long, Row ~ type, value.var = "amount")
fat.wide
## Row fat1 fat2 fat3 fat4
## 1 1 164 178 175 155
## 2 2 172 191 186 166
## 3 3 168 197 178 149
## 4 4 177 182 171 164
## 5 5 190 185 163 170
## 6 6 176 177 176 168
Now that we’ve got our data in the long format, let’s return to the ANOVA.

Back to ANOVA: Let’s look at the numerical summaries. We’ve seen other ways
of computing these so I’ll show you another way.
#### Back to ANOVA
# Calculate the mean, sd, n, and se for the four fats

# The plyr package is an advanced way to apply a function to subsets of data


# "Tools for splitting, applying and combining data"
library(plyr)
# ddply "dd" means the input and output are both data.frames

Prof. Erik B. Erhardt


5.1: ANOVA 149

fat.summary <- ddply(fat.long,


"type",
function(X) {
data.frame( m = mean(X$amount),
s = sd(X$amount),
n = length(X$amount)
)
}
)
# standard errors
fat.summary$se <- fat.summary$s/sqrt(fat.summary$n)
# individual confidence limits
fat.summary$ci.l <- fat.summary$m - qt(1-.05/2, df=fat.summary$n-1) * fat.summary$se
fat.summary$ci.u <- fat.summary$m + qt(1-.05/2, df=fat.summary$n-1) * fat.summary$se
fat.summary
## type m s n se ci.l ci.u
## 1 fat1 174.5000 9.027735 6 3.685557 165.0260 183.9740
## 2 fat2 185.0000 7.771744 6 3.172801 176.8441 193.1559
## 3 fat3 174.8333 7.626707 6 3.113590 166.8296 182.8371
## 4 fat4 162.0000 8.221922 6 3.356586 153.3716 170.6284

Let’s plot the data with boxplots, individual points, mean, and CI by fat type.
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(fat.long, aes(x = type, y = amount))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(fat.long$amount),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Doughnut fat absorption") + ylab("amount absorbed (g)")
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


150 Ch 5: One-Way Analysis of Variance

Doughnut fat absorption

190

180
amount absorbed (g)

170

160

150

fat1 fat2 fat3 fat4


type

The p-value for the F -test is 0.001. The scientist would reject H0 at any of the
usual test levels (such as, 0.05 or 0.01). The data suggest that the population mean
absorption rates differ across fats in some way. The F -test does not say how they
differ. The pooled standard deviation spooled = 8.18 is the “Residual standard error”.
We’ll ignore the rest of this output for now.
fit.f <- aov(amount ~ type, data = fat.long)
summary(fit.f)
## Df Sum Sq Mean Sq F value Pr(>F)
## type 3 1596 531.8 7.948 0.0011 **
## Residuals 20 1338 66.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.f
## Call:
## aov(formula = amount ~ type, data = fat.long)
##
## Terms:
## type Residuals
## Sum of Squares 1595.500 1338.333
## Deg. of Freedom 3 20
##

Prof. Erik B. Erhardt


5.2: Multiple Comparison Methods: Fisher’s Method 151

## Residual standard error: 8.180261


## Estimated effects may be unbalanced

ClickerQ s — ANOVA, Fat 1/2

ClickerQ s — ANOVA, Fat 1/2

5.2 Multiple Comparison Methods: Fisher’s Method


The ANOVA F -test checks whether all the population means are equal. Multiple
comparisons are often used as a follow-up to a significant ANOVA F -test to deter-
mine which population means are different. I will discuss Fisher’s, Bonferroni’s, and
Tukey’s methods for comparing all pairs of means.

Fisher’s least significant difference method (LSD or FSD) is a two-step process:


1. Carry out the ANOVA F -test of H0 : µ1 = µ2 = · · · = µk at the α level. If
H0 is not rejected, stop and conclude that there is insufficient evidence to claim
differences among population means. If H0 is rejected, go to step 2.
2. Compare each pair of means using a pooled two sample t-test at the α level.
Use spooled from the ANOVA table and df = df E (Residual).
To see where the name LSD originated, consider the t-test of H0 : µi = µj (i.e.,
populations i and j have same mean). The t-statistic is

Ȳi − Ȳj
ts = q .
spooled n1i + 1
nj

You reject H0 if |ts | ≥ tcrit , or equivalently, if


s
1 1
|Ȳi − Ȳj | ≥ tcrit spooled + .
ni nj

The minimum absolute difference between Ȳi and Ȳj needed to reject H0 is the LSD,
the quantity on the right hand side of this inequality. If all the sample sizes are equal
n1 = n2 = · · · = nk then the LSD is the same for each comparison:
r
2
LSD = tcrit spooled ,
n1
where n1 is the common sample size.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


152 Ch 5: One-Way Analysis of Variance

I will illustrate Fisher’s method on the doughnut data, using α = 0.05. At the
first step, you reject the hypothesis that the population mean absorptions are equal
because p-value= 0.001. At the second step, compare all pairs of fats at the 5% level.
Here, spooled = 8.18 and tcrit = 2.086 for a two-sided test based on 20 df (the df E for
Residual SS). Each sample has six observations, so the LSD for each comparison is
r
2
LSD = 2.086 × 8.18 × = 9.85.
6

Any two sample means that differ by at least 9.85 in magnitude are significantly
different at the 5% level.
An easy way to compare all pairs of fats is to order the samples by their sample
means. The samples can then be grouped easily, noting that two fats are in the same
group if the absolute difference between their sample means is smaller than the LSD.
Fats Sample Mean
2 185.00
3 174.83
1 174.50
4 162.00
There are six comparisons of two fats. From this table, you can visually assess
which sample means differ by at least the LSD=9.85, and which ones do not. For
completeness, the table below summarizes each comparison:
Comparison Absolute difference in means Exceeds LSD?
Fats 2 and 3 10.17 Yes
2 and 1 10.50 Yes
2 and 4 23.00 Yes
Fats 3 and 1 0.33 No
3 and 4 12.83 Yes
Fats 1 and 4 12.50 Yes
The end product of the multiple comparisons is usually presented as a collection of
groups, where a group is defined to be a set of populations with sample means that
are not significantly different from each other. Overlap among groups is common, and
occurs when one or more populations appears in two or more groups. Any overlap
requires a more careful interpretation of the analysis.
There are three groups for the doughnut data, with no overlap. Fat 2 is in a group
by itself, and so is Fat 4. Fats 3 and 1 are in a group together. This information
can be summarized by ordering the samples from lowest to highest average, and then
connecting the fats in the same group using an underscore:
FAT 4 FAT 1 FAT 3 FAT 2
----- -------------- -----

Prof. Erik B. Erhardt


5.2: Multiple Comparison Methods: Fisher’s Method 153

The results of a multiple comparisons must be interpreted carefully. At the 5%


level, you have sufficient evidence to conclude that the population mean absorption
for Fat 2 and Fat 4 are each different than the other population means. However,
there is insufficient evidence to conclude that the population mean absorptions for
Fats 1 and 3 differ.

Be Careful with Interpreting Groups in Multiple Comparisons!


To see why you must be careful when interpreting groupings, suppose you obtain two
groups in a three sample problem. One group has samples 1 and 3. The other group
has samples 3 and 2:
1 3 2
-----------
-----------
This occurs, for example, when |Ȳ1 − Ȳ2 | ≥ LSD, but both |Ȳ1 − Ȳ3 | and |Ȳ3 − Ȳ2 |
are less than the LSD. There is a tendency to conclude, and please try to avoid this
line of attack, that populations 1 and 3 have the same mean, populations 2 and 3
have the same mean, but populations 1 and 2 have different means. This conclusion
is illogical. The groupings imply that we have sufficient evidence to conclude that
population means 1 and 2 are different, but insufficient evidence to conclude that
population mean 3 differs from either of the other population means.

5.2.1 FSD Multiple Comparisons in R


One way to get Fisher comparisons in R uses pairwise.t.test() with p.adjust.method = "none".
The resulting summary of the multiple comparisons is in terms of p-values for all pair-
wise two-sample t-tests using the pooled standard deviation from the ANOVA using
pool.sd = TRUE. This output can be used to generate groupings. A summary of the
p-values is given below. Let us see that we can recover the groups from this output.
#### Multiple Comparisons
# all pairwise comparisons among levels of fat
# Fisher's LSD (FSD) uses "none"
pairwise.t.test(fat.long$amount, fat.long$type,
pool.sd = TRUE, p.adjust.method = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: fat.long$amount and fat.long$type
##
## fat1 fat2 fat3
## fat2 0.038 - -
## fat3 0.944 0.044 -
## fat4 0.015 9.3e-05 0.013
##
## P value adjustment method: none

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


154 Ch 5: One-Way Analysis of Variance

Discussion of the FSD Method: family error rate


There are c = k(k − 1)/2 pairs of means to compare in the second step of the FSD
method. Each comparison is done at the α level, where for a generic comparison of
the ith and j th populations
α = probability of rejecting H0 : µi = µj when H0 is true.
The individual error rate is not the only error rate that is important in multiple
comparisons. The family error rate (FER), or the experimentwise error rate,
is defined to be the probability of at least one false rejection of a true hypothesis
H0 : µi = µj over all comparisons. When many comparisons are made, you may have
a large probability of making one or more false rejections of true null hypotheses. In
particular, when all c comparisons of two population means are performed, each at
the α level, then
α < F ER < cα.
For example, in the doughnut problem where k = 4, there are c = 4(3)/2 = 6
possible comparisons of pairs of fats. If each comparison is carried out at the 5% level,
then 0.05 < F ER < 0.30. At the second step of the FSD method, you could have
up to a 30% chance of claiming one or more pairs of population means are different
if no differences existed between population means. Most statistical packages do not
evaluate the FER, so the upper bound is used.
The first step of the FSD method is the ANOVA “screening” test. The multiple
comparisons are carried out only if the F -test suggests that not all population means
are equal. This screening test tends to deflate the FER for the two-step FSD proce-
dure. However, the FSD method is commonly criticized for being extremely liberal
(too many false rejections of true null hypotheses) when some, but not many, differ-
ences exist — especially when the number of comparisons is large. This conclusion is
fairly intuitive. When you do a large number of tests, each, say, at the 5% level, then
sampling variation alone will suggest differences in 5% of the comparisons where the
H0 is true. The number of false rejections could be enormous with a large number
of comparisons. For example, chance variation alone would account for an average of
50 significant differences in 1000 comparisons (about 45 populations) each at the 5%
level.

5.2.2 Bonferroni Comparisons


The Bonferroni method controls the FER by reducing the individual comparison error
rate. The FER is guaranteed to be no larger than a prespecified amount, say α, by
setting the individual error rate for each of the c comparisons of interest to α/c.
Therefore, α/c < F ER < cα/c = α, thus the upper bound for FER is α. Larger
differences in the sample means are needed before declaring statistical significance
using the Bonferroni adjustment than when using the FSD method at the α level.

Prof. Erik B. Erhardt


5.2: Multiple Comparison Methods: Fisher’s Method 155

ClickerQ s — ANOVA, Bonferroni

Assuming all comparisons are of interest, you can implement the Bonferroni ad-
justment in R by specifying p.adjust.method = "bonf" A by-product of the Bonfer-
roni adjustment is that we have at least 100(1 − α)% confidence that all pairwise
t-test statements hold simultaneously!
# Bonferroni 95% Individual p-values
# All Pairwise Comparisons among Levels of fat
pairwise.t.test(fat.long$amount, fat.long$type,
pool.sd = TRUE, p.adjust.method = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: fat.long$amount and fat.long$type
##
## fat1 fat2 fat3
## fat2 0.22733 - -
## fat3 1.00000 0.26241 -
## fat4 0.09286 0.00056 0.07960
##
## P value adjustment method: bonferroni
Looking at the output, can you create the groups? You should get the groups given
below, which implies you have sufficient evidence to conclude that the population
mean absorption for Fat 2 is different than that for Fat 4.
FAT 4 FAT 1 FAT 3 FAT 2
-----------------------
-----------------------
The Bonferroni method tends to produce “coarser” groups than the FSD method,
because the individual comparisons are conducted at a lower (alpha/error) level.
Equivalently, the minimum significant difference is inflated for the Bonferroni method.
For example, in the doughnut problem with F ER ≤ 0.05, the critical value for the
individual comparisons at the 0.0083 level is tcrit = 2.929 with df = 20. The minimum
significant difference for the Bonferroni comparisons is
r
2
LSD = 2.929 × 8.18 × = 13.824
6
versus an LSD=9.85 for the FSD method. Referring back to our table of sample
means on page 152, we see that the sole comparison where the absolute difference
between sample means exceeds 13.824 involves Fats 2 and 4.

Example from Koopmans: glabella facial tissue thickness In an anthropo-


logical study of facial tissue thickness for different racial groups, data were taken
during autopsy at several points on the faces of deceased individuals. The Glabella
measurements taken at the bony ridge for samples of individuals from three racial

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


156 Ch 5: One-Way Analysis of Variance

groups (cauc = Caucasian, afam = African American, and naaa = Native American
and Asian) follow. The data values are in mm.
Bregma

temporal line
r
rio
pe
Zygomatic tubercle Su
Zygomaticofrontal Lambda
suture rio
n
Supraorbital foramen Pte

Glabella

22mm
35mm
Nasion
Asterion
Inion
Zygomatic
Zygomatic
arch
hal
bone nuc Reid's base
line line

Maxilla Median nuchal crest

Mandible

Auricular point
Pre-auricular point

#### Example from Koopmans: glabella facial tissue thickness


glabella <- read.table(text="
Row cauc afam naaa
1 5.75 6.00 8.00
2 5.50 6.25 7.00
3 6.75 6.75 6.00
4 5.75 7.00 6.25
5 5.00 7.25 5.50
6 5.75 6.75 4.00
7 5.75 8.00 5.00
8 7.75 6.50 6.00
9 5.75 7.50 7.25
10 5.25 6.25 6.00
11 4.50 5.00 6.00
12 6.25 5.75 4.25
13 NA 5.00 4.75
14 NA NA 6.00
", header=TRUE)

glabella.long <- melt(glabella,


id.vars=c("Row"),
variable.name = "pop",
value.name = "thickness",
# remove NAs
na.rm = TRUE
)
# naming variables manually, the variable.name and value.name not working 11/2012
names(glabella.long) <- c("Row", "pop", "thickness")
# another way to remove NAs:
#glabella.long <- subset(glabella.long, !is.na(thickness))

Prof. Erik B. Erhardt


5.2: Multiple Comparison Methods: Fisher’s Method 157

# Plot the data using ggplot


library(ggplot2)
p <- ggplot(glabella.long, aes(x = pop, y = thickness))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(glabella.long$thickness),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Glabella thickness") + ylab("thickness (mm)")
print(p)

Glabella thickness

7
thickness (mm)

cauc afam naaa


pop

There are 3 groups, so there are 3 possible pairwise comparisons. If you want a
Bonferroni analysis with FER of no greater than 0.05, you should do the individual

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


158 Ch 5: One-Way Analysis of Variance

comparisons at the 0.05/3 = 0.0167 level. Except for the mild outlier in the Caucasian
sample, the observed distributions are fairly symmetric, with similar spreads. I would
expect the standard ANOVA to perform well here.
Let µc = population mean Glabella measurement for Caucasians, µa = popula-
tion mean Glabella measurement for African Americans, and µn = population mean
Glabella measurement for Native Americans and Asians.
glabella.summary <- ddply(glabella.long, "pop",
function(X) { data.frame( m = mean(X$thickness),
s = sd(X$thickness),
n = length(X$thickness) ) } )
glabella.summary
## pop m s n
## 1 cauc 5.812500 0.8334280 12
## 2 afam 6.461538 0.8946959 13
## 3 naaa 5.857143 1.1168047 14
fit.g <- aov(thickness ~ pop, data = glabella.long)
summary(fit.g)
## Df Sum Sq Mean Sq F value Pr(>F)
## pop 2 3.40 1.6991 1.828 0.175
## Residuals 36 33.46 0.9295
fit.g
## Call:
## aov(formula = thickness ~ pop, data = glabella.long)
##
## Terms:
## pop Residuals
## Sum of Squares 3.39829 33.46068
## Deg. of Freedom 2 36
##
## Residual standard error: 0.9640868
## Estimated effects may be unbalanced

At the 5% level, you would not reject the hypothesis that the population mean
Glabella measurements are identical. That is, you do not have sufficient evidence
to conclude that these racial groups differ with respect to their average Glabella
measurement. This is the end of the analysis!
The Bonferroni intervals reinforce this conclusion, all the p-values are greater than
0.05. If you were to calculate CIs for the difference in population means, each would
contain zero. You can think of the Bonferroni intervals as simultaneous CI. We’re
(at least) 95% confident that all of the following statements hold simultaneously:
−1.62 ≤ µc − µa ≤ 0.32, −0.91 ≤ µn − µc ≤ 1.00, and −1.54 ≤ µn − µa ≤ 0.33. The
individual CIs have level 100(1 − 0.0167)% = 98.33%.

Prof. Erik B. Erhardt


5.3: Further Discussion of Multiple Comparisons 159

# Bonferroni 95% Individual p-values


# All Pairwise Comparisons among Levels of glabella
pairwise.t.test(glabella.long$thickness, glabella.long$pop,
pool.sd = TRUE, p.adjust.method = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: glabella.long$thickness and glabella.long$pop
##
## cauc afam
## afam 0.30 -
## naaa 1.00 0.34
##
## P value adjustment method: bonferroni

5.3 Further Discussion of Multiple Comparisons


The FSD and Bonferroni methods comprise the ends of the spectrum of multiple
comparisons methods. Among multiple comparisons procedures, the FSD method is
most likely to find differences, whether real or due to sampling variation, whereas
Bonferroni is often the most conservative method. You can be reasonably sure that
differences suggested by the Bonferroni method will be suggested by almost all other
methods, whereas differences not significant under FSD will not be picked up using
other approaches.
The Bonferroni method is conservative, but tends to work well when the number
of comparisons is small, say 4 or less. A smart way to use the Bonferroni adjustment
is to focus attention only on the comparisons of interest (generated independently of
looking at the data!), and ignore the rest. I will return to this point later.
A commonly-used alternative is Tukey’s honest significant difference method
(HSD). John Tukey’s honest significant difference method is to reject the equality
of a pair of means based, not on the t-distribution, but the studentized range dis-
tribution. To implement Tukey’s method with a FER of α, reject H0 : µi = µj
when s
qcrit 1 1
|Ȳi − Ȳj | ≥ √ spooled + ,
2 ni nj
where qcrit is the α level critical value of the studentized range distribution. For
the doughnut fats, the groupings based on Tukey and Bonferroni comparisons are
identical.
#### Tukey's honest significant difference method (HSD)
## Fat
# Tukey 95% Individual p-values

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


160 Ch 5: One-Way Analysis of Variance

# All Pairwise Comparisons among Levels of fat


TukeyHSD(fit.f)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = amount ~ type, data = fat.long)
##
## $type
## diff lwr upr p adj
## fat2-fat1 10.5000000 -2.719028 23.7190277 0.1510591
## fat3-fat1 0.3333333 -12.885694 13.5523611 0.9998693
## fat4-fat1 -12.5000000 -25.719028 0.7190277 0.0679493
## fat3-fat2 -10.1666667 -23.385694 3.0523611 0.1709831
## fat4-fat2 -23.0000000 -36.219028 -9.7809723 0.0004978
## fat4-fat3 -12.8333333 -26.052361 0.3856944 0.0590077
## Glabella
# Tukey 95% Individual p-values
# All Pairwise Comparisons among Levels of pop
TukeyHSD(fit.g)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = thickness ~ pop, data = glabella.long)
##
## $pop
## diff lwr upr p adj
## afam-cauc 0.64903846 -0.2943223 1.5923993 0.2259806
## naaa-cauc 0.04464286 -0.8824050 0.9716907 0.9923923
## naaa-afam -0.60439560 -1.5120412 0.3032500 0.2472838

Another popular method controls the false discovery rate (FDR) instead of the
FER. The FDR is the expected proportion of false discoveries amongst the rejected
hypotheses. The false discovery rate is a less stringent condition than the family-wise
error rate, so these methods are more powerful than the others, though with a higher
FER. I encourage you to learn more about the methods by Benjamini, Hochberg, and
Yekutieli.
#### false discovery rate (FDR)
## Fat
# FDR
pairwise.t.test(fat.long$amount, fat.long$type,
pool.sd = TRUE, p.adjust.method = "BH")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: fat.long$amount and fat.long$type
##

Prof. Erik B. Erhardt


5.4: Checking Assumptions in ANOVA Problems 161

## fat1 fat2 fat3


## fat2 0.05248 - -
## fat3 0.94443 0.05248 -
## fat4 0.03095 0.00056 0.03095
##
## P value adjustment method: BH

## Glabella
# FDR
pairwise.t.test(glabella.long$thickness, glabella.long$pop,
pool.sd = TRUE, p.adjust.method = "BH")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: glabella.long$thickness and glabella.long$pop
##
## cauc afam
## afam 0.17 -
## naaa 0.91 0.17
##
## P value adjustment method: BH

ClickerQ s — ANOVA, multiple comparisons

5.4 Checking Assumptions in ANOVA Problems


The classical ANOVA assumes that the populations have normal frequency curves
and the populations have equal variances (or spreads). You can test the normal-
ity assumption using multiple normal QQ-plots and normal scores tests, which we
discussed in Chapter 4. An alternative approach that is useful with three or more
samples is to make a single normal scores plot for the entire data set. The samples
must be centered at the same location for this to be meaningful. (WHY?) This is
done by subtracting the sample mean from each observation in the sample, giving
the so-called residuals. A normal scores plot or histogram of the residuals should
resemble a sample from a normal population. These two plots can be generated with
output in $residuals from the anova() procedure.
For the glabella residuals, there are a few observations outside the confidence
bands, but the formal normality tests each have p-values > 0.2, so there’s weak but
unconvincing evidence of nonnormality.
#### Checking Assumptions in ANOVA Problems
# plot of data
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


162 Ch 5: One-Way Analysis of Variance

hist(fit.g$residuals, freq = FALSE, breaks = 20)


points(density(fit.g$residuals), type = "l")
rug(fit.g$residuals)

# violin plot
library(vioplot)
vioplot(fit.g$residuals, horizontal=TRUE, col="gray")

# boxplot
boxplot(fit.g$residuals, horizontal=TRUE)

# QQ plot
par(mfrow=c(1,1))
library(car)
qqPlot(fit.g$residuals, las = 1, id = list(n = 8, cex = 1), lwd = 1, main="QQ Plot")
## 29 8 34 40 21 25 27 37
## 26 8 31 37 19 23 25 34
Histogram of fit.g$residuals
QQ Plot
0.8
Density

0.4

29 ●
2 8●
0.0

−2 −1 0 1 2 21 ●
37 ●
fit.g$residuals


1 ●

fit.g$residuals


●●
●●
● ●●●●●
1


0 ●●●●●
●●
●●
−2 −1 0 1 2 ●


●●
−1 ●

● ● 2527
● 40
● 34
● ●
−2
−2 −1 0 1 2
−2 −1 0 1 2
norm quantiles

shapiro.test(fit.g$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit.g$residuals
## W = 0.97693, p-value = 0.5927
library(nortest)
ad.test(fit.g$residuals)
##
## Anderson-Darling normality test
##

Prof. Erik B. Erhardt


5.4: Checking Assumptions in ANOVA Problems 163

## data: fit.g$residuals
## A = 0.37731, p-value = 0.3926
# lillie.test(fit.g£residuals)
cvm.test(fit.g$residuals)
##
## Cramer-von Mises normality test
##
## data: fit.g$residuals
## W = 0.070918, p-value = 0.2648
In Chapter 4, I illustrated the use of Bartlett’s test and Levene’s test for equal
population variances, and showed how to evaluate these tests in R.

χ2 distribution with 3 degrees of freedom

α = .05 (fixed)

0 4
χ2Crit Reject H0 for χ2S here

R does the calculation for us, as illustrated below. Because the p-value > 0.5, we
fail to reject the null hypothesis that the population variances are equal. This result
is not surprising given how close the sample variances are to each other.
## Test equal variance
# Barlett assumes populations are normal
bartlett.test(thickness ~ pop, data = glabella.long)
##
## Bartlett test of homogeneity of variances

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


164 Ch 5: One-Way Analysis of Variance

##
## data: thickness by pop
## Bartlett's K-squared = 1.1314, df = 2, p-value = 0.568
Levene’s and Flinger’s tests are consistent with Bartlett’s.
# Levene does not assume normality, requires car package
library(car)
leveneTest(thickness ~ pop, data = glabella.long)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.5286 0.5939
## 36
# Fligner is a nonparametric test
fligner.test(thickness ~ pop, data = glabella.long)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: thickness by pop
## Fligner-Killeen:med chi-squared = 1.0311, df = 2, p-value =
## 0.5972

5.5 Example from the Child Health and Develop-


ment Study (CHDS)
We consider data from the birth records of 680 live-born white male infants. The
infants were born to mothers who reported for pre-natal care to three clinics of
the Kaiser hospitals in northern California. As an initial analysis, we will exam-
ine whether maternal smoking has an effect on the birth weights of these children.
To answer this question, we define 3 groups based on mother’s smoking history: (1)
mother does not currently smoke or never smoked, (2) mother smoked less than one
pack of cigarettes a day during pregnancy, and (3) mother smoked at least one pack
of cigarettes a day during pregnancy.
Let µi = pop mean birth weight (lb) for children in group i, (i = 1, 2, 3). We wish
to test H0 : µ1 = µ2 = µ3 against HA : not H0 .
We read in the data, create a smoke factor variable, and plot the data by smoking
group.
#### Example from the Child Health and Development Study (CHDS)
# description at http://statacumen.com/teach/ADA1/ADA1_notes_05-CHDS_desc.txt
# read data from website
chds <- read.csv("http://statacumen.com/teach/ADA1/ADA1_notes_05-CHDS.csv")

chds$smoke <- rep(NA, nrow(chds));


# no cigs

Prof. Erik B. Erhardt


5.5: Example from the Child Health and Development Study (CHDS) 165

chds[(chds$m_smok == 0), "smoke"] <- "0 cigs";


# less than 1 pack (20 cigs = 1 pack)
chds[(chds$m_smok > 0) & (chds$m_smok < 20),"smoke"] <- "1-19 cigs";
# at least 1 pack (20 cigs = 1 pack)
chds[(chds$m_smok >= 20),"smoke"] <- "20+ cigs";
chds$smoke <- factor(chds$smoke)
# histogram using ggplot
p1 <- ggplot(chds, aes(x = c_bwt))
p1 <- p1 + geom_histogram(binwidth = .4)
p1 <- p1 + geom_rug()
p1 <- p1 + facet_grid(smoke ~ .)
p1 <- p1 + labs(title = "Child birthweight vs maternal smoking") +
xlab("child birthweight (lb)")
#print(p1)

p2 <-
ggplot(chds, aes(x = c_bwt, fill=smoke))
p2 <-
p2 + geom_histogram(binwidth = .4, alpha = 1/3, position="identity")
p2 <-
p2 + geom_rug(aes(colour = smoke), alpha = 1/3)
p2 <-
p2 + labs(title = "Child birthweight vs maternal smoking") +
xlab("child birthweight (lb)")
#print(p2)

library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=1)

# Plot the data using ggplot


library(ggplot2)
p <- ggplot(chds, aes(x = smoke, y = c_bwt))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(chds$c_bwt),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.2)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 4,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Child birthweight vs maternal smoking") +
ylab("child birthweight (lb)")
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


166 Ch 5: One-Way Analysis of Variance

Child birthweight vs maternal smoking Child birthweight vs maternal smoking


60

0 cigs
40
20
0
60

1−19 cigs
count

40 10.0
20
0
60

20+ cigs
40

child birthweight (lb)


20
0
5.0 7.5 10.0
child birthweight (lb) 7.5

Child birthweight vs maternal smoking

60

smoke
40
count

0 cigs 5.0
1−19 cigs

20 20+ cigs

0
5.0 7.5 10.0 0 cigs 1−19 cigs 20+ cigs
child birthweight (lb) smoke

Looking at the boxplots, there is some evidence of non-normality here. Although


there are outliers in the no smoking group, we need to recognize that the sample
size for this group is fairly large (381). Given that boxplots are calibrated in such a
way that 7 outliers per 1000 observations are expected when sampling from a normal
population, 5 outliers (only 4 are visible!) out of 381 seems a bit excessive. A formal
test rejects the hypothesis of normality in the no and low smoker groups. The normal
probability plot and the histogram of the residuals also suggests that the population
distributions are heavy tailed.
library(car)
par(mfrow=c(1,3))
qqPlot(subset(chds, smoke == "0 cigs" )$c_bwt, las = 1, id = list(n = 0, cex = 1),
lwd = 1, main="QQ Plot, 0 cigs")
qqPlot(subset(chds, smoke == "1-19 cigs")$c_bwt, las = 1, id = list(n = 0, cex = 1),
lwd = 1, main="QQ Plot, 1-19 cigs")
qqPlot(subset(chds, smoke == "20+ cigs" )$c_bwt, las = 1, id = list(n = 0, cex = 1),
lwd = 1, main="QQ Plot, 20+ cigs")

Prof. Erik B. Erhardt


5.5: Example from the Child Health and Development Study (CHDS) 167

QQ Plot, 0 cigs QQ Plot, 1−19 cigs QQ Plot, 20+ cigs


10 ● ●

● ●

●●
●●● 9 ●●●●

subset(chds, smoke == "1−19 cigs")$c_bwt


● ● ●
●●●●

subset(chds, smoke == "20+ cigs")$c_bwt



● ●● ●

subset(chds, smoke == "0 cigs")$c_bwt


● ●
● ●
●●

10 ●


●●
● ●



●●





● 9 ●

●●





●● ●

●●
●● ●


●●

● ●









●● ●● 8 ●●







●●

●●

● ●●
● ●
●●



● ●
● ●●


●●


●●



●●

●●
● ●
●●


●●



● ●

●●
●●


● ●


●●

● ●●

●●
●●



8 ●













●●

● 8 ●













●●

●●


●●
● ●●

●● ●

●●



●●


●●










●●







7 ●


●●

●●


● ●

● ●


●●

●●


●●


●●


● ● ●

●● ●●

●● ●
●●
●●


●●


●●
● ●●
● ●
●●

●●
●●

● ●
●●

●●
●●
● ●

●●
●●


●●







● 7 ●




● ●

6 ●



●●




●●
6 ●
●●





● ●
●● ●
●●
● ●
●● ●

●●

●● ●

●● ●
●●


●●


● ●●
● ●

●●

●●

●●

● ●
6 ●

●●
●●


●●
5
4 ●

●●●
●●

●● ●
● ●● ●
● ● ● ●

−3 −2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 2

norm quantiles norm quantiles norm quantiles

library(nortest)
# 0 cigs --------------------
shapiro.test(subset(chds, smoke == "0 cigs" )$c_bwt)
##
## Shapiro-Wilk normality test
##
## data: subset(chds, smoke == "0 cigs")$c_bwt
## W = 0.98724, p-value = 0.00199
ad.test( subset(chds, smoke == "0 cigs" )$c_bwt)
##
## Anderson-Darling normality test
##
## data: subset(chds, smoke == "0 cigs")$c_bwt
## A = 0.92825, p-value = 0.01831
cvm.test( subset(chds, smoke == "0 cigs" )$c_bwt)
##
## Cramer-von Mises normality test
##
## data: subset(chds, smoke == "0 cigs")$c_bwt
## W = 0.13844, p-value = 0.03374
# 1-19 cigs --------------------
shapiro.test(subset(chds, smoke == "1-19 cigs")$c_bwt)
##
## Shapiro-Wilk normality test
##
## data: subset(chds, smoke == "1-19 cigs")$c_bwt
## W = 0.97847, p-value = 0.009926
ad.test( subset(chds, smoke == "1-19 cigs")$c_bwt)
##
## Anderson-Darling normality test
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


168 Ch 5: One-Way Analysis of Variance

## data: subset(chds, smoke == "1-19 cigs")$c_bwt


## A = 0.83085, p-value = 0.03149
cvm.test( subset(chds, smoke == "1-19 cigs")$c_bwt)
##
## Cramer-von Mises normality test
##
## data: subset(chds, smoke == "1-19 cigs")$c_bwt
## W = 0.11332, p-value = 0.07317
# 20+ cigs --------------------
shapiro.test(subset(chds, smoke == "20+ cigs" )$c_bwt)
##
## Shapiro-Wilk normality test
##
## data: subset(chds, smoke == "20+ cigs")$c_bwt
## W = 0.98127, p-value = 0.06962
ad.test( subset(chds, smoke == "20+ cigs" )$c_bwt)
##
## Anderson-Darling normality test
##
## data: subset(chds, smoke == "20+ cigs")$c_bwt
## A = 0.40008, p-value = 0.3578
cvm.test( subset(chds, smoke == "20+ cigs" )$c_bwt)
##
## Cramer-von Mises normality test
##
## data: subset(chds, smoke == "20+ cigs")$c_bwt
## W = 0.040522, p-value = 0.6694

Fit the ANOVA (we’ll look at the table below).


fit.c <- aov(c_bwt ~ smoke, data = chds)

A formal test of normality on the residuals of the combined sample is marginally


significant (SW p-value= 0.047, others > 0.10). However, I am not overly concerned
about this for the following reasons: in large samples, small deviations from normality
are often statistically significant and in my experience, the small deviations we are
seeing here are not likely to impact our conclusions, in the sense that non-parametric
methods that do not require normality will lead to the same conclusions.
# plot of data
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(fit.c$residuals, freq = FALSE, breaks = 20)
points(density(fit.c$residuals), type = "l")
rug(fit.c$residuals)

# violin plot

Prof. Erik B. Erhardt


5.5: Example from the Child Health and Development Study (CHDS) 169

library(vioplot)
vioplot(fit.c$residuals, horizontal=TRUE, col="gray")

# boxplot
boxplot(fit.c$residuals, horizontal=TRUE)

# QQ plot
par(mfrow=c(1,1))
library(car)
qqPlot(fit.c$residuals, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot")

Histogram of fit.c$residuals
QQ Plot
0.4
Density

0.2



0.0

●●●
●●
−4 −2 0 2 4

●●●●

●●
●●



●●
fit.c$residuals 2 ●

●●








●●
●●


●●





●●

●●


●●





●●


●●




●●


●●


●●


●●


●●
●●
fit.c$residuals
●●



●●

●●


●●



●●


●●


●●

●●


●●
0 ●●

●●





●●

●●





●●


●●

● ●


●●

●●

1




●●

●●


●●






●●
●●


●●




●●

●●


●●

●●


●●


●●

●●

●●


●●
−4 −2 0 2 ●

●●
●●




●●



●●
●●


−2 ●




●●●

●●
●●●


−4

● ● ● ●

−3 −2 −1 0 1 2 3
−4 −2 0 2
norm quantiles

shapiro.test(fit.c$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit.c$residuals
## W = 0.99553, p-value = 0.04758
library(nortest)
ad.test(fit.c$residuals)
##
## Anderson-Darling normality test
##
## data: fit.c$residuals
## A = 0.62184, p-value = 0.1051
cvm.test(fit.c$residuals)
##
## Cramer-von Mises normality test
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


170 Ch 5: One-Way Analysis of Variance

## data: fit.c$residuals
## W = 0.091963, p-value = 0.1449
Looking at the summaries, we see that the sample standard deviations are close.
Formal tests of equal population variances are far from significant. The p-values for
Bartlett’s test and Levene’s test are greater than 0.4. Thus, the standard ANOVA
appears to be appropriate here.
# calculate summaries
chds.summary <- ddply(chds, "smoke",
function(X) { data.frame( m = mean(X$c_bwt),
s = sd(X$c_bwt),
n = length(X$c_bwt) ) } )
chds.summary
## smoke m s n
## 1 0 cigs 7.732808 1.052341 381
## 2 1-19 cigs 7.221302 1.077760 169
## 3 20+ cigs 7.266154 1.090946 130
## Test equal variance
# assumes populations are normal
bartlett.test(c_bwt ~ smoke, data = chds)
##
## Bartlett test of homogeneity of variances
##
## data: c_bwt by smoke
## Bartlett's K-squared = 0.3055, df = 2, p-value = 0.8583
# does not assume normality, requires car package
library(car)
leveneTest(c_bwt ~ smoke, data = chds)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.7591 0.4685
## 677
# nonparametric test
fligner.test(c_bwt ~ smoke, data = chds)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: c_bwt by smoke
## Fligner-Killeen:med chi-squared = 2.0927, df = 2, p-value =
## 0.3512
The p-value for the F -test is less than 0.0001. We would reject H0 at any of the
usual test levels (such as 0.05 or 0.01). The data suggest that the population mean
birth weights differ across smoking status groups.
summary(fit.c)
## Df Sum Sq Mean Sq F value Pr(>F)
## smoke 2 40.7 20.351 17.9 2.65e-08 ***

Prof. Erik B. Erhardt


5.5: Example from the Child Health and Development Study (CHDS) 171

## Residuals 677 769.5 1.137


## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.c
## Call:
## aov(formula = c_bwt ~ smoke, data = chds)
##
## Terms:
## smoke Residuals
## Sum of Squares 40.7012 769.4943
## Deg. of Freedom 2 677
##
## Residual standard error: 1.066126
## Estimated effects may be unbalanced
The Tukey multiple comparisons suggest that the mean birth weights are different
(higher) for children born to mothers that did not smoke during pregnancy.
## CHDS
# Tukey 95% Individual p-values
TukeyHSD(fit.c)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = c_bwt ~ smoke, data = chds)
##
## $smoke
## diff lwr upr p adj
## 1-19 cigs-0 cigs -0.51150662 -0.7429495 -0.2800637 0.0000008
## 20+ cigs-0 cigs -0.46665455 -0.7210121 -0.2122970 0.0000558
## 20+ cigs-1-19 cigs 0.04485207 -0.2472865 0.3369907 0.9308357

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


172 Ch 5: One-Way Analysis of Variance

Prof. Erik B. Erhardt


Part III

ADA1: Nonparametric,
categorical, and regression
methods

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


Chapter 6

Nonparametric Methods

Contents
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2 The Sign Test and CI for a Population Median . . . . . 176
6.3 Wilcoxon Signed-Rank Procedures . . . . . . . . . . . . . 183
6.3.1 Nonparametric Analyses of Paired Data . . . . . . . . . . . 187
6.3.2 Comments on One-Sample Nonparametric Methods . . . . 189
6.4 (Wilcoxon-)Mann-Whitney Two-Sample Procedure . . . 190
6.5 Alternatives for ANOVA and Planned Comparisons . . 198
6.5.1 Kruskal-Wallis ANOVA . . . . . . . . . . . . . . . . . . . . 199
6.5.2 Transforming Data . . . . . . . . . . . . . . . . . . . . . . . 199
6.5.3 Planned Comparisons . . . . . . . . . . . . . . . . . . . . . 212
6.5.4 Two final ANOVA comments . . . . . . . . . . . . . . . . . 215
6.6 Permutation tests . . . . . . . . . . . . . . . . . . . . . . . 215
6.6.1 Linear model permutation tests in R . . . . . . . . . . . . . 219
6.7 Density estimation . . . . . . . . . . . . . . . . . . . . . . . 222

Learning objectives
After completing this topic, you should be able to:
select the appropriate procedure based on assumptions.
explain reason for using one procedure over another.
decide whether the medians between multiple populations are different.
Achieving these goals contributes to mastery in these course learning outcomes:

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


176 Ch 6: Nonparametric Methods

3. select correct statistical procedure.


5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
10. identify and explain statistical methods, assumptions, and limitations.
12. make evidence-based decisions.

6.1 Introduction

Nonparametric methods do not require the normality assumption of classical tech-


niques. When the normality assumption is met, the ANOVA and t-test are most
powerful, in that if the alternative is true these methods will make the correct de-
cision with highest probability. However, if the normality assumption is not met,
results from the ANOVA and t-test can be misleading and too liberal. I will describe
and illustrate selected non-parametric methods, and compare them with classi-
cal methods. Some motivation and discussion of the strengths and weaknesses of
non-parametric methods is given.

6.2 The Sign Test and CI for a Population Median

The sign test assumes that you have a random sample from a population, but
makes no assumption about the population shape. The standard t-test provides
inferences on a population mean. The sign test, in contrast, provides inferences about
a population median.
If the population frequency curve is symmetric (see below), then the population
median, identified by η, and the population mean µ are identical. In this case the
sign procedures provide inferences for the population mean, though less powerfully
than the t-test.
The idea behind the sign test is straightforward. Suppose you have a sample of
size m from the population, and you wish to test H0 : η = η0 (a given value). Let S
be the number of sampled observations above η0 . If H0 is true, you expect S to be
approximately one-half the sample size, 0.5m. If S is much greater than 0.5m, the
data suggests that η > η0 . If S is much less than 0.5m, the data suggests that η < η0 .

Prof. Erik B. Erhardt


6.2: The Sign Test and CI for a Population Median 177

Mean and Median differ with skewed distributions Mean and Median are the same with symmetric distributions

50%

Median = η Mean = µ Mean = Median

S has a Binomial distribution when H0 is true. The Binomial distribution is


used to construct a test with size α (approximately). For a two-sided alternative HA :
η 6= η0 , the test rejects H0 when S is significantly different from 0.5m, as determined
from the reference Binomial distribution. One-sided tests use the corresponding lower
or upper tail of the distribution. To generate a CI for η, you can exploit the duality
between CI and tests. A 100(1 − α)% CI for η consists of all values η0 not rejected
by a two-sided size α test of H0 : η = η0 .
Not all test sizes and confidence levels are possible because the test statistic S
is discrete valued. R’s SIGN.test() in the BSDA package gives an exact p-value for
the test, and approximates the desired confidence level using a linear interpolation
algorithm.

Example: Income Data Recall that the income distribution is extremely skewed,
with two extreme outliers at 46 and 1110.
#### Example: Income Data
income <- c(7, 1110, 7, 5, 8, 12, 0, 5, 2, 2, 46, 7)
# sort in decreasing order
income <- sort(income, decreasing = TRUE)
income
## [1] 1110 46 12 8 7 7 7 5 5 2 2 0
summary(income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.25 7.00 100.92 9.00 1110.00
sd(income)
## [1] 318.0078
The income data is unimodal, skewed right, with two extreme outliers.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


178 Ch 6: Nonparametric Methods

par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(income, freq = FALSE, breaks = 1000)
points(density(income), type = "l")
rug(income)

# violin plot
library(vioplot)
vioplot(income, horizontal=TRUE, col="gray")

# boxplot
boxplot(income, horizontal=TRUE)

Histogram of income
Density

0.15
0.00

0 200 400 600 800 1000

income


1

0 200 400 600 800 1000

● ●

0 200 400 600 800 1000

The normal QQ-plot of the sample data indicates strong deviation from normality,
and the CLT can’t save us: even the bootstrap sampling distribution of the mean
indicates strong deviation from normality.
library(car)
qqPlot(income, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Income")

bs.one.samp.dist(income)

Prof. Erik B. Erhardt


6.2: The Sign Test and CI for a Population Median 179

QQ Plot, Income Plot of data with smoothed density curve

0.004

Density
1000

0.002
800

0.000
income

600 0 200 400 600 800 1000 1200

dat
Bootstrap sampling distribution of the mean
400

0.010
Density
200


● ● ● ● ● ● ●
0 ● ● ●

0.000
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
0 100 200 300 400 500 600
norm quantiles
Data: n = 12 , mean = 100.92 , se = 91.801 5

The presence of the outliers has a dramatic effect on the 95% CI for the population
mean income µ, which goes from −101 to 303 (in 1000 dollar units). This t-CI is
suspect because the normality assumption is unreasonable. A CI for the population
median income η is more sensible because the median is likely to be a more reasonable
measure of typical value. Using the sign procedure, you are 95% confident that the
population median income is between 2.32 and 11.57 (times $1000).
library(BSDA)
## Loading required package: lattice
##
## Attaching package: ’BSDA’
## The following objects are masked from ’package:carData’:
##
## Vocab, Wool
## The following object is masked from ’package:TeachingDemos’:
##
## z.test
## The following object is masked from ’package:datasets’:
##
## Orange
t.test(income)
##
## One Sample t-test
##
## data: income
## t = 1.0993, df = 11, p-value = 0.2951
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -101.1359 302.9692
## sample estimates:
## mean of x

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


180 Ch 6: Nonparametric Methods

## 100.9167
SIGN.test(income)
##
## One-sample Sign-Test
##
## data: income
## s = 11, p-value = 0.0009766
## alternative hypothesis: true median is not equal to 0
## 95 percent confidence interval:
## 2.319091 11.574545
## sample estimates:
## median of x
## 7
##
## Achieved and Interpolated Confidence Intervals:
##
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.8540 5.0000 8.0000
## Interpolated CI 0.9500 2.3191 11.5745
## Upper Achieved CI 0.9614 2.0000 12.0000

Example: Age at First Heart Transplant Recall that the distribution of ages is
skewed to the left with a lower outlier. A question of interest is whether the “typical
age” at first transplant is 50. This can be formulated as a test about the population
median η or as a test about the population mean µ, depending on the interpretation.

#### Example: Age at First Heart Transplant


age <- c(54, 42, 51, 54, 49, 56, 33, 58, 54, 64, 49)
# sort in decreasing order
age <- sort(age, decreasing = TRUE)
age
## [1] 64 58 56 54 54 54 51 49 49 42 33
summary(age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33.00 49.00 54.00 51.27 55.00 64.00
sd(age)
## [1] 8.25943
The age data is unimodal, skewed left, no extreme outliers.
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(age, freq = FALSE, breaks = 10)
points(density(age), type = "l")
rug(age)

# violin plot

Prof. Erik B. Erhardt


6.2: The Sign Test and CI for a Population Median 181

library(vioplot)
vioplot(age, horizontal=TRUE, col="gray")

# boxplot
boxplot(age, horizontal=TRUE)

Histogram of age
Density

0.04
0.00

30 35 40 45 50 55 60 65

age


1

35 40 45 50 55 60 65

35 40 45 50 55 60 65

The normal QQ-plot of the sample data indicates mild deviation from normality in
the left tail (2 points of 11 outside the bands), and the bootstrap sampling distribution
of the mean indicates weak deviation from normality. It is good practice in this case
to use the nonparametric test as a double-check of the t-test, with the nonparametric
test being the more conservative test.
library(car)
qqPlot(age, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Income")

bs.one.samp.dist(age)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


182 Ch 6: Nonparametric Methods

QQ Plot, Income Plot of data with smoothed density curve

65

Density

0.04
60


55
● ● ●

0.00

50 30 35 40 45 50 55 60 65
age

● ●

dat
Bootstrap sampling distribution of the mean
45

0.15

40

0.10
Density

0.05
35

0.00
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
40 45 50 55 60
norm quantiles
Data: n = 11 , mean = 51.273 , se = 2.49031 5

The sign test for H0 : η = 50 against HA : η 6= 50 has a p-value of 0.549, which is


not sufficient to reject H0 . A 95% CI for η is 47.0 to 56.6 years, which includes the
hypothesized median age of 50. Similar conclusions are reached with the t-CI and the
test on µ, but you should have less confidence in these results because the normality
assumption is tenuous.
library(BSDA)
t.test(age, mu=50)
##
## One Sample t-test
##
## data: age
## t = 0.51107, df = 10, p-value = 0.6204
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
## 45.72397 56.82149
## sample estimates:
## mean of x
## 51.27273
SIGN.test(age, md=50)
##
## One-sample Sign-Test
##
## data: age
## s = 7, p-value = 0.5488
## alternative hypothesis: true median is not equal to 50
## 95 percent confidence interval:
## 46.98909 56.57455
## sample estimates:
## median of x
## 54

Prof. Erik B. Erhardt


6.3: Wilcoxon Signed-Rank Procedures 183

##
## Achieved and Interpolated Confidence Intervals:
##
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.9346 49.0000 56.0000
## Interpolated CI 0.9500 46.9891 56.5745
## Upper Achieved CI 0.9883 42.0000 58.0000

6.3 Wilcoxon Signed-Rank Procedures


The Wilcoxon procedure assumes you have a random sample from a population with
a symmetric frequency curve. The curve need not be normal. The test and CI can
be viewed as procedures for either the population median or mean.
To illustrate the computation of the Wilcoxon statistic W , suppose you wish to
test H0 : µ = µ0 = 10 on the made-up data below. The test statistic requires us to
compute the signs of Xi − µ0 and the ranks of |Xi − µ0 |. Ties in |Xi − µ0 | get the
average rank and observations at µ0 (here 10) are always discarded. The Wilcoxon
statistic is the sum of the signed ranks for observations above µ0 = 10. For us

W = 6 + 4.5 + 8 + 2 + 4.5 + 7 = 32.


Xi Xi − 10 sign |Xi − 10| rank sign× rank
20 10 + 10 6 6
18 8 + 8 4.5 4.5
23 13 + 13 8 8
5 −5 − 5 3 −3
14 4 + 4 2 2
8 −2 − 2 1 −1
18 8 + 8 4.5 4.5
22 12 + 12 7 7
The sum of all ranks is always 0.5m(m + 1), where m is the sample size. If H0
is true, you expect W to be approximately 0.5 × 0.5m(m + 1) = 0.25m(m + 1).
Why? Recall that W adds up the ranks for observations above µ0 . If H0 is true, you
expect 1/2 of all observations to be above µ0 , assuming the population distribution
is symmetric. The ranks of observations above µ0 should add to approximately 1/2
times the sum of all ranks. You reject H0 in favor of HA : µ 6= µ0 if W is much larger
than, or much smaller than 0.25m(m + 1). One sided tests can also be constructed.
The Wilcoxon CI for µ is computed in a manner analogous to that described for the
sign CI.
Here, m = 8 so the sum of all ranks is 0.5 × 8 × 9 = 36 (check yourself). The
expected value of W is 0.5 × 0.5 × 8 × 9 = 18. Is the observed value of W = 32 far

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


184 Ch 6: Nonparametric Methods

from the expected value of 18? To formally answer this question, we need to use the
Wilcoxon procedures, which are implemented in R with wilcox.test().

Example: Made-up Data The boxplot indicates that the distribution is fairly
symmetric, so the Wilcoxon method is reasonable (so is a t-CI and test).
#### Example: Made-up Data
dat <- c(20, 18, 23, 5, 14, 8, 18, 22)
# sort in decreasing order
dat <- sort(dat, decreasing = TRUE)
dat
## [1] 23 22 20 18 18 14 8 5
summary(dat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 12.5 18.0 16.0 20.5 23.0
sd(dat)
## [1] 6.524678

The dat data is unimodal, skewed left, no extreme outliers.


par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(dat, freq = FALSE, breaks = 10)
points(density(dat), type = "l")
rug(dat)

# violin plot
library(vioplot)
vioplot(dat, horizontal=TRUE, col="gray")

# boxplot
boxplot(dat, horizontal=TRUE)

Prof. Erik B. Erhardt


6.3: Wilcoxon Signed-Rank Procedures 185

Histogram of dat

0.12
Density

0.06
0.00
5 10 15 20

dat

1
5 10 15 20

5 10 15 20

The normal QQ-plot of the sample data indicates insufficient evidence of deviation
from normality though both the QQ-plot and the bootstrap sampling distribution of
the mean indicates weak left-skewness. Either the Wilcoxon or t-test are appropriate.

par(mfrow=c(1,1))
library(car)
qqPlot(dat, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Income")

bs.one.samp.dist(dat)

QQ Plot, Income Plot of data with smoothed density curve


Density

0.04

20 ●

● ●
0.00

15 5 10 15 20 25
dat

dat
Bootstrap sampling distribution of the mean

10
Density

0.10

5 ●
0.00

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


8 10 12 14 16 18 20 22
norm quantiles
Data: n = 8 , mean = 16 , se = 2.30682 5

The Wilcoxon p-value with continuity correction for testing H0 : µ = 10 against


a two-sided alternative is 0.058. This would not lead to rejecting H0 at the 5% level.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


186 Ch 6: Nonparametric Methods

t.test(dat, mu=10)
##
## One Sample t-test
##
## data: dat
## t = 2.601, df = 7, p-value = 0.03537
## alternative hypothesis: true mean is not equal to 10
## 95 percent confidence interval:
## 10.54523 21.45477
## sample estimates:
## mean of x
## 16
# with continuity correction in the normal approximation for the p-value
wilcox.test(dat, mu=10, conf.int=TRUE)
## Warning in wilcox.test.default(dat, mu = 10, conf.int = TRUE): cannot compute exact
p-value with ties
## Warning in wilcox.test.default(dat, mu = 10, conf.int = TRUE): cannot compute exact
confidence interval with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: dat
## V = 32, p-value = 0.0584
## alternative hypothesis: true location is not equal to 10
## 95 percent confidence interval:
## 9.500002 21.499942
## sample estimates:
## (pseudo)median
## 16.0056
# without continuity correction
wilcox.test(dat, mu=10, conf.int=TRUE, correct=FALSE)
## Warning in wilcox.test.default(dat, mu = 10, conf.int = TRUE, correct = FALSE): cannot
compute exact p-value with ties
## Warning in wilcox.test.default(dat, mu = 10, conf.int = TRUE, correct = FALSE): cannot
compute exact confidence interval with ties
##
## Wilcoxon signed rank test
##
## data: dat
## V = 32, p-value = 0.04967
## alternative hypothesis: true location is not equal to 10
## 95 percent confidence interval:
## 10.99996 21.00005
## sample estimates:
## (pseudo)median
## 16.0056

Prof. Erik B. Erhardt


6.3: Wilcoxon Signed-Rank Procedures 187

6.3.1 Nonparametric Analyses of Paired Data


Nonparametric methods for single samples can be used to analyze paired data because
the difference between responses within pairs is the unit of analysis.

Example: Sleep Remedies I will illustrate Wilcoxon methods on the paired com-
parison of two remedies A and B for insomnia. The number of hours of sleep gained
on each method was recorded.
#### Example: Sleep Remedies
# Data and numerical summaries
a <- c( 0.7, -1.6, -0.2, -1.2, 0.1, 3.4, 3.7, 0.8, 0.0, 2.0)
b <- c( 1.9, 0.8, 1.1, 0.1, -0.1, 4.4, 5.5, 1.6, 4.6, 3.0)
d <- b - a;
sleep <- data.frame(a, b, d)
summary(sleep$d)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.200 1.000 1.250 1.520 1.675 4.600
shapiro.test(sleep$d)
##
## Shapiro-Wilk normality test
##
## data: sleep$d
## W = 0.83798, p-value = 0.04173
# boxplot
library(ggplot2)
p3 <- ggplot(sleep, aes(x = "d", y = d))
p3 <- p3 + geom_hline(yintercept=0, colour="#BB0000", linetype="dashed")
p3 <- p3 + geom_boxplot()
p3 <- p3 + geom_point()
p3 <- p3 + stat_summary(fun.y = mean, geom = "point", shape = 18,
size = 4, alpha = 0.3)
p3 <- p3 + coord_flip()
print(p3)

d
x

● ● ● ● ● ● ● ●

0 1 2 3 4
d

The boxplot shows that distribution of differences is reasonably symmetric but


not normal. Recall that the Shapiro-Wilk test of normality was significant at the 5%
level (p-value=0.042). It is sensible to use the Wilcoxon procedure on the differences.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


188 Ch 6: Nonparametric Methods

Let µB be the population mean sleep gain on remedy B, and µA be the population
mean sleep gain on remedy A. You are 95% confident that µB − µA is between 0.8
and 2.8 hours. Putting this another way, you are 95% confident that µB exceeds µA
by between 0.8 and 2.8 hours. The p-value for testing H0 : µB − µA = 0 against a
two-sided alternative is 0.008, which strongly suggests that µB 6= µA . This agrees
with the CI. Note that the t-CI and test give qualitatively similar conclusions as the
Wilcoxon methods, but the t-test p-value is about half as large.

If you are uncomfortable with the symmetry assumption, you could use the sign
CI for the population median difference between B and A. I will note that a 95% CI
for the median difference goes from 0.86 to 2.2 hours.
t.test(sleep$d, mu=0)
##
## One Sample t-test
##
## data: sleep$d
## t = 3.7796, df = 9, p-value = 0.004352
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.610249 2.429751
## sample estimates:
## mean of x
## 1.52
# with continuity correction in the normal approximation for the p-value
wilcox.test(sleep$d, mu=0, conf.int=TRUE)
## Warning in wilcox.test.default(sleep$d, mu = 0, conf.int = TRUE): cannot compute
exact p-value with ties
## Warning in wilcox.test.default(sleep$d, mu = 0, conf.int = TRUE): cannot compute
exact confidence interval with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: sleep$d
## V = 54, p-value = 0.008004
## alternative hypothesis: true location is not equal to 0
## 95 percent confidence interval:
## 0.7999339 2.7999620
## sample estimates:
## (pseudo)median
## 1.299983
# can use the paired= option
#wilcox.test(sleep£b, sleep£a, paired=TRUE, mu=0, conf.int=TRUE)
# if don't assume symmetry, can use sign test
#SIGN.test(sleep£d)

Prof. Erik B. Erhardt


6.3: Wilcoxon Signed-Rank Procedures 189

6.3.2 Comments on One-Sample Nonparametric Methods


For this discussion, I will assume that the underlying population distribution is (ap-
proximately) symmetric, which implies that population means and medians are equal
(approximately). For symmetric distributions the t, sign, and Wilcoxon procedures
are all appropriate.
If the underlying population distribution is extremely skewed, you can use the
sign procedure to get a CI for the population median. Alternatively, as illustrated
on HW 2, you can transform the data to a scale where the underlying distribution is
nearly normal, and then use the classical t-methods. Moderate degrees of skewness
will not likely have a big impact on the standard t-test and CI.
The one-sample t-test and CI are optimal when the underlying population fre-
quency curve is normal. Essentially this means that the t-CI is, on average, nar-
rowest among all CI procedures with given level, or that the t-test has the highest
power among all tests with a given size. The width of a CI provides a measure of
the sensitivity of the estimation method. For a given level CI, the narrower CI better
pinpoints the unknown population mean.
With heavy-tailed symmetric distributions, the t-test and CI tend to be conser-
vative. Thus, for example, a nominal 95% t-CI has actual coverage rates higher than
95%, and the nominal 5% t-test has an actual size smaller than 5%. The t-test and
CI possess a property that is commonly called robustness of validity. However,
data from heavy-tailed distributions can have a profound effect on the sensitivity of
the t-test and CI. Outliers can dramatically inflate the standard error of the mean,
causing the CI to be needlessly wide, and tests to have diminished power (outliers
typically inflate p-values for the t-test). The sign and Wilcoxon procedures down-
weight the influence of outliers by looking at sign or signed-ranks instead of the actual
data values. These two nonparametric methods are somewhat less efficient than the
t-methods when the population is normal (efficiency is about 0.64 and 0.96 for the
sign and Wilcoxon methods relative to the normal t-methods, where efficiency is the
ratio of sample sizes needed for equal power), but can be infinitely more efficient with
heavier than normal tailed distributions. In essence, the t-methods do not have a
robustness of sensitivity.
Nonparametric methods have gained widespread acceptance in many scientific
disciplines, but not all. Scientists in some disciplines continue to use classical t-
methods because they believe that the methods are robust to non-normality. As
noted above, this is a robustness of validity, not sensitivity. This misconception is
unfortunate, and results in the routine use of methods that are less powerful than the
non-parametric techniques. Scientists need to be flexible and adapt their tools
to the problem at hand, rather than use the same tool indiscriminately! I
have run into suspicion that use of nonparametric methods was an attempt to “cheat”
in some way — properly applied, they are excellent tools that should be used.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


190 Ch 6: Nonparametric Methods

A minor weakness of nonparametric methods is that they do not easily generalize


to complex modelling problems. A great deal of progress has been made in this area,
but most software packages have not included the more advanced techniques (R is
among the forerunners).

Nonparametric statistics used to refer almost exclusively to the set of methods


such as we have been discussing that provided analogs like tests and CIs to the
normal theory methods without requiring the assumption of sampling from normal
distributions. There is now a large area of statistics also called nonparametric meth-
ods not focused on these goals at all. In our department we (used to) have a course
titled “Nonparametric Curve Estimation & Image Reconstruction”, where the focus
is much more general than relaxing an assumption of normality. In that sense, what
we are covering in this course could be considered “classical” nonparametrics.

6.4 (Wilcoxon-)Mann-Whitney Two-Sample Pro-


cedure

The WMW procedure assumes you have independent random samples from the two
populations, and assumes that the populations have the same shapes and spreads
(the frequency curves for the two populations are “shifted” versions of each other
— see below). The frequency curves are not required to be symmetric. The WMW
procedures give a CI and tests on the difference η1 − η2 between the two population
medians. If the populations are symmetric, then the methods apply to µ1 − µ2 .

Prof. Erik B. Erhardt


6.4: (Wilcoxon-)Mann-Whitney Two-Sample Procedure 191

0.20

0.15
0.15

0.10
0.10

0.05
0.05
0.0

0.0
-5 0 5 10 15 20 0 5 10 15

The R help on ?wilcox.test gives references to how the exact WMW procedure
is actually calculated; here is a good approximation to the exact method that is
easier to understand. The WMW procedure is based on ranks. The two samples are
combined, ranked from smallest to largest (1=smallest) and separated back into the
original samples. If the two populations have equal medians, you expect the average
rank in the two samples to be roughly equal. The WMW test computes a classical
two sample t-test using the pooled variance on the ranks to assess whether the sample
mean ranks are significantly different.

Example: Comparison of Cooling Rates of Uwet and Walker Co. Mete-


orites The Uwet1 (Cross River, Nigeria) and Walker2 County (Alabama, US) me-
teorite cooling rate data are below. A primary interest is comparing the population
“typical” cooling rate measurements.
#### Example: Comparison of Cooling Rates of Uwet and Walker Co. Meteorites
Uwet <- c(0.21, 0.25, 0.16, 0.23, 0.47, 1.20, 0.29, 1.10, 0.16)
Walker <- c(0.69, 0.23, 0.10, 0.03, 0.56, 0.10, 0.01, 0.02, 0.04, 0.22)
The boxplots and normal QQ-plots show that the distributions are rather skewed
to the right. The AD test of normality indicate that a normality assumption is
unreasonable for each population.

1
http://www.lpi.usra.edu/meteor/metbull.php?code=24138
2
http://www.lpi.usra.edu/meteor/metbull.php?code=24204

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


192 Ch 6: Nonparametric Methods

met <- data.frame(Uwet=c(Uwet,NA), Walker)


library(reshape2)
met.long <- melt(met, variable.name = "site", value.name = "cool", na.rm=TRUE)
## No id variables; using all as measure variables
# naming variables manually, the variable.name and value.name not working 11/2012
names(met.long) <- c("site", "cool")
library(ggplot2)
p <- ggplot(met.long, aes(x = site, y = cool, fill=site))
p <- p + geom_boxplot()
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 3, size = 2)
p <- p + coord_flip()
p <- p + labs(title = "Cooling rates for samples of meteorites at two locations")
p <- p + theme(legend.position="none")
print(p)

par(mfrow=c(1,2))
library(car)
qqPlot(Walker, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Walker")
qqPlot(Uwet, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Uwet")
Cooling rates for samples of meteorites at two locations
QQ Plot, Walker QQ Plot, Uwet

0.7 ● 1.2 ●


0.6
Walker ● ● ● 1.0

0.5

0.8
0.4
Walker

Uwet
site

0.3 0.6


● ●
0.2
0.4
Uwet ● ●

0.1 ● ● ●


● ●

● 0.2
● ● ●
0.0

−1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5

0.00 0.25 0.50 0.75 1.00 1.25 norm quantiles norm quantiles
cool

I carried out the standard two-sample procedures to see what happens. The
pooled-variance and Satterthwaithe results are comparable, which is expected be-
cause the sample standard deviations and sample sizes are roughly equal. Both tests
indicate that the mean cooling rates for Uwet and Walker Co. meteorites are not
significantly different at the 10% level. You are 95% confident that the mean cooling
rate for Uwet is at most 0.1 less, and no more than 0.6 greater than that for Walker
Co. (in degrees per million years).

Prof. Erik B. Erhardt


6.4: (Wilcoxon-)Mann-Whitney Two-Sample Procedure 193

# numerical summaries
summary(Uwet)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.2100 0.2500 0.4522 0.4700 1.2000
c(sd(Uwet), IQR(Uwet), length(Uwet))
## [1] 0.4069944 0.2600000 9.0000000
summary(Walker)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.0325 0.1000 0.2000 0.2275 0.6900
c(sd(Walker), IQR(Walker), length(Walker))
## [1] 0.2389793 0.1950000 10.0000000
t.test(Uwet, Walker, var.equal = TRUE)
##
## Two Sample t-test
##
## data: Uwet and Walker
## t = 1.6689, df = 17, p-value = 0.1134
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0666266 0.5710710
## sample estimates:
## mean of x mean of y
## 0.4522222 0.2000000
t.test(Uwet, Walker)
##
## Welch Two Sample t-test
##
## data: Uwet and Walker
## t = 1.6242, df = 12.652, p-value = 0.129
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.08420858 0.58865302
## sample estimates:
## mean of x mean of y
## 0.4522222 0.2000000
Given the marked skewness, a nonparametric procedure is more appropriate. The
Wilcoxon-Mann-Whitney comparison of population medians is reasonable. Why?
The WMW test of equal population medians is significant (barely) at the 5% level.
You are 95% confident that median cooling rate for Uwet exceeds that for Walker by
between 0+ and 0.45 degrees per million years.
wilcox.test(Uwet, Walker, conf.int = TRUE)
## Warning in wilcox.test.default(Uwet, Walker, conf.int = TRUE): cannot compute exact
p-value with ties
## Warning in wilcox.test.default(Uwet, Walker, conf.int = TRUE): cannot compute exact

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


194 Ch 6: Nonparametric Methods

confidence intervals with ties


##
## Wilcoxon rank sum test with continuity correction
##
## data: Uwet and Walker
## W = 69.5, p-value = 0.04974
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## 0.0000449737 0.4499654518
## sample estimates:
## difference in location
## 0.1702657
The difference between the WMW and t-test p-values and CI lengths (i.e. the
WMW CI is narrower and the p-value smaller) reflects the effect of the outliers on
the sensitivity of the standard tests and CI.
I conducted a pooled-variance two-sample t-test on ranks to show you that the
p-value is close to the WMW p-value, as expected.
rank(met.long$cool)
## [1] 9.0 13.0 7.5 11.5 15.0 19.0 14.0 18.0 7.5 17.0 11.5 5.5 3.0
## [14] 16.0 5.5 1.0 2.0 4.0 10.0
by(rank(met.long$cool), met.long$site, summary)
## met.long$site: Uwet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.50 9.00 13.00 12.72 15.00 19.00
## ----------------------------------------------------
## met.long$site: Walker
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.25 5.50 7.55 11.12 17.00
# note: the CI for ranks is not interpretable
t.test(rank(met.long$cool) ~ met.long$site, var.equal = TRUE)
##
## Two Sample t-test
##
## data: rank(met.long$cool) by met.long$site
## t = 2.2082, df = 17, p-value = 0.04125
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2304938 10.1139507
## sample estimates:
## mean in group Uwet mean in group Walker
## 12.72222 7.55000

Example: Newcombe’s Data Experiments of historical importance were per-


formed beginning in the eighteenth century to determine physical constants, such

Prof. Erik B. Erhardt


6.4: (Wilcoxon-)Mann-Whitney Two-Sample Procedure 195

as the mean density of the earth, the distance from the earth to the sun, and the
velocity of light. An interesting series of experiments to determine the velocity of
light was begun in 1875. The first method used, and reused with refinements several
times thereafter, was the rotating mirror method3 . In this method a beam of light
is reflected off a rapidly rotating mirror to a fixed mirror at a carefully measured
distance from the source. The returning light is re-reflected from the rotating mirror
at a different angle, because the mirror has turned slightly during the passage of the
corresponding light pulses. From the speed of rotation of the mirror and from careful
measurements of the angular difference between the outward-bound and returning
light beams, the passage time of light can be calculated for the given distance. After
averaging several calculations and applying various corrections, the experimenter can
combine mean passage time and distance for a determination of the velocity of light.
Simon Newcombe, a distinguished American scientist, used this method during the
year 1882 to generate the passage time measurements given below, in microseconds.
The travel path for this experiment was 3721 meters in length, extending from Ft.
Meyer, on the west bank of the Potomac River in Washington, D.C., to a fixed mirror
at the base of the Washington Monument.

The problem is to determine a 95% CI for the “true” passage time, which is taken
to be the typical time (mean or median) of the population of measurements that were
or could have been taken by this experiment.
#### Example: Newcombe's Data
time <- c(24.828, 24.833, 24.834, 24.826, 24.824, 24.756
, 24.827, 24.840, 24.829, 24.816, 24.798, 24.822
, 24.824, 24.825, 24.823, 24.821, 24.830, 24.829

3
http://en.wikipedia.org/wiki/File:Speed_of_light_(foucault).PNG

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


196 Ch 6: Nonparametric Methods

, 24.831, 24.824, 24.836, 24.819, 24.820, 24.832


, 24.836, 24.825, 24.828, 24.828, 24.821, 24.829
, 24.837, 24.828, 24.830, 24.825, 24.826, 24.832
, 24.836, 24.830, 24.836, 24.826, 24.822, 24.823
, 24.827, 24.828, 24.831, 24.827, 24.827, 24.827
, 24.826, 24.826, 24.832, 24.833, 24.832, 24.824
, 24.839, 24.824, 24.832, 24.828, 24.825, 24.825
, 24.829, 24.828, 24.816, 24.827, 24.829, 24.823)
library(nortest)
ad.test(time)
##
## Anderson-Darling normality test
##
## data: time
## A = 5.8843, p-value = 1.217e-14

# Histogram overlaid with kernel density curve


Passage_df <- data.frame(time)
p1 <- ggplot(Passage_df, aes(x = time))
# Histogram with density instead of count on y-axis
p1 <- p1 + geom_histogram(aes(y=..density..), binwidth=0.001)
p1 <- p1 + geom_density(alpha=0.1, fill="white")
p1 <- p1 + geom_rug()

# violin plot
p2 <- ggplot(Passage_df, aes(x = "t", y = time))
p2 <- p2 + geom_violin(fill = "gray50")
p2 <- p2 + geom_boxplot(width = 0.2, alpha = 3/4)
p2 <- p2 + coord_flip()

# boxplot
p3 <- ggplot(Passage_df, aes(x = "t", y = time))
p3 <- p3 + geom_boxplot()
p3 <- p3 + coord_flip()

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1)

Prof. Erik B. Erhardt


6.4: (Wilcoxon-)Mann-Whitney Two-Sample Procedure 197

90
density

60

30

0
24.775 24.800 24.825
time

t
x

24.775 24.800 24.825


time

t
x

● ●

24.775 24.800 24.825


time

par(mfrow=c(1,1))
library(car)
qqPlot(time, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, Time")

bs.one.samp.dist(time)

QQ Plot, Time Plot of data with smoothed density curve


60

24.84 ● ●
●● ● ● ●
●●●
40
Density

●●●●●●●
●●●
●●●
●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●
●●●●●
20

24.82 ● ●●●
● ●
0

24.76 24.78 24.80 24.82 24.84


time

24.80 ●

dat
Bootstrap sampling distribution of the mean
300

24.78
200
Density

100

24.76

0

−2 −1 0 1 2
24.818 24.820 24.822 24.824 24.826 24.828 24.830
norm quantiles
Data: n = 66 , mean = 24.826 , se = 0.00132266 5

The data set is skewed to the left, due to the presence of two extreme outliers
that could potentially be misrecorded observations. Without additional information I
would be hesitant to apply normal theory methods (the t-test), even though the sam-
ple size is “large” (bootstrap sampling distribution is still left-skewed). Furthermore,
the t-test still suffers from a lack of robustness of sensitivity, even in large samples. A

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


198 Ch 6: Nonparametric Methods

formal QQ-plot and normal test rejects, at the 0.01 level, the normality assumption
needed for the standard methods.
The table below gives 95% t, sign, and Wilcoxon CIs. I am more comfortable with
the sign CI for the population median than the Wilcoxon method, which assumes
symmetry.
t.sum <- t.test(time)
t.sum$conf.int
## [1] 24.82357 24.82885
## attr(,"conf.level")
## [1] 0.95
diff(t.test(time)$conf.int)
## [1] 0.005283061
s.sum <- SIGN.test(time)
s.sum$conf.int
## [1] 24.82600 24.82849
## attr(,"conf.level")
## [1] 0.95
diff(s.sum$conf.int)
## [1] 0.00249297
w.sum <- wilcox.test(time, conf.int=TRUE)
w.sum$conf.int
## [1] 24.82604 24.82853
## attr(,"conf.level")
## [1] 0.95
diff(w.sum$conf.int)
## [1] 0.002487969
parameter Method CI Limits Width
mean t (24.8236, 24.8289) 0.0053
median sign (24.8260, 24.8285) 0.0025
median Wilcoxon (24.8260, 24.8285) 0.0025
Note the big difference between the nonparametric and the t-CI. The nonparametric
CIs are about 1/2 as wide as the t-CI. This reflects the impact that outliers have on
the standard deviation, which directly influences the CI width.

6.5 Alternatives for ANOVA and Planned Com-


parisons
The classical ANOVA assumes that the populations have normal frequency curves
and the populations have equal variances (or spreads). You learned formal tests for
these assumptions in Chapter 5. When the assumptions do not hold, you can try
one of the following two approaches. Before describing alternative methods, I will

Prof. Erik B. Erhardt


6.5: Alternatives for ANOVA and Planned Comparisons 199

note that deviations from normality in one or more samples might be expected in
a comparison involving many samples. You should downplay small deviations from
normality in problems involving many samples.

6.5.1 Kruskal-Wallis ANOVA


The Kruskal-Wallis (KW) test is a non-parametric method for testing the hypoth-
esis of equal population medians against the alternative that not all population me-
dians are equal. The procedure assumes you have independent random samples from
populations with frequency curves having identical shapes and spreads. The KW
ANOVA is essentially the standard ANOVA based on ranked data. That is, we com-
bine the samples, rank the observations from smallest to largest, and then return the
ranks to the original samples and do the standard ANOVA using the ranks. The KW
ANOVA is a multiple sample analog of the Wilcoxon-Mann-Whitney two sample pro-
cedure. Hence, multiple comparisons for a KW analysis, be they FSD or Bonferroni
comparisons, are based on the two sample WMW procedure.

6.5.2 Transforming Data


The distributions in many data sets are skewed to the right with outliers. If the
sample spreads, say s and IQR, increase with an increasing mean or median, you
can often transform data to a scale where the normality and the constant spread
assumption are more nearly satisfied. The transformed data are analyzed using the
standard ANOVA. The two most commonly used transforms for this problem are the
square root and natural logarithm, provided the data are non-negative4 .
If the original distributions are nearly symmetric, but heavy-tailed, non-linear
transformations will tend to destroy the symmetry. Many statisticians recommend
methods based on trimmed means for such data. These methods are not commonly
used by other researchers.
4
The aim behind the choice of a variance-stabilizing transformation is to find a simple func-
tion f to apply to values y in a data set to create new values y 0 = f (y) such that the variability of the
values y 0 is not related to their mean value. For example, suppose that the values y are realizations
from a Poisson distribution. Because for the Poisson distribution the variance is identical to the
mean, the variance varies with the mean. However, if the simple variance-stabilizing transformation

y 0 = y is applied, the sampling variance will be independent of the mean. A few distributional
examples are provided in the table below.
Distribution Variance=g(mean) Transformation y 0 = f (y)

Poisson σ2 = µ y0 = y p
binomial σ 2 = µ(1 − µ) y 0 = arcsin( (y))
lognormal σ 2 = µ2 y 0 = log(y)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


200 Ch 6: Nonparametric Methods

Example: Hydrocarbon (HC) Emissions Data These data are the HC emis-
sions at idling speed, in ppm, for automobiles of different years of manufacture. The
data are a random sample of all automobiles tested at an Albuquerque shopping
center. (It looks like we need to find some newer cars!)
#### Example: Hydrocarbon (HC) Emissions Data
emis <- read.table(text="
Pre-y63 y63-7 y68-9 y70-1 y72-4
2351 620 1088 141 140
1293 940 388 359 160
541 350 111 247 20
1058 700 558 940 20
411 1150 294 882 223
570 2000 211 494 60
800 823 460 306 20
630 1058 470 200 95
905 423 353 100 360
347 900 71 300 70
NA 405 241 223 220
NA 780 2999 190 400
NA 270 199 140 217
NA NA 188 880 58
NA NA 353 200 235
NA NA 117 223 1880
NA NA NA 188 200
NA NA NA 435 175
NA NA NA 940 85
NA NA NA 241 NA
", header=TRUE)
#emis

# convert to long format


emis.long <- melt(emis,
variable.name = "year",
value.name = "hc",
na.rm = TRUE
)
## No id variables; using all as measure variables
# naming variables manually, the variable.name and value.name not working 11/2012
names(emis.long) <- c("year", "hc")
# summary of each year
by(emis.long$hc, emis.long$year, summary)
## emis.long$year: Pre.y63
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 347.0 548.2 715.0 890.6 1019.8 2351.0
## ----------------------------------------------------
## emis.long$year: y63.7
## Min. 1st Qu. Median Mean 3rd Qu. Max.

Prof. Erik B. Erhardt


6.5: Alternatives for ANOVA and Planned Comparisons 201

## 270.0 423.0 780.0 801.5 940.0 2000.0


## ----------------------------------------------------
## emis.long$year: y68.9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 71.0 196.2 323.5 506.3 462.5 2999.0
## ----------------------------------------------------
## emis.long$year: y70.1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100.0 197.5 244.0 381.4 449.8 940.0
## ----------------------------------------------------
## emis.long$year: y72.4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 65.0 160.0 244.1 221.5 1880.0
# IQR and sd of each year
by(emis.long$hc, emis.long$year, function(X) { c(IQR(X), sd(X), length(X)) })
## emis.long$year: Pre.y63
## [1] 471.5000 591.5673 10.0000
## ----------------------------------------------------
## emis.long$year: y63.7
## [1] 517.0000 454.9285 13.0000
## ----------------------------------------------------
## emis.long$year: y68.9
## [1] 266.2500 707.8026 16.0000
## ----------------------------------------------------
## emis.long$year: y70.1
## [1] 252.2500 287.8864 20.0000
## ----------------------------------------------------
## emis.long$year: y72.4
## [1] 156.5000 410.7866 19.0000
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(emis.long, aes(x = year, y = hc))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(emis.long$hc),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Albuquerque automobile hydrocarbon emissions data") + ylab("hc (ppm)")
# to reverse order that years print, so oldest is first on top

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


202 Ch 6: Nonparametric Methods

p <- p + scale_x_discrete(limits = rev(levels(emis.long$year)) )


p <- p + coord_flip()
p <- p + theme(legend.position="none")
print(p)
Albuquerque automobile hydrocarbon emissions data

Pre.y63

y63.7
year

y68.9

y70.1

y72.4

0 1000 2000 3000


hc (ppm)

The standard ANOVA shows significant differences among the mean HC emis-
sions. However, the standard ANOVA is inappropriate because the distributions are
extremely skewed to the right due to presence of outliers in each sample.
fit.e <- aov(hc ~ year, data = emis.long)
summary(fit.e)
## Df Sum Sq Mean Sq F value Pr(>F)
## year 4 4226834 1056709 4.343 0.00331 **
## Residuals 73 17759968 243287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.e
## Call:
## aov(formula = hc ~ year, data = emis.long)
##
## Terms:
## year Residuals
## Sum of Squares 4226834 17759968
## Deg. of Freedom 4 73
##
## Residual standard error: 493.2416
## Estimated effects may be unbalanced

Prof. Erik B. Erhardt


6.5: Alternatives for ANOVA and Planned Comparisons 203

The boxplots show that the typical HC emissions appear to decrease as the age
of car increases (the simplest description). Although the spread in the samples, as
measured by the IQR, also decreases as age increases, I am more comfortable with the
KW ANOVA, in part because the KW analysis is not too sensitive to differences in
spreads among samples. This point is elaborated upon later. As described earlier, the
KW ANOVA is essentially an ANOVA based on the ranks. I give below the ANOVA
based on ranks and the output from the KW procedure. They give similar p-values,
and lead to the conclusion that there are significant differences among the population
median HC emissions. A simple description is that the population median emission
tends to decrease with the age of the car. You should follow up this analysis with
Mann-Whitney multiple comparisons.
# ANOVA of rank, for illustration that this is similar to what KW is doing
fit.er <- aov(rank(hc) ~ year, data = emis.long)
summary(fit.er)
## Df Sum Sq Mean Sq F value Pr(>F)
## year 4 16329 4082 12.85 5.74e-08 ***
## Residuals 73 23200 318
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.er
## Call:
## aov(formula = rank(hc) ~ year, data = emis.long)
##
## Terms:
## year Residuals
## Sum of Squares 16329.32 23199.68
## Deg. of Freedom 4 73
##
## Residual standard error: 17.82705
## Estimated effects may be unbalanced
# KW ANOVA
fit.ek <- kruskal.test(hc ~ year, data = emis.long)
fit.ek
##
## Kruskal-Wallis rank sum test
##
## data: hc by year
## Kruskal-Wallis chi-squared = 31.808, df = 4, p-value =
## 2.093e-06

It is common to transform the data to a log scale when the spread increases as
the median or mean increases.
# log scale
emis.long$loghc <- log(emis.long$hc)
# summary of each year

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


204 Ch 6: Nonparametric Methods

by(emis.long$loghc, emis.long$year, summary)


## emis.long$year: Pre.y63
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.849 6.306 6.565 6.634 6.925 7.763
## ----------------------------------------------------
## emis.long$year: y63.7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.598 6.047 6.659 6.548 6.846 7.601
## ----------------------------------------------------
## emis.long$year: y68.9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.263 5.279 5.775 5.755 6.137 8.006
## ----------------------------------------------------
## emis.long$year: y70.1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.605 5.285 5.497 5.711 6.107 6.846
## ----------------------------------------------------
## emis.long$year: y72.4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.996 4.171 5.075 4.838 5.400 7.539
# IQR and sd of each year
by(emis.long$loghc, emis.long$year, function(X) { c(IQR(X), sd(X), length(X)) })
## emis.long$year: Pre.y63
## [1] 0.6186119 0.5702081 10.0000000
## ----------------------------------------------------
## emis.long$year: y63.7
## [1] 0.7985077 0.5524878 13.0000000
## ----------------------------------------------------
## emis.long$year: y68.9
## [1] 0.8575139 0.9061709 16.0000000
## ----------------------------------------------------
## emis.long$year: y70.1
## [1] 0.8216494 0.6775933 20.0000000
## ----------------------------------------------------
## emis.long$year: y72.4
## [1] 1.228980 1.138882 19.000000
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(emis.long, aes(x = year, y = loghc))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(emis.long$loghc),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group

Prof. Erik B. Erhardt


6.5: Alternatives for ANOVA and Planned Comparisons 205

p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,


colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Albuquerque automobile hydrocarbon emissions data (log scale)")
p <- p + ylab("log(hc) (log(ppm))")
# to reverse order that years print, so oldest is first on top
p <- p + scale_x_discrete(limits = rev(levels(emis.long$year)) )
p <- p + coord_flip()
p <- p + theme(legend.position="none")
print(p)
Albuquerque automobile hydrocarbon emissions data (log scale)

Pre.y63

y63.7
year

y68.9

y70.1

y72.4

3 4 5 6 7 8
log(hc) (log(ppm))

After transformation, the samples have roughly the same spread (IQR and s) and
shape. The transformation does not completely eliminate the outliers. However, I
am more comfortable with a standard ANOVA on this scale than with the original
data. A difficulty here is that the ANOVA is comparing population mean log HC
emission (so interpretations are on the log ppm scale, instead of the natural ppm
scale). Summaries for the ANOVA on the log hydrocarbon emissions levels are given
below.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


206 Ch 6: Nonparametric Methods

# ANOVA of rank, for illustration that this is similar to what KW is doing


fit.le <- aov(loghc ~ year, data = emis.long)
summary(fit.le)
## Df Sum Sq Mean Sq F value Pr(>F)
## year 4 31.90 7.974 11.42 2.98e-07 ***
## Residuals 73 50.98 0.698
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.le
## Call:
## aov(formula = loghc ~ year, data = emis.long)
##
## Terms:
## year Residuals
## Sum of Squares 31.89510 50.97679
## Deg. of Freedom 4 73
##
## Residual standard error: 0.8356508
## Estimated effects may be unbalanced
# KW ANOVA -- same conclusions as original scale, since based on ranks
fit.lek <- kruskal.test(loghc ~ year, data = emis.long)
fit.lek
##
## Kruskal-Wallis rank sum test
##
## data: loghc by year
## Kruskal-Wallis chi-squared = 31.808, df = 4, p-value =
## 2.093e-06

The boxplot of the log-transformed data reinforces the reasonableness of the orig-
inal KW analysis. Why? The log-transformed distributions have fairly similar shapes
and spreads, so a KW analysis on these data is sensible. The ranks for the original
and log-transformed data are identical, so the KW analyses on the log-transformed
data and the original data must lead to the same conclusions. This suggests that the
KW ANOVA is not overly sensitive to differences in spreads among the samples.
There are two reasonable analyses here: the standard ANOVA using log HC emis-
sions, and the KW analysis of the original data. The first analysis gives a comparison
of mean log HC emissions. The second involves a comparison of median HC emis-
sions. A statistician would present both analyses to the scientist who collected the
data to make a decision on which was more meaningful (independently of the re-
sults5 !). Multiple comparisons would be performed relative to the selected analysis
(t-tests for ANOVA or WMW-tests for KW ANOVA).

5
It is unethical to choose a method based on the results it gives.

Prof. Erik B. Erhardt


6.5: Alternatives for ANOVA and Planned Comparisons 207

Example: Hodgkin’s Disease Study Plasma bradykininogen levels were mea-


sured in normal subjects, in patients with active Hodgkin’s disease, and in patients
with inactive Hodgkin’s disease. The globulin bradykininogen is the precursor sub-
stance for bradykinin, which is thought to be a chemical mediator of inflammation.
The data (in micrograms of bradykininogen per milliliter of plasma) are displayed
below. The three samples are denoted by nc for normal controls, ahd for active
Hodgkin’s disease patients, and ihd for inactive Hodgkin’s disease patients.
The medical investigators wanted to know if the three samples differed in their
bradykininogen levels. Carry out the statistical analysis you consider to be most
appropriate, and state your conclusions to this question.
Read in the data, look at summaries on the original scale, and create a plot. Also,
look at summaries on the log scale and create a plot.
#### Example: Hodgkin's Disease Study
hd <- read.table(text="
nc ahd ihd
5.37 3.96 5.37
5.80 3.04 10.60
4.70 5.28 5.02
5.70 3.40 14.30
3.40 4.10 9.90
8.60 3.61 4.27
7.48 6.16 5.75
5.77 3.22 5.03
7.15 7.48 5.74
6.49 3.87 7.85
4.09 4.27 6.82
5.94 4.05 7.90
6.38 2.40 8.36
9.24 5.81 5.72
5.66 4.29 6.00
4.53 2.77 4.75
6.51 4.40 5.83
7.00 NA 7.30
6.20 NA 7.52
7.04 NA 5.32
4.82 NA 6.05
6.73 NA 5.68
5.26 NA 7.57
NA NA 5.68
NA NA 8.91
NA NA 5.39
NA NA 4.40
NA NA 7.13
", header=TRUE)
#hd

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


208 Ch 6: Nonparametric Methods

# convert to long format


hd.long <- melt(hd,
variable.name = "patient",
value.name = "level",
na.rm = TRUE
)
## No id variables; using all as measure variables
# naming variables manually, the variable.name and value.name not working 11/2012
names(hd.long) <- c("patient", "level")
# summary of each patient
by(hd.long$level, hd.long$patient, summary)
## hd.long$patient: nc
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.400 5.315 5.940 6.081 6.865 9.240
## ----------------------------------------------------
## hd.long$patient: ahd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.400 3.400 4.050 4.242 4.400 7.480
## ----------------------------------------------------
## hd.long$patient: ihd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.270 5.385 5.915 6.791 7.640 14.300
# IQR and sd of each patient
by(hd.long$level, hd.long$patient, function(X) { c(IQR(X), sd(X), length(X)) })
## hd.long$patient: nc
## [1] 1.550000 1.362104 23.000000
## ----------------------------------------------------
## hd.long$patient: ahd
## [1] 1.000000 1.302878 17.000000
## ----------------------------------------------------
## hd.long$patient: ihd
## [1] 2.25500 2.17647 28.00000
# log scale
hd.long$loglevel <- log(hd.long$level)
# summary of each patient
by(hd.long$loglevel, hd.long$patient, summary)
## hd.long$patient: nc
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.224 1.670 1.782 1.780 1.926 2.224
## ----------------------------------------------------
## hd.long$patient: ahd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8755 1.2238 1.3987 1.4039 1.4816 2.0122
## ----------------------------------------------------
## hd.long$patient: ihd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.452 1.684 1.777 1.875 2.033 2.660
# IQR and sd of each patient
by(hd.long$loglevel, hd.long$patient, function(X) { c(IQR(X), sd(X), length(X)) })
## hd.long$patient: nc

Prof. Erik B. Erhardt


6.5: Alternatives for ANOVA and Planned Comparisons 209

## [1] 0.2557632 0.2303249 23.0000000


## ----------------------------------------------------
## hd.long$patient: ahd
## [1] 0.2578291 0.2920705 17.0000000
## ----------------------------------------------------
## hd.long$patient: ihd
## [1] 0.3496572 0.2802656 28.0000000

# Plot the data using ggplot


library(ggplot2)
p <- ggplot(hd.long, aes(x = patient, y = level))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(hd.long$level),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Plasma bradykininogen levels for three patient groups")
p <- p + ylab("level (mg/ml)")
# to reverse order that years print, so oldest is first on top
p <- p + scale_x_discrete(limits = rev(levels(hd.long$patient)) )
p <- p + ylim(c(0,max(hd.long$level)))
p <- p + coord_flip()
p <- p + theme(legend.position="none")
print(p)

## log scale
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(hd.long, aes(x = patient, y = loglevel))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(hd.long$loglevel),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


210 Ch 6: Nonparametric Methods

p <- p + labs(title = "Plasma bradykininogen levels for three patient groups (log scale)")
p <- p + ylab("log(level) (log(mg/ml))")
# to reverse order that years print, so oldest is first on top
p <- p + scale_x_discrete(limits = rev(levels(hd.long$patient)) )
p <- p + ylim(c(0,max(hd.long$loglevel)))
p <- p + coord_flip()
p <- p + theme(legend.position="none")
print(p)
Plasma bradykininogen levels for three patient groups Plasma bradykininogen levels for three patient groups (log scale)

nc nc
patient

patient

ahd ahd

ihd ihd

0 5 10 15 0 1 2
level (mg/ml) log(level) (log(mg/ml))

Although the spread (IQR, s) in the ihd sample is somewhat greater than the
spread in the other samples, the presence of skewness and outliers in the boxplots is a
greater concern regarding the use of the classical ANOVA. The shapes and spreads in
the three samples are roughly identical, so a Kruskal-Wallis nonparametric ANOVA
appears ideal. As a sidelight, I transformed plasma levels to a log scale to reduce
the skewness and eliminate the outliers. The boxplots of the transformed data show
reasonable symmetry across groups, but outliers are still present. I will stick with
the Kruskal-Wallis ANOVA (although it would not be much of a problem to use the
classical ANOVA on transformed data).
Let ηnc = population median plasma level for normal controls, ηahd = popula-
tion median plasma level for active Hodgkin’s disease patients, and ηihd = popula-
tion median plasma level for inactive Hodgkin’s disease patients. The KW test of
H0 : ηnc = ηahd = ηihd versus HA : not H0 is highly significant (p-value= 0.00003),
suggesting differences among the population median plasma levels. The Kruskal-
Wallis ANOVA summary is given below.
# KW ANOVA
fit.h <- kruskal.test(level ~ patient, data = hd.long)
fit.h

Prof. Erik B. Erhardt


6.5: Alternatives for ANOVA and Planned Comparisons 211

##
## Kruskal-Wallis rank sum test
##
## data: level by patient
## Kruskal-Wallis chi-squared = 20.566, df = 2, p-value =
## 3.421e-05
I followed up the KW ANOVA with Bonferroni comparisons of the samples, using
the Mann-Whitney two sample procedure. There are three comparisons, so an overall
FER of 0.05 is achieved by doing the individual tests at the 0.05/3=0.0167 level.
Alternatively, you can use 98.33% CI for differences in population medians.
# with continuity correction in the normal approximation for the p-value
wilcox.test(hd$nc , hd$ahd, conf.int=TRUE, conf.level = 0.9833)
## Warning in wilcox.test.default(hd$nc, hd$ahd, conf.int = TRUE, conf.level = 0.9833):
cannot compute exact p-value with ties
## Warning in wilcox.test.default(hd$nc, hd$ahd, conf.int = TRUE, conf.level = 0.9833):
cannot compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: hd$nc and hd$ahd
## W = 329, p-value = 0.0002735
## alternative hypothesis: true location shift is not equal to 0
## 98.33 percent confidence interval:
## 0.8599458 2.9000789
## sample estimates:
## difference in location
## 1.910067
wilcox.test(hd$nc , hd$ihd, conf.int=TRUE, conf.level = 0.9833)
## Warning in wilcox.test.default(hd$nc, hd$ihd, conf.int = TRUE, conf.level = 0.9833):
cannot compute exact p-value with ties
## Warning in wilcox.test.default(hd$nc, hd$ihd, conf.int = TRUE, conf.level = 0.9833):
cannot compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: hd$nc and hd$ihd
## W = 276.5, p-value = 0.3943
## alternative hypothesis: true location shift is not equal to 0
## 98.33 percent confidence interval:
## -1.5600478 0.6800262
## sample estimates:
## difference in location
## -0.3413932
wilcox.test(hd$ahd, hd$ihd, conf.int=TRUE, conf.level = 0.9833)
## Warning in wilcox.test.default(hd$ahd, hd$ihd, conf.int = TRUE, conf.level = 0.9833):
cannot compute exact p-value with ties

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


212 Ch 6: Nonparametric Methods

## Warning in wilcox.test.default(hd$ahd, hd$ihd, conf.int = TRUE, conf.level = 0.9833):


cannot compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: hd$ahd and hd$ihd
## W = 56, p-value = 2.143e-05
## alternative hypothesis: true location shift is not equal to 0
## 98.33 percent confidence interval:
## -3.500059 -1.319957
## sample estimates:
## difference in location
## -2.146666
The only comparison with a p-value greater than 0.0167 involved the nc and ihd
samples. The comparison leads to two groups, and is consistent with what we see in
the boxplots.
ahd nc ihd
--- --------
You have sufficient evidence to conclude that the plasma bradykininogen levels for
active Hodgkin’s disease patients (ahd) is lower than the population median levels for
normal controls (nc) and for patients with inactive Hodgkin’s disease (ihd). You do
not have sufficient evidence to conclude that the population median levels for normal
controls (nc) and for patients with inactive Hodgkin’s disease (ihd) are different. The
CIs give an indication of size of differences in the population medians.

6.5.3 Planned Comparisons


Bonferroni multiple comparisons are generally preferred to Fisher’s least significant
difference approach. Fisher’s method does not control the familywise error rate and
produces too many spurious significant differences (claims of significant differences
that are due solely to chance variation and not to actual differences in population
means). However, Bonferroni’s method is usually very conservative when a large
number of comparisons is performed — large differences in sample means are needed
to claim significance. A way to reduce this conservatism is to avoid doing all possible
comparisons. Instead, one should, when possible, decide a priori (before looking at
the data) which comparisons are of primary interest, and then perform only those
comparisons.
For example, suppose a medical study compares five new treatments with a control
(a six group problem). The medical investigator may not be interested in all 15
possible comparisons, but only in which of the five treatments differ on average from
the control. Rather than performing the 15 comparisons, each at the say 0.05/15 =
0.0033 level, she could examine the five comparisons of interest at the 0.05/5 = 0.01
level. By deciding beforehand which comparisons are of interest, she can justify using

Prof. Erik B. Erhardt


6.5: Alternatives for ANOVA and Planned Comparisons 213

a 0.01 level for the comparisons, instead of the more conservative 0.0033 level needed
when doing all possible comparisons.
To illustrate this idea, consider the KW analysis of HC emissions. We saw that
there are significant differences among the population median HC emissions. Given
that the samples have a natural ordering
Sample Year of manufacture
1 Pre-1963
2 63 – 67
3 68 – 69
4 70 – 71
5 72 – 74
you may primarily be interested in whether the population medians for cars manufac-
tured in consecutive samples are identical. That is, you may be primarily interested
in the following 4 comparisons:
Pre-1963 vs 63 – 67
63 – 67 vs 68 – 69
68 – 69 vs 70 – 71
70 – 71 vs 72 – 74
A Bonferroni analysis would carry out each comparison at the 0.05/4 = 0.0125 level
versus the 0.05/10 = 0.005 level when all comparisons are done.
The following output was obtained for doing these four comparisons, based on
Wilcoxon-Mann-Whitney two-sample tests (why?6 ). Two-year groups are claimed to
be different if the p-value is 0.0125 or below, or equivalently, if a 98.75% CI for the
difference in population medians does not contain zero.
#### Planned Comparisons
# with continuity correction in the normal approximation for the p-value
wilcox.test(emis$y63.7, emis$Pre.y63, conf.int=TRUE, conf.level = 0.9875)
## Warning in wilcox.test.default(emis$y63.7, emis$Pre.y63, conf.int = TRUE, : cannot
compute exact p-value with ties
## Warning in wilcox.test.default(emis$y63.7, emis$Pre.y63, conf.int = TRUE, : cannot
compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: emis$y63.7 and emis$Pre.y63
## W = 61.5, p-value = 0.8524
## alternative hypothesis: true location shift is not equal to 0
## 98.75 percent confidence interval:

6
The ANOVA is the multi-sample analog to the two-sample t-test for the mean, and the KW
ANOVA is the multi-sample analog to the WMW two-sample test for the median. Thus, we follow
up a KW ANOVA with WMW two-sample tests at the chosen multiple comparison adjusted error
rate.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


214 Ch 6: Nonparametric Methods

## -530.0001 428.0000
## sample estimates:
## difference in location
## -15.4763
wilcox.test(emis$y68.9, emis$y63.7 , conf.int=TRUE, conf.level = 0.9875)
## Warning in wilcox.test.default(emis$y68.9, emis$y63.7, conf.int = TRUE, : cannot
compute exact p-value with ties
## Warning in wilcox.test.default(emis$y68.9, emis$y63.7, conf.int = TRUE, : cannot
compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: emis$y68.9 and emis$y63.7
## W = 43, p-value = 0.007968
## alternative hypothesis: true location shift is not equal to 0
## 98.75 percent confidence interval:
## -708.99999 -51.99998
## sample estimates:
## difference in location
## -397.4227
wilcox.test(emis$y70.1, emis$y68.9 , conf.int=TRUE, conf.level = 0.9875)
## Warning in wilcox.test.default(emis$y70.1, emis$y68.9, conf.int = TRUE, : cannot
compute exact p-value with ties
## Warning in wilcox.test.default(emis$y70.1, emis$y68.9, conf.int = TRUE, : cannot
compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: emis$y70.1 and emis$y68.9
## W = 156, p-value = 0.9112
## alternative hypothesis: true location shift is not equal to 0
## 98.75 percent confidence interval:
## -206.0001 171.0000
## sample estimates:
## difference in location
## -10.99997
wilcox.test(emis$y72.4, emis$y70.1 , conf.int=TRUE, conf.level = 0.9875)
## Warning in wilcox.test.default(emis$y72.4, emis$y70.1, conf.int = TRUE, : cannot
compute exact p-value with ties
## Warning in wilcox.test.default(emis$y72.4, emis$y70.1, conf.int = TRUE, : cannot
compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: emis$y72.4 and emis$y70.1
## W = 92.5, p-value = 0.006384
## alternative hypothesis: true location shift is not equal to 0

Prof. Erik B. Erhardt


6.6: Permutation tests 215

## 98.75 percent confidence interval:


## -285.999962 -6.000058
## sample estimates:
## difference in location
## -130
There are significant differences between the 1963-67 and 1968-69 samples, and
between the 1970-71 and 1972-74 samples. You are 98.75% confident that the popula-
tion median HC emissions for 1963-67 year cars is between 52 and 708.8 ppm greater
than the population median for 1968-69 cars. Similarly, you are 98.75% confident
that the population median HC emissions for 1970-71 year cars is between 6.1 and
285.9 ppm greater than the population median for 1972-74 cars. Overall, you are
95% confident among the four pairwise comparisons that you have not declared a
difference significant when it isn’t.

6.5.4 Two final ANOVA comments


It is not uncommon for researchers to combine data from groups not found to be
significantly different. This is not, in general, a good practice. Just because you do
not have sufficient evidence to show differences does not imply that you should treat
the groups as if they are the same!
If the data distributions do not substantially deviate from normality, but the
spreads are different across samples, you might consider the standard ANOVA fol-
lowed with multiple comparisons using two-sample tests based on Satterthwaite’s
approximation.

6.6 Permutation tests


Permutation tests7 are a subset of non-parametric statistics. The basic premise is
to use only the assumption that it is possible that all of the treatment groups are
equivalent, and that every member of them is the same before sampling began (i.e.,
the position in the group to which they belong is not differentiable from other position
before the positions are filled). From this, one can calculate a statistic and then see
to what extent this statistic is special by seeing how likely it would be if the group
assignments had been jumbled.
A permutation test (also called a randomization test, re-randomization test, or
an exact test) is a type of statistical significance test in which the distribution of the
test statistic under the null hypothesis is obtained by calculating all possible values
of the test statistic under rearrangements of the labels on the observed data
points. In other words, the method by which treatments are allocated to subjects in
7
http://en.wikipedia.org/wiki/Resampling_(statistics)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


216 Ch 6: Nonparametric Methods

an experimental design is mirrored in the analysis of that design. If the labels are
exchangeable under the null hypothesis, then the resulting tests yield exact signifi-
cance levels. Confidence intervals can then be derived from the tests. The theory has
evolved from the works of R.A. Fisher and E.J.G. Pitman in the 1930s.
Let’s illustrate the basic idea of a permutation test using the Meteorites example.
Suppose we have two groups Uwet and Walker whose sample means are ȲU and
ȲW , and that we want to test, at 5% significance level, whether they come from the
same distribution. Let nU = 9 and nW = 10 be the sample size corresponding to
each group. The permutation test is designed to determine whether the observed
difference between the sample means is large enough to reject the null hypothesis
H0 : µU = µW , that the two groups have identical means.
The test proceeds as follows. First, the difference in means between the two
samples is calculated: this is the observed value of the test statistic, T(obs) . Then the
observations of groups Uwet and Walker are pooled.
#### Permutation tests
# Calculated the observed difference in means
# met.long includes both Uwet and Walker groups
Tobs <- mean(met.long[(met.long$site == "Uwet" ), 2]) -
mean(met.long[(met.long$site == "Walker"), 2])
Tobs
## [1] 0.2522222
Next, the difference in sample means is calculated and recorded for every possible
way of dividing these pooled values into two groups of size nU = 9 and nW = 10
(i.e., for every permutation of the group labels Uwet and Walker). The set of these
calculated differences is the exact distribution of possible differences under the null
hypothesis that group label does not matter. This exact distribution can be approx-
imated by drawing a large number of random permutations.
# Plan:
# Initialize a vector in which to store the R number of difference of means.
# Calculate R differences in means for R permutations, storing the results.
# Note that there are prod(1:19) = 10^17 total permutations,
# but the R repetitions will serve as a good approximation.
# Plot the permutation null distribution with an indication of the Tobs.

# R = a large number of repetitions


R <- 1e4
# initialize the vector of difference of means from the permutations
Tperm <- rep(NA, R)
# For each of R repetitions, permute the Uwet and Walker labels,
# calculate the difference of means with the permuted labels,
# and store the result in the i.R'th position of Tperm.
for (i.R in 1:R) {
# permutation of 19 = 9+10 integers 1, 2, ..., 19
ind.perm <- sample.int(nrow(met.long))
# identify as "TRUE" numbers 1, ..., 9 (the number of Uwet labels)

Prof. Erik B. Erhardt


6.6: Permutation tests 217

lab.U <- (ind.perm <= sum(met.long$site == "Uwet")) #£


# identify as "TRUE" numbers 10, ..., 19 (the number of Walker labels)
# that is, all the non-Uwet labels
lab.W <- !lab.U

# calculate the difference in means and store in Tperm at index i.R


Tperm[i.R] <- mean(met.long[lab.U, 2]) - mean(met.long[lab.W, 2])
}

# Plot the permutation null distribution with an indication of the Tobs.


dat <- data.frame(Tperm)

library(ggplot2)
p <- ggplot(dat, aes(x = Tperm))
#p <- p + scale_x_continuous(limits=c(-20,+20))
p <- p + geom_histogram(aes(y=..density..), binwidth=0.01)
p <- p + geom_density(alpha=0.1, fill="white")
p <- p + geom_rug()
# vertical line at Tobs
p <- p + geom_vline(aes(xintercept=Tobs), colour="#BB0000", linetype="dashed")
p <- p + labs(title = "Permutation distribution of difference in means, Uwet and Walker Meteorite
p <- p + xlab("difference in means (red line = observed difference in means)")
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


218 Ch 6: Nonparametric Methods

Permutation distribution of difference in means, Uwet and Walker Meteorites

2
density

−0.25 0.00 0.25


difference in means (red line = observed difference in means)

Notice the contrast in this permutation distribution of the difference in means


from a normal distribution.
The one-sided p-value of the test is calculated as the proportion of sampled permu-
tations where the difference in means was at least as extreme as T(obs) . The two-sided
p-value of the test is calculated as the proportion of sampled permutations where the
absolute difference was at least as extreme as |T(obs) |.
# Calculate a two-sided p-value.
p.upper <- sum((Tperm >= abs(Tobs))) / R
p.upper
## [1] 0.0608
p.lower <- sum((Tperm <= -abs(Tobs))) / R
p.lower
## [1] 0.0569
p.twosided <- p.lower + p.upper
p.twosided
## [1] 0.1177

Note that the two-sided p-value of 0.1177 is consistent, in this case, with the two-
sample t-test p-values of 0.1134 (pooled) and 0.1290 (Satterthwaite), but different

Prof. Erik B. Erhardt


6.6: Permutation tests 219

from 0.0497 (WMW). The permutation is a comparison of means without the nor-
mality assumption, though requires that the observations are exchangable between
populations under H0 .
If the only purpose of the test is reject or not reject the null hypothesis, we can
as an alternative sort the recorded differences, and then observe if T(obs) is contained
within the middle 95% of them. If it is not, we reject the hypothesis of equal means
at the 5% significance level.

6.6.1 Linear model permutation tests in R


The coin package provides an implementation of a general framework for condi-
tional inference procedures commonly known as permutation tests. In the help
on ?"coin-package" search for location to find tests for the means or medians
of populations (such as oneway_test()). Other packages of note include perm and
exactRankTests (lmPerm is defunct).
Below I calculate the standard t-test for the Meteorite data using t.test() and
lm(), then compare that with oneway_test() and what we calculated using our cal-
culation of the permutation test.
# standard two-sample t-test with equal variances
t.summary <- t.test(cool ~ site, data = met.long, var.equal = TRUE)
t.summary
##
## Two Sample t-test
##
## data: cool by site
## t = 1.6689, df = 17, p-value = 0.1134
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0666266 0.5710710
## sample estimates:
## mean in group Uwet mean in group Walker
## 0.4522222 0.2000000
# linear model form of t-test, "siteWalker" has estimate, se, t-stat, and p-value
lm.summary <- lm(cool ~ site, data = met.long)
summary(lm.summary)
##
## Call:
## lm(formula = cool ~ site, data = met.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2922 -0.1961 -0.1600 0.0250 0.7478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


220 Ch 6: Nonparametric Methods

## (Intercept) 0.4522 0.1096 4.125 0.000708 ***


## siteWalker -0.2522 0.1511 -1.669 0.113438
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3289 on 17 degrees of freedom
## Multiple R-squared: 0.1408,Adjusted R-squared: 0.09024
## F-statistic: 2.785 on 1 and 17 DF, p-value: 0.1134
# permutation test version
library(coin)
## Loading required package: survival
# Fisher-Pitman permutation test
oneway.summary <- oneway_test(cool ~ site, data = met.long)
oneway.summary
##
## Asymptotic Two-Sample Fisher-Pitman Permutation Test
##
## data: cool by site (Uwet, Walker)
## Z = 1.5919, p-value = 0.1114
## alternative hypothesis: true mu is not equal to 0
# examples of extracting values from coins S4 class objects
coin::expectation(oneway.summary)
## Uwet
## 2.875263
coin::covariance(oneway.summary)
## Uwet
## Uwet 0.5632881
coin::pvalue(oneway.summary)
## [1] 0.1114144
confint(oneway.summary)
## Error: $ operator not defined for this S4 class
The permutation test gives a p-value of 0.1114 which is close to our manually
calculated permuatation p-value of 0.1177.
For the emisions data, below we compare the ANOVA results (assuming normality)
with a permutation test without distributional assumptions.
fit.e <- aov(hc ~ year, data = emis.long)
summary(fit.e)
## Df Sum Sq Mean Sq F value Pr(>F)
## year 4 4226834 1056709 4.343 0.00331 **
## Residuals 73 17759968 243287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(coin)
# Fisher-Pitman permutation test

Prof. Erik B. Erhardt


6.6: Permutation tests 221

oneway.summary <- oneway_test(hc ~ year, data = emis.long)


oneway.summary
##
## Asymptotic K-Sample Fisher-Pitman Permutation Test
##
## data: hc by
## year (Pre.y63, y63.7, y68.9, y70.1, y72.4)
## chi-squared = 14.803, df = 4, p-value = 0.005128
Thus the permutation test of the ANOVA hypothesis on means rejects the null
hypothesis of all equal means. A followup set of pairwise tests can be done by looping
over pairs of factors.
First we list the factor levels ordered by their medians, the ordering by medians
is helpful at the end when the results of the pairwise comparisons are given.
# these are the levels of the factor, ordered by their medians
fac.lev <- levels(reorder(levels(emis.long$year)
, -as.numeric(by(emis.long$loghc, emis.long$year, median)))
)
fac.lev
## [1] "y63.7" "Pre.y63" "y68.9" "y70.1" "y72.4"
Create a matrix to store pairwise comparison p-values, then loop over all pairs of
groups and perform a two-sample permutation test. Store the p-value for each test
in the matrix.
# create a matrix to store pairwise comparison p-values
mc.pval <- matrix(NA
, nrow = length(fac.lev)
, ncol = length(fac.lev)
, dimnames = list(fac.lev, fac.lev))
# diag is always 1, no group differs from itself
diag(mc.pval) <- 1
mc.pval
## y63.7 Pre.y63 y68.9 y70.1 y72.4
## y63.7 1 NA NA NA NA
## Pre.y63 NA 1 NA NA NA
## y68.9 NA NA 1 NA NA
## y70.1 NA NA NA 1 NA
## y72.4 NA NA NA NA 1
# loop over all pairs of factor levels, perform two-sample test,
# and store p-value in matrix
for (i1 in 1:(length(fac.lev) - 1)) {
for (i2 in (i1 + 1):length(fac.lev)) {
## DEBUG - to make sure the indexing is working, you can print them:
# print(cat(i1, i2))

library(coin)
# Fisher-Pitman permutation test

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


222 Ch 6: Nonparametric Methods

oneway.summary <- oneway_test(hc ~ year, data = subset(emis.long, (year == fac.lev[i1] | year == fac

# put p-value in matrix


mc.pval[i1, i2] <- coin::pvalue(oneway.summary)
mc.pval[i2, i1] <- mc.pval[i1, i2]
}
}

# p-values
mc.pval
## y63.7 Pre.y63 y68.9 y70.1 y72.4
## y63.7 1.000000000 0.676572596 0.1993877 0.004273746 0.002185513
## Pre.y63 0.676572596 1.000000000 0.1611790 0.005319987 0.003379156
## y68.9 0.199387725 0.161179041 1.0000000 0.468455149 0.177187250
## y70.1 0.004273746 0.005319987 0.4684551 1.000000000 0.227517382
## y72.4 0.002185513 0.003379156 0.1771873 0.227517382 1.000000000
Summarize the results of the pairwise comparisons. Groups with a common letter
are not statistically different.
# summary of pairwise comparisons
# threshold is Bonferroni-corrected alpha=0.05 / 10
library(multcompView)
multcompLetters( mc.pval
, compare = "<"
, threshold = 0.05 / choose(length(fac.lev), 2)
, Letters = letters
, reversed = FALSE)
## y63.7 Pre.y63 y68.9 y70.1 y72.4
## "a" "ab" "abc" "bc" "c"

6.7 Density estimation


Density estimation is like a histogram: It is a method for visualizing the shape of a
univariate distribution (there are methods for doing multivariate density estimation
as well, but we will ignore those for the time being). In fact, I snuck in density
estimation in the first chapter and have been using it all along! Let’s experiment
with Newcombe’s speed-of-light data (excluding the two outliers).
Consider the shape of the histogram for different numbers of bins.
#### Density estimation
# include time ranks 3 and above, that is, remove the lowest two values
time2 <- time[(rank(time) >= 3)]

old.par <- par(no.readonly = TRUE)


# make smaller margins
par(mfrow=c(5,1), mar=c(3,2,2,1), oma=c(1,1,1,1))

Prof. Erik B. Erhardt


6.7: Density estimation 223

hist(time2, breaks=1 , main="1 break" , xlim=c(24.80,24.84), xlab=""); rug(time2)


hist(time2, main="default" , xlim=c(24.80,24.84), xlab=""); rug(time2)
hist(time2, breaks=10 , main="10 breaks" , xlim=c(24.80,24.84), xlab=""); rug(time2)
hist(time2, breaks=20 , main="20 breaks" , xlim=c(24.80,24.84), xlab=""); rug(time2)
hist(time2, breaks=100 , main="100 breaks", xlim=c(24.80,24.84), xlab=""); rug(time2)

# restore par() settings


par(old.par)

1 break
60
Frequency

40
20
0

24.80 24.81 24.82 24.83 24.84

default
25
Frequency

15
0 5

24.80 24.81 24.82 24.83 24.84

10 breaks
12
Frequency

8
4
0

24.80 24.81 24.82 24.83 24.84

20 breaks
6
Frequency

4
2
0

24.80 24.81 24.82 24.83 24.84

100 breaks
6
Frequency

4
2
0

24.80 24.81 24.82 24.83 24.84

Notice that we are starting to see more and more bins that include only a single
observation (or multiple observations at the precision of measurement). Taken to its
extreme, this type of exercise gives in some sense a “perfect” fit to the data but is
useless as an estimator of shape.
On the other hand, it is obvious that a single bin would also be completely use-
less. So we try in some sense to find a middle ground between these two extremes:
“Oversmoothing” by using only one bin and “undersmooting” by using too many.
This same paradigm occurs for density estimation, in which the amount of smoothing

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


224 Ch 6: Nonparametric Methods

is determined by a quantity called the bandwidth. By default, R uses an optimal (in


some sense) choice of bandwidth.

We’ve already used the density() function to provide a smooth curve to our
histograms. So far, we’ve taken the default “bandwidth”. Let’s see what happens
when we use different bandwidths.
par(mfrow=c(3,1))

# prob=TRUE scales the y-axis like a density function, total area = 1


hist(time2, prob=TRUE, main="")
# apply a density function, store the result
den = density(time2)
# plot density line over histogram
lines(den, col=2, lty=2, lwd=2)
# extract the bandwidth (bw) from the density line
b = round(den$bw, 4)
title(main=paste("Default =", b), col.main=2)

# undersmooth
hist(time2, prob=TRUE, main="")
lines(density(time2, bw=0.0004), col=3, lwd=2)
text(17.5, .35, "", col=3, cex=1.4)
title(main=paste("Undersmooth, BW = 0.0004"), col.main=3)

# oversmooth
hist(time2, prob=TRUE, main="")
lines(density(time2, bw=0.008), col=4, lwd=2)
title(main=paste("Oversmooth, BW = 0.008"), col.main=4)

Prof. Erik B. Erhardt


6.7: Density estimation 225

Default = 0.0018

80
60
Density

40
20
0

24.815 24.820 24.825 24.830 24.835 24.840

time2

Undersmooth, BW = 0.0004
80
60
Density

40
20
0

24.815 24.820 24.825 24.830 24.835 24.840

time2

Oversmooth, BW = 0.008
80
60
Density

40
20
0

24.815 24.820 24.825 24.830 24.835 24.840

time2

The other determining factor is the kernel, which is the shape each individual
point takes before all the shapes are added up for a final density line. While the
choice of bandwidth is very important, the choice of kernel is not. Choosing a kernel
with hard edges (such as ”rect”) will result in jagged artifacts, so smoother kernels
are often preferred.
par(mfrow=c(1,1))

hist(time2, prob=TRUE, main="")

# default kernel is Gaussian ("Normal")


lines(density(time2) , col=2, lty=1, lwd=2)
lines(density(time2, ker="epan"), col=3, lty=1, lwd=2)
lines(density(time2, ker="rect"), col=4, lty=1, lwd=2)
title(main="Gaussian, Epanechnikov, Rectangular")

# other kernels include: "triangular", "biweight", "cosine", "optcosine"

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


226 Ch 6: Nonparametric Methods

Gaussian, Epanechnikov, Rectangular

80
60
Density

40
20
0

24.815 24.820 24.825 24.830 24.835 24.840

time2

Prof. Erik B. Erhardt


Chapter 7

Categorical Data Analysis

Contents
7.1 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . 228
7.2 Single Proportion Problems . . . . . . . . . . . . . . . . . 231
7.2.1 A CI for p . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.2.2 Hypothesis Tests on Proportions . . . . . . . . . . . . . . . 233
7.2.3 The p-value for a two-sided test . . . . . . . . . . . . . . . . 235
7.2.4 Appropriateness of Test . . . . . . . . . . . . . . . . . . . . 236
7.2.5 R Implementation . . . . . . . . . . . . . . . . . . . . . . . 237
7.2.6 One-Sided Tests and One-Sided Confidence Bounds . . . . 237
7.2.7 Small Sample Procedures . . . . . . . . . . . . . . . . . . . 239
7.3 Analyzing Raw Data . . . . . . . . . . . . . . . . . . . . . 241
7.4 Goodness-of-Fit Tests (Multinomial) . . . . . . . . . . . . 244
7.4.1 Adequacy of the Goodness-of-Fit Test . . . . . . . . . . . . 246
7.4.2 R Implementation . . . . . . . . . . . . . . . . . . . . . . . 246
7.4.3 Multiple Comparisons in a Goodness-of-Fit Problem . . . . 249
7.5 Comparing Two Proportions: Independent Samples . . 251
7.5.1 Large Sample CI and Tests for p1 − p2 . . . . . . . . . . . . 251
7.6 Effect Measures in Two-by-Two Tables . . . . . . . . . . 258
7.7 Analysis of Paired Samples: Dependent Proportions . . 260
7.8 Testing for Homogeneity of Proportions . . . . . . . . . . 263
7.8.1 Adequacy of the Chi-Square Approximation . . . . . . . . . 268

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


228 Ch 7: Categorical Data Analysis

7.9 Testing for Homogeneity in Cross-Sectional and Strati-


fied Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
7.9.1 Testing for Independence in a Two-Way Contingency Table 270
7.9.2 Further Analyses in Two-Way Tables . . . . . . . . . . . . . 271

Learning objectives
After completing this topic, you should be able to:
select the appropriate statistical method to compare summaries from categorical
variables.
assess the assumptions of one-way and two-way tests of proportions and indepen-
dence.
decide whether the proportions between populations are different, including in
stratified and cross-sectional studies.
recommend action based on a hypothesis test.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.

7.1 Categorical data


When the response variable is categorical, the interesting questions are often about the
probability of one possible outcome versus another, and whether these probabilities
depend on other variables (continuous or categorical).

Example: Titanic The sinking of the Titanic is a famous event, and new books
are still being published about it. Many well-known facts — from the proportions of
first-class passengers to the “women and children first” policy, and the fact that policy
was not entirely successful in saving the women and children in the third class — are
reflected in the survival rates for various classes of passenger. The source provides a
data set recording class, sex, age, and survival status for each person on board of the
Titanic, and is based on data originally collected by the British Board of Trade1 .
1
British Board of Trade (1990), Report on the Loss of the “Titanic” (S.S.). British Board of
Trade Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing. Note that there is not
complete agreement among primary sources as to the exact numbers on board, rescued, or lost.

Prof. Erik B. Erhardt


7.1: Categorical data 229

# The Titanic dataset is a 4-dimensional table: Class, Sex, Age, Survived


library(datasets)
data(Titanic)

Titanic
## , , Age = Child, Survived = No
##
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
##
## , , Age = Adult, Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 387 89
## Crew 670 3
##
## , , Age = Child, Survived = Yes
##
## Sex
## Class Male Female
## 1st 5 1
## 2nd 11 13
## 3rd 13 14
## Crew 0 0
##
## , , Age = Adult, Survived = Yes
##
## Sex
## Class Male Female
## 1st 57 140
## 2nd 14 80
## 3rd 75 76
## Crew 192 20
# reshape into long data.frame
library(reshape2)
df.titanic <- melt(Titanic, value.name = "Freq")
df.titanic
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


230 Ch 7: Categorical Data Analysis

## 4 Crew Male Child No 0


## 5 1st Female Child No 0
## 6 2nd Female Child No 0
## 7 3rd Female Child No 17
## 8 Crew Female Child No 0
## 9 1st Male Adult No 118
## 10 2nd Male Adult No 154
## 11 3rd Male Adult No 387
## 12 Crew Male Adult No 670
## 13 1st Female Adult No 4
## 14 2nd Female Adult No 13
## 15 3rd Female Adult No 89
## 16 Crew Female Adult No 3
## 17 1st Male Child Yes 5
## 18 2nd Male Child Yes 11
## 19 3rd Male Child Yes 13
## 20 Crew Male Child Yes 0
## 21 1st Female Child Yes 1
## 22 2nd Female Child Yes 13
## 23 3rd Female Child Yes 14
## 24 Crew Female Child Yes 0
## 25 1st Male Adult Yes 57
## 26 2nd Male Adult Yes 14
## 27 3rd Male Adult Yes 75
## 28 Crew Male Adult Yes 192
## 29 1st Female Adult Yes 140
## 30 2nd Female Adult Yes 80
## 31 3rd Female Adult Yes 76
## 32 Crew Female Adult Yes 20
# Total number of people
sum(df.titanic$Freq)
## [1] 2201
# create colors based on survival
df.titanic$Color <- ifelse(df.titanic$Survived == "Yes", "#008888", "#330066")

# subset only the adults (since there were so few children)


df.titanic.adult <- subset(df.titanic, Age == "Adult")

# see R code on website for function parallelset()


# see help for with(), it allows temporary direct reference to columns in a data.frame
# otherwise, we'd need to specify df.titanic.adult£Survived, ...
with(df.titanic.adult
, parallelset(Survived, Sex, Class, freq = Freq, col = Color, alpha=0.2)
)

Prof. Erik B. Erhardt


7.2: Single Proportion Problems 231

Survived
No Yes

Sex
Male Female

Class
1st 2nd 3rd Crew

There are many questions that can be asked of this dataset. How likely were people
to survive such a ship sinking in cold water? Is the survival proportion dependent on
sex, class, or age, or a combination of these? How different is the survival proportions
for 1st class females versus 3rd class males?

7.2 Single Proportion Problems

Assume that you are interested in estimating the proportion p of individuals in a


population with a certain characteristic or attribute based on a random or represen-
tative sample of size n from the population. The sample proportion p̂ =(# with
attribute in the sample)/n is the best guess for p based on the data.
This is the simplest categorical data problem. Each response falls into one of
two exclusive and exhaustive categories, called “success” and “failure”. Individuals
with the attribute of interest are in the success category. The rest fall into the failure
category. Knowledge of the population proportion p of successes characterizes the
distribution across both categories because the population proportion of failures is
1 − p.
As an aside, note that the probability that a randomly selected individual has
the attribute of interest is the population proportion p with the attribute, so the
terms population proportion and probability can be used interchangeably with ran-
dom sampling.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


232 Ch 7: Categorical Data Analysis

7.2.1 A CI for p
A two-sided CI for p is a range of plausible values for the unknown population pro-
portion p, based on the observed data. To compute a two-sided CI for p:
1. Specify the confidence level as the percent 100(1 − α)% and solve for the error
rate α of the CI.
2. Compute zcrit = z0.5α (i.e., area under the standard normal curve to the left and
to the right of zcrit are 1 − 0.5α and 0.5α, respectively). qnorm(1-0.05/2)=1.96.
3. The 100(1 − α)% CI for p has endpoints L = p̂ − zcrit SE and U = p̂ + zcrit SE,
respectively, where the “CI standard error” is
r
p̂(1 − p̂)
SE = .
n
The CI is often written as p̂ ± zcrit SE.

Reminder of CI interpretation. The CI is determined once the confidence level


is specified and the data are collected. Prior to collecting data, the CI is unknown
and can be viewed as random because it will depend on the actual sample selected.
Different samples give different CIs. The “confidence” in, say, the 95% CI (which
has a 0.05 or 5% error rate) can be interpreted as follows. If you repeatedly sample
the population and construct 95% CIs for p, then 95% of the intervals will contain p,
whereas 5% (the error rate) will not. The CI you get from your data either covers p,
or it does not.
The length of the CI
U − L = 2zcrit SE
depends on the accuracy of the estimate p̂, as measured by the standard error SE.
For a given p̂, this standard error decreases as the sample size n increases, yielding
a narrower CI. For a fixed sample size, this standard error is maximized at p̂ = 0.5,
and decreases as p̂ moves towards either 0 or 1. In essence, sample proportions near
0 or 1 give narrower CIs for p. However, the normal approximation used in the CI
construction is less reliable for extreme values of p̂.

ClickerQ s — CI for proportions STT.08.01.010

Example: Tamper resistant packaging The 1983 Tylenol poisoning episode


highlighted the desirability of using tamper-resistant packaging. The article “Tamper
Resistant Packaging: Is it Really?” (Packaging Engineering, June 1983) reported the
results of a survey on consumer attitudes towards tamper-resistant packaging. A
sample of 270 consumers was asked the question: “Would you be willing to pay extra

Prof. Erik B. Erhardt


7.2: Single Proportion Problems 233

for tamper resistant packaging?” The number of yes respondents was 189. Construct
a 95% CI for the proportion p of all consumers who were willing in 1983 to pay extra
for such packaging.
Here n = 270 and p̂ = 189/270 = 0.700. The critical value for a 95% CI for p is
z0.025 = 1.96. The CI standard error is given by
r
0.7 × 0.3
SE = = 0.028,
270

so zcrit SE = 1.96 × 0.028 = 0.055. The 95% CI for p is 0.700 ± 0.055. You are 95%
confident that the proportion of consumers willing to pay extra for better packaging
is between 0.645 and 0.755. (Willing to pay how much extra?)

Appropriateness of the CI
The standard CI is based on a large-sample standard normal approximation to

p̂ − p
z= .
SE

A simple rule of thumb requires np̂ ≥ 5 and n(1− p̂) ≥ 5 for the method to be suitable.
Given that np̂ and n(1 − p̂) are the observed numbers of successes and failures, you
should have at least 5 of each to apply the large-sample CI.
In the packaging example, np̂ = 270 × (0.700) = 189 (the number who support
the new packaging) and n(1 − p̂) = 270 × (0.300) = 81 (the number who oppose) both
exceed 5. The normal approximation is appropriate here.

7.2.2 Hypothesis Tests on Proportions


The following example is typical of questions that can be answered using a hypothesis
test for a population proportion.

Example Environmental problems associated with leaded gasolines are well-known.


Many motorists have tampered with the emission control devices on their cars to save
money by purchasing leaded rather than unleaded gasoline. A Los Angeles Times
article on March 17, 1984 reported that 15% of all California motorists have engaged
in emissions tampering. A random sample of 200 cars from L.A. county was obtained,
and the emissions devices on 21 are found to be tampered with. Does this suggest
that the proportion of cars in L.A. county with tampered devices differs from the
statewide proportion?

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


234 Ch 7: Categorical Data Analysis

Two-Sided Hypothesis Test for p


Suppose you are interested in whether the population proportion p is equal to a
prespecified value, say p0 . This question can be formulated as a two-sided test. To
carry out the test:
1. Define the null hypothesis H0 : p = p0 and alternative hypothesis HA : p 6= p0 .
2. Choose the size or significance level of the test, denoted by α.
3. Using the standard normal probability table, find the critical value zcrit such
that the areas under the normal curve to the left and right of zcrit are 1 − 0.5α
and 0.5α, respectively. That is, zcrit = z0.5α .
4. Compute the test statistic (often to be labelled zobs )

p̂ − p0
zs = ,
SE
where the “test standard error” (based on the hypothesized value) is
r
p0 (1 − p0 )
SE = .
n
5. Reject H0 in favor of HA if |zobs | ≥ zcrit . Otherwise, do not reject H0 .
The rejection rule is easily understood visually. The area under the normal curve
outside ±zcrit is the size α of the test. One-half of α is the area in each tail. You
reject H0 in favor of HA if the test statistic exceeds ±zcrit . This occurs when p̂ is
significantly different from p0 , as measured by the standardized distance zobs between
p̂ and p0 .
Z−distribution with two−sided size α = .05 critical region Z−distribution with two−sided p−value

Total area is p−value


and is random, not fixed

α α p − value p − value
(fixed)
2 2 2 2

−4 Rej H0 − z 0 4 −4 − zs 0 zs 4
Crit zCrit Rej H0 − zCrit zCrit

Prof. Erik B. Erhardt


7.2: Single Proportion Problems 235

ClickerQ s — Test statistic STT.07.01.057

7.2.3 The p-value for a two-sided test

To compute the p-value (not to be confused with the value of the proportion p) for a
two-sided test:
1. Compute the test statistic zs = zobs .
2. Evaluate the area under the normal probability curve outside ±|zs |.
Recall that the null hypothesis for a size α test is rejected if and only if the p-value
is less than or equal to α.

Example: Emissions data Each car in the target population (L.A. county) either
has been tampered with (a success) or has not been tampered with (a failure). Let
p = the proportion of cars in L.A. county with tampered emissions control devices.
You want to test H0 : p = 0.15 against HA : p 6= 0.15 (here p0 = 0.15). The critical
value for a two-sided test of size α = 0.05 is zcrit = 1.96.
The data are a sample of n = 200 cars. The sample proportion of cars that have
been tampered with is p̂ = 21/200 = 0.105. The test statistic is

0.105 − 0.15
zs = = −1.78,
0.02525

where the test standard error satisfies

r
0.15 × 0.85
SE = = 0.02525.
200

Given that |zs | = 1.78 < 1.96, you have insufficient evidence to reject H0 at the 5%
level. That is, you have insufficient evidence to conclude that the proportion of cars
in L.A. county that have been tampered with differs from the statewide proportion.
This decision is reinforced by the p-value calculation. The p-value is the area
under the standard normal curve outside ±1.78. This is 2 × 0.0375 = 0.075, which
exceeds the test size of 0.05.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


236 Ch 7: Categorical Data Analysis

Emissions data p−value

Total area is p−value


= .075

.0375 .0375

−4 −1.78 0 1.78 4

Remark The SE used in the test and CI are different. This implies that a hypothesis
test and CI could potentially lead to different decisions. That is, a 95% CI for a
population proportion might cover p0 when the p-value for testing H0 : p = p0 is
smaller than 0.05. This will happen, typically, only in cases where the decision is
“borderline.”

7.2.4 Appropriateness of Test


The z-test is based on a large-sample normal approximation, which works better for
a given sample size when p0 is closer to 0.5. The sample size needed for an accurate
approximation increases dramatically the closer p0 gets to 0 or to 1. A simple rule
of thumb is that the test is appropriate when (the expected number of successes)
np0 ≥ 5 and (the expected number of failures) n(1 − p0 ) ≥ 5.
In the emissions example, np0 = 200 × (0.15) = 30 and n(1 − p0 ) = 200 × (0.85) =
170 exceed 5, so the normal approximation is appropriate.

Prof. Erik B. Erhardt


7.2: Single Proportion Problems 237

7.2.5 R Implementation
#### Single Proportion Problems
# Approximate normal test for proportion, without Yates' continuity correction
prop.test(21, 200, p = 0.15, correct = FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: 21 out of 200, null probability 0.15
## X-squared = 3.1765, df = 1, p-value = 0.07471
## alternative hypothesis: true p is not equal to 0.15
## 95 percent confidence interval:
## 0.06970749 0.15518032
## sample estimates:
## p
## 0.105
# Approximate normal test for proportion, with Yates' continuity correction
#prop.test(21, 200, p = 0.15)

ClickerQ s — Parachute null hypothesis STT.08.01.040

ClickerQ s — Parachute conclusion STT.08.01.050

ClickerQ s — Parachute p-value STT.08.01.060

7.2.6 One-Sided Tests and One-Sided Confidence Bounds


The mechanics of tests on proportions are similar to tests on means, except we use
a different test statistic and a different probability distribution for critical values.
This applies to one-sided and two-sided procedures. The example below illustrates a
one-sided test and bound.

Example: brain hemispheres An article in the April 6, 1983 edition of The


Los Angeles Times reported on a study of 53 learning-impaired youngsters at the
Massachusetts General Hospital. The right side of the brain was found to be larger
than the left side in 22 of the children. The proportion of the general population
with brains having larger right sides is known to be 0.25. Does the data provide
strong evidence for concluding, as the article claims, that the proportion of learning
impaired youngsters with brains having larger right sides exceeds the proportion in
the general population?

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


238 Ch 7: Categorical Data Analysis

I will answer this question by computing a p-value for a one-sided test. Let p
be the population proportion of learning disabled children with brains having larger
right sides. I am interested in testing H0 : p = 0.25 against HA : p > 0.25 (here
p0 = 0.25).
The proportion of children sampled with brains having larger right sides is p̂ =
22/53 = 0.415. The test statistic is

0.415 − 0.25
zs = = 2.78,
0.0595
where the test standard error satisfies
r
0.25 × 0.75
SE = = 0.0595.
53
The p-value for an upper one-sided test is the area under the standard normal curve
to the right of 2.78, which is approximately .003; see the picture below. I would
reject H0 in favor of HA using any of the standard test levels, say 0.05 or 0.01. The
newspaper’s claim is reasonable.

Right brain upper one−sided p−value

p−value is area in
right tail only

.003

−4 −2 0 zs = 2.78 4

Prof. Erik B. Erhardt


7.2: Single Proportion Problems 239

A sensible next step in the analysis would be to compute a lower confidence


bound p̂ − zcrit SE for p. For illustration, consider a 95% bound. The CI standard
error is r r
p̂(1 − p̂) 0.415 × 0.585
SE = = = 0.0677.
n 53
The critical value for a one-sided 5% test is zcrit = 1.645, so a lower 95% bound on p
is 0.415 − 1.645 × 0.0677 = 0.304. I am 95% confident that the population proportion
of learning disabled children with brains having larger right sides is at least 0.304.
Values of p smaller than 0.304 are not supported by the data.
You should verify that the sample size is sufficiently large to use the approximate
methods in this example.
#### Example: brain hemispheres
# Approximate normal test for proportion, without Yates' continuity correction
prop.test(22, 53, p = 0.25, alternative = "greater", correct = FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: 22 out of 53, null probability 0.25
## X-squared = 7.7044, df = 1, p-value = 0.002754
## alternative hypothesis: true p is greater than 0.25
## 95 percent confidence interval:
## 0.3105487 1.0000000
## sample estimates:
## p
## 0.4150943

7.2.7 Small Sample Procedures


Large sample tests and CIs for p should be interpreted with caution in small sized
samples because the true error rate usually exceeds the assumed (nominal) value. For
example, an assumed 95% CI, with a nominal error rate of 5%, may be only an 80%
CI, with a 20% error rate. The large-sample CIs are usually overly optimistic (i.e.,
too narrow) when the sample size is too small to use the normal approximation.
Alan Agresti suggests the following method for a 95% CI. The standard method
computes the sample proportion as p̂ = x/n where x is the number of successes in the
sample and n is the sample size. Agresti suggested using the estimated proportion
p̃ = (x + 2)/(n + 4) with the standard error
r
p̃(1 − p̃)
SE = ,
n+4
in the “usual 95% interval” formula: p̃ ± 1.96SE. This appears odd, but amounts to
adding two successes and two failures to the observed data, and then computing the

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


240 Ch 7: Categorical Data Analysis

standard CI.
This adjustment has little effect when n is large and p̂ is not near either 0 or 1,
as in the Tylenol example.

Example: swimming segregation This example is based on a case heard before


the U.S. Supreme Court. A racially segregated swimming club was ordered to admit
minority members. However, it is unclear whether the club has been following the
spirit of the mandate. Historically, 85% of the white applicants were approved. Since
the mandate, only 1 of 6 minority applicants has been approved. Is there evidence of
continued discrimination?
I examine this issue by constructing a CI and a test for the probability p (or
population proportion) that a minority applicant is approved. Before examining the
question of primary interest, let me show that the two approximate CIs are very
different, due to the small sample size. One minority applicant (x = 1) was approved
out of n = 6 candidates, giving p̂ = 1/6 = 0.167.
A 95% large-sample CI for p is (−0.14, 0.46). Since a negative proportion is
not possible, the CI should be reported as (0.00, 0.46). Agresti’s 95% CI (based
on 3 successes and 7 failures) is (0.02, 0.58). The big difference between the two
intervals coupled with the negative lower endpoint on the standard CI suggests that
the normal approximation used with the standard method is inappropriate. This
view is reinforced by the rule-of-thumb calculation for using the standard interval.
Agresti’s CI is wider, which is consistent with my comment that the standard CI is
too narrow in small samples. As a comparison, the exact 95% CI is (0.004, 0.64),
which agrees more closely with Agresti’s interval.
I should emphasize that the exact CI is best to use, but is not available in all
statistical packages, so methods based on approximations may be required, and if so,
then Agresti’s method is clearly better than the standard normal approximation in
small sized samples.
Recall that the results of the asymptotic 95% CIs may disagree with the hypothesis
test results. Exact methods will not contradict each other this way (neither do these
asymptotic methods, usually).
#### Example: swimming segregation
## The prop.test() does an additional adjustment, so does not match precisely
## the results in the above paragraphs

# Approximate normal test for proportion, without Yates' continuity correction


prop.test(1, 6, p = 0.85, correct = FALSE)$conf.int
## Warning in prop.test(1, 6, p = 0.85, correct = FALSE): Chi-squared approximation
may be incorrect
## [1] 0.03005337 0.56350282
## attr(,"conf.level")
## [1] 0.95

Prof. Erik B. Erhardt


7.3: Analyzing Raw Data 241

# Agresti's method
prop.test(1+2, 6+4, p = 0.85, correct = FALSE)$conf.int
## Warning in prop.test(1 + 2, 6 + 4, p = 0.85, correct = FALSE): Chi-squared approximation
may be incorrect
## [1] 0.1077913 0.6032219
## attr(,"conf.level")
## [1] 0.95
# Exact binomial test for proportion
binom.test(1, 6, p = 0.85)$conf.int
## [1] 0.004210745 0.641234579
## attr(,"conf.level")
## [1] 0.95
Returning to the problem, you might check for discrimination by testing H0 : p =
0.85 against HA : p < 0.85 using an exact test. The exact test p-value is 0.000 to
three decimal places, and an exact upper bound for p is 0.582. What does this suggest
to you?
# Exact binomial test for proportion
binom.test(1, 6, alternative = "less", p = 0.85)
##
## Exact binomial test
##
## data: 1 and 6
## number of successes = 1, number of trials = 6, p-value =
## 0.0003987
## alternative hypothesis: true probability of success is less than 0.85
## 95 percent confidence interval:
## 0.0000000 0.5818034
## sample estimates:
## probability of success
## 0.1666667

7.3 Analyzing Raw Data


In most studies, your data will be stored in a spreadsheet with one observation per
case or individual. For example, the data below give the individual responses to the
applicants of the swim club.
#### Example: swimming segregation, raw data
## read.table options
# sep = default is any white space, but our strings contain a space,
# so I changed this to a comma
# header = there are no column headers
# stringsAsFActors = default converts strings to factors, but I want them
# to just be the plain character text
swim <- read.table(text="

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


242 Ch 7: Categorical Data Analysis

not approved
not approved
not approved
approved
not approved
not approved
", sep = ",", header=FALSE, stringsAsFactors=FALSE)

# name the column


names(swim) <- c("application")
# show the structure of the data.frame
str(swim)
## 'data.frame': 6 obs. of 1 variable:
## $ application: chr "not approved" "not approved" "not approved" "approved" ...
# display the data.frame
swim
## application
## 1 not approved
## 2 not approved
## 3 not approved
## 4 approved
## 5 not approved
## 6 not approved
The data were entered as alphabetic strings. We can use table() to count fre-
quencies of categorical variables.
# count the frequency of each categorical variable
table(swim)
## swim
## approved not approved
## 1 5
You can compute a CI and test for a proportion using raw data, provided the data
column includes only two distinct values. The levels can be numerical or alphanu-
meric.
# use the counts from table() for input in binom.text()
# the help for binom.test() says x can be a vector of length 2
# giving the numbers of successes and failures, respectively
# that's exactly what table(swim) gave us
binom.test(table(swim), p = 0.85, alternative = "less")
##
## Exact binomial test
##
## data: table(swim)
## number of successes = 1, number of trials = 6, p-value =
## 0.0003987
## alternative hypothesis: true probability of success is less than 0.85
## 95 percent confidence interval:

Prof. Erik B. Erhardt


7.3: Analyzing Raw Data 243

## 0.0000000 0.5818034
## sample estimates:
## probability of success
## 0.1666667

It is possible that the order (alphabetically) is the wrong order, failures and suc-
cesses, in which case we’d need to reorder the input to binom.test().
In Chapter 6 we looked at the binomial distribution to obtain an exact Sign Test
confidence interval for the median. Examine the following to see where the exact
p-value for this test comes from.
n <- 6
x <- 0:n
p0 <- 0.85
bincdf <- pbinom(x, n, p0)
cdf <- data.frame(x, bincdf)
cdf
## x bincdf
## 1 0 1.139063e-05
## 2 1 3.986719e-04
## 3 2 5.885156e-03
## 4 3 4.733859e-02
## 5 4 2.235157e-01
## 6 5 6.228505e-01
## 7 6 1.000000e+00

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


244 Ch 7: Categorical Data Analysis

ClickerQ s — Excess successes STT.05.01.030

7.4 Goodness-of-Fit Tests (Multinomial)


Example: jury pool The following data set was used as evidence in a court case.
The data represent a sample of 1336 individuals from the jury pool of a large municipal
court district for the years 1975–1977. The fairness of the representation of various
age groups on juries was being contested. The strategy for doing this was to challenge
the representativeness of the pool of individuals from which the juries are drawn. This
was done by comparing the age group distribution within the jury pool against the
age distribution in the district as a whole, which was available from census figures.
Age group (yrs) Obs. Counts Obs. Prop. Census Prop.
18-19 23 0.017 0.061
20-24 96 0.072 0.150
25-29 134 0.100 0.135
30-39 293 0.219 0.217
40-49 297 0.222 0.153
50-64 380 0.284 0.182
65-99 113 0.085 0.102
Total: 1336 1.000 1.000
A statistical question here is whether the jury pool population proportions are
equal to the census proportions across the age categories. This comparison can be
formulated as a goodness-of-fit test, which generalizes the large-sample test on a
single proportion to a categorical variable (here age) with r > 2 levels. For r =
2 categories, the goodness-of-fit test and large-sample test on a single proportion
are identical. Although this problem compares two populations, only one sample is
involved because the census data is a population summary!
In general, suppose each individual in a population is categorized into one and
only one of r levels or categories. Let p1 , p2 , . . ., pr , be the population proportions
in the r categories, where each pi ≥ 0 and p1 + p2 + · · · + pr = 1. The hypotheses
of interest in a goodness-of-fit problem are H0 : p1 = p01 , p2 = p02 , . . ., pr = p0r and
HA : not H0 , where p01 , p02 , . . ., p0r are given category proportions.
The plausibility of H0 is evaluated by comparing the hypothesized category pro-
portions to estimated (i.e., observed) category proportions p̂1 , p̂2 , . . ., p̂r from a
random or representative sample of n individuals selected from the population. The
discrepancy between the hypothesized and observed proportions is measured by the
Pearson chi-squared statistic:
r
X (Oi − Ei )2
χ2s = ,
i=1
Ei

Prof. Erik B. Erhardt


7.4: Goodness-of-Fit Tests (Multinomial) 245

where Oi is the observed number in the sample that fall into the ith category (Oi =
np̂i ), and Ei = np0i is the number of individuals expected to be in the ith category
when H0 is true.
The Pearson statistic can also be computed as the sum of the squared residuals:
r
X
χ2s = Zi2 ,
i=1

where Zi = (Oi − Ei )/ Ei , or in terms of the observed and hypothesized category
proportions
r
2
X (p̂i − p0i )2
χs = n .
i=1
p 0i

The Pearson statistic χ2s is “small” when all of the observed counts (proportions)
are close to the expected counts (proportions). The Pearson χ2 is “large” when one
or more observed counts (proportions) differs noticeably from what is expected when
H0 is true. Put another way, large values of χ2s suggest that H0 is false.
The critical value χ2crit for the test is obtained from a chi-squared probability
table with r −1 degrees of freedom. The picture below shows the form of the rejection
region. For example, if r = 5 and α = 0.05, then you reject H0 when χ2s ≥ χ2crit = 9.49
(qchisq(0.95, 5-1)). The p-value for the test is the area under the chi-squared curve
with df = r − 1 to the right of the observed χ2s value.

χ2 with 4 degrees of freedom χ2 with 4 degrees of freedom


χ2S significant

α = .05 (fixed) p − value (random)

0 5 10 15 0 5 10 15
χ2Crit Reject H0 for χ2S here χ2Crit χ2S

Example: jury pool Let p18 be the proportion in the jury pool population between
ages 18 and 19. Define p20 , p25 , p30 , p40 , p50 , and p65 analogously. You are interested in

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


246 Ch 7: Categorical Data Analysis

testing that the true jury proportions equal the census proportions, H0 : p18 = 0.061,
p20 = 0.150, p25 = 0.135, p30 = 0.217, p40 = 0.153, p50 = 0.182, and p65 = 0.102
against HA : not H0 , using the sample of 1336 from the jury pool.
The observed counts, the expected counts, and the category residuals are given
in the√table below. For example, E18 = 1336 × (0.061) = 81.5 and Z18 = (23 −
81.5)/ 81.5 = −6.48 in the 18-19 year category.
The Pearson statistic is

χ2s = (−6.48)2 + (−7.38)2 + (−3.45)2 + 0.182 + 6.482 + 8.782 + (−1.99)2 = 231.26

on r − 1 = 7 − 1 = 6 degrees of freedom. Here χ2crit = 12.59 at α = 0.05. The p-value


for the goodness-of-fit test is less than 0.001, which suggests that H0 is false.
Age group (yrs) Obs. Counts Exp. Counts Residual
18-19 23 81.5 −6.48
20-24 96 200.4 −7.38
25-29 134 180.4 −3.45
30-39 293 289.9 0.18
40-49 297 204.4 6.48
50-64 380 243.2 8.78
65-99 113 136.3 −1.99

7.4.1 Adequacy of the Goodness-of-Fit Test


The chi-squared goodness-of-fit test is a large-sample test. A conservative rule of
thumb is that the test is suitable when each expected count is at least five. This
holds in the jury pool example. There is no widely available alternative method
for testing goodness-of-fit with smaller sample sizes. There is evidence, however, that
the chi-squared test is slightly conservative (the p-values are too large, on average)
when the expected counts are smaller. Some statisticians recommend that the chi-
squared approximation be used when the minimum expected count is at least one,
provided the expected counts are not too variable.

7.4.2 R Implementation
#### Example: jury pool
jury <- read.table(text="
Age Count CensusProp
18-19 23 0.061
20-24 96 0.150
25-29 134 0.135
30-39 293 0.217
40-49 297 0.153

Prof. Erik B. Erhardt


7.4: Goodness-of-Fit Tests (Multinomial) 247

50-64 380 0.182


65-99 113 0.102
", header=TRUE)

# show the structure of the data.frame


str(jury)
## 'data.frame': 7 obs. of 3 variables:
## $ Age : Factor w/ 7 levels "18-19","20-24",..: 1 2 3 4 5 6 7
## $ Count : int 23 96 134 293 297 380 113
## $ CensusProp: num 0.061 0.15 0.135 0.217 0.153 0.182 0.102
# display the data.frame
jury
## Age Count CensusProp
## 1 18-19 23 0.061
## 2 20-24 96 0.150
## 3 25-29 134 0.135
## 4 30-39 293 0.217
## 5 40-49 297 0.153
## 6 50-64 380 0.182
## 7 65-99 113 0.102
# calculate chi-square goodness-of-fit test
x.summary <- chisq.test(jury$Count, correct = FALSE, p = jury$CensusProp)
# print result of test
x.summary
##
## Chi-squared test for given probabilities
##
## data: jury$Count
## X-squared = 231.26, df = 6, p-value < 2.2e-16
# use output in x.summary and create table
x.table <- data.frame(age = jury$Age
, obs = x.summary$observed
, exp = x.summary$expected
, res = x.summary$residuals
, chisq = x.summary$residuals^2
, stdres = x.summary$stdres)
x.table
## age obs exp res chisq stdres
## 1 18-19 23 81.496 -6.4797466 41.98711613 -6.6869061
## 2 20-24 96 200.400 -7.3748237 54.38802395 -7.9991194
## 3 25-29 134 180.360 -3.4520201 11.91644267 -3.7116350
## 4 30-39 293 289.912 0.1813611 0.03289186 0.2049573
## 5 40-49 297 204.408 6.4762636 41.94199084 7.0369233
## 6 50-64 380 243.152 8.7760589 77.01921063 9.7033764
## 7 65-99 113 136.272 -1.9935650 3.97430128 -2.1037408

Plot observed vs expected values to help identify age groups that deviate the

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


248 Ch 7: Categorical Data Analysis

most. Plot contribution to chi-square values to help identify age groups that deviate
the most. The term “Contribution to Chi-Square” (chisq) refers to the values of
(O−E)2
E
for each category. χ2s is the sum of those contributions.
library(reshape2)
x.table.obsexp <- melt(x.table,
# id.vars: ID variables
# all variables to keep but not split apart on
id.vars = c("age"),
# measure.vars: The source columns
# (if unspecified then all other variables are measure.vars)
measure.vars = c("obs", "exp"),
# variable.name: Name of the destination column identifying each
# original column that the measurement came from
variable.name = "stat",
# value.name: column name for values in table
value.name = "value"
)

# naming variables manually, the variable.name and value.name not working 11/2012
names(x.table.obsexp) <- c("age", "stat", "value")

# Observed vs Expected counts


library(ggplot2)
p <- ggplot(x.table.obsexp, aes(x = age, fill = stat, weight=value))
p <- p + geom_bar(position="dodge")
p <- p + labs(title = "Observed and Expected frequencies")
p <- p + xlab("Age category (years)")
print(p)

# Contribution to chi-sq
# pull out only the age and chisq columns
x.table.chisq <- x.table[, c("age","chisq")]
# reorder the age categories to be descending relative to the chisq statistic
x.table.chisq$age <- with(x.table, reorder(age, -chisq))

p <- ggplot(x.table.chisq, aes(x = age, weight = chisq))


p <- p + geom_bar()
p <- p + labs(title = "Contribution to Chi-sq statistic")
p <- p + xlab("Sorted age category (years)")
p <- p + ylab("Chi-sq")
print(p)

Prof. Erik B. Erhardt


7.4: Goodness-of-Fit Tests (Multinomial) 249

Observed and Expected frequencies Contribution to Chi−sq statistic


80

300 60

stat

Chi−sq
count

200 40
obs
exp

100 20

0 0

18−19 20−24 25−29 30−39 40−49 50−64 65−99 50−64 20−24 18−19 40−49 25−29 65−99 30−39
Age category (years) Sorted age category (years)

7.4.3 Multiple Comparisons in a Goodness-of-Fit Problem


The goodness-of-fit test suggests that at least one of the age category proportions for
the jury pool population differs from the census figures. A reasonable next step in
the analysis would be to separately test the seven hypotheses: H0 : p18 = 0.061,
H0 : p20 = 0.150, H0 : p25 = 0.135, H0 : p30 = 0.217, H0 : p40 = 0.153, H0 : p50 =
0.182, and H0 : p65 = 0.102 to see which age categories led to this conclusion.
A Bonferroni comparison with a Family Error Rate ≤ 0.05 will be considered for
this multiple comparisons problem. The error rates for the seven individual tests
are set to α = 0.05/7 = 0.0071, which corresponds to 99.29% two-sided CIs for the
individual jury pool proportions. The area under the standard normal curve to the
right of 2.69 is 0.05/2/7 = 0.00357, about one-half the error rate for the individual
CIs, so the critical value for the CIs, or for the tests, is zcrit ≈ 2.69. The next table
gives individual 99.29% CIs based on the large sample approximation. You can get
the individual CIs in R using the binom.test() or prop.test() function. For example,
the CI for Age Group 18-19 is obtained by specifying 23 successes in 1336 trials.
Below I perform exact binomial tests of proportion for each of the seven age cate-
gories at the Bonferroni-adjusted significance level. I save the p-values and confidence
intervals in a table along with the observed and census proportions in order to display
the table below.
b.sum1 <- binom.test(jury$Count[1], sum(jury$Count), p = jury$CensusProp[1], alternative = "two.sided", conf.level = 1-0.05/7)
b.sum2 <- binom.test(jury$Count[2], sum(jury$Count), p = jury$CensusProp[2], alternative = "two.sided", conf.level = 1-0.05/7)
b.sum3 <- binom.test(jury$Count[3], sum(jury$Count), p = jury$CensusProp[3], alternative = "two.sided", conf.level = 1-0.05/7)
b.sum4 <- binom.test(jury$Count[4], sum(jury$Count), p = jury$CensusProp[4], alternative = "two.sided", conf.level = 1-0.05/7)
b.sum5 <- binom.test(jury$Count[5], sum(jury$Count), p = jury$CensusProp[5], alternative = "two.sided", conf.level = 1-0.05/7)
b.sum6 <- binom.test(jury$Count[6], sum(jury$Count), p = jury$CensusProp[6], alternative = "two.sided", conf.level = 1-0.05/7)
b.sum7 <- binom.test(jury$Count[7], sum(jury$Count), p = jury$CensusProp[7], alternative = "two.sided", conf.level = 1-0.05/7)
# put the p-value and CI into a data.frame
b.sum <- data.frame(
rbind( c(b.sum1$p.value, b.sum1$conf.int)
, c(b.sum2$p.value, b.sum2$conf.int)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


250 Ch 7: Categorical Data Analysis

, c(b.sum3$p.value, b.sum3$conf.int)
, c(b.sum4$p.value, b.sum4$conf.int)
, c(b.sum5$p.value, b.sum5$conf.int)
, c(b.sum6$p.value, b.sum6$conf.int)
, c(b.sum7$p.value, b.sum7$conf.int)
)
)
names(b.sum) <- c("p.value", "CI.lower", "CI.upper")
b.sum$Age <- jury$Age
b.sum$Observed <- x.table$obs/sum(x.table$obs)
b.sum$CensusProp <- jury$CensusProp
b.sum
## p.value CI.lower CI.upper Age Observed CensusProp
## 1 8.814860e-15 0.00913726 0.02920184 18-19 0.01721557 0.061
## 2 2.694633e-18 0.05415977 0.09294037 20-24 0.07185629 0.150
## 3 1.394274e-04 0.07939758 0.12435272 25-29 0.10029940 0.135
## 4 8.421685e-01 0.18962122 0.25120144 30-39 0.21931138 0.217
## 5 2.383058e-11 0.19245560 0.25433144 40-49 0.22230539 0.153
## 6 5.915839e-20 0.25174398 0.31880556 50-64 0.28443114 0.182
## 7 3.742335e-02 0.06536589 0.10707682 65-99 0.08458084 0.102

The CIs for the 30-39 and 65-99 year categories contain the census proportions.
In the other five age categories, there are significant differences between the jury
pool proportions and the census proportions. In general, young adults appear to be
underrepresented in the jury pool whereas older age groups are overrepresented.

##
## Attaching package: ’xtable’
## The following object is masked from ’package:TeachingDemos’:
##
## digits
Age p.value CI.lower CI.upper Observed CensusProp
1 18-19 0.000 0.009 0.029 0.017 0.061
2 20-24 0.000 0.054 0.093 0.072 0.150
3 25-29 0.000 0.079 0.124 0.100 0.135
4 30-39 0.842 0.190 0.251 0.219 0.217
5 40-49 0.000 0.192 0.254 0.222 0.153
6 50-64 0.000 0.252 0.319 0.284 0.182
7 65-99 0.037 0.065 0.107 0.085 0.102

The residuals also highlight significant differences because the largest residuals
correspond to the categories that contribute most to the value of χ2s . Some researchers
use the residuals for the multiple comparisons, treating the Zi s as standard normal
variables. Following this approach, you would conclude that the jury pool proportions
differ from the proportions in the general population in every age category where
|Zi | ≥ 2.70 (using the same Bonferroni correction). This gives the same conclusion
as before.
The two multiple comparison methods are similar, but not identical. The residuals

Oi − Ei p̂i − p0i
Zi = √ = p p0i
Ei n

Prof. Erik B. Erhardt


7.5: Comparing Two Proportions: Independent Samples 251

agree with the large-sample statistic for testing H0 : pi = p0i , except that the divisor
in Zi omits a 1 − p0i term. The Zi s are not standard normal random variables as as-
sumed, and the value of Zi underestimates the significance of the observed differences.
Multiple comparisons using the Zi s will find, on average, fewer significant differences
than the preferred method based on the large sample tests. However, the differences
between the two methods are usually minor when all of the hypothesized proportions
are small.

7.5 Comparing Two Proportions: Independent Sam-


ples
The New Mexico state legislature is interested in how the proportion of registered vot-
ers that support Indian gaming differs between New Mexico and Colorado. Assuming
neither population proportion is known, the state’s statistician might recommend
that the state conduct a survey of registered voters sampled independently from the
two states, followed by a comparison of the sample proportions in favor of Indian
gaming.
Statistical methods for comparing two proportions using independent samples can
be formulated as follows. Let p1 and p2 be the proportion of populations 1 and 2,
respectively, with the attribute of interest. Let p̂1 and p̂2 be the corresponding sample
proportions, based on independent random or representative samples of size n1 and
n2 from the two populations.

7.5.1 Large Sample CI and Tests for p1 − p2


A large-sample CI for p1 −p2 is (p̂1 − p̂2 )±zcrit SECI (p̂1 − p̂2 ), where zcrit is the standard
normal critical value for the desired confidence level, and
s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
SECI (p̂1 − p̂2 ) = +
n1 n2

is the CI standard error.


A large-sample p-value for a test of the null hypothesis H0 : p1 − p2 = 0 against
the two-sided alternative HA : p1 − p2 6= 0 is evaluated using tail areas of the standard
normal distribution (identical to one sample evaluation) in conjunction with the test
statistic
p̂1 − p̂2
zs = ,
SEtest (p̂1 − p̂2 )

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


252 Ch 7: Categorical Data Analysis

where
s s  
p̄(1 − p̄) p̄(1 − p̄) 1 1
SEtest (p̂1 − p̂2 ) = + = p̄(1 − p̄) +
n1 n2 n1 n2
is the test standard error for p̂1 − p̂2 . The pooled proportion
n1 p̂1 + n2 p̂2
p̄ =
n1 + n2
is the proportion of successes in the two samples combined. The test standard error
has the same functional form as the CI standard error, with p̄ replacing the individual
sample proportions.
The pooled proportion is the best guess at the common population proportion
when H0 : p1 = p2 is true. The test standard error estimates the standard deviation
of p̂1 − p̂2 assuming H0 is true.

Remark: As in the one-sample proportion problem, the test and CI SE’s are dif-
ferent. This can (but usually does not) lead to some contradiction between the test
and CI.

Example, vitamin C Two hundred and seventy nine (279) French skiers were
studied during two one-week periods in 1961. One group of 140 skiers receiving a
placebo each day, and the other 139 receiving 1 gram of ascorbic acid (Vitamin C) per
day. The study was double blind — neither the subjects nor the researchers knew who
received which treatment. Let p1 be the probability that a member of the ascorbic
acid group contracts a cold during the study period, and p2 be the corresponding
probability for the placebo group. Linus Pauling (Chemistry and Peace Nobel prize
winner) and I are interested in testing whether H0 : p1 = p2 . The data are summarized
below as a two-by-two table of counts (a contingency table)
Outcome Ascorbic Acid Placebo
# with cold 17 31
# with no cold 122 109
Totals 139 140
The sample sizes are n1 = 139 and n2 = 140. The sample proportion of skiers
developing colds in the placebo and treatment groups are p̂2 = 31/140 = 0.221 and
p̂1 = 17/139 = 0.122, respectively. The difference is p̂1 − p̂2 = 0.122−0.221 = −0.099.
The pooled proportion is the number of skiers that developed colds divided by the
number of skiers in the study: p̄ = 48/279 = 0.172.
The test standard error is
s  
1 1
SEtest (p̂1 − p̂2 ) = 0.172 × (1 − 0.172) + = 0.0452.
139 140

Prof. Erik B. Erhardt


7.5: Comparing Two Proportions: Independent Samples 253

The test statistic is


0.122 − 0.221
zs = = −2.19.
0.0452
The p-value for a two-sided test is twice the area under the standard normal curve to
the right of 2.19 (or twice the area to the left of −2.19), which is 2 × (0.014) = 0.028.
At the 5% level, we reject the hypothesis that the probability of contracting a cold is
the same whether you are given a placebo or Vitamin C.
A CI for p1 − p2 provides a measure of the size of the treatment effect. For a 95%
CI
r
0.221 × (1 − 0.221) 0.122 × (1 − 0.122)
zcrit SECI (p̂1 − p̂2 ) = 1.96 +
140 139
= 1.96 × (0.04472) = 0.088.

The 95% CI for p1 − p2 is −0.099 ± 0.088, or (−0.187, −0.011). We are 95% confident
that p2 exceeds p1 by at least 0.011 but not by more than 0.187.
On the surface, we would conclude that a daily dose of Vitamin C decreases
a French skier’s chance of developing a cold by between 0.011 and 0.187 (with 95%
confidence). This conclusion was somewhat controversial. Several reviews of the study
felt that the experimenter’s evaluations of cold symptoms were unreliable. Many other
studies refute the benefit of Vitamin C as a treatment for the common cold.
#### Example, vitamin C
# Approximate normal test for two-proportions, without Yates' continuity correction
prop.test(c(17, 31), c(139, 140), correct = FALSE)
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(17, 31) out of c(139, 140)
## X-squared = 4.8114, df = 1, p-value = 0.02827
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.18685917 -0.01139366
## sample estimates:
## prop 1 prop 2
## 0.1223022 0.2214286

Conditional probability
In probability theory, a conditional probability is the probability that an event will
occur, when another event is known to occur or to have occurred. If the events are A
and B respectively, this is said to be “the probability of A given B”. It is commonly
denoted by Pr(A|B). Pr(A|B) may or may not be equal to Pr(A), the probability of
A. If they are equal, A and B are said to be independent. For example, if a coin is

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


254 Ch 7: Categorical Data Analysis

flipped twice, “the outcome of the second flip” is independent of “the outcome of the
first flip”.
In the Vitamin C example above, the unconditional observed probability of con-
tracting a cold is Pr(cold) = (17 + 31)/(139 + 140) = 0.172. The conditional observed
probabilities are Pr(cold|ascorbic acid) = 17/139 = 0.1223 and Pr(cold|placebo) =
31/140 = 0.2214. The two-sample test of H0 : p1 = p2 where p1 = Pr(cold|ascorbic acid)
and p2 = Pr(cold|placebo) is effectively testing whether Pr(cold) = Pr(cold|ascorbic acid) =
Pr(cold|placebo). This tests whether contracting a cold is independent of the vitamin
C treatment.

Example, cervical dysplasia A case-control study was designed to examine risk


factors for cervical dysplasia2 (Becker et al. 1994). All the women in the study were
patients at UNM clinics. The 175 cases were women, aged 18-40, who had cervical
dysplasia. The 308 controls were women aged 18-40 who did not have cervical dys-
plasia. Each woman was classified as positive or negative, depending on the presence
of HPV (human papilloma virus). The data are summarized below.
HPV Outcome Cases Controls
Positive 164 130
Negative 11 178
Sample size 175 308
We’ll take a short detour to create this table in R, calculate the column propor-
tions, and plot the frequencies and proportions.
We first create a table with labelled rows and columns.
#### Example: cervical dysplasia
# Create the labelled table
hpv <-
matrix(c(164, 130, 11, 178),
nrow = 2, byrow = TRUE,
dimnames = list("HPV.Outcome" = c("Positive", "Negative"),
"Group" = c("Cases", "Controls")))
hpv
## Group
## HPV.Outcome Cases Controls
## Positive 164 130
## Negative 11 178
Next, we create column proportions.
# calculate the column (margin = 2) proportions
hpv.col.prop <- prop.table(hpv, margin = 2)
hpv.col.prop
## Group

2
http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0002461/

Prof. Erik B. Erhardt


7.5: Comparing Two Proportions: Independent Samples 255

## HPV.Outcome Cases Controls


## Positive 0.93714286 0.4220779
## Negative 0.06285714 0.5779221
Here, we reshape the data from wide format to long format. This will allow us
to make plots later, and also shows how to create these tables from a dataset in long
format (which is the typical format).
# OR, convert to long format and use xtabs to produce the table
library(reshape2)
hpv.long <- melt(hpv, value.name = "Frequency")
hpv.long
## HPV.Outcome Group Frequency
## 1 Positive Cases 164
## 2 Negative Cases 11
## 3 Positive Controls 130
## 4 Negative Controls 178
T1 <- xtabs(Frequency ~ HPV.Outcome + Group, data = hpv.long)
T1
## Group
## HPV.Outcome Cases Controls
## Positive 164 130
## Negative 11 178
hpv.col.prop <- prop.table(T1, margin = 2)
hpv.col.prop
## Group
## HPV.Outcome Cases Controls
## Positive 0.93714286 0.42207792
## Negative 0.06285714 0.57792208
Add a column of proportions to our long-formatted data to plot these proportion
values.
library(reshape2)
hpv.col.prop.long <- melt(hpv.col.prop, value.name = "Proportion")
hpv.col.prop.long
## HPV.Outcome Group Proportion
## 1 Positive Cases 0.93714286
## 2 Negative Cases 0.06285714
## 3 Positive Controls 0.42207792
## 4 Negative Controls 0.57792208
# join these two datasets to have both Freq and Prop columns
library(plyr)
hpv.long <- join(hpv.long, hpv.col.prop.long)
## Joining by: HPV.Outcome, Group
hpv.long
## HPV.Outcome Group Frequency Proportion
## 1 Positive Cases 164 0.93714286
## 2 Negative Cases 11 0.06285714

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


256 Ch 7: Categorical Data Analysis

## 3 Positive Controls 130 0.42207792


## 4 Negative Controls 178 0.57792208

Finally, plot the frequencies, and the proportions in three ways (the frequencies
can obviously be plotted in many ways, too).
# plots are easier now that data are in long format.
library(ggplot2)
p <- ggplot(data = hpv.long, aes(x = Group, y = Frequency, fill = HPV.Outcome))
p <- p + geom_bar(stat="identity", position = "dodge")
p <- p + theme_bw()
p <- p + labs(title = "Frequency of HPV by Case/Control group")
print(p)

# bars, stacked
library(ggplot2)
p <- ggplot(data = hpv.long, aes(x = Group, y = Proportion, fill = HPV.Outcome))
p <- p + geom_bar(stat="identity")
p <- p + theme_bw()
p <- p + labs(title = "Proportion of HPV by Case/Control group")
print(p)

# bars, dodged
library(ggplot2)
p <- ggplot(data = hpv.long, aes(x = Group, y = Proportion, fill = HPV.Outcome))
p <- p + geom_bar(stat="identity", position = "dodge")
p <- p + theme_bw()
p <- p + labs(title = "Proportion of HPV by Case/Control group")
p <- p + scale_y_continuous(limits = c(0, 1))
print(p)

# lines are sometimes easier, especially when many categories along the x-axis
library(ggplot2)
p <- ggplot(data = hpv.long, aes(x = Group, y = Proportion, colour = HPV.Outcome))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_point(aes(shape = HPV.Outcome))
p <- p + geom_line(aes(linetype = HPV.Outcome, group = HPV.Outcome))
p <- p + theme_bw()
p <- p + labs(title = "Proportion of HPV by Case/Control group")
p <- p + scale_y_continuous(limits = c(0, 1))
print(p)

Prof. Erik B. Erhardt


7.5: Comparing Two Proportions: Independent Samples 257

Frequency of HPV by Case/Control group Proportion of HPV by Case/Control group


1.00

150

0.75
Frequency

Proportion
HPV.Outcome HPV.Outcome
100
Positive 0.50 Positive
Negative Negative

50
0.25

0 0.00

Cases Controls Cases Controls


Group Group

Proportion of HPV by Case/Control group Proportion of HPV by Case/Control group


1.00 1.00

0.75 0.75
Proportion

Proportion
HPV.Outcome HPV.Outcome
0.50 Positive 0.50 ● Positive
Negative ● Negative

0.25 0.25

0.00 0.00

Cases Controls Cases Controls


Group Group

Returning to the hypothesis test, let p1 be the probability that a case is HPV
positive and let p2 be the probability that a control is HPV positive. The sample
sizes are n1 = 175 and n2 = 308. The sample proportions of positive cases and
controls are p̂1 = 164/175 = 0.937 and p̂2 = 130/308 = 0.422.
For a 95% CI
r
0.937 × (1 − 0.937) 0.422 × (1 − 0.422)
zcrit SECI (p̂1 − p̂2 ) = 1.96 +
175 308
= 1.96 × (0.03336) = 0.0659.

A 95% CI for p1 − p2 is (0.937 − 0.422) ± 0.066, or 0.515 ± 0.066, or (0.449, 0.581). I


am 95% confident that p1 exceeds p2 by at least 0.45 but not by more than 0.58.
# Approximate normal test for two-proportions, without Yates' continuity correction
prop.test(c(164, 130), c(175, 308), correct = FALSE)
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(164, 130) out of c(175, 308)
## X-squared = 124.29, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


258 Ch 7: Categorical Data Analysis

## 0.4492212 0.5809087
## sample estimates:
## prop 1 prop 2
## 0.9371429 0.4220779
Not surprisingly, a two-sided test at the 5% level would reject H0 : p1 = p2 . In
this problem one might wish to do a one-sided test, instead of a two-sided test. Let
us carry out this test, as a refresher on how to conduct one-sided tests.
# one-sided test, are cases more likely to be HPV positive?
prop.test(c(164, 130), c(175, 308), correct = FALSE, alternative = "greater")
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(164, 130) out of c(175, 308)
## X-squared = 124.29, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.4598071 1.0000000
## sample estimates:
## prop 1 prop 2
## 0.9371429 0.4220779

Appropriateness of Large Sample Test and CI


The standard two-sample CI and test used above are appropriate when each sample
is large. A rule of thumb suggests a minimum of at least five successes (i.e., obser-
vations with the characteristic of interest) and failures (i.e., observations without the
characteristic of interest) in each sample before using these methods. This condition
is satisfied in our two examples.

ClickerQ s — Comparing two proportions STT.08.02.010

7.6 Effect Measures in Two-by-Two Tables


Consider a study of a particular disease, where each individual is either exposed or
not-exposed to a risk factor. Let p1 be the proportion diseased among the individuals
in the exposed population, and p2 be the proportion diseased among the non-exposed
population. This population information can be summarized as a two-by-two table
of population proportions:
Outcome Exposed population Non-Exposed population
Diseased p1 p2
Non-Diseased 1 − p1 1 − p2

Prof. Erik B. Erhardt


7.6: Effect Measures in Two-by-Two Tables 259

A standard measure of the difference between the exposed and non-exposed pop-
ulations is the absolute difference: p1 − p2 . We have discussed statistical methods
for assessing this difference.
In many epidemiological and biostatistical settings, other measures of the differ-
ence between populations are considered. For example, the relative risk
p1
RR =
p2
is commonly reported when the individual risks p1 and p2 are small. The odds ratio

p1 /(1 − p1 )
OR =
p2 /(1 − p2 )

is another standard measure. Here p1 /(1 − p1 ) is the odds of being diseased in the
exposed group, whereas p2 /(1 − p2 ) is the odds of being diseased in the non-exposed
group.
I mention these measures because you may see them or hear about them. Note
that each of these measures can be easily estimated from data, using the sample
proportions as estimates of the unknown population proportions. For example, in the
vitamin C study:
Outcome Ascorbic Acid Placebo
# with cold 17 31
# with no cold 122 109
Totals 139 140
the proportion with colds in the placebo group is p̂2 = 31/140 = 0.221. The propor-
tion with colds in the vitamin C group is p̂1 = 17/139 = 0.122.
The estimated absolute difference in risk is p̂1 − p̂2 = 0.122 − 0.221 = −0.099. The
estimated risk ratio and odds ratio are

d = 0.122 = 0.55
RR
0.221
and
d = 0.122/(1 − 0.122) = 0.49,
OR
0.221/(1 − 0.221)
respectively.

Interpretting odds ratios, two examples Let’s begin with probability3 . Let’s
say that the probability of success is 0.8, thus p = 0.8. Then the probability of
3
Borrowed graciously from UCLA Academic Technology Services at http://www.ats.ucla.edu/
stat/sas/faq/oratio.htm

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


260 Ch 7: Categorical Data Analysis

failure is q = 1 − p = 0.2. The odds of success are defined as odds(success) =


p/q = 0.8/0.2 = 4, that is, the odds of success are 4 to 1. The odds of failure would
be odds(failure) = q/p = 0.2/0.8 = 0.25, that is, the odds of failure are 1 to 4.
Next, let’s compute the odds ratio by OR = odds(success)/odds(failure) = 4/0.25 =
16. The interpretation of this odds ratio would be that the odds of success are 16
times greater than for failure. Now if we had formed the odds ratio the other way
around with odds of failure in the numerator, we would have gotten something like
this, OR = odds(failure)/odds(success) = 0.25/4 = 0.0625. Interestingly enough,
the interpretation of this odds ratio is nearly the same as the one above. Here the
interpretation is that the odds of failure are one-sixteenth the odds of success. In
fact, if you take the reciprocal of the first odds ratio you get 1/16 = 0.0625.

Another example This example is adapted from Pedhazur (1997). Suppose


that seven out of 10 males are admitted to an engineering school while three of 10
females are admitted. The probabilities for admitting a male are, p = 7/10 = 0.7 and
q = 1−0.7 = 0.3. Here are the same probabilities for females, p = 3/10 = 0.3 and q =
1−0.3 = 0.7. Now we can use the probabilities to compute the admission odds for both
males and females, odds(male) = 0.7/0.3 = 2.33333 and odds(female) = 0.3/0.7 =
0.42857. Next, we compute the odds ratio for admission, OR = 2.3333/0.42857 =
5.44. Thus, the odds of a male being admitted are 5.44 times greater than for a
female.

7.7 Analysis of Paired Samples: Dependent Pro-


portions
Paired and more general block analyses are appropriate with longitudinal data
collected over time and in medical studies where several treatments are given to the
same patient over time. A key feature of these designs that invalidates the two-sample
method discussed earlier is that repeated observations within a unit or individual are
likely to be correlated, and not independent.

Example, President performance For example, in a random sample of n = 1600


voter-age Americans, 944 said that they approved of the President’s performance. One
month later, only 880 of the original 1600 sampled approved. The following two-by-
two table gives the numbers of individuals with each of the four possible sequences
of responses over time. Thus, 150 voter-age Americans approved of the President’s
performance when initially asked but then disapproved one month later. The row
and column totals are the numbers of approvals and disapprovals for the two surveys
(Agresti, 1990, p. 350).

Prof. Erik B. Erhardt


7.7: Analysis of Paired Samples: Dependent Proportions 261

(Obs Counts) Second survey


First Survey Approve Disapprove Total
Approve 794 150 944
Disapprove 86 570 656
Total 880 720 1600
Let pAA , pAD , pDA , pDD be the population proportion of voter-age Americans
that fall into the four categories, where the subscripts preserve the time ordering and
indicate Approval or Disapproval. For example, pAD is the population proportion that
approved of the President’s performance initially and disapproved one month later.
The population proportion that initially approved is pA+ = pAA +pAD . The population
proportion that approved at the time of the second survey is p+A = pAA + pDA . The
“+” sign used as a subscript means that the replaced subscript has been summed
over.
Similarly, let p̂AA , p̂AD , p̂DA , p̂DD be the sample proportion of voter-age Ameri-
cans that fall into the four categories, and let p̂A+ = p̂AA + p̂AD and p̂+A = p̂AA + p̂DA
be the sample proportion that approves the first month and the sample proportion
that approves the second month, respectively. The table below summarizes the ob-
served proportions. For example, p̂AA = 794/1600 = 0.496 and p̂A+ = 944/1600 =
0.496 + 0.094 = 0.590. The sample proportion of voting-age Americans that approve
of the President’s performance decreased from one month to the next.
(Obs Proportions) Second survey
First Survey Approve Disapprove Total
Approve 0.496 0.094 0.590
Disapprove 0.054 0.356 0.410
Total 0.550 0.450 1.000
The difference in the population proportions from one month to the next can be
assessed by a large-sample CI for pA+ − p+A , given by (p̂A+ − p̂+A ) ± zcrit SECI (p̂A+ −
p̂+A ), where the CI standard error satisfies
r
p̂A+ (1 − p̂A+ ) + p̂+A (1 − p̂+A ) − 2(p̂AA p̂DD − p̂AD p̂DA )
SECI (p̂A+ − p̂+A ) =
n
One-sided bounds are constructed in the usual way.
The −2(·) term in the standard error accounts for the dependence between the
samples at the two time points. If independent samples of size n had been selected
for the two surveys, then this term would be omitted from the standard error, giving
the usual two-sample CI.
For a 95% CI in the Presidential survey,
r
0.590 × 0.410 + 0.550 × 0.450 − 2(0.496 × 0.356 − 0.094 × 0.054)
zcrit SECI (p̂A+ − p̂+A ) = 1.96
1600
= 1.96 × (0.0095) = 0.0186.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


262 Ch 7: Categorical Data Analysis

A 95% CI for pA+ − p+A is (0.590 − 0.550) ± 0.019, or (0.021, 0.059). You are 95%
confident that the population proportion of voter-age Americans that approved of the
President’s performance the first month was between 0.021 and 0.059 larger than the
proportion that approved one month later. This gives evidence of a decrease in the
President’s approval rating.
A test of H0 : pA+ = p+A can be based on the CI for pA+ − p+A , or on a standard
normal approximation to the test statistic
p̂A+ − p̂+A
zs = ,
SEtest (p̂A+ − p̂+A )
where the test standard error is given by
r
p̂A+ p̂+A − 2p̂AA
SEtest (p̂A+ − p̂+A ) = .
n
The test statistic is often written in the simplified form
nAD − nDA
zs = √ ,
nAD + nDA
where the nij s are the observed cell counts. An equivalent form of this test, based
on comparing the square of zs to a chi-squared distribution with 1 degree of freedom,
is the well-known McNemar’s test for marginal homogeneity (or symmetry) in the
two-by-two table.
For example, in the Presidential survey
150 − 86
zs = √ = 4.17.
150 + 86
The p-value for a two-sided test is, as usual, the area under the standard normal
curve outside ±4.17. The p-value is less than 0.001, suggesting that H0 is false.
R can perform this test as McNemar’s test.
#### Example, President performance
# McNemar's test needs data as a matrix

# Presidential Approval Ratings.


# Approval of the President's performance in office in two surveys,
# one month apart, for a random sample of 1600 voting-age Americans.
pres <-
matrix(c(794, 150, 86, 570),
nrow = 2, byrow = TRUE,
dimnames = list("1st Survey" = c("Approve", "Disapprove"),
"2nd Survey" = c("Approve", "Disapprove")))
pres

Prof. Erik B. Erhardt


7.8: Testing for Homogeneity of Proportions 263

## 2nd Survey
## 1st Survey Approve Disapprove
## Approve 794 150
## Disapprove 86 570
mcnemar.test(pres, correct=FALSE)
##
## McNemar's Chi-squared test
##
## data: pres
## McNemar's chi-squared = 17.356, df = 1, p-value = 3.099e-05
# => significant change (in fact, drop) in approval ratings

7.8 Testing for Homogeneity of Proportions


Example, cancer deaths The following two-way table of counts summarizes the
location of death and age at death from a study of 2989 cancer deaths (Public Health
Reports, 1983).
(Obs Counts) Location of death
Age Home Acute Care Chronic care Row Total
15-54 94 418 23 535
55-64 116 524 34 674
65-74 156 581 109 846
75+ 138 558 238 934
Col Total 504 2081 404 2989
The researchers want to compare the age distributions across locations. A one-way
ANOVA would be ideal if the actual ages were given. Because the ages are grouped,
the data should be treated as categorical. Given the differences in numbers that died
at the three types of facilities, a comparison of proportions or percentages in the age
groups is appropriate. A comparison of counts is not.
The table below summarizes the proportion in the four age groups by location.
For example, in the acute care facility 418/2081 = 0.201 and 558/2081 = 0.268.
The pooled proportions are the Row Totals divided by the total sample size of
2989. The pooled summary gives the proportions in the four age categories, ignoring
location of death.
The age distributions for home and for the acute care facilities are similar, but
are very different from the age distribution at chronic care facilities.
To formally compare the observed proportions, one might view the data as rep-
resentative sample of ages at death from the three locations. Assuming independent
samples from the three locations (populations), a chi-squared statistic is used to test
whether the population proportions of ages at death are identical (homogeneous)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


264 Ch 7: Categorical Data Analysis

across locations. The chi-squared test for homogeneity of population propor-


tions can be defined in terms of proportions, but is traditionally defined in terms of
counts.
(Proportions) Location of death
Age Home Acute Care Chronic care Pooled
15-54 0.187 0.201 0.057 0.179
55-64 0.230 0.252 0.084 0.226
65-74 0.310 0.279 0.270 0.283
75+ 0.273 0.268 0.589 0.312
Total 1.000 1.000 1.000 1.000
In general, assume that the data are independent samples from c populations
(strata, groups, sub-populations), and that each individual is placed into one of r
levels of a categorical variable. The raw data will be summarized as a r × c contin-
gency table of counts, where the columns correspond to the samples, and the rows
are the levels of the categorical variable. In the age distribution problem, r = 4 and
c = 3.
To implement the test:
1. Compute the (estimated) expected count for each cell in the table as follows:
Row Total × Column Total
E= .
Total Sample Size
2. Compute the Pearson test statistic
X (O − E)2
χ2s = ,
all cells
E

where O is the observed count.


3. For a size α test, reject the hypothesis of homogeneity if χ2s ≥ χ2crit , where
χ2crit is the upper α critical value from the chi-squared distribution with df =
(r − 1)(c − 1).
The p-value for the chi-squared test of homogeneity is equal to the area under the
chi-squared curve to the right of χ2s .
For a two-by-two table of counts, the chi-squared test of homogeneity
of proportions is identical to the two-sample proportion test we discussed
earlier.
The (estimated) expected counts for the (15-54, Home) cell and for the (75+,
Acute Care) cell in the age distribution data are E = 535 × 504/2989 = 90.21
and E = 934 × 2081/2989 = 650.27, respectively. The other expected counts were
computed similarly, and are summarized below. The row and column sums on the
tables of observed and expected counts always agree.

Prof. Erik B. Erhardt


7.8: Testing for Homogeneity of Proportions 265

(Exp Counts) Location of death


Age Home Acute Care Chronic care Row Total
15-54 90.21 372.48 72.31 535
55-64 113.65 469.25 91.10 674
65-74 142.65 589 114.35 846
75- 157.49 650.27 126.24 934
Col Total 504 2081 404 2989
Why is a comparison of the observed and expected counts relevant for testing
homogeneity? To answer this question, first note that the expected cell count can be
expressed
E = Col Total × Pooled proportion for category.
For example, E = 504 × 0.179 = 90.21 in the (15-54, Home) cell. A comparison
of the observed and expected counts is a comparison of the observed category pro-
portions in a location with the pooled proportions, taking the size of each sample
into consideration. Thinking of the pooled proportion as a weighted average of the
sample proportions for a category, the Pearson χ2s statistic is an aggregate measure of
variation in the observed proportions across samples. If the category proportions are
similar across samples then the category and pooled proportions are similar, resulting
in a “small” value of χ2s . Large values of χ2s occur when there is substantial variation
in the observed proportions across samples, in one or more categories. In this regard,
the Pearson statistic is similar to the F -statistic in a one-way ANOVA (where large
differences between groups result in a large F -statistic).
#### Example, cancer deaths
candeath <-
matrix(c( 94, 418, 23,
116, 524, 34,
156, 581, 109,
138, 558, 238),
nrow = 4, byrow = TRUE,
dimnames = list("Age" = c("15-54", "55-64", "65-74", "75+"),
"Location of death" = c("Home", "Acute Care", "Chronic care")))
candeath
## Location of death
## Age Home Acute Care Chronic care
## 15-54 94 418 23
## 55-64 116 524 34
## 65-74 156 581 109
## 75+ 138 558 238
chisq.summary <- chisq.test(candeath, correct=FALSE)
chisq.summary
##
## Pearson's Chi-squared test
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


266 Ch 7: Categorical Data Analysis

## data: candeath
## X-squared = 197.62, df = 6, p-value < 2.2e-16
# The Pearson residuals
chisq.summary$residuals
## Location of death
## Age Home Acute Care Chronic care
## 15-54 0.3989527 2.3587229 -5.798909
## 55-64 0.2205584 2.5273526 -5.982375
## 65-74 1.1176594 -0.3297027 -0.500057
## 75+ -1.5530094 -3.6183388 9.946704
# The sum of the squared residuals is the chi-squared statistic:
chisq.summary$residuals^2
## Location of death
## Age Home Acute Care Chronic care
## 15-54 0.1591633 5.5635737 33.627351
## 55-64 0.0486460 6.3875111 35.788805
## 65-74 1.2491626 0.1087039 0.250057
## 75+ 2.4118382 13.0923756 98.936922
sum(chisq.summary$residuals^2)
## [1] 197.6241

A visualization of the Pearson residuals is available with a mosaic() plot in the vcd
package. Extended mosaic and association plots are each helpful methods of visualing
complex data and evaluating deviations from a specified independence model. For
extended mosaic plots, use mosaic(x, condvar=, data=) where x is a table or formula,
condvar= is an optional conditioning variable, and data= specifies a data frame or a
table. Include shade=TRUE to color the figure, and legend=TRUE to display a legend
for the Pearson residuals.
# mosaic plot
library(vcd)
## Loading required package: grid
##
## Attaching package: ’vcd’
## The following object is masked from ’package:BSDA’:
##
## Trucks
mosaic(candeath, shade=TRUE, legend=TRUE)

# association plot
library(vcd)
assoc(candeath, shade=TRUE)

Prof. Erik B. Erhardt


7.8: Testing for Homogeneity of Proportions 267

Location of death
Home Acute Care Chronic care
Location of death Pearson
Home Acute Care Chronic care residuals:

15−54
Pearson 9.9
15−54

residuals:
9.9
55−64

55−64
4.0
4.0
Age

Age
2.0 2.0
65−74

65−74
0.0
0.0
−2.0
−2.0
−4.0
75+

75+
−6.0 −4.0
p−value =
< 2.22e−16 −6.0
p−value =
< 2.22e−16

The vcd package provides a variety of methods for visualizing multivariate categor-
ical data, inspired by Michael Friendly’s wonderful “Visualizing Categorical Data”.
For more details, see The Strucplot Framework4 .

For example, a sieve plot for an n-way contingency table plots rectangles with
areas proportional to the expected cell frequencies and filled with a number of squares
equal to the observed frequencies. Thus, the densities visualize the deviations of the
observed from the expected values.
# sieve plot
library(vcd)

# plot expected values


sieve(candeath, sievetype = "expected", shade = TRUE)

# plot observed table, then label cells with observed values in the cells
sieve(candeath, pop = FALSE, shade = TRUE)
labeling_cells(text = candeath, gp_text = gpar(fontface = 2))(as.table(candeath))

4
http://cran.r-project.org/web/packages/vcd/vignettes/strucplot.pdf

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


268 Ch 7: Categorical Data Analysis

Location of death Location of death


Home Acute Care Chronic care Home Acute Care Chronic care
15−54

15−54
94 418 23
55−64

55−64
116 524 34
Age

Age
65−74

65−74
156 581 109
75+

75+
138 558 238

7.8.1 Adequacy of the Chi-Square Approximation


The chi-squared tests for homogeneity and independence are large-sample tests. As
with the goodness-of-fit test, a simple rule-of-thumb is that the approximation is
adequate when the expected (not observed) cell counts are 5 or more. This rule is
conservative, and some statisticians argue that the approximation is valid for expected
counts as small as one.
In practice, the chi-squared approximation to χ2s tends to be a bit conservative,
meaning that statistically significant results would likely retain significance had a
more accurate approximation been used.
R may not print out a warning message whenever a noticeable percentage of
cells have expected counts less than 5. Ideally, one would use Fisher’s exact test
(fisher.test()) for tables with small counts, and can be used for larger than 2-by-2
tables when the frequencies are not too large.

7.9 Testing for Homogeneity in Cross-Sectional and


Stratified Studies
Two-way tables of counts are often collected using either stratified sampling or
cross-sectional sampling.
In a stratified design, distinct groups, strata, or sub-populations are identified.
Independent samples are selected from each group, and the sampled individuals are
classified into categories. The Indian gaming example is an illustration of a stratified

Prof. Erik B. Erhardt


7.9: Testing for Homogeneity in Cross-Sectional and Stratified Studies 269

design (where the two strata were NM and CO voters). Stratified designs provide
estimates for the strata (population) proportion in each of the categories. A test for
homogeneity of proportions is used to compare the strata.
In a cross-sectional design, individuals are randomly selected from a population
and classified by the levels of two categorical variables. With cross-sectional samples
you can test homogeneity of proportions by comparing either the row proportions or
by comparing the column proportions.

Example, antismoking adverts The following data (The Journal of Advertising,


1983, pp. 34–42) are from a cross-sectional study that involved soliciting opinions
on anti-smoking advertisements. Each subject was asked whether they smoked and
their reaction (on a five-point ordinal scale) to the ad. The data are summarized as
a two-way table of counts, given below:
Str. Dislike Dislike Neutral Like Str. Like Row Tot
Smoker 8 14 35 21 19 97
Non-smoker 31 42 78 61 69 281
Col Total 39 56 113 82 88 378
The row proportions (i.e., fix a row and compute the proportions for the column
categories) are
(Row Prop) Str. Dislike Dislike Neutral Like Str. Like Row Tot
Smoker 0.082 0.144 0.361 0.216 0.196 1.000
Non-smoker 0.110 0.149 0.278 0.217 0.245 1.000
For example, the entry for the (Smoker, Str. Dislike) cell is: 8/97 = 0.082. Similarly,
the column proportions are
(Col Prop) Str. Dislike Dislike Neutral Like Str. Like
Smoker 0.205 0.250 0.310 0.256 0.216
Non-smoker 0.795 0.750 0.690 0.744 0.784
Total 1.000 1.000 1.000 1.000 1.000
Although it may be more natural to compare the smoker and non-smoker row
proportions, the column proportions can be compared across ad responses. There is no
advantage to comparing “rows” instead of “columns” in a formal test of homogeneity
of proportions with cross-sectional data. The Pearson chi-squared test treats the
rows and columns interchangeably, so you get the same result regardless of how you
view the comparison. However, one of the two comparisons may be more natural to
interpret.
Note that checking for homogeneity of proportions is meaningful in
stratified studies only when the comparison is between strata! Further,
if the strata correspond to columns of the table, then the column proportions or
percentages are meaningful whereas the row proportions are not.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


270 Ch 7: Categorical Data Analysis

7.9.1 Testing for Independence in a Two-Way Contingency


Table
The row and column classifications for a population where each individual is cross-
classified by two categorical variables are said to be independent if each population
cell proportion in the two-way table is the product of the proportion in a given row
and the proportion in a given column. One can show that independence is equivalent
to homogeneity of proportions. In particular, the two-way table of population cell
proportions satisfies independence if and only if the population column proportions
are homogeneous. If the population column proportions are homogeneous then so are
the population row proportions.
This suggests that a test for independence or no association between two vari-
ables based on a cross-sectional study can be implemented using the chi-squared test
for homogeneity of proportions. This suggestion is correct. If independence is not
plausible, I interpret the dependence as a deviation from homogeneity, using the
classification for which the interpretation is most natural.
The Pearson chi-squared test of independence is not significant (p-value = 0.56).
The observed association between smoking status and the ad reaction is not sig-
nificant. This suggests, for example, that the smoker’s reactions to the ad were not
statistically significantly different from the non-smoker’s reactions, which is consistent
with the smokers and non-smokers attitudes being fairly similar.
#### Example, antismoking adverts
antismokead <-
matrix(c( 8, 14, 35, 21, 19,
31, 42, 78, 61, 69),
nrow = 2, byrow = TRUE,
dimnames = list(
"Status" = c("Smoker", "Non-smoker"),
"Reaction" = c("Str. Dislike", "Dislike", "Neutral", "Like", "Str. Like")))
antismokead
## Reaction
## Status Str. Dislike Dislike Neutral Like Str. Like
## Smoker 8 14 35 21 19
## Non-smoker 31 42 78 61 69
chisq.summary <- chisq.test(antismokead, correct=FALSE)
chisq.summary
##
## Pearson's Chi-squared test
##
## data: antismokead
## X-squared = 2.9907, df = 4, p-value = 0.5594
# All expected frequencies are at least 5
chisq.summary$expected

Prof. Erik B. Erhardt


7.9: Testing for Homogeneity in Cross-Sectional and Stratified Studies 271

## Reaction
## Status Str. Dislike Dislike Neutral Like Str. Like
## Smoker 10.00794 14.37037 28.99735 21.04233 22.58201
## Non-smoker 28.99206 41.62963 84.00265 60.95767 65.41799
# Contribution to chi-squared statistic
chisq.summary$residuals^2
## Reaction
## Status Str. Dislike Dislike Neutral Like
## Smoker 0.4028612 0.009545628 1.2425876 8.514567e-05
## Non-smoker 0.1390660 0.003295110 0.4289359 2.939192e-05
## Reaction
## Status Str. Like
## Smoker 0.5681868
## Non-smoker 0.1961356

7.9.2 Further Analyses in Two-Way Tables


The χ2s statistic is a summary measure of independence or homogeneity. A careful
look at the data usually reveals the nature of the association or heterogeneity
when the test is significant. There are numerous meaningful ways to explore two-
way tables to identify sources of association or heterogeneity. For example, in the
comparison of age distributions across locations, you might consider the 4 × 2 tables
comparing all possible pairs of locations. Another possibility would be to compare
the proportion in the 75+ age category across locations. For the second comparison
you need a 2×3 table of counts, where the two rows correspond to the individuals less
than 75 years old and those 75+ years old, respectively (i.e., collapse the first three
rows of the original 4 × 2 table). The possibilities are almost limitless in large tables.
Of course, theoretically generated comparisons are preferred to data dredging.

Example: drugs and nausea, testing for homogeneity A randomized double-


blind experiment compared the effectiveness of several drugs in reducing postoperative
nausea. All patients were anesthetized with nitrous oxide and ether. The following
table shows the incidence of nausea during the first four postoperative hours of four
drugs and a placebo. Compare the drugs to each other and to the placebo.
Drug # with Nausea # without Nausea Sample Size
Placebo 96 70 166
Chlorpromazine 52 100 152
Dimenhydrinate 52 33 85
Pentobarbitol (100mg) 35 32 67
Pentobarbitol (150mg) 37 48 85
Let pPL be the probability of developing nausea given a placebo, and define pCH ,
pDI , pPE100 , and pPE150 analogously. A simple initial analysis would be to test homo-

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


272 Ch 7: Categorical Data Analysis

geneity of proportions: H0 : pPL = pCH = pDI = pPE100 = pPE150 against HA : not


H0 .
The data were entered as frequencies. The output shows that the proportion of
patients exhibiting nausea (see the column percents — the cell and row percentages
are not interpretable, so they are omitted) is noticeably different across drugs. In
particular, Chlorpromazine is the most effective treatment with p̂CH = 0.34 and
Dimenhydrinate is the least effective with p̂DI = 0.61.
The p-value for the chi-squared test is 0.00005, which leads to rejecting H0 at the
0.05 or 0.01 levels. The data strongly suggest there are differences in the effectiveness
of the various treatments for postoperative nausea.
#### Example: drugs and nausea, testing for homogeneity
nausea <-
matrix(c(96, 70, 52, 100, 52, 33, 35, 32, 37, 48),
nrow = 5, byrow = TRUE,
dimnames = list("Drug" = c("PL", "CH", "DI", "PE100", "PE150"),
"Result" = c("Nausea", "No Nausea")))
nausea
## Result
## Drug Nausea No Nausea
## PL 96 70
## CH 52 100
## DI 52 33
## PE100 35 32
## PE150 37 48
# Sorted proportions of nausea by drug
nausea.prop <- sort(nausea[,1]/rowSums(nausea))
nausea.prop
## CH PE150 PE100 PL DI
## 0.3421053 0.4352941 0.5223881 0.5783133 0.6117647
# chi-sq test of association
chisq.summary <- chisq.test(nausea, correct=FALSE)
chisq.summary
##
## Pearson's Chi-squared test
##
## data: nausea
## X-squared = 24.827, df = 4, p-value = 5.451e-05
# All expected frequencies are at least 5
chisq.summary$expected
## Result
## Drug Nausea No Nausea
## PL 81.35495 84.64505
## CH 74.49369 77.50631
## DI 41.65766 43.34234
## PE100 32.83604 34.16396
## PE150 41.65766 43.34234

Prof. Erik B. Erhardt


7.9: Testing for Homogeneity in Cross-Sectional and Stratified Studies 273

A sensible follow-up analysis is to identify which treatments were responsible


for the significant differences. For example, the placebo and chlorpromazine can be
compared using a test of pPL = pCH or with a CI for pPL − pCH .

In certain experiments, specific comparisons are of interest, for example a com-


parison of the drugs with the placebo. Alternatively, all possible comparisons might
be deemed relevant. The second case is suggested here based on the problem descrip-
tion. I will use a Bonferroni adjustment to account for the multiple comparisons.
The Bonferroni adjustment accounts for data dredging, but at a cost of less sensitive
comparisons.

There are 10 possible comparisons here. The Bonferroni analysis with an over-
all Family Error Rate of 0.05 (or less) tests the 10 individual hypotheses at the
0.05/10=0.005 level.
nausea.table <- data.frame(Interval = rep(NA,10)
, CI.lower = rep(NA,10)
, CI.upper = rep(NA,10)
, Z = rep(NA,10)
, p.value = rep(NA,10)
, sig.temp = rep(NA,10)
, sig = rep(NA,10))
# row names for table
nausea.table[,1] <- c("p_PL - p_CH"
, "p_PL - p_DI"
, "p_PL - p_PE100"
, "p_PL - p_PE150"
, "p_CH - p_DI"
, "p_CH - p_PE100"
, "p_CH - p_PE150"
, "p_DI - p_PE100"
, "p_DI - p_PE150"
, "p_PE100 - p_PE150")
# test results together in a table
i.tab <- 0
for (i in 1:4) {
for (j in (i+1):5) {
i.tab <- i.tab + 1
nausea.summary <- prop.test(nausea[c(i,j),], correct = FALSE, conf.level = 1-0.05/10)
nausea.table[i.tab, 2:6] <- c(nausea.summary$conf.int[1]
, nausea.summary$conf.int[2]
, sign(-diff(nausea.summary$estimate)) * nausea.summary$statistic^0.5
, nausea.summary$p.value
, (nausea.summary$p.value < 0.05/10))
if (nausea.table$sig.temp[i.tab] == 1) { nausea.table$sig[i.tab] <- "*" }
else { nausea.table$sig[i.tab] <- " " }
}
}

# remove temporary sig indicator


nausea.table <- subset(nausea.table, select = -sig.temp)
#nausea.table

The following table gives two-sample tests of proportions with nausea and 99.5%
CIs for the differences between the ten pairs of proportions. The only two p-values
are less than 0.005 corresponding to pPL − pCH and pCH − pDI . I am 99.5% confident
that pCH is between 0.084 and 0.389 less than pPL , and I am 99.5% confident that pCH
is between 0.086 and 0.453 less than pDI . The other differences are not significant.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


274 Ch 7: Categorical Data Analysis

Interval CI.lower CI.upper Z p.value sig


1 p PL - p CH 0.0838 0.3887 4.2182 0.0000 *
2 p PL - p DI −0.2167 0.1498 −0.5099 0.6101
3 p PL - p PE100 −0.1464 0.2582 0.7788 0.4361
4 p PL - p PE150 −0.0424 0.3284 2.1485 0.0317
5 p CH - p DI −0.4532 −0.0861 −4.0122 0.0001 *
6 p CH - p PE100 −0.3828 0.0222 −2.5124 0.0120
7 p CH - p PE150 −0.2788 0.0924 −1.4208 0.1554
8 p DI - p PE100 −0.1372 0.3160 1.1058 0.2688
9 p DI - p PE150 −0.0352 0.3881 2.3034 0.0213
10 p PE100 - p PE150 −0.1412 0.3154 1.0677 0.2857

Using ANOVA-type groupings, and arranging the treatments from most to least
effective (low proportions to high), we get:
CH (0.34) PE150 (0.44) PE100 (0.52) PL (0.58) DI (0.61)
---------------------------------------
---------------------------------------------------

Prof. Erik B. Erhardt


Chapter 8

Correlation and Regression

Contents
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 276
8.2 Logarithmic transformations . . . . . . . . . . . . . . . . . 278
8.2.1 Log-linear and log-log relationships: amoebas, squares, and
cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
8.3 Testing that ρ = 0 . . . . . . . . . . . . . . . . . . . . . . . 285
8.3.1 The Spearman Correlation Coefficient . . . . . . . . . . . . 285
8.4 Simple Linear Regression . . . . . . . . . . . . . . . . . . . 289
8.4.1 Linear Equation . . . . . . . . . . . . . . . . . . . . . . . . 290
8.4.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . 290
8.5 ANOVA Table for Regression . . . . . . . . . . . . . . . . 293
8.5.1 Brief discussion of the output for blood loss problem . . . . 297
8.6 The regression model . . . . . . . . . . . . . . . . . . . . . 297
8.6.1 Back to the Data . . . . . . . . . . . . . . . . . . . . . . . . 299
8.7 CI and tests for β1 . . . . . . . . . . . . . . . . . . . . . . . 300
8.7.1 Testing β1 = 0 . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.8 A CI for the population regression line . . . . . . . . . . 301
8.8.1 CI for predictions . . . . . . . . . . . . . . . . . . . . . . . . 302
8.8.2 A further look at the blood loss data . . . . . . . . . . . . . 303
8.9 Model Checking and Regression Diagnostics . . . . . . . 304
8.9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 304
8.9.2 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . 305

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


276 Ch 8: Correlation and Regression

8.9.3 Checking Normality . . . . . . . . . . . . . . . . . . . . . . 310


8.9.4 Checking Independence . . . . . . . . . . . . . . . . . . . . 310
8.9.5 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
8.9.6 Influential Observations . . . . . . . . . . . . . . . . . . . . 312
8.9.7 Summary of diagnostic measures . . . . . . . . . . . . . . . 315
8.10 Regression analysis suggestion . . . . . . . . . . . . . . . . 315
8.10.1 Residual and Diagnostic Analysis of the Blood Loss Data . 316
8.10.2 Gesell data . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.11 Weighted Least Squares . . . . . . . . . . . . . . . . . . . 326

Learning objectives
After completing this topic, you should be able to:
select graphical displays that reveal the relationship between two continuous vari-
ables.
summarize model fit.
interpret model parameters, such as slope and R2 .
assess the model assumptions visually and numerically.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
6. summarize data visually, numerically, and descriptively.
8. use statistical software.
12. make evidence-based decisions.

8.1 Introduction
Suppose we select n = 10 people from the population of college seniors who plan
to take the medical college admission test (MCAT) exam. Each takes the test, is
coached, and then retakes the exam. Let Xi be the pre-coaching score and let Yi
be the post-coaching score for the ith individual, i = 1, 2, . . . , n. There are several
questions of potential interest here, for example: Are Y and X related (associated),
and how? Does coaching improve your MCAT score? Can we use the data to develop
a mathematical model (formula) for predicting post-coaching scores from the pre-
coaching scores? These questions can be addressed using correlation and regression
models.

Prof. Erik B. Erhardt


8.1: Introduction 277

The correlation coefficient is a standard measure of association or relationship


between two features Y and X. Most scientists equate Y and X being correlated
to mean that Y and X are associated, related, or dependent upon each other.
However, correlation is only a measure of the strength of a linear relationship. For
later reference, let ρ be the correlation between Y and X in the population and let
r be the sample correlation. I define r below. The population correlation is defined
analogously from population data.
Suppose each of n sampled individuals is measured on two quantitative charac-
teristics called Y and X. The data are pairs of observations (X1 , Y1 ), (X2 , Y2 ), . . .,
(Xn , Yn ), where (Xi , Yi ) is the (X, Y ) pair for the ith individual in the sample. The
sample correlation between Y and X, also called the Pearson product moment
correlation coefficient, is
P
SXY (Xi − X̄)(Yi − Ȳ )
r= = pP i ,
SX SY 2
P 2
i (X i − X̄) i (Yi − Ȳ )

where Pn
− X̄)(Yi − Ȳ )
i=1 (Xi
SXY =
n−1
pP
2
is the p
sample covariance between Y and X, and SY =
P i (Yi − Ȳ ) /(n − 1) and
2
SX = i (Xi − X̄) /(n − 1) are the standard deviations for the Y and X samples.

Important properties of r:
1. −1 ≤ r ≤ 1.
2. If Yi tends to increase linearly with Xi then r > 0.
3. If Yi tends to decrease linearly with Xi then r < 0.
4. If there is a perfect linear relationship between Yi and Xi with a positive slope
then r = +1.
5. If there is a perfect linear relationship between Yi and Xi with a negative slope
then r = −1.
6. The closer the points (Xi , Yi ) come to forming a straight line, the closer r is to
±1.
7. The magnitude of r is unchanged if either the X or Y sample is transformed
linearly (such as feet to inches, pounds to kilograms, Celsius to Fahrenheit).
8. The correlation does not depend on which variable is called Y and which is
called X.
If r is near ±1, then there is a strong linear relationship between Y and X in the
sample. This suggests we might be able to accurately predict Y from X with a linear
equation (i.e., linear regression). If r is near 0, there is a weak linear relationship
between Y and X, which suggests that a linear equation provides little help for

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


278 Ch 8: Correlation and Regression

predicting Y from X. The pictures below should help you develop a sense about the
size of r.
Note that r = 0 does not imply that Y and X are not related in the sample. It
only implies they are not linearly related. For example, in the last plot r = 0 yet
Yi = Xi2 , exactly.
Correlation=1 Correlation=-1 Correlation=.3 Correlation=-.3
• • • •

•• ••• •• • •
••• • • • •• •
••• • ••••
•••• • •• • • • • • •
•••••• •••••• • •• •• •• ••••• • • • •• •• •
••
•••••••
• •••• • • •

• ••• • • •• ••• • •• ••• • • ••
• •

•••••• • •••
••••• •• •• • • •• ••••• •••• • • • • • • • • • •• •
•••••• •• ••• ••• •• • ••• • • • •• • •• •
•••••• •• •
•••••• • • ••• • • ••• •• • • • • ••••• •• • • • •• • ••
•• • •• •••• • • • ••• •
• • •
• •• •• ••• •••• •
• •
•• •
• ••••
• •• • • • • • • •• • •• • •
• •• • • • • •

Correlation=.7 Correlation=-.7 Correlation=0 Correlation=0

0 20 40 60 80 100
• • • • • • • • •
• •
• •• • • • • •
• • • • • • • • •• • • •
• •• • • • ••• •• • • • • • ••••• ••• • ••• • • •
• • ••••••••• •• • • • • • • •
• •• • • •
• • •••• • • •••• • •• • ••• • •• •• • ••••• ••• • •
•• •••••• • • • •••• •••••••••• •••••••••••• •••••• • • • •• •••• •••• •

• •• •••• •••• •• • • • • • • • •• •••• • • • • • •• ••• ••• ••• • •
• ••• • ••
• • • • •••• •••• •• • • ••• • • •• • •• •
• • •
• •• •••• •

• •



• •• • •
• •• •• • •
• • •• •
• • • ••••
-10 -5 0 5 10

ClickerQ s — Correlation coefficients STT.02.02.010

ClickerQ s — Strong correlation STT.02.02.020

8.2 Logarithmic transformations


Logarithms1 are useful for understanding data, partly because they allow numbers
that vary by several orders of magnitude to be viewed on a common scale, and more
importantly because they allow exponential and power-law relations to be transformed
into linearity.

8.2.1 Log-linear and log-log relationships: amoebas, squares,


and cubes
Suppose you have an amoeba that takes one hour to divide, and then the two amoebas
each divide in one more hour, and so forth. What is the equation of the number of
amoebas, y, as a function of time, x (in hours)? It can be written as y = 2x or, on
the logarithmic scale, log(y) = (log(2))x = 0.30x.
1
From: Gelman, Andrew and Deborah Nolan (2002). Teaching statistics: A bag of tricks. Oxford
University Press.

Prof. Erik B. Erhardt


8.2: Logarithmic transformations 279

Suppose you have the same example, but the amoeba takes three hours to divide
at each step. Then the number of amoebas y after time x has the equation, y = 2x/3 =
(21/3 )x = 1.26x or, on the logarithmic scale, log(y) = (log(1.26))x = 0.10x. The slope
of 0.10 is one-third the earlier slope of 0.30 because the population is growing at
one-third the rate.
In the example of exponential growth of amoebas, y is logged while x remains the
same. For power-law relations, it makes sense to log both x and y. How does the
area of a square relate to its circumference (perimeter)? If the side of the cube has
length L, then the area is L2 and the circumference is 4L; thus

area = (circumference/4)2 .

Taking the logarithm of both sides yields,

log(area) = 2(log(circumference) − log(4))


log(area) = −1.20 + 2 log(circumference),

a linear relation on the log-log scale.


What is the relation between the surface area and volume of a cube? In terms of
the side length L, are 6L2 and L3 , respectively. On the original scale, this is

surface area = 6(volume)2/3 ,

or, on the logarithmic scale,

log(surface area) = log(6) + (2/3) log(volume).

Example: Log-linear transformation: world population Consider the world


population from the year 0 to 2000. Compare the data in the table below with the
plots on the original and logarithmic scales; again, the plots reveals the pattern much
clearer than the table.
On the raw scale, all you can see is that the population has increased very fast
recently. On the log scale, convex curvature is apparent — that is, the rate of in-
crease has itself increased. On the logarithmic graph, the least-squares regression line
is drawn. The estimated world population has been growing even faster than expo-
nentially! What would you guess the population to be fore year 1400? Would you be
surprised that it was 350 million? It is actually lower than the year 1200 population,
because of plague and other factors. This is an illustration that even interpolation
can sometimes go awry.
#### Example: Log-linear population growth
pop <- read.table(text="
Year Pop_M
1 170
400 190

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


280 Ch 8: Correlation and Regression

800 220
1200 360
1600 545
1800 900
1850 1200
1900 1625
1950 2500
1975 3900
2000 6080
2012 7000
", header=TRUE)
pop$Pop <- 1e6 * pop$Pop_M # convert to millions
pop$PopL10 <- log10(pop$Pop)

# calculate the residuals from a simple linear regression


lm.fit <- lm(PopL10 ~ Year, data = pop)
# include those residuals in the pop table
pop$Res <- residuals(lm.fit)
pop$R10 <- 10^residuals(lm.fit) # residuals on the original scale

Year Pop M Pop PopL10 Res R10


1 1 170 170000000 8.230 0.293 1.964
2 400 190 190000000 8.279 0.050 1.122
3 800 220 220000000 8.342 −0.178 0.663
4 1200 360 360000000 8.556 −0.256 0.554
5 1600 545 545000000 8.736 −0.368 0.428
6 1800 900 900000000 8.954 −0.296 0.505
7 1850 1200 1200000000 9.079 −0.208 0.619
8 1900 1625 1625000000 9.211 −0.113 0.771
9 1950 2500 2500000000 9.398 0.038 1.091
10 1975 3900 3900000000 9.591 0.213 1.631
11 2000 6080 6080000000 9.784 0.387 2.439
12 2012 7000 7000000000 9.845 0.440 2.752
library(ggplot2)

p1 <- ggplot(pop, aes(x = Year, y = Pop))


p1 <- p1 + geom_point()
#print(p1)

p2 <- ggplot(pop, aes(x = Year, y = PopL10))


p2 <- p2 + geom_point()
p2 <- p2 + geom_smooth(method = lm, se = FALSE)
#print(p2)

library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow=1
, top = "Log-linear transformation: world population")

Prof. Erik B. Erhardt


8.2: Logarithmic transformations 281

Log−linear transformation: world population

● ●

● ●
6e+09
9.5

4e+09 9.0

PopL10


Pop



8.5
2e+09
● ●

● ●


● 8.0
● ● ●
0e+00
0 500 1000 1500 2000 0 500 1000 1500 2000
Year Year

When using data of this nature, consider the source of the population numbers.
How would you estimate the population of the world in the year 1?

Example: Log-log transformation: metabolic rates A rich source of examples


when covering log-log transformations are biological scaling relations2 . The relation-
ship3 between body mass (M, g) and basal metabolic rate (BMR, ml of O2 per h
(similar to Watts?)) for mammalian orders for selected data are summarized in plots
below, both on the original and log-log scales. The linear regression summarizes the
dark points, the mean for each of the species groups, and the colored points are indi-
vidual species. The curved regression is found by inverting the linear regression onto
the original scale. The third plot displays the log axes with values on the original
scale.
# http://www.ncbi.nlm.nih.gov/pmc/articles/PMC153045/
# Supp:
# http://www.ncbi.nlm.nih.gov/pmc/articles/PMC153045/bin/pnas_0436428100_index.html
# Supporting information for White and Seymour (2003)
# Proc. Natl. Acad. Sci. USA, 10.1073/pnas.0436428100

library(gdata)
## gdata: read.xls support for ’XLS’ (Excel 97-2004) files
## gdata: ENABLED.
##
## gdata: read.xls support for ’XLSX’ (Excel 2007+) files
## gdata: ENABLED.
##
## Attaching package: ’gdata’
## The following object is masked from ’package:gridExtra’:
##
## combine

2
One of the world experts in allometric scaling is Prof. Jim Brown, UNM Biology, http://
biology.unm.edu/jhbrown
3
White and Seymour (2003) PNAS, 10.1073/pnas.0436428100

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


282 Ch 8: Correlation and Regression

## The following object is masked from ’package:stats’:


##
## nobs
## The following object is masked from ’package:utils’:
##
## object.size
## The following object is masked from ’package:base’:
##
## startsWith
fn <- "data/ADA1_notes_08_data_log-logScaling_BodyMassMetabolicRate_2003_WhiteSeymour.xlsx"
bm.bmr <- read.xls(fn, skip = 4, stringsAsFactors = TRUE)
bm.bmr$Log10BodyMass <- log10(bm.bmr$BodyMass)
bm.bmr$Log10BaseMetRate <- log10(bm.bmr$BaseMetRate)

# remove a very strange group


bm.bmr <- subset(bm.bmr, !(Group == "Artiodactyla 7"))
str(bm.bmr)
## 'data.frame': 634 obs. of 9 variables:
## $ Group : Factor w/ 18 levels "Artiodactyla 7",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Genus : Factor w/ 88 levels "","Acrobatidae",..: 1 12 12 12 12 12 12 12 12 31 ...
## $ Species : Factor w/ 621 levels "","2n = 52","2n = 54",..: 1 22 67 68 79 206 614 615 616 8 ...
## $ BodyMass : num 4452 3600 10000 7720 5444 ...
## $ T : num 37.5 38.6 37 38 38.2 38.8 38 38.7 NA 39 ...
## $ BaseMetRate : num 1244 1374 2687 3860 1524 ...
## $ Ref : Factor w/ 239 levels "","1","10","100",..: 1 230 3 15 24 37 3 50 61 72 ...
## $ Log10BodyMass : num 3.65 3.56 4 3.89 3.74 ...
## $ Log10BaseMetRate: num 3.09 3.14 3.43 3.59 3.18 ...
# log-log scale linear regression
lm.fit <- lm(Log10BaseMetRate ~ Log10BodyMass, data = bm.bmr)
# coefficients for regression line
coef(lm.fit)
## (Intercept) Log10BodyMass
## 0.6775600 0.6575572
library(ggplot2)

p1 <- ggplot(subset(bm.bmr, (Genus == "")), aes(x = BodyMass, y = BaseMetRate))


p1 <- p1 + geom_point(data = subset(bm.bmr, !(Genus == "")), aes(colour = Group), alpha = 0.2)
p1 <- p1 + geom_point(size = 3)
# Using a custom function
f.org.scale <- function(BodyMass) { 10^coef(lm.fit)[1] * BodyMass ^ coef(lm.fit)[2]}
p1 <- p1 + stat_function(fun = f.org.scale, size = 1)
p1 <- p1 + labs(title = paste("BaseMetRate = ", signif(10^coef(lm.fit)[1], 3), " * ", "BodyMass ^ ", signif(coef(lm.fit)[2], 3), sep = ""))
p1 <- p1 + scale_y_continuous(limits=c(0, 1300))
p1 <- p1 + scale_x_continuous(limits=c(0, 5000))
#print(p1)

p2 <- ggplot(subset(bm.bmr, (Genus == "")), aes(x = Log10BodyMass, y = Log10BaseMetRate))


p2 <- p2 + geom_point(data = subset(bm.bmr, !(Genus == "")), aes(colour = Group), alpha = 0.3)
p2 <- p2 + geom_point(size = 3)
p2 <- p2 + geom_smooth(method = lm, se = FALSE, fullrange = TRUE, size = 1, colour = "black")
p2 <- p2 + labs(title = paste("log10(BaseMetRate) = ", signif(coef(lm.fit)[1], 3), " + ", signif(coef(lm.fit)[2], 3), " log10(BodyMass)", sep = ""))
p2 <- p2 + scale_y_continuous(limits=c(NA, log10(1300)))
p2 <- p2 + scale_x_continuous(limits=c(NA, log10(5000)))
#print(p2)

p3 <- ggplot(subset(bm.bmr, (Genus == "")), aes(x = BodyMass, y = BaseMetRate))


p3 <- p3 + geom_point(data = subset(bm.bmr, !(Genus == "")), aes(colour = Group), alpha = 0.3)
p3 <- p3 + geom_point(size = 3)
p3 <- p3 + geom_smooth(method = lm, se = FALSE, fullrange = TRUE, size = 1, colour = "black")
p3 <- p3 + labs(title = paste("log10(BaseMetRate) = ", signif(coef(lm.fit)[1], 3), " + ", signif(coef(lm.fit)[2], 3), " log10(BodyMass)", sep = ""))
p3 <- p3 + scale_y_log10()#limits=c(NA, log10(1300)))
p3 <- p3 + scale_x_log10()#limits=c(NA, log10(5000)))
#print(p3)

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), ncol=1
, top = "Log-log transformation: metabolic rates")
## Warning: Removed 56 rows containing missing values (geom point).
## Warning: Removed 5 rows containing missing values (geom point).
## Warning: Removed 5 rows containing non-finite values (stat smooth).
## Warning: Removed 56 rows containing missing values (geom point).
## Warning: Removed 5 rows containing missing values (geom point).
## Warning: Removed 5 rows containing non-finite values (stat smooth).
## Warning: Removed 1 rows containing missing values (geom point).
## Warning: Removed 5 rows containing missing values (geom point).

Prof. Erik B. Erhardt


8.2: Logarithmic transformations 283

Log−log transformation: metabolic rates


Group
BaseMetRate = 4.76 * BodyMass ^ 0.658
Carnivora 48
Chiroptera 77

Dasyuromorpha 23
Didelphimorphia 11
1000 Diprotodontia 25
Hyracoidea 5
BaseMetRate

● Insectivora 51
Lagomorpha 10

Macroscelidea 8

500 Monotremata 4
● ●
● Notoryctemorphia

● Peramelemorphia 9

● Pholidota 5
Primates 25
●●●●
0 ● Rodentia 289
Scandentia 3
0 1000 2000 3000 4000 5000
BodyMass Xenarthra 15
Group
log10(BaseMetRate) = 0.678 + 0.658 log10(BodyMass)
Carnivora 48

● Chiroptera 77
3.0
● Dasyuromorpha 23

● Didelphimorphia 11
● ●
● ● Diprotodontia 25
2.5 ●

Log10BaseMetRate

Hyracoidea 5

Insectivora 51
2.0 ● ● Lagomorpha 10

● Macroscelidea 8
Monotremata 4
1.5
● Notoryctemorphia
Peramelemorphia 9
Pholidota 5
1.0
Primates 25
Rodentia 289
Scandentia 3
1 2 3
Log10BodyMass Xenarthra 15
Group
log10(BaseMetRate) = 0.678 + 0.658 log10(BodyMass)
Carnivora 48
Chiroptera 77
Dasyuromorpha 23
10000
Didelphimorphia 11
Diprotodontia 25
Hyracoidea 5

BaseMetRate

1000 Insectivora 51

●● Lagomorpha 10
● ● ●
● ●
● Macroscelidea 8

Monotremata 4
100 ● ●
● Notoryctemorphia

Peramelemorphia 9

Pholidota 5
10 Primates 25
Rodentia 289
Scandentia 3
1e+01 1e+02 1e+03 1e+04 1e+05
BodyMass Xenarthra 15

The table below provides predictions over a range of scales. Note that the smallest

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


284 Ch 8: Correlation and Regression

mammel in this dataset is about 2.4 grams. A 5-gram mammal uses about 13.7 Watts,
so 1000 5-gram mammals use about 13714 Watts. Whereas, one 5000-gram mammal
uses 1287 Watts. Thus, larger mammals give off less heat than the equivalent weight
of many smaller mammals.
pred.bm.bmr <- data.frame(BodyMass = 5 * c(1, 10, 100, 1000))
pred.bm.bmr$Log10BodyMass <- log10(pred.bm.bmr$BodyMass)
pred.bm.bmr$Log10BaseMetRate <- predict(lm.fit, pred.bm.bmr)
pred.bm.bmr$BaseMetRate <- 10^pred.bm.bmr$Log10BaseMetRate

tab.pred <- subset(pred.bm.bmr, select = c("BodyMass", "BaseMetRate"))


BodyMass BaseMetRate
1 5 13.71
2 50 62.33
3 500 283.33
4 5000 1287.79

We want to focus on the slope in the log-log plot, which is the exponent in original
scale plot. On the log scale we have

log(BaseMetRate) = 0.678 + 0.658 log(BodyMass).

To interpret the slope, for each unit increase in (predictor, x-variable) log(BodyMass),
the expected increase in (response, y-variable) log(BaseMetRate) is 4.55. By expo-
nentiating on both sides, the expression on the original scale is

BaseMetRate = 100.678 × BodyMass0.658


= 4.76 × BodyMass0.658 .

For example, if you multiply body mass by 10, then you multiply metabolic rate by
100.658 = 4.55. If you multiply body mass by 100, then you multiply metabolic rate
by 1000.658 = 20.7, and so forth. The relation between metabolic rate and body mass
is less than linear (that is, the exponent 0.658 is less than 1.0, and the line in the
original-scal plot curves downward, not upward), which implies that the equivalent
mass of small mammals gives off more heat, and the equivalent mass of large mammals
gives off less heat.
This seems related to the general geometrical relation that surface area and volume
are proportional to linear dimension to the second and third power, respectively, and
thus surface area should be proportional to volume to the 2/3 power. Heat produced
by a mammal is emitted from its surface, and it would thus be reasonable to suspect
metabolic rate to be proportional to the 2/3 power of body mass. Biologists have
considered whether the empirical slope is closer to 3/4 or 2/3; the important thing
here is to think about log transformations and power laws (and have a chat with Jim

Prof. Erik B. Erhardt


8.3: Testing that ρ = 0 285

Brown or someone from his lab at UNM for the contextual details). As an aside,
something not seen from this plot is that males tend to be above the line and females
below the line.

8.3 Testing that ρ = 0


Suppose you want to test H0 : ρ = 0 against HA : ρ 6= 0, where ρ is the population
correlation between Y and X. This test is usually interpreted as a test of no associ-
ation, or relationship, between Y and X in the population. Keep in mind, however,
that ρ measures the strength of a linear relationship.
The standard test of H0 : ρ = 0 is based on the magnitude of r. If we let

r
n−2
ts = r ,
1 − r2

then the test rejects H0 in favor of HA if |ts | ≥ tcrit , where tcrit is the two-sided test
critical value from a t-distribution with df = n − 2. The p-value for the test is the
area under the t-curve outside ±ts (i.e., two-tailed test p-value).
This test assumes that the data are a random sample from a bivariate normal
population for (X, Y ). This assumption implies that all linear combinations of X
and Y , say aX + bY , are normal. In particular, the (marginal) population frequency
curves for X and Y are normal. At a minimum, you should make boxplots of the
X and Y samples to check marginal normality. For large-sized samples, a plot of
Y against X should be roughly an elliptical cloud, with the density of the points
decreasing as the points move away from the center of the cloud.

8.3.1 The Spearman Correlation Coefficient

The Pearson correlation r can be highly influenced by outliers in one or both samples.
For example, r ≈ −1 in the plot below. If you delete the one extreme case with the
largest X and smallest Y value then r ≈ 0. The two analyses are contradictory. The
first analysis (ignoring the plot) suggests a strong linear relationship, whereas the
second suggests the lack of a linear relationship. I will not strongly argue that you
should (must?) delete the extreme case, but I am concerned about any conclusion
that depends heavily on the presence of a single observation in the data set.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


286 Ch 8: Correlation and Regression

• • •• • •
•••••••••••••••••••
• ••••••••••••••••••••••••••••
•••• •
•• •••••••••••••••••••••• •
0
• ••••••••••••••••••••••••••••
• •• •• •
-2
Y
-4
-6


-8

0 2 4 6 8 10
X

Spearman’s rank correlation coefficient rS is a sensible alternative to r when


normality is unreasonable or outliers are present. Most books give a computational
formula for rS . I will verbally describe how to compute rS . First, order the Xi s and
assign them ranks. Then do the same for the Yi s and replace the original data pairs by
the pairs of ranked values. The Spearman rank correlation is the Pearson correlation
computed from the pairs of ranks.
The Spearman correlation rS estimates the population rank correlation coef-
ficient, which is a measure of the strength of linear relationship between population
ranks. The Spearman correlation, as with other rank-based methods, is not sensitive
to the presence of outliers in the data (or any information about the marginal distri-
bution of X or Y ). In the plot above, rS ≈ 0 whether the unusual point is included
or excluded from the analysis. In samples without unusual observations and a linear
trend, you often find that the Spearman and Pearson correlations are similar, rS ≈ r.
An important point to note is that the magnitude of the Spearman correlation
does not change if either X or Y or both are transformed (monotonically). Thus, if
rS is noticeably greater than r, a transformation of the data might provide a stronger
linear relationship.

Example: Blood loss Eight patients underwent a thyroid operation. Three vari-
ables were measured on each patient: weight in kg, time of operation in minutes, and
blood loss in ml. The scientists were interested in the factors that influence blood
loss.

Prof. Erik B. Erhardt


8.3: Testing that ρ = 0 287

#### Example: Blood loss


thyroid <- read.table(text="
weight time blood_loss
44.3 105 503
40.6 80 490
69.0 86 471
43.7 112 505
50.3 109 482
50.2 100 490
35.4 96 513
52.2 120 464
", header=TRUE)

# show the structure of the data.frame


str(thyroid)
## 'data.frame': 8 obs. of 3 variables:
## $ weight : num 44.3 40.6 69 43.7 50.3 50.2 35.4 52.2
## $ time : int 105 80 86 112 109 100 96 120
## $ blood_loss: int 503 490 471 505 482 490 513 464
# display the data.frame
#thyroid

weight time blood loss


1 44.3 105 503
2 40.6 80 490
3 69.0 86 471
4 43.7 112 505
5 50.3 109 482
6 50.2 100 490
7 35.4 96 513
8 52.2 120 464

Below, we calculate the Pearson correlations between all pairs of variables (left),
as well as the p-values (right) for testing whether the correlation is equal to zero.
p.corr <- cor(thyroid);
#p.corr

# initialize pvalue table with dim names


p.corr.pval <- p.corr;
for (i1 in 1:ncol(thyroid)) {
for (i2 in 1:ncol(thyroid)) {
p.corr.pval[i1,i2] <- cor.test(thyroid[, i1], thyroid[, i2])$p.value
}
}
#p.corr.pval

weight time blood loss weight time blood loss


weight 1.000 −0.066 −0.772 weight 0.0000 0.8761 0.0247
time −0.066 1.000 −0.107 time 0.8761 0.0000 0.8003
blood loss −0.772 −0.107 1.000 blood loss 0.0247 0.8003 0.0000

Similarly, we calculate the Spearman (rank) correlation table (left), as well as the
p-values (right) for testing whether the correlation is equal to zero.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


288 Ch 8: Correlation and Regression

s.corr <- cor(thyroid, method = "spearman");


#s.corr

# initialize pvalue table with dim names


s.corr.pval <- p.corr;
for (i1 in 1:ncol(thyroid)) {
for (i2 in 1:ncol(thyroid)) {
s.corr.pval[i1,i2] <- cor.test(thyroid[, i1], thyroid[, i2],
method = "spearman")$p.value
}
}
## Warning in cor.test.default(thyroid[, i1], thyroid[, i2], method = "spearman"): Cannot
compute exact p-value with ties
## Warning in cor.test.default(thyroid[, i1], thyroid[, i2], method = "spearman"): Cannot
compute exact p-value with ties
## Warning in cor.test.default(thyroid[, i1], thyroid[, i2], method = "spearman"): Cannot
compute exact p-value with ties
## Warning in cor.test.default(thyroid[, i1], thyroid[, i2], method = "spearman"): Cannot
compute exact p-value with ties
## Warning in cor.test.default(thyroid[, i1], thyroid[, i2], method = "spearman"): Cannot
compute exact p-value with ties
#s.corr.pval

weight time blood loss weight time blood loss


weight 1.000 0.286 −0.874 weight 0.0000 0.5008 0.0045
time 0.286 1.000 −0.156 time 0.5008 0.0000 0.7128
blood loss −0.874 −0.156 1.000 blood loss 0.0045 0.7128 0.0000

Here are scatterplots for the original data and the ranks of the data using ggpairs()
from the GGally package with ggplot2.
# Plot the data using ggplot
library(ggplot2)
library(GGally)
p1 <- ggpairs(thyroid[,1:3], progress=FALSE)
print(p1)

thyroid$rank_weight <- rank(thyroid$weight )


thyroid$rank_time <- rank(thyroid$time )
thyroid$rank_blood_loss <- rank(thyroid$blood_loss)

p2 <- ggpairs(thyroid[,4:6], progress=FALSE)


print(p2)

Prof. Erik B. Erhardt


8.4: Simple Linear Regression 289

weight time blood_loss rank_weight rank_time rank_blood_loss

0.12
0.04

rank_weight
0.03 0.08

weight
Corr: Corr: Corr: Corr:
0.02 −0.0663 −0.772 0.286 −0.874
0.04
0.01

0.00 0.00
120 ● 8 ●



110 ● 6 ●

rank_time

Corr: ● Corr:

time
100 ●

● −0.107 4 ● −0.156

90
● 2 ●

80 ● ●

● ● 8 ● ●
510
● ● ● ●
● ●

rank_blood_loss
500 6 ● ●

blood_loss
490 ● ● ● ●
● ● ● ●
4
● ●
480 ● ●

470 ● ● 2 ● ●

● ● ● ●

40 50 60 70 80 90 100 110 120 470 480 490 500 510 2 4 6 8 2 4 6 8 2 4 6 8

Comments:
1. (Pearson correlations). Blood loss tends to decrease linearly as weight increases,
so r should be negative. The output gives r = −0.77. There is not much of
a linear relationship between blood loss and time, so r should be close to 0.
The output gives r = −0.11. Similarly, weight and time have a weak negative
correlation, r = −0.07.
2. The Pearson and Spearman correlations are fairly consistent here. Only the
correlation between blood loss and weight is significantly different from zero at
the α = 0.05 level (the p-values are given below the correlations).
3. (Spearman p-values) R gives the correct p-values. Calculating the p-value using
the Pearson correlation on the ranks is not correct, strictly speaking.

ClickerQ s — Coney Island STT.02.02.050

8.4 Simple Linear Regression


In linear regression, we are interested in developing a linear equation that best sum-
marizes the relationship in a sample between the response variable Y and the
predictor variable (or independent variable) X. The equation is also used to
predict Y from X. The variables are not treated symmetrically in regression, but the
appropriate choice for the response and predictor is usually apparent.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


290 Ch 8: Correlation and Regression

8.4.1 Linear Equation


If there is a perfect linear relationship between Y and X then Y = β0 + β1 X for some
β0 and β1 , where β0 is the Y-intercept and β1 is the slope of the line. Two plots of
linear relationships are given below. The left plot has β0 = 5 and β1 = 3. The slope
is positive, which indicates that Y increases linearly when X increases. The right
plot has β0 = 7 and β1 = −2. The slope is negative, which indicates that Y decreases
linearly when X increases.

The line Y = 5 + 3X The line Y = 7 - 2X

8
15

3
6

1
10
Y

Y
4

-2
2
5

1
0

-1 0 1 2 3 4 -1 0 1 2 3 4
X X

ClickerQ s — Equation STT.02.03.010

8.4.2 Least Squares


Data rarely, if ever, fall on a straight line. However, a straight line will often describe
the trend for a set of data. Given a data set, (Xi , Yi ), i = 1, . . . , n, with a linear
trend, what linear equation “best” summarizes the observed relationship between Y
and X? There is no universally accepted definition of “best”, but many researchers
accept the Least Squares line (LS line) as a reasonable summary.
Mathematically, the LS line chooses the values of β0 and β1 that minimize
n
X
{Yi − (β0 + β1 Xi )}2
i=1

Prof. Erik B. Erhardt


8.4: Simple Linear Regression 291

over all possible choices of β0 and β1 . These values can be obtained using calculus.
Rather than worry about this calculation, note that the LS line makes the sum of
squared (vertical) deviations between the responses Yi and the line as small as possible,
over all possible lines. The LS line goes through the mean point, (X̄, Ȳ ), which is
typically in the “the heart” of the data, and is often closely approximated by an
eye-ball fit to the data.


19


18


17
Y
16


15


4.5 5.0 5.5
X

The equation of the LS line is

ŷ = b0 + b1 X

where the intercept b0 satisfies


b0 = Ȳ − b1 X̄
and the slope is P
i (Yi − Ȳ )(Xi − X̄) SY
b1 = P 2
=r .
i (Xi − X̄) SX
As before, r is the Pearson correlation between Y and X, whereas SY and SX are the
sample standard deviations for the Y and X samples, respectively. The sign of the
slope and the sign of the correlation are identical (i.e., + correlation implies +
slope).
Special symbols b0 and b1 identify the LS intercept and slope to distinguish the
LS line from the generic line Y = β0 + β1 X. You should think of Ŷ as the fitted
value at X, or the value of the LS line at X.
Fit a regression for the equation estimates from summary(). Note that we’ll reuse
the output of lm() over and over again.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


292 Ch 8: Correlation and Regression

lm.blood.wt <- lm(blood_loss ~ weight, data = thyroid)


lm.blood.wt
##
## Call:
## lm(formula = blood_loss ~ weight, data = thyroid)
##
## Coefficients:
## (Intercept) weight
## 552.4 -1.3
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm.blood.wt)
##
## Call:
## lm(formula = blood_loss ~ weight, data = thyroid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.565 -6.189 4.712 8.192 9.382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 552.4420 21.4409 25.77 2.25e-07 ***
## weight -1.3003 0.4364 -2.98 0.0247 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.66 on 6 degrees of freedom
## Multiple R-squared: 0.5967,Adjusted R-squared: 0.5295
## F-statistic: 8.878 on 1 and 6 DF, p-value: 0.02465

Create a scatterplot with regression fit.


# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(thyroid, aes(x = weight, y = blood_loss))
p <- p + geom_point()
p <- p + geom_smooth(method = lm, se = FALSE)
print(p)

# Base graphics: Plot the data with linear regression fit and confidence bands
# scatterplot
plot(thyroid$weight, thyroid$blood_loss)
# regression line from lm() fit
abline(lm.blood.wt)

Prof. Erik B. Erhardt


8.5: ANOVA Table for Regression 293

510

510

500 ●

500
thyroid$blood_loss
blood_loss

490 ● ●

490
● ●

● ●

480
480

470


470


35 40 45 50 55 60 65 70

40 50 60 70 thyroid$weight
weight

For the thyroid operation data with Y = Blood loss in ml and X = Weight
in kg, the LS line is Ŷ = 552.44 − 1.30X, or Predicted Blood Loss = 552.44 −
1.30 Weight. For an 86kg individual, the Predicted Blood Loss = 552.44 − 1.30 × 86 =
440.64ml.
The LS regression coefficients for this model are interpreted as follows. The inter-
cept b0 is the predicted blood loss for a 0 kg individual. The intercept has no meaning
here. The slope b1 is the predicted increase in blood loss for each additional kg of
weight. The slope is −1.30, so the predicted decrease in blood loss is 1.30 ml for each
increase of 1 kg in weight.
Any fitted linear relationship holds only approximately and does not necessarily
extend outside the range of the data. In particular, nonsensical predicted blood losses
of less than zero are obtained at very large weights outside the range of data.

8.5 ANOVA Table for Regression


The LS line minimizes n
X
{Yi − (β0 + β1 Xi )}2
i=1
over all choices for β0 and β1 . Inserting the LS estimates b0 and b1 into this expression
gives
Xn
Residual Sums of Squares = {Yi − (b0 + b1 Xi )}2 .
i=1
Several bits of notation are needed. Let
Ŷi = b0 + b1 Xi

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


294 Ch 8: Correlation and Regression

be the predicted or fitted Y -value for an X-value of Xi and let ei = Yi − Ŷi . The
fitted value Ŷi is the value of the LS line at Xi whereas the residual ei is the distance
that the observed response Yi is from the LS line. Given this notation,
n
X n
X
Residual Sums of Squares = Res SS = (Yi − ŷi )2 = e2i .
i=1 i=1

Here is a picture to clarify matters:


19


18


17

Fitted
residual
16

Response •
15


4.5 X-val 5.0 5.5

The Residual SS, or sum of squared residuals, is small if each Ŷi is close to Yi (i.e.,
the line closely fits the data). It can be shown that
n
X
Total SS in Y = (Yi − Ȳ )2 ≥ Res SS ≥ 0.
i=1

Also define
n
X
Regression SS = Reg SS = Total SS − Res SS = b1 (Yi − Ȳ )(Xi − X̄).
i=1

The Total SS measures the variability in the Y -sample. Note that

0 ≤ Regression SS ≤ Total SS.

Prof. Erik B. Erhardt


8.5: ANOVA Table for Regression 295

The percentage of the variability in the Y -sample that is explained by the


linear relationship between Y and X is

Reg SS
R2 = coefficient of determination = .
Total SS

Given the definitions of the Sums of Squares, we can show 0 ≤ R2 ≤ 1 and

R2 = square of Pearson correlation coefficient = r2 .

To understand the interpretation of R2 , at least in two extreme cases, note that

Reg SS = Total SS ⇔ Res SS = 0


⇔ all the data points fall on a straight line
⇔ all the variability in Y is explained by the linear relationship with X
(which has variation)
⇔ R2 = 1. (see the picture below)


8


6
Variation in Y


4


2


0


-2

-1 0 1 2 3 4
Variation in X

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


296 Ch 8: Correlation and Regression

Furthermore,

Reg SS = 0 ⇔ Total SS = Res SS


⇔ b1 = 0
⇔ LS line is Ŷ = Ȳ
⇔ none of the variability in Y is explained by a linear relationship
⇔ R2 = 0.


5

• •

• •
4

• ••
• • • ••

•• • •• •• • • •
• • • • • • •
• • • • •• •
• • ••
3

• •• • • •• • • • •
• • ••• •• • •• •• •• •
• • • • • •• ••• •• ••• • •••• •
•• • • ••
Y

• • ••• • ••••• •• • • • ••
2

• •
• •• • • • • •
• • • • • •• ••• • ••
•• • • ••• •• •• • • • • ••• •• •
• • •
• •• • • ••
•• • • •• •
1

• ••
• • • • •• •

• •• • •

0

• •

-3 -2 -1 0 1 2
X

Each Sum of Squares has a corresponding df (degrees of freedom). The Sums of


Squares and df are arranged in an analysis of variance (ANOVA) table:
Source df SS MS
Regression 1 SSR MSR
Residual (Error) n−2 SSE MSE
Total n−1
The Total df is n − 1. The Residual df is n minus the number of parameters
(2) estimated by the LS line. The Regression df is the number of predictor variables
(1) in the model. A Mean Square is always equal to the Sum of Squares divided
by the df . Sometime the following notation is used for the Residual MS: s2Y |X =
Resid(SS)/(n − 2).
# ANOVA table of the simple linear regression fit
anova(lm.blood.wt)

Prof. Erik B. Erhardt


8.6: The regression model 297

## Analysis of Variance Table


##
## Response: blood_loss
## Df Sum Sq Mean Sq F value Pr(>F)
## weight 1 1207.45 1207.45 8.8778 0.02465 *
## Residuals 6 816.05 136.01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.5.1 Brief discussion of the output for blood loss problem


1. Identify fitted line: Blood Loss = 552.44 − 1.30 Weight (i.e., b0 = 552.44 and
b1 = −1.30).
What is the line of best fit? What is the direction of the relationship?
2. Locate Analysis of Variance Table.
This tests the the hypothesis H0 : βj = 0, j = 0, 1, . . . , p (for all p + 1 beta
coefficients), against HA : not H0 , i.e., at least one βj 6= 0. More on this later.
3. Locate Parameter Estimates Table.
Given that not all the betas are zero from the ANOVA, which parameter betas
are different from zero, and what are their estimated values and standard errors?
More on this later.
4. Note that R2 = 0.5967 = (−0.77247)2 = r2 .
R2 indicates the proportion of the variance explained by the regression model.
This indicates the predictive ability of the model, and is not a indication of
model fit.

ClickerQ s — Salaries STT.02.02.060

8.6 The regression model


The following statistical model is assumed as a means to provide error estimates for
the LS line, regression coefficients, and predictions. Assume that the data (Xi , Yi ),
i = 1, . . . , n, are a sample of (X, Y ) values from the population of interest, and
1. The mean in the population of all responses Y at a given X value (sometimes
called µY |X ) falls on a straight line, β0 + β1 X, called the population regression
line.
2. The variation among responses Y at a given X value is the same for each X,
and is denoted by σY2 |X .
3. The population of responses Y at a given X is normally distributed.
4. The pairs (Xi , Yi ) are a random sample from the population. Alternatively,

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


298 Ch 8: Correlation and Regression

we can think that the Xi s were fixed by the experimenter, and that the Yi are
random responses at the selected predictor values.

The model is usually written in the form

Yi = β0 + β1 Xi + εi

(i.e., Response = Mean Response + Residual), where the εi s are, by virtue of assump-
tions 2, 3, and 4, independent normal random variables with mean 0 and variance
σY2 |X . The following picture might help see this. Note that the population regression
line is unknown, and is estimated from the data using the LS line.
14

14
12

12
10

10
Y

epsilon_i
8

Y_i
6

1 2 3 4 5 1 2 X_i 3 4 5
X X

In the plot below, data are simulated where yi = 4 − 2xi + ei , where xi ∼


Gamma(3, 0.5) and ei ∼ Normal(0, 32 ). The data are plotted and a linear regres-
sion is fit and the mean regression line is overlayed. Select normal distributions with
variance estimated from the linear model fit are overlayed, one which indicates limits
at two standard deviations. See the R code to create this image.
##
## Attaching package: ’scales’
## The following object is masked from ’package:coin’:
##
## pvalue

Prof. Erik B. Erhardt


8.6: The regression model 299

5



● ●

● ●

0

●● ●

● ●●

● ● ●
●● ●

● ● ● ●
● ● ● ●●
● ●

−5

● ● ●
● ● ●
●● ●
● ●
● ●
● ● ●

● ● ●
● ●
●●

● ● ●

−10


● ●
● ● ●


● ● ●
● ●
● ● ●

−15

● ●
● ●

● ●

−20

● ●


−25

2 4 6 8 10 12 14

Model assumptions In decreasing order of importance, the model assumptions


are

1. Validity. Most importantly, the data you are analyzing should map to the
research question you are trying to answer. This sounds obvious but is often
overlooked or ignored because it can be inconvenient.

2. Additivity and linearity. The most important mathematical assumption of


the regression model is that its deterministic component is a linear function of
the separate predictors.

3. Independence of errors. This assumption depends on how the data were


collected.

4. Equal variance of errors.

5. Normality of errors.

Normality and equal variance are typically minor concerns, unless you’re using the
model to make predictions for individual data points.

8.6.1 Back to the Data


There are three unknown population parameters in the model: β0 , β1 and σY2 |X .
Given the data, the LS line
Ŷ = b0 + b1 X

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


300 Ch 8: Correlation and Regression

estimates the population regression line Y = β0 + β1 X. The LS line is our best


guess about the unknown population regression line. Here b0 estimates the intercept
β0 of the population regression line and b1 estimates the slope β1 of the population
regression line.
The ith observed residual ei = Yi − Ŷi , where Ŷi = b0 + b1 Xi is the ith fitted
value, estimates the unobservable residual εi (εi is unobservable because β0 and
β1 are unknown). The Residual MS from the ANOVA table is used to estimate σY2 |X :

− Ŷi )2
P
Res SS i (Yi
s2Y |X = Res MS = = .
Res df n−2

The denominator df = n − 2 is the number of observations minus the number of beta


parameters in the model, i.e., β0 and β1 .

8.7 CI and tests for β1


A CI for β1 is given by b1 ± tcrit SEb1 , where the standard error of b1 under the model
is
sY |X
SEb1 = pP ,
2
i (Xi − X̄)

and where tcrit is the appropriate critical value for the desired CI level from a t-
distribution with df =Res df .
To test H0 : β1 = β10 (a given value) against HA : β1 6= β10 , reject H0 if |ts | ≥ tcrit ,
where
b1 − β10
ts = ,
SEb1
and tcrit is the t-critical value for a two-sided test, with the desired size and df =Res
df . Alternatively, you can evaluate a p-value in the usual manner to make a decision
about H0 .
# CI for beta1
sum.lm.blood.wt <- summary(lm.blood.wt)
sum.lm.blood.wt$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 552.442023 21.4408832 25.765824 2.253105e-07
## weight -1.300327 0.4364156 -2.979562 2.465060e-02
est.beta1 <- sum.lm.blood.wt$coefficients[2,1]
se.beta1 <- sum.lm.blood.wt$coefficients[2,2]
sum.lm.blood.wt$fstatistic
## value numdf dendf
## 8.877788 1.000000 6.000000

Prof. Erik B. Erhardt


8.8: A CI for the population regression line 301

df.beta1 <- sum.lm.blood.wt$fstatistic[3]


t.crit <- qt(1-0.05/2, df.beta1)
t.crit
## [1] 2.446912
CI.lower <- est.beta1 - t.crit * se.beta1
CI.upper <- est.beta1 + t.crit * se.beta1
c(CI.lower, CI.upper)
## [1] -2.3681976 -0.2324567
The parameter estimates table gives the standard error, t-statistic, and p-value
for testing H0 : β1 = 0. Analogous summaries are given for the intercept, β0 , but
these are typically of less interest.

8.7.1 Testing β1 = 0
Assuming the mean relationship is linear, consider testing H0 : β1 = 0 against HA :
β1 6= 0. This test can be conducted using a t-statistic, as outlined above, or with an
ANOVA F -test, as outlined below.
For the analysis of variance (ANOVA) F -test, compute
Reg MS
Fs =
Res MS
and reject H0 when Fs exceeds the critical value (for the desired size test) from an F -
table with numerator df = 1 and denominator df = n − 2 (see qf()). The hypothesis
of zero slope (or no relationship) is rejected when Fs is large, which happens when a
significant portion of the variation in Y is explained by the linear relationship with
X.
The p-values from the t-test and the F -test are always equal. Furthermore this
p-value is equal to the p-value for testing no correlation between Y and X, using the
t-test described earlier. Is this important, obvious, or disconcerting?

8.8 A CI for the population regression line


I can not overemphasize the power of the regression model. The model allows you
to estimate the mean response at any X value in the range for which the model is
reasonable, even if little or no data is observed at that location.
We estimate the mean population response among individuals with X = Xp
µp = β0 + β1 Xp ,
with the fitted value, or the value of the least squares line at Xp :
Ŷp = b0 + b1 Xp .

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


302 Ch 8: Correlation and Regression

Xp is not necessarily one of the observed Xi s in the data. To get a CI for µp , use
Ŷp ± tcrit SE(Ŷp ), where the standard error of Ŷp is
s
1 (Xp − X̄)2
SE(Ŷp ) = sY |X +P 2
.
n i (Xi − X̄)

The t-critical value is identical to that used in the subsection on CI for β1 .

8.8.1 CI for predictions


Suppose a future individual (i.e., someone not used to compute the LS line) has
X = Xp . The best prediction for the response Y of this individual is the value of the
least squares line at Xp :
Ŷp = b0 + b1 Xp .
To get a CI (prediction interval) for an individual response, use Ŷp ± tcrit SEpred (Ŷp ),
where s
1 (Xp − X̄)2
SEpred (Ŷp ) = sY |X 1 + + P 2
,
n i (Xi − X̄)

and tcrit is identical to the critical value used for a CI on β1 . The prediction variance
has two parts: (1) the 1 indicates the variability associated with the data around the
mean (regression line), and (2) the rest is the variability associated with estimating
the mean.
For example, in the blood loss problem you may want to estimates the blood loss
for an 50kg individual, and to get a CI for this prediction. This problem is different
from computing a CI for the mean blood loss of all 50kg individuals!
# CI for the mean and PI for a new observation at weight=50
predict(lm.blood.wt, data.frame(weight=50), interval = "confidence", level = 0.95)
## fit lwr upr
## 1 487.4257 477.1575 497.6938
predict(lm.blood.wt, data.frame(weight=50), interval = "prediction", level = 0.95)
## fit lwr upr
## 1 487.4257 457.098 517.7533

Comments
1. The prediction interval is wider than the CI for the mean response. This is
reasonable because you are less confident in predicting an individual response
than the mean response for all individuals.
2. The CI for the mean response and the prediction interval for an individual
response become wider as Xp moves away from X̄. That is, you get a more
sensitive CI and prediction interval for Xp s near the center of the data.

Prof. Erik B. Erhardt


8.8: A CI for the population regression line 303

3. In plots below include confidence and prediction bands along with the fitted LS
line.
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(thyroid, aes(x = weight, y = blood_loss))
p <- p + geom_point()
p <- p + geom_smooth(method = lm, se = TRUE)
print(p)

# Base graphics: Plot the data with linear regression fit and confidence bands
# scatterplot
plot(thyroid$weight, thyroid$blood_loss)
# regression line from lm() fit
abline(lm.blood.wt)
# x values of weight for predictions of confidence bands
x.pred <- data.frame(weight = seq(min(thyroid$weight), max(thyroid$weight),
length = 20))
# draw upper and lower confidence bands
lines(x.pred$weight, predict(lm.blood.wt, x.pred,
interval = "confidence")[, "upr"], col = "blue")
lines(x.pred$weight, predict(lm.blood.wt, x.pred,
interval = "confidence")[, "lwr"], col = "blue")
525



510




500 ●
500
thyroid$blood_loss

● ●
blood_loss

490


● ●

475
● ●
480


470

450

35 40 45 50 55 60 65 70

40 50 60 70 thyroid$weight
weight

8.8.2 A further look at the blood loss data


ˆ The LS line is: Predicted Blood Loss = 552.442 - 1.30 Weight.
ˆ The R2 is 0.597 (i.e., 59.7%).
ˆ The F -statistic for testing H0 : β1 = 0 is Fobs = 8.88 with a p-value=0.025. The

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


304 Ch 8: Correlation and Regression

Error MS is s2Y |X = 136.0; see ANOVA table.


ˆ The Parameter Estimates table gives b0 and b1 , their standard errors, and t-
statistics and p-values for testing H0 : β0 = 0 and H0 : β1 = 0. The t-test
and F -test p-values for testing that the slope is zero are identical. We could
calculate a 95% CI for β0 and β1 . If we did so (using the t critical value) we find
we are 95% confident that the slope of the population regression line is between
−2.37 and −0.23.
ˆ Suppose we are interested in estimating the average blood loss among all 50kg
individuals. The estimated mean blood loss is 552.442 − 1.30033 × 50 = 487.43.
Reading off the plot, we are 95% confident that the mean blood loss of all
50kg individuals is between (approximately) 477 and 498 ml. A 95% prediction
interval for the blood loss of a single 50 kg person is less precise (about 457 to
518 ml).
As a summary we might say that weight is important for explaining the variation
in blood loss. In particular, the estimated slope of the least squares line (Predicted
Blood loss = 552.442 - 1.30 Weight) is significantly different from zero (p-value =
0.0247), with weight explaining approximately 60% (59.7%) of the variation in blood
loss for this sample of 8 thyroid operation patients.

8.9 Model Checking and Regression Diagnostics

8.9.1 Introduction
The simple linear regression model is usually written as

Yi = β0 + β1 Xi + εi

where the εi s are independent normal random variables with mean 0 and variance
σ 2 . The model implies (1) The average Y -value at a given X-value is linearly related
to X. (2) The variation in responses Y at a given X value is constant. (3) The
population of responses Y at a given X is normally distributed. (4) The observed
data are a random sample.
A regression analysis is never complete until these assumptions have been checked.
In addition, you need to evaluate whether individual observations, or groups of ob-
servations, are unduly influencing the analysis. A first step in any analysis is to
plot the data. The plot provides information on the linearity and constant variance
assumption.

Prof. Erik B. Erhardt


8.9: Model Checking and Regression Diagnostics 305

(a) (b) (c) (d)

800 1000 1200 1400


400
• • • •

20

• •

25

18

300


• • •
16

• •• •

20

200

14

•••
Y

Y

600
• •
• •

15
• •
12

• •

400
• • •

100
•••
10

• • • •

200

• 10 • •• • ••
• • • • •
8

• • • ••

0
• • •• • • •

0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
X X X X

Figure (a) is the only plot that is consistent with the assumptions. The plot
shows a linear relationship with constant variance. The other figures show one or
more deviations. Figure (b) shows a linear relationship but the variability increases
as the mean level increases. In Figure (c) we see a nonlinear relationship with constant
variance, whereas (d) exhibits a nonlinear relationship with non-constant variance.
In many examples, nonlinearity or non-constant variability can be addressed by
transforming Y or X (or both), or by fitting polynomial models. These issues
will be addressed later.

8.9.2 Residual Analysis


A variety of methods for assessing model adequacy are based on the observed resid-
uals,
ei = Yi − Ŷi i.e., Observed − Fitted values.
The residual is the difference between the observed values and predicted or fitted
values. The residual is the part of the observation that is not explained by the fitted
model. You can analyze residuals to determine the adequacy of the model. A large
residual identifies an observation poorly fit by the model.
The residuals are usually plotted in various ways to assess potential inadequacies.
The observed residuals have different variances, depending on Xi . Recall that the
standard error of Ŷi (and therefore ei ) is
s
1 (Xi − X̄)2
SE(Ŷi ) = SE(ei ) = sY |X +P 2
.
n j (Xj − X̄)

So many statisticians prefer to plot the studentized residuals (sometimes called


the standardized residuals or internally Studentized residual)
ei
ri = .
SE(ei )

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


306 Ch 8: Correlation and Regression

The standardized residual is the residual, ei , divided by an estimate of its standard


deviation. This form of the residual takes into account that the residuals may have
different variances, which can make it easier to detect outliers. The studentized resid-
uals have a constant variance of 1 (approximately). Standardized residuals greater
than 2 and less than −2 are usually considered large. I will focus on diagnostic
methods using the studentized residuals.
A plot of the studentized residuals ri against the fitted values Ŷi often reveals
inadequacies with the model. The real power of this plot is with multiple predictor
problems (multiple regression). The information contained in this plot with simple
linear regression is similar to the information contained in the original data plot,
except it is scaled better and eliminates the effect of the trend on your perceptions of
model adequacy. The residual plot should exhibit no systematic dependence of the
sign or the magnitude of the residuals on the fitted values:
2

• • •
25



• •
1

•• • • •
•• • • •
• • •• • •• •
20

• • • •
•• • • •

Resids

• • •
•• • • •• • •••
0
Y

• •••• • • •• •
• • •
• ••••• •
• • ••
15

••
• • •
•••
• • •
-1

• •• •• •
•• •

10



• •
-2

3 4 5 6 7 10 15 20 25
X Fitted

The following sequence of plots show how inadequacies in the data plot appear
in a residual plot. The first plot shows a roughly linear relationship between Y and
X with non-constant variance. The residual plot shows a megaphone shape rather
than the ideal horizontal band. A possible remedy is a weighted least squares
analysis to handle the non-constant variance (see end of chapter for an example),
or to transform Y to stabilize the variance. Transforming the data may destroy the
linearity.

Prof. Erik B. Erhardt


8.9: Model Checking and Regression Diagnostics 307

• • • •

12
120

60

4

100

10
40
• • •
• • •• ••
• • •
80

• •• • • • •

20
• • • •

8
• • • • • •• •

2

Resids

Resids

• • •
60

••
Y

Y
• • •
• •• • • •

0
• • •
• • • ••
•• • • • •• • • • •

6
• • • • ••
40

• • • •
• • • • •• ••

-20
• •• ••• •
• • • • • • • •

0
• • • •

• • • • • • • •• • • ••• ••• • •• • • ••
20

• • • •
• • •

4
••• ••• • •• •• •• •
• •• • •• ••
• • -40 • •• • • ••• • • ••
•• • •
• •• •• • •
0

• • • •
-60

• • • • •

-2
3 4 5 6 7 25 30 35 40 45 50 55 2 3 4 5 6 2 3 4 5 6 7
X Fitted X Fitted

The plot above shows a nonlinear relationship between Y and X. The residual
plot shows a systematic dependence of the sign of the residual on the fitted value.
Possible remedies were mentioned earlier.
The plot below shows an outlier. This case has a large residual and large stu-
dentized residual. A sensible approach here is to refit the model after holding out the
case to see if any conclusions change.

• • •• •
2
40

• • •

• • • • • •• • • •
• • •• ••••• ••• •• •
0

• • • • • •••
• • • •
35

• •• •
• •• •
-2

•• ••• •

••
30

Resids

• • ••
-4

• ••
• •••••• •
Y

•••

-6

• •••
25

• •
•• •
-8

••
20

•• •
-10


• •
-12

3 4 5 6 7 20 25 30 35 40
X Fitted

A third type of residual is called an externally Studentized residual (or Stu-


dentized deleted residual or deleted t-residual). The problem with residuals is that a
highly influential value can force the residual to have a very small value. This mea-
sure tries to correct for that by looking at how well the model fits this observation

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


308 Ch 8: Correlation and Regression

without using this observation to construct the fit. It is quite possible for the deleted
residual to be huge when the raw residual is tiny.
The studentized deleted residual for observation ith is calculated by fitting the
regression based on all of the cases except the ith one. The residual is then divided
by its estimated standard deviation. Since the Studentized deleted residual for the
ith observation estimates all quantities with this observation deleted from the data
set, the ith observation cannot influence these estimates. Therefore, unusual Y values
clearly stand out. Studentized deleted residuals with large absolute values are con-
sidered large. If the regression model is appropriate, with no outlying observations,
each Studentized deleted residual follows the t-distribution with n − 1 − p degrees of
freedom.

Nonconstant variance vs sample size Because more extreme observations are


more likely to occur with larger sample sizes, sometimes when sample size depends on
X it can appear as noncontant variance. Below sample sizes are either all 25 or (3, 5,
10, 25, 125), and standard deviations are either all 1 or (0.1, 0.5, 1, 1.5, 3). Note that
constant variance and different sample sizes appears as though it has nonconstant
variance.
#### Nonconstant variance vs sample size
dat.var.sam <- function(n, s) {
dat <- data.frame(
x = c(rep(0, n[1]),
rep(1, n[2]),
rep(2, n[3]),
rep(3, n[4]),
rep(4, n[5])),
y = c(rnorm(n[1], mean = 0, sd = s[1]),
rnorm(n[2], mean = 0, sd = s[2]),
rnorm(n[3], mean = 0, sd = s[3]),
rnorm(n[4], mean = 0, sd = s[4]),
rnorm(n[5], mean = 0, sd = s[5]))
)
return(dat)
}

library(ggplot2)

n <- c(25, 25, 25, 25, 25)


s <- c(1, 1, 1, 1, 1)
dat <- dat.var.sam(n, s)
p1 <- ggplot(dat, aes(x = x, y = y))
p1 <- p1 + geom_point(position = position_jitter(width = 0.1))
p1 <- p1 + labs(title = "Constant variance, constant sample size")
#print(p1)

Prof. Erik B. Erhardt


8.9: Model Checking and Regression Diagnostics 309

n <- c(3, 5, 10, 25, 125)


s <- c(1, 1, 1, 1, 1)
dat <- dat.var.sam(n, s)
p2 <- ggplot(dat, aes(x = x, y = y))
p2 <- p2 + geom_point(position = position_jitter(width = 0.1))
p2 <- p2 + labs(title = "Constant variance, different sample sizes")
#print(p2)

n <- c(25, 25, 25, 25, 25)


s <- c(0.1, 0.5, 1, 1.5, 3)
dat <- dat.var.sam(n, s)
p3 <- ggplot(dat, aes(x = x, y = y))
p3 <- p3 + geom_point(position = position_jitter(width = 0.1))
p3 <- p3 + labs(title = "Different variance, constant sample size")
#print(p3)

n <- c(3, 5, 10, 25, 125)


s <- c(0.1, 0.5, 1, 1.5, 3)
dat <- dat.var.sam(n, s)
p4 <- ggplot(dat, aes(x = x, y = y))
p4 <- p4 + geom_point(position = position_jitter(width = 0.1))
p4 <- p4 + labs(title = "Different variance, different sample sizes")
#print(p4)

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, ncol=2
, top = "Nonconstant variance vs sample size")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


310 Ch 8: Correlation and Regression

Nonconstant variance vs sample size


Constant variance, constant sample size Constant variance, different sample sizes
● ● ●


● ●
2 ● 2 ●●

● ● ●●●●
●● ●
● ● ● ● ●




● ● ● ●●●
● ● ●

● ● ●●●●
● ●
● ● ● ●
1 ● ●
● ●● ● ●

1 ● ●●
●● ●
● ● ● ● ● ● ●●
● ● ● ● ● ● ●●●

●●●●
● ● ● ● ● ●● ●●
● ●
● ● ●●●●●
● ● ●● ● ●
● ● ● ●●
● ●
● ● ● ● ● ● ●●
● ● ●● ●●● ●
● ● ● ●
●●●●●
0 ●
y

y
● ●
●● ● ●● ● ●● ●
● 0 ●●●● ●
● ● ● ● ●●●
● ● ● ● ● ●●
● ●●● ●
● ●


● ●●●
● ● ● ●
●●●● ●● ● ● ● ●●
●● ● ●

● ●● ● ● ● ●

● ●● ●●
● ● ● ● ●●●●
● ●● ●●●● ●
−1 ●● ●
● ● ●
●●
● ● ●
●● −1 ● ● ● ●●
● ● ● ●●
● ● ● ● ●
● ● ● ● ●●
● ● ●
● ●
● ● ●●


−2
−2 ●

● ● ●

0 1 2 3 4 0 1 2 3 4
x x

Different variance, constant sample size Different variance, different sample sizes
● ●
6

●●●●
5 ●●
● ● ●●
● ●

●●● ●●
● ●
●●● ●
● ● ●●● ●
●●●●
● ● ● ● ●●
● ● ●
● ● ●●●●

3 ●
●●
● ● ●
●●

●●

●●
●●
● ● ● ●●●●

●● ●●● ●● ● ● ●
● ● 0 ●
● ● ●●
●●
●●●
●●●●●●

●●● ●
● ●● ●●
● ● ●●● ●●● ●● ●●

● ●●●

● ● ●●

●● ●

● ●●
● ●●● ●●
●● ● ● ● ●●●●●
● ● ●● ● ● ●
● ●
●●
y

● ● ●
●●
● ● ●●●

●●● ● ●
● ● ●● ●

0 ●●
●●
●●


●●




●●
● ● ● ●● ●● ●
● ●● ● ●
●●●●● ●
● ● ●● ●●
●● ● ● ●● ● ●●


● ●
●●● ● ● ●
● −5
● ●● ● ● ●
●●
● ● ●
● ●
●● ●

●● ●


−3


−10

● ●

0 1 2 3 4 0 1 2 3 4
x x

8.9.3 Checking Normality


The normality assumption for the εi s can be evaluated visually with a boxplot or a
normal probability plot (rankit plot) of the ri , or formally with a Shapiro-Wilk test.
The normal probability plot often highlights outliers, or poorly fitted cases. If an
outlier is held out of the data and a new analysis is performed, the resulting normal
scores plot may be roughly linear, but often will show a short-tailed distribution.
(Why?).
You must interpret regression tests and CI with caution with non-normal data.
Statisticians developed robust regression methods for non-normal data which are
available in R packages.

8.9.4 Checking Independence


Diagnosing dependence among observations requires an understanding of the data
collection process. There are a variety of graphical and inferential tools for checking
independence for data collected over time (called a time series). The easiest check is
to plot the ri against time index and look for any suggestive patterns.

Prof. Erik B. Erhardt


8.9: Model Checking and Regression Diagnostics 311

8.9.5 Outliers

Outliers are observations that are poorly fitted by the regression model. The response
for an outlier is far from the fitted line, so outliers have large positive or negative values
of the studentized residual ri . Usually, |ri | > 2 is considered large. Outliers are often
highlighted in residual plots.
What do you do with outliers? Outliers may be due to incorrect recordings of
the data or failure of the measuring device, or indications or a change in the mean
or variance structure for one or more cases. Incorrect recordings should be fixed if
possible, but otherwise deleted from the analysis.
Routine deletion of outliers from the analysis is not recommended. This practice
can have a dramatic effect on the fit of the model and the perceived precision of
parameter estimates and predictions. Analysts who routinely omit outliers without
cause tend to overstate the significance of their findings and get a false sense of
precision in their estimates and predictions. To assess effects of outliers, a data analyst
should repeat the analysis holding out the outliers to see whether any substantive
conclusions are changed. Very often the only real effect of an outlier is to inflate MSE
and hence make p-values a little larger and CIs a little wider than necessary, but
without substantively changing conclusions. They can completely mask underlying
patterns, however.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


312 Ch 8: Correlation and Regression

ClickerQ s — Outliers STT.02.04.040

8.9.6 Influential Observations


Certain data points can play an important role in determining the position of the LS
line. These data points may or may not be outliers. In the plots below, the solid
line is the LS line from the full data set, whereas the dashed line is the LS line after
omitting the unusual point. For example, the observation with Y > 45 in the first
plot is an outlier relative to the LS fit. The extreme observation in the second plot
has a very small ri . Both points are highly influential observations — the LS line
changes dramatically when these observations are held out.

10

• •
45

8
40


6
35


Y


4


30

• •
••
2


25

• • • •
••••• •
0

• • •
20

4 5 6 0 2 4 6 8 10
X X

In the second plot, the extreme value is a high leverage value, which is basically
an outlier among the X values; Y does not enter in this calculation. This influential
observation is not an outlier because its presence in the analysis determines that the
LS line will essentially pass through it! These are values with the potential of greatly
distorting the fitted model. They may or may not actually have distorted it.
The hat variable from the influence() function on the object returned from lm()
fit will give the leverages: influence(lm.output)$hat. Leverage values fall between
0 and 1. Experts consider a leverage value greater than 2p/n or 3p/n, where p is the
number of predictors or factors plus the constant and n is the number of observations,
large and suggest you examine the corresponding observation. A rule-of-thumb is to
identify observations with leverage over 3p/n or 0.99, whichever is smaller.

Prof. Erik B. Erhardt


8.9: Model Checking and Regression Diagnostics 313

Dennis Cook developed a measure of the impact that individual cases have on the
placement of the LS line. His measure, called Cook’s distance or Cook’s D, provides
a summary of how far the LS line changes when each individual point is held out (one
at a time) from the analysis. While high leverage values indicate observations that
have the potential of causing trouble, those with high Cook’s D values actually do
disproportionately affect the overall fit. The case with the largest D has the greatest
impact on the placement of the LS line. However, the actual influence of this case
may be small. In the plots above, the observations I focussed on have the largest
Cook’s Ds.
A simple, but not unique, expression for Cook’s distance for the j th case is
X
Dj ∝ (Ŷi − Ŷi[−j] )2 ,
i

where Ŷi[−j] is the fitted value for the ith case when the LS line is computed from all
the data except case j. Here ∝ means that Dj is a multiple of i (Ŷi − Ŷi[−j] )2 where
P
the multiplier does not depend on the case. This expression implies that Dj is also
an overall measure of how much the fitted values change when case j is deleted.
Observations with large D values may be outliers. Because D is calculated us-
ing leverage values and standardized residuals, it considers whether an observation
is unusual with respect to both x- and y-values. To interpret D, compare it to the
F -distribution with (p, n − p) degrees-of-freedom to determine the corresponding per-
centile. If the percentile value is less than 10% or 20%, the observation has little
influence on the fitted values. If the percentile value is greater than 50%, the obser-
vation has a major influence on the fitted values and should be examined.
Many statisticians make it a lot simpler than this sounds and use 1 as a cutoff
value for large Cook’s D (when D is on the appropriate scale). Using the cutoff of 1
can simplify an analysis, since frequently one or two values will have noticeably larger
D values than other observations without actually having much effect, but it can be
important to explore any observations that stand out. Cook’s distance values for each
observation from a linear regression fit are given with cooks.distance(lm.output).
Given a regression problem, you should locate the points with the largest Dj s and
see whether holding these cases out has a decisive influence on the fit of the model or
the conclusions of the analysis. You can examine the relative magnitudes of the Dj s
across cases without paying much attention to the actual value of Dj , but there are
guidelines (see below) on how large Dj needs to be before you worry about it.
It is difficult to define a good strategy for dealing with outliers and influential
observations. Experience is the best guide. I will show you a few examples that
highlight some standard phenomena. One difficulty you will find is that certain
observations may be outliers because other observations are influential, or vice-versa.
If an influential observation is held out, an outlier may remain an outlier, may become

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


314 Ch 8: Correlation and Regression

influential, or both, or neither. Observations of moderate influence may become more,


or less influential, when the most influential observation is held out.
Thus, any sequential refitting of models holding out of observations should not
be based on the original (full-data) summaries, but rather on the summaries that
result as individual observations are omitted. I tend to focus more on influential
observations than outliers.
In the plots below, which cases do you think are most influential, and which are
outliers. What happens in each analysis if I delete the most influential case? Are any
of the remaining cases influential or poorly fitted?

• 10 12 14 16 •
15


• •
10
Y

• •

5

• •
0

2 4 6 8 10 12 1.0 1.5 2.0 2.5 3.0 3.5 4.0


X X

• •
15

15


• •
• •
10

10
Y

• •
5


• •
0

0 2 4 6 8 10 12 2 4 6 8 10 12
X X

Many researchers are hesitant to delete points from an analysis. I think this view
is myopic, and in certain instances, such as the Gesell example to be discussed, can
not be empirically supported. Being rigid about this can lead to some silly analyses
of data, but one needs a very good reason and full disclosure if any points are deleted.

Prof. Erik B. Erhardt


8.10: Regression analysis suggestion 315

8.9.7 Summary of diagnostic measures


The various measures discussed above often flag the same observations as unusual, but
they certainly can flag different observations. At the very least I examine standardized
residuals and Cooks’s D values. They are invaluable diagnostic measures, but nothing
is perfect. Observations can be unusual in groups — a pair of unusual high leverage
values close to each other will not necessarily be flagged by Cook’s D since removing
just one may not affect the fit very much. Any analysis takes some careful thought.
These measures and techniques really are designed for multiple regression prob-
lems where several predictor variables are used. We are using them in simple linear
regression to learn what they do and see what they have to say about data, but in
truth it is fairly simple with one variable to see what may be an outlier in the x-
direction, to see if the data are poorly fit, etc. With more variables all that becomes
quite difficult and these diagnostics are essential parts of those analyses.

8.10 Regression analysis suggestion


There are a lot options allowed in R. I will make a few suggestions here on how to
start an analysis. What you find once you get started determines what more you
might wish to explore.
1. Plot the data. With lots of variables the matrix plot is valuable as a quick
screen. If you want to see better resolution on an individual scatter plot, do the
individual scatter plot.
2. Do any obvious transformations of the data. We will discuss this in a lot
more detail later. Re-plot with transformations.
3. Fit the least squares equation.
4. Examine the residual plots and results. Check for the patterns discussed
earlier.
(a) Plotting several diagnostic plots together seems convenient to me. This
gives you the essential plot (residuals vs. fitted values) plus a quick check
on normality and possible violations of independence and influence. If you
see something that needs closer investigation, you may need to recreate
one of the plots larger by itself.
(b) Do you see curvature in the standardized residual plot? If the sign of the
residuals has a distinct pattern vs. the fitted values, the linear fit is not
adequate and you need some remedy, such as transformations. Note that
standardized residuals are more conventional and show you what actually
happened, while deleted residuals are probably the better diagnostic tool
for identifying problem cases.
(c) Does it appear σY |X depends upon X (we are assuming it does not)? A

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


316 Ch 8: Correlation and Regression

megaphone pattern in residuals vs. fits is the classic (not the only) pattern
to look for. Weighted least squares or transformations may be called for.
(d) Do you see obvious outliers? Make sure you do not have a misrecorded
data value. It might be worth refitting the equation without the outlier to
see if it affects conclusions substantially.
(e) Is the normality assumption reasonable? This can be very closely related
to the preceding points.
(f) Is there a striking pattern in residuals vs. order of the data? This can be
an indication that the independence assumption is not valid.
5. Check the Cook’s D values. The Cook’s distance plot and Residuals vs.
Leverage (with Cook’s D) plot are both helpful.
6. If you found problem observations, omit them from the analysis and see if any
conclusions change substantially. There are two good ways to do this.
(a) Subset the data.frame using subset().
(b) Use lm() with the weights= option with weights of 0 for the excluded
observations, weights of 1 for those included.
You may need to repeat all these steps many times for a complete analysis.

8.10.1 Residual and Diagnostic Analysis of the Blood Loss


Data
We looked at much of this before, but let us go through the above steps systematically.
Recall the data set (we want to predict blood loss from weight):

weight time blood loss


1 44.3 105 503
2 40.6 80 490
3 69.0 86 471
4 43.7 112 505
5 50.3 109 482
6 50.2 100 490
7 35.4 96 513
8 52.2 120 464

1. Plot the data. Plot blood loss vs. weight.


# create data ids
thyroid$id <- 1:nrow(thyroid)
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(thyroid, aes(x = weight, y = blood_loss, label = id))
p <- p + geom_point()
# plot labels next to points

Prof. Erik B. Erhardt


8.10: Regression analysis suggestion 317

p <- p + geom_text(hjust = 0.5, vjust = -0.5)


# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
print(p)

525

7

4
● 1

500

2 6
● ●
blood_loss

5

475
3

8

450

40 50 60 70
weight

Clearly the heaviest individual is an unusual value that warrants a closer look
(maybe data recording error). I might be inclined to try a transformation here
(such as log(weight)) to make that point a little less influential.
2. Do any obvious transformations of the data. We will look at transformations
later.
3. Fit the least squares equation. Blood Loss appears significantly negatively as-
sociated with weight.
lm.blood.wt <- lm(blood_loss ~ weight, data = thyroid)
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm.blood.wt)
##
## Call:
## lm(formula = blood_loss ~ weight, data = thyroid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.565 -6.189 4.712 8.192 9.382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 552.4420 21.4409 25.77 2.25e-07 ***
## weight -1.3003 0.4364 -2.98 0.0247 *
## ---

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


318 Ch 8: Correlation and Regression

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.66 on 6 degrees of freedom
## Multiple R-squared: 0.5967,Adjusted R-squared: 0.5295
## F-statistic: 8.878 on 1 and 6 DF, p-value: 0.02465

(a) Graphs: Check Standardized Residuals (or the Deleted Residuals). The
residual plots:

# plot diagnistics
par(mfrow=c(2,3))
plot(lm.blood.wt, which = c(1,4,6))

# residuals vs weight
plot(thyroid$weight, lm.blood.wt$residuals, main="Residuals vs weight")
# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
# qq plot for studentized resid
# las = 1 : turns labels on y-axis to read horizontally
# id.n = n : labels n most extreme observations, and outputs to console
# id.cex = 1 : is the size of those labels
# lwd = 1 : line width
qqPlot(lm.blood.wt$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 8 2 4
# residuals vs order of data
plot(lm.blood.wt$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")

Prof. Erik B. Erhardt


8.10: Regression analysis suggestion 319

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


2 1.5

2.5
3 3●

10

2.5
4●
● ●

2.0
2.0

Cook's distance

Cook's distance
0
Residuals

1.5
1.5

1

1.0
−10

1.0
2●

0.5
0.5
8 0.5
●8
−20

●8 7
● ●7

0.0

0.0

● 0

470 480 490 500 1 2 3 4 5 6 7 8 0.1 0.4 0.6 0.7

Fitted values Obs. number Leverage hii

Residuals vs weight QQ Plot Residuals vs Order of data


10

10
● 10 4● ●
● ● ● ● ● ●
● ● ●
5
5

5
lm.blood.wt$residuals

lm.blood.wt$residuals

lm.blood.wt$residuals
● ● ●

0
0

0
−5

−5
● −5 ● ●
−10

−10

−10 ● 2 ●

−15
−20

−20
● −20 ● 8 ●

35 40 45 50 55 60 65 70 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 1 2 3 4 5 6 7 8

thyroid$weight norm quantiles Index

4. Examine the residual plots and results.


(a) Do you see curvature? There does not appear to be curvature (and it could
be hard to detect with so few points).
(b) Does it appear σY |X depends upon X? Not much evidence for this.
(c) Do you see obvious outliers? Observation 3 is an outlier in the x direction,
and therefore possibly a high leverage point and influential on the model
fit.
(d) Is the normality assumption reasonable? There appears to be some skew-
ness, but with so few points normality may be reasonable.
(e) Is there a striking pattern in residuals vs. order of the data? No striking
pattern.
5. Check the Cook’s D values. We anticipated that the 3rd observation is affecting
the fit by a lot more than any other values. The D-value is much larger than
1. Note that the residual is not large for this value.
6. Omit problem observations from the analysis and see if any conclusions change
substantially. Let us refit the equation without observation 3 to see if anything
changes drastically. I will use the weighted least squares approach discussed
earlier on this example. Define a variable wt that is 1 for all observations
except obs. 3, and make it 0 for that one.
# wt = 1 for all except obs 3 where wt = 0
thyroid$wt <- as.numeric(!(thyroid$id == 3))
thyroid$wt
## [1] 1 1 0 1 1 1 1 1

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


320 Ch 8: Correlation and Regression

What changes by deleting case 3? The fitted line gets steeper (slope changes
from −1.30 to −2.19), adjusted R2 gets larger (up to 58% from 53%), and S
changes from 11.7 to 10.6. Because the Weight values are much less spread
out, SE(βˆ1 ) becomes quite a bit larger (to 0.714, up from 0.436) and we lose a
degree of freedom for MS Error (which will penalize us on tests and CIs). Just
about any quantitative statement we would want to make using CIs would be
about the same either way since CIs will overlap a great deal, and our quali-
tative interpretations are unchanged (Blood Loss drops with Weight). Unless
something shows up in the plots, I don’t see any very important changes here.

lm.blood.wt.no3 <- lm(blood_loss ~ weight, data = thyroid, weights = wt)


# use summary() to get t-tests of parameters (slope, intercept)
summary(lm.blood.wt.no3)
##
## Call:
## lm(formula = blood_loss ~ weight, data = thyroid, weights = wt)
##
## Weighted Residuals:
## 1 2 3 4 5 6 7
## 8.5033 -12.6126 0.0000 9.1872 0.6641 8.4448 -1.0186
## 8
## -13.1683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 591.6677 32.5668 18.168 9.29e-06 ***
## weight -2.1935 0.7144 -3.071 0.0278 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.6 on 5 degrees of freedom
## Multiple R-squared: 0.6535,Adjusted R-squared: 0.5842
## F-statistic: 9.428 on 1 and 5 DF, p-value: 0.02777

# exclude obs 3
thyroid.no3 <- subset(thyroid, wt == 1)
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(thyroid.no3, aes(x = weight, y = blood_loss, label = id))
p <- p + geom_point()
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
print(p)

Prof. Erik B. Erhardt


8.10: Regression analysis suggestion 321

520

7

4

1

blood_loss
500

2 6
● ●

5

480

8

460

35 40 45 50
weight

Nothing very striking shows up in the residual plots, and no Cook’s D values
are very large among the remaining observations.

# plot diagnistics
par(mfrow=c(2,3))
plot(lm.blood.wt.no3, which = c(1,4,6))

# residuals vs weight
plot(thyroid.no3$weight, lm.blood.wt.no3$residuals[(thyroid$wt == 1)]
, main="Residuals vs weight")
# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
# qq plot for studentized resid
# las = 1 : turns labels on y-axis to read horizontally
# id.n = n : labels n most extreme observations, and outputs to console
# id.cex = 1 : is the size of those labels
# lwd = 1 : line width
qqPlot(lm.blood.wt.no3$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 3 8 2
# residuals vs order of data
plot(lm.blood.wt.no3$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


322 Ch 8: Correlation and Regression

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


10 2 1.5 1
8 8●
4●
● ●

0.6
0.6
5

Cook's distance

Cook's distance
Residuals

0.4
0

0.4

2 ●2
−5

0.2
0.2
6 0.5
−10

●6


2● ●
●8
−15

0.0

0.0

● 0

480 490 500 510 1 2 3 4 5 6 7 0.1 0.3 0.4 0.5

Fitted values Obs. number Leverage hii

Residuals vs weight QQ Plot Residuals vs Order of data


lm.blood.wt.no3$residuals[(thyroid$wt == 1)]

10

3●

30
● ●
● ● 30
lm.blood.wt.no3$residuals

lm.blood.wt.no3$residuals
5

20
20

0

10

10 ● ●

● ● ●
−5

● ●
0

0
● ●
−10

−10
−10

● ● 8 ● 2 ● ●

35 40 45 50 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 1 2 3 4 5 6 7 8

thyroid.no3$weight norm quantiles Index

How much difference is there in a practical sense? Examine the 95% prediction
interval for a new observation at Weight = 50kg. Previously we saw that interval
based on all 8 observations was from 457.1 to 517.8 ml of Blood Loss. Based on just
the 7 observations the prediction interval is 451.6 to 512.4 ml. There really is no
practical difference here.
# CI for the mean and PI for a new observation at weight=50
predict(lm.blood.wt , data.frame(weight=50), interval = "prediction")
## fit lwr upr
## 1 487.4257 457.098 517.7533
predict(lm.blood.wt.no3, data.frame(weight=50), interval = "prediction")
## Warning in predict.lm(lm.blood.wt.no3, data.frame(weight = 50), interval = "prediction"):
Assuming constant prediction variance even though model fit is weighted
## fit lwr upr
## 1 481.9939 451.5782 512.4096
Therefore, while obs. 3 was potentially influential, whether the value is included
or not makes very little difference in the model fit or relationship between Weight
and BloodLoss.

8.10.2 Gesell data


These data are from a UCLA study of cyanotic heart disease in children. The predictor
is the age of the child in months at first word and the response variable is the Gesell
adaptive score, for each of 21 children.

Prof. Erik B. Erhardt


8.10: Regression analysis suggestion 323

id age score
1 1 15 95
2 2 26 71
3 3 10 83
4 4 9 91
5 5 15 102
6 6 20 87
7 7 18 93
8 8 11 100
9 9 8 104
10 10 20 94
11 11 7 113
12 12 9 96
13 13 10 83
14 14 11 84
15 15 11 102
16 16 10 100
17 17 12 105
18 18 42 57
19 19 17 121
20 20 11 86
21 21 10 100

Let us go through the same steps as before.

1. Plot Score versus Age. Comment on the relationship between Score and Age.
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(gesell, aes(x = age, y = score, label = id))
p <- p + geom_point()
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


324 Ch 8: Correlation and Regression

19

120

11

9 17


15 5
16 8●
21 ●
100 ● ●

12 1


7 10

4 ●

20 6

3 14

13
score


80

2

60
18

10 20 30 40
age

2. There are no obvious transformations to try here.


3. Fit a simple linear regression model. Provide an equation for the LS line. Does
age at first word appear to be an “important predictor” of Gesell adaptive score?
(i.e., is the estimated slope significantly different from zero?)
lm.score.age <- lm(score ~ age, data = gesell)
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm.score.age)
##
## Call:
## lm(formula = score ~ age, data = gesell)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.604 -8.731 1.396 4.523 30.285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.8738 5.0678 21.681 7.31e-15 ***
## age -1.1270 0.3102 -3.633 0.00177 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.02 on 19 degrees of freedom
## Multiple R-squared: 0.41,Adjusted R-squared: 0.3789
## F-statistic: 13.2 on 1 and 19 DF, p-value: 0.001769
4. Do these plots suggest any inadequacies with the model?

Prof. Erik B. Erhardt


8.10: Regression analysis suggestion 325

# plot diagnistics
par(mfrow=c(2,3))
plot(lm.score.age, which = c(1,4,6))

# residuals vs weight
plot(gesell$age, lm.score.age$residuals, main="Residuals vs age")
# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.score.age$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 19 3 13
# residuals vs order of data
plot(lm.score.age$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


18
32.5 2 1.5 1 18 ●
30

● 19

0.6
0.6
20

Cook's distance

Cook's distance
Residuals

0.4
0.4


10

● ●


● ● ●
● ●
0

● 19 0.2
● 19
0.5
0.2



−10

● ●
● 2
● ● ●2
3●
13 ●●


0.0

0.0


−20


●●


● 0

70 80 90 100 5 10 15 20 0 0.2 0.4 0.5 0.6

Fitted values Obs. number Leverage hii

Residuals vs age QQ Plot Residuals vs Order of data

19 ●
30

30

● 30 ●
lm.score.age$residuals

lm.score.age$residuals

lm.score.age$residuals
20

20

20

● ●
10

10

● ● 10 ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
0
0

● ● ●
● ● ●
● ● ●
−10

−10

● ●

● ●

−10 ●


● ● ●
● ● 3 ● 13 ● ●

10 15 20 25 30 35 40 −2 −1 0 1 2 5 10 15 20

gesell$age norm quantiles Index

5. Observations 18 and 19 stand out with relatively high Cook’s D. The cutoff
line is only a rough guideline. Those two were flagged with high influence and
standardized residual, respectively, also. Be sure to examine the scatter plot
carefully to see why 18 and 19 stand out.
6. Consider doing two additional analyses: Analyze the data after omitting case
18 only and analyze the data after omitting case 19 only. Refit the regression

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


326 Ch 8: Correlation and Regression

model for each of these two scenarios. Provide a summary table such as the
following, giving the relevant summary statistics for the three analyses. Discuss
the impact that observations 18 and 19 have individually on the fit of the model.
When observation 18 is omitted, the estimated slope is not significantly different
from zero (p-value = 0.1489), indicating that age is not an important predictor
of Gesell score. This suggests that the significance of age as a predictor in the
original analysis was due solely to the presence of observation 18. Note the
dramatic decrease in R2 after deleting observation 18.
The fit of the model appears to improve when observation 19 is omitted. For
example, R2 increases noticeably and the p-value for testing the significance of
the slope decreases dramatically (in a relative sense). These tendencies would be
expected based on the original plot. However, this improvement is misleading.
Once observation 19 is omitted, observation 18 is much more influential. Again
the significance of the slope is due to the presence of observation 18.
Feature Full data Omit 18 Omit 19
b0 109.87 105.63 109.30
b1 -1.13 -0.78 -1.19
SE(b0 ) 5.07 7.16 3.97
SE(b1 ) 0.31 0.52 0.24
R2 0.41 0.11 0.57
p-val for H0 : β1 = 0 0.002 0.149 0.000

Can you think of any reasons to justify doing the analysis without observation 18?
If you include observation 18 in the analysis, you are assuming that the mean
Gesell score is linearly related to age over the entire range of observed ages. Obser-
vation 18 is far from the other observations on age (age for observation 18 is 42; the
second highest age is 26; the lowest age is 7). There are no children with ages between
27 and 41, so we have no information on whether the relationship is roughly linear
over a significant portion of the range of ages. I am comfortable deleting observation
18 from the analysis because it’s inclusion forces me to make an assumption that I
can not check using these data. I am only willing to make predictions of Gesell score
for children with ages roughly between 7 and 26. However, once this point is omitted,
age does not appear to be an important predictor.
A more complete analysis would delete observation 18 and 19 together. What
would you expect to see if you did this?

8.11 Weighted Least Squares


Earlier I indicated that nonconstant error variance can be addressed (sometimes)
with weighted least squares. The scedastic function is the conditional variance of Y

Prof. Erik B. Erhardt


8.11: Weighted Least Squares 327

given X. Y is said to be heteroscedastic if it’s variance depends on the value of


X (variance changes), and homoscedastic if it’s variance does not depend on X
(constant variance).
Recall the LS (OLS or ordinary LS) line chooses the values of β0 and β1 that
minimize
Xn
{Yi − (β0 + β1 Xi )}2
i=1

over all possible choices of β0 and β1 . The weighted LS (WLS) line chooses the values
of β0 and β1 that minimize
n
X
wi {Yi − (β0 + β1 Xi )}2
i=1

over all possible choices of β0 and β1 . If σY |X depends up X, then the correct choice
of weights is inversely proportional to variance, wi ∝ σY2 |X .
Consider the following data and plot of y vs. x and standardized OLS residuals
vs x. It is very clear that variability increases with x.
#### Weighted Least Squares
# R code to generate data
set.seed(7)
n <- 100
# 1s, Xs uniform 0 to 100
X <- matrix(c(rep(1,n),runif(n,0,100)), ncol=2)
# intercept and slope (5, 5)
beta <- matrix(c(5,5),ncol=1)
# errors are X*norm(0,1), so variance increases with X
e <- X[,2]*rnorm(n,0,1)
# response variables
y <- X %*% beta + e

# put data into data.frame


wlsdat <- data.frame(y, x = X[,2])

# fit regression
lm.y.x <- lm(y ~ x, data = wlsdat)

# put residuals in data.frame


wlsdat$res <- lm.y.x$residuals

# ggplot: Plot the data with linear regression fit


library(ggplot2)
p <- ggplot(wlsdat, aes(x = x, y = y))
p <- p + geom_point()
p <- p + geom_smooth(method = lm, se = FALSE)
print(p)

# ggplot: Plot the residuals


library(ggplot2)
p <- ggplot(wlsdat, aes(x = x, y = res))
p <- p + geom_point()
p <- p + geom_smooth(method = lm, se = FALSE)
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


328 Ch 8: Correlation and Regression

● ●

600

● 100 ●
● ●
● ● ● ●



● ●
● ●
● ● ● ●

● ●
● ● ●
● ● ● ● ● ●
● ●● ●
● ●
● ● ●
● ● ●
● ● ● ● ● ●
400 ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ●
● ●
● ●

0 ●

●●●




● ●
● ● ●● ●
● ● ●
● ●

res
● ●
y

● ● ●
● ●
● ● ●
● ● ●
● ● ●

● ●
● ● ●
● ●

●● ● ●

● ● ●
● ●
● ●● ●
● ● ●
● ● ● ● ●
● ●
200 ●
● ●
● ● ●





● ● ● −100
● ●
● ● ●
● ●

● ● ●

●● ●●
● ●


●●●
●●
● ●


● ●
0

0 25 50 75 100 0 25 50 75 100
x x

In order to use WLS to solve this problem, we need some form for σY2 |X . Finding
that form is a real problem with WLS. It can be useful to plot the absolute value of
the standardized residual vs. x to see if the top boundary seems to follow a general
pattern.
# ggplot: Plot the absolute value of the residuals
library(ggplot2)
p <- ggplot(wlsdat, aes(x = x, y = abs(res)))
p <- p + geom_point()
print(p)

150

100
● ●
abs(res)

● ●

● ●



● ●





● ●
● ●
50 ● ●
● ●



● ● ● ● ●

● ● ● ●
● ●
● ●● ●
● ●
● ● ● ●
● ● ●

● ● ● ● ● ●
● ●
● ● ●
● ●
●● ● ●
●● ● ● ●

● ● ● ●
● ● ● ●
● ● ● ●● ●
0 ●● ● ●

0 25 50 75 100
x

It is plausible the upper boundary is linear, so let us try wi = x12 . Standardized


residuals from this WLS fit look very good. Note that raw (nonstandardized) residuals
will still have the same pattern — it is essential to use standardized residuals here.

Prof. Erik B. Erhardt


8.11: Weighted Least Squares 329

# fit regression
lm.y.x.wt <- lm(y ~ x, data = wlsdat, weights = x^(-2))

# put residuals in data.frame


wlsdat$reswt <- lm.y.x.wt$residuals
wlsdat$wt <- lm.y.x.wt$weights^(1/2)

# ggplot: Plot the absolute value of the residuals


library(ggplot2)
p <- ggplot(wlsdat, aes(x = x, y = reswt*wt))
p <- p + geom_point()
print(p)



● ●


1 ●

● ●
● ● ●
● ●

● ● ●


● ● ● ● ●


● ● ●
reswt * wt

● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ●



● ● ●
0 ●

● ● ●

● ● ●
● ●

● ●


● ●

● ●
● ●

● ●

−1 ● ● ●
● ●





● ●
● ● ●


−2
0 25 50 75 100
x

Compare also the OLS fitted equation:


summary(lm.y.x)
##
## Call:
## lm(formula = y ~ x, data = wlsdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -168.175 -24.939 2.542 24.973 125.366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6308 10.4256 0.732 0.466
## x 5.0348 0.1791 28.116 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.53 on 98 degrees of freedom
## Multiple R-squared: 0.8897,Adjusted R-squared: 0.8886
## F-statistic: 790.5 on 1 and 98 DF, p-value: < 2.2e-16

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


330 Ch 8: Correlation and Regression

to the WLS fitted equation:


summary(lm.y.x.wt)
##
## Call:
## lm(formula = y ~ x, data = wlsdat, weights = x^(-2))
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.8939 -0.6707 0.1777 0.5963 2.3132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.09309 0.54931 9.272 4.61e-15 ***
## x 5.10690 0.09388 54.399 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8902 on 98 degrees of freedom
## Multiple R-squared: 0.9679,Adjusted R-squared: 0.9676
## F-statistic: 2959 on 1 and 98 DF, p-value: < 2.2e-16
Clearly the weighted fit looks better, although note that everything is based on
the weighted SS. In practice it can be pretty difficult to determine the correct set of
weights, but WLS works much better than OLS if appropriate. I actually simulated
this data set using β0 = β1 = 5. Which fit actually did better?

Prof. Erik B. Erhardt


Part IV

ADA1: Additional topics

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


Chapter 9

Introduction to the Bootstrap

Contents
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 333
9.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
9.2.1 Ideal versus Bootstrap world, sampling distributions . . . . 335
9.2.2 The accuracy of the sample mean . . . . . . . . . . . . . . . 338
9.2.3 Comparing bootstrap sampling distribution from popula-
tion and sample . . . . . . . . . . . . . . . . . . . . . . . . 344

Learning objectives
After completing this topic, you should be able to:
explain the bootstrap principle for hypothesis tests and inference.
decide (for simple problems) how to construct a bootstrap procedure.
Achieving these goals contributes to mastery in these course learning outcomes:
1. organize knowledge.
5. define parameters of interest and hypotheses in words and notation.
8. use statistical software.
12. make evidence-based decisions.

9.1 Introduction
Statistical theory attempts to answer three basic questions:
1. How should I collect my data?
2. How should I analyze and summarize the data that I’ve collected?

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


334 Ch 9: Introduction to the Bootstrap

3. How accurate are my data summaries?


Question 3 consitutes part of the process known as statistical inference. The bootstrap
makes certain kinds of statistical inference1 . Let’s look at an example.

Example: Aspirin and heart attacks, large-sample theory Does aspirin pre-
vent heart attacks in healthy middle-aged men? A controlled, randomized, double-
blind study was conducted and gathered the following data.
(fatal plus non-fatal)
heart attacks subjects
aspirin group: 104 11037
placebo group: 189 11034
A good experimental design, such as this one, simplifies the results! The ratio of the
two rates (the risk ratio) is

104/11037
θ̂ = = 0.55.
189/11034

Because of the solid experimental design, we can believe that the aspirin-takers only
have 55% as many heart attacks as the placebo-takers.
We are not really interested in the estimated ratio θ̂, but the true ratio, θ. That
is the ratio if we could treat all possible subjects, not just a sample of them. Large
sample theory tells us that the log risk ratio has an approximate Normal distribution.
The standard error of the log risk ratio is estimated simply by the square root of the
sum of the reciprocals of the four frequencies:
r
1 1 1 1
SE(log(RR)) = + + + = 0.1228
104 189 11037 11034

The 95% CI for log(θ) is

log(θ̂) ± 1.96 × SE(log(RR)), (−0.839, −0.357),

and expontiating gives the CI on the ratio scale,

exp{log(θ̂) ± 1.96 × SE(log(RR))}, (0.432, 0.700).

The same data that allowed us to estimate the ratio θ with θ̂ = 0.55 also allowed us
to get an idea of the estimate’s accuracy.
1
Efron (1979), “Bootstrap methods: another look at the jackknife.” Ann. Statist. 7, 1–26

Prof. Erik B. Erhardt


9.2: Bootstrap 335

Example: Aspirin and strokes, large-sample theory The aspirin study tracked
strokes as well as heart attacks.
strokes subjects
aspirin group: 119 11037
placebo group: 98 11034
The ratio of the two rates (the risk ratio) is

119/11037
θ̂ = = 1.21.
98/11034

It looks like aspirin is actually harmful, now, however the 95% interval for the true
stroke ratio θ is (0.925, 1.583). This includes the neutral value θ = 1, at which aspirin
would be no better or worse than placebo for strokes.

9.2 Bootstrap
The bootstrap is a data-based simulation method for statistical inference, which can
be used to produce inferences like those in the previous slides. The term “bootstrap”
comes from literature. In “The Adventures of Baron Munchausen”, by Rudolph Erich
Raspe, the Baron had fallen to the bottom of a deep lake, and he thought to get out
by pulling himself up by his own bootstraps.

9.2.1 Ideal versus Bootstrap world, sampling distributions


Ideal world
1. Population of interest
2. Obtain many simple random samples (SRSs) of size n
3. For each SRS, calculate statistic of interest (θ)
4. Sampling distribution is the distribution of the calculated statistic
Bootstrap world
1. Population of interest; One empirical distribution based on a sample of size n
2. Obtain many bootstrap resamples of size n
3. For each resample, calculate statistic of interest (θ∗ )
4. Bootstrap distribution is the distribution of the calculated statistic
5. Bootstrap distribution estimates the sampling distribution centered at the statis-
tic (not the parameter).

Cartoon: Estimating the proportion of Republican votes Imagine a two-


candidate eleection. The following illustrates how to use exit polls to estimate (with
uncertainty) the probability of a Republican win.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


336 Ch 9: Introduction to the Bootstrap

How can the sampling distribution of the proportion of Republican votes be estimated?
The ideal case: draw repeated (infinite) samples from the population

Population Samples Statistic Repeated samples provide the true sampling distribution = Republican vote
1 of the proportion of Republican votes
= Democrat vote

2 = Proportion of

Density
Republican votes

∞ 0.20 0.50 0.80

The traditional case: a single sample is observed


Population Sample Statistic Estimate the sampling distribution using statistical theory
(such as the Central Limit Theorem)

Density
0.20 0.50 0.80

Note the similar structure


between the ideal and bootstrap cases

The bootstrap case: a single sample is resampled


Population Sample Bootstrap resamples Statistic
Repeated resamples provide the
1 bootstrap estimate of the sampling distribution

2
Density

R 0.20 0.50 0.80

How is the bootstrap used in this scenario?


Registered voters Votes counted Proportion of votes for Republican candidate If ,
Population Sample Statistic the Republican candidate wins.

Election How likely was this outcome?

Simulate many elections


Bootstrap resamples Statistic Bootstrap sampling distribution
1 Area = the probability
of a Republican win
2
Density

R
0.20 0.50 0.80

Prof. Erik B. Erhardt


9.2: Bootstrap 337

Example: Aspirin and strokes, bootstrap Here’s how the bootstrap works in
the stroke example. We create two populations:
ˆ the first consisting of 119 ones and 11037 − 119 = 10918 zeros,
ˆ the second consisting of 98 ones and 11034 − 98 = 10936 zeros.
We draw with replacement a sample of 11037 items from the first population, and a
sample of 11034 items from the second population. Each is called a bootstrap sample.
From these we derive the bootstrap replicate of θ̂:

Proportion of ones in bootstrap sample 1


θ̂∗ = .
Proportion of ones in bootstrap sample 2

Repeat this process a large number of times, say 10000 times, and obtain 10000
bootstrap replicates θ̂∗ . The summaries are in the code, followed by a histogram of
bootstrap replicates, θ̂∗ .
#### Example: Aspirin and strokes, bootstrap
# sample size (n) and successes (s) for sample 1 (aspirin) and 2 (placebo)
n <- c(11037, 11034)
s <- c( 119, 98)
# data for samples 1 and 2, where 1 = success (stroke), 0 = failure (no stroke)
dat1 <- c(rep(1, s[1]), rep(0, n[1] - s[1]))
dat2 <- c(rep(1, s[2]), rep(0, n[2] - s[2]))
# draw R bootstrap replicates
R <- 10000
# init location for bootstrap samples
bs1 <- rep(NA, R)
bs2 <- rep(NA, R)
# draw R bootstrap resamples of proportions
for (i in 1:R) {
# proportion of successes in bootstrap samples 1 and 2
# (as individual steps for group 1:)
resam1 <- sample(dat1, n[1], replace = TRUE)
success1 <- sum(resam1)
bs1[i] <- success1 / n[1]
# (as one line for group 2:)
bs2[i] <- sum(sample(dat2, n[2], replace = TRUE)) / n[2]
}
# bootstrap replicates of ratio estimates
rat <- bs1 / bs2
# sort the ratio estimates to obtain bootstrap CI
rat.sorted <- sort(rat)
# 0.025th and 0.975th quantile gives equal-tail bootstrap CI
CI.bs <- c(rat.sorted[round(0.025*R)], rat.sorted[round(0.975*R+1)])
CI.bs
## [1] 0.9399154 1.5878036

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


338 Ch 9: Introduction to the Bootstrap

## Plot the bootstrap distribution with CI


# First put data in data.frame for ggplot()
dat.rat <- data.frame(rat)

library(ggplot2)
p <- ggplot(dat.rat, aes(x = rat))
p <- p + geom_histogram(aes(y=..density..), binwidth=0.02)
p <- p + geom_density(alpha=0.1, fill="white")
p <- p + geom_rug()
# vertical line at 1 and CI
p <- p + geom_vline(xintercept=1, colour="#BB0000", linetype="dashed")
p <- p + geom_vline(xintercept=CI.bs[1], colour="#00AA00", linetype="longdash")
p <- p + geom_vline(xintercept=CI.bs[2], colour="#00AA00", linetype="longdash")
p <- p + labs(title = "Bootstrap distribution of relative risk ratio, strokes")
p <- p + xlab("ratio (red = 1, green = bootstrap CI)")
print(p)
Bootstrap distribution of relative risk ratio, strokes

2
density

1.0 1.5 2.0


ratio (red = 1, green = bootstrap CI)

In this simple case, the confidence interval derived from the bootstrap (0.94, 1.588)
agrees very closely with the one derived from statistical theory (0.925, 1.583). Boot-
strap methods are intended to simplify the calculation of inferences like those using
large-sample theory, producing them in an automatic way even in situations much
more complicated than the risk ratio in the aspirin example.

9.2.2 The accuracy of the sample mean


For sample means, and essentially only for sample means, an accuracy formula (for
the standard error of the parameter) is easy to obtain (using the delta method). We’ll
see how to use the bootstrap for the sample mean, then for the more complicated
situation of assessing the accuracy of the median.

Prof. Erik B. Erhardt


9.2: Bootstrap 339

Bootstrap Principle The plug-in principle is used when the underlying distri-
bution is unknown and you substitute your best guess for what that distribution is.
What to substitute?
Empirical distribution ordinary bootstrap
Smoothed distribution (kernel) smoothed bootstrap
Parametric distribution parametric bootstrap
Satisfy assumptions such as the null hypothesis
This substitution works in many cases, but not always. Keep in mind that the
bootstrap distribution is centered at the statistic, not the parameter. Implemention
is done by Monte Carlo sampling.
The bootstrap is commonly implemented in one of two ways, nonparametrically
or parametrically. An exact nonparametric bootstrap requires nn samples! That’s
one for every possible combination of each of n observation positions taking the value
of each of n observations. This is sensibly approximated by using the Monte Carlo
strategy of drawing a large number (1000 or 10000) of random resamples. On the
other hand, a parametric bootstrap first assumes a distribution for the population
(such as a normal distribution) and estimates the distributional parameters (such as
the mean and variance) from the observed sample. Then, the Monte Carlo strategy is
used to draw a large number (1000 or 10000) of samples from the estimated parametric
distribution.

Example: Mouse survival, two-sample t-test, mean Sixteen mice were ran-
domly assigned to a treatment group or a control group. Shown are their survival
times, in days, following a test surgery. Did the treatment prolong survival?
Group Data n Mean SE
Control: 52, 104, 146, 10, 9 56.22 14.14
51, 30, 40, 27, 46
Treatment: 94, 197, 16, 38, 7 86.86 25.24
99, 141, 23
Difference: 30.63 28.93
Numerical and graphical summaries of the data are below. There seems to be a
slight difference in variability between the two treatment groups.
#### Example: Mouse survival, two-sample t-test, mean
treatment <- c(94, 197, 16, 38, 99, 141, 23)
control <- c(52, 104, 146, 10, 51, 30, 40, 27, 46)
survive <- c(treatment, control)
group <- c(rep("Treatment", length(treatment)), rep("Control", length(control)))
mice <- data.frame(survive, group)

library(plyr)
# ddply "dd" means the input and output are both data.frames
mice.summary <- ddply(mice,
"group",
function(X) {
data.frame( m = mean(X$survive),
s = sd(X$survive),
n = length(X$survive)
)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


340 Ch 9: Introduction to the Bootstrap

}
)
# standard errors
mice.summary$se <- mice.summary$s/sqrt(mice.summary$n)
# individual confidence limits
mice.summary$ci.l <- mice.summary$m - qt(1-.05/2, df=mice.summary$n-1) * mice.summary$se
mice.summary$ci.u <- mice.summary$m + qt(1-.05/2, df=mice.summary$n-1) * mice.summary$se

mice.summary
## group m s n se ci.l ci.u
## 1 Control 56.22222 42.47581 9 14.15860 23.57242 88.87202
## 2 Treatment 86.85714 66.76683 7 25.23549 25.10812 148.60616
diff(mice.summary$m) #£
## [1] 30.63492
# histogram using ggplot
p <- ggplot(mice, aes(x = survive))
p <- p + geom_histogram(binwidth = 20)
p <- p + geom_rug()
p <- p + facet_grid(group ~ .)
p <- p + labs(title = "Mouse survival following a test surgery") + xlab("Survival (days)")
print(p)
Mouse survival following a test surgery
3

2
Control

0
count

2
Treatment

0
0 50 100 150 200
Survival (days)

The standard error for the difference is 28.93 = 25.242 + 14.142 , so the observed
difference of 30.63 is only 30.63/28.93=1.05 estimated standard errors greater than
zero, an insignificant result.
The two-sample t-test of the difference in means confirms the lack of statistically
significant difference between these two treatment groups with a p-value=0.3155.
t.test(survive ~ group, data = mice)
##
## Welch Two Sample t-test
##

Prof. Erik B. Erhardt


9.2: Bootstrap 341

## data: survive by group


## t = -1.0587, df = 9.6545, p-value = 0.3155
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -95.42263 34.15279
## sample estimates:
## mean in group Control mean in group Treatment
## 56.22222 86.85714
But these are small samples, and the control sample does not look normal. We
could do a nonparametric two-sample test of difference of medians. Or, we could use
the bootstrap to make our inference.

Example: Mouse survival, two-sample bootstrap, mean Here’s how the


bootstrap works in the two-sample mouse example. We draw with replacement from
each sample, calculate the mean for each sample, then take the difference in means.
Each is called a bootstrap sample of the difference in means. From these we derive
the bootstrap replicate of µ̂:
µ̂∗ = x̄∗ − ȳ ∗ .

Repeat this process a large number of times, say 10000 times, and obtain 10000
bootstrap replicates µ̂∗ . The summaries are in the code, followed by a histogram of
bootstrap replicates, µ̂∗ .
#### Example: Mouse survival, two-sample bootstrap, mean
# draw R bootstrap replicates
R <- 10000
# init location for bootstrap samples
bs1 <- rep(NA, R)
bs2 <- rep(NA, R)
# draw R bootstrap resamples of means
for (i in 1:R) {
bs2[i] <- mean(sample(control, replace = TRUE))
bs1[i] <- mean(sample(treatment, replace = TRUE))
}
# bootstrap replicates of difference estimates
bs.diff <- bs1 - bs2
sd(bs.diff)
## [1] 27.00087
# sort the difference estimates to obtain bootstrap CI
diff.sorted <- sort(bs.diff)
# 0.025th and 0.975th quantile gives equal-tail bootstrap CI
CI.bs <- c(diff.sorted[round(0.025*R)], diff.sorted[round(0.975*R+1)])
CI.bs
## [1] -21.96825 83.09524

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


342 Ch 9: Introduction to the Bootstrap

## Plot the bootstrap distribution with CI


# First put data in data.frame for ggplot()
dat.diff <- data.frame(bs.diff)

library(ggplot2)
p <- ggplot(dat.diff, aes(x = bs.diff))
p <- p + geom_histogram(aes(y=..density..), binwidth=5)
p <- p + geom_density(alpha=0.1, fill="white")
p <- p + geom_rug()
# vertical line at 0 and CI
p <- p + geom_vline(xintercept=0, colour="#BB0000", linetype="dashed")
p <- p + geom_vline(xintercept=CI.bs[1], colour="#00AA00", linetype="longdash")
p <- p + geom_vline(xintercept=CI.bs[2], colour="#00AA00", linetype="longdash")
p <- p + labs(title = "Bootstrap distribution of difference in survival time, median")
p <- p + xlab("ratio (red = 0, green = bootstrap CI)")
print(p)

Bootstrap distribution of difference in survival time, median

0.015

0.010
density

0.005

0.000

−50 0 50 100
ratio (red = 0, green = bootstrap CI)

Example: Mouse survival, two-sample bootstrap, median For most statis-


tics (such as the median) we don’t have a formula for the limiting value of the stan-
dard error, but in fact no formula is needed. Instead, we use the numerical output of
the bootstrap program. The summaries are in the code, followed by a histogram of
bootstrap replicates, η̂ ∗ .

Prof. Erik B. Erhardt


9.2: Bootstrap 343

Group Data (n) Median est. SE


Control: 52, 104, 146, 10, (9) 46 ?
51, 30, 40, 27, 46
Treatment: 94, 197, 16, 38, (7) 94 ?
99, 141, 23
Difference: 48 ?
#### Example: Mouse survival, two-sample bootstrap, median
sort(control)
## [1] 10 27 30 40 46 51 52 104 146
sort(treatment)
## [1] 16 23 38 94 99 141 197
# draw R bootstrap replicates
R <- 10000
# init location for bootstrap samples
bs1 <- rep(NA, R)
bs2 <- rep(NA, R)
# draw R bootstrap resamples of medians
for (i in 1:R) {
bs2[i] <- median(sample(control, replace = TRUE))
bs1[i] <- median(sample(treatment, replace = TRUE))
}
# bootstrap replicates of difference estimates
bs.diff <- bs1 - bs2
sd(bs.diff)
## [1] 40.43024
# sort the difference estimates to obtain bootstrap CI
diff.sorted <- sort(bs.diff)
# 0.025th and 0.975th quantile gives equal-tail bootstrap CI
CI.bs <- c(diff.sorted[round(0.025*R)], diff.sorted[round(0.975*R+1)])
CI.bs
## [1] -29 111
## Plot the bootstrap distribution with CI
# First put data in data.frame for ggplot()
dat.diff <- data.frame(bs.diff)

library(ggplot2)
p <- ggplot(dat.diff, aes(x = bs.diff))
p <- p + geom_histogram(aes(y=..density..), binwidth=5)
p <- p + geom_density(alpha=0.1, fill="white")
p <- p + geom_rug()
# vertical line at 0 and CI
p <- p + geom_vline(xintercept=0, colour="#BB0000", linetype="dashed")
p <- p + geom_vline(xintercept=CI.bs[1], colour="#00AA00", linetype="longdash")
p <- p + geom_vline(xintercept=CI.bs[2], colour="#00AA00", linetype="longdash")
p <- p + labs(title = "Bootstrap distribution of difference in survival time, median")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


344 Ch 9: Introduction to the Bootstrap

p <- p + xlab("ratio (red = 0, green = bootstrap CI)")


print(p)

Bootstrap distribution of difference in survival time, median

0.02
density

0.01

0.00

−100 0 100
ratio (red = 0, green = bootstrap CI)

9.2.3 Comparing bootstrap sampling distribution from pop-


ulation and sample

Example: Law School, correlation of (LSAT, GPA) The population of aver-


age student measurements of (LSAT, GPA) for the universe of 82 law schools are in
the table below. Imagine that we don’t have all 82 schools worth of data. Consider
taking a random sample of 15 schools, indicated by the +’s.

Prof. Erik B. Erhardt


9.2: Bootstrap 345

School LSAT GPA School LSAT GPA School LSAT GPA


1 622 3.23 28 632 3.29 56 641 3.28
2 542 2.83 29 587 3.16 57 512 3.01
3 579 3.24 30 581 3.17 58 631 3.21
4+ 653 3.12 31+ 605 3.13 59 597 3.32
5 606 3.09 32 704 3.36 60 621 3.24
6+ 576 3.39 33 477 2.57 61 617 3.03
7 620 3.10 34 591 3.02 62 637 3.33
8 615 3.40 35+ 578 3.03 62 572 3.08
9 553 2.97 36+ 572 2.88 64 610 3.13
10 607 2.91 37 615 3.37 65 562 3.01
11 558 3.11 38 606 3.20 66 635 3.30
12 596 3.24 39 603 3.23 67 614 3.15
13+ 635 3.30 40 535 2.98 68 546 2.82
14 581 3.22 41 595 3.11 69 598 3.20
15+ 661 3.43 42 575 2.92 70+ 666 3.44
16 547 2.91 43 573 2.85 71 570 3.01
17 599 3.23 44 644 3.38 72 570 2.92
18 646 3.47 45+ 545 2.76 73 605 3.45
19 622 3.15 46 645 3.27 74 565 3.15
20 611 3.33 47+ 651 3.36 75 686 3.50
21 546 2.99 48 562 3.19 76 608 3.16
22 614 3.19 49 609 3.17 77 595 3.19
23 628 3.03 50+ 555 3.00 78 590 3.15
24 575 3.01 51 586 3.11 79+ 558 2.81
25 662 3.39 52+ 580 3.07 80 611 3.16
26 627 3.41 53+ 594 2.96 81 564 3.02
27 608 3.04 54 594 3.05 82+ 575 2.74
55 560 2.93

#### Example: Law School, correlation of (LSAT, GPA)


School <- 1:82

LSAT <- c(622, 542, 579, 653, 606, 576, 620, 615, 553, 607, 558, 596, 635,
581, 661, 547, 599, 646, 622, 611, 546, 614, 628, 575, 662, 627,
608, 632, 587, 581, 605, 704, 477, 591, 578, 572, 615, 606, 603,
535, 595, 575, 573, 644, 545, 645, 651, 562, 609, 555, 586, 580,
594, 594, 560, 641, 512, 631, 597, 621, 617, 637, 572, 610, 562,
635, 614, 546, 598, 666, 570, 570, 605, 565, 686, 608, 595, 590,
558, 611, 564, 575)

GPA <- c(3.23, 2.83, 3.24, 3.12, 3.09, 3.39, 3.10, 3.40, 2.97, 2.91, 3.11,
3.24, 3.30, 3.22, 3.43, 2.91, 3.23, 3.47, 3.15, 3.33, 2.99, 3.19,
3.03, 3.01, 3.39, 3.41, 3.04, 3.29, 3.16, 3.17, 3.13, 3.36, 2.57,
3.02, 3.03, 2.88, 3.37, 3.20, 3.23, 2.98, 3.11, 2.92, 2.85, 3.38,

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


346 Ch 9: Introduction to the Bootstrap

2.76, 3.27, 3.36, 3.19, 3.17, 3.00, 3.11, 3.07, 2.96, 3.05, 2.93,
3.28, 3.01, 3.21, 3.32, 3.24, 3.03, 3.33, 3.08, 3.13, 3.01, 3.30,
3.15, 2.82, 3.20, 3.44, 3.01, 2.92, 3.45, 3.15, 3.50, 3.16, 3.19,
3.15, 2.81, 3.16, 3.02, 2.74)

Sampled <- c(0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,


0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1)

# law = population
law <- data.frame(School, LSAT, GPA, Sampled)
law$Sampled <- factor(law$Sampled)
# law.sam = sample
law.sam <- subset(law, Sampled == 1)

library(ggplot2)
p <- ggplot(law, aes(x = LSAT, y = GPA))
p <- p + geom_point(aes(colour = Sampled, shape = Sampled), alpha = 0.8, size = 2)
p <- p + labs(title = "Law School average scores of LSAT and GPA")
print(p)

Law School average scores of LSAT and GPA

3.50

3.25

Sampled
GPA

0
3.00 1

2.75

500 550 600 650 700


LSAT

Let’s bootstrap the sample of 15 observations to get the bootstrap sampling dis-
tribution of correlation (for sampling 15 from the population). From the bootstrap
sampling distribution we’ll calculate a bootstrap confidence interval for the true pop-
ulation correlation, as well as a bootstrap standard deviation for the correlation. But
how well does this work? Let’s compare it against the true sampling distribution

Prof. Erik B. Erhardt


9.2: Bootstrap 347

by drawing 15 random schools from the population of 82 schools and calculating the
correlation. If the bootstrap works well (from our hopefully representative sample of
15), then the bootstrap sampling distribution from the 15 schools will be close to the
true sampling distribution.
The code below does that, followed by two histograms. In this case, the histograms
are noticeably non-normal, having a long tail toward the left. Inferences based on
the normal curve are suspect when the bootstrap histogram is markedly non-normal.
The histogram on the left is the nonparametric bootstrap sampling distribution using
only the n = 15 sampled schools with 10000 bootstrap replicates of corr(xd ∗ ). The
histogram on the right is the true sampling distribution using 10000 replicates of
d ∗ ) from the population of law school data, repeatedly drawing n = 15 without
corr(x
replacement from the N = 82 points. Impressively, the bootstrap histogram on the
left strongly resembles the population histogram on the right. Remember, in a real
problem we would only have the information on the left, from which we would be
trying to infer the situation on the right.
# draw R bootstrap replicates
R <- 10000
# init location for bootstrap samples
bs.pop <- rep(NA, R)
bs.sam <- rep(NA, R)
# draw R bootstrap resamples of medians
for (i in 1:R) {
# sample() draws indicies then bootstrap correlation of LSAT and GPA
# population
bs.pop[i] = cor(law [sample(seq(1,nrow(law )), nrow(law.sam)
, replace = TRUE), 2:3])[1, 2]
# sample
bs.sam[i] = cor(law.sam[sample(seq(1,nrow(law.sam)), nrow(law.sam)
, replace = TRUE), 2:3])[1, 2]
}

# sort the difference estimates to obtain bootstrap CI


diff.sorted <- sort(bs.pop)
# 0.025th and 0.975th quantile gives equal-tail bootstrap CI
CI.bs.pop <- c(diff.sorted[round(0.025*R)], diff.sorted[round(0.975*R+1)])
# population correlation
cor(law [, c(2,3)])[1,2]
## [1] 0.7599979
CI.bs.pop
## [1] 0.4296745 0.9271040
sd(bs.pop)
## [1] 0.1295076
# sort the difference estimates to obtain bootstrap CI
diff.sorted <- sort(bs.sam)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


348 Ch 9: Introduction to the Bootstrap

# 0.025th and 0.975th quantile gives equal-tail bootstrap CI


CI.bs.sam <- c(diff.sorted[round(0.025*R)], diff.sorted[round(0.975*R+1)])
# sample correlation
cor(law.sam[, c(2,3)])[1,2]
## [1] 0.7763745
CI.bs.sam
## [1] 0.4637826 0.9637982
sd(bs.sam)
## [1] 0.1334595
law.bs.df <- data.frame(corr = c(bs.pop, bs.sam), group = c(rep("Pop",R),rep("Sam",R)))
# histogram using ggplot
library(ggplot2)
p <- ggplot(law.bs.df, aes(x = corr, fill=group))
p <- p + geom_histogram(binwidth = .01, alpha = 0.5, position="identity")
p <- p + geom_rug(aes(colour=group))
p <- p + labs(title = "Sampling distribution of 15 observation from 82 (Pop) vs 15 (Sam, BS)") +
xlab("Correlation")
print(p)
Sampling distribution of 15 observation from 82 (Pop) vs 15 (Sam, BS)
400

300

group
count

200
Pop
Sam

100

0.0 0.5 1.0


Correlation

Prof. Erik B. Erhardt


Chapter 10

Power and Sample size

Contents
10.1 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 349
10.2 Effect size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.3 Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
10.4 Power calculation via simulation . . . . . . . . . . . . . . 359

Learning objectives
After completing this topic, you should be able to:
assess the power of a test or
determine the required sample size for a study.
Achieving these goals contributes to mastery in these course learning outcomes:
7. Distinguish between statistical significance and scientific relevance.
10. Identify and explain the statistical methods, assumptions, and limitations.
12. Make evidence-based decisions by constructing and deciding between testable
hypotheses using appropriate data and methods.

10.1 Power Analysis


The meaning of statistical power Power is the probability (1 − β) of detecting
an effect, given that the effect is really there. In other words, it is the probability of
correctly rejecting the null hypothesis when it is in fact false. For example, let’s say
that we have a simple study with drug A and a placebo group, and that the drug
truly is effective; the power is the probability of finding a difference between the two
groups. So, imagine that we had a power of 1−β = 0.8 and that this simple study was

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


350 Ch 10: Power and Sample size

conducted many times. Having power of 0.8 means that 80% of the time, we would
get a statistically significant difference between the drug A and placebo groups. This
also means that 20% of the times that we run this experiment, we will not obtain a
statistically significant effect between the two groups, even though there really is an
effect in reality. That is, the probability of a Type-II error is β = 0.2.

One-sample power figure Consider the plot below for a one-sample one-tailed
greater-than t-test. If the null hypothesis, H0 : µ = µ0 , is true, then the test statistic
t is follows the null distribution indicated by the hashed area. Under a specific alter-
native hypothesis, H1 : µ = µ1 , the test statistic t follows the distribution indicated
by the solid area. If α is the probability of making a Type-I error (rejecting H0 when
it is true), then “crit. val.” indicates the location of the tcrit value associated with H0
on the scale of the data. The rejection region is the area under H0 that is at least
as far as “crit. val.” is from µ0 . The power (1 − β) of the test is the green area, the
area under H1 in the rejection region. A Type-II error is made when H1 is true, but
we fail to reject H0 in the red region. (Note, for a two-tailed test the rejection region
for both tails under the H1 curve contribute to the power.)
#### One-sample power
# Power plot with two normal distributions
# http://stats.stackexchange.com/questions/14140/how-to-best-display-graphically-type-ii-beta-error-powe

x <- seq(-4, 4, length=1000)


hx <- dnorm(x, mean=0, sd=1)

plot(x, hx, type="n", xlim=c(-4, 8), ylim=c(0, 0.5),


ylab = "",
xlab = "",
main= expression(paste("Type-II Error (", beta, ") and Power (", 1-beta, ")")), axes=FALSE)

#shift = qnorm(1-0.025, mean=0, sd=1)*1.7


shift = qnorm(1-0.05, mean=0, sd=1)*1.7 # one-tailed
xfit2 <- x + shift
yfit2 <- dnorm(xfit2, mean=shift, sd=1)

#axis(1, at = c(-qnorm(.025), 0, shift, -4),


# labels = expression("p-value", 0, mu, -infinity ))
#axis(1, at = c(-qnorm(.025), 0, shift),
# labels = expression((t[alpha/2]), mu[0], mu[1]))
axis(1, at = c(-qnorm(.05), 0, shift),
labels = expression("crit. val.", mu[0], mu[1]))
axis(1, at = c(-4, 4+shift),
labels = expression(-infinity, infinity ), lwd=1, lwd.tick=FALSE)

## The alternative hypothesis area

Prof. Erik B. Erhardt


10.1: Power Analysis 351

# The red - underpowered area


lb <- min(xfit2)
#ub <- round(qnorm(.975),2)
ub <- round(qnorm(.95),2)
col1 = "#CC2222"

i <- xfit2 >= lb & xfit2 <= ub


polygon(c(lb,xfit2[i],ub), c(0,yfit2[i],0), col=col1)

# The green area where the power is


col2 = "#22CC22"
i <- xfit2 >= ub
polygon(c(ub,xfit2[i],max(xfit2)), c(0,yfit2[i],0), col=col2)

# Outline the alternative hypothesis


lines(xfit2, yfit2, lwd=2)

## Print null hypothesis area


#col_null = "#DDDDDD"
#polygon(c(min(x), x,max(x)), c(0,hx,0), col=col_null)
#lines(x, hx, lwd=2)
col_null = "#AAAAAA"
polygon(c(min(x), x,max(x)), c(0,hx,0), col=col_null, lwd=2, density=c(10, 40), angle=-45, border
lines(x, hx, lwd=2, lty="dashed", col=col_null)

axis(1, at = (c(ub, max(xfit2))), labels=c("", expression(infinity)),


col=col2, lwd=1, lwd.tick=FALSE)

#legend("topright", inset=.05, title="Color",


# c("Null hypoteses","Type II error", "True"), fill=c(col_null, col1, col2), horiz=FALSE)
legend("topright", inset=.015, title="Color",
c("Null hypothesis","Type-II error", "Power"), fill=c(col_null, col1, col2),
angle=-45,
density=c(20, 1000, 1000), horiz=FALSE)

abline(v=ub, lwd=2, col="#000088", lty="dashed")

arrows(ub, 0.45, ub+1, 0.45, lwd=3, col="#008800")


arrows(ub, 0.45, ub-1, 0.45, lwd=3, col="#880000")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


352 Ch 10: Power and Sample size

Type−II Error (β) and Power (1 − β)

Color
Null hypothesis
Type−II error
Power

−∞ µ0 crit. val. µ1 ∞

Example: IQ drug Imagine that we are evaluating the effect of a putative memory
enhancing drug. We have randomly sampled 25 people from a population known to
be normally distributed with a µ of 100 and a σ of 15. We administer the drug, wait
a reasonable time for it to take effect, and then test our subjects’ IQ. Assume that
we were so confident in our belief that the drug would either increase IQ or have no
effect that we entertained one-sided (directional) hypotheses. Our null hypothesis is
that after administering the drug µ ≤ 100 and our alternative hypothesis is µ > 100.
These hypotheses must first be converted to exact hypotheses. Converting the null
is easy: it becomes µ = 100. The alternative is more troublesome. If we knew that
the effect of the drug was to increase IQ by 15 points, our exact alternative hypothesis
would be µ = 115, and we could compute power, the probability of correctly rejecting
the false null hypothesis given that µ is really equal to 115 after drug treatment, not
100 (normal IQ). But if we already knew how large the effect of the drug was, we
would not need to do inferential statistics. . .
One solution is to decide on a minimum nontrivial effect size. What is the
smallest effect that you would consider to be nontrivial? Suppose that you decide
that if the drug increases µIQ by 2 or more points, then that is a nontrivial effect, but
if the mean increase is less than 2 then the effect is trivial.
Now we can test the null of µ = 100 versus the alternative of µ = 102. Consider
the previous plot. Let the left curve represent the distribution of sample means if the

Prof. Erik B. Erhardt


10.1: Power Analysis 353

null hypothesis
√ were true, µ = 100. This sampling distribution has a µ = 100 and
a σȲ = 15/ 25 = 3. Let the right curve represent the sampling distribution if the
exact alternative hypothesis is true, µ = 102. Its µ is 102 and, assuming the drug has
no effect on the variance in IQ scores, also has σȲ = 3.
The green area in the upper tail of the null distribution (gray hatched curve) is
α. Assume we are using a one-tailed α of 0.05. How large would a sample mean need
be for us to reject the null? Since the upper 5% of a normal distribution extends
from 1.645σ above the µ up to positive infinity, the sample mean IQ would need be
100 + 1.645(3) = 104.935 or more to reject the null. What are the chances of getting
a sample mean of 104.935 or more if the alternative hypothesis is correct, if the drug
increases IQ by 2 points? The area under the alternative curve from 104.935 up to
positive infinity represents that probability, which is power. Assuming the alternative
hypothesis is true, that µ = 102, the probability of rejecting the null hypothesis is
the probability of getting a sample mean of 104.935 or more in a normal distribution
with µ = 102, σ = 3. Z = (104.935 − 102)/3 = 0.98, and P (Z > 0.98) = 0.1635.
That is, power is about 16%. If the drug really does increase IQ by an average of 2
points, we have a 16% chance of rejecting the null. If its effect is even larger, we have
a greater than 16% chance.
Suppose we consider 5 (rather than 2) the minimum nontrivial effect size. This
will separate the null and alternative distributions more, decreasing their overlap and
increasing power. Now, Z = (104.935 − 105)/3 = −0.02, P (Z > −0.02) = 0.5080 or
about 51%. It is easier to detect large effects than small effects.
Suppose we conduct a 2-tailed test, since the drug could actually decrease IQ;
α is now split into both tails of the null distribution, 0.025 in each tail. We shall
reject the null if the sample mean is 1.96 or more standard errors away from the
µ of the null distribution. That is, if the mean is 100 + 1.96(3) = 105.88 or more
(or if it is 100 − 1.96(3) = 94.12 or less) we reject the null. The probability of that
happening if the alternative is correct (µ = 105) is: Z = (105.88 − 105)/3 = 0.29,
P (Z > 0.29) = 0.3859, and P (Z < (94.12 − 105)/3) = P (Z < −3.63) = 0.00014, for
a total power = (1 − β) = 0.3859 + 0.00014, or about 39%. Note that our power is less
than it was with a one-tailed test. If you can correctly predict the direction of
effect, a one-tailed test is more powerful than a two-tailed test.
Consider
√ what would happen if you increased sample size to 100. Now the σȲ =
15/ 100 = 1.5. With the null and alternative distributions are narrower, and should
overlap less, increasing power. With σȲ = 1.5 the sample mean will need be 100 +
(1.96)(1.5) = 102.94 (rather than 105.88 from before) or more to reject the null.
If the drug increases IQ by 5 points, power is: Z = (102.94 − 105)/1.5 = −1.37,
P (Z > −1.37) = 0.9147, or between 91 and 92%. Anything that decreases the
standard error will increase power. This may be achieved by increasing
the sample size N or by reducing the σ of the dependent variable. The σ

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


354 Ch 10: Power and Sample size

of the dependent variable may be reduced by reducing the influence of extraneous


variables upon the dependent variable (eliminating “noise” in the dependent variable
makes it easier to detect the signal).
Now consider what happens if you change the significance level, α. Let us reduce
α to 0.01. Now the sample mean must be 2.58 or more standard errors from the
null µ before we reject the null. That is, 100 + 2.58(1.5) = 103.87 (rather than
102.94 with α = 0.05). Under the alternative, Z = (103.87 − 105)/1.5 = −0.75,
P (Z > −0.75) = 0.7734 or about 77%, less than it was with α = 0.05. Reducing α
reduces power.
Please note that all of the above analyses have assumed that we have used a
normally distributed test statistic, as Z = (Ȳ − µ0 )/σȲ will be if the dependent
variable is normally distributed in the population or if sample size is large enough to
invoke the central limit theorem (CLT). Remember that using Z also requires that
you know the population σ rather than estimating it from the sample data. We more
often estimate the population σ, using Student’s t as the test statistic. If N is fairly
large, Student’s t is nearly normal, so this is no problem. For example, with a two-
tailed α = 0.05 and N = 25, we went out ±1.96 standard errors to mark off the
rejection region. With Student’s t on N − 1 = 24 df we should have gone out ±2.064
standard errors. But 1.96 versus 2.06 is a relatively trivial difference, so we should
feel comfortable with the normal approximation. If, however, we had N = 5, df = 4,
critical t = ±2.776, then the normal approximation would not do. A more complex
analysis would be needed.

10.2 Effect size


For the one-sample test, the effect size in σ units is d = (µ1 − µ0 )/σ. For our IQ
problem with minimum nontrivial effect size at 5 IQ points, d = (105−100)/15 = 1/3.
Cohen’s1 conventions for small, medium, and large effects for a two-sample difference
test between two means is in the table below.
One- or two-sample difference of means
Size of effect d % variance
small 0.2 1
medium 0.5 6
large 0.8 16
Cohen has conventions for other tests (correlation, contingency tables, etc.), but they
should be used with caution.
What is a small or even trivial effect in one context may be a large effect in another
context. For example, Rosnow and Rosenthal (1989) discussed a 1988 biomedical
1
Cohen, J. (1988). Statistical power analysis for the behavior sciences. (2nd ed.). Hillsdale, NJ:
Erlbaum.

Prof. Erik B. Erhardt


10.3: Sample size 355

research study on the effects of taking a small, daily dose of aspirin. Each participant
was instructed to take one pill a day. For about half of the participants the pill
was aspirin, for the others it was a placebo. The dependent variable was whether or
not the participant had a heart attack during the study. In terms of a correlation
coefficient, the size of the observed effect was r = 0.034. In terms of percentage of
variance explained, that is 0.12%. In other contexts this might be considered a trivial
effect, but it this context it was so large an effect that the researchers decided it was
unethical to continue the study and the contacted all of the participants who were
taking the placebo and told them to start taking aspirin every day.

10.3 Sample size


Before you can answer the question “how many subjects do I need,”’ you will have
to answer several other questions, such as:
ˆ How much power do I want?
ˆ What is the likely size (in the population) of the effect I am trying to detect,
or, what is smallest effect size that I would consider of importance?
ˆ What criterion of statistical significance will I employ?
ˆ What test statistic will I employ?
ˆ What is the standard deviation (in the population) of the criterion variable?
ˆ For correlated samples designs, what is the correlation (in the population) be-
tween groups?
If one considers Type I and Type II errors equally serious, then one should have
enough power to make α = β. If employing the traditional 0.05 criterion of statistical
significance, that would mean you should have 95% power. However, getting 95%
power usually involves expenses too great – that is, too many samples.
A common convention is to try to get at least enough data to have 80% power.
So, how do you figure out how many subjects you need to have the desired amount
of power. There are several methods, including:
ˆ You could buy an expensive, professional-quality software package to do the
power analysis.
ˆ You could buy an expensive, professional-quality book on power analysis and
learn to do the calculations yourself and/or to use power tables and figures to
estimate power.
ˆ You could try to find an interactive web page on the Internet that will do the
power analysis for you. This is probably fine, but be cautious.
ˆ You could download and use the G Power program, which is free, not too
difficult to use, and generally reliable (this is not to say that it is error free).
ˆ You could use the simple guidelines provided in Jacob Cohen’s “A Power Primer”
(Psychological Bulletin, 1992, 112, 155-159).

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


356 Ch 10: Power and Sample size

The plots below indicate the amount of power for a given effect size and sample
size for a one-sample t-test and ANOVA test. This graph makes clear the diminishing
returns you get for adding more and more subjects if you already have moderate to
high power. For example, let’s say we’re doing a one-sample test and we an effect
size of 0.2 and have only 10 subjects. We can see that we have a power of about 0.15,
which is really, really low. Going to 25 subjects increases our power to about 0.25,
and to 100 subjects increases our power to about 0.6. But if we had a large effect size
of 0.8, 10 subjects would already give us a power of about 0.8, and using 25 or 100
subjects would both give a power at least 0.98. So each additional subject gives you
less additional power. This curve also illustrates the “cost” of increasing your desired
power from 0.8 to 0.98.
# Power curve plot for one-sample t-test with range of sample sizes
# http://stackoverflow.com/questions/4680163/power-vs-effect-size-plot/4680786#4680786

P <- 3 # number of groups for ANOVA


fVals <- seq(0, 1.2, length.out=100) # effect sizes f for ANOVA
dVals <- seq(0, 3, length.out=100) # effect sizes d for t-Test
#nn <- seq(10, 25, by=5) # group sizes
nn <- c(5,10,25,100) # group sizes
alpha <- 0.05 # test for level alpha

# function to calculate one-way ANOVA power for given group size


getFPow <- function(n) {
critF <- qf(1-alpha, P-1, P*n - P) # critical F-value

# probabilities of exceeding this F-value given the effect sizes f


# P*n*fVals^2 is the non-centrality parameter
1-pf(critF, P-1, P*n - P, P*n * fVals^2)
}

# function to calculate one-sample t-Test power for given group size


getTPow <- function(n) {
critT <- qt(1-alpha, n-1) # critical t-value

# probabilities of exceeding this t-value given the effect sizes d


# sqrt(n)*d is the non-centrality parameter
1-pt(critT, n-1, sqrt(n)*dVals)
}

powsF <- sapply(nn, getFPow) # ANOVA power for for all group sizes
powsT <- sapply(nn, getTPow) # t-Test power for for all group sizes

#dev.new(width=10, fig.height=5)
par(mfrow=c(1, 2))
matplot(dVals, powsT, type="l", lty=1, lwd=2, xlab="effect size d",
ylab="Power", main="Power one-sample t-test", xaxs="i",

Prof. Erik B. Erhardt


10.3: Sample size 357

xlim=c(-0.05, 1.1), col=c("blue", "red", "darkgreen", "green"))


#legend(x="bottomright", legend=paste("N =", c(5,10,25,100)), lwd=2,
# col=c("blue", "red", "darkgreen", "green"))
legend(x="bottomright", legend=paste("N =", nn), lwd=2,
col=c("blue", "red", "darkgreen", "green"))
#matplot(fVals, powsF, type="l", lty=1, lwd=2, xlab="effect size f",
# ylab="Power", main=paste("Power one-way ANOVA, ", P, " groups", sep=""), xaxs="i",
# xlim=c(-0.05, 1.1), col=c("blue", "red", "darkgreen", "green"))
##legend(x="bottomright", legend=paste("Nj =", c(10, 15, 20, 25)), lwd=2,
## col=c("blue", "red", "darkgreen", "green"))
#legend(x="bottomright", legend=paste("Nj =", nn), lwd=2,
# col=c("blue", "red", "darkgreen", "green"))

library(pwr)
pwrt2 <- pwr.t.test(d=.2,n=seq(2,100,1),
sig.level=.05,type="one.sample", alternative="two.sided")
pwrt3 <- pwr.t.test(d=.3,n=seq(2,100,1),
sig.level=.05,type="one.sample", alternative="two.sided")
pwrt5 <- pwr.t.test(d=.5,n=seq(2,100,1),
sig.level=.05,type="one.sample", alternative="two.sided")
pwrt8 <- pwr.t.test(d=.8,n=seq(2,100,1),
sig.level=.05,type="one.sample", alternative="two.sided")
#plot(pwrt£n, pwrt£power, type="b", xlab="sample size", ylab="power")
matplot(matrix(c(pwrt2$n,pwrt3$n,pwrt5$n,pwrt8$n),ncol=4),
matrix(c(pwrt2$power,pwrt3$power,pwrt5$power,pwrt8$power),ncol=4),
type="l", lty=1, lwd=2, xlab="sample size",
ylab="Power", main="Power one-sample t-test", xaxs="i",
xlim=c(0, 100), ylim=c(0,1), col=c("blue", "red", "darkgreen", "green"))
legend(x="bottomright", legend=paste("d =", c(0.2, 0.3, 0.5, 0.8)), lwd=2,
col=c("blue", "red", "darkgreen", "green"))

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


358 Ch 10: Power and Sample size

Power one−sample t−test Power one−sample t−test

1.0

1.0
0.8
0.8

0.6
0.6
Power

Power

0.4
0.4

0.2
0.2

N=5 d = 0.2
N = 10 d = 0.3
N = 25 d = 0.5

0.0
N = 100 d = 0.8

0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100

effect size d sample size

Reasons to do a power analysis There are several of reasons why one might do
a power analysis. (1) Perhaps the most common use is to determine the necessary
number of subjects needed to detect an effect of a given size. Note that trying to find
the absolute, bare minimum number of subjects needed in the study is often not a
good idea. (2) Additionally, power analysis can be used to determine power, given an
effect size and the number of subjects available. You might do this when you know,
for example, that only 75 subjects are available (or that you only have the budget for
75 subjects), and you want to know if you will have enough power to justify actually
doing the study. In most cases, there is really no point to conducting a study that is
seriously underpowered. Besides the issue of the number of necessary subjects, there
are other good reasons for doing a power analysis. (3) For example, a power analysis
is often required as part of a grant proposal. (4) And finally, doing a power analysis is
often just part of doing good research. A power analysis is a good way of making sure
that you have thought through every aspect of the study and the statistical analysis
before you start collecting data.

Limitations Despite these advantages of power analyses, there are some limita-
tions. (1) One limitation is that power analyses do not typically generalize very well.
If you change the methodology used to collect the data or change the statistical pro-
cedure used to analyze the data, you will most likely have to redo the power analysis.
(2) In some cases, a power analysis might suggest a number of subjects that is inad-
equate for the statistical procedure. For example (beyond the scope of this class), a

Prof. Erik B. Erhardt


10.4: Power calculation via simulation 359

power analysis might suggest that you need 30 subjects for your logistic regression,
but logistic regression, like all maximum likelihood procedures, require much larger
sample sizes. (3) Perhaps the most important limitation is that a standard power
analysis gives you a “best case scenario” estimate of the necessary number of sub-
jects needed to detect the effect. In most cases, this “best case scenario” is based
on assumptions and educated guesses. If any of these assumptions or guesses are
incorrect, you may have less power than you need to detect the effect. (4) Finally,
because power analyses are based on assumptions and educated guesses, you often
get a range of the number of subjects needed, not a precise number. For example, if
you do not know what the standard deviation of your outcome measure will be, you
guess at this value, run the power analysis and get X number of subjects. Then you
guess a slightly larger value, rerun the power analysis and get a slightly larger number
of necessary subjects. You repeat this process over the plausible range of values of
the standard deviation, which gives you a range of the number of subjects that you
will need.

Other considerations After all of this discussion of power analyses and the nec-
essary number of subjects, we need to stress that power is not the only consideration
when determining the necessary sample size. For example, different researchers might
have different reasons for conducting a regression analysis. (1) One might want to see
if the regression coefficient is different from zero, (2) while the other wants to get a
very precise estimate of the regression coefficient with a very small confidence interval
around it. This second purpose requires a larger sample size than does merely seeing
if the regression coefficient is different from zero. (3) Another consideration when
determining the necessary sample size is the assumptions of the statistical procedure
that is going to be used (e.g., parametric vs nonparametric procedure). (4) The num-
ber of statistical tests that you intend to conduct will also influence your necessary
sample size: the more tests that you want to run, the more subjects that you will need
(multiple comparisons). (5) You will also want to consider the representativeness of
the sample, which, of course, influences the generalizability of the results. Unless you
have a really sophisticated sampling plan, the greater the desired generalizability, the
larger the necessary sample size.

10.4 Power calculation via simulation


Using the principles of the bootstrap (to be covered later) we can estimate statistical
power through simulation.

Example: IQ drug, revisited Recall that we sample N = 25 people from a


population known to be normally distributed with a µ of 100 and a σ of 15. Consider

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


360 Ch 10: Power and Sample size

the first one-sided alternative H0 : µ = 100 and H1 : µ > 100. Assume the minimum
nontrivial effect size was that the drug increases µIQ by 2 or more points, so that the
specific alternative to consider is H1 : µ = 102. What is the power of this test?
We already saw how to calculate this analytically. To solve this computationally,
we need to simulate samples of N = 25 from the alternative distribution (µ = 102
and σ = 15) and see what proportion of the time we correctly reject H0 .
#### Example: IQ drug, revisited
# R code to simulate one-sample one-sided power

# Strategy:
# Do this R times:
# draw a sample of size N from the distribution specified by the alternative hypothesis
# That is, 25 subjects from a normal distribution with mean 102 and sigma 15
# Calculate the mean of our sample
# Calculate the associated z-statistic
# See whether that z-statistic has a p-value < 0.05 under H0: mu=100
# If we reject H0, then set reject = 1, else reject = 0.
# Finally, the proportion of rejects we observe is the approximate power

n <- 25; # sample size of 25


mu0 <- 100; # null hypothesis mean of 100
mu1 <- 102; # alternative mean of 102
#mu1 <- 105; # alternative mean of 105
sigma <- 15; # standard deviation of normal population

alpha <- 0.05; # significance level

R <- 10000; # Repetitions to draw sample and see whether we reject H0


# The proportion of these that reject H0 is the power

reject <- rep(NA, R); # allocate a vector of length R with missing values (NA)
# to fill with 0 (fail to reject H0) or 1 (reject H0)
for (i in 1:R) {
sam <- rnorm(n, mean=mu1, sd=sigma); # sam is a vector with 25 values

ybar <- mean(sam); # Calculate the mean of our sample sam

z <- (ybar - mu0) / (sigma / sqrt(n)); # z-statistic (assumes we know sigma)


# we could also have calculated the t-statistic, here

pval <- 1-pnorm(z); # one-sided right-tail p-value


# pnorm gives the area to the left of z
# therefore, the right-tail area is 1-pnorm(z)

if (pval < 0.05) {


reject[i] <- 1; # 1 for correctly rejecting H0
} else {

Prof. Erik B. Erhardt


10.4: Power calculation via simulation 361

reject[i] <- 0; # 0 for incorrectly fail to reject H0


}

power <- mean(reject); # the average reject (proportion of rejects) is the power
power
## [1] 0.166
# 0.1655 for mu1=102
# 0.5082 for mu1=105

Our simulation (this time) with µ1 = 102 gave a power of 0.166 (exact answer
is P (Z > 0.98) = 0.1635). Rerunning with µ1 = 105 gave a power of 0.5082 (exact
answer is P (Z > −0.02) = 0.5080). Our simulation well-approximates the true value,
and the power can be made more precise by increasing the number of repetitions
calculated. However, two to three decimal precision is quite sufficient.

Example: Head breadth Recall the head breadth example in Chapter 3 compar-
ing maximum head breadths (in millimeters) of modern day Englishmen with ancient
Celts. The data are summarized below.
Descriptive Statistics: ENGLISH, CELTS
Variable N Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
ENGLISH 18 146.50 1.50 6.38 132.00 141.75 147.50 150.00 158.00
CELTS 16 130.75 1.36 5.43 120.00 126.25 131.50 135.50 138.00

Imagine that we don’t have the information above. Imagine we have been invited
to a UK university to take skull measurements for 18 modern day Englishmen, and
16 ancient Celts. We have some information about modern day skulls to use as prior
information for measurement mean and standard deviation. What is the power to
observe a difference between the populations? Let’s make some reasonable assump-
tions that allows us to be a bit conservative. Let’s assume the sampled skulls from
each of our populations is a random sample with common standard deviation 7mm,
and let’s assume we can’t get the full sample but can only measure 15 skulls from
each population. At a significance level of α = 0.05, what is the power for detecting
a difference of 5, 10, 15, 20, or 25 mm?
The theoretical two-sample power result is not too hard to derive (and is avail-
able in text books), but let’s simply compare the power calculated exactly and by
simulation.
For the exact result we use R library pwr. Below is the function call as well as
the result. Note that we specified multiple effect sizes (diff/SD) in one call of the
function.
# R code to compute exact two-sample two-sided power
library(pwr) # load the power calculation library

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


362 Ch 10: Power and Sample size

pwr.t.test(n = 15,
d = c(5,10,15,20,25)/7,
sig.level = 0.05,
power = NULL,
type = "two.sample",
alternative = "two.sided")
##
## Two-sample t test power calculation
##
## n = 15
## d = 0.7142857, 1.4285714, 2.1428571, 2.8571429, 3.5714286
## sig.level = 0.05
## power = 0.4717438, 0.9652339, 0.9998914, 1.0000000, 1.0000000
## alternative = two.sided
##
## NOTE: n is number in *each* group

To simulate the power under the same circumstances, we follow a similar strategy
as in the one-sample example.
# R code to simulate two-sample two-sided power

# Strategy:
# Do this R times:
# draw a sample of size N from the two hypothesized distributions
# That is, 15 subjects from a normal distribution with specified means and sigma=7
# Calculate the mean of the two samples
# Calculate the associated z-statistic
# See whether that z-statistic has a p-value < 0.05 under H0: mu_diff=0
# If we reject H0, then set reject = 1, else reject = 0.
# Finally, the proportion of rejects we observe is the approximate power

n <- 15; # sample size of 25


mu1 <- 147; # null hypothesis English mean
mu2 <- c(142, 137, 132, 127, 122); # Celt means
sigma <- 7; # standard deviation of normal population

alpha <- 0.05; # significance level

R <- 2e4; # Repetitions to draw sample and see whether we reject H0


# The proportion of these that reject H0 is the power

power <- rep(NA,length(mu2)); # allocate a vector to store the calculated power in

for (j in 1:length(mu2)) { # do for each value of mu2

reject <- rep(NA, R); # allocate a vector of length R with missing values (NA)
# to fill with 0 (fail to reject H0) or 1 (reject H0)

Prof. Erik B. Erhardt


10.4: Power calculation via simulation 363

for (i in 1:R) {
sam1 <- rnorm(n, mean=mu1 , sd=sigma); # English sample
sam2 <- rnorm(n, mean=mu2[j], sd=sigma); # Celt sample

ybar1 <- mean(sam1); # Calculate the mean of our sample sam


ybar2 <- mean(sam2); # Calculate the mean of our sample sam

# z-statistic (assumes we know sigma)


# we could also have calculated the t-statistic, here
z <- (ybar2 - ybar1) / (sigma * sqrt(1/n+1/n));

pval.Left <- pnorm(z); # area under left tail


pval.Right <- 1-pnorm(z); # area under right tail
# p-value is twice the smaller tail area
pval <- 2 * min(pval.Left, pval.Right);

if (pval < 0.05) {


reject[i] <- 1; # 1 for correctly rejecting H0
} else {
reject[i] <- 0; # 0 for incorrectly fail to reject H0
}

# the average reject (proportion of rejects) is the power


power[j] <- mean(reject);
}

power
## [1] 0.49275 0.97650 1.00000 1.00000 1.00000
Note the similarity between power calculated using both the exact and simulation
methods. If there is a power calculator for your specific problem, it is best to use
that because it is faster and there is no programming. However, using the simula-
tion method is better if we wanted to entertain different sample sizes with different
standard deviations, etc. There may not be a standard calculator for our specific
problem, so knowing how to simulate the power can be valuable.
Mean Sample size Power
µE µC diff SD nE nC exact simulated
147 142 5 7 15 15 0.4717 0.4928
147 137 10 7 15 15 0.9652 0.9765
147 132 15 7 15 15 0.9999 1
147 127 20 7 15 15 1.0000 1
147 122 25 7 15 15 1.0000 1

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


364 Ch 10: Power and Sample size

Prof. Erik B. Erhardt


Chapter 11

Data Cleaning

Contents
11.1 The five steps of statistical analysis . . . . . . . . . . . . 366
11.2 R background review . . . . . . . . . . . . . . . . . . . . . 367
11.2.1 Variable types . . . . . . . . . . . . . . . . . . . . . . . . . 367
11.2.2 Special values and value-checking functions . . . . . . . . . 368
11.3 From raw to technically correct data . . . . . . . . . . . . 369
11.3.1 Technically correct data . . . . . . . . . . . . . . . . . . . . 369
11.3.2 Reading text data into an R data.frame . . . . . . . . . . . 370
11.4 Type conversion . . . . . . . . . . . . . . . . . . . . . . . . 377
11.4.1 Introduction to R’s typing system . . . . . . . . . . . . . . 377
11.4.2 Recoding factors . . . . . . . . . . . . . . . . . . . . . . . . 378
11.4.3 Converting dates . . . . . . . . . . . . . . . . . . . . . . . . 380
11.5 Character-type manipulation . . . . . . . . . . . . . . . . 382
11.5.1 String normalization . . . . . . . . . . . . . . . . . . . . . . 383
11.5.2 Approximate string matching . . . . . . . . . . . . . . . . . 384
11.6 From technically correct data to consistent data . . . . . 387
11.6.1 Detection and localization of errors . . . . . . . . . . . . . . 388
11.6.2 Edit rules for detecting obvious inconsistencies . . . . . . . 393
11.6.3 Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
11.6.4 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


366 Ch 11: Data Cleaning

Data cleaning1 , or data preparation, is an essential part of statistical analysis. In


fact, in practice it is often more time-consuming than the statistical analysis itself.
Data cleaning may profoundly influence the statistical statements based on the data.
Typical actions like imputation or outlier handling obviously influence the results of
a statistical analyses. For this reason, data cleaning should be considered a statistical
operation, to be performed in a reproducible manner. The R statistical environment
provides a good environment for reproducible data cleaning since all cleaning actions
can be scripted and therefore reproduced.

11.1 The five steps of statistical analysis


Statistical analysis can be viewed as the result of a number of value-increasing data
processing steps.

1. Raw data
Data cleaning

type checking, normalizing


2. Technically correct data
fix and impute
3. Consistent data
estimate, analyze, derive, etc.
4. Statistical results
tabulate, plot
5. Formatted output
Each box represents data in a certain state while each arrow represents the activ-
ities needed to get from one state to the other.
1. Raw Data The data “as is” may lack headers, contain wrong data types (e.g.,
numbers stored as strings), wrong category labels, unknown or unexpected charac-
ter encoding and so on. Reading such files into an R data.frame directly is either
difficult or impossible without some sort of preprocessing.
1
Content in this chapter is derived with permission from Statistics Netherlands at http://cran.
r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf

Prof. Erik B. Erhardt


11.2: R background review 367

2. Technically correct data The data can be read into an R data.frame, with
correct names, types and labels, without further trouble. However, that does not
mean that the values are error-free or complete.
For example, an age variable may be reported negative, an under-aged person may
be registered to possess a driver’s license, or data may simply be missing. Such
inconsistencies obviously depend on the subject matter that the data pertains to,
and they should be ironed out before valid statistical inference from such data can
be produced.
3. Consistent data The data is ready for statistical inference. It is the data that
most statistical theories use as a starting point. Ideally, such theories can still
be applied without taking previous data cleaning steps into account. In practice
however, data cleaning methods like imputation of missing values will influence
statistical results and so must be accounted for in the following analyses or inter-
pretation thereof.
4. Statistical results The results of the analysis have been produced and can be
stored for reuse.
5. Formatted output The results in tables and figures ready to include in sta-
tistical reports or publications.

Best practice Store the input data for each stage (raw, technically correct, consis-
tent, results, and formatted) separately for reuse. Each step between the stages may
be performed by a separate R script for reproducibility.

11.2 R background review


11.2.1 Variable types
The most basic variable in R is a vector. An R vector is a sequence of values of
the same type. All basic operations in R act on vectors (think of the element-wise
arithmetic, for example). The basic types in R are as follows.
numeric Numeric data (approximations of the real numbers)
integer Integer data (whole numbers)
factor Categorical data (simple classifications, like gender)
ordered Ordinal data (ordered classifications, like educational level)
character Character data (strings)
raw Binary data (rarely used)
All basic operations in R work element-wise on vectors where the shortest argument
is recycled if necessary. Why does the following code work the way it does?
# vectors have variables of _one_ type
c(1, 2, "three")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


368 Ch 11: Data Cleaning

## [1] "1" "2" "three"


# shorter arguments are recycled
(1:3) * 2
## [1] 2 4 6
(1:4) * c(1, 2)
## [1] 1 4 3 8
# warning! (why?)
(1:4) * (1:3)
## Warning in (1:4) * (1:3): longer object length is not a multiple of shorter object
length
## [1] 1 4 9 4

11.2.2 Special values and value-checking functions


Below are the definitions and some illustrations of the special values NA, NULL, ±Inf,
and NaN.
ˆ NA Stands for “not available”. NA is a placeholder for a missing value. All basic
operations in R handle NA without crashing and mostly return NA as an answer
whenever one of the input arguments is NA. If you understand NA, you should
be able to predict the result of the following R statements.
NA + 1
sum(c(NA, 1, 2))
median(c(NA, 1, 2, 3), na.rm = TRUE)
length(c(NA, 2, 3, 4))
3 == NA
NA == NA
TRUE | NA
# use is.na() to detect NAs
is.na(c(1, NA, 3))
ˆ NULL Think of NULL as the empty set from mathematics; it has no class (its class
is NULL) and has length 0 so it does not take up any space in a vector.
length(c(1, 2, NULL, 4))
sum(c(1, 2, NULL, 4))
x <- NULL
length(x)
c(x, 2)
# use is.null() to detect NULL variables
is.null(x)
ˆ Inf Stands for “infinity” and only applies to vectors of class numeric (not in-
teger). Technically, Inf is a valid numeric that results from calculations like
division of a number by zero. Since Inf is a numeric, operations between Inf
and a finite numeric are well-defined and comparison operators work as ex-
pected.

Prof. Erik B. Erhardt


11.3: From raw to technically correct data 369

pi/0
2 * Inf
Inf - 1e+10
Inf + Inf
3 < -Inf
Inf == Inf
# use is.infinite() to detect Inf variables
is.infinite(-Inf)
ˆ NaN Stands for “not a number”. This is generally the result of a calculation
of which the result is unknown, but it is surely not a number. In particular
operations like 0/0, Inf − Inf and Inf/Inf result in NaN. Technically, NaN is of
class numeric, which may seem odd since it is used to indicate that something
is not numeric. Computations involving numbers and NaN always result in NaN.

NaN + 1
exp(NaN)
# use is.nan() to detect NULL variables
is.nan(0/0)
Note that is.finite() checks a numeric vector for the occurrence of any non-
numerical or special values.
is.finite(c(1, NA, 2, Inf, 3, -Inf, 4, NULL, 5, NaN, 6))
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE

11.3 From raw to technically correct data


11.3.1 Technically correct data
Limiting ourselves to “rectangular” data sets read from a text-based format, techni-
cally correct data in R
1. is stored in a data.frame with suitable columns names, and
2. each column of the data.frame is of the R type that adequately represents the
value domain.
The second demand implies that numeric data should be stored as numeric or integer,
textual data should be stored as character and categorical data should be stored as
a factor or ordered vector, with the appropriate levels.

Best practice Whenever you need to read data from a foreign file format, like
a spreadsheet or proprietary statistical software that uses undisclosed file formats,
make that software responsible for exporting the data to an open format that can be
read by R.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


370 Ch 11: Data Cleaning

11.3.2 Reading text data into an R data.frame


In the following, we assume that the text-files we are reading contain data of at most
one unit per line. The number of attributes, their format and separation symbols in
lines containing data may differ over the lines. This includes files in fixed-width or
csv-like format, but excludes XML-like storage formats.

Reading text
read.table() and similar functions below will read a text file and return a data.frame.

Best practice. A freshly read data.frame should always be inspected with func-
tions like head(), str(), and summary().
The read.table() function is the most flexible function to read tabular data
that is stored in a textual format. The other read-functions below all eventually
use read.table() with some fixed parameters and possibly after some preprocessing.
Specifically
ˆ read.csv() for comma separated values with period as decimal separator.
ˆ read.csv2() for semicolon separated values with comma as decimal separator.
ˆ read.delim() tab-delimited files with period as decimal separator.
ˆ read.delim2() tab-delimited files with comma as decimal separator.
ˆ read.fwf() data with a predetermined number of bytes per column.
Additional optional arguments include:

Argument Description
header Does the first line contain column names?
col.names character vector with column names.
na.string Which strings should be considered NA?
colClasses character vector with the types of columns. Will coerce the columns
to the specified types.
stringsAsFactors If TRUE, converts all character vectors into factor vectors.
sep Field separator.
Except for read.table() and read.fwf(), each of the above functions assumes by
default that the first line in the text file contains column headers. The following
demonstrates this on the following text file.
21,6.0
42,5.9
18,5.7*
21,NA
Read the file with defaults, then specifying necessary options.
#### Example: unnamed person text
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch18_unnamed.txt"

Prof. Erik B. Erhardt


11.3: From raw to technically correct data 371

# first line is erroneously interpreted as column names


person <- read.csv(fn.data)
person
## X21 X6.0
## 1 42 5.9
## 2 18 5.7*
## 3 21 <NA>
# instead, use header = FALSE and specify the column names
person <- read.csv(file = fn.data
, header = FALSE
, col.names = c("age", "height")
)
person
## age height
## 1 21 6.0
## 2 42 5.9
## 3 18 5.7*
## 4 21 <NA>
If colClasses is not specified by the user, read.table() will try to determine the
column types. Although this may seem convenient, it is noticeably slower for larger
files (say, larger than a few MiB) and it may yield unexpected results. For example,
in the above script, one of the rows contains a malformed numerical variable (5.7*),
causing R to interpret the whole column as a text variable. Moreover, by default text
variables are converted to factor, so we are now stuck with a height variable expressed
as levels in a categorical variable:
str(person)
## 'data.frame': 4 obs. of 2 variables:
## $ age : int 21 42 18 21
## $ height: Factor w/ 3 levels "5.7*","5.9","6.0": 3 2 1 NA
As an alternative, columns can be read in as character by setting stringsAsFactors=FALSE.
Next, one of the as.-functions can be applied to convert to the desired type, as shown
below.
person <- read.csv(file = fn.data
, header = FALSE
, col.names = c("age", "height")
, stringsAsFactors = FALSE)
person
## age height
## 1 21 6.0
## 2 42 5.9
## 3 18 5.7*
## 4 21 <NA>
person$height <- as.numeric(person$height)
## Warning: NAs introduced by coercion
person

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


372 Ch 11: Data Cleaning

## age height
## 1 21 6.0
## 2 42 5.9
## 3 18 NA
## 4 21 NA
Now, everything is read in and the height column is translated to numeric, with
the exception of the row containing 5.7*. Moreover, since we now get a warning
instead of an error, a script containing this statement will continue to run, albeit with
less data to analyse than it was supposed to. It is of course up to the programmer to
check for these extra NA’s and handle them appropriately.

Reading data with readLines


When the rows in a data file are not uniformly formatted you can consider reading
in the text line-by-line and transforming the data to a rectangular set yourself. With
readLines() you can exercise precise control over how each line is interpreted and
transformed into fields in a rectangular data set. We use the following data as an
example.
%% Data on the Dalton Brothers
Gratt ,1861,1892
Bob,1892
1871,Emmet ,1937
% Names, birth and death dates
And this is the table we want.
Name Birth Death
Gratt 1861 1892
Bob NA 1892
Emmet 1871 1937
The file has comments on several lines (starting with a % sign) and a missing value
in the second row. Moreover, in the third row the name and birth date have been
swapped. We want a general strategy so that if we had a file with 10,000 records we
could process them all. The table suggests one strategy.
Step result
1 Read the data with readLines character
2 Select lines containing data character
3 Split lines into separate fields list of character vectors
4 Standardize rows list of equivalent vectors
5 Transform to data.frame data.frame
6 Normalize and coerce to correct type data.frame
Step 1. Reading data. The readLines() function accepts filename as argument
and returns a character vector containing one element for each line in the file.
readLines() detects both the end-of-line and carriage return characters so lines are

Prof. Erik B. Erhardt


11.3: From raw to technically correct data 373

detected regardless of whether the file was created under DOS, UNIX, or MAC
(each OS has traditionally had different ways of marking an end-of-line). Reading
in the Daltons file yields the following.
#### Example: Dalton data
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch18_dalton.txt"
dalton.txt <- readLines(fn.data)
dalton.txt
## [1] "%% Data on the Dalton Brothers" "Gratt ,1861,1892"
## [3] "Bob,1892" "1871,Emmet ,1937"
## [5] "% Names, birth and death dates"
str(dalton.txt)
## chr [1:5] "%% Data on the Dalton Brothers" "Gratt ,1861,1892" ...
The variable dalton.txt has 5 character elements, equal to the number of lines in
the textfile.
Step 2. Selecting lines containing data. This is generally done by throwing
out lines containing comments or otherwise lines that do not contain any data
fields. You can use grep() or grepl() to detect such lines. Regular expressions2 ,
though challenging to learn, can be used to specify what you’re searching for. I
usually search for an example and modify it to meet my needs.
# detect lines starting (^) with a percentage sign (%)
ind.nodata <- grepl("^%", dalton.txt)
ind.nodata
## [1] TRUE FALSE FALSE FALSE TRUE
# and throw them out
!ind.nodata
## [1] FALSE TRUE TRUE TRUE FALSE
dalton.dat <- dalton.txt[!ind.nodata]
dalton.dat
## [1] "Gratt ,1861,1892" "Bob,1892" "1871,Emmet ,1937"
Here, the first argument of grepl() is a search pattern, where the caret (^) indicates
a start-of-line. The result of grepl() is a logical vector that indicates which ele-
ments of dalton.txt contain the pattern ’start-of-line’ followed by a percent-sign.
The functionality of grep() and grepl() will be discussed in more detail later.
Step 3. Split lines into separate fields. This can be done with strsplit().
This function accepts a character vector and a split argument which tells strsplit()
how to split a string into substrings. The result is a list of character vectors.
# remove whitespace by substituting nothing where spaces appear
dalton.dat2 <- gsub(" ", "", dalton.dat)
# split strings by comma

2
http://en.wikipedia.org/wiki/Regular_expression

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


374 Ch 11: Data Cleaning

dalton.fieldList <- strsplit(dalton.dat2, split = ",")


dalton.fieldList
## [[1]]
## [1] "Gratt" "1861" "1892"
##
## [[2]]
## [1] "Bob" "1892"
##
## [[3]]
## [1] "1871" "Emmet" "1937"
Here, split= is a single character or sequence of characters that are to be interpreted
as field separators. By default, split is interpreted as a regular expression, and
the meaning of a special characters can be ignored by passing fixed=TRUE as extra
parameter.
Step 4. Standardize rows. The goal of this step is to make sure that (a) every
row has the same number of fields and (b) the fields are in the right order. In
read.table(), lines that contain fewer fields than the maximum number of fields
detected are appended with NA. One advantage of the do-it-yourself approach
shown here is that we do not have to make this assumption. The easiest way to
standardize rows is to write a function that takes a single character vector as input
and assigns the values in the right order.
The function below accepts a character vector and assigns three values to an out-
put vector of class character. The grepl() statement detects fields containing
alphabetical values a-z or A-Z. To assign year of birth and year of death, we use
the knowledge that all Dalton brothers were born before and died after 1890. To
retrieve the fields for each row in the example, we need to apply this function to
every element of dalton.fieldList.
# function to correct column order for Dalton data
f.assignFields <- function(x) {
# create a blank character vector of length 3
out <- character(3)
# get name and put into first position
ind.alpha <- grepl("[[:alpha:]]", x)
out[1] <- x[ind.alpha]
# get birth date (if any) and put into second position
ind.num.birth <- which(as.numeric(x) < 1890)
# if there are more than 0 years <1890,
# then return that value to second position,
# else return NA to second position
out[2] <- ifelse(length(ind.num.birth) > 0, x[ind.num.birth], NA)
# get death date (if any) and put into third position (same strategy as birth)
ind.num.death <- which(as.numeric(x) > 1890)
out[3] <- ifelse(length(ind.num.death) > 0, x[ind.num.death], NA)
out
}

Prof. Erik B. Erhardt


11.3: From raw to technically correct data 375

The function lapply() will apply the function f.assignFields() to each list ele-
ment in dalton.fieldList.
dalton.standardFields <- lapply(dalton.fieldList, f.assignFields)
## Warning in which(as.numeric(x) < 1890): NAs introduced by coercion
## Warning in which(as.numeric(x) > 1890): NAs introduced by coercion
## Warning in which(as.numeric(x) < 1890): NAs introduced by coercion
## Warning in which(as.numeric(x) > 1890): NAs introduced by coercion
## Warning in which(as.numeric(x) < 1890): NAs introduced by coercion
## Warning in which(as.numeric(x) > 1890): NAs introduced by coercion
dalton.standardFields
## [[1]]
## [1] "Gratt" "1861" "1892"
##
## [[2]]
## [1] "Bob" NA "1892"
##
## [[3]]
## [1] "Emmet" "1871" "1937"
The advantage of this approach is having greater flexibility than read.table offers.
However, since we are interpreting the value of fields here, it is unavoidable to
know about the contents of the dataset which makes it hard to generalize the
field assigner function. Furthermore, f.assignFields() function we wrote is still
relatively fragile. That is, it crashes for example when the input vector contains
two or more text-fields or when it contains more than one numeric value larger
than 1890. Again, no one but the data analyst is probably in a better position to
choose how safe and general the field assigner should be.
Step 5. Transform to data.frame. There are several ways to transform a list
to a data.frame object. Here, first all elements are copied into a matrix which is
then coerced into a data.frame.
# unlist() returns each value in a list in a single object
unlist(dalton.standardFields)
## [1] "Gratt" "1861" "1892" "Bob" NA "1892" "Emmet" "1871"
## [9] "1937"
# there are three list elements in dalton.standardFields
length(dalton.standardFields)
## [1] 3
# fill a matrix will the character values
dalton.mat <- matrix(unlist(dalton.standardFields)
, nrow = length(dalton.standardFields)
, byrow = TRUE
)
dalton.mat

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


376 Ch 11: Data Cleaning

## [,1] [,2] [,3]


## [1,] "Gratt" "1861" "1892"
## [2,] "Bob" NA "1892"
## [3,] "Emmet" "1871" "1937"
# name the columns
colnames(dalton.mat) <- c("name", "birth", "death")
dalton.mat
## name birth death
## [1,] "Gratt" "1861" "1892"
## [2,] "Bob" NA "1892"
## [3,] "Emmet" "1871" "1937"
# convert to a data.frame but don't turn character variables into factors
dalton.df <- as.data.frame(dalton.mat, stringsAsFactors=FALSE)
str(dalton.df)
## 'data.frame': 3 obs. of 3 variables:
## $ name : chr "Gratt" "Bob" "Emmet"
## $ birth: chr "1861" NA "1871"
## $ death: chr "1892" "1892" "1937"
dalton.df
## name birth death
## 1 Gratt 1861 1892
## 2 Bob <NA> 1892
## 3 Emmet 1871 1937

The function unlist() concatenates all vectors in a list into one large character
vector. We then use that vector to fill a matrix of class character. However,
the matrix function usually fills up a matrix column by column. Here, our data
is stored with rows concatenated, so we need to add the argument byrow=TRUE.
Finally, we add column names and coerce the matrix to a data.frame. We use
stringsAsFactors=FALSE since we have not started interpreting the values yet.
Step 6. Normalize and coerce to correct types. This step consists of prepar-
ing the character columns of our data.frame for coercion and translating numbers
into numeric vectors and possibly character vectors to factor variables. String nor-
malization and type conversion are discussed later. In this example we can suffice
with the following statements.
dalton.df$birth <- as.numeric(dalton.df$birth)
dalton.df$death <- as.numeric(dalton.df$death)
str(dalton.df)
## 'data.frame': 3 obs. of 3 variables:
## $ name : chr "Gratt" "Bob" "Emmet"
## $ birth: num 1861 NA 1871
## $ death: num 1892 1892 1937
dalton.df

Prof. Erik B. Erhardt


11.4: Type conversion 377

## name birth death


## 1 Gratt 1861 1892
## 2 Bob NA 1892
## 3 Emmet 1871 1937

11.4 Type conversion


Converting a variable from one type to another is called coercion. The reader is
probably familiar with R’s basic coercion functions, but as a reference they are listed
here.
as.numeric
as.integer
as.character
as.logical
as.factor
as.ordered
Each of these functions takes an R object and tries to convert it to the class spec-
ified behind the “as.”. By default, values that cannot be converted to the specified
type will be converted to a NA value while a warning is issued.
as.numeric(c("7", "7*", "7.0", "7,0"))
## Warning: NAs introduced by coercion
## [1] 7 NA 7 NA
In the remainder of this section we introduce R’s typing and storage system and
explain the difference between R types and classes. After that we discuss date con-
version.

11.4.1 Introduction to R’s typing system


Everything in R is an object. An object is a container of data endowed with a label
describing the data. Objects can be created, destroyed, or overwritten on-the-fly by
the user. The function class returns the class label of an R object.
class(c("abc", "def"))
## [1] "character"
class(1:10)
## [1] "integer"
class(c(pi, exp(1)))
## [1] "numeric"
class(factor(c("abc", "def")))
## [1] "factor"
# all columns in a data.frame
sapply(dalton.df, class)
## name birth death
## "character" "numeric" "numeric"

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


378 Ch 11: Data Cleaning

For the user of R these class labels are usually enough to handle R objects in R
scripts. Under the hood, the basic R objects are stored as C structures as C is the
language in which R itself has been written. The type of C structure that is used to
store a basic type can be found with the typeof function. Compare the results below
with those in the previous code snippet.
typeof(c("abc", "def"))
## [1] "character"
typeof(1:10)
## [1] "integer"
typeof(c(pi, exp(1)))
## [1] "double"
typeof(factor(c("abc", "def")))
## [1] "integer"

Note that the type of an R object of class numeric is double. The term double
refers to double precision, which is a standard way for lower-level computer languages
such as C to store approximations of real numbers. Also, the type of an object of
class factor is integer. The reason is that R saves memory (and computational
time!) by storing factor values as integers, while a translation table between factor
and integers are kept in memory. Normally, a user should not have to worry about
these subtleties, but there are exceptions (the homework includes an example of the
subtleties).
In short, one may regard the class of an object as the object’s type from the
user’s point of view while the type of an object is the way R looks at the object. It
is important to realize that R’s coercion functions are fundamentally functions that
change the underlying type of an object and that class changes are a consequence of
the type changes.

11.4.2 Recoding factors


In R, the value of categorical variables is stored in factor variables. A factor is an
integer vector endowed with a table specifying what integer value corresponds to what
level. The values in this translation table can be requested with the levels function.
f <- factor(c("a", "b", "a", "a", "c"))
f
## [1] a b a a c
## Levels: a b c
levels(f)
## [1] "a" "b" "c"
as.numeric(f)
## [1] 1 2 1 1 3
You may need to create a translation table by hand. For example, suppose we

Prof. Erik B. Erhardt


11.4: Type conversion 379

read in a vector where 1 stands for male, 2 stands for female and 0 stands for unknown.
Conversion to a factor variable can be done as in the example below.
# example:
gender <- c(2, 1, 1, 2, 0, 1, 1)
gender
## [1] 2 1 1 2 0 1 1
# recoding table, stored in a simple vector
recode <- c(male = 1, female = 2)
recode
## male female
## 1 2
gender <- factor(gender, levels = recode, labels = names(recode))
gender
## [1] female male male female <NA> male male
## Levels: male female

Note that we do not explicitly need to set NA as a label. Every integer value that
is encountered in the first argument, but not in the levels argument will be regarded
missing.
Levels in a factor variable have no natural ordering. However in multivariate
(regression) analyses it can be beneficial to fix one of the levels as the reference level.
R’s standard multivariate routines (lm, glm) use the first level as reference level. The
relevel function allows you to determine which level comes first.
gender <- relevel(gender, ref = "female")
gender
## [1] female male male female <NA> male male
## Levels: female male

Levels can also be reordered, depending on the mean value of another variable,
for example:
age <- c(27, 52, 65, 34, 89, 45, 68)
gender <- reorder(gender, age)
gender
## [1] female male male female <NA> male male
## attr(,"scores")
## female male
## 30.5 57.5
## Levels: female male

Here, the means are added as a named vector attribute to gender. It can be
removed by setting that attribute to NULL.
attr(gender, "scores") <- NULL
gender
## [1] female male male female <NA> male male
## Levels: female male

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


380 Ch 11: Data Cleaning

11.4.3 Converting dates


The base R installation has three types of objects to store a time instance: Date,
POSIXlt, and POSIXct. The Date object can only be used to store dates, the other two
store date and/or time. Here, we focus on converting text to POSIXct objects since
this is the most portable way to store such information.
Under the hood, a POSIXct object stores the number of seconds that have passed
since January 1, 1970 00:00. Such a storage format facilitates the calculation of
durations by subtraction of two POSIXct objects.
When a POSIXct object is printed, R shows it in a human-readable calender format.
For example, the command Sys.time() returns the system time provided by the
operating system in POSIXct format.
current_time <- Sys.time()
class(current_time)
## [1] "POSIXct" "POSIXt"
current_time
## [1] "2018-10-10 00:15:10 MDT"
Here, Sys.time() uses the time zone that is stored in the locale settings of the
machine running R.
Converting from a calender time to POSIXct and back is not entirely trivial, since
there are many idiosyncrasies to handle in calender systems. These include leap days,
leap seconds, daylight saving times, time zones and so on. Converting from text to
POSIXct is further complicated by the many textual conventions of time/date denota-
tion. For example, both 28 September 1976 and 1976/09/28 indicate the same day of
the same year. Moreover, the name of the month (or weekday) is language-dependent,
where the language is again defined in the operating system’s locale settings.
The lubridate package contains a number of functions facilitating the conversion
of text to POSIXct dates. As an example, consider the following code.
library(lubridate)
##
## Attaching package: ’lubridate’
## The following object is masked from ’package:plyr’:
##
## here
## The following object is masked from ’package:base’:
##
## date
dates <- c("15/02/2013"
, "15 Feb 13"
, "It happened on 15 02 '13")
dmy(dates)
## [1] "2013-02-15" "2013-02-15" "2013-02-15"
Here, the function dmy assumes that dates are denoted in the order day-month-

Prof. Erik B. Erhardt


11.4: Type conversion 381

year and tries to extract valid dates. Note that the code above will only work properly
in locale settings where the name of the second month is abbreviated to Feb. This
holds for English or Dutch locales, but fails for example in a French locale (Fevrier).
There are similar functions for all permutations of d, m, and y. Explicitly, all of
the following functions exist.
dmy()
dym()
mdy()
myd()
ydm()
ymd()

So once it is known in what order days, months and years are denoted, extraction
is very easy.

Note It is not uncommon to indicate years with two numbers, leaving out the
indication of century. Recently in R, 00-69 was interpreted as 2000-2069 and 70-99 as
1970-1999; this behaviour is according to the 2008 POSIX standard, but one should
expect that this interpretation changes over time. Currently all are now 2000-2099.
dmy("01 01 68")
## [1] "2068-01-01"
dmy("01 01 69")
## [1] "1969-01-01"
dmy("01 01 90")
## [1] "1990-01-01"
dmy("01 01 00")
## [1] "2000-01-01"

It should be noted that lubridate (as well as R’s base functionality) is only capable
of converting certain standard notations. For example, the following notation does
not convert.
dmy("15 Febr. 2013")
## Warning: All formats failed to parse. No formats found.
## [1] NA

The standard notations that can be recognized by R, either using lubridate or


R’s built-in functionality are shown below. The complete list can be found by typing
?strptime in the R console. These are the day, month, and year formats recognized
by R.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


382 Ch 11: Data Cleaning

Code Description Example


%a Abbreviated weekday name in the current locale. Mon
%A Full weekday name in the current locale. Monday
%b Abbreviated month name in the current locale. Sep
%B Full month name in the current locale. September
%m Month number (01-12) 09
%d Day of the month as decimal number (01-31). 28
%y Year without century (00-99) 13
%Y Year including century. 2013
Here, the names of (abbreviated) week or month names that are sought for in the
text depend on the locale settings of the machine that is running R.
If you know the textual format that is used to describe a date in the input, you
may want to use R’s core functionality to convert from text to POSIXct. This can be
done with the as.POSIXct function. It takes as arguments a character vector with
time/date strings and a string describing the format.
dates <- c("15-9-2009", "16-07-2008", "17 12-2007", "29-02-2011")
as.POSIXct(dates, format = "%d-%m-%Y")
## [1] "2009-09-15 MDT" "2008-07-16 MDT" NA
## [4] NA
In the format string, date and time fields are indicated by a letter preceded by
a percent sign (%). Basically, such a %-code tells R to look for a range of substrings.
For example, the %d indicator makes R look for numbers 1-31 where precursor zeros
are allowed, so 01, 02, . . . , 31 are recognized as well. Strings that are not in the exact
format specified by the format argument (like the third string in the above example)
will not be converted by as.POSIXct. Impossible dates, such as the leap day in the
fourth date above are also not converted.
Finally, to convert dates from POSIXct back to character, one may use the format
function that comes with base R. It accepts a POSIXct date/time object and an output
format string.
mybirth <- dmy("28 Sep 1976")
format(mybirth, format = "I was born on %B %d, %Y")
## [1] "I was born on September 28, 1976"

11.5 Character-type manipulation


Because of the many ways people can write the same things down, character data can
be difficult to process. For example, consider the following excerpt of a data set with
a gender variable.
gender
M

Prof. Erik B. Erhardt


11.5: Character-type manipulation 383

male
Female
fem.
If this would be treated as a factor variable without any preprocessing, obviously
four, not two classes would be stored. The job at hand is therefore to automatically
recognize from the above data whether each element pertains to male or female.
In statistical contexts, classifying such “messy” text strings into a number of fixed
categories is often referred to as coding.
Below we discuss two complementary approaches to string coding: string nor-
malization and approximate text matching. In particular, the following topics are
discussed.
ˆ Remove prepending or trailing white spaces.
ˆ Pad strings to a certain width.
ˆ Transform to upper/lower case.
ˆ Search for strings containing simple patterns (substrings).
ˆ Approximate matching procedures based on string distances.

11.5.1 String normalization


String normalization techniques are aimed at transforming a variety of strings to a
smaller set of string values which are more easily processed. By default, R comes
with extensive string manipulation functionality that is based on the two basic string
operations: finding a pattern in a string and replacing one pattern with another. We
will deal with R’s generic functions below but start by pointing out some common
string cleaning operations.
The stringr package offers a number of functions that make some some string
manipulation tasks a lot easier than they would be with R’s base functions. For
example, extra white spaces at the beginning or end of a string can be removed using
str_trim().
library(stringr)
str_trim(" hello world ")
## [1] "hello world"
str_trim(" hello world ", side = "left")
## [1] "hello world "
str_trim(" hello world ", side = "right")
## [1] " hello world"
Conversely, strings can be padded with spaces or other characters with str_pad()
to a certain width. For example, numerical codes are often represented with prepend-
ing zeros.
str_pad(112, width = 6, side = "left", pad = 0)
## [1] "000112"

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


384 Ch 11: Data Cleaning

Both str_trim() and str_pad() accept a side argument to indicate whether trim-
ming or padding should occur at the beginning (left), end (right), or both sides of
the string.
Converting strings to complete upper or lower case can be done with R’s built-in
toupper() and tolower() functions.
toupper("Hello world")
## [1] "HELLO WORLD"
tolower("Hello World")
## [1] "hello world"

11.5.2 Approximate string matching


There are two forms of string matching. The first consists of determining whether a
(range of) substring(s) occurs within another string. In this case one needs to specify
a range of substrings (called a pattern) to search for in another string. In the second
form one defines a distance metric between strings that measures how “different” two
strings are. Below we will give a short introduction to pattern matching and string
distances with R.
There are several pattern matching functions that come with base R. The most
used are probably grep() and grepl(). Both functions take a pattern and a character
vector as input. The output only differs in that grepl() returns a logical index,
indicating which element of the input character vector contains the pattern, while
grep() returns a numerical index. You may think of grep(...) as which(grepl(...)).
In the most simple case, the pattern to look for is a simple substring. For example,
from the previous example, we get the following.
gender <- c("M", "male ", "Female", "fem.")
grepl("m", gender)
## [1] FALSE TRUE TRUE TRUE
grep("m", gender)
## [1] 2 3 4
Note that the result is case sensitive: the capital M in the first element of gender
does not match the lower case m. There are several ways to circumvent this case
sensitivity. Either by case normalization or by the optional argument ignore.case.
grepl("m", gender, ignore.case = TRUE)
## [1] TRUE TRUE TRUE TRUE
grepl("m", tolower(gender))
## [1] TRUE TRUE TRUE TRUE
Obviously, looking for the occurrence of m or M in the gender vector does not allow
us to determine which strings pertain to male and which not. Preferably we would
like to search for strings that start with an m or M. Fortunately, the search patterns
that grep() accepts allow for such searches. The beginning of a string is indicated

Prof. Erik B. Erhardt


11.5: Character-type manipulation 385

with a caret (^).


grepl("^m", gender, ignore.case = TRUE)
## [1] TRUE TRUE FALSE FALSE
Indeed, the grepl() function now finds only the first two elements of gender.
The caret is an example of a so-called meta-character. That is, it does not indicate
the caret itself but something else, namely the beginning of a string. The search
patterns that grep(), grepl() (and sub() and gsub()) understand have more of these
meta-characters, namely:
. \ | ( ) [ { ^ $ * + ?
If you need to search a string for any of these characters, you can use the option
fixed=TRUE.
grepl("^", gender, fixed = TRUE)
## [1] FALSE FALSE FALSE FALSE
This will make grepl() or grep() ignore any meta-characters in the search string
(and thereby search for the “^” character).
Search patterns using meta-characters are called regular expressions. Regular
expressions3 offer powerful and flexible ways to search (and alter) text. A concise
description of regular expressions allowed by R’s built-in string processing functions
can be found by typing ?regex at the R command line. If you frequently have to deal
with “messy” text variables, learning to work with regular expressions is a worthwhile
investment. Moreover, since many popular programming languages support some
dialect of regexps, it is an investment that could pay off several times.
We now turn our attention to the second method of approximate matching, namely
string distances. A string distance is an algorithm or equation that indicates how
much two strings differ from each other. An important distance measure is imple-
mented by the R’s native adist() function. This function counts how many basic
operations are needed to turn one string into another. These operations include
insertion, deletion, or substitution of a single character. For example
adist("abc", "bac")
## [,1]
## [1,] 2
The result equals two since turning ”abc” into ”bac” involves two character sub-
stitutions: abc → bbc → bac.
Using adist(), we can compare fuzzy text strings to a list of known codes. For
example:
codes <- c("male", "female")
# calculate pairwise distances between the gender strings and codes strings
dist.g.c <- adist(gender, codes)
# add column and row names
colnames(dist.g.c) <- codes

3
http://en.wikipedia.org/wiki/Regular_expression

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


386 Ch 11: Data Cleaning

rownames(dist.g.c) <- gender


dist.g.c
## male female
## M 4 6
## male 1 3
## Female 2 1
## fem. 4 3
Here, adist() returns the distance matrix between our vector of fixed codes and
the input data. For readability we added row and column names accordingly. Now,
to find out which code matches best with our raw data, we need to find the index of
the smallest distance for each row of dist.g.c. This can be done as follows.
ind.min <- apply(dist.g.c, 1, which.min)
data.frame(rawtext = gender, coded = codes[ind.min])
## rawtext coded
## 1 M male
## 2 male male
## 3 Female female
## 4 fem. female
We use apply() to apply which.min() to every row of dist.g.c. Note that in
the case of multiple minima, the first match will be returned. At the end of this
subsection we show how this code can be simplified with the stringdist package.
Finally, we mention three more functions based on string distances. First, the R
built-in function agrep() is similar to grep(), but it allows one to specify a maxi-
mum Levenshtein distance4 between the input pattern and the found substring. The
agrep() function allows for searching for regular expression patterns, which makes it
very flexible.
Secondly, the stringdist package offers a function called stringdist() which can
compute a variety of string distance metrics, some of which are likely to provide re-
sults that are better than adist()’s. Most importantly, the distance function used
by adist() does not allow for character transpositions, which is a common typo-
graphical error. Using the optimal string alignment distance (the default choice for
stringdist()) we get
library(stringdist)
stringdist("abc", "bac")
## [1] 1
The answer is now 1 (not 2 as with adist()), since the optimal string alignment
distance allows for transpositions of adjacent characters: abc → bac.
Thirdly, the stringdist package provides a function called amatch(), which mim-
ics the behaviour of R’s match() function: it returns an index to the closest match
within a maximum distance. Recall the earlier gender and code example.
4
Informally, the Levenshtein distance between two words is the minimum number of single-
character edits (i.e., insertions, deletions, or substitutions) required to change one word into the
other: https://en.wikipedia.org/wiki/Levenshtein_distance.

Prof. Erik B. Erhardt


11.6: From technically correct data to consistent data 387

# this yields the closest match of 'gender' in 'codes' (within a distance of 4)


ind <- amatch(gender, codes, maxDist = 4)
ind
## [1] 1 1 2 2
# store results in a data.frame
data.frame(rawtext = gender, code = codes[ind])
## rawtext code
## 1 M male
## 2 male male
## 3 Female female
## 4 fem. female

11.6 From technically correct data to consistent


data
Consistent data are technically correct data that are fit for statistical analysis. They
are data in which missing values, special values, (obvious) errors and outliers are either
removed, corrected, or imputed. The data are consistent with constraints based on
real-world knowledge about the subject that the data describe.
Consistency can be understood to include in-record consistency, meaning that
no contradictory information is stored in a single record, and cross-record con-
sistency, meaning that statistical summaries of different variables do not conflict
with each other. Finally, one can include cross-dataset consistency, meaning that
the dataset that is currently analyzed is consistent with other datasets pertaining to
the same subject matter. In this tutorial we mainly focus on methods dealing with
in-record consistency, with the exception of outlier handling which can be considered
a cross-record consistency issue.
The process towards consistent data always involves the following three steps.

ˆ Detection of an inconsistency. That is, one establishes which constraints are


violated. For example, an age variable is constrained to non-negative values.

ˆ Selection of the field or fields causing the inconsistency. This is trivial in


the case of a univariate demand as in the previous step, but may be more
cumbersome when cross-variable relations are expected to hold. For example
the marital status of a child must be unmarried. In the case of a violation it is
not immediately clear whether age, marital status, or both are wrong.

ˆ Correction of the fields that are deemed erroneous by the selection method.
This may be done through deterministic (model-based) or stochastic methods.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


388 Ch 11: Data Cleaning

For many data correction methods these steps are not necessarily neatly separated.
First, we introduce a number of techniques dedicated to the detection of errors
and the selection of erroneous fields. If the field selection procedure is performed
separately from the error detection procedure, it is generally referred to as error
localization. Next, we describe techniques that implement correction methods based
on “direct rules” or “deductive correction”. In these techniques, erroneous values are
replaced by better ones by directly deriving them from other values in the same
record. Finally, we give an overview of some commonly used imputation techniques
that are available in R.

11.6.1 Detection and localization of errors


This section details a number of techniques to detect univariate and multivariate
constraint violations.

Missing values
A missing value, represented by NA in R, is a placeholder for a datum of which the
type is known but its value isn’t. Therefore, it is impossible to perform statistical
analysis on data where one or more values in the data are missing. One may choose
to either omit elements from a dataset that contain missing values or to impute a
value, but missingness is something to be dealt with prior to any analysis.
In practice, analysts, but also commonly used numerical software may confuse
a missing value with a default value or category. For instance, in Excel 2010, the
result of adding the contents of a field containing the number 1 with an empty field
results in 1. This behaviour is most definitely unwanted since Excel silently imputes
“0” where it should have said something along the lines of “unable to compute”. It
should be up to the analyst to decide how empty values are handled, since a default
imputation may yield unexpected or erroneous results for reasons that are hard to
trace.
Another commonly encountered mistake is to confuse an NA in categorical data
with the category unknown. If unknown is indeed a category, it should be added as a
factor level so it can be appropriately analyzed. Consider as an example a categorical
variable representing place of birth. Here, the category unknown means that we have
no knowledge about where a person is born. In contrast, NA indicates that we have
no information to determine whether the birth place is known or not.
The behaviour of R’s core functionality is completely consistent with the idea that
the analyst must decide what to do with missing data. A common choice, namely
“leave out records with missing data” is supported by many base functions through
the na.rm option.

Prof. Erik B. Erhardt


11.6: From technically correct data to consistent data 389

age <- c(23, 16, NA)


mean(age)
## [1] NA
mean(age, na.rm = TRUE)
## [1] 19.5

Functions such as sum(), prod(), quantile(), sd(), and so on all have this option.
Functions implementing bivariate statistics such as cor() and cov() offer options to
include complete or pairwise complete values.
Besides the is.na() function, that was already mentioned previously, R comes
with a few other functions facilitating NA handling. The complete.cases() function
detects rows in a data.frame that do not contain any missing value. Recall the person
data set example from earlier.
print(person)
## age height
## 1 21 6.0
## 2 42 5.9
## 3 18 NA
## 4 21 NA
complete.cases(person)
## [1] TRUE TRUE FALSE FALSE

The resulting logical can be used to remove incomplete records from the data.frame.
Alternatively the na.omit() function, does the same.
persons_complete <- na.omit(person)
persons_complete
## age height
## 1 21 6.0
## 2 42 5.9
na.action(persons_complete)
## 3 4
## 3 4
## attr(,"class")
## [1] "omit"

The result of the na.omit() function is a data.frame where incomplete rows have
been deleted. The row.names of the removed records are stored in an attribute called
na.action.

Note. It may happen that a missing value in a data set means 0 or Not applicable.
If that is the case, it should be explicitly imputed with that value, because it is not
unknown, but was coded as empty.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


390 Ch 11: Data Cleaning

Special values
As explained previously, numeric variables are endowed with several formalized special
values including ±Inf, NA, and NaN. Calculations involving special values often result
in special values, and since a statistical statement about a real-world phenomenon
should never include a special value, it is desirable to handle special values prior to
analysis. For numeric variables, special values indicate values that are not an element
of the mathematical set of real numbers. The function is.finite() determines which
values are “regular” values.
is.finite(c(1, Inf, NaN, NA))
## [1] TRUE FALSE FALSE FALSE
This function accepts vectorial input. With little effort we can write a function
that may be used to check every numerical column in a data.frame.
f.is.special <- function(x) {
if (is.numeric(x)) {
return(!is.finite(x))
} else {
return(is.na(x))
}
}
person
## age height
## 1 21 6.0
## 2 42 5.9
## 3 18 NA
## 4 21 NA
sapply(person, f.is.special)
## age height
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE TRUE
## [4,] FALSE TRUE
Here, the f.is.special() function is applied to each column of person using
sapply(). f.is.special() checks its input vector for numerical special values if the
type is numeric, otherwise it only checks for NA.

Outliers
There is a vast body of literature on outlier detection, and several definitions of
outlier exist. A general definition by Barnett and Lewis defines an outlier in a data
set as an observation (or set of observations) which appear to be inconsistent with that
set of data. Although more precise definitions exist (see e.g., the book by Hawkins),
this definition is sufficient for the current chapter. Below we mention a few fairly

Prof. Erik B. Erhardt


11.6: From technically correct data to consistent data 391

common graphical and computational techniques for outlier detection in univariate


numerical data. In a previous chapter, we’ve discussed using PCA as a graphical
technique to help detect multivariate outliers.

Note. Outliers do not equal errors. They should be detected, but not necessarily
removed. Their inclusion in the analysis is a statistical decision.
For more or less unimodal and symmetrically distributed data, Tukey’s box-and-
whisker method for outlier detection is often appropriate. In this method, an ob-
servation is an outlier when it is larger than the so-called “whiskers” of the set of
observations. The upper whisker is computed by adding 1.5 times the interquar-
tile range to the third quartile and rounding to the nearest lower observation. The
lower whisker is computed likewise. The base R installation comes with function
boxplot.stats(), which, amongst other things, list the outliers.
x <- c(1:10, 20, 30)
boxplot.stats(x)
## $stats
## [1] 1.0 3.5 6.5 9.5 10.0
##
## $n
## [1] 12
##
## $conf
## [1] 3.76336 9.23664
##
## $out
## [1] 20 30
Here, 20 and 30 are detected as outliers since they are above the upper whisker
of the observations in x. The factor 1.5 used to compute the whisker is to an extent
arbitrary and it can be altered by setting the coef option of boxplot.stats(). A
higher coefficient means a higher outlier detection limit (so for the same dataset,
generally less upper or lower outliers will be detected).
boxplot.stats(x, coef = 2)$out
## [1] 30
The box-and-whisker method can be visualized with the box-and-whisker plot,
where the box indicates the interquartile range and the median, the whiskers are
represented at the ends of the box-and-whisker plots and outliers are indicated as
separate points above or below the whiskers.
op <- par(no.readonly = TRUE) # save plot settings
par(mfrow=c(1,3))
boxplot(x, main="default")
boxplot(x, range = 1.5, main="range = 1.5")
boxplot(x, range = 2, main="range = 2")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


392 Ch 11: Data Cleaning

par(op) # restore plot settings

30 default range = 1.5 range = 2

30

30
● ● ●
25

25

25
20

20

20
● ●
15

15

15
10

10

10
5

5
0

0
The box-and-whisker method fails when data distribution is skewed, as in an
exponential or log-normal distribution. In that case one can attempt to transform
the data, for example with a logarithm or square root transformation. Another option
is to use a method that takes the skewness into account.
A particularly easy-to-implement method for outlier detection with positive ob-
servations is by Hiridoglou and Berthelot. In this method, an observation is an outlier
when
x x∗
 
h(x) = max , ≥ r, and x > 0.
x∗ x
Here, r is a user-defined reference value and x∗ is usually the median observation,
although other measures of centrality may be chosen. Here, the score function h(x)
grows as 1/x as x approaches zero and grows linearly with x when it is larger than
x∗ . It is therefore appropriate for finding outliers on both sides of the distribution.
Moreover, because of the different behaviour for small and large x-values, it is appro-
priate for skewed (long-tailed) distributions. An implementation of this method in R
does not seem available but it is implemented simple enough as follows.
#### Example: Hiridoglou and Berthelot outlier detection function
f.hb.outlier <- function(x,r) {
x <- x[is.finite(x)]
stopifnot(length(x) > 0 , all(x>0)) # if empty vector or non-positive values, quit
xref <- median(x)
if (xref <= sqrt(.Machine$double.eps)) {
warning("Reference value close to zero: results may be inaccurate")
}
pmax(x/xref, xref/x) > r
}
f.hb.outlier(x, r = 4)
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] TRUE

Prof. Erik B. Erhardt


11.6: From technically correct data to consistent data 393

The above function returns a logical vector indicating which elements of x are
outliers.

11.6.2 Edit rules for detecting obvious inconsistencies


An obvious inconsistency occurs when a record contains a value or combination of
values that cannot correspond to a real-world situation. For example, a person’s age
cannot be negative, a man cannot be pregnant and an under-aged person cannot
possess a drivers license.
Such knowledge can be expressed as rules or constraints. In data editing litera-
ture these rules are referred to as edit rules or edits, in short. Checking for obvious
inconsistencies can be done straightforwardly in R using logical indices and recycling.
For example, to check which elements of x obey the rule ‘x must be non negative’ one
simply uses the following.
x_nonnegative <- (x >= 0)

However, as the number of variables increases, the number of rules may increase
rapidly and it may be beneficial to manage the rules separate from the data. Moreover,
since multivariate rules may be interconnected by common variables, deciding which
variable or variables in a record cause an inconsistency may not be straightforward.
The editrules package allows one to define rules on categorical, numerical or
mixed-type data sets which each record must obey. Furthermore, editrules can check
which rules are obeyed or not and allows one to find the minimal set of variables to
adapt so that all rules can be obeyed. The package also implements a number of
basic rule operations allowing users to test rule sets for contradictions and certain
redundancies.
As an example, we will work with a small file containing the following data.
age,agegroup,height,status,yearsmarried
21,adult,6.0,single,-1
2,child,3,married, 0
18,adult,5.7,married, 20
221,elderly, 5,widowed, 2
34,child, -7,married, 3
We read this data into a variable called people and define some restrictions on age
using editset().
#### Example: people data for cleaning
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch18_people.txt"
people <- read.csv(fn.data)

people
## age agegroup height status yearsmarried
## 1 21 adult 6.0 single -1

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


394 Ch 11: Data Cleaning

## 2 2 child 3.0 married 0


## 3 18 adult 5.7 married 20
## 4 221 elderly 5.0 widowed 2
## 5 34 child -7.0 married 3
library(editrules)
## Loading required package: igraph
##
## Attaching package: ’igraph’
## The following objects are masked from ’package:lubridate’:
##
## %--%, union
## The following objects are masked from ’package:stats’:
##
## decompose, spectrum
## The following object is masked from ’package:base’:
##
## union
##
## Attaching package: ’editrules’
## The following objects are masked from ’package:igraph’:
##
## blocks, normalize
E <- editset(c("age >=0", "age <= 150"))
E
##
## Edit set:
## num1 : 0 <= age
## num2 : age <= 150

The editset() function parses the textual rules and stores them in an editset
object. Each rule is assigned a name according to it’s type (numeric, categorical,
or mixed) and a number. The data can be checked against these rules with the
violatedEdits() function. Record 4 contains an error according to one of the rules:
an age of 221 is not allowed.
violatedEdits(E, people)
## edit
## record num1 num2
## 1 FALSE FALSE
## 2 FALSE FALSE
## 3 FALSE FALSE
## 4 FALSE TRUE
## 5 FALSE FALSE

violatedEdits() returns a logical array indicating for each row of the data,
which rules are violated. The number and type of rules applying to a data set usually
quickly grow with the number of variables. With editrules, users may read rules,

Prof. Erik B. Erhardt


11.6: From technically correct data to consistent data 395

specified in a limited R-syntax, directly from a text file using the editfile() function.
As an example consider the contents of the following text file (note, you can’t include
braces in your if() statement).
# numerical rules
age >= 0
height > 0
age <= 150
age > yearsmarried

# categorical rules
status %in% c("married", "single", "widowed")
agegroup %in% c("child", "adult", "elderly")
if ( status == "married" ) agegroup %in% c("adult","elderly")

# mixed rules
if ( status %in% c("married","widowed")) age - yearsmarried >= 17
if ( age < 18 ) agegroup == "child"
if ( age >= 18 && age <65 ) agegroup == "adult"
if ( age >= 65 ) agegroup == "elderly"

There are rules pertaining to purely numerical, purely categorical and rules per-
taining to both data types. Moreover, there are univariate as well as multivariate
rules. Comments are written behind the usual # character. The rule set can be read
as follows.
#### Edit rules for people data
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch18_edits.txt"
E <- editfile(fn.data)
E
##
## Data model:
## dat6 : agegroup %in% c('adult', 'child', 'elderly')
## dat7 : status %in% c('married', 'single', 'widowed')
##
## Edit set:
## num1 : 0 <= age
## num2 : 0 < height
## num3 : age <= 150
## num4 : yearsmarried < age
## cat5 : if( agegroup == 'child' ) status != 'married'
## mix6 : if( age < yearsmarried + 17 ) !( status %in% c('married', 'widowed') )
## mix7 : if( age < 18 ) !( agegroup %in% c('adult', 'elderly') )
## mix8 : if( 18 <= age & age < 65 ) !( agegroup %in% c('child', 'elderly') )
## mix9 : if( 65 <= age ) !( agegroup %in% c('adult', 'child') )

Since rules may pertain to multiple variables, and variables may occur in several
rules (e.g., the age variable in the current example), there is a dependency between
rules and variables. It can be informative to show these dependencies in a graph using

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


396 Ch 11: Data Cleaning

the plot function. Below the graph plot shows the interconnection of restrictions. Blue
circles represent variables and yellow boxes represent restrictions. The lines indicate
which restrictions pertain to what variables.
op <- par(no.readonly = TRUE) # save plot settings
par(mfrow=c(1,1), mar = c(0,0,0,0))
plot(E)
par(op) # restore plot settings

num1
num3

num4
heght
age
mix8
yrsmr
mix9 num2

mix7
mix6

aggrp

stats
cat5

As the number of rules grows, looking at the full array produced by violatedEdits()
becomes cumbersome. For this reason, editrules offers methods to summarize or vi-
sualize the result.
ve <- violatedEdits(E, people)
summary(ve)
## Edit violations, 5 observations, 0 completely missing (0%):
##
## editname freq rel
## cat5 2 40%
## mix6 2 40%
## num2 1 20%
## num3 1 20%
## num4 1 20%
## mix8 1 20%
##
## Edit violations per record:
##
## errors freq rel
## 0 1 20%
## 1 1 20%
## 2 2 40%

Prof. Erik B. Erhardt


11.6: From technically correct data to consistent data 397

## 3 1 20%
plot(ve)

Edit violation frequency of top 10 edits

mix7
dat7
dat6
num1
Edit
mix8
num4
num3
num2
mix6
cat5

0.0 0.1 0.2 0.3 0.4

Frequency

Edit violations per record


1 records with no violations

1.8
Count

1.4
1.0

● ●

1.0 1.5 2.0 2.5 3.0

Number of violations

Here, the edit labeled cat5 is violated by two records (20% of all records). Violated
edits are sorted from most to least often violated. The plot visualizes the same
information.

Error localization
The interconnectivity of edits is what makes error localization difficult. For example,
the graph above shows that a record violating edit num4 may contain an error in age
and/or yrsmr (years married). Suppose that we alter age so that num4 is not violated
anymore. We then run the risk of violating up to six other edits containing age.
If we have no other information available but the edit violations, it makes sense
to minimize the number of fields being altered. This principle, commonly referred to
as the principle of Fellegi and Holt, is based on the idea that errors occur relatively
few times and when they do, they occur randomly across variables. Over the years
several algorithms have been developed to solve this minimization problem of which
two have been implemented in editrules. The localizeErrors() function provides

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


398 Ch 11: Data Cleaning

access to this functionality.


As an example we take two records from the people dataset from the previous
subsection.
id <- c(2, 5)
people[id, ]
## age agegroup height status yearsmarried
## 2 2 child 3 married 0
## 5 34 child -7 married 3
violatedEdits(E, people[id, ])
## edit
## record num1 num2 num3 num4 dat6 dat7 cat5 mix6 mix7 mix8
## 2 FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## 5 FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## edit
## record mix9
## 2 FALSE
## 5 FALSE
Record 2 violates mix6 while record 5 violates edits num2, cat5, and mix8. We use
localizeErrors(), with a mixed-integer programming (MIP) approach to find the
minimal set of variables to adapt.
le <- localizeErrors(E, people[id, ], method = "mip")
le$adapt
## age agegroup height status yearsmarried
## 2 FALSE FALSE FALSE TRUE FALSE
## 5 FALSE TRUE TRUE FALSE FALSE
Here, the le object contains some processing metadata and a logical array labeled
adapt which indicates the minimal set of variables to be altered in each record. It can
be used in correction and imputation procedures for filling in valid values. Such pro-
cedures are not part of editrules, but for demonstration purposes we will manually
fill in new values showing that the solution computed by localizeErrors() indeed
allows one to repair records to full compliance with all edit rules.
people[2, "status"] <- "single"
people[5, "height"] <- 7
people[5, "agegroup"] <- "adult"
summary(violatedEdits(E, people[id, ]))
## No violations detected, 0 checks evaluated to NA
## NULL
The behaviour of localizeErrors() can be tuned with various options. It is
possible to supply a confidence weight for each variable allowing for fine grained
control on which values should be adapted. It is also possible to choose a branch-
and-bound based solver (instead of the MIP solver used here), which is typically
slower but allows for more control.

Prof. Erik B. Erhardt


11.6: From technically correct data to consistent data 399

11.6.3 Correction
Correction methods aim to fix inconsistent observations by altering invalid values in
a record based on information from valid values. Depending on the method this is
either a single-step procedure or a two-step procedure where first, an error localization
method is used to empty certain fields, followed by an imputation step.
In some cases, the cause of errors in data can be determined with enough certainty
so that the solution is almost automatically known. In recent years, several such
methods have been developed and implemented in the deducorrect package.
For the purposes of ADA1, we will manually correct errors, either by replacing
values or by excluding observations.

Simple transformation rules


In practice, data cleaning procedures involve a lot of ad-hoc transformations. This
may lead to long scripts where one selects parts of the data, changes some variables,
selects another part, changes some more variables, etc. When such scripts are neatly
written and commented, they can almost be treated as a log of the actions performed
by the analyst. However, as scripts get longer it is better to store the transformation
rules separately and log which rule is executed on what record. The deducorrect
package offers functionality for this. Consider as an example the following (fictitious)
dataset listing the body length of some brothers.
#### Example: marx brothers data
marx <- read.table(text = "
name height unit
Groucho 170.00 cm
Zeppo 1.74 m
Chico 70.00 inch
Gummo 168.00 cm
Harpo 5.91 ft
", header=TRUE, stringsAsFactors = FALSE)
marx
## name height unit
## 1 Groucho 170.00 cm
## 2 Zeppo 1.74 m
## 3 Chico 70.00 inch
## 4 Gummo 168.00 cm
## 5 Harpo 5.91 ft

The task here is to standardize the lengths and express all of them in meters. The
obvious way would be to use indexing techniques, which would look something like
this.
marx_m <- marx
ind <- (marx$unit == "cm") # indexes for cm

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


400 Ch 11: Data Cleaning

marx_m[ind, "height"] <- marx$height[ind] / 100


marx_m[ind, "unit"] <- "m"
ind <- (marx$unit == "inch") # indexes for inch
marx_m[ind, "height"] <- marx$height[ind] / 39.37
marx_m[ind, "unit"] <- "m"
ind <- (marx$unit == "ft") # indexes for ft
marx_m[ind, "height"] <- marx$height[ind] / 3.28
marx_m[ind, "unit"] <- "m"
marx_m
## name height unit
## 1 Groucho 1.700000 m
## 2 Zeppo 1.740000 m
## 3 Chico 1.778004 m
## 4 Gummo 1.680000 m
## 5 Harpo 1.801829 m
Such operations quickly become cumbersome. Of course, in this case one could
write a for-loop but that would hardly save any code. Moreover, if you want to check
afterwards which values have been converted and for what reason, there will be a
significant administrative overhead. The deducorrect package takes all this overhead
off your hands with the correctionRules() functionality. For example, to perform
the above task, one first specifies a file with correction rules as follows.
# convert centimeters
if ( unit == "cm" ){
height <- height / 100
unit <- "m" # set all units to meter
}
# convert inches
if (unit == "inch" ){
height <- height / 39.37
unit <- "m" # set all units to meter
}
# convert feet
if (unit == "ft" ){
height <- height / 3.28
unit <- "m" # set all units to meter
}
With deducorrect we can read these rules, apply them to the data and obtain a
log of all actual changes as follows.
#### Conversions for marx bros data
library(deducorrect)
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch18_conversions.txt"
# read the conversion rules.
R <- correctionRules(fn.data)
R
## Object of class 'correctionRules'

Prof. Erik B. Erhardt


11.6: From technically correct data to consistent data 401

## ## 1-------
## if (unit == "cm") {
## height <- height/100
## unit <- "m"
## }
## ## 2-------
## if (unit == "inch") {
## height <- height/39.37
## unit <- "m"
## }
## ## 3-------
## if (unit == "ft") {
## height <- height/3.28
## unit <- "m"
## }
correctionRules() has parsed the rules and stored them in a correctionRules
object. We may now apply them to the data.
cor <- correctWithRules(R, marx)
The returned value, cor, is a list containing the corrected data
cor$corrected
## name height unit
## 1 Groucho 1.700000 m
## 2 Zeppo 1.740000 m
## 3 Chico 1.778004 m
## 4 Gummo 1.680000 m
## 5 Harpo 1.801829 m
as well as a log of applied corrections.
cor$corrections[1:4]
## row variable old new
## 1 1 height 170 1.7
## 2 1 unit cm m
## 3 3 height 70 1.77800355600711
## 4 3 unit inch m
## 5 4 height 168 1.68
## 6 4 unit cm m
## 7 5 height 5.91 1.80182926829268
## 8 5 unit ft m
The log lists for each row, what variable was changed, what the old value was and
what the new value is. Furthermore, the fifth column of cor$corrections shows the
corrections that were applied (not shown above for formatting reasons).
cor$corrections[5]
## how
## 1 if (unit == "cm") { height <- height/100 unit <- "m" }
## 2 if (unit == "cm") { height <- height/100 unit <- "m" }

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


402 Ch 11: Data Cleaning

## 3 if (unit == "inch") { height <- height/39.37 unit <- "m" }


## 4 if (unit == "inch") { height <- height/39.37 unit <- "m" }
## 5 if (unit == "cm") { height <- height/100 unit <- "m" }
## 6 if (unit == "cm") { height <- height/100 unit <- "m" }
## 7 if (unit == "ft") { height <- height/3.28 unit <- "m" }
## 8 if (unit == "ft") { height <- height/3.28 unit <- "m" }

So here, with just two commands, the data is processed and all actions logged in
a data.frame which may be stored or analyzed. The rules that may be applied with
deducorrect are rules that can be executed record-by-record.
By design, there are some limitations to which rules can be applied with correctWithRules().
The processing rules should be executable record-by-record. That is, it is not permit-
ted to use functions like mean() or sd(). The symbols that may be used can be listed
as follows.
getOption("allowedSymbols")
## [1] "if" "else" "is.na" "is.finite" "=="
## [6] "<" "<=" "=" ">=" ">"
## [11] "!=" "!" "%in%" "identical" "sign"
## [16] "abs" "||" "|" "&&" "&"
## [21] "(" "{" "<-" "=" "+"
## [26] "-" "*" "^" "/" "%%"
## [31] "%/%"
When the rules are read by correctionRules(), it checks whether any symbol
occurs that is not in the list of allowed symbols and returns an error message when
such a symbol is found as in the following example.
correctionRules(expression(x <- mean(x)))
##
## Forbidden symbols found:
## ## ERR 1 ------
## Forbidden symbols: mean
## x <- mean(x)
## Error in correctionRules.expression(expression(x <- mean(x))): Forbidden symbols
found
Finally, it is currently not possible to add new variables using correctionRules()
although such a feature will likely be added in the future.

Deductive correction
When the data you are analyzing is generated by people rather than machines or mea-
surement devices, certain typical human-generated errors are likely to occur. Given
that data has to obey certain edit rules, the occurrence of such errors can sometimes
be detected from raw data with (almost) certainty. Examples of errors that can be
detected are typing errors in numbers (under linear restrictions) rounding errors in

Prof. Erik B. Erhardt


11.6: From technically correct data to consistent data 403

numbers and sign errors or variable swaps. The deducorrect package has a number
of functions available that can correct such errors. Below we give some examples,
every time with just a single edit rule. The functions can handle larger sets of edits
however.
[I will complete this section if we need it for our Spring semester.]

Deterministic imputation
In some cases a missing value can be determined because the observed values combined
with their constraints force a unique solution.
[I will complete this section if we need it for our Spring semester.]

11.6.4 Imputation
Imputation is the process of estimating or deriving values for fields where data is
missing. There is a vast body of literature on imputation methods and it goes beyond
the scope of this chapter to discuss all of them.
There is no one single best imputation method that works in all cases. The
imputation model of choice depends on what auxiliary information is available and
whether there are (multivariate) edit restrictions on the data to be imputed. The
availability of R software for imputation under edit restrictions is limited. However,
a viable strategy for imputing numerical data is to first impute missing values without
restrictions, and then minimally adjust the imputed values so that the restrictions
are obeyed. Separately, these methods are available in R.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


404 Ch 11: Data Cleaning

Prof. Erik B. Erhardt


Part V

ADA2: Review of ADA1

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


Chapter 1

R statistical software and review

The purpose of this chapter is to discuss R in the context of a quick review of the
topics we covered last semester in ADA11 .

1.1 R
R is a programming language for programming, data management, and statistical
analysis. So many people have written “An Introduction to R”, that I refer you to
the course website2 for links to tutorials. I encourage you to learn R by (1) running
the commands in the tutorials, (2) looking at the help for the commands (e.g., ?mean),
and (3) trying things on your own as you become curious. Make mistakes, figure out
why some things don’t work the way you expect, and keep trying. Persistence wins
the day with programming (as does asking and searching for help).
R is more difficult to master (though, more rewarding) than some statistical pack-
ages (such as Minitab) for the following reasons: (1) R does not, in general, provide
a point-and-click environment for statistical analysis. Rather, R uses syntax-based
programs (i.e., code) to define, transform, and read data, and to define the procedures
for analyzing data. (2) R does not really have a spreadsheet environment for data
management. Rather, data are entered directly within an R program, read from a
file, or imported from a spreadsheet. All manipulation, transformation, and selection
of data is coded in the R program. Well done, this means that all the steps of the
analysis are available to be repeatable and understood.
Take a minute to install the packages we’ll need this semester by executing the
following commands in R.
#### Install packages needed this semester
ADA2.package.list <- c("BSDA", "Hmisc", "MASS", "NbClust",

1
http://statacumen.com/teaching/ada1/
2
http://statacumen.com/teaching/ada2/

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


408 Ch 1: R statistical software and review

"aod", "candisc", "car", "cluster",


"ellipse", "ggplot2", "gridExtra", "klaR",
"leaps", "lsmeans", "moments", "multcomp",
"mvnormtest", "nortest", "plyr", "popbio",
"reshape", "reshape2", "scatterplot3d", "vcd",
"vioplot", "xtable")
install.packages(ADA2.package.list)

1.2 ADA1 Ch 0: R warm-up


This example illustrates several strategies for data summarization and analysis. The
data for this example are from 15 turkeys raised on farms in either Virginia or Wis-
consin.
You should use help to get more information on the functions demonstrated here.
Many of the lines in the program have comments, which helps anyone reading the
program to more easily follow the logic. I strongly recommend commenting any
line of code that isn’t absolutely obvious what is being done and why it is being done.
I also recommend placing comments before blocks of code in order to describe what
the code below is meant to do.
#### Example: Turkey, R warm-up
# Read the data file from the website and learn some functions

# filename
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch01_turkey.csv"
# read file and assign data to turkey variable
turkey <- read.csv(fn.data)
# examine the structure of the dataset, is it what you expected?
# a data.frame containing integers, numbers, and factors
str(turkey)
## 'data.frame': 15 obs. of 3 variables:
## $ age : int 28 20 32 25 23 22 29 27 28 26 ...
## $ weight: num 13.3 8.9 15.1 13.8 13.1 10.4 13.1 12.4 13.2 11.8 ...
## $ orig : Factor w/ 2 levels "va","wi": 1 1 1 2 2 1 1 1 1 1 ...
# print dataset to screen
turkey
## age weight orig
## 1 28 13.3 va
## 2 20 8.9 va
## 3 32 15.1 va
## 4 25 13.8 wi
## 5 23 13.1 wi
## 6 22 10.4 va
## 7 29 13.1 va
## 8 27 12.4 va

Prof. Erik B. Erhardt


1.2: ADA1 Ch 0: R warm-up 409

## 9 28 13.2 va
## 10 26 11.8 va
## 11 21 11.5 wi
## 12 31 16.6 wi
## 13 27 14.2 wi
## 14 29 15.4 wi
## 15 30 15.9 wi
# Note: to view the age variable (column), there's a few ways to do that
turkey$age # name the variable
## [1] 28 20 32 25 23 22 29 27 28 26 21 31 27 29 30
turkey[, 1] # give the column number
## [1] 28 20 32 25 23 22 29 27 28 26 21 31 27 29 30
turkey[, "age"] # give the column name
## [1] 28 20 32 25 23 22 29 27 28 26 21 31 27 29 30
# and the structure is a vector
str(turkey$age)
## int [1:15] 28 20 32 25 23 22 29 27 28 26 ...
# let's create an additional variable for later
# gt25mo will be a variable indicating whether the age is greater than 25 months
turkey$gt25mo <- (turkey$age > 25)
# now we also have a Boolean (logical) column
str(turkey)
## 'data.frame': 15 obs. of 4 variables:
## $ age : int 28 20 32 25 23 22 29 27 28 26 ...
## $ weight: num 13.3 8.9 15.1 13.8 13.1 10.4 13.1 12.4 13.2 11.8 ...
## $ orig : Factor w/ 2 levels "va","wi": 1 1 1 2 2 1 1 1 1 1 ...
## $ gt25mo: logi TRUE FALSE TRUE FALSE FALSE FALSE ...
# there are a couple ways of subsetting the rows
turkey[(turkey$gt25mo == TRUE),] # specify the rows
## age weight orig gt25mo
## 1 28 13.3 va TRUE
## 3 32 15.1 va TRUE
## 7 29 13.1 va TRUE
## 8 27 12.4 va TRUE
## 9 28 13.2 va TRUE
## 10 26 11.8 va TRUE
## 12 31 16.6 wi TRUE
## 13 27 14.2 wi TRUE
## 14 29 15.4 wi TRUE
## 15 30 15.9 wi TRUE
subset(turkey, gt25mo == FALSE) # use subset() to select the data.frame records
## age weight orig gt25mo
## 2 20 8.9 va FALSE
## 4 25 13.8 wi FALSE
## 5 23 13.1 wi FALSE
## 6 22 10.4 va FALSE
## 11 21 11.5 wi FALSE

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


410 Ch 1: R statistical software and review

Analyses can be then done on the entire dataset, or repeated for all subsets of a
variable in the dataset.
# summaries of each variable in the entire dataset,
summary(turkey)
## age weight orig gt25mo
## Min. :20.00 Min. : 8.90 va:8 Mode :logical
## 1st Qu.:24.00 1st Qu.:12.10 wi:7 FALSE:5
## Median :27.00 Median :13.20 TRUE :10
## Mean :26.53 Mean :13.25
## 3rd Qu.:29.00 3rd Qu.:14.65
## Max. :32.00 Max. :16.60
# or summarize by a variable in the dataset.
by(turkey, turkey$orig, summary)
## turkey$orig: va
## age weight orig gt25mo
## Min. :20.00 Min. : 8.90 va:8 Mode :logical
## 1st Qu.:25.00 1st Qu.:11.45 wi:0 FALSE:2
## Median :27.50 Median :12.75 TRUE :6
## Mean :26.50 Mean :12.28
## 3rd Qu.:28.25 3rd Qu.:13.22
## Max. :32.00 Max. :15.10
## ----------------------------------------------------
## turkey$orig: wi
## age weight orig gt25mo
## Min. :21.00 Min. :11.50 va:0 Mode :logical
## 1st Qu.:24.00 1st Qu.:13.45 wi:7 FALSE:3
## Median :27.00 Median :14.20 TRUE :4
## Mean :26.57 Mean :14.36
## 3rd Qu.:29.50 3rd Qu.:15.65
## Max. :31.00 Max. :16.60

1.3 ADA1 Chapters 2, 4, 6: Estimation in one-


sample problems
Plot the weights by origin.
#### Example: Turkey, Chapters 2, 4, 6
# subset the data for convenience
turkeyva <- subset(turkey, orig == "va")
turkeywi <- subset(turkey, orig == "wi")

library(ggplot2)
# Histogram overlaid with kernel density curve
p11 <- ggplot(turkeyva, aes(x = weight))
# Histogram with density instead of count on y-axis
p11 <- p11 + geom_histogram(aes(y=..density..)
, binwidth=2
, colour="black", fill="white")
# Overlay with transparent density plot
p11 <- p11 + geom_density(alpha=0.1, fill="#FF6666")
p11 <- p11 + geom_rug()

# violin plot

Prof. Erik B. Erhardt


1.3: ADA1 Chapters 2, 4, 6: Estimation in one-sample problems 411

p12 <- ggplot(turkeyva, aes(x = "weight", y = weight))


p12 <- p12 + geom_violin(fill = "gray50")
p12 <- p12 + geom_boxplot(width = 0.2, alpha = 3/4)
p12 <- p12 + coord_flip()

# boxplot
p13 <- ggplot(turkeyva, aes(x = "weight", y = weight))
p13 <- p13 + geom_boxplot()
p13 <- p13 + coord_flip()

library(gridExtra)
#grid.arrange(p11, p12, p13, ncol=1, main="Turkey weights for origin va")
## add grobs = list(), and main= becomes top=
grid.arrange(grobs = list(p11, p12, p13), ncol=1, top="Turkey weights for origin va")

# Histogram overlaid with kernel density curve


p21 <- ggplot(turkeywi, aes(x = weight))
# Histogram with density instead of count on y-axis
p21 <- p21 + geom_histogram(aes(y=..density..)
, binwidth=2
, colour="black", fill="white")
# Overlay with transparent density plot
p21 <- p21 + geom_density(alpha=0.1, fill="#FF6666")
p21 <- p21 + geom_rug()

# violin plot
p22 <- ggplot(turkeywi, aes(x = "weight", y = weight))
p22 <- p22 + geom_violin(fill = "gray50")
p22 <- p22 + geom_boxplot(width = 0.2, alpha = 3/4)
p22 <- p22 + coord_flip()

# boxplot
p23 <- ggplot(turkeywi, aes(x = "weight", y = weight))
p23 <- p23 + geom_boxplot()
p23 <- p23 + coord_flip()

library(gridExtra)
grid.arrange(grobs = list(p21, p22, p23), ncol=1, top="Turkey weights for origin wi")

Turkey weights for origin va Turkey weights for origin wi


0.25
0.20
0.20
0.15
density

density

0.15
0.10
0.10

0.05 0.05

0.00 0.00
7.5 10.0 12.5 15.0 17.5 12 14 16
weight weight

weight weight
x

10 12 14 12 13 14 15 16
weight weight

weight weight
x

10 12 14 12 13 14 15 16
weight weight

Check normality of each sample graphically with with bootstrap sampling distri-
bution and normal quantile plot and formally with normality tests.
# a function to compare the bootstrap sampling distribution with
# a normal distribution with mean and SEM estimated from the data
bs.one.samp.dist <- function(dat, N = 1e4) {
n <- length(dat);
# resample from data
sam <- matrix(sample(dat, size = N * n, replace = TRUE), ncol=N);
# draw a histogram of the means
sam.mean <- colMeans(sam);
# save par() settings

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


412 Ch 1: R statistical software and review

old.par <- par(no.readonly = TRUE)


# make smaller margins
par(mfrow=c(2,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
# Histogram overlaid with kernel density curve
hist(dat, freq = FALSE, breaks = 6
, main = "Plot of data with smoothed density curve")
points(density(dat), type = "l")
rug(dat)

hist(colMeans(sam), freq = FALSE, breaks = 25


, main = "Bootstrap sampling distribution of the mean"
, xlab = paste("Data: n =", n
, ", mean =", signif(mean(dat), digits = 5)
, ", se =", signif(sd(dat)/sqrt(n)), digits = 5))
# overlay a density curve for the sample means
points(density(sam.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(min(sam.mean), max(sam.mean), length = 1000)
points(x, dnorm(x, mean = mean(dat), sd = sd(dat)/sqrt(n))
, type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
rug(sam.mean)
# restore par() settings
par(old.par)
}

# Bootstrap sampling distribution


bs.one.samp.dist(turkeyva$weight)
bs.one.samp.dist(turkeywi$weight)

Plot of data with smoothed density curve Plot of data with smoothed density curve
0.3

0.20
Density

Density
0.2

0.10
0.1

0.00
0.0

8 10 12 14 16 11 12 13 14 15 16 17

dat dat
Bootstrap sampling distribution of the mean Bootstrap sampling distribution of the mean
0.6

0.6
0.4

0.4
Density

Density
0.2

0.2
0.0

0.0

10 11 12 13 14 12 13 14 15 16

Data: n = 8 , mean = 12.275 , se = 0.67764 5 Data: n = 7 , mean = 14.357 , se = 0.665066 5

# normal quantile-quantile (QQ) plot of each orig sample


library(car)
# qq plot
# las = 1 : turns labels on y-axis to read horizontally
# id.n = n : labels n most extreme observations, and outputs to console
# id.cex = 1 : is the size of those labels
# lwd = 1 : line width
qqPlot(turkeyva$weight, las = 1, id = list(n = 0, cex = 1), lwd = 1
, main="QQ Plot, turkey origin va")
qqPlot(turkeywi$weight, las = 1, id = list(n = 0, cex = 1), lwd = 1
, main="QQ Plot, turkey origin wi")

Prof. Erik B. Erhardt


1.3: ADA1 Chapters 2, 4, 6: Estimation in one-sample problems 413

QQ Plot, turkey origin va QQ Plot, turkey origin wi

● ●
15

16 ●
14


● 15

13
turkeyva$weight

turkeywi$weight


12 14

11 ●
13

10
12
9 ● ●

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0

norm quantiles norm quantiles

# Normality tests

# VA
shapiro.test(turkeyva$weight)
##
## Shapiro-Wilk normality test
##
## data: turkeyva$weight
## W = 0.95414, p-value = 0.7528
library(nortest)
ad.test(turkeyva$weight)
##
## Anderson-Darling normality test
##
## data: turkeyva$weight
## A = 0.283, p-value = 0.5339
# lillie.test(turkeyva£weight)
cvm.test(turkeyva$weight)
##
## Cramer-von Mises normality test
##
## data: turkeyva$weight
## W = 0.050135, p-value = 0.4642
# WI
shapiro.test(turkeywi$weight)
##
## Shapiro-Wilk normality test
##
## data: turkeywi$weight

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


414 Ch 1: R statistical software and review

## W = 0.97326, p-value = 0.9209


library(nortest)
ad.test(turkeywi$weight)
## Error in ad.test(turkeywi$weight): sample size must be greater than 7
# lillie.test(turkeywi£weight)
cvm.test(turkeywi$weight)
## Error in cvm.test(turkeywi$weight): sample size must be greater than 7
## Note: The errors above are expected.
Because we do not have any serious departures from normality (the data are
consistent with being normal, as well as the sampling distribution of the mean) the
t-test is appropriate. We will also look at a couple nonparametric methods.
# Is the average turkey weight 12 lbs?

# t-tests of the mean

# VA
t.summary <- t.test(turkeyva$weight, mu = 12)
t.summary
##
## One Sample t-test
##
## data: turkeyva$weight
## t = 0.40582, df = 7, p-value = 0.697
## alternative hypothesis: true mean is not equal to 12
## 95 percent confidence interval:
## 10.67264 13.87736
## sample estimates:
## mean of x
## 12.275
# WI
t.summary <- t.test(turkeywi$weight, mu = 12)
t.summary
##
## One Sample t-test
##
## data: turkeywi$weight
## t = 3.5442, df = 6, p-value = 0.01216
## alternative hypothesis: true mean is not equal to 12
## 95 percent confidence interval:
## 12.72978 15.98450
## sample estimates:
## mean of x
## 14.35714
# Sign test for the median

# VA

Prof. Erik B. Erhardt


1.3: ADA1 Chapters 2, 4, 6: Estimation in one-sample problems 415

library(BSDA)
SIGN.test(turkeyva$weight, md=12)
##
## One-sample Sign-Test
##
## data: turkeyva$weight
## s = 5, p-value = 0.7266
## alternative hypothesis: true median is not equal to 12
## 95 percent confidence interval:
## 9.9125 13.8850
## sample estimates:
## median of x
## 12.75
##
## Achieved and Interpolated Confidence Intervals:
##
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.9297 10.4000 13.300
## Interpolated CI 0.9500 9.9125 13.885
## Upper Achieved CI 0.9922 8.9000 15.100
# WI
SIGN.test(turkeywi$weight, md=12)
##
## One-sample Sign-Test
##
## data: turkeywi$weight
## s = 6, p-value = 0.125
## alternative hypothesis: true median is not equal to 12
## 95 percent confidence interval:
## 12.00286 16.38000
## sample estimates:
## median of x
## 14.2
##
## Achieved and Interpolated Confidence Intervals:
##
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.8750 13.1000 15.90
## Interpolated CI 0.9500 12.0029 16.38
## Upper Achieved CI 0.9844 11.5000 16.60
# Wilcoxon sign-rank test for the median (or mean, since symmetric assumption)

# VA
# with continuity correction in the normal approximation for the p-value
wilcox.test(turkeyva$weight, mu=12, conf.int=TRUE)
## Warning in wilcox.test.default(turkeyva$weight, mu = 12, conf.int = TRUE): cannot
compute exact p-value with ties

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


416 Ch 1: R statistical software and review

## Warning in wilcox.test.default(turkeyva$weight, mu = 12, conf.int = TRUE): cannot


compute exact confidence interval with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: turkeyva$weight
## V = 21.5, p-value = 0.674
## alternative hypothesis: true location is not equal to 12
## 95 percent confidence interval:
## 10.40005 14.09997
## sample estimates:
## (pseudo)median
## 12.47331
# without continuity correction
wilcox.test(turkeyva$weight, mu=12, conf.int=TRUE, correct=FALSE)
## Warning in wilcox.test.default(turkeyva$weight, mu = 12, conf.int = TRUE, : cannot
compute exact p-value with ties
## Warning in wilcox.test.default(turkeyva$weight, mu = 12, conf.int = TRUE, : cannot
compute exact confidence interval with ties
##
## Wilcoxon signed rank test
##
## data: turkeyva$weight
## V = 21.5, p-value = 0.6236
## alternative hypothesis: true location is not equal to 12
## 95 percent confidence interval:
## 10.64999 13.75002
## sample estimates:
## (pseudo)median
## 12.47331
# WI
# with continuity correction in the normal approximation for the p-value
wilcox.test(turkeywi$weight, mu=12, conf.int=TRUE)
##
## Wilcoxon signed rank test
##
## data: turkeywi$weight
## V = 27, p-value = 0.03125
## alternative hypothesis: true location is not equal to 12
## 95 percent confidence interval:
## 12.65 16.00
## sample estimates:
## (pseudo)median
## 14.375
# without continuity correction
wilcox.test(turkeywi$weight, mu=12, conf.int=TRUE, correct=FALSE)

Prof. Erik B. Erhardt


1.4: ADA1 Chapters 3, 4, 6: Two-sample inferences 417

##
## Wilcoxon signed rank test
##
## data: turkeywi$weight
## V = 27, p-value = 0.03125
## alternative hypothesis: true location is not equal to 12
## 95 percent confidence interval:
## 12.65 16.00
## sample estimates:
## (pseudo)median
## 14.375

1.4 ADA1 Chapters 3, 4, 6: Two-sample infer-


ences
Presume it is of interest to compare the center of the weight distributions between
the origins. There are many ways to plot the data for visual comparisons.
#### Example: Turkey, Chapters 3, 4, 6
# stripchart (dotplot) using ggplot
library(ggplot2)
p1 <- ggplot(turkey, aes(x = weight, y = orig))
p1 <- p1 + geom_point(position = position_jitter(h=0.1))
p1 <- p1 + labs(title = "Dotplot with position jitter")

# boxplot
p2 <- ggplot(turkey, aes(x = orig, y = weight))
p2 <- p2 + geom_boxplot()
# add a "+" at the mean
p2 <- p2 + stat_summary(fun.y = mean, geom = "point", shape = 3, size = 2)
p2 <- p2 + geom_point()
p2 <- p2 + coord_flip()
p2 <- p2 + labs(title = "Boxplot with mean (+) and points")

# histogram using ggplot


p3 <- ggplot(turkey, aes(x = weight))
p3 <- p3 + geom_histogram(binwidth = 2)
p3 <- p3 + geom_rug()
p3 <- p3 + facet_grid(orig ~ .)
p3 <- p3 + labs(title = "Histogram with facets")

p4 <- ggplot(turkey, aes(x = weight, fill=orig))


p4 <- p4 + geom_histogram(binwidth = 2, alpha = 0.5, position="identity")
p4 <- p4 + geom_rug(aes(colour = orig))
p4 <- p4 + labs(title = "Histogram with opacity (alpha)")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


418 Ch 1: R statistical software and review

p5 <- ggplot(turkey, aes(x = weight, fill=orig))


p5 <- p5 + geom_histogram(binwidth = 2, alpha = 1, position="dodge")
p5 <- p5 + geom_rug(aes(colour = orig))
p5 <- p5 + labs(title = "Histogram with dodge")

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4, p5), ncol=2, nrow=3
, top="Turkey weights compared by origin")
Turkey weights compared by origin
Dotplot with position jitter Boxplot with mean (+) and points

● ●
● ● ●
wi ●
● wi ● ● ● ● ● ● ●
orig

orig
● ●
● ●
va ● ●
● ● va ● ● ● ● ●●● ●

9 11 13 15 9 11 13 15
weight weight

Histogram with facets Histogram with opacity (alpha)


3 3

2
va

1
2 orig
count

count

0
va
3
wi
1
2
wi

0 0
7.5 10.0 12.5 15.0 17.5 7.5 10.0 12.5 15.0 17.5
weight weight

Histogram with dodge


3

2 orig
count

va
wi
1

0
7.5 10.0 12.5 15.0 17.5
weight

Using the two-sample t-test, first check the normality assumptions of the sampling
distribution of the mean difference between the populations.
# a function to compare the bootstrap sampling distribution
# of the difference of means from two samples with
# a normal distribution with mean and SEM estimated from the data

Prof. Erik B. Erhardt


1.4: ADA1 Chapters 3, 4, 6: Two-sample inferences 419

bs.two.samp.diff.dist <- function(dat1, dat2, N = 1e4) {


n1 <- length(dat1);
n2 <- length(dat2);
# resample from data
sam1 <- matrix(sample(dat1, size = N * n1, replace = TRUE), ncol=N);
sam2 <- matrix(sample(dat2, size = N * n2, replace = TRUE), ncol=N);
# calculate the means and take difference between populations
sam1.mean <- colMeans(sam1);
sam2.mean <- colMeans(sam2);
diff.mean <- sam1.mean - sam2.mean;
# save par() settings
old.par <- par(no.readonly = TRUE)
# make smaller margins
par(mfrow=c(3,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
# Histogram overlaid with kernel density curve
hist(dat1, freq = FALSE, breaks = 6
, main = paste("Sample 1", "\n"
, "n =", n1
, ", mean =", signif(mean(dat1), digits = 5)
, ", sd =", signif(sd(dat1), digits = 5))
, xlim = range(c(dat1, dat2)))
points(density(dat1), type = "l")
rug(dat1)

hist(dat2, freq = FALSE, breaks = 6


, main = paste("Sample 2", "\n"
, "n =", n2
, ", mean =", signif(mean(dat2), digits = 5)
, ", sd =", signif(sd(dat2), digits = 5))
, xlim = range(c(dat1, dat2)))
points(density(dat2), type = "l")
rug(dat2)

hist(diff.mean, freq = FALSE, breaks = 25


, main = paste("Bootstrap sampling distribution of the difference in means", "\n"
, "mean =", signif(mean(diff.mean), digits = 5)
, ", se =", signif(sd(diff.mean), digits = 5)))
# overlay a density curve for the sample means
points(density(diff.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(min(diff.mean), max(diff.mean), length = 1000)
points(x, dnorm(x, mean = mean(diff.mean), sd = sd(diff.mean))
, type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
rug(diff.mean)
# restore par() settings
par(old.par)
}

bs.two.samp.diff.dist(turkeyva$weight, turkeywi$weight)
Sample 1
n = 8 , mean = 12.275 , sd = 1.9167
0.3
Density

0.2
0.1
0.0

10 12 14 16

Sample
dat1 2
n = 7 , mean = 14.357 , sd = 1.7596
0.20
Density

0.10
0.00

10 12 14 16

Bootstrap sampling distribution


dat2 of the difference in means
mean = −2.0836 , se = 0.88297
0.0 0.1 0.2 0.3 0.4
Density

−5 −4 −3 −2 −1 0 1

diff.mean

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


420 Ch 1: R statistical software and review

Two-sample t-test is appropriate since the bootstrap sampling distribution of the


difference in means is approximately normal. This is the most powerful test and
detects a difference at a 0.05 significance level.
# Two-sample t-test
## Equal variances
# var.equal = FALSE is the default
# two-sample t-test specifying two separate vectors
t.summary.eqvar <- t.test(turkeyva$weight, turkeywi$weight, var.equal = TRUE)
t.summary.eqvar
##
## Two Sample t-test
##
## data: turkeyva$weight and turkeywi$weight
## t = -2.1796, df = 13, p-value = 0.04827
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.14595971 -0.01832601
## sample estimates:
## mean of x mean of y
## 12.27500 14.35714
# two-sample t-test with unequal variances (Welch = Satterthwaite)
# specified using data.frame and a formula, HeadBreadth by Group
t.summary.uneqvar <- t.test(weight ~ orig, data = turkey, var.equal = FALSE)
t.summary.uneqvar
##
## Welch Two Sample t-test
##
## data: weight by orig
## t = -2.1929, df = 12.956, p-value = 0.04717
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.13407696 -0.03020875
## sample estimates:
## mean in group va mean in group wi
## 12.27500 14.35714
(Wilcoxon-)Mann-Whitney two-sample test is appropriate because the shapes of
the two distributions are similar, though their locations are different. This is a less
powerful test, but doesn’t require normality, and fails to detect a difference at a 0.05
significance level.
# with continuity correction in the normal approximation for the p-value
wilcox.test(turkeyva$weight, turkeywi$weight, conf.int=TRUE)
## Warning in wilcox.test.default(turkeyva$weight, turkeywi$weight, conf.int = TRUE):
cannot compute exact p-value with ties
## Warning in wilcox.test.default(turkeyva$weight, turkeywi$weight, conf.int = TRUE):
cannot compute exact confidence intervals with ties
##
## Wilcoxon rank sum test with continuity correction

Prof. Erik B. Erhardt


1.5: ADA1 Chapters 5, 4, 6: One-way ANOVA 421

##
## data: turkeyva$weight and turkeywi$weight
## W = 11.5, p-value = 0.06384
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -4.19994493 0.09993686
## sample estimates:
## difference in location
## -2.152771
# without continuity correction
wilcox.test(turkeyva$weight, turkeywi$weight, conf.int=TRUE, correct=FALSE)
## Warning in wilcox.test.default(turkeyva$weight, turkeywi$weight, conf.int = TRUE,
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(turkeyva$weight, turkeywi$weight, conf.int = TRUE,
: cannot compute exact confidence intervals with ties
##
## Wilcoxon rank sum test
##
## data: turkeyva$weight and turkeywi$weight
## W = 11.5, p-value = 0.05598
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -4.100049e+00 1.445586e-05
## sample estimates:
## difference in location
## -2.152771

1.5 ADA1 Chapters 5, 4, 6: One-way ANOVA


The Waste Run-up data3 refer to five suppliers of the Levi-Strauss clothing manufac-
turing plant in Albuquerque. The firm’s quality control department collects weekly
data on percent-age waste (run-up) relative to what can be achieved by computer
layouts of patterns on cloth. It is possible to have negative values, which indicate
that the plant employees beat the computer in controlling waste. Under question are
differences among the five supplier plants (PT1, . . . , PT5).
#### Example: Waste Run-up, Chapters 5, 4, 6
# convert to a data.frame by reading the text table
waste <- read.table(text = "
PT1 PT2 PT3 PT4 PT5
1.2 16.4 12.1 11.5 24.0
3
From http://lib.stat.cmu.edu/DASL/Stories/wasterunup.html, the Data and Story Li-
brary (DASL, pronounced “dazzle”) is an online library of datafiles and stories that illustrate the
use of basic statistics methods. “Waste Run-up” dataset from L. Koopmans, Introduction to Con-
temporary Statistical Methods, Duxbury Press, 1987, p. 86.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


422 Ch 1: R statistical software and review

10.1 -6.0 9.7 10.2 -3.7


-2.0 -11.6 7.4 3.8 8.2
1.5 -1.3 -2.1 8.3 9.2
-3.0 4.0 10.1 6.6 -9.3
-0.7 17.0 4.7 10.2 8.0
3.2 3.8 4.6 8.8 15.8
2.7 4.3 3.9 2.7 22.3
-3.2 10.4 3.6 5.1 3.1
-1.7 4.2 9.6 11.2 16.8
2.4 8.5 9.8 5.9 11.3
0.3 6.3 6.5 13.0 12.3
3.5 9.0 5.7 6.8 16.9
-0.8 7.1 5.1 14.5 NA
19.4 4.3 3.4 5.2 NA
2.8 19.7 -0.8 7.3 NA
13.0 3.0 -3.9 7.1 NA
42.7 7.6 0.9 3.4 NA
1.4 70.2 1.5 0.7 NA
3.0 8.5 NA NA NA
2.4 6.0 NA NA NA
1.3 2.9 NA NA NA
", header=TRUE)
waste
## PT1 PT2 PT3 PT4 PT5
## 1 1.2 16.4 12.1 11.5 24.0
## 2 10.1 -6.0 9.7 10.2 -3.7
## 3 -2.0 -11.6 7.4 3.8 8.2
## 4 1.5 -1.3 -2.1 8.3 9.2
## 5 -3.0 4.0 10.1 6.6 -9.3
## 6 -0.7 17.0 4.7 10.2 8.0
## 7 3.2 3.8 4.6 8.8 15.8
## 8 2.7 4.3 3.9 2.7 22.3
## 9 -3.2 10.4 3.6 5.1 3.1
## 10 -1.7 4.2 9.6 11.2 16.8
## 11 2.4 8.5 9.8 5.9 11.3
## 12 0.3 6.3 6.5 13.0 12.3
## 13 3.5 9.0 5.7 6.8 16.9
## 14 -0.8 7.1 5.1 14.5 NA
## 15 19.4 4.3 3.4 5.2 NA
## 16 2.8 19.7 -0.8 7.3 NA
## 17 13.0 3.0 -3.9 7.1 NA
## 18 42.7 7.6 0.9 3.4 NA
## 19 1.4 70.2 1.5 0.7 NA
## 20 3.0 8.5 NA NA NA
## 21 2.4 6.0 NA NA NA
## 22 1.3 2.9 NA NA NA
library(reshape2)
waste.long <- melt(waste,

Prof. Erik B. Erhardt


1.5: ADA1 Chapters 5, 4, 6: One-way ANOVA 423

# id.vars: ID variables
# all variables to keep but not split apart on
# id.vars=NULL,
# measure.vars: The source columns
# (if unspecified then all other variables are measure.vars)
# measure.vars = c("PT1","PT2","PT3","PT4","PT5"),
# variable.name: Name of the destination column identifying each
# original column that the measurement came from
variable.name = "plant",
# value.name: column name for values in table
value.name = "runup",
# remove the NA values
na.rm = TRUE
)
## No id variables; using all as measure variables
str(waste.long)
## 'data.frame': 95 obs. of 2 variables:
## $ plant: Factor w/ 5 levels "PT1","PT2","PT3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ runup: num 1.2 10.1 -2 1.5 -3 -0.7 3.2 2.7 -3.2 -1.7 ...
head(waste.long)
## plant runup
## 1 PT1 1.2
## 2 PT1 10.1
## 3 PT1 -2.0
## 4 PT1 1.5
## 5 PT1 -3.0
## 6 PT1 -0.7
tail(waste.long)
## plant runup
## 96 PT5 22.3
## 97 PT5 3.1
## 98 PT5 16.8
## 99 PT5 11.3
## 100 PT5 12.3
## 101 PT5 16.9
# Calculate the mean, sd, n, and se for the plants

# The plyr package is an advanced way to apply a function to subsets of data


# "Tools for splitting, applying and combining data"
library(plyr)
# ddply "dd" means the input and output are both data.frames
waste.summary <- ddply(waste.long,
"plant",
function(X) {
data.frame( m = mean(X$runup),
s = sd(X$runup),
n = length(X$runup)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


424 Ch 1: R statistical software and review

)
}
)
# standard errors
waste.summary$se <- waste.summary$s/sqrt(waste.summary$n)
waste.summary$moe <- qt(1 - 0.05 / 2, df = waste.summary$n - 1) * waste.summary$se
# individual confidence limits
waste.summary$ci.l <- waste.summary$m - waste.summary$moe

waste.summary$ci.u <- waste.summary$m + waste.summary$moe


waste.summary
## plant m s n se moe ci.l
## 1 PT1 4.522727 10.032041 22 2.1388383 4.447958 0.07476963
## 2 PT2 8.831818 15.353467 22 3.2733701 6.807346 2.02447241
## 3 PT3 4.831579 4.403162 19 1.0101547 2.122256 2.70932276
## 4 PT4 7.489474 3.657093 19 0.8389946 1.762662 5.72681139
## 5 PT5 10.376923 9.555030 13 2.6500884 5.774047 4.60287651
## ci.u
## 1 8.970685
## 2 15.639164
## 3 6.953835
## 4 9.252136
## 5 16.150970
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(waste.long, aes(x = plant, y = runup))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(aes(yintercept = 0),
colour = "black", linetype = "solid", size = 0.2, alpha = 0.3)
p <- p + geom_hline(aes(yintercept = mean(runup)),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Plant Run-up waste relative to computer layout")
p <- p + ylab("Run-up waste")
print(p)

Prof. Erik B. Erhardt


1.5: ADA1 Chapters 5, 4, 6: One-way ANOVA 425

Plant Run−up waste relative to computer layout

50
Run−up waste

25

PT1 PT2 PT3 PT4 PT5


plant

The outliers here suggest the ANOVA is not an appropriate model. The normality
tests below suggest the distributions for the first two plants are not normal.
by(waste.long$runup, waste.long$plant, ad.test)
## waste.long$plant: PT1
##
## Anderson-Darling normality test
##
## data: dd[x, ]
## A = 2.8685, p-value = 1.761e-07
##
## ----------------------------------------------------
## waste.long$plant: PT2

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


426 Ch 1: R statistical software and review

##
## Anderson-Darling normality test
##
## data: dd[x, ]
## A = 2.5207, p-value = 1.334e-06
##
## ----------------------------------------------------
## waste.long$plant: PT3
##
## Anderson-Darling normality test
##
## data: dd[x, ]
## A = 0.23385, p-value = 0.7624
##
## ----------------------------------------------------
## waste.long$plant: PT4
##
## Anderson-Darling normality test
##
## data: dd[x, ]
## A = 0.12363, p-value = 0.9834
##
## ----------------------------------------------------
## waste.long$plant: PT5
##
## Anderson-Darling normality test
##
## data: dd[x, ]
## A = 0.27445, p-value = 0.6004
For review purposes, I’ll fit the ANOVA, but we would count on the following
nonparametric method for inference.
fit.w <- aov(runup ~ plant, data = waste.long)
summary(fit.w)
## Df Sum Sq Mean Sq F value Pr(>F)
## plant 4 451 112.73 1.16 0.334
## Residuals 90 8749 97.21
fit.w
## Call:
## aov(formula = runup ~ plant, data = waste.long)
##
## Terms:
## plant Residuals
## Sum of Squares 450.921 8749.088
## Deg. of Freedom 4 90
##
## Residual standard error: 9.859619
## Estimated effects may be unbalanced

Prof. Erik B. Erhardt


1.5: ADA1 Chapters 5, 4, 6: One-way ANOVA 427

# all pairwise comparisons among plants


# Fisher's LSD (FSD) uses "none"
pairwise.t.test(waste.long$runup, waste.long$plant,
pool.sd = TRUE, p.adjust.method = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: waste.long$runup and waste.long$plant
##
## PT1 PT2 PT3 PT4
## PT2 0.151 - - -
## PT3 0.921 0.198 - -
## PT4 0.339 0.665 0.408 -
## PT5 0.093 0.655 0.122 0.418
##
## P value adjustment method: none
# Tukey 95% Individual p-values
TukeyHSD(fit.w)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = runup ~ plant, data = waste.long)
##
## $plant
## diff lwr upr p adj
## PT2-PT1 4.3090909 -3.966713 12.584895 0.5976181
## PT3-PT1 0.3088517 -8.287424 8.905127 0.9999769
## PT4-PT1 2.9667464 -5.629529 11.563022 0.8718682
## PT5-PT1 5.8541958 -3.747712 15.456104 0.4408168
## PT3-PT2 -4.0002392 -12.596515 4.596036 0.6946720
## PT4-PT2 -1.3423445 -9.938620 7.253931 0.9924515
## PT5-PT2 1.5451049 -8.056803 11.147013 0.9915352
## PT4-PT3 2.6578947 -6.247327 11.563116 0.9203538
## PT5-PT3 5.5453441 -4.334112 15.424800 0.5251000
## PT5-PT4 2.8874494 -6.992007 12.766906 0.9258057
# Bonferroni 95% Individual p-values
# All Pairwise Comparisons among Levels of waste
pairwise.t.test(waste.long$runup, waste.long$plant,
pool.sd = TRUE, p.adjust.method = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: waste.long$runup and waste.long$plant
##
## PT1 PT2 PT3 PT4
## PT2 1.00 - - -
## PT3 1.00 1.00 - -

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


428 Ch 1: R statistical software and review

## PT4 1.00 1.00 1.00 -


## PT5 0.93 1.00 1.00 1.00
##
## P value adjustment method: bonferroni
The residuals show many outliers
# QQ plot
par(mfrow=c(1,1))
library(car)
qqPlot(fit.w$residuals, las = 1, id = list(n = 10, cex = 1), lwd = 1
, main="QQ Plot of residuals")
## 41 18 25 93 15 24 90 89 96 38
## 41 18 25 87 15 24 84 83 90 38

QQ Plot of residuals

41 ●
60

40 18 ●
fit.w$residuals

20
15
89 ●

96
38 ●●
●●●
●●●●
●●●●
●●●●●
●●
●●●

●●
●●
0 ●●
●●
●●
●●


●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●

●●
●●
●●
●●●

●●●●●●●
●●●●●
●●●

2490
● ●

−20 ● 25 ● 93

−2 −1 0 1 2

norm quantiles

Kruskal-Wallis ANOVA is a non-parametric method for testing the hypothesis of


equal population medians against the alternative that not all population medians are
equal. It’s still not perfect here because the distributional shapes are not all the same,
but it is a better alternative than the ANOVA.
# KW ANOVA
fit.wk <- kruskal.test(runup ~ plant, data = waste.long)
fit.wk

Prof. Erik B. Erhardt


1.5: ADA1 Chapters 5, 4, 6: One-way ANOVA 429

##
## Kruskal-Wallis rank sum test
##
## data: runup by plant
## Kruskal-Wallis chi-squared = 15.319, df = 4, p-value =
## 0.004084
# Bonferroni 95% pairwise comparisions with continuity correction
# in the normal approximation for the p-value
for (i1.pt in 1:4) {
for (i2.pt in (i1.pt+1):5) {
wt <- wilcox.test(waste[,names(waste)[i1.pt]], waste[,names(waste)[i2.pt]]
, conf.int=TRUE, conf.level = 1 - 0.05/choose(5,2))
cat(names(waste)[i1.pt], names(waste)[i2.pt])
print(wt)
}
}
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact confidence intervals with ties
## PT1 PT2
## Wilcoxon rank sum test with continuity correction
##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 131.5, p-value = 0.009813
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:
## -8.299958 1.599947
## sample estimates:
## difference in location
## -4.399951
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact confidence intervals with ties
## PT1 PT3
## Wilcoxon rank sum test with continuity correction
##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 141.5, p-value = 0.07978
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:
## -6.900028 2.700029
## sample estimates:
## difference in location
## -2.500047
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


430 Ch 1: R statistical software and review

## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],


: cannot compute exact confidence intervals with ties
## PT1 PT4
## Wilcoxon rank sum test with continuity correction
##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 85, p-value = 0.001241
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:
## -8.900056 -1.099910
## sample estimates:
## difference in location
## -5.300051
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact confidence intervals with ties
## PT1 PT5
## Wilcoxon rank sum test with continuity correction
##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 76, p-value = 0.02318
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:
## -15.50007 3.60001
## sample estimates:
## difference in location
## -8.703538
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact confidence intervals with ties
## PT2 PT3
## Wilcoxon rank sum test with continuity correction
##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 238, p-value = 0.4562
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:
## -3.50007 7.20000
## sample estimates:
## difference in location
## 1.352784
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact confidence intervals with ties
## PT2 PT4

Prof. Erik B. Erhardt


1.5: ADA1 Chapters 5, 4, 6: One-way ANOVA 431

## Wilcoxon rank sum test with continuity correction


##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 186, p-value = 0.5563
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:
## -5.900014 3.800023
## sample estimates:
## difference in location
## -1.099914
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact confidence intervals with ties
## PT2 PT5
## Wilcoxon rank sum test with continuity correction
##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 99, p-value = 0.1375
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:
## -13.100001 7.400037
## sample estimates:
## difference in location
## -4.630905
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact confidence intervals with ties
## PT3 PT4
## Wilcoxon rank sum test with continuity correction
##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 117, p-value = 0.06583
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:
## -6.800001 1.699976
## sample estimates:
## difference in location
## -2.400038
##
## PT3 PT5
## Wilcoxon rank sum test
##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 67, p-value = 0.03018
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


432 Ch 1: R statistical software and review

## -13.4 1.7
## sample estimates:
## difference in location
## -6.6
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact p-value with ties
## Warning in wilcox.test.default(waste[, names(waste)[i1.pt]], waste[, names(waste)[i2.pt]],
: cannot compute exact confidence intervals with ties
## PT4 PT5
## Wilcoxon rank sum test with continuity correction
##
## data: waste[, names(waste)[i1.pt]] and waste[, names(waste)[i2.pt]]
## W = 82, p-value = 0.1157
## alternative hypothesis: true location shift is not equal to 0
## 99.5 percent confidence interval:
## -11.099965 4.799945
## sample estimates:
## difference in location
## -4.000035

1.6 ADA1 Chapter 7: Categorical data analysis


Returning to the turkey dataset, below is the cross-classification of orig by gt25mo.
#### Example: Turkey, Chapter 7
# create a frequency table from two columns of categorical data
xt <- xtabs( ~ orig + gt25mo, data = turkey)
# display the table
xt
## gt25mo
## orig FALSE TRUE
## va 2 6
## wi 3 4
# summary from xtabs() is the same as chisq.test() without continuity correction
summary(xt)
## Call: xtabs(formula = ~orig + gt25mo, data = turkey)
## Number of cases in table: 15
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 0.5357, df = 1, p-value = 0.4642
## Chi-squared approximation may be incorrect
# same as xtabs()
x.summary <- chisq.test(xt, correct=FALSE)
## Warning in chisq.test(xt, correct = FALSE): Chi-squared approximation may be incorrect
x.summary

Prof. Erik B. Erhardt


1.6: ADA1 Chapter 7: Categorical data analysis 433

##
## Pearson's Chi-squared test
##
## data: xt
## X-squared = 0.53571, df = 1, p-value = 0.4642
# the default is to perform Yates' continuity correction
chisq.test(xt)
## Warning in chisq.test(xt): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: xt
## X-squared = 0.033482, df = 1, p-value = 0.8548
# Fisher's exact test
fisher.test(xt)
##
## Fisher's Exact Test for Count Data
##
## data: xt
## p-value = 0.6084
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.02687938 6.23767632
## sample estimates:
## odds ratio
## 0.4698172

A mosaic plot is for categorical data. Area represents frequency. The default
shading is a good start, since colors only appear when there’s evidence of associ-
ation related to those cell values. In our example, there’s insufficient evidence for
association, so the default shading is all gray.
library(vcd) # for mosaic()
# shading based on significance relative to appropriate chi-square distribution
mosaic(xt, shade=TRUE)
# you can define your own interpolated shading
mosaic(xt, shade=TRUE, gp_args = list(interpolate = seq(.1,.2,.05)))

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


434 Ch 1: R statistical software and review

gt25mo gt25mo
FALSE TRUE FALSE TRUE
Pearson Pearson
residuals: residuals:
0.44 0.44
va

va
0.20
0.15
0.10
orig

orig
0.00 0.00

−0.10
−0.15
−0.20
wi

wi
−0.41 −0.41
p−value = p−value =
0.46421 0.46421

From the chisq.test() above, we make a table to summarize important values


from that analysis and compare the observed and expected frequencies in plots.
# use output in x.summary and create table
x.table <- data.frame(obs = x.summary$observed
, exp = x.summary$expected
, res = x.summary$residuals
, chisq = x.summary$residuals^2
, stdres = x.summary$stdres)
## Warning in data.frame(obs = x.summary$observed, exp = x.summary$expected, : row
names were found from a short variable and have been discarded
# There are duplicate row and col identifiers in this x.table
# because we're creating vectors from a two-way table
# and columns identifying row and col names are automatically added.
# Can you figure out the naming scheme?
x.table
## obs.orig obs.gt25mo obs.Freq exp.FALSE exp.TRUE res.orig res.gt25mo
## 1 va FALSE 2 2.666667 5.333333 va FALSE
## 2 wi FALSE 3 2.333333 4.666667 wi FALSE
## 3 va TRUE 6 2.666667 5.333333 va TRUE
## 4 wi TRUE 4 2.333333 4.666667 wi TRUE
## res.Freq chisq.orig chisq.gt25mo chisq.Freq stdres.orig
## 1 -0.4082483 va FALSE 0.16666667 va
## 2 0.4364358 wi FALSE 0.19047619 wi
## 3 0.2886751 va TRUE 0.08333333 va
## 4 -0.3086067 wi TRUE 0.09523810 wi
## stdres.gt25mo stdres.Freq
## 1 FALSE -0.7319251
## 2 FALSE 0.7319251
## 3 TRUE 0.7319251
## 4 TRUE -0.7319251

Prof. Erik B. Erhardt


1.6: ADA1 Chapter 7: Categorical data analysis 435

# create a single column with a joint cell name


x.table$cellname <- paste(as.character(x.table$obs.orig)
, as.character(x.table$obs.gt25mo)
, sep="_")
# expected frequencies in a single column
x.table$exp <- c(x.table$exp.FALSE[1:2], x.table$exp.TRUE[3:4])
# create a simpler name for the obs, res, chisq, stdres
x.table$obs <- x.table$obs.Freq
x.table$res <- x.table$res.Freq
x.table$chisq <- x.table$chisq.Freq
x.table$stdres <- x.table$stdres.Freq

# include only the "cleaned" columns


x.table <- subset(x.table, select = c(cellname, obs, exp, res, chisq, stdres))
x.table
## cellname obs exp res chisq stdres
## 1 va_FALSE 2 2.666667 -0.4082483 0.16666667 -0.7319251
## 2 wi_FALSE 3 2.333333 0.4364358 0.19047619 0.7319251
## 3 va_TRUE 6 5.333333 0.2886751 0.08333333 0.7319251
## 4 wi_TRUE 4 4.666667 -0.3086067 0.09523810 -0.7319251
# reshape the data for plotting
library(reshape2)
x.table.obsexp <- melt(x.table,
# id.vars: ID variables
# all variables to keep but not split apart on
id.vars=c("cellname"),
# measure.vars: The source columns
# (if unspecified then all other variables are measure.vars)
measure.vars = c("obs", "exp"),
# variable.name: Name of the destination column identifying each
# original column that the measurement came from
variable.name = "stat",
# value.name: column name for values in table
value.name = "value"
)
x.table.obsexp
## cellname stat value
## 1 va_FALSE obs 2.000000
## 2 wi_FALSE obs 3.000000
## 3 va_TRUE obs 6.000000
## 4 wi_TRUE obs 4.000000
## 5 va_FALSE exp 2.666667
## 6 wi_FALSE exp 2.333333
## 7 va_TRUE exp 5.333333
## 8 wi_TRUE exp 4.666667
Plot observed vs expected frequencies, and the contribution to chi-square statistic
sorted decending.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


436 Ch 1: R statistical software and review

# Observed vs Expected counts


library(ggplot2)
p <- ggplot(x.table.obsexp, aes(x = cellname, fill = stat, weight=value))
p <- p + geom_bar(position="dodge")
p <- p + labs(title = "Observed and Expected frequencies")
p <- p + xlab("Age category (years)")
print(p)

# Contribution to chi-sq
# pull out only the cellname and chisq columns
x.table.chisq <- x.table[, c("cellname","chisq")]
# reorder the cellname categories to be descending relative to the chisq statistic
x.table.chisq$cellname <- with(x.table, reorder(cellname, -chisq))

p <- ggplot(x.table.chisq, aes(x = cellname, weight = chisq))


p <- p + geom_bar()
p <- p + labs(title = "Contribution to Chi-sq statistic")
p <- p + xlab("Sorted cellname category (years)")
p <- p + ylab("Contribution")
print(p)
Observed and Expected frequencies Contribution to Chi−sq statistic

0.15

4
Contribution

stat
count

0.10
obs
exp

0.05

0 0.00

va_FALSE va_TRUE wi_FALSE wi_TRUE wi_FALSE va_FALSE wi_TRUE va_TRUE


Age category (years) Sorted cellname category (years)

1.7 ADA1 Chapter 8: Correlation and regression


Rocket Propellant Data A rocket motor is manufactured by bonding an igniter
propellant and a sustainer propellant together inside a metal housing. The shear
strength of the bond between the two types of propellant is an important quality
characteristic. It is suspected that shear strength is related to the age in weeks of the

Prof. Erik B. Erhardt


1.7: ADA1 Chapter 8: Correlation and regression 437

batch of sustainer propellant. Twenty observations on these two characteristics are


given below. The first column is shear strength in psi, the second is age of propellant
in weeks.
#### Example: Rocket, Chapter 8
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch01_rocket.dat"
# this file uses spaces as delimiters, so use read.table()
rocket <- read.table(fn.data, header = TRUE)
rocket$id <- 1:nrow(rocket) # add an id variable to identify observations
str(rocket)
## 'data.frame': 20 obs. of 3 variables:
## $ shearpsi: num 2159 1678 2316 2061 2208 ...
## $ agewks : num 15.5 23.8 8 17 5.5 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
head(rocket)
## shearpsi agewks id
## 1 2158.70 15.50 1
## 2 1678.15 23.75 2
## 3 2316.00 8.00 3
## 4 2061.30 17.00 4
## 5 2207.50 5.50 5
## 6 1708.30 19.00 6
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(rocket, aes(x = agewks, y = shearpsi, label = id))
p <- p + geom_point()
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


438 Ch 1: R statistical software and review

2700
19

8

12 17

2400 ●
9
● 14
3 ●

10

5 18
● ●
shearpsi

11 1
● ●

2100 4

16

1800 7 13
2015

● ●

6
● 2

5 10 15 20 25
agewks

The data are reasonably linear, so fit the regression.


# fit the simple linear regression model
lm.shearpsi.agewks <- lm(shearpsi ~ agewks, data = rocket)
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm.shearpsi.agewks)
##
## Call:
## lm(formula = shearpsi ~ agewks, data = rocket)
##
## Residuals:
## Min 1Q Median 3Q Max
## -215.98 -50.68 28.74 66.61 106.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2627.822 44.184 59.48 < 2e-16 ***
## agewks -37.154 2.889 -12.86 1.64e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96.11 on 18 degrees of freedom
## Multiple R-squared: 0.9018,Adjusted R-squared: 0.8964
## F-statistic: 165.4 on 1 and 18 DF, p-value: 1.643e-10

Plot diagnostics.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.shearpsi.agewks, which = c(1,4,6))

# residuals vs weight
plot(rocket$agewks, lm.shearpsi.agewks$residuals, main="Residuals vs agewks")

Prof. Erik B. Erhardt


1.7: ADA1 Chapter 8: Correlation and regression 439

# horizontal line at zero


abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.shearpsi.agewks$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 5 6 1
# residuals vs order of data
plot(lm.shearpsi.agewks$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


5
2.5 ● 5 2
100

●1

0.30
● ●
0.30


● ●

● ● ●
1.5
Cook's distance

Cook's distance

● ● 6 ●6
0

0.20
Residuals


0.20




−100

● 19 19 ●

0.10
0.10

1


−200


● ● ●
●6 5● ●
● ●

● 0.5
0.00

0.00
●●
● ●● ● 0

1800 2000 2200 2400 5 10 15 20 0.04 0.08 0.12 0.16

Fitted values Obs. number Leverage hii

Residuals vs agewks QQ Plot Residuals vs Order of data


50 100

50 100
● ●
100 ● 1● ● ●
● ● ●
lm.shearpsi.agewks$residuals

lm.shearpsi.agewks$residuals

lm.shearpsi.agewks$residuals

● ● ●
● ● ●

● ●

● ● ●
● 50 ● ● ●
● ●
● ● ●
● ● ●
● ● ● ● ● ●
0
0

● ● ●

● ● ●
−50
● ● ●
● ● ●
−100

−100

● ● ●
−100

−150
−200

−200

−200
● ● ● 5 ● 6 ● ●

5 10 15 20 25 −2 −1 0 1 2 5 10 15 20

rocket$agewks norm quantiles Index

The relationship between shear strength and age is fairly linear with predicted
shear strength decreasing as the age of the propellant increases. The fitted LS line is

Predicted shear strength = 2627.8 − 37.2 Age.

The test for H0 : β1 = 0 (zero slope for the population regression line) is highly
significant: p-value< 0.0001. Also note that R2 = 0.9018 so the linear relationship
between shear strength and age explains about 90% of the variation in shear strength.
The data plot and residual information identify observations 5 and 6 as potential
outliers (r5 = −2.38, r6 = −2.32). The predicted values for these observations are
much greater than the observed shear strengths. These same observations appear as
potential outliers in the normal scores plot and the plot of ri against Ŷi . Observations
5 and 6 also have the largest influence on the analysis; see the Cook’s distance values.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


440 Ch 1: R statistical software and review

A sensible next step would be to repeat the analysis holding out the most influ-
ential case, observation 5. It should be somewhat clear that the influence of case
6 would increase dramatically once case 5 is omitted from the analysis. Since both
cases have essentially the same effect on the positioning of the LS line, I will assess
the impact of omitting both simultaneously.

Before we hold out these cases, how do you think the LS line will change? My
guess is these cases are pulling the LS line down, so the intercept of the LS line should
increase once these cases are omitted. Holding out either case 5 or 6 would probably
also affect the slope, but my guess is that when they are both omitted the slope will
change little. (Is this my experience speaking, or have I already seen the output?
Both.) What will happen to R2 when we delete these points?

Exclude observations 5 and 6 and redo the analysis.


# exclude observations 5 and 6
rocket56 <- rocket[-c(5,6), ]
head(rocket56)
## shearpsi agewks id
## 1 2158.70 15.50 1
## 2 1678.15 23.75 2
## 3 2316.00 8.00 3
## 4 2061.30 17.00 4
## 7 1784.70 24.00 7
## 8 2575.00 2.50 8
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(rocket56, aes(x = agewks, y = shearpsi, label = id))
p <- p + geom_point()
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
print(p)

Prof. Erik B. Erhardt


1.7: ADA1 Chapter 8: Correlation and regression 441

2700
19

8

12 17

2400 ●
9
● 14
3 ●

10

18
shearpsi


11 1
● ●

2100 4

16

1800 7 13
2015

● ●

2

5 10 15 20 25
agewks

The data are reasonably linear, so fit the regression.


# fit the simple linear regression model
lm.shearpsi.agewks <- lm(shearpsi ~ agewks, data = rocket56)
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm.shearpsi.agewks)
##
## Call:
## lm(formula = shearpsi ~ agewks, data = rocket56)
##
## Residuals:
## Min 1Q Median 3Q Max
## -118.07 -35.67 11.31 44.75 83.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2658.973 30.533 87.08 < 2e-16 ***
## agewks -37.694 1.979 -19.05 2.02e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.97 on 16 degrees of freedom
## Multiple R-squared: 0.9578,Adjusted R-squared: 0.9551
## F-statistic: 362.9 on 1 and 16 DF, p-value: 2.023e-12

Plot diagnostics.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.shearpsi.agewks, which = c(1,4,6))

# residuals vs weight
plot(rocket56$agewks, lm.shearpsi.agewks$residuals, main="Residuals vs agewks")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


442 Ch 1: R statistical software and review

# horizontal line at zero


abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.shearpsi.agewks$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## 12 20 2
## 10 18 2
# residuals vs order of data
plot(lm.shearpsi.agewks$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


0.4
100

12
2.5 2
● 12

● ●

0.3
50

0.3

● ●
1.5
Cook's distance

Cook's distance

Residuals

● ● ●
2
0

●2

0.2

0.2

● ● 19 19 ●

−50

● ●

● 1

0.1
0.1


●2
−100

● 20 ● ●

12 ● ●
● ● ● 0.5
0.0

0.0
● ● ●
●●
● 0

1800 2000 2200 2400 2600 5 10 15 0.04 0.08 0.12 0.16

Fitted values Obs. number Leverage hii

Residuals vs agewks QQ Plot Residuals vs Order of data

● ● ●
● ● ● ● ●
lm.shearpsi.agewks$residuals

lm.shearpsi.agewks$residuals

lm.shearpsi.agewks$residuals


● ● ●
50

50

● ●
50 ● ●
● ●
● ● ●

● ● ● ● ● ●
● ● ●
0
0

● ● ●
● ● ● ● ● ●

● ●
−50

−50

−50
● ● ●

● ● 2 ●
−100

−100


−100
● 20 ●

● ● 12 ●

5 10 15 20 25 −2 −1 0 1 2 5 10 15

rocket56$agewks norm quantiles Index

Some summaries for the complete analysis, and when cases 5 and 6 are held out,
are given below. The summaries lead to the following conclusions:
1. Holding out cases 5 and 6 has little effect on the estimated LS line. Predictions
of shear strength are slightly larger after holding out these two cases (recall that
intercept increased, but slope was roughly the same!)
2. Holding out these two cases decreases σ̂ considerably, and leads to a modest
increase in R2 . The complete data set will give wider CI and prediction intervals
than the analysis which deletes case 5 and 6 because σ̂ decreases when these
points are omitted.
3. Once these cases are held out, the normal scores plot and plot of the studentized
residuals against fitted values shows no significant problems. One observation

Prof. Erik B. Erhardt


1.7: ADA1 Chapter 8: Correlation and regression 443

has a large Cook’s D but does not appear to be extremely influential.


Without any substantive reason to explain the low shear strengths for cases 5 and
6, I am hesitant to delete either case from the analysis. I feel relatively confident that
including these cases will not seriously limit the use of the model.
Feature Full data Omit 5 and 6
b0 2627.82 2658.97
b1 −37.15 −37.69
2
R 0.9018 0.9578
σ̂ 96.10 62.96
p-val for H0 : β1 = 0 0.0001 0.0001
Here is a comparison of the predicted or fitted values for selected observations in
the data set, based on the two fits.
Observation Actual Shear Strength Pred, full Pred, omit 5,6
1 2159 2052 2075
2 1678 1745 1764
4 2061 1996 2018
8 2575 2535 2565
10 2257 2219 2244
15 1765 1810 1830
18 2201 2163 2188
20 1754 1829 1849

Review complete
Now that we’re warmed up, let’s dive into new material!

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


444 Ch 1: R statistical software and review

Prof. Erik B. Erhardt


Part VI

ADA2: Introduction to multiple


regression and model selection

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


Chapter 2

Introduction to Multiple Linear


Regression

In multiple linear regression, a linear combination of two or more predictor vari-


ables (xs) is used to explain the variation in a response. In essence, the additional
predictors are used to explain the variation in the response not explained by a simple
linear regression fit.

2.1 Indian systolic blood pressure example


Anthropologists conducted a study1 to determine the long-term effects of an envi-
ronmental change on systolic blood pressure. They measured the blood pressure
and several other characteristics of 39 Indians who migrated from a very primitive
environment high in the Andes into the mainstream of Peruvian society at a lower
altitude. All of the Indians were males at least 21 years of age, and were born at a
high altitude.
#### Example: Indian
# filename
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch02_indian.dat"
indian <- read.table(fn.data, header=TRUE)
# examine the structure of the dataset, is it what you expected?
# a data.frame containing integers, numbers, and factors
str(indian)
## 'data.frame': 39 obs. of 11 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 21 22 24 24 25 27 28 28 31 32 ...
## $ yrmig: int 1 6 5 1 1 19 5 25 6 13 ...
## $ wt : num 71 56.5 56 61 65 62 53 53 65 57 ...

1
This problem is from the Minitab handbook.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


448 Ch 2: Introduction to Multiple Linear Regression

## $ ht : int 1629 1569 1561 1619 1566 1639 1494 1568 1540 1530 ...
## $ chin : num 8 3.3 3.3 3.7 9 3 7.3 3.7 10.3 5.7 ...
## $ fore : num 7 5 1.3 3 12.7 3.3 4.7 4.3 9 4 ...
## $ calf : num 12.7 8 4.3 4.3 20.7 5.7 8 0 10 6 ...
## $ pulse: int 88 64 68 52 72 72 64 80 76 60 ...
## $ sysbp: int 170 120 125 148 140 106 120 108 124 134 ...
## $ diabp: int 76 60 75 120 78 72 76 62 70 64 ...
# Description of variables
# id = individual id
# age = age in years yrmig = years since migration
# wt = weight in kilos ht = height in mm
# chin = chin skin fold in mm fore = forearm skin fold in mm
# calf = calf skin fold in mm pulse = pulse rate-beats/min
# sysbp = systolic bp diabp = diastolic bp

## print dataset to screen


#indian

Prof. Erik B. Erhardt


2.1: Indian systolic blood pressure example 449

id age yrmig wt ht chin fore calf pulse sysbp diabp


1 1 21 1 71.00 1629 8.00 7.00 12.70 88 170 76
2 2 22 6 56.50 1569 3.30 5.00 8.00 64 120 60
3 3 24 5 56.00 1561 3.30 1.30 4.30 68 125 75
4 4 24 1 61.00 1619 3.70 3.00 4.30 52 148 120
5 5 25 1 65.00 1566 9.00 12.70 20.70 72 140 78
6 6 27 19 62.00 1639 3.00 3.30 5.70 72 106 72
7 7 28 5 53.00 1494 7.30 4.70 8.00 64 120 76
8 8 28 25 53.00 1568 3.70 4.30 0.00 80 108 62
9 9 31 6 65.00 1540 10.30 9.00 10.00 76 124 70
10 10 32 13 57.00 1530 5.70 4.00 6.00 60 134 64
11 11 33 13 66.50 1622 6.00 5.70 8.30 68 116 76
12 12 33 10 59.10 1486 6.70 5.30 10.30 73 114 74
13 13 34 15 64.00 1578 3.30 5.30 7.00 88 130 80
14 14 35 18 69.50 1645 9.30 5.00 7.00 60 118 68
15 15 35 2 64.00 1648 3.00 3.70 6.70 60 138 78
16 16 36 12 56.50 1521 3.30 5.00 11.70 72 134 86
17 17 36 15 57.00 1547 3.00 3.00 6.00 84 120 70
18 18 37 16 55.00 1505 4.30 5.00 7.00 64 120 76
19 19 37 17 57.00 1473 6.00 5.30 11.70 72 114 80
20 20 38 10 58.00 1538 8.70 6.00 13.00 64 124 64
21 21 38 18 59.50 1513 5.30 4.00 7.70 80 114 66
22 22 38 11 61.00 1653 4.00 3.30 4.00 76 136 78
23 23 38 11 57.00 1566 3.00 3.00 3.00 60 126 72
24 24 39 21 57.50 1580 4.00 3.00 5.00 64 124 62
25 25 39 24 74.00 1647 7.30 6.30 15.70 64 128 84
26 26 39 14 72.00 1620 6.30 7.70 13.30 68 134 92
27 27 41 25 62.50 1637 6.00 5.30 8.00 76 112 80
28 28 41 32 68.00 1528 10.00 5.00 11.30 60 128 82
29 29 41 5 63.40 1647 5.30 4.30 13.70 76 134 92
30 30 42 12 68.00 1605 11.00 7.00 10.70 88 128 90
31 31 43 25 69.00 1625 5.00 3.00 6.00 72 140 72
32 32 43 26 73.00 1615 12.00 4.00 5.70 68 138 74
33 33 43 10 64.00 1640 5.70 3.00 7.00 60 118 66
34 34 44 19 65.00 1610 8.00 6.70 7.70 74 110 70
35 35 44 18 71.00 1572 3.00 4.70 4.30 72 142 84
36 36 45 10 60.20 1534 3.00 3.00 3.30 56 134 70
37 37 47 1 55.00 1536 3.00 3.00 4.00 64 116 54
38 38 50 43 70.00 1630 4.00 6.00 11.70 72 132 90
39 39 54 40 87.00 1542 11.30 11.70 11.30 92 152 88
A question we consider concerns the long term effects of an environmental change
on the systolic blood pressure. In particular, is there a relationship between the
systolic blood pressure and how long the Indians lived in their new environment as
measured by the fraction of their life spent in the new environment.
# Create the "fraction of their life" variable
# yrage = years since migration divided by age
indian$yrage <- indian$yrmig / indian$age

# continuous color for wt


# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(indian, aes(x = yrage, y = sysbp, label = id))
p <- p + geom_point(aes(colour=wt), size=2)
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5, alpha = 0.25, colour = 2)
# plot regression line and confidence band

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


450 Ch 2: Introduction to Multiple Linear Regression

p <- p + geom_smooth(method = lm)


p <- p + labs(title="Indian sysbp by yrage with continuous wt")
print(p)

# categorical color for wt


indian$wtcat <- rep(NA, nrow(indian))
indian$wtcat <- "M" # init as medium
indian$wtcat[(indian$wt < 60)] <- "L" # update low
indian$wtcat[(indian$wt >= 70)] <- "H" # update high
# define as a factor variable with a specific order
indian$wtcat <- ordered(indian$wtcat, levels=c("L", "M", "H"))
#
library(ggplot2)
p <- ggplot(indian, aes(x = yrage, y = sysbp, label = id))
p <- p + geom_point(aes(colour=wtcat, shape=wtcat), size=2)
library(R.oo) # for ascii code lookup
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.22.0 (2018-04-21) successfully loaded. See ?R.oo for help.
##
## Attaching package: ’R.oo’
## The following object is masked from ’package:igraph’:
##
## hierarchy
## The following objects are masked from ’package:gdata’:
##
## ll, trim
## The following objects are masked from ’package:methods’:
##
## getClasses, getMethods
## The following objects are masked from ’package:base’:
##
## attach, detach, gc, load, save
p <- p + scale_shape_manual(values=charToInt(sort(unique(indian$wtcat))))
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
p <- p + labs(title="Indian sysbp by yrage with categorical wt")
print(p)

Prof. Erik B. Erhardt


2.1: Indian systolic blood pressure example 451

Indian sysbp by yrage with continuous wt Indian sysbp by yrage with categorical wt
1
● H

160 160

39
● H

4
● M
wt
35 wtcat
● H
5 31 80
sysbp

sysbp
140 ● ● 140 M M L
15 32 L
● ● M H
22 70 M M
● M
29 36 16 26 10
● ● ● ● ● M M L H L
38 H H
● 60 H
13
● M
30 25 28
● ● ● M H M
23
93
● 20 ● 24 L
L
● ● ● M L L

7 2 1718
120 ● ● ●● 120 L L L L
33 14
● ● M M
37 11
● ● L M
12 1921
● ●● L L L
27
● M
34
● M
8
● L
6
● M

0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75


yrage yrage

Fit the simple linear regression model reporting the ANOVA table (“Terms”) and
parameter estimate table (“Coefficients”).
# fit the simple linear regression model
lm.sysbp.yrage <- lm(sysbp ~ yrage, data = indian)
# use Anova() from library(car) to get ANOVA table (Type 3 SS, df)
library(car)
Anova(lm.sysbp.yrage, type=3)
## Anova Table (Type III tests)
##
## Response: sysbp
## Sum Sq Df F value Pr(>F)
## (Intercept) 178221 1 1092.9484 < 2e-16 ***
## yrage 498 1 3.0544 0.08881 .
## Residuals 6033 37
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm.sysbp.yrage)
##
## Call:
## lm(formula = sysbp ~ yrage, data = indian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.161 -10.987 -1.014 6.851 37.254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 133.496 4.038 33.060 <2e-16 ***
## yrage -15.752 9.013 -1.748 0.0888 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.77 on 37 degrees of freedom

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


452 Ch 2: Introduction to Multiple Linear Regression

## Multiple R-squared: 0.07626,Adjusted R-squared: 0.05129


## F-statistic: 3.054 on 1 and 37 DF, p-value: 0.08881
A plot above of systolic blood pressure against yrage fraction suggests a weak
linear relationship. Nonetheless, consider fitting the regression model

sysbp = β0 + β1 yrage + ε.

The least squares line (already in the plot) is given by

\ = 133.5 + −15.75 yrage,


sysbp

and suggests that average systolic blood pressure decreases as the fraction of life spent
in modern society increases. However, the t-test of H0 : β1 = 0 is not significant at
the 5% level (p-value=0.08881). That is, the weak linear relationship observed in
the data is not atypical of a population where there is no linear relationship between
systolic blood pressure and the fraction of life spent in a modern society.
Even if this test were significant, the small value of R2 = 0.07626 suggests that
yrage fraction does not explain a substantial amount of the variation in the systolic
blood pressures. If we omit the individual with the highest blood pressure then the
relationship would be weaker.

2.1.1 Taking Weight Into Consideration


At best, there is a weak relationship between systolic blood pressure and the yrage
fraction. However, it is usually accepted that systolic blood pressure and weight are
related. A natural way to take weight into consideration is to include wt (weight)
and yrage fraction as predictors of systolic blood pressure in the multiple regression
model:
sysbp = β0 + β1 yrage + β2 wt + ε.
As in simple linear regression, the model is written in the form:

Response = Mean of Response + Residual,

so the model implies that that average systolic blood pressure is a linear combination
of yrage fraction and weight. As in simple linear regression, the standard multiple re-
gression analysis assumes that the responses are normally distributed with a constant
variance σ 2 . The parameters of the regression model β0 , β1 , β2 , and σ 2 are estimated
by least squares (LS).
Here is the multiple regression model with yrage and wt (weight) as predictors.
Add wt to the right hand side of the previous formula statement.

Prof. Erik B. Erhardt


2.1: Indian systolic blood pressure example 453

# fit the multiple linear regression model, (" + wt" added)


lm.sysbp.yrage.wt <- lm(sysbp ~ yrage + wt, data = indian)
library(car)
Anova(lm.sysbp.yrage.wt, type=3)
## Anova Table (Type III tests)
##
## Response: sysbp
## Sum Sq Df F value Pr(>F)
## (Intercept) 1738.2 1 18.183 0.0001385 ***
## yrage 1314.7 1 13.753 0.0006991 ***
## wt 2592.0 1 27.115 7.966e-06 ***
## Residuals 3441.4 36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.sysbp.yrage.wt)
##
## Call:
## lm(formula = sysbp ~ yrage + wt, data = indian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4330 -7.3070 0.8963 5.7275 23.9819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8959 14.2809 4.264 0.000138 ***
## yrage -26.7672 7.2178 -3.708 0.000699 ***
## wt 1.2169 0.2337 5.207 7.97e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.777 on 36 degrees of freedom
## Multiple R-squared: 0.4731,Adjusted R-squared: 0.4438
## F-statistic: 16.16 on 2 and 36 DF, p-value: 9.795e-06

2.1.2 Important Points to Notice About the Regression Out-


put
1. The LS estimates of the intercept and the regression coefficient for yrage frac-
tion, and their standard errors, change from the simple linear model to the
multiple regression model. For the simple linear regression:

\ = 133.50 − 15.75 yrage.


sysbp

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


454 Ch 2: Introduction to Multiple Linear Regression

For the multiple regression model:

\ = 60.89 − 26.76 yrage + 1.21 wt.


sysbp

2. Looking at the ANOVA tables for the simple linear and the multiple regres-
sion models we see that the Regression (model) df has increased from 1 to 2
(2=number of predictor variables) and the Residual (error) df has decreased
from 37 to 36 (=n − 1− number of predictors). Adding a predictor increases
the Regression df by 1 and decreases the Residual df by 1.
3. The Residual SS decreases by 6033.37 − 3441.36 = 2592.01 upon adding the
weight term. The Regression SS increased by 2592.01 upon adding the weight
term term to the model. The Total SS does not depend on the number of
predictors so it stays the same. The Residual SS, or the part of the variation
in the response unexplained by the regression model, never increases when new
predictors are added. (You can’t add a predictor and explain less variation.)
4. The proportion of variation in the response explained by the regression model:

R2 = Regression SS/Total SS

never decreases when new predictors are added to a model. The R2 for the
simple linear regression was 0.076, whereas

R2 = 3090.08/6531.44 = 0.473

for the multiple regression model. Adding the weight variable to the model
increases R2 by 40%. That is, weight explains 40% of the variation in systolic
blood pressure not already explained by fraction.
5. The estimated variability about the regression line

Residual MS = σ̂ 2

decreased dramatically after adding the weight effect. For the simple linear
regression model σ̂ 2 = 163.06, whereas σ̂ 2 = 95.59 for the multiple regression
model. This suggests that an important predictor has been added to model.
6. The F -statistic for the multiple regression model

Fobs = Regression MS/Residual MS = 1545.04/95.59 = 16.163

(which is compared to a F -table with 2 and 36 df ) tests H0 : β1 = β2 = 0


against HA : not H0 . This is a test of no relationship between the average
systolic blood pressure and fraction and weight, assuming the relationship is
linear. If this test is significant, then either fraction or weight, or both, are
important for explaining the variation in systolic blood pressure.

Prof. Erik B. Erhardt


2.1: Indian systolic blood pressure example 455

7. Given the model


sysbp = β0 + β1 yrage + β2 wt + ε,
we are interested in testing H0 : β2 = 0 against HA : β2 6= 0. The t-statistic for
this test
b2 − 0 1.217
tobs = = = 5.207
SE(b2 ) 0.234
is compared to a t-critical value with Residual df = 36. The test gives a p-value
of < 0.0001, which suggests β2 6= 0. The t-test of H0 : β2 = 0 in the multiple
regression model tests whether adding weight to the simple linear regression
model explains a significant part of the variation in systolic blood pressure
not explained by yrage fraction. In some sense, the t-test of H0 : β1 = 0
will be significant if the increase in R2 (or decrease in Residual SS) obtained by
adding weight to this simple linear regression model is substantial. We saw a big
increase in R2 , which is deemed significant by the t-test. A similar interpretation
is given to the t-test for H0 : β1 = 0.
8. The t-tests for β0 = 0 and β1 = 0 are conducted, assessed, and interpreted in
the same manner. The p-value for testing H0 : β0 = 0 is 0.0001, whereas the
p-value for testing H0 : β1 = 0 is 0.0007. This implies that fraction is important
in explaining the variation in systolic blood pressure after weight is taken into
consideration (by including weight in the model as a predictor).
9. We compute CIs for the regression parameters in the usual way: bi + tcrit SE(bi ),
where tcrit is the t-critical value for the corresponding CI level with df =
Residual df.

2.1.3 Understanding the Model


The t-test for H0 : β1 = 0 is highly significant (p-value=0.0007), which implies
that fraction is important in explaining the variation in systolic blood pressure af-
ter weight is taken into consideration (by including weight in the model as a
predictor). Weight is called a suppressor variable. Ignoring weight suppresses the
relationship between systolic blood pressure and yrage fraction.
The implications of this analysis are enormous! Essentially, the correlation
between a predictor and a response says very little about the importance of the pre-
dictor in a regression model with one or more additional predictors. This conclusion
also holds in situations where the correlation is high, in the sense that a predictor
that is highly correlated with the response may be unimportant in a multiple regres-
sion model once other predictors are included in the model. In multiple regression
“everything depends on everything else.”
I will try to convince you that this was expected, given the plot of systolic blood
pressure against fraction. This plot used a weight category variable wtcat L, M, or H

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


456 Ch 2: Introduction to Multiple Linear Regression

as a plotting symbol. The relationship between systolic blood pressure and fraction
is fairly linear within each weight category, and stronger than when we ignore weight.
The slopes in the three groups are negative and roughly constant.
To see why yrage fraction is an important predictor after taking weight into con-
sideration, let us return to the multiple regression model. The model implies that the
average systolic blood pressure is a linear combination of yrage fraction and weight:
\ = β0 + β1 yrage + β2 wt.
sysbp
For each fixed weight, the average systolic blood pressure is linearly related to yrage
fraction with a constant slope β1 , independent of weight. A similar interpretation
holds if we switch the roles of yrage fraction and weight. That is, if we fix the value
of fraction, then the average systolic blood pressure is linearly related to weight with
a constant slope β2 , independent of yrage fraction.
To see this point, suppose that the LS estimates of the regression parameters are
the true values
\ = 60.89 − 26.76 yrage + 1.21 wt.
sysbp
If we restrict our attention to 50kg Indians, the average systolic blood pressure as a
function of fraction is
\ = 60.89 − 26.76 yrage + 1.21(50) = 121.39 − 26.76 yrage.
sysbp
For 60kg Indians,
\ = 60.89 − 26.76 yrage + 1.21(60) = 133.49 − 26.76 yrage.
sysbp

Prof. Erik B. Erhardt


2.1: Indian systolic blood pressure example 457

Hopefully the pattern is clear: the average systolic blood pressure decreases by
26.76 for each increase of 1 on fraction, regardless of one’s weight. If we vary weight
over its range of values, we get a set of parallel lines (i.e., equal slopes) when we
plot average systolic blood pressure as a function of yrage fraction. The intercept
increases by 1.21 for each increase of 1kg in weight.
Similarly, if we plot the average systolic blood pressure as a function of weight,
for several fixed values of fraction, we see a set of parallel lines with slope 26.76, and
intercepts decreasing by 26.76 for each increase of 1 in fraction.
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(indian, aes(x = wt, y = sysbp, label = id))
p <- p + geom_point(aes(colour=yrage), size=2)
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5, alpha = 0.25, colour = 2)
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
p <- p + labs(title="Indian sysbp by wt with continuous yrage")
print(p)
Indian sysbp by wt with continuous yrage
1

160

39

4

yrage
35 0.8

5 31
sysbp

140 ● ●
15 32 0.6
● ●
22
16
10 36 ● 29 26 0.4
●● ● ● ●
38
● 0.2
13

28
30 25
● ●
3 23

● 24 20 9
●● ●

7 18 217
120 ● ● ●●
33 14
● ●
37 11
● ●
19 12
21
● ●●
27

34

8

6

60 70 80
wt

If we had more data we could check the model by plotting systolic blood pressure
against fraction, broken down by individual weights. The plot should show a fairly
linear relationship between systolic blood pressure and fraction, with a constant slope
across weights. I grouped the weights into categories because of the limited number
of observations. The same phenomenon should approximately hold, and it does. If
the slopes for the different weight groups changed drastically with weight, but the
relationships were linear, we would need to include an interaction or product variable

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


458 Ch 2: Introduction to Multiple Linear Regression

wt × yrage in the model, in addition to weight and yrage fraction. This is probably
not warranted here.
A final issue that I wish to address concerns the interpretation of the estimates of
the regression coefficients in a multiple regression model. For the fitted model

\ = 60.89 − 26.76 yrage + 1.21 wt


sysbp

our interpretation is consistent with the explanation of the regression model given
above. For example, focus on the yrage fraction coefficient. The negative coeffi-
cient indicates that the predicted systolic blood pressure decreases as yrage fraction
increases holding weight constant. In particular, the predicted systolic blood pres-
sure decreases by 26.76 for each unit increase in fraction, holding weight constant at
any value. Similarly, the predicted systolic blood pressure increases by 1.21 for each
unit increase in weight, holding yrage fraction constant at any level.
This example was meant to illustrate multiple regression. A more complete anal-
ysis of the data, including diagnostics, will be given later.

2.2 GCE exam score example

The data below are selected from a larger collection of data referring to candidates for
the General Certificate of Education (GCE) who were being considered for a special
award. Here, Y denotes the candidate’s total mark, out of 1000, in the GCE exam,
while X1 is the candidate’s score in the compulsory part of the exam, which has a
maximum score of 200 of the 1000 points on the exam. X2 denotes the candidates’
score, out of 100, in a School Certificate English Language paper taken on a previous
occasion.
#### Example: GCE
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch02_gce.dat"
gce <- read.table(fn.data, header=TRUE)
str(gce)
## 'data.frame': 15 obs. of 3 variables:
## $ y : int 476 457 540 551 575 698 545 574 645 690 ...
## $ x1: int 111 92 90 107 98 150 118 110 117 114 ...
## $ x2: int 68 46 50 59 50 66 54 51 59 80 ...
## print dataset to screen
#gce

Prof. Erik B. Erhardt


2.2: GCE exam score example 459

y x1 x2
1 476 111 68
2 457 92 46
3 540 90 50
4 551 107 59
5 575 98 50
6 698 150 66
7 545 118 54
8 574 110 51
9 645 117 59
10 690 114 80
11 634 130 57
12 637 118 51
13 390 91 44
14 562 118 61
15 560 109 66

A goal here is to compute a multiple regression of Y on X1 and X2 , and make


the necessary tests to enable you to comment intelligently on the extent to which
current performance in the compulsory test (X1 ) may be used to predict aggregate
performance on the GCE exam (Y ), and on whether previous performance in the
School Certificate English Language (X2 ) has any predictive value independently of
what has already emerged from the current performance in the compulsory papers.

I will lead you through a number of steps to help you answer this question. Let
us answer the following straightforward questions.

1. Plot Y against X1 and X2 individually, and comment on the form (i.e., linear,
non-linear, logarithmic, etc.), strength, and direction of the relationships.
2. Plot X1 against X2 and comment on the form, strength, and direction of the
relationship.
3. Compute the correlation between all pairs of variables. Do the correlations
appear sensible, given the plots?
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
#p <- ggpairs(gce, progress=FALSE)
# put scatterplots on top so y axis is vertical
p <- ggpairs(gce, upper = list(continuous = "points")
, lower = list(continuous = "cor")
, progress=FALSE
)
print(p)

# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


460 Ch 2: Introduction to Multiple Linear Regression

y x1 x2
● ●
0.005 ● ●

● ●
● ● ● ●
0.004

● ● ●●
0.003 ● ● ● ●
● ● ● ●

y
● ●

0.002
● ●
● ●
0.001

0.000 ● ●

140

Corr:

x1
120 ● ● ●●
0.731 ●
● ●

100 ●

●● ●

80

70
Corr: Corr:

x2
60 0.548 0.509

50

400 500 600 700 100 120 140 50 60 70 80

# correlation matrix and associated p-values testing "H0: rho == 0"


library(Hmisc)
## Loading required package: Formula
##
## Attaching package: ’Hmisc’
## The following objects are masked from ’package:xtable’:
##
## label, label<-
## The following objects are masked from ’package:plyr’:
##
## is.discrete, summarize
## The following objects are masked from ’package:TeachingDemos’:
##
## cnvrt.coords, subplot
## The following objects are masked from ’package:base’:
##
## format.pval, units
rcorr(as.matrix(gce))
## y x1 x2
## y 1.00 0.73 0.55

Prof. Erik B. Erhardt


2.2: GCE exam score example 461

## x1 0.73 1.00 0.51


## x2 0.55 0.51 1.00
##
## n= 15
##
##
## P
## y x1 x2
## y 0.0020 0.0346
## x1 0.0020 0.0527
## x2 0.0346 0.0527
In parts 4 through 9, ignore the possibility that Y , X1 or X2 might ideally need
to be transformed.
4. Which of X1 and X2 explains a larger proportion of the variation in Y ? Which
would appear to be a better predictor of Y ? (Explain).
Model Y = β0 + β1 X1 + ε:
# y ~ x1
lm.y.x1 <- lm(y ~ x1, data = gce)
library(car)
Anova(lm.y.x1, type=3)
## Anova Table (Type III tests)
##
## Response: y
## Sum Sq Df F value Pr(>F)
## (Intercept) 4515 1 1.246 0.284523
## x1 53970 1 14.895 0.001972 **
## Residuals 47103 13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.y.x1)
##
## Call:
## lm(formula = y ~ x1, data = gce)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97.858 -33.637 -0.034 48.507 111.327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.548 115.160 1.116 0.28452
## x1 3.948 1.023 3.859 0.00197 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.19 on 13 degrees of freedom

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


462 Ch 2: Introduction to Multiple Linear Regression

## Multiple R-squared: 0.534,Adjusted R-squared: 0.4981


## F-statistic: 14.9 on 1 and 13 DF, p-value: 0.001972
Plot diagnostics.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.y.x1, which = c(1,4,6))

plot(gce$x1, lm.y.x1$residuals, main="Residuals vs x1")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.y.x1$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 10 13 1
# residuals vs order of data
plot(lm.y.x1$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")

Residuals vs Fitted Cook's distance 0.4 Cook's dist vs Leverage hii (1 − hii)
0.4

13
2 ● 13 1.5 1
10 ●
100

0.3
0.3


Cook's distance

Cook's distance

● ●
50


Residuals

0.2
0.2


●●
0


● 3 6 6●
● ● ●3
● 0.5
−50

0.1
0.1




−100


● 13 1● ●●


0.0

0.0


● ● 0

500 550 600 650 700 2 4 6 8 10 12 14 0 0.1 0.2 0.3 0.4 0.5

Fitted values Obs. number Leverage hii

Residuals vs x1 QQ Plot Residuals vs Order of data

● 10 ● ●
100

100

100

● ● ●
lm.y.x1$residuals

lm.y.x1$residuals

lm.y.x1$residuals

● ● ● ● ●
50

50


50 ● ●

● ● ●
●● 0 ● ●
0

● ●
● ● ●
● ● ●
● ● ● ●
● ●
−50

−50

● −50 ● ●

1
−100

−100

● ● ●
● −100 ● 13 ●

90 100 110 120 130 140 150 −1 0 1 2 4 6 8 10 12 14

gce$x1 norm quantiles Index

Model Y = β0 + β1 X2 + ε:
# y ~ x2
lm.y.x2 <- lm(y ~ x2, data = gce)
library(car)
Anova(lm.y.x2, type=3)
## Anova Table (Type III tests)
##
## Response: y

Prof. Erik B. Erhardt


2.2: GCE exam score example 463

## Sum Sq Df F value Pr(>F)


## (Intercept) 32656 1 6.0001 0.02924 *
## x2 30321 1 5.5711 0.03455 *
## Residuals 70752 13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.y.x2)
##
## Call:
## lm(formula = y ~ x2, data = gce)
##
## Residuals:
## Min 1Q Median 3Q Max
## -143.770 -37.725 7.103 54.711 99.276
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 291.586 119.038 2.45 0.0292 *
## x2 4.826 2.045 2.36 0.0346 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73.77 on 13 degrees of freedom
## Multiple R-squared: 0.3,Adjusted R-squared: 0.2461
## F-statistic: 5.571 on 1 and 13 DF, p-value: 0.03455

# plot diagnistics
par(mfrow=c(2,3))
plot(lm.y.x2, which = c(1,4,6))

plot(gce$x2, lm.y.x2$residuals, main="Residuals vs x2")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.y.x2$residuals, las = 1, id = list(n = 3), main="QQ Plot")

## [1] 1 13 12

# residuals vs order of data


plot(lm.y.x2$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


464 Ch 2: Introduction to Multiple Linear Regression

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


2.5 ●21 1.5 1

0.4
100 1

0.4
● 12 13 13 ●

● ●
50

0.3
●●

0.3
Cook's distance

Cook's distance
Residuals

● ●
0

0.2

0.2

−50



6
● ●6 0.5

0.1
0.1
● 13 ●
−150

● ●
1● ●● ●

0.0

0.0
● ●
●● 0

500 550 600 650 2 4 6 8 10 12 14 0 0.1 0.2 0.3 0.4

Fitted values Obs. number Leverage hii

Residuals vs x2 QQ Plot Residuals vs Order of data


100

100
● 100 12 ● ●
● ● ●

● ● ● ● ● ●
50

50

50 ● ●
● ●
lm.y.x2$residuals

lm.y.x2$residuals

lm.y.x2$residuals

● ● ●
● ● ●
0
0

0
● ● ●
● ● ● ● ● ●
−50

−50

● −50 ●


−100
● ● 13 ●
−150

−150
● ● 1 ●
−150
45 50 55 60 65 70 75 80 −1 0 1 2 4 6 8 10 12 14

gce$x2 norm quantiles Index

Answer: R2 is 0.53 for the model with X1 and 0.30 with X2 . Equivilantly, the
Model SS is larger for X1 (53970) than for X2 (30321). Thus, X1 appears to be
a better predictor of Y than X2 .
5. Consider 2 simple linear regression models for predicting Y , one with X1 as a
predictor, and the other with X2 as the predictor. Do X1 and X2 individually
appear to be important for explaining the variation in Y ? (i.e., test that the
slopes of the regression lines are zero). Which, if any, of the output, support,
or contradicts, your answer to the previous question?
Answer: The model with X1 has a t-statistic of 3.86 with an associated p-
value of 0.0020, while X2 has a t-statistic of 2.36 with an associated p-value of
0.0346. Both predictors explain a significant amount of variability in Y . This
is consistant with part (4).
6. Fit the multiple regression model

Y = β0 + β1 X1 + β2 X2 + ε.

Test H0 : β1 = β2 = 0 at the 5% level. Describe in words what this test is


doing, and what the results mean here.
Model Y = β0 + β1 X1 + β2 X2 + ε:
# y ~ x1 + x2
lm.y.x1.x2 <- lm(y ~ x1 + x2, data = gce)
library(car)
Anova(lm.y.x1.x2, type=3)
## Anova Table (Type III tests)
##
## Response: y

Prof. Erik B. Erhardt


2.2: GCE exam score example 465

## Sum Sq Df F value Pr(>F)


## (Intercept) 1571 1 0.4396 0.51983
## x1 27867 1 7.7976 0.01627 *
## x2 4218 1 1.1802 0.29866
## Residuals 42885 12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.y.x1.x2)
##
## Call:
## lm(formula = y ~ x1 + x2, data = gce)
##
## Residuals:
## Min 1Q Median 3Q Max
## -113.201 -29.605 -6.198 56.247 66.285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.161 122.406 0.663 0.5198
## x1 3.296 1.180 2.792 0.0163 *
## x2 2.091 1.925 1.086 0.2987
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.78 on 12 degrees of freedom
## Multiple R-squared: 0.5757,Adjusted R-squared: 0.505
## F-statistic: 8.141 on 2 and 12 DF, p-value: 0.005835

Diagnostic plots suggest the residuals are roughly normal with no substantial
outliers, though the Cook’s distance is substantially larger for observation 10.
We may wish to fit the model without observation 10 to see whether conclusions
change.

# plot diagnistics
par(mfrow=c(2,3))
plot(lm.y.x1.x2, which = c(1,4,6))

plot(gce$x1, lm.y.x1.x2$residuals, main="Residuals vs x1")


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(gce$x2, lm.y.x1.x2$residuals, main="Residuals vs x2")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.y.x1.x2$residuals, las = 1, id = list(n = 3), main="QQ Plot")

## [1] 1 13 5

## residuals vs order of data


#plot(lm.y.x1.x2£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


466 Ch 2: Introduction to Multiple Linear Regression

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)

1.2
2.5 2

0.0 0.2 0.4 0.6 0.8 1.0 1.2


10 10 ●
●5 ●
● ●

50

1.0
1.5

Cook's distance

Cook's distance

0.8

Residuals


0.6


● ●
−50

0.4
1 ●1
13 ● 13
● 13

0.2
−100

1● ●●
● 0.5

0.0


●●● ● ●
● 0

500 550 600 650 700 2 4 6 8 10 12 14 0 0.1 0.2 0.3 0.4 0.5

Fitted values Obs. number Leverage hii

Residuals vs x1 Residuals vs x2 QQ Plot

● ●

● ● ● 5●
● ● ●● ● ●


50

50
50
lm.y.x1.x2$residuals

lm.y.x1.x2$residuals

lm.y.x1.x2$residuals
● ● ●

● ● ●
0
0

● ● ●

● ● ●
● ●

● ● ●


● ● ● ● ●
−50

−50

−50

● ● ● 13
−100

−100

−100
● ● ● 1

90 100 110 120 130 140 150 45 50 55 60 65 70 75 80 −1 0 1

gce$x1 gce$x2 norm quantiles

Answer: The ANOVA table reports an F -statistic of 8.14 with associated p-


value of 0.0058 indicating that the regression model with both X1 and X2
explains significantly more variability in Y than a model with the intercept,
alone. That is, X1 and X2 explain variability in Y together. This does not tell
us which of or whether X1 or X2 are individually important (recall the results
of the Indian systolic blood pressure example).
7. In the multiple regression model, test H0 : β1 = 0 and H0 : β2 = 0 individually.
Describe in words what these tests are doing, and what the results mean here.
Answer: Each hypothesis is testing, conditional on all other predictors being in
the model, whether the addition of the predictor being tested explains signifi-
cantly more variability in Y than without it.
For H0 : β1 = 0, the t-statistic is 2.79 with an associated p-value of 0.0163.
Thus, we reject H0 in favor of the alternative that the slope is statistically
significantly different from 0 conditional on X2 being in the model. That is,
X1 explains significantly more variability in Y given that X2 is already in the
model.
For H0 : β2 = 0, the t-statistic is 1.09 with an associated p-value of 0.2987.
Thus, we fail to reject H0 concluding that there is insufficient evidence that the
slope is different from 0 conditional on X1 being in the model. That is, X2 does
not explain significantly more variability in Y given that X1 is already in the
model.
8. How does the R2 from the multiple regression model compare to the R2 from
the individual simple linear regressions? Is what you are seeing here appear
reasonable, given the tests on the individual coefficients?
Answer: The R2 for the model with only X1 is 0.5340, only X2 is 0.3000, and

Prof. Erik B. Erhardt


2.2: GCE exam score example 467

both X1 and X2 is 0.5757. There is only a very small increase in R2 from the
model with only X1 when X2 is added, which is consistent with X2 not being
important given that X1 is already in the model.
9. Do your best to answer the question posed above, in the paragraph after the
data “A goal . . . ”. Provide an equation (LS) for predicting Y .
Answer: Yes, we’ve seen that X1 may be used to predict Y , and that X2 does not
explain significantly more variability in the model with X1 . Thus, the preferred
model has only X1 :

ŷ = 128.55 + 3.95X1 .

2.2.1 Some Comments on GCE Analysis

I will give you my thoughts on these data, and how I would attack this problem,
keeping the ultimate goal in mind. I will examine whether transformations of the data
are appropriate, and whether any important conclusions are dramatically influenced
by individual observations. I will use some new tools to attack this problem, and will
outline how they are used.
The plot of GCE (Y ) against COMP (X1 ) is fairly linear, but the trend in the
plot of GCE (Y ) against SCEL (X2 ) is less clear. You might see a non-linear trend
here, but the relationship is not very strong. When I assess plots I try to not allow
a few observations affect my perception of trend, and with this in mind, I do not see
any strong evidence at this point to transform any of the variables.
One difficulty that we must face when building a multiple regression model is that
these two-dimensional (2D) plots of a response against individual predictors may have
little information about the appropriate scales for a multiple regression analysis. In
particular, the 2D plots only tell us whether we need to transform the data in a simple
linear regression analysis. If a 2D plot shows a strong non-linear trend, I would do
an analysis using the suggested transformations, including any other effects that are
important. However, it might be that no variables need to be transformed in the
multiple regression model.
The partial regression residual plot, or added variable plot, is a graphical tool
that provides information about the need for transformations in a multiple regression
model. The following reg procedure generates diagnostics and the partial residual
plots for each predictor in the multiple regression model that has COMP and SCEL
as predictors of GCE.
library(car)
avPlots(lm.y.x1.x2, id = list(n = 3))

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


468 Ch 2: Introduction to Multiple Linear Regression

Added−Variable Plots
100

● 10 ●

100
6●
● 11 ●
50

● 5 ● ● 5 ●●

50

10
●●
y | others

y | others
0



● ● 15 ●

0

−50


● ●
● ●

−50

● 13

−100
1●
−150

● 1 ● 13

−10 0 10 20 30 −5 0 5 10 15 20

x1 | others x2 | others

The partial regression residual plot compares the residuals from two model fits.
First, we “adjust” Y for all the other predictors in the model except the selected
one. Then, we “adjust” the selected variable Xsel for all the other predictors in the
model. Lastly, plot the residuals from these two models against each other to see what
relationship still exists between Y and Xsel after accounting for their relationships with
the other predictors.
# function to create partial regression plot
partial.regression.plot <- function (y, x, sel, ...) {
m <- as.matrix(x[, -sel])
# residuals of y regressed on all x's except "sel"
y1 <- lm(y ~ m)$res
# residuals of x regressed on all other x's
x1 <- lm(x[, sel] ~ m)$res
# plot residuals of y vs residuals of x
plot( y1 ~ x1, main="Partial regression plot", ylab="y | others", ...)
# add grid
grid(lty = "solid")
# add red regression line
abline(lm(y1 ~ x1), col = "red", lwd = 2)
}

par(mfrow=c(1, 2))
partial.regression.plot(gce$y, cbind(gce$x1, gce$x2), 1, xlab="x1 | others")
partial.regression.plot(gce$y, cbind(gce$x1, gce$x2), 2, xlab="x2 | others")

Prof. Erik B. Erhardt


2.2: GCE exam score example 469

Partial regression plot Partial regression plot


100
● ●

100

● ●
50

● ● ● ●●

50

y | others

y | others
●●
0



● ● ● ●

0
−50



● ●
● ●

−100 −50


−150


● ●

−10 0 10 20 30 −5 0 5 10 15 20

x1 | others x2 | others

The first partial regression residual plot for COMP, given below, “adjusts” GCE
(Y ) and COMP (X1 ) for their common dependence on all the other predictors in
the model (only SCEL (X2 ) here). This plot tells us whether we need to transform
COMP in the multiple regression model, and whether any observations are influencing
the significance of COMP in the fitted model. A roughly linear trend suggests that
no transformation of COMP is warranted. The positive relationship seen here is
consistent with the coefficient of COMP being positive in the multiple regression
model. The partial residual plot for COMP shows little evidence of curvilinearity,
and much less so than the original 2D plot of GCE against COMP. This indicates
that there is no strong evidence for transforming COMP in a multiple regression
model that includes SCEL.
Although SCEL appears to somewhat useful as a predictor of GCE on it’s own, the
multiple regression output indicates that SCEL does not explain a significant amount
of the variation in GCE, once the effect of COMP has been taken into account. Put
another way, previous performance in the School Certificate English Language (X2 )
has little predictive value independently of what has already emerged from the current
performance in the compulsory papers (X1 or COMP). This conclusion is consistent
with the fairly weak linear relationship between GCE against SCEL seen in the second
partial residual plot.
Do diagnostics suggest any deficiencies associated with this conclusion? The par-
tial residual plot of SCEL highlights observation 10, which has the largest value of
Cook’s distance in the multiple regression model. If we visually hold observation 10
out from this partial residual plot, it would appear that the relationship observed in
this plot would weaken. This suggests that observation 10 is actually enhancing the
significance of SCEL in the multiple regression model. That is, the p-value for testing

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


470 Ch 2: Introduction to Multiple Linear Regression

the importance of SCEL in the multiple regression model would be inflated by holding
out observation 10. The following output confirms this conjecture. The studentized
residuals, Cook’s distances and partial residual plots show no serious deficiencies.
Model Y = β0 + β1 X1 + β2 X2 + ε, excluding observation 10:
gce10 <- gce[-10,]
# y ~ x1 + x2
lm.y10.x1.x2 <- lm(y ~ x1 + x2, data = gce10)
library(car)
Anova(lm.y10.x1.x2, type=3)
## Anova Table (Type III tests)
##
## Response: y
## Sum Sq Df F value Pr(>F)
## (Intercept) 5280 1 1.7572 0.211849
## x1 37421 1 12.4540 0.004723 **
## x2 747 1 0.2486 0.627870
## Residuals 33052 11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.y10.x1.x2)
##
## Call:
## lm(formula = y ~ x1 + x2, data = gce10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99.117 -30.319 4.661 37.416 64.803
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.461 120.295 1.326 0.21185
## x1 4.241 1.202 3.529 0.00472 **
## x2 -1.280 2.566 -0.499 0.62787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54.82 on 11 degrees of freedom
## Multiple R-squared: 0.6128,Adjusted R-squared: 0.5424
## F-statistic: 8.706 on 2 and 11 DF, p-value: 0.005413
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.y10.x1.x2, which = c(1,4,6))

plot(gce10$x1, lm.y10.x1.x2$residuals, main="Residuals vs x1")


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(gce10$x2, lm.y10.x1.x2$residuals, main="Residuals vs x2")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals

Prof. Erik B. Erhardt


2.2: GCE exam score example 471

library(car)
qqPlot(lm.y10.x1.x2$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## 13 1 9
## 12 1 9
## residuals vs order of data
#plot(lm.y10.x1.x2£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)

0.6
1
2.5 2 1● 1.5
● 9● 13 ● 13

0.5
50

0.5

0.4
Cook's distance

Cook's distance

0.4
● ●
1
Residuals

0.3
0.3


0.2
−50

0.2

3 ●3
1●

0.1
0.1 ● 0.5
−100

● ● ●
● 13 ● ● ●
0.0

0.0
●●
● ● 0

500 550 600 650 700 2 4 6 8 10 12 14 0 0.1 0.2 0.3 0.4 0.5

Fitted values Obs. number Leverage hii

Residuals vs x1 Residuals vs x2 QQ Plot

● ● ● ● ● ● ● 9●
50

50

50
● ● ●
lm.y10.x1.x2$residuals

lm.y10.x1.x2$residuals

lm.y10.x1.x2$residuals
● ● ●

● ● ● ● ● ●

0
0

● ● ●

● ● ●
● ● ●

● ● ●
−50

−50

● ● ●
−50
● ● ● 1
−100

−100

● ● −100 ● 13

90 100 110 120 130 140 150 45 50 55 60 65 −1 0 1

gce10$x1 gce10$x2 norm quantiles

library(car)
avPlots(lm.y10.x1.x2, id = list(n = 3))

Added−Variable Plots
100

● ●
6● ● ● 9
12
50

9 ●
● 11 ●
50




15 ●
y | others

y | others

● ●

0


● ● ●

−100 −50

● ●
● ●
−50

● 13 1●
● 1 ● 13

−10 0 10 20 −5 0 5 10

x1 | others x2 | others

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


472 Ch 2: Introduction to Multiple Linear Regression

What are my conclusions? It would appear that SCEL (X2 ) is not a useful pre-
dictor in the multiple regression model. For simplicity, I would likely use a simple
linear regression model to predict GCE (Y ) from COMP (X1 ) only. The diagnostic
analysis of the model showed no serious deficiencies.

Prof. Erik B. Erhardt


Chapter 3

A Taste of Model Selection for


Multiple Regression

3.1 Model
Given data on a response variable Y and k predictor variables X1 , X2 , . . . , Xk , we
wish to develop a regression model to predict Y . Assuming that the collection of
variables is measured on the correct scale, and that the candidate list of predictors
includes all the important predictors, the most general model is

Y = β0 + β1 X1 + · · · + βk Xk + ε.

In most problems one or more of the predictors can be eliminated from this general
or full model without (much) loss of information. We want to identify the impor-
tant predictors, or equivalently, eliminate the predictors that are not very useful for
explaining the variation in Y (conditional on the other predictors in the model).
We will study several automated methods for model selection, which, given a
specific criterion for selecting a model, gives the best predictors. Before applying
any of the methods, you should plot Y against each predictor X1 , X2 , . . . , Xk to see
whether transformations are needed. If a transformation of Xi is suggested, include
the transformation along with the original Xi in the candidate list.√Note that you
can transform the predictors differently, for example, log(X1 ) and X2 . However,
if several transformations are suggested for the response, then you should consider
doing one analysis for each suggested response scale before deciding on the final scale.
At this point, I will only consider the backward elimination method. Other
approaches will be addressed later this semester.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


474 Ch 3: A Taste of Model Selection for Multiple Regression

3.2 Backward Elimination


The backward elimination procedure deletes unimportant variables, one at a time,
starting from the full model. The steps is the procedure are:
1. Fit the full model

Y = β0 + β1 X1 + · · · + βk Xk + ε.

2. Find the variable which when omitted from the full model (1) reduces R2 the
least, or equivalently, increases the Residual SS the least. This is the variable
that gives the largest p-value for testing an individual regression coefficient
H0 : βi = 0 for i > 0. Suppose this variable is Xk . If you reject H0 , stop and
conclude that the full model is best. If you do not reject H0 , delete Xk from
the full model, giving the new full model

Y = β0 + β1 X1 + · · · + βk−1 Xk−1 + ε.

Repeat steps 1 and 2 sequentially until no further predictors can be deleted.


In backward elimination we isolate the least important predictor left in the model,
and check whether it is important. If not, delete it and repeat the process. Otherwise,
stop. A 0.10 significance level is common to use for this strategy.
Epidemiologists use a slightly different approach to building models. They argue
strongly for the need to always include confounding variables in a model, regard-
less of their statistical significance. I will briefly discuss this issue, but you should
recognize that there is no universally accepted best approach to building models. A
related issue is that several sets of predictors might give nearly identical fits and
predictions to those obtained using any model selection method. This should not be
too surprising because predictors are often correlated with each other. However, this
should make us question whether one could ever completely unravel which variables
are important (and which are not) for predicting a response.

3.2.1 Maximum likelihood and AIC/BIC


The Akaike information criterion (AIC) and Bayesian information criterion
(BIC) are related penalized-likelihood criteria of the relative goodness-of-fit of a
statistical model to the observed data. For model selection, a parsimonious model
minimizes (one of) these quantities, where the penalty term is larger in BIC (k ln(n))
than in AIC (2k). They are defined as

AIC = −2 ln(L) + 2k and


BIC = −2 ln(L) + k ln(n)

Prof. Erik B. Erhardt


3.3: Example: Peru Indian blood pressure 475

where n is the number of observations, k is the number of model parameters, and L


is the maximized value of the likelihood function for the estimated model.
Maximum-likelihood estimation (MLE) applied to a data set and given a statistical
model, estimates the model’s parameters (βs and σ 2 in regression). MLE finds the
particular parametric values that make the observed data the most probable given the
model. That is, it selects the set of values of the model parameters that maximizes
the likelihood function.
In practice, start with a set of candidate models, and then find the models’ corre-
sponding AIC/BIC values. There will almost always be information lost due to using
one of the candidate models to represent the “true” (unknown) model. We choose
the model that minimizes the (estimated) information loss (the Kullback-Leibler di-
vergence of the “true” unknown model represented with a candidate model).
The penalty discourages overfitting. Increasing the number of free parameters in
the model will always improve the goodness-of-fit, regardless of the number of free
parameters in the data-generating process. In the spirit of Occam’s razor, the princi-
ple of parsimony, economy, or succinctness, the penalty helps balance the complexity
of the model (low) with its ability to describe the data (high).
There are many methods for model selection. AIC or BIC are good tools for
helping to choose among candidate models. A model selected by BIC will tend to
have fewer parameters than one selected by AIC. Ultimately, you have to choose a
model. I think of automated model selection as a starting point among the models
I ultimately consider, and I may decide upon a different model than AIC, BIC, or
another method.

3.3 Example: Peru Indian blood pressure


I will illustrate backward elimination on the Peru Indian data, using systolic blood
pressure (sysbp) as the response, and seven candidate predictors: wt = weight in
kilos; ht = height in mm; chin = chin skin fold in mm; fore = forearm skin fold
in mm; calf = calf skin fold in mm; pulse = pulse rate-beats/min, and yrage =
fraction.
The program given below generates simple summary statistics and plots. The
plots do not suggest any apparent transformations of the response or the predictors,
so we will analyze the data using the given scales. The correlations between the
response and each potential predictor indicate that predictors are generally not highly
correlated with each other (a few are).
#### Example: Indian
# filename
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch02_indian.dat"
indian <- read.table(fn.data, header=TRUE)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


476 Ch 3: A Taste of Model Selection for Multiple Regression

# Create the "fraction of their life" variable


# yrage = years since migration divided by age
indian$yrage <- indian$yrmig / indian$age

# subset of variables we want in our model


indian2 <- subset(indian, select=c("sysbp", "wt", "ht", "chin"
, "fore", "calf", "pulse", "yrage")
)

str(indian2)
## 'data.frame': 39 obs. of 8 variables:
## $ sysbp: int 170 120 125 148 140 106 120 108 124 134 ...
## $ wt : num 71 56.5 56 61 65 62 53 53 65 57 ...
## $ ht : int 1629 1569 1561 1619 1566 1639 1494 1568 1540 1530 ...
## $ chin : num 8 3.3 3.3 3.7 9 3 7.3 3.7 10.3 5.7 ...
## $ fore : num 7 5 1.3 3 12.7 3.3 4.7 4.3 9 4 ...
## $ calf : num 12.7 8 4.3 4.3 20.7 5.7 8 0 10 6 ...
## $ pulse: int 88 64 68 52 72 72 64 80 76 60 ...
## $ yrage: num 0.0476 0.2727 0.2083 0.0417 0.04 ...
# Description of variables
# id = individual id
# age = age in years yrmig = years since migration
# wt = weight in kilos ht = height in mm
# chin = chin skin fold in mm fore = forearm skin fold in mm
# calf = calf skin fold in mm pulse = pulse rate-beats/min
# sysbp = systolic bp diabp = diastolic bp

## print dataset to screen


#indian2
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
#p <- ggpairs(indian2, progress=FALSE)
# put scatterplots on top so y axis is vertical
p <- ggpairs(indian2, upper = list(continuous = "points")
, lower = list(continuous = "cor")
, progress=FALSE
)
print(p)

# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

Prof. Erik B. Erhardt


3.3: Example: Peru Indian blood pressure 477

sysbp wt ht chin fore calf pulse yrage


● ● ● ● ● ● ●

0.02 ●
● ●
● ●








sysbp
● ●● ● ●●●● ●● ● ●
●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ●●
● ●● ●
● ● ●●● ●● ● ●● ● ●● ● ●●●●● ● ● ●
● ● ●●● ● ● ● ●
● ● ● ● ●●●●
0.01 ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●
●● ● ● ● ● ●
●●
●● ● ●● ●● ● ●● ● ● ● ● ●
● ● ● ● ●● ● ● ● ●
● ● ●



● ●● ●
● ●
●●●● ● ● ●● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ●●● ● ● ●
●●● ●● ● ●● ● ● ● ●● ● ●●●●

● ●
● ● ●●● ● ●●● ● ● ● ● ●● ● ● ● ●●● ●
● ● ● ● ● ● ● ● ●
● ●● ● ●
0.00 ● ● ●● ●● ● ● ● ● ●

● ● ● ● ● ●

80
Corr: ●
●● ●
● ● ● ●


● ● ●
● ● ●● ●
● ●
● ● ●
● ● ●●


●● ● ●

wt
70 ● ● ●
●● ● ●
● ● ● ●
● ● ●● ●●



● ● ●
●●


0.521 ● ●●

● ●● ●●

●●●
● ● ● ●●● ● ●

● ● ●

●● ● ● ●

●●● ● ● ●
●● ●



●● ● ● ● ●
● ● ●
● ●● ● ●● ●●
● ●●● ● ● ● ● ● ● ●●
60 ● ● ●

● ● ●

● ● ●

● ● ● ● ● ●
● ●●
●● ● ●● ● ● ●● ● ● ●● ● ●●● ● ●● ●
● ● ● ● ● ●● ●● ●
● ●● ●
● ● ●● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●

1650 ● ● ● ● ●
●●● ● ● ● ● ● ● ● ●
● ●● ●
● ● ●
● ●● ●● ●● ●●●● ●
● ● ● ● ●
● ● ●● ● ● ●●● ● ● ● ●
● ● ●● ●

● ●
● ● ●● ● ● ●
● ● ●● ●

● ● ●● ● ● ● ● ● ●
1600
Corr: Corr: ●●●
●●● ●

● ●●●

● ● ●●
● ●
● ●

● ● ● ●
● ●

● ● ●●


● ●

ht
● ● ●
1550 0.219 0.45 ●
● ● ●

● ● ●
●● ● ●
● ● ●●●
● ●
● ● ● ●●●● ●● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
1500 ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●

12.5 ● ● ● ●
● ● ●● ● ● ● ●
● ●● ● ●
10.0 ●
● ●

● ●

● ● ● ● ● ● ● ●
Corr: Corr: Corr:

chin
●● ● ● ● ● ● ●
7.5 ● ● ● ● ● ● ●
● ● ● ●
0.17 0.562 −0.0079 ● ● ●●

●●●● ●● ●

● ● ● ●
●● ● ●

5.0 ● ●● ●● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
●● ● ● ● ●●● ● ● ● ● ● ● ● ●●
● ●●● ●●
● ●
●●●
●● ●
●● ● ● ● ● ●
● ● ●
● ● ● ●● ● ●
●● ●● ●● ●
2.5
● ● ●
12.5 ● ● ●

10.0
● ● ●
Corr: Corr: Corr: Corr:

fore
7.5 ● ● ●
● ● ● ● ● ● ● ●
●● ● ●
● ● ● ● ● ● ●
5.0 0.272 0.544 −0.0689 0.638 ● ●●

●●
● ●●

● ● ●
● ●● ●


● ●●● ●●
●●● ● ●
● ●
●●● ● ● ● ● ●
● ●● ●● ● ●
● ● ●

●●●●●
●● ● ● ● ● ●
● ● ● ●● ●●
● ● ● ●● ●
2.5
● ● ●

● ●
20
● ●
15 ● ●
● ● ● ●
Corr: Corr: Corr: Corr: Corr: ●
● ●
● ● ●● ●

calf
● ● ●
10 ●● ● ●●
0.251 0.392 −0.00285 0.516 0.736 ●
● ● ●● ●
● ●
● ●
● ●
● ● ●●● ●



●●
● ●● ●

5 ● ● ● ●
●● ● ● ● ●
● ● ● ● ●●

0 ● ●


90 ● ● ●

80 ● ●
Corr: Corr: Corr: Corr: Corr: Corr:

pulse
●● ● ●
● ●● ●●● ● ● ●
70
0.133 0.31 0.00294 0.224 0.422 0.21 ●
● ●●
● ●


● ●●
60 ● ●● ● ● ●


50

0.75
Corr: Corr: Corr: Corr: Corr: Corr: Corr:

yrage
0.50
−0.276 0.293 0.0512 0.12 0.028 −0.113 0.213
0.25

0.00
120 140 160 60 70 80 1500155016001650
2.5 5.0 7.5 10.0 12.5 2.5 5.0 7.510.012.5 0 5 10 15 20 50 60 70 80 90 0.00 0.25 0.50 0.75

# correlation matrix and associated p-values testing "H0: rho == 0"


library(Hmisc)
rcorr(as.matrix(indian2))
## sysbp wt ht chin fore calf pulse yrage
## sysbp 1.00 0.52 0.22 0.17 0.27 0.25 0.13 -0.28
## wt 0.52 1.00 0.45 0.56 0.54 0.39 0.31 0.29
## ht 0.22 0.45 1.00 -0.01 -0.07 0.00 0.00 0.05
## chin 0.17 0.56 -0.01 1.00 0.64 0.52 0.22 0.12
## fore 0.27 0.54 -0.07 0.64 1.00 0.74 0.42 0.03
## calf 0.25 0.39 0.00 0.52 0.74 1.00 0.21 -0.11
## pulse 0.13 0.31 0.00 0.22 0.42 0.21 1.00 0.21
## yrage -0.28 0.29 0.05 0.12 0.03 -0.11 0.21 1.00

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


478 Ch 3: A Taste of Model Selection for Multiple Regression

##
## n= 39
##
##
## P
## sysbp wt ht chin fore calf pulse yrage
## sysbp 0.0007 0.1802 0.3003 0.0936 0.1236 0.4211 0.0888
## wt 0.0007 0.0040 0.0002 0.0003 0.0136 0.0548 0.0702
## ht 0.1802 0.0040 0.9619 0.6767 0.9863 0.9858 0.7570
## chin 0.3003 0.0002 0.9619 0.0000 0.0008 0.1708 0.4665
## fore 0.0936 0.0003 0.6767 0.0000 0.0000 0.0075 0.8656
## calf 0.1236 0.0136 0.9863 0.0008 0.0000 0.1995 0.4933
## pulse 0.4211 0.0548 0.9858 0.1708 0.0075 0.1995 0.1928
## yrage 0.0888 0.0702 0.7570 0.4665 0.8656 0.4933 0.1928
Below I fit the linear model with all the selected main effects.
# fit full model
lm.indian2.full <- lm(sysbp ~ wt + ht + chin + fore + calf + pulse + yrage
, data = indian2)

library(car)
Anova(lm.indian2.full, type=3)
## Anova Table (Type III tests)
##
## Response: sysbp
## Sum Sq Df F value Pr(>F)
## (Intercept) 389.46 1 3.8991 0.0572767 .
## wt 1956.49 1 19.5874 0.0001105 ***
## ht 131.88 1 1.3203 0.2593289
## chin 186.85 1 1.8706 0.1812390
## fore 27.00 1 0.2703 0.6068061
## calf 2.86 1 0.0287 0.8666427
## pulse 14.61 1 0.1463 0.7046990
## yrage 1386.76 1 13.8835 0.0007773 ***
## Residuals 3096.45 31
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.indian2.full)
##
## Call:
## lm(formula = sysbp ~ wt + ht + chin + fore + calf + pulse + yrage,
## data = indian2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3993 -5.7916 -0.6907 6.9453 23.5771
##
## Coefficients:

Prof. Erik B. Erhardt


3.3: Example: Peru Indian blood pressure 479

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 106.45766 53.91303 1.975 0.057277 .
## wt 1.71095 0.38659 4.426 0.000111 ***
## ht -0.04533 0.03945 -1.149 0.259329
## chin -1.15725 0.84612 -1.368 0.181239
## fore -0.70183 1.34986 -0.520 0.606806
## calf 0.10357 0.61170 0.169 0.866643
## pulse 0.07485 0.19570 0.383 0.704699
## yrage -29.31810 7.86839 -3.726 0.000777 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.994 on 31 degrees of freedom
## Multiple R-squared: 0.5259,Adjusted R-squared: 0.4189
## F-statistic: 4.913 on 7 and 31 DF, p-value: 0.0008079

Remarks on Step 0: The full model has 7 predictors so REG df = 7. The F -test
in the full model ANOVA table (F = 4.91 with p-value = 0.0008) tests the hypothesis
that the regression coefficient for each predictor variable is zero. This test is highly
significant, indicating that one or more of the predictors is important in the model.
In the ANOVA table, the F -value column gives the square of the t-statistic (from
the parameter [Coefficients] estimate table) for testing the significance of the indi-
vidual predictors in the full model (conditional on all other predictors being in the
model). The p-value is the same whether the t-statistic or F -value is shown.
The least important variable in the full model, as judged by the p-value, is calf skin
fold. This variable, upon omission, reduces R2 the least, or equivalently, increases the
Residual SS the least. The p-value of 0.87 exceeds the default 0.10 cut-off, so calf
will be the first to be omitted from the model.
Below, we will continue in this way. After deleting calf, the six predictor model
can be fitted. Manually, you can find that at least one of the predictors left is
important, as judged by the overall F -test p-value. The least important predictor left
is pulse. This variable is omitted from the model because the p-value for including
it exceeds the 0.10 threshold.
This is repeated until all predictors remain significant at a 0.10 significance level.
# model reduction using update() and subtracting (removing) model terms
lm.indian2.red <- lm.indian2.full;

# remove calf
lm.indian2.red <- update(lm.indian2.red, ~ . - calf ); summary(lm.indian2.red);
##
## Call:
## lm(formula = sysbp ~ wt + ht + chin + fore + pulse + yrage, data = indian2)
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


480 Ch 3: A Taste of Model Selection for Multiple Regression

## Residuals:
## Min 1Q Median 3Q Max
## -14.6993 -5.3152 -0.7725 7.2966 23.7240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.13739 53.05581 2.000 0.053993 .
## wt 1.70900 0.38051 4.491 8.65e-05 ***
## ht -0.04478 0.03871 -1.157 0.256008
## chin -1.14165 0.82823 -1.378 0.177635
## fore -0.56731 1.07462 -0.528 0.601197
## pulse 0.07103 0.19142 0.371 0.713018
## yrage -29.54000 7.63983 -3.867 0.000509 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.841 on 32 degrees of freedom
## Multiple R-squared: 0.5255,Adjusted R-squared: 0.4365
## F-statistic: 5.906 on 6 and 32 DF, p-value: 0.0003103
# remove pulse
lm.indian2.red <- update(lm.indian2.red, ~ . - pulse); summary(lm.indian2.red);
##
## Call:
## lm(formula = sysbp ~ wt + ht + chin + fore + yrage, data = indian2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6147 -5.9803 -0.2065 6.6755 24.9269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 110.27872 51.18665 2.154 0.038601 *
## wt 1.71825 0.37470 4.586 6.22e-05 ***
## ht -0.04504 0.03820 -1.179 0.246810
## chin -1.17716 0.81187 -1.450 0.156514
## fore -0.43385 0.99933 -0.434 0.667013
## yrage -28.98171 7.39172 -3.921 0.000421 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.712 on 33 degrees of freedom
## Multiple R-squared: 0.5234,Adjusted R-squared: 0.4512
## F-statistic: 7.249 on 5 and 33 DF, p-value: 0.0001124
# remove fore
lm.indian2.red <- update(lm.indian2.red, ~ . - fore ); summary(lm.indian2.red);
##
## Call:

Prof. Erik B. Erhardt


3.3: Example: Peru Indian blood pressure 481

## lm(formula = sysbp ~ wt + ht + chin + yrage, data = indian2)


##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1030 -6.3484 0.2834 6.7766 24.8883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104.52292 48.84627 2.140 0.039629 *
## wt 1.64631 0.33203 4.958 1.94e-05 ***
## ht -0.03957 0.03563 -1.111 0.274530
## chin -1.31083 0.74220 -1.766 0.086348 .
## yrage -28.32580 7.14879 -3.962 0.000361 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.595 on 34 degrees of freedom
## Multiple R-squared: 0.5207,Adjusted R-squared: 0.4643
## F-statistic: 9.235 on 4 and 34 DF, p-value: 3.661e-05
# remove ht
lm.indian2.red <- update(lm.indian2.red, ~ . - ht ); summary(lm.indian2.red);
##
## Call:
## lm(formula = sysbp ~ wt + chin + yrage, data = indian2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.6382 -6.6316 0.4521 6.3593 24.2086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.9092 15.0895 3.506 0.001266 **
## wt 1.4407 0.2766 5.209 8.51e-06 ***
## chin -1.0135 0.6945 -1.459 0.153407
## yrage -27.3522 7.1185 -3.842 0.000491 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.627 on 35 degrees of freedom
## Multiple R-squared: 0.5033,Adjusted R-squared: 0.4608
## F-statistic: 11.82 on 3 and 35 DF, p-value: 1.684e-05
# remove chin
lm.indian2.red <- update(lm.indian2.red, ~ . - chin ); summary(lm.indian2.red);
##
## Call:
## lm(formula = sysbp ~ wt + yrage, data = indian2)
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


482 Ch 3: A Taste of Model Selection for Multiple Regression

## Residuals:
## Min 1Q Median 3Q Max
## -18.4330 -7.3070 0.8963 5.7275 23.9819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8959 14.2809 4.264 0.000138 ***
## wt 1.2169 0.2337 5.207 7.97e-06 ***
## yrage -26.7672 7.2178 -3.708 0.000699 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.777 on 36 degrees of freedom
## Multiple R-squared: 0.4731,Adjusted R-squared: 0.4438
## F-statistic: 16.16 on 2 and 36 DF, p-value: 9.795e-06
# all are significant, stop.
# final model: sysbp ~ wt + yrage
lm.indian2.final <- lm.indian2.red

AIC/BIC automated model selection The AIC/BIC strategy is more com-


monly used for model selection, though resulting models are usually the same as the
method described above. I use the step() function to perform backward selection
using the AIC criterion (and give code for the BIC) then make some last-step deci-
sions. Note that because the BIC has a larger penalty, it arrives at my chosen model
directly.
At each step, the predictors are ranked (least significant to most significant) and
then a decision of whether to keep the top predictor is made. <none> represents the
current model.
## AIC
# option: test="F" includes additional information
# for parameter estimate tests that we're familiar with
# option: for BIC, include k=log(nrow( [data.frame name] ))
lm.indian2.red.AIC <- step(lm.indian2.full, direction="backward", test="F")
## Start: AIC=186.6
## sysbp ~ wt + ht + chin + fore + calf + pulse + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - calf 1 2.86 3099.3 184.64 0.0287 0.8666427
## - pulse 1 14.61 3111.1 184.79 0.1463 0.7046990
## - fore 1 27.00 3123.4 184.94 0.2703 0.6068061
## - ht 1 131.88 3228.3 186.23 1.3203 0.2593289
## <none> 3096.4 186.60
## - chin 1 186.85 3283.3 186.89 1.8706 0.1812390
## - yrage 1 1386.76 4483.2 199.04 13.8835 0.0007773 ***
## - wt 1 1956.49 5052.9 203.70 19.5874 0.0001105 ***
## ---

Prof. Erik B. Erhardt


3.3: Example: Peru Indian blood pressure 483

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=184.64
## sysbp ~ wt + ht + chin + fore + pulse + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - pulse 1 13.34 3112.6 182.81 0.1377 0.7130185
## - fore 1 26.99 3126.3 182.98 0.2787 0.6011969
## - ht 1 129.56 3228.9 184.24 1.3377 0.2560083
## <none> 3099.3 184.64
## - chin 1 184.03 3283.3 184.89 1.9000 0.1776352
## - yrage 1 1448.00 4547.3 197.59 14.9504 0.0005087 ***
## - wt 1 1953.77 5053.1 201.70 20.1724 8.655e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=182.81
## sysbp ~ wt + ht + chin + fore + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - fore 1 17.78 3130.4 181.03 0.1885 0.667013
## - ht 1 131.12 3243.8 182.42 1.3902 0.246810
## <none> 3112.6 182.81
## - chin 1 198.30 3310.9 183.22 2.1023 0.156514
## - yrage 1 1450.02 4562.7 195.72 15.3730 0.000421 ***
## - wt 1 1983.51 5096.2 200.03 21.0290 6.219e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=181.03
## sysbp ~ wt + ht + chin + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - ht 1 113.57 3244.0 180.42 1.2334 0.2745301
## <none> 3130.4 181.03
## - chin 1 287.20 3417.6 182.45 3.1193 0.0863479 .
## - yrage 1 1445.52 4575.9 193.84 15.7000 0.0003607 ***
## - wt 1 2263.64 5394.1 200.25 24.5857 1.945e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=180.42
## sysbp ~ wt + chin + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 3244.0 180.42
## - chin 1 197.37 3441.4 180.72 2.1295 0.1534065
## - yrage 1 1368.44 4612.4 192.15 14.7643 0.0004912 ***

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


484 Ch 3: A Taste of Model Selection for Multiple Regression

## - wt 1 2515.33 5759.3 200.81 27.1384 8.512e-06 ***


## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# BIC (not shown)
# step(lm.indian2.full, direction="backward", test="F", k=log(nrow(indian2)))

Remark on Summary Table: The partial R2 is the reduction in R2 achieved by


omitting variables sequentially.
The backward elimination procedure eliminates five variables from the full model,
in the following order: calf skin fold calf, pulse rate pulse, forearm skin fold fore,
height ht, and chin skin fold chin. The model selected by backward elimination
includes two predictors: weight wt and fraction yrage. As we progress from the full
model to the selected model, R2 decreases as follows: 0.53, 0.53, 0.52, 0.52, 0.50, and
0.47. The decrease is slight across this spectrum of models.
Using a mechanical approach, we are led to a model with weight and years by
age fraction as predictors of systolic blood pressure. At this point we should closely
examine this model.

3.3.1 Analysis for Selected Model


The summaries and diagnostics for the selected model follow.
Model sysbp = β0 + β1 wt + β2 yrage + ε:
library(car)
Anova(lm.indian2.final, type=3)
## Anova Table (Type III tests)
##
## Response: sysbp
## Sum Sq Df F value Pr(>F)
## (Intercept) 1738.2 1 18.183 0.0001385 ***
## wt 2592.0 1 27.115 7.966e-06 ***
## yrage 1314.7 1 13.753 0.0006991 ***
## Residuals 3441.4 36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.indian2.final)
##
## Call:
## lm(formula = sysbp ~ wt + yrage, data = indian2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4330 -7.3070 0.8963 5.7275 23.9819
##

Prof. Erik B. Erhardt


3.3: Example: Peru Indian blood pressure 485

## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8959 14.2809 4.264 0.000138 ***
## wt 1.2169 0.2337 5.207 7.97e-06 ***
## yrage -26.7672 7.2178 -3.708 0.000699 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.777 on 36 degrees of freedom
## Multiple R-squared: 0.4731,Adjusted R-squared: 0.4438
## F-statistic: 16.16 on 2 and 36 DF, p-value: 9.795e-06

Comments on the diagnostic plots below.

1. The individual with the highest systolic blood pressure (case 1) has a large
studentized residual ri and the largest Cook’s Di .
2. Except for case 1, the rankit plot and the plot of the studentized residuals
against the fitted values show no gross abnormalities.
3. The plots of studentized residuals against the individual predictors show no
patterns. The partial residual plots show roughly linear trends. These plots
collectively do not suggest the need to transform either of the predictors. Al-
though case 1 is prominent in the partial residual plots, it does not appear to
be influencing the significance of these predictors.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.indian2.final, which = c(1,4,6))

plot(indian2$wt, lm.indian2.final$residuals, main="Residuals vs wt")


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(indian2$yrage, lm.indian2.final$residuals, main="Residuals vs yrage")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.indian2.final$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 1 34 11
## residuals vs order of data
#plot(lm.indian2.final£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


486 Ch 3: A Taste of Model Selection for Multiple Regression

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


1
3 ●1 2.5 2

0.4
1●

0.4
1.5
20

● ●

0.3
Cook's distance

Cook's distance

0.3
10


Residuals

● ●

● ● ●
● ● ●
● ● ●

0.2

0.2
● ● ●
0

● ● ●

● 1

● ●

−10

0.1
0.1
● ● ● ● 8
4 8●
●● ●4 ●
● 11 ● ●
● ● 0.5
●●●●
−20

● 34 ●

●● ●●●● ● ●●

0.0

0.0
●●●
●●●●● ●
●●
● ●● 0

100 110 120 130 140 0 10 20 30 40 0 0.05 0.15 0.2 0.25 0.3

Fitted values Obs. number Leverage hii

Residuals vs wt Residuals vs yrage QQ Plot

● ● 1●
20

20

20
lm.indian2.final$residuals

lm.indian2.final$residuals

lm.indian2.final$residuals
● ● ● ● ●

● ● ●
10

10

● ● ●
● ● ● ● 10 ● ●
● ● ●
● ● ● ● ● ● ●●

● ● ● ● ● ● ●
●●
● ● ● ● ●● ●●●
●● ●● ●● ● ● ●●●

0
0

● ● ● ● ● ● ●●●
● ● ●
● ● ●
● ● ●
● ● ● ● ●●
● ● ●
−10

−10

● ● ●

● ● ● ● ● ● ● ●
−10 ● ●
●●

● ● ● ● ●
● ● ● 11●
34
−20

−20

● ● ●
−20
55 60 65 70 75 80 85 0.0 0.2 0.4 0.6 0.8 −2 −1 0 1 2

indian2$wt indian2$yrage norm quantiles

Recall that the partial regression residual plot for weight, given below, adjusts
systolic blood pressure and weight for their common dependence on all the other
predictors in the model (only years by age fraction here). This plot tells us whether
we need to transform weight in the multiple regression model, and whether any ob-
servations are influencing the significance of weight in the fitted model. A roughly
linear trend, as seen here, suggests that no transformation of weight is warranted.
The positive relationship seen here is consistent with the coefficient of weight being
positive in the multiple regression model.

The partial residual plot for fraction exhibits a stronger relationship than is seen
in the earlier 2D plot of systolic blood pressure against year by age fraction. This
means that fraction is more useful as a predictor after taking an individual’s weight
into consideration.
library(car)
avPlots(lm.indian2.final, id = list(n = 3))

Prof. Erik B. Erhardt


3.4: Example: Dennis Cook’s Rat Data 487

Added−Variable Plots

1● ● 1

30
30

39 ●

20
sysbp | others

sysbp | others
20

● ● ● ● ●

10
●●
● ● ●
10

● ● ●
● ● ● ● ● ● ●
● ● ●● ● ●
● ● ● ●

0

● ● ● 38 ●
0

● ● ●
● ● ● ●
●● ●
● ● ●● ● ● ● 8●
−10


● 8 ●
● ● ● 11●● ● 11● ●

−20

●● ● 34 ● 34 ●

−15 −5 0 5 10 15 20 −0.4 −0.2 0.0 0.2 0.4 0.6

wt | others yrage | others

Model selection methods can be highly influenced by outliers and influential cases.
We should hold out case 1, and rerun the backward procedure to see whether case 1
unduly influenced the selection of the two predictor model. If we hold out case 1, we
find that the model with weight and fraction as predictors is suggested again. After
holding out case 1, there are no large residuals, no extremely influential points, or
any gross abnormalities in plots. The R2 for the selected model is now R2 = 0.408.
This decrease in R2 should have been anticipated. Why?1
The two analyses suggest that the “best model” for predicting systolic blood
pressure is

sysbp = β0 + β1 wt + β2 yrage + ε.

Should case 1 be deleted? I have not fully explored this issue, but I will note that
eliminating this case does have a significant impact on the estimates of the regression
coefficients, and on predicted values. What do you think?

3.4 Example: Dennis Cook’s Rat Data


This example illustrates the importance of a careful diagnostic analysis.

An experiment was conducted to investigate the amount of a particular drug


present in the liver of a rat. Nineteen (19) rats were randomly selected, weighed,
placed under light ether anesthesia and given an oral dose of the drug. Because it
was felt that large livers would absorb more of a given dose than small livers, the
1
Obs 1 increases the SST, but greatly increases model relationship so greatly increases SSR.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


488 Ch 3: A Taste of Model Selection for Multiple Regression

actual dose an animal received was approximately determined as 40mg of the drug
per kilogram of body weight. (Liver weight is known to be strongly related to body
weight.) After a fixed length of time, each rat was sacrificed, the liver weighed, and
the percent of the dose in the liver determined.
The experimental hypothesis was that, for the method of determining the dose,
there is no relationship between the percentage of dose in the liver (Y ) and the body
weight, liver weight, and relative dose.
#### Example: Rat liver
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch03_ratliver.csv"
ratliver <- read.csv(fn.data)

ratliver <- ratliver[,c(4,1,2,3)] # reorder columns so response is the first

str(ratliver)
## 'data.frame': 19 obs. of 4 variables:
## $ y : num 0.42 0.25 0.56 0.23 0.23 0.32 0.37 0.41 0.33 0.38 ...
## $ bodywt : int 176 176 190 176 200 167 188 195 176 165 ...
## $ liverwt: num 6.5 9.5 9 8.9 7.2 8.9 8 10 8 7.9 ...
## $ dose : num 0.88 0.88 1 0.88 1 0.83 0.94 0.98 0.88 0.84 ...
y bodywt liverwt dose
1 0.42 176 6.50 0.88
2 0.25 176 9.50 0.88
3 0.56 190 9.00 1.00
4 0.23 176 8.90 0.88
5 0.23 200 7.20 1.00
6 0.32 167 8.90 0.83
7 0.37 188 8.00 0.94
8 0.41 195 10.00 0.98
9 0.33 176 8.00 0.88
10 0.38 165 7.90 0.84
11 0.27 158 6.90 0.80
12 0.36 148 7.30 0.74
13 0.21 149 5.20 0.75
14 0.28 163 8.40 0.81
15 0.34 170 7.20 0.85
16 0.28 186 6.80 0.94
17 0.30 146 7.30 0.73
18 0.37 181 9.00 0.90
19 0.46 149 6.40 0.75
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
#p <- ggpairs(ratliver, progress=FALSE)
# put scatterplots on top so y axis is vertical
p <- ggpairs(ratliver, upper = list(continuous = "points")
, lower = list(continuous = "cor")
, progress=FALSE
)
print(p)

Prof. Erik B. Erhardt


3.4: Example: Dennis Cook’s Rat Data 489

# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
y bodywt liverwt dose
● ● ●
4

3 ● ● ●

● ● ●
● ● ●

y
2 ●
● ●

● ●

● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
1
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
0 ● ● ●

● ●
200
● ●

● ●
● ●
● ●

● ●
180

bodywt
Corr: ● ● ● ● ●

0.151 ● ●
● ●
● ●
● ●
160
● ●

● ● ●
● ●
● ●


10

9 ● ●
● ●

8 ● ●

liverwt

Corr: Corr:
0.203 0.5 ●●
● ●
7 ●



5
1.0

0.9
Corr: Corr: Corr:

dose
0.228 0.99 0.49

0.8

0.2 0.3 0.4 0.5 160 180 200 5 6 7 8 9 10 0.8 0.9 1.0

The correlation between Y and each predictor is small, as expected.


# correlation matrix and associated p-values testing "H0: rho == 0"
library(Hmisc)
rcorr(as.matrix(ratliver))
## y bodywt liverwt dose
## y 1.00 0.15 0.20 0.23
## bodywt 0.15 1.00 0.50 0.99
## liverwt 0.20 0.50 1.00 0.49
## dose 0.23 0.99 0.49 1.00
##
## n= 19
##
##
## P
## y bodywt liverwt dose
## y 0.5370 0.4038 0.3488
## bodywt 0.5370 0.0293 0.0000

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


490 Ch 3: A Taste of Model Selection for Multiple Regression

## liverwt 0.4038 0.0293 0.0332


## dose 0.3488 0.0000 0.0332
Below I fit the linear model with all the selected main effects.
# fit full model
lm.ratliver.full <- lm(y ~ bodywt + liverwt + dose, data = ratliver)

library(car)
Anova(lm.ratliver.full, type=3)
## Anova Table (Type III tests)
##
## Response: y
## Sum Sq Df F value Pr(>F)
## (Intercept) 0.011157 1 1.8676 0.19188
## bodywt 0.042408 1 7.0988 0.01768 *
## liverwt 0.004120 1 0.6897 0.41930
## dose 0.044982 1 7.5296 0.01507 *
## Residuals 0.089609 15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.ratliver.full)
##
## Call:
## lm(formula = y ~ bodywt + liverwt + dose, data = ratliver)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.100557 -0.063233 0.007131 0.045971 0.134691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.265922 0.194585 1.367 0.1919
## bodywt -0.021246 0.007974 -2.664 0.0177 *
## liverwt 0.014298 0.017217 0.830 0.4193
## dose 4.178111 1.522625 2.744 0.0151 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07729 on 15 degrees of freedom
## Multiple R-squared: 0.3639,Adjusted R-squared: 0.2367
## F-statistic: 2.86 on 3 and 15 DF, p-value: 0.07197
The backward elimination procedure selects weight and dose as predictors. The
p-values for testing the importance of these variables, when added last to this two
predictor model, are small, 0.019 and 0.015.
lm.ratliver.red.AIC <- step(lm.ratliver.full, direction="backward", test="F")
## Start: AIC=-93.78
## y ~ bodywt + liverwt + dose

Prof. Erik B. Erhardt


3.4: Example: Dennis Cook’s Rat Data 491

##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - liverwt 1 0.004120 0.093729 -94.924 0.6897 0.41930
## <none> 0.089609 -93.778
## - bodywt 1 0.042408 0.132017 -88.416 7.0988 0.01768 *
## - dose 1 0.044982 0.134591 -88.049 7.5296 0.01507 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=-94.92
## y ~ bodywt + dose
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 0.093729 -94.924
## - bodywt 1 0.039851 0.133580 -90.192 6.8027 0.01902 *
## - dose 1 0.043929 0.137658 -89.621 7.4989 0.01458 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm.ratliver.final <- lm.ratliver.red.AIC

This cursory analysis leads to a conclusion that a combination of dose and body
weight is associated with Y , but that neither of these predictors is important of its own
(low correlations with Y ). Although this commonly happens in regression problems,
it is somewhat paradoxical here because dose was approximately a multiple of body
weight, so to a first approximation, these predictors are linearly related and so only
one of them should be needed in a linear regression model. Note that the correlation
between dose and body weight is 0.99.

The apparent paradox can be resolved only with a careful diagnostic analysis! For
the model with dose and body weight as predictors, there are no cases with large |ri |
values, but case 3 has a relatively large Cook’s D value.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.ratliver.final, which = c(1,4,6))

plot(ratliver$bodywt, lm.ratliver.final$residuals, main="Residuals vs bodywt")


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(ratliver$dose, lm.ratliver.final$residuals, main="Residuals vs dose")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.ratliver.final$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 19 13 1
## residuals vs order of data
#plot(lm.ratliver.final£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


492 Ch 3: A Taste of Model Selection for Multiple Regression

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


0.15

2.0
3
2 1.5 3●
19 ●
1
●1

1.5
1.5
● ●

Cook's distance

Cook's distance
0.05

●●
Residuals


● ●

1.0
1.0


● ●
−0.05

0.5
0.5

●● ●● 0.5
5 19
●5
● 19
13 ●
−0.15

●●

0.0

0.0

●●●

●●● 0

0.30 0.35 0.40 0.45 0.50 5 10 15 0 0.4 0.6 0.8

Fitted values Obs. number Leverage hii

Residuals vs bodywt Residuals vs dose QQ Plot

● ● 19 ●
0.10

0.10

● ● 0.10 1●
lm.ratliver.final$residuals

lm.ratliver.final$residuals

lm.ratliver.final$residuals
● ● ● ● ●

0.05

0.05

● ● ● ● 0.05 ● ●

● ● ●
● ● ● ● ● ●
● ●
0.00

0.00


● ● 0.00 ●

● ● ● ● ● ●

−0.05
● ● ●

● ● ●
−0.10

−0.10

● ● ● ● ● ●
● ● ●
−0.10
● ● ● 13

150 160 170 180 190 200 0.75 0.80 0.85 0.90 0.95 1.00 −2 −1 0 1 2

ratliver$bodywt ratliver$dose norm quantiles

Further, the partial residual plot for bodywt clearly highlights case 3. Without this
case we would see roughly a random scatter of points, suggesting that body weight is
unimportant after taking dose into consideration. The importance of body weight as
a predictor in the multiple regression model is due solely to the placement of case 3.
The partial residual plot for dose gives the same message.
library(car)
avPlots(lm.ratliver.final, id = list(n = 3))

Added−Variable Plots
0.20

● 3 3●
0.15

19 ●
● 19
0.10

1●
y | others

y | others

● 1
0.05

10 ● ● ● ●
18 ● ● ● 10

● ● 18

0.00

● ● ● ●
● ● ●
−0.05



● ●
● ●
−0.10


13 ● ●

● ● 13
−0.15

● ●

−8 −6 −4 −2 0 2 −0.01 0.01 0.02 0.03 0.04

bodywt | others dose | others

Prof. Erik B. Erhardt


3.4: Example: Dennis Cook’s Rat Data 493

Removing case 3 If we delete this case and redo the analysis we find, as expected,
no important predictors of Y . The output below shows that the backward elimination
removes each predictor from the model. Thus, the apparent relationship between Y
and body weight and dose in the initial analysis can be ascribed to Case 3 alone. Can
you see this case in the plots?
# remove case 3
ratliver3 <- ratliver[-3,]

# fit full model


lm.ratliver3.full <- lm(y ~ bodywt + liverwt + dose, data = ratliver3)

lm.ratliver3.red.AIC <- step(lm.ratliver3.full, direction="backward", test="F")


## Start: AIC=-88.25
## y ~ bodywt + liverwt + dose
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - dose 1 0.00097916 0.086696 -90.043 0.1599 0.6953
## - bodywt 1 0.00105871 0.086776 -90.026 0.1729 0.6838
## - liverwt 1 0.00142114 0.087138 -89.951 0.2321 0.6374
## <none> 0.085717 -88.247
##
## Step: AIC=-90.04
## y ~ bodywt + liverwt
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - bodywt 1 0.00035562 0.087052 -91.969 0.0615 0.8075
## - liverwt 1 0.00082681 0.087523 -91.872 0.1431 0.7106
## <none> 0.086696 -90.043
##
## Step: AIC=-91.97
## y ~ liverwt
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - liverwt 1 0.00050917 0.087561 -93.864 0.0936 0.7636
## <none> 0.087052 -91.969
##
## Step: AIC=-93.86
## y ~ 1

All variables are omitted!


In his text2 , Weisberg writes: The careful analyst must now try to understand
exactly why the third case is so influential. Inspection of the data indicates that this
rat with weight 190g, was reported to have received a full dose of 1.000, which was a
larger dose than it should have received according to the rule for assigning doses, see
2
Applied Linear Regression, 3rd Ed. by Sanford Weisberg, published by Wiley/Interscience in
2005 (ISBN 0-471-66379-4)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


494 Ch 3: A Taste of Model Selection for Multiple Regression

scatterplot below (e.g., rat 8 with a weight of 195g got a lower dose of 0.98).
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(ratliver, aes(x = bodywt, y = dose, label = 1:nrow(ratliver)))
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
p <- p + geom_point(alpha=1/3)
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5, alpha = 0.25, colour = 2)
p <- p + labs(title="Rat liver dose by bodywt: rat 3 overdosed")
print(p)
Rat liver dose by bodywt: rat 3 overdosed

3 5
1.0
8

16 7

18
0.9
2
1
4
9
dose

15
10
6

14
11
0.8

13
19
12
17

160 180 200


bodywt

A number of causes for the result found in the first analysis are possible: (1)
the dose or weight recorded for case 3 was in error, so the case should probably
be deleted from the analysis, or (2) the regression fit in the second analysis is not
appropriate except in the region defined by the 18 points excluding case 3. It is
possible that the combination of dose and rat weight chosen was fortuitous, and that
the lack of relationship found would not persist for any other combinations of them,
since inclusion of a data point apparently taken under different conditions leads to
a different conclusion. This suggests the need for collection of additional data, with
dose determined by some rule other than a constant proportion of weight.
I hope the point of this analysis is clear! What have we learned from this analysis?

Prof. Erik B. Erhardt


Part VII

ADA2: Experimental design and


observational studies

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


Chapter 4

One Factor Designs and Extensions

This section describes an experimental design to compare the effectiveness of four


insecticides to eradicate beetles. The primary interest is determining which treatment
is most effective, in the sense of providing the lowest typical survival time.
In a completely randomized design (CRD), the scientist might select a sample
of genetically identical beetles for the experiment, and then randomly assign a prede-
termined number of beetles to the treatment groups (insecticides). The sample sizes
for the groups need not be equal. A power analysis is often conducted to determine
sample sizes for the treatments. For simplicity, assume that 48 beetles will be used
in the experiment, with 12 beetles assigned to each group.
After assigning the beetles to the four groups, the insecticide is applied (uniformly
to all experimental units or beetles), and the individual survival times recorded. A
natural analysis of the data collected from this one factor design would be to compare
the survival times using a one-way ANOVA.
There are several important controls that should be built into this experiment.
The same strain of beetles should be used to ensure that the four treatment groups
are alike as possible, so that differences in survival times are attributable to the
insecticides, and not due to genetic differences among beetles. Other factors that
may influence the survival time, say the concentration of the insecticide or the age of
the beetles, would be held constant, or fixed by the experimenter, if possible. Thus,
the same concentration would be used with the four insecticides.
In complex experiments, there are always potential influences that are not re-
alized or are thought to be unimportant that you do not or can not control. The
randomization of beetles to groups ensures that there is no systematic dependence
of the observed treatment differences on the uncontrolled influences. This is ex-
tremely important in studies where genetic and environmental influences can not be
easily controlled (as in humans, more so than in bugs or mice). The randomization of
beetles to insecticides tends to diffuse or greatly reduce the effect of the uncontrolled
influences on the comparison of insecticides, in the sense that these effects become

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


498 Ch 4: One Factor Designs and Extensions

part of the uncontrolled or error variation of the experiment.


In summary, an experiment is to impose a treatment on experimental units to
observe a response. Randomization and carefully controlling factors are important
considerations.
Suppose yij is the response for the j th experimental unit in the ith treatment
group, where i = 1, 2, . . . , I. The statistical model for a completely randomized
one-factor design that leads to a one-way ANOVA is given by:
yij = µi + eij ,
where µi is the (unknown) population mean for all potential responses to the ith
treatment, and eij is the residual or deviation of the response from the population
mean. The responses within and across treatments are assumed to be independent,
normal random variables with constant variance.
For the insecticide experiment, yij is the survival time for the j th beetle given
the ith insecticide, where i = 1, 2, 3, 4 and j = 1, 2, . . . , 12. The random selection
of beetles coupled with the randomization of beetles to groups ensures the indepen-
dence assumptions. The assumed population distributions of responses for the I = 4
insecticides can be represented as follows:

Insecticide 1

Insecticide 2

Insecticide 3

Insecticide 4

Let
1X
µ= µi
I i
be the grand mean, or average of the population means. Let
αi = µi − µ

Prof. Erik B. Erhardt


499

be the ith group treatment effect. The treatment effects are constrained to add
to zero, α1 + α2 + · · · + αI = 0, and measure the difference between the treatment
population means and the grand mean. Given this notation, the one-way ANOVA
model is
yij = µ + αi + eij .
The model specifies that the
Response = Grand Mean + Treatment Effect + Residual.
An hypothesis of interest is whether the population means are equal: H0 : µ1 =
· · · = µI , which is equivalent to the hypothesis of no treatment effects: H0 : α1 =
· · · = αI = 0. If H0 is true, then the one-way model is

yij = µ + eij ,

where µ is the common population mean. You know how to test H0 and do multiple
comparisons of the treatments, so I will not review this material.
Most texts use treatment effects to specify ANOVA models, a convention that I
will also follow. A difficulty with this approach is that the treatment effects must
be constrained to be uniquely estimable from the data (because the I population
means µi are modeled in terms of I + 1 parameters: µi = µ + αi ). An infinite
number of constraints can be considered each of which gives the same structure on
the population means. The standard constraint where the treatment effects sum to
zero was used above, but many statistical packages, impose the constraint αI = 0
(or sometimes α1 = 0). Although estimates of treatment effects depend on which
constraint is chosen, the null and alternative models used with the ANOVA F -test,
and pairwise comparisons of treatment effects, do not. I will downplay the discussion
of estimating treatment effects to minimize problems.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


500 Ch 4: One Factor Designs and Extensions

Prof. Erik B. Erhardt


Chapter 5

Paired Experiments and


Randomized Block Experiments

A randomized block design is often used instead of a completely randomized


design in studies where there is extraneous variation among the experimental units
that may influence the response. A significant amount of the extraneous variation
may be removed from the comparison of treatments by partitioning the experimental
units into fairly homogeneous subgroups or blocks.
For example, suppose you are interested in comparing the effectiveness of four
antibiotics for a bacterial infection. The recovery time after administering an antibi-
otic may be influenced by a patient’s general health, the extent of their infection, or
their age. Randomly allocating experimental subjects to the treatments (and then
comparing them using a one-way ANOVA) may produce one treatment having a “fa-
vorable” sample of patients with features that naturally lead to a speedy recovery.
Alternatively, if the characteristics that affect the recovery time are spread across
treatments, then the variation within samples due to these uncontrolled features can
dominate the effects of the treatment, leading to an inconclusive result.
A better way to design this experiment would be to block the subjects into groups
of four patients who are alike as possible on factors other than the treatment that
influence the recovery time. The four treatments are then randomly assigned to the
patients (one per patient) within a block, and the recovery time measured. The
blocking of patients usually produces a more sensitive comparison of treatments than
does a completely randomized design because the variation in recovery times due to
the blocks is eliminated from the comparison of treatments.
A randomized block design is a paired experiment when two treatments are
compared. The usual analysis for a paired experiment is a parametric or non-
parametric paired comparison.
Randomized block (RB) designs were developed to account for soil fertility gra-

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


502 Ch 5: Paired Experiments and Randomized Block Experiments

dients in agricultural experiments. The experimental field would be separated into


strips (blocks) of fairly constant fertility. Each strip is partitioned into equal size
plots. The treatments, say varieties of corn, are randomly assigned to the plots, with
each treatment occurring the same number of times (usually once) per block. All
other factors that are known to influence the response would be controlled or fixed
by the experimenter. For example, when comparing the mean yields, each plot would
receive the same type and amount of fertilizer and the same irrigation plan.

The discussion will be limited to randomized block experiments with one factor.
Two or more factors can be used with a randomized block design. For example,
the agricultural experiment could be modified to compare four combinations of two
corn varieties and two levels of fertilizer in each block instead of the original four
varieties. In certain experiments, each experimental unit receives each treatment.
The experimental units are “natural” blocks for the analysis.

Example: Comparison of Treatments for Itching Ten1 male volunteers be-


tween 20 and 30 years old were used as a study group to compare seven treatments (5
drugs, a placebo, and no drug) to relieve itching. Each subject was given a different
treatment on seven study days. The time ordering of the treatments was random-
ized across days. Except on the no-drug day, the subjects were given the treatment
intravenously, and then itching was induced on their forearms using an effective itch
stimulus called cowage. The subjects recorded the duration of itching, in seconds.
The data are given in the table below. From left to right the drugs are: papaverine,
morphine, aminophylline, pentobarbitol, tripelenamine.

The volunteers in the study were treated as blocks in the analysis. At best, the
volunteers might be considered a representative sample of males between the ages of
20 and 30. This limits the extent of inferences from the experiment. The scientists
can not, without sound medical justification, extrapolate the results to children or to
senior citizens.
#### Example: Itching
itch <- read.csv("http://statacumen.com/teach/ADA2/ADA2_notes_Ch05_itch.csv")

1
Beecher, 1959

Prof. Erik B. Erhardt


5.1: Analysis of a Randomized Block Design 503

Patient Nodrug Placebo Papv Morp Amino Pento Tripel


1 1 174 263 105 199 141 108 141
2 2 224 213 103 143 168 341 184
3 3 260 231 145 113 78 159 125
4 4 255 291 103 225 164 135 227
5 5 165 168 144 176 127 239 194
6 6 237 121 94 144 114 136 155
7 7 191 137 35 87 96 140 121
8 8 100 102 133 120 222 134 129
9 9 115 89 83 100 165 185 79
10 10 189 433 237 173 168 188 317

5.1 Analysis of a Randomized Block Design


Assume that you designed a randomized block experiment with I blocks and J treat-
ments, where each treatment occurs once in each block. Let yij be the response for
the j th treatment within the ith block. The model for the experiment is

yij = µij + eij ,

where µij is the population mean response for the j th treatment in the ith block and
eij is the deviation of the response from the mean. The population means are assumed
to satisfy the additive model
µij = µ + αi + βj

where µ is a grand mean, αi is the effect for the ith block, and βj is the effect
for the j th treatment. The responses are assumed to be independent across blocks,
normally distributed and with constant variance. The randomized block model does
not require the observations within a block to be independent, but does assume
that the correlation between responses within a block is identical for each pair of
treatments.
The model is sometimes written as
Response = Grand Mean + Treatment Effect + Block Effect + Residual.
Given the data, let ȳi· be the ith block sample mean (the average of the responses
in the ith block), ȳ·j be the j th treatment sample mean (the average of the responses
on the j th treatment), and ȳ·· be the average response of all IJ observations in the
experiment.
An ANOVA table for the randomized block experiment partitions the Model SS
into SS for Blocks and Treatments.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


504 Ch 5: Paired Experiments and Randomized Block Experiments

Source df SSP MS
Blocks I − 1 J P i (ȳi· − ȳ·· )2
Treats J − 1 I j (ȳ·j − ȳ·· )2
(y − ȳi· − ȳ·j + ȳ·· )2
P
Error (I − 1)(J − 1)
Pij ij 2
Total IJ − 1 ij (yij − ȳ·· ) .
A primary interest is testing whether the treatment effects are zero: H0 : β1 =
· · · = βJ = 0. The treatment effects are zero if in each block the population mean
responses are identical for each treatment. A formal test of no treatment effects is
based on the p-value from the F-statistic Fobs = MS Treat/MS Error. The p-value
is evaluated in the usual way (i.e., as an upper tail area from an F-distribution with
J − 1 and (I − 1)(J − 1) df.) This H0 is rejected when the treatment averages ȳ·j
vary significantly relative to the error variation.
A test for no block effects (H0 : α1 = · · · = αI = 0) is often a secondary inter-
est, because, if the experiment is designed well, the blocks will be, by construction,
noticeably different. There are no block effects if the population mean response for
an arbitrary treatment is identical across blocks. A formal test of no block effects is
based on the p-value from the the F -statistic Fobs = MS Blocks/MS Error. This H0 is
rejected when the block averages ȳi· vary significantly relative to the error variation.
The randomized block model is easily fitted in the lm() function. Before illus-
trating the analysis on the itching data, let me mention five important points about
randomized block analyses:
1. The F -test p-value for comparing J = 2 treatments is identical to the p-value
for comparing the two treatments using a paired t-test.
2. The Block SS plus the Error SS is the Error SS from a one-way ANOVA com-
paring the J treatments. If the Block SS is large relative to the Error SS from
the two-factor model, then the experimenter has eliminated a substantial por-
tion of the variation that is used to assess the differences among the treatments.
This leads to a more sensitive comparison of treatments than would have been
obtained using a one-way ANOVA.
3. The RB model is equivalent to an additive or no interaction model for a two-
factor experiment, where the blocks are levels of one of the factors. The analysis
of a randomized block experiment under this model is the same analysis used
for a two-factor experiment with no replication (one observation per cell). We
will discuss the two-factor design soon. P P
4. Under the sum constraint on the parameters (i.e., i αi = j βj = 0), the
estimates of the grand mean, block effects, and treatment effects are µ̂ = ȳ·· ,
α̂i = ȳi· − ȳ·· , and β̂j = ȳ·j − ȳ·· , respectively. The estimated mean response for
the (i, j)th cell is µ̂ij = µ̂ + α̂i + β̂j = ȳi· + ȳ·j − ȳ·· .
5. The F -test for comparing treatments is appropriate when the responses within
a block have the same correlation. This is a reasonable working assumption

Prof. Erik B. Erhardt


5.1: Analysis of a Randomized Block Design 505

in many analyses. A multivariate repeated measures model can be used to


compare treatments when the constant correlation assumption is unrealistic,
for example when the same treatment is given to an individual over time.

RB Analysis of the Itching Data First we reshape the data to long format so
each observation is its own row in the data.frame and indexed by the Patient and
Treatment variables.
library(reshape2)
itch.long <- melt(itch
, id.vars = "Patient"
, variable.name = "Treatment"
, value.name = "Seconds"
)
str(itch.long)
## 'data.frame': 70 obs. of 3 variables:
## $ Patient : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Treatment: Factor w/ 7 levels "Nodrug","Placebo",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Seconds : int 174 224 260 255 165 237 191 100 115 189 ...
head(itch.long, 3)
## Patient Treatment Seconds
## 1 1 Nodrug 174
## 2 2 Nodrug 224
## 3 3 Nodrug 260
tail(itch.long, 3)
## Patient Treatment Seconds
## 68 8 Tripel 129
## 69 9 Tripel 79
## 70 10 Tripel 317
# make Patient a factor variable
itch.long$Patient <- factor(itch.long$Patient)
str(itch.long)
## 'data.frame': 70 obs. of 3 variables:
## $ Patient : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Treatment: Factor w/ 7 levels "Nodrug","Placebo",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Seconds : int 174 224 260 255 165 237 191 100 115 189 ...
As a first step, I made side-by-side boxplots of the itching durations across treat-
ments. The boxplots are helpful for informally comparing treatments and visualizing
the data. The differences in the level of the boxplots will usually be magnified by the
F -test for comparing treatments because the variability within the boxplots includes
block differences which are moved from the Error SS to the Block SS. The plot also
includes the 10 Patients with lines connecting their measurements to see how common
the treatment differences were over patients. I admit, this plot is a little too busy.
Each of the five drugs appears to have an effect, compared to the placebo and
to no drug. Papaverine appears to be the most effective drug. The placebo and no

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


506 Ch 5: Paired Experiments and Randomized Block Experiments

drug have similar medians. The relatively large spread in the placebo group suggests
that some patients responded adversely to the placebo compared to no drug, whereas
others responded positively.
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(itch.long, aes(x = Treatment, y = Seconds))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(aes(yintercept = 0),
colour = "black", linetype = "solid", size = 0.2, alpha = 0.3)
p <- p + geom_hline(aes(yintercept = mean(Seconds)),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# colored line for each patient
p <- p + geom_line(aes(group = Patient, colour = Patient), alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(aes(colour = Patient))
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
alpha = 0.5)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, aes(colour=Treatment), alpha = 0.8)
p <- p + labs(title = "Comparison of Treatments for Itching, Treatment means")
p <- p + ylab("Duration of itching (seconds)")
# removes legend
p <- p + theme(legend.position="none")
print(p)

Prof. Erik B. Erhardt


5.1: Analysis of a Randomized Block Design 507

Comparison of Treatments for Itching, Treatment means

400


Duration of itching (seconds)

300

● ●

● ● ●

● ● ●

200 ●


● ●
● ●
● ●

● ● ●



● ● ● ●
● ●
● ●

● ●

● ● ●
● ● ●
● ●
● ●
100 ●



● ●

● ●

Nodrug Placebo Papv Morp Amino Pento Tripel


Treatment

To fit the RB model in lm(), you need to specify blocks (Patient) and treatments
(Treatment) as factor variables, and include each to the right of the tilde symbol in
the formula statement. The response variable Seconds appears to the left of the
tilde.
lm.s.t.p <- lm(Seconds ~ Treatment + Patient, data = itch.long)
library(car)
Anova(lm.s.t.p, type=3)
## Anova Table (Type III tests)
##
## Response: Seconds
## Sum Sq Df F value Pr(>F)
## (Intercept) 155100 1 50.1133 3.065e-09 ***
## Treatment 53013 6 2.8548 0.017303 *
## Patient 103280 9 3.7078 0.001124 **
## Residuals 167130 54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.s.t.p)
##
## Call:

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


508 Ch 5: Paired Experiments and Randomized Block Experiments

## lm(formula = Seconds ~ Treatment + Patient, data = itch.long)


##
## Residuals:
## Min 1Q Median 3Q Max
## -81.286 -34.800 -8.393 30.900 148.914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 188.286 26.598 7.079 3.07e-09 ***
## TreatmentPlacebo 13.800 24.880 0.555 0.58141
## TreatmentPapv -72.800 24.880 -2.926 0.00501 **
## TreatmentMorp -43.000 24.880 -1.728 0.08965 .
## TreatmentAmino -46.700 24.880 -1.877 0.06592 .
## TreatmentPento -14.500 24.880 -0.583 0.56245
## TreatmentTripel -23.800 24.880 -0.957 0.34303
## Patient2 35.000 29.737 1.177 0.24436
## Patient3 -2.857 29.737 -0.096 0.92381
## Patient4 38.429 29.737 1.292 0.20176
## Patient5 11.714 29.737 0.394 0.69518
## Patient6 -18.571 29.737 -0.625 0.53491
## Patient7 -46.286 29.737 -1.557 0.12543
## Patient8 -27.286 29.737 -0.918 0.36292
## Patient9 -45.000 29.737 -1.513 0.13604
## Patient10 82.000 29.737 2.758 0.00793 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.63 on 54 degrees of freedom
## Multiple R-squared: 0.4832,Adjusted R-squared: 0.3397
## F-statistic: 3.367 on 15 and 54 DF, p-value: 0.00052

The order to look at output follows the hierarchy of multi-parameter tests down
to single-parameter tests.
1. The F-test at the bottom of the summary() tests for both no block effects and
no treatment effects. If there are no block effects and no tretment effects then
the mean itching time is independent of treatment and patients. The p-value
of 0.0005 strongly suggests that the population mean itching times are not all
equal.
2. The ANOVA table at top from Anova() partitions the Model SS into the SS
for Blocks (Patients) and Treatments. The Mean Squares, F-statistics, and
p-values for testing these effects are given. For a RB design with the same
number of responses per block (i.e., no missing data), the Type I and Type III
SS are identical, and correspond to the formulas given earlier. The distinction
between Type I and Type III SS is important for unbalanced problems, an issue
we discuss later. The F -tests show significant differences among the treatments
(p-value=0.017) and among patients (p-value=0.001).

Prof. Erik B. Erhardt


5.1: Analysis of a Randomized Block Design 509

3. The individual parameter (coefficient) estimates in the summary() are likely


of less interest since they test differences from the baseline group, only. The
multiple comparisons in the next section will indicate which factor levels are
different from others.

Multiple comparisons Multiple comparison and contrasts are not typically straight-
forward in R, though some newer packages are helping make them easier. Below I
show one way that I think is relatively easy.
The package multcomp is used to specify which factor to perform multiple com-
parisons over and which p-value adjustment method to use. Below I use Tukey ad-
justments, first.
# multcomp has functions for multiple comparisons
library(multcomp)
## Loading required package: mvtnorm
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: ’MASS’
## The following object is masked from ’package:sm’:
##
## muscle
##
## Attaching package: ’TH.data’
## The following object is masked from ’package:MASS’:
##
## geyser
## The following object is masked from ’package:sm’:
##
## geyser
##
## Attaching package: ’multcomp’
## The following object is masked by ’.GlobalEnv’:
##
## waste
# Use the ANOVA object and run a "General Linear Hypothesis Test"
# specifying a linfct (linear function) to be tested.
# The mpc (multiple comparison) specifies the factor and method.
# Here: correcting over Treatment using Tukey contrast corrections.
glht.itch.t <- glht(aov(lm.s.t.p), linfct = mcp(Treatment = "Tukey"))
summary(glht.itch.t)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


510 Ch 5: Paired Experiments and Randomized Block Experiments

##
## Fit: aov(formula = lm.s.t.p)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Placebo - Nodrug == 0 13.80 24.88 0.555 0.9978
## Papv - Nodrug == 0 -72.80 24.88 -2.926 0.0699 .
## Morp - Nodrug == 0 -43.00 24.88 -1.728 0.6003
## Amino - Nodrug == 0 -46.70 24.88 -1.877 0.5038
## Pento - Nodrug == 0 -14.50 24.88 -0.583 0.9971
## Tripel - Nodrug == 0 -23.80 24.88 -0.957 0.9610
## Papv - Placebo == 0 -86.60 24.88 -3.481 0.0162 *
## Morp - Placebo == 0 -56.80 24.88 -2.283 0.2710
## Amino - Placebo == 0 -60.50 24.88 -2.432 0.2054
## Pento - Placebo == 0 -28.30 24.88 -1.137 0.9135
## Tripel - Placebo == 0 -37.60 24.88 -1.511 0.7370
## Morp - Papv == 0 29.80 24.88 1.198 0.8920
## Amino - Papv == 0 26.10 24.88 1.049 0.9398
## Pento - Papv == 0 58.30 24.88 2.343 0.2434
## Tripel - Papv == 0 49.00 24.88 1.969 0.4456
## Amino - Morp == 0 -3.70 24.88 -0.149 1.0000
## Pento - Morp == 0 28.50 24.88 1.146 0.9108
## Tripel - Morp == 0 19.20 24.88 0.772 0.9867
## Pento - Amino == 0 32.20 24.88 1.294 0.8516
## Tripel - Amino == 0 22.90 24.88 0.920 0.9676
## Tripel - Pento == 0 -9.30 24.88 -0.374 0.9998
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

With summary(), the p-value adjustment can be coerced into one of several popular
methods, such as Bonferroni. Notice that the significance is lower (larger p-value)
for Bonferroni below than Tukey above. Note comment at bottom of output that
“(Adjusted p values reported -- bonferroni method)”. Passing the summary
to plot() will create a plot of the pairwise intervals for difference between factor
levels.
Recall how the Bonferroni correction works. A comparison of c pairs of levels
from one factor having a family error rate of 0.05 or less is attained by comparing
pairs of treatments at the 0.05/c level. Using this criteria, the population mean
response for factor levels (averaged over the other factor) are significantly different if
the p-value for the test is 0.05/c or less. The output actually adjusts the p-values by
reporting p-value×c, so that the reported adjusted p-value can be compared to the
0.05 significance level.
summary(glht.itch.t, test = adjusted("bonferroni"))
##

Prof. Erik B. Erhardt


5.1: Analysis of a Randomized Block Design 511

## Simultaneous Tests for General Linear Hypotheses


##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = lm.s.t.p)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Placebo - Nodrug == 0 13.80 24.88 0.555 1.000
## Papv - Nodrug == 0 -72.80 24.88 -2.926 0.105
## Morp - Nodrug == 0 -43.00 24.88 -1.728 1.000
## Amino - Nodrug == 0 -46.70 24.88 -1.877 1.000
## Pento - Nodrug == 0 -14.50 24.88 -0.583 1.000
## Tripel - Nodrug == 0 -23.80 24.88 -0.957 1.000
## Papv - Placebo == 0 -86.60 24.88 -3.481 0.021 *
## Morp - Placebo == 0 -56.80 24.88 -2.283 0.554
## Amino - Placebo == 0 -60.50 24.88 -2.432 0.386
## Pento - Placebo == 0 -28.30 24.88 -1.137 1.000
## Tripel - Placebo == 0 -37.60 24.88 -1.511 1.000
## Morp - Papv == 0 29.80 24.88 1.198 1.000
## Amino - Papv == 0 26.10 24.88 1.049 1.000
## Pento - Papv == 0 58.30 24.88 2.343 0.479
## Tripel - Papv == 0 49.00 24.88 1.969 1.000
## Amino - Morp == 0 -3.70 24.88 -0.149 1.000
## Pento - Morp == 0 28.50 24.88 1.146 1.000
## Tripel - Morp == 0 19.20 24.88 0.772 1.000
## Pento - Amino == 0 32.20 24.88 1.294 1.000
## Tripel - Amino == 0 22.90 24.88 0.920 1.000
## Tripel - Pento == 0 -9.30 24.88 -0.374 1.000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- bonferroni method)
# plot the summary
op <- par(no.readonly = TRUE) # the whole list of settable par's.
# make wider left margin to fit contrast labels
par(mar = c(5, 10, 4, 2) + 0.1) # order is c(bottom, left, top, right)
# plot bonferroni-corrected difference intervals
plot(summary(glht.itch.t, test = adjusted("bonferroni"))
, sub="Bonferroni-adjusted Treatment contrasts")
par(op) # reset plotting options

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


512 Ch 5: Paired Experiments and Randomized Block Experiments

95% family−wise confidence level

Placebo − Nodrug ( ● )
Papv − Nodrug ( ● )
Morp − Nodrug ( ● )
Amino − Nodrug ( ● )
Pento − Nodrug ( ● )
Tripel − Nodrug ( ● )
Papv − Placebo ( ● )
Morp − Placebo ( ● )
Amino − Placebo ( ● )
Pento − Placebo ( ● )
Tripel − Placebo ( ● )
Morp − Papv ( ● )
Amino − Papv ( ● )
Pento − Papv ( ● )
Tripel − Papv ( ● )
Amino − Morp ( ● )
Pento − Morp ( ● )
Tripel − Morp ( ● )
Pento − Amino ( ● )
Tripel − Amino ( ● )
Tripel − Pento ( ● )

−150 −100 −50 0 50 100

Linear Function
Bonferroni−adjusted Treatment contrasts
The Bonferroni comparisons for Treatment suggest that papaverine induces a lower
mean itching time than placebo. All the other comparisons of treatments are insignif-
icant. The comparison of Patient blocks is of less interest.
### Code for the less interesting contrasts.
### Testing multiple factors may be of interest in other problems.
### Note that the first block of code below corrects the p-values
### for all the tests done for both factors together,
### that is, the Bonferroni-corrected significance level is (alpha / (t + p))
### where t = number of treatment comparisons
### and p = number of patient comparisons.

# # correcting over Treatment and Patient


# glht.itch.tp <- glht(aov(lm.s.t.p), linfct = mcp(Treatment = "Tukey"
# , Patient = "Tukey"))
# summary(glht.itch.tp, test = adjusted("bonferroni"))
# plot(summary(glht.itch.tp, test = adjusted("bonferroni")))

# # correcting over Patient, only


# glht.itch.p <- glht(aov(lm.s.t.p), linfct = mcp(Patient = "Tukey"))
# summary(glht.itch.p, test = adjusted("bonferroni"))
# plot(summary(glht.itch.p, test = adjusted("bonferroni")))

Diagnostic Analysis for the RB Analysis A diagnostic analysis of ANOVA-


type models, including the RB model, is easily performed using the lm() output. The
normal quantile (or QQ-plot) shows the residual distribution is slightly skewed to the
right, in part, due to three cases that are not fitted well by the model (the outliers

Prof. Erik B. Erhardt


5.1: Analysis of a Randomized Block Design 513

in the boxplots). Except for these cases, which are also the most influential cases
(Cook’s distance), the plot of the studentized residuals against fitted values shows no
gross abnormalities.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.s.t.p, which = c(1,4,6))

plot(itch.long$Treatment, lm.s.t.p$residuals, main="Residuals vs Treatment")


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(itch.long$Patient, lm.s.t.p$residuals, main="Residuals vs Patient")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.s.t.p$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 20 52 48
## residuals vs order of data
#plot(lm.s.t.p£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


3.5 ● 20 3
150

20
20 ●

0.15
52 ●
0.15
100

● 48 52 52 ●
Cook's distance

Cook's distance

2.5
● ●
Residuals

0.10

0.10


50

● ●● ● ●
● ● 48 ● 48
● ● ●
● ●
● ●
● ●
● 2
● ● ●
0

●● ●
● ●
●● ● ●● ● ●

0.05

0.05

● ● ● ●

● ● ●

● ● ● ● ●

● 1.5
●● ● ●

●● ●

● ● ● ● ●

● ● ● ●
● 1
● ● ●


−100


0.00

0.00

● 0.5

● 0

100 150 200 250 0 10 20 30 40 50 60 70 0.228571428571428

Fitted values Obs. number Leverage hii

Residuals vs Treatment Residuals vs Patient QQ Plot


150

150

150 20 ●
● 52 ●
48 ●
100

100


100
lm.s.t.p$residuals


● ●
● ●●

●●●
50

50

● 50 ●●

●●

●●


●●●●
●●

●●●●
0
0

●●●
●●●
●●●●
●●●
●●
●●
●●●
●●●●
−50

−50


●●
−50 ●●

● ●●●
● ●●


Nodrug Papv Amino Tripel 1 2 3 4 5 6 7 8 9 10 −2 −1 0 1 2

norm quantiles

Although the F -test for comparing treatments is not overly sensitive to modest
deviations from normality, I will present a non-parametric analysis as a backup, to
see whether similar conclusions are reached about the treatments.

Non-parametric Analysis of a RB Experiment Milton Friedman developed a


non-parametric test for comparing treatments in an unreplicated randomized block
design where the normality assumption may be violated. An unreplicated complete
block design has exactly one observation in y for each combination of levels of groups
and blocks. The null hypothesis is that apart from an effect of blocks, the location

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


514 Ch 5: Paired Experiments and Randomized Block Experiments

parameter of y is the same in each of the groups.


The output suggests significant differences among treatments, which supports the
earlier conclusion.
# Friedman test for differences between groups conditional on blocks.
# The formula is of the form a ~ b | c,
# where a, b and c give the data values (a)
# and corresponding groups (b) and blocks (c), respectively.
friedman.test(Seconds ~ Treatment | Patient, data = itch.long)
##
## Friedman rank sum test
##
## data: Seconds and Treatment and Patient
## Friedman chi-squared = 14.887, df = 6, p-value = 0.02115
# Quade test is very similar to the Friedman test (compare the help pages).
quade.test(Seconds ~ Treatment | Patient, data = itch.long)
##
## Quade test
##
## data: Seconds and Treatment and Patient
## Quade F = 3.7321, num df = 6, denom df = 54, p-value =
## 0.003542

5.2 Extending the One-Factor Design to Multiple


Factors
The CRD (completely randomized design) for comparing insecticides below varies
the levels of one factor (insecticide), while controlling other factors that influence
survival time. The inferences from the one-way ANOVA apply to beetles with a
given age from the selected strain that might be given the selected concentration of
the insecticides. Any generalization of the conclusions to other situations must be
justified scientifically, typically through further experimentation.
There are several ways to broaden the scope of the study. For example, several
strains of beetles or several concentrations of the insecticide might be used. For
simplicity, consider a simple two-factor experiment where three concentrations (Low,
Medium, and High) are applied with each of the four insecticides. This is a completely
crossed two-factor experiment where each of the 4 × 3 = 12 combinations of the
two factors (insecticide and dose) are included in the comparison of survival times.
With this experiment, the scientist can compare insecticides, compare concentrations,
and check for an interaction between dose and insecticide.
Assuming that 48 beetles are available, the scientist would randomly assign them
to the 12 experimental groups, giving prespecified numbers of beetles to the 12 groups.

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 515

For simplicity, assume that the experiment is balanced, that is, the same number of
beetles (4) is assigned to each group (12 × 4 = 48). This is a CRD with two factors.

5.2.1 Example: Beetle insecticide two-factor design


The data below were collected using the experimental design just described. The table
gives survival times of groups of four beetles randomly allocated to twelve treatment
groups obtained by crossing the levels of four insecticides (A, B, C, D) at each of three
concentrations of the insecticides (1=Low, 2=Medium, 3=High). This is a balanced
4-by-3 factorial design (two-factor design) that is replicated four times. The unit of
measure for the survival times is 10 hours, that is, 0.3 is a survival time of 3 hours.
#### Example: Beetles
beetles <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch05_beetles.dat"
, header = TRUE)
# make dose a factor variable and label the levels
beetles$dose <- factor(beetles$dose, labels = c("low","medium","high"))

dose insecticide t1 t2 t3 t4
1 low A 0.3100 0.4500 0.4600 0.4300
2 low B 0.8200 1.1000 0.8800 0.7200
3 low C 0.4300 0.4500 0.6300 0.7600
4 low D 0.4500 0.7100 0.6600 0.6200
5 medium A 0.3600 0.2900 0.4000 0.2300
6 medium B 0.9200 0.6100 0.4900 1.2400
7 medium C 0.4400 0.3500 0.3100 0.4000
8 medium D 0.5600 1.0200 0.7100 0.3800
9 high A 0.2200 0.2100 0.1800 0.2300
10 high B 0.3000 0.3700 0.3800 0.2900
11 high C 0.2300 0.2500 0.2400 0.2200
12 high D 0.3000 0.3600 0.3100 0.3300
First we reshape the data to long format so each observation is its own row in the
data.frame and indexed by the dose and insecticide variables.
library(reshape2)
beetles.long <- melt(beetles
, id.vars = c("dose", "insecticide")
, variable.name = "number"
, value.name = "hours10"
)
str(beetles.long)
## 'data.frame': 48 obs. of 4 variables:
## $ dose : Factor w/ 3 levels "low","medium",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ insecticide: Factor w/ 4 levels "A","B","C","D": 1 2 3 4 1 2 3 4 1 2 ...
## $ number : Factor w/ 4 levels "t1","t2","t3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ hours10 : num 0.31 0.82 0.43 0.45 0.36 0.92 0.44 0.56 0.22 0.3 ...

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


516 Ch 5: Paired Experiments and Randomized Block Experiments

head(beetles.long)
## dose insecticide number hours10
## 1 low A t1 0.31
## 2 low B t1 0.82
## 3 low C t1 0.43
## 4 low D t1 0.45
## 5 medium A t1 0.36
## 6 medium B t1 0.92
The basic unit of analysis is the cell means, which are the averages of the 4 ob-
servations in each of the 12 treatment combinations. For example, in the table below,
the sample mean survival for the 4 beetles given a low dose (dose=1) of insecticide A
is 0.413. From the cell means we obtain the dose and insecticide marginal means
by averaging over the levels of the other factor. For example, the marginal mean
for insecticide A is the average of the cell means for the 3 treatment combinations
involving insecticide A: 0.314 = (0.413 + 0.320 + 0.210)/3.
Cell Means Dose
Insecticide 1 2 3 Insect marg
A 0.413 0.320 0.210 0.314
B 0.880 0.815 0.335 0.677
C 0.568 0.375 0.235 0.393
D 0.610 0.668 0.325 0.534
Dose marg 0.618 0.544 0.277 0.480
Because the experiment is balanced, a marginal mean is the average of all observa-
tions that receive a given treatment. For example, the marginal mean for insecticide
A is the average survival time for the 16 beetles given insecticide A.
Looking at the table of means, the insecticides have noticeably different mean
survival times averaged over doses, with insecticide A having the lowest mean survival
time averaged over doses. Similarly, higher doses tend to produce lower survival times.
A more formal approach to analyzing the table of means is given in the next section.

5.2.2 The Interaction Model for a Two-Factor Experiment


Assume that you designed a balanced two-factor experiment with K responses at
each combination of the I levels of factor 1 (F1) with the J levels of factor 2 (F2).
The total number of responses is KIJ, or K times the IJ treatment combinations.
Let yijk be the k th response at the ith level of F1 and the j th level of F2. A generic
model for the experiment expresses yijk as a mean response plus a residual:
yijk = µij + eijk ,
where µij is the population mean response for the treatment defined by the ith level of
F1 combined with the j th level of F2. As in a one-way ANOVA, the responses within

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 517

and across treatment groups are assumed to be independent, normally distributed,


and have constant variance.
The interaction model expresses the population means as

µij = µ + αi + βj + (αβ)ij ,

where µ is a grand mean, αi is the effect for the ith level of F1, βj is the effect for the
j th level of F2, and (αβ)ij is the interaction between the ith level of F1 and the j th
level of F2. (Note that (αβ) is an individual term distinct from α and β, (αβ) is not
their product.) The model is often written

yijk = µ + αi + βj + (αβ)ij + eijk ,

meaning
Response = Grand Mean + F1 effect + F2 effect + F1-by-F2 interaction +
Residual.
The additive model having only main effects, no interaction terms, is yijk =
µ + αi + βj + eijk , meaning
Response = Grand Mean + F1 effect + F2 effect + Residual.
The effects of F1 and F2 on the mean are additive.

Defining effects from cell means


The effects that define the population means and the usual hypotheses of interest can
be formulated from the table of population means, given here.
Level of F2
Level of F1 1 2 ··· J F1 marg
1 µ11 µ12 · · · µ1J µ̄1·
2 µ21 µ22 · · · µ2J µ̄2·
.. .. .. .. .. ..
. . . . . .
I µI1 µI2 ··· µIJ µ̄I·
F2 marg µ̄·1 µ̄·2 ··· µ̄·J µ̄··
The F1 marginal population means are averages within rows (over columns) of
the table:
1X
µ̄i· = µic .
J c

The F2 marginal population means are averages within columns (over rows):

1X
µ̄·j = µrj .
I r

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


518 Ch 5: Paired Experiments and Randomized Block Experiments

The overall or grand population mean is the average of the cell means
1 X 1X 1X
µ̄·· = µrc = µ̄i· = µ̄·j .
IJ rc I i J j

Using this notation, the effects in the interaction model are µ = µ̄·· , αi = µ̄i· −
µ̄·· , βj = µ̄·j − µ̄·· , and (αβ)ij = µij − µ̄i· − µ̄·j + µ̄·· . The effects sum to zero:
X X X
αi = βj = (αβ)ij = 0,
i j ij

and satisfy µij = µ + αi + βj + (αβ)ij (i.e., cell mean is sum of effects) required under
the model.
The F1 and F2 effects are analogous to treatment effects in a one-factor experi-
ment, except that here the treatment means are averaged over the levels of the other
factor. The interaction effect will be interpreted later.

Estimating effects from the data


Let
1 X
ȳij = yijk and s2ij
K k

be the sample mean and variance, respectively, for the K responses at the ith level of
F1 and the j th level of F2. Inferences about the population means are based on the
table of sample means:
Level of F2
Level of F1 1 2 ··· J F1 marg
1 ȳ11 ȳ12 · · · ȳ1J ȳ1·
2 ȳ21 ȳ22 · · · ȳ2J ȳ2·
.. .. .. .. .. ..
. . . . . .
I ȳI1 ȳI2 · · · ȳIJ ȳI·
F2 marg ȳ·1 ȳ·2 · · · ȳ·J ȳ··
The F1 marginal sample means are averages within rows of the table:
1X
ȳi· = ȳic .
J c

The F2 marginal sample means are averages within columns:


1X
ȳ·j = ȳrj .
I r

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 519

Finally, ȳ·· is the average of the cell sample means:


1 X 1X 1X
ȳ·· = ȳij = ȳi· = ȳ·j .
IJ ij I i J j

The sample sizes in each of the IJ treatment groups are equal (K), so ȳi· is the
sample average of all responses at the ith level of F1, ȳ·j is the sample average of all
responses at the j th level of F2, and ȳ·· is the average response in the experiment.
Under the interaction model, the estimated population mean for the (i, j)th cell is
the observed cell mean: µ̂ij = ȳij . This can be partitioned into estimated effects

µ̂ = ȳ·· the estimated grand mean


α̂i = ȳi· − ȳ·· the estimated F1 effect i = 1, 2, . . . , I
β̂j = ȳ·j − ȳ·· the estimated F2 effect j = 1, 2, . . . , J
[
(αβ) = ȳij − ȳi· − ȳ·j + ȳ·· the estimated cell interaction (5.1)
ij

that satisfy
[ .
µ̂ij = µ̂ + α̂i + β̂j + (αβ)ij

The ANOVA table


The ANOVA table for a balanced two-factor design decomposes the total variation in
the data, as measured by the Total SS, into components that measure the variation
of marginal sample means for F1 and F2 (the F1 SS and F2 SS), a component that
measures the degree to which the factors interact (the F1-by-F2 Interaction SS), and
a component that pools the sample variances across the IJ samples (the Error SS).
Each SS has a df, given in the following ANOVA table. As usual, the MS for each
source of variation is the corresponding SS divided by the df. The MS Error estimates
the common population variance for the IJ treatments.
Source df SS P MS
F1 I −1 KJ P i (ȳi· − ȳ·· )2 MS F1=SS/df
2
F2 J −1 KI j (ȳ·j − ȳ·· ) MS F2=SS/df
2
P
Interaction (I − 1)(J − 1) K ij (yij − ȳi· − ȳ·j + ȳ·· ) MS Inter=SS/df
(K − 1) ij s2ij
P
Error IJ(K − 1) MSE=SS/df
2
P
Total IJK − 1 ijk (yijk − ȳ·· ) .
The standard tests in the two-factor analysis, and their interpretations are:
The test of no F1 effect: H0 : α1 = · · · = αI = 0 is equivalent to testing H0 :
µ̄1· = µ̄2· = · · · = µ̄I· . The absence of an F1 effect implies that each level of F1 has
the same population mean response when the means are averaged over levels

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


520 Ch 5: Paired Experiments and Randomized Block Experiments

of F2. The test for no F1 effect is based on MS F1/MS Error, which is compared
to the upper tail of an F-distribution with numerator and denominator df of I − 1
and IJ(K − 1), respectively. H0 is rejected when the F1 marginal means ȳi· vary
significantly relative to the within sample variation. Equivalently, H0 is rejected
when the sum of squared F1 effects (between sample variation) is large relative to
the within sample variation.
The test of no F2 effect: H0 : β1 = · · · = βJ = 0 is equivalent to testing H0 :
µ̄·1 = µ̄·2 = · · · = µ̄·J . The absence of a F2 effect implies that each level of F2
has the same population mean response when the means are averaged over
levels of F1. The test for no F2 effect is based on MS F2/MS Error, which is
compared to an F-distribution with numerator and denominator df of J − 1 and
IJ(K − 1), respectively. H0 is rejected when the F2 marginal means ȳ·j vary
significantly relative to the within sample variation. Equivalently, H0 is rejected
when the sum of squared F2 effects (between sample variation) is large relative to
the within sample variation.
The test of no interaction: H0 : (αβ)ij = 0 for all i and j is based on MS Interact/MS Error,
which is compared to an F-distribution with numerator and denominator df of
(I − 1)(J − 1) and IJ(K − 1), respectively.
The interaction model places no restrictions on the population means µij . Since
the population means can be arbitrary, the interaction model can be viewed as a one
factor model with IJ treatments. One connection between the two ways of viewing the
two-factor analysis is that the F1, F2, and Interaction SS for the two-way interaction
model sum to the Treatment or Model SS for comparing the IJ treatments. The
Error SS for the two-way interaction model is identical to the Error SS for a one-way
ANOVA of the IJ treatments. An overall test of no differences in the IJ population
means is part of the two-way analysis.
I always summarize the data using the cell and marginal means instead of the
estimated effects, primarily because means are the basic building blocks for the anal-
ysis. My discussion of the model and tests emphasizes both approaches to help you
make the connection with the two ways this material is often presented in texts.

Understanding interaction
To understand interaction, suppose you (conceptually) plot the means in each row
of the population table, giving what is known as the population mean profile
plot. The F1 marginal population means average the population means within the
F1 profiles. At each F2 level, the F2 marginal mean averages the population cell
means across F1 profiles.

No interaction is present if the plot has perfectly parallel F1 profiles, as in the


plot below for a 3 × 5 experiment. The levels of F1 and F2 do not interact. That

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 521

is,

parallel profiles ⇔ µij − µhj is independent of j for each i and h


difference between levels of F1 does not depend on level of F2
⇔ µij − µ̄i· = µhj − µ̄h· for all i, j, h
difference between level of F2 j and F2 mean does not depend on level of F1
⇔ µij − µ̄i· = µ̄·j − µ̄·· for all i, j
difference between level of F2 j and F2 mean is the same for all levels of F1
⇔ µij − µ̄i· − µ̄·j + µ̄·· = 0 for all i, j
interaction effect is 0
⇔ (αβ)ij = 0 for all i, j
interaction effect is 0
⇔ no interaction term in model.
14
12 10
Population Mean

F1=2
6 8

F1=1
4

F1=3
2
0

1 2 3 4 5
Level of Factor 2

Interaction is present if the profiles are not perfectly parallel. An example of a


profile plot for two-factor experiment (3 × 5) with interaction is given below.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


522 Ch 5: Paired Experiments and Randomized Block Experiments

14

F1=3
12 10
Population Mean
8

F1=1
6 4
2

F1=2
0

1 2 3 4 5
Level of Factor 2

The roles of F1 and F2 can be reversed in these plots without changing the assess-
ment of a presence or absence of interaction. It is often helpful to view the interaction
plot from both perspectives.
A qualitative check for interaction can be based on the sample means profile
plot, but keep in mind that profiles of sample means are never perfectly parallel even
when the factors do not interact in the population. The Interaction SS measures the
extent of non-parallelism in the sample mean profiles. In particular, the Interaction
[ = 0 for
SS is zero when the sample mean profiles are perfectly parallel because (αβ) ij
all i and j.

5.2.3 Example: Survival Times of Beetles


First we generate cell means and a sample means profile plot (interaction plot) for
the beetle experiment. The ddply() function was used to obtain the 12 treatment
cell means. Three variables were needed to represent each response in the data set:
dose (1-3, categorical), insecticide (A-D, categorical), and time (the survival time).
As noted earlier, the insecticides have noticeably different mean survival times
averaged over doses, with insecticide A having the lowest mean. Similarly, higher
doses tend to produce lower survival times. The sample means profile plot shows
some evidence of interaction.
library(plyr)
# Calculate the cell means for each (dose, insecticide) combination

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 523

mean(beetles.long[, "hours10"])
## [1] 0.479375
beetles.mean <- ddply(beetles.long, .(), summarise, m = mean(hours10))
beetles.mean
## .id m
## 1 <NA> 0.479375
beetles.mean.d <- ddply(beetles.long, .(dose), summarise, m = mean(hours10))
beetles.mean.d
## dose m
## 1 low 0.617500
## 2 medium 0.544375
## 3 high 0.276250
beetles.mean.i <- ddply(beetles.long, .(insecticide), summarise, m = mean(hours10))
beetles.mean.i
## insecticide m
## 1 A 0.3141667
## 2 B 0.6766667
## 3 C 0.3925000
## 4 D 0.5341667
beetles.mean.di <- ddply(beetles.long, .(dose,insecticide), summarise, m = mean(hours10))
beetles.mean.di
## dose insecticide m
## 1 low A 0.4125
## 2 low B 0.8800
## 3 low C 0.5675
## 4 low D 0.6100
## 5 medium A 0.3200
## 6 medium B 0.8150
## 7 medium C 0.3750
## 8 medium D 0.6675
## 9 high A 0.2100
## 10 high B 0.3350
## 11 high C 0.2350
## 12 high D 0.3250
# Interaction plots, ggplot

p <- ggplot(beetles.long, aes(x = dose, y = hours10, colour = insecticide, shape = insecticide))


p <- p + geom_hline(aes(yintercept = 0), colour = "black"
, linetype = "solid", size = 0.2, alpha = 0.3)
p <- p + geom_boxplot(alpha = 0.25, outlier.size=0.1)
p <- p + geom_point(alpha = 0.5, position=position_dodge(width=0.75))
p <- p + geom_point(data = beetles.mean.di, aes(y = m), size = 4)
p <- p + geom_line(data = beetles.mean.di, aes(y = m, group = insecticide), size = 1.5)
p <- p + labs(title = "Beetles interaction plot, insecticide by dose")
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


524 Ch 5: Paired Experiments and Randomized Block Experiments

p <- ggplot(beetles.long, aes(x = insecticide, y = hours10, colour = dose, shape = dose))


p <- p + geom_hline(aes(yintercept = 0), colour = "black"
, linetype = "solid", size = 0.2, alpha = 0.3)
p <- p + geom_boxplot(alpha = 0.25, outlier.size=0.1)
p <- p + geom_point(alpha = 0.5, position=position_dodge(width=0.75))
p <- p + geom_point(data = beetles.mean.di, aes(y = m), size = 4)
p <- p + geom_line(data = beetles.mean.di, aes(y = m, group = dose), size = 1.5)
p <- p + labs(title = "Beetles interaction plot, dose by insecticide")
print(p)

Beetles interaction plot, insecticide by dose Beetles interaction plot, dose by insecticide

1.2 1.2


0.8 0.8
insecticide
dose
● A
hours10

hours10

● low
B
● medium
C ●
high
D

0.4 ● 0.4 ●

0.0 0.0

low medium high A B C D


dose insecticide

# Interaction plots, base graphics

interaction.plot(beetles.long$dose, beetles.long$insecticide, beetles.long$hours10


, main = "Beetles interaction plot, insecticide by dose")
interaction.plot(beetles.long$insecticide, beetles.long$dose, beetles.long$hours10
, main = "Beetles interaction plot, dose by insecticide")

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 525

Beetles interaction plot, insecticide by dose Beetles interaction plot, dose by insecticide
0.9

0.9
beetles.long$insecticide beetles.long$dose
0.8

0.8
B medium
D low
mean of beetles.long$hours10

mean of beetles.long$hours10
0.7

0.7
C high
A
0.6

0.6
0.5

0.5
0.4

0.4
0.3

0.3
0.2

0.2
low medium high A B C D

beetles.long$dose beetles.long$insecticide

In the lm() function below we specify a first-order model with interactions, in-
cluding the main effects and two-way interactions. The interaction between dose and
insecticide is indicated with dose:insecticide. The shorthand dose*insecticide
expands to “dose + insecticide + dose:insecticide” for this first-order model.
The F -test at the bottom of the summary() tests for no differences among the
population mean survival times for the 12 dose and insecticide combinations. The
p-value of < 0.0001 strongly suggests that the population mean survival times are
not all equal.
The next summary at the top gives two partitionings of the one-way ANOVA
Treatment SS into the SS for Dose, Insecticide, and the Dose by Insecticide interac-
tion. The Mean Squares, F-statistics and p-values for testing these effects are given.
The p-values for the F-statistics indicate that the dose and insecticide effects are
significant at the 0.01 level. The F-test for no dose by insecticide interaction is not
significant at the 0.10 level (p-value=0.112). Thus, the interaction seen in the profile
plot of the sample means might be due solely to chance or sampling variability.
lm.h.d.i.di <- lm(hours10 ~ dose + insecticide + dose:insecticide
, data = beetles.long)
# lm.h.d.i.di <- lm(hours10 ~ dose*insecticide, data = beetles.long) # equivalent
library(car)
Anova(lm.h.d.i.di, type=3)
## Anova Table (Type III tests)
##
## Response: hours10
## Sum Sq Df F value Pr(>F)
## (Intercept) 0.68063 1 30.6004 2.937e-06 ***
## dose 0.08222 2 1.8482 0.1721570
## insecticide 0.45395 3 6.8031 0.0009469 ***

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


526 Ch 5: Paired Experiments and Randomized Block Experiments

## dose:insecticide 0.25014 6 1.8743 0.1122506


## Residuals 0.80072 36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.h.d.i.di)
##
## Call:
## lm(formula = hours10 ~ dose + insecticide + dose:insecticide,
## data = beetles.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32500 -0.04875 0.00500 0.04312 0.42500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.41250 0.07457 5.532 2.94e-06 ***
## dosemedium -0.09250 0.10546 -0.877 0.3862
## dosehigh -0.20250 0.10546 -1.920 0.0628 .
## insecticideB 0.46750 0.10546 4.433 8.37e-05 ***
## insecticideC 0.15500 0.10546 1.470 0.1503
## insecticideD 0.19750 0.10546 1.873 0.0692 .
## dosemedium:insecticideB 0.02750 0.14914 0.184 0.8547
## dosehigh:insecticideB -0.34250 0.14914 -2.297 0.0276 *
## dosemedium:insecticideC -0.10000 0.14914 -0.671 0.5068
## dosehigh:insecticideC -0.13000 0.14914 -0.872 0.3892
## dosemedium:insecticideD 0.15000 0.14914 1.006 0.3212
## dosehigh:insecticideD -0.08250 0.14914 -0.553 0.5836
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1491 on 36 degrees of freedom
## Multiple R-squared: 0.7335,Adjusted R-squared: 0.6521
## F-statistic: 9.01 on 11 and 36 DF, p-value: 1.986e-07

Since the interaction is not significant, I’ll drop the interaction term and fit the
additive model with main effects only. I update the model by removing the interaction
term.
lm.h.d.i <- update(lm.h.d.i.di, ~ . - dose:insecticide )
library(car)
Anova(lm.h.d.i, type=3)
## Anova Table (Type III tests)
##
## Response: hours10
## Sum Sq Df F value Pr(>F)
## (Intercept) 1.63654 1 65.408 4.224e-10 ***
## dose 1.03301 2 20.643 5.704e-07 ***

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 527

## insecticide 0.92121 3 12.273 6.697e-06 ***


## Residuals 1.05086 42
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.h.d.i)
##
## Call:
## lm(formula = hours10 ~ dose + insecticide, data = beetles.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25167 -0.09625 -0.01490 0.06177 0.49833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.45229 0.05592 8.088 4.22e-10 ***
## dosemedium -0.07313 0.05592 -1.308 0.19813
## dosehigh -0.34125 0.05592 -6.102 2.83e-07 ***
## insecticideB 0.36250 0.06458 5.614 1.43e-06 ***
## insecticideC 0.07833 0.06458 1.213 0.23189
## insecticideD 0.22000 0.06458 3.407 0.00146 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 42 degrees of freedom
## Multiple R-squared: 0.6503,Adjusted R-squared: 0.6087
## F-statistic: 15.62 on 5 and 42 DF, p-value: 1.123e-08

The Bonferroni multiple comparisons indicate which treatment effects are differ-
ent.
# Testing multiple factors is of interest here.
# Note that the code below corrects the p-values
# for all the tests done for both factors together,
# that is, the Bonferroni-corrected significance level is (alpha / (d + i))
# where d = number of dose comparisons
# and i = number of insecticide comparisons.

# correcting over dose and insecticide


glht.beetle.di <- glht(aov(lm.h.d.i), linfct = mcp(dose = "Tukey"
, insecticide = "Tukey"))
summary(glht.beetle.di, test = adjusted("bonferroni"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


528 Ch 5: Paired Experiments and Randomized Block Experiments

## Fit: aov(formula = lm.h.d.i)


##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## dose: medium - low == 0 -0.07313 0.05592 -1.308 1.000000
## dose: high - low == 0 -0.34125 0.05592 -6.102 2.55e-06 ***
## dose: high - medium == 0 -0.26812 0.05592 -4.794 0.000186 ***
## insecticide: B - A == 0 0.36250 0.06458 5.614 1.28e-05 ***
## insecticide: C - A == 0 0.07833 0.06458 1.213 1.000000
## insecticide: D - A == 0 0.22000 0.06458 3.407 0.013134 *
## insecticide: C - B == 0 -0.28417 0.06458 -4.400 0.000653 ***
## insecticide: D - B == 0 -0.14250 0.06458 -2.207 0.295702
## insecticide: D - C == 0 0.14167 0.06458 2.194 0.304527
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- bonferroni method)
# plot the summary
op <- par(no.readonly = TRUE) # the whole list of settable par's.
# make wider left margin to fit contrast labels
par(mar = c(5, 10, 4, 2) + 0.1) # order is c(bottom, left, top, right)
# plot bonferroni-corrected difference intervals
plot(summary(glht.beetle.di, test = adjusted("bonferroni"))
, sub="Bonferroni-adjusted Treatment contrasts")
par(op) # reset plotting options

95% family−wise confidence level

dose: medium − low ( ● )

dose: high − low ( ● )

dose: high − medium ( ● )

insecticide: B − A ( ● )

insecticide: C − A ( ● )

insecticide: D − A ( ● )

insecticide: C − B ( ● )

insecticide: D − B ( ● )

insecticide: D − C ( ● )

−0.4 −0.2 0.0 0.2 0.4

Linear Function
Bonferroni−adjusted Treatment contrasts

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 529

Interpretation of the Dose and Insecticide Effects


The interpretation of the dose and insecticide main effects depends on whether
interaction is present. The distinction is important, so I will give both interpretations
to emphasize the differences. Given the test for interaction, I would likely summarize
the main effects assuming no interaction.
The average survival time decreases as the dose increases, with estimated mean
survival times of 0.618, 0.544, and 0.276, respectively. A Bonferroni comparison shows
that the population mean survival time for the high dose (averaged over insecticides)
is significantly less than the population mean survival times for the low and medium
doses (averaged over insecticides). The two lower doses are not significantly different
from each other. This leads to two dose groups:
Dose: 1=Low 2=Med 3=Hig
Marg Mean: 0.618 0.544 0.276
Groups: ------------ -----

If dose and insecticide interact, you can conclude that beetles given a high dose
of the insecticide typically survive for shorter periods of time averaged over insec-
ticides. You can not, in general, conclude that the highest dose yields the lowest
survival time regardless of insecticide. For example, the difference in the medium
and high dose marginal means (0.544 - 0.276 = 0.268) estimates the typical decrease
in survival time achieved by using the high dose instead of the medium dose, averaged
over insecticides. If the two factors interact, then the difference in mean times between
the medium and high doses on a given insecticide may be significantly greater than
0.268, significantly less than 0.268, or even negative. In the latter case the medium
dose would be better than the high dose for the given insecticide, even though the
high dose gives better performance averaged over insecticides. An interaction forces
you to use the cell means to decide which combination of dose and insecticide gives
the best results (and the multiple comparisons as they were done above do not give
multiple comparisons of cell means; a single factor variable combining both factors
would need to be created). Of course, our profile plot tells us that this hypothetical
situation is probably not tenable here, but it could be so when a significant interaction
is present.
If dose and insecticide do not interact, then the difference in marginal dose
means averaged over insecticides also estimates the difference in population mean
survival times between two doses, regardless of the insecticide. This follows from
the parallel profiles definition of no interaction. Thus, the difference in the medium
and high dose marginal means (0.544 - 0.276 = 0.268) estimates the expected decrease
in survival time anticipated from using the high dose instead of the medium dose,
regardless of the insecticide (and hence also when averaged over insecticides). A
practical implication of no interaction is that you can conclude that the high dose is
best, regardless of the insecticide used. The difference in marginal means for two doses
estimates the difference in average survival expected, regardless of the insecticide.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


530 Ch 5: Paired Experiments and Randomized Block Experiments

An ordering of the mean survival times on the four insecticides (averaged over
the three doses) is given below. Three groups are obtained from the Bonferroni
comparisons, with any two insecticides separated by one or more other insecticides
in the ordered string having significantly different mean survival times averaged over
doses.
If interaction is present, you can conclude that insecticide A is no better than C,
but significantly better than B or D, when performance is averaged over doses. If
the interaction is absent, then A is not significantly better than C, but is significantly
better than B or D, regardless of the dose. Furthermore, for example, the difference
in marginal means for insecticides B and A of 0.677 - 0.314 = 0.363 is the expected
decrease in survival time from using A instead of B, regardless of dose. This is also
the expected decrease in survival times when averaged over doses.
Insect: B D C A
Marg Mean: 0.677 0.534 0.393 0.314
Groups: ------------
------------
------------

5.2.4 Example: Output voltage for batteries


The maximum output voltage for storage batteries is thought to be influenced by the
temperature in the location at which the battery is operated and the material used
in the plates. A scientist designed a two-factor study to examine this hypothesis,
using three temperatures (50, 65, 80), and three materials for the plates (1, 2, 3).
Four batteries were tested at each of the 9 combinations of temperature and material
type. The maximum output voltage was recorded for each battery. This is a balanced
3-by-3 factorial experiment with four observations per treatment.
#### Example: Output voltage for batteries
battery <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch05_battery.dat"
, header = TRUE)
battery$material <- factor(battery$material)
battery$temp <- factor(battery$temp)

material temp v1 v2 v3 v4
1 1 50 130 155 74 180
2 1 65 34 40 80 75
3 1 80 20 70 82 58
4 2 50 150 188 159 126
5 2 65 136 122 106 115
6 2 80 25 70 58 45
7 3 50 138 110 168 160
8 3 65 174 120 150 139
9 3 80 96 104 82 60

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 531

library(reshape2)
battery.long <- melt(battery
, id.vars = c("material", "temp")
, variable.name = "battery"
, value.name = "maxvolt"
)
str(battery.long)
## 'data.frame': 36 obs. of 4 variables:
## $ material: Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 1 ...
## $ temp : Factor w/ 3 levels "50","65","80": 1 2 3 1 2 3 1 2 3 1 ...
## $ battery : Factor w/ 4 levels "v1","v2","v3",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ maxvolt : int 130 34 20 150 136 25 138 174 96 155 ...
The overall F -test at the bottom indicates at least one parameter in the model is
significant. The two-way ANOVA table indicates that the main effect of temperature
and the interaction are significant at the 0.05 level, the main effect of material is not.
lm.m.m.t.mt <- lm(maxvolt ~ material*temp, data = battery.long)
library(car)
Anova(lm.m.m.t.mt, type=3)
## Anova Table (Type III tests)
##
## Response: maxvolt
## Sum Sq Df F value Pr(>F)
## (Intercept) 72630 1 107.5664 6.456e-11 ***
## material 886 2 0.6562 0.5268904
## temp 15965 2 11.8223 0.0002052 ***
## material:temp 9614 4 3.5595 0.0186112 *
## Residuals 18231 27
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.m.m.t.mt)
##
## Call:
## lm(formula = maxvolt ~ material * temp, data = battery.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.750 -14.625 1.375 17.938 45.250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.75 12.99 10.371 6.46e-11 ***
## material2 21.00 18.37 1.143 0.263107
## material3 9.25 18.37 0.503 0.618747
## temp65 -77.50 18.37 -4.218 0.000248 ***
## temp80 -77.25 18.37 -4.204 0.000257 ***
## material2:temp65 41.50 25.98 1.597 0.121886

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


532 Ch 5: Paired Experiments and Randomized Block Experiments

## material3:temp65 79.25 25.98 3.050 0.005083 **


## material2:temp80 -29.00 25.98 -1.116 0.274242
## material3:temp80 18.75 25.98 0.722 0.476759
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.98 on 27 degrees of freedom
## Multiple R-squared: 0.7652,Adjusted R-squared: 0.6956
## F-statistic: 11 on 8 and 27 DF, p-value: 9.426e-07

The cell means plots of the material profiles have different slopes, which is consis-
tent with the presence of a temperature-by-material interaction.
library(plyr)
# Calculate the cell means for each (material, temp) combination
battery.mean <- ddply(battery.long, .(), summarise, m = mean(maxvolt))
battery.mean
## .id m
## 1 <NA> 105.5278
battery.mean.m <- ddply(battery.long, .(material), summarise, m = mean(maxvolt))
battery.mean.m
## material m
## 1 1 83.16667
## 2 2 108.33333
## 3 3 125.08333
battery.mean.t <- ddply(battery.long, .(temp), summarise, m = mean(maxvolt))
battery.mean.t
## temp m
## 1 50 144.83333
## 2 65 107.58333
## 3 80 64.16667
battery.mean.mt <- ddply(battery.long, .(material,temp), summarise, m = mean(maxvolt))
battery.mean.mt
## material temp m
## 1 1 50 134.75
## 2 1 65 57.25
## 3 1 80 57.50
## 4 2 50 155.75
## 5 2 65 119.75
## 6 2 80 49.50
## 7 3 50 144.00
## 8 3 65 145.75
## 9 3 80 85.50
# Interaction plots, ggplot
p <- ggplot(battery.long, aes(x = material, y = maxvolt, colour = temp, shape = temp))
p <- p + geom_hline(aes(yintercept = 0), colour = "black"
, linetype = "solid", size = 0.2, alpha = 0.3)

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 533

p <- p + geom_boxplot(alpha = 0.25, outlier.size=0.1)


p <- p + geom_point(alpha = 0.5, position=position_dodge(width=0.75))
p <- p + geom_point(data = battery.mean.mt, aes(y = m), size = 4)
p <- p + geom_line(data = battery.mean.mt, aes(y = m, group = temp), size = 1.5)
p <- p + labs(title = "Battery interaction plot, temp by material")
print(p)

p <- ggplot(battery.long, aes(x = temp, y = maxvolt, colour = material, shape = material))


p <- p + geom_hline(aes(yintercept = 0), colour = "black"
, linetype = "solid", size = 0.2, alpha = 0.3)
p <- p + geom_boxplot(alpha = 0.25, outlier.size=0.1)
p <- p + geom_point(alpha = 0.5, position=position_dodge(width=0.75))
p <- p + geom_point(data = battery.mean.mt, aes(y = m), size = 4)
p <- p + geom_line(data = battery.mean.mt, aes(y = m, group = material), size = 1.5)
p <- p + labs(title = "Battery interaction plot, material by temp")
print(p)
Battery interaction plot, temp by material Battery interaction plot, material by temp


150 150

● ●

temp material
maxvolt

maxvolt

100 ● 50 100 ● 1
65 2
80 3

● ●
50 50

0 0

1 2 3 50 65 80
material temp

The Bonferroni multiple comparisons may be inappropriate because of covariate


interactions. That is, interactions make the main effects less meaningful (or their in-
terpretation unclear) since the change in response when one factor is changed depends
on what the second factor is.
The significant interaction between temperature and material implies that you
can not directly conclude that batteries stored at 50 degrees have the highest average
output regardless of the material. Nor can you directly conclude that material 3 has
a higher average output than material 1 regardless of the temperature. You can only
conclude that the differences are significant when averaged over the levels of the other
factor.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


534 Ch 5: Paired Experiments and Randomized Block Experiments

# correcting over temp


glht.battery.t <- glht(aov(lm.m.m.t.mt), linfct = mcp(temp = "Tukey"))
## Warning in mcp2matrix(model, linfct = linfct): covariate interactions found -- default
contrast might be inappropriate
summary(glht.battery.t, test = adjusted("bonferroni"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = lm.m.m.t.mt)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 65 - 50 == 0 -77.50 18.37 -4.218 0.000744 ***
## 80 - 50 == 0 -77.25 18.37 -4.204 0.000772 ***
## 80 - 65 == 0 0.25 18.37 0.014 1.000000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- bonferroni method)
# plot bonferroni-corrected difference intervals
plot(summary(glht.battery.t, test = adjusted("bonferroni"))
, sub="Bonferroni-adjusted Treatment contrasts")

95% family−wise confidence level

65 − 50 ( ● )

80 − 50 ( ● )

80 − 65 ( ● )

−100 −50 0 50

Linear Function
Bonferroni−adjusted Treatment contrasts
The Bonferroni comparisons indicate that the population mean max voltage for
the three temperatures averaged over material types decreases as the temperature
increases:
Temp: 80 65 50
Marg mean: 64.17 107.58 144.83
Group: ------------- ------

However, you can compare materials at each temperature, and you can compare
temperatures for each material. At individual temperatures, material 2 and 3 (or 1
and 2) might be significantly different even though they are not significantly different

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 535

when averaged over temperatures. For example, material 2 might produce a signifi-
cantly higher average output than the other two material types at 50 degrees. This
comparison of cell means is relevant if you are interested in using the batteries at 50
degrees! Comparing cell means is possible using “lsmeans”, a point I will return to
later.

5.2.5 Checking assumptions in a two-factor experiment


The normality and constant variance assumptions for a two-factor design can be visu-
ally checked using side-by-side boxplots (as was produced in the ggplot() interaction
plots) and residual plots. Another useful tool for checking constant variances is to
plot the sample deviations for each group against the group means.
Let us check the distributional assumptions for the insecticide experiment. The
sampling design suggests that the independence assumptions are reasonable. The
group sample sizes are small, so the residual plots are likely to be more informative
than the side-by-side boxplots and the plot of the standard deviations.
The code below generates plots and summary statistics for the survival times. I
used ddply() to store the means ȳij and standard deviations sij for the 12 treatment
combinations. The diagnostic plots we’ve been using for lm() displays residual plots.
Only the relevant output is presented.
The set of box plots (each representing 4 points) for each insecticide/dose combina-
tion indicates both that means and standard deviations of treatments seem different.
Also, there appears to be less variability for dose=3 (high) than for doses 1 and 2
in the table; the model assumes that variability is the same and does not depend on
treatment. The plot of the standard deviation vs mean shows an increasing trend.
#### Example: Beetles, checking assumptions
# boxplots, ggplot
p <- ggplot(beetles.long, aes(x = dose, y = hours10, colour = insecticide))
p <- p + geom_boxplot()
print(p)

# mean vs sd plot
library(plyr)
# means and standard deviations for each dose/interaction cell
beetles.meansd.di <- ddply(beetles.long, .(dose,insecticide), summarise
, m = mean(hours10), s = sd(hours10))
beetles.meansd.di
## dose insecticide m s
## 1 low A 0.4125 0.06946222
## 2 low B 0.8800 0.16083117
## 3 low C 0.5675 0.15671099
## 4 low D 0.6100 0.11284207
## 5 medium A 0.3200 0.07527727

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


536 Ch 5: Paired Experiments and Randomized Block Experiments

## 6 medium B 0.8150 0.33630343


## 7 medium C 0.3750 0.05686241
## 8 medium D 0.6675 0.27097048
## 9 high A 0.2100 0.02160247
## 10 high B 0.3350 0.04654747
## 11 high C 0.2350 0.01290994
## 12 high D 0.3250 0.02645751
p <- ggplot(beetles.meansd.di, aes(x = m, y = s, shape = dose, colour = insecticide))
p <- p + geom_point(size=4)
p <- p + labs(title = "Beetles standard deviation vs mean")
print(p)
Beetles standard deviation vs mean
1.25

0.3

1.00

dose
● low
insecticide medium

A 0.2 high
hours10

0.75
B
s

C insecticide
● ●
D ● A

● B
0.50 ● ● C
0.1 ● D


0.25

0.0
low medium high 0.2 0.4 0.6 0.8
dose m

Diagnostic plots show the following features. The normal quantile plot shows
an “S” shape rather than a straight line, suggesting the residuals are not normal,
but have higher kurtosis (more peaky) than a normal distribution. The residuals
vs the fitted (predicted) values show that the higher the predicted value the more
variability (horn shaped). The plot of the Cook’s distances indicate a few influential
observations.
# interaction model
lm.h.d.i.di <- lm(hours10 ~ dose*insecticide, data = beetles.long)

# plot diagnistics
par(mfrow=c(2,3))
plot(lm.h.d.i.di, which = c(1,4,6))

plot(beetles.long$dose, lm.h.d.i.di$residuals, main="Residuals vs dose")


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(beetles.long$insecticide, lm.h.d.i.di$residuals, main="Residuals vs insecticide")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.h.d.i.di$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 42 20 30

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 537

## residuals vs order of data


#plot(lm.h.d.i.di£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)

0.30
3.5 ● 42

0.30
42
42 ●
0.4

● 20 3

Cook's distance

Cook's distance
0.2

0.20
20 ● 20

0.20

Residuals

● ● 30 ● 302.5
● ● ●
●●
●● ●
● ● ●

0.0

● ●
● ●
● ●
● ● ● ● ●
● ● ●

●●


● 2

0.10

0.10
● ● ●



−0.2

● ●

● ● 1.5

● ●
30 ● ●


1

0.00

0.00


−0.4



● 0.5
0

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 10 20 30 40 0.249999999999999

Fitted values Obs. number Leverage hii

Residuals vs dose Residuals vs insecticide QQ Plot

42 ●
0.4

0.4

● ●
0.4
● ● 20 ●

lm.h.d.i.di$residuals
0.2

0.2


● 0.2 ●

● ●

●●
●●●●
●●●●●
0.0

0.0

●●●●●
0.0 ●●●●
●●●●●
●●●●
●●

● ●●

−0.2

−0.2

● ●

−0.2 ●

● ●
● 30

low medium high A B C D −2 −1 0 1 2

norm quantiles

Survival times are usually right skewed, with the spread or variability in the
distribution increasing as the mean or median increases. Ideally, the distributions
should be symmetric, normal, and the standard deviation should be fairly constant
across groups.
The boxplots (note the ordering) and the plot of the sij against ȳij show the
tendency for the spread to increase with the mean. This is reinforced by the residual
plot, where the variability increases as the predicted values (the cell means under the
two-factor interaction model) increase.
As noted earlier, the QQ-plot of the studentized residuals is better suited to
examine normality here than the boxplots which are constructed from 4 observations.
Not surprisingly, the boxplots do not suggest non-normality. Looking at the QQ-plot
we clearly see evidence of non-normality.

5.2.6 A Remedy for Non-Constant Variance


A plot of cell standard deviations against the cell means is sometimes used as a diag-
nostic tool for suggesting transformations of the data. Here are some suggestions for
transforming non-negative measurements to make the variability independent of the
mean (i.e., stabilize the variance). The transformations also tend to reduce skewness,
if present (and may induce skewness if absent!). As an aside, some statisticians prefer

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


538 Ch 5: Paired Experiments and Randomized Block Experiments

to plot the IQR against the median to get a more robust view of the dependence of
spread on typical level because sij and ȳij are sensitive to outliers.
1. If sij increases linearly with ȳij , use a log transformation of the response.
2. If sij increases as a quadratic function of ȳij , use a reciprocal (inverse) trans-
formation of the response.
3. If sij increases as a square root function of ȳij , use a square root transformation
of the response.
4. If sij is roughly independent of ȳij , do not transform the response. This idea
does not require the response to be non-negative!
A logarithmic transformation or a reciprocal (inverse) transformation of the sur-
vival times might help to stabilize the variance. The survival time distributions are
fairly symmetric, so these nonlinear transformations may destroy the symmetry. As
a first pass, I will consider the reciprocal transformation because the inverse survival
time has a natural interpretation as the dying rate. For example, if you survive 2
hours, then 1/2 is the proportion of your remaining lifetime expired in the next hour.
The unit of time is actually 10 hours, so 0.1/time is the actual rate. The 0.1 scaling
factor has no effect on the analysis provided you appropriately rescale the results on
the mean responses.
Create the rate variable.
#### Example: Beetles, non-constant variance
# create the rate variable (1/hours10)
beetles.long$rate <- 1/beetles.long$hours10

Redo the analysis replacing hours10 by rate.


The standard deviations of rate appear much more similar than those of time did.
# boxplots, ggplot
p <- ggplot(beetles.long, aes(x = dose, y = rate, colour = insecticide))
p <- p + geom_boxplot()
print(p)

# mean vs sd plot
library(plyr)
# means and standard deviations for each dose/interaction cell
beetles.meansd.di.rate <- ddply(beetles.long, .(dose,insecticide), summarise
, m = mean(rate), s = sd(rate))
beetles.meansd.di.rate
## dose insecticide m s
## 1 low A 2.486881 0.4966627
## 2 low B 1.163464 0.1994976
## 3 low C 1.862724 0.4893774
## 4 low D 1.689682 0.3647127
## 5 medium A 3.268470 0.8223269
## 6 medium B 1.393392 0.5531885
## 7 medium C 2.713919 0.4175138

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 539

## 8 medium D 1.701534 0.7019053


## 9 high A 4.802685 0.5296355
## 10 high B 3.028973 0.4214358
## 11 high C 4.264987 0.2348115
## 12 high D 3.091805 0.2440546
p <- ggplot(beetles.meansd.di.rate, aes(x = m, y = s, shape = dose
, colour = insecticide))
p <- p + geom_point(size=4)
p <- p + labs(title = "Beetles standard deviation vs mean")
print(p)
Beetles standard deviation vs mean

0.8
5

dose
4
● low
insecticide 0.6 medium

A high
rate


B
s

3 C ● insecticide

D ● A

● B
● 0.4 ● C

2 ● ● D

1
0.2 ●
low medium high 1 2 3 4
dose m

The profile plots and ANOVA table indicate that the main effects are significant
but the interaction is not.
library(plyr)
# Calculate the cell means for each (dose, insecticide) combination
beetles.mean <- ddply(beetles.long, .(), summarise, m = mean(rate))
beetles.mean
## .id m
## 1 <NA> 2.622376
beetles.mean.d <- ddply(beetles.long, .(dose), summarise, m = mean(rate))
beetles.mean.d
## dose m
## 1 low 1.800688
## 2 medium 2.269329
## 3 high 3.797112
beetles.mean.i <- ddply(beetles.long, .(insecticide), summarise, m = mean(rate))
beetles.mean.i
## insecticide m
## 1 A 3.519345
## 2 B 1.861943
## 3 C 2.947210
## 4 D 2.161007
beetles.mean.di <- ddply(beetles.long, .(dose,insecticide), summarise, m = mean(rate))
beetles.mean.di
## dose insecticide m
## 1 low A 2.486881
## 2 low B 1.163464
## 3 low C 1.862724
## 4 low D 1.689682
## 5 medium A 3.268470

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


540 Ch 5: Paired Experiments and Randomized Block Experiments

## 6 medium B 1.393392
## 7 medium C 2.713919
## 8 medium D 1.701534
## 9 high A 4.802685
## 10 high B 3.028973
## 11 high C 4.264987
## 12 high D 3.091805
# Interaction plots, ggplot

p <- ggplot(beetles.long, aes(x = dose, y = rate, colour = insecticide, shape = insecticide))


p <- p + geom_hline(aes(yintercept = 0), colour = "black"
, linetype = "solid", size = 0.2, alpha = 0.3)
p <- p + geom_boxplot(alpha = 0.25, outlier.size=0.1)
p <- p + geom_point(alpha = 0.5, position=position_dodge(width=0.75))
p <- p + geom_point(data = beetles.mean.di, aes(y = m), size = 4)
p <- p + geom_line(data = beetles.mean.di, aes(y = m, group = insecticide), size = 1.5)
p <- p + labs(title = "Beetles interaction plot, insecticide by dose")
print(p)

p <- ggplot(beetles.long, aes(x = insecticide, y = rate, colour = dose, shape = dose))


p <- p + geom_hline(aes(yintercept = 0), colour = "black"
, linetype = "solid", size = 0.2, alpha = 0.3)
p <- p + geom_boxplot(alpha = 0.25, outlier.size=0.1)
p <- p + geom_point(alpha = 0.5, position=position_dodge(width=0.75))
p <- p + geom_point(data = beetles.mean.di, aes(y = m), size = 4)
p <- p + geom_line(data = beetles.mean.di, aes(y = m, group = dose), size = 1.5)
p <- p + labs(title = "Beetles interaction plot, dose by insecticide")
print(p)

Beetles interaction plot, insecticide by dose Beetles interaction plot, dose by insecticide

4 4

insecticide
● dose
● A
● low
rate

rate

B
medium
● C ● high
D
2 2

0 0

low medium high A B C D


dose insecticide

lm.r.d.i.di <- lm(rate ~ dose*insecticide, data = beetles.long) # equivalent


library(car)
Anova(lm.r.d.i.di, type=3)
## Anova Table (Type III tests)
##
## Response: rate
## Sum Sq Df F value Pr(>F)
## (Intercept) 24.7383 1 103.0395 4.158e-12 ***
## dose 11.1035 2 23.1241 3.477e-07 ***
## insecticide 3.5723 3 4.9598 0.005535 **
## dose:insecticide 1.5708 6 1.0904 0.386733
## Residuals 8.6431 36

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 541

## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.r.d.i.di)
##
## Call:
## lm(formula = rate ~ dose * insecticide, data = beetles.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76847 -0.29642 -0.06914 0.25458 1.07936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.48688 0.24499 10.151 4.16e-12 ***
## dosemedium 0.78159 0.34647 2.256 0.030252 *
## dosehigh 2.31580 0.34647 6.684 8.56e-08 ***
## insecticideB -1.32342 0.34647 -3.820 0.000508 ***
## insecticideC -0.62416 0.34647 -1.801 0.080010 .
## insecticideD -0.79720 0.34647 -2.301 0.027297 *
## dosemedium:insecticideB -0.55166 0.48999 -1.126 0.267669
## dosehigh:insecticideB -0.45030 0.48999 -0.919 0.364213
## dosemedium:insecticideC 0.06961 0.48999 0.142 0.887826
## dosehigh:insecticideC 0.08646 0.48999 0.176 0.860928
## dosemedium:insecticideD -0.76974 0.48999 -1.571 0.124946
## dosehigh:insecticideD -0.91368 0.48999 -1.865 0.070391 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.49 on 36 degrees of freedom
## Multiple R-squared: 0.8681,Adjusted R-squared: 0.8277
## F-statistic: 21.53 on 11 and 36 DF, p-value: 1.289e-12

Drop the nonsignificant interaction term.


lm.r.d.i <- update(lm.r.d.i.di, ~ . - dose:insecticide)
library(car)
Anova(lm.r.d.i, type=3)
## Anova Table (Type III tests)
##
## Response: rate
## Sum Sq Df F value Pr(>F)
## (Intercept) 58.219 1 239.399 < 2.2e-16 ***
## dose 34.877 2 71.708 2.865e-14 ***
## insecticide 20.414 3 27.982 4.192e-10 ***
## Residuals 10.214 42
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.r.d.i)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


542 Ch 5: Paired Experiments and Randomized Block Experiments

##
## Call:
## lm(formula = rate ~ dose + insecticide, data = beetles.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82757 -0.37619 0.02116 0.27568 1.18153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6977 0.1744 15.473 < 2e-16 ***
## dosemedium 0.4686 0.1744 2.688 0.01026 *
## dosehigh 1.9964 0.1744 11.451 1.69e-14 ***
## insecticideB -1.6574 0.2013 -8.233 2.66e-10 ***
## insecticideC -0.5721 0.2013 -2.842 0.00689 **
## insecticideD -1.3583 0.2013 -6.747 3.35e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4931 on 42 degrees of freedom
## Multiple R-squared: 0.8441,Adjusted R-squared: 0.8255
## F-statistic: 45.47 on 5 and 42 DF, p-value: 6.974e-16

Unlike the original analysis, the residual plots do not show any gross deviations
from assumptions. Also, no case seems relatively influential.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.r.d.i, which = c(1,4,6))

plot(beetles.long$dose, lm.r.d.i$residuals, main="Residuals vs dose")


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(beetles.long$insecticide, lm.r.d.i$residuals, main="Residuals vs insecticide")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.r.d.i$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 41 4 33
## residuals vs order of data
#plot(lm.r.d.i£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Prof. Erik B. Erhardt


5.2: Extending the One-Factor Design to Multiple Factors 543

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


3

0.15
41 ● 41 2.5

0.15
41 ●
1.0

●4 33 ●

Cook's distance

Cook's distance

0.10
0.5

● ●

0.10
Residuals

● ● 2
● 4
● ● ● ● 4●
● ● 33 ● 33
● ● ●
● ● ● ●


0.0

● ●

● ● ● ● ●

0.05
● 1.5

0.05
● ●
● ● ● ●
● ● ●
−0.5

● ●

● ●
● ● ●
● ● ●


1


● ● ●


0.00

0.00
0.5
−1.0



● 0

1 2 3 4 0 10 20 30 40 0.1

Fitted values Obs. number Leverage hii

Residuals vs dose Residuals vs insecticide QQ Plot

41 ●
1.0

1.0 1.0
33 ●4 ●

lm.r.d.i$residuals

0.5

0.5

●●
0.5 ●●


●●●

●●●

●●
●●●
0.0

0.0

0.0 ●●
●●
●●●


●●●
●●
−0.5

−0.5

● ●●

−0.5 ● ●


● ●

low medium high A B C D −2 −1 0 1 2

norm quantiles

Bonferroni multiple comparisons imply differences about the mean rates.


# Testing multiple factors is of interest here.
# Note that the code below corrects the p-values
# for all the tests done for both factors together,
# that is, the Bonferroni-corrected significance level is (alpha / (d + i))
# where d = number of dose comparisons
# and i = number of insecticide comparisons.

# correcting over dose and insecticide


glht.beetle.di.rate <- glht(aov(lm.r.d.i), linfct = mcp(dose = "Tukey"
, insecticide = "Tukey"))
summary(glht.beetle.di.rate, test = adjusted("bonferroni"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = lm.r.d.i)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## dose: medium - low == 0 0.4686 0.1744 2.688 0.09236 .
## dose: high - low == 0 1.9964 0.1744 11.451 1.52e-13 ***
## dose: high - medium == 0 1.5278 0.1744 8.763 4.46e-10 ***
## insecticide: B - A == 0 -1.6574 0.2013 -8.233 2.39e-09 ***

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


544 Ch 5: Paired Experiments and Randomized Block Experiments

## insecticide: C - A == 0 -0.5721 0.2013 -2.842 0.06203 .


## insecticide: D - A == 0 -1.3583 0.2013 -6.747 3.01e-07 ***
## insecticide: C - B == 0 1.0853 0.2013 5.391 2.67e-05 ***
## insecticide: D - B == 0 0.2991 0.2013 1.485 1.00000
## insecticide: D - C == 0 -0.7862 0.2013 -3.905 0.00302 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- bonferroni method)
par(mfrow=c(1,1))
# plot the summary
op <- par(no.readonly = TRUE) # the whole list of settable par's.
# make wider left margin to fit contrast labels
par(mar = c(5, 10, 4, 2) + 0.1) # order is c(bottom, left, top, right)
# plot bonferroni-corrected difference intervals
plot(summary(glht.beetle.di.rate, test = adjusted("bonferroni"))
, sub="Bonferroni-adjusted Treatment contrasts")
par(op) # reset plotting options

95% family−wise confidence level

dose: medium − low ( ● )

dose: high − low ( ● )

dose: high − medium ( ● )

insecticide: B − A ( ● )

insecticide: C − A ( ● )

insecticide: D − A ( ● )

insecticide: C − B ( ● )

insecticide: D − B ( ● )

insecticide: D − C ( ● )

−2 −1 0 1 2

Linear Function
Bonferroni−adjusted Treatment contrasts

Comments on the Two Analyses of Survival Times


The effects of the transformation are noticeable. For example, the comparisons among
doses and insecticides are less sensitive (differences harder to distinguish) on the origi-
nal scale (look at the Bonferroni groupings). A comparison of the interaction p-values
and profile plots for the two analyses suggests that the transformation eliminates much

Prof. Erik B. Erhardt


5.3: Multiple comparisons: balanced (means) vs unbalanced (lsmeans) 545

of the observed interaction between the main effects. Although the interaction in the
original analysis was not significant at the 10% level (p-value=0.112), the small sam-
ple sizes suggest that power for detecting interaction might be low. To be on the safe
side, one might interpret the main effects in the original analysis as if an interaction
were present. This need appears to be less pressing with the rates.
The statistical assumptions are reasonable for an analysis of the rates. I think
that the simplicity of the main effects interpretation is a strong motivating factor for
preferring the analysis of the transformed data to the original analysis. You might
disagree, especially if you believe that the original time scale is most relevant for
analysis.
Given the suitability of the inverse transformation, I did not consider the loga-
rithmic transformation.

5.3 Multiple comparisons: balanced (means) vs


unbalanced (lsmeans)
The lsmeans provides a way to compare cell means (combinations of factors), some-
thing that is not possible directly with glht(), which compares marginal means.
Using the battery example, we compare the multiple comparison methods using
means (glht()) and lsmeans2 (lsmeans()). When there are only main effects, the two
methods agree.
#### Multiple comparisons
#### Example: Battery
# fit additive (main effects) model (same as before)
lm.m.m.t <- lm(maxvolt ~ material + temp, data = battery.long)

### comparing means (must be balanced or have only one factor)


# correcting over temp
glht.battery.t <- glht(aov(lm.m.m.t), linfct = mcp(temp = "Tukey"))
summary(glht.battery.t, test = adjusted("bonferroni"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = lm.m.m.t)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
2
lsmeans is a package written by Russell V. Lenth, PhD of UNM 1975, and well-known for his
online power calculators.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


546 Ch 5: Paired Experiments and Randomized Block Experiments

## 65 - 50 == 0 -37.25 12.24 -3.044 0.01417 *


## 80 - 50 == 0 -80.67 12.24 -6.593 6.89e-07 ***
## 80 - 65 == 0 -43.42 12.24 -3.548 0.00377 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- bonferroni method)
### comparing lsmeans (may be unbalanced)
library(lsmeans)
## The ’lsmeans’ package is being deprecated.
## Users are encouraged to switch to ’emmeans’.
## See help(’transition’) for more information, including how
## to convert ’lsmeans’ objects and scripts to work with ’emmeans’.
## compare levels of main effects
# temp
lsmeans(lm.m.m.t, list(pairwise ~ temp), adjust = "bonferroni")
## $`lsmeans of temp`
## temp lsmean SE df lower.CL upper.CL
## 50 144.83333 8.65164 31 127.18820 162.4785
## 65 107.58333 8.65164 31 89.93820 125.2285
## 80 64.16667 8.65164 31 46.52153 81.8118
##
## Results are averaged over the levels of: material
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## 50 - 65 37.25000 12.23527 31 3.044 0.0142
## 50 - 80 80.66667 12.23527 31 6.593 <.0001
## 65 - 80 43.41667 12.23527 31 3.548 0.0038
##
## Results are averaged over the levels of: material
## P value adjustment: bonferroni method for 3 tests

When there are model interactions, the comparisons of the main effects are inap-
propriate, and give different results depending on the method of comparison.
# fit interaction model (same as before)
lm.m.m.t.mt <- lm(maxvolt ~ material*temp, data = battery.long)

### comparing means (must be balanced or have only one factor)


# correcting over temp
glht.battery.t <- glht(aov(lm.m.m.t.mt), linfct = mcp(temp = "Tukey"))
## Warning in mcp2matrix(model, linfct = linfct): covariate interactions found -- default
contrast might be inappropriate
summary(glht.battery.t, test = adjusted("bonferroni"))
##
## Simultaneous Tests for General Linear Hypotheses
##

Prof. Erik B. Erhardt


5.3: Multiple comparisons: balanced (means) vs unbalanced (lsmeans) 547

## Multiple Comparisons of Means: Tukey Contrasts


##
##
## Fit: aov(formula = lm.m.m.t.mt)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 65 - 50 == 0 -77.50 18.37 -4.218 0.000744 ***
## 80 - 50 == 0 -77.25 18.37 -4.204 0.000772 ***
## 80 - 65 == 0 0.25 18.37 0.014 1.000000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- bonferroni method)
### comparing lsmeans (may be unbalanced)
library(lsmeans)
## compare levels of main effects
# temp
lsmeans(lm.m.m.t.mt, list(pairwise ~ temp), adjust = "bonferroni")
## NOTE: Results may be misleading due to involvement in interactions
## $`lsmeans of temp`
## temp lsmean SE df lower.CL upper.CL
## 50 144.83333 7.501183 27 129.44218 160.22449
## 65 107.58333 7.501183 27 92.19218 122.97449
## 80 64.16667 7.501183 27 48.77551 79.55782
##
## Results are averaged over the levels of: material
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## 50 - 65 37.25000 10.60827 27 3.511 0.0048
## 50 - 80 80.66667 10.60827 27 7.604 <.0001
## 65 - 80 43.41667 10.60827 27 4.093 0.0010
##
## Results are averaged over the levels of: material
## P value adjustment: bonferroni method for 3 tests

When there are model interactions and you want to compare cell means, levels of
one factor at each level of another factor separately, then you must use lsmeans().
# fit interaction model (same as before)
lm.m.m.t.mt <- lm(maxvolt ~ material*temp, data = battery.long)

### comparing lsmeans (may be unbalanced)


library(lsmeans)
## compare levels of one factor at each level of another factor separately
# material at levels of temp
lsmeans(lm.m.m.t.mt, list(pairwise ~ material | temp), adjust = "bonferroni")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


548 Ch 5: Paired Experiments and Randomized Block Experiments

## $`lsmeans of material | temp`


## temp = 50:
## material lsmean SE df lower.CL upper.CL
## 1 134.75 12.99243 27 108.09174 161.40826
## 2 155.75 12.99243 27 129.09174 182.40826
## 3 144.00 12.99243 27 117.34174 170.65826
##
## temp = 65:
## material lsmean SE df lower.CL upper.CL
## 1 57.25 12.99243 27 30.59174 83.90826
## 2 119.75 12.99243 27 93.09174 146.40826
## 3 145.75 12.99243 27 119.09174 172.40826
##
## temp = 80:
## material lsmean SE df lower.CL upper.CL
## 1 57.50 12.99243 27 30.84174 84.15826
## 2 49.50 12.99243 27 22.84174 76.15826
## 3 85.50 12.99243 27 58.84174 112.15826
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast, temp | temp`
## temp = 50:
## contrast estimate SE df t.ratio p.value
## 1 - 2 -21.00 18.37407 27 -1.143 0.7893
## 1 - 3 -9.25 18.37407 27 -0.503 1.0000
## 2 - 3 11.75 18.37407 27 0.639 1.0000
##
## temp = 65:
## contrast estimate SE df t.ratio p.value
## 1 - 2 -62.50 18.37407 27 -3.402 0.0063
## 1 - 3 -88.50 18.37407 27 -4.817 0.0001
## 2 - 3 -26.00 18.37407 27 -1.415 0.5055
##
## temp = 80:
## contrast estimate SE df t.ratio p.value
## 1 - 2 8.00 18.37407 27 0.435 1.0000
## 1 - 3 -28.00 18.37407 27 -1.524 0.4175
## 2 - 3 -36.00 18.37407 27 -1.959 0.1814
##
## P value adjustment: bonferroni method for 3 tests
# temp at levels of material
lsmeans(lm.m.m.t.mt, list(pairwise ~ temp | material), adjust = "bonferroni")
## $`lsmeans of temp | material`
## material = 1:
## temp lsmean SE df lower.CL upper.CL
## 50 134.75 12.99243 27 108.09174 161.40826
## 65 57.25 12.99243 27 30.59174 83.90826

Prof. Erik B. Erhardt


5.3: Multiple comparisons: balanced (means) vs unbalanced (lsmeans) 549

## 80 57.50 12.99243 27 30.84174 84.15826


##
## material = 2:
## temp lsmean SE df lower.CL upper.CL
## 50 155.75 12.99243 27 129.09174 182.40826
## 65 119.75 12.99243 27 93.09174 146.40826
## 80 49.50 12.99243 27 22.84174 76.15826
##
## material = 3:
## temp lsmean SE df lower.CL upper.CL
## 50 144.00 12.99243 27 117.34174 170.65826
## 65 145.75 12.99243 27 119.09174 172.40826
## 80 85.50 12.99243 27 58.84174 112.15826
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast, material | material`
## material = 1:
## contrast estimate SE df t.ratio p.value
## 50 - 65 77.50 18.37407 27 4.218 0.0007
## 50 - 80 77.25 18.37407 27 4.204 0.0008
## 65 - 80 -0.25 18.37407 27 -0.014 1.0000
##
## material = 2:
## contrast estimate SE df t.ratio p.value
## 50 - 65 36.00 18.37407 27 1.959 0.1814
## 50 - 80 106.25 18.37407 27 5.783 <.0001
## 65 - 80 70.25 18.37407 27 3.823 0.0021
##
## material = 3:
## contrast estimate SE df t.ratio p.value
## 50 - 65 -1.75 18.37407 27 -0.095 1.0000
## 50 - 80 58.50 18.37407 27 3.184 0.0109
## 65 - 80 60.25 18.37407 27 3.279 0.0086
##
## P value adjustment: bonferroni method for 3 tests

Finally, an important point demonstrated in the next section is that the cell and
marginal averages given by the means and lsmeans methods agree here for the
main effects model because the design is balanced. For unbalanced designs with two
or more factors, lsmeans and means compute different averages. I will argue that
lsmeans are the appropriate averages for unbalanced analyses. You should use the
means statement with caution — it is OK for balanced or unbalanced one-factor
designs, and for the balanced two-factor designs (including the RB) that we have
discussed.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


550 Ch 5: Paired Experiments and Randomized Block Experiments

5.4 Unbalanced Two-Factor Designs and Analysis


Sample sizes are usually unequal, or unbalanced, for the different treatment groups
in an experiment. Although this has no consequence on the specification of a model,
you have to be more careful with the analysis of unbalanced experiments that
have two or more factors. With unbalanced designs, the Type I and Type III SS
differ, as do the main effect averages given by means and lsmeans. Inferences might
critically depend on which summaries are used. I will use the following example to
emphasize the differences between the types of SS and averages.

5.4.1 Example: Rat insulin


The experiment consists of measuring insulin levels in rats a certain length of time
after a fixed dose of insulin was injected into their jugular or portal veins. This is a
two-factor study with two vein types (jugular, portal) and three time levels (0, 30, and
60). An unusual feature of this experiment is that the rats used in the six vein and time
combinations are distinct. I will fit a two-factor interaction model, which assumes
that the responses are independent within and across treatments (other models may
be preferred). The design is unbalanced, with sample sizes varying from 3 to 12.
An alternative experimental design might randomly assign rats to the two vein
groups, and then measure the insulin levels of each rat at the three time points.
Depending on the questions of interest, you could compare veins using a one-way
MANOVA, or use other multivariate techniques that allow correlated responses within
rats.
#### Example: Rat insulin
rat <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch05_ratinsulin.dat"
, header = TRUE)
# make time a factor variable and label the levels
rat$time<- factor(rat$time)
str(rat)
## 'data.frame': 48 obs. of 3 variables:
## $ vein : Factor w/ 2 levels "j","p": 1 1 1 1 1 1 1 1 1 1 ...
## $ time : Factor w/ 3 levels "0","30","60": 1 1 1 1 1 2 2 2 2 2 ...
## $ insulin: int 18 36 12 24 43 61 116 63 132 68 ...
head(rat, 3)
## vein time insulin
## 1 j 0 18
## 2 j 0 36
## 3 j 0 12
tail(rat, 3)
## vein time insulin
## 46 p 60 105
## 47 p 60 71

Prof. Erik B. Erhardt


5.4: Unbalanced Two-Factor Designs and Analysis 551

## 48 p 60 83

It appears the standard deviation increases with the mean.


# boxplots, ggplot
p <- ggplot(rat, aes(x = time, y = insulin, colour = vein))
p <- p + geom_boxplot()
print(p)

# mean vs sd plot
library(plyr)
# means and standard deviations for each time/interaction cell
rat.meansd.tv <- ddply(rat, .(time,vein), summarise
, m = mean(insulin), s = sd(insulin), n = length(insulin))
rat.meansd.tv
## time vein m s n
## 1 0 j 26.60000 12.75931 5
## 2 0 p 81.91667 27.74710 12
## 3 30 j 79.50000 36.44585 6
## 4 30 p 172.90000 76.11753 10
## 5 60 j 61.33333 62.51666 3
## 6 60 p 128.50000 49.71830 12
p <- ggplot(rat.meansd.tv, aes(x = m, y = s, shape = time, colour = vein, label=n))
p <- p + geom_point(size=4)
# labels are sample sizes
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
p <- p + labs(title = "Rats standard deviation vs mean")
print(p)
Rats standard deviation vs mean

10

300

3
60

time
● 0
200 12
vein 30
insulin

j 60
s

p
40 vein
6

a j


a p
100
12

20

5

0
0 30 60 50 100 150
time m

We take the log of insulin to correct the problem. The variances are more constant
now, except for one sample with only 3 observations which has a larger standard

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


552 Ch 5: Paired Experiments and Randomized Block Experiments

deviation than the others, but because this is based on such a small sample size, it’s
not of much concern.
rat$loginsulin <- log(rat$insulin)

# boxplots, ggplot
p <- ggplot(rat, aes(x = time, y = loginsulin, colour = vein))
p <- p + geom_boxplot()
print(p)

# mean vs sd plot
library(plyr)
# means and standard deviations for each time/interaction cell
rat.meansd.tv <- ddply(rat, .(time,vein), summarise
, m = mean(loginsulin)
, s = sd(loginsulin)
, n = length(loginsulin))
rat.meansd.tv
## time vein m s n
## 1 0 j 3.179610 0.5166390 5
## 2 0 p 4.338230 0.4096427 12
## 3 30 j 4.286804 0.4660571 6
## 4 30 p 5.072433 0.4185221 10
## 5 60 j 3.759076 1.0255165 3
## 6 60 p 4.785463 0.3953252 12
p <- ggplot(rat.meansd.tv, aes(x = m, y = s, shape = time, colour = vein, label=n))
p <- p + geom_point(size=4)
# labels are sample sizes
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
p <- p + labs(title = "Rats standard deviation vs mean")
print(p)

Prof. Erik B. Erhardt


5.4: Unbalanced Two-Factor Designs and Analysis 553

Rats standard deviation vs mean


3
1.0

time
0.8 ● 0
vein
loginsulin

30

j 60

s
4 p
vein

a j

0.6 ●
a p

5

3
6

12 10
0.4 ● 12

0 30 60 3.5 4.0 4.5 5.0


time m

Type I and Type III SS


We can request ANOVA tables including Type I or Type III SS3 for each effect in
a model. The Type I SS is the sequential reduction in Error SS achieved when
an effect is added to a model that includes only the prior effects listed in the model
statement. The Type III SS are more difficult to define explicitly, but they roughly
correspond to the reduction in Error SS achieved when an effect is added last to the
model.
Type I SS and Type III SS are equal for balanced designs and for one-way ANOVA,
but are typically different for unbalanced designs, where there is no unique way to
define the SS for an effect. The problem here is similar to multiple regression, where
the SS for a predictor X is the decrease in Residual SS when X is added to a model.
This SS is not unique because the change in the Residual SS depends on which
predictors are included in the model prior to X. In a regression analysis, the standard
tests for effects in a model are based on Type III SS and not on the Type I SS.
For the insulin analysis, the Type I and Type III interaction SS are identical
because this effect was added last to the model statement. The Type I and III SS
for the main effects are not equal. Also note that the Type I SS for the main effects
and interaction add to the model SS, but the Type III SS do not.
Looking at the output, we see the Type I and Type III SS are different, except
for the interaction term.
lm.i.t.v.tv <- lm(loginsulin ~ time*vein, data = rat
, contrasts = list(time = contr.sum, vein = contr.sum))

3
For the ugly details, see http://goanna.cs.rmit.edu.au/~fscholer/anova.php.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


554 Ch 5: Paired Experiments and Randomized Block Experiments

## CRITICAL!!! Unbalanced design warning.


## The contrast statement above must be included identifying
## each main effect with "contr.sum" in order for the correct
## Type III SS to be computed.
## See http://goanna.cs.rmit.edu.au/~fscholer/anova.php
library(car)
# type I SS (intercept SS not shown)
summary(aov(lm.i.t.v.tv))
## Df Sum Sq Mean Sq F value Pr(>F)
## time 2 5.450 2.725 12.18 6.74e-05 ***
## vein 1 9.321 9.321 41.66 8.82e-08 ***
## time:vein 2 0.259 0.130 0.58 0.565
## Residuals 42 9.399 0.224
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# type III SS
Anova(lm.i.t.v.tv, type=3)
## Anova Table (Type III tests)
##
## Response: loginsulin
## Sum Sq Df F value Pr(>F)
## (Intercept) 668.54 1 2987.5842 < 2.2e-16 ***
## time 6.18 2 13.7996 2.475e-05 ***
## vein 9.13 1 40.7955 1.101e-07 ***
## time:vein 0.26 2 0.5797 0.5645
## Residuals 9.40 42
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because the profile plot lines all seem parallel, and because of the interaction
Type III SS p-value above, it appears there is not sufficient evidence for a vein-by-
time interaction. For now we’ll keep the interaction in the model for the purpose of
discussing differences between means and lsmeans and Type I and Type III SS.
# calculate means for plot
library(plyr)
rat.mean.tv <- ddply(rat, .(time,vein), summarise, m = mean(loginsulin))

# Interaction plots, ggplot


p <- ggplot(rat, aes(x = time, y = loginsulin, colour = vein, shape = vein))
p <- p + geom_hline(aes(yintercept = 0), colour = "black"
, linetype = "solid", size = 0.2, alpha = 0.3)
p <- p + geom_boxplot(alpha = 0.25, outlier.size=0.1)
p <- p + geom_point(alpha = 0.5, position=position_dodge(width=0.75))
p <- p + geom_point(data = rat.mean.tv, aes(y = m), size = 4)
p <- p + geom_line(data = rat.mean.tv, aes(y = m, group = vein), size = 1.5)
p <- p + labs(title = "Rats interaction plot, vein by time")
print(p)

Prof. Erik B. Erhardt


5.4: Unbalanced Two-Factor Designs and Analysis 555

p <- ggplot(rat, aes(x = vein, y = loginsulin, colour = time, shape = time))


p <- p + geom_hline(aes(yintercept = 0), colour = "black"
, linetype = "solid", size = 0.2, alpha = 0.3)
p <- p + geom_boxplot(alpha = 0.25, outlier.size=0.1)
p <- p + geom_point(alpha = 0.5, position=position_dodge(width=0.75))
p <- p + geom_point(data = rat.mean.tv, aes(y = m), size = 4)
p <- p + geom_line(data = rat.mean.tv, aes(y = m, group = time), size = 1.5)
p <- p + labs(title = "Rats interaction plot, time by vein")
print(p)
Rats interaction plot, vein by time Rats interaction plot, time by vein
6 6

● ●
4 4

time
vein
loginsulin

loginsulin
● ●
● 0
● j
30
p
60

2 2

0 0

0 30 60 j p
time vein

Means versus lsmeans


The lsmeans (sometimes called adjusted means) for a single factor is an arithmetic
average of cell means. For example, the mean responses in the jugular vein at times
0, 30, and 60 are 3.18, 4.29, and 3.76, respectively. The lsmeans for the jugular vein
is thus 3.74 = (3.18 + 4.29 + 3.76)/3. This average gives equal weight to the 3 times
even though the sample sizes at these times differ (5, 6, and 3). The means of 3.78
for the jugular is the average of the 14 jugular responses, ignoring time. If the cell
sample sizes were equal, the lsmeans and means averages would agree.
The means and lsmeans for individual cells (i.e., for the 6 vein*time combina-
tions) are identical, and equal to cell means.
#### lsmeans
library(plyr)
# unbalanced, don't match
rat.mean.t <- ddply(rat, .(time), summarise, m = mean(loginsulin))
rat.mean.t
## time m

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


556 Ch 5: Paired Experiments and Randomized Block Experiments

## 1 0 3.997460
## 2 30 4.777822
## 3 60 4.580186
library(lsmeans)
lsmeans(lm.i.t.v.tv, list(pairwise ~ time), adjust = "bonferroni")
## NOTE: Results may be misleading due to involvement in interactions
## $`lsmeans of time`
## time lsmean SE df lower.CL upper.CL
## 0 3.758920 0.1258994 42 3.504845 4.012996
## 30 4.679619 0.1221403 42 4.433130 4.926108
## 60 4.272270 0.1526754 42 3.964158 4.580381
##
## Results are averaged over the levels of: vein
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## 0 - 30 -0.9206985 0.1754107 42 -5.249 <.0001
## 0 - 60 -0.5133494 0.1978900 42 -2.594 0.0390
## 30 - 60 0.4073491 0.1955199 42 2.083 0.1300
##
## Results are averaged over the levels of: vein
## P value adjustment: bonferroni method for 3 tests

# unbalanced, don't match


rat.mean.v <- ddply(rat, .(vein), summarise, m = mean(loginsulin))
rat.mean.v
## vein m
## 1 j 3.778293
## 2 p 4.712019
# compare jugular mean above (3.778) with the lsmeans average below (3.742)
(3.179610 + 4.286804 + 3.759076)/3
## [1] 3.74183
lsmeans(lm.i.t.v.tv, list(pairwise ~ vein), adjust = "bonferroni")
## NOTE: Results may be misleading due to involvement in interactions
## $`lsmeans of vein`
## vein lsmean SE df lower.CL upper.CL
## j 3.741830 0.13192664 42 3.475592 4.008069
## p 4.732042 0.08142689 42 4.567716 4.896369
##
## Results are averaged over the levels of: time
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## j - p -0.9902121 0.1550322 42 -6.387 <.0001
##

Prof. Erik B. Erhardt


5.4: Unbalanced Two-Factor Designs and Analysis 557

## Results are averaged over the levels of: time

# unbalanced, but highest-order interaction cell means will match


rat.mean.tv <- ddply(rat, .(time,vein), summarise, m = mean(loginsulin))
rat.mean.tv
## time vein m
## 1 0 j 3.179610
## 2 0 p 4.338230
## 3 30 j 4.286804
## 4 30 p 5.072433
## 5 60 j 3.759076
## 6 60 p 4.785463
lsmeans(lm.i.t.v.tv, list(pairwise ~ time | vein), adjust = "bonferroni")
## $`lsmeans of time | vein`
## vein = j:
## time lsmean SE df lower.CL upper.CL
## 0 3.179610 0.2115533 42 2.752678 3.606542
## 30 4.286804 0.1931208 42 3.897071 4.676538
## 60 3.759076 0.2731141 42 3.207910 4.310243
##
## vein = p:
## time lsmean SE df lower.CL upper.CL
## 0 4.338230 0.1365570 42 4.062647 4.613814
## 30 5.072433 0.1495908 42 4.770547 5.374320
## 60 4.785463 0.1365570 42 4.509880 5.061047
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast, vein | vein`
## vein = j:
## contrast estimate SE df t.ratio p.value
## 0 - 30 -1.1071941 0.2864445 42 -3.865 0.0011
## 0 - 60 -0.5794659 0.3454650 42 -1.677 0.3027
## 30 - 60 0.5277282 0.3344951 42 1.578 0.3664
##
## vein = p:
## contrast estimate SE df t.ratio p.value
## 0 - 30 -0.7342029 0.2025468 42 -3.625 0.0023
## 0 - 60 -0.4472330 0.1931208 42 -2.316 0.0766
## 30 - 60 0.2869699 0.2025468 42 1.417 0.4917
##
## P value adjustment: bonferroni method for 3 tests
lsmeans(lm.i.t.v.tv, list(pairwise ~ vein | time), adjust = "bonferroni")
## $`lsmeans of vein | time`
## time = 0:
## vein lsmean SE df lower.CL upper.CL
## j 3.179610 0.2115533 42 2.752678 3.606542

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


558 Ch 5: Paired Experiments and Randomized Block Experiments

## p 4.338230 0.1365570 42 4.062647 4.613814


##
## time = 30:
## vein lsmean SE df lower.CL upper.CL
## j 4.286804 0.1931208 42 3.897071 4.676538
## p 5.072433 0.1495908 42 4.770547 5.374320
##
## time = 60:
## vein lsmean SE df lower.CL upper.CL
## j 3.759076 0.2731141 42 3.207910 4.310243
## p 4.785463 0.1365570 42 4.509880 5.061047
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast, time | time`
## time = 0:
## contrast estimate SE df t.ratio p.value
## j - p -1.1586202 0.2517988 42 -4.601 <.0001
##
## time = 30:
## contrast estimate SE df t.ratio p.value
## j - p -0.7856289 0.2442807 42 -3.216 0.0025
##
## time = 60:
## contrast estimate SE df t.ratio p.value
## j - p -1.0263873 0.3053508 42 -3.361 0.0017

For completeness, these diagnostic plots are mostly fine, though the plot of the
Cook’s distances indicate a couple influential observations.
# interaction model
lm.i.t.v.tv <- lm(loginsulin ~ time*vein, data = rat
, contrasts = list(time = contr.sum, vein = contr.sum))

# plot diagnistics
par(mfrow=c(2,3))
plot(lm.i.t.v.tv, which = c(1,4,6))

plot(rat$time, lm.i.t.v.tv$residuals, main="Residuals vs time")


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(rat$vein, lm.i.t.v.tv$residuals, main="Residuals vs vein")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.i.t.v.tv$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 13 12 17
## residuals vs order of data
#plot(lm.i.t.v.tv£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Prof. Erik B. Erhardt


5.4: Unbalanced Two-Factor Designs and Analysis 559

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


13
313 ●
● 13
1.0

0.6
0.6

2.5

Cook's distance

Cook's distance
● ●
0.5



●● ●
Residuals

● ● 12 12 ●

0.4

0.4

● ● ●

● ●
● 2
0.0

● ●
●● ● ●

● ● ●
● ●
● ●

0.2
0.2

−0.5

● 1.5


● 3
● ● ● ●3
17 ●


● 1
● 12 ● ●
−1.0

● ● ● ● ●
0.5

0.0

0.0

● ●
● ●
● ● ● 0

3.5 4.0 4.5 5.0 0 10 20 30 40 0.05 0.15 0.2 0.25 0.3

Fitted values Obs. number Leverage hii

Residuals vs time Residuals vs vein QQ Plot

13 ●
1.0

1.0 1.0

lm.i.t.v.tv$residuals
● ●
0.5

0.5


0.5 ●
●● ●
●●

●●●
●●●
●●●

0.0

0.0

0.0 ●●
●●
●●●
●●
●●

●●

●●
−0.5

−0.5


−0.5 ●


● ● ●

12 17

0 30 60 j p −2 −1 0 1 2

norm quantiles

Should I use means or lsmeans, Type I or Type III SS? Use lsmeans and
Type III SS.
Regardless of whether the design is balanced, the basic building blocks for a two-
factor analysis are cell means, and the marginal means, defined as the average of the
cell means over the levels of the other factor.
The F -statistics based on Type III SSs are appropriate for unbalanced two-factor
designs because they test the same hypotheses that were considered in balanced de-
signs. That is, the Type III F -tests on the main effects check for equality in population
means averaged over levels of the other factor. The Type III F -test for no interaction
checks for parallel profiles. Given that the Type III F -tests for the main effects check
for equal population cell means averaged over the levels of the other factor, multiple
comparisons for main effects should be based on lsmeans.
The Type I SS and F -tests and the multiple comparisons based on means should
be ignored because they do not, in general, test meaningful hypotheses. The problem
with using the means output is that the experimenter has fixed the sample sizes for
a two-factor experiment, so comparisons of means, which ignore the second factor,
introduces a potential bias due to choice of sample sizes. Put another way, any
differences seen in the means in the jugular and portal could be solely due to the
sample sizes used in the experiment and not due to differences in the veins.
Focusing on the Type III SS, the F -tests indicate that the vein and time effects
are significant, but that the interaction is not significant. The jugular and portal

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


560 Ch 5: Paired Experiments and Randomized Block Experiments

profiles are reasonably parallel, which is consistent with a lack of interaction. What
can you conclude from the lsmeans comparisons of veins and times?
Answer: significant differences between veins, and between times 0 and 30.

Prof. Erik B. Erhardt


5.5: Writing factor model equations and interpretting coefficients 561

5.5 Writing factor model equations and interpret-


ting coefficients
This section is an exercise in writing statistical factor models, plotting predicted val-
ues, and interpretting model coefficients. You’ll need pen (preferrably multi-colored)
and paper.
In class I’ll discuss indicator variables and writing these models. Together, we’ll
plot the model predicted values, write the model for each factor combination, and
indicate the β coefficients on the plot (βs are vertical differences)4 . From the exercise,
I hope that coefficient interpretation will become clear. Assume a balanced design
with ni observations for each treatment combination.

5.5.1 One-way ANOVA, 1 factor with 3 levels


1. Write ANOVA factor model (general and indicator-variables)
2. Write model for each factor level with βs and predicted values
3. Plot the predicted values on axes
4. Label the βs on the plot
5. Calculate marginal and grand means in table and label in plot
Level 1 2 3
ŷ 5 4 6

5.5.2 Two-way ANOVA, 2 factors with 3 and 2 levels, addi-


tive model
1. Write two-way ANOVA factor model (general and indicator-variables)
2. Write model for each factor level with βs and predicted values
3. Plot the predicted values on axes
4. Label the βs on the plot
5. Calculate marginal and grand means in table and label in plot
ŷ Factor 1
Factor 2 1 2 3
1 5 4 6
2 8 7 9

4
Please attempt by hand before looking at the solutions at http://statacumen.com/teach/
ADA2/ADA2_05_PairedAndBlockDesigns_CoefScan.pdf.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


562 Ch 5: Paired Experiments and Randomized Block Experiments

5.5.3 Two-way ANOVA, 2 factors with 3 and 2 levels, inter-


action model
1. Write two-way ANOVA factor model (general and indicator-variables)
2. Write model for each factor level with βs and predicted values
3. Plot the predicted values on axes
4. Label the βs on the plot
5. Calculate marginal and grand means in table and label in plot
ŷ Factor 1
Factor 2 1 2 3
1 5 4 6
2 8 10 3

Prof. Erik B. Erhardt


Chapter 6

A Short Discussion of
Observational Studies

“Thou shall adjust for what thou can not control.”

In most scientific studies, the groups being compared do not consist of identical
experimental units that have been randomly assigned to receive a treatment. Instead,
the groups might be extremely heterogeneous on factors that might be related to a
specific response on which you wish to compare the groups. Inferences about the
nature of differences among groups in such observational studies can be flawed if
this heterogeneity is ignored in the statistical analysis.
The following problem emphasizes the care that is needed when analyzing obser-
vational studies, and highlights the distinction between the means and lsmeans
output for a two-way table. The data are artificial, but the conclusions are consis-
tent with an interesting analysis conducted by researchers at Sandia National Labo-
ratories.
A representative sample of 550 high school seniors was selected in 1970. A similar
sample of 550 was selected in 1990. The final SAT scores (on a 1600 point scale) were
obtained for each student1 .
The boxplots for the two samples show heavy-tailed distributions with similar
spreads. Given the large sample sizes, the F -test comparing populations is approxi-
mately valid even though the population distributions are non-normal.
#### Example: SAT
sat <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch06_sat.dat", header = TRUE)
sat$year <- factor(sat$year)

1
The fake-data example in this chapter is similar to a real-world SAT example illustrated in this
paper: “Minority Contributions to the SAT Score Turnaround: An Example of Simpson’s Paradox”
by Howard Wainer, Journal of Educational Statistics, Vol. 11, No. 4 (Winter, 1986), pp. 239–244
http://www.jstor.org/stable/1164696.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


564 Ch 6: A Short Discussion of Observational Studies

sat$eth <- factor(sat$eth )

# calculate means by year (also calculated below to illustrate lsmeans())


library(plyr)
sat.mean.y <- ddply(sat, .(year), summarise, m = mean(grade))
# Interaction plots, ggplot
p <- ggplot(sat, aes(x = year, y = grade))
p <- p + geom_boxplot(alpha = 0.5)
p <- p + geom_point(position = position_jitter(w = 0.1, h = 0), colour="gray25", size=1, alpha = 0.2)
p <- p + geom_point(data = sat.mean.y, aes(y = m), size = 4)
#p <- p + geom_line(data = sat.mean.y, aes(y = m), size = 1.5)
p <- p + labs(title = "SAT scores by year")
print(p)

SAT scores by year

950

900


grade

850

800

1970 1990
year

A simple analysis might compare the average SAT scores for the two years, to
see whether students are scoring higher, lower, or about the same, over time. The
one-way lsmeans and means breakdowns of the SAT scores are identical; the av-
erage SAT scores for 1970 and 1990 are 892.8 and 882.2, respectively. The one-way
ANOVA, combined with the observed averages, indicates that the typical SAT score
has decreased significantly (10.7 points) over the 20 year period.
lm.g.y <- lm(grade ~ year, data = sat
, contrasts = list(year = contr.sum))
library(car)
# type III SS
Anova(lm.g.y, type=3)
## Anova Table (Type III tests)
##
## Response: grade

Prof. Erik B. Erhardt


565

## Sum Sq Df F value Pr(>F)


## (Intercept) 866418325 1 1.7076e+06 < 2.2e-16 ***
## year 31410 1 6.1904e+01 8.591e-15 ***
## Residuals 557117 1098
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(plyr)
# balanced with respect to year, so means and lsmeans match
sat.mean.y <- ddply(sat, .(year), summarise, m = mean(grade))
sat.mean.y
## year m
## 1 1970 892.8418
## 2 1990 882.1545
library(lsmeans)
lsmeans(lm.g.y, list(pairwise ~ year), adjust = "bonferroni")
## $`lsmeans of year`
## year lsmean SE df lower.CL upper.CL
## 1970 892.8418 0.9604853 1098 890.9572 894.7264
## 1990 882.1545 0.9604853 1098 880.2700 884.0391
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## 1970 - 1990 10.68727 1.358331 1098 7.868 <.0001
Should we be alarmed? Should we be concerned that students entering college have
fewer skills than students 20 years ago? Should we be pumping billions of dollars into
the bloated bureaucracies of our public school systems with the hope that a few of
these dollars might be put to good use in programs to enhance performance? This
is the consensus among some people in the know, all of whom wax eloquently about
the impending inability of the U.S. to compete in the new global economy.
The SAT study is not a well-designed experiment, where a scientist has controlled
all the factors that might affect the response (the SAT score) other than the treatment
(the year). Even without these controls, there is no randomization of treatments to
students selected from a target population.
The SAT study is an observational study of two distinct populations. The ob-
served differences in SAT scores may indeed be due to a decrease in performance. The
differences might also be due to factors that make the two populations incomparable
for assessing changes in performance over time.
My hypothetical populations have students from two ethnic groups (1 and 2). If
you construct box-plots of the SAT scores for the four combinations of ethnicity and
year, you see that the typical SAT score within each ethnic group has increased
over time, whereas the typical SAT score ignoring ethnicity decreased over time. Is

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


566 Ch 6: A Short Discussion of Observational Studies

this a paradox, and what are appropriate conclusions in the analysis?


sat.mean.ye <- ddply(sat, .(year,eth), summarise, m = mean(grade))
sat.mean.ye
## year eth m
## 1 1970 1 899.712
## 2 1970 2 824.140
## 3 1990 1 948.560
## 4 1990 2 875.514
# Interaction plots, ggplot
library(ggplot2)
p <- ggplot(sat, aes(x = year, y = grade, colour = eth, shape = eth))
p <- p + geom_boxplot(alpha = 0.5, outlier.size=0.5)
p <- p + geom_point(data = sat.mean.ye, aes(y = m), size = 4)
p <- p + geom_line(data = sat.mean.ye, aes(y = m, group = eth), size = 1.5)
p <- p + labs(title = "SAT interaction plot, eth by year")
print(p)

#p <- ggplot(sat, aes(x = eth, y = grade, colour = year, shape = year))


#p <- p + geom_boxplot(alpha = 0.5, outlier.size=0.5)
#p <- p + geom_point(data = sat.mean.ye, aes(y = m), size = 4)
#p <- p + geom_line(data = sat.mean.ye, aes(y = m, group = year), size = 1.5)
#p <- p + labs(title = "SAT interaction plot, year by eth")
#print(p)

SAT interaction plot, eth by year

950 ●

900 ●

eth
grade

● 1
2

850

800

1970 1990
year

I fit a two-factor model with year and ethnicity effects plus an interaction. The
two-factor model gives a method to compare the SAT scores over time, after adjust-
ing for the effect of ethnicity on performance. The F -test for comparing years adjusts
for ethnicity because it is based on comparing the average SAT scores across years
after averaging the cell means over ethnicities, thereby eliminating from the compar-
ison of years any effects due to changes in the ethnic composition of the populations.
The two-way analysis is preferable to the unadjusted one-way analysis which ignores
ethnicity.
lm.g.y.e.ye <- lm(grade ~ year * eth, data = sat
, contrasts = list(year = contr.sum, eth = contr.sum))

Prof. Erik B. Erhardt


567

## CRITICAL!!! Unbalanced design warning.


## The contrast statement above must be included identifying
## each main effect with "contr.sum" in order for the correct
## Type III SS to be computed.
## See http://goanna.cs.rmit.edu.au/~fscholer/anova.php
library(car)
# type III SS
Anova(lm.g.y.e.ye, type=3)
## Anova Table (Type III tests)
##
## Response: grade
## Sum Sq Df F value Pr(>F)
## (Intercept) 286085884 1 5.7022e+06 < 2e-16 ***
## year 228283 1 4.5501e+03 < 2e-16 ***
## eth 501984 1 1.0005e+04 < 2e-16 ***
## year:eth 145 1 2.8904e+00 0.08939 .
## Residuals 54988 1096
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The year and ethnicity main effects are significant in the two factor model, but
the interaction is not. The marginal lsmeans indicate that the average SAT score
increased significantly over time when averaged over ethnicities. This is consistent
with the cell mean SAT scores increasing over time within each ethnic group. Given
the lack of a significant interaction, the expected increase in SAT scores from 1970 to
1990 within each ethnic group is the difference in marginal averages: 912.0 - 861.9
= 50.1.
library(plyr)
# unbalanced, don't match (lsmeans is correct)
sat.mean.y <- ddply(sat, .(year), summarise, m = mean(grade))
sat.mean.y
## year m
## 1 1970 892.8418
## 2 1990 882.1545
library(lsmeans)
lsmeans(lm.g.y.e.ye, list(pairwise ~ year), adjust = "bonferroni")
## NOTE: Results may be misleading due to involvement in interactions
## $`lsmeans of year`
## year lsmean SE df lower.CL upper.CL
## 1970 861.926 0.5253021 1096 860.8953 862.9567
## 1990 912.037 0.5253021 1096 911.0063 913.0677
##
## Results are averaged over the levels of: eth
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


568 Ch 6: A Short Discussion of Observational Studies

## contrast estimate SE df t.ratio p.value


## 1970 - 1990 -50.111 0.7428893 1096 -67.454 <.0001
##
## Results are averaged over the levels of: eth
# unbalanced, don't match (lsmeans is correct)
sat.mean.e <- ddply(sat, .(eth), summarise, m = mean(grade))
sat.mean.e
## eth m
## 1 1 904.1527
## 2 2 870.8436
lsmeans(lm.g.y.e.ye, list(pairwise ~ eth), adjust = "bonferroni")
## NOTE: Results may be misleading due to involvement in interactions
## $`lsmeans of eth`
## eth lsmean SE df lower.CL upper.CL
## 1 924.136 0.5253021 1096 923.1053 925.1667
## 2 849.827 0.5253021 1096 848.7963 850.8577
##
## Results are averaged over the levels of: year
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## 1 - 2 74.309 0.7428893 1096 100.027 <.0001
##
## Results are averaged over the levels of: year
# unbalanced, but highest-order interaction cell means will match
sat.mean.ye <- ddply(sat, .(year,eth), summarise, m = mean(grade))
sat.mean.ye
## year eth m
## 1 1970 1 899.712
## 2 1970 2 824.140
## 3 1990 1 948.560
## 4 1990 2 875.514
lsmeans(lm.g.y.e.ye, list(pairwise ~ year | eth), adjust = "bonferroni")
## $`lsmeans of year | eth`
## eth = 1:
## year lsmean SE df lower.CL upper.CL
## 1970 899.712 0.3167691 1096 899.0905 900.3335
## 1990 948.560 1.0017118 1096 946.5945 950.5255
##
## eth = 2:
## year lsmean SE df lower.CL upper.CL
## 1970 824.140 1.0017118 1096 822.1745 826.1055
## 1990 875.514 0.3167691 1096 874.8925 876.1355
##
## Confidence level used: 0.95

Prof. Erik B. Erhardt


569

##
## $`pairwise differences of contrast, eth | eth`
## eth = 1:
## contrast estimate SE df t.ratio p.value
## 1970 - 1990 -48.848 1.050604 1096 -46.495 <.0001
##
## eth = 2:
## contrast estimate SE df t.ratio p.value
## 1970 - 1990 -51.374 1.050604 1096 -48.899 <.0001
lsmeans(lm.g.y.e.ye, list(pairwise ~ eth | year), adjust = "bonferroni")
## $`lsmeans of eth | year`
## year = 1970:
## eth lsmean SE df lower.CL upper.CL
## 1 899.712 0.3167691 1096 899.0905 900.3335
## 2 824.140 1.0017118 1096 822.1745 826.1055
##
## year = 1990:
## eth lsmean SE df lower.CL upper.CL
## 1 948.560 1.0017118 1096 946.5945 950.5255
## 2 875.514 0.3167691 1096 874.8925 876.1355
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast, year | year`
## year = 1970:
## contrast estimate SE df t.ratio p.value
## 1 - 2 75.572 1.050604 1096 71.932 <.0001
##
## year = 1990:
## contrast estimate SE df t.ratio p.value
## 1 - 2 73.046 1.050604 1096 69.528 <.0001
As noted in the insulin analysis, the marginal lsmeans and means are different
for unbalanced two-factor analyses. The marginal means ignore the levels of the
other factors when averaging responses. The marginal lsmeans are averages of cell
means over the levels of the other factor. Thus, for example, the 1970 mean SAT
score of 892.8 is the average of the 550 scores selected that year. The 1970 lsmeans
SAT score of 861.9 is midway between the average 1970 SAT scores for the two ethnic
groups: 861.9 = (899.7 + 824.1)/2. Hopefully, this discussion also clarifies why the
year marginal means are identical in the one and two-factor analyses, but the year
lsmeans are not.
The 1970 and 1990 marginal means estimate the typical SAT score ignoring all
factors that may influence performance. These marginal averages are not relevant for
understanding any trends in performance over time because they do not account
for changes in the composition of the population that may be related to performance.
The average SAT scores (ignoring ethnicity) decreased from 1970 to 1990 because

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


570 Ch 6: A Short Discussion of Observational Studies

the ethnic composition of the student population changed. Ten out of every eleven
students sampled in 1970 were from the first ethnic group. Only one out of eleven
students sampled in 1990 was from this group. Students in the second ethnic group
are underachievers, but they are becoming a larger portion of the population over
time. The decrease in average (means) performance inferred from comparing 1970
to 1990 is confounded with the increased representation of the underachievers over
time. Once ethnicity was taken into consideration, the typical SAT scores were shown
to have increased, rather than decreased.
In summary, the one-way analysis ignoring ethnicity is valid, and allows you to
conclude that the typical SAT score has decreased over time, but it does not provide
any insight into the nature of the changes that have occurred. A two-factor anal-
ysis backed up with a comparison of the marginal lsmeans is needed to compare
performances over time, adjusting for the changes in ethnic composition.
The Sandia study reached the same conclusion. The Sandia team showed that
the widely reported decreases in SAT scores over time are due to changes in the
ethnic distribution of the student population over time, with individuals in historically
underachieving ethnic groups becoming a larger portion of the student population over
time.
A more complete analysis of the SAT study would adjust the SAT scores to account
for other potential confounding factors, such as sex, and differences due to the number
of times the exam was taken. These confounding effects are taken into consideration
by including them as effects in the model.
The interpretation of the results from an observational study with several effects
of interest, and several confounding variables, is greatly simplified by eliminating the
insignificant effects from the model. For example, the year by ethnicity interaction
in the SAT study might be omitted from the model to simplify interpretation. The
year effects would then be estimated after fitting a two-way additive model with
year and ethnicity effects only. The same approach is sometimes used with designed
experiments, say the insulin study that we analyzed earlier.

An important caveat The ideas that we discussed on the design and analysis of
experiments and observational studies are universal. They apply regardless of whether
you are analyzing categorical data, counts, or measurements.

Prof. Erik B. Erhardt


Part VIII

ADA2: ANCOVA and logistic


regression

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


Chapter 7

Analysis of Covariance: Comparing


Regression Lines

Suppose that you are interested in comparing the typical lifetime (hours) of two tool
types (A and B). A simple analysis of the data given below would consist of making
side-by-side boxplots followed by a two-sample test of equal means (or medians). The
standard two-sample test using the pooled variance estimator is a special case of the
one-way ANOVA with two groups. The summaries suggest that the distribution of
lifetimes for the tool types are different. In the output below, µi is population mean
lifetime for tool type i (i = A, B).
#### Example: Tool lifetime
tools <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch07_tools.dat"
, header = TRUE)
str(tools)
## 'data.frame': 20 obs. of 3 variables:
## $ lifetime: num 18.7 14.5 17.4 14.5 13.4 ...
## $ rpm : int 610 950 720 840 980 530 680 540 890 730 ...
## $ type : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...

lifetime rpm type lifetime rpm type


1 18.7300 610 A 11 30.1600 670 B
2 14.5200 950 A 12 27.0900 770 B
3 17.4300 720 A 13 25.4000 880 B
4 14.5400 840 A 14 26.0500 1000 B
5 13.4400 980 A 15 33.4900 760 B
6 24.3900 530 A 16 35.6200 590 B
7 13.3400 680 A 17 26.0700 910 B
8 22.7100 540 A 18 36.7800 650 B
9 12.6800 890 A 19 34.9500 810 B
10 19.3200 730 A 20 43.6700 500 B

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


574 Ch 7: Analysis of Covariance: Comparing Regression Lines

library(ggplot2)
p <- ggplot(tools, aes(x = type, y = lifetime))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(aes(yintercept = mean(lifetime)),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour="red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour="red", alpha = 0.8)
p <- p + labs(title = "Tool type lifetime") + ylab("lifetime (hours)")
p <- p + coord_flip()
print(p)

Tool type lifetime

B
type

20 30 40
lifetime (hours)

A two sample t-test comparing mean lifetimes of tool types indicates a difference
between means.
t.summary <- t.test(lifetime ~ type, data = tools)
t.summary
##
## Welch Two Sample t-test
##
## data: lifetime by type
## t = -6.435, df = 15.93, p-value = 8.422e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.70128 -9.93472
## sample estimates:
## mean in group A mean in group B
## 17.110 31.928

This comparison is potentially misleading because the samples are not compara-
ble. A one-way ANOVA is most appropriate for designed experiments where all the
factors influencing the response, other than the treatment (tool type), are fixed by
the experimenter. The tools were operated at different speeds. If speed influences
lifetime, then the observed differences in lifetimes could be due to differences in speeds
at which the two tool types were operated.

Prof. Erik B. Erhardt


575

Fake example For example, suppose speed is inversely related to lifetime of the
tool. Then, the differences seen in the boxplots above could be due to tool type
B being operated at lower speeds than tool type A. To see how this is possible,
consider the data plot given below, where the relationship between lifetime and speed
is identical in each sample. A simple linear regression model relating hours to speed,
ignoring tool type, fits the data exactly, yet the lifetime distributions for the tool
types, ignoring speed, differ dramatically. (The data were generated to fall exactly
on a straight line). The regression model indicates that you would expect identical
mean lifetimes for tool types A and B, if they were, or could be, operated at identical
speeds. This is not exactly what happens in the actual data. However, I hope the
point is clear.
#### Example: Tools, fake
toolsfake <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch07_toolsfake.dat"
, header = TRUE)
library(ggplot2)
p <- ggplot(toolsfake, aes(x = speed, y = hours, colour = type, shape = type))
p <- p + geom_point(size=4)
library(R.oo) # for ascii code lookup
p <- p + scale_shape_manual(values=charToInt(sort(unique(toolsfake$type))))
p <- p + labs(title="Fake tools data, hours by speed with categorical type")
print(p)

Fake tools data, hours by speed with categorical type

30.0 B
B
B
B
B
27.5
B
B
B
B
type
hours

B
25.0 A A
A
B B
A
A
A
22.5 A
A
A
A
A
20.0 A

600 700 800 900 1000


speed

As noted in the Chapter 6 SAT example, you should be wary of group comparisons
where important factors that influence the response have not been accounted for or

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


576 Ch 7: Analysis of Covariance: Comparing Regression Lines

controlled. In the SAT example, the differences in scores were affected by a change
in the ethnic composition over time. A two-way ANOVA with two factors, time and
ethnicity, gave the most sensible analysis.
For the tool lifetime problem, you should compare groups (tools) after adjusting
the lifetimes to account for the influence of a measurement variable, speed. The
appropriate statistical technique for handling this problem is called analysis of co-
variance (ANCOVA).

7.1 ANCOVA
A natural way to account for the effect of speed is through a multiple regression
model with lifetime as the response and two predictors, speed and tool type. A
binary categorical variable, here tool type, is included in the model as a dummy
variable or indicator variable (a {0, 1} variable).
Consider the model

Tool lifetime = β0 + β1 typeB + β2 rpm + e,

where typeB is 0 for type A tools, and 1 for type B tools. For type A tools, the model
simplifies to:

Tool lifetime = β0 + β1 (0) + β2 rpm + e


= β0 + β2 rpm + e.

For type B tools, the model simplifies to:

Tool lifetime = β0 + β1 (1) + β2 rpm + e


= (β0 + β1 ) + β2 rpm + e.

This ANCOVA model fits two regression lines, one for each tool type, but restricts
the slopes of the regression lines to be identical. To see this, let us focus on the
interpretation of the regression coefficients. For the ANCOVA model,
β2 = slope of population regression lines for tool types A and B.
and
β0 = intercept of population regression line for tool A (called the reference group).
Given that
β0 + β1 = intercept of population regression line for tool B,
it follows that
β1 = difference between tool B and tool A intercepts.
A picture of the population regression lines for one version of the model is given
below.

Prof. Erik B. Erhardt


7.1: ANCOVA 577

40
35
Population Mean Life
25 30

Tool B
20 15

Tool A

500 600 700 800 900 1000


Speed

An important feature of the ANCOVA model is that β1 measures the difference


in mean response for the tool types, regardless of the speed. A test of H0 : β1 = 0 is
the primary interest, and is interpreted as a comparison of the tool types, after
adjusting or allowing for the speeds at which the tools were operated.

The ANCOVA model is plausible. The relationship between lifetime and speed
is roughly linear within tool types, with similar slopes but unequal intercepts across
groups. The plot of the studentized residuals against the fitted values shows no gross
abnormalities, but suggests that the variability about the regression line for tool type
A is somewhat smaller than the variability for tool type B. The model assumes that
the variability of the responses is the same for each group. The QQ-plot does not
show any gross deviations from a straight line.
#### Example: Tool lifetime
library(ggplot2)
p <- ggplot(tools, aes(x = rpm, y = lifetime, colour = type, shape = type))
p <- p + geom_point(size=4)
library(R.oo) # for ascii code lookup
p <- p + scale_shape_manual(values=charToInt(sort(unique(tools$type))))
p <- p + geom_smooth(method = lm, se = FALSE)
p <- p + labs(title="Tools data, lifetime by rpm with categorical type")
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


578 Ch 7: Analysis of Covariance: Comparing Regression Lines

Tools data, lifetime by rpm with categorical type

40

B
B B
B

30 B type
lifetime

A A
B
B B B B B
A
A
20
A A
A

A A
A A
A

500 600 700 800 900 1000


rpm

lm.l.r.t <- lm(lifetime ~ rpm + type, data = tools)


#library(car)
#Anova(aov(lm.l.r.t), type=3)
summary(lm.l.r.t)
##
## Call:
## lm(formula = lifetime ~ rpm + type, data = tools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5527 -1.7868 -0.0016 1.8395 4.9838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.98560 3.51038 10.536 7.16e-09 ***
## rpm -0.02661 0.00452 -5.887 1.79e-05 ***
## typeB 15.00425 1.35967 11.035 3.59e-09 ***
## ---

Prof. Erik B. Erhardt


7.1: ANCOVA 579

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.039 on 17 degrees of freedom
## Multiple R-squared: 0.9003,Adjusted R-squared: 0.8886
## F-statistic: 76.75 on 2 and 17 DF, p-value: 3.086e-09
# plot diagnostics
par(mfrow=c(2,3))
plot(lm.l.r.t, which = c(1,4,6), pch=as.character(tools$type))

plot(tools$rpm, lm.l.r.t$residuals, main="Residuals vs rpm", pch=as.character(tools$type))


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(tools$type, lm.l.r.t$residuals, main="Residuals vs type")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.l.r.t$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(tools$type))
## [1] 7 20 19
## residuals vs order of data
#plot(lm.l.r.t£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)

0.4
6

0.4

20
2 20B
20B
19B
4

0.3
0.3

AA
Cook's distance

Cook's distance
B
2

A A B 1.5
Residuals

0.2
A
0

A
0.2

A A B
7
B A7
−2

A
1
0.1
19
0.1

B
BB B
19
A A
−4

B B B
A7 A A 0.5
A B A BB
−6

B
0.0

0.0

A A A B 0

10 15 20 25 30 35 5 10 15 20 0.1 0.15 0.2

Fitted values Obs. number Leverage hii

Residuals vs rpm Residuals vs type QQ Plot

B 20 B
B 19 B
4
4

A A A
A
B
lm.l.r.t$residuals

lm.l.r.t$residuals

2 B
2

A A B A B
A

B B
A 0 A A
0

A A A
B A B A

B B
−2

−2

A −2 A

B B
−4

−4

B −4 B
B B

A ● A 7

500 600 700 800 900 1000 A B −2 −1 0 1 2

tools$rpm norm quantiles

The fitted relationship for the combined data set is


Predicted Lifetime = 36.99 + 15.00 typeB − 0.0266 rpm.
Assigning the LS estimates to the appropriate parameters, the fitted relationships for
the two tool types must be, for tool type B:
Predicted Lifetime = (36.99 + 15.00) − 0.0266 rpm
= 51.99 − 0.0266 rpm,

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


580 Ch 7: Analysis of Covariance: Comparing Regression Lines

and for tool type A:

Predicted Lifetime = 36.99 − 0.0266 rpm.

The t-test of H0 : β1 = 0 checks whether the intercepts for the population regres-
sion lines are equal, assuming equal slopes. The t-test p-value < 0.0001 suggests that
the population regression lines for tools A and B have unequal intercepts. The LS
lines indicate that the average lifetime of either type tool decreases by 0.0266 hours
for each increase in 1 RPM. Regardless of the lathe speed, the model predicts that
type B tools will last 15 hours longer (i.e., the regression coefficient for the typeB
predictor) than type A tools. Summarizing this result another way, the t-test suggests
that there is a significant difference between the lifetimes of the two tool types, after
adjusting for the effect of the speeds at which the tools were operated. The estimated
difference in average lifetime is 15 hours, regardless of the lathe speed.

7.2 Generalizing the ANCOVA Model to Allow


Unequal Slopes
I will present a flexible approach for checking equal slopes and equal intercepts in
ANCOVA-type models. The algorithm also provides a way to build regression models
in studies where the primary interest is comparing the regression lines across groups
rather than comparing groups after adjusting for a regression effect. The approach
can be applied to an arbitrary number of groups and predictors. For simplicity, I will
consider a problem with three groups and a single regression effect.
The data1 below are the IQ scores of identical twins, one raised in a foster home
(IQF) and the other raised by natural parents (IQN). The 27 pairs are divided into
three groups by social status of the natural parents (H=high, M=medium, L=low).
I will examine the regression of IQF on IQN for each of the three social classes.
There is no a priori reason to assume that the regression lines for the three groups
have equal slopes or equal interepts. These are, however, reasonable hypotheses to
examine. The easiest way to check these hypotheses is to fit a multiple regression
model to the combined data set, and check whether certain carefully defined regression
effects are zero. The most general model has six parameters, and corresponds to fitting
a simple linear regression model to the three groups separately (3 × 2 = 6).
Two indicator variables are needed to uniquely identify each observation by social
class. For example, let I1 = 1 for H status families and I1 = 0 otherwise, and let
I2 = 1 for M status families and I2 = 0 otherwise. The indicators I1 and I2 jointly
assume 3 values:
1
The data were originally analyzed by Sir Cyril Burt.

Prof. Erik B. Erhardt


7.2: Generalizing the ANCOVA Model to Allow Unequal Slopes 581

Status I1 I2
L 0 0
M 0 1
H 1 0
Given the indicators I1 and I2 and the predictor IQN, define two interaction or
product effects: I1 × IQN and I2 × IQN.

7.2.1 Unequal slopes ANCOVA model


The most general model allows separate slopes and intercepts for each group:

IQF = β0 + β1 I1 + β2 I2 + β3 IQN + β4 I1 IQN + β5 I2 IQN + e. (7.1)

This model is best understood by considering the three status classes separately.
If status = L, then I1 = I2 = 0. For these families

IQF = β0 + β3 IQN + e.

If status = M, then I1 = 0 and I2 = 1. For these families

IQF = β0 + β2 (1) + β3 IQN + β5 IQN + e


= (β0 + β2 ) + (β3 + β5 ) IQN + e.

Finally, if status = H, then I1 = 1 and I2 = 0. For these families

IQF = β0 + β1 (1) + β3 IQN + β4 IQN + e


= (β0 + β1 ) + (β3 + β4 ) IQN + e.

The regression coefficients β0 and β3 are the intercept and slope for the L status
population regression line. The other parameters measure differences in intercepts and
slopes across the three groups, using L status families as a baseline or reference
group. In particular:
β1 = difference between the intercepts of the H and L population regression lines.
β2 = difference between the intercepts of the M and L population regression lines.
β4 = difference between the slopes of the H and L population regression lines.
β5 = difference between the slopes of the M and L population regression lines.
The plot gives a possible picture of the population regression lines corresponding
to the general model (7.1).

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


582 Ch 7: Analysis of Covariance: Comparing Regression Lines

160
Population Mean IQ Foster Twin (IQF)

L Status
140

M Status
120

H Status
80 100

70 80 90 100 110 120 130


Home Twin IQ (IQN)

We fit the general model to the twins data.


#### Example: Twins
twins <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch07_twins.dat"
, header = TRUE)
# set "L" as baseline level
twins$status <- relevel(twins$status, "L")
str(twins)
## 'data.frame': 27 obs. of 3 variables:
## $ IQF : int 82 80 88 108 116 117 132 71 75 93 ...
## $ IQN : int 82 90 91 115 115 129 131 78 79 82 ...
## $ status: Factor w/ 3 levels "L","H","M": 2 2 2 2 2 2 2 3 3 3 ...
library(ggplot2)
p <- ggplot(twins, aes(x = IQN, y = IQF, colour = status, shape = status))
p <- p + geom_point(size=4)
library(R.oo) # for ascii code lookup
p <- p + scale_shape_manual(values=charToInt(sort(unique(twins$status))))
p <- p + geom_smooth(method = lm, se = FALSE)
p <- p + labs(title="Twins data, IQF by IQN with categorical status")
# equal axes since x- and y-variables are same quantity
dat.range <- range(twins[,c("IQF","IQN")])
p <- p + xlim(dat.range) + ylim(dat.range) + coord_equal(ratio=1)
print(p)

Prof. Erik B. Erhardt


7.2: Generalizing the ANCOVA Model to Allow Unequal Slopes 583

Twins data, IQF by IQN with categorical status

120

H H
L
L M
L H
L
status
100 L L
IQF

L L
L H H
M L L M
M M
HL M
L
H L
80 H
L
M
M

L
60
60 80 100 120
IQN

lm.f.n.s.ns <- lm(IQF ~ IQN*status, data = twins)


library(car)
Anova(aov(lm.f.n.s.ns), type=3)
## Anova Table (Type III tests)
##
## Response: IQF
## Sum Sq Df F value Pr(>F)
## (Intercept) 11.61 1 0.1850 0.6715
## IQN 1700.39 1 27.1035 3.69e-05 ***
## status 8.99 2 0.0716 0.9311
## IQN:status 0.93 2 0.0074 0.9926
## Residuals 1317.47 21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.f.n.s.ns)
##
## Call:
## lm(formula = IQF ~ IQN * status, data = twins)
##
## Residuals:
## Min 1Q Median 3Q Max

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


584 Ch 7: Analysis of Covariance: Comparing Regression Lines

## -14.479 -5.248 -0.155 4.582 13.798


##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.20461 16.75126 0.430 0.672
## IQN 0.94842 0.18218 5.206 3.69e-05 ***
## statusH -9.07665 24.44870 -0.371 0.714
## statusM -6.38859 31.02087 -0.206 0.839
## IQN:statusH 0.02914 0.24458 0.119 0.906
## IQN:statusM 0.02414 0.33933 0.071 0.944
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.921 on 21 degrees of freedom
## Multiple R-squared: 0.8041,Adjusted R-squared: 0.7574
## F-statistic: 17.24 on 5 and 21 DF, p-value: 8.31e-07
# plot diagnostics
par(mfrow=c(2,3))
plot(lm.f.n.s.ns, which = c(1,4,6), pch=as.character(twins$status))

plot(twins$IQN, lm.f.n.s.ns$residuals, main="Residuals vs IQN", pch=as.character(twins$status))


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(twins$status, lm.f.n.s.ns$residuals, main="Residuals vs status")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.f.n.s.ns$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(twins$status))
## [1] 27 24 23
## residuals vs order of data
#plot(lm.f.n.s.ns£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Prof. Erik B. Erhardt


7.2: Generalizing the ANCOVA Model to Allow Unequal Slopes 585

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)

0.30
2.5 L272 1.5
15

0.30
27
L 24
L23
M
10

L 13 13M

Cook's distance

Cook's distance
M

0.20
H H

0.20
5

10 M10
H 1
Residuals

L
L H L L
L M
0

L M
M LL H
−5

0.10
L

0.10
M H H
H H
M
L L
M L
L H H 0.5
−15

27L LLL H M

0.00

0.00
LL LLHL M LH 0

70 80 90 100 110 120 0 5 10 15 20 25 0 0.1 0.2 0.3 0.4 0.5

Fitted values Obs. number Leverage hii

Residuals vs IQN Residuals vs status QQ Plot

L 24 L
M L 23 L
M
10

10
10
lm.f.n.s.ns$residuals

lm.f.n.s.ns$residuals
L L
M H H HH
M
5
5

H L LH
L L
L
L H 0 LHL
0

M
0

M
L L LL
M L H MH L
−5

−5

M L −5 L
H HM
H H
L L L L
−15 −10

−15 −10

M −10 M

L −15 L 27

70 80 90 100 110 120 130 L H M −2 −1 0 1 2

twins$IQN norm quantiles

The natural way to express the fitted model is to give separate prediction equations
for the three status groups. Here is an easy way to get the separate fits. For the
general model (7.1), the predicted IQF satisfies
Predicted IQF = (Intercept + Coeff for Status Indicator)
+ (Coeff for Status Product Effect + Coeff for IQN) × IQN.
For the baseline group, use 0 as the coefficients for the status indicator and product
effect.
Thus, for the baseline group with status = L,
Predicted IQF = 7.20 + 0 + (0.948 + 0) IQN
= 7.20 + 0.948 IQN.
For the M status group with indicator I2 and product effect I2 × IQN:
Predicted IQF = 7.20 − 6.39 + (0.948 + 0.024) IQN
= 0.81 + 0.972 IQN.
For the H status group with indicator I1 and product effect I1 × IQN:
Predicted IQF = 7.20 − 9.08 + (0.948 + 0.029) IQN
= −1.88 + 0.977 IQN.
The LS lines are identical to separately fitting simple linear regressions to the three
groups.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


586 Ch 7: Analysis of Covariance: Comparing Regression Lines

7.2.2 Equal slopes ANCOVA model


There are three other models of potential interest besides the general model. The
equal slopes ANCOVA model

IQF = β0 + β1 I1 + β2 I2 + β3 IQN + e

is a special case of (7.1) with β4 = β5 = 0 (no interaction). In the ANCOVA model,


β3 is the slope for all three regression lines. The other parameters have the same
interpretation as in the general model (7.1), see the plot above. Output from the
ANCOVA model is given below.
160
Population Mean IQ Foster Twin (IQF)

L Status
140

M Status
120

H Status
80 100

70 80 90 100 110 120 130


Home Twin IQ (IQN)

lm.f.n.s <- lm(IQF ~ IQN + status, data = twins)


library(car)
Anova(aov(lm.f.n.s), type=3)
## Anova Table (Type III tests)
##
## Response: IQF
## Sum Sq Df F value Pr(>F)
## (Intercept) 18.2 1 0.3181 0.5782
## IQN 4674.7 1 81.5521 5.047e-09 ***
## status 175.1 2 1.5276 0.2383
## Residuals 1318.4 23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.f.n.s)

Prof. Erik B. Erhardt


7.2: Generalizing the ANCOVA Model to Allow Unequal Slopes 587

##
## Call:
## lm(formula = IQF ~ IQN + status, data = twins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8235 -5.2366 -0.1111 4.4755 13.6978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6188 9.9628 0.564 0.578
## IQN 0.9658 0.1069 9.031 5.05e-09 ***
## statusH -6.2264 3.9171 -1.590 0.126
## statusM -4.1911 3.6951 -1.134 0.268
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.571 on 23 degrees of freedom
## Multiple R-squared: 0.8039,Adjusted R-squared: 0.7784
## F-statistic: 31.44 on 3 and 23 DF, p-value: 2.604e-08
For the ANCOVA model, the predicted IQF for the three groups satisfies
Predicted IQF = (Intercept + Coeff for Status Indicator)
+(Coeff for IQN) × IQN.
As with the general model, use 0 as the coefficients for the status indicator and
product effect for the baseline group.
For L status families:
Predicted IQF = 5.62 + 0.966 IQN,
for M status:
Predicted IQF = 5.62 − 4.19 + 0.966 IQN
= 1.43 + 0.966 IQN,
and for H status:
Predicted IQF = 5.62 − 6.23 + 0.966 IQN
= −0.61 + 0.966 IQN.

7.2.3 Equal slopes and equal intercepts ANCOVA model

The model with equal slopes and equal intercepts


IQF = β0 + β3 IQN + e

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


588 Ch 7: Analysis of Covariance: Comparing Regression Lines

is a special case of the ANCOVA model with β1 = β2 = 0. This model does not
distinguish among social classes. The common intercept and slope for the social
classes are β0 and β3 , respectively.
The predicted IQF for this model is

IQF = 9.21 + 0.901 IQN

for each social class.


lm.f.n <- lm(IQF ~ IQN, data = twins)
#library(car)
#Anova(aov(lm.f.n), type=3)
summary(lm.f.n)
##
## Call:
## lm(formula = IQF ~ IQN, data = twins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3512 -5.7311 0.0574 4.3244 16.3531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.20760 9.29990 0.990 0.332
## IQN 0.90144 0.09633 9.358 1.2e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.729 on 25 degrees of freedom
## Multiple R-squared: 0.7779,Adjusted R-squared: 0.769
## F-statistic: 87.56 on 1 and 25 DF, p-value: 1.204e-09

7.2.4 No slopes, but intercepts ANCOVA model


The model with no predictor (IQN) effects

IQF = β0 + β1 I1 + β2 I2 + e

is a special case of the ANCOVA model with β3 = 0. In this model, social status
has an effect on IQF but IQN does not. This model of parallel regression lines
with zero slopes is identical to a one-way ANOVA model for the three social classes,
where the intercepts play the role of the population means, see the plot below.

Prof. Erik B. Erhardt


7.2: Generalizing the ANCOVA Model to Allow Unequal Slopes 589

140
Population Mean IQ Foster Twin (IQF)
H Status

130
M Status
100 110 120

L Status
90

70 80 90 100 110 120 130


Home Twin IQ (IQN)

For the ANOVA model, the predicted IQF for the three groups satisfies

Predicted IQF = Intercept + Coeff for Status Indicator

Again, use 0 as the coefficients for the baseline status indicator.


For L status families:

Predicted IQF = 93.71,

for M status:

Predicted IQF = 93.71 − 4.88


= 88.83,

and for H status:

Predicted IQF = 93.71 + 9.57


= 103.28.

The predicted IQFs are the mean IQFs for the three groups.
lm.f.s <- lm(IQF ~ status, data = twins)
library(car)
Anova(aov(lm.f.s), type=3)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


590 Ch 7: Analysis of Covariance: Comparing Regression Lines

## Anova Table (Type III tests)


##
## Response: IQF
## Sum Sq Df F value Pr(>F)
## (Intercept) 122953 1 492.3772 <2e-16 ***
## status 732 2 1.4648 0.2511
## Residuals 5993 24
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.f.s)
##
## Call:
## lm(formula = IQF ~ status, data = twins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.714 -12.274 2.286 12.500 28.714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.714 4.223 22.190 <2e-16 ***
## statusH 9.571 7.315 1.308 0.203
## statusM -4.881 7.711 -0.633 0.533
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.8 on 24 degrees of freedom
## Multiple R-squared: 0.1088,Adjusted R-squared: 0.03452
## F-statistic: 1.465 on 2 and 24 DF, p-value: 0.2511

7.3 Relating Models to Two-Factor ANOVA


Recall the multiple regression formulation of the general model (7.1):
IQF = β0 + β1 I1 + β2 I2 + β3 IQN + β4 I1 IQN + β5 I2 IQN + e. (7.2)
If you think of β0 as a grand mean, β1 I1 + β2 I2 as the status effect (i.e., the two
indicators I1 and I2 allow you to differentiate among social classes), β3 IQN as the
IQN effect and β4 I1 IQN + β5 I2 IQN as the status by IQN interaction, then you can
represent the model as
IQF = Grand Mean + Status Effect + IQN effect (7.3)
+Status×IQN interaction + Residual.
This representation has the same form as a two-factor ANOVA model with interaction,
except that IQN is a quantitative effect rather than a qualitative (i.e., categorical)

Prof. Erik B. Erhardt


7.4: Choosing Among Models 591

effect. The general model has the same structure as a two-factor interaction ANOVA
model because the plot of the population means allows non-parallel profiles. However,
the general model is a special case of the two-factor interaction ANOVA model because
it restricts the means to change linearly with IQN.
The ANCOVA model has main effects for status and IQN but no interaction:

IQF = Grand Mean + Status Effect + IQN effect + Residual. (7.4)

The ANCOVA model is a special case of the additive two-factor ANOVA model
because the plot of the population means has parallel profiles, but is not equivalent
to the additive two-factor ANOVA model.
The model with equal slopes and intercepts has no main effect for status nor an
interaction between status and IQN:

IQF = Grand Mean + IQN effect + Residual. (7.5)

The one-way ANOVA model has no main effect for IQN nor an interaction between
status and IQN:

IQF = Grand Mean + Status Effect + Residual. (7.6)

I will expand on these ideas later, as they are useful for understanding the con-
nections between regression and ANOVA models.

7.4 Choosing Among Models


I will suggest a backward sequential method to select which of models (7.1), (7.4),
and (7.5) fits best. You would typically be interested in the one-way ANOVA model
(7.6) only when the effect of IQN was negligible.
Step 1: Fit the full model (7.1) and test the hypothesis of equal slopes H0 : β4 =
β5 = 0. (aside: t-tests are used to test either β4 = 0 or β5 = 0.) To test H0 ,
eliminate the predictor variables I1 IQN and I2 IQN associated with β4 and β5 from
the full model (7.1). Then fit the reduced model (7.4) with equal slopes. Reject
H0 : β4 = β5 = 0 if the increase in the Residual SS obtained by deleting I1 IQN
and I2 IQN from the full model is significant. Formally, compute the F -statistic:

(ERROR SS for reduced model − ERROR SS for full model)/2


Fobs =
ERROR MS for full model
and compare it to an upper-tail critical value for an F -distribution with 2 and df
degrees of freedom, where df is the Residual df for the full model. The F -test is a di-
rect extension of the single degree-of-freedom F -tests in the stepwise fits. A p-value

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


592 Ch 7: Analysis of Covariance: Comparing Regression Lines

for F -test is obtained from library(car) with Anova(aov(LMOBJECT), type=3) for


the interaction. If H0 is rejected, stop and conclude that the population regression
lines have different slopes (and then I do not care whether the intercepts are equal).
Otherwise, proceed to step 2.
Step 2: Fit the equal slopes or ANCOVA model (7.4) and test for equal intercepts
H0 : β1 = β2 = 0. Follow the procedure outlined in Step 1, treating the ANCOVA
model as the full model and the model IQF = β0 +β3 IQN+e with equal slopes and
intercepts as the reduced model. See the intercept term using library(car) with
Anova(aov(LMOBJECT), type=3). If H0 is rejected, conclude that that population
regression lines are parallel with unequal intercepts. Otherwise, conclude that
regression lines are identical.
Step 3: Estimate the parameters under the appropriate model, and conduct a
diagnostic analysis. Summarize the fitted model by status class.

A comparison of regression lines across k > 3 groups requires k − 1 indicator


variables to define the groups, and k − 1 interaction variables, assuming the model
has a single predictor. The comparison of models mimics the discussion above, except
that the numerator of the F -statistic is divided by k−1 instead of 2, and the numerator
df for the F -test is k − 1 instead of 2. If k = 2, the F -tests for comparing the three
models are equivalent to t−tests given with the parameter estimates summary. For
example, recall how you tested for equal intercepts in the tools problems.
The plot of the twins data shows fairly linear relationships within each social class.
The linear relationships appear to have similar slopes and similar intercepts. The p-
value for testing the hypothesis that the slopes of the population regression lines are
equal is essentially 1. The observed data are consistent with the reduced model of
equal slopes.
The p-value for comparing the model of equal slopes and equal intercepts to the
ANCOVA model is 0.238, so there is insufficient evidence to reject the reduced model
with equal slopes and intercepts. The estimated regression line, regardless of social
class, is:
Predicted IQF = 9.21 + 0.901*IQN.
There are no serious inadequacies with this model, based on a diagnostic analysis
(not shown).
An interpretation of this analysis is that the natural parents’ social class has no
impact on the relationship between the IQ scores of identical twins raised apart. What
other interesting features of the data would be interesting to explore? For example,
what values of the intercept and slope of the population regression line are of intrinsic
interest?

Prof. Erik B. Erhardt


7.4: Choosing Among Models 593

7.4.1 Simultaneous testing of regression parameters


In the twins example, we have this full interaction model,

IQF = β0 + β1 I1 + β2 I2 + β3 IQN + β4 I1 IQN + β5 I2 IQN + e, (7.7)

where I1 = 1 indicates H, and I2 = 1 indicates M, and L is the baseline status.


Consider these two specific hypotheses:
1. H0 : equal regression lines for status M and L
2. H0 : equal regression lines for status M and H
That is, the intercept and slope for the regression lines are equal for the pairs of status
groups.
First, it is necessary to formulate these hypotheses in terms of testable parameters.
That is, find the β values that make the null hypothesis true in terms of the model
equation.
1. H0 : β2 = 0 and β5 = 0
2. H0 : β1 = β2 and β4 = β5
Using linear model theory, there are methods for testing these multiple-parameter
hypothesis tests.
One strategy is to use the Wald test of null hypothesis rβ = r, where r is a
matrix of contrast coefficients (typically +1 or −1), β is our vector of regression β
e e
coefficients, and r is a hypothesized vector of what the linear system rβ equals. For
e
our first hypothesis test, the linear system we’re testing in matrix notation
e is
e
 
β0
  β1 
 
  
0 0 1 0 0 0  β2

 = 0
 .
0 0 0 0 0 1  β
 3 
 0
 β4 
β5

Let’s go about testing another hypothesis, first, using the Wald test, then we’ll
test our two simultaneous hypotheses above.
ˆ H0 : equal slopes for all status groups
ˆ H0 : β4 = β5 = 0
lm.f.n.s.ns <- lm(IQF ~ IQN*status, data = twins)
library(car)
Anova(aov(lm.f.n.s.ns), type=3)
## Anova Table (Type III tests)
##
## Response: IQF
## Sum Sq Df F value Pr(>F)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


594 Ch 7: Analysis of Covariance: Comparing Regression Lines

## (Intercept) 11.61 1 0.1850 0.6715


## IQN 1700.39 1 27.1035 3.69e-05 ***
## status 8.99 2 0.0716 0.9311
## IQN:status 0.93 2 0.0074 0.9926
## Residuals 1317.47 21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# beta coefficients (term positions: 1, 2, 3, 4, 5, 6)
coef(lm.f.n.s.ns)
## (Intercept) IQN statusH statusM IQN:statusH
## 7.20460986 0.94842244 -9.07665352 -6.38858548 0.02913971
## IQN:statusM
## 0.02414450
The test for the interaction above (IQN:status) has a p-value=0.9926, which in-
dicates that common slope is reasonable. In the Wald test notation, we want to test
whether those last two coefficients (term positions 5 and 6) both equal 0. Here we
get the same result as the ANOVA table.
library(aod) # for wald.test()
##
## Attaching package: ’aod’
## The following object is masked by ’.GlobalEnv’:
##
## mice
## The following object is masked from ’package:survival’:
##
## rats
# Typically, we are interested in testing whether individual parameters or
# set of parameters are all simultaneously equal to 0s
# However, any null hypothesis values can be included in the vector coef.test.values.
coef.test.values <- rep(0, length(coef(lm.f.n.s.ns)))
wald.test(b = coef(lm.f.n.s.ns) - coef.test.values
, Sigma = vcov(lm.f.n.s.ns)
, Terms = c(5,6))
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 0.015, df = 2, P(> X2) = 0.99
Now to our two simultaneous hypotheses. In hypothesis 1 we are testing β2 =
0 and β5 = 0, which are the 3rd and 6th position for coefficients in our original
equation (7.7). However, we need to choose the correct positions based on the coef()
order, and these are positions 4 and 6. The large p-value=0.55 suggests that M and
L can be described by the same regression line, same slope and intercept.
library(aod) # for wald.test()
coef.test.values <- rep(0, length(coef(lm.f.n.s.ns)))
wald.test(b = coef(lm.f.n.s.ns) - coef.test.values

Prof. Erik B. Erhardt


7.4: Choosing Among Models 595

, Sigma = vcov(lm.f.n.s.ns)
, Terms = c(4,6))
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 1.2, df = 2, P(> X2) = 0.55
# Another way to do this is to define the matrix r and vector r, manually.
mR <- as.matrix(rbind(c(0, 0, 0, 1, 0, 0), c(0, 0, 0, 0, 0, 1)))
mR
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 1 0 0
## [2,] 0 0 0 0 0 1
vR <- c(0, 0)
vR
## [1] 0 0
wald.test(b = coef(lm.f.n.s.ns)
, Sigma = vcov(lm.f.n.s.ns)
, L = mR, H0 = vR)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 1.2, df = 2, P(> X2) = 0.55
In hypothesis 2 we are testing β1 − β2 = 0 and β4 − β5 = 0 which are the difference
of the 2nd and 3rd coefficients and the difference of the 5th and 6th coefficients.
However, we need to choose the correct positions based on the coef() order, and these
are positions 3 and 4, and 5 and 6. The large p-value=0.91 suggests that M and H
can be described by the same regression line, same slope and intercept.
mR <- as.matrix(rbind(c(0, 0, 1, -1, 0, 0), c(0, 0, 0, 0, 1, -1)))
mR
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 1 -1 0 0
## [2,] 0 0 0 0 1 -1
vR <- c(0, 0)
vR
## [1] 0 0
wald.test(b = coef(lm.f.n.s.ns)
, Sigma = vcov(lm.f.n.s.ns)
, L = mR, H0 = vR)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 0.19, df = 2, P(> X2) = 0.91

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


596 Ch 7: Analysis of Covariance: Comparing Regression Lines

The results of these tests are not surprising, given our previous analysis where we
found that the status effect is not significant for all three groups.
Any simultaneous linear combination of parameters can be tested in this way.

7.5 Comments on Comparing Regression Lines

In the twins example, I defined two indicator variables (plus two interaction variables)
from an ordinal categorical variable: status (H, M, L). Many researchers would assign
numerical codes to the status groups and use the coding as a predictor in a regression
model. For status, a “natural” coding might be to define NSTAT=0 for L, 1 for M,
and 2 for H status families. This suggests building a multiple regression model with
a single status variable (i.e., single df):

IQF = β0 + β1 IQN + β2 NSTAT + e.

If you consider the status classes separately, the model implies that

IQF = β0 + β1 IQN + β2 (0) + e = β0 + β1 IQN + e for L status,


IQF = β0 + β1 IQN + β2 (1) + e = (β0 + β2 ) + β1 IQN + e for M status,
IQF = β0 + β1 IQN + β2 (2) + e = (β0 + 2β2 ) + β1 IQN + e for H status.

The model assumes that the IQF by IQN regression lines are parallel for the three
groups, and are separated by a constant β2 . This model is more restrictive (and less
reasonable) than the ANCOVA model with equal slopes but arbitrary intercepts. Of
course, this model is a easier to work with because it requires keeping track of only
one status variable instead of two status indicators.
A plot of the population regression lines under this model is given above, assuming
β2 < 0.

Prof. Erik B. Erhardt


7.6: Three-way interaction 597

160
Population Mean IQ Foster Twin (IQF)
L Status

120 140 M Status

H Status
80 100

70 80 90 100 110 120 130


Home Twin IQ (IQN)

7.6 Three-way interaction

In this example, a three-way interaction is illustrated with two categorical variables


and one continuous variable. Let a take values 0 or 1 (it’s an indicator variable), b
take values 0 or 1, and c be a continuous variable taking any value.
Below are five models:
(1) Interactions: ab. All lines parallel, different intercepts for each (a, b) combina-
tion.
(2) Interactions: ab, ac. (a, c) combinations have parallel lines, different intercepts
for each (a, b) combination.
(3) Interactions: ab, bc. (b, c) combinations have parallel lines, different intercepts
for each (a, b) combination.
(4) Interactions: ab, ac, bc. All combinations may have different slope lines with
different intercepts, but difference in slope between b = 0 and b = 1 is similar for
each a group (and vice versa).
(5) Interactions: ab, ac, bc, abc. All combinations may have different slope lines
with different intercepts.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


598 Ch 7: Analysis of Covariance: Comparing Regression Lines

Model Intercepts Slopes for c


(1) y = β0 +β1 a +β2 b +β3 ab +β4 c
(2) y = β0 +β1 a +β2 b +β3 ab +β4 c +β5 ac
(3) y = β0 +β1 a +β2 b +β3 ab +β4 c +β6 bc
(4) y = β0 +β1 a +β2 b +β3 ab +β4 c +β5 ac +β6 bc
(5) y = β0 +β1 a +β2 b +β3 ab +β4 c +β5 ac +β6 bc +β7 abc
X <- expand.grid(c(0,1),c(0,1),c(0,1))
X <- cbind(1, X)
colnames(X) <- c("one", "a", "b", "c")
X$ab <- X$a * X$b
X$ac <- X$a * X$c
X$bc <- X$b * X$c
X$abc <- X$a * X$b * X$c
X <- as.matrix(X)
X <- X[,c(1,2,3,5,4,6,7,8)] # reorder columns to be consistent with table above

vbeta <- matrix(c(3, -1, 2, 2, 5, -4, -2, 8), ncol = 1)


rownames(vbeta) <- paste("beta", 0:7, sep="")

Beta <- matrix(vbeta, nrow = dim(vbeta)[1], ncol = 5)


rownames(Beta) <- rownames(vbeta)

# Beta vector for each model


Beta[c(6,7,8), 1] <- 0
Beta[c( 7,8), 2] <- 0
Beta[c(6, 8), 3] <- 0
Beta[c( 8), 4] <- 0

colnames(Beta) <- 1:5 #paste("model", 1:5, sep="")

# Calculate response values


Y <- X %*% Beta

library(reshape2)
YX <- data.frame(cbind(melt(Y), X[,"a"], X[,"b"], X[,"c"]))
colnames(YX) <- c("obs", "Model", "Y", "a", "b", "c")
YX$a <- factor(YX$a)
YX$b <- factor(YX$b)

These are the β values used for this example.

beta0 beta1 beta2 beta3 beta4 beta5 beta6 beta7


3 −1 2 2 5 −4 −2 8
library(ggplot2)
p <- ggplot(YX, aes(x = c, y = Y, group = a))
#p <- p + geom_point()
p <- p + geom_line(aes(linetype = a))
p <- p + labs(title = "Three-way Interaction")
p <- p + facet_grid(Model ~ b, labeller = "label_both")
print(p)

Prof. Erik B. Erhardt


7.6: Three-way interaction 599

Three−way Interaction
b: 0 b: 1

10

Model: 1
5

10

Model: 2
5

10 a

Model: 3
0
Y

5 1

10

Model: 4
5

10

Model: 5
5

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
c

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


600 Ch 7: Analysis of Covariance: Comparing Regression Lines

Prof. Erik B. Erhardt


Chapter 8

Polynomial Regression

8.1 Polynomial Models with One Predictor

A pth order polynomial model relating a dependent variable Y to a predictor X is


given by

Y = β0 + β1 X + β2 X 2 + · · · + βp X p + ε.

This is a multiple regression model with predictors X, X 2 , . . . , X p . For p = 2, 3, 4,


and 5 we have quadratic, cubic, quartic and quintic relationships, respectively.
A second order polynomial (quadratic) allows at most one local maximum or
minimum (i.e., a point where trend changes direction from increasing to decreasing,
or from decreasing to increasing). A third order polynomial (cubic) allows at most
two local maxima or minima. In general, a pth order polynomial allows at most p − 1
local maxima or minima. The two panels below illustrate different quadratic and
cubic relationships.
#### Creating polynomial plots
# R code for quadratic and cubic plots
x <- seq(-3,3,0.01);
y21 <- x^2-5;
y22 <- -(x+1)^2+3;
y31 <- (x+1)^2*(x-3);
y32 <- -(x-.2)^2*(x+.5)-10;

plot( x, y21, type="l", main="Quadratics", ylab="y")


points(x, y22, type="l", lt=2)
plot( x, y31, type="l", main="Cubics", ylab="y")
points(x, y32, type="l", lt=2)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


602 Ch 8: Polynomial Regression

Quadratics Cubics
4

0
−5
2

−10
0
y

y
−2

−20
−4

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

x x

It is important to recognize that not all, or even none, of the “turning-points” in


a polynomial may be observed if the range of X is suitably restricted.
Although polynomial models allow a rich class of non-linear relationships between
Y and X (by virtue of Taylor’s Theorem in calculus), some caution is needed when
fitting polynomials. In particular, the extreme X-values can be highly influential,
numerical instabilities occur when fitting high order models, and predictions based
on high order polynomial models can be woeful.
To illustrate the third concern, consider a data set (Yi , Xi ) for i = 1, 2, . . . , n
where the Xi s are distinct. One can show mathematically that an (n − 1)st degree
polynomial will fit the observed data exactly. However, for a high order polynomial to
fit exactly, the fitted curve must oscillate wildly between data points. In the picture
below, I show the 10th degree polynomial that fits exactly the 11 distinct data points.
Although R2 = 1, I would not use the fitted model to make predictions with new
data. (If possible, models should always be validated using new data.) Intuitively, a
quadratic or a lower order polynomial would likely be significantly better. In essence,
the 10th degree polynomial is modelling the variability in the data, rather than the
trend.
# R code for quadratic and cubic plots

X <- rnorm(11); Y <- rnorm(11); # observed


X1 <- X^1 ;
X2 <- X^2 ;
X3 <- X^3 ;
X4 <- X^4 ;
X5 <- X^5 ;
X6 <- X^6 ;
X7 <- X^7 ;
X8 <- X^8 ;
X9 <- X^9 ;
X10 <- X^10;

fit <- lm(Y~X+X2+X3+X4+X5+X6+X7+X8+X9+X10)


fit$coefficients
## (Intercept) X X2 X3 X4

Prof. Erik B. Erhardt


8.1: Polynomial Models with One Predictor 603

## 36.70206 -461.55109 -620.55094 13030.85848 29149.14342


## X5 X6 X7 X8 X9
## -26416.29553 -81282.20211 15955.10270 70539.53467 -3396.10960
## X10
## -18290.46769
x <- seq(-2.5,2.5,0.01);
x1 <- x^1 ;
x2 <- x^2 ;
x3 <- x^3 ;
x4 <- x^4 ;
x5 <- x^5 ;
x6 <- x^6 ;
x7 <- x^7 ;
x8 <- x^8 ;
x9 <- x^9 ;
x10 <- x^10;

xx <- matrix(c(rep(1,length(x)),x1,x2,x3,x4,x5,x6,x7,x8,x9,x10),ncol=11)

y <- xx %*% fit$coefficients;

plot( X, Y, main="High-order polynomial", pch=20, cex=2)


points(x, y, type="l", lt=1)
plot( X, Y, main="(same, longer y-axis)", pch=20, cex=1, ylim=c(-10000,10000))
points(x, y, type="l", lt=1)

High−order polynomial (same, longer y−axis)


5000
0.5

● ●
Y


0

● ● ●●
● ● ●
● ● ● ●
● ●
−0.5

● ●


−1.5

−10000


−1.5 −0.5 0.0 0.5 1.0 1.5 −1.5 −0.5 0.0 0.5 1.0 1.5

X X

Another concern is that the importance of lower order terms (i.e., X, X 2 , . . .,


X p−1 ) depends on the scale in which we measure X. For example, suppose for some
chemical reaction,

Time to reaction = β0 + β1 Temp + β2 Temp2 + ε.

The significance level for the estimate of the Temp coefficient depends on whether we
measure temperature in degrees Celsius or Fahrenheit.
To avoid these problems, I recommend the following:
1. Center the X data at X̄ and fit the model

Y = β0 + β1 (X − X̄) + β2 (X − X̄)2 + · · · + βp (X − X̄)p + ε.

This is usually important only for cubic and higher order models.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


604 Ch 8: Polynomial Regression

2. Restrict attention to low order models, say quartic or less. If a fourth-order


polynomial does not fit, a transformation may provide a more succinct summary.
3. Pay careful attention to diagnostics.
4. Add or delete variables using the natural hierarchy among powers of X and
include all lower order terms if a higher order term is needed. For example, in
a forward-selection type algorithm, add terms X, X 2 , . . ., sequentially until no
additional term is significant, but do not delete powers that were entered earlier.
Similarly, with a backward-elimination type algorithm, start with the model of
maximum acceptable order (for example a fourth or third order polynomial)
and consider deleting terms in the order X p , X p−1 , . . ., until no further terms
can be omitted. The select=backward option in the reg procedure does
not allow you to invoke the hierarchy principle with backward elimination.
The backward option sequentially eliminates the least significant effect in the
model, regardless of what other effects are included.

8.1.1 Example: Cloud point and percent I-8

The cloud point of a liquid is a measure of the degree of crystallization in a stock,


and is measured by the refractive index 1 . It has been suggested that the percent of
I-8 (variable “i8”) in the base stock is an excellent predictor of the cloud point using
a second order (quadratic) model:

Cloud point = β0 + β1 I8 + β2 I82 + ε.

Data were collected to examine this model.


#### Example: Cloud point
cloudpoint <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch08_cloudpoint.dat"
, header = TRUE)
# center i8 by subracting the mean
cloudpoint$i8 <- cloudpoint$i8 - mean(cloudpoint$i8)
The plot of the data suggest a departure from a linear relationship.
library(ggplot2)
p <- ggplot(cloudpoint, aes(x = i8, y = cloud))
p <- p + geom_point()
p <- p + labs(title="Cloudpoint data, cloud by centered i8")
print(p)

1
Draper and Smith 1966, p. 162

Prof. Erik B. Erhardt


8.1: Polynomial Models with One Predictor 605

Cloudpoint data, cloud by centered i8



33



● ●

30 ●




cloud


27 ●


24

−5.0 −2.5 0.0 2.5 5.0


i8

Fit the simple linear regression model and plot the residuals.
lm.c.i <- lm(cloud ~ i8, data = cloudpoint)
#library(car)
#Anova(aov(lm.c.i), type=3)
#summary(lm.c.i)

The data plot is clearly nonlinear, suggesting that a simple linear regression model
is inadequate. This is confirmed by a plot of the studentized residuals against the
fitted values from a simple linear regression of Cloud point on i8. Also by the residuals
against the i8 values. We do not see any local maxima or minima, so a second order
model is likely to be adequate. To be sure, we will first fit a cubic model, and see
whether the third order term is important.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.c.i, which = c(1,4,6), pch=as.character(cloudpoint$type))

plot(cloudpoint$i8, lm.c.i$residuals, main="Residuals vs i8", pch=as.character(cloudpoint$type))


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.c.i$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(cloudpoint$type))
## [1] 1 11 17
# residuals vs order of data
plot(lm.c.i$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


606 Ch 8: Polynomial Regression

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


1.0

3 1● 2.5

0.8
1
● 11

0.8
● ●
0.5



● ● 2

0.6
Cook's distance

Cook's distance

0.6
● ●
0.0
Residuals

● ●
● ●

0.4
−0.5

0.4
● 1.5

17 ● 14 17 14 ●
● 17

0.2
0.2
● 1
−1.5

●1 ● ●
●● ● 0.5

0.0

0.0
● ●
●● ● ● ●
● ● 0

24 26 28 30 32 34 5 10 15 0.05 0.1 0.15 0.2

Fitted values Obs. number Leverage hii

Residuals vs i8 QQ Plot Residuals vs Order of data

● 11 ● ●
● ● ● ● ● ●
0.5

0.5
● ● ●
● 0.5 ● ●
● ● ● ● ● ●
● ● ●
lm.c.i$residuals

lm.c.i$residuals

lm.c.i$residuals
● ● ● ● ● ●
0.0

0.0
0.0
● ● ● ● ● ●
● ● ● ● ● ●
−0.5

−0.5
● −0.5 ● ●

● ● ● 17 ● ● ●
−1.0

−1.0
−1.0
−1.5

−1.5
● −1.5 ● 1 ●

−4 −2 0 2 4 −2 −1 0 1 2 5 10 15

cloudpoint$i8 norm quantiles Index

The output below shows that the cubic term improves the fit of the quadratic
model (i.e., the cubic term is important when added last to the model). The plot
of the studentized residuals against the fitted values does not show any extreme
abnormalities. Furthermore, no individual point is poorly fitted by the model. Case
1 has the largest studentized residual: r1 = −1.997.
# I() is used to create an interpreted object treated "as is"
# so we can include quadratic and cubic terms in the formula
# without creating separate columns in the dataset of these terms
lm.c.i3 <- lm(cloud ~ i8 + I(i8^2) + I(i8^3), data = cloudpoint)
#library(car)
#Anova(aov(lm.c.i3), type=3)
summary(lm.c.i3)
##
## Call:
## lm(formula = cloud ~ i8 + I(i8^2) + I(i8^3), data = cloudpoint)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42890 -0.18658 0.07355 0.13536 0.39328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.870451 0.088364 326.723 < 2e-16 ***
## i8 0.847889 0.048536 17.469 6.67e-11 ***
## I(i8^2) -0.065998 0.007323 -9.012 3.33e-07 ***

Prof. Erik B. Erhardt


8.1: Polynomial Models with One Predictor 607

## I(i8^3) 0.009735 0.002588 3.762 0.0021 **


## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2599 on 14 degrees of freedom
## Multiple R-squared: 0.9943,Adjusted R-squared: 0.9931
## F-statistic: 812.9 on 3 and 14 DF, p-value: 6.189e-16
Below are plots of the data and the studentized residuals.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.c.i3, which = c(1,4,6), pch=as.character(cloudpoint$type))

plot(cloudpoint$i8, lm.c.i3$residuals, main="Residuals vs i8", pch=as.character(cloudpoint$type))


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.c.i3$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(cloudpoint$type))
## [1] 4 12 1
# residuals vs order of data
plot(lm.c.i3$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


1
0.8

1
2● 1 1.5
0.4

12 ●

0.6
0.6


0.2

Cook's distance

Cook's distance

● 18 18 ●
● ● 14 ● 14
Residuals

● ●

0.4
0.0

0.4



−0.2


0.2
0.2

● 0.5
● ●

−0.4


●1
4● ●
●●●

0.0

0.0

●●●

● 0

24 26 28 30 32 5 10 15 0.1 0.4 0.5 0.6 0.7

Fitted values Obs. number Leverage hii

Residuals vs i8 QQ Plot Residuals vs Order of data


0.4

0.4

● 0.4 12 ● ●

● ● ●

● ●
0.2

0.2

● 0.2 ●
lm.c.i3$residuals

lm.c.i3$residuals

lm.c.i3$residuals


● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ●
0.0

0.0

0.0
● ● ●
● ● ●
−0.2

−0.2

● −0.2 ● ●
● ● ●
● ● ●
−0.4

−0.4


−0.4 ● 1 ●
● ● 4 ●

−4 −2 0 2 4 −2 −1 0 1 2 5 10 15

cloudpoint$i8 norm quantiles Index

The first and last observations have the lowest and highest values of I8, given by
0 and 10, respectively. These cases are also the most influential points in the data
set (largest Cook’s D). If we delete these cases and redo the analysis we find that
the cubic term is no longer important (p-value=0.55) when added after the quadratic
term. One may reasonably conclude that the significance of the cubic term in the
original analysis is solely due to the two extreme I8 values, and that the quadratic

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


608 Ch 8: Polynomial Regression

model appears to fit well over the smaller range of 1 ≤ I8 ≤ 9.


# remove points for minimum and maximum i8 values
cloudpoint2 <- cloudpoint[!(cloudpoint$i8 == min(cloudpoint$i8) |
cloudpoint$i8 == max(cloudpoint$i8)), ]
lm.c.i2 <- lm(cloud ~ i8 + I(i8^2) + I(i8^3), data = cloudpoint2)
#library(car)
#Anova(aov(lm.c.i2), type=3)
summary(lm.c.i2)
##
## Call:
## lm(formula = cloud ~ i8 + I(i8^2) + I(i8^3), data = cloudpoint2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36620 -0.12845 0.03737 0.14031 0.33737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.857039 0.089465 322.551 < 2e-16 ***
## i8 0.904515 0.058338 15.505 8.04e-09 ***
## I(i8^2) -0.060714 0.012692 -4.784 0.000568 ***
## I(i8^3) 0.003168 0.005166 0.613 0.552200
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2313 on 11 degrees of freedom
## Multiple R-squared: 0.9917,Adjusted R-squared: 0.9894
## F-statistic: 436.3 on 3 and 11 DF, p-value: 1.032e-11

8.2 Polynomial Models with Two Predictors


Polynomial models are sometimes fit to data collected from experiments with two
or more predictors. For simplicity, consider the general quadratic model, with two
predictors:
Y = β0 + β1 X1 + β2 X2 + β3 X12 + β4 X22 + β5 X1 X2 + ε.
The model, which can be justified as a second order approximation to a smooth trend,
includes quadratic terms in X1 and X2 and the product or interaction of X1 and X2 .

8.2.1 Example: Mooney viscosity


The data below give the Mooney viscosity at 100 degrees Celsius (Y ) as a function
of the filler level (X1 ) and the naphthenic oil (X2 ) level for an experiment involving
filled and plasticized elastomer compounds.

Prof. Erik B. Erhardt


8.2: Polynomial Models with Two Predictors 609

#### Example: Mooney viscosity


mooney <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch08_mooney.dat"
, header = TRUE)
library(ggplot2)
p <- ggplot(mooney, aes(x = oil, y = mooney, label = filler))
p <- p + geom_text()
p <- p + scale_y_continuous(limits = c(0, max(mooney$mooney, na.rm=TRUE)))
p <- p + labs(title="Mooney data, mooney by oil with filler labels")
print(p)
## Warning: Removed 1 rows containing missing values (geom text).
library(ggplot2)
p <- ggplot(mooney, aes(x = filler, y = mooney, label = oil))
p <- p + geom_text()
p <- p + scale_y_continuous(limits = c(0, max(mooney$mooney, na.rm=TRUE)))
p <- p + labs(title="Mooney data, mooney by filler with oil labels")
print(p)
## Warning: Removed 1 rows containing missing values (geom text).

Mooney data, mooney by oil with filler labels Mooney data, mooney by filler with oil labels

60 0
150 150

60 10

48 0
100 100

60 20
mooney

mooney

48 10
36 0

60 40
48 20
36 10
50 24 50 0
48 40
12 24 36 0 10 20

0 12 24 36 0 10 20 40
12 24 20 40
0 12 10 40
0 20

0 0

0 10 20 30 40 0 20 40 60
oil filler

At each of the 4 oil levels, the relationship between the Mooney viscosity and filler
level (with 6 levels) appears to be quadratic. Similarly, the relationship between the
Mooney viscosity and oil level appears quadratic for each filler level (with 4 levels).
This supports fitting the general quadratic model as a first step in the analysis.
The output below shows that each term is needed in the model. Although there
are potentially influential points (cases 6 and 20), deleting either or both cases does
not change the significance of the effects in the model (not shown).
# I create each term separately
lm.m.o2.f2 <- lm(mooney ~ oil + filler + I(oil^2) + I(filler^2) + I(oil * filler),
data = mooney)
summary(lm.m.o2.f2)
##
## Call:
## lm(formula = mooney ~ oil + filler + I(oil^2) + I(filler^2) +
## I(oil * filler), data = mooney)
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


610 Ch 8: Polynomial Regression

## Residuals:
## Min 1Q Median 3Q Max
## -6.3497 -2.2231 -0.1615 2.5424 5.2749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.144582 2.616779 10.373 9.02e-09 ***
## oil -1.271442 0.213533 -5.954 1.57e-05 ***
## filler 0.436984 0.152658 2.862 0.0108 *
## I(oil^2) 0.033611 0.004663 7.208 1.46e-06 ***
## I(filler^2) 0.027323 0.002410 11.339 2.38e-09 ***
## I(oil * filler) -0.038659 0.003187 -12.131 8.52e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.937 on 17 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9917,Adjusted R-squared: 0.9892
## F-statistic: 405.2 on 5 and 17 DF, p-value: < 2.2e-16
## poly() will evaluate variables and give joint polynomial values
## which is helpful when you have many predictors
#head(mooney, 10)
#head(poly(mooney£oil, mooney£filler, degree = 2, raw = TRUE), 10)
## This model is equivalent to the one above
#lm.m.o2.f2 <- lm(mooney ~ poly(oil, filler, degree = 2, raw = TRUE), data = mooney)
#summary(lm.m.o2.f2)
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.m.o2.f2, which = c(1,4,6), pch=as.character(mooney$oil))

# because of one missing value, get the indices of non-missing


ind <- as.numeric(names(lm.m.o2.f2$residuals))

plot(mooney$oil[ind], lm.m.o2.f2$residuals, main="Residuals vs oil with filler labels", pch=as.character(mooney$filler[ind]))


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(mooney$filler[ind], lm.m.o2.f2$residuals, main="Residuals vs filler with oil labels", pch=as.character(mooney$oil[ind]))


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.m.o2.f2$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(mooney$oil[ind]))
## 18 20 6
## 18 19 6
## residuals vs order of data
#plot(lm.m.o2.f2£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Prof. Erik B. Erhardt


8.2: Polynomial Models with Two Predictors 611

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)

0.8
2.5 2
6

20 204
60
2 1
2

0.6
4

0.6
6 60
4

Cook's distance

Cook's distance
1
2

4 04
4 1.5
Residuals

0.4
0.4
0

1
1 0
2
−2

2 1 0 18
0 218

0.2
0

0.2
1
−4

2 1
420
−6

182 2 4 4 0.5
111 22 00400 14 2 0

0.0

0.0
0

20 40 60 80 100 120 140 5 10 15 20 0.1 0.2 0.3 0.4 0.5

Fitted values Obs. number Leverage hii

Residuals vs oil with filler labels Residuals vs filler with oil labels QQ Plot

6 0 60
6 1 2 1 1 2
2 2 4 2
4

4
1 1 1
2 4
lm.m.o2.f2$residuals

lm.m.o2.f2$residuals

lm.m.o2.f2$residuals
4
2 1 2 1
2

1 3 0 4 4 0
4
6 4 4 44

0
0

4 1 1

0 0 1
0 0
1
3 2 2
−2

−2

3 3 0 2 1
0 −2 0 2
1

4
2 0 0 0
0
−4

−4

−4
4 2 2
1 4 4 20
−6

−6

6 2 −6 2 18

0 10 20 30 40 0 10 20 30 40 50 60 −2 −1 0 1 2

mooney$oil[ind] mooney$filler[ind] norm quantiles

8.2.2 Example: Mooney viscosity on log scale

As noted earlier, transformations can often be used instead of polynomials. For


example, the original data plots suggest transforming the Moody viscosity to a log
scale. If we make this transformation and replot the data, we see that the log Mooney
viscosity is roughly linearly related to the filler level at each oil level, but is a quadratic
function of oil at each filler level. The plots of the transformed data suggest that a
simpler model will be appropriate.
# log transform the response
mooney$logmooney <- log(mooney$mooney)

library(ggplot2)
p <- ggplot(mooney, aes(x = oil, y = logmooney, label = filler))
p <- p + geom_text()
#p <- p + scale_y_continuous(limits = c(0, max(mooney£logmooney, na.rm=TRUE)))
p <- p + labs(title="Mooney data, log(mooney) by oil with filler labels")
print(p)
## Warning: Removed 1 rows containing missing values (geom text).
library(ggplot2)
p <- ggplot(mooney, aes(x = filler, y = logmooney, label = oil))
p <- p + geom_text()
#p <- p + scale_y_continuous(limits = c(0, max(mooney£logmooney, na.rm=TRUE)))
p <- p + labs(title="Mooney data, log(mooney) by filler with oil labels")
print(p)
## Warning: Removed 1 rows containing missing values (geom text).

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


612 Ch 8: Polynomial Regression

Mooney data, log(mooney) by oil with filler labels Mooney data, log(mooney) by filler with oil labels

60 0
5.0 5.0

60 10
48 0

4.5 4.5
60 20
48 10
36 0

60 40
4.0
48 4.0
20
logmooney

logmooney
36 10
24 0

48 40
12 24 36 0 10 20
3.5 3.5

0 12 24 36 0 10 20 40

24 40
3.0 12 3.0 20

0 10
12 40
0 20
2.5 2.5
0 10 20 30 40 0 20 40 60
oil filler

To see that a simpler model is appropriate, we fit the full quadratic model. The
interaction term can be omitted here, without much loss of predictive ability (R-
squared is similar). The p-value for the interaction term in the quadratic model is
0.34.
# I create each term separately
lm.lm.o2.f2 <- lm(logmooney ~ oil + filler + I(oil^2) + I(filler^2) + I(oil * filler),
data = mooney)
summary(lm.lm.o2.f2)
##
## Call:
## lm(formula = logmooney ~ oil + filler + I(oil^2) + I(filler^2) +
## I(oil * filler), data = mooney)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.077261 -0.035795 0.009193 0.030641 0.075640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.236e+00 3.557e-02 90.970 < 2e-16 ***
## oil -3.921e-02 2.903e-03 -13.507 1.61e-10 ***
## filler 2.860e-02 2.075e-03 13.781 1.18e-10 ***
## I(oil^2) 4.227e-04 6.339e-05 6.668 3.96e-06 ***
## I(filler^2) 4.657e-05 3.276e-05 1.421 0.173
## I(oil * filler) -4.231e-05 4.332e-05 -0.977 0.342
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05352 on 17 degrees of freedom

Prof. Erik B. Erhardt


8.2: Polynomial Models with Two Predictors 613

## (1 observation deleted due to missingness)


## Multiple R-squared: 0.9954,Adjusted R-squared: 0.9941
## F-statistic: 737 on 5 and 17 DF, p-value: < 2.2e-16
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.lm.o2.f2, which = c(1,4,6), pch=as.character(mooney$oil))

# because of one missing value, get the indices of non-missing


ind <- as.numeric(names(lm.lm.o2.f2$residuals))

plot(mooney$oil[ind], lm.lm.o2.f2$residuals, main="Residuals vs oil with filler labels", pch=as.character(mooney$filler[ind]))


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(mooney$filler[ind], lm.lm.o2.f2$residuals, main="Residuals vs filler with oil labels", pch=as.character(mooney$oil[ind]))


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.lm.o2.f2$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(mooney$oil[ind]))
## 22 12 21
## 21 12 20
## residuals vs order of data
#plot(lm.lm.o2.f2£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


2 60
1.5
6
421 121
1

0.4
0.4
0.05

0
Cook's distance

Cook's distance
4 2

0.3
1
0.3

0 1
Residuals

4 2
1
0.00

0.2
1
0.2

2 2 13
21
421
4 1 213
0 0 4 4
−0.05

0.1
0.1

2 1
0 2 0 1
422
2 1
00 4 0 4 0.5
111222 0
0.0

0.0

2 0

3.0 3.5 4.0 4.5 5.0 5 10 15 20 0.1 0.2 0.3 0.4 0.5

Fitted values Obs. number Leverage hii

Residuals vs oil with filler labels Residuals vs filler with oil labels QQ Plot

6
4 2 4 1 1 121 4 12 1
0.05

0.05

1 0 0.05 0
lm.lm.o2.f2$residuals

lm.lm.o2.f2$residuals

lm.lm.o2.f2$residuals

1 1 2
4 4 2

0 1 0 1 1
2 6 1 4 4
1 0

3 6 1 2 12
0.00

0.00

3 0 0.00
0

4
2 2 2 22

4 4 4
0 0
4
2 0 0
−0.05

−0.05

0 1 −0.05 2 1
0 2
6 0 0
3 2 2
3 4 4 22

0 10 20 30 40 0 10 20 30 40 50 60 −2 −1 0 1 2

mooney$oil[ind] mooney$filler[ind] norm quantiles

After omitting the interaction term, the quadratic effect in filler is not needed
in the model (output not given). Once these two effects are removed, each of the
remaining effects is significant.
# I create each term separately
lm.lm.o2.f <- lm(logmooney ~ oil + filler + I(oil^2),
data = mooney)
summary(lm.lm.o2.f)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


614 Ch 8: Polynomial Regression

##
## Call:
## lm(formula = logmooney ~ oil + filler + I(oil^2), data = mooney)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.090796 -0.031113 -0.008831 0.032533 0.100587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.230e+00 2.734e-02 118.139 < 2e-16 ***
## oil -4.024e-02 2.702e-03 -14.890 6.26e-12 ***
## filler 3.086e-02 5.716e-04 53.986 < 2e-16 ***
## I(oil^2) 4.097e-04 6.356e-05 6.446 3.53e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05423 on 19 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9947,Adjusted R-squared: 0.9939
## F-statistic: 1195 on 3 and 19 DF, p-value: < 2.2e-16
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.lm.o2.f, which = c(1,4,6), pch=as.character(mooney$oil))

# because of one missing value, get the indices of non-missing


ind <- as.numeric(names(lm.lm.o2.f$residuals))

plot(mooney$oil[ind], lm.lm.o2.f$residuals, main="Residuals vs oil with filler labels", pch=as.character(mooney$filler[ind]))


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(mooney$filler[ind], lm.lm.o2.f$residuals, main="Residuals vs filler with oil labels", pch=as.character(mooney$oil[ind]))


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.lm.o2.f$residuals, las = 1, id = list(n = 3), main="QQ Plot", pch=as.character(mooney$oil[ind]))
## 12 22 16
## 12 21 16
## residuals vs order of data
#plot(lm.lm.o2.f£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Prof. Erik B. Erhardt


8.2: Polynomial Models with Two Predictors 615

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


2.5 1212224
0.10

12 22

0.20
121 1.5

0.20
4 1
4
0.05

21 214

0.15
Cook's distance

Cook's distance
4

0.15
2 0
Residuals

0 2
1 4
0.00

0.10
1

0.10
1 0 2 1
2 2 0 0
1 2
4 0

0.05
0.05
0 1 4
0 0
2 1 0 0 0.5
−0.10

422 216 2 2 22
4

0.00

0.00
1 1 0 0

2.5 3.0 3.5 4.0 4.5 5.0 5 10 15 20 0.05 0.1 0.15 0.2 0.25

Fitted values Obs. number Leverage hii

Residuals vs oil with filler labels Residuals vs filler with oil labels QQ Plot
0.10

0.10
6 1 0.10 12 1

4 2 4 1 1
4
1 4 4
0.05

0.05
lm.lm.o2.f$residuals

lm.lm.o2.f$residuals

lm.lm.o2.f$residuals
0.05
1 1 0
2 2 0
0 6 0 2 2 0
1 6 1 4 41
0.00

0.00

2 1 0.00 1
3 3 1
0 01

6
4 0
4 2 2
0 0 0 2 0
2
0 2 1 2 1 2
−0.05

−0.05

4 4 4
−0.05
2 0 0

3 3 2
4 4 22 2 16

0 10 20 30 40 0 10 20 30 40 50 60 −2 −1 0 1 2

mooney$oil[ind] mooney$filler[ind] norm quantiles

The model does not appear to have inadequacies. Assuming no difficulties, an


important effect of the transformation is that the resulting model is simpler than the
model selected on the original scale. This is satisfying to me, but you may not agree.
Note that the selected model, with linear effects due to the oil and filler levels and
a quadratic effect due to the oil level, agrees with our visual assessment of the data.
Assuming no inadequacies, the predicted log Moody viscosity is given by

\
log(Moody viscosity) = 3.2297 − 0.0402 Oil + 0.0004 Oil2 + 0.0309 Filler.

Quadratic models with two or more predictors are often used in industrial ex-
periments to estimate the optimal combination of predictor values to maximize or
minimize the response, over the range of predictor variable values where the model is
reasonable. (This strategy is called “response surface methodology”.) For example,
we might wish to know what combination of oil level between 0 and 40 and filler level
between 0 and 60 provides the lowest predicted Mooney viscosity (on the original or
log scale). We can visually approximate the minimizer using the data plots, but one
can do a more careful job of analysis using standard tools from calculus.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


616 Ch 8: Polynomial Regression

Prof. Erik B. Erhardt


Chapter 9

Discussion of Response Models


with Factors and Predictors

We have considered simple models for designed experiments and observational studies
where a response variable is modeled as a linear combination of effects due to factors
or predictors, or both. With designed experiments, where only qualitative factors
are considered, we get a “pure ANOVA” model. For example, in the experiment
comparing survival times of beetles, the potential effects of insecticide (with levels
A, B, C, and D) and dose (with levels 1=low, 2=medium, and 3=high) are included
in the model as factors because these variables are qualitative. The natural model
to consider is a two-way ANOVA with effects for dose and insecticide and a dose-by-
insecticide interaction. If, however, the dose given to each beetle was recorded on a
measurement scale, then the dosages can be used to define a predictor variable which
can be used as a “regression effect” in a model. That is, the dose or some function of
dose can be used as a (quantitative) predictor instead of as a qualitative effect.
For simplicity, assume that the doses are 10, 20, and 30, but the actual levels are
irrelevant to the discussion. The simple additive model, or ANCOVA model, assumes
that there is a linear relationship between mean survival time and dose, with different
intercepts for the four insecticides. If data set includes the survival time (times) for
each beetle, the insecticide (insect: an alphanumeric variable, with values A, B, C,
and D), and dose, you would fit the ANCOVA model this way
beetles$insect <- factor(beetles$insect)
lm.t.i.d <- lm(times ~ insect + dose, data = beetles)
A more complex model that allows separate regression lines for each insecticide is
specified as follows:
beetles$insect <- factor(beetles$insect)
lm.t.i.d.id <- lm(times ~ insect + dose + insect:dose, data = beetles)
It is important to recognize that the factor() statement defines which variables
in the model are treated as factors. Each effect of Factor data type is treated as
a factor. Effects in the model statement that are numeric data types are treated as

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


618 Ch 9: Discussion of Response Models with Factors and Predictors

predictors. To treat a measurement variable as a factor (with one level for each
distinct observed value of the variable) instead of a predictor, convert that varible
type to a factor using factor(). Thus, in the survival time experiment, these models

beetles$insect <- factor(beetles$insect)


beetles$dose <- factor(beetles$dose)
# call this (A) for additive
lm.t.i.d <- lm(times ~ insect + dose, data = beetles)
# call this (I) for interaction
lm.t.i.d.id <- lm(times ~ insect + dose + insect:dose, data = beetles)

give the analysis for a two-way ANOVA model without interaction and with interac-
tion, respectively, where both dose and insecticide are treated as factors (since dose
and insect are both converted to factors), even though we just defined dose on a
measurement scale!

Is there a basic connection between the ANCOVA and separate regression line
models for dose and two-way ANOVA models where dose and insecticide are treated
as factors? Yes — I mentioned a connection when discussing ANCOVA and I will try
now to make the connection more explicit.

For the moment, let us simplify the discussion and assume that only one insecticide
was used at three dose levels. The LS estimates of the mean responses from the
quadratic model

Times = β0 + β1 Dose + β2 Dose2 + ε

are the observed average survival times at the three dose levels. The LS curve goes
through the mean survival time at each dose, as illustrated in the picture below.

If we treat dose as a factor, and fit the one-way ANOVA model

Times = Grand Mean + Dose Effect + Residual,

then the LS estimates of the population mean survival times are the observed mean
survival times. The two models are mathematically equivalent, but the parameters
have different interpretations. In essence, the one-way ANOVA model places no
restrictions on the values of the population means (no a priori relation between them)
at the three doses, and neither does the quadratic model! (WHY?)

Prof. Erik B. Erhardt


619


7


6

● ●
5

● ●
times

● ●
4


3

Insecticide
● A ●

B
2

● C ●
D
● Mean ●
1

10 15 20 25 30

dose

In a one-way ANOVA, the standard hypothesis of interest is that the dose effects
are zero. This can be tested using the one-way ANOVA F-test, or by testing H0 :
β1 = β2 = 0 in the quadratic model. With three dosages, the absence of a linear or
quadratic effect implies that all the population mean survival times must be equal.
An advantage of the polynomial model over the one-way ANOVA is that it provides
an easy way to quantify how dose impacts the mean survival, and a convenient way
to check whether a simple description such as a simple linear regression model is
adequate to describe the effect.
More generally, if dose has p levels, then the one-way ANOVA model
Times = Grand Mean + Dose Effect + Residual,
is equivalent to the (p − 1)st degree polynomial

Times = β0 + β1 Dose + β2 Dose2 + · · · + βp−1 Dose(p−1) + ε

and the one-way ANOVA F-test for no treatment effects is equivalent to testing
H0 : β1 = β2 = · · · = βp−1 = 0 in this polynomial.
Returning to the original experiment with 4 insecticides and 3 doses, I can show
the following two equivalences. First, the two-way additive ANOVA model, with
insecticide and dose as factors, i.e., model (A), is mathematically equivalent to an
additive model with insecticide as a factor, and a quadratic effect in dose:
beetles$insect <- factor(beetles$insect)
lm.t.i.d.d2 <- lm(times ~ insect + dose + I(dose^2), data = beetles)
Thinking of dose2 as a quadratic term in dose, rather than as an interaction,
this model has an additive insecticide effect, but the dose effect is not differentiated
across insecticides. That is, the model assumes that the quadratic curves for the
four insecticides differ only in level (i.e., different intercepts) and that the coefficients

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


620 Ch 9: Discussion of Response Models with Factors and Predictors

for the dose and dose2 effects are identical across insecticides. This is an additive
model, because the population means plot has parallel profiles. A possible pictorial
representation of this model is given below.


7


6
5

● ●
times

● ●
4


3

Insecticide
● A

B
2

● C ●
D

1

10 15 20 25 30

dose

Second, the two-way ANOVA interaction model, with insecticide and dose as
factors, i.e., model (I), is mathematically equivalent to an interaction model with
insecticide as a factor, and a quadratic effect in dose.
beetles$insect <- factor(beetles$insect)
lm.t.i.d.d2.id.id2 <- lm(times ~ insect + dose + I(dose^2)
+ insect:dose + insect:I(dose^2), data = beetles)
This model fits separate quadratic relationships for each of the four insecticides,
by including interactions between insecticides and the linear and quadratic terms in
dose. Because dose has three levels, this model places no restrictions on the mean
responses.
To summarize, we have established that
ˆ The additive two-way ANOVA model with insecticide and dose as factors is
mathematically identical to an additive model with an insecticide factor and a
quadratic effect in dose. The ANCOVA model with a linear effect in dose is a
special case of these models, where the quadratic effect is omitted.
ˆ The two-way ANOVA interaction model with insecticide and dose as factors is
mathematically identical to a model with an insecticide factor, a quadratic effect
in dose, and interactions between the insecticide and the linear and quadratic
dose effects. The separate regression lines model with a linear effect in dose is a
special case of these models, where the quadratic dose effect and the interaction
of the quadratic term with insecticide are omitted.
Recall that response models with factors and predictors as effects can be fit
using the lm() procedure, but each factor or interaction involving a factor must be

Prof. Erik B. Erhardt


621

represented in the model using indicator variables or product terms. The number of
required indicators or product effects is one less than the number of distinct levels of
the factor. For example, to fit the model with “parallel” quadratic curves in dose, you
can define (in the data.frame()) three indicator variables for the insecticide effect,
say I1 , I2 , and I3 , and fit the model

Times = β0 + β1 I1 + β2 I2 + β3 I3 + β4 Dose + β5 Dose2 + ε.

For the “quadratic interaction model”, you must define 6 interaction or product terms
between the 3 indicators and the 2 dose terms:

Times = β0 + β1 I1 + β2 I2 + β3 I3 + β4 Dose + β5 Dose2


+β6 I1 Dose + β7 I2 Dose + β8 I3 Dose
+β9 I1 Dose2 + β10 I2 Dose2 + β11 I3 Dose2 + ε.

The (β6 I1 Dose+β7 I2 Dose+β8 I3 Dose) component in the model formally corresponds
to the insect ∗ dose interaction, whereas the (β9 I1 Dose2 + β10 I2 Dose2 + β11 I3 Dose2 )
component is equivalent to the insect ∗ dose ∗ dose interaction (i.e., testing H0 : β9 =
β10 = β11 = 0).
This discussion is not intended to confuse, but rather to impress upon you the inti-
mate connection between regression and ANOVA, and to convince you of the care that
is needed when modelling variation even in simple studies. Researchers are usually
faced with more complex modelling problems than we have examined, where many
variables might influence the response. If experimentation is possible, a scientist will
often control the levels of variables that influence the response but that are not of
primary interest. This can result in a manageable experiment with, say, four or fewer
qualitative or quantitative variables that are systematically varied in a scientifically
meaningful way. In observational studies, where experimentation is not possible, the
scientist builds models to assess the effects of interest on the response, adjusting the
response for all the uncontrolled variables that might be important. The uncontrolled
variables are usually a mixture of factors and predictors. Ideally, the scientist knows
what variables to control in an experiment and which to vary, and what variables are
important to collect in an observational study.
The level of complexity that I am describing here might be intimidating, but
certain basic principles can be applied to many of the studies you will see. Graduate
students in statistics often take several courses (5+) in experimental design, regression
analysis, and linear model theory to master the breadth of models, and the subtleties
of modelling, that are needed to be a good data analyst. I can only scratch the
surface here. I will discuss a reasonably complex study having multiple factors and
multiple predictors. The example focuses on strategies for building models, with little
attempt to do careful diagnostic analyses. Hopefully, the example will give you an

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


622 Ch 9: Discussion of Response Models with Factors and Predictors

appreciation for statistical modelling, but please be careful — these tools are
dangerous!

9.1 Some Comments on Building Models


A primary goal in many statistical analyses is to build a model or models to under-
stand the variation in a response. Fortunately, or unfortunately, there is no consensus
on how this should be done. Ideally, theory would suggest models to compare, but in
many studies the goal is to provide an initial model that will be refined, validated, or
refuted, by further experimentation. An extreme view is that the selected model(s)
should only include effects that are “statistically important”, whereas another ex-
treme suggests that all effects that might be “scientifically important” should be
included.
A difficulty with implementing either approach is that importance is relative to
specific goals (i.e., Why are you building the model and what do you plan to use the
model for? Is the model a prescription or device to make predictions? Is the model
a tool to understand the effect that one or more variables have on a response, after
adjusting for uninteresting, but important effects that can not be controlled? etc.)
Madigan and Raftery, in the 1994 edition of The Journal of the American Statistical
Association, comment that “Science is an iterative process in which competing models
of reality are compared on the basis of how well they predict what is observed; models
that predict much less well than their competitors are discarded.” They argue that
models should be selected using Occum’s razor, a widely accepted norm in scientific
investigations whereby the simplest plausible model among all reasonable models,
given the data, is preferred. Madigan and Raftery’s ideas are fairly consistent with
the first extreme, but can be implemented in a variety of ways, depending on how you
measure prediction adequacy. They propose a Bayesian approach, based on model
averaging and prior beliefs on the plausibility of different models. An alternative
method using Mallow’s Cp criterion will be discussed later.
A simple compromise between the two extremes might be to start the model
building process with the most complex model that is scientifically reasonable, but
still interpretable, and systematically eliminate effects using backward elimination.
The initial or maximal model might include polynomial effects for predictors, main
effects and interactions (2 factor, 3 factor, etc.) between factors, and products or
interactions between predictors and factors. This approach might appear to be less
than ideal because the importance of effects is assessed using hypothesis tests and
no attempt is made to assess the effect of changes on predictions. However, one can
show that the average squared error in predictions is essentially reduced by eliminating
insignificant regression effects from the model, so this approach seems tenable.
It might be sensible to only assess significance of effects specified in the model

Prof. Erik B. Erhardt


9.2: Example: The Effect of Sex and Rank on Faculty Salary 623

statement. However, many of these effects consist of several degrees-of-freedom. That


is, the effect corresponds to several regression coefficients in the model. (Refer to the
discussion following the displayed equations on page 621). The individual regression
variables that comprise an effect could also be tested individually. However, if the
effect is a factor (with 3+ levels) or an interaction involving a factor, then the in-
terpretation of tests on individual regression coefficients depends on the level of the
factor that was selected to be the baseline category. The Type III F -test on the
entire effect does not depend on the baseline category. In essence, two researchers
can start with different representations of the same mathematical model (i.e., the
parameters are defined differently for different choices of baseline categories), use the
same algorithm for selecting a model, yet come to different final models for the data.
Statisticians often follow the hierarchy principle, which states that a lower order
term (be it a factor or a predictor) may be considered for exclusion from a model
only if no higher order effects that include the term are present in the model. For
example, given an initial model with effects A, B, C, and the A ∗ B interaction, the
only candidates for omission at the first step are C and A ∗ B. If you follow the
hierarchy principle, and test an entire effect rather than test the single degree-of-
freedom components that comprise an effect, then the difficulty described above can
not occur. The hierarchy principle is most appealing with pure ANOVA models (such
as the three-factor model in the example below), where all the regression variables
are indicators. In ANOVA models, the ANOVA effects are of interest because they
imply certain structure on the means. The individual regression variables that define
the effects are not usually a primary interest.
A non-hierarchical backward elimination algorithm where single degree-of-freedom
effects are eliminated independently of the other effects in the model is implemented in
the step() procedure. Recall our discussion of backwards elimination from Chapter 3
earlier this semester.

9.2 Example: The Effect of Sex and Rank on Fac-


ulty Salary
The data in this example were collected from the personnel files of faculty at a small
college in the 1970s. The data were collected to assess whether women were being
discriminated against (consciously or unconsciously) in salary. The sample consists
of tenured and tenure-stream faculty only. Temporary faculty were excluded from
consideration (because they were already being discriminated against).
The variables below are id (individual identification numbers from 1 to 52), sex
(coded 1 for female and 0 for male), rank (coded 1 for Asst. Professor, 2 for Assoc.
Professor and 3 for Full Professor), year (number of years in current rank), degree

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


624 Ch 9: Discussion of Response Models with Factors and Predictors

(coded 1 for Doctorate, 0 else), yd (number of years since highest degree was earned),
and salary (academic year salary in dollars).
#### Example: Faculty salary
faculty <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch09_faculty.dat"
, header = TRUE)
head(faculty)
## id sex rank year degree yd salary
## 1 1 0 3 25 1 35 36350
## 2 2 0 3 13 1 22 35350
## 3 3 0 3 10 1 23 28200
## 4 4 1 3 7 1 27 26775
## 5 5 0 3 19 0 30 33696
## 6 6 0 3 16 1 21 28516
str(faculty)
## 'data.frame': 52 obs. of 7 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : int 0 0 0 1 0 0 1 0 0 0 ...
## $ rank : int 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 25 13 10 7 19 16 0 16 13 13 ...
## $ degree: int 1 1 1 1 0 1 0 1 0 0 ...
## $ yd : int 35 22 23 27 30 21 32 18 30 31 ...
## $ salary: int 36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...
faculty$sex <- factor(faculty$sex , labels=c("Male", "Female"))
# ordering the rank variable so Full is the baseline, then descending.
faculty$rank <- factor(faculty$rank , levels=c(3,2,1)
, labels=c("Full", "Assoc", "Asst"))
faculty$degree <- factor(faculty$degree, labels=c("Other", "Doctorate"))
head(faculty)
## id sex rank year degree yd salary
## 1 1 Male Full 25 Doctorate 35 36350
## 2 2 Male Full 13 Doctorate 22 35350
## 3 3 Male Full 10 Doctorate 23 28200
## 4 4 Female Full 7 Doctorate 27 26775
## 5 5 Male Full 19 Other 30 33696
## 6 6 Male Full 16 Doctorate 21 28516
str(faculty)
## 'data.frame': 52 obs. of 7 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 1 2 1 1 1 ...
## $ rank : Factor w/ 3 levels "Full","Assoc",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 25 13 10 7 19 16 0 16 13 13 ...
## $ degree: Factor w/ 2 levels "Other","Doctorate": 2 2 2 2 1 2 1 2 1 1 ...
## $ yd : int 35 22 23 27 30 21 32 18 30 31 ...
## $ salary: int 36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...

The data includes two potential predictors of salary (year and yd), and three

Prof. Erik B. Erhardt


9.2: Example: The Effect of Sex and Rank on Faculty Salary 625

factors (sex, rank, and degree). A primary statistical interest is whether males and
females are compensated equally, on average, after adjusting salary for rank, years in
rank, and the other given effects. Furthermore, we wish to know whether an effect
due to sex is the same for each rank, or not.

Before answering these questions, let us look at the data. I will initially focus
on the effect of the individual factors (sex, rank, and degree) on salary. A series of
box-plots is given below. Looking at the boxplots, notice that women tend to earn
less than men, that faculty with Doctorates tend to earn more than those without
Doctorates (median), and that salary tends to increase with rank.
# plot marginal boxplots

# Plot the data using ggplot


library(ggplot2)
p1 <- ggplot(faculty, aes(x = sex, y = salary, group = sex))
# plot a reference line for the global mean (assuming no groups)
p1 <- p1 + geom_hline(aes(yintercept = mean(salary)),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p1 <- p1 + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p1 <- p1 + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p1 <- p1 + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
alpha = 0.5)
# confidence limits based on normal distribution
p1 <- p1 + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, alpha = 0.8)
p1 <- p1 + labs(title = "Salary by sex")

# Plot the data using ggplot


library(ggplot2)
p2 <- ggplot(faculty, aes(x = degree, y = salary, group = degree))
# plot a reference line for the global mean (assuming no groups)
p2 <- p2 + geom_hline(aes(yintercept = mean(salary)),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p2 <- p2 + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p2 <- p2 + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p2 <- p2 + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
alpha = 0.5)
# confidence limits based on normal distribution
p2 <- p2 + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, alpha = 0.8)
p2 <- p2 + labs(title = "Salary by degree")

# Plot the data using ggplot


library(ggplot2)
p3 <- ggplot(faculty, aes(x = rank, y = salary, group = rank))
# plot a reference line for the global mean (assuming no groups)
p3 <- p3 + geom_hline(aes(yintercept = mean(salary)),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p3 <- p3 + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p3 <- p3 + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p3 <- p3 + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
alpha = 0.5)
# confidence limits based on normal distribution
p3 <- p3 + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, alpha = 0.8)
p3 <- p3 + labs(title = "Salary by rank")

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), nrow = 1)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


626 Ch 9: Discussion of Response Models with Factors and Predictors

Salary by sex Salary by degree Salary by rank

35000 35000 35000

30000 30000 30000


salary

salary

salary
25000 25000 25000

20000 20000 20000

15000 15000 15000

Male Female Other Doctorate Full Assoc Asst


sex degree rank

9.2.1 A Three-Way ANOVA on Salary Data


Hopefully, our earlier analyses have cured you of the desire to claim that a sex effect
exists before considering whether the differences between male and female salaries
might be due to other factors. The output below gives the sample sizes, means, and
standard deviations for the 11 combinations of sex, rank, and degree observed in the
data. Side-by-side boxplots of the salaries for the 11 combinations are also provided.
One combination of the three factors was not observed: female Associate Professors
without Doctorates.
Looking at the summaries, the differences between sexes within each combination
of rank and degree appear to be fairly small. There is a big difference in the ranks of
men and women, with a higher percentage of men in the more advanced ranks. This
might explain the differences between male and female salaries, when other factors
are ignored.
library(plyr)
fac.sum <- ddply(faculty, .(sex, rank, degree), function(.df) {
data.frame(n = length(.df$salary)
, m = mean(.df$salary)
, s = sd(.df$salary)
)
})
fac.sum

Prof. Erik B. Erhardt


9.2: Example: The Effect of Sex and Rank on Faculty Salary 627

## sex rank degree n m s


## 1 Male Full Other 4 30711.50 4241.9723
## 2 Male Full Doctorate 12 29592.75 3479.9566
## 3 Male Assoc Other 7 23584.57 1733.2365
## 4 Male Assoc Doctorate 5 23246.20 2120.2684
## 5 Male Asst Other 3 20296.00 3016.9642
## 6 Male Asst Doctorate 7 16901.14 728.9962
## 7 Female Full Other 1 24900.00 NA
## 8 Female Full Doctorate 3 30106.67 6904.2927
## 9 Female Assoc Other 2 21570.00 1244.5079
## 10 Female Asst Other 1 21600.00 NA
## 11 Female Asst Doctorate 7 17005.71 1834.6791
# plot marginal boxplots

library(ggplot2)
# create position dodge offset for plotting points
pd <- position_dodge(0.75) # 0.75 puts dots up center of boxplots

p <- ggplot(faculty, aes(x = degree, y = salary, fill = sex))


# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(aes(yintercept = mean(salary)),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.25 for thin lines
p <- p + geom_boxplot(size = 0.25, alpha = 0.25)
# points for observed data
p <- p + geom_point(position = pd, alpha = 0.5)
p <- p + facet_grid(. ~ rank)
p <- p + scale_y_continuous(limits = c(0, max(faculty$salary)))
p <- p + labs(title = "Salary by rank, degree, and sex")
print(p)

Salary by rank, degree, and sex


Full Assoc Asst

30000

sex
salary

20000
Male
Female

10000

Other Doctorate Other Doctorate Other Doctorate


degree

I will consider two simple analyses of these data. The first analysis considers the
effect of the three factors on salary. The second analysis considers the effect of the
predictors. A complete analysis using both factors and predictors is then considered.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


628 Ch 9: Discussion of Response Models with Factors and Predictors

I am doing the three factor analysis because the most complex pure ANOVA problem
we considered this semester has two factors — the analysis is for illustration only!!
The full model for a three-factor study includes the three main effects, the three
possible two-factor interactions, plus the three-factor interaction. Identifying the
factors by S (sex), D (degree) and R (rank), we write the full model as
Salary = Grand mean + S effect + D effect + R effect
+S*D interaction + S*R interaction + R*D interaction
+S*D*R interaction + Residual.
You should understand what main effects and two-factor interactions measure, but
what about the three-factor term? If you look at the two levels of degree separately,
then a three-factor interaction is needed if the interaction between sex and rank is
different for the two degrees. (i.e., the profile plots are different for the two degrees).
Not surprisingly, three-factor interactions are hard to interpret.
I considered a hierarchical backward elimination of effects (see Chapter 3 for
details). Individual regression variables are not considered for deletion, unless they
correspond to an effect in the model statement. All tests were performed at the 0.10
level, but this hardly matters here.
The first step in the elimination is to fit the full model and check whether the
three-factor term is significant. The three-factor term was not significant (in fact, it
couldn’t be fit because one category had zero observations). After eliminating this
effect, I fit the model with all three two-factor terms, and then sequentially deleted the
least important effects, one at a time, while still adhering to the hierarchy principle
using the AIC criterion from the step() function. The final model includes only an
effect due to rank. Finally, I compute the lsmeans() to compare salary for all pairs
of rank.
# fit full model
lm.faculty.factor.full <- lm(salary ~ sex*rank*degree, data = faculty)

## Note that there are not enough degrees-of-freedom to estimate all these effects
## because we have 0 observations for Female/Assoc/Doctorate
library(car)
Anova(lm.faculty.factor.full, type=3)
## Error in Anova.III.lm(mod, error, singular.ok = singular.ok, ...): there are aliased
coefficients in the model
summary(lm.faculty.factor.full)
##
## Call:
## lm(formula = salary ~ sex * rank * degree, data = faculty)
##
## Residuals:
## Min 1Q Median 3Q Max

Prof. Erik B. Erhardt


9.2: Example: The Effect of Sex and Rank on Faculty Salary 629

## -6261.5 -1453.0 -225.9 1349.7 7938.3


##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30711.5 1485.0 20.681 < 2e-16
## sexFemale -5811.5 3320.7 -1.750 0.087581
## rankAssoc -7126.9 1861.6 -3.828 0.000433
## rankAsst -10415.5 2268.4 -4.591 4.13e-05
## degreeDoctorate -1118.8 1714.8 -0.652 0.517774
## sexFemale:rankAssoc 3796.9 4086.3 0.929 0.358229
## sexFemale:rankAsst 7115.5 4773.7 1.491 0.143734
## sexFemale:degreeDoctorate 6325.4 3834.4 1.650 0.106653
## rankAssoc:degreeDoctorate 780.4 2442.3 0.320 0.750952
## rankAsst:degreeDoctorate -2276.1 2672.3 -0.852 0.399304
## sexFemale:rankAssoc:degreeDoctorate NA NA NA NA
## sexFemale:rankAsst:degreeDoctorate -7524.8 5383.7 -1.398 0.169720
##
## (Intercept) ***
## sexFemale .
## rankAssoc ***
## rankAsst ***
## degreeDoctorate
## sexFemale:rankAssoc
## sexFemale:rankAsst
## sexFemale:degreeDoctorate
## rankAssoc:degreeDoctorate
## rankAsst:degreeDoctorate
## sexFemale:rankAssoc:degreeDoctorate
## sexFemale:rankAsst:degreeDoctorate
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2970 on 41 degrees of freedom
## Multiple R-squared: 0.7975,Adjusted R-squared: 0.7481
## F-statistic: 16.14 on 10 and 41 DF, p-value: 2.989e-11
## AIC
# option: test="F" includes additional information
# for parameter estimate tests that we're familiar with
# option: for BIC, include k=log(nrow( [data.frame name] ))
lm.faculty.factor.red.AIC <- step(lm.faculty.factor.full, direction="backward", test="F")
## Start: AIC=841.26
## salary ~ sex * rank * degree
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 361677227 841.26
## - sex:rank:degree 1 17233177 378910404 841.68 1.9536 0.1697
## Because the full model can not be fit, the step() procedure does not work
## Below we remove the three-way interaction, then the step() procedure will

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


630 Ch 9: Discussion of Response Models with Factors and Predictors

## do the rest of the work for us.

Remove the three-way interaction, then use step() to perform backward selection
based on AIC.
# model reduction using update() and subtracting (removing) model terms
lm.faculty.factor.red <- lm.faculty.factor.full;

# remove variable
lm.faculty.factor.red <- update(lm.faculty.factor.red, ~ . - sex:rank:degree );
Anova(lm.faculty.factor.red, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 3932650421 1 435.9113 < 2.2e-16 ***
## sex 11227674 1 1.2445 0.2709438
## rank 196652264 2 10.8989 0.0001539 ***
## degree 421614 1 0.0467 0.8298945
## sex:rank 2701493 2 0.1497 0.8614045
## sex:degree 7661926 1 0.8493 0.3620198
## rank:degree 33433415 2 1.8529 0.1693627
## Residuals 378910404 42
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# AIC backward selection
lm.faculty.factor.red.AIC <- step(lm.faculty.factor.red, direction="backward", test="F")
## Start: AIC=841.68
## salary ~ sex + rank + degree + sex:rank + sex:degree + rank:degree
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - sex:rank 2 2701493 381611896 838.05 0.1497 0.8614
## - sex:degree 1 7661926 386572329 840.72 0.8493 0.3620
## <none> 378910404 841.68
## - rank:degree 2 33433415 412343819 842.08 1.8529 0.1694
##
## Step: AIC=838.05
## salary ~ sex + rank + degree + sex:degree + rank:degree
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - sex:degree 1 12335789 393947686 837.71 1.4223 0.2394
## <none> 381611896 838.05
## - rank:degree 2 32435968 414047864 838.29 1.8699 0.1662
##
## Step: AIC=837.71
## salary ~ sex + rank + degree + rank:degree
##
## Df Sum of Sq RSS AIC F value Pr(>F)

Prof. Erik B. Erhardt


9.2: Example: The Effect of Sex and Rank on Faculty Salary 631

## - sex 1 3009036 396956722 836.10 0.3437 0.5606


## - rank:degree 2 27067985 421015671 837.16 1.5460 0.2242
## <none> 393947686 837.71
##
## Step: AIC=836.1
## salary ~ rank + degree + rank:degree
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rank:degree 2 31019255 427975976 836.01 1.7973 0.1772
## <none> 396956722 836.10
##
## Step: AIC=836.01
## salary ~ rank + degree
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - degree 1 10970082 438946058 835.33 1.2304 0.2729
## <none> 427975976 836.01
## - rank 2 1349072233 1777048209 906.04 75.6532 1.45e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=835.33
## salary ~ rank
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 438946058 835.33
## - rank 2 1346783800 1785729858 904.30 75.171 1.174e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# all are significant, stop.
# final model: salary ~ rank
lm.faculty.factor.final <- lm.faculty.factor.red.AIC

library(car)
Anova(lm.faculty.factor.final, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 1.7593e+10 1 1963.932 < 2.2e-16 ***
## rank 1.3468e+09 2 75.171 1.174e-15 ***
## Residuals 4.3895e+08 49
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.faculty.factor.final)
##
## Call:

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


632 Ch 9: Discussion of Response Models with Factors and Predictors

## lm(formula = salary ~ rank, data = faculty)


##
## Residuals:
## Min 1Q Median 3Q Max
## -5209.0 -1819.2 -417.8 1586.6 8386.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29659.0 669.3 44.316 < 2e-16 ***
## rankAssoc -6483.0 1043.0 -6.216 1.09e-07 ***
## rankAsst -11890.3 972.4 -12.228 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2993 on 49 degrees of freedom
## Multiple R-squared: 0.7542,Adjusted R-squared: 0.7442
## F-statistic: 75.17 on 2 and 49 DF, p-value: 1.174e-15

All ranks are different with salaries increasing with rank.


### comparing lsmeans (may be unbalanced)
library(lsmeans)
## compare levels of main effects
lsmeans(lm.faculty.factor.final, list(pairwise ~ rank), adjust = "bonferroni")
## $`lsmeans of rank`
## rank lsmean SE df lower.CL upper.CL
## Full 29658.95 669.2564 49 28314.03 31003.87
## Assoc 23175.93 799.9144 49 21568.44 24783.42
## Asst 17768.67 705.4582 49 16351.00 19186.34
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## Full - Assoc 6483.021 1042.961 49 6.216 <.0001
## Full - Asst 11890.283 972.407 49 12.228 <.0001
## Assoc - Asst 5407.262 1066.553 49 5.070 <.0001
##
## P value adjustment: bonferroni method for 3 tests

This analysis suggests that sex is not predictive of salary, once other factors are
taken into account. In particular, faculty rank appears to be the sole important
effect, in the sense that once salaries are adjusted for rank no other factors explain a
significant amount of the unexplained variation in salaries.
As noted earlier, the analysis was meant to illustrate a three-factor ANOVA and
backward selection. The analysis is likely flawed, because it ignores the effects of year
and year since degree on salary.

Prof. Erik B. Erhardt


9.2: Example: The Effect of Sex and Rank on Faculty Salary 633

9.2.2 Using Year and Year Since Degree to Predict Salary


Plots of the salary against years in rank and years since degree show fairly strong
associations with salary. The variability in salaries appears to be increasing with year
and with year since degree, which might be expected. You might think to transform
the salaries to a log scale to eliminate this effect, but doing so has little impact on
the conclusions (not shown).
library(ggplot2)

p1 <- ggplot(faculty, aes(x = year, y = salary, colour = rank, shape = sex, size = degree))
p1 <- p1 + scale_size_discrete(range=c(3,5))
## Warning: Using size for a discrete variable is not advised.
p1 <- p1 + geom_point(alpha = 0.5)
p1 <- p1 + labs(title = "Salary by year")
p1 <- p1 + theme(legend.position = "bottom")
#print(p1)

p2 <- ggplot(faculty, aes(x = yd, y = salary, colour = rank, shape = sex, size = degree))
p2 <- p2 + scale_size_discrete(range=c(3,5))
## Warning: Using size for a discrete variable is not advised.
p2 <- p2 + geom_point(alpha = 0.5)
p2 <- p2 + labs(title = "Salary by yd")
p2 <- p2 + theme(legend.position = "bottom")
#print(p2)

library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow = 1)

Salary by year Salary by yd

35000 35000

30000 30000
salary

salary

25000 25000

20000 20000

15000 15000

0 5 10 15 20 25 0 10 20 30
year yd

Other Doctorate rank Full Assoc Asst sex


degree MaleOther Female
Doctorate rank Full Assoc Asst sex

As a point of comparison with the three-factor ANOVA, I fit a multiple regression


model with year and years since degree as predictors of salary. These two predictors
are important for explaining the variation in salaries, but together they explain much
less of the variation (58%) than rank does on its own (75%).

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


634 Ch 9: Discussion of Response Models with Factors and Predictors

# interaction model
lm.s.y.yd.yyd <- lm(salary ~ year*yd, data = faculty)
summary(lm.s.y.yd.yyd)
##
## Call:
## lm(formula = salary ~ year * yd, data = faculty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10368.5 -2361.5 -505.7 2363.1 12211.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16287.391 1395.049 11.675 1.25e-15 ***
## year 561.155 275.243 2.039 0.04700 *
## yd 235.415 83.266 2.827 0.00683 **
## year:yd -3.089 10.412 -0.297 0.76796
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3958 on 48 degrees of freedom
## Multiple R-squared: 0.579,Adjusted R-squared: 0.5527
## F-statistic: 22 on 3 and 48 DF, p-value: 4.17e-09
# interaction is not significant
lm.s.y.yd <- lm(salary ~ year + yd, data = faculty)
summary(lm.s.y.yd)
##
## Call:
## lm(formula = salary ~ year + yd, data = faculty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10321.2 -2347.2 -332.7 2298.8 12240.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16555.7 1052.4 15.732 < 2e-16 ***
## year 489.3 129.6 3.777 0.000431 ***
## yd 222.2 69.8 3.184 0.002525 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3921 on 49 degrees of freedom
## Multiple R-squared: 0.5782,Adjusted R-squared: 0.561
## F-statistic: 33.58 on 2 and 49 DF, p-value: 6.532e-10

Prof. Erik B. Erhardt


9.2: Example: The Effect of Sex and Rank on Faculty Salary 635

9.2.3 Using Factors and Predictors to Model Salaries


The plots we looked at helped us to understand the data. In particular, the plot
of salary against years in rank, using rank as a plotting symbol, suggests that a
combination of predictors and factors will likely be better for modelling faculty salaries
than either of the two models that we proposed up to this point.
There is no evidence of non-linearity in the plots of salary against the predictors,
so I will not consider transforming years since degree, years in rank, or salary. Note
that the increasing variability in salaries for increasing years in rank and increasing
years since degree is partly due to differences in the relationships across ranks. The
non-constant variance should be less of a concern in any model that includes rank
and either years in rank or years since degree as effects.
I started the model building process with a maximal or full model with the
five main effects plus the 10 possible interactions between two effects, regardless of
whether the effects were factors or predictors. Notationally, this model is written as
follows:

Salary = Grand mean + S effect + D effect + R effect + YEAR effect + YD effect


+S*D interaction + S*R interaction + S*YEAR interaction + S*YD interaction
+D*R interaction + D*YEAR interaction + D*YD interaction
+R*YEAR interaction + R*YD interaction + YEAR*YD interaction + Residual,

where the year and year since degree effects (YD) are linear terms (as in the multiple
regression model we considered). To check whether any important effects might
have been omitted, I added individual three-factor terms to this model. All of the
three factor terms were insignificant (not shown), so I believe that my choice for the
“maximal” model is sensible.
The output below gives the fit to the maximal model, and subsequent fits, using
the hierarchy principle. Only selected summaries are provided.
# fit full model with two-way interactions
lm.faculty.full <- lm(salary ~ (sex + rank + degree + year + yd)^2, data = faculty)
library(car)
Anova(lm.faculty.full, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 22605087 1 3.6916 0.06392 .
## sex 4092995 1 0.6684 0.41984
## rank 5731837 2 0.4680 0.63059
## degree 4137628 1 0.6757 0.41735
## year 2022246 1 0.3302 0.56966
## yd 3190911 1 0.5211 0.47578

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


636 Ch 9: Discussion of Response Models with Factors and Predictors

## sex:rank 932237 2 0.0761 0.92688


## sex:degree 7164815 1 1.1701 0.28773
## sex:year 7194388 1 1.1749 0.28676
## sex:yd 2024210 1 0.3306 0.56947
## rank:degree 13021265 2 1.0632 0.35759
## rank:year 1571933 2 0.1284 0.88001
## rank:yd 9822382 2 0.8020 0.45750
## degree:year 4510249 1 0.7366 0.39735
## degree:yd 6407880 1 1.0465 0.31424
## year:yd 50921 1 0.0083 0.92793
## Residuals 189825454 31
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# This time I use BIC for model reduction by specifying k=
# (compare this to model result using AIC --
# too many nonsignificant parameters left in model)

## BIC
# option: test="F" includes additional information
# for parameter estimate tests that we're familiar with
# option: for BIC, include k=log(nrow( [data.frame name] ))
lm.faculty.red.BIC <- step(lm.faculty.full, direction="backward", test="F"
, k=log(nrow(faculty)))
## Start: AIC=868.72
## salary ~ (sex + rank + degree + year + yd)^2
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - sex:rank 2 932237 190757690 861.07 0.0761 0.9269
## - rank:year 2 1571933 191397386 861.24 0.1284 0.8800
## - rank:yd 2 9822382 199647836 863.44 0.8020 0.4575
## - rank:degree 2 13021265 202846719 864.26 1.0632 0.3576
## - year:yd 1 50921 189876375 864.78 0.0083 0.9279
## - sex:yd 1 2024210 191849663 865.32 0.3306 0.5695
## - degree:year 1 4510249 194335703 865.99 0.7366 0.3974
## - degree:yd 1 6407880 196233334 866.49 1.0465 0.3142
## - sex:degree 1 7164815 196990268 866.69 1.1701 0.2877
## - sex:year 1 7194388 197019841 866.70 1.1749 0.2868
## <none> 189825454 868.72
##
## Step: AIC=861.07
## salary ~ sex + rank + degree + year + yd + sex:degree + sex:year +
## sex:yd + rank:degree + rank:year + rank:yd + degree:year +
## degree:yd + year:yd
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rank:year 2 4480611 195238301 854.37 0.3876 0.6818
## - rank:yd 2 14587933 205345624 857.00 1.2618 0.2964
## - year:yd 1 25889 190783580 857.12 0.0045 0.9470

Prof. Erik B. Erhardt


9.2: Example: The Effect of Sex and Rank on Faculty Salary 637

## - rank:degree 2 16365099 207122790 857.45 1.4155 0.2572


## - sex:yd 1 3293276 194050966 858.01 0.5697 0.4557
## - degree:year 1 4428068 195185758 858.31 0.7660 0.3878
## - degree:yd 1 6525075 197282766 858.87 1.1288 0.2957
## - sex:year 1 10462381 201220071 859.89 1.8099 0.1877
## - sex:degree 1 10654937 201412628 859.94 1.8432 0.1838
## <none> 190757690 861.07
##
## Step: AIC=854.37
## salary ~ sex + rank + degree + year + yd + sex:degree + sex:year +
## sex:yd + rank:degree + rank:yd + degree:year + degree:yd +
## year:yd
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - year:yd 1 582367 195820669 850.58 0.1044 0.7485
## - rank:degree 2 18612514 213850816 851.21 1.6683 0.2032
## - sex:yd 1 3008739 198247041 851.22 0.5394 0.4676
## - rank:yd 2 20258184 215496486 851.60 1.8158 0.1777
## - degree:year 1 7497925 202736226 852.38 1.3441 0.2542
## - degree:yd 1 8179958 203418259 852.56 1.4664 0.2340
## - sex:degree 1 12500896 207739197 853.65 2.2410 0.1434
## - sex:year 1 12669105 207907406 853.69 2.2712 0.1408
## <none> 195238301 854.37
##
## Step: AIC=850.58
## salary ~ sex + rank + degree + year + yd + sex:degree + sex:year +
## sex:yd + rank:degree + rank:yd + degree:year + degree:yd
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - sex:yd 1 2456466 198277134 847.27 0.4516 0.50587
## - rank:degree 2 21836322 217656990 848.17 2.0072 0.14912
## - degree:year 1 7414066 203234734 848.56 1.3630 0.25069
## - degree:yd 1 9232872 205053541 849.02 1.6974 0.20090
## - sex:degree 1 12831931 208652600 849.93 2.3590 0.13330
## - sex:year 1 13646799 209467467 850.13 2.5089 0.12196
## <none> 195820669 850.58
## - rank:yd 2 41051000 236871669 852.57 3.7734 0.03253 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=847.27
## salary ~ sex + rank + degree + year + yd + sex:degree + sex:year +
## rank:degree + rank:yd + degree:year + degree:yd
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rank:degree 2 21157939 219435073 844.64 1.9741 0.15324
## - degree:year 1 8497324 206774458 845.50 1.5857 0.21583
## - degree:yd 1 9463400 207740534 845.75 1.7659 0.19202

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


638 Ch 9: Discussion of Response Models with Factors and Predictors

## - sex:degree 1 10394382 208671516 845.98 1.9397 0.17202


## <none> 198277134 847.27
## - sex:year 1 22789419 221066553 848.98 4.2527 0.04626 *
## - rank:yd 2 42516602 240793736 849.47 3.9670 0.02749 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=844.64
## salary ~ sex + rank + degree + year + yd + sex:degree + sex:year +
## rank:yd + degree:year + degree:yd
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - degree:yd 1 361929 219797002 840.78 0.0643 0.8011
## - degree:year 1 855102 220290175 840.89 0.1520 0.6988
## - sex:degree 1 1616150 221051223 841.07 0.2872 0.5950
## - rank:yd 2 24391011 243826084 842.22 2.1675 0.1281
## - sex:year 1 10569795 230004869 843.14 1.8786 0.1783
## <none> 219435073 844.64
##
## Step: AIC=840.78
## salary ~ sex + rank + degree + year + yd + sex:degree + sex:year +
## rank:yd + degree:year
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - sex:degree 1 3112507 222909509 837.56 0.5664 0.45609
## - degree:year 1 4414318 224211320 837.86 0.8033 0.37546
## - rank:yd 2 24695126 244492128 838.41 2.2471 0.11889
## - sex:year 1 16645026 236442028 840.62 3.0292 0.08947 .
## <none> 219797002 840.78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=837.56
## salary ~ sex + rank + degree + year + yd + sex:year + rank:yd +
## degree:year
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - degree:year 1 2585275 225494784 834.21 0.4755 0.4943
## - rank:yd 2 25367664 248277174 835.26 2.3330 0.1098
## - sex:year 1 14770974 237680484 836.94 2.7168 0.1069
## <none> 222909509 837.56
##
## Step: AIC=834.21
## salary ~ sex + rank + degree + year + yd + sex:year + rank:yd
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rank:yd 2 24905278 250400062 831.75 2.3194 0.1108
## - degree 1 8902098 234396882 832.27 1.6581 0.2049

Prof. Erik B. Erhardt


9.2: Example: The Effect of Sex and Rank on Faculty Salary 639

## - sex:year 1 14134386 239629170 833.42 2.6326 0.1122


## <none> 225494784 834.21
##
## Step: AIC=831.75
## salary ~ sex + rank + degree + year + yd + sex:year
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - sex:year 1 8458303 258858365 829.53 1.4863 0.22929
## - degree 1 11217823 261617885 830.08 1.9712 0.16734
## - yd 1 16309342 266709404 831.08 2.8659 0.09755 .
## <none> 250400062 831.75
## - rank 2 406263292 656663354 873.98 35.6941 6.144e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=829.53
## salary ~ sex + rank + degree + year + yd
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - sex 1 9134971 267993336 827.38 1.5880 0.2141
## - degree 1 10687589 269545954 827.68 1.8579 0.1796
## - yd 1 14868158 273726523 828.48 2.5847 0.1149
## <none> 258858365 829.53
## - year 1 144867403 403725768 848.69 25.1838 8.654e-06 ***
## - rank 2 399790682 658649047 870.19 34.7499 7.485e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=827.38
## salary ~ rank + degree + year + yd
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - degree 1 6684984 274678320 824.71 1.1475 0.2897
## - yd 1 7871680 275865016 824.93 1.3511 0.2511
## <none> 267993336 827.38
## - year 1 147642871 415636208 846.25 25.3423 7.839e-06 ***
## - rank 2 404108665 672102002 867.29 34.6818 6.544e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=824.71
## salary ~ rank + year + yd
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - yd 1 2314414 276992734 821.19 0.396 0.5322
## <none> 274678320 824.71
## - year 1 141105647 415783967 842.32 24.145 1.126e-05 ***
## - rank 2 478539101 753217421 869.26 40.941 5.067e-11 ***

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


640 Ch 9: Discussion of Response Models with Factors and Predictors

## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=821.19
## salary ~ rank + year
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 276992734 821.19
## - year 1 161953324 438946058 841.18 28.065 2.905e-06 ***
## - rank 2 632056217 909048951 875.09 54.764 4.103e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The add1() function will indicate whether a variable from the “full” model should be
added to the current model. In our case, our BIC-backward selected model appears
adequate.
add1(lm.faculty.red.BIC, . ~ (sex + rank + degree + year + yd)^2, test="F")
## Single term additions
##
## Model:
## salary ~ rank + year
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 276992734 813.39
## sex 1 2304648 274688086 814.95 0.3943 0.5331
## degree 1 1127718 275865016 815.18 0.1921 0.6632
## yd 1 2314414 274678320 814.95 0.3960 0.5322
## rank:year 2 15215454 261777280 814.45 1.3368 0.2727

Let’s look carefully at our resulting model.


# all are significant, stop.
# final model: salary ~ year + rank
lm.faculty.final <- lm.faculty.red.BIC

library(car)
Anova(lm.faculty.final, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 4422688839 1 766.407 < 2.2e-16 ***
## rank 632056217 2 54.764 4.103e-13 ***
## year 161953324 1 28.065 2.905e-06 ***
## Residuals 276992734 48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.faculty.final)

Prof. Erik B. Erhardt


9.2: Example: The Effect of Sex and Rank on Faculty Salary 641

##
## Call:
## lm(formula = salary ~ rank + year, data = faculty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3462.0 -1302.8 -299.2 783.5 9381.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25657.79 926.81 27.684 < 2e-16 ***
## rankAssoc -5192.24 871.83 -5.956 2.93e-07 ***
## rankAsst -9454.52 905.83 -10.437 6.12e-14 ***
## year 375.70 70.92 5.298 2.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2402 on 48 degrees of freedom
## Multiple R-squared: 0.8449,Adjusted R-squared: 0.8352
## F-statistic: 87.15 on 3 and 48 DF, p-value: < 2.2e-16
### comparing lsmeans (may be unbalanced)
library(lsmeans)
## compare levels of main effects
lsmeans(lm.faculty.final, list(pairwise ~ rank), adjust = "bonferroni")
## $`lsmeans of rank`
## rank lsmean SE df lower.CL upper.CL
## Full 28468.28 582.2789 48 27297.53 29639.03
## Assoc 23276.05 642.2996 48 21984.62 24567.48
## Asst 19013.76 613.0513 48 17781.14 20246.38
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## Full - Assoc 5192.239 871.8328 48 5.956 <.0001
## Full - Asst 9454.523 905.8301 48 10.437 <.0001
## Assoc - Asst 4262.285 882.8914 48 4.828 <.0001
##
## P value adjustment: bonferroni method for 3 tests

9.2.4 Discussion of the Salary Analysis


The selected model is a simple ANCOVA model with a rank effect and a linear effect
due to years in rank. Note that the maximal model has 20 single df effects with an
R2 = 0.89 while the selected model has 3 single df effects with R2 = 0.84.
Looking at the parameter estimates table, all of the single df effects in the selected

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


642 Ch 9: Discussion of Response Models with Factors and Predictors

model are significant. The baseline group is Full Professors, with rank=3. Predicted
salaries for the different ranks are given by:

Full: \ = 25658 + 375.70 year


salary
Assoc: \ = 25658 − 5192 + 375.70 year = 20466 + 375.70 year
salary
Assis: \ = 25658 − 9454 + 375.70 year = 16204 + 375.70 year
salary

Do you remember how to interpret the lsmeans, and the p-values for comparing
lsmeans?
You might be tempted to conclude that rank and years in rank are the only effects
that are predictive of salaries, and that differences in salaries by sex are insignificant,
once these effects have been taken into account. However, you must be careful because
you have not done a diagnostic analysis. The following two issues are also important
to consider.
A sex effect may exist even though there is insufficient evidence to support it based
on these data. (Lack of power corrupts; and absolute lack of power corrupts
absolutely!) If we are interested in the possibility of a sex effect, I think that we
would do better by focusing on how large the effect might be, and whether it is
important. A simple way to check is by constructing a confidence interval for the sex
effect, based on a simple additive model that includes sex plus the effects that were
selected as statistically significant, rank and year in rank. I am choosing this model
because the omitted effects are hopefully small, and because the regression coefficient
for a sex indicator is easy to interpret in an additive model. Other models might be
considered for comparison. Summary output from this model is given below.
# add sex to the model
lm.faculty.final.sex <- update(lm.faculty.final, . ~ . + sex)
summary(lm.faculty.final.sex)
##
## Call:
## lm(formula = salary ~ rank + year + sex, data = faculty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3286.3 -1311.8 -178.4 939.1 9002.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25390.65 1025.14 24.768 < 2e-16 ***
## rankAssoc -5109.93 887.12 -5.760 6.20e-07 ***
## rankAsst -9483.84 912.79 -10.390 9.19e-14 ***
## year 390.94 75.38 5.186 4.47e-06 ***
## sexFemale 524.15 834.69 0.628 0.533
## ---

Prof. Erik B. Erhardt


9.2: Example: The Effect of Sex and Rank on Faculty Salary 643

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2418 on 47 degrees of freedom
## Multiple R-squared: 0.8462,Adjusted R-squared: 0.8331
## F-statistic: 64.64 on 4 and 47 DF, p-value: < 2.2e-16
Men are the baseline group for the sex effect, so the predicted salaries for men
are 524 dollars less than that for women, adjusting for rank and year. A rough 95%
CI for the sex differential is the estimated sex coefficient plus or minus two standard
errors, or 524 ± 2 ∗ (835), or −1146 to 2194 dollars. The range of plausible values
for the sex effect would appear to contain values of practical importance, so further
analysis is warranted here.
Another concern, and potentially a more important issue, was raised by M. O.
Finkelstein in a 1980 discussion in the Columbia Law Review on the use of regres-
sion in discrimination cases: “. . . [a] variable may reflect a position or status
bestowed by the employer, in which case if there is discrimination in the
award of the position or status, the variable may be ‘tainted’.” Thus, if
women are unfairly held back from promotion through the faculty ranks, then using
faculty rank to adjust salary before comparing sexes may not be acceptable to the
courts. This suggests that an analysis comparing sexes but ignoring rank effects
might be justifiable. What happens if this is done?
lm.faculty.sex.yd <- lm(salary ~ sex + yd, data = faculty)
library(car)
Anova(lm.faculty.sex.yd, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 4275963832 1 231.4448 < 2.2e-16 ***
## sex 67178787 1 3.6362 0.06241 .
## yd 766344185 1 41.4799 4.883e-08 ***
## Residuals 905279453 49
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.faculty.sex.yd)
##
## Call:
## lm(formula = salary ~ sex + yd, data = faculty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9631.7 -2529.4 3.5 2298.0 13125.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


644 Ch 9: Discussion of Response Models with Factors and Predictors

## (Intercept) 18355.23 1206.52 15.213 < 2e-16 ***


## sexFemale -2572.53 1349.08 -1.907 0.0624 .
## yd 380.69 59.11 6.440 4.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4298 on 49 degrees of freedom
## Multiple R-squared: 0.493,Adjusted R-squared: 0.4724
## F-statistic: 23.83 on 2 and 49 DF, p-value: 5.911e-08
Similar result as before, insufficient evidence between sexes (due to large propor-
tion of variability in salary explained by yd [which I’m using in place of year since
year is paired with rank]). Furthermore (not shown), there is insufficient evidence for
a sex:yd interaction. However, rank and sex are (potentially) confounded. This data
can not resolve this question. Instead, data on promotions would help resolve this
issue.

Prof. Erik B. Erhardt


Chapter 10

Automated Model Selection for


Multiple Regression

Given data on a response variable Y and k predictors or binary variables X1 , X2 , . . . , Xk ,


we wish to develop a regression model to predict Y . Assuming that the collection
of variables is measured on the correct scale, and that the candidate list of effects
includes all the important predictors or binary variables, the most general model is
Y = β0 + β1 X1 + · · · + βk Xk + ε.
In most problems one or more of the effects can be eliminated from the full model
without loss of information. We want to identify the important effects, or equivalently,
eliminate the effects that are not very useful for explaining the variation in Y .
We will study several automated non-hierarchical methods for model selection.
Given a specific criterion for selecting a model, a method gives the best model. Before
applying these methods, plot Y against each predictor to see whether transformations
are needed. Although transformations of binary variables are not necessary, side-by-
side boxplots of the response across the levels of a factor give useful information on
the predictive ability of the factor. If a transformation of Xi is suggested, include
the transformation along with the original Xi in the candidate list.√ Note that we
can transform the predictors differently, for example, log(X1 ) and X2 . However,
if several transformations are suggested for the response, then one should consider
doing one analysis for each suggested response scale before deciding on the final scale.
Different criteria for selecting models lead to different “best models.” Given a
collection of candidates for the best model, we make a choice of model on the basis
of (1) a direct comparison of models, if possible (2) examination of model adequacy
(residuals, influence, etc.) (3) simplicity — all things being equal, simpler models are
preferred, and (4) scientific plausibility.
I view the various criteria as a means to generate interesting models for further
consideration. I do not take any of them literally as best.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


646 Ch 10: Automated Model Selection for Multiple Regression

You should recognize that automated model selection methods should not replace
scientific theory when building models! Automated methods are best suited for ex-
ploratory analyses, in situations where the researcher has little scientific information
as a guide.
AIC/BIC were discussed in Section 3.2.1 for stepwise procedures and were used
in examples in Chapter 9. In those examples, I included the corresponding F -tests in
the ANOVA table as a criterion for dropping variables from a model. The next few
sections cover these methods in more detail, then discuss other criteria and selections
strategies, finishing with a few examples.

10.1 Forward Selection


In forward selection we add variables to the model one at a time. The steps in the
procedure are:
1. Find the variable in the candidate list with the largest correlation (ignoring
the sign) with Y . This variable gives a simple linear regression model with the
largest R2 . Suppose this is X1 . Then fit the model

Y = β0 + β1 X1 + ε (10.1)

and test H0 : β1 = 0. If we reject H0 , go to step 2. Otherwise stop and conclude


that no variables are important. A t-test can be used here, or the equivalent
ANOVA F -test.
2. Find the remaining variable which when added to model (10.1) increases R2 the
most (or equivalently decreases Residual SS the most). Suppose this is X2 . Fit
the model

Y = β0 + β1 X1 + β2 X2 + ε (10.2)

and test H0 : β2 = 0. If we do not reject H0 , stop and use model (10.1) to


predict Y . If we reject H0 , replace model (10.1) with (10.2) and repeat step 2
sequentially until no further variables are added to the model.
In forward selection we sequentially isolate the most important effect left in the
pool, and check whether it is needed in the model. If it is needed we continue the
process. Otherwise we stop.
The F -test default level for the tests on the individual effects is sometimes set as
high as α = 0.50 (SAS default). This may seem needlessly high. However, in many
problems certain variables may be important only in the presence of other variables.
If we force the forward selection to test at standard levels then the process will never
get “going” when none of the variables is important on its own.

Prof. Erik B. Erhardt


10.2: Backward Elimination 647

10.2 Backward Elimination


The backward elimination procedure (discussed earlier this semester) deletes unim-
portant variables, one at a time, starting from the full model. The steps is the
procedure are:
1. Fit the full model
Y = β0 + β1 X1 + · · · + βk Xk + ε. (10.3)
2. Find the variable which when omitted from the full model (10.3) reduces R2 the
least, or equivalently, increases the Residual SS the least. This is the variable
that gives the largest p-value for testing an individual regression coefficient
H0 : βj = 0 for j > 0. Suppose this variable is Xk . If you reject H0 , stop and
conclude that the full model is best. If you do not reject H0 , delete Xk from
the full model, giving the new full model
Y = β0 + β1 X1 + · · · + βk−1 Xk−1 + ε
to replace (10.3). Repeat steps 1 and 2 sequentially until no further variables
can be deleted.
In backward elimination we isolate the least important effect left in the model,
and check whether it is important. If not, delete it and repeat the process. Otherwise,
stop. The default test level on the individual variables is sometimes set at α = 0.10
(SAS default).

10.3 Stepwise Regression


Stepwise regression combines features of forward selection and backward elimination.
A deficiency of forward selection is that variables can not be omitted from the model
once they are selected. This is problematic because many variables that are initially
important are not important once several other variables are included in the model.
In stepwise regression, we add variables to the model as in forward regression, but
include a backward elimination step every time a new variable is added. That is, every
time we add a variable to the model we ask whether any of the variables added earlier
can be omitted. If variables can be omitted, they are placed back into the candidate
pool for consideration at the next step of the process. The process continues until
no additional variables can be added, and none of the variables in the model can
be excluded. The procedure can start from an empty model, a full model, or an
intermediate model, depending on the software.
The p-values used for including and excluding variables in stepwise regression are
usually taken to be equal (why is this reasonable?), and sometimes set at α = 0.15
(SAS default).

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


648 Ch 10: Automated Model Selection for Multiple Regression

10.3.1 Example: Indian systolic blood pressure


We revisit the example first introduced in Chapter 2. Anthropologists conducted a
study to determine the long-term effects of an environmental change on systolic blood
pressure. They measured the blood pressure and several other characteristics of 39
Indians who migrated from a very primitive environment high in the Andes into the
mainstream of Peruvian society at a lower altitude. All of the Indians were males at
least 21 years of age, and were born at a high altitude.
Let us illustrate the three model selection methods to build a regression model,
using systolic blood pressure (sysbp) as the response, and seven candidate predictors:
wt = weight in kilos; ht = height in mm; chin = chin skin fold in mm; f ore =
forearm skin fold in mm; calf = calf skin fold in mm; pulse = pulse rate-beats/min,
and yrage = fraction, which is the proportion of each individual’s lifetime spent in
the new environment.
Below I generate simple summary statistics and plots. The plots do not suggest
any apparent transformations of the response or the predictors, so we will analyze
the data using the given scales.
#### Example: Indian
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch02_indian.dat"
indian <- read.table(fn.data, header=TRUE)

# Description of variables
# id = individual id
# age = age in years yrmig = years since migration
# wt = weight in kilos ht = height in mm
# chin = chin skin fold in mm fore = forearm skin fold in mm
# calf = calf skin fold in mm pulse = pulse rate-beats/min
# sysbp = systolic bp diabp = diastolic bp

# Create the "fraction of their life" variable


# yrage = years since migration divided by age
indian$yrage <- indian$yrmig / indian$age

# correlation matrix and associated p-values testing "H0: rho == 0"


library(Hmisc)
i.cor <- rcorr(as.matrix(indian[,c("sysbp", "wt", "ht", "chin"
, "fore", "calf", "pulse", "yrage")]))
# print correlations with the response to 3 significant digits
signif(i.cor$r[1, ], 3)
## sysbp wt ht chin fore calf pulse yrage
## 1.000 0.521 0.219 0.170 0.272 0.251 0.133 -0.276
# scatterplots
library(ggplot2)
p1 <- ggplot(indian, aes(x = wt , y = sysbp)) + geom_point(size=2)
p2 <- ggplot(indian, aes(x = ht , y = sysbp)) + geom_point(size=2)

Prof. Erik B. Erhardt


10.3: Stepwise Regression 649

p3 <- ggplot(indian, aes(x = chin , y = sysbp)) + geom_point(size=2)


p4 <- ggplot(indian, aes(x = fore , y = sysbp)) + geom_point(size=2)
p5 <- ggplot(indian, aes(x = calf , y = sysbp)) + geom_point(size=2)
p6 <- ggplot(indian, aes(x = pulse, y = sysbp)) + geom_point(size=2)
p7 <- ggplot(indian, aes(x = yrage, y = sysbp)) + geom_point(size=2)

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4, p5, p6, p7), ncol=3
, top = "Scatterplots of response sysbp with each predictor variable")
Scatterplots of response sysbp with each predictor variable
● ● ●

160 160 160

● ● ●
● ● ●

● ● ●
sysbp

sysbp

sysbp
140 ● ● 140 ● ● 140 ● ●
● ● ● ● ● ●
● ● ●
●● ● ● ● ● ●● ● ● ●● ●● ●
● ● ●
● ● ●
● ● ● ● ● ● ● ●
●●●● ● ●
● ●● ●
●●
● ● ●
120 ● ● ●● 120 ● ● ● ● 120 ●● ● ●
● ● ●● ● ●
● ● ● ● ● ●
● ●● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ●

60 70 80 1500 1550 1600 1650 5.0 7.5 10.0


wt ht chin

● ● ●

160 160 160

● ● ●
● ● ●

● ● ●
sysbp

sysbp

sysbp

140 ● ● 140 ● ● 140 ●


●● ● ● ● ●
● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ●● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
120 ● ●● 120 ● ● ● 120 ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●● ●
● ● ●
● ● ●
● ● ●
● ● ●

2.5 5.0 7.5 10.0 12.5 0 5 10 15 20 50 60 70 80 90


fore calf pulse

160



sysbp

140 ● ●
● ●

● ● ●● ●


● ● ●

●● ● ●
120 ● ● ●●
● ●
● ●
● ●●



0.00 0.25 0.50 0.75


yrage

The step() function provides the forward, backward, and stepwise procedures
based on AIC or BIC, and provides corresponding F -tests.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


650 Ch 10: Automated Model Selection for Multiple Regression

## step() function specification


## The first two arguments of step(object, scope, ...) are
# object = a fitted model object.
# scope = a formula giving the terms to be considered for adding or dropping
## default is AIC
# for BIC, include k = log(nrow( [data.frame name] ))
# test="F" includes additional information
# for parameter estimate tests that we're familiar with

Forward selection output The output for the forward selection method is below.
BIC is our selection criterion, though similar decisions are made as if using F -tests.

Step 1 Variable wt =weight is entered first because it has the highest correlation
with sysbp =sys bp. The corresponding F -value is the square of the t-statistic for
testing the significance of the weight predictor in this simple linear regression model.

Step 2 Adding yrage =fraction to the simple linear regression model with
weight as a predictor increases R2 the most, or equivalently, decreases Residual SS
(RSS) the most.

Step 3 The last table has “<none>” as the first row indicating that the current
model (no change to current model) is the best under the current selection criterion.
# start with an empty model (just the intercept 1)
lm.indian.empty <- lm(sysbp ~ 1, data = indian)
# Forward selection, BIC with F-tests
lm.indian.forward.red.BIC <- step(lm.indian.empty
, sysbp ~ wt + ht + chin + fore + calf + pulse + yrage
, direction = "forward", test = "F", k = log(nrow(indian)))
## Start: AIC=203.38
## sysbp ~ 1
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## + wt 1 1775.38 4756.1 194.67 13.8117 0.0006654 ***
## <none> 6531.4 203.38
## + yrage 1 498.06 6033.4 203.95 3.0544 0.0888139 .
## + fore 1 484.22 6047.2 204.03 2.9627 0.0935587 .
## + calf 1 410.80 6120.6 204.51 2.4833 0.1235725
## + ht 1 313.58 6217.9 205.12 1.8660 0.1801796
## + chin 1 189.19 6342.2 205.89 1.1037 0.3002710
## + pulse 1 114.77 6416.7 206.35 0.6618 0.4211339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##

Prof. Erik B. Erhardt


10.3: Stepwise Regression 651

## Step: AIC=194.67
## sysbp ~ wt
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## + yrage 1 1314.69 3441.4 185.71 13.7530 0.0006991 ***
## <none> 4756.1 194.67
## + chin 1 143.63 4612.4 197.14 1.1210 0.2967490
## + calf 1 16.67 4739.4 198.19 0.1267 0.7240063
## + pulse 1 6.11 4749.9 198.28 0.0463 0.8308792
## + ht 1 2.01 4754.0 198.31 0.0152 0.9024460
## + fore 1 1.16 4754.9 198.32 0.0088 0.9257371
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=185.71
## sysbp ~ wt + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 3441.4 185.71
## + chin 1 197.372 3244.0 187.07 2.1295 0.1534
## + fore 1 50.548 3390.8 188.80 0.5218 0.4749
## + calf 1 30.218 3411.1 189.03 0.3101 0.5812
## + ht 1 23.738 3417.6 189.11 0.2431 0.6251
## + pulse 1 5.882 3435.5 189.31 0.0599 0.8081
summary(lm.indian.forward.red.BIC)
##
## Call:
## lm(formula = sysbp ~ wt + yrage, data = indian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4330 -7.3070 0.8963 5.7275 23.9819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8959 14.2809 4.264 0.000138 ***
## wt 1.2169 0.2337 5.207 7.97e-06 ***
## yrage -26.7672 7.2178 -3.708 0.000699 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.777 on 36 degrees of freedom
## Multiple R-squared: 0.4731,Adjusted R-squared: 0.4438
## F-statistic: 16.16 on 2 and 36 DF, p-value: 9.795e-06

Backward selection output The output for the backward elimination method is
below. BIC is our selection criterion, though similar decisions are made as if using

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


652 Ch 10: Automated Model Selection for Multiple Regression

F -tests.

Step 0 The full model has 7 predictors so REG df = 7. The F -test in the full
model ANOVA table (F = 4.91 with p-value=0.0008) tests the hypothesis that the
regression coefficient for each predictor variable is zero. This test is highly significant,
indicating that one or more of the predictors is important in the model.
The t-value column gives the t-statistic for testing the significance of the individual
predictors in the full model conditional on the other variables being in the model.
# start with a full model
lm.indian.full <- lm(sysbp ~ wt + ht + chin + fore + calf + pulse + yrage, data = indian)
summary(lm.indian.full)
##
## Call:
## lm(formula = sysbp ~ wt + ht + chin + fore + calf + pulse + yrage,
## data = indian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3993 -5.7916 -0.6907 6.9453 23.5771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.45766 53.91303 1.975 0.057277 .
## wt 1.71095 0.38659 4.426 0.000111 ***
## ht -0.04533 0.03945 -1.149 0.259329
## chin -1.15725 0.84612 -1.368 0.181239
## fore -0.70183 1.34986 -0.520 0.606806
## calf 0.10357 0.61170 0.169 0.866643
## pulse 0.07485 0.19570 0.383 0.704699
## yrage -29.31810 7.86839 -3.726 0.000777 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.994 on 31 degrees of freedom
## Multiple R-squared: 0.5259,Adjusted R-squared: 0.4189
## F-statistic: 4.913 on 7 and 31 DF, p-value: 0.0008079

The least important variable in the full model, as judged by the p-value, is
calf =calf skin fold. This variable, upon omission, reduces R2 the least, or equiva-
lently, increases the Residual SS the least. So calf is the first to be omitted from the
model.

Step 1 After deleting calf , the six predictor model is fitted. At least one of
the predictors left is important, as judged by the overall F -test p-value. The least
important predictor left is pulse =pulse rate.

Prof. Erik B. Erhardt


10.3: Stepwise Regression 653

# Backward selection, BIC with F-tests


lm.indian.backward.red.BIC <- step(lm.indian.full
, direction = "backward", test = "F", k = log(nrow(indian)))
## Start: AIC=199.91
## sysbp ~ wt + ht + chin + fore + calf + pulse + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - calf 1 2.86 3099.3 196.28 0.0287 0.8666427
## - pulse 1 14.61 3111.1 196.43 0.1463 0.7046990
## - fore 1 27.00 3123.4 196.59 0.2703 0.6068061
## - ht 1 131.88 3228.3 197.88 1.3203 0.2593289
## - chin 1 186.85 3283.3 198.53 1.8706 0.1812390
## <none> 3096.4 199.91
## - yrage 1 1386.76 4483.2 210.68 13.8835 0.0007773 ***
## - wt 1 1956.49 5052.9 215.35 19.5874 0.0001105 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=196.28
## sysbp ~ wt + ht + chin + fore + pulse + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - pulse 1 13.34 3112.6 192.79 0.1377 0.7130185
## - fore 1 26.99 3126.3 192.96 0.2787 0.6011969
## - ht 1 129.56 3228.9 194.22 1.3377 0.2560083
## - chin 1 184.03 3283.3 194.87 1.9000 0.1776352
## <none> 3099.3 196.28
## - yrage 1 1448.00 4547.3 207.57 14.9504 0.0005087 ***
## - wt 1 1953.77 5053.1 211.69 20.1724 8.655e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=192.79
## sysbp ~ wt + ht + chin + fore + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - fore 1 17.78 3130.4 189.35 0.1885 0.667013
## - ht 1 131.12 3243.8 190.73 1.3902 0.246810
## - chin 1 198.30 3310.9 191.53 2.1023 0.156514
## <none> 3112.6 192.79
## - yrage 1 1450.02 4562.7 204.04 15.3730 0.000421 ***
## - wt 1 1983.51 5096.2 208.35 21.0290 6.219e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=189.35
## sysbp ~ wt + ht + chin + yrage
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


654 Ch 10: Automated Model Selection for Multiple Regression

## Df Sum of Sq RSS AIC F value Pr(>F)


## - ht 1 113.57 3244.0 187.07 1.2334 0.2745301
## - chin 1 287.20 3417.6 189.11 3.1193 0.0863479 .
## <none> 3130.4 189.35
## - yrage 1 1445.52 4575.9 200.49 15.7000 0.0003607 ***
## - wt 1 2263.64 5394.1 206.90 24.5857 1.945e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=187.07
## sysbp ~ wt + chin + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - chin 1 197.37 3441.4 185.71 2.1295 0.1534065
## <none> 3244.0 187.07
## - yrage 1 1368.44 4612.4 197.14 14.7643 0.0004912 ***
## - wt 1 2515.33 5759.3 205.80 27.1384 8.512e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=185.71
## sysbp ~ wt + yrage
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 3441.4 185.71
## - yrage 1 1314.7 4756.1 194.67 13.753 0.0006991 ***
## - wt 1 2592.0 6033.4 203.95 27.115 7.966e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.indian.backward.red.BIC)
##
## Call:
## lm(formula = sysbp ~ wt + yrage, data = indian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4330 -7.3070 0.8963 5.7275 23.9819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8959 14.2809 4.264 0.000138 ***
## wt 1.2169 0.2337 5.207 7.97e-06 ***
## yrage -26.7672 7.2178 -3.708 0.000699 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.777 on 36 degrees of freedom
## Multiple R-squared: 0.4731,Adjusted R-squared: 0.4438

Prof. Erik B. Erhardt


10.3: Stepwise Regression 655

## F-statistic: 16.16 on 2 and 36 DF, p-value: 9.795e-06


In the final table we are unable to drop yrage or wt from the model.

Stepwise selection output The output for the stepwise selection is given below.
Variables are listed in the output tables in order that best improves the AIC/BIC
criterion. In the stepwise case, BIC will decrease (improve) by considering variables
to drop or add (indicated in the first column by − and +). Rather than printing
a small table at each step of the step() procedure, we use lm.XXX$anova to print a
summary of the drop/add choices made.
# Stepwise (both) selection, BIC with F-tests, starting with intermediate model
# (this is a purposefully chosen "opposite" model,
# from the forward and backward methods this model
# includes all the variables dropped and none kept)
lm.indian.intermediate <- lm(sysbp ~ ht + fore + calf + pulse, data = indian)
# option: trace = 0 does not print each step of the selection
lm.indian.both.red.BIC <- step(lm.indian.intermediate
, sysbp ~ wt + ht + chin + fore + calf + pulse + yrage
, direction = "both", test = "F", k = log(nrow(indian)), trace = 0)
# the anova object provides a summary of the selection steps in order
lm.indian.both.red.BIC$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 34 5651.131 212.3837
## 2 - pulse 1 2.874432 35 5654.005 208.7400
## 3 - calf 1 21.843631 36 5675.849 205.2268
## 4 + wt -1 925.198114 35 4750.651 201.9508
## 5 + yrage -1 1439.707117 34 3310.944 191.5335
## 6 - ht 1 79.870793 35 3390.815 188.7995
## 7 - fore 1 50.548149 36 3441.363 185.7131
summary(lm.indian.both.red.BIC)
##
## Call:
## lm(formula = sysbp ~ wt + yrage, data = indian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4330 -7.3070 0.8963 5.7275 23.9819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8959 14.2809 4.264 0.000138 ***
## wt 1.2169 0.2337 5.207 7.97e-06 ***
## yrage -26.7672 7.2178 -3.708 0.000699 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


656 Ch 10: Automated Model Selection for Multiple Regression

## Residual standard error: 9.777 on 36 degrees of freedom


## Multiple R-squared: 0.4731,Adjusted R-squared: 0.4438
## F-statistic: 16.16 on 2 and 36 DF, p-value: 9.795e-06

Summary of three section methods

All three methods using BIC choose the same final model, sysbp = β0 + β1 wt +
β2 yrage. Using the AIC criterion, you will find different results.

10.4 Other Model Selection Procedures


10.4.1 R2 Criterion
R2 is the proportion of variation explained by the model over the grand mean, and
we wish to maximize this. A substantial increase in R2 is usually observed when an
“important” effect is added to a regression model. With the R2 criterion, variables
are added to the model until further additions give inconsequential increases in R2 .
The R2 criterion is not well-defined in the sense of producing a single best model.
All other things being equal, I prefer the simpler of two models with similar values
of R2 . If several models with similar complexity have similar R2 s, then there may be
no good reason, at this stage of the analysis, to prefer one over the rest.

10.4.2 Adjusted-R2 Criterion, maximize


The adjusted-R2 criterion gives a way to compare R2 across models with different
numbers of variables, and we want to maximize this. This eliminates some of the
difficulty with calibrating R2 , which increases even when unimportant predictors are
added to a model. For a model with p variables and an intercept, the adjusted-R2 is
defined by
n−1
R̄2 = 1 − (1 − R2 ),
n−p−1
where n is the sample size.
There are four properties of R̄2 worth mentioning:
1. R̄2 ≤ R2 ,
2. if two models have the same number of variables, then the model with the larger
R2 has the larger R̄2 ,
3. if two models have the same R2 , then the model with fewer variables has the
larger adjusted-R2 . Put another way, R̄2 penalizes complex models with many
variables. And

Prof. Erik B. Erhardt


10.5: Illustration with Peru Indian data 657

4. R̄2 can be less than zero for models that explain little of the variation in Y .
The adjusted R2 is easier to calibrate than R2 because it tends to decrease when
unimportant variables are added to a model. The model with the maximum R̄2 is
judged best by this criterion. As I noted before, I do not take any of the criteria
literally, and would also choose other models with R̄2 near the maximum value for
further consideration.

10.4.3 Mallows’ Cp Criterion, minimize


Mallows’ Cp measures the adequacy of predictions from a model, relative to those
obtained from the full model, and we want to minimize Cp . Mallows’ Cp statistic is
defined for a given model with p variables by

Residual SS
Cp = 2
− Residual df + (p + 1)
σ̂FULL
2
where σ̂FULL is the Residual MS from the full model with k variables X1 , X2 , . . ., Xk .
If all the important effects from the candidate list are included in the model,
then the difference between the first two terms of Cp should be approximately zero.
Thus, if the model under consideration includes all the important variables from the
candidate list, then Cp should be approximately p + 1 (the number of variables in
model plus one), or less. If important variables from the candidate list are excluded,
Cp will tend to be much greater than p + 1.
Two important properties of Cp are
1. the full model has Cp = p + 1, where p = k, and
2. if two models have the same number of variables, then the model with the larger
R2 has the smaller Cp .
Models with Cp ≈ p + 1, or less, merit further consideration. As with R2 and R̄2 , I
prefer simpler models that satisfy this condition. The “best” model by this criterion
has the minimum Cp .

10.5 Illustration with Peru Indian data


R2 Criterion
# The leaps package provides best subsets with other selection criteria.
library(leaps)

# First, fit the full model


lm.indian.full <- lm(sysbp ~ wt + ht + chin + fore + calf + pulse + yrage, data = indian)

# Second, create the design matrix which leap uses as argument

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


658 Ch 10: Automated Model Selection for Multiple Regression

# using model.matrix(lm.XXX) as input to leaps()

# R^2 -- for each model size, report best subset of size 5


leaps.r2 <- leaps(x = model.matrix(lm.indian.full), y = indian$sysbp
, method = 'r2'
, int = FALSE, nbest = 5, names = colnames(model.matrix(lm.indian.full)))
str(leaps.r2)
## List of 4
## $ which: logi [1:36, 1:8] FALSE TRUE FALSE FALSE FALSE TRUE ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:36] "1" "1" "1" "1" ...
## .. ..$ : chr [1:8] "(Intercept)" "wt" "ht" "chin" ...
## $ label: chr [1:8] "(Intercept)" "wt" "ht" "chin" ...
## $ size : num [1:36] 1 1 1 1 1 2 2 2 2 2 ...
## $ r2 : num [1:36] 0.99 0.99 0.989 0.976 0.845 ...
# plot model R^2 vs size of model
plot(leaps.r2$size, leaps.r2$r2, main = "R2")
# report the best model (indicate which terms are in the model)
best.model.r2 <- leaps.r2$which[which((leaps.r2$r2 == max(leaps.r2$r2))),]
# these are the variable names for the best model
names(best.model.r2)[best.model.r2]
## [1] "(Intercept)" "wt" "ht" "chin"
## [5] "fore" "calf" "pulse" "yrage"

Prof. Erik B. Erhardt


10.5: Illustration with Peru Indian data 659

R2

1.00

● ● ● ● ● ●




0.95
leaps.r2$r2

0.90
0.85

1 2 3 4 5 6 7 8

leaps.r2$size

Adjusted-R2 Criterion, maximize


# adj-R^2 -- for each model size, report best subset of size 5
leaps.adjr2 <- leaps(x = model.matrix(lm.indian.full), y = indian$sysbp
, method = 'adjr2'
, int = FALSE, nbest = 5, names = colnames(model.matrix(lm.indian.full)))
# plot model R^2 vs size of model
plot(leaps.adjr2$size, leaps.adjr2$adjr2, main = "Adj-R2")
# report the best model (indicate which terms are in the model)
best.model.adjr2 <- leaps.adjr2$which[which((leaps.adjr2$adjr2 == max(leaps.adjr2$adjr2))),]
# these are the variable names for the best model
names(best.model.adjr2)[best.model.adjr2]
## [1] "(Intercept)" "wt" "ht" "chin"
## [5] "yrage"

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


660 Ch 10: Automated Model Selection for Multiple Regression

Adj−R2

1.00 ● ● ● ● ● ●

● ●


0.95
leaps.adjr2$adjr2

0.90
0.85

1 2 3 4 5 6 7 8

leaps.adjr2$size

Mallows’ Cp Criterion, minimize


# Cp -- for each model size, report best subset of size 3
leaps.Cp <- leaps(x = model.matrix(lm.indian.full), y = indian$sysbp
, method = 'Cp'
, int = FALSE, nbest = 3, names = colnames(model.matrix(lm.indian.full)))
# plot model R^2 vs size of model
plot(leaps.Cp$size, leaps.Cp$Cp, main = "Cp")
lines(leaps.Cp$size, leaps.Cp$size) # adds the line for Cp = p
# report the best model (indicate which terms are in the model)
best.model.Cp <- leaps.Cp$which[which((leaps.Cp$Cp == min(leaps.Cp$Cp))),]
# these are the variable names for the best model
names(best.model.Cp)[best.model.Cp]
## [1] "(Intercept)" "wt" "yrage"

Prof. Erik B. Erhardt


10.5: Illustration with Peru Indian data 661

Cp

30


25
20
leaps.Cp$Cp


15


10



5

● ●
● ●


● ●

1 2 3 4 5 6 7 8

leaps.Cp$size

All together The function below takes regsubsets() output and formats it into a
table.
# best subset, returns results sorted by BIC
f.bestsubset <- function(form, dat, nbest = 5){
library(leaps)
bs <- regsubsets(form, data=dat, nvmax=30, nbest=nbest, method="exhaustive");
bs2 <- cbind(summary(bs)$which, (rowSums(summary(bs)$which)-1)
, summary(bs)$rss, summary(bs)$rsq
, summary(bs)$adjr2, summary(bs)$cp, summary(bs)$bic);
cn <- colnames(bs2);
cn[(dim(bs2)[2]-5):dim(bs2)[2]] <- c("SIZE", "rss", "r2", "adjr2", "cp", "bic");
colnames(bs2) <- cn;
ind <- sort.int(summary(bs)$bic, index.return=TRUE); bs2 <- bs2[ind$ix,];
return(bs2);
}
# perform on our model
i.best <- f.bestsubset(formula(sysbp ~ wt + ht + chin + fore + calf + pulse + yrage)
, indian)
op <- options(); # saving old options
options(width=90) # setting command window output text width wider
i.best
## (Intercept) wt ht chin fore calf pulse yrage SIZE rss r2 adjr2
## 2 1 1 0 0 0 0 0 1 2 3441.363 0.47310778 0.44383599
## 3 1 1 0 1 0 0 0 1 3 3243.990 0.50332663 0.46075463

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


662 Ch 10: Automated Model Selection for Multiple Regression

## 3 1 1 0 0 1 0 0 1 3 3390.815 0.48084699 0.43634816


## 3 1 1 0 0 0 1 0 1 3 3411.145 0.47773431 0.43296868
## 3 1 1 1 0 0 0 0 1 3 3417.624 0.47674226 0.43189159
## 3 1 1 0 0 0 0 1 1 3 3435.481 0.47400828 0.42892328
## 4 1 1 1 1 0 0 0 1 4 3130.425 0.52071413 0.46432755
## 4 1 1 0 1 0 0 1 1 4 3232.168 0.50513668 0.44691747
## 4 1 1 0 1 1 0 0 1 4 3243.771 0.50336023 0.44493203
## 4 1 1 0 1 0 1 0 1 4 3243.988 0.50332702 0.44489490
## 4 1 1 1 0 1 0 0 1 4 3310.944 0.49307566 0.43343750
## 5 1 1 1 1 1 0 0 1 5 3112.647 0.52343597 0.45122930
## 5 1 1 1 1 0 0 1 1 5 3126.303 0.52134520 0.44882174
## 5 1 1 1 1 0 1 0 1 5 3128.297 0.52103997 0.44847027
## 5 1 1 0 1 1 0 1 1 5 3228.867 0.50564205 0.43073933
## 5 1 1 0 1 0 1 1 1 5 3231.936 0.50517225 0.43019835
## 1 1 1 0 0 0 0 0 0 1 4756.056 0.27182072 0.25214020
## 6 1 1 1 1 1 0 1 1 6 3099.310 0.52547798 0.43650510
## 6 1 1 1 1 1 1 0 1 6 3111.060 0.52367894 0.43436875
## 6 1 1 1 1 0 1 1 1 6 3123.448 0.52178233 0.43211651
## 2 1 1 0 1 0 0 0 0 2 4612.426 0.29381129 0.25457859
## 6 1 1 0 1 1 1 1 1 6 3228.324 0.50572524 0.41304872
## 2 1 1 0 0 0 1 0 0 2 4739.383 0.27437355 0.23406097
## 2 1 1 0 0 0 0 1 0 2 4749.950 0.27275566 0.23235320
## 2 1 1 1 0 0 0 0 0 2 4754.044 0.27212880 0.23169151
## 6 1 1 1 0 1 1 1 1 6 3283.293 0.49730910 0.40305455
## 7 1 1 1 1 1 1 1 1 7 3096.446 0.52591643 0.41886531
## 1 1 0 0 0 0 0 0 1 1 6033.372 0.07625642 0.05129038
## 1 1 0 0 0 1 0 0 0 1 6047.218 0.07413652 0.04911319
## 1 1 0 0 0 0 1 0 0 1 6120.639 0.06289527 0.03756811
## 1 1 0 1 0 0 0 0 0 1 6217.854 0.04801119 0.02228176
## cp bic
## 2 1.453122 -13.9989263
## 3 1.477132 -12.6388375
## 3 2.947060 -10.9124614
## 3 3.150596 -10.6793279
## 3 3.215466 -10.6053171
## 3 3.394238 -10.4020763
## 4 2.340175 -10.3650555
## 4 3.358774 -9.1176654
## 4 3.474934 -8.9779148
## 4 3.477106 -8.9753065
## 4 4.147436 -8.1785389
## 5 4.162196 -6.9236046
## 5 4.298910 -6.7528787
## 5 4.318869 -6.7280169
## 5 5.325728 -5.4939516
## 5 5.356448 -5.4569068
## 1 12.615145 -5.0439888
## 6 6.028670 -3.4275113
## 6 6.146308 -3.2799319
## 6 6.270326 -3.1249499
## 2 13.177196 -2.5763538
## 6 7.320289 -1.8369533
## 2 14.448217 -1.5173924
## 2 14.554009 -1.4305331
## 2 14.595000 -1.3969306
## 6 7.870614 -1.1784808
## 7 8.000000 0.1999979
## 1 25.402961 4.2336138
## 1 25.541579 4.3230122
## 1 26.276637 4.7936744
## 1 27.249897 5.4082455

Prof. Erik B. Erhardt


10.5: Illustration with Peru Indian data 663

options(op); # reset (all) initial options

10.5.1 R2 , R̄2 , and Cp Summary for Peru Indian Data


Discussion of R2 results:
1. The single predictor model with the highest value of R2 has wt = weight as a
predictor: R2 = 0.272. All the other single predictor models have R2 < 0.10.
2. The two predictor model with the highest value of R2 has weight and yrage =
fraction as predictors: R2 = 0.473. No other two predictor model has R2 close
to this.
3. All of the best three predictor models include weight and fraction as predictors.
However, the increase in R2 achieved by adding a third predictor is minimal.
4. None of the more complex models with four or more predictors provides a
significant increase in R2 .
A good model using the R2 criterion has two predictors, weight and yrage. The
same conclusion is reached with R̄2 , albeit the model with maximum R̄2 includes wt
(weight), ht (height), chin (chin skin fold), and yrage (fraction) as predictors.

Discussion of Cp results:
1. None of the single predictor models is adequate. Each has Cp  1 + 1 = 2, the
target value.
2. The only adequate two predictor model has wt = weight and yrage = fraction
as predictors: Cp = 1.45 < 2 + 1 = 3. This is the minimum Cp model.
3. Every model with weight and fraction is adequate. Every model that excludes
either weight or fraction is inadequate: Cp  p + 1.
According to Cp , any reasonable model must include both weight and fraction as
predictors. Based on simplicity, I would select the model with these two predictors
as a starting point. I can always add predictors if subsequent analysis suggests this
is necessary!

10.5.2 Peru Indian Data Summary


The model selection procedures suggest three models that warrant further consider-
ation.
Predictors Methods suggesting model
------------------- ------------------------------------------------------------
wt, yrage BIC via stepwise, forward, and backward elimination, and C_p
wt, yrage, chin AIC via stepwise and backward selection
wt, yrage, chin, ht AIC via forward selection, Adj-R2

I will give three reasons why I feel that the simpler model is preferable at this
point:

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


664 Ch 10: Automated Model Selection for Multiple Regression

1. It was suggested by 4 of the 5 methods (ignoring R2 ).


2. Forward selection often chooses predictors that are not important, even when
the significance level for inclusion is reduced from the default α = 0.50 level.
3. The AIC/BIC forward and backward elimination outputs suggest that neither
chin skin fold nor height is significant at any of the standard levels of significance
used in practice. Look at the third and fourth steps of forward selection to see
this.
Using a mechanical approach, we are led to a model with weight and yrage as
predictors of systolic blood pressure. At this point we should closely examine this
model. We did this earlier this semester and found that observation 1 (the individual
with the largest systolic blood pressure) was fitted poorly by the model and potentially
influential.
As noted earlier this semester, model selection methods can be highly influenced
by outliers and influential cases. Thus, we should hold out case 1, and re-evaluate the
various procedures to see whether case 1 unduly influenced the models selected. I will
just note (not shown) that the selection methods point to the same model when case
1 is held out. After deleting case 1, there are no large residuals, extremely influential
points, or any gross abnormalities in plots.
Both analyses suggest that the “best model” for predicting systolic blood pressure
is
sysbp = β0 + β1 wt + β2 yrage + ε.
Should case 1 be deleted? I have not fully explored this issue, but I will note that
eliminating this case does have a significant impact on the least squares estimates of
the regression coefficients, and on predicted values. What do you think?

10.6 Example: Oxygen Uptake


An experiment was conducted to model oxygen uptake (o2up), in milligrams of oxy-
gen per minute, from five chemical measurements: biological oxygen demand (bod),
total Kjeldahl nitrogen (tkn), total solids (ts), total volatile solids (tvs), which is a
component of ts, and chemical oxygen demand (cod), each measured in milligrams
per liter. The data were collected on samples of dairy wastes kept in suspension in
water in a laboratory for 220 days. All observations were on the same sample over
time. We desire an equation relating o2up to the other variables. The goal is to
find variables that should be further studied with the eventual goal of developing a
prediction equation (day should not be considered as a predictor).
We are interested in developing a regression model with o2up, or some function
of o2up, as a response. The researchers believe that the predictor variables are more
likely to be linearly related to log10 (o2up) rather than o2up, so log10 (o2up) was in-
cluded in the data set. As a first step, we should plot o2up against the different

Prof. Erik B. Erhardt


10.6: Example: Oxygen Uptake 665

predictors, and see whether the relationship between o2up and the individual pre-
dictors is roughly linear. If not, we will consider appropriate transformations of the
response and/or predictors.
#### Example: Oxygen uptake
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch10_oxygen.dat"
oxygen <- read.table(fn.data, header=TRUE)

day bod tkn ts tvs cod o2up logup


1 0 1125 232 7160 85.90 8905 36.00 1.56
2 7 920 268 8804 86.50 7388 7.90 0.90
3 15 835 271 8108 85.20 5348 5.60 0.75
4 22 1000 237 6370 83.80 8056 5.20 0.72
5 29 1150 192 6441 82.10 6960 2.00 0.30
6 37 990 202 5154 79.20 5690 2.30 0.36
7 44 840 184 5896 81.20 6932 1.30 0.11
8 58 650 200 5336 80.60 5400 1.30 0.11
9 65 640 180 5041 78.40 3177 0.60 −0.22
10 72 583 165 5012 79.30 4461 0.70 −0.15
11 80 570 151 4825 78.70 3901 1.00 0.00
12 86 570 171 4391 78.00 5002 1.00 0.00
13 93 510 243 4320 72.30 4665 0.80 −0.10
14 100 555 147 3709 74.90 4642 0.60 −0.22
15 107 460 286 3969 74.40 4840 0.40 −0.40
16 122 275 198 3558 72.50 4479 0.70 −0.15
17 129 510 196 4361 57.70 4200 0.60 −0.22
18 151 165 210 3301 71.80 3410 0.40 −0.40
19 171 244 327 2964 72.50 3360 0.30 −0.52
20 220 79 334 2777 71.90 2599 0.90 −0.05
The plots showed an exponential relationship between o2up and the predictors.
To shorten the output, only one plot is given. An exponential relationship can often
be approximately linearized by transforming o2up to the log10 (o2up) scale suggested
by the researchers. The extreme skewness in the marginal distribution of o2up also
gives an indication that a transformation might be needed.
# scatterplots
library(ggplot2)
p1 <- ggplot(oxygen, aes(x = bod, y = o2up)) + geom_point(size=2)
p2 <- ggplot(oxygen, aes(x = tkn, y = o2up)) + geom_point(size=2)
p3 <- ggplot(oxygen, aes(x = ts , y = o2up)) + geom_point(size=2)
p4 <- ggplot(oxygen, aes(x = tvs, y = o2up)) + geom_point(size=2)
p5 <- ggplot(oxygen, aes(x = cod, y = o2up)) + geom_point(size=2)

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4, p5), nrow=2
, top = "Scatterplots of response o2up with each predictor variable")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


666 Ch 10: Automated Model Selection for Multiple Regression

Scatterplots of response o2up with each predictor variable


● ● ●

30 30 30
o2up

o2up

o2up
20 20 20

10 10 10
● ● ●
● ● ● ● ● ●

● ● ●● ● ● ●
● ● ● ● ●● ● ● ● ●
0
● ● ●● ● ● ●●● ● 0 ●● ● ● ● ● ●● 0

● ● ● ●● ● ●
● ●●

300 600 900 1200 150 200 250 300 4000 6000 8000
bod tkn ts

● ●

30 30
o2up

o2up

20 20

10 10
● ●
● ● ● ●
● ● ● ●
● ●●
● ●●
●● ● ●●
●●● ● ●●
● ● ● ●● ●● ● ●
0 0
60 70 80 4000 6000 8000
tvs cod

After transformation, several plots show a roughly linear relationship. A sensi-


ble next step would be to build a regression model using log(o2up) as the response
variable.
# scatterplots
library(ggplot2)
p1 <- ggplot(oxygen, aes(x = bod, y = logup)) + geom_point(size=2)
p2 <- ggplot(oxygen, aes(x = tkn, y = logup)) + geom_point(size=2)
p3 <- ggplot(oxygen, aes(x = ts , y = logup)) + geom_point(size=2)
p4 <- ggplot(oxygen, aes(x = tvs, y = logup)) + geom_point(size=2)
p5 <- ggplot(oxygen, aes(x = cod, y = logup)) + geom_point(size=2)

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4, p5), nrow=2
, top = "Scatterplots of response logup with each predictor variable")

Prof. Erik B. Erhardt


10.6: Example: Oxygen Uptake 667

Scatterplots of response logup with each predictor variable


● ● ●
1.5 1.5 1.5

1.0 1.0 1.0


● ● ●
● ● ● ● ● ●
logup

logup

logup
0.5 0.5 0.5
● ● ●
● ● ●

● ● ● ● ● ●
0.0 ● ● 0.0 ● ● ● 0.0 ● ● ●
● ● ●
● ● ● ● ● ●
●● ● ● ● ● ● ● ●
● ● ● ● ● ●
−0.5 ● −0.5 ● −0.5 ●

300 600 900 1200 150 200 250 300 4000 6000 8000
bod tkn ts

● ●
1.5 1.5

1.0 1.0
● ●
● ● ● ●
logup

logup

0.5 0.5
● ●
● ●

●● ● ●
0.0 ● ●● 0.0 ● ● ●
● ●
● ● ●
● ● ● ● ● ●
● ● ● ●
−0.5 ● −0.5 ●

60 70 80 4000 6000 8000


tvs cod

Correlation between response and each predictor.


# correlation matrix and associated p-values testing "H0: rho == 0"
library(Hmisc)
o.cor <- rcorr(as.matrix(oxygen[,c("logup", "bod", "tkn", "ts", "tvs", "cod")]))
# print correlations with the response to 3 significant digits
signif(o.cor$r[1, ], 3)
## logup bod tkn ts tvs cod
## 1.0000 0.7740 0.0906 0.8350 0.7110 0.8320

I used several of the model selection procedures to select out predictors. The
model selection criteria below point to a more careful analysis of the model with
ts and cod as predictors. This model has the minimum Cp and is selected by the
backward and stepwise procedures. Furthermore, no other model has a substantially
higher R2 or R̄2 . The fit of the model will not likely be improved substantially by
adding any of the remaining three effects to this model.
# perform on our model
o.best <- f.bestsubset(formula(logup ~ bod + tkn + ts + tvs + cod)
, oxygen, nbest = 3)
op <- options(); # saving old options
options(width=90) # setting command window output text width wider
o.best
## (Intercept) bod tkn ts tvs cod SIZE rss r2 adjr2 cp bic
## 2 1 0 0 1 0 1 2 1.0850469 0.7857080 0.7604972 1.738781 -21.82112
## 3 1 0 1 1 0 1 3 0.9871461 0.8050430 0.7684886 2.318714 -20.71660
## 3 1 0 0 1 1 1 3 1.0633521 0.7899926 0.7506163 3.424094 -19.22933

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


668 Ch 10: Automated Model Selection for Multiple Regression

## 3 1 1 0 1 0 1 3 1.0643550 0.7897946 0.7503810 3.438642 -19.21047


## 4 1 0 1 1 1 1 4 0.9652627 0.8093649 0.7585288 4.001291 -18.16922
## 1 1 0 0 1 0 0 1 1.5369807 0.6964531 0.6795894 6.294153 -17.85292
## 2 1 0 0 0 1 1 2 1.3287305 0.7375816 0.7067088 5.273450 -17.76910
## 4 1 1 1 1 0 1 4 0.9871272 0.8050467 0.7530592 4.318440 -17.72124
## 1 1 0 0 0 0 1 1 1.5563027 0.6926371 0.6755614 6.574421 -17.60306
## 4 1 1 0 1 1 1 4 1.0388337 0.7948349 0.7401242 5.068450 -16.70015
## 2 1 0 1 0 0 1 2 1.4388035 0.7158426 0.6824124 6.870078 -16.17735
## 5 1 1 1 1 1 1 5 0.9651737 0.8093824 0.7413048 6.000000 -15.17533
## 1 1 1 0 0 0 0 1 2.0337926 0.5983349 0.5760202 13.500489 -12.25127
options(op); # reset (all) initial options

These comments must be taken with a grain of salt because we have not critically
assessed the underlying assumptions (linearity, normality, independence), nor have
we considered whether the data contain influential points or outliers.
lm.oxygen.final <- lm(logup ~ ts + cod, data = oxygen)
summary(lm.oxygen.final)
##
## Call:
## lm(formula = logup ~ ts + cod, data = oxygen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37640 -0.09238 -0.04229 0.06256 0.59827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.370e+00 1.969e-01 -6.960 2.3e-06 ***
## ts 1.492e-04 5.489e-05 2.717 0.0146 *
## cod 1.415e-04 5.318e-05 2.661 0.0165 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2526 on 17 degrees of freedom
## Multiple R-squared: 0.7857,Adjusted R-squared: 0.7605
## F-statistic: 31.17 on 2 and 17 DF, p-value: 2.058e-06

The p-values for testing the importance of the individual predictors are small,
indicating that both predictors are important. However, two observations (1 and 20)
are poorly fitted by the model (both have ri > 2) and are individually most influential
(largest Di s). Recall that this experiment was conducted over 220 days, so these
observations were the first and last data points collected. We have little information
about the experiment, but it is reasonable to conjecture that the experiment may not
have reached a steady state until the second time point, and that the experiment was
ended when the experimental material dissipated. The end points of the experiment
may not be typical of conditions under which we are interested in modelling oxygen

Prof. Erik B. Erhardt


10.6: Example: Oxygen Uptake 669

uptake. A sensible strategy here is to delete these points and redo the entire analysis
to see whether our model changes noticeably.
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.oxygen.final, which = c(1,4,6))

plot(oxygen$ts, lm.oxygen.final$residuals, main="Residuals vs ts")


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(oxygen$cod, lm.oxygen.final$residuals, main="Residuals vs cod")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals
library(car)
qqPlot(lm.oxygen.final$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## [1] 1 20 7
## residuals vs order of data
#plot(lm.oxygen.final£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


1.5

1
3 ●1 2.5
0.6

1●

1.2
● 20
2
0.4

1.0
Cook's distance

Cook's distance
Residuals

0.8
0.2

● ●

● 1.5
0.0

● ● ●

0.5

● ●
● ● ●
0.4
● 20 ● 20

1
● 3
● 3●
−0.4


7● ●● 0.5
0.0

0.0

● ● ●
●●
● ●
●● ● ●● 0

−0.5 0.0 0.5 1.0 5 10 15 20 0 0.1 0.2 0.3 0.4

Fitted values Obs. number Leverage hii

Residuals vs ts Residuals vs cod QQ Plot


0.6

0.6

● ● 0.6 1●
● ● 20 ●
lm.oxygen.final$residuals

lm.oxygen.final$residuals

lm.oxygen.final$residuals
0.4

0.4

0.4
0.2

0.2

● ●
0.2
● ● ● ●

● ● ●
● ● ●
0.0

0.0

● ● ● ● ● ● 0.0 ● ● ●
● ● ●
● ● ● ●● ● ● ●
● ● ● ● ● ●
● ● ● ●
−0.2

−0.2

● ● ●
−0.2
● ● ●
● ● ●
−0.4

−0.4

● ●
−0.4
● 7

3000 4000 5000 6000 7000 8000 9000 3000 4000 5000 6000 7000 8000 9000 −2 −1 0 1 2

oxygen$ts oxygen$cod norm quantiles

Further, the partial residual plot for both ts and cod clearly highlights outlying
cases 1 and 20.
library(car)
avPlots(lm.oxygen.final, id = list(n = 3))

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


670 Ch 10: Automated Model Selection for Multiple Regression

Added−Variable Plots
0.6

3● 1●

0.2 0.4 0.6 0.8


● 1 ● 20
0.4
logup | others

logup | others
2●
0.2


9●
● 20


0.0

●●● ● ●
● ● ●
● ●

● ● ● ●

●●

−0.2
● 3 ●
● ●
7●
−0.4



● ● 7 ● 9

−1000 0 1000 2000 3000 −2000 0 1000 2000

ts | others cod | others

10.6.1 Redo analysis excluding first and last observations


For more completeness, we exclude the end observations and repeat the model selec-
tion steps. Summaries from the model selection are provided.
The model selection criteria again suggest ts and cod as predictors. After delet-
ing observations 1 and 20 the R2 for this two predictor model jumps from 0.786 to
0.892. Also note that the LS coefficients change noticeably after these observations
are deleted.
# exclude observations 1 and 20
oxygen2 <- oxygen[-c(1,20),]
Correlation between response and each predictor.
# correlation matrix and associated p-values testing "H0: rho == 0"
library(Hmisc)
o.cor <- rcorr(as.matrix(oxygen2[,c("logup", "bod", "tkn", "ts", "tvs", "cod")]))
# print correlations with the response to 3 significant digits
signif(o.cor$r[1, ], 3)
## logup bod tkn ts tvs cod
## 1.000 0.813 0.116 0.921 0.717 0.806
# perform on our model
o.best <- f.bestsubset(formula(logup ~ bod + tkn + ts + tvs + cod)
, oxygen2, nbest = 3)
op <- options(); # saving old options
options(width=90) # setting command window output text width wider
o.best
## (Intercept) bod tkn ts tvs cod SIZE rss r2 adjr2 cp bic
## 2 1 0 0 1 0 1 2 0.3100332 0.8923243 0.8779675 0.1755108 -31.44424

Prof. Erik B. Erhardt


10.6: Example: Oxygen Uptake 671

## 3 1 0 0 1 1 1 3 0.3060781 0.8936979 0.8709189 2.0201863 -28.78498


## 3 1 1 0 1 0 1 3 0.3092614 0.8925923 0.8695764 2.1452002 -28.59874
## 3 1 0 1 1 0 1 3 0.3095352 0.8924973 0.8694610 2.1559501 -28.58281
## 1 1 0 0 1 0 0 1 0.4346881 0.8490312 0.8395956 3.0709102 -28.25153
## 2 1 1 0 1 0 0 2 0.3832495 0.8668960 0.8491488 3.0508321 -27.62812
## 2 1 0 0 1 1 0 2 0.4182487 0.8547406 0.8353727 4.4253075 -26.05510
## 4 1 1 0 1 1 1 4 0.3056566 0.8938443 0.8611810 4.0036314 -25.91941
## 4 1 0 1 1 1 1 4 0.3057789 0.8938018 0.8611255 4.0084356 -25.91221
## 4 1 1 1 1 0 1 4 0.3091249 0.8926398 0.8596058 4.1398376 -25.71632
## 5 1 1 1 1 1 1 5 0.3055641 0.8938764 0.8496583 6.0000000 -23.03449
## 1 1 1 0 0 0 0 1 0.9766926 0.6607910 0.6395904 24.3563087 -13.67975
## 1 1 0 0 0 0 1 1 1.0100759 0.6491968 0.6272716 25.6673251 -13.07480
options(op); # reset (all) initial options

Below is the model with ts and cod as predictors, after omitting the end obser-
vations. Both predictors are significant at the 0.05 level. Furthermore, there do not
appear to be any extreme outliers. The QQ-plot, and the plot of studentized residuals
against predicted values do not show any extreme abnormalities.
lm.oxygen2.final <- lm(logup ~ ts + cod, data = oxygen2)
summary(lm.oxygen2.final)
##
## Call:
## lm(formula = logup ~ ts + cod, data = oxygen2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24157 -0.08517 0.01004 0.10102 0.25094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.335e+00 1.338e-01 -9.976 5.16e-08 ***
## ts 1.852e-04 3.182e-05 5.820 3.38e-05 ***
## cod 8.638e-05 3.517e-05 2.456 0.0267 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1438 on 15 degrees of freedom
## Multiple R-squared: 0.8923,Adjusted R-squared: 0.878
## F-statistic: 62.15 on 2 and 15 DF, p-value: 5.507e-08
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.oxygen2.final, which = c(1,4,6))

plot(oxygen2$ts, lm.oxygen2.final$residuals, main="Residuals vs ts")


# horizontal line at zero
abline(h = 0, col = "gray75")

plot(oxygen2$cod, lm.oxygen2.final$residuals, main="Residuals vs cod")


# horizontal line at zero
abline(h = 0, col = "gray75")

# Normality of Residuals

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


672 Ch 10: Automated Model Selection for Multiple Regression

library(car)
qqPlot(lm.oxygen2.final$residuals, las = 1, id = list(n = 3), main="QQ Plot")
## 6 7 15
## 5 6 14
## residuals vs order of data
#plot(lm.oxygen2.final£residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")

Residuals vs Fitted Cook's distance Cook's dist vs Leverage hii (1 − hii)


0.3

4
2 1.5
4●
6●

0.4
0.2

0.4

3 3●

Cook's distance

Cook's distance

0.1

0.3
● 0.3

1
Residuals

● ● ●
7 ●7
−0.1 0.0

0.2

0.2




0.1
0.1


● ● ●
● 0.5
● 15 ●
7● ● ● ●
0.0

0.0

−0.3

●●● ● ● ● 0

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 5 10 15 0 0.1 0.2 0.3 0.4

Fitted values Obs. number Leverage hii

Residuals vs ts Residuals vs cod QQ Plot

● ● 6●
0.2

0.2

0.2
lm.oxygen2.final$residuals

lm.oxygen2.final$residuals

lm.oxygen2.final$residuals
● ● ●

● ● ●
● ● ●
0.1

0.1

● ● 0.1 ●
● ● ●

● ● ● ● ●
● ● ● ●
0.0

0.0

● ● 0.0 ●
● ● ● ● ●

● ● ●
−0.1

−0.1

● ●
−0.1 ●

● ● ●

● ● ●
−0.2

−0.2

● ●
−0.2 ● 15
● ● ● 7

3000 4000 5000 6000 7000 8000 9000 3000 4000 5000 6000 7000 8000 −2 −1 0 1 2

oxygen2$ts oxygen2$cod norm quantiles

library(car)
avPlots(lm.oxygen2.final, id = list(n = 3))

Added−Variable Plots

3● 4●
0.6

6●
0.4

0.2
logup | others

logup | others

2● ●
● ●
0.2

● 6 9● ●

0.0

● ●
0.0


● ● ● ● ● 3
● ● ● ● ● ●


● ● ● 7●
15 ●
−0.2

●7
−0.4

● 15 ● 9

−1000 0 1000 2000 −2000 −1000 0 1000 2000

ts | others cod | others

Prof. Erik B. Erhardt


10.6: Example: Oxygen Uptake 673

Let us recall that the researcher’s primary goal was to identify important predic-
tors of o2up. Regardless of whether we are inclined to include the end observations in
the analysis or not, it is reasonable to conclude that ts and cod are useful for explain-
ing the variation in log10 (o2up). If these data were the final experiment, I might be
inclined to eliminate the end observations and use the following equation to predict
oxygen uptake:

log10 (o2up) = −1.335302 + 0.000185 ts + 0.000086 cod.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


674 Ch 10: Automated Model Selection for Multiple Regression

Prof. Erik B. Erhardt


Chapter 11

Logistic Regression

Logistic regression analysis is used for predicting the outcome of a categorical depen-
dent variable based on one or more predictor variables. The probabilities describing
the possible outcomes of a single trial are modeled, as a function of the explanatory
(predictor) variables, using a logistic function. Logistic regression is frequently used
to refer to the problem in which the dependent variable is binary — that is, the num-
ber of available categories is two — and problems with more than two categories are
referred to as multinomial logistic regression or, if the multiple categories are ordered,
as ordered logistic regression.
Logistic regression measures the relationship between a categorical dependent vari-
able and usually (but not necessarily) one or more continuous independent variables,
by converting the dependent variable to probability scores. As such it treats the same
set of problems as does probit regression using similar techniques.

11.1 Generalized linear model variance and link


families
The generalized linear model (GLM) is a flexible generalization of ordinary linear
regression that allows for response variables that have other than a normal distri-
bution. The GLM generalizes linear regression by allowing the linear model to be
related to the response variable via a link function and by allowing the magnitude of
the variance of each measurement to be a function of its predicted value.
In R, the basic tool for fitting generalized linear models is the glm() function,
which has the following general structure:
glm(formula, family, data, weights, subset, ...)

where “...” stands for additional options. The key parameter here is family, which
is a simple way of specifying a choice of variance and link functions. Some choices

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


676 Ch 11: Logistic Regression

of family are listed in the table. As can be seen, each of the first five choices has an
associated variance function (for binomial the binomial variance µ(1 − µ)), and one
or more choices of link functions (for binomial the logit, probit, or complementary
log-log).
Family Variance Link
gaussian gaussian identity
binomial binomial logit, probit, or cloglog
poisson poisson log, identity, or sqrt
Gamma Gamma inverse, identity, or log
inverse.gaussian inverse.gaussian 1/µ2
quasi user-defined user-defined
As long as you want the default link, all you have to specify is the family name.
If you want an alternative link, you must add a link argument. For example to do
probits you use:
glm(formula, family = binomial(link = probit))

The last family on the list, quasi, is there to allow fitting user-defined models by
maximum quasi-likelihood.
The rest of this chapter concerns logistic regression with a binary response vari-
able.

11.2 Example: Age of Menarche in Warsaw


The data1 below are from a study conducted by Milicer and Szczotka on pre-teen
and teenage girls in Warsaw, Poland in 1965. The subjects were classified into 25 age
categories. The number of girls in each group (Total) and the number that reached
menarche (Menarche) at the time of the study were recorded. The age for a group
corresponds to the midpoint for the age interval.
#### Example: Menarche
# menarche dataset is available in MASS package
# (remove previous instance if it exists, important if rerunning code)
rm(menarche)
## Warning in rm(menarche): object ’menarche’ not found
library(MASS)
# these frequencies look better in the table as integers
menarche$Total <- as.integer(menarche$Total)
menarche$Menarche <- as.integer(menarche$Menarche)
str(menarche)

1
Milicer, H. and Szczotka, F. (1966) Age at Menarche in Warsaw girls in 1965. Human Biology
38, 199–203.

Prof. Erik B. Erhardt


11.2: Example: Age of Menarche in Warsaw 677

## 'data.frame': 25 obs. of 3 variables:


## $ Age : num 9.21 10.21 10.58 10.83 11.08 ...
## $ Total : int 376 200 93 120 90 88 105 111 100 93 ...
## $ Menarche: int 0 0 0 2 2 5 10 17 16 29 ...
# create estimated proportion of girls reaching menarche for each age group
menarche$p.hat <- menarche$Menarche / menarche$Total
Age Total Menarche p.hat
1 9.21 376 0 0.00
2 10.21 200 0 0.00
3 10.58 93 0 0.00
4 10.83 120 2 0.02
5 11.08 90 2 0.02
6 11.33 88 5 0.06
7 11.58 105 10 0.10
8 11.83 111 17 0.15
9 12.08 100 16 0.16
10 12.33 93 29 0.31
11 12.58 100 39 0.39
12 12.83 108 51 0.47
13 13.08 99 47 0.47
14 13.33 106 67 0.63
15 13.58 105 81 0.77
16 13.83 117 88 0.75
17 14.08 98 79 0.81
18 14.33 97 90 0.93
19 14.58 120 113 0.94
20 14.83 102 95 0.93
21 15.08 122 117 0.96
22 15.33 111 107 0.96
23 15.58 94 92 0.98
24 15.83 114 112 0.98
25 17.58 1049 1049 1.00

The researchers were curious about how the proportion of girls that reached menar-
che (p̂ = Menarche/Total) varied with age. One could perform a test of homogeneity
(Multinomial goodness-of-fit test) by arranging the data as a 2-by-25 contingency
table with columns indexed by age and two rows: ROW1 = Menarche, and ROW2
= number that have not reached menarche = (Total − Menarche). A more power-
ful approach treats these as regression data, using the proportion of girls reaching
menarche as the response and age as a predictor.
A plot of the observed proportion p̂ of girls that have reached menarche shows
that the proportion increases as age increases, but that the relationship is nonlinear.
The observed proportions, which are bounded between zero and one, have a lazy S-
shape (a sigmoidal function) when plotted against age. The change in the observed
proportions for a given change in age is much smaller when the proportion is near 0 or
1 than when the proportion is near 1/2. This phenomenon is common with regression
data where the response is a proportion.
The trend is nonlinear so linear regression is inappropriate. A sensible alternative
might be to transform the response or the predictor to achieve near linearity. A
common transformation of response proportions following a sigmoidal curve is to the

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


678 Ch 11: Logistic Regression

logit scale µ̂ = loge {p̂/(1 − p̂)}. This transformation is the basis for the logistic
regression model. The natural logarithm (base e) is traditionally used in logistic
regression.
The logit transformation is undefined when p̂ = 0 or p̂ = 1. To overcome this
problem, researchers use the empirical logits, defined by log{(p̂ + 0.5/n)/(1 − p̂ +
0.5/n)}, where n is the sample size or the number of observations on which p̂ is based.
A plot of the empirical logits against age is roughly linear, which supports a logistic
transformation for the response.
library(ggplot2)
p <- ggplot(menarche, aes(x = Age, y = p.hat))
p <- p + geom_point()
p <- p + labs(title = paste("Observed probability of girls reaching menarche,\n",
"Warsaw, Poland in 1965", sep=""))
print(p)

# emperical logits
menarche$emp.logit <- log(( menarche$p.hat + 0.5/menarche$Total) /
(1 - menarche$p.hat + 0.5/menarche$Total))
library(ggplot2)
p <- ggplot(menarche, aes(x = Age, y = emp.logit))
p <- p + geom_point()
p <- p + labs(title = "Empirical logits")
print(p)
Observed probability of girls reaching menarche, Empirical logits
Warsaw, Poland in 1965 8

1.00 ● ●

● ●

● ●


● 4 ●
0.75 ● ●
● ●

● ●


emp.logit


● ●
p.hat


0.50 0 ● ●
● ●


● ●



0.25


−4
● ●

● ●

● ●
● ●
0.00 ● ● ● ●

10 12 14 16 10 12 14 16
Age Age

11.3 Simple logistic regression model


The simple logistic regression model expresses the population proportion p of indi-
viduals with a given attribute (called the probability of success) as a function of a

Prof. Erik B. Erhardt


11.3: Simple logistic regression model 679

single predictor variable X. The model assumes that p is related to X through


 
p
log = β0 + β1 X
1−p

or, equivalently, as
exp(β0 + β1 X)
p= .
1 + exp(β0 + β1 X)
The logistic regression model is a binary response model, where the response for
each case falls into one of two exclusive and exhaustive categories, success (cases with
the attribute of interest) and failure (cases without the attribute of interest).
The odds of success are p/(1 − p). For example, when p = 1/2 the odds of success
are 1 (or 1 to 1). When p = 0.9 the odds of success are 9 (or 9 to 1). The logistic
model assumes that the log-odds of success is linearly related to X. Graphs of the
logistic model relating p to X are given below. The sign of the slope refers to the
sign of β1 .
I should write p = p(X) to emphasize that p is the proportion of all individuals
with score X that have the attribute of interest. In the menarche data, p = p(X) is
the population proportion of girls at age X that have reached menarche.
Logit Scale Probability Scale
1.0
0.8
5

+ slope
0.6
Probability
Log-Odds

0 slope
0

0.4

0 slope

- slope
0.2
-5

+ slope - slope
0.0

-5 0 5 -5 0 5
X X

The data in a logistic regression problem are often given in summarized or ag-
gregate form:
X n y
X1 n1 y1
X2 n2 y2
.. .. ..
. . .
Xm nm ym

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


680 Ch 11: Logistic Regression

where yi is the number of individuals with the attribute of interest among ni randomly
selected or representative individuals with predictor variable value Xi . For raw data
on individual cases, yi = 1 or 0, depending on whether the case at Xi is a success or
failure, and the sample size column n is omitted with raw data.
For logistic regression, a plot of the sample proportions p̂i = yi /ni against Xi
should be roughly sigmoidal, and a plot of the empirical logits against Xi should be
roughly linear. If not, then some other model is probably appropriate. I find the
second plot easier to calibrate, but neither plot is very informative when the sample
sizes are small, say 1 or 2. (Why?).
There are a variety of other binary response models that are used in practice. The
probit regression model or the complementary log-log regression model might be
appropriate when the logistic model does fit the data.
The following section describes the standard MLE strategy for estimating the
logistic regression parameters.

11.3.1 Estimating Regression Parameters via LS of empirical


logits
(This is a naive method; we will discuss a better way in the next section.)
There are two unknown population parameters in the logistic regression model
 
p
log = β0 + β1 X.
1−p

A simple way to estimate β0 and β1 is by least squares (LS), using the empirical logits
as responses and the Xi s as the predictor values.
Below we use standard regression to calculate the LS fit between the empirical
logits and age.
lm.menarche.e.a <- lm(emp.logit ~ Age, data = menarche)
# LS coefficients
coef(lm.menarche.e.a)
## (Intercept) Age
## -22.027933 1.676395
The LS estimates for the menarche data are b0 = −22.03 and b1 = 1.68, which
gives the fitted relationship
 

log = −22.03 + 1.68 Age
1 − p̃
or
exp(−22.03 + 1.68 Age)
p̃ = ,
1 + exp(−22.03 + 1.68 Age)

Prof. Erik B. Erhardt


11.3: Simple logistic regression model 681

where p̃ is the predicted proportion (under the model) of girls having reached menar-
che at the given age. I used p̃ to identify a predicted probability, in contrast to p̂
which is the observed proportion at a given age.
The power of the logistic model versus the contingency table analysis discussed
earlier is that the model gives estimates for the population proportion reaching menar-
che at all ages within the observed age range. The observed proportions allow you to
estimate only the population proportions at the observed ages.

11.3.2 Maximum Likelihood Estimation for Logistic Regres-


sion Model
There are better ways to the fit the logistic regression model than LS which assumes
that the responses are normally distributed with constant variance. A deficiency of
the LS fit to the logistic model is that the observed counts yi have a Binomial distri-
bution under random sampling. The Binomial distribution is a discrete probability
model associated with counting the number of successes in a fixed size sample, and
other equivalent experiments such as counting the number of heads in repeated flips
of a coin. The distribution of the empirical logits depend on the yi s so they are not
normal (but are approximately normal in large samples), and are extremely skewed
when the sample sizes ni are small. The response variability depends on the popula-
tion proportions, and is not roughly constant when the observed proportions or the
sample sizes vary appreciably across groups.
The differences in variability among the empirical logits can be accounted for us-
ing weighted least squares (WLS) when the sample sizes are large. An alternative
approach called maximum likelihood uses the exact Binomial distribution of the re-
sponses yi to generate optimal estimates of the regression coefficients. Software for
maximum likelihood estimation is widely available, so LS and WLS methods are not
really needed.
In maximum likelihood estimation (MLE), the regression coefficients are es-
timated iteratively by minimizing the deviance function (also called the likelihood
ratio chi-squared statistic)
m     
X yi ni − y i
D=2 yi log + (ni − yi ) log
i=1
ni pi ni − ni pi

over all possible values of β0 and β1 , where the pi s satisfy the logistic model
 
pi
log = β0 + β1 Xi .
1 − pi
The ML method also gives standard errors and significance tests for the regression
estimates.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


682 Ch 11: Logistic Regression

The deviance is an analog of the residual sums of squares in linear regression.


The choices for β0 and β1 that minimize the deviance are the parameter values that
make the observed and fitted proportions as close together as possible in a “likelihood
sense”.
Suppose that b0 and b1 are the MLEs of β0 and β1 . The deviance evaluated at the
MLEs,
m     
X yi ni − y i
D=2 yi log + (ni − yi ) log ,
i=1
ni p̃i ni − ni p̃i
where the fitted probabilities p̃i satisfy
 
p̃i
log = b0 + b1 X i ,
1 − p̃i

is used to test the adequacy of the model. The deviance is small when the data fits
the model, that is, when the observed and fitted proportions are close together. Large
values of D occur when one or more of the observed and fitted proportions are far
apart, which suggests that the model is inappropriate.
If the logistic model holds, then D has a chi-squared distribution with m − r
degrees of freedom, where m is the the number of groups and r (here 2) is the
number of estimated regression parameters. A p-value for the deviance is given by
the area under the chi-squared curve to the right of D. A small p-value indicates that
the data does not fit the model.
Alternatively, the fit of the model can be evaluated using the chi-squared approx-
imation to the Pearson X 2 statistic:
m  m
(yi − ni p̃i )2 ((ni − yi ) − ni (1 − p̃i ))2 (yi − ni p̃i )2
X  X
2
X = + = .
i=1
ni p̃i ni (1 − p̃i ) i=1
ni p̃i (1 − p̃i )

11.3.3 Fitting the Logistic Model by Maximum Likelihood,


Menarche
# For our summarized data (with frequencies and totals for each age)
# The left-hand side of our formula binds two columns together with cbind():
# the columns are the number of "successes" and "failures".
# For logistic regression with logit link we specify family = binomial,
# where logit is the default link function for the binomial family.
glm.m.a <- glm(cbind(Menarche, Total - Menarche) ~ Age, family = binomial, menarche)
The glm() statement creates an object which we can use to create the fitted
probabilities and 95% CIs for the population proportions at the ages in menarche.
The fitted probabilities and the limits are stored in columns labeled fitted.values,
fit.lower, and fit.upper, respectively.

Prof. Erik B. Erhardt


11.3: Simple logistic regression model 683

# put the fitted values in the data.frame


menarche$fitted.values <- glm.m.a$fitted.values
pred <- predict(glm.m.a, data.frame(Age = menarche$Age), type = "link", se.fit = TRUE)
menarche$fit <- pred$fit
menarche$se.fit <- pred$se.fit
# CI for fitted values
menarche <- within(menarche, {
fit.lower = exp(fit - 1.96 * se.fit) / (1 + exp(fit - 1.96 * se.fit))
fit.upper = exp(fit + 1.96 * se.fit) / (1 + exp(fit + 1.96 * se.fit))
})
#round(menarche, 3)
This printed summary information is easily interpreted. For example, the esti-
mated population proportion of girls aged 15.08 (more precisely, among girls in the
age interval with midpoint 15.08) that have reached menarche is 0.967. You are 95%
confident that the population proportion is between 0.958 and 0.975. A variety of
other summaries and diagnostics can be produced.

Age Total Menarche p.hat emp.logit fitted.values fit se.fit fit.upper fit.lower
21 15.08 122.00 117.00 0.96 3.06 0.97 3.38 0.14 0.97 0.96

Age Total Menarche p.hat emp.logit fitted.values fit se.fit fit.upper fit.lower
1 9.21 376.00 0.00 0.00 −6.62 0.00 −6.20 0.23 0.00 0.00
2 10.21 200.00 0.00 0.00 −5.99 0.01 −4.56 0.18 0.01 0.01
3 10.58 93.00 0.00 0.00 −5.23 0.02 −3.96 0.16 0.02 0.01
4 10.83 120.00 2.00 0.02 −3.86 0.03 −3.55 0.14 0.04 0.02
5 11.08 90.00 2.00 0.02 −3.57 0.04 −3.14 0.13 0.05 0.03
6 11.33 88.00 5.00 0.06 −2.72 0.06 −2.74 0.12 0.08 0.05
7 11.58 105.00 10.00 0.10 −2.21 0.09 −2.33 0.11 0.11 0.07
8 11.83 111.00 17.00 0.15 −1.69 0.13 −1.92 0.10 0.15 0.11
9 12.08 100.00 16.00 0.16 −1.63 0.18 −1.51 0.08 0.21 0.16
10 12.33 93.00 29.00 0.31 −0.78 0.25 −1.10 0.07 0.28 0.22
11 12.58 100.00 39.00 0.39 −0.44 0.33 −0.70 0.07 0.36 0.30
12 12.83 108.00 51.00 0.47 −0.11 0.43 −0.29 0.06 0.46 0.40
13 13.08 99.00 47.00 0.47 −0.10 0.53 0.12 0.06 0.56 0.50
14 13.33 106.00 67.00 0.63 0.54 0.63 0.53 0.07 0.66 0.60
15 13.58 105.00 81.00 0.77 1.20 0.72 0.94 0.07 0.75 0.69
16 13.83 117.00 88.00 0.75 1.10 0.79 1.34 0.08 0.82 0.77
17 14.08 98.00 79.00 0.81 1.40 0.85 1.75 0.09 0.87 0.83
18 14.33 97.00 90.00 0.93 2.49 0.90 2.16 0.10 0.91 0.88
19 14.58 120.00 113.00 0.94 2.72 0.93 2.57 0.11 0.94 0.91
20 14.83 102.00 95.00 0.93 2.54 0.95 2.98 0.12 0.96 0.94
21 15.08 122.00 117.00 0.96 3.06 0.97 3.38 0.14 0.97 0.96
22 15.33 111.00 107.00 0.96 3.17 0.98 3.79 0.15 0.98 0.97
23 15.58 94.00 92.00 0.98 3.61 0.98 4.20 0.16 0.99 0.98
24 15.83 114.00 112.00 0.98 3.81 0.99 4.61 0.18 0.99 0.99
25 17.58 1049.00 1049.00 1.00 7.65 1.00 7.46 0.28 1.00 1.00

The summary table gives MLEs and standard errors for the regression parameters.
The z-value column is the parameter estimate divided by its standard error. The p-
values are used to test whether the corresponding parameters of the logistic model
are zero.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


684 Ch 11: Logistic Regression

summary(glm.m.a)
##
## Call:
## glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial,
## data = menarche)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0363 -0.9953 -0.4900 0.7780 1.3675
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -21.22639 0.77068 -27.54 <2e-16 ***
## Age 1.63197 0.05895 27.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3693.884 on 24 degrees of freedom
## Residual deviance: 26.703 on 23 degrees of freedom
## AIC: 114.76
##
## Number of Fisher Scoring iterations: 4
If the model is correct and when sample sizes are large, the residual deviance D
has an approximate chi-square distribution,

residual D = χ2residual df .

If D is too large, or the p-value is too small, then the model does not capture all the
features in the data.
The deviance statistic is D = 26.70 on 25 − 2 = 23 df. The large p-value for
D suggests no gross deficiencies with the logistic model. The observed and fitted
proportions (p.hat and fitted.values in the output table above are reasonably close
at each observed age. Also, emp.logit and fit are close. This is consistent with D
being fairly small. The data fits the logistic regression model reasonably well.
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
glm.m.a$deviance
## [1] 26.70345
glm.m.a$df.residual
## [1] 23
dev.p.val <- 1 - pchisq(glm.m.a$deviance, glm.m.a$df.residual)
dev.p.val
## [1] 0.2687953

Prof. Erik B. Erhardt


11.3: Simple logistic regression model 685

The MLEs b0 = −21.23 and b1 = 1.63 for the intercept and slope are close to
the LS estimates of bLS0 = −22.03 and bLS1 = 1.68, respectively from page 680. The
two estimation methods give similar predicted probabilities here. The MLE of the
predicted probabilities satisfy
 

log = −21.23 + 1.63 Age
1 − p̃
or
exp(−21.23 + 1.63 Age)
p̃ = .
1 + exp(−21.23 + 1.63 Age)
library(ggplot2)
p <- ggplot(menarche, aes(x = Age, y = p.hat))
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = Age, ymin = fit.lower, ymax = fit.upper), alpha = 0.2)
p <- p + geom_line(aes(x = Age, y = fitted.values), color = "red")
# fitted values
p <- p + geom_point(aes(y = fitted.values), color = "red", size=2)
# observed values
p <- p + geom_point(size=2)
p <- p + labs(title = paste("Observed and predicted probability of girls reaching menarche,\n",
"Warsaw, Poland in 1965", sep=""))
print(p)
Observed and predicted probability of girls reaching menarche,
Warsaw, Poland in 1965
1.00 ● ●
● ●
● ●

● ●
● ●
● ● ●

● ●

0.75 ●


p.hat


0.50
● ●


0.25 ●


● ●





● ●
● ●
● ●
● ●
0.00

10 12 14 16
Age

If the model holds, then a slope of β1 = 0 implies that p does not depend on AGE,
i.e., the proportion of girls that have reached menarche is identical across age groups.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


686 Ch 11: Logistic Regression

The Wald p-value for the slope is < 0.0001, which leads to rejecting H0 : β1 = 0 at
any of the usual test levels. The proportion of girls that have reached menarche is
not constant across age groups. Again, the power of the model is that it gives you a
simple way to quantify the effect of age on the proportion reaching menarche. This
is more appealing than testing homogeneity across age groups followed by multiple
comparisons.
Wald tests can be performed to test the global null hypothesis, that all non-
intercept βs are equal to zero. This is the logistic regression analog of the overall
model F-test in ANOVA and regression. The only predictor is AGE, so the implied
test is that the slope of the regression line is zero. The Wald test statistic and p-value
reported here are identical to the Wald test and p-value for the AGE effect given
in the parameter estimates table. The Wald test can also be used to test specific
contrasts between parameters.
# Testing Global Null Hypothesis
library(aod)
coef(glm.m.a)
## (Intercept) Age
## -21.226395 1.631968
# specify which coefficients to test = 0 (Terms = 2:4 would be terms 2, 3, and 4)
wald.test(b = coef(glm.m.a), Sigma = vcov(glm.m.a), Terms = 2:2)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 766.3, df = 1, P(> X2) = 0.0

11.4 Example: Leukemia white blood cell types


Feigl and Zelen2 reported the survival time in weeks and the white cell blood count
(WBC) at time of diagnosis for 33 patients who eventually died of acute leukemia.
Each person was classified as AG+ or AG−, indicating the presence or absence of a
certain morphological characteristic in the white cells. Four variables are given in the
data set: WBC, a binary factor or indicator variable AG (1 for AG+, 0 for AG−),
NTOTAL (the number of patients with the given combination of AG and WBC),
2
Feigl, P., Zelen, M. (1965) Estimation of exponential survival probabilities with concomitant
information. Biometrics 21, 826–838. Survival times are given for 33 patients who died from acute
myelogenous leukaemia. Also measured was the patient’s white blood cell count at the time of
diagnosis. The patients were also factored into 2 groups according to the presence or absence of a
morphologic characteristic of white blood cells. Patients termed AG positive were identified by the
presence of Auer rods and/or significant granulation of the leukaemic cells in the bone marrow at
the time of diagnosis.

Prof. Erik B. Erhardt


11.4: Example: Leukemia white blood cell types 687

and NRES (the number of NTOTAL that survived at least one year from the time of
diagnosis).

The researchers are interested in modelling the probability p of surviving at least


one year as a function of WBC and AG. They believe that WBC should be trans-
formed to a log scale, given the skewness in the WBC values.
#### Example: Leukemia
## Leukemia white blood cell types example
# ntotal = number of patients with IAG and WBC combination
# nres = number surviving at least one year
# ag = 1 for AG+, 0 for AG-
# wbc = white cell blood count
# lwbc = log white cell blood count
# p.hat = Emperical Probability
leuk <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch11_leuk.dat"
, header = TRUE)
leuk$ag <- factor(leuk$ag)
leuk$lwbc <- log(leuk$wbc)
leuk$p.hat <- leuk$nres / leuk$ntotal
str(leuk)
## 'data.frame': 30 obs. of 6 variables:
## $ ntotal: int 1 1 1 1 1 1 1 1 3 1 ...
## $ nres : int 1 1 1 1 1 1 1 1 1 1 ...
## $ ag : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 1 ...
## $ wbc : int 75 230 260 430 700 940 1000 1050 10000 300 ...
## $ lwbc : num 4.32 5.44 5.56 6.06 6.55 ...
## $ p.hat : num 1 1 1 1 1 ...

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


688 Ch 11: Logistic Regression

ntotal nres ag wbc lwbc p.hat


1 1 1 1 75 4.32 1.00
2 1 1 1 230 5.44 1.00
3 1 1 1 260 5.56 1.00
4 1 1 1 430 6.06 1.00
5 1 1 1 700 6.55 1.00
6 1 1 1 940 6.85 1.00
7 1 1 1 1000 6.91 1.00
8 1 1 1 1050 6.96 1.00
9 3 1 1 10000 9.21 0.33
10 1 1 0 300 5.70 1.00
11 1 1 0 440 6.09 1.00
12 1 0 1 540 6.29 0.00
13 1 0 1 600 6.40 0.00
14 1 0 1 1700 7.44 0.00
15 1 0 1 3200 8.07 0.00
16 1 0 1 3500 8.16 0.00
17 1 0 1 5200 8.56 0.00
18 1 0 0 150 5.01 0.00
19 1 0 0 400 5.99 0.00
20 1 0 0 530 6.27 0.00
21 1 0 0 900 6.80 0.00
22 1 0 0 1000 6.91 0.00
23 1 0 0 1900 7.55 0.00
24 1 0 0 2100 7.65 0.00
25 1 0 0 2600 7.86 0.00
26 1 0 0 2700 7.90 0.00
27 1 0 0 2800 7.94 0.00
28 1 0 0 3100 8.04 0.00
29 1 0 0 7900 8.97 0.00
30 2 0 0 10000 9.21 0.00

As an initial step in the analysis, consider the following model:


 
p
log = β0 + β1 LWBC + β2 AG,
1−p

where LWBC = log(WBC). The model is best understood by separating the AG+
and AG− cases. For AG− individuals, AG=0 so the model reduces to
 
p
log = β0 + β1 LWBC + β2 ∗ 0 = β0 + β1 LWBC.
1−p

Prof. Erik B. Erhardt


11.4: Example: Leukemia white blood cell types 689

For AG+ individuals, AG=1 and the model implies


 
p
log = β0 + β1 LWBC + β2 ∗ 1 = (β0 + β2 ) + β1 LWBC.
1−p

The model without AG (i.e., β2 = 0) is a simple logistic model where the log-odds
of surviving one year is linearly related to LWBC, and is independent of AG. The
reduced model with β2 = 0 implies that there is no effect of the AG level on the
survival probability once LWBC has been taken into account.
Including the binary predictor AG in the model implies that there is a linear
relationship between the log-odds of surviving one year and LWBC, with a constant
slope for the two AG levels. This model includes an effect for the AG morphological
factor, but more general models are possible. A natural extension would be to include
a product or interaction effect, a point that I will return to momentarily.
The parameters are easily interpreted: β0 and β0 + β2 are intercepts for the pop-
ulation logistic regression lines for AG− and AG+, respectively. The lines have a
common slope, β1 . The β2 coefficient for the AG indicator is the difference between
intercepts for the AG+ and AG− regression lines. A picture of the assumed rela-
tionship is given below for β1 < 0. The population regression lines are parallel on
the logit scale only, but the order between AG groups is preserved on the probability
scale.
Logit Scale Probability Scale
1.0
5

0.80.6

IAG=1
0

Probability
Log-Odds

IAG=1
0.4
-5

IAG=0
0.2

IAG=0
-10

0.0

-5 0 5 -5 0 5
LWBC LWBC

Before looking at output for the equal slopes model, note that the data set has 30
distinct AG and LWBC combinations, or 30 “groups” or samples. Only two samples

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


690 Ch 11: Logistic Regression

have more than 1 observation. The majority of the observed proportions surviving
at least one year (number surviving ≥ 1 year/group sample size) are 0 (i.e., 0/1)
or 1 (i.e., 1/1). This sparseness of the data makes it difficult to graphically assess
the suitability of the logistic model (because the estimated proportions are almost
all 0 or 1). Although significance tests on the regression coefficients do not require
large group sizes, the chi-squared approximation to the deviance statistic is suspect
in sparse data settings. With small group sizes as we have here, most researchers
would not interpret the p-value for D literally. Instead, they would use the p-values
to informally check the fit of the model. Diagnostics would be used to highlight
problems with the model.
glm.i.l <- glm(cbind(nres, ntotal - nres) ~ ag + lwbc, family = binomial, leuk)

# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev.p.val <- 1 - pchisq(glm.i.l$deviance, glm.i.l$df.residual)
dev.p.val
## [1] 0.6842804
The large p-value for D indicates that there are no gross deficiencies with the
model. Recall that the Testing Global Null Hypothesis gives p-values for testing the
hypothesis that the regression coefficients are zero for each predictor in the model.
The two predictors are LWBC and AG, so the small p-values indicate that LWBC or
AG, or both, are important predictors of survival. The p-values in the estimates table
suggest that LWBC and AG are both important. If either predictor was insignificant, I
would consider refitting the model omitting the least significant effect, as in regression.
# Testing Global Null Hypothesis
library(aod)
coef(glm.i.l)
## (Intercept) ag1 lwbc
## 5.543349 2.519562 -1.108759
# specify which coefficients to test = 0 (Terms = 2:3 is for terms 2 and 3)
wald.test(b = coef(glm.i.l), Sigma = vcov(glm.i.l), Terms = 2:3)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 8.2, df = 2, P(> X2) = 0.017

Given that the model fits reasonably well, a test of H0 : β2 = 0 might be a primary
interest here. This checks whether the regression lines are identical for the two AG
levels, which is a test for whether AG affects the survival probability, after taking
LWBC into account. This test is rejected at any of the usual significance levels,
suggesting that the AG level affects the survival probability (assuming a very specific
model).

Prof. Erik B. Erhardt


11.4: Example: Leukemia white blood cell types 691

summary(glm.i.l)
##
## Call:
## glm(formula = cbind(nres, ntotal - nres) ~ ag + lwbc, family = binomial,
## data = leuk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6599 -0.6595 -0.2776 0.6438 1.7131
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.5433 3.0224 1.834 0.0666 .
## ag1 2.5196 1.0907 2.310 0.0209 *
## lwbc -1.1088 0.4609 -2.405 0.0162 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38.191 on 29 degrees of freedom
## Residual deviance: 23.014 on 27 degrees of freedom
## AIC: 30.635
##
## Number of Fisher Scoring iterations: 5

A plot of the predicted survival probabilities as a function of LWBC, using AG


as the plotting symbol, indicates that the probability of surviving at least one year
from the time of diagnosis is a decreasing function of LWBC. For a given LWBC
the survival probability is greater for AG+ patients than for AG− patients. This
tendency is consistent with the observed proportions, which show little information
about the exact form of the trend.
# put the fitted values in the data.frame
leuk$fitted.values <- glm.i.l$fitted.values
pred <- predict(glm.i.l, data.frame(lwbc = leuk$lwbc, ag = leuk$ag), type = "link"
, se.fit = TRUE)
leuk$fit <- pred$fit
leuk$se.fit <- pred$se.fit
# CI for fitted values
leuk <- within(leuk, {
fit.lower = exp(fit - 1.96 * se.fit) / (1 + exp(fit - 1.96 * se.fit))
fit.upper = exp(fit + 1.96 * se.fit) / (1 + exp(fit + 1.96 * se.fit))
})
#round(leuk, 3)

library(ggplot2)
p <- ggplot(leuk, aes(x = lwbc, y = p.hat, colour = ag, fill = ag))

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


692 Ch 11: Logistic Regression

# predicted curve and point-wise 95% CI


p <- p + geom_ribbon(aes(x = lwbc, ymin = fit.lower, ymax = fit.upper), alpha = 0.2)
p <- p + geom_line(aes(x = lwbc, y = fitted.values))
# fitted values
p <- p + geom_point(aes(y = fitted.values), size=2)
# observed values
p <- p + geom_point(size = 2, alpha = 0.5)
p <- p + labs(title = "Observed and predicted probability of 1+ year survival")
print(p)

Observed and predicted probability of 1+ year survival


1.00



0.75 ●




ag
p.hat

0.50 ● ● 0

● 1




0.25 ●

● ●

● ●

● ●
●●● ●
● ●
0.00

5 6 7 8 9
lwbc

Prof. Erik B. Erhardt


11.4: Example: Leukemia white blood cell types 693

ntotal nres ag wbc lwbc p.hat fitted.values fit se.fit fit.upper fit.lower
1 1 1 1 75 4.32 1.00 0.96 3.28 1.44 1.00 0.61
2 1 1 1 230 5.44 1.00 0.88 2.03 0.99 0.98 0.52
3 1 1 1 260 5.56 1.00 0.87 1.90 0.94 0.98 0.51
4 1 1 1 430 6.06 1.00 0.79 1.34 0.78 0.95 0.45
5 1 1 1 700 6.55 1.00 0.69 0.80 0.66 0.89 0.38
6 1 1 1 940 6.85 1.00 0.62 0.47 0.61 0.84 0.33
7 1 1 1 1000 6.91 1.00 0.60 0.40 0.61 0.83 0.31
8 1 1 1 1050 6.96 1.00 0.59 0.35 0.60 0.82 0.30
9 3 1 1 10000 9.21 0.33 0.10 −2.15 1.12 0.51 0.01
10 1 1 0 300 5.70 1.00 0.31 −0.78 0.87 0.72 0.08
11 1 1 0 440 6.09 1.00 0.23 −1.21 0.83 0.61 0.06
12 1 0 1 540 6.29 0.00 0.75 1.09 0.72 0.92 0.42
13 1 0 1 600 6.40 0.00 0.73 0.97 0.69 0.91 0.41
14 1 0 1 1700 7.44 0.00 0.45 −0.18 0.61 0.73 0.20
15 1 0 1 3200 8.07 0.00 0.29 −0.89 0.73 0.63 0.09
16 1 0 1 3500 8.16 0.00 0.27 −0.99 0.75 0.62 0.08
17 1 0 1 5200 8.56 0.00 0.19 −1.42 0.88 0.57 0.04
18 1 0 0 150 5.01 0.00 0.50 −0.01 1.02 0.88 0.12
19 1 0 0 400 5.99 0.00 0.25 −1.10 0.84 0.63 0.06
20 1 0 0 530 6.27 0.00 0.20 −1.41 0.83 0.55 0.05
21 1 0 0 900 6.80 0.00 0.12 −2.00 0.86 0.42 0.02
22 1 0 0 1000 6.91 0.00 0.11 −2.12 0.87 0.40 0.02
23 1 0 0 1900 7.55 0.00 0.06 −2.83 1.01 0.30 0.01
24 1 0 0 2100 7.65 0.00 0.05 −2.94 1.03 0.29 0.01
25 1 0 0 2600 7.86 0.00 0.04 −3.18 1.09 0.26 0.00
26 1 0 0 2700 7.90 0.00 0.04 −3.22 1.11 0.26 0.00
27 1 0 0 2800 7.94 0.00 0.04 −3.26 1.12 0.26 0.00
28 1 0 0 3100 8.04 0.00 0.03 −3.37 1.15 0.25 0.00
29 1 0 0 7900 8.97 0.00 0.01 −4.41 1.48 0.18 0.00
30 2 0 0 10000 9.21 0.00 0.01 −4.67 1.57 0.17 0.00

The estimated survival probabilities satisfy


 

log = 5.54 − 1.11 LWBC + 2.52 AG.
1 − p̃

For AG− individuals with AG=0, this reduces to


 

log = 5.54 − 1.11 LWBC,
1 − p̃

or equivalently,
exp(5.54 − 1.11 LWBC)
p̃ = .
1 + exp(5.54 − 1.11 LWBC)
For AG+ individuals with AG=1,
 

log = 5.54 − 1.11 LWBC + 2.52(1) = 8.06 − 1.11 LWBC,
1 − p̃
or
exp(8.06 − 1.11 LWBC)
p̃ = .
1 + exp(8.06 − 1.11 LWBC)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


694 Ch 11: Logistic Regression

Using the logit scale, the difference between AG+ and AG− individuals in the
estimated log-odds of surviving at least one year, at a fixed but arbitrary LWBC, is
the estimated AG regression coefficient

(8.06 − 1.11 LWBC) − (5.54 − 1.11 LWBC) = 2.52.

Using properties of exponential functions, the odds that an AG+ patient lives at least
one year is exp(2.52) = 12.42 times larger than the odds that an AG− patient lives
at least one year, regardless of LWBC.
This summary, and a CI for the AG odds ratio, is given in the Odds Ratio
table. Similarly, the estimated odds ratio of 0.33 for LWBC implies that the odds of
surviving at least one year is reduced by a factor of 3 for each unit increase of LWBC.
We can use the confint() function to obtain confidence intervals for the coefficient
estimates. Note that for logistic models, confidence intervals are based on the profiled
log-likelihood function.
## CIs using profiled log-likelihood
confint(glm.i.l)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.1596372 12.4524409
## ag1 0.5993391 5.0149271
## lwbc -2.2072275 -0.3319512
We can also get CIs based on just the standard errors by using the default method.

## CIs using standard errors


confint.default(glm.i.l)
## 2.5 % 97.5 %
## (Intercept) -0.3804137 11.467112
## ag1 0.3818885 4.657236
## lwbc -2.0121879 -0.205330
You can also exponentiate the coefficients and confidence interval bounds and
interpret them as odds-ratios.
## coefficients and 95% CI
cbind(OR = coef(glm.i.l), confint(glm.i.l))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 5.543349 0.1596372 12.4524409
## ag1 2.519562 0.5993391 5.0149271
## lwbc -1.108759 -2.2072275 -0.3319512
## odds ratios and 95% CI
exp(cbind(OR = coef(glm.i.l), confint(glm.i.l)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 255.5323676 1.1730851 2.558741e+05

Prof. Erik B. Erhardt


11.4: Example: Leukemia white blood cell types 695

## ag1 12.4231582 1.8209149 1.506452e+02


## lwbc 0.3299682 0.1100052 7.175224e-01
Although the equal slopes model appears to fit well, a more general model might
fit better. A natural generalization here would be to add an interaction, or product
term, AG × LWBC to the model. The logistic model with an AG effect and the
AG × LWBC interaction is equivalent to fitting separate logistic regression lines to
the two AG groups. This interaction model provides an easy way to test whether the
slopes are equal across AG levels. I will note that the interaction term is not needed
here.

Interpretting odds ratios in logistic regression Let’s begin with probability3 .


Let’s say that the probability of success is 0.8, thus p = 0.8. Then the probability of
failure is q = 1 − p = 0.2. The odds of success are defined as odds(success) = p/q =
0.8/0.2 = 4, that is, the odds of success are 4 to 1. The odds of failure would be
odds(failure) = q/p = 0.2/0.8 = 0.25, that is, the odds of failure are 1 to 4. Next,
let’s compute the odds ratio by OR = odds(success)/odds(failure) = 4/0.25 = 16.
The interpretation of this odds ratio would be that the odds of success are 16 times
greater than for failure. Now if we had formed the odds ratio the other way around
with odds of failure in the numerator, we would have gotten something like this,
OR = odds(failure)/odds(success) = 0.25/4 = 0.0625.

Another example This example is adapted from Pedhazur (1997). Suppose


that seven out of 10 males are admitted to an engineering school while three of 10
females are admitted. The probabilities for admitting a male are, p = 7/10 = 0.7 and
q = 1−0.7 = 0.3. Here are the same probabilities for females, p = 3/10 = 0.3 and q =
1−0.3 = 0.7. Now we can use the probabilities to compute the admission odds for both
males and females, odds(male) = 0.7/0.3 = 2.33333 and odds(female) = 0.3/0.7 =
0.42857. Next, we compute the odds ratio for admission, OR = 2.3333/0.42857 =
5.44. Thus, the odds of a male being admitted are 5.44 times greater than for a
female.

Leukemia example In the example above, the OR of surviving at least one year
increases 12.43 times for AG+ vs AG−, and increases 0.33 times (that’s a decrease)
for every unit increase in lwbc.

Example: Mortality of confused flour beetles This example illustrates a


quadratic logistic model.
3
Borrowed graciously from UCLA Academic Technology Services at http://www.ats.ucla.edu/
stat/sas/faq/oratio.htm

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


696 Ch 11: Logistic Regression

The aim of an experiment originally reported by Strand (1930) and quoted by Bliss
(1935) was to assess the response of the confused flour beetle, Tribolium confusum,
to gaseous carbon disulphide (CS2 ). In the experiment, prescribed volumes of liquid
carbon disulphide were added to flasks in which a tubular cloth cage containing a
batch of about thirty beetles was suspended. Duplicate batches of beetles were used
for each concentration of CS2 . At the end of a five-hour period, the proportion killed
was recorded and the actual concentration of gaseous CS2 in the flask, measured in
mg/l, was determined by a volumetric analysis. The mortality data are given in the
table below.
#### Example: Beetles
## Beetles data set
# conc = CS2 concentration
# y = number of beetles killed
# n = number of beetles exposed
# rep = Replicate number (1 or 2)
beetles <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch11_beetles.dat", header = TRUE)
beetles$rep <- factor(beetles$rep)
conc y n rep conc y n rep
1 49.06 2 29 1 9 49.06 4 30 2
2 52.99 7 30 1 10 52.99 6 30 2
3 56.91 9 28 1 11 56.91 9 34 2
4 60.84 14 27 1 12 60.84 14 29 2
5 64.76 23 30 1 13 64.76 29 33 2
6 68.69 29 31 1 14 68.69 24 28 2
7 72.61 29 30 1 15 72.61 32 32 2
8 76.54 29 29 1 16 76.54 31 31 2

beetles$conc2 <- beetles$conc^2 # for quadratic term (making coding a little easier)
beetles$p.hat <- beetles$y / beetles$n # observed proportion of successes
# emperical logits
beetles$emp.logit <- log(( beetles$p.hat + 0.5/beetles$n) /
(1 - beetles$p.hat + 0.5/beetles$n))
#str(beetles)
Plot the observed probability of mortality and the empirical logits with linear and
quadratic LS fits (which are not the same as the logistic MLE fits).
library(ggplot2)
p <- ggplot(beetles, aes(x = conc, y = p.hat, shape = rep))
# observed values
p <- p + geom_point(color = "black", size = 3, alpha = 0.5)
p <- p + labs(title = "Observed mortality, probability scale")
print(p)

library(ggplot2)
p <- ggplot(beetles, aes(x = conc, y = emp.logit))
p <- p + geom_smooth(method = "lm", colour = "red", se = FALSE)

Prof. Erik B. Erhardt


11.4: Example: Leukemia white blood cell types 697

p <- p + geom_smooth(method = "lm", formula = y ~ poly(x, 2), colour = "blue", se = FALSE)


# observed values
p <- p + geom_point(aes(shape = rep), color = "black", size = 3, alpha = 0.5)
p <- p + labs(title = "Empirical logit with `naive' LS fits (not MLE)")
print(p)
Observed mortality, probability scale Empirical logit with `naive' LS fits (not MLE)

1.00
4

0.75

rep rep

emp.logit
p.hat

1 1
0.50 2 2

0.25

−2

50 60 70 50 60 70
conc conc

In a number of articles that refer to these data, the responses from the first two
concentrations are omitted because of apparent non-linearity. Bliss himself remarks
that

. . . in comparison with the remaining observations, the two lowest concen-


trations gave an exceptionally high kill. Over the remaining concentra-
tions, the plotted values seemed to form a moderately straight line, so
that the data were handled as two separate sets, only the results at 56.91
mg of CS2 per litre being included in both sets.

However, there does not appear to be any biological motivation for this and so here
they are retained in the data set.
Combining the data from the two replicates and plotting the empirical logit of the
observed proportions against concentration gives a relationship that is better fit by a
quadratic than a linear relationship,
 
p
log = β0 + β1 X + β2 X 2 .
1−p

The right plot below shows the linear and quadratic model fits to the observed values
with point-wise 95% confidence bands on the logit scale, and on the left is the same
on the proportion scale.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


698 Ch 11: Logistic Regression

# fit logistic regression to create lines on plots below


# linear
glm.beetles1 <- glm(cbind(y, n - y) ~ conc, family = binomial, beetles)
# quadratic
glm.beetles2 <- glm(cbind(y, n - y) ~ conc + conc2, family = binomial, beetles)

## put model fits for two models together


beetles1 <- beetles
# put the fitted values in the data.frame
beetles1$fitted.values <- glm.beetles1$fitted.values
pred <- predict(glm.beetles1, data.frame(conc = beetles1$conc), type = "link", se.fit = TRUE) #£
beetles1$fit <- pred$fit
beetles1$se.fit <- pred$se.fit
# CI for fitted values
beetles1 <- within(beetles1, {
fit.lower = exp(fit - 1.96 * se.fit) / (1 + exp(fit - 1.96 * se.fit))
fit.upper = exp(fit + 1.96 * se.fit) / (1 + exp(fit + 1.96 * se.fit))
})
beetles1$modelorder <- "linear"

beetles2 <- beetles


# put the fitted values in the data.frame
beetles2$fitted.values <- glm.beetles2$fitted.values
pred <- predict(glm.beetles2, data.frame(conc = beetles2$conc, conc2 = beetles2$conc2), type = "link", s
beetles2$fit <- pred$fit
beetles2$se.fit <- pred$se.fit
# CI for fitted values
beetles2 <- within(beetles2, {
fit.lower = exp(fit - 1.96 * se.fit) / (1 + exp(fit - 1.96 * se.fit))
fit.upper = exp(fit + 1.96 * se.fit) / (1 + exp(fit + 1.96 * se.fit))
})
beetles2$modelorder <- "quadratic"

beetles.all <- rbind(beetles1, beetles2)


beetles.all$modelorder <- factor(beetles.all$modelorder)

# plot on logit and probability scales

library(ggplot2)
p <- ggplot(beetles.all, aes(x = conc, y = p.hat, shape = rep, colour = modelorder, fill = modelorder))
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = conc, ymin = fit.lower, ymax = fit.upper), linetype = 0, alpha = 0.1)
p <- p + geom_line(aes(x = conc, y = fitted.values, linetype = modelorder), size = 1)
# fitted values
p <- p + geom_point(aes(y = fitted.values), size=2)
# observed values
p <- p + geom_point(color = "black", size = 3, alpha = 0.5)
p <- p + labs(title = "Observed and predicted mortality, probability scale")

Prof. Erik B. Erhardt


11.5: Example: The UNM Trauma Data 699

print(p)

library(ggplot2)
p <- ggplot(beetles.all, aes(x = conc, y = emp.logit, shape = rep, colour = modelorder, fill = mo
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = conc, ymin = fit - 1.96 * se.fit, ymax = fit + 1.96 * se.fit), linet
p <- p + geom_line(aes(x = conc, y = fit, linetype = modelorder), size = 1)
# fitted values
p <- p + geom_point(aes(y = fit), size=2)
# observed values
p <- p + geom_point(color = "black", size = 3, alpha = 0.5)
p <- p + labs(title = "Observed and predicted mortality, logit scale")
print(p)
Observed and predicted mortality, probability scale Observed and predicted mortality, logit scale

1.00 ●
● 7.5




● 5.0
0.75


rep rep
● 1 ● ● 1

2 2
emp.logit


p.hat


2.5

0.50
modelorder modelorder


● linear ● linear
● quadratic ● quadratic
● ●

● 0.0


0.25 ●

● ●




−2.5 ●

0.00
50 60 70 50 60 70
conc conc

11.5 Example: The UNM Trauma Data


The data to be analyzed here were collected on 3132 patients admitted to The Uni-
versity of New Mexico Trauma Center between the years 1991 and 1994. For each
patient, the attending physician recorded their age, their revised trauma score (RTS),
their injury severity score (ISS), whether their injuries were blunt (i.e., the result of
a car crash: BP=0) or penetrating (i.e., gunshot/knife wounds: BP=1), and whether
they eventually survived their injuries (SURV=0 if not, SURV=1 if survived). Ap-
proximately 10% of patients admitted to the UNM Trauma Center eventually die
from their injuries.
The ISS is an overall index of a patient’s injuries, based on the approximately
1300 injuries cataloged in the Abbreviated Injury Scale. The ISS can take on values

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


700 Ch 11: Logistic Regression

from 0 for a patient with no injuries to 75 for a patient with 3 or more life threatening
injuries. The ISS is the standard injury index used by trauma centers throughout
the U.S. The RTS is an index of physiologic injury, and is constructed as a weighted
average of an incoming patient’s systolic blood pressure, respiratory rate, and Glasgow
Coma Scale. The RTS can take on values from 0 for a patient with no vital signs to
7.84 for a patient with normal vital signs.
Champion et al. (1981) proposed a logistic regression model to estimate the prob-
ability of a patient’s survival as a function of RTS, the injury severity score ISS, and
the patient’s age, which is used as a surrogate for physiologic reserve. Subsequent
survival models included the binary effect BP as a means to differentiate between
blunt and penetrating injuries.
We will develop a logistic model for predicting survival from ISS, AGE, BP, and
RTS, and nine body regions. Data on the number of severe injuries in each of the nine
body regions is also included in the database, so we will also assess whether these
features have any predictive power. The following labels were used to identify the
number of severe injuries in the nine regions: AS = head, BS = face, CS = neck,
DS = thorax, ES = abdomen, FS = spine, GS = upper extremities, HS = lower
extremities, and JS = skin.
#### Example: UNM Trauma Data
trauma <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch11_trauma.dat"
, header = TRUE)

## Variables
# surv = survival (1 if survived, 0 if died)
# rts = revised trauma score (range: 0 no vital signs to 7.84 normal vital signs)
# iss = injury severity score (0 no injuries to 75 for 3 or more life threatening injuries)
# bp = blunt or penetrating injuries (e.g., car crash BP=0 vs gunshot/knife wounds BP=1)
# Severe injuries: add the severe injuries 3--6 to make summary variables
trauma <- within(trauma, {
as = a3 + a4 + a5 + a6 # as = head
bs = b3 + b4 + b5 + b6 # bs = face
cs = c3 + c4 + c5 + c6 # cs = neck
ds = d3 + d4 + d5 + d6 # ds = thorax
es = e3 + e4 + e5 + e6 # es = abdomen
fs = f3 + f4 + f5 + f6 # fs = spine
gs = g3 + g4 + g5 + g6 # gs = upper extremities
hs = h3 + h4 + h5 + h6 # hs = lower extremities
js = j3 + j4 + j5 + j6 # js = skin
})
# keep only columns of interest
names(trauma)
## [1] "id" "surv" "a1" "a2" "a3" "a4" "a5" "a6" "b1" "b2"
## [11] "b3" "b4" "b5" "b6" "c1" "c2" "c3" "c4" "c5" "c6"
## [21] "d1" "d2" "d3" "d4" "d5" "d6" "e1" "e2" "e3" "e4"

Prof. Erik B. Erhardt


11.5: Example: The UNM Trauma Data 701

## [31] "e5" "e6" "f1" "f2" "f3" "f4" "f5" "f6" "g1" "g2"
## [41] "g3" "g4" "g5" "g6" "h1" "h2" "h3" "h4" "h5" "h6"
## [51] "j1" "j2" "j3" "j4" "j5" "j6" "iss" "iciss" "bp" "rts"
## [61] "age" "prob" "js" "hs" "gs" "fs" "es" "ds" "cs" "bs"
## [71] "as"
trauma <- subset(trauma, select = c(id, surv, as:js, iss:prob))
head(trauma)
## id surv as bs cs ds es fs gs hs js iss iciss bp rts age prob
## 1 1238385 1 0 0 0 1 0 0 0 0 0 13 0.8612883 0 7.8408 13 0.9909890
## 2 1238393 1 0 0 0 0 0 0 0 0 0 5 0.9421876 0 7.8408 23 0.9947165
## 3 1238898 1 0 0 0 0 0 0 2 0 0 13 0.7251130 0 7.8408 43 0.9947165
## 4 1239516 1 1 0 0 0 0 0 0 0 0 16 1.0000000 0 5.9672 17 0.9615540
## 5 1239961 1 1 0 0 0 0 0 0 0 1 9 0.9346634 0 4.8040 20 0.9338096
## 6 1240266 1 0 0 0 0 0 0 0 1 0 13 0.9004691 0 7.8408 32 0.9947165
#str(trauma)

I made side-by-side boxplots of the distributions of ISS, AGE, RTS, and AS


through JS for the survivors and non-survivors. In several body regions the num-
ber of injuries is limited, so these boxplots are not very enlightening. Survivors
tend to have lower ISS scores, tend to be slightly younger, tend to have higher RTS
scores, and tend to have fewer severe head (AS) and abdomen injuries (ES) than
non-survivors. The importance of the effects individually towards predicting survival
is directly related to the separation between the survivors and non-survivors scores.
# Create boxplots for each variable by survival
library(reshape2)
trauma.long <- melt(trauma, id.vars = c("id", "surv", "prob"))

# Plot the data using ggplot


library(ggplot2)
p <- ggplot(trauma.long, aes(x = factor(surv), y = value))
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.1)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
alpha = 0.75, colour = "red")
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, alpha = 0.8)
p <- p + facet_wrap( ~ variable, scales = "free_y", ncol = 4)
p <- p + labs(title = "Boxplots of variables by survival")
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


702 Ch 11: Logistic Regression

Boxplots of variables by survival


as bs cs ds
8 2.0 2.0 5

4
6 1.5 1.5

3
4 1.0 1.0
2

2 0.5 0.5
1

0 0.0 0.0 0

es fs gs hs
6 5 4 5

4 4
3
4
3 3
2
2 2
2
1
1 1

0 0 0 0
value

js iss iciss bp
1.00 1.00 1.00

60
0.75 0.75 0.75

0.50 40 0.50 0.50

0.25 20 0.25 0.25

0.00 0 0.00 0.00


0 1 0 1
rts age
8

6 75

4 50

2 25

0 0
0 1 0 1
factor(surv)

11.5.1 Selecting Predictors in the Trauma Data


The same automated methods for model building are available for the glm() pro-
cedure, including backward elimination, forward selection, and stepwise methods,
among others. Below we perform a stepwise selection using AIC starting at the full
model, starting with a full model having 13 effects: ISS, BP, RTS, AGE, and AS–JS.
Revisit Chapter 10 for more information.
In our previous logistic regression analyses, the cases in the data set were pre-
aggregated into groups of observations having identical levels of the predictor vari-

Prof. Erik B. Erhardt


11.5: Example: The UNM Trauma Data 703

ables. The numbers of cases in the success category and the group sample sizes were
specified in the model statement, along with the names of the predictors. The trauma
data set, which is not reproduced here, is raw data consisting of one record per pa-
tient (i.e., 3132 lines). The logistic model is fitted to data on individual cases by
specifying the binary response variable (SURV) with successes and 1 − SURV failures
with the predictors on the right-hand side of the formula. Keep in mind that we are
defining the logistic model to model the success category, so we are modeling the
probability of surviving.
As an aside, there are two easy ways to model the probability of dying (which
we don’t do below). The first is to swap the order the response is specified in the
formula: cbind(1 - surv, surv). The second is to convert a model for the log-odds
of surviving to a model for the log-odds of dying by simply changing the sign of each
regression coefficient in the model.
I only included the summary table from the backward elimination, and information
on the fit of the selected model.
glm.tr <- glm(cbind(surv, 1 - surv) ~ as + bs + cs + ds + es + fs + gs + hs + js
+ iss + rts + age + bp
, family = binomial, trauma)

# option: trace = 0 doesn't show each step of the automated selection


glm.tr.red.AIC <- step(glm.tr, direction="both", trace = 0)

# the anova object provides a summary of the selection steps in order


glm.tr.red.AIC$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 3118 869.4309 897.4309
## 2 - as 1 0.04911674 3119 869.4800 895.4800
## 3 - bs 1 0.12242766 3120 869.6024 893.6024
## 4 - fs 1 0.15243093 3121 869.7549 891.7549
## 5 - cs 1 0.78703127 3122 870.5419 890.5419
## 6 - ds 1 0.81690443 3123 871.3588 889.3588
## 7 - gs 1 1.18272567 3124 872.5415 888.5415
## 8 - hs 1 1.17941462 3125 873.7209 887.7209
## 9 - js 1 1.71204029 3126 875.4330 887.4330
The final model includes effects for ES (number of severe abdominal injuries), ISS,
RTS, AGE, and BP. All of the effects in the selected model are significant at the 5%
level.
summary(glm.tr.red.AIC)
##
## Call:
## glm(formula = cbind(surv, 1 - surv) ~ es + iss + rts + age +
## bp, family = binomial, data = trauma)
##
## Deviance Residuals:

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


704 Ch 11: Logistic Regression

## Min 1Q Median 3Q Max


## -3.1546 0.0992 0.1432 0.2316 3.4454
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.355845 0.442943 0.803 0.4218
## es -0.461317 0.110098 -4.190 2.79e-05 ***
## iss -0.056920 0.007411 -7.680 1.59e-14 ***
## rts 0.843143 0.055339 15.236 < 2e-16 ***
## age -0.049706 0.005291 -9.394 < 2e-16 ***
## bp -0.635137 0.249597 -2.545 0.0109 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1825.37 on 3131 degrees of freedom
## Residual deviance: 875.43 on 3126 degrees of freedom
## AIC: 887.43
##
## Number of Fisher Scoring iterations: 7
The p-value for D is large, indicating no gross deficiencies with the selected model.

# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev.p.val <- 1 - pchisq(glm.tr.red.AIC$deviance, glm.tr.red.AIC$df.residual)
dev.p.val
## [1] 1
Letting p be the probability of survival, the estimated survival probability is given
by
 

log = 0.3558 − 0.4613 ES − 0.6351 BP − 0.0569 ISS
1 − p̃
+0.8431 RTS − 0.0497 AGE.

Let us interpret the sign of the coefficients, and the odds ratios, in terms of the impact
that individual predictors have on the survival probability.
## coefficients and 95% CI
cbind(OR = coef(glm.tr.red.AIC), confint(glm.tr.red.AIC))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.35584499 -0.51977300 1.21869015
## es -0.46131679 -0.67603693 -0.24307991
## iss -0.05691973 -0.07159539 -0.04249502
## rts 0.84314317 0.73817886 0.95531089
## age -0.04970641 -0.06020882 -0.03943822

Prof. Erik B. Erhardt


11.5: Example: The UNM Trauma Data 705

## bp -0.63513735 -1.12051508 -0.14005288


## odds ratios and 95% CI
exp(cbind(OR = coef(glm.tr.red.AIC), confint(glm.tr.red.AIC)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 1.4273863 0.5946555 3.3827539
## es 0.6304529 0.5086287 0.7842088
## iss 0.9446699 0.9309075 0.9583952
## rts 2.3236592 2.0921220 2.5994786
## age 0.9515087 0.9415679 0.9613293
## bp 0.5298627 0.3261118 0.8693123

11.5.2 Checking Predictions for the Trauma Model


To assess the ability of the selected model to accurately predict survival, assume that
the two types of errors (a false positive prediction of survival, and a false negative
prediction) have equal costs. Under this assumption, the optimal classification rule
is to predict survival for an individual with an estimated survival probability of 0.50
or larger.
In the below script, a table is generated that gives classifications for cutoff proba-
bilities thresh= 0.0, 0.1, . . . , 1.0 based on the selected model. While I do not do this
below, it is common to use the jackknife method for assessing classification accuracy.
To implement the jackknife method, each observation is temporarily held out and the
selected model is fitted to the remaining (3131) cases. This leads to an estimated
survival probability for the case, which is compared to the actual result. The results
are summarized in terms of total number of correct and incorrect classifications for
each possible outcome. In the table below, the columns labeled Event (a survival)
and Non-Event (a death) refer to the classification of observations, and not to the
actual outcomes. The columns labeled Correct and Incorrect identify whether the
classifications are accurate.
# thresholds for classification given model proportion predictions for each observation
thresh <- seq(0,1,by=0.1)

# predicted probabilities
Yhat <- fitted(glm.tr.red.AIC)

# Name: lower (0) = NonEvent, higher (1) = Event


YObs <- cut(trauma$surv, breaks = c(-Inf, mean(trauma$surv), Inf)
, labels = c("NonEvent", "Event"))

classify.table <- data.frame(Thresh = rep(NA, length(thresh))


, Cor.Event = rep(NA, length(thresh))
, Cor.NonEv = rep(NA, length(thresh))

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


706 Ch 11: Logistic Regression

, Inc.Event = rep(NA, length(thresh))


, Inc.NonEv = rep(NA, length(thresh))
, Cor.All = rep(NA, length(thresh))
, Sens = rep(NA, length(thresh))
, Spec = rep(NA, length(thresh))
, Fal.P = rep(NA, length(thresh))
, Fal.N = rep(NA, length(thresh)))

for (i.thresh in 1:length(thresh)) {


# choose a threshold for dichotomizing according to predicted probability
YhatPred <- cut(Yhat, breaks = c(-Inf, thresh[i.thresh], Inf)
, labels = c("NonEvent", "Event"))

# contingency table and marginal sums


cTab <- table(YhatPred, YObs)
addmargins(cTab)

# Classification Table
classify.table$Thresh [i.thresh] <- thresh[i.thresh] # Prob.Level
classify.table$Cor.Event[i.thresh] <- cTab[2,2] # Correct.Event
classify.table$Cor.NonEv[i.thresh] <- cTab[1,1] # Correct.NonEvent
classify.table$Inc.Event[i.thresh] <- cTab[2,1] # Incorrect.Event
classify.table$Inc.NonEv[i.thresh] <- cTab[1,2] # Incorrect.NonEvent
classify.table$Cor.All [i.thresh] <- 100 * sum(diag(cTab)) / sum(cTab) # Correct.Overall
classify.table$Sens [i.thresh] <- 100 * cTab[2,2] / sum(cTab[,2]) # Sensitivity
classify.table$Spec [i.thresh] <- 100 * cTab[1,1] / sum(cTab[,1]) # Specificity
classify.table$Fal.P [i.thresh] <- 100 * cTab[2,1] / sum(cTab[2,]) # False.Pos
classify.table$Fal.N [i.thresh] <- 100 * cTab[1,2] / sum(cTab[1,]) # False.Neg
}
round(classify.table, 1)
## Thresh Cor.Event Cor.NonEv Inc.Event Inc.NonEv Cor.All Sens Spec Fal.P Fal.N
## 1 0.0 2865 0 267 0 91.5 100.0 0.0 8.5 NaN
## 2 0.1 2861 79 188 4 93.9 99.9 29.6 6.2 4.8
## 3 0.2 2856 105 162 9 94.5 99.7 39.3 5.4 7.9
## 4 0.3 2848 118 149 17 94.7 99.4 44.2 5.0 12.6
## 5 0.4 2837 125 142 28 94.6 99.0 46.8 4.8 18.3
## 6 0.5 2825 139 128 40 94.6 98.6 52.1 4.3 22.3
## 7 0.6 2805 157 110 60 94.6 97.9 58.8 3.8 27.6
## 8 0.7 2774 174 93 91 94.1 96.8 65.2 3.2 34.3
## 9 0.8 2727 196 71 138 93.3 95.2 73.4 2.5 41.3
## 10 0.9 2627 229 38 238 91.2 91.7 85.8 1.4 51.0
## 11 1.0 0 267 0 2865 8.5 0.0 100.0 NaN 91.5
The data set has 2865 survivors and 267 people that died. Using a 0.50 cutoff,
2825 of the survivors would be correctly identified, and 40 misclassified. Similarly, 139
of the patients that died would be correctly classified and 128 would not. The overall
percentage of cases correctly classified is (2825+138)/3132 = 94.6%. The sensitivity
is the percentage of survivors that are correctly classified, 2825/(2825 + 40) = 98.6%.

Prof. Erik B. Erhardt


11.6: Historical Example: O-Ring Data 707

The specificity is the percentage of patients that died that are correctly classified,
139/(139 + 128) = 52.1%. The false positive rate, which is the % of those predicted
to survive that did not, is 128/(128 + 2825) = 4.3%. The false negative rate, which
is the % of those predicted to die that did not is 40/(40 + 139) = 22.5%.
# Thresh = 0.5 classification table
YhatPred <- cut(Yhat, breaks=c(-Inf, 0.5, Inf), labels=c("NonEvent", "Event"))
# contingency table and marginal sums
cTab <- table(YhatPred, YObs)
addmargins(cTab)
## YObs
## YhatPred NonEvent Event Sum
## NonEvent 139 40 179
## Event 128 2825 2953
## Sum 267 2865 3132
round(subset(classify.table, Thresh == 0.5), 1)
## Thresh Cor.Event Cor.NonEv Inc.Event Inc.NonEv Cor.All Sens Spec Fal.P Fal.N
## 6 0.5 2825 139 128 40 94.6 98.6 52.1 4.3 22.3
The misclassification rate seems small, but you should remember that approxi-
mately 10% of patients admitted to UNM eventually die from their injuries. Given
this historical information only, you could achieve a 10% misclassification rate by
completely ignoring the data and classifying each admitted patient as a survivor.
Using the data reduces the misclassification rate by about 50% (from 10% down to
4.4%), which is an important reduction in this problem.

11.6 Historical Example: O-Ring Data


The table below presents data from Dalal, Fowlkes, and Hoadley (1989) on field O-
ring failures in the 23 pre-Challenger space shuttle launches. Temperature at lift-off
and O-ring joint pressure are predictors. The binary version of the data views each
flight as an independent trial. The result of a trial (y) is a 1 if at least one field
O-ring failure occurred on the flight and a 0 if all six O-rings functioned properly.
A regression for the binary data would model the probability that any O-ring failed
as a function of temperature and pressure. The binomial version of these data is
also presented. It views the six field O-rings on each flight as independent trials.
A regression for the binomial data would model the probability that a particular
O-ring on a given flight failed as a function of temperature and pressure. The two
regressions model different probabilities. The Challenger explosion occurred during
a takeoff at 31 degrees Fahrenheit.
#### Example: Shuttle O-ring data
shuttle <- read.csv("http://statacumen.com/teach/ADA2/ADA2_notes_Ch11_shuttle.csv")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


708 Ch 11: Logistic Regression

case flight y six temp pressure


1 1 14 1 2 53 50
2 2 9 1 1 57 50
3 3 23 1 1 58 200
4 4 10 1 1 63 50
5 5 1 0 0 66 200
6 6 5 0 0 67 50
7 7 13 0 0 67 200
8 8 15 0 0 67 50
9 9 4 0 0 68 200
10 10 3 0 0 69 200
11 11 8 0 0 70 50
12 12 17 0 0 70 200
13 13 2 1 1 70 200
14 14 11 1 1 70 200
15 15 6 0 0 72 200
16 16 7 0 0 73 200
17 17 16 0 0 75 100
18 18 21 1 2 75 200
19 19 19 0 0 76 200
20 20 22 0 0 76 200
21 21 12 0 0 78 200
22 22 20 0 0 79 200
23 23 18 0 0 81 200
The regression model for the binomial data (i.e., six trials per launch) is suspect
on substantive grounds. The binomial model makes no allowance for an effect due
to having the six O-rings on the same flight. If these data were measurements rather
than counts, they would almost certainly be treated as dependent repeated measures.
If interest is focused on whether one or more O-rings failed, the simplest, most direct
data are the binary data.
Consider fitting a logistic regression model using temperature and pressure as
predictors. Let pi be the probability that any O-ring fails in case i and model this as
 
pi
log = β0 + β1 Tempi + β2 Pressurei .
1 − pi

Logistic histogram plots of the data show a clear marginal relationship of failure
with temp but not with pressure. We still need to assess the model with both variables
together.
# plot logistic plots of response to each predictor individually
library(popbio)
##
## Attaching package: ’popbio’
## The following object is masked from ’package:gdata’:
##
## resample
logi.hist.plot(shuttle$temp, shuttle$y, boxp=FALSE, type="hist"
, rug=TRUE, col="gray", ylabel = "Probability", xlabel = "Temp")
logi.hist.plot(shuttle$pressure, shuttle$y, boxp=FALSE, type="hist"
, rug=TRUE, col="gray", ylabel = "Probability", xlabel = "Pressure")

Prof. Erik B. Erhardt


11.6: Historical Example: O-Ring Data 709

1.0 ● ● ● ● ● ● 0 1.0 ● ● 0

5
10
0.8 0.8
10
20

Frequency

Frequency
Probability

Probability
0.6 0.6

0.4 0.4
20
10
0.2 0.2
10
5

0.0 ● ● ● ● ● ● ● ● ● ● ● ● 0 0.0 ● ● ● 0

55 60 65 70 75 80 50 100 150 200

Temp Pressure

We fit the logistic model below using Y = 1 if at least one O-ring failed, and 0
otherwise. We are modelling the chance of one or more O-ring failures as a function
of temperature and pressure.
The D goodness-of-fit statistic suggest no gross deviations from the model. Fur-
thermore, the test of H0 : β1 = β2 = 0 (no regression effects) based on the Wald test
has a p-value of 0.1, which suggests that neither temperature or pressure, or both,
are useful predictors of the probability of O-ring failure. The z-test test p-values for
testing H0 : β1 = 0 and H0 : β2 = 0 individually are 0.037 and 0.576, respectively,
which indicates pressure is not important (when added last to the model), but that
temperature is important. This conclusion might be anticipated by looking at data
plots above.
glm.sh <- glm(cbind(y, 1 - y) ~ temp + pressure, family = binomial, shuttle)
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev.p.val <- 1 - pchisq(glm.sh$deviance, glm.sh$df.residual)
dev.p.val
## [1] 0.4589415
# Testing Global Null Hypothesis
library(aod)
coef(glm.sh)
## (Intercept) temp pressure
## 16.385319489 -0.263404073 0.005177602
# specify which coefficients to test = 0 (Terms = 2:3 is for terms 2 and 3)
wald.test(b = coef(glm.sh), Sigma = vcov(glm.sh), Terms = 2:3)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 4.6, df = 2, P(> X2) = 0.1
# Model summary
summary(glm.sh)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


710 Ch 11: Logistic Regression

##
## Call:
## glm(formula = cbind(y, 1 - y) ~ temp + pressure, family = binomial,
## data = shuttle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1928 -0.7879 -0.3789 0.4172 2.2031
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.385319 8.027474 2.041 0.0412 *
## temp -0.263404 0.126371 -2.084 0.0371 *
## pressure 0.005178 0.009257 0.559 0.5760
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 19.984 on 20 degrees of freedom
## AIC: 25.984
##
## Number of Fisher Scoring iterations: 5

A reasonable next step would be to refit the model, after omitting pressure as a
predictor. The target model is now
 
pi
log = β0 + β1 Tempi .
1 − pi

glm.sh <- glm(cbind(y, 1 - y) ~ temp, family = binomial, shuttle)


# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev.p.val <- 1 - pchisq(glm.sh$deviance, glm.sh$df.residual)
dev.p.val
## [1] 0.5013827
# Model summary
summary(glm.sh)
##
## Call:
## glm(formula = cbind(y, 1 - y) ~ temp, family = binomial, data = shuttle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0611 -0.7613 -0.3783 0.4524 2.2175
##
## Coefficients:

Prof. Erik B. Erhardt


11.6: Historical Example: O-Ring Data 711

## Estimate Std. Error z value Pr(>|z|)


## (Intercept) 15.0429 7.3786 2.039 0.0415 *
## temp -0.2322 0.1082 -2.145 0.0320 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.315 on 21 degrees of freedom
## AIC: 24.315
##
## Number of Fisher Scoring iterations: 5
Our conclusions on the overall fit of the model and the significance of the effect
of temperature on the probability of O-ring failure are consistent with the results
for two predictor model. The model estimates the log-odds of (at least one) O-ring
failure to be  

log = 15.043 − 0.2322 Temp.
1 − p̃
The estimated probability of (at least one) O-ring failure is
exp(15.043 − 0.2322 Temp)
p̃ = .
1 + exp(15.043 − 0.2322 Temp)
This is an decreasing function of temperature.
The Challenger was set to launch on a morning where the predicted temperature
at lift-off was 31 degrees. Given that temperature appears to affect the probability
of O-ring failure (a point that NASA missed), what can/should we say about the
potential for O-ring failure?
Clearly, the launch temperature is outside the region for which data were available.
Thus, we really have no prior information about what is likely to occur. If we assume
the logistic model holds, and we can extrapolate back to 31 degrees, what is the
estimated probability of O-ring failure?
The following gives the answer this question. I augmented the original data set
to obtain predicted probabilities for temperatures not observed in the data set. Note
that the fitted model to data with missing observations gives the same model fit
because glm() excludes observations with missing values. Predictions are then made
for all temperatures in the dataset and the resulting table and plot are reported.
# append values to dataset for which we wish to make predictions
shuttle.pred <- data.frame( case = rep(NA, 5)
, flight = rep(NA, 5)
, y = rep(NA, 5)
, six = rep(NA, 5)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


712 Ch 11: Logistic Regression

, temp = c(31, 35, 40, 45, 50) # temp values to predict


, pressure = rep(NA, 5)
)
shuttle <- rbind(shuttle.pred, shuttle)
# fit model
glm.sh <- glm(cbind(y, 1 - y) ~ temp, family = binomial, shuttle)
# Note: same model fit as before since glm() does not use missing values
round(summary(glm.sh)$coefficients, 3)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.043 7.379 2.039 0.041
## temp -0.232 0.108 -2.145 0.032
# put the fitted values in the data.frame
shuttle$fitted.values <- c(rep(NA, 5), glm.sh$fitted.values)
# predict() uses all the temp values in dataset, including appended values
pred <- predict(glm.sh, data.frame(temp = shuttle$temp), type = "link", se.fit = TRUE)
shuttle$fit <- pred$fit
shuttle$se.fit <- pred$se.fit
# CI for fitted values
shuttle <- within(shuttle, {
# added "fitted" to make predictions at appended temp values
fitted = exp(fit) / (1 + exp(fit))
fit.lower = exp(fit - 1.96 * se.fit) / (1 + exp(fit - 1.96 * se.fit))
fit.upper = exp(fit + 1.96 * se.fit) / (1 + exp(fit + 1.96 * se.fit))
})

case flight y six temp pressure fitted.values fit se.fit fit.upper fit.lower fitted
1 31.00 7.85 4.04 1.00 0.48 1.00
2 35.00 6.92 3.61 1.00 0.46 1.00
3 40.00 5.76 3.08 1.00 0.43 1.00
4 45.00 4.60 2.55 1.00 0.40 0.99
5 50.00 3.43 2.02 1.00 0.37 0.97
6 1.00 14.00 1.00 2.00 53.00 50.00 0.94 2.74 1.71 1.00 0.35 0.94
7 2.00 9.00 1.00 1.00 57.00 50.00 0.86 1.81 1.31 0.99 0.32 0.86
8 3.00 23.00 1.00 1.00 58.00 200.00 0.83 1.58 1.21 0.98 0.31 0.83
9 4.00 10.00 1.00 1.00 63.00 50.00 0.60 0.42 0.77 0.87 0.25 0.60
10 5.00 1.00 0.00 0.00 66.00 200.00 0.43 −0.28 0.59 0.71 0.19 0.43
11 6.00 5.00 0.00 0.00 67.00 50.00 0.38 −0.51 0.56 0.64 0.17 0.38
12 7.00 13.00 0.00 0.00 67.00 200.00 0.38 −0.51 0.56 0.64 0.17 0.38
13 8.00 15.00 0.00 0.00 67.00 50.00 0.38 −0.51 0.56 0.64 0.17 0.38
14 9.00 4.00 0.00 0.00 68.00 200.00 0.32 −0.74 0.55 0.58 0.14 0.32
15 10.00 3.00 0.00 0.00 69.00 200.00 0.27 −0.98 0.56 0.53 0.11 0.27
16 11.00 8.00 0.00 0.00 70.00 50.00 0.23 −1.21 0.59 0.49 0.09 0.23
17 12.00 17.00 0.00 0.00 70.00 200.00 0.23 −1.21 0.59 0.49 0.09 0.23
18 13.00 2.00 1.00 1.00 70.00 200.00 0.23 −1.21 0.59 0.49 0.09 0.23
19 14.00 11.00 1.00 1.00 70.00 200.00 0.23 −1.21 0.59 0.49 0.09 0.23
20 15.00 6.00 0.00 0.00 72.00 200.00 0.16 −1.67 0.70 0.43 0.05 0.16
21 16.00 7.00 0.00 0.00 73.00 200.00 0.13 −1.90 0.78 0.40 0.03 0.13
22 17.00 16.00 0.00 0.00 75.00 100.00 0.09 −2.37 0.94 0.37 0.01 0.09
23 18.00 21.00 1.00 2.00 75.00 200.00 0.09 −2.37 0.94 0.37 0.01 0.09
24 19.00 19.00 0.00 0.00 76.00 200.00 0.07 −2.60 1.03 0.36 0.01 0.07
25 20.00 22.00 0.00 0.00 76.00 200.00 0.07 −2.60 1.03 0.36 0.01 0.07
26 21.00 12.00 0.00 0.00 78.00 200.00 0.04 −3.07 1.22 0.34 0.00 0.04
27 22.00 20.00 0.00 0.00 79.00 200.00 0.04 −3.30 1.32 0.33 0.00 0.04
28 23.00 18.00 0.00 0.00 81.00 200.00 0.02 −3.76 1.51 0.31 0.00 0.02

Prof. Erik B. Erhardt


11.6: Historical Example: O-Ring Data 713

library(ggplot2)
p <- ggplot(shuttle, aes(x = temp, y = y))
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = temp, ymin = fit.lower, ymax = fit.upper), alpha = 0.2)
p <- p + geom_line(aes(x = temp, y = fitted), colour="red")
# fitted values
p <- p + geom_point(aes(y = fitted.values), size=2, colour="red")
# observed values
p <- p + geom_point(size = 2)
p <- p + ylab("Probability")
p <- p + labs(title = "Observed events and predicted probability of 1+ O-ring failures")
print(p)
## Warning: Removed 5 rows containing missing values (geom point).
## Warning: Removed 5 rows containing missing values (geom point).
Observed events and predicted probability of 1+ O−ring failures
1.00 ● ● ● ● ● ●


0.75


Probability

0.50



0.25





● ●

0.00 ● ● ● ● ● ● ● ● ● ● ● ●

30 40 50 60 70 80
temp

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


714 Ch 11: Logistic Regression

Prof. Erik B. Erhardt


Part IX

ADA2: Multivariate Methods

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


Chapter 12

An Introduction to Multivariate
Methods

Multivariate statistical methods are used to display, analyze, and describe data on
two or more features or variables simultaneously. I will discuss multivariate methods
for measurement data. Methods for multi-dimensional count data, or mixtures of
counts and measurements are available, but are beyond the scope of what I can do
here. I will give a brief overview of the type of problems where multivariate methods
are appropriate.

Example: Turtle shells Jolicouer and Mosimann provided data on the height,
length, and width of the carapace (shell) for a sample of female painted turtles.
Cluster analysis is used to identify which shells are similar on the three features.
Principal component analysis is used to identify the linear combinations of the
measurements that account for most of the variation in size and shape of the shells.
Cluster analysis and principal component analysis are primarily descriptive tech-
niques.

Example: Fisher’s Iris data Random samples of 50 flowers were selected from
three iris species: Setosa, Virginica, and Versicolor. Four measurements were made
on each flower: sepal length, sepal width, petal length, and petal width. Suppose the
sample means on each feature are computed within the three species. Are the means
on the four traits significantly different across species? This question can be answered
using four separate one-way ANOVAs. A more powerful MANOVA (multivariate
analysis of variance) method compares species on the four features simultaneously.
Discriminant analysis is a technique for comparing groups on multi-dimensional
data. Discriminant analysis can be used with Fisher’s Iris data to find the linear com-
binations of the flower features that best distinguish species. The linear combinations

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


718 Ch 12: An Introduction to Multivariate Methods

are optimally selected, so insignificant differences on one or all features may be sig-
nificant (or better yet, important) when the features are considered simultaneously!
Furthermore, the discriminant analysis could be used to classify flowers into one of
these three species when their species is unknown.
MANOVA, discriminant analysis, and classification are primarily inferential
techniques.

12.1 Linear Combinations


Suppose data are collected on p measurements or features X1 , X2 , . . . , Xp . Most mul-
tivariate methods use linear combinations of the features as the basis for analysis.
A linear combination has the form
Y = a1 X 1 + a2 X 2 + · · · + ap X p ,
where the coefficients a1 , a2 , . . . , ap are known constants. Y is evaluated for each
observation in the data set, keeping the coefficients constant.
For example, three linear combinations of X1 , X2 , . . . , Xp are:
Y = 1X1 + 0X2 + 0X3 + · · · + 0Xp = X1 ,
1
Y = (X1 + X2 + · · · + Xp ), and
p
Y = 2X1 − 4X2 + 55X3 − 1954X4 + · · · + 44Xp .
Vector and matrix notation are useful for representing and summarizing multi-
variate data. Before introducing this notation, let us try to understand linear combi-
nations geometrically when p = 2.

Example: −45◦ rotation A plot of data on two features X1 and X2 is given below.
Also included is a plot for the two linear combinations
1
Y1 = √ (X1 + X2 ) and
2
1
Y2 = √ (X2 − X1 ).
2

This transformation creates two (roughly) uncorrelated linear combinations Y1 and


Y2 from two highly correlated features X1 and X2 . The transformation corresponds
to a rotation of the original coordinate axes by −45 degrees. Each data point is then
expressed relative to the new axes. The new features are uncorrelated!

Prof. Erik B. Erhardt


12.1: Linear Combinations 719

##
## Attaching package: ’ellipse’
## The following object is masked from ’package:car’:
##
## ellipse
## The following object is masked from ’package:graphics’:
##
## pairs

1 1
2

2
Y2  ,  Y1
 2 2
1, 0
 
1

1

45°
X2

Y2

0

0
−1

−1
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2

X1 Y1


The 2 divisor in Y1 and Y2 does not alter the interpretation of these linear
combinations: Y1 is essentially the sum of X1 and X2 , whereas Y2 is essentially the
difference between X2 and X1 .

Example: Two groups The plot below shows data on two features X1 and X2
from two distinct groups.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


720 Ch 12: An Introduction to Multivariate Methods
2

2
Y2

Y1
group 1 group 1
1

1
θ°
X2

Y2
0

0
−1

−1
group 2 group 2
−2

−2
−2 −1 0 1 2 −2 −1 0 1 2

X1 Y1

If you compare the groups on X1 and X2 separately, you may find no significant
differences because the groups overlap substantially on each feature. The plot on the
right was obtained by rotating the coordinate axes −θ degrees, and then plotting the
data relative to the new coordinate axes. The rotation corresponds to creating two
linear combinations:

Y1 = cos(θ)X1 + sin(θ)X2
Y2 = − sin(θ)X1 + cos(θ)X2 .

The two groups differ substantially on Y2 . This linear combination is used with
discriminant analysis and MANOVA to distinguish between the groups.
The linear combinations used in certain multivariate methods do not correspond
to a rotation of the original coordinate axes. However, the pictures given above
should provide some insight into the motivation for the creating linear combinations
of two features. The ideas extend to three or more features, but are more difficult to
represent visually.

12.2 Vector and Matrix Notation


A vector is a string of numbers or variables that is stored in either a row or in a
column. For example, the collection X1 , X2 , . . . , Xp of features can be represented as

Prof. Erik B. Erhardt


12.2: Vector and Matrix Notation 721

a column-vector with p rows, using the notation


 
X1
 X2 
X =  ..  .
 
 . 
Xp

The entry in the j th row is Xj . The transpose of X, represented by X 0 , is a row-


vector with p columns: X 0 = (X1 , X2 , . . . , Xp ). The j th column of X 0 contains Xj .
Suppose you collect data on p features X1 , X2 , . . . , Xp for a sample of n individuals.
The data for the ith individual can be represented as the column-vector:
 
Xi1
 Xi2 
Xi =  ..  .
 
 . 
Xip

or as the row-vector Xi0 = (Xi1 , Xi2 , · · · , Xip ). Here Xij is the value on the j th variable.
Two subscripts are needed for the data values. One subscript identifies the individual
and the other subscript identifies the feature.
A matrix is a rectangular array of numbers or variables. A data set can be viewed
as a matrix with n rows and p columns, where n is the sample size. Each row contains
data for a given individual:
 
X11 X12 · · · X1p
 X21 X22 · · · X2p 
..  .
 
 .. .. ..
 . . . . 
Xn1 Xn2 · · · Xnp

Vector and matrix notation are used for summarizing multivariate data. For example,
the sample mean vector is
 
X̄1
 X̄2 
X̄ =  ..  ,
 
 . 
X̄p

where X̄j is the sample average on the j th feature. Using matrix algebra, X̄ is defined
using a familiar formula:
n
1X
X̄ = Xi .
n i=1

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


722 Ch 12: An Introduction to Multivariate Methods

This mathematical operation is well-defined because vectors are added elementwise.


The sample variances and covariances on the p variables can be grouped together
in a p × p sample variance-covariance matrix S (i.e., p rows and p columns)
 
s11 s12 · · · s1p
 s21 s22 · · · s2p 
S =  .. ..  ,
 
.. . .
 . . . . 
sp1 sp2 · · · spp

where n
1 X
sii = (Xki − X̄i )2
n − 1 k=1

is the sample variance for the ith feature, and


n
1 X
sij = (Xki − X̄i )(Xkj − X̄j )
n − 1 k=1

is the sample covariance between the ith and j th features. The subscripts on the
elements in S identify where the element is found in the matrix: sij is stored in the
ith row and the j th column. The variances are found on the main diagonal of the
matrix. The covariances are off-diagonal elements. S is symmetric, meaning that
the elements above the main diagonal are a reflection of the entries below the main
diagonal. More formally, sij = sji .
Matrix algebra allows you to express S using a formula analogous to the sample
variance for a single feature:
n
1 X
S= (Xk − X̄)(Xk − X̄)0 .
n − 1 k=1

Here (Xk − X̄)(Xk − X̄)0 is the matrix product of a column vector with p entries
times a row vector with p entries. This matrix product is a p × p matrix with
(Xki − X̄i )(Xkj − X̄j ) in the ith row and j th column. The matrix products are added
up over all n observations and then divided by n − 1.
The interpretation of covariances is enhanced by standardizing them to give corre-
lations. The sample correlation matrix is denoted by the p × p symmetric matrix
 
r11 r12 · · · r1p
 r21 r22 · · · r2p 
R =  .. ..  .
 
.. . .
 . . . . 
rp1 rp2 · · · rpp

Prof. Erik B. Erhardt


12.3: Matrix Notation to Summarize Linear Combinations 723

The ith row and j th column element of R is the correlation between the ith and j th
features. The diagonal elements are one: rii = 1. The off-diagonal elements satisfy
sij
rij = rji = √ .
sii sjj

In many applications the data are standardized to have mean 0 and variance 1 on
each feature. The data are standardized through the so-called Z-score transforma-
tion: (Xki − X̄i )/sii which, on each feature, subtracts the mean from each observation
and divides by the corresponding standard deviation. The sample variance-covariance
matrix for the standardized data is the correlation matrix R for the raw data.

Example: Let X1 , X2 , and X3 be the reaction times for three visual stimuli named
A, B and C, respectively. Suppose you are given the following summaries based on a
sample of 30 individuals:
 
4
X̄ =  5  ,
4.7

 
2.26 2.18 1.63
S =  2.18 2.66 1.82  ,
1.63 1.82 2.47

 
1.00 0.89 0.69
R =  0.89 1.00 0.71  .
0.69 0.71 1.00

The average response time on B is 5. The sample variance of response times on


A is 2.26. The sample covariance between response times on A and C is 1.63. The
sample correlation between response times on B and C is 0.71.

12.3 Matrix Notation to Summarize Linear Com-


binations
Matrix algebra is useful for computing sample summaries for linear combinations of
the features X 0 = (X1 , X2 , . . . , Xp ) from the sample summaries on these features. For
example, suppose you define the linear combination

Y 1 = a1 X 1 + a2 X 2 + · · · + ap X p .

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


724 Ch 12: An Introduction to Multivariate Methods

Using matrix algebra, Y1 is the matrix product Y1 = a0 X, where a0 = (a1 , a2 , . . . , ap ).


The sample mean and variance of Y1 are

Ȳ = a1 X̄1 + a2 X̄2 + · · · + ap X̄p = a0 X̄

and X
s2Y = ai aj sij = a0 Sa,
ij

where X̄ and S are the sample mean vector and sample variance-covariance matrix
for X 0 = (X1 , X2 , . . . , Xp ).
Similarly, the sample covariance between Y1 and

Y 2 = b0 X = b1 X 1 + b2 X 2 + · · · + bp X p

is X
sY1 ,Y2 = ai bj sij = a0 Sb = b0 Sa.
ij

Example: In the stimuli example, the total reaction time per individual is
 
X1
Y = [1 1 1]  X2  = X1 + X2 + X3 .
X3

The mean reaction time is

Ȳ = [1 1 1]
 
4
X̄ = [1 1 1]  5  = 4 + 5 + 4.7 = 13.7.
4.7

The variance of Y is the sum of the elements in the variance-covariance matrix:


 
1 X
s2Y = [1 1 1] S  1  = sij = 2.26 + 2.18 + · · · + 2.47 = 18.65.
1 ij

Prof. Erik B. Erhardt


Chapter 13

Principal Component Analysis

Principal component analysis (PCA) is a multivariate technique for understand-


ing variation, and for summarizing measurement data possibly through variable re-
duction. Principal components (the variables created in PCA) are sometimes used
in addition to, or in place of, the original variables in certain analyses. I will illustrate
the use and misuse of principal components in a series of examples.
Given data on p variables or features X1 , X2 , . . ., Xp , PCA uses a rotation of the
original coordinate axes to produce a new set of p uncorrelated variables, called
principal components, that are unit-length linear combinations of the original
variables. A unit-length linear combination a1 X1 + a2 X2 + · · · + ap Xp has a21 + a22 +
· · · + a2p = 1.
The principal components have the following properties. The first principal
component
PRIN1 = a11 X1 + a12 X2 + · · · + a1p Xp
has the largest variability among all unit-length linear combinations of the original
variables. The second principal component

PRIN2 = a21 X1 + a22 X2 + · · · + a2p Xp

has the largest variability among all unit-length linear combinations of X1 , X2 , . . ., Xp


that are uncorrelated with PRIN1. In general, the j th principal component PRINj for
j = 1, 2, . . . , p, has the largest variability among all unit-length linear combinations
of the features that are uncorrelated with PRIN1, PRIN2, . . . , PRIN(j − 1). The
last or pth principal component PRINp has the smallest variability among all
unit-length linear combinations of the features.
I have described PCA on the raw or unstandardized data. This method is often
called PCA on the sample covariance matrix, because the principal components are
computed numerically using a singular value decomposition of the sample covari-
ance matrix for the Xi s. The variances of the PCs are eigenvalues of the sample

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


726 Ch 13: Principal Component Analysis

covariance matrix. The coefficients in the PCs are eigenvectors of the sample co-
variance matrix. The sum of the variances of the principal components is equal to
the sum of the variances in the original features. An alternative method for PCA
uses standardized data, which is often called PCA on the correlation matrix.
The ordered principal components are uncorrelated variables with progressively
less variation. Principal components are often viewed as separate dimensions corre-
sponding to the collection of features. The variability of each component divided by
the total variability of the components is the proportion of the total variation in the
data captured by each component. If data reduction is your goal, then you might
need only the first few principal components to capture most of the variability in the
data. This issue will be returned to later.
The unit-length constraint on the coefficients in PCA is needed to make the max-
imization well-defined. Without this constraint, there does not exist a linear combi-
nation with maximum variation. For example, the variability of an arbitrary linear
combination a1 X1 + a2 X2 + · · · + ap Xp is increased by 100 when each coefficient is
multiplied by 10!
The principal components are unique only up to a change of the sign for each
coefficient. For example,

PRIN1 = 0.2X1 − 0.4X2 + 0.4X3 + 0.8X4

and
PRIN1 = −0.2X1 + 0.4X2 − 0.4X3 − 0.8X4
have the same variability, so either could play the role of the first principal component.
This non-uniqueness does not have an important impact on the analysis.

13.1 Example: Temperature Data


The following temperature example includes mean monthly temperatures in January
and July for 64 U.S. cities.
#### Example: Temperature of cities
## The Temperature data file is in "fixed width format", an older data file format.
## Each field is specified by column ranges.
## Below I've provided numbers to help identify the column numbers
## as well as the first three observations in the dataset.
## 123456789012345678901234
## [ 14 char ][ 5 ][ 5 ]
# mobile 51.2 81.6
# phoenix 51.2 91.2
# little rock 39.5 81.4

fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch13_temperature.dat"

Prof. Erik B. Erhardt


13.1: Example: Temperature Data 727

temp <- read.fwf(fn.data, widths = c(14, 5, 5))


# the city names have trailing white space (we fix this below)
str(temp)
## 'data.frame': 64 obs. of 3 variables:
## $ V1: Factor w/ 64 levels "albany ",..: 39 48 33 56 21 27 64 62 31 36 ...
## $ V2: num 51.2 51.2 39.5 45.1 29.9 24.8 32 35.6 54.6 67.2 ...
## $ V3: num 81.6 91.2 81.4 75.2 73 72.7 75.8 78.7 81 82.3 ...
head(temp)
## V1 V2 V3
## 1 mobile 51.2 81.6
## 2 phoenix 51.2 91.2
## 3 little rock 39.5 81.4
## 4 sacramento 45.1 75.2
## 5 denver 29.9 73.0
## 6 hartford 24.8 72.7
# remove that white space with strip.white=TRUE
temp <- read.fwf(fn.data, widths = c(14, 5, 5), strip.white = TRUE)
# name columns
colnames(temp) <- c("city", "january", "july")
temp$id <- 1:nrow(temp)
str(temp)
## 'data.frame': 64 obs. of 4 variables:
## $ city : Factor w/ 64 levels "albany","albuquerque",..: 39 48 33 56 21 27 64 62 31 36 ...
## $ january: num 51.2 51.2 39.5 45.1 29.9 24.8 32 35.6 54.6 67.2 ...
## $ july : num 81.6 91.2 81.4 75.2 73 72.7 75.8 78.7 81 82.3 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
head(temp)
## city january july id
## 1 mobile 51.2 81.6 1
## 2 phoenix 51.2 91.2 2
## 3 little rock 39.5 81.4 3
## 4 sacramento 45.1 75.2 4
## 5 denver 29.9 73.0 5
## 6 hartford 24.8 72.7 6
# plot original data
library(ggplot2)
p1 <- ggplot(temp, aes(x = january, y = july))
p1 <- p1 + geom_point() # points
p1 <- p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x- and y-axis
# good idea since both are in the same units
p1 <- p1 + geom_text(aes(label = city), vjust = -0.5, alpha = 0.25) # city labels
p1 <- p1 + labs(title = "Mean temperature in Jan and July for selected cities")
print(p1)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


728 Ch 13: Principal Component Analysis

Mean temperature in Jan and July for selected cities


phoenix

90

dallas

houston

el paso new orleans miami
oklahomalittle
city rock ● jackson mobile ●
● columbia● ●

jacksonville
wichita ● ● ●

80 nashville
memphis
washington dc● norfolk
kansas cityst louis albuquerque


charlotte
● atlanta
●●

richmondraleigh ●

july

omaha louisville
philadelphia


● salt lake city new baltimore

york

● ● ●
wilmington
cincinnati
des moines peoria indianapolis atlantic ●
city
● charleston, wv sacramento
● ● ●boise ● ● ●

sioux falls detroitcolumbus
boston
●denver
● hartford
● ●

minneapolis albany providence
chicago ● pittsburgh
● ● ● cleveland ●●
bismarck ●

milwaukee
burlington concord buffalo
70 ●spokane
● great
● ● falls cheyenne
● reno
● ● ●
portland, me

portland, or

duluth

sault ste mari


20 40 60
january

The princomp() procedure is used for PCA. By default the principal components
are computed based on the covariance matrix. The correlation matrix may also be
used (it effectively z-scores the data first) with the cor = TRUE option. The principal
component scores are the values of the principal components across cases. The
principal component scores PRIN1, PRIN2, . . . , PRINp are centered to have mean
zero.
Output from a PCA on the covariance matrix is given. Two principal components
are created because p = 2.
# perform PCA on covariance matrix
temp.pca <- princomp( ~ january + july, data = temp)
# standard deviation and proportion of variation for each component
summary(temp.pca)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 12.3217642 3.0004557
## Proportion of Variance 0.9440228 0.0559772
## Cumulative Proportion 0.9440228 1.0000000
# coefficients for PCs
loadings(temp.pca)
##
## Loadings:
## Comp.1 Comp.2
## january 0.939 0.343
## july 0.343 -0.939
##
## Comp.1 Comp.2
## SS loadings 1.0 1.0
## Proportion Var 0.5 0.5

Prof. Erik B. Erhardt


13.1: Example: Temperature Data 729

## Cumulative Var 0.5 1.0


# scores are coordinates of each observation on PC scale
head(temp.pca$scores)
## Comp.1 Comp.2
## 1 20.000106 0.9239612
## 2 23.291460 -8.0941867
## 3 8.940669 -2.8994977
## 4 12.075589 4.8446790
## 5 -2.957414 1.7000283
## 6 -7.851160 0.2333138

PCA is effectively doing a location shift (to the origin, zero) and a rotation of
the data. When the correlation is used for PCA (instead of the covariance), it also
performs a scaling so that the resulting PC scores have unit-variance in all directions.
# create small data.frame with endpoints of PC lines through data
line.scale <- c(35, 15) # length of PCA lines to draw
# endpoints of lines to draw
temp.pca.line.endpoints <-
data.frame(PC = c(rep("PC1", 2), rep("PC2", 2))
, x = c(temp.pca$center[1] - line.scale[1] * temp.pca$loadings[1, 1]
, temp.pca$center[1] + line.scale[1] * temp.pca$loadings[1, 1]
, temp.pca$center[1] - line.scale[2] * temp.pca$loadings[1, 2]
, temp.pca$center[1] + line.scale[2] * temp.pca$loadings[1, 2])
, y = c(temp.pca$center[2] - line.scale[1] * temp.pca$loadings[2, 1]
, temp.pca$center[2] + line.scale[1] * temp.pca$loadings[2, 1]
, temp.pca$center[2] - line.scale[2] * temp.pca$loadings[2, 2]
, temp.pca$center[2] + line.scale[2] * temp.pca$loadings[2, 2])
)
temp.pca.line.endpoints
## PC x y
## 1 PC1 -0.7833519 63.61121
## 2 PC1 64.9739769 87.61066
## 3 PC2 26.9525727 89.70179
## 4 PC2 37.2380523 61.52008
# plot original data with PCA vectors overlayed
library(ggplot2)
p1 <- ggplot(temp, aes(x = january, y = july))
p1 <- p1 + geom_point() # points
p1 <- p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x- and y-axis
# good idea since both are in the same units
p1 <- p1 + geom_text(aes(label = id), vjust = -0.5, alpha = 0.25) # city labels
# plot PC lines
p1 <- p1 + geom_path(data = subset(temp.pca.line.endpoints, PC=="PC1"), aes(x=x, y=y)
, alpha=0.5)
p1 <- p1 + geom_path(data = subset(temp.pca.line.endpoints, PC=="PC2"), aes(x=x, y=y)
, alpha=0.5)
# label lines

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


730 Ch 13: Principal Component Analysis

p1 <- p1 + annotate("text"
, x = temp.pca.line.endpoints$x[1]
, y = temp.pca.line.endpoints$y[1]
, label = as.character(temp.pca.line.endpoints$PC[1])
, vjust = 0) #, size = 10)
p1 <- p1 + annotate("text"
, x = temp.pca.line.endpoints$x[3]
, y = temp.pca.line.endpoints$y[3]
, label = as.character(temp.pca.line.endpoints$PC[3])
, hjust = 1) #, size = 10)
p1 <- p1 + labs(title = "Mean temperature in Jan and July for selected cities")
print(p1)

Mean temperature in Jan and July for selected cities


2

90 PC2

54

56

55 10
45 3 ●
50
27 1 19●

17 ● ● ●
● ● 9


80 53 52
28 29 358 ● ●
● ● ●●
60 59 39
11

31 ● 40


● 57 4718
38 21 ●
● ●
july


42 7●● ● 4
16 14 15 ● 34 62
● ● ● 12 ● ● ●

51 23 4422
● 5

● 6●
26 36 13 ● 49 ●
48
● ● ● 43 ●●
41 ●

58 63 33 37 61
70 ● 30 ●
32
● ●

● 64 ●

20

46

25

24
PC1 ●

0 20 40 60
january

# plot PCA scores (data on PC-scale centered at 0)


library(ggplot2)
p2 <- ggplot(as.data.frame(temp.pca$scores), aes(x = Comp.1, y = Comp.2))
p2 <- p2 + geom_point() # points
p2 <- p2 + coord_fixed(ratio = 1) # makes 1 unit equal length on x- and y-axis
# good idea since both are in the same units
p2 <- p2 + geom_text(aes(label = rownames(temp.pca$scores)), vjust = -0.5, alpha = 0.25) # city labels
# plot PC lines
p2 <- p2 + geom_vline(xintercept = 0, alpha=0.5)
p2 <- p2 + geom_hline(yintercept = 0, alpha=0.5)
p2 <- p2 + labs(title = "Same, PC scores")
#print(p2)

# plot PCA scores (data on (negative) PC-scale centered at 0)


library(ggplot2) # negative temp.pca£scores
p3 <- ggplot(as.data.frame(-temp.pca$scores), aes(x = Comp.1, y = Comp.2))
p3 <- p3 + geom_point() # points

Prof. Erik B. Erhardt


13.1: Example: Temperature Data 731

p3 <- p3 + coord_fixed(ratio = 1) # makes 1 unit equal length on x- and y-axis


# good idea since both are in the same units
p3 <- p3 + geom_text(aes(label = rownames(temp.pca$scores)), vjust = -0.5, alpha = 0.25) # city l
# plot PC lines
p3 <- p3 + geom_vline(xintercept = 0, alpha=0.5)
p3 <- p3 + geom_hline(yintercept = 0, alpha=0.5)
p3 <- p3 + labs(title = "Same, but negative PC scores match orientation of original data")
#print(p3)

library(gridExtra)
grid.arrange(grobs = list(p2, p3), ncol=1, top="Temperature data and PC scores")
Temperature data and PC scores

Same, PC scores
46
10 ●

32 10
● ●
24 4
5 ●
64 ●
20 61


37
● 9
30 43 48
49 ●
33 5
Comp.2

● ●
25

63 ●
● ●
22● 62 40 39 11 19
44● 34 ● ● 1 ●

58 ● 13 6 23 ● 12 ● 59

● ●

0 ● 36 ● ● 42 7● 21 60 ●
50 27 56
● ●
15 ● ● 38
4718


● 52 ● ● ●
● ● ●
358● 53 ●

14 57 ● ●
55
26
● ● 29 3 ●

41 ● 51 16 ●
45

● ● 28 ● 54
31 ● 17 ●

−5 ●

2

−20 0 20
Comp.1

Same, but negative PC scores match orientation of original data


2

17 31
5 54 ● 28 ●
● 45 ● 16 51 41
3

29 ● ● 26 ●

55 ● ● 57 14

53 835

● ●

52 ● ●
18 47
38 15
56 27 50 60 21 ●●
7 42
● ● ● ●
● 12 ● 23 6 13 36 58
39 59
Comp.2

● ● ● ●
0 1 40
● 34● ● 44 ●
● ● 63 ●
25
19 ● 11
● 62 ● 22● 33
● ● ●
● 5 ● 49
48 43 30
● ●

9

●● ●
37 ●


● 61 20

64 ●
4 ●
24

−5 10 32 ●
● ●

46
−10 ●

−20 0 20
Comp.1

Some comments on the output:


1. You can visualize PCA when p = 2. In the temperature plot, the direction of
maximal variability corresponds to the first PC axis. The PRIN1 score for each
city is obtained by projecting the temperature pairs perpendicularly onto this
axis. The direction of minimum variation corresponds to the second PC axis,
which is perpendicular to the first PC axis. The PRIN2 score for each city is
obtained by projecting the temperature pairs onto this axis.
2. The total variance is the sum of variances for the monthly temperatures:

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


732 Ch 13: Principal Component Analysis

163.38 = 137.18 + 26.20.


# variance of data (on diagonals, covariance of off-diags)
var(temp[,c("january","july")])
## january july
## january 137.1811 46.72910
## july 46.7291 26.20035
# sum of variance
sum(diag(var(temp[,c("january","july")])))
## [1] 163.3814
# variance of PC scores
var(temp.pca$scores)
## Comp.1 Comp.2
## Comp.1 1.542358e+02 -1.831125e-15
## Comp.2 -1.831125e-15 9.145635e+00
# sum is same as original data
sum(diag(var(temp.pca$scores)))
## [1] 163.3814
3. The eigenvalues of the covariance matrix are variances for the PCs. The
variability of
PRIN1 = +0.939 JAN + 0.343 JULY

is 154.236. The variability of

PRIN2 = −0.343 JAN + 0.939 JULY

is 9.146. The proportion of the total variability due to PRIN1 is 0.944 =


154.23/163.38. The proportion of the total variability due to PRIN2 is 0.056 =
9.146/163.38.
# eigenvalues and eigenvectors of covariance matrix give PC variance and loadings
eigen(var(temp[,c("january","july")]))
## eigen() decomposition
## $values
## [1] 154.235808 9.145635
##
## $vectors
## [,1] [,2]
## [1,] -0.9393904 0.3428493
## [2,] -0.3428493 -0.9393904
4. Almost all of the variability (94.4%) in the original temperatures is captured
by the first PC. The second PC accounts for the remaining 5.6% of the total
variability.

Prof. Erik B. Erhardt


13.2: PCA on Correlation Matrix 733

5. PRIN1 weights the January temperature about three times the July tempera-
ture. This is sensible because PRIN1 maximizes variation among linear combi-
nations of the January and July temperatures. January temperatures are more
variable, so they are weighted heavier in this linear combination.
6. The PCs PRIN1 and PRIN2 are standardized to have mean zero. This explains
why some PRIN1 scores are negative, even though PRIN1 is a weighted average
of the January and July temperatures, each of which is non-negative.
The built-in plots plot the scores and original data directions (biplot) and the
screeplot shows the relative variance proportion of all components in decreasing order.

# a couple built-in plots


par(mfrow=c(1,2))
biplot(temp.pca)
screeplot(temp.pca)

temp.pca
−50 0 50

46
0.4

140
120

32 10
50
0.2

24 4
100

64
20 61
9
Variances

3743
48
Comp.2

30 495
80

33
25 63 22 62 40 11 19
1 january
34 39
6 44
58 1323 59
0.0

36 1242
0

721 60 27 56
18 52 50
60

15 38
47
8 53
35
14 57 55
july
29 3
41 2651 16
40

45 54
−0.2

31 28 17
−50

20

2
0

−0.2 0.0 0.2 0.4 Comp.1 Comp.2

Comp.1

13.2 PCA on Correlation Matrix


The coefficients in the first principal component reflect the relative sizes of the feature
variances. The features with large variances have larger coefficients or loadings. This
might be considered a problem, because variability is scale dependent and the
principal component analysis on the raw data does not take scale into account. For
example, if you measure height in meters but then change height to centimeters, the
variability increases by a factor of 100*100 = 10,000. This might have a dramatic
effect on the PCA.
You might prefer to standardize the features when they are measured on different

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


734 Ch 13: Principal Component Analysis

scales, or when the features have wildly different variances. The features are stan-
dardized to have mean zero and variance one by using the Z-score transformation:
(Obs − Mean)/Std Dev. The PCA is then performed on the standardized data.
temp.z <- temp
# manual z-score
temp.z$january <- (temp.z$january - mean(temp.z$january)) / sd(temp.z$january)
# z-score using R function scale()
temp.z$july <- scale(temp.z$july)

# the manual z-score and scale() match


all.equal(temp.z$january, as.vector(scale(temp.z$january)))
## [1] TRUE
# scale() includes attributes for the mean() and sd() used for z-scoring
str(temp.z)
## 'data.frame': 64 obs. of 4 variables:
## $ city : Factor w/ 64 levels "albany","albuquerque",..: 39 48 33 56 21 27 64 62 31 36 ...
## $ january: num 1.631 1.631 0.632 1.11 -0.187 ...
## $ july : num [1:64, 1] 1.1701 3.0456 1.131 -0.0803 -0.5101 ...
## ..- attr(*, "scaled:center")= num 75.6
## ..- attr(*, "scaled:scale")= num 5.12
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
head(temp.z)
## city january july id
## 1 mobile 1.6311459 1.17005226 1
## 2 phoenix 1.6311459 3.04555476 2
## 3 little rock 0.6322075 1.13097929 3
## 4 sacramento 1.1103319 -0.08028274 4
## 5 denver -0.1874344 -0.51008540 5
## 6 hartford -0.6228691 -0.56869485 6
# z-scored data has mean 0 and variance 1
colMeans(temp.z[,c("january","july")])
## january july
## 1.228943e-16 -1.214842e-15
var(temp.z[,c("january","july")])
## january july
## january 1.0000000 0.7794472
## july 0.7794472 1.0000000
# the correlation is used to contruct the PCs
# (same as covariance for z-scored data)
cor(temp.z[,c("january","july")])
## january july
## january 1.0000000 0.7794472
## july 0.7794472 1.0000000
## Plot z-scored data
temp.z.pca <- princomp( ~ january + july, data = temp.z)

Prof. Erik B. Erhardt


13.2: PCA on Correlation Matrix 735

# create small data.frame with endpoints of PC lines through data


line.scale <- c(3, 3) # length of PCA lines to draw
# endpoints of lines to draw
temp.z.pca.line.endpoints <-
data.frame(PC = c(rep("PC1", 2), rep("PC2", 2))
, x = c(temp.z.pca$center[1] - line.scale[1] * temp.z.pca$loadings[1, 1]
, temp.z.pca$center[1] + line.scale[1] * temp.z.pca$loadings[1, 1]
, temp.z.pca$center[1] - line.scale[2] * temp.z.pca$loadings[1, 2]
, temp.z.pca$center[1] + line.scale[2] * temp.z.pca$loadings[1, 2])
, y = c(temp.z.pca$center[2] - line.scale[1] * temp.z.pca$loadings[2, 1]
, temp.z.pca$center[2] + line.scale[1] * temp.z.pca$loadings[2, 1]
, temp.z.pca$center[2] - line.scale[2] * temp.z.pca$loadings[2, 2]
, temp.z.pca$center[2] + line.scale[2] * temp.z.pca$loadings[2, 2])
)
temp.z.pca.line.endpoints
## PC x y
## 1 PC1 -2.12132 -2.12132
## 2 PC1 2.12132 2.12132
## 3 PC2 -2.12132 2.12132
## 4 PC2 2.12132 -2.12132
# plot original data with PCA vectors overlayed
library(ggplot2)
p1 <- ggplot(temp.z, aes(x = january, y = july))
p1 <- p1 + geom_point() # points
p1 <- p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x- and y-axis
# good idea since both are in the same units
p1 <- p1 + geom_text(aes(label = id), vjust = -0.5, alpha = 0.25) # city labels
# plot PC lines
p1 <- p1 + geom_path(data = subset(temp.z.pca.line.endpoints, PC=="PC1"), aes(x=x, y=y), alpha=0.5)
p1 <- p1 + geom_path(data = subset(temp.z.pca.line.endpoints, PC=="PC2"), aes(x=x, y=y), alpha=0.5)
# label lines
p1 <- p1 + annotate("text"
, x = temp.z.pca.line.endpoints$x[1]
, y = temp.z.pca.line.endpoints$y[1]
, label = as.character(temp.z.pca.line.endpoints$PC[1])
, vjust = 0) #, size = 10)
p1 <- p1 + annotate("text"
, x = temp.z.pca.line.endpoints$x[3]
, y = temp.z.pca.line.endpoints$y[3]
, label = as.character(temp.z.pca.line.endpoints$PC[3])
, hjust = 0) #, size = 10)
p1 <- p1 + labs(title = "Z-score temperature in Jan and July for selected cities")
print(p1)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


736 Ch 13: Principal Component Analysis

Z−score temperature in Jan and July for selected cities

2

3

PC2
2
54

56

55 10
1 19
● ●
45 3 27 ●
● ● 50 ● ●
9
17 ●

1 ●

53 52
● ●
28 29 35
8

● ●● 5939
● 11

60
● 40 ●
july

31 ●

57 4718
38
● ●21●
● ●

427●
0 16 1514 ●
34 62 4
● ● ●

12
● ●

51 23 44 ●22
● ● ●5
6 ●

26 3613 49
48
● ● ●

43 ●

41

−1
58 63 37
● 33 61


30● ●
64 32
● ●

20

46

25

−2
PC1
24

−2 −1 0 1 2 3
january

The covariance matrix computed from the standardized data is the correlation
matrix. Thus, principal components based on the standardized data are computed
from the correlation matrix. This is implemented by adding the cor = TRUE option
on the princomp() procedure statement.
# perform PCA on correlation matrix
temp.pca2 <- princomp( ~ january + july, data = temp, cor = TRUE)
# standard deviation and proportion of variation for each component
summary(temp.pca2)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 1.3339592 0.4696305
## Proportion of Variance 0.8897236 0.1102764
## Cumulative Proportion 0.8897236 1.0000000
# coefficients for PCs
loadings(temp.pca2)
##
## Loadings:
## Comp.1 Comp.2
## january 0.707 0.707
## july 0.707 -0.707

Prof. Erik B. Erhardt


13.2: PCA on Correlation Matrix 737

##
## Comp.1 Comp.2
## SS loadings 1.0 1.0
## Proportion Var 0.5 0.5
## Cumulative Var 0.5 1.0
# scores are coordinates of each observation on PC scale
head(temp.pca2$scores)
## Comp.1 Comp.2
## 1 1.9964045 0.3286199
## 2 3.3330689 -1.0080444
## 3 1.2566173 -0.3554730
## 4 0.7341125 0.8485470
## 5 -0.4971200 0.2299523
## 6 -0.8492236 -0.0386098
This plot is the same except for the top/right scale around the biplot and the
variance scale on the screeplot.
# a couple built-in plots
par(mfrow=c(1,2))
biplot(temp.z.pca)
screeplot(temp.z.pca)

temp.z.pca
−5 0 5 10

46
0.4

10

1.5

10
0.3

32 4
0.2

64 9
24
1.0
Variances
Comp.2

20 61
0.1

19 january
48 11 1
37 43
495 62 40
39
30
33 22 34 59 56
44 27
0.0

63 50
721 60 52
0

25 13623 1242
58 36 15 38 18
47
8 53 55
35
0.5
−0.1

3 july
14 57 29
45 54
−5
−0.2

26 16 28 17
41 51 31
2
0.0

−0.2 −0.1 0.0 0.1 0.2 0.3 0.4 Comp.1 Comp.2

Comp.1

The standardized features are dimensionless, so the PCs are not influenced by the
original units of measure, nor are they affected by the variability in the features. The
only important factor is the correlation between the features, which is not changed
by standardization.
The PCs from the correlation matrix are

PRIN1 = +0.707 JAN + 0.707 JULY

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


738 Ch 13: Principal Component Analysis

and
PRIN2 = −0.707 JAN + 0.707 JULY.
PCA is an exploratory tool, so neither a PCA on the covariance matrix nor a
PCA on the correlation matrix is always the “right” method. I often do both and see
which analysis is more informative.

13.3 Interpreting Principal Components


You should try to interpret the linear combinations created by multivariate tech-
niques. The interpretations are usually non-trivial, and some degree of creativity is
involved.
The coefficients or loadings in a principal component reflect the relative con-
tribution of the features to the linear combination. Most researchers focus more on
the signs of the coefficents than on the magnitude of the coefficients. The principal
components are then interpreted as weighted averages or comparisons of weighted
averages.
A weighted average is a linear combination of features with non-negative load-
ings. The simplest case of a weighted average is an arithmetic average, where each
feature has the same coefficient.
The difference Z = X − Y is a comparison of X and Y . The sign and magnitude
of Z indicates which of X and Y is larger, and by how much. To see this, note that
Z = 0 if and only if X = Y , whereas Z < 0 when X < Y and Z > 0 when X > Y .
In the temperature data, PRIN1 is a weighted average of January and July tem-
peratures (signs of the loadings: JAN is + and JULY is +):

PRIN1 = +0.94 JAN + 0.34 JULY.

PRIN2 is a comparison of January and July temperatures (signs of the loadings: JAN
is − and JULY is +):

PRIN2 = −0.34 JAN + 0.94 JULY.

Principal components often have positive and negative loadings when p ≥ 3. To


interpret the components, group the features with + and − signs together and then
interpret the linear combination as a comparison of weighted averages.
You can often simplify the interpretation of principal components by mentally
eliminating features from the linear combination that have relatively small (in mag-
nitude) loadings or coefficients. This strategy does not carry over to all multivariate
analyses, so I will be careful about this issue when necessary.

Prof. Erik B. Erhardt


13.4: Example: Painted turtle shells 739

13.4 Example: Painted turtle shells


Jolicouer and Mosimann gave the length, width, and height in mm of the carapace
(shell) for a sample of 24 female painted turtles. I perform a PCA on the original
data and on the standardized data.
#### Example: Painted turtle shells
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch13_shells.dat"
shells <- read.table(fn.data, header = TRUE)
str(shells)
## 'data.frame': 24 obs. of 3 variables:
## $ length: int 98 103 103 105 109 123 123 133 133 133 ...
## $ width : int 81 84 86 86 88 92 95 99 102 102 ...
## $ height: int 38 38 42 42 44 50 46 51 51 51 ...
head(shells)
## length width height
## 1 98 81 38
## 2 103 84 38
## 3 103 86 42
## 4 105 86 42
## 5 109 88 44
## 6 123 92 50
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
# put scatterplots on top so y axis is vertical
p <- ggpairs(shells, upper = list(continuous = "points")
, lower = list(continuous = "cor")
, progress=FALSE
)
print(p)

# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

## 3D scatterplot
library(scatterplot3d)
par(mfrow=c(1,1))
with(shells, {
scatterplot3d(x=length
, y=width
, z=height
, main="Shells 3D Scatterplot"
, type = "h" # lines to the horizontal xy-plane
, color="blue", pch=19, # filled blue circles
#, highlight.3d = TRUE # makes color change with z-axis value

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


740 Ch 13: Principal Component Analysis

)
})

#### For a rotatable 3D plot, use plot3d() from the rgl library
# ## This uses the R version of the OpenGL (Open Graphics Library)
# library(rgl)
# with(shells, { plot3d(x = length, y = width, z = height) })
length width height
● ●
Shells 3D Scatterplot
0.015 ● ●
●● ●●
● ●● ● ● ●
●● ●●

length
0.010 ● ●
●● ● ●
●●● ●● ●

●● ● ● ●
0.005
● ●

●● ● ●

0.000 ● ● ● ● ●

70

● ●
130

65

120 ●
● ● ●

60
●●
Corr:
width

110 ●
●●●

●●
55
height

100 0.973 ●●
● ●

●●

● ● ●

width
50
90 ● ●

● 140
80 ●
● 130
45

●● 120
110
40

100
60 ● ●
90
35

80
height

Corr: Corr: 80 100 120 140 160 180


50 0.971 0.966
length
40

100 120 140 160 18080 90 100 110 120 130 40 50 60

13.4.1 PCA on shells covariance matrix


The plots show that the shell measurements are strongly positively correlated, which
is not surprising. Let us perform a PCA on the covariance matrix and interpret the
results.
# perform PCA on covariance matrix
shells.pca <- princomp( ~ length + width + height, data = shells)
# standard deviation and proportion of variation for each component
summary(shells.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 25.4970668 2.547081962 1.653745717
## Proportion of Variance 0.9860122 0.009839832 0.004148005
## Cumulative Proportion 0.9860122 0.995851995 1.000000000
# coefficients for PCs
loadings(shells.pca)
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## length 0.814 0.555 0.172

Prof. Erik B. Erhardt


13.4: Example: Painted turtle shells 741

## width 0.496 -0.818 0.291


## height 0.302 -0.151 -0.941
##
## Comp.1 Comp.2 Comp.3
## SS loadings 1.000 1.000 1.000
## Proportion Var 0.333 0.333 0.333
## Cumulative Var 0.333 0.667 1.000
The three principal components from the raw data are given below. Length and
width are grouped in PRIN3 because they have negative loadings.

PRIN1 = 0.81 Length + 0.50 Width + 0.30 Height


PRIN2 = −0.55 Length + (0.82 Width + 0.15 Height)
PRIN3 = −(0.17 Length + 0.29 Width) + 0.94 Height.

PRIN1 is a weighted average of the carapace measurements, and can be viewed as


an overall measure of shell size. Jolicouer and Mosimann interpreted the second and
third principal components as measures of shape, for they appear to be a comparison
of length with an average of width and height, and a comparison of height with length
and width, respectively.
Jolicouer and Mosimann argue that the size of female turtle shells can be char-
acterized by PRIN1 with little loss of information because this linear combination
accounts for 98.6% of the total variability in the measurements. The form of PRIN1
makes sense conceptually. The carapace measurements are positively correlated with
each other, so larger lengths tend to occur with larger widths and heights. The
primary way the shells vary is with regards to their overall size, as measured by a
weighted average of length, width, and height.

Question: Can PRIN2 and PRIN3, which have relatively little variation, be used
in any meaningful way? To think about this, suppose the variability in PRIN2 and
PRIN3 was zero.

13.4.2 PCA on shells correlation matrix


For the analysis on the correlation matrix, add the cor = TRUE option.
# perform PCA on correlation matrix
shells.pca <- princomp( ~ length + width + height, data = shells, cor = TRUE)
# standard deviation and proportion of variation for each component
summary(shells.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.714584 0.1853043 0.160820482
## Proportion of Variance 0.979933 0.0114459 0.008621076

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


742 Ch 13: Principal Component Analysis

## Cumulative Proportion 0.979933 0.9913789 1.000000000


# coefficients for PCs
loadings(shells.pca)
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## length 0.578 0.137 0.804
## width 0.577 0.628 -0.522
## height 0.577 -0.766 -0.284
##
## Comp.1 Comp.2 Comp.3
## SS loadings 1.000 1.000 1.000
## Proportion Var 0.333 0.333 0.333
## Cumulative Var 0.333 0.667 1.000
The three principal components for the standardized data are

PRIN1 = 0.58 Length + 0.58 Width + 0.58 Height


PRIN2 = −(0.14 Length + 0.63 Width) + 0.77 Height
PRIN3 = −0.80 Length + (0.52 Width + 0.28 Height).

The first principal component accounts for 98% of the total variability in the
standardized data. The total variability for correlation is always the number p of
features because it is the sum of the variances. Here, p = 3. Little loss of information
is obtained by summarizing the standardized data using PRIN1, which is essentially
an average of length, width, and height. PRIN2 and PRIN3 are measures of shape.
The loadings in the first principal component are approximately equal because the
correlations between pairs of features are almost identical. The standardized features
are essentially interchangeable with regards to the construction of the first principal
component, so they must be weighted similarly. True, but not obvious.

13.5 Why is PCA a Sensible Variable Reduction


Technique?
My description of PCA provides little insight into why principal components are
reasonable summaries for variable reduction. Specifically, why does it make sense
for researchers to consider the linear combination of the original features with the
largest variance as the best single variable summary of the data?
There is an alternative description of PCA that provides more insight into this
issue. For simplicity, consider the following data plot of two features, and the implied
principal components. The PC scores for an observation are obtained by projecting

Prof. Erik B. Erhardt


13.5: Why is PCA a Sensible Variable Reduction Technique? 743

the feature scores onto the axes of maximal and minimal variation, and then rotating
the axes appropriately.
One can show mathematically that PRIN1 is the best (in some sense) linear com-
bination of the two features to predict the original two features simultaneously. In-
tuitively, this is plausible. In a PCA, you know the direction for the axis of maximal
variation. Given the value of PRIN1, you get a good prediction of the original feature
scores by moving PRIN1 units along the axis of maximal variation in the feature
space.
2

2
PRIN2 PRIN1
1

1
Feature2


PRIN2
0

0

(0.99,0.33)
(1,−0.3)
−1

−1
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2

Feature1 PRIN1

The LS line from regressing feature 2 on feature 1 gives the best prediction for
feature 2 scores when the feature 1 score is known. Similarly, the LS line from
regressing feature 1 on feature 2 gives the best prediction for feature 1 scores when
the feature 2 score is known. PRIN1 is the best linear combination of features 1 and
2 to predict both features simultaneously. Note that feature 1 and feature 2 are linear
combinations as well!
This idea generalizes. The first k principal components give the best simultaneous
prediction of the original p features, among all possible choices of k uncorrelated
unit-length linear combinations of the features. Prediction of the original features
improves as additional components are added, but the improvement is slight when
the added principal components have little variability. Thus, summarizing the data
using the principal components with maximum variation is a sensible strategy for
data reduction.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


744 Ch 13: Principal Component Analysis

13.5.1 A Warning on Using PCA as a Variable Reduction


Technique
Some researchers view PCA as a “catchall” technique for reducing the number of
variables that need to be considered in an analysis. They will replace the original
variables with a small number of principal components that explain most of the vari-
ation in the original data and proceed with an analysis on the principal components.
This strategy is not always sensible, especially if a primary interest in the anal-
ysis is a comparison of heterogeneous groups. If the group structure was ignored
in the PCA analysis, then the linear combinations retained by the researcher may
contain little information for distinguishing among groups. For example, consider
the following data plot on two features and two groups, and the implied principal
components.
2

PRIN2

PRIN1
1

group 1 group 1
θ°
Feature2

PRIN2
0

group 2 group 2
−1

−1
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2

Feature1 PRIN1

Although PRIN1 explains most of the variability in the two features (ignoring
the groups), little of the total variation is due to group differences. If the researcher
reduced the two features to the first principal component, he would be throwing away
most of the information for distinguishing between the groups. PRIN2 accounts for
little of the total variation in the features, but most of the variation in PRIN2 is due
to group differences.
If a comparison of the two groups was the primary interest, then the researcher
should use discriminant analysis instead. Although there is little gained by reduc-
ing two variables to one, this principle always applies in multivariate problems. In

Prof. Erik B. Erhardt


13.5: Why is PCA a Sensible Variable Reduction Technique? 745

discriminant analysis, a stepwise selection of variables can be implemented to elimi-


nate features that have no information for distinguishing among groups. This is data
reduction as it should be practiced — with a final goal in mind.

Variable reduction using PCA followed by group comparisons might be fruitful if


you are fortunate enough to have the directions with large variation correspond to
the directions of group differences. For example, in the plot below, the first principal
component is a linear combination of the features that distinguishes between the
groups. A comparison of the groups based on the first principal component will lead
to similar conclusions as a discriminant analysis.
2

2
PRIN2
PRIN1
group 2
1

θ° group 1 group 2
Feature2

PRIN2
0

group 1
−1

−1
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2

Feature1 PRIN1

There is nothing unreasonable about using principal components in a comparison


of groups, provided you recognize that the principal component scores with the largest
variability need not be informative for group comparisons!

In summary, PCA should be used to summarize the variation within a homoge-


neous group, and should not, in general, be used as a data reduction tool prior to
a comparison across groups. The same concern applies to using PCA for identifying
groups of similar objects (use cluster analysis instead), or when factor analysis is
used prior to a group comparison.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


746 Ch 13: Principal Component Analysis

13.5.2 PCA is Used for Multivariate Outlier Detection

An outlier in a multidimensional data set has atypical readings on one or on several


features simultaneously. These observations can often be found in univariate plots
of the lead principal component scores, even when outliers are not extreme in any
individual feature.
2

2
PRIN2 PRIN1

1

1

Feature2

PRIN2
0

0
−1

−1
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2

Feature1 PRIN1

13.6 Example: Sparrows, for Class Discussion

After a severe storm in 1898, a number of sparrows were taken to the biological labora-
tory at the University of Rhode Island. H. Bumbus1 measured several morphological
characteristics on each bird. The data here correspond to five measurements on a
sample of 49 females. The measurements are the total length, alar extent, beak-head
length, humerus length, and length of keel of sternum.

1
Bumpus, Hermon C. 1898. Eleventh lecture. The elimination of the unfit as illustrated by the
introduced sparrow, Passer domesticus. (A fourth contribution to the study of variation.) Biol.
Lectures: Woods Hole Marine Biological Laboratory, 209–225.

Prof. Erik B. Erhardt


13.6: Example: Sparrows, for Class Discussion 747

http://media-2.web.britannica.com/eb-media/46/51946-004-D003BC49.gif
Let us look at the output, paying careful attention to the interpretations of the
principal components (zeroing out small loadings). How many components seem
sufficient to capture the total variation in the morphological measurements?
#### Example: Sparrows
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch13_sparrows.dat"
sparrows <- read.table(fn.data, header = TRUE)
str(sparrows)
## 'data.frame': 49 obs. of 5 variables:
## $ Total : int 156 153 155 157 164 158 161 157 158 155 ...
## $ Alar : int 245 240 243 238 248 240 246 235 244 236 ...
## $ BeakHead: num 31.6 31 31.5 30.9 32.7 31.3 32.3 31.5 31.4 30.3 ...
## $ Humerus : num 18.5 18.4 18.6 18.4 19.1 18.6 19.3 18.1 18.5 18.5 ...
## $ Keel : num 20.5 20.6 20.3 20.2 21.2 22 21.8 19.8 21.6 20.1 ...
head(sparrows)
## Total Alar BeakHead Humerus Keel
## 1 156 245 31.6 18.5 20.5
## 2 153 240 31.0 18.4 20.6
## 3 155 243 31.5 18.6 20.3
## 4 157 238 30.9 18.4 20.2
## 5 164 248 32.7 19.1 21.2
## 6 158 240 31.3 18.6 22.0
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


748 Ch 13: Principal Component Analysis

# put scatterplots on top so y axis is vertical


p <- ggpairs(sparrows, upper = list(continuous = "points")
, lower = list(continuous = "cor")
, progress=FALSE
)
print(p)

# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
Total Alar BeakHead Humerus Keel
● ● ● ●
● ● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ● ● ●● ● ●
0.075
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●

Total
0.050 ● ● ● ● ● ●● ● ● ●●● ●● ● ●● ●●
● ● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●●● ●
● ● ● ● ●●● ● ● ● ● ●●
0.025 ●●●●●● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ●● ● ● ●●
● ● ● ●
● ●●● ● ● ●● ●● ● ● ● ● ● ● ●●● ●
0.000 ● ● ●● ●● ● ●

● ● ●

250 ● ● ●
● ● ●
● ● ● ● ● ●
● ●● ● ●● ● ● ●
● ● ● ● ●●
245 ● ●
● ● ●●● ● ● ●
●●
● ● ● ● ● ● ● ● ●
● ●

● ●● ●● ● ● ●
Corr:

Alar
● ● ● ●● ● ●● ●

240 0.735 ●
● ● ●●●

●● ● ● ●
● ●
● ●●



● ●● ● ● ● ● ● ● ● ●● ● ●
● ●● ● ● ● ●● ● ●●
● ● ● ● ● ● ●● ●
235 ● ● ● ● ●●

● ● ●
● ● ●
230 ● ● ●

● ●

● ●
33
● ●
● ●
● ●
●● ● ●
● ● ● ●

BeakHead
● ●
● ●
32 Corr: Corr: ●●





● ●
● ● ● ●
● ● ● ●
0.662 0.674 ●●● ● ●
● ●
● ● ●



● ●
● ●
● ●
31 ● ●●
● ●●
●●



●●
●●


● ●
● ●
● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ●
30

● ●

● ● ● ● ●
19 ●
Humerus

● ● ● ● ●
Corr: Corr: Corr: ●●● ● ●●
●● ● ● ● ●●
●● ●
0.645 0.769 0.763 ● ●● ●
● ● ●
18 ● ●
● ● ● ●

● ●

● ●

23

22

Corr: Corr: Corr: Corr:


Keel

21
0.608 0.531 0.529 0.609
20

19

155 160 165 230 235 240 245 250 30 31 32 33 18 19 19 20 21 22 23

Prof. Erik B. Erhardt


13.6: Example: Sparrows, for Class Discussion 749

# perform PCA on covariance matrix


sparrows.pca <- princomp( ~ Total + Alar + BeakHead + Humerus + Keel
, data = sparrows)
# standard deviation and proportion of variation for each component
summary(sparrows.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 5.8828991 2.1280701 0.78468836 0.552957190
## Proportion of Variance 0.8623099 0.1128372 0.01534175 0.007618397
## Cumulative Proportion 0.8623099 0.9751472 0.99048891 0.998107311
## Comp.5
## Standard deviation 0.275612798
## Proportion of Variance 0.001892689
## Cumulative Proportion 1.000000000
# coefficients for PCs
# loadings(sparrows.pca) # print method for loadings() uses cutoff = 0.1 by default
print(loadings(sparrows.pca), cutoff = 0) # to show all values
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Total 0.536 0.828 0.157 0.039 0.018
## Alar 0.829 -0.551 0.058 0.069 -0.040
## BeakHead 0.096 0.034 -0.241 -0.897 -0.357
## Humerus 0.074 -0.015 -0.205 -0.306 0.927
## Keel 0.101 0.100 -0.934 0.310 -0.110
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## SS loadings 1.0 1.0 1.0 1.0 1.0
## Proportion Var 0.2 0.2 0.2 0.2 0.2
## Cumulative Var 0.2 0.4 0.6 0.8 1.0
# perform PCA on correlation matrix
sparrows.pca <- princomp( ~ Total + Alar + BeakHead + Humerus + Keel
, data = sparrows, cor = TRUE)
# standard deviation and proportion of variation for each component
summary(sparrows.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.9025587 0.7267974 0.62139498 0.54902221
## Proportion of Variance 0.7239459 0.1056469 0.07722634 0.06028508
## Cumulative Proportion 0.7239459 0.8295928 0.90681917 0.96710425
## Comp.5
## Standard deviation 0.40555980
## Proportion of Variance 0.03289575
## Cumulative Proportion 1.00000000
# coefficients for PCs
#loadings(sparrows.pca)
print(loadings(sparrows.pca), cutoff = 0) # to show all values

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


750 Ch 13: Principal Component Analysis

##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Total 0.452 0.058 0.689 0.422 0.375
## Alar 0.461 -0.301 0.345 -0.545 -0.530
## BeakHead 0.450 -0.326 -0.453 0.607 -0.342
## Humerus 0.470 -0.189 -0.409 -0.390 0.651
## Keel 0.399 0.874 -0.184 -0.073 -0.194
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## SS loadings 1.0 1.0 1.0 1.0 1.0
## Proportion Var 0.2 0.2 0.2 0.2 0.2
## Cumulative Var 0.2 0.4 0.6 0.8 1.0
# a couple built-in plots
par(mfrow=c(1,2))
biplot(sparrows.pca)
screeplot(sparrows.pca)

sparrows.pca
−5 0 5
3.5

16
3.0
0.4

2.5
5

30 Keel
11 6
0.2

Variances

2.0
Comp.2

38
40 34
13 18 9 17 15
1.5

2710
33 21 4137 35 43 45
24 3644 Total
0.0

2 12 7
0

23 4 29
4622 31
14 Humerus
49
1.0

26 8 39 Alar
19 BeakHead
31 4728
25 5
−0.2

20
0.5

48
32
−5

42
0.0

−0.2 0.0 0.2 0.4 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

Comp.1

13.7 PCA for Variable Reduction in Regression


I will outline an analysis where PCA was used to create predictors in a regression
model. The data were selected from the Berkeley Guidance Study, a longitudinal
monitoring of children born in Berkeley, California, between 1928 and 1929. The
variables selected from the study are
ID an identification number,
WT2 weight at age 2 in kg,

Prof. Erik B. Erhardt


13.7: PCA for Variable Reduction in Regression 751

HT2 height at age 2 in cm,


WT9 weight at age 9,
HT9 height at age 9,
LG9 leg circumference at age 9 in cm,
ST9 a composite measure of strength at age 9 (higher is stronger),
WT18 weight at age 18,
HT18 height at age 18,
LG18 leg circumference at age 18,
ST18 a composite measure of strength at age 18, and
SOMA somatotype on a 7-point scale, as a measure of fatness (1=slender to 7=fat)
determined using a photo taken at age 18.
Data on 26 boys are given below.
We are interested in building a regression model to predict somatotype (SOMA)
from the other variables.
#### Example: BGS (Berkeley Guidance Study)
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch13_bgs.dat"
bgs <- read.table(fn.data, header = TRUE)
str(bgs)
## 'data.frame': 26 obs. of 12 variables:
## $ ID : int 201 202 203 204 205 206 207 209 210 211 ...
## $ WT2 : num 13.6 12.7 12.6 14.8 12.7 11.9 11.5 13.2 16.9 12.7 ...
## $ HT2 : num 90.2 91.4 86.4 87.6 86.7 88.1 82.2 83.8 91 87.4 ...
## $ WT9 : num 41.5 31 30.1 34.1 24.5 29.8 26 30.1 37.9 27 ...
## $ HT9 : num 139 144 136 135 129 ...
## $ LG9 : num 31.6 26 26.6 28.2 24.2 26.7 26.5 27.6 29 26 ...
## $ ST9 : int 74 73 64 75 63 77 45 70 61 74 ...
## $ WT18: num 110.2 79.4 76.3 74.5 55.7 ...
## $ HT18: num 179 195 184 179 172 ...
## $ LG18: num 44.1 36.1 36.9 37.3 31 37 39.1 37.3 33.9 33.3 ...
## $ ST18: int 226 252 216 220 200 215 152 189 183 193 ...
## $ SOMA: num 7 4 6 3 1.5 3 6 4 3 3 ...
head(bgs)
## ID WT2 HT2 WT9 HT9 LG9 ST9 WT18 HT18 LG18 ST18 SOMA
## 1 201 13.6 90.2 41.5 139.4 31.6 74 110.2 179.0 44.1 226 7.0
## 2 202 12.7 91.4 31.0 144.3 26.0 73 79.4 195.1 36.1 252 4.0
## 3 203 12.6 86.4 30.1 136.5 26.6 64 76.3 183.7 36.9 216 6.0
## 4 204 14.8 87.6 34.1 135.4 28.2 75 74.5 178.7 37.3 220 3.0
## 5 205 12.7 86.7 24.5 128.9 24.2 63 55.7 171.5 31.0 200 1.5
## 6 206 11.9 88.1 29.8 136.0 26.7 77 68.2 181.8 37.0 215 3.0
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
p <- ggpairs(bgs, progress=FALSE)
print(p)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


752 Ch 13: Principal Component Analysis

# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)
ID WT2 HT2 WT9 HT9 LG9 ST9 WT18 HT18 LG18 ST18 SOMA

0.03
0.02
Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr:

ID
0.01 0.0935 0.285 −0.0788 0.0855 −0.0721 0.12 −0.296 0.0268 −0.236 −0.0531 −0.363
0.00
17 ● ●
16 ● ●
Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr:

WT2
15 ●
● ●● ● ●
14 ● ● ●●
13 ●
●●● ●
● 0.498 0.579 0.382 0.581 0.347 0.216 0.336 0.199 0.258 −0.283

12 ●
● ● ● ●
● ● ●●
● ● ● ●● ● ● ●● ● ●
● ●● ●
90 ●●● ● ● ● ●
● ● Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr:

HT2
●● ● ● ●
●● ● ●
87 ●● ●
● ●
● ●

84 ●● ● ● 0.531 0.776 0.284 0.357 0.315 0.685 0.113 0.23 −0.123


● ●
81 ● ●
● ● ●
● ● ●
40 ● ● ●
Corr: Corr: Corr: Corr: Corr: Corr: Corr: Corr:

WT9
35 ● ●●
● ● ●
● ●
● ●
● ●● ●
●●●

●● ● ● ●● ●● ● ● ● ● ● ●●
30 ● ● ● ●●

●●
● ●●●
● ● ● ● ● ●● 0.62 0.906 0.536 0.709 0.384 0.584 0.363 0.157
● ● ●
● ●● ● ●
● ● ● ●
25 ● ● ●

● ● ● ● ●● ● ●
145 ● ● ● ● ● ●●

140 ●●● ● ● ● ●● ●●

● ●● ● ● ●
●● ● ●● ●● ● ● Corr: Corr: Corr: Corr: Corr: Corr: Corr:

HT9
●●● ●● ●●● ●● ●●●● ● ●●●
● ●
135 ●
●●● ●● ● ● ●
●● ●● ● ● ● ●
●● ● ● ● ●● 0.353 0.358 0.273 0.864 0.0372 0.18 −0.141
130 ●●

● ●

● ●

●●

125 ● ● ● ●
● ● ● ● ●
32 ● ● ● ● ●
● ● ● ● ●
30 Corr: Corr: Corr: Corr: Corr: Corr:

LG9
● ● ● ● ●● ● ● ●
● ● ● ●● ● ● ● ● ●●●● ● ● ●●
28 ● ● ●● ●●● ● ●

●●●●
● ● ● ● ● ●●

●●
●●

● ● ●● ●● ●

26
● ●●
● ●●
●●

●● ●
● ●●
● ● ● ● ●●
● ●● ●
● ●
●●
● ●
● ● ● ●
●● ●


0.524 0.64 0.108 0.664 0.344 0.173
● ● ● ● ● ● ● ● ● ●
24
100 ● ● ● ● ● ●
● ● ● ● ● ●
90 ● ● ● ● ● ●
80 ● ● ● ● ● ●
Corr: Corr: Corr: Corr: Corr:

ST9
●●●● ● ● ● ● ● ● ● ●●●
● ●● ●
● ●●● ● ●● ● ●● ● ● ● ●
●● ●● ●● ●
● ● ●
70 ● ● ● ●● ● ●
● ● ● ● ●● ●● ●●●● ●
● ●●●
● ●

● ● ●
● ● ● ●
●● ●●
● ●●●●

60
●● ●

●● ● ● ●●
● ● ●

●●●
● ● ●●●
● ● ● ● ●●
● ● ●● ●● ●
● 0.185 0.159 0.209 0.66 −0.334
50 ● ● ● ● ● ●
● ● ● ● ● ● ●
100

WT18





● ●







Corr: Corr: Corr: Corr:
80 ●● ●
● ●●●● ●
● ●●
● ● ● ●
● ● ●
● ● ●● ● ● ●● ●●
●●
● ●● ● ●
●●●● ●●● ● ● ● ●●● ●●●
● ● ● ● ● ● ● ●●● ●
● ● ● ●●
● ●● ● ●●●●●
● ●● ●


●●● ● ● ● ● ● ●●●●● ●●●
●●

●●
● ●● ● ●

● ●● ● ● ●●● ●
●●●

●●
●● ●●● 0.184 0.901 0.34 0.599
60 ● ● ● ●●
● ●●
● ●●● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●

195 ● ● ● ● ● ● ● ●
190 ● ● ●● ● ● ●● ● ● ●● ●●●
●● ●● ● ●● ● ●
Corr: Corr: Corr:

HT18
● ● ● ●● ●● ●● ● ● ●●
185 ●


● ● ●●

● ●●

● ●● ● ● ●
●●
● ●● ● ● ● ● ● ● ●●
180 ● ● ●● ●
● ● ●● ● ●● ● ● ● ●● ●●
●● ● ●
●● ● ● ● ●● ●
● ●● ● ● ● ● ●●

● ● ●● ●● ●●● ●
175 ●●
●●●●



● ● ●●

● ●

●●
● ● ●● ●

●●
●● ●
●● −0.034 0.194 −0.0986
●● ● ● ● ●
● ●● ● ● ●
● ● ●●
● ● ●● ●
● ●
● ●● ● ● ●●

●● ● ●
170 ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●
40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● Corr: Corr: LG18
●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●
●●● ● ●● ● ● ● ● ● ●●
●●
● ● ●●
●● ● ● ●

●● ●●
● ●●
●●●●●● ● ● ●●
● ●● ●●●● ●●●●● ●●
●● ●●● ● ●
● ● ●● ●
●● ●
●●●


●●● ● ● ● ●●●●● ● ●●
● ● ●● ●
35 ●●
●●●
●●
● ●
●● ● ●



● ●● ●
● ●




● ●
●● ● ●



●● ●
●●
● ●●● ●



●●



● ● ●

● 0.285 0.607
● ● ● ● ● ● ●● ● ● ● ● ●● ●● ●●

250 ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●

●● ● ● ● ● ● ● ● ●● ● ● ●
● ●● ●●
●● ● ●● ● ● ●● ●
● ●
● ●● ●
225 ● ●● ● ● ●● ● ● ● ● ●●● ●● ● ●● ● ●● ● ●● ●
●● ● ● ●● ●● ● ●
● ●● ● ●●● ●
●●● ●● ● ●● ●● ● ● ●● ●● ● ●● ● Corr:
ST18

● ●● ● ● ●● ● ● ●
● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●● ● ●
200 ● ●● ● ● ●● ● ● ●● ● ●●● ● ● ●● ●
● ● ●●● ●●● ●● ● ● ●● ● ●
●● ●● ● ● ● ● ●● ● ● ●● ● ● ●
● ● ●
175


● ●

● ●●
● ●
● ●

● ●

● ● ●●

●●
● ●
●● ● ●
● −0.227
● ● ● ● ● ● ● ● ● ●
150 ● ● ● ● ● ● ● ● ● ●
● ● ●● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●
6 ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
SOMA

4 ● ● ● ● ●● ● ● ●● ●●● ● ● ●●●● ●●● ● ●●● ●● ●● ● ●●●


●● ● ●●● ●● ● ●
●●● ●● ● ● ● ●● ● ● ●●●●●● ● ●●● ● ● ●
●● ●●●● ●● ●
●●●● ● ●● ● ● ●
●●● ● ●●
●● ● ● ● ● ● ●●●●
●● ● ●
●● ●● ● ● ● ●●●●
●● ● ●●● ●●●●
● ● ●● ● ●● ● ●●

● ●● ● ● ●●●
● ●●●●
●● ● ● ●● ●● ● ● ●● ● ● ●
● ●
● ●● ● ●
2 ●
● ●

●● ●



●●

● ●

●●

●●

●●

●●

●●

● ●
● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ●

200 210 220 12131415161781 84 87 90 25 30 35 40 12513013514014524 26 28 30 32 5060708090100 60 80 100 170


175
180
185
190
195 35 40 150175200225250 2 4 6

As an aside, there are other ways to visualize the linear relationships more quickly.
The ellipse library has a function plotcorr(), though it’s output is less than ideal.
An improvement has been made with an updated version2 of the plotcorr() function.
## my.plotcorr example, see associated R file for my.plotcorr() function code

2
http://hlplab.wordpress.com/2012/03/20/correlation-plot-matrices-using-the-ellipse-library/

Prof. Erik B. Erhardt


13.7: PCA for Variable Reduction in Regression 753

# calculate correlations to plot


corr.bgs <- cor(bgs)

# Change the column and row names for clarity


colnames(corr.bgs) = colnames(bgs)
rownames(corr.bgs) = colnames(corr.bgs)

# set colors to use (blue positive, red negative)


colsc=c(rgb(241, 54, 23, maxColorValue=255), 'white', rgb(0, 61, 104, maxColorValue=255))
colramp = colorRampPalette(colsc, space='Lab')
colors = colramp(100)

# plot correlations, colored ellipses on lower diagonal, numerical correlations on upper


my.plotcorr(corr.bgs, col=colors[((corr.bgs + 1)/2) * 100]
, diag='ellipse', upper.panel="number", mar=c(0,2,0,0), main='Correlations')

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


754 Ch 13: Principal Component Analysis

Correlations

SOMA
WT18

HT18

LG18

ST18
WT2

WT9
HT2

HT9

LG9

ST9
ID

ID 0.09 0.29 −0.08 0.09 −0.07 0.12 −0.3 0.03 −0.24 −0.05 −0.36

WT2 0.5 0.58 0.38 0.58 0.35 0.22 0.34 0.2 0.26 −0.28

HT2 0.53 0.78 0.28 0.36 0.32 0.68 0.11 0.23 −0.12

WT9 0.62 0.91 0.54 0.71 0.38 0.58 0.36 0.16

HT9 0.35 0.36 0.27 0.86 0.04 0.18 −0.14

LG9 0.52 0.64 0.11 0.66 0.34 0.17

ST9 0.19 0.16 0.21 0.66 −0.33

WT18 0.18 0.9 0.34 0.6

HT18 −0.03 0.19 −0.1

LG18 0.28 0.61

ST18 −0.23

SOMA

It is reasonable to expect that the characteristics measured over time, for example
HT2, HT9, and HT18 are strongly correlated. Evidence supporting this hypothesis
is given in the following output, which summarizes correlations within subsets of
the predictors. Two of the three subsets include measures over time on the same
characteristic.
cor(bgs[,c("WT2", "WT9", "WT18")])
## WT2 WT9 WT18
## WT2 1.0000000 0.5792217 0.2158735
## WT9 0.5792217 1.0000000 0.7089029
## WT18 0.2158735 0.7089029 1.0000000
cor(bgs[,c("HT2", "HT9", "HT18")])

Prof. Erik B. Erhardt


13.7: PCA for Variable Reduction in Regression 755

## HT2 HT9 HT18


## HT2 1.0000000 0.7758332 0.6847144
## HT9 0.7758332 1.0000000 0.8644624
## HT18 0.6847144 0.8644624 1.0000000
cor(bgs[,c("LG9", "LG18", "ST9", "ST18")])
## LG9 LG18 ST9 ST18
## LG9 1.0000000 0.6644753 0.5239476 0.3440756
## LG18 0.6644753 1.0000000 0.2085289 0.2845414
## ST9 0.5239476 0.2085289 1.0000000 0.6595947
## ST18 0.3440756 0.2845414 0.6595947 1.0000000
library(ggplot2)
p1 <- ggplot(bgs, aes(x = WT2 , y = WT9 )) + geom_point()
p2 <- ggplot(bgs, aes(x = WT2 , y = WT18)) + geom_point()
p3 <- ggplot(bgs, aes(x = WT9 , y = WT18)) + geom_point()
p4 <- ggplot(bgs, aes(x = HT2 , y = HT9 )) + geom_point()
p5 <- ggplot(bgs, aes(x = HT2 , y = HT18)) + geom_point()
p6 <- ggplot(bgs, aes(x = HT9 , y = HT18)) + geom_point()
p7 <- ggplot(bgs, aes(x = LG9 , y = LG18)) + geom_point()
p8 <- ggplot(bgs, aes(x = ST9 , y = ST18)) + geom_point()

library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4, p5, p6, p7, p8), ncol=3, top="Selected BGS variables")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


756 Ch 13: Principal Component Analysis

Selected BGS variables


● ● ●

40 100 100

● ●

35

WT18

WT18
WT9

● ● ● ●

● ●
80 ●
● 80 ●

● ●
● ●● ● ●
● ●
● ●●
● ●
● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ●
30 ●
● ● ●
●●
● ● ●
● ●
● ● ● ● ●

● 60 60
● ● ●
● ●
●● ● ●

25 ● ● ●

12 13 14 15 16 17 12 13 14 15 16 17 25 30 35 40
WT2 WT2 WT9

● 195 ● 195 ●

145 ●

● ●
190 190
● ●
● ● ●
140 ●


● ● ● ●
● ●
185 185
HT18

HT18
● ● ● ● ●
HT9


● ● ● ●
● ● ●
135 ● ● ●
● ●


180 180
● ● ● ●●
●● ● ● ●
● ● ●
● ●
● ●

130 175 ● 175 ●
● ● ●
● ● ●
● ● ● ●
● ●


170 ●
170 ●
125
81 84 87 90 81 84 87 90 125 130 135 140 145
HT2 HT2 HT9

● ●
250 ●

● ●


● ● ● ●
● 225
40 ●●
● ● ●

LG18


ST18

● ●


●●
● ●
● 200 ● ●


● ● ●

● ●
● ● ●
35 ●
● ● ●
● 175



● ●
150
24 26 28 30 32 50 60 70 80 90 100
LG9 ST9

Strong correlation among predictors can cause collinearity problems in regression.


The presence of collinearity makes the interpretation of regression effects more diffi-
cult, and can reek havoc with the numerical stability of certain algorithms used to
compute least squares summaries. A natural way to avoid collinearity and improve
interpretation of regression effects is to use uncorrelated linear combinations of the
original predictors, such as principal components, to build a model.
The interpretation of regression effects changes when linear combinations of pre-
dictors are used in place of the original predictors. However, two models will be
equivalent in the sense of giving identical fitted values when p linearly independent
combinations of p predictors are used. The fitted model can be expressed in terms of

Prof. Erik B. Erhardt


13.7: PCA for Variable Reduction in Regression 757

the original predictors, or in terms of the linear combinations. Similarly, standard-


izing the original predictors does not change the significance of individual regression
effects nor does it change the interpretation of the model.
A reasonable strategy with the Berkeley data might be to find principal compo-
nents for the three height variables separately, the three weights separately, and so
on. It may not make sense to combine the four different types of measures (HT,
WT, ST, and LG) together because the resulting linear combinations would likely be
uninterpretable.
Output from a PCA on the four sets of standardized measures is given below. Two
sets (ST9, ST18) and (LG9, LG18) have two predictors. The PCs on the strength
and leg measures are essentially the sum and difference between the two standardized
features. The loadings have magnitude 0.707 = (1/2)1/2 to satisfy the unit-length
restriction.
Here are some comments on how I might use the PCA to build a regression model
to predict somatotype. First, I would not necessarily use the given PCS, but would
instead use interpretable linear combinations that were nearly identical to the PCs.
For example, the first principal component of the heights is roughly an unweighted
average of the weights at ages 2, 9, and 18. I would use (WT2+WT9+WT18)/3
instead. The overall sum of squared loadings is not important in the regression
analysis. Only the relative sizes are important. Following this idea, what linear
combinations of the heights are reasonable?
Second, you should not assume that principal components with low variance are
unimportant for predicting somatotype. Recall our earlier discussion on potential
problems with ignoring low variability components when group comparisons are the
primary interest. The same problem is possible here.
Feel free to explore these ideas at your leisure.
# WT
bgsWT.pca <- princomp( ~ WT2 + WT9 + WT18, data = bgs, cor = TRUE)
summary(bgsWT.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.4241276 0.8882456 0.42764499
## Proportion of Variance 0.6760465 0.2629934 0.06096008
## Cumulative Proportion 0.6760465 0.9390399 1.00000000
print(loadings(bgsWT.pca), cutoff = 0)
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## WT2 0.492 0.781 0.384
## WT9 0.665 -0.053 -0.745
## WT18 0.562 -0.622 0.545
##

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


758 Ch 13: Principal Component Analysis

## Comp.1 Comp.2 Comp.3


## SS loadings 1.000 1.000 1.000
## Proportion Var 0.333 0.333 0.333
## Cumulative Var 0.333 0.667 1.000
# HT
bgsHT.pca <- princomp( ~ HT2 + HT9 + HT18, data = bgs, cor = TRUE)
summary(bgsHT.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.5975991 0.5723653 0.34651866
## Proportion of Variance 0.8507743 0.1092007 0.04002506
## Cumulative Proportion 0.8507743 0.9599749 1.00000000
print(loadings(bgsHT.pca), cutoff = 0)
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## HT2 0.554 0.800 0.231
## HT9 0.599 -0.190 -0.778
## HT18 0.578 -0.570 0.584
##
## Comp.1 Comp.2 Comp.3
## SS loadings 1.000 1.000 1.000
## Proportion Var 0.333 0.333 0.333
## Cumulative Var 0.333 0.667 1.000
# LG
bgsLG.pca <- princomp( ~ LG9 + LG18, data = bgs, cor = TRUE)
summary(bgsLG.pca)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 1.2901455 0.5792449
## Proportion of Variance 0.8322377 0.1677623
## Cumulative Proportion 0.8322377 1.0000000
print(loadings(bgsLG.pca), cutoff = 0)
##
## Loadings:
## Comp.1 Comp.2
## LG9 0.707 0.707
## LG18 0.707 -0.707
##
## Comp.1 Comp.2
## SS loadings 1.0 1.0
## Proportion Var 0.5 0.5
## Cumulative Var 0.5 1.0
# ST
bgsST.pca <- princomp( ~ ST9 + ST18, data = bgs, cor = TRUE)
summary(bgsST.pca)

Prof. Erik B. Erhardt


13.7: PCA for Variable Reduction in Regression 759

## Importance of components:
## Comp.1 Comp.2
## Standard deviation 1.2882526 0.5834426
## Proportion of Variance 0.8297974 0.1702026
## Cumulative Proportion 0.8297974 1.0000000
print(loadings(bgsST.pca), cutoff = 0)
##
## Loadings:
## Comp.1 Comp.2
## ST9 0.707 0.707
## ST18 0.707 -0.707
##
## Comp.1 Comp.2
## SS loadings 1.0 1.0
## Proportion Var 0.5 0.5
## Cumulative Var 0.5 1.0

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


760 Ch 13: Principal Component Analysis

Prof. Erik B. Erhardt


Chapter 14

Cluster Analysis

14.1 Introduction
Cluster analysis is an exploratory tool for locating and grouping observations that
are similar to each other across features. Cluster analysis can also be used to group
variables that are similar across observations.
Clustering or grouping is distinct from discriminant analysis and classification. In
discrimination problems there are a given number of known groups to compare or
distinguish. The aim in cluster analysis is to define groups based on similarities. The
clusters are then examined for underlying characteristics that might help explain the
grouping.
There are a variety of clustering algorithms1 . I will discuss a simple (agglom-
erative) hierarchical clustering method for grouping observations. The method
begins with each observation as an individual cluster or group. The two most similar
observations are then grouped, giving one cluster with two observations. The remain-
ing clusters have one observation. The clusters are then joined sequentially until one
cluster is left.

14.1.1 Illustration
To illustrate the steps, suppose eight observations are collected on two features X1
and X2 . A plot of the data is given below.
Step 1. Each observation is a cluster.
Step 2. Form a new cluster by grouping the two clusters that are most similar, or
closest to each other. This leaves seven clusters.
Step 3. Form a new cluster by grouping the two clusters that are most similar, or
closest to each other. This leaves six clusters.
1
http://cran.r-project.org/web/views/Cluster.html

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


762 Ch 14: Cluster Analysis

Step 4–7. Continue the process of merging clusters one at a time.


Step 8. Merge (fuse or combine) the remaining two clusters.
Finally Use a tree or dendrogram to summarize the steps in the cluster formation.

#### Example: Fake data cluster illustration


# convert to a data.frame by reading the text table
intro <- read.table(text = "
x1 x2
4 8
6 6
10 11
11 8
17 5
19 3
20 11
21 2
", header=TRUE)
str(intro)
## 'data.frame': 8 obs. of 2 variables:
## $ x1: int 4 6 10 11 17 19 20 21
## $ x2: int 8 6 11 8 5 3 11 2
# perform PCA on covariance matrix
intro.pca <- princomp( ~ x1 + x2, data = intro)

# plot original data


library(ggplot2)
p1 <- ggplot(intro, aes(x = x1, y = x2))
p1 <- p1 + geom_point() # points
p1 <- p1 + geom_text(aes(label = 1:nrow(intro)), hjust = -0.5, alpha = 0.5) # labels
p1 <- p1 + labs(title = "Introductory example")
print(p1)

# plot PCA scores (data on PC-scale centered at 0)


library(ggplot2)
p2 <- ggplot(as.data.frame(intro.pca$scores), aes(x = Comp.1, y = Comp.2))
p2 <- p2 + geom_point() # points
p2 <- p2 + geom_text(aes(label = rownames(intro.pca$scores)), hjust = -0.5, alpha = 0.5) # labels
p2 <- p2 + labs(title = "Introductory example, PCA scores")
print(p2)

Prof. Erik B. Erhardt


14.1: Introduction 763

Introductory example Introductory example, PCA scores


6
● 3 ● 7 ● 7

9 4
● 3
● 1 ● 4

Comp.2
2
x2

6 ● 2 ● 4
● 5 0
● 1 ● 5
3 ● 6 −2
● 2 ● 6
● 8 ● 8
5 10 15 20 −10 −5 0 5
x1 Comp.1

Here are the results of one distance measure, which will be discussed in more detail
after the plots. The clustering algorithm order for average linkage is plotted here.
# create distance matrix between points
intro.dist <- dist(intro)
intro.hc.average <- hclust(intro.dist, method = "average")

op <- par(no.readonly = TRUE) # save original plot options


par(mfrow = c(3,2), mar = c(2, 2, 2.5, 1)) # margins are c(bottom, left, top, right)

library(cluster)
for (i.clus in 7:2) {
clusplot(intro, cutree(intro.hc.average, k = i.clus)
, color = TRUE, labels = 2, lines = 0
, cex = 2, cex.txt = 1, col.txt = "gray20"
, main = paste(i.clus, "clusters"), sub = NULL)
}

par(op) # reset plot options

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


764 Ch 14: Cluster Analysis

7 clusters 6 clusters
77 67
6

6
4

4
33 23

Component 2
2

2
44 34
0

0
1
●1 55 1
●1 45
−2

−2
22 66 8 ●2 56 8

−10 −5 0 5 −10 −5 0 5

5 clusters1
Component 4 clusters1
Component
57 47
6

6
4
4

23 23
Component 2

2
2

4 4
0
0

1
●1 35 1
●1 35
−2

6
−2

●2 46 8
●2 8

−10 −5 0 5 −10 −5 0 5

3 clusters1
Component 2 clusters1
Component
37 27
6

6
4

1
4

●3 1 ●3
Component 2

2
2

●4
●4
0

5
●1
0

●1 25
−2

6
●2 8
−2

6
●2
−4

−10 −5 0 5 −10 −5 0 5

The order of clustering is summarized in the average linkage dendrogram on the


right reading the tree from the bottom upwards2 .
# create distance matrix between points
intro.dist <- dist(intro)
intro.dist
## 1 2 3 4 5 6
## 2 2.828427
## 3 6.708204 6.403124
## 4 7.000000 5.385165 3.162278

2
There are many ways to create dendrograms in R, see http://gastonsanchez.com/blog/
how-to/2012/10/03/Dendrograms.html for several examples.

Prof. Erik B. Erhardt


14.1: Introduction 765

## 5 13.341664 11.045361 9.219544 6.708204


## 6 15.811388 13.341664 12.041595 9.433981 2.828427
## 7 16.278821 14.866069 10.000000 9.486833 6.708204 8.062258
## 8 18.027756 15.524175 14.212670 11.661904 5.000000 2.236068
## 7
## 2
## 3
## 4
## 5
## 6
## 7
## 8 9.055385
op <- par(no.readonly = TRUE) # save original plot options
par(mfrow = c(1,3)) # margins are c(bottom, left, top, right)

intro.hc.single <- hclust(intro.dist, method = "single")


# note: plotting used to use plclust()
plot(intro.hc.single, hang = -1, main = "single")

intro.hc.complete <- hclust(intro.dist, method = "complete")


plot(intro.hc.complete, hang = -1, main = "complete")

intro.hc.average <- hclust(intro.dist, method = "average")


plot(intro.hc.average, hang = -1, main = "average")

par(op) # reset plot options

single complete average


7

12
6

15

10
5

8
4

10
Height

Height

Height

6
3

4
2

2
0
7
5
6
8
1
2
3
4

1
2
3
4
7
5
6
8

1
2
3
4
7
5
6
8

intro.dist intro.dist intro.dist


hclust (*, "single") hclust (*, "complete") hclust (*, "average")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


766 Ch 14: Cluster Analysis

14.1.2 Distance measures


There are several accepted measures of distance between clusters. The single linkage
distance is the minimum distance between points across two clusters. The complete
linkage distance is the maximum distance between points across two clusters. The
average linkage distance is the average distance between points across two clusters.
In these three cases the distance between points is the Euclidean or “ruler” distance.
The pictures below illustrate the measures.
Given a distance measure, the distance between each pair of clusters is evaluated
at each step. The two clusters that are closest to each other are merged. The
observations are usually standardized prior to clustering to eliminate the effect of
different variability on the distance measure.

single linkage complete linkage average linkage


1 1 1
2 2 2
5
2 5
2 5
2
1.0

1.0

1.0
Component 2

Component 2

Component 2
1 1 1
0.0

0.0

0.0
4 4 4
−1.0

−1.0

−1.0
3 3 3
−2.0

−2.0

−2.0

−10 −5 0 5 −10 −5 0 5 −10 −5 0 5

Component 1 Component 1 Component 1

Different distance measures can produce different shape clusters.


Single uses the length of the shortest line between points in clusters.
Single linkage has the ability to produce and detect elongated and irregular clusters.
Complete uses the length of the longest line between points in clusters.
Complete linkage is biased towards producing clusters with roughly equal diame-
ters.
Average uses the average length of all line between points in clusters.
Average linkage tends to produce clusters with similar variability.
You should try different distances to decide the most sensible measure for your prob-
lem.

14.2 Example: Mammal teeth


The table below gives the numbers of different types of teeth for 32 mammals. The
columns, from left to right, give the numbers of (v1) top incisors, (v2) bottom incisors,
(v3) top canines, (v4) bottom canines, (v5) top premolars, (v6) bottom premolars,
(v7) top molars, (v8) bottom molars, respectively. A cluster analysis will be used to
identify the mammals that have similar counts across the eight types of teeth.

Prof. Erik B. Erhardt


14.2: Example: Mammal teeth 767

#### Example: Mammal teeth


## Mammal teeth data
# mammal = name
# number of teeth
# v1 = top incisors
# v2 = bottom incisors
# v3 = top canines
# v4 = bottom canines
# v5 = top premolars
# v6 = bottom premolars
# v7 = top molars
# v8 = bottom molars

fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch14_teeth.dat"


teeth <- read.table(fn.data, header = TRUE)
str(teeth)
## 'data.frame': 32 obs. of 9 variables:
## $ mammal: Factor w/ 32 levels "Badger","Bear",..: 4 17 29 19 13 24 20 22 3 12 ...
## $ v1 : int 2 3 2 2 2 1 2 2 1 1 ...
## $ v2 : int 3 2 3 3 3 3 1 1 1 1 ...
## $ v3 : int 1 1 1 1 1 1 0 0 0 0 ...
## $ v4 : int 1 0 1 1 1 1 0 0 0 0 ...
## $ v5 : int 3 3 2 2 1 2 2 3 2 2 ...
## $ v6 : int 3 3 3 2 2 2 2 2 1 1 ...
## $ v7 : int 3 3 3 3 3 3 3 3 3 3 ...
## $ v8 : int 3 3 3 3 3 3 3 3 3 3 ...

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


768 Ch 14: Cluster Analysis

mammal v1 v2 v3 v4 v5 v6 v7 v8
1 Brown Bat 2 3 1 1 3 3 3 3
2 Mole 3 2 1 0 3 3 3 3
3 Silver Hair Bat 2 3 1 1 2 3 3 3
4 Pigmy Bat 2 3 1 1 2 2 3 3
5 House Bat 2 3 1 1 1 2 3 3
6 Red Bat 1 3 1 1 2 2 3 3
7 Pika 2 1 0 0 2 2 3 3
8 Rabbit 2 1 0 0 3 2 3 3
9 Beaver 1 1 0 0 2 1 3 3
10 Groundhog 1 1 0 0 2 1 3 3
11 Gray Squirrel 1 1 0 0 1 1 3 3
12 House Mouse 1 1 0 0 0 0 3 3
13 Porcupine 1 1 0 0 1 1 3 3
14 Wolf 3 3 1 1 4 4 2 3
15 Bear 3 3 1 1 4 4 2 3
16 Raccoon 3 3 1 1 4 4 3 2
17 Marten 3 3 1 1 4 4 1 2
18 Weasel 3 3 1 1 3 3 1 2
19 Wolverine 3 3 1 1 4 4 1 2
20 Badger 3 3 1 1 3 3 1 2
21 River Otter 3 3 1 1 4 3 1 2
22 Sea Otter 3 2 1 1 3 3 1 2
23 Jaguar 3 3 1 1 3 2 1 1
24 Cougar 3 3 1 1 3 2 1 1
25 Fur Seal 3 2 1 1 4 4 1 1
26 Sea Lion 3 2 1 1 4 4 1 1
27 Grey Seal 3 2 1 1 3 3 2 2
28 Elephant Seal 2 1 1 1 4 4 1 1
29 Reindeer 0 4 1 0 3 3 3 3
30 Elk 0 4 1 0 3 3 3 3
31 Deer 0 4 0 0 3 3 3 3
32 Moose 0 4 0 0 3 3 3 3

The program below produces cluster analysis summaries for the mammal teeth
data.
# create distance matrix between points
teeth.dist <- dist(teeth[,-1])

# number of clusters to identify with red boxes and ellipses


# i.clus <- 8

# create dendrogram
teeth.hc.average <- hclust(teeth.dist, method = "average")
plot(teeth.hc.average, hang = -1
, main = paste("Teeth with average linkage") # and", i.clus, "clusters")
, labels = teeth[,1])
# rect.hclust(teeth.hc.average, k = i.clus)

# # create PCA scores plot with ellipses


# clusplot(teeth, cutree(teeth.hc.average, k = i.clus)
# , color = TRUE, labels = 2, lines = 0
# , cex = 2, cex.txt = 1, col.txt = "gray20"
# , main = paste("Teeth PCA with average linkage and", i.clus, "clusters")
# , sub = NULL)

Prof. Erik B. Erhardt


14.3: Identifying the Number of Clusters 769

Teeth with average linkage


4
3
2
Height

1
0

Raccoon
Wolf
Bear
Elephant_Seal
Fur_Seal
Sea_Lion
Jaguar
Cougar
River_Otter
Marten
Wolverine
Grey_Seal
Sea_Otter
Weasel
Badger
Reindeer
Elk
Deer
Moose
House_Mouse
Beaver
Groundhog
Gray_Squirrel
Porcupine
Brown_Bat
Silver_Hair_Bat
Red_Bat
Pigmy_Bat
House_Bat
Mole
Pika
Rabbit
teeth.dist
hclust (*, "average")

14.3 Identifying the Number of Clusters


Cluster analysis can be used to produce an “optimal” splitting of the data into a
prespecified number of groups or clusters, with different algorithms3 usually giving
different clusters. However, the important issue in many analyses revolves around
identifying the number of clusters in the data. A simple empirical method is to
continue grouping until the clusters being fused are relatively dissimilar, as measured
by the normalized RMS between clusters. Experience with your data is needed to
provide a reasonable stopping rule.
# NbClust provides methods for determining the number of clusters
library(NbClust)
str(teeth)
## 'data.frame': 32 obs. of 9 variables:
## $ mammal: Factor w/ 32 levels "Badger","Bear",..: 4 17 29 19 13 24 20 22 3 12 ...

3
There are thirty in this package: http://cran.r-project.org/web/packages/NbClust/
NbClust.pdf

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


770 Ch 14: Cluster Analysis

## $ v1 : int 2 3 2 2 2 1 2 2 1 1 ...
## $ v2 : int 3 2 3 3 3 3 1 1 1 1 ...
## $ v3 : int 1 1 1 1 1 1 0 0 0 0 ...
## $ v4 : int 1 0 1 1 1 1 0 0 0 0 ...
## $ v5 : int 3 3 2 2 1 2 2 3 2 2 ...
## $ v6 : int 3 3 3 2 2 2 2 2 1 1 ...
## $ v7 : int 3 3 3 3 3 3 3 3 3 3 ...
## $ v8 : int 3 3 3 3 3 3 3 3 3 3 ...
# Because the data type is "int" for integer, the routine fails (error expected)
NbClust(teeth[,-1], method = "average", index = "all")
## Error in solve.default(W): system is computationally singular: reciprocal condition
number = 1.51394e-16
# However, change the data type from integer to numeric and it works just fine!
teeth.num <- as.numeric(as.matrix(teeth[,-1]))
NC.out <- NbClust(teeth.num, method = "average", index = "all")
## Warning in max(DiffLev[, 5], na.rm = TRUE): no non-missing arguments to max; returning
-Inf
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## Warning in matrix(c(results), nrow = 2, ncol = 26): data length [51] is not a sub-multiple
or multiple of the number of rows [2]
## Warning in matrix(c(results), nrow = 2, ncol = 26, dimnames = list(c("Number clusters",
: data length [51] is not a sub-multiple or multiple of the number of rows [2]
## *******************************************************************
## * Among all indices:
## * 1 proposed 4 as the best number of clusters
## * 5 proposed 5 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 5
##
##
## *******************************************************************
# most of the methods suggest 4 or 5 clusters, as do the plots
NC.out$Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 5 5 4 5.0000 5.000 5 -Inf
## Value_Index Inf Inf Inf 369.1341 7787.404 414 5

Prof. Erik B. Erhardt


14.3: Identifying the Number of Clusters 771

## TraceW Friedman Rubin Cindex DB


## Number_clusters 25.875 8.720698e+14 -9.810785e+14 0 0
## Value_Index 5.000 5.000000e+00 6.000000e+00 5 5
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball
## Number_clusters 1 0.4663 168.2472 0.3789 0.4737 61.8333
## Value_Index 2 2.0000 2.0000 3.0000 3.0000 3.0000
## PtBiserial Frey McClain Dunn Hubert SDindex Dindex
## Number_clusters 0.7713 NA 0 Inf 0 Inf 0
## Value_Index 1.0000 5 5 0 2 0 5
## SDbw
## Number_clusters 0
## Value_Index 5

● ● ● ● ● ● ● ● ● ● ● ●
0.0038

Hubert statistic second differences


4e−04


Hubert Statistic values

0.0034

2e−04


0.0030

0e+00

● ● ● ● ● ● ● ● ● ●

2 4 6 8 10 12 14 2 4 6 8 10 12 14

Number of clusters Number of clusters

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


772 Ch 14: Cluster Analysis

0.00 0.05 0.10 0.15 0.20


● ●
0.4

Second differences Dindex Values



0.3
Dindex Values

0.2

● ● ● ● ● ● ● ● ●
0.1

−0.10
0.0

● ● ● ● ● ● ● ● ● ● ● ●

2 4 6 8 10 12 14 2 4 6 8 10 12 14

Number of clusters Number of clusters

There are several statistical methods for selecting the number of clusters. No
method is best. They suggest using the cubic clustering criteria (ccc), a pseudo-F
statistic, and a pseudo-t statistic. At a given step, the pseudo-t statistic is the distance
between the center of the two clusters to be merged, relative to the variability within
these clusters. A large pseudo-t statistic implies that the clusters to be joined are
relatively dissimilar (i.e., much more variability between the clusters to be merged
than within these clusters). The pseudo-F statistic at a given step measures the
variability among the centers of the current clusters relative to the variability within
the clusters. A large pseudo-F value implies that the clusters merged consist of fairly
similar observations. As clusters are joined, the pseudo-t statistic tends to increase,
and the pseudo-F statistic tends to decrease. The ccc is more difficult to describe.
The RSQ summary is also useful for determining the number of clusters. RSQ is
a pseudo-R2 statistic that measures the proportion of the total variation explained
by the differences among the existing clusters at a given step. RSQ will typically
decrease as the pseudo-F statistic decreases.
A common recommendation on cluster selection is to choose a cluster size
where the values of ccc and the pseudo-F statistic are relatively high (compared to
what you observe with other numbers of clusters), and where the pseudo-t statistic
is relatively low and increases substantially at the next proposed merger. For the
mammal teeth data this corresponds to four clusters. Six clusters is a sensible second
choice.

Prof. Erik B. Erhardt


14.3: Identifying the Number of Clusters 773

Let’s look at the results of 5 clusters.


# create distance matrix between points
teeth.dist <- dist(teeth[,-1])

# number of clusters to identify with red boxes and ellipses


i.clus <- 5

# create dendrogram
teeth.hc.average <- hclust(teeth.dist, method = "average")
plot(teeth.hc.average, hang = -1
, main = paste("Teeth with average linkage and", i.clus, "clusters")
, labels = teeth[,1])
rect.hclust(teeth.hc.average, k = i.clus)

# create PCA scores plot with ellipses


clusplot(teeth, cutree(teeth.hc.average, k = i.clus)
, color = TRUE, labels = 2, lines = 0
, cex = 2, cex.txt = 1, col.txt = "gray20"
, main = paste("Teeth PCA with average linkage and", i.clus, "clusters")
, sub = NULL)

Teeth with average linkage and 5 clusters


4
3
2
Height

1
0

Raccoon
Wolf
Bear
Elephant_Seal
Fur_Seal
Sea_Lion
Jaguar
Cougar
River_Otter
Marten
Wolverine
Grey_Seal
Sea_Otter
Weasel
Badger
Reindeer
Elk
Deer
Moose
House_Mouse
Beaver
Groundhog
Gray_Squirrel
Porcupine
Brown_Bat
Silver_Hair_Bat
Red_Bat
Pigmy_Bat
House_Bat
Mole
Pika
Rabbit

teeth.dist
hclust (*, "average")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


774 Ch 14: Cluster Analysis

Teeth PCA with average linkage and 5 clusters

5
29
30
32
2

31

●6
1●3 4
14
Component 2

16
1

●1
●4 15
●5 22
19
21 17
18
0

8 20
27 22 26
3 7
−1

13 10
11 9 23 25
24
12 28

−4 −3 −2 −1 0 1 2 3

Component 1

# print the observations in each cluster


for (i.cut in 1:i.clus) {
print(paste("Cluster", i.cut, " ----------------------------- "))
print(teeth[(cutree(teeth.hc.average, k = i.clus) == i.cut),])
}
## [1] "Cluster 1 ----------------------------- "
## mammal v1 v2 v3 v4 v5 v6 v7 v8
## 1 Brown_Bat 2 3 1 1 3 3 3 3
## 3 Silver_Hair_Bat 2 3 1 1 2 3 3 3
## 4 Pigmy_Bat 2 3 1 1 2 2 3 3
## 5 House_Bat 2 3 1 1 1 2 3 3
## 6 Red_Bat 1 3 1 1 2 2 3 3
## [1] "Cluster 2 ----------------------------- "
## mammal v1 v2 v3 v4 v5 v6 v7 v8
## 2 Mole 3 2 1 0 3 3 3 3
## 7 Pika 2 1 0 0 2 2 3 3
## 8 Rabbit 2 1 0 0 3 2 3 3
## [1] "Cluster 3 ----------------------------- "
## mammal v1 v2 v3 v4 v5 v6 v7 v8
## 9 Beaver 1 1 0 0 2 1 3 3
## 10 Groundhog 1 1 0 0 2 1 3 3
## 11 Gray_Squirrel 1 1 0 0 1 1 3 3
## 12 House_Mouse 1 1 0 0 0 0 3 3
## 13 Porcupine 1 1 0 0 1 1 3 3
## [1] "Cluster 4 ----------------------------- "

Prof. Erik B. Erhardt


14.4: Example: 1976 birth and death rates 775

## mammal v1 v2 v3 v4 v5 v6 v7 v8
## 14 Wolf 3 3 1 1 4 4 2 3
## 15 Bear 3 3 1 1 4 4 2 3
## 16 Raccoon 3 3 1 1 4 4 3 2
## 17 Marten 3 3 1 1 4 4 1 2
## 18 Weasel 3 3 1 1 3 3 1 2
## 19 Wolverine 3 3 1 1 4 4 1 2
## 20 Badger 3 3 1 1 3 3 1 2
## 21 River_Otter 3 3 1 1 4 3 1 2
## 22 Sea_Otter 3 2 1 1 3 3 1 2
## 23 Jaguar 3 3 1 1 3 2 1 1
## 24 Cougar 3 3 1 1 3 2 1 1
## 25 Fur_Seal 3 2 1 1 4 4 1 1
## 26 Sea_Lion 3 2 1 1 4 4 1 1
## 27 Grey_Seal 3 2 1 1 3 3 2 2
## 28 Elephant_Seal 2 1 1 1 4 4 1 1
## [1] "Cluster 5 ----------------------------- "
## mammal v1 v2 v3 v4 v5 v6 v7 v8
## 29 Reindeer 0 4 1 0 3 3 3 3
## 30 Elk 0 4 1 0 3 3 3 3
## 31 Deer 0 4 0 0 3 3 3 3
## 32 Moose 0 4 0 0 3 3 3 3

14.4 Example: 1976 birth and death rates

Below are the 1976 crude birth and death rates in 74 countries. A data plot and
output from a complete and single linkage cluster analyses are given.
#### Example: Birth and death rates
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch14_birthdeath.dat"
bd <- read.table(fn.data, header = TRUE)
str(bd)
## 'data.frame': 74 obs. of 3 variables:
## $ country: Factor w/ 74 levels "afghan","algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ birth : int 52 50 47 22 16 12 47 12 36 17 ...
## $ death : int 30 16 23 10 8 13 19 12 10 10 ...

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


776 Ch 14: Cluster Analysis

country birth death country birth death


country birth death
1 afghan 52 30 26 ghana 46 14
51 poland 20 9
2 algeria 50 16 27 greece 16 9
52 portugal 19 10
3 angola 47 23 28 guatamala 40 14
53 rhodesia 48 14
4 argentina 22 10 29 hungary 18 12
54 romania 19 10
5 australia 16 8 30 india 36 15
55 saudi ar 49 19
6 austria 12 13 31 indonesia 38 16
56 sth africa 36 12
7 banglades 47 19 32 iran 42 12
57 spain 18 8
8 belguim 12 12 33 iraq 48 14
58 sri lanka 26 9
9 brazil 36 10 34 italy 14 10
59 sudan 49 17
10 bulgaria 17 10 35 ivory cst 48 23
60 sweden 12 11
11 burma 38 15 36 japan 16 6
61 switzer 12 9
12 cameroon 42 22 37 kenya 50 14
62 syria 47 14
13 canada 17 7 38 nkorea 43 12
63 tanzania 47 17
14 chile 22 7 39 skorea 26 6
64 thailand 34 10
15 china 31 11 40 madagasca 47 22
65 turkey 34 12
16 taiwan 26 5 41 malaysia 30 6
66 ussr 18 9
17 columbia 34 10 42 mexico 40 7
67 uganda 48 17
18 cuba 20 6 43 morocco 47 16
68 uk 12 12
19 czechosla 19 11 44 mozambique 45 18
69 usa 15 9
20 ecuador 42 11 45 nepal 46 20
70 upp volta 50 28
21 egypt 39 13 46 netherlan 13 8
71 venez 36 6
22 ethiopia 48 23 47 nigeria 49 22
72 vietnam 42 17
23 france 14 11 48 pakistan 44 14
73 yugoslav 18 8
24 german dr 12 14 49 peru 40 13
74 zaire 45 18
25 german fr 10 12 50 phillip 34 10
# plot original data
library(ggplot2)
p1 <- ggplot(bd, aes(x = birth, y = death))
p1 <- p1 + geom_point(size = 2) # points
p1 <- p1 + geom_text(aes(label = country), hjust = -0.1, alpha = 0.2) # labels
p1 <- p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x- and y-axis
p1 <- p1 + labs(title = "1976 crude birth and death rates")
print(p1)

Prof. Erik B. Erhardt


14.4: Example: 1976 birth and death rates 777

1976 crude birth and death rates


30 ● afghan

● upp_volta

● angola
● ethiopia
ivory_cst
● cameroon ● madagasca
● nigeria

20 ●nepal
● banglades
● saudi_ar
death

●zaire
mozambique
● vietnam ● tanzania
● uganda
● sudan
● indonesia ● morocco
● algeria
●india
● burma
● german_dr ● guatamala
● pakistan
● ghana
●syria
●iraq
rhodesia
●kenya
● austria ●egypt
●peru
● german_fr
●uk
belguim ● hungary ● turkey
● sth_africa ●iran
● nkorea
● sweden
● france ● czechosla ●china ● ecuador
10 ●italy ● bulgaria
● romania
portugal
● argentina ●phillip
thailand
columbia
●brazil
● switzer ●usa
● greece
●ussr● poland ● sri_lanka
● netherlan
● australia
●spain
yugoslav
● canada ●chile ● mexico
●japan ●cuba ● skorea ● malaysia ●venez
● taiwan

10 20 30 40 50
birth

14.4.1 Complete linkage


library(NbClust)
# Change integer data type to numeric
bd.num <- as.numeric(as.matrix(bd[,-1]))
NC.out <- NbClust(bd.num, method = "complete", index = "all")
## Warning in max(DiffLev[, 5], na.rm = TRUE): no non-missing arguments to max; returning
-Inf
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in H
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Di
## second differences plot) that corresponds to a significant increase of the val
## the measure.
##
## Warning in matrix(c(results), nrow = 2, ncol = 26): data length [51] is not a sub-multiple

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


778 Ch 14: Cluster Analysis

or multiple of the number of rows [2]


## Warning in matrix(c(results), nrow = 2, ncol = 26, dimnames = list(c("Number clusters",
: data length [51] is not a sub-multiple or multiple of the number of rows [2]
## *******************************************************************
## * Among all indices:
## * 2 proposed 2 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
# most of the methods suggest 2 to 6 clusters, as do the plots
NC.out$Best.nc
## KL CH Hartigan CCC Scott Marriot
## Number_clusters 2.000 15.000 5.0000 2.0000 4.0000 6.000
## Value_Index 3.333 1780.714 209.2456 20.7606 86.7855 9041.261
## TrCovW TraceW Friedman Rubin Cindex DB
## Number_clusters -Inf 854.6162 395.428 -131.4723 0.2254 0.4292
## Value_Index 4 15.0000 13.000 2.0000 2.0000 2.0000
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball
## Number_clusters 0.7468 0.2486 142.0413 0.9864 0.4628 5166.333
## Value_Index 2.0000 2.0000 2.0000 3.0000 3.0000 2.000
## PtBiserial Frey McClain Dunn Hubert SDindex
## Number_clusters 0.8512 3.5386 0.1705 0.3333 0 0.3167
## Value_Index 3.0000 2.0000 13.0000 0.0000 3 0.0000
## Dindex SDbw
## Number_clusters 0 0.0073
## Value_Index 15 2.0000

Prof. Erik B. Erhardt


14.4: Example: 1976 birth and death rates 779

Hubert statistic second differences

4e−06
● ● ● ● ● ●
4.8e−05

● ●

Hubert Statistic values


● ●

2e−06
4.4e−05




4.0e−05

0e+00


● ● ● ● ● ●

2 4 6 8 10 12 14 2 4 6 8 10 12 14

Number of clusters Number of clusters

Second differences Dindex Values

● ●
0.6
4

0.4
Dindex Values



3

0.2

● ●
● ●


2



−0.2 0.0



● ●
● ●

1

● ●
● ● ●
● ●

2 4 6 8 10 12 14 2 4 6 8 10 12 14

Number of clusters Number of clusters

Let’s try 3 clusters based on the dendrogram plots below. First we’ll use complete
linkage.
# create distance matrix between points
bd.dist <- dist(bd[,-1])

# number of clusters to identify with red boxes and ellipses


i.clus <- 3

# create dendrogram
bd.hc.complete <- hclust(bd.dist, method = "complete")
plot(bd.hc.complete, hang = -1

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


780 Ch 14: Cluster Analysis

, main = paste("Teeth with complete linkage and", i.clus, "clusters")


, labels = bd[,1])
rect.hclust(bd.hc.complete, k = i.clus)

# create PCA scores plot with ellipses


clusplot(bd, cutree(bd.hc.complete, k = i.clus)
, color = TRUE, labels = 2, lines = 0
, cex = 2, cex.txt = 1, col.txt = "gray20"
, main = paste("Birth/Death PCA with complete linkage and", i.clus, "clusters"), sub = NULL)

# create a column with group membership


bd$cut.comp <- factor(cutree(bd.hc.complete, k = i.clus))

Teeth with complete linkage and 3 clusters


40
30
20
Height

10
0

afghan
upp_volta
algeria
kenya
iraq
rhodesia
ghana
syria
cameroon
vietnam
nigeria
madagasca
angola
ethiopia
ivory_cst
mozambique
zaire
banglades
nepal
saudi_ar
morocco
tanzania
sudan
uganda
german_fr
german_dr
austria
belguim
uk
netherlan
switzer
sweden
france
italy
argentina
poland
chile
cuba
hungary
czechosla
portugal
romania
canada
japan
ussr
spain
yugoslav
bulgaria
usa
australia
greece
malaysia
sri_lanka
taiwan
skorea
pakistan
nkorea
ecuador
iran
guatamala
egypt
peru
india
burma
indonesia
mexico
venez
china
brazil
thailand
columbia
phillip
sth_africa
turkey

bd.dist
hclust (*, "complete")

Prof. Erik B. Erhardt


14.4: Example: 1976 birth and death rates 781

Birth/Death PCA with complete linkage and 3 clusters

1
2

74


72 371 732
70 67
● ●63 65 64
66
69
● 62
59 ● 68 16
● 56
1

55 53 58 61
● 3960
● 57
48 49 50 54
38 5251
47
● 45 44 43
● 42
Component 2

● ● 37 41 46
40 ●
35 ● 33
0

● ● 32 36
31
26 2830 34
● 29
22 21 20 27

17 24 2325
15 19 18
−1

12 11 14
● 7 9 13
● 2 10

3 ● 4 6 8 5
1

−2

−3 −2 −1 0 1 2

Component 1

# print the observations in each cluster


for (i.cut in 1:i.clus) {
print(paste("Cluster", i.cut, " ----------------------------- "))
print(bd[(cutree(bd.hc.complete, k = i.clus) == i.cut),])
}
## [1] "Cluster 1 ----------------------------- "
## country birth death cut.comp
## 1 afghan 52 30 1
## 2 algeria 50 16 1
## 3 angola 47 23 1
## 7 banglades 47 19 1
## 12 cameroon 42 22 1
## 22 ethiopia 48 23 1
## 26 ghana 46 14 1
## 33 iraq 48 14 1
## 35 ivory_cst 48 23 1
## 37 kenya 50 14 1
## 40 madagasca 47 22 1
## 43 morocco 47 16 1
## 44 mozambique 45 18 1
## 45 nepal 46 20 1
## 47 nigeria 49 22 1
## 53 rhodesia 48 14 1
## 55 saudi_ar 49 19 1

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


782 Ch 14: Cluster Analysis

## 59 sudan 49 17 1
## 62 syria 47 14 1
## 63 tanzania 47 17 1
## 67 uganda 48 17 1
## 70 upp_volta 50 28 1
## 72 vietnam 42 17 1
## 74 zaire 45 18 1
## [1] "Cluster 2 ----------------------------- "
## country birth death cut.comp
## 4 argentina 22 10 2
## 5 australia 16 8 2
## 6 austria 12 13 2
## 8 belguim 12 12 2
## 10 bulgaria 17 10 2
## 13 canada 17 7 2
## 14 chile 22 7 2
## 18 cuba 20 6 2
## 19 czechosla 19 11 2
## 23 france 14 11 2
## 24 german_dr 12 14 2
## 25 german_fr 10 12 2
## 27 greece 16 9 2
## 29 hungary 18 12 2
## 34 italy 14 10 2
## 36 japan 16 6 2
## 46 netherlan 13 8 2
## 51 poland 20 9 2
## 52 portugal 19 10 2
## 54 romania 19 10 2
## 57 spain 18 8 2
## 60 sweden 12 11 2
## 61 switzer 12 9 2
## 66 ussr 18 9 2
## 68 uk 12 12 2
## 69 usa 15 9 2
## 73 yugoslav 18 8 2
## [1] "Cluster 3 ----------------------------- "
## country birth death cut.comp
## 9 brazil 36 10 3
## 11 burma 38 15 3
## 15 china 31 11 3
## 16 taiwan 26 5 3
## 17 columbia 34 10 3
## 20 ecuador 42 11 3
## 21 egypt 39 13 3
## 28 guatamala 40 14 3
## 30 india 36 15 3
## 31 indonesia 38 16 3

Prof. Erik B. Erhardt


14.4: Example: 1976 birth and death rates 783

## 32 iran 42 12 3
## 38 nkorea 43 12 3
## 39 skorea 26 6 3
## 41 malaysia 30 6 3
## 42 mexico 40 7 3
## 48 pakistan 44 14 3
## 49 peru 40 13 3
## 50 phillip 34 10 3
## 56 sth_africa 36 12 3
## 58 sri_lanka 26 9 3
## 64 thailand 34 10 3
## 65 turkey 34 12 3
## 71 venez 36 6 3
# plot original data
library(ggplot2)
p1 <- ggplot(bd, aes(x = birth, y = death, colour = cut.comp, shape = cut.comp))
p1 <- p1 + geom_point(size = 2) # points
p1 <- p1 + geom_text(aes(label = country), hjust = -0.1, alpha = 0.2) # labels
p1 <- p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x- and y-axis
p1 <- p1 + labs(title = "1976 crude birth and death rates, complete linkage")
print(p1)

1976 crude birth and death rates, complete linkage


30 ● afghan

● upp_volta

● angola
● ethiopia
ivory_cst
● cameroon
● madagasca
● nigeria

20 ●nepal cut.comp
● banglades
● saudi_ar
death

●zaire
mozambique a
● 1
● vietnam ● tanzania
● uganda
● sudan
a 2
indonesia ● morocco● algeria
indiaburma a 3
german_dr guatamala
pakistan
● ghana
●syria
●iraq
rhodesia
●kenya
austria egypt
peru
german_fr
uk
belguim hungary turkeysth_africa iran
nkorea
sweden
france czechosla china ecuador
10 italy bulgaria
romania
portugal
argentina phillip
thailand
columbia
brazil
switzerusa
greece
ussr poland sri_lanka
netherlan
australia
spain
yugoslav
canada chile mexico
japan cuba skorea malaysia venez
taiwan
10 20 30 40 50
birth

In very general/loose terms4 , it appears that at least some members of the “Four

4
Thanks to Drew Enigk from Spring 2013 who provided this interpretation.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


784 Ch 14: Cluster Analysis

Asian Tigers5 ” are toward the bottom of the swoop, while the countries with more
Euro-centric wealth are mostly clustered on the left side of the swoop, and many
developing countries make up the steeper right side of the swoop. Perhaps the birth
and death rates of a given country are influenced in part by the primary means by
which the country has obtained wealth6 (if it is considered a wealthy country). For
example, the Four Asian Tigers have supposedly developed wealth in more recent
years through export-driven economies, and the Tiger Cub Economies7 are currently
developing in a similar fashion8 .

14.4.2 Single linkage


Now we’ll use single linkage to compare.
library(NbClust)
# Change integer data type to numeric
bd.num <- as.numeric(as.matrix(bd[,-1]))
NC.out <- NbClust(bd.num, method = "single", index = "all")
## Warning in max(DiffLev[, 5], na.rm = TRUE): no non-missing arguments to max; returning -Inf
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## Warning in matrix(c(results), nrow = 2, ncol = 26): data length [51] is not a sub-multiple or multiple of the number of rows [2]
## Warning in matrix(c(results), nrow = 2, ncol = 26, dimnames = list(c("Number clusters", : data length [51] is not a sub-multiple or multiple
of the number of rows [2]
## *******************************************************************
## * Among all indices:
## * 1 proposed 2 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 11 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 6
##
##
## *******************************************************************
# most of the methods suggest 4 to 11 clusters, as do the plots
NC.out$Best.nc
## KL CH Hartigan CCC Scott Marriot
## Number_clusters 7.0000 11.0000 5.0000 2.0000 6.0000 6.0
## Value_Index 8.5944 342.3491 473.8986 13.7816 221.8442 115129.2
## TrCovW TraceW Friedman Rubin Cindex DB
## Number_clusters -Inf 4990.876 20.3754 -12.0601 0.189 0.341
## Value_Index 6 6.000 6.0000 7.0000 15.000 2.000
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball
## Number_clusters 0.7364 0.4667 53.7092 0.373 0.3521 9161.833
## Value_Index 2.0000 2.0000 2.0000 7.000 3.0000 2.000
## PtBiserial Frey McClain Dunn Hubert SDindex
## Number_clusters 0.8462 4.0846 0.1235 0.1364 0 0.5134
## Value_Index 2.0000 2.0000 3.0000 0.0000 3 0.0000
## Dindex SDbw
## Number_clusters 0 0.0442
## Value_Index 15 7.0000

5
http://en.wikipedia.org/wiki/Four_Asian_Tigers
6
http://www.povertyeducation.org/the-rise-of-asia.html
7
http://en.wikipedia.org/wiki/Tiger_Cub_Economies
8
http://www.investopedia.com/terms/t/tiger-cub-economies.asp

Prof. Erik B. Erhardt


14.4: Example: 1976 birth and death rates 785

Hubert statistic second differences


● ● ● ●


Hubert Statistic values

● ●

8e−07

2.9e−05

● ●

4e−07


● ● ● ●
2.7e−05

0e+00

● ●
● ● ●

2 4 6 8 10 12 14 2 4 6 8 10 12 14

Number of clusters Number of clusters

Second differences Dindex Values

● ●
6

● ● ●
5
Dindex Values


● ● ●
4

● ●
0

● ● ● ●
−1
3



−2



2

● ● ●
● ● ● ●

2 4 6 8 10 12 14 2 4 6 8 10 12 14

Number of clusters Number of clusters

# create distance matrix between points


bd.dist <- dist(bd[,-1])

# number of clusters to identify with red boxes and ellipses


i.clus <- 3

# create dendrogram
bd.hc.single <- hclust(bd.dist, method = "single")
plot(bd.hc.single, hang = -1
, main = paste("Teeth with single linkage and", i.clus, "clusters")
, labels = bd[,1])
rect.hclust(bd.hc.single, k = i.clus)

# create PCA scores plot with ellipses


clusplot(bd, cutree(bd.hc.single, k = i.clus)
, color = TRUE, labels = 2, lines = 0
, cex = 2, cex.txt = 1, col.txt = "gray20"
, main = paste("Birth/Death PCA with single linkage and", i.clus, "clusters")
, sub = NULL)

# create a column with group membership

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


Height
786
0 1 2 3 4 5

afghan
upp_volta
argentina
netherlan
switzer
german_fr
sweden
german_dr
austria
belguim
uk
poland
hungary
czechosla
portugal
romania
ussr
spain
yugoslav
japan
france
italy
canada
bulgaria
usa
australia
greece
bd$cut.sing <- factor(cutree(bd.hc.single, k = i.clus))

chile
cuba
malaysia
sri_lanka
taiwan
skorea
cameroon
mexico
venez
vietnam
china

bd.dist
turkey
sth_africa
brazil

hclust (*, "single")


thailand
columbia
phillip
mozambique
zaire
nigeria
madagasca
angola
ethiopia

Teeth with single linkage and 3 clusters


ivory_cst
saudi_ar
kenya
iraq
rhodesia
ghana
syria
algeria
sudan
uganda
morocco
tanzania
banglades
nepal
pakistan
nkorea
ecuador
iran
guatamala
egypt
peru
india
burma
indonesia

Prof. Erik B. Erhardt


Ch 14: Cluster Analysis
14.4: Example: 1976 birth and death rates 787

Birth/Death PCA with single linkage and 3 clusters

2
3
2

74
72
73 71
67
●1
70 66
63 69
62 65 64
68 16
59
56 61
1

55 60
53 57 58 39
54
48 49 50
5251
38
Component 2

47
45 44 43
37 42 46
40 41
33
0

35 36
32
31 34
26 30
28
29
27
22
21 20 25
24 23
19 17 18
−1

15 14
12 11 13
7 9
10
2
3 46 8 5
1

−2

−3 −2 −1 0 1 2

Component 1

# print the observations in each cluster


for (i.cut in 1:i.clus) {
print(paste("Cluster", i.cut, " ----------------------------- "))
print(bd[(cutree(bd.hc.single, k = i.clus) == i.cut),])
}
## [1] "Cluster 1 ----------------------------- "
## country birth death cut.comp cut.sing
## 1 afghan 52 30 1 1
## 70 upp_volta 50 28 1 1
## [1] "Cluster 2 ----------------------------- "
## country birth death cut.comp cut.sing
## 2 algeria 50 16 1 2
## 3 angola 47 23 1 2
## 7 banglades 47 19 1 2
## 9 brazil 36 10 3 2
## 11 burma 38 15 3 2
## 12 cameroon 42 22 1 2
## 15 china 31 11 3 2
## 17 columbia 34 10 3 2
## 20 ecuador 42 11 3 2
## 21 egypt 39 13 3 2
## 22 ethiopia 48 23 1 2
## 26 ghana 46 14 1 2
## 28 guatamala 40 14 3 2

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


788 Ch 14: Cluster Analysis

## 30 india 36 15 3 2
## 31 indonesia 38 16 3 2
## 32 iran 42 12 3 2
## 33 iraq 48 14 1 2
## 35 ivory_cst 48 23 1 2
## 37 kenya 50 14 1 2
## 38 nkorea 43 12 3 2
## 40 madagasca 47 22 1 2
## 42 mexico 40 7 3 2
## 43 morocco 47 16 1 2
## 44 mozambique 45 18 1 2
## 45 nepal 46 20 1 2
## 47 nigeria 49 22 1 2
## 48 pakistan 44 14 3 2
## 49 peru 40 13 3 2
## 50 phillip 34 10 3 2
## 53 rhodesia 48 14 1 2
## 55 saudi_ar 49 19 1 2
## 56 sth_africa 36 12 3 2
## 59 sudan 49 17 1 2
## 62 syria 47 14 1 2
## 63 tanzania 47 17 1 2
## 64 thailand 34 10 3 2
## 65 turkey 34 12 3 2
## 67 uganda 48 17 1 2
## 71 venez 36 6 3 2
## 72 vietnam 42 17 1 2
## 74 zaire 45 18 1 2
## [1] "Cluster 3 ----------------------------- "
## country birth death cut.comp cut.sing
## 4 argentina 22 10 2 3
## 5 australia 16 8 2 3
## 6 austria 12 13 2 3
## 8 belguim 12 12 2 3
## 10 bulgaria 17 10 2 3
## 13 canada 17 7 2 3
## 14 chile 22 7 2 3
## 16 taiwan 26 5 3 3
## 18 cuba 20 6 2 3
## 19 czechosla 19 11 2 3
## 23 france 14 11 2 3
## 24 german_dr 12 14 2 3
## 25 german_fr 10 12 2 3
## 27 greece 16 9 2 3
## 29 hungary 18 12 2 3
## 34 italy 14 10 2 3
## 36 japan 16 6 2 3
## 39 skorea 26 6 3 3

Prof. Erik B. Erhardt


14.4: Example: 1976 birth and death rates 789

## 41 malaysia 30 6 3 3
## 46 netherlan 13 8 2 3
## 51 poland 20 9 2 3
## 52 portugal 19 10 2 3
## 54 romania 19 10 2 3
## 57 spain 18 8 2 3
## 58 sri_lanka 26 9 3 3
## 60 sweden 12 11 2 3
## 61 switzer 12 9 2 3
## 66 ussr 18 9 2 3
## 68 uk 12 12 2 3
## 69 usa 15 9 2 3
## 73 yugoslav 18 8 2 3
# plot original data
library(ggplot2)
p1 <- ggplot(bd, aes(x = birth, y = death, colour = cut.sing, shape = cut.sing))
p1 <- p1 + geom_point(size = 2) # points
p1 <- p1 + geom_text(aes(label = country), hjust = -0.1, alpha = 0.2) # labels
p1 <- p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x- and y-axis
p1 <- p1 + labs(title = "1976 crude birth and death rates, single linkage")
print(p1)

1976 crude birth and death rates, single linkage


30 ● afghan

● upp_volta

angola
ethiopia
ivory_cst
cameroon madagasca
nigeria
20 nepal cut.sing
banglades
saudi_ar
death

zaire
mozambique a
● 1
vietnam tanzania
uganda
sudan a 2
indonesia moroccoalgeria
indiaburma a 3
german_dr guatamalapakistan
ghana
syria
iraq
rhodesia
kenya
austria egypt
peru
german_fr
uk
belguim hungary turkeysth_africa irannkorea
sweden
france czechosla china ecuador
10 italy bulgaria
romania
portugal
argentina phillip
thailand
columbia
brazil
switzerusa
greece
ussr poland sri_lanka
netherlan
australia
spain
yugoslav
canada chile mexico
japan cuba skorea malaysia venez
taiwan
10 20 30 40 50
birth

The two methods suggest three clusters. Complete linkage also suggests 14 clus-
ters, but the clusters were unappealing so this analysis will not be presented here.
The three clusters generated by the two methods are very different. The same
tendency was observed using average linkage and Ward’s method.
An important point to recognize is that different clustering algorithms may agree
on the number of clusters, but they may not agree on the composition of the clusters.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


790 Ch 14: Cluster Analysis

Prof. Erik B. Erhardt


Chapter 15

Multivariate Analysis of Variance

Jolicouer and Mosimann studied the relationship between the size and shape of
painted turtles. The table below gives the length, width, and height (all in mm)
for 24 males and 24 females.
#### Example: Painted turtle shells
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch15_shells_mf.dat"
shells <- read.table(fn.data, header = TRUE)
str(shells)
## 'data.frame': 48 obs. of 4 variables:
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ length: int 98 103 103 105 109 123 123 133 133 133 ...
## $ width : int 81 84 86 86 88 92 95 99 102 102 ...
## $ height: int 38 38 42 42 44 50 46 51 51 51 ...
#head(shells)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


792 Ch 15: Multivariate Analysis of Variance

sex length width height sex length width height


1 F 98 81 38 25 M 93 74 37
2 F 103 84 38 26 M 94 78 35
3 F 103 86 42 27 M 96 80 35
4 F 105 86 42 28 M 101 84 39
5 F 109 88 44 29 M 102 85 38
6 F 123 92 50 30 M 103 81 37
7 F 123 95 46 31 M 104 83 39
8 F 133 99 51 32 M 106 83 39
9 F 133 102 51 33 M 107 82 38
10 F 133 102 51 34 M 112 89 40
11 F 134 100 48 35 M 113 88 40
12 F 136 102 49 36 M 114 86 40
13 F 138 98 51 37 M 116 90 43
14 F 138 99 51 38 M 117 90 41
15 F 141 105 53 39 M 117 91 41
16 F 147 108 57 40 M 119 93 41
17 F 149 107 55 41 M 120 89 40
18 F 153 107 56 42 M 121 93 44
19 F 155 115 63 43 M 121 95 42
20 F 155 117 60 44 M 125 93 45
21 F 158 115 62 45 M 127 96 45
22 F 159 118 63 46 M 128 95 45
23 F 162 124 61 47 M 131 95 46
24 F 177 132 67 48 M 135 106 47

## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
# color by sex
p <- ggpairs(shells
, mapping = ggplot2::aes(colour = sex, alpha = 0.5)
, title = "Painted turtle shells"
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

## 3D scatterplot
library(scatterplot3d)
with(shells, {
scatterplot3d(x = length
, y = width
, z = height
, main = "Shells 3D Scatterplot"
, type = "h" # lines to the horizontal xy-plane
, color = as.integer(sex) # color by group
, pch = as.integer(sex)+19 # plotting character by group

Prof. Erik B. Erhardt


793

#, highlight.3d = TRUE # makes color change with z-axis value


, angle = 40 # viewing angle (seems hard to control)
)
})

#### Try this!


#### For a rotatable 3D plot, use plot3d() from the rgl library
# ## This uses the R version of the OpenGL (Open Graphics Library)
# library(rgl)
# with(shells, { plot3d(x = length, y = width, z = height, col = sex) })
Painted turtle shells
sex length width height Shells 3D Scatterplot
25
20
15

sex
10 ●

5
0

180 ● ●

Cor : 0.978 Cor : 0.963

70

160 length

140 F: 0.973 F: 0.971

65


120 ●

M: 0.95 M: 0.947 ●

60
100

● ●●

55

height

● ●
Cor : 0.96
120

width
● ●
●● 140
50
width

F: 0.966 ●

130
100 ● ●
●●● 120
45

●●

M: 0.912 ● ●
● 110
80 ●●●
●● 100

40


● 90
● 80
●●
35

60 70
80 100 120 140 160 180
height

50
length
40

0 1 2 3 4 50 1 2 3 4 5 100 120 140 160 180 80 90 100110120130 40 50 60

MANOVA considers the following two questions:


ˆ Are the population mean length, width, and height the same for males and
females?
ˆ If not, then what combination of features is most responsible for the differences?
To describe MANOVA, suppose you measure p features on independent random
samples from k strata, groups, or populations. Let
0
µ0i =

µi1 µi2 · · · µip

be the vector of population means for the ith population, where µij is the ith population
mean on the j th feature. For the turtles, p = 3 features and k = 2 strata (sexes).
A one-way MANOVA tests the hypothesis that the population mean vectors are
identical: H0 : µ1 = µ2 = · · · = µk against HA : not H0 . For the carapace data, you
are simultaneously testing that the sexes have equal population mean lengths, equal
population mean widths, and equal population mean heights.
Assume that the sample sizes from the different groups are n1 , n2 , . . . , nk . The

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


794 Ch 15: Multivariate Analysis of Variance

total sample size is n = n1 + n2 + · · · + nk . Let


0
Xij0 =

Xij1 Xij2 · · · Xijp

be the vector of responses for the j th individual from the ith sample. Let
0
X̄i0 =

X̄i1 X̄i2 · · · X̄ip

and Si be the mean vector and variance-covariance matrix for the ith sample. Finally,
let
0
X̄ 0 =

X̄1 X̄2 · · · X̄p

be the vector of means ignoring samples (combine all the data across samples and
compute the average on each feature), and let
P
(ni − 1)Si
S= i
n−k
be the pooled variance-covariance matrix. The pooled variance-covariance matrix is
a weighted average of the variance-covariance matrices from each group.
To test H0 , construct the following MANOVA table, which is the multivariate
analog of the ANOVA table:
Source df SS MS
0
P
Between k − 1 n (
i i PiX̄ − X̄)(X̄ i − X̄)
Within n − k P i (n i − 1)S i
0
Total n−1 ij (Xij − X̄)(Xij − X̄)
where all the MSs are SS/df.
The expressions for the SS have the same form as SS in univariate analysis of
variance, except that each SS is a p × p symmetric matrix. The diagonal elements of
the SS matrices are the SS for one-way ANOVAs on the individual features. The
off-diagonal elements are SS between features. The Error MS matrix is the pooled
variance-covariance matrix S.
The standard MANOVA assumes that you have independent samples from multi-
variate normal populations with identical variance-covariance matrices. This implies
that each feature is normally distributed in each population, that a feature has the
same variability across populations, and that the correlation (or covariance) between
two features is identical across populations. The Error MS matrix estimates the com-
mon population variance-covariance matrix when the population variance-covariance
matrices are identical.
The H0 of equal population mean vectors should be rejected when the difference
among mean vectors, as measured by the Between MS matrix, is large relative to the

Prof. Erik B. Erhardt


795

variability within groups, as measured by the Error MS matrix. Equivalently, H0 is


implausible if a significant portion of the total variation in the data, as measured by
the Total SS matrix, is due to differences among the groups. The same idea is used
in a one-way ANOVA to motivate the F -test of no differences in population means.
However, some care is needed to quantify these ideas in a MANOVA because there
are several natural matrix definitions for comparing the Between MS matrix to the
Error MS matrix. As a result, several MANOVA tests of H0 have been proposed.
Graphical summaries for the carapace data are given above, with numerical sum-
maries below.
# summary statistics for each sex
by(shells, shells$sex, summary)
## shells$sex: F
## sex length width height
## F:24 Min. : 98.0 Min. : 81.00 Min. :38.00
## M: 0 1st Qu.:123.0 1st Qu.: 94.25 1st Qu.:47.50
## Median :137.0 Median :102.00 Median :51.00
## Mean :136.0 Mean :102.58 Mean :52.04
## 3rd Qu.:153.5 3rd Qu.:109.75 3rd Qu.:57.75
## Max. :177.0 Max. :132.00 Max. :67.00
## ----------------------------------------------------
## shells$sex: M
## sex length width height
## F: 0 Min. : 93.0 Min. : 74.00 Min. :35.00
## M:24 1st Qu.:103.8 1st Qu.: 83.00 1st Qu.:38.75
## Median :115.0 Median : 89.00 Median :40.00
## Mean :113.4 Mean : 88.29 Mean :40.71
## 3rd Qu.:121.0 3rd Qu.: 93.00 3rd Qu.:43.25
## Max. :135.0 Max. :106.00 Max. :47.00
# standard deviations
by(shells[, 2:4], shells$sex, apply, 2, sd)
## shells$sex: F
## length width height
## 21.24900 13.10465 8.04595
## ----------------------------------------------------
## shells$sex: M
## length width height
## 11.806103 7.074013 3.355452
# correlation matrix (excluding associated p-values testing "H0: rho == 0")
library(Hmisc)
rcorr(as.matrix(shells[, 2:4]))[[1]] # all
## length width height
## length 1.0000000 0.9778962 0.9628010
## width 0.9778962 1.0000000 0.9599055
## height 0.9628010 0.9599055 1.0000000
rcorr(as.matrix(shells[shells$sex == "F", 2:4]))[[1]] # females

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


796 Ch 15: Multivariate Analysis of Variance

## length width height


## length 1.0000000 0.9731162 0.9706748
## width 0.9731162 1.0000000 0.9659029
## height 0.9706748 0.9659029 1.0000000
rcorr(as.matrix(shells[shells$sex == "M", 2:4]))[[1]] # males
## length width height
## length 1.0000000 0.9501287 0.9470730
## width 0.9501287 1.0000000 0.9122648
## height 0.9470730 0.9122648 1.0000000

The features are positively correlated within each sex. The correlations between
pairs of features are similar for males and females. Females tend to be larger on
each feature. The distributions for length, width, and height are fairly symmetric
within sexes. No outliers are present. Although females are more variable on each
feature than males, the MANOVA assumptions do not appear to be grossly violated
here. (Additionally, you could consider transforming the each dimension of the data
in hopes to make the covariances between sexes more similar, though it may not be
easy to find a good transformation to use.)
pca.sh <- princomp(shells[, 2:4])
df.pca.sh <- data.frame(sex = shells$sex, pca.sh$scores)
str(df.pca.sh)
## 'data.frame': 48 obs. of 4 variables:
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ Comp.1: num -31.4 -25.9 -23.6 -22 -17.1 ...
## $ Comp.2: num -2.27 -1.43 -4.44 -3.26 -3.11 ...
## $ Comp.3: num -0.943 0.73 -1.671 -1.618 -2.2 ...
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
# put scatterplots on top so y axis is vertical
p <- ggpairs(df.pca.sh
, mapping = ggplot2::aes(colour = sex, alpha = 0.5)
, title = "Principal components of Shells"
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

Prof. Erik B. Erhardt


797

Principal components of Shells


sex Comp.1 Comp.2 Comp.3
25
20
15

sex
10
5
0

50 Cor : 1.83e−15 Cor : −9.07e−16

Comp.1
25
F: 0.107 F: 0.242
0

−25 M: 0.431 M: 0.638

5.0
Cor : 3.46e−15
2.5

Comp.2
0.0
F: −0.196

−2.5 M: −0.068
−5.0

Comp.3
0

−3

0 1 2 3 4 50 1 2 3 4 5 −25 0 25 50 −2.5 0.0 2.5 5.0


−5.0 −2.5 0.0 2.5 5.0

For comparison with the MANOVA below, here are the univariate ANOVAs for
each feature. For the carapace data, the univariate ANOVAs indicate significant
differences between sexes on length, width, and height. Females are larger on average
than males on each feature.
# Univariate ANOVA tests, by each response variable
lm.sh <- lm(cbind(length, width, height) ~ sex, data = shells)
summary(lm.sh)
## Response length :
##
## Call:
## lm(formula = length ~ sex, data = shells)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.042 -10.667 1.271 11.927 40.958
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 136.042 3.509 38.77 < 2e-16 ***
## sexM -22.625 4.962 -4.56 3.79e-05 ***

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


798 Ch 15: Multivariate Analysis of Variance

## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.19 on 46 degrees of freedom
## Multiple R-squared: 0.3113,Adjusted R-squared: 0.2963
## F-statistic: 20.79 on 1 and 46 DF, p-value: 3.788e-05
##
##
## Response width :
##
## Call:
## lm(formula = width ~ sex, data = shells)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.5833 -5.5417 -0.4375 4.8854 29.4167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.583 2.149 47.725 < 2e-16 ***
## sexM -14.292 3.040 -4.701 2.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.53 on 46 degrees of freedom
## Multiple R-squared: 0.3246,Adjusted R-squared: 0.3099
## F-statistic: 22.1 on 1 and 46 DF, p-value: 2.376e-05
##
##
## Response height :
##
## Call:
## lm(formula = height ~ sex, data = shells)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.0417 -2.7917 -0.7083 4.0417 14.9583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.042 1.258 41.360 < 2e-16 ***
## sexM -11.333 1.779 -6.369 8.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.164 on 46 degrees of freedom
## Multiple R-squared: 0.4686,Adjusted R-squared: 0.457
## F-statistic: 40.56 on 1 and 46 DF, p-value: 8.087e-08

Prof. Erik B. Erhardt


799

# Alternatively, for many ANOVAs at once, it may be easier


# to select by column number, but you won't get the column names in the output.
# Also, the left-hand side needs to be a matrix data type.
# lm.sh <- lm(as.matrix(shells[, 2:4]) ~ shells[, 1])

A few procedures can be used for one-way MANOVA; two are manova() and the
car package’s Manova(). First we check the assumption of multivariate normality.
# Test multivariate normality using the Shapiro-Wilk test for multivariate normality
library(mvnormtest)
# The data needs to be transposed t() so each variable is a row
# with observations as columns.

mshapiro.test(t(shells[shells$sex == "F", 2:4]))


##
## Shapiro-Wilk normality test
##
## data: Z
## W = 0.89324, p-value = 0.01551
mshapiro.test(t(shells[shells$sex == "M", 2:4]))
##
## Shapiro-Wilk normality test
##
## data: Z
## W = 0.93602, p-value = 0.1329
# Graphical Assessment of Multivariate Normality
f.mnv.norm.qqplot <- function(x, name = "") {
# creates a QQ-plot for assessing multivariate normality

x <- as.matrix(x) # n x p numeric matrix


center <- colMeans(x) # centroid
n <- nrow(x);
p <- ncol(x);
cov <- cov(x);
d <- mahalanobis(x, center, cov) # distances
qqplot(qchisq(ppoints(n), df=p), d
, main=paste("QQ Plot MV Normality:", name)
, ylab="Mahalanobis D2 distance"
, xlab="Chi-squared quantiles")
abline(a = 0, b = 1, col = "red")
}

f.mnv.norm.qqplot(shells[shells$sex == "F", 2:4], "Female")


f.mnv.norm.qqplot(shells[shells$sex == "M", 2:4], "Male")

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


800 Ch 15: Multivariate Analysis of Variance

QQ Plot MV Normality: Female QQ Plot MV Normality: Male


7

● ●

6

8

Mahalanobis D2 distance

Mahalanobis D2 distance

5

● ●

6
● ●
4

● ●


3

4
●● ●
●●●

●● ●
2

●●
●●● ● ●

2

●●●
1

●● ●
● ●
●●●
●●● ●
0

0
0 2 4 6 8 10 0 2 4 6 8 10

Chi−squared quantiles Chi−squared quantiles

The curvature in the Famale sample cause us to reject normality, while the males
do not deviate from normality. We’ll proceed anyway since this deviation from nor-
mality in the female sample will largely increase the variability of the sample and not
displace the mean greatly, and the sample sizes are somewhat large.

Multivariate test statistics These four multivariate test statistics are among the
most common to assess differences across the levels of the categorical variables for
a linear combination of responses. In general Wilks’ lambda is recommended unless
there are problems with small total sample size, unequal sample sizes between groups,
violations of assumptions, etc., in which case Pillai’s trace is more robust.
Wilks’ lambda, (λ)
ˆ Most commonly used statistic for overall significance
ˆ Considers differences over all the characteristic roots
ˆ The smaller the value of Wilks’ lambda, the larger the between-groups disper-
sion
Pillai’s trace
ˆ Considers differences over all the characteristic roots
ˆ More robust than Wilks’; should be used when sample size decreases, unequal
cell sizes or homogeneity of covariances is violated
Hotelling’s trace
ˆ Considers differences over all the characteristic roots
Roy’s greatest characteristic root
ˆ Tests for differences on only the first discriminant function (Chapter 16)
ˆ Most appropriate when responses are strongly interrelated on a single dimen-
sion

Prof. Erik B. Erhardt


801

ˆ Highly sensitive to violation of assumptions, but most powerful when all as-
sumptions are met.
# Multivariate MANOVA test
# the specific test is specified in summary()
# test = c("Pillai", "Wilks", "Hotelling-Lawley", "Roy")
man.sh <- manova(cbind(length, width, height) ~ sex, data = shells)
summary(man.sh, test="Wilks")
## Df Wilks approx F num Df den Df Pr(>F)
## sex 1 0.38695 23.237 3 44 3.622e-09 ***
## Residuals 46
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# I prefer the output from the car package
library(car)
lm.man <- lm(cbind(length, width, height) ~ sex, data = shells)
man.sh <- Manova(lm.man)
summary(man.sh)
##
## Type II MANOVA Tests:
##
## Sum of squares and products for error:
## length width height
## length 13590.792 8057.500 4679.875
## width 8057.500 5100.792 2840.458
## height 4679.875 2840.458 1747.917
##
## ------------------------------------------
##
## Term: sex
##
## Sum of squares and products for the hypothesis:
## length width height
## length 6142.688 3880.188 3077.000
## width 3880.188 2451.021 1943.667
## height 3077.000 1943.667 1541.333
##
## Multivariate Tests: sex
## Df test stat approx F num Df den Df Pr(>F)
## Pillai 1 0.6130506 23.23665 3 44 3.622e-09 ***
## Wilks 1 0.3869494 23.23665 3 44 3.622e-09 ***
## Hotelling-Lawley 1 1.5843173 23.23665 3 44 3.622e-09 ***
## Roy 1 1.5843173 23.23665 3 44 3.622e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The four MANOVA tests of no differences between sexes are all highly significant.
These tests reinforce the univariate analyses. Of the four tests, I prefer Roy’s test

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


802 Ch 15: Multivariate Analysis of Variance

because it has an intuitive interpretation. I will mostly ignore the other three tests
for discussion.

Roy’s test locates the linear combination of the features that produces the most
significant one-way ANOVA test for no differences among groups. If the groups are
not significantly different on the linear combination that best separates the groups in
a one-way ANOVA sense, then there is no evidence that the population mean vectors
are different. The critical value for Roy’s test accounts for the linear combination
being suggested by the data. That is, the critical value for Roy’s test is not the same
critical value that is used in a one-way ANOVA. The idea is similar to a Bonferroni-
type correction with multiple comparisons.

Roy’s method has the ability to locate linear combinations of the features on
which the groups differ, even when the differences across groups are not significant
on any feature. This is a reason for treating multivariate problems using multivariate
methods rather than through individual univariate analyses on each feature.
## For Roy's characteristic Root and vector
#str(man.sh)
H <- man.sh$SSP$sex # H = hypothesis matrix
# man.sh£df # hypothesis df
E <- man.sh$SSPE # E = error matrix
# man.sh£error.df # error df

# characteristic roots of (E inverse * H)


EinvH <- solve(E) %*% H # solve() computes the matrix inverse
ev <- eigen(EinvH) # eigenvalue/eigenvectors
ev
## eigen() decomposition
## $values
## [1] 1.584317e+00 1.401030e-15 -1.315838e-15
##
## $vectors
## [,1] [,2] [,3]
## [1,] 0.2151877 0.3563618 0.02975353
## [2,] 0.1014317 0.1651253 0.59173018
## [3,] -0.9712908 -0.9196412 -0.80558682

The characteristic vector (eigenvector) output gives one linear combination of


the features for each variable in the data set. By construction, the linear combinations
are uncorrelated (adjusting for groups). In general, the first a = minimum(p, k −
1) linear combinations contain information in decreasing amounts for distinguishing
among the groups. The first linear combination is used by Roy’s test.

The three linear combinations for the carapace data are (reading down the columns

Prof. Erik B. Erhardt


803

in the matrix of eigenvectors)

D1 = 0.2152 Length + 0.1014 Width + −0.9713 Height


D2 = 0.3564 Length + 0.1651 Width + −0.9196 Height
D3 = 0.02975 Length + 0.5917 Width + −0.8056 Height.

Here p = 3 and k = 2 gives a = min(p, k − 1) = min(3, 2 − 1) = 1 so only D1 from


(D1, D2, and D3) contains information for distinguishing between male and female
painted turtles.
As in PCA, the linear combinations should be interpreted. However, do not dis-
count the contribution of a feature with a small loading (though, in the shells example,
D2 and D3 have 0 for loadings). In particular, the most important feature for distin-
guishing among the groups might have a small loading because of the measurement
scale.
The following output shows the differences between sexes on D1. The separation
between sexes on D1 is greater than on any single feature. Females typically have
much larger D1 scores than males.
Since the matrix of eigenvectors are a rotation matrix1 we can create the D linear
combinations by matrix multiplication of the eigenvector (rotation) matrix with the
original data (being careful about dimensions).
# linear combinations of features
D <- as.matrix(shells[,2:4]) %*% ev$vectors
colnames(D) <- c("D1", "D2", "D3")
df.D <- data.frame(sex = shells$sex, D)
str(df.D)
## 'data.frame': 48 obs. of 4 variables:
## $ sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ D1 : num -7.6 -6.22 -9.91 -9.48 -10.36 ...
## $ D2 : num 13.4 15.6 12.3 13 12.9 ...
## $ D3 : num 20.2 22.2 20.1 20.2 19.9 ...
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
# put scatterplots on top so y axis is vertical
p <- ggpairs(df.D
, mapping = ggplot2::aes(colour = sex, alpha = 0.5)
, title = "D1 is the linear combination that best distinguishes the sexes"
, progress=FALSE
)
print(p)

1
http://en.wikipedia.org/wiki/Rotation_matrix

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


804 Ch 15: Multivariate Analysis of Variance

## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

D1 is the linear combination that best distinguishes the sexes


sex D1 D2 D3
25
20
15

sex
10
5
0
−4
Cor : 1.56e−15 Cor : 1.47e−16
−8
F: −0.184 F: −0.19

D1
−12
M: 0.501 M: 0.515
−16

Cor : 0.877
20
F: 0.863

D2
15
M: 0.897

30

25
D3

20

0 1 2 3 40 1 2 3 4 −16 −12 −8 −4 15 20 20 25 30

# Univariate ANOVA tests, by D1 linear combination variable


lm.D.sh <- lm(D1 ~ sex, data = df.D)
summary(lm.D.sh)
##
## Call:
## lm(formula = D1 ~ sex, data = df.D)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3047 -0.9791 0.3260 0.9570 4.6434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.8679 0.3884 -27.978 < 2e-16 ***

Prof. Erik B. Erhardt


805

## sexM 4.6897 0.5493 8.537 4.84e-11 ***


## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.903 on 46 degrees of freedom
## Multiple R-squared: 0.6131,Adjusted R-squared: 0.6046
## F-statistic: 72.88 on 1 and 46 DF, p-value: 4.841e-11

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


806 Ch 15: Multivariate Analysis of Variance

Prof. Erik B. Erhardt


Chapter 16

Discriminant Analysis

A researcher collected data on two external features for two (known) sub-species of
an insect. She can use discriminant analysis to find linear combinations of the
features that best distinguish the sub-species. The analysis can then be used to
classify insects with unknown sub-species origin into one of the two sub-species based
on their external features.

To see how this might be done, consider the following data plot. Can1 is the
linear combination of the two features that best distinguishes or discriminates the
two sub-species. The value of Can1 could be used to classify insects into one of the
two groups, as illustrated.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


808 Ch 16: Discriminant Analysis

The method generalizes to more than two features and sub-species.

16.1 Canonical Discriminant Analysis


While there’s a connection between canonical discriminant analysis and canoni-
cal correlation, I prefer to emphasize the connection between canonical discriminant
analysis and MANOVA because these techniques are essentially identical.
Assume that you have representative samples from k groups, strata, or sub-
populations. Each selected individual is measured on p features (measurements)
X1 , X2 , . . . , Xp . As in MANOVA, canonical discriminant analysis assumes you have
independent samples from multivariate normal populations with identical variance-
covariance matrices.
Canonical discriminant analysis computes r = min(p, k − 1) linear combinations
of the features with the following properties. The first linear combination, called the
first linear discriminant function

Can1 = a11 X1 + a12 X2 + · · · + a1p Xp

Prof. Erik B. Erhardt


16.2: Example: Owners of riding mowers 809

gives the most significant F -test for a null hypothesis of no group differences in a
one-way ANOVA, among all linear combinations of the features. The second linear
combination or the second linear discriminant function:
Can2 = a21 X1 + a22 X2 + · · · + a2p Xp
gives the most significant F -test for no group differences in a one-way ANOVA, among
all linear combinations of the features that are uncorrelated (adjusting for groups)
with Can1. In general, the j th linear combination Canj (j = 1, 2, . . . , r) gives the
most significant F -test for no group differences in a one-way ANOVA, among all linear
combinations of the features that are uncorrelated with Can1, Can2, . . . , Can(j − 1).
The coefficients in the canonical discriminant functions can be multiplied by a
constant, or all the signs can be changed (that is, multiplied by the constant −1),
without changing their properties or interpretations.

16.2 Example: Owners of riding mowers


The manufacturer of a riding lawn mower wishes to identify the best prospects for
buying their product using data on the incomes (X1 ) and lot sizes (X2 ) of homeowners
(Johnson and Wichern, 1988). The data below are the incomes and lot sizes from
independent random samples of 12 current owners and 12 non-owners of the mowers.
#### Example: Riding mowers
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch16_mower.dat"
mower <- read.table(fn.data, header = TRUE)
# income = income in £1000
# lotsize = lot size in 1000 sq ft
# owner = nonowners or owners
str(mower)
## 'data.frame': 24 obs. of 3 variables:
## $ income : num 20 28.5 21.6 20.5 29 36.7 36 27.6 23 31 ...
## $ lotsize: num 9.2 8.4 10.8 10.4 11.8 9.6 8.8 11.2 10 10.4 ...
## $ owner : Factor w/ 2 levels "nonowner","owner": 2 2 2 2 2 2 2 2 2 2 ...

income lotsize owner income lotsize owner


1 20.00 9.20 owner 13 25.00 9.80 nonowner
2 28.50 8.40 owner 14 17.60 10.40 nonowner
3 21.60 10.80 owner 15 21.60 8.60 nonowner
4 20.50 10.40 owner 16 14.40 10.20 nonowner
5 29.00 11.80 owner 17 28.00 8.80 nonowner
6 36.70 9.60 owner 18 19.80 8.00 nonowner
7 36.00 8.80 owner 19 22.00 9.20 nonowner
8 27.60 11.20 owner 20 15.80 8.20 nonowner
9 23.00 10.00 owner 21 11.00 9.40 nonowner
10 31.00 10.40 owner 22 17.00 7.00 nonowner
11 17.00 11.00 owner 23 16.40 8.80 nonowner
12 27.00 10.00 owner 24 21.00 7.40 nonowner

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


810 Ch 16: Discriminant Analysis

library(ggplot2)

p <- ggplot(mower, aes(x = income, y = lotsize, shape = owner, colour = owner))


p <- p + geom_point(size = 3)
p <- p + scale_y_continuous(limits = c(0, 15))
p <- p + scale_x_continuous(limits = c(0, 40))
p <- p + coord_fixed(ratio = 1) # square axes (for perp lines)
p <- p + xlab("income in $1000")
p <- p + ylab("lot size in 1000 sq ft")
print(p)

15
lot size in 1000 sq ft

10 ● ●
● ● owner
● ● ●
● ●
● ● nonowner
● ●
owner
5

0
0 10 20 30 40
income in $1000

#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
p <- ggpairs(rev(mower)
, mapping = ggplot2::aes(colour = owner, alpha = 0.5)
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

Prof. Erik B. Erhardt


16.2: Example: Owners of riding mowers 811

owner lotsize income


12.5

10.0

7.5

owner
5.0

2.5

0.0
12
Cor : 0.172
11

10

lotsize
nonowner: −0.0865
9

8 owner: −0.311
7

30

income
20

10
0.0 0.5 1.0 1.5 2.0
0.0 0.5 1.0 1.5 2.0 7 8 9 10 11 1210 20 30

Although the two groups overlap, the owners tend to have higher incomes and
larger lots than the non-owners. Income seems to distinguish owners and non-owners
better than lot size, but both variables seem to be useful for discriminating between
groups.
Qualitatively, one might classify prospects based on their location relative to a
roughly vertical line on the scatter plot. A discriminant analysis gives similar results
to this heuristic approach because the Can1 scores will roughly correspond to the
projection of the two features onto a line perpendicular to the hypothetical vertical
line. candisc() computes one discriminant function here because p = 2 and k = 2
gives r = min(p, k − 1) = min(2, 1) = 1.
Below we first fit a lm() and use that object to compare populations. First we
compare using univariate ANOVAs. The p-values are for one-way ANOVA comparing
owners to non-owners and both income and lotsize features are important individually
for distinguishing between the groups.
# first fit lm() with formula = continuous variables ~ factor variables
lm.mower <- lm(cbind(income, lotsize) ~ owner, data = mower)

# univariate ANOVA tests

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


812 Ch 16: Discriminant Analysis

summary(lm.mower)
## Response income :
##
## Call:
## lm(formula = income ~ owner, data = mower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4917 -3.8021 0.5875 2.5979 10.2083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.133 1.601 11.954 4.28e-11 ***
## ownerowner 7.358 2.264 3.251 0.00367 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.545 on 22 degrees of freedom
## Multiple R-squared: 0.3245,Adjusted R-squared: 0.2938
## F-statistic: 10.57 on 1 and 22 DF, p-value: 0.003665
##
##
## Response lotsize :
##
## Call:
## lm(formula = lotsize ~ owner, data = mower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81667 -0.66667 -0.01667 0.71667 1.66667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.8167 0.2984 29.55 < 2e-16 ***
## ownerowner 1.3167 0.4220 3.12 0.00498 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.034 on 22 degrees of freedom
## Multiple R-squared: 0.3068,Adjusted R-squared: 0.2753
## F-statistic: 9.736 on 1 and 22 DF, p-value: 0.004983

Second, the MANOVA indicates the multivariate means are different indicating
both income and lotsize features taken together are important for distinguishing be-
tween the groups.
# test whether the multivariate means of the two populations are different
library(car)

Prof. Erik B. Erhardt


16.2: Example: Owners of riding mowers 813

man.mo <- Manova(lm.mower)


summary(man.mo)
##
## Type II MANOVA Tests:
##
## Sum of squares and products for error:
## income lotsize
## income 676.31583 -26.41333
## lotsize -26.41333 23.50333
##
## ------------------------------------------
##
## Term: owner
##
## Sum of squares and products for the hypothesis:
## income lotsize
## income 324.87042 58.13083
## lotsize 58.13083 10.40167
##
## Multivariate Tests: owner
## Df test stat approx F num Df den Df Pr(>F)
## Pillai 1 0.5386044 12.25704 2 21 0.00029701 ***
## Wilks 1 0.4613956 12.25704 2 21 0.00029701 ***
## Hotelling-Lawley 1 1.1673374 12.25704 2 21 0.00029701 ***
## Roy 1 1.1673374 12.25704 2 21 0.00029701 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finally, we fit the canonical discriminant function with candisc(). The LR (likeli-
hood ratio) p-values below correspond to tests of no differences between groups on the
canonical discriminant functions. There is only one canonical discriminant function
here. The tests of no differences based on the first canonical discriminant function is
equivalent to Roy’s MANOVA test.
# perform canonical discriminant analysis
library(candisc)
## Loading required package: heplots
##
## Attaching package: ’candisc’
## The following object is masked from ’package:stats’:
##
## cancor
can.mower <- candisc(lm.mower)
can.mower
##
## Canonical Discriminant Analysis for owner:
##
## CanRsq Eigenvalue Difference Percent Cumulative

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


814 Ch 16: Discriminant Analysis

## 1 0.5386 1.1673 100 100


##
## Test of H0: The canonical correlations in the
## current row and all that follow are zero
##
## LR test stat approx F numDF denDF Pr(> F)
## 1 0.4614 2
The objects available from the candisc() object are named below, and we’ll soon
use a few. There are also a few plots available, but I’ll be creating other plots shortly.
names(can.mower) # list of objects in can.mower
## [1] "dfh" "dfe" "eigenvalues" "canrsq"
## [5] "pct" "rank" "ndim" "means"
## [9] "factors" "term" "terms" "coeffs.raw"
## [13] "coeffs.std" "structure" "scores"
# plot(can.mower) # this plot causes Rnw compile errors
# it would show box plots
# with proportional contribution of each variable to Can1

### can also plot 2D plots when have more than two groups (will use later)
## library(heplots)
#heplot(can.mower, scale=6, fill=TRUE)
#heplot3d(can.mower, scale=6, fill=TRUE)

The raw canonical coefficients define the canonical discriminant variables and are
identical to the feature loadings in a one-way MANOVA, except for an unimportant
multiplicative factor. Only Can1 is generated here.
can.mower$coeffs.raw
## Can1
## income -0.1453404
## lotsize -0.7590457
The means output gives the mean score on the canonical discriminant variables
by group, after centering the scores to have mean zero over all groups. These are in
order of the owner factor levels (nonowner, owner).
can.mower$means
## [1] 1.034437 -1.034437
The linear combination of income and lotsize that best distinguishes owners from
non-owners
Can1 = −0.1453 INCOME + −0.759 LOTSIZE
is a weighted average of income and lotsize.
In the scatterplot below, Can1 is the direction indicated by the dashed line.
library(ggplot2)

Prof. Erik B. Erhardt


16.2: Example: Owners of riding mowers 815

# Scatterplots with Can1 line overlayed


p <- ggplot(mower, aes(x = income, y = lotsize, shape = owner, colour = owner))
p <- p + geom_point(size = 3)

# use a little algebra to determine the intercept and slopes of the


# Can1 line and a line perpendicular to it.

# dashed line of Can1


b1 <- can.mower$coeffs.raw[1]/can.mower$coeffs.raw[2] # slope
a1 <- mean(mower$lotsize) - b1 * mean(mower$income) - 3.5 # intercept
p <- p + geom_abline(intercept = a1, slope = b1, linetype = 2)
p <- p + annotate("text", x = 10, y = 6, label = "Can1"
, hjust = 0, vjust = 1, size = 4)

# solid line to separate groups (perpendicular to Can1)


b2 <- -can.mower$coeffs.raw[2]/can.mower$coeffs.raw[1] # slope
a2 <- mean(mower$lotsize) - b2 * mean(mower$income) - 4.5 # intercept
p <- p + geom_abline(intercept = a2, slope = b2, linetype = 1, alpha = 0.5)
p <- p + annotate("text", x = 22, y = 15, label = "Perp to Can1 for discrim"
, hjust = 0, vjust = 1, size = 4)

p <- p + scale_y_continuous(limits = c(0, 15))


p <- p + scale_x_continuous(limits = c(0, 40))
p <- p + coord_fixed(ratio = 1) # square axes (for perp lines)
p <- p + xlab("income in $1000")
p <- p + ylab("lot size in 1000 sq ft")
print(p)

15
Perp to Can1 for discrim
lot size in 1000 sq ft

10 ● ●
● ● owner
● ● ●
● ●
● ● nonowner
● ●
Can1 owner
5

0
0 10 20 30 40
income in $1000

# Plots of Can1
p1 <- ggplot(can.mower$scores, aes(x = Can1, fill = owner))
p1 <- p1 + geom_histogram(binwidth = 2/3, alpha = 0.5, position="identity")
p1 <- p1 + scale_x_continuous(limits = c(min(can.mower$scores$Can1), max(can.mower$scores$Can1)))
p1 <- p1 + geom_rug(aes(colour = owner))
#p1 <- p1 + labs(title = "Can1 for mower data")
#print(p1)

p2 <- ggplot(can.mower$scores, aes(y = Can1, x = owner, fill = owner))


p2 <- p2 + geom_boxplot(alpha = 0.5)
# add a "+" at the mean
p2 <- p2 + stat_summary(fun.y = mean, geom = "point", shape = 3, size = 2)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


816 Ch 16: Discriminant Analysis

p2 <- p2 + geom_point()
p2 <- p2 + coord_flip()
p2 <- p2 + scale_y_continuous(limits = c(min(can.mower$scores$Can1), max(can.mower$scores$Can1)))
#p2 <- p2 + labs(title = "Can1 for mower data")
#print(p2)

library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=2, top = "Can1 for mower data")
## Warning: Removed 2 rows containing missing values (geom bar).

Can1 for mower data


3

owner ● ●●● ● ●● ●●● ● ●

2
owner owner

owner
count

nonowner ● nonowner
owner ● owner
1
nonowner ● ● ● ● ●● ●● ●● ● ●

0
−2 −1 0 1 2 −2 −1 0 1 2
Can1 Can1

The standardized coefficients (use the pooled within-class coefficients) indicate the
relative contributions of the features to the discrimination. The standardized coeffi-
cients are roughly equal, which suggests that income and lotsize contribute similarly
to distinguishing the owners from non-owners.
can.mower$coeffs.std
## Can1
## income -0.8058419
## lotsize -0.7845512
The p-value of 0.0004 on the likelihood ratio test indicates that Can1 strongly
distinguishes between owners and non-owners. This is consistent with the separation
between owners and non-owners in the boxplot of Can1 scores.
I noted above that Can1 is essentially the same linear combination given in a
MANOVA comparison of owners to non-owners. Here is some Manova() output to
support this claim. The MANOVA test p-values agree with the candisc output (as
we saw earlier). The first characteristic vector from the MANOVA is given here.
## For Roy's characteristic Root and vector
H <- man.mo$SSP$owner # H = hypothesis matrix
E <- man.mo$SSPE # E = error matrix
# characteristic roots of (E inverse * H)
EinvH <- solve(E) %*% H # solve() computes the matrix inverse
ev <- eigen(EinvH) # eigenvalue/eigenvectors
ev
## eigen() decomposition
## $values
## [1] 1.167337 0.000000
##
## $vectors
## [,1] [,2]

Prof. Erik B. Erhardt


16.3: Discriminant Analysis on Fisher’s Iris Data 817

## [1,] 0.1880613 -0.1761379


## [2,] 0.9821573 0.9843655
mult.char.can.disc <- can.mower$coeffs.raw[1] / ev$vectors[1,1]
mult.char.can.disc
## [1] -0.7728352
The first canonical discriminant function is obtained by multiplying the first char-
acteristic vector given in MANOVA by -0.7728:

Can1 = −0.1453 INCOME + −0.759 LOTSIZE


= −0.7728 (0.1881 INCOME + 0.9822 LOTSIZE)

16.3 Discriminant Analysis on Fisher’s Iris Data


Fisher’s iris data consists of samples of 50 flowers from each of three species of iris:
Setosa, Versicolor, and Virginica. Four measurements (in mm) were taken on each
flower: sepal length, sepal width, petal length, and petal width.

The plots show big differences between Setosa and the other two species. The
differences between Versicolor and Virginica are smaller, and appear to be mostly
due to differences in the petal widths and lengths.
#### Example: Fisher's iris data
# The "iris" dataset is included with R in the library(datasets)
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


818 Ch 16: Discriminant Analysis

library(GGally)
p <- ggpairs(iris[,c(5,1,2,3,4)]
, mapping = ggplot2::aes(colour = Species, alpha = 0.5)
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

Species Sepal.Length Sepal.Width Petal.Length Petal.Width


50

40

Species
30

20

10

0
8
Cor : −0.118 Cor : 0.872 Cor : 0.818

Sepal.Length
7
setosa: 0.743 setosa: 0.267 setosa: 0.278
6 versicolor: 0.526 versicolor: 0.754 versicolor: 0.546
5 virginica: 0.457 virginica: 0.864 virginica: 0.281

4.5
Cor : −0.428 Cor : −0.366
4.0

Sepal.Width
3.5 setosa: 0.178 setosa: 0.233
3.0 versicolor: 0.561 versicolor: 0.664
2.5 virginica: 0.401 virginica: 0.538
2.0

Cor : 0.963
6
Petal.Length

setosa: 0.332
4
versicolor: 0.787

2 virginica: 0.322

2.5

2.0
Petal.Width

1.5

1.0

0.5

0.0
0 10 20 300 10 20 300 10 20 30 5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 2 4 6 0.0 0.5 1.0 1.5 2.0 2.5

## parallel coordinate plot


library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)

Prof. Erik B. Erhardt


16.3: Discriminant Analysis on Fisher’s Iris Data 819

# univariate min/max scaling


p1 <- ggparcoord(
data = iris
, columns = 1:4
, groupColumn = 5
, order = "anyClass"
, scale = "uniminmax" # "uniminmax". "globalminmax"
, showPoints = FALSE
, title = "uniminmax scaling"
, alphaLines = 1/3
#, shadeBox = "white"
, boxplot = TRUE
) #+ theme_bw()

# global min/max scaling


p2 <- ggparcoord(
data = iris
, columns = 1:4
, groupColumn = 5
, order = "anyClass"
, scale = "globalminmax" # "uniminmax". "globalminmax"
, showPoints = FALSE
, title = "globalminmax scaling"
, alphaLines = 1/3
#, shadeBox = "white"
, boxplot = TRUE
) #+ theme_bw()

library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=2, top = "Parallel Coordinate Plots of Iris data")

# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


820 Ch 16: Discriminant Analysis

Parallel Coordinate Plots of Iris data


uniminmax scaling globalminmax scaling

1.00 8

0.75 6

Species Species
value

value
setosa setosa
0.50 4
versicolor versicolor
virginica virginica

0.25 2

0.00 0

Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width


variable variable

candisc was used to discriminate among species. There are k = 3 species and
p = 4 features, so the number of discriminant functions is 2 (the minimum of 4 and
3 − 1).
# first fit lm() with formula = continuous variables ~ factor variables
lm.iris <- lm(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~ Species
, data = iris)

## univariate ANOVA tests


#summary(lm.iris)
## test whether the multivariate means of the two populations are different
#library(car)
#man.mo <- Manova(lm.iris)
#summary(man.mo)
# perform canonical discriminant analysis

library(candisc)
can.iris <- candisc(lm.iris)
can.iris$coeffs.raw
## Can1 Can2
## Sepal.Length -0.8293776 0.02410215
## Sepal.Width -1.5344731 2.16452123
## Petal.Length 2.2012117 -0.93192121
## Petal.Width 2.8104603 2.83918785

Can1 is a comparison of petal and sepal measurements (from Raw Canonical

Prof. Erik B. Erhardt


16.3: Discriminant Analysis on Fisher’s Iris Data 821

Coefficients):

Can1 = −0.8294 sepalL + −1.534 sepalW + 2.201 petalL + 2.81 petalW.

Can2 is not easily interpreted, though perhaps a comperison of lengths and widths
ignoring sepalL:

Can2 = 0.0241 sepalL + 2.165 sepalW + −0.9319 petalL + 2.839 petalW.

The canonical directions provide a maximal separation the species. Two lines
across Can1 will provide a classification rule.
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
p <- ggpairs(can.iris$scores
, mapping = ggplot2::aes(colour = Species, alpha = 0.5)
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


822 Ch 16: Discriminant Analysis

Species Can1 Can2


50

40

30

Species
20

10

0
10

Cor : −3.36e−16
5
setosa: −0.69

Can1
0
versicolor: 0.268

−5
virginica: 0.224

−10

Can2
0

−1

−2

0 2 4 6 80 2 4 6 80 2 4 6 8 −10 −5 0 5 10 −2 −1 0 1 2 3

There are significant differences among species on both discriminant functions; see
the p-values under the likelihood ratio tests. Of course, Can1 produces the largest
differences — the overlap among species on Can1 is small. Setosa has the lowest Can1
scores because this species has the smallest petal measurements relative to its sepal
measurements. Virginica has the highest Can1 scores.
can.iris
##
## Canonical Discriminant Analysis for Species:
##
## CanRsq Eigenvalue Difference Percent Cumulative
## 1 0.96987 32.19193 31.907 99.12126 99.121
## 2 0.22203 0.28539 31.907 0.87874 100.000
##
## Test of H0: The canonical correlations in the
## current row and all that follow are zero
##
## LR test stat approx F numDF denDF Pr(> F)
## 1 0.02344 199.145 8 288 < 2.2e-16 ***

Prof. Erik B. Erhardt


16.3: Discriminant Analysis on Fisher’s Iris Data 823

## 2 0.77797 13.794 3 145 5.794e-08 ***


## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Questions:
1. What is the most striking feature of the plot of the Can1 scores?
2. Does the assumption of equal population covariance matrices across species
seem plausible?
3. How about multivariate normality?
# Covariance matrices by species
by(iris[,1:4], iris$Species, cov)
## iris$Species: setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.12424898 0.099216327 0.016355102 0.010330612
## Sepal.Width 0.09921633 0.143689796 0.011697959 0.009297959
## Petal.Length 0.01635510 0.011697959 0.030159184 0.006069388
## Petal.Width 0.01033061 0.009297959 0.006069388 0.011106122
## ----------------------------------------------------
## iris$Species: versicolor
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.26643265 0.08518367 0.18289796 0.05577959
## Sepal.Width 0.08518367 0.09846939 0.08265306 0.04120408
## Petal.Length 0.18289796 0.08265306 0.22081633 0.07310204
## Petal.Width 0.05577959 0.04120408 0.07310204 0.03910612
## ----------------------------------------------------
## iris$Species: virginica
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.40434286 0.09376327 0.30328980 0.04909388
## Sepal.Width 0.09376327 0.10400408 0.07137959 0.04762857
## Petal.Length 0.30328980 0.07137959 0.30458776 0.04882449
## Petal.Width 0.04909388 0.04762857 0.04882449 0.07543265
# Test multivariate normality using the Shapiro-Wilk test for multivariate normality
library(mvnormtest)
# The data needs to be transposed t() so each variable is a row
# with observations as columns.

mshapiro.test(t(iris[iris$Species == "setosa" , 1:4]))


##
## Shapiro-Wilk normality test
##
## data: Z
## W = 0.95878, p-value = 0.07906
mshapiro.test(t(iris[iris$Species == "versicolor", 1:4]))
##
## Shapiro-Wilk normality test

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


824 Ch 16: Discriminant Analysis

##
## data: Z
## W = 0.93043, p-value = 0.005739
mshapiro.test(t(iris[iris$Species == "virginica" , 1:4]))
##
## Shapiro-Wilk normality test
##
## data: Z
## W = 0.93414, p-value = 0.007955
# Graphical Assessment of Multivariate Normality
f.mnv.norm.qqplot <- function(x, name = "") {
# creates a QQ-plot for assessing multivariate normality

x <- as.matrix(x) # n x p numeric matrix


center <- colMeans(x) # centroid
n <- nrow(x);
p <- ncol(x);
cov <- cov(x);
d <- mahalanobis(x, center, cov) # distances
qqplot(qchisq(ppoints(n), df=p), d
, main=paste("QQ Plot MV Normality:", name)
, ylab="Mahalanobis D2 distance"
, xlab="Chi-squared quantiles")
abline(a = 0, b = 1, col = "red")
}

par(mfrow=c(1,3))
f.mnv.norm.qqplot(iris[iris$Species == "setosa" , 1:4], "setosa" )
f.mnv.norm.qqplot(iris[iris$Species == "versicolor", 1:4], "versicolor")
f.mnv.norm.qqplot(iris[iris$Species == "virginica" , 1:4], "virginica" )
par(mfrow=c(1,1))

QQ Plot MV Normality: setosa QQ Plot MV Normality: versicolor QQ Plot MV Normality: virginica


14

● ● ● ●
12

12


12

● ●
10

10

● ●
Mahalanobis D2 distance

Mahalanobis D2 distance

Mahalanobis D2 distance

10


● ●
8


8

●● ● ●
● ●
● ● ●
8

●●
●● ●
6

●● ●● ●
● ●●
● ●●
●●
6

● ●
●●
●●
● ●●● ●
4

●● ●● ●●●
● ●● ●●
●●● ●●
4

●●● ●● ●●
●● ●● ●
●● ●●

●●● ●●●●
●● ● ●
●●● ●●●
2

● ●

●● ●●●
● ●●●
● ●
2


●●● ●● ●●
●●
●● ●
●● ●●● ●●●
●● ●●●
●●● ●● ●●●
0

0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12

Chi−squared quantiles Chi−squared quantiles Chi−squared quantiles

Prof. Erik B. Erhardt


Chapter 17

Classification

A goal with discriminant analysis might be to classify individuals of unknown origin


into one of several known groups. How should this be done? Recall our example where
two subspecies of an insect are compared on two external features. The discriminant
analysis gives one discriminant function (CAN1) for distinguishing the subspecies.
It makes sense to use CAN1 to classify insects because CAN1 is the best (linear)
combination of the features to distinguish between subspecies.

Given the score on CAN1 for each insect to be classified, assign insects to the sub-
species that they most resemble. Similarity is measured by the distance on CAN1 to
the average CAN1 scores for the two subspecies, identified by X’s on the plot.

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


826 Ch 17: Classification

To classify using r discriminant functions, compute the average response on CAN1,


. . . , CANr in each sample. Then compute the canonical discriminant function scores
for each individual to be classified. Each observation is classified into the group it is
closest to, as measured by the distance from the observation in r-space to the sample
mean vector on the canonical variables. How do you measure distance in r-space?

The plot below illustrates the idea with the r = 2 discriminant functions in Fisher’s
iris data: Obs 1 is classified as Versicolor and Obs 2 is classified as Setosa.

Prof. Erik B. Erhardt


17.1: Classification using Mahalanobis distance 827

3

2
● ●


● ●
● ●
1 ● obs 1



● ●●

● ● ● Species
● ●

●●● ●
Can2

setosa

0 ●
● versicolor
● ●●●
●● virginica
● ●


●●●
● ●●
−1 obs 2 ● ●●

−2 ●

−10 −5 0 5 10
Can1

17.1 Classification using Mahalanobis distance


Classification using discrimination is based on the original features and not the
canonical discriminant function scores. Although a linear classification rule is identical
to the method I just outlined, the equivalence is not obvious. I will discuss the method
without justifying the equivalence.
Suppose p features X = (X1 , X2 , . . . , Xp )0 are used to discriminate among the
k groups. Let X̄i = (X̄i1 , X̄i2 , . . . , X̄ip )0 be the vector of mean responses for the ith
sample, and let Si be the p-by-p variance-covariance matrix for the ith sample. The
pooled variance-covariance matrix is given by
(n1 − 1)S1 + (n2 − 1)S2 + · · · + (nk − 1)Sk
S= ,
n−k
where the ni s are the group sample sizes and n = n1 + n2 + · · · + nk is the total sample
size.
The classification rules below can be defined using either the Mahalanobis gen-
eralized squared distance or in terms of a probability model. I will describe each,
starting with the Mahalanobis or M -distance.
The M -distance from an observation X to (the center of) the ith sample is

Di2 (X) = (X − X̄i )0 S −1 (X − X̄i ),

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


828 Ch 17: Classification

where (X − X̄i )0 is the transpose of the column vector (X − X̄i ), and S −1 is the
matrix inverse of S. Note that if S is the identity matrix (a matrix with 1s on the
diagonal and 0s on the off-diagonals), then this is the Euclidean distance. Given the
M -distance from X to each sample, classify X into the group which has the minimum
M -distance.
The M -distance is an elliptical distance measure that accounts for correlation
between features, and adjusts for different scales by standardizing the features to
have unit variance. The picture below (left) highlights the idea when p = 2. All of
the points on a given ellipse are the same M -distance to the center (X̄1 , X̄2 )0 . As the
ellipse expands, the M -distance to the center increases.

Group 2
8
8

Group 1
obs 3
6
6

4
X2

X2

Group 3
4

X1, X2 = (3,5) obs 2


 
Corr X1, X2 > 0
0

obs 1
2

−2

0 2 4 6 0 5 10 15

X1 X1

To see how classification works, suppose you have three groups and two features,
as in the plot above (right). Observations 1 is closest in M -distance to the center
of group 3. Observation 2 is closest to group 1. Thus, classify observations 1 and
2 into groups 3 and 1, respectively. Observation 3 is closest to the center of group
2 in terms of the standard Euclidean (walking) distance. However, observation 3 is
more similar to data in group 1 than it is to either of the other groups. The M -
distance from observation 3 to group 1 is substantially smaller than the M -distances
to either group 2 or 3. The M -distance accounts for the elliptical cloud of data within
each group, which reflects the correlation between the two features. Thus, you would
classify observation 3 into group 1.
The M -distance from the ith group to the j th group is the M -distance between
the centers of the groups:

D2 (i, j) = D2 (j, i) = (X̄i − X̄j )0 S −1 (X̄i − X̄j ).

Larger values suggest relatively better potential for discrimination between groups.

Prof. Erik B. Erhardt


17.2: Evaluating the Accuracy of a Classification Rule 829

In the plot above, D2 (1, 2) < D2 (1, 3) which implies that it should be easier to
distinguish between groups 1 and 3 than groups 1 and 2.
M -distance classification is equivalent to classification based on a probability
model that assumes the samples are independently selected from multivariate nor-
mal populations with identical covariance matrices. This assumption is consistent
with the plot above where the data points form elliptical clouds with similar orienta-
tions and spreads across samples. Suppose you can assume a priori (without looking
at the data for the individual that you wish to classify) that a randomly selected in-
dividual from the combined population (i.e., merge all sub-populations) is equally
likely to be from any group:
1
PRIORj ≡ Pr(observation is from group j) = ,
k
where k is the number of groups. Then, given the observed features X for an indi-
vidual

Pr(j|X) ≡ Pr(observation is from group j given X)


exp{−0.5Dj2 (X)}
= P 2
.
k exp{−0.5Dk (X)}

To be precise, I will note that Pr(j|X) is unknown, and the expression for Pr(j|X) is
an estimate based on the data.
The group with the largest posterior probability Pr(j|X) is the group into which
X is classified. Maximizing Pr(j|X) across groups is equivalent to minimizing the
M -distance Dj2 (X) across groups, so the two classification rules are equivalent.

17.2 Evaluating the Accuracy of a Classification


Rule
The misclassification rate, or the expected proportion of misclassified observations,
is a good yardstick to gauge a classification rule. Better rules have smaller misclassi-
fication rates, but there is no universal cutoff for what is considered good in a given
problem. You should judge a classification rule relative to the current standards in
your field for “good classification”.
Resubstitution evaluates the misclassification rate using the data from which
the classification rule is constructed. The resubstitution estimate of the error rate is
optimistic (too small). A greater percentage of misclassifications is expected when
the rule is used on new data, or on data from which the rule is not constructed.
Cross-validation is a better way to estimate the misclassification rate. In many
statistical packages, you can implement cross-validation by randomly splitting the

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


830 Ch 17: Classification

data into a training or calibration set from which the classification rule is con-
structed. The remaining data, called the test data set, is used with the classification
rule to estimate the error rate. In particular, the proportion of test cases misclassified
estimates the misclassification rate. This process is often repeated, say 10 times, and
the error rate estimated to be the average of the error rates from the individual splits.
With repeated random splitting, it is common to use 10% of each split as the test
data set (a 10-fold cross-validation).
Repeated random splitting can be coded. As an alternative, you might consider
using one random 50-50 split (a 2-fold) to estimate the misclassification rate, provided
you have a reasonably large data base.
Another form of cross-validation uses a jackknife method where single cases are
held out of the data (an n-fold), then classified after constructing the classification rule
from the remaining data. The process is repeated for each case, giving an estimated
misclassification rate as the proportion of cases misclassified.
The lda() function allows for jackknife cross-validation (CV) and cross-validation
using a single test data set (predict()). The jackknife method is necessary with small
sized data sets so single observations don’t greatly bias the classification. You can also
classify observations with unknown group membership, by treating the observations
to be classified as a test data set.

17.3 Example: Carapace classification and error


#### Example: Painted turtle shells
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch15_shells_mf.dat"
shells <- read.table(fn.data, header = TRUE)
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
# put scatterplots on top so y axis is vertical
p <- ggpairs(shells
, mapping = ggplot2::aes(colour = sex, alpha = 0.5)
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

Prof. Erik B. Erhardt


17.3: Example: Carapace classification and error 831

## 3D scatterplot
library(scatterplot3d)
with(shells, {
scatterplot3d(x = length
, y = width
, z = height
, main = "Shells 3D Scatterplot"
, type = "h" # lines to the horizontal xy-plane
, color = as.integer(sex) # color by group
, pch = as.integer(sex)+19 # plotting character by group
#, highlight.3d = TRUE # makes color change with z-axis value
, angle = 100 # viewing angle (seems hard to control)
)
})

#### Try this!


#### For a rotatable 3D plot, use plot3d() from the rgl library
# ## This uses the R version of the OpenGL (Open Graphics Library)
# library(rgl)
# with(shells, { plot3d(x = length, y = width, z = height, col = sex) })
sex length width height
25
Shells 3D Scatterplot
20
15
sex

10
5 ●

0
180
Cor : 0.978 Cor : 0.963 ●

160 ●


length

140 F: 0.973 F: 0.971


120 ●

M: 0.95 M: 0.947 ●

35 40 45 50 55 60 65 70
100
width

140 ●
● ●

●●
Cor : 0.96 130 ●
120 ●

120 ●
● ●
●●
width

height
F: 0.966 ●
100 110 ●
● ●●


●●
M: 0.912 100 ●● ●
80 ●

●●●●
90 ● ●

80 ●●
60 70
height

80 100 120 140 160 180


50

length
40

0 1 2 3 4 50 1 2 3 4 5 100 120 140 160 180 80 90 100110120130 40 50 60

As suggested in the partimat() plot below (and by an earlier analysis), classifica-


tion based on the length and height of the carapace appears best (considering only
pairs). For this example, we’ll only consider those two features.
# classification of observations based on classification methods
# (e.g. lda, qda) for every combination of two variables.
library(klaR)
##
## Attaching package: ’klaR’
## The following object is masked from ’package:TeachingDemos’:

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


832 Ch 17: Classification

##
## triplot
partimat(sex ~ length + width + height, data = shells
, plot.matrix = TRUE)
100 120 140 160 180 80 90 100 110 120 130
35 40 45 50 55 60 65

180
F F
Error: 0.188 Error: 0.083

160
160 F F
FF FF
F FF F F F
FF FF

140
140

F F
FF F● F
length F M MFF F●
M FF M
M
M M
FMM M
120

F F F

120
MM MMM M
MMM M
MM
●M M● M
M
M F MM F F
MF
100 110 120 130 100

MMFF FM
MMM F

100 110 120 130 100


M
M
MF M F
MM
F MM F
Error: 0.188 Error: 0.146
F F
FF F F
FF FF
F FFF
M F FF M F
F●F width F F●
FF F
F F F
F
MFM
MM M M
MM
F
MMM M MM
MF MM F
90

90
●M
FM
M
MM MM
M● F
MFF M M MF
MF
MM FM
M
M M
MF M M MF
80

80
M M
M F F M
65

Error: 0.083 Error: 0.146

65
FFF F
FF F
F F
60

60
F F FF
55

55
F F
●F ●F
FF F
F F height

50
50

F F F
F
M FF M
F MM M
F

45
45

F MMM F MM
MM
FF M M FM M

40
●M
MM ●M
MMM
40

MM M
MF M M MMMM
M FMMM FM
M M FM

35
35

MM MM
100 120 140 160 80 90 100 110 120 13035 40 45 50 55 60 65

The default linear discriminant analysis assumes equal prior probabilities for males
and females.
library(MASS)
lda.sh <- lda(sex ~ length + height, data = shells)
lda.sh
## Call:
## lda(sex ~ length + height, data = shells)
##
## Prior probabilities of groups:
## F M
## 0.5 0.5
##
## Group means:
## length height
## F 136.0417 52.04167
## M 113.4167 40.70833
##
## Coefficients of linear discriminants:
## LD1
## length 0.1370519
## height -0.4890769

The linear discrimant function is in the direction that best separates the sexes,

LD1 = 0.1371 length + −0.4891 height.

Prof. Erik B. Erhardt


17.3: Example: Carapace classification and error 833

The plot of the lda object shows the groups across the linear discriminant func-
tion. From the klaR package we can get color-coded classification areas based on a
perpendicular line across the LD function.
plot(lda.sh, dimen = 1, type = "both", col = as.numeric(shells$sex))

0.8
0.4
0.0

−4 −3 −2 −1 0 1 2

group F
0.8
0.4
0.0

−4 −3 −2 −1 0 1 2

group M

The constructed table gives the jackknife-based classification and posterior prob-
abilities of being male or female for each observation in the data set. The misclassi-
fication rate follows.
# CV = TRUE does jackknife (leave-one-out) crossvalidation
lda.sh.cv <- lda(sex ~ length + height, data = shells, CV = TRUE)

# Create a table of classification and posterior probabilities for each observation


classify.sh <- data.frame(sex = shells$sex
, class = lda.sh.cv$class
, error = ""
, round(lda.sh.cv$posterior,3))
colnames(classify.sh) <- c("sex", "class", "error"
, paste("post", colnames(lda.sh.cv$posterior), sep=""))
# "postF" and "postM" column names

# error column
classify.sh$error <- as.character(classify.sh$error)
classify.agree <- as.character(as.numeric(shells$sex) - as.numeric(lda.sh.cv$class))
classify.sh$error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]

The classification summary table is constructed from the canonical discrim-


inant functions by comparing the predicted group membership for each observation
to its actual group label. To be precise, if you assume that the sex for the 24 males
are unknown then you would classify each of them correctly. Similarly, 20 of the

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


834 Ch 17: Classification

24 females are classified correctly, with the other four classified as males. The To-
tal Error of 0.0833 is the estimated miscassification rate, computed as the sum of
Rates×Prior over sexes: 0.0833 = 0.1667 × 0.5 + 0 × 0.5. Are the misclassification
results sensible, given the data plots that you saw earlier?

The listing of the posterior probabilities for each sex, by case, gives you an idea
of the clarity of classification, with larger differences between the male and female
posteriors corresponding to more definitive (but not necessarily correct!) classifica-
tions.
# print table
classify.sh
## sex class error postF postM
## 1 F M -1 0.166 0.834
## 2 F M -1 0.031 0.969
## 3 F F 0.847 0.153
## 4 F F 0.748 0.252
## 5 F F 0.900 0.100
## 6 F F 0.992 0.008
## 7 F F 0.517 0.483
## 8 F F 0.937 0.063
## 9 F F 0.937 0.063
## 10 F F 0.937 0.063
## 11 F M -1 0.184 0.816
## 12 F M -1 0.294 0.706
## 13 F F 0.733 0.267
## 14 F F 0.733 0.267
## 15 F F 0.917 0.083
## 16 F F 0.994 0.006
## 17 F F 0.886 0.114
## 18 F F 0.864 0.136
## 19 F F 1.000 0.000
## 20 F F 0.998 0.002
## 21 F F 1.000 0.000
## 22 F F 1.000 0.000
## 23 F F 0.993 0.007
## 24 F F 0.999 0.001
## 25 M M 0.481 0.519
## 26 M M 0.040 0.960
## 27 M M 0.020 0.980
## 28 M M 0.346 0.654
## 29 M M 0.092 0.908
## 30 M M 0.021 0.979
## 31 M M 0.146 0.854
## 32 M M 0.078 0.922
## 33 M M 0.018 0.982
## 34 M M 0.036 0.964
## 35 M M 0.026 0.974

Prof. Erik B. Erhardt


17.4: Example: Fisher’s Iris Data cross-validation 835

## 36 M M 0.019 0.981
## 37 M M 0.256 0.744
## 38 M M 0.023 0.977
## 39 M M 0.023 0.977
## 40 M M 0.012 0.988
## 41 M M 0.002 0.998
## 42 M M 0.175 0.825
## 43 M M 0.020 0.980
## 44 M M 0.157 0.843
## 45 M M 0.090 0.910
## 46 M M 0.067 0.933
## 47 M M 0.081 0.919
## 48 M M 0.074 0.926
# Assess the accuracy of the prediction
pred.freq <- table(shells$sex, lda.sh.cv$class) # row = true sex, col = classified sex
pred.freq
##
## F M
## F 20 4
## M 0 24
prop.table(pred.freq, 1) # proportions by row
##
## F M
## F 0.8333333 0.1666667
## M 0.0000000 1.0000000
# proportion correct for each category
diag(prop.table(pred.freq, 1))
## F M
## 0.8333333 1.0000000
# total proportion correct
sum(diag(prop.table(pred.freq)))
## [1] 0.9166667
# total error rate
1 - sum(diag(prop.table(pred.freq)))
## [1] 0.08333333

17.4 Example: Fisher’s Iris Data cross-validation


I will illustrate cross-validation on Fisher’s iris data first using a test data set, and
then using the jackknife method. The 150 observations were randomly rearranged
and separated into two batches of 75. The 75 observations in the calibration set
were used to develop a classification rule. This rule was applied to the remaining 75
flowers, which form the test data set. There is no general rule about the relative sizes

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


836 Ch 17: Classification

of the test data and the training data. Many researchers use a 50-50 split. Regardless
of the split, you should combine the two data sets at the end of the cross-validation
to create the actual rule for classifying future data.
Below, the half of the indices of the iris data set are randomly selected, and
assigned a label “test”, whereas the rest are “train”. A plot indicates the two sub-
samples are similar.
#### Example: Fisher's iris data
# The "iris" dataset is included with R in the library(datasets)
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Randomly assign equal train/test by Species strata
library(plyr)
iris <- ddply(iris, .(Species), function(X) {
ind <- sample.int(nrow(X), size = round(nrow(X)/2))
sort(ind)
X$test <- "train"
X$test[ind] <- "test"
X$test <- factor(X$test)
X$test
return(X)
})
summary(iris$test)
## test train
## 75 75
table(iris$Species, iris$test)
##
## test train
## setosa 25 25
## versicolor 25 25
## virginica 25 25
## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
p <- ggpairs(subset(iris, test == "train")[,c(5,1,2,3,4)]
, mapping = ggplot2::aes(colour = Species, alpha = 0.5)
, title = "train"
, progress=FALSE
)
print(p)

Prof. Erik B. Erhardt


17.4: Example: Fisher’s Iris Data cross-validation 837

## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
p <- ggpairs(subset(iris, test == "test")[,c(5,1,2,3,4)]
, mapping = ggplot2::aes(colour = Species, alpha = 0.5)
, title = "test"
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

train test
Species Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length Sepal.Width Petal.Length Petal.Width
25 25
20 20
Species

Species
15 15
10 10
5 5
0 0
8
Cor : −0.156 Cor : 0.879 Cor : 0.833 Cor : −0.0858 Cor : 0.867 Cor : 0.806
7
Sepal.Length

Sepal.Length
7
setosa: 0.741 setosa: 0.378 setosa: 0.341 setosa: 0.744 setosa: 0.151 setosa: 0.219
6 6
versicolor: 0.537 versicolor: 0.758 versicolor: 0.54 versicolor: 0.517 versicolor: 0.746 versicolor: 0.551
5 virginica: 0.427 virginica: 0.757 virginica: 0.246 5 virginica: 0.476 virginica: 0.914 virginica: 0.338

4.5
4.0 Cor : −0.44 Cor : −0.357 Cor : −0.418 Cor : −0.375
4.0
Sepal.Width

Sepal.Width
3.5 setosa: 0.278 setosa: 0.0964 3.5 setosa: 0.0662 setosa: 0.361

3.0 versicolor: 0.552 versicolor: 0.706 3.0 versicolor: 0.574 versicolor: 0.628
virginica: 0.406 virginica: 0.668 2.5 virginica: 0.397 virginica: 0.396
2.5
2.0

6 Cor : 0.96 6
Cor : 0.966
Petal.Length

Petal.Length
setosa: 0.162 setosa: 0.454
4 4
versicolor: 0.82 versicolor: 0.733

2 virginica: 0.276 2 virginica: 0.391

2.5 2.5

2.0 2.0
Petal.Width

Petal.Width
1.5 1.5

1.0 1.0

0.5 0.5

0.0 0.0
0 3 6 9 0 3 6 9 0 3 6 9 5 6 7 2.5 3.0 3.5 4.0 1 2 3 4 5 6 0.0 0.5 1.0 1.5 2.0 2.5 0 5 1015 0 5 1015 0 5 1015 5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 2 4 6 0.0 0.5 1.0 1.5 2.0 2.5

As suggested in the partimat() plot below, we should expect Sepal.Length to


potentially not contribute much to the classification, since (pairwise) more errors are
introduced with that variable than between other pairs. (In fact, we’ll see below that
the coefficients for Sepal.Length is smallest.)
# classification of observations based on classification methods
# (e.g. lda, qda) for every combination of two variables.
library(klaR)
partimat(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
, data = subset(iris, test == "train")
, plot.matrix = TRUE)

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


838 Ch 17: Classification

4.5 5.5 6.5 7.5 2.5 3.0 3.5 4.0 2 3 4 5 6 0.5 1.0 1.5 2.0 2.5
v v v

7.5
7.5
Error: v
0.227 Error: 0.04 v Error: 0.067 v
v v vv v v
v vv v v
v v vv vv v v
v v vv v v vvv v v vvv

6.5
6.5

●v v vv vvvvv●
vv vv v vvvvv●vv vv
v v v vv vv vv vv
Sepal.Length v v v● vvv v v ●vvvvvv v v●vvv v
v vv v vv v v v v vv v
s s v s
5.5

vv vv vvvv v vvv v v

5.5
s s s ssss ss s
v s sss v ssss v
vv s s●s s ss s s sss●s vv ssss●ss vv
s s s ss s sss
4.0 4.5

ss s ss

4.0 4.5
s ss s s
s sss sss
Error:s0.227 s Error: 0.053 s Error: 0.053
s s s
s s ss s
s s s
ss v s v ss v
3.5

3.5
s s ss
s s●s s v v s●ssss v v s●ss v v
s v v s vv s v v
ss s v v vvv v Sepal.Width sss vv v vvv s v v v v
s vv s vv v s vv v
3.0

3.0
ss s v v ● vv sss vv vvv v sss vv vv v
v v vv v v v ●v vv v ●
vv●v vv v v v vv●v v v v v v v●vv vvvv v
v vv v vv vv v v v v
v v v
2.5

2.5
v v v vv v v v
v v v vv vv
s v v s v v s v v
v v v v v v
Error: 0.04 Error: 0.053 Error: 0.067
v v vv v v vv v v vv v vv
6

6
v vv● v vv v v v●v v v v v vv ●vvvvv
v vvv vvvvv vv v v
v vv v vv vvv v vv vv vvv vvv
vvvvv vvvv
5

5
v v
vv ●vvv vv vv v v v vv vv vv v v vvv●vvvv
v vvvvv v v● v vv vvv
vv v v v v
4

4
Petal.Length vv v
v
vv vv v
3

3
2

2
s s s ss sss
sssss●sssss ss s ●
sss s ss ss s ss s ssss
sssss sss
●s
sss s s s s v s s sv s s ss s
2.5

vv v vvv vvvvv

0.5 1.0 1.5 2.0 2.5


Error: 0.067v Error: v0.053 v Error: 0.067
vv vvvv vv v v v v vvv
2.0


v v v ● v v● v
vv v vvvv v v vvvv v vvvvvvv vv
v v v v vvvvvvv
v v v vv vv v vv vv v v
1.5

v vvv ●vvv vvv vvv v v ●vv v vvv●


v
vvvvvvv v Petal.Width
vv vvv v v vv v v v vvvv
1.0
0.5

ss s●sss s s s ss s ss ss s s s s

sssssss

ss ssssssss ss s s s s ss s sssssssssss
4.5 5.5 6.5 7.5 2.5 3.0 3.5 4.0 2 3 4 5 6 0.5 1.0 1.5 2.0 2.5

library(MASS)
lda.iris <- lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
, data = subset(iris, test == "train"))
lda.iris
## Call:
## lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
## data = subset(iris, test == "train"))
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 4.960 3.400 1.448 0.240
## versicolor 5.968 2.792 4.304 1.344
## virginica 6.512 2.944 5.508 2.020
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.598122 -0.3320011

Prof. Erik B. Erhardt


17.4: Example: Fisher’s Iris Data cross-validation 839

## Sepal.Width 1.684216 2.0882107


## Petal.Length -2.287078 -0.8035116
## Petal.Width -2.297160 2.8095820
##
## Proportion of trace:
## LD1 LD2
## 0.9934 0.0066
The linear discrimant functions that best classify the Species in the training set
are

LD1 = 0.5981 sepalL + 1.684 sepalW + −2.287 petalL + −2.297 petalW


LD2 = −0.332 sepalL + 2.088 sepalW + −0.8035 petalL + 2.81 petalW.

The plots of the lda object shows the data on the LD scale.
plot(lda.iris, dimen = 1, col = as.numeric(iris$Species))
plot(lda.iris, dimen = 2, col = as.numeric(iris$Species))
#pairs(lda.iris, col = as.numeric(iris£Species))
0.6
0.3
0.0

−5 0 5 10
5

group setosa

virginica
virginica
virginica
virginicavirginica setosa
virginica
virginica setosa
0.6

virginica
virginicavirginica versicolor
versicolor setosa
setosasetosa
setosa
LD2

virginica virginica setosa


setosa
setosa setosa
virginica
virginica versicolor versicolor setosa
setosa
0.3

virginicaversicolor
virginica versicolor
versicolor
versicolor setosa
setosa
setosa
virginica
virginica
virginica versicolor
versicolor setosa
setosa
setosa
setosa
versicolor
versicolor versicolor
versicolor setosa
setosa
setosa
virginica versicolor
virginicaversicolor
versicolor setosa
0.0

virginica versicolor versicolor


versicolor
versicolor
versicolor
versicolor setosa
−5 0 5 10 virginica versicolor versicolor
virginica

group versicolor
−5
0.6
0.3
0.0

−5 0 5
−5 0 5 10
LD1
group virginica

# CV = TRUE does jackknife (leave-one-out) crossvalidation


lda.iris.cv <- lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
, data = subset(iris, test == "train"), CV = TRUE)

# Create a table of classification and posterior probabilities for each observation


classify.iris <- data.frame(Species = subset(iris, test == "train")$Species
, class = lda.iris.cv$class
, error = ""
, round(lda.iris.cv$posterior,3))
colnames(classify.iris) <- c("Species", "class", "error"
, paste("post", colnames(lda.iris.cv$posterior), sep=""))

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


840 Ch 17: Classification

# error column
classify.iris$error <- as.character(classify.iris$error)
classify.agree <- as.character(as.numeric(subset(iris, test == "train")$Species)
- as.numeric(lda.iris.cv$class))
classify.iris$error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]

The misclassification error is low within the training set.


# print table
#classify.iris

# Assess the accuracy of the prediction


# row = true Species, col = classified Species
pred.freq <- table(subset(iris, test == "train")$Species, lda.iris.cv$class)
pred.freq
##
## setosa versicolor virginica
## setosa 25 0 0
## versicolor 0 23 2
## virginica 0 1 24
prop.table(pred.freq, 1) # proportions by row
##
## setosa versicolor virginica
## setosa 1.00 0.00 0.00
## versicolor 0.00 0.92 0.08
## virginica 0.00 0.04 0.96
# proportion correct for each category
diag(prop.table(pred.freq, 1))
## setosa versicolor virginica
## 1.00 0.92 0.96
# total proportion correct
sum(diag(prop.table(pred.freq)))
## [1] 0.96
# total error rate
1 - sum(diag(prop.table(pred.freq)))
## [1] 0.04

How well does the LD functions constructed on the training data predict the
Species in the independent test data?
# predict the test data from the training data LDFs
pred.iris <- predict(lda.iris, newdata = subset(iris, test == "test"))

# Create a table of classification and posterior probabilities for each observation


classify.iris <- data.frame(Species = subset(iris, test == "test")$Species
, class = pred.iris$class
, error = ""

Prof. Erik B. Erhardt


17.4: Example: Fisher’s Iris Data cross-validation 841

, round(pred.iris$posterior,3))
colnames(classify.iris) <- c("Species", "class", "error"
, paste("P", colnames(lda.iris.cv$posterior), sep=""))

# error column
classify.iris$error <- as.character(classify.iris$error)
classify.agree <- as.character(as.numeric(subset(iris, test == "test")$Species)
- as.numeric(pred.iris$class))
classify.iris$error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]

# print table
classify.iris
## Species class error Psetosa Pversicolor Pvirginica
## 2 setosa setosa 1 0.000 0.000
## 3 setosa setosa 1 0.000 0.000
## 6 setosa setosa 1 0.000 0.000
## 8 setosa setosa 1 0.000 0.000
## 9 setosa setosa 1 0.000 0.000
## 10 setosa setosa 1 0.000 0.000
## 11 setosa setosa 1 0.000 0.000
## 12 setosa setosa 1 0.000 0.000
## 15 setosa setosa 1 0.000 0.000
## 16 setosa setosa 1 0.000 0.000
## 23 setosa setosa 1 0.000 0.000
## 24 setosa setosa 1 0.000 0.000
## 26 setosa setosa 1 0.000 0.000
## 28 setosa setosa 1 0.000 0.000
## 29 setosa setosa 1 0.000 0.000
## 31 setosa setosa 1 0.000 0.000
## 35 setosa setosa 1 0.000 0.000
## 37 setosa setosa 1 0.000 0.000
## 40 setosa setosa 1 0.000 0.000
## 41 setosa setosa 1 0.000 0.000
## 43 setosa setosa 1 0.000 0.000
## 44 setosa setosa 1 0.000 0.000
## 45 setosa setosa 1 0.000 0.000
## 47 setosa setosa 1 0.000 0.000
## 49 setosa setosa 1 0.000 0.000
## 52 versicolor versicolor 0 0.999 0.001
## 53 versicolor versicolor 0 0.992 0.008
## 54 versicolor versicolor 0 0.999 0.001
## 55 versicolor versicolor 0 0.992 0.008
## 57 versicolor versicolor 0 0.988 0.012
## 59 versicolor versicolor 0 1.000 0.000
## 61 versicolor versicolor 0 1.000 0.000
## 62 versicolor versicolor 0 0.999 0.001
## 63 versicolor versicolor 0 1.000 0.000
## 69 versicolor versicolor 0 0.918 0.082

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


842 Ch 17: Classification

## 74 versicolor versicolor 0 0.999 0.001


## 76 versicolor versicolor 0 1.000 0.000
## 79 versicolor versicolor 0 0.991 0.009
## 80 versicolor versicolor 0 1.000 0.000
## 83 versicolor versicolor 0 1.000 0.000
## 85 versicolor versicolor 0 0.973 0.027
## 89 versicolor versicolor 0 1.000 0.000
## 90 versicolor versicolor 0 1.000 0.000
## 91 versicolor versicolor 0 0.998 0.002
## 93 versicolor versicolor 0 1.000 0.000
## 95 versicolor versicolor 0 0.999 0.001
## 96 versicolor versicolor 0 1.000 0.000
## 97 versicolor versicolor 0 1.000 0.000
## 98 versicolor versicolor 0 1.000 0.000
## 99 versicolor versicolor 0 1.000 0.000
## 103 virginica virginica 0 0.000 1.000
## 105 virginica virginica 0 0.000 1.000
## 106 virginica virginica 0 0.000 1.000
## 107 virginica virginica 0 0.114 0.886
## 108 virginica virginica 0 0.000 1.000
## 109 virginica virginica 0 0.000 1.000
## 113 virginica virginica 0 0.001 0.999
## 114 virginica virginica 0 0.001 0.999
## 116 virginica virginica 0 0.000 1.000
## 118 virginica virginica 0 0.000 1.000
## 119 virginica virginica 0 0.000 1.000
## 122 virginica virginica 0 0.004 0.996
## 124 virginica virginica 0 0.137 0.863
## 127 virginica virginica 0 0.283 0.717
## 130 virginica virginica 0 0.053 0.947
## 132 virginica virginica 0 0.001 0.999
## 136 virginica virginica 0 0.000 1.000
## 138 virginica virginica 0 0.010 0.990
## 139 virginica virginica 0 0.346 0.654
## 140 virginica virginica 0 0.003 0.997
## 141 virginica virginica 0 0.000 1.000
## 145 virginica virginica 0 0.000 1.000
## 148 virginica virginica 0 0.009 0.991
## 149 virginica virginica 0 0.000 1.000
## 150 virginica virginica 0 0.039 0.961
# Assess the accuracy of the prediction
# row = true Species, col = classified Species
pred.freq <- table(subset(iris, test == "test")$Species, pred.iris$class)
pred.freq
##
## setosa versicolor virginica
## setosa 25 0 0
## versicolor 0 25 0

Prof. Erik B. Erhardt


17.4: Example: Fisher’s Iris Data cross-validation 843

## virginica 0 0 25
prop.table(pred.freq, 1) # proportions by row
##
## setosa versicolor virginica
## setosa 1 0 0
## versicolor 0 1 0
## virginica 0 0 1
# proportion correct for each category
diag(prop.table(pred.freq, 1))
## setosa versicolor virginica
## 1 1 1
# total proportion correct
sum(diag(prop.table(pred.freq)))
## [1] 1
# total error rate
1 - sum(diag(prop.table(pred.freq)))
## [1] 0
The classification rule based on the training set works well with the test data. Do
not expect such nice results on all classification problems! Usually the error rate is
slightly higher on the test data than on the training data.
It is important to recognize that statistically significant differences (MANOVA)
among groups on linear discriminant function scores do not necessarily translate into
accurate classification rules! (WHY?)

17.4.1 Stepwise variable selection for classification


Stepwise variable selection for classification can be performed using package klaR
function stepclass() using any specified classification function. Classification per-
formance is estimated by selected from one of Uschi’s classification performance mea-
sures.
The resulting model can be very sensitive to the starting model. Below, the first
model starts full and ends full. The second model starts empty and ends after one
variable is added. Note that running this repeatedly could result in slightly different
models because the k-fold crossvalidation partitions the data at random. The formula
object gives the selected model.
library(klaR)
# start with full model and do stepwise (direction = "backward")
step.iris.b <- stepclass(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
, data = iris
, method = "lda"
, improvement = 0.01 # stop criterion: improvement less than 1%
# default of 5% is too coarse

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


844 Ch 17: Classification

, direction = "backward")
## ‘stepwise classification’, using 10-fold cross-validated correctness rate of method
lda’.
## 150 observations of 4 variables in 3 classes; direction: backward
## stop criterion: improvement less than 1%.
## correctness rate: 0.98; starting variables (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
##
## hr.elapsed min.elapsed sec.elapsed
## 0.00 0.00 0.19
plot(step.iris.b, main = "Start = full model, backward selection")
step.iris.b$formula
## Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
## <environment: 0x000000002045a3f0>
# start with empty model and do stepwise (direction = "both")
step.iris.f <- stepclass(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
, data = iris
, method = "lda"
, improvement = 0.01 # stop criterion: improvement less than 1%
# default of 5% is too coarse
, direction = "forward")
## ‘stepwise classification’, using 10-fold cross-validated correctness rate of method
lda’.
## 150 observations of 4 variables in 3 classes; direction: forward
## stop criterion: improvement less than 1%.
## correctness rate: 0.96; in: "Petal.Width"; variables (1): Petal.Width
##
## hr.elapsed min.elapsed sec.elapsed
## 0.0 0.0 0.2
plot(step.iris.f, main = "Start = empty model, forward selection")
step.iris.f$formula
## Species ~ Petal.Width
## <environment: 0x000000003222b898>

Prof. Erik B. Erhardt


17.4: Example: Fisher’s Iris Data cross-validation 845

Start = full model, backward selection Start = empty model, forward selection
1.4

0.8
estimated correctness rate

estimated correctness rate


1.2

0.6
1.0

0.4
0.8

0.2
0.6

0.0

START

START

+ Petal.Width
Given your selected model, you can then go on to fit your classification model by
using the formula from the stepclass() object.
library(MASS)
lda.iris.step <- lda(step.iris.b$formula
, data = iris)
lda.iris.step
## Call:
## lda(step.iris.b$formula, data = iris)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 5.006 3.428 1.462 0.246
## versicolor 5.936 2.770 4.260 1.326
## virginica 6.588 2.974 5.552 2.026
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.8293776 0.02410215
## Sepal.Width 1.5344731 2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width -2.8104603 2.83918785
##
## Proportion of trace:
## LD1 LD2
## 0.9912 0.0088

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


846 Ch 17: Classification

Note that if you have many variables, you may wish to use the alternate syntax
below to specify your formula (see the help ?stepclass for this example).
iris.d <- iris[,1:4] # the data
iris.c <- iris[,5] # the classes
sc_obj <- stepclass(iris.d, iris.c, "lda", start.vars = "Sepal.Width")

17.5 Example: Analysis of Admissions Data

The admissions officer of a business school has used an index of undergraduate GPA
and management aptitude test scores (GMAT) to help decide which applicants should
be admitted to graduate school. The data below gives the GPA and GMAT scores for
recent applicants who are classified as admit (A), borderline (B), or not admit (N).
An equal number of A, B, and N’s (roughly) were selected from their corresponding
populations (Johnson and Wichern, 1988).
#### Example: Business school admissions data
fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch17_business.dat"
business <- read.table(fn.data, header = TRUE)

## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
p <- ggpairs(business
, mapping = ggplot2::aes(colour = admit, alpha = 0.5)
, progress=FALSE
)
print(p)
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
## ‘stat bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
# detach package after use so reshape2 works (old reshape (v.1) conflicts)
#detach("package:GGally", unload=TRUE)
#detach("package:reshape", unload=TRUE)

library(ggplot2)
p <- ggplot(business, aes(x = gpa, y = gmat, shape = admit, colour = admit))
p <- p + geom_point(size = 6)
library(R.oo) # for ascii code lookup
p <- p + scale_shape_manual(values=charToInt(sort(unique(business$admit))))
p <- p + theme(legend.position="none") # remove legend with fill colours
print(p)

Prof. Erik B. Erhardt


17.5: Example: Analysis of Admissions Data 847

admit gpa gmat


700 A AA
30

AA
20 B

admit
A
A A
NA
600
10
A A
A
AAAA A A
0
NNNNN B A
B A
BAAA A
Cor : 0.442
3.5 500 N N B BBAB
B A
N B

gmat
3.0
A: 0.147
N NN B B A A
B B
N

gpa
B: −0.331 B BB BB
BB A
2.5
N A
N: 0.507
NN N N N N B
BBB
B
2.0 400 N B A B
NA B
700

N B
600
N B
300 N

gmat
500

400

300
N
2.0 2.5 3.0 3.5
0 2 4 60 2 4 60 2 4 6 2.0 2.5 3.0 3.5 300 400 500 600 700 gpa

The officer wishes to use these data to develop a more quantitative (i.e., less
subjective) approach to classify prospective students. Historically, about 20% of all
applicants have been admitted initially, 10% are classified as borderline, and the
remaining 70% are not admitted. The officer would like to keep these percentages
roughly the same in the future.
This is a natural place to use discriminant analysis. Let us do a more careful
analysis here, paying attention to underlying assumptions of normality and equal
covariance matrices.
The GPA and GMAT distributions are reasonably symmetric. Although a few
outliers are present, it does not appear that any transformation will eliminate the
outliers and preserve the symmetry. Given that the outliers are not very extreme,
I would analyze the data on this scale. Except for the outliers, the spreads (IQRs)
are roughly equal across groups within GPA and GMAT. I will look carefully at the
variance-covariance matrices later.
There is a fair amount of overlap between the borderline and other groups, but
this should not be too surprising. Otherwise these applicants would not be borderline!

17.5.1 Further Analysis of the Admissions Data


The assumption of constant variance-covariance matrices is suspect. The GPA and
GMAT variances are roughly constant across groups, but the correlation between
GPA and GMAT varies greatly over groups.
# Covariance matrices by admit
by(business[,2:3], business$admit, cov)
## business$admit: A
## gpa gmat

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


848 Ch 17: Classification

## gpa 0.05866734 2.857601


## gmat 2.85760081 6479.990927
## ----------------------------------------------------
## business$admit: B
## gpa gmat
## gpa 0.05422559 -4.87757
## gmat -4.87756989 4002.76129
## ----------------------------------------------------
## business$admit: N
## gpa gmat
## gpa 0.05602785 10.01171
## gmat 10.01170769 6973.16462
# Correlation matrices by admit
by(business[,2:3], business$admit, FUN=stats::cor)
## business$admit: A
## gpa gmat
## gpa 1.0000000 0.1465602
## gmat 0.1465602 1.0000000
## ----------------------------------------------------
## business$admit: B
## gpa gmat
## gpa 1.0000000 -0.3310713
## gmat -0.3310713 1.0000000
## ----------------------------------------------------
## business$admit: N
## gpa gmat
## gpa 1.0000000 0.5065137
## gmat 0.5065137 1.0000000

Assuming equal variance-covariance matrices, both GPA and GMAT are impor-
tant for discriminating among entrance groups. This is consistent with the original
data plots.
# classification of observations based on classification methods
# (e.g. lda, qda) for every combination of two variables.
library(klaR)
partimat(admit ~ gmat + gpa
, data = business
, plot.matrix = FALSE)

Prof. Erik B. Erhardt


17.5: Example: Analysis of Admissions Data 849

Partition Plot

app. error rate: 0.157

700
A AA
A A
B
A
600 A A
N A A A
A●AA A A
B A A
N A
N NN N B A A
B AA A
N B
500

N B A
N BB
A
gmat

N BB A
N N B B BA B
N ●
B
● B BB BBB A
N A
N N
N B B
B B
NN
400

N N A B B
B
NA
B
B
N
N B
300

N
2.0 2.5 3.0 3.5

gpa

library(MASS)
lda.business <- lda(admit ~ gpa + gmat
, data = business)
lda.business
## Call:
## lda(admit ~ gpa + gmat, data = business)
##
## Prior probabilities of groups:
## A B N
## 0.3595506 0.3483146 0.2921348
##
## Group means:
## gpa gmat
## A 3.321875 554.4062
## B 3.004516 454.1935
## N 2.400385 443.7308
##
## Coefficients of linear discriminants:
## LD1 LD2
## gpa -3.977912929 -1.48346456
## gmat -0.003057846 0.01292319

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


850 Ch 17: Classification

##
## Proportion of trace:
## LD1 LD2
## 0.9473 0.0527
The linear discrimant functions that best classify the admit are

LD1 = −3.978 gpa + −0.003058 gmat


LD2 = −1.483 gpa + 0.01292 gmat,

interpretted as a weighted average of the scores and a contrast of the scores.


The plots of the lda() object shows the data on the LD scale.
plot(lda.business, dimen = 1)
plot(lda.business, dimen = 2, col = as.numeric(business$admit))
#pairs(lda.business, col = as.numeric(business£admit))
0.0 0.2 0.4 0.6

A
−4 −2 0 2 4

group A
2

AA A A
B A
A N N
NN N N N
B B
N N
A NN
0.0 0.2 0.4 0.6

A B N
A A AA
A
LD2

B B B
A N
A A
A
A A AA AB BB B N N N
0

N
A A N
B B N
A AB B BB N
N
B B B
A BB
B N
−4 −2 0 2 4 BB B A
AN N N
A B
A N
group B
−2

B B

B
B
0.0 0.2 0.4 0.6

−2 0 2 4
−4 −2 0 2 4
LD1
group N

# CV = TRUE does jackknife (leave-one-out) crossvalidation


lda.business.cv <- lda(admit ~ gpa + gmat
, data = business, CV = TRUE)

# Create a table of classification and posterior probabilities for each observation


classify.business <- data.frame(admit = business$admit
, class = lda.business.cv$class
, error = ""
, round(lda.business.cv$posterior,3))
colnames(classify.business) <- c("admit", "class", "error"
, paste("post", colnames(lda.business.cv$posterior), sep=""))

# error column
classify.business$error <- as.character(classify.business$error)

Prof. Erik B. Erhardt


17.5: Example: Analysis of Admissions Data 851

classify.agree <- as.character(as.numeric(business$admit)


- as.numeric(lda.business.cv$class))
classify.business$error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]
The misclassification error within the training set is reasonably low, given the
overlap between the B group and the others, and never a misclassification between A
and N.
# print table
#classify.business

# Assess the accuracy of the prediction


# row = true admit, col = classified admit
pred.freq <- table(business$admit, lda.business.cv$class)
pred.freq
##
## A B N
## A 26 6 0
## B 3 25 3
## N 0 3 23
prop.table(pred.freq, 1) # proportions by row
##
## A B N
## A 0.81250000 0.18750000 0.00000000
## B 0.09677419 0.80645161 0.09677419
## N 0.00000000 0.11538462 0.88461538
# proportion correct for each category
diag(prop.table(pred.freq, 1))
## A B N
## 0.8125000 0.8064516 0.8846154
# total proportion correct
sum(diag(prop.table(pred.freq)))
## [1] 0.8314607
# total error rate
1 - sum(diag(prop.table(pred.freq)))
## [1] 0.1685393

17.5.2 Classification Using Unequal Prior Probabilities


A new twist to this problem is that we have prior information on the relative sizes of
the populations. Historically 20% of applicants are admitted, 10% are borderline, and
70% are not admitted. This prior information is incorporated into the classification
rule by using a prior option with lda(). The prior probabilities are assumed to
be equal when this statement is omitted. The classification rule for unequal prior
probabilities uses both the M -distance and the prior probability for making decisions,

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


852 Ch 17: Classification

as outlined below.
When the prior probabilities are unequal, classification is based on the general-
ized distance to group j:

Dj2 (X) = (X − X̄j )0 S −1 (X − X̄j ) − 2 log(PRIORj ),

or on the estimated posterior probability of membership in group j:

exp{−0.5Dj2 (X)}
Pr(j|X) = P 2
.
k exp{−0.5Dk (X)}

Here S is the pooled covariance matrix, and log(PRIORj ) is the (natural) log of the
prior probability of being in group j. As before, you classify observation X into the
group that it is closest to in terms of generalized distance, or equivalently, into the
group with the maximum posterior probability.
Note that −2 log(PRIORj ) exceeds zero, and is extremely large when PRIORj is
near zero. The generalized distance is the M -distance plus a penalty term that is
large when the prior probability for a given group is small. If the prior probabilities
are equal, the penalty terms are equal so the classification rule depends only on the
M -distance.
The penalty makes it harder (relative to equal probabilities) to classify into a
low probability group, and easier to classify into high probability groups. In the
admissions data, an observation has to be very close to the B or A groups to not be
classified as N.
Note that in the analysis below, we make the tenuous assumption that the popu-
lation covariance matrices are equal. We also have 6 new observations that we wish
to classify. These observations are entered as a test data set with missing class levels.
# new observations to classify
business.test <- read.table(text = "
admit gpa gmat
NA 2.7 630
NA 3.3 450
NA 3.4 540
NA 2.8 420
NA 3.5 340
NA 3.0 500
", header = TRUE)
With priors, the LDs are different.
library(MASS)
lda.business <- lda(admit ~ gpa + gmat
, prior = c(0.2, 0.1, 0.7)
, data = business)
lda.business

Prof. Erik B. Erhardt


17.5: Example: Analysis of Admissions Data 853

## Call:
## lda(admit ~ gpa + gmat, data = business, prior = c(0.2, 0.1,
## 0.7))
##
## Prior probabilities of groups:
## A B N
## 0.2 0.1 0.7
##
## Group means:
## gpa gmat
## A 3.321875 554.4062
## B 3.004516 454.1935
## N 2.400385 443.7308
##
## Coefficients of linear discriminants:
## LD1 LD2
## gpa -4.014778092 -1.38058511
## gmat -0.002724201 0.01299761
##
## Proportion of trace:
## LD1 LD2
## 0.9808 0.0192
About 1/2 of the borderlines in the calibration set are misclassified. This is due
to the overlap of the B group with the other 2 groups, but also reflects the low prior
probability for the borderline group. The classification rule requires strong evidence
that an observation is borderline before it can be classified as such.
# CV = TRUE does jackknife (leave-one-out) crossvalidation
lda.business.cv <- lda(admit ~ gpa + gmat
, prior = c(0.2, 0.1, 0.7)
, data = business, CV = TRUE)

# Create a table of classification and posterior probabilities for each observation


classify.business <- data.frame(admit = business$admit
, class = lda.business.cv$class
, error = ""
, round(lda.business.cv$posterior,3))
colnames(classify.business) <- c("admit", "class", "error"
, paste("post", colnames(lda.business.cv$posterior), sep=""))

# error column
classify.business$error <- as.character(classify.business$error)
classify.agree <- as.character(as.numeric(business$admit)
- as.numeric(lda.business.cv$class))
classify.business$error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]

# print table, errors only


classify.business[!(classify.business$error == ""), ]

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


854 Ch 17: Classification

## admit class error postA postB postN


## 29 A B -1 0.027 0.557 0.416
## 30 A N -2 0.114 0.383 0.503
## 31 A B -1 0.045 0.595 0.360
## 35 N A 2 0.442 0.273 0.285
## 36 N B 1 0.033 0.516 0.451
## 59 B N -1 0.001 0.044 0.955
## 60 B N -1 0.002 0.016 0.982
## 61 B N -1 0.126 0.329 0.545
## 66 B A 1 0.747 0.253 0.000
## 67 B N -1 0.161 0.412 0.428
## 68 B N -1 0.037 0.277 0.686
## 71 B N -1 0.059 0.227 0.714
## 72 B N -1 0.017 0.089 0.894
## 74 B N -1 0.070 0.129 0.801
## 75 B N -1 0.020 0.146 0.834
## 82 B N -1 0.107 0.344 0.549
## 85 B A 1 0.758 0.233 0.009
## 86 B A 1 0.979 0.021 0.001
## 87 B A 1 0.627 0.365 0.008
## 88 B A 1 0.633 0.367 0.000
# Assess the accuracy of the prediction
# row = true admit, col = classified admit
pred.freq <- table(business$admit, lda.business.cv$class)
pred.freq
##
## A B N
## A 29 2 1
## B 5 16 10
## N 1 1 24
prop.table(pred.freq, 1) # proportions by row
##
## A B N
## A 0.90625000 0.06250000 0.03125000
## B 0.16129032 0.51612903 0.32258065
## N 0.03846154 0.03846154 0.92307692
# proportion correct for each category
diag(prop.table(pred.freq, 1))
## A B N
## 0.9062500 0.5161290 0.9230769
# total proportion correct
sum(diag(prop.table(pred.freq)))
## [1] 0.7752809
# total error rate
1 - sum(diag(prop.table(pred.freq)))
## [1] 0.2247191

Prof. Erik B. Erhardt


17.5: Example: Analysis of Admissions Data 855

The test data cases were entered with missing group IDs. The classification table
compares the group IDs, which are unknown, to the ID for the group into which an
observation is classified. These two labels differ, so all the test data cases are identified
as misclassified. Do not be confused by this! Just focus on the classification for each
case, and ignore the other summaries.
# predict the test data from the training data LDFs
pred.business <- predict(lda.business, newdata = business.test)

# Create a table of classification and posterior probabilities for each observation


classify.business.test <- data.frame(admit = business.test$admit
, class = pred.business$class
#, error = ""
, round(pred.business$posterior,3))
colnames(classify.business.test) <- c("admit", "class"#, "error"
, paste("post", colnames(pred.business$posterior), sep=""))

## error column
#classify.business.test£error <- as.character(classify.business.test£error)
#classify.agree <- as.character(as.numeric(business.test£admit)
# - as.numeric(pred.business£class))
#classify.business.test£error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]

# print table
classify.business.test
## admit class postA postB postN
## 1 NA N 0.102 0.074 0.824
## 2 NA A 0.629 0.367 0.004
## 3 NA A 0.919 0.081 0.000
## 4 NA N 0.026 0.297 0.676
## 5 NA B 0.461 0.538 0.001
## 6 NA B 0.385 0.467 0.148
Except for observation 5, the posterior probabilities for the test cases give strong
evidence in favor of classification into a specific group.

17.5.3 Classification With Unequal Covariance Matrices, QDA


The assumption of multivariate normal populations with equal covariance matrices
is often unrealistic. Although there are no widely available procedures for MANOVA
or stepwise variable selection in discriminant analysis that relax these assumptions,
there are a variety of classification rules that weaken one or both of these assumptions.
The qda() function is a quadratic discriminant classification rule that assumes
normality but allows unequal covariance matrices.
The quadratic discriminant classification rule is based on the generalized distance

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


856 Ch 17: Classification

to group j:

Dj2 (X) = (X − X̄j )0 Sj−1 (X − X̄j ) − 2 log(PRIORj ) + log |Sj |,

or equivalently, the posterior probability of membership in group j:

exp{−0.5Dj2 (X)}
Pr(j|X) = P 2
.
k exp{−0.5Dk (X)}

Here Sj is the sample covariance matrix from group j and log |Sj | is the log of the de-
terminant of this covariance matrix. The determinant penalty term is large for groups
having large variability. The rule is not directly tied to linear discriminant function
variables, so interpretation and insight into this method is less straightforward.
There is evidence that quadratic discrimination does not improve misclassification
rates in many problems with small to modest sample sizes, in part, because the
quadratic rule requires an estimate of the covariance matrix for each population. A
modest to large number of observations is needed to accurately estimate variances
and correlations. I often compute the linear and quadratic rules, but use the linear
discriminant analysis unless the quadratic rule noticeably reduces the misclassification
rate.
Recall that the GPA and GMAT sample variances are roughly constant across
admission groups, but the correlation between GPA and GMAT varies widely across
groups.
The quadratic rule does not classify the training data noticeably better than
the linear discriminant analysis. The individuals in the test data have the same
classifications under both approaches. Assuming that the optimistic error rates for
the two rules were “equally optimistic”, I would be satisfied with the standard linear
discriminant analysis, and would summarize my analysis based on this approach.
Additional data is needed to decide whether the quadratic rule might help reduce the
misclassification rates.
# classification of observations based on classification methods
# (e.g. qda, qda) for every combination of two variables.
library(klaR)
partimat(admit ~ gmat + gpa
, data = business
, plot.matrix = FALSE
, method = "lda", main = "LDA partition")
partimat(admit ~ gmat + gpa
, data = business
, plot.matrix = FALSE
, method = "qda", main = "QDA partition")

Prof. Erik B. Erhardt


17.5: Example: Analysis of Admissions Data 857

LDA partition QDA partition

app. error rate: 0.157 app. error rate: 0.146


700

700
A AA A AA
A A A A
B B
A A
A A
600

600
A A
N A A A N A A A
AA A A AA A A
A● A●
B A A B A A
N A N A
N NN NB A A
A A N NN NB A A
A A
B A B A
N B N B
500

500
N B A N B A
NN BB
A
NN BB
A
gmat

gmat
BB A BB A
N N B B BA B N N B B BA B
N ●
BB N ●
BB
● B BB B B A ● B BB B B A
N A N A
N NN B
B B B N NN B
B B B
NN NN
400

400
N N A B B N N A B B
B B
N A N A
B B
B B
N N
N B N B
300

300
N N

N N
2.0 2.5 3.0 3.5 2.0 2.5 3.0 3.5

gpa gpa

library(MASS)
qda.business <- qda(admit ~ gpa + gmat
, prior = c(0.2, 0.1, 0.7)
, data = business)
qda.business
## Call:
## qda(admit ~ gpa + gmat, data = business, prior = c(0.2, 0.1,
## 0.7))
##
## Prior probabilities of groups:
## A B N
## 0.2 0.1 0.7
##
## Group means:
## gpa gmat
## A 3.321875 554.4062
## B 3.004516 454.1935
## N 2.400385 443.7308
# CV = TRUE does jackknife (leave-one-out) crossvalidation
qda.business.cv <- qda(admit ~ gpa + gmat
, prior = c(0.2, 0.1, 0.7)
, data = business, CV = TRUE)

# Create a table of classification and posterior probabilities for each observation


classify.business <- data.frame(admit = business$admit
, class = qda.business.cv$class
, error = ""
, round(qda.business.cv$posterior,3))
colnames(classify.business) <- c("admit", "class", "error"

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


858 Ch 17: Classification

, paste("post", colnames(qda.business.cv$posterior), sep=""))

# error column
classify.business$error <- as.character(classify.business$error)
classify.agree <- as.character(as.numeric(business$admit)
- as.numeric(qda.business.cv$class))
classify.business$error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]

# print table
#classify.business

# Assess the accuracy of the prediction


# row = true admit, col = classified admit
pred.freq <- table(business$admit, qda.business.cv$class)
pred.freq
##
## A B N
## A 27 3 2
## B 5 17 9
## N 0 1 25
prop.table(pred.freq, 1) # proportions by row
##
## A B N
## A 0.84375000 0.09375000 0.06250000
## B 0.16129032 0.54838710 0.29032258
## N 0.00000000 0.03846154 0.96153846
# proportion correct for each category
diag(prop.table(pred.freq, 1))
## A B N
## 0.8437500 0.5483871 0.9615385
# total proportion correct
sum(diag(prop.table(pred.freq)))
## [1] 0.7752809
# total error rate
1 - sum(diag(prop.table(pred.freq)))
## [1] 0.2247191
# predict the test data from the training data LDFs
pred.business <- predict(qda.business, newdata = business.test)

# Create a table of classification and posterior probabilities for each observation


classify.business.test <- data.frame(admit = business.test$admit
, class = pred.business$class
#, error = ""
, round(pred.business$posterior,3))
colnames(classify.business.test) <- c("admit", "class"#, "error"

Prof. Erik B. Erhardt


17.5: Example: Analysis of Admissions Data 859

, paste("post", colnames(pred.business$posterior), sep=""))

## error column
#classify.business.test£error <- as.character(classify.business.test£error)
#classify.agree <- as.character(as.numeric(business.test£admit)
# - as.numeric(pred.business£class))
#classify.business.test£error[!(classify.agree == 0)] <- classify.agree[!(classify.agree == 0)]

# print table
classify.business.test
## admit class postA postB postN
## 1 NA N 0.043 0.038 0.919
## 2 NA A 0.597 0.402 0.000
## 3 NA A 0.978 0.022 0.000
## 4 NA N 0.051 0.423 0.526
## 5 NA B 0.292 0.708 0.000
## 6 NA B 0.363 0.513 0.123

Advanced Data Analysis 1 & 2, UNM Stat 427/527 - 428/528


860 Ch 17: Classification

Prof. Erik B. Erhardt

You might also like