Download Introduction to Statistical Data Analysis for the Life Sciences 2nd Edition Claus Thorn Ekstrøm ebook All Chapters PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

Download the full version of the ebook at

https://ebookultra.com

Introduction to Statistical Data Analysis for


the Life Sciences 2nd Edition Claus Thorn
Ekstrøm

https://ebookultra.com/download/introduction-to-
statistical-data-analysis-for-the-life-
sciences-2nd-edition-claus-thorn-ekstrom/

Explore and download more ebook at https://ebookultra.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

R Primer 2nd Edition Claus Thorn Ekstrom (Author)

https://ebookultra.com/download/r-primer-2nd-edition-claus-thorn-
ekstrom-author/

ebookultra.com

Data Analysis and Presentation Skills An Introduction for


the Life and Medical Sciences 1st Edition Jackie Willis

https://ebookultra.com/download/data-analysis-and-presentation-skills-
an-introduction-for-the-life-and-medical-sciences-1st-edition-jackie-
willis/
ebookultra.com

An Introduction to Statistical Genetic Data Analysis


Melinda C. Mills

https://ebookultra.com/download/an-introduction-to-statistical-
genetic-data-analysis-melinda-c-mills/

ebookultra.com

Introduction to Statistical Analysis of Laboratory Data


1st Edition Alfred Bartolucci

https://ebookultra.com/download/introduction-to-statistical-analysis-
of-laboratory-data-1st-edition-alfred-bartolucci/

ebookultra.com
Introduction to Applied Statistical Signal Analysis 3rd
Edition Shiavi

https://ebookultra.com/download/introduction-to-applied-statistical-
signal-analysis-3rd-edition-shiavi/

ebookultra.com

IBM SPSS by Example A Practical Guide to Statistical Data


Analysis 2nd Edition Alan C. Elliott

https://ebookultra.com/download/ibm-spss-by-example-a-practical-guide-
to-statistical-data-analysis-2nd-edition-alan-c-elliott/

ebookultra.com

Qualitative Data Analysis An Introduction 2nd Edition


Carol Grbich

https://ebookultra.com/download/qualitative-data-analysis-an-
introduction-2nd-edition-carol-grbich/

ebookultra.com

Life Sciences for the Non scientist 2nd Edition Viqar


Zaman

https://ebookultra.com/download/life-sciences-for-the-non-
scientist-2nd-edition-viqar-zaman/

ebookultra.com

Experimental Methods for Science and Engineering Students


An Introduction to the Analysis and Presentation of Data
2nd Edition Les Kirkup
https://ebookultra.com/download/experimental-methods-for-science-and-
engineering-students-an-introduction-to-the-analysis-and-presentation-
of-data-2nd-edition-les-kirkup/
ebookultra.com
Introduction to Statistical Data Analysis for the Life
Sciences 2nd Edition Claus Thorn Ekstrøm Digital
Instant Download
Author(s): Claus Thorn Ekstrøm, Helle Sørensen
ISBN(s): 9781482238969, 1482238969
Edition: 2
File Details: PDF, 3.27 MB
Year: 2014
Language: english
Second
Statistics
Edition INTRODUCTION TO

Statistical Data

Analysis for the Life Sciences


Introduction to Statistical Data
Expanded with over 100 more pages, Introduction to Statistical
Data Analysis for the Life Sciences, Second Edition presents the
right balance of data examples, statistical theory, and computing to

Analysis for the


learn introductory statistics. This popular book covers the mathemat-
ics underlying classical statistical analysis, the modeling aspects of
statistical analysis and the biological interpretation of results, and the
application of statistical software in analyzing real-world problems

Life Sciences
and datasets.
New to the Second Edition
• A new chapter on non-linear regression models
• A new chapter that contains examples of complete data
analyses, illustrating how a full-fledged statistical analysis is
undertaken
Second Edition
• Additional exercises in most chapters
• A summary of statistical formulas related to the specific designs
used to teach the statistical concepts
This text provides a computational toolbox that enables you to ana- CLAUS THORN EKSTRØM
lyze real datasets and gain the confidence and skills to undertake
more sophisticated analyses. Although accessible with any statistical HELLE SØRENSEN

E KST RØ M • S Ø R E N S E N
software, the text encourages a reliance on R. For those new to R, an
introduction to the software is available in an appendix. The book also
includes end-of-chapter exercises as well as an entire chapter of case
exercises that help you apply your knowledge to larger datasets and
learn more about approaches specific to the life sciences.

K23251

w w w. c rc p r e s s . c o m
K23251_FM.indd 2 9/29/14 2:00 PM
INTRODUCTION TO

Statistical Data
Analysis for the
Life Sciences
Second Edition

K23251_FM.indd 1 9/29/14 2:00 PM


K23251_FM.indd 2 9/29/14 2:00 PM
INTRODUCTION TO

Statistical Data
Analysis for the
Life Sciences
Second Edition

CLAUS THORN EKSTRØM


Biostatistics, Department of Public Health
University of Copenhagen

HELLE SØRENSEN
Department of Mathematical Sciences
University of Copenhagen

K23251_FM.indd 3 9/29/14 2:00 PM


CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2015 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works


Version Date: 20140910

International Standard Book Number-13: 978-1-4822-3894-5 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
Contents

Preface ix

1 Description of samples and populations 1


1.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Visualizing categorical data . . . . . . . . . . . . . . . . . . . . 4
1.3 Visualizing quantitative data . . . . . . . . . . . . . . . . . . . 6
1.4 Statistical summaries . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 What is a probability? . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Linear regression 27
2.1 Fitting a regression line . . . . . . . . . . . . . . . . . . . . . . 29
2.2 When is linear regression appropriate? . . . . . . . . . . . . . 34
2.3 The correlation coefficient . . . . . . . . . . . . . . . . . . . . . 38
2.4 Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 Comparison of groups 51
3.1 Graphical and simple numerical comparison . . . . . . . . . . 51
3.2 Between-group variation and within-group variation . . . . . 54
3.3 Populations, samples, and expected values . . . . . . . . . . . 55
3.4 Least squares estimation and residuals . . . . . . . . . . . . . 56
3.5 Paired and unpaired samples . . . . . . . . . . . . . . . . . . . 58
3.6 Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.7 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4 The normal distribution 69


4.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 One sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3 Are the data (approximately) normally distributed? . . . . . . 82
4.4 The central limit theorem . . . . . . . . . . . . . . . . . . . . . 88
4.5 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

v
vi Introduction to Statistical Data Analysis for the Life Sciences

5 Statistical models, estimation, and confidence intervals 101


5.1 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4 Unpaired samples with different standard deviations . . . . . 129
5.5 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6 Hypothesis tests 149


6.1 Null hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 t-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.3 Tests in a one-way ANOVA . . . . . . . . . . . . . . . . . . . . 163
6.4 Hypothesis tests as comparison of nested models . . . . . . . 170
6.5 Type I and type II errors . . . . . . . . . . . . . . . . . . . . . . 172
6.6 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7 Model validation and prediction 191


7.1 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.3 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

8 Linear normal models 217


8.1 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . 217
8.2 Additive two-way analysis of variance . . . . . . . . . . . . . 224
8.3 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
8.4 Interactions between variables . . . . . . . . . . . . . . . . . . 243
8.5 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

9 Non-linear regression 269


9.1 Non-linear regression models . . . . . . . . . . . . . . . . . . . 270
9.2 Estimation, confidence intervals, and hypothesis tests . . . . . 272
9.3 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . 277
9.4 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

10 Probabilities 291
10.1 Outcomes, events, and probabilities . . . . . . . . . . . . . . . 291
10.2 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . 295
10.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
Contents vii

11 The binomial distribution 307


11.1 The independent trials model . . . . . . . . . . . . . . . . . . . 307
11.2 The binomial distribution . . . . . . . . . . . . . . . . . . . . . 308
11.3 Estimation, confidence intervals, and hypothesis tests . . . . . 316
11.4 Differences between proportions . . . . . . . . . . . . . . . . . 321
11.5 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

12 Analysis of count data 329


12.1 The chi-square test for goodness-of-fit . . . . . . . . . . . . . . 329
12.2 2 × 2 contingency table . . . . . . . . . . . . . . . . . . . . . . 334
12.3 Two-sided contingency tables . . . . . . . . . . . . . . . . . . . 344
12.4 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
12.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350

13 Logistic regression 355


13.1 Odds and odds ratios . . . . . . . . . . . . . . . . . . . . . . . 355
13.2 Logistic regression models . . . . . . . . . . . . . . . . . . . . 357
13.3 Estimation and confidence intervals . . . . . . . . . . . . . . . 362
13.4 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . 364
13.5 Model validation and prediction . . . . . . . . . . . . . . . . . 367
13.6 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
13.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

14 Statistical analysis examples 387


14.1 Water temperature and frequency of electric signals from
electric eels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
14.2 Association between listeria growth and RIP2 protein . . . . . 393
14.3 Degradation of dioxin . . . . . . . . . . . . . . . . . . . . . . . 400
14.4 Effect of an inhibitor on the chemical reaction rate . . . . . . . 406
14.5 Birthday bulge on the Danish soccer team . . . . . . . . . . . . 413
14.6 Animal welfare . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
14.7 Monitoring herbicide efficacy . . . . . . . . . . . . . . . . . . . 421

15 Case exercises 427


Case 1: Linear modeling . . . . . . . . . . . . . . . . . . . . . . . . . 428
Case 2: Data transformations . . . . . . . . . . . . . . . . . . . . . . 430
Case 3: Two sample comparisons . . . . . . . . . . . . . . . . . . . . 432
Case 4: Linear regression with and without intercept . . . . . . . . 434
Case 5: Analysis of variance and test for linear trend . . . . . . . . 435
Case 6: Regression modeling and transformations . . . . . . . . . . 438
Case 7: Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . 440
Case 8: Binary variables . . . . . . . . . . . . . . . . . . . . . . . . . 442
Case 9: Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Case 10: Logistic regression . . . . . . . . . . . . . . . . . . . . . . . 449
Case 11: Non-linear regression . . . . . . . . . . . . . . . . . . . . . 451
viii Introduction to Statistical Data Analysis for the Life Sciences

Case 12: Power and sample size calculations . . . . . . . . . . . . . 452

A Summary of inference methods 457


A.1 Statistical concepts . . . . . . . . . . . . . . . . . . . . . . . . . 457
A.2 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . 458
A.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
A.4 Statistical formulas . . . . . . . . . . . . . . . . . . . . . . . . . 461

B Introduction to R 473
B.1 Working with R . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
B.2 Data frames and reading data into R . . . . . . . . . . . . . . . 478
B.3 Manipulating data . . . . . . . . . . . . . . . . . . . . . . . . . 483
B.4 Graphics with R . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
B.5 Reproducible research . . . . . . . . . . . . . . . . . . . . . . . 487
B.6 Installing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
B.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489

C Statistical tables 493


C.1 The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . 493
C.2 The normal distribution . . . . . . . . . . . . . . . . . . . . . . 494
C.3 The t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 496
C.4 The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . 497

D List of examples used throughout the book 499

Bibliography 501

Index 507
Preface

The second edition of Introduction to Statistical Data Analysis for the Life Sci-
ences expands the content of the first edition based on the comments, sugges-
tions, and requests we have received from lecturers and students that have
adopted and used the book for teaching and learning introductory statistics.
We have kept the overall structure from the first edition with two excep-
tions: There is a new chapter on non-linear regression models that follows
more naturally after the chapter on linear models, so we have inserted that
as Chapter 9 immediately after the chapter on linear models. Consequently
the remaining chapter numbers have increased but they appear in the same
order as in the 1st edition. This edition also includes a new chapter that con-
tains examples of complete data analyses. These examples are intended as
inspiration or case studies of how a full-fledged statistical analysis might be
undertaken and the results presented. The chapter with examples has been
inserted before the case exercises, as Chapter 14.
Additional exercises have been included in most chapters and the new
exercises have been added at the end of each chapter to ensure that the ex-
ercise numbers in the 2nd edition match the exercise numbering in the 1st
edition (barring the change in chapter numbers due to the new chapters on
non-linear regression and statistical examples).
Finally, we have provided a summary of statistical formulas for the sim-
ple cases that are used throughout the book to teach new statistical concepts.
The book is written and intended as a textbook and as such we introduce and
discuss the formulas relevant for the various special cases whenever a new
concept is introduced. The summary of statistical formulas provides a com-
pilation of all the formulas related to the specific designs to make it easier to
use the book as a reference.
Thank you to the lecturers, students, and colleagues who have provided
useful suggestions, comments and feedback. Especially to Bo Markussen
who meticulously read through the book and provided detailed feedback
and suggestions while teaching a course that used the book.

Claus Thorn Ekstrøm


Helle Sørensen
April 2014

ix
x Introduction to Statistical Data Analysis for the Life Sciences

Preface to the first edition


We believe that a textbook in statistics for the life sciences must focus on
applications and computational statistics combined with a reasonable level
of mathematical rigor. In the spring of 2008 we were asked to revise and
teach the introductory statistics course taken by the majority of students at
the Faculty of Life Sciences at the University of Copenhagen. We searched for
a textbook that could replace the earlier textbook by Skovgaard et al. (1999)
but were unable to find one with the right combination of data examples,
statistical theory, and computing. We decided to make our own material, and
this book is the result of our efforts.
The book covers material seen in many textbooks on introductory statis-
tics but differs from these books in several ways. First and foremost we have
kept the emphasis on both data analysis and the mathematics underlying
classical statistical analysis. We have tried to give the reader a feeling of be-
ing able to model and analyze data very early on and then “sneak in” the
probability and statistics theory as we go along. Second, we put much em-
phasis on the modeling part of statistical analysis and on biological interpre-
tations of parameter estimates, hypotheses, etc. Third, the text focuses on the
use and application of statistical software to analyze problems and datasets.
Our students should not only be able to determine how to analyze a given
dataset but also have a computational toolbox that enables them to actually
do the analysis — for real datasets with numerous observations and several
variables.
We have used R as our choice of statistical software (R Core Team, 2013b).
R is the lingua franca of statistical computing; it is a free statistical program-
ming software and it can be downloaded from http://cran.r-project.org.
By introducing the students to R we hope to provide them with the neces-
sary skills to undertake more sophisticated analyses later on in their careers.
R commands and output are found at the end of each chapter so that they
will not steal too much attention from the statistics, and so the main text can
be used with any statistical software program. However, we believe that be-
ing able to use a software package for statistical analyses is essential for all
students. Appendix B provides a short introduction to R that can be used for
students with no previous experience of R.
All datasets used in the book are available in the R package isdals, which
can be downloaded directly from CRAN — the Comprehensive R Archive
Network. The datasets can also be found as plain text files from the support-
ing website of the book:

http://www.biostatistics.dk/isdals/

The book can be read sequentially from start to end. Some readers may
prefer to have a proper introduction to probability theory (Chapter 10) before
Preface xi

introducing statistics, inference, and modeling, and in that case Chapter 10


can be read between Chapters 1 and 2. Chapters 2 and 3 cover linear regres-
sion and one-way analysis of variance with emphasis on modeling, interpre-
tation, estimation, and the biological questions to be answered, but without
details about variation of estimates and hypothesis tests. These two chap-
ters are meant as appetizers and should provide the readers with a feeling
of what they will be able to accomplish in the subsequent chapters and to
make sure that the reader keeps in mind that we essentially intend to apply
the theory to analyze data.
Chapters 4 to 7 cover the normal distribution and statistical inference:
estimation, confidence intervals, hypothesis testing, prediction, and model
validation with thorough discussions on one- and two-sample problems, lin-
ear regression, and analysis of variance. The different data types are treated
“in one go” since the analyses are similar from a statistical point of view, but
the different biological interpretations are also stressed.
Chapter 8 extends the theory to linear normal models (e.g., multiple lin-
ear regression and two-way analysis of variance models), shows that linear
regression and analysis of variance are essentially special cases of the same
class of models, and more complicated modeling terms such as interactions
are discussed.
Chapter 9 is a self-contained introduction to probability theory including
independence and conditional probabilities.
In Chapter 10 we present the binomial distribution and discuss statistical
inference for the binomial distribution. Chapter 11 is concerned with analy-
sis of count data and the use of chi-square test statistics to test hypotheses.
Emphasis is on the analysis of 2 × 2 tables as well as on general r × k tables.
Chapter 12 is about logistic regression and thus combines aspects from linear
models with the binomial distribution.
Each of these chapters contains a number of exercises related to the topic
and theory of that chapter. Roughly half the exercises are supposed to be
done by hand, whereas a computer should be used for the remaining ones
(marked with an symbol). A few of the exercises include R commands and
related output that can be used to answer the problems. These exercises are
supposed to give the reader a possibility to get familiar with the R language
and learn to read and interpret output from R without getting into trouble
with the actual programming. A small number of exercises are of a more
mathematical nature; e.g., derivation of formulas. Such exercises are marked
with an [M].
Chapter 13 contains ten larger case exercises where readers are encour-
aged to apply their knowledge to larger datasets and learn more about im-
portant topics. We consider these exercises an important part of the book.
They are suitable for self-study because the analyses are made in many small
steps and much help is provided in the questions.
The book ends with three appendices. Appendix A includes an overview
of inference methods. Appendix B contains an introduction to R that can be
xii Introduction to Statistical Data Analysis for the Life Sciences

used as a starting point for readers unfamiliar with R. Finally, Appendix C


contains a few statistical tables for those situations where a computer is not
available to calculate the relevant tail probabilities or quantiles.
We used the book for a 7.5 ECTS course for a second/third year bache-
lor course with four lectures and four hours of exercises per week for eight
weeks. In addition, three hours were used with one of the case exercises from
Chapter 13: The students worked without instruction for two hours followed
by one hour with discussion.
We are grateful to our colleagues at the Faculty of Life Sciences at the Uni-
versity of Copenhagen — in particular to Ib Skovgaard and Mats Rudemo,
who authored an earlier textbook and who throughout the years have col-
lected data from life science studies at the University of Copenhagen. Many
thanks go to the students who participated in the “Statistical Data Analysis
1” course in 2008 and 2009 and helped improve the original manuscript with
their comments.

Claus Thorn Ekstrøm


Helle Sørensen
Chapter 1
Description of samples and populations

Statistics is about making statements about a population from data observed


from a representative sample of the population. A population is a collection
of subjects whose properties are to be analyzed. The population is the com-
plete collection to be studied; it contains all subjects of interest. A sample is a
part of the population of interest, a subset selected by some means from the
population. The concepts of population, sample, and statistical inference are
illustrated in Figure 1.1.

Figure 1.1: Population and sample. In statistics we sample subjects from a large popu-
lation and use the information obtained from the sample to infer characteristics about
the general population. Thus the upper arrow can be viewed as “sampling” while the
lower arrow is “statistical inference”.

A parameter is a numerical value that describes a characteristic of a popu-


lation, while a statistic is a numerical measurement that describes a character-
istic of a sample. We will use a statistic to infer something about a parameter.
Imagine, for example, that we are interested in the average height of a
population of individuals. The average height of the population, µ, is a pa-
rameter, but it would be too expensive and/or time-consuming to measure
the height of all individuals in the population. Instead we draw a random
sample of, say, 12 individuals and measure the height of each of them. The

1
2 Introduction to Statistical Data Analysis for the Life Sciences

average of those 12 individuals in the sample is our statistic, and if the sam-
ple is representative of the population and the sample is sufficiently large,
we have confidence in using the statistic as an estimate or guess of the true
population parameter µ. The rest of this book is concerned with methods for
making inferences about population parameters based on sample statistics.
The distinction between population and sample depends on the context
and the type of inference that you wish to perform. If we were to deduce the
average height of the total population, then the 12 individuals are indeed a
sample. If for some reason we were only interested in the height of these 12
individuals, and had no intention to make further inferences beyond the 12,
then the 12 individuals themselves would constitute the population.

1.1 Data types


The type(s) of data collected in a study determine the type of statistical
analysis that can be used and determine which hypotheses can be tested and
which model we can use for prediction. Broadly speaking, we can classify
data into two major types: categorical and quantitative.

1.1.1 Categorical data


Categorical data can be grouped into categories based on some qualitative
trait. The resulting data are merely labels or categories, and examples include
gender (male and female) and ethnicity (e.g., Caucasian, Asian, African). We
can further sub-classify categorical data into two types: nominal and ordinal.

Nominal. When there is no natural ordering of the categories we call the


data nominal. Hair color is an example of nominal data. Observations
are distinguished by name only, and there is no agreed upon order-
ing. It does not make sense to say “brown” comes before “blonde” or
“gray”. Other examples include gender, race, smoking status (smoker
or non-smoker), or disease status.

Ordinal. When the categories may be ordered, the data are called ordinal
variables. Categorical variables that judge pain (e.g., none, little, heavy)
or income (low-level income, middle-level income, or high-level in-
come) are examples of ordinal variables. We know that households
with low-level income earn less than households in the middle-level
bracket, which in turn earn less than the high-level households. Hence
there is an ordering to these categories.
It is worth emphasizing that the difference between two categories
cannot be measured even though there exists an ordering for ordinal
Description of samples and populations 3

data. We know that high-income households earn more than low- and
medium-income households, but not how much more. Also we cannot
say that the difference between low- and medium-income households
is the same as the difference between medium- and high-income house-
holds.

1.1.2 Quantitative data


Quantitative data are numerical measurements where the numbers are as-
sociated with a scale measure rather than just being simple labels. Quantita-
tive data fall in two categories: discrete and continuous.
Discrete. Discrete quantitative data are numeric data variables that have
a finite or countable number of possible values. When data repre-
sent counts, they are discrete. Examples include household size or the
number of kittens in a litter. For discrete quantitative data there is a
proper quantitative interpretation of the values: the difference between
a household of size 9 and a household of size 7 is the same as the dif-
ference between a household of size 5 and a household of size 3.
Continuous. The real numbers are continuous with no gaps; physically mea-
surable quantities like length, volume, time, mass, etc., are generally
considered continuous. However, while the data in theory are continu-
ous, we often have some limitations in the level of detail that is feasi-
ble to measure. In some experiments, for example, we measure time in
days or weight in kilograms even though a finer resolution could have
been used: hours or seconds and grams. In practice, variables are never
measured with infinite precision, but regarding a variable as continu-
ous is still a valid assumption.
Categorical data are typically summarized using frequencies or proportions
of observations in each category, while quantitative data typically are sum-
marized using averages or means.
Example 1.1. Laminitis in cattle. Danscher et al. (2009) examined eight
heifers in a study to evaluate acute laminitis in cattle after oligofructose over-
load. Due to logistic reasons, the 8 animals were examined at two different
locations. For each of the 8 animals the location, weight, lameness score, and
number of swelled joints were recorded 72 hours after oligofructose was ad-
ministered. A slightly modified version of the data is shown in Table 1.1, and
these data contain all four different types of data.
Location is a nominal variable as it has a finite set of categories with no
specific ordering. Although the location is labeled with Roman numerals,
they have no numeric meaning or ordering and might as well be renamed
“A” and “B”. Weight is a quantitative continuous variable even though it
is only reported in whole kilograms. The weight measurements are actual
measurements on the continuous scale and taking differences between the
4 Introduction to Statistical Data Analysis for the Life Sciences
Table 1.1: Data on acute laminitis for eight heifers

Location Weight (kg) Lameness score No. swelled joints


I 276 Mildly lame 2
I 395 Mildly lame 1
I 356 Normal 0
I 437 Lame 2
II 376 Lame 0
II 350 Moderately lame 0
II 331 Lame 1
II 331 Normal 0

values is meaningful. Lameness score is an ordinal variable where the order


is defined by the clinicians who investigate the animals: normal, mildly lame,
moderately lame, lame, and severely lame. The number of swelled joints is a
quantitative discrete variable — we can count the actual number of swelled
joints on each animal. 

1.2 Visualizing categorical data


Categorical data are often summarized using tables where the frequencies
of the different categories are listed. The frequency is defined as the number
of occurrences of each value in the dataset. If there are only a few categories
then tables are perfect for presenting the data, but if there are several cate-
gories or if we want to compare frequencies in different populations then the
information may be better presented in a graph. A bar chart is a simple plot
that shows the possible categories and the frequency of each category.
The relative frequency is useful if you want to compare datasets of different
sizes; i.e., where the number of observations in two samples differ. The rela-
tive frequency for a category is computed by dividing the frequency of that
category by the total number of observations for the sample, n,

frequency
relative frequency = .
n
The advantage of the relative frequency is that it is unrelated to the sample
size, so it is possible to compare the relative frequencies of a category in two
different samples directly since we draw attention to the relative proportion
of observations that fall in each category.
A segmented bar chart presents the relative frequencies of the categories in
a sample as a single bar with a total height of 100% and where the relative
Description of samples and populations 5

frequencies of the different categories are stacked on top of each other. The
information content from a segmented bar chart is the same as from a plot of
the relative frequency plot, but it may be easier to identify differences in the
distribution of observations from different populations.
Example 1.2. Tibial dyschrondroplasia. Tibial dyschondroplasia (TD) is a dis-
ease that affects the growth of bone of young poultry and is the primary
cause of lameness and mortality in commercial poultry. In an experiment 120
broilers (chickens raised for meat) were split into four equal-sized groups,
each given different feeding strategies to investigate if the feeding strategy
influenced the prevalence of TD:
Group A: feeding ad libitum.
Group B: 8 hours fasting at age 6-12 days.
Group C: 8 hours fasting at age 6-19 days.
Group D: 8 hours fasting at age 6-26 days.
At the time of slaughter the presence of TD was registered for each chicken.
The following table lists the result:

Group A Group B Group C Group D Total


TD present 21 7 6 12 46
TD absent 9 23 24 18 74
The difference between the relative frequencies of TD and non-TD chickens
is very clear when comparing the four groups in Figure 1.2. 
0.8

1.0
0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

Grp A Grp B Grp C Grp D Grp A Grp B Grp C Grp D

Figure 1.2: Relative frequency plot (left) for broiler chickens with and without pres-
ence of tibial dyschondroplasia (dark and light bars, respectively). The segmented bar
plot (right) shows stacked relative frequencies of broiler chickens with and without
tibial dyschondroplasia for the four groups.
6 Introduction to Statistical Data Analysis for the Life Sciences

1.3 Visualizing quantitative data


With categorical variables we can plot the frequency or relative frequency
for each of the categories to display the data. The same approach works for
discrete quantitative data when there are just a few different possible val-
ues, but frequency plots of each observed variable are not informative for
quantitative continuous variables since there will be too many different “cat-
egories” and most of them will have a very low frequency. However, we
can do something that resembles the frequency plot for categorical data by
grouping the quantitative continuous data into bins that take the place of the
categories used for categorical data. We then count the number of observa-
tions that fall into each bin and the resulting bins, and their related relative
frequencies give the distribution of the quantitative continuous variable.
We can display the bin counts in a histogram, which is analogous to the
bar chart, and the histograms allow us to graphically summarize the dis-
tribution of the dataset; e.g., the center, spread, and number of modes in the
data. The relative frequency histogram can be used to compare the distributions
from different populations since the relative frequency histogram has the in-
herent feature that areas for each bar in the histogram are proportional to the
probability that an observation will fall in the range covered by the bin. The
shape of the relative frequency histogram will be identical to the shape of the
histogram and only the scale will differ. Note that if for some reason the bin
widths are not equal, then the areas of the histogram bars will no longer be
proportional to the frequencies of the corresponding categories simply be-
cause wider bins are more likely to contain more observations than smaller
bins.
The relationship between two quantitative variables can be illustrated
with a scatter plot, where the data points are plotted on a two-dimensional
graph. Scatter plots provide information about the relationship between the
variables, including the strength of the relationship, the shape (whether it is
linear, curved, or something else), and the direction (positive or negative),
and make it easy to spot extreme observations. If one of the variables can
be controlled by the experimenter then that variable might be considered an
explanatory variable and is usually plotted on the x-axis, whereas the other
variable is considered a response variable and is plotted on the y-axis. If nei-
ther one nor the other variable can be interpreted as an explanatory variable
then either variable can be plotted on either axis and the scatter plot will
illustrate only the relationship but not the causation between the two vari-
ables.
Example 1.3. Tenderness of pork. Two different cooling methods for pork
meat were compared in an experiment with 18 pigs from two different
groups: low or high pH content. After slaughter, each pig was split in two
and one side was exposed to rapid cooling while the other was put through
Description of samples and populations 7
Table 1.2: Tenderness of pork from different cooling methods and pH levels

Pig no. pH group Tunnel cooling Rapid cooling


73 high 8.44 8.44
74 high 7.11 6.00
75 high 6.00 5.78
76 high 7.56 7.67
77 low 7.22 5.56
78 high 5.11 4.56
79 low 3.11 3.33
80 high 8.67 8.00
81 low 7.44 7.00
82 low 4.33 4.89
83 low 6.78 6.56
84 low 5.56 5.67
85 low 7.33 6.33
86 low 4.22 5.67
87 high 5.78 7.67
94 low 5.78 5.56
95 low 6.44 5.67
96 low 8.00 5.33

a cooling tunnel. After the experiment, the tenderness of the meat was mea-
sured. Data are shown in Table 1.2 and are from a study by Møller et al.
(1987).
Figure 1.3 shows the histograms and relative frequency histograms for
the high- and low-pH groups. Notice that the shapes for the low- and high-
pH groups do not change from the histograms to the relative frequency his-
tograms. The relative frequency histograms make it easier to compare the
distributions in the low- and high-pH groups since the two groups have dif-
ferent numbers of observations.
Figure 1.4 shows the relationship of tenderness between the rapid and
tunnel cooling methods for the combined data of low- and high-pH groups.
Scatter plots are extremely useful as tools to identify relationships between
two quantitative continuous variables. 

1.4 Statistical summaries


Categorical data are best summarized in tables like the one shown in Ex-
ample 1.2 on p. 5. Quantitative data do not have a fixed set of categories, so
8 Introduction to Statistical Data Analysis for the Life Sciences

5
4

4
Frequency

Frequency
3

3
2

2
1

1
0

0
3 4 5 6 7 8 9 3 4 5 6 7 8 9
Tenderness (low pH) Tenderness (high pH)
0.5

0.5
0.4

0.4
0.2 0.3

0.2 0.3
Density

Density
0.1

0.1
0.0

0.0

3 4 5 6 7 8 9 3 4 5 6 7 8 9
Tenderness (low pH) Tenderness (high pH)

Figure 1.3: Histograms (top row) and relative frequency histograms (bottom row) for
tunnel cooling of pork for low- and high-pH groups.

representing those in a table is infeasible. One work-around for this problem


would be to make bins and present the frequency of each bin in the same way
we make the histograms. However, information is lost by grouping data in
bins and the number of bins and bin widths may have a huge influence on the
resulting table. Instead it may be advantageous to identify certain summary
statistics that capture the main characteristics of the distribution.
Often it is desirable to have a single number to describe the values in
a dataset, and this number should be representative of the data. It seems
reasonable that this representative number should be close to the “middle”
of the data such that it best describes all of the data, and we call any such
number a measure of central tendency. Thus, the contral tendency represents
the value of a typical observation.
Very different sets of data can have the same central tendency. Thus a sin-
gle representative number is insufficient to describe the distribution of the
data, and we are also interested in how closely the central tendency repre-
sents the values in the dataset. The dispersion or variability represents how
much the observations in a dataset differ from the central tendency; i.e., how
widely the data are spread out.
Description of samples and populations 9

9

8

● ●

Tenderness (rapid)
7


6

● ●●● ● ●

5


4


3

3 4 5 6 7 8 9
Tenderness (tunnel)

Figure 1.4: Scatter plot of tenderness for rapid cooling and tunnel cooling.

1.4.1 Median and inter-quartile range


In the following we let y1 , . . . , yn represent independent, quantitative ob-
servations in a sample of size n from some population.∗ We can order the
observations y1 , . . . , yn from lowest to highest and we use the following no-
tation to represent the set of ordered observations: y(1) , . . . , y(n) . Thus y(1) is
the smallest value of y1 , . . . , yn , y(2) is the second smallest, etc.
The median of n numbers is a measure of the central tendency and is de-
fined as the middle number when the numbers are ordered. If n is even then
the median is the average of the two middle numbers:
(
y ( n +1 ) if n is odd
Median = 1
2 (1.1)
2 [ y(n/2) + y(n/2+1) ] if n is even.

The median may be used for both quantitative and ordinal categorical data.
The range is one measure of dispersion and it is defined as the highest
value in the dataset minus the lowest value in the dataset:

Range = y(n) − y(1) . (1.2)

One weakness of the range is that it uses only two values in its calculation
and disregards all other values. Two sets of data could have the same range
but be “spread out” in very different fashions. Consider the following three
datasets:
∗ Independence is discussed more closely on p. 80 and in Chapter 10. Roughly speaking, in-
dependence means that the observations do not provide any information about each other —
e.g., even if the previous observation is larger than the mean, there is no reason to believe that
the next observation will be larger than the mean.
10 Introduction to Statistical Data Analysis for the Life Sciences

Dataset 1 : 14, 14, 14, 14, 14, 14, 34


Dataset 2 : 14, 16, 19, 22, 26, 30, 34
Dataset 3 : 14, 14, 14, 34, 34, 34, 34

The medians for the three datasets are 14, 22, and 34, respectively. The range
for each set is 34 − 14 = 20, but the three sets are very different. The first
set is not very dispersed, with the exception of the single value of 34. The
second dataset has values that are more or less evenly dispersed between 14
and 34, while the last set has 3 values of 14 and 4 values of 34 with no values
in between. Clearly the range does not provide a lot of the information about
the spread of the observations in the dataset; i.e., how far away from the
center the typical values are located.
Another measure of dispersion that tries to capture some of the spread
of the values is the inter-quartile range (IQR). The inter-quartile range is cal-
culated as follows: We remove the top 25% and the bottom 25% of all obser-
vations and then calculate the range of the remaining values. We denote the
first and third quartile as Q1 and Q3, respectively.

IQR = Q3 − Q1. (1.3)

The advantage of the IQR over the range is that the IQR is not as sensitive to
extreme values because the IQR is based on the middle 50% of the observa-
tions.
Generally, we can identify separate cut-off points taken at regular inter-
vals (called quantiles) if we order the data according to magnitude. In the
following we divide the ordered data into 100 essentially equal-sized subsets
such that the xth quantile is defined as the cut-off point where x% of the sam-
ple has a value equal to or less than the cut-off point. For example, the 40th
quantile splits the data into two groups containing, respectively, 40% and
60% of the data. It may be impossible to obtain the exact partition for a given
quantile in a finite dataset, but there are various ways to handle this. This
is not too important, though, as the various definitions only lead to slightly
different values for large datasets.† The important thing to understand is the
interpretation.
The first quartile is defined as the 25th quantile and the third quartile is
defined as the 75th quantile. The median (1.1) corresponds to the 50th quan-
tile, so the first quartile, the median, and the third quartile split the data into
4 groups of equal size.

† One possibility is to round up and define the xth quantile as the smallest ranked observation

such that at least x% of the data have values equal or below the xth; another is to take the
average of the two closest observations. For the median in a sample with an even number of
observations, this corresponds to finding the two middle observations, and taking the higher
value or the average, respectively.
Description of samples and populations 11

1.4.2 Boxplot
A boxplot (also called a box-and-whiskers plot) summarizes the data graph-
ically by plotting the following five summaries of the data: minimum, first
quartile, median, third quartile, and maximum, as shown below for dataset
2 listed above:

10 15 20 25 30 35

The middle 50% of the data are represented by a box and the median is
shown as a fat line inside the box. Two whiskers extend from the box to the
minimum and the maximum value. The five summaries used in the boxplot
present a nice overview of the distribution of the observations in the dataset,
and the visualization makes it easy to determine if the distribution is sym-
metric or skewed.
The IQR is sometimes used to identify outliers — observations that differ
so much from the rest of the data that they appear extreme compared to the
remaining observations. As a rule-of-thumb, an outlier is an observation that
is smaller than 1.5·IQR under the first quartile or larger than 1.5·IQR over the
third quartile; i.e., anything outside the following interval:

[ Q1 − 1.5 · IQR; Q3 + 1.5 · IQR]. (1.4)

It is often critical to identify outliers and extreme observations as they can


have an enormous impact on the conclusions from a statistical analysis. We
shall later see how the presence or absence of outliers is used to check the
validity of statistical models.
Sometimes data are presented in a modified boxplot, where outliers are plot-
ted as individual points and where the minimum and maximum summaries
are replaced by the smallest and largest observations that are still within the
interval [ Q1 − 1.5 · IQR; Q3 + 1.5 · IQR]. That enables the reader to determine
if there are any extreme values in the dataset (see Example 1.4).
Example 1.4. Tenderness of pork (continued from p. 6). If we order the 18
measurements for tunnel cooling from the pork tenderness data according to
size we get

3.11 4.22 4.33 5.11 5.56 5.78 5.78 6.00 6.44


6.78 7.11 7.22 7.33 7.44 7.56 8.00 8.44 8.67

such that y(1) = 3.11, y(2) = 4.22, etc. There is an even number of observa-
tions in this sample, so we should take the average of the middle two obser-
vations to calculate the median; i.e.,
6.44 + 6.78
Median = = 6.61.
2
12 Introduction to Statistical Data Analysis for the Life Sciences

The range of the observations is 8.67 − 3.11 = 5.56.


Since there are 18 observations in the dataset, we have that the lower
quartile should be observation 18 · 1/4 = 4.5. We round that value up so the
lower quartile corresponds to observation 5; i.e., Q1 = 5.56. Likewise, the
upper quartile is observation 18 · 3/4 = 13.5, which we round up to 14, so
Q3 = 7.44. Thus, the inter-quartile range is 7.44 − 5.66 = 1.88 . The modified
boxplots for both tunnel and rapid cooling are shown below.

8
Tenderness score
5 6
4 7


3

Tunnel Rapid

From the modified boxplots we see that the distribution of values for tunnel
cooling is fairly symmetric whereas the distribution of the observations from
rapid cooling is highly skewed. By placing boxplots from two samples next to
each other we can also directly compare the two distributions: the tenderness
values from tunnel cooling generally appear to be higher than the values
from rapid cooling although there are a few very small values for tunnel
cooling. We can also see from the boxplot that there is a single outlier for
rapid cooling. It is worth checking the dataset to see if this is indeed a genuine
observation. 

1.4.3 The mean and standard deviation


The mean is another measure of the center for quantitative data. Let us
start by introducing some notation. Let y1 , . . . , yn denote the quantitative ob-
servations in a sample of size n from some population. The sample mean is
defined as
∑ n yi
ȳ = i=1 (1.5)
n
and is calculated as a regular average: we add up all the observations and
divide by the number of observations. The sample standard deviation is a mea-
Description of samples and populations 13

sure of dispersion for quantitative data and is defined as


s
∑in=1 (yi − ȳ)2
s= . (1.6)
n−1

Loosely speaking, the standard deviation measures the “average deviation


from the mean” observed in the sample; i.e., the standard deviation mea-
sures how far away from the center we can expect our observations to be on
average.
The sample variance is denoted s2 and is simply the sample standard devi-
ation squared:
∑n (yi − ȳ)2
s 2 = i =1 . (1.7)
n−1
The mean and standard deviation provide more information than the me-
dian and inter-quartile range because their values utilize information from
all the available observations. The mean, median, standard deviation, and
inter-quartile range have the same units as the values from which they are
calculated.
Example 1.5. Tenderness of pork (continued from p. 6). The mean of the
tunnel cooling data is

3.11 + 4.22 + 4.33 + 5.11 + · · · + 7.56 + 8.00 + 8.44 + 8.67


ȳ = = 6.382.
18
The standard deviation becomes
s
(3.11 − 6.382)2 + (4.22 − 6.382)2 + · · · + (8.67 − 6.382)2
s= = 1.527.
18 − 1

Thus the mean tenderness for tunnel cooling is 6.382 and the corresponding
standard deviation is 1.527 units on the tenderness scale. 
Looking at formula (1.7) for the variance, we see that it is roughly the
average of the squared deviations. It would be the average if we divided the
sum in (1.7) by n instead of n − 1. The variance of the population (not the
sample, but the population) is σ = ∑i (yi − µ)2 /n, which requires knowledge
about the true mean of the population, µ. We could calculate this variance if
full information about the total population was available, but in practice we
need to replace µ with our “best guess” of µ, which is ȳ. The sample mean
ȳ depends on the observations from the sample and will vary from sample
to sample, so it is not a perfectly precise estimate of µ. We divide by n − 1 in
(1.6) and (1.7) in order to take this uncertainty about the estimate of µ into
account. It can be shown that if we divide the sample variance by n we tend to
underestimate the true population variance. This is remedied by dividing by
n − 1 instead. The reason is that the sum of the deviations is always zero (per
14 Introduction to Statistical Data Analysis for the Life Sciences

construction). Hence, when the first n − 1 deviations have been calculated,


then the last deviation is given automatically, so we essentially have only
n − 1 “observations” to provide information about the deviations.
The sample mean and sample standard deviation have some nice proper-
ties if the data are transformed as shown in Infobox 1.1.

Infobox 1.1: Sample mean and standard deviation of linearly trans-


formed data

Let ȳ and s be the sample mean and sample standard deviation from
observations y1 , . . . , yn and let yi0 = c · yi + b be a linear transformation
of the y’s with constants b and c. Then ȳ0 = c · ȳ + b and s0 = |c| · s.

The results presented in Infobox 1.1 can be proved by inserting the trans-
formed values yi0 in the formulas (1.5) and (1.6). The first part of the result tells
us that if we multiply each observation, yi , by a constant value, c, and add a
constant, b, then the mean of the new observations is identical to the origi-
nal mean multiplied by the factor c and added b. Thus, if we measured, say,
height in centimeters instead of meters, the mean would be exactly 100 times
as big for the centimeters as it would be for the measurements in meters. In
addition, the standard deviation of the transformed variables, y10 , . . . , y0n , is
identical to the standard deviation of the original sample multiplied by the
factor c. The standard deviation of our measurements in centimeters is going
to be exactly 100 times as large as the standard deviation in meters.
This is a nice property since it means that simple linear transformation
will not have any surprising effects on the mean and standard deviation.

1.4.4 Mean or median?


The mean and median are both measures of the central tendency of a
dataset, but the two measures have different advantages and disadvantages.
The median partitions the data into two parts such that there is an equal
number of observations on either side of the median or that the two areas
under the histogram have the same size — regardless of how far away the
observations are from the center. The mean also partitions the data into two
parts, but it uses the observed values to decide where to split the data. In a
sense, a histogram balances when supported at the mean since both area size
and distance from center are taken into account. Just like a seesaw, a single
value far away from the center will balance several values closer to the center,
and hence the percentage of observations on either side of the mean can differ
from 50%.
One advantage of the median is that it is not influenced by extreme values
in the dataset. Only the two middle observations are used in the calculation,
and the actual values of the remaining observations are not used. The mean
Description of samples and populations 15

on the other hand is sensitive to all values in the dataset since every obser-
vation in the data affects the mean, and extreme observations can have a
substantial influence on the mean value.
The mean value has some very desirable mathematical properties that
make it possible to prove theorems, and useful results within statistics and
inference methods naturally give rise to the mean value as a parameter esti-
mate. It is much more problematic to prove mathematical results related to
the median even though it is more robust to extreme observations. Generally
the mean is used for symmetric quantitative data, except in situations with
extreme values, where the median is used. The mean and standard devia-
tion may appear to have limited use since they are only really meaningful to
symmetric distributions. However, the central limit theorem from probabil-
ity theory proves that sample means and estimates can indeed be considered
to be symmetric regardless of their original distribution provided that the
sample size is large (see Section 4.4).
200

600
150

400
100

200
50
0

−3 −2 −1 0 1 2 3 0 2 4 6 8

● ● ●● ● ● ●● ●● ●

● ●●

●●
●●●●●

●● ●
●●● ● ●●
● ●●
●●●
●●● ●● ●●
●●●● ● ● ●
●● ● ●●● ●● ● ●●● ● ●● ● ●●●
●●● ● ● ● ●
25
150

20
15
100

10
50

5
0

−4 −2 0 2 4 −3 −2 −1 0 1 2 3

Figure 1.5: Histograms, boxplots, and means (N) for 4 different datasets. The lower
right dataset contains 100 observations — the remaining three datasets all contain
1000 observations.

Figure 1.5 shows histograms, modified boxplots, and means for four dif-
ferent distributions. The top-left distribution has first and third quartiles that
are about the same distance from the median and the same is true for the
whiskers. This in combination with the outline of the histogram all indicates
16 Introduction to Statistical Data Analysis for the Life Sciences

that the distribution is symmetric, and we also see that the median and the
mean are almost identical. The top-right distribution in Figure 1.5 is highly
skewed, and we see that there is a substantial difference between the mean
value and the median. Notice also from the modified boxplot that several
observations are outside the outlier interval (1.4) and are plotted as points.
This does not necessarily mean that we have so many extreme observations
in this case since the distribution is highly skewed. The outlier interval (1.4)
is defined by the IQR, and when the distribution is highly skewed the outlier
interval will have difficulty identifying outliers in one direction (the non-
skewed direction, or towards the left in Figure 1.5) while it may identify too
many outliers in the other direction (the skewed direction, or towards the
right in Figure 1.5). The bottom-left distribution is bimodal and is clearly
symmetric and the bottom-right distribution is also symmetric and resem-
bles the distribution in the upper-left panel but is a bit more ragged.

1.5 What is a probability?


Most people have an intuitive understanding of what a “probability” is,
and we shall briefly cover the concept of probabilities in this section. For a
more mathematical definition of probabilities and probability rules see Chap-
ter 10.
We think of the probability of a random event as the limit of its relative fre-
quency in an infinitely large number of experiments. We have already used
the relative frequency approach earlier, when we discussed presentation of
categorical and continuous data. When a random experiment is performed
multiple times we are not guaranteed to get the exact same result every time.
If we roll a die we do not get the same result every time, and similarly we
end up with different daily quantities of milk even if we treat and feed each
cow the same way every day.
In the simplest situation we can register whether or not a random event
occurs; for example, if a single die shows an even number. If we denote this
event (i.e., rolling an even number with a single die) A and we let n A be the
number of occurrences of A out of n rolls, then n A /n is the relative frequency
of the event A. The relative frequency stabilizes as n increases, and the prob-
ability of A is then the limit of the relative frequency as n tends towards
infinity.

Example 1.6. Throwing thumbtacks. A brass thumbtack was thrown 100


times and it was registered whether the pin was pointing up or down to-
wards the table upon landing (Rudemo, 1979). The results are shown in Ta-
ble 1.3, where ‘1’ corresponds to “tip pointing down” and ‘0’ corresponds to
Description of samples and populations 17

1.0 0.8
Relative frequency
0.4 0.6
0.2
0.0

0 20 40 60 80 100
n

Figure 1.6: Thumbtack throwing. Relative frequency of the event “pin points down”
as the number of throws increases.

“tip pointing up”. The relative frequency of the event “pin points down” as
a function of the number of throws is shown in Figure 1.6.

Table 1.3: Thumbtacks: 100 throws with a brass thumbtack. 1= pin points down, 0=
pin points up

11001 10100 10110 01110 10011


00001 11010 11011 10011 10111
01011 11010 01001 00111 10011
11011 00111 10100 10011 11010

We see from Figure 1.6 that the relative frequency varies highly when n
is low but that it stabilizes on a value around 0.6 as n tends towards infinity.
Hence we conclude that the probability of observing a pin pointing down
when throwing a thumbtack is around 0.6 or 60%. 

1.6 R
In Example 1.3 on p. 6 we had information on tunnel cooling for 18 pigs
for two different pH groups. We will use that dataset to illustrate various R
functions for visualizing data and calculating summary statistics for quan-
titative data. Categorical data are illustrated with the tibial dyschrondroplasia
data from Example 1.2.
18 Introduction to Statistical Data Analysis for the Life Sciences

We start by entering the two datasets into R:


> # First data on tibial dyschrondroplasia
> m <- matrix(c(21, 9, 7, 23, 6, 24, 12, 18), ncol=4)

> # Then data on cooling methods


> tunnel <- c(8.44, 7.11, 6.00, 7.56, 7.22, 5.11, 3.11, 8.67,
+ 7.44, 4.33, 6.78, 5.56, 7.33, 4.22, 5.78, 5.78, 6.44, 8.00)
> rapid <- c(8.44, 6.00, 5.78, 7.67, 5.56, 4.56, 3.33, 8.00,
+ 7.00, 4.89, 6.56, 5.67, 6.33, 5.67, 7.67, 5.56, 5.67, 5.33)
> ph <- c("hi", "hi", "hi", "hi", "lo", "hi", "lo", "hi", "lo",
+ "lo", "lo", "lo", "lo", "lo", "hi", "lo", "lo", "lo")

1.6.1 Visualizing data


Categorical data are visualized as bar plots, which are produced by the
barplot() function in R. When the first argument to barplot() is a vector,
then the default plot consists of a sequence of bars with heights correspond-
ing to the elements of the vector. If the first argument is a matrix, then each
bar is a segmented bar plot where the values in each column of the matrix
correspond to the height of the elements of the stacked bar.
If we prefer the stacked bar plot to show relative frequencies then we need
to divide each column in the matrix by the column sum. The prop.table()
function converts the matrix to the relative frequencies given either the row
or column sums. The second argument to prop.table() determines if the
elements of the table are relative to the row sums (margin=1) or the column
sums (margin=2). The following lines produce the two plots shown in Figure
1.2 on p. 5 and use the options besides=TRUE and names= to barplot().
> relfrq <- prop.table(m, margin=2)
> relfrq
[,1] [,2] [,3] [,4]
[1,] 0.7 0.2333333 0.2 0.4
[2,] 0.3 0.7666667 0.8 0.6
> # Make juxtaposed barplot
> barplot(relfrq, beside=TRUE,
+ names=c("Grp A", "Grp B", "Grp C", "Grp D"))
> # Stacked relative barplot with labels added
> barplot(relfrq, names=c("Grp A", "Grp B", "Grp C", "Grp D"))
We use a simple scatter plot to illustrate the relationship between two
quantitative variables. The plot() function is used to produce a scatter plot,
and we can add additional information to the plot by specifying the labels
for the x-axis and the y-axis with the xlab and ylab options to plot(). The
following command will generate the scatter plot seen in Figure 1.4 on p. 9:
> plot(tunnel, rapid, xlab="Tenderness (tunnel)",
Description of samples and populations 19

+ ylab="Tenderness (rapid)")
plot() is a generic function in R and the output depends on the number
and type of objects that are provided as arguments to the function. A scatter
plot is produced when two numeric vectors are used in plot(). If only a
single numeric vector is used as an argument, plot(tunnel), then all the
observations for that vector are plotted with the observation number on the
x-axis and the corresponding value on the y-axis.
Histograms and relative frequency histograms are both produced with
the hist() function. By default the hist() function automatically groups the
quantitative data vector into bins of equal width and produces a frequency
histogram. We can force the hist() function to make a frequency plot or a
relative frequency plot by specifying either the freq=TRUE or the freq=FALSE
option, respectively. The following two commands produce the upper left
histogram and lower left relative frequency histogram seen in Figure 1.3 on
p. 8 and use the main= option to include a title.
> hist(tunnel[ph=="lo"], xlab="Tenderness (low pH)",
+ main="Histogram") # Add title to plot
> hist(tunnel[ph=="lo"], xlab="Tenderness (low pH)",
+ freq=FALSE, main="Histogram") # Force relative frequency plot
The number of bins is controlled by the breaks= option to hist(). If the
breaks= option is not entered, then R will try to determine a reasonable num-
ber of bins. If we include an integer value for the breaks= option, then we fix
the number of bins.
> # Use the breaks option to specify the number of bins
> # regardless of the size of the dataset
> hist(tunnel[ph=="lo"], xlab="Tenderness (low pH)",
+ breaks = 8, main="Histogram")
Horizontal and vertical boxplots are produced by the boxplot() function.
By default, R creates the modified boxplot as described on p. 11.
> boxplot(tunnel)
The standard boxplot where the whiskers extend to the minimum and
maximum value can be obtained by setting the range=0 option to
boxplot(). In addition, the boxplot can be made horizontal by including the
horizontal=TRUE option.
> # Horiz. boxplot with whiskers from minimum to maximum value
> boxplot(tunnel, range=0, horizontal=TRUE)
boxplot() is a generic function just like plot() and changes the output
based on the type and number of arguments. If we provide more than a sin-
gle numeric vector as input to boxplot(), then parallel boxplots will be pro-
duced. Often it is easier to compare the distribution among several vectors
20 Introduction to Statistical Data Analysis for the Life Sciences

if they are printed next to each other. The command below will produce the
figure seen in Example 1.4 on p. 11.

> # Parallel boxplots


> boxplot(tunnel, rapid, names=c("Tunnel", "Rapid"))

Note that we specify the names of the different vectors. If we do not specify
the names then R will label each boxplot sequentially from 1 and upwards.

1.6.2 Statistical summaries

We can use the mean(), median(), range(), IQR(), sd(), and var() func-
tions to calculate the mean, median, range, inter-quartile range, standard de-
viation, and variance, respectively, for the vector of measurements.

> mean(tunnel) # Calculate the mean value


[1] 6.382222
> median(tunnel) # Calculate the median
[1] 6.61
> range(tunnel) # Calculate the range. Low and high values
[1] 3.11 8.67
> IQR(tunnel) # Calculate the inter-quartile range
[1] 1.7975
> sd(tunnel) # Calculate the standard deviation (SD)
[1] 1.527075
> var(tunnel) # Calculate the variance
[1] 2.331959
> sd(tunnel)**2 # The variance equals the SD squared
[1] 2.331959

Quantiles can be calculated using the quantile() function. By default, R uses


a slightly different method to calculate quantiles, and to get the definition we
have presented in the text we should use the type=1 option.

> quantile(tunnel, 0.25, type=1) # 25th quantile of tunnel data


25%
5.56
> quantile(tunnel, c(0.10, 0.25, 0.60, 0.95), type=1)
10% 25% 60% 95%
4.22 5.56 7.11 8.67
Description of samples and populations 21

1.7 Exercises
1.1 Data types. For each of the following experiments you should iden-
tify the variable(s) in the study, the data type of each variable, and the
sample size.

1. For each of 10 beetles, a biologist counted the number of times


the beetle fed on a disease-resistant plant during a 4-hour pe-
riod.
2. In a nutritional study, 40 healthy males were measured for
weight and height as well as the weight of their food intake over
a 24-hour period.
3. Seven horses were included in a 2-week study. After the first
week a veterinarian measured the heart rate of each of the horses
after an identification chip was inserted in its neck. At the end of
the second week the veterinarian again measured the heart rate
after branding the horses with a hot iron.
4. The birth weight, number of siblings, mother’s race, and
mother’s age were recorded for each of 85 babies.

1.2 Blood pressure. Consider the following data on diastolic blood pres-
sure (measured in mmHg) for 9 patients:

Patient 1 2 3 4 5 6 7 8 9
Blood pressure 96 119 119 108 126 128 110 105 94

1. Determine the median, the range, and the quartiles.


2. Determine the inter-quartile range.
3. Construct a boxplot of the data.
4. Calculate the mean, the standard deviation, and the variance.
5. What are the units for the mean, standard deviation, and vari-
ance?
6. How will the mean change if we add 10 mmHg to each of the
measurements? How will this change the standard deviation
and the variance?
7. Do you think the mean will increase, decrease, or stay roughly
the same if we measure the diastolic blood pressure of more in-
dividuals? How do you think more individuals will influence
the standard deviation?
22 Introduction to Statistical Data Analysis for the Life Sciences

1.3 Distribution of mayflies. To study the spatial distribution of mayflies


(Baetis rhodani), researchers examined a total of 80 random 10 cen-
timeter square test areas. They counted the number of mayflies, Y, in
each square as shown:

Mayflies 4 5 6 7 8 9 10
Frequency 2 2 5 7 10 9 10
Mayflies 11 12 13 14 15 16 17
Frequency 10 8 6 4 4 2 1

1. The mean and standard deviation of Y are ȳ = 10.09 and s =


2.96. What percentage of the observations are within
(a) 1 standard deviation of the mean?
(b) 2 standard deviations of the mean?
2. Determine the total number of mayflies in all 80 squares. How
is this number related to the ȳ?
3. Determine the median of the distribution.

1.4 Distribution shapes. Different words are often used to characterize


the overall shape of a distribution. Determine which of the follow-
ing 6 phrases best matches the histograms seen in the figure below:
symmetrical, bimodal and symmetrical, skewed right, skewed left,
bimodal and skewed right, and uniform and symmetrical.

a b c

d e f

1.5 Design of experiments. Assume that it is of interest to compare the


Description of samples and populations 23

milk yield from cows that have received two different feeding strate-
gies (A and B) to determine if the feeding strategies lead to systematic
differences in the yields. Discuss the advantages and disadvantages
of the following four design strategies and whether or not they can
be used to investigate the purpose of the experiment.
1. Feed one cow after plan A and one cow after plan B.
2. 100 cows from one farm are fed according to plan A while 88
cows from another farm are fed according to plan B.
3. Ten cows are selected at random from a group of 20 cows and
fed according to plan A while the remaining 10 cows are fed
according to plan B.
4. For each of 10 twin pairs, a cow is chosen at random and fed
from plan A while the other cow is fed according to plan B.
1.6 Comparison of boxplots. Consider the following comparison be-
tween the calorie content data from 10 common sandwiches from
McDonald’s and 9 common sandwiches from Burger King found on
their respective web pages.


700
500 600
Calories
400
300

Burger King McDonald's

Describe the distributions (i.e., the shape, center, and spread of each
distribution) and how they compare to one another.

1.7 Histograms and boxplots. Use the following data from Rudemo
(1979) on the lengths in millimeters of 20 cones from conifer (Picea
abies).

125.1 114.6 99.3 119.1 109.6


102.0 104.9 109.6 134.0 108.6
120.3 98.7 104.2 91.4 115.3
107.7 97.8 126.4 104.8 118.8
24 Introduction to Statistical Data Analysis for the Life Sciences

1. Read the data into R.


2. Calculate the standard deviation and variance of the cone
lengths. What is the relationship between the two numbers?
3. Use hist() to plot a histogram of the cone lengths. Use
boxplot() to plot a modified boxplot. What can you discern
about the cone lengths from the two figures?
4. Construct a new vector with the name conelenm which contains
the same lengths but now measured in centimeters. How will
the mean and standard deviation change after we change the
units?
5. Plot a histogram and boxplot of the transformed data (use
hist()). You can choose if you want to change the intervals on
the x-axis or if you want frequencies or relative frequencies on
the y-axis. What can you say about the shape of the distribution?
Is it symmetric, skewed to the right, or skewed to the left?

1.8 Which distribution? Consider the following three boxplots (1, 2, and
3) and histograms (x, y, and z). Which histogram goes with each box-
plot? Explain your answer.

í í    í í    í í   
[ \ ]

●●

í í    í í    í í   
  

1.9 Impact of outliers. Outliers are unusual or extreme data values. In


this exercise we will reuse the blood pressure data from Exercise 1.2
to illustrate that outliers can have a substantial impact on some of the
estimates. The data are reproduced in the table below (sorted after
increasing values of blood pressure). Note that the first two questions
below are also part of Exercise 1.2.
Description of samples and populations 25
Table 1.4: Characteristics of the study population when comparing German Shepherd
dogs following diets with and without chocolate. Values are mean ± SD.

Standard food Chocolate rich food


n 100 92
Age (years) 3.6 ± 0.4 3.4 ± 0.3
Gender (M/F) 48/52 43/49
Weight (kg) 26.1 ± 1.4 25.8 ± 1.4
Temperature (Celcius) 37.7 ± 0.4 37.7 ± 0.5

Patient 9 1 8 4 7 2 3 5 6
Blood pressure 94 96 105 108 110 119 119 126 128
1. Calculate the median and the mean of the data to get an estimate
of a “typical” observation from the population.
2. Calculate the range, inter-quartile range, and the standard devi-
ation from the sample to get an idea of the spread of the values
in the population.
3. Replace the value 119 from patient 2 with 149. That makes the
observation markedly larger than the rest of the observations.
How does that change influence the measures of central ten-
dency (i.e., the mean and the median)?
4. With the same replacement, calculate the range, inter-quartile
range, and standard deviation. How do they change as a result
of the extreme observation?
5. Outliers can be due to data entry errors or rare events. Discuss
how it is reasonable to deal with outliers in those two situations.
For example, is it appropriate to simply remove extreme data
points in the analysis?

1.10 Descriptive tables. In many scientific papers there is a presentation


of the dataset in a large table (typically the first table in the paper).
An example of such a table is shown in Table 1.4 where two groups
of dogs of the same race are compared. Chocolate is toxic to dogs, and
the two groups correspond to a group of dogs getting a standard feed
and a group of dogs that were fed a diet containing a large amount
of dark chocolate.

1. Why is it interesting for the reader to see the data presented in


this manner?
2. The mean and standard deviation are often used to convey the
central tendency and spread of the data. When is it reasonable
to use the mean and the standard deviation to summarize the
data?
26 Introduction to Statistical Data Analysis for the Life Sciences

3. How can you summarize the data in situations where it is not


reasonable to use the mean and the standard deviation?
Chapter 2
Linear regression

The purpose of a data analysis is often to describe one variable as a function


of another variable. The functional relationship between the two variables
may in some situations be based on a well-known theoretical hypothesis, and
in other situations we may have no prior knowledge about the relationship
but would like to use the observed data to identify a relationship empirically.
Simple linear regression attempts to model the relationship between two
quantitative variables, x and y, by fitting a linear equation to the observed
data. The linear equation can be written as
y = α+β·x
where α (also called the intercept) is the value of y when x = 0 and β is the
slope (i.e., the change in y for each unit change in x); see Figure 2.1.
8
6

∆y
4
y

α ∆x
2

β = ∆y ∆x
0
−2

−2 0 2 4
x
Figure 2.1: The straight line.

When we want to model the relationship between two variables we as-


sume that one variable is the dependent variable (y in the linear equation)
while the other is an explanatory variable (x in the regression formula). We
want to model y as a linear function of x in the hope that information about
x will give us some information about the value of y; i.e., it will “explain” the
value of y, at least partly. For example, a modeler might use a linear regres-
sion model to relate the heart beat frequency of frogs to the body temperature
or to relate the tenderness of pig meat to the length of the meat fibers.

27
28 Introduction to Statistical Data Analysis for the Life Sciences

Example 2.1. Stearic acid and digestibility of fat. Jørgensen and Hansen
(1973) examined the digestibility of fat with different levels of stearic acid.
The average digestibility percent was measured for nine different levels of
stearic acid proportion. Data are shown in the table below, where x represents
stearic acid and y is digestibility measured in percent.
x 29.8 30.3 22.6 18.7 14.8 4.1 4.4 2.8 3.8
y 67.5 70.6 72.0 78.2 87.0 89.9 91.2 93.1 96.7
The data are plotted in Figure 2.2 together with the straight line defined by
y = 96.5334 − 0.9337 · x. In Section 2.1 it will become clear why these values
are used for the parameters in the model.
95 100



75 80 85 90


Digestibility %


70


65

0 5 10 15 20 25 30 35
Stearic acid %

Figure 2.2: Digestibility of fat for different proportions of stearic acid in the fat. The
line is y = −0.9337 · x + 96.5334.

Figure 2.2 shows that the relationship between stearic acid and digestibil-
ity appears to scatter around a straight line and that the line plotted in the
figure seems to capture the general trend of the data.
We now have a model (the straight line) for the data that enables us to
give statements about digestibility even for levels of stearic acid that were
not part of the original experiment as long as we assume that the relationship
between stearic acid and digestibility can indeed be modeled by a straight
line. Based on our “model” we would, for example, expect a digestibility of
around 87% if we examine fat with a stearic acid level of 10%. 
For each value of x the linear equation gives us the corresponding y-
value. However, most real-life data will never show a perfect functional rela-
tionship between the dependent and the explanatory variables — just as we
saw in Example 2.1. Despite the linear relationship between digestibility and
stearic acid, it is obvious that a straight line will never fit all the observations
perfectly. Some of the observations are above the line and some are below,
but the general trend matches a straight line as seen in Figure 2.2.
Linear regression 29

2.1 Fitting a regression line


Fitting a regression line means identifying the “best” line; i.e., the opti-
mal parameters to describe the observed data. What we mean by “best” will
become clear in this section.
Let ( xi , yi ), i = 1, . . . , n denote our n pairs of observations and assume
that we somehow have “guesstimates” of the two parameters, α̂ and β̂, from
a linear equation used to model the relationship between the x’s and the y’s.
Notice how we placed “hats” over α and β to indicate that the values are
not necessarily the true (but unknown) values of α and β but estimates. Our
model for the data is given by the line

y = α̂ + β̂ · x.

For any x, we can use this model to predict the corresponding y-value. In
particular, we can do so for each of our original observations, x1 , . . . , xn , to
find the predicted values; i.e., the y-values that the model would expect to find:

ŷi = α̂ + β̂ · xi .

We can use these predicted values to evaluate how well the model fits to the
actual observed values. This is achieved by looking at the residuals, which are
defined as follows:
ri = yi − ybi . (2.1)
The residuals measure how far away each of our actual observations (yi ’s) are
from the expected value given a specific model (the straight line in this case).
We can think of the residuals as the rest or remainder of the observed y’s that
are not explained by the model. Clearly, we would like to use a model that
provides small residuals because that means that the values predicted by the
model are close to our observations.
Example 2.2. Stearic acid and digestibility of fat (continued from p. 28). Let
us for now assume that we have eyeballed the data and have found that a
line defined by the parameters

α̂ = 96.5334 β̂ = −0.9337

provides a good straight line to describe the observed data. We can then cal-
culate the predicted value for each observed x; e.g.,

ŷ1 = 96.5334 − 0.9337 · 29.8 = 68.709.

This value is slightly higher than the observed value of 67.5, and the residual
for the first observation is

r1 = 67.5 − 68.709 = −1.209.


30 Introduction to Statistical Data Analysis for the Life Sciences

95 100


75 80 85 90

Digestibility %

70 ●


65

0 5 10 15 20 25 30 35
Stearic acid %

Figure 2.3: Residuals for the dataset on digestibility and stearic acid. The vertical lines
between the model (the straight line) and the observations are the residuals.

Figure 2.3 shows a graphical representation of the residuals for all nine levels
of stearic acid. 
Note that the residuals, ri , measure the vertical distance from the obser-
vation to the fitted line and that positive, negative, and zero residuals corre-
spond to observations that are above, below, and exactly on the regression
line, respectively. Until now we have just assumed that it was possible to
identify a straight line that would fit our observed data. Two researchers may,
however, have different opinions on which regression line should be used to
model a dataset; e.g., one researcher suggests that y = 1.8x + 2 best describes
the data while the other proposes y = 1.7x + 2.3. From our discussion so far
it should be clear that it would be desirable to have a regression line where

• the residuals are small. That indicates that the regression line is close to
the actual observations.

• the residual sum is zero. That means the observations are spread evenly
above and below the line. If the residual sum is non-zero we could
always change the intercept of the model such that the residual sum
would be zero.

Different lines can yield a residual sum of zero, as can be seen in Figure 2.4
where two different regression lines are plotted for the stearic acid dataset.
The solid line is defined by y = 96.5334 − 0.9337 · x while the dashed line is
defined as 0.6 · x + 74.15. Both regression lines have residual sum zero but
it is clear from the figure that the solid line is a much better model for the
observed data than the dashed line. Hence, the sum of the residuals is not an
adequate measure of how well a model fits the data simply because a large
positive residual can be canceled by a corresponding negative residual.
Linear regression 31

95 100


75 80 85 90

Digestibility %

70 ●


65

0 5 10 15 20 25 30 35
Stearic acid %

Figure 2.4: Two regression lines for the digestibility data. The solid line is defined by
y = −0.9337 · x + 96.5334 while the dashed line is defined by y = 0.6 · x + 74.15. Both
regression lines have residual sum zero.

What we need is a way to consider the magnitude of the residuals such


that positive and negative residuals will not cancel each other. The preferred
solution is the method of least squares, where the residuals are squared before
they are added, which prevents positive and negative residuals from can-
celing each other.∗ One way to think about the squared residuals is that we
desire a model where we have as few observations as possible that are far
away from the model. Since we square the residuals we take a severe “pun-
ishment” from observations that are far from the model and can more easily
accommodate observations that are close to the predicted values. Figure 2.5
shows a graphical representation of the squared residuals. The gray areas
correspond to the square of the residuals so each observation gives rise to a
square gray area. Instead of just looking at the sum of residuals and trying
to find a model that is as close to the observations as possible, we essentially
try to identify a model that minimizes the sum of the gray areas.

2.1.1 Least squares estimation


The least squares method estimates the unknown parameters of a model
by minimizing the sum of the squared deviations between the data and the
model. Thus for a linear regression model we seek to identify the parameters

∗ An alternative would be to use the absolute residuals. This approach also prevents the can-
cellation of positive and negative residuals, but the calculus of minimizing the sum of absolute
residuals (see Section 2.1.1) can be rather tricky. Another reason why the sum of squared resid-
uals is preferred is that the corresponding estimates are identical to the estimates found by the
more general maximum likelihood approach. Maximum likelihood will be discussed briefly in
Section 5.2.7.
Exploring the Variety of Random
Documents with Different Content
Johnny Hope rubbed the stubble of beard on his face and frowned at
Westler. "I'm not sure, but I think I know this place. We should
reach the New York River this afternoon."
They stood in a forest glade not a hundred yards from one of the
overgrown concrete highways upon which the Robots were known to
tread. A path paralleled the highway through the woods, and upon
this they made their way.
"Sometimes I wonder if you know what you're letting yourself in for,"
Westler mused.
"I want to find Diane. I'll take whatever goes with it."
"Do you mind if I ask why?"
"I'm not sure I know myself. All I know is I think of her all the time.
Nothing matters as much as finding her—and freeing her."
"We could be wrong. Perhaps she is not with the Robots at all."
"What do you think?"
"I think she is. Everything points to it. I was only pointing out that
we're not sure. Johnny, not many years ago I met a man, another
Shining One, who had fled from New York. He was old and he didn't
last long, but he told me things which—"
"About the Robots, you mean?"
"Yes. You know, of course, they can help cure the Plague. Instead,
they spread it."
"I never could figure out why."
"Who knows what sort of thinking the Robots can do? We're not
even sure if they possess sentience at all, although I suspect they
do. But in the last days of the War, man made a frantic mistake. The
Robots were conceived as fighters, were constructed as fighters,
were built to hate man and to kill man. When we gave the Robots a
different mission entirely, it failed. They've simply strengthened the
Plague toxoid and made it lethal. I don't think they'll rest until every
man on Earth is destroyed.
"We're weak now, disorganized. We've left civilization behind us.
You'd think the Robots could do the job overnight, but the only thing
that prevents them, actually, is their lack of numbers."
"Most of my people—I mean the villagers, not my people any longer
—most of them believe the Robots somehow will cure the Plague."
"And most of my people," said Westler, "believe their destiny is hand
in glove with the destiny of the Robots. They put it this way: we are
hated by the rest of mankind, we are apparently not hated by the
Robots. Why not cooperate with them, then? Actually, a free band of
Shining Ones as large as Keleher's is the exception, not the rule.
Every day, more and more Shining Ones go to the Citadel in New
York or elsewhere to work for the Robots. Not a pretty picture, is it?"
"What can we do about it?"
"At present, I don't have the slightest notion. We've got to do
something, though. Someone's got to do something, unless nature's
ready to write off mankind as a bad experiment. Perhaps I am a
pedant, Johnny. I do not know. But I will tell you this: when all the
great strides in human history were made, the pedants, the scholars
paved the way. I want to see the Citadel not only to learn but to see
if there is something, some way, to end the reign of the Robots. It
seems incredible that men, their makers, lacked the foresight to
equip them with an Achilles Heel, if the need ever arose."

Abruptly, Johnny motioned Westler down with a wave of his hand.


"It looks like you're going to find out soon enough. Take a look."
Johnny parted the bushes in front of them. Here the dirt path had
angled sharply toward the highway so that not more than thirty
yards separated them. Marching silently along the concrete in the
direction of New York, quiet but for the clanking of their joints, was
a long file of Robots.
"Spongey metal foot-pads," whispered Westler, staring eagerly at the
Robots. "We built fine fighting machines, Johnny, and now find we
have to suffer the consequences."
Johnny nodded impatiently, hardly feeling philosophical. "This is
what we came here for, Amos," he said. "Afraid?"
"To tell you the truth, I'm not sure yet."
Johnny was not sure, either, but did not want to brood about it. He
stood up recklessly, forcing his way through the undergrowth toward
the highway. By the time he reached it, Westler trailing uncertainly
at his heels, he was shouting. It worked magically. The long line of
Robots, extending as far as they could see to the left and several
hundred yards to the right, stopped its steady advance. The great
metal heads, each bigger than a man, swiveled on the sockets which
joined them with the tiny bodies. The unblinking eyes which now
faced them—another set for each Robot surveyed the rear, Johnny
knew—were lined up row on row.
"We want to join you," Johnny called out. "We want employment in
the Citadel." Did a human ask a Robot for employment? Johnny
hardly knew, for nothing had been further from his mind until
recently.
The leading Robot came back down the line toward them. Johnny
could read nothing in the artificial eyes and had to check a wild
impulse to run.
"Sometimes I prefer the uncomplicated life of an unimaginative man
of action," Westler moaned softly.
It was, Johnny knew, a good point. He did not bother telling Westler
that both traits had merged in him, which might have been better or
worse, depending upon the circumstances.
Then the Robot was upon them.
"63-17-B?"
"Yes, sir?" All Robots, even those with a primary level of thought as
high as 63-17-B and an existing secondary level, addressed Central
Intelligence as sir.
"After exhaustive tests, it has been adjudged that an over-estimation
has been made regarding your mental ability. Since that is the case,
it will mechanically be necessary to change your position."
Sullenly, plotting shapeless revenge at a Central Intelligence which
would never consider the possibility of an outside factor intervening
unexpectedly and hence altering or spoiling what had been planned,
63-17-B listened to his fate.
"A position currently is vacant as supervisor of the Shining Ones in a
section of the repair bays. Do you have any objections to assuming
this new duty in place of the old?"
To object was disastrous. To object was to admit you needed not
merely a lesser job commensurate with your lesser skill but also
complete readjustment of your thinking process. "No objections at
all, sir," thought 63-17-B, all the while smouldering with resentment.
His time would come. What was the old human expression about
every dog having his day?
"Then you will report at once to repair bay 151. Do you know its
location?"
"I will find it." That was the prescribed answer. One rarely asked
questions. One found out for oneself from Central Information. 63-
17-B half thought he was still being tested in some less-obvious and
hence all the more deadly fashion. But to be placed in charge of a
gang of humans! It was degrading.
"In time, 63-17-B, you shall be tested again. If it is our opinion you
have gained back what we thought you once possessed, you will
again be elevated to a higher station."
63-17-B cursed Central Intelligence on a private wavelength. Central
Intelligence was the creator of perfect plans. If a plan misfired,
Central Intelligence could not be held responsible. Since accidents of
nature had never been considered valid excuses, blame always fell
on the executing Robot. Until recently, 63-17-B had managed to beat
the system, largely through luck. Now while he realized it was the
most mechanical thing in the world to do as you were told, he could
not hide his bitter disappointment. But he pushed it from his mind all
at once when he felt another mind nibbling at his private
wavelength. No one could be trusted, not when each Robot tried to
outdo every other Robot in the eyes of Central Intelligence, not
when private thoughts could be intercepted by monitors, not when
communal thinking was considered preferable to individual
thinking.... That thought made 63-17-B shudder, his joints clanking
as a sudden surge of power, the electrical equivalent of adrenal
secretions, coursed through his frame. He was indeed thinking not
along the prescribed lines. Probably something was wrong with him.

"This is ironical," said Amos Westler as the first inert Robot came
sliding down the conveyor belt to stop, a rusted man-shaped
creature twice man's size with huge conical head and withdrawn
antenna, in front of his bench. "We'll never learn anything this way.
You won't learn the whereabouts of Diane at this bench, and I won't
learn what I've come to find out."
"We're not on duty twenty-four hours a day," Johnny reminded him,
unfastening leg-joints with a large, wrench-like instrument and
wiping the parts with an oily rag before he reassembled them. "If
Diane is here, I'll find her."
"Well, we've learned nothing so far. They took us into the Citadel
through a tile-walled tunnel—"
"Surely one of the wonders of the world!" Johnny cried,
remembering.
"The world has many wonders, natural and man-made, if we could
but see them. Anyway, they then deposited us in those underground
quarters where all the humans seem to live here. The old hag
interviewed us—"
"Yes. She wouldn't say if she'd seen Starbuck and Diane or not when
I described them, but it sure made her smile. I think they're here in
the Citadel, Amos."
"—then assigned us to this repair bay for work. Do you realize that
except for the brief time it took to go from the tunnel exit to the
underground quarters, we haven't seen the light of day. Try learning
something in these, these caves!"
Without warning, the conveyor belts were stilled. Hidden lighting in
the walls flared brighter as a group of Robots entered the large
vault.
"ATTENTION!" A voice blared at them, oddly metallic. Johnny could
not tell where it came from. "Robot 63-17-B is now entering the
vault. As your supervisor, 63-17-B is to be obeyed as if he were
Central Intelligence itself. He is to be addressed not directly, but
through your human supervisor."
The Robot numbered 63-17-B (but the numbers were hidden under
the central face plate and you hardly could tell the machines apart)
made a brief inspection of the vault, then climbed to his niche in the
wall, where he sat completely without motion while the other Robots
filed from the chamber.
"Although we can't address the Robot, our supervisor can," Westler
said eagerly. "That means, at least, communication of some sort is
possible."
"I guess so. Why don't you get to know the supervisor?"
"You're much better at that sort of thing than I am, Johnny."
"We came here for different reasons, don't forget. There's an old
hag I'd like to answer more questions when I find her."
"Here comes our supervisor now," Westler whispered. Then, aloud:
"My name is Amos Westler."
"I don't care what it is. It's recorded. Keep working, friend." The
supervisor was a brutal-faced man who snarled out his words. His
jaw, cheekbones and forehead were silver-sheened with Plague scar,
with the Plague silver remaining there as well as on his limbs. His
face seemed metallic as a Robot's.
"See?" Westler whispered in despair as another damaged Robot slid
to a stop in front of them.
Johnny offered a wan grin. "Take it easy," he said, but hardly felt
more than the last remaining shreds of patience within himself. If
the old hag wouldn't talk when he saw her tonight....

"Don't bother calling me names, young man," cackled the hag. "I'm
virtually immune. It is against existing regulations to give you that
information since it is felt all ties with the past and the outside world
must be broken, not gradually but at once."
"Listen," Johnny said desperately, "you must remember your own
youth." He had tried every other verbal assault he could think of.
Now he hardly thought flattery would work on the ancient bag of
bones in front of him, but it seemed his last hope. "You must have
had your lovers in your day, were you as attractive for your years as
a younger woman...."
Something melted in the hag's eyes. She scrubbed her breastbone
with the knuckles of one parchment hand, as if preening. "Why,
yes," she admitted.
"I'm in love with the girl. You must know how I feel. He—he took
her." At least in part, it was the truth. In love with Diane? He'd never
thought of it, yet what had impelled him to battle Keleher in an
uneven fight, to set out for New York when he could have ruled the
encampment instead, to surrender himself to the Robots of the
Citadel? Johnny smiled. Trying to awaken something in the hag, he
had succeeded in awakening something, all right, but in himself.
"Such information I cannot give you, young man—"
"And I thought you remembered your youth!"
"—but they say the view from the corridor 13 exit is magnificent. To
reach it, one travels along corridor 14, which is a dormitory for some
of our young, unmarried women." The hag cackled. "Don't get
caught."
"I won't. Thank you."
"Good luck, my boy." The hag patted his shoulder, crowed something
which he failed to hear, disappeared from the room.
Outside at a forking of four corridors, Johnny found a map and
studied it. Lights recessed high on the walls showed him his
direction, and soon he was pounding down the corridors and praying
silently that the hag knew what she was talking about. By the time
he reached corridor 14 he was breathless.
Several young women stood in the corridor talking. Their chatter
was stilled when they saw Johnny, and those who had been in
various stages of undress hastened to cover themselves. Clearly, it
was not common for a man to venture this way, particularly at night.
"Are you lost, man?"
"No. I'm looking for someone. A girl named Diane."
They were smiling, and Johnny began to wonder. He suspected that
corridor trysts were not particularly uncommon.
"Is she expecting you?" demanded the boldest of the women, who
had stepped to the fore while her more timid companions drew
back, ready to dart into the surrounding cubicles.
"I cannot truthfully say," Johnny admitted. "If she knew I was in the
Citadel, I think she would be expecting me." But even that was with
tongue in cheek, for ever since he had refused to fight with
Starbuck, Diane had said not a word to him.
"This Diane, what does she look like?"
Johnny described her. When he finished, the woman chuckled.
"Could you perhaps be trysting? From your description, I would say
you love the girl, for no woman could be so beautiful. I think I know
who you mean, though."
Still chuckling, the tall woman entered one of the cubicles while her
companions melted away into the others. Soon Johnny stood alone
in the corridor, waiting as nervously as a youth in Hamilton Village
might wait while the village matchmaker entered a house to fetch
him his bride. Someone appeared in the doorway. Not the tall
woman. Diane!
"Johnny.... Johnny Hope...."
"Diane, I never thought I would see you again. I thought
Starbuck...."
"I was so afraid for you, because you couldn't adjust to your new
life, because I thought you might do something desperate. I was a
fool, I should have known why you refused to fight with Starbuck.
Johnny, Johnny ... let me look at you."
"Look later," he said, his eyes suddenly, unexpectedly misty. He drew
her to him and for a long time stood there with her, feeling the beat
of her heart tight against him, the warmth of her body and long
smoothness of limbs. She was trembling, the warmth of her all a-
flutter against him. She was murmuring something softly against his
shoulder. He was whispering in her ear, "I love you. I love you,
Diane...."

Her lips were perfumed and yielding, her arms went behind him,
hands joining behind his neck, then playing with his hair. The
Plague, his exile from Hamilton Village, the fight with Keleher, the
long trek, even captivity in the Citadel—all were a small price to pay,
he thought dreamily, then abruptly drew back.
"We don't want to stay here all our lives," he said.
"I'll go anywhere with you, Johnny."
"Save that for later, darling—but I love to hear it. I don't think we'd
have much trouble leaving the Citadel."
"Not if we go tonight, we wouldn't. Every day I work with Starbuck,
but if we left at once, now, tonight!"
Her new-found enthusiasm not only matched his, but added wings
to it. He was on the point of saying yes, of leading her through the
corridors in a dash for freedom, when he remembered. "We can't,"
he said. "Not tonight. We've got to include Amos Westler in our
plans."
"Westler is here?"
Johnny explained the situation to her, then added, "Tonight Westler
went looking for some information about the Robots. He feels certain
they have an Achilles Heel someplace, if only he can find it. Actually,
it won't be easy dragging him away from the Citadel, even tomorrow
night."
"We can wait one night longer, sweetheart. You convince him
tomorrow."
"I don't like the thought of leaving you alone again until tomorrow
night."
Diane stilled his words by placing cool fingers to his lips. "We have
no choice. I can take care of myself one night more."
"Starbuck?"
"I can take care of myself in that respect, too. Go back to your
dormitory and get some sleep."
"Tomorrow night. Same time, same place. Westler will be with me."
They came close and drank of each other again. They parted,
Johnny edging down the corridor backwards until the last shaft of
light disappeared from the entrance to Diane's cubicle. His head was
whirling in a giddy new delight, in a rapture which clouded his mind
with a buoyant optimism which almost made him forget the Citadel,
the Robots, and men like Harry Starbuck....
Footsteps pounding down the hall, heavy, too heavy for a woman's.
Quickly, Johnny flattened himself in the darkness of a niche which
served some nameless purpose. With the light behind it, a shadow
loomed, reared up toward him.
It was Harry Starbuck.
Johnny held his breath until the big man with the smug boy's face
strode past. Heading for Diane? In all probability, yes. Follow him?
Stop him? Attack him? Wild thoughts ran their course through
Johnny's head. And lose everything, all they were looking forward to,
for his impulsiveness? Footsteps receded. The shadow vanished.
Even if he could follow Starbuck, overpower him and escape with
Diane, their secret would be secret no longer, which would leave
Amos Westler to fare for himself.
Wait for tomorrow, Johnny Hope. His course seemed clear, yet he
had to fight himself all the way back down the corridor until he had
reached the male dormitories.
For many hours—which seemed like days—he waited up for Amos
Westler, but his thoughts were all with Diane. If Starbuck so much as
touched her....

CHAPTER VII

"I found it, Johnny! It was so obvious, it seems incredible no one


has tried to end the Robot's reign before. We can do it. One man
could do it, alone. One man, with careful planning—"
"Diane is here, Amos. I saw her tonight. We're going to try to break
out tomorrow night, the three of us."
"You see," Westler went on, "there are two items of importance to
consider. The first is Central Intelligence, the mind, the elan vital, the
sentience which motivates the Robots. Did you know, could you ever
imagine, that there was but one Central Intelligence for the entire
western hemisphere, Johnny? It seems incredible, but it is not. That
was the Achilles Heel we sought, the seed of destruction which some
pessimistic scientist had sown into the Robots in case man had
created a Frankenstein."
"Can you believe it? Tomorrow night, the three of us will be on our
way out of here. I think we stand a good chance, Amos. If we—"
"The second item—why, what in the world are you talking about?
Escape? Now? Never! Within our grasp is the chance to free
humanity from a thraldom which it does not yet fully recognize.
Would you give up the chance to render the Robots harmless in
exchange for your own personal safety?"
"Not mine. Diane's. We love each other, Amos. I wouldn't expose her
to any danger. We're leaving tomorrow and we want you to come
with us."
Westler paced back and forth, caged in spirit more than in body.
"Look at you," he said bitterly. "You call yourself a man. But have
you the right to a woman's love when you think only of tomorrow, of
one day out of thousands, of one small life out of all that humanity
has to offer? You want to hold the girl and kiss her and show her
your virility, eh? While the rest of the race goes to pot."
"That's enough, Amos!" Johnny cried. "My motives are my own. We
leave here tomorrow."
"You're weak, Johnny Hope. You're a coward."
Johnny said, "Shut up, damn you." He couldn't deny all that Amos
was saying, but his parents had perished at the hands of a man-
made Plague, he had been driven from his home, rejected by the
Shining Ones, even, until he proved himself in battle. What did he
owe to humanity, to that big, sprawling concept which took in all
kinds of men and their women, children, good people, bad ones, big
and small, with every type of mind and every type of body...?
"All right, marry the girl. Will you raise a family? You're Shining
Ones, Johnny, both of you. The rest of humanity fears you, and
rightfully. Your children will be stoned away if they venture near
normal people. Perhaps life with the Robots would be best for them
after all.
"Here you have the chance to stop all that. Not only could we
negate the power of the Robots, but we could destroy the Plague as
well. Did you hear me, we could destroy the Plague? Before you give
me your final answer, let me tell you what I found."
"I'm listening. But—"
"But nothing. Only listen. This Central Intelligence is a vast
cybernetics machine occupying an entire building—ironically, it is the
United Nations building where once were housed the dreams of
mankind. Now, understand this, Johnny. Every Robot in North and
South America has its own particular wavelength, although the
master intelligence is in tune with all of them. Each individual Robot
sentience is dependent for its existence upon the great cybernetics
machines in Central Intelligence. In other words, if you were to
destroy them, at one blow you would 'kill' every Robot in the
hemisphere!"
"How did you find all that out?"

Westler smiled. "There was one thing the Robots did not bargain for
—an ex-college professor! The information was available in, of all
places, the main library for humans here in the city. It took some
finding, but as an old hand at research I had an edge even on the
Robots with their mechanical minds. Anyway, all you'd have to do is
destroy this Central Intelligence, and—"
"Might as well say destroy the moon, Amos. It's probably so well
guarded a whole Army of men couldn't break through, let alone two
of us."
"That's right," Westler said eagerly, "men could never hope to get
through, but Robots could."
"What are you talking about?"
"The second thing I learned tonight. Once again, it was so deeply
cross-referenced, so thoroughly hidden away that although it was
available if one knew where to look, the science of research is such
a dead thing that no one knew of its existence, probably not even
the Robots. Johnny, the earliest model Robots were built to function
in a double fashion. They were Robots, yes—but they are also
compartments in which a man can fit for manual control. They were
originally designed, you might say, as glorified suits of armor. While
the research material is naturally old, all I could gather seems to
indicate that no changes have ever been made structurally in those
early models. In other words, a man could climb inside a Robot
today, right now, and no one would know the difference."
"You're forgetting one thing," Johnny pointed out. "Are you going to
walk up to a Robot and tell him, 'Pardon me, old fellow, I'd like to
borrow you and use you for a disguise for a while'?"
"I'm not forgetting anything. We work in the repair bays, remember?
We have access to partially dismantled Robots. We could find
ourselves two dismantled old ones, somehow manage to get inside,
make our way to Central Intelligence...."
"I still haven't said I'm going to do it. I'd like to help you, Amos. I'll
take your word about the plan. It has possibilities. But that still has
nothing to do with my own problems. Right now Diane is the most
important thing."
"Diane's future, your future, all our futures ultimately depend on
this. What's the matter with you? You fail to see the forest for the
trees. Tomorrow, what's tomorrow, with all mankind's days ahead of
us—slave or free? Perhaps one man could do the job alone, although
two would have a better chance. But I think you know I'm not the
man for the job. I don't await your answer, Johnny Hope. I've no one
else to turn to. Humanity awaits your answer."
"Let me think," said Johnny, waving Westler away when he would
have continued talking. More quickly than he dared hope, he had
found Diane. With equal swiftness, Westler had discovered what he
sought. That left Johnny in the middle of a tug-of-war which
wouldn't wait indefinitely for his answer.

As the closing gong sounded, 63-17-B watched the Shining Ones


shuffle away from their benches and make their way down the
corridor toward the cafeteria which would serve them an
unimaginative but well-balanced evening meal. But two humans
remained behind, talking avidly over the gleaming bodies of two
stripped-down Robots. Strange, thought 63-17-B, who was now
confronted with the first even mildly unusual event since taking over
the dull routine of his new job that they should continue working
after the closing gong had sounded. He could summon Hartness, the
scarred human supervisor, and have him talk with the two, or ...
Hartness, his metal-jointed foot! He would do no such thing. If
perhaps the humans were up to some mischief, and if it did not
endanger 63-17-B's own position still further, then let them play. If it
gave a few Robots and even Central Intelligence a hard time for a
while, it served them right. Of course, nothing really serious could
come from the tampering of two helpless humans....
"What about that guy up there?" Johnny raised an eyebrow in the
direction of the supervising Robot, motionless on his stone perch. "Is
he watching us?"
"It appears that he is. Unfortunately, we can't do a thing about it. At
least not until we find out if these gadgets will work with us inside
them. Here, Johnny—you see these tiny items? These are
transistors, using germanium instead of a vacuum grid to activate
electrons, smaller, more compact, more powerful, of longer life.
Without them the whole science of cybernetics which ultimately
made the Robots possible would never have advanced beyond the
rudimentary stage. For with transistors replacing vacuum tubes you
still need the entire U.N. building to house Central Intelligence.
Under the older system, all New York City would not have been
enough."
"Tell me later," Johnny pleaded. "I want to get started. The longer
we delay here the longer it will take until we're finished. And I still
have that appointment with Diane tonight. I couldn't contact her
during the day because she said she works with Starbuck. We've got
to hurry."
Westler's hands, guiding the complex tools, moved with swift
efficiency, as if, indeed, he had worked with the Robots all his life.
Wires were crossed, insulated, re-arranged. Gaps and relays were
tested and retested, gears changed, long-unused parts oiled,
cleaned, checked for defects. Surface plates were clamped into place
over layers of insulation. At last the two Robots lay there, supine but
—Westler hoped—ready for human use.
"He's still watching," said Johnny.
"Let him. We couldn't prevent him. Only hope he suddenly doesn't
decide to come down here for a closer look or send for help. It
seems amazing he's done neither so far."
"Maybe he's asleep."
"Robots do not sleep. I assure you. Well, it's ready." Westler reached
into the Robots' interior before clamping on the final head plates.
Each Robot stood up in ponderous silence.
"You first, Johnny. I can clamp my plate from the inside. Are you
sure my explanations on how to work this were satisfactory? Once
inside we'll have to contact each other by signals only."
"What about the radio sets inside? I don't know much about radio,
but you said they worked."
"They do, but the wavelength might be too close to a Robot
wavelength and we'd give ourselves away. Remember, we are to be
nothing more or less than two Robots once we climb inside. That
way, there shouldn't be any trouble. All ready? Up you go."
Johnny was boosted up, pulled himself within the cramped interior of
the Robot. There was barely room for him to stand upright, his
shoulders hunched, arms tight in front of him. A dizzying mass of
dials and levers confronted him suddenly, and although Westler had
explained them and diagrammed them and made Johnny memorize
them, he was still bewildered by direct contact. He was almost afraid
to try his first movement, lest the Robot remain immobile.
The face plate slammed home. Johnny could see through the one-
way plastic of the Robot's eyes as Westler climbed into his own
machine.
Johnny pulled the starting lever and felt his Robot lurch forward.
Must learn to control the motion ... so ... he was now aware of a
lumbering gait, of a steady advance toward the farther wall....
Something made him whirl and peer through the rear eyes. The
Robot supervisor was coming toward them at a rate of speed they
couldn't match.

"You see?" said Starbuck proudly. "I am no longer a servant. I


suppose you would call me a junior executive now. But I'm on the
way up. Definitely on the way up. In a while there is no telling how
far I can go."
"I'm sure of it," Diane nodded agreement. She didn't want to be
bothered by Starbuck today, not when her thoughts were all on the
night and Johnny. She was so nervous she couldn't keep from
looking anxious. If only Starbuck, all wrapped up in himself the way
he was, would fail to see it for a few hours longer.
"I suppose you wonder how I can advance so rapidly. It is quite
simple, Diane. I look around me. I make contacts. I miss nothing. As
an example, I even know of your meeting with Johnny Hope last
night."
"What!"
"I wouldn't really mind it, except that my informant said you are
considering escape from the Citadel. That, of course, is out of the
question."
In his short time at the Citadel, Diane realized, Starbuck had
affected a way of speaking which hardly fit his booming voice or
boyish face. It was as if he had decided to ape the Shining Ones
who stood highest in the Robots' confidence. To Diane it was
contemptuous, although now her mind was awhirl with the thought
that she and Johnny had been discovered.
"What are you going to do?" she asked in a small, helpless voice.
"Hope will be arrested. Naturally, he will never be permitted to see
you again."
Diane stared at Starbuck in horror. Johnny must be found and
warned. There was still time. They could alter their plans, this time
in secrecy, without any women around who could spy on them for
Starbuck. But she had to find Johnny before it was too late.
In sudden despair, she realized she didn't even know where to look.

CHAPTER VIII
Stop! Stand perfectly still.
The thought was unexpected, peremptory, driving into Johnny's
brain with more authority than any words. He wanted to stop,
wanted to immobilize the Robot in which he hid—but where had the
thought come from?
Westler's Robot was pointing a many-jointed metal arm at the
supervising Robot which rushed toward them. Then, did the thought
originate there? Could the Robot somehow send a soundless
message to them?
Stop! Let me dismantle you.
The urge to render his own Robot motionless became stronger
within Johnny. It was as if the unbidden thought originated outside
his head but tried to direct his own muscles, as surely as his own
mind.
Something made soft beeping noises in his ear and it took a while
before he realized Westler wanted to break their radio silence, so
soon after they had started. The other Robot was almost upon them.
Awkward and uncomfortable in his cramped quarters, Johnny found
the radio switch and pulled it.
"We've got to destroy that Robot, Johnny. Now, at once, or we're
finished."
"But how—"
The Robot was upon them, its unbidden thoughts stronger.
Halt....
It was Johnny who struck the first blow—clumsily, lifting his great
right arm up and bringing it down stiffly on the other Robot's head.
Metal arms came up, swung blurringly. A clanging tumult deafened
Johnny as dents appeared inside the chamber of his own Robot's
head. He triggered the levers mechanically now, aware that they
were fighting under a tremendous disadvantage, for their fingers
were still stiff on the unfamiliar controls and their artificial reflexes
could not hope to match the Robot's.
"Look out, Johnny—"
Two metal shapes loomed, Westler and the real Robot. The three of
them came together, clashing, clanging, metal arms swinging and
wrecking metal bodies. It was Westler's Robot which went down
first, slowly, buckling at the knee joints and then collapsing. Metal
feet drove down upon it ponderously, crushing the head section.
Westler's Robot was still.
Johnny hammered with huge metal hands at the other robot hardly
knowing where he might strike a mortal blow. But the Robot slowed,
its reactions grew feeble, its blows denting Johnny's head-chamber
no longer. Finally, it sprawled across Westler's Robot, then rolled
away and was still.
Cursing to himself, Johnny climbed down from his Robot, found the
battered head plate of Westler's, forced it open.
He saw at once he could never hope to extricate the older man, for
the metal walls of his chamber had been crushed, knifing into bone
and flesh and trapping him.
"Amos, can you hear me?"

The eyelids fluttered open with pain. "I never will see the end,
Johnny...."
"What are you talking about?"
"Don't ... fool me. I'm all broken, inside. I—"
"We'll get you out of there in no time."
"You'd have to melt ... the metal down to ... do it, and you know it."
"We'll do it."
"Your only hope is that the Robot did not have time to broadcast a
warning. If ... he did ... you will have to hurry, but—"
"They still don't know our plans. Maybe they think we only want to
escape, using these Robot bodies for a disguise."
"Perhaps. I hadn't thought ... of that." Westler lapsed into silence,
his face twisted with pain. "If you can do it, if you can destroy their
cybernetics center ... new start for humanity. I was going to tell you
about the Plague, Johnny. The Robots ... have been using ... a
particularly virulent form of the ... toxin which does not exist
naturally. Spreading it in the air, all over the earth. That, combined
with the ... toxin carried by a Shining One, causes illness ... and
death." Westler's words were harder to hear now, low, the barest
whisper of sound. Johnny leaned close to the glazed eyes, the barely
opening lips. "When the Robots are ... gone ... the Plague will die
out almost at once. Shining Ones even will be harmless. You see
why it's so important? You see...."
"I could never do it without you. We'll hide away somewhere, nurse
you back to health—"
"Stop fooling ... an old man. We both know I'm dying."
"That's ridiculous."
"Please ... don't interrupt me. I want to finish telling you ... the
Robots communicate with humans by telepathy. You witnessed it
yourself, a few ... minutes ago. They can make it seem like your own
thoughts and ... who can say? Thought waves are electromagnetic,
like ... so many other things. There is nothing mysterious about ...
telepathy. Give humanity a chance to study what the ... Robots have
done and ... you'll have civilization flourishing again within a
generation. Give humanity the chance...." It was a whisper, a prayer.
On that final note of hope, Westler died.

"The human has emerged from the underground within his Robot
and is heading north-east across the city."
"I still think we ought to stop him now, while we know we can do it."
"Silence. Think on the primary level. In unity we will triumph. It is
our one weapon they cannot hope to match."
"But 63-17-B warned us before he perished—"
"Precisely. That the humans were attempting something other than
mere escape. We must find out what that is, what they have
learned. Don't you realize that if this man fails another might
succeed in his place? Whatever knowledge he has, perhaps it is
widely disseminated. We must find out before we kill him."
There was a silence among the conclave of motionless Robots, their
unblinking eyes intent upon a huge three-dimensional map of the
city, following a tiny pip of light in its slow progress.
"He seems to be heading straight for Central Intelligence."
"That's hardly possible, unless it is mere coincidence."
"I don't think so.... See? Not half a mile away, now."
"Have the supervisors discovered who is missing?"
"Yes. He was employed in the very repair bay where 63-17-B
perished—a defective Robot, incidentally, and no great loss. We have
given his name to the top-level Shining Ones in the hope that they
can help us."
"There is a Shining One, a human, here right now. He wants an
audience concerning the rebel."
"Very well, although we'll have to make it brief."
Starbuck entered the chamber cockily, then lost his poise when he
saw the solemn, unmoving conclave of Robots. "I have outside," he
began, moistening his lips and talking rapidly, "a woman who this
man, this Johnny Hope, loves. Can you understand me? Do you
know what love is? He won't do a thing that might harm her."
We can understand.
"I thought that—"
We can read your thoughts. Leave your name with the Robot
outside. Take this woman within the U.N. building and hold her there
until you hear from us.
"The U.N. building?"
No questions. Go.
Starbuck shuffled from the room, self-conscious and fearful under
the mental command.
"I doubt if we'll need the hostage, but you never can tell."
"It seems incredible that—"
"Does it? The man has almost reached the U.N. building. It will take
him perhaps half an hour, for the rubble is piled high there.
Underground he could reach it in a few moments, but apparently he
is unfamiliar with the passages."
"He has only recently arrived at the Citadel."
"Somehow, they have learned something. It is why we cannot kill
the man until we are sure. Have them alerted at Central Intelligence,
but let him enter. Watch him. If he blunders about as if he has
arrived there by accident, kill him. If he knows something, take him
alive."
"Someday we must learn the secret of Central Intelligence, if we are
to survive. We must learn how to duplicate it or face the possibility
of perishing in a single accident."
"Men built it once. Men could do it again."
"Defective! Silence. Man can do nothing we cannot do."
Then they were quiet, watching the tiny, darting pip on the three-
dimensional map as it struggled through the uncleared rubble
southwest of the U.N. building.
Even in ruin, the city held more wonders for Johnny Hope than he
had ever thought possible. In many ways, it was like a scar on the
face of the earth, pitted with bomb craters, strewn with the debris of
toppled towers, its streets choked with fallen, crumbling masonry
and blocked by the skeletons of buildings which once had stood,
bare and rusted now but not always so, as monuments to the
greatness of man. Yet it was a scar which could be healed, a broken,
dying city which could be made great again, with men and women
roving its streets, repairing the structures, making the living city
function once more.
That was Amos Westler's dream. It was the dream of all mankind,
Johnny thought philosophically, although they did not realize it as
they roved the earth in hunter-bands of Shining Ones or tilled its soil
in small communities fearful of the Plague.
Now, directly ahead of him, he could see the monolithic slab of the
U.N. building. Like one structure in five, it stood incredibly intact, a
remembrance of the past and a promise of the future. We can build
again, Johnny thought, without the Robots and the Plague. They
could build again or they would die. Natural world or artificial world
—men or Robots—they could not survive jointly.
Battered and broken but still functioning adequately, Johnny's Robot
pushed through the debris south of the U.N. building to the edge of
the river. He stood there a moment and stared upstream at the
gaunt ruins of a bridge, now tumbled down the river and resting on
the river-bottom, thrusting its towers up beyond the surface of the
water and toward the sky. Men had used that bridge once, long ago
but within the memory of Johnny's father, to reach the country
beyond. The bridge might be rebuilt. Men might learn to use it
again. It was as if, in dying, Amos Westler had transferred his own
vision to Johnny, showing him a dream of the unborn tomorrow—its
birth or stillborn death depending entirely upon Johnny's success or
failure today.
Half a dozen Robots stood about the wide terrace leading to the
building, but Johnny ignored them, for he had passed many in the
broken streets of the city and grown accustomed to them. He
entered the building through a door of glass and metal and was not
aware of the Robots entering it behind him.
His impulse was to climb down from his Robot, to stretch his
cramped arms and legs and find something to eat, then explore the
wonders of this new place. Above his head, the ceiling was high and
vaulted. Ramps led away, curving and graceful, in all directions and
he longed to feel his feet, his own feet, upon them, and to explore
until he satiated himself with this wonder and sought another.
To leave the Robot would be suicide. Had the thought been his own
—or a metal-made thought, instilled in him some unknown way, an
unbidden suicide thought? It was less specific than the commands of
the Robot that had perished in the repair bay, but Johnny guessed it
came from outside nevertheless.
He advanced mechanically, for Westler had given him careful
directions. The ramps led up, higher and higher, past the rooms in
which men from many lands once, long ago, used to debate their
future—then higher still, climbing....
There was noise behind him. He whirled in cramped quarters,
peered from the Robot's second set of eyes. A dozen Robots climbed
the ramp behind him, gaining. He let his mind drift blankly, let their
thoughts reach him.
He is not wandering aimlessly. Somehow he learned. He learned.
Capture him.

He ran now, awkwardly, his own Robot not smooth and graceful, a
flawless piece of machinery like the others. He clomped and
clattered up the ramp and prayed for time.
The ramp soared upward, curved to the left. Once he looked down
at the floor of the rotunda so far below and became giddy with the
distance and the thought of falling. He leaned over the railing and
looked. His head whirled....
At the last moment, he drew his Robot back from the edge, stabbing
half-blindly at the controls which propelled it. They had almost
driven him to suicide. He must keep his mind a perfect blank—or,
better still, think of something which would keep them at bay. Diane,
his love for her—Diane....
A Robot waited for him at the top of the ramp. Those behind him
were gaining rapidly, driving death-wishes deep within his brain.
The Robot above him abruptly swung into motion, but Johnny
desperately sidestepped the lunge which would have sent him
hurtling to the floor of the rotunda. The other Robot checked its own
inertia and came for Johnny again, huge arms swinging, trying to
crush him within the metal chamber as Amos Westler had been
crushed. Johnny parried the blows with his own metal arms, then
reached out and heard machinery groan within his metal frame as
he lifted the other Robot and hurled it in the path of his pursuers.
There was a grinding, clattering crash of metal. Johnny saw three
forms detach themselves from the arcing ramp and tumble, swinging
and twisting in air grotesquely, to the floor, where they struck
resoundingly and broke apart, the metal arms and legs flying.
Then he was climbing again, the remaining Robots far below him
and disorganized now. But soon, he knew, they would be capable of
following.
It was as Amos Westler had predicted. After a time, the ramp grew
smaller. It no longer climbed now—it had soared high and now was
just below the girdered ceiling. It was hardly wide enough for
Johnny's Robot, it shook dangerously with the tread of metal feet.
Here, Johnny knew, was the sanctuary. This was the Achilles Heel.
This was the entrance, this ramp which no Robot could traverse.
Here the way led to self-functioning, self-repairing machinery, to
Central Intelligence. Here was man's final hope in the eyes of the
original inventor. Here was the guarantee that the Robots, if they
became some Frankenstein monster, could be met and conquered.
For no Robot could guard the final portal to Central Intelligence. No
Robot could even draw close enough to alter the thin ramp. Johnny
smiled grimly as comprehension grew. If Robots could become
neurotic, this was the place for it. They could have employed their
human servants, the Shining Ones, to alter the place, but would
have divulged their secret in the process.
Still smiling, Johnny halted his Robot, opened the face plate clumsily
from the inside, and climbed out. He sat on the ramp and flexed stiff
arms and legs, then stood up and heard the Robots below him. He
could see them now, no longer advancing, milling about in
confusion. Their weight would destroy the ramp, and they knew it.
They could never hope to reach him.
It was all so incredibly simple.
Was it?
One Robot had been above him.
Then they knew he was coming. What had they prepared for him
beyond the point where the Robots could not climb? Shrugging, he
advanced warily.
Soon he could see where the ramp reached a small doorway, much
too low and narrow to admit a Robot, even if one of the machines
could have climbed the ramp this far.
"Hold it,—Johnny Hope. Don't come any closer."

Startled, he looked up. Harry Starbuck stood in the doorway, holding


Diane in front of him.
"I'm not fooling, Hope. If you come any closer I'll throw her off. It's
a long way down."
"You're crazy, Starbuck. You'll never leave this place alive." But even
as he spoke, he knew he could never reason with the man. "The
Robots can't let you carry their secret from here. Your only hope is
to cooperate with me."
"Is that so? They're sending some more men up to get you. All I
have to do is hold the fort until ... cut it out, Hope! Stay right there."
Starbuck edged out of the doorway, dragging Diane along with him
to the railing at one side of the ramp. "I'll do it if you make me."
"Don't listen to him, Johnny! I'm not afraid." Hair disheveled,
clothing torn, face bruised, she still looked beautiful to him. All at
once she stood for everything Westler had mentioned; for the future
of man, for the dreams of tomorrow, for a free world with no Plague
and no Robots. But for Westler the choice would have been easy.
The girl—or humanity.
Westler had not been in love.
Now Starbuck had forced Diane, back arched, breasts thrust
forward, out over the railing. She struggled in his grip, but futilely.
He could hurl her out over the edge and into space or not, as he
wished.
"Back up, Hope. I want you to go back down the ramp and
surrender to the Robots. You're only delaying things. More men will
be here soon. You're licked and you know it."
Wearily, Johnny retreated. "Don't hurt her," he said. "Promise me
that."
"You crazy? I want her for myself."
The thought numbed Johnny. He hadn't considered it that way. A live
Diane or a dead one was one thing. But a Diane forced to submit to
Starbuck....
He reached his own immobile Robot, saw the others, not twenty
yards below him, waiting, thought he heard shouts somewhere
behind them. He must do what he had come to do as if Diane did
not exist. It was Starbuck who had made the choice for him.
But there was a wild possibility....
Quickly, he climbed within his Robot, activated it, lumbered forward.
He could feel the ramp shaking with each step he took. At any
moment, its struts might collapse and send him hurtling to his death,
trapped in his man-shaped metal coffin, far below.
Soon he could see Starbuck again, on the ramp outside the doorway,
holding Diane. Starbuck's eyes went wide. Starbuck frowned, then
began to lick his lips anxiously.
"You can't come up here!" he cried. "It won't hold you. I sent the
man down to surrender, anyway. Do you have him? Is he dead?
What do you want, anyway? I can come down myself. Don't come
any closer, not unless you want the ramp to collapse. Keep away,
you hear me?"
Johnny advanced slowly, the ramp shaking with each stride no
longer, but dipping and rocking constantly now, almost ready to go.
Starbuck retreated, taking Diane with him. Through the doorway
they went—
Out fell the faceplate of Johnny's Robot. He tumbled after it as the
ramp shook, metal grinding against metal, then snapped. He leaped
forward as the ramp caved in. He felt his feet shoot out from under
him, saw metal dropping away, twisting, to his left. He clawed out
with his hands, gripped a jagged edge, pulled himself up slowly as
blood made his hands slip.
He stood in what was left of the doorway, trembling as reaction set
in, his heels on the brink of nothing, his bloodied hands aching.
Starbuck roared and charged at him, attempting to drive him back a
few inches to his death. But Johnny caught him, met him halfway
with no room to evade the charge, and they grappled there,
teetering on the edge.
"You tricked me," Starbuck moaned. "That Robot ... was you."

A knee blurred up at Johnny, exploding in violent pain. He felt himself


falling and managed to twist away from the edge of the sundered
ramp. He hit the floor with waves of nausea boiling up from his
stomach. He lay there, blinking his eyes.
Starbuck came for him.
He drew his legs up instinctively, the knees bent, then straightened
as Starbuck leaned over him. His feet caught the big man squarely on

You might also like