Where Can Buy Introduction To Biostatistics Second Edition Robert R. Sokal Ebook With Cheap Price

Download as pdf or txt
Download as pdf or txt
You are on page 1of 84

Full download ebook at ebookgate.

com

Introduction to Biostatistics Second Edition


Robert R. Sokal

https://ebookgate.com/product/introduction-to-
biostatistics-second-edition-robert-r-sokal/

Download more ebook from https://ebookgate.com


More products digital (pdf, epub, mobi) instant
download maybe you interests ...

An Introduction to Ecological Economics Second Edition


Robert Costanza

https://ebookgate.com/product/an-introduction-to-ecological-
economics-second-edition-robert-costanza/

An Introduction to Chaotic Dynamical Systems Second


Edition Robert Devaney

https://ebookgate.com/product/an-introduction-to-chaotic-
dynamical-systems-second-edition-robert-devaney/

Introduction to Robust Estimation and Hypothesis


Testing Second Edition Rand R. Wilcox

https://ebookgate.com/product/introduction-to-robust-estimation-
and-hypothesis-testing-second-edition-rand-r-wilcox/

An Introduction to the Mechanics of Solids In SI Units


3rd Edition Robert R. Archer

https://ebookgate.com/product/an-introduction-to-the-mechanics-
of-solids-in-si-units-3rd-edition-robert-r-archer/
Expect The Unexpected A First Course In Biostatistics
Second Edition Raluca Balan

https://ebookgate.com/product/expect-the-unexpected-a-first-
course-in-biostatistics-second-edition-raluca-balan/

Introduction to Documentary Second Edition Bill Nichols

https://ebookgate.com/product/introduction-to-documentary-second-
edition-bill-nichols/

Introduction to political theory Second Edition Graham

https://ebookgate.com/product/introduction-to-political-theory-
second-edition-graham/

Introduction to California Plant Life Ornduff Robert


(Editor)

https://ebookgate.com/product/introduction-to-california-plant-
life-ornduff-robert-editor/

Introduction to Compressible Fluid Flow Second Edition


Carscallen

https://ebookgate.com/product/introduction-to-compressible-fluid-
flow-second-edition-carscallen/
INTRODUCTION TO
BIOS TATIS TIC S
SECOND EDITION

Robert R. Sokal and F. James Rohlf


State University of New York at Stony Brook

D O V E R P U B L I C A T I O N S , INC.
Mineola, New York
Copyright

C o p y r i g h t © 1969, 1973. 1981. 1987 b y R o b e r t R . S o k a l and F. J a m e s R o h l f


All rights reserved.

Bibliographical Note

T h i s D o v e r e d i t i o n , first p u b l i s h e d in 2009, is a n u n a b r i d g e d r e p u b l i c a t i o n of
t h e w o r k originally p u b l i s h e d in 1969 by W . H . F r e e m a n a n d C o m p a n y , N e w
Y o r k . T h e a u t h o r s h a v e p r e p a r e d a new P r e f a c e f o r this e d i t i o n .

Library of Congress Cataloging-in-Publication Data

S o k a l , R o b e r t R.
I n t r o d u c t i o n t o Biostatistics / R o b e r t R. S o k a l a n d F. J a m e s R o h l f .
D o v e r ed.
p. c m .
O r i g i n a l l y p u b l i s h e d : 2 n d ed. N e w Y o r k : W . H . F r e e m a n , 1969.
I n c l u d e s b i b l i o g r a p h i c a l r e f e r e n c e s a n d index.
I S B N - 1 3 : 978-0-486-46961-4
I S B N - 1 0 : 0-486-46961-1
I. B i o m e t r y . I. R o h l f , F. J a m e s , 1936- II. Title.
Q H 3 2 3 . 5 . S 6 3 3 2009
570.1 '5195 dc22
2008048052

M a n u f a c t u r e d in the U n i t e d S t a l e s of A m e r i c a
D o v e r P u b l i c a t i o n s , Inc., 31 Fast 2nd Street, M i n e o l a , N . Y . 1 1501
to Julie and Janice
Contents

PREFACE TO THE DOVER EDITION xi

PREFACE xiii

1. INTRODUCTION 1
1.1 Some definitions 1
1.2 The development of bioslatistics 2
1.3 The statistical frame of mind 4

2. D A T A IN B i O S T A T l S T I C S 6
2.1 Samples and populations 7
2.2 Variables in biostatisties 8
2.3 Accuracy and precision of data 10
2.4 Derived variables 13
2.5 Frequency distributions 14
2.6 The handling of data 24

3. D E S C R I P T I V E STATISTICS 27
3.1 The arithmetic mean 28
3.2 Other means 31
3.3 The median 32
3.4 The mode 33
3.5 The range 34
3.6 The standard deviation 36
3.7 Sample statistics and parameters 37
3.Ν Practical methods for computing mean and standard
deviation 39
3.9 The coefficient of variation 43
V1U CONTENTS

4. I N T R O D U C T I O N TO PROBABILITY DISTRIBUTIONS:
T H E B I N O M I A L A N D P O I S S O N D I S T R I B U T I O N S 46
4.1 Probability, random sampling, and hypothesis testing 48
4.2 The binomial distribution 54
4.3 The Poisson distribution 63

5. THE N O R M A L PROBABILITY DISTRIBUTION 74


5.1 Frequency distributions of continuous variables 75
5.2 Derivation of the normal distribution 76
5.3 Properties of the normal distribution 78
5.4 Applications of the normal distribution 82
5.5 Departures from normality: Graphic methods 85

6. ESTIMATION A N D HYPOTHESIS TESTING 93


6.1 Distribution and variance of means 94
6.2 Distribution and variance of other statistics 101
6.3 Introduction to confidence limits 103
6.4 Student's t distribution 106
6.5 Confidence limits based on sample statistics 109
6.6 The chi-square distribution 112
6.7 Confidence limits for variances 114
6.8 Introduction lo hypothesis testing 115
6.9 Tests of simple hypotheses employing the t distribution 126
6.10 Testing the hypothesis !!,,: σ2 = al 129

7. INTRODUCTION TO ANALYSIS O F VARIANCE 133


7.1 The variances of samples and their means 134
7.2 The F distribution 138
7.3 The hypothesis / / „ : σ] = 143
7.4 Heterogeneity among sample means 143
7.5 Partitioning the total sum of squares and degrees of freedom 150
7.6 Model I anova 154
7.7 Model II anora 157

8. SINGLE-CLASSIFICATION ANALYSIS O F VARIANCE 160


H.l Computational formulas 161
H.2 Equal η 162
Λ.J Unequal η 165
X.4 Two groups 168
<V.5 Comparisons among means: Planned comparisons 173
H.6 Comparisons among means: Unplanned comparisons 179
CONTENTS ix

9. TWO-WAY ANALYSIS O F VARIANCE 185


9.1 Two-way anova with replication 186
9.2 Two-way anova: Significance testing 197
9.3 Two-way anova without replication 199

10. ASSUMPTIONS O F ANALYSIS O F VARIANCE 211


10.1 The assumptions of anova 212
10.2 Transformations 216
10.3 Nonparametric methods in lieu of anova 220

11. REGRESSION 230


11.1 Introduction to regression 231
11.2 Models in regression 233
11.3 The linear regression equation 235
11.4 More than one value of Y for each value of X 243
11.5 Tests of significance in regression 250
11.6 The uses of regression 257
11.7 Residuals and transformations in regression 259
11.8 A nonparametric test for regression 263

12. CORRELATION 267


12.1 Correlation and regression 268
12.2 The product-moment correlation coefficient 270
12.3 Significance tests in correlation 280
12.4 Applications of correlation 284
12.5 Kendall's coefficient of rank correlation 286

13. ANALYSIS O F FREQUENCIES 294


13.1 Tests for goodness of Jit: introduction 295
13.2 Single-classification goodness of fit tests 301
13.3 Tests of independence: Two-way tables 305

APPENDIXES 314
AI Mathematical appendix 314
A2 Statistical tables 320

BIBLIOGRAPHY 349

INDEX 353
Preface to the Dover Edition

We are pleased and honored to see the re-issue of the second edition of our Introduc-
tion to Biostatistics by Dover Publications. On reviewing the copy, we find there
is little in it that needs changing for an introductory textbook of biostatistics for an
advanced undergraduate or beginning graduate student. The book furnishes an intro-
duction to most of the statistical topics such students are likely to encounter in their
courses and readings in the biological and biomedical sciences.

The reader may wonder what we would change if we were to write this book anew.
Because of the vast changes that have taken place in modalities of computation in the
last twenty years, we would deemphasize computational formulas that were designed
for pre-computer desk calculators (an age before spreadsheets and comprehensive
statistical computer programs) and refocus the reader's attention to structural for-
mulas that not only explain the nature of a given statistic, but are also less prone to
rounding error in calculations performed by computers. In this spirit, we would omit
the equation (3.8) on page 39 and draw the readers' attention to equation (3.7) instead.
Similarly, we would use structural formulas in Boxes 3.1 and 3.2 on pages 41 and 42,
respectively; on page 161 and in Box 8.1 on pages 163/164, as well as in Box 12.1
on pages 278/279.

Secondly, we would put more emphasis on permutation tests and resampling methods.
Permutation tests and bootstrap estimates are now quite practical. We have found this
approach to be not only easier for students to understand but in many cases preferable
to the traditional parametric methods that are emphasized in this book.

Robert R. Sokal
F. James Rohlf
November 2008
Preface

T h e favorable reception that the first edition of this b o o k received f r o m teachers


a n d s t u d e n t s e n c o u r a g e d us to p r e p a r e a second edition. In this revised edition,
we provide a t h o r o u g h f o u n d a t i o n in biological statistics for the u n d e r g r a d u a t e
student w h o has a minimal knowledge of m a t h e m a t i c s . W e intend Introduction
to Biostatistics to be used in c o m p r e h e n s i v e biostatistics courses, but it can also
be a d a p t e d for short courses in medical a n d professional schools; thus, we
include examples f r o m the health-related sciences.
We have extracted most of this text f r o m the more-inclusive second edition
of our o w n Biometry. W e believe t h a t the p r o v e n pedagogic features of that
book, such as its informal style, will be valuable here.
We have modified some of the features f r o m Biometry, for example, in
Introduction to Biostatistics we provide detailed outlines for statistical c o m p u -
tations but we place less e m p h a s i s on the c o m p u t a t i o n s themselves. Why?
Students in m a n y u n d e r g r a d u a t e courses are not motivated to a n d have few
o p p o r t u n i t i e s to p e r f o r m lengthy c o m p u t a t i o n s with biological research m a -
terial; also, such c o m p u t a t i o n s can easily be m a d e on electronic calculators
a n d m i c r o c o m p u t e r s . T h u s , we rely on the course instructor t o advise students
on the best c o m p u t a t i o n a l p r o c e d u r e s to follow.
We present material in a sequence that progresses from descriptive statistics
to f u n d a m e n t a l d i s t r i b u t i o n s and the testing of elementary statistical hypotheses;
we (hen proceed immediately to the analysis of variance and the familiar t test
XIV PREFACE

(which is treated as a special case of the analysis of variance a n d relegated to


several sections of the book). W e d o this deliberately for two reasons: (1) since
t o d a y ' s biologists all need a t h o r o u g h f o u n d a t i o n in the analysis of variance,
s t u d e n t s should b e c o m e a c q u a i n t e d with the subject early in the course; a n d (2)
if analysis of variance is u n d e r s t o o d early, the need to use the f distribution is
reduced. (One would still w a n t to use it for the setting of confidence limits a n d
in a few o t h e r special situations.) All t tests can be carried out directly as anal-
yses of variance, a n d the a m o u n t of c o m p u t a t i o n of these analyses of variance
is generally equivalent to t h a t of t tests.
This larger second edition includes the K o l g o r o v - S m i r n o v two-sample test,
n o n p a r a m e t r i c regression, stem-and-leaf d i a g r a m s , h a n g i n g h i s t o g r a m s , a n d the
B o n f e r r o n i m e t h o d of multiple c o m p a r i s o n s . W e have rewritten t h e c h a p t e r on
the analysis of frequencies in terms of the G statistic rather t h a n χ 2 , because the
f o r m e r h a s been shown t o have m o r e desirable statistical properties. Also, be-
cause of t h e availability of l o g a r i t h m functions on calculators, the c o m p u t a t i o n
of the G statistic is n o w easier t h a n that of the earlier chi-square test. T h u s , we
reorient the c h a p t e r to e m p h a s i z e log-likeiihood-ratio tests. We have also a d d e d
new h o m e w o r k exercises.
We call special, d o u b l e - n u m b e r e d tables "boxes." T h e y can be used as con-
venient guides for c o m p u t a t i o n because they s h o w the c o m p u t a t i o n a l m e t h o d s
for solving various types of biostatistical problems. They usually c o n t a i n all
the steps necessary t o solve a p r o b l e m — f r o m the initial setup to the final result.
T h u s , s t u d e n t s familiar with material in the b o o k can use them as quick s u m -
m a r y reminders of a technique.
We found in teaching this course that we w a n t e d s t u d e n t s to be able to
refer to the material n o w in these boxes. W e discovered that we could not cover
even half as m u c h of o u r subject if we had to put this material on the black-
board d u r i n g the lecture, a n d so we m a d e u p and distributed boxe^ a n d asked
s t u d e n t s to refer to them d u r i n g the lecture. I n s t r u c t o r s w h o use this b o o k m a y
wish to use the boxes in a similar m a n n e r .
We e m p h a s i z e the practical a p p l i c a t i o n s of statistics to biology in this book;
thus, we deliberately keep discussions of statistical theory to a m i n i m u m . De-
rivations are given for s o m e f o r m u l a s , but these arc consigned to Appendix A l ,
where they should be studied a n d reworked by the student. Statistical tables
to which the reader can refer when w o r k i n g t h r o u g h the m e t h o d s discussed in
this b o o k are found in A p p e n d i x A2.
We a r e grateful to K.. R. Gabriel, R. C. Lewontin, a n d M. K a b a y for their
extensive c o m m e n t s on t h e second edition of Biometry and to M. D. M o r g a n ,
E. R u s s e k - C o h e n , a n d M . Singh for c o m m e n t s on an early d r a f t of this book.
We also a p p r e c i a t e the w o r k of o u r secretaries, Resa C h a p e y a n d Cheryl Daly,
with p r e p a r i n g the m a n u s c r i p t s , a n d of D o n n a D i G i o v a n n i , Patricia Rohlf, a n d
B a r b a r a T h o m s o n with p r o o f r e a d i n g .

Robert R. Sokal

F. J a m e s Rohlf
INTRODUCTION TO
BIOSTATISTICS
CHAPTER

Introduction

This c h a p t e r sets the stage for your study of biostatistics. In Section 1.1, we
define the field itself. We then cast a necessarily brief glance at its historical
development in Section 1.2. T h e n in Section 1.3 we conclude the c h a p t e r with
a discussion of the a t t i t u d e s that the person trained in statistics brings to
biological research.

I.I Some definitions

Wc shall define biostatistics as the application of statistical methods to the so-


lution of biological problems. T h e biological p r o b l e m s of this definition a r e those
arising in the basic biological sciences as well as in such applied areas as the
health-related sciences a n d the agricultural sciences. Biostatistics is also called
biological statistics o r biometry.
T h e definition of biostatistics leaves us s o m e w h a t u p in the air—"statistics"
has not been defined. Statistics is a scicnce well k n o w n by n a m e even to the
l a y m a n . T h e n u m b e r of definitions you can find for it is limited only by the
n u m b e r of b o o k s you wish to consult. We might define statistics in its m o d e r n
2 CHAS'TER 1 / INTRODUCTION

sense as the scientific study of numerical data based on natural phenomena. All
p a r t s of this definition a r e i m p o r t a n t a n d deserve emphasis:
Scientific study: Statistics m u s t meet t h e c o m m o n l y accepted criteria of
validity of scientific evidence. W e m u s t always be objective in p r e s e n t a t i o n a n d
e v a l u a t i o n of d a t a a n d a d h e r e t o the general ethical code of scientific m e t h o d -
ology, or we m a y find t h a t t h e old saying t h a t "figures never lie, only statisticians
d o " applies to us.
Data: Statistics generally deals with p o p u l a t i o n s or g r o u p s of individuals;
hence it deals with quantities of i n f o r m a t i o n , not with a single datum. T h u s , t h e
m e a s u r e m e n t of a single a n i m a l or the response f r o m a single biochemical test
will generally not be of interest.
Numerical: Unless d a t a of a study c a n be quantified in one way o r a n o t h e r ,
they will not be a m e n a b l e to statistical analysis. N u m e r i c a l d a t a can be m e a -
s u r e m e n t s (the length or w i d t h of a s t r u c t u r e or t h e a m o u n t of a chemical in
a b o d y fluid, for example) o r c o u n t s (such as t h e n u m b e r of bristles or teeth).
Natural phenomena: W e use this term in a wide sense to m e a n not only all
t h o s e events in a n i m a t e a n d i n a n i m a t e n a t u r e that take place outside the c o n t r o l
of h u m a n beings, but also those evoked by scientists a n d partly u n d e r their
control, as in experiments. Different biologists will c o n c e r n themselves with
different levels of n a t u r a l p h e n o m e n a ; o t h e r k i n d s of scientists, with yet different
ones. But all would agree t h a t the chirping of crickets, the n u m b e r of peas in
a pod, and the age of a w o m a n at m e n o p a u s e are n a t u r a l p h e n o m e n a . T h e
h e a r t b e a t of rats in response to adrenalin, the m u t a t i o n rate in maize after
irradiation, or t h e incidence o r m o r b i d i t y in patients treated with a vaccine
m a y still be considered n a t u r a l , even t h o u g h scientists have interfered with t h e
p h e n o m e n o n t h r o u g h their intervention. T h e average biologist w o u l d n o t c o n -
sider the n u m b e r of stereo sets b o u g h t by p e r s o n s in different states in a given
year to be a n a t u r a l p h e n o m e n o n . Sociologists o r h u m a n ecologists, however,
might so consider it a n d deem it w o r t h y of study. T h e qualification " n a t u r a l
p h e n o m e n a " is included in the definition of statistics mostly to m a k e certain
that the p h e n o m e n a studied are not a r b i t r a r y ones t h a t are entirely u n d e r the
will a n d c o n t r o l of the researcher, such as the n u m b e r of animals e m p l o y e d in
an experiment.
T h e w o r d "statistics" is also used in a n o t h e r , t h o u g h related, way. It can
be the plural of the n o u n statistic, which refers t o any one of m a n y c o m p u t e d
or estimated statistical quantities, such as the m e a n , the s t a n d a r d deviation, o r
the correlation coefficient. Each o n e of these is a statistic.

1.2 The development of biostatistics

M o d e r n statistics a p p e a r s to have developed f r o m t w o sources as far back as


the seventeenth century. T h e first s o u r c e was political science; a form of statistics
developed as a quantitive description of the v a r i o u s aspects of the affairs of
a g o v e r n m e n t or state (hence the term "statistics"). This subject also became
k n o w n as political arithmetic. T a x e s a n d insurance caused people to b e c o m e
1.2 / THE DEVELOPMENT OF BIOSTATISTICS 3

interested in p r o b l e m s of censuses, longevity, a n d mortality. Such c o n s i d e r a t i o n s


a s s u m e d increasing i m p o r t a n c e , especially in E n g l a n d as the c o u n t r y p r o s p e r e d
d u r i n g the d e v e l o p m e n t of its empire. J o h n G r a u n t ( 1 6 2 0 - 1 6 7 4 ) a n d William
Petty (1623-1687) were early students of vital statistics, a n d o t h e r s followed in
their footsteps.
At a b o u t the s a m e time, the s e c o n d s o u r c e of m o d e r n statistics developed:
the m a t h e m a t i c a l t h e o r y of probability engendered by t h e interest in games
of c h a n c e a m o n g the leisure classes of the time. I m p o r t a n t c o n t r i b u t i o n s to
this theory were m a d e by Blaise Pascal (1623-1662) a n d Pierre de F e r m a t
(1601-1665), b o t h F r e n c h m e n . J a c q u e s Bernoulli (1654-1705), a Swiss, laid the
f o u n d a t i o n of m o d e r n probability t h e o r y in /Irs Conjectandi. A b r a h a m de
M o i v r e (1667-1754), a F r e n c h m a n living in E n g l a n d , was the first to c o m b i n e
the statistics of his d a y with probability t h e o r y in w o r k i n g o u t a n n u i t y values
a n d t o a p p r o x i m a t e the i m p o r t a n t n o r m a l distribution t h r o u g h the expansion
of the binomial.
A later stimulus for the d e v e l o p m e n t of statistics came f r o m the science of
a s t r o n o m y , in which m a n y individual o b s e r v a t i o n s h a d to be digested into a
c o h e r e n t theory. M a n y of the f a m o u s a s t r o n o m e r s a n d m a t h e m a t i c i a n s of the
eighteenth century, such as Pierre Simon Laplace ( 1 7 4 9 - 1 8 2 7 ) in F r a n c e a n d
K a r l Friedrich G a u s s ( 1 7 7 7 - 1 8 5 5 ) in G e r m a n y , were a m o n g the leaders in this
field. T h e latter's lasting c o n t r i b u t i o n to statistics is the d e v e l o p m e n t of the
m e t h o d of least squares.
P e r h a p s the earliest i m p o r t a n t figure in biostatistic t h o u g h t was A d o l p h e
Quetelet (1796-1874), a Belgian a s t r o n o m e r a n d m a t h e m a t i c i a n , w h o in his
work c o m b i n e d the t h e o r y a n d practical m e t h o d s of statistics a n d applied t h e m
to p r o b l e m s of biology, medicine, a n d sociology. Francis G a l t o n (1822-1911),
a cousin of C h a r l e s D a r w i n , h a s been called the father of biostatistics a n d
eugenics. T h e i n a d e q u a c y of D a r w i n ' s genetic theories stimulated G a l t o n to try
to solve the p r o b l e m s of heredity. G a l t o n ' s m a j o r c o n t r i b u t i o n to biology was
his application of statistical m e t h o d o l o g y to the analysis of biological variation,
particularly t h r o u g h the analysis of variability and t h r o u g h his study of regres-
sion a n d correlation in biological m e a s u r e m e n t s . His hope of unraveling the
laws of genetics t h r o u g h these p r o c e d u r e s was in vain. He started with the most
difficult material a n d with the w r o n g a s s u m p t i o n s . However, his m e t h o d o l o g y
has become the f o u n d a t i o n for the application of statistics to biology.
Karl P e a r s o n (1857-1936), at University College, L o n d o n , b e c a m e inter-
ested in the application of statistical m e t h o d s t o biology, particularly in the
d e m o n s t r a t i o n of n a t u r a l selection. P e a r s o n ' s interest came a b o u t t h r o u g h the
influence of W. F. R. W c l d o n (1860-1906), a zoologist at t h e s a m e institution.
Weldon, incidentally, is credited with coining the term " b i o m e t r y " for the type
of studies he and P e a r s o n pursued. P e a r s o n continued in the tradition of G a l t o n
a n d laid the f o u n d a t i o n for m u c h of descriptive a n d correlational statistics.
T h e d o m i n a n t figure in statistics and biometry in the twentieth century has
been R o n a l d A. Fisher (1890 1962). His m a n y c o n t r i b u t i o n s to statistical theory
will become o b v i o u s even to the cursory reader of this b o o k .
4 CHAPTER 1 / INTRODUCTION

Statistics t o d a y is a b r o a d a n d extremely active field w h o s e a p p l i c a t i o n s


t o u c h a l m o s t every science a n d even the humanities. New a p p l i c a t i o n s for sta-
tistics are c o n s t a n t l y being f o u n d , a n d n o o n e can predict f r o m w h a t b r a n c h
of statistics new a p p l i c a t i o n s to biology will be m a d e .

1.3 The statistical frame of mind

A brief perusal of a l m o s t a n y biological j o u r n a l reveals h o w pervasive the use


of statistics has b e c o m e in the biological sciences. W h y h a s there been such a
m a r k e d increase in the use of statistics in biology? Apparently, because biol-
ogists h a v e f o u n d t h a t the interplay of biological causal a n d response variables
d o e s n o t fit the classic m o l d of n i n e t e e n t h - c e n t u r y physical science. In t h a t
century, biologists such as R o b e r t M a y e r , H e r m a n n von H e l m h o l t z , a n d o t h e r s
tried t o d e m o n s t r a t e t h a t biological processes were n o t h i n g but physicochemi-
cal p h e n o m e n a . In so doing, they helped create the impression t h a t the experi-
m e n t a l m e t h o d s a n d n a t u r a l philosophy t h a t h a d led to such d r a m a t i c p r o g r e s s
in the physical sciences should be imitated fully in biology.
M a n y biologists, even to this day, have retained the tradition of strictly
mechanistic a n d deterministic concepts of t h i n k i n g (while physicists, interest-
ingly e n o u g h , as their science has b e c o m e m o r e refined, have begun t o resort
t o statistical approaches). In biology, most p h e n o m e n a are affected by m a n y
causal factors, u n c o n t r o l l a b l e in their variation a n d often unidentifiable. Sta-
tistics is needed to m e a s u r e such variable p h e n o m e n a , to d e t e r m i n e the e r r o r
of m e a s u r e m e n t , a n d to ascertain the reality of m i n u t e but i m p o r t a n t differences.
A m i s u n d e r s t a n d i n g of these principles and relationships h a s given rise t o
the a t t i t u d e of some biologists t h a t if differences induced by an experiment, or
observed by nature, are not clear on plain inspection (and therefore a r e in need
of statistical analysis), they arc not w o r t h investigating. There are few legitimate
fields of inquiry, however, in which, f r o m the n a t u r e of the p h e n o m e n a studied,
statistical investigation is unnecessary.
Statistical thinking is not really different f r o m o r d i n a r y disciplined scientific
thinking, in which wc try to q u a n t i f y o u r observations. In statistics we express
o u r degree of belief or disbelief as a p r o b a b i l i t y rather than as a vague, general
s t a t e m e n t . F o r example, a statement that individuals of species A a r e larger
t h a n those of specics Β or that w o m e n suffer m o r e often f r o m disease X t h a n
d o m e n is of a kind c o m m o n l y m a d e by biological and medical scientists. Such
s t a t e m e n t s can a n d should be m o r e precisely expressed in q u a n t i t a t i v e form.
In m a n y ways the h u m a n mind is a r e m a r k a b l e statistical machine, a b s o r b -
ing m a n y facts f r o m the outside world, digesting these, a n d regurgitating them
in simple s u m m a r y form. F r o m o u r experience we k n o w certain events to o c c u r
frequently, o t h e r s rarely. " M a n s m o k i n g cigarette" is a frequently observed
event, " M a n slipping on b a n a n a peel," rare. W e k n o w f r o m experience t h a t
J a p a n e s e arc on the average shorter than Englishmen a n d that E g y p t i a n s are
on the average d a r k e r t h a n Swedes. We associate t h u n d e r with lightning a l m o s t
always, flics with g a r b a g e cans in the s u m m e r frequently, but s n o w with the
1.3 / THE STATISTICAL FRAME OF MIND 5

s o u t h e r n C a l i f o r n i a n desert extremely rarely. All such k n o w l e d g e comes to us


as a result of experience, b o t h o u r o w n a n d that of others, which we learn
a b o u t by direct c o m m u n i c a t i o n or t h r o u g h reading. All these facts have been
processed by that r e m a r k a b l e c o m p u t e r , t h e h u m a n brain, which furnishes an
abstract. This a b s t r a c t is constantly u n d e r revision, a n d t h o u g h occasionally
faulty a n d biased, it is o n the whole astonishingly s o u n d ; it is o u r k n o w l e d g e
of the m o m e n t .
A l t h o u g h statistics arose t o satisfy the needs of scientific research, the devel-
o p m e n t of its m e t h o d o l o g y in t u r n affected the sciences in which statistics is
applied. T h u s , t h r o u g h positive feedback, statistics, created t o serve the needs
of n a t u r a l science, h a s itself affected the c o n t e n t a n d m e t h o d s of t h e biological
sciences. T o cite a n example: Analysis of variance has h a d a t r e m e n d o u s effect
in influencing the types of experiments researchers carry out. T h e whole field of
quantitative genetics, o n e of whose p r o b l e m s is the s e p a r a t i o n of e n v i r o n m e n t a l
f r o m genetic effects, d e p e n d s u p o n the analysis of variance for its realization,
and m a n y of the c o n c e p t s of q u a n t i t a t i v e genetics have b e e n directly built
a r o u n d the designs inherent in the analysis of variance.
CHAPTER

Data in Biostatistics

In Section 2.1 we explain the statistical m e a n i n g of the terms " s a m p l e " a n d


" p o p u l a t i o n , " which we shall be using t h r o u g h o u t this book. Then, in Section
2.2, we c o m e to the types of o b s e r v a t i o n s t h a t we o b t a i n f r o m biological research
material; we shall see h o w these c o r r e s p o n d to the different kinds of variables
u p o n which we perform (he various c o m p u t a t i o n s in the rest of this b o o k . In
Section 2.3 we discuss the degree of accuracy necessary for recording d a t a a n d
the p r o c e d u r e for r o u n d i n g off figures. We shall then be ready to consider in
Section 2.4 certain k i n d s of derived d a t a frequently used in biological science
a m o n g them ratios a n d indices a n d the peculiar problems of accuracy a n d
d i s t r i b u t i o n they present us. K n o w i n g how to a r r a n g e d a t a in frequency distri-
b u t i o n s is i m p o r t a n t because such a r r a n g e m e n t s give an overall impression of
the general p a t t e r n of the variation present in a s a m p l e a n d also facilitate f u r t h e r
c o m p u t a t i o n a l procedures. F r e q u e n c y distributions, as well as the p r e s e n t a t i o n
of numerical d a t a , a r e discussed in Section 2.5. In Section 2.6 we briefly describe
the c o m p u t a t i o n a l h a n d l i n g of d a t a .
2.1 / SAMPLES AND POPULATIONS 7

2.1 Samples and populations

We shall n o w define a n u m b e r of i m p o r t a n t terms necessary for an under-


s t a n d i n g of biological d a t a . T h e data in biostatistics are generally based on
individual observations. T h e y are observations or measurements taken on the
smallest sampling unit. These smallest s a m p l i n g units frequently, b u t not neces-
sarily, are also individuals in the o r d i n a r y biological sense. If we m e a s u r e weight
in 100 rats, then the weight of each rat is an individual observation; t h e h u n d r e d
rat weights together represent the sample of observations, defined as a collection
of individual observations selected by a specified procedure. In this instance, one
individual o b s e r v a t i o n (an item) is based on o n e individual in a biological
s e n s e — t h a t is, o n e rat. However, if we h a d studied weight in a single rat over
a period of time, the s a m p l e of individual o b s e r v a t i o n s w o u l d be the weights
recorded on one rat at successive times. If we wish to m e a s u r e t e m p e r a t u r e
in a study of ant colonies, where each colony is a basic s a m p l i n g unit, each
t e m p e r a t u r e reading for o n e colony is an individual observation, a n d the sample
of o b s e r v a t i o n s is the t e m p e r a t u r e s for all the colonies considered. If we consider
an estimate of the D N A c o n t e n t of a single m a m m a l i a n sperm cell to be an
individual o b s e r v a t i o n , the s a m p l e of o b s e r v a t i o n s may be the estimates of D N A
c o n t e n t of all the sperm cells studied in o n e individual m a m m a l .
W e have carefully avoided so far specifying what particular variable was
being studied, because the terms "individual o b s e r v a t i o n " a n d " s a m p l e of ob-
servations" as used a b o v e define only the s t r u c t u r e but not the n a t u r e of the
d a t a in a study. T h e actual property m e a s u r e d by the individual o b s e r v a t i o n s
is the character, or variable. T h e m o r e c o m m o n term employed in general sta-
tistics is "variable." H o w e v e r , in biology the word " c h a r a c t e r " is frequently used
synonymously. M o r e t h a n one variable can be measured on each smallest
sampling unit. T h u s , in a g r o u p of 25 mice we might m e a s u r e the blood pH
and the e r y t h r o c y t e c o u n t . Each m o u s e (a biological individual) is the smallest
sampling unit, blood p H a n d red cell c o u n t would be the t w o variables studied,
the ρ Η readings a n d cell c o u n t s are individual observations, a n d two samples
of 25 o b s e r v a t i o n s (on ρ Η a n d on e r y t h r o c y t e c o u n t ) would result. O r we might
speak of a bivariate sample of 25 observations, each referring to a />H reading
paired with an e r y t h r o c y t e c o u n t .
Next we define population. T h e biological definition of this term is well
k n o w n . It refers to all the individuals of a given species ( p e r h a p s of a given
life-history stage or sex) f o u n d in a circumscribed area at a given time. In
statistics, p o p u l a t i o n always m e a n s the totality of individual observations about
which inferences are to he made, existing anywhere in the world or at least within
a definitely specified sampling area limited in space and time. If you take five
men a n d study the n u m b e r of leucocytes in their peripheral blood and you
are prepared to d r a w conclusions a b o u t all men from this s a m p l e of five, then
the p o p u l a t i o n f r o m which the sample has been d r a w n represents the leucocyte
c o u n t s of all extant males of the species Homo sapiens. If, on the other hand,
you restrict yourself to a m o r e narrowly specified sample, such as live male
8 CHAPTER 2 / DATA IN BIOSTATISTICS

Chinese, aged 20, a n d you are restricling y o u r conclusions to this p a r t i c u l a r


g r o u p , then the p o p u l a t i o n f r o m which you a r e s a m p l i n g will be leucocyte
n u m b e r s of all Chinese males of age 20.
A c o m m o n misuse of statistical m e t h o d s is to fail to define the statistical
p o p u l a t i o n a b o u t which inferences can be m a d e . A report on the analysis of
a s a m p l e f r o m a restricted p o p u l a t i o n should not imply that the results hold
in general. T h e p o p u l a t i o n in this statistical sense is sometimes referred t o as
the universe.
A p o p u l a t i o n m a y represent variables of a concrete collection of objects or
creatures, such as the tail lengths of all the white mice in the world, the leucocyte
c o u n t s of all the Chinese m e n in the world of age 20, or the D N A c o n t e n t of
all the h a m s t e r sperm cells in existence: or it m a y represent the o u t c o m e s of
experiments, such as all the h e a r t b e a t frequencies p r o d u c e d in guinea pigs by
injections of adrenalin. In cases of the first kind the p o p u l a t i o n is generally
finite. A l t h o u g h in practice it would be impossible to collect, count, a n d e x a m i n e
all h a m s t e r sperm cells, all Chinese men of age 20, or all white mice in the world,
these p o p u l a t i o n s a r e in fact finite. Certain smaller p o p u l a t i o n s , such as all the
w h o o p i n g cranes in N o r t h America or all the recorded cases of a rare but easily
d i a g n o s e d disease X, m a y well lie within reach of a total census. By c o n t r a s t ,
an experiment can be repeated an infinite n u m b e r of times (at least in theory).
A given experiment, such as the a d m i n i s t r a t i o n of adrenalin t o guinea pigs,
could be repealed as long as the e x p e r i m e n t e r could o b t a i n material a n d his
or her health and patience held out. T h e s a m p l e of experiments actually per-
f o r m e d is a sample f r o m an infinite n u m b e r that could be p e r f o r m e d .
S o m e of the statistical m e t h o d s to be developed later m a k e a distinction
between s a m p l i n g from finite a n d f r o m infinite p o p u l a t i o n s . However, t h o u g h
p o p u l a t i o n s are theoretically finite in most applications in biology, they are
generally so much larger than samples d r a w n from them that they can be c o n -
sidered de facto infinite-sized populations.

2.2 Variables in biostatisties

Each biological discipline has its own set of variables, which may include con-
ventional m o r p h o l o g i c a l m e a s u r e m e n t s ; c o n c e n t r a t i o n s of chemicals in b o d y
fluids; rates of certain biological processes; frequencies of certain events, as in
genetics, epidemiology, a n d radiation biology; physical readings of optical or
electronic machinery used in biological research; and m a n y more.
We have already referred to biological variables in a general way, but we
have not yet defined them. We shall define a variable as a properly with respect
to which individuals in a sample d i f f e r in some ascertainable way. If t h e property
does not differ within a s a m p l e at h a n d or at least a m o n g the samples being
studied, it c a n n o t be of statistical interest. Length, height, weight, n u m b e r of
teeth, vitamin ( ' c o n t e n t , and genotypes are examples of variables in o r d i n a r y ,
genetically and phcnotypically diverse g r o u p s of organisms. W a r m - b l o o d e d n e s s
in a g r o u p of m a m m a l s is not, since m a m m a l s are all alike in this regard.
2 . 2 / VARIABLES IN BIOSTATISTICS 9

a l t h o u g h b o d y t e m p e r a t u r e of individual m a m m a l s would, of course, be a


variable.
We c a n divide variables as follows:

Variables

Measurement variables
Continuous variables
Discontinuous variables
Ranked variables
Attributes

Measurement variables are those measurements and counts that are expressed
numerically. M e a s u r e m e n t variables are of t w o kinds. T h e first kind consists of
continuous variables, which at least theoretically can assume an infinite n u m b e r
of values between a n y t w o fixed points. F o r example, between the t w o length
m e a s u r e m e n t s 1.5 a n d 1.6 cm there are an infinite n u m b e r of lengths that could
be m e a s u r e d if o n e were so inclined a n d h a d a precise e n o u g h m e t h o d of
calibration. Any given reading of a c o n t i n u o u s variable, such as a length of
1.57 m m , is therefore an a p p r o x i m a t i o n to the exact reading, which in practice
is u n k n o w a b l e . M a n y of the variables studied in biology arc c o n t i n u o u s vari-
ables. Examples are lengths, areas, volumes, weights, angles, temperatures,
periods of time, percentages, c o n c e n t r a t i o n s , a n d rates.
C o n t r a s t e d with c o n t i n u o u s variables are the discontinuous variables, also
k n o w n as meristic or discrete variables. These are variables that have only cer-
tain fixed numerical values, with no intermediate values possible in between.
T h u s the n u m b e r of segments in a certain insect a p p e n d a g e may be 4 or 5 or
6 but never 5l or 4.3. Examples of d i s c o n t i n u o u s variables are n u m b e r s of a
given s t r u c t u r e (such as segments, bristles, leel h, or glands), n u m b e r s of offspring,
n u m b e r s of colonics of m i c r o o r g a n i s m s or animals, or n u m b e r s of plants in a
given q u a d r a t .
Some variables c a n n o t be m e a s u r e d but at least can be ordered or r a n k e d
by their m a g n i t u d e . T h u s , in an experiment one might record the rank o r d e r
of emergence o f t e n p u p a e without specifying the exact time at which each p u p a
emerged. In such cases we code the d a t a as a ranked variable. I he o r d e r of
emergence. Special m e t h o d s for dealing with such variables have been devel-
oped, and several arc furnished in this book. By expressing a variable as a series
of ranks, such as 1,2, 3, 4. 5, we d o not imply that the difference in m a g n i t u d e
between, say, r a n k s I and 2 is identical lo or even p r o p o r t i o n a l to the dif-
ference between r a n k s 2 a n d 3.
Variables that c a n n o t be measured but must be expressed qualitatively are
called attributes, or nominal variables. These are all properties, such as black
or white, p r e g n a n t or not p r e g n a n t , d e a d or alive, male or female. W h e n such
attributes are c o m b i n e d with frequencies, they can be treated statistically. Of
80 mice, we may, for instance, state that four were black, t w o agouti, and the
10 CHAPTER 2 / DATA IN BIOSTATISTICS

rest gray. W h e n a t t r i b u t e s are c o m b i n e d with frequencies into tables suitable


for statistical analysis, they are referred to as enumeration data. T h u s the e n u -
m e r a t i o n d a t a on color in mice w o u l d be a r r a n g e d as follows:

Color Frequency

Black 4
Agouti 2
Gray 74

T o t a l n u m b e r of m i c e 80

In s o m e cases a t t r i b u t e s c a n be c h a n g e d into m e a s u r e m e n t variables if this is


desired. T h u s colors c a n be c h a n g e d into wavelengths o r c o l o r - c h a r t values.
C e r t a i n o t h e r a t t r i b u t e s that can be r a n k e d o r ordered can be c o d e d t o be-
c o m e r a n k e d variables. F o r example, three a t t r i b u t e s referring to a s t r u c t u r e
as " p o o r l y developed," "well developed," a n d " h y p e r t r o p h i e d " could be c o d e d
1, 2, a n d 3.
A term that has not yet been explained is variate. In this b o o k we shall use
it as a single reading, score, or o b s e r v a t i o n of a given variable. T h u s , if we have
m e a s u r e m e n t s of the length of the tails of five mice, tail length will be a c o n -
t i n u o u s variable, a n d each of the five readings of length will be a variate. In
this text we identify variables by capital letters, the most c o m m o n s y m b o l being
Y. T h u s V may s t a n d for tail length of mice. A variate will refer t o a given
length m e a s u r e m e n t ; Yt is the m e a s u r e m e n t of tail length of the /'th mouse, a n d
y 4 is the m e a s u r e m e n t of tail length of the f o u r t h m o u s e in our sample.

2.3 Accuracy and precision of data

" A c c u r a c y " and "precision" are used s y n o n y m o u s l y in everyday speech, but in


statistics we define them m o r e rigorously. Accuracy is the closeness of a measured
or computed value to its true value. Precision is the closeness of repeated measure-
ments. A biased but sensitive scale might yield inaccurate but precise weight. By
chance, an insensitive scale might result in an a c c u r a t e reading, which would,
however, be imprecise, since a repeated weighing would be unlikely to yield an
equally accurate weight. Unless there is bias in a m e a s u r i n g i n s t r u m e n t , precision
will lead to accuracy. We need therefore mainly be concerned with the former.
Precise variates arc usually, but not necessarily, whole n u m b e r s . T h u s , when
we count four eggs in a nest, there is no d o u b t a b o u t the exact n u m b e r of eggs
in the nest if we have c o u n t e d correctly; it is 4, not 3 or 5, and clearly it could
not be 4 plus or minus a fractional part. Meristic, or d i s c o n t i n u o u s , variables are
generally m e a s u r e d as exact n u m b e r s . Seemingly, c o n t i n u o u s variables derived
from mcristic ones can u n d e r certain c o n d i t i o n s also be exact n u m b e r s . F o r
instance, ratios between exact n u m b e r s arc themselves also exact. If in a c o l o n y
of a n i m a l s there are 18 females and 12 males, the ratio of females to males (a
2.3 / ACCURACY AND PRECISION OF DATA 11

M o s t c o n t i n u o u s variables, however, are a p p r o x i m a t e . W e m e a n by this


that the exact value of the single m e a s u r e m e n t , the variate, is u n k n o w n a n d
p r o b a b l y u n k n o w a b l e . T h e last digit of the m e a s u r e m e n t stated should imply
precision; t h a t is, it should indicate t h e limits on the m e a s u r e m e n t scale between
which we believe the true m e a s u r e m e n t to lie. T h u s , a length m e a s u r e m e n t of
12.3 m m implies t h a t the true length of the structure lies s o m e w h e r e between
12.25 a n d 12.35 m m . Exactly where between these implied limits the real length
is we d o not k n o w . But where w o u l d a true m e a s u r e m e n t of 12.25 fall? W o u l d
it not equally likely fall in either of the t w o classes 12.2 a n d 12.3—clearly an
unsatisfactory state of affairs? Such an a r g u m e n t is correct, b u t w h e n we record
a n u m b e r as either 12.2 or 12.3, we imply t h a t the decision w h e t h e r to put it
into the higher or lower class h a s already been taken. This decision was not
taken arbitrarily, b u t p r e s u m a b l y was based o n the best available m e a s u r e m e n t .
If the scale of m e a s u r e m e n t is so precise t h a t a value of 12.25 would clearly
have been recognized, then the m e a s u r e m e n t should have been recorded
originally to four significant figures. Implied limits, therefore, always carry one
more figure beyond the last significant one measured by the observer.
Hence, it follows t h a t if we record the m e a s u r e m e n t as 12.32, we a r e implying
that the true value lies between 12.315 a n d 12.325. Unless this is w h a t we m e a n ,
there would be n o p o i n t in a d d i n g the last decimal figure to o u r original mea-
surements. If we d o a d d a n o t h e r figure, we must imply a n increase in precision.
W e see, therefore, t h a t accuracy a n d precision in n u m b e r s are not a b s o l u t e con-
cepts, but are relative. Assuming there is n o bias, a n u m b e r b e c o m e s increasingly
m o r e a c c u r a t e as we are able to write m o r e significant figures for it (increase its
precision). T o illustrate this concept of the relativity of accuracy, consider the
following three n u m b e r s :

Implied limits

193 192.5 193.5


192.8 192.75 192.85
192.76 192.755 192.765

We m a y imagine these n u m b e r s t o be recorded m e a s u r e m e n t s of the same struc-


ture. Let us a s s u m e that we h a d e x t r a m u n d a n e knowledge that the true length
of the given s t r u c t u r e was 192.758 units. If t h a t were so, the three m e a s u r e m e n t s
would increase in accuracy f r o m the t o p d o w n , as the interval between their
implied limits decreased. You will n o t e that the implied limits of the t o p m o s t
m e a s u r e m e n t a r e wider than those of the o n e below it, which in turn are wider
t h a n those of the third m e a s u r e m e n t .
Meristic variates, t h o u g h ordinarily exact, may be recorded a p p r o x i m a t e l y
when large n u m b e r s are involved. T h u s w h e n c o u n t s are reported to the nearest
t h o u s a n d , a c o u n t of 36,000 insects in a cubic meter of soil, for example, implies
that the true n u m b e r varies s o m e w h e r e f r o m 35,500 to 36,500 insects.
T o h o w m a n y significant figures should we record m e a s u r e m e n t s ? If we array
t l l r » f «» m r-vl i> K i ; οr /-»f t-v-i <» n m l ι ι / ί rv Γι-ί-> ι-νι 1 U . i otrv ο 11 η ο i ιη/ίι\;</1ιΐ'>1 1a I l-> α l<iri»ni'
12 CHAPTER 2 / DATA IN BIOSTATISTICS

one, an easy rule to remember is that the number of unit steps from the smallest
to the largest measurement in an array should usually be between 30 a n d 300.
Thus, if we are measuring a series of shells to the nearest millimeter a n d the
largest is 8 m m and the smallest is 4 m m wide, there are only four unit steps
between the largest a n d the smallest measurement. Hence, we should measure
our shells to one m o r e significant decimal place. Then the two extreme measure-
ments might be 8.2 m m a n d 4.1 mm, with 41 unit steps between them (counting
the last significant digit as the unit); this would be an a d e q u a t e n u m b e r of unit
steps. T h e reason for such a rule is that an error of 1 in the last significant digit
of a reading of 4 m m would constitute an inadmissible error of 25%, but an e r r o r
of 1 in the last digit of 4.1 is less t h a n 2.5%. Similarly, if we measured the height
of the tallest of a series of plants as 173.2 cm a n d that of the shortest of these
plants as 26.6 cm, the difference between these limits would comprise 1466 unit
steps (of 0.1 cm), which are far too many. It would therefore be advisable to
record the heights to the nearest centimeter, as follows: 173 cm for the tallest
and 27 cm for the shortest. This would yield 146 unit steps. Using the rule we
have stated for the n u m b e r of unit steps, we shall record two or three digits for
most measurements.
The last digit should always be significant; that is, it should imply a range
for the true measurement of from half a "unit step" below to half a "unit step"
above the recorded score, as illustrated earlier. This applies to all digits, zero
included. Zeros should therefore not be written at the end of a p p r o x i m a t e n u m -
bers to the right of the decimal point unless they are meant to be significant
digits. T h u s 7.80 must imply the limits 7.795 to 7.805. If 7.75 to 7.85 is implied,
the measurement should be recorded as 7.8.
When the n u m b e r of significant digits is to be reduced, we carry out the
process of rounding off numbers. The rules for r o u n d i n g off are very simple. A
digit to be rounded off is not changed if it is followed by a digit less than 5. If
the digit to be rounded off is followed by a digit greater than 5 or by 5 followed
by other nonzero digits, it is increased by 1. When the digit to be rounded off
is followed by a 5 standing alone or a 5 followed by zeros, it is unchanged if it
is even but increased by 1 if it is odd. T h e reason for this last rule is that when
such numbers are summed in a long series, we should have as m a n y digits
raised as arc being lowered, on the average; these changes should therefore
balance out. Practice the above rules by r o u n d i n g off the following n u m b e r s to
the indicated n u m b e r of significant digits:

Number Significant digits desired Answer

26.58 2 27
133.71 37 5 133.71
0.03725 3 0.0372
0.03715 3 0.0372
18,316 2 8.000
17.3476 3 17.3
2 . 4 / DERIVED VARIABLES 13

M o s t pocket calculators or larger c o m p u t e r s r o u n d off their displays using


a different rule: they increase t h e preceding digit when the following digit is a
5 s t a n d i n g alone o r with trailing zeros. H o w e v e r , since m o s t of the m a c h i n e s
usable for statistics also retain eight or ten significant figures internally, the
a c c u m u l a t i o n of r o u n d i n g e r r o r s is minimized. Incidentally, if t w o calculators
give answers with slight differences in the final (least significant) digits, suspect
a different n u m b e r of significant digits in m e m o r y as a cause of t h e disagreement.

2.4 Derived variables

T h e m a j o r i t y of variables in biometric w o r k are o b s e r v a t i o n s r e c o r d e d as direct


m e a s u r e m e n t s or c o u n t s of biological material o r as readings that are the o u t p u t
of various types of instruments. However, there is a n i m p o r t a n t class of variables
in biological research t h a t we m a y call the derived or computed variables. These
are generally based on t w o o r m o r e independently m e a s u r e d variables whose
relations are expressed in a certain way. We are referring to ratios, percentages,
concentrations, indices, rates, a n d the like.
A ratio expresses as a single value the relation that t w o variables have, o n e
to the other. In its simplest form, a ratio is expressed as in 64:24, which m a y
represent the n u m b e r of wild-type versus m u t a n t individuals, the n u m b e r of
males versus females, a c o u n t of parasitized individuals versus those not p a r a -
sitized, a n d so on. T h e s e examples imply ratios based on counts. A ratio based
on a c o n t i n u o u s variable might be similarly expressed as 1.2:1.8, which m a y
represent the ratio of width t o length in a sclerite of an insect o r the ratio
between the c o n c e n t r a t i o n s of t w o minerals contained in w a t e r or soil. Ratios
m a y also be expressed as fractions; thus, the t w o ratios a b o v e could be expressed
as f | a n d f ^ - . However, for c o m p u t a t i o n a l p u r p o s e s it is m o r e useful to express
the ratio as a quotient. T h e two ratios cited would therefore be 2.666 . . . and
0.666 . . . , respectively. These are pure n u m b e r s , not expressed in m e a s u r e m e n t
units of any kind. It is this form for ratios that we shall consider further.
Percentages are also a type of ratio. Ratios, percentages, a n d c o n c e n t r a t i o n s
are basic quantities in m u c h biological research, widely used and generally
familiar.
An index is the ratio of the value of one variable to the value of a so-called
standard one. A well-known example of an index in this sense is the cephalic
index in physical a n t h r o p o l o g y . Conceived in the wide sense, an index could
be the average of t w o m e a s u r e m e n t s — e i t h e r simply, such as {(length of A +
length of β), or in weighted fashion, such as ^ [ ( 2 χ length of A) + length of B\.
Rates are i m p o r t a n t in m a n y experimental fields of biology. T h e a m o u n t
of a s u b s t a n c e liberated per unit weight or volume of biological material, weight
gain per unit time, reproductive rates per unit p o p u l a t i o n size a n d time (birth
rates), a n d d e a t h rates would fall in this category.
T h e use of ratios a n d percentages is deeply ingrained in scientific t h o u g h t .
Often ratios m a y be the only m e a n i n g f u l way to interpret and u n d e r s t a n d cer-
tain types of biological problems. If the biological process being investigated
14 CHAPTER 2 / DATA IN BIOSTATISTICS

o p e r a t e s o n the ratio of the variables studied, o n e must e x a m i n e this r a t i o to


u n d e r s t a n d the process. T h u s , Sinnott a n d H a m m o n d (1935) f o u n d t h a t inheri-
tance of the shapes of squashes of the species Cucurbita pepo could be inter-
preted t h r o u g h a form index based on a length-width ratio, b u t n o t t h r o u g h
the i n d e p e n d e n t d i m e n s i o n s of shape. By similar m e t h o d s of investigation, we
should be able to find selection affecting b o d y p r o p o r t i o n s to exist in t h e evolu-
tion of almost any o r g a n i s m .
T h e r e are several d i s a d v a n t a g e s to using ratios. First, they are relatively
inaccurate. Let us return to the ratio m e n t i o n e d a b o v e a n d recall f r o m t h e
previous section that a m e a s u r e m e n t of 1.2 implies a true r a n g e of m e a s u r e m e n t
of the variable f r o m 1.15 to 1.25; similarly, a m e a s u r e m e n t of 1.8 implies a r a n g e
f r o m 1.75 to 1.85. We realize, therefore, that the true ratio m a y vary a n y w h e r e
f r o m -f^J- to -Hi", or f r o m 0.622 t o 0.714. W e n o t e a possible m a x i m a l e r r o r of
4.2% if 1.2 is an original m e a s u r e m e n t : (1.25 — 1.2)/1.2; the c o r r e s p o n d i n g maxi-
mal e r r o r for the r a t i o is 7.0%: (0.714 - 0.667)/0.667. F u r t h e r m o r e , the best
estimate of a ratio is n o t usually the m i d p o i n t between its possible ranges. T h u s ,
in o u r example the m i d p o i n t between the implied limits is 0.668 a n d the r a t i o
based on 4τ§~ is 0.666 . . . ; while this is only a slight difference, the discrepancy
m a y be greater in o t h e r instances.
A second d i s a d v a n t a g e to ratios a n d percentages is that they m a y not be
a p p r o x i m a t e l y n o r m a l l y distributed (see C h a p t e r 5) as required by m a n y statis-
tical tests. This difficulty can frequently be o v e r c o m e by t r a n s f o r m a t i o n of the
variable (as discussed in C h a p t e r 10). A third d i s a d v a n t a g e of ratios is t h a t
in using them o n e loses i n f o r m a t i o n a b o u t the relationships between the t w o
variables except for the i n f o r m a t i o n a b o u t the ratio itself.

2.5 Frequency distributions

If we were to sample a p o p u l a t i o n of birth weights of infants, we could represent


each sampled m e a s u r e m e n t by a point a l o n g an axis d e n o t i n g m a g n i t u d e of
birth weight. This is illustrated in Figure 2.1 A, for a s a m p l e of 25 birth weights.
If we s a m p l e repeatedly from the p o p u l a t i o n a n d o b t a i n 100 birth weights, we
shall p r o b a b l y have to place some of these points on t o p of o t h e r points in
o r d e r to record them all correctly (Figure 2.1 B). As we c o n t i n u e s a m p l i n g ad-
ditional h u n d r e d s a n d t h o u s a n d s of birth weights (Figure 2.1C a n d D), the
assemblage of points will c o n t i n u e to increase in size but will a s s u m e a fairly
definite shape. The outline of the m o u n d of p o i n t s a p p r o x i m a t e s the distribution
of the variable. R e m e m b e r thai a c o n t i n u o u s variable such as birth weight can
a s s u m e an infinity of values between a n y t w o p o i n l s on the abscissa. T h e refine-
ment of o u r m e a s u r e m e n t s will d e t e r m i n e how fine the n u m b e r of recorded
divisions between any t w o p o i n t s a l o n g the axis will be.
T h e distribution of a variable is of c o n s i d e r a b l e biological interest. If we
find thai the disl ribution is asymmetrical a n d d r a w n out in one direction, it tells
us that there is. perhaps, selection that causes o r g a n i s m s to fall preferentially
in o n e of the tails of the distribution, or possibly that the scale of m e a s u r e m e n t
2.5 / FREQUENCY DISTRIBUTIONS 15

10 l· 25
/
0 I III· I . ll.l I.

10
100
. I. ...il..i .i.llililillilHi hull Li. 11 ι ,ι,ι

30 r

20

10
500

i I L
0

70

60

50

2000
40

30

20

10

• ll. 11 111 n i " l ·• -t.l


0
60 70 80 90 100 110 120 130 140 150 160
Birth w e i g h t (oz)

f i g u r k 2.1
S a m p l i n g from ;i p o p u l a t i o n of birth weights of i n f a n t s (a c o n t i n u o u s variable). Λ. Λ s a m p l e of 25.
Β. Λ s a m p l e of KM). C. A s a m p l e of 500. D. Λ s a m p l e of 2(XX).
16 CHAPTER 2 / DATA IN BIOSTATISTICS

200 -

£ 150 -

JJ _ FIGURE 2 . 2
2 '"" B a r d i a g r a m . F r e q u e n c y of t h e sedge Car ex
£ flacca in 500 q u a d r a t s . D a t a f r o m T a b l e 2.2;
I orginally f r o m A r c h i b a l d (1950).

0 1 2 3 4 5 (i 7 S

N u m b e r of p l a n t s q u a d r a t

chosen is such as to bring a b o u t a distortion of the distribution. If, in a s a m p l e


of i m m a t u r e insects, we discover that the m e a s u r e m e n t s are b i m o d a l l y distrib-
uted (with t w o peaks), this would indicate that the p o p u l a t i o n is d i m o r p h i c .
This m e a n s that different species or races m a y have become intermingled in
o u r sample. O r the d i m o r p h i s m could have arisen f r o m the presence of b o t h
sexes or of different instars.
T h e r e are several characteristic shapes of frequency distributions. T h e most
c o m m o n is the symmetrical bell shape ( a p p r o x i m a t e d by the b o t t o m g r a p h in
Figure 2.1), which is the s h a p e of the n o r m a l frequency distribution discussed
in C h a p t e r 5. T h e r e a r e also skewed d i s t r i b u t i o n s (drawn out m o r e at o n e tail
than the other), L-shaped d i s t r i b u t i o n s as in Figure 2.2, U-shaped distributions,
a n d others, all of which impart significant i n f o r m a t i o n a b o u t the relationships
they represent. We shall have m o r e to say a b o u t the implications of various
types of distributions in later c h a p t e r s a n d sections.
After researchers have obtained d a t a in a given study, they must a r r a n g e
the d a t a in a form suitable for c o m p u t a t i o n a n d interpretation. We m a y a s s u m e
that variates arc r a n d o m l y ordered initially or are in the o r d e r in which the
m e a s u r e m e n t s have been taken. A simple a r r a n g e m e n t would be an array of
the d a t a by o r d e r of m a g n i t u d e . T h u s , for example, the variates 7, 6, 5, 7, 8, 9,
6, 7, 4, 6, 7 could be arrayed in o r d e r of decreasing m a g n i t u d e as follows: 9. 8,
7, 7, 7, 7, 6, 6, 6, 5, 4. W h e r e there are some variates of the same value, such as
the 6's a n d 7's in this Fictitious example, a time-saving device might immediately
have occurred to you namely, to list a frequency for each of the recurring
variates; thus: 9, 8, 7(4 χ ). 6(3 χ ), 5, 4. Such a s h o r t h a n d n o t a t i o n is o n e way to
represent a frcqucncy distribution, which is simply an a r r a n g e m e n t of the classes
of variates with the frequency of each class indicated. C o n v e n t i o n a l l y , a fre-
quency distribution is stated in t a b u l a r form; for our example, this is d o n e as
follows:
2.5 / FREQUENCY DISTRIBUTIONS 17

Variable Frequency
V /

9 I
8 1
7 4
6 3
5 1
4 1

T h e a b o v e is a n example of a quantitative frequency distribution, since Y is


clearly a m e a s u r e m e n t variable. However, a r r a y s a n d frequency distributions
need not be limited to such variables. W e can m a k e frequency distributions of
attributes, called qualitative frequency distributions. In these, the various classes
are listed in some logical o r a r b i t r a r y order. F o r example, in genetics we might
have a qualitative frequency distribution as follows:

Phenolype J

A- 86
an 32

This tells us that there are two classes of individuals, those identifed by the A -
phenotype, of which 86 were f o u n d , a n d those comprising the h o n i o z y g o t e re-
cessive aa, of which 32 were seen in the sample.
An example of a m o r e extensive qualitative frequency distribution is given
in Table 2.1, which s h o w s the distribution of m e l a n o m a (a type of skin cancer)
over b o d y regions in men a n d w o m e n . This table tells us t h a t the t r u n k a n d
limbs are the most frequent sites for m e l a n o m a s and that the buccal cavity, the
rest of the gastrointestinal tract, and the genital tract are rarely afflicted by this

ΤΛΒΙ Κ 2.1

Two qualitative frequency distributions. N u m b e r of cases of


skin c a n c e r ( m e l a n o m a l d i s t r i b u t e d over b o d y regions of
4599 men a n d 47X6 w o m e n .

OhseiVi ϊ<1 frequency


Men Women
Anatomic site J (

Mead a n d neck 949 645


T r u n k and limbs 3243 3645
Buccal cavity 8 11
Rest of g a s t r o i n t e s t i n a l tract 5 21
Genital tract 12 93
F.ye 382 371
Total cases 4599 4786

Snune. Oiilii from I cc


18 CHAPTER 2 / DATA IN BIOSTATISTICS

TABI.E 2.2
A meristic frequency distribution.
N u m b e r of p l a n t s of the sedge Carex
flacca f o u n d in 500 q u a d r a t s .

No. of plants Observed


per quadrat frequency
Y f

0 181
1 118
2 97
3 54
4 32
5 9
6 5
7 3
8 1
Total 500

Source: Data from Archibald (1950).

type of cancer. We often e n c o u n t e r o t h e r examples of qualitative frequency


d i s t r i b u t i o n s in ecology in the form of tables, o r species lists, of the i n h a b i t a n t s
of a sampled ecological area. Such tables c a t a l o g the i n h a b i t a n t s by species o r
at a higher t a x o n o m i c level a n d record the n u m b e r of specimens observed for
each. T h e a r r a n g e m e n t of such tables is usually alphabetical, o r it m a y follow
a special c o n v e n t i o n , as in some botanical species lists.
A q u a n t i t a t i v e frequency distribution based on meristic variates is s h o w n
in T a b l e 2.2. This is an example f r o m plant ecology: the n u m b e r of p l a n t s per
q u a d r a t sampled is listed at the left in the variable c o l u m n ; the observed fre-
quency is shown at the right.
Q u a n t i t a t i v e frequency d i s t r i b u t i o n s based on a c o n t i n u o u s variable are
the most c o m m o n l y e m p l o y e d frequency distributions; you should b e c o m e
t h o r o u g h l y familiar with them. An e x a m p l e is s h o w n in Box 2.1. It is based on
25 femur lengths m e a s u r e d in an aphid p o p u l a t i o n . T h e 25 readings a r e s h o w n
at the t o p of Box 2.1 in the o r d e r in which they were o b t a i n e d as m e a s u r e m e n t s .
(They could have been arrayed a c c o r d i n g to their magnitude.) T h e d a t a arc
next set up in a frequency distribution. T h e variates increase in m a g n i t u d e by
unit steps of 0.1. T h e frequency distribution is prepared by entering each variate
in turn on the scale a n d indicating a c o u n t by a conventional tally m a r k . W h e n
all of (lie items have been tallied in the c o r r e s p o n d i n g class, the tallies are c o n -
verted into n u m e r a l s indicating frequencies in the next c o l u m n . Their sum is
indicated by Σ / .
W h a t have we achieved in s u m m a r i z i n g o u r d a t a ? T h e original 25 variates
are now represented by only 15 classes. We find that variates 3.6, 3.8, and 4.3
have the highest frequencies. However, wc also n o t e that there arc several classes,
such as 3.4 or 3.7, that are not represented by a single aphid. This gives the
2 . 5 / FREQUENCY DISTRIBUTIONS 19

entire frequency distribution a d r a w n - o u t a n d scattered a p p e a r a n c e . T h e reason


for this is that we have only 25 aphids, t o o few to put into a frequency distribu-
tion with 15 classes. T o o b t a i n a m o r e cohesive and s m o o t h - l o o k i n g distribu-
tion, we have to c o n d e n s e our d a t a into fewer classes. This process is k n o w n
as grouping of classes of frequency distributions; it is illustrated in Box 2.1 a n d
described in the following p a r a g r a p h s .
We should realize t h a t g r o u p i n g individual variates into classes of wider
range is only an extension of the same process t h a t t o o k place w h e n we o b t a i n e d
the initial m e a s u r e m e n t . T h u s , as we have seen in Section 2.3, w h e n we m e a s u r e
an aphid and record its femur length as 3.3 units, we imply thereby that the
true m e a s u r e m e n t lies between 3.25 a n d 3.35 units, but that we were u n a b l e t o
measure to the second decimal place. In recording the m e a s u r e m e n t initially as
3.3 units, we estimated t h a t it fell within this range. H a d we estimated that it
exceeded the value of 3.35, for example, we would have given it the next higher
score, 3.4. Therefore, all the m e a s u r e m e n t s between 3.25 a n d 3.35 were in fact
g r o u p e d into the class identified by the class mark 3.3. O u r class interval was
0.1 units. If we now wish to m a k e wider class intervals, we are d o i n g n o t h i n g
but extending the r a n g e within which m e a s u r e m e n t s are placed into one class.
Reference to Box 2.1 will m a k e this process clear. We g r o u p the d a t a twice
in order to impress u p o n the reader the flexibility of the process. In the first
example of grouping, the class interval has been doubled in width; that is, it
has been m a d e to equal 0.2 units. If we start at the lower end, the implied class
limits will now be f r o m 3.25 to 3.45, the limits for the next class from 3.45 to
3.65, a n d so forth.
O u r next task is to find the class marks. This was quite simple in the fre-
quency distribution s h o w n at the left side of Box 2.1, in which the original mea-
surements were used as class marks. However, now we are using a class interval
twice as wide as before, and the class m a r k s arc calculated by t a k i n g the mid-
point of the new class intervals. T h u s , to find the class mark of the first class,
we lake the midpoint between 3.25 and 3.45. which turns out to be 3.35. We
note that the class m a r k has one m o r e decimal place than the original measure-
ments. We should not now be led to believe that we have suddenly achieved
greater precision. Whenever we designate a class interval whose last significant
digit is even (0.2 in this case), the class mark will carry one m o r e decimal place
than the original m e a s u r e m e n t s . O n the right side of the table in Box 2.1 the
d a t a are grouped once again, using a class interval of 0.3. Because of the o d d
last significant digit, the class mark now shows as m a n y decimal places as the
original variates, the m i d p o i n t between 3.25 and 3.55 being 3.4.
O n c e the implied class limits and the class mark for the first class have
been correctly found, the others can be written d o w n bv inspection without
any special c o m p u t a t i o n . Simply a d d the class interval repeatedly to each of
the values. Thus, starting with the lower limit 3.25, by a d d i n g 0.2 wc obtain
3.45. 3.65, 3.X5, a n d so forth; similarly, for the class marks, we o b t a i n 3.35, 3.55,
3.75, and so forth. It should be o b v i o u s that the wider the class intervals, the
m o r e c o m p a c t the d a t a become but also the less precise. However, looking at
BOX 2.1
Preparation of frequency distribution and grouping into fewer classes with wider class intervals.
Twenty-five femur lengths of the aphid Pemphigus. Measurements are in m m χ 10~ \

Original measurements

3.8 3.6 4.3 3.5 4.3


3.3 4.3 3.9 4.3 3.8
3.9 4.4 3.8 4.7 3.6
4.1 4.4 4.5 3.6 3.8
4.4 4.1 3.6 4.2 3.9

Grouping into 8 classes Grouping into S classes


Original frequency distribution of interval 0.2 of intend 0.3

Implied Tally Implied Class Tally Implied Class Tally


limits marks limits mark marks limits mark marks f

3.25-3.35 3.3 1 3.25-3.45 3.35 | 1 3.25-3.55 3.4


3.35-3.45 3.4 0
3.45-3.55 3.5 1 3.45-3.65 3.55 M 5
3.55-3.65 3.6 4 3.55-3.85 3.7 Jff||| 8
3.65-3.75 3.7 0 3.65-3.85 3.75 mi 4
3.75-3.85 3.8 4
3.85-3.95 3.9 3 3.85-4.05 3.95 (Κ 3 3.85-4.15 4.0 j^ff 5
3.95-4.05 4.0 0
4.05-4.15 4.1 2 4.05-4.25 4.15 III 3
4.15-4.25 4.2 1 4.15-4.45 4.3 J|ff||| 8

7
425-4.35 4.3 4 4.25-4.45 4.35
4.35-4.45 4.4 3
4.45-4.55 4.5 1 4.45-4.65 4.55 | 1 4.45-4.75 4.6
4.55-4.65 4.6 0
4.65-4.75 4.7 1 4.65-4.85 4.75 | _1

If 25 25 25

Source: Data from R, R. Sokal.

Histogram of the original frequency distribution shown above and of the grouped distribution with 5 classes. Line below
abscissa shows class marks for the grouped frequency distribution. Shaded bars represent original frequency distribution;
hollow bars represent grouped distribution.

10 r

_] I 1 1 i—
3.4 3.7 4.0 4.3 4.6
Y (femur length, in units of 0.1 mm)

For a detailed account of the process of grouping, see Section 2.5.


Ji
22 CHAPTER 2 / DATA IN BIOSTATISTICS

the frequency d i s t r i b u t i o n of a p h i d f e m u r lengths in Box 2.1, we notice that the


initial r a t h e r chaotic s t r u c t u r e is being simplified by grouping. W h e n we g r o u p
the frequency distribution into five classes with a class interval of 0.3 units, it
b e c o m e s n o t a b l y b i m o d a l (that is, it possesses t w o peaks of frequencies).
In setting up frequency distributions, f r o m 12 to 20 classes should be estab-
lished. This rule need not be slavishly a d h e r e d to, but it should be e m p l o y e d
with some of the c o m m o n sense that comes f r o m experience in h a n d l i n g statis-
tical d a t a . T h e n u m b e r of classes d e p e n d s largely on the size of the s a m p l e
studied. Samples of less t h a n 40 o r 50 should rarely be given as m a n y as 12
classes, since that w o u l d provide t o o few frequencies per class. O n the o t h e r
h a n d , samples of several t h o u s a n d m a y profitably be g r o u p e d into m o r e t h a n
20 classes. If the a p h i d d a t a of Box 2.1 need to be g r o u p e d , they should p r o b a b l y
not be g r o u p e d into m o r e t h a n 6 classes.
If the original d a t a p r o v i d e us with fewer classes than we think we should
have, then n o t h i n g can be d o n e if the variable is meristic, since this is the n a t u r e
of the d a t a in question. H o w e v e r , with a c o n t i n u o u s variable a scarcity of classes
w o u l d indicate that we p r o b a b l y had not m a d e our m e a s u r e m e n t s with sufficient
precision. If we h a d followed the rules on n u m b e r of significant digits for m e a -
s u r e m e n t s stated in Section 2.3, this could not have h a p p e n e d .
W h e n e v e r we c o m e u p with m o r e t h a n the desired n u m b e r of classes, g r o u p -
ing should be u n d e r t a k e n . W h e n the d a t a are meristic, the implied limits of
c o n t i n u o u s variables are meaningless. Yet with m a n y meristic variables, such
as a bristle n u m b e r varying f r o m a low of 13 to a high of 81, it would p r o b a b l y
be wise to g r o u p the variates into classes, each c o n t a i n i n g several counts. This
can best be d o n e by using an o d d n u m b e r as a class interval so that the class
m a r k representing the d a t a will be a whole rather than a fractional n u m b e r .
T h u s , if we were to g r o u p the bristle n u m b e r s 13, 14, 15, a n d 16 into o n e class,
the class m a r k would have to be 14.5, a meaningless value in terms of bristle
n u m b e r . It would therefore be better to use a class ranging over 3 bristles or
5 bristles, giving the integral value 14 or 15 as a class m a r k .
G r o u p i n g data into frequency d i s t r i b u t i o n s was necessary when c o m p u -
tations were d o n e by pencil a n d paper. N o w a d a y s even t h o u s a n d s of variatcs
can be processed efficiently by c o m p u t e r without prior grouping. However, fre-
quency d i s t r i b u t i o n s are still extremely useful as a tool for d a t a analysis. This
is especially true in an age in which it is all l o o easy for a researcher to o b t a i n
a numerical result f r o m a c o m p u t e r p r o g r a m without ever really e x a m i n i n g the
d a t a for outliers or for o t h e r ways in which the sample m a y not c o n f o r m to
the a s s u m p t i o n s of the statistical m e t h o d s .
Rather t h a n using tally m a r k s to set u p a frequency distribution, as was
d o n e in Box 2.1, we can e m p l o y T u k e y ' s stem-and-leaf display. This t e c h n i q u e
is an i m p r o v e m e n t , since it not only results in a frequency distribution of the
variates of a sample but also permits easy checking of the variates a n d o r d e r i n g
them into an array (neither of which is possible with tally marks). This technique
will therefore be useful in c o m p u t i n g the m e d i a n of a sample (see Section 3.3)
a n d in c o m p u t i n g various tests that require ordered a r r a y s of the sample variates
c . . , . ι Ι Ο Ί ..ηΛ t Ί
2.5 / FREQUENCY DISTRIBUTIONS 23

T o learn how to construct a stem-and-leaf display, let us look a h e a d to


Table 3.1 in the next c h a p t e r , which lists 15 b l o o d n e u t r o p h i l c o u n t s . T h e un-
ordered m e a s u r e m e n t s are as follows: 4.9, 4.6, 5.5, 9.1, 16.3, 12.7, 6.4, 7.1, 2.3,
3.6, 18.0, 3.7, 7.3, 4.4, a n d 9.8. T o p r e p a r e a stem-and-leaf display, we scan the
variates in the s a m p l e to discover the lowest a n d highest leading digit or digits.
Next, we write d o w n the entire range of leading digits in unit increments to
the left of a vertical line (the "stem"), as s h o w n in the a c c o m p a n y i n g illustration.
We then put the next digit of the first variate (a "leaf") at that level of the stem
c o r r e s p o n d i n g to its leading digit(s). T h e first o b s e r v a t i o n in o u r s a m p l e is 4.9.
W e therefore place a 9 next to the 4. T h e next variate is 4.6. It is entered by
finding the stem level for the leading digit 4 a n d recording a 6 next to the 9
that is already there. Similarly, for the third variate, 5.5, we record a 5 next to
the leading digit 5. W e c o n t i n u e in this way until all 15 variates have been
entered (as "leaves") in sequence a l o n g the a p p r o p r i a t e leading digits of the stem.
The completed a r r a y is the equivalent of a frequency distribution a n d has the
a p p e a r a n c e of a histogram or bar d i a g r a m (see the illustration). M o r e o v e r , it
permits the efficient o r d e r i n g of the variates. T h u s , f r o m the c o m p l e t e d array
it becomes o b v i o u s that the a p p r o p r i a t e o r d e r i n g of the 15 variates is 2.3, 3.6,
3.7, 4.4, 4.6, 4.9, 5.5, 6.4, 7.1, 7.3, 9.1, 9.8, 12.7, 16.3, 18.0. T h e m e d i a n can easily
be read off the stem-and-leaf display. It is clearly 6.4. F o r very large samples,
stem-and-leaf displays m a y b e c o m e a w k w a r d . In such cases a conventional
frequency distribution as in Box 2.1 w o u l d be preferable.

Completed array
Step I Step 2 ... Step 7 .. . (Step )

Ί Ί 7 3
3 3 3 3 67
4 9 4 96 4 96 4 964
5 5 5 5 5 5
6 6 6 4 6 4
7 7 7 7 13
X X X X
9 9 9 1 9 IX
10 10 10 10
11 11 11 11
12 12 12 7 12 7
13 13 13 13
14 14 14 14
15 15 15 15
16 16 16 3 16 3
17 17 17 17
IX IX 18 IX 0

W h e n the shape of a frequency distribution is of particular interest, wc may


wish to present the distribution in graphic form when discussing the results.
This is generally d o n e by m e a n s of frequency d i a g r a m s , of which there are two
c o m m o n types. F o r a distribution of meristic d a t a we e m p l o y a bar diagram.
24 CHAPTER 2 / DATA IN BIOSTATISTICS

FIGURE 2 . 3
F r e q u e n c y polygon. Birth weights of 9465
males infants. C h i n e s e third-class p a t i e n t s in
S i n g a p o r e , 1950 a n d 1951. D a t a f r o m Millis
a n d Seng (1954).

the variable (in o u r case, the n u m b e r of p l a n t s per q u a d r a t ) , a n d the o r d i n a t e


represents the frequencies. T h e i m p o r t a n t point a b o u t such a d i a g r a m is t h a t
the bars d o not t o u c h each other, which indicates that the variable is not c o n -
tinuous. By contrast, c o n t i n u o u s variables, such as the frequency distribution
of the femur lengths of a p h i d stem m o t h e r s , are g r a p h e d as a histogram. In a
h i s t o g r a m the width of each bar a l o n g the abscissa represents a class interval
of the frequency distribution a n d the bars t o u c h each other to s h o w that the
actual limits of the classes a r e contiguous. T h e m i d p o i n t of the bar c o r r e s p o n d s
to the class mark. At the b o t t o m of Box 2.! are shown h i s t o g r a m s of the fre-
quency distribution of the aphid data, u n g r o u p e d a n d g r o u p e d . T h e height of
each bar represents the frequency of the c o r r e s p o n d i n g class.
T o illustrate that h i s t o g r a m s are a p p r o p r i a t e a p p r o x i m a t i o n s to the c o n -
t i n u o u s distributions f o u n d in nature, we may take a histogram a n d m a k e the
class intervals m o r e n a r r o w , p r o d u c i n g m o r e classes. T h e h i s t o g r a m would then
clearly have a closer fit to a c o n t i n u o u s distribution. We can c o n t i n u e this p r o -
cess until the class intervals b e c o m e infinitesimal in width. At this point the
h i s t o g r a m becomes the c o n t i n u o u s distribution of the variable.
Occasionally the class intervals of a g r o u p e d c o n t i n u o u s frequency distri-
bution are unequal. For instance, in a frequency distribution of ages we might
have m o r e detail on the different ages of y o u n g individuals and less a c c u r a t e
identification of the ages of old individuals. In such cases, the class intervals
lor the older age g r o u p s would be wider, those lor the y o u n g e r age groups, nar-
rower. In representing such d a t a , the bars of the histogram are d r a w n with
different widths.
f igure 2.3 shows a n o t h e r graphical m o d e of representation of a frequency
distribution of a c o n t i n u o u s variable (in this case, birth weight in infants). As
we shall see later the shapes of distributions seen in such frequency polygons
can reveal much a b o u t the biological situations affecting the given variable.

2.6 The handling of data

D a t a must be handled skillfully a n d expeditiously so that statistics can be prac-


ticed successfully. Readers should therefore a c q u a i n t themselves with (he var-
2.6 / THE HANDLING OF DATA 25

In this b o o k we ignore " p e n c i l - a n d - p a p e r " short-cut m e t h o d s for c o m p u t a -


tions, f o u n d in earlier t e x t b o o k s of statistics, since we a s s u m e that t h e s t u d e n t
has access to a calculator or a c o m p u t e r . S o m e statistical m e t h o d s are very
easy to use because special tables exist that provide answers for s t a n d a r d sta-
tistical problems; thus, almost no c o m p u t a t i o n is involved. An example is
Finney's table, a 2-by-2 contingency table c o n t a i n i n g small frequencies that is
used for the test of independence ( P e a r s o n a n d Hartley, 1958, T a b l e 38). F o r
small problems, Finney's table can be used in place of Fisher's m e t h o d of finding
exact probabilities, which is very tedious. O t h e r statistical techniques are so
easy to carry out that no mechanical aids are needed. S o m e are inherently
simple, such as the sign test (Section 10.3). O t h e r m e t h o d s are only a p p r o x i m a t e
but can often serve the p u r p o s e adequately; for example, we m a y sometimes
substitute an casy-to-evaluate m e d i a n (defined in Section 3.3) for the m e a n
(described in Sections 3.1 a n d 3.2) which requires c o m p u t a t i o n .
We can use m a n y new types of e q u i p m e n t to p e r f o r m statistical c o m p u t a -
t i o n s — m a n y m o r e than we could have when Introduction to Biostalistics was
first published. T h e o n c e - s t a n d a r d electrically driven mechanical desk calculator
has completely d i s a p p e a r e d . M a n y new electronic devices, f r o m small pocket
calculators to larger d e s k - t o p c o m p u t e r s , have replaced it. Such dcvices are so
diverse that we will not try to survey the field here. Even if we did, the rate of
a d v a n c e in (his area would be so rapid that w h a t e v e r we might say would soon
become obsolete.
We c a n n o t really d r a w the line between the m o r e sophisticated electronic
calculators, on the o n e h a n d , a n d digital c o m p u t e r s . T h e r e is no a b r u p t increase
in capabilities between the more versatile p r o g r a m m a b l e calculators a n d the
simpler m i c r o c o m p u t e r s , just as there is n o n e as we progress f r o m m i c r o c o m -
puters to m i n i c o m p u t e r s a n d so on up to the large c o m p u t e r s that o n e associates
with the central c o m p u t a t i o n center of a large university or research l a b o r a t o r y .
All can perform c o m p u t a t i o n s automatically a n d be controlled by a set of
detailed instructions p r e p a r e d by the user. Most of these devices, including pro-
g r a m m a b l e small calculators, are a d e q u a t e for all of the c o m p u t a t i o n s described
in this book, even for large sets of d a t a .
T h e m a t e r i a l in this b o o k c o n s i s t s of r e l a t i v e l y s t a n d a r d statistical
c o m p u t a t i o n s that arc a v a i l a b l e in m a n y statistical p r o g r a m s . BI()Mstat : , : is
a statistical s o f t w a r e p a c k a g e that i n c l u d e s most of the statistical m e t h o d s
c o v e r e d in this b o o k .
T h e ti.se of m o d e r n d a t a processing procedures has o n e inherent danger.
O n e can all too easily cither feed in e r r o n e o u s d a t a or choose an i n a p p r o p r i a t e
p r o g r a m . Users must select p r o g r a m s carefully to ensure that those p r o g r a m s
perform the desired c o m p u t a t i o n s , give numerically reliable results, and are as
free from e r r o r as possible. When using a p r o g r a m for the first time, one should
test it using d a t a f r o m t e x t b o o k s with which o n e is familiar. S o m e p r o g r a m s

* Ι·'οι· information or l<> order, contact Hxcter S o f t w a r e . Websile:hUp://www.cxelcrM>ilwaa\com. li-mail:


siilcs(«'cxctcrsoflwai'e.com. ' t h e s e programs are compatible with Windows XI' and Vista.
26 CHAPTER 2 / DATA IN BIOSTATISTICS

a r e n o t o r i o u s because the p r o g r a m m e r has failed to g u a r d against excessive


r o u n d i n g e r r o r s or o t h e r p r o b l e m s . Users of a p r o g r a m should carefully check
the d a t a being analyzed so t h a t typing e r r o r s are not present. In a d d i t i o n , p r o -
g r a m s should help users identify a n d remove b a d d a t a values a n d should p r o v i d e
them with t r a n s f o r m a t i o n s so that they can m a k e sure that their d a t a satisfy
the a s s u m p t i o n s of various analyses.

Exercises

2.1 R o u n d t h e f o l l o w i n g n u m b e r s t o t h r e e s i g n i f i c a n t figures: 1 0 6 . 5 5 , 0 . 0 6 8 1 9 , 3 . 0 4 9 5 ,
7815.01, 2.9149, a n d 20.1500. W h a t a r e t h e implied limits b e f o r e a n d after r o u n d -
ing? R o u n d these s a m e n u m b e r s t o o n e decimal place.
A N S . F o r t h e first v a l u e : 107; 1 0 6 . 5 4 5 106.555; 1 0 6 . 5 - 1 0 7 . 5 ; 106.6
2.2 D i f f e r e n t i a t e b e t w e e n t h e f o l l o w i n g p a i r s of t e r m s a n d g i v e a n e x a m p l e o f e a c h ,
(a) S t a t i s t i c a l a n d b i o l o g i c a l p o p u l a t i o n s , ( b ) V a n a l e a n d i n d i v i d u a l , (c) A c c u r a c y
a n d p r e c i s i o n ( r e p e a t a b i l i t y ) , ( d ) C l a s s i n t e r v a l a n d c l a s s m a r k , (e) B a r d i a g r a m
a n d h i s t o g r a m , (f) A b s c i s s a a n d o r d i n a t e .
2.3 G i v e n 2 0 0 m e a s u r e m e n t s r a n g i n g f r o m 1.32 t o 2 . 9 5 m m , h o w w o u l d y o u g r o u p
t h e m i n t o a f r e q u e n c y d i s t r i b u t i o n ? G i v e class limits a s well a s c l a s s m a r k s .
2.4 G r o u p t h e f o l l o w i n g 4 0 m e a s u r e m e n t s of i n t e r o r b i t a l w i d t h of a s a m p l e o f d o -
m e s t i c p i g e o n s i n t o a f r e q u e n c y d i s t r i b u t i o n a n d d r a w its h i s t o g r a m ( d a t a f r o m
O l s o n a n d M i l l e r , 1958). M e a s u r e m e n t s a r e in m i l l i m e t e r s .

12.2 12.9 11.8 11.9 11.6 11.1 12.3 12.2 11.8 11.8
10.7 1 1.5 1 1.3 11.2 1 1.6 11.9 13.3 11.2 10.5 11.1
12.1 11.9 10.4 10.7 10.8 11.0 11.9 10.2 10.9 11.6
10.8 11.6 10.4 10.7 12.0 12.4 11.7 11.8 1 1.3 11.1

2.5 H o w p r e c i s e l y s h o u l d y o u m e a s u r e t h e w i n g l e n g t h of a s p e c i e s of m o s q u i t o e s
in a s t u d y of g e o g r a p h i c v a r i a t i o n if t h e s m a l l e s t s p c c i m c n h a s a l e n g t h of a b o u t
2.8 m m a n d t h e l a r g e s t a l e n g t h of a b o u t 3.5 mm'. 1
2.6 T r a n s f o r m t h e 4 0 m e a s u r e m e n t s in E x e r c i s e 2.4 i n l o c o m m o n l o g a r i t h m s ( u s e a
t a b i c o r c a l c u l a t o r ) a n d m a k e a f r e q u e n c y d i s t r i b u t i o n of t h e s e t r a n s f o r m e d
v a r i a t e s . C o m m e n t o n t h e r e s u l t i n g c h a n g e in t h e p a t t e r n of t h e f r e q u e n c y d i s -
tribution from that found before
2.7 f o r t h e d a t a of T a h l e s 2.1 a n d 2.2 i d e n t i f y t h e i n d i v i d u a l o b s e r v a t i o n s , s a m p l e s ,
populations, and variables.
2.8 M a k e a s t e m - a n d - l c a f d i s p l a y of t h e d a t a g i v e n in E x c r c i s c 2.4.
2.9 T h e d i s t r i b u t i o n o f a g e s of s t r i p e d b a s s c a p t u r e d by h o o k a n d l i n e f r o m t h e E a s t
R i v e r a n d t h e H u d s o n R i v e r d u r i n g 1 9 8 0 w e r e r e p o r t e d a s f o l l o w s ( Y o u n g , 1981):

A<tc I

1 13
2 49
3 96
4 28
5 16
6 X

S h o w t h i s d i s t r i b u t i o n in t h e f o r m of a b a r d i a g r a m .
CHAPTER

Descriptive Statistics

An early a n d f u n d a m e n t a l stage in any seienec is the descriptive stage. Until


p h e n o m e n a c a n be accurately described, a n analysis of their causes is p r e m a t u r e .
T h e question " W h a t ? " comes before " H o w ? " Unless we k n o w s o m e t h i n g a b o u t
the usual distribution of the sugar c o n t e n t of blood in a p o p u l a t i o n of guinea
pigs, as well as its fluctuations f r o m day to d a y a n d within days, we shall be
unable to ascertain the effect of a given dose of a d r u g u p o n this variable. In
a sizable s a m p l e it w o u l d be tedious to o b t a i n o u r knowledge of the material
by c o n t e m p l a t i n g each individual o b s e r v a t i o n . W e need s o m e f o r m of s u m m a r y
to permit us to deal with the d a t a in m a n a g e a b l e form, as well as to be able
to share o u r findings with o t h e r s in scientific talks a n d publications. A his-
t o g r a m or bar d i a g r a m of the frequency distribution would be o n e type of
s u m m a r y . However, for most purposes, a numerical s u m m a r y is needed to
describe concisely, yet accurately, t h e properties of the o b s e r v e d frequency
distribution. Q u a n t i t i e s p r o v i d i n g such a s u m m a r y are called descriptive sta-
tistics. This c h a p t e r will i n t r o d u c e you to some of them a n d s h o w how they
arc c o m p u t e d .
T w o kinds of descriptive statistics will be discussed in this c h a p t e r : statistics
of location and statistics of dispersion. T h e statistics of location (also k n o w n as
28 CHAPTER 3 /' DESCRIPTIVE STATISTICS

measures of central tendency) describe the position of a sample along a given


dimension representing a variable. F o r example, after we m e a s u r e t h e length of
the a n i m a l s within a sample, we will then w a n t to k n o w w h e t h e r the a n i m a l s
a r e closer, say, to 2 cm o r to 20 cm. T o express a representative value for t h e
s a m p l e of o b s e r v a t i o n s — f o r the length of the a n i m a l s — w e use a statistic of
location. But statistics of location will n o t describe the s h a p e of a frequency
distribution. T h e s h a p e m a y be long or very n a r r o w , m a y be h u m p e d or U-
s h a p e d , m a y c o n t a i n t w o h u m p s , or m a y be m a r k e d l y asymmetrical. Q u a n t i -
tative m e a s u r e s of such aspects of frequency distributions a r e required. T o this
e n d we need to define a n d study t h e statistics of dispersion.
T h e a r i t h m e t i c m e a n , described in Section 3.1, is u n d o u b t e d l y the most
i m p o r t a n t single statistic of location, but o t h e r s (the geometric m e a n , the
h a r m o n i c mean, the m e d i a n , a n d the m o d e ) are briefly m e n t i o n e d in Sections
3.2, 3.3, a n d 3.4. A simple statistic of dispersion (the range) is briefly n o t e d in
Section 3.5, a n d the s t a n d a r d deviation, the most c o m m o n statistic for describing
dispersion, is explained in Section 3.6. O u r first e n c o u n t e r with c o n t r a s t s be-
tween s a m p l e statistics a n d p o p u l a t i o n p a r a m e t e r s occurs in Section 3.7, in
c o n n e c t i o n with statistics of location a n d dispersion. In Section 3.8 there is a
description of practical m e t h o d s for c o m p u t i n g the m e a n a n d s t a n d a r d devia-
tion. T h e coefficient of variation (a statistic that permits us to c o m p a r e the
relative a m o u n t of dispersion in different samples) is explained in the last section
(Section 3.9).
T h e techniques that will be at y o u r disposal after you have mastered this
c h a p t e r will not be very powerful in solving biological problems, but they will
be indispensable tools for any further w o r k in biostatistics. O t h e r descriptive
statistics, of b o t h location and dispersion, will be taken up in later chapters.
An important note: We shall first e n c o u n t e r the use of l o g a r i t h m s in this
c h a p t e r . T o avoid c o n f u s i o n , c o m m o n logarithms have been consistently ab-
breviated as log, a n d n a t u r a l l o g a r i t h m s as In. T h u s , log \ m e a n s l o g , 0 χ a n d
In v m e a n s log,, x.

3.1 The arithmetic mean

T h e most c o m m o n statistic of location is familiar to everyone. It is the arithmetic


mean, c o m m o n l y called the mean or average. T h e m e a n is calculated by s u m m i n g
all the individual o b s e r v a t i o n s or items of a s a m p l e and dividing this s u m by
the n u m b e r of items in the sample. F o r instance, as the result of a gas analysis
in a respirometer an investigator o b t a i n s the following four readings of oxygen
percentages a n d s u m s them:

14.9
10.8
12.3
23.3
Sum =- 6 1 7 3
3.1 / THE ARITHMETIC MEAN 29

T h e investigator calculates the m e a n oxygen percentage as the s u m of the four


items divided by the n u m b e r of items. T h u s the average oxygen p e r c e n t a g e is

Mean = 15.325%

Calculating a m e a n presents us with the o p p o r t u n i t y for learning statistical


symbolism. W e have already seen (Section 2.2) t h a t a n individual o b s e r v a t i o n
is symbolized by Y|·, which s t a n d s for t h e ith o b s e r v a t i o n in t h e sample. F o u r
observations could be written symbolically as follows:

Υ» v2, Y3, ^

W e shall define n, t h e sample size, as the n u m b e r of items in a sample. In this


particular instance, the sample size η is 4. T h u s , in a large sample, we c a n
symbolize the a r r a y f r o m the first to the nth item as follows:

Yl, Υ2,.·.,Υη

W h e n we wish to s u m items, we use the following n o t a t i o n :

Σ" Yi = y> + y
2 + ·· ·+ η
i= 1

T h e capital Greek sigma, Σ, simply m e a n s the sum of the items indicated. T h e


i = 1 m e a n s that the items should be s u m m e d , starting with the first o n e a n d
ending with the nth one, as indicated by the i = η a b o v e the Σ. T h e subscript
a n d superscript are necessary to indicate how m a n y items s h o u l d be s u m m e d .
T h e "/ = " in the superscript is usually o m i t t e d as superfluous. F o r instance, if
we h a d wished t o s u m only the first three items, we would have written Σ?=, Y{.
O n the o t h e r h a n d , h a d we wished to sum all of them except the first one, we
would have written Σ " = 2 ν ; . W i t h some exceptions (which will a p p e a r in later
chapters), it is desirable to omit subscripts a n d superscripts, which generally
add to the a p p a r e n t complexity of the f o r m u l a and, when they are unnecessary,
distract the s t u d e n t ' s a t t e n t i o n f r o m the i m p o r t a n t relations expressed by the
formula. Below are seen increasing simplifications of the c o m p l e t e s u m m a t i o n
n o t a t i o n shown at the extreme left:

Σ 1 Yi = ιΣ 1 γί
!
= Σγ<
;
= Σ γ
= Σ γ

T h e third symbol might be interpreted as meaning, " S u m the Y t 's over all
available values of /." This is a frequently used n o t a t i o n , a l t h o u g h we shall
not employ it in this b o o k . T h e next, with η as a superscript, tells us to sum η
items of V; note (hat the i subscript of the Y has been d r o p p e d as unneces-
sary. Finally, the simplest n o t a t i o n is s h o w n at the right. It merely says sum
the Vs. This will be the form we shall use most frequently: if a s u m m a t i o n sign
precedes a variable, the s u m m a t i o n will be u n d e r s t o o d to be over η items (all
the items in the sample) unless subscripts or superscripts specifically tell us
otherwise.
30 CHAPTER 3 /' DESCRIPTIVE STATISTICS

W e shall use the s y m b o l Y for the a r i t h m e t i c m e a n of the variable Y. Its


f o r m u l a is ^written as follows:
- y y L
Y = = ~YY (3.1)
η η""

This f o r m u l a tells us, " S u m all the («) items a n d divide the s u m by n."
T h e mean of a sample is the center of gravity of the obsen'ations in the sample.
If you were to d r a w a h i s t o g r a m of an observed frequency d i s t r i b u t i o n o n a
sheet of c a r d b o a r d a n d then cut out the h i s t o g r a m a n d lay it flat against a
b l a c k b o a r d , s u p p o r t i n g it with a pencil b e n e a t h , chances a r e t h a t it would be
out of balance, t o p p l i n g to either the left o r the right. If you m o v e d the s u p -
p o r t i n g pencil p o i n t to a position a b o u t which the h i s t o g r a m w o u l d exactly
balance, this point of b a l a n c e would c o r r e s p o n d to the a r i t h m e t i c m e a n .
W e often m u s t c o m p u t e averages of m e a n s or of o t h e r statistics that m a y
differ in their reliabilities because they are based on different sample sizes. At
o t h e r times we m a y wish the individual items to be averaged to have different
weights or a m o u n t s of influence. In all such cases we c o m p u t e a weighted
average. A general f o r m u l a for calculating the weighted average of a set of
values Yt is as follows:

t = (3.2)
Σ »·.-
w h e r e η variates, each weighted by a factor w„ are being averaged. T h e values
of Yi in such cases are unlikely to represent variates. They are m o r e likely to
be s a m p l e m e a n s Yt or s o m e o t h e r statistics of different reliabilities.
T h e simplest case in which this arises is when the V, are not individual
variates but are means. T h u s , if the following three m e a n s are based on differing
s a m p l e sizes, as shown,

>; n,

3.85 12
5.21 25
4.70 Η

their weighted average will be

- = (12)(3.85) + (25)(5.2I| + (8X4.70) = 214.05


12T25 1 S 45

N o t e that in this example, c o m p u t a t i o n of Ihc weighted mean is exactly equiv-


alent to a d d i n g up all the original m e a s u r e m e n t s a n d dividing the sum by the
total n u m b e r of the m e a s u r e m e n t s . Thus, the s a m p l e with 25 observations,
having the highest m e a n , will influence the weighted average in p r o p o r t i o n to
ils size.
3.2 / OTHER MEANS 31

3.2 Other means

W e shall see in C h a p t e r s 10 a n d 11 t h a t variables are s o m e t i m e s t r a n s f o r m e d


into their l o g a r i t h m s or reciprocals. If we calculate the m e a n s of such trans-
formed variables a n d then c h a n g e the m e a n s back into the original scale, these
m e a n s will not be the s a m e as if we h a d c o m p u t e d the arithmetic m e a n s of t h e
original variables. T h e resulting m e a n s have received special n a m e s in statistics.
T h e b a c k - t r a n s f o r m e d m e a n of the logarithmically t r a n s f o r m e d variables is
called the geometric mean. It is c o m p u t e d as follows:

GMv = antilog - Υ log Y (3.3)


η

which indicates that the geometric m e a n GMr is the a n t i l o g a r i t h m of the m e a n


of the l o g a r i t h m s of variable Y. Since a d d i t i o n of logarithms is equivalent t o
multiplication of their antilogarithms, there is a n o t h e r way of representing this
quantity; it is

GMY = ^Y^YiT77Yn (3.4)

T h e geometric m e a n p e r m i t s us to b e c o m e familiar with a n o t h e r o p e r a t o r


symbol: capital pi, Π , which m a y be read as " p r o d u c t . " Just as Σ symbolizes
s u m m a t i o n of the items that follow it, so Π symbolizes the multiplication of
the items that follow it. T h e subscripts a n d superscripts have exactly the same
m e a n i n g as in the s u m m a t i o n case. T h u s , Expression (3.4) for the geometric
m e a n can be rewritten m o r e c o m p a c t l y as follows:

GMr=nY\Yi (3.4a)
I

T h e c o m p u t a t i o n of the geometric m e a n by Expression (3.4a) is quite tedious.


In practice, the geometric m e a n has to be c o m p u t e d by t r a n s f o r m i n g the variates
into logarithms.
The reciprocal of the arithmetic m e a n of reciprocals is called the harmonic
mean. If we symbolize it by HY, the f o r m u l a for the h a r m o n i c m e a n can be
written in concise form (without subscripts a n d superscripts) as

1 1 „ 1

You may wish to convince yourself that the geometric mean a n d the h a r m o n i c
m e a n of the four oxygen percentages are 14.65% a n d 14.09%, respectively. U n -
less the individual items d o not vary, the geometric m e a n is always less than
the arithmetic m e a n , and the h a r m o n i c m e a n is always less t h a n the geometric
mean.
S o m e beginners in statistics have difficulty in accepting the fact that mea-
sures of location or central tendency o t h e r t h a n the arithmetic m e a n are per-
missible or even desirable. T h e y feel that the arithmetic m e a n is the "logical"
32 CHAPTER 3 /' DESCRIPTIVE STATISTICS

average, a n d that any o t h e r m e a n would be a distortion. This whole p r o b l e m


relates t o the p r o p e r scale of m e a s u r e m e n t for representing d a t a ; this scale is
not always the linear scale familiar to everyone, but is sometimes by preference
a logarithmic or reciprocal scale. If you have d o u b t s a b o u t this question, we
shall try to allay t h e m in C h a p t e r 10, where we discuss the reasons for t r a n s -
f o r m i n g variables.

3.3 The median

T h e median Μ is a statistic of location occasionally useful in biological research.


It is defined as that value of the variable (in an o r d e r e d array) that has an equal
number of items on either side of it. Thus, the m e d i a n divides a frequency dis-
tribution into two halves. In the following sample of five m e a s u r e m e n t s ,

14, 15, 16, 19, 23

Μ ~ 16, since the third o b s e r v a t i o n has an equal n u m b e r of o b s e r v a t i o n s on


b o t h sides of it. We can visualize the m e d i a n easily if we think of an a r r a y
f r o m largest t o s m a l l e s t — f o r example, a row of m e n lined u p by their heights.
T h e m e d i a n individual will then be that m a n having an equal n u m b e r of m e n
on his right a n d left sides. His height will be the median height of the s a m -
ple considered. This quantity is easily evaluated f r o m a sample a r r a y with
an o d d n u m b e r of individuals. W h e n the n u m b e r in the s a m p l e is even, the
m e d i a n is conventionally calculated as the m i d p o i n t between the (n/2)th a n d
the [(«/2) + 1 j t h variate. T h u s , for the s a m p l e of four m e a s u r e m e n t s

14, 15, 16, 19

the median would be the m i d p o i n t between the second and third items, or 15.5.
Whenever any o n e value of a variatc occurs m o r e than once, p r o b l e m s may
develop in locating the m e d i a n . C o m p u t a t i o n of the median item b e c o m e s m o r e
involved because all the m e m b e r s of a given class in which the m e d i a n item is
located will have the s a m e class m a r k . T h e median then is the {n/2)lh variate
in the frequency distribution. It is usually c o m p u t e d as that point between the
class limits of the m e d i a n class where the median individual would be located
(assuming the individuals in the class were evenly distributed).
T h e median is just o n e of a family of statistics dividing a frequency dis-
tribution into equal areas. It divides the distribution into two halves. T h e three
quartiles cut the d i s t r i b u t i o n at the 25, 50, and 75% p o i n t s — t h a t is, at points
dividing the distribution into first, second, third, and f o u r t h q u a r t e r s by area
(and frequencies). T h e second quarlile is, of course, the median. (There are also
quintiles, deciles, a n d percentiles, dividing the distribution into 5. 10, a n d 100
equal portions, respectively.)
M e d i a n s arc most often used for d i s t r i b u t i o n s that d o not c o n f o r m to the
s t a n d a r d probability models, so that n o n p a r a m e t r i c m e t h o d s (sec C h a p t e r 10)
must be used. Sometimes (he median is a m o r e representative m e a s u r e of loca-
tion than the a r i t h m e t i c m e a n . Such instances almost always involve a s y m m e t r i c
3.4 / THE MODE 33

distributions. An often q u o t e d example f r o m economics w o u l d be a suitable


m e a s u r e of location for the "typical" salary of a n employee of a c o r p o r a t i o n .
T h e very high salaries of the few senior executives would shift the arithmetic
m e a n , the center of gravity, t o w a r d a completely unrepresentative value. T h e
m e d i a n , on the o t h e r h a n d , would be little affected by a few high salaries; it
w o u l d give the p a r t i c u l a r point o n the salary scale a b o v e which lie 50% of the
salaries in the c o r p o r a t i o n , the o t h e r half being lower t h a n this figure.
In biology an example of the preferred application of a m e d i a n over the
arithmetic m e a n m a y be in p o p u l a t i o n s showing skewed distribution, such as
weights. T h u s a m e d i a n weight of American males 50 years old m a y be a more
meaningful statistic than the average weight. T h e m e d i a n is also of i m p o r t a n c e
in cases where it m a y be difficult or impossible to o b t a i n a n d m e a s u r e all the
items of a sample. F o r example, s u p p o s e an animal behaviorist is studying
the time it takes for a s a m p l e of a n i m a l s to perform a certain behavioral step.
T h e variable he is m e a s u r i n g is the time from the beginning of the experiment
until each individual has performed. W h a t he w a n t s to o b t a i n is an average
time of p e r f o r m a n c e . Such an average time, however, can be calculated only
after records have been o b t a i n e d on all the individuals. It m a y t a k e a long lime
for the slowest a n i m a l s to complete their p e r f o r m a n c e , longer t h a n the observer
wishes to spend. (Some of them may never respond a p p r o p r i a t e l y , m a k i n g the
c o m p u t a t i o n of a m e a n impossible.) Therefore, a convenient statistic of location
to describe these a n i m a l s may be the median time of p e r f o r m a n c e . Thus, so
long as the observer k n o w s what the total sample size is, he need not have
m e a s u r e m e n t s for the right-hand tail of his distribution. Similar e x a m p l e s would
be the responses to a d r u g or poison in a g r o u p of individuals (the median
lethal or effective dose. LD 5 ( I or F.D S 0 ) or the median time for a m u t a t i o n to
a p p e a r in a n u m b e r of lines of a species.

3.4 The mode

T h e mode r e f e r s t o the value represented by the greatest number of individuals.


When seen on a frequency distribution, the m o d e is the value of the variable
at which the curve peaks. In grouped frequency distributions the m o d e as a
point has little meaning. It usually sulliccs It) identify the m o d a l class. In biology,
the m o d e does not have m a n y applications.
Distributions having two peaks (equal or unequal in height) are called
bimodal; those with m o r e than two peaks are multimodal. In those rare dis-
tributions that are U-shaped, we refer to the low point at the middle of the
distribution as an antimode.
In evaluating the relative merits of the arithmetic mean, the median, a n d
the mode, a n u m b e r of c o n s i d e r a t i o n s have to be kept in mind. T h e m e a n is
generally preferred in statistics, since it has a smaller s t a n d a r d e r r o r than o t h e r
statistics of location (see Section 6.2), it is easier to work with mathematically,
and it has an a d d i t i o n a l desirablc p r o p e r t y (explained in Section 6.1): it will
tend to be normally distributed even if the original data are not. T h e mean is
34 CHAPTER 3 /' DESCRIPTIVE STATISTICS

20

18 η = 120

Hi uh
14

12

10
c"
ct-
U.
8

:i. ι :i.(i ;is i.o 1.2 1,1 !.(> 4.8 5,0


lVl"!'!'] bul I r r f a t

HGURi·: 3.1
An a s y m m e t r i c a l f r e q u e n c y d i s t r i b u t i o n ( s k e w e d t o the right) s h o w i n g l o c a t i o n of t h e m e a n , m e d i a n ,
a n d m o d e . P e r c e n t b u t t e r f a t in 120 s a m p l e s of milk ( f r o m a C a n a d i a n c a t t l e b r e e d e r s ' r e c o r d b o o k ) .

m a r k e d l y affected by outlying observations; the m e d i a n and m o d e are not. T h e


mean is generally m o r e sensitive to c h a n g e s in the s h a p e of a frequency distri-
bution, a n d if it is desired to have a statistic reflecting such changes, the m e a n
may be preferred.
In symmetrical, u n i m o d a l d i s t r i b u t i o n s the mean, the median, a n d the m o d e
are all identical. A prime example of this is the well-known n o r m a l distribution
of C h a p t e r 5. In a typical asymmetrical d i s t r i b u t i o n , such as the o n e s h o w n in
Figure 3.1, the relative positions of the mode, median, and mean are generally
these: the mean is closest to the d r a w n - o u t tail of the distribution, the m o d e is
farthest, and the m e d i a n is between these. An easy way to r e m e m b e r this se-
q u e n c e is to recall that they occur in alphabetical o r d e r from the longer tail of
t h e distribution.

3.5 The ran}>e

We now turn to measures of dispersion, f igure 3.2 d e m o n s t r a t e s that radically


different-looking distributions may possess the identical arithmetic mean. It is
3.5 / THE RANGE 35

10

10

£α; 6
α-
ϊ 4
Uh

10

(i 1

0 I

FIGURE 3 . 2
T h r e e frequency d i s t r i b u t i o n s h a v i n g identical m e a n s a n d s a m p l e si/.es but differing in dispersion
pattern.

O n e simple m e a s u r e of dispersion is the range, which is defined as the


difference between the largest and the smallest items in a sample. Thus, the range
of the four oxygen percentages listed earlier (Section 3.1) is

R a n g e = 23.3 - 10.8 = 12.5";,

a n d the range of the a p h i d femur lengths (Box 2.1) is

Range = 4.7 - 3.3 = 1.4 units of 0.1 m m

Since the range is a m e a s u r e of the s p a n of the variates a l o n g the scale of the


variable, it is in the same units as the original m e a s u r e m e n t s . T h e range is
clearly affected by even a single outlying value a n d for this reason is only a
rnuoh estimate of the dtsriersion of all the items in the samtnle.
36 CHAPTER 3 /' DESCRIPTIVE STATISTICS

3.6 The standard deviation

W e desire t h a t a m e a s u r e of dispersion t a k e all items of a d i s t r i b u t i o n i n t o


c o n s i d e r a t i o n , weighting e a c h item by its distance f r o m the center of the distri-
b u t i o n . W e shall n o w try t o c o n s t r u c t such a statistic. In T a b l e 3.1 we s h o w a
s a m p l e of 15 b l o o d n e u t r o p h i l c o u n t s f r o m p a t i e n t s with t u m o r s . C o l u m n (1)
s h o w s the variates in t h e o r d e r in which they were reported. T h e c o m p u t a t i o n
of t h e m e a n is s h o w n below the table. T h e m e a n n e u t r o p h i l c o u n t t u r n s o u t to
be 7.713.
T h e distance of e a c h variate f r o m t h e m e a n is c o m p u t e d as t h e following
deviation:
y = Y - Y

E a c h individual deviation, or deviate, is by c o n v e n t i o n c o m p u t e d as the indi-


vidual o b s e r v a t i o n m i n u s t h e m e a n , Υ — Ϋ, r a t h e r t h a n the reverse, Ϋ — Y.
D e v i a t e s are symbolized by lowercase letters c o r r e s p o n d i n g to the capital letters
of t h e variables. C o l u m n (2) in T a b l e 3.1 gives the deviates c o m p u t e d in this
manner.
W e n o w wish to calculate a n average d e v i a t i o n t h a t will s u m all t h e deviates
and divide t h e m by the n u m b e r of deviates in the sample. But n o t e that when

TABLE 3.1

The standard deviation. L o n g m e t h o d , not r e c o m m e n d e d for


h a n d or c a l c u l a t o r c o m p u t a t i o n s but s h o w n here to illus-
t r a t e t h e m e a n i n g of t h e s t a n d a r d deviation. T h e d a t a a r e
b l o o d n e u t r o p h i l c o u n t s (divided by 1000) per microliter, in
15 p a t i e n t s with n o n h e m a t o l o g i c a l t u m o r s .

(/) (2) (i)


Y y = Υ - Y y2

4.9 -2.81 7.9148


4.6 -3.11 9.6928
5.5 -2.21 4.8988
9.1 1.39 1.9228
16.3 8.59 73.7308
12.7 4.99 24.8668
6.4 -1.31 1.7248
7.1 -0.61 0.3762
2.3 -5.41 29.3042
3.6 -4.11 16.9195
18.0 10.29 105.8155
3.7 -4.01 16.1068
7.3 -0.41 0.1708
4.4 -3.31 10.9782
9.8 2.09 4.3542
Total 15.7 0.05 308.7770

ΣΥ I Is.7
Mean Y - 7.713
3.7 / SAMPLE STATISTICS AND PARAMETERS 37

we s u m o u r deviates, negative a n d positive deviates cancel out, as is s h o w n


by the s u m at the b o t t o m of c o l u m n (2); this sum a p p e a r s to be u n e q u a l to
zero only because of a r o u n d i n g error. D e v i a t i o n s f r o m the a r i t h m e t i c m e a n
always s u m to zero because the m e a n is the center of gravity. C o n s e q u e n t l y ,
an average based o n the s u m of deviations w o u l d also always e q u a l zero. Y o u
are urged to study A p p e n d i x A l . l , which d e m o n s t r a t e s that the s u m of deviations
a r o u n d the m e a n of a s a m p l e is equal t o zero.
S q u a r i n g t h e deviates gives us c o l u m n (3) of Table 3.1 a n d e n a b l e s us to
reach a result o t h e r t h a n zero. (Squaring the deviates also h o l d s o t h e r m a t h e -
matical a d v a n t a g e s , which we shall t a k e u p in Sections 7.5 a n d 11.3.) T h e sum
of the s q u a r e d deviates (in this case, 308.7770) is a very i m p o r t a n t q u a n t i t y in
statistics. It is called t h e sum of squares a n d is identified symbolically as Σγ2.
A n o t h e r c o m m o n symbol for the s u m of s q u a r e s is SS.
T h e next step is t o o b t a i n the average of the η s q u a r e d deviations. T h e
resulting q u a n t i t y is k n o w n as the variance, or the mean square'.

X>· 2 __ 308.7770
Variance = = 20.5851
15

T h e variance is a m e a s u r e of f u n d a m e n t a l i m p o r t a n c e in statistics, a n d we
shall employ it t h r o u g h o u t this b o o k . At the m o m e n t , we need only r e m e m b e r
that because of the s q u a r i n g of the deviations, the variance is expressed in
squared units. T o u n d o the effect of the squaring, we now take the positive
s q u a r e r o o t of the variance a n d o b t a i n the standard deviation:

Thus, s t a n d a r d deviation is again expressed in the original units of measure-


ment, since it is a s q u a r e r o o t of the squared units of the variance.
An important note: T h e technique just learned a n d illustrated in T a b l e 3.1
is not the simplest for direct c o m p u t a t i o n of a variance a n d s t a n d a r d deviation.
However, it is often used in c o m p u t e r p r o g r a m s , where accuracy of c o m p u t a -
tions is an i m p o r t a n t consideration. Alternative a n d simpler c o m p u t a t i o n a l
m e t h o d s are given in Section 3.8.
T h e o b s e r v a n t reader m a y have noticed that we have avoided assigning
any symbol to either the variance o r the s t a n d a r d deviation. We shall explain
why in the next section.

3.7 Sample statistics and parameters

U p to now we have calculated statistics f r o m samples without giving t o o m u c h


t h o u g h t to what these statistics represent. W h e n correctly calculated, a m e a n
and s t a n d a r d deviation will always be absolutely true measures of location a n d
dispersion for the samples on which they are based. T h u s , the true m e a n of the
four oxygen percentage readings in Section 3.1 is 15.325".",. T h e s t a n d a r d devia-
tion of the 15 n e u t r o p h i l c o u n t s is 4.537. However, only rarely in biology (or
f ι,Λ,-,ιί^,Λ .,,-,,ι ,ι;..
38 CHAPTER 3 /' DESCRIPTIVE STATISTICS

only as descriptive s u m m a r i e s of the samples we have studied. Almost always we


are interested in the populations f r o m which t h e samples h a v e been t a k e n . W h a t
we w a n t to k n o w is not the m e a n of the particular four oxygen precentages,
but r a t h e r the t r u e oxgyen percentage of the universe of readings f r o m which
the f o u r readings have been sampled. Similarly, we would like t o k n o w the true
m e a n neutrophil c o u n t of the p o p u l a t i o n of patients with n o n h e m a t o l o g i c a l
t u m o r s , n o t merely the m e a n of the 15 individuals m e a s u r e d . W h e n s t u d y i n g
dispersion we generally wish to learn the true s t a n d a r d deviations of t h e p o p u -
lations a n d not those of t h e samples. These p o p u l a t i o n statistics, however, are
u n k n o w n a n d (generally speaking) are u n k n o w a b l e . W h o would be able t o col-
lect all the patients with this p a r t i c u l a r disease a n d m e a s u r e their n e u t r o p h i l
c o u n t s ? T h u s we need to use sample statistics as e s t i m a t o r s of population statis-
tics or parameters.
It is c o n v e n t i o n a l in statistics to use G r e e k letters for p o p u l a t i o n p a r a m e t e r s
a n d R o m a n letters for s a m p l e statistics. T h u s , the sample m e a n Ϋ estimates μ,
the p a r a m e t r i c m e a n of the p o p u l a t i o n . Similarly, a sample variance, symbolized
by s 2 , estimates a p a r a m e t r i c variance, symbolized by a 2 . Such e s t i m a t o r s should
be unbiased. By this we m e a n that samples (regardless of the sample size) t a k e n
f r o m a p o p u l a t i o n with a k n o w n p a r a m e t e r should give sample statistics that,
when averaged, will give the p a r a m e t r i c value. An estimator that d o e s not d o
so is called biased.
T h e s a m p l e m e a n Ϋ is an unbiased e s t i m a t o r of the p a r a m e t r i c m e a n μ.
H o w e v e r , the s a m p l e variance as c o m p u t e d in Section 3.6 is not unbiased. O n
the average, it will u n d e r e s t i m a t e the m a g n i t u d e of the p o p u l a t i o n variance a 1 .
T o o v e r c o m e this bias, m a t h e m a t i c a l statisticians have shoWn t h a t w h e n s u m s
of squares are divided by π — 1 rather than by η the resulting s a m p l e variances
will be unbiased estimators of the p o p u l a t i o n variance. F o r this reason, it is
c u s t o m a r y to c o m p u t e variances by dividing the sum of squares by η — 1. T h e
f o r m u l a for the s t a n d a r d deviation is therefore customarily given as follows:

(3.6)

In the n e u t r o p h i l - c o u n t d a t a the s t a n d a r d deviation would thus be c o m p u t e d as

We note that this value is slightly larger than o u r previous estimate of 4.537.
Of course, the greater the s a m p l e size, the less difference there will be between
division by η a n d by n I. However, regardless of sample size, it is good
practice to divide a sum of s q u a r e s by η — 1 when c o m p u t i n g a variance or
s t a n d a r d deviation. It m a y be assumed that when the symbol s2 is e n c o u n t e r e d ,
it refers to a variance o b t a i n e d by division of the sum of squares by the degrees
of freedom, as the q u a n t i t y η — 1 is generally referred to.
Division of the s u m of s q u a r e s by η is a p p r o p r i a t e only when the interest
of the investigator is limited to the s a m p l e at h a n d a n d to its variance a n d
Another random document with
no related content on Scribd:
The Project Gutenberg eBook of Suuria
pyrkimyksiä
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.

Title: Suuria pyrkimyksiä

Author: Juho Hoikkanen

Release date: July 8, 2024 [eBook #73989]

Language: Finnish

Original publication: Hämeenlinna: Arvi A. Karisto Oy, 1915

Credits: Juhani Kärkkäinen and Tapio Riikonen

*** START OF THE PROJECT GUTENBERG EBOOK SUURIA


PYRKIMYKSIÄ ***
SUURIA PYRKIMYKSIÄ

Kirj.

Juho Hoikkanen

Hämeenlinnassa, Arvi A. Karisto Oy 1915.

I
"No niin…" Heikki keskeytti hartaan äänettömyyden, joka vallitsi,
kun hänen isänsä oli huokaissut viimeisen henkäyksensä tälle
maailmalle karsinanurkan sängyssä.

"No niin", hän toisti, rykäisi kuivasti ja meni hakemaan riihen luota
leveätä lautaa, jonka oli muutamia päiviä aikaisemmin pudottanut
riihen kylkiäisen katolta ikäänkuin odottamaan tätä hetkeä.
Palattuaan hän asetti sen pystyyn uunia vasten ja virkkoi äidilleen ja
muille tuvassa olijoille:

"Tekö pesette ruumiin, vai…?"

Vastauksena oli äidin hiljaisia nyyhkytyksiä.

"Vai pitääkö hakea pesijä kylästä?"

Hiljaisia nyyhkytyksiä.

"Vai viedäänkö ukko riiheen niine nuhjuineen?"

Kyyneleet äidin silmissä imeytyivät jälleen lähteisiinsä, ja oikaisten


vartaloaan mummo lausui:

"Heikki, sinun sydämesi kovenee sitä mukaa kuin keuhkosi


pehmenevät. Katso tuonne" — hän viittasi kuolinsänkyyn — "isäsi
silmät jäivät auki: se tietää, että joku meistä seuraa pian samaa
tietä."

Heikki rykäisi kuivasti, ikäänkuin siten huomauttaakseen, että äiti


"jollakulla meistä" taisi tarkoittaa juuri häntä. Ei hän siitä kuitenkaan
mitään maininnut, virkkoihan vain:

"Omat kai minun keuhkoni ovat, niinkuin sydämenikin. Oli ne kovat


tai pehmeät, niin sen asian ei pitäisi äitiä liikuttaa, eikä ketään
muutakaan. Muuten on minun paikkani tästä päivästä alkaen pöydän
päässä, ja sen asian saattaisi äiti pitää visusti mielessään."

"Pöydän päässä?… hyvä. Vaikka sinulla itselläsi on hyvin vähän


ansiota siihen asemaan. Mutta pöydässä on kaksi päätä, ja toisessa
päässä aion istua minä, niinkuin olen istunut siitä asti kun Töyrylän
torpasta on haiku noussut."

Kyyneleet pulpahtivat taaskin esiin.

"Että hänen pitääkin tällaista kuulla…

"Ei kuollut mitään kuule", keskeytti Heikki.

"… hänen, joka tässä on hongikkoon tuvan tehnyt ja pellot


kuokkinut ja kivennyt ja kannot vääntänyt… ja raivannut ja ojittanut
niityt kylmään korpeen, ettei yön unta… noilla käsillään, jotka nyt
voimattomina ja… kylminä…"

"Riittää jo, ei asia porusta parane. Tuossa on lauta, tehkää te mitä


täällä on tehtävää, minä teen mitä minulle kuuluu."

Näin sanottuaan Heikki meni taloon ja kertoi isännälle, että koska


hänen isänsä nyt vihdoinkin on kuollut ja hallitusohjat ovat joutuneet
hänen käsiinsä, niin olisi kai oikein ja kohtuullista, että hän saisi
pistää jalkansa virallisestikin oman pöydän alle. Että jos isäntä olisi
hyvä ja tekisi paperit eli muut kontrahdit, puustavilleen samanlaiset
kuin ne entisetkin paperit, paitsi että näissä uusissa pitäisi seisoa
hänen, Heikki Eerikinpojan nimi ja puumerkki.

Isäntä piti aluksi pienen puheen siitä, kuinka täällä kaikki on


katoavaista; ei täällä ole pysyväisiä asuinsijoja, tulevaisia olemme
vain etsimässä. Sanoi, että ne vanhat tervaskannot kaatuvat toinen
toisensa perään, mutta konnulleen kaatuvat, niinkuin kaatuvat
juurilleen pohjolan hongat. Mutta tämän nykyisen sukupolven juuret
ovat jotenkin löyhässä isänmaan kamarassa; pienikin tuulenpuuska
ne siitä tempaa irti.
"Mitä erittäin tähän Töyrylän torpan kontrahtiasiaan tulee, niin se
on hiukan pulmallinen", arveli isäntä. "Nyt on vainaja päässyt
parempaan elämään, ja hänen kanssaan on kontrahtikin…"

"Päässyt parempaan elämään", keskeytti Heikki.

"Minä toivon, että Heikki ei laskisi leikkiä vakavasta asiasta — eikä


tällä kertaa minustakaan. Tarkoitan: vainajan kanssa on kontrahtikin
kuollut, sillä se oli tehty vain hänen elinajakseen. Ja olen minä vähin
ajatellut ottaa itse torpan haltuuni nyt, kun…"

"Tekö!"

"No, no… tietysti kohtuullista korvausta vastaan."

Isäntä vaipui mietteisiinsä ja kysyi kuin ohi mennen:

"Heikin sanotaan käyneen lääkärillä. Mitä tohtori arveli Heikin


keuhkojen terveydentilasta?"

"Mitä hän arveli! Koputeltuaan ja kuulosteltuaan sekä edestä että


takaa, ravistellen kaljua päätään, hän viimein istui tuoliinsa, iski
kämmenen polviinsa ja katsoa muljotti kotvan torveensa pöydällä —
niinkuin minä nyt tuohon tyhjään pikariin." (Isäntä kiiruhti jälleen
täyttämään pikarin.) "Ähhäh… kiitos… Kysyy vihdoin: Onko teillä
kapitaalia — noin niinkuin runsaanpuoleisesti? Ei ole kerskumista,
herra tohtori, vastasin minä, mutta aina sitä toki sen verran on, ettei
koirat tuppaa kinttuja kastelemaan. Tohtori silloin vähän naurahti ja
sanoi, että koska teillä ei ole kapitaalia, niin hän määrää vain
muutamia konjakkitippoja nautittavaksi aamuisin lämpimän maidon
kanssa. Mutta minä sanoin tohtorille, että ne ovat lasten resehtiä
sellaiset. Kun aikamies avaa pullon, niin se ottaa ryypyn, kaks jopa
kolmekin yksin tein — eikä siinä tarvita maitoa ei siirappia."

"Vai konjakkia… Ja sanoiko Heikki, että tohtori ravisteli päätään?"

"Joo… ja sen pää oli niin puhdas ja sileä kuin listitty nauris… eli
niinkuin tuon pi-pikariri kuve", vastasi Heikki, nyt jo melkolailla
sammaltaen.

Ottaen hieman väkinäisesti huomautuksen varteen isäntä kysyi


taas:

"Kuinka olikaan — sanoiko Heikki äsken, että vainajan silmät jäivät


auki?"

"No, se-sepposen selälleen. Äiti sanoi vielä, että jo-joku meistä


seuraa pian, mutta mi-minä luulen, että se on nyt ä-äidin vuoro."

Isäntä nakutteli sormillaan pöytään nähtävästi tyytyväisenä, ja


asiain näin ollen hän suostui tekemään vuokrasopimuksen Heikki
Eerikinpojan nimelle, kuitenkin sillä nimenomaisella lisäyksellä, että
torppa tämän kuoltua joutuu talon haltuun, jos isäntä niin tahtoo,
perillisten olematta oikeutettuja saamaan korvausta.

Pistettyään välikirjan taskuunsa soperteli Heikki:

"No nyt minä siis saan pistää jalkani oikein virallisestikin oman pö-
pöydän alle."

"Kunhan nyt Heikki muistaisi varoa, ettei koko mies suistuisi


pöydän alle."

"Ky-kyllä."
II

Heikin puuhatessa itselleen torpan sopimusta hänen äitinsä toimitti


kotona raskasta velvollisuuttaan muutamien vainajan vanhojen
ystävien avustamana.

Kun ruumis oli pesty, puettiin se karkeihin rohtimisiin


kuolinvaatteisiin. Jalkoihin vedettiin harmaat villasukat ja käsiin
lapaset, jotka ommeltiin toisiinsa niin, että kädet jäivät ristiin rinnan
yli. Sitten vainaja nostettiin laudalle, ja ylle levitettiin valkea
venykevaate.

Vaikka kuollut oli jo lähtövalmis, viivytettiin kuitenkin kuin säälien


kantamasta häntä lämpimästä tuvasta kylmään riiheen juuri vielä.
Äänettöminä istuvat vieraat tuvan penkeillä. Ei kimaltele kyynel
yhdenkään silmissä, ei näy pienintä tunteen värähdystä ainoankaan
kasvoilla. Syvine ryppyineen ne ovat liikkumattomat kuin aaltoihin
jäätynyt järven pinta, jota eivät mitkään tuulet liikuta. Niihin näyttää
kivettyneen yksi ainoa ankara aatos: Tänään lähti hän. Kenen vuoro
meistä on tämän jälkeen — minun vai sinun — tänään vai
huomenna?
"Siinä on matkamies valmis", kuuluu karsinan puolelta värähtelevä
naisen ääni, ikäänkuin arastaen pyhän hiljaisuuden keskeyttämistä.

"Pitkän matkan mies", lisää hetken kuluttua toinen.

"Eihän siltä matkalta heti huomenna palata."

"Ei palata."

"Ei."

"Minun se on mieheni Amerikassa", jatkaa keskustelun alottaja;


"on ollut jo viisitoista vuotta. Ei sinnekään ääni kuulu, mutta toivon
minä hänet näillä silmillä näkeväni vielä."

"Ainahan ne näkee elävän silmät — jos lie ukkosi elossa vielä… ei


ole tainnut kirjoittaa?"

"Ei ole kirjoittanut, ei pennin pyöreää lähettänyt viiteentoista


vuoteen. Mutta kirjoittavat toiset, että on se siellä Kalihvorniassa
nähtynä."

"Kerrotaan, että on siellä vainiot viljavat ja hengen elanto helppoa


— liekö totta, en tiedä. Täällä on työ raskasta ja leipä monen hien ja
helteen takana. Ei se tämä vainajakaan laiskan leipää syönyt."

"Ei syönyt."

"Ei."

"Kuinka lienee ollut ankara tämä viimeinen kamppailunsa?"

"Eihän se henki helyillä lähde, ei kulje kuolema kulkusten kanssa."


"Se on sokea vieras, kuolema, joka usein kulkee ystäviensä
avattujen veräjien ohi ja saapuu sinne, missä sitä kaikkein vähimmin
odotetaan ja toivotaan."

"Kuitenkin se kerran kolkuttaa meidän jokaisen ovelle — eikä


suinkaan liian aikaisin nyt täällä."

"Ei liian aikaisin."

"Kyllä Eero jo tämän maailman tienhaarat tiesi."

"Hyvinkin tiesi."

"Kuinka lienee tuntenut ne tulevaisen maailman tienhaarat sen


oikean ja vasemman tien."

Viime huomautukseen ei kukaan osannut vastata mitään.


Painostavan hiljaisuuden keskeytti vihdoin leski:

"Ei tiennyt Eero keväällä hamppua kylväessään, että viimeistä


hamppuaan kylvää; ei syksyllä riihessä loukuttaessaan, että itselleen
kuolinliinoja loukuttaa. En tiennyt minä rukkia polkiessani, että
tuonelle lankaa kehrään; en sukkulaa heittäessäni aavistanut, että
manalle kangasta kudon."

"Sen verran ihminen tietää."

"Sen verran."

"Ja sitähän se Eero sairastaessaan penäsi, että älä sinä, Henna,


osta puodista palttinaa kuolinvaatteikseni, ne ovat vieraita ja kylmiä.
Ompele vaatteet oman pellon liinoista, kudo sukat oman lampaan
villoista — niissä on lämmin levätäkseni, nukkuakseni rattoisampi."
*****

Vainaja on saatettu riiheen, vapisevat äänet ovat veisanneet


kuolinvirren ja vieraat poistuneet äänettöminä, allapäin. Yksin istuu
tuvassa leski, katsellen ympärilleen kuin vieraassa paikassa. Siinä
ovat vielä pöydällä siniraitaiset kahvikupit ja puinen sokeriastia —
Eero-vainajan käsialoja; maitoa ei ole näin talvisaikaan ollut edes
kahvin joukkoon tipauttaa. Viluisen näköisenä hautoo kahvipannu
takassa mustia hiiliä, jotka eivät lämmitä sen kuparista kupua.
Nauloissa seinillä riippuu vainajan vaatteita ja nahkavyö puukkoineen
ja tuohituppineen. Tuolilla tyhjän sängyn vieressä on
tupakkakukkaro, tulitikkulaatikko ja visainen piippu. Lattia sängyn
vieressä on vielä märkä äskeisen toimituksen jäljeltä, ja outo
kalmanhaju herättää hiukan kammottavaa tunnetta jonkun
näkymättömän läsnäolosta. Luullen jo yön hiljaisuuden tulleen alkaa
sirkka uunin raossa laulaa. Siihen yhtyy toinen, kolmas, ja pian
kokonainen sirkkojen kuoro laulaa yksitoikkoista säveltään, niinkuin
ei mitään erinomaista olisi tapahtunut. Mutta yksinistujasta laulu
kuulostaa kuolinvirreltä ja lisää vain yksinäisyyden ja orpouden
tunnetta. Väkisinkin kiertyy kyynel silmäkulmaan.

Vähitellen ikäänkuin loittonee ja hiljenee sirkkojen laulu, ikäänkuin


avartuu ahdas tupa ja vaikenee hämärtyvä talvinen iltapäivä. On
kuin istuisi yksin hiljaisessa eläväin kuvain teatterissa ja katseleisi
vaihtuvia kohtauksia oman elämänsä pitkästä näytelmästä.

‒ ‒ kirkas syyskesän aamu. Venhe liukuu rannasta voimakkain


aironvedoin pappilaa kohti. Paitahihasillaan soutaja alatuhdolla on
Eero, sulhasmies. Hän itse, morsian, soutaa ylätuhdolla, ja vanha
isäukko pitää perää. Kokassa lipattavat purjekoivun lehdet, ja sen
oksien alla on tuohinen eväskontti, ja valkeassa nyytissä lainattu
silkkinen pääliina ja mustat vihkivaatteet. Laiturilla äiti toisella
kädellä kyyneltä pyyhkii, toisella hyvästiä huiskuttaa. — Siunatkoon
Jumala! ‒ ‒

‒ ‒ avara savutupa palveluspaikassa. Päreet palavat pihdeissään


uunin kupeessa, ja juhlan kunniaksi himmeä talikynttilä pöydällä.
Pelimanni soittaa viulua pöydän päässä, polkien tahtia virsujalalla,
kastaa kaulaa kotipolttoisella viinalla ja yhä vinhemmin käyttää
käyrää. Valkoiseksi pestyllä permannolla tanssivat vieraat —
kuokkavieraat — polkkaa, purpuria, martinvappua, niin että karsta
katosta karisee. Vasta aamupuolella yötä taukoaa tanssi. Kuin vihan
väellä iskevät miehet käsiksi sulhaseen, tytöt morsiameen, liki lakea
nostavat. — Eläköön! ‒ ‒

‒ ‒ pienen aukeaman laidassa kosken parpaalla vastaveistetty


tupa. Sen lähellä istuvat mies ja vaimo — mies pyyhkien paitansa
hihalla hikeä otsaltaan, nainen katsellen järven taakse laskevaa
aurinkoa, kädet velttoina polvilla — ympärillään suuria kiviröykkiöitä
ja juurikasrovioita. — "Tämmöistäkö se onkin se kuviteltu onni?
Yhäkö se painuu alemma ja alemma kuin tuo laskeva aurinko",
huokaa nainen. — "Tämmöistä se on — köyhän onni", vastaa jurosti
mies. — —

‒ ‒ seisovat suorina kuin valkeat kynttilät ‒ tyhjät tähkäpäät.


Pelloilla ja halmeilla elonleikkaajien mykät joukot, selät köyryssä kuin
sirpit kädessänsä raskaasta työstä, selät köyryssä tyhjien
tähkäpäitten raskauttamina. ‒ ‒

‒ ‒ kuin yö ja päivä: piakkaa pitelevän emännän kalpeat kasvot ja


uunissa paistuvat sysimustat pettuleivät, vanteet ympärillä, etteivät
arinalle hajoaisi. Mustia leipiä, kalpeita kasvoja. — —
‒ ‒ valtateillä puutteen ja hädän ryysyinen itkevä ihmisvirta, johon
sydänmaan poluilta vuolaat sivupurot yhtyvät. Se paisuu, paisuu kuin
tulviva kymi — ja siihen tulvaan hukkuu moni.

‒ ‒ kellot soivat, sielukellot soivat joka päivä. Sunnuntaisin avattu


yhteinen hauta ulottuu hautuumaan laidasta toiseen. Siellä alhaalla
mustia ruumisarkkuja kymmenittäin, isoja ja pieniä, isät ja äidit
rinnan, lapset kupeellansa — köyhää kansaa. Haudan partaalla pappi
hitaasti siirtyy arkun luota arkun luo, multaa heittää. — "Maasta olet
sinä tullut, maaksi pitää sinun jälleen tuleman. Jeesus Kristus,
Vapahtajamme, on sinut viimeisenä päivänä herättävä."

*****

Niin syvälle hän oli vaipunut muistojensa kaukaisiin maailmoihin,


ettei ollut huomannut hämärän vaihtumista pimeydeksi, ei kuullut
humalaisen epävarmoja askeleita eikä oven narahtavaa avautumista.
Vasta kun Heikki pudottaa jyskäytti ulkoa ottamansa halkosylyksen
uunin soppeen, hän siunaten ponnahti seisaalle ja jäi tuijottamaan
ovensuussa häämöittävään mustaan haamuun.

"Kuka se on?… Onko se Heikki?"

Heikki oli mielestään tehnyt hyvät kaupat, ja harjakaishumala oli


kohonnut korkeimmilleen; siksi hän tahtoi hiukan huvitella äitinsä
kustannuksella. Vastaukseksi tämän kysymykseen hän möyrysi kuin
vihainen härkä, jonka matkiminen olikin hyvin helppoa hänen
nykyisessä tilassaan.

"Hyvä Jumala!… Kuka se on?… Onko se… onko se…?"


"Yö!… yö!" kuului ovensuusta — aivan kuin Eero-vainajan
nikottelemista sairastaessaan. Noin yököttelee vain se, jonka mieltä
todenteolla kääntää.

Hetken aikaa kuului tuvasta vain näkymättömän kellon nakutus


peräseinältä.

"Hehehe!… Ta-taisipa äiti luulla, että ukko on tullut riihestä ku-


kummittelemaan", nauroi Heikki. Kopeloituaan taskustaan
tikkulaatikon ja sytytettyään lampun hän näki äitinsä kyyröttävän
peräpenkillä, kasvot kalpeina kuin valkaistu palttina. Otsalla kimalteli
suuria hikihelmiä, ja hän oli niin raukean väsynyt, ettei aluksi saanut
sanaa suustaan.

"Le-leikkiähän minä vain", soperteli Heikki kuin anteeksi pyytäen.

"Huonoa leikkiä. Ja on tämä nyt aika leikitellä… ja on tuo kaunista


— mennä kylille juomaan, kun isä…

"Maatkoon isä le-levossa. Ne on nyt" ‒ Heikki ojensi kätensä kuin


olisi hevosta ohjannut — "ne on nyt Töyrylän ny-nyörit näissä
käsissä. Mi-minä olen isäntä."

Kuin valtansa merkiksi hän ruiskautti pitkän syljen lattialle.

"Pötyä pöytään tai pö-pöytä pihalle!"

Äidin tuotua ruuan istuutui Heikki pöydän päähän.

"Eikö sitä ole maitoa ja voita — tä-tämmöisenäkään iltana?"

"Ei ole", vastasi äiti lyhyeen, ryhtymättä humalaisen kanssa


pitempiin puheisiin.
Ruokiin kajoamatta Heikki antoi katseensa kulkea kuivasta
leipäkannikasta kuorekuppiin, siitä kaljatuoppiin ja kylmiin
perunakeitikkäisiin. Äskeinen iloinen mieliala oli kuin pois puhallettu.
Silmälaudat takoivat toisiaan vasten, ja kiivas hengitys tohisi
sieramissa kuin itkuaan pidättävällä lapsella, jolta on riistetty
leikkikalut.

"Se pitäisi tappaa", sanoi hän hetken päästä.

"Mikä?"

"Semmoinen le-lehmä."

"Mansikkiko!"

"Kun se ei kerta ly-lypsä."

"Ei suinkaan se yksi lehmä jaksa ympäri vuoden lypsää. Ja milloin


on ennen ollut edessäsi talvisin maidot ja voit! Hyvä kun leipääkin.
Minä olen elänyt semmoisiakin aikoja, ettei ole ollut leipääkään."

"Se oli ennen. Mutta tästä alkaen on Töyrylässä leipää ja le-leivän


särvintä."

"Ei kait se taivas ruvenne maitoa satamaan paremmin kuin


tähänkään asti."

"Ei ta-taivas, mutta maa. Minä laitan tähän talon semmoisen, ettei
pappila parempi. Navettakin pitää olla pulskempi kuin köyhän talon
pytinki — se-sementtipermannot, akkunat kuin kirkossa ja hö-
hönkätorvet katolla."

"Ohhoh — voi-voi!" huokasi äiti.


"Ni-niin minä olen aatellut."

Hän oikein hehkui intoa selittäessään äidille aatteitaan, mutta kun


hänen katseensa sattumalta osui kuorekuppiin, katkesi juttu kuin
tikkua taittaen. Murjotettuaan hetkisen ääneti kyynäskolkkaisillaan
pöydän päässä hän äkkiä hypähti seisaalle, meni vainajan sängyn luo
ja nakeltuaan vuodevaatteet lattialle kokosi sänkyoljet kainaloonsa.

"Mitä sinä teet, onneton?" kysyi äiti.

"Eikös se ole tapa semmoinen, että kuolleen pahnat on po-


poltettava."

"Ehtisit ne polttaa huomennakin."

Heikki kantoi oljet pellolle, haki ladosta pari kupoa lisää ja sytytti
palamaan.

Äiti seisoi tuvan akkunan edessä, muttta ei voinut nähdä muuta


kuin kaamean, punertavan kumotuksen valkeaksi jäätyneen
akkunaruudun lävitse.

*****

Heikki nukkui raskaasti kuorsaten ovensuusängyssä, mutta äiti


lattialla vuoteellaan valvoi kauan, ja sirkkojen kuoro uunin raoissa
lauloi yksitoikkoista yöllistä virttänsä aamuun asti.
III

Aamuyö.

Töyrylän torpan riihen mustan oven avautuessa helmikuun


pakkanen kirkaisee rautaisissa saranoissa kaikkein kipeimmän
säveleensä. Matalasta oviaukosta työntyvät kumarassa sisään Henna
ja naapuritöllin mies ja vaimo. Miehellä on kädessä tuohkosessa
vasara ja rautanauloja, vaimo kantaa sangasta pyöreätä läkkilyhtyä,
jonka seinät ja kartiomainen yläosa ovat täynnä pitkulaisia reikiä.
Niiden kautta tuikkiva valo muodostaa riihen mustiin seiniin ja
kattoon himmeitä, levottomia, aavemaisia tähtiä. Kiukaan kupeella
kahden liinaloukun päällä laudallaan lepäävä vainaja nostetaan
mukana tuotuun mustaan arkkuun, jonka pohjalle pehmikkeeksi on
levitetty höylänlastuja ja päänalustaksi asetettu lastuilla täytetty
pikku pielus. Naapuritöllin vanhukset laulavat lähtövirren. Vaimon
korkealla särisevä, rikkinäinen ääni ikäänkuin itkee särkyneen
elämän tuskaa, miehen basso kuin syvyyksistä kaikuen on tyyni ja
rauhoittava: Ei hätää mitään, täällä syvällä saa levätä vaivoistaan.
Laulajien suusta pöllähtelee valkeat höyrypatsaat kuin tehtaan
pillistä sen huutaessa illalla lepoajan alkua väsyneille työmiehille.
Arkun sivulla käsi suun edessä seisovan lesken sieramista puuskuttaa
valkea murheen henki kuolleen kasvoille, ja silmistä tipahtaa joku
kuuma kyynel valkoiselle venykevaatteelle, leviää ja jäätyy siihen
kiiltäväksi tähdeksi. Heikki valjastaa ulkona hevosta liisterein eteen,
joka on tuotu riihen luo jo edellisenä iltana, valmiiksi odottamaan
kuormaansa, ja sitelee valkoista vaatekaistaletta luokkaan oikealle
puolelle.

Kun virsi on laulettu, katsotaan hetkinen äänettöminä värittömiä


kasvoja, kuin tahdottaisiin lähtemättömästi painaa ne muistiin.
Tällävälin on Heikkikin tullut riiheen.

"Niin, hyvästi nyt vain, Eero", nyyhkyttää leski. "Nuku rauhassa…


Monta… monta… anna anteeksi — kaikki… Tule sinäkin, Heikki,
sanomaan hyvästi isällesi."

Kansi nostetaan paikalleen, ja naapuritöllin mies iskee sen kiinni


kummastakin päästä kolmella pitkällä takonaulalla.

"Kai se siellä pysyy", huomauttaa Heikki.

"Eikö tuo pysyne."

"Herra Jumala!" huudahtaa samassa Henna ja tarrautuu naulaajan


kohotettuun käteen juuri kun tämä on antamassa viimeistä iskua
viimeisen naulan kantaan.

"Mikä, mikä nyt?"

"Voi, hyvä Jumala, kun Eero pyysi — pyysi, että arkun kansi
naulattaisiin puunauloilla."

Naulaajan käsi kohosi vitkaan korvalliselle.

"Kun et tuota ennen sanonut."


"Kun, hyvä Ville, kun en muistanut."

"Pitäisi muistaa. Eihän tuo ollut paljon pyydetty miehesi puolelta —


muutamia puunauloja. Ja kuolleen tahto on täytettävä, muutoin voi
tulla pahat perästäpäin."

"Ei se ole paljon pyydetty. Ota, hyvä Ville, pois nuo rautanaulat,
kisko vaikka hampain", hätäili Henna yhä.

Ville tarkasteli nauloja — niiden kannatkin olivat uponneet jonkun


verran puun sisään.

"Ei, Henna hyvä, kyllä ne nyt pysyvät siinä, missä ovat. Vapriikin
valssinauloja jos olisivat, niin voisivat lähteäkin. Mutta takonauloja ei
kisko pihkaisesta puusta itse per… tuota — ellei järin arkkua säre."

"Voi yhtähyvin sentään", Henna ihan itki. "Että jos sen mitenkuten
on raskaampi ollakseen, kun kerta puunauloja pyysi. Se sitä
ylösnousemustaan lie ajatellut. Ja jos se siitä hyvinkin kuontuu vielä
jälkeenpäin valittelemaan ja syyttelemään."

"Ei se olisi ensimäinen kumma."

"Ei olisi ensimäinen. Sano, hyvä Ville, mitä nyt on tehtävä."

"Nyt on kannettava arkku rekeen ja lähdettävä ajamaan", sanoi


Heikki. "Pitkä matka ja kiteä pakkaskeli — eikä ruumiin kanssa sovi
ajaa kuin markkinatiellä. Ennenkun ylösnousemus tulee, ehtii siinä
toinenkin arkku mullaksi muuttua, arkku ja arkunnaulat, vaikka
olisivat timanttia. Ja lie hänellä, joka voi kuolleitakin herättää,
näpistimet sellaiset, että nousee takonaula pihkaisestakin puusta."
"Heikki, älä puhu syntiä. Se tulee kuin varas — varas yöllä",
nuhteli äiti.

Varas?… On nuo ennen varkaat lujempiakin lukkoja availleet…


"Käykäähän, Ville, kiinni sinne jalkapäähän; minä koetan kannattaa
täältä päänpuolesta."

Eikä siinä todella muuta keinoa keksitty. Arkku kannettiin rekeen ja


peitettiin loimella, niin että mustat päädyt vain näkyivät. Heikki ja
Ville istuutuivat poikkiteloin arkun kannelle.

"Soh, tamma!"

Tamma nojasi länkiinsä kerran, kaksi, mutta kun reki tuntui olevan
kuin kannossa kiinni, katsoi se kysyvästi taakseen.

"Ei nyt hullumpaa", ihmetteli Heikki. "Onpa sen tamman jälkeen


ennen kuorma lähtenyt, raskaampikin. Välistä tukkimetsässä on ollut
viisikin kymmentuumaista yhtaikaa, ja hangoittelematta niiden on
pitänyt seurata."

"Tämän tönäisen kyydittäväsi lähtö ei taida olla oikein mieluista


sille itselleen", sanoi Ville. "Mutta on ne keinot, että kevenee liiat
painot."

"Minkälaiset ne keinot ovat?"

"On käännettävä arkku alassuin… ja on siinä vähän muutakin."

"Eikö tuo lie tämä sukkelin keino", sanoi Heikki, nousi reessä
seisaalle ja läimäytti tammaa lautasille ohjasperillä olan takaa. Ja
kun hevonen oikein vauhdilla ponnisti, lähti reki liikkeelle niin äkkiä,
että miehet olivat keikahtaa maahan selälleen. Mutta reen kulku oli
niin raskasta, että tamma sai kiskoa voimainsa takaa kuin karhia.
Tuntui kuin sen alla olisi ollut sadat näkymättömät kynnet
haraamassa vastaan.

Kaiken tämän johdosta Henna oli niin lyöty, että hän ei voinut edes
itkeä, ei muuta kuin huokailla.

"Voi, Eero, Eero!… Että kun tällainen palkka piti maksettaman. Ja


sillä lailla kuin se rukoili sängyssään vielä viimeisillä hetkillään,
pusersi käsiään rinnan yli ristiin, niin että sormiluut ruskivat, ja
voivotti, että olisi luullut kivienkin heltyvän. Jumala minua varjelkoon
tuomitsemasta, mutta ne voivat olla liian suuret — Eeron synnit.
Laula edes, Ville. Laula sinäkin, Serahviia, jos tuosta olisi jotakin
huojennusta."

Ville lauloi, ja Serahviia Hennan rinnalla reen perässä kävellen


lauloi aina kotipellon ulkoveräjälle asti, josta naiset pyörsivät
takaisin.

"Muistakaa hänelle soitattaa edes sielukellot", huusi Henna


poistuvien jälkeen.

Alkumatkalla miehet istua könöttivät arkun kannella töppösineen


ja vyötettyine turkkeineen ääneti, mietteissään kuin varikset riihen
harjalla. Vasta kun Heikki oli kopeloinut povestaan eväspullon,
jollainen oli ollut välttämätön seuralainen tällaisilla matkoilla
ylimuistoisista ajoista asti, ja se oli kulkenut kädestä käteen, huulilta
huulille useampaan kertaan, laukesivat vähitellen kielten siteet.

"Saattoihan se olla jäätynyt kiinni siellä riihen tanhualla", alotti


Heikki asiasta, jota arveli Villenkin hautovan. "Se toisinaan reki
pikeytyy jäätikköön niin lujaan, ettei kangetta irtau."
"Hm… voipa olla, voipa olla pikeytynytkin", myönteli Ville hyvillään
saamistansa ryypyistä. "Vaan entäpä ne karhitsemiset ja kynsimiset
reen alla?"

"Anturain alla tietysti oli jäätä, joka vähitellen kului ja hioutui pois.
Näkeehän Ville, että tamma kävellä rapsuttaa jo länget korvissa.
Soh, Lipi!… hih!"

"Ole kirkumatta", varoitti Ville ankarasti. "Vai tekeekö taas mielesi


jäätä anturaisi alle. Ne vielä äsken kaikkoutuivat laulun mahdilla,
vaan jos rienautuvat tulemaan uudelleen, eivät ehkä lähdekään.

"Ketkä ne?"

"Nepä ne… Vai jäätä? No, lapsilla on lapsen usko. Minä olen jo
vanha mies, olen ollut monessa myllyssä ja tunnen hivenen näitä
asioita. Minä olen tehnyt viimeiset majat näille Korvenkylän kuolleille,
saatellut heidät näinikään hautaan toisen toisensa jälkeen ja ollut
ylimäisenä vieraana ja virrenveisaajana kaikissa maahanpanijaisissa
lähemmä viisikymmentä ajastaikaa.

"Ehtii siinä ajassa jo nähdä yhtä ja toista.

"Ehtiipä niinkin. Kerrankin, juolahti tuossa mieleeni, kannoimme


muuatta höyhenkevyttä, taudin nuolemaa äijän käppyrää
metsäpolkua pitkin venevalkamaan — siihen aikaan näet vietiin
ruumiit kirkolle kesäisin vesiteitse. Tultiin siitä rantaniityn veräjälle,
niin sanoo muuan kantajista: Nykäiseppä, Ville, alas pari sulkupuuta,
niin sujautamme arkun aidanraosta kuin pirtapussin. Vaan ennenkuin
saimme pirtapussin veräjästä, täytyi se purkaa viimeistä sulkupuuta
myöten. Ja muutaman sadan sylen matkalla veräjältä rantaan piti
levähtääksemme kolmasti, vaikka meitä oli kantajina neljä vaurasta
miestä, ja perille päästyämme olimme hiestä märkiä kuin uitetut
koirat.

"Piti maar siinä olla kymmenhankainen kirkkovenhe mokomaa


paatta kantamassa."

"Eikä ollut, tavallinen kaksisoutu vain. Sen, näet, maapirun valta ei


ylety vesille, missä on omat haltijansa, vesipirut… Liehän siellä
putelin pohjassa pisara vielä?"

Heikki kaivoi pullon povestaan ja ojensi Villelle. Tämä, ennenkuin


otti ryypyn, pudisteli ja tarkasteli välkähtelevää nestettä taivaan
rannalla kenottavaa kuun kantaa vasten.

"Sitä tuon vanhan silmä hämärtää — näyttäähän tuo niinkuin


kolmanneksen vajautuneen… Äh!"

"Mitäs tyhjästä, Ville ottaa miehen ryypyn", kehoitti Heikki. "Ei pidä
vähäksyä köyhän antia. Kun tästä loppuu, löytynee tilkka lisää
reenseviltä heinäsäkistä.

"Heinäsäkistä?… Älä veikkonen puhu sitten mitään."

Pullon pohja keikahti taas ylöspäin, ja kuun kanta peilaili kapeata


naamaansa pyöreässä pohjassa, tällä kertaa tovin aikaa.

"Ähäts!… Onpa, onpa… mistä?… Otahan, isäntäkin, seh!… Aina se


on isäntä vieraan veroinen… Mistä tuota… ähäts!… löysit vielä
kuusen juurella keitettyä?"

"Puhumatta paras."
"No, niinpä kai. Enpä ole mokomaa junkkaria maistanut sitte
kotipolton aikain… Pistähän tuosta rouheita piippuusi. Tuntuu tuo
ilma hiukan lämmähtäneen, että tarkenee tässä jo piippuunsa pistää,
eikä ole pelkoa, että piip… Eikös perh…! Puhelehan, Heikki, tuonne
suupieleen, että pääsen tuosta nysästä eroon. Vetää piru huulia
yhteen kuin maneetti ja pistelee kuin sata neulaa", puhua sopotti
Ville toisesta suupielestään. "Siinä on tuo suukappale läkkipellistä, ja
on se joskus ennenkin tulipalopakkasilla jäätynyt huuliin kiinni… No
sillälailla, sillälailla… väkevällepä tuo henkesi höyrähtääkin… Ähä,
jopas irtautuu ylähuuli. Puhelehan vielä, että irtautuu
alahuulestakin… no sil… sillälailla. Sen tuon vanhan huulissa ei ole
lämpöä enempää kuin vanhassa anturanahassa."

Jatkoivat siitä matkaansa tupakoiden. Savu jäi leijailemaan jäljelle


tyynessä ilmassa, ja Ville kertoili ajan ratoksi seikkailuja kuolleitten
kanssa, omiansa ja muitten.

"Väitteli kerran kaksi rohkeata miestä keskenään, kumpi heistä


olisi rohkeampi. Löivät siitä vetoa, ja pelkäämättömyytensä
todisteeksi oli kummankin noudettava yksin pimeänä syysyönä
hautakammiosta kuolleen luu. Arvalla määrättiin, kumman ensin oli
kokeensa suoritettava. Se, jolle arpa lankesi, lähtee määrättynä yönä
hautuumaalle, siirtää kansilaudat syrjään, laskeutuu tikkaita myöten
hautakammioon ja löydettyään luun aikoo lähteä — 'Jätä rauhaan, se
on minun vaarini luu', kuuluu kammion nurkasta ääni kuoleman
kolkko, ja siellä näyttää häämöittävän joku valkea haamu. Jättää
etsijä saaliinsa ja löytää toisen. — 'Jätä rauhaan, se on minun isäni
luu.' Tekee mies työtä käskettyä ja löytää kolmannen luun. — 'Jätä
rauhaan, se on minun luuni.' — 'Olkoon vaikka kenen luu, mutta
menee se nyt', vastaa etsijä ja nousee saaliineen tikkaita myöten
ylös, juoksee kipittää hiljalleen käytävää pitkin portille päin ja kuulee
hipsuttavia askeleita aivan takanaan. Ei uskalla katsoa taakseen,
multa ei saa ilmankaan oltua. Näkee silloin kintereillään valkoisen
haamun, kädet levällään kuin syliinsä tavoittaen. Samassa kaatuu
takaa-ajettu kiljaisten käytävälle. Siitä hänet seuraavana aamuna
löydettiin kuolleena, kädessä kuolleen sääriluu."

"Se oli sen valkoisen haamun sääriluu?" kysyi Heikki.

"Se valkoinen haamu oli se toinen vedonlyöjä, joka oli edeltäpäin


salaa piiloutunut hautakammion nurkkaan, valkoinen hursti
ympärillä."

"Noutamatta taisi nyt jättää se hurstimieskin kuolleen luun?"

"Noutamatta. Kun sitä toista vedonlyöjää lähdettiin viemään


arkussa hautaan, lähdettiin toista kuljettamaan köysissä
hullujenhuoneeseen."

Ville kopisti piippuaan arkun päätylautaan kuin onttoon


kumisevaan honkaan, kaivoi perät poskeensa ja pisti piipun
taskuunsa.

Istua kyyröttivät taas ääneti kuin varikset mustan riihen harjalla.


Mutta nyt tapahtui jotakin ihmeellistä, jotakin samantapaista kuin jos
mitään pahaa aavistamaton ihminen istuisi kivellä ja sen sisällä
yhtäkkiä räjähtäisi dynamiittipanos. Syviin ajatuksiinsa vaipuneet
miehet kimmahtivat yhtaikaa kuin näkymättömän käden heittäminä
kauas reen sivulle seisaalleen ja silmät renkaina tuijottivat olkansa yli
arkkuun, ikäänkuin odottaisivat jotain vieläkin kauheampaa.

"Oli se pamaus", sai Ville vihdoin änkytetyksi.

"Oli — oli se jymäys — kuin tykillä olisi ampunut", säesti Heikki.


Arkusta heidän altaan oli jo pitemmän aikaa kuulunut salaperäistä
ritinää ja napsahtelemista ja viimein yhtäkkiä voimakas, rämähtävä
paukahdus. Samalla tuntui kuin arkun kansi olisi saanut voimakkaan
sysäyksen ylöspäin.

"Nyrkilläänkö lie iskenyt… vai olisiko potkaissut?" kysyi Heikki.

"Lie iskenyt nyrkillään ja lie potkaissut samalla kertaa. Tuntuihan


menneen pirstoiksi koko arkku."

"Hyvä oli, että oli takonaulat", jatkoi Heikki, pyyhkien turkkinsa


hihalla hikeä otsaltaan.

"Entäpä se sen tekikin niiden rautanaulojen takia", vastasi Ville…


"Sinulla taitaa olla lämmin?"

"Löi vähän lämpimäksi."

"Mitä sinä mainitsit taannoin heinäsäkistä? Ei taitaisi nyt tehdä


pieni pahaa. Kajautti tuon pääni niin selväksi, että se on tyhjä kuin
Oulunmaa."

"Heinäsäkistä?… jaa, niin."

Heikki pysäytti hevosen, ja otettuaan pullosta jatkoivat miehet


taas matkaansa, aluksi kävellen reen sivulla; mutta kun arkusta ei
enää kuulunut muuta kuin tuttua, salaperäistä napsahtelemista,
uskalsivat he istuutua taas entiselle paikalleen — uutta kummaa
ihmettelemään.

"Näkeekö Ville tuolla tiellä mitään?" kysyi Heikki hiukan


värähtävällä äänellä.
Tarkastettuaan osotettuun suuntaan sanoi Ville:

"Tahtoo asustaa usva vanhan silmässä — näyttäähän niinkuin


liikkuisi tuolla jotakin mustaa."

"Pitkä häntä vain letkuttaa kahdenpuolen. On tuo näköjään kuin


kettu, vaan saattaa se olla koirakin… Seh, seh!"

"Älä häntä huutele, kyllä ne hännän heiluttajat tunnetaan."

Kun ketunnäköinen yhä vain pitkitti kulkuansa edellä tietä pitkin,


väliin istahtaen, niinkuin odottelisi jäljessä tulevia, sanoi vihdoin
Heikki:

"Ei taitaisi Ville tareta laulaa. On tuolla ketuksi vähän oudot eleet."

"Sehän tuo lie virkani näillä matkoilla", vastasi Ville.

Samalla kun hän päästi äänen, loikkasi ketunnäköinen tiepuoleen,


näyttäytymättä enää. Sen sijaan rasahteli silloin tällöin tiepuolessa,
ja puitten välissä vilahteli valkoisia, kiitäviä haamuja. Toisinaan
kuului ylhäältä puitten latvoista omituisia honottavia ääniä ja siipien
räpyttämistä… Vilahdelkaa, honotelkaa, räpytelkää nahkasiipiänne —
sitä voimakkaammin vyöryy sana Villen huulilta.

Kilometrien päässä ajotiestä erämaan mökkiläinen menee talliin


hevoselleen aamuapetta laittamaan. Pihapolulle pysähtyy, kuuntelee
— kutsuu vaimonsakin, ja kuuntelevat yhdessä.

"Kuka hullu se huutaa sydänmaassa yöllä?" sanoo mies.

"Eihän sillä liene mikä hätä, sillä huutajalla?"


"Liekö tuolla nyt hätäkään, muuten mölynnee lämpimikseen.
Kuulostaa tuo koommin kuin Korpeis-Villen ääneltä; olisiko siellä
Korvenkylällä joku muuttanut majaa."

"Kirkolle päin tuntuu laulu loittonevan. Vaan eihän ne


korpeisetkaan ennen ole laulaneet kuolleilleen muualla kuin kylien
kohdalla."

Ajettuaan asumattomia kankaita, soita ja järvien kumisevia jäitä


pitkin peninkulmamäärin miehet vihdoin yhdyttivät kirkon lähettyvillä
valtamaantien ja asutut seudut. Kaakon kulmalla alkoi jo sarastaa
hopeankirkas päivänsalo. Öisen erämaan lumot ja sydäntä
kouristavat tapahtumat matkan varrella tuntuivat enää vain pahalta,
kaukaiselta unennäöltä.

"Heruneeko sieltä heinäsäkistäsi pisara vielä?" kysyi Ville. "Kovin


on kipakkata tuo pakkanen — se tässä päivän valetessa on
purevimmillaan."

Otettuaan pullosta siinä teitten risteyksessä hyvät naukut jatkoivat


miehet viimeistä taivaltaan kirkolle.

"Ville se on säilyttänyt vielä tuon veisuun äänen, vaikka on jo


vanha mies", kehui Heikki.

"Tuolla kun korvessa parkasitte, niin sitä vailla, ettei kuura puista
tipahdellut, ja eikö lie tipahdellutkin alimmilta oksilta."

"Käheäksi käypi vanhan kukon laulu", vastasi Ville. "Vaan olisit


kuullut ennen nuorianna. Oli tässä rovastina siihen aikaan Intreeni-
vainaa, ja Jumalaa palveltiin vielä vanhassa kirkossa. Kerran
piispankäräjillä… no, kun itse piispa on kuulemassa, koettaa kukin

You might also like