Basic Statistics For Research
Basic Statistics For Research
Kenneth M. Y. Leung
Why do we
need statistics?
Statistics
Derived from the Latin for state - governmental
data collection and analysis.
Study of data (branch of mathematics dealing
with numerical facts i.e. data).
Descriptive Stats
http://www.censtatd.gov.hk
Inferential Stats
A Hypothesis
A statement relating to an observation that
may be true but for which a proof (or
disproof) has not been found.
The results of a well-designed experiment
may lead to the proof or disproof of a
hypothesis (i.e. accept or reject of the
corresponding null hypothesis).
Inferential Stats
Inferential Stats
Samples
Sub-samples
Population
Inferential Stats
0.10
0.09
Probability density
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
140
150
160
170
Height (cm)
180
190
Inferential Stats
True mean
Sample size
*Assuming data follow the normal distribution
Inferential Stats
Inferential Stats
Zimmer 2001
Predictive Stats
b: Sullivans method
c: A regression model
Predictive Stats
Measurement Theory
Environmental scientists use measurements
routinely in Lab or field work by assigning
numbers or groups (classes).
Mathematical operations may be applied to
the data, e.g. predicting fish mass by their
length through an established regression
Different levels of measurements:
nominal, ordinal, interval scale, ratio
Nominal
Ordinal
10
100
1000
Scale
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Dispersal pattern
Standard deviation for normally distributed
data
Measurements
(data)
Descriptive
statistics
Normality Check
Frequency histogram
(Skewness & Kurtosis)
Probability plot, K-S
test
YES
Data transformation
NO
Median, range,
Q1 and Q3
YES
Parametric Tests
Students t tests for
2 samples; ANOVA
for 2 samples; post
hoc tests for
multiple comparison
of means
NO
Non-Parametric
Test(s)
For 2 samples: MannWhitney
For 2-paired samples:
Wilcoxon
For >2 samples:
Kruskal-Wallis
Sheirer-Ray-Hare
Ball-Balls Flowchart
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0 mm
mean
median
0.5
1.0
1.5 2.0
2.5
3.0 3.5
4.0 mm
mean
median
0.5
1.0
1.5 2.0
2.5
3.0 3.5
4.0 mm
2.5
3.0
4.0 mm
mean
median
0.5
1.0
1.5
2.0
3.5
= 10^[mean of log10(xi)]
Only for positive ratio scale data
If data are not all equal, geometric mean < arithmetic mean
Measurements of Dispersion
Range
e.g. length of 8 fish larvae at day 3 after hatching:
0.6, 0.7, 1.2, 1.5, 1.7, 2.0, 2.2, 2.5 mm
Range = 2.5 - 0.6 = 1.9 mm (or say from 0.6 t 2.5mm)
12 - 7 = 5
0
2
5
16
0 - 7 = -7
2 - 7 = -5
5 - 7 = -2
16 - 7 = 9
Sum = 184
25
49
25
4
81
Sum = 184
Sample SD (s)
s = [(xi - x)2]/ (n - 1)
s = [xi2 ((xi)2 /n)]/ (n - 1)
Two modifications:
by dividing [(xi - x)2] by (n -1) rather than n, gives a better
unbiased estimate of (however, when n increases,
difference between s and declines rapidly)
the sum of squared (SS) deviations can be calculated as
(xi2)- ( xi)2/ n
Sample SD (s)
e.g. five rainfall measurements, whose mean is 7.0
xi2
xi
12
144
12
0
2
5
16
0
4
25
256
(xi2) = 429
Rainfall (mm)
0
2
5
16
xi = 35
(xi)2 = 1225
Start here
Observations
Patterns in space or time
Models
Explanations or theories
Hypothesis
Predictions based on model
Carcinus maenas
Reference: Sneddon et
al. 1997, in
Behav. Ecol. Sociobiol.
41: 237 - 242
Null Hypothesis
Logical opposite to hypothesis
Experiment
Critical test of null hypothesis
Retain Ho
Refute
hypothesis
and model
Interpretation
Don't end here
Reject Ho
Support
hypothesis
and model
Randomized Sampling
Lucky Draw Concept
To randomly select 30 out of 200 sampling stations in
Hong Kong waters, you may perform a lucky draw.
So, the chance for selecting each one of them for each
time of drawing would be more or less equal (unbiased).
It can be done with or without replacement.
Sampling with Transects and a Random Number Table
Randomly lay down the transects based on random nos.
Randomly take samples along each transect.
Randomized Sampling
Spatial Comparison Clustered Random
Sampling
Randomly take
e.g. 10 samples
from each
randomly
selected site
S1
S2
S3
S4
S5
S6
S7
Temporal Comparison
Wet Season vs. Dry Season
Randomly select sampling days within each
season (assuming each day is independent to
other days) covering both neap and spring tides.
Transitional period should not be selected to
ensure independency of the two seasons.
Spatial
Temporal
Moderately
precise
and
accurate
Highly
precise
but not
accurate
Highly
precise
and
accurate
Precision can be
estimated using
procedure replicates.
Accuracy can be
checked with certified
standard reference
solutions.
Abs
Conc.
QC & QA:
Control Chart
Control
Treatment A
Treatment B
Treatment A
Treatment B
Mean 1
Mean 2
Mean 3
A Bathing Beach
Strom drainage
outfall
Wave breaker
Sea
Site B
Site C
3 Sub-sites
Three replicated sites per site, each replicated site with three
replicate samples and each sample with three procedure
replicates to ascertain the measurement precision.
Inferential Statistics
Frequency Distribution
8.2
5.3
5.2
5.5
4.3
4.2
Frequency Distribution
Frequency
2
12
10
7
4
2
Frequency Histogram
Frequency
15
10
3 to <4
4 to <5
5 to <6
6 to <7
7 to <8
8 to <9
e.g.
Why bimodal-like ?
Frequency
10
8
6
4
2
0
>149-153>153-157>157-161>161-165>165-169>169-173>173-177>177-181>181-185
Height (cm)
0.09
0.08
Probability density
0.10
0.07
0.06
male
female
0.05
0.04
0.03
0.02
0.01
0.00
140
150
160
170
Height (cm)
180
190
Probability density
0.40
N(10,1)
N(20,1)
0.30
0.20
N(20,2)
N(10,3)
0.10
0.00
0
10
20
X
30
(Pentecost 1999)
Accept Ho
Accept Ho
Reject Ho
Reject Ho
Wrong
judgement
OK
Courts
decision:
Innocent
OK
Wrong
judgement
If Ho is rejected
If Ho is true
If Ho is false
Type I error
No error
If Ho is accepted No error
Type II error
What is next?
1. Group Discussion on the Experimental
Design for a Case Study
Measurements
(data)
Descriptive
statistics
Normality Check
Frequency histogram
(Skewness & Kurtosis)
Probability plot, K-S
test
YES
Data transformation
NO
Median, range,
Q1 and Q3
YES
Parametric Tests
Students t tests for
2 samples; ANOVA
for 2 samples; post
hoc tests for
multiple comparison
of means
NO
Non-Parametric
Test(s)
For 2 samples: MannWhitney
For 2-paired samples:
Wilcoxon
For >2 samples:
Kruskal-Wallis
Sheirer-Ray-Hare
Ball-Balls Flowchart
http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/
Measurements
(data)
Descriptive
statistics
Normality Check
Frequency histogram
(Skewness & Kurtosis)
Probability plot, K-S
test
YES
Data transformation
NO
Median, range,
Q1 and Q3
YES
Parametric Tests
Students t tests for
2 samples; ANOVA
for 2 samples; post
hoc tests for
multiple comparison
of means
NO
Non-Parametric
Test(s)
For 2 samples: MannWhitney
For 2-paired samples:
Wilcoxon
For >2 samples:
Kruskal-Wallis
Sheirer-Ray-Hare
Ball-Balls Flowchart
Measured unit
B
Error bar = 95% C.I.
Measured unit
B
Error bar = 95% C.I.
Measured unit
n (2SP2/2)(t, + t(1),)2
where is the smallest population difference we wish
to detect: = 1 - 2
Required sample size depends on , population
variance (2), , and power (1-)
If we want to detect a very small , we need a larger
sample.
n (2SP2/2)(t, + t(1),)2
The above equation can be rearranged to
ask how small a population difference ()
is detectable with a given sample size:
[(2SP2/n)](t, + t(1),)
www.myspace.com/mtkchronicles
Example 1
b
3.89
3.19
2.80
4.31
3.42
3.41
3.55
2.40
2.99
3.08
3.31
4.52
12
12
mean
3.701
3.406
S2
0.257
0.366
SS = sum of square = S2
sp2 = (SS1+ SS2) / (1+ 2) = [(0.257 11) + (0.366 11)]/(11+11)
= 0.312
sX1 X2 = (sp2/n1 + sp2/n2) = (0.312/12) 2 = 0.228
t = (X1 X2) / sX1 X2 = (3.701 3.406) / 0.228 = 1.294
df = 2n - 2 = 22
t = 0.05, df = 22, 2-tailed = 2.074 > t observed = 1.294, p > 0.05
Remember to always check the homogeneity of variance before running the t test.
Example 1
Example 1
N = 2 x 48 = 96
Example 2
mean 625.0
306.25
S2
5798.2
6028.6
Example 2
Example 2
Example 2
N=2x3=6
Example 3
P2
P3
P4
C5
C6
4.25
3.50
7.20
4.00
0.50
2.50
3.45
3.80
6.50
5.50
5.50
2.50
4.75
4.70
4.00
2.20
2.25
2.30
5.60
1.01
2.20
1.70
3.00
3.30
3.20
6.00
3.50
6.00
5.00
4.50
Example 3
ANOVA
Source of Variation
SS
df
MS
P-value
0.650981
0.663547
Between Groups
9.465417
1.893083
Within Groups
69.79308
24
2.908045
79.2585
29
Total
common SD
= 2.908
8.00
7.00
6.00
5.00
4.00
3.00
2.00
1.00
0.00
1
1.705299
4
Sites
Example 3
Example 3
N = 6 x 21 = 126
Example 4
Source of variance
Total
Cells
PCB
Sex
PCB x Sex
Within cells (error)
SS
1827.7
1461.3
1386.1
70.31
4.900
366.4
DF MS = SS/DF F
F critical, 0.05(1), 1, 16
P
19
3
1
1386.10 60.53
4.49
< 0.001
1
70.31 3.07
4.49
> 0.05
1
4.90 0.21
4.49
> 0.05
16
22.90
45
40
Female
Male
35
30
25
20
15
10
5
0
Control
PCB treated
Example 4
Source of variance
Total
Cells
PCB
Sex
PCB x Sex
Within cells (error)
SS
1827.7
1461.3
1386.1
70.31
4.900
366.4
DF MS = SS/DF F
F critical, 0.05(1), 1, 16
P
19
3
1
1386.10 60.53
4.49
< 0.001
1
70.31 3.07
4.49
> 0.05
1
4.90 0.21
4.49
> 0.05
16
22.90
Measurements
(data)
Due to
shortcomings of
inferential stats
Descriptive
statistics
Normality Check
Frequency histogram
(Skewness & Kurtosis)
Probability plot, K-S
test
YES
Data transformation
NO
Median, range,
Q1 and Q3
YES
Parametric Tests
Students t tests for
2 samples; ANOVA
for 2 samples; post
hoc tests for
multiple comparison
of means
NO
Non-Parametric
Test(s)
For 2 samples: MannWhitney
For 2-paired samples:
Wilcoxon
For >2 samples:
Kruskal-Wallis
Sheirer-Ray-Hare
Alternatives to Hypothesis
testing exist
A Simple Example
1000 People
10 Exposed
8 Sick
2 Fine
990 Non-Exposed
95 Sick
895 Fine
P=0.010 Exposed
P=0.800
Sick
P=0.990 Non-Exposed
P=0.200
Fine
P=0.096
Sick
P=0.904
Fine
P(Exposed) P(SickExposed)
P(Sick)
(0.010)(0.800)
(0.010*0.800+0.990*0.096)
= 0.078
Example: Fishkills
Yes
0.081
(810)
Yes
0.520
No
0.919
(9190)
High
Pfiesteria
Yes
0.205
No
0.480
421 Cases
of Kills with
Pfiesteria
389 Cases
of Kills without
Pfiesteria
Low
Oxygen
178 Cases
of Kills with
Low DO
No
0.780
632 cases
of Kills without
Low DO
High
Pfiesteria
No
0.795
1884 Cases
of no Kills
with Pfiesteria
Large
Fish Kill
Yes
0.081
(810)
Yes
0.220
Large
Fish Kill
Yes
0.095
7306 Cases
of no Kills
without Pfiesteria
No
0.919
(9190)
Low
Oxygen
873 Cases
of no Kills
with Low DO
No
0.905
8317 Cases
of no Kills without
Low DO
0.22346
Likelihood Ratio
1.096
0.20389
Yes
0.520
Yes
0.205
No
0.480
389 Cases
of Kills without
Pfiesteria
Low
Oxygen
178 Cases
of Kills with
Low DO
No
0.780
632 cases
of Kills without
Low DO
High
Pfiesteria
No
0.795
1884 Cases
of no Kills
with Pfiesteria
Large
Fish Kill
Yes
0.081
(810)
Yes
0.220
No
0.919
(9190)
High
Pfiesteria
421 Cases
of Kills with
Pfiesteria
Large
Fish Kill
Yes
0.095
7306 Cases
of no Kills
without Pfiesteria
No
0.919
(9190)
Low
Oxygen
873 Cases
of no Kills
with Low DO
No
0.905
8317 Cases
of no Kills without
Low DO
Urbanization
Sediment
concentrationsinorganics
Sediment
concentrationsPAHs
Sediment
concentrationsorganochlorines
(DDTs, chlordane)
Stomach
concentrationsInorganics
Stomach
concentrationsorganochlorines
Fish liver
lesions
Fish sex
Stomach
concentrationsPAHs
Fish
mortality
Fish age
Supplemental Readings
Aven, T. & J.T. Kval y, 2002. Implementing the Bayesian paradigm in risk analysis.
Reliability Engineering and System Safety 78: 195-201.
Bacon, P.J., J.D. Cain & D.C. Howard, 2002. Belief network models of land manager
decisions and land use change. Journal of Environmental Management 65: 1-23.
Belousek, D.W., 2004. Scientific consensus and public policy: the case of Pfiesteria.
Journal Philosophy, Science & Law 4: 1-33.
Borsuk, M.E., 2004. Predictive assessment of fish health and fish kills in the Neuse
River estuary using elicited expert judgment, Human and Ecological Assessment
10: 415-434.
Borsuk, M.E., D. Higdon, C.A. Stow & K.H. Reckhow, 2001. A Bayesian hierarchical
model to predict benthic oxygen demand from organic matter loading in estuaries
and coastal zones. Ecological Modelling 143: 165-181.
Garbolino, P. and F. Taroni. 2002. Evaluation of scientific evidence using Bayesian
networks. Forensic Sci Intern. 125:149-155.
Newman, M.C. and D. Evans. 2002. Causal inference in risk assessments: Cognitive
idols or Bayesian theory? In: Coastal and Estuarine Risk Assessment. CRC Press
LLC, Boca Raton, FL, pp. 73-96.
Newman, M.C., Zhao, Y., and J.F. Carriger. 2007. Coastal and estuarine ecological
risk assessment: the need for a more formal approach to stressor identification.
Hydrobiologia 577: 31-40.
Uusitalo, L. 2007. Advantages and challenges of Bayesian networks in environmental
modeling. Ecol. Modelling 203:312-318.