BIOSTATISTICS
The word statistics is a Latin word
derived from status meaning
information useful to the state, e.g.,
the sizes of the populations and
armed forces.
BIOSTATISTICS
Statistics refers to the
numerical data relating to an
aggregate of facts.
Also used to refer to the
procedures and techniques
used to collect, process and
analyze data to make
inferences and to reach
IMPORTANT CHARACTERISTICS OF
BIOSTATISTICS:
It deals with uncertainties in
population groups and events.
It deals with data subjected to
random variations like height of
children etc.
The study design and data collection
procedures have to be correct to
obtain meaningful statistics.
Biostatistics can be divided into:
1. Descriptive: Deals with the concepts and
methods concerned with summarization and
description of the important aspects of the
numerical data.
2. Inferential: Deals with procedures for
making inferences about the characteristics
of the large groups of populations by using a
part of the data called the sample population.
Definitions
Population is a set of measurement of
interest to the sample collector.
Sample is any subset of
measurements selected from the
population.
Element/Unit an entity on which
measurements are obtained.
Observation set of measurement
obtained for each element
Data facts and figures collected,
summarised and analysed.
Data set a set of different
variables in a particular study.
Statistical analyses need
variability; otherwise
there is nothing to study
Statistics is concerned, mainly, with
variables
Variation is important!!!!
Any type of observation which can take
different values for different people, times,
places, species etc is called a
VARIABLE
Eg., height, weight, uric acid level, Xrays
findings, parity, social class etc.
A mathematical constant takes a fixed value
eg., the ratio of the circumference of a circle to
its diameter is a constant, 3.141592654 for all
sized circles
Types of variables
A QUALITATIVE variable is one which does
not take a numerical value. It may be
concerned with the characteristics eg., gender,
survival or death, place of birth, colour of eyes
etc.
A QUANTITATIVE variable takes
a numerical value. eg., height, blood pressure,
lung capacity, exact age, parity, number of
cases in a study, completed family size, age last
birthday etc.
TYPES OF VARIABLES
Variable
Qualitative
or categorical
Nominal
(not ordered)
e.g. ethnic
group
Ordinal
(ordered)
e.g. response
to treatment
Quantitative
measurement
Discrete
(count data)
e.g. number
of admissions
Continuous
(real-valued)
e.g. height
CATEGORICAL VARIABLES
Cannot be measured numerically
Categories must not overlap and
must cover all possibilities
CATEGORICAL NOMINAL
VARIABLES
Named categories
No implied order among categories
Examples:
Gender: Male/Female
Blood Groups: 0, A, B, AB
Ethnic Group: Chinese, Malay, Indian,
Jordanian
Eye color: brown/black/blue/green/mixed
CATEGORICAL ORDINAL VARIABLES
Same as nominal but ordered
categories
Differences between categories
may not be considered equal
Examples:
Grading: Excellent, satisfactory,
unsatisfactory
Pain severity: no pain, slight pain,
moderate pain, severe pain
QUANTITATIVE VARIABLES
Can be measured numerically
Examples:
weight
# of admissions to the hospital
concentration of chlorine
Can be discrete or continuous
DISCRETE NUMERICAL VARIABLES
Integers that correspond to a count
Can assume only whole numbers
Examples:
#
#
#
#
of
of
of
of
bacterial colonies on a plate
missing teeth
accidents in a time period
illnesses in a time period
CONTINUOUS DATA
Continuous data are measured
Can take any value within a defined
range
Limitations imposed by the measuring
stick
Examples: blood pressure, height, weight,
time
WHY DOES IT MATTER?
Categorical and quantitative variables are statistically
summarized and presented in different ways
Variable Type
Data Presentation
Quantitative
Graphs, Tables
Categorical
Charts, Tables
TYPES of DATA
Qualitative data Categorical data
Quantitative data Numerical data
Qualitative/Categorical Data
There are two types of categorical
data:
nominal
NOMINAL DATA
In NOMINAL DATA, the variables are divided into
named categories. These categories however,
cannot be ordered one above another (as they
are not greater or less than each other).
Example:
NOMINAL DATA CATEGORIES
Sex/ Gender:
male, female
Marital status: single, married, widowed,
divorced
separated,
ORDINAL DATA
In ORDINAL DATA, the variables are also
divided into a number of categories, but they
can be ordered one above another, from
lowest to highest or vice versa.
Example:
ORDINAL DATACATEGORIES
Level of knowledge: good, average, poor
Opinion on a statement: fully agree, agree,
disagree, totally disagree
Numerical Data
We speak of NUMERICAL DATA if the
VARIABLES are expressed in numbers. They
can be examined through:
Frequency Distribution
Percentages, Proportions, Ratios and Rates
Figures ETC.
Numerical Data
May be:
Discrete or Continuous
Discrete numerical data considers counts
which can be expressed only as whole
numbers e.g., number of people, parity,
number of males/females in a family etc.
Continuous numerical data considers
measures which can take any value
between two whole numbers e.g., weight,
height, uric acid levels etc.
SCALES OF MEASUREMENT
There are four scales (or levels) at which we measure:
__________________________________________________________
Lowest
Level
Scale
Characteristic
_________________________________________________________
Nominal naming
Ordinal ordering
Interval equal interval without absolute zero
Ratio
equal interval with absolute zero
__________________________________________________________
Highest
__________________________________________________________
DATA SUMMARIZATION
Measures of Central Location
Measures of Dispersion and
Measures of Shapes
Central Location
Number of people
Spread
Age
MEASURES OF CENTRAL LOCATION
Definition: a single value that
represents (is a good summary of) an
entire distribution of data
Also known as:
Measure of central tendency
Measure of central position
Common measures
Arithmetic mean
Median
Mode
Age
27
30
28
31
28
36
29
37
29
34
30
30
27
30
Raw data set:
Ages of students in a class (years)
Ob
s
Age
27
27
28
28
28
29
29
29
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
20
37
Order the data set from the
lowest value to the highest value
Add observation numbers
MODE
Definition: Mode is the value that occurs
most frequently
Method for identification
1. Arrange data into frequency
distribution or histogram,
showing the values of the
variable and the frequency with
which each value occurs
2. Identify the value that occurs
most often
Mode
Ob
s
Age
27
27
28
28
28
Age
Frequenc
y
29
29
27
29
28
29
29
10
30
11
30
30
12
30
31
13
30
32
14
30
33
15
31
16
31
34
17
32
35
18
34
36
19
36
20
37
37
Mode
Obs
Age
27
27
28
28
28
29
Mode
The most frequent value of the variable
Mode
= 30
29
29
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
2
7
3
2
3
3
3
4
3
5
3
6
3
7
18
34
19
36
20
37
Frequency
28
29 30 31
Age (years)
FINDING MODE FROM LENGTH OF
STAY DATA
0, 2, 3, 4, 5, 5, 6, 7,
8, 9,
9, 9, 10, 10, 10, 10, 10, 11,
12, 12,
12, 13, 14, 16, 18, 18, 19, 22,
27, 49
Mode = 10
FINDING MODE FROM HISTOGRAM
MODE SENSITIVE TO OUTLIERS?
20
Unimodal Distribution
18
Population
16
14
12
10
8
6
4
2
0
18
16
Population
14
12
10
8
6
4
2
0
Bimodal Distribution
MODE PROPERTIES / USES
Easiest measure to understand,
explain, identify
Always equals an original value
Insensitive to extreme values
(outliers)
Good descriptive measure, but poor
statistical properties
May be more than one mode
May be no mode
Does not use all the data
MEDIAN
Definition: Median is the middle
value; also, the value that splits the
distribution into two equal parts
50% of observations are below the median
50% of observations are above the median
Method for identification
1.
2.
3.
Arrange observations in order
Find middle position as (n + 1) / 2
Identify the value at the middle
Obs
Age
27
27
28
28
28
29
29
29
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
Median:
Odd Number of Values
n = 19
Median
Observation
=
=
n+1
2
19+1
2
20
2
10
Median age = 30 years
Obs
Age
27
27
28
28
28
29
29
29
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
Median:
Even Number of Values
n = 20
Median
Observation
=
=
=
n+1
2
20+1
2
21
2
10.5
Median age = Average value between 10th and
11th observation
30+30
2
30 years
Median at 50% = 10
FIND MEDIAN OF LENGTH OF STAY DATA;
IS MEDIAN SENSITIVE TO OUTLIERS?
0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49
0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 149
MEDIAN PROPERTIES / USES
Does not use all the data
available
Insensitive to extreme values
(outliers)
Good descriptive measure but
poor statistical properties
Measure of choice for skewed
data
Equals an original value of n is
odd
Quartiles
Definition: Quartile is the value that splits
the distribution into four equal parts
25%
25%
25%
25%
of observations are below the first quartile (Q1)
of observations are between Q1 and Q2 (median)
of observations are between Q2 (median) and Q3
of observations are above Q3
Q1
Q2
Q3
Obs
Age
27
27
28
28
28
29
29
29
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
20
37
Quartiles
Q1 age = 28
Q2 age = 30
Q3 age = 31
n+1
Q1 observation = round
4
20+1
21
=
=
4
4
= 5.25 ~ 5th obs
Q2 observation =
10.5 (median)
3(n+1)
Q3 observation = round
4
3(20+1)
3(21)
=
=
4
4
= 15.75 ~ 16th obs
Percentiles
Value of the variable that splits the
distribution in 100 equal parts
35 % of observations are below the 35th percentile
65 % of observations are above 35th percentile
Obs
Age
27
27
28
28
28
29
29
Percentiles
Value
s
(Age)
Fre
q
Percent
(Freq/To
tal)
Cumulati
ve
Percent
27
10%
10%
29
28
15%
25%
29
29
20%
45%
10
30
30
25%
70%
11
30
12
30
31
10%
80%
13
30
32
5%
85%
14
30
34
5%
90%
15
31
36
5%
95%
16
31
37
5%
100%
17
32
18
34
Total
20
100%
19
36
20
37
25th Percentile
90th Percentile
ARITHMETIC MEAN
Arithmetic mean = average value
Method for identification
1.
2.
Sum up all of the values
Divide the sum by the
number of observations
(n)
Obs
Age
27
27
28
28
28
29
29
29
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
20
37
Arithmetic Mean
i
x
x
n
n = 20
xi = 605
x 605
20
30.25
FINDING THE MEAN LENGTH OF STAY DATA
0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49
Sum = 360
n = 30
Mean = 360 / 30 = 12
CENTERING PROPERTY OF
MEAN
0
2
3
4
5
5
6
7
8
9
12
12
12
12
12
12
12
12
12
12
-71
= -12 9
= -10 9
= -9 10
= -8 10
= -7 10
= -7 10
= -6 10
= -5 11
= -4 12
= -3 12
-17 88
12
12
12
12
12
12
12
12
12
12
=
=
=
=
=
=
=
=
=
=
-3 12
-3 13
-2 14
-2 16
-2 18
-2 18
-2 19
-1 22
0 27
0 49
12
12
12
12
12
12
12
12
12
12
=
=
=
=
=
=
=
=
=
=
0
1
2
4
6
6
7
10
15
37
MEAN USES ALL DATA,
SO SENSITIVE TO OUTLIERS
6
5
4
3
2
1
0
Mean = 12.0
10
15
20
25
30
Nights of stay
Mean = 15.3
35
40
45
50
When to use the arithmetic mean?
Centered distribution
Approximately
symmetrical
Few extreme values
(outliers)
OK!
ARITHMETIC MEAN PROPERTIES /
USES
Probably best known measure of
central location
Uses all of the data
Affected by extreme values (outliers)
Best for normally distributed data
Not usually equal to one of the
original values
Good statistical properties
Var A
0 0
0 4
1 4
1 4
1 5
5 5
9 5
9 6
9 6
10
10
Var B
0
1
2
3
4
5
6
7
8
6 9
10 10
Var C
For each variable,
find the:
Sum
Mean
Median
Mode
Minimum value
Maximum value
Var A
Var B
Sum: 55 55 55
Mean:
Median:
Mode:
Min:
Max:
Var C
For each variable,
find the:
Sum
Mean
Median
Mode
Minimum value
Maximum value
Var A
Var B
Var C
Sum: 55 55 55
Mean: 5 5 5
Median: 5 5 5
Mode: 1,9 4,5,6 none
Min: 0 0 0
Max: 10 10 10
For each variable,
find the:
Sum
Mean
Median
Mode
Minimum value
Maximum value
Comparison of Mode, Median and Mean
Symmetrical:
Mode = Median = Mean
Skewed right:
Mode < Median < Mean
Skewed left:
Mean < Median < Mode
Measures of Central Location Summary
Measure of Central Location single measure
that represents an entire distribution
Mode most common value
Median central value
Arithmetic mean average value
Mean uses all data, so sensitive to outliers
Mean has best statistical properties
Mean preferred for normally distributed data
Median preferred for skewed data
Same center
but
different dispersions
MEASURES OF SPREAD
Definition: Measures that quantify
the variation or dispersion of a set
of data from its central location
Also known as:
Measure of dispersion
Measure of variation
Common measures
Range
Standard error
Interquartile range
95% confidence
interval
Variance / standard deviation
RANGE
Definition: difference between largest and
smallest values
Properties / Uses
Greatly affected by outliers
Usually used with median
FINDING THE RANGE OF LENGTH OF
STAY DATA
0, 2, 3, 4, 5, 5, 6, 7, 8,
9,
9, 9, 10, 10, 10, 10, 10, 11, 12,
12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49
RANGE SENSITIVE TO OUTLIERS?
6
5
4
3
2
1
0
Range = 49 - 0 = 49
10
15
20
25
30
Nights of stay
35
40
Range = 149 - 0 = 149
45
50
INTERQUARTILE RANGE
Definition: the central 50% of a distribution
Properties / Uses
Used with median
Five-number summary for boxand-whiskers diagram:
Maximum (100%, largest value)
Third quartile (75%)
Median (50%)
First quartile (25%)
Minimum (0%, smallest value)
INTERQUARTILE RANGE
LENGTH OF STAY DATA
Q1
0, 2, 3, 4, 5, 5, 6, 7, 8,
9,
9, 9, 10, 10, 10, M 10, 10, 11, 12,
12,
Q3
12,th 13, 14, 16, 18, 18, 19, 22, 27,
Q1 = 25
percentile
@
(30+1)
/
4
=
7
6
49
Median = 50th percentile @ 15.5
10
Q3 = 75th percentile @ 3 (30+1) / 4 = 23
14
BOX-AND-WHISKERS DIAGRAM
LENGTH OF STAY DATA
BOX-AND-WHISKERS DIAGRAMS
VARIABLES A, B, C
VARIANCE AND STANDARD
DEVIATION
Definition: measures of variation that
quantifies how closely clustered the
observed values are to the mean
Variance
= average of squared deviations
from mean
= Sum (x mean)2 / n-1
Standard deviation
= square root of variance
EQUATIONS FOR VARIANCE AND
STANDARD DEVIATION
x : mean
xi : value
n : number
sd: variance
sd : standard deviation
i - x
SD =
n-1
SD =
x i - x
n-1
STEPS TO CALCULATE VARIANCE AND
STANDARD DEVIATION
x : mean
xi : value
n : number
sd: variance
sd : standard deviation
x i - x
SD
n-1
=
x
1. Calculate the arithmetic mean
x - x
i
2. Subtract the mean from each observation.
x i - x
4. Sum the squared differences
x i - x
3. Square the difference.
5. Divide the sum of the squared differences by n 1
6. Take the square root of the variance
SD
= s2
CENTERING PROPERTY OF
MEAN
0
2
3
4
5
5
6
7
8
9
12
12
12
12
12
12
12
12
12
12
-71
= -12 9
= -10 9
= -9 10
= -8 10
= -7 10
= -7 10
= -6 10
= -5 11
= -4 12
= -3 12
-17 88
12
12
12
12
12
12
12
12
12
12
=
=
=
=
=
=
=
=
=
=
-3 12
-3 13
-2 14
-2 16
-2 18
-2 18
-2 19
-1 22
0 27
0 49
12
12
12
12
12
12
12
12
12
12
=
=
=
=
=
=
=
=
=
=
0
1
2
4
6
6
7
10
15
37
LENGTH OF STAY DATA
(0 12)2
0
(2 12)2
1
(3 12)2
4
(4 12)2
16
(5 12)2
36
(5 12)2
36
(6 12)2
49
(7 12)2
= 144
(9 12)2 = 9 (12 12)2 =
= 100
(9 12)2 = 9 (13 12)2 =
81 (10 12)2 = 4 (14 12)2 =
64 (10 12)2 = 4 (16 12)2 =
49 (10 12)2 = 4 (18 12)2 =
49 (10 12)2 = 4 (18 12)2 =
36 (10 12)2 = 4 (19 12)2 =
25 (11 12)2 = 1 (22 12)2 =
STANDARD DEVIATION PROPERTIES /
USES
Standard deviation usually
calculated only when data are more
or less normally distributed (bell
shaped curve)
For normally distributed data,
68% of the data fall within 1 SD
95% of the data fall within 2 SD
99% of the data fall within 3 SD
NORMAL DISTRIBUTION
2.5%
95%
68%
Standard
deviation
Mean
2.5%
Match the Measures of Central Location & Sprea
Mode
Standard deviation
Median
Arithmetic mean
Range
Interquartile range
Match the Measures of Central Location & Sprea
Mode
Standard deviation
Median
Arithmetic mean
Range
Interquartile range
NAME THE APPROPRIATE
MEASURES OF CENTRAL LOCATION AND SPREAD
Distribution
Central Location Spread
Single peak, Mean* Standard
symmetrical deviation
Skewed or Median Range or
Data with outliers
Interquartile range
* Median and mode will be similar
Properties of
Measures of Central Location & Spread
For quantitative / continuous variables
Mode simple, descriptive, not always useful
Median best for skewed data
Arithmetic mean best for normally distributed
data
Range use with median
Standard deviation use with mean
Standard error used to construct confidence
intervals
Median
Mode
14
12
Population
10
8
6
4
2
0
Age
1st quartile
Minimum
3rd quartile
Interquartile interval
Range
Maximum
Measures of Shapes
THE NORMAL DISTRIBUTION
Many variables have a normal
distribution. This is a bell shaped curve
with most of the values clustered near the
mean and a few values out near the tails.
MEASURES OF VARIATION
Range is defined as the difference in value
between the highest (maximum) and the lowest
(minimum) observation
Variance is defined as the sum of the squares of
the deviation about the sample mean divided by
one less than the total number of items.
Standard deviation it is the square root of the
variance
.2
F r a c tio n
.1 5
.1
.0 5
0
0
V ar
10
15
The normal distribution is
symmetrical around the
mean. The mean, the median
and the mode of a normal
distribution have the same
value.
An important characteristic of
a normally distributed
variable is that 95% of the
measurements have value
which are approximately
within 2 standard deviations
(SD) of the mean.
ESTIMATIONS
The basic problems to which Statistics
are applied in practice arise when trying
to deduce something about a population
from the evidence provided by a sample
of observations taken from that
population.
The population
parameters do not change
and remain constant
whereas the sample
estimates can change and
take any random value.
Population
parameters
Sample
estimates
Mean
Standard
deviation
SD
Proportion
Population
correlation
coefficient
HOW TO DETERMINE THE
EXTENT TO WHICH THE
SAMPLE REPRESENTS THE
POPULATION AS A WHOLE.
To find out to what extent a
particular sample value
deviates from the population
value, a range or an interval
around the sample value can
be worked out which will most
probably contain the
population value.
This range or interval is called
the CONFIDENCE INTERVAL.
The calculation of a confidence
interval takes into account the
STANDARD ERROR. The standard
error gives an estimate of the
degree to which the sample mean
varies from the population mean. It
is computed on the basis of the
standard deviation.
The standard error for the mean is
calculated by dividing the standard
deviation by the square root of the
sample size:
standard deviation/ Sample
size
n
or SD /
95% CONFIDENCE INTERVAL
When describing variables statistically
you usually present the calculated
x ).
sample mean x 1.96 times the SE(
This is then called the 95%
CONFIDENCE INTERVAL. It means
that there is 95% probability that the
population mean lies within this
interval.
Note that the larger the sample
size, the smaller the standard
error and the narrower the
confidence interval will be. Thus
the advantage of having a large
sample size is that the sample
mean will be a better estimate of
the population mean.
If the sample size is large, small
differences can be significant but
a large difference may not
achieve statistical significance
due to small sample size. This
leads us to calculating the
Confidence Intervals.