Biostat Lecture 3-1
Biostat Lecture 3-1
1
Methods of data organization and presentation
2
Precise methods of analysis can be decided
up on only when the characteristics of the
data are understood.
3
Generally Summarizing and organizing data can
be achieved through:
1. Frequency Distributions
2. Graphical Representations
4. Measures of variability
4
Frequency Distributions
o For data to be more easily appreciated and to draw
quick comparisons, it is often useful to arrange the data
in the form of a table, or in one of a number of different
graphical forms.
o If this is not done the raw data will not present any
meaning and any pattern in them (if any) may not be
detected. 5
Array
Array (ordered array) is a serial arrangement of
numerical data in an ascending or descending order.
7
• The actual summarization and organization of data
starts from frequency distribution.
8
• For nominal and ordinal data, frequency distributions
are often used as a summary.
• Example:
10
a) Qualitative variable: Count the number of cases in
each category.
11
Frequency Relative Frequency
ICU Type (How often) (Proportionately often)
Medical 12 0.48
Surgical 6 0.24
Cardiac 5 0.20
Other 2 0.08
Total 25 1.00
12
Example 2:
A study was conducted to assess the characteristics of a
group of 234 smokers by collecting data on gender and
other variables.
Gender, 1 = male, 2 = female
13
b) Quantitative variable:
- Select a set of continuous, non-overlapping
intervals such that each value can be placed in
one, and only one, of the intervals.
14
For a continuous variable (e.g. –
age), the frequency distribution
of the individual ages is not so
interesting.
15
• We “see more” in
frequencies of age
values in “groupings”.
Here, 10 year groupings
make sense.
• Grouped data
frequency distribution
16
To determine the number of class intervals and the
corresponding width, we may use:
Sturge’s rule:
K 1 3.322(log n)
L S
W
K
where
K = number of class intervals
n = no. of observations
W = width of the class interval
L = the largest value
S = the smallest value
17
Example:
Leisure time (hours) per week for 40 college students:
23 24 18 14 20 36 24 26 23 21 16 15 19 20 22 14 13 10 19
27 29 22 38 28 34 32 23 19 21 31 16 28 19 18 12 27 15
21 25 16
18
Time Relative Cumulative
(Hours) Frequency Frequency Relative
Frequency
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00
Total 40 1.00
19
• Cumulative frequencies: When frequencies of two or
more classes are added.
20
• True limits: Are those limits that make an
interval of a continuous variable continuous in
both directions
21
Time
(Hours) True limit Mid-point Frequency
Total 40
22
Simple Frequency Distribution
• Primary and secondary cases of syphilis morbidity by
age, 1989
Age group Cases
(years) Number Percent
26
Diagrammatic Representation
27
Importance of diagrammatic representation:
29
Limitations of Diagrammatic Representation
1. The technique of diagrammatic representation is
made use only for purposes of comparison. It is not
to be used when comparison is either not possible
or is not necessary.
2. Diagrammatic representation is not an alternative
to tabulation. It only strengthens the textual
exposition of a subject, and cannot serve as a
complete substitute for statistical data.
3. It can give only an approximate idea and as such
where greater accuracy is needed diagrams will not
be suitable.
4. They fail to bring to light small differences
30
Construction of graphs
The choice of the particular form among the
different possibilities will depend on personal
choices and/or the type of the data.
Bar charts and pie chart are commonly used
for qualitative or quantitative discrete data.
Histograms, frequency polygons are used for
quantitative continuous data.
31
There are, however, general rules that are commonly
accepted about construction of graphs:
1.Every graph should be self-explanatory and as simple as
possible.
2.Titles are usually placed below the graph and it should
again question what? Where? When? How classified?
3.Legends or keys should be used to differentiate variables
if more than one is shown.
4.The axes label should be placed to read from the left side
and from the bottom.
5.The units in to which the scale is divided should be
clearly indicated.
6.The numerical scale representing frequency must start at
zero or a break in the line should be shown.
32
Method of constructing bar chart
• All the bars must have equal width
• The bars are not joined together (leave space
between bars)
• The different bars should be separated by equal
distances
• All the bars should rest on the same line called
the base
• Label both axes clearly
33
Specific types of graphs include:
• Bar graph Nominal, ordinal
• Pie chart data
• Histogram
• Stem-and-leaf plot
• Box plot Quantitative
• Scatter plot data
• Line graph
• Others
34
1. Bar Chart
Bar diagrams are used to represent and compare the
frequency distribution of discrete variables and
attributes or categorical series
35
A. Simple bar chart:
• It is a one-dimensional diagram in which the bar
represents the whole of the magnitude.
36
90
80
Number of Children 70
60
50
40
30
20
10
0
Not Immunized Partialy immunized Fully immunized
Immunization Status
37
B. Multiple bar chart
In this type of chart the component figures are
shown as separate bars adjoining/touch each
other.
The height of each bar represents the actual
value of the component figure.
It depicts distributional pattern of more than
one variable
– Example of multiple bar diagrams: consider that
data on immunization status of women by marital
status.
38
Fig. 2 TT Immunization status by marital status of women 15-49
years, Asendabo town, 1996
39
There’s no reason why the bar chart can’t be
plotted horizontally instead of vertically.
CHA
Type of source
HC
Reading
Training female
male
Campaign
Anti FGMC
CAT
0 10 20 30 40 50
Percent
40
Example: Construct a bar chart for the following data.
41
Distribution of patients in hopital X by source of referal, 1999
769
800
700 623
600
No. of patients
500
400
300 256
200 161
97
100
0
Other GP OPD Casualty Other
hospital
Source of referal
42
C. Component ( sub-divided) Bar Diagram
Bars are sub-divided into component parts of the
figure.
These sorts of diagrams are constructed when each
total is built up from two or more component
figures.
They can be of two kinds:
I) Actual Component Bar Diagrams: When the overall
height of the bars and the individual component
lengths represent actual figures.
Example of actual component bar diagram: The
above data can also be presented as below.
43
44
C. Percentage Component Bar Diagram
45
46
2. Pie chart
• Shows the relative frequency for each category by
dividing a circle into sectors, the angles of which are
proportional to the relative frequency.
• Used for a single categorical variable
• Use percentage distributions
47
Steps to construct a pie-chart
• Construct a frequency table
49
Distribution fo cause of death for females, in England and Wales, 1989
Others
8%
Digestive System
4%
Injury and Poisoning
3%
Circulatory system
Respiratory system
42%
13%
Neoplasmas
30%
50
3. Histogram
• Histograms are frequency distributions with
continuous class intervals that have been turned into
graphs.
51
• Bars are drawn over the intervals in such a
way that the areas of the bars are all
proportional in the same way to their interval
frequencies.
52
Example: Distribution of the age of women at the time of marriage
Age 15-19 20-24 25-29 30-34 35-39 40-44 45-49
group
Number 11 36 28 13 7 3 2
Age of women at the time of marriage
40
35
30
No of women
25
20
15
10
0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
53
Histogram for the ages of 2087 mothers with <5 children,
Adami Tulu, 2003
700
600
500
400
300
200
N1AGEMOTH
54
Two problems with histograms
1. They are somewhat difficult to construct
2. The actual values within the respective groups
are lost and difficult to reconstruct
55
4. Stem-and-Leaf Plot
• A quick way to organize data to give visual impression
similar to a histogram while retaining much more detail
on the data.
57
Steps …
3. Write the second stem (first stem +1) below the first
stem
4. Continue with the remaining stems until you reach
the largest stem in the data set
5. Draw a vertical bar to the right of the column of
stems
6. For each number in the data set, find the appropriate
stem and write the leaf to the right of the vertical
bar
58
Example: 3031, 3101, 3265, 3260, 3245, 3200, 3248,
3323, 3314, 3484, 3541, 3649 (BWT in g)
60
• Similarly we could speak of other percentiles:
– P0: The minimum
– P25: 25% of the sample values are less than or equal
to this value. 1st Quartile
. P25 means 25th percentile
62
5. Scatter plot
• Most studies in medicine involve measuring more than
one characteristic, and graphs displaying the relationship
between two characteristics are common in literature.
63
• For two quantitative variables we use
bivariate plots (also called scatter plots or
scatter diagrams).
64
• A scatter diagram is constructed by drawing X-and Y-axes.
• Each point represented by a point or dot() represents a pair of
values measured for a single study subject
140
120
Saturation of bile
100
80
60
40
20
0
0 10 20 30 40 50 60 70 80
Age
65
• The graph suggests the possibility of a positive
relationship between age and percentage
saturation of bile in women.
66
6. Line graph
• Useful for assessing the trend of particular situation overtime.
• Helps for monitoring the trend of epidemics.
• The time, in weeks, months or years, is marked along the
horizontal axis, and
• Values of the quantity being studied is marked on the vertical
axis.
• Values for each category are connected by continuous line.
• Sometimes two or more graphs are drawn on the same graph
taking the same scale so that the plotted graphs are
comparable.
67
No. of microscopically confirmed malaria cases by species and
month at Zeway malaria control unit, 2003
No. of confirmed malaria cases
2100
1800 Positive
1500 P. falciparum
P. vivax
1200
900
600
300
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Months
68
Line graph can be also used to depict the relationship
between two continuous variables like that of scatter
diagram.
69
Response to administration of zidovudine in two groups of AIDS
patients in hospital X, 1999
8
7
6
Blood zidovudine
concentration
5
4
3
2
1
0
10
20
70
80
100
120
170
190
250
300
Time since administration (Min.) 360
70
Exercise
• Evaluate the following graphs whether they
are good or bad and discuss the points which
make them good or bad
71
MMRatio per 100,000 live births by age of woman;
Giza, Egypt 1984
1200
1000
MMR per 100,000 LB
800
600
400
200
0
15-19 20-24 25-29 30-34 35-39 40-44 45-49
Age
72
1. The title of the graph tells the reader the
content of the graph. For example:
• the statistic presented (MMRatio);
• the second dimension of the graph (age of
woman on the x axis);
• the metric (per 100,000 live births);
• the source of the data (Giza, Egypt);
• The date (1984);
73
2. The Y axis is labeled (MMR per 100,000
LB);
3. The X axis is labeled (age of woman);
4. The legend is given (_______= MMR);
5. The source of the information is provided
(Kane et al)
74
Maternal Mortality:
Countries X, Y and Z since 1850
900
•
800
700
600
500
400
300
200
100
0
Sweden UK USA
75
• The Y axis is not labeled;
• The title does not give you the statistic presented in
the graph (Maternal Mortality is not a statistic). This
is particularly problematic when the Y axis is also not
labeled;
• Neither the title nor the Y axis identify the metric (per
100,000 live births).
• The X axis is not labeled – but this is not so serious
when the categories are so obvious and when the
second dimension (year) has been identified in the
graph title.
76
14
Remember:
12
10
A graph is a tool. 2
0
Antepartum Intrapartum Postpartum
It is not an artwork to
Pre-eclampsia Eclampsia
77
Numerical Summary Measures
Measures of dispersion
78
• A frequency distribution is a general picture of
the distribution of a variable
79
Measures of Central Tendency (MCT)
• On the scale of values of a variable there is a certain
stage at which the largest number of items tend to
cluster.
81
Position
20
15
10
0
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
82
Characteristics of a good MCT
A MCT is good or satisfactory if it possesses the
following characteristics.
1. It should be based on all the observations
2. It should not be affected by the extreme values
3. It should be as close to the maximum number of values
as possible
4. It should have a definite value
5. It should not be subjected to complicated and tedious
calculations
6. It should be capable of further algebraic treatment
7. It should be stable with regard to sampling
83
• The most common measures of central
tendency include:
– Arithmetic Mean
– Median
– Mode
– Others
84
1. Arithmetic Mean
A. Ungrouped Data
• The arithmetic mean is the "average" of the data set
and by far the most widely used measure of central
location and it is usually denoted by
• Is the sum of all the observations divided by the total
number of observations.
85
The Summation Notation
86
87
The heart rates for n=10 patients were as follows (beats
per minute):
167, 120, 150, 125, 150, 140, 40, 136, 120, 150
What is the arithmetic mean for the heart rate of these
patients?
88
b)G ro
u pe dd a
ta
Inc alc
u latingthem e
anfromgroup
eddata
,weass
u m
etha
tallvalu esfallingin
toa
particularc la
ssinte
rva
larelo
cate
datth
em id
-po
into
fth
einterv
a l.Itisc alc
ula
teda
s
follow:
k
mf ii
x=i=1k
f
i=
1
i
w
he
re,
k= thenum be
rofclassinterv a
ls
m i=them id
-po
intoftheithc la
ssinterv
al
fi=thefre
q u
encyoftheithc lassin
terval
89
Example. Compute the mean age of 169 subjects
from the grouped data.
90
The mean can be thought of as a “balancing
point”, “center of gravity”
91
When the data are skewed, the mean is “dragged” in
the direction of the skewness
• It is possible in extreme cases for all but one of the sample points
to be on one side of the arithmetic mean & in this case, the mean is
a poor measure of central location or does not reflect the center of
the sample.
92
Properties of the Arithmetic Mean.
• For a given set of data there is one and only one
arithmetic mean (uniqueness).
98
b) Grouped data
• In calculating the median from grouped data, we assume
that the values within a class-interval are evenly
distributed through the interval.
101
• n/2 = 84.5 = in the 3rd class interval
• Lower limit = 29.5, Upper limit = 39.5
• Frequency of the class = 47
• (n/2 – fc) = 84.5-70 = 14.5
102
Properties of the median
• There is only one median for a given set of data
(uniqueness)
103
Quartiles
• Just as the median is the value above and below which
lie half the set of data, one can define measures (above
or below) which lie other fractional parts of the data.
104
a) The first quartile (Q1): 25% of all the ranked
observations are less than Q1.
106
3. Mode
• The mode is the most frequently occurring value among
all the observations in a set of data.
107
Mode
Mode
Mode
20
18
16
14
12
N 10
8
6
4
2
0
108
T. Ancelle, D. Coulombie
a) Ungrouped data
• It is a value which occurs most frequently in a
set of values.
• If all the values are different there is no
mode, on the other hand, a set of values may
have more than one mode.
109
• Example
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4 “Unimodal”
• Example
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes – 2 & 5
• This distribution is said to be “bi-modal”
• Example
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different
110
b) Grouped data
• To find the mode of grouped data, we
usually refer to the modal class, where the
modal class is the class interval with the
highest frequency.
• If a single value for the mode of grouped
data must be specified, it is taken as the
mid-point of the modal class interval.
111
x̂ = L m
w f 2
0
f f 2
where
L - Lower boundary of the Modal class
f0 – The frequency of the class next below the modal class
in value
f2 – the frequency of the class next above the modal class
in value
w – length of the interval of the modal class
112
113
Properties of mode
It is not affected by extreme values
It can be calculated for distributions with
open end classes
Often its value is not unique
The main drawback of mode is that often it
does not exist
114
4. Geometric mean (GM)
• Mainly used in many types of laboratory data,
specifically data in the form of concentrations of one
substance in another
• Example: the minimum inhibitory concentration of
penicillin in urine for N. gonorrhoeae in 71 patients
0.03125 21 0.250 19
0.0625 6 0.50 17
0.1250 8 1.0 3
115
If x 1 , x 2 , ..., x n are n positive observed values, then
n
GM = n x i
i=1
and
n
logx
i=1
i
logGM = .
n
The geometric mean is generally used with data measured on a logarithmic scale, such
as titers of anti-neutrophil immunoglobulin G.
116
Example:
logGM = [21log(0.03125) + 6log(0.0625) +
8log(0.125) + 19log(0.25) + 17log(0.5)
+ 3log(1.0)]/74 = -0.846
The GM = the antilogarithm of -0.846 = 0.143
117
5. Harmonic mean (HM)
• Just as the geometric mean is based on an
arithmetic mean of logarithms, so is the
harmonic mean based on arithmetic mean
of the reciprocals.
• Pertains to rates and time
• We define it as the reciprocal of the
arithmetic mean of the reciprocal of the
given numbers.
118
If the given numbers are x 1 , x 2 , ..., x n , then
1
HM = n
1 1
n i=1 x i
119
6. Weighted mean (WM)
• In a weighted mean, separate outcomes have
separate influences.
120
Example:
121
Which measure of central tendency is best with a given
set of data?
122
• The mean can be used for discrete and
continuous data
• The median is appropriate for discrete and
continuous data as well, but can also be used
for ordinal data
• The mode can be used for all types of data,
but may be especially useful for nominal and
ordinal measurements
• For discrete or continuous data, the “modal
class” can be used
123
• The geometric mean is used primarily for
observations measured on a logarithmic
scale.
• Harmonic mean is a suitable MCT when the
data pertains to rates and time.
• Weighted mean is commonly used in the
calculation of mean for different outcomes.
124
(a) Symmetric and unimodal distribution —
Mean, median, and mode should all be
approximately the same
125
(b) Bimodal — Mean and median should be
about the same, but may take a value that is
unlikely to occur; two modes might be best
126
(c) Skewed to the right (positively skewed) —
Mean is sensitive to extreme values, so median
might be more appropriate
Mode
Median
Mean
127
(d) Skewed to the left (negatively skewed) —
Same as (c)
Mode
Median
Mean
128
Measures of Dispersion
Consider the following two sets of data:
Two or more sets may have the same mean and/or median but they
may be quite different.
129
These two distributions have the same mean,
median, and mode
130
• MCT are not enough to give a clear
understanding about the distribution of the
data.
131
Measures of Dispersion
• Measures that quantify the variation or dispersion of
a set of data from its central location
132
Measures of Dispersion
Other synonymous term:
– “Measure of Variation”
– “Measure of Spread”
– “Measures of Scatter”
133
• Measures of dispersion include:
– Range
– Inter-quartile range
– Variance
– Standard deviation
– Coefficient of variation
– Standard error
– Others
134
1. Range (R)
• The difference between the largest and smallest
observations in a sample.
• Example –
– Data values: 5, 9, 12, 16, 23, 34, 37, 42
– Range = 42-5 = 37
• Data set with higher range exhibit more
variability
135
Properties of range
It is the simplest crude measure and can be
easily understood
It takes into account only two values which
causes it to be a poor measure of dispersion
Very sensitive to extreme observations
The larger the sample size, the larger the
range
136
2. Interquartile range (IQR)
• Indicates the spread of the middle 50% of the
observations, and used with median
IQR = Q3 - Q1
137
The two quartiles (Q3 &Q1) form the basis of the
Box-and-Whiskers Plots — Variables A, B, C
10
9
8
7
6
5
4
3
2
1
0
Variable A Variable B Variable C
138
Properties of IQR:
• It is a simple and versatile measure
• It encloses the central 50% of the observations
• It is not based on all observations but only on two
specific values
• It is important in selecting cut-off points in the
formulation of clinical standards
• Since it excludes the lowest and highest 25% values,
it is not affected by extreme values
• Less sensitive to the size of the sample
139
3. Quartile deviation (QD)
QD = Q 3 Q1
2
140
4. Coefficient of quartile deviation (CQD)
• CQD = Q 3 Q1
Q 3 Q1
• CQD is an absolute quantity (unitless) and is
useful to compare the variability among the
middle 50% observations.
141
5. Mean deviation (MD)
• Mean deviation is the average of the absolute
deviations taken from a central value, generally
the mean or median.
• Consider a set of n observations x1, x2, ..., xn.
Then:
n
1
MD x i A
n i 1
• ‘A’ is a central value (arithmetic mean or
median).
142
Properties of mean deviation:
MD removes one main objection of the earlier
measures, that it involves each value
143
6. Variance (2, s2)
• The main objection of mean deviation, that
the negative signs are ignored, is removed by
taking the square of the deviations from the
mean.
144
• It is squared because the sum of the
deviations of the individual observations of a
sample about the sample mean is always 0
0= ( )
xi- x
• The variance can be thought of as an average
of squared deviations
145
• Variance is used to measure the dispersion of
values relative to the mean.
• When values are close to their mean (narrow
range) the dispersion is less than when there
is scattering over a wide range.
– Population variance = σ2
– Sample variance = S2
146
a) Ungrouped data
Let X1, X2, ..., XN be the measurement on N
population units, then:
N
i
(X ) 2
2 i 1
where
N
N
X i
= i=1
is the population mean.
N
147
A sample variance is calculated for a sample of individual values
(X1, X2 , … Xn) and uses the sample
mean (e.g. ) rather than the population mean µ.
148
Degrees of freedom
• In computing the variance there are (n-1)
degrees of freedom because only (n-1) of the
deviations are independent from each other
• The last one can always be calculated from the
others automatically.
• This is because the sum of the deviations from
their mean (Xi-Mean) must add to zero.
149
b) Grouped data
k
(m i x) 2 f i
S2 i =1
k
f
i =1
i -1
where
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
x = the sample mean
k = the number of class intervals
150
Properties of Variance:
The main disadvantage of variance is that
its unit is the square of the unite of the
original measurement values
The variance gives more weight to the
extreme values as compared to those
which are near to mean value, because the
difference is squared in variance.
• The drawbacks of variance are overcome
by the standard deviation.
151
7. Standard deviation (, s)
• It is the square root of the variance.
• This produces a measure having the same
scale as that of the individual values.
and S = S
2 2
152
• Following are the survival times of n=11
patients after heart transplant surgery.
153
154
Example. Compute the variance and SD of the age of 169 subjects from
the grouped data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22
155
Properties of SD
• The SD has the advantage of being expressed in the
same units of measurement as the mean
156
SD Vs Standard Error (SE)
• SD describes the variability among individual values
in a given data set
• SE is used to describe the variability among
separate sample means obtained from one sample
to another
• For example, imagine 5,000 samples, each of the same size n=11. This would
produce 5,000 sample means. This new collection has its own pattern of
variability. We describe this new pattern of variability using the SE, not the
SD.
158
Example: The heart transplant surgery
n=11, SD=168.89, Mean=161 days
• What happens if we repeat the study? What will our next mean
be? Will it be close? How different will it be? Focus here is on the
generalizability of the study findings.
• The behavior of mean from one replication of the study to the
next replication is referred to as the sampling distribution of
mean.
• We can also have sampling distribution of the median or the SD
159
8. Coefficient of variation (CV)
• When two data sets have different units of
measurements, or their means differ
sufficiently in size, the CV should be used as
a measure of dispersion.
• It is the best measure to compare the
variability of two series of sets of
observations.
• Data with less coefficient of variation is
considered more consistent.
160
•CV is the ratio of the SD to the mean multiplied by 100.
S
CV 100
x
SD Mean CV (%)
SBP 15mm 130mm 11.5
Cholesterol 40mg/dl 200mg/dl 20.0
161
NOTE:
• The range often appears with the median as a
numerical summary measure
• The IQR is used with the median as well
• The SD is used with the mean
• For nominal and ordinal data, a table or graph
is often more effective than any numerical
summary measure
162