Biostatistics 3
Biostatistics 3
Descriptive Statistics
1
Descriptive Statistics
Techniques used to organize and summarize a set
of data.
The best way to work with data is to summarize
and organize them.
Numbers that have not been summarized and
organized are called raw data.
2
Cont.
3
Methods of data organization and
presentation
4
Methods of data organization and
presentation
• The data collected in a survey is called raw data.
• Collected data need to be organized in such a way
as to condense the information they contain in a
way that will show patterns of variation clearly.
• Precise methods of analysis can be decided up on
only when the characteristics of the data are
understood.
5
Cont.
6
Cont.
• Quite often, the presentation of data in a
meaningful way is done by preparing a
frequency distribution.
• If this is not done, the raw data will not present
any meaning and any pattern in them (if any)
may not be detected.
7
Frequency Distributions
• Ordered array: A simple arrangement of
individual observations in order of magnitude.
• The actual summarization and organization of
data starts from frequency distribution.
8
Cont.
9
a) Qualitative variable: Count the number of cases
in each category.
Example1:
• A 25 patients entering ICU at a given hospital:
1. Medical
2. Surgical
3. Cardiac
4. Other
10
ICU Type Frequency Relative Frequency
(How often) (Proportionately often)
Medical 12 0.48
Surgical 6 0.24
Cardiac 5 0.20
Other 2 0.08
Total 25 1.00
11
Example 2:
13
• To determine the number of class intervals and
the corresponding width, we may use:
Sturge’s rule:
K 1 3.322(logn)
LS
W
where K
Total 40 1.00
16
• Cumulative frequencies: When frequencies of
two or more classes are added.
• Cumulative relative frequency: The Cumulative
Relative Frequency is the sum of the relative
frequencies for all values that are less than or
equal to the given value.
• Mid-point: The value of the interval which lies
midway between the lower and the upper limits of
a class.
17
• True limits: Are those limits that make an
interval of a continuous variable, continuous in
both directions.
• Used for smoothening of the class intervals
• Subtract 0.5 from the lower and add it to the
upper limit.
18
Time True limit Mid-point Frequency
(Hours)
10-14 9.5 – 14.5 12 5
15-19 14.5 – 19.5 17 11
20-24 19.5 – 24.5 22 12
25-29 24.5 – 29.5 27 7
30-34 29.5 – 34.5 32 3
35-39 34.5 - 39.5 37 2
Total 40
19
Exercise
20
Calculate
• Number of class intervals
• Width then
• Frequency, relative frequency and cumulative
relative frequency
21
Exercise
Age of 35 mothers who delivered at Burao General
Hospital on June, 2019.
24 30 40 18 19 26 20 32 19 22
21 26 23 18 19 27 20 18 28 30
32 38 29 41 26 27 39 23 31
25 33 34
35 18 37
22
Calculate
• Number of class intervals
• Width then
• Frequency, relative frequency and cumulative
relative frequency
23
Exercise
The following table shows the number of hours
45 hospital patients slept following the
administration of a certain anesthetic.
7 10 12 4 8 7 3 8 5 12 11 3 8
1 1
13 10 4 4 5 5 8 7 7 3 2 3 8
13 1 7 17 3 4 5 5 3 1 17 10 4
7 7 11 8
24
Calculate
• Number of class intervals
• Width then
• Frequency, relative frequency and cumulative
relative frequency
25
Exercise
25 workers’ Dollar wages of company X in 2018.
24 33 18 19 26 19
21 23 19 27 20 25
28 30 29 32 29 31
35 27 22 28 26 34
24
26
Calculate
• Number of class intervals
• Width then
• Frequency, relative frequency and cumulative
relative frequency
27
Exercise
This is a data set that shows a distribution of the
age of 28 men at the time of marriage at distinct Y
in 2017.
19 33 18 19 26
24 21 23 19 27
20 25 28 30 29
32 29 31 35 27
22 28 26 34 24
30 22 18
28
Calculate
• Number of class intervals
• Width then
• Frequency, relative frequency and cumulative
relative frequency
29
Tables can also be used to present more than
one variables
30
Guidelines for Constructing Tables
• Keep them simple,
• Limit the number of variables to three or less,
• All tables should be self-explanatory,
• Include clear title telling what, when and where,
• Clearly label the rows and columns,
• Explain codes and abbreviations in the foot-note,
• Show totals,
• If data is not original, indicate the source in foot-note
31
32
Graphs
• Well designed graphs can be powerful means
of communicating a great deal of information.
• When graphs are poorly designed, they don’t
only miss to express the message, but they are
often misleading.
33
Specific types of graphs include:
• Bar graph Nominal, ordinal and
• Pie chart discrete data
• Histogram
• Scatter plot
Continuous data
• Line graph
• Others
34
Bar charts (or graphs)
• Categories are listed on the horizontal axis (X-
axis)
• Frequencies or relative frequencies are
represented on the Y-axis (ordinate).
• The height of each bar is proportional to the
frequency or relative frequency of observations
in that category.
35
Bar chart for the reason of ICU for 25
patients
36
Method of constructing bar chart
• All the bars must have equal width
• The bars are not joined together
• The different bars should be separated by equal
distances
• All the bars should rest on the same line called
the base.
37
Example: Construct a bar chart for the
following data
Distribution of patients in hospital by source of referral
Source of referral No. of patients Relative freq.
Other hospital 97 5.1
General practitioner 769 40.3
Out-patient department 623 32.7
Casualty 256 13.4
Other 161 8.5
Total 1 906 100.0
38
Distribution of patients in hopital X by source of referal, 1999
769
800
700 623
600
No. of pat i ent s
500
400
300 256
200 161
97
100
0
Other GP OPD Casualty Other
hospital
Source of referal
39
Sub-divided Bar chart
• If there are different quantities forming the sub-
divisions of the totals, simple bars may be sub-
divided in the ratio of the various sub-divisions
to exhibit the relationship of the parts to the
whole.
• The order in which the components are shown in
a “bar” is followed in all bars used in the
diagram.
40
• Example: Plasmodium species distribution for
confirmed malaria cases, Zeway, 2003.
100 Mixed
P. vivax
80 P. falciparum
60
Percent
40
20
0
August October December
2003
41
Pie chart
• Shows the relative frequency for each
category by dividing a circle into sectors, the
angles of which are proportional to the
relative frequency.
• Use percentage distributions
• Used for a single categorical variable
42
Steps to construct a pie-chart
• Construct a frequency table
• Change the frequency into percentage (P)
• Change the percentages into degrees, where:
degree = Percentage X 360o
• Draw a circle and divide it accordingly
43
Example: Distribution of deaths for females, in
England and Wales, 1989.
Cause of death Number of death
Circulatory system (C) 100 000
Neoplasmas (Cancer) (N) 70 000
Respiratory system (R) 30 000
Injury and poisoning (I) 6000
Digestive System (D) 10 000
Others 20 000
Total 236 000
44
Distribution fo cause of death for females, in England and Wales, 1989
Others
8%
Digestive System
4%
Injury and Poisoning
3%
Circulatory system
Respiratory system
42%
13%
Neoplasmas
30%
45
Histogram
• Histograms are frequency distributions with
continuous class intervals that have been turned
into graphs.
• To construct a histogram, we draw the interval
boundaries on a horizontal line and the
frequencies on a vertical line.
• Non-overlapping intervals that cover all of the
data values must be used.
46
• Example: Distribution of the age of women at
the time of marriage
47
Age of women at the time of marriage
40
35
30
25
No of women
20
15
10
0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
48
• Histogram for the ages of 2087 mothers with
<5 children, Adami Tulu, 2003.
700
600
500
400
300
200
0 N = 2087.00
15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0
N1AGEMOTH
49
Frequency Polygon
• A frequency distribution can be portrayed
graphically in yet another way by means of a
frequency polygon.
• To draw a frequency polygon we connect the
mid-point of the tops of the cells of the
histogram by a straight line.
50
• Frequency polygon for the ages of 2087
mothers with <5 children, Adami Tulu, 2003.
700
600
500
400
300
200
0 N = 2087.00
15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0
N1AGEMOTH
51
It can be also drawn without erecting rectangles by
joining the top midpoints of the intervals
representing the frequency of the classes as
follows: Age of women at the time of marriage
40
35
30
No of women
25
20
15
10
0
12 17 22 27 32 37 42 47
Age
52
Percentiles (Quartiles)
• Suppose that 50% of a cohort survived at least 4
years.
• This also means that 50% survived at most 4
years.
• We say 4 years is the median.
• The median is also called the 50th percentile
• We write: P50 = 4 years
53
Cont.
54
• It is possible to estimate the values of percentiles
from a cumulative frequency polygon.
55
Numerical Summary Measures
– Single number which quantify the
characteristics of a distribution of values
Measures of central tendency or location
Measures of dispersion
56
Measures of Central Tendency (MCT)
57
Measures of Central Tendency
• On the scale of values of a variable there is a
certain stage at which the largest number of items
tend to cluster.
• Since this stage is usually in the centre of
distribution, the tendency of the statistical data
to get concentrated at a certain value is called
“central tendency”
58
• The various methods of determining the point
about which the observations tend to concentrate
are called Measures of Central Tendency
(MCT).
• The objective of calculating MCT is to determine
a single figure which may be used to represent
the whole data set.
• Since a MCT represents the entire data, it
facilitates comparison within one group or
between groups of data.
59
Characteristics of a good MCT
A MCT is good or satisfactory if it possesses
the following characteristics.
1. MCT should be based on all the
observations
2. It should not be affected by the extreme
values
3. It should have a definite value
4. It should not be subjected to complicated
and tedious calculations
5. It should be stable with regard to sampling
60
• The most common measures of central
tendency include:
– Mean
– Median
– Mode
– Others
61
The Arithmetic Mean or simple Mean
a) Ungrouped Data
• The arithmetic Mean is the "average" which is
obtained by adding all the values in a sample or
population and dividing them by the number of
values.
62
The heart rates for 10 patients were as follows
(beats per minute):
167, 120, 150, 125, 150, 140, 40, 136, 120, 150
What is the average heart rate for these patients?
The sample mean:
X X / n
X = (167 + 120 + 150 + 125 +
150+140+40+136+120+150)/10
= 1298/10 = 129.8 beats per minute
63
b) Grouped data:
In calculating mean from grouped data, we
assume that all values falling into particular class
are located at the mid point of interval. It is
calculated as follow:
64
• Example: Compute the mean age of 169 subjects from
the grouped data.
Class interval Mid-point (mi) Frequency (fi) mifi
10-19 14.5 4 58.0
20-29 24.5 66 1617.0
30-39 34.5 47 1621.5
40-49 44.5 36 1602.0
50-59 54.5 12 654.0
60-69 64.5 4 258.0
66
When the data are skewed, the mean is “dragged”
in the direction of the skewness.
67
Properties of the Arithmetic Mean
68
Median
a) Ungrouped data
• The median is the value which divides the data
set into two equal parts.
• If the number of values is odd, the median will
be the middle value when all values are arranged
in order of magnitude.
69
• When the number of observations is even, there
is no single middle value but two middle
observations.
• In this case the median is the mean of these two
middle observations, when all observations have
been arranged in the order of their magnitude.
70
71
Example 1
• Calculate the medium of this biostatistics exam
result?
65 50 85 46 70 75 60
80 90
Solution:
First arrange the sample in ascending order
46 50 60 65 70 75 80 85 90
Since n = 9, it is Odd number
Medium = n+1/2 9+1/2 = 5
So, the medium is the 5th number which is 70
72
Example 2
• Calculate the medium of this Pharmacology exam
result?
65 50 85 46 70 75 60 80 90 40
Solution:
•First arrange the sample in ascending order
4046 50 60 65 70 75 80 85 90
74
b) Grouped data
75
• Find n/2 and see a class interval with a
minimum cumulative frequency which
contains n/2.
• Then, use the following formula.
76
n
Fc
~
x = Lm 2 W
fm
where,
Lm = lower true class boundary of the interval containing the median
Fc = cumulative frequency of the interval just above the median
class
interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations
77
Example: Compute the median age of 169 subjects from the grouped data.
n/2 = 169/2 = 84.5
79
Properties of the Median
• There is only one median for a given set of
data
• The median is easy to calculate
• Median is a positional average and hence it is
not drastically affected by extreme values
• It is not a good representative of data if the
number of items is small
80
The median is a better description (than the
mean) of the majority when the distribution
is skewed
Example:
• Data are: 14, 89, 93, 95, 96
• Skewness is reflected in the outlying
low value of 14
• The sample mean is 77.4
• The median is 93
81
Mode
• The mode is the most frequently occurring
value in a set of data.
• It is not influenced by extreme values.
• It is possible to have more than one mode or
no mode.
• It is not a good summary of the majority of the
data.
82
Example:
a) Ungrouped data
• Find the modal values for the following data
a) 22, 66, 69, 70, 73. (no mode)
b) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5
(mode = 3.0 kg)
c ) 9, 2, 10, 9, 5, 10, 8, 4, 12 ( mode = 9
and 10)
83
b) Grouped data
84
85
Properties of mode
• It is not affected by extreme values
• It can be calculated for distributions with open
end classes
• Often its value is not unique
• The main drawback of mode is that often it
does not exist
86
Exercise
• Calculate mean, medium and mode from the
following ungrouped data:
22, 10, 12, 23,25,20, 24,26, 25, 12
87
Exercise
• Calculate mean, medium and mode from the
following ungrouped data:
18 17 17 18 18 19 9 34 18 10 22
19 10 15 14 10 10 16 21 11 1
88
Exercise
• Find the mean, medium and mode of the following
grouped data.
1-5 3 6 6
6-10 8 3 9
11-15 13 5 14
16-20 18 10 24
21-25 23 11 35
26-30 28 10 45
Total 45
89
Measures of Dispersion
90
Measures of Dispersion
• Dispersion refers to the variety exhibited by the
values of the data.
• The amount may be small when the values are
close together.
• MCT are not enough to give a clear
understanding about the distribution of the data.
• Moreover, two or more sets may have the same
mean and/or median but they may be quite
different.
91
Consider the following data sets:
Set 1: 60 40 30 50 60 40 70
Set 2: 50 49 49 51 48 50 53
• The two data sets given above have a mean of
50, but obviously set 1 is more “spread out”
than set 2.
• How do we express this numerically?
92
• We need to know something about the
variability or spread of the values — whether
they tend to be clustered close together, or
spread out over a broad range.
93
• Measures of dispersion include:
– Range
– Variance
– Standard deviation
– Others
94
1. Range (R)
• The range is the difference between the largest
and smallest values in the set of observations.
• These values are often called the maximum
and the minimum.
95
Example
Set 1: 60 40 30 50 60 40 70
Set 2: 50 49 49 51 48 50 53
96
Properties of range
• It is the simplest crude measure and can be
easily understood
• It takes into account only two values which
causes it to be a poor measure of dispersion
• Very sensitive to extreme values
97
2. Variance (2, S2)
98
A sample variance is calculated for a sample of
individual values (X1, X2, … Xn) and uses the sample
mean (e.g. ) rather than the population mean µ.
99
3. Standard deviation (, S)
• It is the positive square root of the variance.
and S = S2 2
100
Properties of Variance
• The main demerit of variance is that its unit is
the square of the unite of the original
measurement values
• The variance gives more weight to the extreme
values as compared to those which are near to
mean value, because the difference is squared in
variance.
• The drawbacks of variance are overcome by the
standard deviation.
101
Example
• Following are the survival times of n=11
patients after heart transplant surgery.
• Patients are identified numerically, from 1 to
11.
• The survival time for the “ith” patient is
represented as Xi for i= 1, …, 11.
• Calculate the sample variance and SD.
102
103
Properties of SD
• The SD has the advantage of being expressed in
the same units of measurement as the mean.
• SD is considered to be the best measure of
dispersion and is used widely because of the
properties of the theoretical normal curve.
• However, if the units of measurements of
variables of two data sets is not the same, then
there variability can’t be compared by comparing
the values of SD.
104
Exercise
• Find the mean, medium, mode, range, variance
and standard deviation of the following exam
result data and correct it to 2 decimal places:
74, 72, 83, 96, 64, 79, 88, 69
105
Exercise
Find the mean, medium, mode; range, variance
and standard deviation of the following exam
result data and correct it to 2 decimal places.
81, 81, 88, 72, 79, 81, 85, 72, 89, 72, 80, 72, 90,
71, 90, 88, 81, 90
106
Exercise
Find the mean, range, variance and standard
deviation of the following data set and correct it to
2 decimal places:
16 24 18 14 20 36 26 23 16 15 19 20 22 14
19 10 19 27 29 22 38 34 32 23 19 21 31 16
28 19 12 27 15 21 25
107
108