chapter 3 descriptive biostatistics
chapter 3 descriptive biostatistics
Descriptive Biostatistics
Oct ,2024
Descriptive Statistics
• Numbers that have not been summarized and organized are called
raw data.
• Before interpretation & communication of the findings, the raw
data must be organized, summarized and presented in a clear and
understandable way.
2
A. Describing categorical variables
• Table of frequency distributions
– Frequency
– Relative frequency
– Cumulative frequencies
• Charts
– Bar charts
– Pie charts
3
Frequency distributions
• Simple and effective way of summarizing categorical data
• The actual summarization and organization of data starts from
frequency distribution
• Done by counting the number of observations falling into each of
the categories or levels of the variables.
E.g. Birth weight with levels „Very low ‟, „Low‟, „Normal‟and „big‟.
• The frequency distribution for newborns is obtained simply by
counting the number of newborns in each birth weight category.
4
Relative Frequency
• It is the proportion or percentages of observations in each category of a
variable.
5
Cumulative frequency
• It is the number of observations in the category of a variable plus
observations in all categories smaller than it.
6
Table 1. Distribution of birth weight of newborns between Sept-
Oct, 2020 at „X‟ Hospital.
7
B) Describing Quantitative variable:
– Frequency
– Relative frequency
– Cumulative frequencies
- Select a set of continuous, non-overlapping intervals such that
each value can be placed in one and only one of the intervals.
8
To determine the number of class intervals and the corresponding
width, we may use:
Sturge‟s rule:
K 1 3.322(logn)
LS
W
K
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value
9
Example:
Leisure time (hours) per week for 40 college students:
23 24 18 14 20 36 24 26 23 21 16 15 19 20 22 14
13 10
19 27 29 22 38 28 34 32 23 19 21 31 16 28 19 18
12 27
15 21 25 16
11
• Class Limit: The range for each class
– Upper class limit
– Lower class limit
• Subtract 0.5 from the lower and add it to the upper class limit
12
Time
(Hours) True limit(class boundary) Mid-point Frequency
10-14 9.5 – 14.5 12 5
15-19 14.5 – 19.5 17 11
20-24 19.5 – 24.5 22 12
25-29 24.5 – 29.5 27 7
30-34 29.5 – 34.5 32 3
35-39 34.5 - 39.5 37 2
Total 40
13
Types of tables
14
Types of table cont.…..
15
16
Guidelines for constructing tables
• Keep them simple
• Show totals
18
Specific types of graphs include:
• Bar graph
Nominal, ordinal,
• Pie chart Discrete data
19
1. Bar charts (Graphs)
• Categories are listed on the horizontal axis (X-axis)
20
A. Simple bar chart: It is a one-dimensional in which the bar
represents the whole of the magnitude. (only one variable)
100
80
60
Number of
children
40
20
0
Not immunized Partially immunized Fully immunized
Immunization status
350
300
250
Number of
200
women
150
100
50
0
Married Sin g le Divorced W id o wed
M arital s tatu s
100
n
e
m 80
wo 60
f
o 40
e
r
b
m 20
Nu 0
Married Single Divorced Widow ed
Marital status
Fig. 3 TT Immunization status by marital status of women 15-49 years, Asendabo town,
1996
23
Subdivided bar chart cont.…..
24
Method of constructing bar chart
• All the bars should rest on the same line called the base
26
Steps to construct a pie-chart
• Construct a frequency table
28
Distribution fo cause of d e a t h for f e m a l e s , in E n g l a n d a n d W a l e s , 1989
O th e r s
8%
Digestive S y s t e m
4%
Injury a n d P o i s o n i ng
3%
Circulatory s y s t e m
Respiratory s y s t e m
42%
13%
N e o p la s m a s
30%
29
3. Histogram
• Histograms are frequency distributions with continuous class
interval that have been turned into graphs.
30
• It is necessary that the class intervals be non-overlapping so that
each observation falls in one and only one interval.
31
Example: Distribution of the age of women at the time of marriage
40
35
30
No of w omen
25
20
15
10
5
0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group 71
4. Frequency polygon
33
Age of women at the time of marriage
40
35
30
n
e 25
m
o
w 20
f
o
No 15
10
0
12 17 22 27 32 37 42 47
Age
34
Age of women at the time of marriage
40
35
30
No of women
25
20
15
10
0
12 17 22 27 32 37 42 47
Age
35
Frequency polygon of birth weight of 9975 newborns for males and
females
50
40
%
30
20
SEX
10
M a les
F e m ales
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
B ir t h W e i g h t
36
5. Ogive Curve (Cumulative Frequency Polygon)
• Used to know the number of items whose values are more or less than a
certain amount.
• E.g. to know the no. of patients whose weight is <50 or >60 Kg.
90
80
Cumulative frequency
70
60
50
40
30
20
10
0
4.5 9.5 14.5 19.5 24.5 29.5 34.5 39.5
Fig 4: Cumulative frequency curve for amount of time college students devoted to
leisure activities
38
6. Line graph
2 .5
2 .0
1 .5
1 .0
0 .5
0 .0
1967 1969 1971 1973 1975 1977 1979
Ye a r
Mean
• The sum of the observations divided by the number of
observations.
Example
19 21 20 20 34 22 24 27
27 27
• Then, Mean = (19 + 21 + … +27) = 24.1
10
• General formula
a) Ungrouped data
x
i=1
i
x= .
n
b) Grouped data
• We assume that all values falling into a particular class interval
are located at the mid-point of the interval. It is calculated as
follow: k
m ifi
i=1
x = k
fi
i=1
• where,
• Influenced by each and every value in the data set hence affected
by the extreme values.
a) ungrouped data
observation.
• If observations are even the median is the average of the two
middle (n/2)th and [(n/2)+1]th values i.e
Cont’d…
Example : Find the median for the following
• 20 20 19 22 24 27 27 27 34 21 20
The median is a better measure of central tendency (than the mean)
when the distribution is skewed
b) Grouped data
n F
c
x = Lm
~ 2 W
fm
• where,
• Lm = lower true class boundary of the interval containing the median
• Fc = cumulative frequency of the interval just above the median class interval
• Fc = 70
c) The third quartile (Q3): 75% of all the ranked observations are
less than Q3. [75th percentile] 104
Percentiles
– P25: 25% of the sample values are less than or equal to this value.
P25 means 1st Quartile or 25th percentile and given by:-
0.25(n+1)th observation
– P50: 50% of the sample are less than or equal to this value. 2nd
Quartile or 50th percentile and given by:-
0.5(n+1)th observation
– P75: 75% of the sample values are less than or equal to this value.
3rd Quartile or 75th percentile and given by:-
0.75(n+1)th observation
– P100: The maximum
Example: Birth weight in grams
2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200, 3245, 3248,
3260, 3265, 3314, 3323, 3484, 3541, 3609, 3649, 4146
• The mode of grouped data usually refers to the modal class with
the highest frequency.
• If a single value for the mode of grouped data must be specified,
it is taken as the mid point of the modal class interval.
Properties of mode
Often its value is not unique (more than one mode is possible)
The main drawback of mode is that often it does not exist,
therefore it is not a good summary of the majority of the data.
Descriptive statistics
Measures of
dispersion
Measures of Dispersion……
• • Two or more sets may have the same mean and/or median
but they may be quite different.
• • MCT are not good to describe about the variability or spread of
the values.
Measures of Dispersion
• The amount may be small when the values are close together.
• Example –
– Range = 42-5 = 37
Properties of range
IQR = Q3 - Q1
Example: Suppose the first and third quartile for weights of girls
12 months of age are 8.8 Kg and 10.2 Kg, respectively.
i.e., 50% of the infant girls weigh between 8.8 and 10.2 Kg.
Example 2
• Given the following data set (age of patients):-
• Solution: 18 21 23 24 24 32 42 59
• Hence, IQR = 37 - 22 = 15
Properties of IQR:
i x) 2
S2 i=1
n -1
(x
n
i x) 2
S2 i=1
n -1
(x
n
i x) 2
S2 i=1
n -1
(x
n
i x) 2
S2 i=1
n -1
(x
n
i x) 2
S2 i=1
n -1
(x
n
i x) 2
S2 i=1
n -1
(x
n
i x) 2
S2 i=1
n -1
(x
n
i x) 2
S2 i=1
n -1
(x
n
i x) 2
S2 i=1
n -1
(x
n
i x) 2
S2 i=1
n -1
Example. Compute the variance and SD of the age of 169 subjects from the
grouped data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22
Properties of SD
• Has the advantage of being expressed in the same units of
measurement as the mean
S
CV 100
x
SD Mean CV (%)
• When the data are skewed, it is preferable to use the median and IQR as
summary statistics.
• Median and IQR are not easily influenced by extreme values in a skewed
distribution unlike means and standard deviations.
• Remark:
• The mean and median of symmetric distribution coincide.
• When skewed to the right, its mean is larger than its median.
• When skewed to the left, its mean is smaller than its median.(see fig. a-c)
Median Mode Mean
Fig. 2(a). Symmetric Distribution Mode Median Mean
Fig. 2(b). Distribution skewed to the right
144
• Calculate the mean ,median, standard devation of the following distribussion