0% found this document useful (0 votes)
8 views

chapter 3 descriptive biostatistics

Chapter Three discusses descriptive biostatistics, focusing on techniques to organize, summarize, and present data. It covers categorical and quantitative variables, including frequency distributions, relative frequencies, and various graphical representations like bar charts, pie charts, and histograms. The chapter emphasizes the importance of clear data presentation for effective interpretation and communication of findings.

Uploaded by

michot felegu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

chapter 3 descriptive biostatistics

Chapter Three discusses descriptive biostatistics, focusing on techniques to organize, summarize, and present data. It covers categorical and quantitative variables, including frequency distributions, relative frequencies, and various graphical representations like bar charts, pie charts, and histograms. The chapter emphasizes the importance of clear data presentation for effective interpretation and communication of findings.

Uploaded by

michot felegu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Chapter Three

Descriptive Biostatistics

Samrawit .F (MSc. in Biostatistics)

Oct ,2024
Descriptive Statistics

• Involves techniques used to organize and summarize and present a


set of data.

• Numbers that have not been summarized and organized are called
raw data.
• Before interpretation & communication of the findings, the raw
data must be organized, summarized and presented in a clear and
understandable way.

2
A. Describing categorical variables
• Table of frequency distributions

– Frequency

– Relative frequency

– Cumulative frequencies

– Relative cumulative frequency

• Charts

– Bar charts

– Pie charts
3
Frequency distributions
• Simple and effective way of summarizing categorical data
• The actual summarization and organization of data starts from
frequency distribution
• Done by counting the number of observations falling into each of
the categories or levels of the variables.

E.g. Birth weight with levels „Very low ‟, „Low‟, „Normal‟and „big‟.
• The frequency distribution for newborns is obtained simply by
counting the number of newborns in each birth weight category.

4
Relative Frequency
• It is the proportion or percentages of observations in each category of a
variable.

• The distribution of proportions is called the relative frequency


distribution of the variable
• Given a total number of observations, the relative frequency
distribution is easily derived from the frequency distribution.

• Conversion in the opposite direction is also possible, but the conversion


is often inaccurate because of rounding

5
Cumulative frequency
• It is the number of observations in the category of a variable plus
observations in all categories smaller than it.

Cumulative relative frequency


• It is the proportion of observations in the category plus
observations in all categories smaller than it.

• It is obtained by dividing the cumulative frequency by the total


number of observations.

6
Table 1. Distribution of birth weight of newborns between Sept-
Oct, 2020 at „X‟ Hospital.

BWT Freq. Cum. Freq Rel.Freq. Cum.rel.freq.

Very low 25 25 0.1 0.1


Low 50 75 0.2 0.3
Normal 150 200 0.6 0.9
Big 25 250 0.1 1
Total 250 1

7
B) Describing Quantitative variable:

• Table of frequency distributions

– Frequency

– Relative frequency

– Cumulative frequencies
- Select a set of continuous, non-overlapping intervals such that
each value can be placed in one and only one of the intervals.

- The first consideration is how many intervals to include

8
To determine the number of class intervals and the corresponding
width, we may use:

Sturge‟s rule:
K  1 3.322(logn)
LS
W
K
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value

9
Example:
Leisure time (hours) per week for 40 college students:

23 24 18 14 20 36 24 26 23 21 16 15 19 20 22 14
13 10
19 27 29 22 38 28 34 32 23 19 21 31 16 28 19 18
12 27
15 21 25 16

K = 1 + 3.322 (log40) = 6.32 ≈ 6

Maximum value = 38, Minimum value = 10

Width = (38-10)/6 = 4.66 ≈ 5


• Ordered array: is a simple arrangement of
individual observations in the order of 10
Time Relative Cumulative
(Hours) Frequency Frequency Relative
Frequency
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00
Total 40 1.00

11
• Class Limit: The range for each class
– Upper class limit
– Lower class limit

• Mid-point ( class mark): The value of the interval which lies


midway between the lower and the upper limits of a class.
• Class boundary (True limits): Are those limits that make an
interval of a continuous variable continuous in both directions

– Upper class boundary

– Lower class boundary

• Subtract 0.5 from the lower and add it to the upper class limit

12
Time
(Hours) True limit(class boundary) Mid-point Frequency
10-14 9.5 – 14.5 12 5
15-19 14.5 – 19.5 17 11
20-24 19.5 – 24.5 22 12
25-29 24.5 – 29.5 27 7
30-34 29.5 – 34.5 32 3
35-39 34.5 - 39.5 37 2
Total 40

13
Types of tables

14
Types of table cont.…..

15
16
Guidelines for constructing tables
• Keep them simple

• Limit the number of variables to be included.

• All tables should be self-explanatory

• Include clear title telling what, when and where

• Clearly label the rows and columns

• State clearly the unit of measurement used

• Explain codes and abbreviations in the foot-note

• Show totals

• If data is not original, indicate the source in foot-note.


17
Diagrammatic (Pictorial) representations of Statistical data

Importance of diagrammatic representation

1. Diagrams have greater attraction than mere figures.

2. They give quick overall impression of the data.

3. They have great memorizing value than mere figures.

4. They facilitate comparison

5. Used to understand patterns and trends

18
Specific types of graphs include:
• Bar graph
Nominal, ordinal,
• Pie chart Discrete data

• Stem and Leaf Plot


• Histogram
• Frequency polygon
Quantitative
• Cum. Freq. polygon (Ogive Curve) continuous data
• Line graph
• Box plot
• Scatter plot

19
1. Bar charts (Graphs)
• Categories are listed on the horizontal axis (X-axis)

• Frequencies or relative frequencies are represented on the Y-axis


(ordinate)

• The height of each bar is proportional to the frequency or relative


frequency of observations in that category
• There are different types of bar graphs, the most important ones
are:

20
A. Simple bar chart: It is a one-dimensional in which the bar
represents the whole of the magnitude. (only one variable)

100

80

60
Number of
children
40

20

0
Not immunized Partially immunized Fully immunized
Immunization status

Fig. 1. Immunization status of Children in Adami Tulu Woreda, Feb.


2020.
21
B. Multiple bar chart: the component figures are shown as
separate bars bordering each other. It depicts distributional
pattern of more than one variable

350
300
250
Number of

200
women

150
100
50
0
Married Sin g le Divorced W id o wed
M arital s tatu s

Immunized Not immunized

Fig. 2 TT Immunization status by marital status of women 15-49 years, Asendabo


town, 2020
22
C. Sub-divided bar chart: Bars are sub-divided into component parts
of the figure. These sorts of graphs are constructed when each total is
built up from two or more component figures.

100
n
e
m 80
wo 60
f
o 40
e
r
b
m 20
Nu 0
Married Single Divorced Widow ed
Marital status

Immunized Not immunized

Fig. 3 TT Immunization status by marital status of women 15-49 years, Asendabo town,
1996
23
Subdivided bar chart cont.…..

24
Method of constructing bar chart

• All the bars must have equal width

• The bars are not joined together

• The different bars should be separated by equal distances

• All the bars should rest on the same line called the base

• Label both axes clearly


25
2. Pie chart
• Shows the relative frequency for each category by dividing a
circle into sectors

• The angles are proportional to the relative frequency.

• Used for a single categorical variable

• Use percentage distributions

26
Steps to construct a pie-chart
• Construct a frequency table

• Change the frequency into percentage (P)

• Change the percentages into degrees, where: degree =


Percentage X 360o

• Draw a circle and divide it accordingly


27
Example: Distribution of deaths for females, in England and
Wales, 1989.
Cause of death No. of death
Circulatory system 100 000
Neoplasm 70 000
Respiratory system 30 000
Injury and poisoning 6 000
Digestive system 10 000
Others 20 000

Total 236 000

28
Distribution fo cause of d e a t h for f e m a l e s , in E n g l a n d a n d W a l e s , 1989

O th e r s
8%
Digestive S y s t e m
4%
Injury a n d P o i s o n i ng
3%

Circulatory s y s t e m
Respiratory s y s t e m
42%
13%

N e o p la s m a s
30%

29
3. Histogram
• Histograms are frequency distributions with continuous class
interval that have been turned into graphs.

• Given a set of numerical data, we can obtain impression of the


shape of its distribution by constructing a histogram.

• Constructed by choosing a set of non-overlapping class intervals


& counting the number of observations that fall in each class.

30
• It is necessary that the class intervals be non-overlapping so that
each observation falls in one and only one interval.

• Bars are drawn over the intervals

• The area of each bar is proportional to the frequency of


observations in the interval

31
Example: Distribution of the age of women at the time of marriage

Age 15-19 20-24 25-29 30-34 35-39 40-44 45-49


group
Number 11 36 28 13 7 3 2
Age of women at the time of marriage

40

35

30
No of w omen

25

20

15

10

5
0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group 71
4. Frequency polygon

• Instead of drawing bars for each class interval, sometimes a single


point is drawn at the mid point of each class interval and
consecutive points joined by straight line.

• Graphs drawn in this way are called frequency polygons

• Frequency polygons are superior to histograms for comparing two


or more sets of data.

33
Age of women at the time of marriage

40

35

30
n
e 25
m
o
w 20
f
o
No 15

10

0
12 17 22 27 32 37 42 47
Age

34
Age of women at the time of marriage

40

35

30
No of women

25

20

15

10

0
12 17 22 27 32 37 42 47
Age

35
Frequency polygon of birth weight of 9975 newborns for males and
females
50

40

%
30

20

SEX
10
M a les

F e m ales

0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000

B ir t h W e i g h t

36
5. Ogive Curve (Cumulative Frequency Polygon)
• Used to know the number of items whose values are more or less than a
certain amount.

• E.g. to know the no. of patients whose weight is <50 or >60 Kg.

• To get this information it is necessary to change the form of the


frequency distribution from a „simple‟ to a „cumulative‟ distribution.

• Ogive curve turns a cumulative frequency distribution in to graphs.

• Are much more common than frequency polygons


37
Example: time spend on leisure activities

90
80

Cumulative frequency
70
60
50
40
30
20
10
0
4.5 9.5 14.5 19.5 24.5 29.5 34.5 39.5

Upper class boundary

Fig 4: Cumulative frequency curve for amount of time college students devoted to
leisure activities
38
6. Line graph

• Useful for assessing the trend of particular situation overtime.

• Helps for monitoring the trend of epidemics.


• The time, in weeks, months or years, is marked along the
horizontal axis, and

• Values of the quantity being studied is marked on the vertical


axis.

• Values for each category are connected by continuous line.


• Sometimes two or more graphs are drawn on the same graph
taking the same scale so that the plotted graphs are comparable.
Example: Malaria Parasite Prevalence Rates in Ethiopia, 1967 –
1979 Eth. C.
5 .5
5 .0
4 .5
4 .0
3 .5
3 .0
Rate
(%)

2 .5
2 .0
1 .5
1 .0
0 .5
0 .0
1967 1969 1971 1973 1975 1977 1979
Ye a r

Fig 5: Malaria Parasite Prevalence Rates in Ethiopia, 1967 – 1979 Eth. C.


Scatter Plots
 The most useful graphical tool for displaying the
relationship between two quantitative variables is a two
way scatterplot.
 Scatter plots present data on the x- and y-axes and are used
to investigate an association between two variables.
 A point represents each individual or object, and an
association between two variables can be studied by
analyzing patterns across multiple points.
 A regression line is added to a graph to determine whether
the association between two variables can be explained or
not.
Scatter plot (Two way) Here is one that
displays annual salary vs year of education.
Box-and-Whisker Plots
 It is a useful visual device for communicating the
information contained in a data set.
 The construction of a box-and-whisker plot makes use of
the quartiles
Examination of a box-and-whisker plot for a set of
data reveals information regarding the amount of
spread, location of concentration, and symmetry
of the data.
Box plots
Any
question?
Numerical summary
measures
 A single number which quantify the characteristics of a
distribution of values.

Measures of central tendency (location)

Measures of dispersion (variability)


A. Measures of Central location
• The objective of calculating MCT is to determine a single value
which may be used to represent the whole data set.
• Measures used to summarize the point at which the data tend to
cluster in a single number. Such statistics are called measures of
location or measures of central tendency.

• We describe them as mean, median and mode.

Mean
• The sum of the observations divided by the number of
observations.
Example
19 21 20 20 34 22 24 27
27 27
• Then, Mean = (19 + 21 + … +27) = 24.1
10
• General formula
a) Ungrouped data

If x 1 , x 2 , ..., x n are n observed values, then


n

x
i=1
i
x= .
n
b) Grouped data
• We assume that all values falling into a particular class interval
are located at the mid-point of the interval. It is calculated as

follow: k

 m ifi
i=1
x = k

 fi
i=1

• where,

k = the number of class intervals


mi = the mid-point of the ith class interval

fi = the frequency of the ith class interval


Example. Compute the mean age of 169 subjects from the
grouped data.
Mean = 5810.5/169 = 34.48 years

Class interval Mid-point (mi) Frequency (fi) mifi


[10-19] 14.5 4 58.0
[20-29] 24.5 66 1617.0
[30-39] 34.5 47 1621.5
[40-49] 44.5 36 1602.0
[50-59] 54.5 12 654.0
[60-69] 64.5 4
258.0
Total 169 5810.5
Properties of the arithmetic mean
• For given set of data there is one and only one arithmetic mean
(uniqueness).

• It is easily calculate and understand (simple).


• Poor measure of central location if the underlying distribution is
not normal (or not Gaussian).

• Influenced by each and every value in the data set hence affected
by the extreme values.

• In grouped data if any class interval is open, arithmetic mean can


not be calculated.
Median
• With the observations arranged in increasing or decreasing order,
the median is defined as the middle observation.

a) ungrouped data

If observations are odd, the median is defined as the [(n+1)/2]th

observation.
• If observations are even the median is the average of the two
middle (n/2)th and [(n/2)+1]th values i.e
Cont’d…
Example : Find the median for the following
• 20 20 19 22 24 27 27 27 34 21 20
The median is a better measure of central tendency (than the mean)
when the distribution is skewed
b) Grouped data

 we assume that the values within a class-interval are evenly


distributed through the interval.
– The first step is to locate the class interval in which it is
located.

– Find n/2 and see a class interval with a minimum cumulative


frequency which contains n/2.
Median for Grouped data…..
To find a unique median value, use the following formal.

 n F 
 c
x = Lm  
~ 2 W
 fm 
• where,
 
• Lm = lower true class boundary of the interval containing the median

• Fc = cumulative frequency of the interval just above the median class interval

• fm = frequency of the interval containing the median

• W= class interval width

• n = total number of observations


Example. Compute the median age of 169 subjects from the
grouped data.

n/2 = 169/2 = 84.5

Class interval Mid-point (mi) Frequency (fi) Cum. freq


[10-19] 14.5 4 4
[20-29] 24.5 66 70
[30-39] 34.5 47 117
[40-49] 44.5 36 153
[50-59] 54.5 12 165
[60-69] 64.5 4 169
Total 169
• n/2 = 84.5 = in the 3rd class interval

• Lower limit = 29.5, Upper limit = 39.5

• Frequency of the class = 47

• Fc = 70

• (n/2 – fc) = 84.5-70 = 14.5

• Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33


Properties of median

• There is only one median for a given set of data (uniqueness)

• The median is easy to calculate


• Median is a positional average and hence it is not sensitive to very
large or very small values.

• The median is a better measure of central tendency (than the


mean) when the distribution is skewed (not normal)

• Can be calculated even in the case of open end intervals


Quartiles
• If the data are divided into four equal parts, we speak of
quartiles.

• The median divides the data into two equal parts


a) The first quartile (Q1): 25% of all the ranked observations are
less than Q1. [25th percentile]

b) b) The second quartile (Q2): 50% of all the ranked observations


are less than Q2. [50th percentile] The second quartile is the
median.

c) The third quartile (Q3): 75% of all the ranked observations are
less than Q3. [75th percentile] 104
Percentiles

 Simply divide the data into 100 pieces.


 Commonly used percentiles:
→ 10, 20, ….. 90% (deciles)
→ 20, 40, ….. 80% (quintiles)
→ 25, 50, 75% (quartiles)
→ 33.3, 66.7% (tertiles)
– P0: The minimum

– P25: 25% of the sample values are less than or equal to this value.
P25 means 1st Quartile or 25th percentile and given by:-
0.25(n+1)th observation

– P50: 50% of the sample are less than or equal to this value. 2nd
Quartile or 50th percentile and given by:-

0.5(n+1)th observation
– P75: 75% of the sample values are less than or equal to this value.
3rd Quartile or 75th percentile and given by:-

0.75(n+1)th observation
– P100: The maximum
Example: Birth weight in grams

2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200, 3245, 3248,
3260, 3265, 3314, 3323, 3484, 3541, 3609, 3649, 4146

 find the 10th and 90th percentile of the data set.

 10th percentile = 0.1(20+1) = 2.1th value

 the average of the 2nd and 3rd values = (2581+2759)/2 = 2670 g

 90th percentile = 0.9(20+1) = 18.9th value

 the average of the18th and 19th values = (3609+3649)/2 = 3629 g


Mode
• It is a value that occur most often.

• Most distributions have one peak and are described as uni-modal.


•E.g. 19 21 20 20 34 22 24 27 27 27
• The mode is 27, because the value 27 occurs three times (the most
frequent).

• Some distributions have more than one mode

 Unimodal: A distribution with one mode.

 Bimodal: A distribution with two modes.

 Trimodal: A distribution with three modes.


Mode….

• The mode of grouped data usually refers to the modal class with
the highest frequency.
• If a single value for the mode of grouped data must be specified,
it is taken as the mid point of the modal class interval.
Properties of mode

 It is not affected by extreme values

 Often its value is not unique (more than one mode is possible)
 The main drawback of mode is that often it does not exist,
therefore it is not a good summary of the majority of the data.
Descriptive statistics
Measures of
dispersion
Measures of Dispersion……

Consider the following two sets of data:


A: 177, 193, 195, 209, 226 Mean = 200
B: 192, 197, 200, 202, 209 Mean = 200

• • Two or more sets may have the same mean and/or median
but they may be quite different.
• • MCT are not good to describe about the variability or spread of
the values.
Measures of Dispersion

• Measures that quantify the variation or dispersion of a set of data


from its central location.

• Dispersion refers to the variety exhibited by the values of the


data.

• The amount may be small when the values are close together.

• If all the values are the same, no dispersion


1. Range (R)
• The difference between the largest and smallest observations in a
data set.

• Range = Maximum value – Minimum value

• Example –

– Data values: 5, 9, 12, 16, 23, 34, 37, 42

– Range = 42-5 = 37
Properties of range

 It is the simplest crude measure and can be easily understood


 It takes into account only two values which causes it to be a poor
measure of dispersion

 Very sensitive to extreme observations


2. Inter-quartile range (IQR)
• Indicates the spread of the middle 50% of the observations, and
used with median

IQR = Q3 - Q1
Example: Suppose the first and third quartile for weights of girls
12 months of age are 8.8 Kg and 10.2 Kg, respectively.

IQR = 10.2 Kg – 8.8 Kg

i.e., 50% of the infant girls weigh between 8.8 and 10.2 Kg.
Example 2
• Given the following data set (age of patients):-

18, 59, 24, 42, 21, 23, 24, 32

• Find the inter-quartile range

• Solution: 18 21 23 24 24 32 42 59

• 1st quartile = {(n+1)/4}th = (2.25)th = (21 + 23)/2 = 22

• 3rd quartile = {3/4 (n+1)}th = (6.75)th = (32 + 42)/2 = 37

• Hence, IQR = 37 - 22 = 15
Properties of IQR:

• It encloses the central 50% of the observations

• It is not based on all observations but only on two specific values


• It is important in selecting cut-off points in the formulation of
clinical standards.

• Since it excludes the lowest and highest 25% values, it is not


affected by extreme values

• Less sensitive to the size of the sample


(x
n

i  x) 2
S2  i=1

n -1
(x
n

i  x) 2
S2  i=1

n -1
(x
n

i  x) 2
S2  i=1

n -1
(x
n

i  x) 2
S2  i=1

n -1
(x
n

i  x) 2
S2  i=1

n -1
(x
n

i  x) 2
S2  i=1

n -1
(x
n

i  x) 2
S2  i=1

n -1
(x
n

i  x) 2
S2  i=1

n -1
(x
n

i  x) 2
S2  i=1

n -1
(x
n

i  x) 2
S2  i=1

n -1
Example. Compute the variance and SD of the age of 169 subjects from the
grouped data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23

SD = √S2 = √120.23 = 10.96

Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22
Properties of SD
• Has the advantage of being expressed in the same units of
measurement as the mean

• The best measure of dispersion and is used widely because of the


properties of the theoretical normal curve.

• However, if the units of measurements of variables of two data sets


is not the same, then there variability can‟t be compared by
comparing the values of SD.
Coefficient of variation (CV)
 When two data sets have different units of measurements the CV
should be used as a measure of dispersion.

 It is the best measure to compare the variability of two series of


sets of observations.

 Data with less coefficient of variation is considered more


consistent.
CV is the ratio of the SD to the mean multiplied by 100.

S
CV   100
x

SD Mean CV (%)

SBP 15mm 130mm 11.5


Cholesterol 40mg/dl 200md/dl 20.0

“Cholesterol is more variable than systolic blood pressure”


Skewed distributions
 Skewness: If extremely low or extremely high observations are
present in a distribution, then the mean tends to shift towards
those scores.

 Based on the type of Skewness, distributions can be:


A. Positively skewed distribution: Occurs when the majority of
scores are at the left end of the curve and a few extreme large
scores are scattered at the right end.
B. Negatively skewed distribution: occurs when majority of
scores are at the right end of the curve and a few small scores are
scattered at the left end.

C. Symmetrical distribution: It is neither positively nor


negatively skewed.

A curve is symmetrical if one half of the curve is the mirror image


of the other half.
Mean, Median & Mode
Which measures to use?
• When the distribution is symmetric, summarize the data using means and
standard deviations.

• When the data are skewed, it is preferable to use the median and IQR as
summary statistics.

• Median and IQR are not easily influenced by extreme values in a skewed
distribution unlike means and standard deviations.

• Remark:
• The mean and median of symmetric distribution coincide.

• When skewed to the right, its mean is larger than its median.

• When skewed to the left, its mean is smaller than its median.(see fig. a-c)
Median Mode Mean
Fig. 2(a). Symmetric Distribution Mode Median Mean
Fig. 2(b). Distribution skewed to the right

Mean = Median = Mode Mean > Median > Mode

Mean Median Mode


Fig. 2(c). Distribution skewed to the left

Mean < Median < Mode 143


Any question?

144
• Calculate the mean ,median, standard devation of the following distribussion

calss interval frequancy


31-35 2
36-40 3
41-45 8
46-50 12
51-55 16
56-60 5
61-65 2
66-70 3
N=51

You might also like