Summarizing Data
Summarizing Data
SUMMARIZING DATA
• LECTURE OUTLINE
• Methods of summarizing data
❖TABULAR
❖GRAPHICAL and
❖NUMERICAL methods-
- Simple frequencies,
- Measures of central tendency,
- Measures of spread
• Other methods
❖Rates and Ratios
❖Measures of morbidity
❖Measures of mortality
Tabular method
• Before one can display the data graphically, one has to organize
the data in the form of tables, which summarize data into
compact and readily comprehensible form
Eg. frequency distribution table.
Tabular method
Tables should be
• Well labeled axis
• Provide title
• Indicate source
Total 55
6
• Table 2: Disease pattern at an out patient clinic.
8
‑
‑
‑
‑
‑
‑
‑
Graphical presentation
• c. Other
• i Scatter diagram
• ii Spot diagram
Basic terms for frequency distribution
• In the construction of histogram, the area under the graph must correspond to
the frequencies of each interval.
• In the case of data with unequal interval widths, the heights on the y axis must
be adjusted.
• The y axis gives the frequency of individuals and the x axis gives the classes into
which the data have been grouped.
•
• The axis should be properly defined and clearly labelled and scale clearly shown.
80
70
60
Frequency
50
40
30
20
10
0
1 - 5 6-10 11- 16- 21- 26- 31- 36- 41- 46- 51- 56-
15 20 25 30 35 40 45 50 55 60
Weight group
13
• HISTOGRAM
• In the case of data with unequal interval/widths, the heights on the y axis
must be adjusted.
Mass (g) 10 – 19 20 – 24 25 – 34 35 – 50 51 – 55
Frequency 6 4 12 18 8
Class widths 10 5 10 15 5
Width on the x-axis 2 × standard standard 2 × standard 3 × standard standard
Rectangle’s height in
6÷2=3 4 12 ÷ 2 = 6 18 ÷ 3 = 6 8
histogram
‑
Frequency polygons
Frequency polygons
Steps
• Create a histogram.
• Find the midpoints for each bar that
exists on the histogram.
• Place a point on the origin of the
histogram and its end.
• Connection of the points.
• intervals of unequal widths, the heights
on the y axis must be re adjusted.
3.6
3.6 3.4
3.2
3.1
2.9 2.9
2.7
2.7 2.6
2.3
2.2
HIV Prevalence
1.8
0.9
0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Multiple line graphs
• BAR DIAGRAM
• The bars are separated and the widths are equal for the
respective categories. Numbers or frequencies or percentages
can be used.
21
COMPOSITE BAR CHART
23
Pie chart
Arrivals at KIA
GHANAIANS
ECOWAS
12% AFRICANS
19%
AMERICAS
EU
ASIA
21%
14%
7%
27%
24
Scatter Plots
25
Nature of Relationship – Linear?
26
• OTHER GRAPHICAL METHODS
Source: NACP/GHS
SPOT MAP Showing
Location
of HIV Sentinel Sites in
Ghana
Source: NACP/GHS
Numerical or mathematical methods of data presentation
• Introduction
• It is often important to be able to describe the raw
data with one or two summary figures.
Numerical methods
• Proportions
• If N=no of subjects in a sample and
n=no within the same sample having an attribute, then the
proportion with the attribute is n/N
Eg. In a survey of 150 medical students, 20 tested positive for
Hepatitis B infection.
The proportion of students with Hepatitis B infection is 20/150=
0.13 or 13%
• MEASURES OF CENTRAL TENDENCY.
• Class mid point is obtained by adding the two class limits and
dividing by two
• Mean
If X1, X2, X3, ….Xn are numeric observations made on n subjects,
then the mean
= X1+ X2 + X3 + … + Xn
n
= ΣX
n
Mean = ΣfX
Σf
Where f is the frequency of observation X
AGE IN YEARS (X) FREQUENCY (F) X2 fX fX2
21 38
22 35
23 28
24 24
25 28
• It is less influenced by extreme values however, it is not easily amenable to mathematical manipulation.
The median is the best measure of central tendency for skewed data.
• GEOMETRIC MEAN
• It is a useful summary statistic in antibody assay and
microbacterial counts and for skewed data.
• 4 8 16 16 64 ( VERY SKEWED)
• GM = fifth root of (4x8x16x16x64)
• taking the logs on both sides
• 5log GM = log4+ log8 + log16 + log16 + log16 + log 64
• = 5.71
• GM = antilog of 5.71/5
• = 13.9
• On the other hand, the arithmetic mean = 21.6 the median = 16
and the mode = 16.
• Mode
This is the most frequently occurring observation.
For grouped data, mode= L + (fz – fl) x i
2fz – (fl + fh)
•I 70 29 48 90 92 61 30
• II 68 72 65 50 58 63 44
• III 59 59 58 60 60 61 63
• MEASURES OF DISPERSION OR SPREAD OR VARIATION
direction
of each group.
Dispersion /variation
• These include
o Range
o Variance
o Standard deviation
o Coefficient of variation
o Standard error of mean
o Inter-quartile range etc.
• RANGE:
• It is the simplest measure of spread
• defined as the difference between the highest and the lowest
observations.
Range= maximum observation-minimum
observation
• It tends to increase as the number of observations increases.
• It is not easily used for statistical inference.
• It only uses 2 of the observations and neglects all the
information regarding variation
• Variance and standard deviation
• The variance (σ2), is defined as the sum of the squared distances of
each term in the distribution from the mean (μ), divided by the
number of terms in the distribution (N).
•
54
• VARIANCE Mean square deviation (SUM(X X')2/(n 1)))
• Table 5. Example
• X (X X') (X X')2
• 70 10 100
• 29 31 961
• 48 12 144
• 90 30 900
• 92 32 1024
• 61 1 1
• 30 30 900
• TOTAL 0 4030
• Quartile divide a given set of data that has been ranked into
four equal parts
• Deciles divide a given set of data that has been ranked into 10
equal parts
• Percentile divide a given set of data that has been ranked into
100 equal parts
Measures of Location/position
• These are observations which divide a given set of data that has
been ranked into four equal parts.
• The value below which 1/4 of the ordered observations fall is called
the lower or the first Quartile Q1.
• The distance between the lower and the upper quartiles is called
inter quartile range (IQR) = Q3-Q1
Recall
For grouped data, the median =
LM+( n/2- FM-1 ) x Ci
FM
• Where
LM = lower class boundary of median class
n= total number of observations
FM-1= cumulative frequency below the median class
FM = median class frequency
Ci= median class interval
Q. The following data represent the number of correct responses made to the examination
in statistics by 50 medical students in the Medical School selected systematically from the
list of all students in the School.
72 72 93 70 59 78 74 65 73 80
57 67 72 57 83 76 74 56 68 67
74 76 79 72 61 72 73 76 67 49
71 53 67 65 100 83 69 61 72 68
65 51 75 68 75 66 77 61 64 74
a. Prepare the frequency distribution table and the frequency histogram for this data set.
b. Compute the sample mean , sample median , sample range R, and sample variance .
c. Does the data set represent a sample or a population?
62
• Incidence rates
= Number of new cases of illness in a defined period
Average number of persons exposed to risk
• Prevalence rates
= Number of persons who are sick at a given time
Average number of persons exposed to risk
• THANK YOU