0% found this document useful (0 votes)
36 views

Summarizing Data

This document discusses various methods for summarizing data including tabular, graphical, and numerical methods. It covers frequency distribution tables, histograms, frequency polygons, bar charts, pie charts and scatter plots. Examples are provided for each method.

Uploaded by

F
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Summarizing Data

This document discusses various methods for summarizing data including tabular, graphical, and numerical methods. It covers frequency distribution tables, histograms, frequency polygons, bar charts, pie charts and scatter plots. Examples are provided for each method.

Uploaded by

F
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

BIOSTATISTICS (BIO 03)

SUMMARIZING DATA

By Dr. NANA AYEGUA HAGAN SENEADZA


METHODS OF SUMMARIZING DATA

• LECTURE OUTLINE
• Methods of summarizing data
❖TABULAR
❖GRAPHICAL and
❖NUMERICAL methods-
- Simple frequencies,
- Measures of central tendency,
- Measures of spread
• Other methods
❖Rates and Ratios
❖Measures of morbidity
❖Measures of mortality
Tabular method

• Before one can display the data graphically, one has to organize
the data in the form of tables, which summarize data into
compact and readily comprehensible form
Eg. frequency distribution table.
Tabular method

Tables should be
• Well labeled axis
• Provide title
• Indicate source

Cross tabular presentation (two-dimensional tables) is used for


two variables
• E.g. age and sex distribution of Level 400 students
Tabular method

• Frequency distribution table is a table showing number of


observations at different values of the variable.

• The purpose is to display meaningful pattern. It can be used for


all types of data discrete or continuous.

• The categories must be mutually exclusive and mutually


exhaustive. Each disease must belong to a category and only one
category of the table.

• Avoid open ended intervals.


• Limit the number of classes to between 10 20.
• Classes could be of equal or unequal widths.

Summarizing data – TALLY

Table 1: NATIONALITY OF MINE WORKERS IN TOWN A


Nationality Tally Frequency

GHANAIANS //// //// //// //// 19


OTHER ECOWAS //// //// //// 14
EUROPE //// / 6
AMERICAS //// //// 9
AFRICA /// 3
ASIA //// 4

Total 55

6
• Table 2: Disease pattern at an out patient clinic.

• DISEASE freq. relative freq.



• Malaria 186 31.0
• Pneumonia 132 22.0
• Measles 48 8.0
• Diarrhoeal dxs 54 9.0
• Malnutrition 60 10.0
• Others 120 20.0

• TOTAL 600 100



Table 3. Age structure of patients

Age Interval males females


• 0 < 1yr 36 57
• 1 4 191 196
• 5 14 369 367
• 15 34 263 384
• 35 54 180 204
• 55 64 64 71
• 65 99 45 28
• TOTAL 1148 1307

8







Graphical presentation

• a. Continuous data set


• i Histogram
• ii Line graph
• iii Frequency polygon

• b. Discrete data set


• i Pie chart
• ii Bar diagram

• c. Other
• i Scatter diagram
• ii Spot diagram
Basic terms for frequency distribution

• Class limit, boundary, interval, width and midpoint

For the first class (300-399)


• Lower class limit = ??????
• Upper class limit = ??????
• Lower class boundary = ??????
• Upper class boundary = ??????
• The class width = ??????
• Class midpoint = ???????
Basic terms for frequency distribution

• Class limit, boundary, interval, width and midpoint

For the first class (300-399)


• Lower class limit = 300
• Upper class limit = 399
• Lower class boundary = 299.5
• Upper class boundary = 399.5
• The class width = Upper class boundary – lower class boundary = 399.5 – 299.5= 100
• Class midpoint = (Upper class limit + lower class limit)/2 OR (Upper class boundary + lower class
boundary)/2
• HISTOGRAM

• In the construction of histogram, the area under the graph must correspond to
the frequencies of each interval.

• In the case of data with unequal interval widths, the heights on the y axis must
be adjusted.

• It is important to avoid open ended intervals.

• The bars are not separated ie no spaces in-between the bars.

• The y axis gives the frequency of individuals and the x axis gives the classes into
which the data have been grouped.

• The axis should be properly defined and clearly labelled and scale clearly shown.

• Foot note must be provided if it is from other source






Histogram
Weight of luggage

80
70
60
Frequency

50
40
30
20
10
0
1 - 5 6-10 11- 16- 21- 26- 31- 36- 41- 46- 51- 56-
15 20 25 30 35 40 45 50 55 60

Weight group

13
• HISTOGRAM

• In the case of data with unequal interval/widths, the heights on the y axis
must be adjusted.

• Frequency distribution gives the masses of 48 objects


Mass (g) 10 – 19 20 – 24 25 – 34 35 – 50 51 – 55
Frequency 6 4 12 18 8

Mass (g) 10 – 19 20 – 24 25 – 34 35 – 50 51 – 55
Frequency 6 4 12 18 8
Class widths 10 5 10 15 5
Width on the x-axis 2 × standard standard 2 × standard 3 × standard standard
Rectangle’s height in
6÷2=3 4 12 ÷ 2 = 6 18 ÷ 3 = 6 8
histogram


Frequency polygons

Frequency polygons

Steps
• Create a histogram.
• Find the midpoints for each bar that
exists on the histogram.
• Place a point on the origin of the
histogram and its end.
• Connection of the points.
• intervals of unequal widths, the heights
on the y axis must be re adjusted.

• You can also create the polygon


without first creating the
histogram


Frequency polygons
LlNE GRAPH:
Median HIV Prevalence 2000 – 2009

3.6
3.6 3.4
3.2
3.1
2.9 2.9
2.7
2.7 2.6
2.3
2.2
HIV Prevalence

1.8

0.9

0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Multiple line graphs

• Figure 2. Cell phone use in Ghana, 1996 to 2002


Any advantages of the frequency polygon over the
histogram ???????
GRAPHICAL DISPLAYS FOR DISCRETE DATA

• BAR DIAGRAM
• The bars are separated and the widths are equal for the
respective categories. Numbers or frequencies or percentages
can be used.

• Bars can be vertical or horizontal


• Bars can be used to represent multiple categories of the
variable
Bar chart-types

21
COMPOSITE BAR CHART

• One bar is used for each group of


the descriptive attribute. ie
– distribution of patients who
responded yes/no/unknown to
having comorbid conditions in
addition to their present diagnosiss
Pie chart

• A circular diagram which is cut up into several


segments representing the various groups of a
descriptive attribute.
• It is drawn to show a percentage composition of a
descriptive attribute.
• May have to use compasses or protractors.
• It is not too good for comparing two or more
distributions, ie visually difficult to relate sizes of
segments

23
Pie chart

Arrivals at KIA
GHANAIANS
ECOWAS
12% AFRICANS
19%
AMERICAS
EU
ASIA
21%
14%

7%

27%

24
Scatter Plots

A scatter plot is a graphical tool for exploring the


relationship between two variables
• The response/outcome/dependent variable, Y , is on
the vertical axis
• The predictor, covariate, independent variable, X, is on
the horizontal axis

• Question: What to look for in a scatter plot?

25
Nature of Relationship – Linear?

26
• OTHER GRAPHICAL METHODS

• Spot diagrams and


• Area graphs are often used in epidemiology to display
geographical distribution and intensity of disease distribution
respectively.
SPOT MAP Showing Location
of HIV Sentinel Sites in Ghana Ebola outbreak in West Africa

Source: NACP/GHS
SPOT MAP Showing
Location
of HIV Sentinel Sites in
Ghana

Source: NACP/GHS
Numerical or mathematical methods of data presentation

• Introduction
• It is often important to be able to describe the raw
data with one or two summary figures.
Numerical methods

• The appropriate summary measure depends on the type of


data.
✓ Numeric data- eg parity, systolic BP, are summarized using
measures of location/central tendency (mean, median, mode)
and dispersion (standard deviation range etc.)
✓ Non-numeric data eg sex, tribe, etc are summarized by
proportions or percentages
Numerical methods

Measures of Central Tendency


• Mean (arithmetic, geometric etc.)
• Median
• Mode
• Measures of spread/ dispersion
• Range
• Variance
• Standard deviation (square root of variance)
• Standard error of means
• Coefficient of variation [ sd /mean x 100%]
• Percentiles, quintiles, quartiles, tertiles etc
Numerical methods

• Proportions or percentages have no units,


• the measures of location and dispersion are in the same units
as the data eg average age in years
• except the variance which is in square units
• and the co-efficient of variation which has no units.
Calculating summary measures

• Proportions
• If N=no of subjects in a sample and
n=no within the same sample having an attribute, then the
proportion with the attribute is n/N
Eg. In a survey of 150 medical students, 20 tested positive for
Hepatitis B infection.
The proportion of students with Hepatitis B infection is 20/150=
0.13 or 13%
• MEASURES OF CENTRAL TENDENCY.

• The most common measures of central tendency are the mean,


median, mode and the geometric mean.
• Each has its advantages and disadvantages as a measure of
location.
• MEAN

• Sum of the individual items divided by the total number of items. It


is amenable to mathematical manipulation but is easily affected by
extreme observations.

• For grouped data, the mean is obtained by multiplying the


frequencies of each item by the value of the item and then summing
the products to obtain the numerator, the denominator is the sum of
all the frequencies.

• In the case of data classified into intervals, the frequencies are


multiplied by the class midpoint. Because it is not known exactly
where the frequencies are located within the classes.

• Class mid point is obtained by adding the two class limits and
dividing by two
• Mean
If X1, X2, X3, ….Xn are numeric observations made on n subjects,
then the mean
= X1+ X2 + X3 + … + Xn
n
= ΣX
n
Mean = ΣfX
Σf
Where f is the frequency of observation X
AGE IN YEARS (X) FREQUENCY (F) X2 fX fX2

21 38

22 35

23 28

24 24

25 28

Note : X can be the class midpoint when using classes(intervals)


• ΣfX = ?
• Σf = ?
• Mean=
• Median
The item located at the mid point when all the observations are arranged in ascending or descending order.

• It is the middle most ranked observation.

• It is less influenced by extreme values however, it is not easily amenable to mathematical manipulation.

• It is the best measure of central tendency in case of skewed distributed data.

First locate the midpoint= (n+1)/2

Odd vs Even number of observations



For grouped data, the median =
LM+( n/2- FM-1 ) x Ci
FM
• Where
LM = lower class boundary of median class
n= total number of observations
FM-1= cumulative frequency below the median class
FM = median class frequency
Ci= median class interval

The median is the best measure of central tendency for skewed data.
• GEOMETRIC MEAN
• It is a useful summary statistic in antibody assay and
microbacterial counts and for skewed data.

• It is defined as the Nth root of the product of N observations.

• It is not used if any of the observations is negative.


• Example results of measles antibody measurements of 5
children.

• 4 8 16 16 64 ( VERY SKEWED)
• GM = fifth root of (4x8x16x16x64)
• taking the logs on both sides
• 5log GM = log4+ log8 + log16 + log16 + log16 + log 64
• = 5.71
• GM = antilog of 5.71/5
• = 13.9
• On the other hand, the arithmetic mean = 21.6 the median = 16
and the mode = 16.
• Mode
This is the most frequently occurring observation.
For grouped data, mode= L + (fz – fl) x i
2fz – (fl + fh)

• Where L= lower class boundary of the modal class


fz = frequency of the modal class
fl = frequency in the adjacent lower class
fh = frequency in the adjacent higher class
I = modal class interval
Distributions
MEASURES OF DISPERSION OR SPREAD OR
VARIATION

• If The data below represent the post-evaluation


results of three groups of participants at a workshop,
supervised by different facilitators.

• Which of the three groups would you have liked to


have been assigned to.

•I 70 29 48 90 92 61 30
• II 68 72 65 50 58 63 44
• III 59 59 58 60 60 61 63
• MEASURES OF DISPERSION OR SPREAD OR VARIATION

• The mean score per group is identical

• However, it is important to know if the observations are all

close to the mean or whether they scatter widely in each

direction

• Information about the variation within groups will provide

useful additional statistics which could help to rate the strength

of each group.
Dispersion /variation

• The degree to which numerical data tend to spread about an


average value
Measures of dispersion/variation/spread

• These include
o Range
o Variance
o Standard deviation
o Coefficient of variation
o Standard error of mean
o Inter-quartile range etc.
• RANGE:
• It is the simplest measure of spread
• defined as the difference between the highest and the lowest
observations.
Range= maximum observation-minimum
observation
• It tends to increase as the number of observations increases.
• It is not easily used for statistical inference.
• It only uses 2 of the observations and neglects all the
information regarding variation
• Variance and standard deviation
• The variance (σ2), is defined as the sum of the squared distances of
each term in the distribution from the mean (μ), divided by the
number of terms in the distribution (N).


54
• VARIANCE Mean square deviation (SUM(X X')2/(n 1)))

• Table 5. Example
• X (X X') (X X')2
• 70 10 100
• 29 31 961
• 48 12 144
• 90 30 900
• 92 32 1024
• 61 1 1
• 30 30 900
• TOTAL 0 4030

• VARIANCE = 4030/6 = 671.7









• COEFFICIENT OF VARIATION

• It is the expression of the standard deviation as a percentage of


the mean.
• Useful in comparing variations of different attributes. e.g
variations in weight, height, and age of a study population.
• It can enable one to conclude that weights are more spread
than heights of preschool children.
• It is a dimensionless statistic.
• CV = (standard deviation / mean) * 100
Measures of Location/position

• Median divides the ranked dataset into 2 equal parts

• Quartile divide a given set of data that has been ranked into
four equal parts
• Deciles divide a given set of data that has been ranked into 10
equal parts
• Percentile divide a given set of data that has been ranked into
100 equal parts
Measures of Location/position

Measure of position Position in the ranked


dataset
Q1 (lower quartile) ¼ (n+1)
Q2 (median) ½ (n+1)
Q3 (upper quartile) ¾ (n+1)
D7 , D9, Dk 7/10 (n+1), 9/10(n+1)
k/10(n+1)
Pk k/100(n+1)
Example

• Dataset of ages of participants receiving the MMR vaccine

1, 27, 16, 7, 31, 7, 30, 3, 21, 15, 13, 11, 5

Find Q1, Q2,Q3, D5, D8, P80


• QUARTILES

• These are observations which divide a given set of data that has
been ranked into four equal parts.

• The value below which 1/4 of the ordered observations fall is called
the lower or the first Quartile Q1.

• The value which is exceeded by 1/4 is called the third or upper


Quartile Q3.

• The distance between the lower and the upper quartiles is called
inter quartile range (IQR) = Q3-Q1

• The semi inter quartile range is also a measure of variation and


unlike the range
(Q3-Q1)/2


Percentiles
• Finding percentiles in grouped data:

Recall
For grouped data, the median =
LM+( n/2- FM-1 ) x Ci
FM
• Where
LM = lower class boundary of median class
n= total number of observations
FM-1= cumulative frequency below the median class
FM = median class frequency
Ci= median class interval
Q. The following data represent the number of correct responses made to the examination
in statistics by 50 medical students in the Medical School selected systematically from the
list of all students in the School.

72 72 93 70 59 78 74 65 73 80
57 67 72 57 83 76 74 56 68 67
74 76 79 72 61 72 73 76 67 49
71 53 67 65 100 83 69 61 72 68
65 51 75 68 75 66 77 61 64 74
a. Prepare the frequency distribution table and the frequency histogram for this data set.
b. Compute the sample mean , sample median , sample range R, and sample variance .
c. Does the data set represent a sample or a population?
62

If it is a sample, describe the population from which it has been drawn.


Other summary measures

Rates and Ratios can be used to summarize data

• What are the differences between rates, ratios and


proportions?
Other summary measures

Rates and Ratios can be used to summarize data


Rates
• Crude rates
❖ Crude birth rate, Crude death rate
• Specific rates-age or sex specific rates
❖ Infant mortality, Under 5 mortality, Maternal Mortality etc.
• Standardized rates
❖ Direct and Indirect standardized rates (for comparing two or more
different population groups)
Ratios
• Male: Female ratio
• Maternal Mortality ratio etc
Some Measures of morbidity

• Incidence rates
= Number of new cases of illness in a defined period
Average number of persons exposed to risk
• Prevalence rates
= Number of persons who are sick at a given time
Average number of persons exposed to risk
• THANK YOU

You might also like