Biostatistics and Demography - Lecture 2
Biostatistics and Demography - Lecture 2
Technology
Faculty of Medicine, Department of Pharmacy
Biostatistics
&
Demography
Summarizing and presenting data
Categorical Quantitative
Binary or
Dichotomous
Ratio Interval
3
Four scales of
measurement…
Introduction…
• In public health and health research we are
interested in describing a group (of people or
things e.t.c.) rather than an individual person.
• We have noted the inherent variability in biological,
socio-economic, or behavioral processes.
• Knowledge and application of appropriate
statistical methods enable us to describe groups of
people with varying experiences.
5
Summarizing and Presenting data : Methods and
tools
• Summarising categorical data
• Counts
• Proportions
• Percentages
• Rates, Ratios
• Summarizing quantitative data
• Measures of central tendency (Mode, median,
mean)
• Measures of dispersion (minimum &
maximum, IQR, standard deviation)
• Graphical presentation of data
• Categorical data (bar chart, pie chart)
• Continuous data (histogram, box plot)
6
Summarising Categorical Data:
Frequency counts
• The frequency distribution tells us how many
times different values of a variable occur in a
given sample or population. Frequency
distribution of a random sample of 200 new
students:
9
Other Basic Measures of Frequency:
Ratios
Examples
•Number of doctors to population size
•Number of nurses to number of patients in
a hospital
•Number of girls to number of boys
•Number of households with a bed net to
number of households without a bed net
10
Other Basic Measures of Frequency: Rates
For example
-Number of disease events per year
-Number of cases of dysentery per week
• Numerator is number of events
• Denominator is total time of observation
11
Mbarara University of Science and
Technology
Faculty of Medicine, Department of Pharmacy
SUMMARIZING CONTINUOUS
DATA
13
Example: Frequency distribution of
cholesterol in mg/dl (n=250)
totchol Freq. Percent Cum.
15
Frequency distribution of cholesterol data,
Class width is 20 g/dl
totcholcat Freq. Percent Cum.
Centre and
15
Spread of the
data.
5 10 0
17
Tools for Visualising distribution of continuous
data: Histogram
20
Percent of the sample
represents the
Mean, Median
and Mode?
5 10 0
18
Recall: Frequency distribution of cholesterol in
mg/dl (n=250)
totchol Freq. Percent Cum.
20
Creating a Histogram
Determine the minimum and maximum values
for the variable of interest e.g. income
• Minimum is 135, Maximum is 464
• Determine the range (max-min=329)
Determine the number of classes/groups you
want to have e.g. 11.
Determine the Class Interval width ≈Range
divide by number of classes ≈30 mg/dl
Create mutually exclusive categories/classes of
the original (continuous) variable.
Classes are also called bins .
Start the first interval at a convenient value
below the minimum. 134.5 g/dl,
Therefore first class will be 134.5 to 164.5
g
Second class will be 164.5 to 194.5 g/dl
21
Creating a Histogram
Determine the frequency counts and relative
frequency for each category/class
To plot the graph:
• On the horizontal axis mark equally spaced
values of the lower boundary of each class
• On the vertical axis, the length represents
the frequency
• Plot the frequencies for each class as bars.
• The height of the bar will be proportional to
the frequency of that class.
• The width of the bars is the same.
22
30
Is the shape of a
20
Histogram
Percent
sensitive to the
number of
Class width is ≈30 mg/dl
classes?
10
0
23
20 Shape of a
Histogram is
15
sensitive to the
number of classes
Percent
10
24
Distribution of continuous data
• The shape of the frequency distribution can
be symmetrical or asymmetrical
• A symmetric distribution has the same
shape on both sides of the mean (the
centre)
• If outlying values occur only in one direction,
the distribution is said to be skewed
• Normally distributed data has zero skewness
25
Shape of distribution of continuous
data:
Symmetrical
Frequency
26
Shape of distribution of continuous data:
Skewed to the right
27
Shape of distribution of continuous data:
Skewed to the Left
28
Mbarara University of Science and
Technology
Faculty of Medicine, Department of Pharmacy
MEASURES OF CENTRAL
TENDENCY
31
Median
• The Median is the score found at the exact
middle of the set of values
• Arrange observations in order of magnitude,
the median is the middle observation
• Median divides the set of observations into
two equal parts such that the number of
values equal to or greater than the median is
equal to the number of values less than or
equal to the median
Consider the following age data:
45 19 23 10 16 21 25 17 21 18 15 18 21 13 16 23 21 24 18 19 26 20 21 19
20 25 26 20 23 8 23 18 24 16 30 24 15 22 27 20
32
The Arithmetic Mean
• The arithmetic mean – is the most popular
measure of central tendency
• Calculation of mean requires numerical data
• For a given variable, the mean is obtained by
1. Adding all values in the sample
2. Dividing the sum by the number of
observations in the sample (sample size)
• For a given set of data and variable, there is
only one mean
33
Arithmetic mean- computation
34
Arithmetic mean -Example
For this set of observations in a sample, calculate the mean.
45 19 23 10 16 21 25 17 21 18 15 18 21 13 16 23 21 24 18 19
26 20 21 19 20 25 26 20 23 8 23 18 24 16 30 24 15 22 27 20
35
Summary of the Age data of 40
individuals:
• Median=20.5 years,
• Modal age=21 years
• Mean=20.75 years
36
Using a graph to see the distribution helps
to identify key features such as presence of
25
20 outliers
Outliers
15
Percent
10
5
0
37
Outliers & Arithmetic Mean
• Outliers fall outside the general pattern of the
distribution.
• The value of the arithmetic mean is sensitive
to/affected by outliers
38
The Geometric Mean
• For a variable X with observations xi, the
geometric mean of a set of n observations is
equal to the nth root of the cross-product of the
n observations.
• For a given data set, the geometric mean is less
than or equal to the arithmetic mean.
Geometric mean=
MEASURES OF VARIATION
44
Tools for Visualising distribution of continuous
data: Histogram
20
Observe the
Spread of the data.
Percent of the sample
15
of the data
5
0
46
Tools for Visualising variation:
the Box Plot
Box Plot showing the distribution of total cholesterol.
500 400
totchol (mg/dl)
200 300
100
47
Box Plot- visualising variation in
age
Box Plot showing the distribution of age (n=40)
50
40
Age (years)
30
20
10
48
Box plot…
• Standardized way of displaying data.
• Based on five number summary;
1. Minimum
2. First quartile (Q1)
3. Median
4. Third quartile (Q3)
5. Maximum
49
How to draw a Box Plot
1. Sort the data from minimum to maximum
2. Determine the Min, Q1, Median, Q3, Maximum
3. Determine the IQR (i.e. Q3-Q1) and the value of IQR*1.5
4. Obtain the values of Q1-IQR*1.5, and Q3+IQR*1.5
5. Draw and Label a vertical line that includes the range of
the distribution
6. Draw a central box from Q1 to Q3
7. Draw a horizontal line for the median inside the box
8. Extend vertical lines (whiskers) from the box (at Q1 and
at Q3) out to the lower and upper bounds of data falling
within the general distribution (i.e. not outliers). Length
of the whisker is ≈1.5 times the IQR.
Determining Q1 & Q3
53
Box Plot: Location of fences
• When the calculated value of Upper
fence is greater than the maximum
observation in the data, the fence will
be located at the observed maximum
value.
500 400
Observe
totchol (mg/dl)
the
location of
300
the median
relative to
Q1 and Q3
200 100
250 200
Distance above median
The symmetry 150
plot showing
100
distribution of
data around the
median.
50 0
0 20 40 60 80 100
Distance below median
• Inter-quartile range
• Is the difference between the 1st quartile
(25th percentile) and the 3rd quartile(75th
percentile)
• The inter-quartile range contains the central
50% of the observations
59
Quantifying variation: Standard
Deviation
• Standard deviation is a measure of the
spread of observations about their mean
• It is a measure of how much on average each
of the values in the distribution deviates
from the mean
• Standard deviation is an essential part of
many statistical tests
• The value of the standard deviation is
affected by outliers
60
Calculating the Standard
Deviation
1. Calculate the arithmetic mean
2. Calculate and square the (difference
between each observation in the data set
and the mean)
3. Obtain a sum of the squared deviations
4. Divide the sum of the squared deviations by
n-1, (number of observations in the sample
minus one)
61
Computation of variance and standard
deviation
UG001 45 24.25 588.0625 Id_number age: xi (xi-mean) (xi-mean)^2
UG002 19 -1.75 3.0625 UG023 21 0.25 0.0625
UG003 23 UG024 19 -1.75 3.0625
UG004 10 UG025 20 -0.75 0.5625
UG005 16 UG026 25 4.25 18.0625
UG006 21 UG027 26
UG007 25 UG028 20
UG008 17 UG029 23
UG009 21 UG030 8
UG010 18 UG031 23
UG011 15 UG032 18
UG012 18
UG033 24
UG013 21
UG034 16
UG014 13
UG035 30
UG015 16
UG036 24
UG016 23
UG037 15
UG017 21
UG038 22
UG018 24
UG039 27
UG019 18
UG040 20
UG020 19
sum 830
UG021 26
mean
UG022 20
Variance=
Standard deviation=square-root of variance=
62
Median= 20.5, Mean=20.75,
Relationship between standard
deviation, the mean and
distribution of observations
• If the distribution of observations of a given
variable is approx normal:
– Approximately 68% of the observations in
the sample fall within one standard deviation
of the mean (Mean±1SD)
– Approximately 95% of the observations
in the sample fall within two standard
deviations of the mean (Mean±2SD)
– Approximately 99.7% of the observations in
the sample fall within three standard
deviations of the mean (Mean±3SD)
63
64
Summary of Cholesterol Data
• Sample size: 250 persons
• Minimum: 135 g/dl
• Median income: 237 g/dl
• Mean (Average): 236.3 g/dl
• Maximum: 464 g/dl
• Standard deviation: 42.6 g/dl
• The formula is
67
Choice of measures of
dispersion
• Standard deviation is appropriate when the mean is
used to describe central tendency (symmetric data)
• The inter-quartile range is used to describe the
central 50% of a distribution, regardless of its shape
• The percentile may also be used when the mean is
used but the objective is to compare a set of
observations with the norm
• The range is used with numerical data when the
purpose is to emphasize extreme values
• Percentiles and inter-quartile range are used when
the median is used (skewed data)
• The coefficient of variation is used when the intent is
to compare distributions of variables measured on
different scales
68
Take home assignment
• Using dummy data from the research questions that
were pitched in class, provide a summary of the data
collected as;
1.Summarize and present data on socio demographics of
the study population in percentages.
2.Present 4 separate variables of continuous data as;
a) Box plots
b) Symmentry lines.
3.Comment on the spread and symmetry of your data
findings
Please work in your groups to have a PowerPoint
presentation ready for a 5 minute presentation in
our next class
69