0% found this document useful (0 votes)
12 views69 pages

Biostatistics and Demography - Lecture 2

Uploaded by

Christine Ayamba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views69 pages

Biostatistics and Demography - Lecture 2

Uploaded by

Christine Ayamba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 69

Mbarara University of Science and

Technology
Faculty of Medicine, Department of Pharmacy

Biostatistics
&
Demography
Summarizing and presenting data

Edward .J. LUKYAMUZI (MPS)

P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug


1
Objectives…
• Explain how proportions and percentages are
calculated
• Explain two methods for graphically
displaying the distribution of categorical data.
• Explain the method of construction of a
histogram; describe the shape, centre, and
spread
• Define a percentile, name the 3 important
percentiles and derive their values from a
frequency table
• Define and describe the characteristics of the
mean, median, and mode, and identify when
to use each.
• Define and describe the characteristics of
standard deviation, interquartile range and
identify when to use each
2
Recap…
Data types

Categorical Quantitative

Nominal Ordinal Continuous


Discrete (Measured on a continuum)
(count data)

Binary or
Dichotomous

Ratio Interval

3
Four scales of
measurement…
Introduction…
• In public health and health research we are
interested in describing a group (of people or
things e.t.c.) rather than an individual person.
• We have noted the inherent variability in biological,
socio-economic, or behavioral processes.
• Knowledge and application of appropriate
statistical methods enable us to describe groups of
people with varying experiences.

5
Summarizing and Presenting data : Methods and
tools
• Summarising categorical data
• Counts
• Proportions
• Percentages
• Rates, Ratios
• Summarizing quantitative data
• Measures of central tendency (Mode, median,
mean)
• Measures of dispersion (minimum &
maximum, IQR, standard deviation)
• Graphical presentation of data
• Categorical data (bar chart, pie chart)
• Continuous data (histogram, box plot)

6
Summarising Categorical Data:
Frequency counts
• The frequency distribution tells us how many
times different values of a variable occur in a
given sample or population. Frequency
distribution of a random sample of 200 new
students:

Example: Frequency distribution of PHA III


Sex
students by sex Number of
students: counts
Males 36
Females 14
Total 50
7
Summarizing Categorical Data:
Proportions
• Proportion
• Numerator is included in the denominator
(a / a+b) , where a+b=n
e.g. #students who are Males /Total number
of students

Frequency Distribution of Year 1 Students by Sex


Sex Number of Proportion
students
Males 108 0.54
Females 92 0.46
Total 200 1.0
8
Summarizing Categorical Data:
Percentages
• Proportions are often converted into
percentages for ease of interpretation.
• Percentages are relative frequencies
expressed per 100.
– Frequency% =(a/n)*100

Percentage of Students by Marital Status


Sex Number of Proportion Percentage
students
Males 108 0.54 54.0
Females 92 0.46 46.0
Total 200 1.0 100.0

9
Other Basic Measures of Frequency:
Ratios

Ratios show the relative sizes of two events

Examples
•Number of doctors to population size
•Number of nurses to number of patients in
a hospital
•Number of girls to number of boys
•Number of households with a bed net to
number of households without a bed net

10
Other Basic Measures of Frequency: Rates

A rate is a measure of frequency of


occurrence of events per unit time.

For example
-Number of disease events per year
-Number of cases of dysentery per week
• Numerator is number of events
• Denominator is total time of observation

• Time may be hours, days, weeks, months,


years etc

11
Mbarara University of Science and
Technology
Faculty of Medicine, Department of Pharmacy

SUMMARIZING CONTINUOUS
DATA

P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug


12
Distribution of continuous data

• Values of continuous data exist on a


continuum.
• As the sample size increases, the number of
unique values are so large, that it is difficult
to summarise such data by making frequency
counts of individual values

• See example of frequency distribution of


total cholesterol of 250 people.

13
Example: Frequency distribution of
cholesterol in mg/dl (n=250)
totchol Freq. Percent Cum.

135 1 0.40 0.40


150 2 0.81 1.21
154 1 0.40 1.61
159 1 0.40 2.02
163 1 0.40 2.42
164 1 0.40 2.82
. . . .
. . .
.
320 2 0.81 98.39
325 1 0.40 98.79
326 1 0.40 99.19
332 1 0.40 99.60
464 1 0.40 100.00
14
Distribution of Continuous data
• Continuous data is first grouped into
classes comprising of successive
ranges of values of the variable of
interest

• In each class/group, frequency counts


are determined

• Frequency distributions of continuous


data have a shape

15
Frequency distribution of cholesterol data,
Class width is 20 g/dl
totcholcat Freq. Percent Cum.

135- 4 1.61 1.61


155- 12 4.84 6.45
175- 26 10.48 16.94
195- 35 14.11 31.05
215- 40 16.13 47.18
235- 52 20.97 68.15
255- 39 15.73 83.87
275- 19 7.66 91.53
295- 15 6.05 97.58
315- 5 2.02 99.60
455- 1 0.40 100.00

Total 248 100.00


Frequency distributions of continuous data have a
shape; easier to see on a graph
16
Frequency distribution ( from Table
above) presented on a Graph
(Histogram)
20
Observe the
Shape,
Percent of the sample

Centre and
15

Spread of the
data.
5 10 0

100 200 300 400 500


Total cholesterol, g/dl

17
Tools for Visualising distribution of continuous
data: Histogram
20
Percent of the sample

Does the centre


15

represents the
Mean, Median
and Mode?
5 10 0

100 200 300 400 500


Total cholesterol, g/dl

18
Recall: Frequency distribution of cholesterol in
mg/dl (n=250)
totchol Freq. Percent Cum.

135 1 0.40 0.40


150 2 0.81 1.21
154 1 0.40 1.61
159 1 0.40 2.02
163 1 0.40 2.42
164 1 0.40 2.82
. . . .
. . .
.
320 2 0.81 98.39
325 1 0.40 98.79
326 1 0.40 99.19
332 1 0.40 99.60
464 1 0.40 100.00
19
Summarising the distribution of total
cholesterol data (Continuous data)

• Minimum is 135 mg/dl


• Maximum is 464 mg/dl
• Sample size is 250
• The complete frequency table is too
long, making it difficult to make
meaning out of the frequency list.
• A histogram can help visualise the
distribution

20
Creating a Histogram
 Determine the minimum and maximum values
for the variable of interest e.g. income
• Minimum is 135, Maximum is 464
• Determine the range (max-min=329)
 Determine the number of classes/groups you
want to have e.g. 11.
 Determine the Class Interval width ≈Range
divide by number of classes ≈30 mg/dl
 Create mutually exclusive categories/classes of
the original (continuous) variable.
 Classes are also called bins .
 Start the first interval at a convenient value
below the minimum. 134.5 g/dl,
 Therefore first class will be 134.5 to 164.5
g
 Second class will be 164.5 to 194.5 g/dl
21
Creating a Histogram
 Determine the frequency counts and relative
frequency for each category/class
 To plot the graph:
• On the horizontal axis mark equally spaced
values of the lower boundary of each class
• On the vertical axis, the length represents
the frequency
• Plot the frequencies for each class as bars.
• The height of the bar will be proportional to
the frequency of that class.
• The width of the bars is the same.

22
30

Is the shape of a
20

Histogram
Percent

sensitive to the
number of
Class width is ≈30 mg/dl
classes?
10
0

100 200 300 400 500


total cholesterol, mg/dl

23
20 Shape of a
Histogram is
15

sensitive to the
number of classes
Percent
10

Class width is 16.45 mg/dl


5
0

100 200 300 400 500


total cholesterol, g/dl

24
Distribution of continuous data
• The shape of the frequency distribution can
be symmetrical or asymmetrical
• A symmetric distribution has the same
shape on both sides of the mean (the
centre)
• If outlying values occur only in one direction,
the distribution is said to be skewed
• Normally distributed data has zero skewness

25
Shape of distribution of continuous
data:
Symmetrical
Frequency

Values of a continuous variable (X)


• Symmetrical distribution has the same shape
both sides of the mean.
• Horizontal axis represents the X-variable while
the Vertical axis represents the frequencies.

26
Shape of distribution of continuous data:
Skewed to the right

27
Shape of distribution of continuous data:
Skewed to the Left

28
Mbarara University of Science and
Technology
Faculty of Medicine, Department of Pharmacy

MEASURES OF CENTRAL
TENDENCY

P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug


29
Measures of central tendency
• The central tendency of a distribution is an
estimate of the "center" of a distribution of
values.
• Enable us to describe the characteristics of a
typical member of a group of people.
• Three major types of estimates of central
tendency:
• Mode
• Median
• Mean (Arithmetic mean)
• Other measures of central tendency
• Geometric mean
30
Mode

• The mode is the most frequently


occurring value in the set of observations

Using this age data,


45 19 23 10 16 21 25 17 21 18 15 18 21 13 16 23 21
24 18 19 26 20 21 19 20 25 26 20 23 8 23 18 24 16
30 24 15 22 27 20
What is the modal age

• In a given dataset, a variable may have


more than one mode (bimodal
distribution)
– distribution with two different peaks

31
Median
• The Median is the score found at the exact
middle of the set of values
• Arrange observations in order of magnitude,
the median is the middle observation
• Median divides the set of observations into
two equal parts such that the number of
values equal to or greater than the median is
equal to the number of values less than or
equal to the median
Consider the following age data:
45 19 23 10 16 21 25 17 21 18 15 18 21 13 16 23 21 24 18 19 26 20 21 19
20 25 26 20 23 8 23 18 24 16 30 24 15 22 27 20

- Is The median age of the 40 individuals is 20.5


years

32
The Arithmetic Mean
• The arithmetic mean – is the most popular
measure of central tendency
• Calculation of mean requires numerical data
• For a given variable, the mean is obtained by
1. Adding all values in the sample
2. Dividing the sum by the number of
observations in the sample (sample size)
• For a given set of data and variable, there is
only one mean

33
Arithmetic mean- computation

• Calculating the arithmetic mean


• Formula

– where n is the sample size and xi is the random


variable.
• The mean is affected by outliers(extreme but
legitimate values)

34
Arithmetic mean -Example
For this set of observations in a sample, calculate the mean.

45 19 23 10 16 21 25 17 21 18 15 18 21 13 16 23 21 24 18 19
26 20 21 19 20 25 26 20 23 8 23 18 24 16 30 24 15 22 27 20

Sample size (n)=40


Step 1: First obtain a sum:
45+ 19+ 23+ 10+ 16+ 21+ 25 +17 +21 +18 +15 + 18 + 21+ 13+ 16+ 23+
21+ 24+ 18+ 19+26+ 20+ 21 +19 +20 + 25 +26 + 20 +23 + 8 +23 +18 +24
+16 +30 +24 +15 +22+ 27 +20

Step 2: Divide the sum by the sample size (n)

Thus, arithmetic mean =sum/n =20.75


Note: The arithmetic mean is affected by outliers.

35
Summary of the Age data of 40
individuals:

• Median=20.5 years,
• Modal age=21 years
• Mean=20.75 years

• Based on these measures, what can


you conclude about the shape of the
distribution of age in this sample?

36
Using a graph to see the distribution helps
to identify key features such as presence of
25
20 outliers

Outliers
15
Percent
10
5
0

8 10 1214 16 18 2022 24 2628 3032 34 3638 40 42 44


Age(years)

37
Outliers & Arithmetic Mean
• Outliers fall outside the general pattern of the
distribution.
• The value of the arithmetic mean is sensitive
to/affected by outliers

• If the data contains outliers-


1. Find out if there is a problem with data entry.
2. Assess the extent to which the outlier
affects the results: for example calculate the
mean with and without the outlier

38
The Geometric Mean
• For a variable X with observations xi, the
geometric mean of a set of n observations is
equal to the nth root of the cross-product of the
n observations.
• For a given data set, the geometric mean is less
than or equal to the arithmetic mean.

Geometric mean=

• Do not calculate the geometric mean if you


have a zero or negative value in your data
set.
39
Application of Geometric Mean
• The geometric mean is a useful measure of
central tendency for highly skewed data e.g.
gametocyte density, antibody titers;
• Geometric mean is used to describe the central
tendency of data which is basically skewed BUT
normally distributed on a log-scale.
• Compared to the arithmetic mean, the geometric
mean is less affected by extreme values.
• The GM is appropriate for describing
proportional growth, e.g. annual population
growth, average rate of return on investment
over a period of time.
40
Choice of appropriate measure
of central tendency
• The arithmetic mean is calculated for
numerical data and for symmetric or ≈
symmetric distributions
• The median is suitable for ordinal data or
numerical data if the distribution is skewed
• The mode is used to describe bimodal
distributions (esp. in disease-age/time
distributions) e.g. seasonal distribution
of malaria
• For a symmetric distribution, the mean ≈
median ≈ mode
41
Mbarara University of Science and
Technology
Faculty of Medicine, Department of Pharmacy

P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug


Mbarara University of Science and
Technology
Faculty of Medicine, Department of Pharmacy

MEASURES OF VARIATION

P.O. Box 1410, Mbarara, Uganda, http://www.must.ac.ug


43
Assessing Variation/Dispersion
in Data
 Variation of data is also commonly referred to as
dispersion.
 Dispersion refers to the spread of the values around the
central point.
 The starting point when assessing dispersion in a set of
data is to use visualisation tools such as the Histogram,
the Box Plot, Symmetry Plot, and Quantile plot.
 These tools enable the researcher to make a qualitative
description of the extent of variation observed in the
data.
 Data variation is often quantified using measures such
as the range, percentiles, inter-quartile range, the
standard deviation, coefficient of variation , and
standard error.

44
Tools for Visualising distribution of continuous
data: Histogram
20

Observe the
Spread of the data.
Percent of the sample
15

The graph (Histogram)


helps to see the spread
10

of the data
5
0

100 200 300 400 500


Total Cholesterol, mg/dl

Starts at 135 mg/dl; Class width is 21.93 mg/dl


45
Alternative methods for viewing
the distribution of continuous
data
• Although the histogram is a popular tool for
displaying visualising the distribution of
continuous data, histograms are sensitive to
the number of bins used in their construction.
• They can be inaccurate in informing
researchers about the nature of the distribution
of data.
• Other approaches to understanding the
distribution of continuous data include: Box
Plot, Symmetry Plot, Quantile plot etc

46
Tools for Visualising variation:
the Box Plot
Box Plot showing the distribution of total cholesterol.
500 400
totchol (mg/dl)
200 300
100

47
Box Plot- visualising variation in
age
Box Plot showing the distribution of age (n=40)
50
40
Age (years)
30
20
10

48
Box plot…
• Standardized way of displaying data.
• Based on five number summary;
1. Minimum
2. First quartile (Q1)
3. Median
4. Third quartile (Q3)
5. Maximum

49
How to draw a Box Plot
1. Sort the data from minimum to maximum
2. Determine the Min, Q1, Median, Q3, Maximum
3. Determine the IQR (i.e. Q3-Q1) and the value of IQR*1.5
4. Obtain the values of Q1-IQR*1.5, and Q3+IQR*1.5
5. Draw and Label a vertical line that includes the range of
the distribution
6. Draw a central box from Q1 to Q3
7. Draw a horizontal line for the median inside the box
8. Extend vertical lines (whiskers) from the box (at Q1 and
at Q3) out to the lower and upper bounds of data falling
within the general distribution (i.e. not outliers). Length
of the whisker is ≈1.5 times the IQR.
Determining Q1 & Q3

• These are known as the lower and upper quartiles


1. Arrange the data set in numerical order
2. If your data set has an odd number of
observations, Q1 is the median of the lower
half of the data set.
3. If your data set has an even number of
observations, Q1 is the average of the middle
two values of the lower half of the data set.
• Consider the data set {1, 3, 4, 7, 8, 9, 10, 12, 14}.
• Arrange the data in numerical order: {1, 3, 4, 7,
8, 9, 10, 12, 14}.
• (3+4)/2 = 3.5.
51
The Box
Plot
Box plot for
displaying the
distribution of
quantitative data.
Upper inner fence =Q3+
(1.5XIQR)

Lowe inner fence =Q1-


(1.5XIQR)
Any data that falls outside the fences are outliers. 52
Source: https://en.wikipedia.org/wiki/Boxplot

53
Box Plot: Location of fences
• When the calculated value of Upper
fence is greater than the maximum
observation in the data, the fence will
be located at the observed maximum
value.

• When the calculated value of the Lower


fence is less than the minimum value in
the data, the fence will be located at
the observed minimum value.
Box Plot showing the distribution of
cholesterol.

500 400
Observe
totchol (mg/dl)

the
location of
300

the median
relative to
Q1 and Q3
200 100

Data Values which fall outside the fences are called


outliers.
55
8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44
Box Plot- distribution of age
Age (years)

Using this box


plot, estimate
the median, Q1
and Q3.

Data Values which fall outside the fences are called


outliers.
56
Tools for Visualising variation: the
Symmetry Plot
• Also known as a normal probability plot or a
normal plot.
• Assess whether a data set follows a normal
distribution.
• Scatter plot of the data against a theoretical
normal distribution.
• If the data is symmetric and approximately
bell-shaped, the points on the plot will form a
roughly straight line.
• If the data deviates significantly from
normality, the points on the plot will deviate
from a straight line.
57
Tools for Visualising variation: the
Symmetry Plot

250 200
Distance above median
The symmetry 150
plot showing
100

distribution of
data around the
median.
50 0

0 20 40 60 80 100
Distance below median

If data is symmetrically distributed, all plotted values


will lie along the reference line.
58
Quantifying variation: The
Range
• Range
• The range is the difference between the
highest value (maximum)minus the lowest
value (minimum)
• In practice the lowest and the highest values in the
data are reported

• Inter-quartile range
• Is the difference between the 1st quartile
(25th percentile) and the 3rd quartile(75th
percentile)
• The inter-quartile range contains the central
50% of the observations
59
Quantifying variation: Standard
Deviation
• Standard deviation is a measure of the
spread of observations about their mean
• It is a measure of how much on average each
of the values in the distribution deviates
from the mean
• Standard deviation is an essential part of
many statistical tests
• The value of the standard deviation is
affected by outliers

60
Calculating the Standard
Deviation
1. Calculate the arithmetic mean
2. Calculate and square the (difference
between each observation in the data set
and the mean)
3. Obtain a sum of the squared deviations
4. Divide the sum of the squared deviations by
n-1, (number of observations in the sample
minus one)

61
Computation of variance and standard
deviation
UG001 45 24.25 588.0625 Id_number age: xi (xi-mean) (xi-mean)^2
UG002 19 -1.75 3.0625 UG023 21 0.25 0.0625
UG003 23 UG024 19 -1.75 3.0625
UG004 10 UG025 20 -0.75 0.5625
UG005 16 UG026 25 4.25 18.0625
UG006 21 UG027 26
UG007 25 UG028 20
UG008 17 UG029 23
UG009 21 UG030 8
UG010 18 UG031 23
UG011 15 UG032 18
UG012 18
UG033 24
UG013 21
UG034 16
UG014 13
UG035 30
UG015 16
UG036 24
UG016 23
UG037 15
UG017 21
UG038 22
UG018 24
UG039 27
UG019 18
UG040 20
UG020 19
sum 830
UG021 26
mean
UG022 20
Variance=
Standard deviation=square-root of variance=

62
Median= 20.5, Mean=20.75,
Relationship between standard
deviation, the mean and
distribution of observations
• If the distribution of observations of a given
variable is approx normal:
– Approximately 68% of the observations in
the sample fall within one standard deviation
of the mean (Mean±1SD)
– Approximately 95% of the observations
in the sample fall within two standard
deviations of the mean (Mean±2SD)
– Approximately 99.7% of the observations in
the sample fall within three standard
deviations of the mean (Mean±3SD)

63
64
Summary of Cholesterol Data
• Sample size: 250 persons
• Minimum: 135 g/dl
• Median income: 237 g/dl
• Mean (Average): 236.3 g/dl
• Maximum: 464 g/dl
• Standard deviation: 42.6 g/dl

• Question. Based on the values of the mean


and median cholesterol, what can you say
about the distribution of cholesterol in this
sample?
65
Coefficient of Variation (CV)
• It is a ratio of the standard deviation to the
mean

• The formula is

• where SD=standard deviation and is the mean

• CV can be used to compare variation


between data sets
• Mainly applied in laboratory testing
and quality control procedures
66
Standard Error (SE)
• SE is used to assess how closely sample
estimates (like the sample mean) relate to
the population parameter (population mean)

• Used in computation of confidence intervals


and testing statistical significance

• (more on standard error later)

67
Choice of measures of
dispersion
• Standard deviation is appropriate when the mean is
used to describe central tendency (symmetric data)
• The inter-quartile range is used to describe the
central 50% of a distribution, regardless of its shape
• The percentile may also be used when the mean is
used but the objective is to compare a set of
observations with the norm
• The range is used with numerical data when the
purpose is to emphasize extreme values
• Percentiles and inter-quartile range are used when
the median is used (skewed data)
• The coefficient of variation is used when the intent is
to compare distributions of variables measured on
different scales
68
Take home assignment
• Using dummy data from the research questions that
were pitched in class, provide a summary of the data
collected as;
1.Summarize and present data on socio demographics of
the study population in percentages.
2.Present 4 separate variables of continuous data as;
a) Box plots
b) Symmentry lines.
3.Comment on the spread and symmetry of your data
findings
Please work in your groups to have a PowerPoint
presentation ready for a 5 minute presentation in
our next class

69

You might also like