Introduction (Data Presentation & Summarization
Introduction (Data Presentation & Summarization
2
Introduction to Biostatistics
The discipline concerned with the treatment of numerical
data derived from groups of individuals (Armitage et
al., 2001).
Statistics: A field of study concerned with:
Collection,
Organization,
Analysis,
Summarization and
Interpretation of numerical data, & the drawing of
inferences about a body of data when only a small part
of the data is observed from the population.
4
Types of Biostatistics
1. Descriptive (exploratory) statistics: is the aspect of
collecting, organization, presentation and
summarization of data.
6
Types of Biostatistics……
2. Inferential Statistics:
Consists of generalizing from samples to population,
performing hypothesis testing, determining relation
among variables, and making prediction.
Example:
Principles of probability,
Estimation,
Confidence interval,
Comparison of two or more means or proportions,
Hypothesis testing, etc.
7
Uses of Biostatistics
8
Uses of biostatistics
Provide methods of organizing information
Assessment of health status
Health program evaluation
Resource allocation
To formulate and test hypothesis
Magnitude of association
Assessing risk factors; Cause & effect relationship
Drawing of inferences(for prediction or projection)
To handle biological variations: Among individuals as
well as within same individual over time
» Example: height, weight, blood pressure, eye color ...
9
Uses of biostatistics……
Collect reliable and unbiased data
10
limitations of Biostatistics
It deals with only those subjects of inquiry that are
capable of being quantitatively measured and
numerically expressed.
11
What does biostatistics cover?
Research Planning
Presentation
Interpretation
Publication
12
Variables
Variable: A variable is a characteristic an
attribute or quantity that can be measured and varies
among individuals under study that assumes
different values for different elements. or it is a
characteristic or attribute that can assume different
value.
Some examples of variables include:
Diastolic blood pressure,
Heart rate, height,
The weight and
Stage of bladder cancer to list some
Mild
Moderate
Severe
13
Variables….
Data: are values, observations, measurements, facts or
figures that variables describing an event in a given
survey, census, experiment or any other study .
Statistical data: is the numerical description of things
(counts/measurements)
Statistical method : methods that are used to collect
organize analyze and interpret data.
Variables: A characteristic that takes on different values
in different persons/ places/things.
Data set: it is a collection of observation on a
variable.
Random variable: are variables whose value are
determined by chance.
14
Variables……
15
Types of variables
Depending on the characteristic of the measurement,
variables can be: Qualitative and Quantitative
variables.
Qualitative(Categorical) variables
A variable or characteristic which cannot be measured
in quantitative form.
Types
of Qualitative Quantitative
variables or categorical measurement
Measurement scales
20
Scales of measurement
Measurement: is assignment of numbers to objects or
events according to set of events.
Data comes in various sizes and shapes and it is important
to know about these so that the proper analysis can be used
on the data.
There are four at which we measure:
Nominal scales of measurement
It may be thought of as "naming" level.
This level of measurement do not put subjects in any
particular order.
There is no logical basis for saying one category is higher
or less than the other category.
In research activities a Yes/No scale is nominal.
21
Scales of measurement………..
The simplest data consist of unordered, dichotomous, or
"either - or" types of observations, i.e., either the patient
lives or the patient dies, either he has some particular
attribute or he does not
The nominal level of measurement classifies data into
mutually exclusive (non over lapping),exhaustive
categories in which no order or ranking can be imposed
on the data.
Examples are: Blood group, Gender, religious affiliation
Ordinal Scales of Measurement
An ordinal scale is next up the list in terms of power of
measurement.
The simplest ordinal scale is a ranking.
22
Scales of measurement………..
At this level we put subjects in order from lowest to
highest.
27
Scales of measurement………..
Ratio scales permit the researcher to compare both
differences in scores and the relative magnitude of
scores.
28
Scales of measurement………..
29
Individual Assignment 1
30
31
32
33
34
Assignment 2
35
Response and Explanatory variables
A variable can be also either
Response ,dependant, or outcome variable or
A variable can be either
Explanatory ,independent, predictor variable.
Response (dependent, outcome) variables: are variables
which can be affected by explanatory variable and it is
the outcome of a study.
A variable you would be interested in predicting or
forecasting.
While explanatory variables are any variables that
explain the response variable.
36
Exercise
In a study to determine whether surgery or chemotherapy results
in higher survival rates for a certain type of cancer,
Which variable is the explanatory variable and
____________________
Independent variable__________________________________
37
Descriptive Statistics
38
Data collection methods
39
Data collection methods……
The choice of methods of data collection is based on:
40
Data collection methods………….
The methods of collecting data may be broadly
classified as:
Self-administered questionnaires
Observation
Interviews
Tape recording
Filming, Photography
Disadvantages
46
Data collection methods………….
5) Focus Group Discussion
47
Data collection methods………….
Advantages
Is less expensive; permits anonymity & may result in
more honest responses; does not require research
assistants; eliminates bias due to phrasing questions
differently with different respondents
Disadvantages
Cannot be used with illiterates; there is often a low rate
of response; questions may be misunderstood
48
Data collection methods………….
Problems in gathering data
Common problems might include:
Language barriers
Lack of adequate time
Expense
Inadequately trained and experienced staf
Invasion of privacy
Suspicion (mistrust)
Bias (any systematic error)
Cultural norms (e.g. which may preclude
(prevent) men interviewing women)
49
Data collection methods………….
Types of Questions
Depending on how questions are asked and recorded
we can distinguish two major possibilities
Open –ended questions, and
Closed questions.
Open-ended questions
51
Data collection methods………….
Closed Questions
Closed questions offer a list of possible options or
answers from which the respondents must choose.
52
Data collection methods………….
For example
What is your marital status?
1. Single
2. Married/living together
3. Separated
4. divorced
5. widowed
Have you ever gone to the local village health worker
for treatment?
1. Yes
2. No
53
Methods of data organization and presentation
54
Frequency Distributions
A table which has a list of each of the possible values
that the data can assume along with the number of
times each value occurs.
Tables make it easier to see how the data are distributed
56
Ungrouped Frequency Distribution
57
Example: The following ungrouped data/ordered array or individual
observation is about current age of women and it was collected from
240 women ( data 1).
58
Example: Consider the data collected on age at first marriage of
240 women (data 1). One of the variable in this dataset is religion
followed by the women. Hence, for such types of variable, we can
use ungrouped frequency distribution to summarize the data as
follows:
59
How Can I Change Ungrouped Frequency
Distribution In To
Grouped Frequency Distribution ???
60
Grouped Frequency Distribution
In order to present data using grouped frequency distribution, it
is not as simple as that of ungrouped.
In this case we need to compute some values. These values are
given below:
Number of class(K): The number of categories the table will
have
Number of class can be computed/ estimated using Sturge’s
rule as:
Sturge’s rule: K 1 3.322(logn)
LS
W
where K
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value
61
Grouped Frequency Distrib……
Class limit: The range for each class / The smallest and largest
values that can go into any class; they can be either lower or
upper class limits.
62
Grouped Frequency cont’d…
Class Boundaries/True Limits: are those limits, which are
determined mathematically to make an interval of a continuous
variable continuous in both directions, and no gap exists between
classes. It is obtained by subtracting and adding 0.5 from lower
and upper class limit respectively.
It has two boundaries :
Lower class boundary
63
Grouped Frequency cont’d…
Class mark/ Mid-point (Xc) of an interval: is the value of the
interval which lies mid-way between the lower true limit (LTL)
and the upper true limit (UTL) of a class.
64
Grouped Frequency cont’d…
Example for data 1
The number of classes(k) can be computed using Sturg's rule as:
K= 1 +3.322Log(240)
W = 49-15 = 4
9
65
Thus the width of each class can be 4 and the lower class limit
for the first class will be the minimum observation from the
dataset.
Example for data 1
Class Class Class Frequency RF(%) CF
Limit boundary mark
15-18 14.5-18.5 16.5 15 6.25 15
19-22 18.5-22.5 20.5 49 20.41 64
23-26 22.5-26.5 24.5 51 21.25 115
27-30 26.5-30.5 25.5 40 16.67 155
31-34 30.5-34.5 32.5 21 8.75 176
35-38 34.5-38.5 36.5 22 9.17 198
39-42 38.5-42.5 40.5 18 7.50 216
43-46 42.5-46.5 44.5 15 6.25 231
47-50 46.5-50.5 48.5 9 3.75 240 66
Grouped Frequency cont’d…
Note that: the value to be added or subtracted on the
class limits to get class boundaries depends on the
decimal number of the dataset that we want to
summarize.
The width of a class is found from the true class limit by
subtracting the true lower limit from the upper true limit
of any particular class.
For example, the width of the above distribution is (let's
take the fourth class) ( w = 30.5 - 26.5 = 4).
67
Statistical Tables
A statistical table is an orderly and systematic
presentation of data in rows and columns.
Based on the purpose for which the table is designed
and the complexity of the relationship,
A table could be either of simple frequency table or cross
tabulation.
68
Statistical Tables….
Construction of tables
There are no hard and fast rules to follow, the following
general principles should be addressed in constructing
tables.
69
Statistical Tables….
Numerical entities of zero should be explicitly
written rather than indicated by a dash.
Dashed are reserved for missing or unobserved
data.
70
Frequency Relative Frequency
ICU Type (How often) (Proportionately often)
Medical 12 0.48
Surgical 6 0.24
Cardiac 5 0.20
Other 2 0.08
Total 25 1.00
71
Statistical Tables….
• Clinical symptoms among 54 patients with S
Typhimurium-infection, Oslo, Norway, May
1998
72
Statistical Tables….
Two variable table
Table 1. Cases of Salmonella Typhimurium-infection by age-
group and sex, Herøy, Norway, 1999
73
Three variable table
74
Composite/ Higher Order Table
It is a large table combining several separate variable/tables.
Age, sex and other demographic variables may be combined to
form a single table
75
Common form of a two by two variable
It is a special form of table favorite among
epidemiologist
76
Graphical Presentation
Graphs are often easier to interpret than tables,
perhaps at the expense of detail.
77
Graphical Presentation……
Construction of a Graph
Every graph should be self-explanatory and as
simple as possible.
79
Specific types of graphs include:
Histogram
Stem-and-leaf plot Quantitative
Box plot data
Scatter plot
Line graph
Others
80
Graphical Presentation……
Bar Charts
Categories are listed on the horizontal axis (X-axis)
Frequencies or relative frequencies are represented on the Y-axis
(ordinate)
82
Graphical Presentation……
B. Grouped bar chart
Data from 2-variable or more variable table
Distinct colours or shading is used to differentiate; Legend is
necessary
83
Graphical Presentation……
C. Stacked bar chart It is used to show the same data as a
grouped bar chart using a single bar
84
Graphical Presentation……
D. 100% component bar chart
85
Graphical Presentation……
86
Graphical Presentation……
Pie Chart
It is a circle divided into sectors so that the areas
of the sectors are proportional to the frequencies.
It is splited into segments to show percentages or
the relative contributions of categories of data.
It is a good method of representation if you wish
to compare a part of group with the whole group.
The number of categories should not be too much.
Used for a single categorical variable
Use percentage distributions
Performed by changing frequency to percentage
then to degrees.
87
Example: Distribution of deaths for females, in England and
Wales, 1989.
Cause of death No. of death
88
Distribution of deaths for females, in England and Wales, 1989.
Others
8%
Digestive System
4%
Injury and Poisoning
3%
Circulatory system
Respiratory system
42%
13%
Neoplasmas
30%
89
Graphical Presentation……
Histograms: is the graph of the frequency distribution
of continuous measurement variables.
It is constructed on the basis of the following principles:
The horizontal axis is a continuous scale running
from one extreme end of the distribution to the
other.
To construct a histogram, we draw the interval
boundaries on a horizontal line and the frequencies
on a vertical line.
The area of each bar is proportional to the
frequency of observations in the interval
In constructing
– Use equal class intervals
– Do not use scale breaks
91
Example: Distribution of the age of women at the time of marriage
40
35
30
No of women
25
20
15
10
0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
92
Graphical Presentation……
Frequency polygon
If we join the midpoints of the tops of the adjacent
rectangles of the histogram with line segments a
frequency polygon is obtained.
93
Frequency polygon for the ages of 2087 mothers with <5
children, Adami Tulu, 2003
700
600
500
400
300
200
N1AGEMOTH
94
Graphical Presentation……
95
Graphical Presentation……
Ogive or cumulative frequency curve
To construct an Ogive curve:
Compute the cumulative frequency of the
distribution then turn in to graph
96
Cumulative Frequency and Cum. Rel. Freq. of Age of 25 ICU
Patients
Relative Cumulative Cumulative
Age Interval Frequency Frequency frequency Rel. Freq.
(%) (%)
10-19 3 12 3 12
20-29 1 4 4 16
30-39 3 12 7 28
40-49 0 0 7 28
50-59 6 24 13 52
60-69 1 4 14 56
70-79 9 36 23 92
80-89 2 8 25 100
Total 25 100
97
Cumulative frequency of 25 ICU patients
98
Graphical Presentation……
Line graph
Useful for assessing the trend of particular situation
overtime.
Helps for monitoring the trend of epidemics.
The time, in weeks, months or years, is marked along
the horizontal axis, and
Values of the quantity being studied is marked on the
vertical axis.
99
No. of microscopically confirmed malaria cases by species and month at
Zeway malaria control unit, 2003
2100
No. of confirmed malaria cases
1800 Positive
1500 P. falciparum
P. vivax
1200
900
600
300
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Months
100
MMRatio per 100,000 live births by age of woman;
Giza, Egypt 1984
1200
1000
MMR per 100,000 LB
800
600
400
200
0
15-19 20-24 25-29 30-34 35-39 40-44 45-49
Age
102
Data Summarization………..
A MCT is good or satisfactory if it possesses the following
characteristics.
It should be based on all the observations
104
Data Summarization………..
Mean for Grouped data
105
Data Summarization………..
For a given set of data there is one and only one
arithmetic mean (uniqueness).
106
Data Summarization………..
Example 1
Consider the data on birth weight of 10 new born
children in kilo gram at Aksum Saint marry public hospital
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88,2.43.Then
the average birth weight can be computed as:
Solution:
Solution:
109
Data Summarization………..
Solution:
111
Data Summarization………..
112
Data Summarization………..
Example:Consider the data on the weight of 10 new born
children at Saint Marry hospital within a month:2.51,
3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43.
Find median for the data.
Solution:
First arrange the data in to ascending order as:
1.98, 2.02, 2.33, 2.33, 2.43,2.51, 2.88, 2.98, 3.01, 3.25.
As 10 is even we need to take the middle two observations
and the median will be the average of this two middle
observations.
Where:
LCB= Lower Class Boundary of the median class
FC = Cumulative Frequency just before the median class
fC = Frequency of the median class
W = Class Width and
n=number of observations.
114
Data Summarization………..
Example :Median for grouped data
Consider the example on age of women we
presented using frequency distribution bellow.
Compute median for grouped data?
To compute median for grouped data, we need
first find the median class.
In this example half of the observation is 120 that
is n/2; 240/2=120
Let us see the distribution with the cumulative
frequency:
115
Data Summarization………..
116
Data Summarization………..
As we can see from the distribution, the class which
contains 120 observation for the first time is the class
with cumulative frequency 155 as 120 is under 155. So,
the median class is the 4th class.
Solution: n
Fc
LCB =26.5 x = Lm 2
~ W
FC = 115 fm
fC = 40
W =4
n= 240
117
Data Summarization………..
Properties of the median
Extreme values do NOT affect the median, making
the median a good alternative to the mean to
measure central tendencies when such values occur.
There is only one median for a given set of data
(uniqueness)
120
Data Summarization………..
NB: The mode for grouped data is modal class. Modal
class is the class with the largest frequency.
121
Group Assignment
Explain with arbitrary data to the below mentioned
statistical terms and present with power point presentation
to your group/classmates
Geometric Mean(GM)
Harmonic Mean(HM)
Weighted Mean(WM)
Quantiles
Percentiles/Quartiles
Range
Interquartile Range(IQR)
Box and Whisker Plot
Outliers
122
Data Summarization………..
Skewness
If extremely low or extremely high observations are
present in a distribution, then the mean tends to shift
towards those scores.
Based on the type of skewness, distributions can be:
125
Data Summarization………..
126
Data Summarization………..
Negatively skewed distribution:
Occurs when majority of scores are at the right end of
the curve and a few small scores are scattered at the left
end.
127
128
Measurement of Variation
Used to determine the degree of variability between
points relative to MCT.(SD, CV, Mean deviation, etc)
130
Measurement of Variation………..
Group Discussion
131
Measurement of Variation………..
The Variance
Variance measure how far on average scores
deviate or differ from the mean.
132
Measurement of Variation………..
The mathematical formula for sample variance is
defined as:
133
Measurement of Variation………..
Following are the survival times of n=11 patients
after heart transplant surgery.
134
135
Variance for grouped frequency distribution
136
Example of Variance for grouped frequency
distribution
Consider the following data of time spend by college
students for leisure activities. Compute standard
deviation.
137
Measurement of Variation………..
138
Measurement of Variation………..
Standard Error (SE) : is used to describe the variability
among separate sample means obtained from one sample
to another.
141
Measurement of Variation………..
Example1 The Areas of sprayable surfaces with DDT
from a sample of 15 houses are measured as follows (in
m2) :
101, 105, 110, 114, 115, 124, 125, 125, 130, 133, 135,
136, 137, 140, 145
Find the variance and standard deviation of the
above distribution.
Solutions
The mean of the sample is 125 m2.
Variance (sample) = s2 = Σ(xi –x)2/n-1 = {(101-125) 2
+(105-125) 2 + ….(145-125) 2 } / (15-1)
= 2502/14
= 178.71 m4
Hence, the standard deviation
= 13.37 M2
142
Measurement of Variation………..
Properties of SD
The SD has the advantage of being expressed in the
same units of measurement as the mean
143
Measurement of Variation………..
Coefficient of variation (CV)
The standard deviation is an absolute measure of
deviation
of observations around their mean and is expressed with
the same unit of the data.
144
Measurement of Variation………..
The coefficient of variation is most useful in
comparing the variability of several different samples,
each with different means.
CV is a relative measure free from unit of measurement.
145
Measurement of Variation………..
When to use coefficient of variance
When comparison groups have very different
means(CV is suitable as it expresses the standard
deviation relative to its corresponding mean)
When different units of measurements are involved,
e.g. group 1 unit is mm, and group 2 unit is gm (CV is
suitable for comparison as it is unit-free)
In such cases, standard deviation should not be used
for comparison.
It is the best measure to compare the variability of two
series of sets of observations.
Data with less coefficient of variation is considered
more consistent. 146
Measurement of Variation………..
SD Mean CV (%)
147
Thank You !!!
148