Bio-Statistics and RD Lecture Note
Bio-Statistics and RD Lecture Note
COMH 607
(3 Cr.Hrs.)
GIRUM TAYE
1
Course Description
4
INTRODUCTION
5
Why do you need to learn Bio-statistics?
Bio-statistics is fundamental to know disease distribution and contributes a lot to
the majority of studies published in the medical literature.
Therefore, it is essential for anyone studying health related sciences to have an
understanding of the key statistical principles and methods.
6
Data and Variable
Suppose we want to study the characteristics of a group of students, we might
ask/measure about their age, sex, place of residence, weight, height, SBP, DBP, a
particular disease status.
Each of these characteristics varies from student to student, which is a variable
and the values we collect from the students are called data.
Types of Data/variable
In general, we can classify data into two types:
1. Numerical or Quantitative data is data where the observations are
numbers. For example, age, height, weight, SBP, DBP etc.
Note: Numerical data is called discrete if the number of possible values
within every bounded range is finite. Examples: Cups of coffee consumed
per day, number of children etc. Otherwise, numerical data is called
continuous (the value of the variable is not restricted to an
integer). Example: height, weight, Body temperature, SBP, DBP etc.
2. Categorical or Qualitative data is data where the observations are non-
numerical. Example: vaccination status, smoking status, Disease severity,
disease status etc.
7
The simplest type of categorical variable is in which it can take only two
categories. Such a variable is known as binary (or dichotomous).
Example:- Sex (M/F), Smoking status (Smoker/Non-smoker), Disease status
(Positive/Negative) etc
Some qualitative variables can take more than two values. Example:- Marital
status (Single, married, divorced, widowed ), Disease severity (Low, mild,
moderate, high) etc
Generally, qualitative variable can be:-
Unordered if the categories may be listed in any order such as marital status
(it does not involve ranking)
Ordered if the categories have a natural ordering to their categories such as
disease severity (it involves ranking)
When some one use qualitative variable it is very important to check the provided
categories’ exhaustiveness and mutually exclusiveness.
For example the following categories for the variable, How many cigarettes do you
smoke per day? Are not exhaustive and not mutually exclusive.
0-2, 3-5, 5-7, 8-10
8
Scales of measurement
All scales of measurement belong to one of the following four basic type of scales
of measurement:
a. Nominal b. Ordinal c. Interval d. Ratio
Nominal scale is the most commonly used and basic scale of measurement. It
consists in forming a set of exhaustive and mutually exclusive classes or
categories of entities to measure the values of a trait. The objects are placed in
one of these classes.
Example: - Gender (Male or Female), Disease status (Positive or Negative),
Marital status (Single, Married, Widowed, Divorced, Separated)
In Ordinal scale the entities are ranked with respect to the degree to which they
possess a particular attribute. Thus, in measuring the efficacy of specific
medicine may be placed in one of the categories, i.e (fair, good, very good,
excellent) depending on its achievement. An ordinal scale does not, however,
reflect in the absolute sense, how far apart are the entities in two different
classes with respect to the attribute under study. Here also the exhaustive and
mutually exclusive conditions should be fulfilled.
9
In the case of interval scale, entities are not only ranked with the respect to the
degree of a trait under interest but the distance or difference between
neighboring ranks or classes can also be measured and this distance is constant
between each successive interval or rank.
We can design numerical rating scale beginning from an arbitrary zero point
representing the total absence of the trait or quality under study and
increasing the value in successive equal units on the scale up to the desired
limit.
The ratio scale is the most sophisticated of all the four measurement scales.
Weight, length, and time all fall within the ratio scale. Here natural zero point
representing the absence of the variable attribute.
Note: The only real difference between Interval and Ratio scale data is that the
ratio scale has a natural measurement that is called zero, while the zero point is
defined arbitrary for an interval scale (zero degrees temperature has different
meaning for the Fahrenheit and centigrade scales)
10
Exercise
1. Which of the following is a qualitative variable?
a. Severity of a disease b. Age c. HIV status d. All except “b”
2. Which of the following is a quantitative variable?
a. Weight b. Age group c. Number of children d. Age e. All except “b”
3. The variable “HIV status” based on Gold standard test has ____ type of
qualitative data.
a. Binary b. Ordinal c. Nominal d. Dichotomous d. All except “b”
4. Which one is the continuous quantitative variable?
a. SBP b. time c. body temperature d. number of cigarettes smoked
e. All except “d”
11
Measure of central tendency/Location
Mean
The sample mean is the average and is computed as the sum of all the observed
outcomes from the sample divided by the total number of events. In mathematical
terms, it can be given as:
n
X 1/ n xi
11
where n is the sample size and the xi corresponds to the observed value.
The population mean is the average of the entire population and is usually
impossible to compute but can be estimated by sample mean. We use the Greek
letter μ for the population mean.
12
Median
One problem with using the mean, is that it often does not depict the typical
outcome i.e,
If there is one outcome that is very far from the rest of the data, then the mean
will be strongly affected by this outcome.
Such an outcome is called and outlier.
In this case an alternative measure is the median.
The median is the middle score. If we have an even number of events we take
the average of the two middles.
Note: After arranging the data in ascending order the position of the median
value can be obtained by the formula (n+1)/2 and The median for a frequency
distribution is simply the value at which the cumulative relative frequency is
50.5%.
Mode
The mode of a set of data is the number with the highest frequency. From this
definition therefore a distribution may have more than one mode.
13
Properties of the Mean, Median & Mode
The mean is sensitive to outliers; the others are not.
The mode may be affected by small changes in the data; the others are not.
The mode and median may be found graphically.
All three measures of central tendency are equal for a symmetric distribution; in a
skewed distribution they differ.
If the mean is greater than from both median and mode, then the distribution will
be skewed to the right. And
If the mean is less than from both median and mode, then the distribution will be
skewed to the left.
Note:-For statistical analysis and inference, the mean is more often used.
However, if the data is considerably skewed then statistical techniques based on
median should be employed.
14
Measure of dispersion/variation
These measure how far the data is spread apart
Range
The simplest way to describe the spread of a set of observations is to quote the
range, stating the lowest and highest values and hence the difference in
between.
The problem with this is that it reports the extreme values, while the actual
distribution of all the values in between will not be summarized in any way.
Inter Quartile Range
Inter Quartile Range is another measure of spread, and is defined as the
difference between the upper (third) and lower (first) quartiles. The lower
quartile is defined as that observation below which 25% of the sample lies and
above which 75% lies. The upper quartile is defined analogously, as the point
below which 75% of the sample lies, and above which 25% lies.
Therefore:- IQR= 3rd Quartile – 1st Quartile
Note: After arranging the data in ascending order the position of the first and
third quartile can be obtained by the formulas (n+1)/4 and 3(n+1)/4,
respectively.
15
Variance
It is a sort of average of all deviations of each observation from the mean.
However, simply calculating the mean deviation is not sufficient. Since it gives an
average deviation of zero, because positive deviations from the mean will always
exactly balance the negative deviations.
What we are interested in is the magnitude of the deviations. If we square the
deviations before summing them, we will always get a positive quantity.
Dividing this by the total number of observations then gives a measure of
average deviation from the mean, known as the variance, which is denoted by
S2.Therefore, variance is given by the formula.
n 2
2
S 1/ n 1 ( xi x )
i 1
16
Standard Deviation
The problem with the variance is that it is squared, and so it is not in the same
units as the original data.
For Example: For height measurement the variance is in units of square meters,
which is a unit of area, not height.
Therefore, if we take the square root of the variance we get a measure of
variability in the same units as the raw data.
This quantity is called the standard deviation and tells us the average distance of
all the observations in a dataset from the mean.
Standard Deviation (S) = Square root of Variance (S2).
17
Steps to calculate Variance and Standard Deviation
1. Calculate the sample mean, x-bar
2. Write a column in a table that subtracts the mean from each
observed value.
3. Square each of the differences.
4. Add values in this column.
5. Divide by n -1 where n is the number of items in the
sample. This is the variance.
6. To get the standard deviation we take the square root of the
variance.
The sample standard deviation will be denoted by s and the
population standard deviation will be denoted by the Greek
letter σ.
The sample variance will be denoted by s2 and the population
variance will be denoted by σ2.
The variance and standard deviation describe how spread out
the data is.
If the data all lies close to the mean, then the standard deviation
will be small.
18
while if the data is spread out over a large range of values, s will be large. That
is having outliers will increase the standard deviation.
One of the flaws involved with the standard deviation, is that it depends on the
units that are used.
One way of handling this difficulty, is called the coefficient of variation (CV)
which is the standard deviation divided by the mean times 100%.
19
Skew ness and Kurtosis
Skew ness is a measure of symmetry, or more precisely, the lack of symmetry.
A distribution, or data set, is symmetric if it looks the same to the left and right
of the center point.
The skew ness for a normal distribution is zero, and any symmetric data should
have a skew ness near zero.
Negative values for the skew ness indicate data that are skewed left and positive
values for the skew ness indicate data that are skewed right.
By skewed left, we mean that the left tail is long relative to the right tail.
Similarly, skewed right means that the right tail is long relative to the left tail.
20
Kurtosis is a measure of whether the data are peaked or flat relative to a normal
distribution.
That is, data sets with high kurtosis tend to have a distinct peak near the mean,
decline rather rapidly, and have heavy tails.
Data sets with low kurtosis tend to have a flat top near the mean rather than a
sharp peak. A uniform distribution would be the extreme case.
This definition is used so that the standard normal distribution has a kurtosis of
zero.
Positive kurtosis indicates a "peaked" distribution and negative kurtosis
indicates a "flat" distribution
Histogram is an effective graphical technique for showing both the skew ness
and kurtosis of data set.
21
Exercise
Find the mean, median, mode, range, I.Q.R, Standard deviation and variance and tell
the most likely distribution for the following 10 S.B.P measurements in mmHg as
shown in the Table below:
140 2
120 3
100 1
130 2
110 2
Total 10
Solution
Mean=122 mmHg, Median=120 mmHg, Mode=120, Range=40mmHg, I.Q.R=22.5
mmHg, St.Devn=13.17 mmHg, Variance=173.33 mmHg
Exercise
Find the mean, median, mode, range, I.Q.R, Standard deviation and variance for the
following ten weight measurements in K.g:
60, 55, 40, 50, 45, 45, 50, 55, 60, 50
Solution
Mean=51 K.g, Median=50 K.g, Mode=50 K.g, Range=20 K.g, I.Q.R=11.25 K.g,
St.Devn=6.58 Kg, Variance=43.33 Kg 22
Exercise
Select ten students randomly from your class who present the class and ask
their age in completed years, or their weight in K.g or their height in c.m,
then
a. Describe as how you select the ten students from your class
b. Present the information you collected using an appropriate graph
and frequency distribution table
c. Find the mean, median, mode, range, I.Q.R, Standard deviation
and variance for your data
d. Tell the most likely distribution for your data
e. Find the skewness and kurtosis value for your data and interpret
the result.
23
Data presentation using Graphs
"A graph/picture is worth 1000 words"
A distribution presented as a graph or chart gives a more immediate message
than a frequency table does.
The type of graph/chart used depends on the type of data.
In general:-
If the data is categorical or discrete, we use a bar or a pie chart.
If the data is continuous, a histogram or frequency polygon is more appropriate.
Bar chart
For categorical variables, the frequency for each category is easily displayed in a
bar chart.
key points about Bar Chart
It is used to display qualitative (or discrete numerical) data
One bar represents one category, and the height of the bar equals its frequency
Each bar has the same width and equally spaced
24
Bars should have a space between them to stress that they represent categorical
data
The position of each category is arbitrary if the variable is unordered
It is important that the vertical axis of a bar chart starts at zero, to avoid
distortion of true differences between frequencies.
5000
4000
3000
Count
2000
Male Female
SEX OF RESPONDENT
25
Pie chart
It is an alternative display for categorical data where the frequency of each
category is represented by the angle at the center of each slice of the circle.
Note:-Angle = (Frequency/Total )X3600
38%
Male
Female
62%
26
Histogram
For quantitative continuous variables we need a different type of plot from a bar
chart. Instead we use a histogram.
A histogram is like a bar chart but because we use it to display quantitative
continuous variables there are no spaces between adjacent bars.
Another important feature of a histogram is that it is the area of each bar, not the
height, which is proportional to the frequency in each group.
Key points about Histogram
The x-axis must be continuous, and there are no spaces between the bars.
The y-axis always begins at zero, this is important because relative comparisons
are being made.
The area of each bar represents the frequency in each group
The width of each bar is the size of the interval for each group
27
Respondents age by six group intervals
6
0 N = 20.00
20.0 25.0 30.0 35.0 40.0 45.0
AGE
2
Frequency
0
18 20 23 25 26 27 30 35 38 40 43 44 45
AGE
29
Cumulative frequency polygon for respondent's age
30
20
Cumulative Frequency
10
0
20 25 30 35 40 45
AGE
30
Data presentation using Tables
1. Frequency Distribution Table
Frequency distribution table is more difficult to construct for numerical data than for
categorical data because the scale of the observations must first be divided in to
classes.
The steps for constructing a frequency distribution table for numerical data are as
follows: -
i. Identify the largest and smallest observations
ii. Subtract the smallest observation from the largest to obtain the range of the data
iii. Determine the number of classes.
iv. Divide the range of observations by the number of classes to obtain the width of
the classes. Then tally the number of observations in each class.
Columns in frequency distribution table
Frequency- in a particular event it is the number of times that the event occurs.
Relative frequency- is the proportion of observed responses in the category.
cumulative relative frequency- is the running total of the relative frequencies by
reading from top to bottom.
31
Example: The data collected through asking 20 students’ weight in K.g is summarized in
following frequency distribution table (Numerical)
Total 20 1 (100)
32
Example: The data collected through asking 100 individuals’ marital status is
summarized in following frequency distribution table (Categorical)
33
2. Two by Two Table (2X2)
Measures of association between risk factor (exposure) and disease are often calculated
from data presented in 2X2 table.
The following is a 2X2 table showing association between exposure and disease
Disease
Yes (+) No (-) Total
Exposure
No (-) C D C+D
34
Where
A = number exposed and have the disease
B = number exposed and do not have disease
C = number not exposed and has the disease
D = number non exposed and do not have the disease
A+B = total number of individuals exposed
C+D = total number of individuals none exposed
A+C = total number with disease, B+D = total number with out disease
Example: The following 2X2 table shows the relationship between asthma disease and
exposure of asbestoses from a case control study. What main results you observed from
this table and how can we measure the association between asbestos exposure and
asthma.
Asthma
Asbestos Yes (+) No (-) Total
Exposure
Yes (+) 90 270 360
No (-) 60 360 420
Total 150 630 780 35
Exercise
From 10 lung cancer patients the following information regarding their previous
cigarette smoking practices per day are collected.
0, 4, 6, 6, 7, 8, 9, 10, 10, 12
Present this data in the form of table using interval scales of measurement
with two classes.
Solution
0–6 4 40
7 – 13 6 60
Total 10 100
36
Elementary Probability Theory
Meaning of Probability
Assume that an experiment can be repeated many times, with each repetition
called a trial and assume that one or more outcomes can result from each trial,
then
The probability of a given outcome is the number of times that outcome occurs
divided by the total number of trials.
If the outcome is sure to occur, it has a probability of 1; if an outcome can not
occur, its probability is 0.
Example:- The probability of flipping a fair coin and getting tails is 0.50, or 50%. If a
coin is flipped 10 times, there is no guarantee, that exactly 5 tails will be
observed, the proportion of tails can range from 0 to 1.
Basic Definitions and Rules of probability
An experiment is defined as any planned process of data collection
For an experiment we define an event to be any collection of possible
outcomes.
A conditional probability is the probability of one event given that another event
has occurred.
37
A simple event is an event that consists of exactly one outcome.
In Probability OR: means the union i.e. either can occur
In probability AND: means intersection i.e. both must occur
Two events are mutually exclusive if they cannot occur simultaneously.
Characteristics of probability
P(E) is always between 0 and 1.
The sum of the probabilities of all simple events must be 1.
P(E) + P (not E) = 1
If A and B are mutually exclusive then
P(A or B) = P(A) + P(B)
If A and B are not mutually exclusive then
P(A or B) = P(A) + P(B) - P(A and B)
If A and B are independent then P(A and B) = P(A)P(B)
For conditional probability (dependent event)
P(A and B) = P(A|B)XP(B) or
P(B and A) = P(B|A)XP(A)
The multiplication rule for probabilities when events are not independent can be
used to derive one form of an important formula called Baye’s theorem.
38
Since P(A and B) equals both P(A|B)XP(B) and P(B|A)XP(A), these latter two
expressions are equal.
Assuming that P(B) and P(A) are not equal to zero, we can solve for one in terms
of the other as follows:-
P(A|B)XP(B) = P(B|A)XP(A) Then;
P(A|B)= P(B|A)XP(A) ÷ P(B)
P(B|A)= P(A|B)XP(B) ÷ P(A)
In the above two formulas of Baye’s theorem P(A) and P(B) are called the prior
probability, because its value is known prior to the calculation, while P(A|B) and
P(B|A) are called the posterior probability, because its value is known only after
the calculation.
Exercise
Distribution of blood type by gender of 100 students are given in the following
table. Based on these information, find
Having not type “O” blood from total students
Having type “O” blood or being female from total students
Having type “A” blood or “AB” from total students
Having type “A” blood and being male from total students
39
Blood Type Students
B 8 6 14
AB 17 9 26
Total 50 50 100
40
Exercise
41
Probability Distribution
43
Where, λ (lambda)=np is the value of both the mean and the variance of the
Poisson distribution, and e is the base of natural logarithms, equal to 2.718.
Exercise
Suppose we are interested in the number of people who visit the clinic in city “X”
in a given year among the total population say 5000000, and let the probability
that some one in the city visits the clinic is 0.00001. The mean number of people
from the example above would be np=5000000*0.00001=50 which is also the
variance.
For this example calculate: -
The probability that no one in this population visits the clinic in the a given year
The probability that less than 5 people visits the clinic in the a given year
44
The Normal Distribution
It is a special distribution that we will use just about every day for a continuous
random variable, has the following properties:
It can take on any value (not just integers, as do the binomial and Poisson
distribution)
It is symmetric about the mean, μ
The standard deviation of the distribution is symbolized by σ, which is
The horizontal distance between the mean and the point of inflection on the
curve.
It approaches the horizontal axis on both the left and right side without
touching, that is the x-axis is an asymptote.
It is bell shaped with transition points one standard deviation from the mean.
Approximately 68% of the data points lie within one standard deviation of the
mean.
Approximately 95% of the data points lie within two standard deviations of the
mean.
Approximately 99.7% of the data points lie within three standard deviations of
the mean.
The area under the curve is equal to 1
Since it is symmetrical distribution, half the area is on the left of the mean and
half is on the right 45
Given a random variable x that can take on any value between negative and
positive infinity (- infinity to + infinity), normal distribution formula is as follows:
Where e is the base of natural logarithms=2.718, П=3.1416
2 1/ 2( x / )2
1/(2 )e
68%
95%
99.7%
47
The number of standard deviations from the mean is called the z-score and can
be found by the formula :-
z = (x-μ)/σ
The Z Score and Area
Often we want to find the probability that a z-score will be less than a given
value, greater than a given value, or in between two values. To accomplish this,
we use the Table from the textbook and a few properties about the normal
distribution.
In the pictures below has shaded region corresponding to the area:-
1.To the right (above) a z-score, which is 0.011 from the table, hence P (z >
2.30) = 0.011
2. To the left (below) a z-score, which is 0.989 from the table, hence P (z <
2.30) = 0.989
3. Between a z-score, which is 0.979 from the table, hence
P (-2.30 < z < 2.30) = 0.979
48
0 2.3 0 2.3
Shaded Area= 0.011 Shaded Area= 0.989
50
Chi square distribution
Chi-square test is used to compare frequencies or proportions in two or more
groups, specially for their independence.
The logics in chi-square test are as follows:-
The total number of observations in each column and the total number of
observations in each row are considered to be given or fixed. (marginal
frequencies)
After assuming that the columns and rows are independent, we can calculate
the number of observations expected to occur by chance (expected
frequencies).
Expected frequency can be find by multiplying the column total by the row
total and dividing by the grand total i.e
Expected Frequency=(Row total X Column total)/Grand total
Chi-square test compares the observed frequency in each cell with the
expected frequency.
If no relationship exists between the column and row variables, then
The observed frequencies will be very close to the expected frequencies, they
will differ only by small amounts
51
In this instance, the value of the chi-square statistic will be small.
On the other hand, if a relationship (dependency) does occur, then
The observed frequencies will vary quite a bit from the expected frequencies,
and
The value of the chi-square statistics will be large.
So chi-square is given as:- 2
k
2
X (d . f )
i 1
(O i E i) / E i
53
Student t-distribution
The t distribution is similar in shape to the z-distribution and
One of its major uses is to answer research questions about means.
The t-distribution is symmetric and has a mean of 0, but its standard deviation is
larger than 1.
The precise size of the standard deviation depends on the sample size, which is
called here degree of freedom (d.f)
The t-distribution has a larger standard deviation so it is wider and its tails are
higher than those for the Z-distribution.
As the sample size increases, the degree of freedom also increases, and the t-
distribution becomes almost the same as the standard normal distribution.
When the sample size is 30 or more t-distribution and z-distribution curves
become so close, therefore
Either t or z distribution can be used.
So, t-distribution is given by:-
t= (¯x – μ)/s/√n
54
Where
¯x= sample mean
μ= population mean
S= sample standard deviation
n= sample size
55
50
50
50
55
50
60
55
50
56
3. The following table shows age for nine male and female individuals.
By using t-test, make a conclusion about the equality of male and female
selected individuals age. (Independent t-test)
(t tabulated with 16 d.f and at 5% level=2.12)
Note:-Here S.E=(¯x1-¯x2)=√(s12/n1)+(s22/n2) and t with n1+n2-2 d.f
Age of Male Age of female
26 40
22 17
18 15
38 44
18 16
15 20
27 28
17 48
52 37
57
4. The following table shows SBP repeated measurements for nine individuals.
By using t-test, make a conclusion about the measurement difference. (paired t-
test)
(t tabulated with 8 d.f and at 5% level=2.306)
Note:- Here you can use one sample t-test using difference (di) and t with n-1 d.f
in comparing to zero.
SBP1 SBP2
120 120
125 120
130 135
140 140
125 120
130 130
120 120
140 140
125 135
58
Confidence interval for single mean
Confidence interval is a widely used tool to describe a population based on sample
data.
The idea here is to obtain an interval, based on sample statistics, that we can be
confident contains the population parameter of interest.
Applying the properties of the sampling distribution to the results of a single
sample leads us to the concept of confidence intervals.
This is an interval around the estimated mean which we can be confident contains
the true population mean.
A confidence interval extends either side of the sample mean by a multiple of the
standard error.
It is most common to calculate a 95% confidence interval; this extends 1.96 SE
either side of the mean.
Thus, a 95% confidence interval for a single population mean () is calculated as
follows
X 1.96[ S .E ( X )]
59
Where
X is estimated mean
60
Confidence interval for two mean difference
Testing a hypothesis that a parameter equals some specified value (such as 1 -
2 =0) can be done by determining whether or not 0 falls in the interval.
Therefore, similar to the confidence interval for single mean, a 95% confidence
interval for a population mean difference (1 - 2) is calculated as follows:
( X X 2) 1.96[ SE ( X 1 X 2)
Where 1
S .E ( X 1 X 2 )
2 2
( S / n1) ( S 2 / n 2)
1 61
Exercise
The following data shows the measurement of systolic blood pressure
measurements among two groups of population separated by getting /not
getting appropriate treatment. Based on the information find the 95%
confidence interval for the difference between the population means of SBP
measurements between the treated and untreated groups and interpret the
result
62
Confidence interval for single proportion
The 95% confidence interval for a single population proportion (P) is calculated
as follows:
p + 1.96 (S.E(p))
Where
p is estimated proportion
1.96 is multiplier
SE (p) is standard error of the estimate=√ pq/n
Exercise
Suppose that it is known that in a certain population of women, 90% entering
their 3rd trimester of pregnancy have had some prenatal care. A random sample
of size 200 is drawn from the population in an informal settlement and it is
found that 170 have had prenatal care at the beginning of their third trimester.
From this, find the 95% confidence interval proportion of women in the informal
settlement who have had some form of prenatal care by the third trimester and
interpret the result.
63
Confidence interval for two population proportions difference
Similarly, the 95% confidence interval for two population proportions difference
(P1 – P2) is calculated as follows:
(p1-p2) + 1.96 (SE(p1-p2))
Where
p1 and p2 are estimated proportions
1.96 is multiplier
SE (p1-p2) is standard error of the estimates=√ (p1q1/n1) + (p2q2/n2)
Example
TB patients were randomized to receive either medicine and bed rest or simply
bed rest. The outcome of interest will be whether or not the patient showed
improvement. Let p1 be the proportion of all TB patients who, if given medicine,
would show improvement at 6 months. Further we will define p2 as a similar
proportion among patients receiving only bed rest. Based on the information
given in the following table calculate the 95% confidence interval for population
proportion difference and interpret your result.
64
Improvement condition
Treatment
Group Total
Yes No
Medicine and 30 20 50
Bed rest
Bed rest only 15 35 50
Total 45 55 100
65
Note
In general, the width of confidence interval depends on:
The confidence level (1-α):- As (1-α) increases, so does the width of the interval.
If we want to increase the confidence we have that the interval contains the
parameter, we must increase the width of the interval
The sample size:- The larger the sample size, the smaller the standard error of
the estimator, and thus the estimator and thus the smaller the interval.
The standard deviation of the underlying distributions. If the standard
deviations are large, then the standard error of the estimator will also be large.
66
Hypothesis Testing
Whenever we have a decision to make about a population characteristic, we
make a hypothesis.
Suppose that we want to test the hypothesis that μ≠120 mmHg.
Then we can think of our opponent suggesting that μ=120 mmHg.
We call the opponent's hypothesis the null hypothesis and write:
H0: μ = 120 mmHg and our alternative hypothesis and write
H1: μ ≠ 120 mmHg
For the null hypothesis we always use equality, since we are comparing μ with a
previously determined mean.
For the alternative hypothesis, we have the choices: < , > , or ≠ .
Procedures in Hypothesis Testing
When we test a hypothesis we proceed as follows:
Formulate the null and alternative hypothesis.
Choose a level of significance.
Determine the sample size.
Collect data.
Calculate z ( t) or chi-square score.
Utilize the table to determine if the z score falls within the acceptance region.
67
Decide to:-
a.Reject the null hypothesis and therefore accept the alternative hypothesis or
b.Fail to reject the null hypothesis and therefore state that there is not enough
evidence to suggest the truth of the alternative hypothesis.
Errors in Hypothesis Tests
We define a type I error as the event of rejecting the null hypothesis when the
null hypothesis was true. The probability of a type I error (α) is called the
significance level.
We define a type II error (with probability ß) as the event of failing to reject the
null hypothesis when the null hypothesis was false.
Note: Larger α results in smaller ß, and smaller α results in a larger ß.
Null Hypothesis
Action True False
69
Example
In the study of the piglets being fed the supplemented diet, we know that the
mean weekly weight gain for our sample is 311.9, and that this is based on 16
observations. We also know that we have assumed the population standard
deviation, σ, to be 120 grams and that we want to test μ=200 grams.
Solution
Here
H0: μ = 200 gm
H1: μ ≠ 200 gm
Thus, Z=( Ẍ - ) / (σ/ √ n) becomes
Z=( 311.99 gm – 200 gm) / (120 gm/ √ 16) = 37.3
Now we know that 95% of the values from a standardized normal distribution
can be expected to lie between -1.96 and 1.96, which implies that values above
1.96 or below -1.96 occur in only 5% of samples drawn from a standardized
normal distribution. Since the value we have calculates as test statistics lie above
1.96, this means that we have only a 5% chance of getting a value this large by
chance alone, if the null hypothesis is true.
Thus we reject the null hypothesis, and accept the alternative hypothesis, at the
5% level.
70
One sided versus two sided tests
The statement of the alternative hypothesis in above example was “not equal
to(≠ ) ” , that is, either higher than or lower than. This is called a two sided test.
If we were only interested in testing whether this diet would give a greater
weight gain, we would have used a one sided test, implying
Alternative hypothesis H1: μ > 200 gm
For a one sided test, we would not use the same cutoff as that of two sided test
For one sided test, we are only interested in a critical value or cutoff above which
5% of the distribution lies (and below which 95% of the distribution lies)
71
Sample size determination and sampling methods
Sample size calculation for single mean
The sample mean is used to estimate the population mean, and the Confidence
Interval is used to determine how big or small the population mean is.
s2 , the variance is required before sample size calculation. This may be obtained
from the literature, previous studies or a pilot study.
Therefore for a 95% C.I
n=(1.96 s /d)2
Example
If we want to estimate the mean SBP of Rwandan males and the standard
deviation is around 20 mmHg and we wish to estimate the true mean with in 10
mmHg with 95% confidence, what will be the sample size.
72
Solution
We are given s=20, d=10 and z=1.96
Therefore n=[(1.96*20)/10]2=15.37 which round to 16
Suppose the response rate is 80% then we will need to sample 16/0.8=20 males
If we decide to obtain a more precise estimate example d=5 mmHg, we require n=
[(1.96*20)/5]2=62 and 62/0.8 = 78 with 80% response rate
Sample size calculation for single proportion
Research questions such as “What proportions are smokers?, What is the
prevalence of HSV-2 in rural area?, or What is the sensitivity (or specificity) of a
particular diagnostic test for disease x?)” lead to the estimation of a proportion.
To determine how big or how small the population proportion is likely to be a
confidence interval is calculated and the sample size for 95% confidence is given
by
n=(1.96 /d)2 p (1-p)
73
Thus to determine the sample size required to estimate the proportion with the
desired level of precision, some idea is required before hand about the possible
magnitude of the proportion. If there is insufficient information to know this,
then the value of 0.5 can be used as this will give the largest possible sample size
that would be required
Example
We wish to estimate the proportion of males who smoke in a given country.
What sample size do we require to achieve a 95% confidence interval of width +
5% (that is to be with in 5% of the true value)?. A study some years ago found
approximately 30% were smokers
Solution
We take p=0.30, d=0.05 and z=1.96
n=(1.96/0.05)2 0.3(1-0.3)=322.69 rounded to 323 men
If we anticipate a 75% response rate then we need to sample 323/0.75=431 men
If we had no idea what the prevalence of smoking is likely to be we would use
p=0.50 to give n=(1.96/0.05)2 0.5(1-0.5)=384.16
So we need 385 men at the analysis stage and 385/0.75=514 to be sampled.
74
Sample size calculation for comparison of two independent means
The number per group to detect a difference in means of size d with power (1 -
) at significance level is given by
n=2 s2/d2 *(Z +Z)2
Where Z is the value of the standard normal distribution cutting of probability
in one tail for a one sided alternative or /2 in each tail for a two sided
alternative and Z is the value of the standard normal distribution cutting off
probability (or right hand) tail. Commonly used values for Z and Z are Z=1.96
for =0.05 (two tailed) and Z =0.84 for 80% power or Z=1.28 for 90% power.
Example
If you wish to carry out a study comparing serum catecholamine levels in
normotensive patients and patients with essential hypertension. And previous
studies have found mean serum catecholamine levels of 0.218 mg/ml (sd=0.14)
in normotensives. If the clinically important difference to be detected in
catecholamine levels in hypertensive patients is an increase by 0.1 mg/ml how
many subjects would you sample?
75
Solution
Mean1=0.218, mean2=0.318 sd1=0.14, sd2=0.14,z=1.28, Z=1.96
n=2 s2/d2 (Z +Z)2=2(0.14/0.1)2*(1.96+1.28)2=41.15 rounded to 42
Sample size calculation for comparison of two independent proportions
In Pocock’s formula for calculating the sample size required to compare two
proportions, the number per group to detect a difference in two proportions p1
and p2 , with power (1 - ) and significance level is given by
n=p1 (1-p1)+p2(1-p2)/(p1 – p2)2 *(Z +Z)2
Note that to apply this formula, we need to know the expected proportion of
individuals in one group who will have the outcome of interest. Usually for a
randomized controlled trial the proportion for the control group is known (say
p2), the size difference that is important to detect (d) is decided and then the
proportion in the other group is calculated as p1= p2+d. For a case control study,
information about the proportion exposed to the factor of interest is obtained
for the control group (say p2), this can be approximated by the proportion
exposed in the population, the difference d is specified and the proportion
exposed in the case is calculated as p1= p2+d
76
Example
A new polio vaccine is thought to decrease polio cases. A decrease of 33% in a
population with approximately 30% polio prevalence rate is considered clinically
and economically of significance. How many treatment and placebo patients are
required to detect this difference at the =0.05 (two sided) level with 80%
power?
Solution
p1=0.30 and p2=0.30 – 33% of 0.30=0.20
Z =1.96 and Z=0.84
Therefore n=(0.3*0.7+0.2*0.8)/(0.3 – 0.2)2 *(1.96+0.84)2=290.1
It implies at least 291 cases and 291 controls are required
77
Exercise
In a trial for a promising new HIV vaccine, the vaccine is considered effective if
the proportions of HIV infected in the vaccine arm is 15% compared to 25% in
the control arm. What sample size is required if one has 5% level of significance
and 90% power.
ANS: 331
78
Sampling Methods
Population is the total set of individuals that we are interested
Sample is a subset of the individuals selected in a prescribed manner of study.
Reasons for sampling
Samples can be studied more quickly than populations.
A study of a sample is less expensive than studying an entire population,
because
a smaller number of items or subjects are examined.
A study of an entire population (census) is impossible in most situations.
Sample results are often more accurate than results based on a population.
If samples are properly selected, probability methods can be used to estimate
about the population
Methods of sampling
1. Probability sampling
The best way to ensure that a sample will lead to reliable and valid inferences
is to use probability samples,
in which the probability of being included in the sample is known for each
subject in the population.
The four commonly used probability-sampling methods are simple random
sampling, systematic sampling, stratified sampling, and cluster sampling.
79
I. A simple random sample (lottery method) is one in which every subject
has an equal probability of being selected for the study.
The recommended way to select a simple random sample is to use a
table of random numbers or a computer-generated list of random
numbers.
II. A systematic random sample is one in which every kth item is selected;
k is determined by dividing the total number of items in the sampling
frame by the desired sample size.
Example. If a researcher want to consider only 200 students as a
sample from the total of 3400 students in his/her study using a
systematic random sample, 3400 divided by 200 is 17, so every 17th
student is sampled. In this approach we must select a number
randomly between 1 and 17 first, and we then select every 17th
student. Suppose we randomly select the number 12 from a random
number table. Then, the systematic sample consists of students with
ID numbers 12, 29, 46, 63, 80, and so on; each subsequent number is
determined by adding 17 to the last ID number.
80
Exercise
For a given study select 10 students from a total of 59 students using systematic
random sampling
III. A stratified random sample is one in which the population is first divided in to
relevant strata (subgroups), and a random sample is then selected from each
stratum proportionally. Characteristics used to stratify should be related to the
measurement of interest, in which case stratified random sampling is the most
efficient (different characteristics)
IV. A cluster random sample results from a two-stage process in which the
population is divided in to clusters and a subset of the clusters is randomly
selected. Clusters are commonly based on geographic areas or districts, so this
approach is used more often in epidemiologic research than in clinical studies
(similar characteristics)
Exercise
To study prevalence of vivax malaria, discuss the application of stratified and
cluster random sampling method in two districts in which the first has three agro-
ecological zone (Hot, medium and cold) and the second has only one agro-
ecological zone which is only hot.
81
2. Non-probability sampling
Non-probability samples are those in which the probability that a subject is
selected is unknown.
Non-probability samples often reflect selection biases of the person doing the
study and do not fulfill the requirements of randomness needed to estimate
sampling errors. Examples: convenience samples or quota samples.
Example for convenience sampling is taking 10 oranges from a big basket contain
may be 1000
Example for quota sampling is selecting Clinics from different povinces of the
country proportion to the number of clinics around the hospital.
82
Scatter Diagram, Regression Line and Simple linear regression
Scatter Diagram
If data is given in pairs then the scatter diagram of the data is just the points
plotted on the xy-plane.
The scatter plot is used to visually identify relationships between the first and
the second entries of paired data.
The scatter plot below represents the age vs. size of a plant.
It is clear from the scatter plot that as the plant aged, its size tends to increase.
If it seems to be the case that the points follow a linear pattern well, then we say
that there is a high linear correlation, while if it seems that the data do not follow
a linear pattern, we say that there is no linear correlation.
If the data somewhat follow a linear path, then we say that there is a moderate
linear correlation.
83
84
Simple linear Regression
Linear regression is applicable when one has collected data on two or more
variables and wants to quantify a relationship between the Response
(dependent) and Predictor (independent) variables.
Therefore, regression is used:-
To predict the value of one variable from the other variables
To examine the actual relationship between variables
To determine trends in data
The simplest form of regression is simple linear regression, where one has one
response and one predictor variable. Example:-
Predicting weight from height
Relating blood sugar content to daily amount of sugar consumption
Given a scatter plot, we can draw the line that best fits the data
• To find the equation of a line, we need the slope and the y-intercept. We will
write the equation of the line as
y = a + bx
Where a is the y-intercept and b is the slope.
x is the independent or predictor variable and
y is the dependent or response variable.
85
Least square estimation
Least squares can be interpreted as a method of fitting data.
The best fit in the least-squares sense is that instance of the model for which the
sum of squared residuals has its least value.
A residual being the difference between an observed value and the value given
by the model. Let Yi=α+ βXi+Єi
The least square methods determines the line that minimizes the sum of squared
vertical differences between the actual and predicted values of outcome variable
such as Yi = α + βXi and y’=a+bXi, respectively. So that
Σ(y-y’) is minimized through partial derivatives of the equation with respect to α
and β
The two resulting equations are set equal to zero to locate the minimum values;
these two equations in two unknowns, are solved simultaneously to obtain the
formulas for α and β.
Then the formulas for α and β in terms of the sample estimates a and b will be:-
n n
b ( x i x )( yi y ) / ( xi x )2
i 1 i 1 86
n n n n n 2
2
N xy
i i
xi yi / N xi ( xi)
i 1 i 1 i 1 i 1 i 1
n n
2 2
xy
i i nxy /
xi nx
i 1 i 1
n n n n n
2 2
{ xy
i i (
xi yi) / n}/{ xi ( xi) / n}
i 1 i 1 i 1 i 1 i 1
a y bx
87
Therefore, to find a and b we follow the following steps:
List
The sum of the x= Σx
The sum of the y= Σy
The sum of the squares of x=Σx2
The sum of the products of x and y=Σxy
Interpretation
We can interpret a as the value of y when x is zero and
we can interpret b as the amount that y increases when x increases by
one.
88
Model Assumptions
For simple linear regression model Yi=α+ βXi+Єi, where the yi’s are the
measurements on the dependent variable, the xi’s are the measurements
on the independent or predictor variable, and α and β are the parameters
in the linear regression model that we want to estimate. The Єi’s or error
terms are used to make allowance for the random scatter about the line, in
other words, we are allowing for the fact that there is variability in our
sample. Most of the assumptions are concern the error term Єi.
Assumptions
i. Єi are symmetry, which means the deviations from the model are
equally likely to be positive or negative
ii. The error for any particular case is not related to the value of the
predictor variable, this means that we are assuming that the variability
of the errors is the same over the whole range of the regression, which
we quantify by saying that the error variance is constant
iii. The error for one case is not affected by the error for another case, or,
in other words
89
iv. The distribution by error terms is normal or symmetry
v. The dependent variable yi is a continuous variable
vi. The only assumptions made about independent variable is that you
have data for them. It can be continuous or discrete
For the above simple linear regression model (Yi=α+ βXi+Єi) the following
hypothesis can be tested:-
H0: α=0
H1: α≠0
H0: β=0
H1: β ≠0
90
Example
Suppose that a study was done to determine the weight loss after taking various
amounts of a diet pill in combination with exercise. If the regression line was
y = 3 + 2x
where x denotes the grams of the pill per day and y represents the weight loss,
then
we can say that with only the exercise and no pill the average weight loss is 3
pounds.
We can also say that if a person takes an additional gram of the pill, then that on
average the person should expect to lose an additional 2 pounds.
If a person takes 5 grams then that person can expect to lose an average of 13
pounds.
91
Coefficient of determination (R2) and Correlation coefficient (r)
From the linear regression line
Generally Residual is given by:- yi - y’ = yi - (a + bxi)
Coefficient of determination: R2
We define the coefficient of determination as an indication of how linear the
data is. R2 has the following properties:-
R2 is between 0 and 1.
If R2=1 then all points lie on a line. (perfectly linear)
If R2=0 then the regression line is a useless indicator for predicting y values. To
compute R2, do the following:
n n
S S T o ta l (Y i Y ) 2
SS R e sid u a l ( Y i Y i ) 2
i1 i1
n
SS R e g r e s s io n (Y i Y ) 2
i 1
16
14
12
10
2
Y
0 2 4 6 8 10 12
X 93
Correlation coefficient (r)
In many cases, more than one variables has been measured on each unit, such
as animal, plant and object.
So, if there are several variables of interest, one is frequently interested in
correlations between these variables.
Together with plots, this should form a starting point for any study of the joint
effect of a number of variables
In the previously discussed simple linear regression model,
If we want to determine not just if they are linearly related,
but also want to know whether there is a positive relationship or a negative
relationship (b> 0 or b<0).
Therefore, one method of examining the relationship between two continuous
variables (such as weight and height) is to look at the usual PEARSON’S
CORRELATION COEFFICIENT. For the sample this is defined as:- rxy= Sxy/SxSy
where, Sx and Sy are the standard deviations of x and y, and Sxy is the COVRIANCE
between x and y, defined as:-
n
S xy i 1
( x i x ) ( y i y ) / ( n 1)
Therefore rxy is given by
n n 2 n 2
rxy ( xi x )( yi y ) / (x x ) ( y y)
i i
i 1 i 1 i 1
96
Exercise
For the following data
a. Plot a scatter diagram
b. By fitting simple linear regression model, find the parameters of the model
using least square method and write the equation of the model.
c. Predict the weight’s of an individual whose height is 174 c.m
d. Calculate and interpret R2 and r
e. Make a conclusion about the fitted model goodness of fit
98
In order to apply the test, we need to define the mean sum of squares for the
factor and for the error.
The mean sum of squares for factor A is then denoted by
n 2
100
Number of children
1 2 3 4
62 63 68 56
60 67 66 62
63 71 71 60
59 64 67 61
65 68 63
66 68 64
63
59
Mean (Xi) 61 66 68 61
No. of 4 6 6 8
observation
101
The over all mean X=64
MSSA=[4(61-64)2+ 6(66-64)2+ 6(68-64)2+ 8(61-64)2]/3=76
Since S12=3.3, S22=8.0, S32=2.8, S42=6.8, and n1=4, n2=n3=6, n4=8
MSSE=(n1-1) S12+ (n2-1) S22+ (n3-1) S32+ (n4-1) S42/[(n1-1)+ (n2-1)+ (n3-1)+ (n4-1)]
=5.6
For this example, the test statistic for the sample is 76/5.6=13.6, and the degrees
of freedom are k-1=4-1=3 and N-k=24-4=20. The p-value for an F statistic of 13.6
with 3 and 20 degrees of freedom is 0.000046.
103
(Xi-Xj) + G√[MSE(1/ni+1/nj)
Differing only as to the value of G. The value of G is chosen in such a way that the
overall chance of a type I error is not more than α.
In the case where we had two independent populations we used
G=t1-α/2,n1+n2-2
Thus for a factor with only two levels the formula becomes
(X1-X2) + t1-α/2,n1+n2-2√[MSE(1/n1+1/n2)
This two sample t test is a special case of ANOVA, with only two independent
samples (i.e, two levels of the factor of interest).
For more than two groups, such as three one usually tests all pair wise
comparisons, that is, you test if level 1 differs from level 2, level 2 differs from
level 3, and whether level 1 differs from level 3.
The two most common multiple comparison tests are
The SCHEFFE procedure, for which G= √(k-1)Fk-1,N-k,1- α
where Fk-1,N-k,1- α is the critical value of the F distribution with degrees of
freedom k-1 and N-k.
The BONFERRONI procedure, for which G= t1-α/2k,N-k , which uses a critical
value from the t distribution, but with α/2 divided by k, the number of factor
levels to be compared
104
Applying the two procedures to the above examples at the 5% level
For the SCHEFFE procedure G= √(3F3,20;0.95)=3.049
For BONFERRONI procedure G=t0.99375,20=2.927
Using these values for G in the confidence interval formulae, we obtain
i j Xi - Xj SCHEFFE BONFERRONI
105
For interpretation of the above confidence interval in the table, if it includes
zero, the treatment do not differ significantly. Here all two tests are agree.
Treatments for which the confidence intervals do not include zero are marked by
stars in the above table.
Assumptions of the ANOVA test
The general one way ANOVA model may be written as:-
Yij=μi+Єij for i=1,2,…,k treatments (level of the factor), and j=1,2,…,ri replicates for
the ith treatment.
For example for the previous example r1=4, r2=6, r3=6 and r4=8. The observation
y2,5 is 65.
The yij are the values of the response variable for the jth replicate of the ith group
or factor level.
The Єij are the random error terms for the individual observations, while μi is the
population mean for the ith level of the factor.
Then, the ANOVA test assumes that the data for each factor level are an
independent random sample from the relevant population, and that they are
normally distributed around the mean for that factor level.
106
Independence of observations
The independence assumption means that, if we have a factor with three levels,
we have a separate sample for each level. In addition, the observations in each
sample are independent.
In terms of the ANOVA model, this means that the Єij are independent.
Normality of the error term
The assumption of normality of the population from which the data is drawn,
means the same as the assumption that the Єij are normally distributed. This
means the assumption of normality is not on the data itself.
This assumption holds for many data sets. For others, may be often be
transformed to approximate normality using for example log transformation
(appropriate for biological data) or square root transformation (appropriate for
count data).
Error variance the same for all groups (Homoscedasticity)
The other assumption for ANOVA test is on the error terms Єij namely that the
error variance is the same for all groups. To keep this assumption the appropriate
measure is to transform the data, thereby making the error variances for the
different groups more similar.
107
Note that one does not need the error variances to be identical, the rule of
thumb is that the error standard deviations should not differ by more than a
factor of 2, i.e
The largest standard deviation should not be more than twice the smallest
standard deviation, and the within-group variances should not differ by more
than a factor of 4.
The F test not much affected by unequal variances when the sample sizes at
different factor levels are approximately equal. However, the multiple
comparison tests may be greatly affected.
One way to check this assumption is to calculate the standard deviation for each
of the factor levels.
For our previous example, the variances were 2.8, 3.3, 6.8, 8.0, so that ratio of
the largest to the smallest is 8.0/2.8=2.86.
As this is less than 4, our assumption of equal population variances is likely not
to have too great an effect on the conclusions.
Some packages allow one to test the assumption of equality of variances, using
a test for “homogeneity of variance”, such as Hartley and modified Levene tests.
As always, the null hypothesis is the hypothesis of no difference, that is, that the
variances are equal.
108
Analysis of Variance for two factors (Two way ANOVA)
In the one way ANOVA, we have dealt only with comparing two or more group on a
single factor.
Here in two way ANOVA, we turn to the situation where we have measurements on
some response (such as weight or expenditure), as well as data on two or more
factors.
The model for such a general two way ANOVA is usually written as:
Yijk=μ+αi+βj+ Єijk
for i=1,2,…,nA levels of the first factor (factor A), j=1,2,…,nB levels of the second factor
(factor B), and k=1,2,…,rij replicates (observations) for the combination of the ith level
of factor A and jth level of factor B. The Єijk are random error terms, while μ is the
overall mean. Here the sum of the
αi ‘s and the sum of the βj’s are both taken to be zero.
EXAMPLE
Consider the following 2X2 experiment, also known as a 22 factorial design. The
factors here are age (under and over 30) and gender (Male, Female). The response
variable is amount of expenditure in US Dollar on health care.
109
Gender
Age
Male Female
Under 30 25 26 31 24 27 44 35 26 39 32
20 27 19 25 32 25 21 31 27 41
Over 30 31 34 41 32 39 44 41 38 39 34
36 43 41 30 39 45 55 48 38 40
In this example there are two factors. Factor 1 (Gender)-taking 2 levels
(Male, Female) and Factor 2 (Age)-taking also 2 levels (under and over 30)
In this example there are 10 observations for males under 30, giving
r11=10, r12=10, r21=10, r22=10
To analyze this data, one needs to define a mean sum of squares for
factor B as well as for factor A, with their interaction and error.
110
The general two way ANOVA Table is as follows:-
Source of SS d.f. MS F P-
variation val
ue
Between levels of (nA-1)MSA nA-1 MSA MSA/MSE
A
111
The abbreviations in this table are:
SS=Sum of Squares
MS=Mean Square
The column of the major interest is the last column, giving the p-values
associated with the null hypothesis that all levels of a particular factor are equal,
versus the alternative that at least two factor levels are different to each other.
The predictions for such model i.e for the combination of the ith level of factor A
and the jth level of factor B are given by the estimate of the overall mean, plus the
estimates for the ith level of factor A and the estimate for the jth level of factor B.
The estimate for the ith level of factor A is given by the mean of all values of the
response at the ith level of factor A, minus the overall mean value of the
response.
112
Two way analysis of variance decomposition
SSTotal=SSA + SSB + SSAxB + SSError
Where
a b r
S T o ta l ( Y ijk Y ...) 2
i 1 j 1 k 1
a
S S A r b ( Y i .. Y ...) 2
i 1
b 2
SS B ra (Y . j . Y ...)
j 1
a b 2
S S E rro r r ( Y ijk Y ij . )
i 1 j 1 k 1
Similarly, the estimate for the jth level of factor A is given by the mean of all values of
the response at the jth level of factor A, minus the overall mean value of the response.
Analyzing the previous example via a two factor analysis of variance (two way ANOVA)
gives the following table.
114
MEANS
Overall=34.125
R1=28.85
R2=39.4
a
S S A r b ( Y i .. Y ...) 2
i 1
=10x2[(28.85-34.125)]2+(39.4-34.125)2
=20[(27.825625)+(27.825625)]
=20[55.65125]
=1113.025
115
MEANS
Overall=34.125
C1=31.1
C2=37.15
b 2
SSB ra (Y . j . Y ...)
j 1
=10x2[(31.1-34.125)]2+(37.15-34.125)2
=20[(9.150625)+(9.150625)]
=20[18.30125]
=366.025
116
MEANS
Overall=34.125
R1C1=25.6 R2C1=36.6 R1=28.85 C1=31.1
R1C2=32.1 R2C2=42.2 R2=39.4 C2=37.15
a b 2
=10[(25.6-28.85-31.1+34.125)]2+(32.1-28.85-37.15+34.125)2 +
(36.6-39.4-31.1+34.125)2 +(42.2-39.4-37.15+34.125)2
=10[0.050625+0.050625+0.050625+0.050625]
=10[0.2025]
=2.025
a b r
S T o ta l ( Y ijk Y ...) 2
i 1 j 1 k 1
=[(25-34.125)2+….+(40-34.125)2
=83.27+…..+34.52
=2670.375
117
Source SS d.f. MS F P-value
118
Here is the hypothesis to be tested are:
H0:There is no difference in true man health care expenditure over age
H1:There is difference in true man health care expenditure over age and,
H0:There is no interaction between gender and age in true man health care
expenditure
H1:There is no interaction between gender and age in true man health care
expenditure
From the table, for p-values less than 0.05, we have to reject the null hypothesis
and can conclude at the 5% level that the true man health care expenditure
differs over both age groups and gender, and for p-values greater than 0.05
there is no interaction between Age and gender over true man
119
Research Methodology
Research can be defined as the search for knowledge or any systematic
investigation to establish facts.
Applied research is a research accessing and using some part of the research
communities' accumulated theories, knowledge, methods, and techniques, for a
specific, often state, commercial or client driven purpose.
Basic research or fundamental research (sometimes pure research) is research
carried out to increase understanding of fundamental principles.
Many times the end results have no direct or immediate commercial
benefits.
It can be thought of as arising out of curiosity.
However, in the long term it is the basis for many commercial products and
applied research.
Therefore, to do a research first a Research proposal/Protocol should be
developed.
A research proposal is intended to convince others that you have a worthwhile
research project and that you have the competence and the work-plan to
complete it.
120
Generally, a research proposal should contain all the key elements involved in
the research process and include sufficient information for the readers to
evaluate the proposed study.
Regardless of your research area and the methodology you choose, all research
proposals must address the following questions:-
What you plan to accomplish
why you want to do it
how you are going to do it.
The proposal should have sufficient information to convince your readers that:-
You have an important research idea.
You have a good grasp of the relevant literature and the major issues.
Your methodology is sound.
The quality of your research proposal depends not only on the quality of your
proposed project, but also on the quality of your proposal writing.
A good research project may run the risk of rejection simply because the
proposal is poorly written. Therefore, it pays if your writing is coherent, clear and
compelling.
121
Main components of Research proposal
1. TITLE
It should be concise and descriptive. An effective title not only pricks the reader's
interest, but also predisposes him/her favorably towards the proposal.
The title should be in line with your general objective.
Make sure that it is specific enough to tell the reader what your study is about
and where it will be conducted.
2. Summary
It is a brief summary of approximately 500 words/a page.
It should include the research question, the rationale for the study , the
hypothesis (if any) and the method.
Descriptions of the method may include the design, procedures, the sample and
any instruments that will be used.
3. Introduction
The main purpose of the introduction is to provide the necessary background or
context for your research problem
The introduction typically begins with a general statement of the problem area,
with a focus on a specific research problem, to be followed by the rational or
justification for the proposed study.
122
The introduction generally covers the following elements
1. State the research problem, which is often referred to as the purpose of the
study.
2. Provide the context and set the stage for your research question in such a way as
to show its necessity and importance.
3. Present the rationale of your proposed study and clearly indicate why it is worth
doing.
4. Briefly describe the major issues and sub-problems to be addressed by your
research.
5. Identify the key independent and dependent variables of your experiment.
Alternatively, specify the phenomenon you want to study.
6. State your hypothesis or theory, if any. For exploratory or phenomenological
research, you may not have any hypotheses.
7. Set the delimitation or boundaries of your proposed research in order to provide
a clear focus.
123
4. Literature Review
Sometimes the literature review is incorporated into the introduction
section.
However, mostly it is preferred a separate section, which allows a more
thorough review of the literature.
The literature review serves several important functions such as:
1. Gives credits to those who have laid the groundwork for your research.
2. Demonstrates your knowledge of the research problem.
3. Demonstrates your understanding of the theoretical and research issues
related to your research question.
4. Shows your ability to critically evaluate relevant literature information.
5. Indicates your ability to integrate and synthesize the existing literature.
6. Provides new theoretical insights or develops a new model as the
conceptual framework for your research.
7. Convinces your reader that your proposed research will make a significant
and substantial contribution to the literature (i.e., resolving an important
theoretical issue or filling a major gap in the literature).
124
5. Research Objectives
The OBJECTIVES of a research project summarize what is to be achieved by the
study.
Objectives should be closely related to the statement of the problem.
For example, if the problem identified is low utilization of child welfare clinics, the
general objective of the study could be “to identify the reasons for this low
utilization”, in order to find solutions.
The general objective of a study states what researchers expect to achieve by the
study in general terms.
It is possible (and advisable) to break down a general objective into smaller,
logically connected parts.
These are normally referred to as specific objectives.
And they should specify what you will do in your study, where and for what
purpose.
125
6. Methods
The Method section is very important because it tells your reviewer how you
plan to tackle your research problem.
It will provide your work plan and describe the activities necessary for the
completion of your project.
The guiding principle for writing the method section is that it should contain
sufficient information for the reader to determine whether methodology is
sound.
Some even argue that a good proposal should contain sufficient details for
another qualified researcher to implement the study.
You need to demonstrate your knowledge of alternative methods and make
the case that your approach is the most appropriate and most valid way to
address your research question.
For quantitative studies, the method section typically consists of the following
sections:
Design -Is it a questionnaire study or a laboratory experiment? What kind of
design do you choose?
Subjects or participants - Who will take part in your study ? What kind of
sampling procedure do you use?
126
Instruments - What kind of measuring instruments or questionnaires do you use?
Why do you choose them? Are they valid and reliable?
Procedure - How do you plan to carry out your study? What activities are
involved? How long does it take?
Note
Obviously you do not have results at the proposal stage. However, you need to
have some idea about what kind of data you will be collecting, and what statistical
procedures will be used in order to answer your research question or test your
hypothesis, which should be mentioned in methods section.
7. Discussion
It is important to convince your reader of the potential impact of your proposed
research.
You need to communicate a sense of enthusiasm and confidence without
exaggerating the merits of your proposal.
That is why you also need to mention the limitations and weaknesses of the
proposed research, which may be justified by time and financial constraints as
well as by the early developmental stage of your research area.
127
What is the difference between Research Proposal and Report?
Research Report Equals
All components of the research proposal plus
Result with discussion and interpretation
Conclusion
Recommendation
Referencing
A prime purpose of a citation is intellectual honesty (To avoid Plagiarism)
To attribute prior or unoriginal work and ideas to the correct sources, and
To allow the reader to determine independently whether the referenced material
supports the author's argument in the claimed way.
Citation content can vary depending on the type of source and may include:
• Book: author(s), book title, publisher, date of publication, and page number(s) if
appropriate.
• Journal: author(s), article title, journal title, date of publication, and page number(s).
• Newspaper: author(s), article title, name of newspaper, section title and page
number(s) if desired, date of publication.
• Web site: author(s), article and publication title where appropriate, as well as a, and
a date when the site was accessed.
128
Harvard referencing style (Parenthetical)
It is an example of author-date referencing.
The Harvard style is very common and is used across most subjects.
With the Harvard system, when you cite someone else's work, you need to include
the author's last name and the date of publication in brackets after the citation in
the body of your paper.
The full reference to the work is then included in an alphabetic reference list or
bibliography at the end of your paper.
Example: (Smith, 2001):- inside the text.
Smith SD, Jones, AD. Organ donation. Engl. J. Med. 2001;657:230-5:- at the end in
Alphabetical order .
Vancouver referencing style (Numbering)
Citation numbers are included in the text in square brackets, brace or superscripts.
[1], (1) or 1
All bibliographical information is exclusively included in the list of references at the
end of the document, next to the respective citation number.
Example: 1 :- inside the text.
1. Smith SD, Jones, AD. Organ donation. Engl. J. Med. 2001;657:230-5:- at the end.
129
What is Epidemiology
Epidemiology: - is branch of Health Science which study the frequency, distribution
and determinants of diseases and other health related states or events in specified
populations.
The application of this study will be to promote health and to the prevention and
control of health problems.
COMPONENTS OF THE DEFINITION
Population: - is the group of people and their environment. The focus of
Epidemiology is mainly on the population rather than individuals.
Frequency: - it expresses amount. This shows Epidemiology be mainly a quantitative
science.
Frequency of disease (Morbidity)
Frequency of death (Mortality)
Health related conditions: - are conditions, which are directly or indirectly affect or
influence health.
This may be injury, vital events, health related behavior, social factor, economic
factor etc.
Distribution: - refers to geographical distribution of disease. The distribution of the
disease can be expressed in time, place or affected persons.
130
Determinants:- are factors, which determine whether or not a person will get a
disease.
Health: - A state of complete physical, mental, and social well-being and not merely
the absence of disease or infirmity.
SCOPE OF EPIDEMIOLOGY
Originally Epidemiology was concerned with epidemic of communicable disease and
epidemic investigation.
Later on it was extended to endemic communicable diseases and non-communicable
diseases.
At present epidemiological methods are being applied to:- Infectious and non-
infectious diseases, Injuries and accidents, Nutritional deficiencies
Maternal and child health, Cancer, Occupational health, Environmental health,
Violence etc
Hence Epidemiology can be applied to all disease conditions and other health
related events.
131
Epidemiological study design
Study Design – is an arrangement of conditions for the collections and analysis of
data to get the most accurate answer for the research question in economic way.
Observational Experimental/Interventional
Descriptive Analytical
133
Descriptive studies use information from various sources. Eg. Census data,
vital statistics records and summarize these data in systematic ways.
Since the data used by these studies are usually routinely collected, it is less
time consuming
Analytical studies
In analytical study we can investigate risk factors for a disease or an outcome.
Here we ask the question "does the pattern of exposure to certain risk factors
among individuals with or without a specific disease help us to work out the
cause of the disease?"
We must be careful in how we interpret our findings; in analytical study we
measure associations between exposures and outcomes. If we demonstrate
an association that does not necessarily mean that the exposure caused the
outcome.
The different Epidemiological study designs
1.Case report – it is a careful detailed report by one or more clinicians for a single
patient. It also document unusual medical occurrences and can represent the
first clues in identification of new diseases or effects of certain exposures.
134
2. Case series – It describes the characteristics of a number of patients with a given
disease. It is very useful for hypothesis generation. As limitations it is based on
a single or few patients, which can happen, just by coincidence and there is no
also comparison group.
3. Ecological/Co relational – It use data from entire population to describe disease in
relation to some factors of interest such as age, time utilization of health
services, consumption of food etc. That is an ecological study compares group.
Thus it looks for an association between an exposure and an outcome at the
group-level not at individual level.
Example: Comparison of incidence of hypertension and average salt consumption in
African countries.
STRENGTHS
It is the only studies that enable us to investigate the differences between
groups. This is extremely important in public health.
It is the only studies that enable us to investigate the effects of group
properties or contextual properties.
It can often be carried out relatively quickly and cheaply, using routine or
secondary data. Because of this, we often use them as a first step in the
investigation of a possible exposure-outcome relationship.
135
We can often obtain group-level exposure data in circumstances in which it is
difficult or impossible to obtain individual-level exposure data.
For exposures with substantial within-person variability, group-level
exposure data may be more reliable than individual-level exposure-data.
WEAKNESSES
It can be difficult to control for confounding in ecological studies.
It is particularly susceptible to information bias.
It does not enable us to make inferences about the causes of individual
risks.
4. Cross sectional – it is also called prevalence study.
In cross sectional studies exposure and disease status are assessed
simultaneously among individuals in a well-defined population.
And information about the status of individual with respect to the presence
or absence of exposure and disease is assessed at a point in time.
N.B. Since exposure and disease are assessed at the same time in most case,
it is not possible to determine whether the exposure preceded or resulted
from the disease.
136
STRENGTHS
Cross-sectional studies are relatively easy and economical to conduct
Cross-sectional studies provide important information on the distribution
and burden of exposures and outcomes. This is extremely valuable for
health-service planning.
Cross-sectional studies can be used as the first step in the study of a possible
exposure-outcome relationship
WEAKNESSES
Cross-sectional studies measure prevalent rather than incident cases.
It can be difficult to establish the time-sequence of events in a cross-
sectional study.
5. Case-control – in case-control study subjects are selected with respect to
presence or absence of disease (outcome) and then inquiries are made
about past exposure to the factor of interest.
Here to investigate the association between an exposure and an outcome
first we obtain information about one or more previous exposures from
cases and controls, and compare the two groups to see if each exposure is
significantly more (or less) frequent in cases than in controls.
137
Example: Examining the association between cigarette smoking and lung cancer
Case –control study design
Exposed
Un exposed
Population
Exposed
Un exposed TIME
INQUIRY 138
STRENGTHS
Can be carried out rapidly and relatively cheaply
Are useful for studying rare diseases
Can be used to study diseases with long latent periods
Can study multiple exposures for a single outcome
WEAKNESSES
Are prone to selection bias, particularly in the selection of controls
Are prone to information bias, because exposure status is determined after
the outcome has occurred
Cannot establish the sequence of events: the exposure may be a
consequence rather than a cause of the outcome (reverse causality)
Are not suitable for studying rare exposures (except in nested case-control
studies)
Cannot usually be used to estimate disease incidence or prevalence
139
6.Cohort – in cohort study subjects are selected by exposure or determinant or
interest and followed to see the development of the disease or other outcome
of interest. The starting point for cohort studies is exposure to a risk factor.
Cohort studies are particularly useful for rare exposures and in situations where
we are interested in more than one outcome.
In addition, because exposure status is defined at the start of the study, before
the outcome occurs, the temporal sequence of events can be investigated (i.e
exposure precede outcome)
140
Cohort study design
Disease/Outcome
Positive
Exposed
Disease/Outcome
Negative
Population
Total Population without Disease
Disease/Outcome
Positive
Unexposed
Disease/Outcome
Negative
TIME
INQUIRY
141
STRENGTHS
Exposure is measured at the start of the study, before the outcome occurs,
and so measurement of exposure is not biased by the presence or absence
of the outcome
Cohort studies can provide data on the time course of the development of
the outcome (s), including late effects.
More than one outcome can be examined at once
Rare exposures can be investigated using appropriately selected
populations.
WEAKNESSES
Prospective cohorts are slow and potentially expensive if there is a long
period between exposure and outcome
They are inefficient for rare diseases
Retrospective cohort studies depend upon pre-existing records of exposure
being available, and being reliable
Exposure status may change during study (in which case it may need to be
determined again at intervals throughout the study)
142
Differential loss to follow-up may introduce bias: this is a particular problem
when follow-up is of long duration
In long term cohort studies, it may be hard to ensure that diagnostic criteria
remain consistent throughout the study, particularly if outcomes are ascertained
from routine data sources
Measure of association
Measures of association between risk factor and disease are often calculated
from data presented in 2X2 table. The following is a 2X2 table showing
association between exposure/factor and disease/outcome.
Outcome
Factor Total
Yes (+) No (-)
144
It can also be used to compare risks of death, accident or other possible
outcomes of an exposure
It is calculated as
RR = Incidence Rate among exposed /Incidence Rate among non-exposed
= (A/A+B) ÷ (C/C+D)
Example: If in a study of lung cancer mortality in relation to cigarette smoking, the
mortality rate was 96/100000 among smoker and 24/100000 among non-
smokers, Find RR.
Solution: RR = 96/100000 ÷ 24/100000 = 4.0. And can interpreted as, Cigarette
smokers were 4 times more likely to die from lung cancer compared to non-
smoker.
ODDS RATIO (OR)
In Case-control study where study participants are selected on the basis of
outcome status is not possible to determine Incidence Rate. Therefore direct
computation of RR is not possible. Therefore, indirect estimation of the RR is
given by the OR as follows:-
OR = odds of having the outcome if exposed ÷odds of having the outcome if
non- exposed
145
= ratio of outcome to with out outcome in exposed ÷ ratio of
outcome to with out outcome in non-exposed
= A/B ÷ C/D
= AD/CB (cross product)
Example: The following 2X2 table shows the relationship between
asthma disease and exposure of asbestos from a case control
study. From the given information find OR
Solution: OR = AD ÷ BC = (90x360) ÷ (60x270) = 2. And can be
interpreted as odds of developing asthma is 2 times higher among
exposed to asbestos compared to those with out the exposure of
asbestos.
Note: If A and C are small, so that one is looking at a very rare
disease, then the formula for RR and OR give almost the same
answers.
146
Asthma
Asbestos Total
Yes (+) No (-)
exposure
Yes (+) 90 270 360
No (-) 60 360 420
Total 150 630 780
Therefore, these very important measure of association (RR and OR) can
be obtained from multiple logistic regression model using the following
formula:-
OR or RR =1/exp-(bi)
where i=0,1,2 according to the number of independent variables and OR
or RR based on the type of study (cohort or case control).
147
Generally OR/RR is >0, and interpret as :-
1 indicates that the odds are even, so that the exposure has no effect on the
probability of the disease.
Since one would seldom get an odds ratio of exactly 1, one needs a method to
test whether an odds ratio is significantly different from 1. This can be done by
calculating a standard error, and then using a confidence interval or doing a test.
>1 means the outcome is more likely in those with factor compare to without
factor
<1 means the outcome is less likely in those with factor compare to without
factor
Note :
For cross sectional Epidemiological study design we can use OR as a measure of
association between exposure and outcome variable
148
Confidence Interval for OR
The OR can not be negative, but can be very large if the divisor is very small.
This means it must lie between zero and infinity, which implies that it has a skew
distribution.
To develop a standard error in confidence interval calculation, it can be shown
that the ln of the odds ratio has a symmetric, in fact approximately normal
distribution.
The standard error for the ln odds ratio is
S.E (loge OR)=√[(1/A)+(1/B)+(1/C)+(1/D)], where A,B,C and D are as defined in the
2x2 above.
Since the ln odds ratio is approximately normally distributed, one can calculate a
confidence interval for the ln odds ratio. One can then convert this to a
confidence interval for the odds ratio, by undoing the ln through taking
exponentials.
149
Therefore, the 95% confidence interval for the ln odds ratio then be given by
(loge(OR))-1.96 X S.E (loge OR), loge(OR))+1.96 X S.E (loge OR))
Breast Cancer
Yes No
Total
Yes 273 2641 2914
151
From the following Table:-
1. Construct 2X2 Table
2. Calculate OR and interpret it
Table
Lung Cigarette
Cancer Smoking
Status
Positive Yes
Positive Yes
Positive Yes
Positive No
Positive No
Negative No
Negative No
Negative No
Negative Yes
Negative Yes
152
Excercise
a. From the following information which is obtained from Cross Sectional
Epidemiological Study design, calculate and interpret appropriate measure of
association between the disease and the exposure variable.
HIV/AIDS Condom use
status
P No
P No
P No
P Yes
P Yes
N Yes
N Yes
N Yes
N No
N No
b. For rare disease (outcome), show that the values of Odds Ratio (OR) and Risk Ratio
153
(RR) are almost equal.
QUESTIONNAIRE DEVELOPMENT AND ADMINISTRATION
Questionnaire is a list of questions, to be answered by the respondent, which
help to fulfill the objectives of the study.
Steps in designing good questionnaire
1. Decide on the content matter according to the objective of the study :-
Questions should be based on your study objectives, keep them within the scope
of the study. Short-list the variables that you need. A short, well conceived
questionnaire elicits much better information than a long one.
2. Formulate Questions
Questions can be open-ended, and the respondent replies in whatever way she
or he chooses. The alternative is to have closed-ended questions where
predetermined possible answer categories are marked off and coded.
This ensures quicker, more standardized data collection. Closed question
categories should be mutually exclusive and exhaustive: i.e there should be no
overlap of categories, and all possibilities should be covered. Always allow for an
“other” category where the respondent can specify the answer.
154
Questions should be simple, concise and specific. Make sure that there are no
ambiguities.
Take special care in wording questions.
Assess potential respondents as what questions are meaningful to them and
how to phrase the questions to make sure that they are understandable and
acceptable.
Avoid questions, which suggest to the respondent the answer that is expected
(this is called a leading question). Example: Do you believe that IUDs have
adverse health effects?
Develop one question at a time. Break up complex questions in to simple
ones.
Example: “Do you use a method of contraception? If not, why not?” Should be
broken up as follows:
1. Do you use a method of contraception? Yes/No
2. If no, what are your reasons?
3. If yes, which method are you currently using?
155
3. Sequencing of questions
Take special care in locating questions within the sequence when seeking
personal or sensitive information. i.e order the questions meaningfully to ensure
a smooth, logical flow.
Non-threatening items should be put first so that the respondents feel at ease.
4. Formatting the questionnaire
For the interviewer in interviews and the respondent in self-administered
questionnaires, provide visual markers to make the form easy to complete.
Examples of visual aids include putting related questions in boxes and using
arrows and flow diagrams. Besides, ensure good spacing and printing for easy
reading, and enough space for filling in the responses and computer codes.
5. Pre-testing the questionnaire
It is a test run of aspects before the main study and require an in depth look at
the questionnaire with the aim of improving its quality
Select respondents who are as much as similar to the target population
156
Usually only a few subjects are chosen (5 to 20)
It help to do trial runs of the questioning, by leaving space for noting required
changes
During the pilot, record words and sentences that are not understood and the
questions that require explanation
It help to assess logistical issues such as time taken, wording, common
responses which suggest categories for closed questions, and common
misinterpretations of the question
Feedback from the subjects should be welcomed as many of the criticisms as
possible should be addressed
Finally, make the necessary changes after the pilot study
6. Translation
Ideally it helps to interview study subjects in their own language
After translation it should be translate back and assign second person to proof
read.
Finally it has to be standardize the translation amongst interviewers
157
7. Training
It helps to reduce measurement errors and
Ensure the interviewers on how to use questions
8. Implementation of the actual field work
Questionnaire Administration
Questions, which are developed for the study by considering the above steps, can be
asked by
Self-administration – here it does not require interviewer and the clarity of the
questions is essential
Interview – here interviewer asks the questions face to face from respondents.
Telephone – here interviewer asks the questions through telephone
Discussion – most of the time it used to collect qualitative type of data by raising
points for discussion through the discussion leader.
Example: Focus Group Discussion (FGD). In this case we can use tape recorder or
notebook to handle the discussion points.
158
Some rules for interviewing
1. Establish a social setting, which is comfortable for the respondent, i.e when you
are interviewing children, it is especially important that you sit on the same level
as them.
2. Answer the following questions to the satisfaction of your respondents which
will determine the quality of the information you get
Who you are, what you are doing?
Why have you picked me for this interview?
What will you do with the information?
Will giving you information benefit me? Or potentially harm me?
3. Be sensitive to local customs (e.g when interviewing women). Check with local
people beforehand what you should watch out for.
4. Encourage people to talk.
5. Be neutral by avoiding in showing your feelings and opinions and be patient.
6. Ask respondents to give examples since it is a good way to get people to
describe their ideas and opinions.
7. Avoid interrupting and contradicting the respondent.
159
8. Avoid thinking about the next questions while the respondent is talking, rather
than listening with full attention
Examples of problematic questions which we have to avoid
1. Have you ever smoked? This question is vague. Smoked what, pipe, cigarettes,
etc. what do you answer if you had one cigarette in your childhood.
2. What was your age of menarche? Here the word menarche is used which many
respondents will not understand
3. Have you had any infectious diseases such as measles? It is a leading question.
160
Exercise
The following are problematic questions, state the reasons
1. What is your religion?
a. Muslim
b. Orthodox
c. Protestant
2. Do you know signs and symptoms of malaria such as fever?
161
Logistic Regression model
Logistic Regression is commonly used when the independent variables include
both numerical and nominal measures and the outcome variable is binary
(dichotomous).
That is, logistic regression is a method which is useful when the response
variable is dichotomous (has two levels) and at least one of the explanatory is
continuous.
In this situation we are modeling the probability that the response variable takes
on the levels of interest (success) as a function of the explanatory variable.
In many experiments, the end point or outcome measurement, is dichotomous
with levels being the presence or absence of a characteristic (Example:- cure,
death; positive, negative etc.)
One key difference between logistic regression model from simple or multiple
regression model is that, here the probabilities must lie between 0 and 1, we can
not fit a straight line function as we did with linear regression.
We will fit “S” curve that are constrained to lie between 0 and 1.
162
Logistic regression curve for β>0
1
Prob.
0.5
0
Independent variable (x)
163
For the case where we have one independent variable, we can have the following
logistic regression model.
x x ( x)
(x) e / (1 e ) 1(1 e )
Here ( xis) the probability that the response variable takes on the characteristic
of interest (success), and x is the level of the numeric explanatory variable.
The interest here is whether or not β=0. If β=0, then the probability of success is
independent of the level of x. If β>0, then the probability of success increases as x
increases, conversely, if β<0, then the probability of success decreases as x
increases.
To test this hypothesis, we will conduct the following test, based on estimates
obtained from a statistical computer package.
H0: β=0
H1: β≠0
In logistic regression, is the change in the odds ratio of a success at levels of
the explanatory variable one unitˆ apart.
e
164
The odds of an event occurring is defined as:-
o /1
It implies,
o(x) (x)/1(x)
Then the ratio of the odds at x+1 to the odds at x (the odds ratio) can be written
(independent of x) as:
165
With some models, like the logistic curve, there is no mathematical solution that
will produce least squares estimates of the parameters.
For many of these models, we use the concept of maximum likelihood.
A likelihood is a conditional probability (e.g., P(Y|X), the probability of Y given X).
We can pick the parameters of the model (a and b of the logistic curve) at random
or by trial-and-error and then compute the likelihood of the data given those
parameters.
We will choose as our parameters, those that result in the greatest likelihood
computed. The estimates are called maximum likelihood because the parameters
are chosen to maximize the likelihood (conditional probability of the data given
parameter estimates) of the sample data.
The techniques actually employed to find the maximum likelihood estimates fall
under the general label numerical analysis.
There are several methods of numerical analysis, but they all follow a similar series
of steps.
First, the computer picks some initial estimates of the parameters. Then it will
compute the likelihood of the data given these parameter estimates.
Then it will improve the parameter estimates slightly and recalculate the likelihood
of the data.
166
It will do this forever until we tell it to stop, which we usually do when the parameter
estimates do not change much (usually a change .01 or .001 is small enough to tell
the computer to stop).
Example
A study was conducted to study the therapeutic effects of individual drugs in mice.
One part of this study was to determine toxicity of the drug individually. Mice were
given varying doses in a parallel groups fashion, and one primary outcome was
whether or not the mouse died from toxic causes during the 60 day study.
The observed numbers and proportions of toxic deaths are given in the following
Table by dose, as well as the fitted values from fitting the logistic regression model
where is the
probability a mouse that receiveda dose x of x dies
from
x toxicity. Based on a
( x ) e / 1 e
computer analysis of the data we get the fitted equation ( x)
167
Observed Fitted
8 87 1 1/87=0.012 0.077
12 77 38 38/77=0.494 0.372
16 69 54 54/69=0.783 0.806
20 49 45 45/49=0.918 0.967
24 41 41 41/41=1.000 0.995
To test whether or not P(Toxic death) is associated with dose, we will test
Ho:β=0 versus
H1: β≠0
Based on the computer analysis, we have and
ˆ 0.488 ˆˆ 0.0519 168
Now we can conduct the test for association at α=0.05 significance level.
1. Ho:β=0 (No association between dose and P(toxic death))
2. H1: β≠0 (Association exists)
3. X2-calculated= (ˆ / ˆˆ) (0.488/0.052) 88.071
2 2
171
Probit Data Analysis
This procedure measures the relationship between the strength of a dose and the
proportion of cases exhibiting a certain response to the dose.
It is useful for situations where you have a dichotomous output that is thought to
be influenced or caused by levels of some independent variable(s) and is
particularly well suited to experimental data.
This procedure will allow you to estimate the strength of a dose required to induce
a certain proportion of responses, such as the median effective dose.
Example.
How effective is a new pesticide at killing insects, and what is an appropriate
concentration to use?
You might perform an experiment in which you expose samples of insects to
different concentrations of the pesticide and then record the number of insects
killed and the number of insects exposed.
Then by applying probit analysis to these data, you can determine the strength of
the relationship between concentration and killing, and you can determine what
the appropriate concentration of pesticide would be if you wanted to be sure to kill,
say, 50% of exposed insects.
172
When biological responses are plotted against their causal dose (or logarithms of
them) they often form a sigmoid curve, as follows:-
173
Notes:-
Probit Analysis is very similar to logistic Regression model but is preferred when
data are normally distributed.
Most common outcome of a dose-response experiment in which probit analysis is to
get the corresponding response values for D50.
Probit analysis can be done by eye through hand calculations, or by using a statistical
program.
Probit (0.025)=-1.96=-probit (0.975)
Probit value versus ln (Dose) will be more or less linear.
174
Exercise
Based on the following information from an experiment, apply logit and probit model
and find the amount of insecticide in dl to kill 50% of insects in the experiment.
0 10 0
2 10 0
4 10 1
9 10 2
16 10 6
25 10 8
36 10 9
38 10 9
39 10 9
40 10 10
175
Excercise
1. Identify a research topic which can be done by applying cross sectional, cohort or
case–control analytical epidemiological research design.
2. Provide general objective in line with your topic.
3. Provide sample size and sampling methods in line with your objective.
4. Provide and discuss appropriate statistical data analysis method to achieve your
general objective.
5. List at least 2 questions that should be included in your questionnaire in line with
your general objective.
176