0% found this document useful (0 votes)
54 views

Bio-Statistics and RD Lecture Note

Uploaded by

fayan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Bio-Statistics and RD Lecture Note

Uploaded by

fayan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 176

BIO-STATISTICS AND RESEARCH DESIGN

COMH 607
(3 Cr.Hrs.)

GIRUM TAYE

1
Course Description

 Definitions and scope of Biostatistics


 Collection, classification and tabulation of data
 Graphical and diagrammatic representation (Histogram, Frequency polygon,
Frequency curve)
 Descriptive statistics
 Measures of central tendency: Mean (Arithmetic, Harmonic and Geometric),
Median and Mode
 Measures of dispersion: Variance, standard deviation and standard errors
 Elements of probability theory
 Probability distributions: Binomial, Poisson and Normal distribution
 Simple linear regression
 Correlation coefficient
 Probit and logit analysis
 Basic idea of significance test
 Hypothesis
2
 Types of errors
 Level of significance
 Student’s t test
 Chi-square test
 F test
 Type of research
 Problem identification and definition
 Prioritizing problems for research
 Formulating the problem statement
 Literature search and review
 Formulation of research objectives
 Formulation of hypothesis
 Study type and design
 Sample size determination
 Sampling method
 Designing research questionnaires
 Data collection techniques
 Monitoring the quality of data 3
• Data management and analysis
• Research proposal development
• Report writing
• Ethical consideration/Research ethics

4
INTRODUCTION

Bio-statistics is a branch of health science, which study about the methods of


health related data collection, presentation, analysis and interpretation.
Components of the definition:-
1.Methods of data collection:- observation, measurement, asking, counting etc
2.Presentation:- Tables (such as frequency distribution, 2X2 table )
Graphs (Bar, Histogram, Pie, Frequency polygon, cumulative frequency
polygon)
3.Analysis:-
Descriptive method through measure of central tendency or location (such as
mean, median and mode). And using measure of variation or dispersion (such as
range, Inter quartile range, variance and standard deviation)
Inferential method using probability theory to make inferences (estimation) about
a population from data of a sample, this can be done through confidence interval
and hypothesis testing.
4. Interpretation:- giving meanings for the output of data analysis result

5
Why do you need to learn Bio-statistics?
Bio-statistics is fundamental to know disease distribution and contributes a lot to
the majority of studies published in the medical literature.
Therefore, it is essential for anyone studying health related sciences to have an
understanding of the key statistical principles and methods.

6
Data and Variable
Suppose we want to study the characteristics of a group of students, we might
ask/measure about their age, sex, place of residence, weight, height, SBP, DBP, a
particular disease status.
Each of these characteristics varies from student to student, which is a variable
and the values we collect from the students are called data.
Types of Data/variable
In general, we can classify data into two types:
1. Numerical or Quantitative data is data where the observations are
numbers. For example, age, height, weight, SBP, DBP etc.
Note: Numerical data is called discrete if the number of possible values
within every bounded range is finite. Examples: Cups of coffee consumed
per day, number of children etc. Otherwise, numerical data is called
continuous (the value of the variable is not restricted to an
integer). Example: height, weight, Body temperature, SBP, DBP etc.
2. Categorical or Qualitative data is data where the observations are non-
numerical. Example: vaccination status, smoking status, Disease severity,
disease status etc.

7
The simplest type of categorical variable is in which it can take only two
categories. Such a variable is known as binary (or dichotomous).
Example:- Sex (M/F), Smoking status (Smoker/Non-smoker), Disease status
(Positive/Negative) etc
Some qualitative variables can take more than two values. Example:- Marital
status (Single, married, divorced, widowed ), Disease severity (Low, mild,
moderate, high) etc
Generally, qualitative variable can be:-
Unordered if the categories may be listed in any order such as marital status
(it does not involve ranking)
Ordered if the categories have a natural ordering to their categories such as
disease severity (it involves ranking)
When some one use qualitative variable it is very important to check the provided
categories’ exhaustiveness and mutually exclusiveness.
For example the following categories for the variable, How many cigarettes do you
smoke per day? Are not exhaustive and not mutually exclusive.
0-2, 3-5, 5-7, 8-10

8
Scales of measurement
All scales of measurement belong to one of the following four basic type of scales
of measurement:
a. Nominal b. Ordinal c. Interval d. Ratio
Nominal scale is the most commonly used and basic scale of measurement. It
consists in forming a set of exhaustive and mutually exclusive classes or
categories of entities to measure the values of a trait. The objects are placed in
one of these classes.
Example: - Gender (Male or Female), Disease status (Positive or Negative),
Marital status (Single, Married, Widowed, Divorced, Separated)
In Ordinal scale the entities are ranked with respect to the degree to which they
possess a particular attribute. Thus, in measuring the efficacy of specific
medicine may be placed in one of the categories, i.e (fair, good, very good,
excellent) depending on its achievement. An ordinal scale does not, however,
reflect in the absolute sense, how far apart are the entities in two different
classes with respect to the attribute under study. Here also the exhaustive and
mutually exclusive conditions should be fulfilled.

9
In the case of interval scale, entities are not only ranked with the respect to the
degree of a trait under interest but the distance or difference between
neighboring ranks or classes can also be measured and this distance is constant
between each successive interval or rank.
We can design numerical rating scale beginning from an arbitrary zero point
representing the total absence of the trait or quality under study and
increasing the value in successive equal units on the scale up to the desired
limit.
The ratio scale is the most sophisticated of all the four measurement scales.
Weight, length, and time all fall within the ratio scale. Here natural zero point
representing the absence of the variable attribute.

Note: The only real difference between Interval and Ratio scale data is that the
ratio scale has a natural measurement that is called zero, while the zero point is
defined arbitrary for an interval scale (zero degrees temperature has different
meaning for the Fahrenheit and centigrade scales)

10
Exercise
1. Which of the following is a qualitative variable?
a. Severity of a disease b. Age c. HIV status d. All except “b”
2. Which of the following is a quantitative variable?
a. Weight b. Age group c. Number of children d. Age e. All except “b”
3. The variable “HIV status” based on Gold standard test has ____ type of
qualitative data.
a. Binary b. Ordinal c. Nominal d. Dichotomous d. All except “b”
4. Which one is the continuous quantitative variable?
a. SBP b. time c. body temperature d. number of cigarettes smoked
e. All except “d”

11
Measure of central tendency/Location
Mean
The sample mean is the average and is computed as the sum of all the observed
outcomes from the sample divided by the total number of events. In mathematical
terms, it can be given as:
n
X  1/ n xi
11

where n is the sample size and the xi corresponds to the observed value.
The population mean is the average of the entire population and is usually
impossible to compute but can be estimated by sample mean. We use the Greek
letter μ for the population mean.

12
Median
One problem with using the mean, is that it often does not depict the typical
outcome i.e,
 If there is one outcome that is very far from the rest of the data, then the mean
will be strongly affected by this outcome.
Such an outcome is called and outlier.
In this case an alternative measure is the median.
The median is the middle score. If we have an even number of events we take
the average of the two middles.
Note: After arranging the data in ascending order the position of the median
value can be obtained by the formula (n+1)/2 and The median for a frequency
distribution is simply the value at which the cumulative relative frequency is
50.5%.

Mode
The mode of a set of data is the number with the highest frequency. From this
definition therefore a distribution may have more than one mode.

13
Properties of the Mean, Median & Mode
The mean is sensitive to outliers; the others are not.
The mode may be affected by small changes in the data; the others are not.
The mode and median may be found graphically.
All three measures of central tendency are equal for a symmetric distribution; in a
skewed distribution they differ.
If the mean is greater than from both median and mode, then the distribution will
be skewed to the right. And
If the mean is less than from both median and mode, then the distribution will be
skewed to the left.

Note:-For statistical analysis and inference, the mean is more often used.
However, if the data is considerably skewed then statistical techniques based on
median should be employed.

14
Measure of dispersion/variation
These measure how far the data is spread apart
Range
The simplest way to describe the spread of a set of observations is to quote the
range, stating the lowest and highest values and hence the difference in
between.
The problem with this is that it reports the extreme values, while the actual
distribution of all the values in between will not be summarized in any way.
Inter Quartile Range
Inter Quartile Range is another measure of spread, and is defined as the
difference between the upper (third) and lower (first) quartiles. The lower
quartile is defined as that observation below which 25% of the sample lies and
above which 75% lies. The upper quartile is defined analogously, as the point
below which 75% of the sample lies, and above which 25% lies.
Therefore:- IQR= 3rd Quartile – 1st Quartile
Note: After arranging the data in ascending order the position of the first and
third quartile can be obtained by the formulas (n+1)/4 and 3(n+1)/4,
respectively.

15
Variance
It is a sort of average of all deviations of each observation from the mean.
 However, simply calculating the mean deviation is not sufficient. Since it gives an
average deviation of zero, because positive deviations from the mean will always
exactly balance the negative deviations.
What we are interested in is the magnitude of the deviations. If we square the
deviations before summing them, we will always get a positive quantity.
Dividing this by the total number of observations then gives a measure of
average deviation from the mean, known as the variance, which is denoted by
S2.Therefore, variance is given by the formula.

n 2
2
S  1/ n  1 ( xi  x )
i 1

16
Standard Deviation
The problem with the variance is that it is squared, and so it is not in the same
units as the original data.
For Example: For height measurement the variance is in units of square meters,
which is a unit of area, not height.
Therefore, if we take the square root of the variance we get a measure of
variability in the same units as the raw data.
This quantity is called the standard deviation and tells us the average distance of
all the observations in a dataset from the mean.
Standard Deviation (S) = Square root of Variance (S2).

17
Steps to calculate Variance and Standard Deviation
1. Calculate the sample mean, x-bar
2. Write a column in a table that subtracts the mean from each
observed value.
3. Square each of the differences.
4. Add values in this column.
5. Divide by n -1 where n is the number of items in the
sample. This is the variance.
6. To get the standard deviation we take the square root of the
variance.
 The sample standard deviation will be denoted by s and the
population standard deviation will be denoted by the Greek
letter σ.
 The sample variance will be denoted by s2 and the population
variance will be denoted by σ2.
 The variance and standard deviation describe how spread out
the data is.
 If the data all lies close to the mean, then the standard deviation
will be small.
18
while if the data is spread out over a large range of values, s will be large. That
is having outliers will increase the standard deviation.
One of the flaws involved with the standard deviation, is that it depends on the
units that are used.
One way of handling this difficulty, is called the coefficient of variation (CV)
which is the standard deviation divided by the mean times 100%.

19
Skew ness and Kurtosis
Skew ness is a measure of symmetry, or more precisely, the lack of symmetry.
A distribution, or data set, is symmetric if it looks the same to the left and right
of the center point.

The skew ness for a normal distribution is zero, and any symmetric data should
have a skew ness near zero.
Negative values for the skew ness indicate data that are skewed left and positive
values for the skew ness indicate data that are skewed right.
By skewed left, we mean that the left tail is long relative to the right tail.
Similarly, skewed right means that the right tail is long relative to the left tail.

20
Kurtosis is a measure of whether the data are peaked or flat relative to a normal
distribution.
That is, data sets with high kurtosis tend to have a distinct peak near the mean,
decline rather rapidly, and have heavy tails.
Data sets with low kurtosis tend to have a flat top near the mean rather than a
sharp peak. A uniform distribution would be the extreme case.

This definition is used so that the standard normal distribution has a kurtosis of
zero.
Positive kurtosis indicates a "peaked" distribution and negative kurtosis
indicates a "flat" distribution
 Histogram is an effective graphical technique for showing both the skew ness
and kurtosis of data set.

21
Exercise
Find the mean, median, mode, range, I.Q.R, Standard deviation and variance and tell
the most likely distribution for the following 10 S.B.P measurements in mmHg as
shown in the Table below:

S.B.P in mmHg Frequency

140 2

120 3

100 1

130 2

110 2

Total 10
Solution
Mean=122 mmHg, Median=120 mmHg, Mode=120, Range=40mmHg, I.Q.R=22.5
mmHg, St.Devn=13.17 mmHg, Variance=173.33 mmHg
Exercise
Find the mean, median, mode, range, I.Q.R, Standard deviation and variance for the
following ten weight measurements in K.g:
60, 55, 40, 50, 45, 45, 50, 55, 60, 50
Solution
Mean=51 K.g, Median=50 K.g, Mode=50 K.g, Range=20 K.g, I.Q.R=11.25 K.g,
St.Devn=6.58 Kg, Variance=43.33 Kg 22
Exercise
Select ten students randomly from your class who present the class and ask
their age in completed years, or their weight in K.g or their height in c.m,
then
a. Describe as how you select the ten students from your class
b. Present the information you collected using an appropriate graph
and frequency distribution table
c. Find the mean, median, mode, range, I.Q.R, Standard deviation
and variance for your data
d. Tell the most likely distribution for your data
e. Find the skewness and kurtosis value for your data and interpret
the result.

23
Data presentation using Graphs
"A graph/picture is worth 1000 words"
A distribution presented as a graph or chart gives a more immediate message
than a frequency table does.
The type of graph/chart used depends on the type of data.
In general:-
If the data is categorical or discrete, we use a bar or a pie chart.
If the data is continuous, a histogram or frequency polygon is more appropriate.
Bar chart
For categorical variables, the frequency for each category is easily displayed in a
bar chart.
key points about Bar Chart
It is used to display qualitative (or discrete numerical) data
One bar represents one category, and the height of the bar equals its frequency
Each bar has the same width and equally spaced

24
Bars should have a space between them to stress that they represent categorical
data
The position of each category is arbitrary if the variable is unordered
It is important that the vertical axis of a bar chart starts at zero, to avoid
distortion of true differences between frequencies.

Number of respondents by sex


6000

5000

4000

3000
Count

2000
Male Female

SEX OF RESPONDENT
25
Pie chart
It is an alternative display for categorical data where the frequency of each
category is represented by the angle at the center of each slice of the circle.
Note:-Angle = (Frequency/Total )X3600

Respondents percentage distribution by Sex

38%
Male
Female
62%

26
Histogram
For quantitative continuous variables we need a different type of plot from a bar
chart. Instead we use a histogram.
A histogram is like a bar chart but because we use it to display quantitative
continuous variables there are no spaces between adjacent bars.
Another important feature of a histogram is that it is the area of each bar, not the
height, which is proportional to the frequency in each group.
Key points about Histogram
The x-axis must be continuous, and there are no spaces between the bars.
The y-axis always begins at zero, this is important because relative comparisons
are being made.
The area of each bar represents the frequency in each group
The width of each bar is the size of the interval for each group

27
Respondents age by six group intervals
6

1 Std. Dev = 7.85


Mean = 32.0

0 N = 20.00
20.0 25.0 30.0 35.0 40.0 45.0

AGE

The shape of Histogram can be uni-modal if there is one hump, bimodal if


there are two humps and multimodal if there are many humps
A non symmetric histogram is called skewed if it is not symmetric. If the
right tail is longer than the left tail then it is positively skewed. If the right
tail is shorter then it is negatively skewed. 28
Frequency polygon
The relative frequency polygon can be constructed easily by joining the midpoints of the
tops of the vertical bars of a histogram.
The cumulative relative frequency polygon can be constructed by first getting the
cumulative relative frequency and then using line graphs

Frequency polygon for respondent's age


6

2
Frequency

0
18 20 23 25 26 27 30 35 38 40 43 44 45

AGE
29
Cumulative frequency polygon for respondent's age
30

20
Cumulative Frequency

10

0
20 25 30 35 40 45

AGE

30
Data presentation using Tables
1. Frequency Distribution Table
 Frequency distribution table is more difficult to construct for numerical data than for
categorical data because the scale of the observations must first be divided in to
classes.
 The steps for constructing a frequency distribution table for numerical data are as
follows: -
i. Identify the largest and smallest observations
ii. Subtract the smallest observation from the largest to obtain the range of the data
iii. Determine the number of classes.
iv. Divide the range of observations by the number of classes to obtain the width of
the classes. Then tally the number of observations in each class.
Columns in frequency distribution table
 Frequency- in a particular event it is the number of times that the event occurs.
 Relative frequency- is the proportion of observed responses in the category.
 cumulative relative frequency- is the running total of the relative frequencies by
reading from top to bottom.

31
Example: The data collected through asking 20 students’ weight in K.g is summarized in
following frequency distribution table (Numerical)

Weight (K.g) Frequency Relative Cumulative Relative


Frequency Frequency (%)
(%)

48 6 0.3 (30) 0.3 (30)

54 7 0.35 (35) 0.65 (65)

56 2 0.1 (10) 0.75 (75)

59 1 0.05 (5) 0.80 (80)

64 4 0.2 (20) 1 (100)

Total 20 1 (100)

32
Example: The data collected through asking 100 individuals’ marital status is
summarized in following frequency distribution table (Categorical)

Marital STAUS Frequency Relative Frequency (%)

Single 45 0.45 (45)

Married 15 0.15 (15)

Widowed 8 0.08 (8)

Divorced 10 0.1 (10)

Separated 22 0.22 (22)

Total 100 1 (100)

33
2. Two by Two Table (2X2)
Measures of association between risk factor (exposure) and disease are often calculated
from data presented in 2X2 table.
The following is a 2X2 table showing association between exposure and disease

Disease
Yes (+) No (-) Total
Exposure

Yes (+) A B A+B

No (-) C D C+D

Total A+C B+D A+B+C+D

34
Where
A = number exposed and have the disease
B = number exposed and do not have disease
C = number not exposed and has the disease
D = number non exposed and do not have the disease
A+B = total number of individuals exposed
C+D = total number of individuals none exposed
A+C = total number with disease, B+D = total number with out disease
Example: The following 2X2 table shows the relationship between asthma disease and
exposure of asbestoses from a case control study. What main results you observed from
this table and how can we measure the association between asbestos exposure and
asthma.

Asthma
Asbestos Yes (+) No (-) Total
Exposure
Yes (+) 90 270 360
No (-) 60 360 420
Total 150 630 780 35
Exercise
From 10 lung cancer patients the following information regarding their previous
cigarette smoking practices per day are collected.
0, 4, 6, 6, 7, 8, 9, 10, 10, 12
Present this data in the form of table using interval scales of measurement
with two classes.
Solution

# of C.S.P.D Frequency Relative frequency (%)

0–6 4 40

7 – 13 6 60

Total 10 100

36
Elementary Probability Theory
Meaning of Probability
Assume that an experiment can be repeated many times, with each repetition
called a trial and assume that one or more outcomes can result from each trial,
then
The probability of a given outcome is the number of times that outcome occurs
divided by the total number of trials.
If the outcome is sure to occur, it has a probability of 1; if an outcome can not
occur, its probability is 0.
Example:- The probability of flipping a fair coin and getting tails is 0.50, or 50%. If a
coin is flipped 10 times, there is no guarantee, that exactly 5 tails will be
observed, the proportion of tails can range from 0 to 1.
Basic Definitions and Rules of probability
An experiment is defined as any planned process of data collection
For an experiment we define an event to be any collection of possible
outcomes.
A conditional probability is the probability of one event given that another event
has occurred.

37
A simple event is an event that consists of exactly one outcome.
In Probability OR: means the union i.e. either can occur
In probability AND: means intersection i.e. both must occur
Two events are mutually exclusive if they cannot occur simultaneously.
Characteristics of probability
P(E) is always between 0 and 1.
The sum of the probabilities of all simple events must be 1.
P(E) + P (not E) = 1
If A and B are mutually exclusive then
P(A or B) = P(A) + P(B)
If A and B are not mutually exclusive then
P(A or B) = P(A) + P(B) - P(A and B)
If A and B are independent then P(A and B) = P(A)P(B)
For conditional probability (dependent event)
P(A and B) = P(A|B)XP(B) or
P(B and A) = P(B|A)XP(A)
The multiplication rule for probabilities when events are not independent can be
used to derive one form of an important formula called Baye’s theorem.

38
Since P(A and B) equals both P(A|B)XP(B) and P(B|A)XP(A), these latter two
expressions are equal.
Assuming that P(B) and P(A) are not equal to zero, we can solve for one in terms
of the other as follows:-
P(A|B)XP(B) = P(B|A)XP(A) Then;
P(A|B)= P(B|A)XP(A) ÷ P(B)
P(B|A)= P(A|B)XP(B) ÷ P(A)
In the above two formulas of Baye’s theorem P(A) and P(B) are called the prior
probability, because its value is known prior to the calculation, while P(A|B) and
P(B|A) are called the posterior probability, because its value is known only after
the calculation.
Exercise
Distribution of blood type by gender of 100 students are given in the following
table. Based on these information, find
Having not type “O” blood from total students
Having type “O” blood or being female from total students
Having type “A” blood or “AB” from total students
Having type “A” blood and being male from total students

39
Blood Type Students

Male Female Total


O 10 15 25
A 15 20 35

B 8 6 14

AB 17 9 26

Total 50 50 100

40
Exercise

Race Population HIV/AIDS


Positive
Black 2000,000 80,000
White 5000,000 100,000
Other 8000,000 400,000
Total 15,000,000 580,000

Based on the information in the table find


a. Probability of an individual being HIV positive given that black race people
b. Probability of an individual being HIV positive given that white race people
c. Probability of an individual being HIV positive given that other race people
d. Probability of an individual being HIV positive using direct and indirect method.

41
Probability Distribution

The Binomial Distribution


It is a discrete distribution which is applicable when an outcome is the number of
times an event occur
We call a distribution is binomial distribution if all of the following are true
1.There are a fixed number of trials, n, which are all independent.
2.The outcomes are Binary , such as True or False, yes or no, success or failure,
Positive or Negative.
3.The probability of success is the same for each trial.
Therefore, a binomial distribution with n trials with the probability of success p
and failure q, we have
P (x successes) = C n,x p x q n-x , where C n,x = n!/x! (n-x)!
4. The mean and variance of the binomial distribution is np and npq
Exercise
If the probability that a woman gets cervical cancer in her life time is 0.002. In a
study of 50 women, find the probability that: -
0 woman gets cervical cancer (0.904747)
10 women gets cervical cancer
2 or more women gets cervical cancer
42
2 or less women gets cervical cancer
What is the mean number of women who will get cancer?
Calculate the standard deviation for this groups as well

The Poisson Distribution


Like the binomial, Poisson distribution is a discrete distribution which is
applicable when an outcome is the number of times an event occur
It gives the probability that an outcome occurs in a specified number of times
The Poisson distribution can be used to determine the probability of rare events;
that is
when the number of trials is large and the probability of any one occurrence is
small.
Therefore, the probability of exactly X occurrences is given by the formula: - P
(X=x) = λ xe- λ
x!
Poisson distribution is slightly positively skewed. The skew becomes more
pronounced as λ becomes smaller.

43
Where, λ (lambda)=np is the value of both the mean and the variance of the
Poisson distribution, and e is the base of natural logarithms, equal to 2.718.
Exercise
Suppose we are interested in the number of people who visit the clinic in city “X”
in a given year among the total population say 5000000, and let the probability
that some one in the city visits the clinic is 0.00001. The mean number of people
from the example above would be np=5000000*0.00001=50 which is also the
variance.
For this example calculate: -
The probability that no one in this population visits the clinic in the a given year
The probability that less than 5 people visits the clinic in the a given year

44
The Normal Distribution
It is a special distribution that we will use just about every day for a continuous
random variable, has the following properties:
It can take on any value (not just integers, as do the binomial and Poisson
distribution)
It is symmetric about the mean, μ
The standard deviation of the distribution is symbolized by σ, which is
The horizontal distance between the mean and the point of inflection on the
curve.
It approaches the horizontal axis on both the left and right side without
touching, that is the x-axis is an asymptote.
It is bell shaped with transition points one standard deviation from the mean.
Approximately 68% of the data points lie within one standard deviation of the
mean.
Approximately 95% of the data points lie within two standard deviations of the
mean.
Approximately 99.7% of the data points lie within three standard deviations of
the mean.
The area under the curve is equal to 1
Since it is symmetrical distribution, half the area is on the left of the mean and
half is on the right 45
Given a random variable x that can take on any value between negative and
positive infinity (- infinity to + infinity), normal distribution formula is as follows:
Where e is the base of natural logarithms=2.718, П=3.1416

2 1/ 2( x   /  )2
 1/(2 )e


68%

95%

99.7%

μ-3σ μ-2σ μ-σ μ μ+σ μ+2σ μ+3σ


46
The Standard Normal Distribution
The Standard Normal distribution follows a normal distribution and has mean 0
and standard deviation 1
Notice the picture below is perfectly symmetric about 0.
If a distribution is normal but not standard, we can convert a value to the Standard
normal distribution Z score by finding first as how many standard deviations away the
number is from the mean.

47
The number of standard deviations from the mean is called the z-score and can
be found by the formula :-
z = (x-μ)/σ
The Z Score and Area
Often we want to find the probability that a z-score will be less than a given
value, greater than a given value, or in between two values. To accomplish this,
we use the Table from the textbook and a few properties about the normal
distribution.
In the pictures below has shaded region corresponding to the area:-
1.To the right (above) a z-score, which is 0.011 from the table, hence P (z >
2.30) = 0.011
2. To the left (below) a z-score, which is 0.989 from the table, hence P (z <
2.30) = 0.989
3. Between a z-score, which is 0.979 from the table, hence
P (-2.30 < z < 2.30) = 0.979

48
0 2.3 0 2.3
Shaded Area= 0.011 Shaded Area= 0.989

Shaded Area= 0.979 -2.3 0 2.3


49
Exercise
Assuming systolic blood pressure (SBP) in normal healthy individuals is
normally distributed with µ = 120 and σ = 10 mm Hg
a. What area of the curve is above 150 mm Hg?
b. What area of the curve is between 100 and 140 mm Hg?
c. What area of the curve is either below 90 mm Hg or above 150 mm Hg?
d. What is the value of the systolic blood pressure that divides the area under
the curve in to the lower 95% and the upper 5%?
e. What is the value of the systolic blood pressure that divides the area under
the curve in to lower 97.5% and the upper 2.5%?

50
Chi square distribution
Chi-square test is used to compare frequencies or proportions in two or more
groups, specially for their independence.
The logics in chi-square test are as follows:-
The total number of observations in each column and the total number of
observations in each row are considered to be given or fixed. (marginal
frequencies)
After assuming that the columns and rows are independent, we can calculate
the number of observations expected to occur by chance (expected
frequencies).
Expected frequency can be find by multiplying the column total by the row
total and dividing by the grand total i.e
Expected Frequency=(Row total X Column total)/Grand total
Chi-square test compares the observed frequency in each cell with the
expected frequency.
If no relationship exists between the column and row variables, then
The observed frequencies will be very close to the expected frequencies, they
will differ only by small amounts

51
In this instance, the value of the chi-square statistic will be small.
On the other hand, if a relationship (dependency) does occur, then
The observed frequencies will vary quite a bit from the expected frequencies,
and
The value of the chi-square statistics will be large.
So chi-square is given as:- 2
k
2
X (d . f )  
i 1
(O i  E i) / E i

Where d.f=degree of freedom=(r-1) (c-1), r and c are number of rows and


column, respectively
O= Observed frequency
E= Expected frequency
Here the hypothesis to be tested is
H0: The two variables are independent
H1: The two variables are dependent
Reject H0 when the calculated value of chi-square is greater than the tabulated
chi-square (obtained from the chi-square Table in a given significant level such as
5%)
52
Exercise
According to the following table which summarized status of students knowledge on
statistical computer software and Bio-statistics exam performance.
Calculate chi-square and make conclusion about the independence of the two
variables (x2 tabulated (1) at 5% level=3.841)

Software Bio-Statistics performance


Knowledge Good Bad Total
Yes 70 5 75
No 10 15 25
Total 80 20 100
Exercise
For the given HIV/AIDS related data calculate the chi-square and make conclusion
about the independence of HIV Result with Education, Place of Residence and Age
(15-19, 20-24, 25-29, >30)

53
Student t-distribution
The t distribution is similar in shape to the z-distribution and
One of its major uses is to answer research questions about means.
The t-distribution is symmetric and has a mean of 0, but its standard deviation is
larger than 1.
The precise size of the standard deviation depends on the sample size, which is
called here degree of freedom (d.f)
The t-distribution has a larger standard deviation so it is wider and its tails are
higher than those for the Z-distribution.
As the sample size increases, the degree of freedom also increases, and the t-
distribution becomes almost the same as the standard normal distribution.
When the sample size is 30 or more t-distribution and z-distribution curves
become so close, therefore
Either t or z distribution can be used.
So, t-distribution is given by:-
t= (¯x – μ)/s/√n

54
Where
¯x= sample mean
μ= population mean
S= sample standard deviation
n= sample size

Here the hypothesis to be tested is


H0: Population means among groups are equal
H1: Population means among groups are not equal
Reject H0 when the calculated t-value is greater than the tabulated
t-value (obtained from the t-distribution Table with n-1 d.f at 5% level)
Exercise
1. According to the following table which shows weight for nine female students .
By using t-test, make a conclusion about the equality of the selected female
students weight with female population mean weight of 55 k.g (t tabulated
with 8 d.f and at 5% level=2.306)
2. From the previous HIV/AIDS data, test and make conclusion on the hypothesis
stated below about Age of respondents (t tabulated with >120 d.f and at 5%
level=1.96)
Ho:µ=29
H1:µ≠29
55
Female Student weight in K.g

55
50
50
50
55
50
60
55
50
56
3. The following table shows age for nine male and female individuals.
By using t-test, make a conclusion about the equality of male and female
selected individuals age. (Independent t-test)
(t tabulated with 16 d.f and at 5% level=2.12)
Note:-Here S.E=(¯x1-¯x2)=√(s12/n1)+(s22/n2) and t with n1+n2-2 d.f
Age of Male Age of female

26 40

22 17

18 15

38 44

18 16

15 20

27 28

17 48

52 37

57
4. The following table shows SBP repeated measurements for nine individuals.
By using t-test, make a conclusion about the measurement difference. (paired t-
test)
(t tabulated with 8 d.f and at 5% level=2.306)
Note:- Here you can use one sample t-test using difference (di) and t with n-1 d.f
in comparing to zero.

SBP1 SBP2

120 120

125 120

130 135

140 140

125 120

130 130

120 120

140 140

125 135

58
Confidence interval for single mean
Confidence interval is a widely used tool to describe a population based on sample
data.
The idea here is to obtain an interval, based on sample statistics, that we can be
confident contains the population parameter of interest.
Applying the properties of the sampling distribution to the results of a single
sample leads us to the concept of confidence intervals.
This is an interval around the estimated mean which we can be confident contains
the true population mean.
A confidence interval extends either side of the sample mean by a multiple of the
standard error.
It is most common to calculate a 95% confidence interval; this extends 1.96 SE
either side of the mean.
Thus, a 95% confidence interval for a single population mean () is calculated as
follows

X  1.96[ S .E ( X )]
59
Where

X is estimated mean

1.96 is the value of Z for 95% confidence

S .E ( X )is standard error of the estimate s/√n


There is no reason focused on 95% confidence intervals. Sometimes we may
wish to use other confidence intervals such as 90% or 99% confidence intervals.
For a 90% and 99% confidence interval the value 1.96 in the formula used
previously becomes 1.65 and 2.58, respectively.
Exercise
Calculate and interpret a 90% and 95% confidence interval for population mean
height from the sample of 150 students, having sample mean height (Ẍ) =169.6
cm and standard error, SE (Ẍ) = 0.70 cm

60
Confidence interval for two mean difference
Testing a hypothesis that a parameter equals some specified value (such as 1 -
2 =0) can be done by determining whether or not 0 falls in the interval.
Therefore, similar to the confidence interval for single mean, a 95% confidence
interval for a population mean difference (1 - 2) is calculated as follows:

( X  X 2)  1.96[ SE ( X 1  X 2)
Where 1

is estimated mean difference


1.96
( X 1is the
X 2)value of Z for 95% confidence

is standard error of the estimate which is given as:-

S .E ( X 1  X 2 )
2 2
( S / n1)  ( S 2 / n 2)
1 61
Exercise
The following data shows the measurement of systolic blood pressure
measurements among two groups of population separated by getting /not
getting appropriate treatment. Based on the information find the 95%
confidence interval for the difference between the population means of SBP
measurements between the treated and untreated groups and interpret the
result

With treatment Without treatment

Mean 120 mmHg 140mmHg


Standard deviation 10 15
Sample size 144 144

62
Confidence interval for single proportion
The 95% confidence interval for a single population proportion (P) is calculated
as follows:
p + 1.96 (S.E(p))
Where
p is estimated proportion
1.96 is multiplier
SE (p) is standard error of the estimate=√ pq/n
Exercise
Suppose that it is known that in a certain population of women, 90% entering
their 3rd trimester of pregnancy have had some prenatal care. A random sample
of size 200 is drawn from the population in an informal settlement and it is
found that 170 have had prenatal care at the beginning of their third trimester.
From this, find the 95% confidence interval proportion of women in the informal
settlement who have had some form of prenatal care by the third trimester and
interpret the result.

63
Confidence interval for two population proportions difference
Similarly, the 95% confidence interval for two population proportions difference
(P1 – P2) is calculated as follows:
(p1-p2) + 1.96 (SE(p1-p2))
Where
p1 and p2 are estimated proportions
1.96 is multiplier
SE (p1-p2) is standard error of the estimates=√ (p1q1/n1) + (p2q2/n2)
Example
TB patients were randomized to receive either medicine and bed rest or simply
bed rest. The outcome of interest will be whether or not the patient showed
improvement. Let p1 be the proportion of all TB patients who, if given medicine,
would show improvement at 6 months. Further we will define p2 as a similar
proportion among patients receiving only bed rest. Based on the information
given in the following table calculate the 95% confidence interval for population
proportion difference and interpret your result.

64
Improvement condition
Treatment
Group Total
Yes No
Medicine and 30 20 50
Bed rest
Bed rest only 15 35 50

Total 45 55 100

65
Note
In general, the width of confidence interval depends on:
The confidence level (1-α):- As (1-α) increases, so does the width of the interval.
If we want to increase the confidence we have that the interval contains the
parameter, we must increase the width of the interval
The sample size:- The larger the sample size, the smaller the standard error of
the estimator, and thus the estimator and thus the smaller the interval.
The standard deviation of the underlying distributions. If the standard
deviations are large, then the standard error of the estimator will also be large.

66
Hypothesis Testing
Whenever we have a decision to make about a population characteristic, we
make a hypothesis.
Suppose that we want to test the hypothesis that μ≠120 mmHg.
Then we can think of our opponent suggesting that μ=120 mmHg.
We call the opponent's hypothesis the null hypothesis and write:
H0: μ = 120 mmHg and our alternative hypothesis and write
H1: μ ≠ 120 mmHg
For the null hypothesis we always use equality, since we are comparing μ with a
previously determined mean.
For the alternative hypothesis, we have the choices: < , > , or ≠ .
Procedures in Hypothesis Testing
When we test a hypothesis we proceed as follows:
Formulate the null and alternative hypothesis.
Choose a level of significance.
Determine the sample size.
Collect data.
Calculate z ( t) or chi-square score.
Utilize the table to determine if the z score falls within the acceptance region.

67
Decide to:-
a.Reject the null hypothesis and therefore accept the alternative hypothesis or
b.Fail to reject the null hypothesis and therefore state that there is not enough
evidence to suggest the truth of the alternative hypothesis.
Errors in Hypothesis Tests
We define a type I error as the event of rejecting the null hypothesis when the
null hypothesis was true. The probability of a type I error (α) is called the
significance level.
We define a type II error (with probability ß) as the event of failing to reject the
null hypothesis when the null hypothesis was false.
Note: Larger α results in smaller ß, and smaller α results in a larger ß.

Null Hypothesis
Action True False

Fail to Reject Correct Type II Error (ß)


Reject Type I Error (α) Correct
68
Rejection Regions
Suppose that α = .05. We can draw the appropriate picture and find the z score
for -.025 and .025. We call the outside regions the rejection regions.
We call the blue areas the rejection region since if the value of z falls in these
regions, we can say that the null hypothesis is very unlikely so we can reject the
null hypothesis
Note:- Here our test statistic is Z=( Ẍ - ) / (σ/ √ n) [TEST STATISTIC]
at 5% level, reject the null hypothesis Ho if Z>1.96

69
Example
In the study of the piglets being fed the supplemented diet, we know that the
mean weekly weight gain for our sample is 311.9, and that this is based on 16
observations. We also know that we have assumed the population standard
deviation, σ, to be 120 grams and that we want to test μ=200 grams.
Solution
Here
H0: μ = 200 gm
H1: μ ≠ 200 gm
Thus, Z=( Ẍ - ) / (σ/ √ n) becomes
Z=( 311.99 gm – 200 gm) / (120 gm/ √ 16) = 37.3
Now we know that 95% of the values from a standardized normal distribution
can be expected to lie between -1.96 and 1.96, which implies that values above
1.96 or below -1.96 occur in only 5% of samples drawn from a standardized
normal distribution. Since the value we have calculates as test statistics lie above
1.96, this means that we have only a 5% chance of getting a value this large by
chance alone, if the null hypothesis is true.
Thus we reject the null hypothesis, and accept the alternative hypothesis, at the
5% level.
70
One sided versus two sided tests
The statement of the alternative hypothesis in above example was “not equal
to(≠ ) ” , that is, either higher than or lower than. This is called a two sided test.
If we were only interested in testing whether this diet would give a greater
weight gain, we would have used a one sided test, implying
Alternative hypothesis H1: μ > 200 gm
For a one sided test, we would not use the same cutoff as that of two sided test
For one sided test, we are only interested in a critical value or cutoff above which
5% of the distribution lies (and below which 95% of the distribution lies)

71
Sample size determination and sampling methods
Sample size calculation for single mean
The sample mean is used to estimate the population mean, and the Confidence
Interval is used to determine how big or small the population mean is.
s2 , the variance is required before sample size calculation. This may be obtained
from the literature, previous studies or a pilot study.
Therefore for a 95% C.I
n=(1.96 s /d)2
Example
If we want to estimate the mean SBP of Rwandan males and the standard
deviation is around 20 mmHg and we wish to estimate the true mean with in 10
mmHg with 95% confidence, what will be the sample size.

72
Solution
We are given s=20, d=10 and z=1.96
Therefore n=[(1.96*20)/10]2=15.37 which round to 16
Suppose the response rate is 80% then we will need to sample 16/0.8=20 males
If we decide to obtain a more precise estimate example d=5 mmHg, we require n=
[(1.96*20)/5]2=62 and 62/0.8 = 78 with 80% response rate
Sample size calculation for single proportion
Research questions such as “What proportions are smokers?, What is the
prevalence of HSV-2 in rural area?, or What is the sensitivity (or specificity) of a
particular diagnostic test for disease x?)” lead to the estimation of a proportion.
To determine how big or how small the population proportion is likely to be a
confidence interval is calculated and the sample size for 95% confidence is given
by
n=(1.96 /d)2 p (1-p)

73
Thus to determine the sample size required to estimate the proportion with the
desired level of precision, some idea is required before hand about the possible
magnitude of the proportion. If there is insufficient information to know this,
then the value of 0.5 can be used as this will give the largest possible sample size
that would be required
Example
We wish to estimate the proportion of males who smoke in a given country.
What sample size do we require to achieve a 95% confidence interval of width +
5% (that is to be with in 5% of the true value)?. A study some years ago found
approximately 30% were smokers
Solution
We take p=0.30, d=0.05 and z=1.96
n=(1.96/0.05)2 0.3(1-0.3)=322.69 rounded to 323 men
If we anticipate a 75% response rate then we need to sample 323/0.75=431 men
If we had no idea what the prevalence of smoking is likely to be we would use
p=0.50 to give n=(1.96/0.05)2 0.5(1-0.5)=384.16
So we need 385 men at the analysis stage and 385/0.75=514 to be sampled.

74
Sample size calculation for comparison of two independent means
The number per group to detect a difference in means of size d with power (1 -
) at significance level  is given by
n=2 s2/d2 *(Z +Z)2
Where Z is the value of the standard normal distribution cutting of probability 
in one tail for a one sided alternative or /2 in each tail for a two sided
alternative and Z is the value of the standard normal distribution cutting off
probability  (or right hand) tail. Commonly used values for Z and Z are Z=1.96
for =0.05 (two tailed) and Z =0.84 for 80% power or Z=1.28 for 90% power.
Example
If you wish to carry out a study comparing serum catecholamine levels in
normotensive patients and patients with essential hypertension. And previous
studies have found mean serum catecholamine levels of 0.218 mg/ml (sd=0.14)
in normotensives. If the clinically important difference to be detected in
catecholamine levels in hypertensive patients is an increase by 0.1 mg/ml how
many subjects would you sample?

75
Solution
Mean1=0.218, mean2=0.318 sd1=0.14, sd2=0.14,z=1.28, Z=1.96
n=2 s2/d2 (Z +Z)2=2(0.14/0.1)2*(1.96+1.28)2=41.15 rounded to 42
Sample size calculation for comparison of two independent proportions
In Pocock’s formula for calculating the sample size required to compare two
proportions, the number per group to detect a difference in two proportions p1
and p2 , with power (1 - ) and significance level  is given by
n=p1 (1-p1)+p2(1-p2)/(p1 – p2)2 *(Z +Z)2
Note that to apply this formula, we need to know the expected proportion of
individuals in one group who will have the outcome of interest. Usually for a
randomized controlled trial the proportion for the control group is known (say
p2), the size difference that is important to detect (d) is decided and then the
proportion in the other group is calculated as p1= p2+d. For a case control study,
information about the proportion exposed to the factor of interest is obtained
for the control group (say p2), this can be approximated by the proportion
exposed in the population, the difference d is specified and the proportion
exposed in the case is calculated as p1= p2+d

76
Example
A new polio vaccine is thought to decrease polio cases. A decrease of 33% in a
population with approximately 30% polio prevalence rate is considered clinically
and economically of significance. How many treatment and placebo patients are
required to detect this difference at the =0.05 (two sided) level with 80%
power?
Solution
p1=0.30 and p2=0.30 – 33% of 0.30=0.20
Z =1.96 and Z=0.84
Therefore n=(0.3*0.7+0.2*0.8)/(0.3 – 0.2)2 *(1.96+0.84)2=290.1
It implies at least 291 cases and 291 controls are required

77
Exercise
In a trial for a promising new HIV vaccine, the vaccine is considered effective if
the proportions of HIV infected in the vaccine arm is 15% compared to 25% in
the control arm. What sample size is required if one has 5% level of significance
and 90% power.

ANS: 331

78
Sampling Methods
Population is the total set of individuals that we are interested
Sample is a subset of the individuals selected in a prescribed manner of study.
Reasons for sampling
Samples can be studied more quickly than populations.
A study of a sample is less expensive than studying an entire population,
because
a smaller number of items or subjects are examined.
A study of an entire population (census) is impossible in most situations.
Sample results are often more accurate than results based on a population.
If samples are properly selected, probability methods can be used to estimate
about the population
Methods of sampling
1. Probability sampling
The best way to ensure that a sample will lead to reliable and valid inferences
is to use probability samples,
in which the probability of being included in the sample is known for each
subject in the population.
The four commonly used probability-sampling methods are simple random
sampling, systematic sampling, stratified sampling, and cluster sampling.

79
I. A simple random sample (lottery method) is one in which every subject
has an equal probability of being selected for the study.
 The recommended way to select a simple random sample is to use a
table of random numbers or a computer-generated list of random
numbers.
II. A systematic random sample is one in which every kth item is selected;
 k is determined by dividing the total number of items in the sampling
frame by the desired sample size.
 Example. If a researcher want to consider only 200 students as a
sample from the total of 3400 students in his/her study using a
systematic random sample, 3400 divided by 200 is 17, so every 17th
student is sampled. In this approach we must select a number
randomly between 1 and 17 first, and we then select every 17th
student. Suppose we randomly select the number 12 from a random
number table. Then, the systematic sample consists of students with
ID numbers 12, 29, 46, 63, 80, and so on; each subsequent number is
determined by adding 17 to the last ID number.

80
Exercise
For a given study select 10 students from a total of 59 students using systematic
random sampling
III. A stratified random sample is one in which the population is first divided in to
relevant strata (subgroups), and a random sample is then selected from each
stratum proportionally. Characteristics used to stratify should be related to the
measurement of interest, in which case stratified random sampling is the most
efficient (different characteristics)
IV. A cluster random sample results from a two-stage process in which the
population is divided in to clusters and a subset of the clusters is randomly
selected. Clusters are commonly based on geographic areas or districts, so this
approach is used more often in epidemiologic research than in clinical studies
(similar characteristics)
Exercise
To study prevalence of vivax malaria, discuss the application of stratified and
cluster random sampling method in two districts in which the first has three agro-
ecological zone (Hot, medium and cold) and the second has only one agro-
ecological zone which is only hot.

81
2. Non-probability sampling
Non-probability samples are those in which the probability that a subject is
selected is unknown.
Non-probability samples often reflect selection biases of the person doing the
study and do not fulfill the requirements of randomness needed to estimate
sampling errors. Examples: convenience samples or quota samples.
Example for convenience sampling is taking 10 oranges from a big basket contain
may be 1000
Example for quota sampling is selecting Clinics from different povinces of the
country proportion to the number of clinics around the hospital.

82
Scatter Diagram, Regression Line and Simple linear regression

Scatter Diagram
If data is given in pairs then the scatter diagram of the data is just the points
plotted on the xy-plane.
The scatter plot is used to visually identify relationships between the first and
the second entries of paired data.
The scatter plot below represents the age vs. size of a plant.
It is clear from the scatter plot that as the plant aged, its size tends to increase.
If it seems to be the case that the points follow a linear pattern well, then we say
that there is a high linear correlation, while if it seems that the data do not follow
a linear pattern, we say that there is no linear correlation.
If the data somewhat follow a linear path, then we say that there is a moderate
linear correlation.

83
84
Simple linear Regression
Linear regression is applicable when one has collected data on two or more
variables and wants to quantify a relationship between the Response
(dependent) and Predictor (independent) variables.
Therefore, regression is used:-
To predict the value of one variable from the other variables
To examine the actual relationship between variables
To determine trends in data
The simplest form of regression is simple linear regression, where one has one
response and one predictor variable. Example:-
Predicting weight from height
Relating blood sugar content to daily amount of sugar consumption
Given a scatter plot, we can draw the line that best fits the data
• To find the equation of a line, we need the slope and the y-intercept. We will
write the equation of the line as
y = a + bx
Where a is the y-intercept and b is the slope.
x is the independent or predictor variable and
y is the dependent or response variable.
85
Least square estimation
Least squares can be interpreted as a method of fitting data.
The best fit in the least-squares sense is that instance of the model for which the
sum of squared residuals has its least value.
A residual being the difference between an observed value and the value given
by the model. Let Yi=α+ βXi+Єi
The least square methods determines the line that minimizes the sum of squared
vertical differences between the actual and predicted values of outcome variable
such as Yi = α + βXi and y’=a+bXi, respectively. So that
Σ(y-y’) is minimized through partial derivatives of the equation with respect to α
and β
The two resulting equations are set equal to zero to locate the minimum values;
these two equations in two unknowns, are solved simultaneously to obtain the
formulas for α and β.
Then the formulas for α and β in terms of the sample estimates a and b will be:-

n n
b   ( x i  x )( yi  y ) /  ( xi  x )2
i 1 i 1 86
n n n n n 2
2
 N  xy
i i
 xi yi / N xi  ( xi)
i 1 i 1 i 1 i 1 i 1
n n
2 2
  xy
i i  nxy /
 xi  nx
i 1 i 1
n n n n n
2 2
 { xy
i i (
 xi yi) / n}/{ xi  ( xi) / n}
i 1 i 1 i 1 i 1 i 1

a  y  bx
87
Therefore, to find a and b we follow the following steps:
List
The sum of the x= Σx
The sum of the y= Σy
The sum of the squares of x=Σx2
The sum of the products of x and y=Σxy
Interpretation
We can interpret a as the value of y when x is zero and
we can interpret b as the amount that y increases when x increases by
one.

88
Model Assumptions
 For simple linear regression model Yi=α+ βXi+Єi, where the yi’s are the
measurements on the dependent variable, the xi’s are the measurements
on the independent or predictor variable, and α and β are the parameters
in the linear regression model that we want to estimate. The Єi’s or error
terms are used to make allowance for the random scatter about the line, in
other words, we are allowing for the fact that there is variability in our
sample. Most of the assumptions are concern the error term Єi.
Assumptions
i. Єi are symmetry, which means the deviations from the model are
equally likely to be positive or negative
ii. The error for any particular case is not related to the value of the
predictor variable, this means that we are assuming that the variability
of the errors is the same over the whole range of the regression, which
we quantify by saying that the error variance is constant
iii. The error for one case is not affected by the error for another case, or,
in other words

89
iv. The distribution by error terms is normal or symmetry
v. The dependent variable yi is a continuous variable
vi. The only assumptions made about independent variable is that you
have data for them. It can be continuous or discrete

For the above simple linear regression model (Yi=α+ βXi+Єi) the following
hypothesis can be tested:-
 H0: α=0
H1: α≠0

 H0: β=0
H1: β ≠0

90
Example
Suppose that a study was done to determine the weight loss after taking various
amounts of a diet pill in combination with exercise. If the regression line was
y = 3 + 2x
where x denotes the grams of the pill per day and y represents the weight loss,
then
we can say that with only the exercise and no pill the average weight loss is 3
pounds.
We can also say that if a person takes an additional gram of the pill, then that on
average the person should expect to lose an additional 2 pounds.
If a person takes 5 grams then that person can expect to lose an average of 13
pounds.

91
Coefficient of determination (R2) and Correlation coefficient (r)
From the linear regression line
Generally Residual is given by:- yi - y’ = yi - (a + bxi)
Coefficient of determination: R2
We define the coefficient of determination as an indication of how linear the
data is. R2 has the following properties:-
R2 is between 0 and 1.
If R2=1 then all points lie on a line. (perfectly linear)
If R2=0 then the regression line is a useless indicator for predicting y values. To
compute R2, do the following:
n n
S S T o ta l   (Y i  Y ) 2
SS R e sid u a l   ( Y i  Y i ) 2
i1 i1

n
SS R e g r e s s io n   (Y i  Y ) 2
i 1

SSTot. = SSRes +SSReg


Then R2= SSReg/SSTot =1 - SSResid/SSTot
If we multiply R2 by 100%, we arrive at the percent of the observed variation
attributable to the linear relationship. 92
Note
1. Although a good regression will give a high R2 , a high R2 does not necessarily
mean a good fit. Consider the plot below the its R2 is high (98.8%), but the plot
show one point far away from the others. This point is often termed as influential
point, because it has a great effect on the estimation of the regression line
2. A low R2 does not necessarily mean that there is no relation. The relation could
be very strong, but could be non linear, such as a semi circle relation which gives
a R2 of zero.

16

14

12

10

2
Y

0 2 4 6 8 10 12

X 93
Correlation coefficient (r)
In many cases, more than one variables has been measured on each unit, such
as animal, plant and object.
So, if there are several variables of interest, one is frequently interested in
correlations between these variables.
Together with plots, this should form a starting point for any study of the joint
effect of a number of variables
In the previously discussed simple linear regression model,
If we want to determine not just if they are linearly related,
but also want to know whether there is a positive relationship or a negative
relationship (b> 0 or b<0).
Therefore, one method of examining the relationship between two continuous
variables (such as weight and height) is to look at the usual PEARSON’S
CORRELATION COEFFICIENT. For the sample this is defined as:- rxy= Sxy/SxSy
where, Sx and Sy are the standard deviations of x and y, and Sxy is the COVRIANCE
between x and y, defined as:-

n
S xy  i 1
( x i  x ) ( y i  y ) / ( n  1)
Therefore rxy is given by

n n 2 n 2

rxy   ( xi  x )( yi  y ) /  (x  x )  ( y  y)
i i
i 1 i 1 i 1

The correlation coefficient measures the strength of the linear relation


between x and y.
Note that:- if the relationship between x and y is positive (direct), then
when the x value is higher the y value is also very likely to be higher.
For negative (indirect) relationship, one would find that when y is higher, x
will be lower and vice versa.
Note that:- a correlation coefficient of zero does not necessarily imply that
there is no relationship between the variables,
A zero correlation coefficient would also be obtained if the plot of x versus
y showed a semi-circle. This indicates a curved relationship between the
variables.
Similarly, a correlation coefficient of near one does not imply a near
perfect linear relation, if there is one point in the data set that is far away
from the rest of the points, the correlation coefficient may be near 1 or -1.
Note also that correlation does not necessarily imply causation. 95
Generally, the square of the correlation coefficient (r2) is equal to the coefficient
of determination (R2).
If r < 0 then they are negatively correlated.
If r > 0 then they are positively correlated.
We say that the correlation is
Strong if |r| >0.8
Middle if 0.5 < |r| < 0.8 and
Weak otherwise.

96
Exercise
For the following data
a. Plot a scatter diagram
b. By fitting simple linear regression model, find the parameters of the model
using least square method and write the equation of the model.
c. Predict the weight’s of an individual whose height is 174 c.m
d. Calculate and interpret R2 and r
e. Make a conclusion about the fitted model goodness of fit

Weight in K.g Height in c.m


(Dependent) (Independent)
50 160
70 172
60 165
65 170
55 165
68 168 97
Analysis of Variance for one factor (One way ANOVA)
ANOVA is used to compare the response to a number of levels of factor,
regarding each of the groups as arising from the application of some treatment.
Examples to be compared
Comparing Hemoglobin levels of pregnant women being given three different
treatments for anemia (say pills, diet and injections).
Comparing DDT resistance in 3 colonies of mosquitoes
The main purpose of the ANOVA is to test
H0: μ1= μ2=… =μk versus the alternative that at least two of the μi are different to
each other.
For this the overall test is called Analysis of Variance (ANOVA). This compares the
factor variability (i.e variability between the groups or levels of the factor) to the
random or error variability (i.e variability with in the groups or samples), to see if
there is a significant difference.
In other words, we again look to see if the variability between the means for the
treatment groups is greater than the overall random variability with in the
treatment groups.

98
In order to apply the test, we need to define the mean sum of squares for the
factor and for the error.
The mean sum of squares for factor A is then denoted by

n 2

MSSA  ni (Xi  X ) / (k 1)


i1
This is essentially, just the variance of the difference between the mean of each
factor level, and the overall mean.
Where we assume that we have k groups with ni cases in the ith group, and the
mean of the ith group is Xi, while the overall mean is X.
Note:-Although it is optimum to have the same number of observations for each
factor level, there is often something that goes wrong which lead different
samples may be of different sizes.
The mean sum of squares for the error , MSE is defined as
k k k
MSSE  (ni 1)Si / (ni 1) (ni 1)si 2 / (N  k)
2

i1 i1 i1

Where Si2 is the sample variance for the ith group


99
Here the Null hypothesis to be tested is that, there is no difference between the
groups. The Alternative hypothesis is that there is a difference.
Note that the alternative hypothesis, in the situation of comparing several levels
of a factor is always a two sided alternative, as it does not make sense to try to
specify in advance where the differences lie.
The ratio of the above two mean sum of squares has an F distribution (that is
MSSA/MSSE has F distribution) with degrees of freedom associated with MSSA
which is k-1 and for MSSE which is N-K.
Here the Null hypothesis is rejected if the p-value is smaller than the specified
significance level (or if the test statistic F is larger than the critical value).
EXAMPLE
The following data in the table is weight of 3 months pregnant women by their
number of children. Therefore, apply ANOVA and make conclusion.

100
Number of children
1 2 3 4

62 63 68 56
60 67 66 62
63 71 71 60
59 64 67 61
65 68 63
66 68 64
63
59
Mean (Xi) 61 66 68 61

No. of 4 6 6 8
observation
101
The over all mean X=64
MSSA=[4(61-64)2+ 6(66-64)2+ 6(68-64)2+ 8(61-64)2]/3=76
Since S12=3.3, S22=8.0, S32=2.8, S42=6.8, and n1=4, n2=n3=6, n4=8
MSSE=(n1-1) S12+ (n2-1) S22+ (n3-1) S32+ (n4-1) S42/[(n1-1)+ (n2-1)+ (n3-1)+ (n4-1)]
=5.6
For this example, the test statistic for the sample is 76/5.6=13.6, and the degrees
of freedom are k-1=4-1=3 and N-k=24-4=20. The p-value for an F statistic of 13.6
with 3 and 20 degrees of freedom is 0.000046.

Source of Sum of d.f Mean F-Ratio Sig.level


variation squares Square
Between 228.0 3 76.0 13.57 0.0000
groups
With in 112.0 20 5.6
groups
Total 340.0 23 102
The only part of the above table of real interest is the p-value of the test statistic,
termed “significance level” is less than 0.01, we can reject the null hypothesis at
the 1% level, implying that there is a difference in students marks between study
methods. Then, we now need to find out which differ.
Multiple comparison
The overall test above tells us whether there is a difference between the means
for the different groups or not, but does not tell us where the difference lies (if
there is one).
If the overall test has indicated that there is indeed a significant difference, then
multiple comparison tests can be used to show us where the difference lies.
Thus, each pair wise test is conducted at a more stringent level than the original
overall test.
There are many multiple comparison procedures that can be used to determine
where these differences lie. Here we will consider two of the most widely used
procedures.
In confidence interval terms, all have the form of the difference between the
means, plus and minus a constant times the estimate of the standard error.

103
(Xi-Xj) + G√[MSE(1/ni+1/nj)
Differing only as to the value of G. The value of G is chosen in such a way that the
overall chance of a type I error is not more than α.
In the case where we had two independent populations we used
G=t1-α/2,n1+n2-2
Thus for a factor with only two levels the formula becomes
(X1-X2) + t1-α/2,n1+n2-2√[MSE(1/n1+1/n2)
This two sample t test is a special case of ANOVA, with only two independent
samples (i.e, two levels of the factor of interest).
For more than two groups, such as three one usually tests all pair wise
comparisons, that is, you test if level 1 differs from level 2, level 2 differs from
level 3, and whether level 1 differs from level 3.
The two most common multiple comparison tests are
The SCHEFFE procedure, for which G= √(k-1)Fk-1,N-k,1- α
where Fk-1,N-k,1- α is the critical value of the F distribution with degrees of
freedom k-1 and N-k.
The BONFERRONI procedure, for which G= t1-α/2k,N-k , which uses a critical
value from the t distribution, but with α/2 divided by k, the number of factor
levels to be compared
104
Applying the two procedures to the above examples at the 5% level
For the SCHEFFE procedure G= √(3F3,20;0.95)=3.049
For BONFERRONI procedure G=t0.99375,20=2.927
Using these values for G in the confidence interval formulae, we obtain

i j Xi - Xj SCHEFFE BONFERRONI

1 2 -5 (-9.67 , -0.343) (-9.471 , -0.529) *


1 3 -7 (-11.657 , -2.343) (-11.47 , -2.529) *
1 4 0 (-4.418 , 4.418) (-4.242 , 4.242)
2 3 -2 (-6.165 , 2.165) (-5.999 , 1.999)
2 4 5 (1.104 , 8.896) (1.259 , 8.741) *
3 4 7 (3.104 , 10.896) (3.259 , 10.741) *

105
For interpretation of the above confidence interval in the table, if it includes
zero, the treatment do not differ significantly. Here all two tests are agree.
Treatments for which the confidence intervals do not include zero are marked by
stars in the above table.
Assumptions of the ANOVA test
The general one way ANOVA model may be written as:-
Yij=μi+Єij for i=1,2,…,k treatments (level of the factor), and j=1,2,…,ri replicates for
the ith treatment.
For example for the previous example r1=4, r2=6, r3=6 and r4=8. The observation
y2,5 is 65.
The yij are the values of the response variable for the jth replicate of the ith group
or factor level.
 The Єij are the random error terms for the individual observations, while μi is the
population mean for the ith level of the factor.
Then, the ANOVA test assumes that the data for each factor level are an
independent random sample from the relevant population, and that they are
normally distributed around the mean for that factor level.

106
Independence of observations
The independence assumption means that, if we have a factor with three levels,
we have a separate sample for each level. In addition, the observations in each
sample are independent.
In terms of the ANOVA model, this means that the Єij are independent.
Normality of the error term
The assumption of normality of the population from which the data is drawn,
means the same as the assumption that the Єij are normally distributed. This
means the assumption of normality is not on the data itself.
This assumption holds for many data sets. For others, may be often be
transformed to approximate normality using for example log transformation
(appropriate for biological data) or square root transformation (appropriate for
count data).
Error variance the same for all groups (Homoscedasticity)
The other assumption for ANOVA test is on the error terms Єij namely that the
error variance is the same for all groups. To keep this assumption the appropriate
measure is to transform the data, thereby making the error variances for the
different groups more similar.

107
Note that one does not need the error variances to be identical, the rule of
thumb is that the error standard deviations should not differ by more than a
factor of 2, i.e
The largest standard deviation should not be more than twice the smallest
standard deviation, and the within-group variances should not differ by more
than a factor of 4.
The F test not much affected by unequal variances when the sample sizes at
different factor levels are approximately equal. However, the multiple
comparison tests may be greatly affected.
One way to check this assumption is to calculate the standard deviation for each
of the factor levels.
For our previous example, the variances were 2.8, 3.3, 6.8, 8.0, so that ratio of
the largest to the smallest is 8.0/2.8=2.86.
As this is less than 4, our assumption of equal population variances is likely not
to have too great an effect on the conclusions.
Some packages allow one to test the assumption of equality of variances, using
a test for “homogeneity of variance”, such as Hartley and modified Levene tests.
As always, the null hypothesis is the hypothesis of no difference, that is, that the
variances are equal.

108
Analysis of Variance for two factors (Two way ANOVA)
In the one way ANOVA, we have dealt only with comparing two or more group on a
single factor.
Here in two way ANOVA, we turn to the situation where we have measurements on
some response (such as weight or expenditure), as well as data on two or more
factors.
The model for such a general two way ANOVA is usually written as:
Yijk=μ+αi+βj+ Єijk
for i=1,2,…,nA levels of the first factor (factor A), j=1,2,…,nB levels of the second factor
(factor B), and k=1,2,…,rij replicates (observations) for the combination of the ith level
of factor A and jth level of factor B. The Єijk are random error terms, while μ is the
overall mean. Here the sum of the
αi ‘s and the sum of the βj’s are both taken to be zero.
EXAMPLE
Consider the following 2X2 experiment, also known as a 22 factorial design. The
factors here are age (under and over 30) and gender (Male, Female). The response
variable is amount of expenditure in US Dollar on health care.

109
Gender
Age
Male Female

Under 30 25 26 31 24 27 44 35 26 39 32
20 27 19 25 32 25 21 31 27 41

Over 30 31 34 41 32 39 44 41 38 39 34
36 43 41 30 39 45 55 48 38 40

In this example there are two factors. Factor 1 (Gender)-taking 2 levels
(Male, Female) and Factor 2 (Age)-taking also 2 levels (under and over 30)
In this example there are 10 observations for males under 30, giving
r11=10, r12=10, r21=10, r22=10
To analyze this data, one needs to define a mean sum of squares for
factor B as well as for factor A, with their interaction and error.
110
The general two way ANOVA Table is as follows:-

Source of SS d.f. MS F P-
variation val
ue
Between levels of (nA-1)MSA nA-1 MSA MSA/MSE
A

Between levels of (nB-1)MSB nB-1 MSB MSB/MSE


B

Between A and B (nA-1) (nB-1) MSAxB (nA-1) (nB-1) MSAxB MSAxB/MSE

Error ESS (r-1)(nA) (nB) MSE

Total TSS abr-1

111
The abbreviations in this table are:
SS=Sum of Squares
MS=Mean Square
The column of the major interest is the last column, giving the p-values
associated with the null hypothesis that all levels of a particular factor are equal,
versus the alternative that at least two factor levels are different to each other.
The predictions for such model i.e for the combination of the ith level of factor A
and the jth level of factor B are given by the estimate of the overall mean, plus the
estimates for the ith level of factor A and the estimate for the jth level of factor B.
The estimate for the ith level of factor A is given by the mean of all values of the
response at the ith level of factor A, minus the overall mean value of the
response.

112
Two way analysis of variance decomposition
SSTotal=SSA + SSB + SSAxB + SSError

Where
a b r
S T o ta l   ( Y ijk  Y ...) 2
i 1 j 1 k 1

a
S S A  r b  ( Y i ..  Y ...) 2
i 1

b 2

SS B  ra  (Y . j .  Y ...)
j 1

a b 2

SS AxB  r   (Yij .  Yi ..  Y . j .  Y ...)


i 1 j 1
113
a b r 2

S S E rro r  r   ( Y ijk  Y ij . )
i 1 j 1 k 1

Similarly, the estimate for the jth level of factor A is given by the mean of all values of
the response at the jth level of factor A, minus the overall mean value of the response.
Analyzing the previous example via a two factor analysis of variance (two way ANOVA)
gives the following table.

114
MEANS
Overall=34.125
R1=28.85
R2=39.4
a
S S A  r b  ( Y i ..  Y ...) 2
i 1

=10x2[(28.85-34.125)]2+(39.4-34.125)2
=20[(27.825625)+(27.825625)]
=20[55.65125]
=1113.025

115
MEANS
Overall=34.125
C1=31.1
C2=37.15

b 2

SSB  ra  (Y . j .  Y ...)
j 1

=10x2[(31.1-34.125)]2+(37.15-34.125)2
=20[(9.150625)+(9.150625)]
=20[18.30125]
=366.025

116
MEANS
Overall=34.125
R1C1=25.6 R2C1=36.6 R1=28.85 C1=31.1
R1C2=32.1 R2C2=42.2 R2=39.4 C2=37.15

a b 2

SSAxB  r  (Yij .  Yi..  Y . j .  Y ...)


i 1 j 1

=10[(25.6-28.85-31.1+34.125)]2+(32.1-28.85-37.15+34.125)2 +
(36.6-39.4-31.1+34.125)2 +(42.2-39.4-37.15+34.125)2
=10[0.050625+0.050625+0.050625+0.050625]
=10[0.2025]
=2.025
a b r
S T o ta l   ( Y ijk  Y ...) 2
i 1 j 1 k 1

=[(25-34.125)2+….+(40-34.125)2
=83.27+…..+34.52
=2670.375
117
Source SS d.f. MS F P-value

Age 1113.025 1 1113.025 33.69 <0.01

Gender 366.25 1 366 11.08 <0.01

Age x Gender 2.025 1 2.025 0.06 >0.05

Error 1189.3 36 33.04

Total 2670.375 39 68.47

118
Here is the hypothesis to be tested are:
H0:There is no difference in true man health care expenditure over age
H1:There is difference in true man health care expenditure over age and,

H0:There is no difference in true man health care expenditure over gender


H1:There is difference in true man health care expenditure over gender

H0:There is no interaction between gender and age in true man health care
expenditure
H1:There is no interaction between gender and age in true man health care
expenditure

From the table, for p-values less than 0.05, we have to reject the null hypothesis
and can conclude at the 5% level that the true man health care expenditure
differs over both age groups and gender, and for p-values greater than 0.05
there is no interaction between Age and gender over true man

119
Research Methodology
Research can be defined as the search for knowledge or any systematic
investigation to establish facts.
Applied research is a research accessing and using some part of the research
communities' accumulated theories, knowledge, methods, and techniques, for a
specific, often state, commercial or client driven purpose.
Basic research or fundamental research (sometimes pure research) is research
carried out to increase understanding of fundamental principles.
Many times the end results have no direct or immediate commercial
benefits.
It can be thought of as arising out of curiosity.
However, in the long term it is the basis for many commercial products and
applied research.
Therefore, to do a research first a Research proposal/Protocol should be
developed.
A research proposal is intended to convince others that you have a worthwhile
research project and that you have the competence and the work-plan to
complete it.

120
Generally, a research proposal should contain all the key elements involved in
the research process and include sufficient information for the readers to
evaluate the proposed study.
Regardless of your research area and the methodology you choose, all research
proposals must address the following questions:-
What you plan to accomplish
why you want to do it
how you are going to do it.
The proposal should have sufficient information to convince your readers that:-
You have an important research idea.
You have a good grasp of the relevant literature and the major issues.
Your methodology is sound.
The quality of your research proposal depends not only on the quality of your
proposed project, but also on the quality of your proposal writing.
A good research project may run the risk of rejection simply because the
proposal is poorly written. Therefore, it pays if your writing is coherent, clear and
compelling.

121
Main components of Research proposal
1. TITLE
It should be concise and descriptive. An effective title not only pricks the reader's
interest, but also predisposes him/her favorably towards the proposal.
The title should be in line with your general objective.
Make sure that it is specific enough to tell the reader what your study is about
and where it will be conducted.
2. Summary
It is a brief summary of approximately 500 words/a page.
It should include the research question, the rationale for the study , the
hypothesis (if any) and the method.
Descriptions of the method may include the design, procedures, the sample and
any instruments that will be used.
3. Introduction
The main purpose of the introduction is to provide the necessary background or
context for your research problem
The introduction typically begins with a general statement of the problem area,
with a focus on a specific research problem, to be followed by the rational or
justification for the proposed study.
122
The introduction generally covers the following elements
1. State the research problem, which is often referred to as the purpose of the
study.
2. Provide the context and set the stage for your research question in such a way as
to show its necessity and importance.
3. Present the rationale of your proposed study and clearly indicate why it is worth
doing.
4. Briefly describe the major issues and sub-problems to be addressed by your
research.
5. Identify the key independent and dependent variables of your experiment.
Alternatively, specify the phenomenon you want to study.
6. State your hypothesis or theory, if any. For exploratory or phenomenological
research, you may not have any hypotheses.
7. Set the delimitation or boundaries of your proposed research in order to provide
a clear focus.

123
4. Literature Review
 Sometimes the literature review is incorporated into the introduction
section.
 However, mostly it is preferred a separate section, which allows a more
thorough review of the literature.
The literature review serves several important functions such as:
1. Gives credits to those who have laid the groundwork for your research.
2. Demonstrates your knowledge of the research problem.
3. Demonstrates your understanding of the theoretical and research issues
related to your research question.
4. Shows your ability to critically evaluate relevant literature information.
5. Indicates your ability to integrate and synthesize the existing literature.
6. Provides new theoretical insights or develops a new model as the
conceptual framework for your research.
7. Convinces your reader that your proposed research will make a significant
and substantial contribution to the literature (i.e., resolving an important
theoretical issue or filling a major gap in the literature).

124
5. Research Objectives
The OBJECTIVES of a research project summarize what is to be achieved by the
study.
Objectives should be closely related to the statement of the problem.
For example, if the problem identified is low utilization of child welfare clinics, the
general objective of the study could be “to identify the reasons for this low
utilization”, in order to find solutions.
The general objective of a study states what researchers expect to achieve by the
study in general terms.
It is possible (and advisable) to break down a general objective into smaller,
logically connected parts.
These are normally referred to as specific objectives.
And they should specify what you will do in your study, where and for what
purpose.

125
6. Methods
 The Method section is very important because it tells your reviewer how you
plan to tackle your research problem.
 It will provide your work plan and describe the activities necessary for the
completion of your project.
 The guiding principle for writing the method section is that it should contain
sufficient information for the reader to determine whether methodology is
sound.
 Some even argue that a good proposal should contain sufficient details for
another qualified researcher to implement the study.
 You need to demonstrate your knowledge of alternative methods and make
the case that your approach is the most appropriate and most valid way to
address your research question.
For quantitative studies, the method section typically consists of the following
sections:
 Design -Is it a questionnaire study or a laboratory experiment? What kind of
design do you choose?
 Subjects or participants - Who will take part in your study ? What kind of
sampling procedure do you use?
126
Instruments - What kind of measuring instruments or questionnaires do you use?
Why do you choose them? Are they valid and reliable?
Procedure - How do you plan to carry out your study? What activities are
involved? How long does it take?
Note
Obviously you do not have results at the proposal stage. However, you need to
have some idea about what kind of data you will be collecting, and what statistical
procedures will be used in order to answer your research question or test your
hypothesis, which should be mentioned in methods section.
7. Discussion
It is important to convince your reader of the potential impact of your proposed
research.
You need to communicate a sense of enthusiasm and confidence without
exaggerating the merits of your proposal.
That is why you also need to mention the limitations and weaknesses of the
proposed research, which may be justified by time and financial constraints as
well as by the early developmental stage of your research area.

127
What is the difference between Research Proposal and Report?
Research Report Equals
All components of the research proposal plus
Result with discussion and interpretation
Conclusion
Recommendation
Referencing
A prime purpose of a citation is intellectual honesty (To avoid Plagiarism)
To attribute prior or unoriginal work and ideas to the correct sources, and
To allow the reader to determine independently whether the referenced material
supports the author's argument in the claimed way.
Citation content can vary depending on the type of source and may include:
• Book: author(s), book title, publisher, date of publication, and page number(s) if
appropriate.
• Journal: author(s), article title, journal title, date of publication, and page number(s).
• Newspaper: author(s), article title, name of newspaper, section title and page
number(s) if desired, date of publication.
• Web site: author(s), article and publication title where appropriate, as well as a, and
a date when the site was accessed.
128
Harvard referencing style (Parenthetical)
It is an example of author-date referencing.
The Harvard style is very common and is used across most subjects.
With the Harvard system, when you cite someone else's work, you need to include
the author's last name and the date of publication in brackets after the citation in
the body of your paper.
The full reference to the work is then included in an alphabetic reference list or
bibliography at the end of your paper.
Example: (Smith, 2001):- inside the text.
Smith SD, Jones, AD. Organ donation. Engl. J. Med. 2001;657:230-5:- at the end in
Alphabetical order .
Vancouver referencing style (Numbering)
Citation numbers are included in the text in square brackets, brace or superscripts.
[1], (1) or 1
All bibliographical information is exclusively included in the list of references at the
end of the document, next to the respective citation number.
Example: 1 :- inside the text.
1. Smith SD, Jones, AD. Organ donation. Engl. J. Med. 2001;657:230-5:- at the end.

129
What is Epidemiology
Epidemiology: - is branch of Health Science which study the frequency, distribution
and determinants of diseases and other health related states or events in specified
populations.
The application of this study will be to promote health and to the prevention and
control of health problems.
COMPONENTS OF THE DEFINITION
Population: - is the group of people and their environment. The focus of
Epidemiology is mainly on the population rather than individuals.
Frequency: - it expresses amount. This shows Epidemiology be mainly a quantitative
science.
Frequency of disease (Morbidity)
Frequency of death (Mortality)
Health related conditions: - are conditions, which are directly or indirectly affect or
influence health.
This may be injury, vital events, health related behavior, social factor, economic
factor etc.
Distribution: - refers to geographical distribution of disease. The distribution of the
disease can be expressed in time, place or affected persons.
130
Determinants:- are factors, which determine whether or not a person will get a
disease.
Health: - A state of complete physical, mental, and social well-being and not merely
the absence of disease or infirmity.
SCOPE OF EPIDEMIOLOGY
Originally Epidemiology was concerned with epidemic of communicable disease and
epidemic investigation.
Later on it was extended to endemic communicable diseases and non-communicable
diseases.
At present epidemiological methods are being applied to:- Infectious and non-
infectious diseases, Injuries and accidents, Nutritional deficiencies
Maternal and child health, Cancer, Occupational health, Environmental health,
Violence etc
Hence Epidemiology can be applied to all disease conditions and other health
related events.

131
Epidemiological study design
Study Design – is an arrangement of conditions for the collections and analysis of
data to get the most accurate answer for the research question in economic way.

EPIDEMIOLOGICAL STUDY DESIGN

Observational Experimental/Interventional

Descriptive Analytical

•Randomized Control trial


•Field trial
•Ecological •Community trial
•Case-report •Cross-sectional
•Case-series •Case-control
•Cohort
132
Observational study
Information are obtained by observation of events
In observational study the investigator can not take an active role in allocation of
people in to groups and administering an exposure to one of the groups
The investigator observes what is happening or what has happened to this groups
under study
Experimental/Interventional studies
Individuals are allocated in to experiment or control group by the investigator. If it
is done properly, experimental studies can produce high quality data.
Experimental design is the Gold standard study design compared to other designs
Descriptive studies
It is mainly concerned with the distribution disease with respect to time, place
and person
It answers the question who is affected, where and when does the case occur

133
 Descriptive studies use information from various sources. Eg. Census data,
vital statistics records and summarize these data in systematic ways.
 Since the data used by these studies are usually routinely collected, it is less
time consuming
Analytical studies
 In analytical study we can investigate risk factors for a disease or an outcome.
 Here we ask the question "does the pattern of exposure to certain risk factors
among individuals with or without a specific disease help us to work out the
cause of the disease?"
 We must be careful in how we interpret our findings; in analytical study we
measure associations between exposures and outcomes. If we demonstrate
an association that does not necessarily mean that the exposure caused the
outcome.
The different Epidemiological study designs
1.Case report – it is a careful detailed report by one or more clinicians for a single
patient. It also document unusual medical occurrences and can represent the
first clues in identification of new diseases or effects of certain exposures.

134
2. Case series – It describes the characteristics of a number of patients with a given
disease. It is very useful for hypothesis generation. As limitations it is based on
a single or few patients, which can happen, just by coincidence and there is no
also comparison group.
3. Ecological/Co relational – It use data from entire population to describe disease in
relation to some factors of interest such as age, time utilization of health
services, consumption of food etc. That is an ecological study compares group.
Thus it looks for an association between an exposure and an outcome at the
group-level not at individual level.
Example: Comparison of incidence of hypertension and average salt consumption in
African countries.
STRENGTHS
 It is the only studies that enable us to investigate the differences between
groups. This is extremely important in public health.
 It is the only studies that enable us to investigate the effects of group
properties or contextual properties.
 It can often be carried out relatively quickly and cheaply, using routine or
secondary data. Because of this, we often use them as a first step in the
investigation of a possible exposure-outcome relationship.

135
 We can often obtain group-level exposure data in circumstances in which it is
difficult or impossible to obtain individual-level exposure data.
 For exposures with substantial within-person variability, group-level
exposure data may be more reliable than individual-level exposure-data.
WEAKNESSES
 It can be difficult to control for confounding in ecological studies.
 It is particularly susceptible to information bias.
 It does not enable us to make inferences about the causes of individual
risks.
4. Cross sectional – it is also called prevalence study.
 In cross sectional studies exposure and disease status are assessed
simultaneously among individuals in a well-defined population.
 And information about the status of individual with respect to the presence
or absence of exposure and disease is assessed at a point in time.
 N.B. Since exposure and disease are assessed at the same time in most case,
it is not possible to determine whether the exposure preceded or resulted
from the disease.

136
STRENGTHS
 Cross-sectional studies are relatively easy and economical to conduct
 Cross-sectional studies provide important information on the distribution
and burden of exposures and outcomes. This is extremely valuable for
health-service planning.
 Cross-sectional studies can be used as the first step in the study of a possible
exposure-outcome relationship
WEAKNESSES
 Cross-sectional studies measure prevalent rather than incident cases.
 It can be difficult to establish the time-sequence of events in a cross-
sectional study.
5. Case-control – in case-control study subjects are selected with respect to
presence or absence of disease (outcome) and then inquiries are made
about past exposure to the factor of interest.
 Here to investigate the association between an exposure and an outcome
first we obtain information about one or more previous exposures from
cases and controls, and compare the two groups to see if each exposure is
significantly more (or less) frequent in cases than in controls.

137
Example: Examining the association between cigarette smoking and lung cancer
Case –control study design

Exposed

Cases (with disease)

Un exposed

Population

Exposed

Control (without disease)

Un exposed TIME

INQUIRY 138
STRENGTHS
 Can be carried out rapidly and relatively cheaply
 Are useful for studying rare diseases
 Can be used to study diseases with long latent periods
 Can study multiple exposures for a single outcome
WEAKNESSES
 Are prone to selection bias, particularly in the selection of controls
 Are prone to information bias, because exposure status is determined after
the outcome has occurred
 Cannot establish the sequence of events: the exposure may be a
consequence rather than a cause of the outcome (reverse causality)
 Are not suitable for studying rare exposures (except in nested case-control
studies)
 Cannot usually be used to estimate disease incidence or prevalence

139
6.Cohort – in cohort study subjects are selected by exposure or determinant or
interest and followed to see the development of the disease or other outcome
of interest. The starting point for cohort studies is exposure to a risk factor.
Cohort studies are particularly useful for rare exposures and in situations where
we are interested in more than one outcome.
In addition, because exposure status is defined at the start of the study, before
the outcome occurs, the temporal sequence of events can be investigated (i.e
exposure precede outcome)

140
Cohort study design
Disease/Outcome
Positive

Exposed

Disease/Outcome
Negative
Population
Total Population without Disease
Disease/Outcome
Positive

Unexposed

Disease/Outcome
Negative
TIME

INQUIRY
141
STRENGTHS
 Exposure is measured at the start of the study, before the outcome occurs,
and so measurement of exposure is not biased by the presence or absence
of the outcome
 Cohort studies can provide data on the time course of the development of
the outcome (s), including late effects.
 More than one outcome can be examined at once
 Rare exposures can be investigated using appropriately selected
populations.
WEAKNESSES
 Prospective cohorts are slow and potentially expensive if there is a long
period between exposure and outcome
 They are inefficient for rare diseases
 Retrospective cohort studies depend upon pre-existing records of exposure
being available, and being reliable
 Exposure status may change during study (in which case it may need to be
determined again at intervals throughout the study)

142
Differential loss to follow-up may introduce bias: this is a particular problem
when follow-up is of long duration
In long term cohort studies, it may be hard to ensure that diagnostic criteria
remain consistent throughout the study, particularly if outcomes are ascertained
from routine data sources
Measure of association
Measures of association between risk factor and disease are often calculated
from data presented in 2X2 table. The following is a 2X2 table showing
association between exposure/factor and disease/outcome.

Outcome
Factor Total
Yes (+) No (-)

Yes (+) A B A+B


No (-) C D C+D
Total A+C B+D A+B+C+D 143
Where
A = number with the factor and have the outcome
B = number with the factor and with out outcome
C = number not with the factor and has the outcome
D = number with out the factor and with out the outcome
A+B = total number of individuals with the factor
C+D = total number of individuals with out the factor
A+C = total number of disease with outcome
B+D = total number of individuals with out outcome
RELATIVE RISK (RR)
It can be directly calculated only in cohort and experimental study
It expresses the risk of developing certain outcome in people exposed to a
certain factor as compared to the risk of outcome in people not exposed to
the factor.
It estimates magnitude of the association between factor and disease

144
It can also be used to compare risks of death, accident or other possible
outcomes of an exposure
It is calculated as
RR = Incidence Rate among exposed /Incidence Rate among non-exposed
= (A/A+B) ÷ (C/C+D)
Example: If in a study of lung cancer mortality in relation to cigarette smoking, the
mortality rate was 96/100000 among smoker and 24/100000 among non-
smokers, Find RR.
Solution: RR = 96/100000 ÷ 24/100000 = 4.0. And can interpreted as, Cigarette
smokers were 4 times more likely to die from lung cancer compared to non-
smoker.
ODDS RATIO (OR)
In Case-control study where study participants are selected on the basis of
outcome status is not possible to determine Incidence Rate. Therefore direct
computation of RR is not possible. Therefore, indirect estimation of the RR is
given by the OR as follows:-
OR = odds of having the outcome if exposed ÷odds of having the outcome if
non- exposed

145
= ratio of outcome to with out outcome in exposed ÷ ratio of
outcome to with out outcome in non-exposed
= A/B ÷ C/D
= AD/CB (cross product)
Example: The following 2X2 table shows the relationship between
asthma disease and exposure of asbestos from a case control
study. From the given information find OR
Solution: OR = AD ÷ BC = (90x360) ÷ (60x270) = 2. And can be
interpreted as odds of developing asthma is 2 times higher among
exposed to asbestos compared to those with out the exposure of
asbestos.
Note: If A and C are small, so that one is looking at a very rare
disease, then the formula for RR and OR give almost the same
answers.

146
Asthma
Asbestos Total
Yes (+) No (-)
exposure
Yes (+) 90 270 360
No (-) 60 360 420
Total 150 630 780

Therefore, these very important measure of association (RR and OR) can
be obtained from multiple logistic regression model using the following
formula:-
OR or RR =1/exp-(bi)
where i=0,1,2 according to the number of independent variables and OR
or RR based on the type of study (cohort or case control).

147
Generally OR/RR is >0, and interpret as :-
1 indicates that the odds are even, so that the exposure has no effect on the
probability of the disease.
Since one would seldom get an odds ratio of exactly 1, one needs a method to
test whether an odds ratio is significantly different from 1. This can be done by
calculating a standard error, and then using a confidence interval or doing a test.
>1 means the outcome is more likely in those with factor compare to without
factor
<1 means the outcome is less likely in those with factor compare to without
factor
Note :
For cross sectional Epidemiological study design we can use OR as a measure of
association between exposure and outcome variable

148
Confidence Interval for OR
The OR can not be negative, but can be very large if the divisor is very small.
This means it must lie between zero and infinity, which implies that it has a skew
distribution.
To develop a standard error in confidence interval calculation, it can be shown
that the ln of the odds ratio has a symmetric, in fact approximately normal
distribution.
The standard error for the ln odds ratio is
S.E (loge OR)=√[(1/A)+(1/B)+(1/C)+(1/D)], where A,B,C and D are as defined in the
2x2 above.
Since the ln odds ratio is approximately normally distributed, one can calculate a
confidence interval for the ln odds ratio. One can then convert this to a
confidence interval for the odds ratio, by undoing the ln through taking
exponentials.

149
Therefore, the 95% confidence interval for the ln odds ratio then be given by
(loge(OR))-1.96 X S.E (loge OR), loge(OR))+1.96 X S.E (loge OR))

Note: loge =ln


Example
A group of 10890 women were examined as to their use of oral contraceptives, and the
presence or absence of breast cancer. The data were

Breast Cancer
Yes No

Total
Yes 273 2641 2914

Oral Contra- No 716 7260 7976


ceptives

Therefore, the odds ratio is (273x7260)/(2641x716)=1.04814. Thus women using


oral contraceptives in this sample have, an odds of 1.05 to 1 of developing breast
cancer, compared to those not using oral contraceptives. 150
The ln of the odds ratio is then loge(1.04814)=0.047017.
The standard error of the ln-odds ratio is
√(1/273)+(1/2641)+(1/716)+(1/7260)=0.074621
This gives a 95% confidence interval for the log odds ratio, i.e using 1.96 times
the standard error of “ln odds ratio” +1.96 x S.E
Or 0.047017+0.146257
(-0.09924,0.193274)
Taking exponentials of both sides of this confidence interval, gives the
confidence interval for the odds ratio, as (0.90553 , 1.2132). This includes 1, so
that the estimated increased risk of breast cancer is not significantly different to
1 at the 5% level.

151
From the following Table:-
1. Construct 2X2 Table
2. Calculate OR and interpret it
Table

Lung Cigarette
Cancer Smoking
Status
Positive Yes

Positive Yes

Positive Yes

Positive No

Positive No

Negative No

Negative No

Negative No

Negative Yes

Negative Yes
152
Excercise
a. From the following information which is obtained from Cross Sectional
Epidemiological Study design, calculate and interpret appropriate measure of
association between the disease and the exposure variable.
HIV/AIDS Condom use
status
P No

P No

P No

P Yes

P Yes

N Yes

N Yes

N Yes

N No

N No

b. For rare disease (outcome), show that the values of Odds Ratio (OR) and Risk Ratio
153
(RR) are almost equal.
QUESTIONNAIRE DEVELOPMENT AND ADMINISTRATION
Questionnaire is a list of questions, to be answered by the respondent, which
help to fulfill the objectives of the study.
Steps in designing good questionnaire
1. Decide on the content matter according to the objective of the study :-
Questions should be based on your study objectives, keep them within the scope
of the study. Short-list the variables that you need. A short, well conceived
questionnaire elicits much better information than a long one.
2. Formulate Questions
Questions can be open-ended, and the respondent replies in whatever way she
or he chooses. The alternative is to have closed-ended questions where
predetermined possible answer categories are marked off and coded.
This ensures quicker, more standardized data collection. Closed question
categories should be mutually exclusive and exhaustive: i.e there should be no
overlap of categories, and all possibilities should be covered. Always allow for an
“other” category where the respondent can specify the answer.

154
Questions should be simple, concise and specific. Make sure that there are no
ambiguities.
Take special care in wording questions.
Assess potential respondents as what questions are meaningful to them and
how to phrase the questions to make sure that they are understandable and
acceptable.
Avoid questions, which suggest to the respondent the answer that is expected
(this is called a leading question). Example: Do you believe that IUDs have
adverse health effects?
Develop one question at a time. Break up complex questions in to simple
ones.
Example: “Do you use a method of contraception? If not, why not?” Should be
broken up as follows:
1. Do you use a method of contraception? Yes/No
2. If no, what are your reasons?
3. If yes, which method are you currently using?

155
3. Sequencing of questions
Take special care in locating questions within the sequence when seeking
personal or sensitive information. i.e order the questions meaningfully to ensure
a smooth, logical flow.
Non-threatening items should be put first so that the respondents feel at ease.
4. Formatting the questionnaire
For the interviewer in interviews and the respondent in self-administered
questionnaires, provide visual markers to make the form easy to complete.
Examples of visual aids include putting related questions in boxes and using
arrows and flow diagrams. Besides, ensure good spacing and printing for easy
reading, and enough space for filling in the responses and computer codes.
5. Pre-testing the questionnaire
It is a test run of aspects before the main study and require an in depth look at
the questionnaire with the aim of improving its quality
Select respondents who are as much as similar to the target population

156
Usually only a few subjects are chosen (5 to 20)
It help to do trial runs of the questioning, by leaving space for noting required
changes
During the pilot, record words and sentences that are not understood and the
questions that require explanation
It help to assess logistical issues such as time taken, wording, common
responses which suggest categories for closed questions, and common
misinterpretations of the question
Feedback from the subjects should be welcomed as many of the criticisms as
possible should be addressed
Finally, make the necessary changes after the pilot study
6. Translation
Ideally it helps to interview study subjects in their own language
After translation it should be translate back and assign second person to proof
read.
Finally it has to be standardize the translation amongst interviewers

157
7. Training
It helps to reduce measurement errors and
Ensure the interviewers on how to use questions
8. Implementation of the actual field work

Questionnaire Administration
Questions, which are developed for the study by considering the above steps, can be
asked by
Self-administration – here it does not require interviewer and the clarity of the
questions is essential
Interview – here interviewer asks the questions face to face from respondents.
Telephone – here interviewer asks the questions through telephone
Discussion – most of the time it used to collect qualitative type of data by raising
points for discussion through the discussion leader.
Example: Focus Group Discussion (FGD). In this case we can use tape recorder or
notebook to handle the discussion points.

158
Some rules for interviewing
1. Establish a social setting, which is comfortable for the respondent, i.e when you
are interviewing children, it is especially important that you sit on the same level
as them.
2. Answer the following questions to the satisfaction of your respondents which
will determine the quality of the information you get
Who you are, what you are doing?
Why have you picked me for this interview?
What will you do with the information?
Will giving you information benefit me? Or potentially harm me?
3. Be sensitive to local customs (e.g when interviewing women). Check with local
people beforehand what you should watch out for.
4. Encourage people to talk.
5. Be neutral by avoiding in showing your feelings and opinions and be patient.
6. Ask respondents to give examples since it is a good way to get people to
describe their ideas and opinions.
7. Avoid interrupting and contradicting the respondent.

159
8. Avoid thinking about the next questions while the respondent is talking, rather
than listening with full attention
Examples of problematic questions which we have to avoid
1. Have you ever smoked? This question is vague. Smoked what, pipe, cigarettes,
etc. what do you answer if you had one cigarette in your childhood.
2. What was your age of menarche? Here the word menarche is used which many
respondents will not understand
3. Have you had any infectious diseases such as measles? It is a leading question.

160
Exercise
The following are problematic questions, state the reasons
1. What is your religion?
a. Muslim
b. Orthodox
c. Protestant
2. Do you know signs and symptoms of malaria such as fever?

3. How many tablets do you have in the bottle ?


a. 0
b. 1 - 2
c. 2 – 3
d. > 3

4. Write your own one open -ended non-problematic question.

161
Logistic Regression model
Logistic Regression is commonly used when the independent variables include
both numerical and nominal measures and the outcome variable is binary
(dichotomous).
That is, logistic regression is a method which is useful when the response
variable is dichotomous (has two levels) and at least one of the explanatory is
continuous.
In this situation we are modeling the probability that the response variable takes
on the levels of interest (success) as a function of the explanatory variable.
In many experiments, the end point or outcome measurement, is dichotomous
with levels being the presence or absence of a characteristic (Example:- cure,
death; positive, negative etc.)
One key difference between logistic regression model from simple or multiple
regression model is that, here the probabilities must lie between 0 and 1, we can
not fit a straight line function as we did with linear regression.
We will fit “S” curve that are constrained to lie between 0 and 1.

162
Logistic regression curve for β>0
1

Prob.

0.5

0
Independent variable (x)

163
For the case where we have one independent variable, we can have the following
logistic regression model.

  x   x (  x)
 (x)  e / (1 e ) 1(1 e )
Here  ( xis) the probability that the response variable takes on the characteristic
of interest (success), and x is the level of the numeric explanatory variable.
The interest here is whether or not β=0. If β=0, then the probability of success is
independent of the level of x. If β>0, then the probability of success increases as x
increases, conversely, if β<0, then the probability of success decreases as x
increases.
To test this hypothesis, we will conduct the following test, based on estimates
obtained from a statistical computer package.
H0: β=0
H1: β≠0
 In logistic regression, is the change in the odds ratio of a success at levels of
the explanatory variable one unitˆ apart.
e

164
The odds of an event occurring is defined as:-

o  /1
 It implies,
o(x)  (x)/1(x)
Then the ratio of the odds at x+1 to the odds at x (the odds ratio) can be written
(independent of x) as:

OR(x 1, x)  o(x 1)/ o(x)  e(x1) / ex  e


An odds ratio greater than 1 implies that the probability of success is increasing as x
increases, and an odds ratio less than 1 implies that the probability of success is
decreasing as x increases.
Frequently, the odds ratio, rather than is reported in studies.
We choose the parameters of our model to minimize
ˆ
the badness-of-fit or to
maximize the goodness-of-fit of the model to the data.
 With least squares , we minimize SSres, the sum of squares residual. This also
happens to maximize SSreg, the sum of squares due to regression.
With linear models, there is a mathematical solution to the problem that will
minimize the sum of squares.

165
With some models, like the logistic curve, there is no mathematical solution that
will produce least squares estimates of the parameters.
For many of these models, we use the concept of maximum likelihood.
 A likelihood is a conditional probability (e.g., P(Y|X), the probability of Y given X).
We can pick the parameters of the model (a and b of the logistic curve) at random
or by trial-and-error and then compute the likelihood of the data given those
parameters.
We will choose as our parameters, those that result in the greatest likelihood
computed. The estimates are called maximum likelihood because the parameters
are chosen to maximize the likelihood (conditional probability of the data given
parameter estimates) of the sample data.
The techniques actually employed to find the maximum likelihood estimates fall
under the general label numerical analysis.
There are several methods of numerical analysis, but they all follow a similar series
of steps.
First, the computer picks some initial estimates of the parameters. Then it will
compute the likelihood of the data given these parameter estimates.
Then it will improve the parameter estimates slightly and recalculate the likelihood
of the data.

166
It will do this forever until we tell it to stop, which we usually do when the parameter
estimates do not change much (usually a change .01 or .001 is small enough to tell
the computer to stop).
Example
A study was conducted to study the therapeutic effects of individual drugs in mice.
One part of this study was to determine toxicity of the drug individually. Mice were
given varying doses in a parallel groups fashion, and one primary outcome was
whether or not the mouse died from toxic causes during the 60 day study.
The observed numbers and proportions of toxic deaths are given in the following
Table by dose, as well as the fitted values from fitting the logistic regression model
where is the
probability a mouse that receiveda dose x of x dies
  from
x toxicity. Based on a
 ( x )  e / 1  e
computer analysis of the data we get the fitted equation  ( x)

ˆ(x)  e6.3810.488x /1 e6.3810.488x

167
Observed Fitted

Dose (mg/kg) Total mice Toxic deaths P(toxic deaths)


ˆ ( x)

8 87 1 1/87=0.012 0.077

12 77 38 38/77=0.494 0.372

16 69 54 54/69=0.783 0.806

20 49 45 45/49=0.918 0.967

24 41 41 41/41=1.000 0.995

To test whether or not P(Toxic death) is associated with dose, we will test
Ho:β=0 versus
H1: β≠0
Based on the computer analysis, we have and
ˆ 0.488 ˆˆ  0.0519 168
Now we can conduct the test for association at α=0.05 significance level.
1. Ho:β=0 (No association between dose and P(toxic death))
2. H1: β≠0 (Association exists)
3. X2-calculated= (ˆ / ˆˆ)  (0.488/0.052) 88.071
2 2

4. X2-tabulted= X2 0.05,1=3.84, P-value < 0.0001,


5. It implies the association exists.
 A plot of the logistic regression and the observed proportions of toxic deaths is
given in the following figure.
 The plot also depicts the dose at which the probability of dose is 0.5 (50% of
mice would die of toxicity at this dose). This is often referred to as LD50 and is
13.087 mg/kg based on this fitted equation.
 Finally the estimated odds ratio, the change in the odds of death for unit
increase in dose is OR= =e 0.488=1.629
ˆ
e
 Which indicates that, the odds of death increase by approximately 63% for each
unit increase in dose.
Multiple logistic regression can be conducted by fitting a model with more than
one explanatory variable. It is similar to multiple linear regression in the sense that
we can test whether or not one explanatory variable is associated with the
dichotomous response variable after controlling for all
other explanatory variables. 169
170
Exercise
From the previous example (therapeutic effects of individual drugs in mice)
a. Construct four 2X2 table by considering dose 8 as reference (unexposed)
b. Calculate and interpret their respective ORs.

171
Probit Data Analysis
This procedure measures the relationship between the strength of a dose and the
proportion of cases exhibiting a certain response to the dose.
It is useful for situations where you have a dichotomous output that is thought to
be influenced or caused by levels of some independent variable(s) and is
particularly well suited to experimental data.
This procedure will allow you to estimate the strength of a dose required to induce
a certain proportion of responses, such as the median effective dose.
Example.
How effective is a new pesticide at killing insects, and what is an appropriate
concentration to use?
You might perform an experiment in which you expose samples of insects to
different concentrations of the pesticide and then record the number of insects
killed and the number of insects exposed.
Then by applying probit analysis to these data, you can determine the strength of
the relationship between concentration and killing, and you can determine what
the appropriate concentration of pesticide would be if you wanted to be sure to kill,
say, 50% of exposed insects.

172
When biological responses are plotted against their causal dose (or logarithms of
them) they often form a sigmoid curve, as follows:-

173
Notes:-
Probit Analysis is very similar to logistic Regression model but is preferred when
data are normally distributed.
Most common outcome of a dose-response experiment in which probit analysis is to
get the corresponding response values for D50.
Probit analysis can be done by eye through hand calculations, or by using a statistical
program.
Probit (0.025)=-1.96=-probit (0.975)
Probit value versus ln (Dose) will be more or less linear.

174
Exercise
Based on the following information from an experiment, apply logit and probit model
and find the amount of insecticide in dl to kill 50% of insects in the experiment.

Dose of insecticide in dl Total insects Number of death

0 10 0
2 10 0
4 10 1
9 10 2
16 10 6
25 10 8
36 10 9
38 10 9
39 10 9
40 10 10

175
Excercise

1. Identify a research topic which can be done by applying cross sectional, cohort or
case–control analytical epidemiological research design.
2. Provide general objective in line with your topic.
3. Provide sample size and sampling methods in line with your objective.
4. Provide and discuss appropriate statistical data analysis method to achieve your
general objective.
5. List at least 2 questions that should be included in your questionnaire in line with
your general objective.

176

You might also like