Biostatistics
Biostatistics
Biostatistics
Biostatistics is in fact the combination of two words “Bio” and “Statistics”.
The Bio parts involve Biology “The Study of living things” while the statistics part involves “The
accumulation, tracking, analysis and application of data. Biostatistics is the branch of statistics related to
medical and health applications. Biostatistics underpins the methodologies and epidemiological
investigation and research. Biostatistics is the used of the statistical procedures and analysis and study
and practice of biology. In simple words the branch of statistics that deals with data relating to living
organisms is called Biostatistics. Statistical process and methods apply to the collection, analysis, and
interpretation of biological data and especially data relating to human biology, health and medicine is
called Bio statistics
Applications of Biostatistics
Biostatistics has applications in all life sciences. Few applications of Biostatistics are
summarized below.
b) In Demography
It is used and estimating the attributes of population such as sex ratio, Birth rates,
Density of population etc.
c) In Pharmacology
To find the action of the drug, the drug is given to animals or humans to see whether
the changes are produced due to drug or by chance.
d) In Research
Research is incomplete without statistics. Every result needs to be statistically validated,
for the design of experiment, selecting the method of collection of data, deriving logical
conclusion from data, one need the enough knowledge of statistics.
Variable
Types of Variable
a. Quantitative Variables
b. Qualitative Variables
Qualitative Variables is also called categorical variables. Many characteristics are not
capable of being measured some of them are ordered called ordinal and some of them cannot ordered
called Nominal. Qualitative variables can be coded to appear but there numbers are meaningless. For
example classification of peoples into some socio-economic group, Examination Grades etc.
Discrete variables are characterized by a gap or interruption in the values that it can
assume. These gaps or interruptions indicate the absence of values between particular values that the
variable can assume. It takes only whole numbers. For example the number of admission to general
hospitals, the number of decayed, missing teeth per child in an elementary school, the number or
prescriptions an individual takes daily.
Continuous Variables
A continuous variable assumes any value within a specified relevant interval. Examples
of continuous variables includes the various measurements that can be made on individuals such as
height (inch), weight (pounds), skull circumference, heartbeat, blood pressure, time to recovery (days)
Scale of Measurements
All characteristics in life cannot be measured through same scale not same statistical
procedure appropriate for handling every type of measurements. Psychologist “Stanly Smith Stevens” as
proposed four scales of measurements which cover nearly all area of learning.
They are:
Nominal Scale
As the name implies, it consist of “naming or labeling” observations or classifying them into
various mutually exclusive and collectively exhaustive categories and the observations of each
categories are counted. For example Gender (Male, Female), Marital stratus (Married, unmarried) etc
Data obtained by nominal scale are called nominal data or qualitative data and are analyzed by statistics
of attributes. Summery statistic “Mode” is computed from such data. Nominal data can be represented
by pie-chart or Bar chart.
Example Gender (Male=1, Female=2), Base Ball uniform numbers, the number provides no insight into
the play.
Ordinal Scale
Qualitative observations can be ranked or ordered according to some criterion e.g. with respect
to some quality or performance but interval among category is unknown or unequal. Ordinal scale
process natural ordering, for example Qualification (Matric, Inter, BA/BSC., Master, M.Phil., Ph.D.),
feelings (Very unhappy, unhappy, ok, happy, very happy) etc., the defects of this scale is that it as
unequal interval i.e. we don’t know the how much one category is better than the other, nor can we say
that a difference between ok and unhappy as the same as difference between vary happy and happy.
Data obtained by ordinal scale are called ordinal or ranked data. Summary statistic like Median,
Percentiles and spearmen’s rank correlation co-efficient are computed from ordinal data. Ordinal data
cannot be represented by pie-chart the best choice to present on the column-Bar chart. Note: Ordinal
scale implies a statement of “grater then” or “less than” without being able to state how much greater
or less.
Interval Scale
The interval scale has numeric ordered values, it fixed or equal intervals and can go below zero
(Means it can have negative values) where “0” is not the ordinary zero. In interval scale the distance
between two values are same i.e. the distance between 5-6 and degree as a same is that between 7-8
degree also it can go below zero, for example the temperature of Ice ad -5 degree. Interval scale not
only tells us about that values are smaller or bigger, but also tell that how much bigger or smaller, they
are unlike that of ordinal scale. For example if it is 450 on Sunday and 550 on Monday. We know not only
that it was hotter on Monday and also know that it was 100 hotter. Zero as meaningful on this scale and
does not mean the absence of the quality. I.e. Zero degree temperature is hotter than -1 degree.
Statistical methods like Mean, Median, and Mode etc. can be easily calculated from the interval data.
Ratio Scale
Independent Variable
An independent Variable is presumed to influence other variables. Sometimes independent
variables are called manipulated-variables or experimental variables. Independent variable is presumed
cause, whereas the dependent variables are presumed effect.
Dependent Variables
A dependent Variable is presumed to be effected by one or more independent variables. The
dependent variables is often called an outcome variable.
For example: If we are interested How stress affects Heart-rate in humans in this case, stress will be
independent variable and Heart-rate that will be dependent variable
Intervening/Mediating Variable
Intervening/Mediating variable whose existence is inferred but it cannot be measured. For
example determining the effects of video clips on learning ability of students of students of B.S the
association between video clips and leaner ability need to be explained.
Discrete Data
Discrete data represent items that can be counted; they take on possible values that can be
listed out. The list of possible values may be fixed also called finite or it may go from 0, 1, and 2 onto
infinity (making it countable infinite).for example the number of heads in 100 coins flips takes on values
0-100(finite case), but the number of flips needed to get a 100 takes on values from 100 up to infinity, if
Continuous Data
Continuous Data represent measurement there possible values cannot be counted in can only
be describe using interval on the real number line. For example the exact amount of gas purchased at a
pump for cause with 20 gallon tanks would be continuous Data, from zero gallons to 20 gallons
represented by the interval [0-20] inclusive. You might pump 8.40 gallons, or 8.41 or 8.414 gallons, or
any possible number from 0-20. in this way continuous Data the thought of as being unaccountably
infinite.
Qualitative/Categorical Data
Qualitative data is categorical measurements expressed not in terms of numbers, but rather it
varies in kind or names. In statistics qualitative data is often used interchangeable with “Categorical
Data”. Categorical data represents characteristics such as person’s Gender, Marital status, or the types
of movies they like. Categorical Data can take on numerical values such as “1” indicating males and “2”
indicating female, but those numbers does not have mathematical meaning. A classic example defining
categorical data is given below.
Amount of money earned last week, birth date, exercise, Favorite sports, horse steps per night,
Language mostly spoken at home, foot length, opinion on environment conservation etc.
Answer: Marital status is qualitative/categorical variable. It can take on values such as “Married”,
“Widowed”, and “divorced”.
Answer: Song Length is a quantitative variable. It can take on values such as “180 Second”, “189.2
Seconds”, and “210.0039 Seconds”, It continuous quantitative variable because it can take on infinite
number of values.
Censored Data
Censoring occur when we have some information about an individual survival time, but we don’t know
survival time exactly.
For example
Leukemia Patients
As Simple Example of censoring, consider Leukemia Patients, following until they go out of remission.
Shown as “X”, if for a given patient, the study ends while the patient is still in remission (i.e. do not get
the event then the patient survival-time is considered as Censored). We know that for this person the
survival time is atleast as long as the period that the person has been followed, but the person goes out
of remission of the study ends, we don’t know the complete survival time.
Cause of Censored
1) A person doesn’t experiment the event before study ends.
2) A person is lost to follow up during the study period.
3) A person withdraws from the study due to death (If death is not the event of
interest) or some other reasons (Inverse drug reaction).
Person A is followed from start of study until getting the event at week 5. Therefor person A,
survival time is 5 week and is not censored.
Person B is also observed at the start of the study but it is followed to the end of 12 week study
period without getting the event, the survival time here is censored because we can say only it
is at least 12 week.
Person C, enter the study between 2nd and 3rd week and is followed until he or she withdraws
from the study at 6 week, this persons survival time is censored after 3.5 weeks.
In short a six person were observed to get the event (person A and person F) and four
Censored (B, C, D, and E)
A table of the survival time data for Six Person is presented as:
Types Of Censored
Right Censoring
When a person exist survival time become incomplete at the right side of the following a period,
occurring when the study ends or when the person’s lost to follow up are as withdrawn, this is called
right censoring.
Left Censoring
When a person exist survival time become incomplete at the right side of the
following up period for that person. For example, if we are following person’s with “HIV” infection, we
may start following up when a subject first test positive for the “HIV” Virus, but we may not know
exactly the time. First exposed to the virus thus, the survival time is censored on the left side.
Sampled Population
Target Population
Suppose we want to know the opinion of GPGC Nowshera Students about the examination system then
the sampled population may consist of the total number of students of statistics deptt, political science
deptt, etc. and the target population will consist of the total number of students in GPGC Nowshera
Odds
The odds in favor of an event are the ratio of the probability that an event will happen to the
probability that it will not happen.
For example: The odds a randomly chosen day are the week is a Sunday are one to six; 1/6, which is
same time to return 1/6 or 1:6
Example: There are five pink Marbles, 2 blue and 8 purples. What are the odds in favor of picking 1
blue Marble?
Solution: The odds of picking one blue Marble: odds= p/q; Where “P” is the probability of picking one
blue Marble. P=
2
1 ()¿
2
15 15
1
(
1− p= =
1 ) 13
3
15 15
1
2 13
OR 1− p=1− 15 = 15
2
p 15 2
Odds= = =
q 13 13
15
It means that the odds of picking Marble is less than a half as compared to the odds of picking a Marble
other than blue.
Example: The probability of diabetes in patient is 5%. Find the odds of diabetes.
0.05 1
Odds= =
0.95 19
Odds=1 :9
Odd Ratio
It is defined as the Ratio of the odds of an event occurring in one group to the odds of it
occurring in another group i.e. the odd ratio compares the relative odds in each group.
X- X+
Y- a b a+b
Y+ c d c+d
a+c b+d n
Since odd Ratio is the ratio of two odds
a
b ad
O . R= =
c bc
d
Odds can be computed from probability and probability can be computed from odds.
p( A)
Odds∈ Favor of A=
1− p( A)
odds∈ Favor of ( A)
p ( A )=
1+odds ∈Favor of ( A)
Note that if the odds are same in each row then the odd ratio is 1.
Interpretation
An odd Ratio=1, indicates that the condition or event under study is equally likely to occur in both
groups.
If odd> 1
An odd Ratio>1, indicates that the condition or event under study is more likely to occur in first group.
If odd< 1
An odd Ratio<1, indicates that the condition or event under study is less likely to occur in first group.
The Odd Ratio must be non-negative i.e. odd>=0 If the odd of first group approaches to zero then the
odd Ratio approaches to zero. But when the odd of the second group approaches to zero then the odd
Ratio approaches to ∞
Example: Considered the following data on survival of passengers on the titanic. There were 851
males passengers 142 survival and 709 died. Compute the odd Ratio and interpret your result.
Solution:
a 709
Odds of death among male¿ b = 142
c 154
Odds of death among female¿ d = 308
709
Odds of death among male 142
O.R= Odds of deathamong Female = 154
308
O.R=9.98
The males are 10 times more likely to die in the titanic as compared to females.
Example: Suppose that in a sample of 100 men, 60 have drunk wine in a previous week, while in a
sample of 100 women, only 20 have drunk wine in the same period. Calculate odd ratio and comments
your results.
a 60
Odds of men who drink wine¿ b = 40
c 20
Odds of women who drink wine ¿ d = 80
60
Odds of men who drunk wine 40
O.R= Odds of women who drunk wine = 20
80
O.R= 6
Interpretation:
The males are 6 times more likely to drink wine as compared to female in the previous week.
Question: If the prevalence of smoking among lung cancer patient in 95 per 100, and the prevalence
of smoking among peoples without lung cancer in 25 per 100. Calculate odd ratio and comments your
results.
Solution:
a 95
Odds of Smoking among Ling cancer patient¿ b = 5
c 25
Odds of smoking among patient without Lung cancer ¿ d = 75
95
Odds of Smoking among Lung cancer patient 5
O.R= Odds of Smoking among patient without Lung cancer = 25
75
O.R= 57
Interpretation:
The Patient with Lung cancer is more likely than without Lung Cancer.
Important Questions
Answer: The odd ratio of 0.5 means that odds of the exposer being found in the case group is 50% less
than the odds of finding to the exposer in the control group.
Answer: The odd ratio of 0.75 means that odds in one group the outcomes is 25% less likely i.e. an odd
Ratio less than “1” means that the first group less likely to experience the event. If odd Ratio is 1.33
mean that the second group is the outcome is 33% more likely than the first group.
S.E (ln )= √ 1 1 1 1
+ + +
a b c d
Knowing this S.E one can tests the significance hypothesis Ho; ln (θ ) and construct the confidence
interval
Where “Zα /2” is the value of “Z” defining the confidence limits
Example :
Calculate (1) Odd Ratio (2) Test the hypothesis ln (θ ) =0 (3) C.I for ln (θ )
Solution:
a 60
Odds of men who drink wine¿ b = 40
c 30
Odds of women who drink wine ¿ d = 70
O.R= 4
Interpretation:
The males are 4 times more likely to drink wine as compared to female.
Solution:
2) Level of significance
We set α =0.5
4) Computation
=3.5 =1.2527
1.2527
Z=
√ 1 1 1 1
+ + +
60 40 30 70
Z=4.192
5) Critical Region
Z ≥ Z 0.025¿ ± 1.96
Since z=4.19 falls in the critical region, so therefor we reject Ho that the association
between sex and drunken wine is significant at α =0.05 level
± 1.96
1.2527−0.5856 , 1.2527+0.5856
(0.671 , 1.8383)
The 95%C.I For θ
0.6705 1.8383
(e ,e )
( 1.955 , 6.2858 )
Since the C.I for θ doesn’t include “1” so there for significant association between Gender and
drunk wine.
Question:
Incidence
New Cases
Incidence=
Total Population
OR
NO :of New Cases
Incidence Rate=
NO :of People at risk ∈given time Frame
Example: If over the course of 1 year 5 women are diagnose with breast cancer out of the total female
study population of the 200.
Solution: Five women are diagnosing with breast cancer out of the total female study population of the
200(who do not breast cancer at the beginning of the study period. Then we would say that the
incidence of breast cancer in this population is:
5
Incidence= =0.025
200
Incidence=0.025 ×1000
Question: In a population of 1000, non-diseased persons, 28 were infected with HIV over two years of
observation.
Solution:
New Cases
Incidence= ×K
Total Population
28
Incidence= =¿
1000
%
Incidence=2.8 year period
two
Question: 100 new Cases occurred in a population of 50000 in a year. Calculate Incidence rate
Solution:
100
Incidence Proportion= K
50000
Prevalence
It refer to all “old and new” cases existing at a given point or period of time in the given
population. The total number of individuals who have an attribute or diseases at a particular time (or
during a particular period) divided by the population at risk at that time. (Or Mid-year population), A
prevalence rate is the total number of cases of a diseases existing in a population divided by the total
population.
Question: If measurement of cases is taken the population of 40000 people and 1200 were recently
diagnosed and 3500 are living with cancer then find prevalence rate.
Solution:
1200+3500
Prevalance Rate= × 1000
40000
Prevalance Rate=118
Types of Prevalence
Point Prevalence
The number of all current cases (old+ new) of a disease at one point of time in relation to a
defined population, at that point of time, point of time may be a day/several days/weeks etc depending
upon the time required to examine the entire population .
Point Prevalence=
No :of all cases ( old +new ) of a specified disease existing at a given point of time
Estimated Population at same point of time
Solution:
100
Point Prevalence= ×100
200
Point Prevalence=50
So that 50 trachoma patient per 100 students on 10th march 1997. Which means that 50% of the
students in that school affected by trachoma.
Period Prevalence
The proportion of individuals is a specified population at risk who has the disease of interest
over a specified period of time. I.e. Annual prevalence, life time prevalence, (when the time of
prevalence rate is not specified it is usually point prevalence.
Question: Between June 30 and august 30th 1999, Average Population of 1600, 29 existing cases of
hepatitis B on June 30, 6 incidences (New cases) of hepatitis B between July 1 st and August 30.Find the
period prevalence.
Solution:
29+6
Period Prevalence=
1600
Period Prevalence=0.022
Solution:
Point Prevalence=
No :of all cases ( old +new ) of a specified disease existing at a given point of time
Estimated Population at same point of time
125
Point Prevalence=
1000
Point Prevalence=0.125
NO : of New Cases
Incidence Proportion=
NO :of People at risk ∈given time Frame
25
Incidence Proportion=
900
Incidence ∝ ortion=0.027
Relative Risk
A Relative Risk can only be calculated from prospective studies (cohort study). It can be defined
as the ratio of the incidence rate among exposed to the incidence rate among non-exposed.
Mathematically
Considered the following 2×2 contingency table for the calculation of measure of association.
Exposure Outcome
Present Present Absent Total
Absent A B a+b
Total C D c+d
a+c b+d N
Interpretation
If R.R=1
If R.R>1
If R.R>1, then the incidence in the exposed is greater than the incidence in the non-exposed. Increase
Risk of outcome among exposed. It is positive association i.e. (exposure is the harmful so those who are
exposed are at higher risk of suffering from diseased for those who are not-exposed.
If R.R<1
If R.R<1, then the incidence in the exposed is lower than the incidence in non-exposed. I.e. the
Decreased Risk, It is negative association. The exposure is protective.
For example: providing Vaccine to group will be our exposure and not providing Vaccine will be non-
exposed. If R.R<1, providing Vaccine is protective.
Note: The further the R.R is from 1 the stronger is the association.
Example: Suppose we are researching the effect of benzene exposure in cancer we go to a work where
there is non-potential for exposure to benzene. There are 483 people in the work center. However only
212 were are exposed to benzene in their work duties, 12% of the work center employees. Our
discovery finds that 40 people with cancer were in exposure group. Calculate the relative risk.
Cancer Total
Benzene 40 172` 212
Exposure
Not 18 253 271
Benzene
Exposure
Total 58 425 483
Solution:
a
Diseased risk among exposed¿ a+b =0.1886
c
Diseased risk among not exposed¿ c+ d =¿0.0664
We can say that if we are exposed to benzene 2.84 times more likely to get cancer, if we are not
exposed to benzene.
Question:
Outcome Total
Exposure 366 32 398
Exposure 64 319 383
Total 430 351 781
Solution:
a
Diseased risk among exposed¿ a+b =0.9195
c
Diseased risk among not exposed¿ c+ d =¿0.1671
O.R=5.50
Interpretation:
We can say that if we are exposed group are 5.50 times more likely than the non-exposed
group.
Question:
a
Incidence of LBW among smokers¿ a+b =0.33
c
Incidence of LBW among non-smokers¿ c+ d =¿0.09375
O.R= 3.6
Interpretation:
Based on the study smokers are 3.6 times more likely to suffer LBW then from non-smokers.
Question:
In a prospective study of pregnant women, the collective information on exercise leader of low
risk pregnant women. A group of 217 women’s did no voluntary exercise during the pregnancy; while
the group of 238 women exercises extensively outcome variable of interest is exercising preterm Labor.
The result is summarized as:
Solution:
a
Incidence of cases of preterm Labor extreme exercise ¿ a+b =0.092
O.R= 1.12
Interpretation:
The result indicate that the risk of experiencing preterm labor when a women exercises heavily
is 1.12 times greater than the women who do not exercise at all.
Find the Confidence interval from the standard normal distribution 1.96 for 95% C.I.
S . Eln (R . R)=
√ b
+
d
a (a+b) c (c +d )
ln ( R . R ) ±1.96 S . E ln (R . R)
If the 95% C.I doesn’t contain the value “1” the association is set to be statistically significant
at α=0.05 level.
Question:
Physicians enrolled in the physician health study were randomly assigned to take daily
aspirim or placebo. The table provides the number with M.I in each group.
Calculate (1) Calculate R.R (2) Construct the 95% C.I for R.R
Solution:
a
Incidence of M.I among Aspirim¿ a+b =0.012
c
Incidence of M.I among placebo¿ c+ d =¿0.021
O.R=0.571
Interpretation:
The relative risk estimate=0.58 which indicates that physicians in the aspirim group had a lower
risk of M.I then physics in the placebo group.
S . Eln (R . R)=
√ b
+
d
a (a+b) c (c +d )
S . Eln (R . R)=
√ 10898
+
10795
139(11037 ) 239(11034 )
S . Eln ( R . R ) =0.1058
(−0.5447−0.207368 ,−0.5447+0.207368)
(−0.752068 ,−0.337332)
The 95%C.I for R.R
−0.752068 −0.337332
(e ,e )
( 0.47139 , 0.7136 )
The 95% C.I indicates that the decreased risk related to daily aspirim use is significant at α =0.05
level, since the interval does not contain “1”.
Sensitivity p ¿+¿ D +¿
It is the probability of positive test result given the individual as the disease. i.e. the likelihood of
a disease individual getting a positive test result. It is also called true positive. The countermand of this is
false negative.
a TP
p ¿+¿ D +¿= =
a+c TP+ FN
The probability that test result is negative when actually the person is suffering from diseases.
OR The probability that is suffering from disease given test result is negative.
c FN
p ¿-¿ D +¿= =
a+c TP+ FN
Specificity p ¿-¿ D -¿
d TN
p ¿-¿ D -¿= =
b +d TN+ FP
The probability that test result is positive when actually the person is not suffering from the
diseases. OR The probability that is not suffering from disease given test result is positive.
b FP
p ¿+¿ D -¿= =
b +d FP+TN
The Probability that a person test positive has the disease .i.e. the probability that a subject has the
disease given the subject has a positive test result.
a TP
p ¿+¿ D +¿= =
a+b TP+ FP
The probability that a person, who test is negative, does not have the disease .i.e. probability that a
subject doesn’t have the disease give the subject has a negative test result.
d TN
p ¿-¿ D -¿= =
c+ d TN + FN
Question:
TP a 200
Sensitivity p ¿+¿ D +¿= = = =0.5
TP+ FN a+ c 400
This means that there is 50% chance that the person would get the positive test result when actually
he has the disease
TN d 450
Specificity p ¿-¿ D -¿= = = =0.75
TN + FP b+ d 600
This means that there is 75% chance that the person would get the negative test result when actually
he has the no disease
FN c 200
False Negative p ¿-¿ D +¿= = =
TP+ FN a+ c 400
=0.5
This means that there is 50% chance that the person would get the negative test result when actually
he has the disease
FP b 150
False Positive p ¿+¿ D -¿=
FP+TN b+ d 600 =0.25
= =
This means that there is 25% chance that the person would get the positive test result when actually
he has the no disease
TP a
Positive Predictive Value (P.P.V) p ¿+¿ D +¿=
TP+ FP a+b 0.57
= =¿
This means that there is 57% chance that the person would get the positive test result when actually
he has the disease
This means that there is 69% chance that the person would get the negative test result when actually
he has the no disease
Question:
TP a 36
Sensitivity p ¿+¿ D +¿= = = =0.8
TP+ FN a+ c 45
This means that there is 80% chance that the person would get the positive test result when actually
he has the disease
TN d 230
Specificity p ¿-¿ D -¿= = = =0.90
TN + FP b+ d 255
This means that there is 90% chance that the person would get the negative test result when actually
he has the no disease
FN c 9
False Negative p ¿-¿ D +¿= = = =0.2
TP+ FN a+ c 45
This means that there is 20% chance that the person would get the negative test result when actually
he has the disease
FP b 25
False Positive p ¿+¿ D -¿= = =
FP+TN b+ d 255
=0.09
This means that there is 9.80% or 10% chance that the person would get the positive test result when
actually he has the no disease
TP a
Positive Predictive Value (P.P.V) p ¿+¿ D +¿=
TP+ FP a+b 0.59
= =¿
This means that there is 59% chance that the person would get the positive test result when actually
he has the disease
This means that there is 96% chance that the person would get the negative test result when actually
he has the no disease
Note: P.P.V and N.P.V are affected by prevalence, when prevalence increases P.P.V increases
and N.P.V decreases.
TP TP TN TN
P.P.V= = N .P.V= =
AllTest Positive TP+ FP AllTest Negative FN +T N
P.P.V increases with increased specificity so higher the specificity the higher will be its P.P.V
P.P.V also increases with prevalence, N.P.V increases with increased sensitivity and decreases
with increases prevalence, so the higher the prevalence the lower will be N.P.V
Observational Studies
There are two basic types of Observational Studies
Prospective study
A prospective study is an observational study in which two random samples of subjects are
selected. One sample consists of subject who processes the risk factor, and the other sample consists of
subject who does not process the risk factor. The subjects are followed into the feature i.e. they are
followed prospectively and record in kept on the no: of subject in each sample who at some point in
time are classifiable into each of the categories of outcome variable. The data resulting from a
prospective study involving two dichotomous variables can be displayed in 2×2 contingency table that
usually provides information regarding the no: of subjects with and without risk factor and the number
who did and did not succumb to the diseases of interest as well as the frequency for each combination
of categories of the two variables.
Disease Status
Risk Factor Present Absent Total
Present A b a+b
Absent C d c+d
Total a+c b+d Total
Retrospective Study
It is a type of retrospective study, in which two groups with different known outcomes are
compared, that’s way one group have the disease and the other doesn’t have the disease. We compere
the subjects who have a disease (the cases) with subjects who do not have that disease (the control).
We calculate Odd Ratio (O.R) from the case control study.
Risk Factor
A risk factor is something that increases your chance of getting a disease, this risk come from
something you do. For example Smoking increases your chance of developing colon cancer, therefor
smoking is a risk factor for colon cancer.
TP
P.P.V=
TP+ FP
TN
N .P.V=
TN + FN
P (T ∩ D)
p(D /T )=
P (T )
T =(T ∩ D)∪(T ∩ D)
P ( T ) =( T ∩ D ) + ( T ∩ D ) Equation A
P ( T ∩ D )=P ( D ) . P (T /D)
P ( T ∩ D )=P ( D ) . P ¿ )
Put in Equation A
P ( T ) =P ( D ) . P(T / D)+ P ( D ) . P ¿)
P ( D ) . P(T / D)
p(D /T )=
P ( D ) . P (T /D)+ P ( D ) . P(T / D)
P ( D )=1−P ( D )
P ( D ∩T )
p(D /T )=
P (T )
T =(T ∩ D)∪ (T ∩ D)
P ( T ) =( T ∩ D ) + ( T ∩ D ) Equation A
P ( T ∩ D )=P ( D ) . P¿ )
Put in Equation A
P ( T ) =P ( D ) . P(T / D)+ P ( D ) . P ¿)
P ( D ) . P(T / D)
p(D /T )=
P ( D ) . P (T /D)+ P ( D ) . P(T / D)
P ¿)¿ 1−Sensitivity
Question:
Medical Research team wishes to evaluate a proposed screening test for Alzheimer’s disease.
The was given to a random sample of 450 patients with Alzheimer’s disease and an independent random
sample of 500 patients without symptoms of the diseases the two samples were drawn from population
of subjects who were 65 years of age or older. The result is as follows.
Disease Status
Alzheimer’s Alzheimer’s Total
Present Absent
T+ 436 5 441
T- 14 495 509
Total 450 b+d 950
Based on another independent study it is known that the % of patients with Alzheimer’s disease is 11.3%
out of all subjects who were 65 years of age or older. First we calculate sensitivity and specificity as
follows.
Solution:
This means that there is 96% chance that the person would get the positive test result when actually
he has the disease
TN d 495
Specificity p ¿-¿ D -¿= = = =0.99
TN + FP b+ d 500
This means that there is 99% chance that the person would get the negative test result when actually
he has the no disease
The positive predictive value of the test we wish to estimate the probability that the subject who is
positive on the test has Alzheimer’s disease
( 0.113 ) .(0.96)
p(D /T )=
( 0.113 ) .(0.96)+ ( 0.887 ) . (0.01)
0.10848
p(D /T )=
0.10848+0.897
0.10848
p(D /T )=
0.11735
p(D /T )=0.9244
This means that 93% of the subject has a disease when given that the test is positive.
P ( D ) . P(T / D)
p(D /T )=
P ( D ) . P (T /D)+ P ( D ) . P(T / D)
( 0.887 ) .(0.99)
p(D /T )=
( 0.887 ) .(0.99)+ ( 0.113 ) .( 0.04)
0.8713
p(D /T )=
0.8713+0.0045
p(D /T )=0.99
This means that 99% of the subject does not have a disease when given that the test is negative.
Likelihood Ratio
Likelihood ratio describes how many times a person with diseases is more likely to receive a
particular test result, then a person without disease. Another words it means how likely it is that a
patient has a disease as compare to patient without disease. A negative likelihood ratio means, how
likely it is that a patient has no disease as compare to patients with disease.
An LR+ of a positive test result is usually a number greater than “1” and an LR- of a negative test result
usually less between 0-1. When LR=1, this is useless, which means that this test has a very little influence
on a fact that a patient does or does not have a disease.
P (T /D) Sensitivity
LR +¿ =
P (T /D) 1−specificity
P (T /D) 1−Sensitivity
LR -¿ =
P (T /D) specificity
Test Accuracy
The accuracy of a test is its ability to differentiable the patient and healthy cases correctly. To
estimate the accuracy of the test we should calculate the proportion of true positive and true negative
and all evaluated cases. Mathematically it can be stated as:
TP+ TN
Test Accuracy=
TP+TN + FP+ FN
a+ d
Test Accuracy=
a+b +c +d
Imagine we have a sample of 100 cases, 50 healthy and other patients. If a test is positive for all patients
and be negative for all healthy once, it is a 100% accurate. In figure error shows the test and it is been
able to differentiate the healthy and patient exactly. In this example the sensitivity of the test is
TP+ TN
Test Accuracy=
TP+TN + FP+ FN
a+ d
Test Accuracy=
a+b +c +d
50+50
Test Accuracy=
50+0+50+ 0
Test Accuracy=1∨100 %
TP a 50
Sensitivity p ¿+¿ D +¿= = = =1∨100 %
TP+ FN a+ c 50+0
This means that there is 100% chance that the person would get the positive test result when actually
he has the disease
TN d 50
Specificity p ¿-¿ D -¿= = = =1∨100 %
TN + FP b+ d 50+0
This means that there is 100% chance that the person would get the negative test result when actually
he has no disease
Taking into account the mentioned statistical characteristics this test is appropriate for both screening
and final verification a disease.
Scenario-II
Test with 75% accuracy 50% sensitivity and 100% specificity. If test is can only diagnose 25 out of the 50
patients and has reported the other has healthy (as we see from figure II) then the accuracy sensitivity
and specificity are given below accuracy of the 100 cases that have been tasted the test could determine
25 patients and 50 healthy cases correctly, therefor the accuracy of the test is 75%
TP+ TN
Test Accuracy=
TP+TN + FP+ FN
a+ d
Test Accuracy=
a+b +c +d
25+50
Test Accuracy=
25+50+0+ 25
Test Accuracy=75∨75 %
TP a 25
Sensitivity p ¿+¿ D +¿= = = =0.5∨50 %
TP+ FN a+ c 50
This means that there is 50% chance that the person would get the positive test result when actually
he has the disease
TN d 50
Specificity p ¿-¿ D -¿= = = =1∨100 %
TN + FP b+ d 0+50
This means that there is 100% chance that the person would get the negative test result when actually
he has no disease
This test is not suitable for screening purpose but is suitable for final confirmation of a disease.
Scenario III
If we assume that the test as mean able to identified 25% of the 50 healthy cases and as
reported the other as patient (we see from figure III) in this scenario accuracy, sensitivity and specificity
will be as follows test with 75% accuracy, 100% sensitivity and 50% specificity.
TP+ TN
Test Accuracy=
TP+TN + FP+ FN
50+25
Test Accuracy=
50+25+25+ 0
Test Accuracy=75∨75 %
TP a 50
Sensitivity p ¿+¿ D +¿= = = =1∨100 %
TP+ FN a+ c 50
This means that there is 100% chance that the person would get the positive test result when actually
he has the disease
TN d 25
Specificity p ¿-¿ D -¿= = = =0.5∨50 %
TN + FP b+ d 50
This means that there is 50% chance that the person would get the negative test result when actually
he has no disease
This test is suitable for screening purpose but it is not suitable for final confirmation of a disease.
Diagnostic Test
A diagnostic test is a procedure perform to conform or to determine the presence or absence of
disease in an individual suspected of having the disease usually following the reported of symptoms or
base on the result of other medical tests. This procedure will give as a rapid indication of whether a
patient has certain disease. A diagnostic test is any approach use together clinical information for
purpose of making a clinical decision. i.e. (Diagnoses) some examples of diagnostic test x-ray, Biopsies,
pregnancy test, medical histories and result from physical examination.
D .O . R=LR +¿ LR -
The Diagnostic Odd Ratio may be express in term of sensitivity and specificity of the test.
Sensitivity Specificity
D .O . R= ×
(1−sensitivity ) (1−specificity )
a d
a+c b+d
D .O . R= ×
a+c−a b+d−d
a+c b+d
a d
D .O . R= ×
c b
ad
D .O . R= =O . R
bc
The Diagnostic Odd Ratio may also be express in terms of positive predictive value (P.PV) and Negative
predictive value (N.P.V)
P.P.V N .P.V
D .O . R= ×
(1−P . P . V ) (1−N . P .V )
a d
a+b c+ d
D .O . R= ×
a d
(1− ) (1− )
a+b c+ d
a d
a+b c +d
D .O . R= ×
a+b−a c +d−d
a+b c +d
a d
D .O . R= ×
b c
ad
D .O . R= =O . R
bc
Question:
Concerned test with the following 2×2 contingency table calculate Diagnostic O.R
Disease Status
D+ D- Total
+
T 26 12 38
T- 3 48 51
Total 29 60 89
Prepared By: Sir Zahawat Sahib
Bio Statistics Notes
TP
FP
D .O . R=
FN
TN
26
12
D .O . R=
3
48
26 48
D .O . R= ×
12 3
D .O . R=34.56
Probability Distribution
The probability distribution of a random variable, describes how the probability are distributed
over the values of a random variable. A probability distribution is a listing of all the outcomes of an
experiment and their associated probabilities. For a discrete random variable X, the probability
distribution is defined by probability Mass function f ( x )= p [ X =x ], where this function gives the
probability for each value of the random variable. Consider the example of tossing of three coins in
which the variable of interest is a random variable X (the number of heads) when three coins are tossed,
let X, be the no: of heads.
Where X=0, 1, 2, 3
1
p ( X=0 )= p ( No heads ) =p ( TTT )=
8
3
p ( X=1 )= p ( one heads )= p ( THT , HTT , TTH )=
8
3
p ( X=0 )= p ( two heads )= p ( HHT ,THH , HTH )=
8
X P(X=x)
0 1/8
1 3/8
2 3/8
3 1/8
1
Discrete Probability Distribution
Probability distribution of a discrete random variable is a table, graph or formula that gives the
probability associated with each possible value that the variable can assume. For the discrete Random
variable X, the probability Mass function is denoted by f ( x )= p [ X =x ], which satisfy the following two
conditions.
i. f (x)≥0∀ x∈ X ,
ii. ∑ f ( x )=1
Binomial Experiment
A binomial experiment is a statistical experiment that as the following properties.
Consider the statistical experiment, in which we flip a coin two times and count the number of times
that a head occur, this is a binomial experiment because.
I. The experiment consist of repeated trails, we flip the coin two times.
II. Each trail can result in just two possible outcomes i.e. head or trail.
III. The probability of success is constant, i.e. 1/2
IV. The trails are independent i.e. getting head on one trail doesn’t affect whether we
get head on the other trail.
Notations:
The Probability that an “n” trail Binomial-experiment, results an exactly X success, when the probability
of success on an individual trail is p.
Suppose a binomial experiment consist of “n” trails which results on an “n” successes on an individual
trail is p if then the probability Mass function (P.M.F) of the Binomial Distribution is.
()
p [ X=x ] =f ( x ) = n p q X =0 ,1 , 2 … … n
x
x n− x
p [ X=x ] =f ( x ) =0 O. W
Question:
Suppose a die is rolled is 5 times. What is the probability of getting exactly 2, fours?
Solution:
This is a binomial experiment in which the number of successes=2, the number of trails=5 and the
probability of successes=p=1/6 or 0.167, therefor the Binomial probability is
b ( X ; n , p )=b ( 2; 5 , 0.167 )
()
p [ X=x ] =f ( x ) = n p q X =2
x
x n− x
p [ X=2 ]=0.1606
Question:
The probability that a student is accepted to a prestigious college is 0.3 If 5 students from the
same school apply. What is the probability that at most 2 are accepted?
Solution:
b ( X ; n , p )=b ( X ≤2 ; 5 , 0.3 )
2
p [ X ≤2 ]=f ( x )=∑ b ( X ; n , p )
x=0
()
p [ X ≤2 ]=∑ 5 0.3x 0.75− x
x=0 x
p [ X ≤2 ]= 5 ¿
0 ()
p [ X ≤2 ]=0.1680+0.3601+0.3087
p [ X ≤2 ]=0.8368
Question:
60% of the people who purchased sports car are male. If 10 sports car are randomly selected.
Find the probability that exactly 7 are men.
Solution:
b ( X ; n , p )=b ( 7 ; 10 ,0.6 )
x ()
p [ X=x ] =f ( x ) = n p q X =7
x n− x
p [ X=7 ] =0.2149
Question:
Suppose that 80% of adults with allergies with report symptoms relief with a specific
Medication. If the medication is given to 10 new patients with allergies, what is the probability that is
effective in exactly7?
Solution:
b ( X ; n , p )=b ( 7 ; 10 ,0.8 )
()
p [ X=x ] =f ( x ) = n p q X =7
x
x n− x
( )
p [ X=7 ] = 10 (0.80) ¿
7
7
p [ X=7 ] =0.2013
Question:
The likelihood that a patient with heart attack is 0.04.suppose we have 5 patients who suffer a
heart attack. What is the probability that all survive?
Solution:
b ( X ; n , p )=b ( 0 ; 5 ,0.04 )
()
p [ X=x ] =f ( x ) = n p q X =0
x
x n− x
0()
p [ X=0 ] = 5 (0.04) ¿
0
p [ X=0 ] =0.8153
In a class of 8 students 3% of the students are suffering from anxiety. A sample of 100 students
is selected. Find the probability that out of these.
Solution:
Let X is a random variable denoted the number of students suffering from anxiety.
()
p [ X=x ] =f ( x ) = n p q X =0
x
x n− x
0 ()
p [ X=0 ] = 5 (0.03) ¿
0
p [ X=0 ] =0.8587
b ( X ; n , p )=b ( X ≥1 ; 5 , 0.3 )
p [ X ≥1 ]=1−p ( X <1 )
p [ X ≥1 ]=1−0.8587
p [ X ≥1 ]=0.1413
b ( X ; n , p )=b ( X ≥3 ; 5 , 0.3 )
p [ X ≥1 ]=1−¿
p [ X ≥3 ] =0.2822
x ()
p [ X=x ] = n p q X=5
x n−x
5 ()
p [ X=5 ] = 5 (0.03) ¿
5
p [ X=5 ] =0
Poisson Distribution
The Poisson distribution is a discrete distribution. It is named after “Simeon-Denis Poisson”
(1781-1840). A French mathematician, who published its essentials in a paper in 1837, The Poisson
distribution and the binomial distribution have some simulates, but also several differences. The
binomial distribution describes a distribution of two possible outcomes, designated as success and
failure from a given number of trails. The Poisson distribution focus is only on the number of discrete
occurrence over interval. A Poisson experiment doesn’t have a given number of trails (n) as binomial
experiment does for examples.
A binomial experiment might be used to determine how many black cars are in a random sample of 50
cars. A Poisson experiment might focus on the number of cars random Arriving at a car Wash during a 20
minute interval. The Poisson distribution has the following characteristics.
i) It is discrete Distribution.
ii) Each occurrence is independent of the other occurrence.
iii) It describes discrete occurrence over an interval.
iv) The occurrence in each interval can range 0-∞ .
v) The mean number of occurrence must be constant throughout the experiment.
Then the random Variable X is said to be have Poisson distribution with Parameter μ. Where the symbol
“!” is called Factorial.
μ(is called the expected or mean number of occurrence) is sometimes written as λ , some times is called
event rate or rate parameter.
Question:
The average number of major stories in a city is 2 per year. What is the probability that exactly 3
storms will hit in the city next year.
Solution:
p [ X=3 ] =0.180∨18 %
Question:
The average number of home should by a gcon’s company has 2 home per day. What is the
probability that exactly 3 home will be sold tomorrow.
Solution:
p [ X=3 ] =0.180∨18 %
Question:
Suppose the average number of loins seen in jungle on 1 day visits as 5. What is the probability
that has 2 arrests will see fewer than 4 loins on the next day visit.
Solution:
Since we want to find likelihood tourist will see four lions i.e. we want to find the probability that they
will see X=0, 1, 2, 3 or X<4
3
p [ X <4 ] =∑ p [ X =x ]
x=0
p [ X <4 ] =0.264
That the probability that tourist will see no more than 3 lions are 0.264.
Question:
Consider a computer system will Poisson job annual determine the probability.
I. Zero Jobs.
II. Exactly 2 Jobs.
III. At most three Jobs.
Solution:
Zero Jobs
p [ X=0 ] =0.135
Exactly 2 Jobs
p [ X=2 ]=0.27
p [ X <4 ] =0.8560
The function f(x) representing the normal distribution satisfies the properties of proper p.d.f
The Mean, Median and Mode for normal distribution are equal.
Mean=Median=Mode=μ
μ2n+1¿ 0 ∀ n(odd)
The Normal curve expends n indefinitely for to the left and to the right, approaching more
closely the x-axis, as x increases in magnitude.
The curve is symmetric about its Mean and thus the area to the left to the Mean and the area to
the right of the Mean each equal to the 0.5.
For Normal distribution about 68% of the area under the curve are between μ−σ and μ+σ
and about 95% of the area under the curve are between μ−2 σ and μ+2 σ and about 99.7% of
the area under the curve are between μ−3 σ and μ+3 σ .
The points of inflection on the curve are standard deviation away from the Mean.
Question:
The average on the statistics test was 78, with S.D of 8. If the test score are normally distributed.
Find the probability that a student receives a test score less than 90
Solution:
X −μ 90−μ
P ( X <90 )= p( < )
σ σ
In Standardize Form
90−78
P ( X <90 )= p(Z < )
8
P ( X <90 )= p ( Z <1.50 )
P ( X <90 )=0.5+¿
P ( X <90 )=0.9332
A pollen count for a species of flowers vary randomly in a manner well represented by a normal
distribution with μ=1000 , and σ =80
I. Find the probability that an individual pollen count will be greater than 1200
II. Less than 775.
III. Between 800 and 1100.
Solution:
Find the probability that an individual pollen count will be greater than 1200
X−μ 1200−μ
P ( X <1200 ) =p ( < )
σ σ
In Standardize Form
1200−1000
P ( X <1200 ) =p ( Z< )
80
P ( X <90 )= p ( Z <2.50 )
P ( X <1200 ) =0.5−0.0175
P ( X <1200 ) =0.4825
X−μ 775−μ
P ( X <775 ) =p ( < )
σ σ
In Standardize Form
775−1000
P ( X <775 ) =p (Z< )
80
P ( X <775 ) =p ( Z ←2.81 )
P ( X <775 ) =0.5078
800−μ X −μ 1100−μ
P ( 800 ≤ X ≤ 1100 )= p( < < )
σ σ σ
In Standardize Form
800−1000 1100−1000
P ( 800 ≤ X ≤ 1100 )= p( <Z< )
80 80
Definition:
The P-value (or Probability value) is a probability of getting a sample statistic (such as the Mean)
or a more extreme sample statistic in the direction of the alternative hypothesis when the null-
hypothesis is true.
OR
“The P-value is the probability of getting the observed value of the test statistic, or a value with even
greater evidence against Ho, if the null-hypothesis is actually true”
Step N0: 2
Compute the test value.
Step N0: 3
Find the P-value
Step N0: 4
Make the Decision
Step N0: 5
Summarized the result
P-value= 2P (Z>Zo)
P-value= P (Z>Zo)
P-value= P (Z<Zo)
Case: II
If HA or H1 contains a greater than the alternative, find the probability that Z> your test statistic
(i.e look up your test statistic on the Z table and find its corresponding probability and subtract it from 1)
the result is your P-value.
Case: III
If HA or H1 contains a not equal to alternative, find the probability that Z is beyond your test
statistic and double it.
If your test statistic is negative, first find the probability that Z is less than test-statistic (i.e look
up your test statistic on the Z table and find its corresponding probability) then double this
probability to get P-value from)
If your test statistic is positive , first find the probability that your test-statistic (i.e look up your
test statistic on the Z table and find its corresponding probability and then subtract it from 1)
then double this result to get P-value
Question: A researcher wishes to test to claim that the average cost of tuition in fees it 2 Year college
is greater than $5550. She selects a random sample of 36 2 year colleges and find is the mean to be
$5800, the population S.D is $600. Is there any evidence to support the claim at α 0.05? use P-value
Method.
Solution:
HO μ=5550 VS H1 μ>5550
2) Test Statistic
5800−5550
Z= 600
√36
Z=2.50
P-value= 1-P(Z>2.50)
=1-0.4938
=0.062
0.062<0.05
I.e. P<α
Since P is less than α so there is enough evidence to support the claim that the tuition is
fees it 2 years colleges are greater than $5550.
Question: A researcher wishes to test to claim that the average wind speed in a certain city is 9 per
hour. A sample of 36 days has an average wind speed 9.3, the S.D of the population is 0.8 miles per
hours at α=0.01. Is there enough evidence to reject the claim? Use P-value Method.
Solution:
HO μ=9 VS H1 μ ≠ 9
2) Test Statistic
P-value =1-0.9878
P-value =0.0122
P-value=2(0.0122)
P-value=0.0244
0.0244>0.01
I.e. P>α
Since P> α so there is not enough evidence to reject the claim that the average wind
speed is 9 miles per hour.
Question: Suppose the average no: of Facebook friend from 150 S.D = 40.3. A random sample of 64
high school students in a particular country related the average Facebook friend was 160 at α=0.01. Is
their sufficient evidence to compute that the mean.
Solution:
HO μ=150 VS H1 μ>150
2) Test Statistic
160−150
Z= 40.3
√64
Z=1.9851
P-value =1-0.9767
P-value =0.0233
0.0233>0.01
I.e. P>α