Biostatistics
Biostatistics
Basic Biostatistics
05/03/2024 2
Definition and classification of Biostatistics
05/03/2024 3
Classification of Biostatistics
Descriptive biostatistics
A statistical method that is concerned with the collection,
organization, summarization, and analysis of data from a
sample of population.
Inferential biostatistics
A statistical method that is concerned with the drawing
conclusions/inference about a particular population by
selecting and measuring a random sample from the population.
05/03/2024 4
Cont…
B io sta tistics
05/03/2024 5
Descriptive Biostatistics
05/03/2024 6
Inferential Biostatistics
05/03/2024 7
1.2 Stages in statistical investigation
There are five stages or steps in any statistical investigation.
1. Collection of data
The process of obtaining measurements or counts.
2. Organization of data
Includes editing, classifying, and tabulating the data
collected.
3. Presentation of data:
overall view of what the data actually looks like.
facilitate further statistical analysis.
Can be done in the form of tables and graphs or diagrams.
05/03/2024 8
Cont…
4. Analysis of data
To dig out useful information for decision making
It involves extracting relevant information from the data
(like mean, median, mode, range, variance…),
5. Interpretation of data
Concerned with drawing conclusions from the data
collected and analyzed; and giving meaning to analysis
results.
A difficult task and requires a high degree of skill and
experience.
05/03/2024 9
1.3 Definition of Some Basic terms
05/03/2024 11
Cont...
Sampling: The process or method of sample selection from the
population.
Sample size: The number of elements or observation to be
included in the sample.
variable is a characteristic or attribute that can assume different
values in different persons, places, or things.
Some examples of variables include:
Diastolic blood pressure,
heart rate, heights,
The weights
Data: Refers to a collection of facts, values, observations, or
measurements that the variables can assume.
05/03/2024 12
Uses of statistics:
05/03/2024 13
Limitations of statistics
05/03/2024 15
A. Depending on the characteristic of the measurement, variable can be:
Qualitative(Categorical) variable
A variable or characteristic which cannot be measured in
quantitative form but can only be identified by name or categories,
for instance place of birth, ethnic group, type of drug, stages of
breast cancer (I, II, III, or IV), degree of pain (minimal, moderate,
sever or unbearable).
The categories should be clear cut, not overlapping, and cover all the
possibilities. For example, sex (male or female), vital status (alive or
dead), disease stage (depends on disease), ever smoked (yes or no).
05/03/2024 16
Quantitative(Numerical) variable:
is one that can be measured and expressed numerically.
Example: survival time, systolic blood pressure, number of
children in a family, height, age, body mass index.
they can be of two types
Discrete Variables
Have a set of possible values that is either finite or
countabl infinite.
The values of a discrete variable are usually whole
numbers.
Numerical discrete data occur when the observations are
integers that correspond with a count of some sort.
05/03/2024 17
Some common examples are:
Number of pregnancies,
05/03/2024 19
Con…
Observations are not restricted to take on certain numerical
values: Often measurements (e.g., height, weight, age).
Continuous data are used to report a measurement of the
individual that can take on any value within an acceptable
range.
05/03/2024 20
Nominal Scale
Other Examples
Sex Social status
Marital status Days of the week (months)
Geographic location Seasons
Ethnic group Types of restaurants
Brand choice Religion
Job type : executive, technical, clerical
Coded as “0”
05/03/2024 Coded as “1” 22
Ordinal Scale
Level of measurement which classifies data into categories that can be
ranked. Differences between the ranks do not exist.
05/03/2024 23
Ordinal Scales
05/03/2024 24
Interval Scales
• Level of measurement which classifies data that can be ranked
and differences are meaningful. However, there is no meaningful
zero, so ratios are meaningless.
• All arithmetic operations except division are applicable.
• Relational operations are also possible.
Examples:
IQ
Temperature in oF.
05/03/2024 25
Interval Scale
Numerically equal distances on the scale represent equal values in
the characteristic being measured. An interval scale contains all the
information of an ordinal scale, but it also allows you to compare the
differences between objects.
assumes that the measurements are made in equal units.
i.e. gaps between whole numbers on the scale are equal.
e.g. Fahrenheit and Celsius temperature scales
an interval scale does not have a true zero.
e.g. A temperature of "zero" does not mean that there
is no temperature...it is just an arbitrary zero point.
permissible statistics: count/frequencies, mode, median,
mean,
05/03/2024
standard deviation 26
Ratio Scales
05/03/2024 27
Primary Scales of Measurement
Nominal Numbers
assigned to 4 81 9
runners
05/03/2024 30
Chapter 2
Organization and Presentation of data
• When the range of the data is large, the data must be grouped
in to classes that are more than one unit in width.
Definitions of same terms:
• Grouped Frequency Distribution: a frequency distribution when
several numbers are grouped in one class.
• Class limits: Separates one class in a grouped frequency
distribution from another. The limits could actually appear in
the data and have gaps between the upper limits of one class
and lower limit of the next.
• Units of measurement (U): the distance between two possible
consecutive measures. It is usually taken as 1, 0.1, 0.01, 0.001,
-----.
Friday, May 3, 2024 Wullo S. 36
Cont…
• Class boundaries: Separates one class in a grouped frequency
distribution from another. The boundaries have one more
decimal places than the row data and therefore do not appear
in the data. There is no gap between the upper boundary of
one class and lower boundary of the next class. The lower
class boundary is found by subtracting U/2 from the
corresponding lower class limit and the upper class boundary
is found by adding U/2 to the corresponding upper class limit.
• Class width: the difference between the upper and lower
class boundaries of any class. It is also the difference between
the lower limits of any two consecutive classes or the
difference between any two consecutive class marks.
Friday, May 3, 2024 Wullo S. 37
Cont…
• Class mark (Mid points): it is the average of the lower and
upper class limits or the average of upper and lower class
boundary.
• Cumulative frequency: is the number of observations less
than/more than or equal to a specific value.
• Cumulative frequency above: it is the total frequency of all
values greater than or equal to the lower class boundary of a
given class.
• Cumulative frequency blow: it is the total frequency of all
values less than or equal to the upper class boundary of a
given class.
1
2
3
4
40
35
30
No of women
25
20
15
10
0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
19 21 20 20 34 22 24 27 27 27
• Then, mean = (19 + 21 + … +27) = 24.1
10
• General formula
a) Ungrouped mean
where,
k = the number of class intervals
xi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
A. Weighted mean
B. Correct and wrong mean
C. Combined mean
D. Geometric mean
E. Harmonic
• Whereas:
• LCB= lower class boundary of the median class
• Fc= cumulative frequency just before the median class
• fc=frequency of the median class
• W =class width and n=number Wullo
Friday, May 3, 2024
of observations.
S. 70
Properties of median
• The median can be used as a summary measure for discrete
and continuous data, in general however, it is not
appropriate for nominal data.
• The quartiles are sets of values which divide the distribution into
four parts such that there are an equal number of observations
in each part.
– Q1 = [(n+1)/4]th
– Q2 = [2(n+1)/4]th
– Q3 = [3(n+1)/4]th
• The inter-quartile range is the difference between the third and
the first quartiles.
– Q 3 - Q1
Example1: We use the data set of 11 numbers:
19 21 20 20 34 22 24 27 27
27 28
– The first quartile is 20 and the third quartile is 27
– The inter quartile range = 27 – 20 = 7.
• Example: Find the sample space for the gender of the children if
a family has three children. Use B for boy and G for girl.
– Solution: There are two genders, male and female, and each
child could be either gender. Hence, there are eight
possibilities, as shown here.
S= {BBB, BBG, BGB, GBB, GGG, GGB, GBG, BGG}
• Note: the way to find all possible outcomes of a probability
experiment (the sample spaces)
– by observation and reasoning;
– use a tree diagram (a device consisting of line segments
emanating from a starting point and also from the outcome
point.)
Solution
• P(At least one male) = 1- P(all females)
1. Binomial distribution
• A binomial experiment (also known as a Bernoulli trial) is a statistical
experiment that has the following properties:
• The experiment consists of n repeated fixed number of trials.
• Each trial can result in just two possible outcomes. We call one of these
outcomes a success and the other, a failure.
• The trials are independent; that is, the outcome on one trial does not
affect the outcome on other trials.
• Where
Where
• X=Number of successes per unit time
where
• X is a normal random variable,
• μ is the mean
• σ is the standard deviation
• pi is approximately 3.14159, and e is approximately 2.71828.
• The random variable X in the normal equation is called the
normal random variable.
Friday, May 3, 2024 Wullo S. 108
Characteristics of Normal Distribution
• It is Symmetric around the mean: Two halves of the curve are the
same (mirror images)
• We can transform all the observations of any normal random variable X with
mean μ and variance σ to a new set of observations of another normal random
variable Z with mean 0 and variance 1 using the following transformation:
Then:
1) What area under the curve is above 80 beats/min?
For this we need to draw the figure…..and find the area which corresponds to Z .
13.6% 33.35%
2.2%
0.15
0.159
-3 -2 -1 μ 1 2 3
3) 95.4% or 0.954
4) 0.15 % or 0.015
I. Probability sampling
• Is any method of sampling that utilizes some form of random
selection.
• probability sampling is a procedure for sampling from a
population in which
– The selection of a sample unit is based on chance
– Every element of the population has a known and non-zero
probability of being selected
– Random sampling helps produce representative samples by
eliminating voluntary response bias and guarding against under
coverage bias
Every individual of the target population has equal chance to be
included in the sample.
Friday, May 3, 2024 Wullo S. 130
1. Simple random sample (SRS)
• Objective: To select n units out of N
• If the population is homogenous
• If frame is available
• If the study area is not very wide
– Note: Homogeneity refers to the similarity of the population with regard
to the outcome variable .
• Procedure:
Use a table of random numbers: takes on values 0,1,2,
…….,9 with equal probability
a computer random number generator
mechanical device to select the sample.
RAND() function from Excel sheet if frame is available
Lottery method
Friday, May 3, 2024 Wullo S. 131
Friday, May 3, 2024 Wullo S. 132
2. Stratified random sampling
• Stratified Random Sampling involves dividing your population
into homogeneous subgroups and then taking a simple random
sample in each subgroup.
Example:
An agency has clients from three ethnic groups and the agency
wants to asses clients view of quality of service for the last year.
Friday, May 3, 2024 Wullo S. 133
Stratified random sampling
2. Expert Sampling
Expert sampling involves the assembling of a sample of persons
with known or demonstrable experience and expertise in some
area.
3. Quota Sampling
In quota sampling, you select people non randomly according to
some fixed quota
There are two types of quota sampling: proportional and non
proportional
4. Heterogeneity Sampling
We sample for heterogeneity when we want to include all
opinions or views, and we aren't concerned about representing
these views proportionately.
Another term for this is sampling for diversity.
Friday, May 3, 2024 Wullo S. 146
Purposive sampling…
5. Snowball sampling
In snowball sampling, you begin by identifying someone who
meets the criteria for inclusion in your study
You then ask them to recommend others who they may know
who also meet the criteria
Snowball sampling is especially useful when you are trying to
reach populations that are inaccessible or hard to find
For instance, if you are studying the homeless, you are not likely
to be able to find good lists of homeless people within a specific
geographical area. However, if you go to that area and identify
one or two, you may find that they know very well who the
other homeless people in their vicinity are and how you can find
them
Friday, May 3, 2024 Wullo S. 147
Summary
• Population to be studied
Size and geographic distributions
Heterogeneity with respect to the variable studied
• Resource available
• Level of precision required
• Importance of having a precise estimate of sampling error
Inferential statistics
• After complete this session you will be able to do
– Parameter estimations
– Point estimate
– Confidence interval
– Hypothesis testing
– Z-test
– T-test
– Testing associations
– Chi-Square test
Introduction
• Proportion
Suppose we choose a random sample of size n, the sampling
distribution of the sample means p posses the following
properties.
i =1
xi
x =
n
• Friday,
Where x is the total number of
May 3, 2024
success (events)
Wullo S. 168
Three Properties of a Good Estimator
[ x z . , x z . ]
2 n 2 n
[ p z . p (1 p ) / n , p z . p (1 p ) / n ]
2 2
xi
0.295
x= i =1
0.01844
n 16
Introduction
– Researchers are interested in answering many types of questions. For example, A
physician might want to know whether a new medication will lower a person’s
blood pressure.
• Null hypothesis (represented by HO) is the statement about the value of the
population parameter. That is the null hypothesis postulates that ‘there is no
difference between factor and outcome’ or ‘there is no an intervention effect’.
• Alternative hypothesis (represented by HA) states the ‘opposing’ view that
‘there is a difference between factor and outcome’ or ‘there is an intervention
effect’.
Identify the null hypothesis H0 and Choose a. The value should be small, usually less
than 10%. It is important to consider the
the alternate hypothesis HA. consequences of both types of errors.
3 4
Select the test statistic and
determine its value from the sample Compare the observed value of the statistic to
data. This value is called the the critical value obtained for the chosen a.
observed value of the test statistic.
Remember that t statistic is usually
appropriate for a small number of
samples; for larger number of 5
samples, a z statistic can work well if Make a decision.
data are normally distributed. 6
Conclusion
Friday, May 3, 2024 Wullo S. 190
Test Statistics
Observed _ Hypothesized
Test statistics = value value .
Standard error
• The known distributions are Normal distribution, student’s distribution , Chi-
square distribution ….
Friday, May 3, 2024 Wullo S. 191
Critical value
• The critical value separates the critical region from the noncritical region
for a given level of significance
H1: m < m0
0
Rejection Regions
a
H0: m = m0
H1: m > m0 0
a/2
H0: m = m0
H1: m ¹ m0
0
Two tailed test
Friday, May 3, 2024 Wullo S. 197
Two tailed test
1 H 0 : 0 ( 0 )
H A : 1 0 ( 0 )
x 0
zcal
n
ztabulated z for two tailed test
2
• When the p-value is less than to 0.05, we often say that the
result is statistically significant.
• If the sample size is small (if np<5 and n(1-p)<5) then use student’s
t- statistic for the tabulated value of the test statistic.
1 , and 2
• If are unknown, then can be estimated
s1 , and s2
by
Friday, May 3, 2024 Wullo S. 212
Hypothesis testing for two sample means
( x y ) ( 1 2 )
z cal
12 22
Friday, May 3, 2024 n1 n2Wullo S. 213
Hypothesis …
ztabulated z for two tailed test
2
H O : 1 2 0
H A : 1 2 0
( x y ) ( 1 2 ) ( 4 . 3 3. 4 ) 0
z cal
2
2
2 .9 2 3.5 2
1
2
n1 n2 12 15
1. 6 1. 6
5.33
1.5178 1.23
z z 0.025 1.96
2
• Let n1 and n2 be the sample size from the two population. If x and
y are the out come of interest then the point estimate for each
population is given by p1=x/n1 and p2=y/n2 respectively.
• The point estimates π1-π2 =p1-p2
• The interval estimate for the difference of proportions is given by
• If the sample size is large and n1p1>5, n1 (1-p1)>5, n2p2>5, then
p1 (1 p1 ) p2 (1 p2 )
p1 p2 z
n1 n2
2
( p1 p2 ) ( 1 2 )
zcal
p1 (1 p1 ) p2 (1 p2 )
n1 n2
Friday, May 3, 2024 Wullo S. 219
Small sample size
O E ij
2
2 ij
i, j E ij
i raw total j column total Ri C j
th th
Eij
grand total n
Oij=observed frequency, Eij=expected frequency of the cell at the
juncture of I th raw & j th column
ij
2
i, j E ij
• Additionally, chi squared test should not be used when the observed values in a cell
are <5. It is, at times not inappropriate to pad an empty cell with a small value,
though, as one can only assume the result would be more significant with no value
there.
Hypothesis:
• H0: there is no association between smoking and symptoms of asthma
• H0: there is association between smoking and symptoms of asthma