Bio Statistics
Bio Statistics
Data
Qualitative Quantitative
4
1. How to collect data?
SAMPLING DISTRIBUTION
• First, you need to identify the target
population of your research.
You are researching experiences of drugs on youth in your city. there is no list of
all drug addicts, other sampling isn’t possible.
One person who agrees to participate in the research, and he puts you in contact
with other drug addict people that he knows in the area.
II. Probability sampling methods
Probability sampling means that every member of the population has a chance
of being selected.
Useful for making conclusions for complete population using quantitative Data
It is free from selection bias
Which one of the following options correctly matches sampling techniques with their features?
(1) A-(ii); B-(i); C-(iv); D-(iii) (2) A-(ii); B-(iv); C-(iii); D-(i)
(3) A-(i); B-(iv); C-(iii); D-(ii) (4) A-(i); B-(iv); C-(ii); D-(iii)
2. How to analyze the data?
For qualitative data: Thematic analysis to interpret patterns and meanings in the data.
2. How to analyze the data?
Example Data: 13, 18, 13, 14, 13, 16, 14, 21, 13
Geometric Mean
• n√(a1 X a2 X a3 X an)
Harmonic Mean
• n/[(1/a1) + (1/a2) + (1/a3) + ….+ (1/an)]
AM x HM = GM2
2. Median: The value separating the higher half of a sample or population
from the lower half
Found by arranging all the values from lowest to highest and taking the
middle one
If even number of values at middle: median will be the mean of the two
middle values.
Appropriate measure when data are contaminated by outliers (non-
paramataric)
Example :Data (Odd Number):
13, 13, 13, 13, 14, 14, 16, 18, 21 Total Data = 9
Median: 14
13, 13, 13, 13, 13, 14, 14, 16, 18, 21 Total Data = 10
1, 4, 4, 4, (5), 6, 6, 7, 8 = 45
The mean and median of a data set are 24 and 22, respectively.
The mode of the data set will be:
(1) 23
(2) 18
(3) 2
(4) -2
II. Measures of Dispersion
1. Range
• One simple measure of dispersion is the range, which is the difference
between the greatest and least data values.
1 2 3 4 5 6 7 8 9
Quartile: Meaning
• One of three points that divide a data set into four equal parts.
• Each group contains equal number of observations or data.
• Median acts as base for calculation of quartile.
2. Quartile deviation
• The difference between Q3-Q1 is also known as the
interquartile range
• Interquartile range divided by two is known as quartile
deviation or semi interquartile range.
• Quartile Déviation = (Q3 – Q1) / 2
• Good for non-paramatric data (outlier présent)
To calculate Quartile deviation, you need to first find out Q1 then the second step
is to find Q3 and then take a difference of both and the final step is to divide by 2.
Example: Find the quartiles and quartile deviation of the following data:
17, 2, 7, 27, 15, 5, 14, 8, 10, 24, 48, 10, 8, 7, 18, 28 = 16 data
Solution:
Ascending order of the given data is: 2, 5, 7, 7, 8, 8, 10, 10, 14, 15, 17, 18, 24, 27, 28, 48
Number of data values = n = 16
Q2 = Median of the given data set
• Variance is
• sum of squares of deviations from mean/degree
of freedom (n-1) for sample
• sum of squares of deviations from number of
observations (n) for population
Mean X-X (X-X)2
• Standard deviation (σ) is square
2 -3 9
root of variance of sample.
2 -3 9
50/10 = 5
Steps for the computation of standard 4 -1 1
deviation :
4 -1 1
1. Calculate the mean
2. Find the difference of each variable from 5 0 0
the mean 5 0 0
3. Square the differences of observations
from the mean 6 +1 1
= Standard deviation/mean
• Outlier can be defined as a value in data set that lies more than
three standard deviation from mean
a. 54,55,54,96,60,58,55,56,06,62,68,62,55,72,69,44
The appropriate measure of dispersion of an open-end class data is (data are
contaminated by outliers ):
(1) Range
(2) Mean deviation
(3) Quartile deviation
(4) Standard deviation
Apply your mind
Following are statements related to statistical methods:
A. An outlier can be defined as a value in a data set that lies more than three standard deviations
from the mean.
B. Measures of central tendency and dispersion are independent of the presence of outliers in a
data set.
C. Standard deviation is a measure of dispersion.
D. Mean, median and mode are not measure of central tendency.
Which one of the following options is a combination with both INCORRECT statements?
1. A and B 2. B and C
3. A and C 4. B and D
Apply your mind:
1. The μ and σ of wing length (a normally distributed parameter) in a population of fruitflies are 4 and 0.2 mm.
respectively. In a random sample of 400 fruitflies, how many individuals are expected to have wing lengths
greater than 4.4 mm? (JUNE 2011)
(1) 20 (2) 64
(3) 10 (4) 336
Mean 4 ± SD 0.2
Up to 4.4 = +2σ
± 2σ includes 95 %
= +2σ includes 47.5%
Incudes 190 and
greater than +2σ is 2.5% =10
2. The pH of a solution is 7.4 ± 0.02 where 0.02 is standard deviation obtained from eight
measurements. If more measurements were carried out, the % of samples whose pH would fall
between pH 7.38 and 7.42 is (JUNE 2017)
(1) 99.6 (2) 95.4
(3) 68.2 (4) 99.8
A scientist is weighing each of 30 fishes. Their mean weight worked out is 30
gm and a standard deviation of 2 gm. Later, it was found that the measuring
scale was misaligned and always under reported every fish weight by 2 gm.
The correct mean and standard deviation (in gm) of fishes are Respectively:
(1) 28, 2 (2) 28, 4
(3) 32, 2 (4) 32, 4
THANKS
Biostatistics 2
Probability
Distribution
Kurtosis is high when middle data are more and extremes are very low
Kurtosis is low when data are more or less uniformly distributed along range
No skewness: mean=mode=median
Right (Positive) skewness Mean<median<mode
Left (Negative) skewness Mean>median>mode
3. Poisson Distribution :
The probability of a given number of events occurring in a fixed interval of time or space.
These events occur with constant mean rate and independently of the time since the last
event
2. Which one of the following statements regarding normal distribution is NOT correct? (JUNE 2019)
(1) It is symmetric around the mean
(2) It is symmetric around the median
(3) It is symmetric around the variance.
(4) It is symmetric around the mode.
Apply your mind
3. Following are statements to depict relationship among measures of central tendency in a skewed
dataset
A. In positively skewed datasets, mean > median > mode
B. In positively skewed datasets, mode >median > mean
C. In negatively skewed datasets, mean>median > mode
D. In negatively skewed datasets, mode> median> mean
Which one of the following options correctly represents A, B, C and D in the above equation?
(1) A-Evidence, B-Posterior probability, C-Likelihood, D-Prior probability
(2) A-Likelihood, B-Prior probability, C-Posterior probability, D-Evidence
(3) A-Posterior probability, B-Prior probability, C-Likelihood, D-Evidence
(4) A-Prior probability, B-Evidence, C-Posterior probability, D-Likelihood
HYPTHESIS TESTING
TYPES OF ERRORS
LEVELS OF SIGNIFICANCE
Null Hypothesis:
Example:
Null hypothesis, H0: Online classes are not be effective for selection in
CSIR NET life sciences.
Reject H0 Accept H0
H0 is True Type I error Correct Decision
H0 is False Correct Decision Type II error
Null Hypothesis
Reality
Beneficial Harmful
Beneficial
OK Type II error
Research on drug
and you suggest it
• The smaller the p-value, the strong the evidence that you
should reject the null hypothesis.
“The mean temperature of this region now is significantly higher than the one 50 years ago (p<0.05, t-
test)”
(1) Ratio of the mean temperature of the two times periods tested
(2) Probability of the error of rejecting a true null hypothesis
(3) Probability of the error of accepting a false null hypothesis
(4) Probability of the t-test being effective in detecting significant difference in the mean annual
temperatures of the two periods
CONFIDENCE
INTERVAL
• A Confidence Interval is a range of values we are fairly sure our
true value lies in.
EXAMPLE: Average Height of humans
• Suppose if average (mean) height of 100 human sample is 165 cm and calculated standard
deviation is 20 cm (we can use standard deviation of population if it is known).
• This says the true mean of all men (if we measure) is likely to be between 161 cm and 169 cm
in 95 % of cases.
• All though there is a 1-in-20 chance (5%) that our Confidence Interval does NOT include the
true mean.
How to calculate CI
Step 1:
• Start with the number of observations (n=100)
• Calculate mean and standard deviation
Step 2:
Decide what Confidence Interval we want:
95% or 99% are common choices.
Then find the "Z" value for that Confidence Interval
here:
• Step 3: use that Z value in this formula for the
Confidence Interval
Suppose if average (mean) height of 100 human sample is 165 cm and calculated
standard deviation is 20 cm. What will be confidence limit range at 95 % level ?
20
= 165 ± 2
100
20
= 165 ± 2
10
= 165 ± 2 x 2
= 165 ± 4
Apply your mind
The mean and standard deviation of serum cholesterol in a population of senior citizens are
assumed to be 200 and 24mg/dl, respectively. In a random sample of 36 senior citizens, what
values of cholesterol (to the nearest whole number) should lead to rejection of the null
hypothesis at 95% confidence level? (JUNE 2015)
(1) above 224
(2) above 248
(3) below 176 and above 224
(4) below 192 and above 208
Given
Observations (n)=36, Mean (X) =200, Standard deviation
(s) =24, Z for 95 % Confidence interval =1.96 (please
memorize)
Observation (n) = 36 Accepted value at
Mean (μ) = 200 cm 95% CI are ranging
Standard deviation (σ) = 24 from 193 to 207
Z (for 95% CI) = 1.96 Out side this
range: are rejected
24
= 200 ± 2
36
24
= 200 ± 2
6
= 200 ± 2 x 4
= 200 ± 8
Co-Relation
CORRELATION
Correlation can tell you just how much of the variation in peoples'
weights is related to their heights.
Positive Negative Zero
Demand of
Age of husband Car color and car
product with rise
and age of wife mileage
in price
Sale of woolen
Drinking tea and
Increase in height cloths with
clearing CSIR
and weight increase in
exam
temperature
Correlation Coefficient (r) r is called the (Pearson) correlation coefficient
r= Covariance(x, y) .
Standard deviation (x) X standard deviation (y)
Correlation Coefficient (r)
True or False??
Study Hours Marks obtained
X Y
Varun 2 58
Karan 4 32
Garv 5 63
Subham 7 87
Snehal 3 67
Laksh 1 45
Ram 6 68
Step 1: Calculate mean for X
Study Hours Marks obtained
X Y
Varun 2 58
Karan 4 32
Garv 5 63
Subham 7 87
Snehal 3 67
Laksh 1 45
Ram 6 68
x̅ =
Step 2: Calculate mean for Y
X Y (x-x̅ )
2 58 -2
4 32 0
5 63 1
7 87
3 67
1 45
6 68
x̅ = 4 Y̅ = 60
Step 3: Calculate deviation from mean for ‘x’
X Y (x-x̅ )
2 58 -2
4 32 0
5 63 1
7 87 3
3 67 -1
1 45 -3
6 68 2
x̅ = 4 Y̅ = 60
Step 4: Calculate deviation from mean for ‘y’
X Y (x-x̅ ) (y-y̅)
2 58 -2 -2
4 32 0 -28
5 63 1 3
7 87 3
3 67 -1
1 45 -3
6 68 2
x̅ = 4 Y̅ = 60
Step 4: Calculate deviation from mean for ‘y’
X Y (x-x̅ ) (y-y̅)
2 58 -2 -2
4 32 0 -28
5 63 1 3
7 87 3 27
3 67 -1 7
1 45 -3 -15
6 68 2 8
x̅ = 4 Y̅ = 60
Step 5: Multiply deviation from mean of ‘x’ and ‘y’
Null hypothesis:
Y= b X + c
If =β1 (slope) is positive (>0), there is positive relationship
Survey:
β1 = Slope
Salary = β0 + β1 education Education is 0 = 3 K
Let salary = 3 + 1 education
With every 1 year of study
salary is going to be
increased by 1 K
M. Sc Salary ??
Y= bx+C The line regression on x is:
Find the linear regression equation for the following data pairs (x, y) given in the
above table
(1) y = 4x + 0
(2) y = 3x + 2
(3) y = 6x + 2
(4) y = 0.33 + 2
𝑛Σ𝑥𝑦 − Σ𝑥 . Σ𝑦
𝑏𝑦𝑥 =
𝑛 Σ𝑥 2 − (Σ𝑥)2
X Y X.Y
0 2 0
2 8 16
4 14 56
6 20 120
8 26 208
10 32 320
12 38 456
14 44 616
Σ𝑥 = 56 Σ𝑦 =184 Σ𝑥𝑦=1792
X Y X.Y X2 Y2 Here n=8
0 2 0 0 4 𝑛Σ𝑥𝑦 − Σ𝑥 . Σ𝑦
2 8 16 4 64 𝑏𝑦𝑥 =
𝑛 Σ𝑥 2 − (Σ𝑥)2
4 14 56 16 196
8 × 1792 − (56 × 184)
6 20 120 36 400 𝑏𝑦𝑥 =
8 × 560 − (56)2
8 26 208 64 676
14336 − 10304
10 32 320 100 1024 =
4480 − 3136
12 38 456 144 1444
4032
14 44 616 196 1936 =
1344
∑x ∑y Σ𝑥 . 𝑦 Σ𝑥 2 = Σ𝑦 2 =
= 56 =184 =1792 560 5744 =3
Here n=8 Y−𝑌ത = b (X−𝑋)
ത
56
𝑛Σ𝑥𝑦 − Σ𝑥 . Σ𝑦 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑋 = =7
8
𝑏𝑦𝑥 =
𝑛 Σ𝑥 2 − (Σ𝑥)2
184
8 × 1792 − (56 × 184) 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑌 = = 23
8
𝐴𝑏𝑦𝑥 =
8 × 560 − (56)2
The line regression on x is:
14336 − 10304
= Y−𝑌ത = b (X−𝑋)ത
4480 − 3136
Y- 23 = 3 (x-7)
4032 Y- 23 = 3x - 21
=
1344 Y = 3x -21 + 23
Y = 3x -21 + 23
= 3.0 = 3 Y = 3x + 2
THANKS
• Student ‘t’ Test
• ANOVA
• Chi-square
• Paramatric and Non- Parmatric Test
• Field Ecology
Student ‘t’-test
WS Gosset “Pen name = student”
Sample size
Degree of freedom = 30
Null hypothesis testing at p=0.05
Degree of freedom = 30
= 2.04
Degree of freedom = 30
Null hypothesis
Ho = Means all population (=K) is similar, no significant difference
Alternate hypothesis:
Alternate Hypothesis:
64 16
16 9
36 36
49 16
165 77
ΣX and ΣX2
Correction Term
Sum of Squares of Total
Sum of Squares of Total
Sum of Squares Among group
Sum of Squares Among group
Mean of Sum of Squares Among group
Sum of Squares Among group
Mean of Sum of Squares within group
Fischer Ratio (distribution)
Source of Degree of Sum of Mean of F Ratio
variation freedom squares sum of (Distributio
squares n)
Among (K-1) = 2 28.17 14.085
Groups 5.394
Within (N-K) = 9 23.5 2.6111
Groups
Null hypothesis:
No significant effect of surrounding noise on number of question solved
Alternate Hypothesis:
Significant effect of noise on number of questions solved
Source of Degree of Sum of Mean of F Ratio 𝑴𝑺𝑺𝑨
𝑭=
variation freedom squares sum of 𝑴𝑺𝑺𝒘
squares
Among (K-1) = 2 28.17 14.085
Groups 5.394
Within (N-K) = 9 23.5 2.6111
Groups
Within group
F Null hypothesis:
Distributi No significant effect of surrounding noise on number of question solved
on
Alternate Hypothesis:
5.394 Significant effect of noise on number of questions solved
Non-Paramataric Test
A chi-squared test (χ2 test): Goodness of fit
Null hypothesis:
1. If coin is tossed for 100 times there will be 50 head and 50 tail
2. The probability of being male or female child is equal
3. The F2 monohybrid phenotypic ratio will be 3:1 under dominance
4. The F2 dihybrid phenotypic ratio will be 9:3:3:1 under dominance
and independent assortment
5. There is equal frequency of A, T, G or C in DNA (1/4 each)
To test hypothesis we can use Chi – Square
goodness of fit test
Null hypothesis:
Null hypothesis:
There is no significant difference between the observed and
expected frequency of head and tail in coin
H H T H H
T H H H H
T H H T T
H T H H H
T H H H T
T H T H H
Total Head: 20 Total Tail: 10
Observed Expected O-E (O-E)2 (O-E)2
Head Head E
20 15 20-15=5 25 25/15=1.66
∑(O-E)2
E
χ2
= 3.32
Calculation part:
Observed Expected O-E (O-E)2 (O-E)2
Head Head E
20 15 20-15=5 25 25/15=1.66
n = possible outcomes (H and T)
Degree of freedom = n -1
At significance level α=0.05, critical value with 1 df is 3.81
Calculate chi square value, if it is less than 3.81 accept null hypothesis else
reject it.
1. Given below are a few statistical terms in Column A and their related features
and terms in Column B.
Column I Column II
A Standard (i) Measure of Relative variability of given
deviation populations.
B. Coefficient of (ii) Used to make inferences about population
Variation means.
C. Chi-square test (iii) Positive square root of population variance.
D. t-Test (iv) Test hypothesis related to categorical data
from inheritance studies.
Which one of the options given below correctly
matches all items of Columns A and B?
(1) A-iv; B-iii; C-ii; D-I (2) A-ii; B-iv; C-i; D-iii
(3) A-iii; B-i; C-iv; D-ii (4) A-ii; B-i; C-iv; D-iii
(FEB 2022-I)
2. From the steps listed below, some are used to evaluate the goodness of fit
using the chi-square test.
A. The mean, variance and standard deviation are calculated
Σ 𝑥1 −𝑥 2
B. Variance calculated using 𝑛−1
Σ 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑−𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 2
C. Test statistic calculated using 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑
D. the degree of freedom is calculated as n-1, where n is the number of ways in
which the expected classes are free to vary
E. The probability value is obtained
Which one of the following options provides the correct sequence of steps in
this statistical analysis?
(1) A, C, D (2) C, D, E
(3) B, A, D (4) A, D, E
(DEC 2013)
3. In the population of an insect species, 50% are known to be
female. How many females should turn up in a random sample of
40 insects to reject the null hypothesis (χ2 value for rejection is ≥ 3
.84)?
(1) 6
(2) ≤26 or≤8
(3) 25
(4) <17 or >27
Null hypothesis:
There is no significant difference between the observed
and expected frequency of head and tail in coin
∑(O-E)2
E
= 3.6
Consider female = 27
13 20 13-20=-7 49 49/20=2.48
∑(O-E)2
E
= 4.98
Non-parametric tests:
• Tests don’t require that your data follow the normal distribution.
• They’re also known as distribution-free tests and can provide benefits in
certain situations (nominal/ordinal).
• Used when individual variability among the study groups is high
Parametric Non-parametric:
• Paired t test is used for compare mean • Wilcoxon rank sum test is used compare mean
and SD in two dependent groups (same between two dependent groups (not normally
group studied twice) distributed)
• Unpaired t-test is used for compare • Mann-Whitney U test is used compare mean
mean and SD in two independent between two independent groups (not
groups normally distributed)
Example:
• Paired/unpaired t-test
• ANOVA
• Pearson correlation
Apply Your mind
Two groups (Control, Treated) are to be compared to test the effect of a treatment. Since individual
variability is high in both groups, the appropriate statistical test to use is (JUNE 2015)
(1) Analysis of variance.
(2) Kendall's test.
(3) Student's t-test.
(4) Mann-Whitney U-test.
Test to assess strength of association between two variables
The use of Kruskal Wallis test is most appropriate in which of these cases? (JUNE
2016)
(1) There are more than two groups and each group is normally distributed.
(2) There are more than two groups and the distribution in each group is not
normal.
(3) There are two groups and each group is normally distributed.
(4) There are two groups and the distribution in each group is not normal.
Population Field Ecology
Apply Your mind
(FEB 2022-II)
Select the option that represents the correct combination of non-
parametric tests and its equivalent parametric test respectively that
can be used to compare two or more groups.
(1) Wilcoxon Rank Sum Test and Paired t-test
(2) Wilcoxon Rank Sum Test and Spearman correlation
(3) Spearman correlation and Kruskal Wallis test
(4) Mann-Whitney U test and Pearson correlation
Quadrat Method
Line transact Method
'n' individuals are collected randomly from the study area in a defined
period of time.
The captured individuals are counted, marked and released at the site of
collection.
Next day or after some time, individuals are captured from the same site
for same length of time.
Number of marked (nM) and unmarked (nU)