0% found this document useful (0 votes)
51 views217 pages

Bio Statistics

Uploaded by

Shashi Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views217 pages

Bio Statistics

Uploaded by

Shashi Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 217

Biostatistics

Statistical methods: Involves collecting and analyzing data.

For any research two key decisions have to made


1. How you will collect data?
2. How will you analyze data?
TYPES OF DATA
1. Qualitative (Non-scalar varaible) vs. quantitative (Scalar Variables)

Data

Qualitative Quantitative

Nominal Ordinal Interval Ratio

No specified order Have specified order Data that has Celsius


Gender: M/F Shirt: S, M, L, XL measurable Altitude
Hair Color: B/G/W Pizza: S, M, L distances. Include 0
Time: Sec, min
Donot include zero
2. Primary vs. secondary:
• Will you collect original data yourself, or will you use data that has already
been collected by someone else
• Secondary Data: Meta-research – Integrating the finding of prior research
studies using statistical procedures

3. Descriptive vs. experimental:


• Will you take measurements of something as it is, or will you perform an
experiment?
Apply Your Mind

1. Marital status of a group of people represents which variable:


(1) Discrete
(2) Continuous
(3) Nominal
(4) Ordinal
Apply Your Mind

2. Among the data collected for a clinical study in diabetic patients,


which of the following is NOT a scalar variable?
(1) Blood glucose level at time of enrolment into research study
(2) Daily units of insulin prescribed to the patient
(3) Age of the patient at time of disease diagnosis
(4) District where the patients currently live

4
1. How to collect data?

SAMPLING DISTRIBUTION
• First, you need to identify the target
population of your research.

• The population is the entire group


that you want to draw conclusions
about.

• The sample is the specific group of


individuals that you will collect data
from.
During research on a group, it’s rarely possible to collect data from
every person in that group. Instead, you select a sample.

There are two types of sampling methods:

Non-probability sampling involves non-random selection based on


convenience

Probability sampling involves random selection


I. Non-random sampling
• Individuals are selected based on non-random criteria
• Every individual has a donot has chance of being included.
• For any Research, first pilot survey is done by non-random sampling
• It is easier and cheaper to access
• But you can’t use it to make valid statistical inferences about the
whole population.

• Appropriate for exploratory and qualitative research.


• The aim is not to test a hypothesis about a broad population, but
to develop an initial understanding
1. Convenience sampling

• It includes the individuals who happen to be most accessible


to the researcher.
This is an easy and inexpensive way to gather initial data,
But the sample is not representative of the population
Example:

Research about opinions about sport facility in your


college, so after each of your classes, you ask your fellow
friends to complete a survey on the topic.
2. Opportunity/Voluntary response sampling:

Only participants available and willing to participate are used.


3. Purposive sampling
• This type of sampling involves the
researcher using their judgment to
select a sample that is most useful to
the purposes of the research.

I want to know more about the opinions of


students who are not giving test on Sunday.
4. Snowball sampling
• If the population is hard to access, snowball
sampling can be used to recruit participants via
other participants.
• The number of people you have access to
“snowballs” as you get in contact with more
people.

You are researching experiences of drugs on youth in your city. there is no list of
all drug addicts, other sampling isn’t possible.

One person who agrees to participate in the research, and he puts you in contact
with other drug addict people that he knows in the area.
II. Probability sampling methods

Probability sampling means that every member of the population has a chance
of being selected.

Useful for making conclusions for complete population using quantitative Data
It is free from selection bias

Four main types


1. Simple random sample
2. Systemic random sample
3. Stratified random sample
4. Cluster random sample
1. Simple Random Sample: Example:

You want to select a simple random


Where each item of the population has an sample of 100 employees of
equal chance of being included in the sample Company X. You assign a number to
every employee in the company
database from 1 to 1000, and use a
random number generator to select
100 numbers.
2. Systematic Random sampling
All employees of the company are listed in
• Population is large, scattered alphabetical order. From the first 10 numbers,
and not homogeneous. you randomly select a starting point: number
2. From number 2 onwards, every 3rd person
• Samples are selected at regular on the list is selected (5, 8, 11, 14, and so on),
intervals from the population and you end up with a sample of 100 people.
3. Stratified Random sampling:

Used when the population is small and not homogeneous

Population is divided into groups/clusters and within each


group cluster, a probability sample is selected from it.

The company has 800 male employees and 200 female


employees. You want to ensure that the sample reflects the
gender balance of the company, so you sort the population
into two strata based on gender.

Then you use random sampling on each group, selecting 80


men and 20 women, which gives you a representative
sample of 100 people.
4. Clustered sampling:

When complete large population can not be


investigated

Population is divided into groups/clusters and


a sample of group cluster is chosen using
probability method.

The company has offices in 6 cities across the country (all


with roughly the same number of employees in similar
roles). You don’t have the capacity to travel to every office
to collect your data, so you use random sampling to select
2 offices – these are your clusters.
Apply your mind
1. Given below are sampling techniques and their features

Which one of the following options correctly matches sampling techniques with their features?
(1) A-(ii); B-(i); C-(iv); D-(iii) (2) A-(ii); B-(iv); C-(iii); D-(i)
(3) A-(i); B-(iv); C-(iii); D-(ii) (4) A-(i); B-(iv); C-(ii); D-(iii)
2. How to analyze the data?

For qualitative data: Thematic analysis to interpret patterns and meanings in the data.
2. How to analyze the data?

For quantitative data:

Statistical analysis methods to test relationships between variables.

Two important types of statistics are


• measures of central tendency and
• measures of dispersion.
I. Measures of central tendency :

It is a number used to represent the center or middle of a set


of data values.
1. Arithmetic Mean (μ ) The sum of all the values in the sample divided by
the number of values in the sample or population

Example Data: 13, 18, 13, 14, 13, 16, 14, 21, 13

Mean (13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9


= 135 ÷ 9
= 15
Arithmetic Mean
• (a1+a2+a3+….,+an)/n

Geometric Mean
• n√(a1 X a2 X a3 X an)

Harmonic Mean
• n/[(1/a1) + (1/a2) + (1/a3) + ….+ (1/an)]

AM x HM = GM2
2. Median: The value separating the higher half of a sample or population
from the lower half
Found by arranging all the values from lowest to highest and taking the
middle one
If even number of values at middle: median will be the mean of the two
middle values.
Appropriate measure when data are contaminated by outliers (non-
paramataric)
Example :Data (Odd Number):
13, 13, 13, 13, 14, 14, 16, 18, 21 Total Data = 9

First arrange in ascending order


13, 13, 13, 13, (14), 14, 16, 18, 21

Median: 14

Data (Even Number):

13, 13, 13, 13, 13, 14, 14, 16, 18, 21 Total Data = 10

13, 13, 13, 13, (13, 14,) 14, 16, 18, 21

Median: (13+14)/2 = 13.5


Fractile: Division of complete set of observation

Median: Two equal parts

Quartiles: four equal parts


(Q1, Q2, Q3, Q4).

Quintiles: five equal parts


(0-20, 21-40, 41-60, 61-80, 81-100).

Deciles: Ten equal parts (T1, T2….T10).

Centile or Percentile: 100 equal parts.


(P1, P2, P3, P4, ……… P99)
3. Mode: The mode is the number that is repeated more often than any other.

Mode = 3 Median - 2 Mean


Apply Your Mind

Data set: 1,6,5,9,6,8,4,4,4 Total values (n)= 9

1, 4, 4, 4, (5), 6, 6, 7, 8 = 45

Mean = _____ Median = _____ Mode = ______


Apply Your Mind

The most appropriate measure of central tendency when data


are contaminated by outliers is:
(1) Mode
(2) Arithmetic mean
(3) Geometric mean
(4) Median
Apply Your Mind

The mean and median of a data set are 24 and 22, respectively.
The mode of the data set will be:
(1) 23
(2) 18
(3) 2
(4) -2
II. Measures of Dispersion
1. Range
• One simple measure of dispersion is the range, which is the difference
between the greatest and least data values.

1 2 3 4 5 6 7 8 9
Quartile: Meaning

• One of three points that divide a data set into four equal parts.
• Each group contains equal number of observations or data.
• Median acts as base for calculation of quartile.
2. Quartile deviation
• The difference between Q3-Q1 is also known as the
interquartile range
• Interquartile range divided by two is known as quartile
deviation or semi interquartile range.
• Quartile Déviation = (Q3 – Q1) / 2
• Good for non-paramatric data (outlier présent)
To calculate Quartile deviation, you need to first find out Q1 then the second step
is to find Q3 and then take a difference of both and the final step is to divide by 2.
Example: Find the quartiles and quartile deviation of the following data:
17, 2, 7, 27, 15, 5, 14, 8, 10, 24, 48, 10, 8, 7, 18, 28 = 16 data

Solution:
Ascending order of the given data is: 2, 5, 7, 7, 8, 8, 10, 10, 14, 15, 17, 18, 24, 27, 28, 48
Number of data values = n = 16
Q2 = Median of the given data set

n is even, median =(1/2)[8th observation + 9th observation]


= (10 + 14)/2
= 24/2
= 12
Q2 = 12
2, 5, 7, 7, 8, 8, 10, 10, 14, 15, 17, 18, 24, 27, 28, 48 Quartile deviation
Q2 = 12 = (Q3 – Q1)/2
= (21 – 7.5)/2
Now, lower half of the data is: = 13.5/2
2, 5, 7, 7, 8, 8, 10, 10 (even number of observations=8) = 6.75
Q1 = Median of lower half of the data
Q1 = (1/2)[4th observation + 5th observation]
Q1 = (7 + 8)/2 = 15/2 = 7.5

Also, the upper half of the data is:


14, 15, 17, 18, 24, 27, 28, 48 (even number of observations=8)
Q3= Median of upper half of the data
= (1/2)[4th observation + 5th observation]
Q3 = (18 + 24)/2 = 42/2 = 21
3. Variance and Standard Deviation

• Standard deviation is a measure of the amount


of dispersion of a set of values around mean.

• A low SD indicates that the values tend to be


close to the mean of the set, while a high SDS
indicates that the values are spread out over a
wider range

• Variance is
• sum of squares of deviations from mean/degree
of freedom (n-1) for sample
• sum of squares of deviations from number of
observations (n) for population
Mean X-X (X-X)2
• Standard deviation (σ) is square
2 -3 9
root of variance of sample.
2 -3 9
50/10 = 5
Steps for the computation of standard 4 -1 1
deviation :
4 -1 1
1. Calculate the mean
2. Find the difference of each variable from 5 0 0
the mean 5 0 0
3. Square the differences of observations
from the mean 6 +1 1

4. Add the squared values to get the sum of 6 +1 1


squares.
8 +3 9
5. Divide this sum by the number of
observations to get mean- squared 8 +3 9
deviation, called variance.
Total=50 Total/n Sum ((X-X)2 = 40
6. Find the square root of this variance to
get root-mean squared deviation, called Variance = 40/10 = 4
standard deviation.
SD = 2
• Coefficient of variation (COV)

= Standard deviation/mean

• More useful for comparison of variance in different


data set
Outliers

• An outlier is a value that is much greater than or much less than


most of the other values in a data set.

• Outlier can be defined as a value in data set that lies more than
three standard deviation from mean

• Measures of central tendency and dispersion can give misleading


impressions of a data set if the set contains one or more outliers.
Which of the following data is/are probably outlier?

a. 54,55,54,96,60,58,55,56,06,62,68,62,55,72,69,44
The appropriate measure of dispersion of an open-end class data is (data are
contaminated by outliers ):
(1) Range
(2) Mean deviation
(3) Quartile deviation
(4) Standard deviation
Apply your mind
Following are statements related to statistical methods:
A. An outlier can be defined as a value in a data set that lies more than three standard deviations
from the mean.
B. Measures of central tendency and dispersion are independent of the presence of outliers in a
data set.
C. Standard deviation is a measure of dispersion.
D. Mean, median and mode are not measure of central tendency.

Which one of the following options is a combination with both INCORRECT statements?
1. A and B 2. B and C
3. A and C 4. B and D
Apply your mind:

1. The μ and σ of wing length (a normally distributed parameter) in a population of fruitflies are 4 and 0.2 mm.
respectively. In a random sample of 400 fruitflies, how many individuals are expected to have wing lengths
greater than 4.4 mm? (JUNE 2011)
(1) 20 (2) 64
(3) 10 (4) 336

Mean 4 ± SD 0.2
Up to 4.4 = +2σ
± 2σ includes 95 %
= +2σ includes 47.5%
Incudes 190 and
greater than +2σ is 2.5% =10
2. The pH of a solution is 7.4 ± 0.02 where 0.02 is standard deviation obtained from eight
measurements. If more measurements were carried out, the % of samples whose pH would fall
between pH 7.38 and 7.42 is (JUNE 2017)
(1) 99.6 (2) 95.4
(3) 68.2 (4) 99.8
A scientist is weighing each of 30 fishes. Their mean weight worked out is 30
gm and a standard deviation of 2 gm. Later, it was found that the measuring
scale was misaligned and always under reported every fish weight by 2 gm.
The correct mean and standard deviation (in gm) of fishes are Respectively:
(1) 28, 2 (2) 28, 4
(3) 32, 2 (4) 32, 4
THANKS
Biostatistics 2
Probability
Distribution

Binomial Normal Poisson


1. Binomial Distribution:

Each observation represents one of two outcomes (success


or failure)

Yes or No, Head or tail , Success or failure, Male or female

It is very useful when


1. The number of the trial or the experiment must be fixed.
2. Every trial is independent. None of your trials should
affect the possibility of the next trial.
3. The probability always stays the same and equal.
2. Normal Distribution:
• Distribution are arranged in linear fashion and vary continuously on both the
sides from the central value

• Probability distribution is symmetric about the mean, mode and median


Data of normal distribution presents a bell shaped symmetrical curve called "normal
distribution curve". This curve is also known as the "Gaussian curve".

Mean = Median = Mode


Examples of Normal distribution:

1. Measures of size of living tissue: Length, height, skin area, weight


2. The length of inert appendages: Hair, claws, nails, teeth
3. Certain physiological measurements, such as blood pressure of adult
humans.
Skewness and Kurtosis
Skewness is a measure of the extent to which a probability distribution of a
real-valued random variable ”leans” to one side of the mean.

The skewness value can be positive or negative, or zero.

Kurtosis is high when middle data are more and extremes are very low

Kurtosis is low when data are more or less uniformly distributed along range
No skewness: mean=mode=median
Right (Positive) skewness Mean<median<mode
Left (Negative) skewness Mean>median>mode
3. Poisson Distribution :

The probability of a given number of events occurring in a fixed interval of time or space.
These events occur with constant mean rate and independently of the time since the last
event

Eg. Telephone call per hour


Online order per day
Number of radioactive decay events per second
Number of plants per meter square
Number of mutation in DNA strand per unit length
The proportion of cells that will be infected at a given MOI
The number of deaths per year in a given age group.
The number of bacteria in a certain amount of liquid.
Bayes' theorem
• Bayesian methods to update probabilities, which are degrees of belief, after obtaining new
data.
• Given two events H and E, the conditional probability of H, given that E is true is expressed
as follows
P(H) is the prior probability of H which expresses one's beliefs about H before
evidence is taken into account (Head =0.5, Race won by Rohit = 0.75).

P ( E ∣ H ) is the likelihood function, which can be interpreted as the


probability of the evidence E given that H is true (0.25).

P(E) represents the probability of evidence, or new data that is to be taken


into account. (Rain Chance in next race = 0.75)

P ( H ∣ E ) is the posterior probability, the probability of the proposition H


after taking the evidence E into account. (Chance of winning rohit, taken
rain into account)
Apply your mind
1. Given below are names of statistical distribution (Column I) and their characteristic
features (Column II). Which one of the following represents a correct match between
columns I and II? (DEC 2018)
Column I Column II
A Binomial distribution i Each observation represents one of two outcomes (success or
failure)
B Poisson distribution ii Probability that is symmetric about the mean
C Normal distribution iii Probability of a given number of events happening in a fixed
interval of time
(1) A - (ii) ; B - (i) ; C - (iii) (2) A - (i) ; B - (ii) ; C - (iii)
(3) A - (i); B - (iii) ; C - (ii) (4) A - (iii) ; B - (ii); C - (i)
Apply your mind

2. Which one of the following statements regarding normal distribution is NOT correct? (JUNE 2019)
(1) It is symmetric around the mean
(2) It is symmetric around the median
(3) It is symmetric around the variance.
(4) It is symmetric around the mode.
Apply your mind
3. Following are statements to depict relationship among measures of central tendency in a skewed
dataset
A. In positively skewed datasets, mean > median > mode
B. In positively skewed datasets, mode >median > mean
C. In negatively skewed datasets, mean>median > mode
D. In negatively skewed datasets, mode> median> mean

Which of the above statements are TRUE? (JUNE 2018)


(1) A and B (2) A and C
(3) B and D (4) A and D
Apply your mind

4. The following represents an equation for Bayesian statistics:

Which one of the following options correctly represents A, B, C and D in the above equation?
(1) A-Evidence, B-Posterior probability, C-Likelihood, D-Prior probability
(2) A-Likelihood, B-Prior probability, C-Posterior probability, D-Evidence
(3) A-Posterior probability, B-Prior probability, C-Likelihood, D-Evidence
(4) A-Prior probability, B-Evidence, C-Posterior probability, D-Likelihood
HYPTHESIS TESTING
TYPES OF ERRORS
LEVELS OF SIGNIFICANCE
Null Hypothesis:

• The null hypothesis, H0 is the commonly accepted fact;.


• It is the opposite of the alternate hypothesis (H1).
• Researchers work to reject, nullify or disprove the null hypothesis.

Example:

Null hypothesis, H0: Online classes are not be effective for selection in
CSIR NET life sciences.

Alternate hypothesis H1 : Online classes very effective for selection in


CSIR NET life sciences.
Hypothesis Testing
• Hypothesis: explanation based on limited evidences
• No hypothesis is 100 % true, unless proven
• Always chance of drawing incorrect conclusions (errors)
Two types of errors in testing of hypothesis

1. Type I error: Rejection of null hypothesis which is true.

2. Type II error: Acceptance of null hypothesis which is false.

Reject H0 Accept H0
H0 is True Type I error Correct Decision
H0 is False Correct Decision Type II error
Null Hypothesis
Reality
Beneficial Harmful

Beneficial
OK Type II error

Research on drug
and you suggest it

Harmful Type I error OK


• Sensitivity of a test refers to its ability to correctly identify those
with the disease.
• Minimize the chances of a type-II error

• Specificity of test is ability of a test to correctly identify patients


without the disease
• Minimize the chances of a type-I error
Significance level (p-value):
• The level of significance (p-value) is used in hypothesis
testing to help you support or reject the null hypothesis.

• The p-value is the evidence against a null hypothesis.

• The smaller the p-value, the strong the evidence that you
should reject the null hypothesis.

• Thus, p-value is the probability of committing type I error

• The smaller the p-value, lesser is the probability of the


error (Type I) of rejecting a true null hypothesis
Significance level (p): Error for Rejection of True Null Hypothesis:
p ≤ 0.05 : This means that the probability of accepting a true alternative
hypothesis is 95% and committing type I error is 5%)

If p > 0.10 → “not significant”


If p ≤ 0.10 → “marginally significant”
If p ≤ 0.05 → “significant”
If p ≤ 0.01 → “highly significant.”
Apply your mind

A research laboratory examined their data of 100 patients proven to have


tuberculosis based on results from sputum culture. Only 40 of them had a positive
result on sputum microscopy, while 80 had a positive result from a novel
diagnostic test under evaluation. Based on this information, as compared to
sputum microscopy, the new test has better:
1. Sensitivity
2. Specificity
3. Positive predictive value
4. Negative predictive value
Apply your mind:
In the following statement taken from a research paper, what does p in the parenthesis stands for? (DEC
2011)

“The mean temperature of this region now is significantly higher than the one 50 years ago (p<0.05, t-
test)”

(1) Ratio of the mean temperature of the two times periods tested
(2) Probability of the error of rejecting a true null hypothesis
(3) Probability of the error of accepting a false null hypothesis
(4) Probability of the t-test being effective in detecting significant difference in the mean annual
temperatures of the two periods
CONFIDENCE
INTERVAL
• A Confidence Interval is a range of values we are fairly sure our
true value lies in.
EXAMPLE: Average Height of humans
• Suppose if average (mean) height of 100 human sample is 165 cm and calculated standard
deviation is 20 cm (we can use standard deviation of population if it is known).

• What will be confidence limit range at 95 % level if we extend sample to complete


population?(Where you are sure that 95 % data will fall in that CI range )

• At 95% Confidence Interval it will be: 165cm ± 4 cm.

• This says the true mean of all men (if we measure) is likely to be between 161 cm and 169 cm
in 95 % of cases.

• All though there is a 1-in-20 chance (5%) that our Confidence Interval does NOT include the
true mean.
How to calculate CI
Step 1:
• Start with the number of observations (n=100)
• Calculate mean and standard deviation

Step 2:
Decide what Confidence Interval we want:
95% or 99% are common choices.
Then find the "Z" value for that Confidence Interval
here:
• Step 3: use that Z value in this formula for the
Confidence Interval
Suppose if average (mean) height of 100 human sample is 165 cm and calculated
standard deviation is 20 cm. What will be confidence limit range at 95 % level ?

Observation (n) = 100


Mean (μ) = 165 cm
Standard deviation (σ) = 20
Z (for 95% CI) = 1.96 = 2

20
= 165 ± 2
100
20
= 165 ± 2
10
= 165 ± 2 x 2
= 165 ± 4
Apply your mind
The mean and standard deviation of serum cholesterol in a population of senior citizens are
assumed to be 200 and 24mg/dl, respectively. In a random sample of 36 senior citizens, what
values of cholesterol (to the nearest whole number) should lead to rejection of the null
hypothesis at 95% confidence level? (JUNE 2015)
(1) above 224
(2) above 248
(3) below 176 and above 224
(4) below 192 and above 208

Given
Observations (n)=36, Mean (X) =200, Standard deviation
(s) =24, Z for 95 % Confidence interval =1.96 (please
memorize)
Observation (n) = 36 Accepted value at
Mean (μ) = 200 cm 95% CI are ranging
Standard deviation (σ) = 24 from 193 to 207
Z (for 95% CI) = 1.96 Out side this
range: are rejected

24
= 200 ± 2
36
24
= 200 ± 2
6
= 200 ± 2 x 4
= 200 ± 8
Co-Relation
CORRELATION

Co=“Together” relation =“connection”

It show whether and how strongly pairs of variables are related


(Direction + strength).

For example, human height and weight are related

Correlation can tell you just how much of the variation in peoples'
weights is related to their heights.
Positive Negative Zero
Demand of
Age of husband Car color and car
product with rise
and age of wife mileage
in price

Sale of woolen
Drinking tea and
Increase in height cloths with
clearing CSIR
and weight increase in
exam
temperature
Correlation Coefficient (r) r is called the (Pearson) correlation coefficient

r= Covariance(x, y) .
Standard deviation (x) X standard deviation (y)
Correlation Coefficient (r)

It ranges from -1.0 to +1.0.


Null hypothesis:

There is no correlation between study hour and marks obtained in exam.

True or False??
Study Hours Marks obtained
X Y
Varun 2 58
Karan 4 32
Garv 5 63
Subham 7 87
Snehal 3 67
Laksh 1 45
Ram 6 68
Step 1: Calculate mean for X
Study Hours Marks obtained
X Y
Varun 2 58
Karan 4 32
Garv 5 63
Subham 7 87
Snehal 3 67
Laksh 1 45
Ram 6 68
x̅ =
Step 2: Calculate mean for Y

Study Hours Marks obtained


X Y
Varun 2 58
Karan 4 32
Garv 5 63
Subham 7 87
Snehal 3 67
Laksh 1 45
Ram 6 68
x̅ = 4 Y̅ =
Step 2: Calculate mean for Y

Study Hours Marks obtained


X Y
Varun 2 58
Karan 4 32
Garv 5 63
Subham 7 87
Snehal 3 67
Laksh 1 45
Ram 6 68
x̅ = 4 Y̅ = 60
Step 3: Calculate deviation from mean for ‘x’

X Y (x-x̅ )
2 58 -2
4 32 0
5 63 1
7 87
3 67
1 45
6 68
x̅ = 4 Y̅ = 60
Step 3: Calculate deviation from mean for ‘x’

X Y (x-x̅ )
2 58 -2
4 32 0
5 63 1
7 87 3
3 67 -1
1 45 -3
6 68 2
x̅ = 4 Y̅ = 60
Step 4: Calculate deviation from mean for ‘y’

X Y (x-x̅ ) (y-y̅)
2 58 -2 -2
4 32 0 -28
5 63 1 3
7 87 3
3 67 -1
1 45 -3
6 68 2
x̅ = 4 Y̅ = 60
Step 4: Calculate deviation from mean for ‘y’

X Y (x-x̅ ) (y-y̅)
2 58 -2 -2
4 32 0 -28
5 63 1 3
7 87 3 27
3 67 -1 7
1 45 -3 -15
6 68 2 8
x̅ = 4 Y̅ = 60
Step 5: Multiply deviation from mean of ‘x’ and ‘y’

X Y (x-x̅ ) (y-y̅) (x-x̅ ) (y-y̅)


2 58 -2 -2 4
4 32 0 -28 0
5 63 1 3 3
7 87 3 27 81
3 67 -1 7
1 45 -3 -15
6 68 2 8
x̅ = 4 Y̅ = 60
Step 5: Multiply deviation from mean of ‘x’ and ‘y’ and sum them to obtain co-varianace

X Y (x-x̅ ) (y-y̅) (x-x̅ ) (y-y̅)


2 58 -2 -2 4
4 32 0 -28 0
5 63 1 3 3
7 87 3 27 81
3 67 -1 7 -7
1 45 -3 -15 45
6 68 2 8 16
x̅ = 4 Y̅ = 60 ∑ = 142
Step 6: Square the deviation for ‘x’ from mean

X Y (x-x̅ ) (y-y̅) (x-x̅ ) (x-x̅ )2


(y-y̅)
2 58 -2 -2 4 4
4 32 0 -28 0 0
5 63 1 3 3
7 87 3 27 81
3 67 -1 7 -7
1 45 -3 -15 45
6 68 2 8 16
x̅ = 4 Y̅ = 60 ∑ = 142
Step 6: Square the deviation for ‘x’ from mean

X Y (x-x̅ ) (y-y̅) (x-x̅ ) (x-x̅ )2


(y-y̅)
2 58 -2 -2 4 4
4 32 0 -28 0 0
5 63 1 3 3 1
7 87 3 27 81 9
3 67 -1 7 -7 1
1 45 -3 -15 45 9
6 68 2 8 16 4
x̅ = 4 Y̅ = 60 ∑ = 142 ∑ = 28
Step 6: Square the deviation for ‘y’ from mean

X Y (x-x̅ ) (y-y̅) (x-x̅ ) (x-x̅ )2 (y-y̅)2


(y-y̅)
2 58 -2 -2 4 4 4
4 32 0 -28 0 0 784
5 63 1 3 3 1
7 87 3 27 81 9
3 67 -1 7 -7 1
1 45 -3 -15 45 9
6 68 2 8 16 4
x̅ = 4 Y̅ = 60 ∑ = 142 ∑ = 28
Step 6: Square the deviation for ‘y’ from mean

X Y (x-x̅ ) (y-y̅) (x-x̅ ) (x-x̅ )2 (y-y̅)2


(y-y̅)
2 58 -2 -2 4 4 4
4 32 0 -28 0 0 784
5 63 1 3 3 1 9
7 87 3 27 81 9 729
3 67 -1 7 -7 1 49
1 45 -3 -15 45 9 225
6 68 2 8 16 4 64
x̅ = 4 Y̅ = 60 ∑ = 142 ∑ = 28
X Y (x-x̅ ) (y-y̅) (x-x̅ ) (x-x̅ )2 (y-y̅)2
(y-y̅)
2 58 -2 -2 4 4 4
4 32 0 -28 0 0 784
5 63 1 3 3 1 9
7 87 3 27 81 9 729
3 67 -1 7 -7 1 49
1 45 -3 -15 45 9 225
6 68 2 8 16 4 64
x̅ = 4 Y̅ = 60 ∑ = 142 ∑ = 28 ∑ = 1864
Null hypothesis:

There is no correlation between study hour and marks obtained in exam.

Accept/Reject Compare to critical r values


1. Calculate Degree of freedom = (N-2) The degrees of freedom for correlations are the
total number of score pairs (N) minus two.
= (14-2) =12
2. Assume α or p =0.05 for a one tailed test
3. Find critical r = 0.457
4. Decision: Reject or accept null hypothesis
Calculated r > Critical r: Reject null hypothesis
Calculated r < Critical r: Accept null hypothesis

Null hypothesis:

There is no correlation between study hour and


marks obtained in exam.

Calculated r (0.62) > Critical r (0.457)


Correlation is not causation

High correlation strength does not mean that is statistically significant.

(depends on sample size)


Regression
Regression
It used to estimate relationships between a dependent variable (X-axis) and one or
more independent variables (Y-axis)

Y= b X + c
If =β1 (slope) is positive (>0), there is positive relationship

If =β1 (slope) is negative (<0), there is negative relationship

If =β1 (slope) is 0, there is no relationship


You want to go for Ph. D

Your friend says:

Don’t waste your time

Anita did Ph.D and is unemployed

Vijay failed in X class, today he is Crorepati

What you will do?


Assess relationship between salary and education

Does higher education Higher salary

Survey:

Sample of 100 individuals


100 data from our sample
Name Education Salary
1 Amar 10 12000
2 Bhavna M.Sc 20000
3 Charu B.Sc 15000
4 Dinesh 5 10000
5 Mahima Ph. D 25000
------
What is best way to represent data?
Scatter plot

Can we see any


relationship

Line of best fit


Regression line
Y= b X + c
Y = β1 X + β0
Salary = β0 + β1 education

β1 = Slope
Salary = β0 + β1 education Education is 0 = 3 K
Let salary = 3 + 1 education
With every 1 year of study
salary is going to be
increased by 1 K

M. Sc Salary ??
Y= bx+C The line regression on x is:

𝑛Σ𝑥𝑦 − Σ𝑥 . Σ𝑦 Y−𝑌ത = b (X−𝑋)



𝑏𝑦𝑥 =
𝑛 Σ𝑥 2 − (Σ𝑥)2
𝑋ഥ = 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑋
𝑌ത = Mean of Y
Solved example:

Find the linear regression equation for the following data pairs (x, y) given in the
above table
(1) y = 4x + 0
(2) y = 3x + 2
(3) y = 6x + 2
(4) y = 0.33 + 2
𝑛Σ𝑥𝑦 − Σ𝑥 . Σ𝑦
𝑏𝑦𝑥 =
𝑛 Σ𝑥 2 − (Σ𝑥)2

X Y X.Y
0 2 0
2 8 16
4 14 56
6 20 120
8 26 208
10 32 320
12 38 456
14 44 616
Σ𝑥 = 56 Σ𝑦 =184 Σ𝑥𝑦=1792
X Y X.Y X2 Y2 Here n=8
0 2 0 0 4 𝑛Σ𝑥𝑦 − Σ𝑥 . Σ𝑦
2 8 16 4 64 𝑏𝑦𝑥 =
𝑛 Σ𝑥 2 − (Σ𝑥)2
4 14 56 16 196
8 × 1792 − (56 × 184)
6 20 120 36 400 𝑏𝑦𝑥 =
8 × 560 − (56)2
8 26 208 64 676
14336 − 10304
10 32 320 100 1024 =
4480 − 3136
12 38 456 144 1444
4032
14 44 616 196 1936 =
1344
∑x ∑y Σ𝑥 . 𝑦 Σ𝑥 2 = Σ𝑦 2 =
= 56 =184 =1792 560 5744 =3
Here n=8 Y−𝑌ത = b (X−𝑋)

56
𝑛Σ𝑥𝑦 − Σ𝑥 . Σ𝑦 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑋 = =7
8
𝑏𝑦𝑥 =
𝑛 Σ𝑥 2 − (Σ𝑥)2
184
8 × 1792 − (56 × 184) 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑌 = = 23
8
𝐴𝑏𝑦𝑥 =
8 × 560 − (56)2
The line regression on x is:
14336 − 10304
= Y−𝑌ത = b (X−𝑋)ത
4480 − 3136
Y- 23 = 3 (x-7)
4032 Y- 23 = 3x - 21
=
1344 Y = 3x -21 + 23
Y = 3x -21 + 23
= 3.0 = 3 Y = 3x + 2
THANKS
• Student ‘t’ Test
• ANOVA
• Chi-square
• Paramatric and Non- Parmatric Test
• Field Ecology
Student ‘t’-test
WS Gosset “Pen name = student”

Used for comparison of two sample means.

Sample size

Less than 30 = t test


More than 30 = Z test
To test hypothesis we have to calculate
Null Hypothesis (H0):

There is no significant difference between


the samples

Alternative Hypothesis (H1):

There is significant difference between the


samples
Lower Higher
Degree of freedom:
= n1 + n2 – 2
= 16 + 16 – 2
= 30
Null hypothesis testing at p=0.05

Degree of freedom = 30
Null hypothesis testing at p=0.05

Degree of freedom = 30
= 2.04

Accept Null Hypothesis (H0):

There is no significant difference


between the samples
Null hypothesis testing at p=0.05

Degree of freedom = 30

Critical value = 2.04


ANOVA

Comparison of means/variance of two or more groups/sample

Used for testing

Null hypothesis
Ho = Means all population (=K) is similar, no significant difference

Alternate hypothesis:

H1 = At least one of mean differs from the others


ANOVA
One way ANOVA
Effect of fertilizer in plant height (cm)

Low Medium High

Group 1 Group 2 Group 3


5 4 5
6 5 3
4 6 5
5 8 7
4 6 6

Sample mean( x̄ ) 4.8 5.8 5.2


Effect of surrounding noise/disturbance on question solving ability of students

Low noise Medium noise Loud noise


(Group 1) (Group 2) (Group 3)
Student Correct Student Correct Student Correct
Questions Questions Questions
1 10 5 8 9 4
2 9 6 4 10 3
3 6 7 6 11 6
4 7 8 7 12 4
Sample mean 8 6.25 4.25

Variation among the groups


Variation within group
Two way ANOVA

1. Effect of fertilizer in plant height (cm)

Low Medium High

2. Effect of photoperiod in plant height (cm)

10 hours 12 hours 14 hours


Null hypothesis:

No significant effect of surrounding noise on number of question solved.

Alternate Hypothesis:

Significant effect of noise on number of questions solved

𝑀𝑒𝑎𝑛 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒 𝐴𝑚𝑜𝑛𝑔 𝑔𝑟𝑜𝑢𝑝𝑠


𝐹=
𝑀𝑒𝑎𝑛 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒 𝑊𝑖𝑡ℎ𝑖𝑛 𝑔𝑟𝑜𝑢𝑝𝑠
𝑀𝑆𝑆𝐴
𝐹=
𝑀𝑆𝑆𝑤
Calculation Part

64 16
16 9
36 36
49 16
165 77
ΣX and ΣX2
Correction Term
Sum of Squares of Total
Sum of Squares of Total
Sum of Squares Among group
Sum of Squares Among group
Mean of Sum of Squares Among group
Sum of Squares Among group
Mean of Sum of Squares within group
Fischer Ratio (distribution)
Source of Degree of Sum of Mean of F Ratio
variation freedom squares sum of (Distributio
squares n)
Among (K-1) = 2 28.17 14.085
Groups 5.394
Within (N-K) = 9 23.5 2.6111
Groups

Null hypothesis:
No significant effect of surrounding noise on number of question solved

Alternate Hypothesis:
Significant effect of noise on number of questions solved
Source of Degree of Sum of Mean of F Ratio 𝑴𝑺𝑺𝑨
𝑭=
variation freedom squares sum of 𝑴𝑺𝑺𝒘
squares
Among (K-1) = 2 28.17 14.085
Groups 5.394
Within (N-K) = 9 23.5 2.6111
Groups
Within group
F Null hypothesis:
Distributi No significant effect of surrounding noise on number of question solved
on
Alternate Hypothesis:
5.394 Significant effect of noise on number of questions solved

Calculated F > Permissive F distribution


(Table α = 0.05)

Reject Null Hypothesis


Apply Your mind
To study the effect of temperature on seed germination, 16 seeds of a
plant species were selected for an experiment. A total of four
temperature treatments were provided to sets of four seeds to study
the onset of germination. What would be the within, between and
total degrees of freedom, respectively, in an analysis of variance?
(1) 3, 15 and 18 (2) 16, 4 and 20
(3) 4, 16 and 20 (4) 12, 3 and 15
A Chi-squared Test (χ2 test): Goodness of fit

Used to determine whether there is a significant difference


between the expected frequencies and the observed
frequencies of observations.

Non-Paramataric Test
A chi-squared test (χ2 test): Goodness of fit
Null hypothesis:

1. If coin is tossed for 100 times there will be 50 head and 50 tail
2. The probability of being male or female child is equal
3. The F2 monohybrid phenotypic ratio will be 3:1 under dominance
4. The F2 dihybrid phenotypic ratio will be 9:3:3:1 under dominance
and independent assortment
5. There is equal frequency of A, T, G or C in DNA (1/4 each)
To test hypothesis we can use Chi – Square
goodness of fit test
Null hypothesis:

There is no significant difference between the


observed and expected frequency of head and
tail in coin
Total observations : 30

Null hypothesis:
There is no significant difference between the observed and
expected frequency of head and tail in coin

Expected Head: 15 Expected Tail: 15

Observed Head: 20 Observed Tail: 10

Accept or reject Null hypothesis ? (significance level α=0.05)


I tossed coin 30 time and outcome are represented below

H H T H H
T H H H H
T H H T T
H T H H H
T H H H T
T H T H H
Total Head: 20 Total Tail: 10
Observed Expected O-E (O-E)2 (O-E)2
Head Head E
20 15 20-15=5 25 25/15=1.66

Observed Expected O-E (O-E)2 (O-E)2


Tail Tail E
10 15 10-15= -5 25 25/15=1.66

∑(O-E)2
E
χ2
= 3.32
Calculation part:
Observed Expected O-E (O-E)2 (O-E)2
Head Head E
20 15 20-15=5 25 25/15=1.66
n = possible outcomes (H and T)

Degree of freedom = n -1
At significance level α=0.05, critical value with 1 df is 3.81

Calculate chi square value, if it is less than 3.81 accept null hypothesis else
reject it.
1. Given below are a few statistical terms in Column A and their related features
and terms in Column B.
Column I Column II
A Standard (i) Measure of Relative variability of given
deviation populations.
B. Coefficient of (ii) Used to make inferences about population
Variation means.
C. Chi-square test (iii) Positive square root of population variance.
D. t-Test (iv) Test hypothesis related to categorical data
from inheritance studies.
Which one of the options given below correctly
matches all items of Columns A and B?
(1) A-iv; B-iii; C-ii; D-I (2) A-ii; B-iv; C-i; D-iii
(3) A-iii; B-i; C-iv; D-ii (4) A-ii; B-i; C-iv; D-iii
(FEB 2022-I)
2. From the steps listed below, some are used to evaluate the goodness of fit
using the chi-square test.
A. The mean, variance and standard deviation are calculated
Σ 𝑥1 −𝑥 2
B. Variance calculated using 𝑛−1
Σ 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑−𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 2
C. Test statistic calculated using 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑
D. the degree of freedom is calculated as n-1, where n is the number of ways in
which the expected classes are free to vary
E. The probability value is obtained
Which one of the following options provides the correct sequence of steps in
this statistical analysis?
(1) A, C, D (2) C, D, E
(3) B, A, D (4) A, D, E
(DEC 2013)
3. In the population of an insect species, 50% are known to be
female. How many females should turn up in a random sample of
40 insects to reject the null hypothesis (χ2 value for rejection is ≥ 3
.84)?
(1) 6
(2) ≤26 or≤8
(3) 25
(4) <17 or >27
Null hypothesis:
There is no significant difference between the observed
and expected frequency of head and tail in coin

At significance level α=0.05,


Critical value is 3.81
Calculated χ2 = 3.32

Accept null hypothesis


PARAMETRIC
AND
NON-PARAMETRIC TEST
Consider female = 26
Observed Expected O-E (O-E)2 (O-E)2
Chi sq [0.05]
Male Male E
df =1 is 3.84
14 20 14-20=-6 36 36/20=1.8

Observed Expected O-E (O-E)2 (O-E)2


Female Female E
26 20 26-20= 6 36 36/20=1.8

∑(O-E)2
E
= 3.6
Consider female = 27

Observed Expected O-E (O-E)2 (O-E)2 Chi sq [0.05]


Male Male E df =1 is 3.84

13 20 13-20=-7 49 49/20=2.48

Observed Expected O-E (O-E)2 (O-E)2


Female Female E
27 20 27-20= 7 49 49/20=2.48

∑(O-E)2
E
= 4.98
Non-parametric tests:

• Tests don’t require that your data follow the normal distribution.
• They’re also known as distribution-free tests and can provide benefits in
certain situations (nominal/ordinal).
• Used when individual variability among the study groups is high

Example: Chi square test


Spearman Correlation,
Krusal Wallis Test
Mann-Whitney U test
Mann-Kendall’s test
Test to assess differences in mean, SD or Variance in two or more groups

Parametric Non-parametric:
• Paired t test is used for compare mean • Wilcoxon rank sum test is used compare mean
and SD in two dependent groups (same between two dependent groups (not normally
group studied twice) distributed)

• Unpaired t-test is used for compare • Mann-Whitney U test is used compare mean
mean and SD in two independent between two independent groups (not
groups normally distributed)

• Kruskal-Wallis test assesses for difference in


• An one way ANOVA assesses variance between two or more groups (not
differences in variance between two or normally distributed)
more groups.
Parametric tests

• Make assumptions about the parameters of the population distribution from


which the sample is drawn.
• These test assume that the population data are normally distributed.
• Parametric test is more powerful as compared to non-parametric test.
• Results can be significantly affected by outliers in a parametric test.

Example:
• Paired/unpaired t-test
• ANOVA
• Pearson correlation
Apply Your mind

Two groups (Control, Treated) are to be compared to test the effect of a treatment. Since individual
variability is high in both groups, the appropriate statistical test to use is (JUNE 2015)
(1) Analysis of variance.
(2) Kendall's test.
(3) Student's t-test.
(4) Mann-Whitney U-test.
Test to assess strength of association between two variables

Parametric test: Non-parametric test:


• Pearson correlation is • Spearman correlation is
used when assessing the appropriate when at least
relationship between two one of the variables is
continuous variables. measured on an ordinal
scale.
• Kendall rank correlation is a
non-parametric test that
measures the strength of
dependence between two
variables
Apply Your mind

The use of Kruskal Wallis test is most appropriate in which of these cases? (JUNE
2016)
(1) There are more than two groups and each group is normally distributed.
(2) There are more than two groups and the distribution in each group is not
normal.
(3) There are two groups and each group is normally distributed.
(4) There are two groups and the distribution in each group is not normal.
Population Field Ecology
Apply Your mind
(FEB 2022-II)
Select the option that represents the correct combination of non-
parametric tests and its equivalent parametric test respectively that
can be used to compare two or more groups.
(1) Wilcoxon Rank Sum Test and Paired t-test
(2) Wilcoxon Rank Sum Test and Spearman correlation
(3) Spearman correlation and Kruskal Wallis test
(4) Mann-Whitney U test and Pearson correlation
Quadrat Method
Line transact Method

Animal biologist generally uses line-transect method for estimating density.

It is based on assumption that the animals do not move as a result of the


presence of the observer
Animal 1 Animal 2
A Transect length (Km) 100 100
B Mean perpendicular distance 10 40
from transect line (m)
C Number of animals recorded 30 36
Mark-recapture method:

'n' individuals are collected randomly from the study area in a defined
period of time.
The captured individuals are counted, marked and released at the site of
collection.
Next day or after some time, individuals are captured from the same site
for same length of time.
Number of marked (nM) and unmarked (nU)

Total population/Marked during capture


= Recaptured population/ Number of marked among recaptured
𝑁 𝑛𝑚 +𝑛𝑢
𝑛
= 𝑛𝑚
𝑁 𝑛𝑚 + 𝑛𝑢
=
𝑛 𝑛𝑚
Remote sensing has been increasingly used to monitor vegetation
globally.
Part of the EMR spectrum Vegetation characteristics
Visible Plant photosynthetic pigments
Near Infrared Foliage density
Short wave Infrared Plant water content
Remote sensing:

Normalized difference vegetation index

The NDVI is computed as the difference between


near infrared (NIR) and red (RED) reflectance
divided by their sum.

NDVI ratio yields a measure of photosynthetic activity within values between


− 1 and 1.
Low NDVI values indicate moisture-stressed vegetation and higher values
indicate a higher density of green vegetation.
THANKS

You might also like