Inferential Statistics
Inferential Statistics
Umesh Pathak
December 5, 2024
1 Inferential Statistics
1.1 Sampling and Confidence Intervals
Sampling and confidence intervals are fundamental concepts in statistics used to estimate population
parameters based on a sample and to quantify the uncertainty of these estimates.
—
1. Sampling
What is Sampling? Sampling involves selecting a subset (sample) of individuals or observations
from a larger population to make inferences about the population as a whole.
Types of Sampling 1. Random Sampling : - Each individual in the population has an equal
chance of being selected. - Reduces bias and ensures representativeness.
2. Stratified Sampling : - The population is divided into subgroups (strata) based on
characteristics, and random samples are taken from each stratum. - Ensures representation of key
subgroups.
3. Systematic Sampling : - Select every kth individual from a list of the population. -
Simpler than random sampling but can introduce bias if the population has a pattern.
4. Cluster Sampling : - Divide the population into clusters (e.g., geographic regions), and
randomly select entire clusters for the sample. - Useful for large, dispersed populations.
5. Convenience Sampling : - Use individuals who are easiest to access. - May introduce
significant bias.
—
2. Confidence Interval
What is a Confidence Interval? A confidence interval (CI) provides a range of values that likely
contains the true population parameter (e.g., mean, proportion) with a specified level of confidence
(e.g., 95
Key Components of a Confidence Interval 1. Point Estimate : - The sample statistic used to
estimate the population parameter (e.g., sample mean).
2. Margin of Error (MOE) : - Reflects the uncertainty in the estimate due to sampling
variability. - Larger sample sizes lead to smaller MOEs.
3. Confidence Level : - The probability that the interval contains the true parameter. -
Common levels: 90
Formula for Confidence Interval For the population mean (µ), when the population standard
deviation (σ) is known:
σ
CI = X̄ ± Z · √
n
1
For the population mean (µ), when σ is unknown (using the sample standard deviation s):
s
CI = X̄ ± t · √
n
Where: - X̄: Sample mean - Z: Critical value from the standard normal distribution (e.g., 1.96
for 95- t: Critical value from the t-distribution - σ: Population standard deviation - s: Sample
standard deviation - n: Sample size
—
3. Example of Confidence Interval
Scenario : You conduct a study to estimate the average height of students in a university. A
random sample of 50 students has: - Sample mean (X̄) = 170 cm - Sample standard deviation (s)
= 8 cm
Objective : Construct a 95
—
Solution :
1. Identify Parameters : - X̄ = 170, s = 8, n = 50, confidence level = 95
2. Find the Critical t-Value : - Degrees of freedom (df ) = n − 1 = 50 − 1 = 49. - For 95
3. Compute the Margin of Error (MOE) :
s
M OE = t · √
n
8
M OE = 2.009 · √ = 2.009 · 1.131 = 2.27
50
4. Calculate the Confidence Interval :
CI = X̄ ± M OE
2
1.2 Inference and Significance in Statistics
In statistics, inference and significance are closely related concepts used in analyzing data, making
predictions, and drawing conclusions about a population based on sample data.
—
1. Statistical Inference
Statistical inference refers to the process of using sample data to draw conclusions about a
population. It involves estimating parameters, testing hypotheses, and predicting future outcomes.
Types of Inference
1. Estimation : - Point Estimate : A single value used to estimate a population parameter (e.g.,
sample mean for population mean).
- Confidence Interval : A range of values that likely contains the population parameter with a
specified probability (e.g., 95
2. Hypothesis Testing : - Involves testing assumptions (null and alternative hypotheses)
about population parameters using sample data.
- Example: Testing if a new drug is more effective than a placebo.
3. Prediction : - Using models (e.g., regression) to predict future outcomes based on current
data.
—
2. Statistical Significance
Statistical significance is a measure of whether the results observed in a sample are unlikely to
have occurred by random chance. It helps determine if the observed effect is real or due to sampling
variability.
Key Concepts in Significance 1. Null Hypothesis (H0 ) : - The default assumption
that there is no effect or difference (e.g., no relationship between variables).
2. Alternative Hypothesis (Ha ) : - The assumption that there is an effect or difference
(e.g., a relationship exists).
3. p-Value : - The probability of observing results as extreme as the sample data, assuming
the null hypothesis is true. - Threshold : If p < α (e.g., 0.05), the result is considered statistically
significant, and H0 is rejected.
4. Significance Level (α) : - The pre-determined threshold for significance (e.g., 0.05, 0.01).
- It represents the probability of rejecting H0 when it is true (Type I error).
—
Example: Statistical Inference and Significance
Scenario : A company claims that the average weight of a bag of chips is 500 grams. A quality
control team takes a sample of 30 bags, which has: - Sample mean (X̄) = 495 grams. - Sample
standard deviation (s) = 10 grams.
Objective : Test whether the bags weigh less than 500 grams (Ha : µ < 500) at a significance
level of α = 0.05.
—
Step-by-Step Solution
3
3. Calculate Test Statistic :
X̄ − µ
t=
√s
n
Substituting values:
495 − 500 −5
t= = ≈ −2.73
√10 1.83
30
4. Find the Critical Value : - Degrees of Freedom (df ) = n − 1 = 30 − 1 = 29. - For α = 0.05
(one-tailed test), critical t-value from the t-table is approximately -1.699 .
5. Compare Test Statistic and Critical Value : - t = −2.73 is less than −1.699.
6. Decision : - Reject H0 : There is significant evidence that the average weight is less than
500 grams.
—
Key Takeaways
1. Inference involves analyzing data to make decisions or predictions about the population.
2. Significance evaluates whether the results are due to chance or a true effect.
3. A statistically significant result (p < α) indicates strong evidence to reject the null hypothesis.
—
4
Steps in Hypothesis Testing
1. State the Hypotheses : - Null Hypothesis (H0 ) : Assumes no effect or no difference (e.g.,
µ = µ0 ). - Alternative Hypothesis (Ha ) : Represents the claim we aim to test (e.g., µ > µ0 ).
2. Choose the Significance Level (α) : - Typically, α = 0.05 or 5- The probability of rejecting
H0 when H0 is true (Type I error).
3. Select the Test and Calculate the Test Statistic : - Examples of test statistics: - z-test: Used
when the population standard deviation (σ) is known or the sample size is large. - t-test: Used
when σ is unknown and the sample size is small.
- For population mean :
X̄ − µ0
z= σ √
n
X̄ − µ0
t=
√s
n
4. Determine the Critical Value or p-Value : - Compare the test statistic to the critical value
from the z- or t-distribution. - Alternatively, calculate the p-value and compare it to α.
5. Make a Decision : - If p ≤ α or the test statistic exceeds the critical value, reject H0 . -
Otherwise, fail to reject H0 .
6. State the Conclusion : - Clearly interpret the results in the context of the problem.
—
Example: Combining Estimation and Hypothesis Testing
Scenario : A factory claims that the average weight of its packets of rice is 1 kg. A quality
control officer collects a random sample of 30 packets and finds: - Sample mean (X̄) = 0.98 kg -
Sample standard deviation (s) = 0.05 kg
Objective : 1. Estimate the true mean weight using a 952. Test if the average weight is less
than 1 kg at α = 0.05.
—
Solution :
1. Confidence Interval (Estimation) : - n = 30, X̄ = 0.98, s = 0.05, t-value for df = 29 at 95
s
CI = X̄ ± t · √
n
0.05
CI = 0.98 ± 2.045 · √
30
CI = 0.98 ± 0.0187 = [0.9613, 0.9987]
Interpretation : The true mean weight is likely between 0.9613 kg and 0.9987 kg.
—
2. Hypothesis Testing :
Step 1 : Hypotheses:
H0 : µ = 1 vs. Ha : µ < 1
Step 2 : Test Statistic:
X̄ − µ0 0.98 − 1 −0.02
t= = = ≈ −2.20
√s 0.05
√ 0.0091
n 30
5
Step 3 : Critical Value: From the t-table for df = 29 and one-tailed test at α = 0.05, the critical
t-value = -1.699.
Step 4 : Decision: Since t = −2.20 < −1.699, reject H0 .
Step 5 : Conclusion: There is significant evidence to conclude that the average weight of the
packets is less than 1 kg.
—
Comparison of Estimation and Hypothesis Testing
Aspect Estimation Hypothesis Testing
Objective Estimate the value of a popula- Test a claim or hypothesis about
tion parameter. a population parameter.
Output Confidence interval or point esti- Decision: reject or fail to reject
mate. H0 .
Focus Quantifying uncertainty of an es- Assessing evidence for/against a
timate. hypothesis.
Example Estimating the mean income of a Testing if the mean income dif-
population. fers from $50,000.
The expected frequency for each side is E = 60/6 = 10. A chi-square test determines if the die is
fair.
—
2. Kolmogorov-Smirnov (K-S) Test - Used for continuous data to compare the empirical dis-
tribution function (EDF) of the sample to a theoretical distribution. - Suitable for small sample
sizes.
Test Statistic :
D = max |Fo (x) − Fe (x)|
Where: - Fo (x): Observed cumulative distribution function (CDF). - Fe (x): Expected CDF.
6
Hypotheses : - H0 : The sample comes from the specified distribution. - Ha : The sample does
not come from the specified distribution.
Example : Test if a sample of data follows a normal distribution using the K-S test.
—
3. Anderson-Darling Test - Similar to the K-S test but gives more weight to the tails of the
distribution. - Used for testing if data follow a specific distribution, especially normality.
Test Statistic :
n
2 1X
A = −n − ((2i − 1) [ln F (xi ) + ln(1 − F (xn+1−i ))])
n i=1
7
3. Compare to Critical Value : - Degrees of freedom (df ) = Number of categories - 1 = 6−1 = 5.
- Critical value for df = 5 and α = 0.05 from the chi-square table is 11.07.
4. Conclusion : - 3.4 < 11.07: Fail to reject H0 . There is no significant evidence to suggest the
die is unfair.
—
Applications of Goodness-of-Fit Tests
1. Quality Control : - Test if a manufacturing process produces items meeting a specified
distribution (e.g., defect rates).
2. Model Validation : - Check if data follow a theoretical model (e.g., normal distribution for
errors in regression).
3. Genetics : - Test if observed traits follow Mendelian inheritance ratios.
4. Market Research : - Compare observed consumer behaviour to expected proportions.
8
Step 3: Compute the Chi-Square Statistic The test statistic is calculated as:
X (Oij − Eij )2
χ2 =
Eij
Where: - Oij : Observed frequency in cell (i, j) - Eij : Expected frequency in cell (i, j)
Step 4: Determine Degrees of Freedom The degrees of freedom (df ) for a contingency table is
given by:
df = (r − 1) × (c − 1)
Where r is the number of rows and c is the number of columns.
Step 5: Compare to the Critical Value or Compute p-Value - Use a Chi-Square distribution
table to find the critical value at a given significance level (α). - Alternatively, calculate the p-value
and compare it to α.
Step 6: Decision - If χ2 exceeds the critical value or p ≤ α, reject H0 . - Otherwise, fail to reject
H0 .
—
3. Example
Scenario : A researcher wants to determine if there is an association between gender (Male/Female)
and preference for a product (Product A/Product B). The data collected is as follows:
Scenario
Product A Product B Product C
Male 40 60 100
Female 50 50 100
Total 90 110 200
Step-by-Step Solution :
Step 1: Compute Expected Frequencies For each cell:
9
Construct Expected Table
Product A Product B Product C
Male 45 55 100
Female 45 55 100
Total 90 110 200
Step 3: Compute χ2
X (Oij − Eij )2
χ2 =
Eij
For each cell: - Male, Product A:
(40 − 45)2 25
= = 0.56
45 45
- Male, Product B:
(60 − 55)2 25
= = 0.45
55 55
- Female, Product A:
(50 − 45)2 25
= = 0.56
45 45
- Female, Product B:
(50 − 55)2 25
= = 0.45
55 55
df = (r − 1) × (c − 1) = (2 − 1) × (2 − 1) = 1
Step 5: Compare to Critical Value From the Chi-Square table, for df = 1 and α = 0.05, the
critical value is 3.841 .
Since χ2 = 2.02 < 3.841, we fail to reject H0 .
—
Conclusion : There is no significant association between gender and product preference at the 5
—
Applications of Test of Independence
1. Marketing : Analyzing the relationship between customer demographics and product preference.
2. Healthcare : Evaluating the association between smoking and disease prevalence.
3. Education : Determining if there is a relationship between study methods and exam performance.
10