0% found this document useful (0 votes)
25 views31 pages

Lecture 03. Statistical Inference

The document discusses statistical inference concepts including normal distribution, standard normal distribution, central limit theorem, point estimation, interval estimation, confidence intervals, hypothesis testing, type I and type II errors. Examples are provided to illustrate key concepts.

Uploaded by

dantie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views31 pages

Lecture 03. Statistical Inference

The document discusses statistical inference concepts including normal distribution, standard normal distribution, central limit theorem, point estimation, interval estimation, confidence intervals, hypothesis testing, type I and type II errors. Examples are provided to illustrate key concepts.

Uploaded by

dantie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Fundamentals of Data Analytics

Lecture 03. Statistical Inference


Instructional Team
About this Course
- Probability
- Statistics
- Hands-on programming skills
- Meet your instructors

Vinh Dang Thuy Nguyen Sang Nguyen Huy Pham


PhD., Data Scientist M.Sc., Data Analyst M.Sc., Data Scientist PharmB, R&D Officer
Trusting Social VNG FE Credit OPC
Content
➔ Recall: Normal Distribution & the CLT & Sampling Distribution
➔ Point Estimation & Interval Estimation
➔ Hypothesis Testing
Normal Distribution
Normal Distribution
❖ A continuous probability distribution which is characterized by a symmetric
bell-shaped curve.
X ∼ N(µ, σ^2 )
Standard Normal Distribution
Standard Normal Distribution
❖ Standard Normal Distribution Z (with Z = (X - µ) / σ) is a Normal Distribution
N(µ, σ^2 ) with parameters µ = 0 and σ = 1
Z ∼ N(0, 1)
Normal Probabilities
Normal Probabilities
Probability that X takes on values between a and b
P(a ≤ X ≤ b)
Steps to calculate Normal Probability:
- Calculate standardized Z-score by transforming X using Z = (X - μ) / σ
- Use the standard normal N(0,1) table or Z-table (www.z-table.com)

Example:
Let X equal the weight of a randomly selected infant. Assume X ~ N(3000, 1000).
- What is the probability that a randomly selected infant has weight below 3500?
- What is the probability that a randomly selected infant has weight above 5000?
- What is the probability that a randomly selected infant has weight between
2500 and 4000?
Normal Probabilities
Example:
Let X equal the weight of a randomly selected infant. Assume X ~ N(3000, 1000).
- What is the probability that a randomly selected infant has weight below 3500?
P(X ≤ 3500) = P(Z ≤ (3500-3000)/1000) = P(Z ≤ 0.5) = 0.6915

- What is the probability that a randomly selected infant has weight above 5000?
P(X ≥ 5000) = P(Z ≥ (5000-3000)/1000) = P(Z ≥ 2) = 1 - P(Z ≤ 2) = 0.0228

- What is the probability that a randomly selected infant has weight between
2500 and 4000?
P(2500 ≤ X ≤ 4000) = P(-0.5 ≤ Z ≤ 1) = P(Z ≤ 1) - P(Z ≤ -0.5) = 0.8413 - 0.3085
Population vs Sample
Sampling Distribution
Sampling Distribution
The distribution of the statistic for all possible samples randomly drawn from the
same population of a given sample size.

Example:
- Considering a population follows the normal distribution N(μ, σ^2)
- Repeatedly take samples of a given size from this population
- Calculate the mean for each sample – this statistic is called the sample mean
- The distribution of these means is "sampling distribution of the sample mean"
The Central Limit Theorem
Central Limit Theorem
If a random sample of size n is drawn from a any population with μ and σ, the
distribution of the sample mean X approaches a normal distribution with μ and
σ_x = σ/sqrt(n) (the standard error) as the sample size increases
X ~ N(μ, (σ^2)/n)

As sample size n increases, the


distribution of sample means
converges to the population mean μ
(i.e., the standard error of the mean
σX̄ = σ / √n gets smaller).
The Central Limit Theorem
Range of Sample Means
If we know μ and σ, the CLT allows us to predict the range of sample means for
samples of size n
[ μ - z*σ/√n, μ + z*σ/√n]

Example:
Within what interval would we expect GMAT sample means to fall for samples of n =
5 applicants? The population is approximately normal with parameters μ = 520.78
and σ = 86.80, so the predicted range for 95 percent of the sample means is
[ 520.78 - 1.96*86.80/√10, 520.78 + 1.96*86.80/√10]
Estimation
❖ Point Estimation
❖ Interval Estimation
❖ Mean (μ) vs Proportion (π)
■ With known σ
■ With unknown σ
➢ Difference in Mean (μ1 - μ2)
➢ Difference in Proportion (π1 - π2)
❖ Sample size
Point Estimation
Point Estimation
Point Estimation is a single statistic, determined from a sample, that is used to
estimate the corresponding population parameter.

Example:
A sample mean x̄ calculated from a random sample x1 , x2 , . . . , xn is a point
estimate of the unknown population mean μ.
Interval Estimation
Interval Estimation
An interval estimate is a range of values for a statistic which means a point estimate
plus an interval that expresses the uncertainty or variability associated with the
estimate
estimate ± (critical value of z or t) × (standard error)

Example:
Given a data set with the mean falls somewhere between 10 and 100 (10<μ<100).
Confidence Interval for Mean
Confidence Interval for Mean
A 100(1 − α)% confidence interval for µ, the population mean, is given by the
interval estimate
x̄ - z𝜶/2*σ/√n ≤ μ ≤ x̄ + z𝜶/2*σ/√n
when the population variance is known

Interpretation of CI
❖ In repeated sampling, 100(1 − α)% confidence interval is a range of values that
you can be 100(1 − α)% certain contains the true mean of the population
❖ This is not the same as a range that contains 95% of the values
Derivation of Confidence Interval (CI) for Mean

The confidence level (1 - 𝛂) indicates how confident we are that the population
mean lies within the indicated confidence interval

P(μ - z𝜶/2*σ/√n ≤ x̄ ≤ μ + z𝜶/2*σ/√n ) = 1 - 𝛂


P(x̄ - z𝜶/2*σ/√n ≤ μ ≤ x̄ + z𝜶/2*σ/√n ) = 1 - 𝛂
P(L ≤ μ ≤ U) = 1 - 𝛂

Example:
If confidence level is 0.95 then z𝜶/2 = 1.96.
We can say that we are 95% confident that
the population mean lies within the interval
x̄ - 1.96*σ/√n ≤ μ ≤ x̄ + 1.96*σ/√n
Summary of Confidence Interval (CI)
Summary of Confidence Interval (CI) for Mean

Summary of Confidence Interval (CI) for Difference of Mean


Estimating Proportion
Estimating Proportion π

Where:
❖ p is the Sample proportion
❖ zα/2 is the Critical value for Confidence level (1 - α) in Standard normal table
❖ n is the Sample size
Sample size determination for a mean
Suppose we wish to estimate a population mean with a maximum allowable margin
of error of ± E.

How to Estimate σ?
❖ Take a Preliminary Sample
→ Take small sample to estimate σ
What if we don’t ❖ Assume Uniform Population
→ Estimate upper and lower limits a and b and set σ = [(b - a)2 / 12 ]1/2
know σ?
❖ Assume Normal Population
→ Estimate upper and lower bounds a and b, and set σ = (b - a) / 6
❖ Poisson Arrivals
→ In the special case when λ is a Poisson arrival rate, then σ = √ λ .
Sample size determination for a proportion
Suppose we wish to estimate a population mean with a maximum allowable margin
of error of ± E.

How to Estimate π ?
What if we don’t ❖ Assume that π = 0.5
know π? ❖ Take a Preliminary Sample
→ Take small sample to estimate σ
❖ Use a Prior Sample or Historical Data
Type I Error & Type II Error
H0 is True H0 is False

Reject H0 Type I error - α


(false positive)

Fail to Reject H0 Type II error - β


(false negative)

❖ If we choose α = .05, we expect to commit a Type I error about 5 times in 100


❖ Depending on the situation, one error is more important than the other
❖ There is trade-off between Type I and Type II error
❖ The larger critical value needed to reduce α makes it harder to reject H0, thereby
increasing β
Example:
• A doctor who is conservative about admitting patients with symptoms of heart attack to the ICU (reduced β) will
admit more patients with no heart attack (increased α).
• More sensitive airport weapons detectors (reduced β) will inconvenience more safe passengers (increased α).
Type I Error & Type II Error
The below examples come from Top 45 Data Scientist Interview Questions:
False Positive is more important than False Negative:
An e-commerce site run a marketing campaign that gives $1000 Gift voucher to the customers who purchase at least $10,000 worth
of items. The marketing team sends free voucher mail directly to 100 customers randomly (without any minimum purchase condition)
with an assumption that to make at least 20% profit on sold items above $10,000. Now the issue is if we send the $1000 gift
vouchers to customers who have not actually purchased anything but are marked as having made $10,000 worth of purchase.

False Positive is less important than False Negative:


Assume there is an airport which has received high-security threats. The security team identifies whether a particular passenger can
be a threat or not based on certain characteristics they. Due to a shortage of staff, they decide to scan passengers being predicted
as risk positives by their predictive model. What will happen if a true threat customer is being flagged as non-threat by airport model?

False Positive is equally important than False Negative:


In the Banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good
you will not make any profit, rather you will risk huge losses. Banks don’t want to lose good customers and at the same point in time,
they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to
measure.
One-sample Hypothesis Test Steps for a Single Mean
Basic Steps
❖ Define the null hypothesis, H0
❖ Define the alternative hypothesis, Ha, where Ha is usually of the form “not H0”
❖ Define the Type I error (the probability of falsely rejecting the null)
❖ Calculate the test statistics
❖ Calculate the p-value (probability of getting a result as or more extreme than
observed if the null hypothesis is true)
❖ If p-value ≤ Type I error, reject H0. Otherwise, fail to reject H0

Hint to state the null hypothesis:


- The H0 should always contain a statement of equality. Another way of thinking of it is that the null
hypothesis is a statement of "no difference." while The Ha is the claim we are trying to find evidence in
favor of
- The hypothesis is sometimes a statement of what you expect to happen in the experiment
Approaches to two-sided hypothesis testing
Using Confidence Interval - CI
❖ Create 100(1 − α)% CI for the population parameter.
❖ If the CI does not contain the null hypothesis,fail to reject the null hypothesis
❖ If the CI contains the null hypothesis, you reject the null

Example:
Let X equal the weight of a randomly selected 10 infants with the sample mean is 2500 grams. The population
follow normal distribution with standard deviation is 1000 grams

Question: Is the mean birth weight in this population different from 3000 grams?
Answer: With 95% confidence, we have:
x̄ - 1.96*σ/√n ≤ μ ≤ x̄ + 1.96*σ/√n
Or 2500 - 1.96*1000/√10 ≤ μ ≤ 2500 + 1.96*1000/√10
1880 ≤ μ ≤ 3120
So, we can not say that the true mean is different from 3000
Approaches to two-sided hypothesis testing
Using Critical Value - CV
❖ Calculate a critical value zc(CV) for the specified α
❖ Compute the test statistics zobs(TS)
❖ Reject the null hypothesis if |TS| > |CV| and fail to reject the null if |TS| < |CV|

Example:
Let X equal the weight of a randomly selected 10 infants with the sample mean is 2500 grams. The population
follow normal distribution with standard deviation is 1000 grams

Question: Is the mean birth weight in this population different from 3000 grams?
Answer: With significance level α = 0.05, we have
- zc = 1.96 (recall that 2 × P(Z > |zc|) = 0.05)
- zobs = -1.58 (recall that zobs = (x̄ - μ)/(σ/√n) = -1.58)
Because, |zc| > |zobs| we can not say that the true mean is different from 3000
p-value
p-value
The p-value for a hypothesis test is the probability of obtaining a value of the test
statistics that is as or more extreme than the observed test statistics when the null
hypothesis is true

❖ The rejection region is determined by the desired level of significance, or


probability of committing a type I error or the probability of falsely reject the null
❖ Reporting the p-value associated with a test gives an indication of how common
or rare the computed value of the test hypothesis is, given that the null
- hypothesis is true
One-sample Hypothesis Test for a Single Mean
Example:

Assume a chair manufacturing process of normally distributed chair heights with:

- a known standard deviation of 5cm


- a sample of 10 chairs
- the sample mean is calculated as 37.5cm

Question: Is the mean chair height in this production line different from 40cm?
One-sample Hypothesis Test for a Single Mean
❖ Set up a two-sided test of
- H0: mean = 40cm
- Ha: mean ≠ 40cm
❖ Let type I error be 0.05
- Calculate the test statistics
37.5 40
5
- What does this mean? Our observed mean is 1.58 standard error below the
hypothesized mean
- The test statistics is the standardized value of our data assuming the null
hypothesis is true
- Question: if the true mean is 40cm, is our observed sample mean of 37.5cm
“common” or is this value unlikely to occur?
One-sample Hypothesis Test for a Single Mean
- Calculate the p-value to answer our question

- If the true mean is 40cm, our data, or data more extreme than ours, would
occur in 11 out of 100 studies (of the same size, n=10)
- General guideline, if p-value is less than or equal to type I error, then reject
the null hypothesis
- Conclusion: we fail to reject the null hypothesis with 95% of confidence since
we choose α = 0.05
Summary of One-sample Hypothesis Testing
Summary of One-sample Hypothesis Testing for One Mean

Summary of One-sample Hypothesis Testing for One Proportion


Reference
1. Doane, David P., and Lori E. Seward - Applied statistics in business and economics
2. Wasserman, Larry - All of statistics: a concise course in statistical inference
3. http://www-hsc.usc.edu/~eckel/biostat2/slides/lecture4.pdf
4. http://dsearls.org/courses/M120Concepts/ClassNotes/Statistics/530G_Derivation.htm
5. https://www.graphpad.com/guides/prism/7/statistics/stat_more_about_confidence_interval.htm?toc=0&print
Window

You might also like