0% found this document useful (0 votes)
28 views72 pages

Hypothesis Testing

- Hypothesis testing involves formulating a null hypothesis (H0) and an alternative hypothesis (Ha). The null hypothesis typically represents no effect or no difference, while the alternative represents an effect or difference. - Type I errors occur when the null hypothesis is rejected when it is true. Type II errors occur when the null hypothesis is not rejected when it is false. Setting the significance level (α) controls the probability of a Type I error. - A confidence interval represents the range of values that an estimate is expected to fall within a certain percentage of the time. The confidence level is the percentage at which the true value is expected to be within the confidence interval.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views72 pages

Hypothesis Testing

- Hypothesis testing involves formulating a null hypothesis (H0) and an alternative hypothesis (Ha). The null hypothesis typically represents no effect or no difference, while the alternative represents an effect or difference. - Type I errors occur when the null hypothesis is rejected when it is true. Type II errors occur when the null hypothesis is not rejected when it is false. Setting the significance level (α) controls the probability of a Type I error. - A confidence interval represents the range of values that an estimate is expected to fall within a certain percentage of the time. The confidence level is the percentage at which the true value is expected to be within the confidence interval.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

DATA SCIENCE

UNIT – 2
HYPOTHESIS TESTING
Hypothesis Testing
Introduction Hypothesis
• Hypothesis testing typically begins with some theory, claim, or
assertion about a particular parameter of a population
• These hypothetical statements are tested for their validity by the
information provided by random samples drawn from their
corresponding populations
Hypothesis
• The hypothesis provides a summary of what direction, if any, is
taken to investigate a theory.

• The purpose of including hypotheses is:


• To provide a summary of the research, how it will be investigated, and
what is expected to be found.

• To provide an answer to the research question

• the hypothesis is a predictive statement of what is expected to


happen when testing the research question.
Notation & Two Questions
• A Hypothesis is an approximation of f(x), the target function
Notation
• X: the space of instances
• D: the probability distribution of encountering instances from X
• f: the target function
• H: the hypothesis space
• h: a particular hypothesis in H
• (x, f(x)): a training instance
• S: all training instances
• Two Questions
• Given h constructed from n examples drawn randomly from D, what is the best estimate of h over
future instances drawn from D?
• What is the probable error in this accuracy estimate?
Types of Hypotheses
• The null hypothesis predicts that the results will show no or
little effect.
• The null hypothesis is a predictive statement that is used when
it is thought that the independent variable will not influence the
dependent variable.
• An alternative hypothesis is a predictive statement used when it
is thought that the independent variable will influence the
dependent variable.
• The alternative hypothesis is also called a non-directional, two-
tailed hypothesis, as it predicts the results can go either way,
e.g. increase or decrease.
Types of Hypotheses
• Does independent variable affect dependent variable?
• Null hypothesis (H0): Independent variable does not affect dependent
variable.
• Alternative hypothesis (Ha): Independent variable affects dependent variable.

• The directional alternative hypothesis states how the IV will


influence the DV, identifying a specific direction, such as if there
will be an increase or decrease in the observed results.
Null Hypothesis
• The hypothesis that the population parameter is equal to the company specification is referred to as the null
hypothesis
• It is a statement of no difference (equality) or no significant or insignificant difference and is denoted by 𝐻0 .
• It is a statement of null or neutral attitude.
• 𝐻0 : The difference in quality of tube- lights among three brands is insignificant,

• 𝐻0 : The performance of new drug is same as that of old drug i.e., the performance of new drug is not better
than old one etc.,

• Usually, a null hypothesis is expressed with “=” sign and sometimes ‘>=‘ or ‘<=‘.
• The null hypothesis is that the filling process is working properly, and therefore the mean fill is the 368-gram
specification
𝐻0 : 𝜇 = 368
Alternative Hypothesis
• It is a rivalry or complimentary or opposite hypothesis to null
hypothesis and is denoted by 𝐻1 𝑜𝑟 𝐻𝑎 .
• 𝐻1 :The difference in quality of tube- lights among three brands is
significant,

• 𝐻1 : The performance of new drug is better than old one.


• The Alternative hypothesis is that the filling process is working
properly, and therefore the mean fill is the 368-gram specification
𝐻1 : 𝜇 ≠ 368
Summary
• The following key points summarize the null and alternative hypotheses:
• The null hypothesis, 𝐻0 , represents the status quo or the current belief in a situation.
• The alternative hypothesis, 𝐻1 , is the opposite of the null hypothesis and represents
a research claim or specific inference you would like to prove.
• If you reject the null hypothesis, you have statistical proof that the alternative
hypothesis is correct.
• If you do not reject the null hypothesis, you have failed to prove the alternative
hypothesis. The failure to prove the alternative hypothesis, however, does not mean
that you have proven the null hypothesis.
• The null hypothesis, 𝐻0 , always refers to a specified value of the population
parameter (such as 𝜇), not a sample statistic (such as 𝑋ത ).
• The statement of the null hypothesis always contains an equal sign regarding the
specified value of the population parameter
• The statement of the alternative hypothesis never contains an equal sign regarding
the specified value of the population parameter
Hypothesis Formulation
THE NULL AND ALTERNATIVE HYPOTHESES
• You are the manager of a fast-food restaurant. You want to determine
whether the waiting time to place an order has changed in the past
month from its previous population mean value of 4.5 minutes. State
the null and alternative hypotheses
Answer:
• The null hypothesis is that the population mean has not changed
from its previous value of 4.5 minutes 𝐻0 : 𝜇 = 4.5
• The alternative hypothesis is that the population mean is not 4.5
minutes 𝐻1 : 𝜇 ≠ 4.5
Example 1
• A new manufacturing method is believed to be better than the
current method
• Alternative Hypothesis
• A new manufacturing method is better
• Null Hypothesis
• The method is no better than the old method
Example 2
• A new bonus plan, that is developed is an attempt to increase sales
• Alternative Hypothesis
• A new bonus plan increases the sales
• Null Hypothesis
• New bonus plan does not increase the sales
Example 3
• A new drug is developed with the goal of lowering Cholesterol level
more than the existing drug
• Alternative Hypothesis
• The new drug lowers Cholesterol-level more than the existing drug
• Null Hypothesis
• The new drug doesn’t lower Cholesterol-level more than the existing drug
True Error vs sample Error
Confusion Matrix
• P = TP + FN
• N = FP + TN
• PP = TP + FP
• PN = FN + TN
• Error cases: FP, FN
Example

P = 8, N = 4
TP = 6, TN = 3, FP = 1, FN = 2
PP = TP + FP = 7
PN = TN + FN = 5
Estimating Error
Bias is the difference between the average prediction of the
hypothesis and the correct value of prediction.

The hypothesis with high bias tries to oversimplify the training (not
working on a complex model). It tends to have high training errors
and high test errors.

Variance: High variance hypotheses have high variability between their


predictions.

They try to over-complex the model and do not generalize the data very
well.
Confidence Level & Interval
• When an estimate is made for a variable, there is always uncertainty
around that estimate because the number is based on a sample of
the population you are studying.
• The confidence interval is the range of values that the estimate is
expected to fall between, a certain percentage of the time if the
experiment is run again or re-sampled the population in same way.
• The confidence level is the percentage of times that an estimate is
expected to be reproduced between the upper and lower bounds of
the confidence interval, and is set by the alpha value.
The Critical Value of the Test Statistic
• In the Oxford Cereal Company scenario, the null hypothesis is that the mean
amount of cereal per box in the entire filling process is 368 grams
• You select a sample of boxes from the filling process, weigh each box, and
compute the sample mean
• This statistic is an estimate of the corresponding parameter
ത is likely to
• Even if the null hypothesis is true, the statistic (the sample mean, 𝑋)
differ from the value of the parameter
• For example, if the sample mean is 367.9, you conclude that the population mean
has not changed
• if the sample mean is 320, you conclude that the population mean is not 368
• Determining what is very close and what is very different is arbitrary without
clear definitions
Regions of Rejection and Non-rejection
• The sampling distribution of the test statistic is divided into two regions, a
region of rejection (sometimes called the critical region) and a region of
nonrejection
• if a value of the test statistic falls into this rejection region, you reject the
null hypothesis
• If the test statistic falls into the region of non-rejection, you do not reject
the null hypothesis.
Risks in Decision Making Using Hypothesis-
Testing Methodology
• A Type I error occurs if you reject the null hypothesis, 𝐻0 , when it is
true and should not be rejected. The probability of a Type I error
occurring is 𝛼.
• A Type II error occurs if you do not reject the null hypothesis, 𝐻0 ,
when it is false and should be rejected. The probability of a Type II
error occurring is 𝛽.
Confusion Matrix
• P = TP + FN
• N = FP + TN
• PP = TP + FP
• PN = FN + TN
• Error cases: FP, FN
• Type I error: FP
• Type II error: FN
Errors in Hypothesis Testing
•P(Type I error)

•𝜶: denotes the probability of making a Type I error

• 𝜶 = 𝐏 Rejecting 𝐻0 𝐻0 is true)

•P(Type II error)

•𝜷: denotes the probability of making a Type II error

• 𝛃 = 𝐏 Accepting 𝐻0 𝐻0 is false)
•Note:
• 𝜶 and 𝛃 are not independent of each other. as one increases, the other
decreases
• When the sample size increases, both to decrease since sampling error is
reduced.
• In general, we focus on Type I error, but Type II error is also important,
particularly when sample size is small.
Type-1 & Type-2 Error
Level of Significance
• level of significance(𝜶)
• control the Type I error by deciding the risk level, 𝛼, that you are willing to
have in rejecting the null hypothesis when it is true
• select levels of 0.01, 0.05, or 0.10
The Confidence Coefficient
• The complement of the probability of a Type I error, (1 - 𝛼),is called
the confidence coefficient
• The confidence coefficient, (1 - 𝛼), is the probability that you will not
reject the null hypothesis, 𝐻0 , when it is true and should not be
rejected. The confidence level of a hypothesis test is (1 - 𝛼) *100%.
The 𝛽 𝑅𝑖𝑠𝑘
• The probability of committing a Type II error is denoted by 𝛽
• the probability of making a Type II error depends on the difference
between the hypothesized and actual values of the population
parameter
• if the difference between the hypothesized and actual values of the
population parameter is large, 𝛽 is small
• if the difference between the hypothesized and actual values of the
parameter is small, 𝛽 is large
The Power of a Test
• The complement of the probability of a Type II error, (1 - 𝛽), is called
the power of a statistical test.
• The power of a statistical test, (1 - 𝛽), is the probability that you will
reject the null hypothesis when it is false and should be rejected.
Risk in Decision Making
• The Table illustrates
Confidence Interval
• Generally, the true error is complex and difficult to calculate. It can be
estimated with the help of a confidence interval. The confidence interval can
be estimated as the function of the sampling error.
• Below are the steps for the confidence interval:
• Randomly drawn n samples S (independently of each other), where n should be >30 from
the population P.
• Calculate the Sample Error of sample S.
• Here we assume that the sampling error is the unbiased estimator of True
Error. Following is the formula for calculating true error:

• where zs is the value of the z-score of the s percentage of the confidence


interval:
The sample S contains n examples drawn independent of one
another, and independent of h, according to probability
distribution D
n >= 30
Hypothesis h commits r errors over these n examples (errorS(h)
= r/n)
With 95% probability, the true error errorD(h) lies in the interval

𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) ± 1.96 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ)(1 − 𝑒𝑟𝑟𝑜𝑟𝑆 ℎ )/𝑛

Example n = 40
Hypothesis h commits r = 12 errors
12
Sample error 𝑒𝑟𝑟𝑜𝑟𝑆 ℎ = = 0.3
40
95% confidence
0.3 ±1.96 ∗ 0.7 = 0.3 ± 0.14
Confidence Intervals
• With 95% probability, 𝑒𝑟𝑟𝑜𝑟𝐷 ℎ lies in the interval
𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) ± 𝑧𝑁 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ)(1 − 𝑒𝑟𝑟𝑜𝑟𝑆 ℎ )/𝑛

N%: 50% 68% 80% 90% 95% 98% 99%


zN : 0.67 1.00 1.28 1.64 1.96 2.33 2.5
Concept of Confidence Interval (C.I)
Significance Values
Confidence Interval
• Statistical inference is the process of using sample results to draw
conclusions about the characteristics of a population
• Inferential statistics enables you to estimate unknown population
characteristics such as a population mean or a population
proportion.
• Two types of estimates are used to estimate population
parameters: point estimates and interval estimates.
– A point estimate is the value of a single sample statistic.
– A confidence interval estimate is a range of numbers, called an interval,
constructed around the point estimate.
Estimation
CONFIDENCE INTERVAL ESTIMATION
FOR THE MEAN (𝜎 KNOWN)
• How much uncertainty is associated with a point estimate of a
population parameter?

• An interval estimate provides more information about a population


characteristic than does a point estimate

• Such interval estimates are called confidence intervals


Confidence interval estimate
σ
X  Z α/2
n

where 𝑋ത is the point estimate


Zα/2 is the normal distribution critical value for a
probability of /2 in each tail
𝜎/√𝑛 is the standard error
The level of confidence is symbolized as (1-𝜎)
The value of Z needed for constructing a confidence interval is called the
critical value for the distribution
Critical values

Z value for 95% confidence Z value for 99% confidence


Common Levels of Confidence
• Commonly used confidence levels are 90%,
95%, and 99%
Confidence
Confidence
Coefficient, Zα/2 value
Level
1− 
80% 0.80 1.28
90% 0.90 1.645
95% 0.95 1.96
98% 0.98 2.33
99% 0.99 2.58
99.8% 0.998 3.08
99.9% 0.999 3.27
Confidence Interval based on Population
Standard deviation
Example
Z-score values on different C.I
p- value
Practice question
Solution
Example
• A paper manufacturer has a production process that operates continuously throughout
an entire production shift. The paper is expected to have a mean length of 11 inches, and
the standard deviation of the length is 0.02 inch. At periodic intervals, a sample is
selected to determine whether the mean paper length is still equal to 11 inches or
whether something has gone wrong in the production process to change the length of
the paper produced. You select a random sample of 100 sheets, and the mean paper
length is 10.998 inches. Construct a 95% confidence interval estimate for the population
mean paper length.
Answer:
with 95% confidence z = 1.96
𝜎
𝑋ത ± 𝑍 = 10.998 ± 1.96 * 0.02/√100
𝑛
= 10.998 ± 0.00392
10.99408≤ 𝜇 ≤11.00192
Example
• Try with 99% confidence
Example
• The quality control manager at a light bulb factory needs to estimate
the mean life of a large shipment of light bulbs. The standard
deviation is 100 hours. A random sample of 64 light bulbs indicated a
sample mean life of 350 hours.
a. Construct a 95% confidence interval estimate of the population mean life of
light bulbs in this shipment.
b. Do you think that the manufacturer has the right to state that the light bulbs
last an average of 400 hours? Explain
CONFIDENCE INTERVAL ESTIMATION
FOR THE MEAN (𝜎 UNKNOWN)
• If the population standard deviation σ is unknown, we can substitute
the sample standard deviation, S
• This introduces extra uncertainty, since S is variable from sample to
sample
• So we use the t distribution instead of the normal distribution
Student’s t Distribution
• At the beginning of the twentieth century, William S. Gosset, a
statistician for Guinness Breweries in Ireland, wanted to make
inferences about the mean when 𝜎 was unknown.
• Gosset adopted the pseudonym “Student.”
• The distribution that he developed is known as Student s t
distribution and is commonly referred to as the t distribution
Student’s t-distribution
• If the random variable X is normally distributed, then the following
statistic has a t distribution with n -1 degrees of freedom
𝑋ത − 𝜇
𝑡=
𝑆/√𝑛
Student’s t-distribution
Degrees of Freedom (df)
Idea: Number of observations that are free to vary
after sample mean has been calculated

Example: Suppose the mean of 3 numbers is 8.0

Let X1 = 7 If the mean of these three values is 8.0,


Let X2 = 8 then X3 must be 9
What is X3? (i.e., X3 is not free to vary)

Here, n = 3, so degrees of fredom = n – 1 = 3 – 1 = 2


(2 values can be any numbers, but the third is not free to vary for a
given mean)
The Confidence Interval
• Below Equation defines the (1-𝛼 ) 100% confidence interval estimate
for the mean with 𝜎 unknown.
𝑆
𝑋ത ± 𝑡𝑛−1
√𝑛
Where 𝑡𝑛−1 is the critical value of the t-distribution with n – 1 degrees
of freedom for an area 𝛼/2 in the upper tail
Selected t distribution values
DCOVA
With comparison to the Z value

Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) (∞ d.f.)

0.80 1.372 1.325 1.310 1.28


0.90 1.812 1.725 1.697 1.645
0.95 2.228 2.086 2.042 1.96
0.99 3.169 2.845 2.750 2.58

Note: t Z as n increases
Example of t distribution confidence
interval

A random sample of n = 25 has X = 50 and


S = 8. Form a 95% confidence interval for μ

• d.f. = n – 1 = 24, so t α/2 = t 0.025 = 2.0639

The confidence interval is


S 8
X  t α/2 = 50  (2.0639)
n 25

46.698 ≤ μ ≤ 53.302
Example
• A manufacturing company produces electric insulators. If the insulators break when in use, a
short circuit is likely. To test the strength of the insulators, you carry out destructive testing to
determine how much force is required to break the insulators. You measure force by observing
how many pounds are applied to the insulator before it breaks. Table lists 30 values from this
experiment. Construct a 95% confidence interval estimate for the population mean force required
to break the insulator.
• 1,870 1,728 1,656 1,610 1,634 1,784 1,522 1,696 1,592 1,662
• 1,866 1,764 1,734 1,662 1,734 1,774 1,550 1,756 1,762 1,866
• 1,820 1,744 1,788 1,688 1,810 1,752 1,680 1,810 1,652 1,736
Answer: 𝑋ത = 1723.4, S = 89.55, 𝑛 = 30, 𝑡29 =2.0452
𝑋ത ± 𝑡𝑛−1 𝑆/√𝑛
1723.4 ± 2.0452 ∗ 89.55/√30
1723.4 ± 33.44
1689.96 ≤ 𝜇 ≤ 1756.84
CONFIDENCE INTERVAL ESTIMATION FOR THE
PROPORTION
• The concept of the confidence interval to categorical data
• The unknown population proportion is represented by the Greek
letter 𝜋
• point estimate for 𝜋 is the sample proportion, p = X/n, where n is the
sample size and X is the number of items in the sample having the
characteristic of interest
CONFIDENCE INTERVAL ESTIMATE FOR THE
PROPORTION
𝑝(1 − 𝑝)
𝑝 ± 𝑍√
𝑛
• Z = critical value from the standardized normal distribution
• Example: estimate the proportion of sales invoices that contain errors. Suppose that in a
sample of 100 sales invoices, 10 contain errors
Answer: 95% confidence, Z = 1.96
10
𝑝= = 0.1
100
0.1 ∗ 09
0.1 ± 1.96 ∗ √
100
0.1 ± 1.96 ∗ 0.03
0.1 ± 0.0588
0.0412 ≤ 𝜋 ≤ 0.1588
Try
• A large newspaper wants to estimate the proportion of newspapers
printed that have a nonconforming attribute, such as excessive ruboff,
improper page setup, missing pages, or duplicate pages. A random
sample of 200 newspapers is selected from all the newspapers
printed during a single day. For this sample of 200, 35 contain some
type of nonconformance. Construct and interpret a 90% confidence
interval for the proportion of newspapers printed during the day that
have a nonconforming attribute.
Sample Size Determination for the Mean
• Confidence interval estimation
𝜎
𝑋ത ± 𝑍
√𝑛
• The sampling error e is defined by
𝜎
𝑒=𝑍
√𝑛
• The sample size, n, is equal to the product of the Z value squared and
the variance 𝜎, squared, divided by the square of the sampling error,
e
𝑛 = 𝑍 2 𝜎 2 /𝑒 2
Example
• suppose you want to estimate the population mean force required to break the
insulator to within 25 pounds with 95% confidence. On the basis of a study taken
the previous year, you believe that the standard deviation is 100 pounds. Find the
sample size needed.
Answer: e = 25, Z = 1.96 (95% confidence), 𝜎 = 100
𝑛 = 𝑍 2 𝜎 2 /𝑒 2
𝑛 = (1.96)2 (100)2 /(25)2
𝑛 = 61.47

You might also like