Introduction To Data Analytics: Statistical Inference - II
Introduction To Data Analytics: Statistical Inference - II
DATA ANALYTICS
Class #11
Statistical Inference - II
Dr. Sreeja S R
Assistant Professor
Indian Institute of Information Technology
IIIT Sri City
IIITS: IDA - M2021 1
Q U O T E O F T H E D AY. .
Example 6.6:
Suppose, two hypotheses in a statistical testing are:
Also, assume that for a given sample, population obeys normal distribution. A
threshold limit say is used to say that they are significantly different from a.
a-δ a a+δ
Thus the null hypothesis is to be rejected if the mean value is less than or
greater than .
a’ a a”
Rejection region for H0 for a
given value of α
Thus, in a two-tailed test, there are two rejection regions (also known as critical
region), one on each tail of the sampling distribution curve.
95 % of area
µH 0
Rejection region
Reject H0 ,if the sample mean falls
in either of these regions
Acceptance and rejection regions in case of a two-tailed test with 5% significance level.
IIITS: IDA - M2021 8
One-Tailed Test
•A one-tailed
test would be used when we are to test, say, whether the population mean is
either lower or higher than the hypothesis test value.
Symbolically,
Wherein there is one rejection region only on the left-tail (or right-tail).
Acceptance region
Acceptance region
.05 of area
.05 of area
Rejection region
Rejection region
¿ − tailed test
tailed test
¿
IIITS: IDA - M2021 9
EXAMPLE 6.7: CALCULATING
•
Consider the two hypotheses are
Assume that given a sample of size 16 and standard deviation is 0.2 and sample
follows normal distribution.
Suppose, the null hypothesis is to be rejected if the mean value is less than 7.9 or greater than 8.1.
If is the sample mean, then the probability of Type I error is
Given the standard deviation of the sample is 0.2 and that the distribution follows normal
distribution.
Thus,
and
Hence,
IIITS: IDA - M2021 11
Note: Since each draw is independent to each other, we can assume the sample distribution
follows binomial probability distribution. IIITS: IDA - M2021 12
Example 6.8: Calculating
•Let us express the population parameter as
The hypotheses of the problem can be stated as:
// Box B is on the table
// Box A is on the table
Calculating
In this example, the null hypothesis specifies that the probability of drawing a red chocolate is .
This means that, lower proportion of red chocolates in observations favors the null hypothesis.
In other words, drawing all red chocolates provides sufficient evidence to reject the null
hypothesis. Then, the probability of making a error is the probability of getting five red
chocolates in a sample of five from Box B. That is,
Thus, the probability of rejecting a true null hypothesis is That is, there is approximately
chance that the box B will be mislabeled as box A. IIITS: IDA - M2021 13
Example 6.8: Calculating
• error occurs if we fail to reject the null hypothesis when it is not true. For the current
The
illustration, such a situation occurs, if Box A is on the table but we did not get the five red
chocolates required to reject the hypothesis that Box B is on the table.
The probability of error is then the probability of getting four or fewer red chocolates in a
sample of five from Box A.
That is,
That is,
Now,
Hence,
That is, the probability of making error is over . This means that, if Box IIITS:
A isIDAon- M2021
the table,
14
the
probability that we will be unable to detect it is .
CASE STUDY 1: COFFEE SALE
A coffee vendor nearby Kharagpur railway station has been having average
sales of 500 cups per day. Because of the development of a bus stand nearby, it
expects to increase its sales. During the first 12 days, after the inauguration of
the bus stand, the daily sales were as under:
550 570 490 615 505 580 570 460 600 580 530 526
On the basis of this sample information, can we conclude that the sales of coffee
have increased?
•The
following five steps are followed when testing hypothesis
1. Specify and , the null and alternate hypothesis, and an acceptable level of .
2. Determine an appropriate sample-based test statistics and the rejection region for
the specified .
•Step
1: Specification of hypothesis and acceptable level of
550 570 490 615 505 580 570 460 600 580 530 526
Since the sample size is small and the population standard deviation is not known, we shall
use assuming normal population. The test statistics is
Hence,
Note:
Statistical table for t-distributions gives a t-value given n, the degrees of freedom and ,
the level of significance and vice-versa.
Step 3: Collect the sample data and calculate the test statistics
As is one-tailed, we shall determine the rejection region applying one-tailed in the right
tail because is more than type ) at level of significance.
Step 3: Collect the sample data and calculate the test statistics
As is one-tailed, we shall determine the rejection region applying one-tailed in the right
tail because is more than type ) at level of significance.
The observed value of which is in the rejection region and thus is rejected at level of
significance.
We can conclude that the sample data indicate that coffee sales have increased.
•
Step 1: Specification of hypothesis and acceptable level of
The hypotheses are given in terms of the population mean of medicine per tube.
•Step
2: Sample-based test statistics and the rejection region for specified
Rejection region: G, which gives (obtained from standard normal calculation for two-
tailed test).
•
Step 3: Collect the sample data and calculate the test statistics
Sample results: , ,
Hence,
Since , we reject
1. ,
2. Reject if
4. , we fail to reject = 8
Note:
All these tests are based on the assumption of normality (i.e., the source of data is
considered to be normally distributed).
• small sample(s)
• population variance is not known (in this case, we use the variance of the sample as an
estimate of the population variance)
: It is based on F-distribution.
• This test is also used in the context of analysis of variance (ANOVA) for
judging the significance of more than two sample means.
Case 2: Population normal, population finite, sample size may large or small………variance
is known.
Case 3: Population normal, population infinite, sample size is small and variance of the
population is unknown.
and
• Non-Parametric tests
Does not under any assumption
Assumes only nominal or ordinal data
Note: Non-parametric tests need entire population (or very large sample size)
IIITS: IDA - M2021 41
Any question?