06 Statistical Inference
06 Statistical Inference
(CS40003)
Lecture #7
Statistical Inference
Hypothesis in SI
Sample
Statistical inference
Sample
“A hypothesis is used to define the relationship between two variables” (Oxford dictionary).
“The volume of a gas is directly proportional to the number of molecules of the gas.”
𝑽 =𝒂𝑵
Example 6.2:
1. To determine whether the wages of men and women are equal.
Example 6.3:
One hypothesis might claim that wages of men and women are equal, while the
alternative might claim that men make more than women.
Hypothesis testing start by making a set of two statements about the parameter(s) in
question.
The hypothesis actually to be tested is usually given the symbol and is commonly
referred as the null hypothesis.
The other hypothesis, which is assumed to be true when null hypothesis is false, is
referred as the alternate hypothesis and is often symbolized by
Examples 6.5:
I.
4. The two hypothesis should be chosen in such a way that they are exclusive and
exhaustive.
One or other must be true, but they cannot both be true.
Example.
Two-tailed test
An alternative hypothesis that specifies that the parameter can lie on their
sides of the value specified by is called a two-sided (or two-tailed) test.
Example.
is same as
1. Specify and , the null and alternate hypothesis, and an acceptable level of .
The procedure is based on probability theory, that is, there is a chance that
we can make errors.
Type II error: A type II error occurs when we incorrectly fail to reject (i.e.,
we accept when it is not true).
Note:
and are not independent of each other as one increases, the other decreases
When the sample size increases, both to decrease since sampling error is reduced.
In general, we focus on Type I error, but Type II error is also important,
particularly when sample size is small.
Example 6.6:
Suppose, two hypotheses in a statistical testing are:
Also, assume that for a given sample, population obeys normal distribution. A
threshold limit say is used to say that they are significantly different from a.
Thus the null hypothesis is to be rejected if the mean value is less than or
greater than .
a’ a a”
Rejection region for H0 for a
given value of α
Thus, in a two-tailed test, there are two rejection regions (also known as critical
region), one on each tail of the sampling distribution curve.
95 % of area
µH 0
Rejection region
Reject H0 ,if the sample mean falls
in either of these regions
Acceptance and rejection regions in case of a two-tailed test with 5% significance level.
Symbolically,
Wherein there is one rejection region only on the left-tail (or right-tail).
Acceptance region Acceptance region
.05 of area
.05 of area
Assume that given a sample of size 16 and standard deviation is 0.2 and sample
follows normal distribution.
Suppose, the null hypothesis is to be rejected if the mean value is less than 7.9 or greater
than 8.1. If is the sample mean, then the probability of Type I error is
Given the standard deviation of the sample is 0.2 and that the distribution follows normal
distribution.
Thus,
and
Hence,
Note: Since each draw is independent to each other, we can assume the sample distribution
follows binomial probability distribution.
Thus, the probability of rejecting a true null hypothesis is That is, there is approximately
chance that the box B will be mislabeled as box A.
That is,
Now,
Hence,
That is, the probability of making error is over . This means that, if Box A is on the table, the
probability that we will be unable to detect it is .
CS 40003: Data Analytics 30
Case Study 1: Coffee Sale
A coffee vendor nearby Kharagpur railway station has been having average
sales of 500 cups per day. Because of the development of a bus stand nearby, it
expects to increase its sales. During the first 12 days, after the inauguration of
the bus stand, the daily sales were as under:
550 570 490 615 505 580 570 460 600 580 530 526
On the basis of this sample information, can we conclude that the sales of coffee
have increased?
1. Specify and , the null and alternate hypothesis, and an acceptable level of .
550 570 490 615 505 580 570 460 580 530 526
Since the sample size is small and the population standard deviation is not known, we
shall use assuming normal population. The test statistics is
Hence,
Note:
Statistical table for t-distributions gives a t-value given n, the degrees of freedom and ,
the level of significance and vice-versa.
Step 3: Collect the sample data and calculate the test statistics
As is one-tailed, we shall determine the rejection region applying one-tailed in the right
tail because is more than type ) at level of significance.
Step 3: Collect the sample data and calculate the test statistics
As is one-tailed, we shall determine the rejection region applying one-tailed in the right
tail because is more than type ) at level of significance.
The observed value of which is in the rejection region and thus is rejected at level of
significance.
We can conclude that the sample data indicate that coffee sales have increased.
The hypotheses are given in terms of the population mean of medicine per tube.
Rejection region: G, which gives (obtained from standard normal calculation for two-
tailed test).
Step 3: Collect the sample data and calculate the test statistics
Sample results: , ,
Hence,
Since , we reject
1. ,
2. Reject if
4. , we fail to reject = 8
Note:
All these tests are based on the assumption of normality (i.e., the source of data is
considered to be normally distributed).
small sample(s)
population variance is not known (in this case, we use the variance of
the sample as an estimate of the population variance)
: It is based on F-distribution.
This test is also used in the context of analysis of variance (ANOVA) for
judging the significance of more than two sample means.
Case 2: Population normal, population finite, sample size may large or small………variance
is known.
Case 3: Population normal, population infinite, sample size is small and variance of the
population is unknown.
and
Non-Parametric tests
Does not under any assumption
Assumes only nominal or ordinal data
Note: Non-parametric tests need entire population (or very large sample size)
2. Give the expressions for z, t and in terms of population and sample parameters,
whichever is applicable to each. Signifies these values in terms of the respective
distributions.
3. How can you obtain the value say P(z = a)? What this values signifies?
4. On what occasion, you should consider z-distribution but not t-distribution and
vice-versa?
5. Give a situation when you should consider distribution but neither z- nor t-
distribution.