0% found this document useful (0 votes)
111 views60 pages

06 Statistical Inference

This document summarizes a lecture on statistical inference and hypothesis testing. It discusses the principles of statistical inference, the concept of hypotheses and how they are formulated, types of errors in hypothesis testing, and the basic procedures for conducting hypothesis testing. Examples are provided to illustrate key concepts like the null and alternative hypotheses, types of hypotheses tests, and the probabilities of making errors when testing hypotheses.

Uploaded by

Nirmala Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views60 pages

06 Statistical Inference

This document summarizes a lecture on statistical inference and hypothesis testing. It discusses the principles of statistical inference, the concept of hypotheses and how they are formulated, types of errors in hypothesis testing, and the basic procedures for conducting hypothesis testing. Examples are provided to illustrate key concepts like the null and alternative hypotheses, types of hypotheses tests, and the probabilities of making errors when testing hypotheses.

Uploaded by

Nirmala Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Data Analytics

(CS40003)
Lecture #7

Statistical Inference

Dr. Debasis Samanta


Associate Professor
Department of Computer Science & Engineering
Quote of the day..

Live as if you were to die tomorrow. Learn as if


you were to live forever.

 MAHATMA GANDHI, father of nation of India

CS 40003: Data Analytics 2


In this presentation…
Principle of Statistical Inference (SI)

Hypothesis in SI

Hypotheses testing procedures


Errors in hypothesis testing

Case Study 1: Coffee Sale

Case Study 2: Machine Testing

Summary of Sampling Distributions in Hypothesis Testing


CS 40003: Data Analytics 3
Just a minute to mark your attendance

CS 40003: Data Analytics 4


Introduction

What do you think about this piece?

CS 40003: Data Analytics 5


Introduction
The primary objective of statistical analysis is to use data from a sample to make
inferences about the population from which the sample was drawn.

The mean and


µ, σ variance of
students in the
entire country?

This lecture aims to learn the basic


procedures for making such
inferences.

Sample

Mean and variance


𝑋, S of GATE scores of
all students of IIT-
KGP

CS 40003: Data Analytics 6


Basic Approaches
Approach 1: Hypothesis testing
 We conduct test on hypothesis.
 We hypothesize that one (or more) parameter(s) has (have) some specific value(s)
or relationship.
 Make our decision about the parameter(s) based on one (or more) sample
statistic(s)
 Accuracy of the decision is expressed as the probability that the decision is
incorrect.

Approach 2: Confidence interval measurement


 We estimate one (or more) parameter(s) using sample statistics.
 This estimation usually done in the form of an interval.
 Accuracy of the decision is expressed as the level of confidence we have in
the interval.

CS 40003: Data Analytics 7


Hypothesis Testing

Statistical inference

Sample

Null hypothesis Alternative hypothesis

CS 40003: Data Analytics 8


Hypothesis Testing
What is Hypothesis?
 “A hypothesis is an educated prediction that can be tested” (study.com).

 “A hypothesis is a proposed explanation for a phenomenon” (Wikipedia).

 “A hypothesis is used to define the relationship between two variables” (Oxford dictionary).

 “A supposition or proposed explanation made on the basis of limited evidence as a starting


point for further investigation” (Walpole).

 Example 6.1: Avogadro’s Hypothesis(1811)

“The volume of a gas is directly proportional to the number of molecules of the gas.”
𝑽 =𝒂𝑵

CS 40003: Data Analytics 9


Statistical Hypothesis
 If the hypothesis is stated in terms of population parameters (such as mean
and variance), the hypothesis is called statistical hypothesis.
 Data from a sample (which may be an experiment) are used to test the
validity of the hypothesis.
 A procedure that enables us to agree (or disagree) with the statistical hypothesis is
called a test of the hypothesis.

Example 6.2:
1. To determine whether the wages of men and women are equal.

2. A product in the market is of standard quality.

3. Whether a particular medicine is effective to cure a disease.

CS 40003: Data Analytics 10


The Hypotheses
 The main purpose of statistical hypothesis testing is to choose between two
competing hypotheses.

Example 6.3:
One hypothesis might claim that wages of men and women are equal, while the
alternative might claim that men make more than women.

 Hypothesis testing start by making a set of two statements about the parameter(s) in
question.

 The hypothesis actually to be tested is usually given the symbol and is commonly
referred as the null hypothesis.

 The other hypothesis, which is assumed to be true when null hypothesis is false, is
referred as the alternate hypothesis and is often symbolized by

 The two hypotheses are exclusive and exhaustive.

CS 40003: Data Analytics 11


The Hypotheses
Example 6.4:
Ministry of Human Resource Development (MHRD), Government of India
takes an initiative to improve the country’s human resources and hence set up
23 IIT’s in the country.

To measure the engineering aptitudes of graduates, MHRD conducts GATE


examination for a mark of 1000 in every year. A sample of 300 students who
gave GATE examination in 2018 were collected and the mean is observed as
220.
In this context, statistical hypothesis testing is to determine the mean mark of
the all GATE-2018 examinee.
The two hypotheses in this context are:

CS 40003: Data Analytics 12


The Hypotheses
Note:
1. As null hypothesis, we could choose or

2. It is customary to always have the null hypothesis with an equal sign.

3. As an alternative hypothesis there are many options available with us.

Examples 6.5:
I.

4. The two hypothesis should be chosen in such a way that they are exclusive and
exhaustive.
 One or other must be true, but they cannot both be true.

CS 40003: Data Analytics 13


The Hypotheses
One-tailed test

 A statistical test in which the alternative hypothesis specifies that the


population parameter lies entirely above or below the value specified in is
called a one-sided (or one-tailed) test.

Example.

Two-tailed test
 An alternative hypothesis that specifies that the parameter can lie on their
sides of the value specified by is called a two-sided (or two-tailed) test.

Example.

CS 40003: Data Analytics 14


The Hypotheses
Note:

In fact, a 1-tailed test such as:

is same as

In essence, it does not imply that , etc.

CS 40003: Data Analytics 15


Hypothesis Testing Procedures
The following five steps are followed when testing hypothesis

1. Specify and , the null and alternate hypothesis, and an acceptable level of .

2. Determine an appropriate sample-based test statistics and the rejection


region for the specified .

3. Collect the sample data and calculate the test statistics.

4. Make a decision to either reject or fail to reject .

5. Interpret the result in common language suitable for practitioners.

CS 40003: Data Analytics 16


Hypothesis Testing Procedure
 In summary, we have to choose between and

 The standard procedure is to assume is true.


(Just we presume innocent until proven guilty)

 Using statistical test, we try to determine whether there is sufficient


evidence to declare false.

 We reject only when the chance is small that is true.

 The procedure is based on probability theory, that is, there is a chance that
we can make errors.

CS 40003: Data Analytics 17


Errors in Hypothesis Testing

In hypothesis testing, there are two types of errors.


Type I error: A type I error occurs when we incorrectly reject (i.e., we reject
the null hypothesis, when is true).

Type II error: A type II error occurs when we incorrectly fail to reject (i.e.,
we accept when it is not true).

CS 40003: Data Analytics 18


Probabilities of Making Errors
Type I error calculation
: denotes the probability of making a Type I error

Type II error calculation

: denotes the probability of making a Type II error

Note:
 and are not independent of each other as one increases, the other decreases
 When the sample size increases, both to decrease since sampling error is reduced.
 In general, we focus on Type I error, but Type II error is also important,
particularly when sample size is small.

CS 40003: Data Analytics 19


Calculating
Assuming that we have the results of random sample. Hence, we use the
characteristics of sampling distribution to calculate the probabilities of making
either Type I or Type II error.

Example 6.6:
Suppose, two hypotheses in a statistical testing are:

Also, assume that for a given sample, population obeys normal distribution. A
threshold limit say is used to say that they are significantly different from a.

CS 40003: Data Analytics 20


Calculating

Here, shaded region implies the probability


that,
a-δ a a+δ

Thus the null hypothesis is to be rejected if the mean value is less than or
greater than .

If denotes the sample mean, then the Type I error is

CS 40003: Data Analytics 21


The Rejection Region
The rejection region comprises of value of the test statistics for which
1. The probability when the null hypothesis is true is less than or equal to the specified .
2. Probability when is true are greater than they are under .

a’ a a”
Rejection region for H0 for a
given value of α

Reject H0 Do not reject H0 Reject H0


≠a =a ≠a

CS 40003: Data Analytics 22


Two-Tailed Test
For two-tailed hypothesis test, hypotheses take the form

In other words, to reject a null hypothesis, sample mean or under a given .

Thus, in a two-tailed test, there are two rejection regions (also known as critical
region), one on each tail of the sampling distribution curve.

CS 40003: Data Analytics 23


Two-Tailed Test
Acceptance region
Accept H0 ,if the sample
mean falls in this region

95 % of area

0.025 of area 0.025 of area

µH 0

Rejection region
Reject H0 ,if the sample mean falls
in either of these regions

Acceptance and rejection regions in case of a two-tailed test with 5% significance level.

CS 40003: Data Analytics 24


One-Tailed Test
A one-tailed test would be used when we are to test, say, whether the population mean is
either lower or higher than the hypothesis test value.

Symbolically,

Wherein there is one rejection region only on the left-tail (or right-tail).
Acceptance region Acceptance region

.05 of area
.05 of area

Rejection region Rejection region

¿ tailed test ¿ − tailed test

CS 40003: Data Analytics 25


Example 6.7: Calculating
Consider the two hypotheses are

The null hypothesis is

The alternative hypothesis is

Assume that given a sample of size 16 and standard deviation is 0.2 and sample
follows normal distribution.

CS 40003: Data Analytics 26


Example 6.7: Calculating
We can decide the rejection region as follows.

Suppose, the null hypothesis is to be rejected if the mean value is less than 7.9 or greater
than 8.1. If is the sample mean, then the probability of Type I error is

Given the standard deviation of the sample is 0.2 and that the distribution follows normal
distribution.
Thus,

and

Hence,

CS 40003: Data Analytics 27


Example 6.8: Calculating and
There are two identically appearing boxes of chocolates. Box A contains 60 red and
40 black chocolates whereas box B contains 40 red and 60 black chocolates. There
is no label on the either box. One box is placed on the table. We are to test the
hypothesis that “Box B is on the table”.

To test the hypothesis an experiment is planned, which is as follows:


 Draw at random five chocolates from the box.
 We replace each chocolates before selecting a new one.
 The number of red chocolates in an experiment is considered as the sample
statistics.

Note: Since each draw is independent to each other, we can assume the sample distribution
follows binomial probability distribution.

CS 40003: Data Analytics 28


Example 6.8: Calculating
Let us express the population parameter as
The hypotheses of the problem can be stated as:
// Box B is on the table
// Box A is on the table
Calculating
In this example, the null hypothesis specifies that the probability of drawing a red chocolate is .
This means that, lower proportion of red chocolates in observations favors the null hypothesis.
In other words, drawing all red chocolates provides sufficient evidence to reject the null
hypothesis. Then, the probability of making a error is the probability of getting five red
chocolates in a sample of five from Box B. That is,

Using the binomial distribution

Thus, the probability of rejecting a true null hypothesis is That is, there is approximately
chance that the box B will be mislabeled as box A.

CS 40003: Data Analytics 29


Example 6.8: Calculating
The error occurs if we fail to reject the null hypothesis when it is not true. For the current
illustration, such a situation occurs, if Box A is on the table but we did not get the five red
chocolates required to reject the hypothesis that Box B is on the table.
The probability of error is then the probability of getting four or fewer red chocolates in a
sample of five from Box A.
That is,

Using the probability rule:

That is,
Now,
Hence,

That is, the probability of making error is over . This means that, if Box A is on the table, the
probability that we will be unable to detect it is .
CS 40003: Data Analytics 30
Case Study 1: Coffee Sale
A coffee vendor nearby Kharagpur railway station has been having average
sales of 500 cups per day. Because of the development of a bus stand nearby, it
expects to increase its sales. During the first 12 days, after the inauguration of
the bus stand, the daily sales were as under:

550 570 490 615 505 580 570 460 600 580 530 526

On the basis of this sample information, can we conclude that the sales of coffee
have increased?

Consider 5% level of confidence.

CS 40003: Data Analytics 31


Hypothesis Testing : 5 Steps
The following five steps are followed when testing hypothesis

1. Specify and , the null and alternate hypothesis, and an acceptable level of .

2. Determine an appropriate sample-based test statistics and the rejection


region for the specified .

3. Collect the sample data and calculate the test statistics.

4. Make a decision to either reject or fail to reject .

5. Interpret the result in common language suitable for practitioner.

CS 40003: Data Analytics 32


Case Study 1: Step 1
Step 1: Specification of hypothesis and acceptable level of

Let us consider the hypotheses for the given problem as follows.

cups per day


The null hypothesis that sales average 500 cups per day and they have not
increased.

The alternative hypothesis is that the sales have increased.

Given the acceptance level of

CS 40003: Data Analytics 33


Case Study 1: Step 2
Step 2: Sample-based test statistics and the rejection region for specified

Given the sample as

550 570 490 615 505 580 570 460 580 530 526

Since the sample size is small and the population standard deviation is not known, we
shall use assuming normal population. The test statistics is

To find and , we make the following computations.

CS 40003: Data Analytics 34


Case Study 1: Step 2

CS 40003: Data Analytics 35


Case Study 1: Step 2

Hence,

Note:
Statistical table for t-distributions gives a t-value given n, the degrees of freedom and ,
the level of significance and vice-versa.

CS 40003: Data Analytics 36


Case Study 1: Step 3

Step 3: Collect the sample data and calculate the test statistics

As is one-tailed, we shall determine the rejection region applying one-tailed in the right
tail because is more than type ) at level of significance.

CS 40003: Data Analytics 37


Case Study 1: Step 3

Step 3: Collect the sample data and calculate the test statistics

As is one-tailed, we shall determine the rejection region applying one-tailed in the right
tail because is more than type ) at level of significance.

Using table of for 11 degrees of freedom and with level of significance,

CS 40003: Data Analytics 38


Case Study 1: Step 4
Step 4: Make a decision to either reject or fail to reject H0

The observed value of which is in the rejection region and thus is rejected at level of
significance.

CS 40003: Data Analytics 39


Case Study 1: Step 5
Step 5: Final comment and interpret the result

We can conclude that the sample data indicate that coffee sales have increased.

CS 40003: Data Analytics 40


Case Study 2: Machine Testing
A medicine production company packages medicine in a tube of 8 ml. In
maintaining the control of the amount of medicine in tubes, they use a machine.
To monitor this control a sample of 16 tubes is taken from the production line at
random time interval and their contents are measured precisely. The mean amount
of medicine in these 16 tubes will be used to test the hypothesis that the machine
is indeed working properly.

CS 40003: Data Analytics 41


Case Study 2: Step 1
Step 1: Specification of hypothesis and acceptable level of

The hypotheses are given in terms of the population mean of medicine per tube.

The null hypothesis is

The alternative hypothesis is

We assume , the significance level in our hypothesis testing 0.05.


(This signifies the probability that the machine needs to be adjusted less than 5).

CS 40003: Data Analytics 42


Case Study 2: Step 2
Step 2: Sample-based test statistics and the rejection region for specified

Rejection region: G, which gives (obtained from standard normal calculation for two-
tailed test).

CS 40003: Data Analytics 43


Case Study 2: Step 3

Step 3: Collect the sample data and calculate the test statistics
Sample results: , ,

With the sample, the test statistics is

Hence,

CS 40003: Data Analytics 44


Case Study 2: Step 4

Step 4: Make a decision to either reject or fail to reject H0

-2.20 -1.96 0 1.96 2.20

Since , we reject

CS 40003: Data Analytics 45


Case Study 2: Step 5
Step 5: Final comment and interpret the result

We conclude and recommend that the machine be adjusted.

CS 40003: Data Analytics 46


Case Study 2: Alternative Test
Suppose that in our initial setup of hypothesis test, if we choose instead of 0.05, then the
test can be summarized as:

1. ,

2. Reject if

3. Sample result n =16, = 0.2, =7.89, ,

4. , we fail to reject = 8

5. We do not recommend that the machine be readjusted.

CS 40003: Data Analytics 47


Hypothesis Testing Strategies
 The hypothesis testing determines the validity of an assumption (technically
described as null hypothesis), with a view to choose between two conflicting
hypothesis about the value of a population parameter.

 There are two types of tests of hypotheses


 Non-parametric tests (also called distribution-free test of hypotheses)
Parametric tests (also called standard test of hypotheses).

CS 40003: Data Analytics 48


Parametric Tests : Applications
 Usually assume certain properties of the population from
which we draw samples.

• Observation come from a normal population

• Sample size is small

• Population parameters like mean, variance, etc. are hold good.

• Requires measurement equivalent to interval scaled data.

CS 40003: Data Analytics 49


Parametric Tests
Important Parametric Tests
The widely used sampling distribution for parametric tests are

Note:
All these tests are based on the assumption of normality (i.e., the source of data is
considered to be normally distributed).

CS 40003: Data Analytics 50


Parametric Tests : Z-test
: This is most frequently test in statistical analysis.

 It is based on the normal probability distribution.

 Used for judging the significance of several statistical measures particularly


the mean.

 It is used even when or is applicable with a condition that such a distribution


tends to normal distribution when n becomes large.

 Typically it is used for comparing the mean of a sample to some


hypothesized mean for the population in case of large sample, or when
population variance is known.

CS 40003: Data Analytics 51


Parametric Tests : t-test

: It is based on the t-distribution.

 It is considered an appropriate test for judging the significance of a sample


mean or for judging the significance of difference between the means of two
samples in case of

 small sample(s)

 population variance is not known (in this case, we use the variance of
the sample as an estimate of the population variance)

CS 40003: Data Analytics 52


Parametric Tests : -test

: It is based on Chi-squared distribution.

 It is used for comparing a sample variance to a theoretical population


variance.

CS 40003: Data Analytics 53


Parametric Tests : -test

: It is based on F-distribution.

 It is used to compare the variance of two independent samples.

 This test is also used in the context of analysis of variance (ANOVA) for
judging the significance of more than two sample means.

CS 40003: Data Analytics 54


Hypothesis Testing : Assumptions
Case 1: Normal population, population infinite, sample size may be large or small, variance
of the population is known.

Case 2: Population normal, population finite, sample size may large or small………variance
is known.

Case 3: Population normal, population infinite, sample size is small and variance of the
population is unknown.

and

CS 40003: Data Analytics 55


Hypothesis Testing
Case 4: Population finite

Note: If variance of population is known, replace by . Population normal, population


infinite, sample size is small and variance of the population is unknown.

CS 40003: Data Analytics 56


Hypothesis Testing : Non-Parametric Test

Non-Parametric tests
Does not under any assumption
Assumes only nominal or ordinal data

Note: Non-parametric tests need entire population (or very large sample size)

CS 40003: Data Analytics 57


Reference

The detail material related to this lecture can be found in

Probability and Statistics for Engineers and Scientists (8th Ed.) by


Ronald E. Walpole, Sharon L. Myers, Keying Ye (Pearson), 2013.

CS 40003: Data Analytics 58


Any question?

You may post your question(s) at the “Discussion Forum”


maintained in the course Web page!

CS 40003: Data Analytics 59


Questions of the day…
1. In a hypothesis testing, suppose H0 is rejected. Does it mean that H1 is accepted?
Justify your answer.

2. Give the expressions for z, t and in terms of population and sample parameters,
whichever is applicable to each. Signifies these values in terms of the respective
distributions.

3. How can you obtain the value say P(z = a)? What this values signifies?

4. On what occasion, you should consider z-distribution but not t-distribution and
vice-versa?

5. Give a situation when you should consider distribution but neither z- nor t-
distribution.

CS 40003: Data Analytics 60

You might also like