0% found this document useful (0 votes)
39 views24 pages

Module 4 (301 SI-2)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views24 pages

Module 4 (301 SI-2)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

RV Institute of Technology & Management ®

MODULE-IV

STATISTICAL INFERENCE-2

Topic Learning Objectives:

Upon Completion of this module, student will be able to:

• Solve problems on probability distribution functions two variables.


• Use statistical methodology and tools in the engineering problem-solving process
• Compute the confidence intervals for the mean of the population.

Sampling Variables

Sampling variables refers to the process of selecting data points or observations from a larger
population or dataset for the purpose of analysis, experimentation, or research. It is a fundamental
concept in statistics and data analysis. When you sample data, you are essentially taking a subset of
the entire population to draw conclusions or make inferences about the entire population. Here are
some key points related to sampling variables:

Population: The population refers to the entire set of individuals, items, or data points that you are
interested in studying. Its often impractical or impossible to study an entire population, so you sample
from it.

Sample: A sample is a subset of the population. It consists of a smaller number of data points or
observations that are chosen in a way that they represent the larger population to some extent.

Sampling Methods: There are various methods for sampling data, including simple random
sampling (each data point has an equal chance of being selected), stratified sampling (dividing the
population into subgroups and then sampling from each subgroup), systematic sampling (selecting
every nth data point), and more.

Sampling Error: When you take a sample from a population, there is a chance that the sample may
not perfectly represent the population. This difference between the sample and the population is
called sampling error.

Parameter and Statistic: In statistical analysis, a parameter is a characteristic of the population,


while a statistic is a characteristic of the sample.

For example, the mean of a population is a parameter, while the mean of a sample is a statistic.

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 1 | 24
RV Institute of Technology & Management ®

Sampling Size: The number of data points or observations you include in your sample is known as
the sample size. A larger sample size generally provides more accurate estimates of population
parameters.

Central limit theorem:

The theorem which explains this sort of relationship between the shape of the population
distribution and the sampling distribution of the mean is known as the central limit theorem.
This theorem is by far the most important theorem in statistical inference. It assures that the
sampling distribution of the mean approaches normal distribution as the sample size increases.
In formal terms, we may say that the central limit theorem states that the distribution of means
of random samples taken from a population having mean µ and finite variance 𝜎 2
approaches the normal distribution with mean µ and variance 𝜎 2 /𝑛 as n goes to infinity.

̅ is the mean of random sample of size n taken from a population with mean 𝝁 and
If 𝒙
̅−𝝁
𝒙
finite variance 𝝈𝟐 , then the limiting form of the distribution of 𝒁 = 𝝈 , as 𝒏 → ∞ is
√𝒏

the standard normal distribution N(Z;0,1).

The significance of the central limit theorem lies in the fact that it permits us to use sample
statistics to make inferences about population parameters without knowing anything about the
shape of the frequency distribution of that population other than what we can get from the
sample.

Confidences limit for unknown mean.


Let 𝑥̅ be the sample mean, and n be the size of the sample. Then the interval estimate of the
𝑠
population mean 𝜇 is given by 𝑥̅ ± 𝑡𝛼
√𝑛

Problems:
1. A sample of size 9 from a normal population gave 𝑥̅ = 15.8 and 𝑠 2 = 10.3. Find a 99%
interval for population mean.

Solution: Given 𝑥̅ = 15.8, 𝑠 2 = 10.3 and 𝑛 = 9.


Degrees of freedom= 𝑛 − 1 = 8

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 2 | 24
RV Institute of Technology & Management ®

Also 𝑡𝛼 = 𝑡0.01 for 8 d.f =3.36


𝑠
99% confidence limit for the population mean 𝑥̅ are 𝑥̅ ± 𝑡0.01
√𝑛

10.3
=15.8±3.36√ 9

=12.2055,19.3944.
Hence 99% confidence interval = [12.2055,19.3944].

2. A random sample of 15 observations has a mean of 20 and a standard deviation of 3.5. To


estimate the population mean with 95% confidence level determine the confidence interval.
Solution: Given 𝑥̅ = 20, 𝑠 = 3.5 and 𝑛 = 15.
Degrees of freedom= 𝑛 − 1 = 14
Also 𝑡𝛼 = 𝑡0.05 for 14 d.f =2.145
𝑠
95% confidence limit for the population mean 𝑥̅ are 𝑥̅ ± 𝑡0.05
√𝑛

3.52
=20±2.145√ 14

=18.06,21.94.
Hence 95% confidence interval = [18.06,21.94].

A discussion on tests of significance for small samples


So far the problem of testing a hypothesis about a population parameter was based on the
assumption that sample drawn from population is large in size (more than 30) and the
probability distribution is normally distributed. However, when the size of the sample is small,
(say < 30) tests considered above are not suitable because the assumptions on which they are
based generally do not hold good in the case of small samples. In particular, here one cannot
assume that the problem follows a normal distribution function and those values given by
sample data are sufficiently close to the population values and can be used in their place for the
calculation of standard error. Thus, it is a necessity to develop some alternative strategies to
deal with problems having sample size relatively small. Also, we do see a number of problems
involving small samples. With these in view, here, we will initiate a detailed discussion on the
same.
Here, too, the problem is about testing a statement about population parameter; i.e. in
ascertaining whether observed values could have arisen by sampling fluctuations from some
value given in advance. For example, if a sample of 15 gives a correlation coefficient of +0.4,

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 3 | 24
RV Institute of Technology & Management ®

we shall be interested not so much in the value of the correlation in the parent population, but
more generally this value could have come from an un – correlated population, i.e. whether it
is significant in the parent population. It is widely accepted that when we work with small
samples, estimates will vary from sample to sample.
Further, in the theory of small samples also, we begin study by assuming that parent population
is normally distributed unless otherwise stated. Strictly, whatever the decision one takes in
hypothesis testing problems is valid only for normal populations. Sir William Gosset and R.
A. Fisher have contributed a lot to theory of small samples. Sir W. Gosset published his
findings in the year 1905 under the pen name “student”. He gave a test popularly known as “t
– test” and Fisher gave another test known as “z – test”. These tests are based on “t distribution
and “z – distribution”.

Test of Significance for means of two small samples by Student’s t - distribution


Procedures to be followed for testing of significance for means of two small samples
1. Null Hypothesis: 𝐻0 : 𝜇1 = 𝜇2 There is no significant difference in the means.
Alternate Hypothesis: 𝐻1 : 𝜇1 ≠ 𝜇2
2. Calculation of Test Statistic.
Estimated standard deviation:

𝑛1 𝑠12 +𝑛2 𝑠22


𝑆=√
𝑛1 +𝑛2 −1

𝑥̅1 −𝑥̅2
Test Statistic t= 1 1
𝑆√ +
𝑛1 𝑛2

3. Level of significance: Take the level of significance 𝛼 = 0.05 if 𝛼 is not known.


4. Decision: Accept 𝐻0 if computed 𝑡 ≤ tabled 𝑡𝛼
Reject 𝐻0 if computed 𝑡 > tabled 𝑡𝛼 .

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 4 | 24
RV Institute of Technology & Management ®

Problems:
1. The average number of articles produced by two machines per day are 200 and 250 with
standard deviations 20 and 25 respectively on the basis of records of 25 days production.
Can you regard both the machines equally efficient at 1% level of significance?
Solution: Given 𝑛1 = 25, ̅̅̅=200,
𝑥1 𝑠1 = 20
𝑛2 = 25, ̅̅̅=250,
𝑥2 𝑠2 = 25
Assume Null Hypothesis: 𝐻0 : 𝜇1 = 𝜇2 ie., both the machines are equally efficient.
Alternate Hypothesis: 𝐻1 : 𝜇1 ≠ 𝜇2
Estimated standard deviation:

𝑛1 𝑠12 + 𝑛2 𝑠22
𝑆=√ = 23.1
𝑛1 + 𝑛2 − 1

̅𝑥̅̅1̅−𝑥
̅̅̅2̅ 200−250
Test Statistic t= | 1 1
|= 1 1
= |−7.7| = 7.7
𝑆√ + 23.1√ +
𝑛1 𝑛2 25 25

The table value of t at 1% level of significance 𝑡0.01,48 is 2.58


Calculated value > Tabulated value,
Hence reject the Null hypothesis.ie., The two machines are not equally efficient at 1% level of
significance.
2. Two salesman A and B are working in a certain district. From a sample survey conducted
by the Head Office, the following results were obtained. State whether there is any
significant difference in the average sales between the two salesmen?
A B
No. of Sales 20 18
Average sales (in Rs.) 170 205
Standard Deviation (in Rs.) 20 25
Solution:
Given 𝑛1 = 20, ̅̅̅=170,
𝑥1 𝑠1 = 20
𝑛2 = 18, ̅̅̅=205,
𝑥2 𝑠2 = 25
Assume Null Hypothesis: 𝐻0 : 𝜇1 = 𝜇2
i.e., There is no significant difference in the average between the two salesmen.
Alternate Hypothesis: 𝐻1 : 𝜇1 ≠ 𝜇2
Estimated standard deviation:

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 5 | 24
RV Institute of Technology & Management ®

𝑛1 𝑠12 + 𝑛2 𝑠22
𝑆=√ = 23.12
𝑛1 + 𝑛2 − 1

̅𝑥̅̅1̅−𝑥
̅̅̅2̅ 170−205
Test Statistic t= | 1 1
|= 1 1
= 4.73
𝑆√ + 23.12√ +
𝑛1 𝑛2 20 18

The table value of t at 5% level of significance 𝑡0.05,36 is 1.96


Calculated value > Tabulated value,
Hence reject the Null hypothesis.ie., There is a significant difference in the average between
the two salesmen.
3. The mean life of a sample of 10 electric bulbs was found to be 1456 hours with a standard
deviation of 423 hours. A second sample of 17 bulbs chosen from a different batch showed
a mean life of 1280 hours with standard deviation 398 hours. Is there significant difference
between the means of the two batches?
Solution: Given 𝑛1 = 10, ̅̅̅=1456,
𝑥1 𝑠1 = 423
𝑛2 = 17, ̅̅̅=1280,
𝑥2 𝑠2 = 398
Assume Null Hypothesis: 𝐻0 : 𝜇1 = 𝜇2 ie., There is no significant difference in the
means of two samples.
Alternate Hypothesis: 𝐻1 : 𝜇1 ≠ 𝜇2
Estimated standard deviation:

𝑛1 𝑠12 + 𝑛2 𝑠22
𝑆=√ = 423.42
𝑛1 + 𝑛2 − 1

̅𝑥̅̅1̅−𝑥
̅̅̅2̅ 1456−1280
Test Statistic t= | 1 1
|= 1 1
= 1.04
𝑆√ + 423.42√ +
𝑛1 𝑛2 10 17

The table value of t at 5% level of significance 𝑡0.05,25 is 2.06


Calculated value < Tabulated value,
Hence accept the Null hypothesis.ie., There is no significant difference in the means of two
samples.
4. Two types of batteries are tested for their lengths of life and the following data are
obtained.
No. of Samples Mean Life Variance
Type A 9 600 hours 121
Type B 8 640 hours 144
Is there significance difference in the two means? Value of t for 15 degrees of freedom

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 6 | 24
RV Institute of Technology & Management ®

at 5% level is 2.131.
Solution: Given 𝑛1 = 9, 𝑥
̅̅̅=600,
1 𝑠12 = 121
𝑛2 = 8, 𝑥
̅̅̅=40,
2 𝑠22 = 144
Assume Null Hypothesis: 𝐻0 : 𝜇1 = 𝜇2 ie., There is no significant difference in the two
means.
Alternate Hypothesis: 𝐻1 : 𝜇1 ≠ 𝜇2
Estimated standard deviation:

𝑛1 𝑠12 + 𝑛2 𝑠22
𝑆=√ = 12.22
𝑛1 + 𝑛2 − 1

̅𝑥̅̅1̅−𝑥
̅̅̅2̅ 600−640
Test Statistic t= | 1 1
|= 1 1
= 6.73
𝑆√ + 12.22√ +
𝑛1 𝑛2 9 8

The table value of t at 5% level of significance 𝑡0.05,15 is 2.131


Calculated value > Tabulated value,
Hence reject the Null hypothesis.ie., There is a significant difference in the two means.

Student’s t - distribution function


Gosset was employed by the Guinness and Son, Dublin bravery, Ireland which did not permit
employees to publish research work under their own names. So Gosset adopted the pen name
“student” and published his findings under this name. Thereafter, the t – distribution commonly
called student’s t – distribution or simply student’s distribution.
The t – distribution to be used in a situation when the sample drawn from a population is of
size lower than 30 and population standard deviation is un – known. The t – statistic, tcal is

 ( x -x )
i=n 2

 x-μ  i
defined as t cal =   × n where S=
i=1
, x is the sample mean, n is the sample
 S  n-1

size, and x i are the data items.

The t – distribution function has been derived mathematically under the assumption of a
 γ+1 
- 
 t2   2 
normally distributed population; it has the following form f(t)=C  1+  where C is a
 γ

constant term and  = n - 1 denotes the number of degrees of freedom. As the p.d.f. of a t –
distribution is not suitable for analytical treatment. Therefore, the function is evaluated
numerically for various values of t, and for particular values of  . The t – distribution table

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 7 | 24
RV Institute of Technology & Management ®

normally given in statistics text books gives, over a range of values of  , the probability values
of exceeding by chance value of t at different levels of significance. The t – distribution
function has a different value for each degree of freedom and when degrees of freedom
approach a large value, t – distribution is equivalent to normal distribution function.
The application of t – distribution includes (i) testing the significance of the mean of a random
sample i.e. determining whether the mean of a sample drawn from drawn from a normal
population deviates significantly from a stated value (i.e. hypothetical value of the populations
mean) and (ii) testing whether difference between means of two independent samples is
significant or not i.e. ascertaining whether the two samples comes from the same normal
population? (iii) Testing difference between means of two dependent samples is significant?
(iv) Testing the significance of on observed correlation coefficient.
Procedures to be followed in testing a hypothesis made about the population parameter
using student’s t - distribution:

• As usual first set up null hypothesis,


• Then, set up alternate hypothesis,
• Choose a suitable level of significance,
• Note down the sample size, n and the number of degrees of freedom,
• Compute the theoretical value, t tab by using t – distribution table.

• t tab value is to be obtained as follows: If we set up  = 5% = 0.05 , suppose  = 9

then, t tab is to be obtained by looking in 9th row and in the column  = 0.025
(i.e. half of  = 0.05) .

 x-μ 
• The test criterion is then calculated using the formula, t cal =  × n
 S 
• Later, the calculated value above is compared with tabulated value. As long as the
calculated value matches with the tabulated value, we as usual accept the null hypothesis
and on the other hand, when the calculated value becomes more than tabulated value, we
reject the null hypothesis and accept the alternate hypothesis.

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 8 | 24
RV Institute of Technology & Management ®

Problems:
1. The manufacturer of a certain make of electric bulbs claims that his bulbs have a mean life
of 25 months with a standard deviation of 5 months. Random samples of 6 such bulbs have
the following values: Life of bulbs in months: 24, 20, 30, 20, 20, and 18. Can you regard the
producer’s claim to valid at 1% level of significance? (Given that t tab = 4.032 corresponding

to  = 5 ).
Solution: To solve the problem, we first set up the null hypothesis H0 :  = 25 months ,

alternate hypothesis may be treated as H0 :   25 months . To set up  = 1% , then tabulated

value corresponding to this level of significance is t tab | =1% and  =5 = 4.032

(4.032 value has been got by looking in the 5th row ) . The test criterion is given by

 ( x -x )
i=n 2

 x-μ  i
t cal =   × n where S=
i=1
.
 S  n-1

Consider

xi − x (x )
2
xi x −x
i

24 1 1

26 3 9

30 7 49
23
20 -3 9

20 -3 9

18 -5 25

Total = 138 - Total = 102

102 23 − 25
Thus, S = = 20.4 = 4.517 and t cal = 6 = 1.084 . Since the calculated value,
5 4.517
1.084 is lower than the tabulated value of 4.032; we accept the null hypothesis as mean life of
bulbs could be about 25 hours.

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 9 | 24
RV Institute of Technology & Management ®

2. A certain stimulus administered to each of the 13 patients resulted in the following increase
of blood pressure: 5, 2, 8, -1, 3, 0, -2, 1, 5, 0, 4, 6, 8. Can it be concluded that the stimulus, in
general, be accompanied by an increase in the blood pressure?
Solution: We shall set up H0 :μ before =μafter i.e. there is no significant difference in the blood

pressure readings before and after the injection of the drug. The alternate hypothesis is
H0 :μ before >μafter i.e. the stimulus resulted in an increase in the blood pressure of the patients.
Taking α=1% and α=5% , as n = 13, γ = n − 1 = 12 , respective tabulated values are
t tab | =1% and  =12 = 3.055 and t tab | =5% and  =12 = 2.179 . Now, we compute the value of test

criterion. For this, consider

xi − x (x )
2
xi x −x
i

5 2 4

2 -1 1

8 5 25

-1 -4 16

3 0 0

0 -3 9

-2 -5 25

1 3 -2 4

5 2 4

0 -3 9

4 1 1

6 3 9

8 5 25

Total = 39 - Total = 132

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 10 | 24
RV Institute of Technology & Management ®

 ( x -x )
i=n 2
i
132 x-μ
Consider S= i=1
= = 11 = 3.317 . Therefore, t cal = × n may be obtained
n-1 12 S

0−3
as t cal = 13 = 3.2614 . As the calculated value 3.2614 is more than the tabulated values
3.317
of 3.055 and 2.179, we accept the alternate hypothesis that after the drug is given to patients,
there is an increase in the blood pressure level.
3. the life time of electric bulbs for a random sample of 10 from a large consignment gave the
following data: 4.2, 4.6, 3.9, 4.1, 5.2, 3.8, 3.9, 4.3, 4.4, 5. 6 . Can we accept the hypothesis that
the average life time of bulbs is 4, 000 hours?
Solution: Set up H0 :μ=4,000 hours , H1:μ  4,000 hours . Let us choose that  = 5% . Then

tabulated value is t tab | =5% and  =9 = 2.262 . To find the test criterion, consider

xi − x (x )
2
xi x −x
i

4.2 -0.2 0.04

4.6 0.2 0.04

3.9 -0.5 0.25

4.1 -0.3 0.09

5.2 0.8 0.64


4.4
3.8 -0.6 0.36

3.9 -0.5 0.25

4.3 -0.1 0.01

4.4 0.0 0.0


5.6 1.2 1.44

Total = 44 - Total = 3.12

 ( x -x )
i=n 2
i
3.12 x-μ
Consider S= i=1
= = 0.589 . Therefore, t cal = × n is computed as
n-1 9 S
4.4 − 4.0
t cal =  10 = 2.148. As the computed value is lower than the tabulated value of
0.589
2.262, we conclude that mean life of time bulbs is about 4, 000 hours.

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 11 | 24
RV Institute of Technology & Management ®

4. Consider the sample consisting of nine numbers 45, 47, 50, 52, 48, 47, 49, 53 and 51. The sample

is drawn from a population whose mean is 47.5. Find whether the sample mean differs significantly

from the population mean at 5% level of significance.

Solution: for the given sample, the size is N=9. Therefore its mean is

1
𝑋̅ = ( 45 + 47+50+5 2 +48+47 + 49 +53 + 51) = 49.11
9

And the variance is


1
S2 = 9 {( 45 − 49.11)2 + (47 − 49.11)2 + ( 50 − 49.11)2 + ( 52 − 49.11)2 +

(48 − 49.11)2 + ( 47 − 49.11)2 + ( 49 − 49.11)2 + (53 − 49.11)2 +


(51 − 49.11)2 }
=6.0988
So that the standard deviation is s = √6.0988 = 2.47.
Since N = 9, we have 𝛾 = 8 for which we find from the table that 𝑡0.05 = 2.31

With µ = 47.5, 𝑋̅= 49.11 and s= 2.47,


we have
𝑋̅−µ 49.11−47.5
t= ( )√𝛾 = ×√8= 1.844.
𝑠 2.47
Thus, here the t- score is less than 𝑡0.05 (𝛾) = 2.31. Accordingly, the difference between the
sample mean and the population is not significant at 0.05 level of significance.

5. Eleven school boys were given a test in mathematics carrying a maximum of 25 marks. They
were given a month’s extra coaching and a second test of equal difficulty was held thereafter.
The following table gives the marks in the two tests.
Boy 1 2 3 4 5 6 7 8 9 10 11
I Test Marks 23 20 19 21 18 20 18 17 23 16 19
II Test Marks 24 19 22 18 20 22 20 20 23 20 17
Do the marks given evidence that the students have benefitted by extra coaching? Use 0.05
level of significance.
Solution: We first calculate the mean and the standard deviation in the difference in marks in
the two tests.
We note that the difference in marks(marks in II test – marks in I test) are

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 12 | 24
RV Institute of Technology & Management ®

1, -1, 3, -3, 2, 2, 2, 3, 0, 4, -2.


The mean of these differences is
1
𝑋̅ = (1 − 1 + 3 − 3 + 2 + 2 + 2 + 3 + 0 + 4 − 2) = 1
11
And the variance is
1
𝑠2 = {(1 − 1)2 + (−1 − 1)2 + (3 − 1)2 + (−3 − 1)2 + (2 − 1)2 + (2 − 1)2
11
+ (2 − 1)2 + (3 − 1)2 + (0 − 1)2 + (4 − 1)2 + (−2 − 1)2 }
1 50
= (0 + 4 + 4 + 16 + 1 + 1 + 1 + 4 + 1 + 9 + 9) = = 4.545,
11 11

So that the standard deviation is 𝑠 = √4.545 = 2.13.


Since N = 11, we have 𝛾 = 10 for which we find from table that 𝑡0.05 = 2.23.
Now, we make hypothesis that the students have not been benefitted by extra coaching. That
is, the difference in mean marks 𝜇 is Zero. Under this Hypothesis, the t – score is
𝑋̅ −𝜇 1−0
𝑡= √𝛾 = 2.13 √10 = 1.485.
𝑠

We note that this t- score is less than 𝑡0.05 (𝛾) = 2.23. Hence, we do not reject the
hypothesis at 0.05 level of significance. This means that it is likely that the students have not
been benefitted by extra coaching.

6. Two horses A and B were tested according to the time (in seconds) to run a particular race
with the following results.
Horse A: 28 30 32 33 33 29 34
Horse B: 29 30 30 24 27 29
Test whether you can discriminate between the two horses. (t0.05=2.2 for 11 d.f.)
Solution: Let the variables x and y respectively correspond to horse A and horse B.
∑ 𝑥 219
𝑥̄ = = = 31.3
𝑛1 7
∑ 𝑦 169
𝑦̄ = = = 28.2
𝑛2 6
∑(𝑥 − 𝑥̄ )2 = 31.43 ∑(𝑦 − 𝑦̄ )2 = 26.84

⟨∑(𝑥 − 𝑥̄ )2 + (𝑦 − 𝑦̄ )2 ⟩
𝑠=√ = 2.30
𝑛1 + 𝑛2 − 2
𝑥̄ − 𝑦̄
𝑡= = 2.42 > 2.2
1 1
𝑠√𝑛 + 𝑛
1 2

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 13 | 24
RV Institute of Technology & Management ®

Therefore, hypothesis rejected at 5% level of significance.


Discussion on  test and Goodness of Fit
2

In above section, we have discussed t – distribution function (i.e. t – test). The study was based
on the assumption that the samples were drawn from normally distributed populations, or, more
accurately that the sample means were normally distributed. Since test required such an
assumption about population parameters. For this reason, A test of this kind is called
parametric test. There are situations in which it may not be possible to make any rigid
assumption about the distribution of population from which one has to draw a sample.
Thus, there is a need to develop some non – parametric tests which does not require any
assumptions about the population parameters.

With this in view, now we shall consider a discussion on  2 distribution which does not
require any assumption with regard to the population. The test criterion corresponding to this

 ( O -E )
2
i i
distribution may be given as χ 2 = i
where Oi : Observed values ,
Ei

Ei : Expected values .

The calculated χ 2 value (i.e. test criterion value or calculated value) is compared with the

tabular value of χ 2 value for given degree of freedom at a certain prefixed level of
significance. Whenever the calculated value is lower than the tabular value, we continue to
accept the fact that there is not much significant difference between expected and observed
results. On the other hand, if the calculated value is found to be more than the value suggested
in the table, then we have to conclude that there is a significant difference between observed
and expected frequencies.
As usual, degrees of freedom are γ=n-k where k denotes the number of independent
constraints. Usually, it is 1 as we will be always testing null hypothesis against only one
hypothesis, namely, alternate hypothesis.
This is an approximate test for relatively a large population. For the usage of test, the following
conditions must checked before employing the test. These are:
1. The sample observations should be independent.
2. Constraints on the cell frequencies, if any, must be linear.
3. i.e. the sum of all the observed values must match with the sum of all the expected values.
4. N, total frequency should be reasonably large
5. No theoretical frequency should be lower than 5.

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 14 | 24
RV Institute of Technology & Management ®

6. It may be recalled this test is depends on χ


2
test: The set of observed and expected

frequencies and on the degrees of freedom, it does not make any assumptions regarding the
population.
Problems:
1. The following table gives the number of road accidents that occurred in a large city during
the various days of a week. Test the hypothesis that the accidents are uniformly distributed over
all the days of a week.
Day Sun Mon Tue Wed Thu Fri Sat Total
No. of 14 16 8 12 11 9 14 84
accidents

Solution: under the hypothesis that the accidents on each day are uniformly distributed over
the week, the expected number of accidents on each day are 12. (because a total of N = 84
accidents have occurred in 7 days).
Thus, her, the expected frequencies are 12 each observed frequencies are the number of
accidents shown in the given table.
Using these, we find that

(14−12)2 (16−12)2 (8−12)2 (12−12)2 (11−12)2 (9−12)2 (14−12)2


𝜒2 = + + + + + + = 4.17
12 12 12 12 12 12 12

We note that n=7 frequency pairs are used in the computation of 𝜒 2 . Further, N = ∑ 𝑓𝑖 = 84. Is

the only quantity used in the computation of ei. Therefore, the number of degrees of freedom
2 2
is v= 7-1 = 6. From the Table we find that 𝜒0.05 (6) = 12.59 and 𝜒0.01 (6) = 16.81.
2 2
Since 𝜒 2 =4.17 is much less than both of 𝜒0.05 (6) and 𝜒0.01 (6), we do not reject the hypothesis.
This means that the accidents seem to be distributed uniformly over the week.
2. A set of five similar coins is tossed 320 times and the result is

No. of heads 0 1 2 3 4 5
Frequency 6 27 72 112 71 32

Test the hypothesis that the data follow a binomial distribution function.
Solution: We shall set up the null hypothesis that data actually follows a binomial distribution.
Then alternate hypothesis is, namely, data does not follow binomial distribution. Next, to set
up a suitable level of significance,  = 5% , with n = 6, degrees of freedom is  = 5.

Therefore, the tabulated value is  2 | = 0.05, =5 = 11.07 . Before proceeding to finding test

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 15 | 24
RV Institute of Technology & Management ®

criterion, first we compute the various expected frequencies. As the data is set to be following
n
binomial distribution, clearly probability density function is F ( X ) =N   p k q n-k .
k
Here, n = 320, p = 0.5, q = 0.5 , and k takes the values right from 0 up to 5. Hence, the
expected frequencies of getting 0, 1, 2, 3, 4, 5 heads are the successive terms of the binomial
expansion
Here, observed values are: Oi : 6, 27, 72, 112, 71, 32

The expected values are: Ei : 10, 50, 100, 100, 50, 10 .

 ( 6 − 10 ) 2   ( 27 − 50 ) 2   ( 72 − 100 ) 2 
 |cal = 
2
+  + 
 10   50   100 
     
 (112 − 100 ) 2   ( 71 − 50 ) 2   ( 32 − 10 ) 2 
+ +  +  = 78.68.
 100   50   10 
     
As the calculated value is very much higher than the tabulated value of 3.841, we reject the
null hypothesis and accept the alternate hypothesis that data does not follow the binomial
distribution.
3. A set of five identical coins is tossed 320 times and the result is shown in the following
table.
No. of heads 0 1 2 3 4 5
Frequency 6 27 72 112 71 32
Test the hypothesis that the data follows a binomial distribution associated with a fair coin.
Solution: The Probability that x number of fair coins out of 5 shows a head in a single toss is
given by the binomial function
1 1
b (5, ½, x) = 5𝐶𝑥 (1/2)𝑥 (1/2)5−𝑥 = (5𝐶𝑥 ) = 32 (5𝐶𝑥 ) = b(x), say,
25

accordingly, in 320 tosses the expected number of tosses in which x number of coins show a
head is 320 × b(x). Hence the expected frequencies (i,e. the number of tosses in which
0,1,2,3,4,5 coins show a head) are, respectively,
1
𝑒1 = 320 × b(0) = 320 × 32× 5𝐶0 = 10,
1
𝑒2 = 320 × b(1) = 320 × 32× 5𝐶1 = 50,
1
𝑒3 = 320 × b(2) = 320 × 32× 5𝐶2 = 100,
1
𝑒4 = 320 × b(4) = 320 × 32× 5𝐶4 = 100,
1
𝑒5 = 320 × b(5) = 320 × 32× 5𝐶5 = 50,

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 16 | 24
RV Institute of Technology & Management ®

1
𝑒6 = 320 × b(6) = 320 × 32× 5𝐶6 = 10,

The corresponding observed frequencies are


𝑓1 = 6, 𝑓2 = 27, 𝑓3 = 72, 𝑓4 = 112, 𝑓5 = 71, 𝑓6 = 32
We find that
(6−10)2 (27−50)2 (72−100)2 (112−100)2 (71−50)2 (32−10)2
𝜒2 = + + + + +
10 50 100 100 50 10
16 529 784 144 441 484
= + + + + + = 78.68
10 50 100 100 50 10

We note that the number of degrees of freedom is 6-1 = 5. From the table we find that
2 2
𝜒0.05 (5) = 11.07 and 𝜒0.01 (5) = 15.09. We observe that 𝜒 2 = 78.68, is very much greater than
2 2
both of𝜒0.05 (5) and 𝜒0.01 (5). Therefore, we reject the hypothesis that the observed data
follows a binomial distribution associated with a fair coin.

4. Five dice were thrown 96 times and the numbers 1, 2 or 3 appearing on the dice follows the frequency
distribution as below.
No. of dice showing 1, 2 or 3 5 4 3 2 1 0
Frequency 7 19 35 24 8 3
2
Test the hypothesis that the data follows a binomial distribution. (𝜒0.05 = 11.07 for 5 d.f).

Solution:

p = q = 0.5
F ( x ) = N ( n C x ) p x q n− x
By fitting of Binomial distribution, we get
0i 7 19 35 24 8 3
Ei 3 15 30 30 15 3
(𝐸𝑖 − 𝑂𝑖 )2
𝜒2 = ∑ = 11.7 > 11.07
𝐸𝑖
Therefore, hypothesis rejected at 5% level of significance.

5. Fit a Poisson distribution to the following data and test for its goodness of fit at a level of significance
2
0.05. (𝜒0.05 with 3 d.f = 9.48)
X 0 1 2 3 4
f 419 352 154 56 19
Solution:
∑ 𝑓𝑥 904
𝑥̄ = = 1000 = 0.904 = 𝑚, the mean of Poisson distribution.
𝑁
𝑚𝑥 𝑒 −𝑚 (0.904)𝑥 𝑒 0.904
Hence 𝑃(𝑥) = = , 𝑥 = 0, 1, 2, 3, 4
𝑥! 𝑥!

Hence the expected frequency for ‘x’ successes is

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 17 | 24
RV Institute of Technology & Management ®

1000×(0.904)𝑥 𝑒 −0.904
𝐸𝑥 = 𝑁 × 𝑃(𝑥) = , where x = 0, 1, 2, 3, 4.
𝑥!

Putting x = 0, 1, 2, 3 , 4 we get
1000 × (0.904)0 𝑒 −0.904
𝐸0 = 𝑁 × 𝑃(0) = = 405,
0!
1000 × (0.904)1 𝑒 −0.904
𝐸1 = 𝑁 × 𝑃(1) = = 366,
1!
1000 × (0.904)2 𝑒 −0.904
𝐸2 = 𝑁 × 𝑃(2) = = 165.4,
2!
1000 × (0.904)3 𝑒 −0.904
𝐸3 = 𝑁 × 𝑃(3) = = 49.8,
3!
4 −0.904
1000 × (0.904) 𝑒
𝐸4 = 𝑁 × 𝑃(4) = = 11.2,
4!
Hence the theoretical frequencies are
x: 0 1 2 3 4
f: 405 366 165.4 49.8 11.2

(𝐸𝑖 −𝑂𝑖 )2 (419−405)2 (352−366)2 (154−164.5)2 (56−49.8)2 (19−11.2)2


𝜒2 = ∑ = + + + +
𝐸𝑖 405 366 164.5 49.8 11.2

= 7.87

Here calculated 𝜒 2 < 9.48. So we accept 𝐻0 .

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 18 | 24
RV Institute of Technology & Management ®

Exercises:

F – test or Fisher’s F-test

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 19 | 24
RV Institute of Technology & Management ®

The F-test was first originated by the statistician R.A. Fisher. This test is also known as
Fisher’s F-test or simply F-test. It is based on the F-distribution, which is defined as the ratio
of two independent chi-square variates which is derived by dividing each variable by its
𝝍𝟐⁄
𝝂𝟏
corresponding degree of freedom 𝑭 = 𝝍𝟐⁄
𝝂𝟐

To test if the two samples have come from same population we use F test (OR) To test there
is any significant difference between two estimates of population variance.
F= greater variance/smaller variance
𝑆12
𝐹=
𝑆22
Where
̅)2
∑(𝑥−𝑥
𝑆21 =
𝑛1 −1

∑(𝑦 − 𝑦̅)2
𝑆22 =
𝑛2 − 1
Where n1 is the first sample size and n2 is the second sample size.
If the sample variance S2 is not given we can obtain the population variance byusing the
𝑛1 𝑠21 𝑛2 𝑠22
relation 𝑆21 = and 𝑆22 =
𝑛1 −1 𝑛2 −1

Assumptions in F-test.
The F-Test is based on the following assumptions:
1. Normality: The values in each group should be normally distributed.
2. Independence of Error: The variation of each value around its own group mean.
3. Homogeneity: The variances within each group should be equal for all groups.
If, however, the sample sizes are large enough, we do not need the assumption of normality.

Test of hypothesis about the variance of two populations


We have the following steps:
1. Null Hypothesis: 𝐻0 : 𝑆12 = 𝑆22
Alternate Hypothesis: 𝐻1 : 𝑆12 ≠𝑆22
2. Calculation of Test Statistic.
𝑆2 𝑆2
𝐹 = 𝑆12 if 𝑆12 > 𝑆22 so that 𝐹 ≥ 1 or 𝐹 = 𝑆22 if 𝑆22 > 𝑆12 so that 𝐹 ≥ 1
2 1

3. Level of significance: Take the level of significance 𝛼 = 0.05 if 𝛼 is not known.

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 20 | 24
RV Institute of Technology & Management ®

4. Decision: Accept 𝐻0 if computed 𝐹 ≤ tabled 𝐹𝛼


Reject 𝐻0 if computed 𝐹 > tabled 𝐹𝛼 .

Problems
1. In one sample of 8 observations the sum of the squares of deviations of the sample
values from the sample mean was 84.4 and in the other sample of 10 observation
it was 102. 6. Test whether this difference is significant at 5 % level.
Solution: Assume Null Hypothesis: 𝐻0 : 𝑆12 = 𝑆22 (There is no significant difference)
Alternate Hypothesis: 𝐻1 : 𝑆12 ≠𝑆22
Given ∑(𝑥 − 𝑥̅ )2 = 84.4, 𝑛1 = 8, ∑(𝑦 − 𝑦̅)2 = 102.6, 𝑛2 = 10
̅)2
∑( 𝑥 − 𝑥 84.4
𝑆21 = = = 12.057
𝑛1 − 1 8−1

∑(𝑦 − 𝑦̅)2 102.6


𝑆22 = = = 11.4
𝑛2 − 1 10 − 1
𝑆12
𝐹= = 1.057
𝑆22
Calculated F value = 1.057
Tabulated Value = 3.29 (at 5% level of significance with (7,9) degrees of freedom)
Calculated value < Tabulated value,
Hence accept Ho (Null hypothesis)
2. Two random samples gave the following results.
Sample Size Sample mean Sum of squares of deviations from the mean
1 10 15 90
2 12 14 108
Test whether the samples come from the same normal population.
Solution: Assume Null Hypothesis: 𝐻0 : 𝑆12 = 𝑆22 (the samples come from the same
normal population)
Alternate Hypothesis: 𝐻1 : 𝑆12 ≠𝑆22
Given ∑(𝑥 − 𝑥̅ )2 = 90, 𝑛1 = 10, ∑(𝑦 − 𝑦̅)2 = 108, 𝑛2 = 12
̅)2
∑( 𝑥 − 𝑥 90
𝑆21 = = = 10
𝑛1 − 1 9

∑(𝑦 − 𝑦̅)2 108


𝑆22 = = = 9.82
𝑛2 − 1 11

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 21 | 24
RV Institute of Technology & Management ®

𝑆12
𝐹 = 2 = 1.018
𝑆2
Calculated F value = 1.018
Tabulated Value at 5% level of significance with (9,11) degrees of freedom= 2.90
Calculated value < Tabulated value,
Hence accept Ho (Null hypothesis)
3. The time taken by workers in performing a job by method I and method II isgiven
below.
Method I 20 16 26 27 23 22
Method II 27 33 42 35 32 34 38
Do the data show that the variances of time distribution from population fromwhich
these samples are drawn do not differ significantly?
Solution: Assume Null Hypothesis: 𝑯𝟎 : 𝑺𝟐𝟏 = 𝑺𝟐𝟐 (The two samples have the same variance)
Alternate Hypothesis: 𝐻1 : 𝑆12 ≠𝑆22
x 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2 y 𝑦 − 𝑦̅ (𝑦 − 𝑦̅)2
20 -2 4 27 -8 64
16 -6 36 33 -2 4
26 4 16 42 7 49
27 5 25 35 0 0
23 1 1 32 -3 9
22 0 34 -1 1
38 3 9

x ∑(𝑥 − 𝑥̅ )2 ∑y ∑(𝑦 − 𝑦̅)2

=134 = 82 = 241 = 136

134 241
Given 𝑥̅ = = 22, 𝑦̅ = = 34.428 = 35
6 7

̅ )2
∑( 𝑥 − 𝑥 82
𝑆21 = = = 16.4
𝑛1 − 1 5

∑(𝑦 − 𝑦̅)2 136


𝑆22 = = = 22.66
𝑛2 − 1 6
𝑆12
𝐹 = 2 = 1.38
𝑆2
Calculated F value = 1.37

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 22 | 24
RV Institute of Technology & Management ®

Tabulated Value = 4.95 (at 5% level of significance with (6,5) degrees of freedom)
Calculated value < Tabulated value, Accept Ho (Null hypothesis)

4. In a test given to two groups of students drawn from two normal populations, the marks
obtained were as follows:
Group A 18 20 36 50 49 36 34 49 41

Group B 29 28 26 35 30 44 46

Examine at 5% level, Whether the two populations have the same variance.

Solution: Assume Null Hypothesis: 𝑯𝟎 : 𝑺𝟐𝟏 = 𝑺𝟐𝟐 (The two samples have the same variance)

Alternate Hypothesis: 𝐻1 : 𝑆12 ≠𝑆22


x 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2 y 𝑦 − 𝑦̅ (𝑦 − 𝑦̅)2
18 -19 361 29 5 25
20 -7 289 28 6 36
36 -1 1 26 8 64
50 13 169 35 1 1
49 12 144 30 4 16
36 -1 1 44 10 100
34 -3 9 46 12 144
49 12 144
41 4 16
x ∑(𝑥 − 𝑥̅ )2 ∑y ∑(𝑦 − 𝑦̅)2
=333 = 1134 = 238 = 386

333 238
Given 𝑥̅ = = 37, 𝑦̅ = = 34
9 7
̅) 2
∑( 𝑥 − 𝑥 1134
𝑆21 = = = 141.75
𝑛1 − 1 8
∑(𝑦 − 𝑦̅)2 386
𝑆22 = = = 64.33
𝑛2 − 1 6
𝑆12
𝐹 = 2 = 2.203
𝑆2
Calculated F value = 2.203
The table value of F at 5% level for 8 and 6 degrees of freedom is 4.15
Calculated value < Tabulated value,
Hence accept the Null hypothesis.

Exercises

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 23 | 24
RV Institute of Technology & Management ®

1. The nicotine content in milligrams of two samples of tobacco were found to beas
follows:
Sample A 24 27 26 21 25

Sample B 27 30 28 31 22 36

Can it be said that two samples come from normal populations having the samevariances.
2. The standard deviations calculated from two random samples of size 9 and 13 are 2 and
1.9 respectively May the sample be regarded as drawn from the normal population with
the same standard deviation.

Video links:

1. Hypothesis Testing - Statistics - YouTube


2. Student's t-test - YouTube
3. Chi-square distribution introduction | Probability and Statistics | Khan Academy -
YouTube
4. F-test - YouTube

III-Semester, Mathematics for Computer Science(MCS) (BCS301)


P a g e 24 | 24

You might also like