UNIT-III Curve Fitting & Smpling, App
UNIT-III Curve Fitting & Smpling, App
Curve fitting
The general problem of finding equations of approximating curves which fit given data is called
curve fitting.
Consider 𝑛 pairs of values (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) of two variables 𝑥 and 𝑦. To get a rough
idea about their relationship if any, we plot the values of 𝑥 and 𝑦 on a suitable scale. The points
(𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1,2,3, … , 𝑛 constitute a diagram called scatter diagram and the given data
(𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1,2,3, … , 𝑛 is said to be bivariate. An exact relationship between the variables 𝑥 and
𝑦 i.e. of the 𝑦 = 𝑓(𝑥) which fits the given sets of data is called curve fitting. Generally it is not
possible to find a curve which passes through all the given points. We can obtain relationship
between 𝑥 and 𝑦 in the form of a straight lines, curves of second degree, third degree etc.,
which may give the best representation of the bivariate distribution. The method of least
squares can be used to get the representation. It is probably the best to fit a unique curve to a
given data. The other methods are graphical method, method of group averages and method of
moments. Here we will discuss the method of least squares only.
The method of least squares is probably the most symmetric procedure to fit a unique curve
through the given points.
Let 𝑦 = 𝑓(𝑥) be the equation of curve to be fitted to the given data (observed or experimental
) points (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), … , (𝑥𝑛 , 𝑦𝑛 ). At 𝑥 = 𝑥1 , the observed ( or experimental ) value
of the ordinate is 𝑦1 and the corresponding value on the fitting curve is 𝑁1 𝑀1 , i.e. [𝑓(𝑥1 )]. The
difference of the observed and the expected (theoretical) value is
= 𝑃1 𝑀1 − 𝑁1 𝑀1 = 𝑃1 𝑁1 = 𝑒1 .
𝑒1 = 𝑦1 − 𝑓(𝑥1 )
Similarly, 𝑒2 = 𝑦2 − 𝑓(𝑥2 )
𝑒3 = 𝑦3 − 𝑓(𝑥3 )
…………………………………………
…………………………………………
𝑒𝑛 = 𝑦𝑛 − 𝑓(𝑥𝑛 )
Some of the errors 𝑒1 , 𝑒2 , 𝑒3 , … , 𝑒𝑛 will be positive and others negative.
In finding the total errors, errors are added. In addition, some negative and some positive
errors may cancel and in some cases sum of all the errors may be zero, which leads to false
result. To avoid such situation, we may make all the errors positive by squaring.
The curve of best fit is that for which the sum of the squares of errors (S) is minimum. This is
called principle of least squares.
Let 𝑦 = 𝑎 + 𝑏𝑥
(1)
be the straight line to be fitted to the given data points (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), … , (𝑥𝑛 , 𝑦𝑛 ).
𝑃𝑀 = 𝑦1 , 𝑁𝑀 = 𝑦𝑡1
𝑃𝑁 = 𝑃𝑀 − 𝑁𝑀
𝑆 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2
For 𝑆 to be minimum,
𝜕𝑆
= ∑𝑛𝑖=1 2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )(−1) = 0 , implies that ∑(𝑦 − 𝑎 − 𝑏𝑥) = 0
𝜕𝑎
(2)
𝜕𝑆
= ∑𝑛𝑖=1 2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )(−𝑥𝑖 ) = 0 , implies that ∑(𝑥𝑦 − 𝑎𝑥 − 𝑏𝑥 2 ) = 0
𝜕𝑏
(3)
On putting the values of 𝑎 and 𝑏 in (1), we get the equation of required line.
Working rule:
1. Equation (4) is obtained by putting Σ before all the terms on both sides of (1).
∑ 𝑦 = ∑ 𝑎 + ∑ 𝑏𝑥 , implies that ∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥
2. Equation (5) is obtained on multiplying equation (1) by 𝑥 and then putting Σ before each
obtained term on both the sides. i.e.
∑ 𝑥𝑦 = ∑ 𝑎𝑥 + ∑ 𝑏𝑥 2 , implies that ∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥 2 .
Example1. By the method of least squares, find the straight line that best fits the following
data:
𝑥 1 2 3 4 5
𝑦 14 27 40 55 68
Solution: Let the equation of the straight line best fit be 𝑦 = 𝑎 + 𝑏𝑥 (1)
𝑥 𝑦 𝑥𝑦 𝑥2
1 14 14 1
2 27 54 2
3 40 120 9
4 55 220 16
5 68 340 25
∑ 𝑥 = 15 ∑𝑦 ∑ 𝑥𝑦 = 748 ∑ 𝑥 2 = 55
= 204
Here 𝑛 = 5
∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥 (2)
∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥 2 (3)
𝑦 = 13.6𝑥.
Example2. Use least squares method to fit a curve of the form 𝑦 = 𝑎𝑒 𝑏𝑥 to the following data:
𝑥 1 2 3 4 5 6
𝑦 7.209 5.265 3.846 2.809 2.052 1.499
Solution: 𝑦 = 𝑎𝑒 𝑏𝑥 (1)
𝑌 = 𝐴 + 𝑏𝑥 (3)
𝑥 𝑦 𝑌 = 𝑙𝑜𝑔𝑒 𝑦 𝑥𝑌 𝑥2
1 7.209 1.97533 1.97533 1
2 5.265 1.66108 3.32216 4
3 3.846 1.34703 4.04109 9
4 2.809 1.03283 4.13132 16
5 2.052 0.71881 3.59405 25
6 1.499 0.40480 2.4288 36
∑𝑥 ∑𝑦 ∑𝑌 ∑ 𝑥𝑌 ∑ 𝑥2
= 21 = 204 = 7.13988 = 19.49275 = 91
Here 𝑛 = 6
∑ 𝑥𝑌 = 𝐴 ∑ 𝑥 + 𝑏 ∑ 𝑥 2 (5)
𝑏 = −0.3141 , 𝐴 = 2.28933
𝑦 = 9.86832𝑒 −0.3141𝑥 .
In some problems the magnitude of the variables in the given data is so large that the
calculation becomes very tedious. The size of the data can be reduced by assuming some origin
for 𝑥, 𝑦 series.
The problem is further simplified by taking suitable scale for the values of 𝑥 and 𝑦. If these
values are equally spaced.
Let ℎ be the width of the interval at which the values of 𝑥 are given and let the origin of 𝑥 and 𝑦
be taken at the point 𝑥0 , 𝑦0 respectively, then putting
𝑥−𝑥0
𝑢= and 𝑣 = 𝑦 − 𝑦0 .
ℎ
𝑣 = 𝐴 + 𝐵𝑢 ; 𝑣 = 𝐴 + 𝐵𝑢 + 𝐶𝑢2 .
𝑥 0 5 10 15 20 25
𝑦 12 15 17 22 24 30
Solution:
∑𝑥 ∑𝑦
Let 𝑥0 = 12.5 , 𝑦0 = 20 because (𝑥0 = , 𝑦0 = ).
𝑁 𝑁
𝑥 𝑦 𝑥 − 12.5 𝑣 = 𝑦 − 20 𝑢𝑣 𝑢2
𝑢=
2.5
0 12 −5 −8 40 25
5 15 −3 −5 15 9
10 17 −1 −3 3 1
15 22 1 2 2 1
20 24 3 4 12 9
25 30 5 10 50 25
∑𝑢 = 0 ∑𝑣 = 0 ∑ 𝑢𝑣 = 122 ∑ 𝑢2 = 70
∑ 𝑣 = 𝑛𝐴 + 𝐵 ∑ 𝑢 (2)
∑ 𝑢𝑣 = 𝐴 ∑ 𝑢 + 𝐵 ∑ 𝑢2 (3)
𝑦 = 0.7𝑥 + 11.285.
Let 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 2 (1)
∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥 + 𝑐 ∑ 𝑥 2 (2)
∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥 2 + 𝑐 ∑ 𝑥 3 (3)
∑ 𝑥 2𝑦 = 𝑎 ∑ 𝑥 2 + 𝑏 ∑ 𝑥 3 + 𝑐 ∑ 𝑥 4 (4)
On putting the values of 𝑎, 𝑏 and 𝑐 in (1), we get the required equation of parabola.
Notes:-
1. Equation (2) is obtained by putting Σ before each term on both sides of (1).
2. Equation (3) is obtained on multiplying (1) by 𝑥 and putting Σ before each term on both sides
of obtained equation.
3. Equation (3) is obtained on multiplying (1) by 𝑥 2 and putting Σ before each term on both
sides of obtained equation.
Example1. Employ the method of least squares to fit a parabola 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 2 to the data:
𝑥 0 1 2 3 4
𝑦 −4 −1 4 11 20
∑𝑥 ∑𝑦 ∑ 𝑥𝑦 ∑ 𝑥2 ∑ 𝑥 2𝑦 ∑ 𝑥3 ∑ 𝑥4
= 10 = 30 = 120 = 30 = 434 = 100 = 354
∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥 + 𝑐 ∑ 𝑥 2 (2)
∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥 2 + 𝑐 ∑ 𝑥 3 (3)
∑ 𝑥 2𝑦 = 𝑎 ∑ 𝑥 2 + 𝑏 ∑ 𝑥 3 + 𝑐 ∑ 𝑥 4 (4)
𝑎 = −4, 𝑏 = 2, 𝑐 = 1.
𝑦 = −4 + 2𝑥 + 𝑥 2 .
Example2. Fit a second degree parabola to the following data by least squares method:
Again taking 𝑢 = 𝑥 − 𝑥0 , 𝑣 = 𝑦 − 𝑦0
Therefore, 𝑢 = 𝑥 − 1933 , 𝑣 = 𝑦 − 357
𝑥 𝑦 𝑢 𝑣 𝑢𝑣 𝑢2 𝑢2 𝑣 𝑢3 𝑢4
=𝑥 =𝑦
− 1933 − 357
1929 352 −4 −5 20 16 −80 −64 256
1930 356 −3 −1 3 9 −9 −27 81
1931 357 −2 0 0 4 0 −8 16
1932 358 −1 1 −1 1 1 −1 1
1933 360 0 3 0 0 0 0 0
1934 361 1 4 4 1 4 1 1
1935 361 2 4 8 4 16 8 16
1936 360 3 3 9 9 27 27 81
1937 359 4 2 8 16 32 64 256
∑𝑢 = 0 ∑𝑣 ∑ 𝑢𝑣 ∑ 𝑢2 ∑ 𝑢2 𝑣 ∑ 𝑢3 ∑ 𝑢4
= 11 = 51 = 60 = −9 =0 = 708
∑ 𝑣 = 𝑛𝐴 + 𝐵 ∑ 𝑢 + 𝐶 ∑ 𝑢2 (2)
∑ 𝑢𝑣 = 𝐴 ∑ 𝑢 + 𝐵 ∑ 𝑢2 + 𝐶 ∑ 𝑢3 (3)
∑ 𝑢2 𝑣 = 𝐴 ∑ 𝑢2 + 𝐵 ∑ 𝑢3 + 𝐶 ∑ 𝑢4 (4)
Implies that
11 = 9𝐴 + 0𝐵 + 60𝐶 (5)
51 = 0𝐴 + 60𝐵 + 0𝐶 (6)
Correlation
In our day to day life there are some situations where one variable depends on the other. For, instance,
the heights and weights of a certain group of people, the records of rainfall and the yields of crops in a
certain period, we shall get what is known as Bivariate distribution.
In a bivariate distribution our object is to discover whether there is any relationship between the
variables under study. The relationship may be of any type but here we are concerned with the linear
relation only. Whenever two variables are so related that the change in one variable effects the change
in the other in such a way that the increase in one produces an increase or decrease in the other
variable and vice-versa, the variables are said to be correlated. If the two variables lean in the same
direction i.e. ; if the increase ( or decrease ) in one variable is accompanied by the increase ( or
decrease) in the other variable correlation is said to be positive or direct. On the other hand, if the
variable deviate oppositely i.e., an increase in one followed by a decrease in the other and a decrease in
one by an increase in the other, then the correlation is said to be negative or inverse. If, however, the
variables do not exhibit any relationship, the correlation is said to be zero or null correlation.
(ii) Histogram.
2. Numerical methods:
Let us suppose that we are given 𝑛 pairs of values (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … . , (𝑥𝑛 , 𝑦𝑛 ) of two variables 𝑥 and
𝑦. These values when plotted on a sheet of paper according to some convenient scale give us 𝑛 dots one
each for the 𝑛 pairs. This graphical representation of the dots defines what is known as dot diagram or
scatter diagram.
Karl Pearson’s coefficient of correlation
Karl Pearson’s correlation coefficient between two variables 𝑥 and 𝑦, usually denoted by 𝑟(𝑥, 𝑦)𝑜𝑟 𝑟𝑥𝑦 is
a numerical measure of linear relationship between them and is defined as
1
𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥,𝑦) ∑(𝑥𝑖−𝑥̅ )(𝑦𝑖−𝑦̅) ∑(𝑥𝑖−𝑥̅ )(𝑦𝑖−𝑦̅)
𝑛
𝑟(𝑥, 𝑦)𝑜𝑟 𝑟𝑥𝑦 = = = .
√𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑥√𝑣𝑎𝑦𝑖𝑎𝑛𝑐𝑒𝑦 1 1
√ ∑(𝑥𝑖−𝑥̅ )2√ ∑(𝑦𝑖−𝑦̅)2 √∑(𝑥𝑖−𝑥̅ )2√∑(𝑦𝑖 −𝑦̅)2
𝑛 𝑛
1 1 1
∑ 𝑥𝑦−( ∑ 𝑥)( ∑ 𝑦)
𝑛 𝑛 𝑛
OR 𝑟(𝑥, 𝑦)𝑜𝑟 𝑟𝑥𝑦 = 2 2
=
√ 1 ∑ 𝑥 2−(1 ∑ 𝑥) √ 1 ∑ 𝑦 2−(1 ∑ 𝑦)
𝑛 𝑛 𝑛 𝑛
1
∑ 𝑥𝑦−𝑥̅ 𝑦̅ 𝑛 ∑ 𝑥𝑦−∑ 𝑥 ∑ 𝑦
𝑛
𝑂𝑅
1 1
√ ∑ 𝑥 2−(𝑥̅ )2√ ∑ 𝑦 2 −(𝑦̅)2 √𝑛 ∑ 𝑥 2 −(∑ 𝑥)2√𝑛 ∑ 𝑦 2 −(∑ 𝑦)2
𝑛 𝑛
Example1. Calculate the coefficient of correlation between the marks obtained by 8 students in
Mathematics and Statistics.
𝑀𝑎𝑡ℎ𝑒𝑚𝑎𝑡𝑖𝑐𝑠 25 30 32 35 37 40 42 45
𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑠 8 10 15 17 20 23 24 25
Solution:
Let the marks of two subjects Mathematics and Statistics be denoted by 𝑥 and 𝑦 respectively.
Let the assumed mean for 𝑥 marks be 35 and that for 𝑦 be 17.
𝑥 𝑦 𝑢 𝑣 𝑢𝑣 𝑢2 𝑣2
= 𝑥 − 35 = 𝑦 − 17
25 8 −10 −9 90 100 81
30 10 −5 −7 35 25 49
32 15 −3 −2 6 9 4
35 17 0 0 0 0 0
37 20 2 3 6 4 9
40 23 5 6 30 25 36
42 24 7 7 49 49 49
45 25 10 8 80 100 64
∑ 𝑢 = 6 ∑ 𝑣 = 6 ∑ 𝑢𝑣 ∑ 𝑢2 ∑ 𝑣2
= 296 = 312 = 292
𝑛=8
𝑛 ∑ 𝑢𝑣−∑ 𝑢 ∑ 𝑣 8×296−6×6
𝑟𝑥𝑦 = 𝑟𝑢𝑣 = = = 0.787.
√𝑛 ∑ 𝑢2 −(∑ 𝑢)2√𝑛 ∑ 𝑣 2−(∑ 𝑣)2 √8×312−(6)2√8×292−(6)2
Example2. A computer while calculating correlation coefficient between two variables 𝑥 and 𝑦 from 25
pairs of observations obtained the following results:
It was, however, later discovered at the time of checking that he had copied down two pairs as
𝑥 𝑦
6 14
8 6
While the correct values were
𝑥 𝑦
8 12
6 8
Obtain the correct value of correlation coefficient.
1 1
∑ 𝑥𝑦−𝑥̅ 𝑦̅ ×520−5×4
𝑛 25
Corrected 𝑟(𝑥, 𝑦)𝑂𝑅 𝑟𝑥𝑦 = 1 1
= 1 1
= 0.67.
√ ∑ 𝑥 2−(𝑥̅ )2√ ∑ 𝑦 2 −(𝑦̅)2 √( ×650−(5)2)√( ×436−(4)2)
𝑛 𝑛 25 25
Spearman’s Rank Correlation
The coefficient of rank correlation is applied to the problems in which data cannot be measured
quantitatively but qualitative assessment is possible such as beauty, honesty etc. In this case the best
individual is given the rank no. 1 next rank no. 2 and so on.
6 ∑ 𝑑2
Spearman’s rank correlation coefficient (𝑥, 𝑦)𝑂𝑅 𝑟𝑥𝑦 = 1 − 𝑛(𝑛2−1) , where 𝑑 = 𝑅𝑥 − 𝑅𝑦 .
Example1. Obtain the rank correlation coefficient for the following data:
𝑥 10 15 12 17 13 16 24 14 22
𝑦 30 42 45 46 33 34 40 35 39
Solution:
First we write ranks in each series, the item with the largest size is ranked 1, next largest 2 and so on.
𝑥 𝑦 𝑅𝑥 𝑅𝑦 𝑑 = 𝑅𝑥 − 𝑅𝑦 𝑑2
10 30 9 9 0 0
15 42 5 3 2 4
12 45 8 2 6 36
17 46 3 1 2 4
13 33 7 8 −1 1
16 34 4 7 −3 9
24 40 1 4 −3 9
14 35 6 6 0 0
22 39 2 5 −3 9
∑ 𝑑 2 = 72
Therefore,
6 ∑ 𝑑2 6×72
𝑟(𝑥, 𝑦)𝑂𝑅 𝑟𝑥𝑦 = 1 − 𝑛(𝑛2−1) = 1 − 9(81−1) = 1 − 0.6 = 0.4.
Example2. Obtain the rank correlation coefficient for the following data:
𝑥 68 64 75 50 64 80 75 40 55 64
𝑦 62 58 68 45 81 60 68 48 50 70
Solution:
𝑥 𝑦 𝑅𝑥 𝑅𝑦 𝑑 = 𝑅𝑥 − 𝑅𝑦 𝑑2
68 62 4 5 −1 1
64 58 6 7 −1 1
75 68 2.5 3.5 −1 1
50 45 9 10 −1 1
64 81 6 1 5 25
80 60 1 6 −5 25
75 68 2.5 3.5 −1 1
40 48 10 9 1 1
55 50 8 8 0 0
64 70 6 2 4 16
∑ 𝑑 2 = 72
Therefore,
6 ∑ 𝑑2 6×72
𝑟(𝑥, 𝑦)𝑂𝑅 𝑟𝑥𝑦 = 1 − =1− = 0.545.
𝑛(𝑛2 −1) 100(100−1)
REGRESSION
If the scatter diagram indicates some relationship between two variables 𝑥 and 𝑦, then the dots
of the scatter diagram will be concentrated round a curve. This curve is called the curve of
regression.
Regression analysis is the method used for estimating the unknown values of one variable
corresponding to the known value of another variable.
Line of regression
When the curve is a straight line, it is called a line of regression. A line of regression is the
straight line which gives the best fit in the least square sense to the given frequency.
Regression will be called non-linear if there exist a relationship (parabola etc.) other then a
straight line between the variables under consideration.
̅ = 𝒃𝒚𝒙 (𝒙 − 𝒙
𝒚−𝒚 ̅),
̅ = 𝒃𝒙𝒚 (𝒚 − 𝒚
𝒙−𝒙 ̅),
Note:-
1. Two lines of regression pass through the point (𝑥̅ , 𝑦̅) i.e., the means of 𝑥 and 𝑦 series. The
point of intersection of these two lines gives the two means 𝑥̅ and 𝑦̅ .
2. The coefficient of correlation is the geometric mean between the two regression coefficients.
𝜎 𝜎
𝑏𝑦𝑥 × 𝑏𝑥𝑦 = 𝑟 𝜎𝑦 × 𝑟 𝜎𝑥 = 𝑟 2 .
𝑥 𝑦
𝜎 𝜎
= √𝑟 𝜎𝑦 × 𝑟 𝜎𝑥 = √𝑟 2 = 𝑟 = coefficient of correlation.
𝑥 𝑦
2. If one of the regression coefficients is greater than unity, then other must be less than unity.
𝜎 𝜎
The coefficient of regression are 𝑏𝑦𝑥 = 𝑟 𝜎𝑦 and 𝑏𝑥𝑦 = 𝑟 𝜎𝑥 .
𝑥 𝑦
1
Let 𝑏𝑦𝑥 > 1, then 𝑏 <1 (i)
𝑦𝑥
𝜎 𝜎
Since 𝑏𝑦𝑥 × 𝑏𝑥𝑦 = 𝑟 𝜎𝑦 × 𝑟 𝜎𝑥 = 𝑟 2 ≤ 1, because −1 ≤ 𝑟 ≤ 1
𝑥 𝑦
1
Therefore, 𝑏𝑥𝑦 ≤ 𝑏 <1
𝑦𝑥
2
OR (𝜎𝑥 − 𝜎𝑦 ) > 0 which is true.
𝜎 𝑘𝜎 𝑘 𝜎 𝑘 ℎ
𝑏𝑦𝑥 = 𝑟 𝜎𝑦 = 𝑟. ℎ𝜎𝑣 = ℎ (𝑟 𝜎𝑣 ) = ℎ 𝑏𝑣𝑢 . Similarly 𝑏𝑥𝑦 = 𝑘 𝑏𝑢𝑣 .
𝑥 𝑢 𝑢
Thus 𝑏𝑦𝑥 and 𝑏𝑥𝑦 are both independent of 𝑎 and 𝑏 but not of ℎ and 𝑘.
5. The correlation coefficient and the two regression coefficients have same sign.
𝜎
Regression coefficient of 𝑦 on = 𝑏𝑦𝑥 = 𝑟 𝜎𝑦 .
𝑥
𝜎
Regression coefficient of 𝑥 on = 𝑏𝑥𝑦 = 𝑟 𝜎𝑥 .
𝑦
Since 𝜎𝑥 and 𝜎𝑦 are both positive; 𝑏𝑦𝑥 , 𝑏𝑥𝑦 and 𝑟 must have be same sign.
Example1. If 𝜃 be the acute angle between the two regression lines in the case of two variables
𝑥 and 𝑦, show that
1−𝑟 2 𝜎𝑥 𝜎𝑦
𝑡𝑎𝑛𝜃 = .𝜎 2 +𝜎 2 , where 𝑟, 𝜎𝑥 , 𝜎𝑦 have their usual meanings. Explain the significance when
𝑟 𝑥 𝑦
𝑟 = 0 and 𝑟 = ±1.
Solution: We know that angle(𝜃) between two straight lines when their slope are 𝑚1 and 𝑚2 is
1 𝑚 −𝑚2
𝑡𝑎𝑛𝜃 = |1+𝑚 |. Here lines of regressions are
1 .𝑚2
𝑦 − 𝑦̅ = 𝑏𝑦𝑥 (𝑥 − 𝑥̅ ) (1)
𝑥 − 𝑥̅ = 𝑏𝑥𝑦 (𝑦 − 𝑦̅) (2)
𝜎𝑦
Slope of (1) is 𝑚1 = 𝑏𝑦𝑥 = 𝑟 𝜎 and
𝑥
𝜎𝑦
Slope of (2) is 𝑚2 = 𝑏𝑥𝑦 = 𝑟𝜎 , therefore
𝑥
𝜎𝑦𝜎𝑦
𝑟 − 1−𝑟2 𝜎𝑥 𝜎𝑦
𝜎𝑥 𝑟𝜎𝑥
𝑡𝑎𝑛𝜃 = 2 = . . (3)
(𝜎𝑦 ) 𝑟 𝜎𝑥 2 +𝜎𝑦 2
1+
(𝜎𝑥 )2
𝜋
(i) If 𝑟 = 0 then 𝑡𝑎𝑛𝜃 = ∞ 𝑂𝑅 𝜃 = 2 i.e. the lines of regression are at right angle. There is no
relationship between the two variables and they are independent or uncorrelated.
(ii) If 𝑟 = 1 𝑜𝑟 − 1, 𝑡𝑎𝑛𝜃 = 0 𝑂𝑅 𝜃 = 0. Therefore, the two lines of regression are coincident or parallel
and the correlation is perfect. Since the two lines pass through the common point (𝑥̅ , 𝑦̅), they cannot
be parallel. Hence they are coincident. Alternately the sum of the squares of deviation from
either line of regression is zero. Hence each deviation is zero and all the points lie on both the
lines of regression which coincide, and the correlation between the variables is perfect.
Example2. If the coefficient of correlation between two variables 𝑥 and 𝑦 is 0.5 and the acute angle
3 1
between their lines of regression is 𝑡𝑎𝑛−1 (5), show that 𝜎𝑥 = 2 𝜎𝑦 .
3 3
Solution: Here we, have 𝑟 = 0.5 , 𝜃 = 𝑡𝑎𝑛−1 (5), implies that 𝑡𝑎𝑛𝜃 = 5,
1−𝑟2 𝜎𝑥 𝜎𝑦
𝑡𝑎𝑛𝜃 = . (1)
𝑟 𝜎𝑥 2+𝜎𝑦 2
1
3 1− 𝜎𝑥 𝜎𝑦
5
= 1
4
.𝜎 2 +𝜎 2 , implies that 2𝜎𝑥 2 + 2𝜎𝑦 2 − 5𝜎𝑥 𝜎𝑦 = 0
2
𝑥 𝑦
or
𝜎𝑥 − 2𝜎𝑦 = 0 (3)
1
From (2), 𝜎𝑥 = 2 𝜎𝑦 .
5𝑦 − 8𝑥 + 17 = 0 (1)
2𝑦 − 5𝑥 + 14 = 0 (2)
(i) Since (𝑥̅ , 𝑦̅) is a common point of the two lines of regression, we have
Therefore,
8 2
𝑏𝑦𝑥 = 5 and 𝑏𝑥𝑦 = 5 OR
𝜎𝑦 8 𝜎𝑥 2
𝑟 𝜎 = 5 and 𝑟 =5.
𝑥 𝜎𝑦
16 4
On multiplying these, we get 𝑟 2 = 25 < 1, therefore 𝑟 = ± 5 (5)
Now we have to determine the sign of 𝑟 𝑖. 𝑒. +𝑜𝑟 −, as 𝑏𝑦𝑥 and 𝑏𝑥𝑦 are positive, therefore 𝑟 is also
4
positive. Therefore, 𝑟 = .
5
𝜎 8
(iii) We are given 𝜎𝑦 2 = 16 , therefore 𝜎𝑦 = 4 and 𝑟 𝜎𝑦 = 5, implies that
𝑥
4 4 8
× = ,
5 𝜎𝑥 5
Therefore,
𝜎𝑥 = 2, implies that 𝜎𝑥 2 = 4.
Example4. Find the coefficient of correlation and lines of regression to the following data:
𝑥 5 7 8 10 11 13 16
𝑦 33 30 28 20 18 16 9
Solution:
Here 𝑛 = 7
∑𝑥 70
𝑥̅ = 𝑛
= 7
= 10
∑𝑦 154
𝑦̅ = 𝑛
= 7
= 22.
𝑥 𝑦 𝑋 = 𝑥 − 10 𝑌 = 𝑦 − 22 𝑋𝑌 𝑋2 𝑌2
5 33 −5 11 −55 25 121
7 30 −3 8 −24 9 64
8 28 −2 6 −12 4 36
10 20 0 −2 0 0 4
11 18 1 −4 −4 1 16
13 16 3 −6 −18 9 36
16 9 6 −13 −78 36 169
∑𝑥 ∑𝑦 ∑ 𝑋𝑌 ∑ 𝑋2 ∑ 𝑌2
= 70 = 154 = −191 = 84 = 446
Coefficient of correlation
∑ 𝑋𝑌 −191
𝑟= = = −0.9868
√∑ 𝑋 2√𝑌 2 √84√446
𝜎𝑦 ∑ 𝑌2 446
𝑏𝑦𝑥 = 𝑟 = 𝑟√∑ = −0.9868 × √ = −2.2738 and
𝜎𝑥 𝑋2 84
𝜎𝑥 ∑ 𝑋2 84
𝑏𝑥𝑦 = 𝑟 = 𝑟√ ∑ = −0.9868 × √ = −0.4283
𝜎𝑦 𝑌2 446
Therefore,
𝑦 = −2.2738𝑥 + 44.738
𝑥 = −0.4283𝑦 + 19.4226.
Multiple Linear regression
There are number of situation where the dependent variable is a function of two or more independent
variables either linear or non-linear. Here, we shall discuss an approach to fit the experiment data where
the variable under consideration is a linear function of two independent variables.
𝑦 = 𝑎1 + 𝑎2 𝑥 + 𝑎3 𝑧 (1)
𝑆 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑎1 − 𝑎2 𝑥𝑖 − 𝑎3 𝑧𝑖 )2 (2)
𝜕𝑆
= −2 ∑𝑛𝑖=1(𝑦𝑖 − 𝑎1 − 𝑎2 𝑥𝑖 − 𝑎3 𝑧𝑖 )
𝜕𝑎1
𝜕𝑆
𝜕𝑎2
= −2 ∑𝑛𝑖=1(𝑦𝑖 − 𝑎1 − 𝑎2 𝑥𝑖 − 𝑎3 𝑧𝑖 )𝑥𝑖
𝜕𝑆
𝜕𝑎3
= −2 ∑𝑛𝑖=1(𝑦𝑖 − 𝑎1 − 𝑎2 𝑥𝑖 − 𝑎3 𝑧𝑖 )𝑧𝑖
∑ 𝑦𝑖 = 𝑛𝑎1 + 𝑎2 ∑ 𝑥𝑖 + 𝑎3 ∑ 𝑧𝑖
∑ 𝑥𝑖 𝑦𝑖 = 𝑎1 ∑ 𝑥𝑖 + 𝑎2 ∑(𝑥𝑖 )2 + 𝑎3 ∑ 𝑥𝑖 𝑧𝑖 } (3)
∑ 𝑧𝑖 𝑦𝑖 = 𝑎1 ∑ 𝑧𝑖 + 𝑎2 ∑ 𝑥𝑖 𝑧𝑖 + 𝑎3 ∑(𝑧𝑖 )2
𝑛 ∑ 𝑥𝑖 ∑ 𝑧𝑖
∑ 𝑦𝑖 𝑎1
∑ ∑(𝑥𝑖 )2 ∑ 𝑥𝑖 𝑧𝑖 𝑎
[∑ 𝑥𝑖 𝑦𝑖 ] = [ 𝑥𝑖 ] [ 2] (4)
∑ 𝑧𝑖 𝑦𝑖 ∑ 𝑧𝑖 ∑ 𝑥𝑖 𝑧𝑖 ∑(𝑧𝑖 )2 𝑎3
𝑦 = 𝑎1 + 𝑎2 𝑥 + 𝑎3 𝑧
This is a two dimensional case and therefore, we obtain a regression plane rather than regression line.
∑ 𝑦𝑖 = 𝑛𝑎1 + 𝑎2 ∑ 𝑥𝑖 + 𝑎3 ∑ 𝑧𝑖
∑ 𝑥𝑖 𝑦𝑖 = 𝑎1 ∑ 𝑥𝑖 + 𝑎2 ∑(𝑥𝑖 )2 + 𝑎3 ∑ 𝑥𝑖 𝑧𝑖
∑ 𝑧𝑖 𝑦𝑖 = 𝑎1 ∑ 𝑧𝑖 + 𝑎2 ∑ 𝑥𝑖 𝑧𝑖 + 𝑎3 ∑(𝑧𝑖 )2
𝑦 = 𝑎1 + 𝑎2 𝑥 + 𝑎3 𝑧.
Example1. Obtain a regression plane by using multiple linear regression to fit the data given below:
𝑥 1 2 3 4
𝑧 0 1 2 3
𝑦 12 18 24 30
Solution: Let 𝑦 = 𝑎1 + 𝑎2 𝑥 + 𝑎3 𝑧 be the regression plane where 𝑎1 , 𝑎2 , 𝑎3 are determined by using the
following equations:
∑ 𝑦𝑖 = 𝑛𝑎1 + 𝑎2 ∑ 𝑥𝑖 + 𝑎3 ∑ 𝑧𝑖
∑ 𝑥𝑖 𝑦𝑖 = 𝑎1 ∑ 𝑥𝑖 + 𝑎2 ∑(𝑥𝑖 )2 + 𝑎3 ∑ 𝑥𝑖 𝑧𝑖 (1)
∑ 𝑧𝑖 𝑦𝑖 = 𝑎1 ∑ 𝑧𝑖 + 𝑎2 ∑ 𝑥𝑖 𝑧𝑖 + 𝑎3 ∑(𝑧𝑖 )2
Where summation is 1 to 𝑛(= 4) and various summation values are given as:
𝑥𝑖 𝑧𝑖 𝑦𝑖 (𝑥𝑖 )2 (𝑧𝑖 )2 𝑥𝑖 𝑦𝑖 𝑥𝑖 𝑧𝑖 𝑧𝑖 𝑦𝑖
1 0 12 1 0 12 0 0
2 1 18 4 1 36 2 8
3 2 24 9 4 72 6 48
4 3 30 16 9 120 12 90
∑ 𝑥𝑖 ∑ 𝑧𝑖 ∑ 𝑦𝑖 ∑(𝑥𝑖 )2 ∑(𝑧𝑖 )2 ∑ 𝑥𝑖 𝑦𝑖 ∑ 𝑥𝑖 𝑧𝑖 ∑ 𝑧𝑖 𝑦𝑖
= 10 =6 = 84 = 30 = 14 = 240 = 20 = 156
Substituting the various values in (1), we obtain
𝑎1 = 10, 𝑎2 = 2, 𝑎3 = 4
𝑦 = 10 + 2𝑥 + 4𝑧
SAMPLING DISTRIBUTION
Population (Universe)
The group of individuals under study is called population or universe. It may be finite or infinite.
Sampling
A part selected from population is called a sample. The process of selection of a sample is called
sampling. A random sample is one in which each member of population has an equal chance of
being included in it. There are 𝐶 (𝑁, 𝑛) different samples of size 𝑛 that can be picked up from a
population of size 𝑁.
The statistical constants of the population such as mean (𝜇), standard deviation (𝜎) are called
parameters.
The mean (𝑥̅ ), standard deviation |𝑆| of a sample are known as statistics.
Aims of a sample
The population parameters are not known generally. Then the sample characteristics are
utilized to approximately determine or estimate of the population. Thus, static is an estimate of
the parameter. To what extent can we depend on the sample estimates?
The estimate of mean and standard deviation of the population is a primary purpose of all
scientific experimentation. The logic of the sampling theory is the logic of induction. In
induction, we pass from a particular (sample) to general (population). This type of
generalization here is known as statistical inference. The conclusion in the sampling studies are
based not on certainties but on probabilities.
Types of sampling
1. Purposive sampling
2. Random sampling
3. Stratified sampling
4. Systematic sampling
Sampling distribution
From a population a number of samples are drawn of equal size 𝑛. Find out the mean of each
sample. The means of samples are not equal. The means with their respective frequencies are
grouped. The frequency distribution so formed is known as sampling distribution of the mean.
Similarly, sampling distribution of standard deviation we can have.
Standard error
Standard error is the standard deviation of the sampling distribution of a statistic. For assessing
the difference between the expected value and observed value, standard error is used.
Reciprocal of the standard error is known as precision. It plays an important role in the theory
of large samples and it forms a basis of the testing of hypothesis. If 𝑡 is any statistic, for large
sample.
𝜎
(ii) S.D. .
√2𝑛
2
(iii) Variance 𝜎 2 √𝑛 .
Let the population be infinitely large and having a population mean of 𝜇 and a population
variance of 𝜎 2 . If 𝑥 is a random variable denoting the measurement of the characteristic, then
Expected value of 𝑥, 𝐸 (𝑥 ) = 𝜇
Variance of 𝑥, 𝑉𝑎𝑟(𝑥) = 𝜎 2
The sample mean 𝑥̅ is the sum of 𝑛 random variables 𝑥1 , 𝑥2 , … , 𝑥𝑛 each being divided by 𝑛.
Here, 𝑥1 , 𝑥2 , … , 𝑥𝑛 are independent variables from the infinitely large population.
1 1 1 1 1 1 𝑛𝜎 2 𝜎2
= 𝑛2 𝑉𝑎𝑟(𝑥1 ) + 𝑛2 𝑉𝑎𝑟(𝑥2 ) + ⋯ + 𝑛2 𝑉𝑎𝑟(𝑥𝑛 ) = 𝑛2 . 𝜎 2 + 𝑛2 . 𝜎 2 + ⋯ + 𝑛2 . 𝜎 2 = = .
𝑛2 𝑛
The expected value of the sample mean is the same as population mean. The variance of the
sample mean is the variance of the population divided by the sample size.
The average value of the sample tends to true population mean. If sample size (𝑛) is increased
𝜎2 𝜎2
then variance of 𝑥̅ , ( 𝑛 ) gets reduced, by taking large value of 𝑛, the variance ( 𝑛 ) of 𝑥̅ can be
𝜎
made as small as desired. The standard deviation ( ) of 𝑥̅ is also called standard error of the
√𝑛
mean. It is denoted by 𝜎𝑥̅ .
Sampling from normal population
𝜎2
If 𝑥~𝑁 (𝜇, 𝜎 2 ) then it follows that 𝑥̅ ~𝑁 (𝜇, ).
𝑛
Solution: Let 𝑥 be a random variable representing the diameter of one component picked up at
random.
0.01 𝜎2
Here 𝑥~𝑁 (10,0.01), therefore, 𝑥̅ ~𝑁 (10, ), because 𝑥̅ = 𝑛 (𝑥̅ , 𝑛 )
5
𝑥−𝜇
Probability{9.95 ≤ 𝑥̅ ≤ 10.05} = 2 × 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦{10 ≤ 𝑥̅ ≤ 10.05} , because 𝑧 = 𝜎
√𝑛
We use a sample statistic called the sample variance to estimate the population variance. The
sample variance is usually denoted by 𝑠 2
∑𝑛
𝑖=1(𝑥−𝑥̅ )
2
𝑠2 = .
𝑛−1
Central limit theorem says that the sampling distribution of the means always be normally
distributed, as long as the sample size is large enough. Regardless of whether the population
has a normal, poisson, binomial or any other distribution of the mean will be normal.
OR
The central limit theorem states that if you take sufficiently large samples from a population,
the samples mean will be normally distributed, even if the population is not normally
distributed.
Testing a Hypothesis:
Very often it is required to make decisions about populations on the basis of sample
information. Such decisions are called Statistical decisions or statistical hypothesis. These
hypothesis are tested. Assuming the hypothesis correct we calculate the probability of getting
the observed sample. If this probability is less than a certain assigned value, the hypothesis is to
be rejected.
The hypothesis is based for analyzing the problem. Null hypothesis is the hypothesis of no
difference. Thus, we shall assume that there is no significant difference between the observed
value and expected value. Then, we shall test whether this hypothesis is satisfied by the data or
not. If the hypothesis is not approved the difference is considered to be significant. If
hypothesis is approved then the difference would be described as due to sampling fluctuation.
Null hypothesis is denoted by 𝐻0 .
Level of significance
There are two critical regions which cover 5% and 1% areas of the normal curve. The shaded
portions are of the critical regions.
Thus, the probability of the value of the variate falling in the critical region is the level of
significance. If the variate falls in the critical area, the hypothesis is to be rejected.
Test of significance
The tests which enables us to decide whether to accept or to reject the null hypothesis is called
the tests of significance. If the difference between the sample values and the population values
are so large (lies in the critical area), it is to be rejected.
Confidence limits
𝜇 − 1.96𝜎, 𝜇 + 1.96𝜎 are 95% confidence limits as the area between 𝜇 − 1.96𝜎 and 𝜇 + 1.96𝜎
is 95%. If a sample statistics lies in the interval −1.96𝜎, 𝜇 + 1.96𝜎 , we call 95% confidence
interval.
Similarly, 𝜇 − 2.58𝜎 and 𝜇 + 2.58𝜎 is 99% confidence limits as the area between 𝜇 − 2.58𝜎
and 𝜇 + 2.58𝜎 is 99%. The numbers 1.96, 2.58 are called confidence coefficients.
Normal distribution is the limiting case of binomial distribution when 𝑛 is large enough. For
normal distribution 5% of the items lie outside 𝜇 ± 1.96𝜎 while only 1% of the items lie outside
𝜇 ± 2.586𝜎
𝑥−𝜇
𝑧= 𝜎
Where 𝑧 is the standard normal variate and 𝑥 is the observed number of successes.
First we find the value of 𝑧. Test of significance depends upon the value of 𝑧.
(i) (a) If |𝑧| < 1.96, difference between the observed and expected number of successes is not
significant at the 5% level of significance.
(ii) (a) If |𝑧| < 2.58, difference between the observed and expected number of successes is not
significant at the 1% level of significance.
Example2. A cubical die was thrown 9000 times and 1 or 6 was obtained 3120 times. Can the
deviation from expected value lie due to fluctuations of sampling?
Solution: Let us consider the hypothesis that the die is an unbiased one and hence the
2 1 1 2
probability of obtaining 1 or 6 6 = 3 i.e. 𝑝 = 3 , 𝑞 = 3
1
The expected value of the number of successes = 𝑛𝑝 = 9000 × 3 = 3000
1 2
Also 𝜎 = 𝑆. 𝐷. = √𝑛𝑝𝑞 = √9000 × 3 × 3 = √2000 = 44.72
𝜎 = 3 × 44.72 = 134.16
Difference between the actual number of successes and expected number of successes
Hence, the hypothesis is correct and the deviation is due to fluctuations of sampling due to
random causes.
𝑛𝑝𝑞 𝑝𝑞
(b) Standard deviation (standard error) of proportion of successes = √ =√
𝑛 𝑛
1 𝑛
(c) Precision of the proportion of success = 𝑆.𝐸. = √𝑝𝑞.
Example3. A group of scientific men reported 1705 sons and 1527 daughters. Do these figures
conform to the hypothesis that the sex ratio is ½ .
1
On the given hypothesis the male ratio = = 0.5.
2
Thus, the difference between the observed ratio and theoretical ratio
1 1
𝑝𝑞 ×
The standard deviation of the proportion = 𝑠 = √ 𝑛 =√2 2
= 0.0088.
3232
Hence, it can be definitely said that the figures given do not confirm to the given hypothesis.
The mean, standard deviation etc of the population are known as parameters. They are
denoted by 𝜇 and 𝜎. Their estimates are based on the sample VALUES. The mean and standard
deviation of a sample are denoted by 𝑥̅ and 𝑠 respectively. Thus, a static is an estimate of the
parameter. There are two types of estimates.
(i) Point estimation: An estimate of population parameter given by a single number is called a
point estimation of the parameter. For example,
∑(𝑥−𝑥̅ )2
(𝑆. 𝐷. )2 = .
𝑛−1
(ii) Interval estimation: An interval in which population parameter may be expected to lie with a
given degree of confidence. The intervals are
Similarly, 𝑥̅ ± 1.96𝜎𝑠 , 𝑥̅ ± 2.58𝜎𝑠 are 95% and 99% confidence of limits for 𝜇.
𝜎 𝜎 𝜎
𝑥̅ ± 1.96 and 𝑥̅ ± 2.58 are also the intervals as 𝜎𝑠 = .
√ 𝑛 √ 𝑛 √𝑛
Test of significance of large samples
Let 𝑥̅1 be the mean of a sample of size 𝑛1 from a population with mean 𝜇1 , and variance 𝜎12 . Let
𝑥̅ 2 be the mean of an independent sample of size 𝑛2 from another population with mean 𝜇2
and variance 𝜎22 . The test statistic is given by
𝑥̅1 −𝑥̅2
𝑧=
𝜎 𝜎2 2
√ 1+ 2
𝑛1 𝑛2
Under the null hypothesis that the samples are drawn from the same population where 𝜎1 =
𝜎2 = 𝜎 i.e. 𝜇1 = 𝜇2 the test statistic is given by
𝑥̅1 −𝑥̅2
𝑧= 1 1
.
𝜎√ +
𝑛1 𝑛2
Note:-
𝑥̅1−𝑥̅2
𝑧= .
𝑛 𝑠2 +𝑛 𝑠2 1 1
√( 1 1 2 2 )( + )
𝑛1 +𝑛2 𝑛1 𝑛2
Example1. The average income of persons was Rs. 210 with a S.D. of Rs. 10 in sample of 100
people of a city. For another sample of 150 persons, the average income was Rs. 220 with S.D.
of Rs. 12. The S.D. of incomes of the people of the city was Rs. 11. Test whether there is any
significant difference between the average incomes of the localities.
Null hypothesis: The difference is not significant. i.e. there is no difference between the
incomes of the localities.
𝐻0 : 𝑥̅1 = 𝑥̅ 2 , 𝐻1 : 𝑥̅1 ≠ 𝑥̅ 2
𝑥̅1 −𝑥̅2 210−220
Under the null hypothesis, 𝐻0 , 𝑧 = = 2 2
= −7.1428.
𝑠2 𝑠2 √ 10 + 12
√ 1 + 2 100 150
𝑛1 𝑛2
Conclusion: As the calculated value of |𝑧| > 1.96, the significant value of 𝑧 at 5% level of
significance, 𝐻0 is rejected i.e. there is significant difference between the average incomes of
the localities.
Example2. Intelligent tests were given to two groups of boys and girls.
Solution: Null hypothesis 𝐻0 : There is no significant difference between means scores i.e. 𝑥̅1 =
𝑥̅ 2
𝐻1 ∶ 𝑥̅1 ≠ 𝑥̅ 2
𝑥̅1 −𝑥̅2 75−73
Under the null hypothesis, 𝐻0 , 𝑧 = = 2 2
= 1.3912.
𝑠2 𝑠2 √ 8 + 10
√ 1 + 2 60 100
𝑛1 𝑛2
Conclusion: the calculated value of |𝑧| < 1.96, the significant value of 𝑧 at 5% level of
significance, 𝐻0 is accepted. i.e. there is no significant difference between mean scores.
If 𝑠1 and 𝑠2 are the standard deviations of two independent samples then under the null
hypothesis 𝐻0 ∶ 𝜎1 = 𝜎2 i.e. the sample standard deviations do not differ significantly, the
statistic
𝑠1 −𝑠2
𝑧= , where 𝜎1 and 𝜎2 are population standard deviations.
𝜎2 𝜎 2
√ 1+ 2
2𝑛1 2𝑛2
𝑠1 −𝑠2
When population standard deviations are not known then 𝑧 = .
𝑠2 𝑠2
√ 1+ 2
2𝑛1 2𝑛2
Example1. Random samples drawn from two countries gave the following data relating to the
heights of adult males.
Country A Country B
Mean height(in inches) 67.42 67.25
Standard deviation 2.58 2.50
Number in samples 1000 1200
Since |𝑧| < 1.96 we accept the null hypothesis at 5% level of significance.
Since |𝑧| < 1.96 we accept the null hypothesis at 5% level of significance.
Test of significance of small samples
When the size of sample is less than 30, then the sample is called small sample. For such sample
it will not be possible for us to assume that the random sampling distribution of a statistic is
approximately normal and the values given by the sample data are sufficiently close to the
population values and can be used in their place for the calculation of the standard error of the
estimate.
This 𝑡 −distribution is used when sample size is ≤ 30 and population standard deviation is
unknown.
𝑥̅ −𝜇 ∑(𝑥−𝑥̅ )2
𝑡 −statistic is defined as 𝑡 = 𝑆 , where 𝑆 = √ 𝑛−1
√𝑛
𝑥̅ is the mean of sample, 𝜇 is population mean. 𝑆 is the standard deviation of population and 𝑛
is sample size.
𝑥̅ −𝜇
If the S.D. of the sample 𝑠 is given then 𝑡 −statistic is defined as = 𝑠 .
√𝑛−1
The 𝒕 −Table
The 𝑡 −table given at the end is the probability integral of 𝑡 −distribution. The 𝑡 −distribution
has different values for each degrees of freedom and when the degrees of freedom are
infinitely large, the 𝑡 −distribution is equivalent to normal distribution and the probabilities
shown in the normal distribution tables are applicable.
Applications of 𝑡 −distribution
1. To test if the sample mean (𝑥̅ ) differs significantly from the hypothetical value 𝜇 of the
population mean.
The critical value or significant value of 𝑡 at level of significance𝛼, degrees of freedom 𝛾 for two
tailed test is given by
𝑃[|𝑡| ≤ 𝑡𝛾 (𝛼)] = 1 − 𝛼
The significant value of 𝑡 at level of significance 𝛼, for a single tailed test can be got from those
of two tailed test by referring to the values at 2𝛼.
To test whether the mean of a sample drawn from a normal population deviates significantly
from a stated value when variance of the population is unknown.
𝐻0 : There is no significant difference between the sample mean 𝑥̅ and the population mean 𝜇
i.e. we use the statistic
𝑥̅ −𝜇 ∑(𝑥−𝑥̅ )2
𝑡= 𝑆 , , where 𝑆 = √ with degree of freedom 𝑛 − 1.
𝑛−1
√𝑛
At given level of significance 𝛼 and degree of freedom (𝑛 − 1), we refer to 𝑡- table 𝑡𝛼 (two
tailed or one tailed). If calculated 𝑡-value is such that |𝑡| < 𝑡𝛼 , the null hypothesis is accepted. If
|𝑡| > 𝑡𝛼 , 𝐻0 is rejected.
𝑥̅ −𝜇
| 𝑆 | < 𝑡𝛼 for acceptance of 𝐻0 .
√𝑛
Example1. A random sample of size 16 has 53 as mean. The sum of squares of the deviation
from mean is 135. Can this sample be regarded as taken from the population having 56 as
mean? Obtain 95% and 99% confidence limits of the mean of the population.
Solution: 𝐻0 : There is no significant difference between the sample mean and hypothetical
population mean i.e. 𝜇 = 56.
Alternative hypothesis, 𝐻1 : 𝜇 ≠ 56 (two tailed test).
𝑥̅ −𝜇
Test statistic: Under 𝐻0 , test statistic is 𝑡 = 𝑆
√𝑛
∑(𝑥−𝑥̅ )2 135
𝑆=√ = √ 15 = 3
𝑛−1
𝑥̅ −𝜇 53−56
𝑡= 𝑆 = 3 = −4
√𝑛 √16
|𝑡 | = 4
Conclusion: Since |𝑡| = 4 > 𝑡0.05 = 2.13 i.e. the calculated value of 𝑡 is more than the
tabulated value, the null hypothesis is rejected. Hence, the sample mean has not come from a
population having 56 as mean.
𝑆 3
95% confidence limits of the population mean = 𝑥̅ ± 𝑡 = 53 ± (2.13) =
√ 𝑛 0.05 √16
51.5975, 55.5975.
𝑆 3
99% confidence limits of the population mean = 𝑥̅ ± 𝑡0.01 = 53 ± (2.95) =
√𝑛 √16
50.7875, 55.2125.
Example2. The lifetime of electric bulbs for a random sample of 10 from a large consignment
gave the following data:
Item 1 2 3 4 5 6 7 8 9 10
Life in ‘000 hrs 4.2 4.6 3.9 4.1 5.2 3.8 3.9 4.3 4.4 5.6
Can we accept the hypothesis that the average lifetime of bulb is 4000 hrs?
Solution: Null hypothesis: 𝐻0 : There is no significant difference between the sample mean and
hypothetical population mean i.e. 𝜇 = 4000 hrs.
∑𝑥 44
Mean 𝑥̅ = = 10 = 4.4, ∑(𝑥 − 𝑥̅ )2 = 3.12.
𝑛
∑(𝑥−𝑥̅ )2 3.12
𝑆=√ = √10−1 = 0.589
𝑛−1
𝑥̅ −𝜇 4.4−4
𝑡= 𝑆 = 0.589 = 2.123 .
√𝑛 √10
Conclusion: Since the calculated value of 𝑡 is less than the tabulated value of 𝑡 at 55 level of
significance. Therefore, the null hypothesis 𝜇 = 4000 hrs is accepted i.e. the average lifetime of
bulbs could be 4000 hrs.
Type-II 𝒕- test for difference of means of two small samples (from a normal population)
This test is used to test whether the two samples 𝑥1 , 𝑥2 , … , 𝑥𝑛 ; 𝑦1 , 𝑦2 , … , 𝑦𝑛 of sizes 𝑛1 , 𝑛2 have
been drawn from two normal populations with mean 𝜇1 and 𝜇2 respectively under the
assumption that the population variance are equal (𝜎1 = 𝜎2 = 𝜎).
𝐻0 : The samples have been drawn from the normal population with means 𝜇1 and 𝜇2 i.e. 𝐻0 ∶
𝜇1 ≠ 𝜇2 .
Degree of freedom is 𝑛1 + 𝑛2 − 2.
Note:-
𝑛1 𝑠1 2 +𝑛2𝑠2 2
1. If the two samples standard deviations 𝑠1 , 𝑠2 are given then we have 𝑆 2 = .
𝑛1 +𝑛2 −2
∑(𝑥1−𝑥̅1 )2+∑(𝑥2−𝑥̅2 )2
2. If 𝑠1 , 𝑠2 are not given then 𝑆 2 = .
𝑛1 +𝑛2 −2
Example1. Samples of sizes 10 and 14 were taken from two normal populations with S.D. 3.5
and 5.2 . The sample means were found to be 20.3 and 18.6. Test whether the means of the
two populations are the same at 55 level.
Therefore, 𝑆 = 4.772.
Null hypothesis: 𝐻0 ∶ 𝜇1 = 𝜇2 i.e. the means of the two populations are the same.
Alternative hypothesis : 𝐻1 ∶ 𝜇1 ≠ 𝜇2 .
(𝑥̅ −𝑦̅) 20.3−18.6
Test statistic : Under 𝐻0 , the test statistic is 𝑡 = 1 1
= 1 1
= 0.8604.
𝑆√ + 4.772√ +
𝑛1 𝑛2 10 14
The tabulated value of 𝑡 at 55 level of significance for 22 degree of freedom is 𝑡0.05 = 2.0739.
Conclusion: Since 𝑡 = 0.8604 < 𝑡0.05 , the null hypothesis 𝐻0 is accepted; i.e. there is no
significant difference between their means.
Example2. The height of 6 randomly chosen sailors in inches are 63, 65, 68, 69, 71 and 72.
Those of 9 randomly chosen soidiers are 61, 62, 65, 66, 69, 70, 71, 72 and 73. Test whether the
sailors are on the average taller than soldiers.
Solution: Let 𝑋1 and 𝑋2 be two samples denoting the height s of sailors and soldiers.
𝑛1 = 6, 𝑛2 = 9
Let
𝑋1 63 65 68 69 71 72
Therefore,
∑𝑋
𝑋̅1 = 𝑛 1 = 68
1
𝑋1 63 65 68 69 71 72
𝑋1 − ̅̅̅
𝑋1 −5 −3 0 1 3 4
(𝑋1 − ̅̅̅
𝑋1 )2 25 9 0 1 9 16
∑(𝑋1 − 𝑋̅1 )2 = 60
Let
𝑋2 61 62 65 66 69 70 71 72 73
Therefore,
∑𝑋
𝑋̅2 = 𝑛 2 = 67.66
2
𝑋2 61 62 65 66 69 70 71 72 73
𝑋2 − ̅̅̅
𝑋2 −6.66 −5.66 −2.66 1.66 1.34 2.34 3.34 4.34 5.34
̅̅̅
(𝑋2 − 𝑋2 )2 44.36 32.035 7.075642.755 1.795 5.475 11.155 18.8354 28.515
Test statistic :
(𝑥̅ −𝑦̅) 68−67.666
Under 𝐻0 , 𝑡= 1 1
= 1 1
= 0.1569.
𝑆√ + 4.038√ +
𝑛1 𝑛2 6 9
Conclusion: Since 𝑡𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 < 𝑡0.05 = 1.77, the null hypothesis 𝐻0 is accepted. i.e. there is no
significant difference between their average.
i.e. the sailors are not on the average taller than the soldiers.
F-Test OR Snedecor’s variance ratio test
In testing the significance of the difference of two means of two samples, we assumed that the
two sample came from the same population or population with equal variance. The object of
the F-test is to discover whether two independent estimates of population variance differ
significantly or whether the two samples may be regarded as drawn from the normal
populations having the same variance. Hence before applying the t-test for the significance of
the difference of two means, we have to test for the equality of population variance by using F-
test.
Let 𝑛1 and 𝑛2 be the sizes of two samples with variance 𝑠12 and 𝑠22 . The estimate of the
𝑛 𝑠2 𝑛 𝑠2
population variance based on these samples is 𝑆12 = 𝑛 1−1
1
and 𝑆22 = 𝑛 2−1
2
. The degrees of
1 2
To test whether these estimates 𝑆12 and 𝑆22 are significantly different or if the samples may be
regarded as drawn from the same population or from two populations with same variance 𝜎 2 .
We set up the null hypothesis 𝐻0 : 𝜎12 = 𝜎22 = 𝜎 2 . i.e. the independent estimates of the
common population do not differ significantly.
To carry out the test of significance of the difference of the variances we calculate the test
statistic
𝑆2
𝐹 = 𝑆12 if 𝑆12 > 𝑆22 and
2
𝑆2
𝐹 = 𝑆22 if 𝑆22 > 𝑆12 .
1
Conclusion: If the calculated value of 𝐹 exceeds 𝐹0.05 for (𝑛1 − 1), (𝑛2 − 1) degrees of freedom
given in the table we conclude that the ratio is significant at 5% level.
i.e. we conclude that the sample could have come from two normal population with same
variance.
3. The ratio of 𝜎12 to 𝜎22 should be equal to 1 or greater than 1. That is why we take the larger
variance in the numerator of the ratio.
2. Whether the two independent estimates of the population variance are homogeneous or
not.
Example1. In two independent samples of sizes 8 and 10 the sum of squares of deviations of
the sample values from the respective sample means were 84.4 and 102.6. Test whether the
differences of the populations is significant or not.
Solution: Null hypothesis 𝐻0 : 𝜎12 = 𝜎22 = 𝜎 2 i.e. there is no significant difference between
population variance.
𝑆2
Under 𝐻0 : 𝐹 = 𝑆12 ~𝐹(𝑣1 , 𝑣2 𝑑𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚)
2
𝑆2
𝐹 = 𝑆12 , because 𝑆12 > 𝑆22 ,
2
12.057
Therefore, 𝐹 = = 1.0576.
11.4
Conclusion: The tabulated value of 𝐹 at 5% level of significance for (7,9) degree of freedom is
3.29.
𝐻0 is accepted.
Example2. Two random samples are drawn from two normal populations are as follows:
𝐴 17 27 18 25 27 29 13 17
𝐵 16 16 20 27 26 25 21
Test whether the samples are drawn from the same normal population.
Solution: To test if two independent samples have been drawn from the same population we
have to test
Since the 𝑡-test assumes that the sample variances are equal, we shall first apply the 𝐹-test.
𝑆2
Test statistic : 𝐹 = 𝑆12 , if 𝑆12 > 𝑆22 .
2
Let
𝑋1 17 27 18 25 27 29 13 17
̅
𝑋1 = 21.625 ; 𝑛1 = 8
And
𝑋2 16 16 20 27 26 25 21
̅
𝑋2 = 18.714 ; 𝑛2 = 7
𝑆2
Test statistic : 𝐹 = 𝑆12 , if 𝑆12 > 𝑆22 .
2
𝑆2 36.267
Therefore, 𝐹 = 𝑆12 = 1.19.
2 30.47
Conclusion: The table value of 𝐹 for 𝑣1 = 7 and 𝑣2 = 6 degrees of freedom at 5% level is 4.21.
The calculated value of 𝐹 is less than the tabulated value of 𝐹. Therefore, 𝐻0 is accepted. Hence
we conclude that the variability in two populations is same.
Alternative hypothesis 𝐻1 : 𝜇1 ≠ 𝜇2
Test of statistic
𝑋̅1−𝑋̅2 21.625−18.714
𝑡= 1 1
= 1 1
= 0.9704~𝑡(𝑛1 + 𝑛2 − 2) degree of freedom.
𝑠√ + 5.796√ +
𝑛1 𝑛2 8 7
The calculated value of 𝑡 is less than the tabulated value. 𝐻0 is accepted. i.e. there is no
significant difference between the population mean i.e. 𝜇1 = 𝜇2 . Therefore, we conclude that
the two samples have been drawn from the same normal population.
Chi-square(ℵ𝟐 ) test
When a coin is tossed 200 times, the theoretical considerations lead us to expect 100 heads and
100 tails. But in practice, these results are rarely achieved. The quantity ℵ2 describes the
magnitude of discrepancy between theory and observation. If ℵ2 = 0, the observed and
expected frequencies completely coincide. The greater the discrepancy between the observed
and expected frequencies, the greater is the value of ℵ2 . Thus ℵ2 affords a measure of the
correspondence between theory and observation.
Of 𝑂𝑖 (𝑖 = 1,2, … , 𝑛) is a set of observed (experimental)frequencies and 𝐸𝑖 (𝑖 = 1,2, … , 𝑛) is
the corresponding set of expected (theoretical or hypothetical) frequencies, then, ℵ2 is defined
as
(𝑂𝑖 −𝐸𝑖 )2
ℵ2 = ∑𝑛𝑖=1 [ ]
𝐸𝑖
Degrees of freedom
While comparing the calculated value of ℵ2 with the tabular value, we have to determine the
degrees of freedom.
If we have to choose any four numbers whose sum is 50, we can exercise our independent
choice for any three numbers only, the fourth being 50 minus the total of the three numbers
selected. Thus, though we were to choose any four numbers, our choice was reduced to three
because of one condition imposed. There was only one restraint on our freedom and our
degrees of freedom were 4 − 1 = 3. If two restrictions are imposed, our freedom to choose
will be further curtailed and degrees of freedom will be 4 − 2 = 2.
In general, the number of degrees of freedom is the total number of observations less the
number of independent constraints imposed on the observations. Degrees of freedom (d.f.) are
usually denoted by 𝑣.
ℵ2 -test is one of the simplest and the most general test known. It is applicable to a very large
number of problems in practice which can be summed up under the following heads:
ℵ2 -test is an approximate test for large values of 𝑛. For the validity of ℵ2 -test of goodness of fit
between theory and experiment, the following conditions must be satisfied.
e.g. ∑ 𝑛𝑖 = ∑ λ𝑖 or ∑ 𝑂𝑖 = ∑ 𝐸𝑖 .
(iii) 𝑁, the total number of frequencies should be reasonably large. It is difficult to say what
constitutes largeness, but as an arbitrary figure, we may say that 𝑁 should be atleast 50,
however, few the cells.
(iv) No theoretical cell-frequency should be small. Here again, it is difficult to say what
constitutes smallness, but 5 should be regarded as the very minimum and 10 is better, If small
theoretical frequencies occur (i.e.<10), the difficulty is overcome by grouping two or more
classes together before calculating (𝑂 − 𝐸). It is important to remember that the number of
degrees of freedom is determined with the number of classes after regrouping.
Note;- It may be noted that the ℵ2 -test depends only on the set of observed and expected
frequencies and on degrees of freedom (d.f.) . It does not make any assumption regarding the
parent population from which the observations are taken. Since ℵ2 does not involve any
population parameters, it is termed as a statistic and the test is known as non-parametric test
or distribution free test.
The ℵ𝟐 distribution
For large sample sizes, the sampling distribution of ℵ2 can be closely approximated by a
continuous curve known as the chi-square distribution. The probability function of ℵ2
distribution is given by
𝑣 𝑥2
𝑓 (ℵ2 ) = 𝑐 (ℵ2 )2−1 𝑒 − 2
Symbolically, the degrees of freedom are denoted by the symbol 𝑣 or by degrees of freedom
and are obtained by the rule 𝑣 = 𝑛 − 𝑘, where 𝑘 refers to the number of independent
constraints.
In general, when we fit a binomial distribution the number of degrees of freedom is one less
than the number of classes ; when we fit a poisson distribution the degrees of freedom are 2
less than the number of classes, because we use the total frequency and the arithmetic mean
to get the parameter of the poisson distribution. When we fit a normal curve the number of
degrees of freedom are 3 less than the number of classes, because in this fitting we use the
total frequency, mean and standard deviation.
ℵ2 test enables us to ascertain how well the theoretical distributions such as Binomial, Poisson
or Normal etc. fit empirical distributions, i.e. distributions obtained from sample data. If the
calculated value of ℵ2 is less than the tabular value at a specified level (generally 5%) of
significance, the fit is considered to be good i.e. , the divergence between actual and expected
frequencies is attributed to fluctuations of simple sampling. If the calculated value of ℵ2 is
greater than the tabular value, the fit is considered to be poor.
Example1. In experiments on pea breeding, the following frequencies of seeds were obtained:
Red & Yellow Wrinkled & Round & Wrinkled & Total
Yellow Green Green
315 101 108 32 556
Theory predicts that the frequencies should be in proportions 9:3:3:1. Examine the
correspondence between theory and experiment.
𝐻0 : The experimental result support the theory i.e. there is no significant difference between
the observed and theoretical frequency.
556×3 556×1
𝐸3 = = 104.25 ; 𝐸4 = = 34.75
16 16
To calculate the value of ℵ2 :
(𝑂𝑖 −𝐸𝑖 )2
ℵ2 = ∑ [ ] = 0.470024.
𝐸𝑖
Conclusion: Since the calculated value of ℵ2 is less than that of the tabulated value, hence 𝐻0 is
accepted. Therefore, the experimental results support the theory.
Example2. The following table gives the number of accidents that took place in an industry
during various days of the week. Test if accidents are uniformly distributed over the week.
Solution: Null hypothesis 𝐻0 : The accidents are uniformly distributed over the week.
84
Under this 𝐻0 , the expected frequencies of the accidents on each of these days = = 14.
6
Observed 14 18 12 11 15 14
frequency (𝑂𝑖 )
Expected 14 14 14 14 14 14
frequency 𝐸𝑖
(𝑂𝑖 − 𝐸𝑖 )2 0 16 4 9 1 0
Conclusion: Since the calculated value of ℵ2 is less than the tabulated value, 𝐻0 is accepted i.e.,
the accidents are uniformly distributed over the week.
Example3. A die is thrown 276 times and the results of these throws are given below:
Number appeared 1 2 3 4 5 6
on the die
Frequency 40 32 29 59 57 59
Observed 40 32 29 59 57 59
frequency (𝑂𝑖 )
Expected 46 46 46 46 46 46
frequency 𝐸𝑖
(𝑂𝑖 − 𝐸𝑖 )2 36 196 289 169 121 169
Conclusion: Since the calculated value of ℵ2 = 21.30 > 11.07 the tabulated value, 𝐻0 is
rejected. i.e. die is not unbiased OR die is biased.
Fisher’s Z-test
This test is used to test the significance of the correlation coefficient in small samples. If 𝑟 is the
correlation coefficient of the sample and 𝜌, that of the population, then calculating the value of
𝑍−ξ
1 ,
√𝑛−3
1 1 1+𝑟 1+𝑟
where 𝑍 = 2 𝑡𝑎𝑛ℎ−1 𝑟 = 2 𝑙𝑜𝑔𝑒 (1−𝑟) 𝑂𝑅 1.1513𝑙𝑜𝑔10 (1−𝑟)
1 1 1+𝜌 1+𝜌
ξ = 2 𝑡𝑎𝑛ℎ−1 𝜌 = 2 𝑙𝑜𝑔𝑒 (1−𝜌) 𝑂𝑅 1.1513𝑙𝑜𝑔10 (1−𝜌 )
1
= 𝑆. 𝐸.
√𝑛−3
𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒
If the absolute value of this exceeds 1.96, the difference is significant at 5% level.
𝑆.𝐸.
Example1. Test the significance of the correlation 𝑟 = 0.5 from a sample of size 18 against
hypothetical correlation 𝜌 = 0.7.
Solution: We have to test the hypothesis that correlation in the population is 0.7.
1 1+𝑟 1+0.5
𝑍 = 2 𝑙𝑜𝑔𝑒 (1−𝑟) = 1.1513𝑙𝑜𝑔10 (1−0.5) = 1.1513𝑙𝑜𝑔10 3 = 1.1513 × 0.4771 = 0.549.
1 1+𝜌 1+0.7
ξ = 2 𝑙𝑜𝑔𝑒 (1−𝜌 ) = 1.1513𝑙𝑜𝑔10 (1−0.7 ) = 1.1513𝑙𝑜𝑔10 5.67 = 1.1513 × 0.7536 = 0.868
𝑍−ξ 0.319
Absolute value of = = 1.23 which is less than 1.96 (5% level of significance) and is ,
𝑆.𝐸. 0.26
therefore, not significant. Hence the sample may be regarded as coming from population with
𝜌 = 0.7.
Example2. From a sample of 19 pairs of observations, the correlation is 0.5 and the
corresponding population value is 0.3. Is the difference significant?
𝑍−ξ 0.239
Therefore, = = 0.956.
𝑆.𝐸. 0.25
Which is less than 1.96 (5% level of significance) and is, therefore not significant. Hence the
sample may be regarded as coming from population with 𝜌 = 0.3.