0% found this document useful (0 votes)

29 views51 pages

UNIT-III Curve Fitting & Smpling, App

The document discusses curve fitting, which involves finding equations that approximate given data points. It explains the principle of least squares as a method to minimize the sum of squared errors when fitting curves to data. The document also provides examples of fitting both linear and exponential curves using least squares, along with transformations for simplifying calculations.

Uploaded by

aneeshahuja31

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views51 pages

UNIT-III Curve Fitting & Smpling, App

Uploaded by

aneeshahuja31

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

UNIT-III

Curve fitting

The general problem of finding equations of approximating curves which fit given data is called
curve fitting.

Consider 𝑛 pairs of values (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) of two variables 𝑥 and 𝑦. To get a rough
idea about their relationship if any, we plot the values of 𝑥 and 𝑦 on a suitable scale. The points
(𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1,2,3, … , 𝑛 constitute a diagram called scatter diagram and the given data
(𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1,2,3, … , 𝑛 is said to be bivariate. An exact relationship between the variables 𝑥 and
𝑦 i.e. of the 𝑦 = 𝑓(𝑥) which fits the given sets of data is called curve fitting. Generally it is not
possible to find a curve which passes through all the given points. We can obtain relationship
between 𝑥 and 𝑦 in the form of a straight lines, curves of second degree, third degree etc.,
which may give the best representation of the bivariate distribution. The method of least
squares can be used to get the representation. It is probably the best to fit a unique curve to a
given data. The other methods are graphical method, method of group averages and method of
moments. Here we will discuss the method of least squares only.

Principle of least squares

The method of least squares is probably the most symmetric procedure to fit a unique curve
through the given points.

Let 𝑦 = 𝑓(𝑥) be the equation of curve to be fitted to the given data (observed or experimental
) points (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), … , (𝑥𝑛 , 𝑦𝑛 ). At 𝑥 = 𝑥1 , the observed ( or experimental ) value
of the ordinate is 𝑦1 and the corresponding value on the fitting curve is 𝑁1 𝑀1 , i.e. [𝑓(𝑥1 )]. The
difference of the observed and the expected (theoretical) value is

= 𝑃1 𝑀1 − 𝑁1 𝑀1 = 𝑃1 𝑁1 = 𝑒1 .

This difference is called the error.

𝑒1 = 𝑦1 − 𝑓(𝑥1 )

Similarly, 𝑒2 = 𝑦2 − 𝑓(𝑥2 )

𝑒3 = 𝑦3 − 𝑓(𝑥3 )

…………………………………………

𝑒𝑛 = 𝑦𝑛 − 𝑓(𝑥𝑛 )
Some of the errors 𝑒1 , 𝑒2 , 𝑒3 , … , 𝑒𝑛 will be positive and others negative.

In finding the total errors, errors are added. In addition, some negative and some positive
errors may cancel and in some cases sum of all the errors may be zero, which leads to false
result. To avoid such situation, we may make all the errors positive by squaring.

Sum = 𝑒12 + 𝑒22 + 𝑒32 + ⋯ + 𝑒𝑛2

The curve of best fit is that for which the sum of the squares of errors (S) is minimum. This is
called principle of least squares.

Method of least squares

Let 𝑦 = 𝑎 + 𝑏𝑥
(1)

be the straight line to be fitted to the given data points (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), … , (𝑥𝑛 , 𝑦𝑛 ).

Let 𝑦𝑡1 be the theoretical ordinate for 𝑥1 .

𝑃𝑀 = 𝑦1 , 𝑁𝑀 = 𝑦𝑡1

𝑃𝑁 = 𝑃𝑀 − 𝑁𝑀

Then 𝑒1 = 𝑦1 − 𝑦𝑡1 (𝑃𝑁 = 𝑒1 )

𝑒1 = 𝑦1 − (𝑎 + 𝑏𝑥1 ) because 𝑦𝑡1 = 𝑎 + 𝑏𝑥1

On squaring, we get 𝑒12 = (𝑦1 − 𝑎 − 𝑏𝑥1 )2

𝑆 = 𝑒12 + 𝑒22 + 𝑒32 + ⋯ + 𝑒𝑛2 = ∑𝑛𝑖=1 𝑒𝑖2

𝑆 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2

For 𝑆 to be minimum,
𝜕𝑆
= ∑𝑛𝑖=1 2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )(−1) = 0 , implies that ∑(𝑦 − 𝑎 − 𝑏𝑥) = 0
𝜕𝑎
(2)
𝜕𝑆
= ∑𝑛𝑖=1 2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )(−𝑥𝑖 ) = 0 , implies that ∑(𝑥𝑦 − 𝑎𝑥 − 𝑏𝑥 2 ) = 0
𝜕𝑏
(3)

On simplification equations (2) and (3) become

∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥 (4) and ∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥 2
(5)

The equations (4) and (5) are known as normal equations.

On solving equations (4) and (5), we get the values of 𝑎 and 𝑏.

On putting the values of 𝑎 and 𝑏 in (1), we get the equation of required line.

Working rule:

1. Equation (4) is obtained by putting Σ before all the terms on both sides of (1).

∑ 𝑦 = ∑ 𝑎 + ∑ 𝑏𝑥 , implies that ∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥

2. Equation (5) is obtained on multiplying equation (1) by 𝑥 and then putting Σ before each
obtained term on both the sides. i.e.

∑ 𝑥𝑦 = ∑ 𝑎𝑥 + ∑ 𝑏𝑥 2 , implies that ∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥 2 .

Example1. By the method of least squares, find the straight line that best fits the following
data:

𝑥 1 2 3 4 5
𝑦 14 27 40 55 68

Solution: Let the equation of the straight line best fit be 𝑦 = 𝑎 + 𝑏𝑥 (1)

𝑥 𝑦 𝑥𝑦 𝑥2
1 14 14 1
2 27 54 2
3 40 120 9
4 55 220 16
5 68 340 25
∑ 𝑥 = 15 ∑𝑦 ∑ 𝑥𝑦 = 748 ∑ 𝑥 2 = 55
= 204

Here 𝑛 = 5

Normal equations are

∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥 (2)
∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥 2 (3)

On putting the values of ∑ 𝑥, ∑ 𝑦, ∑ 𝑥𝑦 and ∑ 𝑥 2 in (2) and (3), we get

204 = 5𝑎 + 15𝑏 (4)

748 = 15𝑎 + 55𝑏 (5)

On solving (4) and (5), we get 𝑎 = 0 , 𝑏 = 13.6

On substituting the values of 𝑎 and 𝑏 in (1), we get

𝑦 = 13.6𝑥.

Example2. Use least squares method to fit a curve of the form 𝑦 = 𝑎𝑒 𝑏𝑥 to the following data:

𝑥 1 2 3 4 5 6
𝑦 7.209 5.265 3.846 2.809 2.052 1.499

Solution: 𝑦 = 𝑎𝑒 𝑏𝑥 (1)

On taking log on both sides, we get

𝑙𝑜𝑔𝑒 𝑦 = 𝑙𝑜𝑔𝑒 𝑎 + 𝑏𝑥 (2)

On putting 𝑙𝑜𝑔𝑒 𝑦 = 𝑌 , 𝑙𝑜𝑔𝑒 𝑎 = 𝐴 in (2), we get

𝑌 = 𝐴 + 𝑏𝑥 (3)

𝑥 𝑦 𝑌 = 𝑙𝑜𝑔𝑒 𝑦 𝑥𝑌 𝑥2
1 7.209 1.97533 1.97533 1
2 5.265 1.66108 3.32216 4
3 3.846 1.34703 4.04109 9
4 2.809 1.03283 4.13132 16
5 2.052 0.71881 3.59405 25
6 1.499 0.40480 2.4288 36
∑𝑥 ∑𝑦 ∑𝑌 ∑ 𝑥𝑌 ∑ 𝑥2
= 21 = 204 = 7.13988 = 19.49275 = 91

Here 𝑛 = 6

Normal equations are

∑ 𝑌 = 𝑛𝐴 + 𝑏 ∑ 𝑥 (4)

∑ 𝑥𝑌 = 𝐴 ∑ 𝑥 + 𝑏 ∑ 𝑥 2 (5)

On putting the values of 𝑛 , ∑ 𝑥 , ∑ 𝑥𝑌 , and ∑ 𝑥 2 in equations (4) and (5), we get

7.13988 = 6𝐴 + 21𝑏 (6)

19.49275 = 21𝐴 + 91𝑏 (7)

On solving (6) and (7), we get

𝑏 = −0.3141 , 𝐴 = 2.28933

𝐴 = 𝑙𝑜𝑔𝑒 𝑎 = 2.28933 , implies that 𝑎 = 9.86832.

On substituting the values of 𝑎 and 𝑏 in (1), we get

𝑦 = 9.86832𝑒 −0.3141𝑥 .

Change of origin and scale

In some problems the magnitude of the variables in the given data is so large that the
calculation becomes very tedious. The size of the data can be reduced by assuming some origin
for 𝑥, 𝑦 series.

The problem is further simplified by taking suitable scale for the values of 𝑥 and 𝑦. If these
values are equally spaced.

Let ℎ be the width of the interval at which the values of 𝑥 are given and let the origin of 𝑥 and 𝑦
be taken at the point 𝑥0 , 𝑦0 respectively, then putting
𝑥−𝑥0
𝑢= and 𝑣 = 𝑦 − 𝑦0 .
ℎ

Then the equations 𝑦 = 𝑎 + 𝑏𝑥 𝑎𝑛𝑑 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 2 are transformed to

𝑣 = 𝐴 + 𝐵𝑢 ; 𝑣 = 𝐴 + 𝐵𝑢 + 𝐶𝑢2 .

Example3. Fit a straight line to the following data:

𝑥 0 5 10 15 20 25
𝑦 12 15 17 22 24 30
Solution:
∑𝑥 ∑𝑦
Let 𝑥0 = 12.5 , 𝑦0 = 20 because (𝑥0 = , 𝑦0 = ).
𝑁 𝑁

There are six i.e. even values of 𝑥 and are equidistant.

Hence transfer the origin of 𝑥 series to mean of 10 and 15 i.e. to 12.5.

5
So 𝑥0 = 12.5. taking ℎ = 2 = 2.5 as the unit of measurement.
Putting
𝑥−𝑥0 𝑥−12.5
𝑢= = and 𝑣 = 𝑦 − 20
ℎ 2.5

Therefore, the transformed equation is 𝑣 = 𝑎 + 𝑏𝑢 (1)

𝑥 𝑦 𝑥 − 12.5 𝑣 = 𝑦 − 20 𝑢𝑣 𝑢2
𝑢=
2.5
0 12 −5 −8 40 25
5 15 −3 −5 15 9
10 17 −1 −3 3 1
15 22 1 2 2 1
20 24 3 4 12 9
25 30 5 10 50 25
∑𝑢 = 0 ∑𝑣 = 0 ∑ 𝑢𝑣 = 122 ∑ 𝑢2 = 70

Normal equations are

∑ 𝑣 = 𝑛𝐴 + 𝐵 ∑ 𝑢 (2)

∑ 𝑢𝑣 = 𝐴 ∑ 𝑢 + 𝐵 ∑ 𝑢2 (3)

On putting the values of ∑ 𝑢 , ∑ 𝑣 , ∑ 𝑢𝑣 , ∑ 𝑢2 in (2) and (3), we get

0 = 6𝐴 + 0, implies that 𝐴 = 0 and

122 = 𝐴 × 0 + 𝐵 × 70 , implies that 𝐵 = 1.743.

Substituting the values of 𝐴 and 𝐵 in (1), we get

𝑣 = 0 + 1.743𝑢 or 𝑣 = 1.743𝑢 (4)

𝑥−12.5
Putting 𝑢 = and 𝑣 = 𝑦 − 20 in (4), we get
2.5
𝑥−12.5
𝑦 − 20 = 1.743 ( ) , implies that
2.5

𝑦 = 0.7𝑥 + 11.285.

To fit the parabola

Let 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 2 (1)

be the equation of parabola.

The following normal equations are obtained as

∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥 + 𝑐 ∑ 𝑥 2 (2)

∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥 2 + 𝑐 ∑ 𝑥 3 (3)

∑ 𝑥 2𝑦 = 𝑎 ∑ 𝑥 2 + 𝑏 ∑ 𝑥 3 + 𝑐 ∑ 𝑥 4 (4)

On solving these three normal equations, we get the values of 𝑎, 𝑏 and 𝑐.

On putting the values of 𝑎, 𝑏 and 𝑐 in (1), we get the required equation of parabola.

Notes:-

1. Equation (2) is obtained by putting Σ before each term on both sides of (1).

2. Equation (3) is obtained on multiplying (1) by 𝑥 and putting Σ before each term on both sides
of obtained equation.

3. Equation (3) is obtained on multiplying (1) by 𝑥 2 and putting Σ before each term on both
sides of obtained equation.

Example1. Employ the method of least squares to fit a parabola 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 2 to the data:

𝑥 0 1 2 3 4
𝑦 −4 −1 4 11 20

Solution: Let the equation be 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 2 (1)

𝑥 𝑦 𝑥𝑦 𝑥2 𝑥 2𝑦 𝑥3 𝑥4
0 −4 0 0 0 0 0
1 −1 −1 1 −1 1 1
2 4 8 4 16 8 16
3 11 33 9 99 27 81
4 20 80 16 320 64 256

∑𝑥 ∑𝑦 ∑ 𝑥𝑦 ∑ 𝑥2 ∑ 𝑥 2𝑦 ∑ 𝑥3 ∑ 𝑥4
= 10 = 30 = 120 = 30 = 434 = 100 = 354

Normal equations are

∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥 + 𝑐 ∑ 𝑥 2 (2)

∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥 2 + 𝑐 ∑ 𝑥 3 (3)

∑ 𝑥 2𝑦 = 𝑎 ∑ 𝑥 2 + 𝑏 ∑ 𝑥 3 + 𝑐 ∑ 𝑥 4 (4)

On putting values of ∑ 𝑥 , ∑ 𝑦 , ∑ 𝑥𝑦, ∑ 𝑥 2 , ∑ 𝑥 2 𝑦 , ∑ 𝑥 3 , ∑ 𝑥 4 in equation (2), (3), (4),

we get

30 = 5𝑎 + 10𝑏 + 30𝑐 (5)

120 = 10𝑎 + 30𝑏 + 100𝑐 (6)

434 = 30𝑎 + 100𝑏 + 354𝑐 (7)

On solving equations (5), (6), (7), we get

𝑎 = −4, 𝑏 = 2, 𝑐 = 1.

Substituting these values of 𝑎, 𝑏, 𝑐 in (1), we get

𝑦 = −4 + 2𝑥 + 𝑥 2 .

Example2. Fit a second degree parabola to the following data by least squares method:

𝑥 1929 1930 1931 1932 1933 1934 1935 1936 1937

𝑦 352 356 357 358 360 361 361 360 359

Solution: Taking 𝑥0 = 1933 , 𝑦0 = 357

Again taking 𝑢 = 𝑥 − 𝑥0 , 𝑣 = 𝑦 − 𝑦0
Therefore, 𝑢 = 𝑥 − 1933 , 𝑣 = 𝑦 − 357

The equation 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 2 is transformed to v= 𝐴 + 𝐵𝑢 + 𝐶𝑢2 (1)

𝑥 𝑦 𝑢 𝑣 𝑢𝑣 𝑢2 𝑢2 𝑣 𝑢3 𝑢4
=𝑥 =𝑦
− 1933 − 357
1929 352 −4 −5 20 16 −80 −64 256
1930 356 −3 −1 3 9 −9 −27 81
1931 357 −2 0 0 4 0 −8 16
1932 358 −1 1 −1 1 1 −1 1
1933 360 0 3 0 0 0 0 0
1934 361 1 4 4 1 4 1 1
1935 361 2 4 8 4 16 8 16
1936 360 3 3 9 9 27 27 81
1937 359 4 2 8 16 32 64 256
∑𝑢 = 0 ∑𝑣 ∑ 𝑢𝑣 ∑ 𝑢2 ∑ 𝑢2 𝑣 ∑ 𝑢3 ∑ 𝑢4
= 11 = 51 = 60 = −9 =0 = 708

Normal equations are

∑ 𝑣 = 𝑛𝐴 + 𝐵 ∑ 𝑢 + 𝐶 ∑ 𝑢2 (2)

∑ 𝑢𝑣 = 𝐴 ∑ 𝑢 + 𝐵 ∑ 𝑢2 + 𝐶 ∑ 𝑢3 (3)

∑ 𝑢2 𝑣 = 𝐴 ∑ 𝑢2 + 𝐵 ∑ 𝑢3 + 𝐶 ∑ 𝑢4 (4)

Implies that

11 = 9𝐴 + 0𝐵 + 60𝐶 (5)

51 = 0𝐴 + 60𝐵 + 0𝐶 (6)

−9 = 60𝐴 + 0𝐵 + 708𝐶 (7)

On solving (5), (6) and (7), we get

694 17 247
𝐴 = 231 , 𝐵 = 20 , 𝐶 = − 924

Substituting these values in (1), we get

694 17 247
𝑣= + 𝑢− 𝑢2 (8)
231 20 924
Further substituting 𝑢 = 𝑥 − 1933 , 𝑣 = 𝑦 − 357 in (8), we get
694 17 247
𝑦 − 357 = 231 + 20 (𝑥 − 1933) − 924 (𝑥 − 1933)2 ,

After solving , we get

𝑦 = −1000106.41 + 1034.29𝑥 − 0.267𝑥 2 .

CORRELATION AND REGRESSION

Correlation

In our day to day life there are some situations where one variable depends on the other. For, instance,
the heights and weights of a certain group of people, the records of rainfall and the yields of crops in a
certain period, we shall get what is known as Bivariate distribution.

In a bivariate distribution our object is to discover whether there is any relationship between the
variables under study. The relationship may be of any type but here we are concerned with the linear
relation only. Whenever two variables are so related that the change in one variable effects the change
in the other in such a way that the increase in one produces an increase or decrease in the other
variable and vice-versa, the variables are said to be correlated. If the two variables lean in the same
direction i.e. ; if the increase ( or decrease ) in one variable is accompanied by the increase ( or
decrease) in the other variable correlation is said to be positive or direct. On the other hand, if the
variable deviate oppositely i.e., an increase in one followed by a decrease in the other and a decrease in
one by an increase in the other, then the correlation is said to be negative or inverse. If, however, the
variables do not exhibit any relationship, the correlation is said to be zero or null correlation.

Methods of measuring correlation

The degree of correlation may be ascertained by the following methods:

!. Graphical methods which include:

(i) Scatter diagram or dot diagram,

(ii) Histogram.

2. Numerical methods:

(i) Pearson’s coefficient of correlation,

(ii) Correlation table,

(iii) Coefficient of rank correlation.

(iv) Coefficient of concurrent deviation.

Scatter diagram (or Dot diagram )

Let us suppose that we are given 𝑛 pairs of values (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … . , (𝑥𝑛 , 𝑦𝑛 ) of two variables 𝑥 and
𝑦. These values when plotted on a sheet of paper according to some convenient scale give us 𝑛 dots one
each for the 𝑛 pairs. This graphical representation of the dots defines what is known as dot diagram or
scatter diagram.
Karl Pearson’s coefficient of correlation

Karl Pearson’s correlation coefficient between two variables 𝑥 and 𝑦, usually denoted by 𝑟(𝑥, 𝑦)𝑜𝑟 𝑟𝑥𝑦 is
a numerical measure of linear relationship between them and is defined as
1
𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥,𝑦) ∑(𝑥𝑖−𝑥̅ )(𝑦𝑖−𝑦̅) ∑(𝑥𝑖−𝑥̅ )(𝑦𝑖−𝑦̅)
𝑛
𝑟(𝑥, 𝑦)𝑜𝑟 𝑟𝑥𝑦 = = = .
√𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑥√𝑣𝑎𝑦𝑖𝑎𝑛𝑐𝑒𝑦 1 1
√ ∑(𝑥𝑖−𝑥̅ )2√ ∑(𝑦𝑖−𝑦̅)2 √∑(𝑥𝑖−𝑥̅ )2√∑(𝑦𝑖 −𝑦̅)2
𝑛 𝑛

1 1 1
∑ 𝑥𝑦−( ∑ 𝑥)( ∑ 𝑦)
𝑛 𝑛 𝑛
OR 𝑟(𝑥, 𝑦)𝑜𝑟 𝑟𝑥𝑦 = 2 2
=
√ 1 ∑ 𝑥 2−(1 ∑ 𝑥) √ 1 ∑ 𝑦 2−(1 ∑ 𝑦)
𝑛 𝑛 𝑛 𝑛
1
∑ 𝑥𝑦−𝑥̅ 𝑦̅ 𝑛 ∑ 𝑥𝑦−∑ 𝑥 ∑ 𝑦
𝑛
𝑂𝑅
1 1
√ ∑ 𝑥 2−(𝑥̅ )2√ ∑ 𝑦 2 −(𝑦̅)2 √𝑛 ∑ 𝑥 2 −(∑ 𝑥)2√𝑛 ∑ 𝑦 2 −(∑ 𝑦)2
𝑛 𝑛

Note:- Correlation coefficient is independent of change of origin and scale.

𝑥−𝑎 𝑦−𝑏
For large data, let us define two new variables 𝑢 and 𝑣 as𝑢 = ℎ
,𝑣 = 𝑘
, where 𝑎 a, 𝑏 are assumed
mean and ℎ , 𝑘 are class intervals of 𝑥 and 𝑦 series respectively when class interval is same throughout
the respective series. If class interval is not same, then we consider 𝑢 = 𝑥 − 𝑎 , 𝑣 = 𝑦 − 𝑏.
1
𝑛 ∑ 𝑢𝑣−∑ 𝑢 ∑ 𝑣 ∑ 𝑢𝑣−𝑢̅𝑣̅
𝑛
𝑟𝑥𝑦 = 𝑟𝑢𝑣 = = .
√𝑛 ∑ 𝑢2 −(∑ 𝑢)2√𝑛 ∑ 𝑣 2−(∑ 𝑣)2 1
√ ∑ 𝑢2−(𝑢
1
̅)2√ ∑ 𝑣 2−(𝑣̅)2
𝑛 𝑛

Example1. Calculate the coefficient of correlation between the marks obtained by 8 students in
Mathematics and Statistics.

𝑀𝑎𝑡ℎ𝑒𝑚𝑎𝑡𝑖𝑐𝑠 25 30 32 35 37 40 42 45
𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑠 8 10 15 17 20 23 24 25
Solution:

Let the marks of two subjects Mathematics and Statistics be denoted by 𝑥 and 𝑦 respectively.

Let the assumed mean for 𝑥 marks be 35 and that for 𝑦 be 17.

𝑥 𝑦 𝑢 𝑣 𝑢𝑣 𝑢2 𝑣2
= 𝑥 − 35 = 𝑦 − 17
25 8 −10 −9 90 100 81
30 10 −5 −7 35 25 49
32 15 −3 −2 6 9 4
35 17 0 0 0 0 0
37 20 2 3 6 4 9
40 23 5 6 30 25 36
42 24 7 7 49 49 49
45 25 10 8 80 100 64
∑ 𝑢 = 6 ∑ 𝑣 = 6 ∑ 𝑢𝑣 ∑ 𝑢2 ∑ 𝑣2
= 296 = 312 = 292

𝑛=8
𝑛 ∑ 𝑢𝑣−∑ 𝑢 ∑ 𝑣 8×296−6×6
𝑟𝑥𝑦 = 𝑟𝑢𝑣 = = = 0.787.
√𝑛 ∑ 𝑢2 −(∑ 𝑢)2√𝑛 ∑ 𝑣 2−(∑ 𝑣)2 √8×312−(6)2√8×292−(6)2

Example2. A computer while calculating correlation coefficient between two variables 𝑥 and 𝑦 from 25
pairs of observations obtained the following results:

𝑛 = 25 , ∑ 𝑥 = 125 , ∑ 𝑥 2 = 650 , ∑ 𝑦 = 100 , ∑ 𝑦 2 = 460 , ∑ 𝑥𝑦 = 508

It was, however, later discovered at the time of checking that he had copied down two pairs as

𝑥 𝑦
6 14
8 6
While the correct values were

𝑥 𝑦
8 12
6 8
Obtain the correct value of correlation coefficient.

Solution: We know that

Corrected value = (given value) – incorrect data) + (correct data).

Corrected ∑ 𝑥 = 125 − (6 + 8) + (8 + 6) = 125

Corrected ∑ 𝑦 = 100 − (14 + 6) + (12 + 8) = 100

Corrected ∑ 𝑥 2 = 650 − (62 + 82 ) + (82 + 62 ) = 650

Corrected ∑ 𝑦 2 = 460 − (142 + 62 ) + (122 + 82 ) = 436

Corrected ∑ 𝑥𝑦 = 508 − (6 × 14 + 8 × 6) + (8 × 12 + 6 × 8) = 520

∑𝑥 125 ∑𝑦 100
𝑥̅ = = = 5 ; 𝑦̅ = = =4
𝑛 25 𝑛 25

1 1
∑ 𝑥𝑦−𝑥̅ 𝑦̅ ×520−5×4
𝑛 25
Corrected 𝑟(𝑥, 𝑦)𝑂𝑅 𝑟𝑥𝑦 = 1 1
= 1 1
= 0.67.
√ ∑ 𝑥 2−(𝑥̅ )2√ ∑ 𝑦 2 −(𝑦̅)2 √( ×650−(5)2)√( ×436−(4)2)
𝑛 𝑛 25 25
Spearman’s Rank Correlation

The coefficient of rank correlation is applied to the problems in which data cannot be measured
quantitatively but qualitative assessment is possible such as beauty, honesty etc. In this case the best
individual is given the rank no. 1 next rank no. 2 and so on.

6 ∑ 𝑑2
Spearman’s rank correlation coefficient (𝑥, 𝑦)𝑂𝑅 𝑟𝑥𝑦 = 1 − 𝑛(𝑛2−1) , where 𝑑 = 𝑅𝑥 − 𝑅𝑦 .

Example1. Obtain the rank correlation coefficient for the following data:

𝑥 10 15 12 17 13 16 24 14 22
𝑦 30 42 45 46 33 34 40 35 39

Solution:

First we write ranks in each series, the item with the largest size is ranked 1, next largest 2 and so on.

𝑥 𝑦 𝑅𝑥 𝑅𝑦 𝑑 = 𝑅𝑥 − 𝑅𝑦 𝑑2
10 30 9 9 0 0
15 42 5 3 2 4
12 45 8 2 6 36
17 46 3 1 2 4
13 33 7 8 −1 1
16 34 4 7 −3 9
24 40 1 4 −3 9
14 35 6 6 0 0
22 39 2 5 −3 9
∑ 𝑑 2 = 72

Therefore,

6 ∑ 𝑑2 6×72
𝑟(𝑥, 𝑦)𝑂𝑅 𝑟𝑥𝑦 = 1 − 𝑛(𝑛2−1) = 1 − 9(81−1) = 1 − 0.6 = 0.4.

Example2. Obtain the rank correlation coefficient for the following data:

𝑥 68 64 75 50 64 80 75 40 55 64
𝑦 62 58 68 45 81 60 68 48 50 70

Solution:
𝑥 𝑦 𝑅𝑥 𝑅𝑦 𝑑 = 𝑅𝑥 − 𝑅𝑦 𝑑2
68 62 4 5 −1 1
64 58 6 7 −1 1
75 68 2.5 3.5 −1 1
50 45 9 10 −1 1
64 81 6 1 5 25
80 60 1 6 −5 25
75 68 2.5 3.5 −1 1
40 48 10 9 1 1
55 50 8 8 0 0
64 70 6 2 4 16
∑ 𝑑 2 = 72

Therefore,

6 ∑ 𝑑2 6×72
𝑟(𝑥, 𝑦)𝑂𝑅 𝑟𝑥𝑦 = 1 − =1− = 0.545.
𝑛(𝑛2 −1) 100(100−1)

REGRESSION

If the scatter diagram indicates some relationship between two variables 𝑥 and 𝑦, then the dots
of the scatter diagram will be concentrated round a curve. This curve is called the curve of
regression.

Regression analysis is the method used for estimating the unknown values of one variable
corresponding to the known value of another variable.

Line of regression

When the curve is a straight line, it is called a line of regression. A line of regression is the
straight line which gives the best fit in the least square sense to the given frequency.

Regression will be called non-linear if there exist a relationship (parabola etc.) other then a
straight line between the variables under consideration.

Equation of line of regression of 𝒚 on 𝒙

̅ = 𝒃𝒚𝒙 (𝒙 − 𝒙
𝒚−𝒚 ̅),

where 𝑏𝑦𝑥 is known as regression coefficient of 𝑦 on 𝑥 and defined as

𝟏 𝟏
𝝈 𝑪𝒐𝒗.(𝒙,𝒚) ∑(𝒙−𝒙
̅)(𝒚−𝒚
̅) ∑ 𝒙𝒚− ∑ 𝒙 ∑ 𝒚
𝒃𝒚𝒙 = 𝒓 𝝈𝒚 = =𝒏 𝟏 = 𝒏
𝟏
𝒙 𝑽𝒂𝒓.(𝒙) ̅ )𝟐
∑(𝒙−𝒙 ∑ 𝒙𝟐 − (∑ 𝒙)𝟐
𝒏 𝒏

Equation of line of regression of 𝒙 on 𝒚

̅ = 𝒃𝒙𝒚 (𝒚 − 𝒚
𝒙−𝒙 ̅),

where 𝑏𝑥𝑦 is known as regression coefficient of 𝑥 on 𝑦 and defined as

𝟏 𝟏
𝝈 𝑪𝒐𝒗.(𝒙,𝒚) ∑(𝒙−𝒙
̅)(𝒚−𝒚
̅) ∑ 𝒙𝒚− ∑ 𝒙 ∑ 𝒚
𝒃𝒙𝒚 = 𝒓 𝝈𝒙 = =𝒏 𝟏 = 𝒏
𝟏
𝒚 𝑽𝒂𝒓.(𝒚) ̅ )𝟐
∑(𝒙−𝒙 ∑ 𝒚𝟐 − (∑ 𝒚)𝟐
𝒏 𝒏

Note:-

1. Two lines of regression pass through the point (𝑥̅ , 𝑦̅) i.e., the means of 𝑥 and 𝑦 series. The
point of intersection of these two lines gives the two means 𝑥̅ and 𝑦̅ .

2. The coefficient of correlation is the geometric mean between the two regression coefficients.
𝜎 𝜎
𝑏𝑦𝑥 × 𝑏𝑥𝑦 = 𝑟 𝜎𝑦 × 𝑟 𝜎𝑥 = 𝑟 2 .
𝑥 𝑦

Properties of regression coefficient

1. Correlation coefficient is the geometric mean between the regression coefficients.

𝜎 𝜎
The coefficient of regression are 𝑟 𝜎𝑦 and 𝜎𝑥 .
𝑥 𝑦

Geometric mean between them = √𝑏𝑦𝑥 × 𝑏𝑥𝑦

𝜎 𝜎
= √𝑟 𝜎𝑦 × 𝑟 𝜎𝑥 = √𝑟 2 = 𝑟 = coefficient of correlation.
𝑥 𝑦

2. If one of the regression coefficients is greater than unity, then other must be less than unity.
𝜎 𝜎
The coefficient of regression are 𝑏𝑦𝑥 = 𝑟 𝜎𝑦 and 𝑏𝑥𝑦 = 𝑟 𝜎𝑥 .
𝑥 𝑦

1
Let 𝑏𝑦𝑥 > 1, then 𝑏 <1 (i)
𝑦𝑥

𝜎 𝜎
Since 𝑏𝑦𝑥 × 𝑏𝑥𝑦 = 𝑟 𝜎𝑦 × 𝑟 𝜎𝑥 = 𝑟 2 ≤ 1, because −1 ≤ 𝑟 ≤ 1
𝑥 𝑦

1
Therefore, 𝑏𝑥𝑦 ≤ 𝑏 <1
𝑦𝑥

Similarly, if 𝑏𝑥𝑦 > 1, then 𝑏𝑦𝑥 < 1.

3. Arithmetic mean of regression coefficient is greater than the correlation coefficient.
(𝑏𝑥𝑦 +𝑏𝑦𝑥 ) 𝜎 𝜎
i.e. > 𝑟 OR 𝑟 𝜎𝑦 + 𝑟 𝜎𝑥 > 2𝑟 OR 𝜎𝑥 2 + 𝜎𝑦 2 > 2𝜎𝑥 𝜎𝑦
2 𝑥 𝑦

2
OR (𝜎𝑥 − 𝜎𝑦 ) > 0 which is true.

4. Regression coefficient are independent of the origin but not of scale.

𝑥−𝑎 𝑦−𝑏
Let 𝑢 = and 𝑣 = , where 𝑎, 𝑏, ℎ and 𝑘 are constants.
ℎ 𝑘

𝜎 𝑘𝜎 𝑘 𝜎 𝑘 ℎ
𝑏𝑦𝑥 = 𝑟 𝜎𝑦 = 𝑟. ℎ𝜎𝑣 = ℎ (𝑟 𝜎𝑣 ) = ℎ 𝑏𝑣𝑢 . Similarly 𝑏𝑥𝑦 = 𝑘 𝑏𝑢𝑣 .
𝑥 𝑢 𝑢

Thus 𝑏𝑦𝑥 and 𝑏𝑥𝑦 are both independent of 𝑎 and 𝑏 but not of ℎ and 𝑘.

5. The correlation coefficient and the two regression coefficients have same sign.
𝜎
Regression coefficient of 𝑦 on = 𝑏𝑦𝑥 = 𝑟 𝜎𝑦 .
𝑥

𝜎
Regression coefficient of 𝑥 on = 𝑏𝑥𝑦 = 𝑟 𝜎𝑥 .
𝑦

Since 𝜎𝑥 and 𝜎𝑦 are both positive; 𝑏𝑦𝑥 , 𝑏𝑥𝑦 and 𝑟 must have be same sign.

Example1. If 𝜃 be the acute angle between the two regression lines in the case of two variables
𝑥 and 𝑦, show that

1−𝑟 2 𝜎𝑥 𝜎𝑦
𝑡𝑎𝑛𝜃 = .𝜎 2 +𝜎 2 , where 𝑟, 𝜎𝑥 , 𝜎𝑦 have their usual meanings. Explain the significance when
𝑟 𝑥 𝑦

𝑟 = 0 and 𝑟 = ±1.

Solution: We know that angle(𝜃) between two straight lines when their slope are 𝑚1 and 𝑚2 is

1 𝑚 −𝑚2
𝑡𝑎𝑛𝜃 = |1+𝑚 |. Here lines of regressions are
1 .𝑚2

𝑦 − 𝑦̅ = 𝑏𝑦𝑥 (𝑥 − 𝑥̅ ) (1)
𝑥 − 𝑥̅ = 𝑏𝑥𝑦 (𝑦 − 𝑦̅) (2)
𝜎𝑦
Slope of (1) is 𝑚1 = 𝑏𝑦𝑥 = 𝑟 𝜎 and
𝑥

𝜎𝑦
Slope of (2) is 𝑚2 = 𝑏𝑥𝑦 = 𝑟𝜎 , therefore
𝑥

𝜎𝑦𝜎𝑦
𝑟 − 1−𝑟2 𝜎𝑥 𝜎𝑦
𝜎𝑥 𝑟𝜎𝑥
𝑡𝑎𝑛𝜃 = 2 = . . (3)
(𝜎𝑦 ) 𝑟 𝜎𝑥 2 +𝜎𝑦 2
1+
(𝜎𝑥 )2
𝜋
(i) If 𝑟 = 0 then 𝑡𝑎𝑛𝜃 = ∞ 𝑂𝑅 𝜃 = 2 i.e. the lines of regression are at right angle. There is no
relationship between the two variables and they are independent or uncorrelated.

(ii) If 𝑟 = 1 𝑜𝑟 − 1, 𝑡𝑎𝑛𝜃 = 0 𝑂𝑅 𝜃 = 0. Therefore, the two lines of regression are coincident or parallel
and the correlation is perfect. Since the two lines pass through the common point (𝑥̅ , 𝑦̅), they cannot
be parallel. Hence they are coincident. Alternately the sum of the squares of deviation from
either line of regression is zero. Hence each deviation is zero and all the points lie on both the
lines of regression which coincide, and the correlation between the variables is perfect.

Example2. If the coefficient of correlation between two variables 𝑥 and 𝑦 is 0.5 and the acute angle
3 1
between their lines of regression is 𝑡𝑎𝑛−1 (5), show that 𝜎𝑥 = 2 𝜎𝑦 .

3 3
Solution: Here we, have 𝑟 = 0.5 , 𝜃 = 𝑡𝑎𝑛−1 (5), implies that 𝑡𝑎𝑛𝜃 = 5,

1−𝑟2 𝜎𝑥 𝜎𝑦
𝑡𝑎𝑛𝜃 = . (1)
𝑟 𝜎𝑥 2+𝜎𝑦 2

1
3 1− 𝜎𝑥 𝜎𝑦
5
= 1
4
.𝜎 2 +𝜎 2 , implies that 2𝜎𝑥 2 + 2𝜎𝑦 2 − 5𝜎𝑥 𝜎𝑦 = 0
2
𝑥 𝑦

(2𝜎𝑥 − 𝜎𝑦 )(𝜎𝑥 − 2𝜎𝑦 ) = 0 , therefore

Either 2𝜎𝑥 − 𝜎𝑦 = 0 (2)

𝜎𝑥 − 2𝜎𝑦 = 0 (3)

1
From (2), 𝜎𝑥 = 2 𝜎𝑦 .

Example3. Two lines of regression are given by 5𝑦 − 8𝑥 + 17 = 0 and 2𝑦 − 5𝑥 + 14 = 0. If 𝜎𝑦 2 = 16,

find (i) the mean values of 𝑥 and 𝑦 (ii) the coefficient of correlation between 𝑥 and 𝑦 (iii) 𝜎𝑥 2 .

Solution: We have lines of regression as

5𝑦 − 8𝑥 + 17 = 0 (1)

2𝑦 − 5𝑥 + 14 = 0 (2)

(i) Since (𝑥̅ , 𝑦̅) is a common point of the two lines of regression, we have

5𝑦̅ − 8𝑥̅ + 17 = 0 (3)

2𝑦̅ − 5𝑥̅ + 14 = 0 (4)

On solving (3) and (4) for 𝑥̅ and 𝑦̅, we get

𝑥̅ = 4 and 𝑦̅ = 3.

(ii) The equations of lines of regression can be written as

8 17 2 14
𝑦 = 5𝑥 − 5
and 𝑥 = 5 𝑦 + 5
,

Therefore,
8 2
𝑏𝑦𝑥 = 5 and 𝑏𝑥𝑦 = 5 OR

𝜎𝑦 8 𝜎𝑥 2
𝑟 𝜎 = 5 and 𝑟 =5.
𝑥 𝜎𝑦

16 4
On multiplying these, we get 𝑟 2 = 25 < 1, therefore 𝑟 = ± 5 (5)

Now we have to determine the sign of 𝑟 𝑖. 𝑒. +𝑜𝑟 −, as 𝑏𝑦𝑥 and 𝑏𝑥𝑦 are positive, therefore 𝑟 is also
4
positive. Therefore, 𝑟 = .
5

𝜎 8
(iii) We are given 𝜎𝑦 2 = 16 , therefore 𝜎𝑦 = 4 and 𝑟 𝜎𝑦 = 5, implies that
𝑥

4 4 8
× = ,
5 𝜎𝑥 5

Therefore,

𝜎𝑥 = 2, implies that 𝜎𝑥 2 = 4.

Example4. Find the coefficient of correlation and lines of regression to the following data:

𝑥 5 7 8 10 11 13 16
𝑦 33 30 28 20 18 16 9

Solution:

Here 𝑛 = 7
∑𝑥 70
𝑥̅ = 𝑛
= 7
= 10

∑𝑦 154
𝑦̅ = 𝑛
= 7
= 22.
𝑥 𝑦 𝑋 = 𝑥 − 10 𝑌 = 𝑦 − 22 𝑋𝑌 𝑋2 𝑌2
5 33 −5 11 −55 25 121
7 30 −3 8 −24 9 64
8 28 −2 6 −12 4 36
10 20 0 −2 0 0 4
11 18 1 −4 −4 1 16
13 16 3 −6 −18 9 36
16 9 6 −13 −78 36 169
∑𝑥 ∑𝑦 ∑ 𝑋𝑌 ∑ 𝑋2 ∑ 𝑌2
= 70 = 154 = −191 = 84 = 446

Coefficient of correlation
∑ 𝑋𝑌 −191
𝑟= = = −0.9868
√∑ 𝑋 2√𝑌 2 √84√446

𝜎𝑦 ∑ 𝑌2 446
𝑏𝑦𝑥 = 𝑟 = 𝑟√∑ = −0.9868 × √ = −2.2738 and
𝜎𝑥 𝑋2 84

𝜎𝑥 ∑ 𝑋2 84
𝑏𝑥𝑦 = 𝑟 = 𝑟√ ∑ = −0.9868 × √ = −0.4283
𝜎𝑦 𝑌2 446

Therefore,

Equation of line of regression 𝑦 on 𝑥 is

𝑦 − 𝑦̅ = 𝑏𝑦𝑥 (𝑥 − 𝑥̅ ), implies that

𝑦 − 22 = −2.2738(𝑥 − 10), implies that

𝑦 = −2.2738𝑥 + 44.738

Equation of line of regression 𝑥 on 𝑦 is

𝑥 − 𝑥̅ = 𝑏𝑥𝑦 (𝑦 − 𝑦̅), implies that

𝑥 − 10 = −0.4283(𝑦 − 22), which implies that

𝑥 = −0.4283𝑦 + 19.4226.
Multiple Linear regression

There are number of situation where the dependent variable is a function of two or more independent
variables either linear or non-linear. Here, we shall discuss an approach to fit the experiment data where
the variable under consideration is a linear function of two independent variables.

Let us consider a two variable linear function given by

𝑦 = 𝑎1 + 𝑎2 𝑥 + 𝑎3 𝑧 (1)

The sum of the squares of the errors is given by

𝑆 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑎1 − 𝑎2 𝑥𝑖 − 𝑎3 𝑧𝑖 )2 (2)

Differentiating 𝑆 w.r.t. 𝑎1 , 𝑎2 and 𝑎3 , we get

𝜕𝑆
= −2 ∑𝑛𝑖=1(𝑦𝑖 − 𝑎1 − 𝑎2 𝑥𝑖 − 𝑎3 𝑧𝑖 )
𝜕𝑎1

𝜕𝑆
𝜕𝑎2
= −2 ∑𝑛𝑖=1(𝑦𝑖 − 𝑎1 − 𝑎2 𝑥𝑖 − 𝑎3 𝑧𝑖 )𝑥𝑖

𝜕𝑆
𝜕𝑎3
= −2 ∑𝑛𝑖=1(𝑦𝑖 − 𝑎1 − 𝑎2 𝑥𝑖 − 𝑎3 𝑧𝑖 )𝑧𝑖

Setting these partial derivative equal to zero and simplifying, we obtain

∑ 𝑦𝑖 = 𝑛𝑎1 + 𝑎2 ∑ 𝑥𝑖 + 𝑎3 ∑ 𝑧𝑖
∑ 𝑥𝑖 𝑦𝑖 = 𝑎1 ∑ 𝑥𝑖 + 𝑎2 ∑(𝑥𝑖 )2 + 𝑎3 ∑ 𝑥𝑖 𝑧𝑖 } (3)
∑ 𝑧𝑖 𝑦𝑖 = 𝑎1 ∑ 𝑧𝑖 + 𝑎2 ∑ 𝑥𝑖 𝑧𝑖 + 𝑎3 ∑(𝑧𝑖 )2

Where summation is for 𝑖 = 1 to 𝑛.

These are three simultaneous equations in three unknowns and therefore,

can be expressed in matrix form as

𝑛 ∑ 𝑥𝑖 ∑ 𝑧𝑖
∑ 𝑦𝑖 𝑎1
∑ ∑(𝑥𝑖 )2 ∑ 𝑥𝑖 𝑧𝑖 𝑎
[∑ 𝑥𝑖 𝑦𝑖 ] = [ 𝑥𝑖 ] [ 2] (4)
∑ 𝑧𝑖 𝑦𝑖 ∑ 𝑧𝑖 ∑ 𝑥𝑖 𝑧𝑖 ∑(𝑧𝑖 )2 𝑎3

And can be solved for 𝑎1 , 𝑎2 and 𝑎3 by using any standard method.

Substituting these values in

𝑦 = 𝑎1 + 𝑎2 𝑥 + 𝑎3 𝑧

This is a two dimensional case and therefore, we obtain a regression plane rather than regression line.

We can extend equation (1) to the more general case as given by

𝑦 = 𝑎1 + 𝑎2 𝑥1 + 𝑎3 𝑥2 + ⋯ + 𝑎𝑚+1 𝑥𝑚

Where 𝑥1 , 𝑥2 , … , 𝑥𝑚 are independent variables and 𝑦 is dependent variable.

Algorithm for multiple linear regression

Purpose: To fit a linear regression 𝑦 = 𝑎1 + 𝑎2 𝑥 + 𝑎3 𝑧 by method of least squares.

Step-1. Read the data for 𝑥, 𝑦, 𝑧

Step-2. Obtain the sum of powers and products.

Step-3. Compute 𝑎1 , 𝑎2 , 𝑎3 by using the normal equations:

∑ 𝑦𝑖 = 𝑛𝑎1 + 𝑎2 ∑ 𝑥𝑖 + 𝑎3 ∑ 𝑧𝑖
∑ 𝑥𝑖 𝑦𝑖 = 𝑎1 ∑ 𝑥𝑖 + 𝑎2 ∑(𝑥𝑖 )2 + 𝑎3 ∑ 𝑥𝑖 𝑧𝑖
∑ 𝑧𝑖 𝑦𝑖 = 𝑎1 ∑ 𝑧𝑖 + 𝑎2 ∑ 𝑥𝑖 𝑧𝑖 + 𝑎3 ∑(𝑧𝑖 )2

Step-4. Print out the equation

𝑦 = 𝑎1 + 𝑎2 𝑥 + 𝑎3 𝑧.

Step-5. Find the estimated value, if desired.

Example1. Obtain a regression plane by using multiple linear regression to fit the data given below:

𝑥 1 2 3 4
𝑧 0 1 2 3
𝑦 12 18 24 30

Solution: Let 𝑦 = 𝑎1 + 𝑎2 𝑥 + 𝑎3 𝑧 be the regression plane where 𝑎1 , 𝑎2 , 𝑎3 are determined by using the
following equations:

∑ 𝑦𝑖 = 𝑛𝑎1 + 𝑎2 ∑ 𝑥𝑖 + 𝑎3 ∑ 𝑧𝑖
∑ 𝑥𝑖 𝑦𝑖 = 𝑎1 ∑ 𝑥𝑖 + 𝑎2 ∑(𝑥𝑖 )2 + 𝑎3 ∑ 𝑥𝑖 𝑧𝑖 (1)
∑ 𝑧𝑖 𝑦𝑖 = 𝑎1 ∑ 𝑧𝑖 + 𝑎2 ∑ 𝑥𝑖 𝑧𝑖 + 𝑎3 ∑(𝑧𝑖 )2

Where summation is 1 to 𝑛(= 4) and various summation values are given as:

𝑥𝑖 𝑧𝑖 𝑦𝑖 (𝑥𝑖 )2 (𝑧𝑖 )2 𝑥𝑖 𝑦𝑖 𝑥𝑖 𝑧𝑖 𝑧𝑖 𝑦𝑖
1 0 12 1 0 12 0 0
2 1 18 4 1 36 2 8
3 2 24 9 4 72 6 48
4 3 30 16 9 120 12 90
∑ 𝑥𝑖 ∑ 𝑧𝑖 ∑ 𝑦𝑖 ∑(𝑥𝑖 )2 ∑(𝑧𝑖 )2 ∑ 𝑥𝑖 𝑦𝑖 ∑ 𝑥𝑖 𝑧𝑖 ∑ 𝑧𝑖 𝑦𝑖
= 10 =6 = 84 = 30 = 14 = 240 = 20 = 156
Substituting the various values in (1), we obtain

84 = 4𝑎1 + 10𝑎2 + 6𝑎3 (2)

240 = 10𝑎1 + 30𝑎2 + 20𝑎3 (3)

156 = 6𝑎1 + 20𝑎2 + 14𝑎3 (4)

Solving (2), (3) and (4), we get

𝑎1 = 10, 𝑎2 = 2, 𝑎3 = 4

Hence, the required regression plane is

𝑦 = 10 + 2𝑥 + 4𝑧

SAMPLING DISTRIBUTION

Population (Universe)

The group of individuals under study is called population or universe. It may be finite or infinite.

Sampling

A part selected from population is called a sample. The process of selection of a sample is called
sampling. A random sample is one in which each member of population has an equal chance of
being included in it. There are 𝐶 (𝑁, 𝑛) different samples of size 𝑛 that can be picked up from a
population of size 𝑁.

Parameters and statistics

The statistical constants of the population such as mean (𝜇), standard deviation (𝜎) are called
parameters.

The mean (𝑥̅ ), standard deviation |𝑆| of a sample are known as statistics.

Symbols for population and samples

Characteristic Population Sample

Parameter Statistic

Symbols Population size = 𝑁 Sample size = 𝑛

Population mean = 𝜇 Sample mean = 𝑥̅

Population S.D. = 𝜎 Sample S.D. = 𝑠

Population proportion = 𝑝 Sample proportion = 𝑝̃

Aims of a sample

The population parameters are not known generally. Then the sample characteristics are
utilized to approximately determine or estimate of the population. Thus, static is an estimate of
the parameter. To what extent can we depend on the sample estimates?

The estimate of mean and standard deviation of the population is a primary purpose of all
scientific experimentation. The logic of the sampling theory is the logic of induction. In
induction, we pass from a particular (sample) to general (population). This type of
generalization here is known as statistical inference. The conclusion in the sampling studies are
based not on certainties but on probabilities.

Types of sampling

Following types of sampling are common:

1. Purposive sampling

2. Random sampling

3. Stratified sampling

4. Systematic sampling

Sampling distribution

From a population a number of samples are drawn of equal size 𝑛. Find out the mean of each
sample. The means of samples are not equal. The means with their respective frequencies are
grouped. The frequency distribution so formed is known as sampling distribution of the mean.
Similarly, sampling distribution of standard deviation we can have.

Standard error

Standard error is the standard deviation of the sampling distribution of a statistic. For assessing
the difference between the expected value and observed value, standard error is used.
Reciprocal of the standard error is known as precision. It plays an important role in the theory
of large samples and it forms a basis of the testing of hypothesis. If 𝑡 is any statistic, for large
sample.

Probable error (P.E.) it is defined by P.E. = (0.67449) S.E.

Standard errors of various parameters

𝜎
(i) Mean .
√𝑛

𝜎
(ii) S.D. .
√2𝑛

2
(iii) Variance 𝜎 2 √𝑛 .

Sampling distribution of means

Let the population be infinitely large and having a population mean of 𝜇 and a population
variance of 𝜎 2 . If 𝑥 is a random variable denoting the measurement of the characteristic, then

Expected value of 𝑥, 𝐸 (𝑥 ) = 𝜇

Variance of 𝑥, 𝑉𝑎𝑟(𝑥) = 𝜎 2

The sample mean 𝑥̅ is the sum of 𝑛 random variables 𝑥1 , 𝑥2 , … , 𝑥𝑛 each being divided by 𝑛.
Here, 𝑥1 , 𝑥2 , … , 𝑥𝑛 are independent variables from the infinitely large population.

Therefore, 𝐸 (𝑥1 ) = 𝜇 and (𝑥1 ) = 𝜎 2 ; 𝐸 (𝑥2 ) = 𝜇 and 𝑉𝑎𝑟(𝑥2 ) = 𝜎 2 and so on

𝑥1+ 𝑥2 + …+𝑥𝑛 𝐸(𝑥1 ) 𝐸(𝑥2) 𝐸(𝑥𝑛 ) 𝜇 𝜇 𝜇
Finally 𝐸 (𝑥̅ ) = 𝐸 ( )= + + ⋯+ = 𝑛 +𝑛 + ⋯+𝑛 = 𝜇
𝑛 𝑛 𝑛 𝑛

𝑥1+ 𝑥2+ …+𝑥𝑛 𝑥 𝑥 𝑥

And 𝑉𝑎𝑟(𝑥̅ ) = 𝑉𝑎𝑟 ( ) = 𝑉𝑎𝑟 ( 1 ) + 𝑉𝑎𝑟 ( 2 ) + ⋯ + 𝑉𝑎𝑟 ( 𝑛)
𝑛 𝑛 𝑛 𝑛

1 1 1 1 1 1 𝑛𝜎 2 𝜎2
= 𝑛2 𝑉𝑎𝑟(𝑥1 ) + 𝑛2 𝑉𝑎𝑟(𝑥2 ) + ⋯ + 𝑛2 𝑉𝑎𝑟(𝑥𝑛 ) = 𝑛2 . 𝜎 2 + 𝑛2 . 𝜎 2 + ⋯ + 𝑛2 . 𝜎 2 = = .
𝑛2 𝑛

The expected value of the sample mean is the same as population mean. The variance of the
sample mean is the variance of the population divided by the sample size.

The average value of the sample tends to true population mean. If sample size (𝑛) is increased
𝜎2 𝜎2
then variance of 𝑥̅ , ( 𝑛 ) gets reduced, by taking large value of 𝑛, the variance ( 𝑛 ) of 𝑥̅ can be
𝜎
made as small as desired. The standard deviation ( ) of 𝑥̅ is also called standard error of the
√𝑛
mean. It is denoted by 𝜎𝑥̅ .
Sampling from normal population

𝜎2
If 𝑥~𝑁 (𝜇, 𝜎 2 ) then it follows that 𝑥̅ ~𝑁 (𝜇, ).
𝑛

Example1. The diameter of a component produced on a semi-automatic machine is known to

be distributed normally with a mean of 10 mm. and a standard deviation of 0.01 mm. If we pick
up a random sample of size 5, what is the probability that the same mean will be between 9.95
and 10.05 mm?

Solution: Let 𝑥 be a random variable representing the diameter of one component picked up at
random.

0.01 𝜎2
Here 𝑥~𝑁 (10,0.01), therefore, 𝑥̅ ~𝑁 (10, ), because 𝑥̅ = 𝑛 (𝑥̅ , 𝑛 )
5

𝑥−𝜇
Probability{9.95 ≤ 𝑥̅ ≤ 10.05} = 2 × 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦{10 ≤ 𝑥̅ ≤ 10.05} , because 𝑧 = 𝜎
√𝑛

10−𝜇 𝑥−𝜇 10.05−𝜇 10.05−10

= 2 × 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 { 𝜎 ≤ 𝜎 ≤ 𝜎 } = 2 × 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 {0 ≤ 𝑧 ≤ 0.01 }
√𝑛 √𝑛 √𝑛 √5

= 2 × 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦{0 ≤ 𝑧 ≤ 1.12} = 2 × 0.3686 = 0.7372.

Sampling distribution of the variance

We use a sample statistic called the sample variance to estimate the population variance. The
sample variance is usually denoted by 𝑠 2

∑𝑛
𝑖=1(𝑥−𝑥̅ )
2
𝑠2 = .
𝑛−1

Central Limit Theorem

Central limit theorem says that the sampling distribution of the means always be normally
distributed, as long as the sample size is large enough. Regardless of whether the population
has a normal, poisson, binomial or any other distribution of the mean will be normal.

The central limit theorem states that if you take sufficiently large samples from a population,
the samples mean will be normally distributed, even if the population is not normally
distributed.

Testing a Hypothesis:
Very often it is required to make decisions about populations on the basis of sample
information. Such decisions are called Statistical decisions or statistical hypothesis. These
hypothesis are tested. Assuming the hypothesis correct we calculate the probability of getting
the observed sample. If this probability is less than a certain assigned value, the hypothesis is to
be rejected.

Null hypothesis (𝑯𝟎 )

The hypothesis is based for analyzing the problem. Null hypothesis is the hypothesis of no
difference. Thus, we shall assume that there is no significant difference between the observed
value and expected value. Then, we shall test whether this hypothesis is satisfied by the data or
not. If the hypothesis is not approved the difference is considered to be significant. If
hypothesis is approved then the difference would be described as due to sampling fluctuation.
Null hypothesis is denoted by 𝐻0 .

Level of significance

There are two critical regions which cover 5% and 1% areas of the normal curve. The shaded
portions are of the critical regions.

Thus, the probability of the value of the variate falling in the critical region is the level of
significance. If the variate falls in the critical area, the hypothesis is to be rejected.

Test of significance

The tests which enables us to decide whether to accept or to reject the null hypothesis is called
the tests of significance. If the difference between the sample values and the population values
are so large (lies in the critical area), it is to be rejected.

Confidence limits
𝜇 − 1.96𝜎, 𝜇 + 1.96𝜎 are 95% confidence limits as the area between 𝜇 − 1.96𝜎 and 𝜇 + 1.96𝜎
is 95%. If a sample statistics lies in the interval −1.96𝜎, 𝜇 + 1.96𝜎 , we call 95% confidence
interval.

Similarly, 𝜇 − 2.58𝜎 and 𝜇 + 2.58𝜎 is 99% confidence limits as the area between 𝜇 − 2.58𝜎
and 𝜇 + 2.58𝜎 is 99%. The numbers 1.96, 2.58 are called confidence coefficients.

Test of significance of large samples (N>30)

Normal distribution is the limiting case of binomial distribution when 𝑛 is large enough. For
normal distribution 5% of the items lie outside 𝜇 ± 1.96𝜎 while only 1% of the items lie outside
𝜇 ± 2.586𝜎
𝑥−𝜇
𝑧= 𝜎

Where 𝑧 is the standard normal variate and 𝑥 is the observed number of successes.

First we find the value of 𝑧. Test of significance depends upon the value of 𝑧.

(i) (a) If |𝑧| < 1.96, difference between the observed and expected number of successes is not
significant at the 5% level of significance.

(b) If |𝑧| > 1.96, difference is significant at 5% level of significance.

(ii) (a) If |𝑧| < 2.58, difference between the observed and expected number of successes is not
significant at the 1% level of significance.

(b) If |𝑧| > 2.58, difference is significant at 1% level of significance.

Example2. A cubical die was thrown 9000 times and 1 or 6 was obtained 3120 times. Can the
deviation from expected value lie due to fluctuations of sampling?
Solution: Let us consider the hypothesis that the die is an unbiased one and hence the
2 1 1 2
probability of obtaining 1 or 6 6 = 3 i.e. 𝑝 = 3 , 𝑞 = 3

1
The expected value of the number of successes = 𝑛𝑝 = 9000 × 3 = 3000

1 2
Also 𝜎 = 𝑆. 𝐷. = √𝑛𝑝𝑞 = √9000 × 3 × 3 = √2000 = 44.72

𝜎 = 3 × 44.72 = 134.16

Actual number of successes = 3120

Difference between the actual number of successes and expected number of successes

= 3120 − 3000 = 120 which is < 3𝜎.

Hence, the hypothesis is correct and the deviation is due to fluctuations of sampling due to
random causes.

Sampling distribution of the proportion

A simple sample of 𝑛 items is drawn from the population. It is same as a series of 𝑛

independent trials with the probability 𝑝 of success. The probabilities of o, 1, 2, …,n success are
the terms in the binomial expansion of (𝑞 + 𝑝)𝑛 .

Here mean = 𝑛𝑝 and S.D. = √𝑛𝑝𝑞.

Let us consider the proportion of success, then

𝑛𝑝
(a) Mean proportion of successes = =𝑝
𝑛

𝑛𝑝𝑞 𝑝𝑞
(b) Standard deviation (standard error) of proportion of successes = √ =√
𝑛 𝑛

1 𝑛
(c) Precision of the proportion of success = 𝑆.𝐸. = √𝑝𝑞.

Example3. A group of scientific men reported 1705 sons and 1527 daughters. Do these figures
conform to the hypothesis that the sex ratio is ½ .

Solution: The total number of observations = 1705 + 1527 = 3232.

The number of sons = 1705.

1705
Therefore, the observed male ratio = 3232 = 0.5275

1
On the given hypothesis the male ratio = = 0.5.
2

Thus, the difference between the observed ratio and theoretical ratio

= 0.5275 − 0.5 = 0.0275.

1 1
𝑝𝑞 ×
The standard deviation of the proportion = 𝑠 = √ 𝑛 =√2 2
= 0.0088.
3232

The difference is more than three times of S.D.

Hence, it can be definitely said that the figures given do not confirm to the given hypothesis.

Estimation of the parameters of the population

The mean, standard deviation etc of the population are known as parameters. They are
denoted by 𝜇 and 𝜎. Their estimates are based on the sample VALUES. The mean and standard
deviation of a sample are denoted by 𝑥̅ and 𝑠 respectively. Thus, a static is an estimate of the
parameter. There are two types of estimates.

(i) Point estimation: An estimate of population parameter given by a single number is called a
point estimation of the parameter. For example,

∑(𝑥−𝑥̅ )2
(𝑆. 𝐷. )2 = .
𝑛−1

(ii) Interval estimation: An interval in which population parameter may be expected to lie with a
given degree of confidence. The intervals are

(a) 𝑥̅ − 𝜎𝑠 to 𝑥̅ + 𝜎𝑠 ( 68.27% confidence level)

(𝑏) 𝑥̅ − 2𝜎𝑠 to 𝑥̅ + 2𝜎𝑠 ( 95.45% confidence level)

(c) 𝑥̅ − 3𝜎𝑠 to 𝑥̅ + 3𝜎𝑠 ( 99.73% confidence level)

𝑥̅ and 𝜎𝑠 are the mean and S.D. of the sample.

Similarly, 𝑥̅ ± 1.96𝜎𝑠 , 𝑥̅ ± 2.58𝜎𝑠 are 95% and 99% confidence of limits for 𝜇.
𝜎 𝜎 𝜎
𝑥̅ ± 1.96 and 𝑥̅ ± 2.58 are also the intervals as 𝜎𝑠 = .
√ 𝑛 √ 𝑛 √𝑛
Test of significance of large samples

Let 𝑥̅1 be the mean of a sample of size 𝑛1 from a population with mean 𝜇1 , and variance 𝜎12 . Let
𝑥̅ 2 be the mean of an independent sample of size 𝑛2 from another population with mean 𝜇2
and variance 𝜎22 . The test statistic is given by
𝑥̅1 −𝑥̅2
𝑧=
𝜎 𝜎2 2
√ 1+ 2
𝑛1 𝑛2

Under the null hypothesis that the samples are drawn from the same population where 𝜎1 =
𝜎2 = 𝜎 i.e. 𝜇1 = 𝜇2 the test statistic is given by
𝑥̅1 −𝑥̅2
𝑧= 1 1
.
𝜎√ +
𝑛1 𝑛2

Note:-

1. If 𝜎1 , 𝜎2 are not known and 𝜎1 ≠ 𝜎2 the test statistic in this case is

𝑥̅1 −𝑥̅2
𝑧= .
𝑠2 𝑠2
√ 1 + 2
𝑛1 𝑛2

2. If 𝜎 is not known and 𝜎1 = 𝜎2 .

𝑛1𝑠12 +𝑛2 𝑠22

We use 𝜎 2 = to calculate 𝜎:
𝑛1 +𝑛2

𝑥̅1−𝑥̅2
𝑧= .
𝑛 𝑠2 +𝑛 𝑠2 1 1
√( 1 1 2 2 )( + )
𝑛1 +𝑛2 𝑛1 𝑛2

Example1. The average income of persons was Rs. 210 with a S.D. of Rs. 10 in sample of 100
people of a city. For another sample of 150 persons, the average income was Rs. 220 with S.D.
of Rs. 12. The S.D. of incomes of the people of the city was Rs. 11. Test whether there is any
significant difference between the average incomes of the localities.

Solution: Here 𝑛1 = 100, 𝑛2 = 150, 𝑥̅1 = 210, 𝑥̅ 2 = 220, 𝑠1 = 10, 𝑠2 = 12.

Null hypothesis: The difference is not significant. i.e. there is no difference between the
incomes of the localities.

𝐻0 : 𝑥̅1 = 𝑥̅ 2 , 𝐻1 : 𝑥̅1 ≠ 𝑥̅ 2
𝑥̅1 −𝑥̅2 210−220
Under the null hypothesis, 𝐻0 , 𝑧 = = 2 2
= −7.1428.
𝑠2 𝑠2 √ 10 + 12
√ 1 + 2 100 150
𝑛1 𝑛2

Conclusion: As the calculated value of |𝑧| > 1.96, the significant value of 𝑧 at 5% level of
significance, 𝐻0 is rejected i.e. there is significant difference between the average incomes of
the localities.

Example2. Intelligent tests were given to two groups of boys and girls.

𝑀𝑒𝑎𝑛 𝑆. 𝐷.𝑖 𝑆𝑖𝑧𝑒

𝐺𝑖𝑟𝑙𝑠 75 8 60
𝐵𝑜𝑦𝑠 73 10 100
Examine if the difference between mean scores is significant.

Solution: Null hypothesis 𝐻0 : There is no significant difference between means scores i.e. 𝑥̅1 =
𝑥̅ 2

𝐻1 ∶ 𝑥̅1 ≠ 𝑥̅ 2
𝑥̅1 −𝑥̅2 75−73
Under the null hypothesis, 𝐻0 , 𝑧 = = 2 2
= 1.3912.
𝑠2 𝑠2 √ 8 + 10
√ 1 + 2 60 100
𝑛1 𝑛2

Conclusion: the calculated value of |𝑧| < 1.96, the significant value of 𝑧 at 5% level of
significance, 𝐻0 is accepted. i.e. there is no significant difference between mean scores.

Test of significance for the difference of standard deviations

If 𝑠1 and 𝑠2 are the standard deviations of two independent samples then under the null
hypothesis 𝐻0 ∶ 𝜎1 = 𝜎2 i.e. the sample standard deviations do not differ significantly, the
statistic
𝑠1 −𝑠2
𝑧= , where 𝜎1 and 𝜎2 are population standard deviations.
𝜎2 𝜎 2
√ 1+ 2
2𝑛1 2𝑛2

𝑠1 −𝑠2
When population standard deviations are not known then 𝑧 = .
𝑠2 𝑠2
√ 1+ 2
2𝑛1 2𝑛2

Example1. Random samples drawn from two countries gave the following data relating to the
heights of adult males.
Country A Country B
Mean height(in inches) 67.42 67.25
Standard deviation 2.58 2.50
Number in samples 1000 1200

(i) Is the difference between the means significant?

(ii) Is the difference between the standard deviations significant?

Solution: Given 𝑛1 = 1000, 𝑛2 = 1200, 𝑥̅1 = 67.42, 𝑥̅ 2 = 67.25, 𝑠1 = 2.58, 𝑠2 = 2.50

Since the samples size is large we can take 𝜎1 = 𝑠1 = 2.58; 𝜎2 = 𝑠2 = 2.50

Null hypothesis: 𝐻0 = 𝜇1 = 𝜇2 i.e. sample means do not differ significantly.

Alternative hypothesis ; 𝐻1 : 𝜇1 ≠ 𝜇2 (two tailed test)

𝑠1 −𝑠2 67.42−67.25
𝑧= = 2 2
= 1.56
𝑠2 𝑠2 √(2.58) +(2.50)
√ 1+ 2 1000 1200
𝑛1 𝑛2

Since |𝑧| < 1.96 we accept the null hypothesis at 5% level of significance.

(ii) We set up the null hypothesis.

𝐻0 : 𝜎1 = 𝜎2 i.e. the sample S.D.’s do not differ significantly.

Alternatively hypothesis : 𝐻1 = 𝜎1 ≠ 𝜎2 (two tailed)

Therefore, the test statistic is given by

𝑠1 −𝑠2 𝑠1 −𝑠2 2.58−2.50
𝑧= = = 2 2
= 1.0387.
𝜎2 𝜎2 𝑠2 𝑠2 √ (2.58) + (2.50)
√ 1+ 2 √ 1+ 2 2×1000 2×1200
2𝑛1 2𝑛2 2𝑛1 2𝑛2

Since |𝑧| < 1.96 we accept the null hypothesis at 5% level of significance.
Test of significance of small samples

When the size of sample is less than 30, then the sample is called small sample. For such sample
it will not be possible for us to assume that the random sampling distribution of a statistic is
approximately normal and the values given by the sample data are sufficiently close to the
population values and can be used in their place for the calculation of the standard error of the
estimate.

Student’s 𝒕 −distribution test (𝒕 −test)

This 𝑡 −distribution is used when sample size is ≤ 30 and population standard deviation is
unknown.

𝑥̅ −𝜇 ∑(𝑥−𝑥̅ )2
𝑡 −statistic is defined as 𝑡 = 𝑆 , where 𝑆 = √ 𝑛−1
√𝑛

𝑥̅ is the mean of sample, 𝜇 is population mean. 𝑆 is the standard deviation of population and 𝑛
is sample size.
𝑥̅ −𝜇
If the S.D. of the sample 𝑠 is given then 𝑡 −statistic is defined as = 𝑠 .
√𝑛−1

Note:- The relation between 𝑠 and 𝑆 is 𝑛𝑠 2 = (𝑛 − 1)𝑆 2 .

The 𝒕 −Table

The 𝑡 −table given at the end is the probability integral of 𝑡 −distribution. The 𝑡 −distribution
has different values for each degrees of freedom and when the degrees of freedom are
infinitely large, the 𝑡 −distribution is equivalent to normal distribution and the probabilities
shown in the normal distribution tables are applicable.

Applications of 𝑡 −distribution

Some of the applications of 𝑡 −distribution are given below:

1. To test if the sample mean (𝑥̅ ) differs significantly from the hypothetical value 𝜇 of the
population mean.

2. To test the significance between two sample means.

3. To test the significance of observed partial and multiple correlation coefficients.

Critical value of 𝒕

The critical value or significant value of 𝑡 at level of significance𝛼, degrees of freedom 𝛾 for two
tailed test is given by

𝑃[|𝑡| > 𝑡𝛾 (𝛼)] = 𝛼

𝑃[|𝑡| ≤ 𝑡𝛾 (𝛼)] = 1 − 𝛼

The significant value of 𝑡 at level of significance 𝛼, for a single tailed test can be got from those
of two tailed test by referring to the values at 2𝛼.

Type – I : 𝒕 −test of significance of the mean of a random sample

To test whether the mean of a sample drawn from a normal population deviates significantly
from a stated value when variance of the population is unknown.

𝐻0 : There is no significant difference between the sample mean 𝑥̅ and the population mean 𝜇
i.e. we use the statistic

𝑥̅ −𝜇 ∑(𝑥−𝑥̅ )2
𝑡= 𝑆 , , where 𝑆 = √ with degree of freedom 𝑛 − 1.
𝑛−1
√𝑛

At given level of significance 𝛼 and degree of freedom (𝑛 − 1), we refer to 𝑡- table 𝑡𝛼 (two
tailed or one tailed). If calculated 𝑡-value is such that |𝑡| < 𝑡𝛼 , the null hypothesis is accepted. If
|𝑡| > 𝑡𝛼 , 𝐻0 is rejected.

Fiducial limits of population mean

If 𝑡𝛼 is the value of 𝑡 at level of significance 𝛼 at (𝑛 − 1) degrees of freedom then,

𝑥̅ −𝜇
| 𝑆 | < 𝑡𝛼 for acceptance of 𝐻0 .
√𝑛

95% confidence limits (level of significance 5%) are 𝑥̅ ± 𝑡0.05 𝑆/√𝑛.

99% confidence limits (level of significance 1%) are 𝑥̅ ± 𝑡0.01 𝑆/√𝑛.

Example1. A random sample of size 16 has 53 as mean. The sum of squares of the deviation
from mean is 135. Can this sample be regarded as taken from the population having 56 as
mean? Obtain 95% and 99% confidence limits of the mean of the population.

Solution: 𝐻0 : There is no significant difference between the sample mean and hypothetical
population mean i.e. 𝜇 = 56.
Alternative hypothesis, 𝐻1 : 𝜇 ≠ 56 (two tailed test).
𝑥̅ −𝜇
Test statistic: Under 𝐻0 , test statistic is 𝑡 = 𝑆
√𝑛

Given 𝑥̅ = 53, 𝜇 = 56, 𝑛 = 16, ∑(𝑥 − 𝑥̅ )2 = 135

∑(𝑥−𝑥̅ )2 135
𝑆=√ = √ 15 = 3
𝑛−1

𝑥̅ −𝜇 53−56
𝑡= 𝑆 = 3 = −4
√𝑛 √16

|𝑡 | = 4

Degree of freedom = 16 − 1 = 15.

Conclusion: Since |𝑡| = 4 > 𝑡0.05 = 2.13 i.e. the calculated value of 𝑡 is more than the
tabulated value, the null hypothesis is rejected. Hence, the sample mean has not come from a
population having 56 as mean.
𝑆 3
95% confidence limits of the population mean = 𝑥̅ ± 𝑡 = 53 ± (2.13) =
√ 𝑛 0.05 √16
51.5975, 55.5975.
𝑆 3
99% confidence limits of the population mean = 𝑥̅ ± 𝑡0.01 = 53 ± (2.95) =
√𝑛 √16
50.7875, 55.2125.

Example2. The lifetime of electric bulbs for a random sample of 10 from a large consignment
gave the following data:

Item 1 2 3 4 5 6 7 8 9 10
Life in ‘000 hrs 4.2 4.6 3.9 4.1 5.2 3.8 3.9 4.3 4.4 5.6

Can we accept the hypothesis that the average lifetime of bulb is 4000 hrs?

Solution: Null hypothesis: 𝐻0 : There is no significant difference between the sample mean and
hypothetical population mean i.e. 𝜇 = 4000 hrs.

Alternative hypothesis, 𝐻1 : 𝜇 ≠ 4000 hrs. (two tailed test).

𝑥̅ −𝜇
Test statistic: Under 𝐻0 , test statistic is 𝑡 = 𝑆
√𝑛
𝑥 4.2 4.6 3.9 4.1 5.2 3.8 3.9 4.3 4.4 5.6
𝑥 − 𝑥̅ −0.2 0.2 −0.5 −0.3 0.8 −0.6 −0.5 −0.1 0 1.2
(𝑥 − 𝑥̅ )2 0.04 0.04 0.25 0.09 0.64 0.36 0.25 0.01 0 1.44

∑𝑥 44
Mean 𝑥̅ = = 10 = 4.4, ∑(𝑥 − 𝑥̅ )2 = 3.12.
𝑛

∑(𝑥−𝑥̅ )2 3.12
𝑆=√ = √10−1 = 0.589
𝑛−1

𝑥̅ −𝜇 4.4−4
𝑡= 𝑆 = 0.589 = 2.123 .
√𝑛 √10

For 𝛾 = 9, 𝑡0.05 = 2.26.

Conclusion: Since the calculated value of 𝑡 is less than the tabulated value of 𝑡 at 55 level of
significance. Therefore, the null hypothesis 𝜇 = 4000 hrs is accepted i.e. the average lifetime of
bulbs could be 4000 hrs.

Type-II 𝒕- test for difference of means of two small samples (from a normal population)

This test is used to test whether the two samples 𝑥1 , 𝑥2 , … , 𝑥𝑛 ; 𝑦1 , 𝑦2 , … , 𝑦𝑛 of sizes 𝑛1 , 𝑛2 have
been drawn from two normal populations with mean 𝜇1 and 𝜇2 respectively under the
assumption that the population variance are equal (𝜎1 = 𝜎2 = 𝜎).

𝐻0 : The samples have been drawn from the normal population with means 𝜇1 and 𝜇2 i.e. 𝐻0 ∶
𝜇1 ≠ 𝜇2 .

Let 𝑥̅ , 𝑦̅ be their means of the two samples.

(𝑥̅ −𝑦̅)
Under this 𝐻0 the test statistic 𝑡 is given by 𝑡 = 1 1
𝑆√ +
𝑛1 𝑛2

Degree of freedom is 𝑛1 + 𝑛2 − 2.

Note:-

𝑛1 𝑠1 2 +𝑛2𝑠2 2
1. If the two samples standard deviations 𝑠1 , 𝑠2 are given then we have 𝑆 2 = .
𝑛1 +𝑛2 −2

∑(𝑥1−𝑥̅1 )2+∑(𝑥2−𝑥̅2 )2
2. If 𝑠1 , 𝑠2 are not given then 𝑆 2 = .
𝑛1 +𝑛2 −2
Example1. Samples of sizes 10 and 14 were taken from two normal populations with S.D. 3.5
and 5.2 . The sample means were found to be 20.3 and 18.6. Test whether the means of the
two populations are the same at 55 level.

Solution: We have 𝑥̅1 = 20.3, 𝑥̅ 2 = 18.6, 𝑛1 = 10, 𝑛2 = 14, 𝑠1 = 3.5, 𝑠2 = 5.2

𝑛1 𝑠12 +𝑛2 𝑠22 10×(3.5)2+14×(5.2)2

𝑆2 = = = 22.775.
𝑛1 +𝑛2 −2 10+14−2

Therefore, 𝑆 = 4.772.

Null hypothesis: 𝐻0 ∶ 𝜇1 = 𝜇2 i.e. the means of the two populations are the same.

Alternative hypothesis : 𝐻1 ∶ 𝜇1 ≠ 𝜇2 .
(𝑥̅ −𝑦̅) 20.3−18.6
Test statistic : Under 𝐻0 , the test statistic is 𝑡 = 1 1
= 1 1
= 0.8604.
𝑆√ + 4.772√ +
𝑛1 𝑛2 10 14

The tabulated value of 𝑡 at 55 level of significance for 22 degree of freedom is 𝑡0.05 = 2.0739.

Conclusion: Since 𝑡 = 0.8604 < 𝑡0.05 , the null hypothesis 𝐻0 is accepted; i.e. there is no
significant difference between their means.

Example2. The height of 6 randomly chosen sailors in inches are 63, 65, 68, 69, 71 and 72.
Those of 9 randomly chosen soidiers are 61, 62, 65, 66, 69, 70, 71, 72 and 73. Test whether the
sailors are on the average taller than soldiers.

Solution: Let 𝑋1 and 𝑋2 be two samples denoting the height s of sailors and soldiers.

𝑛1 = 6, 𝑛2 = 9

Null hypothesis, 𝐻0 : 𝜇1 = 𝜇2 . i.e. the mean of both population are same.

Alternative hypothesis, 𝐻1 ; 𝜇1 > 𝜇2 (one tailed test)

Calculation of two sample means:

Let

𝑋1 63 65 68 69 71 72
Therefore,
∑𝑋
𝑋̅1 = 𝑛 1 = 68
1

𝑋1 63 65 68 69 71 72
𝑋1 − ̅̅̅
𝑋1 −5 −3 0 1 3 4
(𝑋1 − ̅̅̅
𝑋1 )2 25 9 0 1 9 16

∑(𝑋1 − 𝑋̅1 )2 = 60

Let

𝑋2 61 62 65 66 69 70 71 72 73
Therefore,
∑𝑋
𝑋̅2 = 𝑛 2 = 67.66
2

𝑋2 61 62 65 66 69 70 71 72 73
𝑋2 − ̅̅̅
𝑋2 −6.66 −5.66 −2.66 1.66 1.34 2.34 3.34 4.34 5.34
̅̅̅
(𝑋2 − 𝑋2 )2 44.36 32.035 7.075642.755 1.795 5.475 11.155 18.8354 28.515

∑(𝑋2 − 𝑋̅2 )2 = 152.0002

1
𝑆2 = 𝑛 [∑(𝑋1 − 𝑋̅1 )2 + ∑(𝑋2 − 𝑋̅2 )2 ] = 16.3077, therefore 𝑆 = 4.038.
1 +𝑛2 −2

Test statistic :
(𝑥̅ −𝑦̅) 68−67.666
Under 𝐻0 , 𝑡= 1 1
= 1 1
= 0.1569.
𝑆√ + 4.038√ +
𝑛1 𝑛2 6 9

The value of 𝑡 at 5% level of significance for 13 degree of freedom is 1.77.

(Because degree of freedom is = 𝑛1 + 𝑛2 − 2).

Conclusion: Since 𝑡𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 < 𝑡0.05 = 1.77, the null hypothesis 𝐻0 is accepted. i.e. there is no
significant difference between their average.

i.e. the sailors are not on the average taller than the soldiers.
F-Test OR Snedecor’s variance ratio test

In testing the significance of the difference of two means of two samples, we assumed that the
two sample came from the same population or population with equal variance. The object of
the F-test is to discover whether two independent estimates of population variance differ
significantly or whether the two samples may be regarded as drawn from the normal
populations having the same variance. Hence before applying the t-test for the significance of
the difference of two means, we have to test for the equality of population variance by using F-
test.

Let 𝑛1 and 𝑛2 be the sizes of two samples with variance 𝑠12 and 𝑠22 . The estimate of the
𝑛 𝑠2 𝑛 𝑠2
population variance based on these samples is 𝑆12 = 𝑛 1−1
1
and 𝑆22 = 𝑛 2−1
2
. The degrees of
1 2

freedom of these estimates are 𝑣1 = 𝑛1 − 1, 𝑣2 = 𝑛2 − 1.

To test whether these estimates 𝑆12 and 𝑆22 are significantly different or if the samples may be
regarded as drawn from the same population or from two populations with same variance 𝜎 2 .
We set up the null hypothesis 𝐻0 : 𝜎12 = 𝜎22 = 𝜎 2 . i.e. the independent estimates of the
common population do not differ significantly.

To carry out the test of significance of the difference of the variances we calculate the test
statistic

𝑆2
𝐹 = 𝑆12 if 𝑆12 > 𝑆22 and
2

𝑆2
𝐹 = 𝑆22 if 𝑆22 > 𝑆12 .
1

Conclusion: If the calculated value of 𝐹 exceeds 𝐹0.05 for (𝑛1 − 1), (𝑛2 − 1) degrees of freedom
given in the table we conclude that the ratio is significant at 5% level.

i.e. we conclude that the sample could have come from two normal population with same
variance.

The assumptions on which 𝑭-test is based are:

1. the population for each sample must be normally distributed.

2. The samples must be random and independent.

3. The ratio of 𝜎12 to 𝜎22 should be equal to 1 or greater than 1. That is why we take the larger
variance in the numerator of the ratio.

Applications: 𝑭-test is used to test

1. Whether two independent samples have been drawn from the normal populations with the
same variance 𝜎 2 .

2. Whether the two independent estimates of the population variance are homogeneous or
not.

Example1. In two independent samples of sizes 8 and 10 the sum of squares of deviations of
the sample values from the respective sample means were 84.4 and 102.6. Test whether the
differences of the populations is significant or not.

Solution: Null hypothesis 𝐻0 : 𝜎12 = 𝜎22 = 𝜎 2 i.e. there is no significant difference between
population variance.

𝑆2
Under 𝐻0 : 𝐹 = 𝑆12 ~𝐹(𝑣1 , 𝑣2 𝑑𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚)
2

Where 𝑣1 = 𝑛1 − 1, 𝑛1 = Sample I size = 8 ;

𝑣2 = 𝑛2 − 1, 𝑛2 = Sample II size = 10.

∑(𝑋1 − 𝑋̅1 )2 = 84.4 ; ∑(𝑋2 − 𝑋̅2 )2 = 102.6

∑(𝑋1 −𝑋̅1)2 84.4

𝑆12 = = = 12.057 ;
𝑛1 −1 7

∑(𝑋2 −𝑋̅2 )2 102.6

𝑆22 = = = 11.4
𝑛2 −1 9

𝑆2
𝐹 = 𝑆12 , because 𝑆12 > 𝑆22 ,
2

12.057
Therefore, 𝐹 = = 1.0576.
11.4

Conclusion: The tabulated value of 𝐹 at 5% level of significance for (7,9) degree of freedom is
3.29.

Therefore, 𝐹0.05 = 3.29 and |𝐹 | = 1.0576 > 3.29.

𝐻0 is accepted.

Therefore, there is no significant difference between the variance of the populations.

Example2. Two random samples are drawn from two normal populations are as follows:

𝐴 17 27 18 25 27 29 13 17
𝐵 16 16 20 27 26 25 21
Test whether the samples are drawn from the same normal population.

Solution: To test if two independent samples have been drawn from the same population we
have to test

(i) equality of the means by applying the 𝑡-test and

(ii) equality of population variance by applying 𝐹-test.

Since the 𝑡-test assumes that the sample variances are equal, we shall first apply the 𝐹-test.

𝐹 −test. Null hypothesis 𝐻0 : 𝜎12 = 𝜎22

i.e. the population variance do not differ significantly.

Alternative hypothesis 𝐻1 : 𝜎12 ≠ 𝜎22

𝑆2
Test statistic : 𝐹 = 𝑆12 , if 𝑆12 > 𝑆22 .
2

Computations for 𝑆12 and 𝑆22

Let

𝑋1 17 27 18 25 27 29 13 17
̅
𝑋1 = 21.625 ; 𝑛1 = 8

And

𝑋2 16 16 20 27 26 25 21
̅
𝑋2 = 18.714 ; 𝑛2 = 7

𝑋1 𝑋1 − 𝑋̅1 (𝑋1 − 𝑋̅1 )2 𝑋2 𝑋2 − 𝑋̅2 (𝑋2 − 𝑋̅2 )2

17 −4.625 21.39 16 −2.714 7.365
27 5.735 28.89 16 −2.714 7.365
18 −3.625 13.14 20 1.286 1.653
25 3.375 11.39 27 8.286 68.657
27 5.735 28.89 26 7.286 53.085
29 7.735 54.39 25 6.286 39.513
13 −8.625 74.39 21 2.286 5.226
17 −4.625 21.39

∑(𝑋1 − 𝑋̅1 )2 = 253.87

∑(𝑋2 − 𝑋̅2 )2 = 182.859

∑(𝑋1 −𝑋̅1 )2 253.87 ∑(𝑋2 −𝑋̅2 )2 182.859
𝑠12 = = = 36.267 ; 𝑠22 = = = 30.47
𝑛1 −1 7 𝑛2 −1 6

𝑆2
Test statistic : 𝐹 = 𝑆12 , if 𝑆12 > 𝑆22 .
2

𝑆2 36.267
Therefore, 𝐹 = 𝑆12 = 1.19.
2 30.47

Conclusion: The table value of 𝐹 for 𝑣1 = 7 and 𝑣2 = 6 degrees of freedom at 5% level is 4.21.
The calculated value of 𝐹 is less than the tabulated value of 𝐹. Therefore, 𝐻0 is accepted. Hence
we conclude that the variability in two populations is same.

𝑡 −test: Null hypothesis 𝐻0 : 𝜇1 = 𝜇2 i.e. the population means are equal.

Alternative hypothesis 𝐻1 : 𝜇1 ≠ 𝜇2

Test of statistic

∑(𝑋1 −𝑋̅1 )2+∑(𝑋2 −𝑋̅2 )2 253.87+182.859

𝑠2 = = = 33.594 therefore, 𝑠 = 5.796.
𝑛1+𝑛2 −2 8+7−2

𝑋̅1−𝑋̅2 21.625−18.714
𝑡= 1 1
= 1 1
= 0.9704~𝑡(𝑛1 + 𝑛2 − 2) degree of freedom.
𝑠√ + 5.796√ +
𝑛1 𝑛2 8 7

Conclusion: The tabulated value of 𝑡 at 5% level of significance for 13 degree of freedom is

2.16.

The calculated value of 𝑡 is less than the tabulated value. 𝐻0 is accepted. i.e. there is no
significant difference between the population mean i.e. 𝜇1 = 𝜇2 . Therefore, we conclude that
the two samples have been drawn from the same normal population.

Chi-square(ℵ𝟐 ) test

When a coin is tossed 200 times, the theoretical considerations lead us to expect 100 heads and
100 tails. But in practice, these results are rarely achieved. The quantity ℵ2 describes the
magnitude of discrepancy between theory and observation. If ℵ2 = 0, the observed and
expected frequencies completely coincide. The greater the discrepancy between the observed
and expected frequencies, the greater is the value of ℵ2 . Thus ℵ2 affords a measure of the
correspondence between theory and observation.
Of 𝑂𝑖 (𝑖 = 1,2, … , 𝑛) is a set of observed (experimental)frequencies and 𝐸𝑖 (𝑖 = 1,2, … , 𝑛) is
the corresponding set of expected (theoretical or hypothetical) frequencies, then, ℵ2 is defined
as

(𝑂𝑖 −𝐸𝑖 )2
ℵ2 = ∑𝑛𝑖=1 [ ]
𝐸𝑖

Where ∑ 𝑂𝑖 = ∑ 𝐸𝑖 = 𝑁 (total frequency) and degrees of freedom (d.f.) = (𝑛 − 1).

Note:- 1. If ℵ2 = 0, the observed and theoretical frequencies agree exactly.

2. If ℵ2 > 0 they do not agree exactly.

Degrees of freedom

While comparing the calculated value of ℵ2 with the tabular value, we have to determine the
degrees of freedom.

If we have to choose any four numbers whose sum is 50, we can exercise our independent
choice for any three numbers only, the fourth being 50 minus the total of the three numbers
selected. Thus, though we were to choose any four numbers, our choice was reduced to three
because of one condition imposed. There was only one restraint on our freedom and our
degrees of freedom were 4 − 1 = 3. If two restrictions are imposed, our freedom to choose
will be further curtailed and degrees of freedom will be 4 − 2 = 2.

In general, the number of degrees of freedom is the total number of observations less the
number of independent constraints imposed on the observations. Degrees of freedom (d.f.) are
usually denoted by 𝑣.

Applications of chi-square (ℵ2 ) test

ℵ2 -test is one of the simplest and the most general test known. It is applicable to a very large
number of problems in practice which can be summed up under the following heads:

1. As a test of goodness of fit.

2. As a test of independence of attributes (characteristics or qualities).

3. As a test of homogeneity of independent estimates of the population variance.

4. As a test of the hypothetical value of the population variance 𝜎 2 .

5. As a test of homogeneity of independent estimates of the population correlation coefficient.

Conditions for applying ℵ𝟐 -test

ℵ2 -test is an approximate test for large values of 𝑛. For the validity of ℵ2 -test of goodness of fit
between theory and experiment, the following conditions must be satisfied.

(i) The sample observations must be independent.

(ii) The constraints on the cell frequencies, if any, should be linear

e.g. ∑ 𝑛𝑖 = ∑ λ𝑖 or ∑ 𝑂𝑖 = ∑ 𝐸𝑖 .

(iii) 𝑁, the total number of frequencies should be reasonably large. It is difficult to say what
constitutes largeness, but as an arbitrary figure, we may say that 𝑁 should be atleast 50,
however, few the cells.

(iv) No theoretical cell-frequency should be small. Here again, it is difficult to say what
constitutes smallness, but 5 should be regarded as the very minimum and 10 is better, If small
theoretical frequencies occur (i.e.<10), the difficulty is overcome by grouping two or more
classes together before calculating (𝑂 − 𝐸). It is important to remember that the number of
degrees of freedom is determined with the number of classes after regrouping.

Note;- It may be noted that the ℵ2 -test depends only on the set of observed and expected
frequencies and on degrees of freedom (d.f.) . It does not make any assumption regarding the
parent population from which the observations are taken. Since ℵ2 does not involve any
population parameters, it is termed as a statistic and the test is known as non-parametric test
or distribution free test.

The ℵ𝟐 distribution

For large sample sizes, the sampling distribution of ℵ2 can be closely approximated by a
continuous curve known as the chi-square distribution. The probability function of ℵ2
distribution is given by
𝑣 𝑥2
𝑓 (ℵ2 ) = 𝑐 (ℵ2 )2−1 𝑒 − 2

Where 𝑒 = 2.71828, 𝑣 = number of degrees of freedom ; 𝑐 = 𝑎 constant depending only on 𝑣.

Symbolically, the degrees of freedom are denoted by the symbol 𝑣 or by degrees of freedom
and are obtained by the rule 𝑣 = 𝑛 − 𝑘, where 𝑘 refers to the number of independent
constraints.

In general, when we fit a binomial distribution the number of degrees of freedom is one less
than the number of classes ; when we fit a poisson distribution the degrees of freedom are 2
less than the number of classes, because we use the total frequency and the arithmetic mean
to get the parameter of the poisson distribution. When we fit a normal curve the number of
degrees of freedom are 3 less than the number of classes, because in this fitting we use the
total frequency, mean and standard deviation.

If the data is given in a series of 𝑛 numbers then degrees of freedom = 𝑛 − 1.

In the case of Binomial distribution degrees of freedom = 𝑛 − 1.

In the case of Poisson distribution degrees of freedom = 𝑛 − 2.

In the case of Normal distribution degrees of freedom = 𝑛 − 3.

ℵ𝟐 test as a test of goodness of fit

ℵ2 test enables us to ascertain how well the theoretical distributions such as Binomial, Poisson
or Normal etc. fit empirical distributions, i.e. distributions obtained from sample data. If the
calculated value of ℵ2 is less than the tabular value at a specified level (generally 5%) of
significance, the fit is considered to be good i.e. , the divergence between actual and expected
frequencies is attributed to fluctuations of simple sampling. If the calculated value of ℵ2 is
greater than the tabular value, the fit is considered to be poor.

Example1. In experiments on pea breeding, the following frequencies of seeds were obtained:

Red & Yellow Wrinkled & Round & Wrinkled & Total
Yellow Green Green
315 101 108 32 556

Theory predicts that the frequencies should be in proportions 9:3:3:1. Examine the
correspondence between theory and experiment.

Solution: Null hypothesis:

𝐻0 : The experimental result support the theory i.e. there is no significant difference between
the observed and theoretical frequency.

Under 𝐻0 , the theoretical (expected) frequencies can be calculated as follows:

556×9 556×3
𝐸1 = = 312.75 ; 𝐸2 = = 104.25 ;
16 16

556×3 556×1
𝐸3 = = 104.25 ; 𝐸4 = = 34.75
16 16
To calculate the value of ℵ2 :

Observed frequency 315 101 108 32

(𝑂𝑖 )
Expected frequency 312.75 104.25 104.25 34.75
𝐸𝑖
(𝑂𝑖 − 𝐸𝑖 )2 0.016187 0.101319 0.134892 0.217626
𝐸𝑖

(𝑂𝑖 −𝐸𝑖 )2
ℵ2 = ∑ [ ] = 0.470024.
𝐸𝑖

Tabular value of ℵ2 at 5% level of significance for 𝑛 − 1 = 3 degree of freedom is 7.815 i.e.

ℵ2 0.05 = 7.815.

Conclusion: Since the calculated value of ℵ2 is less than that of the tabulated value, hence 𝐻0 is
accepted. Therefore, the experimental results support the theory.

Example2. The following table gives the number of accidents that took place in an industry
during various days of the week. Test if accidents are uniformly distributed over the week.

Day Monday Tuesday Wednesday Thursday Friday Saturday

No. of 14 18 12 11 15 14
accidents

Solution: Null hypothesis 𝐻0 : The accidents are uniformly distributed over the week.
84
Under this 𝐻0 , the expected frequencies of the accidents on each of these days = = 14.
6

Observed 14 18 12 11 15 14
frequency (𝑂𝑖 )
Expected 14 14 14 14 14 14
frequency 𝐸𝑖
(𝑂𝑖 − 𝐸𝑖 )2 0 16 4 9 1 0

(𝑂𝑖 −𝐸𝑖 )2 ∑(𝑂𝑖 −𝐸𝑖 )2 30

ℵ2 = ∑ [ ]= = 14 = 2.1428.
𝐸𝑖 𝐸𝑖

Tabular value of ℵ2 at 5% level for (6 − 1 = 5 degree of freedom) is 11.09.

Conclusion: Since the calculated value of ℵ2 is less than the tabulated value, 𝐻0 is accepted i.e.,
the accidents are uniformly distributed over the week.
Example3. A die is thrown 276 times and the results of these throws are given below:

Number appeared 1 2 3 4 5 6
on the die
Frequency 40 32 29 59 57 59

Test whether the die is biased or not.

Solution: Null hypothesis 𝐻0 : Die is unbiased.

276
Under this 𝐻0 , the expected frequencies for each digit is = = 46.
6

To find the value of ℵ2

Observed 40 32 29 59 57 59
frequency (𝑂𝑖 )
Expected 46 46 46 46 46 46
frequency 𝐸𝑖
(𝑂𝑖 − 𝐸𝑖 )2 36 196 289 169 121 169

(𝑂𝑖 −𝐸𝑖 )2 ∑(𝑂𝑖 −𝐸𝑖 )2 980

ℵ2 = ∑ [ ]= = = 21.3
𝐸𝑖 𝐸𝑖 46

Tabulated value of ℵ2 at 5% level of significance for (6 − 1 = 5) degree of freedom is 11.09.

Conclusion: Since the calculated value of ℵ2 = 21.30 > 11.07 the tabulated value, 𝐻0 is
rejected. i.e. die is not unbiased OR die is biased.
Fisher’s Z-test

This test is used to test the significance of the correlation coefficient in small samples. If 𝑟 is the
correlation coefficient of the sample and 𝜌, that of the population, then calculating the value of
𝑍−ξ
1 ,
√𝑛−3

1 1 1+𝑟 1+𝑟
where 𝑍 = 2 𝑡𝑎𝑛ℎ−1 𝑟 = 2 𝑙𝑜𝑔𝑒 (1−𝑟) 𝑂𝑅 1.1513𝑙𝑜𝑔10 (1−𝑟)

1 1 1+𝜌 1+𝜌
ξ = 2 𝑡𝑎𝑛ℎ−1 𝜌 = 2 𝑙𝑜𝑔𝑒 (1−𝜌) 𝑂𝑅 1.1513𝑙𝑜𝑔10 (1−𝜌 )

1
= 𝑆. 𝐸.
√𝑛−3

𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒
If the absolute value of this exceeds 1.96, the difference is significant at 5% level.
𝑆.𝐸.

Example1. Test the significance of the correlation 𝑟 = 0.5 from a sample of size 18 against
hypothetical correlation 𝜌 = 0.7.

Solution: We have to test the hypothesis that correlation in the population is 0.7.
1 1+𝑟 1+0.5
𝑍 = 2 𝑙𝑜𝑔𝑒 (1−𝑟) = 1.1513𝑙𝑜𝑔10 (1−0.5) = 1.1513𝑙𝑜𝑔10 3 = 1.1513 × 0.4771 = 0.549.

1 1+𝜌 1+0.7
ξ = 2 𝑙𝑜𝑔𝑒 (1−𝜌 ) = 1.1513𝑙𝑜𝑔10 (1−0.7 ) = 1.1513𝑙𝑜𝑔10 5.67 = 1.1513 × 0.7536 = 0.868

𝑍 − ξ = 0.549 − 0.868 = −0.319

1 1 1
𝑆. 𝐸. = = = = 0.26.
√𝑛−3 √18−3 √15

𝑍−ξ 0.319
Absolute value of = = 1.23 which is less than 1.96 (5% level of significance) and is ,
𝑆.𝐸. 0.26
therefore, not significant. Hence the sample may be regarded as coming from population with

𝜌 = 0.7.

Example2. From a sample of 19 pairs of observations, the correlation is 0.5 and the
corresponding population value is 0.3. Is the difference significant?

Solution: Here 𝑛 = 19, 𝑟 = 0.5, 𝜌 = 0.3.

1 1+𝑟 1+0.5
𝑍 = 2 𝑙𝑜𝑔𝑒 (1−𝑟) = 1.1513𝑙𝑜𝑔10 (1−0.5) = 1.1513𝑙𝑜𝑔10 3 = 1.1513 × 0.4771 = 0.549.
1 1+𝜌 1+0.3
ξ = 2 𝑙𝑜𝑔𝑒 (1−𝜌) = 1.1513𝑙𝑜𝑔10 (1−0.3) = 1.1513𝑙𝑜𝑔10 1.857 = 1.1513 × 0.2695 = 0.31

𝑍 − ξ = 0.549 − 0.31 = 0.239

1 1 1
𝑆. 𝐸. = = = = 0.25.
√𝑛−3 √19−3 √16

𝑍−ξ 0.239
Therefore, = = 0.956.
𝑆.𝐸. 0.25

Which is less than 1.96 (5% level of significance) and is, therefore not significant. Hence the
sample may be regarded as coming from population with 𝜌 = 0.3.

Probability and Statistics - Book (DR Hari Arora)
100% (3)
Probability and Statistics - Book (DR Hari Arora)
473 pages
Curve Fitting
No ratings yet
Curve Fitting
20 pages
Curve Fitting and Solution of Equation
No ratings yet
Curve Fitting and Solution of Equation
37 pages
Module-IV Curve Fitting & Statistical Methods: RV Institute of Technology & Management
No ratings yet
Module-IV Curve Fitting & Statistical Methods: RV Institute of Technology & Management
28 pages
Least Square Method
No ratings yet
Least Square Method
4 pages
Curve Fitting
100% (4)
Curve Fitting
37 pages
Curve Fitting: Fitting A Straight Line
No ratings yet
Curve Fitting: Fitting A Straight Line
17 pages
Fitting A Straight Line by The Method of Least Squares
No ratings yet
Fitting A Straight Line by The Method of Least Squares
6 pages
Curve Fitting For Gtu Amee
No ratings yet
Curve Fitting For Gtu Amee
20 pages
Unit - I - Curve Fitting
No ratings yet
Unit - I - Curve Fitting
42 pages
Engg Maths Sem 3 Curve Fitting
No ratings yet
Engg Maths Sem 3 Curve Fitting
13 pages
Method of Least Square - 20210823-232902
No ratings yet
Method of Least Square - 20210823-232902
11 pages
Curve Fitting, NP Bali
No ratings yet
Curve Fitting, NP Bali
10 pages
Curve Fitting ST Line and Parabola
0% (1)
Curve Fitting ST Line and Parabola
12 pages
Curve Fitting
No ratings yet
Curve Fitting
7 pages
P&S Unit 2
No ratings yet
P&S Unit 2
42 pages
Unit 3-Statistical Techniques
No ratings yet
Unit 3-Statistical Techniques
21 pages
Curve Fitting - Lecturers - 2
No ratings yet
Curve Fitting - Lecturers - 2
21 pages
M3 Unit 4efggg
No ratings yet
M3 Unit 4efggg
102 pages
Are Deviations Ofx's and 'S From Their Respective Means, Then The Data May Be
No ratings yet
Are Deviations Ofx's and 'S From Their Respective Means, Then The Data May Be
83 pages
Robin Assignment
100% (1)
Robin Assignment
29 pages
Curve Fitting and Regression
No ratings yet
Curve Fitting and Regression
24 pages
Interpolation: Dr. Gokul K. C
No ratings yet
Interpolation: Dr. Gokul K. C
14 pages
Numerical Methodes - Chapter 4
No ratings yet
Numerical Methodes - Chapter 4
25 pages
Regression
No ratings yet
Regression
12 pages
Curve Fitting: Fitting A Straight Line
No ratings yet
Curve Fitting: Fitting A Straight Line
3 pages
Batch 5
No ratings yet
Batch 5
26 pages
Unit 1 - Curve Fitting & Statistical Methods
No ratings yet
Unit 1 - Curve Fitting & Statistical Methods
23 pages
Chapter 9
No ratings yet
Chapter 9
26 pages
B.Sc. (I) Sta Tistics Paper - III Unit - III (Part-I) The Method of Le Ast Squares
No ratings yet
B.Sc. (I) Sta Tistics Paper - III Unit - III (Part-I) The Method of Le Ast Squares
8 pages
ANM Formulae - GSK
No ratings yet
ANM Formulae - GSK
6 pages
M Iii 118 127
No ratings yet
M Iii 118 127
10 pages
Mathcs41 Module 4
No ratings yet
Mathcs41 Module 4
28 pages
Chapter IV
No ratings yet
Chapter IV
24 pages
Khalil Mikaela Compilation-Of-Algorithems PDF
No ratings yet
Khalil Mikaela Compilation-Of-Algorithems PDF
49 pages
CHP 4 Activities
No ratings yet
CHP 4 Activities
7 pages
Lecture-3 (Fitting of 2nd Degree Parabola)
No ratings yet
Lecture-3 (Fitting of 2nd Degree Parabola)
7 pages
Principle of Least Squares
No ratings yet
Principle of Least Squares
8 pages
Chapter Two
No ratings yet
Chapter Two
13 pages
Notes-Curve Fitting & Interpolation
No ratings yet
Notes-Curve Fitting & Interpolation
23 pages
SE 403 Lecture 5
No ratings yet
SE 403 Lecture 5
10 pages
Principle of Least Square
No ratings yet
Principle of Least Square
6 pages
Curve Fitting, B-Splines & Approximations
No ratings yet
Curve Fitting, B-Splines & Approximations
14 pages
Stats Main
No ratings yet
Stats Main
18 pages
Dca2101 Computer
No ratings yet
Dca2101 Computer
12 pages
Scan 18 Aug 2020
No ratings yet
Scan 18 Aug 2020
9 pages
Linear Least Square and Euler Method
No ratings yet
Linear Least Square and Euler Method
18 pages
Term Paper: Curve Fitting Numerical Methods
No ratings yet
Term Paper: Curve Fitting Numerical Methods
14 pages
NA 1.CurveFitting
No ratings yet
NA 1.CurveFitting
12 pages
ProbStat - Curvefitting - U5notes
No ratings yet
ProbStat - Curvefitting - U5notes
25 pages
Curve fitting-I-II
No ratings yet
Curve fitting-I-II
12 pages
Least Square Fitting - 28686953
No ratings yet
Least Square Fitting - 28686953
4 pages
CPP Notes
No ratings yet
CPP Notes
6 pages
Lecture-2 (Fitting of Straight Line)
No ratings yet
Lecture-2 (Fitting of Straight Line)
4 pages
Curve Fitting
No ratings yet
Curve Fitting
16 pages
Curve Fitting
No ratings yet
Curve Fitting
21 pages
Term Paper: Curve Fitting Numerical Methods
No ratings yet
Term Paper: Curve Fitting Numerical Methods
14 pages
Math
No ratings yet
Math
8 pages
Digital Signal Processing Assignment For Final - Solution
No ratings yet
Digital Signal Processing Assignment For Final - Solution
6 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
13 pages
2.004 Dynamics and Control Ii: Mit Opencourseware
No ratings yet
2.004 Dynamics and Control Ii: Mit Opencourseware
9 pages
Scott Aaronson P NP Survey
No ratings yet
Scott Aaronson P NP Survey
122 pages
PCE3201-Network Analysis and Sysnthesis
No ratings yet
PCE3201-Network Analysis and Sysnthesis
11 pages
OSY CO-PO Competency Performance Indicator Matices
No ratings yet
OSY CO-PO Competency Performance Indicator Matices
14 pages
Deep Learning ECOMMERCE
No ratings yet
Deep Learning ECOMMERCE
9 pages
Uts No 3
No ratings yet
Uts No 3
3 pages
The Cholesky Decomposition
No ratings yet
The Cholesky Decomposition
4 pages
Artificial Intelligence: Level - Iii Curriculum
No ratings yet
Artificial Intelligence: Level - Iii Curriculum
4 pages
Final Exam Solution
No ratings yet
Final Exam Solution
10 pages
M.tech Digital System Design
No ratings yet
M.tech Digital System Design
2 pages
12 Sorting
No ratings yet
12 Sorting
66 pages
Deep Learning Applications in Agriculture: A Short Review: January 2020
No ratings yet
Deep Learning Applications in Agriculture: A Short Review: January 2020
13 pages
An Unsupervised Deep Domain Adaptation Approach For Robust Speech Recognition PDF
No ratings yet
An Unsupervised Deep Domain Adaptation Approach For Robust Speech Recognition PDF
12 pages
Tabel Regresi Linear Graphpad
No ratings yet
Tabel Regresi Linear Graphpad
17 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
16 pages
Final Synopsis
No ratings yet
Final Synopsis
12 pages
Linear Regression and Correlation
No ratings yet
Linear Regression and Correlation
6 pages
Assignment 5
No ratings yet
Assignment 5
4 pages
MIT6 0001F16 Pset4
No ratings yet
MIT6 0001F16 Pset4
10 pages
Chapter 6 Logarithmic and Exponential Functions
No ratings yet
Chapter 6 Logarithmic and Exponential Functions
7 pages
Optimal Number of Trials For Monte Carlo Simulation
No ratings yet
Optimal Number of Trials For Monte Carlo Simulation
4 pages
Lecture 1
No ratings yet
Lecture 1
15 pages
Bản sao của softmax - regression.ipynb - Colab
No ratings yet
Bản sao của softmax - regression.ipynb - Colab
6 pages
DSP - Practical 05
No ratings yet
DSP - Practical 05
9 pages
Cpa Lab 2 3 19 (9) - C
No ratings yet
Cpa Lab 2 3 19 (9) - C
2 pages
Polynomial End Behavior: Date
No ratings yet
Polynomial End Behavior: Date
2 pages
Chapter 6 Review Questions MCF3M Winter '22
No ratings yet
Chapter 6 Review Questions MCF3M Winter '22
4 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Basic Mathematics. Explained Easy | For Beginners
From Everand
Basic Mathematics. Explained Easy | For Beginners
ExaGrecation
No ratings yet
Math Review: a QuickStudy Laminated Reference Guide
From Everand
Math Review: a QuickStudy Laminated Reference Guide
BarCharts Publishing, Inc.
5/5 (1)