Unit 1 Statistics - 21MA41

Department of Mathematics
UNIT-I
STATISTICS
Topic Learning Objectives:
Upon Completion of this unit, students will be able to:
➢ Expand their knowledge and skills of the Statistical Concepts and a personal
development experience towards the needs of statistical data analysis.
➢ Understand the Central Moments, Skewness and Kurtosis.
➢ Describe & evaluate the concept of correlation and regression coefficients.
➢ Investigate the strength and direction of a relationship between two variables by
collecting measurements and using appropriate statistical analysis.
➢ To model a linear relationship between a dependent variable and two or more
independent variables.
Introduction:
In many fields of Applied Mathematics and Engineering we face some problems and do the
experiments involving two variables. In this chapter, we consider the Mathematical theory of
statistics, by presenting an elementary treatment of Central moments, mean, variance,
coefficients of skewness and kurtosis in terms of moments, curve fitting, correlation and
regression. In mathematics, a moment is a specific quantitative measure of the shape of a
function. It is used in both mechanics and statistics. If the function represents physical density,
then the zeroth moment is the total mass, the first moment divided by the total mass is the center
of mass, and the second moment is the rotational inertia. If the function is a probability
distribution, then the zeroth moment is the total probability (i.e. one), the first moment is
the mean, the second central moment is the variance, the third standardized moment is
the skewness, and the fourth standardized moment is the kurtosis.
Moments:
In mechanics, moment refers to the turning or the rotating effect of a force whereas it is used
to describe the peculiarities of a frequency distribution in statistics. We can measure the central
tendency of a set of observations by using moments. Moments also help in measuring the
scatteredness, asymmetry and peakedness of a curve for a particular distribution.
Moments refers to the average of the deviations from mean or some other value raised to a
certain power. The arithmetic mean of various powers of these deviations in any distribution is
called the moments of the distribution about mean. Moments about mean are generally used in
statistics.
Fourth Semester 1 Statistics (21MA41)

Moments for ungrouped data:
Now we first define the moments for ungrouped data. The rth moment about origin is denoted
by 𝜇𝑟′ and defined by,
1
𝜇𝑟′ = 𝑛 ∑𝑛𝑖=1 𝑥𝑖𝑟 , r = 1, 2, 3 … (1)
Here the 𝜇𝑟′ is th
the r moment when we are dealing with the n observations denoted by x1, x2...
xn. Thus, for r =1, 2, 3 and 4 we get the first four raw moments about the origin.
𝑛 𝑛 𝑛 𝑛
1 1 1 1
𝜇1′ = ∑ 𝑥𝑖 , μ′2 = ∑ 𝑥𝑖2 , μ′3 = ∑ 𝑥𝑖3 and μ′4 = ∑ 𝑥𝑖4 .
𝑛 𝑛 𝑛 𝑛
i=1 i=1 i=1 i=1
Similarly, we can define the rth moment about the arithmetic mean 𝑥̄ or this is also called the r
th
central moment and it is denoted by the notation𝜇𝑟 and it is defined as:
1
𝜇𝑟 = 𝑛 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̄ )𝑟 , r = 1, 2, 3 … (2)
1
Thus, for r =1, we get the first central moment about the mean as 𝜇1 = 𝑛 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̄ ) = 0.
Similarly for r = 2, we get the second central moment about the mean as 𝜇2 =
1
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̄ )2 which is equal to variance.
𝑛
Moments for grouped data:

Suppose we are having observations x1, x2, . . . ,xn which are the mid points of the class-
intervals and f1, f2, . . . ,fn are their corresponding frequencies then the rth moment about origin
is denoted by 𝜇𝑟′ and defined by,
1
𝜇𝑟′ = 𝑁 ∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖𝑟 , r = 1, 2, 3 … and N = ∑𝑛i= 1 𝑓𝑖 (3)
th
Similarly, the r moment about arithmetic mean is denoted by 𝜇𝑟 and defined by,
1
𝜇𝑟 = 𝑁 ∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑥̄ )𝑟 , r = 1, 2, 3 … (4)
th ′
Also, the r moment about any point A is denoted by 𝜇𝑟 and defined by,
1
𝜇𝑟′ = 𝑁 ∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝐴)𝑟 , r = 1, 2, 3 … (5)
(𝑥𝑖 - A) (𝑥𝑖 - x̄ )
Note: If 𝑑𝑖 = or d𝑖 = , Then rth order moments about an arbitrary point A and
ℎ ℎ
1 1
mean 𝑥̄ are defined respectively by 𝜇𝑟′ = 𝑁 ∑𝑛i=1 𝑓𝑖 𝑑𝑖 𝑟 ℎ𝑟 & μ𝑟 = 𝑁 ∑𝑛i=1 𝑓𝑖 𝑑𝑖 𝑟 ℎ𝑟 r = 1, 2, 3 …
Relation between raw (Moments about origin or any point) and Central Moments
The central moments can be expressed in terms of raw moments and vice-versa. The general
relation between the moments about mean in terms of moments about any point is given by,
𝜇𝑟 = 𝜇𝑟′ − 𝑟𝐶1 μ′r-1 μ1′ + 𝑟𝐶2 μ′r-2 μ′21 −. . . +(−1) 𝑟𝜇 ′𝑟 1 , r = 1, 2, 3 … (6)
In particular, on putting r = 2, 3 and 4 in equation (6), we get
𝜇2 = 𝜇2′ − 𝜇 ′21 , μ3 = 𝜇3′ − 3𝜇2′ 𝜇1′ + 2μ′31 and μ4 = 𝜇4′ − 4𝜇3′ 𝜇1′ + 6μ′2 𝜇 ′21 -3μ′41 .
Conversely,

𝜇𝑟′ = 𝜇𝑟 + 𝑟𝐶1 μr-1 μ1′ + 𝑟𝐶2 μr-2 μ′21 −. . . +𝜇 ′𝑟 1 , r = 1, 2, 3 … (7)
In particular, on putting r = 2, 3 and 4 in equation (7), we get
𝜇2′ = 𝜇2 − 𝜇 ′21 , μ′3 = 𝜇3 − 3𝜇2 𝜇1′ + μ′31 and μ′4 = 𝜇4 − 4𝜇3 𝜇1′ + 6μ2 𝜇 ′21 +μ′41 .
Example 1: The first four moments of a distribution about the value 4 of the variables are
-1.5, 17, -30 and 108. Find the moments about the mean.
Solution: Given A = 4, 𝜇1′ =-1.5, μ′2 =17, μ′3 =-30 and μ′4 =108.
Moments about mean:
𝜇2 = 𝜇2′ − 𝜇 ′21 = 17 - (-1.5)2 =14.75
𝜇3 = 𝜇3′ − 3𝜇2′ 𝜇1′ + 2μ′31 = -30 - 3(17)(-1.5) + 2 (-1.5)2 = 39.75
𝜇4 = 𝜇4′ − 4𝜇3′ 𝜇1′ + 6μ′2 𝜇 ′21 - 3μ′41
=108 - 4(-30)(-1.5) + 6(17)(-1.5)2 - 3(-1.5)4 = 142.3125.
Example 2: Calculate the first four moments of the following distribution about the mean.
0 1 2 3 4 5 6 7 8
1 8 28 56 70 56 28 8 1
Solution:
x f d = (x - 𝑥̄ ) fd fd2 fd3 fd4

0 1 -4 -4 16 -64 256
1 8 -3 -24 72 -216 648
2 28 -2 -56 112 -224 448
3 56 -1 -56 56 -56 56
4 70 0 0 0 0 0
5 56 1 56 56 56 56
6 28 2 56 112 224 448
7 8 3 24 72 216 648
8 1 4 4 16 64 256
∑=0 ∑ = 512 ∑=0 ∑ = 2816
Moments about the mean 𝑥̄ = 4 are
∑ fd ∑ fd2 ∑ fd3 ∑ fd4

𝜇1 = = 0, μ2 = = 2, μ3 = = 0, μ4 = = 11
𝑁 𝑁 𝑁 𝑁

Example 3: Wages of workers are given in the following table:
1.5 - 2.5 2.5 - 3.5 3.5 - 4.5 4.5 - 5.5 5.5 - 6.5
1 3 7 3 3
Calculate the first four central moments of the following distribution.
Mid-point
Wages f d = ( x - 𝑥̄ ) fd fd2 fd3 fd4
x
1.5 - 2.5 1 2 -2 -2 4 -8 16
2.5 - 3.5 3 3 -1 -3 3 -3 3
3.5 - 4.5 7 4 0 0 0 0 0
4.5 - 5.5 3 5 1 3 3 3 3
5.5 - 6.5 3 6 2 6 12 24 48
∑=4 ∑ = 22 ∑ = 16 ∑ = 70
Moments about mean are respectively; 0, 1.239, 0.0547, 3.645.
Skewness and Kurtosis:

Averages tell us about the central value of the distribution and measures of dispersion tell us
about the concentration of the items around a central value. These measures do not reveal
whether the dispersal of value on either side of an average is symmetrical or not. If
observations are arranged in a symmetrical manner around a measure of central tendency, we
get a symmetrical distribution; otherwise, it may be arranged in an asymmetrical order which
gives asymmetrical distribution.
Measures of Skewness and Kurtosis, like measures of central tendency and dispersion, study
the characteristics of a frequency distribution. Thus, skewness is a measure that studies the
degree and direction of departure from symmetry.
A symmetrical distribution, gives a ‘symmetrical curve’, where the value of mean, median and
mode are exactly equal. On the other hand, in an asymmetrical distribution, the values of mean,
median and mode are not equal. When two or more symmetrical distributions are compared,
the difference in them is studied with ‘Kurtosis’. On the other hand, when two or more
symmetrical distributions are compared, they will give different degrees of Skewness. These
measures are mutually exclusive i.e. the presence of skewness implies absence of kurtosis and
vice-versa.
Measures of Kurtosis:
Kurtosis enables us to have an idea about the flatness or peakedness of the curve. It is measured
by the Karl Pearson co-efficient β2 and given by
𝜇4
𝛽2 =
𝜇22
Kurtosis studies the concentration of the items at the central part of a series. The following
figure in which all the three curves A, B and C are symmetrical about the mean.

Curve of the type ‘A’ which is neither flat nor peaked is called the normal curve or
‘MESOKURTIC’ curve (β2 = 3). If items concentrate too much at the center (more peaked
than the normal curve), the curve of the type ‘C’ becomes ‘LEPTOKURTIC’ curve (β2 > 3).
If the concentration at the center is comparatively less (flatter than the normal curve), the curve
of the type ‘B’ becomes ‘PLATYKURTIC’ curve (β2 < 3).
Measures of Skewness:
Literally, skewness means ‘lack of symmetry’. A distribution is said to be skewed if
(i) Mean, Median and Mode fall at different points.
(ii) The curve drawn with the help of the given data is not symmetrical but stretched more to
one side than to the other.
Karl Pearson’s coefficient of Skewness: The method is most frequently used for measuring
skewness. The formula for measuring coefficient of skewness is as follows:
Mean - Mode
Sk = , where σ is the standard deviation of the distribution.
𝜎
Based upon moments, co-efficient of skewness is defined as follows:

√𝛽1 (𝛽2 +3) 𝜇32 𝜇4
𝑆𝑘 = , where 𝛽1 = and 𝛽2 = .
2(5𝛽2 −6𝛽1 −9) 𝜇23 𝜇22
Nature of Skewness:
Skewness can be positive or negative or zero. The direction of skewness is determined by
observing whether the mean is greater than the mode (positive skewness) or less than the
mode (negative skewness).
(i) When the values of mean, median and mode are equal, there is no skewness.
(ii) When mean > median > mode, skewness will be positive.
(iii) When mean < median < mode, skewness will be negative.
Characteristic of a good measure of skewness:

1. It should be a pure number in the sense that its value should be independent of the unit of
the series and also degree of variation in the series.
2. It should have zero-value, when the distribution is symmetrical.

3. It should have a meaningful scale of measurement so that we could easily interpret the
measured value.
Note:
𝜇32
From 𝛽1 = …………………(*) we observe the following:
𝜇23
• 𝜇32 is always positive whether 𝜇3 is positive or negative.
• 𝜇23 is always positive as 𝜇2 is variance.
∴ from (*) 𝛽1 is always positive which is not so always as skewness may be negative also.
To overcome this, the measure of skewness is defined by
𝛾1 = ±√𝛽1
Here sign of 𝛾1 depends on the sign of 𝜇3 .
Similarly, the measure of kurtosis is defined by 𝛾2 = 𝛽2 − 3.

Example: Wages of workers are given in the following table:
10-12 12-14 14-16 16-18 18-20 20 - 22 22 - 24

1 3 7 12 12 4 3
Calculate the first four central moments of the following distribution. Also compute β1 and β2.
Solution:
Mid-point
Wages f d = (x -17) / 2 fd fd2 fd3 fd4
x
10-12 1 11 -3 -3 9 -27 81
12-14 3 13 -2 -6 12 -24 48
14-16 7 15 -1 -7 7 -7 7
16-18 12 17 0 0 0 0 0
18-20 12 19 1 12 12 12 12
20-22 4 21 2 8 16 32 64
22-24 3 23 3 9 27 81 243
∑ = 13 ∑ = 83 ∑ = 67 ∑ =455
∑ fd ∑ fd2 ∑ fd3
𝜇1′
= ′
x h = 0.52, μ2 = 2 ′
x h = 2.16, μ3 = x h3 = 10.72,
𝑁 𝑁 𝑁
∑ 4
fd
𝜇4′ = x h4 = 145.6
𝑁
Moments about mean:
𝜇1 = 0, 𝜇2 = 𝜇2′ − 𝜇 ′21 = 2.16 - 0.2704= 1.8896
𝜇3 = 𝜇3′ − 3𝜇2′ 𝜇1′ + 2μ′31 = 10.72 - 3(2.16)(0.52) + 2 (0.52)2 = 7.491
𝜇4
= 𝜇4′ − 4𝜇3′ 𝜇1′ + 6μ′2 𝜇 ′21 - 3μ′41 =145.6 - 4(0.52)(10.72) + 6(2.56)(0.52)2 -3 x 0.07312
= 126.5874.
𝜇2 𝜇
So, we have 𝛽1 = 𝜇33 = 8.317, β2 = 𝜇42 = 35.4527.
2 2
Exercise:
1. The first four raw moments of a distribution are 2, 136, 320 and 40,000. Find the
coefficients of skewness and kurtosis.
𝜇2 𝜇
Ans. 𝛽1 = 𝜇33 = 0.0904, β2 = 𝜇42 = 2.333.
2 2
2. Find the second, third and fourth central moments of the frequency distribution given
below. Hence, find (i) a measure of skewness and (ii) a measure of kurtosis.
Class limits Frequency
110.0 – 114.9 5
115.0 – 119.9 15
120.0 – 124.9 20
125.0 – 129.9 35
130.0 – 134.9 10
135.0 – 134.9 10
140.0 – 144.9 5

Ans.
𝜇2 = 2.16, μ3 = 0.804, μ4 = 12.5232
𝛾1 = √𝛽1 = 0.25298; γ2 = β2 -3 = -0.317
3. Find the second, third and fourth central moments of the frequency distribution
given below. Hence, find (i) a measure of skewness and (ii) a measure of kurtosis.
5 10 15 20 25 30 35
4 10 20 36 16 12 2
Ans.
𝜇2 = 44.41, μ3 = -12.504, μ4 = 5423.5057, β1 = 0.001785,
𝛽2 = 2.7499, γ1 = √𝛽1 = 0.25298; γ2 = β2 -3 = -0.317.
4. Compute the first four moments about mean from the following data. Hence, find (i) a
measure of skewness and (ii) a measure of kurtosis.
Class Intervals: 0 -10 10 – 20 20 – 30 30 – 40
Frequency: 1 3 4 2
Ans.
𝜇1 = 0, μ2 = 81, μ3 = -144, μ4 = 14817, β1 = 0.03902,
𝛽2 = 0.01909, γ1 = √𝛽1 = 0.1975; γ2 = β2 -3 = - 2.9809.
Correlation and Regression:

The word correlation is used in everyday life to denote some form of association. In statistical
terms we use correlation to denote association between two quantitative variables. We also
assume that the association is linear, that one variable increases or decreases a fixed amount
for a unit increase or decrease in the other. The other technique that is often used in these
circumstances is regression, which involves estimating the best straight line to summarize the
association.
Correlation:
Correlation means simply a relation between two or more variables.
Two variables are said to be correlated if the change in one variable results in a corresponding
change in the other.
Ex: 1. x: supply y: price
2. x: demand y: Price.
Positive correlation:
If an increase or decrease in one variable corresponds to an increase or decrease in the other
then the correlation is said to be positive correlation or direct correlation.
Ex: 1. Demand and price of commodity. 2. Income and expenditure.

Negative correlation:
If an increase or decrease in one variable corresponds to an decrease or increase in the other
then the correlation is said to be negative correlation or inversely correlated.
Ex: 1.Supply and Price of a commodity.
2. Correlation between Volume and pressure of a perfect gas.
No correlation
If there exist no relationship between two variables then they are said to be non correlated.
Scatter diagram
To obtain a measure of relationship between two variables x and y we plot their corresponding
values in the xy - plane. The resulting diagram showing the collection of the dots is called the
dot diagram or scatter diagram.
Correlation Coefficient (Karl Pearson correlation coefficient)

The degree of association is measured by a correlation coefficient, denoted by r. It is sometimes
called Karl Pearson's correlation coefficient and is a measure of linear association. If a curved
line is needed to express the relationship, other and more complicated measures of the
correlation must be used.
Let 𝑥1 , 𝑥2 ,x3 , . . . . . . , 𝑥𝑛 be n values of x and 𝑦1 , 𝑦2 ,y3 , . . . . . . 𝑦𝑛 be the corresponding n values
of y, then the coefficient of correlation between x and y is
∑(𝑥−𝑥̄ )(𝑦−𝑦̄ ) ∑𝑥
𝑟= , where𝜎𝑥 2 - variance of the x series, 𝜎𝑦 2 - variance of the y series, 𝑥 = →
nσ𝑥 𝜎𝑦 𝑛
∑𝑦
Mean of the x series 𝑦 = → mean of the y series.
𝑛
For computation purpose we can use the formula

𝑛 ∑ xy−(∑ 𝑥)(∑ 𝑦)
𝑟= .
√{𝑛 ∑ 𝑥 2 −(∑ 𝑥)2 }{𝑛 ∑ 𝑦 2 −(∑ 𝑦)2 }

Limits for correlation coefficient
The coefficient of correlation numerically does not exceed unity (−1 ≤ 𝑟 ≤ 1).
Proof:
1
∑(𝑥𝑖 −𝑥̄ )(𝑦𝑖 −𝑦̄ )
We have 𝑟 = 1
𝑛
1
, i=1,2,………n,
√ ∑(𝑥𝑖 −𝑥̄ )2 √ ∑(𝑦𝑖 −𝑦)2
𝑛 𝑛
1
 a i  bi (∑ 𝑎 ∑ 𝑏 )2
r= n , 𝑟 2 = ∑ 𝑎 2𝑖 ∑ 𝑏𝑖 2 . (1)
𝑖 𝑖
1 2 1 2
ai  bi
n n
By Schwartz inequality, which states that if a i ,𝑏𝑖 i=1, 2… n are real quantities then
( a i  b i )  ∑ 𝑎𝑖 2 ∑ 𝑏𝑖 and the sign of equality holding if and only if

2 2 𝑎1
𝑏1
=
𝑎2
𝑏2
=
𝑎3
𝑏3
=
𝑎𝑛
............= .
𝑏𝑛
Using this equation (1) becomes𝑟 2 ≤ 1,

⇒ |𝑟| ≤ 1,
⇒ −1 ≤ |𝑟| ≤ 1.
Hence correlation coefficient cannot exceed unity numerically.
Note:
Figure 1.1 Correlation illustrated.

1. If r =-1 there is a perfect negative correlation.
2. If r =1 there is a perfect positive correlation.
3. If r =0 then the variables are non-correlated.
𝜋
4. When r = 0, θ = 2 . i.e, when the variables are independent the two lines of regression
are perpendicular to each other.

5. When 𝑟 = ±1, θ = 0 or π. i,e the lines of regression coincide.

RANK CORRELATION
In many practical situations, characters are not measurable.
They are qualitative characteristics and individuals or items can be ranked in order of their
merits. This type of situation occurs when we deal with the qualitative study such as honesty,
beauty, voice, etc. For example, contestants of a singing competition may be ranked by judge
according to their performance. In another example, students may be ranked in different
subjects according to their performance in tests.
Arrangement of individuals or items in order of merit or proficiency in the possession of a
certain characteristic is called ranking and the number indicating the position of individuals or
items is known as rank.
If ranks of individuals or items are available for two characteristics then correlation between
ranks of these two characteristics is known as rank correlation.
With the help of rank correlation, we find the association between two qualitative
characteristics. As we know that the Karl Pearson’s correlation coefficient gives the intensity
of linear relationship between two variables and Spearman’s rank correlation coefficient gives
the concentration of association between two qualitative characteristics. In fact, Spearman’s
rank correlation coefficient measures the strength of association between two ranked variables.
Derivation of the Spearman’s rank correlation coefficient formula is discussed in the following
section.
RANK CORRELATIONCOEFFICIENT FORMULA
Suppose we have a group of n individuals and let x1, x 2 ,..., x n and y1 , y2 ,..., yn be the
ranks of n individuals in characteristics A and B respectively. Then rank correlation
coefficient 𝑟𝑠 is given by
6 ∑𝑛𝑖=1 𝑑𝑖 2
𝑟𝑠 = 1 −
𝑛(𝑛2 − 1)
Here 𝑑𝑖 is difference between ranks assigned in characteristics A and B. and n is number

of pairs of data.
This formula was given by Spearman and hence it is known as Spearman’s rank correlation
coefficient formula.
Note 1: When two or more observations have equal values, if there is a tie, it is difficult to
assign ranks to them. In such cases, the observations are given the average of the ranks
they would have received. Then, a different formula is used to calculate the correlation
coefficient.
The Spearman’s correlation coefficient for tied ranks can be calculated using the formula
1
6 ∑𝑛𝑖=1 𝑑𝑖 2 + 12 (𝑚𝑖3 − 𝑚𝑖 )
𝑟𝑠 = 1 −
𝑛(𝑛2 − 1)

1
Where 𝑚1 , 𝑚2 , … .. are number of repetitions of ranks and (𝑚𝑖3 − 𝑚𝑖 ) are the
12
corresponding correction factors.
Note 2: 𝑟𝑠 lie between -1 1nd 1.
Examples:
1. If r is the correlation coefficient between x and y and z= ax+by. Show that
𝜎𝑧 2 −(𝑎2 𝜎𝑥 2 +𝑏 2 𝜎𝑦 2 )
𝑟= .
2abσ𝑥 𝜎𝑦
Solution:
1 𝑎 𝑏
Let z = ax + by ⇒ 𝑛 ∑ z = ∑ 𝑥 + ∑ 𝑦  𝑧 = 𝑎𝑥 + 𝑏𝑦,
𝑛 𝑛
1 1 1 1
∑(𝑧 − 𝑧)2 = 𝑎2 𝑛 ∑(𝑥 − 𝑥)2 + 𝑏 2 𝑛 ∑(𝑦 − 𝑦)2 + 2ab 𝑛 ∑(𝑥 − 𝑥)(𝑦 − 𝑦) ,
𝑛
⇒ σ𝑧 2 = 𝑎2 𝜎𝑥 2 + 𝑏 2 𝜎𝑦 2 + 2abrσ𝑥 𝜎𝑦 ,
𝜎𝑧 2 −(𝑎2 𝜎𝑥 2 +𝑏 2 𝜎𝑦 2 )
⇒ r= .
2abσ𝑥 𝜎𝑦
2. While calculating the correlation coefficient between x and y from 25 pairs of

observations a person obtained the following values. ∑ 𝑥𝑖 = 125, ∑ 𝑥𝑖 2 = 650,
∑ 𝑦𝑖 = 100, ∑ 𝑦𝑖 2 = 460, ∑ 𝑥𝑖 𝑦𝑖 = 508 . It was later discovered that he had copied
down the pairs (8,12) and (6,8) as (6,12) and (8,6) respectively. Obtain the correct value
of the correlation coefficient.
Solution:
Correct  x i = 125,  x i = 650, ∑ 𝑦𝑖 = 102, ∑ 𝑦𝑖 2 = 488, ∑ 𝑥𝑖 𝑦𝑖 = 532 ,

2
n = 25,
𝑛 ∑ xy−(∑ 𝑥)(∑ 𝑦)
𝑟= = 0.51912.
√{𝑛 ∑ 𝑥 2 −(∑ 𝑥)2 }{𝑛 ∑ 𝑦 2 −(∑ 𝑦)2 }

3. The following Table gives the age (in years) of 10 married couples. Calculate the coefficient of
correlation between these ages.
Age of Husband(x) 23 27 28 29 30 31 33 35 36 39
Age of wife(y) 18 22 23 24 25 26 28 29 30 32
Solution:
Here n=10
1 311 1 257
We find 𝑥̄ = 𝑛 ∑ 𝑥𝑖 = = 31.1 ȳ = 𝑛 ∑ 𝑦𝑖 = = 25.7.
10 10
𝑥𝑖 X i = 𝑥𝑖 -𝑥̄ 𝑋𝑖 2 𝑌𝑖 = 𝑦𝑖 − ȳ Y𝑖 2 𝑋𝑖 𝑌𝑖
23 -8.1 65.61 -7.7 59.29 62.37

27 -4.1 16.81 -3.7 13.69 15.17
28 -3.1 9.61 -2.7 7.29 8.37
29 -2.1 4.41 -1.7 2.89 3.57
30 -1.1 1.21 -0.7 0.49 0.77
31 -0.1 0.01 0.3 0.09 -0.03
33 1.9 3.61 2.3 5.29 4.37
35 3.9 15.21 3.3 10.89 12.87
36 4.9 24.01 4.3 18.49 21.07
39 7.9 62.41 6.3 39.69 49.77
∑ 𝑋𝑖 2 = 202.9 ∑ 𝑌𝑖 2 ∑ 𝑋𝑖 𝑌𝑖 =178.
= 158.10
∑ 𝑋𝑖 𝑌𝑖
r= = 0.9955 ≈ 1.
√∑ 𝑋𝑖 2 ∑ 𝑌𝑖 2
i.e, the ages of husbands and wives are almost perfectly correlated.

4. Suppose we have ranks of 8 students of B.Sc. in Statistics and Mathematics. On the
basis of rank we would like to know that to what extent the knowledge of the student
in Statistics and Mathematics is related.
Rank in Statistics 1 2 3 4 5 6 7 8
Rank in Mathematics 2 4 1 5 3 8 7 6
Solution: Spearman’s rank correlation coefficient formula is

6 ∑𝑛𝑖=1 𝑑𝑖 2
𝑟𝑠 = 1 −
𝑛(𝑛2 − 1)
Let us denote the rank of students in Statistics by R x and rank in Mathematics by R y . For
the calculation of rank correlation coefficient, we have to find ∑𝑛𝑖=1 𝑑𝑖 2 which is obtained
through the following table:
Rank in Rank in Difference of

Statistics Mathematics Ranks 𝑑𝑖 2
(Rx ) (Ry ) 𝒅𝒊 = 𝑹𝒙 − 𝑹𝒚
1 2 −1 1
2 4 −2 4
3 1 2 4
4 5 −1 1
5 3 2 4
6 8 −2 4
7 7 0 0
8 6 2 4
∑8𝑖=1 𝑑𝑖 2
=22
i
Here, n = number of paired observations = 8
6 ∑𝑛𝑖=1 𝑑𝑖 2 6𝑋22
𝑟𝑠 = 1 − = 1 − = 0.74
𝑛(𝑛2 − 1) 8𝑋63
Thus, there is a positive association between ranks of Statistics and Mathematics.

5. Suppose we have ranks of 5 students in three subjects Computer,Physics and
Statistics and we want to test which two subjects have the same trend.
Rank in 2 4 5 1 3
Computer
Rank in Physics 5 1 2 3 4
Rank in Statistics 2 3 5 4 1
Solution: In this problem, we want to see which two subjects have same trend i.e.,
which two subjects have the positive rank correlation coefficient.
Here we have to calculate three rank correlation coefficients
𝑟12𝑠 = Rank correlation coefficient between the ranks of Computer and Physics
𝑟23𝑠 = Rank correlation coefficient between the ranks of Physics and Statistics
𝑟13𝑠 = Rank correlation coefficient between the ranks of Computer and Statistics
Let 𝑅1 , 𝑅 2 and 𝑅3 be the ranks of students in Computer, Physics and
Statistics respectively.
Rank in Rank in Rank in 𝒅 = 𝒅𝟐𝟏𝟐 𝒅𝟐𝟑 = 𝒅𝟐𝟐𝟑 𝒅𝟏𝟑 = 𝒅𝟐𝟏𝟑
𝟏𝟐
Compute Physics Statistics
r (R1) (R2) (R3) R1−R2 R2−R3 R1−R3
2 5 2 −3 9 3 9 0 0
4 1 3 3 9 −2 4 1 1
5 2 5 3 9 −3 9 0 0
1 3 4 −2 4 −1 1 −3 9
3 4 1 −1 1 −3 9 2 4
Total 32 32 14
Thus ∑ 𝑑12 2 = 32 , ∑ 𝑑23 2 = 32 , ∑ 𝑑13 2 = 14

Now
2
6∑𝑑
𝑟12𝑠 = 1 − 𝑛(𝑛212 =-0.6
−1)
2
6∑𝑑
𝑟23𝑠 = 1 − 𝑛(𝑛223 =-0.6
−1)
2
6∑𝑑
𝑟13𝑠 = 1 − 𝑛(𝑛213 =-0.3
−1)
𝑟12𝑠 is negative which indicates that Computer and Physics have opposite trend. Similarly,
negative rank correlation . 𝑟23𝑠 shows the opposite trend in Physics and Statistics. 𝑟13𝑠 = 0.3
indicates that Computer and Statistics have same trend.
Sometimes we do not have rank but actual values of variables are available. If we are interested
in rank correlation coefficient, we find ranks from the given values. Considering this case we
are taking a problem and try to solve it.

Example 3: Calculate rank correlation coefficient from the following data:
x 78 89 97 69 59 79 68
y 125 137 156 112 107 136 124
Solution: We have some calculation in the following table:

x y Rank of Rank d = Rx-Ry 𝒅𝟐
x of y
(Rx) (Ry)
78 125 4 4 0 0
89 137 2 2 0 0
97 156 1 1 0 0
69 112 5 6 -1 1
59 107 7 7 0 0
79 136 3 3 0 0
68 124 6 5 1 1
𝑛
∑ 𝑑𝑖 2 = 2
𝑖=1
Spearman’s Rank correlation formula is
6 ∑𝑛𝑖=1 𝑑𝑖 2 6×2
𝑟𝑠 = 1 − = 1 − = 0.96
𝑛(𝑛2 − 1) 7(49 − 1)
Example 4: Calculate rank correlation coefficient from the following data:

x 81 78 73 73 69 68 62 58
y 10 12 18 18 18 22 20 24
Solution: We have some calculation in the following table:

x y Rank of x Rank of y d = Rx-Ry d2
(Rx) (Ry)
81 10 1 8 7 49
78 12 2 7 5 25
73 18 3.5 5 1.5 2.25
73 18 3.5 5 1.5 2.25
69 18 5 5 0 0
68 22 6 2 -4 6
62 20 7 3 -4 16
58 24 8 1 -7 49
𝑛
∑ 𝑑𝑖 2 = 159.50
𝑖=1

Spearman’s Rank correlation formula is
1
6 ∑𝑛𝑖=1 𝑑𝑖 2 + 12 [ (𝑚13 − 𝑚1 ) + (𝑚23 − 𝑚2 )]
𝑟𝑠 = 1 −
𝑛(𝑛2 − 1)
Where 𝑚1 = 2 (the two items of x have equal value ie., 73) and 𝑚2 = 3 (three items of y
having value 18)
1
6 × [159.50 + 12 ((8 − 2) + (27 − 3))]
= 1− = −0.9286
8(64 − 1)
Regression :
Correlation describes the strength of an association between two variables, and is completely
symmetrical, the correlation between A and B is the same as the correlation between B and A.
However, if the two variables are related it means that when one changes by a certain amount
the other changes on an average by a certain amount. The relationship can be represented by a
simple equation called the regression equation. In this context "regression" (the term is a
historical anomaly) simply means that the average value of y is a "function" of x, that is, it
changes with x.
Regression analysis is a mathematical measure of the average relationship between two or more
variables in terms of the original units of data.
Line of regression:
Line of regression is the line which gives the best estimate to the value of one variable for any
specific value of the other variable. So the line of regression is the line of best fit.
Method of Least squares:
Suppose we are given n values of x1, x2, x3,….., xn of an independent variable x and the
corresponding values y1, y2, y3,….., yn of a variable y depending on x. Then the pairs (x1, y1),
(x2, y2),........, (xn, yn) give us n- points in the xy-plane. Generally, it is not possible to find the
actual curve y = f(x) that passes through these points. Hence, we try to find a curve that serves
as best approximation to the curve y = f(x). Such a curve is referred to as the curve of best fit.
The process of determining a curve of best fit is called curve fitting. A method to find curve of
best fit is called method of least squares.

The method of least squares tells that the curve should pass as closely as possible to meet all
the points. Let y = f(x) be an approximate relation that fits into the data (xi, yi) then yi are called
observed values Yi = f(xi) are called the expected values. The expected values Ei = yi - Yi are
called the estimated error or residuals.
The method of least squares provides a relationship y = f(x) such that sum of the squares of the
residues is least. Such a curve is known as least square curve.
Regression line of y on x:
Let regression line of y on x be y = a + bx.
The normal equations by the method of least squares is
∑ y = na + b ∑ 𝑥,
∑ xy = a ∑ 𝑥 + b ∑ 𝑥 2 ,
1 𝑏
∑y = a + ∑ 𝑥.
𝑛 𝑛
𝑦̄ = 𝑎 + 𝑏𝑥̄ is the regression line passing through ((𝑥̄ , 𝑦̄ )

∑(𝑥−𝑥)(𝑦−𝑦) ∑(XY) ∑(XY) 𝜎𝑦
𝑏= ∑(𝑥−𝑥)2
= ∑ 𝑋2
= 2
= 𝑟𝜎 ,
nσ𝑥 𝑥
σy
y− y = r (x − x)  Y = b yx X is the regression line of y on x.
σx
Regression line of y on x:
𝜎𝑥
𝑥−𝑥 =𝑟 (𝑦 − 𝑦) ⟹ 𝑋 = 𝑏𝑥𝑦 𝑌
𝜎𝑦
Note:
1. Regression coefficient of y on x
∑(𝑥−𝑥̄ )(𝑦−𝑦̄ ) 𝑛 ∑ xy− ∑ 𝑥 ∑ 𝑦 𝜎𝑦
𝑏yx = ∑(𝑥−𝑥̄ )2
= = 𝑟𝜎 .
𝑛 ∑ 𝑥 2 −(∑ 𝑥)2 𝑥
2. Regression coefficient of x on y

∑(𝑥−𝑥̄ )(𝑦−𝑦̄ ) 𝑛 ∑ xy− ∑ 𝑥 ∑ 𝑦 𝜎𝑥
𝑏xy = ∑(𝑦−𝑦̄ )2
= = 𝑟𝜎 .
𝑛 ∑ 𝑦 2 −(∑ 𝑦)2 𝑦
Examples:
1. If two regression equations of the variables x and y are x = 19.13 - .87y, y = 11.6 – 0.5x,
find
(a) mean of x
(b) mean of y
(c)The correlation coefficient between x and y.
Soln:
Since 𝑥̄ and ȳ lie on two regression lines,
𝑥̄ = 19.13 − 0.87ȳ , ȳ = 11.64 − 0.5x̄ ,
Solving we get x̄ = 15.79, ȳ = 3.74.
𝑏yx = −0.5,bxy = −0.87,r = √−0.5 × −0.87 = −0.66.
2. In the following table data is showing the test scores made by sales man on an intelligent
test and their weekly sales.
Test scores(x) 1 2 3 4 5 6 7 8 9 10
sales(y) 2.5 6 4.5 5 4.5 2 5.5 3 4.5 3
Calculate the regression line of sales on test scores and estimate the most possible weekly
volume if a sales man scores 70.
Soln:
𝜎𝑦
𝑥̄ = 60, ȳ = 4.05, Regression line of y on x is𝑦 − 𝑦̄ = 𝑟 𝜎 (𝑥 − 𝑥̄ ),
𝑥
y = 0.06x + 0.45.
When x = 70, y = 4.65.
3. In a partially destroyed laboratory, record of an analysis of correlation data, the following

results only are legible.
Variance of x=9, Regression equations 8x -10y + 66 = 0, 40x - 18y = 214
what are (i) the mean values of x and y
(ii) the correlation coefficient between x and y
(iii) the standard deviation of y.
Soln:
(i) Since both the lines of regression pass through the point (𝑥̄ ,𝑦̄ )

8 𝑥̄ -10 𝑦̄ + 66 = 0,
40𝑥̄ -18𝑦̄ - 214 = 0.
Solving these equations, we get 𝑥̄ =13 , 𝑦̄ =17
(ii) 𝜎𝑥 2 = 9
𝜎𝑥 = 3
Let 8x - 10y + 66 = 0 and 40x - 18y = 214 be the lines of regression of y on x
and x on y respectively
4 18 9 9 3
𝑏yx = 5 ,bxy = 40 = 20, Hence 𝑟 2 = b yx 𝑏xy =25, 𝑟 = ± 5 = ±0.6.
Since both the regression coefficients positive we take r = 0.6.

Standard deviation of y = 4.
4. The following table gives the stopping distance y in meters of a motor bike
Moving at a speed of x Kms/hour when the breaks are applied
x 16 24 32 40 48 56
y 0.39 0.75 1.23 1.91 2.77 3.81
Find the correlation coefficient between the speed and the stopping distance, and the
equations of regression lines. Hence estimate the maximum speed at which the motor
bike could be driven if the stopping distance is not to exceed 5 meters.
Soln:
𝑥̄ = 36, ȳ = 1.81, , σ𝑥 = 13.663,σ𝑦 = 1.1831,
𝑏yx = 0.0851, bxy = 11.352,
𝑛 ∑ xy−(∑ 𝑥)(∑ 𝑦)
𝑟=𝑟= =0.983.
√{𝑛 ∑ 𝑥 2 −(∑ 𝑥)2 }{𝑛 ∑ 𝑦 2 −(∑ 𝑦)2 }
The equation of the line of regression of y on x is y = 0.0851x - 1.2536 (i)

and the equation of the line of regression of x on y is x =11.352y + 15.453. (ii)
For y = 5, equation (ii) gives x = 72.213.
Accordingly, for the stopping distance not to exceed 5 meters, the speed must not
exceed 72 Kms/hour.
Multivariate Regression Analysis using least squares estimation of the

parameters
When several independent variables are used to estimate the value of the
dependent variable it is called multiple regression. The multiple linear regression model is just
an extension of the simple linear regrssion model. In simple linear regression, we used an “x”

to represent the explanatory variable. In multiple linear regression, we will have more than one
explanatory variable.
Let an experiment be conducted n times, and the data is obtained as follows:
Observation number Response Explanatory variables

y X1 X 2 … X k
1 y1 x11 x12 …x1k

2 y2 x21 x22 … x2k
⁝ ⁝ ⁝ ⁝ ⋱ ⁝
yn xn1 xn 2 … xnk
n
Assuming that the model is

y = 0 + 1 X1 + 2 X 2 + ... + k X k ,
where, y is an observed value of variable for a particular observation in the
population. 𝛽0 , 𝛽1 , 𝛽2 … 𝛽𝑘 are parameters which are to be determined.
the n-tuples of observations are also assumed to follow the same model. Thus they satisfy
y1 = 0 + 1 x11 + 2 x12 + ... + k x1k
y2 = 0 + 1 x21 + 2 x22 + ... + k x2k
⁝ ⁝
yn = 0 + 1 xn1 + 2 xn 2 + ... + k xnk .
These n equations can be written as
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ 𝛽𝑘 𝑥𝑖𝑘 , 𝑖 = 1, … , 𝑛.

𝑘
= 𝛽0 + ∑ 𝛽𝑗 𝑥𝑖𝑗
𝑗=1
Using least squares principle, we get the following normal equations:
𝑛 𝑛 𝑛 𝑛
𝑛𝛽0 + 𝛽1 ∑ 𝑥𝑖1 + 𝛽2 ∑ 𝑥𝑖2 + ⋯ 𝛽𝑘 ∑ 𝑥𝑖𝑘 = ∑ 𝑦𝑖

𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛 𝑛
2
𝛽0 ∑ 𝑥𝑖1 + 𝛽1 ∑ 𝑥𝑖1 + 𝛽2 ∑ 𝑥𝑖1 𝑥𝑖2 + ⋯ 𝛽𝑘 ∑ 𝑥𝑖1 𝑥𝑖𝑘 = ∑ 𝑥𝑖1 𝑦𝑖
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
……………………………………………………………………………………….
……………………………………………………………………………………….
𝑛 𝑛 𝑛 𝑛 𝑛
2
𝛽0 ∑ 𝑥𝑖𝑘 + 𝛽1 ∑ 𝑥𝑖𝑘 𝑥𝑖1 + 𝛽2 ∑ 𝑥𝑖𝑘 𝑥𝑖2 + ⋯ 𝛽𝑘 ∑ 𝑥𝑖𝑘 = ∑ 𝑥𝑖𝑘 𝑦𝑖
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
Solving the above normal equations, we get the values of 𝛽0 , 𝛽1 , 𝛽2 … . . 𝛽𝑘 .

Example:
1. A company produces two different items A and B. The data below shows the sale of these items in
one day and the profit made by the company on that day.
𝑥1 (Sales of item 8 11 9 8 6 10 7
A)
𝑥2 (Sales of item 6 4 5 7 1 1 0
B)
Profit (y) 93.26 89.76 60.78 79.34 28.23 75.83 32.74
Fit the best multilinear model that represents the relationship between sales of A and B and
the profit.
Solution: The normal equations corresponding to the regression equation y = 0 + 1 x1 + 2 x

2
are:
𝑛𝛽0 + 𝛽1 ∑𝑥1 + 𝛽2 ∑𝑥2 = ∑𝑦
𝛽0 ∑𝑥1 + 𝛽1 ∑𝑥12 + 𝛽2 ∑𝑥1 𝑥2 = ∑𝑦𝑥1
𝛽0 ∑𝑥2 + 𝛽1 ∑𝑥1 𝑥2 + 𝛽2 ∑𝑥22 = ∑𝑦𝑥2
𝑥1 𝑥2 𝑦 𝑦2 𝑥12 𝑥22 𝑥1 ∗ 𝑦 𝑥2 ∗ 𝑦 𝑥1 ∗ 𝑥2
8 6 93.26 8697.43 64 36 746.08 559.56 48
11 4 89.76 8056.86 121 16 987.36 359.04 44
9 5 60.78 3694.21 81 25 547.02 303.9 45
8 7 79.34 6294.84 64 49 634.72 555.38 56
6 1 28.23 796.933 36 1 169.38 28.23 6
10 1 75.83 5750.19 100 1 758.3 75.83 10
7 0 32.74 1071.91 49 0 229.18 0 0
∑= 59 24 459.94 34362.4 515 128 4072.04 1881.94 209
7𝛽0 + 59𝛽1 + 24𝛽2 = 459.94

59𝛽0 + 515𝛽1 + 209𝛽2 = 4072.04
24𝛽0 + 209𝛽1 + 128𝛽2 = 1881.94_____
𝛽0 = −28.5193, 𝛽1 = 9.0031, 𝛽2 = 5.3496
=> 𝑦 = − 28.5193 + 9.0031𝑥1 + 5.3496𝑥2

2. A set of experimental runs was made to determine a way of predicting cooking time 𝑦 at
various values of oven width 𝑥1 and flue temperature 𝑥2 . The coded data were recorded
as follows:
𝑦 6.40 15.05 18.75 30.25 44.85 48.94 51.55 61.50 100.44 111.42
𝑥1 1.32 2.69 3.56 4.41 5.35 6.20 7.12 8.87 9.80 10.65
𝑥2 1.15 3.40 4.10 8.75 14.82 15.15 15.32 18.18 35.19 40.40
Estimate the multiple linear regression equation y = 0 + 1 x1 + 2 x 2
Solution:
The normal equations corresponding to the regression equation y = 0 + 1 x1 + 2 x 2
are:
𝑛𝛽0 + 𝛽1 ∑𝑥1 + 𝛽2 ∑𝑥2 = ∑𝑦
𝛽0 ∑𝑥1 + 𝛽1 ∑𝑥12 + 𝛽2 ∑𝑥1 𝑥2 = ∑𝑦𝑥1
𝛽0 ∑𝑥2 + 𝛽1 ∑𝑥1 𝑥2 + 𝛽2 ∑𝑥22 = ∑𝑦𝑥2
For the given data
𝑛 = 10, ∑𝑥1 = 59.97, ∑𝑥12 = 446.9965, ∑𝑦 = 489.15, ∑𝑦𝑥1 = 3875.9365
∑𝑥2 = 156.46, ∑𝑥22 = 3991.1208, ∑𝑦𝑥2 = 11749.8781, ∑𝑥1 𝑥2 = 1282.0769
Substituting these values in the above normal equations and solving we get,
𝛽0 = 0.4178, 𝛽1 = 2.7719, 𝛽2 = 2.0372
Hence the required multiple linear regression equation is
𝑦 = 0.4178 + 2.7719𝑥1 + 2.0372𝑥2
Exercise:
1. If the coefficient of correlation between the variables x and y is 0.5 and the acute angle
3
between their lines of regression is tan-1 (5) . Find the ratio of the standard deviation of
x and y.
𝜎 1 𝜎 2
Ans. 𝜎𝑥 = 2
or . 𝜎𝑥 = 1.
𝑦 𝑦
2. Prove the following formulas for the coefficient of correlation r (in the usual notation)
2 2
1 𝑋𝑖 𝑌𝑖 1 𝑋 𝑌
a) 𝑟 = 1 − 2n ∑( −𝜎 ) , 𝑟 = −1 + 2n ∑ (𝜎 𝑖 + 𝜎 𝑖 ) .
𝜎𝑥 𝑦 𝑥 𝑦
3. Find the rank correlation coefficient for the following data:
x 56 42 72 36 63 47 55 49 38 42 68 60
y 147 125 160 118 149 128 150 145 115 140 152 155

4. Ten participants in a contest are ranked by two judges as follows:

X 1 6 5 10 3 2 4 9 7 8
Y 6 4 9 8 1 2 3 10 5 7
Calculate the rank correlation coefficient

5. The following table shows the ages x and the systolic pressures of 12 persons.
Age (x) 56 42 72 36 63 47 55 49 38 42 68 60
Blood Pressure (y) 147 125 160 118 149 128 150 145 115 140 152 155
Calculate the coefficient of correlation between x and y. Estimate the blood pressure of
a person whose age is 45 years.
Ans. r = 0.8961, y = 80.78 + 1.138 x , when x = 45, y = 132.
6. The height (inches) and weight (pounds) of baseball players are given below:
(76, 212), (76, 224), (72, 180), (74, 210), (75, 215), (71, 200), (77, 235), (78, 235),
(77, 194), (76, 185).
(i) Estimate the coefficient of correlation between weight and height of baseball
players.
(ii) Find the regression line between weight and height. Use the regression equation to
find the weight of a baseball player that is 68 inches tall.
Ans. r = 0.5529, y = 4.737 x – 147.227, x = 0.064 y + 61.712, when x = 68, y = 97.37.
7. The equations of regression lines of two variables x and y are 4 x – 5y + 33 = 0 and
20x - 9y = 107, Find the correlation coefficient and the means of x and y.
Ans. r = 0.6, Mean of x = 13 and Mean of y = 17.
8. If the tangent of the angle between the lines of regression of y on x and x on y is 0.6
and the standard deviation of y is twice the standard deviation of x. find the coefficient
of correlation between x and y.
Ans. r = 0.5.
9. The chemistry grade, intelligence test score and number of classes missed data of 12
students are given.
Chemistry 85 74 76 90 85 87 94 98 81 91 76 74
grade (𝑦)
Test 65 50 55 65 55 70 65 70 55 70 50 55
score(𝑥1 )

Classes 1 7 5 2 6 3 2 5 4 3 1 4
missed(𝑥2 )
a) Fit the best multilinear model that represents the relationship of the form
y = 0 + 1 x 1 + 2 x 2
b) Estimate the chemistry grade for a student who has an intelligence test score of 60
and missed 4 classes
10. An experiment was conducted to determine if the weight of an animal can be predicted after a
given period of time on the basis of the initial weight of the animal and the amount of feed that
was eaten. The following data, measured in kilograms, were recorded:
Final 95 77 80 100 97 70 50 80 92 84
weight(𝑦)
Initial 42 33 33 45 39 36 32 41 40 38
weight(𝑥1 )
Feed 272 226 259 292 311 183 173 236 230 235
weight(𝑥2 )
a) Fit the best multilinear model that represents the relationship of the form
y = 0 + 1 x 1 + 2 x 2
b) Predict the final weight of an animal having an initial weight of 35 kilograms that is
given 250 kilograms of feed.
Resources:
1. https://nptel.ac.in/courses/111105042/
2. http://www.nptelvideos.in/2012/12/regression-analysis.html
3. https://nptel.ac.in/courses/111104074/

Unit 1 Statistics - 21MA41

Uploaded by

Copyright:

Available Formats

Unit 1 Statistics - 21MA41

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1 Statistics - 21MA41

Uploaded by

Copyright:

Available Formats

Department of Mathematics

Topic Learning Objectives:

Upon Completion of this unit, students will be able to:

Fourth Semester 1 Statistics (21MA41)

Moments for grouped data:

𝜇𝑟 = 𝜇𝑟′ − 𝑟𝐶1 μ′r-1 μ1′ + 𝑟𝐶2 μ′r-2 μ′21 −. . . +(−1) 𝑟𝜇 ′𝑟 1 , r = 1, 2, 3 … (6)

In particular, on putting r = 2, 3 and 4 in equation (6), we get

Fourth Semester 2 Statistics (21MA41)

In particular, on putting r = 2, 3 and 4 in equation (7), we get

x f d = (x - 𝑥̄ ) fd fd2 fd3 fd4

Moments about the mean 𝑥̄ = 4 are

∑ fd ∑ fd2 ∑ fd3 ∑ fd4

Fourth Semester 3 Statistics (21MA41)

Moments about mean are respectively; 0, 1.239, 0.0547, 3.645.

Skewness and Kurtosis:

Fourth Semester 4 Statistics (21MA41)

Based upon moments, co-efficient of skewness is defined as follows:

Characteristic of a good measure of skewness:

Fourth Semester 5 Statistics (21MA41)

Similarly, the measure of kurtosis is defined by 𝛾2 = 𝛽2 − 3.

Fourth Semester 6 Statistics (21MA41)

10-12 12-14 14-16 16-18 18-20 20 - 22 22 - 24

Fourth Semester 7 Statistics (21MA41)

Correlation and Regression:

Fourth Semester 8 Statistics (21MA41)

Correlation Coefficient (Karl Pearson correlation coefficient)

For computation purpose we can use the formula

Fourth Semester 9 Statistics (21MA41)

( a i  b i )  ∑ 𝑎𝑖 2 ∑ 𝑏𝑖 and the sign of equality holding if and only if

Using this equation (1) becomes𝑟 2 ≤ 1,

Figure 1.1 Correlation illustrated.

are perpendicular to each other.

Fourth Semester 10 Statistics (21MA41)

Here 𝑑𝑖 is difference between ranks assigned in characteristics A and B. and n is number

Fourth Semester 11 Statistics (21MA41)

2. While calculating the correlation coefficient between x and y from 25 pairs of

Correct  x i = 125,  x i = 650, ∑ 𝑦𝑖 = 102, ∑ 𝑦𝑖 2 = 488, ∑ 𝑥𝑖 𝑦𝑖 = 532 ,

Fourth Semester 12 Statistics (21MA41)

23 -8.1 65.61 -7.7 59.29 62.37

Fourth Semester 13 Statistics (21MA41)

Solution: Spearman’s rank correlation coefficient formula is

Rank in Rank in Difference of

Thus, there is a positive association between ranks of Statistics and Mathematics.

Fourth Semester 14 Statistics (21MA41)

Thus ∑ 𝑑12 2 = 32 , ∑ 𝑑23 2 = 32 , ∑ 𝑑13 2 = 14

Fourth Semester 15 Statistics (21MA41)

Solution: We have some calculation in the following table:

Example 4: Calculate rank correlation coefficient from the following data:

Solution: We have some calculation in the following table:

Fourth Semester 16 Statistics (21MA41)

Fourth Semester 17 Statistics (21MA41)

𝑦̄ = 𝑎 + 𝑏𝑥̄ is the regression line passing through ((𝑥̄ , 𝑦̄ )

Fourth Semester 18 Statistics (21MA41)

3. In a partially destroyed laboratory, record of an analysis of correlation data, the following

Fourth Semester 19 Statistics (21MA41)

Since both the regression coefficients positive we take r = 0.6.

The equation of the line of regression of y on x is y = 0.0851x - 1.2536 (i)

Multivariate Regression Analysis using least squares estimation of the

Fourth Semester 20 Statistics (21MA41)

Let an experiment be conducted n times, and the data is obtained as follows: