0% found this document useful (0 votes)
37 views44 pages

Correlation and Regression

Uploaded by

Aman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views44 pages

Correlation and Regression

Uploaded by

Aman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

CORRELATION

AND
REGRESSION
CORRELATION:
The relationship between two or more than two
variables is known as correlation. For example; the
relationship between cost and price, demand and
supply, distance and velocity, production of crops and
fertility of soil, amount of rain fall, relative humidity etc.
are some examples of correlation. There are three types
of correlation
(i) Simple correlation,
(ii) Partial correlation and
(iii) Multiple correlation.
SIMPLE CORRELATION:
The relationship between two variables is called simple
correlation or linear correlation.
The numerical measurement of strength of relationship
or degree of relationship between two variables is
known as simple correlation coefficient. If x and y are
two variables then simple correlation coefficient
between them is denoted by 𝒓𝒙𝒚 . The variables x and y
are interchangeable so if one is considered as
dependent variable then another will be independent
variable. The simple correlation coefficient is given by
following formula;
𝒔𝒙𝒚
1. 𝒓𝒙𝒚 = where,
√𝒔𝒙𝒙 √𝒔𝒚𝒚
∑𝒙∑𝒚
𝒔𝒙𝒚 =∑(𝒙 − 𝒙 ̅)=∑ 𝒙𝒚 -
̅)(𝒚 − 𝒚
𝒏
𝟐
𝟐 𝟐 (∑ 𝒙)
𝒔𝒙𝒙 =∑(𝒙 − 𝒙
̅) =∑ 𝒙 - and
𝒏

𝟐 𝟐 (∑ 𝒚)𝟐
𝒔𝒚𝒚 =∑(𝒚 − 𝒚
̅) =∑ 𝒚 -
𝒏

2.
∑(𝒙−𝒙 ̅̅̅
̅)(𝒚−𝒚)
𝒓𝒙𝒚 =
̅)𝟐 √∑(𝒚−𝒚
√∑(𝒙−𝒙 ̅)𝟐
3.
𝒏 ∑ 𝒙𝒚−∑ 𝒙 ∑ 𝒚
𝒓𝒙𝒚 =
√𝒏 ∑ 𝒙𝟐 −(∑ 𝒙)𝟐 √𝒏 ∑ 𝒚𝟐 −(∑ 𝒚)𝟐

4.
𝒏 ∑ 𝒖𝒗−𝒖 ∑ 𝒗 𝒙−𝒂 𝒚−𝒂
𝒓𝒙𝒚 = where u= and v=
√𝒏 ∑ 𝒖𝟐 −(∑ 𝒖)𝟐 √𝒏 ∑ 𝒗𝟐 −(∑ 𝒗) 𝟐 𝒉 𝒉

Theorem:
Prove that simple correlation coefficient is always lies
between -1 and +1 i.e. -1≤ 𝒓𝒙𝒚 ≤1.
Proof:
̅)𝟐
∑(𝒙−𝒙
We know 𝒔𝟐𝒙 = ̅)𝟐 =(n-1)𝒔𝟐𝒙
∴ ∑(𝒙 − 𝒙
𝒏−𝟏
̅)𝟐
∑(𝒚−𝒚
𝒔𝟐𝒚 = ̅)𝟐 =(n-1)𝒔𝟐𝒚
∴ ∑(𝒚 − 𝒚
𝒏−𝟏
𝟏 ∑(𝒙−𝒙 ̅̅̅
̅)(𝒚−𝒚)
𝒓𝒙𝒚 = ̅)(𝒚 − ̅̅̅
∴ ∑(𝒙 − 𝒙 𝒚)=(n-1) 𝒓𝒙𝒚 𝒔𝒙 𝒔𝒚
𝒏−𝟏 𝒔𝒙 𝒔 𝒚

Consider the sum of square as,


𝒙−𝒙 ̅ 𝒚−𝒚 ̅ 𝟐
∑( ± ) ≥𝟎
𝒔𝒙 𝒔𝒚
̅)𝟐
(𝒙−𝒙 ̅)(𝒚−𝒚
(𝒙−𝒙 ̅) ̅)𝟐
(𝒚−𝒚
Or, ∑ ±𝟐∑ +∑ ≥𝟎
𝒔𝟐𝒙 𝒔𝒙 𝒔𝒚 𝒔𝟐𝒚
(𝐧−𝟏)𝒔𝟐𝒙 (𝐧−𝟏) 𝒓𝒙𝒚 𝒔𝒙 𝒔𝒚 (𝐧−𝟏)𝒔𝟐𝒚
Or, ±𝟐 + ≥𝟎
𝒔𝟐𝒙 𝒔𝒙 𝒔𝒚 𝒔𝟐𝒚

Or, (n-1)±𝟐(𝒏 − 𝟏)𝒓𝒙𝒚 +(n-1)≥ 𝟎


Or, 2(n-1) ±𝟐(𝒏 − 𝟏)𝒓𝒙𝒚 ≥ 𝟎.
Or, 2(n-1) {𝟏 ± 𝒓𝒙𝒚 } ≥ 𝟎.
Or, 𝟏 ± 𝒓𝒙𝒚 ≥ 𝟎
Taking positive sign
𝒓𝒙𝒚 ≥ −𝟏
Taking negative sign
𝒓𝒙𝒚 ≤ 𝟏.
Combining both we get,
. -1≤ 𝒓𝒙𝒚 ≤1 proved.

Properties of simple correlation coefficient


1.The dependent and independent variables are
interchangeable 𝒓𝒙𝒚= 𝒓𝒚𝒙 .
2. It has no units.
3. It is independent upon the change of scale.
4. Its value is always lies between -1 and +1.
i.e. -1≤ 𝒓𝒙𝒚 ≤1 .
5. (i) If 𝒓𝒙𝒚 =+1 then x and y have perfectly positive
correlation.

(ii)If 𝒓𝒙𝒚 =-1 then x and y have perfectly negative


correlation.

(iv) If 𝒓𝒙𝒚 =0, then x and y have no correlation.

6.
Simple correlation coefficient is geometric mean of
two regression coefficient. 𝒊. 𝒆 𝒓𝒙𝒚 =√𝒃𝒙𝒚 𝒃𝒚𝒙
INTERPRETING THE LINEAR CORRELATION COEFFICIENT
The value of 𝒓𝒙𝒚 must always fall between -1 and +1
inclusive. If r is close to zero, we conclude that there is
no significance linear correlation between x and y but if
it is close to -1 or +1 we conclude that there is a
significance linear correlation between x and y.
COEFFICIENT OF LINEAR DETERMINATION:
If 𝒓𝒙𝒚 be the linear correlation between two variables x
and y then 𝒓𝟐𝒙𝒚 is the coefficient determination. It is
used to interpret the value of coefficient of linear
correlation which gives how far the changes in one
variable is explained by the other variable. For example;
If 𝒓𝒙𝒚 =0.6 then 𝒓𝟐𝒙𝒚 =0.36=36% means 36% in changes in
one variable is explained by another variable.

PARTIAL CORRELATION
The relationship between three or more than three
variables in which one is dependent, one is independent
and rest of independent variables are kept constant is
known as partial correlation.
The numerical measurement of strength of relationship
between a dependent variable and an independent
variable by keeping rest of the independent variables
constant is known as partial correlation coefficient. For
example; the relationship between quantity of
production of crops and fertility of soil by keeping
amount of rain fall, quality of seeds etc. constant is an
example of partial correlation.
If x₁ is a dependent variable, x₂ and x₃ are independent
variables then partial correlation coefficient between x₁
and x₂ by keeping x₃ constant is denoted by r₁₂.₃ and
given by formula as ;
𝒓₁₂−𝒓₁₃𝒓₂₃
r₁₂.₃ = ,
√𝟏−𝒓𝟐𝟏𝟑 √𝟏−𝒓𝟐𝟐𝟑

Similarly,
𝒓₁₃−𝒓₁₂𝒓₃₂
r₁₃.₂ = and
√𝟏−𝒓𝟐𝟏𝟐 √𝟏−𝒓𝟐𝟑𝟐

𝒓₂₃−𝒓₂₁𝒓₃₁
r₂₃.₁ = .
√𝟏−𝒓𝟐𝟐𝟏 √𝟏−𝒓𝟐𝟑𝟏

PROPERTIES OF PARTIAL CORRELATION COEFFICIENT:

1. The subscripts before the dot can be interchanged


i.e. r₁₂.₃= r₂₁.₃
2. Its value lies between -1 and +1.
3. In partial correlation if only one independent
variable is kept constant then it is said to be partial
correlation of first order, if two independent
variables are kept constant then it is said to be
partial correlation of second order and so on.
COEFFICIENT OF PARTIAL DETERMINATION
The square of the partial correlation coefficient is called
coefficient of partial determination.
Hence 𝒓𝟐𝟏𝟐.𝟑 , 𝒓𝟐𝟏𝟑.𝟐 𝒂𝒏𝒅 𝒓𝟐𝟏𝟑.𝟐 are coefficient of partial
determination. It is used to interpret the value of partial
correlation coefficient. For example; if r₁₂.₃ =0.9 then
𝒓𝟐𝟏𝟐.𝟑 =0.81=81% means 81% of variation in dependent
variable x₁ is explained by independent variable x₂ by
keeping x₃ constant.

MULTIPLE CORRELATION
The relationship between a dependent variable and two
or more than two independent variables in which the
effect of all independent variables are kept together is
known as multiple correlation.
The numerical measurement of strength of relationship
between a dependent variable and two or more than
two independent variables in which the effect of all
independent variables are kept together is known as
multiple correlation coefficient. The multiple
correlation coefficient between a dependent variable x₁
and independent variables x₂ and x₃ is denoted by R₁.₂₃
And given by following formula as;
𝒓𝟏𝟐 𝟐 +𝒓𝟏𝟑 𝟐 −𝟐𝒓𝟏𝟐 𝒓𝟏𝟑 𝒓𝟐𝟑
R₁.₂₃ =√
𝟏−𝒓𝟐𝟐𝟑

Similarly,
𝒓𝟐𝟑 𝟐 +𝒓𝟐𝟏 𝟐 −𝟐𝒓𝟐𝟑 𝒓𝟐𝟏 𝒓𝟑𝟏
R₂.₃₁ =√
𝟏−𝒓𝟐𝟏𝟑

𝒓𝟑𝟏 𝟐 +𝒓𝟑𝟐 𝟐 −𝟐𝒓𝟑𝟏 𝒓𝟑𝟐 𝒓𝟏𝟐


R₃.₁₂ =√
𝟏−𝒓𝟐𝟐𝟑

PROPERTIES OF MULTIPLE CORRELATION COEFFICIENT


(i) Its value lies between 0 and +1
i.e.0≤ 𝐑₁. ₂₃ ≤ 𝟏.
(ii) The position of the subscripts after the dot can
be interchanged R₁.₂₃ = R₁.₃₂
(iii) Multiple correlation coefficient is not less than
simple correlation coefficient; i.e. R₁.₂₃ ≥r₁₂,
R₁.₂₃ ≥r₁₃ and R₁.₂₃ ≥r₂₃.
(iv) If R₁.₂₃ =0 then r₁₂=0 and r₁₃=0.
COEFFICIENT OF MULTIPLE DETERMINATION
The square of multiple correlation coefficient is known
as coefficient of multiple determination. Thus
𝑹𝟐𝟏.𝟐𝟑 , 𝑹𝟐𝟐.𝟑𝟏 𝒂𝒏𝒅 𝑹𝟐𝟑.𝟏𝟐 are coefficient of multiple
determination. It is used to interpret the value of
multiple correlation coefficient.
For example; if R₁.₂₃=0.8 then 𝑹𝟐𝟏.𝟐𝟑 =0.64= 64% means
64% of the variation in dependent variable x₁ is
explained by independent variables x₂ and x₃ together.
Problems:(Old questions of IOE)
1. Define partial correlation and multiple correlation
with suitable examples. Write down the properties
of multiple and partial correlation coefficient.
2. What do you mean by the correlation coefficient?
Show that the correlation coefficient lies between
-1 and +1.
3. The simple correlation coefficients between
fertilizer x₁, seeds x₂ and productivity x₃ are
r₁₂=0.69,
r₁₃=0.64 and r₂₃=0.85. Calculate the partial
correlation coefficient r₁₂.₃ and multiple correlation
coefficient R₁.₂₃.
3. Write the uses of correlation and regression in the
field of engineering.

4. Write the properties of correlation coefficient and


describe under what condition there exist only one
regression line.
5. Distinguish between correlation coefficient and
regression coefficient and write its importance in
the
field of engineering.
6. A sample of 10 values of three variables X₁, X₂ and
X₃ were obtained as;
∑ 𝑿𝟏 =10 ∑ 𝑿𝟐 =20 ∑ 𝑿𝟑 =30
∑ 𝑿𝟐𝟏 =20 ∑ 𝑿𝟐𝟐 =68 ∑ 𝑿𝟐𝟑 =170
∑ 𝑿𝟏 𝑿𝟐 =10 ∑ 𝑿𝟏 𝑿𝟑 =15 ∑ 𝑿𝟐 𝑿𝟑 =64
Find
a) Partial correlation coefficient between X₁ and X₂
eliminating the effect of X₃.
b) Multiple correlation coefficient between X₁, X₂
and X₃ assuming X₁ as dependent.
7. Define multiple correlation with suitable example.
You are given simple correlation coefficients as
r₁₂=0.93, r₁₃=0.50 and r₂₃=0.34. Assuming the first
variable as dependent, compute the multiple
correlation coefficient R₁.₂₃ and coefficient of
multiple determination𝑹𝟐𝟏.𝟐𝟑 . Also interpret the
result.
8. Ten still wires of diameter 0.5 mm and length 2.5 m
were extended in a laboratory by applying vertical
forces of varying magnitudes. Results are as
follows:
Forces in kg 15 19 25 35 42 48 53 56 62 65
Increase in 1.7 2.1 2.5 3.4 3.9 4.9 5.4 5.7 6.6 7.2
length(mm)
Determine correlation coefficient and coefficient of
determination between force and increase in
length and interpret the result using coefficient of
determination.
9. The concentration of chloride and phosphates is
solution is given below in milligrams per liter are
determine over a 10 days period.
Chloride 64 66 64 62 65 64 64 67 74 69
Phosphates 1.31 1.39 1.59 1.68 1.89 1.98 1.97 1.99 1.98 2.15
i) Compute the correlation coefficient r. comment
on the result.
ii) Do you see any role in this association for
predictive purposes?
10. An article in the Journal of Environment
Engineering (Vol. 115, No 3, 1989, pp,608-619)
reported the results of a study on the occurrence of
Sodium and chloride in surface streams in central
Rhode Island. The following data are chloride
concentration y (in milligrams per liter) and
roadway area in the watershed x (in percentage).
X 4.4 6.6 9.7 10.6 10.8 10.9 11.8 12.1
Y 0.19 0.15 0.57 0.70 0.67 0.63 0.47 0.70

Find the correlation coefficient and coefficient of


determination of the given data and draw your
conclusion.
11. A house hold survey of monthly
expenditure on food yields the following results;
Monthly 10 15 20 25 30 35 40
expenditure
(Rs 100)
Monthly 2 4 5 7 6 6 5
income (Rs
1000)
Size of the 4 5 7 10 8 11 4
family

Calculate the coefficient of multiple correlation and


coefficient of multiple determination. Also
interpret the result.
12. From following data find the Karl
Pearson’s coefficient of correlation and interpret
the result:
Marks in 39 65 62 90 82 75 25 98 36 78
statistics
Marks in 47 53 58 86 62 68 60 91 51 84
mathematics

13. In trying to evaluate the effectiveness of


the antibodies in killing bacteria. A research
institute compiled the following information
Antibodies(mg) 12 15 14 16 17 10
Bacteria 5 7 5.6 7.2 8.6 6.2

Find strength and direction of relationship between


them.
REGRESSION ANALYSIS
The relationship between a dependent variable and
one or more than one independent variables in
which the value of dependent variable is predicted
with the help of independent variables is known as
regression analysis. It indicates the cause and effect
of relationship between the variables and establish
the functional relationship. Correlation describe, in
what degree the variables are related? On the
other hand, regression describe how the variables
are related? It means regression explain the nature
of relationship whereas correlation explain strength
of relationship.

SIMPLE REGRESSION:
What is Simple Linear Regression?
Simple linear regression is a statistical method that allows us to summarize and study relationships
between two continuous (quantitative) variables:
 One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.
 The other variable, denoted y, is regarded as the response, outcome, or dependent variable.
Because the other terms are used less frequently today, we'll use the "predictor" and "response"
terms to refer to the variables encountered in this course. The other terms are mentioned only to
make you aware of them should you encounter them. Simple linear regression gets its adjective
"simple," because it concerns the study of only one predictor variable. In contrast, multiple linear
regression, which we study later in this course, gets its adjective "multiple," because it concerns the
study of two or more predictor variables.
METHOD OF LEAST SQUARE:

Least Square Method


The least square method is the process of finding the best-fitting curve or line of best fit for a
set of data points by reducing the sum of the squares of the offsets (residual part) of the points
from the curve. During the process of finding the relation between two variables, the trend of
outcomes are estimated quantitatively. This process is termed as regression analysis. The
method of curve fitting is an approach to regression analysis. This method of fitting equations
which approximates the curves to given raw data is the least square.
It is quite obvious that the fitting of curves for a particular data set are not always unique. Thus,
it is required to find a curve having a minimal deviation from all the measured data points. This
is known as the best-fitting curve and is found by using the least-squares method.

Least Square Method Definition


The least-squares method is a crucial statistical method that is practised to find a regression line
or a best-fit line for the given pattern. This method is described by an equation with specific
parameters. The method of least squares is generously used in evaluation and regression. In
regression analysis, this method is said to be a standard approach for the approximation of sets of
equations having more equations than the number of unknowns.
The method of least squares actually defines the solution for the minimization of the sum of
squares of deviations or the errors in the result of each equation. Find the formula for sum of
squares of errors, which help to find the variation in observed data.
The least-squares method is often applied in data fitting. The best fit result is assumed to reduce
the sum of squared errors or residuals which are stated to be the differences between the
observed or experimental value and corresponding fitted value given in the model.
There are two basic categories of least-squares problems:

 Ordinary or linear least squares


 Nonlinear least squares
These depend upon linearity or nonlinearity of the residuals. The linear problems are often seen
in regression analysis in statistics. On the other hand, the non-linear problems generally used in
the iterative method of refinement in which the model is approximated to the linear one with
each iteration.

Least Square Method Graph


In linear regression, the line of best fit is a straight line as shown in the following diagram:
The given data points are to be minimized by the method of reducing residuals or offsets of each
point from the line. The vertical offsets are generally used in surface, polynomial and hyperplane
problems, while perpendicular offsets are utilized in common practice.

Least Square Method Formula


The least-square method states that the curve that best fits a given set of observations, is said to
be a curve having a minimum sum of the squared residuals (or deviations or errors) from the
given data points. Let us assume that the given points of data are (x1,y1), (x2,y2), (x3,y3), …, (xn,yn)
in which all x’s are independent variables, while all y’s are dependent ones. Also, suppose that
f(x) be the fitting curve and d represents error or deviation from each given point.
Now, we can write:
d1 = y1 − f(x1)
d2 = y2 − f(x2)
d3 = y3 − f(x3)
…..
dn = yn – f(xn)
The least-squares explain that the curve that best fits is represented by the property that the sum
of squares of all the deviations from given values must be minimum. I.e:

Sum = Minimum Quantity

Limitations for Least-Square Method


The least-squares method is a very beneficial method of curve fitting. Despite many benefits, it
has a few shortcomings too. One of the main limitations is discussed here.
In the process of regression analysis, which utilizes the least-square method for curve fitting, it is
inevitably assumed that the errors in the independent variable are negligible or zero. In such
cases, when independent variable errors are non-negligible, the models are subjected to
measurement errors. Therefore, here, the least square method may even lead to hypothesis
testing, where parameter estimates and confidence intervals are taken into consideration due to
the presence of errors occurring in the independent variables.
The method of finding a and b in y= a+ bx………(i)
The normals of above line are
∑ 𝒚=na +b∑ 𝒙……..(i)
∑ 𝒙𝒚=a∑ 𝒙 +b∑ 𝒙𝟐 ……..(i)
Solving (i) and (ii) then we get the value of a and b.
Now put these values in (i) we get line of regression y on x.
The method of finding c and d in x= c+ dy………(i)
The normals of above line are
∑ 𝒙=nc +d∑ 𝒚……..(i)
∑ 𝒙𝒚=c∑ 𝒚 +d∑ 𝒚𝟐 ……..(i)
Solving (i) and (ii) then we get the value of c and d.
Now put these values in (i) we get line of regression x on y.

Regression Coefficient
Definition: The Regression Coefficient is the constant ‘b’ in the regression equation that tells
about the change in the value of dependent variable corresponding to the unit change in the
independent variable.

If there are two regression equations, then there will be two regression coefficients:

 Regression Coefficient of X on Y: The regression coefficient of X on Y is represented by the


symbol bxy that measures the change in X for the unit change in Y. Symbolically, it can be

represented as: The bxy can be obtained by using the


following formula when the deviations are taken from the actual means of X and Y:

When the deviations are obtained from the assumed mean, the

following formula is used:


 Regression Coefficient of Y on X: The symbol byx is used that measures the change in Y
corresponding to the unit change in X. Symbolically, it can be represented as:

In case, the deviations are taken from the actual means; the following formula is used:

The b can be calculated by using the following formula when the deviations are taken from
yx

the assumed means:


The Regression Coefficient is also called as a slope coefficient because it determines the slope
of the line i.e. the change in the independent variable for the unit change in the independent
variable
Properties of Regression Coefficient
Definition: The constant ‘b’ in the regression equation (Ye = a + bX) is called as the Regression
Coefficient. It determines the slope of the line, i.e. the change in the value of Y corresponding to
the unit change in X and therefore, it is also called as a “Slope Coefficient.”

Properties of Regression Coefficient

1. The correlation coefficient is the geometric mean of two regression coefficients. Symbolically, it can be
expressed as: r=√𝑏𝑥𝑦 𝑏𝑦𝑥
2. The value of the coefficient of correlation cannot exceed unity i.e. 1. Therefore, if one of the regression
coefficients is greater than unity, the other must be less than unity.
3. The sign of both the regression coefficients will be same, i.e. they will be either positive or negative.
Thus, it is not possible that one regression coefficient is negative while the other is positive.
4. The coefficient of correlation will have the same sign as that of the regression coefficients, such as if
the regression coefficients have a positive sign, then “r” will be positive and vice-versa.
5. The average value of the two regression coefficients will be greater than the value of the correlation.

Symbolically, it can be represented as


6. The regression coefficients are independent of the change of origin, but not of the scale. By origin, we
mean that there will be no effect on the regression coefficients if any constant is subtracted from the
value of X and Y. By scale, we mean that if the value of X and Y is either multiplied or divided by some
constant, then the regression coefficients will also change.

Thus, all these properties should be kept in mind while solving for the regression coefficient
MULTIPLE REGRESSION PLANE:
If x, y and z are three variables then regression line of y
on x and z is given by y=a+bx+cz……………..(i)
The normals of equation (i) are
∑ 𝒚 = 𝒏𝒂 + 𝒃 ∑ 𝒙 + 𝒄 ∑ 𝒛…………..(ii)
∑ 𝒙𝒚 = 𝒂 ∑ 𝒙 + 𝒃 ∑ 𝒙𝟐 + 𝒄 ∑ 𝒙𝒛…………..(iii)
∑ 𝒚𝒛 = 𝒂 ∑ 𝒛 + 𝒃 ∑ 𝒙𝒛 + 𝒄 ∑ 𝒛𝟐 …………..(iv)
Solving equations (ii) and (iii) then we will get values of
a, b and c. now putting these values in equation (i) to
obtain regression line of y on x and z.
INFERENCE CONCERNING LEAST SQUARE METHOD:
The regression equation y=a+bx is obtained on the basis
of sample data. We are often interested in
corresponding equation y= 𝜶+𝜷x from the population
from which the samples are drawn. The following is the
test concerning normal population.
A TEST OF HYPOTHESIS CONCERNING THE SLOPE
PARAMETER 𝜷=b.
To test the hypothesis that the regression coefficient 𝜷
is equal to some specific value b, we use the test
𝒃−𝜷
statistic t=
𝒔𝒆
√𝒔𝒙𝒙 with n-2 degree of freedom.
Similarly, the test statistics inference about 𝜶=a.
𝒂−𝜶 𝒏𝒔𝒙𝒙
t= √𝒔 ̅𝟐
with n-2 degree of freedom.
𝒔𝒆 𝒙𝒙 +𝒏𝒙

(𝒔𝒙𝒚 )𝟐
𝒔𝒚𝒚 −
Where 𝒔𝒆 =√
𝒔𝒙𝒙
𝒏−𝟐

CONFIDENCE INTERVAL FOR INTERCEPT AND SLOPE:


i. For intercept 𝜶:
𝟏 ̅𝟐
𝒙
C.I.= a±𝒕𝜶,𝒏−𝟐 x 𝒔𝒆 √ +
𝟐 𝒏 𝒔𝒙𝒙

ii. For slope 𝜷:


𝟏
C.I.=𝒃±𝒕𝜶,𝒏−𝟐 x 𝒔𝒆 √
𝟐 𝒔𝒙𝒙
WORKEDOUT PROBLEMS:
1. The following are the measurements of the air
velocity and evaporation coefficient of burning fuel
droplets in an impulse engine:

Air velocity (cm/sec) Evaporation coefficient


x (𝒎𝒎𝟐 /𝒔𝒆𝒄) = 𝒚
20 0.18
60 0.37
100 0.35
140 0.78
180 0.56
220 0.75
260 1.18
300 1.36
340 1.17
380 1.65
i. Fit a straight line to these data by the method
of least squares and use it to estimate the
evaporation coefficient of a droplet when the
air velocity is 190 cm/sec.
ii. Construct a 95% confidence interval for the
intercept 𝜶 and slope𝜷.
iii. Test the null hypothesis 𝜷=0 verses alternative
hypothesis 𝜷 ≠0 at 5% level of confidence.
iv. Test the null hypothesis 𝜶 = 𝟎 verses
alternative hypothesis 𝜶 ≠ 𝟎 at 5% level of
significance.
Solution:
x y x2 y2 xy
20 0.18 400 0.0324 3.6
60 0.37 3600 0.1369 22.2
100 0.35 10000 0.1225 35
140 0.78 19600 0.6068 109.2
180 0.56 32400 0.3136 100.8
220 0.75 46400 0.5625 165
260 1.18 67600 1.3924 306.8
300 1.36 90000 1.8496 408
340 1.17 115600 1.3689 397.8
380 1.65 144400 2.7225 625
∑ 𝒙=2000 ∑ 𝒚= ∑ 𝒙𝟐 =532 ∑ 𝒚𝟐 = ∑ 𝒙𝒚=
8.35 9.1097 2175.40

Now;
𝟐 (∑ 𝒙)𝟐 (𝟐𝟎𝟎𝟎)𝟐
𝒔𝒙𝒙 =∑ 𝒙 - = 532 - =132,000
𝒏 𝟏𝟎
𝟐 (∑ 𝒚)𝟐 (𝟖.𝟑𝟓)𝟐
𝒔𝒚𝒚 =∑ 𝒚 - = 9.1097 - =2.13745
𝒏 𝟏𝟎
∑𝒙∑𝒚 (𝟐𝟎𝟎𝟎)(𝟖.𝟑𝟓)
𝒔𝒙𝒚 =∑ 𝒙𝒚- = 2175.40 - =505.40
𝒏 𝟏𝟎
∑𝒙
̅=
𝒙 =200
𝒏
∑𝒚
̅=
𝒚 = 0.835
𝒏
𝒔𝒙𝒚 𝟓𝟎𝟓.𝟒𝟎
𝒃 = 𝒃𝒚𝒙 = = =0.00383
𝒔𝒙𝒙 𝟏𝟑𝟐𝟎𝟎
̅ -b𝒙
𝒂=𝒚 ̅ =0.835-(0.00383)200=0.069
(𝒔𝒙𝒚 )𝟐 (𝟓𝟎𝟓.𝟒𝟎)𝟐
𝒔𝒚𝒚 − 𝟐.𝟏𝟑𝟕𝟒𝟓 −
𝒔𝒆 =√ √
𝒔𝒙𝒙 𝟏𝟑𝟐𝟎𝟎𝟎
= 𝒔𝒆 = =0.0253
𝒏−𝟐 𝟏𝟎−𝟐

(1-𝜶)100%=95% ∴ 𝜶=0.05

(i)
The equation of the straight line that best fit
the given data in the sense of least square is
y= a+bx=0.069+0.00383x
∴y=0.069+0.00383x
When x=190 cm/sec then
y=0.069+(0.00383)190=0.80𝒎𝒎𝟐 /𝒔𝒆𝒄
(ii) 95% confidence interval for slope 𝜷 and
intercept 𝜶.
For intercept:
𝟏 ̅𝟐
𝒙
C.I.= a±𝒕 𝜶
,𝒏−𝟐 x 𝒔𝒆 √ +
𝟐 𝒏 𝒔𝒙𝒙
𝟏 (𝟐𝟎𝟎)𝟐
=0.069 ±(𝟐. 𝟑𝟎𝟔)(𝟎. 𝟏𝟓𝟗)x √ +
𝟏𝟎 𝟏𝟑𝟐𝟎𝟎
= 0.069± 0.233
=(-0.069, 0.302)
For slope 𝜷:
𝟏
C.I.=𝒃±𝒕𝜶,𝒏−𝟐 x 𝒔𝒆 √
𝟐 𝒔𝒙𝒙
=(…. , ……)
(iii) A test of hypothesis concerning slope 𝜷=0.
STEP I:
H0: 𝜷=0
H1: 𝜷 ≠0

STEP II:
𝜶=5%= 0.05
STEP III:
ttab= 𝒕𝜶,𝒏−𝟐 = t0.025, 8=2.306
𝟐
STEP IV:
Test statistic under null hypothesis H0: 𝜷=0
𝒃−𝜷
tcal=
𝒔𝒆
√𝒔
𝒙𝒙
𝟎.𝟎𝟎𝟑𝟖𝟑−𝟎
=
𝟎.𝟏𝟓𝟗
√𝟏𝟑𝟐𝟎𝟎𝟎
=8.75
STEP V:(Decision)
∴tcal>ttab so null hypothesis is rejected and
alternative hypothesis is accepted..
STEP VI:(conclusion)
From above procedure we conclude that the
slope 𝜷 ≠0 .
(v) Test hypothesis concerning intercept 𝜶 :
(do your self)
2. Ten still wires of diameter 0.5 mm and length 2.5 m
were extended in a laboratory by applying vertical
forces of varying magnitudes. Results are as follows:
Forces in kg 15 19 25 35 42 48 53 56 62 65
Increase in 1.7 2.1 2.5 3.4 3.9 4.9 5.4 5.7 6.6 7.2
length(mm)
(a) Estimate the parameter of a simple line
regression model with forces as explanatory
variable.
(b) Find 95% confidence limit for the slope of the
line.
2. Find the equation of the regression line of y on x, if
the observations (xi , yi) are the following:
(1,4),(2,8),(3,2),(4,12),(5,10),(6,14),(7,16),(8,6),(9,18)
3. The following table shows the weight z to the
nearest pound, height x to the nearest inch, and
age y to the nearest year, of 12 boys:
Weight(z) 64 71 53 67 55 58 77 57 56 51 76 68
Height(x) 57 59 49 62 51 50 55 48 52 42 61 57
Age (y) 8 10 6 11 8 7 10 9 10 6 12 9

(a) Fit a least square regression plane.


(b) Find the weight of a boy who is 19 years old
and 54 inches tall.
Solution;

x y z 𝒙𝟐 𝒚𝟐 𝒛𝟐 xy yz zx
64 8 57
71 10 59
53 6 49
67 11 62
55 8 51
58 7 50
77 10 55
57 9 48
56 10 52
51 6 42
78 12 61
68 9 57
∑ 𝒙= ∑ 𝒚= ∑ 𝒛= ∑ 𝒙𝟐 = ∑ 𝒚𝟐 = ∑ 𝒛𝟐 = ∑ 𝒙 𝒚= ∑ 𝒚𝒛= ∑ 𝒛𝒙=
643 106 753 34843 976 48 5779 6796 40830

(a) The linear regression equation z on x and


y can be written as,
z=a+bx+cy…………..(i)
The normals of (i) are
∑ 𝒛 = 𝒏𝒂 + 𝒃 ∑ 𝒙 + 𝒄 ∑ 𝒚.
∑ 𝒙𝒛 = 𝒂 ∑ 𝒙 + 𝒃 ∑ 𝒙𝟐 + 𝒄 ∑ 𝒙𝒚.
∑ 𝒚𝒛 = 𝒂 ∑ 𝒚 + 𝒃 ∑ 𝒙𝒚 + 𝒄 ∑ 𝒚𝟐 .
Then
𝟕𝟓𝟑 = 𝟏𝟐𝒂 + 𝟔𝟒𝟑𝒃 + 𝟐𝟎𝟎𝒄…………..(ii)
𝟒𝟎. 𝟖𝟖𝟑𝟎 = 𝟔𝟒𝟑𝒂 + 𝟑𝟒𝟖𝟒𝟑𝒃 + 𝟓𝟕𝟕𝟗𝒄…………..(iii)
𝟔𝟕𝟗𝟔 = 𝟏𝟎𝟔𝒂 + 𝟓𝟕𝟕𝟗𝒃 + 𝟗𝟕𝟔𝒄…………..(iv)
Solving (ii), (iii) and (iii) we get,
a=3.6512, b=0.8546 and c=1.5063
The required regression plane
z=3.6512+(0.855)x+(1.506)y
(b)
When x=54 and y=9 then
z=3.6512+(0.855)54+(1.506)9
=63.356
≈ 63 pound
4. The following table gives the measurement of train
resistance; V is the velocity in miles per hour, R is
the resistance in pound per ton:
V 20 40 60 80 100 120
R 5.5 9.1 14.9 22.8 33.3 46
If R is related to V by the relation R=A+BV+C𝑽𝟐 find A, B
and C.
Solution:
Here the number of observation is even. The two
middle values of V are 60 and 80. The mean values of V
are 60 and 80, the mean of 60 and 80 is 70.
𝑽−𝟕𝟎
We take, x= ; y=R-22.8.
𝟏𝟎

Let y = a+bx+c𝒙𝟐 …………(i)


The normal equations are
∑ 𝒚 = 𝒏𝒂 + 𝒃 ∑ 𝒙 + 𝒄 ∑ 𝒙𝟐 .
∑ 𝒙𝒚 = 𝒂 ∑ 𝒙 + 𝒃 ∑ 𝒙𝟐 + 𝒄 ∑ 𝒙𝟑 .
∑ 𝒙𝟐 𝒚 = 𝒂 ∑ 𝒙𝟐 + 𝒃 ∑ 𝒙𝟑 + 𝒄 ∑ 𝒙𝟒 .
x y xy 𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒙𝟐 𝒚
-5 -17.3 86.5 25 -125 625 -432.5
-3 -13.7 41.1 9 -27 81 -123.3
-1 -7.9 7.9 1 -1 1 -7.9
0 0 0 1 1 1 0
3 10.5 31.5 9 27 81 94.5
5 23.2 116 25 125 625 580
∑ 𝒙=0 ∑ 𝒚=-5.2 ∑ 𝒙𝒚=283 ∑ 𝒙𝟐 =70 ∑ 𝒙𝟑 =0 ∑ 𝒙𝟒 =1414 ∑ 𝒙𝟐 𝒚=
110.8

Substituting in above normal equations


−𝟓. 𝟐 = 𝟔𝒂 + 𝟎. 𝒃 + 𝟕𝟎𝒄……………(ii)
𝟐𝟖𝟑 = 𝟎. 𝒂 + 𝟕𝟎𝒃 + 𝟎. 𝒄 … ….. …….(iii)
𝟏𝟏𝟎. 𝟖 = 𝟕𝟎𝒂 + 𝟎. 𝒃 + 𝟏𝟒𝟏𝟒𝒄…………(iv)
Solving equations (ii), (iii) and (iv) we get
a=-4.25, b=4.04 and c=0.29
Hence, y=-4.25+4.04x+0.29x2
𝑽−𝟕𝟎 (𝑽−𝟕𝟎)𝟐
Or, R-22.8= -4.25+4.04 +0.29x
𝟏𝟎 𝟏𝟎𝟐

R= 3.48-0.002V+0.0029V2
Comparing with R=A+BV+C𝑽𝟐 then we get;
A=3.48, B=-0.002 and C=0.0029

Scatter Plots

A Scatter (XY) Plot has points that show the relationship between two
sets of data.

In this example, each dot shows one person's weight versus their height.
(The data is plotted on the graph as "Cartesian (x,y) Coordinates")

Example:
The local ice cream shop keeps track of how much ice cream they sell versus
the noon temperature on that day. Here are their figures for the last 12 days:

Ice Cream Sales vs Temperature

Temperature °C Ice Cream Sales

14.2° $215

16.4° $325

11.9° $185

15.2° $332

18.5° $406

22.1° $522

19.4° $412

25.1° $614

23.4° $544

18.1° $421

22.6° $445

17.2° $408

And here is the same data as a Scatter Plot:


It is now easy to see that warmer weather leads to more sales, but the
relationship is not perfect.

Line of Best Fit


We can also draw a "Line of Best Fit" (also called a "Trend Line") on our scatter
plot:

Try to have the line as close as possible to all points, and as many points
above the line as below.

But for better accuracy we can calculate the line using Least Squares
Regression and the Least Squares Calculator.

Example: Sea Level Rise

A Scatter Plot of Sea Level


Rise:
And here I have drawn on
a "Line of Best Fit".

Interpolation and Extrapolation


Interpolation is where we find a value inside our set of data points.

Here we use linear interpolation to estimate the sales at 21 °C.

Extrapolation is where we find a value outside our set of data points.

Here we use linear extrapolation to estimate the sales at 29 °C (which is


higher than any value we have).

Careful: Extrapolation can give misleading results because we are in


"uncharted territory".

As well as using a graph (like above) we can create a formula to help us.
Example: Straight Line Equation

We can estimate a straight line equation from two points from the graph above

Let's estimate two points on the line near actual values: (12°,
$180) and (25°, $610)

First, find the slope:

slope "m" = change in ychange in x


= $610 − $18025° − 12°
= $43013°
= 33 (rounded)

Now put the slope and the point (12°, $180) into the "point-slope" formula:

y − y1 = m(x − x1)

y − 180 = 33(x − 12)

y = 33(x − 12) + 180

y = 33x − 396 + 180

y = 33x − 216

INTERPOLATING

Now we can use that equation to interpolate a sales value at 21°:

y = 33×21° − 216 = $477


EXTRAPOLATING

And to extrapolate a sales value at 29°:

y = 33×29° − 216 = $741

The values are close to what we got on the graph. But that doesn't mean they
are more (or less) accurate. They are all just estimates.

Don't use extrapolation too far! What sales would you expect at 0° ?

y = 33×0° − 216 = −$216

Hmmm... Minus $216? We extrapolated too far!

Note: we used linear (based on a line) interpolation and extrapolation, but


there are many other types, for example we could use polynomials to make
curvy lines, etc

You might also like