regression
regression
Applications of Mathematics
Project : Application of Least Squares Method in Regression Analysis
Presented by : Tamal Kanti Panja, Shouvik Sardar, Suvadip Roy
INTRODUCTION :
• What is Regression?
Regression is a statistical measure that attempts to determine the strength of the relationship
between one dependent variable (usually denoted by Y) and a single or a series of other changing
variables (known as independent variables).
Note: We use the above non-linear functions in some particular phenomena such as :
(i) Exponential function is used in census and for population prediction.
(ii) Logistic function is used for predicting the outcome of a categorical data. Categorical data
refers to a data which has more than one source of variation.
(iii) Trigonometric function is used in case of share market purpose.
Though there are various types of regression model, in general purposes we emphasize on linear
and polynomial equations only.
• Why linear and polynomial equations only?
We use linear and polynomial equations because
(i) Linear and polynomial equations are linear in parameters. So these are easier to deal with
than any other types of equations.
(ii) Analysis of linear and polynomial equations are less time, money and labour consuming.
(iii) In most of the cases, the regression equation can be well approximated by linear or polyno-
mial equations with less error.
But such a situation will be realised very rarely in practice. For other situations, we shall try to
minimize the RSS with respect to the unknown parameters a1 , a2 , . . . , ak . For this purpose, we
equate the partial derivatives of S 2 with respect to a1 , a2 , . . . , ak separately to 0 to generate as many
equations as the no. of parameters. These equations are known as normal equations.
Mathematically,
∂S 2
= 0 ∀ j = 1, 2, . . . , k are k normal equations in k unknowns
∂aj
Finally, these normal equations are solved simultaneously to get the estimates of a1 , a2 , . . . , ak . Let
the estimates be aˆ1 , aˆ2 , . . . , aˆk . Then the fitted equation is
Y = f (aˆ1 , aˆ2 , . . . , aˆk ; X)
• Fitting of equation when it is linear in parameters :
Polynomial and multiple linear equations are linear in parameters. So, in this context we
will fit polynomial and multiple linear equations on the basis of observed data on dependent and
independent variables.
1. To fit Y = a0 +a1 X +a2 X 2 +· · ·+ap X p (pth degree polynomial) on n paired observations
(xi , yi ), i = 1, 2, . . . , n.
n
X
Here, S = 2
(yi − a0 − a1 xi − a2 x2i − · · · − ap xpi )2
i=1
(∗∗) : (m+1) normal equations for fitting a multiple linear equation on m independent variables.
Here, if we take p = 1, we get the simple linear equation as Y = bˆ0 + bˆ1 X. (Since here we have
only one independent variable X1 , we can take it as X.)
A measure of goodness of fit for fitting a pth degree polynomial Y = aˆ0 +aˆ1 X+aˆ2 X 2 +· · ·+aˆp X p ,
where aˆ0 , aˆ1 , aˆ2 , . . . , aˆp are least square estimates of a0 , a1 , a2 , . . . , ap obtained by solving (p+1) nor-
mal equations, is given by
X n X n
RSS = 2
ei = (yi − ŷi )2 , where ŷi = aˆ0 + aˆ1 xi + aˆ2 x2i + · · · + aˆp xpi
i=1 i=1
n
X
= (yi − aˆ0 − aˆ1 xi − · · · − aˆp xpi )(yi − aˆ0 − aˆ1 xi − · · · − aˆp xpi )
i=1
n
X n
X
= (yi − aˆ0 − aˆ1 xi − · · · − aˆp xpi )yi − aˆ0 (yi − aˆ0 − aˆ1 xi − · · · − aˆp xpi )
i=1 i=1
n
X n
X
−aˆ1 xi (yi − aˆ0 − aˆ1 xi − · · · − aˆp xpi ) − aˆ2 x2i (yi − aˆ0 − aˆ1 xi − · · · − aˆp xpi )
i=1 i=1
n
X
− · · · · · · − aˆp xpi (yi − aˆ0 − aˆ1 xi − · · · − aˆp xpi )
i=1
n
X
= (yi − aˆ0 − aˆ1 xi − · · · − aˆp xpi )yi (by normal equations)
i=1
n
X n
X n
X n
X n
X
= yi2 − aˆ0 yi − aˆ1 xi yi − aˆ2 x2i yi − · · · · · · − aˆp xpi yi
i=1 i=1 i=1 i=1 i=1
n
X
Note that, in addition to the calculations made earlier for solving normal equations, only yi2 is
i=1
required for the determination of RSS.
Similarly, for fitting a multilinear equation with m independent variables Y = bˆ0 + bˆ1 X1 +
bˆ2 X2 + · · · + bˆm Xm , where bˆ0 , bˆ1 , bˆ2 , . . . , bˆm are least square estimates of b0 , b1 , b2 , . . . , bm obtained by
solving (m+1) normal equations, a measure of goodness of fit can be given by
n
X n
X
RSS = e2i = (yi − ŷi )2 , where ŷi = bˆ0 + bˆ1 x1i + bˆ2 x2i + · · · + bˆm xmi
i=1 i=1
n
X
= (yi − bˆ0 − bˆ1 x1i − · · · − bˆm xmi )(yi − bˆ0 − bˆ1 x1i − · · · − bˆm xmi )
i=1
n
X n
X
= (yi − bˆ0 − bˆ1 x1i − · · · − bˆm xmi )yi − bˆ0 (yi − bˆ0 − bˆ1 x1i − · · · − bˆm xmi )
i=1 i=1
n
X n
X
−bˆ1 x1i (yi − bˆ0 − bˆ1 x1i − · · · − bˆm xmi ) − bˆ2 x2i (yi − bˆ0 − bˆ1 x1i − · · · − bˆm xmi )
i=1 i=1
n
X
− · · · · · · − bˆm xmi (yi − bˆ0 − bˆ1 x1i − · · · − bˆm xmi )
i=1
n
X
= (yi − bˆ0 − bˆ1 x1i − · · · − bˆm xmi )yi (by normal equations)
i=1
Xn n
X n
X n
X n
X
= yi2 − bˆ0 yi − bˆ1 x1i yi − bˆ2 x2i yi − · · · · · · − bˆm xmi yi
i=1 i=1 i=1 i=1 i=1
Here also, we can see that, in addition to the calculations made earlier for solving normal equations,
Xn
only yi2 is required for the determination of RSS.
i=1
We see that RSS is actually the measure of sum of square errors (SSE). A small value of RSS
denotes that an equation fitted to the given data set is good. In case of polynomial fitting, it can
be shown that RSS is a non-increasing function of the degree of the polynomial to be fitted. So,
RSS reduces with the increase of the degree of polynomial. Again, the RSS will decrease with the
increase of the number of explanatory variables in case of multilinear fitting. By doing this we can
minimize the error upto a certain threshold.
We can use least square method for fitting transformed equation using the transformed data
if the empirical relation can be made linear in parameters with suitable transformations.
1. To fit Y = abX (exponential equation) on n paired observations (xi , yi ), i = 1, 2, . . . , n.
Here, we take logarithm to both sides of the equation and get
log Y = log a + X log b
⇒ Z = A + BX, where A = log a, B = log b and Z = log Y
Y = âb̂X
APPLICATION OF LEAST SQUARES METHOD IN REGRESSION :
Our main objective is to show how the least squares method is applied in the regression anal-
ysis to get the regression equation rather than doing a whole regression analysis. In this context, we
will discuss only two types of regression model as follows:
(i) Simple Linear Regression
(ii) Multiple Linear Regression
We will not discuss the polynomial regression, since the problem of polynomial regression can be
extended from the problem of simple linear regression. Actually, simple linear regression is a partic-
ular form of polynomial regression. Again, the problem of exponential regression can be reduced to
a simple linear regression problem. So,we will discuss these two regression models and will compare
them with an example. Before starting that discussion, we first need to know what is simple linear
regression and what is multiple linear regression.
1. Simple Linear Regression :
Definition : Simple Linear Regression is a statistical technique that uses only one explanatory
variable to predict the outcome of a response variable. The goal of simple linear regression
(SLR) is to model the relationship between an explanatory and a response variable.
For example, we can think of the height of a human body as the explanatory variable and weight
as the response variable. Then we could try to predict the weight on the basis of height. Some
more such examples are:
(i) Production and sale of a large business house,
(ii) Age and blood pressure of a human body,
(iii) Weight of a new born baby and age of the mother, etc.
A simple linear regression equation can be taken as Y = a0 + a1 X, where Y is the dependent or
response variable and X is the independent or explanatory variable. Here, we try to estimate
the value of Y on the basis of X. Given a set of paired observations on the variables X and Y ,
we can use the least square method to get the estimates of a0 and a1 as aˆ0 and aˆ1 respectively
and obtain the simple linear regression equation of Y on X as Y = aˆ0 + aˆ1 X. (When we take
the regression equation in such a way that Y is a function of X, i.e. we take Y as the response
variable and X as the explanatory variable, then we call the regression equation as Y on X. In
the reverse, we call it as the regression equation of X on Y .)
2. Multiple Linear Regression :
Definition : Multiple Linear Regression is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable. The goal of multiple linear regression
(MLR) is to model the relationship between the explanatory and response variables.
It is often the case that a response variable may depend on more than one explanatory variable.
For example, human weight could reasonably be expected to depend on both the height and the
age of the person. Furthermore, possible explanatory variables often co-vary with one another
(e.g. sea surface temperatures and sea-level pressures). This makes it impossible to subtract
out the effects of the factors separately by performing successive linear regressions for each
individual factor. It is necessary in such cases to perform multiple regression defined by an
extended linear model.
A multiple linear regression equation can be taken as Y = b0 + b1 X1 + b2 X2 + · · · + bm Xm , where
Y is the dependent or response variable and X1 , X2 , . . . , Xm are the independent or explanatory
variables. Here, we try to estimate the value of Y on the basis of X1 , X2 , . . . , Xm . Given a set
of tuples of observations on the variables X1 , X2 , . . . , Xm and Y , we can use the least square
method to get the estimates of b0 , b1 , . . . , bm as bˆ0 , bˆ1 , . . . , bˆm respectively and obtain the multiple
linear regression equation as Y = bˆ0 + bˆ1 X1 + bˆ2 X2 + · · · + bˆm Xm .
• Elaboration with an Example :
Here, we have a dataset (Table 1) on systolic blood pressure of 11 persons with their ages
(in years) and weight (in pounds). In this context, we will deal with this dataset and will try to
obtain a simple and a multiple linear regression equation to obtain an estimate of the systolic blood
pressure of a person with respect to age (in case of simple linear regression) and with respect to
both age and weight (in case of multiple linear regression). Before that, we need to clear our concept
about systolic blood pressure.
There are many physical factors, such as age, weight, diet, exercise, disease, drugs or alco-
hol etc. that influence systolic blood pressure. Here, we consider the first two physical factors
only.
i yi x1i x2i
1 132 52 173
2 143 59 184
3 153 67 194
4 162 73 211
5 154 64 196
6 168 74 220
7 137 54 188
8 149 61 188
9 159 65 207
10 128 46 167
11 166 72 217
Table 1 : Showing the data on systolic blood pressure (yi ) with age in years (x1i ) and weight in pounds (x2i )
corresponding to ith person of total 11 persons
At first, we fit a simple linear regression equation to the data to estimate systolic blood pressure
(BP) with respect to age only. For this, we take our regression equation to be fitted as Y = a0 +a1 X1 ,
where Y is the response variable denoting the systolic BP and X1 is the explanatory variable denoting
the age (in years).
Now, by Least Squares Method, for fitting Y = a0 + a1 X1 , a simple linear equation, we have
the normal equations in matrix form as:
Xn X n
n x1i yi
a0
i=1 i=1
= X ⇔ M a = c, say
n n n
a
X
X
2 1
x1i x1i x1i yi
i=1 i=1 i=1
Here, n = 11
Using scilab, we get
n
X n
X n
X n
X
x1i = 687, x21i = 43737, yi = 1651, x1i yi = 104328
i=1 i=1 i=1 i=1
11 687 1651
∴ M= and c =
687 43737 104328
We get the estimate of a as
aˆ0 −1 58.705515
â = =M c=
aˆ1 1.4632305
Hence, the fitted equation is Y = 58.705515 + 1.4632305 X1
The estimated values of Y obtained by this equation are given below (in Table 2):
i yi ŷi simple ei = yi − ŷi simple
1 132 134.7935 -2.7934997
2 143 145.03611 -2.0361129
3 153 156.74196 -3.7419567
4 162 165.52134 -3.5213395
5 154 152.35227 1.6477347
6 168 166.98457 1.0154301
7 137 137.71996 -0.7199606
8 149 147.96257 1.0374261
9 159 153.8155 5.1845043
10 128 126.01412 1.9858831
11 166 164.05811 1.941891
Table 2 : Showing the observed (yi ) and expected (yˆi simple ) value on systolic BP with the error (ei = yi − yˆi simple )
due to this fitting corresponding to ith person of total 11 persons
From Table 2, considering the errors, we can say that the fitted simple linear regression equation to
the observed data is moderately good. We compute RSS (Residual Sum of Squares) for comparison
with other models as:
X n n
X
RSSsimple = e2i = (yi − ŷi simple )2 = 78.285949
i=1 i=1
Now, we fit a multiple linear regression equation to the data to estimate systolic BP with respect
to both age and weight. For this, we take our regression equation to be fitted as Y = b0 +b1 X1 +b2 X2 ,
where Y is the response variable denoting the systolic BP, X1 and X2 are the explanatory variables
denoting the age (in years) and the weight (in pounds) respectively.
Now, by Least Squares Method, for fitting Y = b0 + b1 X1 + b2 X2 , a multiple linear equation,
we have the normal equations in matrix form as:
X n X n
X n
n x1i x2i yi
i=1 i=1 i=1
b0
n n n n
X X X X
2
x 1i x 1i x 1i x 2i
b 1
=
x 1i y i
⇔ N b = d, say
i=1 i=1 i=1 b2 i=1
X n Xn X n Xn
x2i x2i x1i x22i x2i yi
i=1 i=1 i=1 i=1
n
X n
X n
X n
X
Using scilab, with the previous calculations of x1i , x21i , yi , x1i yi , we get
i=1 i=1 i=1 i=1
n
X n
X n
X n
X n
X
x2i = 2146, x1i x2i = x2i x1i = 135530, x22i = 421708, x2i yi = 324401
i=1 i=1 i=1 i=1 i=1
11 687 2146 1651
∴ N = 687 43737 135530 and d = 104328
2146 135530 421708 324401
We get the estimate of b as
ˆ
b0 31.60052
b̂ = bˆ1 = N −1 d = 0.8663002
bˆ2 0.3300308
From Table 3, considering the errors, we can say that the fitted multiple linear regression equation
to the observed data is very good. We compute RSS (Residual Sum of Squares) for comparison with
other models as: n n
X X
2
RSSmultiple = ei = (yi − ŷi multiple )2 = 42.860841
i=1 i=1
Now, comparing RSSsimple and RSSmultiple , we can clearly say that for the observed data,
multiple linear regression equation fitting is better than simple linear regression equation fitting. This
implies that we can estimate the systolic BP better taking into consideration the weight alongwith
the age than taking the age only.
From the above example, we can conclude that for estimating the value of response variable
taking into consideration several explanatory variables is better than taking only one explanatory
variable. The RSS will decrease with the increase of the number of explanatory variables but that
does not mean that we can increase the number of explanatory variables as much as we want because
the increasing of the number of explanatory variables is more time, money and labour consuming.
CONCLUSION :
Regression analysis is itself a huge subject of discussion and subject to various considerations
also. It is important to state few points about this, such as:
X While we are doing some linear regression problem, we may deal with simple linear regression or
multiple linear regressions. In a model, different factors have different extent of effects, so if we
consider larger number of variables in the model we would be able to explain larger amount of
information for the dependent variable. So it is clear that error is inversely related with number
of factors taken into account of the model.
X Here, in this project, we have deliberately considered that a mathematical model can be fitted
on some data set and then we have shown that there exist various types of mathematical model
and of course those models have different extent of fitting with the original data, whereas in
regression analysis, there is another notion called inferential problem where we test whether
the effect of a particular factor is present in the problem or not. Then if some factors have no
effect on a particular problem, we drop those effects when we fix the model. And then we go
on with the important factors but there is some technical problems related to Statistics that
should be taken in to account. While we test the presence of a factor’s effect it may happen
that our test gives the conclusion that there is no significant effect of a factor in the model but
we should practically think about it and then make a conclusion. For an example, if we are to
fit a regression model for predicting the cut off marks for admission in an engineering college,
taking marks of the students of that college in different subjects, it may happen that someone
is interested to test whether a particular subject has significant effect in the model or not. For
that a test is conducted for cross validation. In practice, testing the effect of a particular subject
both of English and Mathematics, the test may give the result that there is no significant effect
of that subject in each case but we should accept the result for English and reject the result for
Mathematics to make the model appropriate.
X Another problem related to regression theory is the problem of predicting a value of a dependent
variable while the independent variable is a very unlikely value for the set, on which the model
is fitted. In this case, though the model is well fitted with observed data so far, it may give high
error for the unlikely value of the independent variable. For an example, if a regression model
is fitted on heights and weights for some adults, where weight is to be predicted for some given
height, and the data set for height lies between 5 feet to 5.11 feet. Then if we want to predict
the weight while the height is above 7 feet, it may results in high error.
X Sometimes it may happen that a regression model can be fitted on some data set, but in reality
there is no meaning of fitting such a model. For an example, we may construct some linear
relationship between sizes of shoes with their I.Q level, and may be the model is well fitted with
the original data set, but actually there is no relationship between these two.
So, for regression analysis, we should take various factor in to our account which are related
to the problem, while we are to fit an appropriate model, otherwise it will be worthless, and the
decision should be taken with practical ideas about the field of interest. So, it is convenient to state
a quotation at the end: “I am in full favor of keeping dangerous weapons from the fools. Let’s start
with STATISTICS.”
• References :
– Fundamentals of Statistics (Volume I), Eighth Edition (2008) [The World Press Private
Limited, Kolkata] – A. M. Gun, M. K. Gupta, B. Dasgupta.
– Linear Algebra and Its Applications, Fourth Edition – Gilbert Strang.
– Statistical Inference, Second Edition – George Casella, Roger L. Berger.
– Least squares - Wikipedia, the free encyclopedia
weblink: en.wikipedia.org/wiki/Least_squares
– Linear regression - Wikipedia, the free encyclopedia
weblink: en.wikipedia.org/wiki/Linear_regression
– Regression analysis - Wikipedia, the free encyclopedia
weblink: en.wikipedia.org/wiki/Regression_analysis
– Simple linear regression - Wikipedia, the free encyclopedia
weblink: en.wikipedia.org/wiki/Simple_linear_regression
– Multiple Linear Regression
weblink: www.stat.yale.edu/Courses/1997-98/101/linmult.htm
– Multiple Linear Regression (MLR) Definition - Investopedia
weblink: www.investopedia.com/terms/m/mlr.asp
– Data source for the elaborate example:
weblink: http://college.cengage.com/mathematics/brase/understandable_stat-
istics/7e/students/datasets/mlr/frames/mlr02.html