Unit Regression Analysis: Objectives
Unit Regression Analysis: Objectives
Structure
9.0 9.1 9.2 Objectives Introduction The Concept of Regression Linear Relationship: Two Variable Case Minimisation&Errors Method of Least Squares Prediction and Relationship between R e ~ s s i o n Correlation Multiple Regression Non-linear Regression Let Us Sum Up Key Words Some Useful Books Answers/Hints to Check Your Progress Exercises
9.0 OBJECTIVES
After going through this unit, you should be able to: explain the concept of regression; explain the method of least squares; identify the limitations of linear regression; apply linear regression models to given data; and use the regression equation for prediction.
9.1 INTRODUCTION
In the previous' Unit we noted that correlation coefficient does not reflect cause and effect relationship between two variables. Thus we cannot predict the value of one variable for a given value of the other variable. This limitation is removed 'by regression analysis. In regression analysis, to be discussed in this Unit, the relationship between variables are expressed in the fom of a mathematical equation. It is assumed that one variable is the cause and the other is the effect. You should remember that regression is a statistical tool which helps understand the relationship between variables and predicts the unknown values of the dependent variable fmm known values of the independent variable.
S u m m a r i s a t i o n of Bivariate Data
In the simplest case of regression analysis there is one dependent variable and one independent variable. Let us assume that consrunption expenditure of a household is related to the household income. For example, it can be postulated that as household income increases, expenditure also increases. Here consumption expenditure is the dependent variable and household income is the independent variable.
Usually we denote the dependent variable as Y and the independent variable as X. Suppose we took up a household survey and collected n pairs of observations in X and Y. The next step is to find out the nature of relationship between X and
Y
The relationship between X and Y can take many forms. The general practice is to express the relationship in terms of some mathematical equation. The simplest of these equations is the linear equation. This means that the relationship betweeh X and Y is in the form of a straight line and is termed linear regression. When the equation represents curves (not a straight line) the regression is called nonlinear or cunilitiear. Now the question arises, 'How do we identify the equation form?' There is no hard and fast rule as such. The form of the equation depends upon the reasoning and assumptions made by us. However, we may plot the X and Y variables on a graph paper to prepare a scatter diagram. From the scatter diagram, the location of the points on the graph paper helps in identifying the type of equation to be fitted.'Ethe points are more or less in a straight line, then linear equation is assumed. On the other hand, if the points are not in a straight line and are in the form of - a curve, a suitable non-linear equation (which resembles the scatter) is assumed. We have to take another decision, that is, the identification of dependent and independent variables. This again depends on the logic put forth and purpose of ,analysis:whether 'Y depends on X' or 'X depends on Y'. Thus there can be two regression equations fiom the same set of data. These are i) Y is assumed to be dependent on X (this is termed 'Y on X' line), and ii) X is assumed to be dependent on Y (this is termed 'X on Y' line). Regression analysis can be extended to cases where one dependent variable is explained by a number of independent variablbs Such a case is termed multiple regression. In advanced regression models there can be a number of both dependent as well as independent variables. You may by now be wondering why the term 'regression', which means 'reduce'. This n m e is associated with a phenomenon that was observed in a study on the relatioplship between the stature of father (x) and son (j). obs6rved that It was the average stature of sons of the tallest fathers has a tendency to be less than the avmge stature of these fathers. On the other hand, the average stature of sons of the Ghortest fathers has a tendency to be more than the average stature of these fathers. This phenomenon was called regression towards the mean. Although this appeared somewhat strange at that time, it was found later that this is due to natural variation within subgroups of a group and the same phenomenon occurred in most problems and data sets. The explanation is that many tall men come h m families with average stature due to vagaries of natural variation and they produce sons who are shorter than them on the whole. A similar phenomenon takes place at the lower end of the scale.
1 4
r.
In the above equation X is the independent variable or explanatory variable and Y is the dependent variable or explained variable. You may recall that the s u b d p t i represents the observation number, i ranges from 1 to n. Thus Y, is the first observation of the dependent variable, X, is the i3.h observationof the independent variable, and so on. Equation (9.1) implies that Y is completely determined by X and the parameters a and b. Suppose we have parameter values a = 3 and b = 0.75, then our linear equation is Y = 3 + 0.75 X. From this equation we can find out the value of Y for given values of X. For example, when X = 8, we find that Y = 9. Thus if we have different values of X then we obtain corresponding Y values on the basis of (9.1). Again, if Xi is the same for two observations, then the value of
Y, will also be identical for both the observations. A plot of Y on X will show
no deviation from the straight line with intercept 'a' and slope 'b'. If we look into the detenninistic model given by (9.1) we find that it may not be appropriate for describing economic interrelationship between variables. For example, let Y = consumption and X = income of households. Suppose you record your income and consumption for successive months. For the months when your income is the same, do your consumption remain the same? The point we are w g to make is that economic relationship involves certain randomness. Therefore, we assume the relationship between Y and X to be stochastic and add one error term in (9.1). Thus our stochastic model is ...(9.2) q. =a+bXi +ei where e, is the error term.In real life situations ei represents randomness in human b~haviour excluded variables, if any, in the model. Remember that the right and hand side of (9.2) has two parts, viz., i) deterministic part (that is, a + bx,), and ii) stochastic or randomness part (that is, e,). Equation (9.2) implies that even if
X , remains the same for two observations, Y, need not be the same because
of different e, . Thus, if we plot (9.2) on a graph paper the observations will not remain on a straight line.
Example 9.1 The amount of rainfall and agricultural production for ten years are given in Table 9.1.
60
62
65
7 1
73
75 81
85
n
55 57
.
88
90
We plot the data on a graph paper. The scatter diagram looks something like Fig. 9.1. We observe h m Fig. 9.1 that the points do not lie strictly on a straight line. But they show an upward rising tendency where a straight line can be fitted. Let us draw the regression line along with the scatter plot.
The vertical difference between the regression line and the observations is the error e, . The value corresponding to the regression line is called the predicted value or the expected value. On the other hand, the actual value of the dependent variable corresponding to a particular value of the independent variable is called the observed value. Thus 'error' is the difference between predicted value and observed value.
A question that arises is, 'How do we obtain the regression line? The procedure of fitting a straight line to the data is explained below.
Regression Analysis
As mentioned earlier, a straight line can be represented by Yj =a+bX, where b is the slope and a is the intercept on y-axis. The location of a straight line depends on the value of a and b, calledparameters. Therefore, the task before us is to estimate these parameters fiom the collected data. (You will learn more about the concept of estimation in Block 7). In order to obtain the line' of best fit to the data we should find estimates of a and b in such a way that the error
In Fig. 9.1 these differences between observed and predicted values of Y are marked with straight lines fiom the observed points, parallel to y-axis, meeting the regression line. The lengths of these segments are the errors at the observed
points.
Let us denote the n observations as before by (Xi, Y) i = 1,2, ....., n. In Example ,, 9.1 on agricultural production and &all, n=10.Let us denote the predicted value of I.;. at xi by f (the notation fi is pronounced as ' Y,-cap' or ' Y, -hat'). Thus 1, 2, ....., n. The error at the r P point will then be
f .=a+bXi, i
e, = q -&
,.
... (9.3)
It would be nice if we can determine a and b in such a way that each of the e, , i = 1,2, .....,n is zero. But this is impossible unless it so happens that all the n points lie on a straight line, which is very unlikely. Thus we have to be content with minimising a combination of e, ,i = 1,2, ....., n. What are the options before us?
a
It is tempting to think that the total of all the ei, i = 1,2, .. ...,n, that is Cei
' i=l
is a suitable choice. But it is not. Because, for points above the line ate positive and below the line are negative. Thus by having a cbmbination of large positive and large negative errors, it is possible for
a
Cei i=l
to be very small.
A second possibility& that if we take a = ij (the arithmetic mean of the 5;s) .and b = 0,' could be made zero. In this case, however, we do not need Cei
n
the value of X at all for prediction! The predicted value is the same irrespective of the observed value of X. This evidently is wrong.
n
What then is wrong with the criterion Ce; ? It takes into account the sign of
i=l
e, . What matters is the magnitude of the error and whether the error is on
the positive side or negative side is really immaterial. Thus, the criterion
i=l
ilei1
is a suitable criterion to rninimise. Remember that lei means the absolute value of ei. Thus, if ei = 5 then lei = 5 and also if ei = -5 then lei1 = 5.However, this option poses some computational problems. For theoretical and computational reasons, the criterion of least squares is preferred to the absolute value criterion. While in the absolute value criterion the sign of ei is removed by taking its absolute value, in the least squares criterion it is done by squaring it. Remember that the squares of both 5 and -5 are 25. This device has been found to be mathematically and computationally more attractive. We explai~ detail the least squares method in the following Section. in
that is.
te:.
t.
...(9.4)
i:e
i=l
(Y, - a - bx.1 2
The next question is: How do we obtain the values of a and b to rninimise (9.3)? a Those of you who are familiar with the concept of differentiation will remember that the value of a function is minimumwhen the fust derivative of the function is zero and second derivative is positive. Here we have to choose the value of a and b. Hence,
i=l
5s:
1sl
are obtained as
By equating (9.5) and (9.6) to zero and re-arranging the terms we get the following Wo equations:
Regression Analysis
These two equations, (9.7) and (9.8), are called the normal equations of least squares. These are two simultaneous linear equations in two unknowns. These can be solved to obtain the values of a and b. Those of you who are not familiar with the concept of differentiation can use a rule of thumb (We suggest that you shouldlearnthe concept of differentiation, which is so much usehl in Economics). We can say that the normal equations given at (9.7) and (9.8) are derived by multiplying the coefficients of a and b to the linear equation &d summing over all observations. Here the linear equation is Y, = a + bX, . The first normal equation is simply the linear equation Y,. = a + bXi summed over all observations (since the coefficient of a is I).
The second normal equation is the linear equation multiplied by coefficient of b is Xi)
4. the (since
After obtaining the normal equations we calculate the values of a and b h m the set of data we have.
Example 9.2: Assume that quantity of agricultural production depends on the amount of rainfall and fit a linear regression to the data given in Example 9.1.
In this case dependent variable (Y) is quantity of agricultural production and independent variable (X) is amount of rainfall. The regression equation to be fitted is Y, =a+bXi +ei For the above equation we find out the nonnal equations by the method of least squares. These equations are given at (9.7) and (9.8). Next we construct a table as follows:
Table 9.2: Computation of Regression Line
By substituting values fiom Table 9.2 in the normal equations (9.7) and (9.8) we get the following:
!
j
By solving these two equations we obtain a = -10.73 and b = 0.743. So the regression line is
= -10.73 + 0.743Xi.
Notice that the sum of errors ei for the estimated regression equation in zero (see the last column of Table 9.2). The computation given in Table92 often involves large numbers and poses difficulty. Hence we have a short-cut method for calculating the values of a and b h m the normal equations. Let us take x = x - T and y = Y Y respectively.
where
Hence x i = (X - Z)(Y- F)
1* n i=l
1*
Since these formulae are derived h m the normal equations we get the same values for a and b in this method also. For the data given in Table 9.1 we compute the values of a and b by this method. For this purpose we construct Table 9.3.
Regression Analysis
XI
60
Q
65 71 73 75
y1
33
37
XI
YI
-12 -8
-7
xi'
225
X I
Y,
180 104
70
-15
-13 -10
169
100
38
42 42 45 49
-3 -3 0
4 7
16
12 6 0
24
-20 6 10
13 15 0
4
0
81
85
36
100
52
55
70
130 180
776
88
10
12 0
169
225
90
Total
750
57
450
1044
...(9.12)
Coeficient b in (9.12) is called the regression coefficient. This coefficient reflects the amount of increase in Y when there is a unit increase in X. In regression equation (9.12) the coefficient b = 0.743 implies that if rainfall increase by 1 mm. agricultural production will increase 0.743 thousand tonne. Regression coefficient is widely used. It is also an important tool of analysis. For example, if Y is aggregate consumption and X is aggregate income, b represents marginal propensity to consume (MPC).
9.6 PREDICTION
A major interest in studying regression lies in its ability to forecast. In Example 9.1 in the previous Section we assumed that the quantity of agricultural production is dependent on the amount of rainfall. We fitted a linear equation to the observed data and got the relationship =-10.73
+ 0.7434
Froin this equation we can predict the quantity of agricultural output given the amount of rainfall. Thus when rainfall is 60 rnrn. agriculturalproduction is (-1 0.73 + 0.74 X 60) = 33.85 thousand tomes. This figure is the predicted value on the basis of regression equation. In a similar manner we can find the predicted valies of Y for different values of X.
Compare the predicted value with the observed value. Frorn Table 9.1 where observed values are given we find that when rainfall is 60 mm. a g r i c u l production ~ is 33 thousand tonnes. In fact, the predicted values for observed values of X are given in the f f h column a 'Table 9.2. Thus when rainfall is 60 mm. predicted it Thus the error value is -0.85 thousand tonne. value is 33.85 thousand tonr~es. Now a question arises, 'Which one, between observed and predicted values, should we believe?' In other words, what will be the quantity of agriculturr1 production if there is a rainfall of 60 mrn. in future? On the basis of our regression line it is given to be 33.85 tonnes. And we accept this value because it is based on the overall data. The e m r of -0.85 is considered as a random fluctuation which may not be repeated. The secolh,d question that comes to our mind is, 'Is the prediction valid for any value of X?' For example, we find fiom the regression equation that when rainfall is zero, agricultural production is -1 0.73 thous.and tonne. But common sense tells us that agricultural production cannot be negative! Is there anything wrong with our regression equation? In fact, the regression equation here is estimated on the basis of rainfall data in the range of 60-90 mm. Thus prediction is be valid in this range of X. Our prediction should not be for far off values of X. A third, question that arises here is, 'Will the predicted value come true?' This depends upon the coeflcient of determination. If the coefficient of determination is closer to one, there is greater likelihood that the prediction will be realised. However, the predicted value is constrained by elements of randomness involved with human behaviour and other unforeseen factors.
.
Thus we can have two regression coefficients h m a given set of data depending upon the choice of dependent and independent variables. These are:
l
b) X o n Y line, Xi =&+fly You may ask, 'What is the need for having two different lines? By rearrangement of terms of the Y on X line we obtain Xi= -- + - q . Thus we should have b b
a
1
a = -- and
h
p = 1.
the relation between X and Y is not a mathematical one. You may recall that estimates of the parameters are obtained by the method of least squares. Thus the regression line
= a +b x ,
Regression Analysis
Z ( X I -a - Pq I2 ,
P.
o x
of X and Y we find
0; :.
p=
O~Y
Thus b x p = 0 X 0 ,2 ,which is the same as 9. : This 9 is called the coeficient o determination. Thus the product of the two f regression coefficients of Y on X and X on Y is the square of the correlation coefficient. This gives a relationship between correlation and regression. Notice, however, that the coefficient of determination of either regression is the same, i.e., 9; means that although the hvo regression lines are different, their predictive this powers are the sane. Note that the coefficient of determination ? ranges between 0 and 1, i.e., the makimum value it can assume is unity and the minimum value is zero; it cannot be negative. From the previous discussion, two points emerge clearly:
1) If the points in the scatter lie close to a straight line, then there is a strong relationship between X and Y and the correlation coefficient is high. 2) Jf the points in the scatter diagram lie close to a straight line, then the observed values and predicted values of Y by least squares are very close and the prediction errors (y. - f .) are small.
Thus, the prediction errors by least squares seem to be related to the correlation coefficient. We explain this relationship here. The sum of squares of errors at the various points upon using the.least squares linear regression is
i=l
5(K -
r.
On the other hand, if we had not used the value of observed X t i predict Y, then the prediction would be a constant, say, a. The best value of a by least squares
i=l
t (& - F)i
i=l
i=l
has been by the use of X. In fact, this ratio is the coefficient of detemhation and same as ,-2 mentioned above. Since both the nunerator and denominator of this ratio are non-negative, the ratio is greater than or equal to zero.
5 ():. - 8 1' /"(T - I)' can then be used as an index of how much Z
Check Your Progress 1 1) From the following data find the coefficient of linear correlation between X 4Y,Determine also the regression line of Y on X, and then make an estimate of the value of Y when X = 12,
3) Find the two lines of regression from the following data : AgeofHusband(X) 25 22 28 25 35 20 22 40 20 18 Age of Wife (Y) 18 15 20 17 22 14 16 21 15 14 Hence es.ri;uate(i) age of husband when the age of wife is 19, (ii) age of wife when the age of husband is 30.
75
69
97
70
91
39
61
'
80
47
5) Obtain the equation of the line of regression of yield of rice O on water (x), fiom the data given in the following table : bater in inches (x) 12 18 24 30 36 42 48 Yield in tons Q 5.27 5.68 6.25 7.21 8.02 8.71 8.42 Estimate the most probable yield of rice for 40 hches of water.
Regression Analysis
So far we have considered the case of the dependent variable being explained by one independent variable. However, them are many cases where the dependent variable is explained by two or more independent variables. For example, yield of crops (Y) being explained by application of fertilizer (X,) and irrigation water(;rZ). This sort of models is termed multiple regression. Here, the equation that we consider is
Where Y is the explained variable, X, and X, are explanatory variables, and e is the error term. In order to make the presentation simple we have dropped the subscripts. A regression equation can be fitted to (9.13) by applying the method of least squares discussed in Section 9.3. Here alsqwe minirnise the normal equations as follows: e2 and obtain
By solving the above equations we obtain estimates for a,P and y. The regression equation that we obtain is f = a + w l+ y ~ 2
...(9.15)
Remember that we obtain predicted or forecast values of Y (that is f ) through (9.15) by applying various values for X, and X2. In the bivariate case (Y,X) we could plot the regression line on a graph paper. on However, it is quite complex to plot the three variable case (Y, X,, 5) graph paper bec-e it will require three dimensions. However, the intuitive idea remains the same and we have to minimise the sum of errors. In fact when we add all the error terms (el,e2,........e n )it sum up to zero. In many cases the number of explanatory variables may be more than two. In
such cases we have to follow the basic principle of least squares: minimize ze2. Thusif Y=a,+a,Xl+a2X2+ ...............+a,X,,+ethenwehavetominimize Ce2 =C(Y-a,-a,X, -a2X2........-anX,,)2 andfindoutthenormalequations.
Now a question arises, 'How many variables should be added in a regression equation?' It depends on our logic and what variables are considered to be important. Whether a variable is important or not can be identified on the basis of statistical tests also. These tests will be discussed later in Block 7.
We present a numerical example of multiple regression below.
Example 9.2
A student tries to explain the rent charged for housing near the University. She collects data on monthly rent, area of the house and distance of the house fiom the university carnpus and fits a linear regression model. Rent (in Rs.'000) Y Area (in sq.mt.) X. distance (in Km.)
x,
In the above example rent charged (Y) is the dependent variable while area of the house (X,) and distance of the house fiom the university campus (X,) are independent variables. The steps involved in estimation of regression line are: i) Find out the regression equation to be estimated. In this case it is given by ~ = a + w+yX2+e. ,
i) Find out the normal equations for the regression equatiqn to be estimated. i In this case the normal equations are
iv)
Construct a table as given in Table 9.4. Put the values fiom the table in the normal equations. Solve for the estimates of a , p and y .
v)
Regression Analysis
By applying the above mentioned steps we obtain the estimated regression line as f =-4.80 + 0.45X, + 0.09X2
By taking logarithm (log), the equation can be transformed into log Y = log a + b log X orY= a+pX where, Yf = log Y, a = log a, P = b and X =.log X
If we take X 7-=a + b X
then
Once the non-linear equation is transformed, the fitting of a regression line is as per the method discussed in the beginning of this unit.' We derive the normal equations and substiwte the values calculated fiom the observed data. From the transformed parameters, the actual parameters can be obtained by making the reverse transformation. Check Your Progress 2 1) Using the data on scores in Statistics and Economics of Table 9.8, compute the regression of Yon Xand Ken Y and check that the two lines are different. On the scatter diagram, plot both these regression lines. Check that the product of the regression coefficients is the square of the correlation coefficient.
...................................................................................................................
2) Suppose that the least squares linear regression of family expenditure on clothing (Rs. Y) on family annual income (Rs. X ) has been found to be Y = 100 + 0.09X, in the range 1000 <X <100000. Interpret this regression line. Predict the expenditure on the clothing of a family with an annual income of Rs. 10,000. What about families with annual income of Rs. 100 and Rs. 10,00,000?
between variable takes the form of a mathematical equation. Based on our logic, understanding and purpose of analysis we categorise variables iind identify the equation form. The regression coefficient enables us to make predictions for the dependent variable given the values of the independent variable. However,.prediction remains more or less valid within the range of data used for analysis. If we attempt to predict for far off values of the independent variable we may get insensible values for the dependent variable.
Regression Analysis
the application of the least squares method, for example in regression analysis. They are used to estimate the parameters of the model.
Regression
relationship between two or more variables in terms of the original units of the data.
--
Nagar, A.L. and R.K. Das, 1989 :Basic Statistics, Oxford University Press, Delhi. Goon, A.M., M.K. Gupta and B. Dasgupta, 1987 : Basic Statistics, The World Press Pvt. Ltd., Calcutta.
+ 0.98 ; Y = 0.64X + 0.54; 8.2 2) X = 0.95Y - 6.4 ; Y = 0.95X + 7.25 3) X = 2.23Y - 12.70 ; Y = 0.39X + 7.33
1)
(i) 29.6 (ii) 18.9 4) Y = 0.613X + 14.81 ; X = 1 . 3 6 0 ~ 5.2 5) Y= 3.99+ 0.103X; 8.11 tons
Sum~narisationo f U i v ~ r i a t cData
1) i) Y = a + bX = 5.856 + 0.676~ ii) X = a + PY= 29.848 + 0.799Y iii) r = 0.73 iv) 0.676 X 0.799 = 0.54 2) Expenditure on clothing, when family income is Rs. 10,000, is Rs.1,000. In the case of income below Rs.1,000 or above Rs. 1,00,000 the regression line may not hold good. In between both these figures, one rupee increase in income increases expenditure on clothes by 9 paise.