Regression Analysis
Regression Analysis
Introduction
The variable whose value is estimated using the algebraic
equation is called dependent (or response) variable and the variable
whose value is used to estimate this value is called independent
(regressor or predictor) variable. The linear algebraic equation used
for expressing a dependent variable in terms of independent
variable is called linear regression equation.
Introduction
• The dictionary meaning of the word Regression
is ‘Stepping back’ or ‘Going back’.
• Regression is the measures of the average
relationship between two or more variables in
terms of the original units of the data.
• And it also attempts to establish the nature of
the relationship between variables that is to
study the functional relationship between the
variables and thereby provide a mechanism for
prediction, or forecasting
Differences between Correlation and Regression Analysis
• Developing an algebraic equation between two variables from sample data and predicting the value of one
variable, given the value of the other variable is referred to as regression analysis, while measuring the strength
(or degree) of the relationship between two variables is referred as correlation analysis. The sign of correlation
coefficient indicates the nature (direct or inverse) of relationship between two variables, while the absolute value
of correlation coefficient indicates the extent of relationship.
• Correlation analysis determines an association between two variables x and y but not that they have a cause-and-
effect relationship. Regression analysis, in contrast to correlation, determines the cause-and-effect relationship
between x and y, that is, a change in the value of independent variable x causes a corresponding change (effect)
in the value of dependent variable y if all other factors that affect y remain unchanged.
• In linear regression analysis one variable is considered as dependent variable and other as independent variable,
while in correlation analysis both variables are considered to be independent.
• The coefficient of determination r2 indicates the proportion of total variance in the dependent variable that is
explained or accounted for by the variation in the independent variable. Since value of r2 is determined from a
sample, its value is subject to sampling error. Even if the value of r2 is high, the assumption of a linear regression
may be incorrect
ADVANTAGES OF REGRESSION ANALYSIS
• Regression analysis helps in developing a regression equation by which the value of a
dependent variable can be estimated given a value of an independent variable.
• Regression analysis helps to determine standard error of estimate to measure the variability
or spread of values of a dependent variable with respect to the regression line. Smaller the
variance and error of estimate, the closer the pair of values (x, y) fall about the regression
line and better the line fits the data, that is, a good estimate can be made of the value of
variable y. When all the points fall on the line, the standard error of estimate equals zero.
• When the sample size is large (df < 29), the interval estimation for predicting the value of a
dependent variable based on standard error of estimate is considered to be acceptable by
changing the values of either x or y. The magnitude of r2 remains the same regardless of
the values of the two variables.
Remarks
• The relationship between the dependent variable y and independent variable x exists and is linear. The average
relationship between x and y can be described by a simple linear regression equation y = a + bx + e, where e is
the deviation of a particular value of y from its expected value for a given value of independent variable x.
• For every value of the independent variable x, there is an expected (or mean) value of the dependent variable y
and these values are normally distributed. The mean of these normally distributed values fall on the line of
regression.
• The dependent variable y is a continuous random variable, whereas values of the independent variable x are
fixed values and are not random.
• The sampling error associated with the expected value of the dependent variable y is assumed to be an
independent random variable distributed normally with mean zero and constant standard deviation. The errors
are not related with each other in successive observations.
• The standard deviation and variance of expected values of the dependent variable y about the regression line are
constant for all values of the independent variable x within the range of the sample data. The value of the
dependent variable cannot be estimated for a value of an independent variable lying outside the range of values
in the sample data.
PARAMETERS OF SIMPLE LINEAR
REGRESSION MODEL
• The fundamental aim of regression analysis is to determine a regression equation (line) that makes sense
and fits the representative data such that the error of variance is as small as possible. This implies that the
regression equation should adequately be used for prediction. J. R. Stockton stated that • The device
used for estimating the values of one variable from the value of the other consists of a line through the
points, drawn in such a manner as to represent the average relationship between the two variables. Such a
line is called line of regression.
• The two variables x and y which are correlated can be expressed in terms of each other in the form of
straight line equations called regression equations. Such lines should be able to provide the best fit of
sample data to the population data. The algebraic expression of regression lines is written as:
Regression equation of y on x Regression equation of x on y
Y = a + bX X = a + bY
is used for estimating the value of y for given values of x. • is used for estimating the value of x for
given values of y.
Remarks
• To determine the value of ˆy for a given value of x, this equation requires the determination of two unknown constants
a (intercept) and b (also called regression coefficient). Once these constants are calculated, the regression line can be
used to compute an estimated value of the dependent variable y for a given value of independent variable x.
• The particular values of a and b define a specific linear relationship between x and y based on sample data. The
coefficient ‘a’ represents the level of fitted line (i.e., the distance of the line above or below the origin) when x equals
zero, whereas coefficient ‘b’ represents the slope of the line (a measure of the change in the estimated value of y for a
one-unit change in x).
• The regression coefficient ‘b’ is also denoted as:
• byx (regression coefficient of y on x) in the regression line, Y = a + bX
• bxy (regression coefficient of x on y) in the regression line, X = a + bY
Properties of Regression Coefficients
• The correlation coefficient is the geometric mean of two regression coefficients, that is, r = √ byx × bxy
• If one regression coefficient is greater than one, then other regression coefficient must be less than one,
because the value of correlation coefficient r cannot exceed one. However, both the regression
coefficients may be less than one.
• Both regression coefficients must have the same sign (either positive or negative). This property rules out
the case of opposite sign of two regression coefficients.
• The correlation coefficient will have the same sign (either positive or negative) as that of the two regression
coefficients. For example, if byx = – 0.664 and bxy = – 0.234, then r =√ – 0.664 × 0.234 = – 0.394.
• The arithmetic mean of regression coefficients bxy and byx is more than or equal to the correlation
coefficient r, that is, (byx + bxy ) / 2 ≥ r. For example, if byx = – 0.664 and bxy = – 0.234, then the arithmetic
mean of these two values is (– 0.664 – 0.234)/2 = – 0.449, and this value is more than the value of r = –
0.394.
• Regression coefficients are independent of origin but not of scale.
METHODS TO DETERMINE REGRESSION COEFFICIENTS
Algebraically method-:
X 3 2 7 4 8
Y 6 1 8 5 9
Solution-:
X Y XY X2 Y2
3 6 18 9 36
2 1 2 4 1
7 8 56 49 64
4 5 20 16 25
8 9 72 64 81
2 2
X 24 Y 29 XY 168 X 142 Y 207
Y na b X
2
XY a X b X
168=24a+142b
Y=0.66+1.07X
X na b Y
2
XY a Y b Y
X=0.49+0.74Y
The calculation by the least squares method are quit cumbersome when the
values of X and Y are large. So the work can be simplified by using this
method.
The formula for the calculation of Regression Equations by this method:
Regression Equation of X on Y- (X X) b xy (Y Y)
Regression Equation of Y on X-
(Y Y) b yx ( X X)
Where, b xy and b = Regression Coefficient
yx
xy xy
b xy and b yx 2
2
y x
Example2-: from the previous data obtain the regression equations by
Taking deviations from the actual means of X and Y series.
X 3 2 7 4 8
Y 6 1 8 5 9
Solution-:
X Y y Y Y x2 y2 xy
x X X
3 6 -1.8 0.2 3.24 0.04 -0.36
2 2
X 24 Y 29 x 0 y 0 x 26 . 8 y 38 . 8 xy 28 . 8
Regression Equation of X on Y is
(X X) b xy (Y Y)
xy
b xy 2
y
28 . 8
X 4 .8 Y 5 .8
38 . 8
X 4 .8 0 . 74 Y 5 .8
X 0 . 74 Y 0 . 49
Regression Equation of Y on X is
(Y Y) b yx ( X X)
xy
b yx 2
x
28 . 8
Y 5 .8 X 4 .8
26 . 8
Y 5 .8 1 . 07 ( X 4 .8)
Y 1 . 07 X 0 . 66