Unit 5
Unit 5
Objectives
After going through this unit, you will be able to:
Explain Regression Analysis
Explain Simple Regression
Explain Multiple Regression
Structure
5.1 Introduction
5.2 Simple Regression Analysis
5.3 Multiple Regression Analysis
5.4 Assessing the Regression Equation
5.5 Key Words
5.6 Summary
5.1 INTRODUCTION
The most commonly used form of regression is linear regression, and the most common type of linear
regression is called ordinary least squares regression.
Linear regression uses the values from an existing data set consisting of measurements of the values
of two variables, X and Y, to develop a model that is useful for predicting the value of the dependent
variable, Y for given values of X.
For example, say we know what the average speed is of cars on the freeway when we have 2 highway
patrols deployed (average speed=75 mph) or 10 highway patrols deployed (average speed=35 mph).
But what will be the average speed of cars on the freeway when we deploy 5 highway patrols?
From our known data, we can use the regression formula (calculations not shown) to compute the
values of and obtain the following equation: Y= 85 + (-5) X, where
That is, the average speed of cars on the freeway when there are no highway patrols working (X=0)
will be 85 mph. For each additional highway patrol car working, the average speed will drop by 5 mph.
For five patrols (X=5), Y = 85 + (-5) (5) = 85 - 25 = 60 mph
There may be some variations on how regression equations are written in the literature. For example,
you may sometimes see the dependent variable term (Y) written with a little “hat” ( ^ ) on it, or called
Y-hat. This refers to the predicted value of Y. The plain Y refers to observed values of Y in the data set
used to calculate the regression equation.
You may see the symbols for alpha (a) and beta (b) written in Greek letters, or you may see them
written in English letters. The coefficient of the independent variable may have a subscript, as may
the term for X, for example, b1X1 (this is common in multiple regressions).
In theory, there are several important assumptions that must be satisfied if linear regression is to be
used. These are:
1. Both the independent (X) and the dependent (Y) variables are measured at the interval or
ratio level.
2. The relationship between the independent (X) and the dependent (Y) variables is linear.
3. Errors in prediction of the value of Y are distributed in a way that approaches the normal
curve.
4. Errors in prediction of the value of Y are all independent of one another.
5. The distribution of the errors in prediction of the value of Y is constant regardless of the value
of X.
There are a number of advanced statistical tests that can be used to examine whether or not these
assumptions are true for any given regression equation. However, these are beyond the scope of this
discussion.
E [ui] = 0 V [ui] =
This is so, because the observations y1, y2, . . . ,yn are a random sample, they are
mutually independent and hence the error terms are also mutually independent
The distribution of the error term is independent of the joint distribution of x i, x 2, . .
.,xp
The unknown parameters h0, 1, 2, . . . ,p are constants.
The parameters h0, 1, 2, . . . ,p can be estimated using the least squares procedure,
which minimizes the sum of squares of errors.
Minimizing the sum of squares leads to the following equations, from which the values
of b can be computed:
Se =
n = number observations
The denominator of the equation indicates that in multiple regressions with p independent
variables, the standard error has n-p-1 degrees of freedom. This happens because the degrees
of freedom are reduced from n by p+1 numerical constants a, b1, b2, …..bp, that have been
estimated from the sample.
Fit of the regression model
The fit of the multiple regression models can be assessed by the Coefficient of Multiple
determination, which is a fraction that represents the proportion of total variation of y that is
explained by the regression plane.
SSE =
SSR =
SST =
Obviously,
The ratio SSR/SST represents the proportion of the total variation in y explained by the
regression model. This ratio, denoted by R2, is called the coefficient of multiple
determinations. R2 is sensitive to the magnitudes of n and p in small samples. If p is large
relative to n, the model tends to fit the data very well. In the extreme case, if n = p+1, the
model would exactly fit the data.
A better goodness of fit measure is the adjusted R2, which is computed as follows:
The overall goodness of fit of the regression model (i.e. whether the regression model is at all
helpful in predicting the values of y can be evaluated, using an F-test in the format of analysis
of variance.
t=
And performing a one or two tailed t-test with n-p-1 degrees of freedom
Standardized regression coefficients
The magnitude of the regression coefficients depends upon the scales of measurement
used for the dependent variable y and the explanatory variables included in the regression
equation. Unstandardized regression coefficients cannot be compared directly because of
differing units of measurements and different variances of the x variables. It is therefore
necessary to standardize the variables for meaningful comparisons.
The estimated model
The expressions in the parentheses are standardized variables; b’s; are unstandardized
regression coefficients and s1, s2, …sp are the standard deviations of variables x1, x2,
….xp and sx is the standard deviation of variable x.
The coefficients (bisi)/sy, j=1, 2,…, p are called standardized regression coefficients. The
standardized regression coefficient measures the impact of a unit change in the standardized
value of xi on the standardized value of y. The larger the magnitude of standardized bi, the
more xi contributes to the prediction of y. However, the regression equation itself should be
reported in terms of the unstandardized regression coefficients so that prediction of y can be
made directly from the x variables.
We now have a regression equation. But how good is the equation at predicting values of Y, for given
values of X? For that assessment, we turn to measures of association and measures of statistical
significance that are used with regression equations.
r2: It is a measure of association; it represents the percent of the variance in the values of Y
that can be explained by knowing the value of X. It varies from a low of 0.0 (none of the
variance is explained), to a high of +1.0 (all of the variance is explained).
Standard error of the computed value of b: A t-test for statistical significance of the coefficient
is conducted by dividing the value of b by its standard error. By rule of thumb, a t-value of
greater than 2.0 is usually statistically significant but you must consult a t-table to be sure. If
the t-value indicates that the b coefficient is statistically significant, this means that the
independent variable or X (number of patrol cars deployed) should be kept in the regression
equation, since it has a statistically significant relationship with the dependent variable or Y
(average speed in mph). If the relationship was not statistically significant, the value of the b
coefficient would be (statistically speaking) indistinguishable from zero.
F: It is a test for statistical significance of the regression equation as a whole. It is obtained by dividing
the explained variance by the unexplained variance. By rule of thumb, an F-value of greater than 4.0
is usually statistically significant but you must consult an F-table to be sure. If F is significant, than the
regression equation helps us to understand the relationship between X and Y.
5.5 KEYWORDS
The coefficient of determination: It tells the amount of variability in one variable explained
by variability in the other variable.
Linear relationship: From the definition of correlation as the degree of linear relationship
between two variables, we can use the correlation coefficient to compute the equations for
the straight lines best describing the relationship between the variables.
Regression equations: The equations (one to predict X and one to predict Y) are called
regression equations, and we can use them to predict a score on one variable if we know a
score on the other.
The least squares line: The general form of the equation is Y = bX + a, where ‘b’ is the slope
of the line and ‘a’ is where the line intercepts the Y axis. The regression line is also called the
least squares line.
5.6 SUMMARY
Most parametric models are “regression models.” Regression models require data sets from past
performance in order that a regression formula can be derived. The regression formula is used to
predict or forecast future performance. Thus, to employ parametric models they first must be
calibrated with history. Calibration requires some standardization of the definition of deliverable
items and item attributes. Once a calibrated model is in hand, to obtain estimates of deliverable is fed
with parameter data of the project being estimated. Model parameters are also set or adjusted to
account for similarity or dissimilarity between the project being estimated and the project history.
Usually, a methodology is incorporated into the model. Some models also allow for specification of
risk factors as well as the severity of those risks.