0% found this document useful (0 votes)
28 views10 pages

Unit 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views10 pages

Unit 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT 5 SIMPLE AND MULTIPLE REGRESSION

Objectives
After going through this unit, you will be able to:
 Explain Regression Analysis
 Explain Simple Regression
 Explain Multiple Regression
Structure
5.1 Introduction
5.2 Simple Regression Analysis
5.3 Multiple Regression Analysis
5.4 Assessing the Regression Equation
5.5 Key Words
5.6 Summary

5.1 INTRODUCTION

The most commonly used form of regression is linear regression, and the most common type of linear
regression is called ordinary least squares regression.

Linear regression uses the values from an existing data set consisting of measurements of the values
of two variables, X and Y, to develop a model that is useful for predicting the value of the dependent
variable, Y for given values of X.

5.2 SIMPLE REGRESSION ANALYSIS

The regression equation is written as Y = a + bX + e


Y is the value of the Dependent variable (Y), what is being predicted or explained
a or Alpha, a constant; equals the value of Y when the value of X=0
b or Beta, the coefficient of X; the slope of the regression line; how much Y changes for each one-unit
change in X.
X is the value of the Independent variable (X), what is predicting or explaining the value of Y e is the
error term; the error in predicting the value of Y, given the value of X (it is not displayed in most
regression equations).

For example, say we know what the average speed is of cars on the freeway when we have 2 highway
patrols deployed (average speed=75 mph) or 10 highway patrols deployed (average speed=35 mph).
But what will be the average speed of cars on the freeway when we deploy 5 highway patrols?

Average Speed on Freeway (Y) Number of Patrol Cars Deployed (X)


75 2
35 10

From our known data, we can use the regression formula (calculations not shown) to compute the
values of and obtain the following equation: Y= 85 + (-5) X, where

Y is the average speed of cars on the freeway


a=85, or the average speed when X=0
b= (-5), the impact on Y of each additional patrol car deployed
X is the number of patrol cars deployed

That is, the average speed of cars on the freeway when there are no highway patrols working (X=0)
will be 85 mph. For each additional highway patrol car working, the average speed will drop by 5 mph.
For five patrols (X=5), Y = 85 + (-5) (5) = 85 - 25 = 60 mph

There may be some variations on how regression equations are written in the literature. For example,
you may sometimes see the dependent variable term (Y) written with a little “hat” ( ^ ) on it, or called
Y-hat. This refers to the predicted value of Y. The plain Y refers to observed values of Y in the data set
used to calculate the regression equation.

You may see the symbols for alpha (a) and beta (b) written in Greek letters, or you may see them
written in English letters. The coefficient of the independent variable may have a subscript, as may
the term for X, for example, b1X1 (this is common in multiple regressions).

In theory, there are several important assumptions that must be satisfied if linear regression is to be
used. These are:
1. Both the independent (X) and the dependent (Y) variables are measured at the interval or
ratio level.
2. The relationship between the independent (X) and the dependent (Y) variables is linear.
3. Errors in prediction of the value of Y are distributed in a way that approaches the normal
curve.
4. Errors in prediction of the value of Y are all independent of one another.
5. The distribution of the errors in prediction of the value of Y is constant regardless of the value
of X.

There are a number of advanced statistical tests that can be used to examine whether or not these
assumptions are true for any given regression equation. However, these are beyond the scope of this
discussion.

5.3 MULTIPLE REGRESSION ANALYSIS

Consider a random sample of n observations (xi1, xi2, . . . . , xip, yi), i = 1, 2, . . . , n.


The p + 1, random variables are assumed to satisfy the linear model
yi =  0 +  1xi1 +  2xi2 , + pxip + ui i = 1, 2, . . . , n
Where ui are values of an unobserved error term, u, and. the unknown parameters are
constants.
Some assumptions of multiple regression analysis are:
 The error terms ui are mutually independent and identically distributed, with mean =
0 and constant variances

 E [ui] = 0 V [ui] =
 This is so, because the observations y1, y2, . . . ,yn are a random sample, they are
mutually independent and hence the error terms are also mutually independent
 The distribution of the error term is independent of the joint distribution of x i, x 2, . .
.,xp
 The unknown parameters h0, 1, 2, . . . ,p are constants.

Equations relating the n observations can be written as:

The parameters h0, 1, 2, . . . ,p can be estimated using the least squares procedure,
which minimizes the sum of squares of errors.

Minimizing the sum of squares leads to the following equations, from which the values
of b can be computed:

The problem of multiple regressions can be geometrically represented as follows. We can


visualize those n observations (xi1, xi2, …..xip, yi) i = 1, 2, ….n are represented as points in a
(p+1) - dimensional space. The regression problem is to determine the possible hyper-planes
in the p – dimensional space, which will be the best- fit. We use the least squares criterion and
locate the hyper-plane that minimizes the sum of squares of the errors, i.e., the distances from
the points around the plane (observations) and the point on the plane.

i.e. the estimate ŷ = a+b1x1+b2x2+…+ bpxp

Standard error of the estimate

Se =

Where yi = the sample value of the dependent variable

ŷi = corresponding value estimated from the regression equation

n = number observations

p = number of predictors or independent variable

The denominator of the equation indicates that in multiple regressions with p independent
variables, the standard error has n-p-1 degrees of freedom. This happens because the degrees
of freedom are reduced from n by p+1 numerical constants a, b1, b2, …..bp, that have been
estimated from the sample.
Fit of the regression model

The fit of the multiple regression models can be assessed by the Coefficient of Multiple
determination, which is a fraction that represents the proportion of total variation of y that is
explained by the regression plane.

Sum of squares due to error

SSE =

Sum of squares due to regression

SSR =

Total sum of squares

SST =

Obviously,

SST = SSR + SSE

The ratio SSR/SST represents the proportion of the total variation in y explained by the
regression model. This ratio, denoted by R2, is called the coefficient of multiple
determinations. R2 is sensitive to the magnitudes of n and p in small samples. If p is large
relative to n, the model tends to fit the data very well. In the extreme case, if n = p+1, the
model would exactly fit the data.

A better goodness of fit measure is the adjusted R2, which is computed as follows:

Adjusted R2= 1 – ( ) (1- R2)


=1-

Statistical inferences for the model

The overall goodness of fit of the regression model (i.e. whether the regression model is at all
helpful in predicting the values of y can be evaluated, using an F-test in the format of analysis
of variance.

Under the null hypothesis: Ho: β1 = β2 = ... = βp = 0, the statistic

has an F-distribution with p and n--1 degrees of freedom

ANOVA Table for Multiple Regression

Source of Sum of Degrees of Mean F ratio


Variation Squares freedom Squares

Regression SSR P MSR MSR/MSE

Error SSE (n-p-1) MSE

Total SST (n-1)

Whether a particular variable contributes significantly to the regression equation can be


tested as follows: For any specific variable xi, we can test the null hypothesis Ho: βi = 0, by
computing the statistic

t=

And performing a one or two tailed t-test with n-p-1 degrees of freedom
Standardized regression coefficients
The magnitude of the regression coefficients depends upon the scales of measurement
used for the dependent variable y and the explanatory variables included in the regression
equation. Unstandardized regression coefficients cannot be compared directly because of
differing units of measurements and different variances of the x variables. It is therefore
necessary to standardize the variables for meaningful comparisons.
The estimated model

ŷi = bo+b1xi1+b2xi2+….bpxip can be written as:

The expressions in the parentheses are standardized variables; b’s; are unstandardized
regression coefficients and s1, s2, …sp are the standard deviations of variables x1, x2,
….xp and sx is the standard deviation of variable x.

The coefficients (bisi)/sy, j=1, 2,…, p are called standardized regression coefficients. The
standardized regression coefficient measures the impact of a unit change in the standardized
value of xi on the standardized value of y. The larger the magnitude of standardized bi, the
more xi contributes to the prediction of y. However, the regression equation itself should be
reported in terms of the unstandardized regression coefficients so that prediction of y can be
made directly from the x variables.

5.4 ASSESSING THE REGRESSION EQUATION

We now have a regression equation. But how good is the equation at predicting values of Y, for given
values of X? For that assessment, we turn to measures of association and measures of statistical
significance that are used with regression equations.

 r2: It is a measure of association; it represents the percent of the variance in the values of Y
that can be explained by knowing the value of X. It varies from a low of 0.0 (none of the
variance is explained), to a high of +1.0 (all of the variance is explained).
 Standard error of the computed value of b: A t-test for statistical significance of the coefficient
is conducted by dividing the value of b by its standard error. By rule of thumb, a t-value of
greater than 2.0 is usually statistically significant but you must consult a t-table to be sure. If
the t-value indicates that the b coefficient is statistically significant, this means that the
independent variable or X (number of patrol cars deployed) should be kept in the regression
equation, since it has a statistically significant relationship with the dependent variable or Y
(average speed in mph). If the relationship was not statistically significant, the value of the b
coefficient would be (statistically speaking) indistinguishable from zero.

F: It is a test for statistical significance of the regression equation as a whole. It is obtained by dividing
the explained variance by the unexplained variance. By rule of thumb, an F-value of greater than 4.0
is usually statistically significant but you must consult an F-table to be sure. If F is significant, than the
regression equation helps us to understand the relationship between X and Y.

5.5 KEYWORDS
 The coefficient of determination: It tells the amount of variability in one variable explained
by variability in the other variable.
 Linear relationship: From the definition of correlation as the degree of linear relationship
between two variables, we can use the correlation coefficient to compute the equations for
the straight lines best describing the relationship between the variables.
 Regression equations: The equations (one to predict X and one to predict Y) are called
regression equations, and we can use them to predict a score on one variable if we know a
score on the other.
 The least squares line: The general form of the equation is Y = bX + a, where ‘b’ is the slope
of the line and ‘a’ is where the line intercepts the Y axis. The regression line is also called the
least squares line.

5.6 SUMMARY
Most parametric models are “regression models.” Regression models require data sets from past
performance in order that a regression formula can be derived. The regression formula is used to
predict or forecast future performance. Thus, to employ parametric models they first must be
calibrated with history. Calibration requires some standardization of the definition of deliverable
items and item attributes. Once a calibrated model is in hand, to obtain estimates of deliverable is fed
with parameter data of the project being estimated. Model parameters are also set or adjusted to
account for similarity or dissimilarity between the project being estimated and the project history.
Usually, a methodology is incorporated into the model. Some models also allow for specification of
risk factors as well as the severity of those risks.

You might also like