CH 14 Handout
CH 14 Handout
A regression model that includes two or more independent variables is called multiple regression model. The population
model is given by the equation:
y = A + B 1 x1 + B 2 x2 + B 3 x3 + … + B k xk + ε
If all the x variables are raised to the power of one then the model becomes a first-order multiple regression model. If a
model has k independent variables and a sample size of n, then the degrees of freedom is df = n -k – 1
Assumption 1: The mean of the probability distribution of εis zero, that is E(ε) = 0
Assumption 2: The errors associated with different sets of values of independent variables are independent.
Furthermore, these errors are normally distributed and have a constant standard deviation, which is denoted by σε
Assumption 3: The independent variables are not linearly related. However, they can have a nonlinear relationship.
When independent variables are highly linearly correlated, it is referred to as multicollinearity.
Assumption 4: There is no linear association between the random error term ε and each independent variable xi
Population standard deviation of errors is denoted by the symbol σε. The sample standard deviation of errors is denoted
SSE
where SSE y yˆ
2
Se
n k 1
This value can be generated using Stata. In the results, seis usually termed Root MSE, where MSE means Mean Square
Error.
Similar to Simple Regression models, the coefficient of determination in a Multiple Regression Model, R2, measures the
proportion of Total Sum of Squares (SST) that is explained by the regression model. It measures how well the
independent variables explain the dependent variable. Recall that SSR is the part of SST that is explained by the
regression model. (SSE is the part of SST that is not explained by the regression model).
SSR
R2
SST
In multiple regression models, the value of R2 keeps increasing with the addition of more independent variables, even if
the additional variables do not have significant influence on the dependent variable. However, increasing values of R2
misleads us to think that the model is improving with the additional independent variable.
To overcome this short-coming of R2 we use an alternative coefficient of determination measure, the Adjusted R2 or Ŕ2.
The formula is
n 1 SSR / n k 1
R2 1 1 R2 or 1
n k 1 SST /(n 1)
One thing to note is that while R2 can only be a value between 0 and 1, Ŕ2 can be negative.
Last year five randomly selected students took a match aptitude test before they began their statistics course. Their
results are presented in the table below:
Q1. Present descriptive statistics of the dependent and the independent variables.
. sum statistics
. sum aptitude
Q2. What linear regression equation best predicts statistics performance, based on math aptitude scores?
------------------------------------------------------------------------------
statistics | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
aptitude | .6438356 .3866477 1.67 0.194 -.58665 1.874321
_cons | 26.78082 30.51824 0.88 0.445 -70.34184 123.9035
------------------------------------------------------------------------------
^y =26.78+ 0.644 aptitude
Q3. Fit the linear regression line through a scatter plot of the observations.
60 70 80 90 100
aptitude
Q4. If a student made an 80 on the aptitude test, what grade would we expect her to make in statistics?
Q5. How well does the regression equation fit the data?
R-squared = 0.4803
48.03% of the variation in statistics grades is explained by the scores of the aptitude
test.
Example 2: Multiple Regression Model
A researcher wanted to find the effect of driving experience and the number of driving violations on auto insurance
premiums. A sample of 12 drivers was selected.
x2 = the number of driving violations committed by a driver during the past three years
------------------------------------------------------------------------------
premium | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
experience | -2.747272 .9770062 -2.81 0.020-4.957414 -.5371308
violations | 16.10612 2.61332 6.16 0.000 10.19438 22.01786
_cons | 110.2761 14.61865 7.54 0.000 77.20639 143.3458
------------------------------------------------------------------------------
Using the software output, we are going to answer the following questions:
(a) Explain the meaning of the estimated regression coefficients.
From the STATA output above, we can form the regression equation as follows:
The value of a=110.2761. This means that a person with 0 years of driving experience and 0 violations in the past three
years would be charged a premium of $110.2761 per month.
The value of b 1=−2.747 indicating that with each additional year of experience, the premium reduces by $2.474,
keeping the number of violations in the past three years fixed.
The value of b 2=16.106 indicating that with each additional violation committed in the past three years, the premium
increases by $16.106, keeping the number of years of driving experience fixed.
By intuition, we expect the relationship between years of driving experience and monthly insurance premiums to be
negatively related. As per our sample, the coefficient of driving experience in the regression model is b 1=−2.747<0 . To
determine the statistical significance of the coefficient, we need to evaluate whether there is enough evidence present
to conclude that the relationship between monthly premiums and years of driving experience is negative in the
population. (STATA always run a two-tailed test by default)
H 0 :B1 =0
H 1 : B1 ≠ 0
The test is t-distributed, where the t-statistic is -2.81. The p-value for this two-tailed test is 0.020. If we select the value
of alpha to be 5%, then p-value <α and we reject the null hypothesis, concluding that there is sufficient evidence present
to conclude that the relationship between the two variables is different from zero in the population (the t-statistic being
negative indicates that the relationship is negative). However, if we choose a 1% level of significance, then we do not
reject the null hypothesis. Hence, there is a moderately strong indication that the relationship between the variables is
different from 0 in the population. Hence it is difficult to conclude whether the variable is indeed statistically significant.
By intuition, we expect the relationship between number of driving violations (in the past three years) and monthly
insurance premiums to be positively related. As per our sample, the coefficient of number of violations (in the past three
years) in the regression model isb 2=16.106>0 . To determine the statistical significance of the coefficient, we need to
evaluate whether there is enough evidence present to conclude that the relationship between monthly premiums and
number of violations in the past three years is positive in the population.
H 0 :B 2=0
H 1 : B2 ≠ 0
The test is t-distributed, where the t-statistic is 6.16. The p-value for this two-tailed test is 0.000. If we select the value of
alpha to be 1%, then p-value <α and we reject the null hypothesis, concluding that there is sufficient evidence present to
conclude that the relationship between the two variables is different from zero in the population (the t-statistic being
positive indicates that the relationship is positive in the population). Hence, there is overwhelming evidence about the
relationship being different from 0 and the variable, number of violations in the past three years, is statistically
significant in explaining variations in the dependent variable.
(c) What are the values of the coefficient of multiple determination, and the adjusted coefficient of multiple
determination?
(d) What is the predicted auto insurance premium paid per month by a driver with seven years of driving
experience and three driving violations committed in the past three years?
A driver with 7 years of driving experience and three violations in the past year is predicted to pay a premium of
$139.37.
(e) What is the point estimate of the expected (or mean) auto insurance premium paid per month by all drivers
with 12 years of driving experience and 4 driving violations committed in the past three years?
(f) Determine a 95% confidence interval for B1 (the coefficient of experience) for the multiple regression of
auto insurance premium on driving experience and the number of driving violations.
According to the STATA output the 95% confidence interval for the coefficient of experience is between -4.957 to -0.537.
This means with 95% confidence the interval -4.957 to -0.537 contains the true population coefficient for experience.