Da On Regression
Da On Regression
Quick survey
Case Studies:
1. Office Trip Study
2. Does an Increasing Crime Rate
Decrease House Prices?
3. Analysis of Car Mileage Data
Motivating Examples
Suppose we have data on sales of houses in some area.
For each house, we have complete information about its size, the number of
bedrooms, bathrooms, total rooms, the size of the lot, the corresponding
property tax, etc., and also the price at which the house was eventually sold.
Can we use this data to predict the selling price of a house currently on the
market?
The first step is to postulate a model of how the various features of a house
determine its selling price.
A linear model would have the following form:
selling price = 0 + 1(sq.ft.) + 2 (no. bedrooms) + 3 (no. bath)
+ 4 (no. acres) + 5 (taxes) + error
In this expression, 1 represents the increase in selling price for each additional
square foot of area: it is the marginal cost of additional area.
2 and 3 are the marginal costs of additional bedrooms and bathrooms, and so on.
The intercept 0 could in theory be thought of as the price of a house for which all
the variables specified are zero; of course, no such house could exist, but including
0 gives us more flexibility in picking a model.
Sales of Houses
The error reflects the fact that two houses with exactly the same characteristics
need not sell for exactly the same price.
There is always some variability left over, even after we specify the value of a large
number variables.
This variability is captured by an error term, which we will treat as a random
variable.
Levels of advertising
Determine appropriate levels of advertising and promotion for a
particular market segment.
Consider the problem of managing sales of beer at large college
campuses.
Sales over, say, one semester might be influenced by ads in the college paper,
ads on the campus radio station, sponsorship of sports-related events,
sponsorship of contests, etc.
Purpose of Modeling
Prediction: The fitted regression line is a prediction rule!
The regression equation is Price = 38.9 + 35.4 Size
Forecast Accuracy
Our forecast is not going to be right on the money every time and
we need to develop the notion of forecast accuracy.
Two things we want:
What kind of Y can we expect for a given value of X?
How sure are we about this forecast?
How different could y be from what we expect?
Prediction Interval
Prediction Interval: range of possible Y values that are likely given
X
What influences the length of the prediction interval?
Intuitively, the answer must lie in observed variation of the data points about
the prediction rule or fitted line.
Linear Regression
A regression model species a relation between a dependent variable
Y and certain independent variables X1,,XK.
A simple linear regression refers to a model with just one
independent variable, K=1.
Y = 0 +1X +
Independent variables are also called explanatory variables; in the equation
above, we say that X explains part of the variability in the dependent variable
Y.
Example
The following is a list of salary levels ($1000s) for 20 managers and
the sizes of the budgets ($100,000s) they manage: (59.0,3.5),
(67.4,5.0), (50.4,2.5), (83.2,6.0), (105.6, 7.5), (86.0,4.5), (74.4,6.0),
(52.2,4.0), (59.0,3.5), (67.4,5.0), (50.4,2.5), (83.2,6.0), (105.6,7.5),
(86.0,4.5), (74.4,6.0), (52.2, 4.0)
Best Line
Want to fit a straight line to this data.
The slope of this line gives the marginal increase in salary with respect to
increase in budget responsibility.
Least Squares
For the budget level Xi, the least squares line predicts the salary
level
SALARY = 31.9 + 7.73 BUDGET or PYi = 31.9 + 7.73Xi
Unless the line happens to go through the point (Xi; Yi), the predicted value
PYi will generally be different from the observed value Yi.
Each additional $100,000 of budget responsibility translates to an expected
additional salary of $7,730.
The average salary corresponding to a budget of 6.0, we get a salary of 31:9 +
7:73(6:0) = 78:28.
The difference between the two is the error or residual ei = Yi - PYi.
Questions
Q1: Why is the least squares criterion the correct principle to
follow?
Q2: How do we evaluate and use the regression line?
Assumptions Underlying Least Squares
Q1: What are the angle between (1,,1) and (1,, n) and that of
(1,, n) and (X1,,Xn)?
Discussion on Assumptions
The first two are very reasonable: if the ei's are indeed
random errors, then there is no reason to expect them to
depend on the data or to have a nonzero mean.
The second two assumptions are less automatic.
Do we necessarily believe that the variability in salary levels
among managers with large budgets is the same as the variability
among managers with small budgets? Is the variability in price
really the same among large houses and small houses?
These considerations suggest that the third assumption may not
be valid if we look at too broad a range of data values.
Correlation of errors becomes an issue when we use regression
to do forecasting. If we use data from several past periods to
forecast future results, we may introduce correlation by
overlapping several periods and this would violate the fourth
assumption.
Linear Regression
We assume that the outcome we are predicting depends
linearly on the information used to make the prediction.
Linear dependence means constant rate of increase of one
variable with respect to another (as opposed to, e.g., diminishing
returns).
E(Y|X) is the population average value of Y for any given
value of X. For example, the average house price for a house
size = 1,000 sq ft.
Reduction of Variability
The Error SS
This reflects differences in salary levels that cannot be attributed
to differences in budget responsibilities.
The explained and unexplained variation sum to the Total SS.
How much of the original variability has been explained?
The answer is given by the ratio of the explained variation to the total
variation, which is
R2 =SSR/SST=6884.7/9538.8= 72.2%
This quantity is the coefficient of determination, though everybody calls it
R-squared.
Normal Distribution
The following figure depicts two different normal distributions
both with mean 0 one with =.5 one with =2.
one : 68%, two : 95.44%, two : 99.7%
large/large
LO
C
LOTSZ
BED RM
BATH
ROOMS
AGE
GARG
EMEADW
LVTTWN
190.00
6.90
38
215.00
6.00
2.0
30
160.00
600
2.0
35
195.00
6.00
2.0
35
163.00
7.00
1.0
39
159.90
6.00
1.0
38
160.00
6.00
1.0
35
195.00
6.00
2.0
38
165.00
9.00
1.0
32
180.00
11.20
1.0
32
181.00
6.00
2.0
10
35
Suggestion
If X has no forecasting power, then the marginal and conditionals
will be the same.
If X has some forecasting information or power, then
the conditional means will be different than the marginal or overall mean and
the conditional standard deviations will be less than the marginal variances.
Multiple Regression
A regression model specifies a relation between a dependent
variable Y and certain independent variables X1, ,XK.
Here independence is not in the sense of random variables; rather, it
means that the value of Y depends on - or is determined by - the Xi
variables.)
Sir Francis Galton (1822-1911)
Galton mark: ,
?
:
?
:
?
:
:
Regression ( ) to the mean
( )
Regression ( ) to the mean
For example if a new office building of 350,000 sq. ft. were being planned,
planners and zoning administrators, etc., would need to know how much
additional traffic to expect after the building was completed and occupied .
Data PM: Similar data for PM hours was also measured, and some
other information about the size, occupancy, location, etc., of the
building was also recorded.
Residual Plot
How do you know that a correct model is being fitted?
Prediction: For a 350,000 sq. ft. bldg, it generates -7.939 + 1.697350 =
586.0 trips. The 95% confidence interval for this prediction is 535.8 to 636.1.
Noticeable
heteroscedasticity
by looking at scatter plot.
undesirable histogram of
residuals
Transformation Attempt #1
Since the (vertical) spread of the data increases as the y-value
increases a log transformation may cure this heteroscedasticity.
Scatter plot after transforming y to Log(y)
Residual Plot
Linear Fit: Log Trips = 1.958 + 0.00208 Occup. Sq. Ft. (1000)
Summary of Fit: R2 = 0.761
Standard Assumptions
After log-log transformation: Linearity, homoscedasticity, and
normality of residuals are all quite OK.
Prediction
If a zoning board were in the process of considering the zoning application from a proposed 350,000 sq. ft. office bldg, what is their
primary concern?
Proposal I: Find the 95% confidence limits for 350,000 sq. ft. office buildings.
Lower 95% Pred Upper 95% Pred
Log (Trips)
2.6262
2.7399
Number of Trips:
422.9 = 102.6262
549.4
Compare this to the confidence interval of 535.8 to 636.1 from the initial model.
These CIs are very different. The latter one, based on the log-log analysis, is the valid
one, since the analysis leading to it is valid.
Proposal II: Consider 95% Individual Prediction intervals - that is, in 95%
intervals for the actual traffic that might accompany the particular proposed
building. These are
Lower 95% Indiv Upper 95% Indiv
Log (Trips)
2.3901
2.9760
Number of Trips:
245.5
946.2
Linear Fit
Hs Prc ($10,000) = 22.53 0.229 Crime
Rate
Analysis of Variance
Term
Estimate
Prob>|t|
Intercept
22.52
<.0001
Crime Rate
-0.229
<.0001
Std Error
1.649
0.0499
t Ratio
13.73
-4.66
Analysis of Variance
Term
Estimate
Prob>|t|
Intercept
19.91
<.0001
Crime Rate
-0.184
<.0001
Std Error
1.193
0.0351
t Ratio
16.69
-5.23
Residuals Hs Prc
($10,000); Analysis 1
Residuals Hs Prc
($10,000); Analysis 2
Another Fit
Fit of MPG City By Horsepower
MPG City = 32.057274 - 0.0843973 Horsepower
MPG City = 76.485987 - 11.511504 Log(Horsepower)
R2 = 0.705884
RMSE = 4.739567
Transformation
The first analysis looks nicely linear, but there is some
evident heteroscedasticity.
The second analysis seems to be slightly curved;
maybe we could try using log(HP) as a predictor.
It seems reasonable to also try transforming to Log(Y)
= Log10 (MPG).
Since MPG was nearly linear in Wt, it seems more reasonable
to try Log10 (Wt) as a predictor here, and similarly for Log10
(HP).
Log10(MPG) = 4.2147744
- 0.8388428
Log10(Wt)
Transformation
Linearity looks fine on these plots.
There may be a hint of heteroscedasticity - but not
close to enough to worry about.
Again, there are no leverage points or outliers that
seem to need special care.
Log10(MPG) = 2.3941295 - 0.5155074 Log10(HP)
Log (Displ)
-0.8624
Log_10(Wt)
0.7483
0.8768
Log_10(Cargo)
-0.1361
Log_10(Price)
-0.8208
0.6860
0.6221
cylinder
-0.7572
length
-0.7516
0.8530
-0.9102
-0.0440
0.7976
0.6592
1.0000
0.8260
0.1060
1.0000
0.6757
0.1449
0.8700
0.8073
0.8055
0.1060
0.1449
1.0000
0.8335
0.6757
0.8055
-0.0684
-0.0684
-0.0225
0.0020
1.0000
0.8700
0.8073
0.8809
0.8809
-0.7516
0.6592
0.7483
0.8768
-0.0225
-0.0020
0.6860
0.6221
1.0000
0.6724
0.6724
1.0000
Scatterplot Matrix
Plots
Actual by Predicted
Plot
Residual by Predicted
Plot