Linear Regression
Linear Regression
Sharyn OHalloran
U9611
Data: (Yi, Xi) for i = 1,...,n Interest is in the probability distribution of Y as a function of X Linear Regression model:
U9611
Mean of Y is a straight line function of X, plus an error term or residual Goal is to find the best fit line that minimizes the sum of the error terms
Spring 2005 3
.73
6.5
Fitted line
^ 6.98-.73X Y=
PH
Error term
5.5 0
U9611
Spring 2005
U9611
Spring 2005
Regression Terminology
Regression: Regression the mean of a response variable as a function of one or more explanatory variables:
{Y | X }
Regression model: model an ideal formula to approximate the regression Simple linear regression model: model
{Y | X } = 0 + 1 X
mean of Y given X or regression of Y on X Intercept
Spring 2005
Slope
Unknown parameter
6
U9611
Regression Terminology
Y Dependent variable
Explained variable Response variable
X Independent variable
Explanatory variable Control variable
0 + 1X
0 + 1X
0 + 1X 0 1
0+ 1
0 + 1X
0 + 1
Choose
U9611
and
Regression Terminology
Fitted value for obs. i is its estimated mean: Y = fiti = {Y | X } = 0 + 1 X Residual for obs. i:
resi = Yi - fit i ei = Yi Y
Least Squares statistical estimation method finds those estimates that minimize the sum of squared residuals.
2 ( y ( + x )) = ( y y ) i 0 1i i 2 i =1 i =1 n n
The Least-squares procedure obtains estimates of the linear equation coefficients 0 and 1, in the model
i = 0 + 1xi y
2 i
2 SSE = e = ( yi yi )
SSE = e = ( yi ( 0 + 1 xi ))
2 i
U9611
= 1
( x X )( y
i =1 i n i =1
Y )
2 x X ( ) i
sY = rxy sX
=Y X 0 1
U9611 Spring 2005 11
Y i e l d (B u s h e l / A c r e )
Note that the regression line always goes through the mean X, Y. Think of this regression line as Relation Between Yield and Fertilizer the expected value 100 of Y for a given 80 value of X.
60 40 20 0 0 100 200 300 400 500 600 700 800 Fertilizer (lb/Acre)
Trend line
That is, for any value of the independent variable there is a single most likely value for the dependent variable
U9611
Spring 2005
12
Degrees of freedom:
(n-2)
Variance {Y|X}
2= (sum of squared residuals)/(n-2)
Confidence intervals
U9611
Spring 2005
14
Inference Tools
+ X {Y | X 0 } = 0 1 0
Standard Error of 0
{Y | X 0 }] = SE [
1 ( X 0 X )2 + 2 n ( n 1) s x
Conduct t-test and confidence interval in the usual way (df = n-2)
Spring 2005 15
U9611
the lfitci command automatically calculate and graph the confidence bands
U9611
Spring 2005
16
Prediction
{Y | X 0 } Pred(Y | X 0 ) =
2
After any regression analysis we can automatically draw a residual-versus-fitted plot just by typing
U9611
Spring 2005
18
U9611
Spring 2005
19
Residuals (e)
the resid command can create a new variable e containing the residuals
U9611
Spring 2005
20
U9611
Spring 2005
21
This express our uncertainty in estimating the unknown value of Y for an individual observation with known X value
Distance
-500
VELOCITY
500
1000
-1 -500
Distance 1
VELOCITY
500
1000
-1 -500
Distance 1
Spring 2005
VELOCIT Y
500
24
1000
The width of the Confidence Interval is zero if n is large enough; this is not true of the Prediction Interval.
Spring 2005 25
U9611
constant variance.
var{Y | X } = 1 =
2 n
2. Least squares: choose estimators 0 and 1 to minimize the sum of squared residuals.
(X
i =1
X )(Yi Y ) / ( X i X ) .
2 i =1
3. Properties of estimators.
i =1
2 ) = SE ( / ( n 1 ) s x 1
U9611
U9611
Constant Variance:
{Y|X} = 0 + 1X var{Y|X} = 2
Normality
Independence
Examples of Violations
Non-Linearity
The
true relation between the independent and dependent variables may not be linear.
For example, consider campaign fundraising and the probability of winning an election.
P (w )
The probability of winning increases with each additional dollar spent and then levels off after $50,000.
$ 5 0 ,0 0 0 S p e n d in g
28 Spring 2005
U9611
U9611
Spring 2005
29
Homoskedasticity assumption implies that, on average, we do not expect to get larger errors in some cases than in others.
Of course, due to the luck of the draw, some errors will turn out to be larger then others. But homoskedasticity is violated only when this happens in a predictable manner.
Example:
People with higher incomes have more choices about what to buy. We would expect that there consumption of certain goods is more variable than for families with lower incomes.
Spring 2005 30
U9611
= (Y6 (a + bX6))
6
6 7 5 X7 X9
9
= (Y9 ( a + bX9))
9
X4 X2
X X1
U9611
X5
As income increases so do the errors (vertical distance from the predicted line)
income
31
Spring 2005
If constant variance is violated, LS estimates are still unbiased but SEs, tests, Confidence Intervals, and Prediction Intervals are incorrect
U9611
Spring 2005
32
Violation of Normality
Non-Normality
Frequency of Nicotine use
Nicotine use is characterized by a large number of people not smoking at all and another large number of people who smoke every day.
U9611
Spring 2005
33
Consequence of non-Normality
If normality is violated,
LS estimates are still unbiased tests and CIs are quite robust PIs are not
Of all the assumptions, this is the one that we need to be least worried about violating. Why?
U9611 Spring 2005 34
Violation of Non-independence
Residuals of GNP and Consumption over Time
Non-Independence
The independence assumption means that errors terms of two variables will not necessarily influence one another.
Highly Correlated
The most common violation occurs with data that are collected over time or time series analysis.
Example: high tariff rates in one period are often associated with very high tariff rates in the next period. Example: Nominal GNP and Consumption
35
U9611
Spring 2005
Consequence of non-independence
If independence is violated: - LS estimates are still unbiased - everything else can be misleading
Log Height
Note that mice from litters 4 and 5 have higher weight and height
Spring 2005
Log Weight
36
The constant variance assumption is important. Normality is not too important for confidence intervals and p-values, but is important for prediction intervals. Long-tailed distributions and/or outliers can heavily influence the results. Non-independence problems: serial correlation (Ch. 15) and cluster effects (we deal with this in Ch. 9-14).
U9611
Spring 2005
37
Scatterplot of Y vs. X (see Display 8.6 p. 213)* Scatterplot of residuals vs. fitted values*
It is sometimes usefulfor checking if the distribution is symmetric or normal (i.e. for PIs).
(Section 8.5).
U9611
Scatterplot of Y vs. X
Y X
Spring 2005 39
yline(0)
Spring 2005 40
(p.224)
Quantile normal plots compare quantiles of a variable distribution with quantiles of a normal distribution having the same mean and standard deviation. They allow visual inspection for departures from normality in every part of the distribution.
grid
41
Spring 2005
For simple reg. this is about the same as residuals vs. x Look for outliers, curvature, increasing spread (funnel or horn shape); then take appropriate action.
U9611
42
Goal: to describe the distribution of breakdown time of an insulating fluid as a function of voltage applied to it.
Statistical illustrations
Recognizing the need for a log transformation of the response from the scatterplot and the residual plot Checking the simple linear regression fit with a lack-of-fit F-test Stata (follows)
Spring 2005 43
U9611
Simple regression
The residuals vs fitted values plot presents increasing spread with increasing fitted values
log(Y) ~ log(time)
U9611
Spring 2005
44
U9611
Spring 2005
45
U9611
{log(Y)|X} = 0 + 1X
(if the distribution of
Median {Y || X } = e 0 + 1 X
Median {Y | X = x + 1} e = 0 + 1 x Median {Y | X = x} e
1
0 + 1 ( x +1)
=e
Median {Y | X = x + 1} = e Median {Y | X = x}
U9611 Spring 2005 47
Interpretation of Y logged
As X increases by 1, the median of Y changes by the multiplicative factor of e 1 . Or, better: If 1>0: As X increases by 1, the median of Y
increases by
(e 1) *100%
U9611
Spring 2005
48
1- e-0.5=.4
U9611
Spring 2005
49
-2
0 25
VOLTAGE
30
35 TIME
40
U9611
Spring 2005
50
Associated with each two-fold increase (i.e doubling) of X is a 1log(2) change in the mean of Y.
U9611
Spring 2005
51
doubling of time after slaughter (between 0 and 8 hours) the mean pH decreases by .5.
7 7 0 .5 1 ltime Fitted v alues 1.5 PH 2 5.5 0 6 pH 6.5
5.5
pH
6.5
U9611
Spring 2005
52
(e
log( 2 ) 1
1) *100%
(1 e
log( 2 ) 1
) *100%
U9611
Spring 2005
53
{log(Y)|log(X)} = 0 1 log(X)
U9611
Spring 2005
54
Y and X logged
U9611
Spring 2005
55
Example: Log-Log
In order to graph the Log-log plot we need to generate two new variables (natural logarithms)
U9611
Spring 2005
56