0% found this document useful (0 votes)

107 views56 pages

Linear Regression

Linear regression analysis seeks to model the relationship between a dependent variable (Y) and one or more independent variables (X) by finding the "best fit" linear relationship between their means. The model assumes the mean of Y (μ{Y|X}) is equal to the intercept (β0) plus the slope (β1) times X, plus an error term. The least squares procedure is used to estimate β0 and β1 by minimizing the sum of squared residuals between predicted and actual Y values. Key outputs include the regression line, tests/intervals for β0 and β1, and predictions/intervals of future Y values given X. Assumptions include linearity of relationship, constant variance of errors, normality of

Uploaded by

Opus Yen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views56 pages

Linear Regression

Uploaded by

Opus Yen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Lecture 2 Linear Regression: A Model for the Mean

Sharyn OHalloran

Closer Look at:

Linear Regression Model

Least squares procedure Inferential tools Confidence and Prediction Intervals

Assumptions Robustness Model checking Log transformation (of Y, X, or both)

Spring 2005 2

U9611

Linear Regression: Introduction

Data: (Yi, Xi) for i = 1,...,n Interest is in the probability distribution of Y as a function of X Linear Regression model:

U9611

Mean of Y is a straight line function of X, plus an error term or residual Goal is to find the best fit line that minimizes the sum of the error terms
Spring 2005 3

Estimated regression line

Steer example (see Display 7.3, p. 177)
Equation for estimated regression line:
Intercept=6.98
7

.73
6.5

Fitted line

^ 6.98-.73X Y=

Error term
5.5 0

1 ltime Fitted v alues PH

U9611

Spring 2005

Create a new variable ltime=log(time) Regression analysis

U9611

Spring 2005

Regression Terminology
Regression: Regression the mean of a response variable as a function of one or more explanatory variables:

{Y | X }
Regression model: model an ideal formula to approximate the regression Simple linear regression model: model

{Y | X } = 0 + 1 X
mean of Y given X or regression of Y on X Intercept
Spring 2005

Slope

Unknown parameter
6

U9611

Regression Terminology
Y Dependent variable
Explained variable Response variable

X Independent variable
Explanatory variable Control variable

Ys probability distribution is to be explained by X b0 and b1 are the regression coefficients

(See Display 7.5, p. 180) Note: Y = b0 + b1 X is NOT simple regression
U9611 Spring 2005 7

Regression Terminology: Estimated coefficients

0 + 1X
0 + 1X
0 + 1X 0 1
0+ 1

0 + 1X

0 + 1

Choose
U9611

and

1 to make the residuals small

Spring 2005 8

Regression Terminology

Fitted value for obs. i is its estimated mean: Y = fiti = {Y | X } = 0 + 1 X Residual for obs. i:

resi = Yi - fit i ei = Yi Y

Least Squares statistical estimation method finds those estimates that minimize the sum of squared residuals.
2 ( y ( + x )) = ( y y ) i 0 1i i 2 i =1 i =1 n n

Solution (from calculus) on p. 182 of Sleuth

U9611 Spring 2005 9

Least Squares Procedure

The Least-squares procedure obtains estimates of the linear equation coefficients 0 and 1, in the model

i = 0 + 1xi y
2 i

by minimizing the sum of the squared residuals or errors (ei)

2 SSE = e = ( yi yi )

This results in a procedure stated as

SSE = e = ( yi ( 0 + 1 xi ))
2 i

Choose 0 and 1 so that the quantity is minimized.

Spring 2005 10

U9611

Least Squares Procedure

The slope coefficient estimator is

= 1

( x X )( y
i =1 i n i =1

Y )

CORRELATION BETWEEN X AND Y

2 x X ( ) i

sY = rxy sX

STANDARD DEVIATION OF Y OVER THE STANDARD DEVIATION OF X

And the constant or intercept indicator is

=Y X 0 1
U9611 Spring 2005 11

Least Squares Procedure(cont.)

Y i e l d (B u s h e l / A c r e )

Note that the regression line always goes through the mean X, Y. Think of this regression line as Relation Between Yield and Fertilizer the expected value 100 of Y for a given 80 value of X.
60 40 20 0 0 100 200 300 400 500 600 700 800 Fertilizer (lb/Acre)
Trend line

That is, for any value of the independent variable there is a single most likely value for the dependent variable
U9611

Spring 2005

Tests and Confidence Intervals for 0, 1

Degrees of freedom:
(n-2)

= sample size - number of coefficients

Variance {Y|X}
2= (sum of squared residuals)/(n-2)

Standard errors (p. 184) Ideal normal model:

the

sampling distributions of 0 and 1 have the shape of a t-distribution on (n-2) d.f.

Do t-tests and CIs as usual (df=n-2)

U9611 Spring 2005 13

P values for Ho=0

Confidence intervals

U9611

Spring 2005

Inference Tools

Hypothesis Test and Confidence Interval for mean of Y at some X:

Estimate the mean of Y at X = X0 by

+ X {Y | X 0 } = 0 1 0

Standard Error of 0

{Y | X 0 }] = SE [

1 ( X 0 X )2 + 2 n ( n 1) s x

Conduct t-test and confidence interval in the usual way (df = n-2)
Spring 2005 15

U9611

Confidence bands for conditional means

confidence bands in simple regression have an hourglass shape, narrowest at the mean of X

the lfitci command automatically calculate and graph the confidence bands

U9611

Spring 2005

Prediction

Prediction of a future Y at X=X0

{Y | X 0 } Pred(Y | X 0 ) =
2

Standard error of prediction: prediction

+ ( SE[ (Y | X 0 )]) SE[Pred(Y | X 0 )] =

Variability of Y about its mean

Uncertainty in the estimated mean

95% prediction interval: interval Pred (Y | X 0 ) t df (.975) * SE[Pred(Y | X 0 )]

U9611 Spring 2005 17

Residuals vs. predicted values plot

After any regression analysis we can automatically draw a residual-versus-fitted plot just by typing

U9611

Spring 2005

Predicted values (yhat)

After any regression, the predict command can create a new variable yhat containing predicted Y values about its mean

U9611

Spring 2005

Residuals (e)
the resid command can create a new variable e containing the residuals

U9611

Spring 2005

The residual-versus-predicted-values plot could be drawn by hand using these commands

U9611

Spring 2005

Second type of confidence interval for regression prediction: prediction band

This express our uncertainty in estimating the unknown value of Y for an individual observation with known X value

Command: lftci with stdf option

Additional note: Predict can generate two kinds of standard errors

for the predicted y value, which have two different applications.

Confidence bands for conditional means (stdp)

3
3

Confidence bands for individual-case predictions (stdf)

Distance

-500

VELOCITY

500

1000

-1 -500

Distance 1

VELOCITY

500

1000

Confidence bands for conditional means (stdp)

95% confidence interval for {Y|1000}

Distance 2

confidence band: band a set of confidence intervals for {Y|X0}

-500 0 VELOCITY 500 1000

Confidence bands for individual-case predictions (stdf)

Calibration interval: interval values of X for which Y0is in a prediction interval

U9611

-1 -500

Distance 1

95% prediction interval for Y at X=1000

Spring 2005

VELOCIT Y

500

1000

Notes about confidence and prediction bands

Both are narrowest at the mean of X Beware of extrapolation

The width of the Confidence Interval is zero if n is large enough; this is not true of the Prediction Interval.
Spring 2005 25

U9611

constant variance.

Review of simple linear regression 1. Model with {Y | X } = 0 + 1 X

var{Y | X } = 1 =

2 n

2. Least squares: choose estimators 0 and 1 to minimize the sum of squared residuals.

(X
i =1

X )(Yi Y ) / ( X i X ) .
2 i =1

0 = Y X 1 X (i = 1,.., n) resi = Yi 0 1 i = resi /(n 2)

2
2

3. Properties of estimators.

i =1

2 ) = SE ( / ( n 1 ) s x 1
U9611

2 2 Spring 2005 SE ( 0 ) = / (1 / n) + X /(n 1) s x26

Assumptions of Linear Regression

A linear regression model assumes: Linearity:

U9611

Constant Variance:

{Y|X} = 0 + 1X var{Y|X} = 2

Normality

Dist. of Ys at any X is normal Given Xis, the Yis are independent

Spring 2005 27

Independence

Examples of Violations

Non-Linearity
The

true relation between the independent and dependent variables may not be linear.
For example, consider campaign fundraising and the probability of winning an election.

P (w )

Probability of Winning an Election

The probability of winning increases with each additional dollar spent and then levels off after $50,000.
$ 5 0 ,0 0 0 S p e n d in g
28 Spring 2005

U9611

Consequences of violation of linearity

: If linearity is violated, misleading conclusions may occur (however, the degree of the problem depends on the degree of non-linearity)

U9611

Spring 2005

Examples of Violations: Constant Variance

Constant Variance or Homoskedasticity

The

Homoskedasticity assumption implies that, on average, we do not expect to get larger errors in some cases than in others.

Of course, due to the luck of the draw, some errors will turn out to be larger then others. But homoskedasticity is violated only when this happens in a predictable manner.

Example:

income and spending on certain goods.

People with higher incomes have more choices about what to buy. We would expect that there consumption of certain goods is more variable than for families with lower incomes.
Spring 2005 30

U9611

Violation of constant variance

X 10 X8 Spending X6 8

Relation between Income and Spending violates homoskedasticity

= (Y6 (a + bX6))
6

6 7 5 X7 X9
9

= (Y9 ( a + bX9))
9

X4 X2

X X1
U9611

As income increases so do the errors (vertical distance from the predicted line)
income
31

Spring 2005

Consequences of non-constant variance

If constant variance is violated, LS estimates are still unbiased but SEs, tests, Confidence Intervals, and Prediction Intervals are incorrect

However, the degree depends

U9611

Spring 2005

Violation of Normality

Non-Normality
Frequency of Nicotine use

Nicotine use is characterized by a large number of people not smoking at all and another large number of people who smoke every day.

An example of a bimodal distribution

U9611

Spring 2005

Consequence of non-Normality

If normality is violated,
LS estimates are still unbiased tests and CIs are quite robust PIs are not

Of all the assumptions, this is the one that we need to be least worried about violating. Why?
U9611 Spring 2005 34

Violation of Non-independence
Residuals of GNP and Consumption over Time
Non-Independence

The independence assumption means that errors terms of two variables will not necessarily influence one another.

Highly Correlated

Technically, the RESIDUALS or error terms are uncorrelated.

The most common violation occurs with data that are collected over time or time series analysis.

Example: high tariff rates in one period are often associated with very high tariff rates in the next period. Example: Nominal GNP and Consumption
35

U9611

Spring 2005

Consequence of non-independence

If independence is violated: - LS estimates are still unbiased - everything else can be misleading

Plotting code is litter (5 mice from each of 5 litters)

U9611

Log Height

Note that mice from litters 4 and 5 have higher weight and height

Spring 2005

Log Weight

Robustness of least squares

The constant variance assumption is important. Normality is not too important for confidence intervals and p-values, but is important for prediction intervals. Long-tailed distributions and/or outliers can heavily influence the results. Non-independence problems: serial correlation (Ch. 15) and cluster effects (we deal with this in Ch. 9-14).

Strategy for dealing with these potential problems

Plots; Residual plots; Consider outliers (more in Ch. 11) Log Transformations (Display 8.6)

U9611

Spring 2005

Tools for model checking

Scatterplot of Y vs. X (see Display 8.6 p. 213)* Scatterplot of residuals vs. fitted values*

*Look for curvature, non-constant variance, and outliers

Normal probability plot (p.224)

It is sometimes usefulfor checking if the distribution is symmetric or normal (i.e. for PIs).

(Section 8.5).
U9611

Lack of fit F-test when there are replicates

Spring 2005 38

Scatterplot of Y vs. X

Command: graph twoway Case study: 7.01 page175

U9611

Y X
Spring 2005 39

Scatterplot of residuals vs. fitted values

Command: rvfplot, Case study: 7.01 page175

U9611

yline(0)
Spring 2005 40

Normal probability plot

(p.224)
Quantile normal plots compare quantiles of a variable distribution with quantiles of a normal distribution having the same mean and standard deviation. They allow visual inspection for departures from normality in every part of the distribution.

Command: qnorm variable, Case study: 7.01, page 175

U9611

grid
41

Spring 2005

Diagnostic plots of residuals

Plot residuals versus fitted values almost always:

For simple reg. this is about the same as residuals vs. x Look for outliers, curvature, increasing spread (funnel or horn shape); then take appropriate action.

If data were collected over time, plot residuals versus time

Check for time trend and Serial correlation

If normality is important, use normal probability plot.

A straight line is expected if distribution is normal

Spring 2005

U9611

Voltage Example (Case Study 8.1.2)

Goal: to describe the distribution of breakdown time of an insulating fluid as a function of voltage applied to it.

Y=Breakdown time X= Voltage

Statistical illustrations

Recognizing the need for a log transformation of the response from the scatterplot and the residual plot Checking the simple linear regression fit with a lack-of-fit F-test Stata (follows)
Spring 2005 43

U9611

Simple regression
The residuals vs fitted values plot presents increasing spread with increasing fitted values

Next step: We try with

log(Y) ~ log(time)

U9611

Spring 2005

Simple regression with Y logged

The residuals vs fitted values plot does not present any obvious curvature or trend in spread.

U9611

Spring 2005

Interpretation after log transformations

Model
Level-level Level-log Log-level Log-log

Dependent Independent Variable Variable Y Y log(Y) log(Y) X log(X) X log(X)

Spring 2005

Interpretation of 1 y=1x y=(1/100)%x %y=(1001)x % y=(1)%x

U9611

Dependent variable logged

{log(Y)|X} = 0 + 1X
(if the distribution of

is the same as:

log(Y), given X, is symmetric)

Median {Y || X } = e 0 + 1 X

As X increases by 1, what happens?

Median {Y | X = x + 1} e = 0 + 1 x Median {Y | X = x} e
1

0 + 1 ( x +1)

Median {Y | X = x + 1} = e Median {Y | X = x}
U9611 Spring 2005 47

Interpretation of Y logged

As X increases by 1, the median of Y changes by the multiplicative factor of e 1 . Or, better: If 1>0: As X increases by 1, the median of Y
increases by

(e 1) *100%

If 1 < 0: As X increases by 1, the median ( 1 e ) * 100 % of Y decreases by

U9611

Spring 2005

1- e-0.5=.4

Example: {log(time)|voltage} = 0 1 voltage

U9611

Spring 2005

{log(time)|voltage} = 18.96 - .507voltage 1- e-0.5=.4

It is estimated that the median breakdown time decreases by 40% with each 1kV increase in voltage
2500
25 30 Fitted values 35 40 8

-2

0 25

Breakdown time (minutes) 500 1000 1500 2000

Log of time until breakdown 0 2 4 6

VOLTAGE

VOLTAGE Fitted values

35 TIME

logarithm of breakdown time

U9611

Spring 2005

If the explanatory variable (X) is logged

If {Y|log(X)} = 0 + 1log(X) then:

Associated with each two-fold increase (i.e doubling) of X is a 1log(2) change in the mean of Y.

An example will follow:

U9611

Spring 2005

Example with X logged

(Display 7.3 Case 7.1):

Y = pH X = time after slaughter (hrs.) estimated model: {Y|log(X)} = 6.98 - .73log(X).

doubling of time after slaughter (between 0 and 8 hours) the mean pH decreases by .5.
7 7 0 .5 1 ltime Fitted v alues 1.5 PH 2 5.5 0 6 pH 6.5

-.73log(2) = -.5 It is estimated that for each

5.5

6.5

4 TIME Fitted v alues PH

U9611

Spring 2005

Both Y and X logged

{log(Y)|log(X)} = 0 + 1log(X) is the same as:

As X increases by 1, what happens?

If 1>0: As X increases by 1, the median of Y increases by

log( 2 ) 1

1) *100%

If 1 < 0: As X increases by 1, the median of Y decreases by

(1 e

log( 2 ) 1

) *100%

U9611

Spring 2005

Example with Y and X logged

Y: number of species on an island X: island area

Display 8.1 page 207

{log(Y)|log(X)} = 0 1 log(X)

U9611

Spring 2005

Y and X logged

{log(Y)|log(X)} = 1.94 .25 log(X) Since e.25log(2)=.19

Associated with each doubling of island area is a 19% increase in the median number of bird species

U9611

Spring 2005

Example: Log-Log

In order to graph the Log-log plot we need to generate two new variables (natural logarithms)

U9611

Spring 2005

PE Civil: Transportation e-book Practice Exam
No ratings yet
PE Civil: Transportation e-book Practice Exam
41 pages
Lecture3 221109 035214
No ratings yet
Lecture3 221109 035214
87 pages
DA Unit-3
No ratings yet
DA Unit-3
11 pages
CVEN2002 Week11
No ratings yet
CVEN2002 Week11
49 pages
Simple_linear_regression-Presentation -Review-analysis -covariance
No ratings yet
Simple_linear_regression-Presentation -Review-analysis -covariance
10 pages
Introduction To Econometrics - Stock & Watson - CH 6 Slides
No ratings yet
Introduction To Econometrics - Stock & Watson - CH 6 Slides
59 pages
Simple Linear Regression Analysis - Final
No ratings yet
Simple Linear Regression Analysis - Final
46 pages
Unit III Regression (1)
No ratings yet
Unit III Regression (1)
24 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
WST 311 Notes part 2 2024
No ratings yet
WST 311 Notes part 2 2024
21 pages
ch12_0
No ratings yet
ch12_0
43 pages
regression2
No ratings yet
regression2
28 pages
R-programming - Unit 5
No ratings yet
R-programming - Unit 5
43 pages
Chapter 10 - 2 - 2
No ratings yet
Chapter 10 - 2 - 2
33 pages
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model for Medical Data
No ratings yet
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model for Medical Data
7 pages
Statistical Methods
No ratings yet
Statistical Methods
7 pages
Notes2
No ratings yet
Notes2
16 pages
US - TMC - 06 - Curve Fitting & Interpolation
No ratings yet
US - TMC - 06 - Curve Fitting & Interpolation
64 pages
Applied Statistics II-SLR
100% (1)
Applied Statistics II-SLR
23 pages
Chapter 02
No ratings yet
Chapter 02
14 pages
Regression Equation: Independent Variable Predictor Variable Explanatory Variable Dependent Variable Response Variable
No ratings yet
Regression Equation: Independent Variable Predictor Variable Explanatory Variable Dependent Variable Response Variable
60 pages
Stats101A - Chapter 2
No ratings yet
Stats101A - Chapter 2
59 pages
Simple Linear Regression: From Wikipedia, The Free Encyclopedia
No ratings yet
Simple Linear Regression: From Wikipedia, The Free Encyclopedia
10 pages
Regression (Hrishikesh)
No ratings yet
Regression (Hrishikesh)
30 pages
BST 32202 LINEAR REGRESSION 6 SLR ASSUMPTIONS LSE
No ratings yet
BST 32202 LINEAR REGRESSION 6 SLR ASSUMPTIONS LSE
20 pages
Chapter2 (Simple Linear Regression)
No ratings yet
Chapter2 (Simple Linear Regression)
11 pages
FDA UNIT 5
No ratings yet
FDA UNIT 5
20 pages
Biostat Lecture 10
No ratings yet
Biostat Lecture 10
47 pages
Regression Notes- Part-1
No ratings yet
Regression Notes- Part-1
17 pages
Chap01-3 (Autosaved)
No ratings yet
Chap01-3 (Autosaved)
51 pages
ch12 0
No ratings yet
ch12 0
82 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
63 pages
3 SimpleLinearRegression
No ratings yet
3 SimpleLinearRegression
30 pages
Statistical Analysis (SM 901B) Unit 2 - Regression: Goonjan Jain Department of Applied Mathematics DTU
No ratings yet
Statistical Analysis (SM 901B) Unit 2 - Regression: Goonjan Jain Department of Applied Mathematics DTU
19 pages
Linear Regression Models
No ratings yet
Linear Regression Models
41 pages
BN2102 7-10
No ratings yet
BN2102 7-10
24 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
55 pages
Estad Istica II Chapter 4: Simple Linear Regression
No ratings yet
Estad Istica II Chapter 4: Simple Linear Regression
46 pages
Reg Analysis
No ratings yet
Reg Analysis
63 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
Module 3 EDA
No ratings yet
Module 3 EDA
14 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Simple Linear Regression Part I - Updated FA18
No ratings yet
Simple Linear Regression Part I - Updated FA18
59 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
TCH442E Quantitative Methods For Finance
No ratings yet
TCH442E Quantitative Methods For Finance
21 pages
No Linealidades Stock Watson
No ratings yet
No Linealidades Stock Watson
59 pages
Inference For Regression
No ratings yet
Inference For Regression
24 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Preliminaries: Prediction and Confidence Intervals in Regression
No ratings yet
Preliminaries: Prediction and Confidence Intervals in Regression
10 pages
LM Week1 1 2019
No ratings yet
LM Week1 1 2019
28 pages
TSNotes 1
No ratings yet
TSNotes 1
29 pages
Regression 101
No ratings yet
Regression 101
18 pages
Ms 236 N 0
No ratings yet
Ms 236 N 0
63 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
Chapter 6: How To Do Forecasting by Regression Analysis
No ratings yet
Chapter 6: How To Do Forecasting by Regression Analysis
7 pages
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)