0% found this document useful (0 votes)
21 views23 pages

Lecture For 111424

Uploaded by

zhiminc1013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views23 pages

Lecture For 111424

Uploaded by

zhiminc1013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 23

Lecture for 111424

Topics for today


Activity
Power training data
Motor vehicle theft data
Influence and multicollinearity
Star data
How do you estimate polynomial
curve models?
Estimate these models sequentially
Add linear term
Add quadratic term
Add cubic term
etc. until two terms in a row are not significant
As you build the model keep significant terms in the model even if they
become non-significant after additional terms are added.
Remember that powers are essentially interaction terms
Here’s why you need to go two
additional steps
x x2 x3 x4 xcen xcen2 xcen3 xcen4
0 0 0 0 −2 4 −8 16
1 1 1 1 −1 1 −1 1
2 4 8 16 0 0 0 0

𝑜
𝑟
3 9 27 81 1 1 1 1
4 16 64 256 2 4 8 16

The even and odd powers are tied


together
An example
Training series 1 data set
powerser.training.sas
powerser.training.session1.R
Using polynomials with different
scales
Three ways of creating polynomials
different.scaling.of.polynomials.R
Motor vehicle data
predict.Model.for.Bob.add.manipulate.R
pa.mvtheft.frst40.higher.order.poly.toclass.R
pa.mvtheft.frst40.dat
Maximum-likelihood regression
Maximum-likelihood regression.docx
Influence statistics
What is the effect of removing a data point?
How influential is each data point?
Leverage
Residual value
Determine this for each data point
This is equivalent to the ‘jackknife’
For problematic data points, remove one at a time (the worst first)
See if the regression results change
add a data point to class.R
Based on the work of Belsley, Kuh, & Welsch (1980)
Examples in R and SAS
first.influence.and.collinearity.example.five.predi
ctors.sas
influence.statistics.R
collinearity.statistics.R
Hat matrix: the degree of leverage

hi = xi(X'X)-1xi'

Cutoff: >
2p/n
Studentized residual

𝑖
𝑖
𝑠
1−h
𝑠
𝑡
𝑢
𝑑
𝑒
𝑛
𝑡
𝑟
()

𝑖
𝑟
Cutoff: > 2
Covariance ratio

COVRATIO = [( det ( s2(i) (X(i)'X(i))-1 ) )/( det ( s2


(X'X)-1 ) )]

Cutoff: |COVRATIO – 1| >


3p/n
Difference in fits: DFFITS

^ −^
()

𝑖
𝑖
𝑠
=

𝐷
𝐹
𝐹
𝐼
𝑇
𝑆
() h( )

𝑖
𝑖
𝑦
𝑦
Cutoff: 
Difference in betas: DFBETAS
(Mike’s favorite)
− ()

𝑖
𝑗
𝑗
=

𝑠
𝑋
𝑋
𝑇
( )
𝐷
𝐹
𝐵
𝐸
𝑇
𝐴
𝑆
()

𝑗
𝑖
𝑗
𝑏
𝑏
Cutoff: 
Cook’s Distance

2 [ (1 − h )2 ]
h

𝑖
𝑖
𝑝
𝑠
=

𝑖
𝐷
𝑖
𝑖
𝑖
𝑟
Cutoff: 
Multicollinearity
Tolerance (Tol)
Variance Inflation Factor (VIF)
Eigenvalues
Condition numbers
Tolerance and Variance Inflation
Factor
2
=1−

𝑘
𝑇
𝑜
𝑙
𝑒
𝑟
𝑎
𝑛
𝑐
𝑒
𝑅
Rough cutoff 1: < .1
Rough cutoff 2: < 1-
R2

1 1

𝑘
𝑇
𝑜
𝑙
𝑅
= =
𝑉
𝐼
𝐹
1− 2

Rough cutoff 1: > 10


Rough cutoff 2: > 1/(1-
R2)
Definition of Rk2 for the variance
inflation factor
Rk2 is the multiple R2 for the regression of Xk on the other
covariates and is a regression that does not involve the
response variable Y
Condition numbers

Condition numbers or condition indices are square roots of the ratios


of the largest eigenvalues to individual ith eigenvalues. Eigenvalues are
just characteristic roots of X’X. Conventionally, an eigenvalue close to zero
(say less than .01) or condition number greater than 50 (30 for conservative
persons) indicates significant multicollinearity. Belsley, Kuh, and Welsch (1980)
insist 10 to 100 as a beginning and serious points that collinearity affects
estimates.
Eigenvalues
[1 3]
3 1
=

𝐶
𝑜
𝑣
𝑎
𝑟
𝑖
𝑎
𝑛
𝑐
𝑒
𝑚
𝑎
𝑡
𝑟
𝑖
𝑥
𝐴
[0 1]
1 0
𝐸
: − =0 h =
𝑖
𝑔
𝑒
𝑛
𝑣
𝑎
𝑙
𝑢
𝑒
𝑒
𝑞
𝑢
𝑎
𝑡
𝑖
𝑜
𝑛
𝐴
𝜆
𝐼
𝑤
𝑒
𝑟
𝑒
𝐼
3− 1

𝜆
− = =0

𝐴
𝜆
𝐼
1 3−

𝜆
(3 − )2 − 1 = 0

𝜆
2
9−6 + −1=0
𝜆
𝜆
( − 4)( − 2) = 0
𝜆
𝜆
= 4, 2
𝜆
Interpretation
Eigenvalues roughly equal or none particularly large indicate no
multicollinearity
Turn into condition number

Criteria
>50
10 beginning; 100 problematic (Belsley, Kuh, & Welsch, 1980)
For high condition number check which independent variable has largest
proportion of variance associated with that eigenvalue and consider
removing
Star data
stardata.toclass.R
stardata.forR.dat
stardata.dat

You might also like