BM1 CheatsheetExam2
BM1 CheatsheetExam2
=( X X ) X Y
T T
b=b
predictor, 2∗1 MLE properties
Group of functions:Y i =β 0 + Σβ j Χ ji +ε I , Projection matrix: definition and role Likelihood: l(θ∨x)=Pθ (X=x ) for a
It represents the orthogonal projection of Y onto
E ( Y )=β 0 + Σβ j Χ j , the space spanned by X R.V. X with θ as a parameter
^
Y =b0 +b 1 X 1
Derivation of Least Squares Estimates and LSE l ( θ∨∩ni=0 xi ) =P ( ∩ni=0 x i∨θ ) =¿
properties
n n
linear in parameter, no parameters are multiplied Minimize errors = Minimize the criterion Q =
/divided or exponential of another β
Error assumption:
Σ ( Υ i−β 0−β 1 Χ 1 )
2 ∏ f X ( x i)=∑ log ( f ¿ ¿ X ( xi ))¿
i=1 i=0
2 ( ε is not assumed to be zero yet.)
E[ε] = 0, σ ( ε ι )=σ 2 Derivation : Μin Q required:
ΑrgMax
δ ln ( l ( β 0 , β 1 , σ ∨∩i=0 x
2 n
Cov ( ε ι , ε 2 )=0 , meaning each observation is
independent.
Assumed Normal distributed for inferences.
δQ
=
δ
δ β0 δ β0
2
Σ ( Υ i−β 0−β 1 Χ 1 ) =2 Σ
( Υ i−βl 0( θ∨∩
δ β0 [
−β Χ )∗δ
xi ) =
( Υ i−β 0−β
Property:
n
1 1
δ ( β1 Χ
i=0
1)
0| β 1|σ )
2
]
Deterministic and random component
¿ 2 Σ [ ( Υ i−β 0−β 1 Χ 1 )∗−1 ] by chain Unbias and linear
β 0 β 1Represent the model coefficients to be Inference about model coefficients
rule.
estimated For β1: Η0 :
Random Error: Σ [ ( Υ i−β 0−β 1 Χ 1 )∗−1 ] β 1=β 1 o=0 H α β1 ≠ 0 ( β1 >¿ <0 ) if one ta
ε i=Y i−Ε [ Y i ] true ,unknown ¿ Σ [ (−Υ i + β 0 + β 1 Χ 1 ) ] .
^ i ,w
Residual e i =Y i −Y β 0=
For β0: Η0 :
¿−Σ¿ ¿ 0
Estimation of errors
Error sum of squares or residual sum of squares: Σ(β ¿¿ 0)=Σ ¿ ¿ as β is constant across i β 0 o=0 H α β 0 ≠ 0( β0 > ¿<0 if one tail)
Confidence intervals (CI) for intercept/slope
Σ ( e ) =Σ ( Y i−Y^ i )
2
SSE =
2
n ( β1) standardized statistic
Σ (Υ ι ) ∗Σ ( Χ 1 ) b1−β 1 o
SSE Σ ( Y i− Y^ i ) n
2
2 ^β = − =¿ t n−2 ,
s =MSE= = 0
n n s ( b 1)
n−2 n−2
√
Ȳ − β1∗X̄ 2
E[MSE] = σ^2 = true variance of the errors
Σ ( Y i−Y^ i )
s ( bi )=√ ΜSE ( b 1 )=
k
Polynormial: X j =X j+1
Higher order polynormial can eventually passes ( n−2 )∗Σ ( Χ i− X̄ i
( Υ i−βb00−β
[
− β10Χo 1 )∗δ
]
all point, resulting over fitting. Retain Higher
δQ δ 2
order retains all lower order. Strong = Σ ( Υ i−β 0 −β1 Χ 1 ) =2 Σ ( Υ i−β 0−β 1 Χ 1)
multicollinearity, compenstate by δ β 1 δ β1 δs (βb10 )
Centered variable: Setting x i=X i − X̄
√
¿ 2 Σ [ ( Υ i−β 0−β 1 Χ 1 )∗− X 1 ] by chain
( 1
Model using matrix notation
Product of two matrix: rule. t n−2 , s ( b0 ) =√ ΜSE ( b 0 )= MSE∗ +
Ai∗c∗B c∗ j=C i∗ j=[ Σ Ck=1 aik∗b kj ] 2 Σ [ ( Υ i−β 0−β 1 Χ i )∗− X i ]=0 n
studentized statistic : When a statistic is
[ ]
T
Ai∗c∗A =A i∗c∗A c∗i ¿ Σ (− X i Υ i + X i β 0 + β 1 Χ i ) =¿
2
standardized but the denominator is an estimated
¿ [ Σ Ck=1 aik∗a ki ]
standard deviation rather than the true standard
2
−Σ Χ ι Υ ι+ Σ X i β 0+ Σβ 1 Χ ι = deviation
[ ]
n , Σ Χi 2
−Σ Χ ι Υ ι+ n X̄ Ȳ −β 1∗n X̄ + β 1 Σ Χ ι =0
2 t n−2 ,97.5∗1
Χ Τ Χ= 2 , Y Y = [ ΣΥ ι ]
T 2 β i=b i ±
Σ Χi , Σ Χi
^β =
Σ Χ ι Υ ι−n X̄ Ȳ √ MSE ( bi )
[ ]
1 2 2 95% CI of β interpreted:
Τ ΣYi Σ Χ ι −n X̄ Ιndependent sampling for each levels of X and a
Χ Y= 95 percent confidence interval is constructed for
Σ Xi Y i To see if ^
β 0∧ β^ 1 is the min or max, second each sample, 95 percent of the intervals will
Inverse of matrix X^(−1)∗X=I, derivitive needed. contain the true value of β,
1
[ ]
Departures from Normality: The sampling
X of 2∗2¿ [ a , b ¿ ]=
−1 d ,−b t h e Joc ¿
[ ] ][
distributions of bo and bI will be approximately
c,d ad−bc −c , a 2
δ Q δ Q
2 normal even departed
[
Asymptotic normality, approach normal with
]
−1
a ,b , c 1 ek−f h , c h−bk , δce−bf
2 ❑
β 0 δ β 1 β 0 = 2 n , 2n X̄ increased sample size
X of 3∗3 ¿[ d , e , f ¿ ]= Inference about the response variable: E[Y h ]
Z fg−dk ,ak −cg , cd−af δ
2
Q δ
2
Q 2 n X̄ , 2 Σ Χ 2ι
g,h,k d h−eg , bg−a h , ae−bd ❑ 2
Y hcorrespond to a X h that is new yet within the
Z=a ( ek −f h )−b ( dk −fg ) + c ( d h−eg ) δ β 1 β 0 δ β 1 scope of the model, we want to find E[Yh]
E[Ŷh] = Ε[Yh], represent the average of Ŷh from
Linear regression in matrix For ( ^
β 0 , β^ 1) to be globle minimum repeated independent sampling and regresion of X
Y n∗1 =X n∗p b p∗1+ ε n∗1 2
2
Det[J] = 2 n Σ Χ −2 ( Σ Χ ) >0
❑ 2
2 1 ( X h− X̄ )
σ 2{ Y h } =σ [ +
T T 2
X Y =X Xb ι ι
2
]
−1 T 2 nΣ Χ
2
−2 n
2
( X̄ ) 2
⇒ Σ Χ
2
−n X̄
2
=Σ Χ
2
−Σ X̄ =Σ¿ n Σ ( X − X̄ )
2∗1= ( X X )
T
b=b X Y ι ι ι h
[ ]
2 2
Ŷ=HY: X( X X ) X Y
T −1 T ¿ ¿ 2 Σ Χ ι > 0 ¿ 1 ( X h− X̄ )
β 0 : t h e intercept of Y wit h t h e regression s { Y h }=MSE n +
2
^ estimat
2
T
Y n∗1 =X n∗p b p∗1+ ε n∗1 X Y =X Xb
T
Σ ( X h− X̄ )
2 Τ
surface : , measure the spread of Ŷh from repeated
Q=ε =( Y −Χβ ) ( Y −Χβ ) Value of baseline level: i.e. all X j = 0 independent sampling and regresion of each levels
¿ ( Y Τ −( Χβ )Τ ) ( Y −Χβ ) β j :the changes of one unit of X j when of X
T T Τ Τ Τ Τ Distances of X h ¿ X̄ impact the σ^2:
Y Y −Y Χβ−β Χ Y + β Χ Χβ other X j kept constant .
T T Τ Τ
¿ Y Y −2Y Χβ+ β Χ Χβ
CI for the mean response at a particular (mean) X- Analysis of Variance (ANOVA) in MLR Mοdel selection:
value Fit the full model and obtain the error sum of From any set of p - 1 predictors, 2^(P- I)
√ [ ]
squares SSE(F). Fit the reduced model under Ho alternative models can be constructed.
Y^ h−Ε [ Y h ]
2
1 ( h ) − obtain
X and X̄ the error sum of squares SSE(R). F = Automatic procedures: backward, forward,
t n−2 , s ( Y^ h ) = MSE + stepwise elimination
s ( Y^ h ) n Σ ( X h− X̄ )2 "Best" Subsets Algorithms
Backward: start with full model,do t test for each
^ h ±t
CI = Y ∗s ( Y^ h ) added predictors, drop the lowest p
n−2 ,97.5 % Forward: Start with no predictor, add one a time
We conclude with 95% confidence that the mean and drop the lowest p, stepwise
number of {Y^ } when {X = Xh}is somewhere Global ANOVA F-test: elimination:forward first, then for each step,
backward, but variable can be droped if no longer
between CI. Null: ∀ β i=0, alternative hypothesis:
PI for New Observation at a particular X-value needed.
Predict the outcome of drawing a Y h from the
∃ β i ≠0 Major weakness of these procedures. each of the
stepwise search procedures can sometimes err by
MSR
i
√
To find the point where adding more X variables
[ ]
2 is not worthwhile because it leads to a very small
1Partial( X )
h− X̄ F-test:
√
s { pred }= MSE +s {Y^ h }= MSE + MSE Null
2 ANOVA
+ : specific β2 =0
n Σ ( X h− X̄ ) i ,
=¿ increase in R; When use adjusted R-sequared, find
max
√
SSR ( β ι∨S )−SSR ( S ) BIC:n∗¿ ( SS E p ) −n∈n+ ln (n) p
( )
2
1 ( X h− X̄ ) test statistic = =¿ n∗¿ ( SS E p ) which decreases as p increases,
Y^h ±t n−2 , 97.5 MSE 1+ + ( ¿ of βι )∗MSE ( βi , S )
n Σ ( X h− X̄ )2
(R¿¿ f −R s )/(1−R f )∗d f f /(d f f −d f r )¿ Panality 2p | ln ( n ) p increase as p increase
2 2 2
Test/interpretation of interactions and confounders BIC favor more model with less n(parsimony)
Test: Fit 2 model, interaction if interaction term Decision rule: F* ≤ F1−a , ¿ βι , n− p ¿ , H0, Mallow’s Cp criterion:
sig else Ha total mean squared error of the n fitted values for
cofounder β1 change sign or >10%, no interaction Comparing nested models: null and alternative each subset regression model
term, β2 dont matter. hypotheses, test statistic When there is no bias in the regression model with
Intraction effect: p - I X variables so E [ Y^ i ] =μ ι
When β Ι have different sign Likelihood Ratio Test (LRT)
E[Cp] is approximately p:
¿ β i, interference effect, else reinforcement effect L0 2 when the C values for all possible regression
∆=−2 log =−2 ( l 0 −l 1) χ d d=#
models are plotted against p, those models with
Change of increase in Χ i =βi + β I X iI L1 little bias will tend to fall near the line Cp = p.
of added parameters• L0 = likelihood for the Models with substantial bias will tend to fall
β Ι = the impact of every unit increase of small/null model considerably above this line. CI' values below the
X iI ¿ t h e slope of X i Coefficient of multiple determination R^2
line C p = p are interpreted as showing no bias,
being below the line due to sampling error.
Continuous and categorical predictors It measures the proportionate reduction of total MSE: Seek to minimized
β indicates how much higher (lower) the response variation in Y associated with the use of the set of MSPE: minimized, prediction error.
functions for included levels are than the one for X. for SSE can only be smaller and SSTO PRESS: measure of how well the use of the fitted
baseline for any given level of tool speed. constant with addition predictors. values for a subset model can predict the observed
interpretation: interpretation of interaction with
catalogical variable SSR responses Yj , Sum of SSE for loocv
R-squared: : measure the degree of Model Validation (concepts)
β here indicates how much greater (smaller) is the Y SSTO Cross-validation:
intercept/Slope of the response function for the class linear association Data split
coded 1 than that for the class coded 0. Coefficient of partial determination: In model building data set number of cases should
Partitioning variance: SSTO, SSR, SSE 2
SSE ( X 2 )
−SSE ( X 1 , X 2 ) SSR ( | )
X 1 X2
be at least 6 to 10 times the number of variables in
SSTO: RY 1∨2= = the pool of predictor variables
2 T 1 (
SSE X 2 ) ( )
SSE K-fold
X 2 and N-fold validation:
he data are first split into K roughly equal parts.
Σ(Y ¿¿ i−Ȳ ) =Y [I −( )J ]Y ¿ Adjusted R-squared and partial R-squared: For k = 1, 2, ... , K, we use the kth part as the
n Needed b/c Adding X2 can only become larger validation set, fit the model using the other k - 1
SSE(unexplained)= but never smaller, for SSE and SSTO constant parts, and obtain the predicted sum of squares for
Σ(Y ¿¿ i−Y^ i ) =Y [I −H ]Y ¿
2 T
n−1 error. The K estimates of prediction errOr are then
∗SSE combined to produce a K-fold cross-validation
SSE n− p
, E [ MSE ] =σ
2 2 estimate. Note that when K =n, the K -fold cross-
MSE= Ra =1−
n− p SSTO validation estimate is the identical to the PRESSp
SSR(explained)= Can become smaller when changes in SSE offset statistic. K = 5 or 10 usually,
N-fold: LOOCV, k = n,
1 by the lost of df
Σ( Y^ ¿¿ i−Ȳ ) =Y [H −( )J ]Y ¿
2 T Bootstrap
R-squared vs correlation
n Sample with replacement to let sample mimic
population
MSR
, E [ MSE ] =σ +k
2 Feed bootstraped sample into constructed
MSR= Misanderstanding 1. A high indicates that useful regression model
p−1 predictions can be made (No precision included) See differences coefficient: plot to see variance.
When p = 2, Misunderstanding 2. A high coefficient o f
1 determination indicates that the estimated MLR Diagnostics
k= ¿ regression line is a good fit (may be curve linear) Assumptions:
2 Misunderstanding 2. A zero coefficient o f Residuals are normally distributed?
Extra SSR: Increase of SSR with addictional p: determination indicates no correlation(may be Normality probability plot and quantile-
SSR(X3|X2,X1) = SSR(X3, X2,X1) - curve linear) quantile (QQ plot): straight line (residuals are
SSR(X2,X1)= SSE(X2,X1) - SSE(X3,X2,X1) Affected by spacing of X normal) -Small departures from normality are not
± √ R =r , sign
SSTO = SSE(unexplained) + SSR(explained) Coefficient of Correlation: 2
acceptabe -Heavy tails indicate outliers
Degrees of freedom for each SS
represent the direction of slope Variance of residuals is constant?
SSTO : n-1 : mean are fixed, one constrain
____________________________________ Residuals (Y-axis) vs fitted values (X-axis):
SSE: n-p: p parameters was estimated
Model Building (concepts) bounce around 0 (E[e] = 0 )
SSR: p-1: p-1 predictors are free to varies
‘band’ around zero: above and below (constant) Identify Collinearity variance inflation factor
random pattern (no potential outliers) (VIF) • obtained using the R2 of the regression of
Residuals are independent of one another? one predictors against all other:
Residuals (Y-axis) vs an observed covariate (X- 1
axis) spread of residuals around the Y=0 line is VI F j=
approximately constant over the range of X, 1−R j
curved pattern indicates that the predictor should High VIFs predictor are removed
enter the model in a curvilinear manner, fanning VIF < 5 moderate but not warrant correction
indicate variance not constant VIF> 5 severe |VIF > 10 serious collinearity
Remedies Transformations of the outcome (Y) or Remedies for Collinearity •Drop one or several
of the covariates (X) Box-Cox:‘best’ power correlated variables from the model (the one
transformation by raising Y to different λ, ( Yλ ) which bad VIF) •In polynomial regression, use the
Outlier Observations that are different from major centered data for predictors •Use principal
data set Outlier in Y :‘studentized residuals’ components analysis (PCA), based on eigenvalues
ei ei •Use shrinkage/penalized methods
r i= = • ℎii in H =
s {e i } √ MSE ( 1−h ) ii
X(X’X)-1X’,
Rule of thumb: an observation with |ri|>2.5 may
be considered an outlier (in Y)
Leverage values: Outlier in X •The diagonal
elements ℎii of the hat matrix are also called
leverage values, measure how far each
observation is from the center of the X space.
n
σ 2 ( ~e )=σ 2 ( I −H ) , 0 ≤ hii ≤ 1, with ∑ hii = p
i=1
, p = # of model parameters (includes intercept). •
ℎii high means that the ith observation might have
an influence on the fitted equation (model
parameter estimates).Rules of thumb:
2p
. 2<hii <.5 , moderate ; h ii > , high ; hii >.5 very high
n
only applies to large data sets, For p = 2, take 0.2
as cutoff
Influential Observations if outliers are
influential: exclusion or inclusion causes major
changes in the regression function (estimates)
DFFITS: DF:difference between Ŷ for complete
model and ith observation-omitted model.
Y^ i−Y^ i( i)
DFFIT S i=
√ MS E( i) hii
|DFFIT S i|>1, for small data sets;
|DFFIT S i|>2
√ p , for large data sets
n
n
∑ (Y^ j−Y^ j (i ))
2
Cook’s distance D:
Di= j=1
p MSE
influential if high residual eii and moderate ℎii |
high ℎii and moderate ei •Or both high •Rule of
thumb: Di>1 (0.5 in R), Di>4/n is of concern.
Residuals vs Leverage Identify influential cases -
outlying values at the upper right or lower right
Handling ‘Unusual’ Observations
be true to the data, examine the observations
before excluding them from the analysis •If they
are truly representative of the population, it is
better to leave them in the dataset •Compare the
model estimates with and without outliers •If the
slope estimates change, then you might deal with
influential points
Multicollinearity or Intercorrelation
(highly) correlated predictor essentially the same
variable is entered into a regression twice.
Other causes of collinearity -One variable is a
linear combination of several variables in the data
-Perfect collinearity is usually ‘by mistake’
Effects of Collinearity a unique solution might not
exist , Design matrix X is not full rank, Matrix X
′X is singular: (X′X)−1 DNE
one or more correlated variables will become non-
significant •The magnitude / direction of the
coefficients will change •The SE will be inflated
-‘Flat’ SSR • little increase in R^2