0% found this document useful (0 votes)
1 views

cheatsheet

Uploaded by

Ziyi Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

cheatsheet

Uploaded by

Ziyi Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Applied Statistics Cheatsheet

Statistical Inference Sampling Distribution


An inference is a conclusion that patterns in the data are
present in some broader context. A statistical inference is an
inference justified by a probability model linking the data to
the broader context.
• Observational Study: The group status of the subjects is
established beyond the control of the investigator.
• Randomized Experiment: the investigator controls the
assignment of experimental units to groups and uses a
chance mechanism (like the flip of a coin) to make the
assignment
Causal Inference
Statistical inferences of cause-and-effect relationships can be
drawn from randomized experiments, but not from R Squared
observational studies. ∑n
i=1 (YI − Ȳ )
Total sum of squares(SST) = 2

Counfounding Variables ∑n
Regression sum of squares(SSR) = i=1 (Yi − Ȳ )2
ˆ
A confounding variable is related both to group ∑
Residual sum of squares(SSE) = n i=1 (Yi − Yi )
ˆ 2
membership and to the outcome. Its presence makes it
hard to establish the outcome as being a direct consequence of √ SST = SSR + SSE
group membership 1
SD(b1 ) = σ̂ SST − SSE SSR
(n − 1)σx2 R2 = ( )% = ( )%
Inference to populations SST SST
Inferences to populations can be drawn from random sampling √
1 X̄ 2 SSE
studies, but not otherwise. SD(b0 ) = σ̂ + M SE =
n (n − 1)σx2 n−2
Random sampling ensures that all subpopulations are
represented in the sample in roughly the same mix as in the b1 − β1 b0 − β0 Extra-Sums-of-Squares F-test
overall population. Again, ∼ t(n − 2) ∼ t(n − 2)
SE(B1 ) SE(B0 ) H0 : β1 = 0
Simple Random Sample [ ]
Matrix Form extra sums of squares SSRf ull −SSRnull
A simple random sample of size n from a population is a #β being tested
Y = Xβ + ϵ F − stat = = 1
subset of the population consisting of n members selected in σ̂ 2 from full model M SE
such a way that every subset of size n is afforded the same      
Y1 1 X1 [ ] ϵ1
chance of being selected.  Y2   1 X2    Multiple Regression
Y =    β0 +  ϵ2 
 . = .  β1  . 
Simple Linear Regression . ∑
Yn 1 Xn ϵn (Yi − Yˆi )2 SSE
µ{Y |X} = β0 + β1 X σ̂ 2 = =
n−p n−p
Model Assumption √
Ψ = (Y − Xβ)T (Y − Xβ) SD(bj ) = σ cij
1. Linearity
standardized bj ∼ t(n − p)
2. Normality: Y |X ∼ N ormal β̂ = (X T X)−1 X T Y
cij is j th diagonal element of (X T X)−1
3. Constant Variance: σ(Y |X) = σ
Confidence Intervals Linear Combination Of Coefficients
4. Independence √
(X0−X̄)2
Least Square Method SD(µ(Y |X0 )) = σ̂ 1
n
+ 2
(n−1)σx
∑ ∑ H0 : c0 β0 + c1 β1 + ... + cp βp = 0
Minimize Q = (Yi − b0 − b1 Xi )2 = (Yi − Yˆi )2
∑n standardized µ(Y |X0 ) ∼ t(n − 2)
HA : c0 β0 + c1 β1 + ... + cp βp ̸= 0
1 (Xi − X̄)(Yi − Ȳ ))
b1 = ∑n est = c0 b0 + c1 b1 + ... + cp bp
i=1 (Xi − X̄) Prediction Interval
2
√ V ar(est) = c20 V ar(b0 )2 + .. + c2p V ar(b0 )
b0 = Ȳ − b1X̄
√∑ SD(Y |X0 ) = σ̂ 1+ 1
+
(X0 −X̄)2
n n 2 + 2c0 c1 Cov(b0 , b1 ) + ... + cp−1 cp Cov(bp−1 , bp )
j=1 (Yi − Yi )
ˆ 2 (n−1)σx
σ̂ =
n−2 standardized Y |X0 ∼ t(n − 2) = σ̂ 2 C(X T X)−1
Extra-Sums-of-Squares F-test Weighted Regression Strategies
H0 : β1 = β2 = ... = βp = 0
σ2 Forward Selection
[ ] var(Yi |X) = Start with the null model.
extra sums of squares SSRf ull −SSRreduce wi
# of β ’s being tested dfreduce −dff ull ∑ Backward Selection
F − stat = = Q= wi (Yi − Yˆi )2
σ̂ 2 from full model M SE Start with the full model.
β̂ = (X T W X)−1 X T W Y Stepwise Selection
SSE  
M SE = w1 0 . 0 1. Start with null model.
n−p−1  0 w2 0 0 
W =   2. Do on step of forward selection.
. . . . 
Adjusted R2 3. Do one step of backward elimination.
0 0 0 wn
Only for model comparison, not for model assessment. 4. Repeat 2 and 3 until no explanatory variables can be added
Ridge and Lasso Regression or removed.
SST
− SSE
Adjusted R2 =
n−1 n−p |βj |: L1-norm Exhaustive Search Through All Subsets
SST
n−1 ∑
n ∑
p
Use the Cp statistics, R2 , Adjusted R2 , AIC and BIC.
Lasso: (yi − yˆi )2 + λ |βj |
Leverage i=1 j=1 Cp Statistic
(βj )2 : L2-norm The lower, the better.
Measure the distance between explanatory values and the

n ∑
p σ̂ 2 − σ̂f2 ull
mean of explanatory values. Cp = p + (n − p)
Ridge: (yi − yˆi )2 + λ (βj )2 −σ̂f2 ull
H = X(X T X)−1 X i=1 j=1
Akaike’s Information Criteria(AIC)
∂ Yˆi The lower, the better.
For ith observation: hi = Hii =
∂Yi
√ AIC = 2p + nlog(σ̂ 2 ) = 2p − 2log(L)
SD(residuali ) = σ 1 − hi , h¯i = p/n
Cutoff: larger than 2p/n (p : the number of parameters)
Bayesian Information Criteria(BIC)
The lower, the better.
Studentized Residual
BIC = p · log(n) + n · log(σ̂ 2 ) = p · log(n) − 2log(L)
Model Validation
residuali
studresi = √ For a new data set, define mean square prediction error as:
σ̂ 1 − hi ∑k=1
i=1 (Yi − Ŷi )
2

Roughly normal distributed. (Check absolute residual lager M SP E =


k
than 2)
Cross Validation
Cook’s Distance
Serial Correlation

n
(Ŷj(i) − Yˆj )2
First-Order Autoregression Model {AR(1)}
1 hi
Di = = (studresi )2 ( ) Model Selection
j=1
pσ̂ 2 p 1 − hi
Yt = β0 + β1 X1 + ... + βk Xk + ϵt
Ŷj(i) is the jth fitted value without case i in the dataset ϵt = αϵt−1 + ψt
Cutoff: Larger than 1 → influential ψi ∼ N (0, σ 2 )
Estimating α: Use the correlation coefficient between
Model Diagnosis subsequent ordinary regression residuals.
1. Residual v.s. Fitted Value Plot: Partial Auto Correlation Function(PACF)
• Pattern? A plot of the partial autocorrelations against lags.
cutoff: [− √2n , √2n ]
• Non-constant Variance?
Large-Sample Test
• Influential Overservations?
If one estimates the serial correlation coefficient from a series
2. QQ-Plot: Normality of n independent observations with constant variance, the
estimate has an approx. normal distribution with mean 0 and
3. Cook’s Distance and Leverage Plot standard deviation √1 .
1
Variance Inflation Factor Parametric Bootstrap Canonical Correlation Analysis CCA
1. A parametric model is fitted to the data (Often by
CCA finds linear combinations in the two sets that have the
Vd ˆ = σˆ2 (X T X)−1
ar(Bj) j+1,j+1
maximum likelihood)
largest possible correlations.
2. Samples of random numbers are drawn from this fitted
σˆ2 1 model
= R command: cancof
(n − 1)Vd ar(Xj ) 1 − Rj
2
3. Calculate the estimate/quantity of interest from these
Rj2 := R2 for the regression of Xj on the other covariates. samples Bartlett’s Chi-square Test
4. Repeat 2 and 3 many times as for other bootstrap methods
1
V IF := Parametric bootstrap will be more accurate than
1 − Rj2 How many pairs of canonical variables are significant?
non-parametric bootstrap if the parametric assumption is true,
Cut-off rule of thumb: less accurate if false. ( )
p+q+1
V IF (Bˆj ) > 5 for high multicollinearity V = − (n − 1) − ln(k)
Natural Cubic Spline 2

Bootstrap splines package in R


n: number of obvservations
Assumption: Independence between samples. 1. Dividing the range of X into intervals. p: number of X variables minus number times test applied
2. Inside each interval, a cubic polynomial model is fitted. q: number of Y variables minus number times test applied
Non-parametric Bootstrap k: (1 − rt2 )(1 − rt+1
2 )...(1 − r 2 )
T
3. At the interval split points(knots), the cubic polynomial are 2
rj : the squared correlation between the jth pair of canonical
Repeated re-sampling with replacement. continuous and have continuous first and second derivatives.
(2n−1) variables.
The number of different bootstrap samples is for
n T: The totla number of canonical variables
sample size n.
t: number of times test applied
Can obtain statistics(e.g. mean, standard deviation) of the V ∼ χ2pq under H0 : the pair is significant.
estimator with only one set of samples.
Bootstrap Regression Principle Component Analysis
Let X be the explanatory variable, Y be the responsive
variable.
Case-based
1. Re-sample based on (X, Y ) pairs.
2. Fit a regression model on the bootstrap sample.
3. Repeat 1 and 2 several times.

Problem: When X is an indicator variable.


Residual-based
1. Fit regression model on the original sample.
2. Re-sample the residuals from 1
3. Add bootstrap residuals to hatY to form the new Y ′ .
4. Fit a regression model on Y ′ and X.
5. Repeat 1-3 several times. R example:
f it2 < −lm(ozone ∼ ns(temperature, knots = c(70, 90)))
Solve the problem of X being extremely skewed. df = 2 + # of knots
Bootstrap Confidence Interval When to use natural cubic splines?
Useful when the distribution of estimator is skewed or not 1. Smoothing.
normal.
2. To model confounding variables.
Use quantiles of the bootstrap estimations as the boundary of
the confidence interval. 3. Higher order terms are required for X
• Use z in regression: Solve multicollinearity and increase of ten variables (X1 , ..., X1 0) to explain 100% of the total
degrees of freedom. variation in the ten original variables X1 , ..., X1 0. TRUE
• Benefits: Low dimension and No correlation of X. 7. The estimated mean response for the regression
• Drawback: Hard to interpretate. µ(Y |X) = β0 + β1 X1 + β2 X2 + β3 (X1 × X2 ) corresponding
to a particular set of explanatory variable values
Some True/False Questions X1 = 3, X2 = 2 is 15. Based on this information we would
1. A multiple linear regression model should only include estimate that there is more than a 50% chance that the
explanatory variables that have a normal distribution. response variable, given X1 = 3, X2 = 2, would take on a
FALSE value greater than 17. FALSE
2. Adding an extra explanatory variable to a simple linear 8. Modelling marital status (Single, Married, Divorced) as a
regression model cannot increase the significance, as categorical explanatory variable in a Poisson log-linear
measured by the t-test, for the explanatory variable that is regression model will require three parameters to be
already in the model. FALSE estimated, not including the intercept and other variables
3. The main reason logistic regression is preferred to multiple included in the model. FALSE
linear regression for a categorical response with two 9. A fitted linear regression model based on 500 observations
categories is that the logistic regression model allows for returns b6 = 0.21, SE(b5 ) = 0.06. You are given two 90%
the non-constant variance of the response. FALSE confidence intervals (a) (0.09,0.33) and (b) (0.13,0.42) that
4. The multiple linear regression have been computed based on the fitted regression model.
µ(Y |X) = β0 + β1 X1 + β2 X2 + β3 X3 will have the same R2 One of the intervals was computed using the bootstrap and
value as the multiple linear regression another using the standard linear regression theory.
µ(Y |Z) = β0 + β1 Z1 + β2 Z2 + β3 Z3 where Z1 , Z2 , Z3 are Interval (b) (0.13,0.42) is the confidence interval computed
the first three principal component variables of X1 , X2 , X3 using the standard theory. FALSE
TRUE
10. There are 64 possible logistic regression models that can be
5. The bootstrap cannot be used for hypothesis testing. fitted in a situation there are seven explanatory variables
FALSE and we are only interested in models that contain these
z = Uredude ∗ X 6. It is possible for the first three principal component seven variables; that is, we are not including interactions
• Sensitive to scale: Standardize before fitting! variables (Z1 , ..., Z3 ) from a principal components analysis terms, terms for curvature etc. FALSE

You might also like