cheatsheet

Uploaded by

Ziyi Huang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

cheatsheet

Uploaded by

Ziyi Huang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Applied Statistics Cheatsheet

Statistical Inference Sampling Distribution

An inference is a conclusion that patterns in the data are
present in some broader context. A statistical inference is an
inference justified by a probability model linking the data to
the broader context.
• Observational Study: The group status of the subjects is
established beyond the control of the investigator.
• Randomized Experiment: the investigator controls the
assignment of experimental units to groups and uses a
chance mechanism (like the flip of a coin) to make the
assignment
Causal Inference
Statistical inferences of cause-and-eﬀect relationships can be
drawn from randomized experiments, but not from R Squared
observational studies. ∑n
i=1 (YI − Ȳ )
Total sum of squares(SST) = 2

Counfounding Variables ∑n
Regression sum of squares(SSR) = i=1 (Yi − Ȳ )2
ˆ
A confounding variable is related both to group ∑
Residual sum of squares(SSE) = n i=1 (Yi − Yi )
ˆ 2
membership and to the outcome. Its presence makes it
hard to establish the outcome as being a direct consequence of √ SST = SSR + SSE
group membership 1
SD(b1 ) = σ̂ SST − SSE SSR
(n − 1)σx2 R2 = ( )% = ( )%
Inference to populations SST SST
Inferences to populations can be drawn from random sampling √
1 X̄ 2 SSE
studies, but not otherwise. SD(b0 ) = σ̂ + M SE =
n (n − 1)σx2 n−2
Random sampling ensures that all subpopulations are
represented in the sample in roughly the same mix as in the b1 − β1 b0 − β0 Extra-Sums-of-Squares F-test
overall population. Again, ∼ t(n − 2) ∼ t(n − 2)
SE(B1 ) SE(B0 ) H0 : β1 = 0
Simple Random Sample [ ]
Matrix Form extra sums of squares SSRf ull −SSRnull
A simple random sample of size n from a population is a #β being tested
Y = Xβ + ϵ F − stat = = 1
subset of the population consisting of n members selected in σ̂ 2 from full model M SE
such a way that every subset of size n is afforded the same      
Y1 1 X1 [ ] ϵ1
chance of being selected.  Y2   1 X2    Multiple Regression
Y =    β0 +  ϵ2 
 . = .  β1  . 
Simple Linear Regression . ∑
Yn 1 Xn ϵn (Yi − Yî )2 SSE
µ{Y |X} = β0 + β1 X σ̂ 2 = =
n−p n−p
Model Assumption √
Ψ = (Y − Xβ)T (Y − Xβ) SD(bj ) = σ cij
1. Linearity
standardized bj ∼ t(n − p)
2. Normality: Y |X ∼ N ormal β̂ = (X T X)−1 X T Y
cij is j th diagonal element of (X T X)−1
3. Constant Variance: σ(Y |X) = σ
Confidence Intervals Linear Combination Of Coefficients
4. Independence √
(X0−X̄)2
Least Square Method SD(µ(Y |X0 )) = σ̂ 1
n
+ 2
(n−1)σx
∑ ∑ H0 : c0 β0 + c1 β1 + ... + cp βp = 0
Minimize Q = (Yi − b0 − b1 Xi )2 = (Yi − Yî )2
∑n standardized µ(Y |X0 ) ∼ t(n − 2)
HA : c0 β0 + c1 β1 + ... + cp βp ̸= 0
1 (Xi − X̄)(Yi − Ȳ ))
b1 = ∑n est = c0 b0 + c1 b1 + ... + cp bp
i=1 (Xi − X̄) Prediction Interval
2
√ V ar(est) = c20 V ar(b0 )2 + .. + c2p V ar(b0 )
b0 = Ȳ − b1X̄
√∑ SD(Y |X0 ) = σ̂ 1+ 1
+
(X0 −X̄)2
n n 2 + 2c0 c1 Cov(b0 , b1 ) + ... + cp−1 cp Cov(bp−1 , bp )
j=1 (Yi − Yi )
ˆ 2 (n−1)σx
σ̂ =
n−2 standardized Y |X0 ∼ t(n − 2) = σ̂ 2 C(X T X)−1
Extra-Sums-of-Squares F-test Weighted Regression Strategies
H0 : β1 = β2 = ... = βp = 0
σ2 Forward Selection
[ ] var(Yi |X) = Start with the null model.
extra sums of squares SSRf ull −SSRreduce wi
# of β ’s being tested dfreduce −dff ull ∑ Backward Selection
F − stat = = Q= wi (Yi − Yî )2
σ̂ 2 from full model M SE Start with the full model.
β̂ = (X T W X)−1 X T W Y Stepwise Selection
SSE  
M SE = w1 0 . 0 1. Start with null model.
n−p−1  0 w2 0 0 
W =   2. Do on step of forward selection.
. . . . 
Adjusted R2 3. Do one step of backward elimination.
0 0 0 wn
Only for model comparison, not for model assessment. 4. Repeat 2 and 3 until no explanatory variables can be added
Ridge and Lasso Regression or removed.
SST
− SSE
Adjusted R2 =
n−1 n−p |βj |: L1-norm Exhaustive Search Through All Subsets
SST
n−1 ∑
n ∑
p
Use the Cp statistics, R2 , Adjusted R2 , AIC and BIC.
Lasso: (yi − yî )2 + λ |βj |
Leverage i=1 j=1 Cp Statistic
(βj )2 : L2-norm The lower, the better.
Measure the distance between explanatory values and the
∑
n ∑
p σ̂ 2 − σ̂f2 ull
mean of explanatory values. Cp = p + (n − p)
Ridge: (yi − yî )2 + λ (βj )2 −σ̂f2 ull
H = X(X T X)−1 X i=1 j=1
Akaike’s Information Criteria(AIC)
∂ Yî The lower, the better.
For ith observation: hi = Hii =
∂Yi
√ AIC = 2p + nlog(σ̂ 2 ) = 2p − 2log(L)
SD(residuali ) = σ 1 − hi , h¯i = p/n
Cutoff: larger than 2p/n (p : the number of parameters)
Bayesian Information Criteria(BIC)
The lower, the better.
Studentized Residual
BIC = p · log(n) + n · log(σ̂ 2 ) = p · log(n) − 2log(L)
Model Validation
residuali
studresi = √ For a new data set, define mean square prediction error as:
σ̂ 1 − hi ∑k=1
i=1 (Yi − Ŷi )
2

Roughly normal distributed. (Check absolute residual lager M SP E =

k
than 2)
Cross Validation
Cook’s Distance
Serial Correlation
∑
n
(Ŷj(i) − Yˆj )2
First-Order Autoregression Model {AR(1)}
1 hi
Di = = (studresi )2 ( ) Model Selection
j=1
pσ̂ 2 p 1 − hi
Yt = β0 + β1 X1 + ... + βk Xk + ϵt
Ŷj(i) is the jth fitted value without case i in the dataset ϵt = αϵt−1 + ψt
Cutoff: Larger than 1 → influential ψi ∼ N (0, σ 2 )
Estimating α: Use the correlation coefficient between
Model Diagnosis subsequent ordinary regression residuals.
1. Residual v.s. Fitted Value Plot: Partial Auto Correlation Function(PACF)
• Pattern? A plot of the partial autocorrelations against lags.
cutoff: [− √2n , √2n ]
• Non-constant Variance?
Large-Sample Test
• Influential Overservations?
If one estimates the serial correlation coefficient from a series
2. QQ-Plot: Normality of n independent observations with constant variance, the
estimate has an approx. normal distribution with mean 0 and
3. Cook’s Distance and Leverage Plot standard deviation √1 .
1
Variance Inflation Factor Parametric Bootstrap Canonical Correlation Analysis CCA
1. A parametric model is fitted to the data (Often by
CCA finds linear combinations in the two sets that have the
Vd ˆ = σˆ2 (X T X)−1
ar(Bj) j+1,j+1
maximum likelihood)
largest possible correlations.
2. Samples of random numbers are drawn from this fitted
σˆ2 1 model
= R command: cancof
(n − 1)Vd ar(Xj ) 1 − Rj
2
3. Calculate the estimate/quantity of interest from these
Rj2 := R2 for the regression of Xj on the other covariates. samples Bartlett’s Chi-square Test
4. Repeat 2 and 3 many times as for other bootstrap methods
1
V IF := Parametric bootstrap will be more accurate than
1 − Rj2 How many pairs of canonical variables are significant?
non-parametric bootstrap if the parametric assumption is true,
Cut-off rule of thumb: less accurate if false. ( )
p+q+1
V IF (Bˆj ) > 5 for high multicollinearity V = − (n − 1) − ln(k)
Natural Cubic Spline 2

Bootstrap splines package in R

n: number of obvservations
Assumption: Independence between samples. 1. Dividing the range of X into intervals. p: number of X variables minus number times test applied
2. Inside each interval, a cubic polynomial model is fitted. q: number of Y variables minus number times test applied
Non-parametric Bootstrap k: (1 − rt2 )(1 − rt+1
2 )...(1 − r 2 )
T
3. At the interval split points(knots), the cubic polynomial are 2
rj : the squared correlation between the jth pair of canonical
Repeated re-sampling with replacement. continuous and have continuous first and second derivatives.
(2n−1) variables.
The number of diﬀerent bootstrap samples is for
n T: The totla number of canonical variables
sample size n.
t: number of times test applied
Can obtain statistics(e.g. mean, standard deviation) of the V ∼ χ2pq under H0 : the pair is significant.
estimator with only one set of samples.
Bootstrap Regression Principle Component Analysis
Let X be the explanatory variable, Y be the responsive
variable.
Case-based
1. Re-sample based on (X, Y ) pairs.
2. Fit a regression model on the bootstrap sample.
3. Repeat 1 and 2 several times.

Problem: When X is an indicator variable.

Residual-based
1. Fit regression model on the original sample.
2. Re-sample the residuals from 1
3. Add bootstrap residuals to hatY to form the new Y ′ .
4. Fit a regression model on Y ′ and X.
5. Repeat 1-3 several times. R example:
f it2 < −lm(ozone ∼ ns(temperature, knots = c(70, 90)))
Solve the problem of X being extremely skewed. df = 2 + # of knots
Bootstrap Confidence Interval When to use natural cubic splines?
Useful when the distribution of estimator is skewed or not 1. Smoothing.
normal.
2. To model confounding variables.
Use quantiles of the bootstrap estimations as the boundary of
the confidence interval. 3. Higher order terms are required for X
• Use z in regression: Solve multicollinearity and increase of ten variables (X1 , ..., X1 0) to explain 100% of the total
degrees of freedom. variation in the ten original variables X1 , ..., X1 0. TRUE
• Benefits: Low dimension and No correlation of X. 7. The estimated mean response for the regression
• Drawback: Hard to interpretate. µ(Y |X) = β0 + β1 X1 + β2 X2 + β3 (X1 × X2 ) corresponding
to a particular set of explanatory variable values
Some True/False Questions X1 = 3, X2 = 2 is 15. Based on this information we would
1. A multiple linear regression model should only include estimate that there is more than a 50% chance that the
explanatory variables that have a normal distribution. response variable, given X1 = 3, X2 = 2, would take on a
FALSE value greater than 17. FALSE
2. Adding an extra explanatory variable to a simple linear 8. Modelling marital status (Single, Married, Divorced) as a
regression model cannot increase the significance, as categorical explanatory variable in a Poisson log-linear
measured by the t-test, for the explanatory variable that is regression model will require three parameters to be
already in the model. FALSE estimated, not including the intercept and other variables
3. The main reason logistic regression is preferred to multiple included in the model. FALSE
linear regression for a categorical response with two 9. A fitted linear regression model based on 500 observations
categories is that the logistic regression model allows for returns b6 = 0.21, SE(b5 ) = 0.06. You are given two 90%
the non-constant variance of the response. FALSE confidence intervals (a) (0.09,0.33) and (b) (0.13,0.42) that
4. The multiple linear regression have been computed based on the fitted regression model.
µ(Y |X) = β0 + β1 X1 + β2 X2 + β3 X3 will have the same R2 One of the intervals was computed using the bootstrap and
value as the multiple linear regression another using the standard linear regression theory.
µ(Y |Z) = β0 + β1 Z1 + β2 Z2 + β3 Z3 where Z1 , Z2 , Z3 are Interval (b) (0.13,0.42) is the confidence interval computed
the first three principal component variables of X1 , X2 , X3 using the standard theory. FALSE
TRUE
10. There are 64 possible logistic regression models that can be
5. The bootstrap cannot be used for hypothesis testing. fitted in a situation there are seven explanatory variables
FALSE and we are only interested in models that contain these
z = Uredude ∗ X 6. It is possible for the first three principal component seven variables; that is, we are not including interactions
• Sensitive to scale: Standardize before fitting! variables (Z1 , ..., Z3 ) from a principal components analysis terms, terms for curvature etc. FALSE

Ethics in Nursing Practice
86% (7)
Ethics in Nursing Practice
48 pages
Finishup Chatting Format
100% (5)
Finishup Chatting Format
5 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Stat 473-573 Notes
No ratings yet
Stat 473-573 Notes
139 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
PE Civil: Transportation e-book Practice Exam
No ratings yet
PE Civil: Transportation e-book Practice Exam
41 pages
Statistic SimpleLinearRegression
No ratings yet
Statistic SimpleLinearRegression
7 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
Stats B
No ratings yet
Stats B
75 pages
ExamFinal Topics
No ratings yet
ExamFinal Topics
9 pages
Module01 LinearRegression
No ratings yet
Module01 LinearRegression
41 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Stats101A - Chapter 2
No ratings yet
Stats101A - Chapter 2
59 pages
Linear Regression (1)
No ratings yet
Linear Regression (1)
65 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
54 pages
Applied Statistics II-SLR
100% (1)
Applied Statistics II-SLR
23 pages
ECMT1020 Formulas 2021
No ratings yet
ECMT1020 Formulas 2021
9 pages
Regression Models Notes
No ratings yet
Regression Models Notes
13 pages
Module01.1 LinearRegression
No ratings yet
Module01.1 LinearRegression
32 pages
SRM Notes
No ratings yet
SRM Notes
38 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Prof. Dr. Moustapha Ibrahim Salem Mansourms@alexu - Edu.eg 01005857099
No ratings yet
Prof. Dr. Moustapha Ibrahim Salem Mansourms@alexu - Edu.eg 01005857099
110 pages
Financial Statistics - Formula Sheet
No ratings yet
Financial Statistics - Formula Sheet
26 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
Chapter2 (Simple Linear Regression)
No ratings yet
Chapter2 (Simple Linear Regression)
11 pages
CH 2
No ratings yet
CH 2
31 pages
Math644 - Chapter 1 - Part2 PDF
No ratings yet
Math644 - Chapter 1 - Part2 PDF
14 pages
Linera Regression II PDF
No ratings yet
Linera Regression II PDF
14 pages
Lecture 22: Review For Exam 2 1 Basic Model Assumptions (Without Gaussian Noise)
No ratings yet
Lecture 22: Review For Exam 2 1 Basic Model Assumptions (Without Gaussian Noise)
7 pages
Lecture 2
No ratings yet
Lecture 2
17 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
55 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
Regression Models - Follow
No ratings yet
Regression Models - Follow
7 pages
Simple Regression and Correlation
No ratings yet
Simple Regression and Correlation
30 pages
Chapter 3 Notes
No ratings yet
Chapter 3 Notes
5 pages
reg
No ratings yet
reg
110 pages
ch12 0
No ratings yet
ch12 0
82 pages
Bivariate
No ratings yet
Bivariate
28 pages
Presentation__SICEAMS_2024 (1)
No ratings yet
Presentation__SICEAMS_2024 (1)
31 pages
Confidence Interval, Model Fitness and Prediction: S S T B
No ratings yet
Confidence Interval, Model Fitness and Prediction: S S T B
8 pages
CUHK STAT5102 Ch3
No ratings yet
CUHK STAT5102 Ch3
73 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
TSNotes 1
No ratings yet
TSNotes 1
29 pages
What Is Multiple Linear Regression
No ratings yet
What Is Multiple Linear Regression
23 pages
Summary of Topics For Midterm Exam #2: STA 371G, Fall 2017
No ratings yet
Summary of Topics For Midterm Exam #2: STA 371G, Fall 2017
6 pages
Regression and Correlation Analysis
No ratings yet
Regression and Correlation Analysis
28 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
Chapter 02
No ratings yet
Chapter 02
14 pages
Notes2
No ratings yet
Notes2
16 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
Lecture 4
No ratings yet
Lecture 4
60 pages
Regression Analysis NEW-1
No ratings yet
Regression Analysis NEW-1
60 pages
Multiple Regression
No ratings yet
Multiple Regression
49 pages
Statistics Cheatsheet
100% (1)
Statistics Cheatsheet
2 pages
Notes 3008
No ratings yet
Notes 3008
6 pages
Ra Web
No ratings yet
Ra Web
70 pages
Topic 3a
No ratings yet
Topic 3a
64 pages
Complex Variables I Essentials
From Everand
Complex Variables I Essentials
Alan D. Solomon
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Seminar Topics - Few
100% (1)
Seminar Topics - Few
5 pages
RFX 2332301326
No ratings yet
RFX 2332301326
2 pages
Inclusive Education Sourcebook. A Sourcebook For Pre Service Teacher Educators and Practising Teachers
No ratings yet
Inclusive Education Sourcebook. A Sourcebook For Pre Service Teacher Educators and Practising Teachers
147 pages
Adavantages of Historical Research
No ratings yet
Adavantages of Historical Research
2 pages
Logistics Professional Interview Questions
No ratings yet
Logistics Professional Interview Questions
5 pages
2014 - Portal Frame Design
No ratings yet
2014 - Portal Frame Design
14 pages
Estimate K Parameter
No ratings yet
Estimate K Parameter
5 pages
MATH_1281_LEARNING_JOURNAL_UNIT_6.docx
No ratings yet
MATH_1281_LEARNING_JOURNAL_UNIT_6.docx
7 pages
Toolbox Wiki (Fun Tools To Use For Our Assignments)
No ratings yet
Toolbox Wiki (Fun Tools To Use For Our Assignments)
3 pages
Glass 2023 Part 1
No ratings yet
Glass 2023 Part 1
47 pages
OMS Interlock Test Procedure Sanjiang 6-Final Version
No ratings yet
OMS Interlock Test Procedure Sanjiang 6-Final Version
20 pages
Taoudeni 3
No ratings yet
Taoudeni 3
9 pages
Workshop 3-1: Antenna Post-Processing: ANSYS HFSS For Antenna Design
No ratings yet
Workshop 3-1: Antenna Post-Processing: ANSYS HFSS For Antenna Design
48 pages
CelestinaCortez CV
No ratings yet
CelestinaCortez CV
3 pages
SDE (Python) - Tijori
No ratings yet
SDE (Python) - Tijori
2 pages
POM PPT Co1
No ratings yet
POM PPT Co1
35 pages
Ethicon Harmonic 300 Generator - Service Manual
No ratings yet
Ethicon Harmonic 300 Generator - Service Manual
80 pages
0-SS USER MANUAL EN
No ratings yet
0-SS USER MANUAL EN
18 pages
Free Templates: Insert The Subtitle of Your Presentation
No ratings yet
Free Templates: Insert The Subtitle of Your Presentation
54 pages
Transformer Protection: Common Transformer Faults. (I) Open Circuits
No ratings yet
Transformer Protection: Common Transformer Faults. (I) Open Circuits
14 pages
Happlyf - Mentorship Program For Aspiring Psychotherapists
No ratings yet
Happlyf - Mentorship Program For Aspiring Psychotherapists
17 pages
Power Transmission Interconnection LinesPower Lines
No ratings yet
Power Transmission Interconnection LinesPower Lines
12 pages
Design of Lacing: 2.5% of Axial Load Force in Each Lacing Bar (Flac)
No ratings yet
Design of Lacing: 2.5% of Axial Load Force in Each Lacing Bar (Flac)
4 pages
American Legal Realism
No ratings yet
American Legal Realism
2 pages
Well Control Equipment Pre-Audit Checklist: No No Yes
100% (1)
Well Control Equipment Pre-Audit Checklist: No No Yes
1 page
Mewarmeobis Teoriuli Safuzvlebi: Damxmare Saxelmzrvanelo
No ratings yet
Mewarmeobis Teoriuli Safuzvlebi: Damxmare Saxelmzrvanelo
196 pages
Exploring The Impact of Digital Transformation On Business Operations
No ratings yet
Exploring The Impact of Digital Transformation On Business Operations
2 pages
Logical Division Paragraph - Fix
No ratings yet
Logical Division Paragraph - Fix
13 pages

cheatsheet

Uploaded by

cheatsheet

Uploaded by

Applied Statistics Cheatsheet

Statistical Inference Sampling Distribution

Roughly normal distributed. (Check absolute residual lager M SP E =

Bootstrap splines package in R

Problem: When X is an indicator variable.

You might also like