Statistics Formula Cheatsheet
Statistics Formula Cheatsheet
√
selected n
∑ (xi x)2
i=1
Observational Study there can always be ➔ Quartiles split the ranked data into 4 s= n 1
lurking variables affecting results equal groups
➔ i.e, strong positive association between ◆ Box and Whisker Plot ◆ highly affected by outliers
shoe size and intelligence for boys ◆ has same units as original
➔ **should never show causation data
◆ finance = horrible measure of
Experimental Study lurking variables can be risk (trampoline example)
controlled; can give good evidence for causation
Binomial Example
Binomial Distribution
➔ doing something n times
➔ only 2 outcomes: success or failure
➔ trials are independent of each other (Example cont'd next page)
➔ probability remains constant
Sums of Normals
➔ Mean for uniform distribution:
(a+b)
E (X) = 2
➔ Variance for unif. distribution:
(b a)2
V ar(X) = 12 Confidence Intervals = tells us how good our
estimate is
Normal Distribution Sums of Normals Example: **Want high confidence, narrow interval
➔ governed by 2 parameters: **As confidence increases , interval also
μ (the mean) and σ (the standard increases
deviation)
➔ X ~ N (μ, σ 2 ) A. One Sample Proportion
Hypothesis Testing
➔ Null Hypothesis:
➔ H 0 , a statement of no change and is
➔ If n > 30, we can substitute s for assumed true until evidence indicates
σ so that we get: otherwise.
➔ Alternative Hypothesis: H a is a
statement that we are trying to find
evidence to support.
Example of Sample Proportion Problem ➔ Type I error: reject the null hypothesis
when the null hypothesis is true.
(considered the worst error)
➔ Type II error: do not reject the null
hypothesis when the alternative
hypothesis is true.
4. PValues
➔ a number between 0 and 1
➔ the larger the pvalue, the more
consistent the data is with the null
➔ the smaller the pvalue, the more
consistent the data is with the
2. Test Statistic Approach alternative
(Population Mean) ➔ **If P is low (less than 0.05),
3. Test Statistic Approach (Population
H 0 must go reject the null
Proportion)
hypothesis
Two Sample Hypothesis Tests ➔ Test Statistic for Two Proportions 2. Comparing Two Means (large
1. Comparing Two Proportions independent samples n>30)
(Independent Groups)
➔ Calculate Confidence Interval ➔ Calculating Confidence Interval
Matched Pairs
➔ Two samples are DEPENDENT
Example:
︿
➔ Interpretation of slope for each ➔ corr (Y , e) = 0
additional x value (e.x. mile on
odometer), the y value decreases/ A Measure of Fit: R2
increases by an average of b1 value
➔ Interpretation of yintercept plug in
︿
0 for x and the value you get for y is
the yintercept (e.x.
y=3.250.0614xSkippedClass, a
student who skips no classes has a
gpa of 3.25.)
➔ **danger of extrapolation if an x
value is outside of our data set, we
can't confidently predict the fitted y ➔ Good fit: if SSR is big, SEE is small
value ➔ SST=SSR, perfect fit
Simple Linear Regression
➔ R2 : coefficient of determination
➔ used to predict the value of one
variable (dependent variable) on the Properties of the Residuals and Fitted R2 = SSTSSR
= 1 SSE SST
basis of other variables (independent Values ➔ R2 is between 0 and 1, the closer R2
variables) 1. Mean of the residuals = 0; Sum of is to 1, the better the fit
︿ the residuals = 0
➔ Y = b0 + b1 X ➔ Interpretation of R2 : (e.x. 65% of the
︿ 2. Mean of original values is the same variation in the selling price is explained by
➔ Residual: e = Y Y f itted ︿
as mean of fitted values Y = Y the variation in odometer reading. The rest
➔ Fitting error: 35% remains unexplained by this model)
︿
ei = Y i Y i = Y i b0 bi X i ➔ ** R2 doesn’t indicate whether model
◆ e is the part of Y not related is adequate**
to X ➔ As you add more X’s to model, R2
➔ Values of b0 and b1 which minimize goes up
the residual sum of squares are: ➔ Guide to finding SSR, SSE, SST
sy
(slope) b1 = r s
x
b0 = Y b1 X 3.
4. Correlation Matrix
Assumptions of Simple Linear Regression Example of Prediction Intervals: Regression Hypothesis Testing
1. We model the AVERAGE of something *always a twosided test
rather than something itself ➔ want to test whether slope ( β 1 ) is
needed in our model
2. ➔ H 0 : β 1 = 0 (don’t need x)
H a : β 1 =/ 0 (need x)
➔ Need X in the model if:
a. 0 isn’t in the confidence
interval
Standard Errors for b1 and b0 b. t > 1.96
➔ standard errors when noise c. Pvalue < 0.05
➔ sb0 amount of uncertainty in our
estimate of β 0 (small s good, large s Test Statistic for Slope/Yintercept
bad) ➔ can only be used if n>30
➔ sb1 amount of uncertainty in our
➔ if n < 30, use pvalues
estimate of β 1
➔ Multicollinearity
◆ when x variables are highly
correlated with each other.
◆ ovtest: a significant test ◆ R2 > 0.9
statistic indicates that ◆ pairwise correlation > 0.9
➔ Outliers polynomial terms should be ◆ correlate all x variables, include
◆ Regression likes to move added y variable, drop the x variable
towards outliers (shows up ◆ H 0 : data = no transf ormation that is less correlated to y
as R2 being really high) H a : data =/ no transf ormation
◆ want to remove outlier that is Summary of Regression Output
extreme in both x and y
➔ Nonlinearity (ovtest)
◆ Plotting residuals vs. fitted
values will show a
relationship if data is ➔ Normality (sktest)
nonlinear ( R2 also high) ◆ H 0 : data = normality
H a : data =/ normality
◆ don’t want to reject the null
hypothesis. Pvalue should
be big
◆ Log transformation
accommodates nonlinearity,
reduces right skewness in the Y, ➔ Homoskedasticity (hettest)
eliminates heteroskedasticity ◆ H 0 : data = homoskedasticity
◆ **Only take log of X variable ◆ H a : data =/ homoskedasticity