0% found this document useful (0 votes)
98 views3 pages

Cheat Sheet Stats For Exam Cheat Sheet Stats For Exam

This document contains a cheat sheet of key statistical concepts and formulas for an exam on statistical design and analysis. It defines important terms like population, sample, parameter, statistic, and hypothesis testing. It also summarizes common statistical tests and analyses for different variable types, including measures of central tendency, variability, confidence intervals, hypothesis testing, and more. Formulas are provided for calculating z-scores, t-scores, standard errors, margins of error, and confidence intervals. Visualization techniques are also outlined for different variable combinations.

Uploaded by

Urbi Roy Barman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views3 pages

Cheat Sheet Stats For Exam Cheat Sheet Stats For Exam

This document contains a cheat sheet of key statistical concepts and formulas for an exam on statistical design and analysis. It defines important terms like population, sample, parameter, statistic, and hypothesis testing. It also summarizes common statistical tests and analyses for different variable types, including measures of central tendency, variability, confidence intervals, hypothesis testing, and more. Formulas are provided for calculating z-scores, t-scores, standard errors, margins of error, and confidence intervals. Visualization techniques are also outlined for different variable combinations.

Uploaded by

Urbi Roy Barman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

lOMoARcPSD|5567380

Cheat sheet stats for exam

Statistical Design and Analysis (University of Technology Sydney)

StuDocu is not sponsored or endorsed by any college or university


Downloaded by sujib barman ([email protected])
lOMoARcPSD|5567380

-Statistical inference is the process of using data from a sample Bootstrap statistic: is the statistic computed on a bootstrap
to gain information about population sample
a sample statistic is an estimate of the value of a descriptive Bootstrap distribution: is the distribution of many bootstrap
characteristic of a sampling distribution statistics
A parameter is a number that describes some aspect of a Our interest is in the variablility of the statistics from the
population. bootstrap samples, which will be similar to the SE from the true
When a mean is larger than the median its is skewed to the right population.
A statistic is a number that is computed from data in a sample. The sampling distribution is centred around the population
-Sampling bias occurs when the method of screening a sample parameter while the bootstrap distribution is centred around the
causes the sample to differ from the population in some way. If sample statistic.
bias occurs the we cannot trust these generalizations to reflect the Margin of error: = 1.96 x SE for 95% confidence interval.
rest of the population…. causes; Question wording, Context, Hypothesis tests ( statistical test)
Inaccurate responses, Poor sampling methods. Needs random Remember to DEFINE parameters
sampling, Control groups, placebos and blinding to reduce bias Null hypothesis (H0 ): no effect or difference (status quo) =
Correlation does not equal Causation Alternative hypothesis (Ha): claim for which we seek evidence.
Response/ Explanatory Variable: using one variable to help us (Either equal to, less than or greater than)
understand or predict values of another variable, we call the REJECT H0 when P value is less than 0.05 for 5%
former the explanatory and the latter the response variable. significance
Confounding variable: Additional variables that are associated Hypotheses are always about population parameters and not
with both the explanatory variable and the response variable. sample statistics, thus only population notation is used.
confounding variables are minimised by random allocation of - Never accept the null. Only reject or do not reject
subjects to treatment groups. null.
Notations Population Sample Statistical significance: If sample statistic is unlikely to occur
Mean 𝜇 𝑋 𝐵𝑎𝑟 just by chance the results are statistically significant. If results are
Proportion P P-hat not significant results are inconclusive
Correlation Rho r The p-value is the chance of obtaining a sample statistic at least
Standard deviation 𝜎 S as extreme as the observed sample statistic, if the null hypothesis
Resistance: when the metric is not distorted by skewed data. is true
Medians are resistant, means are not. Reject when P value is < alpha(0.05) or when observed value
Standard deviation: 95% of data in a bell shaped curve should in rejection region
fall between 1.96 standard deviations of the mean. Randomization distribution: bootstrapping but with focus on
Z score: how many standard deviations the observation is from null rather than the sample. A randomization distribution assumes
the mean. (Observed – Mean)/ Std deviation H0 is true, while a bootstrap distribution has no knowledge of the
IQR: Q3 – Q1 where Q3= median of values above median Null hypothesis and is used for Confidence Intervals.
Range: Maximum value- minimum value
Outliers: use mean and std deviation for shape if NO outliers. If a 95% CI misses
Use IQR and median if there are outliers. the parameter in
Five number summary: Min, Q1, Median, Q3, Max H0, then a two-
Reject 𝐻) Do not reject 𝐻) tailed test should
𝐻) is true Type I error No error reject H0 at a 5%
𝐻) is false No error Type II error significance level.
One categorical variable And vice versa
Summary statistics: Mode, frequency table, proportions
Visualization: Bar chart, Pie chart
Two Categorical variables Confidence
Summary statistics: two-way table, difference in proportions intervals are most
Visualization: stacked or clustered bar chart useful when you want to estimate population parameters
One Quantitative Variable Hypothesis tests and p-values are most useful when you want to
R – Range S – shape L – Location test hypotheses about population parameters
Summary statistics: median, mean, A density curve is a theoretical model to describe a variable’s
Visualization: Dotplots, Histograms or boxplots distribution.
Regression A normal distribution has a symmetric bell-shaped density
The observed response value, y, is the response value observed curve
for a particular data point. The mean is its center of symmetry (µ).The standard deviation controls its spread (σ).
The predicted response value, y-hat, is the response that would Central limit Theorem For random samples with a sufficiently
be predicted for a given x value, based on a model. large sample size( >=30 quantitative if not very skewed, >= 10 for
Shape: if the sample size is large enough the sampling categorical within each category), the distribution of sample
distribution will be symmetric and bell-shaped (CLT) statistics for a mean is Normal. Use t score IF LESS)
An interval estimate (on the other hand) gives a range of CI = statistic +- Z* x SE where Z* is specific for each %CI
plausible values for a population parameter. The standardized test statistic is the number of standard errors a
statistic ± margin of error statistic is from the null: Z= (sample statistic – null) / SE
The standard error of a statistic, SE, is the standard deviation of t-distribution: compensates for added variability of not knowing
the sampling distribution of that statistic or bootstrap distributio Std dev and is used for small samples. Is very similar to normal
Confidence intervals: Formula: 95% CI = sample stat ± 1.96 x curve but with fatter tails to reflect added uncertainty. Df=n-1
SE T score is used when the Std deviation is not known or the sample
We are 95% confident that the true [population parameter] lies size is less than 30.
between (blah and blah) T* is found using t distribution.
Bootstraping: To simulate a population from a sample
population by drawing samples randomly (out of a hat).

Downloaded by sujib barman ([email protected])


lOMoARcPSD|5567380

much observed counts vary from expected counts for a


categorical variable.

The expected count is the sample size(n) times the null


proportion (Pi)
Df in freedom = number of categories - 1
Always use right tail to find p value in chi^2 distribution
Single mean- 1 quantitative variable Chi- squared test for association (two categorical)
Difference in Means – 1 quant and 1 cat variable tests for an association between two categorical variables
Single proportion P- one cat variable Expected = row total x column total / grand total
Differences in proportion- two cat variables H0: independent/ not associated
Paired difference in means (matched pairs)- looks like 2 samples Ha: associated/ not independent
really only 1 (observation for paired data is 2 observations per Df= (rows-1) x (columns-1) (use right tail test for chi^2 dist)
subject) Ensure expected count in each group is at least 5 for both chi
ANOVA: Test for difference in means across > two samples
H0: Mean1=mean2=mean3… Ha: at least one mean different
Analysis of variance compares the variability between groups
to the variability within groups.
Total variability= Var between groups + var within groups

K= number of groups, n= total sample size


Conditions required to be met to use Theoretical F ,
1. Sample sizes large (ni ≥ 30) OR data reasonably normal
2. Variability is similar in all groups ( Max SD/ Min SD <2)
Df= n-1 of smallest group if more than one groups ANOVA was for a single quantitative var split over > 2 levels.
Regression is for 2 Quantitative Variables
𝒚=α+β𝐗+ε if either slope or correlation is zero = no relationship
Df =n-2
R2 -Coefficient of
determination=
proportion of
variability in
response variable
Y that is
“explained” by the model based on the predictor X
R2 is high if the data is close to the line..Correlation(r): -1<=r<=1
*replace S in t score with square the correlation, r2, we get a number between 0 and 1
𝜎
that we can express as a percentage. A good model explains
75% or more.
We assume the errors (ε) are randomly distributed above and
below the line, curve is linear, variability is even, no outliers/
influential points
Residual - the difference between the observed value of the
dependent variable (y) and the predicted value (ŷ) is called the
residual (e)
Chi- squared goodness of fit (single cat)
The 𝜒2-test for goodness-of-fit tests if at least one categorical
variable differs from a hypothesized distribution
Previously to get a p-value we found how far a p-hat was from
the Null in Standard Errors on the distribution of a difference in
proportions. This won’t work for > 2 groups.. More than two
sample statistics (p-hat), more than one Null value...
We use the chi-square statistic (single number) to quantify how

Downloaded by sujib barman ([email protected])

You might also like