Cheat Sheet Stats For Exam Cheat Sheet Stats For Exam
Cheat Sheet Stats For Exam Cheat Sheet Stats For Exam
-Statistical inference is the process of using data from a sample Bootstrap statistic: is the statistic computed on a bootstrap
to gain information about population sample
a sample statistic is an estimate of the value of a descriptive Bootstrap distribution: is the distribution of many bootstrap
characteristic of a sampling distribution statistics
A parameter is a number that describes some aspect of a Our interest is in the variablility of the statistics from the
population. bootstrap samples, which will be similar to the SE from the true
When a mean is larger than the median its is skewed to the right population.
A statistic is a number that is computed from data in a sample. The sampling distribution is centred around the population
-Sampling bias occurs when the method of screening a sample parameter while the bootstrap distribution is centred around the
causes the sample to differ from the population in some way. If sample statistic.
bias occurs the we cannot trust these generalizations to reflect the Margin of error: = 1.96 x SE for 95% confidence interval.
rest of the population…. causes; Question wording, Context, Hypothesis tests ( statistical test)
Inaccurate responses, Poor sampling methods. Needs random Remember to DEFINE parameters
sampling, Control groups, placebos and blinding to reduce bias Null hypothesis (H0 ): no effect or difference (status quo) =
Correlation does not equal Causation Alternative hypothesis (Ha): claim for which we seek evidence.
Response/ Explanatory Variable: using one variable to help us (Either equal to, less than or greater than)
understand or predict values of another variable, we call the REJECT H0 when P value is less than 0.05 for 5%
former the explanatory and the latter the response variable. significance
Confounding variable: Additional variables that are associated Hypotheses are always about population parameters and not
with both the explanatory variable and the response variable. sample statistics, thus only population notation is used.
confounding variables are minimised by random allocation of - Never accept the null. Only reject or do not reject
subjects to treatment groups. null.
Notations Population Sample Statistical significance: If sample statistic is unlikely to occur
Mean 𝜇 𝑋 𝐵𝑎𝑟 just by chance the results are statistically significant. If results are
Proportion P P-hat not significant results are inconclusive
Correlation Rho r The p-value is the chance of obtaining a sample statistic at least
Standard deviation 𝜎 S as extreme as the observed sample statistic, if the null hypothesis
Resistance: when the metric is not distorted by skewed data. is true
Medians are resistant, means are not. Reject when P value is < alpha(0.05) or when observed value
Standard deviation: 95% of data in a bell shaped curve should in rejection region
fall between 1.96 standard deviations of the mean. Randomization distribution: bootstrapping but with focus on
Z score: how many standard deviations the observation is from null rather than the sample. A randomization distribution assumes
the mean. (Observed – Mean)/ Std deviation H0 is true, while a bootstrap distribution has no knowledge of the
IQR: Q3 – Q1 where Q3= median of values above median Null hypothesis and is used for Confidence Intervals.
Range: Maximum value- minimum value
Outliers: use mean and std deviation for shape if NO outliers. If a 95% CI misses
Use IQR and median if there are outliers. the parameter in
Five number summary: Min, Q1, Median, Q3, Max H0, then a two-
Reject 𝐻) Do not reject 𝐻) tailed test should
𝐻) is true Type I error No error reject H0 at a 5%
𝐻) is false No error Type II error significance level.
One categorical variable And vice versa
Summary statistics: Mode, frequency table, proportions
Visualization: Bar chart, Pie chart
Two Categorical variables Confidence
Summary statistics: two-way table, difference in proportions intervals are most
Visualization: stacked or clustered bar chart useful when you want to estimate population parameters
One Quantitative Variable Hypothesis tests and p-values are most useful when you want to
R – Range S – shape L – Location test hypotheses about population parameters
Summary statistics: median, mean, A density curve is a theoretical model to describe a variable’s
Visualization: Dotplots, Histograms or boxplots distribution.
Regression A normal distribution has a symmetric bell-shaped density
The observed response value, y, is the response value observed curve
for a particular data point. The mean is its center of symmetry (µ).The standard deviation controls its spread (σ).
The predicted response value, y-hat, is the response that would Central limit Theorem For random samples with a sufficiently
be predicted for a given x value, based on a model. large sample size( >=30 quantitative if not very skewed, >= 10 for
Shape: if the sample size is large enough the sampling categorical within each category), the distribution of sample
distribution will be symmetric and bell-shaped (CLT) statistics for a mean is Normal. Use t score IF LESS)
An interval estimate (on the other hand) gives a range of CI = statistic +- Z* x SE where Z* is specific for each %CI
plausible values for a population parameter. The standardized test statistic is the number of standard errors a
statistic ± margin of error statistic is from the null: Z= (sample statistic – null) / SE
The standard error of a statistic, SE, is the standard deviation of t-distribution: compensates for added variability of not knowing
the sampling distribution of that statistic or bootstrap distributio Std dev and is used for small samples. Is very similar to normal
Confidence intervals: Formula: 95% CI = sample stat ± 1.96 x curve but with fatter tails to reflect added uncertainty. Df=n-1
SE T score is used when the Std deviation is not known or the sample
We are 95% confident that the true [population parameter] lies size is less than 30.
between (blah and blah) T* is found using t distribution.
Bootstraping: To simulate a population from a sample
population by drawing samples randomly (out of a hat).