Statistics Cheat Sheet Formulas and Steps
Statistics Cheat Sheet Formulas and Steps
Statistics
Descriptive statistics = techniques that enable us to summarize a set of numbers (stick to this today)
Inferential statistics = techniques that enable us to make inferences based on the data
Scales of measurement
For visualization, but also other description of data or to choose the right analysis for inferential
statistics, the scales of measurement is important
Z scores
= the number of standard deviations that a score differs from its mean
Raw score to z score
1 Draw a picture
2 Calculate z score
Hypotheses
Goodness-of-fit chi-square
Nominal data: 1 variable,, >= 2 categories
• One variable, multiple outcomes
• Compares observed and expected frequencies
• Overall test of significance
• Expected frequencies must be 5 or more in each cell
Is non-parametric and distribution free
Expected frequencies based on theoretical distribution
Degrees of freedom = c - 1
Assumptions:
• One nominal variable
• Observations are independent
Examples:
“Does the handedness among the LiS students equal the expected distribution?”
“Was the deck adequately shuffled?”
“Is the coin fair?”
1 Formulate H0 : variable equals theoretical distribution
Hypotheses H1 : variable does not equal theoretical distribution
2 Calculate χ2
fo = observed frequencies
fe = expected frequencies (if null hypothesis is true)
A
B
C
D
3 Determine the Depends on the degrees of freedom (df) = the number of observations
critical value χ2 out of the total that are free to vary
(α set at .05) df = c–1
c = number of categories
4 Decide on χ2 higher than the critical value? Significant: Reject H0, accept H1
hypotheses χ2 lower than the critical value? Not significant: Retain H0
5 Report the “The results do (not) provide sufficient evidence to reject/retain the null
results hypothesis” + what it means
A
B
C
D
4 Decide on hypotheses χ2 higher than the critical value? Significant: Reject H0, accept H1
χ2 lower than the critical value? Not significant: Retain H0
5 Report the results “The results do (not) provide sufficient evidence to reject/retain the
null hypothesis” + what it means
3 Calculate z test
M = sample mean
= population mean
Two-tailed:
Z value is in between the critical value? Retain H0
Z value beyond one of the critical values? Reject H0 & accept H1
6 Report the results “There was (not) sufficient evidence that the sample mean differs
from the hypothesized or know population mean”
(Z = Z, p </>.05)
3 Calculate t test
4 Critical value t Df = n – 1
• Look at table J.3 (choose the most conservative one)
5 Decide on hypotheses One-tailed:
t value lower than the critical value? Retain H0
t value higher than the critical value? Reject H0 & accept H1
Two-tailed:
t value is in between the critical value? Retain H0
t value beyond one of the critical values? Reject H0 & accept H1
6 Report the results “There was (not) sufficient evidence that the sample mean differs
from the hypothesized or know population mean”
(t (df)= t, p </>.05)
2 Calculate standard
error of the difference
between sample
means
n = sample size group
S2XP = estimation population variance
3 Calculate t test
4 Critical value t
• Table J.3
Two-tailed:
t value is in between the critical value? Retain H0
t value beyond one of the critical values? Reject H0 & accept H1
6 Report the results (t (df)= t, p </>.05)
q: look up in table J5
• Column = number of means being compared
• Row = Dfw
Correlation
Association = as one variable changes, the other variables changes in a predictable manner
the variables covary
Correlation = a measure of the degree of association among variables (not a matter of cause and
effect!!!)
Phi (rϕ)
Nominal data: Correlation
• Measure of correlation between two nominal dichotomous variables (correlation varies from
-1 to 1)
R = 0 = no relationship
R = -1 = perfect relationship
R = 1 = perfect relationship
Assumptions of phi:
• Nominal variables
• Data are in the form of two dichotomies
1 Formulate hypotheses H0: the two dichotomous variables are not related
H1: the two dichotomous variables are related
2 Calculate phi
X X
X (a) (b)
X (c) (d)
3 Calculate X2 test Testing whether the found value of phi is significantly different from
statistics 0 is done using a χ2 distribution:
4 Determine critical value Find critical value X2 in the table, with df = (#columns-1)*(#rows-1) =
X2 1
5 Decide on hypotheses X2 lower than critical value? Retain H0
X2 higher than critical value? Reject H0, accept H1
6 Report the results The variables X and Y are (not) related
r = correlation in a sample
Pearson r
Interval/ratio data: Correlation
• Measure of correlation for for 2 interval/ratio variables (correlation varies from -1 to 1)
R = 0 = no relationship
R = -1 = perfect negative relationship (one increases, the other decreases)
R = 1 = perfect positive relationship (one increases, the other one does too)
• The sign (+ or -) indicates the direction of the relationship
• Measure of linear relationship
• Be aware of restriction of the range
• Prediction is limited to the range of the original variables
• Coefficient of determination (r squared) = proportion of the variance in one variable that is
explained by another variable
Assumptions
• Interval or ratio data
• Data are paired
• Linear relationship
• Normal distribution for X and Y variables
1 Formulate hypotheses H0: the variables are not related (PXY = 0)
(one- or two-tailed?) H1: the variables are related (PXY 0)
Scatterplot:
Linear regression
Interval/ratio data: Regression
• If there is a significant correlation, knowing the value of one variable will assist in predicting
the value of the other variable
• Regression: predicting one variable from another variable
• Provides an equation that helps predicting the value of Y
• Prediction is limited to the original range of the values
• Standard error of estimate (�Ŷ) = standard deviation of Y scores around the regression line
Assumptions:
• Interval or ratio data
• Data are paired
• Linear relationship
• Only used when Pearson r is statistically significant
BOOK REID
Chapter 1 - INTRODUCTION
Absolute value: the magnitude of a number irrespective of whether it is positive or negative.
Data (plural of datum): factual information, often in the form of numbers.
Descriptive statistics: techniques that are used to summarize a set of numbers.
Inferential statistics: techniques that are used in making decisions based on data.
Mean: sum of the scores divided by the total number of scores.
Mean: a measure of central tendency for use with interval or ratio data. It is what is commonly
called an average, but in statistics, the term average can refer to a mean, median, or mode. The
mean is the sum of the scores divided by the number of scores.
Negatively skewed: a nonsymmetrical distribution in which the tail pointing to the left is larger
than the tail pointing to the right.
Normal distribution: a specific, bell-shaped distribution. Many statistical procedures assume that
the data are distributed normally.
Population: the entire group that is of interest.
Positively skewed: a nonsymmetrical distribution in which the tail pointing to the right is larger
than the tail pointing to the left.
Range: a measure of variability. With interval or ratio data, it equals the difference between the
upper real limit of the highest score or category and the lower real limit of the lowest score or
category.
Real limits: with interval or ratio data, the actual limits used in assigning a measurement. These are
halfway between adjacent scores. Each score thus has an upper and a lower real limit.
Sample: a subset of a population.
Standard deviation: a measure of variability—the average deviation of scores within a
distribution. It is defined as the square root of the variance. The symbol for the population standard
deviation is Σ.
Sum of the squared deviations: for a population, it is equal to Σ(X – μ)2 or Σx2. It is often
abbreviated as “sum of squares,” which is shortened even further to SS.
Symmetrical distribution: a distribution in which the right half is the mirror image of the left half.
In such a distribution, there is a high score corresponding to each low score.
Unimodal distribution: a distribution with only one mode.
Variance: a measure of variability—the average of the sum of the squared deviations of scores
from their mean. The symbol for the population variance is σ2.
x: the symbol for a deviation. Thus, x = (X – μ) if we are dealing with a population.
Area of rejection: area of the distribution equal to the alpha level. It is also called the critical
region.
Critical region: area of the distribution equal to the alpha level. It is also called the area of
rejection.
Degrees of freedom (df): the number of observations out of the total that are free to vary.
Expected frequencies: with nominal data, the outcome that would be expected if the null
hypothesis were true.
Independent: two events, samples, or variables are independent if knowing the outcome of one
does not enhance our prediction of the other.
Observed frequencies: with nominal data, the actual data that were collected.
Significant: in statistics, a measure of how unlikely it is that an event occurred by chance.
Bonferroni method: a procedure to control the Type I error rate when making numerous
comparisons. In this procedure, the alpha level that the experimenter sets is divided by the number
of comparisons.
Dependent: two events, samples or variables are dependent if knowing the outcome of one
enhances our prediction of the other.
Effect size: a measure of how strong a statistically significant outcome is.
Gambler’s fallacy: the incorrect assumption that if an event has not occurred recently, then the
probability of it occurring in the future increases.
Interaction: a statistical term indicating that the effects of two or more variables are not
independent.
Post hoc comparisons: statistical procedures utilized following an initial, overall test of
significance to identify the specific samples that differ.
Biased estimator: an estimator that does not accurately predict what it is intended to because of
systematic error.
Central limit theorem:
—with increasing sample sizes, the shape of the distribution of sample means (sampling distribution
of the mean) rapidly approximates the normal distribution irrespective of the shape of the
population from which it is drawn.
—the mean of the distribution of sample means is an unbiased estimator of the population mean.
—and the standard deviation of the distribution of sample means (σM) = σx/ .
Confidence interval: the range of values that has a known probability of including the population
parameter, usually the mean.
Law of large numbers: the larger the sample size, the better the estimate of population parameters
such as μ.
One-tailed or directional test: an analysis in which the null hypothesis will be rejected if an
extreme outcome occurs in only one direction. In such a test, the single area of rejection is equal to
alpha.
Sampling distribution of the mean: a theoretical probability distribution of sample means. The
samples are all of the same size and are randomly selected from the same population.
Standard error: the standard deviation of the sampling distribution of a statistic. Thus the standard
error of the mean is the standard deviation of the sampling distribution of means.
Two-tailed or nondirectional test: an analysis in which the null hypothesis will be rejected if an
extreme outcome occurs in either direction. In such a test, the alpha level is divided into two equal
parts.
Carryover effect: a treatment or intervention at one point in time may affect or carryover to
another point in time.
Counterbalancing: a method used to control for carryover effects. In counterbalancing, the order
of the treatments or interventions is balanced so that an equal number of subjects will experience
each order of presentation.
Longitudinal study: a study in which subjects are measured repeatedly across time. A repeated
measures design is a type of longitudinal study.
Standard error of the difference between sample means (S(M1-M2)
): the standard deviation of the
sampling distribution of the difference between sample means.
Standard error of the mean difference (SMD ): the standard deviation of the sampling distribution
of the mean difference between measures.
Chapter 11 –
Coefficient of determination: the square of the correlation. It indicates the proportion of variability
in one variable that is explained or accounted for by the variability in the other variable.
Coefficient of nondetermination: the proportion of the variability of one variable not explained or
accounted for by the variability of the other variable. For phi, it is equal to 1 – .
Correlation: a measure of the degree of association among variables. A correlation indicates
whether a variable changes in a predicable manner as another variable changes.
Covariance: a statistical measure indicating the extent to which two variables vary together.
Covary: if knowledge of how one variable changes assists you in predicting the value of another
variable, the two variables are said to covary.
Multiple correlation (R): the association between one criterion variable and a combination of two
or more predictor variables.
Negative correlation: a relationship between two variables in which as one variable increases in
value, the other variable decreases in value. Also, as one variable decreases in value, the other
increases in value.
Partial correlation: a procedure in which the effect of a variable that is not of interest is removed.
Pearson r: correlation used with interval or ratio data.
phi (rϕϕ): correlation used with nominal data. It is a form of Pearson r.
Point biserial (rpb): correlation used when one variable is nominal (a true dichotomy) and the other
consists of interval or ratio data.
Positive correlation: a relationship between two variables in which as one variable increases in
value, so does the other variable. Also, as one variable decreases in value, so does the other.
Regression: procedure researchers use to develop an equation that permits the prediction of one
variable of a correlation if the value of the other variable is known.
Restriction of the range: reducing the range of values for a variable will reduce the size of the
correlation.
rho (ρ): symbol used for the population correlation.
True dichotomy: a natural division of scores into two distinct categories.