Bear Handout Hypothesis Testing
Bear Handout Hypothesis Testing
Hypothesis testing is
the starting point for
learning inferential
statistics. We want to
know something about
the world and we use
hypothesis testing to
evaluate statements
about how the world
works.
Probability*
What*are*the*likely*characteris/cs*of*a*
sample*drawn*from*this*popula/on?*
μ"="100"
Typical(polar(bears( Sample"1"
Inferen/al*Sta/s/cs*
What*can*we*infer*about*the*popula/on*
based*on*what*we*know*from*this*sample?*
A Five-Step Guide
Choose the Right Statistical test
Questions to Ask Before You Begin Use the diagrams for Step 1 to choose the right statistical
test. Choose your test based on (a) whether you are looking
What are your Variables? Step 1: for relationships or differences between groups, (b) the level
of the data (NOIR), and (c) whether the assumptions for the
• Independent variable (IV) test have been met.
Exploratory data analysis will help you know what test to choose
• Dependent variable (DV)
What do you want to know? Establish the Null and Alternative Hypothesis
The null hypothesis states that there is really no difference
• Relationships between groups? between means, so any apparent differences are simply due
• Differences between groups? Step 2: to random variation or chance. The alternative hypothesis
states that one group mean will be statistically significantly
different than the other.
The null hypothesis is written as (H0:) and the alternative is (H1:)
What do your data look like?
• What level are the DV data: NOIR?
Select a Criteria for Significance
• Have the assumptions been met?
After choosing a one-tailed vs. two-tailed test, you have three
• How many IV groups do you have? options for determining statistical significance:
1. Is the p-value of your test less than the level of
Step 3: significance (a.k.a. alpha level). The most common level
is ! = .05. (The level of significance sets your critical value.)
What type of test will you use?
2. Is the test statistic greater than a critical value (CV)?
• Directional or non-directional H1? 3. Does the confidence interval around the mean difference
include 0? All three options will agree with each other.
• Is your alpha level .05 or .01?
For most tests, assume α = .05 and a two-tailed test
2 2 3+ 2 3+
1 Groups 1 Groups 1 Groups
Groups Groups
Group Group Group
Parametric Statistics
NP Independent Samples t Test Compares the means (DV) of only two independent
Wilcoxin-Mann-Whitney Test groups (IV)
Compares ranks/median of two independent samples
Preferred when data have extreme outliers and/or have
sample sizes less than 20
2 Dependent Samples
Paired Samples t Test
NP Paired Samples t Test Compares the means (DV) of only two related
samples (IV); i.e. the sample people are measured twice
Wilcoxin Signed Rank Sum Test or subjects have been deliberately matched
Compares two related samples, DV is ranks with median
test
Related = repeated, paired, and/or dependent
3+ Independent Samples
NP One-Way ANOVA
One-Way ANOVA
Kruskal-Wallis Test Compares the means (DV) of a single IV with 3+
Compares a single IV with three independent levels/groups; DV independent levels/groups; analyzes variability
is ranks with median test
Generalized form of a Mann-Whitney test
Create the null hypothesis first: H0: ! = 100 * > greater than
Then create the alternative: H1: " ≠ 100 < less than
Example of a two tailed (nondirectional) hypothesis ≥ greater than or equal to
One tailed test The height of the average gnome is 28 inches tall. ≤ less than or equal to
H0: " = 28 ______________________
H1: " ≠ 28
3<5 & 7>1
“the alligator always eats more.”
Create the alternative hypothesis first: H1: ! > 100 * The equal sign always goes
Then create the null hypothesis what is left: H0: " ≤ 100 with the Null Hyp.
Example of possible null hypotheses: # = 100, # ≤ 100, or # ≥ 100
Remember that the null hypothesis will always include the equal sign.
Examples of one tailed (directional) hypotheses
Two tailed test The average grizzly bear eats more than 12 trout per day (one-tail - right)
H0: " ≤ 12
H1: " > 12
Most people wearing diapers are less than 30 months old. (one-tail - left)
H0: " ≥ 30
H1: " < 30
The number 100 Number is used as a place holder for the mean of the population. Do not use the number in your hypothesis
unless 50 happens to be the mean of the population. And don’t include the asterisk, either.
Degrees of Freedom
Step 3: Level of Significance
Calculating the Critical region for Various Tests
For other tests (t-test, ANOVA, Chi-square), you can look up the critical value on a table. These tables are included at the
back of your class notes. Consult the class notes for the type of test you are using to find the appropriate table and
instructions on how to use it. Using most tables to calculate the critical region will require knowing the degrees of
freedom (df). In most cases, the degrees of freedom will be n-1.
Repeated Measures ANOVA dƒT = n – 1, dƒB = k – 1, dƒW = n – k F(3,16) = 10.49, p < .001
One-Way Chi-Square Test (krow – 1) * (kcolumn – 1) X2(3, N = 350) = 46.03, p < .05
n = number of participants
k = number of IV categories
Calculate
Step 4: Run the Statistics
Analyze Menu
When you run a test in SPSS, you begin with the Analyze menu. You will not use all
of these commands at an introductory level, but you should become familiar with the
commands that you will need for this class. When you click on Analyze, a drop down
menu opens from which you can choose the appropriate test.
Descriptive Statistics
used for exploring data and calculating descriptive statistics. Both Frequencies
and Explore offer similar options but one may be more suitable to a particular
task, so you should know both. Descriptives menu is the fast track to
descriptive stats, although you might find Frequencies more usable. Crosstabs
is used for the two-way chi-square and cross-tabulation. Q-Q plots are used to
assess normality or compare any two distributions.
Compare Means
you will use this menu a lot. This is where you find the most commonly used
hypothesis testing tools. Here is where you will find three types of t tests, as well
as simple Analysis of Variance (ANOVA).
Correlate
Simple correlation, both Pearson and Spearman, is performed from this menu
using the Bivariate command. Partial or Distances commands control for a third
variable.
Regression
Simple & multiple linear regressions are conducted using the Linear command.
You can feed multiple IVs into the regression model to create a predictive
equation for a single DV.
Nonparametric Statistics
When you have categorical data (or your scale data violate the assumptions of a
parametric test), you can use a nonparametric alternative. This menu is the
Legacy Dialogs menu. Chi-square can be conducted here (or in the Crosstabs
command in the Descriptive Statistics menu). Kolmogorov-Smirnov, Mann-
Whitney, Kruskal-Wallis, Wilcoxin, and Friedman are all conducted here.
Sig
Results are Significant Results are Not Significant
(2-tailed)
p value given in SPSS p = .042 p = .45, ns
p value = .000 in SPSS p < .001 impossible
No p value (c.f. doing a test by p < .05 p > .05
hand; doing one-tailed test)
# < .05
Critical Region # > .05
M # > .05
# < .05
Critical Region X-Axis
TYPE II ERROR
NEGATIVE CORRECT RESULT
(not pregnant) False Negative
Avoid an Error
Result of Pregnant But Don’t Know It
Pregnancy
Test TYPE I ERROR
POSITIVE CORRECT RESULT
(pregnant) False Positive
Avoid an Error
Not Really Pregnant After All
Type I Error: rejecting a null hypothesis that is actually true (false positive)
Concluding that a treatment has an effect when it does not. Type I errors are only a concern when you reject the
null hypothesis. The probability of a Type I error is alpha (#) (typically .05 or 5%).
To decrease the chance of a Type I error, you can change the alpha level from .05 to .01; however, this increases
the possibility of a Type II error.
Type II Error: “accepting” a null hypothesis that is actually false (false negative)
Failure to detect an effect that actually exists. Type II errors are only a concern when you fail to reject the null
hypothesis. The probability of a Type II error is beta ($).
Type II errors are most likely to happen:
• When effect size is small – you are looking for a very tiny effect and it is easy to miss; or,
• When the n is small – you do not have a large enough sample size to adequately detect the effect.
To decrease the chance of a Type II error, increase power by increasing sample size (n).
Memory Aids: Type I is like Pinocchio’s nose (lying); Type II is like a dunce cap (missed the effect)
Errors are always written with a Roman numeral (Type I and Type II, not Type One or Type 2).
# = .05
# = .05
-3 -2 -1 0 1 2 3 Z scores (SD)
Number line
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 M = 50, SD = 4
Proportions
effect unless there is enough evidence to prove otherwise. assume the defendant is innocent until proven guilty.
H0: rats1 = rats2 (no difference in group means) H0: no difference between this person and an innocent person
The alpha level: We set a standard of how different the Standard of Proof: The jury must be convinced beyond
groups must be before we will be willing to conclude that a reasonable doubt (“the standard”) before they find a
the treatment had an effect; otherwise, any differences are person guilty.
The sample data: The research study is conducted to Evidence: The prosecutor presents evidence to
gather data (evidence) to demonstrate whether the demonstrate that the defendant is guilty.
treatment had an effect. Victim’s blood was on his clothes, his DNA was at the crime
scene, history of violence toward the victim
Run the rats through the maze and collect data
The critical region: The sample data fall in the critical Deliberation: Either there is enough evidence to meet
region (meet the burden of proof) or they fall outside the the burden for proof of guilt (reasonable doubt or
critical region (there is not enough evidence to reject the null) preponderance of evidence) or there is not.
A difference of more than 12 points will be significantly It is highly unlikely that he would have the victim’s blood on
different his clothes the same night as her murder if he was innocent.
He is different than an innocent person.
Conclusion: If the sample data (evidence) fall into the Verdict: If the evidence exceeds the burden of proof, the
critical region (standard of proof) we reject the null (convict). jury votes “guilty”. If the evidence falls short, the verdict is
If the sample data do not fall into the critical region, we fail “not guilty.” The verdict declares the defendant guilty but
to reject the null (acquit). does not prove the defendant is guilty. Being acquitted does
The means are 15 points apart. Our standard was 12. The not prove the defendant is innocent, either.
groups are statistically significantly different
This defendant is guilty.