06 HypothesisTesting
06 HypothesisTesting
Hypothesis Testing
Standard Deviation
2
Regression
Dependent variable
B1 = slope
= ∆y/ ∆x
b0 (y intercept)
Observation: y
Zero
Independent variable (x)
The function will make a prediction for each observed data point.
^
The observation is denoted by y and the prediction is denoted by y.
Simple Linear Regression
Prediction error: ε
Observation: y
Prediction: y^
Zero
y=^
y+ε
Actual = Explained + Error
Regression
Dependent variable
Mathematically,
^)
SSE = ∑ ( y – y (measure of unexplained variation)
2
SST = SSR + SSE = ∑ ( y – y ) (measure of total variation in y)
What is Hypothesis Testing?
• … the use of statistical procedures to answer research
questions
• Typical research question (generic):
13
Analysis of Variance
• The analysis of variance (ANOVA) is the most
widely used statistical test for hypothesis testing in
factorial experiments
• Goal determine if an independent variable has a
significant effect on a dependent variable
• Remember, an independent variable has at least
two levels (test conditions)
• Goal (put another way) determine if the test
conditions yield different outcomes on the
dependent variable (e.g., one of the test conditions
is faster/slower than the other)
14
Why Analyse the Variance?
• Seems odd that we analyse the variance, but the
research question is concerned with the overall
means:
15
Example #1 Example #2
File: 06-AnovaDemo.xlsx 16
Example #1 - Details
Note: Within-subjects design
1 ANOVA table created by StatView (now marketed as JMP, a product of SAS; www.sas.com)
How to Report an F-statistic
F1,9 = 0.626, ns
Example #2 - Reporting
22
More Than Two Test Conditions
23
ANOVA
24
Post Hoc Comparisons Tests
• A significant F-test means that at least one of the test
conditions differed significantly from one other test
condition
• Does not indicate which test conditions differed
significantly from one another
• To determine which pairs differ significantly, a post hoc
comparisons tests is used
• Examples:
– Fisher PLSD, Bonferroni/Dunn, Dunnett, Tukey/Kramer,
Games/Howell, Student-Newman-Keuls, orthogonal contrasts,
Scheffé
• Scheffé test on next slide
25
Scheffé Post Hoc Comparisons
26
Between-subjects Designs
• Research question:
– Do left-handed users and
right-handed users differ in
the time to complete an
interaction task?
• The independent variable
(handedness) must be
assigned between-subjects
• Example data set
27
Summary Data and Chart
28
ANOVA
29
Two-way ANOVA
• An experiment with two independent variables is a two-
way design
• ANOVA tests for
– Two main effects + one interaction effect
• Example
– Independent variables
• Device D1, D2, D3 (e.g., mouse, stylus, touchpad)
• Task T1, T2 (e.g., point-select, drag-select)
– Dependent variable
• Task completion time (or something, this isn’t important here)
– Both IVs assigned within-subjects
– Participants: 12
– Data set (next slide)
30
Data Set
31
Summary Data and Chart
32
ANOVA
33
ANOVA - Reporting
34
Anova2 Software
• HCI:ERP web site includes analysis of variance Java
software: Anova2
• Operates from command line on data in a text file
• Extensive API with demos, data files, discussions, etc.
• Download and demonstrate
Demo
35
Dix et al. Example1
• Single-factor, within-
subjects design
• See API for discussion
1Dix, A., Finlay, J., Abowd, G., & Beale, R. (2004). Human-computer interaction (3rd ed.). London:
36
Prentice Hall. (p. 337)
Dix et al. Example
• With counterbalancing
• Treating “Group” as a
between-subjects factor1
• Includes header lines
37
1 See API and HCI:ERP for discussion on “counterbalancing and testing for a group effect”.
Chi-square Test (Nominal Data)
• A chi-square test is used to investigate relationships
• Relationships between categorical, or nominal-scale,
variables representing attributes of people, interaction
techniques, systems, etc.
• Data organized in a contingency table – cross tabulation
containing counts (frequency data) for number of
observations in each category
• A chi-square test compares the observed values against
expected values
• Expected values assume “no difference”
• Research question:
– Do males and females differ in their method of scrolling on
desktop systems? (next slide)
38
Chi-square – Example #1
MW = mouse wheel
CD = clicking, dragging
KB = keyboard
39
Chi-square – Example #1
Significant if it
exceeds critical value
(next slide)
2 = 1.462
(See HCI:ERP for calculations)
40
Chi-square Critical Values
• Decide in advance on alpha (typically .05)
• Degrees of freedom
– df = (r – 1)(c – 1) = (2 – 1)(3 – 1) = 2
– r = number of rows, c = number of columns
Demo
42
Chi-square – Example #2
• Research question:
– Do students, professors, and parents differ in their
responses to the question: Students should be allowed
to use mobile phones during classroom lectures?
• Data:
43
Chi-square – Example #2
• Result: significant difference in responses (2 = 20.5, p < .0001)
• Post hoc comparisons reveal that opinions differ between
students:parents and professors:parents (students:professors do not
differ significantly in their responses)
45
Non-parametric – Example #1
• Research question:
– Is there a difference in the political leaning of Mac
users and PC users?
• Method:
– 10 Mac users and 10 PC users randomly selected and
interviewed
– Participants assessed on a 10-point linear scale for
political leaning
• 1 = very left
• 10 = very right
• Data (next slide)
46
Data (Example #1)
• Means:
– 3.7 (Mac users)
– 4.5 (PC users)
• Data suggest PC users more right-
leaning, but is the difference
statistically significant?
• Data are ordinal (at least), a
non-parametric test is used
• Which test? (see below)
3.7 4.5
47
Mann Whitney U Test1
Test statistic: U
Demo
49
1 MannWhitneyU files contained in NonParametric.zip.
Non-parametric – Example #2
• Research question:
– Do two new designs for media players differ in “cool
appeal” for young users?
• Method:
– 10 young tech-savvy participants recruited and given
demos of the two media players (MPA, MPB)
– Participants asked to rate the media players for “cool
appeal” on a 10-point linear scale
• 1 = not cool at all
• 10 = really cool
• Data (next slide)
50
Data (Example #2)
• Means
– 6.4 (MPA)
– 3.7 (MPB)
• Data suggest MPA has more “cool
appeal”, but is the difference
statistically significant?
• Data are ordinal (at least), a
non-parametric test is used
• Which test? (see below)
6.4 3.7
51
Wilcoxon Signed-Rank Test
Conclusion:
The null hypothesis is rejected:
Media player A has more “cool
appeal” than media player B
(z = -2.254, p < .05).
Demo
53
1 WilcoxonSignedRank files contained in NonParametric.zip.
Non-parametric – Example #3
• Research question:
– Is age a factor in the acceptance of a new GPS device for
automobiles?
• Method
– 8 participants recruited from each of three age categories:
20-29, 30-39, 40-49
– Participants demo’d the new GPS device and then asked
if they would consider purchasing it for personal use
– They respond on a 10-point linear scale
• 1 = definitely no
• 10 = definitely yes
• Data (next slide)
54
Data (Example #3)
• Means
– 7.1 (20-29)
– 4.0 (30-39)
– 2.9 (40-49)
• Data suggest differences by age,
but are differences statistically
significant?
• Data are ordinal (at least), a non-
parametric is used
7.1 4.0 2.9
• Which test? (see below)
55
Kruskal-Wallis Test
Conclusion:
The null hypothesis is rejected:
There is an age difference in the
acceptance of the new GPS device.
(2 = 9.605, p < .01).
Demo
57
1 KruskalWallis files contained in NonParametric.zip.
Post Hoc Comparisons
• As with the analysis of variance, a significant result only
indicates that at least one condition differs significantly
from one other condition
• To determine which pairs of conditions differ significantly,
a post hoc comparisons test is used
• Available using –ph option (see below)
58
Non-parametric – Example #4
• Research question:
– Do four variations of a search engine interface (A, B,
C, D) differ in “quality of results”?
• Method
– 8 participants recruited and demo’d the four interfaces
– Participants do a series of search tasks on the four
search interfaces (Note: counterbalancing is used, but
this isn’t important here)
– Quality of results for each search interface assessed on
a linear scale from 1 to 100
• 1 = very poor quality of results
• 100 = very good quality of results
• Data (next slide) 59
Data (Example #4)
• Means
– 71.0 (A), 68.1 (B), 60.9 (C),
69.8 (D)
• Data suggest a difference in
quality of results, but are the
differences statistically
significant?
• Data are ordinal (at least),
a non-parametric test is used
• Which test? (see below) 71.0 68.1 60.9 69.8
60
Friedman Test
Conclusion:
The null hypothesis is rejected:
There is a difference in the quality
of results provided by the search
interfaces (2 = 8.692, p < .05).
Demo
62
1 Friedman files contained in NonParametric.zip.
Post Hoc Comparisons
• As with KruskalWallis application, available
using the –ph option…
63
Points of Discussion
• Reporting the mean vs. median for scaled
responses
• Non-parametric tests for multi-factor experiments
• Non-parametric tests for ratio-scale data
64
Thank You
65