Lecture Notes #1: Review of Basic Concepts 1-1
Lecture Notes #1: Review of Basic Concepts 1-1
Richard Gonzalez1
Psych 613
Version 2.10 (9/2019)
LECTURE NOTES #1
Reading assignment
• Review MD chs 1&2 or G chs 1-5. Acquaint yourself with notation and underlying
logic.
1
These lecture notes have benefited from feedback given by teaching assistants and students. Previous
teaching assistants, or GSIs, in chronological order: Mark Calogero, Steve Holste, Kimi Yamagishi, Ericka
Peterson, Henry Schmidt, Lisa Smith, Jane Swanson, Bill Fulton, Dave Waller, Ian Skurnick, William
Brenneman, Ajita Gopikrishnan, Shane Mueller, Greg Dyson, David Fencsik, Jennifer Hu, Arran Caza,
Vera Sacharin, Alexandra Atkins, Jonathon Kopecky, Julie Maslowsky, Igor Grossmann, Ryan Bremner,
Bill Chopik, Josh Wondra, Brian Vickers, Kim Brink, Ben Blankenship, Anne Waldo, Nick Michalak, Esra
Ascigil and Koji Takahashi. Current GSIs are Esra Ascigil, Megha Ghosh and Zac Zhang. Olena Sinkevich
provided helpful comments on an earlier draft. I also want to thank my UW colleague and friend, John
Miyamoto. We had many conversations about how to teach statistics. These discussions usually occurred
over food (most often pastries), which provides one of many explanations for why I’ve gained 30 pounds
since I began teaching statistics in 1990.
Lecture Notes #1: Review of Basic Concepts 1-2
1. Review syllabus
2. Study tips
This is a demanding course because the pace is quick. There is a lot of material that
needs to be covered during the term. Here are some study tips that will help you
perform well in this course.
3. Intuitive review of a few concepts from Introductory Statistics (you might find it
helpful to review an introductory, undergraduate textbook)
ii. median: the score at (or near, by some appropriate definition) the 50th
percentile
iv. standard deviation (aka s, σ, σ̂: the square root of the variance): know
the difference between the standard deviation and the standard error
v. interquartile range (aka IQR): the difference between the score corre-
sponding to at (or near) the 75th percentile and the score corresponding to
at (or near) the 25th percentile
CLT The key idea is that the sampling distribution of the mean has known properties.
In words, the theorem gives the properties of the mean of the means and variance
of the means from repeated sampling. Put simply, take repeated samples of size n
(with replacement) from some population and calculate the mean of each sample.
Denote the mean of sample i by Yi . You can compute the mean of the means
(i.e., the means of all the Yi ’s), the variance of the sample means, and you can
construct a histogram of the sample means. The central limit theorem states
that as the size of the samples gets large, the distribution of the means becomes
2
normally distributed with E(Y) = µ and var = σn (where the function E denotes
expectation, or average). Thus, the CLT shows that the mean and variance of
the sampling distribution of the mean is related to the mean and the variance
of the original parent population, respectively. Figure 1-1 illustrates a normal
distribution and the resulting sampling distribution when the sample size n = 5.
For a more detailed treatment, including mathematical proof, of the central limit
theorem consult a mathematically-oriented, introductory textbook such as A
First Course in Probability by Ross (but to understand this proof you need a
good understanding of some advanced mathematical concepts).
Lecture Notes #1: Review of Basic Concepts 1-5
normal(0, 1)
0.8
0.6
density
0.4
0.2
0.0
-4 -2 0 2 4
z
normal(0, 1/5)
0.8
0.6
density
0.4
0.2
0.0
-4 -2 0 2 4
z
Lecture Notes #1: Review of Basic Concepts 1-6
(c) t distribution
In this course we will use the simple and elegant result that for normally dis-
tributed parameters
Intuitive
definition of t
estimate of population parameter
t ∼ (1-1)
estimated st. dev. of the sampling distribution
(sometimes the denominator is called the estimated standard error). The right
hand side of Equation 1-1 can be thought of as a standardized parameter esti-
mate, where the parameter is divided by a measure of its sampling variability.
This is a useful result. Most of the estimators we deal with in this course are
normally distributed (e.g., means, difference between two means, contrasts, re-
gression coefficients, correlation coefficients when the population correlation is
assumed to be ρ = 0). Thus, we can use the t distribution in many situations.
The degrees of freedom corresponding to the denominator of Equation 1-1 will
depend on the particular application, more on this later.
Excel and t It is convenient to use a t table to get the necessary values. For instance, with
values
20 degrees of freedom (explained later), the t value is 2.09 as given by the tabled
values. If you want to compute your own tables, you can use a spreadsheet such
as Microsoft’s Excel. For instance, the excel function TINV gives the t value
corresponding to a particular two-tailed α and a particular degrees of freedom.
If you type “=TINV(.05,20)” (no quotes) in a cell of the spreadsheet, the number
2.085963 will appear, which is the t value corresponding to a two-tailed α of 0.05
with 20 degrees of freedom.
Lecture Notes #1: Review of Basic Concepts 1-7
R and t values For users of R the function qt() provides the t value for a given α-level and
degrees of freedom. If you want the two-tailed t corresponding to α = 0.05 with
20 degrees of freedom, then the command
qt(.975,20)
df t value
1 12.71
.. ..
. .
20 2.09
.. ..
. .
100 1.98
.. ..
. .
∞ 1.96
The CI is generally more informative than the result of a hypothesis test. Re-
view an intro book if these concepts are not familiar. Here are two elementary
examples of both the CI and the standard hypothesis test.
The hypothesis test for this situation has an analogous form. Let’s assume
that the null hypothesis is that the population mean µ = 0 (but, more
generally, the null hypothesis can be µ = k where k is any real number).
estimate of mean
tobserved = (1-4)
st. error of mean
Y
= √ (1-5)
s/ n
with degrees of freedom = n - 1. After computing the observed t from
Equation 1-5, one would then compare the computed t to the tabled value
at the desired α level and n - 1 degrees of freedom.
Hypothesis It is useful to introduce a template for the hypothesis test that I will use
testing
template throughout the year. All hypothesis tests proceed in the general manner
described in Figure 1-2. For the specific case of a one sample t test the
template can be completed as shown in Figure 1-3.
Null Hypothesis The null hypothesis Ho and its alternative Ha are stated.
Structural Model and Test Statistic One states the underlying model
(more on this later), its parameters, and test statistics that will evaluate
the performance of the model against data relative to the null hypothesis.
Critical Test Value Each hypothesis test as a critical value, or “a number to
beat,” in order to make a statistical decision. This number is dependent
on other features such as degrees of freedom, number of tests made, one-
or two-sided hypothesis test, etc.
Statistical decision Reject or fail to reject the null hypothesis.
incorrectly reject the null hypothesis. That is, the null hypothesis is treated
as a fixed number, each study provides a test of the null hypothesis (the
information in the study is what is variable, not the fixed value of the null
hypothesis). If the null hypothesis is true and you repeatedly perform study
after study (under the same conditions), 5% of these studies will incorrectly
reject the null hypothesis.
I hope you can see that there is a relation between the CI and the hypothesis
test. Both are very similar in their setup. The hypothesis test is identical
to checking whether the confidence interval includes the value of the null
hypothesis. For example, if the CI around a sample mean has a lower bound
of .8 and an upper bound of 12.1, I immediately know (without having to
do any additional computation) that I would reject the null hypothesis that
the population mean µ = 0. The reason is that the CI does not include 0
within its interval. A second example: if the CI around a sample mean has
a lower bound of -2.4 and an upper bound of 14.3, I immediately know that
I would fail to reject the null hypothesis because the interval does include
the value of the null hypothesis.
There is a sense in which the CI is more informative than the hypothesis test
because the latter is included in the former. The CI also provides information
about variability around the parameter estimate. One could criticize the
null hypothesis test for confounding two pieces of information—effect and
its variability get lumped into a single number. So a t ratio of 2 could arise
in many different ways (e.g., the effect is small and its variability is small, the
effect is large and its variability is large, etc). However, the CI keeps those
two terms separate in that one sees the center of the interval (the estimate)
separate from the width of the interval (its variability).
Figure 1-3: Hypothesis test template for the one sample t test
Null Hypothesis
• Ho : µ = 0
• Ha : µ =
6 0 (two-sided test)
Structural Model and Test Statistic
The structural model is that the dependent variable Y consists of a grand
population mean µ plus random noise . In symbols, for each subject i
his or her individual observation Yi is modeled as Yi = µ + i .
The test statistic operates on the population mean µ and specifies its
sampling distribution. The test of the hypothesis will involve an estimate
over the standard error of the estimate, therefore we make use of the
definition of the t distribution (Equation 1-1)
Using the statistical results stated above we write the specific details for
this problem into the definition of the t distribution
Y−µ
tobserved = √
s/ n
Critical Test Value Because we are using a test statistic based on the t
distribution, we use the t table to find the critical value of t, denoted
tcritical . We decide the α level (such as α = 0.05 two-tailed), then do a
table lookup to find the critical value. For instance, if we want to perform
a two-tailed test with α = 0.05, the critical t value with 20 degrees of
freedom will be 2.09 (that is, tcritical = 2.09). This acts as the cutoff in
the next step; the theoretical gauntlet has been thrown.
Statistical decision If the observed t computed from the raw data (the sec-
ond step) exceeds in absolute value terms the critical value tcritical , then
we reject the null hypothesis. If the observed t value does not exceed
the critical value tcritical , then we fail to reject the null hypothesis. In
symbols, if |tobserved | > tcritical , then reject the null hypothesis, otherwise
fail to reject.
Lecture Notes #1: Review of Basic Concepts 1-11
This example involves two groups. We will use the trick of converting this
new problem into something that we already know how to solve. Let Y1 −
Y2 = D, denote the difference of two means. We know from section 3c above
that
D
t ∼ (1-6)
st. dev. of the sampling dist of D
For this application the degrees of freedom = n1 + n2 - 2.
(note that there are no “bars” over the Y’s because we are talking about the
variance of the population Y not the variance of means Y), then Equation 1-8
reduces to
1 1
var(Y1 − Y2 ) = var(Y) + (1-9)
n1 n2
This equation states the original variance we need (the variance of the dif-
ference between two means) is, under the assumptions made, equal to the
variance of the population times a factor that depends on the sample size.
The right hand side of the last equation is not hard to work with. Recall
that to get here we had to assume that (1) data were independent and (2)
the two population variances were identical.
we can pool our individual estimates of var(Y1 ) and var(Y2 ). We will denote
pooled error the estimate of the pooled variance as
term
The reason for this high level of detail is to highlight how crucial the assump-
tions of independence (Equation 1-7) and equality of population variances
are in the two sample t test. The classic two-sample t test depends on these
assumptions in a fundamental way. If the assumptions do not hold, then the
results of the t test are suspect.
p-values The hypothesis testing template doesn’t specifically mention p-values. That’s
by design. Technically, when we perform a hypothesis test the issue is only
whether or not the observed test statistic (such as t observed) is more ex-
treme than the critical value from the table lookup. If your t observed is
equal to the critical value, then the p-value is .05. It turns out that the t
critical (two-tailed, α = .05) can be stated in the p-value scale as .05. The
p-value is another way to express the t observed, and we can compare our
observed p-value to the critical α = .05 level. The decision to reject or fail
to reject is identical regardless of whether we use t observed and t critical,
or we use p-value and α.
A p-value can serve as a measure of how far one is from the critical test
value. As your observed test statistic moves away from the critical value,
then the p-value is less than .05. The decision (reject, fail to reject) remains
the same. Traditionally, the view has been that there is relatively little info
in the p-value. To make sports analogies, all that matters is that you cross
the endzone, not how far into the endzone you go; a ball is over the line in
tennis, how far over doesn’t matter; when a pitch is outside the strike zone
it is a ball, it doesn’t matter how far outside the strike zone.
I wrote a paper years ago (Greenwald et al, 1995; we’ll talk about it later
in the semester) about what information one can extract from a p-value and
the answer is relatively little. But if you do report p-values you might as well
report the complete value (like p = .023 rather than p < .05) so that you can
extract that little bit of extra info, which we’ll talk about later. We already
know whether or not the test is significant (aka null hypothesis rejected), so
also writing down p < .05 in your paper is not necessary. To make things
worse, as people began to realize that p-values didn’t provide the info they
Lecture Notes #1: Review of Basic Concepts 1-13
thought, they started adding more things to the list of what to report. So
some fields now report not just t observed, degrees of freedom, whether the
difference between the two means is statistically significant, and confidence
intervals (say in a graph with means and +/- 1 standard error around the
means) but also p-values, effect sizes, and power. It turns out though that
much of the extra information is completely redundant with info already
reported (means, st dev, sample sizes, t observed), and if some of the info is
provided the other missing pieces can be computed.
with df = n1 + n2 − 2.
Thus, the CI conveys the same information as the (two-tailed) hypothesis test.
But, unlike the hypothesis test the CI does not confound the size of the effect
with its variability (i.e., does not lump both numbers into a single ratio but
keeps them separate). The CI format will become very useful later in the course.
Several journals are moving in the direction of encouraging authors to report
confidence intervals rather than p-values from inferential tests.
i. independent samples
iii. normality (in practice what tends to be critical is that the distributions be
symmetric)
Figure 1-4: Hypothesis test template for the two sample t test
Null Hypothesis
• Ho : D = 0
• Ha : D =
6 0 (two-sided test)
Structural Model and Test Statistic
The structural model is that the dependent variable Y consists of a grand
population mean µ plus a treatment effect αj plus random noise (we’ll
come back to this later). In symbols, for each subject i his or her indi-
vidual observation Yi is modeled as Yi = µ + αj + i .
The test statistic operates on the difference between two population means
D and specifies its sampling distribution. The test of the hypothesis will
involve an estimate over the standard error of the estimate, therefore we
make use of the definition of the t distribution (Equation 1-1)
Using the statistical results stated above we write the specific details for
this problem into the definition of the t distribution
D
tobserved = p
spooled 1/n1 + 1/n2
with df = n1 + n2 − 2.
Critical Test Value We use the t table to find the critical value of t, denoted
tcritical for the specific degrees of freedom, two-sided, and α = 0.05.
Statistical decision If |tobserved | > tcritical , then reject the null hypothesis,
otherwise fail to reject.
Lecture Notes #1: Review of Basic Concepts 1-15
You may want to graph each mean separately with its own standard error. The SPSS
syntax outline for 95% confidence intervals is
graph
/errorbar (CI 95) DV by GROUP.
Similarly, the SPSS syntax outline for plus or minus one standard error is
graph
/errorbar (STERROR 1) DV by GROUP.
The error bar of plus or minus one standard error roughly corresponds to the hy-
pothesis test, in the sense of whether or not the error bars overlap corresponds to the
decision of whether or not to reject the null hypothesis. Another way to say this, is
that the tcritical value is approximately 2 (the exact value depends on the degrees of
freedom for the specific problem as well as other things such as how the standard error
of the difference is define and whether or not we assume equal population variances).
The number “to beat” is approximately two. Intuitively, by plotting one standard
error from one mean and a second standard error from the other mean, we have the
criterion that if the two error bars do not overlap this roughly corresponds to a sig-
nificant t test (because the nonoverlapping error bars roughly corresponds to a t of
2).3
2
I will adopt the convention that capitalized words are names chosen by the user. For instance, you might
have a dependent variable in your dataset that you call “precall” for percent of words correctly recalled. In
the syntax below, you would replace “DEPENDENT VARIABLE” with “precall”.
3
If you want to be exact in having overlapping bars correspond to the test of significance, you also need
to take into account other factors. For example, for two groups p with equal sample√ sizes using the classic
equal variance t test, the denominator of the t observed is sp n2 , so there is a 2 factor that needs to be
taken into account in defining the width of the interval around the mean. Also, instead of just saying the
t critical is 2, one should use the exact critical value for t (e.g., for 20 df, the exact critical t is 2.086, so
that means we need half of 2.086 for each of the two means). Being careful in this way, one can construct a
graph where one can use the criterion of nonoverlapping intervals and reach the identical conclusions as the
test of significance. Some people, like Geoff Cumming (2013, Psychological Science), have argued that we
should just use 95% CIs around the mean and forget completely about doing hypothesis testing, and also not
trying to scale the figure, like I did earlier in this long footnote, to connect with hypothesis testing results.
There still is no convention about how to plot error bars so you should be clear what you are plotting such
as stating “error bars reflect plus/minus one standard error” or whatever
Lecture Notes #1: Review of Basic Concepts 1-16
I don’t know of an easy way to get SPSS to plot the 95% CI around the difference
of the two means (where the difference is the estimate around which the CI is com-
puted). The previous two commands give intervals around individual cell means not
the difference between two means. The numerical confidence interval around the dif-
ference between two means is presented in the output of the T-TEST command, but
unfortunately SPSS doesn’t seem to plot it.
SPSS Example Here is an excerpt of the example in Appendix 2. This example has an outlier that
we will discuss a little bit later.
Group 1 Group 2
3 4
4 5
5 6
4 5
3 4
4 5
5 6
4 5
3 4
4 11
Note that Group 2 has one outlier; otherwise, the scores in Group 2 are equal to
Group 1 scores plus one.
The SPSS syntax to read the data and run the two sample t test:
26
25
24
25
26
25
24
2 11
end data.
t-test groups=group(1,2)
/variables=dv.
Number
Variable of Cases Mean SD SE of Mean
-----------------------------------------------------------------------
DV
R syntax 5. R syntax
where dependentvar is the name of the dependent variable, groupingvar is the name of
the group variable, and the var.equal subcommand denotes we assume equal variances.
Graphing error bars in R requires some knowledge of setting up plots, which we’ll
Lecture Notes #1: Review of Basic Concepts 1-18
cover later. There are several libraries that offer error bar capabilities, including the
packages gplots (function plotCI), Zelig (function plot.ci), plotrix (function plotCI),
and the package ggplot2 (function geom errorbar).
Here is the same example but using R. I put the data in file called example.dat and
then read it into R. You’ll need to edit the ”PATH/TO/FILE/” part to where you
saved the example.dat file and uncomment the setwd line by deleting the hashtag.
> #setwd("PATH/TO/FILE/example.dat)
> data <- read.table("example.dat", header=T)
> data <- data.frame(data)
group dv
1 3
1 4
1 5
1 4
1 3
1 4
1 5
1 4
1 3
1 4
2 4
2 5
2 6
2 5
2 4
2 5
2 6
2 5
2 4
2 11
Next we make the column called group a factor so that commands know to treat it
as group 1 and group 2 rather than the numbers 1 and 2. Here I specify the column
called group and convert it to a factor.
Lecture Notes #1: Review of Basic Concepts 1-19
I use the data argument and assign it the value data; this allows R to find the variables
called group and dv because they ”reside” in the data.frame called data.
data: dv by group
t = -2.3041, df = 18, p-value = 0.03335
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.058927 -0.141073
sample estimates:
mean in group 1 mean in group 2
3.9 5.5
Boxplots require knowing the median, the quartiles, the smallest and largest values.
Review an intro book for more information. The boxplot is depicted and summarized
in Figure 1-5. The whiskers usually are drawn to the data point closest to but not
exceeding 1.5 interquartile ranges. Data points that exceed the whiskers are plotted
as individual points and may be potential outliers.
Boxplot in R The standard way of plotting boxplots in R is through the boxplot command. The
basic command is
Lecture Notes #1: Review of Basic Concepts 1-20
upper whisker
median 1
1st quartile
lower whisker
sample 1 sample 2
boxplot(dependentvar ~ independentvar)
where dependentvar is the name of the dependent variable and independentvar is the
name of the independent variable.
Well, I was surprised in 2008 to open up the New York Times and see Chernoff faces
reported as though they were something new (Figure 1-7). The story made it seem
that this data analysis technique represents more information than is really there.
Anyway, it was fun to see something from statistics make it into the Times, despite
the fact that it appeared on April Fools Day.
Lecture Notes #1: Review of Basic Concepts 1-21
These techniques, generally called exploratory data analysis (EDA), allow you to
understand your data. There are many more techniques in the EDA tradition that
we will discuss throughout the course, especially when examining residuals.
There are two packages in R that have functions for plotting Chernoff faces: aplpack
and TeachingDemos.
Appendix 2 shows how to do boxplots for the example we just did two sample t tests.
There are procedures one can perform to test whether the assumptions such as equal
variance hold. An example is the Levene test4 , which compares two (or more) variances
to test whether the population null hypothesis of equal variances can be rejected.
But these tests make their own assumptions. There is something silly about perform-
ing a test to check an assumption in order to perform a test. . . , so I won’t emphasize
such statistical tests on assumptions. Further, statistical tests on assumptions behave
like any other tests: with enough power any difference, no matter how small, will
reject the null hypothesis. This is not a desirable feature when testing assumptions
because many statistical tests can tolerate small deviations from assumptions, but if
sample size is large even the smallest deviations will appear statistical significant.
SPSS prints out these tests on assumptions (such as the Levene test) automatically,
so you may simply ignore them.
R doesn’t provide these tests automatically. Instead, R has the ability to apply a
different t test that doesn’t assume equal variances, which we’ll discuss later. You
just change the var.equal argument to FALSE.
In Lecture Notes 2 I’ll present more techniques for checking assumptions. These
techniques will not rely on statistical tests and are relatively easy to perform and
understand.
Our friend the One useful technique for checking the equality of variance assumption involves the
boxplot
boxplot. To check for equality of variance, you can graph boxplots separately for
4
The Levene test is based on absolute deviations from the sample medians. There are several versions
of the Levene test around–see Kutner et al for a discussion. Another test you might hear about is the
Hartley test, which is a bad test for checking equality of population variances because it is very sensitive
to departures from normality. The Hartley test amounts to taking a ratio of the two variances and using a
standard F test, again see Kutner et al for discussion.
Lecture Notes #1: Review of Basic Concepts 1-24
each group. If the width of the box (i.e., the interquartile range, IQR) for one group
differs dramatically from the width for the other group, and/or the length of the
whiskers across groups differs dramatically, then that suggests a violation of equality
of variance. “Dramatically” will be defined in more detail later. The boxplot can also
be used to check for symmetry of the distribution (which is a necessary property of
the normal distribution). A symmetric distribution should have it’s median close to
the middle of the “box” and it’s two whiskers (indicating minimum and maximum)
about equal length.
8. Effects of outliers; a simple example using an Excel spreadsheet and the two-sample
t test.
(a) You can use a test that does not make the particular assumption you’re violating.
For example, Welch’s separate variance two-sample t test does not make the
equality of variances assumption. The Welch test is similar to the classical two
sample t test but it does not make the assumption that both groups have equal
population variances. The computation is similar to the classic test with two
exceptions: 1) there is no pooling because variances are not assumed to be equal,
and 2) the degrees of freedom are “adjusted” to take into account the discrepancy
between the two variances.
Welch’s t test FYI: here is the formula for Welch’s two-sample t test. Most statistical packages
compute this formula, however it is instructive to look at how the formula works.
Similar to the typical t test we have
Y1 − Y2
t = r (1-15)
s21 s22
n1 + n2
(n1 − 1)(n2 − 1)
(1-16)
(n2 − 1)c2 + (n1 − 1)(1 − c)2
with
s21
c = n1 (1-17)
s21 s22
n1 + n2
Quite ugly!
Lecture Notes #1: Review of Basic Concepts 1-25
To help us understand these degrees of freedom we’ll consider the bounds. The
degrees of freedom for Welch’s test are greater or equal to the sample size minus
one corresponding to the group having the greater variance. For example, if one
group has size 5 and the other size 10, then Welch’s df will be no smaller than
either 5-1=4 or 10-1=9. Whether the lower bound is 4 or 9 in this example is
driven by which group has the greater variance (e.g., if the group with size 5 has
the greater variance, then the lower bound would be 4; however if the group with
size 10 has the greater variance, then the lower bound would be 9). A conservative
approximation to the lower bound is simply to say that the lower bound is the
smaller sample size minus one (denoted min(n1,n2) - 1, where min represents
the minimum of the two sample sizes). This conservative approximation will be
right when the group with the smaller sample size has the greater variance but
not when the other group has the greater variance. The upper bound is less
than or equal to the usual degrees of freedom (i.e., dfWelch ≤ n1 + n2 − 2). The
magnitude of change in degrees of freedom could be interpreted as an indication
of how severely the equality of variance assumption is violated.
Note that if the two variances are equal, then both the classical two sample t test
and Welch’s separate variance t test yield identical values. So, a simple decision
rule that many statisticians adopt is to always use Welch’s test. The rationale
for this decision rule should be obvious: when the population variances are equal
one gets the same result as the classical test, but when the population variances
differ Welch’s test gives the correct result. Also, if the two sample sizes are
equal (n1 = n2 ), then both the Welch and the observed values of the classical ts
are equivalent (but note that the degrees of freedom could still differ due to the
difference in how degrees of freedom are calculated, yielding different p-values for
the two tests even though the values of the ts may be the same). The difference
in degrees of freedom means the critical t value for the Welch will not be the
same as the t critical in the classic test that assumes equal variances.
SPSS performs Welch’s separate variance t test. But, not all psychologists are
familiar with Welch’s test and you may get funny looks if you present a test
having noninteger degrees of freedom (e.g., t(12.7) = 2.20). Some people round
down to the nearest integer, but if you do that you should label it clearly as a
Welch’s t-test or a separate variance test in your write-up. I don’t mind having
noninteger degrees of freedom.
R performs the Welch t test with the argument to the t.test() command
var.equal = FALSE.
Here is another heuristic for checking the violation of equality of variances (we’ve
already talked a little about using boxplots for this purpose). If the original
degrees of freedom and the Welch degrees of freedom are close to each other,
Lecture Notes #1: Review of Basic Concepts 1-26
then that suggests that the equal population variance assumption probably holds
because the degree of freedom adjustment wasn’t drastic. However, if there is a
large discrepancy between the two degrees of freedom (i.e., Welch is penalizing
in a big way), then that suggests the equal variance assumption may be violated.
The Mann-Whitney two sample test is a nonparametric version of the two sample
t test. It is identical to performing the usual pooled t test on the ranks of the
data (rather than the raw data). That is, take two independent samples and
transform the scores into ranks (the ranks are computed with respect to the
both samples together, not ranking one sample separately and then ranking the
second sample). Then compute a regular two sample t test on those ranks rather
than the original observed scores. This t test on ranks is identical to the Mann-
Whitney U test (if there are ties in rank then a special correction is needed for
the equivalence to hold).
The ranking reduces the effects of outliers and tends to help with violations of
equality of variance. I’ll come back to this test and its generalization later when
we cover ANOVA. It is remarkable that the classic t-test works well on data that
have been transformed to ranks.
Data analysis Keep in mind that there are two sets of assumptions whenever you measure and
is a juggling
act perform statistical tests. One set of assumptions deals with the measurement
part (see Appendix 3) and the other set of assumptions deals with the statistical
tests (e.g., independence, equal variances, normal distributions). The art of data
analysis involves juggling these two sets of assumptions, especially when remedial
measures needed to deal with one set violate assumptions in the other set.
We will return to the issue of transformations later. At that time, I’ll present
some techniques that will help you select the best transformation given the partic-
ular properties of the data, and I will also spend some more time on exploratory
Lecture Notes #1: Review of Basic Concepts 1-27
data analysis.
The paper would have a data analysis section and that’s where you would state how
you test hypotheses with a two sample t-test, why you are using the Welch test,
how you checked assumptions, any remedial measures you used (transformation or
nonparameteric tests), etc.
I haven’t talked about effect sizes yet and some journals require effect sizes. I defer
effect size discussion until the next set of lecture notes.
In terms of pre-registration, it can get pretty ugly. For each hypthesis state the test
you will run and provide the code you will use. State how you will check assumptions.
What will you do if assumptions aren’t met? If you use Baysian methods (short intro-
duction below), which prior will you use on each of your parameters, what robustness
checks will you use to assess sensitivity of analysis to the choice of prior, what criteria
will you use to assess simulation convergence? Some preregistration protocols ask
you to outline all your analytical steps including assumption checking and remedial
measures. For some cases, like unequal variances in between subjects designs, you can
make things easier by pre-registering a Welch test. In that case if you have unequal
variances you are safe, if you meet the assumption you are still ok because if the
assumption is met, the Welch converges to the classical equal variance t test. But
you may not anticipate all the issues and you may have a more complex pattern of
assumption violation, such as nonnormal data, outliers, missing data, etc. A Welch
won’t handle those and some ways of handling those issues may require equal vari-
ances. So pre-registration can get rather detailed. A good strategy is to pre-register
the code you will use and include detailed comments about what you will look at and
Lecture Notes #1: Review of Basic Concepts 1-28
actions you will take depending on those results, e.g., “I will conduct a boxplot in R
using the cammand presented below to evaluate whether distributions are symmetric,
inspect outliers, and evalute the equality of variance assumption. If I see outliers as
detected by the boxplot default parameters, I will also conduct a Mann-Whitney U for
robustness and will report both the originally planned t test and the Mann-Whitney
test in paper.”
If you pre-register, then do what you said you would do. It is fine to deviate from the
plan, but you should be transparent in the paper and point out which analyses were
preregistered and which were not. There is nothing wrong with that; the key is to
be transparent. What if a new paper with a new analytic method comes out as you
are completing the study so you didn’t know about it at the time of preregistration?
You can still use the new method. “At the time of preregistration this technique
was not available. We report the preregistered analysis as well as the newer analytic
approach.” There are several recent cases of authors waving the pre-registration flag
in their paper, but when you compare the plan with what was reported in the paper,
there are major discrepancies and the authors didn’t call them out. At least with
pre-registration one can compare the plan with the implementation, but don’t assume
that just because a paper has a pre-registration badge that all analyses presented were
preregistered (apparently not all reviewers and editors verify that the preregistered
plan was followed).
Bayesian statistics has been rediscovered. It was prominent in the 1940s to 1960s and,
while much work was done to develop theoretical understanding, it did not catch on
despite several attempts by proponents to convince entire disciplines such as psychol-
ogy about its merits (e.g., Edwards, Lindman & Savage, 1963, Bayesian statistical
inference for psychological research. Psychological Review, 70, 193-242). One reason
why it has come back is that computing power makes Bayesian statistics relatively easy
to conduct, whereas in the past it required sophisticated mathematics to accomplish
such as integrating out parameters and deriving conditional and marginal distribu-
tions. Throughout the term I’ll present snippets of Bayesian thinking and approaches.
If you want a readable introduction to the topic see the Feinberg & Gonzalez (2012)
chapter that’s in the articles section of Canvas.
For now, I’ll just point out special case Bayesian tests of the one sample and two
sample t-tests we reviewed here. Bayesian statistics makes the same assumptions
as we have already made. The data are assumed independent, groups have equal
population variances and the data are normally distributed. In addition, the Bayesian
also introduces a prior distribution. This represents the uncertainty on every unknown
parameter prior to seeing data . The output of the analysis is a posterior distribution
that takes into account not just the data but the prior information, including the
Lecture Notes #1: Review of Basic Concepts 1-29
uncertainty of all the unknown parameters in the model. The important point about
the Bayesian framework is that it provides probability information about the unknown
parameters given the data. I’ll explain through the one and two sample t-tests.
In this case we have two unknown parameters: the mean and the variance. We
hope to use data to reduce that uncertainty. In this illustration I’ll use a pair of
special priors, known as the noninformative priors, that says all values are equally
likely (so a uniform distribution on the mean and a uniform distribution on log σ,
a detail I don’t want to get into now). We assume data are normally distributed
and independent. Under this setup it is possible derive the posterior distribution
for the unknown mean that incorporates the uncertainty of the unknown variance.
The posterior distribution of the mean follows a t distribution with n - 1 degrees
of freedom. We know the entire posterior distribution but we can pick out specific
pieces, such as the two values corresponding to the middle 95% of the distribution.
To compute those values we use
s
Y ± tα/2 √ (1-18)
n
where s is the sample st. dev. and Y is the sample mean. Does this look fa-
miliar? It is identical to the CI we saw for the one sample t test. The interval
will be the same under these assumptions and this choice of prior distributions.
The interpretation differs however. The Bayesian interprets this interval as the
95% credible interval such that there is a 95% probability that the true unknown
value of the mean is contained in the interval. Recall the frequentist interpre-
tation is one of repeated sampling (95% of such intervals. . . ), so the Bayesian
interpretation is more natural and more in line with what researchers are hoping
statistical tests provide.
In a two sample t test there are 4 unknown parameters: two unknown means and
two unknown variances. If we assume equal population variances we can reduce
that to 3 unknown parameters. Further, assume independence and normally
distributed data, and invoke three uniform priors, one on each of the two means
and one on log σ. In this setup the posterior distribution for the difference
between two means follows a t distribution with degrees of freedom n1 + n2 -
2. One can then probe this distribution, such as finding the middle 95% and
the formula turns out to be identical to the classical computation of the 95%
confidence interval we saw before, but the interpretation is in terms of the credible
interval.
One can also derive a Bayesian test with unequal population variances. It also
follows a t distribution and the degrees of freedom are almost identical to the
Welch test. For derivations of the Bayesian t tests I presented in these lecture
notes see Box and Tiao’s 1973 book Bayesian Inference in Statistical Analysis
and for a modern text book see Gelman et al, Bayesian Data Analysis. You’ll
need a good understanding probability theory and calculus.
I purposely selected special case prior distributions so that the Bayesian analysis would
yield the same result of as the classic frequentist CIs we saw earlier. Of course, the
Bayesian is free to choose priors and every choice of prior would lead to different results
than the frequentist approach. The Bayesian approach can also be extended quite
easily. For example, Kruschke (2013) proposed a Bayesian two sample t test that has
an additional parameter to handle outliers (so data that aren’t normally distributed).
Some extensions are not easy to derive, but usually they can be estimated easily
with simulation methods. So there isn’t really one Bayesian t-test because there are
different priors and various extensions but a Bayesian framework. When working
with Bayesian statistics it is important to do sensitivity analysis to verify that the
conclusions are not unique to the particular choice of prior distributions, or at least
give a sense for how much a prior has to change before the conclusions change.
12. Bootstrapping
There is another method for computing standard errors that we will use periodically
this year. It uses the same logic as the central limit theorem but instead of relying
on an asymptotic theorem to give a formula for the standard error, the bootstrap
method uses a similulation to compute the standard error. The most common form
of the bootstrap for a mean looks a lot like the logic I presented justifying the central
limit theorem: take a sample and compute the mean, take a second sample and
compute a second mean, etc. Store all those means. You now have an estimate
Lecture Notes #1: Review of Basic Concepts 1-31
of the standard deviation of the means or other terms likes the 95% CI by taking
the value of the sampled means corresponding to the 2.5% and the 97.5% levels.
However, a key difference is that rather than repeatedly sample from the population
the bootstrap treats the sample like the “population” and repeated samples from that
with replacement to create bootstrapped samples. Unlike the central limit theorem,
the bootstrap is not limited to means. You can compute anything you want for each
sample (such as a median, a correlation, a regression slope, a latent variable, etc)
and get a standard error or a CI for that computed value. There are other forms
of the bootstrap that instead of sampling raw data they sample the residuals from a
fitted model (we first have to learn about models and residuals before covering that
method), and other bells and whistles like bias-corrected bootstrap (which doesn’t
always remove the bias so don’t be fooled by the name). Appendix 2 illustrates
how to run bootstrapping in SPSS and R using the same example we’ve been using
throughout these lecture notes.
Lecture Notes #1: Review of Basic Concepts 1-32
Appendix 1
GRAPH
/HISTOGRAM(NORMAL)=dv .
The SPSS syntax above calls for a histogram with a normal curve superimposed on the
histogram. The normal curve does not provide a reasonable approximation to the observed
histogram.
40
30
20
10
Mean = 34168.2
N = 100.00
0
0. 0 2000 40 00 6000 80 00 1000 12 14 16 18 20 22 24
0. 0 0. 0 0. 0 0. 0 00 .0 00 00 .0 0000 .0 00 00 .0 0000 .0 00 00 .0 0000 .0 00 00 .0
DV
Now, I’ll transform those same data using a log transformation. Notice that the histogram
appears more symmetric and bell-shaped after the transformation; the bell-shaped curve
also represents the observed histogram better.
GRAPH
/HISTOGRAM(NORMAL)=logdv .
Lecture Notes #1: Review of Basic Concepts 1-33
16
14
12
10
2
Mean = 10.03
N = 100.00
0
8. 00 8. 25 8. 50 8. 75 9. 00 9. 25 9. 50 9. 75 10 .0 10 .2 10 .5 10 .7 11 .0 11 .2 11 .5 11 .7 12 .0 12 .2 12 .5
0 5 0 5 0 5 0 5 0 5 0
LOGDV
The logic of a transformation to improve symmetry is that one is trying to affect scores
differentially at the two ends of the distribution. For example, if there is a long tail on
the right side, then a transformation that exerts more influence on those numbers may
do the trick to transform the data into a more symmetric distribution. For example, if
there are many scores between 1 and 5 but a few scores between 25 and 100, then a sqrt
transformation may help with symmetry because the sqrt exerts relatively little effect on
the low numbers but relatively more effect on the high numbers (i.e., 25 goes into 5, 100
goes into 10).
Appendix 2
Data:
Group 1 Group 2
3 4
4 5
5 6
4 5
3 4
4 5
5 6
4 5
3 4
4 11
Note that Group 2 has one outlier; otherwise, the scores in Group 2 are equal to Group 1 scores plus one.
DV
By GROUP 2.00
12
20
10
4
DV
2
N= 10 10
1.00 2.00
GROUP
LOGDV
By GROUP 2.00
Valid cases: 10.0 Missing cases: .0 Percent missing: .0
Lecture Notes #1: Review of Basic Concepts 1-36
1.1
1.0
.9
.8
.7
.6
.5
LOGDV
.4
N= 10 10
1.00 2.00
GROUP
t-test groups=group(1,2)
/variables=dv.
Number
Variable of Cases Mean SD SE of Mean
-----------------------------------------------------------------------
DV
Number
Variable of Cases Mean SD SE of Mean
-----------------------------------------------------------------------
LOGDV
DV
by GROUP
SUMMARY:
We tested the hypothesis that the population means are equal, that is, µ1 = µ2 . But, we
observed that the variances were not equal (due to a single outlier). We tried the classic t
test, the separate variance (Welch’s) t test, a classic t test on the transformed variable, and
a nonparametric test.
Normally, one does not do all these tests. I just show them together so you can compare
Lecture Notes #1: Review of Basic Concepts 1-38
them. You should begin thinking about how you want to handle violations of the assump-
tions. This is a choice based purely on aesthetics. Each method has its advantages and
disadvantages.
One thing we did not do is omit the outlier. Rarely is throwing out a bad data point a good
thing.
Bayesian analysis in SPSS is very limited. Version 25 introduced a few basic Bayesian
analyses such as t tests, one way ANOVA, and simple regression. For now, better to use
other Bayesian programs such as ones implemented in R or standalone ones like BUGS or
STAN, which are mature and allow complete flexibility. But here goes for Bayesian methods
in SPSS. The syntax for the two sample t test in Bayesian form:
BAYES INDEPENDENT
/INFERENCE DISTRIBUTION=NORMAL VARIABLES=dv ANALYSIS=BOTH GROUP=group SELECT=LEVEL(1 2)
/PRIOR EQUALDATAVAR=FALSE VARDIST=DIFFUSE
/ESTBF COMPUTATION=ROUDER.
Lecture Notes #1: Review of Basic Concepts 1-39
Finally, here is how to do bootstrapping in SPSS. We first run the bootstrap command to
inform SPSS that subsequent command will be use bootstrap; here I use the default SPSS
bootstrap samples of 1000.
BOOTSTRAP
/SAMPLING METHOD=SIMPLE
/VARIABLES TARGET=group INPUT=dv
/CRITERIA CILEVEL=95 CITYPE=PERCENTILE NSAMPLES=1000
T-TEST GROUPS=group(1 2)
/VARIABLES=dv
/CRITERIA=CI(.95).
Lecture Notes #1: Review of Basic Concepts 1-40
The output for the boostrapped difference between the two means is a separate table after
the usual independent samples test, and also the table of descriptive statistics for each group
also contains information from the bootstrap.
Repeat SWITCHING TO R
this
example
with R Data file is saved as a two column text file called ”example.dat”. I assume first row of
example.dat has the column names and subsequent rows have the data, which look just like
the two columns that are between begin data/end data in the SPSS example above. In this
case the argument header=T is used. If the file example.dat doesn’t column names, then
use header=F and add the column names later with the names() command.
I like to use data.frames so after reading in the data file I convert it to a data.frame.
You’ll need to edit the ”PATH/TO/FILE/” part to where you saved the example.dat file.
Lecture Notes #1: Review of Basic Concepts 1-41
> #setwd("PATH/TO/FILE/example.dat)
> data <- read.table("example.dat", header=T)
> data <- data.frame(data)
group dv
1 3
1 4
1 5
1 4
1 3
1 4
1 5
1 4
1 3
1 4
2 4
2 5
2 6
2 5
2 4
2 5
2 6
2 5
2 4
2 11
Next we make the column called group a factor so that commands know to treat it as group
1 and group 2 rather than the numbers 1 and 2. Here I specify the column called group
and convert it to a factor. Also, add a column of log transformed dv to the data.frame data
(use base 10 to be consistent with the SPSS output). Now the data.frame data will have
three columns.
●
10
8
dv
6
4
1 2
group
T test on both raw and log data. I use the data argument and assign it the value data; this
allows R to find the variables called group and dv because they ”reside” in the data.frame
data.
data: dv by group
t = -2.3041, df = 18, p-value = 0.03335
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.058927 -0.141073
Lecture Notes #1: Review of Basic Concepts 1-43
sample estimates:
mean in group 1 mean in group 2
3.9 5.5
data: dv by group
W = 17.5, p-value = 0.01108
alternative hypothesis: true location shift is not equal to 0
Bayesian Analysis: To anticipate later lectures, I’ll present a snippet of a Bayesian analysis.
Here we can use it to verify the claim that if the priors are specified in a particular way the
Bayesian analysis 95% interval mimics the 95% confidence interval from the classical test.
> library(brms)
> #so results are same sign as t test, assign group 2 as reference group
> data$group.relevel <- relevel(data$group, ref = "2")
> #set prior close to the specification above;
> #probably better to stick with default prior for noninformative prior
> bayes.prior <- prior(normal(0, 1e6), class=b) + prior(normal(0, 1e6),class=sigma) +
Lecture Notes #1: Review of Basic Concepts 1-44
Family: gaussian
Links: mu = identity; sigma = identity
Formula: dv ~ group.relevel
Data: data (Number of observations: 20)
Samples: 4 chains, each with iter = 20000; warmup = 10000; thin = 5;
total post-warmup samples = 8000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept 5.49 0.54 4.45 6.55 7802 1.00
group.relevel1 -1.59 0.76 -3.11 -0.12 7801 1.00
The estimate of the difference between the two means is -1.59 and the CI is [-3.11, -0.12],
very close to the CI from the t test command under equal variances (small discrepancy
between the t test and Bayes is in part due to sampling and the prior settings allowed by
the brm program). I set the iterations of the simulation at 20,000. The first 10,000 are
dropped to account for the simulator “settling in” and then thin means keep every fifth value
of the remaining 10,000 sample draws so 2000 samples remain. Chain = 4 means this was
done 4 times so a total of 8000 samples in the entire simulation.
Here is a plot of the Bayesian analysis, which yields a distribution for the difference between
the two groups (middle row), the distribution of the residual standard deviation (bottom
row) and the distribution of the mean of one group 2 (top row), which is nicely normal due
to the central limit theorem in action. The second column are traceplots, which I’ll cover
later in the term.
There are many ways of accomplishing the boostrap in R. The most common approach is to
use the general boot package in R. Other approaches include specific functions people have
written and contributed to various packages such as the boot.t.est in the nonpar package. I
prefer the boot package because it is general and can be used across many types of models.
Lecture Notes #1: Review of Basic Concepts 1-45
Figure 1-9: Density plot of difference between means with shaded 95% interval
Lecture Notes #1: Review of Basic Concepts 1-47
> library(boot)
> #function to compute mean difference;
> #illustrated using the t.test function to do the heavy lifting
> mean.diff <- function(formula,data,indices) {
+ boot.sample <- data[indices,]
+ means <- t.test(formula,data=boot.sample)$estimate
+ return(means[1] - means[2])
+ }
> boot.results <- boot(data=data, statistic=mean.diff, R=2000, formula = dv ~ group)
> boot.ci(boot.results)
CALL :
boot.ci(boot.out = boot.results)
Intervals :
Level Normal Basic
95% (-2.932, -0.276 ) (-2.766, -0.189 )
Note the slightly skewed distribution of the difference between the two means most likely
due to the outlier and relatively small sample. This would slightly throw off inferences of
the classic test. The normal probability plot (introduced later in the course) illustrates the
long tails typical of the t distribution and also a slight asymmetry related to the skewness.
The boot.ci function provides several types of confidence intervals and these are heavily
debated in the field. The normal CI is similar to the regular CI in computation but uses
the standard error computed from the bootstrap and a correction for bias, the percentile
CI simply reports the 2.5% and 97.5% values of the bootstrap samples, the basic CI is like
the percentile with a type of bias correction, and BCa is a different bias-correction imposed
on the percentile method. There are many more types of CIs used in the bootstrapping
literature. See, e.g., Efron and Tibshirani (1993), An Introduction to the Bootstrap; Davison
and Hinkley (1997), Bootstrap Methods and Their Application, Chapter 5; various websites
like www.datacamp.com/community/tutorials/bootstrap-r.
Lecture Notes #1: Review of Basic Concepts 1-48
> plot(boot.results)
Histogram of t
●
0
●●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
0.6
●
●
●
●
●
−1
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
Density
●●
−2
●
0.4
●●
●
●
●
●
t*
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
−3
●
●
●
●
●
0.2
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
−4
●●
●
●●
0.0
−4 −3 −2 −1 0 −3 −1 1 2 3
Measurement deals with the assignment of numbers to objects. Statistics deals with what
you do after you’ve collected the numbers and the error around those numbers. Obviously,
one would want to assign numbers in some principled way (this is the fundamental problem
of measurement). There are three primary issues in measurement:
2. how unique is the scale? are there different numerical assignments that are just as
good?
3. is the scale meaningful? once the scale is defined, then what are the numbers really
telling us? what operations are permissible on the numbers?
For our purposes, we will only consider uniqueness and meaningfulness (2 and 3); the
existence of numerical scales (1) is beyond our scope and a full treatment requires a strong
background in abstract algebra. Uniqueness refers to permissible transformations that can
be made on a scale that still preserve it’s inherent properties. Examples:
1. A ratio scale preserves ratios of scale values. Thus, the only permissible transformation
is multiplication by a positive number (why?). Example: length, weight, and time.
2. An interval scale has the property of “equal intervals” throughout the scale. A linear
transformation will preserve this property. Example: temperature.
3. An ordinal scale only has the property that it preserves order. Thus, any transforma-
tion f that preserves order (i.e., is monotonic in the sense that if x > y, then f(x) >
f(y)) is permissible. Examples: Moh’s hardness scale, academic grades (A, B, C, etc),
Beaufort Wind Scale (0=calm, 1=light air, 2=light breeze, . . . , 12=hurricane). See
Figure 1-11.
Lecture Notes #1: Review of Basic Concepts 1-50
4. A nominal scale uses numbers as labels. The word nominal means name. Thus, any
one-to-one mapping is permissible.
In general,
1. Ordinal scales can be used to make comparisons of order, like f(x) > f(y)
f(a) = 2f(b)
Lecture Notes #1: Review of Basic Concepts 1-51
Consider the process of taking the mean of a bunch of scale values (a common practice in
psychology). Taking a mean is simply a transformation of the scale values. For example,
for four objects where f is the function that assigns a number to an object
What other scales different from f would also be meaningful? Ratio and interval scales
are okay because the transformation permitted under meaningfulness (multiplication by
a constant and linear transformation, respectively) will preserve the order of the means.
However, taking means of an ordinal scale is meaningless because the ordering of the means
may not be preserved when permissible transformations are performed. For example, taking
square roots of all the numbers may alter the ordering of the means. So, even though a
transformation may be called for to satisfy statistical assumptions (say to deal with the
violation of equality of variance), an inappropriate transformation may violate the property
of meaningfulness.
Another example. Take a group of subjects. Each subject watches two different movies
and rates each movie on its level of aggressiveness using a seven point scale. The researcher
wants to see whether movie a is rated as more aggressive than movie b. Can we take means?
The knee-jerk answer is to say “yes”. Let’s look at this problem more closely. We have
1X 1X
f(a) > f(b) (1-20)
n n
which represents the statement “the average aggressiveness score for movie a is greater than
the average aggressiveness score for movie b.” Here the average is defined over subjects.
The difference between Equations 1-19 and 1-20, in the latter there are no subscripts on a
and b because they are always the same two movies.
It is more plausible to think that each subject has his or her own scale fi leading to
1X 1X
fi (a) > fi (b) (1-21)
n n
Lecture Notes #1: Review of Basic Concepts 1-52
But, it is this more plausible case that kills us as far as meaningfulness goes. If we take
averages on the raw scores, not even a ratio scale is meaningful because each subject has
his or her own scale, denoted by αi . That is, even if we make the strong assumption that
aggressiveness ratings are ratio scale (implausible as it may be), the ordering of the means
will not necessarily be preserved. Thus, if each subject is allowed an arbitrary positive
constant αi (because we assume each fi is a ratio scale) we have
1X 1X
αi fi (a) and αi fi (b) (1-22)
n n
The two terms in Equation 1-22 need not be ordered the same way as in Equation 1-20 due
to the different αi s.
One way to guarantee that the ordering of the means observed in Equation 1-20 remains
the same regardless of the arbitrary positive α’s is to perform the log transformation before
computing the mean. Recall that log(xy) = log(x) + log(y).
1X 1X
log[fi (a)] > log[fi (b)] (1-23)
n n
1X 1X
log[αi f(a)] > log[αi f(b)] (1-24)
n n
1X 1X 1X 1X
log αi + log f(a) > log αi + log f(b), (1-25)
n n n n
the terms containing the αi ’s cancel out giving Equation 1-25 the same ordering as Equa-
tion 1-20. So, taking means of logs of different ratio scales is a meaningful operation5 . I
suggest you construct an example with made up numbers that produces such a reversal on
the raw data but not on the log scale.
The reason I present this example is to illustrate that sometimes a transformation is justified
because of measurement concerns (in the previous example the log played a useful role in
making the mean a “meaningful” operation). So there are cases in which a transformation
is legitimate on measurement grounds. One feature that makes a transformation legitimate
is that it can force parameters from “permissible transformations” to cancel out.
Now we are ready to evaluate (and make some sense of) the following statement: “A t test
doesn’t know where the numbers came from, so any transformation is legitimate.”
5
Once you compute the mean on the log scale it is a good idea to transform the means back to the “original”
scale; in this example, take the exponential of each mean. This is mainly for clarity when presenting your
data to others. This transformed mean, i.e., mean on logs and then taking the exponential, is equivalent to
what is known as the “geometric mean”.
Lecture Notes #1: Review of Basic Concepts 1-53
Yes, this is true. But, the statement only deals with the statistical assumptions of the
t test (independence, equality of population variances, and normality). After all, we can
even compute a t test on ranks and get a legitimate result with respect to p-values (i.e.,
the Mann-Whitney U test, which many consider to be a legitimate test). What the above
statement doesn’t consider is the notion of meaningfulness. If we want our descriptive
statistics (i.e., the transformations we perform on our scales) to say something meaningful
about the objects in question, then the scale type becomes very important.
A sermon When analyzing data we usually want to say something about both the objects in question
(that is why we collected data in the first place) and the statistical properties (that’s why
we are computing a p value–in the classical sense, to compute the chances that the observed
data could have come from the conditions specified in the null hypothesis). In other words,
statistics helps us make sense of the numbers in a study, but statistics only operates on the
numbers. To figure out what those numbers mean, i.e., how they relate to the objects in
question, we need measurement theory. Statistics deals with making inferences from data;
measurement theory deals with the connection between data and reality. Clearly, a good
researcher is skillful in both statistics and measurement.
Let’s return to the question of which test to use when the assumptions of the two sample
t test are violated. Unfortunately, there is no best recommendation that works “across the
board.” The choice of test depends on the particular situation. The route we will take in this
course will be to give more weight to the statistical assumptions and transform, or, to use
Tukey’s more neutral term, “re-express,” the data. If we use transformations appropriately,
we can get much mileage from the classical statistical techniques. Of course, as we go along
I will point out situations where remedial procedures other than transformations may be
more appropriate. The reason for emphasizing transformations is that they are quite general
and can be used in just about any situation. As our models become more complicated we
will not have options like tests that don’t make a specific assumption (such as the Welch
test) or nonparametric tests (like the Mann-Whitney).