Lesson 2: Simple Comparative Experiments
Lesson 2: Simple Comparative Experiments
edu/stat503/print/book/export/html/8
This chapter should be a review for most students who have the required prerequisites. We
included it to focus the course and confirm the basics of understanding the assumptions and
underpinnings of estimation and hypothesis testing.
Here is an example from the text where there are two formulations for making cement mortar.
It is hard to get a sense of the data when looking only at a table of numbers. You get a much
better understanding of what it is about when looking at a graphical view of the data.
1 of 8 09/04/2017 5:37
https://onlinecourses.science.psu.edu/stat503/print/book/export/html/8
Dot plots work well to get a sense of the distribution. These work especially well for very small
sets of data.
Another graphical tool is the boxplot, useful for small or larger data sets. If you look at the box
plot you get a quick snapshot of the distribution of the data.
Remember that the box spans the middle 50% of the data (from the 25th to the 75th
percentile) and the whiskers extend as far out as the minimum and maximum of the data, to a
maximum of 1.5 times the width of the box, or 1.5 times the Interquartile range. So if the data
are normal you would expect to see just the box and whisker with no dots outside. Potential
outliers will be displayed as single dots beyond the whiskers.
This example is a case where the two groups are different in terms of the median, which is
the horizontal line in the box. One cannot be sure simply by visualizing the data if there is a
significant difference between the means of these two groups. However, both the box plots
and the dot plot hint at differences.
For the two sample t-test both samples are assumed to come from Normal populations with
2 of 8 09/04/2017 5:37
https://onlinecourses.science.psu.edu/stat503/print/book/export/html/8
(possibly different) means μi and variances σ2. When the variances are not equal we will
generally try to overcome this by transforming the data. Using a metric where the variation is
equal we can use complex ANOVA models, which also assume equal variances. (There is a
version of the two sample t-test which can handle different variances, but unfortunately this
does not extend to more complex ANOVA models.) We want to test the hypothesis that the
means μi are equal.
Our first look at the data above shows that the means are somewhat different but the
variances look to be about the same. We estimate the mean and the sample variance using
formulas:
We divide by n - 1 so we can get an unbiased estimate of σ2. These are the summary
statistics for the two sample problem. If you know the sample size, n, the sample mean, and
the sample standard deviation (or the variance), these three quantities for each of the two
groups will be sufficient for performing statistical inference. However, it is dangerous to not
look at the data and only look at the summary statistics because these summary statistics do
not tell you anything about the shape or distribution of the data or about potential outliers,
both things you'd want to know about to determine if the assumptions are satisfied.
The two sample t-test is basically looking at the difference between the sample means relative
to the standard deviation of the difference of the sample means. Engineers would express this
as a signal to noise ratio for the difference between the two groups.
If the underlying distributions are normal then the z-statistic is the difference between the
sample means divided by the true population variance of the sample means. Of course if we
do not know the true variances -- we have to estimate them. We therefore use the
t-distribution and substitute sample quantities for population quantities, which is something we
do frequently in statistics. This ratio is an approximate z-statistic -- Gosset published the exact
distribution under the psuedonym "Student" and the test is often called the "Student t" test. If
we can assume that the variances are equal, an assumption we will make whenever possible,
then we can pool or combine the two sample variances to get the pooled standard deviation
shown below.
Our pooled statistic is the pooled standard deviation sp times the square root of the sum of the
inverses of the two sample sizes. The t-statistic is a signal-to-noise ratio, a measure of how
far apart the means are for determining if they are really different.
Does the data provide evidence that the true means differ? Let's test H0: μ1 = μ2
This is always a relative question. Are they different relative to the variation within the groups?
Perhaps, they look a bit different. Our t-statistic turns out to be -2.19. If you know the
t-distribution, you should then know that this is a borderline value and therefore requires that
3 of 8 09/04/2017 5:37
https://onlinecourses.science.psu.edu/stat503/print/book/export/html/8
we examine carefully whether these two samples are really far apart.
We compare the sample t to the distribution with the appropriate d.f.. We typically will
calculate just the p-value which is the probability of finding the value at least as extreme as
the one in our sample. This is under the assumption of the null hypothesis that our means are
equal. The p-value in our example is essentially 0.043 as shown in the Minitab output below.
Confidence intervals involve finding an interval, in this case the interval is about the difference in
means. We want to find upper and lower limits that include the true difference in the means with a
specified level of confidence, typically we will use 95%.
In the cases where we have a two-sided hypothesis test which rejects the null hypothesis,
then the confidence interval will not contain 0. In our example above we can see in the
Minitab output that the 95% confidence interval does not include the value 0, the
hypothesized value for the difference, when the null hypothesis assumes the two means are
equal.
4 of 8 09/04/2017 5:37
https://onlinecourses.science.psu.edu/stat503/print/book/export/html/8
say B for the Bound on the margin of error, and then to specify how certain we want to be that
we can detect a difference that large. Recall that when we assume equal sample sizes of n, a
confidence interval for μ1- μ2 is given by:
Where n is the sample size for each group, and df = n + n - 2 = 2(n - 1) and s is the pooled
standard deviation. Therefore, we first specify B and then solve this equation:
for n. Therefore,
Since in practice, we don't know what s will be, prior to collecting the data, we will need a
guesstimate of σ to substitute into this equation. To do this by hand and we use z rather than
t since we don't know the df if we don't know the sample size n - the computer will iteratively
update the d.f. as it computes the sample size, giving a slightly larger sample size when n is
small.
So we need to have an estimate of σ2, a desired margin of error bound B, that we want to
detect, and a confidence level 1-α. With this we can determine sample size in this
comparative type of experiment. We may or may not have direct control over σ2, but by using
different experimental designs we do have some control over this and we will address this
later in this course. In most cases an estimate of σ2 is needed in order to determine the
sample size.
One special extension of this method is when we have a binomial situation. In this
case where we are estimating proportions rather than some quantitative mean
level, we know that the worst-case variance, p(1-p), is where p (the true
proportion) is equal to 0.5 and then we would have an approximate sample size
formula that is simpler, namely n = 2/B2 for α = 0.05.
In the paired sample situation, we have a group of subjects where each subject has two
measurements taken. For example, blood pressure was measured before and after a
treatment was administered for five subjects. These are not independent samples, since for
each subject, two measurements are taken, which are typically correlated – hence we call this
paired data. If we perform a two sample independent t-test, ignoring the pairing for the
moment we lose the benefit of the pairing, and the variability among subjects is part of the
error. By using a paired t-test, the analysis is based on the differences (after – before) and
thus any variation among subjects is eliminated.
In our Minitab output we show the example with Blood Pressure on five subjects.
5 of 8 09/04/2017 5:37
https://onlinecourses.science.psu.edu/stat503/print/book/export/html/8
By viewing the output, we see that the different patients' blood pressures seem to vary a lot
(standard deviation about 12) but the treatment seems to make a small but consistent
difference with each subject. Clearly we have a nuisance factor involved - the subject - which
is causing much of this variation. This is a stereotypical situation where because the
observations are correlated and paired and we should do a paired t-test.
These results show that by using a paired design and taking into account the pairing of the
data we have reduced the variance. Hence our test gives a more powerful conclusion
regarding the significance of the difference in means.
The paired t-test is our first example of a blocking design. In this context the subject is used
as a block, and the results from the paired t-test are identical to what we will find when we
analyze this as a Randomize Complete Block Design from lesson 4.
Decision HO HA
Reject Null
Type I Error - α OK
Hypothesis
Accept Null
OK Type II Error - β
Hypothesis
Note:
6 of 8 09/04/2017 5:37
https://onlinecourses.science.psu.edu/stat503/print/book/export/html/8
Before any experiment is conducted you typically want to know how many observations you
will need to run. If you are performing a study to test a hypothesis, for instance in the blood
pressure example where we are measuring the efficacy of the blood pressure medication, if
the drug is effective there should be a difference in the blood pressure before and after the
medication. Therefore we want to reject our null hypothesis, and thus we want the power (i.e.
the probability of rejecting the HO when it is false) to be as high as possible.
To use the Figure in the text, we need to first calculate the difference the difference in means
measured in numbers of standard deviation, i.e. |μ1 - μ2| / σ. You can think of this as a signal
to noise ratio, i.e. how large or strong is the signal, |μ1 - μ2| , in relation to the variation in the
measurements, σ. We are not using the symbols in the text, because the 2 editions define d
and δ differently. Different software packages or operating characteristic curves may require
either |μ1 - μ2| / σ or |μ1 - μ2| / 2σ to compute sample sizes or estimate power, so you need to
be careful in reading the documentation. Minitab avoids this by asking for |μ1 - μ2| and σ
separately, which seems like a very sensible solution.
Again,
If you look at the Figure you get approximately a β of about 0.9. Therefore, power -
or the chance of rejecting the null hypothesis prior to doing the experiment is 1 - β
or 1 - 0.9 = 0.1 or about ten percent of the time. With such low power we should
not even do the experiment!
If we were willing to do a study that would only detect a true difference of, let's
say, |μ1 - μ2| = 18 then and n* would still equal 9, then figure 2-12 the Figure
shows that β looks to be about .5 and the power or chance of detecting a
difference of 18 is also 5. This is still not very satisfactory since we only have a
50/50 chance of detecting a true difference of 18 even if it exists.
These calculations can also be done in Minitab as shown below. Under the Menu:
7 of 8 09/04/2017 5:37
https://onlinecourses.science.psu.edu/stat503/print/book/export/html/8
Stat > Power and Sample Size > 2-sample t, simply input sample sizes, n = 10,
differences δ = 18, and standard deviation σ = 12.
Another way to improve power is to use a more efficient procedure - for example if we have
paired observations we could use a paired t-test. For instance, if we used the paired t-test,
then we would expect to have a much smaller sigma – perhaps somewhere around 2 rather
than 12. So, our signal to noise ratio would be larger because the noise component is smaller.
We do pay a small price in doing this because our t-test would now have degrees of freedom
n - 1, instead of 2n - 2.
If you can reduce variance or noise, then you can achieve an incredible savings in the
number of observations you have to collect. Therefore the benefit of a good design is to get a
lot more power for the same cost or much decreased cost for the same power.
We now show another approach to calculating power, namely using software tools rather than
the graph in Figure 2.12. Let's take a look at how Minitab handles this below.
You can use these dialog boxes to plug in the values that you have assumed and have
Minitab calculate the sample size for a specified power, or the power that would result, for a
given sample size.
Exercise: Use the assumptions above, and confirm the calculations of power for these
values.
8 of 8 09/04/2017 5:37