Maintaining Standards: Differences Between The Standard Deviation and Standard Error, and When To Use Each
Maintaining Standards: Differences Between The Standard Deviation and Standard Error, and When To Use Each
Many people confuse the standard deviation (SD) and the standard error of the mean (SE) and are unsure which,
if either, to use in presenting data in graphical or tabular form. The SD is an index of the variability of the original
data points and should be reported in all studies. The SE reflects the variability of the mean values, as if the study
were repeated a large number of times. By itself, the SE is not particularly useful; however, it is used in constructing
95% and 99% confidence intervals (CIs), which indicate a range of values within which the “true” value lies. The
CI shows the reader how accurate the estimates of the population values actually are. If graphs are used, error
bars equal to plus and minus 2 SEs (which show the 95% CI) should be drawn around mean values. Both statistical
significance testing and CIs are useful because they assist the reader in determining the meaning of the findings.
Key Words: statistics, standard deviation, standard error, confidence intervals, graphing
The last problem (yes, it really is the last one) is that the
results of the formula as it exists so far produce a biased
estimate, that is, one that is consistently either higher or (as
in this case) lower than the “true” value. The explanation of
this is a bit more complicated and requires somewhat of a
detour. Most of the time when we do research, we are not
interested so much in the samples we study as in the popula-
tions they come from. That is, if we look at the level of
expressed emotion (EE) in the families of young schizo-
phrenic males, our interest is in the families of all people who
meet the criteria (the population), not just those in our study.
What we do is estimate the population mean and SD from our
sample. Because all we are studying is a sample, however,
these estimates will deviate by some unknown amount from
the population values. In calculating the SD, we would ideally
see how much each person’s score deviates from the popula-
tion mean, but all we have available to us is the sample mean.
Figure 1. Data from Table I, plotted with different types of error bars. By definition, scores deviate less from their own mean than
from any other number. So, when we do the calculation and
below it, and thus they’ll cancel each other out. We can get subtract each score from the sample mean, the result will be
around this problem by taking the absolute value of each smaller than if we subtracted each score from the population
difference (that is, we can ignore the sign whenever it’s mean (which we don’t know); hence, the result is biased
negative), but for a number of arcane reasons, statisticians
downwards. To correct for this, we divide by N ! 1 instead
don’t like to use absolute numbers. Another way to eliminate
of N. Putting all of this together, we finally arrive at the
negative values is to square them, since the square of any
formula for the standard deviation, which is:
number—negative or positive—is always positive. So, what
we now have is G(Xi ! M)2. ∑( Xi − M)
2
SD =
N −1
The second problem is that the result of this equation will
increase as we add more subjects. Let’s imagine that we have (By the way, don’t use this equation if, for whatever bizarre
a sample of 25 values, with an SD of 10. If we now add reason, you want to calculate the SD by hand, because it leads
another 25 subjects who look exactly the same, it makes to too much rounding error. There is another formula, mathe-
intuitive sense that the dispersion of these 50 points should matically equivalent and found in any statistics book, which
stay the same. Yet the formula as it now reads can result only yields a more precise figure.)
in a larger sum as we add more data points. We can compen- Now that we’ve gone through all this work, what does it
sate for this by dividing by the number of subjects, N, so that all mean? If we assume that the data are normally distributed,
the equation now reads G(Xi ! M)2/N. then knowing the mean and SD tells us everything we need
to know about the distribution of scores. In any normal
In the true spirit of Murphy’s Law, what we’ve done in distribution, roughly two-thirds (actually, 68.2%) of the
solving these 2 difficulties is to create 2 new ones. The first scores fall between !1 and +1 SD, and 95.4% between !2 and
(or should we say third, so we can keep track of our problems) +2 SD. For example, most of the tests used for admission to
is that now we are expressing the deviation in squared units; graduate or professional schools (the GRE, MCAT, LSAT,
500 The Canadian Journal of Psychiatry Vol 41, No 8
and other instruments of torture) were originally designed to SE based on the results of a single study. Let’s approach this
have a mean of 500 and an SD of 100. That means that 68% first from an intuitive standpoint: What would make us more
of people get scores between 400 and 600, and just over 95% or less confident that our estimate of the population mean,
between 300 and 700. Using a table of the normal curve based on our study, is accurate? One obvious thing would be
(found in most statistics books), we can figure out exactly the size of the study; the larger the sample size, N, the less
what proportion of people get scores above or below any chance that one or two aberrant values are distorting the
given value. Conversely, if we want to fail the lowest 5% of results and the more likely it is that our estimate is close to
test takers (as is done with the LMCCs), then knowing the the true value. So, some index of N should be in the denomi-
mean and SD of this year’s class and armed with the table, nator of SE, since the larger N is, the smaller SE would
we can work out what the cut-off point should be. become. Second, and for similar reasons, the smaller the
variability in the data, the more confident we are that one
So, to summarize, the SD tells us the distribution of
value (the mean) accurately reflects them. Thus, the SD
individual scores around the mean. Now, let’s turn our atten-
should be in the numerator: the larger it is, the larger SE will
tion to the standard error.
be, and we end up with the equation:
Standard Error SD
SE =
N
I mentioned previously that the purpose of most studies is
to estimate some population parameter, such as the mean, the (Why does the denominator read √N instead of just N?
SD, a correlation, or a proportion. Once we have that estimate, Because we are really dividing the variance, which is SD2, by
another question then arises: How accurate is our estimate? N, but we end up again with squared units, so we take the
This may seem an unanswerable question; if we don’t know square root of everything. Aren’t you sorry you asked?)
what the population value is, how can we evaluate how close So, the SD reflects the variability of individual data points,
we are to it? Mere logic, however, has never stopped statisti- and the SE is the variability of means.
cians in the past, and it won’t stop us now. What we can do
is resort to probabilities: What is the probability (P) that the Confidence Intervals
true (population) mean falls within a certain range of values?
(To cite one of our mottos, “Statistics means you never have In the previous section, on the SE, we spoke of a range of
to say you’re certain.”) values in which we were 95% or 99% confident that the true
value of the mean fell. Not surprisingly, this range is called
One way to answer the question is to repeat the study a few the confidence interval, or CI. Let’s see how it’s calculated.
hundred times, which will give us many estimates of the If we turn again to our table of the normal curve, we’ll find
mean. We can then take the mean of these means, as well as that 95% of the area falls between !1.96 and +1.96 SDs.
figure out what the distribution of means is; that is, we can Going back to our example of GREs and MCATs, which have
get the standard deviation of the mean values. Then, using the a mean of 500 and an SD of 100, 95% of scores fall between
same table of the normal curve that we used previously, we 304 and 696. How did we get those figures? First, we multi-
can estimate what range of values would encompass 90% or plied the SD by 1.96, subtracted it from the mean to find the
95% of the means. If each sample had been drawn from the lower bound, and added it to the mean for the upper bound.
population at random, we would be fairly safe in concluding The CI is calculated in the same way, except that we use the
that the true mean also falls within this range 90% or 95% of SE instead of the SD. So, the 95% CI is:
the time. We assign a new name to the standard deviation of
the means: we call it the standard error of the mean (abbre- 95% CI = M ± (196
. × SE )
viated as SEM, or, if there is no ambiguity that we’re talking
about the mean, SE). For the 90% CI, we would use the value 1.65 instead of
1.96, and for the 99% CI, 2.58. Using the data from Table I,
But first, let’s deal with one slight problem—replicating the SE for administrators is 5.72 / √25, or 1.14, and thus the
the study a few hundred times. Nowadays, it’s hard enough 95% CI would be 25.83 ± (1.96 × 1.14), or 23.59 to 28.07.
to get money to do a study once, much less replicate it this We would interpret this to mean that we are 95% confident
many times (even assuming you would actually want to spend that the value of LDE in the population of administrators is
the rest of your life doing the same study over and over). Ever somewhere within this interval. If we wanted to be more
helpful, statisticians have figured out a way to determine the confident, we would multiply 1.14 by 2.58; the penalty we
October 1996 Standard Deviation and Standard Error 501
pay for our increased confidence is a wider CI, so that we are the value and our eyes discern it. The advantages are twofold.
less sure of the exact value. First, this method shows the 95% CI, which is more
meaningful than 68%. Second, it allows us to do an “eyeball”
The Choice of Units test of significance, at least in the 2-group situation. If the top
of the lower bar (the controls in Figure 1) and the bottom of
Now we have the SD, the SE, and any one of a number of the higher bar (the administrators) do not overlap, then the
CIs, and the question becomes, which do we use, and when? difference between the groups is significant at the 5% level
Obviously, when we are describing the results of any study or better. Thus we would say that, in this example, the 2
we’ve done, it is imperative that we report the SD. Just as groups were significantly different from one another. If we
obviously, armed with this and the sample size, it is a simple actually did a t test, we would find this to be true: t(48) =
matter for the reader to figure out the SE and any CI. Do we 2.668, P < 0.05. This doesn’t work too accurately if there are
gain anything by adding them? The answer, as usual, is yes more than 2 groups, since we have the issue of multiple tests
and no. to deal with (for example, Group 1 versus Group 2, Group 2
versus 3, and Group 1 versus 3), but it gives a rough indication
Essentially, we want to convey to the reader that there will
of where the differences lie. Needless to say, when presenting
always be sample-to-sample variation and that the answers
the CI in a table, you should give the exact values (multiply
we get from one study wouldn’t be exactly the same if the
by 1.96, not 2).
study were replicated. What we would like to show is how
much of a difference in findings we can expect: just a few Wrapping Up
points either way, but not enough to substantially alter our
conclusions, or so much that the next study is as likely to show
The SD indicates the dispersion of individual data values
results going in the opposite direction as to replicate the
around their mean, and should be given any time we report
findings. To some degree, this is what significance testing
data. The SE is an index of the variability of the means that
does—the lower the P level, the less likely the results are due
would be expected if the study were exactly replicated a large
simply to chance and the greater the probability that they will
number of times. By itself, this measure doesn’t convey much
be repeated the next time around. Significance tests, however,
useful information. Its main function is to help construct 95%
are usually interpreted in an all-or-nothing manner: either the
and 99% CIs, which can supplement statistical significance
result was statistically significant or it wasn’t, and a differ-
testing and indicate the range within which the true mean or
ence between group means that just barely squeaked under
difference between means may be found. Some journals have
the P < 0.05 wire is often given as much credence as one that
dropped significance testing entirely and replaced it with the
is highly unlikely to be due to chance.
reporting of CIs; this is probably going too far, since both
If we used CIs, either in a table or a graph, it would be have advantages, and both can be misused to equal degrees.
much easier for the reader to determine how much variation For example, a study using a small sample size may report
in results to expect from sample to sample. But which CI that the difference between the control and experimental
should we use? We could draw the error bars on a graph or group is significant at the 0.05 level. Had the study indicated
show in a table a CI that is equal to exactly one SE. This has the CIs, however, it would be more apparent to the reader that
the advantages that we don’t have to choose between the SE the CI is very wide and the estimate of the difference is crude,
or the CI (they’re identical) and that not much calculation is at best. By contrast, the much-touted figure of the number of
involved. Unfortunately, this choice of an interval conveys people affected by second-hand smoke is actually not the
very little useful information. An error bar of plus and minus estimate of the mean. The best estimate of the mean is zero,
one SE is the same as the 68% CI; we would be 68% sure that and it has a very broad CI; what is reported is the upper end
the true mean (or difference between 2 means) fell within this of that CI.
range. The trouble is, we’re more used to being 95% or 99%
To sum up, SDs, significance testing, and 95% or 99% CIs
sure, not 68%. So, to begin with, let’s forget about showing
should be reported to help the reader; all are informative and
the SE: it tells us little that is useful, and its sole purpose is in
complement, rather than replace, each other. Conversely,
calculating CIs.
“naked” SEs don’t tell us much by themselves, and more or
What about the advice to use plus and minus 2 SEs in the less just take up space in a report. Conducting our studies with
graph? This makes more sense; 2 is a good approximation of these guidelines in mind may help us to maintain the stand-
1.96, at least to the degree that graphics programs can display ards in psychiatric research.
502 The Canadian Journal of Psychiatry Vol 41, No 8
Résumé
Beaucoup de gens confondent l’écart-type et l’erreur-type de la moyenne et ne savent pas lequel utiliser pour
présenter les données sous forme graphique ou tabulaire. L’écart-type indique la variabilité des données
originales et devrait être mentionné pour toutes les études. L’erreur-type montre la variabilité des valeurs
moyennes, comme si l’étude avait été reprise de nombreuses fois. En soi, l’erreur-type n’a pas d’utilité
particulière; toutefois, on s’en sert pour créer les intervalles de confiance à 95 % et à 99 % utilisés pour établir
la fourchette de valeurs dans laquelle se situe la valeur «réelle». Les intervalles de confiance signalent au lecteur
la précision des estimations des valeurs démographiques. Lorsqu’on se sert de graphiques, la barre d’erreur
représente un intervalle de plus à moins 2 écarts-types (ce qui correspond à l’intervalle de confiance de 95 %).
Elle devrait entourer la valeur moyenne. Les épreuves de signification statistique et les intervalles de confiance
présentent une grande utilité, car ils aident le lecteur à établir l’importance des constatations.