Statistics A Gentle Introduction CH - 3
Statistics A Gentle Introduction CH - 3
Statistics A Gentle Introduction CH - 3
STATISTICAL PARAMETERS
Measures of Central Tendency and Variation
Chapter 3 Goals
80
The Mean
The mean is the arithmetic average of a set of scores. There are actually different kinds of means,
such as the harmonic mean (which will be discussed later in the book) and the geometric mean.
We will first deal with the arithmetic mean. The mean gives someone an idea where the center
lies for a set of scores. The arithmetic mean is obtained by taking the sum of all the numbers in
the set and dividing by the total number of scores in the set.
You already learned how to derive the mean in Chapter 1. Here again is the official formula
in proper statistical notation:
k
x=
where
xi
i =1
x = the mean,
k
i =1
The mean has many important properties that make it useful. Probably its most attractive
quality is that it has a clear conceptual meaning. People almost automatically understand and
easily form a picture of an unseen set of numbers when the mean of that set is presented alone.
Another attractive quality is that the mathematical formula is simple and easy. It involves only
adding, counting, and dividing. The mean also has some more complicated mathematical properties that also make it highly useful in more advanced statistical settings such as inferential
statistics. One of these properties is that the mean of a sample is said to be an unbiased estimator of the population mean. But first, let us back up a bit.
The branch of statistics known as inferential statistics involves making inferences or
guesses from a sample about a population. As members of society, we continually make decisions, some big (such as what school to attend, who to marry, or whether to have an operation) and some little (such as what clothes to wear or what brand of soda to buy). We hope
that our big decisions are based on sound research. For example, if we decide to take a drug
to lower high blood pressure, we hope that the mean response of the participants to the drug
is not just true of the sample but also of the population, that is, all people who could take
the drug for high blood pressure. So if we take the mean blood pressure of a sample of
patients after taking the drug, we hope that it will serve as an unbiased estimator of the
population mean; that is, the mean of a sample should have no tendency to overestimate or
underestimate the population mean, (which is also written mu and pronounced mew).
Sample
Mean
x=
x
N
Population
=
x
N
The Median
Please think back to when you heard income or wealth reports in the United States. Can you
recall if they reported the mean or the median income (or wealth)? More than likely, you heard
reports about the median and not the mean. Although the mean is the mostly widely used
measure of central tendency, it is not always appropriate to use it. There may be many situations
where the median may be a better measure of central tendency. The median value in a set of
numbers is that value that divides the set into equal halves when all the numbers have been
ordered from lowest to highest. Thus, when the median value has been derived, half of all the
numbers in the set should be above that score, and half should be below that score. The reason
that the median is used in reports about income or wealth in the United States is that income
and wealth are unevenly distributed (i.e., they are not normally distributed). A plot of the frequency distribution of income or wealth would reveal a skewed distribution. Now, would the
resulting distribution be positively skewed or negatively skewed? It is easier to answer that
question if you think in terms of wealth. Are most people in the United States wealthy and
theres just a few outlying poor people, or do most people have modest wealth and a few
people own a huge amount? Of course, the latter is true. It is often claimed that the wealthiest
3% of U.S. citizens own about 40% of the total wealth. Thus, wealth is positively skewed. A mean
value of wealth would be skewed or drawn to the right by a few higher values, giving the
appearance that the average person in the United States is far wealthier than he or she really is.
In all skewed distributions, the mean is strongly influenced by the few high or low scores. Thus,
the mean may not truthfully represent the central tendency of the set of scores because it has
been raised (or lowered) by a few outlying scores. In these cases, the median would be a better
measure of central tendency. The formula for obtaining the median for a set of scores will vary
depending on the nature of the ordered set of scores. The following two methods can be used
in many situations.
Method 2
When the scores are ordered from lowest to highest and there are an even number of scores,
the point midway between the two middle values will be the median score. For example, examine the following set of scores:
2, 3, 5, 6, 8, 10
There are six scores, and 6 is an even number; therefore, take the average of 5 and 6, which
is 5.5, and that will be the median value. Notice that in this case, the median value is a hypothetical number that is not in the set of numbers. Let us change the previous set of numbers
slightly and find the median:
2, 3, 5, 7, 8, 10
In this case, 5 and 7 are the two middle values, and their average is 6; therefore, the median
in this set of scores is 6. Let us obtain the mean for this last set, and that is 5.8. In this set, the
median is actually slightly higher than the mean. Overall, however, there is not much of a difference between these two measures of central tendency. The reason for this is that the numbers
are relatively evenly distributed in the set. If the population from which this sample was drawn
is normally distributed (and not skewed), then the mean and the median of the sample will be
about the same value. In a perfectly normally distributed sample, the mean and the median will
be exactly the same value.
Now, let us change the last set of numbers once again:
2, 3, 5, 7, 8, 29
Now the mean for this set of numbers is 9.0, and the median remains 6. Notice that the
mean value was skewed toward the single highest value (29), while the median value was not
Frequency
Frequency
Mean
Median
Median
Mean
If there are ties at the median value when you use either of the two previous methods, then
you should consult an advanced statistics text for a third median formula, which is much more
complicated than the previous two methods. For example, examine the following set:
2, 3, 5, 5, 5, 10
There are an even number of scores in this set, and normally we would take the average of
the two middle values. However, there are three 5s, and that constitutes a tie at the median
value. Notice that if we used 5 as the value of the median, there is one score above the value 5,
and there are two scores below 5. Therefore, 5 is not the correct median value. It is actually 4.54,
which is confusing (because there are two scores below that value and four above it); however,
it is the correct theoretical median.
The Mode
The mode is a third measure of central tendency. The mode score is the most frequently occurring number in a set of scores. In the previous set of numbers, 5 would be the mode score
because it occurs at a greater frequency than any other number in that set. Notice that the mode
4
9
9
80
7
65
9
4
50
35
6
5
4
3
2
1
20
80
9
6
Age Intervals
70
9
60
50
40
30
20
Frequency
6
5
4
3
2
1
Frequency
Age Intervals
Salary ($)
Jordan
30,140,000
Rodman
9,000,000
Kukoc
3,960,000
Harper
3,840,000
Longley
2,790,000
Pippen
2,250,000
Brown
1,300,000
Simpkins
1,040,000
Parish
1,000,000
Wennington
1,000,000
Kerr
750,000
Caffey
700,000
Buechler
500,000
MEASURES OF VARIATION
The second major category of statistical parameters is measures of variation. Measures of variation tell us how far the numbers are scattered about the center value of the set. They are also called
measures of dispersion. There are three common parameters of variation: the range, standard
deviation, and variance. While measures of central tendency are indispensable in statistics, measures of variation provide another important yet different picture of a distribution of numbers. For
example, have you heard of the warning against trying to swim across a lake that averages only 3
feet deep? While the mean does give a picture that the lake on the whole is shallow, we intuitively
know that there is danger because while the lake may average 3 feet in depth, there may be much
deeper places as well as much shallower places. Thus, measures of central tendency are useful in
understanding how scores cluster about a center value, and measures of variation are useful in
understanding how far, wide, or deep the high scores are scattered about the center value.
The Range
The range is the simplest of the measures of variation. The range describes the difference
between the lowest score and the highest score in a set of numbers. Typically, statisticians do not
actually report the range value, but they do state the lowest and highest scores. For example,
given the set of scores 85, 90, 92, 98, 100, 110, 122, the range value is 122 - 85 = 37. Therefore,
the range value for this set of scores is 37, the lowest score is 85, and the highest score is 122.
It is also important to note that the mean for this set is 99.6, and the median score is 98.
However, neither of these measures of central tendency tells us how far the numbers are from
Mean ( x )
28
48
20
42
48
48
48
59
48
+11
63
48
+15
( xi x )
N
Note that s or sigma represents the population value of the standard deviation. You previously learned as the command to sum numbers together. is the capital Greek letter, and s is
the lowercase Greek letter. Also note that although they are pronounced the same, they have
radically different meanings. The sign is actually in the imperative mode; that is, it states a
command (to sum a group of numbers). The other value s is in the declarative mode; that is, it
states a fact (it represents the population value of the standard deviation).
The sample standard deviation has been shown to be a biased estimator of the population
value, and consequently, there is bad news and good news. The bad news is that there are two different formulas, one for the sample standard deviation and one for the population standard deviation. The good news is that statisticians do not often work with a population of numbers. They
typically only work with samples and make inferences about the populations from which they were
drawn. Therefore, we will only use the sample formula. The two formulas are presented as follows:
Sample
Standard deviation
S=
( xi x )
N 1
Population
2
where S (capital English letter S) stands for the sample standard deviation.
( xi x )
N
S=
( x )2
N
N 1
2
x
Remember that this computational formula is exactly equal to the theoretical formula presented earlier. The computational formula is simply easier to compute. The theoretical formula
requires going through the entire data three times: once to obtain the mean, once again to
subtract the mean from each number in the set, and a third time to square and add the numbers
together. Note that on most calculators, x and x2 can be performed at the same time; thus,
the set of numbers will only have to be entered in once. Of course, many calculators can obtain
the sample standard deviation or the population value with just a single button (after entering
all of the data). You may wish to practice your algebra, nonetheless, with the computational
formula and check your final answer with the automatic buttons on your calculator afterward.
Later in the course, you will be required to pool standard deviations, and the automatic standard
deviation buttons of your calculator will not be of use. Your algebraic skills will be required, so
it would be good to practice them now.
THE VARIANCE
The variance is a third measure of variation. It has an intimate mathematical relationship with standard deviation. Variance is defined as the average of the square of the deviations of a set of scores
from their mean. In other words, we use the same formula as we did for the standard deviation,
except that we do not take the square root of the final value. The formulas are presented as follows:
Sample
Variance
S2 =
( xi x )
N 1
Population
2
2 =
( xi x )
N
Statisticians frequently talk about the variance of a set of data, and it is an often-used
parameter in inferential statistics. However, it has some conceptual drawbacks. One of them is
2
x
S=
=
( 733)2
14 =
13
42287
537289
14
13
42287
42287 38377.7857
3909.2143
=
= 300.7088 = 17.34
13
13
Thus, S = 17.34.
Now, let us see what predictions the empirical rule will make regarding this mean and standard deviation.
1. x + 1S = 52.4 + 17.3 = 69.7
x 1S = 52.4 17.3 = 35.1
Thus, the empirical rule predicts that approximately 68% of all the numbers will fall within
this range of 35.1 years old to 69.7 years old.
If we examine the data, we find that 10 of the 14 numbers are within that range, and 10/14
is about 70%. We find, therefore, that the empirical rule was relatively accurate for this distribution of numbers.
2. x + 2S = 52.4 + 2 (17.3) = 52.4 + 34.6 = 87.0
x 2S = 52.4 2 (17.3) = 52.4 34.6 = 17.8
HISTORY TRIVIA
Fisher to Eels
Ronald A. Fisher (18901962) received an undergraduate degree in astronomy in England. After
graduation, he worked as a statistician and taught mathematics. At the age of 29, he was hired at
an agricultural experimental station. Part of the lure of the position was that they had gathered
approximately 70 years of data on wheat crop yields and weather conditions. The director of the
station wanted to see if Fisher could statistically analyze the data and make some conclusions.
Fisher kept the position for 14 years. Consequently, modern statistics came to develop some
strong theoretical roots in the science of agriculture.
Fisher wrote two classic books on statistics, published in 1925 (Statistical Methods for
Research Workers) and 1935 (The Design of Experiments). He also gave modern statistics two
of its three most frequently used statistical tests, t tests and analysis of variance. Later in his
career, in 1954, he published an interesting story of a scientific discovery about eels and the
standard deviation. The story is as follows:
Johannes Schmidt was an ichthyologist (one who studies fish) and biometrician (one who
applies mathematical and statistical theory to biology). One of his topics of interest was the
number of vertebrae in various species of fish. By establishing means and standard deviations for
the number of vertebrae, he was able to differentiate between samples of the same species
depending on where they were spawned. In some cases, he could even differentiate between
two samples from different parts of a fjord or bay.
However, with eels, he found approximately the same mean and same large standard deviation from samples from all over Europe, Iceland, and Egypt. Therefore, he inferred that eels from
all these different places had the same breeding ground in the ocean. A research expedition in
the Western Atlantic Ocean subsequently confirmed his speculation. In fact, Fisher notes, the
expedition found a different species of eel larvae for eels of the eastern rivers of North America
and the Gulf of Mexico.
SPSS Lesson 3
Your objective for this assignment is to become familiar with generating measurements of central
tendency and variation using SPSS.
3.
Double-click the Current Age [Age] variable to move it to the right into the (selected)
Variable(s) field.
4.
Click Statistics to open the Frequencies: Statistics dialog.
5.
In the Central Tendency group box, select Mean, Median, and Mode.
6.
In the Dispersion group box, select Std. deviation, Variance, Range, Minimum, and
Maximum.
7.
Click Continue > OK.
8.
This opens the Statistics Viewer to display the frequency distribution for the Age variable.
9.
Observe the statistics table for the Current Age variable in the Statistics Viewer.