MODULE IN Measures of Central Tendency PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

MODULE IN EDUCATIONAL STATISTICS

TITLE: Measures of Central Tendency

Introduction

A measure of central tendency is a single value that attempts to describe a set of


data by identifying the central position within that set of data. As such, measures
of central tendency are sometimes called measures of central location. They are
also classed as summary statistics. The mean (often called the average) is most
likely the measure of central tendency that you are most familiar with, but there
are others, such as, the median and the mode.

The mean, median and mode are all valid measures of central tendency but, under
different conditions, some measures of central tendency become more appropriate
to use than others. In the following discussions we will look at the mean, median
and mode and learn how to calculate them and under what conditions they are most
appropriate to be used.

Mean (Arithmetic)

The mean (or average) is the most popular and well known measure of central
tendency. It can be used with both discrete and continuous data, although its use
is most often with continuous data. The mean is equal to the sum of all the values
in the data set divided by the number of values in the data set. So, if we have n
values in a data set and they have values x1, x2, ..., xn, then the sample mean,
usually denoted by (pronounced x bar), is:

This formula is usually written in a slightly different manner using the Greek
capitol letter, , pronounced "sigma", which means "sum of...":

You may have noticed that the above formula refers to the sample mean. So, why
have we called it a sample mean? This is because, in statistics, samples and
populations have very different meanings and these differences are very
important, even if, in the case of the mean, they are calculated in the same way.
To acknowledge that we are calculating the population mean and not the sample
mean, we use the Greek lower case letter "mu", denoted as µ:

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 1
The mean is essentially a model of your data set. It is the value that is most
common. You will notice, however, that the mean is not often one of the actual
values that you have observed in your data set. However, one of its important
properties is that it minimizes error in the prediction of any one value in your data
set. That is, it is the value that produces the lowest amount of error from all other
values in the data set.

An important property of the mean is that it includes every value in your data set
as part of the calculation. In addition, the mean is the only measure of central
tendency where the sum of the deviations of each value from the mean is always
zero.

When not to use the mean

The mean has one main disadvantage: it is particularly susceptible to the influence
of outliers. These are values that are unusual compared to the rest of the data set
by being especially small or large in numerical value. For example, consider the
wages of staff at a factory below:

Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k

The mean salary for these ten staff is $30.7k. However, inspecting the raw data
suggests that this mean value might not be the best way to accurately reflect the
typical salary of a worker, as most workers have salaries in the $12k to 18k range.
The mean is being skewed by the two large salaries. Therefore, in this situation
we would like to have a better measure of central tendency. As we will find out
later, taking the median would be a better measure of central tendency in this
situation.

Another time when we usually prefer the median over the mean (or mode) is when
our data is skewed (i.e. the frequency distribution for our data is skewed). If we
consider the normal distribution - as this is the most frequently assessed in
statistics - when the data is perfectly normal, then the mean, median and mode are
identical. Moreover, they all represent the most typical value in the data set.
However, as the data becomes skewed the mean loses its ability to provide the best
central location for the data as the skewed data is dragging it away from the typical
value. However, the median best retains this position and is not as strongly

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 2
influenced by the skewed values. This is explained in more detail in the skewed
distribution section later in this guide.

Median

The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. In order to
calculate the median, suppose we have the data below:

65 55 89 56 35 14 56 55 87 45 92

We first need to rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89 92

Our median mark is the middle mark - in this case 56 (highlighted in bold). It is
the middle mark because there are 5 scores before it and 5 scores after it. This
works fine when you have an odd number of scores but what happens when you
have an even number of scores? What if you had only 10 scores? Well, you
simply have to take the middle two scores and average the result. So, if we look at
the example below:

65 55 89 56 35 14 56 55 87 45

We again rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89 92

Only now we have to take the 5th and 6th score in our data set and average them
to get a median of 55.5.

Mode

The mode is the most frequent score in our data set. On a histogram it represents
the highest bar in a bar chart or histogram. You can, therefore, sometimes
consider the mode as being the most popular option. An example of a mode is
presented on the next page:

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 3
Normally, the mode is used for categorical data where we wish to know which is
the most common category as illustrated below:

We can see above that the most common form of transport, in this particular data
set, is the bus. However, one of the problems with the mode is that it is not unique,
so it leaves us with problems when we have two or more values that share the
highest frequency, such as diagram on the next page:

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 4
We are now stuck as to which mode best describes the central tendency of the data.
This is particularly problematic when we have continuous data, as we are more
likely not to have any one value that is more frequent than the other. For example,
consider measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is it that
we will find two or more people with exactly the same weight, e.g. 67.4 kg? The
answer, is probably very unlikely - many people might be close but with such a
small sample (30 people) and a large range of possible weights you are unlikely to
find two people with exactly the same weight, that is, to the nearest 0.1 kg. This
is why the mode is very rarely used with continuous data.

Another problem with the mode is that it will not provide us with a very good
measure of central tendency when the most common mark is far away from the
rest of the data in the data set, as depicted in the diagram below:

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 5
In the previous diagram the mode has a value of 2. We can clearly see, however,
that the mode is not representative of the data, which is mostly concentrated
around the 20 to 30 value range. To use the mode to describe the central tendency
of this data set would be misleading.

Skewed Distributions and the Mean and Median

We often test whether our data is normally distributed as this is a common


assumption underlying many statistical tests. An example of a normally
distributed set of data is presented below:

When you have a normally distributed sample you can legitimately use both the
mean and the median as your measure of central tendency. In fact, in any
symmetrical distribution the mean, median and mode are equal. However, in this
situation, the mean is widely preferred as the best measure of central tendency as
it is the measure that includes all the values in the data set for its calculation, and
any change in any of the scores will affect the value of the mean. This is not the
case with the median or mode.

However, when our data is skewed, for example, as with the right-skewed data
set seen on the diagram shown on the next page:

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 6
we find that the mean is being dragged in the direct of the skew. In these
situations, the median is generally considered to be the best representative of the
central location of the data. The more skewed the distribution the greater the
difference between the median and mean, and the greater emphasis should be
placed on using the median as opposed to the mean. A classic example of the
above right-skewed distribution is income (salary), where higher-earners provide
a false representation of the typical income if expressed as a mean and not a
median.

If dealing with a normal distribution and tests of normality show that the data is
non-normal, then it is customary to use the median instead of the mean. This is
more a rule of thumb than a strict guideline however. Sometimes, researchers
wish to report the mean of a skewed distribution if the median and mean are not
appreciably different (a subjective assessment) and if it allows easier comparisons
to previous research to be made.

Summary of when to use the mean, median and mode

Please use the following summary table to know what the best measure of central
tendency is with respect to the different types of variable.

Type of Variable Best measure of central tendency


Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 7
CALCULATING THE MEAN, MEDIAN, & MODE OF GROUPED DATA

Formulas for finding the measures of central tendency of grouped data as


well as its standard deviation are discussed below.

Consider the frequency table showing the raw scores of students in a


recently concluded examination.

Class
Interval f cf m fm fm2
21 - 25 3
26 - 30 5
31 - 35 11
36 - 40 9
41 - 45 12
46 - 50 6
51 - 55 4
n=50

There are seven (7) class intervals (or groups) ranging from 21-25 up to 51-
55. Each group has a class width (or size) of five (5). Three (3) students scored in
the range of 21-25, five (5) students got about 26-30, eleven (11) garnered scores
ranging from 31-35, nine (9) received scores from 36-40, … while four (4) scored
about 51-55. Altogether, 50 students took the examination.

The cumulative frequency (cf) is the accumulated frequency of each class


interval. Starting from the frequency of the class interval 21-25, that row has cf of
3 (0 + 3 = 3). Then, the cf of the class interval 26-30 is equal to the previous cf
(which is 3) plus the frequency of the class interval (26-30), hence, cf of this group
is 8. For the class interval 31-35, cf is 19 (that is, 8 + 11 = 19) … and for the class
interval 51-55, cf is equal to 50 (which should be equal to n). See table below.

Class
Interval f cf m fm fm2
21 - 25 3 3
26 - 30 5 8
31 - 35 11 19
36 - 40 9 28
41 - 45 12 40
46 - 50 6 46
51 - 55 4 50
n=50

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 8
The midpoint (m) of each class interval is equal to the sum of lower limit
(lower range) and the upper limit (higher range) divided by 2
lower limit + upper limit
(m = ). For the class interval 21-25, the midpoint m is 23.
2
For the class interval 26-30, m = 28, the next midpoint is 33 (for the class interval
31-35), …, and the midpoint m of the class interval 51-55 is 53.

Class
Interval f cf m fm fm2
21 - 25 3 3 23
26 - 30 5 8 28
31 - 35 11 19 33
36 - 40 9 28 38
41 - 45 12 40 43
46 - 50 6 46 48
51 - 55 4 50 53
n=50

The column for fm is the product of the frequency of the class interval and
its midpoint (frequency x midpoint). For the class interval 21-25, fm is 69 (3 x 23
= 69), for 26-30, 140 (5 x 28), then 363 (11 x 33), …, and for the class interval 51-55,
fm is 212 (4 x 53). The sum of the fm column is 1930.

The last column (fm2) is the product of the midpoint m and fm (m x fm),
hence, for the class interval 21-25 fm2 is equal to 1587 (23 x 69). For the class
interval 26-30, fm is 3920 (28 x 140), next group’s fm2 is 11979 (33 x 363), …, and
for class interval 51-55, fm2 is 11236 (53 x 212). The summation of fm2 is 77730.

Class
Interval f cf m fm fm2
21 - 25 3 3 23 69 1587
26 - 30 5 8 28 140 3920
31 - 35 11 19 33 363 11979
36 - 40 9 28 38 342 12996
41 - 45 12 40 43 516 22188
46 - 50 6 46 48 288 13824
51 - 55 4 50 53 212 11236
n=50 fm=1930 fm2=77730

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 9
Formulas in computing the mean, median, mode and standard deviation of
grouped data are presented in the next page.

The arithmetic mean of the above grouped data is

fm 1930
Mean = = 50 = 38.6
n

(n/2) - cfp
The median is defined by the formula Median = LB + w
fmd

where: LB = real lower boundary of the median class (which is 0.5 less
than the lower boundary)
n = sample size
cfp = cumulative frequency of the class that precedes (just
before) the median class
fmd = frequency of the median class
w = class width (or size)

Note: In finding the median class of the grouped data, we have to identify
the class interval where the middle data is included. In this case, the
25th data which is half of the total sample population of 50 belongs
to the class interval 36-40.

Class
Interval f cf m fm fm2
21 - 25 3 3 23 69 1587
26 - 30 5 8 28 140 3920
Median
class 31 - 35 11 19 33 363 11979
36 - 40 9 28 38 342 12996
41 - 45 12 40 43 516 22188
46 - 50 6 46 48 288 13824
51 - 55 4 50 53 212 11236
n=50 fm=1930 fm2=77730

(n/2) - cfp
Median = LB + w
fmd

(50/2) - 19 25 - 19
Median = 35.5 + 5 = 35.5 + 5
9 9

Median = 38.83

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 10
The mode of the grouped data is defined by the formula

fmode - fB
Mode = LB + w
2fmode - fB - fA

where: LB = real lower boundary of the modal class (which is 0.5 less
than the lower boundary)
fmode = frequency of the modal class
fB = frequency of the class interval preceding (just before) the
modal class
fA = frequency of the class interval succeeding (just after) the
modal class
w = class width (or size)

Note: In finding the modal class of the grouped data, we have to identify
the class interval with the highest frequency. In this case, the modal
class is the class interval 41-45 (with the highest frequency of 12).

Class
Interval f cf m fm fm2
21 - 25 3 3 23 69 1587
26 - 30 5 8 28 140 3920
31 - 35 11 19 33 363 11979
Modal
class 36 - 40 9 28 38 342 12996
41 - 45 12 40 43 516 22188
46 - 50 6 46 48 288 13824
51 - 55 4 50 53 212 11236
n=50 fm=1930 fm2=77730

Solving the mode of the grouped data,

fmode - fB
Mode = LB + 2fmode - fB - fA w

12 - 9 3
Mode = 40.5 + 2(12) - 9 - 6 5 = 40.5 + 9 5

Mode = 42.17

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 11
The Standard Deviation

The standard deviation is a statistic that tells you how tightly all the entries
are clustered around the mean in a set of data. It provides a good indication of
volatility (unpredictability or unstableness). When the sample data are pretty
tightly bunched together and the bell-shaped curve is steep, the standard
deviation is small. When the sample data are spread apart and the bell curve is
relatively flat, there is a relatively large standard deviation.

Standard deviation measures how widely values (scores or grades for


instance) are dispersed from the average (mean). Dispersion is the difference
between the actual value (score or grade) and the average value (mean). The
larger the difference between the score (or grade) and the average score of the
group, the higher the standard deviation will be and the higher the volatility. The
closer the scores are to the mean, the lower the standard deviation and the lower
the volatility.

One standard deviation away from the mean in either direction accounts
for somewhere around 68 percent of the population sample. Two standard
deviations away from the mean account for roughly 95 percent of the population
sample. And three standard deviations account for about 99 percent of the
population sample.

The standard deviation can tell you how spread out the examples in a set
are from the mean.

One of the usefulness of standard deviation is if you are comparing test


scores for different schools. The standard deviation will tell you how diverse the
test scores are for each school.

For example, School A has a higher mean test score than School B. Your
first reaction might be to say that the kids at School A are smarter.

But a bigger standard deviation for one school tells you that there are
relatively more kids at that school scoring toward one extreme or the other. By
asking a few follow-up questions you might find that, say, School A’s mean was
skewed up because the school district sends all of the gifted education kids to
School A. Or that School B’s scores were dragged down because students who
recently have been "mainstreamed" from special education classes have all been
sent to School B.

In this way, looking at the standard deviation can help point you in the
right direction when asking why information is the way it is.

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 12
CALCULATING THE STANDARD DEVIATION OF GROUPED DATA

To compute the standard deviation, we will use the formula

1
s.d. () = n √𝒏 𝒇𝒎𝟐 − ( 𝒇𝒎)𝟐

Using the same frequency table as shown below,

Class
Interval f cf m fm fm2
21 - 25 3 3 23 69 1587
26 - 30 5 8 28 140 3920
31 - 35 11 19 33 363 11979
36 - 40 9 28 38 342 12996
41 - 45 12 40 43 516 22188
46 - 50 6 46 48 288 13824
51 - 55 4 50 53 212 11236
n=50 fm=1930 fm2=77730

the standard deviation is

1
s.d. () = n √𝑛𝑓𝑚2 − (𝑓𝑚)2

1
s.d. () = 50 √50 (77730) − (1930)2

1
s.d. () = 50 √3886500 − 3724900

1
s.d. () = 50 √161600

1
s.d. () = 50 (401.9950)

s.d. () = 8.04

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 13
EXERCISES:

1. Complete the frequency table and compute for the mean, median, mode
and standard deviation.

Class
Interval f cf m fm fm2
86 - 96 1
97 - 107 8
108 - 118 18
119 - 129 14
130 - 140 12
141 - 151 3
152 - 162 4

2. Make a frequency table (see above) with six (6) class intervals starting with
the range 15 – 25 until 70 – 80. Then compute for the mean, median, mode
and standard deviation.

18 20 31 30 37 18 53 27 32 55
60 32 55 45 47 51 54 23 56 42
57 62 75 67 49 22 27 32 32 45
27 73 58 41 42 52 53 35 40 50
19 49 21 50 25 30 38 42 45 47

3. Construct a frequency table with seven (7) classes starting with the class
interval 31 – 35 (up to the class interval 61 – 65). Compute for the mean,
median, mode and standard deviation of the grouped data.

47 41 36 43 48 33 50 52 53 47
44 46 52 54 57 48 53 53 32 45
63 43 31 47 44 40 52 45 60 52
35 53 57 48 42 60 46 58 41 36
51 47 54 37 34 54 65 63 40 48

NOTE: All modules and powerpoint presentations are protected by copyright. It is


unlawful to make copies without the prior written permission of the undersigned.

Module in MEASURES OF CENTRAL TENDENCY


Prepared by: PROF. ALFRED M. OROSCO, M.B.M. Page 14

You might also like