Chapter 3 PDF
Chapter 3 PDF
DATA DESCRIPTION
Objectives
After completing this chapter, you should be able to:
1. Describe data, using measures of central tendencies, such as mean, median, mode and
midrange.
2. Describe data, using measures of variations, such as range, variance and standard deviation.
3. Identify the position of a data value in a data set, using various measures of position, such as
standard scores, percentiles, deciles and quartiles.
4. Check for outliers in a data set.
5. Use the techniques of exploratory data analysis, including boxplots to discover the nature of the
data.
3.1 Introduction
In Chapter 2, we have seen how one can analyse the raw data by organizing it into a frequency
distribution and the presenting the data by using various graphs. Organizing the presenting alone is not
enough to describe data meaningfully so we will now examine some statistical methods that can be used
to describe the data. The methods include measures of central tendency, measures of variation and
measures of position.
The measure of average or the measure of central tendencies is numerical measures that locate the
center of the dataset. Measures of central tendency include mean, median, mode, midrange and
weighted mean.
Knowing the average such as mean, median and mode is not enough to describe the dataset entirely,
therefore the measure of variation or dispersion is studied. The measure of variation or dispersion is
numerical measures that determine the spread of data values from the center. Measures of variation
include range, variance, and standard deviation.
In addition to measure of central tendency and measure of variation, there are measures of position or
location. They are used to locate the relative position of the data value in the dataset. Measures of position
include percentiles, deciles and quartiles. These measures are used extensively in psychology and
education and sometimes they are referred to as norms.
The types of measures of central tendency that will be discussed in this section are mean, median, mode,
midrange and weighted mean.
A parameter is a characteristic or measure obtained by using all the data values from an entire
population.
A statistic is a characteristic of measure obtained by using all the data values from a specific sample
chosen from a large population.
General Rounding Rule: When computations are done in statistics, the basic rounding rule is that,
rounding should not be done until the final answer is calculated. If rounding is done in every step along
the way, it tends to increase the difference between that answer and the exact one.
The symbol X represents the sample mean and represents the population mean.
We use the following formulas summarized in the table below to compute the mean:
Sample
X
X X
fX X
fX m
n n n
Population
X
fX
fX m
N N N
Where,
n is the sample size
N is the population size
f is the frequency of a class
X m is the midpoint of a class interval
The data given below represents the marks scored by a sample of 11 students selected from a particular
English class. Find the mean mark.
67, 89, 49, 55, 87, 79, 72, 69, 81, 52, 91
SOLUTION
Since the dataset represents the sample and is a raw data, the mean is given by:
X
X
67 89 91 791
719
n 11 11
Hence, the mean mark is 71.9
Rounding Rule for the Mean. The mean should be rounded to one more decimal place than it occurs
in the raw data.
EXAMPLE 3−2
Using the frequency distribution as in Example 2-2 of Chapter 2, find the mean.
SOLUTION
Rating( X ) Frequency ( f ) fX
1 2
2 1
3 2
4 2
5 2
6 5
7 3
8 2
9 2
10 3
Total n = 24
Step 3: Find the sum of the values in the 3rd column. The completed table is shown below.
Rating( X ) Frequency ( f ) fX
2
1 2
2
2 1
6
3 2
8
4 2
10
5 2
30
6 5
21
7 3
16
8 2
18
9 2
30
10 3
Total n = 24 fX = 143
Step 4: Divide the sum of 3rd column by n to get the mean.
X
fX
143
5.96
n 24
EXAMPLE 3−3
The following is the distribution of the number of fish caught by all 50 fishermen in a coastal area. Find
the mean number of fish caught by a fisherman.
No. of fishermen No. of fishermen
11 − 15 12
16 − 20 14
21 − 25 13
26 − 30 11
16 − 20 14
21 − 25 13
26 − 30 11
n = 50
Step 2: Find the midpoint of each class and enter them in the 3rd column.
Step 3: For each class, multiply the frequency with the midpoints and enter them in the 4 th column.
Step 4: Find the sum of the values in the 4th column. The completed table is shown below.
16 − 20 14 18 252
21 − 25 13 23 299
26 − 30 11 28 308
n = 50 fX m = 1015
fX m
1015
20.3
N 50
The median is the midpoint of the data set when the data is arranged in order.
The numbers of comics purchased on a particular day by nine school students are given below.
3, 7, 10, 5, 9, 4, 11, 7, 2
Find the median.
SOLUTION
EXAMPLE 3−5
The numbers of tropical cyclones in the Pacific over the 8–year period is as follows.
SOLUTION
687 702
The median number of tropical cyclones is 694.5 .
2
EXAMPLE 3−6
SOLUTION
Step 1: Find the class boundaries, cumulative frequency and cumulative percentage for each class.
cumulative frequency
cumulative percentage 100
Total frequency
The table is shown below:
15.5 – 20.5 14 26 26
100 52
50
20.5 – 25.5 13 39 78
50
Step 2: Using the upper class boundaries for the x values and the cumulative percentage as the y values,
plot the points. This type of ogive is called a Percentile Graph.
Percentile Graph
100
90
cumulative percentage
80
70
60
50
40
30
20
10
0
10.5 15.5 20.5 25.5 30.5
no. of fish caught
To estimate the median, find the x−value corresponding to the y-value of 50 from the percentile graph.
So the median is estimated to be 20.
Find the mode of the transfer fees of 9 professional soccer players for a specific year. The transfer fee in
millions of dollars is: 1.2, 12.0, 4.5, 6.1, 8.3, 4.5, 7.2, 11.0, 4.5
SOLUTION
Since $4.5 million occurred 3 times (most often), the mode is $4.5 million.
EXAMPLE 3−8
SOLUTION
A. Since each value occurs only once, there is no mode. (Do not say that the mode is zero).
B. Since both 45 and 55 occur most often (3 times each), the modes are 45 and 55. This set of data
is said to be bimodal.
EXAMPLE 3−9
SOLUTION
The modal class is 16 – 20, as it has the highest frequency. Note: In many cases, the measures of central
tendency may have significantly different values. One has to be very cautious in using these measures.
EXAMPLE 3−10
A small company consists of the owner, the manager, salesperson and two technicians, all of whose
annual salaries are listed below. Find the mean, median and mode.
Here the mean is $20,000, the median is $12,000 and the mode is $9,000. The mean is much higher
than median and mode because the extremely high salary of the owner. In such situations, the median
should be used as the measure of central tendency.
EXAMPLE 3−11
SOLUTION
9000 +50000
MR 29,500
2
Hence, the midrange is 29,500. The midrange is affected by extreme value of $50,000 in the dataset.
Note: In statistics, several measures can be used for an average. The most common measures are
mean, median, mode and midrange. Each has its own specific purpose and use. The median is a better
measure when there are extreme values in the dataset. 3−10
The weighted mean of the data set x1 x2 … xn with respective weightings w1 w2 … wn , is given by
Weighted mean
w1 x1 w2 x2 wn xn
w x .
i i
w1 w2 wn w i
The mid-semester test had a weight of 20%, assignments had a weight of 10% each and the final exam
has a weight of 60%.
SOLUTION
As in regulation, the weights for the results are in the following ratio:
For awarding the final result, we have to take this weighting into account:
EXAMPLE 3−13
I wish to test two brands of outdoor paint to see how long each will last before fading. The results (in
months) are shown. Find the mean and median of each group. (Assume Population)
Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25
The mean and median for both brands of paint is 35 months. Since the mean and median for both brands
of paint is same, we cannot conclude which paint is better using these measures of central tendencies.
The types of measures of variation that will be discussed in this section are range, variance, and standard
deviation.
3.3.1 Range
The range is the simplest measure of variation and is defined as:
The range (R) is the highest value minus the lowest value in the data set. That is
EXAMPLE 3−14
Find the range for the two brands of paints given in Example 3−13.
SOLUTION
Since the range of Brand B is less it can be concluded that Brand B is less variable (more reliable or a
better choice) than Brand A.
Since range is not good measure of variability if there are extreme values in the dataset, statisticians use
other measures called the variance and standard deviation.
The corresponding formulas used to calculate these variances of raw data are
2
( X ) 2
and s2
( X X ) 2
,
N n 1
Where,
X and X X
N n
EXAMPLE 3−15
Find the variance and standard deviation for Brand A paint data given in Example 3−13.
SOLUTION
X
210
35
N 6
Step 2: Subtract the mean from each data value and square each result. The completed table is shown
below.
Brand A (X) ( X )2
10 (10 – 35)2 = 625
60 (60 – 35)2 = 625
50 225
30 25
40 25
20 225
(X ) 2
625 625 225 25 25 225 1750
2
( X ) 2
1750
291.7
N 6
Remarks:
1. The variance and standard deviation of Brand B paint is 41.7 and 6.5 respectively.
2. Since the standard deviation of Brand B is less, one can conclude that brand B is less variable (more
reliable or a better choice) than Brand A.
X fX f X
m
2
2
2
m
s
2 n s
2 n s
2 n
n 1 n 1 n 1
Population X fX f X
2 2 2
X fX f X
m
2
2
2
m
N 2 N 2 N
2
N N N
Note: Always use the shortcut formulas to compute variance and standard deviation.
EXAMPLE 3−16
Find the variance and standard deviation for Brand A paint data given in Example 3−13 using the shortcut
formula.
SOLUTION
Step 2: Square each data value and enter them in the 2nd column
Brand A ( X ) X2
10 100
60 3600
50 2500
30 900
40 1600
20 400
X 210 X
2
9100
EXAMPLE 3−17
Find the variance and standard deviation of the number of fish caught using the data in Example 3−3.
SOLUTION
16 – 20 14
21 – 25 13
26 – 30 11
n = 50
Step 2: Find the midpoint of each class and enter them in the 3rd column.
Step 3: For each class, multiply the frequency with the midpoints and enter them in the 4th column. Find
the sum of the values in the 4th column.
Step 4: For each class, multiply the frequency with the square of the midpoints and enter them in the
5th column. Find the sum of the values in the 5th column. The completed table is shown below.
21 – 25 13 23 299 6877
26 – 30 11 28 308 8624
n = 50
fX m 1015 f X 2
m 22065
The coefficient of variation, denoted by CV, is the standard deviation divided by the mean. The result
is expressed as a percentage.
For population C V 100%
s
For sample C V 100%
x
EXAMPLE 3−18
The mean of the number of sales of airplane engines over a 6-month period is 92, and the standard
deviation is 5. The mean of the commissions earned is $5255, and the standard deviation is $770.
Compare the variations of the two.
SOLUTION
The types of measures position that will be discussed in this section are standard scores, percentiles,
deciles and quartiles.
A z score or standard score for a value is obtained by subtracting the mean from the value and dividing
the result by the standard deviation, i.e.
X
For population z
XX
For sample z
s
EXAMPLE 3−19
A student scored 90 on Maths test that had a mean of 52 and a standard deviation of 10; he also scored
45 on an English test with a mean of 35 and a standard deviation of 5. Compare her relative positions on
the two tests.
SOLUTION
XX 90 52
For Maths: z = z = 3.8
s 10
XX 45 35
For English: z = z = 2.0
s 5
The score for Maths test is higher than the score for English test.
3.4.2 Percentiles
Percentiles are position measures used in educational and health-related fields to indicate the position of
an individual in a group.
Percentiles are data values that divide the dataset into 100 equal parts where the dataset should be in
an ascending order. Each set of observations has 99 percentiles and are denoted by P1 P2 … P99 .
Remarks:
1. P20 is called the 20th percentile, which indicates that 20% of the scores fall below P20 .
2. P50 is called the 50th percentile, which indicates that 50% of the scores fall below P50 .
P50 median.
Note:
1. To calculate quartiles and deciles of a raw data, convert them to percentiles and use the same
steps.
2. To estimate percentiles, deciles and quartiles of a raw data use a Percentile Graph.
Percentile Rank
We can calculate the percentile rank for a particular value x of a data set by using the formula:
Remarks:
1. D4 is called the 4th decile, which indicates that 40% of the scores fall below D4 .
2. D5 is called the 5th decile, which indicates that 50% of the scores fall below
3. P50 D5 median.
4. D1 P10 ; D2 P20 ; D3 P30 ; D9 P90
3.4.4 Quartiles
Quartiles are data values that divide the dataset into 4 equal parts where the dataset should be in an
ascending order. Each set of observations has 3 quartiles and are denoted by Q1 Q2 and Q3 .
Remarks:
1. Q1 is called the 1st quartile (or lower quartile), which indicates that 25% of the scores fall below
Q1
2. Q3 is called the 3rd quartile (or upper quartile), which indicates that 75% of the scores fall below
Q3
70, 77, 65, 56, 99, 62, 79, 73, 85, 87, 92, 82
SOLUTION
56, 62, 65, 70, 73, 77, 79, 82, 85, 87, 92, 99
Therefore,
85 87
Q3 86.
2
10 0.5
4. Percentile rank of 92 100% 87.5.
12
Hence, approximately 87.5% of the scores are below 92 in the given data.
EXAMPLE 3−21
1. P20 .
2. Percentile rank for the score 26.
SOLUTION
Percentile Graph
100
90
cumulative percentage
80
70
60
50
40
30
20
10
0
10.5 15.5 20.5 25.5 30.5
no. of fish caught
The interquartile range is the difference between the upper quartile and the lower quartile. That
is,
Interquartile range (IQR) Q3 Q1
The quartile deviation is the half of the difference between the upper quartile and the lower
quartile. That is,
Q3 Q1
Quartile deviation (QD)
2
EXAMPLE 3−22
Find the interquartile range and the quartile deviation for the given data in Example 3−20.
SOLUTION
Q1 67.5 and Q3 86
Therefore,
Interquartile range Q3 Q1 86 67.5 18.5
and
Q3 Q1 86 67.5
Quartile deviation 9.25
2 2
An outlier is an extremely high or an extremely low data value when compared with the rest of the
data values.
EXAMPLE 3−23
SOLUTION
The data value 70 is a suspect that it is an outlier. Using the procedure given above we have:
In EDA,
Data can be organised using a stem and leaf plot.
The measure of central tendency used is the median.
The measure of variation used is the interquartile range.
Data are represented graphically using a box-plot.
A box-plot is a graph that is used to determine the nature and shape of the distribution in EDA. It is
obtained by drawing a horizontal line from the minimum data value to Q1 , drawing a horizontal line from
Q3 to the maximum data value, and drawing a box whose vertical sides pass through Q1 and Q3 with
a vertical line inside the box passing through the median.
EXAMPLE 3−24
SOLUTION
Step 1: The Five-Number Summary (Note: The data should be arranged in ascending order first)
1. The lowest value is 3;
2. Q1 8 ;
3. The median is 11.5;
4. Q3 16 ;
5. The highest value is 20;
8 1 1
3 1 6
.
5
0 4 8 12 16 20 22
3.7 Summary
This chapter discusses the statistical techniques of describing data. The data was described using the
techniques such as measure of central tendencies, measure of variations and measure of positions. The
measure of central tendencies include mean, median, mode and midrange to locate the center of the
data set, the measure of variations include range, variance and standard deviation to gauge the spread
of data values, the measure of positions include standard score, percentile, decile and quartile to locate
the position of the data values. Further, the chapter explains how to detect outliers in a data set and how
to construct box-plot.
EXERCISES
2. A survey of all the 110 firms in a small state was carried out to find the number of people employed
at each. The results are shown in the following table.
Number of Employees 1 – 10 11 – 20 21 – 30 31 – 40 41 – 50
Frequency 32 34 14 12 18
3. Suppose an instructor gives two exams and a final exam, assigning the final exam a weight twice
that of each of the other exams. Find the weighted mean for a student who scores 73 and 67 on the
first two exams and 85 on the final exam.
4. An analysis of monthly wages paid to the workers of firm A and B belonging to the same industry
gives the following results:
Firm A Firm B
Number of Workers 100 200
Average monthly wage $196 $185
Variance of distribution of wages $81 $144