0% found this document useful (0 votes)
238 views27 pages

Chapter 3 PDF

This chapter discusses statistical methods for describing data, including measures of central tendency, variation, and position. It provides objectives and an introduction. Measures of central tendency evaluated are the mean, median, mode, midrange, and weighted mean. Measures of variation examined include range, variance, and standard deviation. Measures of position referenced are percentiles, deciles, and quartiles. Examples are provided to demonstrate calculating the mean from raw data and a grouped frequency distribution.

Uploaded by

Shiv Neel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
238 views27 pages

Chapter 3 PDF

This chapter discusses statistical methods for describing data, including measures of central tendency, variation, and position. It provides objectives and an introduction. Measures of central tendency evaluated are the mean, median, mode, midrange, and weighted mean. Measures of variation examined include range, variance, and standard deviation. Measures of position referenced are percentiles, deciles, and quartiles. Examples are provided to demonstrate calculating the mean from raw data and a grouped frequency distribution.

Uploaded by

Shiv Neel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

CHAPTER 3:

DATA DESCRIPTION

Chapter 3: Data Description 34


Overview
This chapter discusses how data can be described using statistical methods. The concepts discussed in
this chapter are as follows: measure of central tendency; measure of variation; measure of position;
outliers; exploratory data analysis. The chapter concludes with a summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. Describe data, using measures of central tendencies, such as mean, median, mode and
midrange.
2. Describe data, using measures of variations, such as range, variance and standard deviation.
3. Identify the position of a data value in a data set, using various measures of position, such as
standard scores, percentiles, deciles and quartiles.
4. Check for outliers in a data set.
5. Use the techniques of exploratory data analysis, including boxplots to discover the nature of the
data.

3.1 Introduction
In Chapter 2, we have seen how one can analyse the raw data by organizing it into a frequency
distribution and the presenting the data by using various graphs. Organizing the presenting alone is not
enough to describe data meaningfully so we will now examine some statistical methods that can be used
to describe the data. The methods include measures of central tendency, measures of variation and
measures of position.

The measure of average or the measure of central tendencies is numerical measures that locate the
center of the dataset. Measures of central tendency include mean, median, mode, midrange and
weighted mean.

Knowing the average such as mean, median and mode is not enough to describe the dataset entirely,
therefore the measure of variation or dispersion is studied. The measure of variation or dispersion is
numerical measures that determine the spread of data values from the center. Measures of variation
include range, variance, and standard deviation.

In addition to measure of central tendency and measure of variation, there are measures of position or
location. They are used to locate the relative position of the data value in the dataset. Measures of position
include percentiles, deciles and quartiles. These measures are used extensively in psychology and
education and sometimes they are referred to as norms.

3.2 Measures of Central Tendency


The measures of central tendencies (also known as measures of average) are numerical measures that
locate the center of the dataset. In other words, this measure is to find a single value, which enables us
to get an idea of the entire set of data. Measures of central tendency also enable us to facilitate
comparison between two or more sets of data.

The types of measures of central tendency that will be discussed in this section are mean, median, mode,
midrange and weighted mean.

Chapter 3: Data Description 35


Recall when the population is small, it is not suitable to use samples since the entire population can be
used to gain information. However, if the population is infinite we make use of samples and then
generalize from samples to populations. Therefore, it is important to know the following terms:

A parameter is a characteristic or measure obtained by using all the data values from an entire
population.

A statistic is a characteristic of measure obtained by using all the data values from a specific sample
chosen from a large population.

General Rounding Rule: When computations are done in statistics, the basic rounding rule is that,
rounding should not be done until the final answer is calculated. If rounding is done in every step along
the way, it tends to increase the difference between that answer and the exact one.

3.2.1 The Mean


The mean (arithmetic average), is calculated by adding all the data values and then dividing by the total
number of values. For example, the mean of the dataset 3, 2, 6, 5 and 4 is found by adding
3+2+6+5+4=20 and dividing by 5; hence the mean of the data is 20/5=4.

The symbol X represents the sample mean and  represents the population mean.

Formulas to Compute Mean

We use the following formulas summarized in the table below to compute the mean:

Raw data Ungrouped frequency Grouped frequency


distribution distribution

Sample
X 
X X 
 fX X 
 fX m

n n n
Population

X 
 fX 
 fX m

N N N

Where,
n is the sample size
N is the population size
f is the frequency of a class
X m is the midpoint of a class interval

 X is the sum of all data values


 fX is the sum of frequency multiplied with the data value of each class

Chapter 3: Data Description 36


EXAMPLE 3−1

The data given below represents the marks scored by a sample of 11 students selected from a particular
English class. Find the mean mark.

67, 89, 49, 55, 87, 79, 72, 69, 81, 52, 91

SOLUTION
Since the dataset represents the sample and is a raw data, the mean is given by:

X 
X 
67  89    91 791
  719
n 11 11
Hence, the mean mark is 71.9

Rounding Rule for the Mean. The mean should be rounded to one more decimal place than it occurs
in the raw data.

EXAMPLE 3−2

Using the frequency distribution as in Example 2-2 of Chapter 2, find the mean.

SOLUTION

Step 1: Make a table as shown.

Rating( X ) Frequency ( f ) fX

1 2

2 1

3 2

4 2

5 2

6 5

7 3

8 2

9 2

10 3

Total n = 24

Chapter 3: Data Description 37


Step 2: Multiply the frequency with the data value of each class and enter them in the 3rd column.

Step 3: Find the sum of the values in the 3rd column. The completed table is shown below.

Rating( X ) Frequency ( f ) fX
2
1 2
2
2 1
6
3 2
8
4 2
10
5 2
30
6 5
21
7 3
16
8 2
18
9 2
30
10 3

Total n = 24  fX = 143
Step 4: Divide the sum of 3rd column by n to get the mean.

X 
 fX 
143
 5.96
n 24

EXAMPLE 3−3

The following is the distribution of the number of fish caught by all 50 fishermen in a coastal area. Find
the mean number of fish caught by a fisherman.
No. of fishermen No. of fishermen
11 − 15 12

16 − 20 14

21 − 25 13

26 − 30 11

Chapter 3: Data Description 38


SOLUTION

Step 1: Make a table as shown.

No. of fish caught No. of fishermen ( f ) Midpoints ( X m ) fX m


11 − 15 12

16 − 20 14

21 − 25 13

26 − 30 11

n = 50

Step 2: Find the midpoint of each class and enter them in the 3rd column.

Step 3: For each class, multiply the frequency with the midpoints and enter them in the 4 th column.

Step 4: Find the sum of the values in the 4th column. The completed table is shown below.

No. of fish caught No. of fishermen ( f ) Midpoints ( X m ) fX m


11 − 15 12 13 156

16 − 20 14 18 252

21 − 25 13 23 299

26 − 30 11 28 308

n = 50  fX m = 1015

Step 5: Divide the sum of 4th column by N to get the mean.


 fX m

1015
 20.3
N 50

3.2.2 The Median


The median is the midpoint of the data set. To calculate the median, it is necessary to arrange the data
in order. The median can either be a specific value in the data set or can fall between two values.

The median is the midpoint of the data set when the data is arranged in order.

Chapter 3: Data Description 39


EXAMPLE 3−4

The numbers of comics purchased on a particular day by nine school students are given below.

3, 7, 10, 5, 9, 4, 11, 7, 2
Find the median.

SOLUTION

Step 1: Arrange the data in order


2, 3, 4, 5, 7, 7, 9, 10, 11

Step 2: Select the middle point.


2, 3, 4, 5, 7, 7, 9, 10, 11
Hence, the median is 7 comics.

EXAMPLE 3−5

The numbers of tropical cyclones in the Pacific over the 8–year period is as follows.

687, 576, 702, 405, 237, 899, 799, 907


Find the median.

SOLUTION

Step 1: Arrange the data in order.


237, 405, 576, 687, 702, 799, 899, 907

Step 2: Select the middle point.


237, 405, 576, 687, 702, 799, 899, 907
Since there are two values in the middle point, we add the two values and divide by 2, to find the median.

687  702
The median number of tropical cyclones is  694.5 .
2

EXAMPLE 3−6

Estimate the median of the data in given Example 3−3.

SOLUTION

Step 1: Find the class boundaries, cumulative frequency and cumulative percentage for each class.

cumulative frequency
cumulative percentage   100
Total frequency
The table is shown below:

Chapter 3: Data Description 40


Class boundaries Frequency Cumulative frequency Cumulative percentage
10.5 – 15.5 12 12 12
 100  24
50

15.5 – 20.5 14 26 26
 100  52
50

20.5 – 25.5 13 39 78

25.5 – 30.5 11 50 100

50

Step 2: Using the upper class boundaries for the x values and the cumulative percentage as the y values,
plot the points. This type of ogive is called a Percentile Graph.

Percentile Graph
100
90
cumulative percentage

80
70
60
50
40
30
20
10
0
10.5 15.5 20.5 25.5 30.5
no. of fish caught

To estimate the median, find the x−value corresponding to the y-value of 50 from the percentile graph.
So the median is estimated to be 20.

3.2.3 The Mode


The mode is the third measure of central tendency. It is the value that occurs most often in a data set.
Note:
 A data set that has only one value that occurs most often is said to be unimodal.
 If a data set has two values that occur most often, both values are considered to be the mode
and the data set is said to be bimodal.
 If a data set has more than two values that occur most often, each value is used as the mode,
and the data set is said to be multimodal.
 A data set where no data value occurs more than once, the data set is said to have no mode.
 If data is grouped in class intervals, then the interval that has the highest frequency is called the
modal class and its midpoint is called the crude mode.

Chapter 3: Data Description 41


EXAMPLE 3−7

Find the mode of the transfer fees of 9 professional soccer players for a specific year. The transfer fee in
millions of dollars is: 1.2, 12.0, 4.5, 6.1, 8.3, 4.5, 7.2, 11.0, 4.5

SOLUTION

Since $4.5 million occurred 3 times (most often), the mode is $4.5 million.

EXAMPLE 3−8

Find the mode for the following sets of data:


A. 40, 44, 57, 78, 48
B. 45, 55, 50, 45, 40, 55, 45, 55

SOLUTION

A. Since each value occurs only once, there is no mode. (Do not say that the mode is zero).
B. Since both 45 and 55 occur most often (3 times each), the modes are 45 and 55. This set of data
is said to be bimodal.

EXAMPLE 3−9

Find the mode of the frequency distribution in Example 3-3.

SOLUTION

The modal class is 16 – 20, as it has the highest frequency. Note: In many cases, the measures of central
tendency may have significantly different values. One has to be very cautious in using these measures.

EXAMPLE 3−10

A small company consists of the owner, the manager, salesperson and two technicians, all of whose
annual salaries are listed below. Find the mean, median and mode.

Staff Salary ($)


Owner 50,000
Manager 20,000
Salesperson 12,000
Technician 9,000
Technician 9,000

Chapter 3: Data Description 42


SOLUTION

Here the mean is $20,000, the median is $12,000 and the mode is $9,000. The mean is much higher
than median and mode because the extremely high salary of the owner. In such situations, the median
should be used as the measure of central tendency.

3.2.4 The Midrange


The midrange (MR) is a rough estimation of the middle. It is found by adding the lowest and the highest
values in the data set and dividing the result by 2. It can be affected by extreme values in the dataset.

lowest value +highest value


MR 
2

EXAMPLE 3−11

Find the midrange of the data in example 10.

SOLUTION

9000 +50000
MR   29,500
2

Hence, the midrange is 29,500. The midrange is affected by extreme value of $50,000 in the dataset.

Note: In statistics, several measures can be used for an average. The most common measures are
mean, median, mode and midrange. Each has its own specific purpose and use. The median is a better
measure when there are extreme values in the dataset. 3−10

3.2.5 The Weighted Mean


The weighted mean is used when we wish to place greater emphasis on some of the values in the data
set. In such situation, it may not be suitable to calculate an ordinary mean. This type of mean that
considers additional factor is called the weighted mean.

The weighted mean of the data set x1 x2 … xn with respective weightings w1  w2 … wn , is given by

Weighted mean 
w1 x1  w2 x2    wn xn

w x .
i i

w1  w2    wn w i

The use of weighted mean is illustrated in the following example.

lowest value +highest value


MR 
2

Chapter 3: Data Description 43


EXAMPLE 3−12

In ST130, a student obtained the following marks in the continuous assessment:

Mid-semester test (MST): 67%


Assignment 1: 88%
Assignment 2: 94%
Final exam: 75%

The mid-semester test had a weight of 20%, assignments had a weight of 10% each and the final exam
has a weight of 60%.

Calculate the final mark of the student.

SOLUTION

As in regulation, the weights for the results are in the following ratio:

MST: Assignment 1: Assignment 2: Final Exam = 20% 10%: 10%: 60% = 2: 1: 1: 6

For awarding the final result, we have to take this weighting into account:

2(67)  1(88)  1(94)  6(75)


Weighted mean   76.6.
2 11 6

Therefore, the final mark is 77%.

3.2.6 Relationships among Mean, Median and Mode


If the values of the mean, median and mode are known, it can give us some idea about the shape of a
frequency distribution. Now we will discuss the relationships among the mean, median and mode for
symmetric, positively and negatively skewed distributions.

For a symmetric distribution with one peak,


the values of the mean, median and mode
are same, and they lie at the center of the
distribution.

Chapter 3: Data Description 44


For a right skewed distribution, the value
of the mean is the largest, the mode is the
smallest, and the value of the median lies
between these two. Notice that the mode
always occurs at the peak point. The value of
the mean is the largest in this case because
it is sensitive to outliers that occur in the right
tail. These outliers pull the mean to the right.

If a distribution is skewed to the left, the


value of the mean is the smallest and the
mode is the largest, with the value of the
median lying between these two. In this case,
the outliers in the left tail pull the mean to the
left.

3.3 Measures of Variation


The measures of variation (also known as measures of dispersion) are numerical measures to determine
the spread of the data values from the central tendencies. Many times the measures of central tendency
alone cannot describe the data.

EXAMPLE 3−13

I wish to test two brands of outdoor paint to see how long each will last before fading. The results (in
months) are shown. Find the mean and median of each group. (Assume Population)

Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25

The mean and median for both brands of paint is 35 months. Since the mean and median for both brands
of paint is same, we cannot conclude which paint is better using these measures of central tendencies.

Chapter 3: Data Description 45


Therefore, to find out which paints lasts longer that is a better choice, the measure of variation is
important.

The types of measures of variation that will be discussed in this section are range, variance, and standard
deviation.

3.3.1 Range
The range is the simplest measure of variation and is defined as:

The range (R) is the highest value minus the lowest value in the data set. That is

R = Highest value – lowest value

EXAMPLE 3−14

Find the range for the two brands of paints given in Example 3−13.

SOLUTION

Brand A: The range R = 60 – 10 = 50 months.

Brand B: The range R = 45 – 25 = 20 months.

Since the range of Brand B is less it can be concluded that Brand B is less variable (more reliable or a
better choice) than Brand A.

Since range is not good measure of variability if there are extreme values in the dataset, statisticians use
other measures called the variance and standard deviation.

3.3.2 The Variance and Standard Deviation


The variance is defined as the average of the squares of the deviation of each data value from the mean.
It is denoted by  2 for population variance and s2 for sample variance.

The corresponding formulas used to calculate these variances of raw data are

2 
( X   ) 2
and s2 
( X  X ) 2
,
N n 1

Where,


 X and X   X
N n

Chapter 3: Data Description 46


The standard deviation is the most commonly used measure of dispersion. The value of the standard
deviation tells how closely the values of a data set are clustered around the mean. Standard deviation is
found by taking square root of the variance. It is denoted by  for population standard deviation and s
for sample standard deviation.

EXAMPLE 3−15

Find the variance and standard deviation for Brand A paint data given in Example 3−13.

SOLUTION

Step 1: Find the mean.


X 
210
 35
N 6
Step 2: Subtract the mean from each data value and square each result. The completed table is shown
below.

Brand A (X) ( X   )2
10 (10 – 35)2 = 625
60 (60 – 35)2 = 625
50 225
30 25
40 25
20 225

Step 3: Find the sum of 2nd column.

 (X  ) 2
 625  625  225  25  25  225  1750

Step 4: Find the variance.

2 
( X   ) 2

1750
 291.7
N 6

Step 5: Find the standard deviation.


  291.7  17.1

Remarks:
1. The variance and standard deviation of Brand B paint is 41.7 and 6.5 respectively.
2. Since the standard deviation of Brand B is less, one can conclude that brand B is less variable (more
reliable or a better choice) than Brand A.

Chapter 3: Data Description 47


3. There are shortcut formulas for computing variance and standard deviation and is summarized in the
table below:
Raw data Ungrouped frequency Grouped frequency
distribution distribution
Sample
 X    fX   f X 
2 2 2

X  fX f X
m
2
 2
 2
m 
s 
2 n s 
2 n s 
2 n
n 1 n 1 n 1
Population  X    fX   f X 
2 2 2

X  fX f X
m
2
 2
 2
m 
N 2  N 2  N
  2

N N N

Note: Always use the shortcut formulas to compute variance and standard deviation.

EXAMPLE 3−16

Find the variance and standard deviation for Brand A paint data given in Example 3−13 using the shortcut
formula.

SOLUTION

Step 1: Find the sum of all the data values.

Step 2: Square each data value and enter them in the 2nd column

Step 3: Find the sum of 2nd column.

Brand A ( X ) X2
10 100
60 3600
50 2500
30 900
40 1600
20 400

 X  210 X
2
 9100

Step 4: Find the variance.


2102
9100 
2  6  291.7
6

Chapter 3: Data Description 48


Step 5: Find the standard deviation.
  291.7  17.1

EXAMPLE 3−17

Find the variance and standard deviation of the number of fish caught using the data in Example 3−3.

SOLUTION

Step 1: Make a table as shown.

No. of fish caught No. of fishermen ( f ) Midpoints ( X m ) fX m f X m2


11 – 15 12

16 – 20 14

21 – 25 13

26 – 30 11

n = 50

Step 2: Find the midpoint of each class and enter them in the 3rd column.

Step 3: For each class, multiply the frequency with the midpoints and enter them in the 4th column. Find
the sum of the values in the 4th column.

Step 4: For each class, multiply the frequency with the square of the midpoints and enter them in the
5th column. Find the sum of the values in the 5th column. The completed table is shown below.

No. of fish No. of fishermen ( f ) Midpoints ( X m ) fX m f X m2


caught

11 – 15 12 13 12 × 13 = 156 12 × 132 = 2028

16 – 20 14 18 14 × 18 = 252 12 × 132 = 4536

21 – 25 13 23 299 6877

26 – 30 11 28 308 8624

n = 50
 fX m  1015 f X 2
m  22065

Chapter 3: Data Description 49


Step 5: Find the variance.
10152
22065 
2  50  29.2
50

Step 6: Find the standard deviation.


  29.21  5.4

3.3.3 Coefficient of Variation


When two or more datasets have same units of measure, variance or standard deviation can be used to
measure the variability between the datasets. However, when the units of measure are different
coefficient of variation is used compare their variability.

The coefficient of variation, denoted by CV, is the standard deviation divided by the mean. The result
is expressed as a percentage.

For population  C V   100%

s
For sample  C V    100%
x
EXAMPLE 3−18

The mean of the number of sales of airplane engines over a 6-month period is 92, and the standard
deviation is 5. The mean of the commissions earned is $5255, and the standard deviation is $770.
Compare the variations of the two.

SOLUTION

The coefficients of variation are:


 5
For sales  C V    100%   100%  5.4%
 92
 770
For commission  C V    100%   100%  14.7%
 5255
Since the coefficient of variation is larger for commissions, the commissions are more variable than the
sales.

3.4 Measures of Position


The measures of position (also known as measures of location) are the numerical measures to determine
the relative position of a data value in a data set.

The types of measures position that will be discussed in this section are standard scores, percentiles,
deciles and quartiles.

Chapter 3: Data Description 50


3.4.1 Standard Scores
There is an old saying, “You can’t compare apples and oranges.” However, with the use of statistics, it
can be done to some extent. Suppose that a student scored 90 in mathematics test and 45 in English
test. Direct comparison of these raw scores is impossible, since the exams might not be equivalent in
terms of number of questions, value of each question, and so on. However, a comparison of a relative
standard similar to both can be made. This comparison uses the mean and standard deviation and is
called a standard score or z score.
A standard score or z score tells how many standard deviations a data value is above or below the mean
for a specific distribution of values. If the standard score is zero, then the data value is the same as the
mean.

A z score or standard score for a value is obtained by subtracting the mean from the value and dividing
the result by the standard deviation, i.e.
X 
For population  z  

XX
For sample  z 
s

EXAMPLE 3−19

A student scored 90 on Maths test that had a mean of 52 and a standard deviation of 10; he also scored
45 on an English test with a mean of 35 and a standard deviation of 5. Compare her relative positions on
the two tests.

SOLUTION

Step 1: Find the z scores.

XX 90  52
For Maths: z = z = 3.8
s 10
XX 45  35
For English: z = z = 2.0
s 5

The score for Maths test is higher than the score for English test.

3.4.2 Percentiles
Percentiles are position measures used in educational and health-related fields to indicate the position of
an individual in a group.

Percentiles are data values that divide the dataset into 100 equal parts where the dataset should be in
an ascending order. Each set of observations has 99 percentiles and are denoted by P1  P2 … P99 .

Chapter 3: Data Description 51


The following figure describes the positions of the 99 percentiles.
Each of these portions contains 1% of the observations
of a data set arranged in increasing order

Remarks:
1. P20 is called the 20th percentile, which indicates that 20% of the scores fall below P20 .
2. P50 is called the 50th percentile, which indicates that 50% of the scores fall below P50 .
P50  median.

Steps to Compute Percentile of Raw data


Step 1: Arrange the data from lowest to highest (ascending order).

Step 2: Find the k th percentile ( Pk ).


 kn 
Pk  value of the   th term
 100 
Where,
k is the number of percentile and n is the sample size.

Note:
1. To calculate quartiles and deciles of a raw data, convert them to percentiles and use the same
steps.
2. To estimate percentiles, deciles and quartiles of a raw data use a Percentile Graph.

Percentile Rank
We can calculate the percentile rank for a particular value x of a data set by using the formula:

Number of values less than x  0.5


Percentile rank of x   100%
Total number of values
Note:
1. A percentile is a value in the data set.
2. The percentile rank of a score indicates what percent of data lies below the score.

Chapter 3: Data Description 52


3.4.3 Deciles
Deciles are data values that divide the dataset into 10 equal parts where the dataset should be in an
ascending order. Each set of observations has 9 deciles and are denoted by D1  D2 … D9 .

The following figure describes the positions of the 9 deciles.

Each of these portions contains 10% of the observations


of a data set arranged in increasing order

Remarks:
1. D4 is called the 4th decile, which indicates that 40% of the scores fall below D4 .
2. D5 is called the 5th decile, which indicates that 50% of the scores fall below
3. P50  D5  median.
4. D1  P10 ; D2  P20 ; D3  P30 ;  D9  P90

3.4.4 Quartiles
Quartiles are data values that divide the dataset into 4 equal parts where the dataset should be in an
ascending order. Each set of observations has 3 quartiles and are denoted by Q1  Q2 and Q3 .

The following figure describes the positions of the 4 quartiles.

Each of these portions contains 25% of the observations


of a data set arranged in increasing order

Remarks:
1. Q1 is called the 1st quartile (or lower quartile), which indicates that 25% of the scores fall below
Q1
2. Q3 is called the 3rd quartile (or upper quartile), which indicates that 75% of the scores fall below
Q3

3. Q1  P25 ; Q2  P50 ; Q3  P75 .


4. Q2  D5  P50  Median.

Chapter 3: Data Description 53


EXAMPLE 3−20

The following are the test scores of 12 students in a statistics class:

70, 77, 65, 56, 99, 62, 79, 73, 85, 87, 92, 82

Calculate the following:

1. P80 and interpret its value.


2. D6 .
3. Q1 and Q3 .
4. Percentile rank for the score 92.

SOLUTION

Arrange the data from lowest to highest (ascending order).

56, 62, 65, 70, 73, 77, 79, 82, 85, 87, 92, 99

1. P80 is obtained by:


80(12)
P80  th term
100
 96th term
The value of 9.6th term can be approximated by the 10th term in the ranked data. Therefore,
P80  87
Hence, approximately 80% of the scores are below 87 in the given data.

2. D6  or P60  and is obtained by:


60(12)
P60  th term
100
 7.2 th term
The value of 7.2th term can be approximated by the 8th term in the ranked data. Therefore,
D6  82
Hence, approximately 60% of the scores are below 82 in the given data.

3. Q1  or P25  is obtained by:


25(12)
P25  th term
100
 3 rd term

Chapter 3: Data Description 54


The value of 3rd term can be approximated by the average of 3rd and 4th terms in the ranked data.
Therefore,
65  70
Q1   67.5
2

Q3  or P75  is obtained by:


75(12)
th termP75 
100
 9 th term
The value of 9 term can be approximated by the average of 9th and 10th terms in the ranked data.
th

Therefore,
85  87
Q3   86.
2
10  0.5
4. Percentile rank of 92   100%  87.5.
12
Hence, approximately 87.5% of the scores are below 92 in the given data.

EXAMPLE 3−21

Estimate the following from the data given in Example 3−3.

1. P20 .
2. Percentile rank for the score 26.

SOLUTION

Using the percentile graph plotted before,

Percentile Graph
100
90
cumulative percentage

80
70
60
50
40
30
20
10
0
10.5 15.5 20.5 25.5 30.5
no. of fish caught

Chapter 3: Data Description 55


1. Observe the x-value for the y−value 20 and we get P20  14.
2. Observe the y-value for the x−value 26 and we get Percentile rank for the score 26 to be 81.

3.4.5 Other Measures of Variation


The variance and standard deviation are regarded as the best and the most powerful measures of
dispersion. One of the drawbacks with these measures of dispersion is that they are influenced by
extreme observations called outliers. Thus, when there are outliers in a dataset, many statisticians think
that the median as the measure of central tendency and other measures of dispersion, namely the
interquartile range of the quartile deviation, should be used to describe the variability.

The interquartile range is the difference between the upper quartile and the lower quartile. That
is,
Interquartile range (IQR)  Q3  Q1

The quartile deviation is the half of the difference between the upper quartile and the lower
quartile. That is,
Q3  Q1
Quartile deviation (QD)  
2

EXAMPLE 3−22

Find the interquartile range and the quartile deviation for the given data in Example 3−20.

SOLUTION

From Example 3−20, we obtain

Q1  67.5 and Q3  86
Therefore,
Interquartile range  Q3  Q1  86  67.5  18.5
and
Q3  Q1 86  67.5
Quartile deviation    9.25
2 2

Chapter 3: Data Description 56


3.5 Outliers
We already know that values that are very small (or extreme low) or very large (or extreme high) relative
to the majority of the values in a data set are known as outliers. We have seen that outliers strongly affect
the mean, standard deviation and some other measures as well. Therefore, it is important to identify
outliers in the dataset so that we use appropriate measures when outliers are present in the dataset.

An outlier is an extremely high or an extremely low data value when compared with the rest of the
data values.

How does an outlier occur?


There are several reasons why outliers may occur. The data value may have resulted from a:
 Measurement or observational error. That is the researcher measured the variable incorrectly.
 Recording error. That is, it may have been written or typed incorrectly.
 Subject that is not in the defined population.

Procedure for Identifying Outliers


There are several ways to check a dataset for outliers. A good rule of thumb of detecting outlier is as
follows:
Step 1: Arrange the data in ascending order and find Q1 and Q3 .

Step 2: Find the interquartile range: IQR  Q3  Q1 .

Step 3: Find the interval: Q1  1.5  IQR  x  Q3  1.5  IQR .


Step 4: Check the data set for any data values x that fall outside the interval. Those values are
outliers.

EXAMPLE 3−23

Check the following data set for outliers.

70, 5, 12, 6, 15, 13, 18, 30

SOLUTION

The data value 70 is a suspect that it is an outlier. Using the procedure given above we have:

Step 1: The data in ascending order is


5, 6, 12, 13, 15, 18, 30, 70

Using the procedure taught before Q1 = 9 and Q3 = 24.

Step 2: The interquartile range (IQR), IQR = 24 – 9 = 15.

Step 3: The interval is: 9  1.5  15  x  24  1.5  15  13.5  x  46.5 .

Chapter 3: Data Description 57


Step 4: Check the data set for any data values that fall outside the interval from −13.5 to 46.5. Since the
data value 70 is outside this interval, it can be considered an outlier.

3.6 Exploratory Data Analysis (EDA)


In traditional statistics, data are organized by using a frequency distribution and various graphs are
constructed to determine the shape or nature of the distribution. Exploratory Data Analysis (EDA) is
the process of using graphical and descriptive statistical techniques (like median, IQR) to learn about the
structure of a dataset.

In EDA,
 Data can be organised using a stem and leaf plot.
 The measure of central tendency used is the median.
 The measure of variation used is the interquartile range.
 Data are represented graphically using a box-plot.

A box-plot is a graph that is used to determine the nature and shape of the distribution in EDA. It is
obtained by drawing a horizontal line from the minimum data value to Q1 , drawing a horizontal line from
Q3 to the maximum data value, and drawing a box whose vertical sides pass through Q1 and Q3 with
a vertical line inside the box passing through the median.

Information obtained from a Box-plot


a. If the median is near the center of the box or the lines are about the same length, the distribution is
approximately symmetric.
b. If the median is to the left of the center of the box or the right line is larger than the left line, the
distribution is positively skewed.
c. If the median falls to the right of the center of the box or the left line is larger than the right line,
the distribution is negatively skewed.

EXAMPLE 3−24

Construct a box-plot for the data given below.

16, 18, 12, 11, 8, 13, 4, 3, 9, 20

SOLUTION

Step 1: The Five-Number Summary (Note: The data should be arranged in ascending order first)
1. The lowest value is 3;
2. Q1  8 ;
3. The median is 11.5;
4. Q3  16 ;
5. The highest value is 20;

Step 2: Draw a horizontal axis with a suitable scale.

Chapter 3: Data Description 58


Step 3: Draw a horizontal line from the minimum data value to Q1 , then draw a horizontal line from Q3
to the maximum data value, and then draw a box whose vertical sides pass through Q1 and Q3 with a
vertical line inside the box passing through the median.

Therefore, the boxplot is given below:

 8  1  1
 3 1 6 
.
5

 0 4 8 12 16 20 22

The distribution is somewhat symmetric.

3.7 Summary
This chapter discusses the statistical techniques of describing data. The data was described using the
techniques such as measure of central tendencies, measure of variations and measure of positions. The
measure of central tendencies include mean, median, mode and midrange to locate the center of the
data set, the measure of variations include range, variance and standard deviation to gauge the spread
of data values, the measure of positions include standard score, percentile, decile and quartile to locate
the position of the data values. Further, the chapter explains how to detect outliers in a data set and how
to construct box-plot.

EXERCISES

1. The cash compensations received in 2009 by the highest-paid executives of 12 international


companies (in $000s) were as follows:

2215 1888 1477 1059 977 956


947 924 899 856 856 803

A. Compute the mean, median, mode and the standard deviation.


B. Calculate the values of three quartiles, 40th percentile and the percentile rank of 956.
C. Check for outliers in the data.
D. Construct a box-plot and use it comment on the shape of the distribution.

2. A survey of all the 110 firms in a small state was carried out to find the number of people employed
at each. The results are shown in the following table.

Number of Employees 1 – 10 11 – 20 21 – 30 31 – 40 41 – 50
Frequency 32 34 14 12 18

Chapter 3: Data Description 59


A. Approximate the mean, the mode and the median of the number of people employed at each
firm.
B. Calculate the variance and standard deviation.

3. Suppose an instructor gives two exams and a final exam, assigning the final exam a weight twice
that of each of the other exams. Find the weighted mean for a student who scores 73 and 67 on the
first two exams and 85 on the final exam.

4. An analysis of monthly wages paid to the workers of firm A and B belonging to the same industry
gives the following results:

Firm A Firm B
Number of Workers 100 200
Average monthly wage $196 $185
Variance of distribution of wages $81 $144

A. Which firm, A or B has a larger wage bill?


B. In which firm, A or B is there greater variability among individual wages?

Chapter 3: Data Description 60

You might also like