0% found this document useful (0 votes)

238 views27 pages

Chapter 3 PDF

This chapter discusses statistical methods for describing data, including measures of central tendency, variation, and position. It provides objectives and an introduction. Measures of central tendency evaluated are the mean, median, mode, midrange, and weighted mean. Measures of variation examined include range, variance, and standard deviation. Measures of position referenced are percentiles, deciles, and quartiles. Examples are provided to demonstrate calculating the mean from raw data and a grouped frequency distribution.

Uploaded by

Shiv Neel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

238 views27 pages

Chapter 3 PDF

Uploaded by

Shiv Neel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

CHAPTER 3:

DATA DESCRIPTION

Chapter 3: Data Description 34

Overview
This chapter discusses how data can be described using statistical methods. The concepts discussed in
this chapter are as follows: measure of central tendency; measure of variation; measure of position;
outliers; exploratory data analysis. The chapter concludes with a summary and a set of exercises.

Objectives
After completing this chapter, you should be able to:
1. Describe data, using measures of central tendencies, such as mean, median, mode and
midrange.
2. Describe data, using measures of variations, such as range, variance and standard deviation.
3. Identify the position of a data value in a data set, using various measures of position, such as
standard scores, percentiles, deciles and quartiles.
4. Check for outliers in a data set.
5. Use the techniques of exploratory data analysis, including boxplots to discover the nature of the
data.

3.1 Introduction
In Chapter 2, we have seen how one can analyse the raw data by organizing it into a frequency
distribution and the presenting the data by using various graphs. Organizing the presenting alone is not
enough to describe data meaningfully so we will now examine some statistical methods that can be used
to describe the data. The methods include measures of central tendency, measures of variation and
measures of position.

The measure of average or the measure of central tendencies is numerical measures that locate the
center of the dataset. Measures of central tendency include mean, median, mode, midrange and
weighted mean.

Knowing the average such as mean, median and mode is not enough to describe the dataset entirely,
therefore the measure of variation or dispersion is studied. The measure of variation or dispersion is
numerical measures that determine the spread of data values from the center. Measures of variation
include range, variance, and standard deviation.

In addition to measure of central tendency and measure of variation, there are measures of position or
location. They are used to locate the relative position of the data value in the dataset. Measures of position
include percentiles, deciles and quartiles. These measures are used extensively in psychology and
education and sometimes they are referred to as norms.

3.2 Measures of Central Tendency

The measures of central tendencies (also known as measures of average) are numerical measures that
locate the center of the dataset. In other words, this measure is to find a single value, which enables us
to get an idea of the entire set of data. Measures of central tendency also enable us to facilitate
comparison between two or more sets of data.

The types of measures of central tendency that will be discussed in this section are mean, median, mode,
midrange and weighted mean.

Chapter 3: Data Description 35

Recall when the population is small, it is not suitable to use samples since the entire population can be
used to gain information. However, if the population is infinite we make use of samples and then
generalize from samples to populations. Therefore, it is important to know the following terms:

A parameter is a characteristic or measure obtained by using all the data values from an entire
population.

A statistic is a characteristic of measure obtained by using all the data values from a specific sample
chosen from a large population.

General Rounding Rule: When computations are done in statistics, the basic rounding rule is that,
rounding should not be done until the final answer is calculated. If rounding is done in every step along
the way, it tends to increase the difference between that answer and the exact one.

3.2.1 The Mean

The mean (arithmetic average), is calculated by adding all the data values and then dividing by the total
number of values. For example, the mean of the dataset 3, 2, 6, 5 and 4 is found by adding
3+2+6+5+4=20 and dividing by 5; hence the mean of the data is 20/5=4.

The symbol X represents the sample mean and  represents the population mean.

Formulas to Compute Mean

We use the following formulas summarized in the table below to compute the mean:

Raw data Ungrouped frequency Grouped frequency

distribution distribution

Sample
X 
X X 
 fX X 
 fX m

n n n
Population

X 
 fX 
 fX m

N N N

Where,
n is the sample size
N is the population size
f is the frequency of a class
X m is the midpoint of a class interval

 X is the sum of all data values

 fX is the sum of frequency multiplied with the data value of each class

Chapter 3: Data Description 36

EXAMPLE 3−1

The data given below represents the marks scored by a sample of 11 students selected from a particular
English class. Find the mean mark.

67, 89, 49, 55, 87, 79, 72, 69, 81, 52, 91

SOLUTION
Since the dataset represents the sample and is a raw data, the mean is given by:

X 
X 
67  89    91 791
  719
n 11 11
Hence, the mean mark is 71.9

Rounding Rule for the Mean. The mean should be rounded to one more decimal place than it occurs
in the raw data.

EXAMPLE 3−2

Using the frequency distribution as in Example 2-2 of Chapter 2, find the mean.

SOLUTION

Step 1: Make a table as shown.

Rating( X ) Frequency ( f ) fX

1 2

2 1

3 2

4 2

5 2

6 5

7 3

8 2

9 2

10 3

Total n = 24

Chapter 3: Data Description 37

Step 2: Multiply the frequency with the data value of each class and enter them in the 3rd column.

Step 3: Find the sum of the values in the 3rd column. The completed table is shown below.

Rating( X ) Frequency ( f ) fX
2
1 2
2
2 1
6
3 2
8
4 2
10
5 2
30
6 5
21
7 3
16
8 2
18
9 2
30
10 3

Total n = 24  fX = 143
Step 4: Divide the sum of 3rd column by n to get the mean.

X 
 fX 
143
 5.96
n 24

EXAMPLE 3−3

The following is the distribution of the number of fish caught by all 50 fishermen in a coastal area. Find
the mean number of fish caught by a fisherman.
No. of fishermen No. of fishermen
11 − 15 12

16 − 20 14

21 − 25 13

26 − 30 11

Chapter 3: Data Description 38

SOLUTION

Step 1: Make a table as shown.

No. of fish caught No. of fishermen ( f ) Midpoints ( X m ) fX m

11 − 15 12

16 − 20 14

21 − 25 13

26 − 30 11

n = 50

Step 2: Find the midpoint of each class and enter them in the 3rd column.

Step 3: For each class, multiply the frequency with the midpoints and enter them in the 4 th column.

Step 4: Find the sum of the values in the 4th column. The completed table is shown below.

No. of fish caught No. of fishermen ( f ) Midpoints ( X m ) fX m

11 − 15 12 13 156

16 − 20 14 18 252

21 − 25 13 23 299

26 − 30 11 28 308

n = 50  fX m = 1015

Step 5: Divide the sum of 4th column by N to get the mean.


 fX m

1015
 20.3
N 50

3.2.2 The Median

The median is the midpoint of the data set. To calculate the median, it is necessary to arrange the data
in order. The median can either be a specific value in the data set or can fall between two values.

The median is the midpoint of the data set when the data is arranged in order.

Chapter 3: Data Description 39

EXAMPLE 3−4

The numbers of comics purchased on a particular day by nine school students are given below.

3, 7, 10, 5, 9, 4, 11, 7, 2
Find the median.

SOLUTION

Step 1: Arrange the data in order

2, 3, 4, 5, 7, 7, 9, 10, 11

Step 2: Select the middle point.

2, 3, 4, 5, 7, 7, 9, 10, 11
Hence, the median is 7 comics.

EXAMPLE 3−5

The numbers of tropical cyclones in the Pacific over the 8–year period is as follows.

687, 576, 702, 405, 237, 899, 799, 907

Find the median.

SOLUTION

Step 1: Arrange the data in order.

237, 405, 576, 687, 702, 799, 899, 907

Step 2: Select the middle point.

237, 405, 576, 687, 702, 799, 899, 907
Since there are two values in the middle point, we add the two values and divide by 2, to find the median.

687  702
The median number of tropical cyclones is  694.5 .
2

EXAMPLE 3−6

Estimate the median of the data in given Example 3−3.

SOLUTION

Step 1: Find the class boundaries, cumulative frequency and cumulative percentage for each class.

cumulative frequency
cumulative percentage   100
Total frequency
The table is shown below:

Chapter 3: Data Description 40

Class boundaries Frequency Cumulative frequency Cumulative percentage
10.5 – 15.5 12 12 12
 100  24
50

15.5 – 20.5 14 26 26
 100  52
50

20.5 – 25.5 13 39 78

25.5 – 30.5 11 50 100

Step 2: Using the upper class boundaries for the x values and the cumulative percentage as the y values,
plot the points. This type of ogive is called a Percentile Graph.

Percentile Graph
100
90
cumulative percentage

80
70
60
50
40
30
20
10
0
10.5 15.5 20.5 25.5 30.5
no. of fish caught

To estimate the median, find the x−value corresponding to the y-value of 50 from the percentile graph.
So the median is estimated to be 20.

3.2.3 The Mode

The mode is the third measure of central tendency. It is the value that occurs most often in a data set.
Note:
 A data set that has only one value that occurs most often is said to be unimodal.
 If a data set has two values that occur most often, both values are considered to be the mode
and the data set is said to be bimodal.
 If a data set has more than two values that occur most often, each value is used as the mode,
and the data set is said to be multimodal.
 A data set where no data value occurs more than once, the data set is said to have no mode.
 If data is grouped in class intervals, then the interval that has the highest frequency is called the
modal class and its midpoint is called the crude mode.

Chapter 3: Data Description 41

EXAMPLE 3−7

Find the mode of the transfer fees of 9 professional soccer players for a specific year. The transfer fee in
millions of dollars is: 1.2, 12.0, 4.5, 6.1, 8.3, 4.5, 7.2, 11.0, 4.5

SOLUTION

Since $4.5 million occurred 3 times (most often), the mode is $4.5 million.

EXAMPLE 3−8

Find the mode for the following sets of data:

A. 40, 44, 57, 78, 48
B. 45, 55, 50, 45, 40, 55, 45, 55

SOLUTION

A. Since each value occurs only once, there is no mode. (Do not say that the mode is zero).
B. Since both 45 and 55 occur most often (3 times each), the modes are 45 and 55. This set of data
is said to be bimodal.

EXAMPLE 3−9

Find the mode of the frequency distribution in Example 3-3.

SOLUTION

The modal class is 16 – 20, as it has the highest frequency. Note: In many cases, the measures of central
tendency may have significantly different values. One has to be very cautious in using these measures.

EXAMPLE 3−10

A small company consists of the owner, the manager, salesperson and two technicians, all of whose
annual salaries are listed below. Find the mean, median and mode.

Staff Salary ($)

Owner 50,000
Manager 20,000
Salesperson 12,000
Technician 9,000
Technician 9,000

Chapter 3: Data Description 42

SOLUTION

Here the mean is $20,000, the median is $12,000 and the mode is $9,000. The mean is much higher
than median and mode because the extremely high salary of the owner. In such situations, the median
should be used as the measure of central tendency.

3.2.4 The Midrange

The midrange (MR) is a rough estimation of the middle. It is found by adding the lowest and the highest
values in the data set and dividing the result by 2. It can be affected by extreme values in the dataset.

lowest value +highest value

MR 
2

EXAMPLE 3−11

Find the midrange of the data in example 10.

SOLUTION

9000 +50000
MR   29,500
2

Hence, the midrange is 29,500. The midrange is affected by extreme value of $50,000 in the dataset.

Note: In statistics, several measures can be used for an average. The most common measures are
mean, median, mode and midrange. Each has its own specific purpose and use. The median is a better
measure when there are extreme values in the dataset. 3−10

3.2.5 The Weighted Mean

The weighted mean is used when we wish to place greater emphasis on some of the values in the data
set. In such situation, it may not be suitable to calculate an ordinary mean. This type of mean that
considers additional factor is called the weighted mean.

The weighted mean of the data set x1 x2 … xn with respective weightings w1  w2 … wn , is given by

Weighted mean 
w1 x1  w2 x2    wn xn

w x .
i i

w1  w2    wn w i

The use of weighted mean is illustrated in the following example.

lowest value +highest value

MR 
2

Chapter 3: Data Description 43

EXAMPLE 3−12

In ST130, a student obtained the following marks in the continuous assessment:

Mid-semester test (MST): 67%

Assignment 1: 88%
Assignment 2: 94%
Final exam: 75%

The mid-semester test had a weight of 20%, assignments had a weight of 10% each and the final exam
has a weight of 60%.

Calculate the final mark of the student.

SOLUTION

As in regulation, the weights for the results are in the following ratio:

MST: Assignment 1: Assignment 2: Final Exam = 20% 10%: 10%: 60% = 2: 1: 1: 6

For awarding the final result, we have to take this weighting into account:

2(67)  1(88)  1(94)  6(75)

Weighted mean   76.6.
2 11 6

Therefore, the final mark is 77%.

3.2.6 Relationships among Mean, Median and Mode

If the values of the mean, median and mode are known, it can give us some idea about the shape of a
frequency distribution. Now we will discuss the relationships among the mean, median and mode for
symmetric, positively and negatively skewed distributions.

For a symmetric distribution with one peak,

the values of the mean, median and mode
are same, and they lie at the center of the
distribution.

Chapter 3: Data Description 44

For a right skewed distribution, the value
of the mean is the largest, the mode is the
smallest, and the value of the median lies
between these two. Notice that the mode
always occurs at the peak point. The value of
the mean is the largest in this case because
it is sensitive to outliers that occur in the right
tail. These outliers pull the mean to the right.

If a distribution is skewed to the left, the

value of the mean is the smallest and the
mode is the largest, with the value of the
median lying between these two. In this case,
the outliers in the left tail pull the mean to the
left.

3.3 Measures of Variation

The measures of variation (also known as measures of dispersion) are numerical measures to determine
the spread of the data values from the central tendencies. Many times the measures of central tendency
alone cannot describe the data.

EXAMPLE 3−13

I wish to test two brands of outdoor paint to see how long each will last before fading. The results (in
months) are shown. Find the mean and median of each group. (Assume Population)

Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25

The mean and median for both brands of paint is 35 months. Since the mean and median for both brands
of paint is same, we cannot conclude which paint is better using these measures of central tendencies.

Chapter 3: Data Description 45

Therefore, to find out which paints lasts longer that is a better choice, the measure of variation is
important.

The types of measures of variation that will be discussed in this section are range, variance, and standard
deviation.

3.3.1 Range
The range is the simplest measure of variation and is defined as:

The range (R) is the highest value minus the lowest value in the data set. That is

R = Highest value – lowest value

EXAMPLE 3−14

Find the range for the two brands of paints given in Example 3−13.

SOLUTION

Brand A: The range R = 60 – 10 = 50 months.

Brand B: The range R = 45 – 25 = 20 months.

Since the range of Brand B is less it can be concluded that Brand B is less variable (more reliable or a
better choice) than Brand A.

Since range is not good measure of variability if there are extreme values in the dataset, statisticians use
other measures called the variance and standard deviation.

3.3.2 The Variance and Standard Deviation

The variance is defined as the average of the squares of the deviation of each data value from the mean.
It is denoted by  2 for population variance and s2 for sample variance.

The corresponding formulas used to calculate these variances of raw data are

2 
( X   ) 2
and s2 
( X  X ) 2
,
N n 1

Where,


 X and X   X
N n

Chapter 3: Data Description 46

The standard deviation is the most commonly used measure of dispersion. The value of the standard
deviation tells how closely the values of a data set are clustered around the mean. Standard deviation is
found by taking square root of the variance. It is denoted by  for population standard deviation and s
for sample standard deviation.

EXAMPLE 3−15

Find the variance and standard deviation for Brand A paint data given in Example 3−13.

SOLUTION

Step 1: Find the mean.


X 
210
 35
N 6
Step 2: Subtract the mean from each data value and square each result. The completed table is shown
below.

Brand A (X) ( X   )2
10 (10 – 35)2 = 625
60 (60 – 35)2 = 625
50 225
30 25
40 25
20 225

Step 3: Find the sum of 2nd column.

 (X  ) 2
 625  625  225  25  25  225  1750

Step 4: Find the variance.

2 
( X   ) 2

1750
 291.7
N 6

Step 5: Find the standard deviation.

  291.7  17.1

Remarks:
1. The variance and standard deviation of Brand B paint is 41.7 and 6.5 respectively.
2. Since the standard deviation of Brand B is less, one can conclude that brand B is less variable (more
reliable or a better choice) than Brand A.

Chapter 3: Data Description 47

3. There are shortcut formulas for computing variance and standard deviation and is summarized in the
table below:
Raw data Ungrouped frequency Grouped frequency
distribution distribution
Sample
 X    fX   f X 
2 2 2

X  fX f X
m
2
 2
 2
m 
s 
2 n s 
2 n s 
2 n
n 1 n 1 n 1
Population  X    fX   f X 
2 2 2

X  fX f X
m
2
 2
 2
m 
N 2  N 2  N
  2

N N N

Note: Always use the shortcut formulas to compute variance and standard deviation.

EXAMPLE 3−16

Find the variance and standard deviation for Brand A paint data given in Example 3−13 using the shortcut
formula.

SOLUTION

Step 1: Find the sum of all the data values.

Step 2: Square each data value and enter them in the 2nd column

Step 3: Find the sum of 2nd column.

Brand A ( X ) X2
10 100
60 3600
50 2500
30 900
40 1600
20 400

 X  210 X
2
 9100

Step 4: Find the variance.

2102
9100 
2  6  291.7
6

Chapter 3: Data Description 48

Step 5: Find the standard deviation.
  291.7  17.1

EXAMPLE 3−17

Find the variance and standard deviation of the number of fish caught using the data in Example 3−3.

SOLUTION

Step 1: Make a table as shown.

No. of fish caught No. of fishermen ( f ) Midpoints ( X m ) fX m f X m2

11 – 15 12

16 – 20 14

21 – 25 13

26 – 30 11

n = 50

Step 2: Find the midpoint of each class and enter them in the 3rd column.

Step 3: For each class, multiply the frequency with the midpoints and enter them in the 4th column. Find
the sum of the values in the 4th column.

Step 4: For each class, multiply the frequency with the square of the midpoints and enter them in the
5th column. Find the sum of the values in the 5th column. The completed table is shown below.

No. of fish No. of fishermen ( f ) Midpoints ( X m ) fX m f X m2

caught

11 – 15 12 13 12 × 13 = 156 12 × 132 = 2028

16 – 20 14 18 14 × 18 = 252 12 × 132 = 4536

21 – 25 13 23 299 6877

26 – 30 11 28 308 8624

n = 50
 fX m  1015 f X 2
m  22065

Chapter 3: Data Description 49

Step 5: Find the variance.
10152
22065 
2  50  29.2
50

Step 6: Find the standard deviation.

  29.21  5.4

3.3.3 Coefficient of Variation

When two or more datasets have same units of measure, variance or standard deviation can be used to
measure the variability between the datasets. However, when the units of measure are different
coefficient of variation is used compare their variability.

The coefficient of variation, denoted by CV, is the standard deviation divided by the mean. The result
is expressed as a percentage.

For population  C V   100%

s
For sample  C V    100%
x
EXAMPLE 3−18

The mean of the number of sales of airplane engines over a 6-month period is 92, and the standard
deviation is 5. The mean of the commissions earned is $5255, and the standard deviation is $770.
Compare the variations of the two.

SOLUTION

The coefficients of variation are:

 5
For sales  C V    100%   100%  5.4%
 92
 770
For commission  C V    100%   100%  14.7%
 5255
Since the coefficient of variation is larger for commissions, the commissions are more variable than the
sales.

3.4 Measures of Position

The measures of position (also known as measures of location) are the numerical measures to determine
the relative position of a data value in a data set.

The types of measures position that will be discussed in this section are standard scores, percentiles,
deciles and quartiles.

Chapter 3: Data Description 50

3.4.1 Standard Scores
There is an old saying, “You can’t compare apples and oranges.” However, with the use of statistics, it
can be done to some extent. Suppose that a student scored 90 in mathematics test and 45 in English
test. Direct comparison of these raw scores is impossible, since the exams might not be equivalent in
terms of number of questions, value of each question, and so on. However, a comparison of a relative
standard similar to both can be made. This comparison uses the mean and standard deviation and is
called a standard score or z score.
A standard score or z score tells how many standard deviations a data value is above or below the mean
for a specific distribution of values. If the standard score is zero, then the data value is the same as the
mean.

A z score or standard score for a value is obtained by subtracting the mean from the value and dividing
the result by the standard deviation, i.e.
X 
For population  z  

XX
For sample  z 
s

EXAMPLE 3−19

A student scored 90 on Maths test that had a mean of 52 and a standard deviation of 10; he also scored
45 on an English test with a mean of 35 and a standard deviation of 5. Compare her relative positions on
the two tests.

SOLUTION

Step 1: Find the z scores.

XX 90  52
For Maths: z = z = 3.8
s 10
XX 45  35
For English: z = z = 2.0
s 5

The score for Maths test is higher than the score for English test.

3.4.2 Percentiles
Percentiles are position measures used in educational and health-related fields to indicate the position of
an individual in a group.

Percentiles are data values that divide the dataset into 100 equal parts where the dataset should be in
an ascending order. Each set of observations has 99 percentiles and are denoted by P1  P2 … P99 .

Chapter 3: Data Description 51

The following figure describes the positions of the 99 percentiles.
Each of these portions contains 1% of the observations
of a data set arranged in increasing order

Remarks:
1. P20 is called the 20th percentile, which indicates that 20% of the scores fall below P20 .
2. P50 is called the 50th percentile, which indicates that 50% of the scores fall below P50 .
P50  median.

Steps to Compute Percentile of Raw data

Step 1: Arrange the data from lowest to highest (ascending order).

Step 2: Find the k th percentile ( Pk ).

 kn 
Pk  value of the   th term
 100 
Where,
k is the number of percentile and n is the sample size.

Note:
1. To calculate quartiles and deciles of a raw data, convert them to percentiles and use the same
steps.
2. To estimate percentiles, deciles and quartiles of a raw data use a Percentile Graph.

Percentile Rank
We can calculate the percentile rank for a particular value x of a data set by using the formula:

Number of values less than x  0.5

Percentile rank of x   100%
Total number of values
Note:
1. A percentile is a value in the data set.
2. The percentile rank of a score indicates what percent of data lies below the score.

Chapter 3: Data Description 52

3.4.3 Deciles
Deciles are data values that divide the dataset into 10 equal parts where the dataset should be in an
ascending order. Each set of observations has 9 deciles and are denoted by D1  D2 … D9 .

The following figure describes the positions of the 9 deciles.

Each of these portions contains 10% of the observations

of a data set arranged in increasing order

Remarks:
1. D4 is called the 4th decile, which indicates that 40% of the scores fall below D4 .
2. D5 is called the 5th decile, which indicates that 50% of the scores fall below
3. P50  D5  median.
4. D1  P10 ; D2  P20 ; D3  P30 ;  D9  P90

3.4.4 Quartiles
Quartiles are data values that divide the dataset into 4 equal parts where the dataset should be in an
ascending order. Each set of observations has 3 quartiles and are denoted by Q1  Q2 and Q3 .

The following figure describes the positions of the 4 quartiles.

Each of these portions contains 25% of the observations

of a data set arranged in increasing order

Remarks:
1. Q1 is called the 1st quartile (or lower quartile), which indicates that 25% of the scores fall below
Q1
2. Q3 is called the 3rd quartile (or upper quartile), which indicates that 75% of the scores fall below
Q3

3. Q1  P25 ; Q2  P50 ; Q3  P75 .

4. Q2  D5  P50  Median.

Chapter 3: Data Description 53

EXAMPLE 3−20

The following are the test scores of 12 students in a statistics class:

70, 77, 65, 56, 99, 62, 79, 73, 85, 87, 92, 82

Calculate the following:

1. P80 and interpret its value.

2. D6 .
3. Q1 and Q3 .
4. Percentile rank for the score 92.

SOLUTION

Arrange the data from lowest to highest (ascending order).

56, 62, 65, 70, 73, 77, 79, 82, 85, 87, 92, 99

1. P80 is obtained by:

80(12)
P80  th term
100
 96th term
The value of 9.6th term can be approximated by the 10th term in the ranked data. Therefore,
P80  87
Hence, approximately 80% of the scores are below 87 in the given data.

2. D6  or P60  and is obtained by:

60(12)
P60  th term
100
 7.2 th term
The value of 7.2th term can be approximated by the 8th term in the ranked data. Therefore,
D6  82
Hence, approximately 60% of the scores are below 82 in the given data.

3. Q1  or P25  is obtained by:

25(12)
P25  th term
100
 3 rd term

Chapter 3: Data Description 54

The value of 3rd term can be approximated by the average of 3rd and 4th terms in the ranked data.
Therefore,
65  70
Q1   67.5
2

Q3  or P75  is obtained by:

75(12)
th termP75 
100
 9 th term
The value of 9 term can be approximated by the average of 9th and 10th terms in the ranked data.
th

Therefore,
85  87
Q3   86.
2
10  0.5
4. Percentile rank of 92   100%  87.5.
12
Hence, approximately 87.5% of the scores are below 92 in the given data.

EXAMPLE 3−21

Estimate the following from the data given in Example 3−3.

1. P20 .
2. Percentile rank for the score 26.

SOLUTION

Using the percentile graph plotted before,

Percentile Graph
100
90
cumulative percentage

80
70
60
50
40
30
20
10
0
10.5 15.5 20.5 25.5 30.5
no. of fish caught

Chapter 3: Data Description 55

1. Observe the x-value for the y−value 20 and we get P20  14.
2. Observe the y-value for the x−value 26 and we get Percentile rank for the score 26 to be 81.

3.4.5 Other Measures of Variation

The variance and standard deviation are regarded as the best and the most powerful measures of
dispersion. One of the drawbacks with these measures of dispersion is that they are influenced by
extreme observations called outliers. Thus, when there are outliers in a dataset, many statisticians think
that the median as the measure of central tendency and other measures of dispersion, namely the
interquartile range of the quartile deviation, should be used to describe the variability.

The interquartile range is the difference between the upper quartile and the lower quartile. That
is,
Interquartile range (IQR)  Q3  Q1

The quartile deviation is the half of the difference between the upper quartile and the lower
quartile. That is,
Q3  Q1
Quartile deviation (QD)  
2

EXAMPLE 3−22

Find the interquartile range and the quartile deviation for the given data in Example 3−20.

SOLUTION

From Example 3−20, we obtain

Q1  67.5 and Q3  86
Therefore,
Interquartile range  Q3  Q1  86  67.5  18.5
and
Q3  Q1 86  67.5
Quartile deviation    9.25
2 2

Chapter 3: Data Description 56

3.5 Outliers
We already know that values that are very small (or extreme low) or very large (or extreme high) relative
to the majority of the values in a data set are known as outliers. We have seen that outliers strongly affect
the mean, standard deviation and some other measures as well. Therefore, it is important to identify
outliers in the dataset so that we use appropriate measures when outliers are present in the dataset.

An outlier is an extremely high or an extremely low data value when compared with the rest of the
data values.

How does an outlier occur?

There are several reasons why outliers may occur. The data value may have resulted from a:
 Measurement or observational error. That is the researcher measured the variable incorrectly.
 Recording error. That is, it may have been written or typed incorrectly.
 Subject that is not in the defined population.

Procedure for Identifying Outliers

There are several ways to check a dataset for outliers. A good rule of thumb of detecting outlier is as
follows:
Step 1: Arrange the data in ascending order and find Q1 and Q3 .

Step 2: Find the interquartile range: IQR  Q3  Q1 .

Step 3: Find the interval: Q1  1.5  IQR  x  Q3  1.5  IQR .

Step 4: Check the data set for any data values x that fall outside the interval. Those values are
outliers.

EXAMPLE 3−23

Check the following data set for outliers.

70, 5, 12, 6, 15, 13, 18, 30

SOLUTION

The data value 70 is a suspect that it is an outlier. Using the procedure given above we have:

Step 1: The data in ascending order is

5, 6, 12, 13, 15, 18, 30, 70

Using the procedure taught before Q1 = 9 and Q3 = 24.

Step 2: The interquartile range (IQR), IQR = 24 – 9 = 15.

Step 3: The interval is: 9  1.5  15  x  24  1.5  15  13.5  x  46.5 .

Chapter 3: Data Description 57

Step 4: Check the data set for any data values that fall outside the interval from −13.5 to 46.5. Since the
data value 70 is outside this interval, it can be considered an outlier.

3.6 Exploratory Data Analysis (EDA)

In traditional statistics, data are organized by using a frequency distribution and various graphs are
constructed to determine the shape or nature of the distribution. Exploratory Data Analysis (EDA) is
the process of using graphical and descriptive statistical techniques (like median, IQR) to learn about the
structure of a dataset.

In EDA,
 Data can be organised using a stem and leaf plot.
 The measure of central tendency used is the median.
 The measure of variation used is the interquartile range.
 Data are represented graphically using a box-plot.

A box-plot is a graph that is used to determine the nature and shape of the distribution in EDA. It is
obtained by drawing a horizontal line from the minimum data value to Q1 , drawing a horizontal line from
Q3 to the maximum data value, and drawing a box whose vertical sides pass through Q1 and Q3 with
a vertical line inside the box passing through the median.

Information obtained from a Box-plot

a. If the median is near the center of the box or the lines are about the same length, the distribution is
approximately symmetric.
b. If the median is to the left of the center of the box or the right line is larger than the left line, the
distribution is positively skewed.
c. If the median falls to the right of the center of the box or the left line is larger than the right line,
the distribution is negatively skewed.

EXAMPLE 3−24

Construct a box-plot for the data given below.

16, 18, 12, 11, 8, 13, 4, 3, 9, 20

SOLUTION

Step 1: The Five-Number Summary (Note: The data should be arranged in ascending order first)
1. The lowest value is 3;
2. Q1  8 ;
3. The median is 11.5;
4. Q3  16 ;
5. The highest value is 20;

Step 2: Draw a horizontal axis with a suitable scale.

Chapter 3: Data Description 58

Step 3: Draw a horizontal line from the minimum data value to Q1 , then draw a horizontal line from Q3
to the maximum data value, and then draw a box whose vertical sides pass through Q1 and Q3 with a
vertical line inside the box passing through the median.

Therefore, the boxplot is given below:

 8  1  1
 3 1 6 
.
5

 0 4 8 12 16 20 22

The distribution is somewhat symmetric.

3.7 Summary
This chapter discusses the statistical techniques of describing data. The data was described using the
techniques such as measure of central tendencies, measure of variations and measure of positions. The
measure of central tendencies include mean, median, mode and midrange to locate the center of the
data set, the measure of variations include range, variance and standard deviation to gauge the spread
of data values, the measure of positions include standard score, percentile, decile and quartile to locate
the position of the data values. Further, the chapter explains how to detect outliers in a data set and how
to construct box-plot.

EXERCISES

1. The cash compensations received in 2009 by the highest-paid executives of 12 international

companies (in $000s) were as follows:

2215 1888 1477 1059 977 956

947 924 899 856 856 803

A. Compute the mean, median, mode and the standard deviation.

B. Calculate the values of three quartiles, 40th percentile and the percentile rank of 956.
C. Check for outliers in the data.
D. Construct a box-plot and use it comment on the shape of the distribution.

2. A survey of all the 110 firms in a small state was carried out to find the number of people employed
at each. The results are shown in the following table.

Number of Employees 1 – 10 11 – 20 21 – 30 31 – 40 41 – 50
Frequency 32 34 14 12 18

Chapter 3: Data Description 59

A. Approximate the mean, the mode and the median of the number of people employed at each
firm.
B. Calculate the variance and standard deviation.

3. Suppose an instructor gives two exams and a final exam, assigning the final exam a weight twice
that of each of the other exams. Find the weighted mean for a student who scores 73 and 67 on the
first two exams and 85 on the final exam.

4. An analysis of monthly wages paid to the workers of firm A and B belonging to the same industry
gives the following results:

Firm A Firm B
Number of Workers 100 200
Average monthly wage $196 $185
Variance of distribution of wages $81 $144

A. Which firm, A or B has a larger wage bill?

B. In which firm, A or B is there greater variability among individual wages?

Chapter 3: Data Description 60

8.1 Measures of Central Tendency Mean Median Mode Weighted Mean
No ratings yet
8.1 Measures of Central Tendency Mean Median Mode Weighted Mean
36 pages
Measures of Central Tendency and Dispersion
100% (1)
Measures of Central Tendency and Dispersion
7 pages
Frequency Distributions and Graphs2
No ratings yet
Frequency Distributions and Graphs2
8 pages
Modified Ps Final 2023
No ratings yet
Modified Ps Final 2023
124 pages
Module 5
0% (1)
Module 5
74 pages
Mean, Median and Mode 1 PDF
0% (1)
Mean, Median and Mode 1 PDF
16 pages
Data Analytics TB
No ratings yet
Data Analytics TB
1,944 pages
Lesson 4 Measure of Central Tendency
100% (1)
Lesson 4 Measure of Central Tendency
20 pages
Stat Chapter 3
No ratings yet
Stat Chapter 3
41 pages
Measures of Central TendencyGrouped Module 1
No ratings yet
Measures of Central TendencyGrouped Module 1
10 pages
ENDATA130 Data Summarization - Measures of Central Tendency
No ratings yet
ENDATA130 Data Summarization - Measures of Central Tendency
30 pages
Research Quarter 3
No ratings yet
Research Quarter 3
7 pages
WEEK 3 - Central-Tendency-Variation-And-Shape
No ratings yet
WEEK 3 - Central-Tendency-Variation-And-Shape
39 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
10 pages
Research Ii: Whole Brain Learning System Outcome-Based Education
No ratings yet
Research Ii: Whole Brain Learning System Outcome-Based Education
16 pages
Statistical Analysis 2023
No ratings yet
Statistical Analysis 2023
56 pages
Module 3 Measures of Location
No ratings yet
Module 3 Measures of Location
27 pages
Chapter 3 Iqt
No ratings yet
Chapter 3 Iqt
7 pages
Lesson 6c, 7, 8
No ratings yet
Lesson 6c, 7, 8
46 pages
الفصل الثالث مقدمة في الاحصاء PDF
No ratings yet
الفصل الثالث مقدمة في الاحصاء PDF
69 pages
MAT-08-Engineering-Data-Analysis-NUMERICAL SUMMARY MEASURES
No ratings yet
MAT-08-Engineering-Data-Analysis-NUMERICAL SUMMARY MEASURES
73 pages
Statistical Treatment (Part of Module 4
No ratings yet
Statistical Treatment (Part of Module 4
56 pages
Unit 4 - Descriptive Statistics (A)
No ratings yet
Unit 4 - Descriptive Statistics (A)
19 pages
3jane - Data Description Finala4
No ratings yet
3jane - Data Description Finala4
14 pages
Chapter 3&4 5
No ratings yet
Chapter 3&4 5
14 pages
Chapter 3 A
No ratings yet
Chapter 3 A
62 pages
For The Students - MODULE 3 - Week 5-7 - Numerical Techniques in Describing Data
No ratings yet
For The Students - MODULE 3 - Week 5-7 - Numerical Techniques in Describing Data
24 pages
Measures of CT and Dispersion
No ratings yet
Measures of CT and Dispersion
57 pages
المحاضرة الثالثة
No ratings yet
المحاضرة الثالثة
16 pages
Eda 11 Measures of Central Tendency
No ratings yet
Eda 11 Measures of Central Tendency
43 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
38 pages
Measures of Position
100% (1)
Measures of Position
44 pages
Measures of CT and Dispersion
No ratings yet
Measures of CT and Dispersion
43 pages
STAE Lecture Notes - LU3
No ratings yet
STAE Lecture Notes - LU3
24 pages
221 Chapter3 Student
No ratings yet
221 Chapter3 Student
16 pages
4 Measures of Centrality: Mean, Median, Mode, Grouped Data
No ratings yet
4 Measures of Centrality: Mean, Median, Mode, Grouped Data
18 pages
St130: Basic Statistics Week 3: Lecture: School of Computing Information and Mathematical Sciences
No ratings yet
St130: Basic Statistics Week 3: Lecture: School of Computing Information and Mathematical Sciences
62 pages
Topic 2 - Descriptive - Statistics
No ratings yet
Topic 2 - Descriptive - Statistics
36 pages
Chapter 3
No ratings yet
Chapter 3
9 pages
CH03 - Descriptive Statistics 2
No ratings yet
CH03 - Descriptive Statistics 2
67 pages
Review On Descriptive Statistics LESSON 2 - Measures of Central Tendency
No ratings yet
Review On Descriptive Statistics LESSON 2 - Measures of Central Tendency
4 pages
Math-7 FLDP Quarter-4 Week-6
No ratings yet
Math-7 FLDP Quarter-4 Week-6
7 pages
02-03 ASAP Business Analytics-2 Descriptive Statistics
No ratings yet
02-03 ASAP Business Analytics-2 Descriptive Statistics
109 pages
Final Exam in Educ 10
No ratings yet
Final Exam in Educ 10
2 pages
Chapter 2 Measure of Central Tendency Dhiraj (Becon 2025)
No ratings yet
Chapter 2 Measure of Central Tendency Dhiraj (Becon 2025)
80 pages
Math 10-Q4-Module-3
No ratings yet
Math 10-Q4-Module-3
13 pages
Exercise 5 - MMW Statistics - For Asynch
No ratings yet
Exercise 5 - MMW Statistics - For Asynch
18 pages
Measures of Central Tendency Position
No ratings yet
Measures of Central Tendency Position
12 pages
Lecture 3 Statistics
No ratings yet
Lecture 3 Statistics
73 pages
Notes For SIBD
No ratings yet
Notes For SIBD
19 pages
STAE Lecture Notes - LU3 - Annotated
No ratings yet
STAE Lecture Notes - LU3 - Annotated
10 pages
Lesson 3 Numerical and Descriptive Measures
No ratings yet
Lesson 3 Numerical and Descriptive Measures
16 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
48 pages
3rd Week
No ratings yet
3rd Week
87 pages
4 2 Measure of Central Tendency
No ratings yet
4 2 Measure of Central Tendency
11 pages
Module 4 PDF
No ratings yet
Module 4 PDF
15 pages
UCCM2233 - Chp3 Num Descriptive Measures-Wble
No ratings yet
UCCM2233 - Chp3 Num Descriptive Measures-Wble
103 pages
Chap1 Lesson 2
No ratings yet
Chap1 Lesson 2
10 pages
Module 3. Organizing and Summarizing Quantitative Data
No ratings yet
Module 3. Organizing and Summarizing Quantitative Data
13 pages
GkFinalCentralTendency Slides
No ratings yet
GkFinalCentralTendency Slides
46 pages
Central Tendency
No ratings yet
Central Tendency
105 pages
Measures of Location
No ratings yet
Measures of Location
43 pages
Properties - Describing Quantitative Data
No ratings yet
Properties - Describing Quantitative Data
36 pages
Chapter 3 - Describing Comparing Data
No ratings yet
Chapter 3 - Describing Comparing Data
21 pages
Deciles
No ratings yet
Deciles
12 pages
Effective Project Management
No ratings yet
Effective Project Management
36 pages
A Review of Statistical Outlier Methods
No ratings yet
A Review of Statistical Outlier Methods
8 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
149 pages
Math As A Tool
No ratings yet
Math As A Tool
31 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
99 pages
STATISTIC
No ratings yet
STATISTIC
100 pages
Q4 Module 5 III - Finding Answer - Quantitative
No ratings yet
Q4 Module 5 III - Finding Answer - Quantitative
31 pages
Case Study 1.1 - MegaTech Incorporation PDF
No ratings yet
Case Study 1.1 - MegaTech Incorporation PDF
1 page
Statistics Final
No ratings yet
Statistics Final
11 pages
Stat Module 3.2
No ratings yet
Stat Module 3.2
16 pages
ST130 FE - S1 - 2018 Soln PDF
No ratings yet
ST130 FE - S1 - 2018 Soln PDF
5 pages
Practice Problems 3 (Data Description For Online) PDF
No ratings yet
Practice Problems 3 (Data Description For Online) PDF
2 pages
Inbound 3954069610835960486
No ratings yet
Inbound 3954069610835960486
60 pages
Covid Project
No ratings yet
Covid Project
87 pages
ST130 FE - S2 - 2017 Soln - Ed PDF
No ratings yet
ST130 FE - S2 - 2017 Soln - Ed PDF
4 pages
Local Media2332571191206230910-1
No ratings yet
Local Media2332571191206230910-1
25 pages
Lecture 01 - Why Do We Study Project Management
No ratings yet
Lecture 01 - Why Do We Study Project Management
14 pages
DAA - Chapter 03
No ratings yet
DAA - Chapter 03
18 pages
MODULE2 Material
No ratings yet
MODULE2 Material
14 pages
8602 Spring 2024
No ratings yet
8602 Spring 2024
26 pages
Apr 15 Quartile
No ratings yet
Apr 15 Quartile
19 pages
GROUP 3 - Percentiles and Quartiles - Comprimido
No ratings yet
GROUP 3 - Percentiles and Quartiles - Comprimido
16 pages
Project STA108
No ratings yet
Project STA108
25 pages
Lecture 02 - Introduction To Project Management
No ratings yet
Lecture 02 - Introduction To Project Management
18 pages
2.measures of Variation by Shakil-1107
No ratings yet
2.measures of Variation by Shakil-1107
18 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
7 pages
CURE Project Deliverable 1 Sep 17
No ratings yet
CURE Project Deliverable 1 Sep 17
8 pages
SQL Notes
No ratings yet
SQL Notes
3 pages
C. Market Share of Five Competing Internet Providers
No ratings yet
C. Market Share of Five Competing Internet Providers
6 pages
CH No. 3: Measure of Location
No ratings yet
CH No. 3: Measure of Location
4 pages
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet