0% found this document useful (0 votes)
17 views63 pages

Mas 101 3

Uploaded by

Kepher Neville
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views63 pages

Mas 101 3

Uploaded by

Kepher Neville
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

MAS 101: Descriptive Statistics

Describing Data with Numerical Measures

Samue Baffoe & Seth Opuku Larbi

Maseno University

September 2, 2024

Sam & Seth Maseno University September 2, 2024 1 / 63


Table of Contents

1 General objective

2 Introduction

3 Measures of Variability

4 Measures of Relative Standing

5 The five-Number Summary and the Box Plot

Sam & Seth Maseno University September 2, 2024 2 / 63


General objective

General objective

Sam & Seth Maseno University September 2, 2024 3 / 63


General objective

General objective

Graphs are extremely useful for the visual description of a data set. However, they
are not always the best tool when you want to make inferences about a population
from the information contained in a sample. For this purpose, it is better to use
numerical measures to construct a mental picture of the data.

Sam & Seth Maseno University September 2, 2024 4 / 63


Introduction

Introduction

Sam & Seth Maseno University September 2, 2024 5 / 63


Introduction

Describing Data with Numerical Measures

Graphs are useful for illustrating data distributions, but they have limitations.
For instance, if your data projector fails or you need to describe your data over
the phone, you must find another way to convey the information effectively
Another limitation of graphs is their imprecision for statistical inference. For
instance, comparing sample and population histograms can be challenging.
While identical histograms can be easily described as ’the same,’ differences
are harder to quantify concretely.
To address these problems, use numerical measures, which can be calculated
for a sample or population. These numbers provide a clear picture of the
frequency distribution. When related to the population, they are called
parameters; when from a sample, they are called statistics.

Sam & Seth Maseno University September 2, 2024 6 / 63


Introduction

Measures of Center

We introduced dotplots, stem and leaf plots, and histograms to describe the
distribution of a set of measurements on a quantitative variable x.
The horizontal axis displays the values of x, and the data are “distributed”
along this horizontal line. One of the first important numerical measures is a
measure of center—a measure along the horizontal axis that locates the
center of the distribution.
The arithmetic average of a set of measurements is a very common and
useful measure of center. This measure is often referred to as the arithmetic
mean,or simply the mean,of a set of measurements.
To distinguish between the mean for the sample and the mean for the
population, we will use the symbol x̄ (x-bar) for a sample mean and the
symbol µ (Greek lowercase mu) for the mean of a population.

Sam & Seth Maseno University September 2, 2024 7 / 63


Introduction

Measures of Center
Definition: The arithmetic mean or average of a set of n measurements is
equal to the sum of the measurements divided by n
Suppose there are n measurements on the variable x—call them
x1 , x2 , . . . , xn . To add the n measurements together, we use this shorthand
notation:
∑n
i=1 xi which means x1 + x2 + x3 + . . . + xn .
The Greek capital sigma (Σ) tells you to add the items that appear to its
right, beginning with the number below the sigma (i = 1) and ending with
the number above (i = n).
However, since the typical sums in statistical calculations are almost always
made on the total set of n measurements, you can use a simpler notation:
Σxi means “the sum of all the x measurements.”
Using this notation, we write the formula for the sample mean:
NOTATION ∑ n
xi
Sample mean: x̄ = i=1
n
∑ N
xi
Population mean: µ = i=1
N
Sam & Seth Maseno University September 2, 2024 8 / 63
Introduction

TABLE 1.9: Birth Weights of 30 Full-Term Newborn Babies


7.2 7.8 6.8 6.2 8.2
8.0 8.2 5.6 8.6 7.1
8.2 7.7 7.5 7.2 7.7
5.8 6.8 6.8 8.5 7.5
6.1 7.9 9.4 9.0 7.8
8.5 9.0 7.7 6.7 7.7

Figure: Figure 2.1

Sam & Seth Maseno University September 2, 2024 9 / 63


Introduction

EXAMPLE
Draw a dotplot for the n = 5 measurements 2, 9, 11, 5, 6. Find the sample mean
and compare its value with what you might consider the “center” of these
observations on the dotplot.
Solution
The dotplot in Figure 2.2 seems to be centered between 6 and 8. To find the
sample mean, calculate
Sample mean:
∑n
i=1 xi 2 + 9 + 11 + 5 + 6
x̄ = = = 6.6
n 5

Figure 2.2
The statistic x̄ = 6.6 is the balancing point or fulcrum shown on the dot plot. It
does seem to mark the center of the data.
Sam & Seth Maseno University September 2, 2024 10 / 63
Introduction

Measures of Center

An important use of the sample mean x̄ is as an estimator of the unknown


population mean µ.
The birth weight data (Table 1.9) are a sample from a larger population of
birth weights and has a distribution.
The mean of the 30 birth weights is: Sample mean:
∑30
i=1 xi 227.2
x̄ = = = 7.57
30 30
The mean of the entire population of newborn birth weights is unknown, but
if you had to guess its value, your best estimate would be 7.57.
Although the sample mean x̄ changes from sample to sample, the population
mean µ stays the same.

Sam & Seth Maseno University September 2, 2024 11 / 63


Introduction

Median

A second measure of central tendency is the median, which is the value in


the middle position in the set of measurements ordered from smallest to
largest.
Definition; The median m of a set of n measurements is the value of x that
falls in the middle position when the measurements are ordered from smallest
to largest.
Example 2.2: Find the median for the set of measurements 2, 9, 11, 5, 6.
Solution Rank the n5 measurements from smallest to largest:

2, 5, 6 ← middle observation, 9, 11

The middle observation, marked with an arrow, is in the center of the set, or
m = 6.

Sam & Seth Maseno University September 2, 2024 12 / 63


Introduction

Median

Example 2.3: Find the median for the set of measurements 2, 9, 11, 5, 6, 27.
Solution: Rank the measurements from smallest to largest:

2, 5, 6, 9 , 11, 27

Now there are two “middle” observations, shown in the box. To find the median,
choose a value halfway between the two middle observations:
6+9
Median = = 7.5
2
The value .5(n + 1) indicates the position of the median in the ordered data
set. If the position of the median is a number that ends in the value .5, you need
to average the two adjacent values.

Sam & Seth Maseno University September 2, 2024 13 / 63


Introduction

Median

For the n = 5 ordered measurements from Example 2.2, the position of the
median is:
0.5(n + 1) = 0.5(5 + 1) = 3
and the median is the 3rd ordered observation, or m = 6.For the n = 6 ordered
measurements from Example 2.3, the position of the median is:

0.5(n + 1) = 0.5(6 + 1) = 3.5

and the median is the average of the 3rd and 4th ordered observations, or
6+9
m= = 7.5.
2

Sam & Seth Maseno University September 2, 2024 14 / 63


Introduction

Median

Although both the mean and the median are good measures of the center of a
distribution, the median is less sensitive to extreme values or outliers. For example,
the value x = 27 in Example 2.3 is much larger than the other five measurements.
The median, m = 7.5, is not affected by the outlier, whereas the sample average,
∑6
i=1 xi
x̄ = = 10
6
is affected; its value is not representative of the remaining five observations. When
a data set has extremely small or extremely large observations, the sample mean is
drawn toward the direction of the extreme measurements (see Figure 2.3).

Sam & Seth Maseno University September 2, 2024 15 / 63


Introduction

Median

Figure 2.3

If a distribution is skewed to the right, the mean shifts to the right; if a


distribution is skewed to the left, the mean shifts to the left.
The median is not affected by these extreme values because the numerical
values of the measurements are not used in its calculation.
When a distribution is symmetric, the mean and the median are equal.
If a distribution is strongly skewed by one or more extreme values, you should
use the median rather than the mean as a measure of center.

Sam & Seth Maseno University September 2, 2024 16 / 63


Introduction

Mode
Another way to locate the center of a distribution is to look for the value of x that
occurs with the highest frequency. This measure of the center is called the mode.
Definition: The mode is the category that occurs most frequently or the
most frequently occurring value of x.
When measurements on a continuous variable have been grouped as a
frequency or relative frequency histogram, the class with the highest peak or
frequency is called the modal class, and the midpoint of that class is taken
to be the mode.
The mode is generally used to describe large data sets, whereas the mean and
median are used for both large and small data sets.
The mode is typically used for large data sets, while the mean and median
are suitable for both large and small data sets. In Example 1.11, the mode of
weekly visits to Starbucks for 30 customers is 5, as shown in Table 2.1(a).
For the birth weight data in Table 2.1(b), the mode is 7.7, occurring four
times. The histogram shows the class with the highest peak is from 7.6 to
8.1, so the mode is the midpoint, 7.85. See Figure 2.4(b)

Sam & Seth Maseno University September 2, 2024 17 / 63


Introduction

Mode

a distribution of measurements can have more than one mode. These modes
would appear as “local peaks” in the relative frequency distribution.
For example, if we were to tabulate the length of fish taken from a lake
during one season, we might get a bimodal distribution,possibly reflecting a
mixture of young and old fish in the population.
Sometimes bimodal distributions of sizes or weights reflect a mixture of
measurements taken on males and females. In any case, a set or distribution
of measurements may have more than one mode.

Sam & Seth Maseno University September 2, 2024 18 / 63


Introduction

Mode
(a) Starbucks data (b) Birth weight data
67156 7.2 7.8 6.8 6.2 8.2
46468 8.0 8.2 5.6 8.6 7.1
65634 8.2 7.7 7.5 7.2 7.7
55576 5.8 6.8 6.8 8.5 7.5
35755 6.1 7.9 9.4 9.0 7.8
8.5 9.0 7.7 6.7 7.7

Figure 2.4

Sam & Seth Maseno University September 2, 2024 19 / 63


Measures of Variability

Measures of Variability

Sam & Seth Maseno University September 2, 2024 20 / 63


Measures of Variability

Measures of Variability

Data sets may have the same center but look different because of the way the
numbers spread out from the center. Consider the two distributions shown in
Figure 2.5. Both distributions are centered at X = 4, but there is a big difference
in how the measurements spread out, or vary.The measurements in Figure 2.5(a)
vary from 3 to 5; in Figure 2.5(b)The measurements vary from 0 to 8.

Figure 2.5

Sam & Seth Maseno University September 2, 2024 21 / 63


Measures of Variability

Measures of Variability

Variability or dispersion is a very important characteristic of data. For


example, if you were manufacturing bolts, extreme variation in the bolt
diameters would cause a high percentage of defective products. On the other
hand, if you were trying to discriminate between good and poor accountants,
you would have trouble if the examination always produced test grades with
little variation, making discrimination very difficult.
Measures of variability can help you create a mental picture of the spread
of the data. We will present some of the more important ones. The simplest
measure of variation is the range.
Definition: The range, R, of a set of n measurements is defined as the
difference between the largest and smallest measurements

Sam & Seth Maseno University September 2, 2024 22 / 63


Measures of Variability

Measures of variability

For example, the measurements 5, 7,1, 2, 4 vary from 1 to 7. Hence, the range is
7 − 1 = 6. The range is easy to calculate, easy to interpret, and is an adequate
measure of variation for small sets of data. But, for large data sets, the range is
not an adequate measure of variability. For example, the two relative frequency
distributions in Figure 2.6 have the same range but very different shapes and
variability.

Figure 2.6

Sam & Seth Maseno University September 2, 2024 23 / 63


Measures of Variability

Measures of variability

The horizontal distances between each measurement and the mean x̄ indicate
variability. Large distances mean more variability, while small distances mean less.
The deviation of a measurement xi from the mean is (xi − x̄). Positive deviations
occur for measurements to the right of the mean, and negative deviations for
those to the left. The values of x and their deviations are shown in Table 2.2.
xi (xi − x̄) (xi − x̄)2
5 1.2 1.44
7 3.2 10.24
1 -2.8 7.84
2 -1.8 3.24
4 0.2 0.04
19 0.0 22.80
Table: 2.2

Sam & Seth Maseno University September 2, 2024 24 / 63


Measures of Variability

Because the deviations in the table indicate variability, we might average


them to get a single measure. However, since some deviations are positive
and some are negative, their sum is always zero, making the average
ineffective for this purpose.
Another approach is to ignore the signs of the deviations and average their
absolute values.
While used in exploratory data analysis and time series analysis, we prefer to
use the sum of squared deviations. This sum helps calculate a single measure
called variance.
We denote sample variance as s2 and population variance as σ 2 . Variance is
larger for highly variable data and smaller for less variable data.

Sam & Seth Maseno University September 2, 2024 25 / 63


Measures of Variability

Variance
The variance of a population of N measurements is the average of the
squares of the deviations of the measurements about their mean µ.
The population variance is denoted by σ 2 and is given by the formula:
∑N
2 i=1 (xi − µ)2
σ =
N
Most often, you will not have all the population measurements available but
will need to calculate the variance of a sample of n measurements.
The variance of a sample of n measurements is the sum of the squared
deviations of the measurements about their mean x̄ divided by (n − 1). The
sample variance is denoted by s2 and is given by the formula:
∑n
2 (xi − x̄)2
s = i=1
n−1
NOTE: The variance and the standard deviation cannot be negative
numbers.
Sam & Seth Maseno University September 2, 2024 26 / 63
Measures of Variability

Variance

For the set of n = 5 sample measurements presented in Table 2.2, the square
of the deviation of each measurement is recorded in the third column.
Adding, we obtain
∑n
i=1 (xi − x̄) = 22.80
2

and the sample variance is


∑n
i=1 (xi −x̄)
2
22.80
s2 = n−1 = 4 = 5.70
The variance is measured in terms of the square of the original units of
measurement.
If the original measurements are in inches, the variance is expressed in square
inches.
Taking the square root of the variance, we obtain the standard deviation,
which returns the measure of variability to the original units of measurement.

Sam & Seth Maseno University September 2, 2024 27 / 63


Measures of Variability

Standard Deviation

Definition: The standard deviation of a set of measurements is equal to the


positive square root of the variance.
Notation
n: number of measurements in the sample
N : number of measurements in the population
s2 : sample variance
σ 2 : population variance

s = s2 : sample standard deviation

σ = σ 2 : population standard deviation
2
For the set of n = 5 sample measurements in Table
√ 2.2, the sample variance is s
= 5.70. So the sample standard deviation is s = 5.70. The more variable the
data set is, the larger the value of s.

Sam & Seth Maseno University September 2, 2024 28 / 63


Measures of Variability

For a small set of measurements, calculating the variance is not too difficult.
Use scientific calculators with built-in programs for larger sets to compute x̄
and s or µ and σ.
The sample mean key is usually marked with x̄, the sample standard
deviation key with s, sx , or sx(n−1) , and the population standard deviation
key with s, sx , or sxn .
Be sure to know which calculation each key performs. For hand calculations,
use the shortcut method for computing s2 .
The Computing Formula for calculating s2
∑ 2 ( (∑ x i )2 )
xi − n
s2 =
n−1
∑ ∑ 2
The symbols ( xi )2 and xi in the computing formula are shortcut ways
to indicate the arithmetic operation you need to perform.

You know from the formula for the sample mean that xi is the sum of all
the measurements.
∑ 2
To find xi , you square each individual measurement and then add them
together.
Sam & Seth Maseno University September 2, 2024 29 / 63
Measures of Variability


x2 = Sum of the squares of the individual measurements
∑i 2
( xi ) = Square of the sum of the individual measurements
The sample standard deviation, s, is the positive square root of s2 .
Example
Calculate the variance and standard deviation for the five measurements the
following 5, 7, 1, 2, 4. Use the computing formula for s2 and compare your results
with those obtained using the original definition of s2 .

Sam & Seth Maseno University September 2, 2024 30 / 63


Measures of Variability

Solution
Let’s calculate the variance and standard deviation for the five measurements: 5,
7, 1, 2, 4 using the computing formula for s2 .
Given the formula: ∑
∑n ( ni=1 xi )
2

i=1 xi −
2
2 n
s =
n−1
First, calculate the necessary sums:


n
xi = 5 + 7 + 1 + 2 + 4 = 19
i=1


n
x2i = 52 + 72 + 12 + 22 + 42 = 25 + 49 + 1 + 4 + 16 = 95
i=1

Sam & Seth Maseno University September 2, 2024 31 / 63


Measures of Variability

Now, plug these values into the formula:


2
95 − 195
s2 =
5−1

95 − 361
s2 = 5
4
95 − 72.2
s2 =
4
22.8
s2 =
4
s2 = 5.7
The variance (s2 ) is 5.7. To find the standard deviation (s), take the square root
of the variance: √
s = 5.7 ≈ 2.39
So, the variance is 5.7 and the standard deviation is approximately 2.39.

Sam & Seth Maseno University September 2, 2024 32 / 63


Measures of Variability

Why Divide by n-1

You divide by (n − 1) rather than n when computing the sample variance s2


because, just as the sample mean x̄ estimates the population mean µ, the
sample variance s2 with (n − 1) in the denominator provides better estimates
of σ 2 than an estimator calculated with n in the denominator.
This adjustment ensures that s2 and s are more accurate estimates of the
population parameters.
Now that you have learned how to compute the variance and standard deviation,
remember these points:
The value of s is always greater than or equal to zero.
The larger the value of s2 or s, the greater the variability of the data set.
If s2 or s is equal to zero, all the measurements must have the same value.
In order to measure the variability in the same units as√the original
observations, we compute the standard deviation s = s2 .

Sam & Seth Maseno University September 2, 2024 33 / 63


Measures of Variability

Practical Significance of the Standard Deviation

We now introduce a useful theorem developed by the Russian mathematician


Tchebysheff. Proof of the theorem is not difficult, but we are more interested in
its application than its proof.
Given a number k ≥ 1 and a set of n measurements, at least 1 − k12 of the
measurements will lie within k standard deviations of their mean.
This applies to both samples and populations.

Sam & Seth Maseno University September 2, 2024 34 / 63


Measures of Variability

Illustrating Tchebysheff’s Theorem

Figure 2.8

In Table 2.4, we choose a few numerical values for k and compute 1 − 1


k2 .
( )
k 1 1 − k12
1 1 0
3
2 1 4
8
3 1 9

Sam & Seth Maseno University September 2, 2024 35 / 63


Measures of Variability

From the calculations in table above, the theorem states:


At least none of the measurements lie in the interval m − s to m + s.
At least 3
4 of the measurements lie in the interval m − 2s to m + 2s.
At least 8
9 of the measurements lie in the interval m − 3s to m + 3s.
Although the first statement is not at all helpful, the other two values of k
provide valuable information about the proportion of measurements that fall
in certain intervals.
The values k = 2 and k = 3 are not the only values of k you can use; for
example, the proportion of measurements that fall within k = 2.5 standard
deviations of the mean is at least 1 − (2.5)
1
2 = 0.84.

Sam & Seth Maseno University September 2, 2024 36 / 63


Measures of Variability

Example

The mean and variance of a sample of n = 25 measurements are 75 and 100,


respectively. Use Tchebysheff’s Theorem to describe the distribution of
measurements.
Solution √
You are given x̄ = 75 and s2 = 100. The standard deviation is s = 100 = 10.
The distribution of measurements is centered about x̄ = 75, and Tchebysheff’s
Theorem states:
At least 34 of the 25 measurements lie in the interval x̄ ± 2s, or
75 ± 2(10)—that is, 55 to 95.
At least 89 of the measurements lie in the interval x̄ ± 3s, or 75 ± 3(10)—that
is, 45 to 105.
Since Tchebysheff’s Theorem applies to any distribution, it is very conservative.
This is why we emphasize “at least 1 − k12 ” in this theorem.

Sam & Seth Maseno University September 2, 2024 37 / 63


Measures of Variability

Empirical Rule
The Empirical Rule works well for mound-shaped data distributions, which are
common in nature. The closer your data distribution is to the mound shape, the
more accurate the rule will be. Figure 2.9

Figure 2.9

Empirical Rule: Given a distribution of measurements that is approximately


mound-shaped:
The interval (µ ± σ) contains approximately 68% of the measurements.
The interval (µ ± 2σ) contains approximately 95% of the measurements.
The interval (µ ± 3σ) contains approximately 99.7% of the measurements.
Sam & Seth Maseno University September 2, 2024 38 / 63
Measures of Variability

Empirical Rule: Example

In a time study conducted at a manufacturing plant, the length of time to


complete a specified operation is measured for each of n = 40 workers. The mean
and standard deviation are found to be 12.8 and 1.7, respectively. Describe the
sample data using the Empirical Rule.
Solution To describe the data, calculate these intervals:

x ± s = 12.8 ± 1.7 or 11.1 to 14.5


x ± 2s = 12.8 ± 2(1.7) or 9.4 to 16.2
x ± 3s = 12.8 ± 3(1.7) or 7.7 to 17.9

According to the Empirical Rule, you expect approximately 68% of the


measurements to fall into the interval from 11.1 to 14.5, approximately 95% to
fall into the interval from 9.4 to 16.2, and approximately 99.7% to fall into the
interval from 7.7 to 17.9.

Sam & Seth Maseno University September 2, 2024 39 / 63


Measures of Variability

Student teachers are trained to develop lesson plans, on the assumption that the
written plan will help them to perform successfully in the classroom. In a study to
assess the relationship between written lesson plans and their implementation in
the classroom, 25 lesson plans were scored on a scale of 0 to 34 according to a
Lesson Plan Assessment Checklist. The 25 scores are shown in Table 2.5. Use
Tchebysheff’s Theorem and the Empirical Rule (if applicable) to describe the
distribution of these assessment scores.
26.1 26.0 14.5 29.3 19.7
22.1 21.2 26.6 31.9 25.0
15.9 20.8 20.2 17.8 13.3
25.6 26.5 15.7 22.1 13.8
29.0 21.3 23.5 22.1 10.2
Table 2.5

Sam & Seth Maseno University September 2, 2024 40 / 63


Measures of Variability

Solution
Use your calculator or the computing formulas to verify that x̄ = 21.6 and
s = 5.5. The appropriate intervals are calculated and listed in Table 2.6. We have
also referred back to the original 25 measurements and counted the actual number
of measurements that fall into each of these intervals. These frequencies and
relative frequencies are shown in Table 2.6.
Intervals x ±ks for the Data of Table 2.5
k Interval x ± ks # in Interval Relative Frequency
1 16.1–27.1 16 0.64
Table 2.6
2 10.6–32.6 24 0.96
3 5.1–38.1 25 1.00
Is Tchebysheff’s Theorem applicable? Yes, because it can be used for any set of
data. According to Tchebysheff’s Theorem,
3
at least 4 of the measurements will fall between 10.6 and 32.6.
8
at least 9 of the measurements will fall between 5.1 and 38.1.
You can see in Table 2.6 that Tchebysheff’s Theorem is true for these data. In
fact, the proportions of measurements that fall into the specified intervals exceed
the lower bound given by this theorem.

Sam & Seth Maseno University September 2, 2024 41 / 63


Measures of Variability

Is the Empirical Rule applicable?


You can check for yourself by drawing a graph—either a stem and leaf plot or a
histogram. The relative frequency histogram in Figure 2.10 shows that the
distribution is relatively mound-shaped, so the Empirical Rule should work
relatively well. That is,
approximately 68% of the measurements will fall between 16.1 and 27.1.
approximately 95% of the measurements will fall between 10.6 and 32.6.
approximately 99.7% of the measurements will fall between 5.1 and 38.1.
The relative frequencies in Table 2.6 closely approximate those specified by the
Empirical Rule.

Figure 2.10
Sam & Seth Maseno University September 2, 2024 42 / 63
Measures of Variability

TCHEBYSHEFF’S THEOREM AND THE EMPIRICAL


RULE

Tchebysheff’s Theorem can be proven mathematically. It applies to any set of


measurements—sample or population, large or small, mound-shaped or
skewed. Tchebysheff’s Theorem gives a lower bound to the fraction of
measurements to be found in an interval constructed as x̄ ± ks. At least
1 − k12 of the measurements will fall into this interval, and probably more!
The Empirical Rule is a “rule of thumb” that can be used as a descriptive
tool only when the data tend to be roughly mound-shaped (the data tend to
pile up near the center of the distribution).
When you use these two tools for describing a set of measurements,
Tchebysheff’s Theorem will always be satisfied, but it is a very conservative
estimate of the fraction of measurements that fall into a particular interval. If
it is appropriate to use the Empirical Rule (mound-shaped data), this rule will
give you a more accurate estimate of the fraction of measurements that fall
into the interval.

Sam & Seth Maseno University September 2, 2024 43 / 63


Measures of Relative Standing

Measures of Relative Standing

Sam & Seth Maseno University September 2, 2024 44 / 63


Measures of Relative Standing

Measures of Relative Standing

Sometimes you need to know the position of one observation relative to


others in a data set.
For example, if you took an examination with a total of 35 points, you might
want to know how your score of 30 compared to the scores of the other
students in the class.
The mean and standard deviation of the scores can be used to calculate a
z-score, which measures the relative standing of a measurement in a dataset.
Definition The sample z-score is a measure of relative standing defined by

Positive z-score ⇔ x is above the mean.


Negative z-score ⇔ x is below the mean.

Sam & Seth Maseno University September 2, 2024 45 / 63


Measures of Relative Standing

A z-score measures the distance between an observation and the mean,


measured in units of standard deviation.
For example, suppose that the mean and standard deviation of the test scores
(based on a total of 35 points) are 25 and 4, respectively.
The z-score for your score of 30 is calculated as follows:
z − score = x−x̄s =
30−25
4 = 1.25
Your score of 30 lies 1.25 standard deviations above the mean (30 = +
1.25s).
The z-score is a valuable tool for determining whether a particular observation
is likely to occur frequently or is unlikely and might be considered an outlier.
According to Tchebysheff’s Theorem and the Empirical Rule,
at least 75% and more likely 95% of the observations lie within two standard
deviations of their mean: their z-scores are between −2 and 2. Observations
with z-scores exceeding 2 in absolute value happen about 5% of the time for
mound-shaped data and are considered somewhat unlikely.
at least 89% and more likely 99.7% of the observations lie within three
standard deviations of their mean: their z-scores are between −3 and 3.
Observations with z-scores exceeding 3 in absolute value happen less than 1%
of the time for mound-shaped data and are considered very unlikely.
Sam & Seth Maseno University September 2, 2024 46 / 63
Measures of Relative Standing

Example

Consider this sample of n = 10 measurements:

1, 1, 0, 15, 2, 3, 4, 0, 1, 3

The measurement x = 15 appears to be unusually large. Calculate the z-score for


this observation and state your conclusions.
Solution Calculate x̄ = 3.0 and s = 4.42 for the n = 10 measurements. Then the
z-score for the suspected outlier, x = 15, is calculated as
x − x̄ 15 − 3
z-score = = = 2.71
s 4.42
Hence, the measurement x = 15 lies 2.71 standard deviations above the sample
mean, x̄ = 3.0. Although the z-score does not exceed 3, it is close enough so that
you might suspect that x = 15 is an outlier. You should examine the sampling
procedure to see whether x = 15 is a faulty observation.

Sam & Seth Maseno University September 2, 2024 47 / 63


Measures of Relative Standing

Percentile

Definition A set of n measurements on the variable x has been arranged in order


of magnitude. The pth percentile is the value of x that is greater than p% of the
measurements and is less than the remaining (100 − p)%.
Example:
Suppose you have been notified that your score of 610 on the Verbal Graduate
Record Examination placed you at the 60th percentile in the distribution of scores.
Where does your score of 610 stand in relation to the scores of others who took
the examination?
Solution Scoring at the 60th percentile means that 60% of all the examination
scores were lower than your score and 40% were higher.
The 60th percentile for any data distribution is the point where 60% of the
measurements are less and 40% are greater (see Figure 2.12). The median is the
50th percentile, with 50% of the measurements smaller and 50% larger.

Sam & Seth Maseno University September 2, 2024 48 / 63


Measures of Relative Standing

Figure 2.12

The 25th and 75th percentiles, known as the lower and upper quartiles, along
with the median (the 50th percentile), divide the data into four equal sets.
The lower quartile has 25% of the measurements below it, the median has
50%, and the upper quartile has 75%.
This partitions the area under the relative frequency histogram into four
equal parts (see Figure 2.13).
Sam & Seth Maseno University September 2, 2024 49 / 63
Measures of Relative Standing

Percentile

Figure 2.13

Definition
A set of n measurements on the variable x has been arranged in order of
magnitude.
The lower quartile (first quartile), Q1 , is the value of x that is greater than
one-fourth of the measurements and is less than the remaining three-fourths.
The second quartile is the median.
The upper quartile (third quartile), Q3 , is the value of x that is greater than
three-fourths of the measurements and is less than the remaining one-fourth.
Sam & Seth Maseno University September 2, 2024 50 / 63
Measures of Relative Standing

For small data sets, it is often impossible to divide the set into four groups,
each of which contains exactly 25% of the measurements.
For example, when n = 10, you would need to have 2.5 measurements in
each group! Even when you can perform this task (for example, if n = 12),
many numbers would satisfy the preceding definition, and could therefore be
considered “quartiles.”
To avoid this ambiguity, we use the following rule to locate sample quartiles.
Calculating Sample Quartiles
When the measurements are arranged in order of magnitude, the lower
quartile, Q1 , is the value of x in position 0.25(n + 1), and the upper quartile,
Q3 , is the value of x in position 0.75(n + 1).
When 0.25(n + 1) and 0.75(n + 1) are not integers, the quartiles are found
by interpolation, using the values in the two adjacent positions.

Sam & Seth Maseno University September 2, 2024 51 / 63


Measures of Relative Standing

Example
Find the lower and upper quartiles for this set of measurements:
16, 25, 4, 18, 11, 13, 20, 8, 11, 9
Solution: Rank the n = 10 measurements from smallest to largest:
4, 8, 9, 11, 11, 13, 16, 18, 20, 25
Calculate
Position of Q1 = 0.25(n + 1) = 0.25(10 + 1) = 2.75
Position of Q3 = 0.75(n + 1) = 0.75(10 + 1) = 8.25
Since these positions are not integers, the lower quartile is taken to be the value
3/4 of the distance between the second and third ordered measurements, and the
upper quartile is taken to be the value 1/4 of the distance between the eighth and
ninth ordered measurements. Therefore,
Q1 = 8 + 0.75(9 − 8) = 8 + 0.75 = 8.75
Q3 = 18 + 0.25(20 − 18) = 18 + 0.5 = 18.5
The median and the quartiles divide the data distribution into four parts, each
containing approximately 25% of the measurements. (Q1 ) and (Q3 ) are the
boundaries for the middle 50% of the distribution. The range of this middle 50%
is measured by the interquartile range (IQR).
Sam & Seth Maseno University September 2, 2024 52 / 63
Measures of Relative Standing

Interquartile range (IQR)


Definition: The interquartile range (IQR) for a set of measurements is the
difference between the upper and lower quartiles; that is,
IQR = Q3 − Q1
For the data in Example above, IQR = Q3 - Q1 = 18.50 - 8.75 = 9.75.
We will use the IQR along with the quartiles and the median in the next section to
construct another graph for describing data sets.
How to Calculate Sample Quartiles
1 Arrange the data set in order of magnitude from smallest to largest.
2 Calculate the quartile positions:
Position of Q1 : 0.25(n + 1)
Position of Q3 : 0.75(n + 1)
3 If the positions are integers, then Q1 and Q3 are the values in the ordered
data set found in those positions.
4 If the positions in step 2 are not integers, find the two measurements in
positions just above and just below the calculated position. Calculate the
quartile by finding a value either one-fourth, one-half, or three-fourths of the
way between these two measurements.
Sam & Seth Maseno University September 2, 2024 53 / 63
The five-Number Summary and the Box Plot

The five-Number Summary and the Box Plot

Sam & Seth Maseno University September 2, 2024 54 / 63


The five-Number Summary and the Box Plot

The five-Number Summary and the Box Plot

The five-number summary consists of the smallest number, the lower quartile, the
median, the upper quartile, and the largest number, presented in order from
smallest to largest:
Min Q1 Median Q3 Max
By definition, one-fourth of the measurements in the data set lie between each of
the four adjacent pairs of numbers.
The five-number summary can be used to create a simple graph called a box
plot to visually describe the data distribution.
From the box plot, you can quickly detect any skewness in the shape of the
distribution and see whether there are any outliers in the data set.
Even when there are no recording or observational errors, a data set may
contain one or more valid measurements that, for one reason or another,
differ markedly from the others in the set.
These outliers can cause a marked distortion in commonly used numerical
measures such as x̄ and s.

Sam & Seth Maseno University September 2, 2024 55 / 63


The five-Number Summary and the Box Plot

In fact, outliers may themselves contain important information not shared


with the other measurements in the set.
Therefore, isolating outliers, if they are present, is an important step in any
preliminary analysis of a data set.
The box plot is designed expressly for this purpose.
How to construct a Box Plot
Calculate the median, the upper and lower quartiles, and the IQR for the
data set.
Draw a horizontal line representing the scale of measurement. Form a box
just above the horizontal line with the right and left ends at Q1 and Q3 .
Draw a vertical line through the box at the location of the median.

Sam & Seth Maseno University September 2, 2024 56 / 63


The five-Number Summary and the Box Plot

Figure: 2.15

Note

The z-score provides boundaries for identifying unusually large or small


measurements, specifically looking for z-scores greater than 2 or 3 in absolute
value.
The box plot uses the IQR to create imaginary ”fences” that separate outliers
from the rest of the data set.

Sam & Seth Maseno University September 2, 2024 57 / 63


The five-Number Summary and the Box Plot

Detecting Outliers-Observations that are Beyond

Lower fence: Q1 − 1.5 × IQR


Upper fence: Q3 + 1.5 × IQR
The upper and lower fences are shown with broken lines in Figure 2.15, but
they are not usually drawn on the box plot.
Any measurement beyond the upper or lower fence is an outlier; the rest of
the measurements, inside the fences, are not unusual.
Finally, the box plot marks the range of the data set using “whiskers” to
connect the smallest and largest measurements (excluding outliers) to the box
To Finish the Box Plot
Mark any outliers with an asterisk (∗) on the graph.
Extend horizontal lines called ”whiskers” from the ends of the box to the
smallest and largest observations that are not outliers.

Sam & Seth Maseno University September 2, 2024 58 / 63


The five-Number Summary and the Box Plot

Example

As American consumers become more careful about the foods they eat, food
processors try to stay competitive by avoiding excessive amounts of fat,
cholesterol, and sodium in the foods they sell. The following data are the amounts
of sodium per slice (in milligrams) for each of the eight brands of regular
American cheese. Construct a boxplot for the data and look for outliers.
340, 300, 520, 340, 320, 290, 260, 330

Sam & Seth Maseno University September 2, 2024 59 / 63


The five-Number Summary and the Box Plot

Solution: The n = 8 measurements are first ranked from smallest to largest:

260, 290, 300, 320, 330, 340, 340, 520

The positions of the median, Q1, and Q3 are:


n+1 9 320 + 330
Median position: = = 4.5 (so m = = 325)
2 2 2
Lower quartile (Q1) position: 0.25 × (n + 1) = 0.25 × 9 = 2.25
Upper quartile (Q3) position: 0.75 × (n + 1) = 0.75 × 9 = 6.75
340 + 340
Thus, Q1 = 330 and Q3 = = 340
2
The interquartile range (IQR) is calculated as:

IQR = Q3 − Q1 = 340 − 292.5 = 47.5

Calculate the upper and lower fences:

Lower fence: Q1 − 1.5 × IQR = 292.5 − 1.5 × 47.5 = 221.25

Upper fence: Q3 + 1.5 × IQR = 340 + 1.5 × 47.5 = 411.25


Sam & Seth Maseno University September 2, 2024 60 / 63
The five-Number Summary and the Box Plot

The value x = 520 a brand of cheese containing 520 milligrams of sodium, is


the only outlier, lying beyond the upper fence.
The box plot for the data is shown in Figure 2.16. The outlier is marked with
an asterisk (*).
Once the outlier is excluded, we find (from the ranked data set) that the
smallest and largest measurements are x = 260andx = 340
These are the two values that from the whiskers. Since the value x = 340 is
the same as Q3 , there is no whisker on the right side of the box.

Figure: 2.16

Sam & Seth Maseno University September 2, 2024 61 / 63


The five-Number Summary and the Box Plot

You can use the box plot to describe the shape of a data distribution by
examining the position of the median line relative to Q1 and Q3, the left and
right ends of the box. If the median is close to the middle of the box, the
distribution is fairly symmetric, with equal-sized intervals containing the two
middle quarters of the data. If the median is to the left of center, the distribution
is skewed to the right; if the median is to the right of center, the distribution is
skewed to the left. Additionally, for most skewed distributions, the whisker on the
skewed side of the box tends to be longer than the whisker on the other side.
Figure 2.17 shows two box plots: one for the sodium contents of eight brands of
cheese and another for five brands of fat-free cheese with the following sodium
contents:
300, 300, 320, 290, 180
Examine the long whisker on the left side of both box plots and the position of
the median lines. Both distributions are skewed to the left, indicating a few
unusually small measurements. The regular cheese data also show one brand
(x = 520) with an unusually high sodium content. Generally, the sodium content
of fat-free brands appears lower than that of regular brands, but the variability of
sodium content for regular cheese (excluding the outlier) is less than for the
fat-free brands.

Sam & Seth Maseno University September 2, 2024 62 / 63


The five-Number Summary and the Box Plot

Figure: 2.17

Sam & Seth Maseno University September 2, 2024 63 / 63

You might also like