Mas 101 3
Mas 101 3
Maseno University
September 2, 2024
1 General objective
2 Introduction
3 Measures of Variability
General objective
General objective
Graphs are extremely useful for the visual description of a data set. However, they
are not always the best tool when you want to make inferences about a population
from the information contained in a sample. For this purpose, it is better to use
numerical measures to construct a mental picture of the data.
Introduction
Graphs are useful for illustrating data distributions, but they have limitations.
For instance, if your data projector fails or you need to describe your data over
the phone, you must find another way to convey the information effectively
Another limitation of graphs is their imprecision for statistical inference. For
instance, comparing sample and population histograms can be challenging.
While identical histograms can be easily described as ’the same,’ differences
are harder to quantify concretely.
To address these problems, use numerical measures, which can be calculated
for a sample or population. These numbers provide a clear picture of the
frequency distribution. When related to the population, they are called
parameters; when from a sample, they are called statistics.
Measures of Center
We introduced dotplots, stem and leaf plots, and histograms to describe the
distribution of a set of measurements on a quantitative variable x.
The horizontal axis displays the values of x, and the data are “distributed”
along this horizontal line. One of the first important numerical measures is a
measure of center—a measure along the horizontal axis that locates the
center of the distribution.
The arithmetic average of a set of measurements is a very common and
useful measure of center. This measure is often referred to as the arithmetic
mean,or simply the mean,of a set of measurements.
To distinguish between the mean for the sample and the mean for the
population, we will use the symbol x̄ (x-bar) for a sample mean and the
symbol µ (Greek lowercase mu) for the mean of a population.
Measures of Center
Definition: The arithmetic mean or average of a set of n measurements is
equal to the sum of the measurements divided by n
Suppose there are n measurements on the variable x—call them
x1 , x2 , . . . , xn . To add the n measurements together, we use this shorthand
notation:
∑n
i=1 xi which means x1 + x2 + x3 + . . . + xn .
The Greek capital sigma (Σ) tells you to add the items that appear to its
right, beginning with the number below the sigma (i = 1) and ending with
the number above (i = n).
However, since the typical sums in statistical calculations are almost always
made on the total set of n measurements, you can use a simpler notation:
Σxi means “the sum of all the x measurements.”
Using this notation, we write the formula for the sample mean:
NOTATION ∑ n
xi
Sample mean: x̄ = i=1
n
∑ N
xi
Population mean: µ = i=1
N
Sam & Seth Maseno University September 2, 2024 8 / 63
Introduction
EXAMPLE
Draw a dotplot for the n = 5 measurements 2, 9, 11, 5, 6. Find the sample mean
and compare its value with what you might consider the “center” of these
observations on the dotplot.
Solution
The dotplot in Figure 2.2 seems to be centered between 6 and 8. To find the
sample mean, calculate
Sample mean:
∑n
i=1 xi 2 + 9 + 11 + 5 + 6
x̄ = = = 6.6
n 5
Figure 2.2
The statistic x̄ = 6.6 is the balancing point or fulcrum shown on the dot plot. It
does seem to mark the center of the data.
Sam & Seth Maseno University September 2, 2024 10 / 63
Introduction
Measures of Center
Median
2, 5, 6 ← middle observation, 9, 11
The middle observation, marked with an arrow, is in the center of the set, or
m = 6.
Median
Example 2.3: Find the median for the set of measurements 2, 9, 11, 5, 6, 27.
Solution: Rank the measurements from smallest to largest:
2, 5, 6, 9 , 11, 27
Now there are two “middle” observations, shown in the box. To find the median,
choose a value halfway between the two middle observations:
6+9
Median = = 7.5
2
The value .5(n + 1) indicates the position of the median in the ordered data
set. If the position of the median is a number that ends in the value .5, you need
to average the two adjacent values.
Median
For the n = 5 ordered measurements from Example 2.2, the position of the
median is:
0.5(n + 1) = 0.5(5 + 1) = 3
and the median is the 3rd ordered observation, or m = 6.For the n = 6 ordered
measurements from Example 2.3, the position of the median is:
and the median is the average of the 3rd and 4th ordered observations, or
6+9
m= = 7.5.
2
Median
Although both the mean and the median are good measures of the center of a
distribution, the median is less sensitive to extreme values or outliers. For example,
the value x = 27 in Example 2.3 is much larger than the other five measurements.
The median, m = 7.5, is not affected by the outlier, whereas the sample average,
∑6
i=1 xi
x̄ = = 10
6
is affected; its value is not representative of the remaining five observations. When
a data set has extremely small or extremely large observations, the sample mean is
drawn toward the direction of the extreme measurements (see Figure 2.3).
Median
Figure 2.3
Mode
Another way to locate the center of a distribution is to look for the value of x that
occurs with the highest frequency. This measure of the center is called the mode.
Definition: The mode is the category that occurs most frequently or the
most frequently occurring value of x.
When measurements on a continuous variable have been grouped as a
frequency or relative frequency histogram, the class with the highest peak or
frequency is called the modal class, and the midpoint of that class is taken
to be the mode.
The mode is generally used to describe large data sets, whereas the mean and
median are used for both large and small data sets.
The mode is typically used for large data sets, while the mean and median
are suitable for both large and small data sets. In Example 1.11, the mode of
weekly visits to Starbucks for 30 customers is 5, as shown in Table 2.1(a).
For the birth weight data in Table 2.1(b), the mode is 7.7, occurring four
times. The histogram shows the class with the highest peak is from 7.6 to
8.1, so the mode is the midpoint, 7.85. See Figure 2.4(b)
Mode
a distribution of measurements can have more than one mode. These modes
would appear as “local peaks” in the relative frequency distribution.
For example, if we were to tabulate the length of fish taken from a lake
during one season, we might get a bimodal distribution,possibly reflecting a
mixture of young and old fish in the population.
Sometimes bimodal distributions of sizes or weights reflect a mixture of
measurements taken on males and females. In any case, a set or distribution
of measurements may have more than one mode.
Mode
(a) Starbucks data (b) Birth weight data
67156 7.2 7.8 6.8 6.2 8.2
46468 8.0 8.2 5.6 8.6 7.1
65634 8.2 7.7 7.5 7.2 7.7
55576 5.8 6.8 6.8 8.5 7.5
35755 6.1 7.9 9.4 9.0 7.8
8.5 9.0 7.7 6.7 7.7
Figure 2.4
Measures of Variability
Measures of Variability
Data sets may have the same center but look different because of the way the
numbers spread out from the center. Consider the two distributions shown in
Figure 2.5. Both distributions are centered at X = 4, but there is a big difference
in how the measurements spread out, or vary.The measurements in Figure 2.5(a)
vary from 3 to 5; in Figure 2.5(b)The measurements vary from 0 to 8.
Figure 2.5
Measures of Variability
Measures of variability
For example, the measurements 5, 7,1, 2, 4 vary from 1 to 7. Hence, the range is
7 − 1 = 6. The range is easy to calculate, easy to interpret, and is an adequate
measure of variation for small sets of data. But, for large data sets, the range is
not an adequate measure of variability. For example, the two relative frequency
distributions in Figure 2.6 have the same range but very different shapes and
variability.
Figure 2.6
Measures of variability
The horizontal distances between each measurement and the mean x̄ indicate
variability. Large distances mean more variability, while small distances mean less.
The deviation of a measurement xi from the mean is (xi − x̄). Positive deviations
occur for measurements to the right of the mean, and negative deviations for
those to the left. The values of x and their deviations are shown in Table 2.2.
xi (xi − x̄) (xi − x̄)2
5 1.2 1.44
7 3.2 10.24
1 -2.8 7.84
2 -1.8 3.24
4 0.2 0.04
19 0.0 22.80
Table: 2.2
Variance
The variance of a population of N measurements is the average of the
squares of the deviations of the measurements about their mean µ.
The population variance is denoted by σ 2 and is given by the formula:
∑N
2 i=1 (xi − µ)2
σ =
N
Most often, you will not have all the population measurements available but
will need to calculate the variance of a sample of n measurements.
The variance of a sample of n measurements is the sum of the squared
deviations of the measurements about their mean x̄ divided by (n − 1). The
sample variance is denoted by s2 and is given by the formula:
∑n
2 (xi − x̄)2
s = i=1
n−1
NOTE: The variance and the standard deviation cannot be negative
numbers.
Sam & Seth Maseno University September 2, 2024 26 / 63
Measures of Variability
Variance
For the set of n = 5 sample measurements presented in Table 2.2, the square
of the deviation of each measurement is recorded in the third column.
Adding, we obtain
∑n
i=1 (xi − x̄) = 22.80
2
Standard Deviation
For a small set of measurements, calculating the variance is not too difficult.
Use scientific calculators with built-in programs for larger sets to compute x̄
and s or µ and σ.
The sample mean key is usually marked with x̄, the sample standard
deviation key with s, sx , or sx(n−1) , and the population standard deviation
key with s, sx , or sxn .
Be sure to know which calculation each key performs. For hand calculations,
use the shortcut method for computing s2 .
The Computing Formula for calculating s2
∑ 2 ( (∑ x i )2 )
xi − n
s2 =
n−1
∑ ∑ 2
The symbols ( xi )2 and xi in the computing formula are shortcut ways
to indicate the arithmetic operation you need to perform.
∑
You know from the formula for the sample mean that xi is the sum of all
the measurements.
∑ 2
To find xi , you square each individual measurement and then add them
together.
Sam & Seth Maseno University September 2, 2024 29 / 63
Measures of Variability
∑
x2 = Sum of the squares of the individual measurements
∑i 2
( xi ) = Square of the sum of the individual measurements
The sample standard deviation, s, is the positive square root of s2 .
Example
Calculate the variance and standard deviation for the five measurements the
following 5, 7, 1, 2, 4. Use the computing formula for s2 and compare your results
with those obtained using the original definition of s2 .
Solution
Let’s calculate the variance and standard deviation for the five measurements: 5,
7, 1, 2, 4 using the computing formula for s2 .
Given the formula: ∑
∑n ( ni=1 xi )
2
i=1 xi −
2
2 n
s =
n−1
First, calculate the necessary sums:
∑
n
xi = 5 + 7 + 1 + 2 + 4 = 19
i=1
∑
n
x2i = 52 + 72 + 12 + 22 + 42 = 25 + 49 + 1 + 4 + 16 = 95
i=1
95 − 361
s2 = 5
4
95 − 72.2
s2 =
4
22.8
s2 =
4
s2 = 5.7
The variance (s2 ) is 5.7. To find the standard deviation (s), take the square root
of the variance: √
s = 5.7 ≈ 2.39
So, the variance is 5.7 and the standard deviation is approximately 2.39.
Figure 2.8
Example
Empirical Rule
The Empirical Rule works well for mound-shaped data distributions, which are
common in nature. The closer your data distribution is to the mound shape, the
more accurate the rule will be. Figure 2.9
Figure 2.9
Student teachers are trained to develop lesson plans, on the assumption that the
written plan will help them to perform successfully in the classroom. In a study to
assess the relationship between written lesson plans and their implementation in
the classroom, 25 lesson plans were scored on a scale of 0 to 34 according to a
Lesson Plan Assessment Checklist. The 25 scores are shown in Table 2.5. Use
Tchebysheff’s Theorem and the Empirical Rule (if applicable) to describe the
distribution of these assessment scores.
26.1 26.0 14.5 29.3 19.7
22.1 21.2 26.6 31.9 25.0
15.9 20.8 20.2 17.8 13.3
25.6 26.5 15.7 22.1 13.8
29.0 21.3 23.5 22.1 10.2
Table 2.5
Solution
Use your calculator or the computing formulas to verify that x̄ = 21.6 and
s = 5.5. The appropriate intervals are calculated and listed in Table 2.6. We have
also referred back to the original 25 measurements and counted the actual number
of measurements that fall into each of these intervals. These frequencies and
relative frequencies are shown in Table 2.6.
Intervals x ±ks for the Data of Table 2.5
k Interval x ± ks # in Interval Relative Frequency
1 16.1–27.1 16 0.64
Table 2.6
2 10.6–32.6 24 0.96
3 5.1–38.1 25 1.00
Is Tchebysheff’s Theorem applicable? Yes, because it can be used for any set of
data. According to Tchebysheff’s Theorem,
3
at least 4 of the measurements will fall between 10.6 and 32.6.
8
at least 9 of the measurements will fall between 5.1 and 38.1.
You can see in Table 2.6 that Tchebysheff’s Theorem is true for these data. In
fact, the proportions of measurements that fall into the specified intervals exceed
the lower bound given by this theorem.
Figure 2.10
Sam & Seth Maseno University September 2, 2024 42 / 63
Measures of Variability
Example
1, 1, 0, 15, 2, 3, 4, 0, 1, 3
Percentile
Figure 2.12
The 25th and 75th percentiles, known as the lower and upper quartiles, along
with the median (the 50th percentile), divide the data into four equal sets.
The lower quartile has 25% of the measurements below it, the median has
50%, and the upper quartile has 75%.
This partitions the area under the relative frequency histogram into four
equal parts (see Figure 2.13).
Sam & Seth Maseno University September 2, 2024 49 / 63
Measures of Relative Standing
Percentile
Figure 2.13
Definition
A set of n measurements on the variable x has been arranged in order of
magnitude.
The lower quartile (first quartile), Q1 , is the value of x that is greater than
one-fourth of the measurements and is less than the remaining three-fourths.
The second quartile is the median.
The upper quartile (third quartile), Q3 , is the value of x that is greater than
three-fourths of the measurements and is less than the remaining one-fourth.
Sam & Seth Maseno University September 2, 2024 50 / 63
Measures of Relative Standing
For small data sets, it is often impossible to divide the set into four groups,
each of which contains exactly 25% of the measurements.
For example, when n = 10, you would need to have 2.5 measurements in
each group! Even when you can perform this task (for example, if n = 12),
many numbers would satisfy the preceding definition, and could therefore be
considered “quartiles.”
To avoid this ambiguity, we use the following rule to locate sample quartiles.
Calculating Sample Quartiles
When the measurements are arranged in order of magnitude, the lower
quartile, Q1 , is the value of x in position 0.25(n + 1), and the upper quartile,
Q3 , is the value of x in position 0.75(n + 1).
When 0.25(n + 1) and 0.75(n + 1) are not integers, the quartiles are found
by interpolation, using the values in the two adjacent positions.
Example
Find the lower and upper quartiles for this set of measurements:
16, 25, 4, 18, 11, 13, 20, 8, 11, 9
Solution: Rank the n = 10 measurements from smallest to largest:
4, 8, 9, 11, 11, 13, 16, 18, 20, 25
Calculate
Position of Q1 = 0.25(n + 1) = 0.25(10 + 1) = 2.75
Position of Q3 = 0.75(n + 1) = 0.75(10 + 1) = 8.25
Since these positions are not integers, the lower quartile is taken to be the value
3/4 of the distance between the second and third ordered measurements, and the
upper quartile is taken to be the value 1/4 of the distance between the eighth and
ninth ordered measurements. Therefore,
Q1 = 8 + 0.75(9 − 8) = 8 + 0.75 = 8.75
Q3 = 18 + 0.25(20 − 18) = 18 + 0.5 = 18.5
The median and the quartiles divide the data distribution into four parts, each
containing approximately 25% of the measurements. (Q1 ) and (Q3 ) are the
boundaries for the middle 50% of the distribution. The range of this middle 50%
is measured by the interquartile range (IQR).
Sam & Seth Maseno University September 2, 2024 52 / 63
Measures of Relative Standing
The five-number summary consists of the smallest number, the lower quartile, the
median, the upper quartile, and the largest number, presented in order from
smallest to largest:
Min Q1 Median Q3 Max
By definition, one-fourth of the measurements in the data set lie between each of
the four adjacent pairs of numbers.
The five-number summary can be used to create a simple graph called a box
plot to visually describe the data distribution.
From the box plot, you can quickly detect any skewness in the shape of the
distribution and see whether there are any outliers in the data set.
Even when there are no recording or observational errors, a data set may
contain one or more valid measurements that, for one reason or another,
differ markedly from the others in the set.
These outliers can cause a marked distortion in commonly used numerical
measures such as x̄ and s.
Figure: 2.15
Note
Example
As American consumers become more careful about the foods they eat, food
processors try to stay competitive by avoiding excessive amounts of fat,
cholesterol, and sodium in the foods they sell. The following data are the amounts
of sodium per slice (in milligrams) for each of the eight brands of regular
American cheese. Construct a boxplot for the data and look for outliers.
340, 300, 520, 340, 320, 290, 260, 330
Figure: 2.16
You can use the box plot to describe the shape of a data distribution by
examining the position of the median line relative to Q1 and Q3, the left and
right ends of the box. If the median is close to the middle of the box, the
distribution is fairly symmetric, with equal-sized intervals containing the two
middle quarters of the data. If the median is to the left of center, the distribution
is skewed to the right; if the median is to the right of center, the distribution is
skewed to the left. Additionally, for most skewed distributions, the whisker on the
skewed side of the box tends to be longer than the whisker on the other side.
Figure 2.17 shows two box plots: one for the sodium contents of eight brands of
cheese and another for five brands of fat-free cheese with the following sodium
contents:
300, 300, 320, 290, 180
Examine the long whisker on the left side of both box plots and the position of
the median lines. Both distributions are skewed to the left, indicating a few
unusually small measurements. The regular cheese data also show one brand
(x = 520) with an unusually high sodium content. Generally, the sodium content
of fat-free brands appears lower than that of regular brands, but the variability of
sodium content for regular cheese (excluding the outlier) is less than for the
fat-free brands.
Figure: 2.17