Lecture Notes 4
Lecture Notes 4
In statistics, to describe the data set accurately, statisticians must know more than the measures of central
tendency. Although location is generally considered to be the most important single characteristic of a distribution,
the variability or dispersion of the values is also very important.
The mean or median of a variable provides an inadequate description of the distribution of that variable since the
list of values would include a wide range. Therefore, description is best presented by using a measure of central
tendency as well as a measure of dispersion.
For normally distributed data, using mean and standard deviation is the most appropriate while for non-normally
distributed data, using median and interquartile range is more appropriate.
Measures of dispersion include:
- Range.
- Variance.
- Standard deviation.
- Coefficient of variation.
Descriptive
Statistics
Ungrouped Grouped
Data Data
Range:
The range is the highest value minus the lowest value. The symbol R is used for the range.
Solution
The range is R = $100,000 - $15,000 = $85,000.
The variance is the average of the squares of the distance each value is from the mean.
The symbol for the population variance is σ .
2
σ =∑2(X − µ) 2
N
where
X = individual value
µ = population mean
N = population size
The standard deviation is the square root of the variance. The symbol for the population standard deviation is σ .
The corresponding formula for the population standard deviation is
σ
= σ
= 2 ∑ ( X − µ) 2
Example 4–2:
Solution
Step 1 Find the mean for the data.
µ
= ∑=
X 10 + 60 + 50 + 30 + 40 + 20 210
= = 35
N 6 6
10 -25 625
60 +25 625
50 +15 225
30 -5 25
40 +5 25
20 -15 225
1750
Column A contains the raw data X. Column B contains the differences X − µ obtained in step 2. Column C
contains the squares of the differences obtained in step 3.
=∑
(X − X ) 2
s 2
n −1
where
X = individual value
X = sample mean
n = sample size
The symbol for the sample standard deviation is s .
=s s
= 2 ∑(X − X ) 2
n −1
where
X = individual value
X = sample mean
n = sample size
PHM111s - Probability and Statistics
The shortcut formulas for computing the variance and standard deviation for data obtained from samples are as
follows:
Variance Standard deviation
n( ∑ X 2 ) − ( ∑ X ) 2
s = ∑
n( X 2 ) − ( ∑ X ) 2
2
s=
n(n − 1) n(n − 1)
Example 4–3: Find the sample variance and standard deviation for the amount of European auto sales for a sample
of 6 years shown. The data are in millions of dollars.
11.2, 11.9, 12.0, 12.8, 13.4, 14.3
Solution
Step 1 Find the sum of the values.
∑ X = 11.2 + 11.9 + 12.0 + 12.8 + 13.4 + 14.3 = 75.6
Step 2 Square each value and find the sum.
∑X 2
= 11.22 + 11.92 + 12.02 + 12.82 + 13.42 + 14.32 = 958.94
Step 3 Substitute in the formulas and solve.
n( ∑ X 2 ) − ( ∑ X ) 2
s = 2
n(n − 1)
6(958.94) − 75.62
=
6(6 − 1)
38.28
= = 1.276
30
The variance is 1.28 rounded.
=s 1.28 1.13
=
Hence, the sample standard deviation is 1.13.
Step 1 Make a table as shown, and find the midpoint of each class.
A B C D E
Class Frequency Midpoint f .X m f . X m2
Step 2 Multiply the frequency by the midpoint for each class, and place the products in column D.
Step 3 Multiply the frequency by the square of the midpoint, and place the products in column E.
Step 4 Find the sums of columns B, D, and E. (The sum of column B is n. The sum of column D is ∑ f .X m
. The
sum of column E is ∑ f . X .)
2
m
Step 5 Substitute in the formula and solve to get the variance.
n( ∑ f . X m 2 ) − ( ∑ f . X m ) 2
s =
2
n(n − 1)
Step 6 Take the square root to get the standard deviation.
Step 1 Make a table as shown, and find the midpoint of each class.
A B C D E
Frequency Midpoint
Class f Xm f .X m f . X m2
5.5–10.5 1 8
10.5–15.5 2 13
15.5–20.5 3 18
20.5–25.5 5 23
25.5–30.5 4 28
30.5–35.5 3 33
35.5–40.5 2 38
Step 2 Multiply the frequency by the midpoint for each class, and place the products in column D.
(1)(8) = 8 (2)(13) = 26 . . . (2)(38) = 76
Step 3 Multiply the frequency by the square of the midpoint, and place the products in column E.
(1)(8)2 = 64 (2)(13)2 = 338 ... (2)(38)2 = 2888
Step 4 Find the sums of columns B, D, and E. The sum of column B is n, the sum of column D is ∑ f .X m
, and
the sum of column E is ∑ f .X 2
m
. The completed table is shown.
A B C D E
Class Frequency Midpoint f .X m f . X m2
5.5–10.5 1 8 8 64
10.5–15.5 2 13 26 338
15.5–20.5 3 18 54 972
20.5–25.5 5 23 115 2,645
25.5–30.5 4 28 112 3,136
30.5–35.5 3 33 99 3,267
35.5–40.5 2 38 76 2,888
n = 20 ∑ f .X m
= 490 ∑ f .X 2
m
= 13,310
Coefficient of Variation
Whenever two samples have the same units of measure, the variance and standard deviation for each can be
compared directly. A statistic that allows you to compare standard deviations when the units are different,
is called the coefficient of variation.
The coefficient of variation, denoted by CVar, is the standard deviation divided by the mean. The result is
expressed as a percentage.
Example 4–5 The mean of the number of sales of cars over a 3-month period is 87, and the standard deviation is 5.
The mean of the commissions is $5225, and the standard deviation is $773. Compare the variations
of the two.
Solution
Descriptive
Statistics
Standard Scores
A z score or standard score for a value is obtained by subtracting the mean from the value and dividing the result
by the standard deviation. The symbol for a standard score is z. The formula is
value − mean
z=
standard deviation
For samples, the formula is
X−X
z=
s
For populations, the formula is
X −µ
z=
σ
The z score represents the number of standard deviations that a data value falls above or below the mean.
Example 4–6: A student scored 65 on a calculus test that had a mean of 50 and a standard deviation of 10; she
scored 30 on a history test with a mean of 25 and a standard deviation of 5. Compare her relative
positions on the two tests.
Solution
First, find the z scores. For calculus the z score is
X − X 65 − 50
=z = = 1.5
s 10
For history the z score is
30 − 25
=z = 1.0
5
Since the z score for calculus is larger, her relative position in the calculus class is higher than her relative position
in the history class.
When all data for a variable are transformed into z scores, the resulting distribution will have a mean of 0 and a
standard deviation of 1. A z score, then, is actually the number of standard deviations each value is from the mean
for a specific distribution.
PHM111s - Probability and Statistics
Percentiles
Percentiles divide the data set into 100 equal groups.
Percentile Formula
The percentile corresponding to a given value X is computed by using the following formula:
(number of values below X ) + 0.5
Percentile = .100%
total number of values
n. p
and the order of the value corresponding to certain percentile is c =
100
where
n = total number of values
p = percentile
Example 4–7: A teacher gives a 20-point test to 10 students. The scores are shown here. Find the percentile rank
of a score of 12. Also find the value corresponding to the 25th percentile.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
Solution:
Arrange the data in order from lowest to highest.
2, 3, 5, 6, 8, 10, 12, 15, 18, 20
Then substitute into the formula.
(number of values below X ) + 0.5
Percentile = .100%
total number of values
Since there are six values below a score of 12, the solution is
6 + 0.5
= Percentile = .100% 65th percentile
10
(10)(25)
and c = = 2.5 ⇒ c = 3
100
Hence, the value 5 corresponds to the 25th percentile.
(Note: If c is not a whole number, round it up to the next whole number as in this example.)
Thus, a student whose score was 12 did better than 65% of the class.
Example 4–8: Using the data set in the previous Example, find the value that corresponds to the 60th percentile.
Solution
Arrange the data in order from smallest to largest.
2, 3, 5, 6, 8, 10, 12, 15, 18, 20
Substitute in the formula.
n. p (10)(60)
= c = = 6
100 100
If c is a whole number, use the value halfway between the c and c +1 values when counting up from the lowest
value. In this case, the 6th and 7th values.
2, 3, 5, 6, 8, 10, 12, 15, 18, 20
Quartiles divide the distribution into four groups, separated by Q1, Q2, Q3.
Note that Q1 is the same as the 25th percentile; Q2 is the same as the 50th percentile,
or the median; Q3 corresponds to the 75th percentile, as shown:
Example 4–9: Find Q1, Q2, and Q3 for the data set 15, 13, 6, 5, 12, 50, 22, 18.
Solution
In addition to dividing the data set into four groups, quartiles can be used as a rough measurement of variability.
The interquartile range (IQR) is defined as the difference between Q1 and Q3 and is the range of the middle 50%
of the data.
IQR = Q3 - Q1
The interquartile range is used to identify outliers, and it is also used as a measure of variability in exploratory data
analysis.
Q +Q
Midhinge = 1 3
2
Deciles divide the distribution into 10 groups, as shown. They are denoted by D1, D2, etc.
Note that D1 corresponds to P10; D2 corresponds to P20; etc. Deciles can be found by using the formulas given for
percentiles. Taken altogether then, these are the relationships among percentiles, deciles, and quartiles.
Deciles are denoted by D1, D2, D3, . . . , D9, and they correspond to P10, P20, P30, . . . , P90.
Quartiles are denoted by Q1, Q2, Q3 and they correspond to P25, P50, P75.
The median is the same as P50 or Q2 or D5.
Using the same method of calculations as in the median, we can get Q1 and Q3 equation as follows:
n 3n
−F 4 −F
Q L + i
= 4 Q
= L + i
1 Q1 , 3 Q3
fQ fQ3
1
Example 4–10: Based on the grouped data below, find the interquartile range (IQR).
Time to travel to work f
1 – 10 8
11 – 20 14
21 – 30 12
31 – 40 9
41 – 50 7
Solution
3n 3(50)
=
Class Q3 = = 37.5 → class Q3 is the 4th class
4 4
3n
4 −F
Q
= L + i
3 Q3
fQ3
37.5− 34
= 30.5 + 10 =34.3889
9
IQR = Q3 - Q1 = 34.3889 - 13.7143 = 20.6746
Outliers
An outlier is an extremely high or an extremely low data value when compared with the rest of the data values.
Solution
The data value 50 is extremely suspect. These are the steps in checking for an outlier.
Step 1 Find Q1 and Q3. From the previous example, Q1 is 9 and Q3 is 20.
Step 2 Find the interquartile range (IQR), which is Q3 - Q1.
IQR = Q3 - Q1 = 20 - 9 = 11
Step 4 Subtract the value obtained in step 3 from Q1, and add the value obtained in step 3 to Q3.
9 - 16.5 = -7.5 and 20 + 16.5 = 36.5
Step 5 Check the data set for any data values that fall outside the interval from -7.5 to 36.5. The value 50 is outside
this interval; hence, it can be considered an outlier.
A boxplot is a graph of a data set obtained by drawing a horizontal line from the minimum data value to Q1,
drawing a horizontal line from Q3 to the maximum data value, and drawing a box whose vertical sides pass
through Q1 and Q3 with a vertical line inside the box passing through the median or Q2.
Example 4–12: The number of meteorites found in 10 states of the United States is 89, 47, 164, 296, 30, 215, 138,
78, 48, 39. Construct a boxplot for the data.
Solution
Step 6 Located the lowest value, Q1, median, Q3, and the highest value on the scale.
Step 7 Draw a box around Q1 and Q3, draw a vertical line through the median, and connect the upper value and the
lower value to the box as in the figure. Fi
gure 3–