Module6 Statistical Tools
Module6 Statistical Tools
MODULE
Learning Outcomes
By the end of the module, students will be able to:
References:
[1] Sobecki, D. (2019). Math in Our World (4th Ed.). McGraw-Hill Education
Terms to Remember
Data are measurements or observations that are gathered for an event under study.
Statistics is the branch of mathematics that involves collecting, organizing,
summarizing, and presenting data and drawing general conclusions from the data.
A population consists of all subjects under study.
A sample is a representative subgroup or subset of a population.
There are two main branches of statistics: descriptive and inferential.
1. Statistical techniques used to describe data are called descriptive statistics.
This is based on collecting, organizing, and reporting data without using the data
to draw any wide-ranging conclusions.
For example, a researcher might be interested in the average age of the
full-time students on your campus and how many credit hours they’re scheduled
for this term.
2. Statistical techniques used to make inferences are called inferential statistics.
This is based on studying characteristics of a sample within a larger population
and using them to draw conclusions about the entire population.
For example, the Bureau of Labor and Statistics estimates the number of
people in the United States that are unemployed every month. Since it would be
impossible to survey everyone, the bureau picks a sample of adults to see what
percentage are unemployed. Then they use that information to estimate the
unemployment rate for the entire population.
Sampling Methods
We will study four basic sampling methods:
1. In order to obtain a random sample, each subject of the population must have an
equal chance of being selected.
● Ask the school registrar to give him a list of 50 students whose student ID
numbers end in 4.
MATHEMATICS IN THE MODERN WORLD
MODULE
Frequency Distribution
The data collected for a statistical study are called raw data. In order to describe
situations and draw conclusions, we need to organize the data in a meaningful way.
Two methods that we will use are frequency distributions and stem and leaf plots.
A B B AB O O O B AB B
B B O A O AB A O B A
A O O O AB
MATHEMATICS IN THE MODERN WORLD
MODULE
4. Make sure the range of numbers included in a class is the same for each one.
5. The beginning and ending values have to be chosen based on how the data
values are rounded.
These data represent the record high temperatures for each of the 50 states in
degrees Fahrenheit. Construct a grouped frequency distribution for the data.
112 100 128 120 134 118 106 110 109 112
100 118 117 116 118 121 114 114 105 109
107 112 114 115 118 117 118 125 106 110
122 108 110 121 113 120 119 111 104 113
120 113 120 117 105 110 118 112 114 115
Step 1 Subtract the lowest value from the highest value: 134 − 100 = 34.
Step 2 If we use a range of 5
degrees, that will give us seven
classes, since the entire range
(34 degrees) divided by 5 is 6.8.
Step 3 Start with the lowest
value and add 5 to get the lower
class limits: 100, 105, 110, 115,
120, 125, 130. Notice that all of
the data are rounded to the
nearest whole number, so that’s reflected in our choices of class limits.
Step 4 Set up the classes. To find the upper limit for each, subtract one from the next
upper limit.
Step 5 Tally the data and record the frequencies. It’s a really good idea to cross out
each data value as you tally it up, and an even better idea to make sure that all of the
frequencies add up to the total number of data values.
MATHEMATICS IN THE MODERN WORLD
MODULE
The data below are the July 2015 unemployment rates for each state. Draw a
stem and leaf plot illustrating these data.
6.2 6.6 6.3 5.4 6.1 4.2 5.3 4.9 5.3 5.9
3.5 4.2 5.6 4.6 3.7 4.6 5.2 6.0 4.5 5.1
4.7 5.1 4.0 6.3 5.6 4.1 2.8 6.8 3.6 5.7
6.7 5.2 5.9 2.9 4.7 4.6 6.1 5.4 5.6 6.0
3.7 5.7 4.1 3.7 3.6 4.5 5.3 7.6 4.5 4.0
It is also possible to do a Grouped Stem and Leaf Plot. It follows the same rules as the
Group Frequency Distribution and we can have this Plot using the previous example.
Presentation of Data
When data are representative of certain categories, rather than numerical, we
often use bar graphs or circle graphs (commonly known as pie charts) to illustrate the
data.
Response Frequency
The marketing firm Deloitte Retail
conducted a survey of grocery shoppers. Always 10
The frequency distribution below
Never 39
represents the responses to the survey
question “How often do you bring your Frequently 19
own bags when grocery shopping?”
Occasionally 32
Draw a pie chart to represent the data.
Store Frequency
Other 65
MATHEMATICS IN THE MODERN WORLD
MODULE
A time series graph can be drawn for data collected over a period of time. This
type of graph is used primarily to show trends, like prices rising or falling, for the time
period.
There are three types of trends. Secular trends are viewed over a long period of
time, such as yearly. Cyclical trends show oscillating patterns. Seasonal trends show
the values of a commodity for shorter periods of the year, such as fall, winter, spring,
and summer.
Politicians often talk about violent crime in the United States in apocalyptic terms,
but what does the data say? The table shows the number of violent crimes committed
per 100,000 citizens every five years from 1980 to 2015. Draw a time series graph for
the data and use it to discuss trends in violent crime.
Step 1 Write the scale for the frequencies on the vertical axis and the class limits on the
horizontal axis. Make sure that your labeling on the vertical axis starts at zero.
Step 2 Draw vertical bars with heights that correspond to the frequencies for each class.
Class Frequency
100-104 3
120-124 7
125-129 2
130-134 1
MATHEMATICS IN THE MODERN WORLD
MODULE
121 101 129 121 135 119 107 111 110 113
102 120 119 118 120 123 116 116 107 111
105 110 112 113 116 115 116 123 104 108
108 109 119 116 122 118 112 115 106 109
The quiz is not timed, so you can pause it and resume at any time.
If you cancel the quiz, your answers are discarded and they are not counted as a submission.
The Greek letter (sigma) is used to represent the sum of a list of numbers. If
we use the letter X to represent data values, then X means to find the sum of all
values in a data set.
The mean is the sum of the values in a data set divided by the number of values.
If X1, X2, X3, …, Xn are the data values, we use "X-bar" to stand for the mean, and
MATHEMATICS IN THE MODERN WORLD
MODULE
Employee Jerry Kramer Newman George Elaine Susan Tim Estelle Frank
Salary 58 65 944 20 52 51 53 55 50
The result of 149.8 tells us that the mean salary is 149.8 thousand dollars, or 149,800.
The Median
In short, the median of a data set is the value in the middle if all values are
arranged in order. The median will either be a specific data value in the set, or will fall in
between two values.
Find the median salary for Vandelay Industries. How does it compare to the mean?
Employee George Frank Susan Elaine Tim Estelle Jerry Kramer Newman
Salary 20 50 51 52 53 55 58 65 944
MATHEMATICS IN THE MODERN WORLD
MODULE
There are nine salaries listed, and where I come from, nine is odd. So the
median will be the salary right in the middle: there will be four salaries less and four
more. That makes it the fifth salary on the list, which is $53,000. This is a whole lot less
than the mean of $149,800, and in fact is a much more reasonable measure of average
for these data. Find the mean and median if Newman’s salary is left out. What can you
conclude?
Salary 20 50 51 52 53 55 58 65
Now there are eight salaries, so we’ll need to find the mean of the two in the
middle, which are 52,000 and 53,000. It would be nice if you could just figure out that
the mean is halfway in between, but for the sake of completeness:
Now that’s interesting. The median was almost unaffected by throwing away the
largest value, but the mean changed dramatically, to say the least. This is exactly why
the mean was a poor measure of average for this data set: the one very large value has
a great impact on the mean, but not so much on the median.
The Midrange
The advantage of the midrange is that it’s very quick and easy to calculate.
The disadvantage is that it totally ignores most of the data values, so it’s not a
particularly reliable measure.
MATHEMATICS IN THE MODERN WORLD
MODULE
Find the midrange of all salaries at Vandelay Industries. Is it meaningful in this case?
Employee George Frank Susan Elaine Tim Estelle Jerry Kramer Newman
Salary 20 50 51 52 53 55 58 65 944
The Mode
The mode is sometimes said to be the most typical case.
The value that occurs most often in a data set is called the mode.
A data set can have more than one mode or no mode at all.
These data represent the duration (in days) of the final 20 U.S. space shuttle voyages.
11, 12, 13, 12, 15, 12, 15, 13, 15, 12, 12, 15, 13, 10, 13, 15, 11, 12, 15, 12
Days Frequency
10 1
11 2
12 7
13 4
15 6
The number of Atlantic hurricanes for each of the years from 1997–2016 is
shown in the list.
Find the mode, and describe what that tells you.
3, 10, 8, 8, 9, 4, 7, 9, 15, 5, 6, 8, 3, 12, 7, 10, 2, 6, 4, 7
This time, we’ll find the mode without making a frequency distribution. Instead,
we can just work down the list, counting the number of occurrences for each number of
hurricanes. It turns out that there are two numbers that appear three times, while no
others appear more than twice. Those numbers are 7 and 8, so this data set has two
modes. This means that over that 20-year span, the most common number of Atlantic
hurricanes was 7 and 8.
MATHEMATICS IN THE MODERN WORLD
MODULE
Mode frequent categorical data like from mean and median Mo with highest
•Dramatically
•Very quick and
The mean affected by
easy to compute
of highest extremely high or low
Midrange •Provides a
and lowest values in the data set
simple look at
value. •Ignores all but two
average
values in the set
MATHEMATICS IN THE MODERN WORLD
MODULE
Measures of Variation
In this section we will study measures of variation,which will help to describe how
the data within a set vary. The three most commonly used measures of variation are
range, variance,and standard deviation.
The range of a data set is the difference between the highest and lowest values
in the set.
Range = Highest value – Lowest value
The first list below is the weights of the dogs in the first picture, and the second is
the weights of the dogs in the second picture.
Find the range for each list, then describe any observations you can make
based on the results.
1st: 70, 73, 58, 60 2nd: 30, 85, 40, 125, 42, 75, 60, 55
The ranges are very different. This is reflective of the fact that there’s a lot more
variation in size among the dogs in the second picture.
MATHEMATICS IN THE MODERN WORLD
MODULE
Find the variance and standard deviation for the weights of the eight dogs in the
second picture at the beginning of this section.
The weights are listed again for reference.
30, 85, 40, 125, 42, 75, 60, 55
Step 5 Divide the sum by n- 1 to get the variance, where nis the sample size. In this
case, n is 8, so n- 1 = 7.
Step 6 Take the square root of the variance to get standard deviation.
30 -34 1,156
85 21 441
40 -24 576
125 61 3,721
42 -22 484
75 11 121
60 -4 16
55 -9 81
MATHEMATICS IN THE MODERN WORLD
MODULE
The sample standard deviation (s) is the square root of the variance. It provides
an approximate average of the distances between data values and the mean.
MATHEMATICS IN THE MODERN WORLD
MODULE
Measures of Position
A percentile, or percentile rank, of a data value indicates the percent of data
values in a set that are below that particular value.
Suppose you score 77 on a test in a class of 10 people, with the 10 scores listed
below. What was your percentile rank?
93 82 64 75 98 52 77 88 90 71
The ordered list is 52, 64, 71, 75, 77, 82, 88, 90, 93, 98
Now if you focus on 77, you can see that there are exactly four scores lower than
yours. Since there were 10 scores total, that means that 4/10 or 40% of the scores were
lower than yours.
As for the percentile rank? That’s just to see if you’re paying attention. The
definition of percentile rank is the percentage of data values that are lower than a given
value, so we say that a score of 77 is at the 40th percentile.
Step 1 We’re asked to find the number on the list that has 30% of the numbers
below it. There are 10 numbers, and 30% of 10 is 3.
MATHEMATICS IN THE MODERN WORLD
MODULE
Step 2 Arrange the data in order from smallest to largest, and find the value that
has 3 values below it.
1,433, 1,592, 1,598, 2,071, 2,096, 2,155, 2,320, 2,395, 2,427, 2,561
The 30th percentile is the speech that consists of 2,071 words.
Both are excellent students, but Miguel’s ranking is higher even though he was
51st and Dustin was 27th.
Quartiles
A quartile divides a data set into quarters.
The second quartile is the same as the median, and divides a data set into an
upper half and a lower half.
The first quartile is the median of the lower half, and the third quartile is the
median of the upper half.
We use the symbols Q1, Q2,and Q3 for the first, second, and third quartiles
respectively.
● Q1 (first quartile): 25% of data values are less than this, and 75% are greater
than it.
MATHEMATICS IN THE MODERN WORLD
MODULE
● Q2 (second quartile): 50% of data values are less than this, and 50% are greater
than it.
● Q3 (third quartile): 75% of data values are less than this, and 25% are greater
than it.
The data below are the percentages of total electricity generated that comes from
nuclear power for the nations with the 12 largest economies in the world, listed by size
of economy. Find the quartiles and describe what they mean.
19.5 2.4 0 15.8 17.2 76.9 2.9 0 3.7 18.6 16.8 0
First, we’ll find the median, which is also Q2 (the second quartile).
To do so, put the data values in order, from least to greatest.
0, 0, 0, 2.4, 2.9, 3.7, 15.8, 16.8, 17.2, 18.6, 19.5, 76.9
Next, we’ll find the median of the lower half: this is Q1 (the first quartile). The two values
in the middle of the bottom half are 0 and 2.4, and halfway between them is 1.2.
So Q1 = 1.2%.
Finally, we’ll find the median of the upper half. The two values in the middle of the upper
six values are 17.2 and 18.6, and 17.9 is halfway between. So Q3 = 17.9%.
The second quartile is 9.75%, which tells us that nations that get less than 9.75%
of their energy from nuclear are in the bottom half.
The third quartile is 17.95%, so if a nation gets more than 17.95% of its energy
from nuclear, it’s in the top fourth in terms of nuclear generation among the world’s
largest economies.
Box Plot
One of the most useful applications of quartiles is using them to draw a box plot
(sometimes called a box and whisker plot). This is a graphical way to evaluate the
spread of a data set. In particular, a box plot makes it easy to identify data points that
are outliers—those that appear to be aberrational in some way.
First, we’ll need to define a new term. The distance between the first and third
quartiles for a data set is called the interquartile range, or IQR.
That is, IQR= Q3 − Q1
Data values are considered to be outliers if they're more than 1.5 times the IQR
below Q1, or above Q3.
You can see what all of the quartiles are (at least approximately): Q1 is about 68,
Q2 is about 78, and Q3 is about 89.
The interquartile range is the distance between 89 and 68, which is 21.
The lowest score was 34 and the highest was 99.
Looking a little deeper, the box shows us that the middle half of all scores (which
are the values between Q1 and Q3) fall between 68 and 89.
Draw a box plot for the nuclear power data, then use it to answer some questions
about the data.
(a) What does the box plot tell us about the data set?
(b) Find any outliers in the data set.
Step 1 Find the quartiles. We know that the quartiles are Q1 = 1.2, Q2 = 9.75, and
Q3 = 17.9.
Step 2 Draw a number line that begins before the lowest value in the data set
and ends after the highest value. Locate the lowest and highest data values and draw
short vertical lines above the line at those locations.
Step 3 Draw a rectangular box over the number line, beginning at Q1 and ending
at Q3. Then draw a vertical line through the box at Q2. Then draw horizontal lines from
the lowest and highest values to the edges of the box.
(a) The position of the box shows that most of the values in the data set are on the low
end of the distribution.
Half of the values are between 1.2 and 17.9, and three fourths are less than 17.9.
(b) To decide if there are any outliers, we’ll first need to multiply the interquartile range
by 1.5.
The IQR is 17.9 - 1.2 = 16.7, and 1.5 · IQR = 25.05.
MATHEMATICS IN THE MODERN WORLD
MODULE
Next, we subtract this number from the first quartile and add it to the third:
Q1 -25.05 = 1.2-25.05=-23.85
Q3 +25.05 = 17.9+25.05=42.95
There are no negative data values, so there can’t be any less than -23.85, but there’s
definitely a value greater than 42.95. Looking back at the data, there’s one outlier: the
maximum value of 76.9%.
121 101 129 121 135 119 107 111 110 113
102 120 119 118 120 123 116 116 107 111
105 110 112 113 116 115 116 123 104 108
108 109 119 116 122 118 112 115 106 109
The quiz is not timed, so you can pause it and resume at any time.
If you cancel the quiz, your answers are discarded and they are not counted as a submission.
MATHEMATICS IN THE MODERN WORLD
MODULE
Normal Distribution
Instructions
80 85 83 88
80 85 83 88
81 85 83 88
81 87 83 89
81 87 85 89
82 87 85 89
82 87 85 90
82 87 85 90
83 88
The quiz is not timed, so you can pause it and resume at any time.
If you cancel the quiz, your answers are discarded and they are not counted as a submission