DescribingDataNumerically Lesson
DescribingDataNumerically Lesson
DescribingDataNumerically Lesson
This lesson includes an overview of the subject, instructor notes, and example exercises using
Minitab.
Statistics is the discipline concerned with the optimal acquisition (where garbage in equals
garbage out) and analysis of data in order to model a population or process.
We can begin to analyze a data set by describing it both numerically and graphically. This lesson
considers important numerical summaries of data. In this lesson, we will use sample data taken
from a large population, and we are only considering quantitative (numeric) data, not qualitative
(categorical) data. For the data sets of interest, we will select only one variable of interest; that is,
we will be working with univariate data, not bivariate or multivariate data.
Prerequisites
This lesson requires knowledge of basic arithmetic. Symbolic notation will be introduced and
used to simplify the formulas for the computation of numerical measurements. In Minitab,
computations will be made on single columns of data.
Learning Targets
Calculate basic numerical measures of center for a sample set of data, including its
mean, median, and mode
Determine which measure of center may be more appropriate for a given data set
Calculate basic numerical measures of spread for a sample set of data, including its
range, variance, and standard deviation
Time Required
It will take the instructor 30-45 minutes in class to introduce the descriptive statistics formulas.
We recommend starting the activity sheet in class so that students can ask the instructor
WWW.MINITAB.COM/ACADEMIC
questions while working on it. The exercises on the activity sheet will take an additional 30-45
minutes, and they can be used as homework or quiz problems.
Materials Required
Assessment
The activity sheet contains exercises for students to assess their understanding of the learning
targets for this lesson.
Possible Extensions
This lesson provides good introductory examples for students new to statistics. The instructor
may want to do the Sampling lesson first so that students know how data is being selected
from the population. The recommended follow-up lesson is Describing Data Graphically.
References
Definition: A sample is a subset of subjects from the population for which observations are
actually made.
The numerical values that are calculated on a sample are called statistics.
There are two branches of statistics that are discussed in introductory statistics courses –
descriptive statistics and inferential statistics. Later lessons will be devoted to inferential
statistics.
Definition: Descriptive statistics (also called summary statistics) uses graphical and/or
numerical summaries for describing or summarizing data from a sample.
The most common descriptive statistics provide information about a sample’s central
tendency (mean, median, mode) and variability (variance, standard deviation, range).
Some graphical methods for displaying and describing data include: dotplot, stem-and-
leaf plot, histogram, boxplot, and time series plot (time ordered data). Additional lessons
describe these graphs.
Notation: When discussing samples throughout this lesson, we need to have notation for a
generic sample of size n. We’ll use:
where
Sample Mean
Definition: The sample mean, denoted by 𝒙, is the arithmetic average of the n data values
in the sample.
𝒏
𝑥 + 𝑥 + 𝑥 + ⋯+ 𝑥 𝟏
𝒙= = 𝒙𝒊
𝑛 𝒏
𝒊 𝟏
Also noted in each picture are the modes (circled) and location of the medians (m). The
definitions of these statistics are contained in the following pages.
Example 1
Ten batteries from brands A, B, and C were tested to determine their lifetimes (in hours).
Definition: The sample median is the middle ordered data value if the sample size n is
odd and the average of the middle two ordered data values if the sample size n is even.
Example 2
The sample median lifetimes of batteries from brands A, B, and C are:
Battery brand A ordered lifetimes: 38, 41, 87, 94, 102, 116, 155, 179, 214, 289. Since there
𝟏𝟎𝟐 𝟏𝟏𝟔
is an even number of data points, the sample median is: = 𝟏𝟎𝟗 hours.
𝟐
Battery brand B ordered lifetimes: 22, 22, 32, 39, 64, 65, 99, 142, 191, 317. The sample
median is 64.5 hours.
Battery brand C ordered lifetimes: 18, 24, 34, 41, 43, 95, 122, 139, 318, 360. The sample
median is = 𝟔𝟗 hours.
As an additional example, suppose we have battery brand D with ordered lifetimes: 20,
32, 45, 67, 69, 142, 150. Since there are an odd number of data points, the sample
median is 67, the middle ordered data value.
Sample Mode
Definition: The most frequently occurring sample data value is the mode. There can be
more than one mode.
Example 4
You decide to participate in a fishing contest at a local pond. Each contestant must catch 5 fish,
and the winner will be determined by the contestant with the “longest” catches overall. Given
you caught the following 5 fish below, would you rather the judges use the mean or median to
determine longest catches?
Answer: You want to win the contest! So, hopefully the judges will determine the longest
catches using the mean of the five catches. The length of the median catch definitely won’t win
you the top prize!
Skewed Data
A data set is said to be skewed if it is asymmetric, either positively or negatively, as denoted in
the figures below.
For positively skewed data, generally the mean is greater than the median.
For negatively skewed data, generally the mean is less than the median.
For symmetric data, the mean and median tend to be close to the same value.
Below are histograms of exam scores for 110 students. Note: All histogram bins contain their left
endpoints.
Mean (~75.04) and median (75) Mean (~83.50) is less than the
are about the same median (89)
Measures of Spread
We can observe three measures of spread for a sample: the sample range, sample variance, and
sample standard deviation.
Sample Range
Definition: The sample range for a data set is the difference between the largest
(maximum) and smallest (minimum) data values in the sample.
Returning to Example 1 (data below), we can calculate sample ranges for battery brands A, B,
and C.
Definition: The sample variance is the most common estimate of data spread, and we use
it in conjunction with the sample mean. It is a measure of deviation from the sample mean 𝑥̅ .
For instance, the difference (𝑥 − 𝑥̅ ) is the deviation of the first data point from the sample
mean. Hence, we have the n deviations:
(𝑥 − 𝑥̅ ), (𝑥 − 𝑥̅ ), … , (𝑥 − 𝑥̅ ), … , (𝑥 − 𝑥̅ )
Some deviations are negative, while others are positive, and summing the deviations yields
0. In order to make all deviations positive, we square each deviation. The sample variance is
the sum of the squared deviations divided by (n – 1) and is denoted by the symbol s2.
𝒏
𝟐
1 𝟏
𝒔 = [(𝑥 − 𝑥̅ ) + (𝑥 − 𝑥̅ ) + ⋯ + (𝑥 − 𝑥̅ ) ] = (𝒙𝒊 − 𝒙)𝟐
𝑛−1 𝒏−𝟏
𝒊 𝟏
The sample variance (s2) measures the average scatter of the data values about the
sample mean. It is the average of the squared deviations.
Why do we divide by n – 1 instead of n? Because dividing by n – 1 gives us a BETTER
ESTIMATE of the true population variance σ2.
The units of s2 are squared units. For example, if our data consists of peoples’ weights
in pounds, s2 has units pounds squared. To return to the same units as the sample mean,
we take the square root of the sample variance; it is called the sample standard
deviation, and it is denoted by s.
We already computed the sample mean of battery brand A as 𝑥̅ = 131.5 hours. So, the sum of
the squared deviations is:
(41 − 131.5) + (289 − 131.5) + (214 − 131.5) + ⋯ + (155 − 131.5) = 55850.5 hours
Thus,
. .
𝒔𝟐 = ≅ 𝟔𝟐𝟎𝟓. 𝟔𝟏 hrs2, and s = ≅ 𝟕𝟖. 𝟕𝟖 hrs.
Minitab Calculations
All computations we just did by hand in previous examples can be easily calculated in Minitab.
Example 5
Ten batteries from brands A, B, and C were tested to determine their lifetimes (in hours).
Minitab
Before beginning the activity sheet, here’s a fun riddle for remembering the mean, median,
mode, and range.