W4_Lecture slides
W4_Lecture slides
Maryam Zangiabadi
Ch. 5: Displaying and Describing
Quantitative Data
Learning Objectives
1) display and summarize quantitative data
2) Display data in a histogram and in a stem-and-leaf diagram
3) Estimate the “centre” of the data distribution; Calculate measures of sente (Mean,
Median)
4) Estimate the spread of the data distribution; Calculate measures of spread (range,
IQR, and standard deviation)
5) Graph the centre of the data distribution and the extent to which it’s spread in a
“boxplot”
6) Identify outliers
7) Standardize data relative to its spread
8) Graph time series data
2
Displaying Data Distributions
Histograms
A histogram uses adjacent bars to show the distribution of values in a quantitative
variable. Each bar represents the frequency (relative frequency) of values falling in an
interval of values.
A bin is one of the groups of values on the horizontal axis of the histogram. A
histogram plots the bin counts as the height of bars, and it describes the overall
“shape” of data.
If our focus is on the overall pattern of how the values are distributed rather than on
the counts themselves, it can be useful to make a relative frequency histogram,
replacing the counts(frequencies) on the vertical axis with the percentage of the total
number of cases falling in each bin.
3
Displaying Data Distributions
Histograms
How do histograms work?
1) Decide how wide to make the bins – if there are n data points, use log 2 𝑛 for the
number of bins. If we have n=29 data points, 𝑙𝑜𝑔2 29 = 4.86. We round 4.86 to 5,
thus, we use 5 bins.
The formula in Excel is LOG(29,2)
1) Determine the count for each bin
2) Decide where to place values that land on the endpoint of a bin. For example, does
a value of $5 go into the $0 to $5 bin or the $5 to $10 bin? The standard rule is to
place such values in the higher bin.
4
Displaying Data Distributions
Histograms
This table shows daily price changes in
Bell Canada stock for the period
September 12 to October 24, 2014.
5
Displaying Data Distributions
Histograms
Daily price changes of Bell Canada stock. The histogram displays the distribution of price
changes.
6
Displaying Data Distributions
Histograms
• If we use too few bins, we lose information.
• If we use too many bins, the overall shape of the histogram will be lost.
7
Displaying Data Distributions
Stem-and-Leaf Displays
Stem-and-leaf displays are like histograms, but they also give the individual values.
How do stem-and-leaf displays work?
1) The leftmost digits of a number is the stem.
2) The next digit (the right digit) of the number is the leaf.
If a set of data has only two digits, the stem is the value on the left and the leaf is the
value on the right.
For example, for the number 21, we would write 2 | 1 with 2 serving as the stem and 1
as the leaf.
8
Displaying Data Distributions
Stem-and-Leaf Displays
Stem-and-leaf displays:
• Sort the numbers in ascending order
• Put each stem in a row, and then branch out its leaves to the right. The leaf values are arranged
in ascending order. For a negative stem, leaf values are arranged in descending order.
Example: Show how to display the data 21, 22, 24, 33, 33, 36, 38, 41 in a stem-and-leaf display
21 2 is the stem and 1 is the leaf 2 124
38 3 is the stem and 8 is the leaf. 3 3368
41
2|1 represents 21
Note: If you turn your head sideways to look at the display, it resembles the histogram
for the same data
9
Displaying Data Distributions
Stem-and-Leaf Displays
Stem-and-Leaf Displays
Example: Show how to display the data 11, 12, 14, 25, 26, 27, 31, 46, 47, 48, 49, 51,
52, 55, 56, 57, 61, 62, 73, 75, 89 in a stem-and-leaf display
1|1 represents 11
10
Displaying Data Distributions
Before making a histogram or a stem-and-leaf display, the Quantitative Data
Condition must be satisfied: the data values are of a quantitative variable whose units
are known.
11
Shape
When describing a distribution, attention should be paid to
• its shape
• its centre
• its spread
We describe the shape of a distribution in terms of its modes, its symmetry, and
whether it has any gaps or outlying values.
12
Shape
Modes
The mode is the value that appears most frequently.
13
Shape
Modes
14
Shape
Modes
A distribution whose histogram doesn’t appear to have any clear mode and in which all
the bars are approximately the same height is called uniform distribution.
In an approximately uniform
distribution, bars are all about the
same height. The histogram does
not have a clearly defined mode.
15
Shape
Symmetry
A distribution is symmetric if the halves on either side of the centre look, at least
approximately, like mirror images.
16
Shape
Skewed
The thinner ends of a distribution are called the tails. If one tail stretches out farther
than the other, the distribution is said to be skewed to the side of the longer tail. The
distribution below is skewed to the right.
17
Shape
Outliers
Extreme values that do not appear to belong with the rest of the data. They may be
unusual values that deserve further investigation or just mistake.
Always be careful to point out the outliers in a distribution: those values that stand off
away from the body of the distribution. Outliers …
• can affect every statistical method we will study
• can be the most informative part of your data
• may be an error in the data (find the error and correct it)
• should be discussed in any conclusions drawn about the data
18
Shape
Characterizing the shape of a distribution is often a judgment call
Understanding the data and how they arose can help.
Look at a histogram at several different bin widths to see how persistent some of the
features are. For example:
➢ Does the gap you see in the histogram really reveal that you have two subgroups, or
will it go away if you change the bin width slightly?
➢ Are those observations at the high end of the histogram truly unusual, or are they
just the largest ones at the end of a long tail?
Think about the data, where they came from and what kinds of questions your hope to
answer from them.
19
Centre
Mean (sometimes called the arithmetic mean) is a natural summary and the centre
point of a unimodal and symmetric distribution.
To find the mean of the variable y, add all the values of the variable and divide that
sum by the number of data values, n.
20
Centre
The hourly salaries of a sample of employees in a restaurant are: $23, $13, $21, $16,
$19, $19, $18, $18, $19, $16, and $21. Calculate mean salary of these employees?
σ 𝑦 23+13+21+16+19+19+18+18+19+16+21 203
𝑦ത = = = = 18.45
𝑛 11 11
21
Centre
If a distribution is skewed, contains gaps, or contains outliers (Mean can be
misleading), then it might be better to use the median – the value that splits the
histogram into two equal areas.
The median is halfway through the list of numbers after they have been ordered from
the minimum to the maximum, i.e., the median splits the distribution in half
numerically.
22
Centre
1. For an even set of numbers, the median will be the average of the two middle
numbers in the data set arranged in ascending order.
2. For an odd set of numbers, the median is the middle number of the data set arranged
in ascending order.
Follow the following steps to find the median:
1. Order the values from minimum to maximum.
𝑛
2. Calculate 𝑖 = ,
2
𝑛
• If 𝑖 = is an integer, we take the average of the 𝑖 𝑡ℎ and 𝑖 + 1 𝑠𝑡
values.
2
𝑛
• If 𝑖 = is
not an integer, we round it up to the next integer, this is the
2
median location, take the value at this location.
23
Centre
Example: The ages for a sample of seven teachers are:
29, 36, 31, 30, 32, 35, 40
Find the median age.
Solution:
1. Arranging the data in ascending order: 29,30,31,32,35,36,40
𝑛 7
n = 7, 𝑖 = = = 3.5, round it up to 4, then, the fourth number in the sorted data set
2 2
is median.
Thus, the median is 32.
24
Centre
Example: The hourly wages of six teachers, in $, are: 56, 65, 70, 70, 64, 61
Find the median wage.
Solution:
Arranging the data in ascending order gives: 56, 61, 64, 65, 70, 70
𝑛 6
◦ 𝑛 = 6, 𝑖 = = = 3, the median is the average of the 3rd and 4th number in the
2 2
64+65
ordered data set. The median is = $64.5.
2
25
Centre
If a distribution is roughly symmetric, we’d expect the mean and median to be close.
➢ For a right skewed distribution, 1) the mean is larger than the median, 2) More values are
located on the left side of the distribution and 3) the right tail is longer.
➢ For a left skewed distribution: 1) the mean is smaller than the median, 2) More values are
located on the right side of the distribution and 3) the left tail is longer.
26
Centre
Figure: The median splits the area of the
histogram in half at $8619. Because the
distribution is skewed to the right, the
mean $10,260 is higher than the median.
The points at the right in the tail of the
data distribution have pulled the mean
toward them, away from the median.
27
Centre
Geometric mean
The geometric mean has many applications in business. For example, It is useful when we are
interested in finding the average rate of growth in investment.
In general, we find the geometric mean of a set of n numbers 𝑎1, 𝑎2, … , 𝑎𝑛 by multiplying them
together and taking the nth root of the product.
1
𝑛
Geometric mean = 𝑎1 𝑎2 … 𝑎𝑛 = (𝑎1 × 𝑎2 × ⋯ × 𝑎𝑛) .𝑛
28
Centre
Geometric mean
Suppose you put $1000 into an investment that grows 10% in the first year, 20% in
the second year, and 60% in the third year. The average rate of growth of your
investment is not (10 + 20 + 60)/3 = 30.
2 20% 1320.00
𝑟 = 28.3%
3 60% 2112.00
29
Spread
We need to determine how spread out the data are because the more the data vary,
the less a measure of centre can tell us.
One simple measure of spread is the range, defined as the difference between the
extremes.
30
Spread
Quartiles divide a set of observations into four equal parts.
The first quartile, Q1, is the median of the first half of the sorted data points from lowest to
highest.
The third quartile, Q3, is the median of the second half of the sorted data points from lowest to
highest.
The interquartile range (IQR) is defined to be the difference between the two quartiles: IQR =
Q3 –Q1
31
Spread
The number of downloads per hour has
been collected for the past 24 hours.
32
Spread
Sorted data: 2 3 5 6 10 12 14 17 18 18 18 19
20 20 21 22 23 24 25 27 28 30 30 36
The first quartile, Q1, is the median of the first 12 data points (i.e., the average of the
12+14
sixth and seventh number): 𝑄1 = = 13 downloads per hour
2
The third quartile, Q3, is the median of the last 12 data points (i.e., the average of the
24+25
18th and 19th number): 𝑄3 = = 24.5.5 downloads per hour
2
33
34
Spread
Taking into account how far each value is from the mean gives a powerful measure of
the spread of a distribution.
The average of the squared deviations of the values of the variable y from the mean is
called the variance and is denoted by s2.
Sample variance
Population Variance
Spread
The variance plays an important role in measuring spread, but the units are the square
of the original units of the data.
Taking the square root of the variance corrects this issue and gives us the standard
deviation, which is denoted by s.
(y − y )
2
Sample standard deviation s=
n −1
35
Spread
Variance σ 2 σ 2
(y − 𝜇) (y − 𝑦)
ത
𝜎2 = 𝑆2 =
𝑁 𝑛−1
Standard deviation 𝜎= 𝜎2 𝑆= 𝑠2
Size N n
36
Spread
For the sample data set: 10, 20, 15, 40, 77, 90. Compute the mean, Variance and standard
deviation of the data set.
37
Spread
The data:
2 3 5 6 10 12 14 17 18 18 18 19
20 20 21 22 23 24 25 27 28 30 30 36
2 + 3 + 5 + ⋯ + 30 + 36 448
𝑀𝑒𝑎𝑛 = 𝑦ത = = = 18.7
24 24
s=
n −1
38
Spread
Computational Formulas:
𝟐 𝟐
(σ 𝒚) (σ 𝒚)
σ 𝒚𝟐 − σ 𝒚𝟐
− 𝒏
𝜎2 = 𝑵 𝑠2 =
𝑵 𝒏−𝟏
39
Spread
Sample: 1, 2, 4, 5, 8
n= 5 Data 𝐲𝟐
𝒚
Computational Formulas: 1 12 = 1
2 22 = 4
𝟐
(σ 𝒚) 4 42 = 16
σ 𝒚𝟐 −
𝑠2 = 𝒏
𝒏−𝟏 5 52 = 25
8 82 = 64
202
110− 30 Total σ 𝐲 = 𝟐𝟎 σ 𝐲 𝟐 = 𝟏𝟏𝟎
Sample Variance: 𝑠 2 = 5
= = 7.5
4 4
40
Spread
Coefficient of variation measures how much variability exists compared with the mean.
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝐶𝑉 =
𝑀𝑒𝑎𝑛
𝑠
𝐶𝑉 =
ӯ
Example: For the sample data set: 10, 20, 15, 40, 77, 90. Compute the coefficient of variation.
41
Spread
During the period October 2, 2014, to November 13, 2014, the daily closing prices of
the Toronto-Dominion Bank (TD) and the Canadian Imperial Bank of Commerce (CIBC)
had the means and standard deviations given in the following table:
42
Reporting the Shape, Centre, and Spread
Which measures of centre and spread should be used for a distribution?
• If the shape is skewed, the median and IQR should be reported.
• If the shape is unimodal and symmetric, the mean and standard deviation and
possibly the median and IQR should be reported.
• If there are multiple modes, try to determine if the data can be split into separate
groups.
• If there are unusual observations point them out and report the mean and standard
deviation with and without the values.
• Always pair the median with the IQR and the mean with the standard deviation.
43
Adding Measures of Centre and Spread
In many cases we are interested in the centre and spread of a combination of two or
more distributions.
A company may have a two-stage production process. Each stage has a unique
distribution. However, we want to understand the entire production process.
The mean of the entire process is the sum of the means of the two stages.
44
Adding Measures of Centre and Spread
The mean of the entire process is the sum of the means of the two stages. However,
the same is not true for the median or the mode.
Similarly, only under certain circumstances can the variance be computed as the sum
of the variances. No other measures of spread can be added.
Stage 1 100 20 18 17 5 3 9
Stage 2 100 30 26 25 6 4 16
45
Grouped Data
Often data is grouped into ranges prior to data collection.
Example: We are interested in understanding how much of a premium people are willing
to pay in order to “Buy Canadian”. The following table summarizes the result of a survey
conducted by a marketing research company.
0 23%
1–5 14%
6–10 23%
11–19 8%
20 or more 17%
No answer 15%
Table - How much extra Canadians would be prepared to pay to purchase products made in Canada.
46
Grouped Data
To calculate the mean of grouped data we use the midpoint of the ranges in our
calculations
The calculation of the average extra amount Canadians are prepared to pay in order to
buy Canadian products is shown in table below. (For the last entry we chose a
midpoint of 30%.)
Range($) Midpoint($) % of Sample MidPt × %
0 0 23% 0.00
1–5 3 14% 0.42
6–10 8 23% 1.84
11–19 15 8% 1.20
>20 30 17% 5.10
Blank Blank Mean $8.56
47
Grouped Data
To calculate the variance of grouped data we use also the midpoint of the ranges in
our calculations. Once we have the variance, we take its square root to get the
standard deviation.
Range($) Midpoint($) % of Sample MidPt × % (MidPt − Mean)2 × %
0 0 23% 0.00 0.001685
1–5 3 14% 0.42 0.000433
48
Five-Number Summary and Boxplots
The five-number summary of a distribution reports its median, quartiles, and extremes
(maximum and minimum).
It provides a good overall summary of the distribution of data.
The five-number summary of the NYSE (in billions of shares) for 2006 is given below.
Max 3.287
Upper Quartile, Q3 1.972
Median 1.824
Lower Quartile, Q1 1.675
Min 0.616
Table :The five-number summary of NYSE daily volume (in billions of shares)
49
Five-Number Summary and Boxplots
Once we have a five-number summary of a variable, we can display that information in
a boxplot. To make a boxplot:
1) Draw a single vertical axis spanning the extent of the data
2) Draw short horizontal lines at the lower and upper quartiles and at the median.
Then connect them with vertical lines to form a box
50
Five-Number Summary and Boxplots
3) Erect (but don’t show in the final plot) “fences” around the main part of the data,
placing the upper fence 1.5 IQRs above the upper quartile and the lower fence
1.5 IQRs below the lower quartile.
Max 3.287
IQR = 0.297
Upper Quartile, Q3 1.972
1.5 IQR = 0.4455
Median 1.824
Lower Quartile, Q1 1.675 Q3 + 1.5 IQR = 2.4175
51
Five-Number Summary and Boxplots
4) Draw lines (whiskers) from each end of the box up and down to the most extreme
data values found within the fences.
52
Five-Number Summary and Boxplots
5) Add any outliers by displaying data values that lie beyond the fences with special symbols.
53
Five-Number Summary and Boxplots
Example:
• Find Interquartile Range. Q1 = 7, Q3 = 11
𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 11 − 7 = 4
• Find the maximum and minimum values of the data set.
𝑚𝑖𝑛 = 1, 𝑚𝑎𝑥 = 18
• Find the upper and lower fences.
𝑇ℎ𝑒 𝑢𝑝𝑝𝑒𝑟 𝑓𝑒𝑛𝑐𝑒 = 𝑄3 + 1.5 𝐼𝑄𝑅 = 11 + 1.5 4 = 17
𝑇ℎ𝑒 𝑙𝑜𝑤𝑒𝑟 𝑓𝑒𝑛𝑐𝑒 = 𝑄1 − 1.5 𝐼𝑄𝑅 = 7 − 1.5 4 = 1
Note that any data values outside the lower fence and upper fence are
outliers. Note that 18 is an outlier.
54
Five-Number Summary
and Boxplots
The centre of a boxplot shows the middle half of the data
between the quartiles – the height of the box equals the IQR
If the median is roughly centered between the quartiles, then
the middle half of the data is roughly symmetric. If it is not
centered, the distribution is skewed
The whiskers show skewness as well if they are not roughly
the same length
The outliers are displayed individually to keep them out of the
way in judging skewness and to display them for special
attention
55
Percentiles
Percentile shows where a given percentage of the data lies.
For example, if your mark in this course is 82%, that tells us that you got 82% of
questions right. Whereas the 82nd percentile shows your mark compared with other
students’ marks.
The first quartile is Q1. Twenty-five percent of the data lies below Q1, and another name
for Q1 is “25th percentile.”
The second quartile is median. Fifty percent of the data lies below the median, and
another name for median is “50th percentile.”
The third quartile is Q3. Seventy-five percent of the data lies below Q3, and another
name for Q3 is “75th percentile.”
56
Percentiles
Suppose the numbers of passengers on 12 f lights from Ottawa to Iqaluit are
24, 18, 31, 27, 15, 16,
26, 15, 24, 26, 25, 30.
Step 1. We first put the data in ascending order, getting
15, 15, 16, 18, 24, 24,
25, 26, 26, 27, 30, 31.
Step 2: Suppose we want to calculate the 80th percentile of this data. Since there are 12 data
values, we first calculate 80% of 12, which is 9.6. Since 9.6 is not an integer, we round it up to
10 and the 80th percentile is the 10th data value, or 27.
What is the 50th percentile of this date?
Since there are 12 data values, we first calculate 50% of 12, which is 6. Since 6 is an integer, the
50th percentile is the average of 6th and 7th data value, or (24+25)/2 = 24.5.
57
Comparing Groups
In attempting to understand the data, we may want to look for patterns, differences, and
trends over different time periods
We may want to split the data in half and display histograms for each half. Histograms for six-
month split NYSE data is shown below
58
Comparing Groups
Histograms work well for comparing two groups, but boxplots tend to offer better results when
side-by-side comparison of several groups is sought.
Below the NYSE data is displayed in monthly boxplots.
59
Dealing with Outliers
What should be done with outliers?
They should be understood in the context of the data. An outlier for a year of data may
not be an outlier for the month in which it occurred and vice versa.
They should be investigated to determine if they are in error. The values may have
simply been entered incorrectly. If a value can be corrected, it should be.
They should be investigated to determine why they are so different from the rest of
the data. For example, were extra sales or fewer sales seen because of a special event
like a holiday.
60
Standardizing
Often, we wish to compare very different variables.
To do so, the values are standardized by measuring how far they are from the mean.
We measure the distance from the mean with the standard deviation, and the result is
the standardized value which records how many standard deviations each value is above
or below the overall mean.
61
Standardizing
Example: Compare two companies (from the “top” 100 companies) with respect to the
variables New Jobs (jobs created) and Average Pay.
Starbucks created over 2000 jobs and has an average salary of $44,790 while Wrigley
created 16 jobs and has an average salary of $56,351.
For all 100 companies, the mean number of new jobs created was 305.9 and the average
salary was $73,229.42.
62
Standardizing
Example (continued): We first display the data for all 100 companies in a stem-and-leaf
display.
Starbucks
Figure shows stem-and-leaf displays
Wrigley
for both the number of New Jobs
created and the Average Pay of
salaried employees at the top 100
companies to work for in 2005 from
Fortune magazine. Starbucks (in red)
created more jobs, but Wrigley (in
blue) did better in average pay.
Which company did better for both
variables combined?
63
Standardizing
Example (continued): To compare the two companies based on these variables, we find
the mean and standard deviation for all 100 companies.
Variable Mean SD
New Jobs 305.9 1507.97
Avg. Pay $73,299.42 $34,055.25
To quantify how well each of the companies did and to combine the two
scores, we’ll determine how many standard deviations they each are
from the variable means.
64
Standardizing
To find how many standard deviations a value is from the mean we calculate a
standardized value or z-score.
y −y
z=
s
For example, a z-score of 2.0 indicates that a data value is two standard
deviations above the mean and a z-score of −2 indicates that the data
value is two standard deviations below the mean.
A rule of thumb for identifying outliers is 𝑧 > 3 or 𝑧 < −3
65
66
Standardizing
Example (continued): Computing the z-scores for these variables for Starbucks and
Wrigley we obtain the results summarized below. For each variable, the z-score for each
observation is found by subtracting the mean from the value and then dividing that
difference by the standard deviation.
Blank New Jobs Average Pay
Mean (all companies) 305.9 $73,299.42
SD 1507.97 $34,055.25
Starbucks 2193 $44,790
z-score 1.25 = (2193 − 305.9)/1507.97 − 0.84 = (44,790 − 73,299.42)/34,055.25
Wrigley 16 $56,351
z-score − 0.19 = (16 − 305.9)/1507.97 − 0.50 = (56,351 − 73,299.42)/34,055.25
By this method, we get that Starbucks would be ranked higher based on these two variables.
Time Series Plots
A display of values against time is sometimes called a time series plot. Below we
have a time series plot of the NYSE daily volumes for 2006.
67
Time Series Plots
Time series plots often show a great deal of
point-to-point variation, but general patterns
do emerge from the plot.
Time series plots may be drawn with the
points connected. The NYSE data from
before is displayed this way, see the figure.
69
Time Series Plots
Example: Consider the time series plot for the
monthly stock closing price of Bell Canada seen
earlier. The histogram showed a symmetric,
possibly unimodal distribution.
The time series plot shows a period of relatively
small price changes followed by several years of
extreme volatility.
Figure: A time series plot of daily Bell Canada stock price changes.
70
Time Series Plots
When a time series is stationary (without a strong trend or change in variability), then
a histogram can provide a useful summary.
A histogram may fail to summarize a distribution with extreme behaviour changes
over time; a time series plot would be more informative.
71
What Can Go Wrong?
Don’t make a histogram of a categorical variable. The histogram below of policy
numbers is not at all informative.
72
What Can Go Wrong?
• Choose a scale appropriate to the data
• Avoid inconsistent scales. Don’t change scales in the middle of a plot, and compare
groups on the same scale
• Label variables and axes clearly
• Do a reality check. Make sure the calculated summaries make sense
• Don’t compute numerical summaries of a categorical variable
• Watch out for multiple modes. If the data has multiple modes, consider separating
the data
• Beware of outliers
73
What Have We Learned?
• To display and summarize quantitative data
• To summarize distributions of quantitative variables numerically
• To calculate measures of spread (range, IQR, and standard deviation)
• To compare groups and look for patterns among groups and over time
• To identify and investigate outliers
• To standardize data and understand its power.
• To graph data and look for trends both by eye with a data smoother.
74
The slides are a combination of the material from your text and Pearson recourses as
well as the original material.
75