0% found this document useful (0 votes)
8 views

W4_Lecture slides

The document outlines the key concepts of displaying and describing quantitative data in statistics, focusing on histograms, stem-and-leaf displays, and measures of center and spread. It covers how to create visual representations of data, calculate mean and median, identify outliers, and understand the shape of data distributions. Additionally, it emphasizes the importance of understanding data spread through measures such as range and interquartile range (IQR).

Uploaded by

J.C.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

W4_Lecture slides

The document outlines the key concepts of displaying and describing quantitative data in statistics, focusing on histograms, stem-and-leaf displays, and measures of center and spread. It covers how to create visual representations of data, calculate mean and median, identify outliers, and understand the shape of data distributions. Additionally, it emphasizes the importance of understanding data spread through measures such as range and interquartile range (IQR).

Uploaded by

J.C.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

STAT 4001

STATISTICS I FOR ANALYTICS


WEEK 4 - DISPLAYING AND DESCRIBING QUANTITATIVE DATA

Maryam Zangiabadi
Ch. 5: Displaying and Describing
Quantitative Data
Learning Objectives
1) display and summarize quantitative data
2) Display data in a histogram and in a stem-and-leaf diagram
3) Estimate the “centre” of the data distribution; Calculate measures of sente (Mean,
Median)
4) Estimate the spread of the data distribution; Calculate measures of spread (range,
IQR, and standard deviation)
5) Graph the centre of the data distribution and the extent to which it’s spread in a
“boxplot”
6) Identify outliers
7) Standardize data relative to its spread
8) Graph time series data
2
Displaying Data Distributions
Histograms
A histogram uses adjacent bars to show the distribution of values in a quantitative
variable. Each bar represents the frequency (relative frequency) of values falling in an
interval of values.
A bin is one of the groups of values on the horizontal axis of the histogram. A
histogram plots the bin counts as the height of bars, and it describes the overall
“shape” of data.
If our focus is on the overall pattern of how the values are distributed rather than on
the counts themselves, it can be useful to make a relative frequency histogram,
replacing the counts(frequencies) on the vertical axis with the percentage of the total
number of cases falling in each bin.

3
Displaying Data Distributions
Histograms
How do histograms work?
1) Decide how wide to make the bins – if there are n data points, use log 2 𝑛 for the
number of bins. If we have n=29 data points, 𝑙𝑜𝑔2 29 = 4.86. We round 4.86 to 5,
thus, we use 5 bins.
The formula in Excel is LOG(29,2)
1) Determine the count for each bin
2) Decide where to place values that land on the endpoint of a bin. For example, does
a value of $5 go into the $0 to $5 bin or the $5 to $10 bin? The standard rule is to
place such values in the higher bin.

4
Displaying Data Distributions
Histograms
This table shows daily price changes in
Bell Canada stock for the period
September 12 to October 24, 2014.

5
Displaying Data Distributions
Histograms
Daily price changes of Bell Canada stock. The histogram displays the distribution of price
changes.

6
Displaying Data Distributions
Histograms
• If we use too few bins, we lose information.
• If we use too many bins, the overall shape of the histogram will be lost.

7
Displaying Data Distributions
Stem-and-Leaf Displays
Stem-and-leaf displays are like histograms, but they also give the individual values.
How do stem-and-leaf displays work?
1) The leftmost digits of a number is the stem.
2) The next digit (the right digit) of the number is the leaf.

If a set of data has only two digits, the stem is the value on the left and the leaf is the
value on the right.
For example, for the number 21, we would write 2 | 1 with 2 serving as the stem and 1
as the leaf.

8
Displaying Data Distributions
Stem-and-Leaf Displays
Stem-and-leaf displays:
• Sort the numbers in ascending order
• Put each stem in a row, and then branch out its leaves to the right. The leaf values are arranged
in ascending order. For a negative stem, leaf values are arranged in descending order.
Example: Show how to display the data 21, 22, 24, 33, 33, 36, 38, 41 in a stem-and-leaf display
21 2 is the stem and 1 is the leaf 2 124
38 3 is the stem and 8 is the leaf. 3 3368
41
2|1 represents 21
Note: If you turn your head sideways to look at the display, it resembles the histogram
for the same data

9
Displaying Data Distributions
Stem-and-Leaf Displays
Stem-and-Leaf Displays
Example: Show how to display the data 11, 12, 14, 25, 26, 27, 31, 46, 47, 48, 49, 51,
52, 55, 56, 57, 61, 62, 73, 75, 89 in a stem-and-leaf display

1|1 represents 11

10
Displaying Data Distributions
Before making a histogram or a stem-and-leaf display, the Quantitative Data
Condition must be satisfied: the data values are of a quantitative variable whose units
are known.

11
Shape
When describing a distribution, attention should be paid to
• its shape
• its centre
• its spread
We describe the shape of a distribution in terms of its modes, its symmetry, and
whether it has any gaps or outlying values.

12
Shape
Modes
The mode is the value that appears most frequently.

Peaks seen in a histogram are called the modes of a distribution:


• A distribution whose histogram has one main peak is called unimodal,
• A distribution whose histogram has two peaks is called bimodal
• A distribution whose histogram has three or more peaks is called multimodal

13
Shape
Modes

A bimodal distribution has two apparent


modes.

14
Shape
Modes
A distribution whose histogram doesn’t appear to have any clear mode and in which all
the bars are approximately the same height is called uniform distribution.

In an approximately uniform
distribution, bars are all about the
same height. The histogram does
not have a clearly defined mode.

15
Shape
Symmetry
A distribution is symmetric if the halves on either side of the centre look, at least
approximately, like mirror images.

An approximately symmetric histogram can be folded in the middle so that


the two sides almost match.

16
Shape
Skewed
The thinner ends of a distribution are called the tails. If one tail stretches out farther
than the other, the distribution is said to be skewed to the side of the longer tail. The
distribution below is skewed to the right.

17
Shape
Outliers
Extreme values that do not appear to belong with the rest of the data. They may be
unusual values that deserve further investigation or just mistake.
Always be careful to point out the outliers in a distribution: those values that stand off
away from the body of the distribution. Outliers …
• can affect every statistical method we will study
• can be the most informative part of your data
• may be an error in the data (find the error and correct it)
• should be discussed in any conclusions drawn about the data

18
Shape
Characterizing the shape of a distribution is often a judgment call
Understanding the data and how they arose can help.
Look at a histogram at several different bin widths to see how persistent some of the
features are. For example:
➢ Does the gap you see in the histogram really reveal that you have two subgroups, or
will it go away if you change the bin width slightly?
➢ Are those observations at the high end of the histogram truly unusual, or are they
just the largest ones at the end of a long tail?
Think about the data, where they came from and what kinds of questions your hope to
answer from them.

19
Centre
Mean (sometimes called the arithmetic mean) is a natural summary and the centre
point of a unimodal and symmetric distribution.
To find the mean of the variable y, add all the values of the variable and divide that
sum by the number of data values, n.

𝑆𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒


𝑀𝑒𝑎𝑛 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑆𝑎𝑚𝑝𝑙𝑒
We will use the Greek letter sigma to represent sum, so the equation for finding the
σ𝑦
mean can be written as 𝑦ത = , where: 𝑦ത is the sample mean. It is read “𝒚 bar”.
𝑛
𝑛 is the number of values.
𝑦 represents any value.
σ 𝑦 is the sum of the 𝑦 values.

20
Centre
The hourly salaries of a sample of employees in a restaurant are: $23, $13, $21, $16,
$19, $19, $18, $18, $19, $16, and $21. Calculate mean salary of these employees?

σ 𝑦 23+13+21+16+19+19+18+18+19+16+21 203
𝑦ത = = = = 18.45
𝑛 11 11

The mean is $18.45.

21
Centre
If a distribution is skewed, contains gaps, or contains outliers (Mean can be
misleading), then it might be better to use the median – the value that splits the
histogram into two equal areas.
The median is halfway through the list of numbers after they have been ordered from
the minimum to the maximum, i.e., the median splits the distribution in half
numerically.

The median is said to be resistant because it is not affected by unusual observations or


by the shape of the distribution.

22
Centre
1. For an even set of numbers, the median will be the average of the two middle
numbers in the data set arranged in ascending order.
2. For an odd set of numbers, the median is the middle number of the data set arranged
in ascending order.
Follow the following steps to find the median:
1. Order the values from minimum to maximum.
𝑛
2. Calculate 𝑖 = ,
2
𝑛
• If 𝑖 = is an integer, we take the average of the 𝑖 𝑡ℎ and 𝑖 + 1 𝑠𝑡
values.
2
𝑛
• If 𝑖 = is
not an integer, we round it up to the next integer, this is the
2
median location, take the value at this location.

23
Centre
Example: The ages for a sample of seven teachers are:
29, 36, 31, 30, 32, 35, 40
Find the median age.
Solution:
1. Arranging the data in ascending order: 29,30,31,32,35,36,40
𝑛 7
n = 7, 𝑖 = = = 3.5, round it up to 4, then, the fourth number in the sorted data set
2 2
is median.
Thus, the median is 32.

24
Centre
Example: The hourly wages of six teachers, in $, are: 56, 65, 70, 70, 64, 61
Find the median wage.
Solution:
Arranging the data in ascending order gives: 56, 61, 64, 65, 70, 70
𝑛 6
◦ 𝑛 = 6, 𝑖 = = = 3, the median is the average of the 3rd and 4th number in the
2 2
64+65
ordered data set. The median is = $64.5.
2

25
Centre
If a distribution is roughly symmetric, we’d expect the mean and median to be close.

➢ For a right skewed distribution, 1) the mean is larger than the median, 2) More values are
located on the left side of the distribution and 3) the right tail is longer.
➢ For a left skewed distribution: 1) the mean is smaller than the median, 2) More values are
located on the right side of the distribution and 3) the left tail is longer.

26
Centre
Figure: The median splits the area of the
histogram in half at $8619. Because the
distribution is skewed to the right, the
mean $10,260 is higher than the median.
The points at the right in the tail of the
data distribution have pulled the mean
toward them, away from the median.

27
Centre
Geometric mean
The geometric mean has many applications in business. For example, It is useful when we are
interested in finding the average rate of growth in investment.

In general, we find the geometric mean of a set of n numbers 𝑎1, 𝑎2, … , 𝑎𝑛 by multiplying them
together and taking the nth root of the product.
1
𝑛
Geometric mean = 𝑎1 𝑎2 … 𝑎𝑛 = (𝑎1 × 𝑎2 × ⋯ × 𝑎𝑛) .𝑛

28
Centre
Geometric mean
Suppose you put $1000 into an investment that grows 10% in the first year, 20% in
the second year, and 60% in the third year. The average rate of growth of your
investment is not (10 + 20 + 60)/3 = 30.

End of Year Growth Rate Value ($)


1
Geometric mean = (𝑎1 × 𝑎2 × ⋯ × 𝑎𝑛) 𝑛 Blank Blank 1000.00
1
1 + 𝑟 = (1.1 × 1.2 × 1.6) = 1.283
3 1 10% 1100.00

2 20% 1320.00
𝑟 = 28.3%
3 60% 2112.00

29
Spread
We need to determine how spread out the data are because the more the data vary,
the less a measure of centre can tell us.
One simple measure of spread is the range, defined as the difference between the
extremes.

𝑅𝑎𝑛𝑔𝑒 = 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒 − 𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒

The range is a single value and it is not resistant to unusual observations.


Concentrating on the middle of the data avoids this problem.

30
Spread
Quartiles divide a set of observations into four equal parts.
The first quartile, Q1, is the median of the first half of the sorted data points from lowest to
highest.
The third quartile, Q3, is the median of the second half of the sorted data points from lowest to
highest.
The interquartile range (IQR) is defined to be the difference between the two quartiles: IQR =
Q3 –Q1

31
Spread
The number of downloads per hour has
been collected for the past 24 hours.

For this data set, describe the spread of


the number of downloads per hour.

32
Spread
Sorted data: 2 3 5 6 10 12 14 17 18 18 18 19

20 20 21 22 23 24 25 27 28 30 30 36

The first quartile, Q1, is the median of the first 12 data points (i.e., the average of the
12+14
sixth and seventh number): 𝑄1 = = 13 downloads per hour
2

The third quartile, Q3, is the median of the last 12 data points (i.e., the average of the
24+25
18th and 19th number): 𝑄3 = = 24.5.5 downloads per hour
2

The inter quartile is IQR = 24.5 − 13 = 11.5 downloads per hour.

33
34

Spread
Taking into account how far each value is from the mean gives a powerful measure of
the spread of a distribution.
The average of the squared deviations of the values of the variable y from the mean is
called the variance and is denoted by s2.

Sample variance

Population Variance
Spread
The variance plays an important role in measuring spread, but the units are the square
of the original units of the data.
Taking the square root of the variance corrects this issue and gives us the standard
deviation, which is denoted by s.

 (y − y )
2
Sample standard deviation s=
n −1

35
Spread

Symbol Population parameter Sample statistics


Formula

Variance σ 2 σ 2
(y − 𝜇) (y − 𝑦)

𝜎2 = 𝑆2 =
𝑁 𝑛−1
Standard deviation 𝜎= 𝜎2 𝑆= 𝑠2
Size N n

36
Spread
For the sample data set: 10, 20, 15, 40, 77, 90. Compute the mean, Variance and standard
deviation of the data set.

Price y ($) y - Mean (y - Mean) squared


σ𝑦 252
𝑦ത = = =42 10 10 − 42 = −32 (−32)2 = 1024
𝑛 6
20 20 − 42 = −22 (−22)2 = 484
σ 𝑦−𝑦ത 2
𝑠2 = 15 15 − 42 = −27 (−27)2 = 729
𝑛−1
40 40 − 42 = −2 (−2)2 = 4
5770 77 77 − 42 = 35 352 = 1225
𝑠2 = = 1,154
6−1 90 90 − 42 = 48 482 = 2304
Total 𝟐𝟓𝟐 𝟎 𝟓𝟕𝟕𝟎
𝑠 = 1154 = 33.97 Type equation
Mean 42 here. 1154

37
Spread
The data:
2 3 5 6 10 12 14 17 18 18 18 19
20 20 21 22 23 24 25 27 28 30 30 36

The mean is 18.7 downloads per hour.

2 + 3 + 5 + ⋯ + 30 + 36 448
𝑀𝑒𝑎𝑛 = 𝑦ത = = = 18.7
24 24

The standard deviation is 8.94 downloads per hour.  (y − y )


2

s=
n −1

38
Spread
Computational Formulas:

Population variance Sample variance

𝟐 𝟐
(σ 𝒚) (σ 𝒚)
σ 𝒚𝟐 − σ 𝒚𝟐
− 𝒏
𝜎2 = 𝑵 𝑠2 =
𝑵 𝒏−𝟏

N – Population size n – Sample size

39
Spread
Sample: 1, 2, 4, 5, 8
n= 5 Data 𝐲𝟐
𝒚
Computational Formulas: 1 12 = 1
2 22 = 4
𝟐
(σ 𝒚) 4 42 = 16
σ 𝒚𝟐 −
𝑠2 = 𝒏
𝒏−𝟏 5 52 = 25
8 82 = 64
202
110− 30 Total σ 𝐲 = 𝟐𝟎 σ 𝐲 𝟐 = 𝟏𝟏𝟎
Sample Variance: 𝑠 2 = 5
= = 7.5
4 4

Sample standard deviation: s = 𝑠 2 = 7.5 = 2.74

40
Spread
Coefficient of variation measures how much variability exists compared with the mean.

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝐶𝑉 =
𝑀𝑒𝑎𝑛

𝑠
𝐶𝑉 =
ӯ

Example: For the sample data set: 10, 20, 15, 40, 77, 90. Compute the coefficient of variation.

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 33.95


𝐶𝑉 = = = 0.81
𝑀𝑒𝑎𝑛 42

41
Spread
During the period October 2, 2014, to November 13, 2014, the daily closing prices of
the Toronto-Dominion Bank (TD) and the Canadian Imperial Bank of Commerce (CIBC)
had the means and standard deviations given in the following table:

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 $1.37


For TD: 𝐶𝑉 = = = 0.0252
𝑀𝑒𝑎𝑛 $54.54
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 $2.34
For CIBC: 𝐶𝑉 = = = 0.0232
𝑀𝑒𝑎𝑛 $100.92
The standard deviation for CIBC is higher than for TD, but does that mean the share
price was more variable?
TD was more variable, even though the standard deviation for CIBC was higher.

42
Reporting the Shape, Centre, and Spread
Which measures of centre and spread should be used for a distribution?
• If the shape is skewed, the median and IQR should be reported.
• If the shape is unimodal and symmetric, the mean and standard deviation and
possibly the median and IQR should be reported.
• If there are multiple modes, try to determine if the data can be split into separate
groups.
• If there are unusual observations point them out and report the mean and standard
deviation with and without the values.
• Always pair the median with the IQR and the mean with the standard deviation.

43
Adding Measures of Centre and Spread
In many cases we are interested in the centre and spread of a combination of two or
more distributions.
A company may have a two-stage production process. Each stage has a unique
distribution. However, we want to understand the entire production process.
The mean of the entire process is the sum of the means of the two stages.

44
Adding Measures of Centre and Spread
The mean of the entire process is the sum of the means of the two stages. However,
the same is not true for the median or the mode.
Similarly, only under certain circumstances can the variance be computed as the sum
of the variances. No other measures of spread can be added.

Processing Number of Mean Median Mode Interquartile Standard Variance


Time Products (minutes) (minutes) (minutes) Range, IQR Deviation (minutes2)
(minutes) (minutes)

Stage 1 100 20 18 17 5 3 9

Stage 2 100 30 26 25 6 4 16

Total 100 50 ? ? ? 5 if stages 25 if stages


are are
uncorrelated uncorrelated

45
Grouped Data
Often data is grouped into ranges prior to data collection.
Example: We are interested in understanding how much of a premium people are willing
to pay in order to “Buy Canadian”. The following table summarizes the result of a survey
conducted by a marketing research company.

Amount Extra a Person Would Be Prepared to Pay($) Percentage of Sample

0 23%
1–5 14%
6–10 23%
11–19 8%
20 or more 17%
No answer 15%
Table - How much extra Canadians would be prepared to pay to purchase products made in Canada.
46
Grouped Data
To calculate the mean of grouped data we use the midpoint of the ranges in our
calculations
The calculation of the average extra amount Canadians are prepared to pay in order to
buy Canadian products is shown in table below. (For the last entry we chose a
midpoint of 30%.)
Range($) Midpoint($) % of Sample MidPt × %
0 0 23% 0.00
1–5 3 14% 0.42
6–10 8 23% 1.84
11–19 15 8% 1.20
>20 30 17% 5.10
Blank Blank Mean $8.56

47
Grouped Data
To calculate the variance of grouped data we use also the midpoint of the ranges in
our calculations. Once we have the variance, we take its square root to get the
standard deviation.
Range($) Midpoint($) % of Sample MidPt × % (MidPt − Mean)2 × %
0 0 23% 0.00 0.001685
1–5 3 14% 0.42 0.000433

Where, 6–10 8 23% 1.84 0.000007


p is percentage 11–19 15 8% 1.20 0.000332
y is the midpoint >20 30 17% 5.10 0.007814
k Blank Mean $8.56 Blank
Blank Blank Blank Variance = 0.010271
Blank Blank Blank SD = $10.13

Table Calculation of variance and standard deviation for grouped data.

48
Five-Number Summary and Boxplots
The five-number summary of a distribution reports its median, quartiles, and extremes
(maximum and minimum).
It provides a good overall summary of the distribution of data.
The five-number summary of the NYSE (in billions of shares) for 2006 is given below.

Max 3.287
Upper Quartile, Q3 1.972
Median 1.824
Lower Quartile, Q1 1.675
Min 0.616

Table :The five-number summary of NYSE daily volume (in billions of shares)

49
Five-Number Summary and Boxplots
Once we have a five-number summary of a variable, we can display that information in
a boxplot. To make a boxplot:
1) Draw a single vertical axis spanning the extent of the data
2) Draw short horizontal lines at the lower and upper quartiles and at the median.
Then connect them with vertical lines to form a box

50
Five-Number Summary and Boxplots
3) Erect (but don’t show in the final plot) “fences” around the main part of the data,
placing the upper fence 1.5 IQRs above the upper quartile and the lower fence
1.5 IQRs below the lower quartile.
Max 3.287
IQR = 0.297
Upper Quartile, Q3 1.972
1.5 IQR = 0.4455
Median 1.824
Lower Quartile, Q1 1.675 Q3 + 1.5 IQR = 2.4175

Min 0.616 Q1 - 1.5 IQR = 1.2295

51
Five-Number Summary and Boxplots
4) Draw lines (whiskers) from each end of the box up and down to the most extreme
data values found within the fences.

52
Five-Number Summary and Boxplots
5) Add any outliers by displaying data values that lie beyond the fences with special symbols.

53
Five-Number Summary and Boxplots
Example:
• Find Interquartile Range. Q1 = 7, Q3 = 11
𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 11 − 7 = 4
• Find the maximum and minimum values of the data set.
𝑚𝑖𝑛 = 1, 𝑚𝑎𝑥 = 18
• Find the upper and lower fences.
𝑇ℎ𝑒 𝑢𝑝𝑝𝑒𝑟 𝑓𝑒𝑛𝑐𝑒 = 𝑄3 + 1.5 𝐼𝑄𝑅 = 11 + 1.5 4 = 17
𝑇ℎ𝑒 𝑙𝑜𝑤𝑒𝑟 𝑓𝑒𝑛𝑐𝑒 = 𝑄1 − 1.5 𝐼𝑄𝑅 = 7 − 1.5 4 = 1
Note that any data values outside the lower fence and upper fence are
outliers. Note that 18 is an outlier.

54
Five-Number Summary
and Boxplots
The centre of a boxplot shows the middle half of the data
between the quartiles – the height of the box equals the IQR
If the median is roughly centered between the quartiles, then
the middle half of the data is roughly symmetric. If it is not
centered, the distribution is skewed
The whiskers show skewness as well if they are not roughly
the same length
The outliers are displayed individually to keep them out of the
way in judging skewness and to display them for special
attention

55
Percentiles
Percentile shows where a given percentage of the data lies.
For example, if your mark in this course is 82%, that tells us that you got 82% of
questions right. Whereas the 82nd percentile shows your mark compared with other
students’ marks.

The first quartile is Q1. Twenty-five percent of the data lies below Q1, and another name
for Q1 is “25th percentile.”
The second quartile is median. Fifty percent of the data lies below the median, and
another name for median is “50th percentile.”
The third quartile is Q3. Seventy-five percent of the data lies below Q3, and another
name for Q3 is “75th percentile.”

56
Percentiles
Suppose the numbers of passengers on 12 f lights from Ottawa to Iqaluit are
24, 18, 31, 27, 15, 16,
26, 15, 24, 26, 25, 30.
Step 1. We first put the data in ascending order, getting
15, 15, 16, 18, 24, 24,
25, 26, 26, 27, 30, 31.
Step 2: Suppose we want to calculate the 80th percentile of this data. Since there are 12 data
values, we first calculate 80% of 12, which is 9.6. Since 9.6 is not an integer, we round it up to
10 and the 80th percentile is the 10th data value, or 27.
What is the 50th percentile of this date?
Since there are 12 data values, we first calculate 50% of 12, which is 6. Since 6 is an integer, the
50th percentile is the average of 6th and 7th data value, or (24+25)/2 = 24.5.

57
Comparing Groups
In attempting to understand the data, we may want to look for patterns, differences, and
trends over different time periods
We may want to split the data in half and display histograms for each half. Histograms for six-
month split NYSE data is shown below

58
Comparing Groups
Histograms work well for comparing two groups, but boxplots tend to offer better results when
side-by-side comparison of several groups is sought.
Below the NYSE data is displayed in monthly boxplots.

59
Dealing with Outliers
What should be done with outliers?
They should be understood in the context of the data. An outlier for a year of data may
not be an outlier for the month in which it occurred and vice versa.
They should be investigated to determine if they are in error. The values may have
simply been entered incorrectly. If a value can be corrected, it should be.
They should be investigated to determine why they are so different from the rest of
the data. For example, were extra sales or fewer sales seen because of a special event
like a holiday.

60
Standardizing
Often, we wish to compare very different variables.
To do so, the values are standardized by measuring how far they are from the mean.
We measure the distance from the mean with the standard deviation, and the result is
the standardized value which records how many standard deviations each value is above
or below the overall mean.

61
Standardizing
Example: Compare two companies (from the “top” 100 companies) with respect to the
variables New Jobs (jobs created) and Average Pay.
Starbucks created over 2000 jobs and has an average salary of $44,790 while Wrigley
created 16 jobs and has an average salary of $56,351.
For all 100 companies, the mean number of new jobs created was 305.9 and the average
salary was $73,229.42.

62
Standardizing
Example (continued): We first display the data for all 100 companies in a stem-and-leaf
display.
Starbucks
Figure shows stem-and-leaf displays
Wrigley
for both the number of New Jobs
created and the Average Pay of
salaried employees at the top 100
companies to work for in 2005 from
Fortune magazine. Starbucks (in red)
created more jobs, but Wrigley (in
blue) did better in average pay.
Which company did better for both
variables combined?

63
Standardizing
Example (continued): To compare the two companies based on these variables, we find
the mean and standard deviation for all 100 companies.

Variable Mean SD
New Jobs 305.9 1507.97
Avg. Pay $73,299.42 $34,055.25

To quantify how well each of the companies did and to combine the two
scores, we’ll determine how many standard deviations they each are
from the variable means.

64
Standardizing
To find how many standard deviations a value is from the mean we calculate a
standardized value or z-score.

y −y
z=
s

For example, a z-score of 2.0 indicates that a data value is two standard
deviations above the mean and a z-score of −2 indicates that the data
value is two standard deviations below the mean.
A rule of thumb for identifying outliers is 𝑧 > 3 or 𝑧 < −3

65
66

Standardizing
Example (continued): Computing the z-scores for these variables for Starbucks and
Wrigley we obtain the results summarized below. For each variable, the z-score for each
observation is found by subtracting the mean from the value and then dividing that
difference by the standard deviation.
Blank New Jobs Average Pay
Mean (all companies) 305.9 $73,299.42
SD 1507.97 $34,055.25
Starbucks 2193 $44,790
z-score 1.25 = (2193 − 305.9)/1507.97 − 0.84 = (44,790 − 73,299.42)/34,055.25
Wrigley 16 $56,351
z-score − 0.19 = (16 − 305.9)/1507.97 − 0.50 = (56,351 − 73,299.42)/34,055.25

By this method, we get that Starbucks would be ranked higher based on these two variables.
Time Series Plots
A display of values against time is sometimes called a time series plot. Below we
have a time series plot of the NYSE daily volumes for 2006.

Figure: A time series plot of


Daily Volume shows the overall
pattern and changes in
variation.

67
Time Series Plots
Time series plots often show a great deal of
point-to-point variation, but general patterns
do emerge from the plot.
Time series plots may be drawn with the
points connected. The NYSE data from
before is displayed this way, see the figure.

Figure: The Daily Volumes of previous Figure, drawn by


connecting all the points. Sometimes this can help us
see the underlying pattern.
68
Time Series Plots
To better understand the trend of times
series data, a smooth trace may be
plotted with the data. A trace is typically
created using a statistics software package
and will be discussed in a later section.
The NYSE data has been plotted with a
smooth trace here.
Unless there is strong evidence for doing
otherwise, we should resist the temptation
to think that any trend we see will continue
Figure: The Daily Volumes with a smooth
trace added to help your eye see the long-
indefinitely.
term pattern.

69
Time Series Plots
Example: Consider the time series plot for the
monthly stock closing price of Bell Canada seen
earlier. The histogram showed a symmetric,
possibly unimodal distribution.
The time series plot shows a period of relatively
small price changes followed by several years of
extreme volatility.

Figure: A time series plot of daily Bell Canada stock price changes.

70
Time Series Plots
When a time series is stationary (without a strong trend or change in variability), then
a histogram can provide a useful summary.
A histogram may fail to summarize a distribution with extreme behaviour changes
over time; a time series plot would be more informative.

71
What Can Go Wrong?
Don’t make a histogram of a categorical variable. The histogram below of policy
numbers is not at all informative.

Figure: It’s not appropriate to display


categorical data like policy numbers
with a histogram.

72
What Can Go Wrong?
• Choose a scale appropriate to the data
• Avoid inconsistent scales. Don’t change scales in the middle of a plot, and compare
groups on the same scale
• Label variables and axes clearly
• Do a reality check. Make sure the calculated summaries make sense
• Don’t compute numerical summaries of a categorical variable
• Watch out for multiple modes. If the data has multiple modes, consider separating
the data
• Beware of outliers

73
What Have We Learned?
• To display and summarize quantitative data
• To summarize distributions of quantitative variables numerically
• To calculate measures of spread (range, IQR, and standard deviation)
• To compare groups and look for patterns among groups and over time
• To identify and investigate outliers
• To standardize data and understand its power.
• To graph data and look for trends both by eye with a data smoother.

74
The slides are a combination of the material from your text and Pearson recourses as
well as the original material.

75

You might also like