0% found this document useful (0 votes)
49 views

Displaying and Describing Quantitative Data

This document provides information on displaying and describing quantitative data through various statistical methods. It discusses how to summarize numerical data using histograms, stem-and-leaf plots, and analyzing shape and skewness. It also covers measuring the center through mean and median, and spread using boxplots, interquartile range, and standard deviation. Examples are provided on building frequency distributions for continuous data and interpreting histograms, stem-and-leaf plots, and boxplots.

Uploaded by

Josh Potash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Displaying and Describing Quantitative Data

This document provides information on displaying and describing quantitative data through various statistical methods. It discusses how to summarize numerical data using histograms, stem-and-leaf plots, and analyzing shape and skewness. It also covers measuring the center through mean and median, and spread using boxplots, interquartile range, and standard deviation. Examples are provided on building frequency distributions for continuous data and interpreting histograms, stem-and-leaf plots, and boxplots.

Uploaded by

Josh Potash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Displaying and Describing

Quantitative Data
Displaying and Describing Quantitative
Data
Summarizing numerical data
Histograms
Stem-and-Leaf plots
Shape and Skewness
Center: Mean vs. Median
Boxplots (5 number summary)
Measuring the spread


Continuous Data: may take on any value in
some interval
Summarized in a grouped data frequency
table

Example: A manufacturer of insulation randomly selects 20
winter days and records the daily high temperature
24, 35, 17, 21, 24, 37, 26, 46, 58, 30,
32, 13, 12, 38, 41, 43, 44, 27, 53, 27

NOTE: Temperature is a continuous variable because it could be
measured to any degree of precision desired
Frequency Distribution:
Continuous Data

1. Determine the number of categories
(classes/bins)
2. Establish class width
Minimum width is the range of the data
Largest data point Smallest data point = Range
3. Set the class boundaries
4. Determine the frequency in each class
Count the number of data points in each category





Building a Frequency Table:
Continuous Data
How Many Categories?
Many (Narrow class intervals)
May yield a very jagged distribution
with gaps from empty classes
Can give a poor indication of how
frequency varies across classes

Few (Wide class intervals)
May compress variation too much
and yield a blocky distribution
Can obscure important patterns of
variation
0
2
4
6
8
10
12
0 30 60 More
Temperature
F
r
e
q
u
e
n
c
y
0
0.5
1
1.5
2
2.5
3
3.5
48
1
2
1
6
2
0
2
4
2
8
3
2
3
6
4
0
4
4
4
8
5
2
5
6
6
0
M
o
r
e
Temperature
F
r
e
q
u
e
n
c
y
(X axis labels are upper class endpoints)
General Guidelines
Number of Data Points Number of Classes
under 50 5 - 7
50 100 6 - 10
100 250 7 - 12
over 250 10 - 20

Class widths can typically be reduced as the
number of observations increases
Distributions with numerous observations are
more likely to be smooth and have gaps filled
since data are plentiful
Considerations:
Continuous Data

Must be mutually exclusive

Must be all-inclusive

Bins should be of equal width

Avoid empty categories
How should the endpoints be
determined?
Often by trial and error

The goal is to create a distribution that is
neither too "jagged" nor too "blocky

You want to appropriately show the pattern
of variation

Sort raw data from low to high:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Find range: 58 - 12 = 46
Select number of classes: 5 (usually between 5 and 20)
Compute class width: 10 (46/5 then round off)
Determine class boundaries:10, 20, 30, 40, 50
(Sometimes class midpoints are reported: 15, 25, 35, 45, 55)
Count the number of values in each class
Example:
Continuous Data
Frequency Distribution Example
Data from low to high:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class Frequency
10 but under 20 3 .15
20 but under 30 6 .30
30 but under 40 5 .25
40 but under 50 4 .20
50 but under 60 2 .10
Total 20 1.00
Relative
Frequency
Frequency Distribution
Histogram
0
3
6
5
4
2
0
0
1
2
3
4
5
6
7
5 15 25 36 45 55 More
F
r
e
q
u
e
n
c
y
Class Midpoints
Histogram Example
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
No gaps
between bars,
since continuous
data
0 10 20 30 40 50 60
Class Endpoints/Bins
Frequency Histograms
Visual representation of the frequency table
The classes/intervals/bins are shown on the
horizontal axis
Frequency is measured on the vertical axis

Bars of the appropriate heights can be used to
represent the number of observations within
each class
Shows the center of the data and the spread

Stem-and-Leaf Plots
A quick and dirty histogram
11, 24, 24, 25, 27, 30, 30, 31, 32, 33, 44, 46, 47, 50,
52

467 4
02 5
00123 3
4457 2
1 1
The parts
Stems
Leaves
5
4
3
2
1
Key: 1|1 stands for 11
02
467
00123
4457
1
Splitting Stems
You want about 7 to 10 stems (depending on how
much data you have)
Create more stems be splitting them
0012233 8
555555566 8
8899 7
02 7
7.0 to 7.4
7.5 to 7.9
8.0 to 8.4
8.5 to 8.9
Or More
2233 8
001 8
8899 7
7
7
2 7
0 7
7.0 and 7.1
7.2 and 7.3
7.4 and 7.5
7.6 and 7.7
7.8 and 7.9
8.0 and 8.1
8.2 and 8.3
20 18 16 29 26 14 21 26 24 21 22
8 28 21 31 20 29 33 26 18 21 19
38 21 13 29 17 15 35 26

1358 3
00111126668999 2
34567889 1
8 0
58 3
13 3
6668999 2
0011112 2
567889 1
34 1
8 0
Key: 1|3 means 13
Normal Body Temperature
96.3 96.7 96.9 97.0 97.1
97.1 97.1 97.2 97.3 97.4
97.4 97.4 97.4 97.5 97.5
97.6 97.6 97.6 97.7 97.8
97.8 97.8 97.8 97.9 97.9
98.0 98.0 98.0 98.0 98.0
98.0 98.1 98.1 98.2 98.2
98.2 98.2 98.3 98.3 98.4
98.4 98.4 98.4 98.5 98.5
98.6 98.6 98.6 98.6 98.6
98.6 98.7 98.7 98.8 98.8
98.8 98.9 99.0 99.0 99.0
99.1 99.2 99.3 99.4 99.5
5 99
0001234 99
55666666778889 98
000000112222334444 98
556667888899 97
0111234444 97
79 96
3 96



Symmetric or Skewed?
Is the distribution
symmetric?
A distribution is symmetric if
the right and left sides of the
histogram are approximately
mirror images of each other.


left skewed?
It is skewed to the left
if the left side of the
histogram extends
much farther out than
the right side.
or right skewed?
A distribution is skewed
to the right if the right
side of the histogram
(side with larger values)
extends much farther
out than the left side



Other Distributions
Bimodal Distribution
Uniform Distribution
Outliers
Data points that dont seem to fit in the
distribution.
Far to the left or right in the graph.
Quantitative Summaries
Mean
Median
5 number summary -- Boxplots
Measuring Spread
Describing Quantitative Data
Where is it? What is its center?
What is the spread or variability? How much
noise is in the data?
What is the shape of the distribution? Is it
symmetric?
Measuring these attributes
Center of the Distribution
The average salary, height, etc.

Mean: add up the data and divide by the
number of observations.

Median: An equal number of observations
more and less than the median.
Mean
Add up the data and divide by the number of
observations

Data: 1, 2, 2, 3, 4
Mean = (1 + 2 + 2 + 3 + 4) /5 = 2.4

Data: 10, 12, 56, 78, 113, 1209
Mean = (10 + 12 + 56 +78 + 113 + 1209)/6 = 246.3
Some Algebra
Median
The middle observation
Data: 1, 2, 2, 3, 4
Mean = (1 + 2 + 2 + 3 + 4) /5 = 2.4
Median = 2

Data: 10, 12, 56, 78, 113, 1209
Mean = (10 + 12 + 56 +78 + 113 + 1209)/6 = 246.3
Median = (56 + 78)/2 = 67

Comparison
Similar for symmetric distributions.
Mean moves in the direction of a skewed distribution
Median
Mean
Modes
Mode: peak in the distribution
Bimodal = Two Modes

Mean and Median
5 number summary
Median
Minimum, Maximum
Quartiles middle observation above the
median and below the median

Min, Q1, Med, Q3, Max
Finding Quartiles
1. Data: 7, 23,75,82,34,91,10
2. Put it in order:
7, 10, 23, 34, 75, 82, 91
3. Find the median: 34
4. Below the median: 7, 10, 23
Lower Quartile Q1 = 10
5. Above the median: 75, 82, 91
Upper Quartile Q3 = 82
More Quartiles
7, 8, 22, 38, 48, 62
Median = (22+38)/2 = 30

7, 8, 22, 38, 48, 62

Q1 =8
Q3 = 48
7, 8, 22, 38, 48, 62
One more time
125, 126, 127, 129, 133, 136, 136, 140, 141,
143, 143, 147, 152
125, 126, 127, 129, 133, 136, 136,
140, 141, 143, 143, 147, 152
Q1 = (127+129)/2 = 128
Q3 = (143 + 143)/2 = 143

5 Numbers
125, 126, 127, 129, 133, 136, 136, 140, 141,
143, 143, 147, 152

Five number summary
(min, Q1, med, Q3, max) =
(125, 128, 136, 143, 152)
5 Number Summary
Example: Shares traded daily on NYSE
Max 3,115,805,723
Q3 1,739,245,625
Median 1.584,406,064
Q1 1,451,269,968
Min 545,244,020
Box Plot
Example: monthly credit card charges($)
100
200
300
C
o
u
n
t
0 1000 2000 3000 4000 5000 6000 7000
How would you describe this distribution?
We can compare groups
Side-by-side boxplots compare two set of
data.
Do they have the same center? Spread?
Shape?
Is the difference between the medians much
bigger than the variability in the data?
Lets examine the numerical responses of the
% hotel occupancy rate in Hawaii to compare
the summer months with the non-summer
months

Case: Hotel Occupancy in Hawaii
Season
H
o
t
e
l

O
c
c
u
p
a
n
c
y

(
%
)
Summer Non-Summer
90
80
70
60
50
Boxplot of Hotel Occupancy vs Season
Case (cont.): Occupancy by each of the 4
seasons
There are two high seasons for hotels in Hawaii.
Season
H
o
t
e
l

O
c
c
u
p
a
n
c
y

(
%
)
Fall Summer Spring Winter
90
80
70
60
50
Boxplot of Hotel Occupancy vs Season
H
o
t
e
l

O
c
c
u
p
a
n
c
y

(
%
)
Year
Month
2004 2003 2002 2001 2000
Jul Jan Jul Jan Jul Jan Jul Jan Jul Jan
90
80
70
60
50
Time Series Plot of Hotel Occupancy
Time Series Plots
What additional information does this graph give us?
Seasonal behavior of unemployment rates
Data from 19802001
Measuring the Spread
How much variability is in the data?
1. Range
Maximum Minimum
2. InterQuartile Range
Q3 Q1
3. Standard Deviation
Average squared distance from the mean
Why I dont use the range.
An outlier is going to be either the largest or
smallest data point.
If there is an outlier then I dont want to use it
it isnt typical.
Even if there are no outliers, Im more
interested in central numbers than extreme
numbers.
IQR
The IQR is the length of the central half of the data.

IQR = Q3 Q1
It is the least sensitive to outliers of any of our
measures of spread.
How much does it vary?
Range = 46555
IQR = 6587
Average of the squared deviations
Shows variation about the mean
Sample variance:



2
2
1
1
n
i
i
X X
S
n

Variance

Square root of the variance
Has the same units as the original data
Sample standard deviation:





2
1
1
n
i
i
X X
S
n

Standard Deviation
How many standard deviations is a value from the mean?



Z = 2 the value is 2 standard deviations above the mean

Z= -2 the value is 2 standard deviations below the mean

Allows us to compare values with different units

Standardizing


z
value mean
st.dev.

You might also like