0% found this document useful (0 votes)
8 views46 pages

ch4 5new

Uploaded by

ngoclampham3008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views46 pages

ch4 5new

Uploaded by

ngoclampham3008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Ch 4 – 5 Displaying and Describing Data

By the end of this week, the student should be able to:

• Display and interpret categorical data


• Display and interpret quantitative data
• Recognize, describe, and calculate the measures of the center of
quantitative data
• Recognize, describe, and calculate the measures of the spread of
quantitative data
• Recognize, describe, and calculate the measures of location of
quantitative data
• Identify outliers in quantitative data
Ch. 4: Displaying and Describing Categorical
Data
Ch. 5: Displaying and Describing Quantitative
Data
Once you have collected data,
what will you do with it?
4.1 The Three Rules of Data Analysis
Rules 1: Make a picture,
Rules 2: Make a picture,
Rules 3: Make a picture

Pictures …
• summarize a table of numbers in one object
• reveal things that you are not likely to see in a table of numbers
• show important features and patterns in the data
• provide an excellent means to report findings to others
4.2 Frequency Tables (1 of 3)
• A frequency table organizes data by recording counts and
category names as in the table here.
Province Corporate Stores
Newfoundland and Labrador 12
Prince Edward Island 4
Nova Scotia 32
New Brunswick 22
Quebec 171
Ontario 165

• Table 4.1 Frequency table of the number of Loblaw stores in eastern Canada.
4.2 Frequency Tables (3 of 3)
A relative frequency table displays the proportions or percentages of
the total that lie in each category rather than the counts.
Table 4.2 Relative frequency table showing percentages of Loblaw stores in
eastern Canada.
Province Corporate Stores Province Corporate Stores (%)
Quebec 171 Quebec 42.12

Ontario 165 Ontario 40.64


Nova Scotia 7.88
Nova Scotia 32
Other 9.36
Other 38
Total 100.00
Total 406
Source: Based on Loblaw Companies Limited (2013). Annual information form.
4.3 Charts (4 of 8)
Bar Charts
A bar chart displays the distribution of a categorical variable, showing the
counts for each category next to each other for easy comparison.

The bar graph here gives a


chart that obeys the area
principle. It gives a more
accurate visual impression
of the distribution.

Figure 4.3 Number of Loblaw stores in


each province in eastern Canada. With the
area principle satisfied, the true
distribution is clear.
4.3 Charts (5 of 8)
Bar Charts
Bar charts are normally drawn in vertical columns,
although they can also be drawn with horizontal bars.
4.3 Charts (7 of 8)

Pie Charts
Pie charts show the whole
group as a circle (“pie”) sliced
into pieces.

The size of each piece is


proportional to the fraction of
the whole in each category.

Figure 4.4 Number of Loblaw stores by province in eastern Canada.

Copyright © 2021 Pearson Canada Inc.


4.3 Mode
• The Mode of a dataset is the most frequently occurring
value.
• There can be more than one mode in a data set as long as
those values have the same frequency and that frequency is
the highest.
• A data set with two modes is called bimodal, three modes –
trimodal, multiple modes – multimodal, etc.
• In most cases the mode can easily be found as the largest
piece of a pie chart, or largest bar in a bar chart.
Variability
The best way to gauge variability in categorical data is by thinking about it as diversity.

Although we will not calculate a numerical measure here, we can note it visually. A variable that has
observations spread out fairly evenly over all categories shows high variability, while a variable where
most observations are only in one or a handful of categories displays low variability. Consider the level
of variability in the two pie charts below.
Variability
4.4 Exploring Two Categorical Variables:
Contingency Tables
Table Contingency table of sex and continent of origin of Dr Yawo’s Students (2024).

North South
Africa Asia Europe America America Grand Total
Female 6 35 7 14 4 66
Male 7 28 8 26 2 71
Prefer not to say 1 1 0 0 0 2
Grand Total 14 64 15 40 6 139
Joint distribution and Marginal distribution
North South
Africa Asia Europe America America Grand Total

Female 4.32%25.18% 5.04% 10.07% 2.88% 47.48%

Male 5.04%20.14% 5.76% 18.71% 1.44% 51.08%


Prefer not to
say 0.72% 0.72% 0.00% 0.00% 0.00% 1.44%

Grand Total 10.07%46.04% 10.79% 28.78% 4.32% 100.00%


Conditional Probability
Africa Asia Europe North America South America Grand Total
Female 9.09% 53.03% 10.61% 21.21% 6.06% 100.00%
Male 9.86% 39.44% 11.27% 36.62% 2.82% 100.00%
Prefer not to say 0.00% 100.00% 0.00% 0.00% 0.00% 100.00%
Grand Total 10.07% 46.04% 10.79% 28.78% 4.32% 100.00%

Africa Asia Europe North America South America Grand Total

Female 42.86% 54.69% 46.67% 35.00% 66.67% 47.48%

Male 50.00% 43.75% 53.33% 65.00% 33.33% 51.08%

Prefer not to say 7.14% 1.56% 0.00% 0.00% 0.00% 1.44%

Grand Total 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%


4.4 Exploring Two Categorical Variables:
Graphs
4.5 Simpson’s Paradox (2 of 2)
Table 4.7 Look at the percentages within each product category. Who has a
better success rate closing sales of paper? Who has the better success rate
closing sales of flash drives? Who has the better performance overall?

Blank Blank Product Blank


Sales Rep Printer Paper USB Flash Drive Overall
Peter 90 out of 100 10 out of 20 100 out of 120
Blank 90% 50% 83%
Katrina 19 out of 20 75 out of 100 94 out of 120
Blank 95% 75% 78%
5.1 Displaying
Data Distributions

Histograms
A histogram plots the bin
counts as the height of
bars and it describes the
overall “shape” of the data
5.1 Displaying Data Distributions (2 of 6)
Stem-and-Leaf Displays
Stem-and-leaf displays are like
histograms, but they also give the
individual values
A stem-and-leaf display for thirty six
months of stock price data is shown
below together with a histogram
5.2 Shape (1 of 7)
When describing a Quantitative Distribution we want to at
least note 4 things: The shape of the distribution, the
presence of outliers, the center, and the spread. A helpful
acronym to remember this is SOCS:
oShape
oOutliers
oCenter
oSpread
5.2 Shape (2 of 7)
peaks or humps seen in a histogram are called the modes of a
distribution
A distribution whose histogram has one main peak is called
unimodal, two peaks – bimodal, three or more peaks – multimodal
5.2 Shape (3 of 7)
Modes
A distribution whose histogram doesn’t appear to have any clear
mode and in which all the bars are approximately the same height is
called uniform distribution.

Figure 5.5 In an approximately


uniform distribution, bars are all
about the same height. The
histogram does not have a
clearly defined mode.
5.2 Shape (4 of 7)
Symmetry
A distribution is symmetric if the halves on either side of the centre look,
at least approximately, like mirror images.

Figure 5.6 An approximately symmetric histogram can be folded in the


middle so that the two sides almost match.
5.2 Shape (5 of 7)
Symmetry
The thinner ends of a distribution are called the tails. If one tail stretches
out farther than the other, the distribution is said to be skewed to the side
of the longer tail.
5.2 Shape (6 of 7)
Outliers
Always be careful to point out the outliers in a distribution: those
values that stand off away from the body of the distribution.
Outliers …
• can affect every statistical method we will study
• can be the most informative part of your data
• may be an error in the data (find the error and correct it)
• should be investigated, understood, and discussed in any conclusions drawn
about the data
5.3 Centre (1 of 5)
Mean is a natural summary and the centre point of a unimodal and
symmetric distribution
To find the mean of the variable y, add all the values of the variable
and divide that sum by the number of data values, n. The mean is a
natural summary for unimodal, symmetric distributions.

y
y=
n
Example
The following table lists the number of people killed in
traffic accidents over a 10-year period. During this period,
what was the average number of people having lost life
every year? How many people died each day on average
in traffic accidents?

623 + 583 + 959 + 1037 + 960 + 797 + 663 + 652 + 560 + 619
𝑑ҧ =
10
7453
= = 745.3
10
≈ 745𝑝𝑒𝑜𝑝𝑙𝑒/𝑦𝑒𝑎𝑟 ≈2 people / day
More about shapes
Measures of location
Common measures of location are quartiles and percentiles.
Quartiles divide ordered data into quarters while percentiles
divide ordered data into hundredths.
The median is the second quartile and the 50th percentile.

Percentiles are often used to compare. To be in the


90th percentile of an exam does not mean, necessarily, that
you received 90% on a test. It means that 90% of test scores
are the same or less than your score and 10% of the test
scores are the same or greater than your test score.
Computing percentile
•Order the data from smallest to largest.
•Calculate the position 𝑖 of the kth percentile in a sample of 𝑛 observation with
𝑘 (𝑛+1)
the formula 𝑖 = 100

•If i is an integer, then the kth percentile is the data value in the ith position in the
ordered set of data.

•If i is not an integer, then round i up and round i down to the nearest integers.
Average the two data values in these two positions in the ordered data set. This
is easier to understand in an example.

NOTE: You can calculate percentiles using calculators and computers. There are
a variety of online calculators using algorithms and probabilistic methods.
Example
Listed are 29 ages for Academy Award - winning best actors in
order from smallest to largest. 18, 21, 22, 25, 26, 27, 29, 30, 31, 33, 36,
37, 41, 42, 47, 52, 55, 57, 58, 62, 64, 67, 69, 71, 72, 73, 74, 76, 77
•Find the 70th percentile.
k = 70, i = the index, and n = 29
𝒌 (𝒏+𝟏) 𝟕𝟎 (𝟐𝟗 + 𝟏)
formula 𝒊 = = = 21. Twenty-one is an
𝟏𝟎𝟎 𝟏𝟎𝟎
integer, and the data value in the 21st position in the
ordered data set is 64. The 70th percentile is 64 years.

•Now, your turn, try to find the 83rd percentile.


Five-Number Summary and Boxplots
The Five Number summary is a simple, easy way to
quickly summarize a data set. It consists of:

1.Minimum
2.Q1 , first quartile, also the 25th percentile
3.Median, second quartile, also the 50th percentile
4.Q3, third quartile, also the 75th percentile
5.Maximum

Copyright © 2021 Pearson Canada Inc.


Example
18, 21, 22, 25, 26, 27, 29, 30, 31, 33, 36, 37, 41, 42, 47, 52, 55, 57, 58, 62,
64, 67, 69, 71, 72, 73, 74, 76, 77
•Find the Five-Number Summary .

Min = 18
Q1 = 29.5
Q2 = 47
Q3 = 68
Max = 77
Range (Measure of dispersion)
We need to determine how spread out the data are because the more the data
vary, the less a measure of centre can tell us.
One simple measure of spread is the range, defined as the difference between the
extremes.

Range = max − min


Inter Quartile Range (measure of dispersion)
The interquartile range (IQR) is defined to be the difference between the two
quartiles:
IQR = Q3 − Q1

The IQR is also helpful in determining potential outliers. It can


also be used as a measure of spread.
Lower and Upper fence are used to determine outliers.

LF = Q1 – 1.5*IQR,
UF = Q3 + 1.5*IQR
Five number summary and
Boxplot
18, 21, 22, 25, 26, 27, 29, 30, 31, 33, 36, 37, 41, 42, 47,
52, 55, 57, 58, 62, 64, 67, 69, 71, 72, 73, 74, 76, 77

Min = 18
Q1 = 29.5
Q2 = 47
Q3 = 68
Max = 77
IQR = 68-29.5 = 38.5
LU = 29.5 – 1.5 * 38.5
UF = 68 + 1.5 * 38.5
Max = 165
Boxplots

UF = 111

Q3 = 80

Q2 = 68
Q1 = 59

Min =42
LF = 29
Variance and standard deviation (Measure of
dispersion)
 (y − y )
2

Sample variance s =
2

n −1  (y − y )
2

Sample standard s=
n −1
deviation
 (y − )
2
Population
 =
2

n
Example
For the number of killed example, what is the variance?

623 − 745.3 2 + 583 − 745.3 2 + 959 − 745.3 2 + 1037 − 745.3 2 + 960 − 745.3 2 + 797 − 745.3 2 + 663 − 745.3 2 + 652 − 745.3 2 + 560 − 745.3 2 + 619 − 745.3 2
𝑑ҧ =
10 − 1
Coefficient of variation
Coefficient of variation measures how much
variability exists compared with the mean.

CV = Standard deviation / Mean


CV = s / ӯ
Reporting the Shape, Centre, and Spread
Which measures of centre and spread should be used for a
distribution?
• If the shape is skewed, the median and IQR should be reported.
• If the shape is unimodal and symmetric, the mean and standard
deviation and possibly IQR should be reported.
• If multiple modes exist, determine if the data can be split into separate
groups.
• If there are unusual observations, point them out and report the
(median and IQR) OR (mean and standard deviation) with and without
the values.
• Always pair the median with the IQR and the mean with the standard
deviation.
Mean and Variance of Grouped Data
To calculate the mean or variance of grouped data we use the
midpoint of the ranges in our calculations.

𝑥ҧ = ∑𝑥 𝑝(𝑥) 𝜎 2 = ∑ 𝑥 − 𝑥ҧ 2
𝑝(𝑥)
Range($) Midpoint($) % of Sample MidPt × % (MidPt − Mean)2 × %
Range($) Midpoint($) % of Sample MidPt × %
0 0 23% 0.00 0.001685
0 0 23% 0.00
1–5 3 14% 0.42 0.000433
1–5 3 14% 0.42
6–10 8 23% 1.84 0.000007
6–10 8 23% 1.84
11–19 15 8% 1.20 0.000332
11–19 15 8% 1.20
>20 30 17% 5.10 0.007814
>20 30 17% 5.10
Blank Blank Mean $8.56 Blank
Blank Blank Mean $8.56
Blank Blank Blank Variance = 0.010271
Blank Blank Blank SD = $10.13

Table 5.3 Calculation of the average and variance


Standardizing
standardized value or z-score.

y −y
z=
s

A rule of thumb for identifying outliers is z>3 or z< −3


5.13 Time Series Plots (1 of 7)
A display of values against time is sometimes called a time series plot.
Below we have a time series plot of the NYSE daily volumes for 2006.

Figure 5.12 A time series plot of Daily Volume shows the overall pattern and changes in variation.
Transforming Skewed Data (1 of 3)
Example: Below we display the skewed distribution of total
compensation for the CEOs of the 500 largest companies.

Figure 5.16 The total compensation for CEOs (in $000) of the 500 largest companies is skewed and
includes some extraordinarily large values.

What is the “centre” of this distribution? Are there outliers?


*5.14 Transforming Skewed Data (3 of 3)
Example: Below we display the transformed distribution of total
compensation for the CEOs of the 500 largest companies. A simple log
function is used to transform data values.

This histogram is much more


symmetric, and we see that a
typical log compensation is
between 6.0 and 7.0 or $1 million
and $10 million in the original
terms.

Figure 5.17 Taking logs makes the histogram of CEO total


compensation nearly symmetric.

You might also like