ch4 5new
ch4 5new
Pictures …
• summarize a table of numbers in one object
• reveal things that you are not likely to see in a table of numbers
• show important features and patterns in the data
• provide an excellent means to report findings to others
4.2 Frequency Tables (1 of 3)
• A frequency table organizes data by recording counts and
category names as in the table here.
Province Corporate Stores
Newfoundland and Labrador 12
Prince Edward Island 4
Nova Scotia 32
New Brunswick 22
Quebec 171
Ontario 165
• Table 4.1 Frequency table of the number of Loblaw stores in eastern Canada.
4.2 Frequency Tables (3 of 3)
A relative frequency table displays the proportions or percentages of
the total that lie in each category rather than the counts.
Table 4.2 Relative frequency table showing percentages of Loblaw stores in
eastern Canada.
Province Corporate Stores Province Corporate Stores (%)
Quebec 171 Quebec 42.12
Pie Charts
Pie charts show the whole
group as a circle (“pie”) sliced
into pieces.
Although we will not calculate a numerical measure here, we can note it visually. A variable that has
observations spread out fairly evenly over all categories shows high variability, while a variable where
most observations are only in one or a handful of categories displays low variability. Consider the level
of variability in the two pie charts below.
Variability
4.4 Exploring Two Categorical Variables:
Contingency Tables
Table Contingency table of sex and continent of origin of Dr Yawo’s Students (2024).
North South
Africa Asia Europe America America Grand Total
Female 6 35 7 14 4 66
Male 7 28 8 26 2 71
Prefer not to say 1 1 0 0 0 2
Grand Total 14 64 15 40 6 139
Joint distribution and Marginal distribution
North South
Africa Asia Europe America America Grand Total
Histograms
A histogram plots the bin
counts as the height of
bars and it describes the
overall “shape” of the data
5.1 Displaying Data Distributions (2 of 6)
Stem-and-Leaf Displays
Stem-and-leaf displays are like
histograms, but they also give the
individual values
A stem-and-leaf display for thirty six
months of stock price data is shown
below together with a histogram
5.2 Shape (1 of 7)
When describing a Quantitative Distribution we want to at
least note 4 things: The shape of the distribution, the
presence of outliers, the center, and the spread. A helpful
acronym to remember this is SOCS:
oShape
oOutliers
oCenter
oSpread
5.2 Shape (2 of 7)
peaks or humps seen in a histogram are called the modes of a
distribution
A distribution whose histogram has one main peak is called
unimodal, two peaks – bimodal, three or more peaks – multimodal
5.2 Shape (3 of 7)
Modes
A distribution whose histogram doesn’t appear to have any clear
mode and in which all the bars are approximately the same height is
called uniform distribution.
y
y=
n
Example
The following table lists the number of people killed in
traffic accidents over a 10-year period. During this period,
what was the average number of people having lost life
every year? How many people died each day on average
in traffic accidents?
623 + 583 + 959 + 1037 + 960 + 797 + 663 + 652 + 560 + 619
𝑑ҧ =
10
7453
= = 745.3
10
≈ 745𝑝𝑒𝑜𝑝𝑙𝑒/𝑦𝑒𝑎𝑟 ≈2 people / day
More about shapes
Measures of location
Common measures of location are quartiles and percentiles.
Quartiles divide ordered data into quarters while percentiles
divide ordered data into hundredths.
The median is the second quartile and the 50th percentile.
•If i is an integer, then the kth percentile is the data value in the ith position in the
ordered set of data.
•If i is not an integer, then round i up and round i down to the nearest integers.
Average the two data values in these two positions in the ordered data set. This
is easier to understand in an example.
NOTE: You can calculate percentiles using calculators and computers. There are
a variety of online calculators using algorithms and probabilistic methods.
Example
Listed are 29 ages for Academy Award - winning best actors in
order from smallest to largest. 18, 21, 22, 25, 26, 27, 29, 30, 31, 33, 36,
37, 41, 42, 47, 52, 55, 57, 58, 62, 64, 67, 69, 71, 72, 73, 74, 76, 77
•Find the 70th percentile.
k = 70, i = the index, and n = 29
𝒌 (𝒏+𝟏) 𝟕𝟎 (𝟐𝟗 + 𝟏)
formula 𝒊 = = = 21. Twenty-one is an
𝟏𝟎𝟎 𝟏𝟎𝟎
integer, and the data value in the 21st position in the
ordered data set is 64. The 70th percentile is 64 years.
1.Minimum
2.Q1 , first quartile, also the 25th percentile
3.Median, second quartile, also the 50th percentile
4.Q3, third quartile, also the 75th percentile
5.Maximum
Min = 18
Q1 = 29.5
Q2 = 47
Q3 = 68
Max = 77
Range (Measure of dispersion)
We need to determine how spread out the data are because the more the data
vary, the less a measure of centre can tell us.
One simple measure of spread is the range, defined as the difference between the
extremes.
LF = Q1 – 1.5*IQR,
UF = Q3 + 1.5*IQR
Five number summary and
Boxplot
18, 21, 22, 25, 26, 27, 29, 30, 31, 33, 36, 37, 41, 42, 47,
52, 55, 57, 58, 62, 64, 67, 69, 71, 72, 73, 74, 76, 77
Min = 18
Q1 = 29.5
Q2 = 47
Q3 = 68
Max = 77
IQR = 68-29.5 = 38.5
LU = 29.5 – 1.5 * 38.5
UF = 68 + 1.5 * 38.5
Max = 165
Boxplots
UF = 111
Q3 = 80
Q2 = 68
Q1 = 59
Min =42
LF = 29
Variance and standard deviation (Measure of
dispersion)
(y − y )
2
Sample variance s =
2
n −1 (y − y )
2
Sample standard s=
n −1
deviation
(y − )
2
Population
=
2
n
Example
For the number of killed example, what is the variance?
623 − 745.3 2 + 583 − 745.3 2 + 959 − 745.3 2 + 1037 − 745.3 2 + 960 − 745.3 2 + 797 − 745.3 2 + 663 − 745.3 2 + 652 − 745.3 2 + 560 − 745.3 2 + 619 − 745.3 2
𝑑ҧ =
10 − 1
Coefficient of variation
Coefficient of variation measures how much
variability exists compared with the mean.
𝑥ҧ = ∑𝑥 𝑝(𝑥) 𝜎 2 = ∑ 𝑥 − 𝑥ҧ 2
𝑝(𝑥)
Range($) Midpoint($) % of Sample MidPt × % (MidPt − Mean)2 × %
Range($) Midpoint($) % of Sample MidPt × %
0 0 23% 0.00 0.001685
0 0 23% 0.00
1–5 3 14% 0.42 0.000433
1–5 3 14% 0.42
6–10 8 23% 1.84 0.000007
6–10 8 23% 1.84
11–19 15 8% 1.20 0.000332
11–19 15 8% 1.20
>20 30 17% 5.10 0.007814
>20 30 17% 5.10
Blank Blank Mean $8.56 Blank
Blank Blank Mean $8.56
Blank Blank Blank Variance = 0.010271
Blank Blank Blank SD = $10.13
y −y
z=
s
Figure 5.12 A time series plot of Daily Volume shows the overall pattern and changes in variation.
Transforming Skewed Data (1 of 3)
Example: Below we display the skewed distribution of total
compensation for the CEOs of the 500 largest companies.
Figure 5.16 The total compensation for CEOs (in $000) of the 500 largest companies is skewed and
includes some extraordinarily large values.