Unit 1 - Examining Distributions
Unit 1 - Examining Distributions
Analysis
Unit 1: Examining Distributions
Element
Names Stock Annual Earn/
Company Exchange Sales($M) Share($)
Dataram NQ 73.10
EnergySouth N 74.00 1.67
Keystone N 365.70 0.86
LandCare NQ 111.40
Psychemedics N 17.60 0.13
Data Set
Population VS Sample
How Many Variables Have You
Measured?
Univariate data: one variable is measured
on a single experimental unit
Ordinal
The
The scale
scale determines
determines thethe amount
amount of
of information
information
contained
contained in
in the
the data.
data.
The
The scale
scale indicates
indicates the
the data
data summarization
summarization and
and
statistical
statistical analyses
analyses that
that are
are most
most appropriate.
appropriate.
Measurement Scale of Qualitative
Nominal Ordinal
level level:
Quantitative Variables
We can use:
histograms.
Time Series chart.
Graphing Qualitative Variables
Use a data distribution to describe:
What values of the variable have been measured
How often each value has occurred:
Frequency
Relative
frequency = Frequency/n
(where n = sample size)
Percent = 100 × Relative frequency
What is the probability (chance) that a random student picked scored at least 80?
40 up to 50 2
50 up to 60 6
60 up to 70 8
70 up to 80 7
80 up to 90 5
90 up to 100 2
Total 30
Relative Frequency Distribution
of Grades
Class Limits Relative Frequency
40 up to 50 2/30 = .067
50 up to 60 6/30 = .200
60 up to 70 8/30 = .267
70 up to 80 7/30 = .233
80 up to 90 5/30 = .167
90 up to 100 2/30 = .067
Relative Frequency Histogram of
Relative frequency
Grades
.30
.25
.20
.15
.10
.05
0
40 50 60 70 80 90 100
Grade
Figure - Histograms (SHAPES)
Symmetric = Bell Shaped
30
Sample Mean
For ungrouped data, the sample mean is the sum of all the
sample values divided by the number of sample values:
Let X be a random variable
𝑿=
∑ 𝒙𝒊
=
𝒙 𝟏 + 𝒙 𝟐 + 𝒙 𝟑 +…+ 𝒙 𝒏
𝒏 𝒏
𝑋=
∑ 𝑋
=
90+ 77+94 + …+113 +83
= 97 . 5
𝑛 12 32
Example 4
A randomly selected sample of eight newborns were selected
and their lengths (in inches) were as follows:
20.4 18.5 16.3 17.9 19.2 21.2 17.3 ???
It is known that the sample mean of all these newborns equal
18.825 inches.
What is the 8th baby’s length?
Solution:
== 18.825
= 18.825(8) = 150.6
= (20.4+18.5+16.3+17.9+19.2+21.2+17.3) = 130.8
Subtract the sums
150.6−130.8 = 19.8 inches
is the 8th baby's length.
Median
Median: the middle measurement when the
measurements are ranked from smallest to largest
The position of the median is
0.5(n + 1)
326 380
Median average of two middle values 353 minutes
2
Mode
Example
The status of five students who are members of the student senate at a college
are senior, sophomore, senior, junior, and senior, respectively. Find the mode.
Solution:
Because senior occurs more frequently than the other categories, it is the mode
for this data set. We cannot calculate the mean and median for this data set.
Example
The number of liters of milk purchased by 25 households:
0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3
3 3 3 3 4 4 4 5
Mean?
Median?
𝑾𝒆𝒊𝒈𝒉𝒕𝒆𝒅 𝑴𝒆𝒂𝒏=𝒙 𝒘 =
∑ 𝒘 𝒊 𝒙𝒊 = 𝒘 𝟏 𝒙𝟏 +𝒘 𝟐 𝒙 𝟐+ …+𝒘 𝒏 𝒙 𝒏
∑𝒘𝒊 𝒘 𝟏+ 𝒘 𝟐+ …+ 𝒘 𝒏
Weighted Mean (Example)
Example
Solution
83 ( 0 . 40 ) +95 ( 0 . 6 0 )
𝑊𝑒𝑖𝑔 h 𝑡𝑒𝑑 𝐴𝑣𝑒𝑟𝑎𝑔𝑒= =90 .2
0 . 4 +0 . 6
46
The Relative Positions of the Mean, Median
and the Mode
47
Measures of Variability
Measure of variability: a measure along
the horizontal axis of the data distribution
that describes the spread of the
distribution from the center
4 6 8 10 12 14
Copyright © 2019 by Nelson Education Ltd.
The Variance
The variance of a population of N
measurements is the average of the
squared deviations of the measurements
about their mean μ
Why divide by n – 1?
The sample standard deviation s is often used to estimate the
population standard deviation σ
Dividing by n – 1 gives us a better estimate of σ
54
EXAMPLE – Sample
Variance
The hourly wages for a sample of part-time
employees at Home Depot are: $12, $20, $16, $18,
and $19. What is the sample variance?
(Sample Mean is calculated and is $17)
55
Example 9 – Variance and Standard deviation
Consider a small sample dataset: 4, 16, 9, 7, 0, 1, 10, 8
Find the standard deviation of the above set.
Solution i ݔ ሺ ݔെݔҧ
ሻ ݔെݔҧଶ
Step 1: Find the sample mean: 1 4 -2.875 8.265625
2 16 9.125 83.265625
3 9 2.125 4.515625
4 7 0.125 0.015625
5 0 -6.875 47.265625
6 1 -5.875 34.515625
Step 2: Find the standard deviation 7 10 3.125 9.765625
√ √
8 8 1.125 1.265625
𝑠=
∑ =
2
( 𝑥𝑖 −𝑥 ) 188.875
= √ 26.982143=5.194434
Total (Sum) 55 0 188.875
𝑛−1 8−1 ∑ 𝑥 𝑖 ∑ ( 𝑥 −𝑥 ) ∑ ( 𝑥 −𝑥 )
𝑖 𝑖
2
Measures of Position
Quartiles and Interquartile Range
Percentiles and Percentile Rank
Definition
Quartiles are three summary measures that divide a ranked data
set into four equal parts.
The second quartile is the same as the median of a data set.
The first quartile is the value of the middle term among the
observations that are less than the median, and the third quartile
is the value of the middle term among the observations that are
greater than the median.
Interquartile Range (IQR)
Interquartile Range (IQR) is a measure of variability which is
the difference between the third and the first quartiles
IQR = Interquartile range = Q – Q
3 1
Example 12:
A sample of 12 commuter students was selected from a college.
The following data give the typical one-way commuting times (in
minutes) from home to college for these 12 students.
29 14 39 17 7 47 63 37 42 18 24 55
7 14 17 18 24 29 37 39 42 47 55 63
Median = = 33
Step 2. We find the median(second quartile):
Step 3. We find the median of the data values that are smaller than , and this gives
the value of the first quartile.
7 14 17 18 24 29
37 39 42 47 55 63
(b) By looking at the position of 47 minutes, we can state that this value lies in
the top 25% of the commuting times.
(c) The interquartile range is given by the difference between the values of the
third and first quartiles. Thus
IQR = = 27 minutes
Interpretation: The range of the middle half of commuting times in the sample is 27 minutes.
Identifying OUTLIERS
In addition to serving as a measure of spread, the interquartile range (IQR) is
used as part of a rule of thumb for identifying outliers.
The 1.5 x IQR Rule for Outliers
Any value below is considered a low outlier, and any value above is considered a
high outlier.
Example: Use the data below to calculate the mean and median of the commuting times (in
minutes) of 20 randomly selected New York workers.
10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45
Solutions: In the New York travel time data, we found Q1=15 minutes, Q3=42.5 minutes, and
IQR=27.5 minutes.
For these data, 1.5 x IQR = 1.5(27.5) = 41.25
Q1 - 1.5 x IQR = 15 – 41.25 = -26.25
Q3+ 1.5 x IQR = 42.5 + 41.25 = 83.75
Any travel time shorter than -26.25 minutes or longer than 83.75 minutes is considered an
outlier. 85 is an outlier.
FIVE NUMBER SUMMARY
The minimum and maximum values alone tell us little about the
distribution as a whole. Likewise, the median and quartiles tell us little
about the tails of a distribution.
To get a quick summary of both center and spread, combine all five
numbers.
6, 8, 1, 5, 7, 4, 4, 9, 2
Example:
Sadie, a student in a large class with 250 students, receives a score of 80% on her
exam. Sadie learns that her score is at the 70th percentile in the class. How many
students score more than Sadie
Solution: If Sadie is in the 70th percentile, then 30 % of the students in the class
did as good or better than Sadie.
Therefore, (0.30*250) 75 students scored higher than Sadie.
BOX and WHISKER PLOT
The center of the boxplot shows us the middle half of
the data between the quartiles.
The height of the box is equal to the IQR.
If the median is roughly centered between the
quartiles, then the middle half of the data is roughly
symmetric. Thus, if the median is not centered, the
distribution is skewed.
The whiskers also show the skewness if they are not the
same length.
Outliers are out of the way to keep you from judging
skewness but give them special attention.
Interpreting Box Plots
Median line in centre of box and whiskers of
equal length: symmetric distribution
Solution:
Step 1 and 2. First, rank the data in increasing order and calculate the
values of the median, the first quartile, the third quartile, and the
interquartile range. The ranked data are
Step 3. Find the points that are 1.5 x IQR below Q1 and 1.5 x
IQR above Q3.
𝑸𝟏 𝑸𝟐 𝑸𝟑
Box and Whisker Outlier Box Plot
(Example)
Step 5. By drawing two lines, join the points of the smallest
(69) and the largest values (112) within the two inner
fences to the box.
These values are 69 and 112 in this example.
This completes the box-and-whisker plot, as shown in Figure
below.
𝑸𝟏 𝑸𝟐 𝑸𝟑 112 145 (outlier)
69
Interpretation:
Right Skewed with one outlier. The outlier is 145.
Example
Amount of sodium in 8 brands of cheese:
260 290 300 320 330 340 340 520
Q1 Q3
Copyright © 2019 by Nelson Education 2-75
Ltd.
Example (cont’d)
IQR = 340 – 295 = 45
Lower fence (LF): 295 – 1.5(45) = 295 – 67.5 = 227.5
Upper fence (UF): 340 + 1.5(45) = 340 + 67.5 = 407.5
Outlier: x = 520
Draw “whiskers” connecting the largest (that is 340) and smallest observation
values(that is 290) that are NOT outliers to the box
Example 15: Side by Side Boxplot (Male vs
Female Weight Distribution)
Example 16
The sport of boxing divides its athletes into different weight classes in
order to make the competition fairer. The side-by-side basic (quantile)
boxplots shown below display the weights (in pounds) of a random
sample of 16 Cruiserweight boxers and 17 Heavyweight boxers.
Cruiserweight Heavyweight
Minimum 204 270
Q1 220 294
Median 226 304
Q3 230 312
Maximum 250 320
n 16 14
Example 16
Which of the following statements is/are true?
(I) The distribution of weights for the Heavyweights is skewed to the left.
(II) There are 12 Cruiserweights in the sample who weigh at least 220 pounds.
(III) The mean weight for the Heavyweights is likely greater than the median
weight.
(A) I only
(B) III only
(C) I and II only
(D) II and III only
(E) I, II and III C
Cruiserweight Heavyweight
Minimum 204 270
Q1 220 294
Median 226 304
Q3 230 312
Maximum 250 320
n 16 14
Example 16
What is the median weight of all of the boxers in the sample (Cruiserweight and
Heavyweight) combined?
(A) 260
(B) 262
(C) 265
(D) 268
(E) 270
E
Cruiserweight Heavyweight
Minimum 204 270
Q1 220 294
Median 226 304
Q3 230 312
Maximum 250 320
n 16 14