Topic 1 Describing Data II
Topic 1 Describing Data II
Central
Shape Location Variation Relationship
Tendency
Mean Skewness Minimum Range Covariance
Median Maximum IQR Correlation
Mode Percentiles Variance
Quartiles Standard deviation
z-Score Coefficient of variation
Measures of Central Tendency: Mean, Median, and Mode
• Measures of central tendency provide information about a “typical” observation in the data
• Usually computed from sample data rather than from population data
Central Tendency
x i
x= i=1
n
Arithmetic Midpoint of Most frequently
average ranked values observed value
(if one exists)
Measures of Central Tendency: Mean, Median, and Mode
• The (arithmetic) mean of a set of data is the sum of the data values
divided by the number of observations
• The population mean is a parameter given by
σ𝑁𝑖=1 𝑥𝑖 𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑁
𝜇= =
𝑁 𝑁 Population size
• The sample mean is a statistic given by
σ𝑛𝑖=1 𝑥𝑖
𝑥ҧ =
𝑛 Sample size
• The mean is appropriate for numerical data
Median
The median is the middle observation of a set of observations that are
arranged in increasing (or decreasing) order.
• If n is odd, the median is the middle observation.
• If n is even, the median is the average of the two middle
observations. The median will be the number located in the 0.5(n
+1)th ordered position.
• The median is more robust to outliers than the mean. [why?]
Mode
The mode, if one exists, is the most frequently occurring value.
• A distribution with one mode is called unimodal; with two (local)
modes, it is called bimodal; and with more than two (local) modes, it
is said to be multimodal.
• The mode is most commonly used with categorical data
Question:
• You want to measure the central tendency of the following
data. Which measurement would you use? Mean, Median or
Mode?
• Mode is most commonly used with categorical data (why?)
2. As a student, you want to know where you are among the class from
the midterm grades.
60+84+65+67+75+72+80+85+63+82+70+75
• The mean is 𝑥ҧ = = 73.17
12
• To find the median, arrange the sales from least to greatest:
• 60, 63, 65, 67, 70, 72, 75, 75, 80, 82, 84, 85
72+75
• So the median is 𝑥0.5 = = 73.5
2
• The mode is 75.
Describing Data Numerically: Measures of Central Tendency
𝑝
Pth percentile = value located in the 𝑛 + 1 𝑡ℎ
100
ordered position
Percentiles and Quartiles
Quartiles are descriptive measures that separate large data sets into four
quarters.
• Split the ranked data into 4 segments with an equal number of values
per segment. Note that the widths of the segments may be different.
1. The first quartile, 𝑄1 , (or 25th percentile) separates approximately
the smallest 25% of the data from the remainder of the data.
2. The second quartile, 𝑄2 , (or 50th percentile) is the median.
3. The third quartile, 𝑄3 , (or 75th percentile) separates approximately
the smallest 75% of the data from the remainder of the data.
Q1 Q2 Q3
Find a quartile by determining the value in the appropriate position in the
ranked data
• where n is the number of observed values
• 𝑄1 = the value in the 0.25(n+1)th ordered position.
• 𝑄2 = the value in the 0.50(n+1)th ordered position.
• 𝑄3 = the value in the 0.75 (n+1)th ordered position.
Variability
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Measures of Variability: Range and Interquartile Range
Example:
Median
Xminimum Q1 Q3 Xmaximum
(Q2)
25% 25%
25% 25%
12 30 45 57 70
A Boxplot in Excel
• The upper whisker ends at min(𝑄3 + 1.5 × 𝐼𝑄𝑅, 𝑚𝑎𝑥𝑖𝑚𝑢𝑚)
• The lower whisker ends at m𝑎𝑥(𝑄1 − 1.5 × 𝐼𝑄𝑅, 𝑚𝑖𝑛𝑖𝑚𝑢𝑚)
• The points beyond are marked as extreme points
Practice Question
A small accounting office is trying to determine its staffing needs for the coming tax
season. The manager has collected the following data: 46, 27, 79, 57, 99, 75, 48, 89,
and 85. These values represent the number of returns the office completed each year
over the entire nine years it has been doing tax returns. For this data, what is the
interquartile for the number of tax returns completed each year?
Variance and Standard Deviation
• Both range and IQR use only two of the data values. Variance uses
the distances of all observations from the mean.
• Variance: measures the average squared “deviation” from the mean.
The unit of variance is the squared unit of the observations, 𝑥𝑖
• Standard deviation: is the positive square root of the variance. By
taking square root, we get back to the “standard” (original) unit of
observations, 𝑥𝑖
Variance and Standard Deviation
• The population variance is
N
(x − μ)
the sum of the squared 2
differences between each i
observation and the
population mean divided by
σ =
2 i=1
i
sum of the squared
differences between each (x − x) 2
𝒔𝟐𝒏−𝟏 s =
2 i=1
n -1
n
Biased sample variance (x − x)
i
2
𝒔𝟐𝒏 s =
2 i=1
n -1
Variance and Standard Deviation
Sample variance, 𝑠 2 , can be computed as follows:
(σ𝑛 𝑥𝑖 )2
σ𝑛 𝑥
𝑖=1 𝑖
2
− 𝑖=1
• 𝑠2 = 𝑛
𝑛−1
σ𝑛 2
𝑖=1 𝑖 −𝑛𝑥ҧ
𝑥 2
• 𝑠2 =
𝑛−1
• The population standard N
deviation, σ, is the
(positive) square root of
(x − μ)
i
2
B
Example 2.9: Gilotti’s Pizzeria Sales at Locaiton 1
(b). Subtract 5 from every observation and complete the sample variance for the original
data and the new data
(c). What effect, if any, does subtracting 5 from every observation have on the sample
mean and sample variance?
• Stock A
• Average price last year = $50
• Standard deviation = $5
• Stock B:
• Average price last year = $100
• Standard deviation = $5
(b) Approximately what proportion of the observations is between 428 and 572?
(c) Approximately what proportion of the observations is between 476 and 524?
Practice Question
The manager of 45 sales people examined their monthly expenditures
on entertaining clients. He found that the mean amount was $237.50
with a standard deviation of $27.40. Assuming the data is bell-shaped,
would a claim for the amount of $300 be considered unlikely? Why or
why not?
Practice Question
A large sample is selected from a bell-shaped distribution. The middle
99.7% of the sample data falls between 24.2 and 69.2. Estimate the
sample mean and the sample standard deviation.
What Is And How To Use Chebyshev's Theorem And The Empirical Rule
Formula In Statistics Explained
• 3:13
Weighted Mean
The weighted mean of a set of data is
n
w x i i
w 1x1 + w 2 x 2 + + w n x n
x= i=1
=
n n
σ𝑛
𝑖=1 𝑤𝑖 𝑥𝑖 10+6+18+0+0
• 𝑥ҧ = = = 1.79
𝑛 19
If the data are intervals rather than specific values,
can we calculate the exact mean and variance?
• No, but we can approximate them
Measures of Grouped Data
• Suppose that data are grouped into 𝐾 classes, with
frequencies 𝑓1 , 𝑓2 , … , 𝑓𝐾 . If the midpoints of these
classes are 𝑚1 , 𝑚2 , … , 𝑚𝐾 , then the sample mean
and sample variance can be approximated as
K K
fimi i i
f (m − x) 2
x= i=1 s2 = i=1
n n −1
• where 𝑛 = σ𝐾
𝑖=1 𝑓𝑖
Practice Question
What is the (approximate) mean
and variance of this sample?
Measures of Relationships Between Variables: Covariance