Lecture 04 (09.16)
Lecture 04 (09.16)
STATISTICS
There are two basic measures of central tendency that we are going
to discuss today.
CENTRAL TENDENCY: MEAN AND MEDIAN
Median – the middle value when data arranged from smallest to largest.
If 𝑛 (sample size) is odd, the median is the middle value. Counting in from
𝑛+1
the ends, we find this value in the position.
2
When If 𝑛 is even, there are two middle values. So, in this case, the median
𝑛 𝑛
is the average of the two values in positions and + 1.
2 2
EXAMPLE: RESTING HEART RATE
50 51 57 60 70 71 80 84
EXAMPLE: RESTING HEART RATE
What if the highest resting heart rate was incorrectly entered as 840 beats
per minute instead of 84? Recalculate the mean and median
71 50 57 840 61 70 80 51
Mean =
50 51 57 60 70 71 80 840
Median =
CENTRAL TENDENCY: MEAN AND MEDIAN
KEY IDEA:
The mean is sensitive to extreme observations.
Question: How do we decide whether it is better to report the mean or the median
as a measure of central tendency?
Answer: Generally, you want to choose the one that best represents a “typical” or
“center” value in the data set.
CENTRAL TENDENCY: MEAN AND MEDIAN
Consider the following histograms. For each histogram, determine whether you would
expect the mean and median values to be approximately equal, for the mean > median, or
for the mean < median.
The bottom line: When analyzing histogram remember that the mean follows the skew of the
histogram.
DESCRIBING VARIATION
Midterms are returned and “the average” was reported as 76 points out of 100 points. You
received a score of 88 points? How do you feel about your performance in each scenario?
DESCRIBING VARIATION
One way to describe the variability in the heart rate measurements would be to compute
the range:
Consider three different alternative scenarios where the spread of resting heart
rates was quite different. What is the range in each case?
The range only uses 2 observations to describe the variation in an entire data set,
and there are obviously cases where it does not do a particularly good job.
PERCENTILES
The p-th percentile is the value such that p% of the observations fall
at or below that value.
PERCENTILES
25th percentile – a value such that 25% of the observations are below the value and
75% are above it; also called the first quartile or Q1; it is the median of the lower half
of data.
75th percentile – a value such that 75% of the observations are below the value and
25% are above it; also called the third quartile or Q3; it is the median of the upper
half of data.
IQR
The inter-quartile range (IQR) is found by taking the difference between the 75th and
25th percentile values:
IQR = Q3 – Q1
We’ve already found 50th percentile (the median) of our heart rate data set. Now find
the 25th and 75th percentiles for the data set. Afterward, compute the corresponding
IQR.
50 51 56 60 70 71 80 84
IQR
Let’s see how the IQR holds up against our fictional data sets.
The minimum
First quartile Q1 (25th percentile)
The median
Third quartile Q3 (75th percentile)
The maximum
VISUALIZING IQR: BOXPLOTS
VISUALIZING IQR: BOXPLOTS
DRAWING BOXPLOTS
Let’s practice drawing one boxplot by hand using the AMES Living Area variable.
Min Q1 Median Q3 Max IQR Q1-1.5*IQR Q3+1.5*IQR
672 1162 1505 1746 2495
BOXPLOTS
Boxplots can be drawn horizontally or vertically. They give a quick glance at the data, while
histograms give a more detailed view of the shape of the data distribution..
MATCHING BOXPLOTS AND HISTOGRAMS
VARIANCE & STANDARD DEVIATION
What is the best way to involve every single data point in our data set in
a calculation of the variation of that data set?
71 50 57 84 61 70 80 51
VARIANCE & STANDARD DEVIATION
σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 − 𝑥)
Population variance formula: 𝜎 = 2
𝑛
Dividing by n works well if we are dealing with population data, but it consistently
underestimates the population variance when we use sample data.
When we have sample data, we correct for this underestimation problem by dividing the
sum of squares by n – 1.
σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 − 𝑥)
Sample variance formula: 𝑠2 =
𝑛 −1
VARIANCE & STANDARD DEVIATION
In this course, we will not often use the variance because it describes the variability of the
data in squared units.
For this reason, we will instead use the squared root of the variance, called the standard
deviation, since it is in the original units of the data.
σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 − 𝑥)
Population standard deviation formula: 𝛔 =
𝑛
σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 − 𝑥)
Sample standard deviation formula: 𝒔 =
𝑛−1
STANDARD DEVIATION
Let’s calculate the sample standard deviation of our heart rate data:
The standard deviation is the square root of the variance and describes how
close the data are to the mean using the units in which the data are recorded
s = 0 means every data point is the same value – there are no deviations from the
mean!
In order to indicate whether our summary values are calculated from population
data and therefore are parameters or if they are from sample data and are
statistics, we use special notation.
Statistic Value
n 42
Min $416
Q1 $1416
Median $1882
Mean $1924
Q3 $2497
Max $3154
s $703
Compute the IQR and range of this data set
ROBUST STATISTICS
Suppose two new high-quality diamonds were accidentally mixed in with the original 42, one with sale price of
$7,393 and the other with a sale price of $8,979. Compared to the low-quality diamonds, their sale prices are
outliers. The summaries below show the new sale price distribution and its summary statistics.
Statistic Value
n 42
Min $416
Q1 $1429
Median $1899
Mean $2209
Q3 $2537
Max $8979
s $1497
Compute the IQR and range of this data set. How do these summaries change after the
high-quality diamonds are included?
ROBUST STATISTICS
Compare the mean and median values of the two histograms. How do they
compare?
Consider the summary statistics we’ve explored so far. We can categorize each of
them with respect to whether the statistic appears to be robust to outlying values.