0% found this document useful (0 votes)
8 views38 pages

Lecture 04 (09.16)

This document provides an introduction to statistical concepts focusing on numerical summaries of data, specifically measures of central tendency such as mean and median. It discusses the importance of these measures in representing typical values in datasets and introduces variability measures like range, interquartile range (IQR), variance, and standard deviation. Additionally, it emphasizes the significance of visualizations like boxplots and histograms in analyzing data distributions.

Uploaded by

sabrinawang830
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views38 pages

Lecture 04 (09.16)

This document provides an introduction to statistical concepts focusing on numerical summaries of data, specifically measures of central tendency such as mean and median. It discusses the importance of these measures in representing typical values in datasets and introduces variability measures like range, interquartile range (IQR), variance, and standard deviation. Additionally, it emphasizes the significance of visualizations like boxplots and histograms in analyzing data distributions.

Uploaded by

sabrinawang830
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

INTRODUCTION TO

STATISTICS

LECTURE 04: VISUAL


AND NUMERICAL
SUMMARIES OF DATA
(PART II)
NUMERICAL SUMMARIES OF DATA: CENTRAL TENDENCY

 The first step in analyzing a set of numerical data is to describe the


“typical” values of data.

 Measures of central tendency acknowledge that each observation of


a numerical variable might be different, but that they also have some
“center” that represents them all together.

 There are two basic measures of central tendency that we are going
to discuss today.
CENTRAL TENDENCY: MEAN AND MEDIAN

 Mean – the numerical average value.

 We represent the mean of a sample by


𝑛
𝑥1 + 𝑥2 + … + 𝑥𝑛 1
𝑥ҧ = = ෍ 𝑥𝑖
𝑛 𝑛
𝑖=1
CENTRAL TENDENCY: MEAN AND MEDIAN

 Median – the middle value when data arranged from smallest to largest.

 If 𝑛 (sample size) is odd, the median is the middle value. Counting in from
𝑛+1
the ends, we find this value in the position.
2

 When If 𝑛 is even, there are two middle values. So, in this case, the median
𝑛 𝑛
is the average of the two values in positions and + 1.
2 2
EXAMPLE: RESTING HEART RATE

Each year, as part of a statistics education program, school children in Australia


participate in the CensusAtSchool program by filling out a questionnaire. On the
questionnaire, one of the questions asks: “What is your resting pulse rate?”. Let’s
take a sample of 8 of the Australian school children. Their reported resting heart
rates are:
71 50 57 84 61 70 80 51

Compute the mean resting hear rate for these children:


EXAMPLE: RESTING HEART RATE

Compute the median resting hear rate.

First, let’s order the numbers from smallest to largest

50 51 57 60 70 71 80 84
EXAMPLE: RESTING HEART RATE

What if the highest resting heart rate was incorrectly entered as 840 beats
per minute instead of 84? Recalculate the mean and median
71 50 57 840 61 70 80 51

Mean =

50 51 57 60 70 71 80 840

Median =
CENTRAL TENDENCY: MEAN AND MEDIAN

KEY IDEA:
 The mean is sensitive to extreme observations.

 The median is robust to extreme observations.

Question: How do we decide whether it is better to report the mean or the median
as a measure of central tendency?
Answer: Generally, you want to choose the one that best represents a “typical” or
“center” value in the data set.
CENTRAL TENDENCY: MEAN AND MEDIAN
Consider the following histograms. For each histogram, determine whether you would
expect the mean and median values to be approximately equal, for the mean > median, or
for the mean < median.

The bottom line: When analyzing histogram remember that the mean follows the skew of the
histogram.
DESCRIBING VARIATION
Midterms are returned and “the average” was reported as 76 points out of 100 points. You
received a score of 88 points? How do you feel about your performance in each scenario?
DESCRIBING VARIATION

 Often what is missing when the central tendency of something is reported is a


corresponding measure of ‘spread’ or variability that describes how tightly or
loosely the observations in the data set are clustered around that measure of
central tendency.

 A measure of variability is perhaps the most important quantity in statistical


analysis.

 Here we discuss several measures of variation, each useful in some situations,


each with some limitations.
RANGE
Let’s return to our sample of heart rates, which are shown visually in the graphics below.

One way to describe the variability in the heart rate measurements would be to compute
the range:

Range = Maximum Value – Minimum Value


RANGE: LIMITATIONS

Consider three different alternative scenarios where the spread of resting heart
rates was quite different. What is the range in each case?

The range only uses 2 observations to describe the variation in an entire data set,
and there are obviously cases where it does not do a particularly good job.
PERCENTILES

 Another measure of variation, called the Interquartile range (IQR),


tries to address this issue. To understand how the IQR works, we
must first introduce the idea of percentiles.

 The p-th percentile is the value such that p% of the observations fall
at or below that value.
PERCENTILES

Some Common percentiles:


 50th percentile – a value such that 50% of the observations are below the value and
50% are above it; also called the median.

 25th percentile – a value such that 25% of the observations are below the value and
75% are above it; also called the first quartile or Q1; it is the median of the lower half
of data.

 75th percentile – a value such that 75% of the observations are below the value and
25% are above it; also called the third quartile or Q3; it is the median of the upper
half of data.
IQR
The inter-quartile range (IQR) is found by taking the difference between the 75th and
25th percentile values:

IQR = Q3 – Q1

We’ve already found 50th percentile (the median) of our heart rate data set. Now find
the 25th and 75th percentiles for the data set. Afterward, compute the corresponding
IQR.
50 51 56 60 70 71 80 84
IQR

Let’s see how the IQR holds up against our fictional data sets.

Fictional set 1: 50, 64, 64, 64, 66, 66, 66, 84

Fictional set 2: 50, 51, 52, 55, 80, 82, 83, 84


VISUALIZING IQR: BOXPLOTS
A boxplot is data visualization that summarizes a data set using five statistics while also
plotting unusual observations.

To construct a boxplot, we use five numbers


calculated from the data:

 The minimum
 First quartile Q1 (25th percentile)
 The median
 Third quartile Q3 (75th percentile)
 The maximum
VISUALIZING IQR: BOXPLOTS
VISUALIZING IQR: BOXPLOTS
DRAWING BOXPLOTS
Let’s practice drawing one boxplot by hand using the AMES Living Area variable.
Min Q1 Median Q3 Max IQR Q1-1.5*IQR Q3+1.5*IQR
672 1162 1505 1746 2495
BOXPLOTS
Boxplots can be drawn horizontally or vertically. They give a quick glance at the data, while
histograms give a more detailed view of the shape of the data distribution..
MATCHING BOXPLOTS AND HISTOGRAMS
VARIANCE & STANDARD DEVIATION

What is the best way to involve every single data point in our data set in
a calculation of the variation of that data set?

71 50 57 84 61 70 80 51
VARIANCE & STANDARD DEVIATION

 A common method is to calculate the mean


value, and then analyze the departures
from the mean.
 We calculate the distance from each
observation and the average of the
observations, 𝑥,ҧ and call this distance the
deviation from the mean.
 The deviations are visualized by the
horizontal line segments in the plot and are
calculated in the table on the next slide.
VARIANCE & STANDARD DEVIATION
 The larger the deviations, the more variable the data!
Resting Heart Rate Deviation from 𝒙

 Problem: The sum of the deviations is always zero!
50 50 – 65.5 = -15.5
 Solution: Square the deviations before adding them all
51 51 – 65.5 = -14.5 up. It’s called the sum of squares and will always be
57 57 – 65.5 = -8.5 positive.

61 61 – 65.5 = -4.5  Another Problem: The sum of squares will always


increase with every additional observation.
70 70 – 65.5 = 4.5
 Solution: Take the sum of squares and divide it by the
71 71 – 65.5 = 5.5
number of observations, n, to find the mean squared
80 80 – 65.5 = 14.5
deviation.
84 84 – 65.5 = 18.5
 This calculation gives us a number we call the
population variance.
VARIANCE & STANDARD DEVIATION

σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 − 𝑥)
 Population variance formula: 𝜎 = 2
𝑛

 Dividing by n works well if we are dealing with population data, but it consistently
underestimates the population variance when we use sample data.

 When we have sample data, we correct for this underestimation problem by dividing the
sum of squares by n – 1.

σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 − 𝑥)
 Sample variance formula: 𝑠2 =
𝑛 −1
VARIANCE & STANDARD DEVIATION

 In this course, we will not often use the variance because it describes the variability of the
data in squared units.
 For this reason, we will instead use the squared root of the variance, called the standard
deviation, since it is in the original units of the data.

σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 − 𝑥)
 Population standard deviation formula: 𝛔 =
𝑛

σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 − 𝑥)
 Sample standard deviation formula: 𝒔 =
𝑛−1
STANDARD DEVIATION
Let’s calculate the sample standard deviation of our heart rate data:

Resting Heart Deviation from 𝒙


ഥ Squared Standard deviation, s:
Rate Deviations
50 50 – 65.5 = -15.5
51 51 – 65.5 = -14.5 Interpretation of s: The resting heart
57 57 – 65.5 = -8.5 rate of the
61 61 – 65.5 = -4.5
70 70 – 65.5 = 4.5 students in our sample are roughly
71 71 – 65.5 = 5.5 ________ away
80 80 – 65.5 = 14.5
84 84 – 65.5 = 18.5
from the mean resting rate of ______,
Sum of squares:
on average.
STANDARD DEVIATION

Notes about the standard deviation:

 The standard deviation is the square root of the variance and describes how
close the data are to the mean using the units in which the data are recorded

 s = 0 means every data point is the same value – there are no deviations from the
mean!

 Like the mean, s is sensitive to outliers.


STANDARD DEVIATION & THE EMPIRICAL RULE
Many data sets we encounter in the natural world generally follow a symmetric, bell-
shaped pattern. Shown bellow is a histogram of the heights of adult women, as an
example.
STANDARD DEVIATION & THE EMPIRICAL RULE

When our data is symmetric and bell-shaped,

 Approximately 68% of the data will be within


one standard deviation of the mean.

 Approximately 95% of the data will be within


two standard deviation of the mean.

 Approximately 99.7% of the data will be


within three standard deviation of the mean.
NOTATIONS FOR PARAMETERS VS STATISTICS

In order to indicate whether our summary values are calculated from population
data and therefore are parameters or if they are from sample data and are
statistics, we use special notation.

Measure Parameter Notation Statistic Notation


Mean 𝜇 𝑥ҧ
Variance 𝜎2 𝑠2
Standard Deviation 𝜎 𝑠
Proportion 𝑝 𝑝Ƹ
QUICK PRACTICE WITH THE STANDARD DEVIATION
At the end of each semester, responses from the Student-Instructional Rating System are
provided to professors across the university. The dot plots below show student ratings (on
a scale of 1-5) of four hypothetical professors (professors A – D). Arrange these professors
in order from smallest variability in rankings to highest variability in rankings.
QUICK PRACTICE WITH THE STANDARD DEVIATION
The table below asks you to compare the standard deviations of two data sets. Without doing any
calculations. Choose one of the four statements below to describe the relationship between the data sets
compared

Statement Column A Column B

The standard deviation of The standard deviation of


I. The quantity in column A is
{0.2, 0.4, 0.6, 0.8} {2, 4, 6, 8}
greater
The standard deviation of The standard deviation of II. The quantity in column B is
{1, 3, 5, 7, 9} {3, 5, 7, 9, 11} greater

III. The two quantities are equal


The standard deviation of The standard deviation of
{1, 3, 5, 7, 9} {1, 3, 5, 7, 9, 9} IV. The relationship cannot be
determined from the given
The standard deviation of The standard deviation of information
{1, 3, 5, 7, 9} {1, 3, 5, 5, 7, 9}
ROBUST STATISTICS
Now that we’ve discussed how to interpret histograms, let’s see how they measure up against our
statistical summaries. Below is a histogram of the sale price of 42 low-quality diamonds that are less
than 1 carat in size. Their summary statistics are provided in the table next to it.

Statistic Value
n 42
Min $416
Q1 $1416
Median $1882
Mean $1924
Q3 $2497
Max $3154
s $703
Compute the IQR and range of this data set
ROBUST STATISTICS
Suppose two new high-quality diamonds were accidentally mixed in with the original 42, one with sale price of
$7,393 and the other with a sale price of $8,979. Compared to the low-quality diamonds, their sale prices are
outliers. The summaries below show the new sale price distribution and its summary statistics.

Statistic Value
n 42
Min $416
Q1 $1429
Median $1899
Mean $2209
Q3 $2537
Max $8979
s $1497

Compute the IQR and range of this data set. How do these summaries change after the
high-quality diamonds are included?
ROBUST STATISTICS

 Compare the mean and median values of the two histograms. How do they
compare?
Consider the summary statistics we’ve explored so far. We can categorize each of
them with respect to whether the statistic appears to be robust to outlying values.

Statistics that are robust to Statistics that are sensitive to


outliers outliers
Median Mean
IQR Standard deviation
Range

You might also like