0% found this document useful (0 votes)

8 views38 pages

Lecture 04 (09.16)

This document provides an introduction to statistical concepts focusing on numerical summaries of data, specifically measures of central tendency such as mean and median. It discusses the importance of these measures in representing typical values in datasets and introduces variability measures like range, interquartile range (IQR), variance, and standard deviation. Additionally, it emphasizes the significance of visualizations like boxplots and histograms in analyzing data distributions.

Uploaded by

sabrinawang830

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views38 pages

Lecture 04 (09.16)

Uploaded by

sabrinawang830

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

INTRODUCTION TO

STATISTICS

LECTURE 04: VISUAL

AND NUMERICAL
SUMMARIES OF DATA
(PART II)
NUMERICAL SUMMARIES OF DATA: CENTRAL TENDENCY

 The first step in analyzing a set of numerical data is to describe the

“typical” values of data.

 Measures of central tendency acknowledge that each observation of

a numerical variable might be different, but that they also have some
“center” that represents them all together.

 There are two basic measures of central tendency that we are going
to discuss today.
CENTRAL TENDENCY: MEAN AND MEDIAN

 Mean – the numerical average value.

 We represent the mean of a sample by

𝑛
𝑥1 + 𝑥2 + … + 𝑥𝑛 1
𝑥ҧ = = ෍ 𝑥𝑖
𝑛 𝑛
𝑖=1
CENTRAL TENDENCY: MEAN AND MEDIAN

 Median – the middle value when data arranged from smallest to largest.

 If 𝑛 (sample size) is odd, the median is the middle value. Counting in from
𝑛+1
the ends, we find this value in the position.
2

 When If 𝑛 is even, there are two middle values. So, in this case, the median
𝑛 𝑛
is the average of the two values in positions and + 1.
2 2
EXAMPLE: RESTING HEART RATE

Each year, as part of a statistics education program, school children in Australia

participate in the CensusAtSchool program by filling out a questionnaire. On the
questionnaire, one of the questions asks: “What is your resting pulse rate?”. Let’s
take a sample of 8 of the Australian school children. Their reported resting heart
rates are:
71 50 57 84 61 70 80 51

Compute the mean resting hear rate for these children:

EXAMPLE: RESTING HEART RATE

Compute the median resting hear rate.

First, let’s order the numbers from smallest to largest

50 51 57 60 70 71 80 84
EXAMPLE: RESTING HEART RATE

What if the highest resting heart rate was incorrectly entered as 840 beats
per minute instead of 84? Recalculate the mean and median
71 50 57 840 61 70 80 51

Mean =

50 51 57 60 70 71 80 840

Median =
CENTRAL TENDENCY: MEAN AND MEDIAN

KEY IDEA:
 The mean is sensitive to extreme observations.

 The median is robust to extreme observations.

Question: How do we decide whether it is better to report the mean or the median
as a measure of central tendency?
Answer: Generally, you want to choose the one that best represents a “typical” or
“center” value in the data set.
CENTRAL TENDENCY: MEAN AND MEDIAN
Consider the following histograms. For each histogram, determine whether you would
expect the mean and median values to be approximately equal, for the mean > median, or
for the mean < median.

The bottom line: When analyzing histogram remember that the mean follows the skew of the
histogram.
DESCRIBING VARIATION
Midterms are returned and “the average” was reported as 76 points out of 100 points. You
received a score of 88 points? How do you feel about your performance in each scenario?
DESCRIBING VARIATION

 Often what is missing when the central tendency of something is reported is a

corresponding measure of ‘spread’ or variability that describes how tightly or
loosely the observations in the data set are clustered around that measure of
central tendency.

 A measure of variability is perhaps the most important quantity in statistical

analysis.

 Here we discuss several measures of variation, each useful in some situations,

each with some limitations.
RANGE
Let’s return to our sample of heart rates, which are shown visually in the graphics below.

One way to describe the variability in the heart rate measurements would be to compute
the range:

Range = Maximum Value – Minimum Value

RANGE: LIMITATIONS

Consider three different alternative scenarios where the spread of resting heart
rates was quite different. What is the range in each case?

The range only uses 2 observations to describe the variation in an entire data set,
and there are obviously cases where it does not do a particularly good job.
PERCENTILES

 Another measure of variation, called the Interquartile range (IQR),

tries to address this issue. To understand how the IQR works, we
must first introduce the idea of percentiles.

 The p-th percentile is the value such that p% of the observations fall
at or below that value.
PERCENTILES

Some Common percentiles:

 50th percentile – a value such that 50% of the observations are below the value and
50% are above it; also called the median.

 25th percentile – a value such that 25% of the observations are below the value and
75% are above it; also called the first quartile or Q1; it is the median of the lower half
of data.

 75th percentile – a value such that 75% of the observations are below the value and
25% are above it; also called the third quartile or Q3; it is the median of the upper
half of data.
IQR
The inter-quartile range (IQR) is found by taking the difference between the 75th and
25th percentile values:

IQR = Q3 – Q1

We’ve already found 50th percentile (the median) of our heart rate data set. Now find
the 25th and 75th percentiles for the data set. Afterward, compute the corresponding
IQR.
50 51 56 60 70 71 80 84
IQR

Let’s see how the IQR holds up against our fictional data sets.

Fictional set 1: 50, 64, 64, 64, 66, 66, 66, 84

Fictional set 2: 50, 51, 52, 55, 80, 82, 83, 84

VISUALIZING IQR: BOXPLOTS
A boxplot is data visualization that summarizes a data set using five statistics while also
plotting unusual observations.

To construct a boxplot, we use five numbers

calculated from the data:

 The minimum
 First quartile Q1 (25th percentile)
 The median
 Third quartile Q3 (75th percentile)
 The maximum
VISUALIZING IQR: BOXPLOTS
VISUALIZING IQR: BOXPLOTS
DRAWING BOXPLOTS
Let’s practice drawing one boxplot by hand using the AMES Living Area variable.
Min Q1 Median Q3 Max IQR Q1-1.5*IQR Q3+1.5*IQR
672 1162 1505 1746 2495
BOXPLOTS
Boxplots can be drawn horizontally or vertically. They give a quick glance at the data, while
histograms give a more detailed view of the shape of the data distribution..
MATCHING BOXPLOTS AND HISTOGRAMS
VARIANCE & STANDARD DEVIATION

What is the best way to involve every single data point in our data set in
a calculation of the variation of that data set?

71 50 57 84 61 70 80 51
VARIANCE & STANDARD DEVIATION

 A common method is to calculate the mean

value, and then analyze the departures
from the mean.
 We calculate the distance from each
observation and the average of the
observations, 𝑥,ҧ and call this distance the
deviation from the mean.
 The deviations are visualized by the
horizontal line segments in the plot and are
calculated in the table on the next slide.
VARIANCE & STANDARD DEVIATION
 The larger the deviations, the more variable the data!
Resting Heart Rate Deviation from 𝒙
ഥ
 Problem: The sum of the deviations is always zero!
50 50 – 65.5 = -15.5
 Solution: Square the deviations before adding them all
51 51 – 65.5 = -14.5 up. It’s called the sum of squares and will always be
57 57 – 65.5 = -8.5 positive.

61 61 – 65.5 = -4.5  Another Problem: The sum of squares will always

increase with every additional observation.
70 70 – 65.5 = 4.5
 Solution: Take the sum of squares and divide it by the
71 71 – 65.5 = 5.5
number of observations, n, to find the mean squared
80 80 – 65.5 = 14.5
deviation.
84 84 – 65.5 = 18.5
 This calculation gives us a number we call the
population variance.
VARIANCE & STANDARD DEVIATION

σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 − 𝑥)
 Population variance formula: 𝜎 = 2
𝑛

 Dividing by n works well if we are dealing with population data, but it consistently
underestimates the population variance when we use sample data.

 When we have sample data, we correct for this underestimation problem by dividing the
sum of squares by n – 1.

σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 − 𝑥)
 Sample variance formula: 𝑠2 =
𝑛 −1
VARIANCE & STANDARD DEVIATION

 In this course, we will not often use the variance because it describes the variability of the
data in squared units.
 For this reason, we will instead use the squared root of the variance, called the standard
deviation, since it is in the original units of the data.

σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 − 𝑥)
 Population standard deviation formula: 𝛔 =
𝑛

σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 − 𝑥)
 Sample standard deviation formula: 𝒔 =
𝑛−1
STANDARD DEVIATION
Let’s calculate the sample standard deviation of our heart rate data:

Resting Heart Deviation from 𝒙

ഥ Squared Standard deviation, s:
Rate Deviations
50 50 – 65.5 = -15.5
51 51 – 65.5 = -14.5 Interpretation of s: The resting heart
57 57 – 65.5 = -8.5 rate of the
61 61 – 65.5 = -4.5
70 70 – 65.5 = 4.5 students in our sample are roughly
71 71 – 65.5 = 5.5 ________ away
80 80 – 65.5 = 14.5
84 84 – 65.5 = 18.5
from the mean resting rate of ______,
Sum of squares:
on average.
STANDARD DEVIATION

Notes about the standard deviation:

 The standard deviation is the square root of the variance and describes how
close the data are to the mean using the units in which the data are recorded

 s = 0 means every data point is the same value – there are no deviations from the
mean!

 Like the mean, s is sensitive to outliers.

STANDARD DEVIATION & THE EMPIRICAL RULE
Many data sets we encounter in the natural world generally follow a symmetric, bell-
shaped pattern. Shown bellow is a histogram of the heights of adult women, as an
example.
STANDARD DEVIATION & THE EMPIRICAL RULE

When our data is symmetric and bell-shaped,

 Approximately 68% of the data will be within

one standard deviation of the mean.

 Approximately 95% of the data will be within

two standard deviation of the mean.

 Approximately 99.7% of the data will be

within three standard deviation of the mean.
NOTATIONS FOR PARAMETERS VS STATISTICS

In order to indicate whether our summary values are calculated from population
data and therefore are parameters or if they are from sample data and are
statistics, we use special notation.

Measure Parameter Notation Statistic Notation

Mean 𝜇 𝑥ҧ
Variance 𝜎2 𝑠2
Standard Deviation 𝜎 𝑠
Proportion 𝑝 𝑝Ƹ
QUICK PRACTICE WITH THE STANDARD DEVIATION
At the end of each semester, responses from the Student-Instructional Rating System are
provided to professors across the university. The dot plots below show student ratings (on
a scale of 1-5) of four hypothetical professors (professors A – D). Arrange these professors
in order from smallest variability in rankings to highest variability in rankings.
QUICK PRACTICE WITH THE STANDARD DEVIATION
The table below asks you to compare the standard deviations of two data sets. Without doing any
calculations. Choose one of the four statements below to describe the relationship between the data sets
compared

Statement Column A Column B

The standard deviation of The standard deviation of

I. The quantity in column A is
{0.2, 0.4, 0.6, 0.8} {2, 4, 6, 8}
greater
The standard deviation of The standard deviation of II. The quantity in column B is
{1, 3, 5, 7, 9} {3, 5, 7, 9, 11} greater

III. The two quantities are equal

The standard deviation of The standard deviation of
{1, 3, 5, 7, 9} {1, 3, 5, 7, 9, 9} IV. The relationship cannot be
determined from the given
The standard deviation of The standard deviation of information
{1, 3, 5, 7, 9} {1, 3, 5, 5, 7, 9}
ROBUST STATISTICS
Now that we’ve discussed how to interpret histograms, let’s see how they measure up against our
statistical summaries. Below is a histogram of the sale price of 42 low-quality diamonds that are less
than 1 carat in size. Their summary statistics are provided in the table next to it.

Statistic Value
n 42
Min $416
Q1 $1416
Median $1882
Mean $1924
Q3 $2497
Max $3154
s $703
Compute the IQR and range of this data set
ROBUST STATISTICS
Suppose two new high-quality diamonds were accidentally mixed in with the original 42, one with sale price of
$7,393 and the other with a sale price of $8,979. Compared to the low-quality diamonds, their sale prices are
outliers. The summaries below show the new sale price distribution and its summary statistics.

Statistic Value
n 42
Min $416
Q1 $1429
Median $1899
Mean $2209
Q3 $2537
Max $8979
s $1497

Compute the IQR and range of this data set. How do these summaries change after the
high-quality diamonds are included?
ROBUST STATISTICS

 Compare the mean and median values of the two histograms. How do they
compare?
Consider the summary statistics we’ve explored so far. We can categorize each of
them with respect to whether the statistic appears to be robust to outlying values.

Statistics that are robust to Statistics that are sensitive to

outliers outliers
Median Mean
IQR Standard deviation
Range

Marine Microbiology Ecology and Applications by Colin Munn
100% (1)
Marine Microbiology Ecology and Applications by Colin Munn
394 pages
Chapter 3 - Data Presentation
100% (1)
Chapter 3 - Data Presentation
40 pages
Physical Education Class-12th Notes
No ratings yet
Physical Education Class-12th Notes
276 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
Descriptive Statistics W25
No ratings yet
Descriptive Statistics W25
41 pages
CH 2
No ratings yet
CH 2
49 pages
Lecture 1
No ratings yet
Lecture 1
89 pages
Central Tendency - HU 2023
No ratings yet
Central Tendency - HU 2023
48 pages
Full Slides Beginselen2019
No ratings yet
Full Slides Beginselen2019
364 pages
Data Summarization
No ratings yet
Data Summarization
37 pages
2a. Describing Variables With Numbers
No ratings yet
2a. Describing Variables With Numbers
30 pages
Week 2b - Descriptive Statistics-Measures of Dispersion-1 Feb2019
No ratings yet
Week 2b - Descriptive Statistics-Measures of Dispersion-1 Feb2019
26 pages
Measures of Central Tendency and Disperssion
No ratings yet
Measures of Central Tendency and Disperssion
33 pages
Class 2 SP
No ratings yet
Class 2 SP
30 pages
Lecture of BIOSTATISTICS 12.2022 RMDC
No ratings yet
Lecture of BIOSTATISTICS 12.2022 RMDC
85 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
BioStat Module 3
No ratings yet
BioStat Module 3
41 pages
Brick Exchange - Descriptive Statistics and Data Representation
No ratings yet
Brick Exchange - Descriptive Statistics and Data Representation
24 pages
Biostat Ch-5
No ratings yet
Biostat Ch-5
58 pages
Lecture 2-Summarizing Data - HSciences Biostats - 010232en
No ratings yet
Lecture 2-Summarizing Data - HSciences Biostats - 010232en
37 pages
Lesson 1
No ratings yet
Lesson 1
37 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
DESCRIBING VARIABILITY - Lecture 2017
No ratings yet
DESCRIBING VARIABILITY - Lecture 2017
34 pages
Screenshot 2024-07-22 at 10.26.36 AM
No ratings yet
Screenshot 2024-07-22 at 10.26.36 AM
35 pages
Chapter 4
No ratings yet
Chapter 4
11 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
41 pages
Day 05
No ratings yet
Day 05
22 pages
3-Measures of Central Tendency
No ratings yet
3-Measures of Central Tendency
59 pages
Statistical Data
No ratings yet
Statistical Data
41 pages
Standard Deviation
No ratings yet
Standard Deviation
37 pages
Lecture III-Measures of Dispersion
No ratings yet
Lecture III-Measures of Dispersion
33 pages
Chapter 5
No ratings yet
Chapter 5
6 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
34 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
11 pages
Lectures On Divergent Series (Emile Borel)
No ratings yet
Lectures On Divergent Series (Emile Borel)
129 pages
Topic: Measures of Central Tendency and Measures of Dispersion
No ratings yet
Topic: Measures of Central Tendency and Measures of Dispersion
45 pages
BVT Bed Re Ets: Vie I
No ratings yet
BVT Bed Re Ets: Vie I
228 pages
Characterization and Its Theories
No ratings yet
Characterization and Its Theories
30 pages
HNS 2321 Biostatistics Lecture 3 and 4 Descritive Statistics
No ratings yet
HNS 2321 Biostatistics Lecture 3 and 4 Descritive Statistics
36 pages
Hns 2321 Biostatistics Descritive Statistics
No ratings yet
Hns 2321 Biostatistics Descritive Statistics
35 pages
RMBS BPT402
No ratings yet
RMBS BPT402
103 pages
6.descriptve PPHD
No ratings yet
6.descriptve PPHD
70 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
53 pages
2) SummarizationOfData Mean Median Mod SD CV
No ratings yet
2) SummarizationOfData Mean Median Mod SD CV
24 pages
Class 1 - 20th August 2024 - Descriptive Statistic
No ratings yet
Class 1 - 20th August 2024 - Descriptive Statistic
6 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Topic 2 Practice
No ratings yet
Topic 2 Practice
5 pages
1.2 Mathematical Presentation of Data
No ratings yet
1.2 Mathematical Presentation of Data
28 pages
Topic02. Descriptive Stats
No ratings yet
Topic02. Descriptive Stats
16 pages
Unit - 2 Biostatistics
No ratings yet
Unit - 2 Biostatistics
9 pages
Mathematics-10 Q1 Module-1.ppsm
No ratings yet
Mathematics-10 Q1 Module-1.ppsm
53 pages
Introduction To Biostatistics: Data Collection Descriptive Statistics
No ratings yet
Introduction To Biostatistics: Data Collection Descriptive Statistics
33 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
Stat 1101 4 7
No ratings yet
Stat 1101 4 7
18 pages
Lecture 3
No ratings yet
Lecture 3
14 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
79 pages
QBM 101 Business Statistics: Department of Business Studies Faculty of Business, Economics & Accounting HE LP University
No ratings yet
QBM 101 Business Statistics: Department of Business Studies Faculty of Business, Economics & Accounting HE LP University
62 pages
Measures of Location and VARIATION For 1 Variable
No ratings yet
Measures of Location and VARIATION For 1 Variable
44 pages
CoS Undergraduate Brochure
No ratings yet
CoS Undergraduate Brochure
36 pages
Statistics I Chapter 2: Univariate Data Analysis
No ratings yet
Statistics I Chapter 2: Univariate Data Analysis
27 pages
AI Book 10 - Worksheets - Unit 1 - Answer Key
No ratings yet
AI Book 10 - Worksheets - Unit 1 - Answer Key
8 pages
直播一课前资料
No ratings yet
直播一课前资料
7 pages
Measures of Variability Lec 7: DR - Nesrin H. Darwesh University of Duhok-College of Dentistry
No ratings yet
Measures of Variability Lec 7: DR - Nesrin H. Darwesh University of Duhok-College of Dentistry
48 pages
Power Series Solutions of Linear Differential Equations
No ratings yet
Power Series Solutions of Linear Differential Equations
34 pages
02 - Descriptive Statistics
No ratings yet
02 - Descriptive Statistics
45 pages
Introduction To Statistics 1 COD
No ratings yet
Introduction To Statistics 1 COD
58 pages
An Overview and Comparative Analysis of Recurrent Neural Networks For Short Term Load Forecasting
No ratings yet
An Overview and Comparative Analysis of Recurrent Neural Networks For Short Term Load Forecasting
41 pages
Resources and Development Practise Sheet 1
100% (1)
Resources and Development Practise Sheet 1
3 pages
Geo3701 Unit 2
No ratings yet
Geo3701 Unit 2
59 pages
15-Nguyen Van Thin-Bai Bao28!3!2007
No ratings yet
15-Nguyen Van Thin-Bai Bao28!3!2007
8 pages
HSC English 2nd Paper 2024 (All Board)
No ratings yet
HSC English 2nd Paper 2024 (All Board)
1 page
310-A STO FY 2024 TIER 1
No ratings yet
310-A STO FY 2024 TIER 1
12 pages
2 Mean Median Mode Variance
No ratings yet
2 Mean Median Mode Variance
29 pages
Chapter 6-Leading
No ratings yet
Chapter 6-Leading
27 pages
Statistics II Essentials
From Everand
Statistics II Essentials
Emil Milewski
2.5/5 (1)
Best Ferrocement Structure 2016
No ratings yet
Best Ferrocement Structure 2016
7 pages
Dissertation Zusammenfassung Schreiben
100% (2)
Dissertation Zusammenfassung Schreiben
6 pages
Interpreting Studies L On Fidelity in Interpretation
No ratings yet
Interpreting Studies L On Fidelity in Interpretation
11 pages
Quarter 1 Least Learned Competencies in Science
No ratings yet
Quarter 1 Least Learned Competencies in Science
3 pages
Thesis Definition of Terms Format
100% (3)
Thesis Definition of Terms Format
4 pages
Chapt3 Overheads
No ratings yet
Chapt3 Overheads
8 pages
Types - Elstomeric Bearings
No ratings yet
Types - Elstomeric Bearings
4 pages
Stal S700 - Porownanie - 10-Hillong-Milan-Veljkovic
No ratings yet
Stal S700 - Porownanie - 10-Hillong-Milan-Veljkovic
30 pages
1 s2.0 S0924013620301187 Main
No ratings yet
1 s2.0 S0924013620301187 Main
13 pages
bml-205 KK en
No ratings yet
bml-205 KK en
1 page
INAC 2011 Phnatom Alderson RANDO - Boia Et Al
No ratings yet
INAC 2011 Phnatom Alderson RANDO - Boia Et Al
10 pages
Angela Ales Bello The Divine in Husserl and Other Explorations 1st Edition Angela Ales Bello Auth Instant Download
No ratings yet
Angela Ales Bello The Divine in Husserl and Other Explorations 1st Edition Angela Ales Bello Auth Instant Download
29 pages
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
BachHoang FritoLay Memo
No ratings yet
BachHoang FritoLay Memo
4 pages
San Chit
No ratings yet
San Chit
2 pages

Lecture 04 (09.16)

Uploaded by

Lecture 04 (09.16)

Uploaded by

INTRODUCTION TO

LECTURE 04: VISUAL

 The first step in analyzing a set of numerical data is to describe the

 Measures of central tendency acknowledge that each observation of

 Mean – the numerical average value.

 We represent the mean of a sample by

Each year, as part of a statistics education program, school children in Australia

Compute the mean resting hear rate for these children:

Compute the median resting hear rate.

First, let’s order the numbers from smallest to largest

 The median is robust to extreme observations.

 Often what is missing when the central tendency of something is reported is a

 A measure of variability is perhaps the most important quantity in statistical

 Here we discuss several measures of variation, each useful in some situations,

Range = Maximum Value – Minimum Value

 Another measure of variation, called the Interquartile range (IQR),

Some Common percentiles:

Fictional set 1: 50, 64, 64, 64, 66, 66, 66, 84

Fictional set 2: 50, 51, 52, 55, 80, 82, 83, 84

To construct a boxplot, we use five numbers

 A common method is to calculate the mean

61 61 – 65.5 = -4.5  Another Problem: The sum of squares will always

Resting Heart Deviation from 𝒙

Notes about the standard deviation:

 Like the mean, s is sensitive to outliers.

When our data is symmetric and bell-shaped,

 Approximately 68% of the data will be within

 Approximately 95% of the data will be within

 Approximately 99.7% of the data will be

Measure Parameter Notation Statistic Notation

Statement Column A Column B

The standard deviation of The standard deviation of

III. The two quantities are equal

Statistics that are robust to Statistics that are sensitive to

You might also like