0% found this document useful (0 votes)

47 views28 pages

BA 216 Lecture 4 Notes

This document discusses measures of central tendency (mean, median, mode) and dispersion (standard deviation, interquartile range, range) to describe datasets. It provides definitions and formulas for these statistical concepts. Key points covered include: the mean provides a point estimate of the population mean; the standard deviation represents the typical deviation from the mean; and for skewed data, the median may better indicate the center than the mean. Box and whisker plots and histograms are presented as visualizations to identify outliers and skewed distributions.

Uploaded by

Harrison Lim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views28 pages

BA 216 Lecture 4 Notes

Uploaded by

Harrison Lim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

BA 216

Central Tendency (mean, median, mode)

Two key ways to describe a dataset are Central Tendency & Dispersion (we’ll be
adding shape/modality, skewness, and outliers over the next two lectures)

● There are two ultra-important questions to answer when describing a dataset.

○ Central Tendency: where is the ‘middle’ of the dataset?
○ Dispersion: how spread out, how ‘wide’ is the dataset?

● Measures of Central Tendency

○ Categorical data: Mode
○ Numerical data: Mean & median, rarely mode

● Measures of Dispersion
○ Categorical data: Range
○ Numerical data: Standard deviation, Interquartile Range (IQR), rarely
range
Central Tendency for Categorical Data -- Mode
● A MODE is represented by a prominent peak in the distribution.
● A definition of mode sometimes taught in math classes is the value with the most
occurrences in the data set.
○ However, for many real-world numerical data sets, it is common to
have no observations with the same value, making this definition
impractical in data analysis.
● Mode is primarily useful for categorical data, and most commonly use for
categorical ordinal.

Central Tendency for Numerical Data -- Median

Central Tendency for Numerical Data -- Mean (𝑥and µ)

A key (if somewhat obvious) concept: the sample mean (𝑥) provides a window to the
true, hidden population mean (µ)

Shorthand statistical notation for population vs. sample mean

Dispersion (range, standard deviation, IQR)

Overview of dispersion – 3 options

● Range - the highest number minus the lowest number
○ Simpler than measures of variance/dispersion
○ This is the only option for categorical data, but also can be used for
numerical data if a super simple statistic is needed.

● Variance & Standard deviation

○ Mathematically and analytically paired with mean
○ Can be used with numerical data

● Quartiles & Interquartile range (IQR)

○ Mathematically and analytically paired with median
○ Can be used with numerical data

DEVIATION is just another way to say “distance from mean”, which we use to calculate
variance (and standard deviation)
2
The Variance (𝑠 ) of a sample is roughly the average deviation from the mean, across
all the observations in the dataset

2
Why is squared deviation used in the numerator when calculating variance (𝑠 ) ?
2
Now, we use variance (𝑠 ) to find standard deviation (s)

2
Summary: variance (𝑠 ) vs. standard deviation (s)

Summary of process for finding standard deviation (s):

Deviation → Variance → Standard Deviation

● The variance is the average squared distance from the mean.

● The standard deviation is the square root of the variance.
○ The standard deviation is useful when considering how far the data are
distributed from the mean.
○ We nearly always use standard deviation, because...

● The standard deviation represents the typical deviation of observations from the
mean.
○ Usually about 70% of the data will be within one standard deviation of the
mean and about 95% will be within two standard deviations.
○ However, these percentages are not strict rules.

Summary: sample statistics (𝑥) vs. population parameters (µ)

● Numerical Variables:
○ When working with averages/means, we calculate a sample statistic 𝑥
from our sample, and (if certain conditions are met) use that as the point
estimate for the (“real”, but unknown) population parameter µ.

● Population parameter: The “true population mean” is denoted with the Greek
symbol µ, pronounced “mew”
● Sample Statistic/point estimate: The “sample mean” is denoted with 𝑥,
pronounced “x-bar”.
● Sample standard deviation – s
● Population standard deviation - σ
Interquartile range (IQR) is another way to measure dispersion, and is conceptually
related to median

Interquartile range (IQR) is another way to measure dispersion, and is related to

median.
Interquartile range (IQR) is another way to measure dispersion, and is related to
median.

Interquartile range (IQR) is another way to measure dispersion, and is related to

median.

Visualizing median & IQR in Box-and-Whisker Plots Data outliers

Box and whisker plots – step 1 – median

Box and whisker plots – step 2 – IQR

Box and whisker plots - questions

Visual example: interpreting quartiles

Box and whisker plots – step 3 – Whiskers

Box & whiskers – step 4 – suspected outliers

Outliers are useful for many reasons in statistics

Two histogram examples for identifying suspected outliers
● Below are two examples of data distributions from a numerical variable.
● In which option would you be more suspicious of outlier(s)

You can use box-and-whisker plots and a histograms to spot suspected outliers
Example - Box and whiskers can be used with 2+ categories
Box & Whiskers Plots vs. Histograms (2 category example)

Example - Box & Whisker plot, 4 categories

The Shape/Modality of a Distribution

Why do we care about skewness and shape?

Preview of common data shapes – modality/shape and skewness

Hint to determine whether a data shape is left or right skewed

● Pay attention to which direction the tail of the data is heading to
● Ex: If the data has a hill on the left side but a tail on the right, then it is RIGHT
SKEWED
● Same goes with if there is a hill on the right side but a tail on the left, then it is
LEFT SKEWED
More about shape – UNIMODAL, BIMODAL, MULTIMODAL

Note: In order to determine modality, step back and imagine a smooth curve over the
histogram – imagine that the bars are wooden blocks and you drop a limp spaghetti
over them, the shape the spaghetti would take could be viewed as a smooth curve.

How to recognize & describe “skewed” data

Shape of a distribution – symmetry vs asymmetry

● The symmetrical, bell-shaped curve (from the Normal Distribution) pops up
everywhere in nature, but it does not represent every kind of distribution.
We’ve already seen some skewed data examples

RIGHT (POSITIVE) SKEWED vs. LEFT (NEGATIVE) SKEWED

Skewness, the mean, and the median

Preview – skewness, the mean, and the median

Examples of data that is likely to be skewed
● Common examples of right/positively skewed data include -- people's
incomes; mileage on used cars for sale; reaction times in a psychology
experiment; house prices; number of accident claims by an insurance customer;
number of children in a family.
● Examples of left/negative skewed data are more rare and harder to grasp
intuitively.
○ An example is the number of fingers; most people have ten, but some lose
one or more in accidents.
○ Also, the age of someone when they die (in wealthy countries) is
negatively skewed.

Example of real-world left/negatively skewed data

Example of real-world right/positively skewed data
Data visualization & skewed data – example 1
● In addition to summary statistics, we’ve learned about data visualizations as a
way to describe data. Two of your top choices when it comes to visualizing
skewed data are:
○ Histograms (covered extensively in last class’s lecture)
○ Box-and-whisker plots

● Homework 2 will have more practice with box and-whisker plots

Data visualization & skewed data – example 2

Data visualization & skewed data – example 3

Bringing it all together….describing a data distribution

Robust statistics:
How to choose summary statistics for central tendency and dispersion, when dealing
with skewed (“funky shaped”) data
Skewness, the mean, and the median
Review, notation:

Knowledge check questions:

● Which is mean, and which is variance?
● How are variance and standard deviation related?
● What is the difference between population mean and sample mean?
● What is the difference between population variance and sample variance?

Skewed data and your statistical toolbox

● If the data are skewed, the mean gets pulled further in the direction of the skew
than the median.
● In the case of very skewed data, the mean may not provide a good estimate
for the center of the data and represent where most of the data fall.
● In this case, you should consider using the median to evaluate the center of the
data, rather than the mean.

● Note: If the data are skewed, this also may mean we can’t use certain statistical
tools like the t-test or an ANOVA (which we’ll learn about later).

Robust statistics – skewness, with median & IQR

Hint: Pay attention to the locations of the original outlier in each row of dot plots
● The median and IQR are only sensitive to numbers near Q1, the median, and
Q3.
● Since values in these regions are stable in the three data sets, the median and
IQR estimates are also stable.

● If we're looking to simply understand what a typical individual loan looks like, the
median is probably more useful.
● However, if the goal is to understand something that scales well, such as the total
amount of money we might need to have on hand if we were to over 1,000 loans,
then the mean would be more useful.
Skewness, the mean, and the median
Summary: statistics and skewed data distributions
● If the data are skewed, the mean gets pulled further in the direction of the skew
than the median. In the case of very skewed data, the mean may not provide
a good estimate for the center of the data and represent where most of the data
fall.
● Note: If the data are skewed, this also may mean we can’t use certain statistical
tools like the t-test or an ANOVA (which we’ll learn about later).

Summary statistic Summary statistic Mean compared to

for centrality of data for data spread median

Symmetrical data Mean Standard deviation Mean approx. =

Median

Right (positively) Median Interquartile range Median < Mean

skewed data (IQR)

Left (negatively) Median Interquartile range Mean < Median

skewed data (IQR)

Quantitative Methods
No ratings yet
Quantitative Methods
4 pages
2.1 - Examining Numerical Data
No ratings yet
2.1 - Examining Numerical Data
60 pages
Chapter 03 PowerPoint
No ratings yet
Chapter 03 PowerPoint
45 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Describing Data - Numerical Measure
No ratings yet
Describing Data - Numerical Measure
33 pages
Data Science & Machine Learning Algorithms - A CONCISEtasets, and Free Text Books) - Ananthu S Chakravarthi
100% (3)
Data Science & Machine Learning Algorithms - A CONCISEtasets, and Free Text Books) - Ananthu S Chakravarthi
90 pages
Why Study Dispersion?: Spread of The Data
No ratings yet
Why Study Dispersion?: Spread of The Data
31 pages
Chapter 4 Numerical Descriptive Measures of Data
No ratings yet
Chapter 4 Numerical Descriptive Measures of Data
35 pages
Math2101Stat 2 2
No ratings yet
Math2101Stat 2 2
23 pages
DS Module 2
No ratings yet
DS Module 2
113 pages
Chapter 3 (Technical English For Statistics)
No ratings yet
Chapter 3 (Technical English For Statistics)
8 pages
Numerical Summary Statistics
No ratings yet
Numerical Summary Statistics
19 pages
Aicte L1
No ratings yet
Aicte L1
47 pages
Chapter 3 Slides #1 Shape and Location
No ratings yet
Chapter 3 Slides #1 Shape and Location
30 pages
EECM3724 Unit 1 Ch3 Slides 2022
No ratings yet
EECM3724 Unit 1 Ch3 Slides 2022
48 pages
Freq. Distribution Characteristics
No ratings yet
Freq. Distribution Characteristics
13 pages
02 Measures of Central Tendency
No ratings yet
02 Measures of Central Tendency
41 pages
Descriptive and Inferential Statistics. Confidence Interval
No ratings yet
Descriptive and Inferential Statistics. Confidence Interval
42 pages
CH 3 Describing Data: Numerical Measures
No ratings yet
CH 3 Describing Data: Numerical Measures
45 pages
042 Symmetric and Skewed Distributions and Outliers
No ratings yet
042 Symmetric and Skewed Distributions and Outliers
9 pages
Topic II Part II
No ratings yet
Topic II Part II
22 pages
Stats
No ratings yet
Stats
109 pages
Theories in Nursing Informatics
No ratings yet
Theories in Nursing Informatics
31 pages
Skewness and Kurtosis
No ratings yet
Skewness and Kurtosis
40 pages
Assignment 2
No ratings yet
Assignment 2
19 pages
Creep
0% (1)
Creep
42 pages
Descriptive Statistics - Practical1
No ratings yet
Descriptive Statistics - Practical1
12 pages
Lecture 3
No ratings yet
Lecture 3
51 pages
2 - Central Tendency and Dispersion - SFB
No ratings yet
2 - Central Tendency and Dispersion - SFB
69 pages
Comparing The Mean and The Median
No ratings yet
Comparing The Mean and The Median
48 pages
Ch3 Numerically Summarizing Data
No ratings yet
Ch3 Numerically Summarizing Data
35 pages
Drawing Conclusions From Statistical Data: Measures of Central Tendency
No ratings yet
Drawing Conclusions From Statistical Data: Measures of Central Tendency
22 pages
Chapter 5 - RM
No ratings yet
Chapter 5 - RM
22 pages
Quant Descriptive Statistics
No ratings yet
Quant Descriptive Statistics
37 pages
03 Numerical Description
No ratings yet
03 Numerical Description
52 pages
Lecture 04
No ratings yet
Lecture 04
88 pages
Ch1 Prob&Stat NEW
No ratings yet
Ch1 Prob&Stat NEW
35 pages
Lesson 3.2 Measures of Central Tendency Position and Variation
No ratings yet
Lesson 3.2 Measures of Central Tendency Position and Variation
62 pages
Unit 3 Summarising Data - Averages and Dispersion
No ratings yet
Unit 3 Summarising Data - Averages and Dispersion
22 pages
Letter To Gov. Gavin Newsom Asking For An Inland Warehouse Moratorium
100% (2)
Letter To Gov. Gavin Newsom Asking For An Inland Warehouse Moratorium
22 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
Module 06 Skewness and The Mean, Median and Mode
No ratings yet
Module 06 Skewness and The Mean, Median and Mode
10 pages
ST8114 Module1 PartI UnivariateEDA
No ratings yet
ST8114 Module1 PartI UnivariateEDA
60 pages
02 Data
No ratings yet
02 Data
36 pages
Brand Perception of Honda Products
No ratings yet
Brand Perception of Honda Products
64 pages
Lecture 3 - Numerical Statistics
No ratings yet
Lecture 3 - Numerical Statistics
7 pages
Lecture Notes
No ratings yet
Lecture Notes
37 pages
Descriptive Stat
No ratings yet
Descriptive Stat
13 pages
Share MBBS - Lecture 4 (1) - 1
No ratings yet
Share MBBS - Lecture 4 (1) - 1
68 pages
RPT Bahasa Inggeris Tingkatan 3 2017
100% (1)
RPT Bahasa Inggeris Tingkatan 3 2017
21 pages
Statistics ClassNotes - 2
No ratings yet
Statistics ClassNotes - 2
10 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
Descriptive Statistics 1
No ratings yet
Descriptive Statistics 1
63 pages
Measures
No ratings yet
Measures
8 pages
Unique Aspects of Accounting - Non-Profit and Healthcare Organizations
No ratings yet
Unique Aspects of Accounting - Non-Profit and Healthcare Organizations
28 pages
Dbms Codds Rules
No ratings yet
Dbms Codds Rules
2 pages
2nd Unit - Statistics
No ratings yet
2nd Unit - Statistics
15 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
Url
No ratings yet
Url
4 pages
BAA Class Notes
No ratings yet
BAA Class Notes
16 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Biostats
No ratings yet
Biostats
17 pages
PA00T62J
No ratings yet
PA00T62J
190 pages
Packet Tracer 8.6.1.3
0% (1)
Packet Tracer 8.6.1.3
16 pages
Unit 1 - Business Statistics & Analytics
No ratings yet
Unit 1 - Business Statistics & Analytics
25 pages
CSS Solved General Science and Ability Past Paper 2021
No ratings yet
CSS Solved General Science and Ability Past Paper 2021
35 pages
Introduction To Descriptive Statistics
No ratings yet
Introduction To Descriptive Statistics
73 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Notation: Ae Aeff An
No ratings yet
Notation: Ae Aeff An
4 pages
Travel Guidelines by Destination - Etihad Airways
No ratings yet
Travel Guidelines by Destination - Etihad Airways
6 pages
Machining Strenx and Hardox: Drilling, Countersinking, Tapping, Turning and Milling
No ratings yet
Machining Strenx and Hardox: Drilling, Countersinking, Tapping, Turning and Milling
8 pages
Introduction To Integrated Library Systems: Lesson 5. How Do You Implement An Integrated Library System?
No ratings yet
Introduction To Integrated Library Systems: Lesson 5. How Do You Implement An Integrated Library System?
19 pages
Purcell Cash Why Seismic Matters Activity and Presentation
No ratings yet
Purcell Cash Why Seismic Matters Activity and Presentation
47 pages
As 3515.2-2002 Gold and Gold Bearing Alloys Determination of Gold Content 30 Percent To 99.5 Percent - Gravim
No ratings yet
As 3515.2-2002 Gold and Gold Bearing Alloys Determination of Gold Content 30 Percent To 99.5 Percent - Gravim
7 pages
Vamshi Krishna Resume
No ratings yet
Vamshi Krishna Resume
4 pages
Xtrade Website Demo
No ratings yet
Xtrade Website Demo
72 pages
70 433 Question
No ratings yet
70 433 Question
5 pages
DDMCA Regulations Updated
No ratings yet
DDMCA Regulations Updated
11 pages
Educ 102
No ratings yet
Educ 102
3 pages
Personal Letter Exercise
No ratings yet
Personal Letter Exercise
3 pages
Holacracy - The New Management System
No ratings yet
Holacracy - The New Management System
11 pages
Unit-3 DS Students
No ratings yet
Unit-3 DS Students
35 pages
Cotton Case Study
No ratings yet
Cotton Case Study
2 pages
L9 Adsorption 2024
No ratings yet
L9 Adsorption 2024
15 pages
BREAK Character Sheet (Tam)
No ratings yet
BREAK Character Sheet (Tam)
1 page
Installation Information Emg Models: Passive / Passive: Output
No ratings yet
Installation Information Emg Models: Passive / Passive: Output
1 page
Plonking Summary
No ratings yet
Plonking Summary
2 pages
16 Marks
No ratings yet
16 Marks
4 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet

BA 216 Lecture 4 Notes

Uploaded by

BA 216 Lecture 4 Notes

Uploaded by

BA 216

Central Tendency (mean, median, mode)

● There are two ultra-important questions to answer when describing a dataset.

● Measures of Central Tendency

Central Tendency for Numerical Data -- Median

Shorthand statistical notation for population vs. sample mean

Overview of dispersion – 3 options

● Variance & Standard deviation

● Quartiles & Interquartile range (IQR)

Summary of process for finding standard deviation (s):

● The variance is the average squared distance from the mean.

Summary: sample statistics (𝑥) vs. population parameters (µ)

Interquartile range (IQR) is another way to measure dispersion, and is related to

Interquartile range (IQR) is another way to measure dispersion, and is related to

Visualizing median & IQR in Box-and-Whisker Plots Data outliers

Box and whisker plots – step 2 – IQR

Visual example: interpreting quartiles

Box & whiskers – step 4 – suspected outliers

Outliers are useful for many reasons in statistics

Example - Box & Whisker plot, 4 categories

The Shape/Modality of a Distribution

Preview of common data shapes – modality/shape and skewness

Hint to determine whether a data shape is left or right skewed

How to recognize & describe “skewed” data

Shape of a distribution – symmetry vs asymmetry

RIGHT (POSITIVE) SKEWED vs. LEFT (NEGATIVE) SKEWED

Preview – skewness, the mean, and the median

Example of real-world left/negatively skewed data

● Homework 2 will have more practice with box and-whisker plots

Data visualization & skewed data – example 2

Bringing it all together….describing a data distribution

Knowledge check questions:

Skewed data and your statistical toolbox

Robust statistics – skewness, with median & IQR

Summary statistic Summary statistic Mean compared to

Symmetrical data Mean Standard deviation Mean approx. =

Right (positively) Median Interquartile range Median < Mean

Left (negatively) Median Interquartile range Mean < Median

You might also like