0% found this document useful (0 votes)
47 views28 pages

BA 216 Lecture 4 Notes

This document discusses measures of central tendency (mean, median, mode) and dispersion (standard deviation, interquartile range, range) to describe datasets. It provides definitions and formulas for these statistical concepts. Key points covered include: the mean provides a point estimate of the population mean; the standard deviation represents the typical deviation from the mean; and for skewed data, the median may better indicate the center than the mean. Box and whisker plots and histograms are presented as visualizations to identify outliers and skewed distributions.

Uploaded by

Harrison Lim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views28 pages

BA 216 Lecture 4 Notes

This document discusses measures of central tendency (mean, median, mode) and dispersion (standard deviation, interquartile range, range) to describe datasets. It provides definitions and formulas for these statistical concepts. Key points covered include: the mean provides a point estimate of the population mean; the standard deviation represents the typical deviation from the mean; and for skewed data, the median may better indicate the center than the mean. Box and whisker plots and histograms are presented as visualizations to identify outliers and skewed distributions.

Uploaded by

Harrison Lim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

BA 216

Central Tendency (mean, median, mode)

Two key ways to describe a dataset are Central Tendency & Dispersion (we’ll be
adding shape/modality, skewness, and outliers over the next two lectures)

● There are two ultra-important questions to answer when describing a dataset.


○ Central Tendency: where is the ‘middle’ of the dataset?
○ Dispersion: how spread out, how ‘wide’ is the dataset?

● Measures of Central Tendency


○ Categorical data: Mode
○ Numerical data: Mean & median, rarely mode

● Measures of Dispersion
○ Categorical data: Range
○ Numerical data: Standard deviation, Interquartile Range (IQR), rarely
range
Central Tendency for Categorical Data -- Mode
● A MODE is represented by a prominent peak in the distribution.
● A definition of mode sometimes taught in math classes is the value with the most
occurrences in the data set.
○ However, for many real-world numerical data sets, it is common to
have no observations with the same value, making this definition
impractical in data analysis.
● Mode is primarily useful for categorical data, and most commonly use for
categorical ordinal.

Central Tendency for Numerical Data -- Median


Central Tendency for Numerical Data -- Mean (𝑥and µ)

A key (if somewhat obvious) concept: the sample mean (𝑥) provides a window to the
true, hidden population mean (µ)

Shorthand statistical notation for population vs. sample mean


Dispersion (range, standard deviation, IQR)

Overview of dispersion – 3 options


● Range - the highest number minus the lowest number
○ Simpler than measures of variance/dispersion
○ This is the only option for categorical data, but also can be used for
numerical data if a super simple statistic is needed.

● Variance & Standard deviation


○ Mathematically and analytically paired with mean
○ Can be used with numerical data

● Quartiles & Interquartile range (IQR)


○ Mathematically and analytically paired with median
○ Can be used with numerical data

DEVIATION is just another way to say “distance from mean”, which we use to calculate
variance (and standard deviation)
2
The Variance (𝑠 ) of a sample is roughly the average deviation from the mean, across
all the observations in the dataset

2
Why is squared deviation used in the numerator when calculating variance (𝑠 ) ?
2
Now, we use variance (𝑠 ) to find standard deviation (s)

2
Summary: variance (𝑠 ) vs. standard deviation (s)

Summary of process for finding standard deviation (s):


Deviation → Variance → Standard Deviation

● The variance is the average squared distance from the mean.


● The standard deviation is the square root of the variance.
○ The standard deviation is useful when considering how far the data are
distributed from the mean.
○ We nearly always use standard deviation, because...

● The standard deviation represents the typical deviation of observations from the
mean.
○ Usually about 70% of the data will be within one standard deviation of the
mean and about 95% will be within two standard deviations.
○ However, these percentages are not strict rules.

Summary: sample statistics (𝑥) vs. population parameters (µ)


● Numerical Variables:
○ When working with averages/means, we calculate a sample statistic 𝑥
from our sample, and (if certain conditions are met) use that as the point
estimate for the (“real”, but unknown) population parameter µ.

● Population parameter: The “true population mean” is denoted with the Greek
symbol µ, pronounced “mew”
● Sample Statistic/point estimate: The “sample mean” is denoted with 𝑥,
pronounced “x-bar”.
● Sample standard deviation – s
● Population standard deviation - σ
Interquartile range (IQR) is another way to measure dispersion, and is conceptually
related to median

Interquartile range (IQR) is another way to measure dispersion, and is related to


median.
Interquartile range (IQR) is another way to measure dispersion, and is related to
median.

Interquartile range (IQR) is another way to measure dispersion, and is related to


median.

Visualizing median & IQR in Box-and-Whisker Plots Data outliers


Box and whisker plots – step 1 – median

Box and whisker plots – step 2 – IQR


Box and whisker plots - questions

Visual example: interpreting quartiles


Box and whisker plots – step 3 – Whiskers

Box & whiskers – step 4 – suspected outliers

Outliers are useful for many reasons in statistics


Two histogram examples for identifying suspected outliers
● Below are two examples of data distributions from a numerical variable.
● In which option would you be more suspicious of outlier(s)

You can use box-and-whisker plots and a histograms to spot suspected outliers
Example - Box and whiskers can be used with 2+ categories
Box & Whiskers Plots vs. Histograms (2 category example)

Example - Box & Whisker plot, 4 categories

The Shape/Modality of a Distribution


Why do we care about skewness and shape?

Preview of common data shapes – modality/shape and skewness

Hint to determine whether a data shape is left or right skewed


● Pay attention to which direction the tail of the data is heading to
● Ex: If the data has a hill on the left side but a tail on the right, then it is RIGHT
SKEWED
● Same goes with if there is a hill on the right side but a tail on the left, then it is
LEFT SKEWED
More about shape – UNIMODAL, BIMODAL, MULTIMODAL

Note: In order to determine modality, step back and imagine a smooth curve over the
histogram – imagine that the bars are wooden blocks and you drop a limp spaghetti
over them, the shape the spaghetti would take could be viewed as a smooth curve.

How to recognize & describe “skewed” data

Shape of a distribution – symmetry vs asymmetry


● The symmetrical, bell-shaped curve (from the Normal Distribution) pops up
everywhere in nature, but it does not represent every kind of distribution.
We’ve already seen some skewed data examples

RIGHT (POSITIVE) SKEWED vs. LEFT (NEGATIVE) SKEWED


Skewness, the mean, and the median

Preview – skewness, the mean, and the median


Examples of data that is likely to be skewed
● Common examples of right/positively skewed data include -- people's
incomes; mileage on used cars for sale; reaction times in a psychology
experiment; house prices; number of accident claims by an insurance customer;
number of children in a family.
● Examples of left/negative skewed data are more rare and harder to grasp
intuitively.
○ An example is the number of fingers; most people have ten, but some lose
one or more in accidents.
○ Also, the age of someone when they die (in wealthy countries) is
negatively skewed.

Example of real-world left/negatively skewed data


Example of real-world right/positively skewed data
Data visualization & skewed data – example 1
● In addition to summary statistics, we’ve learned about data visualizations as a
way to describe data. Two of your top choices when it comes to visualizing
skewed data are:
○ Histograms (covered extensively in last class’s lecture)
○ Box-and-whisker plots

● Homework 2 will have more practice with box and-whisker plots

Data visualization & skewed data – example 2


Data visualization & skewed data – example 3

Bringing it all together….describing a data distribution

Robust statistics:
How to choose summary statistics for central tendency and dispersion, when dealing
with skewed (“funky shaped”) data
Skewness, the mean, and the median
Review, notation:

Knowledge check questions:


● Which is mean, and which is variance?
● How are variance and standard deviation related?
● What is the difference between population mean and sample mean?
● What is the difference between population variance and sample variance?

Skewed data and your statistical toolbox


● If the data are skewed, the mean gets pulled further in the direction of the skew
than the median.
● In the case of very skewed data, the mean may not provide a good estimate
for the center of the data and represent where most of the data fall.
● In this case, you should consider using the median to evaluate the center of the
data, rather than the mean.

● Note: If the data are skewed, this also may mean we can’t use certain statistical
tools like the t-test or an ANOVA (which we’ll learn about later).

Robust statistics – skewness, with median & IQR

Hint: Pay attention to the locations of the original outlier in each row of dot plots
● The median and IQR are only sensitive to numbers near Q1, the median, and
Q3.
● Since values in these regions are stable in the three data sets, the median and
IQR estimates are also stable.

● If we're looking to simply understand what a typical individual loan looks like, the
median is probably more useful.
● However, if the goal is to understand something that scales well, such as the total
amount of money we might need to have on hand if we were to over 1,000 loans,
then the mean would be more useful.
Skewness, the mean, and the median
Summary: statistics and skewed data distributions
● If the data are skewed, the mean gets pulled further in the direction of the skew
than the median. In the case of very skewed data, the mean may not provide
a good estimate for the center of the data and represent where most of the data
fall.
● Note: If the data are skewed, this also may mean we can’t use certain statistical
tools like the t-test or an ANOVA (which we’ll learn about later).

Summary statistic Summary statistic Mean compared to


for centrality of data for data spread median

Symmetrical data Mean Standard deviation Mean approx. =


Median

Right (positively) Median Interquartile range Median < Mean


skewed data (IQR)

Left (negatively) Median Interquartile range Mean < Median


skewed data (IQR)

You might also like