BA 216 Lecture 4 Notes
BA 216 Lecture 4 Notes
Two key ways to describe a dataset are Central Tendency & Dispersion (we’ll be
adding shape/modality, skewness, and outliers over the next two lectures)
● Measures of Dispersion
○ Categorical data: Range
○ Numerical data: Standard deviation, Interquartile Range (IQR), rarely
range
Central Tendency for Categorical Data -- Mode
● A MODE is represented by a prominent peak in the distribution.
● A definition of mode sometimes taught in math classes is the value with the most
occurrences in the data set.
○ However, for many real-world numerical data sets, it is common to
have no observations with the same value, making this definition
impractical in data analysis.
● Mode is primarily useful for categorical data, and most commonly use for
categorical ordinal.
A key (if somewhat obvious) concept: the sample mean (𝑥) provides a window to the
true, hidden population mean (µ)
DEVIATION is just another way to say “distance from mean”, which we use to calculate
variance (and standard deviation)
2
The Variance (𝑠 ) of a sample is roughly the average deviation from the mean, across
all the observations in the dataset
2
Why is squared deviation used in the numerator when calculating variance (𝑠 ) ?
2
Now, we use variance (𝑠 ) to find standard deviation (s)
2
Summary: variance (𝑠 ) vs. standard deviation (s)
● The standard deviation represents the typical deviation of observations from the
mean.
○ Usually about 70% of the data will be within one standard deviation of the
mean and about 95% will be within two standard deviations.
○ However, these percentages are not strict rules.
● Population parameter: The “true population mean” is denoted with the Greek
symbol µ, pronounced “mew”
● Sample Statistic/point estimate: The “sample mean” is denoted with 𝑥,
pronounced “x-bar”.
● Sample standard deviation – s
● Population standard deviation - σ
Interquartile range (IQR) is another way to measure dispersion, and is conceptually
related to median
You can use box-and-whisker plots and a histograms to spot suspected outliers
Example - Box and whiskers can be used with 2+ categories
Box & Whiskers Plots vs. Histograms (2 category example)
Note: In order to determine modality, step back and imagine a smooth curve over the
histogram – imagine that the bars are wooden blocks and you drop a limp spaghetti
over them, the shape the spaghetti would take could be viewed as a smooth curve.
Robust statistics:
How to choose summary statistics for central tendency and dispersion, when dealing
with skewed (“funky shaped”) data
Skewness, the mean, and the median
Review, notation:
● Note: If the data are skewed, this also may mean we can’t use certain statistical
tools like the t-test or an ANOVA (which we’ll learn about later).
Hint: Pay attention to the locations of the original outlier in each row of dot plots
● The median and IQR are only sensitive to numbers near Q1, the median, and
Q3.
● Since values in these regions are stable in the three data sets, the median and
IQR estimates are also stable.
● If we're looking to simply understand what a typical individual loan looks like, the
median is probably more useful.
● However, if the goal is to understand something that scales well, such as the total
amount of money we might need to have on hand if we were to over 1,000 loans,
then the mean would be more useful.
Skewness, the mean, and the median
Summary: statistics and skewed data distributions
● If the data are skewed, the mean gets pulled further in the direction of the skew
than the median. In the case of very skewed data, the mean may not provide
a good estimate for the center of the data and represent where most of the data
fall.
● Note: If the data are skewed, this also may mean we can’t use certain statistical
tools like the t-test or an ANOVA (which we’ll learn about later).