Learn Visual Data Analysis Live and in Person!
Learn Visual Data Analysis Live and in Person!
Dave Langer
TDWI Las Vegas, February 2023
Distribution Analysis
Section 2
For the flights dataset, distribution analysis can help answer questions like:
• Are longer arrival delays more concentrated at one airport versus other airports?
• Do more short-distance flights depart from one airport versus other airports?
The questions above are specific to the flights dataset, but the pattern of questions applies to all domains!
We use the average all the time to summarize historical data and use the summary to predict the future.
Most people would use this number to schedule connecting flights, assuming most flight delays are close
to the average.
While averages are immensely useful, they obscure the data’s underlying distribution.
To accommodate large datasets, bins are commonly used to group the data…
The bins for this histogram are set to 250 units (i.e., miles)
wide.
Consider the following histograms of the distance feature of the flights dataset…
Load tidyverse
goodness
Minimal ink for Set histogram bins to The feature to use Color histogram
the visualization Add a histogram be 1000 units wide for the distribution bars by origin
geometry layer to visualization airport count
the visualization Langer - Visual Data Analysis with R 13
Medians
Everyone is familiar with the average to quantify the location in a collection of data.
However, there are often more insightful ways to quantify location – the median.
Quartiles do precisely what you expect – they divide datasets into quarters.
Either:
• The max data value Outliers
• 75th percentile + 1.5 * IQR
Whichever is smaller.