0% found this document useful (0 votes)
15 views20 pages

Learn Visual Data Analysis Live and in Person!

The document discusses using visualizations to analyze distribution and frequency data. It covers histograms, which show the distribution of values across bins. Interpreting histograms involves examining the spread, center, and shape of the data. The choice of bin width impacts the insights from histograms. Histograms can also show comparisons across categorical variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views20 pages

Learn Visual Data Analysis Live and in Person!

The document discusses using visualizations to analyze distribution and frequency data. It covers histograms, which show the distribution of values across bins. Interpreting histograms involves examining the spread, center, and shape of the data. The choice of bin width impacts the insights from histograms. Histograms can also show comparisons across categorical variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Visual Data Analysis with R

Dave Langer
TDWI Las Vegas, February 2023
Distribution Analysis
Section 2

Langer - Visual Data Analysis with R 2


Business Questions
Insightful data analyses always start with a business question. Doesn’t matter if your analysis uses
visualizations, statistics, or machine learning.

For the flights dataset, distribution analysis can help answer questions like:

• What is the range of arrival delay times?

• Are most arrival delays relatively small? The opposite? Neither?

• Are longer arrival delays more concentrated at one airport versus other airports?

• What is the range of the distances traveled by departing flights?

• Do more short-distance flights depart from one airport versus other airports?

The questions above are specific to the flights dataset, but the pattern of questions applies to all domains!

Langer - Visual Data Analysis with R 3


The Average Arrival Delay
The world’s most common predictive model is the average.

We use the average all the time to summarize historical data and use the summary to predict the future.

For example, the average flight arrival delay is 7 minutes.

Most people would use this number to schedule connecting flights, assuming most flight delays are close
to the average.

What if it was known 10% of flights arrive 52+ minutes late?

Do you think that would change people’s plans?

While averages are immensely useful, they obscure the data’s underlying distribution.

Langer - Visual Data Analysis with R 4


Frequency Distributions
Given a collection of data, the frequency distribution is a count of the individual values within the data.

Take the following collection of data as an example…


The
frequencies
arr_delay Tally the value count
11
individual 5 1 Notice the following:
values…
20 7 1
12 10 1 1. The average value is 14.8
2. 60% of the values are below 14.8
31 11 1 3. 10% of values are double the average
7 12 2
12 17 1
10 20 1 Frequency distributions give us a more thorough
17 23 1 understanding of the data.
5 31 1
23 Langer - Visual Data Analysis with R 5
Frequency Distributions - Bins
There can be many, many distinct values in a collection of data. Creating a frequency distribution based on
distinct values quickly becomes unwieldy.

To accommodate large datasets, bins are commonly used to group the data…

arr_delay Define some bin count


bins and
11 1 to 5 1 Notice the following:
tally the
20
values that 6 to 10 2
1. 60% of the data falls in the first three bins
12 fall in each 11 to 15 3
2. There are no values in the “26 to 30” bin
31 bin… 16 to 20 2 3. Only 10% of value fall in the last two bins
7 21 to 25 1
12 26 to 30 0
10 31 to 35 1 As the data analyst, you define the bins. It is
17 common to experiment with different bin widths
5 while exploring your data.
23 Langer - Visual Data Analysis with R 6
Histograms
A histogram is a frequency distribution visualization.

Here’s a histogram of the arr_delay feature of the


flights dataset.

The height of the bars in a histogram corresponds to


the count of values within each bin.

The bins used in this visualization are set to be 15


units (i.e., minutes) wide.

What does this visualization tell you at first glance


about the distribution of arr_delay?

Langer - Visual Data Analysis with R 7


Interpreting Histograms - Spread
When interpreting histograms, you are looking at the
following visual characteristics:
• Spread
• Center
• Shape

Here’s a histogram of the distance feature of the flights


dataset.

The bins for this histogram are set to 250 units (i.e., miles)
wide.

The spread of a histogram is the range of values depicted


from the highest to the lowest.

What is the spread for this histogram? Any insights?

Langer - Visual Data Analysis with R Spread 8


Interpreting Histograms - Center
When interpreting the center of a histogram, you are
looking to quantify what a typical value might be.

Statisticians refer to this idea as a measure of location or


central tendency.

The most common measure of location is the average (or


arithmetic mean).

For many histograms, you can estimate the location using


your eyes.

In cases like this histogram, you can calculate the average


using R code and add it to the visualization.

What insights does the red average line provide?

Langer - Visual Data Analysis with R 9


Interpreting Histograms - Shape
When interpreting the shape of a histogram, you are
looking to understand the density (i.e., counts) of values
throughout the spread.

Some common questions to ask while looking at the


shape:
• Are many values bunched up at one end of the spread?
• Are many values bunched up at both ends of the spread
(i.e., the histogram looks like a valley)?
• Is the histogram mound-shaped and symmetrical?
• Are there multiple peaks?

In the case of this histogram, what insights does looking at


the shape provide?

Langer - Visual Data Analysis with R 10


Interpreting Histograms - Bins
The choice of bin width is the most critical aspect of producing insightful histograms as it determines
the shape of the histogram.

Consider the following histograms of the distance feature of the flights dataset…

There is no “right” bin size –


it depends on the situation.

Which of these visualization


seems most insightful to
you?

Langer - Visual Data Analysis with R 11


Insights with Histograms
Histograms become even more powerful when combined
with categorical data.

By adding categorical data, you can make comparisons


across the distribution.

The following histogram adds the origin airport to the


distance histogram.

How does distance distribution vary across JFK,


LaGuardia, and Newark origin airports?

Do some airports seem bunched up at one or both ends


of the spread?

Langer - Visual Data Analysis with R 12


Histograms with ggplot2

Load tidyverse
goodness

First layer is the data frame to use


for the visualization
Load flights dataset

The aes() function creates the aesthetic, defining how the


data will be used and aspects of the final appearance

Minimal ink for Set histogram bins to The feature to use Color histogram
the visualization Add a histogram be 1000 units wide for the distribution bars by origin
geometry layer to visualization airport count
the visualization Langer - Visual Data Analysis with R 13
Medians
Everyone is familiar with the average to quantify the location in a collection of data.

However, there are often more insightful ways to quantify location – the median.

Take the following values…


arr_delay To find the median, first sort the values… arr_delay For this data, the average
11 2 is 14.58, while the median
20 5 is 12.
12 The median is the point that splits the data 7
31 in half (i.e., it is the middle value) 10 The average is calculated
7 11 using all values, while the
12
median
12 median is not.
10 12
17 17 Outliers can skew the
This case has no middle value, so the
5 20 average. The median is
median is calculated as the average of the
23 23 robust to outliers.
25
two closest numbers, 12. 25
2 31 14
Quartiles
In addition to the median, quartiles are very useful for quantifying the spread (or dispersion) of values.

Quartiles do precisely what you expect – they divide datasets into quarters.

Take the sorted values from the previous slide…

arr_delay The first quartile contains the first arr_delay


2 25% of the data. 2
5 5 1st quartile
6 The second quartile contains the 7
10 next 25% of the data. 10
11 2nd quartile 11
12 12 median
12 And so on… 12
17 3rd quartile 17
20 20
23 23
25 25 4th quartile
31 31 15
Box Plots
Box plots (or box and whisker plots) is a
visualization for frequency distributions.

Unlike histograms, box plots use medians and


quartiles to visualize the distribution.

The following is a box plot of the distance feature


of the flights dataset...

As with histograms, the usefulness of box plots


increases when categories are added to the
visualization.

However, first we must learn how to interpret


boxplots.

Langer - Visual Data Analysis with R 16


Interpreting Box Plots

Either:
• The max data value Outliers
• 75th percentile + 1.5 * IQR
Whichever is smaller.

The 75th percentile

Interquartile range (IQR) – The median


Range of the middle 50% of the data (the 50th percentile)

Either: The 25th percentile


• The min data value
• 25th percentile - 1.5 * IQR
Whichever is larger. Langer - Visual Data Analysis with R 17
Insights with Box Plots
Box plots allow for side-by-side comparisons of
distributions by categories.

Box plots are a go-to for visualization for data


exploration and analysis.

The following is a box plot of the distance feature


by the origin airport...

Box plots allow for easily comparing distribution


spreads and centers across categories.

What do you see in this plot? What insights can


you glean about flight distances based on the
origin airport?

Langer - Visual Data Analysis with R 18


Box Plots with ggplot2

First layer is the Load tidyverse


data frame to goodness
use for the
visualization
Load flights dataset
The distribution
The categories feature

Minimal ink for


the visualization

The aes() function creates the aesthetic, defining how the


Add a box plot data will be used and aspects of the final appearance
geometry layer to
the visualization
Langer - Visual Data Analysis with R 19

You might also like