7exploatory Data Analysis
7exploatory Data Analysis
7.1 Introduction
This chapter will show you how to use visualisation and transformation to
explore your data in a systematic way, a task that statisticians call
exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:
EDA is not a formal process with a strict set of rules. More than anything,
EDA is a state of mind. During the initial phases of EDA you should feel free
to investigate every idea that occurs to you. Some of these ideas will pan
out, and some will be dead ends. As your exploration continues, you will
home in on a few particularly productive areas that you’ll eventually write
up and communicate to others.
EDA is an important part of any data analysis, even if the questions are
handed to you on a platter, because you always need to investigate the
quality of your data. Data cleaning is just one application of EDA: you ask
questions about whether your data meets your expectations or not. To do
data cleaning, you’ll need to deploy all the tools of EDA: visualisation,
transformation, and modelling.
7.1.1 Prerequisites
In this chapter we’ll combine what you’ve learned about dplyr and ggplot2
to interactively ask questions, answer them with data, and then ask new
questions.
library(tidyverse)
7.2 Questions
“There are no routine statistical questions, only questionable statistical
routines.” — Sir David Cox
“Far better an approximate answer to the right question, which is often
vague, than an exact answer to the wrong question, which can always be
made precise.” — John Tukey
Your goal during EDA is to develop an understanding of your data. The
easiest way to do this is to use questions as tools to guide your
investigation. When you ask a question, the question focuses your attention
on a specific part of your dataset and helps you decide which graphs,
models, or transformations to make.
There is no rule about which questions you should ask to guide your
research. However, two types of questions will always be useful for making
discoveries within your data. You can loosely word these questions as:
The rest of this chapter will look at these two questions. I’ll explain what
variation and covariation are, and I’ll show you several ways to answer each
question. To make the discussion easier, let’s define some terms:
So far, all of the data that you’ve seen has been tidy. In real-life, most data
isn’t tidy, so we’ll come back to these ideas again in tidy data.
7.3 Variation
Variation is the tendency of the values of a variable to change from
measurement to measurement. You can see variation easily in real life; if
you measure any continuous variable twice, you will get two different
results. This is true even if you measure quantities that are constant, like the
speed of light. Each of your measurements will include a small amount of
error that varies from measurement to measurement. Categorical variables
can also vary if you measure across different subjects (e.g. the eye colors of
different people), or different times (e.g. the energy levels of an electron at
different moments). Every variable has its own pattern of variation, which
can reveal interesting information. The best way to understand that pattern
is to visualise the distribution of the variable’s values.
How you visualise the distribution of a variable will depend on whether the
variable is categorical or continuous. A variable is categorical if it can only
take one of a small set of values. In R, categorical variables are usually
saved as factors or character vectors. To examine the distribution of a
categorical variable, use a bar chart:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
The height of the bars displays how many observations occurred with each
x value. You can compute these values manually with dplyr::count():
diamonds %>%
count(cut)
#> # A tibble: 5 × 2
#> cut n
#> <ord> <int>
#> 1 Fair 1610
#> 2 Good 4906
#> 3 Very Good 12082
#> 4 Premium 13791
#> 5 Ideal 21551
A variable is continuous if it can take any of an infinite set of ordered
values. Numbers and date-times are two examples of continuous variables.
To examine the distribution of a continuous variable, use a histogram:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
You can compute this by hand by
combining dplyr::count() and ggplot2::cut_width():
diamonds %>%
count(cut_width(carat, 0.5))
#> # A tibble: 11 × 2
#> `cut_width(carat, 0.5)` n
#> <fct> <int>
#> 1 [-0.25,0.25] 785
#> 2 (0.25,0.75] 29498
#> 3 (0.75,1.25] 15977
#> 4 (1.25,1.75] 5313
#> 5 (1.75,2.25] 2002
#> 6 (2.25,2.75] 322
#> # … with 5 more rows
A histogram divides the x-axis into equally spaced bins and then uses the
height of a bar to display the number of observations that fall in each bin.
In the graph above, the tallest bar shows that almost 30,000 observations
have a carat value between 0.25 and 0.75, which are the left and right
edges of the bar.
Now that you can visualise variation, what should you look for in your
plots? And what type of follow-up questions should you ask? I’ve put
together a list below of the most useful types of information that you will
find in your graphs, along with some follow-up questions for each type of
information. The key to asking good follow-up questions will be to rely on
your curiosity (What do you want to learn more about?) as well as your
skepticism (How could this be misleading?).
In both bar charts and histograms, tall bars show the common values of a
variable, and shorter bars show less-common values. Places that do not
have bars reveal values that were not seen in your data. To turn this
information into useful questions, look for anything unexpected:
How are the observations within each cluster similar to each other?
How are the observations in separate clusters different from each
other?
How can you explain or describe the clusters?
Why might the appearance of clusters be misleading?
The histogram below shows the length (in minutes) of 272 eruptions of the
Old Faithful Geyser in Yellowstone National Park. Eruption times appear to
be clustered into two groups: there are short eruptions (of around 2
minutes) and long eruptions (4-5 minutes), but little in between.
Outliers are observations that are unusual; data points that don’t seem to fit
the pattern. Sometimes outliers are data entry errors; other times outliers
suggest important new science. When you have a lot of data, outliers are
sometimes difficult to see in a histogram. For example, take the distribution
of the y variable from the diamonds dataset. The only evidence of outliers is
the unusually wide limits on the x-axis.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
There are so many observations in the common bins that the rare bins are
so short that you can’t see them (although maybe if you stare intently at 0
you’ll spot something). To make it easy to see the unusual values, we need
to zoom to small values of the y-axis with coord_cartesian():
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
(coord_cartesian() also has an xlim() argument for when you need to
zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that
work slightly differently: they throw away the data outside the limits.)
This allows us to see that there are three unusual values: 0, ~30, and ~60.
We pluck them out with dplyr:
It’s good practice to repeat your analysis with and without the outliers. If
they have minimal effect on the results, and you can’t figure out why
they’re there, it’s reasonable to replace them with missing values, and move
on. However, if they have a substantial effect on your results, you shouldn’t
drop them without justification. You’ll need to figure out what caused them
(e.g. a data entry error) and disclose that you removed them in your write-
up.
7.3.4 Exercises
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
However this plot isn’t great because there are many more non-cancelled
flights than cancelled flights. In the next section we’ll explore some
techniques for improving this comparison.
7.4.1 Exercises
7.5 Covariation
If variation describes the behavior within a variable, covariation describes
the behavior between variables. Covariation is the tendency for the values
of two or more variables to vary together in a related way. The best way to
spot covariation is to visualise the relationship between two or more
variables. How you do that should again depend on the type of variables
involved.
It’s hard to see the difference in distribution because the overall counts
differ so much:
ggplot(diamonds) +
geom_bar(mapping = aes(x = cut))
To make the comparison easier we need to swap what is displayed on the
y-axis. Instead of displaying count, we’ll display density, which is the count
standardised so that the area under each frequency polygon is one.
We see much less information about the distribution, but the boxplots are
much more compact so we can more easily compare them (and fit more on
one plot). It supports the counterintuitive finding that better quality
diamonds are cheaper on average! In the exercises, you’ll be challenged to
figure out why.
cut is an ordered factor: fair is worse than good, which is worse than very
good and so on. Many categorical variables don’t have such an intrinsic
order, so you might want to reorder them to make a more informative
display. One way to do that is with the reorder() function.
For example, take the class variable in the mpg dataset. You might be
interested to know how highway mileage varies across classes:
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
If you have long variable names, geom_boxplot() will work better if you flip
it 90°. You can do that with coord_flip().
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
7.5.1.1 Exercises
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
The size of each circle in the plot displays how many observations occurred
at each combination of values. Covariation will appear as a strong
correlation between specific x values and specific y values.
diamonds %>%
count(color, cut)
#> # A tibble: 35 × 3
#> color cut n
#> <ord> <ord> <int>
#> 1 D Fair 163
#> 2 D Good 662
#> 3 D Very Good 1513
#> 4 D Premium 1603
#> 5 D Ideal 2834
#> 6 E Fair 224
#> # … with 29 more rows
Then visualise with geom_tile() and the fill aesthetic:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
If the categorical variables are unordered, you might want to use the
seriation package to simultaneously reorder the rows and columns in order
to more clearly reveal interesting patterns. For larger plots, you might want
to try the d3heatmap or heatmaply packages, which create interactive plots.
7.5.2.1 Exercises
1. How could you rescale the count dataset above to more clearly show
the distribution of cut within colour, or colour within cut?
2. Use geom_tile() together with dplyr to explore how average flight
delays vary by destination and month of year. What makes the plot
difficult to read? How could you improve it?
3. Why is it slightly better to use aes(x = color, y = cut) rather
than aes(x = cut, y = color) in the example above?
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
Scatterplots become less useful as the size of your dataset grows, because
points begin to overplot, and pile up into areas of uniform black (as above).
You’ve already seen one way to fix the problem: using the alpha aesthetic
to add transparency.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
But using transparency can be challenging for very large datasets. Another
solution is to use bin. Previously you
used geom_histogram() and geom_freqpoly() to bin in one dimension. Now
you’ll learn how to use geom_bin2d() and geom_hex() to bin in two
dimensions.
geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and
then use a fill color to display how many points fall into each
bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal
bins. You will need to install the hexbin package to use geom_hex().
ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))
# install.packages("hexbin")
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
Another option is to bin one continuous variable so it acts like a categorical
variable. Then you can use one of the techniques for visualising the
combination of a categorical and a continuous variable that you learned
about. For example, you could bin carat and then for each group, display a
boxplot:
A scatterplot of Old Faithful eruption lengths versus the wait time between
eruptions shows a pattern: longer wait times are associated with longer
eruptions. The scatterplot also displays the two clusters that we noticed
above.
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
Patterns provide one of the most useful tools for data scientists because
they reveal covariation. If you think of variation as a phenomenon that
creates uncertainty, covariation is a phenomenon that reduces it. If two
variables covary, you can use the values of one variable to make better
predictions about the values of the second. If the covariation is due to a
causal relationship (a special case), then you can use the value of one
variable to control the value of the second.
Models are a tool for extracting patterns out of data. For example, consider
the diamonds data. It’s hard to understand the relationship between cut
and price, because cut and carat, and carat and price are tightly related. It’s
possible to use a model to remove the very strong relationship between
price and carat so we can explore the subtleties that remain. The following
code fits a model that predicts price from carat and then computes the
residuals (the difference between the predicted value and the actual value).
The residuals give us a view of the price of the diamond, once the effect of
carat has been removed.
library(modelr)
Once you’ve removed the strong relationship between carat and price, you
can see what you expect in the relationship between cut and price: relative
to their size, better quality diamonds are more expensive.
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
You’ll learn how models, and the modelr package, work in the final part of
the book, model. We’re saving modelling for later because understanding
what models are and how they work is easiest once you have tools of data
wrangling and programming in hand.
ggplot(faithful, aes(eruptions)) +
geom_freqpoly(binwidth = 0.25)
Sometimes we’ll turn the end of a pipeline of data transformation into a
plot. Watch for the transition from %>% to +. I wish this transition wasn’t
necessary but unfortunately ggplot2 was created before the pipe was
discovered.
diamonds %>%
count(cut, clarity) %>%
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()