Apostila Ggplot
Apostila Ggplot
A ggplot2 Primer
Ehssan Ghashim1 , Patrick Boily1,2,3,4
Abstract
R has become one of the world’s leading languages for statistical and data analysis. While the base R
installation does suppor simple visualizations, its plots are rarely of high-enough quality for publication.
Enter Hadley Wickam’s ggplot2, an aesthetically and logical approach to data visualization. In this short
report, we introduce gglot2’s graphic grammar elements, and present a number of examples.
Keywords
R, ggplot2, data visualization
1
Centre for Quantitative Analysis and Decision Support, Carleton University, Ottawa
2
Sprott School of Business, Carleton University, Ottawa
3
Department of Mathematics and Statistics, University of Ottawa, Ottawa
4
Idlewyld Analytics and Consulting Services, Wakefield, Canada
Email: [email protected]
3. Basics of ggplot2 Grammar A basic display call contains the following elements:
Let’s look at some illustrative ggplot2 code: ggplot(): start an object and specify the data
geom_point(): we want a scatter plot; this is
library("ggplot2") called a “geom”
theme_set(theme_bw()) # use the black aes(): specifies the “aesthetic” elements; a legend
and white theme throughout is automatically created
# artificial data: facet_grid(): specifies the “faceting” or panel
d <- data.frame(x = c(1:8, 1:8), y = layout
runif(16),
group1 = rep(gl(2, 4, labels = c("a", Other components include statistics, scales, and annotation
"b")), 2), options. At a bare minimum, charts require a dataset, some
group2 = gl(2, 8)) aesthetics, and a geom, combined, as above, with “+” sym-
head(d) bols! This non-standard approach has the advantage of
## R output
allowing ggplot2 plots to be proper R objects, which can
## x y group1 group2
## 1 1 0.8683116 a 1 modified, inspected, and re-used.
## 2 2 0.1934542 a 1 ggplot2’s main plotting functions are qplot() and
## 3 3 0.1131743 a 1 ggplot(); qplot() is short for “quick plot” and is
## 4 4 0.9260514 a 1 meant to mimic the format of base R’s plot(); it requires
## 5 5 0.9476787 b 1 less syntax for many common tasks, but has limitations – it’s
## 6 6 0.2949107 b 1 essentially a wrapper for ggplot(), which is not itself
ggplot(data = d) + geom_point(aes(x, y, that complicated to use.
colour = group1)) +
facet_grid(~group2) We will focus on this latter function.
4. Specifying Plot Types with geoms Note that only the x variable (height) was specified when
creating the histogram, but that both the x (voice part) and
Whereas ggplot() specifies the data source and vari- the y (height) variables were specified for the box plot –
ables to be plotted, the various geom functions specify indeed, geom_histogram() defaults to counts on the
how these variables are to be visually represented (using y−axis when no y variable is specified (each function’s
points, bars, lines, and shaded regions). There are currently documentation contains details and additional examples,
37 available geoms. Table 1 lists the more common ones, but there’s a lot of value to be found in playing around with
along with frequently used options (most of the graphs data in order to determine their behaviour).
shown in this report can be created using those geoms). Let’s examine the use of some of these options using the
For example, the next bit of code produces a histogram Salaries dataset (from package “car”). The dataframe
of the heights of singers in the 1979 edition of the New contains information on the salaries of university professors
York Choral Society (Figure 4), and a display of height by collected during the 2008–2009 academic year. Variables
voice part for the same data (Figure 5). include rank (AsstProf, AssocProf, Prof), sex (Female, Male),
yrs.since.phd (years since Ph.D.), yrs.service (years of ser-
library("ggplot2") vice), and salary (nine-month salary in dollars). The next
data(singer, package="lattice")
code produces the plot in Figure 3.
ggplot(singer, aes(x=height)) +
geom_histogram() data(Salaries, package="car")
ggplot(singer, aes(x=voice.part, library(ggplot2)
y=height)) + geom_boxplot() ggplot(Salaries, aes(x=rank, y=salary)) +
geom_boxplot(fill="cornflowerblue",
From Figure 5, it appears that basses tend to be taller and color="black", notch=TRUE)+
sopranos tend to be shorter. Although the singers’ gender geom_point(position="jitter",
was not recorded, it probably accounts for much of the color="blue", alpha=.5)+
variation seen in the diagram. geom_rug(side="l", color="black")
Figure 3. Notched box plots with superimposed points describing the salaries of college professors by rank. A rug plot is
provided on the vertical axis.
Figure 4. Histogram of singer heights Figure 5. Box plot of singer heights by voice part
Figure 6. Visualizations of the mpg dataset – with aes() on the left, without on the right.
Figure 7. Faceted graph showing the distribution (histogram) of singer heights by voice part
Figure 8. Scatterplot of years since graduation and salary. Academic rank is represented by color and shape, and sex is
faceted.
9. Tidy Data: Getting Data into the Right Notice that while there were originally seven columns, there
Format are now only three: Var1, Var2, and value; Var1
represents the year, Var2 the continents, and value the
ggplot2 is compatible with what is generally referred to as number of phones. Every data cell – every observation –
the tidyverse [22]. Social scientists will likely be familiar every number of phones per year per continent – in the
with the distinction between data in wide format and in original dataset now has its own row in the melted dataset.
long format: In 1951, in North America, for instance, there were
45,939,000 phones, which is the same value as in the origi-
in a long format table, every column represents a
nal unmelted data – the data has not changed, it just got
variables, and every row an observation,
reshaped.
whereas in a wide format table, some variables are
spread out across columns, perhaps along some other
Changing the column names might make the data more
characteristic such as the year, say.
intuitive to read:
The plots that have been produced so far were simple to colnames(WorldPhones.m) = c("Year",
create because the data points were given in the format "Continent", "Phones")
of one observation per row which we call a "tall" format. head(WorldPhones.m)
But many datasets come in a "wide"" format, i.e. there is
more than one observation – more than one point on the ## Year Continent Phones
scatterplot – in each row. ## 1 1951 N.Amer 45939
Consider, for instance, the WorldPhones dataset, ## 2 1956 N.Amer 60423
one of R’s built-in dataset: ## 3 1957 N.Amer 64721
## ...
data("WorldPhones")
Now that the data has been melted into a tall dataset, it is
This dataset records the number of telephones, in thou- easy to create a plot with ggplot2, with the usual steps of a
sands, on each continent for several years in the 1950s (see ggplot() call, but with WorldPhones.m instead of
Table 2). WorldPhones:
Each column represents a different continent, and each
ggplot(WorldPhones.m, aes(x=Year,
row represents a different year. This wide format seems like
y=Phones, color=Continent)) +
a reasonable way to store data, but suppose that we want geom_point()
to compare increases in phone usage between continents,
with time on the horizontal axis. In that case, each point on
We place the Year on the x−axis, in order to see how the
the plot is going to represent a continent during one year –
numbers change over time, while the number of Phones
there are seven observations in each row, which makes it
(the variable of interest) is displayed on the y−axis. The
very difficult to plot using ggplot2.
Continent factor will be represented with colour. A
Fortunately, the tidyvers provides an easy way to con-
scatterplot is obtained by adding a geom_point() layer.
vert this wide dataset into a tall dataset, by melting the
Scatterplots can also be used to show trends over time,
data. This can be achieved by loading a thrid-party package
by drawing lines between points for each continent. This
called reshape2. The WorldPhones dataset can now be
only require a change to a geom_line() layer.
melted from a wide to a tall dataset with the melt() func-
tion. Let’s assign the new, melted data to an object called ggplot(WorldPhones.m, aes(x=Year,
WorldPhones.m, where the m reminds us that the data y=Phones, color=Continent)) +
has been melted. geom_line()
library(reshape2)
The result is shown in Figure 11. Incidentally, one might
WorldPhones.m = melt(WorldPhones)
expect the number of phones to increase exponentially over
time, rather than linearly (a fair number of observations
The new, melted data looks like:
are clustered at the bottom of the chart).
head(WorldPhones.m) When that’s the case, it’s a good idea to plot the vertical
axis on a log scale. This can be done adding a logarithm
## Var1 Var2 value scale to the chart.
## 1 1951 N.Amer 45939
## 2 1956 N.Amer 60423 ggplot(WorldPhones.m, aes(x=Year,
## 3 1957 N.Amer 64721 y=Phones, color=Continent)) +
## ... geom_line() + scale_y_log10()
Now each of the phone trends looks linear, and the lower Within RStudio, an alternative is to click on Export, then
values are spotted more easily; for example, it is now clear “Save Plot As Image” to open a GUI.
that Africa has overtaken Central America by 1956 (see
Figure 12).
11. Summary
Notice how easy it was to build this plot once the data
was in the tall format: one row for every point – that’s every The first 10 sections reviewed the ggplot2 package, which
combination of year and continent – on the graph. provides advanced graphical methods based on a compre-
hensive grammar of graphics. The package is designed
10. Saving Graphs to provide the use with a complete and comprehensive
alternative to the native graphics provided with R. It of-
Plots might look great on the screen, but they typically have fers methods for creating attractive and meaningful data
to be embedded in other documents (Markdown, LATEX, visualizations that are difficult to generate in other ways.
Word, etc.). In order to do so, they must first be saved It does come with some drawbacks, however: the gg-
in an appropriate format, with a specific resolution and plot2 and tidyverse design teams have fairly strong opinions
size. Default size settings can be saved within the .Rmd about how data should be visualized and processed. As a
document by declaring them in the first chunk of code. For result, it can sometimes be difficult to produce charts that
instance, this would tell knitr to produce 8 in. × 5 in. go against their design ideals. In the same vein, the various
charts: package updates do not always preserve the functionality
of working code, sending the analysts scurrying to figure
knitr::opts_chunk$set(fig.width=8, how the new functions work, which can cause problems
fig.height=5) with legacy code. Still, the versatility and overall simplicity
of ggplot2 cannot be overstated.
A convenience function named ggsave() can be partic- A list of all ggplot2 functions, along with examples, can
ularly useful. Options include which plot to save, where to be found at http://docs.ggplot2.org. The theory underly-
save it, and in what format. For example, ing ggplot2 is explained in great deatil in [2]; useful exam-
myplot <- ggplot(data=mtcars, ples and starting points can also be found in [1, 5].
aes(x=mpg)) + geom_histogram()
ggsave(file="mygraph.png", plot=myplot, The ggplot2 action flow is always the same: start with data
width=5, height=4) in a table, map the display variables to various aesthetics
(position, colour, shape, etc.), and select one or more geoms
saves the myplot object as a 5-inch by 4-inch .png file to draw the graph. This is accomplished in the code by
named mygraph.png in the current working directory. first creating an object with the basic data and mappings
The available formats include .ps, .tex, .jpeg, .pdf, information, and then by adding or layering additional
.jpg, .tiff, .png, .bmp, .svg, or .wmf (the latter information as needed.
only being available on Windows machines). Once this general way of thinking about plots is under-
Without the plot= option, the most recently created stood (especially the aesthetic mapping part), the drawing
graph is saved. The following code, for instance, the follow- process is simplified significantly. There is no need to think
ing bit of code would also save the mtcars plot (the latest about how to draw particular shapes or colours in the chart;
plot) to the current working directory (see the ggsave() the many (self-explanatory) geom_ functions do all the
helf file for additional details): heavy lifting.
Similarly, learning how to use new geoms is easier
ggplot(data=mtcars, aes(x=mpg)) + when they are viewed as ways to display specific aesthetic
geom_histogram() mappings.
ggsave(file="mygraph.pdf")
Figure 11. WorldPhones.m plots, using geom_points() (above) and geom_lines() (below).
email_campaign_funnel.csv") ggfortify
# X Axis Breaks and Labels theme(plot.title = element_text(hjust =
brks <- seq(-15000000, 15000000, 5000000) .5), axis.ticks = element_blank()) +
lbls = paste0(as.character(c(seq(15, 0, # Centre plot title
-5), seq(5, 15, 5))), "m") scale_fill_brewer(palette = "Dark2")
# Colour palette
# Plot
ggplot(email_campaign_funnel, aes(x =
Stage, y = Users, fill = Gender)) +
# Fill column Example 9 (Calendar Heatmap)
geom_bar(stat = "identity", width = .6) The calendar heat map is a great tool to see the daily varia-
+ # draw the bars tion (especially the highs and lows) of a variable like stock
scale_y_continuous(breaks = brks, # price, as it emphasizes the variation over time rather than
Breaks the actual value itself. It can (with a fair amount of data
labels = lbls) + # Labels preparation) be produced with geom_tile.
coord_flip() + # Flip axes
labs(title="Email Campaign Funnel") + # http://margintale.blogspot.in/2012/04/
theme_tufte() + # Tufte theme from ggplot2-time-series-heatmaps.html
Figure 22. Population pyramid for the email campaign funnel dataset.
Figure 27. Graph of Mad Men characters who are linked by a romantic relationship.
Example 16 (Time Series Plot For a Monthly Time Series) scale_x_date(labels = lbls,
In order to select specific breaks on the x−axis, consider the breaks = brks) + # change to
functionality offered by scale_x_date() (plot shown monthly ticks and labels
in Figure 30). theme(axis.text.x = element_text(angle
= 90, vjust=0.5), # rotate x axis
library(ggplot2) text
library(lubridate) panel.grid.minor =
theme_set(theme_bw()) element_blank()) # turn off
minor grid
economics_m <- economics[1:24, ]
# labels and breaks for X axis text Example 18 (Time Series Plot From Long Data Format)
lbls <-
paste0(month.abb[month(economics_m$date) In this example, we construct the plot from a long data
], " ", format (i.e. the column names and respective values of all
lubridate::year(economics_m$date)) the columns are stacked in only 2 variables – variable
brks <- economics_m$date and value, respectively). In the wide format, the data
would takee the appearance of the economics dataset.
# plot Below, the geom_line objects are drawn using value
ggplot(economics_m, aes(x=date)) + and aes(col) is set to variable. In this way, multiple
geom_line(aes(y=pce)) + coloured lines are plotted (one for each unique variable
labs(title="Monthly Time Series",
level) with a single call; scale_x_date() changes the
subtitle="Personal consumption
x−axis breaks and labels, while the line colours are changed
expenditures, in billions of
dollars", by scale_color_manual.
caption="Source: Economics",
data(economics_long, package = "ggplot2")
y="pce") + # title and caption
# head(economics_long)
scale_x_date(labels = lbls,
library(ggplot2)
breaks = brks) + # change to
library(lubridate)
monthly ticks and labels
theme_set(theme_bw())
theme(axis.text.x = element_text(angle
df <-
= 90, vjust=0.5), # rotate x axis
economics_long[economics_long$variable
text
%in% c("psavert", "uempmed"), ]
panel.grid.minor =
df <- df[lubridate::year(df$date) %in%
element_blank()) # turn off
c(1967:1981), ]
minor grid
# labels and breaks for X axis text
brks <- df$date[seq(1, length(df$date),
Example 17 (Time Series Plot For a Yearly Time Series) 12)]
Here’s the same, but with a yearly breakdown (plot shown lbls <- lubridate::year(brks)
in Figure 31).
# plot
library(ggplot2) ggplot(df, aes(x=date)) +
library(lubridate) geom_line(aes(y=value, col=variable)) +
theme_set(theme_bw()) labs(title="Time Series of Returns
Percentage",
economics_y <- economics[1:90, ] subtitle="Drawn from Long Data
format",
# labels and breaks for X axis text caption="Source: Economics",
brks <- economics_y$date[seq(1, y="Returns %",
length(economics_y$date), 12)] color=NULL) + # title and caption
lbls <- lubridate::year(brks) scale_x_date(labels = lbls, breaks =
brks) + # change to monthly ticks
# plot and labels
ggplot(economics_y, aes(x=date)) + scale_color_manual(labels =
geom_line(aes(y=psavert)) + c("psavert", "uempmed"),
labs(title="Yearly Time Series", values =
subtitle="Personal savings rate", c("psavert"="#00ba38",
caption="Source: Economics", "uempmed"="#f8766d"))
y="psavert") + # title and caption + # line color
the contribution from individual components, the area has df <- economics[, c("date", "psavert",
to be stacked on top of the previous component, rather than "uempmed")]
relative to the floor of the plot. All the bottom layers have df <- df[lubridate::year(df$date) %in%
to be added to the y value of a new area. c(1967:1981), ]
In the example below (and in Figure 33), the top layer # labels and breaks for X axis text
brks <- df$date[seq(1, length(df$date),
is y=psavert+uempmed. However nice the plot might
12)]
look, keep in mind that it can be difficult to interpret. lbls <- lubridate::year(brks)
# plot
library(ggplot2)
ggplot(df, aes(x=date)) +
library(lubridate)
geom_area(aes(y=psavert+uempmed,
theme_set(theme_bw())
Figure 32. Time series from long data format for the economics dataset.
fill="psavert")) + scale_fill_manual(name="",
geom_area(aes(y=uempmed, values =
fill="uempmed")) + c("psavert"="#00ba38",
labs(title="Area Chart of Returns "uempmed"="#f8766d"))
Percentage", + # line color
subtitle="From Wide Data format", theme(panel.grid.minor =
caption="Source: Economics", element_blank()) # turn off minor
y="Returns %") + # title and caption grid
scale_x_date(labels = lbls, breaks =
brks) + # change to monthly ticks
and labels
Figure 33. Stacked area chart for the economics dataset (green area is the sum of psavert and uempmed).
library(reshape2)
Example 21 (Parallel Coordinate Plots) library(ggplot2)
Parallel coordinate plots are useful to visualize multivariate # create long format
data. As a practical example, assume that a survey has been
df_pcp <- melt(df_grouped, id.vars =
c(’id’, ’freq’))
conducted, with a variety of questions. Each question is
df_pcp$value <- factor(df_pcp$value)
asked three times – in a different context – and is answered
on a discrete scale from 1 to 7. Consequently, each question
We can then specify what levels should be drawn on the
has three “dimensions”. The distribution of answers across
y−axis (1 to 7). In the ggplot() function we define an
the three dimensions should be displayed for each question.
aesthetic that uses the “variable” column for the x−axis and
Because the three dimensions have the same unit and scale,
the “value” column for the y−axis. We also specify that the
they can easily be compared on parallel coordinates (it
values should be grouped by using the id column. This is
would be possible to display more than three dimensions,
required, as the connections between the three dimensions
of course).
won’t be drawn otherwise. We use geom_path() to
library(triangle) draw the connection lines and make the width and colour
set.seed(0) of the connection dependent on the n and id columns,
q1_d1 <- round(rtriangle(1000, 1, 7, 5)) respectively,
q1_d2 <- round(rtriangle(1000, 1, 7, 6))
q1_d3 <- round(rtriangle(1000, 1, 7, 2)) y_levels <- levels(factor(1:7))
df <- data.frame(q1_d1 = factor(q1_d1), ggplot(df_pcp, aes(x = variable, y =
q1_d2 = factor(q1_d2), q1_d3 = value, group = id)) + # group = id
factor(q1_d3)) is important!
geom_path(aes(size = freq, color = id),
alpha = 0.5,
We are using the triangular distribution to get random inte-
lineend = ’round’, linejoin =
gers r ∈ [1, 7], around a different mode c for each dimen- ’round’) +
sion (5, 6 and 2). To plot the main “answer paths” (i.e. the scale_y_discrete(limits = y_levels,
most frequent answer combination across the three dimen- expand = c(0.5, 0)) +
sions), we need to group by all dimensions, and then to scale_size(breaks = NULL, range = c(1,
count the frequency of each unique answer combinations. 7))
This can be done with the dplyr package.
library(dplyr)
Figure 34. Seasonal plots for the AirPassengers dataset (above) and the nottem dataset (below).
ggsave(file="clusters.png",
plot=clustering, width=5, height=4)
Figure 36. Clusters in the iris dataset, projected on the first 2 principal components.
Figure 38. Slope chart for the cancer survival rates dataset.
Figure 40. Density plot for the mpg dataset; simultaneous (top), faceted (bottom).
Figure 44. Word cloud for The Green Mile (English subtitles).
Figure 45. Bar chart for The Green Mile (English subtitles).
References
[1]
Chang, W. [2013], R Graphics Cookbook, O’Reilly.
[2]
Wickham, H. [2009], ggplot2: Elegant Graphics for
Data Analysis, Springer.
[3]
Wickham, H. [2009], A Layered Grammar of Graph-
ics. Journal of Computational and Graphical Statistics
19:3–28.
[4]
Horton, N.J., Kleinman, K. [2016], Using R and RStudio
for Data Management, Statistical Analysis, and Graphics,
2nd ed., CRC Press.
[5]
Healey, K. [2018], Data Visualization: A Practical
Introduction.
[6]
Kabacoff, R.I. [2011], R in Action, Second Edition: Data
analysis and graphics with R, Live.
[7]
Maindonald, J.H. [2008], Using R for Data Analysis
and Graphics Introduction, Code and Commentary.
[8]
Tyner, S., Briatte, F., Hofmann, H. [2017], Network
Visualization with ggplot2, The R Journal, vol. 9(1).
[9]
Broman, K. [2016], Data Visualization with ggplot2.
[10]
ggplot2 Extensions.
[11]
R Graph Gallery.
[12]
Anderson, S.C. [2015], An Introduction to ggplot2.
[13]
Prabhakaran, S., Top-50 ggplot2 Visualization (with
Master List R Code).
[14]
Konrad, M. [2016], Parallel Coordinate Plots for Dis-
crete and Categorical Data in R: A Comparison.
[15]
Wilkins, D., treemapify github repository.
[16]
Wilkins, D., treemapify R package (v. 0.2.1).
[17]
STHDA, Beautiful Dendrogram Visualizations in R:
5+ must-know methods - Unsupervised Machine
Learning.
[18]
Text Mining and Word Cloud Fundamentals in R: 5
simple steps you should know, on Easy Guides.
[19]
Harvard tutorial notes, R graphics with ggplot2 work-
shop.
[20]
Robinson, D., Visualizing Data Using ggplot2, on
varianceexplained.org.
[21]
Manipulating, analyzing and exporting data with
tidyverse, on datacarpentry.org.
[22]
Wickham, H. [2014], Tidy Data, Journal of Statistical
Software, v59, n10.