0% found this document useful (0 votes)
64 views59 pages

Apostila Ggplot

This document introduces ggplot2, an R package for data visualization. It provides three key points: 1) Ggplot2 uses a "grammar of graphics" approach where the user explicitly maps variables from the data to visual elements like points, colors, and shapes. This establishes a clear connection between the data and the plot. 2) Core components of a ggplot include specifying the data, a geometric object (geom) like scatter plots, and aesthetic mappings between variables and visual properties. Additional elements control scales, labels, and other design aspects. 3) An example shows how to create a basic scatter plot using ggplot2 syntax like ggplot() to start the plot, geom_point()
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views59 pages

Apostila Ggplot

This document introduces ggplot2, an R package for data visualization. It provides three key points: 1) Ggplot2 uses a "grammar of graphics" approach where the user explicitly maps variables from the data to visual elements like points, colors, and shapes. This establishes a clear connection between the data and the plot. 2) Core components of a ggplot include specifying the data, a geometric object (geom) like scatter plots, and aesthetic mappings between variables and visual properties. Additional elements control scales, labels, and other design aspects. 3) An example shows how to create a basic scatter plot using ggplot2 syntax like ggplot() to start the plot, geom_point()
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

DATA SCIENCE REPORT SERIES

A ggplot2 Primer
Ehssan Ghashim1 , Patrick Boily1,2,3,4

Abstract
R has become one of the world’s leading languages for statistical and data analysis. While the base R
installation does suppor simple visualizations, its plots are rarely of high-enough quality for publication.
Enter Hadley Wickam’s ggplot2, an aesthetically and logical approach to data visualization. In this short
report, we introduce gglot2’s graphic grammar elements, and present a number of examples.
Keywords
R, ggplot2, data visualization
1
Centre for Quantitative Analysis and Decision Support, Carleton University, Ottawa
2
Sprott School of Business, Carleton University, Ottawa
3
Department of Mathematics and Statistics, University of Ottawa, Ottawa
4
Idlewyld Analytics and Consulting Services, Wakefield, Canada
Email: [email protected]

Contents primitives to control the arrangement and appearance


of graphic elements.
1 Introduction 1 This flexibility makes grid a valuable tool for soft-
2 How ggplot2 Works 2 ware developers. But the grid package doesn’t pro-
vide functions for producing statistical graphics or
3 Basics of ggplot2 Grammar 3
complete plots. As a result, it is rarely used directly
4 Specifying Plot Types with geoms 4 by data analysts and won’t be discussed further (see
5 Aesthetics 5 Dr. Murrell’s Grid website at http://mng.bz/C86p).
6 Facets 5 3. The lattice package, written by Deepayan Sarkar in
7 Multiple Graphs per Page 7 2008, implements trellis graphs, as outlined by Cleve-
land (1985, 1993). Basically, trellis graphs display the
8 Themes 7
distribution of a variable or the relationship between
9 Tidy Data: Getting Data into the Right Format 12 variables, separately for each level of one or more
10 Saving Graphs 13 other variables. Built using the grid package, the
lattice package has grown beyond Cleveland’s orig-
11 Summary 13
inal approach to visualizing multivariate data and
12 Examples 16 now provides a comprehensive alternative system for
creating statistical graphics in R.
1. Introduction 4. Finally, the ggplot2 package, written by Hadley Wick-
There are currently four graphical systems available in R. ham [2], provides a system for creating graphs based
on the grammar of graphics described by Wilkinson
1. The base graphics system, written by Ross Ihaka, is (2005) and expanded by Wickham [3]. The intention
included in every R installation. Most of the graphs of the ggplot2 package is to provide a comprehensive,
produced in the ‘Basics of R‘ report rely on base graph- grammar-based system for generating graphs in a uni-
ics functions. fied and coherent manner, allowing users to create
new and innovative data visualizations. The power
2. The grid graphics system, written by Paul Murrell in of this approach has led to ggplot2 becoming one of
2011, is implemented through the grid package, the most common R data visualization tool.
which offers a lower-level alternative to the standard
graphics system. The user can create arbitrary rectan- Access to the four systems differs: they are all included in
gular regions on graphics devices, define coordinate the base installation, except for ggplot2, and they must all
systems for each region, and use a rich set of drawing be explicitly loaded, except for the base graphics system.
DATA SCIENCE REPORT SERIES A ggplot2 Primer

2. How ggplot2 Works


As we saw in Basics of R for Data Analysis, visualization
involves representing data using various elements, such as
lines, shapes, colours, etc.. There is a structured relation-
ship – some mapping – between the variables in the data
and their representation in the displayed plot. We also saw
that not all mappings make sense for all types of variables,
and (independently), that some representations are harder
to interpret than others.
ggplot2 provides a set of tools to map data to visual
display elements and to specify the desired type of plot,
and subsequently to control the fine details of how it will
be displayed. Figure 1 shows a schematic outline of the
process starting from data, at the top, down to a finished
plot at the bottom.
The most important aspect of ggplot2 is the way it can
be used to think about the logical structure of the plot.
The code allows the user to explicitly state the connections
between the variables and the plot elements that are seen
on the screen – items such as points, colors, and shapes.
In ggplot2, these logical connections between the data
and the plot elements are called aesthetic mappings, or
simply aesthetics.

After installing and loading the package, a plot is created


by telling the ggplot() function what the data is, and
how the variables in this data logically map onto the plot’s
aesthetics.

The next step is to specify what sort of plot is desired (scat-


terplot, boxplot, bar chart, etc), also known as a geom.
Each geom is created by a specific function:

geom_point() for scatterplots


geom_bar() for barplots
geom_boxplot() for boxplots,
and so on.

These two components are combined, literally adding them


together in an expression, using the “+” symbol.

At this point, ggplot2 has enough information to draw a plot


– the other components (see Figure 1) provide additional
design elements.
If no further details are specified, ggplot2 uses a set of
sensible default parameters; usually, however, the user will
want to be more specific about, say, the scales, the labels of
legends and axes, and other guides that can improve the
plot readability.
These additional pieces are added to the plot in the
same manner as the geom_ function() component, Figure 1. ggplot2’s graphics grammar [5].
with specific arguments, again using the “+” symbol. Plots
are built systematically in this manner, piece by piece.

E.Gashim, P.Boily, 2018 Page 2 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 2. Artificial data - visualization.

3. Basics of ggplot2 Grammar A basic display call contains the following elements:

Let’s look at some illustrative ggplot2 code: ggplot(): start an object and specify the data
geom_point(): we want a scatter plot; this is
library("ggplot2") called a “geom”
theme_set(theme_bw()) # use the black aes(): specifies the “aesthetic” elements; a legend
and white theme throughout is automatically created
# artificial data: facet_grid(): specifies the “faceting” or panel
d <- data.frame(x = c(1:8, 1:8), y = layout
runif(16),
group1 = rep(gl(2, 4, labels = c("a", Other components include statistics, scales, and annotation
"b")), 2), options. At a bare minimum, charts require a dataset, some
group2 = gl(2, 8)) aesthetics, and a geom, combined, as above, with “+” sym-
head(d) bols! This non-standard approach has the advantage of
## R output
allowing ggplot2 plots to be proper R objects, which can
## x y group1 group2
## 1 1 0.8683116 a 1 modified, inspected, and re-used.
## 2 2 0.1934542 a 1 ggplot2’s main plotting functions are qplot() and
## 3 3 0.1131743 a 1 ggplot(); qplot() is short for “quick plot” and is
## 4 4 0.9260514 a 1 meant to mimic the format of base R’s plot(); it requires
## 5 5 0.9476787 b 1 less syntax for many common tasks, but has limitations – it’s
## 6 6 0.2949107 b 1 essentially a wrapper for ggplot(), which is not itself
ggplot(data = d) + geom_point(aes(x, y, that complicated to use.
colour = group1)) +
facet_grid(~group2) We will focus on this latter function.

The data is plotted in Figure 2.

E.Gashim, P.Boily, 2018 Page 3 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Function Adds Options


geom_bar() Bar chart color, fill, alpha
geom_boxplot() Box plot color, fill, alpha, notch, width
geom_density() Density plot color, fill, alpha, linetype
geom_histogram() Histogram color, fill, alpha, linetype, binwidth
geom_hline() Horizontal lines color, alpha, linetype, size
geom_jitter() Jittered points color, size, alpha, shape
geom_line() Line graph colorvalpha, linetype, size
geom_point() Scatterplot color, alpha, shape, size
geom_rug() Rug plot color, side
geom_smooth() Fitted line method, formula, color, fill, linetype, size
geom_text() Text annotations Many; see the help for this function
geom_violin() Violin plot color, fill, alpha, linetype
geom_vline() Vertical lines color, alpha, linetype, size
Option Specifies
color colour of points, lines, and borders around filled regions
fill colour of filled areas such as bars and density regions
alpha transparency of colors, ranging from 0 (fully transparent) to 1 (opaque)
linetype pattern for lines (1 = solid, 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash)
size point size and line width
shape point shapes (same as pch, with 0 = open square, 1 = open circle, 2 = open triangle, and so on)
position position of plotted objects such as bars and points. For bars, “dodge” places grouped bar charts
side by side, “stacked” vertically stacks grouped bar charts, and “fill” vertically stacks grouped
bar charts and standardizes their heights to be equal; for points, “jitter” reduces point overlap
binwidth bin width for histograms
notch indicates whether box plots should be notched (TRUE/FALSE)
sides placement of rug plots on the graph (“b” = bottom, “l” = left, “t” = top, “r” = right, “bl” = both
bottom and left, and so on)
width width of box plots
Table 1. Commonly-used geom functions (top); common options for the various geom functions (bottom).

4. Specifying Plot Types with geoms Note that only the x variable (height) was specified when
creating the histogram, but that both the x (voice part) and
Whereas ggplot() specifies the data source and vari- the y (height) variables were specified for the box plot –
ables to be plotted, the various geom functions specify indeed, geom_histogram() defaults to counts on the
how these variables are to be visually represented (using y−axis when no y variable is specified (each function’s
points, bars, lines, and shaded regions). There are currently documentation contains details and additional examples,
37 available geoms. Table 1 lists the more common ones, but there’s a lot of value to be found in playing around with
along with frequently used options (most of the graphs data in order to determine their behaviour).
shown in this report can be created using those geoms). Let’s examine the use of some of these options using the
For example, the next bit of code produces a histogram Salaries dataset (from package “car”). The dataframe
of the heights of singers in the 1979 edition of the New contains information on the salaries of university professors
York Choral Society (Figure 4), and a display of height by collected during the 2008–2009 academic year. Variables
voice part for the same data (Figure 5). include rank (AsstProf, AssocProf, Prof), sex (Female, Male),
yrs.since.phd (years since Ph.D.), yrs.service (years of ser-
library("ggplot2") vice), and salary (nine-month salary in dollars). The next
data(singer, package="lattice")
code produces the plot in Figure 3.
ggplot(singer, aes(x=height)) +
geom_histogram() data(Salaries, package="car")
ggplot(singer, aes(x=voice.part, library(ggplot2)
y=height)) + geom_boxplot() ggplot(Salaries, aes(x=rank, y=salary)) +
geom_boxplot(fill="cornflowerblue",
From Figure 5, it appears that basses tend to be taller and color="black", notch=TRUE)+
sopranos tend to be shorter. Although the singers’ gender geom_point(position="jitter",
was not recorded, it probably accounts for much of the color="blue", alpha=.5)+
variation seen in the diagram. geom_rug(side="l", color="black")

E.Gashim, P.Boily, 2018 Page 4 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 3 displays notched box plots of salary by academic 6. Facets


rank. The actual observations (teachers) are overlaid and
In ggplot2 parlance, small multiples are referred to as facets.
given some transparency so they don’t obscure the box plots.
There are two kinds:
They’re also littered to reduce their overlap. Finally, a rug
plot is provided on the left to indicate the general spread of facet_wrap()
salaries. From Figure 3, we see that the salaries of assistant, facet_grid()
associate, and full professors differ significantly from each
other (there is no overlap in the box plot notches). The former plots the panels in the order of the factor levels
Additionally, the variance in salaries increases with – when it gets to the end of a column it wraps to the next
greater rank, with a larger range of salaries for full pro- column (the number of columns and rows can be specified
fessors. In fact, at least one full professor earns less than with nrow and ncol. The grid layout facet_grid()
all assistant professors. There are also three full professors produces a grid with explicit x and y positions.
whose salaries are so large as to make them outliers (as By default, the panels all share the same x and y axes.
indicated by the black dots in the Prof box plot). Note, however, that the various y−axes are allowed to vary
via
5. Aesthetics facet_wrap(scales = "free_y"),
Aesthetics refer to the displayed attributes of the data. They and that all axes are allowed to vary
map the data to an attribute (such as the size or shape of
a marker) and generate an appropriate legend. Aesthetics via facet_wrap(scales = free).
are specified with the aes() function.
The aesthetics available for geom_point(), as an To specify the data frame columns that are mapped to the
example are: rows and columns of the facets, separate them with a tilde.
Usually, only a row or a column is fed to facet_wrap().
x What happens if both are fed to that component?
y
alpha
color Going back to the choral example, a faceted graph can be
fill produced using the following code:
shape
size data(singer, package="lattice")
library(ggplot2)
Note that ggplot() tries to accommodate the user who’s ggplot(data=singer, aes(x=height)) +
never “suffered” through base graphics before by using geom_histogram() +
intuitive arguments like color, size, and linetype, facet_wrap(~voice.part, nrow=4)
but ggplot() also accepts arguments such as col, cex,
and lty. The documentation goes some way towards The resulting plot (Figure 7) displays the distribution of
explaining aesthetic options exist for each geom (they’re singer heights by voice part. Separating the height distri-
generally self-explanatory). bution into their own small, side-by-side plots makes them
Aesthetics can be specified within the data function or easier to compare.
within a geom. If they’re specified within the data function
then they apply to all specified geoms. As a second example, let’s create a graph that has faceting
and grouping:The resulting graph is presented in Figure 8.
Note the important difference between specifying charac- It contains the same information, but separating the plot
teristics like colour and shape inside or outside the aes() into facets makes it somewhat easier to read.
function: those inside it are assigned colour or shape auto-
matically based on the data. If characteristics like colour or library(ggplot2)
shape are defined outside the aes() function, then they ggplot(Salaries, aes(x=yrs.since.phd,
y=salary, color=rank,
will not be mapped to data.
shape=rank)) + geom_point() +
Here’s an example, using the mpg dataset: facet_grid(.~sex)
ggplot(mpg, aes(cty, hwy)) +
geom_point(aes(colour = class))
ggplot(mpg, aes(cty, hwy)) +
geom_point(colour = "red")

The outputs are shown in Figure 6.

E.Gashim, P.Boily, 2018 Page 5 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 3. Notched box plots with superimposed points describing the salaries of college professors by rank. A rug plot is
provided on the vertical axis.

Figure 4. Histogram of singer heights Figure 5. Box plot of singer heights by voice part

E.Gashim, P.Boily, 2018 Page 6 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 6. Visualizations of the mpg dataset – with aes() on the left, without on the right.

7. Multiple Graphs per Page


data(Salaries, package="car")
In basic R, the graphic parameter mfrow and the base library(ggplot2)
function layout() are used to combine two or more mytheme <- theme(plot.title=element_text(
face="bold.italic",
base graphs into a single plot. This approach will not work
size="14",
with plots created with the ggplot2 package, however. The
color="brown"), axis.title=
easiest way to place multiple ggplot2 graphs in a single element_text(
figure is to use the grid.arrange() function found in face="bold.italic",
the gridExtra package. size=10, color="brown"),
This code places three ggplot2 charts based on the Salaries axis.text=element_text(
dataset onto a single graph. face="bold", size=9,
color="darkblue"),
data(Salaries, package="car") panel.background=element_rect(
library(ggplot2) fill="white",color="darkblue"),
p1 <- ggplot(data=Salaries, aes(x=rank)) panel.grid.major.y=element_line(
+ geom_bar() color="grey", linetype=1),
p2 <- ggplot(data=Salaries, aes(x=sex)) panel.grid.minor.y=element_line(
+ geom_bar() color="grey", linetype=2),
p3 <- ggplot(data=Salaries, panel.grid.minor.x=element_blank(),
aes(x=yrs.since.phd, y=salary)) + legend.position="top")
geom_point()
ggplot(Salaries, aes(x=rank, y=salary,
library(gridExtra) fill=sex)) +
grid.arrange(p1, p2, p3, ncol=3) geom_boxplot() +
labs(title="Salary by Rank and
The resulting graph is shown in Figure 9. Each graph is Sex", x="Rank", y="Salary") +
saved as an object and then arranged into a single plot mytheme
via grid.arrange(). Note the difference between
faceting and multiple graphs: faceting creates an array Adding “+ mytheme” to the plotting statement generates
of plots based on one or more categorical variables, but the graph shown in Figure 10; mytheme specifies that plot
the components of a multiple graph could be completely titles are printed in brown 14-point bold italics; axis titles in
independent plots arranged into a single display. brown 10-point bold italics; axis labels in dark blue 9-point
bold; the plot area should have a white fill and dark blue
8. Themes borders; major horizontal grids should be solid grey lines;
minor horizontal grids should be dashed grey lines; vertical
Themes allow the user to control the overall appearance grids should be suppressed; and the legend should appear
of ggplot2 charts; theme() options are used to change at the top of the graph. The theme() function gives you
fonts, backgrounds, colours, gridlines, and more. Themes great control over the look of the finished product (consult
can be used once or saved and applied to multiple charts. help(theme) to learn more about these options).
See below for an example.

E.Gashim, P.Boily, 2018 Page 7 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 7. Faceted graph showing the distribution (histogram) of singer heights by voice part

E.Gashim, P.Boily, 2018 Page 8 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 8. Scatterplot of years since graduation and salary. Academic rank is represented by color and shape, and sex is
faceted.

E.Gashim, P.Boily, 2018 Page 9 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 9. Placing three ggplot2 plots in a single graph

E.Gashim, P.Boily, 2018 Page 10 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 10. Box plots with a customized theme

E.Gashim, P.Boily, 2018 Page 11 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

9. Tidy Data: Getting Data into the Right Notice that while there were originally seven columns, there
Format are now only three: Var1, Var2, and value; Var1
represents the year, Var2 the continents, and value the
ggplot2 is compatible with what is generally referred to as number of phones. Every data cell – every observation –
the tidyverse [22]. Social scientists will likely be familiar every number of phones per year per continent – in the
with the distinction between data in wide format and in original dataset now has its own row in the melted dataset.
long format: In 1951, in North America, for instance, there were
45,939,000 phones, which is the same value as in the origi-
in a long format table, every column represents a
nal unmelted data – the data has not changed, it just got
variables, and every row an observation,
reshaped.
whereas in a wide format table, some variables are
spread out across columns, perhaps along some other
Changing the column names might make the data more
characteristic such as the year, say.
intuitive to read:
The plots that have been produced so far were simple to colnames(WorldPhones.m) = c("Year",
create because the data points were given in the format "Continent", "Phones")
of one observation per row which we call a "tall" format. head(WorldPhones.m)
But many datasets come in a "wide"" format, i.e. there is
more than one observation – more than one point on the ## Year Continent Phones
scatterplot – in each row. ## 1 1951 N.Amer 45939
Consider, for instance, the WorldPhones dataset, ## 2 1956 N.Amer 60423
one of R’s built-in dataset: ## 3 1957 N.Amer 64721
## ...
data("WorldPhones")
Now that the data has been melted into a tall dataset, it is
This dataset records the number of telephones, in thou- easy to create a plot with ggplot2, with the usual steps of a
sands, on each continent for several years in the 1950s (see ggplot() call, but with WorldPhones.m instead of
Table 2). WorldPhones:
Each column represents a different continent, and each
ggplot(WorldPhones.m, aes(x=Year,
row represents a different year. This wide format seems like
y=Phones, color=Continent)) +
a reasonable way to store data, but suppose that we want geom_point()
to compare increases in phone usage between continents,
with time on the horizontal axis. In that case, each point on
We place the Year on the x−axis, in order to see how the
the plot is going to represent a continent during one year –
numbers change over time, while the number of Phones
there are seven observations in each row, which makes it
(the variable of interest) is displayed on the y−axis. The
very difficult to plot using ggplot2.
Continent factor will be represented with colour. A
Fortunately, the tidyvers provides an easy way to con-
scatterplot is obtained by adding a geom_point() layer.
vert this wide dataset into a tall dataset, by melting the
Scatterplots can also be used to show trends over time,
data. This can be achieved by loading a thrid-party package
by drawing lines between points for each continent. This
called reshape2. The WorldPhones dataset can now be
only require a change to a geom_line() layer.
melted from a wide to a tall dataset with the melt() func-
tion. Let’s assign the new, melted data to an object called ggplot(WorldPhones.m, aes(x=Year,
WorldPhones.m, where the m reminds us that the data y=Phones, color=Continent)) +
has been melted. geom_line()
library(reshape2)
The result is shown in Figure 11. Incidentally, one might
WorldPhones.m = melt(WorldPhones)
expect the number of phones to increase exponentially over
time, rather than linearly (a fair number of observations
The new, melted data looks like:
are clustered at the bottom of the chart).
head(WorldPhones.m) When that’s the case, it’s a good idea to plot the vertical
axis on a log scale. This can be done adding a logarithm
## Var1 Var2 value scale to the chart.
## 1 1951 N.Amer 45939
## 2 1956 N.Amer 60423 ggplot(WorldPhones.m, aes(x=Year,
## 3 1957 N.Amer 64721 y=Phones, color=Continent)) +
## ... geom_line() + scale_y_log10()

E.Gashim, P.Boily, 2018 Page 12 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Table 2. WorldPhones dataset in wide format.

Now each of the phone trends looks linear, and the lower Within RStudio, an alternative is to click on Export, then
values are spotted more easily; for example, it is now clear “Save Plot As Image” to open a GUI.
that Africa has overtaken Central America by 1956 (see
Figure 12).
11. Summary
Notice how easy it was to build this plot once the data
was in the tall format: one row for every point – that’s every The first 10 sections reviewed the ggplot2 package, which
combination of year and continent – on the graph. provides advanced graphical methods based on a compre-
hensive grammar of graphics. The package is designed
10. Saving Graphs to provide the use with a complete and comprehensive
alternative to the native graphics provided with R. It of-
Plots might look great on the screen, but they typically have fers methods for creating attractive and meaningful data
to be embedded in other documents (Markdown, LATEX, visualizations that are difficult to generate in other ways.
Word, etc.). In order to do so, they must first be saved It does come with some drawbacks, however: the gg-
in an appropriate format, with a specific resolution and plot2 and tidyverse design teams have fairly strong opinions
size. Default size settings can be saved within the .Rmd about how data should be visualized and processed. As a
document by declaring them in the first chunk of code. For result, it can sometimes be difficult to produce charts that
instance, this would tell knitr to produce 8 in. × 5 in. go against their design ideals. In the same vein, the various
charts: package updates do not always preserve the functionality
of working code, sending the analysts scurrying to figure
knitr::opts_chunk$set(fig.width=8, how the new functions work, which can cause problems
fig.height=5) with legacy code. Still, the versatility and overall simplicity
of ggplot2 cannot be overstated.
A convenience function named ggsave() can be partic- A list of all ggplot2 functions, along with examples, can
ularly useful. Options include which plot to save, where to be found at http://docs.ggplot2.org. The theory underly-
save it, and in what format. For example, ing ggplot2 is explained in great deatil in [2]; useful exam-
myplot <- ggplot(data=mtcars, ples and starting points can also be found in [1, 5].
aes(x=mpg)) + geom_histogram()
ggsave(file="mygraph.png", plot=myplot, The ggplot2 action flow is always the same: start with data
width=5, height=4) in a table, map the display variables to various aesthetics
(position, colour, shape, etc.), and select one or more geoms
saves the myplot object as a 5-inch by 4-inch .png file to draw the graph. This is accomplished in the code by
named mygraph.png in the current working directory. first creating an object with the basic data and mappings
The available formats include .ps, .tex, .jpeg, .pdf, information, and then by adding or layering additional
.jpg, .tiff, .png, .bmp, .svg, or .wmf (the latter information as needed.
only being available on Windows machines). Once this general way of thinking about plots is under-
Without the plot= option, the most recently created stood (especially the aesthetic mapping part), the drawing
graph is saved. The following code, for instance, the follow- process is simplified significantly. There is no need to think
ing bit of code would also save the mtcars plot (the latest about how to draw particular shapes or colours in the chart;
plot) to the current working directory (see the ggsave() the many (self-explanatory) geom_ functions do all the
helf file for additional details): heavy lifting.
Similarly, learning how to use new geoms is easier
ggplot(data=mtcars, aes(x=mpg)) + when they are viewed as ways to display specific aesthetic
geom_histogram() mappings.
ggsave(file="mygraph.pdf")

E.Gashim, P.Boily, 2018 Page 13 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 11. WorldPhones.m plots, using geom_points() (above) and geom_lines() (below).

E.Gashim, P.Boily, 2018 Page 14 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 12. WorldPhones.m on a vertical logarithmic scale.

E.Gashim, P.Boily, 2018 Page 15 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 13. Scatterplot of the midwest dataset.

# load package and data


12. Examples options(scipen=999) # turn-off
In this final section, we provide 30 additional examples of scientific notation like 1e+48
ggplot2 visualizations. Some of those examples have been library(ggplot2)
theme_set(theme_bw()) # pre-set
taken directly (and modified) from various online sources
data("midwest", package = "ggplot2")
(see references). # Scatterplot
ggplot(midwest, aes(x=area, y=poptotal))
Example 1 (Scatterplot) + geom_point(aes(col=state,
The most frequently used plot for data analysis is undoubt- size=popdensity)) +
edly the scatterplot. Whenever you want to understand geom_smooth(method="loess", se=F) +
the nature of relationship between two variables, invari- xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
ably the first choice is the scatterplot, which is drawn us- labs(subtitle="Area Vs Population",
ing geom_point(). Additionally, the geom_smooth y="Population",
default will draw a “loess” smoothing line, which can be x="Area",
tweaked to draw the line of best fit instead by setting title="Scatterplot",
method=’lm’ (see Figure 13). caption = "Source: midwest")

E.Gashim, P.Boily, 2018 Page 16 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 14. Bubble chart of the mpg dataset.

Example 2 (Bubble Chart)


While scatterplots allows for comparisons between 2 con- library(ggplot2)
tinuous variables, bubble charts extend the principles to 4 data(mpg, package="ggplot2")
or more variables using various marker elements: mpg_select <- mpg[mpg$manufacturer %in%
c("audi", "ford", "honda",
the colour of the marker can be mapped to a categor- "hyundai"), ]
ical variable (finite colour choices) or a continuous
variable (gradient scale); # Scatterplot
theme_set(theme_bw()) # pre-set the bw
the size of the marker is typically mapped to a positive
theme.
continuous variable.
g <- ggplot(mpg_select, aes(displ, cty))
Additionally, the shape of the marker can be mapped to a +
labs(subtitle="mpg: Displacement vs
categorical variable.
City Mileage",
title="Bubble chart")
With the mpg dataset, a bubble chart can help to clearly g + geom_jitter(aes(col=manufacturer,
distinguish the range of the displ feature for the various size=hwy)) +
manufacturers, and can be used to show how the slope of geom_smooth(aes(col=manufacturer),
the lines of best fit vary by manufacturer, providing a better method="lm", se=F)
visual comparison between the groups (see Figure 14).

E.Gashim, P.Boily, 2018 Page 17 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Example 3 (Animated Bubble Chart) # Get Ottawa’s Coordinates


An animated bubble chart can be implemented using the --------------------------------
gganimate package. It works quite the same way as a bub- ottawa<-geocode("Ottawa") # get
ble chart, but allows the user to show how the bubble chart longitude and latitude
changes with an additional variable (typically time). The
# Get Coordinates for Ottawa’s Places
key element is to set aes(frame) to the desired column
---------------------
on which to animate. The rest of the plot construction pro- ottawa_places<-c("Canadian War
cedure is the same as before. Once the plot is constructed, Museum","Rideau Centre","University
it can be animated by using gganimate() and setting of Ottawa","Carleton University")
the “time” variable appropriately. places_loc <- geocode(ottawa_places) #
Note that ImageMagick (http://imagemagick.org) must get longitudes and latitudes
be installed in order to use the anim_save() function.
# Get the Map
library(ggplot2) ------------------------------
library(gganimate) # Google Satellite Map
library(gapminder) ottawa_ggl_sat_map <- qmap("ottawa",
theme_set(theme_bw()) # pre-set bw theme zoom=13, source = "google",
head(gapminder) maptype="satellite")

ggplot(gapminder, aes(gdppc, # Google Hybrid Map


life_expectancy, size = population, ottawa_ggl_hybrid_map <- qmap("ottawa",
colour = country)) + zoom=13, source = "google",
geom_point(alpha = 0.7, show.legend = maptype="hybrid")
FALSE) +
#scale_colour_manual(values = # Google Road Map
country_colors) + ottawa_ggl_road_map <- qmap("ottawa",
scale_size(range = c(2, 12)) + zoom=13, source = "google",
scale_x_log10() + maptype="roadmap")
facet_wrap(~continent) +
# Here is the gganimate bits # Plot Google Road Map
labs(title = ’Year: {frame_time}’, x = -------------------------------------
’GDP per capita’, y = ’life ottawa_ggl_road_map +
expectancy’) + geom_point(aes(x=lon, y=lat), data =
transition_time(year) + places_loc, alpha = 0.8, size = 7,
ease_aes(’linear’) color = "tomato") +
anim_save(file="gapminder.gif") # saved, geom_encircle(aes(x=lon, y=lat),
not plotted data = places_loc, size = 2, color =
"blue")

# Google Hybrid Map


Example 4 (Maps)
----------------------------------------
The ggmap package provides facilities to interact with the ottawa_ggl_hybrid_map +
google maps api and get the coordinates (latitude and lon- geom_point(aes(x=lon, y=lat), data =
gitude) of places you want to plot. The example below places_loc, alpha = 0.7, size = 7,
provides road (Figure 16), hybrid (Figure 17) and satellite color = "tomato") +
(Figure 18) maps of the city of Ottawa, encircling some loca- geom_encircle(aes(x=lon, y=lat),
tions of interest (the google maps api has changed its func- data = places_loc, size = 2, color =
tionality since the images were originally created; the exam- "blue")
ple below will require some tweaking). The geocode()
function is used to get the coordinates of the locations and # Google Satellite Map
qmap() is used to retrieve the maps. The type of map to ----------------------------------------
ottawa_ggl_sat_map +
fetch is determined by the value set to maptype.
geom_point(aes(x=lon, y=lat), data =
The map supports zooms; the default value of 10 is places_loc, alpha = 0.7, size = 7,
suitable for large cities. It can be reduced to 3 for zooming color = "tomato") +
out, and increased to 21 to zoom in at the building level. geom_encircle(aes(x=lon, y=lat), data
= places_loc, size = 2, color =
library(ggplot2) "blue")
library(ggmap)
library(ggalt)

E.Gashim, P.Boily, 2018 Page 18 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 15. Bubble chart of the gapminder dataset (selected frames).

E.Gashim, P.Boily, 2018 Page 19 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 16. Ottawa Road Map

E.Gashim, P.Boily, 2018 Page 20 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 17. Ottawa Hybrid Map

E.Gashim, P.Boily, 2018 Page 21 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 18. Ottawa Satellite Map

E.Gashim, P.Boily, 2018 Page 22 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Example 5 (Marginal Histogram / Boxplot) # convert to factor to retain sorted


A marginal histogram is a chart on which the scatterplot of order in plot.
two variables X and Y are shown, together with the distribu-
tion of each of the variables. This can be implemented using # Diverging Barcharts
the ggMarginal() function from the ggExtra package. ggplot(mtcars, aes(x=‘car name‘,
y=mpg_z, label=mpg_z)) +
Other marginal plots are available. The output of the fol-
geom_bar(stat=’identity’,
lowing code is shown in Figure 19.
aes(fill=mpg_type), width=.5) +
# load package and data scale_fill_manual(name="Mileage",
library(ggplot2) labels = c("Above Average", "Below
library(ggExtra) Average"),
data(mpg, package="ggplot2") values = c("above"="#00ba38",
"below"="#f8766d")) +
# Scatterplot labs(subtitle="Normalised mileage",
theme_set(theme_bw()) # pre-set the bw title= "Diverging Bars - mtcars") +
theme. coord_flip()
mpg_select <- mpg[mpg$hwy >= 35 &
mpg$cty > 27, ]
g <- ggplot(mpg, aes(cty, hwy)) + Example 7 (Area Chart)
geom_count() + Area charts are typically used to visualize how a particular
geom_smooth(method="lm", se=F) metric (such as % returns from a stock) performed com-
pared to a certain baseline. Other types of % returns or
ggMarginal(g, type = "histogram",
fill="transparent") % change data are commonly used; area charts are imple-
ggMarginal(g, type = "boxplot", mented with geom_area() (see Figure 21).
fill="transparent")
library(ggplot2)
#install.packages("quantmod")
library(quantmod)
Example 6 (Diverging Bars)
data("economics", package = "ggplot2")
We might want bar charts that can handle both negative and
positive values (diverging bars); this can be implemented # Compute % Returns
by providing a tweak to geom_bar() (the histogram economics$returns_perc <- c(0,
function): diff(economics$psavert)/
economics$psavert[
set stat=identity
-length(economics$psavert)])
provide both x and y inside the aes() call, where
x is a character or a factor and y is numeric. # Create break points and labels for
In order to guarantee diverging bars (instead of simple axis ticks
brks <- economics$date[seq(1,
bars), the categorical variable must have two levels whose
length(economics$date), 12)]
values change at a given threshold of the continuous vari-
#install.packages("lubridate")
able. In the display of Figure 20, mpg (from mtcars) is lbls <-
normalised by computing the z score – vehicles with mpg lubridate::year(economics$date[seq(1,
above zero are shown in green; those below in red. length(economics$date), 12)])
library(ggplot2) # Plot
theme_set(theme_bw()) ggplot(economics[1:100, ], aes(date,
data("mtcars") # load data returns_perc)) +
mtcars$‘car name‘ <- rownames(mtcars) # geom_area() +
create new column for car names scale_x_date(breaks=brks, labels=lbls)+
mtcars$mpg_z <- round((mtcars$mpg - theme(axis.text.x =
mean(mtcars$mpg))/sd(mtcars$mpg), 2) element_text(angle=90)) +
# compute normalized mpg labs(title="Area Chart",
mtcars$mpg_type <- ifelse(mtcars$mpg_z < subtitle = "Perc Returns for
0, "below", "above") # above / below Personal Savings",
avg flag y="% Returns for Personal Savings",
mtcars <- mtcars[order(mtcars$mpg_z), ] caption="Source: economics")
# sort
mtcars$‘car name‘ <- factor(mtcars$‘car
name‘, levels = mtcars$‘car name‘)

E.Gashim, P.Boily, 2018 Page 23 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 19. Marginal histograms / boxplots for the mtcats dataset.

Example 8 (Population Pyramid) library(ggthemes)


Population pyramids offer a way to visualize how much of options(scipen = 999) # turns off
the population (or what percentage of the population) falls scientific notations (like 1e+40)
under a certain category. The pyramid of Figure 22 is an
excellent example, showing how many users are retained # Read data
email_campaign_funnel <-
at each stage of an email marketing campaign funnel.
read.csv("https://raw.githubusercontent.com/
library(ggplot2) selva86/datasets/master/

E.Gashim, P.Boily, 2018 Page 24 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 20. Diverging bars for the mtcars dataset.

email_campaign_funnel.csv") ggfortify
# X Axis Breaks and Labels theme(plot.title = element_text(hjust =
brks <- seq(-15000000, 15000000, 5000000) .5), axis.ticks = element_blank()) +
lbls = paste0(as.character(c(seq(15, 0, # Centre plot title
-5), seq(5, 15, 5))), "m") scale_fill_brewer(palette = "Dark2")
# Colour palette
# Plot
ggplot(email_campaign_funnel, aes(x =
Stage, y = Users, fill = Gender)) +
# Fill column Example 9 (Calendar Heatmap)
geom_bar(stat = "identity", width = .6) The calendar heat map is a great tool to see the daily varia-
+ # draw the bars tion (especially the highs and lows) of a variable like stock
scale_y_continuous(breaks = brks, # price, as it emphasizes the variation over time rather than
Breaks the actual value itself. It can (with a fair amount of data
labels = lbls) + # Labels preparation) be produced with geom_tile.
coord_flip() + # Flip axes
labs(title="Email Campaign Funnel") + # http://margintale.blogspot.in/2012/04/
theme_tufte() + # Tufte theme from ggplot2-time-series-heatmaps.html

E.Gashim, P.Boily, 2018 Page 25 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 21. Area chart for the economics dataset.

library(ggplot2) df <- df[, c("year", "yearmonthf",


library(plyr) "monthf", "week", "monthweek",
library(scales) "weekdayf", "VIX.Close")]
library(zoo) head(df)

df <- read.csv( # Plot


"https://raw.githubusercontent.com/ ggplot(df, aes(monthweek, weekdayf, fill
selva86/datasets/master/yahoo.csv") = VIX.Close)) +
df$date <- as.Date(df$date) # format date geom_tile(colour = "white") +
df <- df[df$year >= 2012, ] # filter facet_grid(year~monthf) +
years scale_fill_gradient(low="red",
high="green") +
# Create Month Week labs(x="Week of Month",
df$yearmonth <- as.yearmon(df$date) y="",
df$yearmonthf <- factor(df$yearmonth) title = "Time-Series Calendar
df <- ddply(df,.(yearmonthf), transform, Heatmap",
monthweek=1+week-min(week)) subtitle="Yahoo Closing Price",
# compute week number of month fill="Close")

E.Gashim, P.Boily, 2018 Page 26 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 22. Population pyramid for the email campaign funnel dataset.

E.Gashim, P.Boily, 2018 Page 27 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 23. Time-series calendar heatmap of the yahoo stockprice dataset.

E.Gashim, P.Boily, 2018 Page 28 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Example 10 (Ordered Bar Chart) # Plot


An ordered bar chart is a bar chart that is ordered by the ggcorrplot(corr, hc.order = TRUE,
y−axis variable. It is not sufficient to sort the dataframe by type = "lower",
the variable of interest; in order for the bar chart to retain lab = TRUE,
the row ordering, the x−axis variable (i.e. the categories) lab_size = 3,
method="circle",
has to be converted into a factor object.
colors = c("tomato2", "white",
This is shown in Figure 24, for the mean city mileage "springgreen3"),
of each manufacturer in the mpg dataset. The data is first title="Correlogram of mtcars",
aggregated and sorted, and the x−variable is converted to ggtheme=theme_bw)
a factor.

# Prepare data: group mean city mileage


Example 12 (Treemap)
by manufacturer.
A treemap requires a data frame with (at least) the following
cty_mpg <- aggregate(mpg$cty,
by=list(mpg$manufacturer), FUN=mean) columns:
# aggregate a numeric column, which determines the area of each
colnames(cty_mpg) <- c("make",
treemap rectangle, and
"mileage") # change column names
another numeric column, which determines the fill
cty_mpg <- cty_mpg[
order(cty_mpg$mileage),] # sort colour of each treemap rectangle.
cty_mpg$make <- factor(cty_mpg$make, The treemapify package includes, as an example, a dataset
levels = cty_mpg$make) # to retain
containing statistics about the G20 world economies. For
the order in plot.
this example, we will further use two optional columns: a
# Plot factor column, containing labels for each rectangle (Coun-
library(ggplot2) try) and a second factor column, containing labels for
theme_set(theme_bw()) groups of rectangles (Region) – see Figure 26 for the fi-
nal display.
# Draw plot We start by drawing a treemap where each tile repre-
ggplot(cty_mpg, aes(x=make, y=mileage)) + sents a G20 country. The area of the tile will be mapped
geom_bar(stat="identity", width=.5, to the country’s GDP, and the tile’s fill colour mapped to
fill="tomato3") + its HDI (Human Development Index). The basic geom
labs(title="Ordered Bar Chart", used for that purpose is geom_treemap(), but with-
subtitle="Make Vs Avg. Mileage",
out a label to identify each country, the display will not
caption="source: mpg") +
theme(axis.text.x = be very insightful. To add a text label to each tile, use
element_text(angle=65, vjust=0.6)) geom_treemap_text(), which uses the ggfittext pack-
age to resize the text so that it fits inside the tile.
In addition to standard text formatting aesthetics in
Example 11 (Correlogram) geom_text() (like fontface or color), ggfittext-
Correlograms can be used to test the level of correlation specific options are available; for example, we can centre
among the data variables. The cells of the matrix can the text in the tile with place = "centre", and ex-
be shaded or coloured to show the correlation value (the pand it to fill as much of the tile as possible with grow =
darker the colour, the higher the magnitude of the corre- TRUE.
lation between a pair of variables). Positive correlations The geom_treemap geom supports subgrouping by
are displayed in one colour, and negative correlations in passing a subgroup aesthetic. Countries can be subdivided
another, with intensity proportional to the actual correla- by region, say; geom_treemap_subgroup_border
tion value. This is conveniently implemented using the can be used to draw a border around these regions, with la-
ggcorrplot package (see Figure 25). bels given by geom_treemap_subgroup_text (the
latter takes the same input arguments for text placement
#install.packages("ggcorrplot") and resizing as geom_treemap_text).
library(ggplot2) Like any ggplot2 plot, treemapify plots can be faceted,
library(ggcorrplot) scaled, themed, etc.

# Correlation matrix library(devtools)


data(mtcars) #devtools::install_github("wilkox/treemapify")
corr <- round(cor(mtcars), 1) library(treemapify)
library(ggplot2)
data(G20)

E.Gashim, P.Boily, 2018 Page 29 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

geom_edges(color = "grey50") + # draw


ggplot(G20, aes(area = gdp_mil_usd, fill edge layer
= region, label = country)) + geom_nodes(aes(colour = gender), size
geom_treemap() + = 2) + # draw node layer
geom_treemap_text(grow = T, reflow = geom_nodetext(aes(colour = gender,
T, colour = "black") + label = vertex.names),
facet_wrap( ~ econ_classification) + size = 3, vjust = -0.6) + #
scale_fill_brewer(palette = "Set1") + draw node label layer
theme(legend.position = "bottom") + scale_colour_manual(values = mm.col) +
labs( xlim(c(-0.05, 1.05)) +
title = "The G20 Economies", theme_blank() +
caption = "The area of each country theme(legend.position = "bottom")
is proportional to its relative
GDP In the plot, we can see that there is one central character
within the economic group (advanced who has many more relationships than any other character.
or developing)", This vertex represents the main character of the show, Don
fill = "Region"
Draper, who is apparently quite the Lothario. Networks can
)
be found practically in all data environments; ggnetwork
provides the curious reader with a straightforward way to
visualize any network.
Example 13 (Network Visualization) Colouring the vertices or edges in a graph is a quick
This example using the geomnet package has been chosen way to visualize grouping and helps with pattern or cluster
from a social network from the popular television show Mad detection. The vertices in a network and the edges between
Men (which we have never actually seen, for the record). them compose the structure of a network, and being able
The network it displays uses data and code that made avail- to visually discover patterns among them is a key part of
able at CRAN’s gcookbook page [1]; it consists of 52 vertices network analysis.
and 87 edges. Each vertex represents a character on the
show; there is an edge between two characters if they have Example 14 (Time Series Plot From a Time Series Object )
shared a romantic relationship. The ggfortify package allows autoplot to automatically plot
The network visualization of Figure 27 is provided by directly from a ts object (see Figure 28).
the ggnetwork package under the ggplot2 framework, using
layering.. ## From Timeseries object (ts)
library(ggplot2)
library(ggplot2) library(ggfortify)
library(ggnetwork) # Plot
library(geomnet) autoplot(AirPassengers) +
library(network) labs(title="AirPassengers") +
# make the data available theme(plot.title =
data(madmen, package = ’geomnet’) element_text(hjust=0.5))
# create undirected network
mm.net <- network(madmen$edges[, 1:2],
directed = FALSE) Example 15 (Time Series Plot From a Data Frame)
# mm.net # glance at network object
Using geom_line(), a time series (or line chart) can
# create node attribute (gender) be drawn from a data frame as well. The horizontal axis
rownames(madmen$vertices) <- breaks are generated by default. In the example below
madmen$vertices$label (Figure 29), the breaks are formed once every 10 years.
mm.net %v% "gender" <- as.character(
madmen$vertices[ library(ggplot2)
network.vertex.names(mm.net), theme_set(theme_classic())
"Gender"]) # Allow Default X Axis Labels
# gender color palette ggplot(economics, aes(x=date)) +
mm.col <- c("female" = "#ff0000", "male" geom_line(aes(y=unemploy)) +
= "#00ff00") labs(title="Time Series Chart",
set.seed(10052016) subtitle="Number of unemployed in
ggplot(data = ggnetwork(mm.net, layout = thousands from ’Economics-US’
"kamadakawai"), Dataset",
aes(x, y, xend = xend, yend = caption="Source: Economics",
yend)) + y="unemploy")

E.Gashim, P.Boily, 2018 Page 30 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 24. Ordered bar plot of the mpg dataset.

E.Gashim, P.Boily, 2018 Page 31 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 25. Correlogram of the mtcars dataset.

E.Gashim, P.Boily, 2018 Page 32 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 26. Treemap for the G20 economies.

E.Gashim, P.Boily, 2018 Page 33 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 27. Graph of Mad Men characters who are linked by a romantic relationship.

E.Gashim, P.Boily, 2018 Page 34 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 28. Time series plot for the AirPassengers dataset.

E.Gashim, P.Boily, 2018 Page 35 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 29. Time series plot for the economics dataset.

E.Gashim, P.Boily, 2018 Page 36 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Example 16 (Time Series Plot For a Monthly Time Series) scale_x_date(labels = lbls,
In order to select specific breaks on the x−axis, consider the breaks = brks) + # change to
functionality offered by scale_x_date() (plot shown monthly ticks and labels
in Figure 30). theme(axis.text.x = element_text(angle
= 90, vjust=0.5), # rotate x axis
library(ggplot2) text
library(lubridate) panel.grid.minor =
theme_set(theme_bw()) element_blank()) # turn off
minor grid
economics_m <- economics[1:24, ]

# labels and breaks for X axis text Example 18 (Time Series Plot From Long Data Format)
lbls <-
paste0(month.abb[month(economics_m$date) In this example, we construct the plot from a long data
], " ", format (i.e. the column names and respective values of all
lubridate::year(economics_m$date)) the columns are stacked in only 2 variables – variable
brks <- economics_m$date and value, respectively). In the wide format, the data
would takee the appearance of the economics dataset.
# plot Below, the geom_line objects are drawn using value
ggplot(economics_m, aes(x=date)) + and aes(col) is set to variable. In this way, multiple
geom_line(aes(y=pce)) + coloured lines are plotted (one for each unique variable
labs(title="Monthly Time Series",
level) with a single call; scale_x_date() changes the
subtitle="Personal consumption
x−axis breaks and labels, while the line colours are changed
expenditures, in billions of
dollars", by scale_color_manual.
caption="Source: Economics",
data(economics_long, package = "ggplot2")
y="pce") + # title and caption
# head(economics_long)
scale_x_date(labels = lbls,
library(ggplot2)
breaks = brks) + # change to
library(lubridate)
monthly ticks and labels
theme_set(theme_bw())
theme(axis.text.x = element_text(angle
df <-
= 90, vjust=0.5), # rotate x axis
economics_long[economics_long$variable
text
%in% c("psavert", "uempmed"), ]
panel.grid.minor =
df <- df[lubridate::year(df$date) %in%
element_blank()) # turn off
c(1967:1981), ]
minor grid
# labels and breaks for X axis text
brks <- df$date[seq(1, length(df$date),
Example 17 (Time Series Plot For a Yearly Time Series) 12)]
Here’s the same, but with a yearly breakdown (plot shown lbls <- lubridate::year(brks)
in Figure 31).
# plot
library(ggplot2) ggplot(df, aes(x=date)) +
library(lubridate) geom_line(aes(y=value, col=variable)) +
theme_set(theme_bw()) labs(title="Time Series of Returns
Percentage",
economics_y <- economics[1:90, ] subtitle="Drawn from Long Data
format",
# labels and breaks for X axis text caption="Source: Economics",
brks <- economics_y$date[seq(1, y="Returns %",
length(economics_y$date), 12)] color=NULL) + # title and caption
lbls <- lubridate::year(brks) scale_x_date(labels = lbls, breaks =
brks) + # change to monthly ticks
# plot and labels
ggplot(economics_y, aes(x=date)) + scale_color_manual(labels =
geom_line(aes(y=psavert)) + c("psavert", "uempmed"),
labs(title="Yearly Time Series", values =
subtitle="Personal savings rate", c("psavert"="#00ba38",
caption="Source: Economics", "uempmed"="#f8766d"))
y="psavert") + # title and caption + # line color

E.Gashim, P.Boily, 2018 Page 37 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 30. Monthly time series for the economics dataset.

theme(axis.text.x = element_text(angle describe how a quantity or volume (rather than some-


= 90, vjust=0.5, size = 8), # thing like a price) changes over time;
rotate x axis text when the data contains a “large” number of points
panel.grid.minor = (for “small” datasets, consider using a bar chart), or
element_blank()) # turn off when the respective contributions of each individual
minor grid
component needs to be highlighted.
The appropriate call uses geom_area, which works very
Example 19 (Stacked Area Chart) much like geom_line, with an important difference –
A stacked area chart is just like a line chart, except that the by default, each geom_area starts from the bottom of
region below the plot is filled in. This is typically used to: y−axis (which is typically set at 0), but in order to show

E.Gashim, P.Boily, 2018 Page 38 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 31. Yearly time series for the economics dataset.

the contribution from individual components, the area has df <- economics[, c("date", "psavert",
to be stacked on top of the previous component, rather than "uempmed")]
relative to the floor of the plot. All the bottom layers have df <- df[lubridate::year(df$date) %in%
to be added to the y value of a new area. c(1967:1981), ]
In the example below (and in Figure 33), the top layer # labels and breaks for X axis text
brks <- df$date[seq(1, length(df$date),
is y=psavert+uempmed. However nice the plot might
12)]
look, keep in mind that it can be difficult to interpret. lbls <- lubridate::year(brks)
# plot
library(ggplot2)
ggplot(df, aes(x=date)) +
library(lubridate)
geom_area(aes(y=psavert+uempmed,
theme_set(theme_bw())

E.Gashim, P.Boily, 2018 Page 39 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 32. Time series from long data format for the economics dataset.

fill="psavert")) + scale_fill_manual(name="",
geom_area(aes(y=uempmed, values =
fill="uempmed")) + c("psavert"="#00ba38",
labs(title="Area Chart of Returns "uempmed"="#f8766d"))
Percentage", + # line color
subtitle="From Wide Data format", theme(panel.grid.minor =
caption="Source: Economics", element_blank()) # turn off minor
y="Returns %") + # title and caption grid
scale_x_date(labels = lbls, breaks =
brks) + # change to monthly ticks
and labels

E.Gashim, P.Boily, 2018 Page 40 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 33. Stacked area chart for the economics dataset (green area is the sum of psavert and uempmed).

E.Gashim, P.Boily, 2018 Page 41 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Example 20 (Seasonal Plot) # group by combinations and count


When working with a time series object of class ts or xts, df_grouped <- df %>% group_by(q1_d1,
the seasonal fluctuations can be viewed through a seasonal q1_d2, q1_d3) %>% count()
plot using forecast::ggseasonplot. # set an id string that denotes the
value combination
You can see the traffic increase in air passengers in
df_grouped <- df_grouped %>% mutate(id =
AirPassengers over the years along with the repeti- factor(paste(q1_d1, q1_d2, q1_d3,
tive seasonal patterns in traffic; in the same vein, the tem- sep = ’-’)))
peratures in nottem do not increase over time, but they order.freq <-
definitely follow a seasonal pattern (see below, and Fig- order(df_grouped[,4],decreasing=TRUE)
ure 34). # sort by count and select top rows
df_grouped <-
library(ggplot2) df_grouped[order.freq[1:25],]
library(forecast)
theme_set(theme_classic()) The count per group is automatically stored in a column
# Subset data
n. We additionally set an id column which denotes the
nottem_small <- window(nottem,
start=c(1920, 1), end=c(1925, 12)) # unique answer combination. The dataset is sorted by de-
subset a smaller timewindow creasing count and the 25 most frequent paths are retained
# Plot (optional).
ggseasonplot(AirPassengers) + We can now plot the data by using geom_path, af-
labs(title="Seasonal plot: ter processing the data appropriately. We need to convert
International Airline Passengers") our grouped data frame into a "long format" using melt()
ggseasonplot(nottem_small) + from the package reshape2 so that our three dimensions are
labs(title="Seasonal plot: Air contained in a column named "variable" and the respective
temperatures at Nottingham Castle") values are in the column "values":

library(reshape2)
Example 21 (Parallel Coordinate Plots) library(ggplot2)
Parallel coordinate plots are useful to visualize multivariate # create long format
data. As a practical example, assume that a survey has been
df_pcp <- melt(df_grouped, id.vars =
c(’id’, ’freq’))
conducted, with a variety of questions. Each question is
df_pcp$value <- factor(df_pcp$value)
asked three times – in a different context – and is answered
on a discrete scale from 1 to 7. Consequently, each question
We can then specify what levels should be drawn on the
has three “dimensions”. The distribution of answers across
y−axis (1 to 7). In the ggplot() function we define an
the three dimensions should be displayed for each question.
aesthetic that uses the “variable” column for the x−axis and
Because the three dimensions have the same unit and scale,
the “value” column for the y−axis. We also specify that the
they can easily be compared on parallel coordinates (it
values should be grouped by using the id column. This is
would be possible to display more than three dimensions,
required, as the connections between the three dimensions
of course).
won’t be drawn otherwise. We use geom_path() to
library(triangle) draw the connection lines and make the width and colour
set.seed(0) of the connection dependent on the n and id columns,
q1_d1 <- round(rtriangle(1000, 1, 7, 5)) respectively,
q1_d2 <- round(rtriangle(1000, 1, 7, 6))
q1_d3 <- round(rtriangle(1000, 1, 7, 2)) y_levels <- levels(factor(1:7))
df <- data.frame(q1_d1 = factor(q1_d1), ggplot(df_pcp, aes(x = variable, y =
q1_d2 = factor(q1_d2), q1_d3 = value, group = id)) + # group = id
factor(q1_d3)) is important!
geom_path(aes(size = freq, color = id),
alpha = 0.5,
We are using the triangular distribution to get random inte-
lineend = ’round’, linejoin =
gers r ∈ [1, 7], around a different mode c for each dimen- ’round’) +
sion (5, 6 and 2). To plot the main “answer paths” (i.e. the scale_y_discrete(limits = y_levels,
most frequent answer combination across the three dimen- expand = c(0.5, 0)) +
sions), we need to group by all dimensions, and then to scale_size(breaks = NULL, range = c(1,
count the frequency of each unique answer combinations. 7))
This can be done with the dplyr package.

library(dplyr)

E.Gashim, P.Boily, 2018 Page 42 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 34. Seasonal plots for the AirPassengers dataset (above) and the nottem dataset (below).

E.Gashim, P.Boily, 2018 Page 43 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 35. Parallel coordinate plots for randomly generated data.

E.Gashim, P.Boily, 2018 Page 44 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Example 22 (Clusters) Example 23 (Dumbbell Plot)


It is possible to show the distinct clusters or groups using Dumbbell charts are a great tool to visualize relative posi-
geom_encircle(). If the dataset has multiple “weak” tions (like growth and decline) between two points in time,
features, we often first find the principal components and and compare distances between two categories.
display the dataset as a scatterplot using PC1 and PC2 for In order to get the correct ordering of the dumbbells,
the x and y axes, respectively. the y−axis variable should be a factor and the levels of the
The geom geom_encircle() can be used to encir- factor variable have to be in the same order as they should
cle the desired groups. The only thing to note is the data appear in the plot.
argument to geom_circle() – a subsetted dataframe
#
containing only the observations (rows) beloning to the
devtools::install_github("hrbrmstr/ggalt")
group as the data argument. library(ggplot2)
library(ggalt)
# devtools::
theme_set(theme_classic())
install_github("hrbrmstr/ggalt")
library(ggplot2)
health <-
library(ggalt)
read.csv("https://raw.githubusercontent.com/
library(ggfortify)
selva86/datasets/master/health.csv")
theme_set(theme_classic())
# Compute data with principal components
# for right ordering of the dumbells
df <- iris[c(1, 2, 3, 4)]
health$Area <- factor(health$Area,
pca_mod <- prcomp(df) # compute
levels=as.character(health$Area))
principal components
# health$Area <- factor(health$Area)
# Data frame of principal components
gg <- ggplot(health, aes(x=pct_2013,
df_pc <- data.frame(pca_mod$x,
xend=pct_2014, y=Area, group=Area)) +
Species=iris$Species) # dataframe of
geom_dumbbell(color="#a3c4dc",
principal components
size=0.75,
df_pc_vir <- df_pc[df_pc$Species ==
point.colour.l="#0e668b") +
"virginica", ] # df for ’virginica’
scale_x_continuous(label=waiver()) +
df_pc_set <- df_pc[df_pc$Species ==
labs(x=NULL,
"setosa", ] # df for ’setosa’
y=NULL,
df_pc_ver <- df_pc[df_pc$Species ==
title="Dumbbell Chart",
"versicolor", ] # df for ’versicolor’
subtitle="Pct Change: 2013 vs
2014",
# Plot
caption="Source:
clustering<-ggplot(df_pc, aes(PC1, PC2,
https://github.com/hrbrmstr/ggalt")
col=Species)) +
+
geom_point(aes(shape=Species), size=2)
theme(plot.title =
+ # draw points
element_text(hjust=0.5,
labs(title="Iris Clustering",
face="bold"),
subtitle="With principal components
plot.background=element_rect(
PC1 and PC2 as X and Y axis",
fill="#f7f7f7"),
caption="Source: Iris") +
panel.background=element_rect(
coord_cartesian(xlim = 1.2 *
fill="#f7f7f7"),
c(min(df_pc$PC1), max(df_pc$PC1)),
panel.grid.minor=element_blank(),
ylim = 1.2 *
panel.grid.major.y=element_blank(),
c(min(df_pc$PC2),
panel.grid.major.x=element_line(),
max(df_pc$PC2))) + #
axis.ticks=element_blank(),
change axis limits
legend.position="top",
geom_encircle(data = df_pc_vir,
panel.border=element_blank())
aes(x=PC1, y=PC2)) + # draw circles
plot(gg)
geom_encircle(data = df_pc_set,
aes(x=PC1, y=PC2)) +
geom_encircle(data = df_pc_ver,
aes(x=PC1, y=PC2))

ggsave(file="clusters.png",
plot=clustering, width=5, height=4)

E.Gashim, P.Boily, 2018 Page 45 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 36. Clusters in the iris dataset, projected on the first 2 principal components.

E.Gashim, P.Boily, 2018 Page 46 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 37. Dumbbell plot for the health dataset.

E.Gashim, P.Boily, 2018 Page 47 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Example 24 (Slope Chart) "yshift"), variable.name="x",


value.name="y")
A slope chart is a great tool to visualize changes in value ## Store these gaps in a separate
and ranking between categories. It is more suitable than a variable so that they can be
time series when very few time points are present. scaled ypos = a*yshift + y
library(dplyr) tmp <- transform(tmp, ypos=y +
theme_set(theme_classic()) scale*yshift)
source_df <- read.csv(" return(tmp)
https://raw.githubusercontent.com/
jkeirstead/r-slopegraph/master/ }
cancer_survival_rates.csv") plot_slopegraph <- function(df) {
ylabs <- subset(df,
# Define functions. Source: x==head(x,1))$group
https://github.com/jkeirstead/r-slopegraph yvals <- subset(df, x==head(x,1))$ypos
tufte_sort <- function(df, x="year", fontSize <- 3
y="value", group="group", gg <- ggplot(df,aes(x=x,y=ypos)) +
method="tufte", min.space=0.05) { geom_line(aes(group=group),colour="grey80")
## First rename the columns for +
consistency geom_point(colour="white",size=8) +
ids <- match(c(x, y, group), geom_text(aes(label=y),
names(df)) size=fontSize,
df <- df[,ids] family="American Typewriter") +
names(df) <- c("x", "y", "group") scale_y_continuous(name="",
breaks=yvals, labels=ylabs)
## Expand grid to ensure every return(gg)
combination has a defined value }
tmp <- expand.grid(x=unique(df$x),
group=unique(df$group)) ## Prepare data
tmp <- merge(df, tmp, all.y=TRUE) df <- tufte_sort(source_df,
df <- mutate(tmp, y=ifelse(is.na(y), x="year",
0, y)) y="value",
group="group",
## Cast into a matrix shape and method="tufte",
arrange by first column min.space=0.05)
require(reshape2)
tmp <- dcast(df, group ~ x, df <- transform(df,
value.var="y") x=factor(x, levels=c(5,10,15,20),
ord <- order(tmp[,2]) labels=c("5 years","10 years","15
tmp <- tmp[ord,] years","20 years")),
y=round(y))
min.space <-
min.space*diff(range(tmp[,-1])) ## Plot
yshift <- numeric(nrow(tmp)) plot_slopegraph(df) +
## Start at "bottom" row labs(title="Estimates of % survival
## Repeat for rest of the rows until rates") +
you hit the top theme(axis.title=element_blank(),
for (i in 2:nrow(tmp)) { axis.ticks = element_blank(),
## Shift subsequent row up by plot.title =
equal space so gap between element_text(hjust=0.5,
## two entries is >= minimum family = "American
mat <- as.matrix(tmp[(i-1):i, -1]) Typewriter",
d.min <- min(diff(mat)) face="bold"),
yshift[i] <- ifelse(d.min < axis.text =
min.space, min.space - d.min, element_text(family =
0)} "American Typewriter",
face="bold"))
tmp <- cbind(tmp,
yshift=cumsum(yshift))
scale <- 1
tmp <- melt(tmp, id=c("group",

E.Gashim, P.Boily, 2018 Page 48 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 38. Slope chart for the cancer survival rates dataset.

E.Gashim, P.Boily, 2018 Page 49 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Example 25 (Hierarchical Dendrogram) Setting varwidth=T in the geom_boxplot geom ad-


A dendrogram is a tree-structured graph used to visual- justs the width of the boxes to be proportional to the number
ize the result of a hierarchical clustering calculation, via of observation it contains (see Figure 41).
ggdendrogram() (see Figure 39).
library(ggplot2)
#install.packages("ggdendro") theme_set(theme_classic())
library("ggplot2")
library("ggdendro") # Plot
theme_set(theme_bw()) g <- ggplot(mpg, aes(class, cty))
hc <- hclust(dist(USArrests), "ave") # g + geom_boxplot(varwidth=T,
hierarchical clustering fill="plum") +
# plot labs(title="Box plot",
ggdendrogram(hc, rotate = TRUE, size = 2) subtitle="City Mileage grouped by
Class of vehicle",
caption="Source: mpg",
x="Class of Vehicle",
Example 26 (Density Plot) y="City Mileage")
Density plots can be viewed as a smoothed histograms (see
Figure 40).
Example 28 (Dot + Box Plot)
library(ggplot2)
theme_set(theme_classic()) On top of the information provided by a box plot, the dot
plot can provide more clear information in the form of
# Plot summary statistics by each group. The dots are staggered
g <- ggplot(mpg, aes(cty)) such that each dot represents one observation. In Figure 42
g + geom_density(aes(fill=factor(cyl)), the number of dots for a given manufacturer will match the
alpha=0.8) + number of rows of that manufacturer in source data.
labs(title="Density plot",
subtitle="City Mileage Grouped by library(ggplot2)
Number of cylinders", theme_set(theme_bw())
caption="Source: mpg",
x="City Mileage", # plot
fill="# Cylinders") g <- ggplot(mpg, aes(manufacturer, cty))
g + geom_boxplot() +
h <- ggplot(mpg, aes(cty)) geom_dotplot(binaxis=’y’,
h + stackdir=’center’,
geom_density(aes(x=cty,fill=factor(cyl)), dotsize = .5,
alpha=0.8) + facet_wrap(~cyl) + fill="red") +
labs(title="Density plot", theme(axis.text.x =
subtitle="City Mileage by Number element_text(angle=65, vjust=0.6)) +
of cylinders", labs(title="Box plot + Dot plot",
caption="Source: mpg", subtitle="City Mileage vs Class:
x="City Mileage", Each dot represents 1 row in
fill="# Cylinders") source data",
caption="Source: mpg",
x="Class of Vehicle",
y="City Mileage")
Example 27 (Box Plot)
Boxplots are an excellent tool to study a univariate distri-
bution. It can also be used to show the distribution within
multiple groups, along with the median, range, and sus- Example 29 (Waffle Chart)
pected outliers (assuming the underlying distribution is Waffle charts provide a nice way to show the categorical
normal). composition in the overall population. There is no di-
The dark line inside the box represents the median. The rect waffle chart geom, but they can be produced using
top of box is the 75th percentile (the 3rd quartile) and the is geom_tile(), as shown below (result in Figure 43).
the 25th percentile (the 1st quartile). The end points of the library(ggplot2)
lines (the whiskers) are plotted at a distance of 1.5 × the var <- mpg$class # the categorical data
interquartie range (3rd quartile - 1st quartile). The points nrows <- 10
outside the whiskers are marked as dots and are normally df <- expand.grid(y = 1:nrows, x =
considered as extreme points. 1:nrows)

E.Gashim, P.Boily, 2018 Page 50 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 39. Hierarchical dendrogram for the USArrests dataset.

categ_table <- round(table(var) * labs(title="Waffle Chart",


((nrows*nrows)/(length(var)))) subtitle="’Class’ of vehicles",
# categ_table caption="Source: mpg") +
df$category <- theme(panel.border =
factor(rep(names(categ_table), element_rect(size = 2),
categ_table)) plot.title =
# NOTE: if sum(categ_table) is not 100 element_text(size =
(i.e. nrows^2), it will need rel(1.2)),
adjustment to make the sum to 100. axis.text = element_blank(),
## Plot axis.title = element_blank(),
ggplot(df, aes(x = x, y = y, fill = axis.ticks = element_blank(),
category)) + legend.title =
geom_tile(color = "black", size = element_blank(),
0.5) + legend.position = "right")
scale_x_continuous(expand = c(0,
0)) +
scale_y_continuous(expand = c(0,
0), trans = ’reverse’) +
scale_fill_brewer(palette =
"Set3") +

E.Gashim, P.Boily, 2018 Page 51 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 40. Density plot for the mpg dataset; simultaneous (top), faceted (bottom).

E.Gashim, P.Boily, 2018 Page 52 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 41. Boxplots of the mpg dataset.

E.Gashim, P.Boily, 2018 Page 53 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 42. Dot boxplot of the mpg dataset.

E.Gashim, P.Boily, 2018 Page 54 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 43. Waffle chart of the mpg dataset.

E.Gashim, P.Boily, 2018 Page 55 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Example 30 (Text Visualization) # Remove numbers


In the following example, we will process the subtitles of docs <- tm_map(docs, removeNumbers)
The Green Mile, saved in a plain text (.txt) file. # Remove english common stopwords
docs <- tm_map(docs, removeWords,
# Load stopwords("english"))
library("tm") # Remove your own stop word
library("SnowballC") # specify your stopwords as a character
library("wordcloud") vector
library("RColorBrewer") docs <- tm_map(docs, removeWords,
library("ggplot2") c("blabla1", "blabla2"))
# Remove punctuations
Loading the text: The text is loaded using Corpus() docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
from the tm (text mining) package. A corpus is a list of a
docs <- tm_map(docs, stripWhitespace)
documents (in this case, a single document). Typically, this # Text stemming
would be done using code such as: # docs <- tm_map(docs, stemDocument)
text <- readLines(file.choose())
# Read the text file from internet Building a term-document matrix: a tdf is a table con-
#filePath <- "http://..." taining the frequency of the words per document. Column
#text <- readLines(filePath) names are words (or terms) and row names are documents.
The function TermDocumentMatrix() can be used
The corpus can be inspected using: as follow :

# Load the data as a corpus dtm <- TermDocumentMatrix(docs)


docs <- Corpus(VectorSource(text)) m <- as.matrix(dtm)
inspect(docs) v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
Text transformation: text processing is performed using
various tm_map() calls to replace, for instance, the spe- Generating a word cloud: The relative importance of
cial characters “/”, “@” and “|” with a blank space. words can be illustrated via a word cloud.

toSpace <- content_transformer(function set.seed(1234)


(x , pattern ) gsub(pattern, " ", x)) wordcloud(words = d$word, freq = d$freq,
docs <- tm_map(docs, toSpace, "/") min.freq = 1,
docs <- tm_map(docs, toSpace, "@") max.words=200,
docs <- tm_map(docs, toSpace, "\\|") random.order=FALSE,
rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Cleaning the text: the tm_map() function can also be
used to remove unnecessary white spaces, to convert the Plotting: ggplot2 can be used to provide bar plots of the
text to lower case, to remove common stopwords like “the” most frequent words
, or “we”.
The information content of these stopwords is basi- p <- ggplot(subset(d, freq>30), aes(x =
cally nil due to the fact that they are used so commonly reorder(word, -freq), y = freq)) +
in a given language. Removing such terms simplifies the geom_bar(stat = "identity") +
final analysis (there are numerous supported language, theme(axis.text.x=element_text(angle=45,
whose names are case-sensitive). Numbers and punctu- hjust=1))
p
ation can also be removed with removeNumbers and
removePunctuation arguments.
A word cloud and a bar plot for The Green Mile are shown
Another important pre-processing step is to stem words
in Figures 44 and 45.
to reduce them to their root form. This process removes
word suffixes to get the common origin. For example, “mov-
ing”, “moved” and “movement” would all be stemmed to the
root word “move” (stemming requires the package Snow-
ballC).

# Convert the text to lower case


docs <- tm_map(docs,
content_transformer(tolower))

E.Gashim, P.Boily, 2018 Page 56 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 44. Word cloud for The Green Mile (English subtitles).

E.Gashim, P.Boily, 2018 Page 57 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

Figure 45. Bar chart for The Green Mile (English subtitles).

E.Gashim, P.Boily, 2018 Page 58 of 59


DATA SCIENCE REPORT SERIES A ggplot2 Primer

References
[1]
Chang, W. [2013], R Graphics Cookbook, O’Reilly.
[2]
Wickham, H. [2009], ggplot2: Elegant Graphics for
Data Analysis, Springer.
[3]
Wickham, H. [2009], A Layered Grammar of Graph-
ics. Journal of Computational and Graphical Statistics
19:3–28.
[4]
Horton, N.J., Kleinman, K. [2016], Using R and RStudio
for Data Management, Statistical Analysis, and Graphics,
2nd ed., CRC Press.
[5]
Healey, K. [2018], Data Visualization: A Practical
Introduction.
[6]
Kabacoff, R.I. [2011], R in Action, Second Edition: Data
analysis and graphics with R, Live.
[7]
Maindonald, J.H. [2008], Using R for Data Analysis
and Graphics Introduction, Code and Commentary.
[8]
Tyner, S., Briatte, F., Hofmann, H. [2017], Network
Visualization with ggplot2, The R Journal, vol. 9(1).
[9]
Broman, K. [2016], Data Visualization with ggplot2.
[10]
ggplot2 Extensions.
[11]
R Graph Gallery.
[12]
Anderson, S.C. [2015], An Introduction to ggplot2.
[13]
Prabhakaran, S., Top-50 ggplot2 Visualization (with
Master List R Code).
[14]
Konrad, M. [2016], Parallel Coordinate Plots for Dis-
crete and Categorical Data in R: A Comparison.
[15]
Wilkins, D., treemapify github repository.
[16]
Wilkins, D., treemapify R package (v. 0.2.1).
[17]
STHDA, Beautiful Dendrogram Visualizations in R:
5+ must-know methods - Unsupervised Machine
Learning.
[18]
Text Mining and Word Cloud Fundamentals in R: 5
simple steps you should know, on Easy Guides.
[19]
Harvard tutorial notes, R graphics with ggplot2 work-
shop.
[20]
Robinson, D., Visualizing Data Using ggplot2, on
varianceexplained.org.
[21]
Manipulating, analyzing and exporting data with
tidyverse, on datacarpentry.org.
[22]
Wickham, H. [2014], Tidy Data, Journal of Statistical
Software, v59, n10.

E.Gashim, P.Boily, 2018 Page 59 of 59

You might also like