Introduction To Ggplot2: Saier (Vivien) Ye September 16, 2013
Introduction To Ggplot2: Saier (Vivien) Ye September 16, 2013
Saier (Vivien) Ye
September 16, 2013
1 Intro
When it comes to producing graphics in R , there are basically three options:
1. base graphics
2. lattice
3. ggplot2
We have introduced the use of base graphics in the R session last week. Base graphics are
attractive, and flexible. But when it comes to creating more complex plots, the codes that you
have to write become more cumbersome - often involving many loops.
Both lattice and ggplo2 make creating complex plots easier. The lattice package uses grid
graphics to implement the trellis graphics system and is a considerable improvement over base
graphics. However lattice graphics lacks a formal model, which can make it hard to extend. ggplot2
has gained significant popularity in recent years, and it has become a mainstream package for
making complex graphics.
2 Basics
In the book ggplot2: elegant graphics for data analysis by the author of ggplot2 Hadley Wickham,
ggplot2 is described as an R package for producing statistical, or data, graphics, and it differs
from other graphics packages because it has a deep underlying grammar. This particular grammar
is based on the Grammar of Graphics, hence the name gg-plot. The basic notion is that there is a
grammar to the composition of graphical components in statistical graphics. This makes ggplot2
very powerful because by directly controlling the grammar, you can generate a large set of graphics
tailored to your particular needs. You are no longer limited to a set of pre-specified graphics.
To install ggplot2, make sure you have a recent version of R (at least version 2.8). Type
install.packages("ggplot2") in the R console to install the package. Or if you work in R
Studio, go to the packages window, click on Install Packages and search for ggplot2. Details of
installation of R and R Studio can be found in the notes from last weeks R help session. In order to
use the package, you have to load every time beforehand, with the command library("ggplot2").
ggplot2 package comes with many built-in data sets, for the purpose of demonstration. In this
session, we will demonstrate a data set called mpg. All the examples are from the book that I
mentioned.
summary(mpg)
head(mpg)
library("YaleToolkit")
whatis(mpg)
Theres a quick plotting function in ggplot2 called qplot(), which is very similar to the plot()
function from base graphics. A simple line of qplot() command looks like the following:
> qplot(displ, hwy, data = mpg, colour = factor(cyl))
You can do a lot with qplot() alone, but the main disadvantage is that it only permits a single
dataset and a single set of aesthetic mappings. ggplot2 is designed to work in a layered fashion,
such that each graphical component is added to the plot as a layer. Each layer can come from a
different dataset and have a different aesthetic mapping, allowing us to create plots that could not
be generated using qplot().
A more systematic way to use ggplot2 package is to plot graphs with the function ggplot().
The function takes two primary arguments: data and aesthetic mapping. These arguments set up
defaults for the plot and can be omitted if you specify data and aesthetics when adding each layer.
data is the data frame that you want to visualize. And aes() mappings will be pass on to the plot
elements. A simple example:
> p <- ggplot(mpg, aes(displ, hwy))
With this function, we have set up a plot which is going to draw from the data frame , the
variable will be mapped to the x-axis, and the variable is going to be mapped to the y-axis.
However, if you just type p or print(p) in R console, youll get back a warning saying that the plot
lacks any layers. Looking at the command, we have not specified which kind of geometric object
will represent the data. Lets add points, for a scatterplot.
> p+geom_point()
You add geometries to a plot with one of the geom_*() functions, using the + operator. Our
command now has two layers, connected by +. This is what we meant by saying ggplot2 works
in layers. We use layers to add various features to the graph, and to customize graph based on
our needs.
Notice how we didnt write any arguments in geom_point(). In order to map points to values
on the x and y axes, geom_point() needs to know what variables were mapping to the x and y
40
hwy
30
20
factor(cyl)
displ
The points are colored, the lines are not, and a legend has automatically been added.
Next, well pass the color mapping to the line, not the points:
40
hwy
30
20
factor(cyl)
4
displ
Now the line is colored, and the points are not. Its kind of hard to tell with this plot, but lines
which are different colors are not connected. The legend also represents the fact that lines are
colored.
Finally, we can include the color mapping in ggplot(), meaning all the geom objects following
will inherit this mapping:
40
hwy
30
20
factor(cyl)
displ
3 Displaying Statistics
Youll frequently want to add statistical analyses to your plots, or your plots may just be of statistical
summaries anyway. ggplot2 has a few built-in statistics to make plotting easier.
The most frequent statistic I use is a smoothing line with stat_smooth(). There are a number
of different smoothing lines you can add, from local regression lines (loess) to linear or logistic
regressions. Lets start with the mpg data again.
40
30
hwy
20
displ
By default, stat_smooth() has added a loess line with the standard error represented by a
semi-transparent ribbon. You could also specify the method argument to add a different smoothing
line:
40
hwy
30
20
10
2
displ
Now, statistics are represented with default geometries. For stat_smooth(), its default geoms
are the semi-transparent ribbon and the smoothing line. You could also represent the output with
points and errorbars.
35
hwy
30
25
20
displ
For numeric vs categorical varialbe comparison, you can calculate statistics that make up boxplots:
40
hwy
30
20
2seater
compact
midsize
minivan
pickup
subcompact
suv
class
4 Grouping
An important feature of ggplot2 is you can represent data as grouped easily, and draw geoms and
calculates statistics acoording to these groupings. Weve already seen an example of this, where
lines of different colors arent connected:
40
hwy
30
20
factor(cyl)
displ
We mapped the color aesthetic to the variable .... in ggplot(). When we add points to the
plot, their color is set according to their color group. Same with the regression lines.
There are various ways of mapping groups to the plot, for example, point shape:
10
40
hwy
30
factor(cyl)
4
5
6
8
20
displ
Now, the color of the smoothing lines arent meaningful anymore, but theyve been grouped and
separated.
We could also group by size:
11
40
hwy
30
factor(cyl)
20
displ
12
40
hwy
30
20
factor(cyl)
4
displ
If you use multiple grouping variables, groups will be defined as unique combinations of each of
the levels.
13
>
>
+
+
+
+
library(MASS)
ggplot(mpg, aes(displ, hwy, color = factor(cyl),
shape = factor(year),
linetype = factor(year)))+
geom_point()+
stat_smooth(method = "rlm")
40
factor(cyl)
hwy
30
20
1999
2008
factor(year)
displ
Grouping isnt only useful for smoothing functions. Boxplots, for example, can be grouped:
14
40
factor(year)
30
hwy
1999
20
class
15
2008
40
factor(year)
30
hwy
20
1999
2008
pickup
suv
minivan 2seatersubcompact
compact midsize
5 Faceting
A very useful kind of visualization technique is the small multiple. i.e. multiple rows and columns
in a graph. This is achieved by par(mfrow=c()) in the base graphics of R . In ggplot2, it is known
as faceting and there is two ways of achieving this: facet_wrap() and facet_grid().
facet_wrap() creates and labels a plot for every level of a factor which is passed to it. For
example:
16
>
+
+
+
1999
2008
40
hwy
30
20
displ
17
audi
chevrolet
dodge
ford
40
30
20
honda
40
30
hyundai
jeep
hwy
20
lincoln
mercury
land rover
nissan
pontiac
40
30
20
subaru
20
toyota
volkswagen
40
30
2 3 4 5 6 7
2 3 4 5 6 7
2 3 4 5 6 7
displ
One important thing to note here is that the x and y scales of each plot are the same in each facet.
If you would like free scales on each of the facets, just modify your facet line as: facet_wrap( factor,scales="free").
With two variables, you can facet by facet_grid(). Recall the tips data shown in class by Prof
Chen:
18
1
2
3
4
5
6
TOTBILL
16.99
10.34
21.01
23.68
24.59
25.29
>
+
+
+
0.6
TIP/TOTBILL
0.2
0.4
0.6
0.4
0.2
SIZE
19
6 Positioning
How geoms are positioned relative to each other is another feature of plots that you might want to
adjust. The possible position adjustments in ggplot2 are:
position_dodge()
position_fill()
position_identity()
position_jitter()
position_stack()
We will use another data set for the demonstration of positioning in ggplot2. The data set is
called diamonds.
> data(diamonds)
> head(diamonds)
> summary(diamonds)
Here are some demonstrations of these positions:
20
10000
cut
Fair
count
Good
Very Good
Premium
5000
Ideal
0
I1
SI2
SI1
VS2
VS1
clarity
21
VVS2 VVS1
IF
> p+geom_histogram(aes(y=..count..),position="fill")
1.00
0.75
cut
count
Fair
Good
0.50
Very Good
Premium
Ideal
0.25
0.00
I1
SI2
SI1
VS2
VS1
clarity
22
VVS2 VVS1
IF
> p+geom_histogram(aes(y=..count..),position="dodge")
5000
4000
cut
Fair
3000
count
Good
Very Good
Premium
2000
Ideal
1000
0
I1
SI2
SI1
VS2
VS1
clarity
23
VVS2 VVS1
IF
> p+geom_histogram(aes(y=..count..),position="identity",alpha=0.2)
5000
4000
cut
Fair
3000
count
Good
Very Good
Premium
2000
Ideal
1000
0
I1
SI2
SI1
VS2
VS1
VVS2 VVS1
IF
clarity
In the identity case, the parameter alpha controls the level of transparency. The lower the
number, the more transparent the bins are.
7 Scales
Every aesthetic which is mapped to the data expresses the magnitude of its value along some scale.
These can be adjusted using the scale_*() functions.
The most common scale adjustments are for the x and y axes. The most basic way to adjust the
x and y scales for continuous data is with scale_x_continuous() or scale_y_continuous().
Some examples of scale manipulation:
> p <- ggplot(mpg, aes(displ, hwy)) + geom_point()
> #p + scale_x_continuous(label="Engine Displacement in Liters")
> #or
24
Some people dont like the default discrete colors. With scale_color_brewer() you can set the
color pallete to one of the RColorBrewer palletes. To see the possible options
> library(RColorBrewer)
> display.brewer.all()
If you like Set1 for qualitative differences:
> #p + scale_color_brewer(pal = "Set1")
where p is your ggplot2 object.
8 Some Examples
In this session, we look at some more advanced examples of plots by ggplot2.
Weve seen the basic histograms in ggplot2, where frequency is represented by bins. There is a
number of variations on a histogram. They all use the same statistical transformation underlying
a histogram - the bin stat, but use different geoms to display the results.
25
12000
count
9000
6000
3000
0
0
carat
26
>
+
+
+
d + stat_bin(
aes(size = ..density..), binwidth = 0.1,
geom = "point", position="identity"
)
12000
9000
density
count
6000
0.0
0.5
1.0
2.0
3000
1.5
carat
27
>
+
+
+
d + stat_bin(
aes(y = 1, fill = ..count..), binwidth = 0.1,
geom = "tile", position="identity"
)
1.50
1.25
count
9000
1.00
6000
3000
0
0.75
0.50
0
carat
The first histogram shown here uses an area geom to display frequency, the second uses the point
geom and the third tile geom.
Weve shown this plot briefly in last weeks session. This is again the diamonds data set. It
shows the distribution of depth, marked by different cut.
28
1.00
0.75
cut
count
Fair
Good
0.50
Very Good
Premium
Ideal
0.25
0.00
57.5
60.0
62.5
65.0
67.5
depth
We see that this is essentially a histogram with very tiny binwidth. And it provides a much
better visual presentation than traditional bars.
The last example that Id like to discuss is drawing maps in ggplot2. ggplot2 provides some
tools to make it easy to combine maps from the maps package with other ggplot2 graphics. To be
able to draw maps, you have to install the package maps in addition to ggplot2. (and, of course,
dont forget to load it for use)
The example below shows the crime statistics for all states in the United States (excluding HI
and AK). This data is built in the package maps.
29
>
>
>
>
>
>
>
>
>
+
library(maps)
states <- map_data("state")
arrests <- USArrests
names(arrests) <- tolower(names(arrests))
arrests$region <- tolower(rownames(USArrests))
choro <- merge(states, arrests, by = "region")
# Reorder the rows because order matters when drawing polygons
choro <- choro[order(choro$order), ]
qplot(long, lat, data = choro, group = group, fill = assault,
geom="polygon")
50
45
assault
40
lat
300
200
35
100
30
25
120
100
80
long
30
> qplot(long, lat, data = choro, group = group, fill = assault / murder,
+ geom="polygon")
50
45
assault/murder
40
50
lat
40
30
35
20
30
25
120
100
80
long
Drawing maps usually involves using the geom polygon. It is a relatively rarely used geom.
Details of this geom can be found here: http://docs.ggplot2.org/current/geom_polygon.html.
Briefly speaking, you need two data frames for using this function: one contains the coordinates of
each polygon (positions), and the other contains the values associated with each polygon (values).
Therefore we had to reorder our data frame choro, as merge disrupts the ordering.
9 Resources ggplot2
The best resource for low-level details will always be the built-in documentation. This documentation is accessible online at: http://docs.ggplot2.org/current/. You can also use the usual help
syntax (help() or ?) in R to access the contents. Online documentation provides more flexibility
as you can see all the example plots and navigate between topics easily.
31
32