0% found this document useful (0 votes)
127 views

Introduction To Ggplot2: Saier (Vivien) Ye September 16, 2013

a great introduction to the ggplot package in the R programming language

Uploaded by

10yangb92
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views

Introduction To Ggplot2: Saier (Vivien) Ye September 16, 2013

a great introduction to the ggplot package in the R programming language

Uploaded by

10yangb92
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Introduction to ggplot2

STAT 361/661 Data Analysis

Saier (Vivien) Ye
September 16, 2013

1 Intro
When it comes to producing graphics in R , there are basically three options:
1. base graphics
2. lattice
3. ggplot2
We have introduced the use of base graphics in the R session last week. Base graphics are
attractive, and flexible. But when it comes to creating more complex plots, the codes that you
have to write become more cumbersome - often involving many loops.
Both lattice and ggplo2 make creating complex plots easier. The lattice package uses grid
graphics to implement the trellis graphics system and is a considerable improvement over base
graphics. However lattice graphics lacks a formal model, which can make it hard to extend. ggplot2
has gained significant popularity in recent years, and it has become a mainstream package for
making complex graphics.

2 Basics
In the book ggplot2: elegant graphics for data analysis by the author of ggplot2 Hadley Wickham,
ggplot2 is described as an R package for producing statistical, or data, graphics, and it differs
from other graphics packages because it has a deep underlying grammar. This particular grammar
is based on the Grammar of Graphics, hence the name gg-plot. The basic notion is that there is a
grammar to the composition of graphical components in statistical graphics. This makes ggplot2
very powerful because by directly controlling the grammar, you can generate a large set of graphics
tailored to your particular needs. You are no longer limited to a set of pre-specified graphics.
To install ggplot2, make sure you have a recent version of R (at least version 2.8). Type
install.packages("ggplot2") in the R console to install the package. Or if you work in R
Studio, go to the packages window, click on Install Packages and search for ggplot2. Details of
installation of R and R Studio can be found in the notes from last weeks R help session. In order to
use the package, you have to load every time beforehand, with the command library("ggplot2").
ggplot2 package comes with many built-in data sets, for the purpose of demonstration. In this
session, we will demonstrate a data set called mpg. All the examples are from the book that I
mentioned.

Saier (Vivien) Ye, Department of Statistics, Yale University 2013


An overview of the data mpg:
> data(mpg)
> ?mpg
Before plotting, it is always useful to perform a sanity check on the data. This helps you gain a
general idea of the structure of the data, and spot abnormality in the data if there is any.
>
>
>
>

summary(mpg)
head(mpg)
library("YaleToolkit")
whatis(mpg)

Theres a quick plotting function in ggplot2 called qplot(), which is very similar to the plot()
function from base graphics. A simple line of qplot() command looks like the following:
> qplot(displ, hwy, data = mpg, colour = factor(cyl))
You can do a lot with qplot() alone, but the main disadvantage is that it only permits a single
dataset and a single set of aesthetic mappings. ggplot2 is designed to work in a layered fashion,
such that each graphical component is added to the plot as a layer. Each layer can come from a
different dataset and have a different aesthetic mapping, allowing us to create plots that could not
be generated using qplot().
A more systematic way to use ggplot2 package is to plot graphs with the function ggplot().
The function takes two primary arguments: data and aesthetic mapping. These arguments set up
defaults for the plot and can be omitted if you specify data and aesthetics when adding each layer.
data is the data frame that you want to visualize. And aes() mappings will be pass on to the plot
elements. A simple example:
> p <- ggplot(mpg, aes(displ, hwy))
With this function, we have set up a plot which is going to draw from the data frame , the
variable will be mapped to the x-axis, and the variable is going to be mapped to the y-axis.
However, if you just type p or print(p) in R console, youll get back a warning saying that the plot
lacks any layers. Looking at the command, we have not specified which kind of geometric object
will represent the data. Lets add points, for a scatterplot.
> p+geom_point()
You add geometries to a plot with one of the geom_*() functions, using the + operator. Our
command now has two layers, connected by +. This is what we meant by saying ggplot2 works
in layers. We use layers to add various features to the graph, and to customize graph based on
our needs.
Notice how we didnt write any arguments in geom_point(). In order to map points to values
on the x and y axes, geom_point() needs to know what variables were mapping to the x and y

Saier (Vivien) Ye, Department of Statistics, Yale University 2013


axes. It inherited this information from ggplot(). If, however, you insert arguments in the geom()
functions, they will override what is in the main ggplot() function.
The best way to demonstrate this is to make a few plots.

> ggplot(mpg, aes(displ, hwy))+


+ geom_point(aes(color = factor(cyl)))+
+ geom_line()

40

hwy

30

20

factor(cyl)

displ

The points are colored, the lines are not, and a legend has automatically been added.
Next, well pass the color mapping to the line, not the points:

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> ggplot(mpg, aes(displ, hwy))+


+ geom_point()+
+ geom_line(aes(color = factor(cyl)))

40

hwy

30

20

factor(cyl)
4

displ

Now the line is colored, and the points are not. Its kind of hard to tell with this plot, but lines
which are different colors are not connected. The legend also represents the fact that lines are
colored.
Finally, we can include the color mapping in ggplot(), meaning all the geom objects following
will inherit this mapping:

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> ggplot(mpg, aes(displ, hwy, color = factor(cyl)))+


+ geom_point()+
+ geom_line()

40

hwy

30

20

factor(cyl)

displ

3 Displaying Statistics
Youll frequently want to add statistical analyses to your plots, or your plots may just be of statistical
summaries anyway. ggplot2 has a few built-in statistics to make plotting easier.
The most frequent statistic I use is a smoothing line with stat_smooth(). There are a number
of different smoothing lines you can add, from local regression lines (loess) to linear or logistic
regressions. Lets start with the mpg data again.

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> p <- ggplot(mpg, aes(displ, hwy))


> p + geom_point() + stat_smooth()

40

30

hwy

20

displ

By default, stat_smooth() has added a loess line with the standard error represented by a
semi-transparent ribbon. You could also specify the method argument to add a different smoothing
line:

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> p + geom_point() + stat_smooth(method = "lm")

40

hwy

30

20

10
2

displ

Now, statistics are represented with default geometries. For stat_smooth(), its default geoms
are the semi-transparent ribbon and the smoothing line. You could also represent the output with
points and errorbars.

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> p + stat_smooth(geom = "point")+stat_smooth(geom =


+ "errorbar")

35

hwy

30

25

20

displ

For numeric vs categorical varialbe comparison, you can calculate statistics that make up boxplots:

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> ggplot(mpg, aes(class, hwy))+


+ geom_boxplot()

40

hwy

30

20

2seater

compact

midsize

minivan

pickup

subcompact

suv

class

4 Grouping
An important feature of ggplot2 is you can represent data as grouped easily, and draw geoms and
calculates statistics acoording to these groupings. Weve already seen an example of this, where
lines of different colors arent connected:

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> ggplot(mpg, aes(displ, hwy, color = factor(cyl)))+


+ geom_point()+
+ stat_smooth(method = "lm")

40

hwy

30

20

factor(cyl)

displ

We mapped the color aesthetic to the variable .... in ggplot(). When we add points to the
plot, their color is set according to their color group. Same with the regression lines.
There are various ways of mapping groups to the plot, for example, point shape:

10

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> ggplot(mpg, aes(displ, hwy, shape = factor(cyl)))+


+ geom_point()+
+ stat_smooth(method = "lm")

40

hwy

30

factor(cyl)

4
5
6
8

20

displ

Now, the color of the smoothing lines arent meaningful anymore, but theyve been grouped and
separated.
We could also group by size:

11

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> ggplot(mpg, aes(displ, hwy, size = factor(cyl)))+


+ geom_point()+
+ stat_smooth(method = "lm")

40

hwy

30

factor(cyl)

20

displ

A silly plot though.


We could also define a grouping which is only meaningful for geom_smooth() and not geom_point().
This will cause each smoothing line to be calculated and appear separately, but the points will be
undifferentiated.

12

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> ggplot(mpg, aes(displ, hwy, linetype = factor(cyl)))+


+ geom_point()+
+ stat_smooth(method = "lm")

40

hwy

30

20

factor(cyl)
4

displ

If you use multiple grouping variables, groups will be defined as unique combinations of each of
the levels.

13

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

>
>
+
+
+
+

library(MASS)
ggplot(mpg, aes(displ, hwy, color = factor(cyl),
shape = factor(year),
linetype = factor(year)))+
geom_point()+
stat_smooth(method = "rlm")

40

factor(cyl)

hwy

30

20

1999
2008

factor(year)

displ

Grouping isnt only useful for smoothing functions. Boxplots, for example, can be grouped:

14

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> ggplot(mpg, aes(class, hwy, fill = factor(year)))+


+ geom_boxplot()

40

factor(year)

30

hwy

1999

20

2seater compact midsize minivan pickupsubcompact suv

class

You can reorder class according to median(hwy):

15

2008

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> ggplot(mpg, aes(reorder(class, hwy, median), hwy, fill =


+ factor(year)))+
+ geom_boxplot()

40

factor(year)

30

hwy

20

1999
2008

pickup

suv

minivan 2seatersubcompact
compact midsize

reorder(class, hwy, median)

5 Faceting
A very useful kind of visualization technique is the small multiple. i.e. multiple rows and columns
in a graph. This is achieved by par(mfrow=c()) in the base graphics of R . In ggplot2, it is known
as faceting and there is two ways of achieving this: facet_wrap() and facet_grid().
facet_wrap() creates and labels a plot for every level of a factor which is passed to it. For
example:

16

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

>
+
+
+

ggplot(mpg, aes(displ, hwy))+


geom_point()+
stat_smooth()+
facet_wrap(~year)

1999

2008

40

hwy

30

20

displ

17

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> ggplot(mpg, aes(displ, hwy))+


+ geom_point()+
+ facet_wrap(~manufacturer)

audi

chevrolet

dodge

ford

40
30

20

honda
40
30

hyundai

jeep

hwy

20
lincoln

mercury

land rover

nissan

pontiac

40

30
20

subaru

20

toyota

volkswagen

40
30

2 3 4 5 6 7

2 3 4 5 6 7

2 3 4 5 6 7

displ

One important thing to note here is that the x and y scales of each plot are the same in each facet.
If you would like free scales on each of the facets, just modify your facet line as: facet_wrap( factor,scales="free").
With two variables, you can facet by facet_grid(). Recall the tips data shown in class by Prof
Chen:

18

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> tips <- read.table("tips.dat", header=T)


> head(tips)

1
2
3
4
5
6

TOTBILL
16.99
10.34
21.01
23.68
24.59
25.29

TIP FEMALE SMOKER DAY TIME SIZE


1.01
1
0
6
1
2
1.66
0
0
6
1
3
3.50
0
0
6
1
3
3.31
0
0
6
1
2
3.61
1
0
6
1
4
4.71
0
0
6
1
4

>
+
+
+

ggplot(tips, aes(SIZE, TIP/TOTBILL))+


geom_point(position = position_jitter(width = 0.2, height =
0)) +
facet_grid(TIME ~ FEMALE)

0.6

TIP/TOTBILL

0.2

0.4

0.6

0.4

0.2

SIZE

19

Saier (Vivien) Ye, Department of Statistics, Yale University 2013


In the facet_grid() command, TIME split the graph in the direction of y-axis, and FEMALE does
so in the direction of x-axis. So vertically zero stands for male and 1 stands for female; while
horizontally, zero stands for lunch and one stands for dinner. For example, it looks like a male bill
payer in a dinner party of two tipped 70%.

6 Positioning
How geoms are positioned relative to each other is another feature of plots that you might want to
adjust. The possible position adjustments in ggplot2 are:
position_dodge()
position_fill()
position_identity()
position_jitter()
position_stack()
We will use another data set for the demonstration of positioning in ggplot2. The data set is
called diamonds.
> data(diamonds)
> head(diamonds)
> summary(diamonds)
Here are some demonstrations of these positions:

20

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> p <- ggplot(diamonds,aes(clarity,fill=cut))


> p+geom_histogram(aes(y=..count..),position="stack")

10000
cut
Fair

count

Good
Very Good
Premium
5000

Ideal

0
I1

SI2

SI1

VS2

VS1

clarity

21

VVS2 VVS1

IF

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> p+geom_histogram(aes(y=..count..),position="fill")

1.00

0.75
cut

count

Fair
Good
0.50

Very Good
Premium
Ideal

0.25

0.00
I1

SI2

SI1

VS2

VS1

clarity

22

VVS2 VVS1

IF

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> p+geom_histogram(aes(y=..count..),position="dodge")

5000

4000

cut
Fair

3000

count

Good
Very Good
Premium

2000

Ideal

1000

0
I1

SI2

SI1

VS2

VS1

clarity

23

VVS2 VVS1

IF

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> p+geom_histogram(aes(y=..count..),position="identity",alpha=0.2)

5000

4000

cut
Fair

3000

count

Good
Very Good
Premium

2000

Ideal

1000

0
I1

SI2

SI1

VS2

VS1

VVS2 VVS1

IF

clarity

In the identity case, the parameter alpha controls the level of transparency. The lower the
number, the more transparent the bins are.

7 Scales
Every aesthetic which is mapped to the data expresses the magnitude of its value along some scale.
These can be adjusted using the scale_*() functions.
The most common scale adjustments are for the x and y axes. The most basic way to adjust the
x and y scales for continuous data is with scale_x_continuous() or scale_y_continuous().
Some examples of scale manipulation:
> p <- ggplot(mpg, aes(displ, hwy)) + geom_point()
> #p + scale_x_continuous(label="Engine Displacement in Liters")
> #or

24

Saier (Vivien) Ye, Department of Statistics, Yale University 2013


>
>
>
>
>
>
>

p + xlab("Engine Displacement in Liters")


#p + scale_x_continuous(limits = c(2,4))
#or
p + xlim(2, 4)
p + scale_x_continuous(trans = "log10")
#or
p + scale_x_log10()

Some people dont like the default discrete colors. With scale_color_brewer() you can set the
color pallete to one of the RColorBrewer palletes. To see the possible options
> library(RColorBrewer)
> display.brewer.all()
If you like Set1 for qualitative differences:
> #p + scale_color_brewer(pal = "Set1")
where p is your ggplot2 object.

8 Some Examples
In this session, we look at some more advanced examples of plots by ggplot2.
Weve seen the basic histograms in ggplot2, where frequency is represented by bins. There is a
number of variations on a histogram. They all use the same statistical transformation underlying
a histogram - the bin stat, but use different geoms to display the results.

25

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> d <- ggplot(diamonds, aes(carat)) + xlim(0, 3)


> d + stat_bin(aes(ymax = ..count..), binwidth = 0.1, geom = "area")

12000

count

9000

6000

3000

0
0

carat

26

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

>
+
+
+

d + stat_bin(
aes(size = ..density..), binwidth = 0.1,
geom = "point", position="identity"
)

12000

9000
density

count

6000

0.0

0.5

1.0

2.0

3000

1.5

carat

27

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

>
+
+
+

d + stat_bin(
aes(y = 1, fill = ..count..), binwidth = 0.1,
geom = "tile", position="identity"
)

1.50

1.25
count

9000
1.00

6000
3000
0

0.75

0.50
0

carat

The first histogram shown here uses an area geom to display frequency, the second uses the point
geom and the third tile geom.
Weve shown this plot briefly in last weeks session. This is again the diamonds data set. It
shows the distribution of depth, marked by different cut.

28

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> depth_dist <- ggplot(diamonds, aes(depth)) + xlim(58, 68)


> depth_dist +
+ geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill")

1.00

0.75
cut

count

Fair
Good
0.50

Very Good
Premium
Ideal

0.25

0.00
57.5

60.0

62.5

65.0

67.5

depth

We see that this is essentially a histogram with very tiny binwidth. And it provides a much
better visual presentation than traditional bars.
The last example that Id like to discuss is drawing maps in ggplot2. ggplot2 provides some
tools to make it easy to combine maps from the maps package with other ggplot2 graphics. To be
able to draw maps, you have to install the package maps in addition to ggplot2. (and, of course,
dont forget to load it for use)
The example below shows the crime statistics for all states in the United States (excluding HI
and AK). This data is built in the package maps.

29

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

>
>
>
>
>
>
>
>
>
+

library(maps)
states <- map_data("state")
arrests <- USArrests
names(arrests) <- tolower(names(arrests))
arrests$region <- tolower(rownames(USArrests))
choro <- merge(states, arrests, by = "region")
# Reorder the rows because order matters when drawing polygons
choro <- choro[order(choro$order), ]
qplot(long, lat, data = choro, group = group, fill = assault,
geom="polygon")

50

45

assault
40

lat

300
200

35

100

30

25
120

100

80

long

30

Saier (Vivien) Ye, Department of Statistics, Yale University 2013

> qplot(long, lat, data = choro, group = group, fill = assault / murder,
+ geom="polygon")

50

45

assault/murder
40

50

lat

40
30
35

20

30

25
120

100

80

long

Drawing maps usually involves using the geom polygon. It is a relatively rarely used geom.
Details of this geom can be found here: http://docs.ggplot2.org/current/geom_polygon.html.
Briefly speaking, you need two data frames for using this function: one contains the coordinates of
each polygon (positions), and the other contains the values associated with each polygon (values).
Therefore we had to reorder our data frame choro, as merge disrupts the ordering.

9 Resources ggplot2
The best resource for low-level details will always be the built-in documentation. This documentation is accessible online at: http://docs.ggplot2.org/current/. You can also use the usual help
syntax (help() or ?) in R to access the contents. Online documentation provides more flexibility
as you can see all the example plots and navigate between topics easily.

31

Saier (Vivien) Ye, Department of Statistics, Yale University 2013


The offciail CRAN website, http://cran.r-project.org/web/packages/ggplot2/ is another
useful resource. This page links to what is new and different in each release.
Lastly, the book we mentioned earlier by the author of ggplot2 Hadley Wickham, http:
//www.amazon.com/dp/0387981403/ref=cm_sw_su_dp?tag=ggplot2-20 provides a detailed and
comprehensive introduction and explanation on data analysis in ggplot2. The book website,
http://ggplot2.org/book/, contains updates to this book, as well as all graphics used in the
book, with code and data needed to reproduce them.

32

You might also like