0% found this document useful (0 votes)
12 views

Data Visualization With R Ggplot2

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Data Visualization With R Ggplot2

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 236

DATA VISUALIZATION

WITH R: GGPLOT2
Getting started with ggplot2

Toolbox

Content
Build a plot layer by layer

Scales, axes and legends


Fuel economy data
library(ggplot2)
mpg
• The variables are mostly self-explanatory:
• cty and hwy record miles per gallon (mpg) for city and highway driving.
• displ is the engine displacement in litres.
• drv is the drivetrain: front wheel (f), rear wheel (r) or four wheel (4).
• model is the model of car. There are 38 models, selected because they had a new edition every
year between 1999 and 2008.
• class (not shown), is a categorical variable describing the “type” of car: two seater, SUV,
compact, etc.
Key components
• Every ggplot2 plot has three key components:
1. data,
2. A set of aesthetic mappings between variables in the data and visual properties, and
3. At least one layer which describes how to render each observation. Layers are usually created
with a geom function.
Example
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
OR
ggplot(mpg, aes(displ, hwy)) +
geom_point()
Example
• This produces a scatterplot defined by:
1. Data: mpg.
2. Aesthetic mapping: engine size mapped to x position, fuel economy to y position.
3. Layer: points.
Pay attention to the structure of this function call: data and aesthetic mappings are supplied in
ggplot(), then layers are added on with +. This is an important pattern, and as you learn more about
ggplot2 you’ll construct increasingly sophisticated plots by adding on more types of components.
Exercise 1
1. How would you describe the relationship between cty and hwy? Do you have any concerns about
drawing conclusions from that plot?
2. What does ggplot(mpg, aes(model, manufacturer)) + geom_point() show? Is it useful? How
could you modify the data to make it more informative?
3. Describe the data, aesthetic mappings and layers used for each of the following plots. You’ll need
to guess a little because you haven’t seen all the datasets and functions yet, but use your common
sense! See if you can predict what the plot will look like before running the code.
1. ggplot(mpg, aes(cty, hwy)) + geom_point()
2. ggplot(diamonds, aes(carat, price)) + geom_point()
3. ggplot(economics, aes(date, unemploy)) + geom_line()
4. ggplot(mpg, aes(cty)) + geom_histogram()
Colour, size, shape and other aesthetic attributes
• To add additional variables to a plot, we can use other aesthetics like colour, shape, and size. These
work in the same way as the x and y aesthetics, and are added into the call to aes():
• aes(displ, hwy, colour = class)
• aes(displ, hwy, shape = drv)
• aes(displ, hwy, size = cyl)
Colour, size, shape and other aesthetic attributes
• ggplot2 takes care of the details of converting data (e.g., ‘f’, ‘r’, ‘4’) into aesthetics (e.g., ‘red’,
‘yellow’, ‘green’) with a scale. There is one scale for each aesthetic mapping in a plot. The scale is
also responsible for creating a guide, an axis or legend, that allows you to read the plot, converting
aesthetic values back into data values. For now, we’ll stick with the default scales provided by
ggplot2. To learn more about those outlying variables in the previous scatterplot, we could map the
class variable to colour:
ggplot(mpg, aes(displ, cty, colour = class)) +
geom_point()
Colour, size, shape and other aesthetic attributes
• If you want to set an aesthetic to a fixed value, without scaling it, do so in the individual layer
outside of aes(). Compare the following two plots:
ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))
ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")
Colour, size, shape and other aesthetic attributes
• Different types of aesthetic attributes work better with different types of variables. For example,
colour and shape work well with categorical variables, while size works well for continuous
variables. The amount of data also makes a difference: if there is a lot of data it can be hard to
distinguish different groups. An alternative solution is to use facetting, as described next.
• When using aesthetics in a plot, less is usually more. It’s difficult to see the simultaneous
relationships among colour and shape and size, so exercise restraint when using aesthetics. Instead
of trying to make one very complex plot that shows everything at once, see if you can create a
series of simple plots that tell a story, leading the reader from ignorance to knowledge.
Facetting
• Another technique for displaying additional categorical variables on a plot is facetting. Facetting
creates tables of graphics by splitting the data into subsets and displaying the same graph for each
subset.
• There are two types of facetting: grid and wrapped. Wrapped is the most useful, so we’ll discuss it
here, and you can learn about grid facetting later. To facet a plot you simply add a facetting
specification with facet_wrap(), which takes the name of a variable preceded by ~.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(~class)
Plot geoms
• You might guess that by substituting geom_point() for a different geom function, you’d get a
different type of plot. That’s a great guess! In the following sections, you’ll learn about some of
the other important geoms provided in ggplot2. This isn’t an exhaustive list, but should cover the
most commonly used plot types.
• geom_smooth() fits a smoother to the data and displays the smooth and its standard error.
• geom_boxplot() produces a box-and-whisker plot to summarise the distribution of a set of
points.
• geom_histogram() and geom_freqpoly() show the distribution of continuous variables.
• geom_bar() shows the distribution of categorical variables.
• geom_path() and geom_line() draw lines between the data points. A line plot is constrained to
produce lines that travel from left to right, while paths can go in any direction. Lines are
typically used to explore how things change over time.
Adding a smoother to a plot
• If you have a scatterplot with a lot of noise, it can be hard to see the dominant pattern. In this case
it’s useful to add a smoothed line to the plot with geom_smooth():
• If you’re not interested in the confidence interval, turn it off with geom_smooth(se = FALSE).
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()
Adding a smoother to a plot
• An important argument to geom_smooth() is the method, which allows you to choose which type
of model is used to fit the smooth curve:
• method = "loess", the default for small n, uses a smooth local regression (as described in ?loess).
The wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly
wiggly) to 1 (not so wiggly).
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(span = 0.2)

ggplot(mpg, aes(displ, hwy)) +


geom_point() +
geom_smooth(span = 1)
Adding a smoother to a plot
• method = "gam" fits a generalised additive model provided by the mgcv package. You need to first
load mgcv, then use a formula like formula = y ~ s(x) or y ~ s(x, bs = "cs") (for large data). This is
what ggplot2 uses when there are more than 1,000 points.
library(mgcv)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x))
Adding a smoother to a plot
• method = "lm" fits a linear model, giving the line of best fit.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "lm")
Adding a smoother to a plot
• method = "rlm" works like lm(), but uses a robust fitting algorithm so that outliers don’t affect the
fit as much.
library(MASS)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "rlm")
Boxplots and jittered points
• When a set of data includes a categorical variable and one or more continuous variables, you will
probably be interested to know how the values of the continuous variables vary with the levels of
the categorical variable. Say we’re interested in seeing how fuel economy varies within car class.
We might start with a scatterplot like this:
ggplot(mpg, aes(drv, hwy)) +
geom_point()
Boxplots and jittered points
• Because there are few unique values of both class and hwy, there is a lot of overplotting. Many
points are plotted in the same location, and it’s difficult to see the distribution. There are three
useful techniques that help alleviate the problem:
• Jittering, geom_jitter(), adds a little random noise to the data which can help avoid
overplotting.
• Boxplots, geom_boxplot(), summarise the shape of the distribution with a handful of
summary statistics.
• Violin plots, geom_violin(), show a compact representation of the “density” of the
distribution, highlighting the areas where more points are found.
Boxplots and jittered points
ggplot(mpg, aes(drv, hwy)) + geom_jitter()
ggplot(mpg, aes(drv, hwy)) + geom_boxplot()
ggplot(mpg, aes(drv, hwy)) + geom_violin()
Histograms and frequency polygons
• Histograms and frequency polygons show the distribution of a single numeric variable. They
provide more information about the distribution of a single group than boxplots do, at the expense
of needing more space.
ggplot(mpg, aes(hwy)) + geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mpg, aes(hwy)) + geom_freqpoly()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histograms and frequency polygons
• You can control the width of the bins with the binwidth argument (if you don’t want evenly spaced
bins you can use the breaks argument). It is very important to experiment with the bin width. The
default just splits your data into 30 bins, which is unlikely to be the best choice. You should
always try many bin widths, and you may find you need multiple bin widths to tell the full story of
your data.
ggplot(mpg, aes(hwy)) +
geom_freqpoly(binwidth = 2.5)
ggplot(mpg, aes(hwy)) +
geom_freqpoly(binwidth = 1)
Histograms and frequency polygons
• To compare the distributions of different subgroups, you can map a categorical variable to either
fill (for geom_histogram()) or colour (for geom_freqpoly()). It’s easier to compare distributions
using the frequency polygon because the underlying perceptual task is easier. You can also use
facetting: this makes comparisons a little harder, but it’s easier to see the distribution of each
group.
ggplot(mpg, aes(displ, colour = drv)) +
geom_freqpoly(binwidth = 0.5)
ggplot(mpg, aes(displ, fill = drv)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~drv, ncol = 1)
Bar charts
• The discrete analogue of the histogram is the bar chart, geom_bar(). It’s easy to use:
ggplot(mpg, aes(manufacturer)) +
geom_bar()
Bar charts
• Bar charts can be confusing because there are two rather different plots that are both commonly
called bar charts. The above form expects you to have unsummarised data, and each observation
contributes one unit to the height of each bar. The other form of bar chart is used for
presummarised data. For example, you might have three drugs with their average effect:
drugs <- data.frame(
drug = c("a", "b", "c"),
effect = c(4.2, 9.7, 6.1)
)
• To display this sort of data, you need to tell geom_bar() to not run the default stat which bins and
counts the data. However, I think it’s even better to use geom_point() because points take up less
space than bars, and don’t require that the y axis includes 0.
ggplot(drugs, aes(drug, effect)) + geom_bar(stat = "identity")
ggplot(drugs, aes(drug, effect)) + geom_point()
Time series with line and path plots
• Because the year variable in the mpg dataset only has two values, we’ll show some time series
plots using the economics dataset, which contains economic data on the US measured over the last
40 years. The figure below shows two plots of unemployment over time, both produced using
geom_line(). The first shows the unemployment rate while the second shows the median number
of weeks unemployed. We can already see some differences in these two variables, particularly in
the last peak, where the unemployment percentage is lower than it was in the preceding peaks, but
the length of unemployment is high.
ggplot(economics, aes(date, unemploy / pop)) +
geom_line()
ggplot(economics, aes(date, uempmed)) +
geom_line()
Time series with line and path plots
• To examine this relationship in greater detail, we would like to draw both time series on the same
plot. We could draw a scatterplot of unemployment rate vs. length of unemployment, but then we
could no longer see the evolution over time. The solution is to join points adjacent in time with
line segments, forming a path plot.
ggplot(economics, aes(unemploy / pop, uempmed)) +
geom_path() +
geom_point()

year <- function(x) as.POSIXlt(x)$year + 1900


ggplot(economics, aes(unemploy / pop, uempmed)) +
geom_path(colour = "grey50") +
geom_point(aes(colour = year(date)))
Modifying the axes
• Two families of useful helpers let you make the most common modifications. xlab() and ylab() modify the x- and y-axis
labels:
ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1 / 3)

ggplot(mpg, aes(cty, hwy)) +


geom_point(alpha = 1 / 3) +
xlab("city driving (mpg)") +
ylab("highway driving (mpg)")

• Remove the axis labels with NULL


ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1 / 3) +
xlab(NULL) +
ylab(NULL)
Modifying the axes
• xlim() and ylim() modify the limits of axes:
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25)

ggplot(mpg, aes(drv, hwy)) +


geom_jitter(width = 0.25) +
xlim("f", "r") +
ylim(20, 30)
#> Warning: Removed 138 rows containing missing values (geom_point).

# For continuous scales, use NA to set only one limit


ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25, na.rm = TRUE) +
ylim(NA, 30)
Output
• Most of the time you create a plot object and immediately plot it, but you can also save a plot to a variable
and manipulate it:
p <- ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_point()
• Render it on screen with print(). This happens automatically when running interactively, but inside a loop or
function, you’ll need to print() it yourself.
print(p)
• Save it to disk with ggsave():
# Save png to disk
ggsave("plot.png", width = 5, height = 5)
Output
• Briefly describe its structure with summary().
summary(p)
• Save a cached copy of it to disk, with saveRDS(). This saves a complete copy of the plot object, so
you can easily re-create it with readRDS().
saveRDS(p, "plot.rds")
q <- readRDS("plot.rds")
Quick plots
• In some cases, you will want to create a quick plot with a minimum of typing. In these cases you
may prefer to use qplot() over ggplot(). qplot() lets you define a plot in a single call, picking a
geom by default if you don’t supply one. To use it, provide a set of aesthetics and a data set:
qplot(displ, hwy, data = mpg)
qplot(displ, data = mpg)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Quick plots
• qplot() assumes that all variables should be scaled by default. If you want to set an aesthetic to a
constant, you need to use I():
qplot(displ, hwy, data = mpg, colour = "blue")
qplot(displ, hwy, data = mpg, colour = I("blue"))
Toolbox
• It is useful to think about the purpose of each layer before it is added. In general, there are three
purposes for a layer:
• To display the data.
• To display a statistical summary of the data.
• To add additional metadata: context, annotations, and references.
Toolbox
• Basic plot types that produce common, ‘named’ graphics like scatterplots and line charts.
• Displaying text.
• Adding arbitrary additional anotations.
• Working with collective geoms, like lines and polygons, that each display multiple rows of data.
• Surface plots to display 3d surfaces in 2d.
• Drawing maps.
• Revealing uncertainty and error, with various 1d and 2d intervals.
• Weighted data.
• Diamonds dataset.
Basic plot types
• These geoms are the fundamental building blocks of ggplot2. They are useful in their own right, but are also
used to construct more complex geoms. Most of these geoms are associated with a named plot: when that
geom is used by itself in a plot, that plot has a special name.
• Each of these geoms is two dimensional and requires both x and y aesthetics. All of them understand colour
(or color) and size aesthetics, and the filled geoms (bar, tile and polygon) also understand fill.

Geoms Plot
geom_area() area plot
geom_bar(stat = "identity“) bar chart
geom_line() line plot
geom_path()
geom_point() scatterplot
geom_polygon() polygons
geom_rect(), geom_tile() and geom_raster() rectangles
Examples
df <- data.frame(
x = c(3, 1, 5),
y = c(2, 4, 6),
label = c("a","b","c")
)
p <- ggplot(df, aes(x, y, label = label)) +
labs(x = NULL, y = NULL) + # Hide axis label
theme(plot.title = element_text(size = 12)) # Shrink plot title
p + geom_point() + ggtitle("point")
p + geom_text() + ggtitle("text")
p + geom_bar(stat = "identity") + ggtitle("bar")
p + geom_tile() + ggtitle("raster")
Examples
p + geom_line() + ggtitle("line")
p + geom_area() + ggtitle("area")
p + geom_path() + ggtitle("path")
p + geom_polygon() + ggtitle("polygon")
Exercise 2
1. What geoms would you use to draw each of the following named plots?
• Scatterplot
• Line chart
• Histogram
• Bar chart
• Pie chart
2. What’s the difference between geom_path() and geom_polygon()? What’s the difference between
geom_path() and geom_line()?
3. What low-level geoms are used to draw geom_smooth()? What about geom_boxplot() and
geom_violin()?
Labels
• Adding text to a plot can be quite tricky. ggplot2 doesn’t have all the answers, but does provide
some tools to make your life a little easier. The main tool is geom_text(), which adds labels at the
specified x and y positions. geom_text() has the most aesthetics of any geom, because there are so
many ways to control the appearance of a text:
• family gives the name of a font. There are only three fonts that are guaranteed to work
everywhere: “sans” (the default), “serif”, or “mono”:
df <- data.frame(x = 1, y = 3:1, family = c("sans", "serif", "mono"))
ggplot(df, aes(x, y)) +
geom_text(aes(label = family, family = family))
Labels
• fontface specifies the face: “plain” (the default), “bold” or “italic”.
df <- data.frame(x = 1, y = 3:1, face = c("plain", "bold", "italic"))
ggplot(df, aes(x, y)) +
geom_text(aes(label = face, fontface = face))
Labels
• You can adjust the alignment of the text with the hjust (“left”, “center”, “right”, “inward”, “outward”) and
vjust (“bottom”, “middle”, “top”, “inward”, “outward”) aesthetics. The default alignment is centered. One of
the most useful alignments is “inward”: it aligns text towards the middle of the plot:
df <- data.frame(
x = c(1, 1, 2, 2, 1.5),
y = c(1, 2, 1, 2, 1.5),
text = c("bottom-left", "bottom-right",
"top-left", "top-right", "center")
)
ggplot(df, aes(x, y)) +
geom_text(aes(label = text))
ggplot(df, aes(x, y)) +
geom_text(aes(label = text), vjust = "inward", hjust = "inward")
Labels
• size controls the font size. Unlike most tools, ggplot2 uses mm, rather than the usual points (pts).
This makes it consistent with other size units in ggplot2. (There are 72.27 pts in a inch, so to
convert from points to mm, just multiply by 72.27 / 25.4).
• angle specifies the rotation of the text in degrees.
Labels
• You can map data values to these aesthetics, but use restraint: it is hard to percieve the relationship
between variables mapped to these aesthetics. geom_text() also has three parameters. Unlike the
aesthetics, these only take single values, so they must be the same for all labels:
• Often you want to label existing points on the plot. You don’t want the text to overlap with the
points (or bars etc), so it’s useful to offset the text a little. The nudge_x and nudge_y parameters
allow you to nudge the text a little horizontally or vertically:
df <- data.frame(trt = c("a", "b", "c"), resp = c(1.2, 3.4, 2.5))
ggplot(df, aes(resp, trt)) +
geom_point() +
geom_text(aes(label = paste0("(", resp, ")")), nudge_y = -0.25) +
xlim(1, 3.6)
Labels
• If check_overlap = TRUE, overlapping labels will be automatically removed. The algorithm is
simple: labels are plotted in the order they appear in the data frame; if a label would overlap with
an existing point, it’s omitted. This is not incredibly useful, but can be handy.
ggplot(mpg, aes(displ, hwy)) +
geom_text(aes(label = model)) +
xlim(1, 8)

ggplot(mpg, aes(displ, hwy)) +


geom_text(aes(label = model), check_overlap = TRUE) +
xlim(1, 8)
Labels
• A variation on geom_text() is geom_label(): it draws a rounded rectangle behind the text. This
makes it useful for adding labels to plots with busy backgrounds:
label <- data.frame(
waiting = c(55, 80),
eruptions = c(2, 4.3),
label = c("peak one", "peak two")
)
ggplot(faithfuld, aes(waiting, eruptions)) +
geom_tile(aes(fill = density)) +
geom_label(data = label, aes(label = label))
Labels
• Text labels can also serve as an alternative to a legend. This usually makes the plot easier to read
because it puts the labels closer to the data. The directlabels package, by Toby Dylan Hocking,
provides a number of tools to make this easier:
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()

ggplot(mpg, aes(displ, hwy, colour = class)) +


geom_point(show.legend = FALSE) +
directlabels::geom_dl(aes(label = class), method = "smart.grid")
Annotations
• Annotations add metadata to your plot. But metadata is just data, so you can use:
• geom_text() to add text descriptions or to label points. Most plots will not benefit from adding
text to every single observation on the plot, but labelling outliers and other important points is
very useful.
• geom_rect() to highlight interesting rectangular regions of the plot. geom_rect() has aesthetics
xmin, xmax, ymin and ymax.
• geom_line(), geom_path() and geom_segment() to add lines. All these geoms have an arrow
parameter, which allows you to place an arrowhead on the line. Create arrowheads with
arrow(), which has arguments angle, length, ends and type.
• geom_vline(), geom_hline() and geom_abline() allow you to add reference lines (sometimes
called rules), that span the full range of the plot.
Annotations
• To show off the basic idea, we’ll draw a time series of unemployment:
ggplot(economics, aes(date, unemploy)) +
geom_line()
Annotations
• We can annotate this plot with which president was in power at the time. There is little new in this code - it’s a straightforward
manipulation of existing geoms. There is one special thing to note: the use of -Inf and Inf as positions. These refer to the top and
bottom (or left and right) limits of the plot.
presidential <- subset(presidential, start > economics$date[1])
ggplot(economics) +
geom_rect(
aes(xmin = start, xmax = end, fill = party),
ymin = -Inf, ymax = Inf, alpha = 0.2,
data = presidential
)+
geom_vline(
aes(xintercept = as.numeric(start)),
data = presidential,
colour = "grey50", alpha = 0.5
)+
geom_text(
aes(x = start, y = 2500, label = name),
data = presidential,
size = 3, vjust = 0, hjust = 0, nudge_x = 50
)+
geom_line(aes(date, unemploy)) +
scale_fill_manual(values = c("blue", "red"))
Annotations
• You can use the same technique to add a single annotation to a plot, but it’s a bit fiddly because you have to
create a one row data frame:
yrng <- range(economics$unemploy)
xrng <- range(economics$date)
caption <- paste(strwrap("Unemployment rates in the US have
varied a lot over the years", 40), collapse = "\n")
ggplot(economics, aes(date, unemploy)) +
geom_line() +
geom_text(
aes(x, y, label = caption),
data = data.frame(x = xrng[1], y = yrng[2], caption = caption),
hjust = 0, vjust = 1, size = 4
)
Annotations
• It’s easier to use the annotate() helper function which creates the data frame for you:
ggplot(economics, aes(date, unemploy)) +
geom_line() +
annotate("text", x = xrng[1], y = yrng[2], label = caption,
hjust = 0, vjust = 1, size = 4
)
Annotations
• Annotations, particularly reference lines, are also useful when comparing groups across facets. In
the following plot, it’s much easier to see the subtle differences if we add a reference line.
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d() +
facet_wrap(~cut, nrow = 1)
Annotations
mod_coef <- coef(lm(log10(price) ~ log10(carat), data = diamonds))
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d() +
geom_abline(intercept = mod_coef[1], slope = mod_coef[2],
colour = "white", size = 1) +
facet_wrap(~cut, nrow = 1)
Collective geoms
• Geoms can be roughly divided into individual and collective geoms. An individual geom draws a
distinct graphical object for each observation (row). For example, the point geom draws one point
per row. A collective geom displays multiple observations with one geometric object. This may be
a result of a statistical summary, like a boxplot, or may be fundamental to the display of the geom,
like a polygon. Lines and paths fall somewhere in between: each line is composed of a set of
straight segments, but each segment represents two points. How do we control the assignment of
observations to graphical elements? This is the job of the group aesthetic.
• By default, the group aesthetic is mapped to the interaction of all discrete variables in the plot.
This often partitions the data correctly, but when it does not, or when no discrete variable is used
in a plot, you’ll need to explicitly define the grouping structure by mapping group to a variable
that has a different value for each group.
Multiple groups, one aesthetic
• You want to be able to distinguish individual subjects, but not identify them. This is common in longitudinal
studies with many subjects, where the plots are often descriptively called spaghetti plots. For example, the
following plot shows the growth trajectory for each boy (each Subject):
data(Oxboys, package = "nlme")
head(Oxboys)
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_point() +
geom_line()
Multiple groups, one aesthetic
• If you incorrectly specify the grouping variable, you’ll get a characteristic sawtooth appearance:
ggplot(Oxboys, aes(age, height)) +
geom_point() +
geom_line()
Different groups on different layers
• Sometimes we want to plot summaries that use different levels of aggregation: one layer might display
individuals, while another displays an overall summary. Building on the previous example, suppose we want
to add a single smooth line, showing the overall trend for all boys. If we use the same grouping in both layers,
we get one smooth per boy:
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE)
Different groups on different layers
• Instead of setting the grouping aesthetic in ggplot(), where it will apply to all layers, we set it in geom_line()
so it applies only to the lines. There are no discrete variables in the plot so the default grouping variable will
be a constant and we get one smooth:
ggplot(Oxboys, aes(age, height)) +
geom_line(aes(group = Subject)) +
geom_smooth(method = "lm", size = 2, se = FALSE)
Overriding the default grouping
• Some plots have a discrete x scale, but you still want to draw lines connecting across groups. This
is the strategy used in interaction plots, profile plots, and parallel coordinate plots, among others.
For example, imagine we’ve drawn boxplots of height at each measurement occasion:
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot()
Overriding the default grouping
• There is one discrete variable in this plot, Occassion, so we get one boxplot for each unique x value. Now we
want to overlay lines that connect each individual boy. Simply adding geom_line() does not work: the lines
are drawn within each occassion, not across each subject:
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(colour = "#3366FF", alpha = 0.5)
Overriding the default grouping
• To get the plot we want, we need to override the grouping to say we want one line per boy:
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(aes(group = Subject), colour = "#3366FF", alpha = 0.5)
Matching aesthetics to graphic objects
• Lines and paths operate on an off-by-one principle: there is one more observation than line segment, and so
the aesthetic for the first observation is used for the first segment, the second observation for the second
segment and so on. This means that the aesthetic for the last observation is not used:
df <- data.frame(x = 1:3, y = 1:3, colour = c(1,3,5))
ggplot(df, aes(x, y, colour = factor(colour))) +
geom_line(aes(group = 1), size = 2) +
geom_point(size = 5)
ggplot(df, aes(x, y, colour = colour)) +
geom_line(aes(group = 1), size = 2) +
geom_point(size = 5)
Matching aesthetics to graphic objects
• You could imagine a more complicated system where segments smoothly blend from one aesthetic to another.
This would work for continuous variables like size or colour, but not for discrete variables, and is not used in
ggplot2. If this is the behaviour you want, you can perform the linear interpolation yourself:
xgrid <- with(df, seq(min(x), max(x), length = 50))
interp <- data.frame(
x = xgrid,
y = approx(df$x, df$y, xout = xgrid)$y,
colour = approx(df$x, df$colour, xout = xgrid)$y
)
ggplot(interp, aes(x, y, colour = colour)) +
geom_line(size = 2) +
geom_point(data = df, size = 5)
Matching aesthetics to graphic objects
• For all other collective geoms, like polygons, the aesthetics from the individual components are only used if
they are all the same, otherwise the default value is used. It’s particularly clear why this makes sense for fill:
how would you colour a polygon that had a different fill colour for each point on its border?
• These issues are most relevant when mapping aesthetics to continuous variables, because, as described above,
when you introduce a mapping to a discrete variable, it will by default split apart collective geoms into
smaller pieces. This works particularly well for bar and area plots, because stacking the individual pieces
produces the same shape as the original ungrouped data:
ggplot(mpg, aes(class)) +
geom_bar()
ggplot(mpg, aes(class, fill = drv)) +
geom_bar()
Matching aesthetics to graphic objects
• If you try to map fill to a continuous variable in the same way, it doesn’t work. The default grouping will only
be based on class, so each bar will be given multiple colours. Since a bar can only display one colour, it will
use the default grey. To show multiple colours, we need multiple bars for each class, which we can get by
overriding the grouping:
ggplot(mpg, aes(class, fill = hwy)) +
geom_bar()
ggplot(mpg, aes(class, fill = hwy, group = hwy)) +
geom_bar()
Exercise 3
1. Draw a boxplot of hwy for each value of cyl, without turning cyl into a factor. What extra aesthetic do you need to set?
2. Modify the following plot so that you get one boxplot per integer value value of displ.
ggplot(mpg, aes(displ, cty)) +
geom_boxplot()
3. When illustrating the difference between mapping continuous and discrete colours to a line, the discrete example needed aes(group = 1).
Why? What happens if that is omitted? What’s the difference between aes(group = 1) and aes(group = 2)? Why?
4. How many bars are in each of the following plots?
ggplot(mpg, aes(drv)) +
geom_bar()
ggplot(mpg, aes(drv, fill = hwy, group = hwy)) +
geom_bar()
library(dplyr)
mpg2 <- mpg %>% arrange(hwy) %>% mutate(id = seq_along(hwy))
ggplot(mpg2, aes(drv, fill = hwy, group = id)) +
geom_bar()
(Hint: try adding an outline around each bar with colour = "white")
Exercise 3
5. Install the babynames package. It contains data about the popularity of babynames in the US. Run the following code and fix the resulting
graph. Why does this graph make me unhappy?
library(babynames)
hadley <- dplyr::filter(babynames, name == "Hadley")
ggplot(hadley, aes(year, n)) +
geom_line()
Diamonds data
• To demonstrate tools for large datasets, we’ll use the built in diamonds dataset, which consists of
price and quality information for ~54,000 diamonds:
diamonds
• The data contains the four C’s of diamond quality: carat, cut, colour and clarity; and five physical
measurements: depth, table, x, y and z.
Displaying distributions
• For 1d continuous distributions the most important geom is the histogram, geom_histogram():
ggplot(diamonds, aes(depth)) +
geom_histogram()

ggplot(diamonds, aes(depth)) +
geom_histogram(binwidth = 0.1) +
xlim(55, 70)
Displaying distributions
• If you want to compare the distribution between groups, you have a few options:
• Show small multiples of the histogram, facet_wrap(~ var).
• Use colour and a frequency polygon, geom_freqpoly().
• Use a “conditional density plot”, geom_histogram(position = "fill").
Displaying distributions
• The frequency polygon and conditional density plots are shown below. The conditional density plot uses
position_fill() to stack each bin, scaling it to the same height. This plot is perceptually challenging because
you need to compare bar heights, not positions, but you can see the strongest patterns.
ggplot(diamonds, aes(depth)) +
geom_freqpoly(aes(colour = cut), binwidth = 0.1, na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
ggplot(diamonds, aes(depth)) +
geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill",
na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
Displaying distributions
• An alternative to a bin-based visualisation is a density estimate. geom_density() places a little normal
distribution at each data point and sums up all the curves. It has desirable theoretical properties, but is more
difficult to relate back to the data. Use a density plot when you know that the underlying density is smooth,
continuous and unbounded. You can use the adjust parameter to make the density more or less smooth.
ggplot(diamonds, aes(depth)) +
geom_density(na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
ggplot(diamonds, aes(depth, fill = cut, colour = cut)) +
geom_density(alpha = 0.2, na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
Displaying distributions
• The histogram, frequency polygon and density display a detailed view of the distribution. However,
sometimes you want to compare many distributions, and it’s useful to have alternative options that sacrifice
quality for quantity. Here are three options:
1. geom_boxplot(): the box-and-whisker plot shows five summary statistics along with individual “outliers”. It
displays far less information than a histogram, but also takes up much less space. You can use boxplot with both
categorical and continuous x. For continuous x, you’ll also need to set the group aesthetic to define how the x
variable is broken up into bins. A useful helper function is cut_width():
ggplot(diamonds, aes(clarity, depth)) +
geom_boxplot()
ggplot(diamonds, aes(carat, depth)) +
geom_boxplot(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)
Displaying distributions
2. geom_violin(): the violin plot is a compact version of the density plot. The underlying computation is the
same, but the results are displayed in a similar fashion to the boxplot:
ggplot(diamonds, aes(clarity, depth)) +
geom_violin()
ggplot(diamonds, aes(carat, depth)) +
geom_violin(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)
Displaying distributions
3. geom_dotplot(): draws one point for each observation, carefully adjusted in space to avoid
overlaps and show the distribution. It is useful for smaller datasets.
Dealing with overplotting
• The scatterplot is a very important tool for assessing the relationship between two continuous variables.
However, when the data is large, points will be often plotted on top of each other, obscuring the true
relationship. In extreme cases, you will only be able to see the extent of the data, and any conclusions drawn
from the graphic will be suspect. This problem is called overplotting.
• For smaller datasets: Very small amounts of overplotting can sometimes be alleviated by making the points
smaller, or using hollow glyphs. The following code shows some options for 2000 points sampled from a
bivariate normal distribution.
df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
norm + geom_point()
norm + geom_point(shape = 1) # Hollow circles
norm + geom_point(shape = ".") # Pixel sized
Dealing with overplotting
• For larger datasets with more overplotting, you can use alpha blending (transparency) to make the points
transparent. If you specify alpha as a ratio, the denominator gives the number of points that must be
overplotted to give a solid colour. Values smaller than ~1/500 are rounded down to zero, giving completely
transparent points.
norm + geom_point(alpha = 1 / 3)
norm + geom_point(alpha = 1 / 5)
norm + geom_point(alpha = 1 / 10)
Statistical summaries
• geom_histogram() and geom_bin2d() use a familiar geom, geom_bar() and geom_raster(), combined with a
new statistical transformation, stat_bin() and stat_bin2d(). stat_bin() and stat_bin2d() combine the data into
bins and count the number of observations in each bin. But what if we want a
summary other than count? So far, we’ve just used the default statistical transformation associated with each
geom. Now we’re going to explore how to use stat_summary_bin() to stat_summary_2d() to compute
different summaries.
• Let’s start with a couple of examples with the diamonds data. The first example in each pair shows how we
can count the number of diamonds in each bin; the second shows how we can compute the average price.
ggplot(diamonds, aes(color)) +
geom_bar()
ggplot(diamonds, aes(color, price)) +
geom_bar(stat = "summary_bin", fun.y = mean)
Statistical summaries
ggplot(diamonds, aes(table, depth)) +
geom_bin2d(binwidth = 1, na.rm = TRUE) +
xlim(50, 70) +
ylim(50, 70)
ggplot(diamonds, aes(table, depth, z = price)) +
geom_raster(binwidth = 1, stat = "summary_2d", fun = mean,
na.rm = TRUE) +
xlim(50, 70) +
ylim(50, 70)
Example: building a scatterplot
How are engine size and fuel economy related? We might create a scatterplot of engine displacement and
highway mpg with points coloured by number of cylinders:

ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +


geom_point()
Underlying grammar
• You can create plots like this easily, but what is going on underneath the surface? How
does ggplot2 draw this plot?
• In order to unlock the full power of ggplot2, you’ll need to master the underlying
grammar. By understanding the grammar, and how its components fit together, you can
create a wider range of visualizations, combine multiple sources of data, and customise to
your heart’s content.
• The grammar makes it easier for you to iteratively update a plot, changing a single feature
at a time. The grammar is also useful because it suggests the high-level aspects of a plot
that can be changed, giving you a framework to think about graphics, and hopefully
shortening the distance from mind to paper. It also encourages the use of graphics
customised to a particular problem, rather than relying on specific chart types.

Source: Wickham (2016)


Mapping aesthetics to data
• A scatterplot represents each observation as a point,
positioned according to the value of two variables. As well as
a horizontal and vertical position, each point also has a size, a
colour and a shape. These attributes are called aesthetics, and
are the properties that can be perceived on the graphic. Each
aesthetic can be mapped to a variable, or set to a constant
value. In the previous graphic, displ is mapped to horizontal
position, hwy to vertical position and cyl to colour. Size and
shape are not mapped to variables, but remain at their
(constant) default values.
Mapping aesthetics to data
ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_line() +
theme(legend.position = "none")
Mapping aesthetics to data
ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_bar(stat = "identity", position = "identity", fill = NA) +
theme(legend.position = "none")
Mapping aesthetics to data
• Points, lines and bars are all examples of geometric objects, or geoms. Geoms
determine the “type” of the plot. Plots that use a single geom are often given a
special name:

Name plot Geom Other features


scatterplot point
bubblechart point Size mapped to a variable
barchart bar
box-and-whisker plot boxplot
line chart line
Without geoms
ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_point() +
geom_smooth(method = "lm")
Scaling
• The values in the previous table have no meaning to the computer. We need to convert them from
data units (e.g., litres, miles per gallon and number of cylinders) to graphical units (e.g., pixels and
colours) that the computer can display. This conversion process is called scaling and performed by
scales. Now that these values are meaningful to the computer, they may not be meaningful to us:
colours are represented by a six-letter hexadecimal string, sizes by a number and shapes by an
integer. These aesthetic specifications that are meaningful to R are described in vignette("ggplot2-
specs").
• We need only a linear mapping from the range of the data to [0; 1]. We use [0; 1] instead of exact
pixels because the drawing system that ggplot2 uses, grid, takes care of that final conversion for
us. A final step determines how the two positions (x and y) are combined to form the final location
on the plot. This is done by the coordinate system, or coord. In most cases this will be Cartesian
coordinates, but it might be polar coordinates, or a spherical projection used for a map.
Scaling
• The process for mapping the colour is a little more
complicated, as we have a non-numeric result: colours.
However, colours can be thought of as having three
components, corresponding to the three types of colour-
detecting cells in the human eye. These three cell types give
rise to a three-dimensional colour space. Scaling then
involves mapping the data values to points in this space.
There are many ways to do this, but here since cyl is a
categorical variable we map values to evenly spaced hues
on the colour wheel, as shown in the below figure. A
different mapping is used when the variable is continuous.

Fig. A colour wheel illustrating the choice of five


equally spaced colours. This is the default scale for
discrete variables.
Scaling
• The result of these conversions is below. As well as aesthetics
that have been mapped to variable, we also include aesthetics
that are constant. We need these so that the aesthetics for each
point are completely specified and R can draw the plot. The
points will be filled circles (shape 19 in R) with a 1-mm
diameter:
• Finally, we need to render this data to create the graphical
objects that are displayed on the screen. To create a complete
plot we need to combine graphical objects from three
sources: the data, represented by the point geom; the scales
and coordinate system, which generate axes and legends so
that we can read values from the graph; and plot annotations,
such as the background and plot title.
Adding complexity
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth() +
facet_wrap(~year)
Adding complexity
• The smooth layer is different to the point layer because it doesn’t display the raw data, but
instead displays a statistical transformation of the data. Specifically, the smooth layer fits
a smooth line through the middle of the data. This requires an additional step in the
process described above: after mapping the data to aesthetics, the data is passed to a
statistical transformation, or stat, which manipulates the data in some useful way. In this
example, the stat fits the data to a loess smoother, and then returns predictions from
evenly spaced points within the range of the data. Other useful stats include 1 and 2d
binning, group means, quantile regression and contouring.
Components of the
layered grammar
• Together, the data, mappings, stat, geom and position
adjustment form a layer. A plot may have multiple
layers, as in the example where we overlaid a
smoothed line on a scatterplot. All together, the layered
grammar defines a plot as the combination of:
 A default dataset and set of mappings from
variables to aesthetics.
 One or more layers, each composed of a geometric
object, a statistical transformation, a position
adjustment, and optionally, a dataset and aesthetic
mappings.
 One scale for each aesthetic mapping.
 A coordinate system.
 The facetting specification.
Layers
• Layers are responsible for creating the objects that we perceive on the plot. A layer is
composed of five parts:
1. Data
2. Aesthetic mappings.
3. A statistical transformation (stat).
4. A geometric object (geom).
5. A position adjustment.
Scales
• A scale controls the mapping from data to aesthetic attributes, and we need a scale for
every aesthetic used on a plot. Each scale operates across all the data in the plot, ensuring
a consistent mapping from data to aesthetics.
• A scale is a function and its inverse, along with a set of parameters. For example, the
colour gradient scale maps a segment of the real line to a path through a colour space. The
parameters of the function define whether the path is linear or curved, which colour space
to use (e.g., LUV or RGB), and the colours at the start and end.
• The inverse function is used to draw a guide so that you can read values from the graph.
Guides are either axes (for position scales) or legends (for everything else). Most
mappings have a unique inverse (i.e., the mapping function is one-to-one), but many do
not. A unique inverse makes it possible to recover the original data, but this is not always
desirable if we want to focus attention on a single aspect.
Coordinate system
• A coordinate system, or coord for short, maps the position of objects onto the plane of the
plot. Position is often specified by two coordinates (x; y), but potentially could be three or
more (although this is not implemented in ggplot2). The Cartesian coordinate system is
the most common coordinate system for two dimensions, while polar coordinates and
various map projections are used less frequently.
• Coordinate systems affect all position variables simultaneously and differ from scales in
that they also change the appearance of the geometric objects. For example, in polar
coordinates, bar geoms look like segments of a circle. Additionally, scaling is performed
before statistical transformation, while coordinate transformations occur afterward.
• Coordinate systems control how the axes and grid lines are drawn.
Facetting
• There is also another thing that turns out to be sufficiently useful that we should include it
in our general framework: facetting, a general case of conditioned or trellised plots. This
makes it easy to create small multiples, each showing a different subset of the whole
dataset. This is a powerful tool when investigating whether patterns hold across all
conditions. The facetting specification describes which variables should be used to split
up the data, and whether position scales should be free or constrained.
Building a plot
• It’s important to realise that there really are two distinct steps. First we create a
plot with default dataset and aesthetic mappings:
p <- ggplot(mpg, aes(displ, hwy))
p
Add a layer
p + geom_point()

p + layer(
mapping = NULL,
data = NULL,
geom = "point", geom_params = list(),
stat = "identity", stat_params = list(),
position = "identity“
)
Data
• The data on each layer doesn’t need to be the same, and it’s often useful to combine multiple
datasets in a single plot. To illustrate that idea I’m going to generate two new datasets related to
the mpg dataset. First I’ll fit a loess model and generate predictions from it. (This is what
geom_smooth() does behind the scenes)
mod <- loess(hwy ~ displ, data = mpg)
grid <- data_frame(displ = seq(min(mpg$displ), max(mpg$displ), length = 50))
grid$hwy <- predict(mod, newdata = grid)
grid
Data
• Next, I’ll isolate observations that are particularly far away from their predicted values:
std_resid <- resid(mod) / mod$s
outlier <- filter(mpg, abs(std_resid) > 2)
outlier
Data
• I’ve generated these datasets because it’s common to enhance the display of raw data with
a statistical summary and some annotations. With these new datasets, I can improve our
initial scatterplot by overlaying a smoothed line, and labelling the outlying points:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_line(data = grid, colour = "blue", size = 1.5) +
geom_text(data = outlier, aes(label = model))
Data
• In this example, every layer uses a different dataset. We could define the same plot in
another way, omitting the default dataset, and specifying a dataset for each layer:
ggplot(mapping = aes(displ, hwy)) +
geom_point(data = mpg) +
geom_line(data = grid) +
geom_text(data = outlier, aes(label = model))
Exercise 4
• The following code uses dplyr to generate some summary statistics about each class of car:
library(dplyr)
class <- mpg %>%
group_by(class) %>%
summarise(n = n(), hwy = mean(hwy))
Use the data to recreate this plot:
Aesthetic mappings
• The aesthetic mappings, defined with aes(), describe how variables are mapped to visual
properties or aesthetics. aes() takes a sequence of aesthetic-variable pairs like this:
aes(x = displ, y = hwy, colour = class)
Or
aes(displ, hwy, colour = class)
Specifying the aesthetics in the plot vs. in the layers

• Aesthetic mappings can be supplied in the initial ggplot() call, in individual layers, or in
some combination of both. All of these calls create the same plot specification:
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
ggplot(mpg, aes(displ)) +
geom_point(aes(y = hwy, colour = class))
ggplot(mpg) +
geom_point(aes(displ, hwy, colour = class))
Specifying the aesthetics in the plot vs. in the layers

• Within each layer, you can add, override, or remove mappings:


Specifying the aesthetics in the plot vs. in the layers
• If you only have one layer in the plot, the way you specify aesthetics doesn’t make any difference.
However, the distinction is important when you start adding additional layers. These two plots are
both valid and interesting, but focus on quite different aspects of the data:
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme(legend.position = "none")

ggplot(mpg, aes(displ, hwy)) +


geom_point(aes(colour = class)) +
geom_smooth(method = "lm", se = FALSE) +
theme(legend.position = "none")
Setting vs. mapping
• Instead of mapping an aesthetic property to a variable, you can set it to asingle value by specifying it in the
layer parameters. We map an aesthetic to a variable (e.g., aes(colour = cut)) or set it to a constant (e.g.,
colour = "red"). If you want appearance to be governed by a variable, put the specification inside aes(); if you
want override the default size or colour, put the value outside of aes().
• The following plots are created with similar code, but have rather different outputs. The second plot maps
(not sets) the colour to the value ‘darkblue’. This effectively creates a new variable containing only the value
‘darkblue’ and then scales it with a colour scale. Because this value is discrete, the default colour scale uses
evenly spaced colours on the colour wheel, and since there is only one value this colour is pinkish.
ggplot(mpg, aes(cty, hwy)) +
geom_point(colour = "darkblue")
ggplot(mpg, aes(cty, hwy)) +
geom_point(aes(colour = "darkblue"))
Setting vs. mapping
• A third approach is to map the value, but override the default scale:
ggplot(mpg, aes(cty, hwy)) +
geom_point(aes(colour = "darkblue")) +
scale_colour_identity()
Setting vs. mapping
• It’s sometimes useful to map aesthetics to constants. For example, if you want to display multiple
layers with varying parameters, you can “name” each layer:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(aes(colour = "loess"), method = "loess", se = FALSE) +
geom_smooth(aes(colour = "lm"), method = "lm", se = FALSE) +
labs(colour = "Method")
Exercises 5
1. Simplify the following plot specifications:
ggplot(mpg) +
geom_point(aes(mpg$disp, mpg$hwy))
ggplot() +
geom_point(mapping = aes(y = hwy, x = cty), data = mpg) +
geom_smooth(data = mpg, mapping = aes(cty, hwy))
ggplot(diamonds, aes(carat, price)) +
geom_point(aes(log(brainwt), log(bodywt)), data = msleep)
2. What does the following code do? Does it work? Does it make sense? Why/why not?
ggplot(mpg) +
geom_point(aes(class, cty)) +
geom_boxplot(aes(trans, hwy))
3. What happens if you try to use a continuous variable on the x axis in one layer, and a categorical variable in
another layer? What happens if you do it in the opposite order?
Geoms
• Geometric objects, or geoms for short, perform the actual rendering of the layer,
controlling the type of plot that you create. For example, using a point geom will create a
scatterplot, while using a line geom will create a line plot.
Graphical primitives
– geom_blank(): display nothing. Most useful for adjusting axes limits using data.
– geom_point(): points.
– geom_path(): paths.
– geom_ribbon(): ribbons, a path with vertical thickness.
– geom_segment(): a line segment, specified by start and end position.
– geom_rect(): rectangles.
– geom_polyon(): filled polygons.
– geom_text(): text.
One variable
– Discrete:
* geom_bar(): display distribution of discrete variable.
– Continuous
* geom_histogram(): bin and count continuous variable, display with bars.
* geom_density(): smoothed density estimate.
* geom_dotplot(): stack individual points into a dot plot.
* geom_freqpoly(): bin and count continuous variable, display with lines.
Two variables
– Both continuous:
· geom_point(): scatterplot.
· geom_quantile(): smoothed quantile regression.
· geom_rug(): marginal rug plots.
· geom_smooth(): smoothed line of best fit.
· geom_text(): text labels.
– Show distribution:
· geom_bin2d(): bin into rectangles and count.
· geom_density2d(): smoothed 2d density estimate.
· geom_hex(): bin into hexagons and count.
Two variables
– At least one discrete:
· geom_count(): count number of point at distinct locations
· geom_jitter(): randomly jitter overlapping points.
– One continuous, one discrete:
· geom_bar(stat = "identity"): a bar chart of precomputed summaries.
· geom_boxplot(): boxplots.
· geom_violin(): show density of values in each group.
– One time, one continuous
· geom_area(): area plot.
· geom_line(): line plot.
· geom_step(): step plot.
Two variables
– Display uncertainty:
· geom_crossbar(): vertical bar with center.
· geom_errorbar(): error bars.
· geom_linerange(): vertical line.
· geom_pointrange(): vertical line with center.
– Spatial
· geom_map(): fast version of geom_polygon() for map data.
Three variables
– geom_contour(): contours.
– geom_tile(): tile the plane with rectangles.
– geom_raster(): fast version of geom_tile() for equal sized tiles.

Each geom has a set of aesthetics that it understands, some of which must be provided. For
example, the point geoms requires x and y position, and understands colour, size and shape
aesthetics. A bar requires height (ymax), and understands width, border colour and fill
colour. Each geom lists its aesthetics in the documentation.
Geoms
• Some geoms differ primarily in the way that they are parameterised. For example, you
can draw a square in three ways:
• By giving geom_tile() the location (x and y) and dimensions (width and height).
• By giving geom_rect() top (ymax), bottom (ymin), left (xmin) and right (xmax)
positions.
• By giving geom_polygon() a four row data frame with the x and y positions of each
corner.
• Other related geoms are:
• geom_segment() and geom_line()
• geom_area() and geom_ribbon().
Exercise 6
• For each of the plots below, identify the geom used to draw it.
Stats
• A statistical transformation, or stat, transforms the data, typically by summarising it in
some manner. For example, a useful stat is the smoother, which calculates the smoothed
mean of y, conditional on x. You’ve already used many of ggplot2’s stats because they’re
used behind the scenes to generate many important geoms:
• stat_bin(): geom_bar(), geom_freqpoly(), geom_histogram()
• stat_bin2d(): geom_bin2d()
• stat_bindot(): geom_dotplot()
• stat_binhex(): geom_hex()
• stat_boxplot(): geom_boxplot()
• stat_contour(): geom_contour()
• stat_quantile(): geom_quantile()
• stat_smooth(): geom_smooth()
• stat_sum(): geom_count()
Stats
• Other stats can’t be created with a geom_ function:
• stat_ecdf(): compute a empirical cumulative distribution plot.
• stat_function(): compute y values from a function of x values.
• stat_summary(): summarise y values at distinct x values.
• stat_summary2d(), stat_summary_hex(): summarise binned values.
• stat_qq(): perform calculations for a quantile-quantile plot.
• stat_spoke(): convert angle and radius to position.
• stat_unique(): remove duplicated rows.
Stats
• There are two ways to use these functions. You can either add a stat_() function and
override the default geom, or add a geom_() function and override the default stat:
ggplot(mpg, aes(trans, cty)) +
geom_point() +
stat_summary(geom = "point", fun.y = "mean", colour = "red", size = 4)

ggplot(mpg, aes(trans, cty)) +


geom_point() +
geom_point(stat = "summary", fun.y = "mean", colour = "red", size = 4)
Generated variables
• Internally, a stat takes a data frame as input and returns a data frame as output, and so a
stat can add new variables to the original dataset. It is possible to map aesthetics to these
new variables. For example, stat_bin, the statistic used to make histograms, produces the
following variables:
• count, the number of observations in each bin
• density, the density of observations in each bin (percentage of total / bar width)
• x, the centre of the bin
Generated variables
• These generated variables can be used instead of the variables present in the original dataset. For example,
the default histogram geom assigns the height of the bars to the number of observations (count), but if you’d
prefer a more traditional histogram, you can use the density (density). To refer to a generated variable like
density, “..” must surround the name. This prevents confusion in case the original dataset includes a variable
with the same name as a generated variable, and it makes it clear to any later reader of the code that this
variable was generated by a stat. Each statistic lists the variables that it creates in its documentation. Compare
the y-axes on these two plots:
ggplot(diamonds, aes(price)) +
geom_histogram(binwidth = 500)

ggplot(diamonds, aes(price)) +
geom_histogram(aes(y = ..density..), binwidth = 500)
Generated variables
• This technique is particularly useful when you want to compare the distribution of multiple groups
that have very different sizes. For example, it’s hard to compare the distribution of price within cut
because some groups are quite small. It’s easier to compare if we standardise each group to take up
the same area:
ggplot(diamonds, aes(price, colour = cut)) +
geom_freqpoly(binwidth = 500) +
theme(legend.position = "none")

ggplot(diamonds, aes(price, colour = cut)) +


geom_freqpoly(aes(y = ..density..), binwidth = 500) +
theme(legend.position = "none")
Exercise 7
1. The code below creates a similar dataset to stat_smooth(). Use the appropriate geoms to
mimic the default geom_smooth() display.
mod <- loess(hwy ~ displ, data = mpg)
smoothed <- data.frame(displ = seq(1.6, 7, length = 50))
pred <- predict(mod, newdata = smoothed, se = TRUE)
smoothed$hwy <- pred$fit
smoothed$hwy_lwr <- pred$fit - 1.96 * pred$se.fit
smoothed$hwy_upr <- pred$fit + 1.96 * pred$se.fit
Exercise 7
2. What stats were used to create the following plots?

3. Read the help for stat_sum() then use geom_count() to create a plot that
shows the proportion of cars that have each combination of drv and trans.
Position adjustments
• Position adjustments apply minor tweaks to the position of elements within a layer. Three
adjustments apply primarily to bars:
• position_stack(): stack overlapping bars (or areas) on top of each other.
• position_fill(): stack overlapping bars, scaling so the top is always at 1.
• position_dodge(): place overlapping bars (or boxplots) side-by-side.
Position adjustments
dplot <- ggplot(diamonds, aes(color, fill = cut)) + xlab(NULL) + ylab(NULL) +
theme(legend.position = "none")
# position stack is the default for bars, so `geom_bar()` is equivalent to `geom_bar(position =
"stack")`.
dplot + geom_bar()
dplot + geom_bar(position = "fill")
dplot + geom_bar(position = "dodge")
Position adjustments
• There’s also a position adjustment that does nothing: position_identity(). The identity position
adjustment is not useful for bars, because each bar obscures the bars behind, but there are many
geoms that don’t need adjusting, like the frequency polygon:
dplot + geom_bar(position = "identity", alpha = 1 / 2, colour = "grey50")

ggplot(diamonds, aes(color, colour = cut)) +


geom_freqpoly(aes(group = cut), stat = "count") +
xlab(NULL) + ylab(NULL) +
theme(legend.position = "none")
Position adjustments
• There are three position adjustments that are primarily useful for points:
• position_nudge(): move points by a fixed offset.
• position_jitter(): add a little random noise to every position.
• position_jitterdodge(): dodge points within groups, then add a little random noise.
Position adjustments
• Note that the way you pass parameters to position adjustments differs to stats and geoms. Instead
of including additional arguments in ..., you construct a position adjustment object, supplying
additional arguments in the call:
ggplot(mpg, aes(displ, hwy)) +
geom_point(position = "jitter")

ggplot(mpg, aes(displ, hwy)) +


geom_point(position = position_jitter(width = 0.05, height = 0.5))
• This is rather verbose, so geom_jitter() provides a convenient shortcut:
ggplot(mpg, aes(displ, hwy)) +
geom_jitter(width = 0.05, height = 0.5)
Scales
• Scales control the mapping from data to aesthetics. They take your data and turn it into something
that you can see, like size, colour, position or shape. Scales also provide the tools that let you read
the plot: the axes and legends. Formally, each scale is a function from a region in data space (the
domain of the scale) to a region in aesthetic space (the range of the scale). The axis or legend is the
inverse function: it allows you to convert visual properties back to data.
Modifying scales
• A scale is required for every aesthetic used on the plot. When you write:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))

What actually happens is this:


ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()

Default scales are named according to the aesthetic and the variable type: scale_y_continuous(),
scale_colour_discrete(), etc.
Modifying scales
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous("A really awesome x axis label") +
scale_y_continuous("An amazingly great y axis label")
Compare two specifications?
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous("Label 1") +
scale_x_continuous("Label 2")

ggplot(mpg, aes(displ, hwy)) +


geom_point() +
scale_x_continuous("Label 2")
Modifying scales
• You can also use a different scale altogether:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_sqrt() +
scale_colour_brewer()
Modifying scales
You’ve probably already figured out the naming scheme for scales, but to be concrete, it’s
made up of three pieces separated by “_“:
1. Scale
2. The name of the aesthetic (e.g., colour, shape or x)
3. The name of the scale (e.g., continuous, discrete, brewer).
Exercise 8
• What happens if you pair a discrete variable to a continuous scale? What happens if you pair a continuous variable to a discrete
scale?
• Simplify the following plot specifications to make them easier to understand.
ggplot(mpg, aes(displ)) +
scale_y_continuous("Highway mpg") +
scale_x_continuous() +
geom_point(aes(y = hwy))

ggplot(mpg, aes(y = displ, x = class)) +


scale_y_continuous("Displacement (l)") +
scale_x_discrete("Car type") +
scale_x_discrete("Type of car") +
scale_colour_discrete() +
geom_point(aes(colour = drv)) +
scale_colour_discrete("Drive\ntrain")
Guides: legends and axes
• The component of a scale that you’re most likely to want to modify is the guide, the axis or legend associated
with the scale. Guides allow you to read observations from the plot and map them back to their original values. In
ggplot2, guides are produced automatically based on the layers in your plot. This is very different to base R
graphics, where you are responsible for drawing the legends by hand. In ggplot2, you don’t directly control the
legend; instead you set up the data so that there’s a clear mapping between data and aesthetics, and a legend is
generated for you automatically. This can be frustrating when you first start using ggplot2, but once you get the
hang of it, you’ll find that it saves you time, and there is little you cannot do. If you’re struggling to get the legend
you want, it’s likely that your data is in the wrong form.
Scale title
• The first argument to the scale function, name, is the axes/legend title. You can supply text strings
(using \n for line breaks) or mathematical expressions in quote() (as described in ?plotmath):
df <- data.frame(x = 1:2, y = 1, z = "a")
p <- ggplot(df, aes(x, y)) + geom_point()
p + scale_x_continuous("X axis")
p + scale_x_continuous(quote(a + mathematical ^ expression))
Scale title
• Because tweaking these labels is such a common task, there are three helpers that save you some
typing: xlab(), ylab() and labs():
p <- ggplot(df, aes(x, y)) + geom_point(aes(colour = z))
p+
xlab("X axis") +
ylab("Y axis")
p + labs(x = "X axis", y = "Y axis", colour = "Colour\nlegend")
Scale title
• There are two ways to remove the axis label. Setting it to "" omits the label, but still allocates
space; NULL removes the label and its space. Look closely at the left and bottom borders of the
following two plots. I’ve drawn a grey rectangle around the plot to make it easier to see the
difference.
p <- ggplot(df, aes(x, y)) +
geom_point() +
theme(plot.background = element_rect(colour = "grey50"))
p + labs(x = "", y = "")
p + labs(x = NULL, y = NULL)
Breaks and labels
• The breaks argument controls which values appear as tick marks on axes and keys on legends.
Each break has an associated label, controlled by the labels argument. If you set labels, you must
also set breaks; otherwise, if data changes, the breaks will no longer align with the labels.
df <- data.frame(x = c(1, 3, 5) * 1000, y = 1)
axs <- ggplot(df, aes(x, y)) +
geom_point() +
labs(x = NULL, y = NULL)
axs
axs + scale_x_continuous(breaks = c(2000, 4000))
axs + scale_x_continuous(breaks = c(2000, 4000), labels = c("2k", "4k"))
Breaks and labels
leg <- ggplot(df, aes(y, x, fill = x)) +
geom_tile() +
labs(x = NULL, y = NULL)
leg
leg + scale_fill_continuous(breaks = c(2000, 4000))
leg + scale_fill_continuous(breaks = c(2000, 4000), labels = c("2k", "4k"))
Breaks and labels
• If you want to relabel the breaks in a categorical scale, you can use a named labels vector:
df2 <- data.frame(x = 1:3, y = c("a", "b", "c"))
ggplot(df2, aes(x, y)) +
geom_point()
ggplot(df2, aes(x, y)) +
geom_point() +
scale_y_discrete(labels = c(a = "apple", b = "banana", c = "carrot"))
Breaks and labels
• To suppress breaks (and for axes, grid lines) or labels, set them to NULL:
axs + scale_x_continuous(breaks = NULL)
axs + scale_x_continuous(labels = NULL)
Breaks and labels
leg + scale_fill_continuous(breaks = NULL)
leg + scale_fill_continuous(labels = NULL)
Breaks and labels
• Additionally, you can supply a function to breaks or labels. The breaks function should have one
argument, the limits (a numeric vector of length two), and should return a numeric vector of
breaks. The labels function should accept a numeric vector of breaks and return a character vector
of labels (the same length as the input). The scales package provides a number of useful labelling
functions:
• scales::comma_format() adds commas to make it easier to read large numbers.
• scales::unit_format(unit, scale) adds a unit suffix, optionally scaling.
• scales::dollar_format(prefix, suffix) displays currency values, rounding to two decimal places
and adding a prefix or suffix.
• scales::wrap_format() wraps long labels into multiple lines.
Breaks and labels
axs + scale_y_continuous(labels = scales::percent_format())
axs + scale_y_continuous(labels = scales::dollar_format())
leg + scale_fill_continuous(labels = scales::unit_format())
Breaks and labels
• You can adjust the minor breaks (the faint grid lines that appear between the major grid lines) by
supplying a numeric vector of positions to the minor_breaks argument. This is particularly useful
for log scales:
df <- data.frame(x = c(2, 3, 5, 10, 200, 3000), y = 1)
ggplot(df, aes(x, y)) +
geom_point() +
scale_x_log10()
mb <- as.numeric(1:10 %o% 10 ^ (0:4))
ggplot(df, aes(x, y)) +
geom_point() +
scale_x_log10(minor_breaks = log10(mb))
Exercise 9
1. Recreate the following graphic:
Exercise 9
2. Recreate the following graphic:
Legends
• While the most important parameters are shared between axes and legends, there are some extra
options that only apply to legends. Legends are more complicated than axes because:
1. A legend can display multiple aesthetics (e.g. colour and shape), from multiple layers, and the
symbol displayed in a legend varies based on the geom used in the layer.
2. Axes always appear in the same place. Legends can appear in different places, so you need some
global way of controlling them.
3. Legends have considerably more details that can be tweaked: should they be displayed vertically
or horizontally? How many columns? How big should the keys be?
Layers and legends
• A legend may need to draw symbols from multiple layers. For example, if you’ve mapped colour
to both points and lines, the keys will show both points and lines. If you’ve mapped fill colour,
you get a rectangle. Note the way the legend varies in the plots below:
Layers and legends
• By default, a layer will only appear if the corresponding aesthetic is mapped to a variable with
aes(). You can override whether or not a layer appears in the legend with show.legend: FALSE to
prevent a layer from ever appearing in the legend; TRUE forces it to appear when it otherwise
wouldn’t. Using TRUE can be useful in conjunction with the following trick to make points stand
out:
ggplot(df, aes(y, y)) +
geom_point(size = 4, colour = "grey20") +
geom_point(aes(colour = z), size = 2)
ggplot(df, aes(y, y)) +
geom_point(size = 4, colour = "grey20", show.legend = TRUE) +
geom_point(aes(colour = z), size = 2)
Layers and legends
• Sometimes you want the geoms in the legend to display differently to the geoms in the plot. This is
particularly useful when you’ve used transparency or size to deal with moderate overplotting and
also used colour in the plot. You can do this using the override.aes parameter of guide_legend(),
which you’ll learn more about shortly.
norm <- data.frame(x = rnorm(1000), y = rnorm(1000))
norm$z <- cut(norm$x, 3, labels = c("a", "b", "c"))
ggplot(norm, aes(x, y)) +
geom_point(aes(colour = z), alpha = 0.1)
ggplot(norm, aes(x, y)) +
geom_point(aes(colour = z), alpha = 0.1) +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
Layers and legends
• ggplot2 tries to use the fewest number of legends to accurately convey the aesthetics used in the
plot. It does this by combining legends where the same variable is mapped to different aesthetics.
The figure below shows how this works for points: if both colour and shape are mapped to the
same variable, then only a single legend is necessary.
ggplot(df, aes(x, y)) + geom_point(aes(colour = z))
ggplot(df, aes(x, y)) + geom_point(aes(shape = z))
ggplot(df, aes(x, y)) + geom_point(aes(shape = z, colour = z))
Legend layout
• The position and justification of legends are controlled by the theme setting legend.position, which
takes values “right”, “left”, “top”, “bottom”, or “none” (no legend).
df <- data.frame(x = 1:3, y = 1:3, z = c("a", "b", "c"))
base <- ggplot(df, aes(x, y)) +
geom_point(aes(colour = z), size = 3) +
xlab(NULL) +
ylab(NULL)
base + theme(legend.position = "right") # the default
base + theme(legend.position = "bottom")
base + theme(legend.position = "none")
Layers and legends
• Switching between left/right and top/bottom modifies how the keys in each legend are laid out
(horizontal or vertically), and how multiple legends are stacked (horizontal or vertically). If
needed, you can adjust those options independently:
• legend.direction: layout of items in legends (“horizontal” or “vertical”).
• legend.box: arrangement of multiple legends (“horizontal” or “vertical”).
• legend.box.just: justification of each legend within the overall bounding box, when there are
multiple legends (“top”, “bottom”, “left”, or “right”).
Layers and legends
• Alternatively, if there’s a lot of blank space in your plot you might want to place the legend inside
the plot. You can do this by setting legend.position to a numeric vector of length two. The numbers
represent a relative location in the panel area: c(0, 1) is the top-left corner and c(1, 0) is the
bottom-right corner. You control which corner of the legend the legend.position refers to with
legend.justification, which is specified in a similar way. Unfortunately positioning the legend
exactly where you want it requires a lot of trial and error.
base <- ggplot(df, aes(x, y)) +
geom_point(aes(colour = z), size = 3)
base + theme(legend.position = c(0, 1), legend.justification = c(0, 1))
base + theme(legend.position = c(0.5, 0.5), legend.justification = c(0.5, 0.5))
base + theme(legend.position = c(1, 0), legend.justification = c(1, 0))
Guide functions
• The guide functions, guide_colourbar() and guide_legend(), offer additional control over the fine
details of the legend. Legend guides can be used for any aesthetic (discrete or continuous) while
the colour bar guide can only be used with continuous colour scales.
• You can override the default guide using the guide argument of the corresponding scale function,
or more conveniently, the guides() helper function. guides() works like labs(): you can override the
default guide associated with each aesthetic.
df <- data.frame(x = 1, y = 1:3, z = 1:3)
base <- ggplot(df, aes(x, y)) + geom_raster(aes(fill = z))
base
base + scale_fill_continuous(guide = guide_legend())
base + guides(fill = guide_legend())
guide_legend()
• The legend guide displays individual keys in a table. The most useful options are:
• nrow or ncol which specify the dimensions of the table. byrow controls how the table is filled:
FALSE fills it by column (the default), TRUE fills it by row.
df <- data.frame(x = 1, y = 1:4, z = letters[1:4])
# Base plot
p <- ggplot(df, aes(x, y)) + geom_raster(aes(fill = z))
p
p + guides(fill = guide_legend(ncol = 2))
p + guides(fill = guide_legend(ncol = 2, byrow = TRUE))
guide_legend()
• reverse reverses the order of the keys. This is particularly useful when you have stacked bars
because the default stacking and legend orders are different:
p <- ggplot(df, aes(1, y)) + geom_bar(stat = "identity", aes(fill = z))
p
p + guides(fill = guide_legend(reverse = TRUE))
• override.aes: override some of the aesthetic settings derived from each layer. This is useful if you
want to make the elements in the legend more visually prominent.
• keywidth and keyheight (along with default.unit) allow you to specify the size of the keys. These
are grid units, e.g. unit(1, "cm").
guide_colourbar
• The colour bar guide is designed for continuous ranges of colors—as its name implies, it outputs a
rectangle over which the color gradient varies. The most important arguments are:
• barwidth and barheight (along with default.unit) allow you to specify the size of the bar.
These are grid units, e.g. unit(1, "cm").
• nbin controls the number of slices. You may want to increase this from the default value of 20
if you draw a very long bar.
• reverse flips the colour bar to put the lowest values at the top.
• These options are illustrated below:
df <- data.frame(x = 1, y = 1:4, z = 4:1)
p <- ggplot(df, aes(x, y)) + geom_tile(aes(fill = z))
p
p + guides(fill = guide_colorbar(reverse = TRUE))
p + guides(fill = guide_colorbar(barheight = unit(4, "cm")))
Exercise 10
1. How do you make legends appear to the left of the plot?
2. What’s gone wrong with this plot? How could you fix it?
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = drv, shape = drv)) +
scale_colour_discrete("Drive train")
Exercise 10
3. Can you recreate the code for this plot?
Limits
• The limits, or domain, of a scale are usually derived from the range of the data. There are two
reasons you might want to specify limits rather than relying on the data:
1. You want to make limits smaller than the range of the data to focus on an interesting area of the
plot.
2. You want to make the limits larger than the range of the data because you want multiple plots to
match up.

It’s most natural to think about the limits of position scales: they map directly to the ranges of the
axes. But limits also apply to scales that have legends, like colour, size, and shape. This is
particularly important to realise if you want your colours to match up across multiple plots in your
paper.
Limits
• You can modify the limits using the limits parameter of the scale:
• For continuous scales, this should be a numeric vector of length two. If you only want to set
the upper or lower limit, you can set the other value to NA.
• For discrete scales, this is a character vector which enumerates all possible values.
df <- data.frame(x = 1:3, y = 1:3)
base <- ggplot(df, aes(x, y)) + geom_point()
base
base + scale_x_continuous(limits = c(1.5, 2.5))
#> Warning: Removed 2 rows containing missing values (geom_point).
base + scale_x_continuous(limits = c(0, 4))
Limits
• Because modifying the limits is such a common task, ggplot2 provides some helper to make this
even easier: xlim(), ylim() and lims() These functions inspect their input and then create the
appropriate scale, as follows:
• xlim(10, 20): a continuous scale from 10 to 20
• ylim(20, 10): a reversed continuous scale from 20 to 10
• xlim("a", "b", "c"): a discrete scale
• xlim(as.Date(c("2008-05-01", "2008-08-01"))): a date scale from May 1 to August 1 2008.
base + xlim(0, 4)
base + xlim(4, 0)
base + lims(x = c(0, 4))
Thành phần của ggplot2
• Về cơ bản, mỗi lệnh trong gói ggplot2 có ba thành phần chủ yếu sau:
 Dữ liệu đầu vào,
 Một bộ các aesthetic mappings (kí kiệu là aes) giữa các biến số của bộ dữ liệu và các
đặc điểm hình ảnh,
 Ít nhất một layer miêu tả dữ liệu. Các layer thường được tạo ra bằng hàm geom.
• Sử dụng dữ liệu "ch4bt8.wf1"
library(ggplot2)
library(hexView)
test = readEViews("ch4bt8.wf1")
Scatter plot trong ggplot2
ggplot(test, aes(x = EDUC, y = WAGE)) + geom_point() # Cách 1
ggplot(test, aes(EDUC, WAGE)) + geom_point() # Cách 2
Scatter plot trong ggplot2
• Cho 2 nhóm lao động ứng với 2 nhóm giới tính:
test$MALE[test$MALE == 1] = "MALE"
test$MALE[test$MALE == 0] = "FEMALE"
ggplot(test, aes(EDUC, WAGE, colour = MALE)) + geom_point(show.legend = F) +
facet_wrap(~ MALE)
Scatter plot trong ggplot2
• Cho 2 nhóm lao động ứng với 2 nhóm giới tính:
ggplot(test, aes(EDUC, WAGE, colour = MALE)) + geom_point() # Gộp 2 scatter
plot
Đường hồi quy trong ggplot2
• Đường hồi quy cho toàn bộ mẫu nghiên cứu với biến độc lập là WAGE còn biến phụ
thuộc là EDUC.
ggplot(test, aes(EDUC, WAGE)) + geom_point() + geom_smooth(method = "lm", se =
FALSE)
Đường hồi quy trong ggplot2
• Hai đường hồi quy ứng với 2 nhóm lao động riêng biệt và hiển thị ở cùng một biểu đồ:
ggplot(test, aes(EDUC, WAGE, colour = MALE)) + geom_point() + geom_smooth(method
= "lm")
Histogram trong ggplot2
• Histogram cho biến IQ (chỉ số thông minh IQ) riêng biệt cho hai nhóm giới tính.
ggplot(test, aes(IQ, fill = MALE)) + geom_histogram(show.legend = F) + facet_wrap(~
MALE)
Histogram trong ggplot2
• Nhạt màu một chút:
ggplot(test, aes(IQ, fill = MALE)) + geom_histogram(show.legend = F, alpha = 0.5) +
facet_wrap(~ MALE)
Box plot trong ggplot2
library(hexView)
test = readEViews("ch4bt8.wf1") # Ví dụ với dữ liệu
"ch4bt8.wf1"
test$MALE[test$MALE == 1] = "MALE"
test$MALE[test$MALE == 0] = "FEMALE"
ggplot(test, aes(MALE, IQ, colour = MALE)) + geom_boxplot(show.legend = F)
# Box plot cho IQ theo hai nhóm giới tính
Hàm mật độ xác suất Density trong ggplot2
• Vẽ hàm mật độ xác suất cho log(wage) ứng với bốn nhóm lao động đến từ 4 khu vực địa
lý:
ggplot(CPS1988, aes(log(wage), colour = region)) + geom_density() # Cách 1
Hàm mật độ xác suất Density trong ggplot2
ggplot(CPS1988, aes(log(wage), fill = region)) + geom_density(alpha = 0.3) # Cách 2
Hàm mật độ xác suất Density trong ggplot2
• Vẽ hàm mật độ xác suất của log(wage) với toàn bộ quan sát bằng lệnh:
ggplot(CPS1988, aes(log(wage))) + geom_density(color = "darkblue", fill = "lightblue")
Hàm mật độ xác suất Density trong ggplot2
• Hiển thị đồng thời cả Histogram và Density:
ggplot(CPS1988, aes(log(wage))) + geom_density(alpha = 0.3, fill = "blue", color = "blue")
+ geom_histogram(aes(y = ..density..), fill = "red", color = "red", alpha = 0.3) + theme_bw()
Biểu đồ cột trong ggplot2
install.packages("tidyverse")
library(AER)
data("CPS1988")
library(tidyverse) # Vẽ số lao động theo khu vực địa lý
k1 = CPS1988 %>%group_by(region) %>% count() %>%ggplot(aes(region, n)) +
geom_col() + theme_minimal() # Kiểu 1
k1
k2 = CPS1988 %>%group_by(region) %>% count() %>%ggplot(aes(region, n, fill =
region)) + geom_col() + theme_minimal() # Kiểu 2
k2
k3 = CPS1988 %>%group_by(region) %>% count() %>%ggplot(aes(reorder(region, n), n,
fill = region)) + geom_col(show.legend = FALSE) # Kiểu 3
k3
Biểu đồ cột trong ggplot2
k4 = k3 + coord_flip() + labs(x = NULL, y = NULL, title = "Observations by Region",
caption = "Data Source: US Census Bureau") # Xoay k3 và hiệu chỉnh
k4
Biểu đồ cột trong ggplot2
k5 = CPS1988 %>% group_by(region, ethnicity) %>% count() %>%ggplot(aes(region, n))
+ geom_col() + facet_wrap(~ ethnicity, scales = "free") + geom_text(aes(label = n), color =
"white", vjust = 1.2, size = 3) + labs(x = NULL, y = NULL, title = "Observations by Region
and Ethnicity", caption = "Data Source: US Census Bureau")
# Số lao động theo khu vực địa lý và chủng tộc
k5 # Kiểu 1
k6 = CPS1988 %>% group_by(region, ethnicity) %>% count() %>%ggplot(aes(region, n,
fill = ethnicity)) + geom_col(position = "stack") + labs(x = NULL, y = NULL, title =
"Observations by Region and Ethnicity", caption = "Data Source: US Census Bureau")
# Kiểu 2
k6
k7 = CPS1988 %>% group_by(region, ethnicity) %>% count() %>%ggplot(aes(region, n,
fill = ethnicity)) + geom_col(position = "dodge") + labs(x = NULL, y = NULL, title =
"Observations by Region and Ethnicity", caption = "Data Source: US Census Bureau")
k7 # Kiểu 3
Pie chart trong ggplot2
library(dplyr)
p = CPS1988 %>% group_by(region) %>% count() %>% ggplot(aes(x = "", n, fill =
region)) + geom_col(width = 1, stat = "identity")
p # Bar chart
p + coord_polar("y", start = 0) + labs(x = NULL, y = NULL) # Pie
chart
Biểu đồ đường trong ggplot2
library(foreign)
test = read.dta("Table13_1.dta")
ggplot(test, aes(time, ex)) + geom_line(col = "blue")
Biểu đồ đường trong ggplot2
library(data.table)
test = fread("cophieu.csv") # Đọc file "cophieu.csv"
library(tidyverse) # Dùng lệnh "mutate"
library(lubridate) # Định dạng lại thời gian
test %>% mutate(Date = ymd(X.DTYYYYMMDD.)) %>% ggplot(aes(Date, X.Open.)) +
geom_line(color = "blue") + facet_wrap(~ X.Ticker., drop = TRUE, scales = "free") +
theme_bw() # Kiểu 1
test %>% mutate(Date = ymd(X.DTYYYYMMDD.)) %>% ggplot(aes(Date, X.Open., color =
X.Ticker.)) + geom_line() + theme_bw() # Kiểu 2
test %>% mutate(Date = ymd(X.DTYYYYMMDD.)) %>% rename(Price = X.Open., Symbol =
X.Ticker.) %>% ggplot(aes(Date, Price, color = Symbol)) + geom_line() + theme_bw()
# Kiểu 3
Bar chart
library(ggplot2)
data(Marriage, package = "mosaicData")
# plot the distribution of race
ggplot(Marriage, aes(x = race)) + geom_bar()

# plot the distribution of race with modified colors and labels


ggplot(Marriage, aes(x = race)) +
geom_bar(fill = "cornflowerblue",
color="black") +
labs(x = "Race",
y = "Frequency",
title = "Participants by race")
Bar chart: percentage
# plot the distribution as percentages
ggplot(Marriage,
aes(x = race,
y = ..count.. / sum(..count..))) +
geom_bar() +
labs(x = "Race",
y = "Percent",
title = "Participants by race") +
scale_y_continuous(labels = scales::percent)
Bar chart: sorting categories
# calculate number of participants in each race category
library(dplyr)
plotdata <- Marriage %>%
count(race)

# plot the bars in ascending order


ggplot(plotdata,
aes(x = reorder(race, n),
y = n)) +
geom_bar(stat = "identity") +
labs(x = "Race",
y = "Frequency",
title = "Participants by race")
Bar chart: labeling bars
# plot the bars with numeric labels
ggplot(plotdata,
aes(x = race,
y = n)) +
geom_bar(stat = "identity") +
geom_text(aes(label = n),
vjust=-0.5) +
labs(x = "Race",
y = "Frequency",
title = "Participants by race")
Bar chart
library(dplyr)
library(scales)
plotdata <- Marriage %>%
count(race) %>%
mutate(pct = n / sum(n),
pctlabel = paste0(round(pct*100), "%"))

# plot the bars as percentages,


# in decending order with bar labels
ggplot(plotdata,
aes(x = reorder(race, -pct),
y = pct)) +
geom_bar(stat = "identity",
fill = "indianred3",
color = "black") +
geom_text(aes(label = pctlabel),
vjust = -0.25) +
scale_y_continuous(labels = percent) +
labs(x = "Race",
y = "Percent",
title = "Participants by race")
Bar chart: overlapping labels
# basic bar chart with overlapping labels
ggplot(Marriage, aes(x = officialTitle)) +
geom_bar() +
labs(x = "Officiate",
y = "Frequency",
title = "Marriages by officiate")
Bar chart: overlapping labels
# horizontal bar chart
ggplot(Marriage, aes(x = officialTitle)) +
geom_bar() +
labs(x = "",
y = "Frequency",
title = "Marriages by officiate") +
coord_flip()
Bar chart: overlapping labels
# bar chart with rotated labels
ggplot(Marriage, aes(x = officialTitle)) +
geom_bar() +
labs(x = "",
y = "Frequency",
title = "Marriages by officiate") +
theme(axis.text.x = element_text(angle = 45,
hjust = 1))
Bar chart: overlapping labels
# bar chart with staggered labels
lbls <- paste0(c("", "\n"),
levels(Marriage$officialTitle))
ggplot(Marriage,
aes(x=factor(officialTitle,
labels = lbls))) +
geom_bar() +
labs(x = "",
y = "Frequency",
title = "Marriages by officiate")
Pie chart
# create a basic ggplot2 pie chart
plotdata <- Marriage %>%
count(race) %>%
arrange(desc(race)) %>%
mutate(prop = round(n * 100 / sum(n), 1),
lab.ypos = cumsum(prop) - 0.5 *prop)

ggplot(plotdata,
aes(x = "",
y = prop,
fill = race)) +
geom_bar(width = 1,
stat = "identity",
color = "black") +
coord_polar("y",
start = 0,
direction = -1) +
theme_void()
Pie chart: add labels, remove the legend
# create a pie chart with slice labels
plotdata <- Marriage %>%
count(race) %>%
arrange(desc(race)) %>%
mutate(prop = round(n*100/sum(n), 1),
lab.ypos = cumsum(prop) - 0.5*prop)
plotdata$label <- paste0(plotdata$race, "\n",
round(plotdata$prop), "%")
ggplot(plotdata,
aes(x = "",
y = prop,
fill = race)) +
geom_bar(width = 1,
stat = "identity",
color = "black") +
geom_text(aes(y = lab.ypos, label = label),
color = "black") +
coord_polar("y",
start = 0,
direction = -1) +
theme_void() +
theme(legend.position = "FALSE") +
labs(title = "Participants by race")
Tree map
library(treemapify)

# create a treemap of marriage officials


plotdata <- Marriage %>%
count(officialTitle)

ggplot(plotdata,
aes(fill = officialTitle,
area = n)) +
geom_treemap() +
labs(title = "Marriages by officiate")
Tree map: with labels
# create a treemap with tile labels
ggplot(plotdata,
aes(fill = officialTitle,
area = n,
label = officialTitle)) +
geom_treemap() +
geom_treemap_text(colour = "white",
place = "centre") +
labs(title = "Marriages by officiate") +
theme(legend.position = "none")
Histogram
library(ggplot2)
data(Marriage, package="mosaicData")
# plot the age distribution using a histogram
ggplot(Marriage, aes(x = age)) +
geom_histogram() +
labs(title = "Participants by age",
x = "Age")
Histogram: with colors
# plot the histogram with blue bars and white borders
ggplot(Marriage, aes(x = age)) +
geom_histogram(fill = "cornflowerblue",
color = "white") +
labs(title="Participants by age",
x = "Age")
Histogram: bins and bandwidths
# plot the histogram with 20 bins
ggplot(Marriage, aes(x = age)) +
geom_histogram(fill = "cornflowerblue",
color = "white",
bins = 20) +
labs(title="Participants by age",
subtitle = "number of bins = 20",
x = "Age")
Histogram: binwidth
# plot the histogram with a binwidth of 5
ggplot(Marriage, aes(x = age)) +
geom_histogram(fill = "cornflowerblue",
color = "white",
binwidth = 5) +
labs(title="Participants by age",
subtitle = "binwidth = 5 years",
x = "Age")
Histogram: percentage
# plot the histogram with percentages on the y-axis
library(scales)
ggplot(Marriage,
aes(x = age,
y= ..count.. / sum(..count..))) +
geom_histogram(fill = "cornflowerblue",
color = "white",
binwidth = 5) +
labs(title="Participants by age",
y = "Percent",
x = "Age") +
scale_y_continuous(labels = percent)
Kernel Density plot
# Create a kernel density plot of age
ggplot(Marriage, aes(x = age)) +
geom_density() +
labs(title = "Participants by age")
Kernel Density plot: the fill and border colors
# Create a kernel density plot of age
ggplot(Marriage, aes(x = age)) +
geom_density(fill = "indianred3") +
labs(title = "Participants by age")
Kernel Density plot: smoothing parameter
# default bandwidth for the age variable
bw.nrd0(Marriage$age)

## [1] 5.181946

# Create a kernel density plot of age


ggplot(Marriage, aes(x = age)) +
geom_density(fill = "deepskyblue",
bw = 1) +
labs(title = "Participants by age",
subtitle = "bandwidth = 1")
Stacked bar chart
library(ggplot2)
data(mpg, package="ggplot2")
# stacked bar chart
ggplot(mpg,
aes(x = class,
fill = drv)) +
geom_bar(position = "stack")
Grouped bar chart
library(ggplot2)

# grouped bar plot


ggplot(mpg,
aes(x = class,
fill = drv)) +
geom_bar(position = "dodge")
Grouped bar chart: wider
library(ggplot2)

# grouped bar plot preserving zero count bars


ggplot(mpg,
aes(x = class,
fill = drv)) +
geom_bar(position = position_dodge(preserve = "single"))
Segmented bar chart
library(ggplot2)

# bar plot, with each bar representing 100%


ggplot(mpg,
aes(x = class,
fill = drv)) +
geom_bar(position = "fill") +
labs(y = "Proportion")
Improving the color and labeling
library(ggplot2)

# bar plot, with each bar representing 100%,


# reordered bars, and better labels and colors
library(scales)
ggplot(mpg,
aes(x = factor(class,
levels = c("2seater", "subcompact",
"compact", "midsize",
"minivan", "suv", "pickup")),
fill = factor(drv,
levels = c("f", "r", "4"),
labels = c("front-wheel",
"rear-wheel",
"4-wheel")))) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .2),
label = percent) +
scale_fill_brewer(palette = "Set2") +
labs(y = "Percent",
fill = "Drive Train",
x = "Class",
title = "Automobile Drive by Class") +
theme_minimal()
Improving the color and labeling
# change the order the levels for the categorical variable "class"
mpg$class = factor(mpg$class,
levels = c("2seater", "subcompact",
"compact", "midsize",
"minivan", "suv", "pickup")
Improving the color and labeling
# create a summary dataset
library(dplyr)
plotdata <- mpg %>%
group_by(class, drv) %>%
summarize(n = n()) %>%
mutate(pct = n/sum(n),
lbl = scales::percent(pct))
plotdata
Improving the color and labeling
# create segmented bar chart
# adding labels to each segment

ggplot(plotdata,
aes(x = factor(class,
levels = c("2seater", "subcompact",
"compact", "midsize",
"minivan", "suv", "pickup")),
y = pct,
fill = factor(drv,
levels = c("f", "r", "4"),
labels = c("front-wheel",
"rear-wheel",
"4-wheel")))) +
geom_bar(stat = "identity",
position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .2),
label = percent) +
geom_text(aes(label = lbl),
size = 3,
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette = "Set2") +
labs(y = "Percent",
fill = "Drive Train",
x = "Class",
title = "Automobile Drive by Class") +
theme_minimal()
Scatterplot
library(ggplot2)
data(Salaries, package="carData")

# simple scatterplot
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary)) +
geom_point()
Vietnam
https://future.ueh.edu.vn/

Thank You

You might also like