Lecture 6 - Data Visualization With Ggplot2
Lecture 6 - Data Visualization With Ggplot2
Table of contents
1
This lecture introduces ggplot2,1 a modern R package for data visualization. Last week we
discussed graphics in base R, which is the original plotting system for R. There are pros and
cons to each approach – base R graphics tend to be more customizable, while ggplot2 graphics
tend to look nicer without many adjustments. ggplot2 also has advantages when dealing with
categorical data. For the rest of the semester we will focus on the ggplot2 approach to
graphics.
ggplot2 is part the tidyverse, which is a collection of R packages designed for data science.
To install the tidyverse run the following command in the console:
install.packages("tidyverse")
library(tidyverse)
This command needs to be run during each R session when you want to use ggplot2 or other
tidyverse packages.
Alternatively, you can just load ggplot2, without the other tidyverse packages:
library(ggplot2)
1
https://ggplot2.tidyverse.org/
2
mpg data frame
The mpg data frame is part of the ggplot2 package. The data set is stored as a tibble, which
is how data frames are represented in the tidyverse. A nice feature of tibbles is that when
you type the name of data frame, only the first 10 rows and all columns that fit on the screen
are displayed. The type of each column (variable) is also shown under its name.
From R for Data Science: “A data frame is a rectangular collection variables (in the columns)
and observations (in the rows). mpg contains observations collected by the US EPA on 38 car
models.”
mpg
# A tibble: 234 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
3 audi a4 2 2008 4 manu~ f 20 31 p comp~
4 audi a4 2 2008 4 auto~ f 21 30 p comp~
5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
7 audi a4 3.1 2008 6 auto~ f 18 27 p comp~
8 audi a4 quattro 1.8 1999 4 manu~ 4 18 26 p comp~
9 audi a4 quattro 1.8 1999 4 auto~ 4 16 25 p comp~
10 audi a4 quattro 2 2008 4 manu~ 4 20 28 p comp~
# i 224 more rows
To learn more about this data set, read the documentation in the help menu:
help(mpg)
3
Creating a ggplot
Run the following code to make a scatter plot with displ on the 𝑥-axis and hwy on the 𝑦-axis:
40
30
hwy
20
2 3 4 5 6 7
displ
There are two steps to create this scatter plot. First, ggplot() initializes the plot and specifies
the mpg data frame used for the plot. Then geom_point() adds the points to the scatter plot,
with displ on the 𝑥-axis and hwy on the 𝑦-axis.
4
Coloring points
Here we create a scatter plot with points colored according to the class of each car.
40
class
2seater
compact
30 midsize
hwy
minivan
pickup
subcompact
20
suv
2 3 4 5 6 7
displ
aes() stands for aesthetics. Conceptually, aes() specifies the mapping of the variables to the
different aesthetics, or visual properties of the plot. In this example, displ is mapped to the
𝑥-axis, hwy is mapped to the 𝑦-axis, and class is mapped to the point color.
5
To make all the points blue:
40
30
hwy
20
2 3 4 5 6 7
displ
Here color does not go inside the aes() function. This is because the color blue does not
convey any information about a particular variable.
6
Labels
By default ggplot2 uses the variable names as labels for the axes and legend. We can use the
labs() function to add more descriptive labels.
40
Car type
Highway miles per gallon
2seater
compact
30 midsize
minivan
pickup
20 subcompact
suv
2 3 4 5 6 7
Engine displacement (liters)
7
Facets
We can use facet_wrap() to split the plot into facets, subplots that each display one subset
of the data. For example, the code below creates a scatter plot of hwy versus displ for each
category of class.
30
20
2 3 4 5 6 7 2 3 4 5 6 7
suv
40
30
20
2 3 4 5 6 7
displ
8
Exercises
3. Make a scatter plot of hwy versus displ. Map the categorical variable drv to the color
and shape of the points. Type help(mpg) to read the description of the drv variable in
the help menu.
4. Use facet_wrap() to create 3 facets with scatter plots of hwy versus displ for each of
category of drv.
9
Scatter plot smoothing
Use geom_smooth() to add a smooth line that displays the average trend in the scatter plot.
40
30
hwy
20
2 3 4 5 6 7
displ
10
Setting method = "lm" adds the least squares regression line instead.
40
30
hwy
20
10
2 3 4 5 6 7
displ
11
This code adds a smooth trend line to each subplot when using facet_wrap()
4 f r
40
30
hwy
20
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ
12
This code colors both the points and trend lines according to the categories of drv.
40
drv
30 4
hwy
f
r
20
2 3 4 5 6 7
displ
13
In the plot below, the mapping of drv to shape is only applied to the points, while the trend
line is for the entire data set.
40
drv
30 4
hwy
f
r
20
2 3 4 5 6 7
displ
14
Exercises
5. Run the following code. What does the argument se = FALSE of geom_smooth() do?
6. Recreate the R code necessary to make the following plot. (If you get a warning message
just ignore it.)
30
20
2 3 4 5 6 7 2 3 4 5 6 7
suv
40
30
20
2 3 4 5 6 7
displ
15