0% found this document useful (0 votes)
5 views

Lecture 6 - Data Visualization With Ggplot2

Lecture notes on data visualization

Uploaded by

Hamzah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 6 - Data Visualization With Ggplot2

Lecture notes on data visualization

Uploaded by

Hamzah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Lecture 6: Data Visualization with ggplot2

Eric Fox, STAT 450, Fall 2024

Table of contents

mpg data frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


Creating a ggplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Coloring points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Facets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Scatter plot smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1
This lecture introduces ggplot2,1 a modern R package for data visualization. Last week we
discussed graphics in base R, which is the original plotting system for R. There are pros and
cons to each approach – base R graphics tend to be more customizable, while ggplot2 graphics
tend to look nicer without many adjustments. ggplot2 also has advantages when dealing with
categorical data. For the rest of the semester we will focus on the ggplot2 approach to
graphics.
ggplot2 is part the tidyverse, which is a collection of R packages designed for data science.
To install the tidyverse run the following command in the console:

install.packages("tidyverse")

You only need to install this package once on your computer.


To load ggplot2 into your current R session run the following command:

library(tidyverse)

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --


v dplyr 1.1.4 v readr 2.1.5
v forcats 1.0.0 v stringr 1.5.1
v ggplot2 3.5.1 v tibble 3.2.1
v lubridate 1.9.3 v tidyr 1.3.1
v purrr 1.0.2
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to becom

This command needs to be run during each R session when you want to use ggplot2 or other
tidyverse packages.
Alternatively, you can just load ggplot2, without the other tidyverse packages:

library(ggplot2)

1
https://ggplot2.tidyverse.org/

2
mpg data frame

The mpg data frame is part of the ggplot2 package. The data set is stored as a tibble, which
is how data frames are represented in the tidyverse. A nice feature of tibbles is that when
you type the name of data frame, only the first 10 rows and all columns that fit on the screen
are displayed. The type of each column (variable) is also shown under its name.
From R for Data Science: “A data frame is a rectangular collection variables (in the columns)
and observations (in the rows). mpg contains observations collected by the US EPA on 38 car
models.”

mpg

# A tibble: 234 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
3 audi a4 2 2008 4 manu~ f 20 31 p comp~
4 audi a4 2 2008 4 auto~ f 21 30 p comp~
5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
7 audi a4 3.1 2008 6 auto~ f 18 27 p comp~
8 audi a4 quattro 1.8 1999 4 manu~ 4 18 26 p comp~
9 audi a4 quattro 1.8 1999 4 auto~ 4 16 25 p comp~
10 audi a4 quattro 2 2008 4 manu~ 4 20 28 p comp~
# i 224 more rows

To learn more about this data set, read the documentation in the help menu:

help(mpg)

We will focus on the following variables:

• displ: a car’s engine size, in liters


• hwy: highway miles per gallon
• class: the type of car

3
Creating a ggplot

Run the following code to make a scatter plot with displ on the 𝑥-axis and hwy on the 𝑦-axis:

ggplot(data = mpg, aes(x = displ, y = hwy)) +


geom_point()

40

30
hwy

20

2 3 4 5 6 7
displ

There are two steps to create this scatter plot. First, ggplot() initializes the plot and specifies
the mpg data frame used for the plot. Then geom_point() adds the points to the scatter plot,
with displ on the 𝑥-axis and hwy on the 𝑦-axis.

4
Coloring points

Here we create a scatter plot with points colored according to the class of each car.

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +


geom_point()

40
class
2seater
compact
30 midsize
hwy

minivan
pickup
subcompact
20
suv

2 3 4 5 6 7
displ

aes() stands for aesthetics. Conceptually, aes() specifies the mapping of the variables to the
different aesthetics, or visual properties of the plot. In this example, displ is mapped to the
𝑥-axis, hwy is mapped to the 𝑦-axis, and class is mapped to the point color.

5
To make all the points blue:

ggplot(data = mpg, aes(x = displ, y = hwy)) +


geom_point(color = "blue")

40

30
hwy

20

2 3 4 5 6 7
displ

Here color does not go inside the aes() function. This is because the color blue does not
convey any information about a particular variable.

6
Labels

By default ggplot2 uses the variable names as labels for the axes and legend. We can use the
labs() function to add more descriptive labels.

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +


geom_point() +
labs(
x = "Engine displacement (liters)",
y = "Highway miles per gallon",
title = "Relationship between fuel efficiency and engine size",
color = "Car type"
)

Relationship between fuel efficiency and engine size

40
Car type
Highway miles per gallon

2seater
compact
30 midsize
minivan
pickup

20 subcompact
suv

2 3 4 5 6 7
Engine displacement (liters)

7
Facets

We can use facet_wrap() to split the plot into facets, subplots that each display one subset
of the data. For example, the code below creates a scatter plot of hwy versus displ for each
category of class.

ggplot(data = mpg, aes(x = displ, y = hwy)) +


geom_point() +
facet_wrap(vars(class))

2seater compact midsize


40
30
20

minivan pickup subcompact


40
hwy

30
20

2 3 4 5 6 7 2 3 4 5 6 7
suv
40
30
20

2 3 4 5 6 7
displ

8
Exercises

1. Run ggplot(data = mpg). What do you see?


2. What’s gone wrong with this code? How can you fix it?

ggplot(data = mpg, aes(x = displ, y = hwy, color = "purple", size = 0.7)) +


geom_point()

3. Make a scatter plot of hwy versus displ. Map the categorical variable drv to the color
and shape of the points. Type help(mpg) to read the description of the drv variable in
the help menu.
4. Use facet_wrap() to create 3 facets with scatter plots of hwy versus displ for each of
category of drv.

9
Scatter plot smoothing

Use geom_smooth() to add a smooth line that displays the average trend in the scatter plot.

ggplot(data = mpg, aes(x = displ, y = hwy)) +


geom_point() +
geom_smooth()

40

30
hwy

20

2 3 4 5 6 7
displ

10
Setting method = "lm" adds the least squares regression line instead.

ggplot(data = mpg, aes(x = displ, y = hwy)) +


geom_point() +
geom_smooth(method = "lm")

40

30
hwy

20

10

2 3 4 5 6 7
displ

11
This code adds a smooth trend line to each subplot when using facet_wrap()

ggplot(data = mpg, aes(x = displ, y = hwy)) +


geom_point() +
geom_smooth() +
facet_wrap(vars(drv))

4 f r

40

30
hwy

20

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ

12
This code colors both the points and trend lines according to the categories of drv.

ggplot(data = mpg, aes(x = displ, y = hwy, color = drv)) +


geom_point() +
geom_smooth()

40

drv
30 4
hwy

f
r

20

2 3 4 5 6 7
displ

13
In the plot below, the mapping of drv to shape is only applied to the points, while the trend
line is for the entire data set.

ggplot(data = mpg, aes(x = displ, y = hwy)) +


geom_point(aes(shape = drv)) +
geom_smooth()

40

drv
30 4
hwy

f
r

20

2 3 4 5 6 7
displ

14
Exercises

5. Run the following code. What does the argument se = FALSE of geom_smooth() do?

ggplot(data = mpg, aes(x = displ, y = hwy)) +


geom_point() +
geom_smooth(se = FALSE)

6. Recreate the R code necessary to make the following plot. (If you get a warning message
just ignore it.)

2seater compact midsize


40
30
20

minivan pickup subcompact


40
hwy

30
20

2 3 4 5 6 7 2 3 4 5 6 7
suv
40
30
20

2 3 4 5 6 7
displ

15

You might also like