R Graphics Essentials For Great Data Visualization
R Graphics Essentials For Great Data Visualization
Alboukadel KASSAMBARA
ii
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, without the prior
written permission of the Publisher. Requests to the Publisher for permission should
be addressed to STHDA (http://www.sthda.com).
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials.
iii
iv CONTENTS
vi
0.4. COLOPHON vii
For a multiple-line R codes, an error is generated, sometimes, when you copy and paste
directly the R code from the PDF to the R console. If this happens, a solution is to:
• Paste firstly the code in your R code editor or in your text editor
• Copy the code from your text/code editor to the R console
0.4 Colophon
This book was built with R 3.3.2 and the following packages :
## name version source
## 1 bookdown 0.5 Github:rstudio/bookdown
## 2 changepoint 2.2.2 CRAN
## 3 cowplot 0.8.0.9000 Github:wilkelab/cowplot
## 4 dplyr 0.7.4 cran
## 5 factoextra 1.0.5.999 local:kassambara/factoextra
## 6 FactoMineR 1.38 CRAN
## 7 GGally 1.3.0 CRAN
## 8 ggcorrplot 0.1.1.9000 Github:kassambara/ggcorrplot
## 9 ggforce 0.1.1 Github:thomasp85/ggforce
## 10 ggformula 0.6 CRAN
## 11 ggfortify 0.4.1 CRAN
## 12 ggpmisc 0.2.15 CRAN
## 13 ggpubr 0.1.5.999 Github:kassambara/ggpubr
## 14 lattice 0.20-34 CRAN
## 15 readr 1.1.1 cran
## 16 scatterplot3d 0.3-40 cran
## 17 strucchange 1.5-1 CRAN
## 18 tidyr 0.7.2 cran
About the author
viii
Chapter 1
1.1 Introduction
R is a free and powerful statistical software for analyzing and visualizing data.
In this chapter, you’ll learn:
• the basics of R programming for importing and manipulating your data:
– filtering and ordering rows,
– renaming and adding columns,
– computing summary statistics
• R graphics systems and packages for data visualization:
– R traditional base plots
– Lattice plotting system that aims to improve on R base graphics
– ggplot2 package, a powerful and a flexible R package, for producing elegant
graphics piece by piece.
– ggpubr package, which facilitates the creation of beautiful ggplot2-based
graphs for researcher with non-advanced programming backgrounds.
– ggformula package, an extension of ggplot2, based on formula interfaces (much
like the lattice interface)
1
2 CHAPTER 1. R BASICS FOR DATA VISUALIZATION
• If the above R code fails, you can install the latest stable version on CRAN:
install.packages("ggpubr")
3. Load required packages. After installation, you must first load the package for
using the functions in the package. The function library() is used for this task.
An alternative function is require(). For example, to load ggplot2 and ggpubr
packages, type this:
library("ggplot2")
library("ggpubr")
Now, we can use R functions, such as ggscatter() [in the ggpubr package] for creating a
scatter plot.
If you want to learn more about a given function, say ggscatter(), type this in R console:
?ggscatter.
• Avoid beginning column names with a number. Use letter instead. Good column
names: sport_100m or x100m. Bad column name: 100m.
• Replace missing values by NA (for not available)
For example, your data should look like this:
manufacturer model displ year cyl trans drv
1 audi a4 1.8 1999 4 auto(l5) f
2 audi a4 1.8 1999 4 manual(m5) f
3 audi a4 2.0 2008 4 manual(m6) f
4 audi a4 2.0 2008 4 auto(av) f
Read more at: Best Practices in Preparing Data Files for Importing into R1
Read more about how to import data into R at this link: http://www.sthda.com/
english/wiki/importing-data-into-r
After typing the above R code, you will see the description of iris data set: this iris
data set gives the measurements in centimeters of the variables sepal length and width
and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The
species are Iris setosa, versicolor, and virginica.
Note that, dplyr package allows to use the forward-pipe chaining operator (%>%) for
combining multiple operations. For example, x %>% f is equivalent to f(x). Using
the pipe (%>%), the output of each operation is passed to the next operation. This
makes R programming easy.
We’ll show you how these functions work in the different chapters of this book.
R comes with simple functions to create many types of graphs. For example:
In the most cases, you can use the following arguments to customize the plot:
• pch: change point shapes. Allowed values comprise number from 1 to 25.
• cex: change point size. Example: cex = 0.8.
• col: change point color. Example: col = “blue”.
• frame: logical value. frame = FALSE removes the plot panel border frame.
• main, xlab, ylab. Specify the main title and the x/y axis labels -, respectively
• las: For a vertical x axis text, use las = 2.
In the following R code, we’ll use the iris data set to create a:
• (1) Scatter plot of Sepal.Length (on x-axis) and Sepal.Width (on y-axis).
• (2) Box plot of Sepal.length (y-axis) by Species (x-axis)
# (1) Create a scatter lot
plot(
x = iris$Sepal.Length, y = iris$Sepal.Width,
pch = 19, cex = 0.8, frame = FALSE,
xlab = "Sepal Length",ylab = "Sepal Width"
)
7.5
Sepal.Length
6.5
5.5
4.5
4.5 5.5 6.5 7.5
The lattice R package provides a plotting system that aims to improve on R base graphs.
After installing the package, whith the R command install.packages("lattice"), you
can test the following functions.
• Main functions in the lattice package:
The lattice package uses formula interface. For example, in lattice terminology, the
formula y ~ x | group, means that we want to plot the y variable according to the x
variable, splitting the plot into multiple panels by the variable group.
1.8. R GRAPHICS SYSTEMS 7
8
Sepal.Length
1 2 3 4 5 6 7
Petal.Length
6.5
6.0
5.0
5.5
4.5
5.0
1.0 1.2 1.4 1.6 1.8 3.0 3.5 4.0 4.5 5.0 4.5 5.0 5.5 6.0 6.5 7.0
Petal.Length
The ggplot2 syntax might seem opaque for beginners, but once you understand the
basics, you can create and customize any kind of plots you want.
Note that, to reduce this opacity, we recently created an R package, named ggpubr
(ggplot2 Based Publication Ready Plots), for making ggplot simpler for students and
researchers with non-advanced programming backgrounds. We’ll present ggpubr in
the next section.
After installing and loading the ggplot2 package, you can use the following key functions:
The main function in the ggplot2 package is ggplot(), which can be used to initialize
the plotting system with data and x/y variables.
For example, the following R code takes the iris data set to initialize the ggplot and
then a layer (geom_point()) is added onto the ggplot to create a scatter plot of x =
Sepal.Length by y = Sepal.Width:
library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))+
geom_point()
1.8. R GRAPHICS SYSTEMS 9
4.0 4.0
Sepal.Width
Sepal.Width
3.5 3.5
3.0 3.0
2.5 2.5
2.0 2.0
5 6 7 8 5 6 7 8
Sepal.Length Sepal.Length
Note that, in the code above, the shape of points is specified as number. To display the
different point shape available in R, type this:
ggpubr::show_point_shapes()
It’s also possible to control points shape and color by a grouping variable (here, Species).
For example, in the code below, we map points color and shape to the Species grouping
variable.
# Control points color by groups
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))+
geom_point(aes(color = Species, shape = Species))
4.5 4.5
4.0 4.0
Species Species
Sepal.Width
Sepal.Width
3.5 3.5
setosa setosa
2.0 2.0
5 6 7 8 5 6 7 8
Sepal.Length Sepal.Length
You can also split the plot into multiple panels according to a grouping variable. R
function: facet_wrap(). Another interesting feature of ggplot2, is the possibility to
combine multiple layers on the same plot. For example, with the following R code, we’ll:
• Add points with geom_point(), colored by groups.
• Add the fitted smoothed regression line using geom_smooth(). By default the
function geom_smooth() add the regression line and the confidence area. You can
control the line color and confidence area fill color by groups.
• Facet the plot into multiple panels by groups
• Change color and fill manually using the function scale_color_manual() and
scale_fill_manual()
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))+
geom_point(aes(color = Species))+
geom_smooth(aes(color = Species, fill = Species))+
facet_wrap(~Species, ncol = 3, nrow = 1)+
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
4.0
Species
Sepal.Width
3.5
setosa
3.0 versicolor
virginica
2.5
2.0
5 6 7 8 5 6 7 8 5 6 7 8
Sepal.Length
Note that, the default theme of ggplots is theme_gray() (or theme_grey()), which is
theme with grey background and white grid lines. More themes are available for profes-
sional presentations or publications. These include: theme_bw(), theme_classic() and
1.8. R GRAPHICS SYSTEMS 11
theme_minimal().
To change the theme of a given ggplot (p), use this: p + theme_classic(). To change
the default theme to theme_classic() for all the future ggplots during your entire R
session, type the following R code:
theme_set(
theme_classic()
)
4.0
Sepal.Width
3.5
3.0
2.5
2.0
5 6 7 8
Sepal.Length
The ggpubr R package facilitates the creation of beautiful ggplot2-based graphs for
researcher with non-advanced programming backgrounds (Kassambara, 2017).
For example, to create the density distribution of “Sepal.Length”, colored by groups
(“Species”), type this:
library(ggpubr)
# Density plot with mean lines and marginal rug
ggdensity(iris, x = "Sepal.Length",
add = "mean", rug = TRUE, # Add mean line and marginal rugs
color = "Species", fill = "Species", # Color by groups
palette = "jco") # use jco journal color palette
12 CHAPTER 1. R BASICS FOR DATA VISUALIZATION
1.2
0.8
density
0.4
0.0
5 6 7 8
Sepal.Length
Note that the argument palette can take also a custom color palette. For example
palette= c("#00AFBB", "#E7B800", "#FC4E07").
9 < 2.2e-16
1.9e-07
< 2.2e-16
8
Sepal.Length
pdf("r-base-plot.pdf")
# Plot 1 --> in the first page of PDF
plot(x = iris$Sepal.Length, y = iris$Sepal.Width)
# Plot 2 ---> in the second page of the PDF
hist(iris$Sepal.Length)
dev.off()
Note that for a ggplot, you can also use the following functions to export the graphic:
• ggsave() [in ggplot2]. Makes it easy to save a ggplot. It guesses the type of
graphics device from the file extension.
• ggexport() [in ggpubr]. Makes it easy to arrange and export multiple ggplots at
once.
See also the following blog post to save high-resolution ggplotsa
a
http://www.sthda.com/english/wiki/saving-high-resolution-ggplots-how-to-
preserve-semi-transparency
Chapter 2
2.1 Introduction
To visualize one variable, the type of graphs to be used depends on the type of the
variable:
• For categorical variable or grouping variables. You can visualize the count of
categories using a bar plot or using a pie chart to show the proportion of each
category.
• For continuous variable, you can visualize the distribution of the variable using
density plots, histograms and alternatives.
In this R graphics tutorial, you’ll learn how to:
• Visualize a categorical variable using bar plots, dot charts and pie charts
• Visualize the distribution of a continuous variable using:
– density and histogram plots,
– other alternatives, such as frequency polygon, area plots, dot plots, box plots,
Empirical cumulative distribution function (ECDF) and Quantile-quantile plot
(QQ plots).
– Density ridgeline plots, which are useful for visualizing changes in distributions,
of a continuous variable, over time or space.
– Bar plot and modern alternatives, including lollipop charts and cleveland’s
dot plots.
2.2 Prerequisites
Load required packages and set the theme function theme_pubr() [in ggpubr] as the
default theme:
library(ggplot2)
library(ggpubr)
theme_set(theme_pubr())
15
16 CHAPTER 2. PLOT ONE VARIABLE
20000
15000
count
10000
5000
0
Fair Good Very Good Premium Ideal
cut
Compute the frequency of each category and add the labels on the bar plot:
• dplyr package used to summarise the data
• geom_bar() with option stat = "identity" is used to create the bar plot of the
summary output as it is.
• geom_text() used to add text labels. Adjust the position of the labels by using
hjust (horizontal justification) and vjust (vertical justification). Values should be
in [0, 1].
# Compute the frequency
library(dplyr)
df <- diamonds %>%
group_by(cut) %>%
summarise(counts = n())
df
## # A tibble: 5 x 2
## cut counts
## <ord> <int>
2.3. ONE CATEGORICAL VARIABLE 17
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
# Create the bar plot. Use theme_pubclean() [in ggpubr]
ggplot(df, aes(x = cut, y = counts)) +
geom_bar(fill = "#0073C2FF", stat = "identity") +
geom_text(aes(label = counts), vjust = -0.3) +
theme_pubclean()
21551
20000
15000 13791
12082
counts
10000
5000
4906
1610
0
Fair Good Very Good Premium Ideal
cut
## # A tibble: 4 x 4
## cut counts prop lab.ypos
18 CHAPTER 2. PLOT ONE VARIABLE
cut
9.1 3
Fair
40 Good
22.4
Very Good
Premium
25.6 Ideal
• Alternative solution to easily create a pie chart: use the function ggpie()[in gg-
pubr]:
ggpie(
df, x = "prop", label = "prop",
lab.pos = "in", lab.font = list(color = "white"),
fill = "cut", color = "white",
palette = "jco"
)
)+
geom_point(aes(color = cut), size = 2)+
ggpubr::color_palette("jco")+
theme_pubclean()
40
30
prop
20
10
0
Fair Good Very Good Premium Ideal
cut
Easy alternative to create a dot chart. Use ggdotchart() [ggpubr]:
ggdotchart(
df, x = "cut", y = "prop",
color = "cut", size = 3, # Points color and size
add = "segment", # Add line segments
add.params = list(size = 2),
palette = "jco",
ggtheme = theme_pubclean()
)
Create some data (wdata) containing the weights by sex (M for male; F for female):
set.seed(1234)
wdata = data.frame(
sex = factor(rep(c("F", "M"), each=200)),
weight = c(rnorm(200, 55), rnorm(200, 58))
)
head(wdata, 4)