3 Ggplot PDF
3 Ggplot PDF
Table of contents
3 ggplot 2
3.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Bar charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Box plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Line graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Faceting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Review Questions 19
1
2 Read in the data
library(tidyverse)
3 ggplot
ggplot is a powerful visualization package. It provides many options to make beautiful graphs,
maps, plots of all sorts. We will look at some important graph types today.
3.1 Histograms
So for a histogram of ages at first marriage in the GSS, we start with specifying the dataset
and variable:
2
20 30 40 50
age_at_first_marriage
Note this is just a blank box, but it has the right x axis. Add a histogram:
600
count
400
200
0
20 30 40 50
age_at_first_marriage
3
Customize the labels:
600
count
400
200
0
20 30 40 50
Age at first marriage (years)
Now we can change the color of the bars. Note for histograms, bar chats, box plots, fill is
the main color choice (color changes the outline)
4
Age at first marriage, GSS
600
count
400
200
0
20 30 40 50
Age at first marriage (years)
Note that you can also save the plot as an object and then print it
# print
my_plot + ylab("Number of observations")
5
Age at first marriage, GSS
600
Number of observations
400
200
0
20 30 40 50
Age at first marriage (years)
Histograms select a binwidth or section of the data and then count how many of the obser-
vations fall within that. Histograms look different depending on the size of the bins. You can
also supply the number of bins that you want to create.
6
Age at first marriage, GSS
600
400
count
200
0
20 30 40 50
Age at first marriage (years)
1500
count
1000
500
0
20 30 40 50
Age at first marriage (years)
7
We can also plot by another variable to compare the plots by the categories of the variable.
For example, we look at plots by whether or not people have at least a bachelor degree:
400
has_bachelor_or_higher
count
No
Yes
200
0
20 30 40 50
Age at first marriage (years)
Importantly, note that the fill color is now specified in the aes function, because it depends
on a variable. Also note that when specifying the data, we have dropped the NAs in the
has_bachelor_or_higher variable.
Let’s plot the proportion of respondents by province as a bar chart. First save the proportions
as a new data frame
8
resp_by_prov
# A tibble: 10 x 3
province n prop
<chr> <int> <dbl>
1 Alberta 1728 0.0839
2 British Columbia 2522 0.122
3 Manitoba 1192 0.0579
4 New Brunswick 1337 0.0649
5 Newfoundland and Labrador 1094 0.0531
6 Nova Scotia 1425 0.0692
7 Ontario 5621 0.273
8 Prince Edward Island 708 0.0344
9 Quebec 3822 0.186
10 Saskatchewan 1153 0.0560
Now plot
0.2
proportion
0.1
0.0
Alberta
British Columbia
Manitoba
New
Newfoundland
Brunswick and
Nova
Labrador
ScotiaOntario
Prince Edward Island
QuebecSaskatchewan
province
9
There are a few things here that would be nice to fix. Firstly, the categories are ordered
alphabetically, which is the default. It would be better visually to order by proportion. We
can do this using the fct_reorder function to alter (mutate) the province variable.
0.2
proportion
0.1
0.0
Prince
Newfoundland
Edward Island
and
Saskatchewan
Labrador
Manitoba
New Brunswick
Nova ScotiaAlberta
British Columbia
Quebec Ontario
province
10
Proportion of GSS respondents by province
Ontario
Quebec
British Columbia
Alberta
province
Nova Scotia
New Brunswick
Manitoba
Saskatchewan
Let’s use the country indicators dataset here and do boxplots of child mortality in 2017 over
regions. Like the bar chart example, best to reorder the regions by the variable we are interested
in
11
under−five child mortality (deaths per 1000 live births)
Distribution of child mortality by region, 2017
125
100
75
50
25
0
Sub−Saharan
Southern
Africa Asia
Oceania
South−eastern
Caucasus Asia
and Northern
Central
Latin America
Asia
Africa and
Eastern
Caribbean
Western
Asia Developed
Asia regions
region
The labels on the x axis are hard to read. We could do the same as last time (switch to
horizontal), or we can change the alignment of the labels:
12
child mortality (deaths per 1000 live births)
Distribution of child mortality by region, 2017
125
100
75
50
25
0
ia
ia
a
ia
ia
ia
s
an
on
ric
ric
si
As
an
As
As
As
be
lA
gi
Af
Af
under−five
ce
ib
rn
rn
rn
re
tra
er
n
n
O
ar
he
te
e
ra
er
ed
en
st
st
es
ut
ha
th
Ea
ea
op
C
W
So
or
Sa
an
h−
el
N
an
ev
−
ut
a
ub
ic
So
D
s
er
su
S
Am
ca
au
tin
C
La
region
Note if you want to color the boxes, use fill, and then remove the legend (not needed)
13
child mortality (deaths per 1000 live births)
Distribution of child mortality by region, 2017
125
100
75
50
25
0
ia
ia
a
ia
ia
ia
s
an
on
ric
ric
si
As
an
As
As
As
be
lA
gi
Af
Af
under−five
ce
ib
rn
rn
rn
re
tra
er
n
n
O
ar
he
te
e
ra
er
ed
en
st
st
es
ut
ha
th
Ea
ea
op
C
W
So
or
Sa
an
h−
el
N
an
ev
−
ut
a
ub
ic
So
D
s
er
su
S
Am
ca
au
tin
C
La
region
Let’s look at the mean life satisfaction by age of respondent. Firstly, let’s make a new variable
in the gss dataset that groups people into 5-year age groups. Here’s the code to do this:
#check
gss |> select(age, age_group)
# A tibble: 20,602 x 2
age age_group
<dbl> <dbl>
1 52.7 50
2 51.1 50
3 63.6 60
4 80 80
5 28 25
14
6 63 60
7 58.8 55
8 80 80
9 63.8 60
10 25.2 25
# ... with 20,592 more rows
Now let’s calculate the average of the ‘life satisfaction’ variable by age group and whether or
not they had at least a bachelor’s degree. This involves a group_by by two variables:
Plot as a line chart over age, coloring by sex, for this example we use a different colour palette
called “Set1”:
15
Average life satisfaction by age and education
8.50
average life satisfication
8.00
7.75
20 40 60 80
age group
Let’s use the country indicators dataset here. The example in the lecture slides is life ex-
pectancy versus TFR. We also used a new colour palette called virdis, these colours palettes
are designed to be viewable in black and white as well.
16
TFR versus life expectancy, 2017
region
Sub−Saharan Africa
life expectancy (years)
80 Southern Asia
Oceania
South−eastern Asia
70 Caucasus and Central Asia
Northern Africa
Latin America and Caribbean
60 Eastern Asia
Western Asia
Developed regions
2 4 6
TFR (births per woman)
Instead of dots could have country codes (although becomes hard to read, but easy to see
outliers)
17
TFR versus life expectancy, 2017
JPN
region
ESP
KOR
SGP
CHE
ITAFRA
AUS
PRT
FIN
GRCSWE
CANIRL ISR
ISL
NOR
LUX
MLT a Sub−Saharan Africa
life expectancy (years) SVN
AUT
PRI
CYP NZL
NLD
BEL
DEU
DNK
CRI
EST
CHL
POL
SVKQAT
CZE
USA
HRVURY
LTULBNPAN a Southern Asia
80 BIHCUB
THA
BRB
HUN MDV
ALBTUR
LVA
COLLKA
ARG OMN
MNE
ARE
CHN
BGR VNM
BLR
ROU
BRA ECU
PER
TUN
SRB
MUS
ARMBHR
MYS
ATG
RUS
LCA MEX
GEO NIC
BLZ
MAR
IRN
SLV
SYC DZA
SYR a Oceania
UKR BRN HND
DOM
SAU
KWT
VENKAZ
GTM
MDA TTO
JAM
BHS PRY
CPV
AZE JOR
KGZWSM
PHL
GRD
BGDSUR
BOL EGY SLB a South−eastern Asia
IDN
UZB
GUYMNG TJK
NPL BWA TON
IRQ
KIR
VUTSTP
BTN
IND TKM
KHM RWA TLS a Caucasus and Central Asia
70 MMR FJI SEN
KEN
DJIPAKMDG
GAB
YEMETH
ERI
ZAF SDN
MWITZA
MRT a Northern Africa
NAM
HTIPNG ZMB
COM
AFG
COG
LBR
GHA UGA a Latin America and Caribbean
SWZZWE BENAGO
GMB
BDI NER
MOZ
TGO BFA
GIN
60 CMR
GNB
GNQ a Eastern Asia
CIV MLI
SSD SOM
LSO a Western Asia
SLE
CAF TCD
NGA
a Developed regions
2 4 6
TFR (births per woman)
3.6 Faceting
Changing the color and fills is useful to show one other variable on a graph. For more compli-
cated set-ups, faceting graphs by an additional variable becomes useful.
For example let’s go back to plotting a histogram of age at first marriage by whether or not
the respondent has at least a bachelor degree, but also add in whether or not the respondent
was born in Canada. First, look at the unique values of the place_birth_canada variable:
gss |>
select(place_birth_canada) |>
unique()
# A tibble: 4 x 1
place_birth_canada
<chr>
1 Born in Canada
2 Born outside Canada
3 <NA>
4 Don't know
For now, filter the data to only include the first two categories. To do this, use the %in%
function within filter:
18
gss_subset <- gss |>
filter(place_birth_canada %in% c("Born in Canada", "Born outside Canada")) |>
drop_na(has_bachelor_or_higher) # also remove the NAs from the education variable
Now plot the histograms as before, but now also facet by place of birth. Note we are plotting
the density here.
0.100
0.075 has_bachelor_or_higher
density
No
0.050 Yes
0.025
0.000
20 30 40 50 20 30 40 50
age at first marriage
4 Review Questions
1. Using the country_indicator dataset, create a scatter plot of GDP over life expectancy
by region for the year 2014. Edit the labels, set a title, and make sure the graph is
color-coded.
2. Using the GSS dataset, create a bar graph of non-missing values for the province of birth
(place_birth_province) and then arrange the proportions from high to low. Make sure
to color code and make all labels are readable.
19