Graphs and Viz With R
Graphs and Viz With R
30
25
investment share
20
15
year
Oliver Kirchkamp
© Oliver Kirchkamp
2 Using Graphs and Visualising Data — Contents
Contents
1 Introduction 4
1.1 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Properties of good graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 How to present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.3 Points, lines and bars . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.4 Error Bars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5 Legends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6 Clutter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.7 Unnecessary 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.8 Aspect ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.9 What to present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.9.1 Structuring content . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.9.2 Don’t discard parts of your data . . . . . . . . . . . . . . . . . . . . . 28
1.9.3 Projecting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.9.4 Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2 ggplot 33
2.1 Elements of ggplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Labels and legends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Nominal data 62
4.1 Nominal univariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
—3
7 Lattice 98
7.1 Multiway xyplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.3 Multiway continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.4 Densityplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.5 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.6 Empirical cumulative densities . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.7 Q-Q plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.8 Sample Q-Q plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.9 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.10 Barcharts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.11 Coplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.12 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.12.1 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
© Oliver Kirchkamp
4 Using Graphs and Visualising Data — 1 INTRODUCTION
1 Introduction
1.1 Literature
• The Elements of Graphing Data (Revised Edition). W. S. Cleveland (1994). Hobart
Press, Summit, New Jersey, U.S.A.
1.2 Examples
The following four datasets all have the same correlation ρ = 0.8730379.
They all have the same regression coefficients β0 =0.15 and β1 =0.5.
x1 y1 x2 y2 x3 y3 x4 y4
0.00 0.16 0.00 0.17 0.00 0.19 0.00 0.00
0.10 0.21 0.00 0.01 0.10 0.07 0.10 0.14
0.20 0.25 0.00 0.10 0.20 0.21 0.20 0.26
0.30 0.29 0.00 0.25 0.30 0.42 0.30 0.36
0.40 0.33 0.00 0.10 0.40 0.29 0.40 0.44
0.50 0.37 0.00 0.25 0.50 0.49 0.50 0.50
0.60 0.41 0.00 0.22 0.60 0.51 0.60 0.54
0.70 0.45 0.00 0.17 0.70 0.50 0.70 0.56
0.80 0.82 0.00 0.22 0.80 0.59 0.80 0.56
0.90 0.54 0.00 0.01 0.90 0.42 0.90 0.54
1.00 0.58 1.00 0.65 1.00 0.71 1.00 0.50
1 2 3 4
0.8
0.6
0.4
y
0.2
0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
—5
46996 numbers:
40
Temperature / [◦ C]
20
Temp. Range
1957-2021
0 2021
-20
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
—7
40
0
Temperature / [◦ C]
-4 35
-8
30
-12
Aims:
• …make the reader familiar with the structure of the data (create trust),
• Some readers look only or mainly at the figures and the graphs.
• Among the many ways to present our data and our results, we have to chose the best
way.
Presenting only aggregate statistics, tests and estimates might discard relevant structure:
© Oliver Kirchkamp
8 Using Graphs and Visualising Data — 1 INTRODUCTION
• All elements of the graph are explained in the figure (not only in the text).
set.seed(131)
N<-50
x<-rnorm(n=N,mean=1.5)
y<-x+rnorm(N)
x<-pmax(0,x)
#
z<-rbinom(N,2,.5)
Treatment <- factor(z)
labels<-c("Baseline","A","B")
levels(Treatment)<-labels
gData <- data.frame(x,y,z,Treatment)
par(mfrow=c(1,2),mar=c(4,3,0,1),mex=.5)
plot(y ~ x,axes=FALSE,cex=.25)
axis(side=1,pos=0)
axis(side=2,pos=0)
plot(y ~ x,cex=.25)
axis(3,labels=FALSE)
axis(4,labels=FALSE)
4
4
2
2
y
y
0
0 1 2 3 4
-2
-2
0 1 2 3 4
x x
Ranges: In the following example both graphs show the same data. The only difference is
the range of the axes.
p1 <- ggplot() + geom_point(aes(x,y))
p2 <- ggplot() + geom_point(aes(x,y)) +
scale_x_continuous(expand=expansion(mult=.25)) +
scale_y_continuous(expand=expansion(mult=.25))
grid.arrange(p1,p2,nrow=1)
5.0
6
4
2.5
2
y
y
0
0.0
-2
-2.5
0 1 2 3 4 0 1 2 3 4
x x
Ranges and correlations We tend to perceive more correlation if the data occupies less
space.
• The amount of white space around the data should be similar in all graphs.
360
300
350
200
CO2
CO2
340
330
100
320
0
1960 1970 1980 1990 1960 1970 1980 1990
year year
© Oliver Kirchkamp
10 Using Graphs and Visualising Data — 1 INTRODUCTION
Comparable scales: In the following example we use different scales for the two dia-
grams. This makes them difficult to compare (although space is used efficiently).1
library(pwt10)
data(pwt10.0)
pwt10.0 %>%
filter(country %in% c("Norway","Haiti")) %>%
select(c("cgdpo","pop","country","year")) %>%
mutate(gdp = cgdpo/pop) -> pwt
ggplot(pwt) + geom_line(aes(x=year,y=cgdpo)) +
labs(y="GDPo/head (US \\$)") +
facet_wrap(vars(country),scales="free")
Haiti Norway
20000
400000
GDPo/head (US $)
15000 300000
200000
10000
100000
5000
1960 1980 2000 2020 1960 1980 2000 2020
year
The figure shows real gross domestic product (GDPo) per capita (US dollars in 2017 prices).
Data is taken from Penn World Table Version 10.0.
In the next figure we use the same scale for the two diagrams. Now we see immediately
that GDPo is larger in Brazil and smaller in Indonesia. Of course, presenting both lines in
one diagram might be preferable, here.
ggplot(pwt) + geom_line(aes(x=year,y=cgdpo)) +
labs(y="GDPo/head (US \\$)") +
facet_wrap(vars(country),scales="fixed")
1
Data from Alan Heston, Robert Summers and Bettina Aten, Penn World Table Version 10.0, Center for Inter-
national Comparisons of Production, Income and Prices at the University of Pennsylvania, August 2009.
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 11
Haiti Norway
400000
GDPo/head (US $)
300000
200000
100000
0
1960 1980 2000 2020 1960 1980 2000 2020
year
The figure shows real gross domestic product (GDPo) per capita (US dollars in 2017 prices). Data is taken from
Penn World Table Version 10.0.
ggplot(pwt) + geom_line(aes(year,y=cgdpo)) +
labs(y="GDPo/head (US \\$)") +
facet_wrap(vars(country),scales="fixed") +
scale_y_log10()
Haiti Norway
300000
GDPo/head (US $)
100000
30000
10000
A logarithmic scale facilitates comparing relative growth. Also with logs, comparability
does not require to use the same axes for several diagrams. Sometimes it might be better to
use the same scale with a different origin.
Another possibility are sliced scales, i.e. the same scale, but different origins:
Haiti Norway
30000
30000
300000
300000
GDPo/head (US$)
10000
10000
100000
100000
3000
3000
30000
30000
1960 1980 2000 2020
year
Haiti Norway
20000
300000
GDPo/head (US $)
10000 100000
7000
30000
1960 1980 2000 2020 1960 1980 2000 2020
year
The figure shows real gross domestic product (GDPo) per capita (dollars in 2017 prices). Data is taken from
Penn World Table Version 10.0.
Breaks:
data.frame(x=1:8,y=c(1,3,4,52,51,5,4,3)) %>%
mutate(out=y > mean(c(max(y),min(y))),
shrink=min(y[out])-max(y[!out])-2,
yShrink=y - ifelse(out,shrink,0)) -> dBreak
p1<-ggplot(dBreak,aes(x=x,y=y)) + geom_line()
p2<-ggplot(dBreak,aes(x=x,y=y)) + geom_line() + scale_y_log10()
p3<-ggplot(dBreak,aes(x=x,y=y)) + geom_line() + geom_point() +
facet_grid(out ~ .,scale="free_y",space="free_y",as.table=FALSE) +
theme(strip.text.y = element_blank())
grid.arrange(p1,p2,p3,nrow=1)
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 13
52.00
51.75
50 51.50
51.25
30 51.00
40 5
30 10 4
y
y
20 3
3
10 2
0 1 1
2 4 6 8 2 4 6 8 2 4 6 8
x x x
All three graphs try to show the same data. The first graph uses a linear scale. Here the
outlier is clearly visible.
The graph in the middle uses a logarithmic scale. This can be reasonable if ratios of the
variable are interesting.
The last graph tries to save space by “breaking” the axis. If gaps can not be avoided,
dividing the graph into different panels might be preferable (the shingle function might
help to find divisions).
ggplot can “slice” axes, however I find the result not entirely convincing:
GGplot:
ggplot(dBreak,aes(x=x,y=y)) + geom_line() +
geom_point() +
facet_grid(out ~ .,scale="free_y",
space="free_y",as.table=FALSE) +
theme(strip.text.y = element_blank())
52.00
51.75
51.50
51.25
51.00
5
4
y
1
2 4 6 8
x
Lattice:
© Oliver Kirchkamp
14 Using Graphs and Visualising Data — 1 INTRODUCTION
xyplot(y ~ x|out,data=dBreak,layout=c(1,2),
strip=FALSE,scales=list(y="sliced"),
between=list(y=.5),type="o")
50 51 52 53
y
5
4
3
2
1
2 4 6 8
x
Logarithmic scales: The first graph shows GDPo on a linear scale, the second one uses
a logarithmic scale. On a linear scale countries like Congo and Ethiopia seem to be quite
similar, Germany and U.S.A. look distinct.
The log scale makes it easier to compare ratios. We see that, in relative terms, Germany is
perhaps closer to the United States of America than Congo is to Ethiopia.
N<-20
pwt10.0 %>% filter(year==2019 & country!="China Version 2") %>%
filter(pop >= -sort(-pop)[N]) %>% ## only the top N countries
mutate(country=substr(country,1,10)) %>%
mutate(country=reorder(country,-cgdpo/pop),
gdp=cgdpo/pop) -> pwtG20
ggplot(pwtG20,aes(x=gdp,y=country)) + geom_point() + labs(x="GDP/head")
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 15
Congo, Dem
Ethiopia
Bangladesh
Nigeria
Pakistan
India
Viet Nam
Philippine
Indonesia
country
Egypt
China
Iran (Isla
Brazil
Thailand
Mexico
Turkey
Russian Fe
Japan
Germany
United Sta
0 20000 40000 60000
GDP/head
Congo, Dem
Ethiopia
Bangladesh
Nigeria
Pakistan
India
Viet Nam
Philippine
Indonesia
country
Egypt
China
Iran (Isla
Brazil
Thailand
Mexico
Turkey
Russian Fe
Japan
Germany
United Sta
1000 3000 10000 30000
GDP/head
Which logarithm?
When the ratio of the largest to the smallest value is really large, then a logarithmic scale
is easy to grasp.
par(mar=c(4,0,0,0),mex=.5)
plot(NULL,xlim=c(.5,20000),ylim=c(0,1),axes=FALSE,log="x",ylab="",xlab="")
axis(1)
labs<-10^(0:4)
plot(NULL,xlim=c(.5,20000),ylim=c(0,1),axes=FALSE,log="x",ylab="",xlab="")
axis(1,at=labs,labels=paste("$10^",log10(labs),"$"))
1.4.2 Points
Size In the following example the graph on the left uses fairly small points, the one on the
right uses larger points to display the data.
plot(y ~ x,cex=.25)
plot(y ~ x)
5
5
4
4
3
3
2
2
y
y
1
1
0
0
-1
-1
-2
-2
0 1 2 3 4 0 1 2 3 4
x x
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 17
Shape Plotting symbols should be easy to distinguish. Compare the graph on the left with
the graph on the right:
plot(y ~ x,pch=c(15,16,17)[z+1])
legend("bottomright",c("Group I","Group II","Baseline"),pch=c(15,16,17),bg="white")
plot(y ~ x,pch=c(1,3,5)[z+1])
legend("bottomright",c("Group I","Group II","Baseline"),pch=c(1,3,5),bg="white")
5
5
4
4
3
3
2
2
y
y
1
1
Group I Group I
0
0
Group II Group II
Baseline Baseline
-2
-2
0 1 2 3 4 0 1 2 3 4
x x
• When points do not tend to overlap we can use both heavy (•) and light (◦) plotting
symbols. Their contrasts helps us to distinguish different types of points.
• When points tend to overlap, heavy symbols are difficult to disentangle. In this situa-
tion we should use only light symbols.
Kröse’s experiment
Participants see patterns of symbols for 80 ms. They have to identify presence or absence
of a given symbol.
symbols % recognised
+◦ 100.0
+□ 88.1
L+ 68.6
∆↓ 52.3
+T 37.6
+X 30.3
TL 30.6
Nominal data and points Sometimes, in particular with nominal data, we want to show
the same obversation several times. In the diagram on the left multiple dots are simple printed
on top of each other. One can not see how frequent an observation is. In the middle we add
jitter to each observation, small noise, which allows us to distinguish the single observa-
tions. The left graph shows frequencies as size of the symbol.
set.seed(1)
nomData <- data.frame(x=sample(1:4,size=200,replace=TRUE),y=sample(1:4,size=200,replace=TRUE))
p1 <- ggplot(nomData,aes(x,y)) + geom_point(shape=1)
p2 <- ggplot(nomData,aes(x,y)) + geom_jitter(width=.1,height=.1,shape=1)
nomData %>% group_by(x,y) %>% summarise(size=n()) %>% mutate(y=factor(y)) -> nomData2
p3 <- ggplot(nomData2,aes(x,y,size=size)) + geom_point(shape=1)
grid.arrange(p1,p2,p3,nrow=1)
4
4
4
size
3 3
3 8
10
y
y
12
14
2
2 2
16
1
1
1
1 2 3 4 1 2 3 4 1 2 3 4
x x x
If we have a small number of categories (at least in one dimension), a dotplot might be
better:
ggplot(nomData2,aes(x=x,y=size,color=y,lty=y,shape=y)) +
geom_line() + geom_point() +
theme(legend.key.width = unit(2,"cm"))
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 19
16
14
y
1
size
12 2
3
4
10
1 2 3 4
x
plot(sunspot.year[1:40],ylab="",xlab="")
plot(sunspot.year[1:40],t="l",ylab="",xlab="")
plot(sunspot.year[1:40],t="b",ylab="",xlab="")
plot(sunspot.year[1:40],t="h",ylab="",xlab="")
120
120
120
120
100
100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
• Lines alone make it impossible to find out when the measurements were taken.
© Oliver Kirchkamp
20 Using Graphs and Visualising Data — 1 INTRODUCTION
set.seed(6)
eBars<-data.frame(y=rnorm(200),i=rep(1:20,each=10))
eBars %>% group_by(i) %>%
summarise(ym = median(y)+mean(y)) %>%
mutate(x = factor(rank(ym))) %>% right_join(eBars) -> eBars2
eBars2 %>% group_by(x) %>%
summarise(s=sd(y),y=mean(y)) -> eBarsS
We have to explain clearly in the figure which quantity (standard deviation of the sample,
standard deviation of the estimated mean, confidence interval,…) is shown.
ggplot(eBars2,aes(x=x,y=y)) + geom_boxplot(outlier.shape=1)
1
y
-1
-2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x
1.5 Legends
p1 <- ggplot(gData,aes(x=x,y=y,shape=Treatment,color=Treatment)) + geom_point() +
scale_shape_manual(values=1:4) +
theme(legend.position=c(.8,.25),legend.box.background = element_rect(colour = "black"))
p2 <- ggplot(gData,aes(x=x,y=y,shape=Treatment,color=Treatment)) +
geom_point() + scale_shape_manual(values=1:4)
grid.arrange(p1,p2,nrow=1)
5.0 5.0
2.5 2.5
Treatment
Baseline
y
A
B
Treatment
0.0 0.0
Baseline
A
B
-2.5 -2.5
0 1 2 3 4 0 1 2 3 4
x x
When we use more than one type of points, we need a legend. Putting the legend inside
the graph saves space. However, a legend outside the graph produces less clutter (see also
1.6).
With lines, we need a legend too. It helps if labels follow the same order as lines (first
graph). Often the graph is easier to understand if we label the curves directly (second graph).
library(directlabels)
gData %>% mutate(Treatment=reorder(Treatment,-y,max)) %>%
ggplot(aes(x=x,y=y,lty=Treatment,color=Treatment)) + geom_smooth(se=FALSE) -> p
grid.arrange(p,
direct.label(p,list("far.from.others.borders","calc.boxes",
"enlarge.box","draw.rects")),nrow=1)
© Oliver Kirchkamp
22 Using Graphs and Visualising Data — 1 INTRODUCTION
3 3
A
2 Treatment 2
B
y
y
A
1 Baseline 1
B
0 0
e
lin
se
Ba
0 1 2 3 4 0 1 2 3 4
x x
1.6 Clutter
The following graph presents too many things in one graph.
set.seed(123)
N<-24
data.frame(y=rnorm(N),
s=sqrt(abs(rnorm(N))),
group=factor(rep(1:2,length.out=N),label=c("Baseline","Treatment")),
x=rep(1:(N/2),each=2)) %>%
mutate(ymin=y-s,
ymax=y+s) -> dataCl
1 group
Baseline
y
0
Treatment
-1
-2
Baseline Treatment
group Baseline Treatment
3
1
1
y
0 0
y
-1
-1
-2
-2
2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.50.0 2.5 5.0 7.5 10.0 12.5
x x
1.7 Unnecessary 3D
1.0
1.0
0.0
0.0
Unnecessary 3D effects distract from the content of your graph. Is the bar in the graph on
the left larger or smaller than 1.0? Of course, one can work it out, but a simple dot without
3D (on the right) is much easier to understand.
150
count
count
100 150
100
500
1700 1800 1900 2000
year
50
The graph on the right might be more informative than the one on the left. We can see
that activity increases more quickly than it decreases. This is less obvious in the left graph.
If we feel that the right graph is too flat then we can ‘cut-and-stack’ it as follows:
library(latticeExtra)
xyplot(sunspot.year,aspect="xy",strip=FALSE,strip.left=TRUE,
cut=list(number=3,overlap=0.05))
150
time
100
50
0
150
time
100
50
0
150
time
100
50
0
In the following example the graph on the left has a slope of about 45◦ . This makes it easier
to see the convexity of the curve.
lco2<-data.frame(lowess(co2))
plot(xyplot(y~x,data=lco2,type="l",aspect="xy"),position=c(0,0,.5,1),more=TRUE)
plot(xyplot(y~x,data=lco2,type="l",aspect=.1),position=c(.5,0,1,1))
360
350
360
340 340
y
y
320
∂∆ s ϵ→0 s
= =
∂ϵ 1 + (1 + ϵ ) · s
2 2 1 + s2
0.4
1 + s2
s
0.2
0.0
0 15 30 45 60 75 90
arctan s
© Oliver Kirchkamp
26 Using Graphs and Visualising Data — 1 INTRODUCTION
Hence, if we want to see differences in slopes, we should scale the graph such that slopes
are close to 1.
Also in the following graph lines have a slope of about 45◦ . This makes it easier to compare
the different slopes. ‘<
library(pwt10)
N<-6
data(pwt10.0)
pwt10.0 %>%
semi_join(pwt10.0 %>% ## find N most populous countries:
group_by(country) %>%
summarise(popM=median(pop,na.rm=TRUE)) %>%
arrange(-popM) %>%
top_n(N)) %>%
mutate(country=reorder(factor(substr(country,1,10)),-cgdpo/pop,median,na.rm=TRUE),
gdp = cgdpo/pop) -> pwt6
xyplot(gdp ~ year,group=country,
scales=list(y=list(log=10)),
yscale.components=yscale.components.log10.3,
aspect="xy",
type="l",data=pwt6,
auto.key=list(space="right",points=FALSE,lines=TRUE),
ylab="real GDPo/head (US\\$)",xlab="year")
real GDPo/head (US$)
1960 2000
year
45◦ with Lattice
xyplot(gdp ~ year,group=country,
scales=list(y=list(log=10)),
yscale.components=yscale.components.log10.3,
type="l",data=pwt6,
ylab="real GDPo/head (US\\$)",xlab="year")
##
##
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 27
10000
3000
1000
When we stretch the graph to fill the entire page, convexities and concavities are harder
to see:
layout(matrix(data=c(1,1,2,3),2,2),heights=c(.7,.3))
plot(co2, ylab = "Atmospheric concentration of CO$_2$",las = 1)
lco2 <- lowess(co2)
plot(lco2,t="l",ylab="lowess",xlab="")
plot(co2-lco2$y,t='l',ylab='seasonal')
360
Atmospheric concentration of CO2
360
350
lowess
340
350
330
340
320
330
seasonal
2
-4
1960 1970 1980 1990
1960 1970 1980 1990
Time
Sometimes it helps to show residuals in a separate graph. In the following example only
the graph on the right shows that noise increases for large values of x. The graph on the left
does not reveal this structure.
© Oliver Kirchkamp
28 Using Graphs and Visualising Data — 1 INTRODUCTION
N <- 200
x <- sort(runif(N))
y0 <- 20 * exp(x)^3
y <- y0 +4*rnorm(N,sd=.5+x)
residuals <- y0-y
plot(lattice::xyplot(y~x,panel=function(...) {
panel.xyplot(...);
panel.xyplot(x=x,y=y0,type="l")}),
position=c(0,0,.5,1),more=TRUE)
plot(xyplot(residuals ~ x),position=c(.5,0,1,1))
400
10
300
5
residuals
200
y
100 -5
-10
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Since temperature is measured as integers, we first have to aggregate, so that we can plot
superimposed points from two different flights in a different way.
library(vcd)
data(SpaceShuttle)
par(mfrow=c(1,2),mar=c(4,4.5,4,0),mex=.5)
xx<-with(SpaceShuttle,aggregate(Temperature,list(Temperature=Temperature,
Failures=nFailures),length))
plot(Failures ~ Temperature,data=xx,cex=sqrt(x),
yaxt="n",xlab="Temperature/[\\degree F]",ylab="Failures",
main="all previous flights")
axis(2,at=0:2)
plot(Failures ~ Temperature,data=xx,subset=Failures>0,cex=sqrt(x),
yaxt="n",xlab="Temperature/[\\degree F]",ylab="Failures",
main="only flights with failures")
axis(2,at=1:2)
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 29
2
Failures
Failures
1
0
1
55 60 65 70 75 80 55 60 65 70 75
Temperature/[◦ F] Temperature/[◦ F]
1.0
0
0.8
0.6
Failures
1
0.4
0.2
2
0.0
55 60 65 70 75 80
◦
Temperature/[ F]
2
Data from S. Dalal, E. B. Fowlkes, B. Hoadly (1989), Risk analysis of the space shuttle: Pre-Challenger pre-
diction of failure, Journal of the American Statistical Association, *84*, 945-957.
© Oliver Kirchkamp
30 Using Graphs and Visualising Data — 1 INTRODUCTION
library(MASS)
data(Animals)
plot(brain ~ body,data=Animals,log="xy")
with(Animals,thigmophobe.labels(body,brain,rownames(Animals),cex=.5))
5000.0
African elephant
Asian elephant
Human
Giraffe
Chimpanzee
500.0
Donkey Gorilla
Horse
Cow
Rhesus monkey Sheep Jaguar Brachiosaurus
Potar monkey Grey wolf
Goat
Pig
brain
Triceratops
50.0
Cat Dipliodocus
Kangaroo
Mountain beaver
Rabbit
5.0
Rat
0.5
Golden hamster
Mouse
• surface grows quadratically with height volume (and weight) grows cubically
2
excess brain = log(brain mass) −
log(body mass)
3
To make it easier to interpret this difference of logs we use logarithms with base 10.
The left graph is ordered by the quantity of “excess brain”, the right one is ordered alpha-
betically. Often dotplots are easier to understand when they are sorted by the quantity.
3
The Dragons of Eden: Speculations on the Evolution of Human Intelligence. Random House, New York, 1977.
4
Ever Since Darwin: Reflections in Natural History. Norton, New York, 1977.
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 31
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
excess excess
Also for a multiway dotplot ordering by quantity helps. In the following example we use
medians of the different categories.
data(pwt10.0)
N <- 12
pwt10.0 %>%
semi_join(pwt10.0 %>% ## find N most populous countries:
group_by(country) %>%
summarise(popM=median(pop,na.rm=TRUE)) %>%
arrange(-popM) %>%
top_n(N)) %>%
filter(year>max(year)-6) %>%
mutate(gdp = cgdpo/pop,
country = reorder(factor(substr(country,1,10)),-gdp,median,na.rm=TRUE)) -> pwt12
Bangladesh
Pakistan
Nigeria
India
Indonesia
country
China
Brazil
Mexico
Russian Fe
Japan
Germany
United Sta
500010000 30000
50000 500010000 30000
50000 500010000 30000
50000 500010000 30000
50000 500010000 30000
50000 500010000 30000
50000
GDPo/head (US$)
50000
country
United Sta
Germany
30000
Japan
GDPo/head (US$)
Russian Fe
Mexico
Brazil
China
10000 Indonesia
India
Nigeria
5000 Pakistan
Bangladesh
1.9.4 Differences
par(mfrow=c(2,1),mex=.5)
x <- seq(-1,5,.01)
y <- sin(x)^6
dy <- abs(6*cos(x)^2*sin(x)^6)
plot(y ~ x,t="l",ylim=c(0,1.2))
lines(x,y+.1*dy+.1,lty=2)
legend("topleft",c("y'","y"),lty=c(2,1),cex=.5)
plot(x,+.05*dy+.1,t="l",ylab="y' - y",xlab="x")
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 33
1.2
y’
y
0.8
y
0.4
0.0
-1 0 1 2 3 4 5
x
0.120
y’ - y
0.100
-1 0 1 2 3 4 5
x
In the top graph it is difficult to assess the difference between the two curves.
If it is the difference that is interesting, then then graph should show the difference (bottom
graph).
2 ggplot
R provides a number of ways to create graphs. The most basic is perhaps the built in plot.
More powerful ones are lattice and ggplot2. Here we use ggplot2 as a starting point. In
this chapter we want to explain how some standard graphs can be created with ggplot2.
iris
Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris
Society, 59, 2-5.
© Oliver Kirchkamp
34 Using Graphs and Visualising Data — 2 GGPLOT
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_point()
4.5
4.0
Sepal.Width
3.5
3.0
2.5
2.0
5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_smooth()
3.50
Sepal.Width
3.25
3.00
2.75
2.50
5 6 7 8
Sepal.Length
ggplot(iris) +
geom_point(aes(x=Sepal.Length,y=Sepal.Width))
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 35
4.5
4.0
Sepal.Width
3.5
3.0
2.5
2.0
5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_point() + geom_smooth()
4.5
4.0
Sepal.Width
3.5
3.0
2.5
2.0
5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_point(aes(color=Species)) + geom_smooth()
4.5
4.0
Species
Sepal.Width
3.5
setosa
3.0 versicolor
virginica
2.5
2.0
5 6 7 8
Sepal.Length
© Oliver Kirchkamp
36 Using Graphs and Visualising Data — 2 GGPLOT
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_point() + geom_smooth(aes(color=Species))
4.5
4.0
Species
Sepal.Width
3.5
setosa
3.0 versicolor
virginica
2.5
2.0
5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
stat_identity(geom="point")
4.5
4.0
Sepal.Width
3.5
3.0
2.5
2.0
5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_point(stat="identity")
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 37
4.5
4.0
Sepal.Width
3.5
3.0
2.5
2.0
5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Width,color=Species)) +
stat_density()
Species
2
density
setosa
versicolor
1 virginica
0
2.0 2.5 3.0 3.5 4.0 4.5
Sepal.Width
ggplot(iris,aes(x=Sepal.Width,color=Species)) +
stat_density(geom="line")
Species
2
density
setosa
versicolor
1 virginica
0
2.0 2.5 3.0 3.5 4.0 4.5
Sepal.Width
© Oliver Kirchkamp
38 Using Graphs and Visualising Data — 2 GGPLOT
factor(cyl)
4
4
wt
6
3
8
ggplot(mtcars,aes(x=hp,y=wt,color=factor(cyl))) +
geom_point() + theme(legend.position="top")
factor(cyl) 4 6 8
4
wt
ggplot(mtcars,aes(x=hp,y=wt,color=factor(cyl))) +
geom_point() +
labs(color="Cyl.",x="Power (hp)",
y="Weight (1000 lbs)")
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 39
5
Weight (1000 lbs)
Cyl.
4
4
6
3
8
ggplot(mtcars,aes(x=hp,y=wt,color=factor(cyl))) +
geom_point() +
labs(color=NULL,x=NULL,y=NULL)
4
4
6
3 8
ggplot(mtcars,aes(x=hp,y=wt,color=factor(cyl))) +
geom_point() +
theme(legend.position="none")
4
wt
ggplot(mtcars,aes(x=hp,y=wt,color=factor(cyl))) +
geom_point() +
theme(legend.background=element_rect(
fill="gray",color="black"))
factor(cyl)
4
4
wt
6
3
8
ggplot(mtcars,aes(x=hp,y=wt,color=factor(cyl),
shape=factor(gear))) +
geom_point()
factor(gear)
5
3
4
4 5
wt
3 factor(cyl)
4
2 6
8
ggplot(mtcars,aes(x=hp,y=wt,color=factor(cyl),
shape=factor(gear))) +
geom_point() +
guides(color=guide_legend(order=1))
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 41
factor(cyl)
5
4
6
4 8
wt
3 factor(gear)
3
2 4
5
ggplot(mtcars,aes(x=hp,y=wt,color=factor(cyl),
shape=factor(gear))) +
geom_point() + guides(color="none")
factor(gear)
4
3
wt
4
3
5
ggplot(mtcars,aes(x=hp,y=wt,color=factor(cyl),
shape=factor(gear))) +
geom_point() +
labs(shape="Gears",color="Cyl.")
Gears
5
3
4
4 5
wt
3 Cyl.
4
2 6
8
2.3 Scatterplots
library(pwt10)
data(pwt10.0)
data(pwt10.0)
pwtYC <- function(years,countries) {
pwt10.0 %>%
semi_join(pwt10.0 %>% ## find N most populous countries:
group_by(country) %>%
summarise(popM=median(pop,na.rm=TRUE)) %>%
arrange(-popM) %>%
top_n(countries)) %>% ## only the ... largest countries
filter(year>max(year)-years) %>% ## only the last ... years
mutate(gdp = cgdpo/pop,
country = substr(country,1,10)) %>%
select(c("country","gdp","year","csh_c","csh_i","csh_g"))
}
Feenstra RC, Inklaar R, Timmer MP (2015). The Next Generation of the Penn World Table,
American Economic Review, 105(10). pp. 3150-82.
pwtYC(years=6, countries=6)
0.4
0.3
Share investment
0.2
0.4
0.3
0.2
2014 2015 2016 2017 2018 2019 2014 2015 2016 2017 2018 2019 2014 2015 2016 2017 2018 2019
year
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 43
We can simply draw (within ggplot) two lines with two geoms: …as several geoms:
0.75
0.50
Share investment
0.25
type
China Indonesia Russian Fe C
I
0.75
0.50
0.25
1960 1980 2000 2020 1960 1980 2000 2020 1960 1980 2000 2020
year
# A tibble: 3 x 5
country gdp year type share
<chr> <dbl> <int> <chr> <dbl>
1 Brazil 1606. 1950 c 0.654
2 Brazil 1606. 1950 i 0.197
3 Brazil 1606. 1950 g 0.133
0.75
0.50
0.25 type
share
c
China Indonesia Russian Fe
g
0.75 i
0.50
0.25
1960 1980 2000 2020 1960 1980 2000 2020 1960 1980 2000 2020
year
c g i
Country
0.75
Brazil
China
share
0.50
India
Indonesia
0.25 Russian Fe
United Sta
60
80
00
20
60
80
00
20
60
80
00
20
19
19
20
20
19
19
20
20
19
19
20
20
year
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 45
c g i
Country
0.75
Brazil
China
share
0.50
India
Indonesia
0.25 Russian Fe
United Sta
60
80
00
20
60
80
00
20
60
80
00
20
19
19
20
20
19
19
20
20
19
19
20
20
year
United Sta
Russian Fe
Indonesia
India
China
Brazil
ggplot(pwtYC(99,6),aes(y=country,x=csh_i)) +
stat_summary(fun.min=min,fun.max=max,fun=mean) +
labs(x="investment share",y=NULL)
United Sta
Russian Fe
Indonesia
India
China
Brazil
ggplot(pwtYC(99,6),aes(y=country,x=csh_i)) +
stat_summary(fun.min=min,fun.max=max,fun=mean,
geom="crossbar") +
labs(x="investment share",y=NULL)
United Sta
Russian Fe
Indonesia
India
China
Brazil
ggplot(pwtYC(99,6),aes(y=country,x=csh_i)) +
stat_summary(fun.min=min,fun.max=max,
geom="errorbar",width=.2) +
stat_summary(fun=mean,geom="bar",alpha=.3) +
labs(x="investment share",y=NULL)
The “bar” suggests that the elements of the bar have a meaning. This might sometimes
make sense, for example if the bar stands for something you can count. Most of the time
bars make no sense.
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 47
United Sta
Russian Fe
Indonesia
India
China
Brazil
Segment plots and regression results We can also use segment plots to show regression
results. In the following example we use the pwt6.3 dataset to study the relation between
openc and gdpc per country:
Russian Fe
United Sta
Brazil
Indonesia
India
China
25 30 35 40 45 50
95% CI for β1
3.2 Densityplots
© Oliver Kirchkamp
48 Using Graphs and Visualising Data — 3 GGPLOT, MORE ADVANCED FEATURES
ggplot(iris) +
geom_density(aes(x=Sepal.Length,color="Sepal.Length")) +
geom_density(aes(x=Sepal.Width,color="Sepal.Width")) +
facet_wrap(~ Species) + labs(x="Length, Width")
1.0
colour
density
Sepal.Length
0.5 Sepal.Width
0.0
2 4 6 8 2 4 6 8 2 4 6 8
Length, Width
3.3 Histograms
ggplot(iris) +
geom_histogram(aes(x=Sepal.Length,fill="Sepal.Length")) +
geom_histogram(aes(x=Sepal.Width,fill="Sepal.Width")) +
facet_wrap(~ Species) + labs(x="Length, Width")
15
10 fill
count
Sepal.Length
Sepal.Width
5
0
2 4 6 8 2 4 6 8 2 4 6 8
Length, Width
0.75
colour
ECDF
0.50 Sepal.Length
Sepal.Width
0.25
0.00
2 4 6 8 2 4 6 8 2 4 6 8
Length, Width
# A tibble: 3 x 5
Petal.Length Petal.Width Species name value
<dbl> <dbl> <fct> <chr> <dbl>
1 1.4 0.2 setosa Sepal.Length 5.1
2 1.4 0.2 setosa Sepal.Width 3.5
3 1.4 0.2 setosa Sepal.Length 4.9
tidyr::pivot_longer(iris,cols=starts_with("Sepal.")) %>%
ggplot(aes(x=value,color=Species,lty=name)) +
stat_ecdf() + labs(lty="Measure",y="ECDF")
1.00
Measure
0.75 Sepal.Length
Sepal.Width
ECDF
0.50
Species
setosa
0.25 versicolor
virginica
0.00
2 4 6 8
value
© Oliver Kirchkamp
50 Using Graphs and Visualising Data — 3 GGPLOT, MORE ADVANCED FEATURES
tidyr::pivot_longer(iris,cols=starts_with("Sepal.")) %>%
ggplot(aes(sample=value,color=Species,lty=name)) +
stat_qq() + stat_qq_line() + labs(y="Length, Width", x="Quantile")
Species
setosa
6 versicolor
Length, Width
virginica
4 name
Sepal.Length
Sepal.Width
-2 -1 0 1 2
Quantile
data.frame(qqplot(0:100,1:5))
x y
1 0 1
2 25 2
3 50 3
4 75 4
5 100 5
5
4
1:5
3
2
1
0 20 40 60 80 100
0:100
data(Wages,package="Ecdat")
Wages %>% mutate(edG = cut_number(ed,n=3)) %>%
group_by(edG) %>%
summarise(data.frame(qqplot(plot.it=FALSE,lwage[sex!="male"],lwage[sex=="male"]))) %>%
ggplot(aes(x=x,y=y)) + geom_line() + labs(x="female",y="male") +
geom_abline(slope=1,intercept=0) + facet_wrap(~edG)
7
male
5
4.5 5.0 5.5 6.0 6.5 7.0 4.5 5.0 5.5 6.0 6.5 7.0 4.5 5.0 5.5 6.0 6.5 7.0
female
3.7 Boxplots
Wages %>% mutate(edG = cut_number(ed,n=3)) %>%
ggplot(aes(y=lwage,x=sex)) + geom_boxplot() + facet_wrap(~edG)
7
lwage
3.8 Barcharts
ggplot can also do bar charts:
© Oliver Kirchkamp
52 Using Graphs and Visualising Data — 3 GGPLOT, MORE ADVANCED FEATURES
We should note that often a dotplot or xyplot presents the same data in a better way.
0.75
0.50
0.25
type
share
c
China Indonesia Russian Fe
g
0.75 i
0.50
0.25
1960 1980 2000 2020 1960 1980 2000 2020 1960 1980 2000 2020
year
3.9 Coplots
c
Indonesia
Russian Fe
China
United Sta
Brazil
India
g
Indonesia
Russian Fe
China
United Sta
Brazil
India
i
Indonesia
Russian Fe
China
0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6
share
3.10 Parameters
3.10.1 Types of lines
With lattice we would choose between different types of lines with type. With ggplot
we use different geoms. In the following graph we use aes(color=...) to create a legend
for the different geoms.
© Oliver Kirchkamp
54 Using Graphs and Visualising Data — 3 GGPLOT, MORE ADVANCED FEATURES
data(Caschool,package="Ecdat")
ggplot(data=Caschool,aes(x=avginc,y=testscr))+geom_point()+
geom_smooth(aes(color="loess",fill="loess",lty="loess"))+
geom_smooth(aes(color="lm",fill="lm",lty="lm"),method="lm") +
labs(fill="type",color="type",lty="type")
700
type
testscr
lm
loess
650
600
10 20 30 40 50
avginc
3.10.2 Axes
Different scales for different panels As with lattice, also ggplot chooses the same
scale for all panels in a plot. This can be changed with the help of the parameter scales in
facet_wrap.
Same scale (the default):
ggplot(pwtYC(99,6), aes(x=year,y=csh_g)) +
geom_line() +
facet_wrap(~reorder(country,csh_g)) +
labs(y="G")
0.3
0.2
0.1
G
0.3
0.2
0.1
1960 1980 2000 2020 1960 1980 2000 2020 1960 1980 2000 2020
year
ggplot(pwtYC(99,6), aes(x=year,y=csh_g)) +
geom_line() +
facet_wrap(~reorder(country,csh_g),scales="free_y") +
labs(y="G")
Sliced scale (facet_grid(...,space='free'), scales have the same scale, but different
origin (this is different than in lattice):
ggplot(pwtYC(99,6), aes(x=year,y=csh_g)) +
geom_line() + facet_grid(.~reorder(country,csh_g),scales="free",space="free") +
labs(y="G") + coord_flip()
2000
year
1980
1960
0.10
0.12
0.14 0.10 0.15 0.20 0.10 0.15 0.20 0.25 0.30 0.125
0.150
0.175
0.200
0.225 0.10 0.15 0.20 0.20 0.25 0.30 0.35
G
Individual axes
We can influence where an axis is labelled as follows:
ggplot(data=pwtYC(99,6), aes(x=year,y=csh_g)) +
geom_line() + facet_grid(.~reorder(country,csh_g)) + labs(y="G") +
© Oliver Kirchkamp
56 Using Graphs and Visualising Data — 3 GGPLOT, MORE ADVANCED FEATURES
scale_y_log10(breaks=c(.07,.08,.09,.1,.12,.15,.2,.25,.3)) +
scale_x_continuous(breaks=c(1950,2000))
0.30
0.25
0.20
G
0.15
0.12
0.10
0.09
0.08
0.07
1950 2000 1950 2000 1950 2000 1950 2000 1950 2000 1950 2000
year
0.30 Ukraine
0.25 Russian Fe
Argentina
Poland
France
Myanmar
0.20 South Afri Japan Kin Iran (Isla
United Turkey
Thailand
Brazil
Mexico Spain
g
Colombia
Germany Italy
0.15 Republic o China
Egypt
Bangladesh
0.05
0.1 0.2 0.3 0.4
i
3.11 Zooming
Sometimes we want to show only part of the data. No problem if the graph shows nothing
but the data. If, however, the graph only shows statistics, e.g., a smooth line, the shape of
the line depends on the data that is included.
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_smooth() + geom_point() +
annotate("rect",xmin=4.5,xmax=5,ymin=2.9,
ymax=3.4,alpha=.3,fill="red")
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 57
4.5
4.0
Sepal.Width
3.5
3.0
2.5
2.0
5 6 7 8
Sepal.Length
The following graph uses only a subset of the data to calculate the smooth line.
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_smooth() + geom_point() +
xlim(c(4.5,5)) + ylim(c(2.9,3.4))
#
3.4
3.3
Sepal.Width
3.2
3.1
3.0
2.9
4.5 4.6 4.7 4.8 4.9 5.0
Sepal.Length
The following graph uses the entire data to calculate the smooth line.
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_smooth() + geom_point() +
coord_cartesian(xlim=c(4.5,5),ylim=c(2.9,3.4))
#
© Oliver Kirchkamp
58 Using Graphs and Visualising Data — 3 GGPLOT, MORE ADVANCED FEATURES
3.4
3.3
Sepal.Width
3.2
3.1
3.0
2.9
4.5 4.6 4.7 4.8 4.9 5.0
Sepal.Length
3.12 Themes
p + theme_gray()
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 59
p + theme_bw()
p + theme_light()
© Oliver Kirchkamp
60 Using Graphs and Visualising Data — 3 GGPLOT, MORE ADVANCED FEATURES
library(ggthemes)
p + theme_economist() + scale_colour_economist()
type c g i
library(ggthemes)
p + theme_solarized() + scale_colour_solarized("blue")
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 61
p + theme(strip.background=element_rect(fill="#8FC9C7"))
Setting colors Parts of the plot which do not represent data can be influenced with theme_set().
If we also want to change the presentation of the data once and for all, we can redefine the
ggplot function:
Lines will be drawn in a different color. Points will have a different shape.
If no colors are desired, then use scale_color_grey(end=0) and scale_fill_grey(end=0)
4 Nominal data
The case of purely nominal data is rare, although we might want present a simplified version
(where only nominal categories matter) of the data in the description.
n group
1 2 type A
2 4 type B
3 3 type C
ggplot(nomD,aes(x=group,y=n)) +
geom_point() + expand_limits(y=0)
#
2
n
0
type A type B type C
group
ggplot(nomD,aes(x=group,y=n,fill=group)) +
geom_bar(stat="identity") +
theme(legend.position="none")
2
n
0
type A type B type C
group
waste of ink
ggplot(nomD,aes(x="",y=n,fill=group)) +
geom_bar(stat="identity")
#
7.5
group
5.0 type A
n
type B
2.5 type C
0.0
x
© Oliver Kirchkamp
64 Using Graphs and Visualising Data — 4 NOMINAL DATA
ggplot(nomD,aes(x=n,y="",fill=group)) +
geom_bar(stat="identity") +
coord_polar()
0.0
7.5 group
type A
y
2.5 type B
type C
5.0
ggplot(nomD,aes(x=n,y=group)) +
geom_point() +
geom_segment(aes(x=0,xend=n,yend=group)) +
expand_limits(x=0)
type C
group
type B
type A
0 1 2 3 4
n
Pie chart
- - The eye is not good at comparing angles (except 90◦ and 180◦ ).
Avoid pie charts (unless 90◦ and 180◦ are of special significance).
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 65
• Bubbleplots
• Dot-plots
data(HairEyeColor)
reshape2::melt(HairEyeColor)
reshape2::melt(HairEyeColor) %>%
filter(Sex=="Female") %>%
ggplot(aes(x=Eye,y=value,fill=Hair)) +
geom_bar(stat='identity') + myFillSc
125
100
Hair
75 Black
value
Brown
50
Red
Blond
25
0
Brown Blue Hazel Green
Eye
reshape2::melt(HairEyeColor) %>%
filter(Sex=="Female") %>%
ggplot(aes(x=Eye,y=value,fill=Hair)) +
© Oliver Kirchkamp
66 Using Graphs and Visualising Data — 4 NOMINAL DATA
geom_bar(stat='identity',position="dodge2") +
myFillSc
60
Hair
40 Black
value
Brown
Red
20
Blond
0
Brown Blue Hazel Green
Eye
library(ggmosaic)
reshape2::melt(HairEyeColor) %>%
filter(Sex=="Female") %>% ggplot() +
geom_mosaic(aes(x=product(Hair,Eye),
weight=value,fill=Hair)) +
myFillSc
Blond
Red
Hair
Black
Brown
Hair
Brown
Red
Blond
Black
reshape2::melt(HairEyeColor) %>%
filter(Sex=="Female") %>% ggplot() +
geom_mosaic(aes(x=product(Eye,Hair),
weight=value,fill=Eye)) +
scale_fill_manual(values=c("brown","blue",
"#ffdd88","green"))
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 67
Green
Hazel
Blue Eye
Brown
Eye
Blue
Hazel
Brown
Green
reshape2::melt(HairEyeColor) %>%
filter(Sex=="Female") %>%
ggplot(aes(x=Eye,y=Hair,size=value)) +
geom_point()
Blond
value
Red
20
Hair
40
Brown
60
Black
reshape2::melt(HairEyeColor) %>%
filter(Sex=="Female") %>%
ggplot(aes(y=Hair,x=value)) +
geom_point() + facet_wrap(vars(Eye),nrow=1)
Blond
Red
Hair
Brown
Black
0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
value
reshape2::melt(HairEyeColor) %>%
filter(Sex=="Female") %>%
ggplot(aes(x=Eye,y=value,color=Hair)) +
geom_point() + myColSc
© Oliver Kirchkamp
68 Using Graphs and Visualising Data — 4 NOMINAL DATA
60
Hair
40 Black
value
Brown
Red
20 Blond
0
Brown Blue Hazel Green
Eye
Multiway dot-plots Multiway dot-plots are another possibility to present two way count
data:
10 20 30 40 50
Red Blond
Green
Hazel
Blue
Brown
Black Brown
Green
Hazel
Blue
Brown
10 20 30 40 50
value
colHair<-c("black","gold","brown","red")
myTheme<-within(lTheme,{
dot.line$col<-colEye
superpose.symbol$col<-colHair
superpose.line$col<-colHair})
keys<-list(space="top",columns=2,lines=TRUE)
(d2<-dotplot(Eye ~ value ,group=Hair,data=HairEyeMale,t=c("p","a"),auto.key=keys,par.settings=m
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 69
Black Red
Brown Blond
Green
Hazel
Blue
Brown
10 20 30 40 50
value
+ good if the number of categories is large (in particular for numeric categories)
reshape2::melt(HairEyeColor) %>%
ggplot(aes(y=Hair,x=value)) +
geom_point() +
facet_grid(cols=vars(Eye),rows=vars(Sex))
© Oliver Kirchkamp
70 Using Graphs and Visualising Data — 4 NOMINAL DATA
Blond
Red
Male
Brown
Black
Hair
Blond
Female
Red
Brown
Black
0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
value
Female:Green
Male:Green
Female:Hazel
Male:Hazel
Female:Blue
Sex:Eye
Male:Blue
Female:Brown
Male:Brown
Green
Hazel
Male
Blue
Brown
Eye
Green
Hazel
Female
Blue
Brown
set.seed(123)
data.frame(x = rnorm(100,mean=12,sd=4)) %>%
ggplot(aes(sample=x)) + stat_qq() +
stat_qq_line() +
labs(x="Theoretical quantiles",
y="Sample quantiles")
20
Sample quantiles
15
10
-2 -1 0 1 2
Theoretical quantiles
mtcars %>%
ggplot(aes(sample=mpg)) +
stat_qq() + stat_qq_line()
35
30
25
y
20
15
10
-2 -1 0 1 2
x
mtcars %>%
ggplot(aes(sample=mpg,color=factor(cyl))) +
stat_qq() + stat_qq_line()
© Oliver Kirchkamp
72 Using Graphs and Visualising Data — 5 CONTINUOUS DATA – DISTRIBUTIONS
30
factor(cyl)
4
y
6
20
8
10
-1 0 1
x
qqnorm compares with a given (theoretical) distribution. qqplot compares with a given
empirical distribution.
5.2.1 Histograms
30
20
count
10
0
4 5 6 7 8
Sepal.Length
ggplot(iris,aes(Sepal.Length)) +
geom_histogram(bins=12,fill="gray",color="black")
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 73
20
15
count
10
0
4 5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,fill=Species)) +
geom_histogram()
12.5
10.0
Species
7.5
count
setosa
5.0 versicolor
virginica
2.5
0.0
5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,fill=Species)) +
geom_histogram(position="dodge")
7.5
Species
count
5.0 setosa
versicolor
virginica
2.5
0.0
5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,color=Species)) +
geom_density()
1.2
0.8
Species
density
setosa
versicolor
0.4 virginica
0.0
5 6 7 8
Sepal.Length
1.00
0.75
Species
density
setosa
0.50
versicolor
virginica
0.25
0.00
5 6 7 8
Sepal.Length
virginica
Species
versicolor
setosa
5 6 7 8
Sepal.Length
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 75
virginica
Species
versicolor
setosa
0 2 4 6
Sepal.Length
virginica
Species
versicolor
setosa
5 6 7 8
Sepal.Length
© Oliver Kirchkamp
76 Using Graphs and Visualising Data — 5 CONTINUOUS DATA – DISTRIBUTIONS
Means and standard deviation are much less informative than boxplots. The following three
graphs all show the same four distributions which all have identical sample means and stan-
dard deviations (right diagram). Still, scattergrams (left) and boxplots (middle) reveal that
the four samples are quite different.
set.seed(123)
xx<-as.data.frame(cbind(seq(0,2,length=20),c(seq(-8,2,length=10),seq(4,6,length=10)),
c(seq(0,1.8,length=18),4,4.1),rnorm(20)))
for(i in 1:4) {xx[,i]<-(xx[,i]-mean(xx[,i]))/sd(xx[,i])}
xx<-reshape(xx,direction="long",v.names="x",varying=list(1:4))
xx<-within(xx,{time<-as.factor(time);levels(time)<-letters[1:4]})
ylim=range(c(xx$x))
par(mfrow=c(1,3))
with(xx,plot(x ~ as.integer(time),xaxt="n",ylim=ylim,xlab="",main="sample"))
axis(1,at=1:4,labels=letters[1:4])
boxplot(x ~ time,data=xx,ylim=ylim,main="boxplot")
library(plotrix)
dispData<-aggregate(x~time,
FUN=function(x) c(mean=mean(x),sd=sd(x)),
data=within(xx,time<-as.numeric(time)))
with(dispData,{
plot(x[,"mean"] ~ time,dispData,xaxt="n",ylim=ylim,xlab="",ylab="",
main="means and sample standard deviations")
dispersion(1:4,x[,"mean"],x[,"sd"])
})
axis(1,at=1:4,labels=letters[1:4])
2
1
1
x
x
0
0
-1
-1
-1
-2
-2
-2
a b c d a b c d a b c d
time
ggplot(iris,aes(Sepal.Length,color=Species)) +
stat_ecdf() +
labs(y="Empirical CDF")
1.00
0.75
Empirical CDF
Species
setosa
0.50
versicolor
virginica
0.25
0.00
5 6 7 8
Sepal.Length
data(pwt10.0)
N <- 12
pwt10.0 %>%
semi_join(pwt10.0 %>% ## find N most populous countries:
group_by(country) %>%
summarise(popM=median(pop,
na.rm=TRUE)) %>%
arrange(-popM) %>%
top_n(N)) %>%
filter(year>max(year)-6) %>%
mutate(gdp = cgdpo/pop) %>%
select(c("country","gdp","year")) -> pwt12
pwt12
ggplot(pwt12,aes(y=country,x=gdp)) + geom_point() +
scale_x_log10() + labs(x="GDPo/head (US\\$)")
Japan
India
Indonesia
Germany
China
Brazil
Bangladesh
pwt12 %>%
mutate(country = reorder(factor(substr(country,1,10)),-gdp)) %>%
ggplot(aes(y=country,x=gdp)) + geom_point() +
scale_x_log10() + labs(x="GDPo/head (US\\$)")
Bangladesh
Pakistan
Nigeria
India
Indonesia
country
China
Brazil
Mexico
Russian Fe
Japan
Germany
United Sta
Multiway dot-plots
pwt12 %>%
mutate(country = reorder(factor(substr(country,1,10)),-gdp)) %>%
ggplot(aes(y=country,x=gdp)) + geom_point() +
scale_x_log10() + labs(x="GDPo/head (US\\$)") +
facet_wrap(vars(year))
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 79
5.4 Summary
Histograms
+ Everybody understands them
– Don’t reveal small differences
– Depend on breaks
Densities
+ Easy to understand
– Need assumptions (must be estimated)
– Depend on bandwidth
Boxplot
+ Shows summary statistics
– Aggregates data
Barplot of means
– Uses a lot of space to show a small amount of information.
ECDF
+ Provides a lot of information
+ Doesn’t depend much on parameters
© Oliver Kirchkamp
80 Using Graphs and Visualising Data — 5 CONTINUOUS DATA – DISTRIBUTIONS
Q-Q Plot
+ Provides a lot of information
+ Doesn’t depend much on parameters
+ Reveals even small differences between distributions
– Only compares two variables
Dot-Plot
+ Provides detailed information
+ Doesn’t depend much on parameters
– Requires a small number of observations
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,color=Species,shape=Species)) +
geom_point()
4.5
4.0
3.5 Species
Sepal.Width
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
With larger data frames scatterplots might provide too much information:
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,color=Species,shape=Species)) +
geom_point() + ggpubr::stat_conf_ellipse(bary=FALSE)
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 81
4.5
4.0
Species
Sepal.Width
3.5
setosa
3.0 versicolor
virginica
2.5
2.0
4 5 6 7 8
Sepal.Length
library(car)
attach(iris)
xhist <- hist(Sepal.Width, breaks=10,plot=FALSE)
yhist <- hist(Sepal.Length, breaks=10,plot=FALSE)
xrange <- range(xhist$breaks)
yrange <- range(yhist$breaks)
layout(rbind(c(2,0),c(1,3)),
widths=c(4,1), heights=c(1,4))
par(mar=c(4,4,0,0))
plot(Sepal.Width, Sepal.Length,
xlim=xrange, ylim=yrange)
dataEllipse(Sepal.Width,Sepal.Length,
levels=c(.5,.95),plot.points=FALSE)
par(mar=c(0,4,0,0))
barplot(xhist$counts, axes=FALSE)
par(mar=c(4,0,0,0))
barplot(yhist$counts, axes=FALSE,
horiz=TRUE)
detach(iris)
© Oliver Kirchkamp
82 Using Graphs and Visualising Data — 5 CONTINUOUS DATA – DISTRIBUTIONS
8
7
Sepal.Length
6
5
4
Sepal.Width
We use hist to calculate the range of the plot and to prepare the barplot:
attach(iris)
xhist <- hist(Sepal.Width, breaks=10,plot=FALSE)
yhist <- hist(Sepal.Length, breaks=10,plot=FALSE)
xhist$breaks
[1] 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4
xhist$counts
[1] 4 7 13 23 36 24 18 10 9 3 2 1
detach(iris)
library(car)
attach(iris)
dataEllipse(Sepal.Length,Sepal.Width,
groups=Species,levels=c(.5,.95))
detach(iris)
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 83
setosa
setosa
4.0
virginica
versicolor
Sepal.Width
virginica
3.5
versicolor
3.0
2.5
2.0
library(car)
attach(iris)
dataEllipse(Sepal.Length,Sepal.Width,
groups=Species,levels=c(.5,.95),
draw=TRUE,plot.points=FALSE,add=FALSE)
detach(iris)
setosa
setosa
4.0
virginica
versicolor
Sepal.Width
virginica
3.5
versicolor
3.0
2.5
2.0
5.5.3 Bagplot
• The dark-blue area: The “bag”. This area contains 50% of all observations.
• The light-blue area: Contains all points which are in the bag 3 times expanded.
-10
y
-20
0 1
g
library(aplpack)
with(iris,bagplot(Sepal.Length,Sepal.Width))
4.0
3.5
y
3.0
2.5
2.0
library(ks)
iris %>%
select(Sepal.Length,Sepal.Width,Species) ->
data2
dlply(data2,.(Species),
function(d) {
kde(d[,1:2])
}) -> kdeList
with(data2,
plot(Sepal.Length,Sepal.Width,cex=.2,
col="gray",pch=as.numeric(Species)))
for(i in 1:length(kdeList))
plot(kdeList[[i]],add=TRUE,lty=i,
col.fun=function(n){rainbow(n)})
legend("topright",
lty=1:length(kdeList),
pch=1:length(kdeList),
names(kdeList),cex=.5)
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 85
setosa
versicolor
virginica
4.0
50
25
3.5
Sepal.Width
75
50
75 25
50
3.0
75 25
2.5
2.0
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,
color=Species,shape=Species)) +
geom_density_2d() +
geom_point()
4.5
4.0
Species
Sepal.Width
3.5
setosa
3.0 versicolor
virginica
2.5
2.0
5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
stat_density_2d(aes(fill = ..level..),
geom = "polygon",
colour="white") +
scale_fill_distiller(palette= "Spectral")
© Oliver Kirchkamp
86 Using Graphs and Visualising Data — 5 CONTINUOUS DATA – DISTRIBUTIONS
4.0
level
3.5
Sepal.Width
0.4
0.3
3.0
0.2
0.1
2.5
2.0
5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_bin2d(bins=10) +
scale_fill_continuous(type = "viridis")
4 count
Sepal.Width
10.0
7.5
3 5.0
2.5
4 5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_hex(bins=10) +
scale_fill_continuous(type = "viridis")
4.5
4.0
count
Sepal.Width
3.5 7.5
3.0 5.0
2.5
2.5
2.0
4 5 6 7
Sepal.Length
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 87
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,
color=Species,shape=Species)) +
geom_point() + geom_smooth()
4.5
4.0
Species
Sepal.Width
3.5
setosa
3.0 versicolor
virginica
2.5
2.0
5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,
color=Species,shape=Species)) +
geom_point() + geom_smooth(method="lm")
4.5
4.0
Species
Sepal.Width
3.5
setosa
3.0 versicolor
virginica
2.5
2.0
5 6 7 8
Sepal.Length
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,
color=Species,shape=Species)) +
geom_smooth(span=1)
Using Graphs and Visualising Data — 6 CONTINUOUS DATA, CAUSAL RELATIONS,
© Oliver Kirchkamp
88 OTHER PROBLEMS
4.5
4.0
Species
Sepal.Width
3.5
setosa
3.0 versicolor
virginica
2.5
2.0
5 6 7 8
Sepal.Length
How smooth?
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,
color=Species,shape=Species)) +
geom_smooth(span=.5)
Species
Sepal.Width
setosa
3 versicolor
virginica
5 6 7 8
Sepal.Length
6.1.2 GAM
Loess (locally estimated scatterplot smoothing) only relates one variable to a smooth function
of one other variables. What if there are more variables?
For more complex relationships (and as an extension of the linear model) we can use GAM
(generalised additive models).
Linear Regression:
Y = β0 + β1 X1 + β2 X2 + . . . + u
GAM (Generalised additive model):
Y = β0 + s1 (X1 ) + s2 (X2 ) + . . . + βk Xk . . . + u
summary(est.ols)
Call:
lm(formula = testscr ~ elpct + avginc + str, data = Caschool)
Residuals:
Min 1Q Median 3Q Max
-42.800 -6.862 0.275 6.586 31.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 640.31550 5.77489 110.879 <2e-16 ***
elpct -0.48827 0.02928 -16.674 <2e-16 ***
avginc 1.49452 0.07483 19.971 <2e-16 ***
str -0.06878 0.27691 -0.248 0.804
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output for a GAM is similar to the output for OLS. Of course, the splines (here for
elpct and avginc) are not shown. The output provides only the result of an F-test and the
estimated degrees of freedom (edf).
summary(est.gam)
Family: gaussian
Link function: identity
Formula:
testscr ~ s(elpct) + s(avginc) + str
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 656.9103 5.3040 123.852 <2e-16 ***
str -0.1402 0.2689 -0.521 0.602
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
© Oliver Kirchkamp
90 OTHER PROBLEMS
plot(est.gam,pages=0)
s(elpct,2.42)
20
-20
0 20 40 60 80
elpct
s(avginc,3.17)
20
-20
10 20 30 40 50
avginc
library(mgcv)
est2.gam <- gam(testscr ~ s(elpct,avginc) + str,data=Caschool)
summary(est2.gam)
Family: gaussian
Link function: identity
Formula:
testscr ~ s(elpct, avginc) + str
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 658.0740 5.3777 122.371 <2e-16 ***
str -0.1995 0.2728 -0.731 0.465
---
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 91
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(est2.gam,pages=1,pers=TRUE,theta=0)
s(elpc
t,avgi
nc
nc,18
avgi
.96)
elpct
plot(est2.gam,pages=1,pers=TRUE,theta=40)
Using Graphs and Visualising Data — 6 CONTINUOUS DATA, CAUSAL RELATIONS,
© Oliver Kirchkamp
92 OTHER PROBLEMS
s(elpc
t,a
vginc
,18.96
)
elp
c
gin
ct
av
plot(est2.gam,pages=1,pers=TRUE,theta=75,phi=5)
s(elpct,avginc,18.96)
elpct
avginc
plot(est2.gam,pages=1)
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 93
40
40
40
avginc
20
30
30
30
30
-10 -20
-30
-20
20 20
20
20
10 10
10
-10
-10
0
00
10
-20
-10
-20
-30
-30 -30 -40
0 20 40 60 80
elpct
vwReg(testscr~avginc,data=CaPart,B=100,
spag=TRUE,slices=50)
690
testscr
660
630
600
10 20 30 40
avginc
© Oliver Kirchkamp
94 OTHER PROBLEMS
725
700
675
testscr
650
625
600
10 20 30 40
avginc
Data ellipses
– assumes a linear relationship
Bagplot
– not very well known
Kernel densities
+ easy to understand
– relies on assumptions (must be estimated, depend on bandwidth)
Regression line
– Assumes a linear causal relationship
Loess/GAM/VWReg
– Assume causal (not necessarily linear) relationship
If the data are highly correlated then a standard scatterplot (left diagram) wastes a lot of
space top left and bottom right from the 45◦ -line.
The Tukey mean-difference plot (also known as Bland-Altman plot) basically rotates the
diagram by 45◦ and, thus, can save space. (This plot aims at showing agreement of the two
elements of the pairs and, hence, also shows the mean of the differences ± two standard
deviations.)
The bumpchart presents essentially the same information, but with a focus on the identity
of the observations. We would usually not do this for anonymous observations, but, e.g. if
observations are for countries or for cities.
Create some paired data:
set.seed(1)
N<-12
x<-runif(N)
y<-x+.1*rnorm(N)
pairD <- data.frame(x=x,y=y,g=letters[1:N])
f g
0.75 d
c i
0.50
y
e b h
l a
0.25
j k
0.25 0.50 0.75
x
pairD %>%
mutate(xyMean = (x+y)/2, yxDiff = (y-x)) %>%
ggplot(aes(x=xyMean,y=yxDiff)) + geom_point() +
labs(x="(x+y)/2",y="y-x") +
directlabels::geom_dl(aes(label=g),
method="smart.grid")
Using Graphs and Visualising Data — 6 CONTINUOUS DATA, CAUSAL RELATIONS,
© Oliver Kirchkamp
96 OTHER PROBLEMS
0.1 e b i
l a c f
0.0
j k d
y-x
-0.1
g
-0.2 h
0.25 0.50 0.75
(x+y)/2
Bump-plot:
pairD %>% tidyr::pivot_longer(cols=c("x","y")) %>%
mutate(time=ifelse(name=="x",0,1)) %>%
ggplot(aes(x=time,y=value,color=g)) +
ggbump::geom_bump() +
geom_text(data=pairD,aes(x=-.01,y=x,label=g)) +
geom_text(data=pairD,aes(x=1.01,y=y,label=g)) +
theme(legend.position="none")
g
d f
f g
d
0.75 i
hi c
c
value
0.50
b
h
b
ae
0.25 a l
kel k
j j
0.00 0.25 0.50 0.75 1.00
time
x y z
[1,] 0.30809754 0.0491014380 0.64280102
[2,] 0.50424783 0.2828580196 0.21289415
[3,] 0.24106888 0.4709212352 0.28800989
[4,] 0.45065923 0.0622128465 0.48712793
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 97
triax.plot(data3,show.grid=TRUE,pch=16*type+1,
cex.ticks=.7)
legend(1,1,c("Treatment A","Treatment B"),
pch=c(1,17))
Treatment A
Treatment B
0 .1
0.9
0
0.3 .2
0.8
0.7
0 .4
0.6
0
y
z
0 .5
0.5
0.7 .6
0.4
0.3
0.9 .8
0.2
0
0.1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
6.2.3 Stars
disp cyl
mpg
Volvo 142E hp
Maserati Bora qsec
drat wt
7 Lattice
7.1 Multiway xyplots
Sometimes we want to display one type of diagram separately for different levels of a factor.
Here is an example:
Example: development of investment share (ci) over time (year), separately for each
country.
Get a subset of the data (six largest countries, later than 2001) from the Penn World Table:
library(pwt)
lattice.options(default.args=list(as.table=TRUE))
data(pwt6.3)
xx<-with(pwt6.3,aggregate(pop,list(country=country),mean))
xx<-subset(xx,country!="China Version 2")
N<-6
xx<-subset(xx,x>=-sort(-xx[["x"]])[N])
xx<-merge(xx,pwt6.3)
xx<-subset(xx,year>2001)
levels(xx$country)[grep("United States",levels(xx$country))]<-"U.S.A."
levels(xx$country)[grep("China",levels(xx$country))]<-"China"
reorder the countries. The order of the factor is used later in the plots. Here we order
according to the median of ci:
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 99
xx<-within(xx,country<-reorder(factor(country),xx$ci,
function(x) -median(x,na.rm=TRUE)))
Sorting the data along year and country makes it easier to draw connect lines lateron:
xx.pw<-xx[with(xx,order(country,year)),]
xx.pw[,c("country","year","pop","ci")]
20
15
2002 2003 2004 2005 2006 2007 2002 2003 2004 2005 2006 2007
year
30
country
investment share
25 China
U.S.A.
India
20 Russia
Indonesia
Brazil
15
The legend of the previous plot does not show the lines.
30
country
investment share
25 China
U.S.A.
India
20 Russia
Indonesia
Brazil
15
7.2 Syntax
The data we want to display in our lattice is described with the help of a formula:
• vertical ∼ horizonal | conditioning variable creates for each level of the conditioning
variable one panel with one graph.
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 101
Graphs with variables only on the horizontal axis (examples would be density plots,
histograms, etc.):
…when R creates values for vertical axis (e.g. histogram, densityplot, ecdfplot):
• ∼ horizonal | conditioning variable creates for each level of the conditioning vari-
able one panel with one graph.
Types of lines The parameter types determines how points are displayed: type="b" or
type=c("b","smooth","g")
”p” ”l” ”b” ”r” ”g”
horizontal=TRUEhorizontal=FALSE
type
© Oliver Kirchkamp
102 Using Graphs and Visualising Data — 7 LATTICE
15 20 25 30
15 20 25 30 15 20 25 30
investment share
If the vertical variable (country in this case) is a factor, then dotplot generates even nicer
graphs:
15 20 25 30
15 20 25 30 15 20 25 30
investment share
We can, of course, show more than one variable on the horizontal axis:
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 103
keys<-list(text=c("private","gov."),space="right",lines=TRUE,size=2,between=.5)
dotplot(country ~ ci+cg| factor(year),data=xx.pw,xlab="investment share",
horizonal=TRUE,auto.key=keys,t="b")
10 15 20 25 30
10 15 20 25 30 10 15 20 25 30
investment share
Certainly, we can also have more than one variable on the vertical axis:
xyplot(ci+cg ~ year| country,layout=c(6,1),data=xx.pw,
ylab="investment share",auto.key=keys,t="b")
30
investment share
25
private
20
gov.
15
10
Segment plots Sometimes we have to plot segments. Here we plot a range of the minimum
investment share to the maximum investment share.
library(latticeExtra)
xx2<-as.data.frame(t(sapply(by(xx.pw,list(xx.pw$country),function(x)
c(min=min(x$ci),mean=mean(x$ci),max=max(x$ci))),c)))
xx2
© Oliver Kirchkamp
104 Using Graphs and Visualising Data — 7 LATTICE
xx2<-within(xx2,{country<-factor(rownames(xx2))})
segplot(country ~ min+max,centers=mean,draw.bands=FALSE,xlab="investment share",data=xx2)
U.S.A.
Russia
Indonesia
India
China
Brazil
15 20 25 30
investment share
segplot(reorder(factor(country),mean) ~ min+max,centers=mean,
xlab="investment share",draw.bands=FALSE,data=xx2)
China
U.S.A.
India
Russia
Indonesia
Brazil
15 20 25 30
investment share
Segment plots and regression results We can also use segment plots to show regression
results. In the following example we use the pwt6.3 dataset to study the relation between
openc and cgdp per country:
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 105
segplot(reorder(country,coef)~lower+upper,
centers=coef,data=reg.ci,
draw.bands=FALSE,
segments.fun = panel.arrows,
ends = "both",angle = 90,
length = 1, unit = "mm")
U.S.A.
Brazil
India
Russia
Indonesia
China
7.4 Densityplots
data(pwt5.6)
pwt5.6<-within(pwt5.6,continent<-sub(" & ","+",continent))
keys<-list(text=c("private","gov."),space="top",columns=2,lines=TRUE)
densityplot(~i+g | continent,data=pwt5.6,plot.points=FALSE,xlab="investment share",
auto.key=keys)
© Oliver Kirchkamp
106 Using Graphs and Visualising Data — 7 LATTICE
private gov.
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
0.06
0.04
0.02
0.00
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
investment share
7.5 Histograms
histogram(~c | continent,data=pwt5.6,plot.points=
FALSE,xlab="consumption share")
0 20 40 60 80 100
30
20
Percent of Total
10
0
Europe Oceania South America
30
20
10
0
0 20 40 60 80 100 0 20 40 60 80 100
consumption share
library(latticeExtra)
ecdfplot(~c | continent,data=pwt5.6,xlab="consumption share")
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 107
20 40 60 80 100
0.2
0.0
Europe Oceania South America
1.0
0.8
0.6
0.4
0.2
0.0
20 40 60 80 100 20 40 60 80 100
consumption share
key<-list(x=0,y=1,corner=c(0,1),background="white",border=TRUE)
ecdfplot(~c ,groups= continent,data=pwt5.6,
auto.key=key,xlab="consumption share")
1.0 Africa
Asia
0.8 Central+North America
Europe
Empirical CDF
Oceania
0.6
South America
0.4
0.2
0.0
20 40 60 80 100
consumption share
-2 0 2
40
20
-2 0 2 -2 0 2
qnorm
Africa
Asia
Central+North America
Europe
Oceania
South America
100
consumption share
80
60
40
20
-2 0 2
qnorm
library(Ecdat)
data(Wages)
qq(sex ~ lwage | factor(ed),data=subset(Wages,ed>=7),type="l")
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 109
5 6 7 8 5 6 7 8 5 6 7 8
7 8 9 10 11 12
8
7
6
5
male
13 14 15 16 17
8
7
6
5
5 6 7 8 5 6 7 8 5 6 7 8
female
7.9 Boxplots
Here we have to factor ed to make clear whether we want boxplots over ed or over lwage.
female male
7
lwage
7 8 9 10 11 12 13 14 15 16 17 7 8 9 10 11 12 13 14 15 16 17
7.10 Barcharts
lattice can also do bar charts:
keys<-list(text=c("private","gov."),space="top",columns=2)
barchart(country ~ ci+cg| as.factor(year),data=xx.pw,xlab="investment share",horizonal=TRUE,
auto.key=keys)
© Oliver Kirchkamp
110 Using Graphs and Visualising Data — 7 LATTICE
private gov.
10 15 20 25 30
10 15 20 25 30 10 15 20 25 30
investment share
We should note that often a dotplot or xyplot presents the same data in a better way.
keys<-list(text=c("private","gov."),space="right",lines=TRUE,size=2,between=.5)
dotplot(country ~ ci+cg| factor(year),data=xx.pw,xlab="investment share",
horizonal=TRUE,auto.key=keys,t="b")
10 15 20 25 30
10 15 20 25 30 10 15 20 25 30
investment share
7.11 Coplots
Given : tension
L M H
0 10 20 30 40 50 0 10 20 30 40 50
70
Given : wool
40
B
breaks
10
70
40
A
10
0 10 20 30 40 50
index
7.12 Parameters
7.12.1 Types
Usually lattice renders data as points. The argument type=(...) modifies this behaviour.
Some useful values are the following:
• type='p': points
• type='g': a grid
data(Caschool,package="Ecdat")
xyplot(testscr ~ str,data=Caschool,type=c("p","g","r","smooth"))
© Oliver Kirchkamp
112 Using Graphs and Visualising Data — 7 LATTICE
700
680
testscr
660
640
620
14 16 18 20 22 24 26
str
7.12.2 Axes
Different scales for different panels Usually, lattice chooses the same scale for all
panels in a plot. This can be changed with the help of the parameter scales.
Same scale (the default):
20
15
2002 2003 2004 2005 2006 2007 2002 2003 2004 2005 2006 2007
year
25 26 27 28
29 30 31 32 33
16 18 20 22 24
investment share
2002 2003 2004 2005 2006 2007 2002 2003 2004 2005 2006 2007
year
Sliced scale (scales=list(x="same",y="sliced"), scales have the same scale, but dif-
ferent origin):
22 24 26 28 30
16 18 20 22 24
investment share
2002 2003 2004 2005 2006 2007 2002 2003 2004 2005 2006 2007
year
Individual axes
We can influence where an axis is labelled as follows:
myscale<-list(x=list(at=2002:2007,labels=c(2002,"","","","",2007)),
y=list(log=TRUE,at=c(15,20,25,30,35)))
xyplot(ci ~ year| as.factor(country),layout=c(6,1),
scales=myscale,data=xx.pw,ylab="investment share",t="b")
© Oliver Kirchkamp
114 Using Graphs and Visualising Data — 7 LATTICE
25
20
15
700
680
testscr
660
640
620
10 20 30 40 50
avginc
xyplot provides a loess smoother, but how can we provide more detail, e.g. confidence
bands for the smoother?
Let us first calculate the necessary data:
data(Caschool,package="Ecdat")
cal.lo<-loess(testscr ~ avginc,data=Caschool)
newx <- with(Caschool,seq(min(avginc),max(avginc),length.out=50))
cal.pred <- predict(cal.lo,newdata=newx,se=TRUE)
cal.df<-with(cal.pred,{data.frame(testscr=fit,
avginc=newx,
upper=fit+qnorm(.975)*se.fit,
lower=fit+qnorm(.025)*se.fit)})
head(cal.df)
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 115
xyplot(testscr ~ avginc,data=cal.df,type="l")
700
680
testscr
660
640
620
10 20 30 40 50
avginc
xyplot(testscr ~ avginc,data=cal.df,type="l",
panel=panel.xyplot)
xyplot(testscr ~ avginc,data=cal.df,type="l",
panel=function(...) panel.xyplot(...))
xyplot(testscr ~ avginc,data=cal.df,type="l",
panel=function(...) {
panel.xyplot(...);
panel.refline(h=660)
})
with(cal.df,xyplot(testscr ~ avginc,type="l",
panel=function(...) {
panel.xyplot(...);
panel.xyplot(avginc,upper,type="l",lty=2)
panel.xyplot(avginc,lower,type="l",lty=2)
}))
All this could be done with the help of the built in panel.smoother function:
xyplot(testscr ~ avginc,data=Caschool,
panel=function(...) {
panel.smoother(...)
panel.xyplot(...)
})
Themes
keys<-list(text=c("consume","private invest.","gov."),lines=TRUE,
space="top",columns=3)
mTheme1<-custom.theme(symbol = brewer.pal(3, "Set1"),
bg = "grey90", fg = "black", pch = 16,lty=1:3,lwd=3)
mTheme2<-custom.theme(symbol = brewer.pal(3, "Pastel1"),
fg = "black", lty=1:3,lwd=3)
mTheme3<-custom.theme(symbol = brewer.pal(3, "Paired"),
fg = "black")
mTheme3$strip.background$col=brewer.pal(3, "Pastel2")
xx<-xyplot(cc+ ci + cg ~ year| as.factor(country),layout=c(6,1),
data=xx.pw,ylab="",t="b",auto.key=keys)
xx
consume private invest. gov.
2002 2004 2006 2002 2004 2006 2002 2004 2006
60
40
20
update(xx,par.settings=mTheme1)
60
40
20
year
update(xx,par.settings=mTheme2)
60
40
20
year
update(xx,par.settings=mTheme3)
© Oliver Kirchkamp
118 Using Graphs and Visualising Data — 7 LATTICE
60
40
20
year
update(xx,par.settings=standard.theme("pdf", color=FALSE))
60
40
20
year
update(xx,par.settings=standard.theme("pdf", color=FALSE))
[ 4 July 2022 19:50:51]
© Oliver Kirchkamp
— 119
Vector graphs
• eps (sometimes)
• pdf (sometimes)
• svg
• wmf
• ...
Raster graphs
• jpeg
• png
• gif
• tiff
• pdf (sometimes)
• ...