Ggplot2 Exercise
Ggplot2 Exercise
The only way I know of making such a plot in a visually appealing layout is
via ggplot2. From a statistical point of view the plot contains two things:
1
Applied Statistics Bo Markussen
Statistical Methods for the Biosciences November, 2021
The y-axis is transformed. More precisely using the cubic root. Al-
though the cubic root isn’t a standard transformation (like the loga-
rithm), it can still be implemented rather easily (but we will not try
this in this exercise). Note also, that the reader of the figure don’t need
to know the precise transformation in order to perceive the information
contained in the graphics.
The same x- and y-axes are used in all 12 panels. This makes it possible
to compare data across the panels.
Within each panel 4 different plotting symbols and 4 different line types
are used to distinguish observed data and model predictions from 4
different regions. These symbols and line types are summarized in the
legend placed to the right of the 12 panels.
The model predictions (i.e. the lines) are only made in the range of
the observed data. And omitted if there are no observed data in the
corresponding panel.
1. Discuss the features of the figure on page 1. Do you agree with descrip-
tion I made above?
2
Applied Statistics Bo Markussen
Statistical Methods for the Biosciences November, 2021
> library(ggplot2)
> ?diamonds
4. The data frame diamonds is very big with almost 54,000 observations.
Sometimes it can be useful to make a random selection of the obser-
vations in order to avoid having a lot of points plotted on top of each
other. Execute the following R codes, and discuss what they do:
5. Let’s make our first plots using ggplot2! Try the following R codes
one-by-one (!) and discuss the relations between graphical output and
the R code. Can you see what is plotted?
3
Applied Statistics Bo Markussen
Statistical Methods for the Biosciences November, 2021
Note, that after executing the first line the variable “myplot” appears
in the Environment window. This variable now contains the results of
the ggplot-call, and can be used instead of writing this.
8. It’s easy to subdivide the plot into several panels. Let’s try this accord-
ing to the categorical variables cut and clarity. Execute the following
R codes one-by-one and discuss the output:
9. To export the most recent figure, such that it can be inserted in a paper
or a report, use the ggsave() function. The following codes saves the
plot to your working directory in png and pdf format1 , respectively:
> ggsave("diamonds.png")
> ggsave("diamonds.pdf")
10. As already hinted at several places above the output of a call to ggplot()
is not a graphical output, but a grammatical description of that out-
put. What you see on the screen is a print() of that description. One
implication of this is that you can add more “layers” to the description
before it is printed. The symbol for adding components is “+”, which
in this context shouldn’t be confused with the mathematical operation
of adding numbers. To change the axes to be logarithmic you add this
information. As above; try and think about:
4
Applied Statistics Bo Markussen
Statistical Methods for the Biosciences November, 2021
11. You can also add smoothing lines and other statistical output to the
graph:
Note, that a smoothing line is generated for each of the diamond colors.
The reason for this is that the separation into distinct colors is inherited
from the aes() code inside our variable “myplot”. To make a single
smoothing line for all diamonds we must turn down the inheritance,
and restate the necessary aesthetics. Thus,
Note that the smoothing method has change from loess (= local poly-
nomial regression fitting) to gam (= generalized additive model).
Can you find out why this is the case? Hint: see the help page by
executing ?geom_smooth in the R console.
Can you change back to loess-smooting? Hint: use the option
method="loess"
13. We have written the two options “aes(x=carat,y=price)” and “inherit.aes = FALSE”
many time above. Discuss whether it would have been more clever to
define “myplot” as
5
Applied Statistics Bo Markussen
Statistical Methods for the Biosciences November, 2021
How would this change the solution code for the above questions?
Please note that the lines inserted on the figure on page 1 were not gen-
erated automatically by ggplot2. Instead I used predictions from a linear
mixed effects model fitted via the R-package lme4. In order to do that I
made another data frame with the model predictions, and used this new data
frame together with the geom_line() function. I hope to be able to give an
example of this technique later in the course.
If you want to read more about ggplot2, then you might start at the
homepage:
http://www.r-bloggers.com/basic-introduction-to-ggplot2/
http://r4ds.had.co.nz/
End of exercise