Basics of Data Analysis and Graphics in
Basics of Data Analysis and Graphics in
Part two:
retrieving summary statistics
and drawing basic graphs
Alexei Kouprianov∗
Contents
Preface 3
Petersburg).
1
Quality colour graphs: post-processing . . . . . . . . . . . . . . . . . . 90
2
Preface
R is a most powerful instrument of data visualization. Besides a wide variety
of built-in standard graphs (histograms, bar-, box-, and scatterplots, etc.) it
can draw more sophisticated things like social networks or maps with the aid
of installable libraries and even allow you to visualise nearly everything you
can imagine with the aid of graphical primitives (i. e. points, line segments,
arrows, and polygons). In this part of the manual, we shall start with simple
standard graphs representing variation of a single variable, then turn to more
complex bivariate graphs of various kinds, and, finally, to few special cases of
still more complex graphs which are, none the less, used far from infrequently.
As we proceed along this sequence, I shall explain you the ways to control
the appearance of the graphs and show you how to save them to files. In a
special section, I shall treat the problem of producing printing quality graphs
for submission to a scholarly journal. Maps, social networks, and other possible
non-trivial graphs will be discussed elsewhere.
not. In fact, it is a fountain of hypotheses you formulate, reject, corroborate, refine, and put
to new tests, sometimes with an enormous speed. I still remember how I calculated regression
or analysis of variance by hand armed with a pencil and a slide rule, using special tables which
hepled to organise the workflow, adding and squaring numbers on a sheet of paper, finding
square roots and determining p-values by the book, drawing ‘frequency polygons’ and scatter
plots on graph paper. Now, one can perform analysis of variance and plot basic graphs with a
couple of function calls. Back then, the process of hypotheses testing was painfully palpable.
Now, the pain has nearly gone, but hypotheses testing is still in place.
3
rocky path of calculator mode first. E. g., we may wish to calculate the mean
height of the students from the training dataset. Technically, it does not seem
that complicated. All we need to know is the sum of all individual heights and
the number of individuals measured. We already know, that the heights are
stored in a vector within our data frame students.df. We may re-create it (if
it is not at hand) and try retrieving the sum of all individual heights.
# This is the code to re-create the students.df if it is lost;
# The file "kouprianov.students.v.2.1.txt" must be present
# in the working directory;
students.df <- read.table("kouprianov.students.v.2.1.txt", header=TRUE,
sep="\t", stringsAsFactors=TRUE)
# The following two lines would help you checking if everything is OK;
dim(students.df)
head(students.df)
We already know what R denotes with NA and its allies. How come that a
sum of a vector, which is full of numbers, can be a missing value? The answer
is simple. If a vector contains missing values, its true sum, strictly speaking,
is unknown, being itself a missing value. If we preview the vector by calling
students.df$HEIGHT, we can spot two NA values in 14th and 20th positions.
The sum() function belongs to a large family of functions dealing with nu-
meric objects that require explicit instructions on to how to treat the NA values.
The argument used for exclusion of NA values from calculations is na.rm, and
its default value is FALSE, so we need to explicitly set it to TRUE:
> sum(students.df$HEIGHT, na.rm=TRUE)
[1] 37738.4
>
Now, we need only the number of persons measured, and the much needed
mean value is ours. The length() of the vector is easy to obtain. The problem,
however, is that we should somehow get rid of the persons contributing NA
values. From our preview, we already know that there are only two of them (so,
we can use (length(students.df$HEIGHT)-2)) as a denominator). But we
see as clearly that visual inspection of a vector could not be recommended as a
general recipe when dealing with vectors of a considerable length containing an
unknown number of NA values. An easier solution then is to resort to specialised
functions that return summary statistics at once. Some of them are as smart
as to deal with NA values properly on their own.
The most powerful of the whole family is, perhaps, a super smart function
summary().2 Let us see what it can do to the students.df$HEIGHT:
2 This function is a true miracle. It is capable of summarising objects of any sort, each time
communicating something important on them. We shall see it time and again summarising
most incredible things from simplest numeric vectors or factors to the results of regression
analysis.
4
> summary(students.df$HEIGHT)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
148.0 165.0 170.0 170.8 175.0 192.0 2
>
This table contains just one row (there can be also multi-row tables as we
shall see later) and it is invariably treated in such a case as a uni-dimensional
object. Its elements can be called both by their numbers and names.
> summary(students.df$HEIGHT)[2]
1st Qu.
165
> summary(students.df$HEIGHT)["1st Qu."]
1st Qu.
165
>
The summary for a numeric vector includes its minimal and maximal values,
mean, median, first and third quartiles, and the number of NA values (when they
are present). Some of these statistics are rather self-evident (minimum and
maximum), some deserve an explanation or, at least, a reminder. The, as you
remember, mean is calculated by dividing the sum of values by their number:
x1 + x2 + . . . + xn
x̄ =
n
The median is defined as the middle element of a series of values ordered
from minimum to maximum (if there is an even number of elements in the series,
then the median equals to the mean of two elements closest to the middle of
the series). Mean and median are two important centrality measures (the third
is mode, the most frequent value or the biggest class of values). We shall come
back to them soon. The quartiles are similar to the median but they cut a
series not in halves but in quarters (by the way, the median is, understandably,
nothing but the second quartile).
Some of these statisics can be retrieved individually either by calling elements
of the summary() output (as it was shown above) or by means of specialised
functions: mean(), median(), min(), max(), and range() (the latter function
returns a vector of two values, the minimum and the maximum for a given
series). It should be noted that the five above-mentioned specialised functions
require to explicitly exclude the NA values, just like the sum() function (see the
code in the Appendix to this manual).
3 One may see that it belongs also to ’summaryDefault’ class but this is not that important
now.
5
As I said before, when dealing with quantitative variables, the analysts are
usually interested both in centrality measures and in the estimation of variability
within a given sample or population. The summary() function tries to return
some metrics from both (centrality can be assessed with mean and median,
while the degree of variation is represented with minimum, maximum, and the
quartiles).
But the output of the summary function mixes up not only the measures of
centrality and variation but also the two major families of measures of centrality
and variation. The first (sometimes called ‘traditional’) is based on the mean
and deviations from it, the second is based on the median and quartiles. The
difference is fundamental. Mean-based metrics take into account both the size of
the sample or population and the values of the parameter measured (and heavily
depend on them). Most median-based metrics take into account only the order
of values in a series ordered from the smallest to the largest, which makes them
to a considerable extent value-independent (and insensitive to outliers — rare
exceptionally big or exceptionally small values) or robust.
number originating from the tiny imperfections of calculations algorithm. For our students
heights this number would be −9.66 × 10−13 or −0.000000000000966. . .
6
For a sample variance, a slightly different formula is used:5
∑︀
(xi − x̄)2
s2 =
n−1
The sample variance can be calculated with var() function.
> var(students.df$HEIGHT, na.rm=TRUE)
[1] 61.60064
>
The variance is a highly useful statistics but not without a blemish. If one
considers carefully the issue of measurement units, one may notice that the
mean squared deviation is. . . well. . . squared. While the original variable was
measured, e. g., in centimeters, its variance should be measured in square cen-
timeters, which makes little sense if we would like to discuss the limits of vari-
ation.6 To overcome this difficulty, another measure of variation is introduced,
the standard deviation, which is the square root of the variance.
Unlike the variance, the standard deviation can be used meaningfully to-
gether with the mean to represent natural variation, and we shall learn more
on the use of it later. For now, it is enough to say that it can be calculated
with the sd() function (which also requires na.rm=TRUE). We may calculate the
standard deviation for our students.df$HEIGHT vector and check if it really
equals the square root of the variance.
> sd(students.df$HEIGHT, na.rm=TRUE)
[1] 7.848607
> (sd(students.df$HEIGHT, na.rm=TRUE))^2
[1] 61.60064
>
(s2 ), the expectation of a parameter (µ) and the sample mean (x̄), the population size (N )
and the sample size n. The difference however is not so much in the choice of particular
letters, as in the subtleties of statistical reasoning, the most practically important of which
is the decrease of denominator in order to increase the summary variation estimate. (n − 1)
is the number of degrees of freedom, a rather tricky statistical concept, which will be treated
in more detail elsewhere. For the sample variance it is equal to the maximum number of
elements of a numeric vector one may change arbitrarily and still get back to the same mean
value by involuntary compensation on the account of the remaining elements. E. g., in a series
of 1, 2, 3, 4, 5, the mean is 3. We may change four elements as we pleased, e. g., to replace
the first four with 10, 100, -218, and 3.28, but to keep the mean at three, the last element of
the vector has to be canged (in our case) to 119.72.
6 Square centimeters look a lot less bizarre than square kilograms or square friends (Sponge
Bob does not really fit here either) but as a measure of length they fare no better.
7
In practice, to delimit the quartiles is a more complex task than to find
the mean. There are several conventional ways to do it, which bring some-
times slightly different results. Accordingly, there are several functions in R for
the quartile analysis. Besides summary() (and one or two more functions we
shall learn later), there are fivenum() (stands for five numbers) and a highly
customisable quantile().7
Here, by the way, one may see a palpable difference of factors from character
vectors. Try summary(as.character(students.df$SEX)) to see a strikingly
different result.
8
Table 1: Anscombe’s quartet: the data (Anscombe 1973).
x1 y1 x2 y2 x3 y3 x4 y4
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Table 2: Anscombe’s quartet: some descriptive statistics are nearly or exactly the
same for all four datasets (Anscombe 1973, with modifications).
Parameter Value
Number of observations (n) 11
Mean of the x’s (x̄) 9.0
Mean of the y’s (ȳ) 7.5
Standard deviation of x’s 3.32
Standard deviation of y’s 2.03
Equation of regression line y = 3 + 0.5 × x
Multiple R2 0.667
Now, after we know how to retrieve centrality and variation estimates, the
time has come to explain why we should not begin our analysis with calculating
them. The only reason we studied them first is that they will be vitally needed in
the following section which will expose their futility at the first stage of analysis.
(1): 17–21.
9
Figure 1: Anscombe’s quartet: bivariate plots for each dataset. Dashed line: regres-
sion of y by x (see tab. 2).
10
The seventh line of the script applies the function mean() (hence the third
paremeter of apply() is mean) to the columns of the object ans (hence the
first two parameters are ans and 2, ans and 1 would call rows, not columns,
remember the Roman Catholic rule). The result is rounded to two decimal
places (hence the second argument of round() is 2). The eighth line does the
same for standard deviation. You can also reproduce the calculations without
rounding to see the tiny differences that will become visible.
Despite all these remarkable numerical similarities, the bivariate scatter plots
for Anscombe’s four datasets are strikingly different (Fig. 1). The four patterns
apparently resulted from different processes. The first case represents a rather
strong linear connection between the two variables with some noise contributing
to the deviations of observations from the fitting regression line. The second
represents a very strong non-linear connection between the two variables (and
the one at y axis is clearly a dependent variable, as the same values of y cor-
respond to different values of x but not vice versa, as in the first case), the
linear regression parameters can be calculated but they are completely out of
place here. The third case presents a very strong linear connection between the
variables with a single outlier driving the regression line away from the possi-
ble stronger dependence. The fourth graph depicts two apparently independent
variables with an outlier providing a (possibly false) basis for regression.11 It
should be noted that in two latter cases robust methods would return strikingly
different results, but we are not yet ready for that.
The sharp contrast between the table of summary statistics (tab. 2) and the
pictures (fig. 1) is a menacing warning to anyone who dares to start ‘counting’
without trying to see the patterns first. In social sciences, with their inherent
indeterminancy, the patterns that are worth being mentioned must be elephan-
tine, and many of them must be visible on bi- or multivariate plots. There
are, of course, formal numerical methods which may help us deciding whether
an observed pattern is more likely a result of some underlying causal relation-
ship or just a strange coincidence. Sometimes formal analysis reveals patterns
that are invisible at first glance. Sometimes these formal methods should be
applied to make some patterns visible. The visual inspection of the data avail-
able is, however, an unavoidable step of analysis, which should either precede
or immediately follow the extraction of summary statistics. Thus, we come to
graphics.
11 In a real-life case, an analyst would suspect that cases III and IV deal with representatives
of two different populations. A true story comes to my mind that happened to a friend of
mine years ago. He took part in a biological indication project, which aimed at the detection
of developmental anomalies in animals caused by industrial pollution. He took a huge sample
of ants (a rather common Lasius niger ) somewhere near the nuclear power plant in Sosnovyi
Bor (near St. Petersburg, Russia), and another one away from it and measured ants in several
ways. Then he fed the data into a computer program (at that time R was still out of sight
and I am not sure as to which software he used). It turned out that several ants captured
near the power plant exhibited unusual body sizes and proportions. Happy with this result he
took the promising monsters to a local expert in ants. To my friends’ sheer disappointment
the monstorus ants happened to belong to a different ant genus and species (Formica fusca),
which only superficially resembled Lasius niger (both were ‘small and dark-brown’).
11
Grammar of graphics: six most basic graphs
In data analysis, several classifications of variables are possible. In this section
we will use a rather crude one, which splits variables into qualitative and quan-
titative, and the latter into discrete and potentially continuous.12 The variation
patterns for different kinds of variables and various kinds of interactions between
variables are traditionally depicted in different ways. Even though infographics
always involves a great amount of creativity, and the same data can be pre-
sented in a number of ways, academic readers usually expect certain degree of
uniformity in graphs, which eases comprehension of author’s message.
The six most basic graphs based on our ‘students’ dataset are presented in
fig. 2. I shall deal with each of these graphs in a special section in more detail.
Here I only briefly characterise them all.
The two upper graphs depict variation patterns for a quantitative (his-
togram) and qualitative (bar plot) variable. Note, please, an important graph-
ical convention here. The bars of a histogram (the so-called bins) are plotted
immediately next to each other, for they represent usually nothing but a conve-
nient way of splitting a potentially continuous variable into arbitrary intervals
(strangely enough, this applies to countable discrete quantitative variables too).
The bars of a bar plot are separated with a space because they represent alter-
native states of a qualitative variable, which neither are arbitrary nor have any
meaningful intermediate states (if a variable is categorised properly, of course,
but this is a different issue). In both kinds of graphs, the heights of bars rep-
resent absolute or relative frequency of variants falling within a bin or within a
class of a qualitative variable.
The four bivariate graphs represent four possible combinations of x-axis and
y-axis variables of different kinds. Usually, at the x-axis, an ‘independent’ vari-
able is placed, while the y-axis is reserved for a ‘dependent’ variable. We should
be extremely careful, though, about ascribing any causal relationship to the
pairs of variables on a plot. E. g., body mass and height can be meaningfully
plotted against each other, and certain pattern of interdependence is clearly
visible (and not at all counter-intuitive) but the explanation of a causal rela-
tionship between these two variables is far from trivial. I would rather say, that
a relationship between two variables involved in a theoretically sound simple
causal model can be depicted by any of these four graphs, but the mere fact
that the two variables are plotted against each other and show a meaningful
pattern does not imply a trivial causal relationship.
The scatter plot (fig. 2, middle left) depicts analytic individuals (they can
be of any nature, e. g., persons, countries, electoral precincts, journal articles,
etc.) in a two-dimensional space defined by two quantitative variables. Each
12 Another very popular classification is based on the kind of scale used (nominal, ordinal,
interval, and ratio). In nominal scale different states of a variable are specified with names
only (e. g. gender, which can be male, female, or whatever; country of origin; the University
one graduated from). In ordinal scale, the named states can be arranged in a meaningfully
ordered sequence (e. g. levels of education, ranks of military or civil service). With interval
and ratio scales more and more numerical operations become possible. The interval scale
(e. g. degrees of Centigrade or of Fahrenheit for temperature) allows meaningful addition and
subtraction, rational (e. g. counts, masses, distances, age, income, etc., anything quantitaive
having a ‘true’ zero) allows also meaningful multiplication and division. ‘Qualitative’ variables
are coded with nominal or ordinal scales, ‘quantitative’ are usually coded with ordinal, interval
and ratio scales. Strangely enough, however important these distinctions are for picking the
right tools for statistical analysis, they are not that useful from the drawing perspective.
12
Figure 2: Six most basic graphs (based on ‘students’ dataset). Upper row: univariate
graphs, middle and bottom rows: bivariate graphs. Right: quantitative variable at
x-axis, left: qualitative variable at x-axis. Middle row: quantitative variable at y-
axis, bottom row: qualitative variable at y-axis. This is not a dogma and should be
perceived critically and creatively. There are more ways to express variation patterns
for these variables.
13
individual is represented with a dot with coordinates corresponding to the value
of parameters forming x- and y-dimensions. E. g., an easy-to-spot lonely dot at
less than 150 cm and slightly over 50 kg represents a miniature person, 148 cm
by 50.5 kg.
The multiple box plot (or, multiple box-and-whisker plot, fig. 2, middle right)
is used to depict a relationship between a qualitative variable at x-axis and a
quantitative at y-axis. The box plot summarises robust measures of centrality
and deviation (see more on that in appropriate sections). The thick bar in the
middle of the box represents median. The lower and upper limits of the box
represent 1st and 3rd quartiles. The ‘whiskers’ represent either the minimum and
maximum or 1.5× interquartile range distance from median (in the latter case,
‘outliers’ falling outside of this range are depicted as dots, just as in our graph).
A series of histograms plotted one under another can be used to depict the same
relationship as well. Histograms, however, lose their information content as they
become more and more flat, because the general shape of frequency distribution
becomes less expressed. Even though box plots present only summaries of the
frequency distributions, they look just as good when they are narrow. One may
plot dozens of box plots in a line without making them less informative and
comparable.
The scatter plot with jitter (fig. 2, bottom left) presents basically the same
information as the multiple box plot but the relationship between quantitative
and qualitative variables is reversed. In fact, a raw scatter plot with height
at x-axis and gender at y-axis would look differently, because, by default, the
numeric representation of feminine and masculine genders would be just 1 and
2, so the dots representing individuals would overlap each other and lie in two
thin lines. To see the individual points, some noise is added, so each line smears
to form a cloud. Each cloud, however, symbolises only one distinct state of
qualitative variable, so this graph is conceptually different from the true scatter
plot above (because the dispersal of a cloud along the y-axis is purely decora-
tive). The variables used in this particular instance of a plot do not really fit its
primary purpose, which is to depict dependence of a quanlitative variable on a
quantitative one (no one seriously expects that gender is determined by height).
There are, however, more relevant examples: decision to go to work or to take
a sick leave at y-axis and body temperature at x-axis, decision to go and vote
or to stay home at y-axis and the income at x-axis, etc.
The relationship between two qualitative variables can be depicted in a num-
ber of ways, of which I picked the default R mosaic plot, which is essentially a
modification of a stacked bar plot. This plot is difficult to interpret at first (to
the extent that when I unintentially evoked it for the first time, I thought that
I broke the graphical console). None the less, it is rather intuitive: the width of
bars is proportional to the frequencies of the states of the x-axis variable, while
the heights of coloured segments of these bars are proportional to the shares
of the states of y-axis variable in each of the x-variable classes. E.ġ., we can
see that there are probably disproportionally many male students describing
themselves as heavy smokers (compare the relative height of the upper segment
in both bars). The left y-axis lists the states of the y-axis variable, in order of
appearance at the diagram, the right y-axis shows frequency scale from zero to
one.
The following sections will deal with these (and some other) graphs in more
detail as well as with implementation of graphics in R code.
14
Visualising a single variable
Histogram
While the generic smart drawing function is plot(), we shall none the less begin
with hist().13 The latter is more specialised, has less parameters than the more
generic plot(), and allows to illustrate some ways to control the graph layout
without going into other distracting details.
hist() produces histograms. A histogram is a specific kind of a graph,
which displays the density distribution of a numerical variable. This may sound
a bit obscure, so I shall explain it at some length.
Let us suppose that we wish to measure body mass of people with a con-
siderable degree of precision. As one can imagine, there is no point to be too
precise, because a glas of water should immediately increase your body mass
by approximately 200 g (≈ 0.5 Lb), however I do not wish this example to be
extremely realistic in this particular respect. What is important, is that we can
measure body mass to the gram.
After weighting a random sample of fifty people of approximately same age
and sex we would come up with a list of measurements (which can, from the R
perspective, be described as a numerical vector). Given this small number of
people and this unnecessary degree of precision, this vector most likely would
be a list of unique five-digit numerical values, like this:
[1] 48244 67159 48444 55839 57019 66308 58104 48608 62987 57294 57555 47966
[13] 64067 60454 56558 52717 48542 62485 46601 62793 64456 59868 66475 64809
[25] 55584 49777 51727 60335 59389 57154 56990 66374 55136 54873 64931 54924
[37] 58724 50229 63285 75209 53338 55694 61122 57141 61876 47330 38335 56252
[49] 57825 44942
If we try to depict it “as is”, representing each value with a vertical line
drawn at a certain point of the x-axis, the variation of mass would be visualised
as in fig. 3.
As one may see, the lines are distributed unevenly. The maximal density of
the lines can be observed around the mean value (57080, in our case). As we
come closer to the marginal values (38340 and 75210), spaces between the lines
become bigger. A purely visual assessment of the density of the lines in this
graph is, however, problematic. We can ease it by dividing the variation range
into intervals (data analysts call them ‘bins’), counting how many values fall
into every bin, and producing a more advanced visualisation as seen in fig. 4. It
is exactly this more advanced graph, which bears the name of the histogram.
13 We have already dealt with other ‘smart’ functions, summary() and str(). As you might
remember, when I say that the function is ‘smart’, this means that it can identify the kind of
an object on its own and act accordingly without asking us for help.
15
6
5
Frequency
4
3
2
1
0
16
The numbers of cases falling within the limits of bins are represented with
bars’ heights (hence the y-axis is labelled ‘Frequency’). The same graph can
be tuned to represent the density, which is nothing but a normalised frequency.
To obtain density values we need to calculate the share taken by a bin area
from the total area of the histogram. The shape of the histogram would remain
unchanged, it is only the y-axis that would look different (see fig. 4, middle).
Bin width is usually assigned arbitrarily.14 The only thing to be taken into
consideration is the need to produce a density distribution of a recogniseable
shape for a quick assessment of the number of ‘peaks’, or modes, symmetry, and
balance between its hump and its tails. E. g., in our imagined case, we may
change the bin width from 1 to 5 kg (fig. 4, bottom) to see a smoother picture.
This graphical test is not the best way to identify the theoretical probability
density function, which can approximate our data (e. g., to test whether some
density distribution can be regarded as strictly normal or Gaussian, one would
need to perform a set of more rigorous tests), but it is good enough for a quick
assessment at a glance.
This being said, we may shift to the implementation of a histogram in R.
As I said, the histogram drawing function is hist(). It has got a number of
arguments which allow to tune the image, but, in its most simplistic form, it
requires just one: the variable under consideration.
hist(students.df$HEIGHT)
This code would produce a rather ugly graph (fig. 5, top-left). The histogram
is barely visible, the title and x-axis label are incomprehensible to anyone unfa-
miliar with R notation and the dataset you use, and the bins may not suit our
expectations. All these features can be adjusted using arguments of hist().
hist() shares most arguments with other plotting functions, so by studying it
in detail we will learn a lot about plotting in general.
The three most frequently used text elements: the plot’s title, x- and y-axis
labels are controlled with main, xlab, and ylab respectively. Y-axis label does
not need any adjustment here, so we shall change only two of them.
hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm")
This graph is already half as ugly as the previous (fig. 5, top-right), but it can
be better. First, we should make it standing out of the page by adding colour or
a dashed pattern. Colours, sometimes really bright ones, are better for screen
presentations, while line patterns are better for black and white academic print.
Colours can be added with col argument, while dashes require two arguments,
density (in lines per inch) and angle (in degrees, counting counter-clockwise
from three o’clock on). Note that as the code gets longer we may split it into
lines for our convenience. As you might remember from the first part of the
manual, if yoy enter these commands from the R text console, R would respond
with a + prompt, not with a > prompt to an unfinished function call.
hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm",
col="grey")
14 Of course, hist() has a built-in algorithm which calculates a more or less appropriate
bin width automatically, but we should learn how to cut the bins at the points we need, not
just at the points R allows us to cut the bins. As we shall see later, bins even do not have to
be necessarily equal to each other within the same histogram.
17
50 Histogram of students.df$HEIGHT Undergraduate students
50
40
40
Frequency
Frequency
30
30
20
20
10
10
0
0
150 160 170 180 190 150 160 170 180 190
students.df$HEIGHT Height, cm
50
40
40
Frequency
Frequency
30
30
20
20
10
10
0
150 160 170 180 190 150 160 170 180 190
Height, cm Height, cm
40
20
30
Frequency
Frequency
15
20
10
10
5
0
150 160 170 180 190 150 160 170 180 190
Height, cm Height, cm
18
hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm",
density=15, angle=45)
The results can be seen in fig. 5 (middle row). The colour need not be grey all
the time.15 R has got a number of ways to specify colour of the elements of the
plot (they can be rather flexibly assigned different colours). There are reserved
text names like red, blue, green, orange, darkred, etc. There are numeric
codes for basic nine colours (white,16 black, red, green, blue, cyan, magenta,
yellow, grey).17 You may also specify colours with hexadecimal RGB codes,
from "#000000" for black through "#ffffff" for white (with all 16 777 214
shades in between).18 Finally, there is a special function rgb(), which encodes
RGB colours and transparency (the so-called alpha-channel) but we shall learn
about it later, in the scatter plot section.
If you inspect the middle-left image in fig. 5, you will see that what changed
was the fill colour. The outline remained black. The deep-sitting reasons for
that are yet to be discovered but for now it is enough to say that the outline
colour is controlled with a different argument, border. Sometimes, it is really
important to control this parameter too, so do not forget about this option.
Finally, we need to take control over the bin size. As you already know,
the bin size affects greatly the shape of a histogram. There are several ways to
conrol it. All of them deal with the breaks argument. The number of breaks
can be communicated to hist() as an integer or as a vector. If you supply an
integer value, hist() calculates an appropriate number of breaks which is as
close to your integer as R finds appropriate (fig. 5, bottom-left). If you supply a
vector, hist() cuts the bins exactly at the points you specify (including unequal
bins).19 In this latter case, however, you have to take care of the upper and
lower limits, because the bins should embrace the full variation range. For the
generation of breaks vector, c() and seq() functions are used.
Our ‘students’ dataset is not big enough to demonstrate the true power of
breaks. The dataset from the Russian Federation presidential elections suits
better. The analysts were anxious to visualise some strange patterns they dis-
covered in the frequency distribution of the voters’ turnout. One of the best
studied phaenomena was an anomally high number of polling stations displaying
integer percentiles of voters’ turnout. The frequencies of polling stations where
the voters’ turnout was equal to, e. g., 90%, 91%, etc, was remarkably higher
than that of, say, 89.9% or 91.1% etc. To visualise this effect, one would need
a very small bin of 0.1%. Besides of being small, the bins should be cut in such
15 It should be noted that both gray and grey are acceptable.
16 In fact, transparent. This can only be seen when the background is not white but this is
important to know in advance.
17 A good-for-nothing pie() function, which produces the most hated by analysts pie-chart,
may serve here well. Try pie(rep(1,9), col=0:8, labels=0:8) from your console. After 8,
the colours 1 throurg 8 are recycled, so 9 is again black, like 1, 10 is red, etc.
18 The total number of colours in the hexadecimal RGB palette is 2563 . The hexadecimal
system needs letters because after 9 the conventional decimal system runs out of numbers, so
10 in hex is a, etc., until 15=f, while 16 is a new 10. So, the colour channel intensity varies
from 0 (00) to ff = 255 = 162 − 1. The first pair of digits in the hexadecimal RGB colour
code controls red, the second green, and the final blue. Accordingly, "#ff0000" brings bright
red, etc. You may experiment with that on your own or google a palette.
19 One may wonder why on Earth would we need unequal bins. Stragnely enough, some
major pollster companies use rather peculiar bins for respondents’ ages. To make data com-
parable to their polls’ results, unequal bins for age groups are sometimes needed.
19
a way as to place the value of interest (i. e. 91%) at the centre of the bin, not
at its margin. This means that the borders of the bin should be not: 90.9%
– 91.0% – 91.1%, but 90.95% – 91.05% – 91.15%. To generate a sequence of
numbers indicating the breaks this way, we will need the following formula:
seq(-.0005, 1.0005, .001)
This sequence starts at −0.5%, ends at 100.5%, and its increment is 0.1%
1 1 1
(we should never forget that 1 percent is 100 , so 10 of it would be 1000 ). The
full script for generating this graph ab ovo can be found in the Appendix to this
part of the manual. Here I give only the part relevant to the picture.
hist(pres.2008$TURNOUT,
breaks=seq(-.0005, 1.0005, .001),
col="black",
main="", xlab="Voters’ turnout at a polling station, 0.1% bin")
The histogram corresponding to this code can be seen at the fig. 6, top-
left. Its general shape is nearly unreadable because about 5 000 polling stations
demonstrate 100% turnout.20 To enhance readability of the general shape of
the histogram, we need to stretch it somehow vertically. The right way to do
it is to set arbitrary limits to the segment of y-axis visible within the plotting
area (thus stretching or compressing the y-axis). This can be done with ylim
argument. In histograms, its default value is derived from the baseline (0) and
the height of the tallest bin and can be replaced with any vector of length 2
specifying an arbitrary range (see fig. 6, top-right for results of the code given
below):
hist(pres.2008$TURNOUT,
breaks=seq(-.0005, 1.0005, .001),
ylim=c(0,400),
col="black",
main="", xlab="Voters’ turnout at a polling station, 0.1% bin")
We may apply a similar magnifying glass to the x-axis too. Its default
value is derived from the range (remember range() function discussed above)
of the variable under consideration. In the two bottom histograms of the fig. 6,
xlim=c(.9,1) was applied. For our convenience, white dashed lines are added
to the bottom-right image to point at the integer percentiles. The art of adding
lines will be discussed in more detail below, in the sections on plotting mathe-
matical functions and graphical primitives.21
The final remark on histograms for now will be about the y-axis meaning.
In the beginning of this section, I mentioned that the y-axis can reflect both
frequency and density. To switch between the two, the freq argument is used.
It defaults to TRUE and when set to freq=FALSE, the y-axis turns to density.
You may experiment with it on your own (note, please, how the density scale
changes depending on the change of the bin width). When a density scale is
used, a smoothing density function curve can be added to the histogram. This,
however, would require two more specialised functions for plotting alone and,
sometimes, additional data transformation is needed. Given these complications
and a dubious aesthetical value, I postpone the discussion of density line until
the section on graphical primitives.
20 Many of them are, indeed, rather peculiar, e. g., they may be located at the railway
stations where (by definition) there are no lists of registered voters, or at the ships.
21 For less patient students, I may recommend studying help pages for abline() function.
20
5000
400
300
3000
Frequency
Frequency
200
100
1000
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Voters' turnout at a polling station, 0.1% bin Voters' turnout at a polling station, 0.1% bin
400
400
300
300
Frequency
Frequency
200
200
100
100
0
0.90 0.92 0.94 0.96 0.98 1.00 0.90 0.92 0.94 0.96 0.98 1.00
Voters' turnout at a polling station, 0.1% bin Voters' turnout at a polling station, 0.1% bin
other device is specified, it opens automatically when a plotting function is called. Like other
devices it can be closed using dev.off().
21
major raster graphics editors and preview software, including web browsers. To
save our simplest histogram (fig. 5, top-left), it would be enough to type:
png("hist.students.df.height.png") # 1. Opening a graphic device;
hist(students.df$HEIGHT) # 2. Plotting a graph;
dev.off() # 3. Closing the device;
The result of this code can be seen in fig. 2, top-right. plot() uses the
factor values sorted according to their numeric codes (as you might remember,
by default they are sorted alphanumerically) for the x-axis, and frequency of
the units characterised by this or that state of the variable, for y-axis. One
can’t change much the appearance of a bar plot drawn with the generic plot
function. The axes can be scaled with xlim and ylim, the main, xlab, and ylab
23 Some printing devices expect size in inches but it is not the default expectation for PNG.
22
80
150
60
100
40
50
20
0
0
f m 0 1 2 3 4
Figure 7: Simple bar plots. Right: students by gender. Left: students by self-reported
smoking class, from ‘never tried’ (0) through ‘heavy smoker’ (4), see dataset legend
for details.
can be also adjusted but, sometimes, even the simplest bar plots require more
interventions.
A more specialised barplot() function offers more controls. Like other
specialised functions, barplot() is not ‘smart’ enough and requires a special
kind of data object as its argument. When visualising a single variable, the
object should be a table. A tabular summary of a variable can be obtained
with table() function. The nature of the variable is unimportant, table()
tabulates numerical vectors as easily as character vectors and factors.
> table(students.df$SEX)
f m
169 52
> table(students.df$SMOKING)
0 1 2 3 4
55 51 85 21 11
>
The results of these two lines of code see in fig. 7. It should be noted that
barplot() can not use the original variable as an argument.24 To specify the
height of the bars, it requires a numeric vector. In our example, it was supplied
as a table, but it can be supplied as a simple vector as well (including numerical
vectors within data frames).
24 In its turn, the use of table() transformation with plot() is technically possible but
it brings rather weird results: they are interpretable but as a graphical representation of the
data they leave much to be desired.
23
6
6
5
5
4
4
Mentions
3
3
2
2
1
1
0
0
Bardill Fries Kant Salat
Wagner
Schulze
Schleiermacher
Salat
Schelling
Salat
Reinhold
Krug
Krause
Koeppen
Kant
Kant
Jacobi
Herder
Herbart
Fries
Hegel
Fries
Fichte
Eschenmayer
Bouterwek
Bardill
Beck
Bardill
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Mentions Mentions
The ‘students’ dataset contains no data that would cause any problem with
representing them in a bar plot. A microscopic real-life dataset on philosophers
mentioned in the chapter headings of the six German textbooks in the history
of philosophy published from 1800 through 1830 will be, strangely enough, more
challenging.25
A straightforward application of barplot() function to the appropriate vari-
able within the phil data frame looks rather imperfect (fig. 8, top-left):
barplot(phil$Freq)
25 The dataset is supplied in a separate file philosophers.txt In the code examples, it
is supposed to be read in the phil object, which is a data frame. In fact, it was one of
the many by-products of a rather complex data transformation used in Maxim Demin &
Alexei Kouprianov (2018): Studying Kanonbildung: An Exercise in a Distant Reading of
Contemporary Self-descriptions of the 19th Century German Philosophy, Social Epistemology,
DOI: 10.1080/02691728.2017.1414332
24
In fact, it looks even worse than the ones for the ‘students’ dataset (fig. 7).
We see just the bars tehmselves and a scale axis to the left. Not a single philoso-
pher’s name appears in the plot. Why the students data looked better? Because
tabulated data contain a names attribute (see the code examples above), which
is picked by barplot() by default. When working with a vector of unnamed
entities, another vector with names is needed to label the bars (names.arg ar-
gument specifies the labels).26
barplot(phil$Freq, names.arg=phil$NAMES, ylab="Mentions")
To turn the axis labels, we need another argument, las (fig. 8, bottom-
right):27
barplot(phil$Freq, names.arg=phil$NAMES, horiz=TRUE, las=1, xlab="Mentions")
There remain three problems. One of them (which would not be present in
the on-screen plot, for the reasons we shall discuss later) is that the graph is
over-cluttered with labels.28 The other is that the y-axis labels are only partly
visible (e. g., ‘rmacher’, the third from top, is apparently ‘Schleiermacher’ who
was too long to fit in the margin). The third I would not call a problem, but
sometimes you may also need to order the items in the barplot according to their
frequencies. These three problems require three different kinds of solutions, and
none of them can be solved using barplot() arguments.
The first one will not be treated here in detail. It is enough to say that it
concerns the relative size of font and plotting area (I could make the graphs
looking exactly the way you see them on screen at the expense of labels being
to small to read them without a magnifying glass but I preferred to keep them
readable). For the purposes of the fig. 9 it was solved by changing the image
file aspect ratio from 1:1 to 8:7.
The second is, however, more pressing and pretty annoying when you need
to work with lots of barplots and have no time to play around with margin
widths adjusting them manually. Fortunately, R has got a couple of functions
that return the size of a text string (strwidth() for the length of a string
26 And we, of course, should label the frequency axis in some meaningful way. For the
purposes of our graph "Mentions" would be enough, as it should be explained in the article
text and image caption in more detail.
27 The las argument can be assigned with any of four numerical codes. 0 (default): x- and
y-labels are oriented along their axes (x-labels — horizontally, y-labels — vertically); 1: both
horizontally; 2: both perpendicular to their axes; 3: both vertically.
28 As you might have noticed, some of the graphs appear in this manual in a slightly
different way than on your screen. This is largely because of rather different requirements for
printing-quality graphs and on-screen graphs we shall discuss in an appropriate section of the
manual.
25
Wagner Salat
Schulze Koeppen
Schleiermacher Herder
Schelling Krug
Salat Krause
Reinhold Wagner
Krug Schleiermacher
Krause Herbart
Koeppen Fries
Kant Eschenmayer
Jacobi Bouterwek
Herder Schulze
Herbart Reinhold
Hegel Hegel
Fries Beck
Fichte Bardill
Eschenmayer Schelling
Bouterwek Jacobi
Beck Kant
Bardill Fichte
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Mentions Mentions
Figure 9: Controlling parameters of a bar plot. Continued from fig. 8. Left: the plot
stretched vertically to provide enough space for all bars (not necessarily needed when
working with standard output, the graphics console), left margin adjusted to make the
names visible. Right: same as the left, data ordered according to frequency. See text
for code details.
and strheight() for its height). The length can be returned in several ways
specified with the units argument. The units can be defined in three ways, of
which two are easily understandable and more immediately relevant to ordinary
users. "inches" mean literally the size in inches, while "user" returns size in
the units of the coordinate system used in a particular plot. The first is more
useful for identifying the size of text labels printed in the margins of the plot,
while the second is better for labels used in the plotting area. What we need
now is to find the maximal length of the y-axis label in inches and somehow
account for it when specifying the width of the left margin of the plot.29
The maximal width of the label can be thus identified with:
max(strwidth(phil$NAMES, units="inches"))
Now, we need a method to feed it into the plot parameters. The width
of the margins is defined at the level of the plotting device with mar or mai
arguments of par() function. The first argument specifies the number of lines
of text, which could be printed in the margin this wide, the second the width
of a margin in inches (only one of them should be used for a given graph).
Margins are defined clockwise starting from the bottom margin. The default
values are:30
> par()$mar
29 In fact, there are other important arguments in the strwidth() and strheight() func-
tions because the size of the string depends on many parameters, all of which should be taken
into account. In our case, however, we do not change the default font family, size, etc., so we
are spared of all these complications.
30 Unfortunately, this easy way of retrieving default values does not work for all funcions,
26
[1] 5.1 4.1 4.1 2.1
> par()$mai
[1] 1.02 0.82 0.82 0.42
>
For the purposes of our graph (fig. 9, left), I used the following script:
par(mai = c(0.82, 0.42 + max(strwidth(phil$NAMES, units="inches")), 0.42, 0.42))
barplot(phil$Freq, names.arg=phil$NAMES, horiz=TRUE, las=1, xlab="Mentions")
Now, the final touch. To sort the items according to their frequencies, we
need a rather simple data transformation. The phil data frame should be
sorted not by phil$NAMES but by phil$Freq. This can be accomplished with
order() function.31 We shall create a new object, phil.s which contains the
same philosophers but sorted differently.32
phil.s <- phil[order(-phil$Freq),]
The resulting graph (fig. 9, right) can be called with the code:33
par(mai = c(0.82, 0.42 + max(strwidth(phil.s$NAMES, units="inches")), 0.42,0.42))
barplot(phil.s$Freq, names.arg=phil.s$NAMES, horiz=TRUE, las=1, xlab="Mentions")
We shall come back to bar plots in the section on bivariate plots. Alongside
with mosaic plots, they can be used to visualise interrelationships of qualitative
variables.
object.
34 The axes range is specified with the same xlim and ylim parameters, margin labels with
27
Figure 10: Simple scatter plots (continued in fig. 11). Left: raw
plot(students.df$HEIGHT, students.df$MASS) call. Right: pch appearance
changed, axes labels added.
There are many things in this plot, which can be improved. E. g., we may
wish to change the appearance of plotting symbols from open circles to some-
thing more aesthetically acceptable. We may wish to smuggle into our plot some
important third dimension, e. g., the degree to which the data points overlap
each other (it happens even to rather small samples if the measurement scales
are crude enough) or, given the natural heterogeneity of our sample, whether
the graphical separation of male and female students displays any meaningful
pattern.
The shape of the plotting symbol is governed with the pch argument. The
value for this argument is either a numerical code or a text string. The standard
26 symbols with their numerical codes are shown in fig. 12 (as you see from
the example above, pch defaults to 1). Besides these symbols, any character
supplied as a text string can be used to represent a data point (examples are
given in the bottom-left corner of fig. 12). It should be noted, however, that the
use of other characters than "." for presentation quality plots is not advisable for
aesthetical reasons (even though sometimes it may be useful for quick previews
at some points of analysis).
The following code brings us fig. 10, right (now, the axes are labelled and
the appearance of data points changed):
plot(students.df$HEIGHT, students.df$MASS,
xlab="Stature, cm", ylab="Body mass, kg", pch=20)
For a quick assessment of the degree to which the data points overlap each
other, a semi-transparent colouration can be used.35 The semi-transparent
colours, as I promised in the section on histograms, can be called with rgb()
function. Its arguments define the intensity of the three channels of RGB system
and the α channel, responsible for transparency. All channels’ intensities may
35 See the section ‘Stepping beyond two dimensions’ below for a more sophisticated ap-
proach.
28
Figure 11: Simple scatter plots (continued from fig. 10). Left: a quick way to as-
sess data points overlap using rgb() function with transparency as a value for col
argument of plot(). Right: a quick way to bring in a third variable, students gen-
der by supplying a numerical value of the students.df$SEX factor level as a colour’s
numerical code for col argument.
The results can be seen at fig. 11, left. Now, we see that some points (pale-
grey) are apparently representing just one individual. The darker is the point,
the more individual data points overlap demonstrating exactly the same com-
bination of stature and body mass, given the limitations of our measurement
scale (stature rounded to centimeters, and body mass, to kilograms).
As we knew too well from the very beginning, our students dataset contains
data on both female and male students. There is a way to preview quickly this
heterogeneity. Apparently, we need to use either different colours or different
data point shapes to distinguish students of different genders. The easiest way
to do this not involving data transformations is to supply the numerical value
of the factor students.df$SEX as the value for the col argument. As you may
remember, we mentioned the standard numerical codes for nine basic colours
above. The numerical values of the levels of our factor would be 1 and 2, so
we may expect black and red dots for female and male students respectively
(remember that, by default, the factor levels are sorted alphanumerically). The
following code brings us fig. 11, right.
plot(students.df$HEIGHT, students.df$MASS,
29
0 1 ● 6 11 16 ● 21 ●
2 7 12 17 22
"." 3 8 13 ● 18 23
"a" a 4 9 14 19 ● 24
"1" 1 5 10 ● 15 20 ● 25
Figure 12: 0–25: numerical codes for 26 pre-set printing characters (pch). Characters
21–25 contain background fill colour which can be defined separately from the outline
colour (col) using bg argument (defaults to "white", for this figure it was set to
bg="red"). The text strings ".", "a", and "1" in the bottom-left corner illustrate
the possibility to use any characters (letters, numbers, and punctuation marks) in the
plots. Even though it is, in principle, possible, the use of text strings, except ".", "*",
and "+", is not advisable.
We can see now, that the male students (red dots) concentrate in the top-
right portion of the data point cloud, while the female students (black dots)
concentrate in the opposite bottom-left corner. This preview does not allow us
to use transparency, but any more sophisticated graph would require at least
a minor data transformation. In our case, it would be simple: we need to
segregate female and male students into two different objects, students.f.df
and students.m.df.
The segregation of a dataset into subsets can be achieved with the subset()
function. Its general form looks like this: subset(object, condition(s)). We
should first specify the object we would like to use for the subsets extraction,
then one or several conditions for subset extraction (see tab. 3 for the list of
operators used to formulate and combine conditions). E. g., if we would like to
put female and male students into different objects, this can be done this way:
students.f.df <- subset(students.df, students.df$SEX == "f")
students.m.df <- subset(students.df, students.df$SEX == "m")
Now, we have two different objects for female and male students and we can
proceed with our graph. We already know that the two clouds of data points
30
Table 3: Conditional operators in R. The two former can be applied to character
values as well as numerical values, the following four can be applied to the numerical
values only. The latter two are used to combine different conditions that should be
applied simultaneously. See more in the section on data transformations.
are shifted relatively to each other, gravitate to opposite corners of the plotting
area, and have apparently a very limited overlap area. This means that if we,
say, plot first the cloud for female students, R would authomatically adjust the
size of the plotting area in such a way that there will be not enough space to plot
all data points for male students, and vice versa. This means that we should
learn another practical trick.
The easiest solution is as follows. We should first call an empty plot with
enough space for all data points and then add the data points for all subsets
to the existing empty plot. To call an empty plot we need to learn another
argument of the plot() function, type. This argument may be assigned several
values, each of them representing a specific kind of graph.36 For the purposes
of the present graph we shall use the "n" value, which means ‘an empty plot for
the data points specified with x and y arguments’.
plot(students.df$HEIGHT, students.df$MASS, type="n",
xlab="Stature, cm", ylab="Body mass, kg")
The result can be seen in fig. 13, top left. As you see from the axes, the size
of the plotting area is exactly the same as in figs. 10 and 11, but the data points
themselves are not present. It’s time to add them. To add data points to the
existing plot, the points() function is used. Its arguments are similar to those
of the plot() function, but the points() function can not influence axes and
text in the margins of the plot (xlim, ylim, main, xlab, ylab, and some other
arguments).
points(students.f.df$HEIGHT, students.f.df$MASS, pch=20, col=rgb(1,0,0,.3))
The results of this code can be seen in fig. 13, top right. I used semi-
transparent red to account simultaneously for both the possible overlap of the
data points and the differences connected to gender. Now, we can add the data
points for male students, for which I picked a semi-transparent blue:
points(students.m.df$HEIGHT, students.m.df$MASS, pch=20, col=rgb(0,0,1,.3))
36 This subject will be treated in more detail in the following sub-section on time series
plots.
31
Figure 13: Adding points to a plot. Top-left: calling an empty plot with type="n".
Top-right: adding the points for the first subset. Bottom-left: adding points for the
second subset. Bottom-right: the same as left but with the use of black and white
symbols with different geometry.
Now we can see clearly the two clouds as well as the small overlap area
including several violet dots of varying intensity in the middle of the combined
cloud where female and male students’ data points overlap.
It should be noted that the points() function itself can not call the plot.
It can only add elements to an existing plot, so, first the plot() (or any other
primary plotting function) should be called, and only then the time comes for
points(). Thus, the complete code for fig. 13, bottom-left, should, in fact, look
like this:
plot(students.df$HEIGHT, students.df$MASS, type="n",
xlab="Stature, cm", ylab="Body mass, kg")
points(students.f.df$HEIGHT, students.f.df$MASS, pch=20, col=rgb(1,0,0,.3))
points(students.m.df$HEIGHT, students.m.df$MASS, pch=20, col=rgb(0,0,1,.3))
32
and white line art. A possible solution is proposed in fig. 13, bottom-right. The
code for it is as follows:
plot(students.df$HEIGHT, students.df$MASS, type="n",
xlab="Stature, cm", ylab="Body mass, kg")
points(students.f.df$HEIGHT, students.f.df$MASS, pch=3)
points(students.m.df$HEIGHT, students.m.df$MASS, pch=4)
37 One may think that as time is continuous, so, the dependent variable should be measured
continuously. This, however, is not always technically possible (the only notable exception
that comes to my mind is a heliograph, an ingenious analog inscription device composed of a
sphaerical lens and a band of paper onto which the Sun burns its trace as the Earth revolves).
The bits of time between which we measure the parameter under study may be tiny, but they
are bits. A more valid objection would be that if the dependent variable is countable (like the
number of people in a certain category), then a line merely connecting the points is misleading.
Indeed, if there were 222 students by January 1, 1882 and 267 students by January 1, 1883 in
the Imperial Moscow University, there is no reason to believe that there were 244.5 students
by July 1 and 255.75 by October 1, 1882 (the numbers are taken from the dataset we will
use a bit later). It is, however, not the plot itself that is misleading, it is our straightforward
interpretation that is. The continuity of the line symbolises here not a continuous variation
but the identity of an analytic object (e. g., a body of students). For methodological purists,
there is a couple of possible alternatives to the default line chart though, which are considered
below.
38 Please, do not forget to close the plotting device after plotting this, or to revert the par()
33
Plot type = n Plot type = p Plot type = l
8
8
●
6
6
y.coord
y.coord
y.coord
●
4
4
●
● ●
2
2
●
●
0
0
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
x.coord x.coord x.coord
8
● ●
6
6
y.coord
y.coord
y.coord
● ●
4
4
● ●
● ● ● ●
2
2
● ●
● ●
0
0
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
x.coord x.coord x.coord
8
6
6
y.coord
y.coord
y.coord
4
4
2
2
0
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
x.coord x.coord x.coord
Figure 14: Nine values of the type argument for the plot() function.
The results of this code can be seen in fig. 14. As you see, the default value
for type in plot() is "p". As for the other types, it seems that we haven’t
seen them before (and, actually, we will not see most of them afterwards). Two
types, however, are most relevant to the time series plots, ("l" and "o", some
would also consider "b" as an option, but it looks no good when the data points
are close to each other).39 I would generally recommend "l" for the cases
when data points are separated with equal time intervals, while "o" (or "b")
suits better the cases when the time intervals between the measurements are
39 It can be easily seen, by the way, that, graphically, "o" is a combination of "p" and "l",
34
Points default Points type="l"
8
8
●
6
6
y.coord
y.coord
●
4
4
●
● ●
2
2
●
●
0
0
0 2 4 6 8 0 2 4 6 8
x.coord x.coord
Lines default Lines type="p"
8
8
●
6
6
y.coord
y.coord
●
4
● ●
2
●
0
0 2 4 6 8 0 2 4 6 8
x.coord x.coord
Figure 15: points() and lines() are nearly the same but their type arguments
default to different values ("p" and "l" respectively).
of various length (the points would thus emphasize the moments at which the
parameter was measured).
What is a bit surprisig is that the points() function, which we used to
add points to the already existing plot, actually has the same parameter type
(defaults to "p"), which works exactly the same way as with plot() with all its
nine options. As the time series plots are most usual, a special function lines(),
which works exactly the same way as points() and plot() but defaults to "l"
is also present.40 You may check it with the following code (see fig. 15):41
par(mfrow=c(2, 2), pch=20) # Setting parameters for the plotting device;
plot() of type="l" and the like require at least two data points to plot something. This
influences also the handling of NA values (you may test it experimentally on your own).
41 Please, do not forget to close the plotting device after plotting this, or to revert the par()
35
8
8
● ●
6
6
y.coord.1
y.coord
● ●
4
4
● ●
● ● ● ●
2
2
● ●
● ●
0
0
0 2 4 6 8 0 2 4 6 8
x.coord x.coord.1
8
8
● ●
6
6
y.coord.1
y.coord
● ●
4
● ●
● ● ● ●
2
● ●
● ●
0
0 2 4 6 8 0 2 4 6 8
x.coord x.coord.1
Figure 16: On the importance of sorting the entries when plotting a line chart.
Thus, to simplify the code, the lines() function can be used when adding
multiple time series to the same graph (as well as, of course, the points()
function with an appropriate type value). There is, however, a most important
difference between plotting data ploints and lines. To illustrate that, I would
36
8
8
● ●
6
6
● ●
y.coord
y.coord
4
4
● ●
● ● ● ●
2
2
● ●
● ●
0
0
0 2 4 6 8 0 2 4 6 8
x.coord.2 Index
Figure 17: A vector, x.coord.2 in this case, (left) vs. the index (right) as the x-axis
value source. Y-axis values are in both cases derived from the same vector, y.coord.
See text for code details.
You might have noticed that x.coord.1 and y.coord.1 describe the same
set of points as x.coord and y.coord but the points are sorted differently. In
the original vectors (x.coord and y.coord) we used in fig. 14, 15, and 16 (left),
the points are sorted by x.coord in ascending order, while in x.coord.1 and
y.coord.1 no such ordering is observed.
The plot(type="o"), like other line chart types, uses the order of elements
within the object to identify the order in which the data points should be con-
nected to each other with line segments to form a line. This is why the bottom
plots in fig. 16 are so different. The thing responsible for that was seen by us
many times. Remember those element numbers in their square brackets? This
thing has its name, index. So, one may see that the index value for the data
point at (7, 5) in the original dataset (fig. 16, left) is [7], and in the modified
dataset (fig. 16, right), [2].43
42 Please, do not forget to close the plotting device after plotting this, or to revert the par()
y.coord.1[2].
37
lty = 6
lty = 5
lty = 4
lty = 3
lty = 2
lty = 1
lty = 0
Figure 18: Line types (the lty argument). 0 — transparent; 1 — solid; 2 — dashed,
with short dashes; 3 — dotted; 4 — dots and short dashes; 5 — dashed, with long
dashes; 6 — very short and longer dashes (magnification needed).
In fact, index may serve as an x-axis variable itself. If you supply plot()
with just one vector, it will by default to serve as a source for the data points’ y-
axis values, while the x-axis values will be derived from index. You should take
into account, however, that index starts with 1 and has, naturally, a regular
increment of 1, while the successive values in a real-life x-axis variable may be
rather diverse even when sorted. See fig. 17 for a rather moderate but, none the
less, telling example, the code for which is provided below:
# Adding a new object;
x.coord.2 <- c(1, 3:8)
Naturally, one may find it useful to add more than one line to a time series
plot. We already know that this can be done with points() and lines().
When colours are available (e. g., in on-screen presentations), different lines can
be distinguished by colours (the already familiar col argument). When they
are not (e. g., in a scholarly journal with no colour prints), we need to employ
different line types (the lty argument) to highlight the difference. Lines are
less graphically diverse than the data point shapes, but there are no less than
six (seven, actually) kinds of them (see fig. 18). Despite all this diversity, it is
not advisable to use more than three contrasting line types in the same plot
(arguably the best combination is 1-3-5).
Now, we are nearly ripe for our first real-life time series plot. The last thing
to learn here is how to represent dates, on which a time series plot is supposed
to base. By default, R recognises dates encoded in YYYY-MM-DD format (e. g.,
2018-04-12 for April 12, 2018). Other date formats can also be used after some
additional transformations.44 A training dataset on the temporal dynamics of
the Imperial Moscow University students over the last two decades of the 19th
century contains dates in YYYY-MM-DD format and four variables reflecting the
yearly reports on the number of students immatriculated with the four faculties
(Law, History and Philology, Physics and Mathematics, Medicine):45
44 As I have already said, an extended discussion of the date and time formats will be
postponed until the section on data transformation.
45 See the dataset at https://github.com/alexei-kouprianov/Breaking-the-ice-with-R
38
Law
History and Philology
Physics and Mathematics
1500
Medicine
Number of students
1000
500
0
Figure 19: A time series plot. Moscow university students by faculty. See text for
code details.
> head(moscow)
DATE HP L PM M
1 1881-01-01 190 451 392 1397
2 1882-01-01 222 567 463 1346
3 1883-01-01 267 692 526 1314
4 1884-01-01 276 844 497 1257
5 1885-01-01 297 949 550 1195
6 1886-01-01 314 1045 586 1237
As the dates supplied this way are essentially text strings, to plot a time
series, one needs to use as.Date() function to interpret them correctly. Here is
a nearly complete code for the fig. 19:46
46 The legend located in the top-left corner is not included. The legend() function will be
39
# Plotting the time series for the faculty of Law;
plot(as.Date(moscow$DATE), moscow$L, type="l",
ylim=c(0, max(moscow[, 2:4])),
main="", xlab="Timeline", ylab="Number of students")
As you might have noticed, I used both points() and lines to add time
series to plots. This was quite deliberately, because I wanted to stress again
that both these functions can be used to add lines to the plot.
Of course, for practical reasons, it is a lot easier sometimes (e. g., when the
leaps of time between measurements are year long or even longer) not to use the
date format at all. Even though not quite philosophicaly correct, technically,
the plot based on a simplified time series in which dates are represented with
just years or decades, will not look much (if at all) different from a ‘properly
dated’. The strictly chronological order of entries is, however, to be observed
for the reasons discussed above.
Multiple boxplot
The multiple boxplot is used when we need to compare several samples or
sub-populations by some quantitative parameter. With our training students
dataset, we may wish to compare, e. g., students of different genders by height
or body mass, or do the same for other possible groupings like smokers and
non-smokers or different departments and academic groups.
We do know that variation of a quantitative variable across the sample can
be depicted with a histogram. Theoretically, we may compare two samples
just by printing two histograms next to each other (fig. 20, top-left). To do it
we need to modify the plotting device properties using mfrow argument of the
par() function.47
# Re-creating the objects;
students.df <- read.table("kouprianov.students.v.2.1.txt", h=TRUE,
sep="\t", stringsAsFactors=TRUE)
students.m.df <- subset(students.df, students.df$SEX == "m")
students.f.df <- subset(students.df, students.df$SEX == "f")
To make the histograms truly comparable, one needs to adjust the x-axes,
though, applying proper xlim values (fig. 20, bottom-left):
par(mfrow=c(2, 1)) # Setting the capacity of the plotting device;
hist(students.m.df$HEIGHT, xlim=c(140, 200), col=8,
main="Male students", xlab="Stature, cm")
hist(students.f.df$HEIGHT, xlim=c(140, 200), col=8,
main="Female students", xlab="Stature, cm")
47 I also made them look less ugly from the start by adding colour, titles, and x-axis labels.
40
Male students
Frequency
0 10
Female students
0 20 50
Frequency
Male students
Frequency
0 10
Female students
0 20 50
Frequency
Figure 20: Comparing two samples by a quantitative variable: juxtaposed (left) and
superimposed (right) histograms. Top-left: a rough attempt to juxtapose stature
histograms for male and female students. Bottom-left: unified x-axis range allows
a reasonabe comparison. Tor-right: female students’ stature — red, male sudents’
stature — blue, frequency at the y-axis shows relative size of samples. Bottom-right:
same, density at the y-axis allows comparison of the bell curves’ proportions regardeless
of the sample size. See text for the code.
41
Another option would be to combine two semi-transparent histograms in the
same plot using the add=TRUE argument in the second hist (fig. 20, top-right).
Note, please, that all basic properties of the plot (like default xlim and ylim
values or axes labels) are specified only for the first hist() (this is why we need
plotting more numerous sub-population first):
par(mfrow=c(1, 1)) # Restoring the default settings;
hist(students.f.df$HEIGHT, xlim=c(140, 200), col=rgb(1, 0, 0, .5),
main="", xlab="Stature, cm")
hist(students.m.df$HEIGHT, col=rgb(0, 0, 1, .5), add=TRUE)
If the two samples we compare are of different size, and we still need to
compare visually the overall shapes of the bell curves, we may use density as
the y-axis value by setting freq=TRUE (fig. 20, bottom-right). Note, please, that
in this case we need to repeat the freq=TRUE for both calls of hist().
hist(students.m.df$HEIGHT, xlim=c(140, 200), freq=FALSE, col=rgb(0, 0, 1, .5),
main="", xlab="Stature, cm")
hist(students.f.df$HEIGHT, freq=FALSE, col=rgb(1, 0, 0, .5), add=TRUE)
In this particular case, we had also to change the order of appearance of the
samples (which explains the different shade of violet in the overlapping areas
of fig. 20, right), because the blue histogram has a little bit more acute excess
than the red one.
But, what if we need more than just two histograms? What if we need to
compare not the two sexes but ten academic groups? What about 85 regions in
the Presidential elections dataset? The more histograms we put into one plot,
the less informative is their shape. As they flatten, the peaks and tails get less
prominent and, in the end, nearly indiscernible. In this case, a series of boxplots
may seem a decent alternative.
As you might remember, the boxplots summarise the robust measures of
variation: median (the thick bar crossing the box) and quartiles (sides of the
box parallel to the median bar). It is a little bit trickier to explain what
‘whiskers’ mean. Usually they represent either maximum and minimum, or
median ±1.5×interquartile range (in the R functions in charge of boxplots, this
is sometimes adjustable).
Boxplots can be created either with the generic plot() function or with a
more specialised boxplot() function, which allows more controls at the expense
of less uniform syntax. The plot() generates boxplots authomatically if the x-
axis variable is a factor, and the y-axis variable is a numerical vector (see fig. 21,
top-left):
plot(students.df$SEX, students.df$HEIGHT)
Note, please, that to use academic group number as the x-axis variable
for boxplot, we shoud apply as.factor() transformation (as the groups are
encoded with numerical codes, they are perceived as numbers by default):48
48 You may try it without this transformation at your own risk and see the result.
42
●
190
190
●
●
180
180
170
170
160
160
●
150
150
●
f m 1 2 3 11 12 13 21 22 23 24
●
● 190
150
●
180
100
170
160
50
●
150
●
0
1 2 f m
Figure 21: Boxplots created with plot() (top), and boxplot() (bottom). See text
for the code and explanations.
From the section on the barplot() function, you might remember that the
objects it was ready to handle were of a dirfferent nature than those required
by plot() to produce the same result. It is the case with boxplot() too. If
we try to apply the same syntax as with plot(), the results will be strikingly
different (fig. 21, bottom-left):
boxplot(students.df$SEX, students.df$HEIGHT)
43
Figure 22: Scatter plot with jitter. Left: raw plot(students.df$HEIGHT,
students.df$SEX) call. Right: the same plot after some adjustments (xlab and ylab
fixed, data points pch and col changed, jitter() added to avoid extreme overlap of
data points, y-axis tuned to reflect the two qualitative states without any technically
inappropriate intermediate gradations). See text for details.
The only difference here is that we will need to take care of the x-axis tick-
mark labels on our own by switching off the default axes and manually adding
new ones. It is unfair, perhaps, not to explain the details of axes manipulation
here. My only excuse is that they will be treated in more detail in the following
section.
students.df$SEX represented with factor levels forming a rather ugly boxplot trying its best
to visualise a series of 1s and 2s. The second is students.df$HEIGHT, not split by gender.
44
boxplot we discussed in the previous section), so it would be useful to learn how
to handle them.
A most straightforward solution would be to call:
plot(students.df$HEIGHT, students.df$SEX)
The results are presented in the fig. 22, left. It seems that the plotting
function took the numerical values of two levels of the students.df$SEX fac-
tor for the y-axis coordinates. The plot is thus perfectly meaningful, however
ugly. Apparently, it needs standard aesthetical adjustments (a meaningful ap-
plication of xlab and ylab as well as some changes in data points appearance
we used before), so this goes without saying. It is clear, however, that, first,
no matter how semi-transparent the points will be, they will form indiscernible
linear clusters. And, secondly, that the y-axis needs a radical reconstruction.
This means that we need to use the jitter() function to add some noise to the
y-coordinate and that we need to learn how to fully control the axes.
A straightforward application of the jitter() to students.df$SEX returns
an error message:
> plot(students.df$HEIGHT, jitter(students.df$SEX))
Error in jitter(students.df$SEX) : ’x’ must be numeric
>
This problem is partly familiar to us. From the section on scatter plots, we
already know how to transform a factor into a numerical vector on the basis
of factor levels’ numerical values. We should apply first the as.numeric()
transformation, and only then, the jitter(). The following code will return a
slightly improved version of the fig. 22, left.51
plot(students.df$HEIGHT, jitter(as.numeric(students.df$SEX)))
45
argument, the tickmarks’ labels with labels, the tilckmark labels orientation
with the already familiar las (see the ‘Bar plots’ section for more details). The
invisibility of the axis and tickmarks lines will be achieved by frankly colouring
them to the background (white) colour.
Thus, the complete code for fig. 22, right, is as follows:
plot(students.df$HEIGHT, jitter(as.numeric(students.df$SEX), factor=.5),
xlab="Height, cm", ylab="Gender", pch=20, col=rgb(0,0,0,.3), axes=FALSE)
axis(1)
axis(2, at=c(1:2), labels=c("f","m"), las=1, col="white")
As you see, within the jitter(), the noise factor is slightly diminished to
narrow the resulting clouds, within the plot(), the axis labels are fixed, pch
and col applied, and original axes suppressed. The bottom axis is redrawn ‘as
is’, while the left axis is supplied with a vector of at values, a vector of new
tickmark labels, tickmark label orientation (las) and white colour for the axis
and tickmarks’ lines (col).
Structured barplots
A structured barplot is needed when two qualitative variables meet each other.
R has a number of options for drawing graphs of this sort. The default mosaic
plot discussed above in the section on six basic graph types can be produced
with the basic plot() function supplied with two qualitative variables as x-and
y-axis values. The more conventional structured barplots require a special
barplot() function (we already started working with it above in the section
on simple barplots) and some additional data transformation. In this section, I
shall treat first the mosaic plot, and secondly discuss the structured bar plots.
First, we need the variables. In our training students dataset, there are
at least four variables that suit the purpose. Besides of conspicuously quali-
tative SEX, there are DEPARTMENT and GROUP (even though they are encoded
with numbers, these numerical codes, as we aleady know, should be perceived
as nothing but names). The scale used to encode SMOKING is ordinal, not nom-
inal, as in the three former cases, but from the drawing perspective it makes
no difference. The only thing we should keep in our minds is that all above-
mentioned variables encoded with numbers should be converted to factors (e.
g. as.factor(students.df$SMOKING) instead of just students.df$SMOKING)
to make R interpret them as ‘qualitative’ from the graphical perspective.
The code for the mosaic plot is very simple:
plot(students.df$SEX, as.factor(students.df$SMOKING))
The interpretation of the resulting graph (fig. 23) is rather tricky. It was
already given in the section on six basic graphs, but I repeat it here again in
more detail. The two vertcal bars represent the two states of the x-axis variable.
Their widths are proportional to the frequencies of the x-axis variable’s states.
In our case, we see that women are roughly three times more numerous than
men. The differently coloured areas in each bar represent the distinct states
of the y-axis variable. An ordered list of all possible states is given at the
left vertical axis, a frequency scale (from zero to one), at the right. A closer
inspection of the graph shows that roughly a half of the students in both genders
identify themselves as more or less regular smokers (classes 2–4). On the other
46
1.0
4
0.8
3
0.6
2
y
0.4
1
0.2
0
0.0
f m
x
Figure 23: Mosaic plot. Students’ self-identified smoking class (0–4, at the y-axis)
vs. gender (at the x-axis). See text for explanations and code details.
hand, less than 5% (0.05) of women students and slightly more than 10% (0.1)
of men students identify themselves as heavy smokers (class 4).
We can check our eye estimates by running table() for the items of interest.
> table(students.df$SEX, students.df$SMOKING)
0 1 2 3 4
f 43 37 68 16 5
m 11 14 17 4 6
> table(students.df$SEX)
f m
169 52
As one may see from these tabulations, our eye estimates were quite close to
the exact numerical proportions. There is nothing unnatural or counterintuitive
about that, for the graph represents the numerical proportions with the highest
technically possible degree of precision, I just wish to stress that this graph may
seem intuitive after a proper amount of explanation.
47
The structured barplots however are a little bit more conventional, and, gen-
erally, more advisable. We have already had some experience with barplot()
function, and its twisted character could not have escaped the reader’s atten-
tion. In the present section, we will deal only with the specific problems of
representation of the interdependance of two qualitative variables. All more
general problems are addressed above in the section on univariate barplots.
We already know that barplot() requires numerical values for frequencies
of distinct states of qualitative variables. When the relatinship between two
qualitative variables are explored, we need to visualise frequencies of their states’
possible combinations. E. g., considering the relationship between gender and
department, we should be interested to visualise somehow frequencies of women
and men students from the department no. 1 and those from the department
no. 2 (just as we did above for combinations of gender and self-ascribed smoking
habits’ classes). We already know that the most natural way to obtain these
frequencies is the table() function. Remember, this worked with the univariate
simple barplots as well. The only difference is that the tabulation of a single
variable produces a uni-dimensional table (a table with just one column), while
the tabulation of two variables result in a two-dimensional contingency table
with as many rows and columns as there are states in the first and second
chosen variables respectively (remember the Roman Catholic rule).
I. e., if we change positions of variables in the brackets, the table gets trans-
posed (this may be roughly described as turning 90°clockwise plus mirroring).
> table(students.df$SMOKING, students.df$SEX)
f m
0 43 11
1 37 14
2 68 17
3 16 4
4 5 6
1 2
f 114 56
m 46 6
A careful comparison of the table to the graph (fig. 24, top-left) shows that,
in a way, the latter is just a pictorial representation of the former mirrored
in relation to the x-axis (imagine the table, in which areas of the cells are
proportional to the values they contain). The columns of the table correspond
to the vertical bars, while the numbers in the table’s cells specify the heights of
bars’ segments. Note, please, an important difference from the mosaic plot we
produced with plot() function in the beginning of this section: in the plot(),
the default order of x- and y-axis variables prevails, while in the barplot(), the
table() interfers turning the tables.52 The order of bar segments is reversed
52 The pun was not intended at first.
48
150
100
80
100
60
40
50
20
0
0
1 2 1 2
30
15
25
20
10
15
10
5
5
0
A B C A B C
Figure 24: Structured bar plots. Top: students’ gender vs. department (departments
indicated with numbers, women with a darker, men with a lighter shade of gray).
Bottom: experimental test.df dataset. Left: results of raw barplot() function calls.
Right: same graphs with beside=TRUE. See text for explanations and code details.
(i. e., the top row of the table is placed at the bottom of the graph; but this
is quite natural, because the graph starts with zero, which is by default placed
at the graph’s bottom line). The bars’ proportions allow us to see at once that
the 1st department outnumbers the 2nd (the latter is just 52 ) of the former, and
that there is a significant gender disproportion (men comprise about 72 of the
1
1st department students and just about 10 of the 2nd ). That’s all we need from
this graph.
Of the many parameters controlling the appearance of the structured bar
plot, one is most important to know. It is beside (defaults to FALSE); when set
to TRUE it changes the appearance of the graph in a most dramatic way.
barplot(table(students.df$SEX, students.df$DEPARTMENT), beside=TRUE)
Now (fig. 24, top-right), the bar segments are placed not on top of but next
to each other. In some cases this layout might seem preferable.
49
Of course, barplot() has many more adjustable parameters than just beside
(you should be already aware of horiz and names.arg discussed earlier). You
may control also the widths of the bars and the widths of the spaces separating
them (and even the width of spaces separating bar segments), colorus of the bar
segments, presence or absence of a legend, and many other things.53
Note, please, that in case we would like to emphasise rather the proportion
of departments within gender groups, than the proportion of the gender groups
within departments, we would need to change the order of variables within the
barplot() function call. The horiz=TRUE would be definitely not enough.
The frequencies may be picked from a data frame as well. To satisfy the
expections of the barplot() function, however, the segment of the data frame,
which is to be used as the contingency table, should be transformed with
as.matrix(). To illustrate that point, we should create a miniature abstract
data frame.
# Create the data frame
test.df <- data.frame(c(10,12,8), c(16,6,4), c(5,7,9))
colnames(test.df) <- c("A","B","C")
# Preview the data
test.df
Now, we are ready to test the barplot() with it. A straightforward attempt
to apply barplot() to our data would result in an error message:
> barplot(test.df)
Error in barplot.default(test.df) : ’height’ must be a vector or a matrix
The use of the as.matrix() transformation solves the problem (see fig. 24,
bottom-left and bottom-right):
barplot(as.matrix(test.df))
barplot(as.matrix(test.df), beside=TRUE)
Two considerations should be kept in mind, however, when using data frames
as a barplot source. First, that matrix is, essentially, a folded vector. This means
that the segment of a data frame we convert to a matrix is not allowed to contain
anything but numbers representing heights of bar segments. Secondly, that the
naturally occurring contingency tables do not contain NA values (which is quite
understandable: if a certain combination of variables’ states does not happen
to occur, its frequency is zero, 0, not NA).54
At the very least, adding legend.text=TRUE to our gender vs. department plot is a must try.
54 To test the effect of an NA value on the plot, you may try to modify our experimental data
frame test.df by replacing one of the numbers with NA and redraw the plots. For didactic
purposes, I would personally recommend to replace 12 in the first vector or 6 in the second
with NA and see what happens to both beside=TRUE and beside=FALSE barplots. It would be
also most instructive to replace then NA with 0 and redraw the plots again to see the difference.
50
diagrams, not to mention a whole separate ggplot2-based graphical system;
they all are postponed to the chapters to come). This section deals with four
rather common issues which, however disconnected they appear, are all much
needed at the very first steps of exploratory analysis and data visualisation. The
first subsection will deal with drawing straight lines and curves based on equa-
tions reflecting ideal relationship between x- and y-axis variables (something
like y = a + b · x or y = x2 ). The second will be devoted to a rather bizzare
issue of the ways to add a third variable to the two-dimensional scatter plots.
The third will treat the so-called graphical primitives (you are already familiar
with some of them, like points(), but this section will add more). The fourth
will lead you into the depths of basic R graphical system to reveal an important
secret of graphical functions: like many other functions in R, they count a lot
before they draw, and we shall learn how to get a more direct access to the
results of these calculations and make use of them.
y =a+b·x
where a is responsible for up and down shifts of the line along the y-axis, while
b represents the ratio of y1 − y0 to x1 − x0 , or how much y changes when x
increases by 1 (which is also known as the tangent of the angle between the x-
axis and the function line). It is important to mention that, just like points()
and lines(), the abline() itself is unable to call a new plot, so it can be used
only to add elements to an existing plot (whether empty or full of data points),
not to create a new one.
# Creating an empty plot;
plot(-10:10, -10:10, type="n")
# Drawing a number of lines;
abline(0, 1, lty=1) # y=0+1*x;
abline(6, -.5, lty=2) # y=8-0.5*x;
abline(-7, 2, lty=3) # y=-7+2*x;
abline(4, sqrt(2), lty=4) # y=4+sqrt(2)*x;
As one may see from this code and the results in fig. 25 (top-left), abline()
uses some standard formatting arguments like lty for line type (see the section
51
10
10
5
5
−10:10
−10:10
0
0
−5
−5
−10
−10
−10 −5 0 5 10 −10 −5 0 5 10
−10:10 −10:10
100
100
90
90
80
80
Body mass, kg
Body mass, kg
70
70
60
60
50
50
40
40
150 160 170 180 190 150 160 170 180 190
Stature, cm Stature, cm
Figure 25: Top: just some lines. Solid: y = x, dashed: y = 8− x2 , dotted: y = −7+2·x,
√
dashes-and-dots: y = 4 + x · 2; right: same as left, h and v arguments of abline()
tested. Bottom: the use of lm() with abline(). See text for explanations and code
details.
on the time series plots, esp. fig. 18 for more details), lwd for line width, and
col for line colour (feel free to try lwd and col out on your own).
There are two special cases, when the line runs parallel to either of the axes.
For the lines running parallel to the x-axis, it is enough to specify the single y-
value, and vice versa (the generic form would be abline(h=y) and abline(v=x)
accordingly, h and v stand for horisontal and v ertical respectively).
Unlike a and b, h and v can be defined with numerical vectors or with an
output of some vector-generating function like seq() or c(). However strange,
vertical and horisontal lines may turn most useful, e. g. when one needs to draw
a co-ordinate grid of a desired density or to highlight some x- or y-axis values.55
55 Technically, a horizontal line at some a value may result also from b=0, and a vertical
one (located at x=0) from a b which is big enough to be practically indiscernible from infinity.
Anyway, the use of h and v arguments makes the function call shorter, and the results more
flexible. Grid as such can also be added with a less flexible but sometimes easier-to-use grid()
52
What is most surprising, however, is that h and v can be used simultaneously
withih a single function call. The code that follwes will add the axes at zero
level and a co-ordinate grid to the previous plot (I suggest to redraw it from the
scratch because axes and the co-ordinate grid should be in the back-, not in the
fore-ground; see fig. 25, top-right):56
# Re-creating an empty plot;
plot(-10:10, -10:10, type="n")
# Adding the grid;
abline(h=seq(-10,10,1), v=c(-10:10), lty=3, col=8)
# Adding the axes;
abline(h=0, v=0)
# Adding back the lines;
abline(0, 1, lty=1) # y=0+1*x;
abline(6, -.5, lty=2) # y=8-0.5*x;
abline(-7, 2, lty=3) # y=-7+2*x;
abline(4, sqrt(2), lty=4) # y=4+sqrt(2)*x;
There is also a special case with a and b, when they are derived from the re-
sults of the linear least squares approximation (also known as linear regression).
Regression analysis itself is a rather big topic, and it does not deserve to be
treated lightly.57 Its general idea, however, is rather simple. Assuming a simple
linear relationship between two variables, we are trying to find a straight line
which fits the cloud of data points in the best possible way. There are different
ways to define which way is the best possible. Least squares way means that
we are trying to find a line which minimises the sum of squared deviations of
data points from their projections on this line.58
The linear regression parameters (a and b for our line equation and a num-
ber of other important parameters providing information on the goodness of
fit and statistical significance of the estimates) are calculated in R with lm()
function (its name is derived from the Generalised Linear M odel, a complex
of parametric methods unifying the linear least squares regression in its many
versions and the analysis of variance). The abline() function is trained to pick
these estimates right from the lm() function’s output. The generic form of the
call is abline(lm(y~x)). With our usual training datasets, we can try this on
students’ stature and body mass (see fig. 25, bottom-left for results).
# Plotting the data points;
plot(students.df$HEIGHT, students.df$MASS,
pch=20, col=rgb(0,0,0,.3),
xlab="Stature, cm", ylab="Body mass, kg")
# Adding regression line;
abline(lm(students.df$MASS ~ students.df$HEIGHT), col="red", lwd=3)
It is intuitively clear, that, on the average, the taller the student is the more
likely she or he is also heavier. The red line illustrates this general trend. As I
said, I am not to enter here any deeper in the interpretation of the regression
function (reading help(grid) is strongly advised).
56 Note, please, that I added h and v using different sequencing functions to emphasise that
the lines that can be drawn between all possible pairs of data points (the so-called Theil —
Sen estimator).
53
analysis results, for this would distract us from our more immediate agenda.
Howerver, we need to know where the coefficients came from. The result of the
lm() function is an object of list class.59 One of its elements stores the values
we need. Little surprise, the element’s name is coefficients.
> lm(students.df$MASS ~ students.df$HEIGHT)$coefficients
(Intercept) students.df$HEIGHT
-85.2025591 0.8520774
Now, we shall try and re-create the line using the values we see. They are
provided in the same order as they are used in the abline(), so, the equation
is, roundedly speaking:60
y = −85.2 + 0.852 · x
In the following line of code, I use a different line type and colour to see both
the original red and the new blue:
abline(-85.2, .852, col="blue", lty=2, lwd=3)
str(lm(students.df$MASS ~ students.df$HEIGHT))
60 The bizarre side-result that a 100 cm tall human should have a zero weight and shorter
people are at risk of defeating the laws of gravitation, comes from an over-simplified nature
of the model we used. The relationship between the body mass and the stature is rather
non-linear and it changes with age. The real-life 100 cm tall kids possess a body mass about
16 kg, while the 50 cm newborns about 3.5 kg. The fact that a is nearly exactly hundred times
bigger than b is, of course, just a coincidence.
54
5.5
5.5
5.0
5.0
4.5
4.5
x^2
x^2
4.0
4.0
3.5
3.5
3.0
3.0
2.5
2.5
−2.5 −2.0 −1.5 −2 −1 0 1 2
x x
4
4
3
3
x^2
x^2
2
2
1
1
0
−2 −1 0 1 2 −2 −1 0 1 2
x x
Figure 26: y = x2 depicted with curve() under different values of n: top-left n=1,
top-right n=2, botom-left n=3, bottom-right n=10. See text for explanations and code
details.
Note, please, that what the latter line of code draws is, in fact, not at all
a curve. It is a straight dotted line at y = x (which may also result from
abline(0, 1)).
Here are some more elementay functions to make use of more math abun-
dantly built into R (see fig.27, right for results):
curve(sin(x), from=-2*pi, to=2*pi, ylim=c(-2, 2)*pi)
curve(exp(x), lty=2, add=TRUE)
curve(log(x), lty=3, add=TRUE, n=303)
abline(h=0, v=0)
Note, please, the error message after the last line of code:
55
3
6
2
4
1
2
sin(x)
x^3
0
0−2
−1
−4
−2
−6
−3
−3 −2 −1 0 1 2 3 −6 −4 −2 0 2 4 6
x x
This is because natural logarithms are defined for positive numbers only, and
the x range (inherited from the first function call, that for sin(x)) contains
negative values. I had to use n=303 this time, because n=101 produces a broken
line on some plotting devices having not enough points to make the curve look
smooth.
The functions used with curve() can be very complicated. An interesting
challenge you may entertain on your own is to draw the probability density
function for the normal (Gaussian) distribution (you may start with µ = 0 and
σ = 1 and then try something more challenging):
1 (x−µ)2
y=√ · e− 2σ 2
2πσ 2
or a circle:
x2 + y 2 = 1
The latter may seem even more challenging than the former, because the
function expression within curve() is not allowed to contain any y.
56
add the third variable to the seemingly 2D scatter plot. In fact, we already did
it (see fig. 11 and two bottom graphs from 13). Now, we shall treat this topic
in a slightly more systematic manner.
There are three cases when we may need to circumvent the limitations of the
2D scatter plot: (1) we need to see how a third ‘qualitative’ variable fits with
the pattern formed by the two quantitative ones that form the original scatter;
(2) same for a ‘quantitative’ variable; (3) a special case when the data points
are so numerous that they overlap each other, and we want to see the pattern
formed by this overlap. The latter case is also known as ‘2D histogram’.
The two former cases are rather simple. To incorporate the third variable
without abandoning the 2D-scatter layout, we change the appearance of the
data points. A qualitative variable can be encoded with colour (col) or shape
(pch) of the data points. We did it before (see fig. 11, right; 13, bottom, and
the appropriate code in the section on the scatter plots).
A quantitative variable can be encoded with colour scale (this is rather tricky
to both plot and interpret, and I am not to teach it here) or immediately with
the data point size (cex). The latter way (the use of cex) requires a little trick.
The matter is that cex sets the height of the plotting character, not its area.
Which means that a character of cex=2 is twice as high as that of cex=1 but
four times as big in terms of the occupied area. You may easily check it visually
with:
plot(1,1,pch=0)
points(1,1,pch=0,cex=2)
Which means that when using cex to encode a quantitative variable, we need
to sqrt() the numeric vector for cex values. Let us try it with the 2008 Russian
presidential elections dataset. The following graph represents the upper-right
corner of the turnout-result plot with data points proportional to the (alleged)
number of ballots cast (see fig. 28 for the results; note, please, that we also have
to divide the sqrt(pres.2008$VOTED) by some arbitrarily chosen number, 36 in
our case, which is big enough to make data points proportional to the plotting
area):
# Loading data and adding variables:
pres.2008 <- read.csv("pres_2008.csv", h=TRUE)
pres.2008$VOTED <- pres.2008$BALL.VALID + pres.2008$BALL.INVALID
pres.2008$TURNOUT <- pres.2008$VOTED/pres.2008$VOTERS
pres.2008$MEDVEDEV.sh <- pres.2008$MEDVEDEV/pres.2008$VOTED
The case of overlapping points is much more complex than the previous two.
There are several ways of dealing with this problem. Each has advantages and
disadvantages of their own. Some can be used with original data, others require
a rather sophisticated data transformation.
The easiest way, perhaps, is to use semi-transparent data points with original
data (e. g. by setting colour to col=rgb(0,0,0,.3), see fig. 29). It requires no
data transformation and demonstrates visually tolerable results. On the other
hand, it lacks precision, when it comes to the issue of density scale.
57
1.00
●●
●
●
● ●
● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ●
●
● ● ● ● ● ●●
● ● ●
● ● ● ● ●●
● ●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
● ●● ●
● ● ●● ● ●
●● ●● ●
●●● ●
●
●
●
●
●
●
● ● ●●
●● ●●
●
●
●
● ● ●
●
● ●●● ●● ● ● ●
●
●
●
●
●
Share of ballots cast for D. Medvedev at a polling station
● ●
● ●
●
● ●●●●●●●●●
●
● ● ●
● ●
●
●
●
● ● ●
● ● ●
● ●● ●●
●
●
● ● ● ●
●
● ● ● ●
●
●●
●●
●
●●
● ●● ●●
● ●
●
●
●●
●●
● ●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●● ●●●●● ● ● ● ● ●
● ●
●
●
●
●
●
●
●
●
● ● ●●
● ● ●
●
●
●
●
● ● ● ●● ● ●● ● ●
● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
● ●●
●
●
●
●●
●
●● ● ● ●●
●●
●
●
●
●
●
● ● ● ● ●●
● ● ● ● ● ●
●
●
● ●●
● ●● ●
● ●
●
●
● ●● ● ●
● ●
● ● ●●●●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ● ●
●
●●● ●● ● ● ●●●●● ●● ● ●
● ● ● ● ●
●
●● ● ● ●
●●●●●●●●●● ●●● ●●● ●
●
●
0.99
● ● ● ●
● ● ● ●
●
●
● ● ●●●●●● ●● ● ● ● ●
●
●
● ●● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
● ●
● ● ●
●
●●
●
● ●
●
● ●●●●● ● ●
●
● ● ● ● ● ●●● ●
●
●
●● ● ●
●
●
●
● ●
●●
● ●
●
●● ●
●
● ● ●
●
●
●
●
●
● ●
●
●
● ● ●
●
● ● ●
●●
●● ●●
● ● ● ●●
●
● ● ●● ● ●
●
●
●
●
●
●
●
● ●
●● ●
●
● ● ● ● ● ● ● ● ● ● ●
● ● ●
●
●
●
●
●
● ● ● ● ● ● ●
●●
●
●
● ● ● ● ● ● ● ● ●
●
●
●● ●
● ●
● ● ● ●●●● ● ● ●
●
● ● ●
● ●
●
●
●
●● ●●
●
●
●
●
●
● ● ● ● ● ●
●
●●●
● ● ●
●
●
●● ● ● ●●
●● ● ●
●
●
●
●● ●●●●● ● ●
●● ● ●
● ● ●
●
●
●
●
●
●
● ● ●
●● ●●
●
● ● ●
● ● ●
●
●
● ● ●● ●
● ● ●
●
● ● ● ●
● ●
●● ●
●
●
●
●
●
●
● ●
●●●
●
● ●
● ● ●
●●●● ●●●●●
●● ●
● ● ●
● ● ● ●
●
● ● ●● ●
● ● ● ●
● ● ●
●
●
●
●
●
●
●
●
● ● ●
● ● ● ●
● ● ●
●
●
●
●
● ● ●
● ●
●
●
● ● ● ●
●● ● ●
●
●● ● ● ●
●
● ● ●
● ● ●
●
●●
● ● ●● ●
●
●●
● ● ●
● ● ●● ● ●
●
●
● ● ● ●
● ●
●● ● ●
● ● ● ● ●
● ●
●●
● ● ● ●
●
●
●● ●● ● ●
● ● ● ●
● ●
● ●
●
●
● ●
● ●●
● ● ●● ● ●● ●
●
●
●
● ● ● ●
●
●
●
● ●● ● ●●
●● ● ●
● ● ● ● ● ●
●
●
●
●
●
●
●
●
● ●●
● ● ● ● ● ● ● ● ● ●
●
●
●
● ● ● ● ●
● ●●●●● ● ●● ● ●● ● ●
●●
●● ● ● ●
●● ●●
●
● ●
●
●
●
●
●
●
●
●●
●
● ● ● ●
●
● ● ●
●
●
●
●
● ● ● ● ●
●
●
●
●●●●●● ●●●●● ●
● ● ●● ●
●
● ●
●
●●●● ●●●●●●●●●●●●●●●●
● ●
● ●● ● ●
●
● ●
● ● ● ● ●● ●
● ● ●●● ● ●●● ● ●● ● ●
●
●
●
●
●
●
●
●● ● ● ● ●● ●
●
● ●● ● ●
●
●
●
●
●● ● ●● ● ● ●
● ● ● ●●
●
●● ● ●
● ● ● ● ● ●
●
●
● ●
● ● ● ●
●●●●●●● ●●●● ●●●●●●●●●●●
● ● ● ●
● ●
● ● ● ● ●
● ● ● ●● ● ●
● ●
● ●●● ● ●●
● ●
●
●
●
●
●
●
●
●● ● ● ● ●
●
●
● ● ●● ● ● ● ● ●
● ●
●●●● ●●● ● ●
●
●
●
● ●
●
●●
● ●● ●● ●●● ●●
● ●
●
●
●
●
●
0.98
●
● ●● ●● ●
●
● ●● ●
●
●● ●
●
● ●●● ●●
● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●● ●
●
●
●
●●●●●●●●
● ● ●
●
●● ● ● ● ●●
● ● ● ●
● ● ●● ● ● ●
●
●
●
●
● ● ● ●
●●
● ●
●
●
●
●●
● ●
●
●●
● ●●
●●
●
●●
●
●●
●● ●
●
●●●● ● ●●●
●
● ● ● ●●●●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
● ●●
●●
● ●
●● ● ●
● ●●● ●●
●
●
● ● ●●
●● ● ●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
● ●●●●●● ●●
● ●
● ●
● ● ●
● ●● ●● ● ● ●● ●●
●
●●
● ● ● ● ● ● ● ●
●
●
●
●
● ● ●● ● ●
●
●● ● ● ●
●
● ● ● ● ●
●
● ● ●
● ●
● ●
●
● ●
●●●●
●
●●
●● ● ●●● ●● ● ● ●●●●●●●●● ●●● ●●
● ● ● ● ●
●
●
●
●
●
●
● ● ● ● ● ●
●
●● ● ● ●●● ● ●
● ● ●
● ●
●
● ●
●
● ●
● ●● ●
●
●
●● ●● ● ● ●● ● ●●● ●●
● ● ● ●
●
●
● ●
●
●
●
●
● ●●●●
●
● ●
● ●
●●
● ● ● ● ●●
●
● ● ●● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●● ● ●● ● ●
●
● ●
●
● ● ● ● ●
●●●●
●●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
● ●
●
●
●●
●
● ● ●● ●
●
● ●● ● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
● ● ●
● ● ● ● ● ● ●
● ● ●● ● ● ● ●
●●
●
●
●● ●●
● ● ● ● ● ● ●
●● ● ●
●
●
● ●
●
●●
●
●
●
● ● ● ●
●
●
● ● ● ● ● ● ● ● ●
●● ● ●
●
● ● ●
●● ● ● ●
● ● ● ● ● ● ●
●
●● ● ●
● ● ● ● ●
● ●
●
● ●
● ● ● ●● ●● ●
● ● ● ●
●
● ● ●
● ●
● ● ● ● ● ●
●
● ● ●
●
● ●
●●
●
● ●
● ●
● ● ●
●●●● ●● ●●● ● ●● ● ●
● ●
● ●● ● ●● ● ● ●● ●● ●● ●● ● ●● ● ●● ●●●●●●●
●
●
●
●
●
● ● ●
●
●
● ●
●
● ●
●● ● ● ● ●
● ●
●
● ● ●●
● ● ●
●
● ●● ● ●● ●● ●● ●●●
● ● ●
●●
●
●
● ●
● ●●●
●
●
● ●● ● ●● ●
● ● ●
●
● ● ●
●● ● ●● ●● ●●● ●●●●●●●● ●●● ●●●●● ●●●●●●●●●●
●
●●●●●●●
● ● ●●
● ●
●
● ● ●
● ●
●
● ●●● ●
●
● ● ●● ●
● ● ●●
●
●● ● ● ● ●●●
● ● ● ● ●
●
● ● ● ● ●
● ●
● ●
● ●●● ●
● ● ●
● ●● ●
●
0.97
●
●
● ● ●
●
● ●● ●● ● ● ● ●
● ●● ● ●●●
● ● ● ●
●
●●
●●●●●● ● ●
●
●
● ● ●● ●● ● ● ●
●
● ●
● ● ● ● ● ● ● ●
● ● ●● ● ●●● ● ● ● ● ●● ●
● ●
●●● ●
● ● ● ● ●
● ● ● ●●●
●
● ● ● ●
● ●
● ● ●●● ● ● ● ●●
● ● ● ●
●
●
●
● ●
● ● ●
● ●
●
●
●
● ●
●●
● ●●
● ●●●● ●
●
● ● ● ●● ●●● ●● ● ●● ● ●
● ● ●
● ● ● ●
●
●●●●
● ●
●
●
●● ● ●
●
●
●
● ● ● ●
● ● ● ● ●
●
●
●
● ●
● ●●
●
●
● ●
●
● ●● ●
● ● ●
●
●
●
●
●
● ● ●
●
●
●
●
●
●●
●● ● ● ● ● ●●
●● ●
● ● ●● ●
●● ● ● ●●● ● ●
● ● ● ● ●● ● ● ● ●
●
●
● ●
● ● ● ● ●● ●
● ●
●
● ● ● ●●
●
● ● ● ● ● ● ● ●●●● ●● ●
● ● ●
●● ● ● ● ● ●
●
● ●
●
● ●
●
●
● ●
●
●● ●● ● ●
●
●
● ●
●
● ● ●
● ● ●● ●● ●●●● ●
● ●
●●●
●
● ● ● ●
●
● ● ●
● ●
● ● ● ● ●
● ●● ●● ●● ● ●●●
●
●
● ● ●
●
●
●
●
●
●
●
● ●●
● ● ● ● ● ● ●
●
●
● ● ● ●● ● ● ● ●
●● ● ●
● ●
●
● ●● ● ● ● ●●● ● ● ●
●
●
●
● ● ●
●
●
●
●
● ●
●
● ●
●
●
● ●
● ●
●
● ●
● ●
●● ●
● ● ● ●
●
● ●
●
● ● ● ● ● ● ●
●● ● ●● ● ●
●
● ●
● ●
● ● ●
● ● ● ●
●
●
● ●
●
● ●
● ●
●
●●●●● ●● ● ● ● ● ●
●
● ● ●
● ● ● ● ● ● ●
●●
●
●● ● ● ●
●●
●
● ● ●
● ●
●
●
●
●
●
●
●
● ● ●
●
●
●● ● ● ● ●
● ●
● ●
●● ●
●
● ● ●
● ●●●
●
● ●
● ●●●●
●
●● ●
●● ●●●●
●
●● ● ● ● ● ●● ● ●
● ●
● ●
●
● ● ● ● ●● ●
● ● ●
●
●● ●
●
● ● ● ●● ● ●● ● ●● ● ● ●
●● ●● ●
● ● ●
● ●
●● ● ● ●
● ● ● ●
● ●● ●●
● ● ● ●
●
● ● ● ●
0.96
● ●
●● ● ● ●
●
●●● ●
●
●●●●●●
●●●● ●● ●● ●●●● ●●
●
● ● ●● ●● ●
●
● ● ●●●●●
●
● ● ● ● ●
● ● ● ●
● ●●● ●
● ●● ● ● ●● ●●
● ●
●●●
●● ● ● ● ●● ●●● ● ●
●
● ● ●●● ●● ● ●
●● ● ● ●●● ● ●
●● ●● ● ●
●●
●
●
●● ●● ●● ● ●● ●
● ●
● ●● ● ● ● ● ● ● ●
●
●
● ● ●●●
●● ●
● ● ● ●●
●
●
●
●
● ●
● ●
●
●●
●
●●
● ●
● ●
●
● ● ●● ● ●
●●●●● ● ●
● ● ● ●
● ● ● ●● ● ● ● ● ● ●
●
●
●
●
● ●
●
●
● ● ● ●
● ● ●
●
●
● ●
● ●
● ●
● ● ●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ● ●
● ●
● ● ●
● ● ● ●
●
● ● ● ● ●●
●
●
● ●● ●
●● ●
● ● ● ● ●
● ●
●
●● ● ●●
● ●
●
●
●
●● ●
● ●
● ● ●
● ●
●●
●
● ●
●
●
●
●
● ●
● ● ●
●
●
●
● ● ●
●
●
● ● ● ● ● ● ● ● ● ●
● ●
●
●
● ●
●
● ● ●
● ●● ● ●
●● ● ●
● ●
●
● ● ● ● ●● ●●●● ●● ● ●●●● ●●●●● ●●●●●●●●● ●●●● ●●●●●●●●●
● ●
● ● ● ● ● ●
●
● ●● ● ● ●
●
● ● ●●
● ●● ● ● ●
●
●
●
●
●
●
● ●●● ● ●
●
● ● ● ●
● ● ●
●
●
● ● ● ●
●●● ● ●● ● ●● ● ● ●● ● ●●●● ●● ●●
●
●
●● ● ● ● ● ●●● ● ●
● ●
● ●● ● ●
● ●
● ● ● ● ●
●
● ●● ● ● ● ●
●
● ●
●
● ● ● ● ● ● ●
● ● ●
●
●
● ●● ●
● ●
●
●
● ● ● ●
●
●● ●● ● ● ● ●
●
● ● ●
●
●●
● ●
● ●●●
● ● ●
●● ● ● ●
●
●
●
● ●
●
●
● ● ●●● ● ● ●● ● ●● ● ●
●
●●
●
● ● ● ●● ●●
● ●
● ● ●● ● ● ●● ● ● ●
●
●
● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ●
● ● ● ●
● ● ●
● ● ● ●●
● ● ●
●
● ●
● ● ● ●
●
● ● ●
● ●● ●● ●
0.95
● ● ● ● ● ● ● ●
● ●
●● ● ● ● ● ●
●
●● ●● ●● ● ● ●● ● ● ● ● ● ● ●
●
●
● ●●
●
● ●
● ● ● ●
●●● ● ● ●
● ●●●●
●● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
●
●●
● ●●
● ●
●●
●●● ●
●●
●
●
●
●
●
●●●●●● ●●
●
● ● ●● ●
● ●
● ● ●● ●●
● ●●●● ● ●●● ●●
●
● ● ●●●
●
●
●●●●●●● ●
●
● ●
●
● ●●●●●●●●●●● ● ●●●●●●●●●
●
●
●●
●●
●
● ●●● ●●●● ●●●● ●
●● ●
●
●
●
●
●
● ●
●
●
●●
● ●● ● ●● ● ●
●
●●●
● ● ● ● ●
●
●● ●● ● ●
● ●●●
●
● ●
●
● ●
●●●
●
●
●
● ● ●
●
●●● ●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●● ●
●
● ● ●
● ●● ● ● ●● ● ● ● ● ● ●
●
● ●
● ●
●
●
● ●●
● ● ● ● ●
● ●● ● ● ● ● ● ● ●
● ●
●
●
●
●
● ●
Figure 28: Representation of the third quantitative variable on a 2D scatter plot with
the data point size using cex: 2008 presidential elections in Russia, the size of data
points is proportional to the number of ballots cast at the polling station. Note thin
diagonal rays formed with small polling stations (calculational artifact), and clusters of
bigger polling stations at the integer percentiles forming a rather weak but discernible
grid-like pattern (evidence of electoral fraud). See text for code and details.
58
Figure 29: Representing density of analytic individuals in a scatter plot using colour
transparency with rgb() function. Cf. fig. 30, 31, 32.
There are plotting functions which may present density levels in a more
explcit way but they require data transformation because, except x- and y-
coordinates they require a z-coordinate, which is presented in form of a density
matrix. We shall use two of them (image() and contour(), see fig. 30, 31, and
32) but there are more.
Before getting to the plots, however, we need to perform the required data
transformation. To create the 2D density matrix we need to extract as many
subsets as there are cells in the matrix. I have chosen a 101×101 matrix (101 —
for each percentile from 0 through 100) of centered 2D-bins (I hope, you noticed
already the similarity to the histogram, the only difference is that the histogram
59
Figure 30: Representing density of analytic individuals in a scatter plot using
contour() function. This method requires data transformation aimed at the extrac-
tion of the density matrix. See text for details. Cf. fig. 29, 31, 32.
is one-dimensional):
pres.2008.TMs.m <- NULL
pres.2008.TMs.m <- as.data.frame(pres.2008.TMs.m)
j <- NULL
i <- 1
while(i <= 101){
j <- 1
while(j <= 101){
pres.2008.TMs.m[i,j] <- nrow(subset(pres.2008,
pres.2008$TURNOUT > (i-1.5)/100 &
pres.2008$TURNOUT <= (i-.5)/100 &
pres.2008$MEDVEDEV.sh > (j-1.5)/100 &
pres.2008$MEDVEDEV.sh <= (j-.5)/100
))
j <- j+1
}
i <- i+1
60
Figure 31: Representing density of analytic individuals in a scatter plot using image()
function. This method requires data transformation aimed at the extraction of the
density matrix. See text for details. Cf. fig. 29, 30, 32.
# Plotting contour()
contour(0:100, 0:100, as.matrix(pres.2008.TMs.m),
levels=c(1, zlvl), col = "black", lty = "solid",
method="flattest", labcex=1,
vfont=c("sans serif", "bold"),
xlab="Voters’ turnout at a polling station, %",
ylab="Share of ballots cast for D. Medvedev at a polling station, %",
main="Presidential elections in Russia, 2008")
61
Figure 32: Representing density of analytic individuals in a scatter plot combining
image() and contour() functions. This method requires data transformation aimed
at the extraction of the density matrix. See text for details. Cf. fig. 29, 30, 31.
Here we see three variables: x=0:100, the same for y, and a matrix of z
values, the explicitely set levels, three arguments controlling the appearance of
the numeric labels at the isolines (method="flattest" means that the numbers
are to be placed at the ‘flattest’ segments of the isolines, labcex regulates labels
font character expansion factor, and vfont defines a font-family for the labels
font, etc.)61 Note, please, that we can not use the usual ‘share’ values ranging
from 0 through 1. In this graph, just as in the following two, they are replaced
with percents (ranging, naturally, from 1% through 100%).
The image() function is widely used to produce so-called heatmaps. A
heatmap is a plot visualising a 2D matrix in a most direct way by mapping
a matrix of numbers to a matrix of pixels (or bigger rectangles) coloured ac-
cording to the values of the matrix cells. It can be used in our case as well.
It does not require explicit levels and splits the data point density range into
61 As usual, I strongly recommend studying help(contour).
62
equal intervals according to the colour palette used. It looks more spectacular
against a dark background, so I’ve chosen a good old dark blue background
(bg="#001133") and white foreground. When the slopes of the density peaks
are too steep, they can be leveled down with a log-transformation (which we
had to use in his particular case too; see fig. 31 for the graphical results of the
following code):62
par(bg="#001133", fg="white", col.axis="white", col.main="white", col.lab="white")
image(1:101, 1:101, log(as.matrix(pres.2008.TMs.m)),
xlim=c(0,101), ylim=c(0,101), lty = "solid", col=terrain.colors(12),
axes=FALSE,
xlab="Voters’ turnout at a polling station, %",
ylab="Share of ballots cast for D. Medvedev at a polling station, %",
main="Presidential elections in Russia, 2008")
The two functions can be combined to enhance the readability of the plot
(fig. 32): 63
par(bg="#001133", fg="white", col.axis="white", col.main="white", col.lab="white")
image(1:101, 1:101, log(as.matrix(pres.2008.TMs.m)), xlim=c(0,101),
ylim=c(0,101), lty = "solid", col=terrain.colors(12),
axes=FALSE,
xlab="Voters’ turnout at a polling station, %",
ylab="Share of ballots cast for D. Medvedev at a polling station, %",
main="Presidential elections in Russia, 2008")
contour(1:101, 1:101, as.matrix(pres.2008.TMs.m),
levels=zlvl, col = "black", lty = "solid",
method="flattest", labcex=1,
vfont=c("sans serif", "bold"),
add=TRUE)
de-magifying glass for ‘big’ numbers (consider: log(2) ≈ 0.69, log(4) ≈ 1.38, log(16) ≈ 2.77,
log(256)
log(256) ≈ 5.55; thus, log(2) = 8 while 256 2
= 128). To revert the plotting console to the
default values, close it, please, with dev.off() after reviewing the plot.
63 To revert the plotting console to the default values, close it, please, with dev.off() after
63
10
10
8
8
6
6
1
1
4
4
2
2
0
0
0 2 4 6 8 10 0 2 4 6 8 10
1 1
Figure 33: Line segments(), left, and arrows(), right. Note, please, different shapes
and positions of arrowheads. See text for code details. The size of arrowheads is
proportional to other parts of the graph, so, in your plots, they may appear larger or
smaller by default.
64
10
10
8 3
8
2
6
6
5
1
1
4
4
4
2
2
1
0
0
0 2 4 6 8 10 0 2 4 6 8 10
1 1
Figure 34: A example of rect(), left, and polygon(), right; the numbers in the
right plot are added with text(). Pay attention, please, to the difference between the
polygons 4 and 5. See text for code details.
Every set of arrows can be adjusted with respect to the length of their
arrowheads (length argument, defaults to 0.25), the angle between arrowhead’s
edges (angle argument, defaults to 30°), and the arrowhead’s position (code
argument, defaults to 2).65 Note, please, the code that corresponds to the
arrows running from (1, 8) to (8, 9) and from (6, 4) to (10, 2). The code=1 sets
the arrowhead to the beginning (x0 , y0 ), while the code=3 sets the arrowheads
to both ends. And, of course, every linewise element can be decorated with
different colours (col argument), line types (lty), and line widths (lwd).
More complex geometrical figures can be drawn with rect() and polygon().
Even though rectangles can be considered as polygons from the viewpoint of
geometry, they are far simplier from the perspective of defining their coordinates
(mostly because the rect() function can only draw rectangles with the sides
parallel to the sides of the plotting area). Thus, to define the rectangle’s position
and size, one needs to specify its bottom left and its upper right corners, and
that’s it. The following script draws three rectangles in the fig. 34, left:
plot(1, 1, type="n", xlim=c(0, 10), ylim=c(0, 10)) # creating an empty plot
rect(xleft=c(1, 2, 6), ybottom=c(1, 3, 0), xright=c(3, 8, 10), ytop=c(4, 8, 1))
An arbitrary polygon can be drawn with the polygon() function. Its main
arguments are two vectors defining x and y co-ordinates of the polygon’s vertices.
The following script draws the polygon no. 1 in fig. 34 (right):
plot(1, 1, type="n", xlim=c(0, 10), ylim=c(0, 10)) # creating an empty plot
65 The arrows within the given set have to be uniform with respect to the shape and position
65
p.1.x <- c(2, 1, 1, 3, 5, 4) # X coordinates for polygon no. 1
p.1.y <- c(0, 1, 3, 5, 2, 1) # Y coordinates for polygon no. 1
polygon(p.1.x, p.1.y)
One and the same vector can define more than one polygon. To do so, we
need to insert NA to separate data for different polygons. E. g., the polygons no.
2 and 3 (fig. 34, right) are the results of the following code:
p.2.x <- c(1, 2, 3, 4, NA, 7, 5, 9, 8) # X coord. for polygons no. 2 and 3
p.2.y <- c(6, 8, 9, 5, NA, 5, 5, 8, 9) # Y coord. for polygons no. 2 and 3
polygon(p.2.x, p.2.y, density=c(12, 24))
As you see from the example polygon 3, the R polygons may contain inter-
secting edges. This, in turn, may affect the polygon background fill pattern,
which, in this rather special case, is regulated with fillOddEven argument (de-
faults to FALSE). I shall use two pentagrams to illustrate this point (fig. 34,
right: 4 and 5).
p.3.x <- c(6, 7, 8, 5.5, 8.5) # X coordinates for polygon no. 4
p.3.y <- c(0, 3, 0, 2, 2) # Y coordinates for polygon no. 4
polygon(p.3.x, p.3.y, density=12, angle=-45)
p.4.x <- c(7.5, 8.5, 9.5, 7, 10) # X coordinates for polygon no. 5
p.4.y <- c(2.5, 5, 2.5, 4.5, 4.5) # Y coordinates for polygon no. 5
polygon(p.4.x, p.4.y, density=12, angle=-45, fillOddEven=TRUE)
The last thing to learn in this section is how to add text labels to annotate
graphs. As we have added already a lot of polygons to our plot, we may need
to somehow label each of them just as they are labelled in fig. 34. The text
labels can be added with text() function. Its most vital arguments specify
co-ordinates of the invisible points to which the text labels are attached (x and
y), contents of text labels (labels), and positions of the text labels relative
to the invisible points. By default, the text label is horizontally and vertically
centered at the invisible point, but it may occupy four standard positions around
the point, specified in an already familiar manner: they are numbered from 1
through 4 clockwise, starting from bottom.66 The code that adds text labels to
the fig. 34 (right) is as follows:67
text.x <- c(1, 1, 8, 7, 8.5)
text.y <- c(1, 6, 9, 3, 5.7)
text(x=text.x, y=text.y, labels=c(1:5), pos=c(2, 2, 3, 2, 4))
In the real-life situations, all these graphical primitives can be more useful
than it may seem after this cursory acquaintance. Needless to say that R itself
relies upon some of them when plotting. Histograms and barplots can be rebuilt
from rectangles. By adding a segment, a couple of modified arrows, and, maybe,
few points to a rectangle, we get a boxplot. We have not yet dealt with any
maps but, as we shall see soon, some of them are nothing but collections of
coordinates for plotting sets of sophisticated polygons, line segments, and dots
packed in more complex objects. Social network graphs are generally built of
points and line segments or arrows. Text labels employed in any graphs are
printed with text().
66 Compare this to the axis() function discussed above.
67 There are other important arguments of the text() function, and I would suggest studying
them in greater detail experimentally after reading help(text).
66
Russian universities except Dorpat
80
Dorpat
Other European, mostly German, universities
60
Faculty members
40
20
0
Figure 35: Using polygon() function. Temporal dynamics of the Dorpat university
faculty composition. Areas representing faculty members who graduated from different
universities are coloured differently. Note the dramatic increase of graduates of Russian
universities in the 1890s. See text for code details.
# An empty plot
plot(dpt.dyn$YEAR, dpt.dyn$TOT, type="n", ylim=c(0,85),
xlab="Timeline", ylab="Number of persons")
# Adding polygons
polygon(x=c(dpt.dyn$YEAR[1], dpt.dyn$YEAR, dpt.dyn$YEAR[nrow(dpt.dyn)]),
y=c(0, dpt.dyn$TOT, 0), col="#0072CE")
polygon(x=c(dpt.dyn$YEAR[1], dpt.dyn$YEAR, dpt.dyn$YEAR[nrow(dpt.dyn)]),
y=c(0, (dpt.dyn$U1.DPT.TOT + dpt.dyn$U1.EUR.TOT), 0), col="black")
polygon(x=c(dpt.dyn$YEAR[1], dpt.dyn$YEAR, dpt.dyn$YEAR[nrow(dpt.dyn)]),
y=c(0, dpt.dyn$U1.EUR.TOT, 0), col="white")
67
Figure 36: Using polygon() function. Presidential elections in Russia, March 18,
2018. Original voters’ turnout histogram (blue) and the results of Monte-Carlo sim-
ulation (thick red line — mean simulated frequency, gray shaded area between thin
red lines — mean ±3 standard deviations). Note the thin blue peaks extending far
beyond the corridor of likely variation: it is highly likely that the area marked with
peaks contains the results of mass data forgery. See text for code details.
its lower border is formed with data points for the mean minus three standard
deviations listed in the reverse order:68
# Reading datasets, adding calculable variables;
ru.2018 <- read.table("ru.2018.txt", h=TRUE, sep="\t", stringsAsFactors=TRUE)
ru.2018$VOTED <- ru.2018$BALL.VALID + ru.2018$BALL.INVALID
ru.2018$TURNOUT <- ru.2018$VOTED / ru.2018$VOTERS
ru.2018$PUTIN.share <- ru.2018$PUTIN / ru.2018$VOTED
ru.hist.MC <- read.table("ru.hist.MC.txt", h=TRUE, sep=" ", stringsAsFactors=TRUE)
# Adding the polygon for the MC simulation mean +/- 3 standard deviations;
polygon(
x = c(ru.hist.MC$PCT, ru.hist.MC$PCT[1001:1]), # Note reverse order ([1001:1])!
y = c(ru.hist.MC$MEAN + 3 * ru.hist.MC$SD,
(ru.hist.MC$MEAN - 3 * ru.hist.MC$SD)[1001:1]), # Note reverse order ([1001:1])!
68 See the code appendix for this section for the details of data pre-processing. The Monte-
Carlo simulation script is included as a bonus. It should be noted, however, that it takes rather
long to run it with a reasonable number of iterations. My worn Dell Latitude 3440 with its
1.70 GHz Intel and 4 GB RAM spent about eight hours going through 1 000 iterations. For
those who prefer not to go this deep, a ready-made dataset for this specific graph is provided
in a separate file.
68
Wender
Tennemann
Reinhold
Reinhold
Chalybaeus
Schwegler
Erdmann
Weigelt
Ueberweg
Dühring
Stöckl
Deter
Nomina
Kirchner
Windelband
Windelband
Falckenberg
Eucken
The third, and final, graph (fig. 37) combines segments and rectangles to
display the timeline of publication of German textbooks in the history of phi-
losophy. The script generating this latter graph is rather long (and it requires
rather specifically structured data), so I do not provide it here, but you may
try and develop something similar. All you need is a module which calculates
positions of individual timelines (thus calculating by the way an appropriate
aspect ratio for the picture) and a module drawing segments, rectangles, and
text labels.
69
and diverse list-like complex objects. Some of these resulting objects can be
meaningfully fed into the plotting functions to produce plots. In fact, we did it
all the time. Remember the basic plots we produced with the plot() function?
plot() is super-smart, to the degree that it is nearly omnipotent. Every time
plot() identified the kind of the object and the nature of its elements and
produced an appropriate kind of a plot whether we wanted it or not.69
It is important to know that some (but not all) plotting functions also per-
form data transformations and even create objects. We never saw them because
these objecs are usually re-directed to some other more basic plotting functions
and never appear in public. These objects, however, can be extracted, reviewed
and, sometimes, used for custom plotting. Of the plotting fuctions we exten-
sively used before, only plot() and the ones involved with graphical primitives
(points, lines, segments, etc.) do not produce anything but graphics. All other
functions create objects of come sort. Not all of these objects are extremely
useful but two of them (the ones resulting from hist() and boxplot()) are
worth studying in depth.
To bring the hidden object to light, it is enough to divert the function’s
output into a named object. E. g. the line:
students.mass.hist <- hist(students.df$MASS)
function.
70
Undergraduate students Undergraduate students
55
60
55
50
47
50
47
40
40
34
34
Frequency
Frequency
30
26 26
30
26 26
20
20
13 13
9 9
10
10
5 5
3 3
1 0 1 1 0 1
0
0
40 50 60 70 80 90 100 40 50 60 70 80 90 100
Body mass, kg Body mass, kg
Figure 38: Plotting students.mass.hist with plot(). Left: a nearly raw function
call, right: ylim adjusted. See text for code details.
plot(students.mass.hist,
main="Undergraduate students", xlab="Body mass, kg",
col="grey")
text(x=students.mass.hist$mids, y=students.mass.hist$counts,
labels=students.mass.hist$counts, pos=3)
The only thing that remains to be fixed (fig. 38, left) is the ylim of the plot:
the label for the most frequent bin, from 50 to 55 kg, extends slightly beyond
the plotting area. In the following code, I add a little bit to the maximal value
of counts to set the new ylim (fig. 38, right):70
plot(students.mass.hist, ylim=c(0,max(students.mass.hist$counts)+5),
main="Undergraduate students", xlab="Body mass, kg",
col="grey")
text(x=students.mass.hist$mids, y=students.mass.hist$counts,
labels=students.mass.hist$counts, pos=3)
for details). Here I skipped this step for brevity by just adding +5 but this is not generally
advisable.
71 Note, please, the notation for matrices (stats and conf): vectors defining the number of
rows and columns are separated with commas. E. g. [1:5, 1] in stats means that there are
five rows and one column.
71
$ conf : num [1:2, 1] 57.6 60.4
$ out : num [1:6] 87 85 90 85 98 90
$ group: num [1:6] 1 1 1 1 1 1
$ names: chr "1"
>
It contains y-axis co-ordinates for the elements of the main part of the box-
and-whiskers plot (stats), the number of elements in the analysed vector (n),
95% confidence interval for the median (conf),72 a vector of y-axis co-ordinates
for outliers (out), the variable (group), to which the outliers belong (in our
case, all six belong to the single variable under consideration), and the grouping
variable name (this time it was arbitrarily assigned to 1 because no grouping
variable was provided).
As boxplot() can handle several variables at a time, or split the single
numerical variable on the basis of the grouping variable, the objects it creates
also can reflect this. In these cases, the matrix elements acquire additional
columns and vector elements additional elements. To illustrate this, let us build
a multiple boxplot using the same body mass as the numeric variable and the
students’ sex as the grouping variable:73
> students.mass.2.box <- boxplot(students.df$MASS ~ students.df$SEX)
> str(students.mass.2.box)
List of 6
$ stats: num [1:5, 1:2] 40 52 56 62 77 54 64 69 76 90
$ n : num [1:2] 168 52
$ conf : num [1:2, 1:2] 54.8 57.2 66.4 71.6
$ out : num [1:2] 82 98
$ group: num [1:2] 1 2
$ names: chr [1:2] "f" "m"
>
Note, please, also the differences in the out, group, and names elements.
The number of the outliers changed because the general shape of the variable’s
72 The limits are identified with the following formula: M ± 1.57 × IQR
√ where M stands
n
for the median, IQR for interquartile range, and n for the number of elements in the vector
under consideration (See: John M. Chambers, William S. Cleveland, Beat Kleiner, & Paul A.
Tukey (1983) Graphical methods for data analysis. Pacific Grove, California: Wadsworth &
Brooks/Cole Publishing Company: p. 62).
73 Note the changes in the matrices stats and conf.
72
frequency distributions did (distribution of mass within each gender group is
more compact, hence the lesser number of outliers). The group now indicates
that the first outlier belongs to the first sub-series, while the second to the
second. And the names now contains names for the sub-series derived from the
values of students.df$SEX used to group students.df$MASS. The group, by
the way, offers also an important insight into how the x-axis values for boxplot
elements are generated (they are assigned integer numbers from 1 to N, where
N is the number of vectors or sub-series under comparison).
The statistics routinely produced and discarded by boxplot() were consid-
ered important enough to deserve a special non-plotting function that brings
them to light:
> boxplot.stats(students.df$MASS)
$stats
[1] 40.00 53.00 59.00 65.75 84.00
$n
[1] 220
$conf
[1] 57.64182 60.35818
$out
[1] 87 85 90 85 98 90
>
This latter function, however, can not process more than one variable at a
time.
As I said, not all objects created with plotting functions are extremely useful.
You may also try barplot() and curve() on your own and see what do the
resulting objects contain and what good can plot() extract from them.
73
their ‘pixels’ and printers print with small dots).74 The R graphical system is
inherently vectorised but it easily produces both vector and raster graphics with
its numerous virtual plotting devices (we already used one of them, png()).
The vector graphics in R has its limitations. Neither pdf() nor other vector
plotting devices can handle transparency (hence, no rgb() for colours). The vec-
tor plotting devices are also notoriously problematic when working with glyphs
beyond basic Latin characters.75
The uses of raster and vector graphic files produced by R are slightly differ-
ent. Raster graphics files work better for a quick preview (the standard graphics
preview software under all major operating systems handles them by default and
allows for flipping through large amounts of images with a keystroke). Vector
graphics files fit best with electronic publishing as quality book and journal il-
lustrations. Of course, there is a lot of flexibility (and you are not at all bound to
use raster for on-screen previews and vector for books), however, every time you
pick a particular flavour of raster or vector image, make your choice cosciously
and not without some deliberation.
The first subsection will treat the specific problems of raster graphics (PNG,
JPEG, and TIFF), the second, those of vector graphics (SVG, EPS, and PDF),
and the third, a rather special problem of colour models (sRGB vs. CMYK),
which is relevant if you are going to prepare colour graphics for printing on
paper.
instructions immediately.
75 It is possible to force it into printing Cyrillics, e. g., but neither the process itself, nor its
74
files worth 6 megapixel or 3008 × 2000 pixels, my more recent compact camera
does 20 megapixel or 5184 × 3888 (which look, in fact, worse than those taken
with the 6 mp DSLR). The pixel size of the image file as measured in pixels
remains the same as long as we do not crop or resize it forcibly with a raster
graphics editor.77 Each dot or pixel (or groups of them if some method of
compression is employed) is described more or less economically in terms of
colour. This means that the image files of the same pixel size are not necessarily
of the same byte size.
The byte size of an image file depends, besides the pixel size, on many things,
among which the arguably most basic is the number of shades of colour reserved
to describe each pixel. Minimum is, understandably, two, black vs. white, so
each pixel can be described with just one bit (1 vs. 0) and maximum (not
reacheable within R graphical system) currently extends to 48 bits per pixel.
The so-called true colour scheme as defined in the RGB colour space (based on
the representation of each colour as a mixture of red, green, and blue) uses 32
bits, eight to encode the intensity of each of the three colour channels (which
makes 24 bit) and eight for transparency (or unused). Thus it is capable of
describing up to 2563 = 16 777 216 colours, which is formally a lot bigger than
a human eye could discern.78 Another important thing contributing to the byte
size is compression method and the degree of compression used but we shall not
go deep into this now.
The resolution is a bit more confusing (because it is often abused). To cut it
short, the resolution is not a characteristic of the image file (even though some
image file formats also allow storing the intended resolution of the picture). The
resolution is the characteristic of a physical printing device (a computer screen
or a printer). It is measured not in pixels or dots but in pixels or dots per
inch (ppi or dpi accordingly). This means that the resolution of a displayed
or printed image is always the same as the resolution of the physical printing
device, no matter what. These days it is usually about 100 dpi for computer
screens and about 600–1200 dpi for laser printers.
One may wonder, what then happens to all these virtual pixels described in
the image file, the ones responsible for the pixel size of the image? And why
then the ‘high resolution’ images sometimes look better than ‘low resolution’
images? The answer is simple. The most straightforward way to represent a
digital image on a physical printing device is to represent each of the virtual
pixels of the former with a real pixel of the latter. This way, however, is not
always desirable.
E. g. my aged laptop screen contains slightly over 1 megapixel — 1366 × 768
pixels. What would I do with my 6 to 20 megapixel digital photos? Preview
them in parts? 16 or 201
at a time? No sane person would do that (except she is a
professional photographer or willing to drag herself into a complex and possibly
dangerous affair like that in Michelangelo Antonioni’s 1966 ‘Blowup’). The
image preview programs resize large images automatically to fit them into the
77 To crop an image means to carve a fragment out of it. To resize the image means to
shrink or expand its pixel size. Obviously, both operations change the pixel size by definition.
78 In practice, human vision is slightly more complicated, to say the least, than the formal
description of digital colours and the ability to distinguish the shades of colour is unevenly
distributed over the full spectrum. So, for a human, some of these sixteen billions of colours
look the same, while in some other areas of the spectrum the border between the two formally
neighbouring colours is still visible.
75
screen. To do it they perform complex calculations (each pixel of the printing
device represents a summary of several pixels of the original digital image, and
this summary should be technically accurate and aesthetically acceptable). The
same goes for printer drivers (except that printers usually have a far bigger
resolution than the computer screens).79 The images of a smaller pixel size
than the screen are usually displayed 1:1 and if we would like to enlarge them,
then, again, the preview software or a printer driver recalculates the resulting
physical picture in such a way as to represent one digital pixel of the image file
with several physical pixels or dots of the printing device. Sometimes this goes
too far and the digital pixels become quite visible large squares (each displayed
or printed with a considerable number of physical pixels).
What does it all mean to us? The most important lesson is that raster
images need to be tailored to the size and resolution of the physical device
they are intended for. A small image file 500 × 500 would be enough for a
quick on-screen preview. Beamer presentations usually require slightly larger
images about 800 × 600 or 1000 × 750. Academic journals request page-wide
illustrations, which can be printed at a resolution of 1200 dpi, which, given the
average width of the printing area (5′′ ) leaves us with images 6000 pixels wide
at least.
less, but now it would be difficult to find a laser printer even with 300 dpi, for the majority
already has 600 dpi set as default.
80 In the following examples we shall work on the German philosophical journals publication
dynamics dataset, so, please, take the trouble and create the object j.phil on the basis of
the file dt.j.phil.1655 1970.txt.
76
Figure 39: Left, if you only can see it: a default 480×480 image file printed at 1200 dpi
‘as is’, without size adjustments. Left: the same, reproduced not to scale. When
previewing this manual as a PDF-file, use magnification to examine the appearance of
characters and lines in both.
The result of this code would be a small image file closely resembling fig. 39
(right) saved to the R working directory. Evidently, it is way off small for quality
printing. Printed at 1200 dpi it would turn into a square centimeter image filled
with microscopical characters, numbers and split-of-a-hair lines (fig. 39, left).81
The first thing to be adjusted is obviously the pixel size of the image file (this
can be done, as we have already discussed above, with the aid of two arguments
of png() function, width and height):82
png("img_wh.png", width=6000, height=4000)
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()
The resulting image file (fig. 40), however, would indicate that something got
completely out of hand. While the plotting area indeed expanded to 6000×3600
pixels, the printing characters and lines remained ridiculously small. We haven’t
noticed anything of the sort when printing image file 500 × 500 only because 500
is not that different from 480. Now the vertical pixel size is 7.5 times bigger than
that of the default image. However, if you try and preview the img wh.png and
img default.png in different viewer windows at 1:1 and juxtapose them on the
81 From this you may conclude that in all previous figures of this manual I tricked you, for
they looked far better. Yes, I did. Most of them are either high-resolution raster graphs or
vector graphs in PDF.
82 For this graph, I jumped right away to the 1200 dpi resolution. I also changed the aspect
77
Figure 40: Image file printed with pixel size adjusted to 6000 × 3600 all other settings
unchanged (trust me, there is something in this image, even though you probably
hardly see it). See text for code details.
screen, you will see that the lines are of the same width, and the numbers are of
the same size in both images. Apparently they have not stretched automatically.
In R, the appearance of the graph can be mended in several ways. If we
need the way that is economical in a long run (i. e., requires less adjustments
for individual elements of the graph) and philosophically sound, then the time
has come to remember whatever we know about pixel size and resolution (see
the previous section).
The matter is that R printing system makes default assumptions not only
on the pixel size of the printing device but on its resolution as well. To fit
better with the low-resolution computer screens of old days and to ease some
calculations the default resolution was set to 72 dpi. The image file 480 × 480 px
reproduced 1:1 at a 72 dpi physical device makes 6.67 × 6.67′′ image, which is
pretty big (not to mention that 480 px constituted the full height of a VGA
monitor). It is close to the images R produces on-screen today. The default size
of the graphical console that pops up on my laptop is 6 × 6′′ . The resolution
of my laptop’s monitor, though, is close to 112 dpi, which is rougly 1.56 times
bigger than 72 dpi.83 Apparently, the screen images these days are rescaled to
befit bigger (in terms of pixel size) monitors with higher resolution. Anyway,
72 dpi is too far from the intended 1200 dpi resolution of our printed picture.
The intended resolution can be fixed with res argument of the same png()
function:
png("img_whr.png", width=6000, height=4000, res=1200)
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
83 A widespread belief that laptop screens resolution these days is invariably 96 dpi is plain
wrong.
78
dev.off()
The result can be seen in fig. 41, top. The image quality has markedly
improved. The lines are now thick and solid, the letters and numbers are big
and fully readable. Their shapes are elegantly curved and bear no traces of
dithering or visible pixel dents on their margins. The same result as in fig. 41
(top) could be obtained in a slightly different way by specifying image size in
inches (note the changes in width and height assigned values and the new
units argument) and indicating explicitely its intended resolution:
png("img_whri.png", width=5, height=3, units="in", res=1200)
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()
However technically tolerable, the graph is still too far from being aesthet-
ically and scholarly acceptable. The only thing to be fixed in this section,
however, is the font size, which is apparently too big for this book. To change
it in a sensible way we have to learn another bit of theory, this time from the
art of typography.
At bottom, many digital printing systems keep trying to imitate the old
good movable type. One of its features was a peculiar measurement unit (point)
which, to make matters worse, varied slightly across countries and continents.
1
Roughly speaking, it was based on 72 of an inch (or French pouce, given all
historical variations within and adding several attempts to reconcile it with
metric system and a number of software implementations on top). It is this
system, which through its wide use in WYSIWYG word processors is also familiar
to any ordinary computer user. When you pick 12 pt Times New Roman in MS
Office Word or 10 pt Liberation Sans in LibreOffice Writer, you are using just
it. 12 pt means 12 1
72 , or 6 of an inch (≈ 4.23 mm), etc. You can try and measure
it for yourself. It should be taken into account though, that the size of the
font defines the distance between the ascending and descending elements of font
characters (e. g., between the highest point of an ‘l’ and the lowest point of an
‘y’) and should be measured accordingly.
The R graphical system sets the default font size at 12 pt. The paragraphs
of this manual are set in a 10 pt font, and the image captions in an even smaller
9 pt font. To make the image font proportional to the one in the caption, we
need to specify the font size explicitely and set it to 9 pt making the characters
smaller by one quarter (note the new pointsize argument added to the script):
png("img_whrp.png", width=6000, height=3600, res=1200, pointsize=9)
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()
The result can be seen in fig. 41, bottom. The plot looks better but it is
still too far from perfect. We somehow fixed image size and made the font size
proportional to the rest of the text in the book, but the plotting area is still
too small, and the graph looks too ugly for a scholarly journal. This, and many
other things concerning this graph, will be discussed in the following subsection.
79
Figure 41: Image files printed with crude size adjustments. Top: same as fig. 40,
intended resolution (res) set to 1200 dpi. Bottom: the same, plus font size (pointsize)
set to 9 pt.
80
Black and white line art in earnest
Any customer can have a car painted any
colour that he wants, so long as it is black.
— Henry Ford
fig. 18.
85 See Simple bar plots for mai and mar, A digression: Time series and the type of plot() for
mfrow, and Stepping beyond two dimensions for bg, fg, col.axis, col.main, and col.lab. In
fact, we used it also to change the default pch and even default plot elements’ size (the latter
81
less systematically (in Simple bar plots) we tried to expand one of the margins
proportionally to the length of the axis labels. Now, the task is much simplier.
We need to remove extra space at the margins to give more room to the
plot itself. In a book, the margins of the page are wide enough, and the page-
wide graph should fit exactly the width of the page’s printing area, leaving no
extra space above or to the right. In case the image is located differently, the
offset from it is anyway defined by the publishing software, so you do not need
to waste the space in the image file. As you may remember, the margins are
controlled with two arguments of par() — mar and mai. The former defines
margins in terms of standard lines of text, the latter in inches. Their default
values can be previewed by calling their names with an empty function call.
> par()$mar
[1] 5.1 4.1 4.1 2.1
> par()$mai
[1] 1.02 0.82 0.82 0.42
As you see, the empty top and right margins still occupy the spatial equiva-
lent of two to four lines of text, and even the bottom and left margins containing
axis labels are too wide. To make them shrink a bit I would change the default
settings as follows:
png("img_whrp_Mar.png", width=6000, height=3600, res=1200, pointsize=9)
par(mar=c(3, 3, 0, 0)+.1)
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()
The two former values specify positions of the axis label and tickmarks labels.
The third affects the axes themselves.
png("img_whrp_MarMgp.png", width=6000, height=3600, res=1200, pointsize=9)
par(mar=c(3, 3, 0, 0)+.1, mgp=c(2.1, .6, 0))
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()
This code brings axis labels back in (fig. 42, bottom). Now, we need to get
rid of the black border around the plot area and ponder whether we should fix
the axes and tickmarks (after removing the box around the plot the axes may
is not avisable in case you work on quality images for academic publishers but it may pass if
you need a quick preview of a bigger size and do not wish to enter the dark forest of image
sizes, fonts, and resolutions we are wandering now). I recommend consulting help(par) to
have a glimpse into the whole universe of graph settings.
82
Figure 42: Adjusting margins. Top: same as fig. 41, mar adjusted to expand the plot
area; pay attention to the dot to the left of the graph about y=35, it is the tip of the
descender of ‘q’ in ‘Unique titles’. Bottom: margin lines’ positions fixed (mgp).
83
appear too short, and the tickmarks already look too long for the now too close
tickmark labels).
The box around the plot can be removed in two ways. The first is to spec-
ify par()$bty (defaults to "o", which means that the box encircles the graph
entirely).86 It is more economical as it affects only the box itself, and not the
axes.
png("img_whrp_MarMgpBty.png", width=6000, height=3600, res=1200, pointsize=9)
par(mar=c(3,3,0,0)+.1, mgp=c(2.1, .6, 0), bty="n")
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()
The results can be seen in fig. 43, top. Even though the defaults for the
plots in R graphical system were developed by expert designers and in view of
academic press requirements, the axes sometimes require futher trimming. In
this particular case, I would not, perhaps, change anything but the length of
the tickmarks. I, however, would like to make use of the second way to get rid
of the box, and to remind some of the other axis controls.87 In the following
code axes=FALSE argument of the plot() function suppresses both axes and
the box, and axis() is used to re-create axes with more detailed gradation (pay
attention to at and labels) and shorter tickmarks of varying length (look at
tcl).
png("img_whrp_MarMgp_at.png", width=6000, height=3600, res=1200, pointsize=9)
par(mar=c(3,3,0,0)+.1, mgp=c(2.1, .6, 0))
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles",
axes=FALSE)
axis(1, tcl=-.4)
axis(1, at=seq(1660, 1970, 10), labels=FALSE, tcl=-.25)
axis(2, tcl=-.4)
axis(2, at=seq(0, 70, 10), labels=FALSE, tcl=-.4)
dev.off()
The results can be seen in fig. 43, bottom. I am not sure the axes look better
this way. Moreover, aesthetically, the top figure looks to me more appealing. Its
axes are not over-cluttered with tickmarks, and, as a result, the whole picture
gets more air. In this case, however, the tickmarks are, of course, too sparse to
help us establishing more or less precise correspondence between the ups and
downs of the journals’ population dynamics and major historical events, and,
to be honest, fig. 43, bottom, with its seemingly far more detailed axes, offers
a very moderate improvement. Would it be easy to decide whether the World
War I had any effect? Whether a sharp decline in the number of periodicals
conicides with Nazi coming to power in 1933? Probably, a better strategy would
be to introduce a regular grid against the background of which the curve would
look more informative. Or, a still better, to use lines or shaded areas to indicate
important historical events or periods immediately.
86 The values of bty are to remind of the shape of the border, however remote the resem-
blance is. "l" stands for left+bottom, literally L- (not l-) shaped frame. "7" means top+right.
"c", "u", and "]" provide top+left+bottom, left+bottom+right, and top+right+bottom
frames. Strangely enough, "n" results not in a n-shaped frame but in a complete absence
thereof.
87 We have already discussed axes above in several instances.
84
Figure 43: Removing box and trimming axes. Top: same as fig. 42, par()$bty set to
"n". Bottom: default axes replaced (see text for the code details).
85
Figure 44: Same as fig. 43 (top) with meaningful historical landmarks added as shaded
areas or vertical lines (see text for the code details), left to right: the Seven Years’
War, Napoleonic Wars (in part), Märzrevolution of 1848, the Franco-Prussian War of
1870, World War I, Hitler appointed Chancellor of Germany in 1933, World War II.
The result can be seen in fig. 44. Now, the answers I posed above can be
answered easily. We see a small but marked dent at the World War I, and we
clearly see that the decline of the journals’ population began well before 1933.
There only remaining problem is that the shaded areas and auxillary vertical
lines look now as prominent as the data line and obscure the latter.
To make the data line stand out against the background a couple more of
tricks are needed. From our previous experiences with stepwise graph produc-
tion we know, that each complex graph in R is a multi-layered construction, and
each layer is printed in its turn. E. g. in fig. 44 we have plotted first the axes
with their labels and the data line with plot(), then added shaded areas with
rect(), and, finally, the thin vertical lines with abline().
86
What if we use this layered composition combined with another trick that
comes from the past times of the inkpen illustrations? To show that one of the
overlapping lines runs over another, the artists of the past made a small break
in the line that was supposed to lay in the background at the intersection with
the line that runs over it. Of course, in R no one could do this. A visually
similar result however, could be obtained by plotting over the background lines
first a slightly wider white line, and then, a regular width black line.88
In the code that follows, the elements of the graph are carefully layered one
upon the other, the background auxillary elements are drawn first and in thinner
lines, the foreground element is drawn last and with a white background line
added. So, the order of layers is: (1) empty plotting area with axis labels, (2)
shaded rectangles in thinner lines (lwd=.5), (3) auxillary vertical lines (again,
lwd=.5), (4) a wide white background for data line (note col="white" and
lwd=5), (5) the data line itself, (6) the axes. It is important to print the axes
last. No element of the plot that touches the axes can be printed over the
latter. Axes are sacred. In this graph, out of purely aesthetical considerations,
I also extended slightly the range of x- and y-axes and shortened the tickmarks
(tcl=-.4).
png("img_whrp_MarMgpBty_l2.png", width=6000, height=3600, res=1200, pointsize=9)
par(mar=c(3,3,0,0)+.1, mgp=c(2.1, .6, 0), bty="n")
plot(j.phil$YEAR, j.phil$TOT, type="n",
xlim=c(1650,2000), ylim=c(0,80), axes=FALSE,
xlab="Timeline, years", ylab="Unique titles")
rect(xleft=c(1756, 1803, 1914, 1939),
xright=c(1763, 1815, 1918, 1945),
ybottom=-100,
ytop=100,
density=36, lwd=.5)
abline(v=c(1848, 1870, 1933), lwd=.5)
lines(j.phil$YEAR, j.phil$TOT, col="white", lwd=5)
lines(j.phil$YEAR, j.phil$TOT)
axis(1, tcl=-.4)
axis(2, tcl=-.4)
dev.off()
The result can be seen in fig. 45. Now, both the data line and the background
elements are clearly visible and visually distinct, and the graph is nearly ready
to go to the publishers.89
The same method of highlighting the foregound elements with a white outline
can be applied not only to overlapping lines (with several different data lines
you also should think of a best possible order of appearance which ensures
readability of the graph) but to data points and text labels as well. For data
points, pch 21–25 with col="white" and bg="black" could be recommended
(see section on scatter plots above). For the text labels, there is a special
function shadowtext() from the package TeachingDemos (analogous to text()
discussed above in the section on graphical primitives).90
88 This could be used for colour graphs with many overlapping lines as well. Just use
background colour of the plot instead of white (for obvious reasons, in black and white graphs
background colour is always white by definition).
89 The only difference is that academic publishers usually prefer TIFF over all other raster
formats, so you either will have to use tiff() function or to convert PNG to an uncompressed
TIFF file with an external graphics editor.
90 Like other packages, it should be installed with install.packages("TeachingDemos")
87
Figure 45: Same as fig. 44 with image’s layered structure carefully observed, data
lines, shaded areas, and axes trimmed (see text for the code details and fig. 44 caption
for the explanation of shaded areas and vertical lines).
Presentation slides
[Section not yet complete]
Conspectus: Two elements of visual rhetoric: juxtaposition and highlighting.
Elements of animation. Precise positioning of the elements. Neighbouring slides
as parts of a very short animation sequence. Truly animated graphs. R and
ffmpeg.
Vector graphs
Unlike raster, the vector graphics is automatically scaled to the size of the
plotted image (and rasterized only after that), so it has fever problems with
plot elements size than the raster graphs described in previous sections. On the
other hand, they have got specific problems of their own. They do not support
transparency, and the vector fonts do not support UNICODE.91
The choice between different vector formats is based on the intended further
use of the resulting image files. Theoretically, EPS is preferable for the use
with desktop publishing systems, PDF for stand-alone preview or as a part
of a presentation merged from PDF files, and SVG for further editing with
vector graphics editors (e. g. InkScape or Adobe Illustrator ). In practice, LATEX
converts EPS to PDF before embedding them into the resulting PDF, and post-
processing in vector graphic editors is hardly needed if you mastered R.
and then loaded for each R session with library(TeachingDemos). After loading the package,
one may enjoy reading help(shadowtext) and using the shadowtext() function.
91 There are workarounds that allow printing, e. g. Cyrillic characters with pdf() and
postscript(), but they are so intricated that I would not dare to recommend using them.
88
It would be needless to repeat here everything that was written on the black
and white line art grphics in the section on raster plots. Here I shall focus only
on the few specific differences from raster graphics. Actually, the only important
difference for now is the default unit for width and height of an image.92 While
in the raster graphics, the numeric values for width and height default to pixels,
in the vector graphics, they default to inches. Below is the same code I used
for fig. 45 with necessary adjustments for PDF (pay attention to the first line
of the code). The result can be seen in fig. 46.
pdf("img_whrp_MarMgpBty_l2.pdf", width=5, height=3, pointsize=9)
par(mar=c(3,3,0,0)+.1, mgp=c(2.1, .6, 0), bty="n")
plot(j.phil$YEAR, j.phil$TOT, type="n",
xlim=c(1650,2000), ylim=c(0,80), axes=FALSE,
xlab="Timeline, years", ylab="Unique titles")
rect(xleft=c(1756, 1803, 1914, 1939),
xright=c(1763, 1815, 1918, 1945),
ybottom=-100,
ytop=100,
density=36, lwd=.5)
abline(v=c(1848, 1870, 1933), lwd=.5)
lines(j.phil$YEAR, j.phil$TOT, col="white", lwd=5)
lines(j.phil$YEAR, j.phil$TOT)
axis(1, tcl=-.4)
axis(2, tcl=-.4)
dev.off()
Even though the thin lines look, perhaps, a bit coarser in the PDF version,
the visual differences are negligible. File size is, however, dramatically different.
The 8.5 KiB PDF is roughly 37.8 times smaller than the 321.4 KiB PNG (and
still smaller than the 20.6 MiB uncompressed TIFF of the same image). The
arguably funniest difference is visible only when viewing this manual as a PDF
file with a PDF-viewer. If you try and select fig. 46 with your mouse you will
see that some characters and numbers in it can be selected and copied as text.
With fig. 45 this would not be possible.
PDF graphs can be used not only as parts of LATEX documents but, in
their own right, as parts of multy-page PDF presentations. Nearly all PDF-
viewers, including original Adobe Acrobat Reader, can switch to the full-screen
presentation mode, so PDF is frequently used as a cross-platform presentation
format. The easiest way to merge individual PDF images into a multi-page
PDF presentation is to resort to an external application. I shall explain it using
PDFtk and qPDF.93
Both can operate from the command line and allow to rather flexibly ma-
nipulate PDF files. Here are examples of two CLI-commands94 merging several
PDF graphs into one file:
92 Other differences worth mentioning are antialias and fallback resolution arguments.
Anti-aliasing (smoothing of the font and line edges) can be adjusted in several ways, see
help(svg) for details. Resolution in itself is unimportant when working with vector images
because their rasterization is rendered at an ad hoc basis. The fallback resolution specifies,
however, the preferred resolution when sending vector images to a laser printer as a raster
(this is a standard and, sometimes, much needed option with PDF files).
93 Both are cross-platform and have free versions distributed under GNU/GPL. See https:
command line you should call terminals under iOS X and GNU Linux or the command prompt
under Windows.
89
80
60
Unique titles
40
20
0
Figure 46: Same as fig. 45 rendered as PDF (see text for the code details and fig. 44
caption for the explanation of shaded areas and vertical lines).
The code above merges two files plot.01.pdf and plot.02.pdf in the order
they are listed into one file presentation.pdf (the first line shows it for PDFtk
and the second, for qPDF ). There are more sophisticated ways to use both
PDFtk and qPDF but I leave it to you to explore them further.
To conclude this section, I have to confess that more than a half of images
used as figures in this part of the manual are, in fact, PDFs. If you read a PDF
version of the manual with a dedicted PDF-viewer, you can try and use the
mouse selection trick described above to differentiate them from PNGs at your
leisure.
90
on the screen, and when installing four different cartridges into a colour printer.
Besides the number of basic colours involved, there is another important differ-
ence. With RGB model, the more colour intensity one adds, the brighter is the
resulting colour (until everything ends up with white at maximum values for all
three channels). With CMYK model, the relationship is reversed. The more
colour intensity one adds, the darker is the resulting colour.
Some of R printing devices can encode colours with CMYK model. E. g.
with pdf() and postscript(), there is a special argument colormodel="cmyk"
(defaults to srgb), which can be used to print a .pdf or an .eps file with CMYK
colours directly.
The others, most notably, tiff() lack this option. In the latter case, we
need to resort to the external programs. All major raster image editors can con-
vert from sRGB to CMYK. One may use PhotoShop or GIMP for this purpose.
Besides of that, there is, as always, a powerful command-line tool, ImageMag-
ick.95 One of its most basic packages, convert, could do everything you need
with a single short line of code run from a terminal.
convert srgbfile.tiff -colorspace CMYK cmykfile.tiff
###############################################################################
###############################################################################
# The very first steps of analysis: descriptive statistics
###############################################################################
###############################################################################
help(sum)
sum(students.df$HEIGHT, na.rm=TRUE)
95 There are releases for all major operating systems, see: http://www.imagemagick.org
91
# Summary function
summary(students.df$HEIGHT)
###############################################################################
# Traditional (mean-based) estimates of variation
###############################################################################
###############################################################################
# Robust estimates of variation
###############################################################################
IQR(students.df$HEIGHT, na.rm=TRUE)
help(quantile())
###############################################################################
# Summarising a qualitative variable
###############################################################################
summary(students.df$SEX)
summary(as.character(students.df\$SEX))
###############################################################################
# Summarising a data frame
###############################################################################
summary(students.df)
###############################################################################
###############################################################################
# Analysts’ nightmare: Anscombe’s quartet
###############################################################################
###############################################################################
# Applying mean() and sd() to all columns and rounding the results
# to two decimal places:
round(apply(anscombe, 2, mean), 2)
round(apply(anscombe, 2, sd), 2)
92
abline(lm(anscombe$y.3 ~ anscombe$x.3), lty=2, col="red")
###############################################################################
###############################################################################
###############################################################################
#
# Grammar of graphics: six most basic graphs
#
###############################################################################
###############################################################################
###############################################################################
###############################################################################
###############################################################################
# Visualising a single variable
###############################################################################
###############################################################################
###############################################################################
# Histogram
###############################################################################
hist(students.df$HEIGHT)
hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm")
hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm", col="grey")
hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm", density=15,
angle=45)
###############################################################################
# Digression: saving the plots as files
###############################################################################
93
dev.off()
###############################################################################
# Simple bar plots
###############################################################################
###############################################################################
###############################################################################
# Visualising interdependence: bivariate plots
###############################################################################
###############################################################################
###############################################################################
# Scatter plot
###############################################################################
94
plot(students.df$HEIGHT, students.df$MASS, type="n", xlab="Stature, cm",
ylab="Body mass, kg")
points(students.f.df$HEIGHT, students.f.df$MASS, pch=3)
points(students.m.df$HEIGHT, students.m.df$MASS, pch=4)
###############################################################################
# A special case: Time series
###############################################################################
###############################################################################
# Nine values for "type" argument of the plot() function
# Please, do not forget to close the plotting device after this exercise with
# dev.off()
# or reset it to the default
# par(mfrow=c(1, 1), pch=1)
###############################################################################
# Testing "type" argument with points() and lines()
# Please, do not forget to close the plotting device after this exercise with
# dev.off()
# or reset it to the default
# par(mfrow=c(1, 1), pch=1)
###############################################################################
# On the importance of ordering the entries chronologically
95
# Please, do not forget to close the plotting device after this exercise with
# dev.off()
# or reset it to the default
# par(mfrow=c(1, 1), pch=1)
###############################################################################
# X coordinate vs. index
###############################################################################
# A real-life time series example
###############################################################################
# Multiple boxplot
###############################################################################
plot(students.df$SEX, students.df$HEIGHT)
# Creating boxplots with plot(), mind the as.factor() data transformation with
# numerical codes used as entity names (e.g. for a group of sudents, its
# number is a name);
plot(as.factor(students.df$GROUP), students.df$HEIGHT)
# Attempt #1
boxplot(students.df$SEX, students.df$HEIGHT)
# Attempt #2
boxplot(students.df$HEIGHT ~ students.df$SEX)
###############################################################################
# Scatter plot with jitter
###############################################################################
# Using jitter()
# Attempt #1
plot(students.df$HEIGHT, jitter(students.df$SEX))
# Attempt #2
plot(students.df$HEIGHT, jitter(as.numeric(students.df$SEX)))
96
plot(students.df$HEIGHT, jitter(as.numeric(students.df$SEX), factor=.5),
xlab="Height, cm", ylab="Gender", pch=20, col=rgb(0,0,0,.3), axes=FALSE)
axis(1)
axis(2, at=c(1:2), labels=c("f","m"), las=1, col="white")
###############################################################################
# Structured barplots
###############################################################################
table(students.df$SMOKING, students.df$SEX)
# Experimenting
barplot(test.df)
barplot(as.matrix(test.df))
barplot(as.matrix(test.df), beside=TRUE)
###############################################################################
###############################################################################
# Drawing math: abline() and curve()
###############################################################################
###############################################################################
97
curve(x^2, from=-2, to=2, n=2)
curve(x^2, from=-2, to=2, n=3)
curve(x^2, from=-2, to=2, n=10)
###############################################################################
###############################################################################
# Many layers of dots
###############################################################################
###############################################################################
dev.off()
###############################################################################
###############################################################################
# Visualising everything: graphical primitives
###############################################################################
###############################################################################
# Drawing arrows
# Drawing rectangles
# Drawing polygons
98
p.1.y <- c(0, 1, 3, 5, 2, 1)
polygon(p.1.x, p.1.y) # Adding polygon 1
# fillOddEven=TRUE
###############################################################################
# Examples of real-life graphs using polygon() function
###############################################################################
###############################################################################
# Dorpat faculty dynamics
# Loading dataset;
dpt.dyn <- read.table("dpt.dyn.txt", h=TRUE, sep="\t", stringsAsFactors=TRUE)
# An empty plot
plot(dpt.dyn$YEAR, dpt.dyn$TOT, type="n", ylim=c(0,85),
xlab="Timeline", ylab="Number of persons")
# Adding polygons;
polygon(x=c(dpt.dyn$YEAR[1], dpt.dyn$YEAR, dpt.dyn$YEAR[nrow(dpt.dyn)]),
y=c(0, dpt.dyn$TOT, 0), col="#0072CE")
polygon(x=c(dpt.dyn$YEAR[1], dpt.dyn$YEAR, dpt.dyn$YEAR[nrow(dpt.dyn)]),
y=c(0, (dpt.dyn$U1.DPT.TOT + dpt.dyn$U1.EUR.TOT), 0), col="black")
polygon(x=c(dpt.dyn$YEAR[1], dpt.dyn$YEAR, dpt.dyn$YEAR[nrow(dpt.dyn)]),
y=c(0, dpt.dyn$U1.EUR.TOT, 0), col="white")
###############################################################################
# Presidential elections Monte-Carlo simulation results
# Loading datasets;
ru.2018 <- read.table("ru.2018.txt", h=TRUE, sep="\t", stringsAsFactors=TRUE)
ru.hist.MC <- read.table("ru.hist.MC.txt", h=TRUE, sep=" ", stringsAsFactors=TRUE)
# Adding the polygon for the MC simulation mean +/- 3 standard deviations;
polygon(
c(ru.hist.MC$PCT, ru.hist.MC$PCT[1001:1]),
c(ru.hist.MC$MEAN + 3 * ru.hist.MC$SD,
(ru.hist.MC$MEAN - 3 * ru.hist.MC$SD)[1001:1]),
col=rgb(0, 0, 0, .3), border=rgb(1, 0, 0, .3), lwd=.5)
###############################################################################
###############################################################################
# Getting inside: numeric output of the plotting functions
###############################################################################
###############################################################################
99
students.mass.hist <- hist(students.df$MASS)
# Adjusting ylim
plot(students.mass.hist, ylim=c(0,max(students.mass.hist$counts)+5),
main="Undergraduate students", xlab="Body mass, kg",
col="grey")
text(x=students.mass.hist$mids, y=students.mass.hist$counts,
labels=students.mass.hist$counts, pos=3)
# Trying boxplot.stats():
boxplot.stats(students.df$MASS)
###############################################################################
###############################################################################
##
## Bonus: Monte-Carlo simulation for St. Petersburg voters’ turnout histogram
##
###############################################################################
###############################################################################
###############################################################################
#
# The following script is based on the methods described in
# Dmitry Kobak, Sergey Shpilkin, and Maxim S. Pshenichnikov "Integer
# percentages as electoral falsification fingerprints". The Annals of Applied
# Statistics. Volume 10, Number 1 (2016), 54-73.
# [https://arxiv.org/abs/1410.6059]
#
# I am indebted to Sergey Shpilkin for his kind explanations and critical
# remarks on the early versions of the script.
# I am grateful to Boris Ovchinnikov for a stimulating discussion.
#
# Alexei Kouprianov, [email protected]
#
###############################################################################
###############################################################################
# Loading data
###############################################################################
###############################################################################
# Declaring objects for the loop
###############################################################################
100
# MK.repeats <- 100 # For preliminary testing;
# MK.repeats <- 1000 # For preliminary testing;
# DO NOT try the following line on larger datasets. 10,000 iterations on
# the dataset for the whole Russia (97+K records) and for 0.1% bins
# may take days to complete.
MK.repeats <- 10000 # Working repeats number;
j <- NULL
i <- NULL
###############################################################################
# Main simulation loop begins
###############################################################################
for (k in 1:MK.repeats) {
###############################################################################
# Main simulation loop ends
###############################################################################
# Extracting summary stats for counts from all MK.repeats simulated hists into
# a data frame
for (i in 1:101){
101
spb.ksp.rbinom.TURNOUT.hist.counts.MEAN <- c(
spb.ksp.rbinom.TURNOUT.hist.counts.MEAN,
mean(spb.ksp.rbinom.TURNOUT.hist.counts.df[,i])
)
spb.ksp.rbinom.TURNOUT.hist.counts.SD <- c(
spb.ksp.rbinom.TURNOUT.hist.counts.SD,
sd(spb.ksp.rbinom.TURNOUT.hist.counts.df[,i])
)
spb.ksp.rbinom.TURNOUT.hist.counts.MEDIAN <- c(
spb.ksp.rbinom.TURNOUT.hist.counts.MEDIAN,
median(spb.ksp.rbinom.TURNOUT.hist.counts.df[,i])
)
spb.ksp.rbinom.TURNOUT.hist.counts.Q1 <- c(
spb.ksp.rbinom.TURNOUT.hist.counts.Q1,
summary(spb.ksp.rbinom.TURNOUT.hist.counts.df[,i])[2]
)
spb.ksp.rbinom.TURNOUT.hist.counts.Q3 <- c(
spb.ksp.rbinom.TURNOUT.hist.counts.Q3,
summary(spb.ksp.rbinom.TURNOUT.hist.counts.df[,i])[5]
)
spb.ksp.rbinom.TURNOUT.hist.counts.IQR <- c(
spb.ksp.rbinom.TURNOUT.hist.counts.IQR,
IQR(spb.ksp.rbinom.TURNOUT.hist.counts.df[,i]))
}
colnames(spb.hist.ksp.simulated.stats.1pct) <-
c("PCT","MEAN","SD","MEDIAN","Q1","Q3","IQR")
# Control plots
102
3*spb.hist.ksp.simulated.stats.1pct$SD)[101:1]),
col=rgb(0,0,0,.3), border=rgb(1,0,0,.3))
points(spb.hist.ksp.simulated.stats.1pct$PCT,
spb.hist.ksp.simulated.stats.1pct$MEAN, type="l", col=2, lwd=3)
legend("topleft", lwd=c(3,1), col=c(2,2),
legend=c("MC-simulated mean", "MC-simulated mean +/- 3 SD"), bty="n")
axis(1, at=seq(0,1,.1), labels=FALSE, lwd=1.5)
axis(2, at=seq(0,190,10), tcl=-.25, labels=FALSE, lwd=1.5)
dev.off()
103