0% found this document useful (0 votes)

32 views103 pages

Basics of Data Analysis and Graphics in

Uploaded by

Andrianarisoa Ratsimbazafy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views103 pages

Basics of Data Analysis and Graphics in

Uploaded by

Andrianarisoa Ratsimbazafy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 103

Basics of data analysis and graphics in R

Part two:
retrieving summary statistics
and drawing basic graphs
Alexei Kouprianov∗

Contents
Preface 3

Elements of exploratory analysis: drawings first 3

The very first steps of analysis: descriptive statistics . . . . . . . . . . 3
Analysts’ nightmare: Anscombe’s quartet . . . . . . . . . . . . . . . . 9

Grammar of graphics: six most basic graphs 12

Visualising a single variable . . . . . . . . . . . . . . . . . . . . . . . . 15
Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Digression: saving the plots . . . . . . . . . . . . . . . . . . . . . 21
Simple bar plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Visualising interdependence: bivariate plots . . . . . . . . . . . . . . . 27
Scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
A digression: Time series and the type of plot() . . . . . . . . . 33
Multiple boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Scatter plot with jitter . . . . . . . . . . . . . . . . . . . . . . . . 44
Structured barplots . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Four steps towards more advanced graphs 50

Plotting math: abline() and curve() . . . . . . . . . . . . . . . . . . 51
Stepping beyond two dimensions . . . . . . . . . . . . . . . . . . . . . 56
Visualising everything: graphical primitives . . . . . . . . . . . . . . . 64
Getting inside: Numeric output of plotting functions . . . . . . . . . . 69

Preparing presentation quality graphs 73

Raster graphs: screen vs. paper . . . . . . . . . . . . . . . . . . . . . . 74
Digression: Pixel size, byte size, and resolution . . . . . . . . . . 74
Adjusting graph size . . . . . . . . . . . . . . . . . . . . . . . . . 76
Black and white line art in earnest . . . . . . . . . . . . . . . . . 81
Vector graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
∗ Department of Sociology, National Research University Higher School of Economics (St.

Petersburg).

1
Quality colour graphs: post-processing . . . . . . . . . . . . . . . . . . 90

Appendix: Code for this manual 91

2
Preface
R is a most powerful instrument of data visualization. Besides a wide variety
of built-in standard graphs (histograms, bar-, box-, and scatterplots, etc.) it
can draw more sophisticated things like social networks or maps with the aid
of installable libraries and even allow you to visualise nearly everything you
can imagine with the aid of graphical primitives (i. e. points, line segments,
arrows, and polygons). In this part of the manual, we shall start with simple
standard graphs representing variation of a single variable, then turn to more
complex bivariate graphs of various kinds, and, finally, to few special cases of
still more complex graphs which are, none the less, used far from infrequently.
As we proceed along this sequence, I shall explain you the ways to control
the appearance of the graphs and show you how to save them to files. In a
special section, I shall treat the problem of producing printing quality graphs
for submission to a scholarly journal. Maps, social networks, and other possible
non-trivial graphs will be discussed elsewhere.

Elements of exploratory analysis: drawings first

The very first steps of analysis: descriptive statistics
The craft of data analysis implies study of variation patterns. In our search
for patterns, we study individual variables first, then shift to the analysis of
possible interdependences between variables by bi- and multivariate methods
until we exhaust all ideas in search for meaningful patterns.1
The pattern of variation for a single variable can be represented graphically
(and we shall move to this soon), but many scientists prefer numerical descrip-
tions. In empirical studies, many variables follow a pattern which seems natural
to many of us, because it is a pattern we learn by many examples before we
grasp it theoretically. The individual values apparently tend to cluster around
some centre. Most values are rather close to it, but some tend to deviate to
a greater or lesser extent. The more the individual values deviate from this
imagined centre, the less frequently they occur. This ‘centre’, towards which
the values ‘gravitate’, the degree to which some extreme values deviate from
the centre, and the general pattern of deviations (which may be surprisingly
regular) are the subject matter of the descriptive statistics. Some of them are
familiar to us since the days of the school math (e. g., the mean, which is one
of the indicators of centrality), some are less known, but we need to familiarise
with them before going any farther.
Although one can find important descriptive statistics for a variable using R
in a calculator mode, there exist a number of specialised functions which do it
in a more efficient way. To appreciate this efficiency I would suggest trying the
1 Even though this process might seem ‘purely inductive’, ‘mindless’ or ‘atheoretical’, it is

not. In fact, it is a fountain of hypotheses you formulate, reject, corroborate, refine, and put
to new tests, sometimes with an enormous speed. I still remember how I calculated regression
or analysis of variance by hand armed with a pencil and a slide rule, using special tables which
hepled to organise the workflow, adding and squaring numbers on a sheet of paper, finding
square roots and determining p-values by the book, drawing ‘frequency polygons’ and scatter
plots on graph paper. Now, one can perform analysis of variance and plot basic graphs with a
couple of function calls. Back then, the process of hypotheses testing was painfully palpable.
Now, the pain has nearly gone, but hypotheses testing is still in place.

3
rocky path of calculator mode first. E. g., we may wish to calculate the mean
height of the students from the training dataset. Technically, it does not seem
that complicated. All we need to know is the sum of all individual heights and
the number of individuals measured. We already know, that the heights are
stored in a vector within our data frame students.df. We may re-create it (if
it is not at hand) and try retrieving the sum of all individual heights.
# This is the code to re-create the students.df if it is lost;
# The file "kouprianov.students.v.2.1.txt" must be present
# in the working directory;
students.df <- read.table("kouprianov.students.v.2.1.txt", header=TRUE,
sep="\t", stringsAsFactors=TRUE)
# The following two lines would help you checking if everything is OK;
dim(students.df)
head(students.df)

The sum of elements of an object containing numerical values (or a series of

numerical values which is not yet part of any object) can be retrieved with sum()
function. E. g. sum(1:4) would return 10. For our vector students.df$HEIGHT,
the result would be more disappointing:
> sum(students.df$HEIGHT)
[1] NA
>

We already know what R denotes with NA and its allies. How come that a
sum of a vector, which is full of numbers, can be a missing value? The answer
is simple. If a vector contains missing values, its true sum, strictly speaking,
is unknown, being itself a missing value. If we preview the vector by calling
students.df$HEIGHT, we can spot two NA values in 14th and 20th positions.
The sum() function belongs to a large family of functions dealing with nu-
meric objects that require explicit instructions on to how to treat the NA values.
The argument used for exclusion of NA values from calculations is na.rm, and
its default value is FALSE, so we need to explicitly set it to TRUE:
> sum(students.df$HEIGHT, na.rm=TRUE)
[1] 37738.4
>

Now, we need only the number of persons measured, and the much needed
mean value is ours. The length() of the vector is easy to obtain. The problem,
however, is that we should somehow get rid of the persons contributing NA
values. From our preview, we already know that there are only two of them (so,
we can use (length(students.df$HEIGHT)-2)) as a denominator). But we
see as clearly that visual inspection of a vector could not be recommended as a
general recipe when dealing with vectors of a considerable length containing an
unknown number of NA values. An easier solution then is to resort to specialised
functions that return summary statistics at once. Some of them are as smart
as to deal with NA values properly on their own.
The most powerful of the whole family is, perhaps, a super smart function
summary().2 Let us see what it can do to the students.df$HEIGHT:
2 This function is a true miracle. It is capable of summarising objects of any sort, each time

communicating something important on them. We shall see it time and again summarising
most incredible things from simplest numeric vectors or factors to the results of regression
analysis.

4
> summary(students.df$HEIGHT)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
148.0 165.0 170.0 170.8 175.0 192.0 2
>

This is a kind of output we never saw. Before we explore the meaning of

each item of the summary, we need to take a closer look at its form. It looks like
a vector with named entities but lacks a usual number in the square brackets
to the left. This suggests that this is clearly not a vector. The str() function
would reveal that this is an object of table class:3
> str(summary(students.df$HEIGHT))
Classes ’summaryDefault’, ’table’ Named num [1:7] 148 165 170 171 175 ...
..- attr(*, "names")= chr [1:7] "Min." "1st Qu." "Median" "Mean" ...
>

This table contains just one row (there can be also multi-row tables as we
shall see later) and it is invariably treated in such a case as a uni-dimensional
object. Its elements can be called both by their numbers and names.
> summary(students.df$HEIGHT)[2]
1st Qu.
165
> summary(students.df$HEIGHT)["1st Qu."]
1st Qu.
165
>

The summary for a numeric vector includes its minimal and maximal values,
mean, median, first and third quartiles, and the number of NA values (when they
are present). Some of these statistics are rather self-evident (minimum and
maximum), some deserve an explanation or, at least, a reminder. The, as you
remember, mean is calculated by dividing the sum of values by their number:
x1 + x2 + . . . + xn
x̄ =
n
The median is defined as the middle element of a series of values ordered
from minimum to maximum (if there is an even number of elements in the series,
then the median equals to the mean of two elements closest to the middle of
the series). Mean and median are two important centrality measures (the third
is mode, the most frequent value or the biggest class of values). We shall come
back to them soon. The quartiles are similar to the median but they cut a
series not in halves but in quarters (by the way, the median is, understandably,
nothing but the second quartile).
Some of these statisics can be retrieved individually either by calling elements
of the summary() output (as it was shown above) or by means of specialised
functions: mean(), median(), min(), max(), and range() (the latter function
returns a vector of two values, the minimum and the maximum for a given
series). It should be noted that the five above-mentioned specialised functions
require to explicitly exclude the NA values, just like the sum() function (see the
code in the Appendix to this manual).
3 One may see that it belongs also to ’summaryDefault’ class but this is not that important
now.

5
As I said before, when dealing with quantitative variables, the analysts are
usually interested both in centrality measures and in the estimation of variability
within a given sample or population. The summary() function tries to return
some metrics from both (centrality can be assessed with mean and median,
while the degree of variation is represented with minimum, maximum, and the
quartiles).
But the output of the summary function mixes up not only the measures of
centrality and variation but also the two major families of measures of centrality
and variation. The first (sometimes called ‘traditional’) is based on the mean
and deviations from it, the second is based on the median and quartiles. The
difference is fundamental. Mean-based metrics take into account both the size of
the sample or population and the values of the parameter measured (and heavily
depend on them). Most median-based metrics take into account only the order
of values in a series ordered from the smallest to the largest, which makes them
to a considerable extent value-independent (and insensitive to outliers — rare
exceptionally big or exceptionally small values) or robust.

Traditional (mean-based) estimates of variation

The mean-based variation estimates are variance and standard deviation, the
latter being the square root of the former. The logics behind these two measures
is as follows. The degree of variation can be estimated by comparing each value
in the series with the series’ mean. The bigger is the summary difference, the
bigger is the variance. A straightforvard attempt to calculate the summary
difference would bring, however, a most disappointing result. No matter how
widely the values vary, the sum of their deviations from the mean would be
equal to zero.4
∑︁
(xi − x̄) = 0
Of the two ways to overcome this problem (to take the absolute value of the
deviations, and to square the deviations), mathematicians picked the second.
Now, the more some values deviate from the mean, the higher is the value of
the new metrics, the sum of squares (SS).
∑︁
SS = (xi − x̄)2 ≥ 0
There, however, remains one more problem. The sum of squared deviations
from the mean would vary not only because of the absolute value of deviations
but because of the size of the sample or population under study. The more
elements we include in our calculations the higher is the value of SS. We, of
course, can take a solemn oath to compare only the samples of equal size and
run a Crusade to force the infidels to follow our heroic example, but there is a
simpler solution. We may calculate the mean squared deviation. For the whole
population, the formula would be:
∑︀
2 (xi − µ)2
σ =
N
4 If you try to calculate this in R in practice, you may get (instead of zero) a very small

number originating from the tiny imperfections of calculations algorithm. For our students
heights this number would be −9.66 × 10−13 or −0.000000000000966. . .

6
For a sample variance, a slightly different formula is used:5
∑︀
(xi − x̄)2
s2 =
n−1
The sample variance can be calculated with var() function.
> var(students.df$HEIGHT, na.rm=TRUE)
[1] 61.60064
>

The variance is a highly useful statistics but not without a blemish. If one
considers carefully the issue of measurement units, one may notice that the
mean squared deviation is. . . well. . . squared. While the original variable was
measured, e. g., in centimeters, its variance should be measured in square cen-
timeters, which makes little sense if we would like to discuss the limits of vari-
ation.6 To overcome this difficulty, another measure of variation is introduced,
the standard deviation, which is the square root of the variance.
Unlike the variance, the standard deviation can be used meaningfully to-
gether with the mean to represent natural variation, and we shall learn more
on the use of it later. For now, it is enough to say that it can be calculated
with the sd() function (which also requires na.rm=TRUE). We may calculate the
standard deviation for our students.df$HEIGHT vector and check if it really
equals the square root of the variance.
> sd(students.df$HEIGHT, na.rm=TRUE)
[1] 7.848607
> (sd(students.df$HEIGHT, na.rm=TRUE))^2
[1] 61.60064
>

Robust estimates of variation

On the robust statistics side, there is a function IQR(), which returns the in-
terquartile range or the difference between the values of the third and the first
quartiles. The interquartile range is a robust estimate of the variation, analo-
gous to the standard deviation.
> IQR(students.df$HEIGHT, na.rm=TRUE)
[1] 10
>
5 Note, please, the differences between the population variance (σ 2 ) and the sample variance

(s2 ), the expectation of a parameter (µ) and the sample mean (x̄), the population size (N )
and the sample size n. The difference however is not so much in the choice of particular
letters, as in the subtleties of statistical reasoning, the most practically important of which
is the decrease of denominator in order to increase the summary variation estimate. (n − 1)
is the number of degrees of freedom, a rather tricky statistical concept, which will be treated
in more detail elsewhere. For the sample variance it is equal to the maximum number of
elements of a numeric vector one may change arbitrarily and still get back to the same mean
value by involuntary compensation on the account of the remaining elements. E. g., in a series
of 1, 2, 3, 4, 5, the mean is 3. We may change four elements as we pleased, e. g., to replace
the first four with 10, 100, -218, and 3.28, but to keep the mean at three, the last element of
the vector has to be canged (in our case) to 119.72.
6 Square centimeters look a lot less bizarre than square kilograms or square friends (Sponge

Bob does not really fit here either) but as a measure of length they fare no better.

7
In practice, to delimit the quartiles is a more complex task than to find
the mean. There are several conventional ways to do it, which bring some-
times slightly different results. Accordingly, there are several functions in R for
the quartile analysis. Besides summary() (and one or two more functions we
shall learn later), there are fivenum() (stands for five numbers) and a highly
customisable quantile().7

Summarising a qualitative variable

It should be noted that qualitative variables also have variation patterns of
their own and, like those for quantitative variables, they can be summarised
with the smart summary() function. When I say that the summary() function
is a smart one, I use the word ‘smart’ in a technical sense. I mean that this
function identifies the class of the object under operation and acts accordingly.
Moreover, it ‘knows’ how to deal with NA values without any help from us.
We have dealt already with one smart function, str(), and we shall see more
of them in the future. When applied to a factor, summary() returns a table
presenting number of elements for each factor level and the number of NA values
(when they are present):
> summary(students.df$SEX)
f m NA’s
169 52 2
>

Here, by the way, one may see a palpable difference of factors from character
vectors. Try summary(as.character(students.df$SEX)) to see a strikingly
different result.

Summarising a data frame

Finally, even though it is not always practical, summaries can be retrieved for
all variables of the data frame at once. More specialised functions we discussed
in this section do not work this way.
> summary(students.df)
HEIGHT MASS SEX SMOKING SMOKING.AGE
Min. :148.0 Min. :40.00 f :169 Min. :0.000 Min. : 9.00
1st Qu.:165.0 1st Qu.:53.00 m : 52 1st Qu.:1.000 1st Qu.:13.75
Median :170.0 Median :59.00 NA’s: 2 Median :2.000 Median :15.00
Mean :170.8 Mean :60.35 Mean :1.475 Mean :15.06
3rd Qu.:175.0 3rd Qu.:65.62 3rd Qu.:2.000 3rd Qu.:16.00
Max. :192.0 Max. :98.00 Max. :5.000 Max. :20.00
NA’s :2 NA’s :3 NA’s :175
MATH GROUP DEPARTMENT
Min. :48.00 Min. : 1.00 Min. :1.000
1st Qu.:62.50 1st Qu.: 9.00 1st Qu.:1.000
Median :70.00 Median :13.00 Median :1.000
Mean :67.42 Mean :14.07 Mean :1.274
3rd Qu.:73.00 3rd Qu.:22.00 3rd Qu.:2.000
Max. :81.00 Max. :24.00 Max. :2.000
NA’s :164 NA’s :7
>
7 Both require na.rm and return quantiles as a vector. The quantile() can return any
quantiles (not only quartiles) using as many as nine different algorithms for different kinds of
variables. Please, read help pages for quantile() and papers cited therein for more details.

8
Table 1: Anscombe’s quartet: the data (Anscombe 1973).

x1 y1 x2 y2 x3 y3 x4 y4
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

Table 2: Anscombe’s quartet: some descriptive statistics are nearly or exactly the
same for all four datasets (Anscombe 1973, with modifications).

Parameter Value
Number of observations (n) 11
Mean of the x’s (x̄) 9.0
Mean of the y’s (ȳ) 7.5
Standard deviation of x’s 3.32
Standard deviation of y’s 2.03
Equation of regression line y = 3 + 0.5 × x
Multiple R2 0.667

Now, after we know how to retrieve centrality and variation estimates, the
time has come to explain why we should not begin our analysis with calculating
them. The only reason we studied them first is that they will be vitally needed in
the following section which will expose their futility at the first stage of analysis.

Analysts’ nightmare: Anscombe’s quartet

In 1973, Francis Ascombe, by that time a renowned statistician working a Yale
University, published a paper in which he argued for the use of graphics in
computerised statistical analysis.8 To illustrate the importance of ‘seeing’ the
data, not just ‘counting’ them, he designed a set of four bivariate datasets with
remarkable properties (tab. 1). As one may see, the first three datasets share
the same values for x variable, and all four share the same basic descriptive
statistics (and even some derived statistical metrics we are not yet supposed to
be familiar with, see tab. 2).
We can see it for ourselves by loading the dataset for Anscombe’s quartet and
8 Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27

(1): 17–21.

9
Figure 1: Anscombe’s quartet: bivariate plots for each dataset. Dashed line: regres-
sion of y by x (see tab. 2).

performing all necessary calculations.9 To do it quick we shall learn couple more

of functions. The first one is round(), which rounds the numbers to a specified
number of decimal places. The second is apply(), which applies functions to
the uniform arrays of numbers (not all data frames are good for that because
they too often unite different kinds of data, but ours contains numbers only).10
# Loading and previewing data:
anscombe <- read.table("anscombe.txt", head=TRUE, sep="\t")
dim(anscombe)
anscombe
# Applying mean() and sd() to all columns and rounding the results
# to two decimal places:
round(apply(anscombe, 2, mean), 2)
round(apply(anscombe, 2, sd), 2)
9Please, make sure that the file is in the working directory.
10 Please, read carefully the help pages for each of these functions. Both represent whole
small families of functions, round() and its allies deal with different kinds of rounding, while
apply() and its allies deal with application of functions to various R object classes.

10
The seventh line of the script applies the function mean() (hence the third
paremeter of apply() is mean) to the columns of the object ans (hence the
first two parameters are ans and 2, ans and 1 would call rows, not columns,
remember the Roman Catholic rule). The result is rounded to two decimal
places (hence the second argument of round() is 2). The eighth line does the
same for standard deviation. You can also reproduce the calculations without
rounding to see the tiny differences that will become visible.
Despite all these remarkable numerical similarities, the bivariate scatter plots
for Anscombe’s four datasets are strikingly different (Fig. 1). The four patterns
apparently resulted from different processes. The first case represents a rather
strong linear connection between the two variables with some noise contributing
to the deviations of observations from the fitting regression line. The second
represents a very strong non-linear connection between the two variables (and
the one at y axis is clearly a dependent variable, as the same values of y cor-
respond to different values of x but not vice versa, as in the first case), the
linear regression parameters can be calculated but they are completely out of
place here. The third case presents a very strong linear connection between the
variables with a single outlier driving the regression line away from the possi-
ble stronger dependence. The fourth graph depicts two apparently independent
variables with an outlier providing a (possibly false) basis for regression.11 It
should be noted that in two latter cases robust methods would return strikingly
different results, but we are not yet ready for that.
The sharp contrast between the table of summary statistics (tab. 2) and the
pictures (fig. 1) is a menacing warning to anyone who dares to start ‘counting’
without trying to see the patterns first. In social sciences, with their inherent
indeterminancy, the patterns that are worth being mentioned must be elephan-
tine, and many of them must be visible on bi- or multivariate plots. There
are, of course, formal numerical methods which may help us deciding whether
an observed pattern is more likely a result of some underlying causal relation-
ship or just a strange coincidence. Sometimes formal analysis reveals patterns
that are invisible at first glance. Sometimes these formal methods should be
applied to make some patterns visible. The visual inspection of the data avail-
able is, however, an unavoidable step of analysis, which should either precede
or immediately follow the extraction of summary statistics. Thus, we come to
graphics.
11 In a real-life case, an analyst would suspect that cases III and IV deal with representatives

of two different populations. A true story comes to my mind that happened to a friend of
mine years ago. He took part in a biological indication project, which aimed at the detection
of developmental anomalies in animals caused by industrial pollution. He took a huge sample
of ants (a rather common Lasius niger ) somewhere near the nuclear power plant in Sosnovyi
Bor (near St. Petersburg, Russia), and another one away from it and measured ants in several
ways. Then he fed the data into a computer program (at that time R was still out of sight
and I am not sure as to which software he used). It turned out that several ants captured
near the power plant exhibited unusual body sizes and proportions. Happy with this result he
took the promising monsters to a local expert in ants. To my friends’ sheer disappointment
the monstorus ants happened to belong to a different ant genus and species (Formica fusca),
which only superficially resembled Lasius niger (both were ‘small and dark-brown’).

11
Grammar of graphics: six most basic graphs
In data analysis, several classifications of variables are possible. In this section
we will use a rather crude one, which splits variables into qualitative and quan-
titative, and the latter into discrete and potentially continuous.12 The variation
patterns for different kinds of variables and various kinds of interactions between
variables are traditionally depicted in different ways. Even though infographics
always involves a great amount of creativity, and the same data can be pre-
sented in a number of ways, academic readers usually expect certain degree of
uniformity in graphs, which eases comprehension of author’s message.
The six most basic graphs based on our ‘students’ dataset are presented in
fig. 2. I shall deal with each of these graphs in a special section in more detail.
Here I only briefly characterise them all.
The two upper graphs depict variation patterns for a quantitative (his-
togram) and qualitative (bar plot) variable. Note, please, an important graph-
ical convention here. The bars of a histogram (the so-called bins) are plotted
immediately next to each other, for they represent usually nothing but a conve-
nient way of splitting a potentially continuous variable into arbitrary intervals
(strangely enough, this applies to countable discrete quantitative variables too).
The bars of a bar plot are separated with a space because they represent alter-
native states of a qualitative variable, which neither are arbitrary nor have any
meaningful intermediate states (if a variable is categorised properly, of course,
but this is a different issue). In both kinds of graphs, the heights of bars rep-
resent absolute or relative frequency of variants falling within a bin or within a
class of a qualitative variable.
The four bivariate graphs represent four possible combinations of x-axis and
y-axis variables of different kinds. Usually, at the x-axis, an ‘independent’ vari-
able is placed, while the y-axis is reserved for a ‘dependent’ variable. We should
be extremely careful, though, about ascribing any causal relationship to the
pairs of variables on a plot. E. g., body mass and height can be meaningfully
plotted against each other, and certain pattern of interdependence is clearly
visible (and not at all counter-intuitive) but the explanation of a causal rela-
tionship between these two variables is far from trivial. I would rather say, that
a relationship between two variables involved in a theoretically sound simple
causal model can be depicted by any of these four graphs, but the mere fact
that the two variables are plotted against each other and show a meaningful
pattern does not imply a trivial causal relationship.
The scatter plot (fig. 2, middle left) depicts analytic individuals (they can
be of any nature, e. g., persons, countries, electoral precincts, journal articles,
etc.) in a two-dimensional space defined by two quantitative variables. Each
12 Another very popular classification is based on the kind of scale used (nominal, ordinal,

interval, and ratio). In nominal scale different states of a variable are specified with names
only (e. g. gender, which can be male, female, or whatever; country of origin; the University
one graduated from). In ordinal scale, the named states can be arranged in a meaningfully
ordered sequence (e. g. levels of education, ranks of military or civil service). With interval
and ratio scales more and more numerical operations become possible. The interval scale
(e. g. degrees of Centigrade or of Fahrenheit for temperature) allows meaningful addition and
subtraction, rational (e. g. counts, masses, distances, age, income, etc., anything quantitaive
having a ‘true’ zero) allows also meaningful multiplication and division. ‘Qualitative’ variables
are coded with nominal or ordinal scales, ‘quantitative’ are usually coded with ordinal, interval
and ratio scales. Strangely enough, however important these distinctions are for picking the
right tools for statistical analysis, they are not that useful from the drawing perspective.

12
Figure 2: Six most basic graphs (based on ‘students’ dataset). Upper row: univariate
graphs, middle and bottom rows: bivariate graphs. Right: quantitative variable at
x-axis, left: qualitative variable at x-axis. Middle row: quantitative variable at y-
axis, bottom row: qualitative variable at y-axis. This is not a dogma and should be
perceived critically and creatively. There are more ways to express variation patterns
for these variables.

13
individual is represented with a dot with coordinates corresponding to the value
of parameters forming x- and y-dimensions. E. g., an easy-to-spot lonely dot at
less than 150 cm and slightly over 50 kg represents a miniature person, 148 cm
by 50.5 kg.
The multiple box plot (or, multiple box-and-whisker plot, fig. 2, middle right)
is used to depict a relationship between a qualitative variable at x-axis and a
quantitative at y-axis. The box plot summarises robust measures of centrality
and deviation (see more on that in appropriate sections). The thick bar in the
middle of the box represents median. The lower and upper limits of the box
represent 1st and 3rd quartiles. The ‘whiskers’ represent either the minimum and
maximum or 1.5× interquartile range distance from median (in the latter case,
‘outliers’ falling outside of this range are depicted as dots, just as in our graph).
A series of histograms plotted one under another can be used to depict the same
relationship as well. Histograms, however, lose their information content as they
become more and more flat, because the general shape of frequency distribution
becomes less expressed. Even though box plots present only summaries of the
frequency distributions, they look just as good when they are narrow. One may
plot dozens of box plots in a line without making them less informative and
comparable.
The scatter plot with jitter (fig. 2, bottom left) presents basically the same
information as the multiple box plot but the relationship between quantitative
and qualitative variables is reversed. In fact, a raw scatter plot with height
at x-axis and gender at y-axis would look differently, because, by default, the
numeric representation of feminine and masculine genders would be just 1 and
2, so the dots representing individuals would overlap each other and lie in two
thin lines. To see the individual points, some noise is added, so each line smears
to form a cloud. Each cloud, however, symbolises only one distinct state of
qualitative variable, so this graph is conceptually different from the true scatter
plot above (because the dispersal of a cloud along the y-axis is purely decora-
tive). The variables used in this particular instance of a plot do not really fit its
primary purpose, which is to depict dependence of a quanlitative variable on a
quantitative one (no one seriously expects that gender is determined by height).
There are, however, more relevant examples: decision to go to work or to take
a sick leave at y-axis and body temperature at x-axis, decision to go and vote
or to stay home at y-axis and the income at x-axis, etc.
The relationship between two qualitative variables can be depicted in a num-
ber of ways, of which I picked the default R mosaic plot, which is essentially a
modification of a stacked bar plot. This plot is difficult to interpret at first (to
the extent that when I unintentially evoked it for the first time, I thought that
I broke the graphical console). None the less, it is rather intuitive: the width of
bars is proportional to the frequencies of the states of the x-axis variable, while
the heights of coloured segments of these bars are proportional to the shares
of the states of y-axis variable in each of the x-variable classes. E.ġ., we can
see that there are probably disproportionally many male students describing
themselves as heavy smokers (compare the relative height of the upper segment
in both bars). The left y-axis lists the states of the y-axis variable, in order of
appearance at the diagram, the right y-axis shows frequency scale from zero to
one.
The following sections will deal with these (and some other) graphs in more
detail as well as with implementation of graphics in R code.

14
Visualising a single variable
Histogram
While the generic smart drawing function is plot(), we shall none the less begin
with hist().13 The latter is more specialised, has less parameters than the more
generic plot(), and allows to illustrate some ways to control the graph layout
without going into other distracting details.
hist() produces histograms. A histogram is a specific kind of a graph,
which displays the density distribution of a numerical variable. This may sound
a bit obscure, so I shall explain it at some length.
Let us suppose that we wish to measure body mass of people with a con-
siderable degree of precision. As one can imagine, there is no point to be too
precise, because a glas of water should immediately increase your body mass
by approximately 200 g (≈ 0.5 Lb), however I do not wish this example to be
extremely realistic in this particular respect. What is important, is that we can
measure body mass to the gram.
After weighting a random sample of fifty people of approximately same age
and sex we would come up with a list of measurements (which can, from the R
perspective, be described as a numerical vector). Given this small number of
people and this unnecessary degree of precision, this vector most likely would
be a list of unique five-digit numerical values, like this:
[1] 48244 67159 48444 55839 57019 66308 58104 48608 62987 57294 57555 47966
[13] 64067 60454 56558 52717 48542 62485 46601 62793 64456 59868 66475 64809
[25] 55584 49777 51727 60335 59389 57154 56990 66374 55136 54873 64931 54924
[37] 58724 50229 63285 75209 53338 55694 61122 57141 61876 47330 38335 56252
[49] 57825 44942

If we try to depict it “as is”, representing each value with a vertical line
drawn at a certain point of the x-axis, the variation of mass would be visualised
as in fig. 3.

40000 50000 60000 70000

Body mass, gram

Figure 3: Variation of body mass, raw.

As one may see, the lines are distributed unevenly. The maximal density of
the lines can be observed around the mean value (57080, in our case). As we
come closer to the marginal values (38340 and 75210), spaces between the lines
become bigger. A purely visual assessment of the density of the lines in this
graph is, however, problematic. We can ease it by dividing the variation range
into intervals (data analysts call them ‘bins’), counting how many values fall
into every bin, and producing a more advanced visualisation as seen in fig. 4. It
is exactly this more advanced graph, which bears the name of the histogram.
13 We have already dealt with other ‘smart’ functions, summary() and str(). As you might

remember, when I say that the function is ‘smart’, this means that it can identify the kind of
an object on its own and act accordingly without asking us for help.

15
6
5
Frequency
4
3
2
1
0

40000 50000 60000 70000

Body mass, gram
0.00012
0.00006
Density
0.00000

40000 50000 60000 70000

Body mass, gram
15
Frequency
10
5
0

40000 50000 60000 70000

Body mass, gram

Figure 4: Variation of body mass, same as in fig. 3, represented by a histogram with

1 kg bin (top). The same using density instead of frequency (middle). The same with
a histogram with 5 kg bin (bottom).

16
The numbers of cases falling within the limits of bins are represented with
bars’ heights (hence the y-axis is labelled ‘Frequency’). The same graph can
be tuned to represent the density, which is nothing but a normalised frequency.
To obtain density values we need to calculate the share taken by a bin area
from the total area of the histogram. The shape of the histogram would remain
unchanged, it is only the y-axis that would look different (see fig. 4, middle).
Bin width is usually assigned arbitrarily.14 The only thing to be taken into
consideration is the need to produce a density distribution of a recogniseable
shape for a quick assessment of the number of ‘peaks’, or modes, symmetry, and
balance between its hump and its tails. E. g., in our imagined case, we may
change the bin width from 1 to 5 kg (fig. 4, bottom) to see a smoother picture.
This graphical test is not the best way to identify the theoretical probability
density function, which can approximate our data (e. g., to test whether some
density distribution can be regarded as strictly normal or Gaussian, one would
need to perform a set of more rigorous tests), but it is good enough for a quick
assessment at a glance.
This being said, we may shift to the implementation of a histogram in R.
As I said, the histogram drawing function is hist(). It has got a number of
arguments which allow to tune the image, but, in its most simplistic form, it
requires just one: the variable under consideration.
hist(students.df$HEIGHT)

This code would produce a rather ugly graph (fig. 5, top-left). The histogram
is barely visible, the title and x-axis label are incomprehensible to anyone unfa-
miliar with R notation and the dataset you use, and the bins may not suit our
expectations. All these features can be adjusted using arguments of hist().
hist() shares most arguments with other plotting functions, so by studying it
in detail we will learn a lot about plotting in general.
The three most frequently used text elements: the plot’s title, x- and y-axis
labels are controlled with main, xlab, and ylab respectively. Y-axis label does
not need any adjustment here, so we shall change only two of them.
hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm")

This graph is already half as ugly as the previous (fig. 5, top-right), but it can
be better. First, we should make it standing out of the page by adding colour or
a dashed pattern. Colours, sometimes really bright ones, are better for screen
presentations, while line patterns are better for black and white academic print.
Colours can be added with col argument, while dashes require two arguments,
density (in lines per inch) and angle (in degrees, counting counter-clockwise
from three o’clock on). Note that as the code gets longer we may split it into
lines for our convenience. As you might remember from the first part of the
manual, if yoy enter these commands from the R text console, R would respond
with a + prompt, not with a > prompt to an unfinished function call.
hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm",
col="grey")
14 Of course, hist() has a built-in algorithm which calculates a more or less appropriate

bin width automatically, but we should learn how to cut the bins at the points we need, not
just at the points R allows us to cut the bins. As we shall see later, bins even do not have to
be necessarily equal to each other within the same histogram.

17
50 Histogram of students.df$HEIGHT Undergraduate students

50
40

40
Frequency

Frequency
30

30
20

20
10

10
0

0
150 160 170 180 190 150 160 170 180 190

students.df$HEIGHT Height, cm

Undergraduate students Undergraduate students

50
40

40
Frequency

Frequency
30

30
20

20
10

10
0

150 160 170 180 190 150 160 170 180 190

Height, cm Height, cm

Undergraduate students Undergraduate students

40
20

30
Frequency

Frequency
15

20
10

10
5
0

150 160 170 180 190 150 160 170 180 190

Height, cm Height, cm

Figure 5: Controlling parameters of a histogram: Variation of students heights. See

text for code details. Top-left: raw hist(students.df$HEIGHT). Top-right: main and
xlab applied. Middle-left: col applied. Middle-right: density and angle applied.
Bottom-left: breaks=20 applied. Bottom-right: breaks are specified with a vector.
Note, please, that the number of bins in the bottom-left histogram is not at all equal
to twenty.

18
hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm",
density=15, angle=45)

The results can be seen in fig. 5 (middle row). The colour need not be grey all
the time.15 R has got a number of ways to specify colour of the elements of the
plot (they can be rather flexibly assigned different colours). There are reserved
text names like red, blue, green, orange, darkred, etc. There are numeric
codes for basic nine colours (white,16 black, red, green, blue, cyan, magenta,
yellow, grey).17 You may also specify colours with hexadecimal RGB codes,
from "#000000" for black through "#ffffff" for white (with all 16 777 214
shades in between).18 Finally, there is a special function rgb(), which encodes
RGB colours and transparency (the so-called alpha-channel) but we shall learn
about it later, in the scatter plot section.
If you inspect the middle-left image in fig. 5, you will see that what changed
was the fill colour. The outline remained black. The deep-sitting reasons for
that are yet to be discovered but for now it is enough to say that the outline
colour is controlled with a different argument, border. Sometimes, it is really
important to control this parameter too, so do not forget about this option.
Finally, we need to take control over the bin size. As you already know,
the bin size affects greatly the shape of a histogram. There are several ways to
conrol it. All of them deal with the breaks argument. The number of breaks
can be communicated to hist() as an integer or as a vector. If you supply an
integer value, hist() calculates an appropriate number of breaks which is as
close to your integer as R finds appropriate (fig. 5, bottom-left). If you supply a
vector, hist() cuts the bins exactly at the points you specify (including unequal
bins).19 In this latter case, however, you have to take care of the upper and
lower limits, because the bins should embrace the full variation range. For the
generation of breaks vector, c() and seq() functions are used.
Our ‘students’ dataset is not big enough to demonstrate the true power of
breaks. The dataset from the Russian Federation presidential elections suits
better. The analysts were anxious to visualise some strange patterns they dis-
covered in the frequency distribution of the voters’ turnout. One of the best
studied phaenomena was an anomally high number of polling stations displaying
integer percentiles of voters’ turnout. The frequencies of polling stations where
the voters’ turnout was equal to, e. g., 90%, 91%, etc, was remarkably higher
than that of, say, 89.9% or 91.1% etc. To visualise this effect, one would need
a very small bin of 0.1%. Besides of being small, the bins should be cut in such
15 It should be noted that both gray and grey are acceptable.
16 In fact, transparent. This can only be seen when the background is not white but this is
important to know in advance.
17 A good-for-nothing pie() function, which produces the most hated by analysts pie-chart,

may serve here well. Try pie(rep(1,9), col=0:8, labels=0:8) from your console. After 8,
the colours 1 throurg 8 are recycled, so 9 is again black, like 1, 10 is red, etc.
18 The total number of colours in the hexadecimal RGB palette is 2563 . The hexadecimal

system needs letters because after 9 the conventional decimal system runs out of numbers, so
10 in hex is a, etc., until 15=f, while 16 is a new 10. So, the colour channel intensity varies
from 0 (00) to ff = 255 = 162 − 1. The first pair of digits in the hexadecimal RGB colour
code controls red, the second green, and the final blue. Accordingly, "#ff0000" brings bright
red, etc. You may experiment with that on your own or google a palette.
19 One may wonder why on Earth would we need unequal bins. Stragnely enough, some

major pollster companies use rather peculiar bins for respondents’ ages. To make data com-
parable to their polls’ results, unequal bins for age groups are sometimes needed.

19
a way as to place the value of interest (i. e. 91%) at the centre of the bin, not
at its margin. This means that the borders of the bin should be not: 90.9%
– 91.0% – 91.1%, but 90.95% – 91.05% – 91.15%. To generate a sequence of
numbers indicating the breaks this way, we will need the following formula:
seq(-.0005, 1.0005, .001)

This sequence starts at −0.5%, ends at 100.5%, and its increment is 0.1%
1 1 1
(we should never forget that 1 percent is 100 , so 10 of it would be 1000 ). The
full script for generating this graph ab ovo can be found in the Appendix to this
part of the manual. Here I give only the part relevant to the picture.
hist(pres.2008$TURNOUT,
breaks=seq(-.0005, 1.0005, .001),
col="black",
main="", xlab="Voters’ turnout at a polling station, 0.1% bin")

The histogram corresponding to this code can be seen at the fig. 6, top-
left. Its general shape is nearly unreadable because about 5 000 polling stations
demonstrate 100% turnout.20 To enhance readability of the general shape of
the histogram, we need to stretch it somehow vertically. The right way to do
it is to set arbitrary limits to the segment of y-axis visible within the plotting
area (thus stretching or compressing the y-axis). This can be done with ylim
argument. In histograms, its default value is derived from the baseline (0) and
the height of the tallest bin and can be replaced with any vector of length 2
specifying an arbitrary range (see fig. 6, top-right for results of the code given
below):
hist(pres.2008$TURNOUT,
breaks=seq(-.0005, 1.0005, .001),
ylim=c(0,400),
col="black",
main="", xlab="Voters’ turnout at a polling station, 0.1% bin")

We may apply a similar magnifying glass to the x-axis too. Its default
value is derived from the range (remember range() function discussed above)
of the variable under consideration. In the two bottom histograms of the fig. 6,
xlim=c(.9,1) was applied. For our convenience, white dashed lines are added
to the bottom-right image to point at the integer percentiles. The art of adding
lines will be discussed in more detail below, in the sections on plotting mathe-
matical functions and graphical primitives.21
The final remark on histograms for now will be about the y-axis meaning.
In the beginning of this section, I mentioned that the y-axis can reflect both
frequency and density. To switch between the two, the freq argument is used.
It defaults to TRUE and when set to freq=FALSE, the y-axis turns to density.
You may experiment with it on your own (note, please, how the density scale
changes depending on the change of the bin width). When a density scale is
used, a smoothing density function curve can be added to the histogram. This,
however, would require two more specialised functions for plotting alone and,
sometimes, additional data transformation is needed. Given these complications
and a dubious aesthetical value, I postpone the discussion of density line until
the section on graphical primitives.
20 Many of them are, indeed, rather peculiar, e. g., they may be located at the railway
stations where (by definition) there are no lists of registered voters, or at the ships.
21 For less patient students, I may recommend studying help pages for abline() function.

20
5000

400
300
3000
Frequency

Frequency

200
100
1000
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Voters' turnout at a polling station, 0.1% bin Voters' turnout at a polling station, 0.1% bin
400

400
300

300
Frequency

Frequency
200

200
100

100
0

0.90 0.92 0.94 0.96 0.98 1.00 0.90 0.92 0.94 0.96 0.98 1.00

Voters' turnout at a polling station, 0.1% bin Voters' turnout at a polling station, 0.1% bin

Figure 6: Russian Federation presidential elections (2008), voters’ turnout histogram.

Top-left: general outline of the histogram. Top-right: general outline with y-axis ab-
breviated to enhance readability. Bottom: 90–100% interval enlarged; in the bottom-
left histogram, white dashed lines added to point at the integer percentiles.

Digression: saving the plots

The most practically important feature of the R plotting system is that it allows
creation of fully reproducible plots which can be easily saved as graphical files.
R has got a number of options for doing this. The general pattern of saving a
graphical file includes three steps: (1) opening a graphic device, (2) plotting a
graph, (3) closing the device.22
We shall learn more on the diversity of graphic devices later, in the section
on preparing presentation quality graphs. Now, only the very first step: how to
save a PNG image. PNG (stands for Portable Network Graphics) is a widely
used file format for raster graphics using loseless compression. It is good for
saving graphs with few colours and sharp edges, it is well understood by all
22 It should be noted that the graphic console is also a graphic device. By default, if no

other device is specified, it opens automatically when a plotting function is called. Like other
devices it can be closed using dev.off().

21
major raster graphics editors and preview software, including web browsers. To
save our simplest histogram (fig. 5, top-left), it would be enough to type:
png("hist.students.df.height.png") # 1. Opening a graphic device;
hist(students.df$HEIGHT) # 2. Plotting a graph;
dev.off() # 3. Closing the device;

It is vitally important to close the device because, while it is open, R redirects

all its graphical output to this device and not to the ‘standard output’ (or
stdout) which is the graphics console. The set of plotting commands between
(1) and (3) can be very long depending on the nature of the graph you are
plotting and fine-tuning of the image parameters. It may include many lines of
code. The example given above is the simplest possible code with no tuning.
The default size of a PNG file is 480 × 480 pixels, which is a bit too small
even for the on-screen preview. It comes from the ancient times when many
computer screens were still about 640 × 480 or 800 × 600 at best. Of all aspects
of fine-tuning the printed image we shall learn now only how to regulate the
image size in pixels. The arguments height and width are responsible for that.
In the PNG printing device, they are supplied with numeric values denoting the
number of pixels.23 A slightly bigger graph 500 × 500 pixels is usually enough
for preview. The still bigger graphs face other problems we are not yet ready
to discuss.
png("hist.students.df.height.png", height=500, width=500)
hist(students.df$HEIGHT,
main="Undergraduate students", xlab="Height, cm",
breaks=20,
density=15, angle=45)
dev.off()

As you see, here I included a longer multi-line plotting command. I did it

to illustrate the point made above on the possibility to use as complex plotting
commands as you need. This one is, in fact, rather short. The plotting command
for the six basic graphs (fig. 2), however simple it was, included 17 much longer
lines of code, and some plots may take pages of code to look fine.

Simple bar plots

As I said above, the bar plots are used to represent variation of a qualitative
variable encoded with nominal or ordinal scale. R has got two functions that
produce bar plots. The first one is the generic plot() function which, by default,
returns a bar plot when a factor variable is used as an argument.
plot(students.df$SEX)

The result of this code can be seen in fig. 2, top-right. plot() uses the
factor values sorted according to their numeric codes (as you might remember,
by default they are sorted alphanumerically) for the x-axis, and frequency of
the units characterised by this or that state of the variable, for y-axis. One
can’t change much the appearance of a bar plot drawn with the generic plot
function. The axes can be scaled with xlim and ylim, the main, xlab, and ylab
23 Some printing devices expect size in inches but it is not the default expectation for PNG.

22
80
150

60
100

40
50

20
0

0
f m 0 1 2 3 4

Figure 7: Simple bar plots. Right: students by gender. Left: students by self-reported
smoking class, from ‘never tried’ (0) through ‘heavy smoker’ (4), see dataset legend
for details.

can be also adjusted but, sometimes, even the simplest bar plots require more
interventions.
A more specialised barplot() function offers more controls. Like other
specialised functions, barplot() is not ‘smart’ enough and requires a special
kind of data object as its argument. When visualising a single variable, the
object should be a table. A tabular summary of a variable can be obtained
with table() function. The nature of the variable is unimportant, table()
tabulates numerical vectors as easily as character vectors and factors.
> table(students.df$SEX)

f m
169 52
> table(students.df$SMOKING)

0 1 2 3 4
55 51 85 21 11
>

A tabular summary can be used immediately as data argument for barplot():

barplot(table(students.df$SEX))
barplot(table(students.df$SMOKING))

The results of these two lines of code see in fig. 7. It should be noted that
barplot() can not use the original variable as an argument.24 To specify the
height of the bars, it requires a numeric vector. In our example, it was supplied
as a table, but it can be supplied as a simple vector as well (including numerical
vectors within data frames).
24 In its turn, the use of table() transformation with plot() is technically possible but

it brings rather weird results: they are interpretable but as a graphical representation of the
data they leave much to be desired.

23
6

6
5

5
4

4
Mentions
3

3
2

2
1

1
0

0
Bardill Fries Kant Salat

Wagner
Schulze
Schleiermacher
Salat

Schelling
Salat
Reinhold
Krug
Krause
Koeppen
Kant

Kant
Jacobi
Herder
Herbart
Fries

Hegel
Fries
Fichte
Eschenmayer
Bouterwek
Bardill

Beck
Bardill

0 1 2 3 4 5 6 0 1 2 3 4 5 6

Mentions Mentions

Figure 8: Controlling parameters of a bar plot. Top-left: raw barplot() applied

to a vector within the phil data frame. Top-right: names.arg applied. Bottom-left:
horiz=TRUE applied. Bottom-right: las=1 applied. See text for code details. For the
further improvements, see fig. 9.

The ‘students’ dataset contains no data that would cause any problem with
representing them in a bar plot. A microscopic real-life dataset on philosophers
mentioned in the chapter headings of the six German textbooks in the history
of philosophy published from 1800 through 1830 will be, strangely enough, more
challenging.25
A straightforward application of barplot() function to the appropriate vari-
able within the phil data frame looks rather imperfect (fig. 8, top-left):
barplot(phil$Freq)
25 The dataset is supplied in a separate file philosophers.txt In the code examples, it

is supposed to be read in the phil object, which is a data frame. In fact, it was one of
the many by-products of a rather complex data transformation used in Maxim Demin &
Alexei Kouprianov (2018): Studying Kanonbildung: An Exercise in a Distant Reading of
Contemporary Self-descriptions of the 19th Century German Philosophy, Social Epistemology,
DOI: 10.1080/02691728.2017.1414332

24
In fact, it looks even worse than the ones for the ‘students’ dataset (fig. 7).
We see just the bars tehmselves and a scale axis to the left. Not a single philoso-
pher’s name appears in the plot. Why the students data looked better? Because
tabulated data contain a names attribute (see the code examples above), which
is picked by barplot() by default. When working with a vector of unnamed
entities, another vector with names is needed to label the bars (names.arg ar-
gument specifies the labels).26
barplot(phil$Freq, names.arg=phil$NAMES, ylab="Mentions")

Some labels appeared. However, just as it happens to any barplot (including

those in MS Excel), most of them are skipped because they are either too long
or too numerous (or both) to fit in the graph. The labels in fig. 7 were too
short to observe this effect but the philosophers’ names presented us with a real
problem. To bring them all to light and to make them easily readable, we need
to turn the graph and change label orientation. Turning the graph with horiz
(counter-clockwise) does not help much (fig. 8, bottom-left):
barplot(phil$Freq, names.arg=phil$NAMES, horiz=TRUE, xlab="Mentions")

To turn the axis labels, we need another argument, las (fig. 8, bottom-
right):27
barplot(phil$Freq, names.arg=phil$NAMES, horiz=TRUE, las=1, xlab="Mentions")

There remain three problems. One of them (which would not be present in
the on-screen plot, for the reasons we shall discuss later) is that the graph is
over-cluttered with labels.28 The other is that the y-axis labels are only partly
visible (e. g., ‘rmacher’, the third from top, is apparently ‘Schleiermacher’ who
was too long to fit in the margin). The third I would not call a problem, but
sometimes you may also need to order the items in the barplot according to their
frequencies. These three problems require three different kinds of solutions, and
none of them can be solved using barplot() arguments.
The first one will not be treated here in detail. It is enough to say that it
concerns the relative size of font and plotting area (I could make the graphs
looking exactly the way you see them on screen at the expense of labels being
to small to read them without a magnifying glass but I preferred to keep them
readable). For the purposes of the fig. 9 it was solved by changing the image
file aspect ratio from 1:1 to 8:7.
The second is, however, more pressing and pretty annoying when you need
to work with lots of barplots and have no time to play around with margin
widths adjusting them manually. Fortunately, R has got a couple of functions
that return the size of a text string (strwidth() for the length of a string
26 And we, of course, should label the frequency axis in some meaningful way. For the

purposes of our graph "Mentions" would be enough, as it should be explained in the article
text and image caption in more detail.
27 The las argument can be assigned with any of four numerical codes. 0 (default): x- and

y-labels are oriented along their axes (x-labels — horizontally, y-labels — vertically); 1: both
horizontally; 2: both perpendicular to their axes; 3: both vertically.
28 As you might have noticed, some of the graphs appear in this manual in a slightly

different way than on your screen. This is largely because of rather different requirements for
printing-quality graphs and on-screen graphs we shall discuss in an appropriate section of the
manual.

25
Wagner Salat
Schulze Koeppen
Schleiermacher Herder
Schelling Krug
Salat Krause
Reinhold Wagner
Krug Schleiermacher
Krause Herbart
Koeppen Fries
Kant Eschenmayer
Jacobi Bouterwek
Herder Schulze
Herbart Reinhold
Hegel Hegel
Fries Beck
Fichte Bardill
Eschenmayer Schelling
Bouterwek Jacobi
Beck Kant
Bardill Fichte

0 1 2 3 4 5 6 0 1 2 3 4 5 6

Mentions Mentions

Figure 9: Controlling parameters of a bar plot. Continued from fig. 8. Left: the plot
stretched vertically to provide enough space for all bars (not necessarily needed when
working with standard output, the graphics console), left margin adjusted to make the
names visible. Right: same as the left, data ordered according to frequency. See text
for code details.

and strheight() for its height). The length can be returned in several ways
specified with the units argument. The units can be defined in three ways, of
which two are easily understandable and more immediately relevant to ordinary
users. "inches" mean literally the size in inches, while "user" returns size in
the units of the coordinate system used in a particular plot. The first is more
useful for identifying the size of text labels printed in the margins of the plot,
while the second is better for labels used in the plotting area. What we need
now is to find the maximal length of the y-axis label in inches and somehow
account for it when specifying the width of the left margin of the plot.29
The maximal width of the label can be thus identified with:
max(strwidth(phil$NAMES, units="inches"))

Now, we need a method to feed it into the plot parameters. The width
of the margins is defined at the level of the plotting device with mar or mai
arguments of par() function. The first argument specifies the number of lines
of text, which could be printed in the margin this wide, the second the width
of a margin in inches (only one of them should be used for a given graph).
Margins are defined clockwise starting from the bottom margin. The default
values are:30
> par()$mar
29 In fact, there are other important arguments in the strwidth() and strheight() func-

tions because the size of the string depends on many parameters, all of which should be taken
into account. In our case, however, we do not change the default font family, size, etc., so we
are spared of all these complications.
30 Unfortunately, this easy way of retrieving default values does not work for all funcions,

so, usually, you need reading relevant help pages.

26
[1] 5.1 4.1 4.1 2.1
> par()$mai
[1] 1.02 0.82 0.82 0.42
>

For the purposes of our graph (fig. 9, left), I used the following script:
par(mai = c(0.82, 0.42 + max(strwidth(phil$NAMES, units="inches")), 0.42, 0.42))
barplot(phil$Freq, names.arg=phil$NAMES, horiz=TRUE, las=1, xlab="Mentions")

Now, the final touch. To sort the items according to their frequencies, we
need a rather simple data transformation. The phil data frame should be
sorted not by phil$NAMES but by phil$Freq. This can be accomplished with
order() function.31 We shall create a new object, phil.s which contains the
same philosophers but sorted differently.32
phil.s <- phil[order(-phil$Freq),]

The resulting graph (fig. 9, right) can be called with the code:33
par(mai = c(0.82, 0.42 + max(strwidth(phil.s$NAMES, units="inches")), 0.42,0.42))
barplot(phil.s$Freq, names.arg=phil.s$NAMES, horiz=TRUE, las=1, xlab="Mentions")

We shall come back to bar plots in the section on bivariate plots. Alongside
with mosaic plots, they can be used to visualise interrelationships of qualitative
variables.

Visualising interdependence: bivariate plots

Scatter plot
The scatterplot is the most common way to represent interdependence of two
quantitative variables. R returns a scatter plot after calling the plot function
supplied with two arguments for x and y coordinates when both variables are
numeric vectors. The function call can be formulated in two equivalent ways:
plot(x, y) or plot(y∼x). The second, which is called ‘formula’ syntax, is
more convenient when working with some forms of formal analysis because the
function calls frequently look like function(y∼x), thus making editing of the
code easier. We shall proceed the same way as with histograms, from the basic
default plot to more and more perfect ones.
Many parameters of plot() are already familiar to us from the previous
sections, so I will not treat them in detail here focusing only on the more specific
problems of scatter plots.34
Let us suppose that we would like to plot students’ body mass against stu-
dents’ stature. The following function call will return a plot, which can be seen
in fig. 10, left:
plot(students.df$HEIGHT, students.df$MASS)
31 It will be discussed in more detail in the section on data transformations.
32 As you see, we list the rows of phil in descending order of phil$Freq (hence the minus
sign before the variable name).
33 Note, please, that the code is the same as for the previous graph save the name of the

object.
34 The axes range is specified with the same xlim and ylim parameters, margin labels with

main, xlab, and ylab.

27
Figure 10: Simple scatter plots (continued in fig. 11). Left: raw
plot(students.df$HEIGHT, students.df$MASS) call. Right: pch appearance
changed, axes labels added.

There are many things in this plot, which can be improved. E. g., we may
wish to change the appearance of plotting symbols from open circles to some-
thing more aesthetically acceptable. We may wish to smuggle into our plot some
important third dimension, e. g., the degree to which the data points overlap
each other (it happens even to rather small samples if the measurement scales
are crude enough) or, given the natural heterogeneity of our sample, whether
the graphical separation of male and female students displays any meaningful
pattern.
The shape of the plotting symbol is governed with the pch argument. The
value for this argument is either a numerical code or a text string. The standard
26 symbols with their numerical codes are shown in fig. 12 (as you see from
the example above, pch defaults to 1). Besides these symbols, any character
supplied as a text string can be used to represent a data point (examples are
given in the bottom-left corner of fig. 12). It should be noted, however, that the
use of other characters than "." for presentation quality plots is not advisable for
aesthetical reasons (even though sometimes it may be useful for quick previews
at some points of analysis).
The following code brings us fig. 10, right (now, the axes are labelled and
the appearance of data points changed):
plot(students.df$HEIGHT, students.df$MASS,
xlab="Stature, cm", ylab="Body mass, kg", pch=20)

For a quick assessment of the degree to which the data points overlap each
other, a semi-transparent colouration can be used.35 The semi-transparent
colours, as I promised in the section on histograms, can be called with rgb()
function. Its arguments define the intensity of the three channels of RGB system
and the α channel, responsible for transparency. All channels’ intensities may
35 See the section ‘Stepping beyond two dimensions’ below for a more sophisticated ap-

proach.

28
Figure 11: Simple scatter plots (continued from fig. 10). Left: a quick way to as-
sess data points overlap using rgb() function with transparency as a value for col
argument of plot(). Right: a quick way to bring in a third variable, students gen-
der by supplying a numerical value of the students.df$SEX factor level as a colour’s
numerical code for col argument.

vary from 0 to 1, from solid black rgb(0,0,0,1) to solid white rgb(1,1,1,1).

The fourth channel set to 0 gives us completely transparent colours which look
the same under all shades of the three former channels, just like the Emperor’s
new clothes. When one needs semi-transparency, some figure in between 0
and 1 is needed. E. g. rgb(1,0,0,.3) would give a semi-transparent red,
rgb(0,1,0,.3) a semi-transparent green, etc. For our graph, I used a semi-
transparent black:
plot(students.df$HEIGHT, students.df$MASS,
xlab="Stature, cm", ylab="Body mass, kg", pch=20, col=rgb(0,0,0,.3))

The results can be seen at fig. 11, left. Now, we see that some points (pale-
grey) are apparently representing just one individual. The darker is the point,
the more individual data points overlap demonstrating exactly the same com-
bination of stature and body mass, given the limitations of our measurement
scale (stature rounded to centimeters, and body mass, to kilograms).
As we knew too well from the very beginning, our students dataset contains
data on both female and male students. There is a way to preview quickly this
heterogeneity. Apparently, we need to use either different colours or different
data point shapes to distinguish students of different genders. The easiest way
to do this not involving data transformations is to supply the numerical value
of the factor students.df$SEX as the value for the col argument. As you may
remember, we mentioned the standard numerical codes for nine basic colours
above. The numerical values of the levels of our factor would be 1 and 2, so
we may expect black and red dots for female and male students respectively
(remember that, by default, the factor levels are sorted alphanumerically). The
following code brings us fig. 11, right.
plot(students.df$HEIGHT, students.df$MASS,

29
0 1 ● 6 11 16 ● 21 ●

2 7 12 17 22

"." 3 8 13 ● 18 23

"a" a 4 9 14 19 ● 24

"1" 1 5 10 ● 15 20 ● 25

Figure 12: 0–25: numerical codes for 26 pre-set printing characters (pch). Characters
21–25 contain background fill colour which can be defined separately from the outline
colour (col) using bg argument (defaults to "white", for this figure it was set to
bg="red"). The text strings ".", "a", and "1" in the bottom-left corner illustrate
the possibility to use any characters (letters, numbers, and punctuation marks) in the
plots. Even though it is, in principle, possible, the use of text strings, except ".", "*",
and "+", is not advisable.

xlab="Stature, cm", ylab="Body mass, kg", pch=20, col=as.numeric(students.df$SEX))

We can see now, that the male students (red dots) concentrate in the top-
right portion of the data point cloud, while the female students (black dots)
concentrate in the opposite bottom-left corner. This preview does not allow us
to use transparency, but any more sophisticated graph would require at least
a minor data transformation. In our case, it would be simple: we need to
segregate female and male students into two different objects, students.f.df
and students.m.df.
The segregation of a dataset into subsets can be achieved with the subset()
function. Its general form looks like this: subset(object, condition(s)). We
should first specify the object we would like to use for the subsets extraction,
then one or several conditions for subset extraction (see tab. 3 for the list of
operators used to formulate and combine conditions). E. g., if we would like to
put female and male students into different objects, this can be done this way:
students.f.df <- subset(students.df, students.df$SEX == "f")
students.m.df <- subset(students.df, students.df$SEX == "m")

Now, we have two different objects for female and male students and we can
proceed with our graph. We already know that the two clouds of data points

30
Table 3: Conditional operators in R. The two former can be applied to character
values as well as numerical values, the following four can be applied to the numerical
values only. The latter two are used to combine different conditions that should be
applied simultaneously. See more in the section on data transformations.

Conditional operator Explanation

== Equivalent
!= Not equivalent
<= Less or equal
>= Greater or equal
< Less than
> Greater than
& Logical AND
| Logical OR

are shifted relatively to each other, gravitate to opposite corners of the plotting
area, and have apparently a very limited overlap area. This means that if we,
say, plot first the cloud for female students, R would authomatically adjust the
size of the plotting area in such a way that there will be not enough space to plot
all data points for male students, and vice versa. This means that we should
learn another practical trick.
The easiest solution is as follows. We should first call an empty plot with
enough space for all data points and then add the data points for all subsets
to the existing empty plot. To call an empty plot we need to learn another
argument of the plot() function, type. This argument may be assigned several
values, each of them representing a specific kind of graph.36 For the purposes
of the present graph we shall use the "n" value, which means ‘an empty plot for
the data points specified with x and y arguments’.
plot(students.df$HEIGHT, students.df$MASS, type="n",
xlab="Stature, cm", ylab="Body mass, kg")

The result can be seen in fig. 13, top left. As you see from the axes, the size
of the plotting area is exactly the same as in figs. 10 and 11, but the data points
themselves are not present. It’s time to add them. To add data points to the
existing plot, the points() function is used. Its arguments are similar to those
of the plot() function, but the points() function can not influence axes and
text in the margins of the plot (xlim, ylim, main, xlab, ylab, and some other
arguments).
points(students.f.df$HEIGHT, students.f.df$MASS, pch=20, col=rgb(1,0,0,.3))

The results of this code can be seen in fig. 13, top right. I used semi-
transparent red to account simultaneously for both the possible overlap of the
data points and the differences connected to gender. Now, we can add the data
points for male students, for which I picked a semi-transparent blue:
points(students.m.df$HEIGHT, students.m.df$MASS, pch=20, col=rgb(0,0,1,.3))
36 This subject will be treated in more detail in the following sub-section on time series

plots.

31
Figure 13: Adding points to a plot. Top-left: calling an empty plot with type="n".
Top-right: adding the points for the first subset. Bottom-left: adding points for the
second subset. Bottom-right: the same as left but with the use of black and white
symbols with different geometry.

Now we can see clearly the two clouds as well as the small overlap area
including several violet dots of varying intensity in the middle of the combined
cloud where female and male students’ data points overlap.
It should be noted that the points() function itself can not call the plot.
It can only add elements to an existing plot, so, first the plot() (or any other
primary plotting function) should be called, and only then the time comes for
points(). Thus, the complete code for fig. 13, bottom-left, should, in fact, look
like this:
plot(students.df$HEIGHT, students.df$MASS, type="n",
xlab="Stature, cm", ylab="Body mass, kg")
points(students.f.df$HEIGHT, students.f.df$MASS, pch=20, col=rgb(1,0,0,.3))
points(students.m.df$HEIGHT, students.m.df$MASS, pch=20, col=rgb(0,0,1,.3))

Understandably, sometimes, we need to show the same relationship in black

32
and white line art. A possible solution is proposed in fig. 13, bottom-right. The
code for it is as follows:
plot(students.df$HEIGHT, students.df$MASS, type="n",
xlab="Stature, cm", ylab="Body mass, kg")
points(students.f.df$HEIGHT, students.f.df$MASS, pch=3)
points(students.m.df$HEIGHT, students.m.df$MASS, pch=4)

A digression: Time series and the type of plot()

From the plotting perspective, a time series is but a variety of a scatter plot
with time at the x-axis, and some parameter varying with time at the y-axis. A
graphic convention is only slightly different, for a time series is usually depicted
not as a set of points but as a line connecting the points at which the parameter
under study was measured (hence, it is usually called a line chart).37 There is
a couple of vitally important differences, however, in the dataset organisation.
First, the entries of the dataset should be ordered chronologically. Secondly, in
most cases, a special data type should be used to encode date and time. In this
section, I shall not treat the latter topic at length keeping the details to absolute
minimum. A more detailed discussion of the date and time representations in
R will be postponed untli the data transfotmations part.
As the new graphical convention requires a special kind o plot, I feel, it
is a valid excuse to learn a bit more of types of plots in general. This time,
however, we shall not speak of the meaning of plots. We shall consider the
formal argument type of the plot() function. We have already seen that it
may take the "n" value when we do not need any data points to be plotted.
The time is ripe to learn what are the other possible values and which graphical
outcomes they bring. To do it, we need first to create objects for a list of plot
types, x and y coordinates of data points, and then to plot different types of
plot() next to each other to appreciate the difference.38
# Putting together example objects;
plot.types = c("n", "p","l","o","b","c","s","S","h")
x.coord <- c(1:7)
y.coord <- c(1, 3, 7, 4, 2, 3, 5)

# Setting parameters for the plotting device;

par(mfrow=c(3, 3), pch=20)

37 One may think that as time is continuous, so, the dependent variable should be measured

continuously. This, however, is not always technically possible (the only notable exception
that comes to my mind is a heliograph, an ingenious analog inscription device composed of a
sphaerical lens and a band of paper onto which the Sun burns its trace as the Earth revolves).
The bits of time between which we measure the parameter under study may be tiny, but they
are bits. A more valid objection would be that if the dependent variable is countable (like the
number of people in a certain category), then a line merely connecting the points is misleading.
Indeed, if there were 222 students by January 1, 1882 and 267 students by January 1, 1883 in
the Imperial Moscow University, there is no reason to believe that there were 244.5 students
by July 1 and 255.75 by October 1, 1882 (the numbers are taken from the dataset we will
use a bit later). It is, however, not the plot itself that is misleading, it is our straightforward
interpretation that is. The continuity of the line symbolises here not a continuous variation
but the identity of an analytic object (e. g., a body of students). For methodological purists,
there is a couple of possible alternatives to the default line chart though, which are considered
below.
38 Please, do not forget to close the plotting device after plotting this, or to revert the par()

to par(mfrow=c(1, 1), pch=1).

33
Plot type = n Plot type = p Plot type = l
8

8
●
6

6
y.coord

y.coord

y.coord
●
4

4
●

● ●
2

2
●

●
0

0
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
x.coord x.coord x.coord

Plot type = o Plot type = b Plot type = c

8
● ●
6

6
y.coord

y.coord

y.coord
● ●
4

4
● ●

● ● ● ●
2

2
● ●

● ●
0

0
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
x.coord x.coord x.coord

Plot type = s Plot type = S Plot type = h

8
6

6
y.coord

y.coord

y.coord
4

4
2

2
0

0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
x.coord x.coord x.coord

Figure 14: Nine values of the type argument for the plot() function.

# Using for(){} loop to print all plots in a single chart;

for (i in 1:9){
plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8),
type=plot.types[i],
main=paste("Plot type =", plot.types[i]))
}

The results of this code can be seen in fig. 14. As you see, the default value
for type in plot() is "p". As for the other types, it seems that we haven’t
seen them before (and, actually, we will not see most of them afterwards). Two
types, however, are most relevant to the time series plots, ("l" and "o", some
would also consider "b" as an option, but it looks no good when the data points
are close to each other).39 I would generally recommend "l" for the cases
when data points are separated with equal time intervals, while "o" (or "b")
suits better the cases when the time intervals between the measurements are
39 It can be easily seen, by the way, that, graphically, "o" is a combination of "p" and "l",

while "b" is a combination of "p" and "c".

34
Points default Points type="l"
8

8
●
6

6
y.coord

y.coord
●
4

4
●

● ●
2

2
●

●
0

0
0 2 4 6 8 0 2 4 6 8
x.coord x.coord
Lines default Lines type="p"
8

8
●
6

6
y.coord

y.coord

●
4

● ●
2

●
0

0 2 4 6 8 0 2 4 6 8
x.coord x.coord

Figure 15: points() and lines() are nearly the same but their type arguments
default to different values ("p" and "l" respectively).

of various length (the points would thus emphasize the moments at which the
parameter was measured).
What is a bit surprisig is that the points() function, which we used to
add points to the already existing plot, actually has the same parameter type
(defaults to "p"), which works exactly the same way as with plot() with all its
nine options. As the time series plots are most usual, a special function lines(),
which works exactly the same way as points() and plot() but defaults to "l"
is also present.40 You may check it with the following code (see fig. 15):41
par(mfrow=c(2, 2), pch=20) # Setting parameters for the plotting device;

plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8),

40 It should be noted that, for quite obvious reasons, lines() as well as points() and

plot() of type="l" and the like require at least two data points to plot something. This
influences also the handling of NA values (you may test it experimentally on your own).
41 Please, do not forget to close the plotting device after plotting this, or to revert the par()

to par(mfrow=c(1, 1), pch=1).

35
8

8
● ●
6

6
y.coord.1
y.coord

● ●
4

4
● ●

● ● ● ●
2

2
● ●

● ●
0

0
0 2 4 6 8 0 2 4 6 8
x.coord x.coord.1
8

8
● ●
6

6
y.coord.1
y.coord

● ●
4

● ●

● ● ● ●
2

● ●

● ●
0

0 2 4 6 8 0 2 4 6 8
x.coord x.coord.1

Figure 16: On the importance of sorting the entries when plotting a line chart.

type="n", main=’Points default’)

points(x.coord, y.coord)

plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8),

type="n", main=’Points type="l"’)
points(x.coord, y.coord, type="l")

plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8),

type="n", main=’Lines default’)
lines(x.coord, y.coord)

plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8),

type="n", main=’Lines type="p"’)
lines(x.coord, y.coord, type="p")

Thus, to simplify the code, the lines() function can be used when adding
multiple time series to the same graph (as well as, of course, the points()
function with an appropriate type value). There is, however, a most important
difference between plotting data ploints and lines. To illustrate that, I would

36
8

8
● ●
6

6
● ●
y.coord

y.coord
4

4
● ●

● ● ● ●
2

2
● ●

● ●
0

0
0 2 4 6 8 0 2 4 6 8
x.coord.2 Index

Figure 17: A vector, x.coord.2 in this case, (left) vs. the index (right) as the x-axis
value source. Y-axis values are in both cases derived from the same vector, y.coord.
See text for code details.

like you to consider carefully the following example (fig. 16):42

# Putting together example objects;
x.coord.1 <- c(1, 7, 3, 6, 5, 2, 4)
y.coord.1 <- c(1, 5, 7, 3, 2, 3, 4)

# Setting parameters for the plotting device;

par(mfrow=c(2, 2), pch=20)

plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8))

plot(x.coord.1, y.coord.1, xlim=c(0, 8), ylim=c(0, 8))
plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8), type="o")
plot(x.coord.1, y.coord.1, xlim=c(0, 8), ylim=c(0, 8), type="o")

You might have noticed that x.coord.1 and y.coord.1 describe the same
set of points as x.coord and y.coord but the points are sorted differently. In
the original vectors (x.coord and y.coord) we used in fig. 14, 15, and 16 (left),
the points are sorted by x.coord in ascending order, while in x.coord.1 and
y.coord.1 no such ordering is observed.
The plot(type="o"), like other line chart types, uses the order of elements
within the object to identify the order in which the data points should be con-
nected to each other with line segments to form a line. This is why the bottom
plots in fig. 16 are so different. The thing responsible for that was seen by us
many times. Remember those element numbers in their square brackets? This
thing has its name, index. So, one may see that the index value for the data
point at (7, 5) in the original dataset (fig. 16, left) is [7], and in the modified
dataset (fig. 16, right), [2].43
42 Please, do not forget to close the plotting device after plotting this, or to revert the par()

to par(mfrow=c(1, 1), pch=1).

43 You can check it on your own by calling x.coord[7], y.coord[7], x.coord.1[2], and

y.coord.1[2].

37
lty = 6
lty = 5
lty = 4
lty = 3
lty = 2
lty = 1
lty = 0

Figure 18: Line types (the lty argument). 0 — transparent; 1 — solid; 2 — dashed,
with short dashes; 3 — dotted; 4 — dots and short dashes; 5 — dashed, with long
dashes; 6 — very short and longer dashes (magnification needed).

In fact, index may serve as an x-axis variable itself. If you supply plot()
with just one vector, it will by default to serve as a source for the data points’ y-
axis values, while the x-axis values will be derived from index. You should take
into account, however, that index starts with 1 and has, naturally, a regular
increment of 1, while the successive values in a real-life x-axis variable may be
rather diverse even when sorted. See fig. 17 for a rather moderate but, none the
less, telling example, the code for which is provided below:
# Adding a new object;
x.coord.2 <- c(1, 3:8)

# x.coord.2 is used for x-axis with y-coord:

plot(x.coord.2, y.coord, xlim=c(0, 8), ylim=c(0, 8), type="o", pch=20)

# index is used for x-axis:

plot(y.coord, xlim=c(0, 8), ylim=c(0, 8), type="o", pch=20)

Naturally, one may find it useful to add more than one line to a time series
plot. We already know that this can be done with points() and lines().
When colours are available (e. g., in on-screen presentations), different lines can
be distinguished by colours (the already familiar col argument). When they
are not (e. g., in a scholarly journal with no colour prints), we need to employ
different line types (the lty argument) to highlight the difference. Lines are
less graphically diverse than the data point shapes, but there are no less than
six (seven, actually) kinds of them (see fig. 18). Despite all this diversity, it is
not advisable to use more than three contrasting line types in the same plot
(arguably the best combination is 1-3-5).
Now, we are nearly ripe for our first real-life time series plot. The last thing
to learn here is how to represent dates, on which a time series plot is supposed
to base. By default, R recognises dates encoded in YYYY-MM-DD format (e. g.,
2018-04-12 for April 12, 2018). Other date formats can also be used after some
additional transformations.44 A training dataset on the temporal dynamics of
the Imperial Moscow University students over the last two decades of the 19th
century contains dates in YYYY-MM-DD format and four variables reflecting the
yearly reports on the number of students immatriculated with the four faculties
(Law, History and Philology, Physics and Mathematics, Medicine):45
44 As I have already said, an extended discussion of the date and time formats will be
postponed until the section on data transformation.
45 See the dataset at https://github.com/alexei-kouprianov/Breaking-the-ice-with-R

38
Law
History and Philology
Physics and Mathematics
1500

Medicine
Number of students
1000
500
0

1885 1890 1895 1900

Timeline

Figure 19: A time series plot. Moscow university students by faculty. See text for
code details.

# Loading the data for Imperial Moscow University students;

moscow <- read.table("Moscow.students.txt", h=TRUE, sep="\t",
stringsAsFactors=TRUE)

> head(moscow)
DATE HP L PM M
1 1881-01-01 190 451 392 1397
2 1882-01-01 222 567 463 1346
3 1883-01-01 267 692 526 1314
4 1884-01-01 276 844 497 1257
5 1885-01-01 297 949 550 1195
6 1886-01-01 314 1045 586 1237

As the dates supplied this way are essentially text strings, to plot a time
series, one needs to use as.Date() function to interpret them correctly. Here is
a nearly complete code for the fig. 19:46
46 The legend located in the top-left corner is not included. The legend() function will be

discussed in the section on the preparation of presentation quality graphs.

39
# Plotting the time series for the faculty of Law;
plot(as.Date(moscow$DATE), moscow$L, type="l",
ylim=c(0, max(moscow[, 2:4])),
main="", xlab="Timeline", ylab="Number of students")

# Adding time series for the other three faculties;

points(as.Date(moscow$DATE), moscow$HP, type="l", lty=5)
lines(as.Date(moscow$DATE), moscow$PM, lty=3)
lines(as.Date(moscow$DATE), moscow$M, lty=4)

As you might have noticed, I used both points() and lines to add time
series to plots. This was quite deliberately, because I wanted to stress again
that both these functions can be used to add lines to the plot.
Of course, for practical reasons, it is a lot easier sometimes (e. g., when the
leaps of time between measurements are year long or even longer) not to use the
date format at all. Even though not quite philosophicaly correct, technically,
the plot based on a simplified time series in which dates are represented with
just years or decades, will not look much (if at all) different from a ‘properly
dated’. The strictly chronological order of entries is, however, to be observed
for the reasons discussed above.

Multiple boxplot
The multiple boxplot is used when we need to compare several samples or
sub-populations by some quantitative parameter. With our training students
dataset, we may wish to compare, e. g., students of different genders by height
or body mass, or do the same for other possible groupings like smokers and
non-smokers or different departments and academic groups.
We do know that variation of a quantitative variable across the sample can
be depicted with a histogram. Theoretically, we may compare two samples
just by printing two histograms next to each other (fig. 20, top-left). To do it
we need to modify the plotting device properties using mfrow argument of the
par() function.47
# Re-creating the objects;
students.df <- read.table("kouprianov.students.v.2.1.txt", h=TRUE,
sep="\t", stringsAsFactors=TRUE)
students.m.df <- subset(students.df, students.df$SEX == "m")
students.f.df <- subset(students.df, students.df$SEX == "f")

# Drawing two histograms;

par(mfrow=c(2, 1)) # Setting the capacity of the plotting device;
hist(students.m.df$HEIGHT, col=8, main="Male students", xlab="Stature, cm")
hist(students.f.df$HEIGHT, col=8, main="Female students", xlab="Stature, cm")

To make the histograms truly comparable, one needs to adjust the x-axes,
though, applying proper xlim values (fig. 20, bottom-left):
par(mfrow=c(2, 1)) # Setting the capacity of the plotting device;
hist(students.m.df$HEIGHT, xlim=c(140, 200), col=8,
main="Male students", xlab="Stature, cm")
hist(students.f.df$HEIGHT, xlim=c(140, 200), col=8,
main="Female students", xlab="Stature, cm")
47 I also made them look less ugly from the start by adding colour, titles, and x-axis labels.

40
Male students
Frequency
0 10

160 170 180 190

Stature, cm

Female students
0 20 50
Frequency

145 155 165 175

Stature, cm

Male students
Frequency
0 10

140 150 160 170 180 190 200

Stature, cm

Female students
0 20 50
Frequency

140 150 160 170 180 190 200

Stature, cm

Figure 20: Comparing two samples by a quantitative variable: juxtaposed (left) and
superimposed (right) histograms. Top-left: a rough attempt to juxtapose stature
histograms for male and female students. Bottom-left: unified x-axis range allows
a reasonabe comparison. Tor-right: female students’ stature — red, male sudents’
stature — blue, frequency at the y-axis shows relative size of samples. Bottom-right:
same, density at the y-axis allows comparison of the bell curves’ proportions regardeless
of the sample size. See text for the code.

41
Another option would be to combine two semi-transparent histograms in the
same plot using the add=TRUE argument in the second hist (fig. 20, top-right).
Note, please, that all basic properties of the plot (like default xlim and ylim
values or axes labels) are specified only for the first hist() (this is why we need
plotting more numerous sub-population first):
par(mfrow=c(1, 1)) # Restoring the default settings;
hist(students.f.df$HEIGHT, xlim=c(140, 200), col=rgb(1, 0, 0, .5),
main="", xlab="Stature, cm")
hist(students.m.df$HEIGHT, col=rgb(0, 0, 1, .5), add=TRUE)

If the two samples we compare are of different size, and we still need to
compare visually the overall shapes of the bell curves, we may use density as
the y-axis value by setting freq=TRUE (fig. 20, bottom-right). Note, please, that
in this case we need to repeat the freq=TRUE for both calls of hist().
hist(students.m.df$HEIGHT, xlim=c(140, 200), freq=FALSE, col=rgb(0, 0, 1, .5),
main="", xlab="Stature, cm")
hist(students.f.df$HEIGHT, freq=FALSE, col=rgb(1, 0, 0, .5), add=TRUE)

In this particular case, we had also to change the order of appearance of the
samples (which explains the different shade of violet in the overlapping areas
of fig. 20, right), because the blue histogram has a little bit more acute excess
than the red one.
But, what if we need more than just two histograms? What if we need to
compare not the two sexes but ten academic groups? What about 85 regions in
the Presidential elections dataset? The more histograms we put into one plot,
the less informative is their shape. As they flatten, the peaks and tails get less
prominent and, in the end, nearly indiscernible. In this case, a series of boxplots
may seem a decent alternative.
As you might remember, the boxplots summarise the robust measures of
variation: median (the thick bar crossing the box) and quartiles (sides of the
box parallel to the median bar). It is a little bit trickier to explain what
‘whiskers’ mean. Usually they represent either maximum and minimum, or
median ±1.5×interquartile range (in the R functions in charge of boxplots, this
is sometimes adjustable).
Boxplots can be created either with the generic plot() function or with a
more specialised boxplot() function, which allows more controls at the expense
of less uniform syntax. The plot() generates boxplots authomatically if the x-
axis variable is a factor, and the y-axis variable is a numerical vector (see fig. 21,
top-left):
plot(students.df$SEX, students.df$HEIGHT)

As I promised, the boxplots are equally informative as summary graphs even

when they are rather numerous (fig. 21, top-right):
plot(as.factor(students.df$GROUP), students.df$HEIGHT)

Note, please, that to use academic group number as the x-axis variable
for boxplot, we shoud apply as.factor() transformation (as the groups are
encoded with numerical codes, they are perceived as numbers by default):48
48 You may try it without this transformation at your own risk and see the result.

42
●
190

190
●

●
180

180
170

170
160

160
●
150

150
●

f m 1 2 3 11 12 13 21 22 23 24

●
● 190
150

●
180
100

170
160
50

●
150

●
0

1 2 f m

Figure 21: Boxplots created with plot() (top), and boxplot() (bottom). See text
for the code and explanations.

From the section on the barplot() function, you might remember that the
objects it was ready to handle were of a dirfferent nature than those required
by plot() to produce the same result. It is the case with boxplot() too. If
we try to apply the same syntax as with plot(), the results will be strikingly
different (fig. 21, bottom-left):
boxplot(students.df$SEX, students.df$HEIGHT)

Someting is deeply wrong, it seems, with the variables involved. First, we do

not see the expected variation pattern: instead of two variables varying, roughly
speaking, within 150–195 range, we see a much vider variation with one variable
rouhly fitting the expected range and the other, only slightly varying, at about
zero. Secondly, there are, accordingly, no f and m at the x-axis tickmarks.49
The matter is that, with the boxplot() function, only the ‘formula’ syn-
tax (boxplot(y∼x)) is used to denote interdependence, while the enumeration
49 When you see something outrageously counter-intuitive, a general advice is always to

consult the help() pages for the function in question.

43
Figure 22: Scatter plot with jitter. Left: raw plot(students.df$HEIGHT,
students.df$SEX) call. Right: the same plot after some adjustments (xlab and ylab
fixed, data points pch and col changed, jitter() added to avoid extreme overlap of
data points, y-axis tuned to reflect the two qualitative states without any technically
inappropriate intermediate gradations). See text for details.

of comma-separated objects is reserved for adding numeric vectors to the box-

plot.50 As soon as we fix the syntax, our boxplot comes back in its expected
form (fig. 21, bottom-right):
boxplot(students.df$HEIGHT ~ students.df$SEX)

This feature of the boxplot() function is, in fact, most convenient, as it

allows to combine numeric vectors no matter from how many different objects
into one plot. E. g., the following script will result in exactly the same picture
as in fig. 21 top-left and bottom-right:
boxplot(students.f.df$HEIGHT, students.m.df$HEIGHT, axes=FALSE)
axis(1, at=c(1,2), labels=c("f","m"), col="white")
axis(2)

The only difference here is that we will need to take care of the x-axis tick-
mark labels on our own by switching off the default axes and manually adding
new ones. It is unfair, perhaps, not to explain the details of axes manipulation
here. My only excuse is that they will be treated in more detail in the following
section.

Scatter plot with jitter

This is a rather rare graph, and still, it is an important piece of our four-
basic-bivariate-graphs-puzzle. Besides that, the jittered individual data points
are sometimes added in the same manner to other plots (e. g., to the multiple
50 Now, we may easily decipher fig. 21, bottom-left. The first variable is nothing but

students.df$SEX represented with factor levels forming a rather ugly boxplot trying its best
to visualise a series of 1s and 2s. The second is students.df$HEIGHT, not split by gender.

44
boxplot we discussed in the previous section), so it would be useful to learn how
to handle them.
A most straightforward solution would be to call:
plot(students.df$HEIGHT, students.df$SEX)

The results are presented in the fig. 22, left. It seems that the plotting
function took the numerical values of two levels of the students.df$SEX fac-
tor for the y-axis coordinates. The plot is thus perfectly meaningful, however
ugly. Apparently, it needs standard aesthetical adjustments (a meaningful ap-
plication of xlab and ylab as well as some changes in data points appearance
we used before), so this goes without saying. It is clear, however, that, first,
no matter how semi-transparent the points will be, they will form indiscernible
linear clusters. And, secondly, that the y-axis needs a radical reconstruction.
This means that we need to use the jitter() function to add some noise to the
y-coordinate and that we need to learn how to fully control the axes.
A straightforward application of the jitter() to students.df$SEX returns
an error message:
> plot(students.df$HEIGHT, jitter(students.df$SEX))
Error in jitter(students.df$SEX) : ’x’ must be numeric
>

This problem is partly familiar to us. From the section on scatter plots, we
already know how to transform a factor into a numerical vector on the basis
of factor levels’ numerical values. We should apply first the as.numeric()
transformation, and only then, the jitter(). The following code will return a
slightly improved version of the fig. 22, left.51
plot(students.df$HEIGHT, jitter(as.numeric(students.df$SEX)))

Now, before switching to purely aesthetical exercises, we need to deal with

axes. The original axes can be removed with axes=FALSE argument of the
plot() function. The new axes can be added to the plot one by one with the
axis() function.
The axis() function has got a number of arguments, and we shall need
nearly half of them. First, the most important argument is side. It comes first
by default and has no default value. The side of the graph is identified with a
numeric code starting from 1 for bottom and then going clockwise through 4 on
the right. We will need two axes, the bottom and the left ones, so we will use
1 and 2 as the value for side.
The x-axis looks tolerable, so we would just re-create it ‘as is’. The y-axis,
on the contrary, needs a complete reconstruction. First, we need to limit the
number of tickmarks to two, one for each gender. Secondly, we need to replace
the numerical values at the tickmarks with "f" and "m". Thirdly, the tickmark
labels should be written in a more natural way, i. e., both "f" and "m" should
be ‘standing’ and not ‘lying on their side’ as the numerical labels in fig. 22,
left. Finally, (this is a matter of pure aesthetics, of course), I would personally
prefer not to see any axis to the left at all, except for the labels "f" and "m"
at its (invisible) tickmarks. The tickmarks positions are defined with the at
51 It would be good to read help(jitter) and to exeriment with it to learn how to control

the amount of noise added.

45
argument, the tickmarks’ labels with labels, the tilckmark labels orientation
with the already familiar las (see the ‘Bar plots’ section for more details). The
invisibility of the axis and tickmarks lines will be achieved by frankly colouring
them to the background (white) colour.
Thus, the complete code for fig. 22, right, is as follows:
plot(students.df$HEIGHT, jitter(as.numeric(students.df$SEX), factor=.5),
xlab="Height, cm", ylab="Gender", pch=20, col=rgb(0,0,0,.3), axes=FALSE)
axis(1)
axis(2, at=c(1:2), labels=c("f","m"), las=1, col="white")

As you see, within the jitter(), the noise factor is slightly diminished to
narrow the resulting clouds, within the plot(), the axis labels are fixed, pch
and col applied, and original axes suppressed. The bottom axis is redrawn ‘as
is’, while the left axis is supplied with a vector of at values, a vector of new
tickmark labels, tickmark label orientation (las) and white colour for the axis
and tickmarks’ lines (col).

Structured barplots
A structured barplot is needed when two qualitative variables meet each other.
R has a number of options for drawing graphs of this sort. The default mosaic
plot discussed above in the section on six basic graph types can be produced
with the basic plot() function supplied with two qualitative variables as x-and
y-axis values. The more conventional structured barplots require a special
barplot() function (we already started working with it above in the section
on simple barplots) and some additional data transformation. In this section, I
shall treat first the mosaic plot, and secondly discuss the structured bar plots.
First, we need the variables. In our training students dataset, there are
at least four variables that suit the purpose. Besides of conspicuously quali-
tative SEX, there are DEPARTMENT and GROUP (even though they are encoded
with numbers, these numerical codes, as we aleady know, should be perceived
as nothing but names). The scale used to encode SMOKING is ordinal, not nom-
inal, as in the three former cases, but from the drawing perspective it makes
no difference. The only thing we should keep in our minds is that all above-
mentioned variables encoded with numbers should be converted to factors (e.
g. as.factor(students.df$SMOKING) instead of just students.df$SMOKING)
to make R interpret them as ‘qualitative’ from the graphical perspective.
The code for the mosaic plot is very simple:
plot(students.df$SEX, as.factor(students.df$SMOKING))

The interpretation of the resulting graph (fig. 23) is rather tricky. It was
already given in the section on six basic graphs, but I repeat it here again in
more detail. The two vertcal bars represent the two states of the x-axis variable.
Their widths are proportional to the frequencies of the x-axis variable’s states.
In our case, we see that women are roughly three times more numerous than
men. The differently coloured areas in each bar represent the distinct states
of the y-axis variable. An ordered list of all possible states is given at the
left vertical axis, a frequency scale (from zero to one), at the right. A closer
inspection of the graph shows that roughly a half of the students in both genders
identify themselves as more or less regular smokers (classes 2–4). On the other

46
1.0
4

0.8
3

0.6
2
y

0.4
1

0.2
0

0.0
f m
x

Figure 23: Mosaic plot. Students’ self-identified smoking class (0–4, at the y-axis)
vs. gender (at the x-axis). See text for explanations and code details.

hand, less than 5% (0.05) of women students and slightly more than 10% (0.1)
of men students identify themselves as heavy smokers (class 4).
We can check our eye estimates by running table() for the items of interest.
> table(students.df$SEX, students.df$SMOKING)
0 1 2 3 4
f 43 37 68 16 5
m 11 14 17 4 6
> table(students.df$SEX)

f m
169 52

As one may see from these tabulations, our eye estimates were quite close to
the exact numerical proportions. There is nothing unnatural or counterintuitive
about that, for the graph represents the numerical proportions with the highest
technically possible degree of precision, I just wish to stress that this graph may
seem intuitive after a proper amount of explanation.

47
The structured barplots however are a little bit more conventional, and, gen-
erally, more advisable. We have already had some experience with barplot()
function, and its twisted character could not have escaped the reader’s atten-
tion. In the present section, we will deal only with the specific problems of
representation of the interdependance of two qualitative variables. All more
general problems are addressed above in the section on univariate barplots.
We already know that barplot() requires numerical values for frequencies
of distinct states of qualitative variables. When the relatinship between two
qualitative variables are explored, we need to visualise frequencies of their states’
possible combinations. E. g., considering the relationship between gender and
department, we should be interested to visualise somehow frequencies of women
and men students from the department no. 1 and those from the department
no. 2 (just as we did above for combinations of gender and self-ascribed smoking
habits’ classes). We already know that the most natural way to obtain these
frequencies is the table() function. Remember, this worked with the univariate
simple barplots as well. The only difference is that the tabulation of a single
variable produces a uni-dimensional table (a table with just one column), while
the tabulation of two variables result in a two-dimensional contingency table
with as many rows and columns as there are states in the first and second
chosen variables respectively (remember the Roman Catholic rule).
I. e., if we change positions of variables in the brackets, the table gets trans-
posed (this may be roughly described as turning 90°clockwise plus mirroring).
> table(students.df$SMOKING, students.df$SEX)

f m
0 43 11
1 37 14
2 68 17
3 16 4
4 5 6

The tabulation of SEX vs. DEPARTMENT brings the following result:

> table(students.df$SEX, students.df$DEPARTMENT)

1 2
f 114 56
m 46 6

Now, we should try to feed it into the barplot() as an argument.

barplot(table(students.df$SEX, students.df$DEPARTMENT))

A careful comparison of the table to the graph (fig. 24, top-left) shows that,
in a way, the latter is just a pictorial representation of the former mirrored
in relation to the x-axis (imagine the table, in which areas of the cells are
proportional to the values they contain). The columns of the table correspond
to the vertical bars, while the numbers in the table’s cells specify the heights of
bars’ segments. Note, please, an important difference from the mosaic plot we
produced with plot() function in the beginning of this section: in the plot(),
the default order of x- and y-axis variables prevails, while in the barplot(), the
table() interfers turning the tables.52 The order of bar segments is reversed
52 The pun was not intended at first.

48
150

100
80
100

60
40
50

20
0

0
1 2 1 2
30

15
25
20

10
15
10

5
5
0

A B C A B C

Figure 24: Structured bar plots. Top: students’ gender vs. department (departments
indicated with numbers, women with a darker, men with a lighter shade of gray).
Bottom: experimental test.df dataset. Left: results of raw barplot() function calls.
Right: same graphs with beside=TRUE. See text for explanations and code details.

(i. e., the top row of the table is placed at the bottom of the graph; but this
is quite natural, because the graph starts with zero, which is by default placed
at the graph’s bottom line). The bars’ proportions allow us to see at once that
the 1st department outnumbers the 2nd (the latter is just 52 ) of the former, and
that there is a significant gender disproportion (men comprise about 72 of the
1
1st department students and just about 10 of the 2nd ). That’s all we need from
this graph.
Of the many parameters controlling the appearance of the structured bar
plot, one is most important to know. It is beside (defaults to FALSE); when set
to TRUE it changes the appearance of the graph in a most dramatic way.
barplot(table(students.df$SEX, students.df$DEPARTMENT), beside=TRUE)

Now (fig. 24, top-right), the bar segments are placed not on top of but next
to each other. In some cases this layout might seem preferable.

49
Of course, barplot() has many more adjustable parameters than just beside
(you should be already aware of horiz and names.arg discussed earlier). You
may control also the widths of the bars and the widths of the spaces separating
them (and even the width of spaces separating bar segments), colorus of the bar
segments, presence or absence of a legend, and many other things.53
Note, please, that in case we would like to emphasise rather the proportion
of departments within gender groups, than the proportion of the gender groups
within departments, we would need to change the order of variables within the
barplot() function call. The horiz=TRUE would be definitely not enough.
The frequencies may be picked from a data frame as well. To satisfy the
expections of the barplot() function, however, the segment of the data frame,
which is to be used as the contingency table, should be transformed with
as.matrix(). To illustrate that point, we should create a miniature abstract
data frame.
# Create the data frame
test.df <- data.frame(c(10,12,8), c(16,6,4), c(5,7,9))
colnames(test.df) <- c("A","B","C")
# Preview the data
test.df

Now, we are ready to test the barplot() with it. A straightforward attempt
to apply barplot() to our data would result in an error message:
> barplot(test.df)
Error in barplot.default(test.df) : ’height’ must be a vector or a matrix

The use of the as.matrix() transformation solves the problem (see fig. 24,
bottom-left and bottom-right):
barplot(as.matrix(test.df))
barplot(as.matrix(test.df), beside=TRUE)

Two considerations should be kept in mind, however, when using data frames
as a barplot source. First, that matrix is, essentially, a folded vector. This means
that the segment of a data frame we convert to a matrix is not allowed to contain
anything but numbers representing heights of bar segments. Secondly, that the
naturally occurring contingency tables do not contain NA values (which is quite
understandable: if a certain combination of variables’ states does not happen
to occur, its frequency is zero, 0, not NA).54

Four steps towards more advanced graphs

We just started climbing up the Mount Everest of R graphics and there is yet
much to be learned, so the number of steps to perfection is nearly infinite (e. g.,
so far we did not touch the special kinds of graphs like maps or social network
53 A careful reading of help(barplot) and some experimentation are strongly recommended.

At the very least, adding legend.text=TRUE to our gender vs. department plot is a must try.
54 To test the effect of an NA value on the plot, you may try to modify our experimental data

frame test.df by replacing one of the numbers with NA and redraw the plots. For didactic
purposes, I would personally recommend to replace 12 in the first vector or 6 in the second
with NA and see what happens to both beside=TRUE and beside=FALSE barplots. It would be
also most instructive to replace then NA with 0 and redraw the plots again to see the difference.

50
diagrams, not to mention a whole separate ggplot2-based graphical system;
they all are postponed to the chapters to come). This section deals with four
rather common issues which, however disconnected they appear, are all much
needed at the very first steps of exploratory analysis and data visualisation. The
first subsection will deal with drawing straight lines and curves based on equa-
tions reflecting ideal relationship between x- and y-axis variables (something
like y = a + b · x or y = x2 ). The second will be devoted to a rather bizzare
issue of the ways to add a third variable to the two-dimensional scatter plots.
The third will treat the so-called graphical primitives (you are already familiar
with some of them, like points(), but this section will add more). The fourth
will lead you into the depths of basic R graphical system to reveal an important
secret of graphical functions: like many other functions in R, they count a lot
before they draw, and we shall learn how to get a more direct access to the
results of these calculations and make use of them.

Plotting math: abline() and curve()

In this manual, we mostly work with empirical data. Moreover, these data
nearly exclusively belong to the domain of social sciences and humanities. They
are noisy and, when visualised as scatter plots, they rarely form smooth lines.
Sometimes, however, even social scientists need something more abstract and
purely mathematical. There is a whole class of tasks dealing with approximation
of empirical data with linear or non-linear mathematical functions, which require
drawing some smooth, abstract, ‘mathematical’ straight lines and curves. These
straight lines and curves can be drawn with the aid of specialised functions
adding ‘mathematical’ lines to the existing plot.
One of these functions, abline() is already familiar to us. We saw it working
in fig. 1 and 6. Its job is to deliver straight lines running through the graph
at different angles. In its generic form — abline(a, b) — it requires two
arguments, a and b (hence abline), representing coefficients from the following
equation:

y =a+b·x
where a is responsible for up and down shifts of the line along the y-axis, while
b represents the ratio of y1 − y0 to x1 − x0 , or how much y changes when x
increases by 1 (which is also known as the tangent of the angle between the x-
axis and the function line). It is important to mention that, just like points()
and lines(), the abline() itself is unable to call a new plot, so it can be used
only to add elements to an existing plot (whether empty or full of data points),
not to create a new one.
# Creating an empty plot;
plot(-10:10, -10:10, type="n")
# Drawing a number of lines;
abline(0, 1, lty=1) # y=0+1*x;
abline(6, -.5, lty=2) # y=8-0.5*x;
abline(-7, 2, lty=3) # y=-7+2*x;
abline(4, sqrt(2), lty=4) # y=4+sqrt(2)*x;

As one may see from this code and the results in fig. 25 (top-left), abline()
uses some standard formatting arguments like lty for line type (see the section

51
10

10
5

5
−10:10

−10:10
0

0
−5

−5
−10

−10
−10 −5 0 5 10 −10 −5 0 5 10
−10:10 −10:10
100

100
90

90
80

80
Body mass, kg

Body mass, kg
70

70
60

60
50

50
40

150 160 170 180 190 150 160 170 180 190
Stature, cm Stature, cm

Figure 25: Top: just some lines. Solid: y = x, dashed: y = 8− x2 , dotted: y = −7+2·x,
√
dashes-and-dots: y = 4 + x · 2; right: same as left, h and v arguments of abline()
tested. Bottom: the use of lm() with abline(). See text for explanations and code
details.

on the time series plots, esp. fig. 18 for more details), lwd for line width, and
col for line colour (feel free to try lwd and col out on your own).
There are two special cases, when the line runs parallel to either of the axes.
For the lines running parallel to the x-axis, it is enough to specify the single y-
value, and vice versa (the generic form would be abline(h=y) and abline(v=x)
accordingly, h and v stand for horisontal and v ertical respectively).
Unlike a and b, h and v can be defined with numerical vectors or with an
output of some vector-generating function like seq() or c(). However strange,
vertical and horisontal lines may turn most useful, e. g. when one needs to draw
a co-ordinate grid of a desired density or to highlight some x- or y-axis values.55
55 Technically, a horizontal line at some a value may result also from b=0, and a vertical

one (located at x=0) from a b which is big enough to be practically indiscernible from infinity.
Anyway, the use of h and v arguments makes the function call shorter, and the results more
flexible. Grid as such can also be added with a less flexible but sometimes easier-to-use grid()

52
What is most surprising, however, is that h and v can be used simultaneously
withih a single function call. The code that follwes will add the axes at zero
level and a co-ordinate grid to the previous plot (I suggest to redraw it from the
scratch because axes and the co-ordinate grid should be in the back-, not in the
fore-ground; see fig. 25, top-right):56
# Re-creating an empty plot;
plot(-10:10, -10:10, type="n")
# Adding the grid;
abline(h=seq(-10,10,1), v=c(-10:10), lty=3, col=8)
# Adding the axes;
abline(h=0, v=0)
# Adding back the lines;
abline(0, 1, lty=1) # y=0+1*x;
abline(6, -.5, lty=2) # y=8-0.5*x;
abline(-7, 2, lty=3) # y=-7+2*x;
abline(4, sqrt(2), lty=4) # y=4+sqrt(2)*x;

There is also a special case with a and b, when they are derived from the re-
sults of the linear least squares approximation (also known as linear regression).
Regression analysis itself is a rather big topic, and it does not deserve to be
treated lightly.57 Its general idea, however, is rather simple. Assuming a simple
linear relationship between two variables, we are trying to find a straight line
which fits the cloud of data points in the best possible way. There are different
ways to define which way is the best possible. Least squares way means that
we are trying to find a line which minimises the sum of squared deviations of
data points from their projections on this line.58
The linear regression parameters (a and b for our line equation and a num-
ber of other important parameters providing information on the goodness of
fit and statistical significance of the estimates) are calculated in R with lm()
function (its name is derived from the Generalised Linear M odel, a complex
of parametric methods unifying the linear least squares regression in its many
versions and the analysis of variance). The abline() function is trained to pick
these estimates right from the lm() function’s output. The generic form of the
call is abline(lm(y~x)). With our usual training datasets, we can try this on
students’ stature and body mass (see fig. 25, bottom-left for results).
# Plotting the data points;
plot(students.df$HEIGHT, students.df$MASS,
pch=20, col=rgb(0,0,0,.3),
xlab="Stature, cm", ylab="Body mass, kg")
# Adding regression line;
abline(lm(students.df$MASS ~ students.df$HEIGHT), col="red", lwd=3)

It is intuitively clear, that, on the average, the taller the student is the more
likely she or he is also heavier. The red line illustrates this general trend. As I
said, I am not to enter here any deeper in the interpretation of the regression
function (reading help(grid) is strongly advised).
56 Note, please, that I added h and v using different sequencing functions to emphasise that

we can do it in different ways.

57 At this point I recommend to consult an excellent manual by Andy Field, Jeremy Miles,

and Zoe Field (2012) Discovering Statistics Using R. SAGE Publications.

58 There are other ways too. E. g., one may look for a median angle between x-axis and all

the lines that can be drawn between all possible pairs of data points (the so-called Theil —
Sen estimator).

53
analysis results, for this would distract us from our more immediate agenda.
Howerver, we need to know where the coefficients came from. The result of the
lm() function is an object of list class.59 One of its elements stores the values
we need. Little surprise, the element’s name is coefficients.
> lm(students.df$MASS ~ students.df$HEIGHT)$coefficients
(Intercept) students.df$HEIGHT
-85.2025591 0.8520774

Now, we shall try and re-create the line using the values we see. They are
provided in the same order as they are used in the abline(), so, the equation
is, roundedly speaking:60

y = −85.2 + 0.852 · x
In the following line of code, I use a different line type and colour to see both
the original red and the new blue:
abline(-85.2, .852, col="blue", lty=2, lwd=3)

The lines overlap just as expected (fig. 25, bottom-right).

The last thing to learn in this subsection is how to use the curve() function.
It differs from the abline() not so much because it draws curves (in fact, it can
also draw straight lines, as we shall see soon) but because, by default, it calls
a new plot. This means that to add a curve() to an existing plot, we should
use add=TRUE argument, just as with the overlapping histograms we discussed
above, in the section on the box plots.
The generic form of the function is curve(fun(x), from=x0, to=x1, n=101).
This requires some explanation. By fun(x) is meant any mathematical expres-
sion involving x, e. g. x^ 2 or x+1 which expresses the way y value is defined
by the corresponding x value. Three other arguments, from, to, and n define
the extreme x-axis values and the number of points between which the segments
emulating the curve will be drawn (defaults to 101, which, however big it seems,
sometimes is not big enough to form a smooth graph). The importance of a
properly chosen n can be tested with the following series of plots (see fig.26):
curve(x^2, from=-2, to=2, n=1)
curve(x^2, from=-2, to=2, n=2)
curve(x^2, from=-2, to=2, n=3)
curve(x^2, from=-2, to=2, n=10)

By default, from is set to 0, to to 1. In case they are not specified whereas

xlim is explicitely stated, the from and to values are inherited from xlim. As
we shall see from the examples to follow, from, to, and xlim can be defined
independently from each other as well.
Now, let us try drawing some simple functions (see fig.27, left for results):
59 You can examine it with the already familiar str() function by typing:

str(lm(students.df$MASS ~ students.df$HEIGHT))
60 The bizarre side-result that a 100 cm tall human should have a zero weight and shorter

people are at risk of defeating the laws of gravitation, comes from an over-simplified nature
of the model we used. The relationship between the body mass and the stature is rather
non-linear and it changes with age. The real-life 100 cm tall kids possess a body mass about
16 kg, while the 50 cm newborns about 3.5 kg. The fact that a is nearly exactly hundred times
bigger than b is, of course, just a coincidence.

54
5.5

5.5
5.0

5.0
4.5

4.5
x^2

x^2
4.0

4.0
3.5

3.5
3.0

3.0
2.5

2.5
−2.5 −2.0 −1.5 −2 −1 0 1 2
x x
4

4
3

3
x^2

x^2
2

2
1

1
0

−2 −1 0 1 2 −2 −1 0 1 2
x x

Figure 26: y = x2 depicted with curve() under different values of n: top-left n=1,
top-right n=2, botom-left n=3, bottom-right n=10. See text for explanations and code
details.

curve(x^3, from=-2, to=2, xlim=c(-3, 3), ylim=c(-3, 3))

curve(x^2, add=TRUE, lty=2)
curve(1*x, add=TRUE, lty=3)
abline(h=0, v=0)

Note, please, that what the latter line of code draws is, in fact, not at all
a curve. It is a straight dotted line at y = x (which may also result from
abline(0, 1)).
Here are some more elementay functions to make use of more math abun-
dantly built into R (see fig.27, right for results):
curve(sin(x), from=-2*pi, to=2*pi, ylim=c(-2, 2)*pi)
curve(exp(x), lty=2, add=TRUE)
curve(log(x), lty=3, add=TRUE, n=303)
abline(h=0, v=0)

Note, please, the error message after the last line of code:

55
3

6
2

4
1

2
sin(x)
x^3
0

0−2
−1

−4
−2

−6
−3

−3 −2 −1 0 1 2 3 −6 −4 −2 0 2 4 6
x x

Figure 27: Left: solid: y = x3 , dashed: y = x2 , dotted: y = x. Right: solid:

y = sin(x), dashed: y = ex , dotted: y = log(x). In both graphs x- and y-axes at zero
are added with abline(h=0, v=0). The x- and y-axis labels are the result of the raw
function call and, of course, can be adjusted with setting proper xlab and ylab values.
See text for explanations and code details.

> curve(log(x), lty=3, add=TRUE)

Warning message:
In log(x) : NaNs produced

This is because natural logarithms are defined for positive numbers only, and
the x range (inherited from the first function call, that for sin(x)) contains
negative values. I had to use n=303 this time, because n=101 produces a broken
line on some plotting devices having not enough points to make the curve look
smooth.
The functions used with curve() can be very complicated. An interesting
challenge you may entertain on your own is to draw the probability density
function for the normal (Gaussian) distribution (you may start with µ = 0 and
σ = 1 and then try something more challenging):
1 (x−µ)2
y=√ · e− 2σ 2
2πσ 2
or a circle:

x2 + y 2 = 1
The latter may seem even more challenging than the former, because the
function expression within curve() is not allowed to contain any y.

Stepping beyond two dimensions

[Section partly written]
This section, strangely enough, is not about 3D graphs (they are also possible
but they will be discussed elsewhere). It is about the three meaningful ways to

56
add the third variable to the seemingly 2D scatter plot. In fact, we already did
it (see fig. 11 and two bottom graphs from 13). Now, we shall treat this topic
in a slightly more systematic manner.
There are three cases when we may need to circumvent the limitations of the
2D scatter plot: (1) we need to see how a third ‘qualitative’ variable fits with
the pattern formed by the two quantitative ones that form the original scatter;
(2) same for a ‘quantitative’ variable; (3) a special case when the data points
are so numerous that they overlap each other, and we want to see the pattern
formed by this overlap. The latter case is also known as ‘2D histogram’.
The two former cases are rather simple. To incorporate the third variable
without abandoning the 2D-scatter layout, we change the appearance of the
data points. A qualitative variable can be encoded with colour (col) or shape
(pch) of the data points. We did it before (see fig. 11, right; 13, bottom, and
the appropriate code in the section on the scatter plots).
A quantitative variable can be encoded with colour scale (this is rather tricky
to both plot and interpret, and I am not to teach it here) or immediately with
the data point size (cex). The latter way (the use of cex) requires a little trick.
The matter is that cex sets the height of the plotting character, not its area.
Which means that a character of cex=2 is twice as high as that of cex=1 but
four times as big in terms of the occupied area. You may easily check it visually
with:
plot(1,1,pch=0)
points(1,1,pch=0,cex=2)

Which means that when using cex to encode a quantitative variable, we need
to sqrt() the numeric vector for cex values. Let us try it with the 2008 Russian
presidential elections dataset. The following graph represents the upper-right
corner of the turnout-result plot with data points proportional to the (alleged)
number of ballots cast (see fig. 28 for the results; note, please, that we also have
to divide the sqrt(pres.2008$VOTED) by some arbitrarily chosen number, 36 in
our case, which is big enough to make data points proportional to the plotting
area):
# Loading data and adding variables:
pres.2008 <- read.csv("pres_2008.csv", h=TRUE)
pres.2008$VOTED <- pres.2008$BALL.VALID + pres.2008$BALL.INVALID
pres.2008$TURNOUT <- pres.2008$VOTED/pres.2008$VOTERS
pres.2008$MEDVEDEV.sh <- pres.2008$MEDVEDEV/pres.2008$VOTED

# Representing the number of ballots cast with cex:

plot(pres.2008$TURNOUT, pres.2008$MEDVEDEV.sh, xlim=c(.95,1), ylim=c(.95,1),
cex=sqrt(pres.2008$VOTED)/36,
xlab="Voters’ turnout at a polling station",
ylab="Share of ballots cast for D. Medvedev at a polling station")

The case of overlapping points is much more complex than the previous two.
There are several ways of dealing with this problem. Each has advantages and
disadvantages of their own. Some can be used with original data, others require
a rather sophisticated data transformation.
The easiest way, perhaps, is to use semi-transparent data points with original
data (e. g. by setting colour to col=rgb(0,0,0,.3), see fig. 29). It requires no
data transformation and demonstrates visually tolerable results. On the other
hand, it lacks precision, when it comes to the issue of density scale.

57
1.00

●●
●
●
● ●
● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ●
●
● ● ● ● ● ●●
● ● ●
● ● ● ● ●●
● ●
●
●
●
●
●
●
●●
●
●
●

● ●
●
●
● ●● ●
● ● ●● ● ●
●● ●● ●
●●● ●
●
●
●
●
●
●
● ● ●●
●● ●●
●
●
●
● ● ●
●
● ●●● ●● ● ● ●
●
●
●
●
●
Share of ballots cast for D. Medvedev at a polling station

● ●
● ●
●
● ●●●●●●●●●
●
● ● ●
● ●
●
●
●

● ● ●
● ● ●
● ●● ●●
●
●
● ● ● ●
●
● ● ● ●
●
●●
●●
●
●●
● ●● ●●
● ●
●
●
●●
●●
● ●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●● ●●●●● ● ● ● ● ●
● ●
●
●
●
●
●
●
●
●

● ● ●●
● ● ●
●
●
●
●
● ● ● ●● ● ●● ● ●
● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
● ●●
●
●
●

●●
●
●● ● ● ●●
●●
●
●
●
●
●
● ● ● ● ●●
● ● ● ● ● ●

●
●
● ●●
● ●● ●
● ●
●
●
● ●● ● ●
● ●
● ● ●●●●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ● ●
●
●●● ●● ● ● ●●●●● ●● ● ●
● ● ● ● ●

●
●● ● ● ●
●●●●●●●●●● ●●● ●●● ●
●
●
0.99

● ● ● ●
● ● ● ●
●
●
● ● ●●●●●● ●● ● ● ● ●
●
●
● ●● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
● ●
● ● ●
●
●●
●
● ●
●

● ●●●●● ● ●
●
● ● ● ● ● ●●● ●
●
●
●● ● ●
●
●
●

● ●
●●
● ●
●
●● ●
●

● ● ●
●
●
●
●

●
● ●
●
●
● ● ●
●
● ● ●
●●
●● ●●
● ● ● ●●
●
● ● ●● ● ●
●
●
●
●
●
●
●
● ●
●● ●
●
● ● ● ● ● ● ● ● ● ● ●
● ● ●
●
●
●
●
●
● ● ● ● ● ● ●
●●
●
●
● ● ● ● ● ● ● ● ●
●
●
●● ●
● ●

● ● ● ●●●● ● ● ●
●

● ● ●
● ●
●
●
●
●● ●●
●
●
●
●
●
● ● ● ● ● ●
●
●●●
● ● ●
●
●
●● ● ● ●●
●● ● ●
●
●
●
●● ●●●●● ● ●
●● ● ●
● ● ●
●
●
●
●

●
●
● ● ●
●● ●●
●
● ● ●
● ● ●
●
●
● ● ●● ●
● ● ●
●
● ● ● ●
● ●
●● ●
●
●
●
●

●
●
● ●
●●●
●
● ●
● ● ●
●●●● ●●●●●
●● ●
● ● ●
● ● ● ●
●
● ● ●● ●
● ● ● ●
● ● ●
●
●
●
●
●
●
●
●
● ● ●
● ● ● ●
● ● ●
●
●
●
●
● ● ●
● ●
●
●
● ● ● ●

●● ● ●
●
●● ● ● ●
●
● ● ●
● ● ●
●
●●
● ● ●● ●
●
●●
● ● ●
● ● ●● ● ●
●
●
● ● ● ●
● ●
●● ● ●
● ● ● ● ●
● ●

●●
● ● ● ●
●
●
●● ●● ● ●
● ● ● ●
● ●
● ●
●
●
● ●

● ●●
● ● ●● ● ●● ●
●
●

●
● ● ● ●
●
●
●

● ●● ● ●●
●● ● ●
● ● ● ● ● ●
●
●
●
●
●
●
●
●

● ●●
● ● ● ● ● ● ● ● ● ●
●
●
●
● ● ● ● ●
● ●●●●● ● ●● ● ●● ● ●
●●
●● ● ● ●
●● ●●
●
● ●
●
●
●
●
●
●
●
●●
●
● ● ● ●
●
● ● ●
●
●
●
●
● ● ● ● ●
●
●
●
●●●●●● ●●●●● ●
● ● ●● ●
●
● ●
●
●●●● ●●●●●●●●●●●●●●●●
● ●
● ●● ● ●
●
● ●
● ● ● ● ●● ●
● ● ●●● ● ●●● ● ●● ● ●
●
●
●
●
●
●
●
●● ● ● ● ●● ●
●
● ●● ● ●
●
●
●
●
●● ● ●● ● ● ●
● ● ● ●●
●
●● ● ●
● ● ● ● ● ●
●
●

● ●
● ● ● ●
●●●●●●● ●●●● ●●●●●●●●●●●
● ● ● ●

● ●
● ● ● ● ●
● ● ● ●● ● ●
● ●
● ●●● ● ●●
● ●
●
●
●
●
●
●
●
●● ● ● ● ●
●
●
● ● ●● ● ● ● ● ●
● ●
●●●● ●●● ● ●
●
●
●
● ●
●
●●
● ●● ●● ●●● ●●
● ●
●
●
●
●
●
0.98

●
● ●● ●● ●
●
● ●● ●
●

●● ●
●
● ●●● ●●
● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●● ●
●
●
●
●●●●●●●●
● ● ●
●
●● ● ● ● ●●
● ● ● ●
● ● ●● ● ● ●
●
●
●
●
● ● ● ●
●●
● ●
●
●
●
●●
● ●
●
●●
● ●●
●●
●
●●
●
●●
●● ●
●

●●●● ● ●●●
●
● ● ● ●●●●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
● ●●
●●
● ●
●● ● ●
● ●●● ●●
●
●
● ● ●●
●● ● ●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
● ●●●●●● ●●
● ●
● ●
● ● ●
● ●● ●● ● ● ●● ●●
●
●●
● ● ● ● ● ● ● ●
●
●
●
●
● ● ●● ● ●
●
●● ● ● ●
●
● ● ● ● ●
●
● ● ●
● ●
● ●

●
● ●

●●●●
●
●●
●● ● ●●● ●● ● ● ●●●●●●●●● ●●● ●●
● ● ● ● ●
●
●
●
●
●
●
● ● ● ● ● ●
●

●● ● ● ●●● ● ●
● ● ●
● ●
●
● ●
●
● ●
● ●● ●
●
●
●● ●● ● ● ●● ● ●●● ●●
● ● ● ●
●
●
● ●
●
●
●
●
● ●●●●
●
● ●
● ●
●●
● ● ● ● ●●
●
● ● ●● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●

●● ● ●● ● ●
●
● ●
●
● ● ● ● ●
●●●●
●●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●

● ● ●
●
● ●
●
●
●●
●
● ● ●● ●
●
● ●● ● ●
●
●
●
●
●
● ●
●
●
●
●

●
●
●
●

● ●
● ● ●
● ● ● ● ● ● ●

● ● ●● ● ● ● ●
●●
●
●
●● ●●
● ● ● ● ● ● ●
●● ● ●
●
●
● ●
●

●●
●
●
●

● ● ● ●
●
●
● ● ● ● ● ● ● ● ●
●● ● ●
●
● ● ●
●● ● ● ●
● ● ● ● ● ● ●
●

●● ● ●
● ● ● ● ●

● ●
●
● ●
● ● ● ●● ●● ●
● ● ● ●
●
● ● ●
● ●
● ● ● ● ● ●
●
● ● ●
●

● ●
●●
●
● ●
● ●

● ● ●
●●●● ●● ●●● ● ●● ● ●
● ●
● ●● ● ●● ● ● ●● ●● ●● ●● ● ●● ● ●● ●●●●●●●
●
●
●
●
●
● ● ●
●
●
● ●
●

● ●
●● ● ● ● ●
● ●
●
● ● ●●
● ● ●
●

● ●● ● ●● ●● ●● ●●●
● ● ●
●●
●
●
● ●
● ●●●
●
●
● ●● ● ●● ●
● ● ●
●
● ● ●
●● ● ●● ●● ●●● ●●●●●●●● ●●● ●●●●● ●●●●●●●●●●
●

●●●●●●●
● ● ●●
● ●
●
● ● ●
● ●
●
● ●●● ●
●
● ● ●● ●
● ● ●●
●
●● ● ● ● ●●●
● ● ● ● ●
●
● ● ● ● ●

● ●
● ●
● ●●● ●
● ● ●
● ●● ●
●
0.97

●
●
● ● ●
●
● ●● ●● ● ● ● ●

● ●● ● ●●●
● ● ● ●
●

●●
●●●●●● ● ●
●
●
● ● ●● ●● ● ● ●
●
● ●

● ● ● ● ● ● ● ●

● ● ●● ● ●●● ● ● ● ● ●● ●
● ●
●●● ●
● ● ● ● ●
● ● ● ●●●
●
● ● ● ●
● ●
● ● ●●● ● ● ● ●●
● ● ● ●
●
●
●
● ●
● ● ●
● ●
●
●
●
● ●
●●
● ●●
● ●●●● ●
●
● ● ● ●● ●●● ●● ● ●● ● ●
● ● ●
● ● ● ●
●

●●●●
● ●
●
●
●● ● ●
●
●
●
● ● ● ●
● ● ● ● ●
●
●
●
● ●
● ●●
●
●
● ●
●
● ●● ●
● ● ●
●
●
●
●
●
● ● ●
●
●
●

●
●

●●
●● ● ● ● ● ●●

●● ●

● ● ●● ●
●● ● ● ●●● ● ●
● ● ● ● ●● ● ● ● ●
●
●
● ●

● ● ● ● ●● ●
● ●
●
● ● ● ●●
●

● ● ● ● ● ● ● ●●●● ●● ●
● ● ●
●● ● ● ● ● ●
●
● ●
●
● ●
●
●
● ●
●
●● ●● ● ●
●
●
● ●
●
● ● ●

● ● ●● ●● ●●●● ●
● ●

●●●
●
● ● ● ●
●
● ● ●
● ●
● ● ● ● ●
● ●● ●● ●● ● ●●●
●
●

● ● ●
●
●
●
●
●
●
●

● ●●
● ● ● ● ● ● ●
●
●
● ● ● ●● ● ● ● ●
●● ● ●
● ●
●
● ●● ● ● ● ●●● ● ● ●
●
●
●
● ● ●
●
●
●
●

● ●
●
● ●
●
●
● ●
● ●
●
● ●
● ●
●● ●
● ● ● ●

●
● ●
●
● ● ● ● ● ● ●
●● ● ●● ● ●
●
● ●
● ●
● ● ●
● ● ● ●
●
●
● ●
●
● ●
● ●
●
●●●●● ●● ● ● ● ● ●
●
● ● ●

● ● ●●●●● ●● ● ● ●●● ●● ● ●●●●● ●● ●●● ● ● ●●● ●●●

●
● ●
●

● ● ● ● ● ● ●

●●
●
●● ● ● ●
●●
●
● ● ●
● ●
●
●
●
●
●
●
●
● ● ●
●
●
●● ● ● ● ●

● ●
● ●
●● ●
●

● ● ●
● ●●●
●
● ●

● ●●●●
●
●● ●
●● ●●●●
●
●● ● ● ● ● ●● ● ●
● ●

● ●
●
● ● ● ● ●● ●
● ● ●
●

●● ●
●
● ● ● ●● ● ●● ● ●● ● ● ●

●● ●● ●
● ● ●
● ●
●● ● ● ●
● ● ● ●
● ●● ●●
● ● ● ●
●
● ● ● ●
0.96

● ●
●● ● ● ●
●

●●● ●
●
●●●●●●
●●●● ●● ●● ●●●● ●●
●
● ● ●● ●● ●
●
● ● ●●●●●
●

● ● ● ● ●
● ● ● ●
● ●●● ●
● ●● ● ● ●● ●●
● ●

●●●
●● ● ● ● ●● ●●● ● ●
●
● ● ●●● ●● ● ●
●● ● ● ●●● ● ●
●● ●● ● ●
●●
●
●
●● ●● ●● ● ●● ●
● ●
● ●● ● ● ● ● ● ● ●
●
●
● ● ●●●
●● ●
● ● ● ●●
●
●
●
●
● ●
● ●
●

●●
●
●●
● ●

●●●● ●●●● ● ●●● ●● ●● ●●●● ●●● ● ● ● ● ● ●

● ●
●

● ● ●● ● ●
●●●●● ● ●
● ● ● ●
● ● ● ●● ● ● ● ● ● ●
●
●
●
●
● ●
●

●
● ● ● ●
● ● ●
●
●

● ●
● ●
● ●
● ● ●
●
●
●
●
●
●

●
● ●
●
●
●
● ●
●
● ● ●
● ●
● ● ●
● ● ● ●
●
● ● ● ● ●●

●
●
● ●● ●
●● ●
● ● ● ● ●
● ●
●
●● ● ●●
● ●
●
●
●

●● ●
● ●
● ● ●
● ●
●●
●
● ●
●
●
●
●
● ●
● ● ●
●
●
●
● ● ●

●
●
● ● ● ● ● ● ● ● ● ●
● ●
●
●
● ●
●

● ● ●
● ●● ● ●
●● ● ●
● ●
●
● ● ● ● ●● ●●●● ●● ● ●●●● ●●●●● ●●●●●●●●● ●●●● ●●●●●●●●●
● ●
● ● ● ● ● ●
●

● ●● ● ● ●
●
● ● ●●
● ●● ● ● ●

●
●
●
●
●
●

● ●●● ● ●

●
● ● ● ●
● ● ●
●
●
● ● ● ●

●●● ● ●● ● ●● ● ● ●● ● ●●●● ●● ●●
●

●
●● ● ● ● ● ●●● ● ●
● ●
● ●● ● ●
● ●
● ● ● ● ●
●
● ●● ● ● ● ●
●
● ●

●
● ● ● ● ● ● ●
● ● ●
●
●
● ●● ●
● ●
●
●
● ● ● ●
●

●● ●● ● ● ● ●
●
● ● ●
●

●●
● ●
● ●●●
● ● ●
●● ● ● ●
●
●
●
● ●
●
●
● ● ●●● ● ● ●● ● ●● ● ●
●
●●
●
● ● ● ●● ●●
● ●
● ● ●● ● ● ●● ● ● ●
●
●

● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ●
● ● ● ●
● ● ●

● ● ● ●●
● ● ●
●
● ●
● ● ● ●
●
● ● ●
● ●● ●● ●
0.95

● ● ● ● ● ● ● ●

● ●
●● ● ● ● ● ●
●
●● ●● ●● ● ● ●● ● ● ● ● ● ● ●
●
●
● ●●
●
● ●
● ● ● ●
●●● ● ● ●

● ●●●●
●● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
●
●●
● ●●
● ●
●●
●●● ●
●●
●
●
●
●
●
●●●●●● ●●
●
● ● ●● ●
● ●
● ● ●● ●●
● ●●●● ● ●●● ●●
●
● ● ●●●
●
●
●●●●●●● ●
●
● ●
●
● ●●●●●●●●●●● ● ●●●●●●●●●
●
●
●●
●●
●
● ●●● ●●●● ●●●● ●
●● ●
●
●
●
●
●
● ●
●
●

●●● ●● ●●● ●●●●● ●●●

● ● ● ● ● ● ●
● ●
●
●
●
● ●

●●
● ●● ● ●● ● ●
●
●●●
● ● ● ● ●
●
●● ●● ● ●
● ●●●
●
● ●
●
● ●
●●●
●
●
●
● ● ●
●
●●● ●
●
●
●
●
●
●
●
●
●
●

● ●●
●
● ●
●● ●
●
● ● ●
● ●● ● ● ●● ● ● ● ● ● ●

●
● ●
● ●
●
●

● ●●
● ● ● ● ●
● ●● ● ● ● ● ● ● ●
● ●
●
●
●
●
● ●

0.95 0.96 0.97 0.98 0.99 1.00

Voters' turnout at a polling station

Figure 28: Representation of the third quantitative variable on a 2D scatter plot with
the data point size using cex: 2008 presidential elections in Russia, the size of data
points is proportional to the number of ballots cast at the polling station. Note thin
diagonal rays formed with small polling stations (calculational artifact), and clusters of
bigger polling stations at the integer percentiles forming a rather weak but discernible
grid-like pattern (evidence of electoral fraud). See text for code and details.

58
Figure 29: Representing density of analytic individuals in a scatter plot using colour
transparency with rgb() function. Cf. fig. 30, 31, 32.

plot(pres.2008$TURNOUT, pres.2008$MEDVEDEV.sh, pch=".",

xlim=c(0,1.01), ylim=c(0,1.01),
col=rgb(0,0,0,.3),
xlab="Voters’ turnout at a polling station",
ylab="Share of ballots cast for D. Medvedev at a polling station",
main="Presidential elections in Russia, 2008")

There are plotting functions which may present density levels in a more
explcit way but they require data transformation because, except x- and y-
coordinates they require a z-coordinate, which is presented in form of a density
matrix. We shall use two of them (image() and contour(), see fig. 30, 31, and
32) but there are more.
Before getting to the plots, however, we need to perform the required data
transformation. To create the 2D density matrix we need to extract as many
subsets as there are cells in the matrix. I have chosen a 101×101 matrix (101 —
for each percentile from 0 through 100) of centered 2D-bins (I hope, you noticed
already the similarity to the histogram, the only difference is that the histogram

59
Figure 30: Representing density of analytic individuals in a scatter plot using
contour() function. This method requires data transformation aimed at the extrac-
tion of the density matrix. See text for details. Cf. fig. 29, 31, 32.

is one-dimensional):
pres.2008.TMs.m <- NULL
pres.2008.TMs.m <- as.data.frame(pres.2008.TMs.m)

j <- NULL
i <- 1
while(i <= 101){
j <- 1
while(j <= 101){
pres.2008.TMs.m[i,j] <- nrow(subset(pres.2008,
pres.2008$TURNOUT > (i-1.5)/100 &
pres.2008$TURNOUT <= (i-.5)/100 &
pres.2008$MEDVEDEV.sh > (j-1.5)/100 &
pres.2008$MEDVEDEV.sh <= (j-.5)/100
))
j <- j+1
}
i <- i+1

60
Figure 31: Representing density of analytic individuals in a scatter plot using image()
function. This method requires data transformation aimed at the extraction of the
density matrix. See text for details. Cf. fig. 29, 30, 32.

Now, we are ready to engage in plotting. The first function, contour()

creates a black-and-white line art plot of isolines uniting cells with equal data
point densities. Besides of the density matrix it requires a vector of density
levels at which the isolines should be drawn (fig. 30):
# Setting the vector of data point density levels:
zlvl <- c(10,25,50,75,100,150,200,300,400,500,600)

# Plotting contour()
contour(0:100, 0:100, as.matrix(pres.2008.TMs.m),
levels=c(1, zlvl), col = "black", lty = "solid",
method="flattest", labcex=1,
vfont=c("sans serif", "bold"),
xlab="Voters’ turnout at a polling station, %",
ylab="Share of ballots cast for D. Medvedev at a polling station, %",
main="Presidential elections in Russia, 2008")

61
Figure 32: Representing density of analytic individuals in a scatter plot combining
image() and contour() functions. This method requires data transformation aimed
at the extraction of the density matrix. See text for details. Cf. fig. 29, 30, 31.

Here we see three variables: x=0:100, the same for y, and a matrix of z
values, the explicitely set levels, three arguments controlling the appearance of
the numeric labels at the isolines (method="flattest" means that the numbers
are to be placed at the ‘flattest’ segments of the isolines, labcex regulates labels
font character expansion factor, and vfont defines a font-family for the labels
font, etc.)61 Note, please, that we can not use the usual ‘share’ values ranging
from 0 through 1. In this graph, just as in the following two, they are replaced
with percents (ranging, naturally, from 1% through 100%).
The image() function is widely used to produce so-called heatmaps. A
heatmap is a plot visualising a 2D matrix in a most direct way by mapping
a matrix of numbers to a matrix of pixels (or bigger rectangles) coloured ac-
cording to the values of the matrix cells. It can be used in our case as well.
It does not require explicit levels and splits the data point density range into
61 As usual, I strongly recommend studying help(contour).

62
equal intervals according to the colour palette used. It looks more spectacular
against a dark background, so I’ve chosen a good old dark blue background
(bg="#001133") and white foreground. When the slopes of the density peaks
are too steep, they can be leveled down with a log-transformation (which we
had to use in his particular case too; see fig. 31 for the graphical results of the
following code):62
par(bg="#001133", fg="white", col.axis="white", col.main="white", col.lab="white")
image(1:101, 1:101, log(as.matrix(pres.2008.TMs.m)),
xlim=c(0,101), ylim=c(0,101), lty = "solid", col=terrain.colors(12),
axes=FALSE,
xlab="Voters’ turnout at a polling station, %",
ylab="Share of ballots cast for D. Medvedev at a polling station, %",
main="Presidential elections in Russia, 2008")

The two functions can be combined to enhance the readability of the plot
(fig. 32): 63
par(bg="#001133", fg="white", col.axis="white", col.main="white", col.lab="white")
image(1:101, 1:101, log(as.matrix(pres.2008.TMs.m)), xlim=c(0,101),
ylim=c(0,101), lty = "solid", col=terrain.colors(12),
axes=FALSE,
xlab="Voters’ turnout at a polling station, %",
ylab="Share of ballots cast for D. Medvedev at a polling station, %",
main="Presidential elections in Russia, 2008")
contour(1:101, 1:101, as.matrix(pres.2008.TMs.m),
levels=zlvl, col = "black", lty = "solid",
method="flattest", labcex=1,
vfont=c("sans serif", "bold"),
add=TRUE)

There is also a related function filled.contour() which produces a tolera-

ble on-screen graph with a built-in legend for colours.64 It is, however, designed
in a way that makes it highly problematic with printing-quality graphs. There
is a module within the function that overrides the par() settings. A skilled
R user can build a similar function using the original code as a template, but
we will not go this far. Technically, a combination of image() and contour()
brings fairly tolerable results both on-screen and in print.
Finally, there is a function smoothScatter() which can be used in its own
right and in cobination with contour() (no figures provided, try it at your risk).
# smoothScatter() alone:
smoothScatter(pres.2008$TURNOUT*100, pres.2008$MEDVEDEV.sh*100,
colramp = colorRampPalette(c("black", "red")),
axes=FALSE,
xlab="Voters’ turnout at a polling station, %",
ylab="Share of ballots cast for D. Medvedev at a polling station, %",
main="Presidential elections in Russia, 2008")

62 Remember, the log-transformation taking a logarithm of the original value works as a

de-magifying glass for ‘big’ numbers (consider: log(2) ≈ 0.69, log(4) ≈ 1.38, log(16) ≈ 2.77,
log(256)
log(256) ≈ 5.55; thus, log(2) = 8 while 256 2
= 128). To revert the plotting console to the
default values, close it, please, with dev.off() after reviewing the plot.
63 To revert the plotting console to the default values, close it, please, with dev.off() after

reviewing the plot.

64 Readers are encouraged to explore filled.contour() on their own. The function call is

very similar to image() and contour().

63
10

10
8

8
6

6
1

1
4

4
2

2
0

0
0 2 4 6 8 10 0 2 4 6 8 10
1 1

Figure 33: Line segments(), left, and arrows(), right. Note, please, different shapes
and positions of arrowheads. See text for code details. The size of arrowheads is
proportional to other parts of the graph, so, in your plots, they may appear larger or
smaller by default.

# smoothScatter() combined with contour():

smoothScatter(pres.2008$TURNOUT*100, pres.2008$MEDVEDEV.sh*100,
colramp = colorRampPalette(c("black", "red")),
axes=FALSE,
xlab="Voters’ turnout at a polling station, %",
ylab="Share of ballots cast for D. Medvedev at a polling station, %",
main="Presidential elections in Russia, 2008")
contour(1:101, 1:101, as.matrix(pres.2008.TMs.m),
levels=c(1, zlvl), col = "white", lty = "solid",
method="flattest", labcex=1, vfont=c("script","bold"), add=TRUE)

Visualising everything: graphical primitives

We have already met with some of the graphical primitives before. We added
dots and lines to the graphs with points(), lines(), and abline(). There are
other simple graphical elements — segments, arrows, rectangles, and polygons,
which may help you to draw anything you can imagine, provided that you
are able to supply the plotting function with appropriate sets of coordinates.
Besides that, we may plot text to annotate the data points.
All these functions share one important feature: being unable to call a plot
on their own, they can only add elements to an existing plot. We shall try them
one by one on abstract examples and towards the end of the section I shall offer
some hints (partly provided with examples) on how they can be used in some
everyday visualisations.
Two related functions segments() and arrows() require basically four ar-
guments: x0 and y0 for the start points, x1 and y1 for the end points of line
segments and arrows respectively. In arrows(), the shape and position of the
arrowhead can be adjusted. Below are the two code examples for fig. 33:
plot(1, 1, type="n", xlim=c(0, 10), ylim=c(0, 10)) # creating an empty plot
segments(x0=c(1, 2, 4, 2), y0=c(1, 2, 1, 6), x1=c(2, 3, 6, 8), y1=c(5, 4, 8, 7))

64
10

10
8 3

8
2
6

6
5
1

1
4

4
4
2

2
1
0

0
0 2 4 6 8 10 0 2 4 6 8 10
1 1

Figure 34: A example of rect(), left, and polygon(), right; the numbers in the
right plot are added with text(). Pay attention, please, to the difference between the
polygons 4 and 5. See text for code details.

plot(1, 1, type="n", xlim=c(0, 10), ylim=c(0,10)) # re-creating an empty plot

arrows(x0=c(1, 2, 4, 2), y0=c(1, 2, 1, 6), x1=c(2, 3, 6, 8), y1=c(5, 4, 8, 7))
arrows(x0=1, y0=8, x1=8, y1=9, length=.15, angle=15, code=1)
arrows(x0=6, y0=4, x1=10, y1=2, length=.2, angle=10, code=3)

Every set of arrows can be adjusted with respect to the length of their
arrowheads (length argument, defaults to 0.25), the angle between arrowhead’s
edges (angle argument, defaults to 30°), and the arrowhead’s position (code
argument, defaults to 2).65 Note, please, the code that corresponds to the
arrows running from (1, 8) to (8, 9) and from (6, 4) to (10, 2). The code=1 sets
the arrowhead to the beginning (x0 , y0 ), while the code=3 sets the arrowheads
to both ends. And, of course, every linewise element can be decorated with
different colours (col argument), line types (lty), and line widths (lwd).
More complex geometrical figures can be drawn with rect() and polygon().
Even though rectangles can be considered as polygons from the viewpoint of
geometry, they are far simplier from the perspective of defining their coordinates
(mostly because the rect() function can only draw rectangles with the sides
parallel to the sides of the plotting area). Thus, to define the rectangle’s position
and size, one needs to specify its bottom left and its upper right corners, and
that’s it. The following script draws three rectangles in the fig. 34, left:
plot(1, 1, type="n", xlim=c(0, 10), ylim=c(0, 10)) # creating an empty plot
rect(xleft=c(1, 2, 6), ybottom=c(1, 3, 0), xright=c(3, 8, 10), ytop=c(4, 8, 1))

An arbitrary polygon can be drawn with the polygon() function. Its main
arguments are two vectors defining x and y co-ordinates of the polygon’s vertices.
The following script draws the polygon no. 1 in fig. 34 (right):
plot(1, 1, type="n", xlim=c(0, 10), ylim=c(0, 10)) # creating an empty plot
65 The arrows within the given set have to be uniform with respect to the shape and position

of their arrowheads, which is, in a way, not unreasonable.

65
p.1.x <- c(2, 1, 1, 3, 5, 4) # X coordinates for polygon no. 1
p.1.y <- c(0, 1, 3, 5, 2, 1) # Y coordinates for polygon no. 1
polygon(p.1.x, p.1.y)

One and the same vector can define more than one polygon. To do so, we
need to insert NA to separate data for different polygons. E. g., the polygons no.
2 and 3 (fig. 34, right) are the results of the following code:
p.2.x <- c(1, 2, 3, 4, NA, 7, 5, 9, 8) # X coord. for polygons no. 2 and 3
p.2.y <- c(6, 8, 9, 5, NA, 5, 5, 8, 9) # Y coord. for polygons no. 2 and 3
polygon(p.2.x, p.2.y, density=c(12, 24))

As you see from the example polygon 3, the R polygons may contain inter-
secting edges. This, in turn, may affect the polygon background fill pattern,
which, in this rather special case, is regulated with fillOddEven argument (de-
faults to FALSE). I shall use two pentagrams to illustrate this point (fig. 34,
right: 4 and 5).
p.3.x <- c(6, 7, 8, 5.5, 8.5) # X coordinates for polygon no. 4
p.3.y <- c(0, 3, 0, 2, 2) # Y coordinates for polygon no. 4
polygon(p.3.x, p.3.y, density=12, angle=-45)

p.4.x <- c(7.5, 8.5, 9.5, 7, 10) # X coordinates for polygon no. 5
p.4.y <- c(2.5, 5, 2.5, 4.5, 4.5) # Y coordinates for polygon no. 5
polygon(p.4.x, p.4.y, density=12, angle=-45, fillOddEven=TRUE)

The last thing to learn in this section is how to add text labels to annotate
graphs. As we have added already a lot of polygons to our plot, we may need
to somehow label each of them just as they are labelled in fig. 34. The text
labels can be added with text() function. Its most vital arguments specify
co-ordinates of the invisible points to which the text labels are attached (x and
y), contents of text labels (labels), and positions of the text labels relative
to the invisible points. By default, the text label is horizontally and vertically
centered at the invisible point, but it may occupy four standard positions around
the point, specified in an already familiar manner: they are numbered from 1
through 4 clockwise, starting from bottom.66 The code that adds text labels to
the fig. 34 (right) is as follows:67
text.x <- c(1, 1, 8, 7, 8.5)
text.y <- c(1, 6, 9, 3, 5.7)
text(x=text.x, y=text.y, labels=c(1:5), pos=c(2, 2, 3, 2, 4))

In the real-life situations, all these graphical primitives can be more useful
than it may seem after this cursory acquaintance. Needless to say that R itself
relies upon some of them when plotting. Histograms and barplots can be rebuilt
from rectangles. By adding a segment, a couple of modified arrows, and, maybe,
few points to a rectangle, we get a boxplot. We have not yet dealt with any
maps but, as we shall see soon, some of them are nothing but collections of
coordinates for plotting sets of sophisticated polygons, line segments, and dots
packed in more complex objects. Social network graphs are generally built of
points and line segments or arrows. Text labels employed in any graphs are
printed with text().
66 Compare this to the axis() function discussed above.
67 There are other important arguments of the text() function, and I would suggest studying
them in greater detail experimentally after reading help(text).

66
Russian universities except Dorpat
80
Dorpat
Other European, mostly German, universities
60
Faculty members
40
20
0

1800 1820 1840 1860 1880 1900

Timeline, years

Figure 35: Using polygon() function. Temporal dynamics of the Dorpat university
faculty composition. Areas representing faculty members who graduated from different
universities are coloured differently. Note the dramatic increase of graduates of Russian
universities in the 1890s. See text for code details.

Sometimes, however, we may need graphical primitives to compose custom

graphs even when these custom graphs themselves look pretty ordinarily. I
would like to provide here just a couple of examples.
The first graph (fig. 35) comes from a real-life conference presentation on
the project in the history of universities. It presents the temporal dynamics of
the structure of Dorpat University faculty. Even though it is not at all complex,
it includes three polygons superimposed upon each other:
# Loading data
dpt.dyn <- read.table("dpt.dyn.txt", h=TRUE, sep="\t", stringsAsFactors=TRUE)

# An empty plot
plot(dpt.dyn$YEAR, dpt.dyn$TOT, type="n", ylim=c(0,85),
xlab="Timeline", ylab="Number of persons")

# Adding polygons
polygon(x=c(dpt.dyn$YEAR[1], dpt.dyn$YEAR, dpt.dyn$YEAR[nrow(dpt.dyn)]),
y=c(0, dpt.dyn$TOT, 0), col="#0072CE")
polygon(x=c(dpt.dyn$YEAR[1], dpt.dyn$YEAR, dpt.dyn$YEAR[nrow(dpt.dyn)]),
y=c(0, (dpt.dyn$U1.DPT.TOT + dpt.dyn$U1.EUR.TOT), 0), col="black")
polygon(x=c(dpt.dyn$YEAR[1], dpt.dyn$YEAR, dpt.dyn$YEAR[nrow(dpt.dyn)]),
y=c(0, dpt.dyn$U1.EUR.TOT, 0), col="white")

The second graph (fig. 36) is a piece of electoral forensics. It presents a

histogram of voters’ turnout at the 2018 presidential elections in Russia with
a corridor of likely ‘natural’ variation superimposed. The histogram is just
the ordinary hist() with a fine-grained bin, while the corridor is a polygon
assembled in an economic and ingenious way: its upper border is formed with
data points for the mean bin frequency plus three standard deviations while

67
Figure 36: Using polygon() function. Presidential elections in Russia, March 18,
2018. Original voters’ turnout histogram (blue) and the results of Monte-Carlo sim-
ulation (thick red line — mean simulated frequency, gray shaded area between thin
red lines — mean ±3 standard deviations). Note the thin blue peaks extending far
beyond the corridor of likely variation: it is highly likely that the area marked with
peaks contains the results of mass data forgery. See text for code details.

its lower border is formed with data points for the mean minus three standard
deviations listed in the reverse order:68
# Reading datasets, adding calculable variables;
ru.2018 <- read.table("ru.2018.txt", h=TRUE, sep="\t", stringsAsFactors=TRUE)
ru.2018$VOTED <- ru.2018$BALL.VALID + ru.2018$BALL.INVALID
ru.2018$TURNOUT <- ru.2018$VOTED / ru.2018$VOTERS
ru.2018$PUTIN.share <- ru.2018$PUTIN / ru.2018$VOTED
ru.hist.MC <- read.table("ru.hist.MC.txt", h=TRUE, sep=" ", stringsAsFactors=TRUE)

# Plotting the histogram;

hist(ru.2018$TURNOUT, breaks=seq(-.0005, 1.0005, .001),
ylim=c(0, 500), col=rgb(0, 0, 1, .3), border=rgb(0, 0, 1, .3),
main="Presidential elections in Russia, 2018-03-18",
xlab="Voters’ turnout at a polling station, 0.1% bin",
ylab="Frequency")

# Adding the polygon for the MC simulation mean +/- 3 standard deviations;
polygon(
x = c(ru.hist.MC$PCT, ru.hist.MC$PCT[1001:1]), # Note reverse order ([1001:1])!
y = c(ru.hist.MC$MEAN + 3 * ru.hist.MC$SD,
(ru.hist.MC$MEAN - 3 * ru.hist.MC$SD)[1001:1]), # Note reverse order ([1001:1])!
68 See the code appendix for this section for the details of data pre-processing. The Monte-

Carlo simulation script is included as a bonus. It should be noted, however, that it takes rather
long to run it with a reasonable number of iterations. My worn Dell Latitude 3440 with its
1.70 GHz Intel and 4 GB RAM spent about eight hours going through 1 000 iterations. For
those who prefer not to go this deep, a ready-made dataset for this specific graph is provided
in a separate file.

68
Wender
Tennemann
Reinhold
Reinhold
Chalybaeus
Schwegler
Erdmann
Weigelt
Ueberweg
Dühring
Stöckl
Deter
Nomina
Kirchner
Windelband
Windelband
Falckenberg
Eucken

1800 1820 1840 1860 1880 1900

Timeline

Figure 37: A combination of segments() and rect(). Nineteenth cenury German

works in the history of philosophy, published at least three times or more. Authors
are indicated by their last names, successive editions by black rectangles and intervals
between different editions of the same book are indicated by line segments. After
Demin and Kouprianov (2018).

col=rgb(0, 0, 0, .3), border=rgb(1, 0, 0, .3), lwd=.5)

# Adding the mean of the MC simulation;

lines(ru.hist.MC$PCT, ru.hist.MC$MEAN, col=2)

The third, and final, graph (fig. 37) combines segments and rectangles to
display the timeline of publication of German textbooks in the history of phi-
losophy. The script generating this latter graph is rather long (and it requires
rather specifically structured data), so I do not provide it here, but you may
try and develop something similar. All you need is a module which calculates
positions of individual timelines (thus calculating by the way an appropriate
aspect ratio for the picture) and a module drawing segments, rectangles, and
text labels.

Getting inside: Numeric output of plotting functions

In this part of the manual, we work mostly with the plotting functions. Arguably
the majority of R functions are, however, non-plotting, i. e. they perform more
or less complex data transformations and/or calculations without any immediate
graphical output. They produce vectors, tables, matrices, data frames, lists,

69
and diverse list-like complex objects. Some of these resulting objects can be
meaningfully fed into the plotting functions to produce plots. In fact, we did it
all the time. Remember the basic plots we produced with the plot() function?
plot() is super-smart, to the degree that it is nearly omnipotent. Every time
plot() identified the kind of the object and the nature of its elements and
produced an appropriate kind of a plot whether we wanted it or not.69
It is important to know that some (but not all) plotting functions also per-
form data transformations and even create objects. We never saw them because
these objecs are usually re-directed to some other more basic plotting functions
and never appear in public. These objects, however, can be extracted, reviewed
and, sometimes, used for custom plotting. Of the plotting fuctions we exten-
sively used before, only plot() and the ones involved with graphical primitives
(points, lines, segments, etc.) do not produce anything but graphics. All other
functions create objects of come sort. Not all of these objects are extremely
useful but two of them (the ones resulting from hist() and boxplot()) are
worth studying in depth.
To bring the hidden object to light, it is enough to divert the function’s
output into a named object. E. g. the line:
students.mass.hist <- hist(students.df$MASS)

produces both a histogram in the plotting console and a students.mass.hist

object. We can preview the latter with str():
> str(students.mass.hist)
List of 6
$ breaks : int [1:13] 40 45 50 55 60 65 70 75 80 85 ...
$ counts : int [1:12] 1 26 55 47 34 26 13 9 5 3 ...
$ density : num [1:12] 0.000909 0.023636 0.05 0.042727 0.030909 ...
$ mids : num [1:12] 42.5 47.5 52.5 57.5 62.5 67.5 72.5 77.5 82.5 87.5 ...
$ xname : chr "students.df$MASS"
$ equidist: logi TRUE
- attr(*, "class")= chr "histogram"
>

We see here a list of objects (vectors, mostly) describing the histogram.

The first vector contains breaks (the x-axis co-ordinates for the borders of the
histogram’s bins); the second, counts (the number of variants falling within
each bin); the third, density (bin heights for freq=FALSE version of the his-
togram); the fourth, mids (x-axis co-ordinates for the bins’ midpoints); the fifth,
xname (default xlab value); and the sixth, equidist (whether the bins are of
equal width: TRUE or FALSE). There is also attr (list attributes container) with
class=histogram.
This object contains enough information for plot() to interpet it correctly
and build a histogram. Try plot(students.mass.hist).
I shall use it to extract counts and place them on top of bins (sometimes
it is needed for clarity). We already know that, having text labels and their
coordinates one may put text labels next to the data points. The object created
with hist() provides us with all we need to label the bins: mids contains x-axis
co-ordinates, while counts provides us with both y-axis co-ordinates and text
labels. The following code shows how to place counts right above the mids of
the bins (note, please, the pos=3 argument for text()):
69 To see something really sophisticated, try and feed into the plot() the result of an lm()

function.

70
Undergraduate students Undergraduate students
55

60
55
50
47

50
47
40

40
34
34
Frequency

Frequency
30

26 26

30
26 26
20

20
13 13
9 9
10

10
5 5
3 3
1 0 1 1 0 1
0

0
40 50 60 70 80 90 100 40 50 60 70 80 90 100
Body mass, kg Body mass, kg

Figure 38: Plotting students.mass.hist with plot(). Left: a nearly raw function
call, right: ylim adjusted. See text for code details.

plot(students.mass.hist,
main="Undergraduate students", xlab="Body mass, kg",
col="grey")
text(x=students.mass.hist$mids, y=students.mass.hist$counts,
labels=students.mass.hist$counts, pos=3)

The only thing that remains to be fixed (fig. 38, left) is the ylim of the plot:
the label for the most frequent bin, from 50 to 55 kg, extends slightly beyond
the plotting area. In the following code, I add a little bit to the maximal value
of counts to set the new ylim (fig. 38, right):70
plot(students.mass.hist, ylim=c(0,max(students.mass.hist$counts)+5),
main="Undergraduate students", xlab="Body mass, kg",
col="grey")
text(x=students.mass.hist$mids, y=students.mass.hist$counts,
labels=students.mass.hist$counts, pos=3)

The other function I would like to mention here is boxplot(). Unlike

hist(), the boxplot() can not create a readily plottable object, but the latter
contains many useful stats.
students.mass.box <- boxplot(students.df$MASS)

The resulting object is again a list:71

> str(students.mass.box)
List of 6
$ stats: num [1:5, 1] 40 53 59 65.8 84
$ n : num 220
70 The line height can also be calculated in "user" units (see the subsection on the boxplots

for details). Here I skipped this step for brevity by just adding +5 but this is not generally
advisable.
71 Note, please, the notation for matrices (stats and conf): vectors defining the number of

rows and columns are separated with commas. E. g. [1:5, 1] in stats means that there are
five rows and one column.

71
$ conf : num [1:2, 1] 57.6 60.4
$ out : num [1:6] 87 85 90 85 98 90
$ group: num [1:6] 1 1 1 1 1 1
$ names: chr "1"
>

It contains y-axis co-ordinates for the elements of the main part of the box-
and-whiskers plot (stats), the number of elements in the analysed vector (n),
95% confidence interval for the median (conf),72 a vector of y-axis co-ordinates
for outliers (out), the variable (group), to which the outliers belong (in our
case, all six belong to the single variable under consideration), and the grouping
variable name (this time it was arbitrarily assigned to 1 because no grouping
variable was provided).
As boxplot() can handle several variables at a time, or split the single
numerical variable on the basis of the grouping variable, the objects it creates
also can reflect this. In these cases, the matrix elements acquire additional
columns and vector elements additional elements. To illustrate this, let us build
a multiple boxplot using the same body mass as the numeric variable and the
students’ sex as the grouping variable:73
> students.mass.2.box <- boxplot(students.df$MASS ~ students.df$SEX)
> str(students.mass.2.box)
List of 6
$ stats: num [1:5, 1:2] 40 52 56 62 77 54 64 69 76 90
$ n : num [1:2] 168 52
$ conf : num [1:2, 1:2] 54.8 57.2 66.4 71.6
$ out : num [1:2] 82 98
$ group: num [1:2] 1 2
$ names: chr [1:2] "f" "m"
>

It is instructive to compare now the outlook of some of the elements of the

two objects we have just created.
> students.mass.box$stats
[,1]
[1,] 40.00
[2,] 53.00
[3,] 59.00
[4,] 65.75
[5,] 84.00
> students.mass.2.box$stats
[,1] [,2]
[1,] 40 54
[2,] 52 64
[3,] 56 69
[4,] 62 76
[5,] 77 90
>

Note, please, also the differences in the out, group, and names elements.
The number of the outliers changed because the general shape of the variable’s
72 The limits are identified with the following formula: M ± 1.57 × IQR
√ where M stands
n
for the median, IQR for interquartile range, and n for the number of elements in the vector
under consideration (See: John M. Chambers, William S. Cleveland, Beat Kleiner, & Paul A.
Tukey (1983) Graphical methods for data analysis. Pacific Grove, California: Wadsworth &
Brooks/Cole Publishing Company: p. 62).
73 Note the changes in the matrices stats and conf.

72
frequency distributions did (distribution of mass within each gender group is
more compact, hence the lesser number of outliers). The group now indicates
that the first outlier belongs to the first sub-series, while the second to the
second. And the names now contains names for the sub-series derived from the
values of students.df$SEX used to group students.df$MASS. The group, by
the way, offers also an important insight into how the x-axis values for boxplot
elements are generated (they are assigned integer numbers from 1 to N, where
N is the number of vectors or sub-series under comparison).
The statistics routinely produced and discarded by boxplot() were consid-
ered important enough to deserve a special non-plotting function that brings
them to light:
> boxplot.stats(students.df$MASS)
$stats
[1] 40.00 53.00 59.00 65.75 84.00

$n
[1] 220

$conf
[1] 57.64182 60.35818

$out
[1] 87 85 90 85 98 90

This latter function, however, can not process more than one variable at a
time.
As I said, not all objects created with plotting functions are extremely useful.
You may also try barplot() and curve() on your own and see what do the
resulting objects contain and what good can plot() extract from them.

Preparing presentation quality graphs

So far we have not discussed the problem of presentation quality graphs in detail,
working mostly with the on-screen graphical console (a separate window that
pops up when you enter your first plotting command). Some basics, however,
were scattered all over the previous sections. We know how to save the graphs
to a file, how to control appearance of some graph elements, how to control
axes, etc, etc. The time has come to deal with it in a more systematic manner.
This means, that we need to go somewhat deeper into the world of computer
graphics.
As you might have known from other sources, the computer graphics can
be subdivided in two domains: raster and vector. With raster graphics, each
picture is defined as a rectangular set of more or less economically described
coloured dots. With vector graphics, each picture is defined as a set of instruc-
tions similar to the ones we already used to plot the graphs: where to place
points, lines, and polygons, and how they should look (colour, textures, line
width, etc.). On the basis of these instructions, rasterized versions of the graphs
are created for every plotting device on ad hoc basis (because the most common
plotting devices like computer screens or printers are rasterised: screens have

73
their ‘pixels’ and printers print with small dots).74 The R graphical system is
inherently vectorised but it easily produces both vector and raster graphics with
its numerous virtual plotting devices (we already used one of them, png()).
The vector graphics in R has its limitations. Neither pdf() nor other vector
plotting devices can handle transparency (hence, no rgb() for colours). The vec-
tor plotting devices are also notoriously problematic when working with glyphs
beyond basic Latin characters.75
The uses of raster and vector graphic files produced by R are slightly differ-
ent. Raster graphics files work better for a quick preview (the standard graphics
preview software under all major operating systems handles them by default and
allows for flipping through large amounts of images with a keystroke). Vector
graphics files fit best with electronic publishing as quality book and journal il-
lustrations. Of course, there is a lot of flexibility (and you are not at all bound to
use raster for on-screen previews and vector for books), however, every time you
pick a particular flavour of raster or vector image, make your choice cosciously
and not without some deliberation.
The first subsection will treat the specific problems of raster graphics (PNG,
JPEG, and TIFF), the second, those of vector graphics (SVG, EPS, and PDF),
and the third, a rather special problem of colour models (sRGB vs. CMYK),
which is relevant if you are going to prepare colour graphics for printing on
paper.

Raster graphs: screen vs. paper

Apart from the on-screen graphical console, main raster graphic devices include
png(), tiff(), and jpeg().76 In this section, I shall discuss the raster devices
on the basis of png() (all of them are controlled in fundmentally similar ways)
pointing to the peculiarities of other devices when appropriate.
This section consists of four subsections. Before I approach the png() print-
ing device itself, I should clarify a most confusing issue of image file pixel size
and image resolution. The one is so frequently (and erroneously) mistaken for or
confused with another that the need to stress the profound difference dividing
them can not be underestimated. Philosophically, the pixel size of the image is
more fundamental, so I shall start with it and then proceed to resolution. To
make matters worse, there is also the byte size, which deserves some of our atten-
tion as well. The second subsection will cover the practical matters of the image
size adjustment, the third the art of printing quality black and white graphs,
the fourth some issues of visual rhetorics and creation of serialised images for
presentation slides and animated graphs.

Digression: Pixel size, byte size, and resolution

The pixel size of a raster image file is defined by the number of dots (pixels)
it contains. It is usually specified with the total number of pixels or pixel
dimensions of the image file. E. g. my old good DSLR camera produces image
74 There are also plotters, which draw line drawings with pens, and they use vector graphics

instructions immediately.
75 It is possible to force it into printing Cyrillics, e. g., but neither the process itself, nor its

result are pleasant.

76 Windows users may like bmp() as well but we shall not describe it in any detail here.

74
files worth 6 megapixel or 3008 × 2000 pixels, my more recent compact camera
does 20 megapixel or 5184 × 3888 (which look, in fact, worse than those taken
with the 6 mp DSLR). The pixel size of the image file as measured in pixels
remains the same as long as we do not crop or resize it forcibly with a raster
graphics editor.77 Each dot or pixel (or groups of them if some method of
compression is employed) is described more or less economically in terms of
colour. This means that the image files of the same pixel size are not necessarily
of the same byte size.
The byte size of an image file depends, besides the pixel size, on many things,
among which the arguably most basic is the number of shades of colour reserved
to describe each pixel. Minimum is, understandably, two, black vs. white, so
each pixel can be described with just one bit (1 vs. 0) and maximum (not
reacheable within R graphical system) currently extends to 48 bits per pixel.
The so-called true colour scheme as defined in the RGB colour space (based on
the representation of each colour as a mixture of red, green, and blue) uses 32
bits, eight to encode the intensity of each of the three colour channels (which
makes 24 bit) and eight for transparency (or unused). Thus it is capable of
describing up to 2563 = 16 777 216 colours, which is formally a lot bigger than
a human eye could discern.78 Another important thing contributing to the byte
size is compression method and the degree of compression used but we shall not
go deep into this now.
The resolution is a bit more confusing (because it is often abused). To cut it
short, the resolution is not a characteristic of the image file (even though some
image file formats also allow storing the intended resolution of the picture). The
resolution is the characteristic of a physical printing device (a computer screen
or a printer). It is measured not in pixels or dots but in pixels or dots per
inch (ppi or dpi accordingly). This means that the resolution of a displayed
or printed image is always the same as the resolution of the physical printing
device, no matter what. These days it is usually about 100 dpi for computer
screens and about 600–1200 dpi for laser printers.
One may wonder, what then happens to all these virtual pixels described in
the image file, the ones responsible for the pixel size of the image? And why
then the ‘high resolution’ images sometimes look better than ‘low resolution’
images? The answer is simple. The most straightforward way to represent a
digital image on a physical printing device is to represent each of the virtual
pixels of the former with a real pixel of the latter. This way, however, is not
always desirable.
E. g. my aged laptop screen contains slightly over 1 megapixel — 1366 × 768
pixels. What would I do with my 6 to 20 megapixel digital photos? Preview
them in parts? 16 or 201
at a time? No sane person would do that (except she is a
professional photographer or willing to drag herself into a complex and possibly
dangerous affair like that in Michelangelo Antonioni’s 1966 ‘Blowup’). The
image preview programs resize large images automatically to fit them into the
77 To crop an image means to carve a fragment out of it. To resize the image means to

shrink or expand its pixel size. Obviously, both operations change the pixel size by definition.
78 In practice, human vision is slightly more complicated, to say the least, than the formal

description of digital colours and the ability to distinguish the shades of colour is unevenly
distributed over the full spectrum. So, for a human, some of these sixteen billions of colours
look the same, while in some other areas of the spectrum the border between the two formally
neighbouring colours is still visible.

75
screen. To do it they perform complex calculations (each pixel of the printing
device represents a summary of several pixels of the original digital image, and
this summary should be technically accurate and aesthetically acceptable). The
same goes for printer drivers (except that printers usually have a far bigger
resolution than the computer screens).79 The images of a smaller pixel size
than the screen are usually displayed 1:1 and if we would like to enlarge them,
then, again, the preview software or a printer driver recalculates the resulting
physical picture in such a way as to represent one digital pixel of the image file
with several physical pixels or dots of the printing device. Sometimes this goes
too far and the digital pixels become quite visible large squares (each displayed
or printed with a considerable number of physical pixels).
What does it all mean to us? The most important lesson is that raster
images need to be tailored to the size and resolution of the physical device
they are intended for. A small image file 500 × 500 would be enough for a
quick on-screen preview. Beamer presentations usually require slightly larger
images about 800 × 600 or 1000 × 750. Academic journals request page-wide
illustrations, which can be printed at a resolution of 1200 dpi, which, given the
average width of the printing area (5′′ ) leaves us with images 6000 pixels wide
at least.

Adjusting graph size

We have already discussed this briefly above in Saving the plots. Now, the time
has come to explore the topic in detail. As I said above, the default pixel size
of R raster printing devices is 480 × 480, which is too small even for an on-
screen preview. This default setting can be traced back to the size limits of
the first VGA monitors (640 × 480 px), but today it does not look impressive.
Thus, whether on screen or on paper, we have to override the default size for
presentation quality graphs anyway. In this subsection, we shall explore the
ways to control the size of raster plots and their elements.
Even though visually tolerable results can be achieved by adjusting but a
few arguments of png() and par() functions, it takes to know which exactly
arguments to adjust. And there is more trouble than it seems at the first sight,
because there are groups of interconnected parameters which have to be changed
harmoniously.
Let us try and adjust the size of an image in a stepwise manner to see all
the possible pitfalls.
I start then with a default printing function call. As you probably remem-
ber, printing to a file (see Saving the plots section above) requires three steps
(opening a printing device, plotting the graph to it, and closing the device in
question), here they are represented with a single function each:80
png("img_default.png")
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()
79 The fax machines of old days had 150 dpi, and the 24- and 9-pin dot matrix printers even

less, but now it would be difficult to find a laser printer even with 300 dpi, for the majority
already has 600 dpi set as default.
80 In the following examples we shall work on the German philosophical journals publication

dynamics dataset, so, please, take the trouble and create the object j.phil on the basis of
the file dt.j.phil.1655 1970.txt.

76
Figure 39: Left, if you only can see it: a default 480×480 image file printed at 1200 dpi
‘as is’, without size adjustments. Left: the same, reproduced not to scale. When
previewing this manual as a PDF-file, use magnification to examine the appearance of
characters and lines in both.

The result of this code would be a small image file closely resembling fig. 39
(right) saved to the R working directory. Evidently, it is way off small for quality
printing. Printed at 1200 dpi it would turn into a square centimeter image filled
with microscopical characters, numbers and split-of-a-hair lines (fig. 39, left).81
The first thing to be adjusted is obviously the pixel size of the image file (this
can be done, as we have already discussed above, with the aid of two arguments
of png() function, width and height):82
png("img_wh.png", width=6000, height=4000)
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()

The resulting image file (fig. 40), however, would indicate that something got
completely out of hand. While the plotting area indeed expanded to 6000×3600
pixels, the printing characters and lines remained ridiculously small. We haven’t
noticed anything of the sort when printing image file 500 × 500 only because 500
is not that different from 480. Now the vertical pixel size is 7.5 times bigger than
that of the default image. However, if you try and preview the img wh.png and
img default.png in different viewer windows at 1:1 and juxtapose them on the
81 From this you may conclude that in all previous figures of this manual I tricked you, for

they looked far better. Yes, I did. Most of them are either high-resolution raster graphs or
vector graphs in PDF.
82 For this graph, I jumped right away to the 1200 dpi resolution. I also changed the aspect

ratio of the image file to 5:3 for aesthetic purposes.

77
Figure 40: Image file printed with pixel size adjusted to 6000 × 3600 all other settings
unchanged (trust me, there is something in this image, even though you probably
hardly see it). See text for code details.

screen, you will see that the lines are of the same width, and the numbers are of
the same size in both images. Apparently they have not stretched automatically.
In R, the appearance of the graph can be mended in several ways. If we
need the way that is economical in a long run (i. e., requires less adjustments
for individual elements of the graph) and philosophically sound, then the time
has come to remember whatever we know about pixel size and resolution (see
the previous section).
The matter is that R printing system makes default assumptions not only
on the pixel size of the printing device but on its resolution as well. To fit
better with the low-resolution computer screens of old days and to ease some
calculations the default resolution was set to 72 dpi. The image file 480 × 480 px
reproduced 1:1 at a 72 dpi physical device makes 6.67 × 6.67′′ image, which is
pretty big (not to mention that 480 px constituted the full height of a VGA
monitor). It is close to the images R produces on-screen today. The default size
of the graphical console that pops up on my laptop is 6 × 6′′ . The resolution
of my laptop’s monitor, though, is close to 112 dpi, which is rougly 1.56 times
bigger than 72 dpi.83 Apparently, the screen images these days are rescaled to
befit bigger (in terms of pixel size) monitors with higher resolution. Anyway,
72 dpi is too far from the intended 1200 dpi resolution of our printed picture.
The intended resolution can be fixed with res argument of the same png()
function:
png("img_whr.png", width=6000, height=4000, res=1200)
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
83 A widespread belief that laptop screens resolution these days is invariably 96 dpi is plain

wrong.

78
dev.off()

The result can be seen in fig. 41, top. The image quality has markedly
improved. The lines are now thick and solid, the letters and numbers are big
and fully readable. Their shapes are elegantly curved and bear no traces of
dithering or visible pixel dents on their margins. The same result as in fig. 41
(top) could be obtained in a slightly different way by specifying image size in
inches (note the changes in width and height assigned values and the new
units argument) and indicating explicitely its intended resolution:
png("img_whri.png", width=5, height=3, units="in", res=1200)
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()

However technically tolerable, the graph is still too far from being aesthet-
ically and scholarly acceptable. The only thing to be fixed in this section,
however, is the font size, which is apparently too big for this book. To change
it in a sensible way we have to learn another bit of theory, this time from the
art of typography.
At bottom, many digital printing systems keep trying to imitate the old
good movable type. One of its features was a peculiar measurement unit (point)
which, to make matters worse, varied slightly across countries and continents.
1
Roughly speaking, it was based on 72 of an inch (or French pouce, given all
historical variations within and adding several attempts to reconcile it with
metric system and a number of software implementations on top). It is this
system, which through its wide use in WYSIWYG word processors is also familiar
to any ordinary computer user. When you pick 12 pt Times New Roman in MS
Office Word or 10 pt Liberation Sans in LibreOffice Writer, you are using just
it. 12 pt means 12 1
72 , or 6 of an inch (≈ 4.23 mm), etc. You can try and measure
it for yourself. It should be taken into account though, that the size of the
font defines the distance between the ascending and descending elements of font
characters (e. g., between the highest point of an ‘l’ and the lowest point of an
‘y’) and should be measured accordingly.
The R graphical system sets the default font size at 12 pt. The paragraphs
of this manual are set in a 10 pt font, and the image captions in an even smaller
9 pt font. To make the image font proportional to the one in the caption, we
need to specify the font size explicitely and set it to 9 pt making the characters
smaller by one quarter (note the new pointsize argument added to the script):
png("img_whrp.png", width=6000, height=3600, res=1200, pointsize=9)
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()

The result can be seen in fig. 41, bottom. The plot looks better but it is
still too far from perfect. We somehow fixed image size and made the font size
proportional to the rest of the text in the book, but the plotting area is still
too small, and the graph looks too ugly for a scholarly journal. This, and many
other things concerning this graph, will be discussed in the following subsection.

79
Figure 41: Image files printed with crude size adjustments. Top: same as fig. 40,
intended resolution (res) set to 1200 dpi. Bottom: the same, plus font size (pointsize)
set to 9 pt.

80
Black and white line art in earnest
Any customer can have a car painted any
colour that he wants, so long as it is black.
— Henry Ford

In this subsection I shall discuss further improvements of the plot we started

polishing in the previous one. I stick here to the black and white graphics for
two reasons. The first is that it is, strangely enough, far trickier to produce a
decent black and white graph than a coloured one. All you need for a coloured
graph is a certain amount of taste in mixing colours and a reference table that
would help you to avoid colour combinations which are indiscernible for colour-
blind people. It takes more to express yourself in black and white. The second is
that the art of sober black and white line art drawings is still in great demand.
Even though these days infographics is mostly pretty colourful, the ages will
pass, I am afraid, before the established journals in humanities start printing
colour graphs. Even if the online version of the journal puts no limitations on
colour, there remains the problem of printed copies. And many journals do not
provide this option.
In this section, I shall deal with three topics. First, I shall briefly discuss
the limitations of black and white plots. Second, I shall get in touch again with
a most powerful function par(). Third, I shall address more systematically the
issue of the layered composition of plots and the importance of a proper ordering
of layers.
Black and white graphics leaves us with very limited means of expression.
We have to use shading lines (angle and density) instead of colour fill for the
polygon-like elements. We have to vary pch and lty instead of col with data
points and lines.84 This all brings in its train still further limitations. The
number of contrasting and easily discernible line types and printing characters
that can be combined within one and the same graph is hardly ever bigger than
three or four (especially in view of the method enchancing readablity of data
points and lines in the black and white graphs, which is discussed towards the
end of this section). It is slightly better with linear shadings (especially because
one could use black and white fill for extreme cases) but not too much better.
All in all this means that we usually can combine less subsets or less data vectors
in one and the same graph than with coloured graphs, which implies a far more
careful choice and planning of illustrations for academic publishing.
It is time, however, to come back to our graph. We left it after initial size
adjustments with intolerably wide margins and few other problems we are not
yet ready even to formulate, let alone solving them. This means it is time
to turn to a most powerful and versatile function par(). We have already
stumbled over par() many times. It is no wonder, for it controls innumerable
general parameters of the entire plotting area of the graphical device. Indeed,
this function has so many arguments, that a whole book can be devoted to it
alone.85 This time we need it to trim the margins. Last time we did it more or
84 See sections Scatter plot, fig. 12, and A digression: Time series and the type of plot(),

fig. 18.
85 See Simple bar plots for mai and mar, A digression: Time series and the type of plot() for

mfrow, and Stepping beyond two dimensions for bg, fg, col.axis, col.main, and col.lab. In
fact, we used it also to change the default pch and even default plot elements’ size (the latter

81
less systematically (in Simple bar plots) we tried to expand one of the margins
proportionally to the length of the axis labels. Now, the task is much simplier.
We need to remove extra space at the margins to give more room to the
plot itself. In a book, the margins of the page are wide enough, and the page-
wide graph should fit exactly the width of the page’s printing area, leaving no
extra space above or to the right. In case the image is located differently, the
offset from it is anyway defined by the publishing software, so you do not need
to waste the space in the image file. As you may remember, the margins are
controlled with two arguments of par() — mar and mai. The former defines
margins in terms of standard lines of text, the latter in inches. Their default
values can be previewed by calling their names with an empty function call.
> par()$mar
[1] 5.1 4.1 4.1 2.1
> par()$mai
[1] 1.02 0.82 0.82 0.42

As you see, the empty top and right margins still occupy the spatial equiva-
lent of two to four lines of text, and even the bottom and left margins containing
axis labels are too wide. To make them shrink a bit I would change the default
settings as follows:
png("img_whrp_Mar.png", width=6000, height=3600, res=1200, pointsize=9)
par(mar=c(3, 3, 0, 0)+.1)
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()

The result in fig. 42 (top) looks both encouraging and disappointing. On

one hand, the plot now ocupies nearly entire width of the page’s printing area.
On the other hand, the axes labels are gone. The only remnant of the left axis
label is still visible as a small dot representing the tip of the descender of ‘q’ in
‘Unique titles’ at eight-o-clock from 40 tickmark.
To fix it we should change the position of lines with mgp bringing them closer
to the axis. mgp is a vector of three values, which defaults to:
> par()$mgp
[1] 3 1 0

The two former values specify positions of the axis label and tickmarks labels.
The third affects the axes themselves.
png("img_whrp_MarMgp.png", width=6000, height=3600, res=1200, pointsize=9)
par(mar=c(3, 3, 0, 0)+.1, mgp=c(2.1, .6, 0))
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()

This code brings axis labels back in (fig. 42, bottom). Now, we need to get
rid of the black border around the plot area and ponder whether we should fix
the axes and tickmarks (after removing the box around the plot the axes may
is not avisable in case you work on quality images for academic publishers but it may pass if
you need a quick preview of a bigger size and do not wish to enter the dark forest of image
sizes, fonts, and resolutions we are wandering now). I recommend consulting help(par) to
have a glimpse into the whole universe of graph settings.

82
Figure 42: Adjusting margins. Top: same as fig. 41, mar adjusted to expand the plot
area; pay attention to the dot to the left of the graph about y=35, it is the tip of the
descender of ‘q’ in ‘Unique titles’. Bottom: margin lines’ positions fixed (mgp).

83
appear too short, and the tickmarks already look too long for the now too close
tickmark labels).
The box around the plot can be removed in two ways. The first is to spec-
ify par()$bty (defaults to "o", which means that the box encircles the graph
entirely).86 It is more economical as it affects only the box itself, and not the
axes.
png("img_whrp_MarMgpBty.png", width=6000, height=3600, res=1200, pointsize=9)
par(mar=c(3,3,0,0)+.1, mgp=c(2.1, .6, 0), bty="n")
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
dev.off()

The results can be seen in fig. 43, top. Even though the defaults for the
plots in R graphical system were developed by expert designers and in view of
academic press requirements, the axes sometimes require futher trimming. In
this particular case, I would not, perhaps, change anything but the length of
the tickmarks. I, however, would like to make use of the second way to get rid
of the box, and to remind some of the other axis controls.87 In the following
code axes=FALSE argument of the plot() function suppresses both axes and
the box, and axis() is used to re-create axes with more detailed gradation (pay
attention to at and labels) and shorter tickmarks of varying length (look at
tcl).
png("img_whrp_MarMgp_at.png", width=6000, height=3600, res=1200, pointsize=9)
par(mar=c(3,3,0,0)+.1, mgp=c(2.1, .6, 0))
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles",
axes=FALSE)
axis(1, tcl=-.4)
axis(1, at=seq(1660, 1970, 10), labels=FALSE, tcl=-.25)
axis(2, tcl=-.4)
axis(2, at=seq(0, 70, 10), labels=FALSE, tcl=-.4)
dev.off()

The results can be seen in fig. 43, bottom. I am not sure the axes look better
this way. Moreover, aesthetically, the top figure looks to me more appealing. Its
axes are not over-cluttered with tickmarks, and, as a result, the whole picture
gets more air. In this case, however, the tickmarks are, of course, too sparse to
help us establishing more or less precise correspondence between the ups and
downs of the journals’ population dynamics and major historical events, and,
to be honest, fig. 43, bottom, with its seemingly far more detailed axes, offers
a very moderate improvement. Would it be easy to decide whether the World
War I had any effect? Whether a sharp decline in the number of periodicals
conicides with Nazi coming to power in 1933? Probably, a better strategy would
be to introduce a regular grid against the background of which the curve would
look more informative. Or, a still better, to use lines or shaded areas to indicate
important historical events or periods immediately.
86 The values of bty are to remind of the shape of the border, however remote the resem-

blance is. "l" stands for left+bottom, literally L- (not l-) shaped frame. "7" means top+right.
"c", "u", and "]" provide top+left+bottom, left+bottom+right, and top+right+bottom
frames. Strangely enough, "n" results not in a n-shaped frame but in a complete absence
thereof.
87 We have already discussed axes above in several instances.

84
Figure 43: Removing box and trimming axes. Top: same as fig. 42, par()$bty set to
"n". Bottom: default axes replaced (see text for the code details).

85
Figure 44: Same as fig. 43 (top) with meaningful historical landmarks added as shaded
areas or vertical lines (see text for the code details), left to right: the Seven Years’
War, Napoleonic Wars (in part), Märzrevolution of 1848, the Franco-Prussian War of
1870, World War I, Hitler appointed Chancellor of Germany in 1933, World War II.

In the following code, I use shaded rectangles to represent long historical

periods, like Napoleonic Wars, and vertical lines for shorter ones (like revolution
of 1848, or a rather brief Franco-Prussian War of 1870). Note that the rectangles
expand far beyond the top and bottom borders of the plot area to appear open
on top.
png("img_whrp_MarMgpBty_l1.png", width=6000, height=3600, res=1200, pointsize=9)
par(mar=c(3,3,0,0)+.1, mgp=c(2.1, .6, 0), bty="n")
plot(j.phil$YEAR, j.phil$TOT, type="l",
xlab="Timeline, years", ylab="Unique titles")
rect(xleft=c(1756, 1803, 1914, 1939),
xright=c(1763, 1815, 1918, 1945),
ybottom=-100,
ytop=100,
density=24)
abline(v=c(1848, 1870, 1933))
dev.off()

The result can be seen in fig. 44. Now, the answers I posed above can be
answered easily. We see a small but marked dent at the World War I, and we
clearly see that the decline of the journals’ population began well before 1933.
There only remaining problem is that the shaded areas and auxillary vertical
lines look now as prominent as the data line and obscure the latter.
To make the data line stand out against the background a couple more of
tricks are needed. From our previous experiences with stepwise graph produc-
tion we know, that each complex graph in R is a multi-layered construction, and
each layer is printed in its turn. E. g. in fig. 44 we have plotted first the axes
with their labels and the data line with plot(), then added shaded areas with
rect(), and, finally, the thin vertical lines with abline().

86
What if we use this layered composition combined with another trick that
comes from the past times of the inkpen illustrations? To show that one of the
overlapping lines runs over another, the artists of the past made a small break
in the line that was supposed to lay in the background at the intersection with
the line that runs over it. Of course, in R no one could do this. A visually
similar result however, could be obtained by plotting over the background lines
first a slightly wider white line, and then, a regular width black line.88
In the code that follows, the elements of the graph are carefully layered one
upon the other, the background auxillary elements are drawn first and in thinner
lines, the foreground element is drawn last and with a white background line
added. So, the order of layers is: (1) empty plotting area with axis labels, (2)
shaded rectangles in thinner lines (lwd=.5), (3) auxillary vertical lines (again,
lwd=.5), (4) a wide white background for data line (note col="white" and
lwd=5), (5) the data line itself, (6) the axes. It is important to print the axes
last. No element of the plot that touches the axes can be printed over the
latter. Axes are sacred. In this graph, out of purely aesthetical considerations,
I also extended slightly the range of x- and y-axes and shortened the tickmarks
(tcl=-.4).
png("img_whrp_MarMgpBty_l2.png", width=6000, height=3600, res=1200, pointsize=9)
par(mar=c(3,3,0,0)+.1, mgp=c(2.1, .6, 0), bty="n")
plot(j.phil$YEAR, j.phil$TOT, type="n",
xlim=c(1650,2000), ylim=c(0,80), axes=FALSE,
xlab="Timeline, years", ylab="Unique titles")
rect(xleft=c(1756, 1803, 1914, 1939),
xright=c(1763, 1815, 1918, 1945),
ybottom=-100,
ytop=100,
density=36, lwd=.5)
abline(v=c(1848, 1870, 1933), lwd=.5)
lines(j.phil$YEAR, j.phil$TOT, col="white", lwd=5)
lines(j.phil$YEAR, j.phil$TOT)
axis(1, tcl=-.4)
axis(2, tcl=-.4)
dev.off()

The result can be seen in fig. 45. Now, both the data line and the background
elements are clearly visible and visually distinct, and the graph is nearly ready
to go to the publishers.89
The same method of highlighting the foregound elements with a white outline
can be applied not only to overlapping lines (with several different data lines
you also should think of a best possible order of appearance which ensures
readability of the graph) but to data points and text labels as well. For data
points, pch 21–25 with col="white" and bg="black" could be recommended
(see section on scatter plots above). For the text labels, there is a special
function shadowtext() from the package TeachingDemos (analogous to text()
discussed above in the section on graphical primitives).90
88 This could be used for colour graphs with many overlapping lines as well. Just use

background colour of the plot instead of white (for obvious reasons, in black and white graphs
background colour is always white by definition).
89 The only difference is that academic publishers usually prefer TIFF over all other raster

formats, so you either will have to use tiff() function or to convert PNG to an uncompressed
TIFF file with an external graphics editor.
90 Like other packages, it should be installed with install.packages("TeachingDemos")

87
Figure 45: Same as fig. 44 with image’s layered structure carefully observed, data
lines, shaded areas, and axes trimmed (see text for the code details and fig. 44 caption
for the explanation of shaded areas and vertical lines).

Presentation slides
[Section not yet complete]
Conspectus: Two elements of visual rhetoric: juxtaposition and highlighting.
Elements of animation. Precise positioning of the elements. Neighbouring slides
as parts of a very short animation sequence. Truly animated graphs. R and
ffmpeg.

Vector graphs
Unlike raster, the vector graphics is automatically scaled to the size of the
plotted image (and rasterized only after that), so it has fever problems with
plot elements size than the raster graphs described in previous sections. On the
other hand, they have got specific problems of their own. They do not support
transparency, and the vector fonts do not support UNICODE.91
The choice between different vector formats is based on the intended further
use of the resulting image files. Theoretically, EPS is preferable for the use
with desktop publishing systems, PDF for stand-alone preview or as a part
of a presentation merged from PDF files, and SVG for further editing with
vector graphics editors (e. g. InkScape or Adobe Illustrator ). In practice, LATEX
converts EPS to PDF before embedding them into the resulting PDF, and post-
processing in vector graphic editors is hardly needed if you mastered R.
and then loaded for each R session with library(TeachingDemos). After loading the package,
one may enjoy reading help(shadowtext) and using the shadowtext() function.
91 There are workarounds that allow printing, e. g. Cyrillic characters with pdf() and

postscript(), but they are so intricated that I would not dare to recommend using them.

88
It would be needless to repeat here everything that was written on the black
and white line art grphics in the section on raster plots. Here I shall focus only
on the few specific differences from raster graphics. Actually, the only important
difference for now is the default unit for width and height of an image.92 While
in the raster graphics, the numeric values for width and height default to pixels,
in the vector graphics, they default to inches. Below is the same code I used
for fig. 45 with necessary adjustments for PDF (pay attention to the first line
of the code). The result can be seen in fig. 46.
pdf("img_whrp_MarMgpBty_l2.pdf", width=5, height=3, pointsize=9)
par(mar=c(3,3,0,0)+.1, mgp=c(2.1, .6, 0), bty="n")
plot(j.phil$YEAR, j.phil$TOT, type="n",
xlim=c(1650,2000), ylim=c(0,80), axes=FALSE,
xlab="Timeline, years", ylab="Unique titles")
rect(xleft=c(1756, 1803, 1914, 1939),
xright=c(1763, 1815, 1918, 1945),
ybottom=-100,
ytop=100,
density=36, lwd=.5)
abline(v=c(1848, 1870, 1933), lwd=.5)
lines(j.phil$YEAR, j.phil$TOT, col="white", lwd=5)
lines(j.phil$YEAR, j.phil$TOT)
axis(1, tcl=-.4)
axis(2, tcl=-.4)
dev.off()

Even though the thin lines look, perhaps, a bit coarser in the PDF version,
the visual differences are negligible. File size is, however, dramatically different.
The 8.5 KiB PDF is roughly 37.8 times smaller than the 321.4 KiB PNG (and
still smaller than the 20.6 MiB uncompressed TIFF of the same image). The
arguably funniest difference is visible only when viewing this manual as a PDF
file with a PDF-viewer. If you try and select fig. 46 with your mouse you will
see that some characters and numbers in it can be selected and copied as text.
With fig. 45 this would not be possible.
PDF graphs can be used not only as parts of LATEX documents but, in
their own right, as parts of multy-page PDF presentations. Nearly all PDF-
viewers, including original Adobe Acrobat Reader, can switch to the full-screen
presentation mode, so PDF is frequently used as a cross-platform presentation
format. The easiest way to merge individual PDF images into a multi-page
PDF presentation is to resort to an external application. I shall explain it using
PDFtk and qPDF.93
Both can operate from the command line and allow to rather flexibly ma-
nipulate PDF files. Here are examples of two CLI-commands94 merging several
PDF graphs into one file:
92 Other differences worth mentioning are antialias and fallback resolution arguments.

Anti-aliasing (smoothing of the font and line edges) can be adjusted in several ways, see
help(svg) for details. Resolution in itself is unimportant when working with vector images
because their rasterization is rendered at an ad hoc basis. The fallback resolution specifies,
however, the preferred resolution when sending vector images to a laser printer as a raster
(this is a standard and, sometimes, much needed option with PDF files).
93 Both are cross-platform and have free versions distributed under GNU/GPL. See https:

//www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ and https://github.com/qpdf/qpdf

94 I hope that you still remember that CLI stands for command-line interface. To run this

command line you should call terminals under iOS X and GNU Linux or the command prompt
under Windows.

89
80
60
Unique titles
40
20
0

1650 1700 1750 1800 1850 1900 1950 2000

Timeline, years

Figure 46: Same as fig. 45 rendered as PDF (see text for the code details and fig. 44
caption for the explanation of shaded areas and vertical lines).

pdftk plot.01.pdf plot.02.pdf cat output presentation.pdf

qpdf --empty --pages plot.01.pdf plot.02.pdf -- presentation.pdf

The code above merges two files plot.01.pdf and plot.02.pdf in the order
they are listed into one file presentation.pdf (the first line shows it for PDFtk
and the second, for qPDF ). There are more sophisticated ways to use both
PDFtk and qPDF but I leave it to you to explore them further.
To conclude this section, I have to confess that more than a half of images
used as figures in this part of the manual are, in fact, PDFs. If you read a PDF
version of the manual with a dedicted PDF-viewer, you can try and use the
mouse selection trick described above to differentiate them from PNGs at your
leisure.

Quality colour graphs: post-processing

This section drives us beyond R. However powerful R is in drawing graphs, there
are more things in heaven and earth, than are dreamt of in its drawing philoso-
phy. So far we dealt exclusively with the colour graphs for screen presentation
and black and white line art for academic press. Sometimes, however, we need
to prepare colour graphs for printing. Academic publishers avoid such graphs
at all costs but sometimes they are inevitable. Besides of pricing (printing a
colour graph may cost you a fortune, because publishers usually request to pay
the printing costs in case colour graphs are required), there is another issue: the
colour model.
While the screen colours are encoded with the so-called RGB model (stands
for Red-Green-Blue), the printing colours are encoded with CMYK (Cyan-
Magenta-Yellow-Black). One may easily see it anytime with a magnifying glass

90
on the screen, and when installing four different cartridges into a colour printer.
Besides the number of basic colours involved, there is another important differ-
ence. With RGB model, the more colour intensity one adds, the brighter is the
resulting colour (until everything ends up with white at maximum values for all
three channels). With CMYK model, the relationship is reversed. The more
colour intensity one adds, the darker is the resulting colour.
Some of R printing devices can encode colours with CMYK model. E. g.
with pdf() and postscript(), there is a special argument colormodel="cmyk"
(defaults to srgb), which can be used to print a .pdf or an .eps file with CMYK
colours directly.
The others, most notably, tiff() lack this option. In the latter case, we
need to resort to the external programs. All major raster image editors can con-
vert from sRGB to CMYK. One may use PhotoShop or GIMP for this purpose.
Besides of that, there is, as always, a powerful command-line tool, ImageMag-
ick.95 One of its most basic packages, convert, could do everything you need
with a single short line of code run from a terminal.
convert srgbfile.tiff -colorspace CMYK cmykfile.tiff

Appendix: Code for this manual

###############################################################################
###############################################################################
###############################################################################
#
# Loading the files
#
###############################################################################
###############################################################################
###############################################################################

# This is the code to re-create the students.df if it is lost;

# The file "kouprianov.students.v.2.1.txt" must be present
# in the working directory;
students.df <- read.table("kouprianov.students.v.2.1.txt", header=TRUE,
sep="\t", stringsAsFactors=TRUE)
# The following two lines would help you checking if everything is OK;
dim(students.df)
head(students.df)

# Loading presidential election dataset

pres.2008 <- read.csv("pres.2008.en.csv", h=TRUE)
# Creating additional variables
pres.2008$TURNOUT <- (pres.2008$BALL.VALID + pres.2008$BALL.INVALID)/pres.2008$VOTERS
pres.2008$MEDVEDEV.sh <- pres.2008$MEDVEDEV/(pres.2008$BALL.VALID + pres.2008$BALL.INVALID)

# Loading philosophers dataset

phil <- read.table("philosophers.txt", h=TRUE, sep=" ", stringsAsFactors=TRUE)

###############################################################################
###############################################################################
# The very first steps of analysis: descriptive statistics
###############################################################################
###############################################################################

# Calculating the sum of all individual heights;

sum(students.df$HEIGHT)
students.df$HEIGHT

help(sum)
sum(students.df$HEIGHT, na.rm=TRUE)

95 There are releases for all major operating systems, see: http://www.imagemagick.org

91
# Summary function
summary(students.df$HEIGHT)

# Structure of the summary’s output

str(summary(students.df$HEIGHT))

# Calling individual elements of the summary’s output

summary(students.df$HEIGHT)[2]
summary(students.df$HEIGHT)["1st Qu."]

# Retrieving some descriptive statistics from a numeric vector

# with more specialised functions;
mean(students.df$HEIGHT, na.rm=TRUE)
median(students.df$HEIGHT, na.rm=TRUE)
min(students.df$HEIGHT, na.rm=TRUE)
max(students.df$HEIGHT, na.rm=TRUE)
range(students.df$HEIGHT, na.rm=TRUE)

###############################################################################
# Traditional (mean-based) estimates of variation
###############################################################################

# Calculating variance manually and with var()

sum((students.df$HEIGHT-mean(students.df$HEIGHT, na.rm=TRUE))^2, na.rm=TRUE)/220
var(students.df$HEIGHT, na.rm=TRUE)

# Calculaing standard deviation and checking if it is square root of variance

sd(students.df$HEIGHT, na.rm=TRUE)
(sd(students.df$HEIGHT, na.rm=TRUE))^2

###############################################################################
# Robust estimates of variation
###############################################################################

IQR(students.df$HEIGHT, na.rm=TRUE)
help(quantile())

###############################################################################
# Summarising a qualitative variable
###############################################################################

summary(students.df$SEX)
summary(as.character(students.df\$SEX))

###############################################################################
# Summarising a data frame
###############################################################################

summary(students.df)

# Loading and previewing data:

anscombe <- read.table("anscombe.txt", head=TRUE, sep="\t")
dim(anscombe)
anscombe

# Applying mean() and sd() to all columns and rounding the results
# to two decimal places:
round(apply(anscombe, 2, mean), 2)
round(apply(anscombe, 2, sd), 2)

# Plotting the four datasets

plot(anscombe$x.1, anscombe$y.1, xlim=c(0,20), ylim=c(0,20))
abline(lm(anscombe$y.1 ~ anscombe$x.1), lty=2, col="red")

plot(anscombe$x.2, anscombe$y.2, xlim=c(0,20), ylim=c(0,20))

abline(lm(anscombe$y.2 ~ anscombe$x.2), lty=2, col="red")

plot(anscombe$x.3, anscombe$y.3, xlim=c(0,20), ylim=c(0,20))

92
abline(lm(anscombe$y.3 ~ anscombe$x.3), lty=2, col="red")

plot(anscombe$x.4, anscombe$y.4, xlim=c(0,20), ylim=c(0,20))

abline(lm(anscombe$y.4 ~ anscombe$x.4), lty=2, col="red")

# Regression summaries (just to take a look at them)

summary(lm(anscombe$y.1 ~ anscombe$x.1))
summary(lm(anscombe$y.2 ~ anscombe$x.2))
summary(lm(anscombe$y.3 ~ anscombe$x.3))
summary(lm(anscombe$y.4 ~ anscombe$x.4))

###############################################################################
###############################################################################
###############################################################################
#
# Grammar of graphics: six most basic graphs
#
###############################################################################
###############################################################################
###############################################################################

###############################################################################
# Histogram
###############################################################################

# Generating a random sequence of 50 weights

# with mean 57000 and standard deviation 7200;
mass.rn <- rnorm(50, 57000, 7200)

# Rounding the mass.rn to the gram;

mass.rnr <- round(mass.rn)

# Drawing a graph resembling fig. 3

hist(mass.rnr, breaks=c((min(mass.rnr, na.rm=T)-.5):(max(mass.rnr, na.rm=T)+.5)))

# Gradually improving histogram for students

hist(students.df$HEIGHT)
hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm")
hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm", col="grey")
hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm", density=15,
angle=45)

hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm",

col="grey", breaks=20)

hist(students.df$HEIGHT, main="Undergraduate students", xlab="Height, cm", col="grey",

breaks=seq(min(students.df$HEIGHT, na.rm=TRUE), max(students.df$HEIGHT, na.rm=TRUE), 4))

hist(pres.2008$TURNOUT, breaks=seq(-.0005, 1.0005, .001), col="black",

main="", xlab="Voters’ turnout at a polling station, 0.1% bin")

hist(pres.2008$TURNOUT, breaks=seq(-.0005, 1.0005, .001), ylim=c(0,400), col="black",

main="", xlab="Voters’ turnout at a polling station, 0.1% bin")

###############################################################################
# Digression: saving the plots as files
###############################################################################

png("hist.students.df.height.png") # 1. Opening a graphic device;

hist(students.df$HEIGHT) # 2. Plotting a graph;
dev.off() # 3. Closing the device;

png("hist.students.df.height.png", height=500, width=500)

hist(students.df$HEIGHT,
main="Undergraduate students", xlab="Height, cm",
breaks=20,
density=15, angle=45)

93
dev.off()

###############################################################################
# Simple bar plots
###############################################################################

# Generic function plot() draws barplot from the raw data

plot(students.df$SEX)

# table() applied to qualitative variables

table(students.df$SEX)
table(students.df$SMOKING)

# table() used as an argument for a specialised barplot() function

barplot(table(students.df$SEX))
barplot(table(students.df$SMOKING))

# Gradually improving philosophers’ barplot

barplot(phil$Freq)
barplot(phil$Freq, names.arg=phil$NAMES, ylab="Mentions")
barplot(phil$Freq, names.arg=phil$NAMES, horiz=TRUE, xlab="Mentions")
barplot(phil$Freq, names.arg=phil$NAMES, horiz=TRUE, las=1, xlab="Mentions")

# Identifying string width in inches

max(strwidth(phil$NAMES, units="inches"))

# The printing device parameters: par() function

par()$mar
par()$mai

# Using strwidth() to expand the left margin

par(mai = c(0.82, 0.42 + max(strwidth(phil$NAMES, units="inches")), 0.42, 0.42))
barplot(phil$Freq, names.arg=phil$NAMES, horiz=TRUE, las=1, xlab="Mentions")

# Sorting philosophers by Freq[uency]

phil.s <- phil[order(-phil$Freq),]

# Plotting the sorted philosophers

par(mai = c(0.82, 0.42 + max(strwidth(phil.s$NAMES, units="inches")), 0.42, 0.42))
barplot(phil.s$Freq, names.arg=phil.s$NAMES, horiz=TRUE, las=1, xlab="Mentions")

###############################################################################
###############################################################################
# Visualising interdependence: bivariate plots
###############################################################################
###############################################################################

###############################################################################
# Scatter plot
###############################################################################

# Gradually omproving scatter plot

plot(students.df$HEIGHT, students.df$MASS)

plot(students.df$HEIGHT, students.df$MASS, pch=20, xlab="Stature, cm",

ylab="Body mass, kg")

plot(students.df$HEIGHT, students.df$MASS, pch=20, col=rgb(0,0,0,.3),

xlab="Stature, cm", ylab="Body mass, kg")

plot(students.df$HEIGHT, students.df$MASS, pch=20,

col=as.numeric(students.df$SEX),
xlab="Stature, cm", ylab="Body mass, kg")

# Subsetting students by gender

students.f.df <- subset(students.df, students.df$SEX=="f")
students.m.df <- subset(students.df, students.df$SEX=="m")

# Superimposing two groups of students on the same graph

plot(students.df$HEIGHT, students.df$MASS, type="n", xlab="Stature, cm",

ylab="Body mass, kg")
points(students.f.df$HEIGHT, students.f.df$MASS, pch=20, col=rgb(1,0,0,.4))
points(students.m.df$HEIGHT, students.m.df$MASS, pch=20, col=rgb(0,0,1,.4))

94
plot(students.df$HEIGHT, students.df$MASS, type="n", xlab="Stature, cm",
ylab="Body mass, kg")
points(students.f.df$HEIGHT, students.f.df$MASS, pch=3)
points(students.m.df$HEIGHT, students.m.df$MASS, pch=4)

###############################################################################
# A special case: Time series
###############################################################################

###############################################################################
# Nine values for "type" argument of the plot() function

# Setting the objects;

plot.types = c("n", "p","l","o","b","c","s","S","h")
x.coord <- c(1:7)
y.coord <- c(1, 3, 7, 4, 2, 3, 5)

# Setting parameters for the plotting device;

par(mfrow=c(3, 3), pch=20)

# Using for(){} loop to print all plots in a single chart;

for (i in 1:9){
plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8),
type=plot.types[i],
main=paste("Plot type =", plot.types[i]))
}

# Please, do not forget to close the plotting device after this exercise with
# dev.off()
# or reset it to the default
# par(mfrow=c(1, 1), pch=1)

###############################################################################
# Testing "type" argument with points() and lines()

# Setting parameters for the plotting device;

par(mfrow=c(2, 2), pch=20)

plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8),

type="n", main=’Points default’)
points(x.coord, y.coord)

plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8),

type="n", main=’Points type="l"’)
points(x.coord, y.coord, type="l")

plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8),

type="n", main=’Lines default’)
lines(x.coord, y.coord)

plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8),

type="n", main=’Lines type="p"’)
lines(x.coord, y.coord, type="p")

# Please, do not forget to close the plotting device after this exercise with
# dev.off()
# or reset it to the default
# par(mfrow=c(1, 1), pch=1)

###############################################################################
# On the importance of ordering the entries chronologically

# Putting together example objects;

x.coord.1 <- c(1, 3, 5, 6, 2, 7, 4)
y.coord.1 <- c(1, 7, 2, 3, 3, 5, 4)

# Setting parameters for the plotting device;

par(mfrow=c(2, 2), pch=20)

plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8))

plot(x.coord.1, y.coord.1, xlim=c(0, 8), ylim=c(0, 8))
plot(x.coord, y.coord, xlim=c(0, 8), ylim=c(0, 8), type="o")
plot(x.coord.1, y.coord.1, xlim=c(0, 8), ylim=c(0, 8), type="o")

95
# Please, do not forget to close the plotting device after this exercise with
# dev.off()
# or reset it to the default
# par(mfrow=c(1, 1), pch=1)

###############################################################################
# X coordinate vs. index

# Putting together an example object;

x.coord.2 <- c(1, 3:8)

plot(y.coord, xlim=c(0, 8), ylim=c(0, 8), type="o")

plot(x.coord.2, y.coord, xlim=c(0, 8), ylim=c(0, 8), type="o")

###############################################################################
# A real-life time series example

# Loading the data for Imperial Moscow University students;

moscow <- read.table("Moscow.students.txt", h=TRUE, sep="\t", stringsAsFactors=TRUE)

# Plotting the time series for the faculty of Law;

plot(as.Date(moscow$DATE), moscow$L, type="l", lty=1,
ylim=c(0, max(moscow[,(2:4)])),
main="", xlab="Timeline", ylab="Number of students")

# Adding time series for the other three faculties;

points(as.Date(moscow$DATE), moscow$HP, type="l", lty=2)
points(as.Date(moscow$DATE), moscow$PM, type="l", lty=3)
points(as.Date(moscow$DATE), moscow$M, type="l", lty=4)

###############################################################################
# Multiple boxplot
###############################################################################

# Creating boxplots with plot()

plot(students.df$SEX, students.df$HEIGHT)

# Creating boxplots with plot(), mind the as.factor() data transformation with
# numerical codes used as entity names (e.g. for a group of sudents, its
# number is a name);

plot(as.factor(students.df$GROUP), students.df$HEIGHT)

# Creating boxplots with boxplot()

# Attempt #1
boxplot(students.df$SEX, students.df$HEIGHT)

# Attempt #2
boxplot(students.df$HEIGHT ~ students.df$SEX)

# Combining different objects in the same boxplot

boxplot(students.f.df$HEIGHT, students.m.df$HEIGHT, axes=FALSE)
axis(1, at=c(1,2), labels=c("f","m"), col="white")
axis(2)

###############################################################################
# Scatter plot with jitter
###############################################################################

# Raw plot HEIGHT ~ SEX

plot(students.df$HEIGHT, students.df$SEX)

# Using jitter()

# Attempt #1
plot(students.df$HEIGHT, jitter(students.df$SEX))

# Attempt #2
plot(students.df$HEIGHT, jitter(as.numeric(students.df$SEX)))

# Controlling the axes and text labels

96
plot(students.df$HEIGHT, jitter(as.numeric(students.df$SEX), factor=.5),
xlab="Height, cm", ylab="Gender", pch=20, col=rgb(0,0,0,.3), axes=FALSE)
axis(1)
axis(2, at=c(1:2), labels=c("f","m"), las=1, col="white")

###############################################################################
# Structured barplots
###############################################################################

# Exploring relationship between gender and smoking with plot()

plot(students.df$SEX, as.factor(students.df$SMOKING))
table(students.df$SEX, students.df$SMOKING)
table(students.df$SEX)

table(students.df$SMOKING, students.df$SEX)

# Exploring relationship between gender and department with barplot()

table(students.df$SEX, students.df$DEPARTMENT)
barplot(table(students.df$SEX, students.df$DEPARTMENT))
barplot(table(students.df$SEX, students.df$DEPARTMENT), beside=TRUE)

# Experimenting with a data frame as a source for barplot()

# Create the data frame

test.df <- data.frame(c(10,12,8), c(16,6,4), c(5,7,9))
colnames(test.df) <- c("A","B","C")

# Preview the data

test.df

# Experimenting
barplot(test.df)
barplot(as.matrix(test.df))
barplot(as.matrix(test.df), beside=TRUE)

# Creating an empty plot;

plot(-10:10, -10:10, type="n")
# Drawing a number of lines;
abline(0, 1, lty=1) # y=0+1*x;
abline(6, -.5, lty=2) # y=8-0.5*x;
abline(-7, 2, lty=3) # y=-7+2*x;
abline(4, sqrt(2), lty=4) # y=4+sqrt(2)*x;

# Re-creating an empty plot;

plot(-10:10, -10:10, type="n")
# Adding the grid;
abline(h=seq(-10,10,1), v=c(-10:10), lty=3, col=8)
# Adding the axes;
abline(h=0, v=0)
# Adding back the lines;
abline(0, 1, lty=1) # y=0+1*x;
abline(6, -.5, lty=2) # y=8-0.5*x;
abline(-7, 2, lty=3) # y=-7+2*x;
abline(4, sqrt(2), lty=4) # y=4+sqrt(2)*x;

# Plotting the data points;

plot(students.df$HEIGHT, students.df$MASS,
pch=20, col=rgb(0,0,0,.3),
xlab="Stature, cm", ylab="Body mass, kg")
# Adding regression line;
abline(lm(students.df$MASS ~ students.df$HEIGHT), col=red, lwd=3)
# Previewing coefficients
lm(students.df$MASS ~ students.df$HEIGHT)$coefficients
# Adding regression line in a different way;
abline(-85.2, .852, col="blue", lty=2, lwd=3)

# Trying the y=x^2 curve for different values for ’n’;

curve(x^2, from=-2, to=2, n=1)

97
curve(x^2, from=-2, to=2, n=2)
curve(x^2, from=-2, to=2, n=3)
curve(x^2, from=-2, to=2, n=10)

# Trying the curve: 2;

curve(x^3, from=-2, to=2, xlim=c(-3, 3), ylim=c(-3, 3))
curve(x^2, add=TRUE, lty=2)
curve(1*x, add=TRUE, lty=3)
abline(h=0, v=0)

# Trying the curve: 3;

curve(sin(x), from=-2*pi, to=2*pi, ylim=c(-2, 2)*pi)
curve(exp(x), lty=2, add=TRUE)
curve(log(x), lty=3, add=TRUE, n=303)
abline(h=0, v=0)
==

png("pres2008_2dh01.png", height=3000, width=3000)

par(cex=6, lwd=6, mar=c(4,4,2,0), mgp=c(2.5,1,0))

plot(pres.2008$TURNOUT, pres.2008$MEDVEDEV.sh, pch=".",

xlim=c(0,1.01), ylim=c(0,1.01),
col=rgb(0,0,0,.3),
axes=FALSE,
xlab="Voters’ turnout", ylab="Medvedev’s share",
main="Presidential elections in Russia, 2008")
axis(1, lwd=6)
axis(2, lwd=6)

dev.off()

###############################################################################
###############################################################################
# Visualising everything: graphical primitives
###############################################################################
###############################################################################

# A list of functions (not to be executed)

abline()
points()
lines()
segments()
arrows()
polygon()
rect()
text()

# Drawing line segments

plot(1, 1, type="n", xlim=c(0, 10), ylim=c(0, 10)) # creating an empty plot

segments(x0=c(1, 2, 4, 2), y0=c(1, 2, 1, 6), x1=c(2, 3, 6, 8), y1=c(5, 4, 8, 7))

# Drawing arrows

plot(1, 1, type="n", xlim=c(0, 10), ylim=c(0,10)) # re-creating an empty plot

# Drawing rectangles

plot(1, 1, type="n", xlim=c(0, 10), ylim=c(0, 10)) # re-creating an empty plot

rect(xleft=c(1, 2, 6), ybottom=c(1, 3, 0), xright=c(3, 8, 10), ytop=c(4, 8, 1))

# Drawing polygons

plot(1, 1, type="n", xlim=c(0, 10), ylim=c(0, 10)) # re-creating an empty plot

p.1.x <- c(2, 1, 1, 3, 5, 4)

98
p.1.y <- c(0, 1, 3, 5, 2, 1)
polygon(p.1.x, p.1.y) # Adding polygon 1

p.2.x <- c(1, 2, 3, 4, NA, 7, 5, 9, 8)

p.2.y <- c(6, 8, 9, 5, NA, 5, 5, 8, 9)
polygon(p.2.x, p.2.y, density=c(12, 24)) # Adding two polygons

# Two ways of dealing with intersections

# Default way (fillOddEven=FALSE)

p.3.x <- c(6, 7, 8, 5.5, 8.5)

p.3.y <- c(0, 3.2, 0, 2, 2)
polygon(p.3.x, p.3.y, density=12, angle=-45)

# fillOddEven=TRUE

p.4.x <- c(7.5, 8.5, 9.5, 7, 10)

p.4.y <- c(2.5, 5.7, 2.5, 4.5, 4.5)
polygon(p.4.x, p.4.y, density=12, angle=-45, fillOddEven=TRUE)

###############################################################################
# Examples of real-life graphs using polygon() function
###############################################################################

###############################################################################
# Dorpat faculty dynamics

# Loading dataset;
dpt.dyn <- read.table("dpt.dyn.txt", h=TRUE, sep="\t", stringsAsFactors=TRUE)

# An empty plot
plot(dpt.dyn$YEAR, dpt.dyn$TOT, type="n", ylim=c(0,85),
xlab="Timeline", ylab="Number of persons")

# Adding polygons;
polygon(x=c(dpt.dyn$YEAR[1], dpt.dyn$YEAR, dpt.dyn$YEAR[nrow(dpt.dyn)]),
y=c(0, dpt.dyn$TOT, 0), col="#0072CE")
polygon(x=c(dpt.dyn$YEAR[1], dpt.dyn$YEAR, dpt.dyn$YEAR[nrow(dpt.dyn)]),
y=c(0, (dpt.dyn$U1.DPT.TOT + dpt.dyn$U1.EUR.TOT), 0), col="black")
polygon(x=c(dpt.dyn$YEAR[1], dpt.dyn$YEAR, dpt.dyn$YEAR[nrow(dpt.dyn)]),
y=c(0, dpt.dyn$U1.EUR.TOT, 0), col="white")

###############################################################################
# Presidential elections Monte-Carlo simulation results

# Loading datasets;
ru.2018 <- read.table("ru.2018.txt", h=TRUE, sep="\t", stringsAsFactors=TRUE)
ru.hist.MC <- read.table("ru.hist.MC.txt", h=TRUE, sep=" ", stringsAsFactors=TRUE)

# Plotting the histogram;

# Adding the polygon for the MC simulation mean +/- 3 standard deviations;
polygon(
c(ru.hist.MC$PCT, ru.hist.MC$PCT[1001:1]),
c(ru.hist.MC$MEAN + 3 * ru.hist.MC$SD,
(ru.hist.MC$MEAN - 3 * ru.hist.MC$SD)[1001:1]),
col=rgb(0, 0, 0, .3), border=rgb(1, 0, 0, .3), lwd=.5)

# Adding the mean of the MC simulation;

lines(ru.hist.MC$PCT, ru.hist.MC$MEAN, col=2)

###############################################################################
###############################################################################
# Getting inside: numeric output of the plotting functions
###############################################################################
###############################################################################

# This line creates both a histogram plot and an object:

99
students.mass.hist <- hist(students.df$MASS)

# Preview object structure:

str(students.mass.hist)

# Visualise the object with plot()

plot(students.mass.hist,
main="Undergraduate students", xlab="Body mass, kg",
col="grey")
text(x=students.mass.hist$mids, y=students.mass.hist$counts,
labels=students.mass.hist$counts, pos=3)

# Adjusting ylim
plot(students.mass.hist, ylim=c(0,max(students.mass.hist$counts)+5),
main="Undergraduate students", xlab="Body mass, kg",
col="grey")
text(x=students.mass.hist$mids, y=students.mass.hist$counts,
labels=students.mass.hist$counts, pos=3)

# This line creates both a boxplot and an object:

students.mass.box <- boxplot(students.df$MASS)

# Preview object structure:

str(students.mass.box)

# Another, slightly more complex, object:

students.mass.2.box <- boxplot(students.df$MASS ~ students.df$SEX)

# Preview object structure:

str(students.mass.2.box)

# Comparing stats elements:

students.mass.box$stats
students.mass.2.box$stats

# Trying boxplot.stats():
boxplot.stats(students.df$MASS)

###############################################################################
###############################################################################
##
## Bonus: Monte-Carlo simulation for St. Petersburg voters’ turnout histogram
##
###############################################################################
###############################################################################

###############################################################################
#
# The following script is based on the methods described in
# Dmitry Kobak, Sergey Shpilkin, and Maxim S. Pshenichnikov "Integer
# percentages as electoral falsification fingerprints". The Annals of Applied
# Statistics. Volume 10, Number 1 (2016), 54-73.
# [https://arxiv.org/abs/1410.6059]
#
# I am indebted to Sergey Shpilkin for his kind explanations and critical
# remarks on the early versions of the script.
# I am grateful to Boris Ovchinnikov for a stimulating discussion.
#
# Alexei Kouprianov, [email protected]
#
###############################################################################

###############################################################################
# Loading data
###############################################################################

spb.2018 <- read.table("spb.q.20180319.txt", h=TRUE, sep="\t", stringsAsFactors=TRUE)

spb.2018$VOTED <- spb.2018$BALL.INVALID + spb.2018$BALL.VALID
spb.2018$TURNOUT <- spb.2018$VOTED/spb.2018$VOTERS

###############################################################################
# Declaring objects for the loop
###############################################################################

100
# MK.repeats <- 100 # For preliminary testing;
# MK.repeats <- 1000 # For preliminary testing;
# DO NOT try the following line on larger datasets. 10,000 iterations on
# the dataset for the whole Russia (97+K records) and for 0.1% bins
# may take days to complete.
MK.repeats <- 10000 # Working repeats number;

spb.ksp.rbinom.ls <- NULL # ls of simulated polling stations;

spb.ksp.rbinom.ls <- as.list(spb.ksp.rbinom.ls)

spb.ksp.rbinom.TURNOUT <- NULL # complete vector of simulated turnouts;

# df for simulated histograms’ counts:

spb.ksp.rbinom.TURNOUT.hist.counts.df <- NULL
spb.ksp.rbinom.TURNOUT.hist.counts.df <-
as.data.frame(spb.ksp.rbinom.TURNOUT.hist.counts.df)

j <- NULL
i <- NULL

###############################################################################
# Main simulation loop begins
###############################################################################

for (k in 1:MK.repeats) {

spb.ksp.rbinom.ls <- NULL # ls of simulated polling stations;

spb.ksp.rbinom.ls <- as.list(spb.ksp.rbinom.ls)

spb.ksp.rbinom.TURNOUT <- NULL # complete vector of simulated turnouts;

# Simulating polling stations’ turnouts

j <- 1
while(j <= length(spb.2018$VOTERS)){
spb.ksp.rbinom.ls[[j]] <- rbinom(n=1, size=spb.2018$VOTERS[j],
prob=spb.2018$TURNOUT[j])
j <- j + 1
}

# Gathering simulated turnouts in a vector

i <- 1
while(i <= length(spb.2018$VOTERS)){
spb.ksp.rbinom.TURNOUT <- c(spb.ksp.rbinom.TURNOUT,
spb.ksp.rbinom.ls[[i]]/spb.2018$VOTERS[i])
i <- i + 1
}

# Calculating hist for simulated turnouts, bin 1% (hist plots appear

# in the graphics console)
spb.ksp.rbinom.TURNOUT.hist <- hist(spb.ksp.rbinom.TURNOUT,
breaks=seq(-.005, 1.005, .01))

# Extracting simulated hist counts

spb.ksp.rbinom.TURNOUT.hist.counts.df <- rbind.data.frame(
spb.ksp.rbinom.TURNOUT.hist.counts.df,
spb.ksp.rbinom.TURNOUT.hist$counts
)
}

###############################################################################
# Main simulation loop ends
###############################################################################

# Extracting summary stats for counts from all MK.repeats simulated hists into
# a data frame

spb.ksp.rbinom.TURNOUT.hist.counts.MEAN <- NULL

spb.ksp.rbinom.TURNOUT.hist.counts.SD <- NULL
spb.ksp.rbinom.TURNOUT.hist.counts.MEDIAN <- NULL
spb.ksp.rbinom.TURNOUT.hist.counts.Q1 <- NULL
spb.ksp.rbinom.TURNOUT.hist.counts.Q3 <- NULL
spb.ksp.rbinom.TURNOUT.hist.counts.IQR <- NULL

for (i in 1:101){

101
spb.ksp.rbinom.TURNOUT.hist.counts.MEAN <- c(
spb.ksp.rbinom.TURNOUT.hist.counts.MEAN,
mean(spb.ksp.rbinom.TURNOUT.hist.counts.df[,i])
)
spb.ksp.rbinom.TURNOUT.hist.counts.SD <- c(
spb.ksp.rbinom.TURNOUT.hist.counts.SD,
sd(spb.ksp.rbinom.TURNOUT.hist.counts.df[,i])
)
spb.ksp.rbinom.TURNOUT.hist.counts.MEDIAN <- c(
spb.ksp.rbinom.TURNOUT.hist.counts.MEDIAN,
median(spb.ksp.rbinom.TURNOUT.hist.counts.df[,i])
)
spb.ksp.rbinom.TURNOUT.hist.counts.Q1 <- c(
spb.ksp.rbinom.TURNOUT.hist.counts.Q1,
summary(spb.ksp.rbinom.TURNOUT.hist.counts.df[,i])[2]
)
spb.ksp.rbinom.TURNOUT.hist.counts.Q3 <- c(
spb.ksp.rbinom.TURNOUT.hist.counts.Q3,
summary(spb.ksp.rbinom.TURNOUT.hist.counts.df[,i])[5]
)
spb.ksp.rbinom.TURNOUT.hist.counts.IQR <- c(
spb.ksp.rbinom.TURNOUT.hist.counts.IQR,
IQR(spb.ksp.rbinom.TURNOUT.hist.counts.df[,i]))
}

spb.hist.ksp.simulated.stats.1pct <- data.frame(

seq(0, 1, .01),
spb.ksp.rbinom.TURNOUT.hist.counts.MEAN,
spb.ksp.rbinom.TURNOUT.hist.counts.SD,
spb.ksp.rbinom.TURNOUT.hist.counts.MEDIAN,
spb.ksp.rbinom.TURNOUT.hist.counts.Q1,
spb.ksp.rbinom.TURNOUT.hist.counts.Q3,
spb.ksp.rbinom.TURNOUT.hist.counts.IQR
)

colnames(spb.hist.ksp.simulated.stats.1pct) <-
c("PCT","MEAN","SD","MEDIAN","Q1","Q3","IQR")

# Control plots

png("hist.TURNOUT.spb.MEDIAN-IQR.simul.png", height=750, width=750)

par(cex=1.5, lwd=1.5)
hist(spb.2018$TURNOUT, breaks=seq(-.005, 1.005, .01),
ylim=c(0,190), col=rgb(0,0,1,.3), border=rgb(0,0,1,.3),
main="Presidential elections in Russia, 2018-03-18\nSt. Petersburg",
xlab="Voters’ turnout, 1% bin",
ylab="Frequency")
polygon(c(spb.hist.ksp.simulated.stats.1pct$PCT,
spb.hist.ksp.simulated.stats.1pct$PCT[101:1]),
c(spb.hist.ksp.simulated.stats.1pct$MEDIAN +
1.5*spb.hist.ksp.simulated.stats.1pct$IQR,
(spb.hist.ksp.simulated.stats.1pct$MEDIAN -
1.5*spb.hist.ksp.simulated.stats.1pct$IQR)[101:1]),
col=rgb(0,0,0,.3), border=rgb(1,0,0,.3))
points(spb.hist.ksp.simulated.stats.1pct$PCT,
spb.hist.ksp.simulated.stats.1pct$MEDIAN, type="l", col=2, lwd=3)
legend("topleft", lwd=c(3,1), col=c(2,2),
legend=c("MC-simulated median", "MC-simulated median +/- 1.5 IQR"), bty="n")
axis(1, at=seq(0,1,.1), labels=FALSE, lwd=1.5)
axis(2, at=seq(0,190,10), tcl=-.25, labels=FALSE, lwd=1.5)
dev.off()

png("hist.TURNOUT.spb.MEAN-SD.simul.png", height=750, width=750)

102
3*spb.hist.ksp.simulated.stats.1pct$SD)[101:1]),
col=rgb(0,0,0,.3), border=rgb(1,0,0,.3))
points(spb.hist.ksp.simulated.stats.1pct$PCT,
spb.hist.ksp.simulated.stats.1pct$MEAN, type="l", col=2, lwd=3)
legend("topleft", lwd=c(3,1), col=c(2,2),
legend=c("MC-simulated mean", "MC-simulated mean +/- 3 SD"), bty="n")
axis(1, at=seq(0,1,.1), labels=FALSE, lwd=1.5)
axis(2, at=seq(0,190,10), tcl=-.25, labels=FALSE, lwd=1.5)
dev.off()

### End of script ###

103

Unit 2
No ratings yet
Unit 2
32 pages
Verzani Answers
100% (8)
Verzani Answers
94 pages
Visual Statistics Use R!
50% (2)
Visual Statistics Use R!
388 pages
Advance R Prog.-1
No ratings yet
Advance R Prog.-1
24 pages
Visual Statistics Use R PDF
No ratings yet
Visual Statistics Use R PDF
388 pages
David Gerbing - R Visualizations Derive Meaning From Data (2020) - 1 - CRC Press (9780429894923)
100% (1)
David Gerbing - R Visualizations Derive Meaning From Data (2020) - 1 - CRC Press (9780429894923)
252 pages
Lab 1 Steps To Make A Straight Through Ethernet Cable
No ratings yet
Lab 1 Steps To Make A Straight Through Ethernet Cable
4 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Admin Dumps Final
100% (1)
Admin Dumps Final
9 pages
ATM Simulator: Created By: Abhijeet Karmaker (C0720286) Naresh Gunimanikula (C0719672) PRIYANKA MODI (C0717925)
No ratings yet
ATM Simulator: Created By: Abhijeet Karmaker (C0720286) Naresh Gunimanikula (C0719672) PRIYANKA MODI (C0717925)
15 pages
Advanced Statistics
No ratings yet
Advanced Statistics
259 pages
Slide Present Ee Full Heart Beat
100% (1)
Slide Present Ee Full Heart Beat
21 pages
Visual Statistics Use R
No ratings yet
Visual Statistics Use R
451 pages
96boards Som Carrier Board Schematics
No ratings yet
96boards Som Carrier Board Schematics
28 pages
Shipunov Visual Statistics
No ratings yet
Shipunov Visual Statistics
429 pages
All v2 Basic Statistics Using R
No ratings yet
All v2 Basic Statistics Using R
241 pages
Boulder Handout 2019
No ratings yet
Boulder Handout 2019
187 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
Vlsi Exp 1
No ratings yet
Vlsi Exp 1
6 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Graphics Chapter
No ratings yet
Graphics Chapter
49 pages
Aragón, Tomás J. - Applied Epidemiology Using R-Springer (2010)
No ratings yet
Aragón, Tomás J. - Applied Epidemiology Using R-Springer (2010)
190 pages
STAT319 Lab Manual Based On R - Final Version
No ratings yet
STAT319 Lab Manual Based On R - Final Version
127 pages
Bhuvaneswar Reddy Pidatala
No ratings yet
Bhuvaneswar Reddy Pidatala
1 page
Lecture 10 R
No ratings yet
Lecture 10 R
117 pages
MultivariateRGGobi PDF
No ratings yet
MultivariateRGGobi PDF
60 pages
Epcom
100% (1)
Epcom
2 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
No ratings yet
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
58 pages
Introduction To R
No ratings yet
Introduction To R
103 pages
Introduction To Matlab Lecture Advanced Data Analysis Jan2012
No ratings yet
Introduction To Matlab Lecture Advanced Data Analysis Jan2012
50 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
Unit V Statistics R
No ratings yet
Unit V Statistics R
60 pages
Chapter - 03 - Review of Basic Data
No ratings yet
Chapter - 03 - Review of Basic Data
92 pages
Teaching Notes of R
No ratings yet
Teaching Notes of R
78 pages
Graph Plotting in R Programming
No ratings yet
Graph Plotting in R Programming
12 pages
21341A0596 Literature Review
No ratings yet
21341A0596 Literature Review
57 pages
Group Technology
No ratings yet
Group Technology
30 pages
Env SPV DR B 001 QC Manual Rev.A
No ratings yet
Env SPV DR B 001 QC Manual Rev.A
92 pages
03 UnderstandData
No ratings yet
03 UnderstandData
29 pages
Nozomi Networks WP Drone Telemetry
No ratings yet
Nozomi Networks WP Drone Telemetry
73 pages
First Course On R
No ratings yet
First Course On R
26 pages
Exploratory Data Analysis - NOTES
No ratings yet
Exploratory Data Analysis - NOTES
31 pages
Exploratory Data Analysis Course Notes
No ratings yet
Exploratory Data Analysis Course Notes
55 pages
P6ADBMS
No ratings yet
P6ADBMS
34 pages
Apunts BLOC 1 Estadística
No ratings yet
Apunts BLOC 1 Estadística
15 pages
Importing The Files
No ratings yet
Importing The Files
14 pages
Unit3 R
No ratings yet
Unit3 R
19 pages
R Visualization ADA
No ratings yet
R Visualization ADA
47 pages
Stats Mid Term
No ratings yet
Stats Mid Term
22 pages
Unit3 R
No ratings yet
Unit3 R
30 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
R Software Project
No ratings yet
R Software Project
42 pages
Math10282 Ex03 - An R Session
No ratings yet
Math10282 Ex03 - An R Session
10 pages
Avanti Kumari - A Report
No ratings yet
Avanti Kumari - A Report
39 pages
Unit 3
No ratings yet
Unit 3
11 pages
Stats Lab1
No ratings yet
Stats Lab1
11 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
10 pages
Handout 3
No ratings yet
Handout 3
24 pages
Amazon's Dynamo - All Things Distributed
No ratings yet
Amazon's Dynamo - All Things Distributed
21 pages
BS51009 Workshop 1
No ratings yet
BS51009 Workshop 1
15 pages
Activator Office 2016.Cmd
No ratings yet
Activator Office 2016.Cmd
1 page
ECON 1100 R04 - R.Commands PDF
No ratings yet
ECON 1100 R04 - R.Commands PDF
15 pages
R Commands
No ratings yet
R Commands
18 pages
HumanEval Pro and MBPPPro Evaluating Large Language Models
No ratings yet
HumanEval Pro and MBPPPro Evaluating Large Language Models
27 pages
User Manual K64
No ratings yet
User Manual K64
20 pages
Module 5-6
No ratings yet
Module 5-6
12 pages
R Data Types 8
No ratings yet
R Data Types 8
7 pages
Forecasting Solved Examples
No ratings yet
Forecasting Solved Examples
10 pages
Introduction To R For Business Analytics
No ratings yet
Introduction To R For Business Analytics
7 pages
Broncolor Mobil Manual
No ratings yet
Broncolor Mobil Manual
15 pages
DX Diag
No ratings yet
DX Diag
31 pages
Lab 1 - Basic Functions in R and Plotting
No ratings yet
Lab 1 - Basic Functions in R and Plotting
8 pages
IITM Thesis Format
No ratings yet
IITM Thesis Format
12 pages
R Imp Funtions
No ratings yet
R Imp Funtions
10 pages
STA1007S Lab 4: Scatterplots and Basic Programming: "Hist"
No ratings yet
STA1007S Lab 4: Scatterplots and Basic Programming: "Hist"
9 pages
Pfe Book
No ratings yet
Pfe Book
9 pages
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
No ratings yet
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
7 pages
Student Dropout Prediction
No ratings yet
Student Dropout Prediction
11 pages
Nicolet In10 MX-PS51511
No ratings yet
Nicolet In10 MX-PS51511
4 pages
123 Appendix4
No ratings yet
123 Appendix4
6 pages
Uniq Coching Class
No ratings yet
Uniq Coching Class
3 pages
Root Insurance: Car Insurance Based On How People Drive, Not Who They Are
No ratings yet
Root Insurance: Car Insurance Based On How People Drive, Not Who They Are
4 pages
Audiolab Mdac HFC
No ratings yet
Audiolab Mdac HFC
3 pages
Commucation Ws
No ratings yet
Commucation Ws
3 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
STT - JD Digital Factory Responsible (BE FE) (2) 1
No ratings yet
STT - JD Digital Factory Responsible (BE FE) (2) 1
1 page
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet