Chapter1 DataVisualization
Chapter1 DataVisualization
Chapter 1
Contents
Glossaries 42
Acronyms 43
Chapter 1 1.1 WHAT IS STATISTICS - A PRIMER 1
This is the first chapter in the eight-chapter DTU Introduction to Statistics book.
It consists of eight chapters:
4. Statistics by simulation
In this first chapter the idea of statistics is introduced together with some of the
basic summary statistics and data visualization methods. The software used
throughout the book for working with statistics, probability and data analysis is
the open source environment R. An introduction to R is included in this chapter.
To catch your attention we will start out trying to give an impression of the
importance of statistics in modern science and engineering.
They came up with a list of 11 points summarizing the most important devel-
opments for the health of mankind in a millennium:
The reason for showing the list here is pretty obvious: one of the points is Ap-
plication of Statistics to Medicine! Considering the other points on the list, and
what the state of medical knowledge was around 1000 years ago, it is obviously
a very impressive list of developments. The reasons for statistics to be on this
list are several and we mention two very important historical landmarks here.
Quoting the paper:
"One of the earliest clinical trials took place in 1747, when James Lind treated 12
scorbutic ship passengers with cider, an elixir of vitriol, vinegar, sea water, oranges
and lemons, or an electuary recommended by the ship’s surgeon. The success of the
citrus-containing treatment eventually led the British Admiralty to mandate the provi-
sion of lime juice to all sailors, thereby eliminating scurvy from the navy." (See also
James_Lind).
Still today, clinical trials, including the statistical analysis of the outcomes, are
taking place in massive numbers. The medical industry needs to do this in
order to find out if their new developed drugs are working and to provide doc-
umentation to have them accepted for the World markets. The medical industry
is probably the sector recruiting the highest number of statisticians among all
sectors. Another quote from the paper:
"The origin of modern epidemiology is often traced to 1854, when John Snow demon-
strated the transmission of cholera from contaminated water by analyzing disease rates
among citizens served by the Broad Street Pump in London’s Golden Square. He ar-
rested the further spread of the disease by removing the pump handle from the polluted
well." (See also John_Snow_(physician)).
Actually, today more numbers/data than ever are being collected and the amounts
are still increasing exponentially. One example is Internet data, that internet
companies like Google, Facebook, IBM and others are using extensively. A
quote from New York Times, 5. August 2009, from the article titled “For To-
Chapter 1 1.2 STATISTICS AT DTU COMPUTE 3
“I keep saying that the sexy job in the next 10 years will be statisticians," said Hal
Varian, chief economist at Google. ‘and I’m not kidding.’ ”
“The key is to let computers do what they are good at, which is trawling these massive
data sets for something that is mathematically odd,” said Daniel Gruhl, an I.B.M. re-
searcher whose recent work includes mining medical data to improve treatment. “And
that makes it easier for humans to do what they are good at - explain those anomalies.”
Each of these sections have their own focus area within statistics, modelling
and data analysis. On the master level it is an important option within DTU
Compute studies to specialize in statistics of some kind on the joint master pro-
gramme in Mathematical Modelling and Computation (MMC). And a Statisti-
cian is a well-known profession in industry, research and public sector institu-
tions.
The high relevance of the topic of statistics and data analysis today is also illus-
trated by the extensive list of ongoing research projects involving many and di-
verse industrial partners within these four sections. Neither society nor indus-
try can cope with all the available data without using highly specialized peo-
ple in statistical techniques, nor can they cope and be internationally competi-
tive without continuously further developing these methodologies in research
projects. Statistics is and will continue to be a relevant, viable and dynamic
field. And the amount of experts in the field continues to be small compared
to the demand for experts, hence obtaining skills in statistics is for sure a wise
career choice for an engineer. Still for any engineer not specialising in statistics,
a basic level of statistics understanding and data handling ability is crucial for
the ability to navigate in modern society and business, which will be heavily
influenced by data of many kinds in the future.
Chapter 1 1.3 STATISTICS - WHY, WHAT, HOW? 4
Often in society and media, the word statistics is used simply as the name for
a summary of some numbers, also called data, by means of a summary table
and/or plot. We also embrace this basic notion of statistics, but will call such
basic data summaries descriptive statistics or explorative statistics. The meaning
of statistics goes beyond this and will rather mean “how to learn from data in an
insightful way and how to use data for clever decision making”, in short we call this
inferential statistics . This could be on the national/societal level, and could be
related to any kind of topic, such as e.g. health, economy or environment, where
data is collected and used for learning and decision making. For example:
• Cancer registries
• Health registries in general
• Nutritional databases
• Climate data
• Macro economic data (Unemployment rates, GNP etc. )
• etc.
The latter is the type of data that historically gave name to the word statistics. It
originates from the Latin ‘statisticum collegium’ (state advisor) and the Italian
word ‘statista’ (statesman/politician). The word was brought to Denmark by
the Gottfried Achenwall from Germany in 1749 and originally described the
processing of data for the state, see also History_of_statistics.
In general, it can be said say that we learn from data by analysing the data
with statistical methods. Therefore statistics will in practice involve mathematical
Chapter 1 1.3 STATISTICS - WHY, WHAT, HOW? 5
modelling, i.e. using some linear or non-linear function to model the particular
phenomenon. Similarly, the use of probability theory as the concept to describe
randomness is extremely important and at the heart of being able to “be clever”
in our use of the data. Randomness express that the data just as well could have
come up differently due to the inherent random nature of the data collection
and the phenomenon we are investigating.
This is all a bit abstract at this point. And likely adding to the potential confu-
sion about this is the fact that the words population and sample will have a “less
Chapter 1 1.3 STATISTICS - WHY, WHAT, HOW? 6
Sample
{ x1 , x2 , . . . , x n }
Randomly
selected
Statistical
Mean Sample mean
Inference
µ x̄
Figure 1.1: Illustration of statistical population and sample, and statistical in-
ference. Note that the bar on each person indicates that the it is the height (the
observational variable) and not the person (the observational unit), which are
the elements in the statistical population and the sample. Notice, that in all
analysis methods presented in this text the statistical population is assumed to
be very large (or infinite) compared to the sample size.
precise” meaning when used in everyday language. When they are used in a
statistical context the meaning is very specific, as given by the definition above.
Let us consider a simple example:
Example 1.2
The following study is carried out (actual data collection): the height of 20 persons
in Denmark is measured. This will give us 20 values x1 , . . . , x20 in cm. The sample
is then simply these 20 values. The statistical population is the height values of all
people in Denmark. The observational unit is a person.
With regards to the meaning of population within statistics the difference to the
Chapter 1 1.4 SUMMARY STATISTICS 7
everyday meaning is less obvious: but note that the statistical population in the
example is defined to be the height values of people, not actually the people.
Had we measured the weights instead the statistical population would be quite
different. Also later we will realize that statistical populations in engineering
contexts can refer to many other things than populations as in a group of or-
ganisms, hence stretching the use of the word beyond the everyday meaning.
From this point: population will be used instead of statistical population in order
to simplify the text.
The population in a given situation will be linked with the actual study and/or
experiment carried out - the data collection procedure sometimes also denoted
the data generating process. For the sample to represent relevant information
about the population it should be representative for that population. In the ex-
ample, had we only measured male heights, the population we can say any-
thing about would be the male height population only, not the entire height
population.
The initial part is also called an explorative analysis of the data. We use a number
of summary statistics to summarize and describe a sample consisting of one or
two variables:
• Measures of centrality:
– Mean
– Median
– Quantiles
Chapter 1 1.4 SUMMARY STATISTICS 8
• Measures of “spread”:
– Variance
– Standard deviation
– Coefficient of variation
– Inter Quartile Range (IQR)
• Measures of relation (between two variables):
– Covariance
– Correlation
One important point to notice is that these statistics can only be calculated for
the sample and not for the population - we simply don’t know all the values
in the population! But we want to learn about the population from the sample.
For example when we have a random sample from a population we say that the
sample mean (x̄) is an estimate of the mean of the population, often then denoted
µ, as illustrated in Figure 1.1.
Remark 1.3
Notice, that we put ’sample’ in front of the name of the statistic, when it is
calculated for the sample, but we don’t put ’population’ in front when we
refer to it for the population (e.g. we can think of the mean as the true mean).
HOWEVER we don’t put sample in front of the name every time it should
be there! This is to keep the text simpler and since traditionally this is not
strictly done, for example the median is rarely called the sample median,
even though it makes perfect sense to distinguish between the sample me-
dian and the median (i.e. the population median). Further, it should be
clear from the context if the statistic refers to the sample or the population,
when it is not clear then we distinguish in the text. Most of the way we do
distinguish strictly for the mean, standard deviation, variance, covariance and
correlation.
The sample mean is a key number that indicates the centre of gravity or cen-
tring of the sample. Given a sample of n observations x1 , . . . , xn , it is defined as
Chapter 1 1.4 SUMMARY STATISTICS 9
follows:
The median is also a key number indicating the center of sample (note that to
be strict we should call it ’sample median’, see Remark 1.3 above). In some
cases, for example in the case of extreme values or skewed distributions, the
median can be preferable to the mean. The median is the observation in the
middle of the sample (in sorted order). One may express the ordered observa-
tions as x(1) , . . . , x(n) , where then x(1) is the smallest of all x1 , . . . , xn (also called
the minimum) and x(n) is the largest of all x1 , . . . , xn (also called the maximum).
Q 2 = x ( n +1 ) . (1-2)
2
x ( n ) + x ( n +2 )
Q2 =
2 2
. (1-3)
2
The reason why it is denoted with Q2 is explained below in Definition 1.8.
Chapter 1 1.4 SUMMARY STATISTICS 10
A random sample of the heights (in cm) of 10 students in a statistics class was
168 161 167 179 184 166 198 187 191 179 .
1
x̄ = (168 + 161 + 167 + 179 + 184 + 166 + 198 + 187 + 191 + 179) = 178.
10
To find the sample median we first order the observations from smallest to largest
x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) x(10)
.
161 166 167 168 179 179 184 187 191 198
Note that having duplicate observations (like e.g. two of 179) is not a problem - they
all just have to appear in the ordered list. Since n = 10 is an even number the median
becomes the average of the 5th and 6th observations
x( n2 ) + x( n+2 ) x (5) + x (6) 179 + 179
2
= = = 179.
2 2 2
As an illustration, let’s look at the results if the sample did not include the 198 cm
height, hence for n = 9
1
x̄ = (168 + 161 + 167 + 179 + 184 + 166 + 187 + 191 + 179) = 175.78.
9
then the median would have been
This illustrates the robustness of the median compared to the sample mean: the
sample mean changes a lot more by the inclusion/exclusion of a single “extreme”
measurement. Similarly, it is clear that the median does not depend at all on the
actual values of the most extreme ones.
The median is the point that divides the observations into two halves. It is of
course possible to find other points that divide into other proportions, they are
called quantiles or percentiles (note, that this is actually the sample quantile or
sample percentile, see Remark 1.3).
Chapter 1 1.4 SUMMARY STATISTICS 11
2. Compute pn
4. If pn is a non-integer: take the “next one” in the ordered list. Then the
p’th quantile is
q p = x(dnpe) , (1-5)
where dnpe is the ceiling of np, that is, the smallest integer larger than
np
a There exist several other formal definitions. To obtain this definition of quan-
tiles/percentiles in R use quantile(. . . , type=2). Using the default in R is also a perfectly
valid approach - just a different one.
Often calculated percentiles are the so-called quartiles (splitting the sample in
quarters, i.e. 0%, 25%, 50%, 75% and 100%):
Note that the 0’th percentile is the minimum (smallest) observation and the
100’th percentile is the maximum (largest) observation. We have specific names
for the three other quartiles:
Using the n = 10 sample from Example 1.6 and the ordered data table from there,
let us find the lower and upper quartiles (i.e. Q1 and Q3 ), as we already found
Q2 = 179.
First, the Q1 : with p = 0.25, we get that np = 2.5 and we find that
The sample standard deviation and the sample variance are key numbers of
absolute variation. If it is of interest to compare variation between different
samples, it might be a good idea to use a relative measure - most obvious is the
coefficient of variation:
We interpret the standard deviation as the average absolute deviation from the mean
or simply: the average level of differences, and this is by far the most used measure
of spread. Two (relevant) questions are often asked at this point (it is perfectly
fine if you didn’t wonder about them by now and you might skip the answers
and return to them later):
Chapter 1 1.4 SUMMARY STATISTICS 14
Remark 1.13
Answer: This is indeed an alternative, called the mean absolute deviation, that
one could use. The reason for most often measuring “mean deviation”
NOT by the Mean Absolute Deviation statistic, but rather by the sample
standard deviation s, is the so-called theoretical statistical properties of
the sample variance s2 . This is a bit early in the material for going into
details about this, but in short: inferential statistics is heavily based
on probability considerations, and it turns out that it is theoretically
much easier to put probabilities related to the sample variance s2 on
explicit mathematical formulas than probabilities related to most other
alternative measures of variability. Further, in many cases this choice
is in fact also the optimal choice in many ways.
Remark 1.14
The Inter Quartile Range (IQR) is the middle 50% range of data defined as
Consider again the n = 10 data from Example 1.6. To find the variance let us com-
pute the n = 10 differences to the mean, that is ( xi − 178)
1
s2 = 1342 = 149.1,
9
and the sample standard deviation is
s = 12.21.
We can interpret this as: people are on average around 12 cm away from the mean
height of 178 cm. The Range and Inter Quartile Range (IQR) are easily found from
the ordered data table in Example 1.6 and the earlier found quartiles in Example 1.9
Hence 50% of all people (in the sample) lie within 20 cm.
Note, that the standard deviation in the example has the physical unit cm,
Chapter 1 1.4 SUMMARY STATISTICS 16
whereas the variance has cm2 . This illustrates the fact that the standard de-
viation has a more direct interpretation than the variance in general.
When two observational variables are available for each observational unit, it
may be of interest to quantify the relation between the two, that is to quantify
how the two variables co-vary with each other, their sample covariance and/or
sample correlation.
In addition to the previously given student heights we also have their weights (in
kg) available
Heights ( xi ) 168 161 167 179 184 166 198 187 191 179
.
Weights (yi ) 65.5 58.3 68.1 85.7 80.5 63.4 102.6 91.4 86.7 78.9
The relation between weights and heights can be illustrated by the so-called scatter-
plot, cf. Section 1.6.4, where e.g. weights are plotted versus heights:
7
100
8
90
4 9
Weight
y = 78.1 5
80
10
70
3
1
6
x = 178
60
2
160 170 180 190
Height
Each point in the plot corresponds to one student - here illustrated by using the
observation number as plot symbol. The (expected) relation is pretty clear now -
different wordings could be used for what we see:
The sample covariance and sample correlation coefficients are a summary statis-
tics that can be calculated for two (related) sets of observations. They quantify
the (linear) strength of the relation between the two. They are calculated by
combining the two sets of observations (and the means and standard deviations
from the two) in the following ways:
When xi − x̄ and yi − ȳ have the same sign, then the point ( xi , yi ) give a positive
contribution to the sample correlation coefficient and when they have opposite
signs the point give a negative contribution to the sample correlation coefficient,
as illustrated here:
Chapter 1 1.4 SUMMARY STATISTICS 18
Using these we can show how each student deviate from the average height and
weight (these deviations are exactly used for the sample correlation and covariance
computations)
Student 1 2 3 4 5 6 7 8 9 10
Height ( xi ) 168 161 167 179 184 166 198 187 191 179
Weight (yi ) 65.5 58.3 68.1 85.7 80.5 63.4 102.6 91.4 86.7 78.9
( xi − x̄ ) -10 -17 -11 1 6 -12 20 9 13 1
(yi − ȳ) -12.6 -19.8 -10 7.6 2.4 -14.7 24.5 13.3 8.6 0.8
( xi − x̄ )(yi − ȳ) 126.1 336.8 110.1 7.6 14.3 176.5 489.8 119.6 111.7 0.8
Student 1 is below average on both height and weight (−10 and − 12.6). Student
10 is above average on both height and weight (+1 and + 0.8).s
The sample covariance is then given by the sum of the 10 numbers in the last row of
the table
1
s xy = (126.1 + 336.8 + 110.1 + 7.6 + 14.3 + 176.5 + 489.8 + 119.6 + 111.7 + 0.8)
9
1
= · 1493.3
9
= 165.9
And the sample correlation is then found from this number and the standard devia-
tions
(the details of the sy computation is not shown). Thus we get the sample correlation
as
165.9
r= = 0.97.
12.21 · 14.07
Note how all 10 contributions to the sample covariance are positive in the ex-
ample case - in line with the fact that all observations are found in the first
and third quadrants of the scatter plot (where the quadrants are defined by the
sample means of x and y). Observations in second and fourth quadrant would
contribute with negative numbers to the sum, hence such observations would
be from students with below average on one feature while above average on the
Chapter 1 1.4 SUMMARY STATISTICS 19
other. Then it is clear that: had all students been like that, then the covariance
and the correlation would have been negative, in line with a negative (down-
wards) trend in the relation.
The sample correlation coefficient measures the degree of linear relation be-
tween x and y, which imply that we might fail to detect non-linear relationships,
illustrated in the following plot of four different point clouds and their sample
correlations:
Chapter 1 1.5 INTRODUCTION TO R AND RSTUDIO 20
r ≈ 0.95 r ≈ −0.5
1.2
1
0.8
0
y
y
0.4
-1
-2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
r≈0 r≈0
2
1
0.8
0
y
y
0.4
-1
-2
0.0
-3
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
The sample correlation in both the bottom plots are close to zero, but as we see
from the plot this number itself doesn’t imply that there no relation between y
and x - which clearly is the case in the bottom right and highly non-linear case.
Sample covariances and correlation are closely related to the topic of linear re-
gression, treated in Chapter 5 and 6 , where we will treat in more detail how
we can find the line that could be added to such scatter-plots to describe the re-
lation between x and y in a different (but related) way, as well as the statistical
analysis used for this.
The program R is an open source software for statistics that you can download
to your own laptop for free. Go to http://mirrors.dotsrc.org/cran/ and se-
lect your platform (Windows, Mac or Linux) and follow instructions to install.
the software, you only need to open RStudio (R will then be used by RStudio for
carrying out the calculations).
Once you have opened RStudio, you will see a number of different windows.
One of them is the console. Here you can write commands and execute them by
hitting Enter. For instance:
[1] 5
If you want to assign a value to a variable, you can use = or <-. The latter is the
preferred by R-users, so for instance:
It is often useful to assign a set of values to a variable like a vector. This is done
with the function c (short for concatenate):
[1] 1 4 6 2
Chapter 1 1.5 INTRODUCTION TO R AND RSTUDIO 22
[1] 1 2 3 4 5 6 7 8 9 10
You can also make a sequence with a specific step-size different from 1
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
If you are in doubt of how to use a certain function, the help page can be opened
by typing ? followed by the function, e.g. ?seq.
All the summary statistics measures presented in Section 1.4 can be found as
functions or part of functions in R:
Please again note that the words quantiles and percentiles are used interchange-
ably - they are essentially synonyms meaning exactly the same, even though the
formal distinction has been clarified earlier.
Consider again the n = 10 data from Example 1.6. We can read these data into R
and compute the sample mean and sample median as follows:
[1] 178
median(x)
[1] 179
The sample variance and sample standard deviation are found as follows:
[1] 149.1
sqrt(var(x))
[1] 12.21
sd(x)
[1] 12.21
The sample quartiles can be found by using the quantile function as follows:
# Sample quartiles
quantile(x, type=2)
The option “type=2” makes sure that the quantiles found by the function is found
using the definition given in Definition 1.7. By default, the quantile function would
use another definition (not detailed here). Generally, we consider this default choice
just as valid as the one explicitly given here, it is merely a different one. Also the
quantile function has an option called “probs” where any list of probability values
from 0 to 1 can be given. For instance:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
161.0 163.5 166.5 168.0 173.5 179.0 184.0 187.0 189.0 194.5 198.0
You should bring your laptop with R installed with you to the teaching activity
and to the exam. We will need access to the so-called probability distributions
to do statistical computations, and the values of these distributions are not oth-
erwise part of the written material: These probability distributions are part of
many different software, also Excel, but it is part of the syllabus to be able to
work with these within R.
We will see and present all three types of applications of R during the course.
For the first type, the aim is not to learn how to use the given R-code itself
Chapter 1 1.5 INTRODUCTION TO R AND RSTUDIO 25
but rather to learn from the insights that the code together with the results of
applying it is providing. It will be stated clearly whenever an R-example is of
this type. Types 2 and 3 are specific tools that should be learned as a part of the
course and represent tools that are explicitly relevant in your future engineering
activity. It is clear that at some point one would love to just do the last kind
of applications. However, it must be stressed that even though the program is
able to calculate things for the user, understanding the details of the calculations
must NOT be forgotten - understanding the methods and knowing the formulas
is an important part of the syllabus, and will be checked at the exam.
A good question to ask yourself each time that you apply en inbuilt R-function
is: ”Would I know how to make this computation ”manually”?”. There are few
exceptions to this requirement in the course, but only a few. And for these the
question would be: ”Do I really understand what R is computing for me now?”
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 26
A really important part of working with data analysis is the visualisation of the
raw data, as well as the results of the statistical analysis – the combination of
the two leads to reliable results. Let us focus on the first part now, which can
be seen as being part of the explorative descriptive analysis also mentioned in
Section 1.4. Depending on the data at hand different types of plots and graphics
could be relevant. One can distinguish between quantitative vs. categorical data.
We will touch on the following type of basic plots:
• Quantitative data:
– Frequency plots and histograms
– box plots
– cumulative distribution
– Scatter plot (xy plot)
• Categorical data:
– Bar charts
– Pie charts
The default histogram uses equidistant interval widths (the same width for all
intervals) and depicts the raw frequencies/counts in each interval. One may
change the scale into showing what we will learn to be densities by dividing the
raw counts by n and the interval width, i.e.
"Interval count"
.
n · ("Interval width")
By plotting the densities a density histogram also called the empirical density
the area of all the bars add up to 1:
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 28
The R-function hist makes some choice of the number of classess based on
the number of observations - it may be changed by the user option nclass as
illustrated here, although the original choice seems better in this case due to the
very small sample.
1
Fn ( x ) = ∑ . (1-13)
j where x ≤ x
n
j
The so-called box plot in its basic form depicts the five quartiles (min, Q1 , me-
dian, Q3 , max) with a box from Q1 to Q3 emphasizing the Inter Quartile Range
(IQR):
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 30
Q3
180
Median
170
Q1
Minimum
160
In the modified box plot the whiskers only extend to the min. and max. obser-
vation if they are not too far away from the box: defined to be 1.5 × IQR. Obser-
vations further away are considered as extreme observations and will be plotted
individually - hence the whiskers extend from the smallest to the largest obser-
vation within a distance of 1.5 × IQR of the box (defined as either 1.5 × IQR
larger than Q3 or 1.5 × IQR smaller than Q1 ).
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 31
If we add an extreme observation, 235 cm, to the heights sample and make the mod-
ified box plot - the default in R- and the basic box plot, then we have:
220
200
200
Q3 Q3
180
180
Median Median
Q1 Q1
Minimum Minimum
160
160
Note that since there was no extreme observations among the original 10 observa-
tions, the two ”different” plots would be the same if we didn’t add the extreme 235
cm observation.
The box plot hence is an alternative to the histogram in visualising the distribu-
tion of the sample. It is a convenient way of comparing distributions in different
groups, if such data is at hand.
In another statistics course the following heights of 17 female and 23 male students
were found:
Males 152 171 173 173 178 179 180 180 182 182 182 185
185 185 185 185 186 187 190 190 192 192 197
Females 159 166 168 168 171 171 172 172 173 174 175 175
175 175 175 177 178
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 32
The two modified box plots of the distributions for each gender can be generated by
a single call to the boxplot function:
Males Females
At this point, it should be noted that in real work with data using R, one would
generally not import data into R by explicit listings in an R-script as here. This
only works for very small data sets. Usually the data is imported from some-
where else, e.g. from a spread sheet exported in a .csv (comma separated values)
format as shown here:
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 33
The gender grouped student heights data used in Example 1.30 is avail-
able as a .csv-file via http://www2.compute.dtu.dk/courses/introstat/data/
studentheights.csv. The structure of the data file, as it would appear in a spread
sheet program (e.g. LibreOffice Calc or Excel) is two columns and 40+1 rows includ-
ing a header row:
1 Height Gender
2 152 male
3 171 male
4 173 male
. . .
. . .
24 197 male
25 159 female
26 166 female
27 168 female
. . .
. . .
39 175 female
40 177 female
41 178 female
The data can now be imported into R with the read.table function:
# Read the data (note that per default sep="," but here semicolon)
studentheights <- read.table("studentheights.csv", sep=";", dec=".",
header=TRUE, stringsAsFactors=TRUE)
Height Gender
1 152 male
2 171 male
3 173 male
4 173 male
5 178 male
6 179 male
# Get an overview
str(studentheights)
Height Gender
Min. :152.0 female:17
1st Qu.:172.5 male :23
Median :177.5
Mean :177.9
3rd Qu.:185.0
Max. :197.0
For quantitative variables we get the quartiles and the mean from summary. For cat-
egorical variables we see the category frequencies. A data structure like this is com-
monly encountered (and often the only needed) for statistical analysis. The gender
grouped box plot can now be generated by:
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 35
190
180
Height
170
160
female male
Gender
The R-syntax Height ~ Gender with the tilde symbol “~” is one that we will use a
lot in various contexts such as plotting and model fitting. In this context it can be
understood as “Height is plotted as a function of Gender”.
The scatter plot can be used for two quantitative variables. It is simply one
variable plotted versus the other using some plotting symbol.
Now we will use a data set available as part of R itself. Both base R and many add-
on R-packages include data sets, which can be used for testing and practising. Here
we will use the mtcars data set. If you write:
you will be able to read the following as part of the help info:
“The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel con-
sumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 36
models). A data frame with 32 observations on 11 variables. Source: Henderson and Velle-
man (1981), Building multiple regression models interactively. Biometrics, 37, 391-411.”
Let us plot the gasoline use, (mpg=miles pr. gallon), versus the weight (wt):
# To make 2 plots
par(mfrow=c(1,2))
# First the default version
plot(mtcars$wt, mtcars$mpg, xlab="wt", ylab="mpg")
# Then a nicer version
plot(mpg ~ wt, xlab="Car Weight (1000lbs)", data=mtcars,
ylab="Miles pr. Gallon", col=factor(am),
main="Inverse fuel usage vs. size")
# Add a legend to the plot
legend("topright", c("Automatic transmission","Manual transmission"),
col=c("black","red"), pch=1, cex=0.7)
30
Miles pr. Gallon
25
25
mpg
20
20
15
15
10
10
2 3 4 5 2 3 4 5
wt Car Weight (1000lbs)
In the second plot call we have used the so-called formula syntax of R, that was
introduced above for the grouped box plot. Again, it can be read: “mpg is plotted
as a function of wt”. Note also how a color option, col=factor(am), can be used to
group the cars with and without automatic transmission, stored in the data column
am in the data set.
All the plots described so far were for quantitative variables. For categorical
variables the natural basic plot would be a bar plot or pie chart visualizing the
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 37
For the gender grouped student heights data used in Example 1.30 we can plot the
gender distribution by:
# Barplot
barplot(table(studentheights$Gender), col=2:3)
20
15
10
5
0
female male
# Pie chart
pie(table(studentheights$Gender), cex=1, radius=1)
female
male
Chapter 1 1.6 PLOTTING, GRAPHICS - DATA VISUALISATION 38
A good place for getting more inspired on how to do easy and nice plots in R is:
http://www.statmethods.net/.
Chapter 1 1.7 EXERCISES 39
1.7 Exercises
In a study of different occupational groups the infant birth weight was recorded
for randomly selected babies born by hairdressers, who had their first child.
The following table shows the weight in grams (observations specified in sorted
order) for 10 female births and 10 male births:
Females (x) 2474 2547 2830 3219 3429 3448 3677 3872 4001 4116
Males (y) 2844 2863 2963 3239 3379 3449 3582 3926 4151 4356
Solve at least the following questions a)-c) first “manually” and then by the
inbuilt functions in R. It is OK to use R as alternative to your pocket calculator
for the “manual” part, but avoid the inbuilt functions that will produce the
results without forcing you to think about how to compute it during the manual
part.
a) What is the sample mean, variance and standard deviation of the female
births? Express in your own words the story told by these numbers. The
idea is to force you to interpret what can be learned from these numbers.
b) Compute the same summary statistics of the male births. Compare and
explain differences with the results for the female births.
c) Find the five quartiles for each sample — and draw the two box plots with
pen and paper (i.e. not using R.)
d) Are there any “extreme” observations in the two samples (use the modified
box plot definition of extremness)?
b) What are the quartiles and the IQR (Inter Quartile Range)?
Patient 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Before 9.1 8.0 7.7 10.0 9.6 7.9 9.0 7.1 8.3 9.6 8.2 9.2 7.3 8.5 9.5
After 8.2 6.4 6.6 8.5 8.0 5.8 7.8 7.2 6.7 9.8 7.1 7.7 6.0 6.6 8.4
a) What is the median of the cholesterol measurements for the patients before
treatment, and similarly after treatment?
a) Go to CampusNet and take a look at the first project and read the project
page on the website for more information (02323.compute.dtu.dk/projects
or 02402.compute.dtu.dk/projects). Follow the steps to import the data
into R and get started with the explorative data analysis.
Chapter 1 Glossaries 42
Glossaries
Box plot [Box plot] The so-called boxplot in its basic form depicts the five quar-
tiles (min, Q1 , median, Q3 , max) with a box from Q1 to Q3 emphasizing
the IQR 26, 29–32, 34, 36
Class The frequency distribution of the data for a certain grouping of the data
26, 28
Histogram [Histogram] The default histogram uses the same width for all classes
and depicts the raw frequencies/counts in each class. By dividing the raw
counts by n times the class width the density histogram is found where
the area of all bars sum to 1 26–28, 31
Inter Quartile Range [Interkvartil bredde] The Inter Quartile Range (IQR) is
the middle 50% range of data 15
Acronyms
IQR Inter Quartile Range 8, 15, 29, 30, Glossary: Inter Quartile Range