Data Wrangling Py
Data Wrangling Py
with Python
Marek Gagolewski
v1.1.0.9002
Prof. Marek Gagolewski
Warsaw University of Technology, Poland
Systems Research Institute, Polish Academy of Sciences
https://www.gagolewski.com/
A little peculiar is the world some people decided to immerse themselves in, so here is
a message stating the obvious. Every effort has been made in the preparation of this
book to ensure the accuracy of the information presented. However, the information
contained in this book is provided without warranty, either express or implied. The
author will, of course, not be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Any bug reports/corrections/feature requests are welcome. To make this textbook even
better, please file them at https://github.com/gagolews/datawranglingpy.
Typeset with XeLATEX. Please be understanding: it was an algorithmic process. Hence,
the results are ∈ [good enough, perfect).
Homepage: https://datawranglingpy.gagolewski.com/
Datasets: https://github.com/gagolews/teaching-data
Preface xiii
0.1 The art of data wrangling . . . . . . . . . . . . . . . . . . . . . . xiii
0.2 Aims, scope, and design philosophy . . . . . . . . . . . . . . . . . xiv
0.2.1 We need maths . . . . . . . . . . . . . . . . . . . . . . . xv
0.2.2 We need some computing environment . . . . . . . . . . . xv
0.2.3 We need data and domain knowledge . . . . . . . . . . . . xvi
0.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
0.4 The Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
0.5 About the author . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
0.6 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . xxi
0.7 You can make this book better . . . . . . . . . . . . . . . . . . . xxii
I Introducing Python 1
1 Getting started with Python 3
1.1 Installing Python . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Working with Jupyter notebooks . . . . . . . . . . . . . . . . . . 4
1.2.1 Launching JupyterLab . . . . . . . . . . . . . . . . . . . . 5
1.2.2 First notebook . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 More cells . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Edit vs command mode . . . . . . . . . . . . . . . . . . . 7
1.2.5 Markdown cells . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 The best note-taking app . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Initialising each session and getting example data . . . . . . . . . 10
II Unidimensional data 43
4 Unidimensional numeric data and their empirical distribution 45
4.1 Creating vectors in numpy . . . . . . . . . . . . . . . . . . . . . . 46
4.1.1 Enumerating elements . . . . . . . . . . . . . . . . . . . 47
4.1.2 Arithmetic progressions . . . . . . . . . . . . . . . . . . . 48
4.1.3 Repeating values . . . . . . . . . . . . . . . . . . . . . . 49
4.1.4 numpy.r_ (*) . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.5 Generating pseudorandom variates . . . . . . . . . . . . . 50
4.1.6 Loading data from files . . . . . . . . . . . . . . . . . . . 50
4.2 Some mathematical notation . . . . . . . . . . . . . . . . . . . . 51
4.3 Inspecting the data distribution with histograms . . . . . . . . . . 52
CONTENTS V
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Changelog 411
References 415
XII CONTENTS
1 https://datawranglingpy.gagolewski.com/datawranglingpy.pdf
2 https://datawranglingpy.gagolewski.com/
3 https://github.com/gagolews/datawranglingpy/issues
4 https://dx.doi.org/10.5281/zenodo.6451068
5 https://deepr.gagolewski.com/
0
Preface
We primarily focus on methods and algorithms that have stood the test of time and
that continue to inspire researchers and practitioners. They all meet the reality check
comprised of the three undermentioned properties, which we believe are essential in
practice:
9 We might have entitled it Introduction to Data Science (with Python).
PREFACE XV
• simplicity (and thus interpretability, being equipped with no or only a few under-
lying tunable parameters; being based on some sensible intuitions that can be ex-
plained in our own words),
• mathematical analysability (at least to some extent; so that we can understand
their strengths and limitations),
• implementability (not too abstract on the one hand, but also not requiring any
advanced computer-y hocus-pocus on the other).
Note Many more complex algorithms are merely variations on or clever combinations
of the more basic ones. This is why we need to study the foundations in great detail.
We might not see it now, but this will become evident as we progress.
need programming at all? Unfortunately, some mathematicians have forgotten that probability and statistics
are deeply rooted in the so-called real world. Theory beautifully supplements practice and provides us with
very deep insights, but we still need to get our hands dirty from time to time.
XVI PREFACE
This course uses the Python language which we shall introduce from scratch. Con-
sequently, we do not require any prior programming experience.
The 2024 StackOverflow Developer Survey11 lists Python as the second most popular
programming language (slightly behind JavaScript, whose primary use is in Web de-
velopment). Over the last couple of years, it has proven to be a quite robust choice for
learning and applying data wrangling techniques. This is possible thanks to the de-
voted community of open-source programmers who wrote the famous high-quality
packages such as numpy, scipy, matplotlib, pandas, seaborn, and scikit-learn.
Nevertheless, Python and its third-party packages are amongst many software tools
which can help extract knowledge from data. Certainly, this ecosystem is not ideal for
all the applications, nor is it the most polished. The R12 environment [36, 65, 96, 102]
is one13 of the recommended alternatives worth considering.
Important We will focus on developing transferable skills: most of what we learn here
can be applied (using different syntax but the same kind of reasoning) in other en-
vironments. Thus, this is a course on data wrangling (with Python), not a course on
Python (with examples in data wrangling).
we believe that ultimately all software should be free. Consequently, we are not going to talk about them
here at all.
PREFACE XVII
0.3 Structure
This book is a whole course. We recommend reading it from the beginning to the end.
The material has been divided into five parts.
1. Introducing Python:
• Chapter 1 discusses how to set up the Python environment, including Jupy-
ter Notebooks which are a flexible tool for the reproducible generation of
reports from data analyses.
• Chapter 2 introduces the elementary scalar types in base Python, ways to
call existing and to compose our own functions, and control a code chunk’s
execution flow.
• Chapter 3 mentions sequential and other iterable types in base Python. The
more advanced data structures (vectors, matrices, data frames) will build
upon these concepts.
2. Unidimensional data:
• Chapter 4 introduces vectors from numpy, which we use for storing data on
the real line (think: individual columns in a tabular dataset). Then, we look at
the most common types of empirical distributions of data, e.g., bell-shaped,
right-skewed, heavy-tailed ones.
• In Chapter 5, we list the most basic ways for processing sequences of num-
bers, including methods for data aggregation, transformation (e.g., stand-
ardisation), and filtering. We also mention that a computer’s floating-point
arithmetic is imprecise and what we can do about it.
• Chapter 6 reviews the most common probability distributions (normal, log-
normal, Pareto, uniform, and mixtures thereof), methods for assessing how
well they fit empirical data. It also covers pseudorandom number generation
which is crucial in experiments based on simulations.
3. Multidimensional data:
• Chapter 7 introduces matrices from numpy. They are a convenient means of
storing multidimensional quantitative data, i.e., many points described by
possibly many numerical features. We also present some methods for their
XVIII PREFACE
Note (*) Parts marked with a single or double asterisk (e.g., some sections or ex-
amples) can be skipped on first reading for they are of lesser importance or greater
difficulty.
things are actually free (see Rule #10). Therefore, this name is misleading.
PREFACE XXI
0.6 Acknowledgements
Minimalist Data Wrangling with Python is based on my experience as an author of a quite
successful textbook Przetwarzanie i analiza danych w języku Python [37] that I wrote with
my former (successful) PhD students, Maciej Bartoszuk and Anna Cena – thanks! Even
though the current book is an entirely different work, its predecessor served as an
excellent testbed for many ideas conveyed here.
The teaching style exercised in this book has proven successful in many similar courses
that yours truly has been responsible for, including at Warsaw University of Techno-
logy, Data Science Retreat (Berlin), and Deakin University (Melbourne). I thank all my
students and colleagues for the feedback given over the last 10 or so years.
A thank-you to all the authors and contributors of the Python packages that we
use throughout this course: numpy [48], scipy [97], matplotlib [54], pandas [66],
and seaborn [99], amongst others (as well as the many C/C++/Fortran libraries they
provide wrappers for). Their version numbers are given in Section 1.4.
16 https://www.gagolewski.com/
17 https://deepr.gagolewski.com/
18 https://github.com/gagolews
19 https://stringi.gagolewski.com/
20 https://genieclust.gagolewski.com/
XXII PREFACE
This book was prepared in a Markdown superset called MyST21 , Sphinx22 , and
TeX (XeLaTeX). Python code chunks were processed with the R (sic!) package
knitr [106]. A little help from Makefiles, custom shell scripts, and Sphinx plugins
(sphinxcontrib-bibtex23 , sphinxcontrib-proof24 ) dotted the j’s and crossed the f ’s.
The Ubuntu Mono25 font is used for the display of code. The typesetting of the main text
relies on the Alegreya26 typeface.
This work received no funding, administrative, technical, or editorial support from
Deakin University, Warsaw University of Technology, Polish Academy of Sciences, or
any other source.
21 https://myst-parser.readthedocs.io/en/latest/index.html
22 https://www.sphinx-doc.org/
23 https://pypi.org/project/sphinxcontrib-bibtex
24 https://pypi.org/project/sphinxcontrib-proof
25 https://design.ubuntu.com/font
26 https://www.huertatipografica.com/en
Part I
Introducing Python
1
Getting started with Python
of joy.
4 (*) CPython was written in the C programming language. Many Python packages are just convenient
on the desktop and in the cloud. Switching to a free system at some point cannot be recommended highly
enough.
6 https://packaging.python.org/en/latest/tutorials/installing-packages
7 https://pypi.org/
4 I INTRODUCING PYTHON
Note (*) More advanced students might consider, for example, jupytext12 as a means
to create .ipynb files directly from Markdown documents.
This should launch the JupyterLab server and open the corresponding app in our fa-
vourite web browser.
Important The file is stored relative to the running JupyterLab server instance’s
current working directory. Make sure you can locate HelloWorld.ipynb on your
disk using your file explorer. On a side note, .ipynb is just a JSON file that can also
be edited using ordinary text editors.
4. Press Ctrl+Enter (or Cmd+Return on m**OS) to execute the code cell and display
the result; see Figure 1.2.
2. Press Ctrl+Enter to execute whole code chunk and replace the previous outputs
with the updated ones.
3. Enter another command that prints a message that you would like to share with
the world. Note that character strings in Python must be enclosed either in double
quotes or in apostrophes.
4. Press Shift+Enter to execute the current code cell, create a new one below it, and
then enter the edit mode.
5. In the new cell, input and then execute:
import matplotlib.pyplot as plt # the main plotting library
plt.bar(
(continues on next page)
1 GETTING STARTED WITH PYTHON 7
6. Add three more code cells that display some text or create other bar plots.
Exercise 1.3 Change print(2+5) to PRINT(2+5). Run the corresponding code cell and see
what happens.
Note In the Edit mode, JupyterLab behaves like an ordinary text editor. Most keyboard
shortcuts known from elsewhere are available, for example:
• Shift+LeftArrow, DownArrow, UpArrow, or RightArrow – select text,
• Ctrl+c – copy,
• Ctrl+x – cut,
• Ctrl+v – paste,
• Ctrl+z – undo,
• Ctrl+] – indent,
• Ctrl+[ – dedent,
• Ctrl+/ – toggle comment.
Important ESC and Enter switch between the Command and Edit modes, respectively.
Example 1.4 In Jupyter notebooks, the linear flow of chunks’ execution is not strongly enforced.
For instance:
8 I INTRODUCING PYTHON
## In [2]:
x = [1, 2, 3]
## In [10]:
sum(x)
## Out [10]:
## 18
## In [7]:
sum(y)
## Out [7]:
## 6
## In [6]:
x = [5, 6, 7]
## In [5]:
y = x
The chunk IDs reveal the true order in which the author has executed them. By editing cells in
a rather frivolous fashion, we may end up with matter that makes little sense when it is read
from the beginning to the end. It is thus best to always select Restart Kernel and Run All Cells
from the Kernel menu to ensure that evaluating content step by step renders results that meet our
expectations.
## Subsection
* two,
1. aaa,
2. bbbb,
* [three](https://en.wikipedia.org/wiki/3).
---
```python
# some code to display (but not execute)
2+2
```
An image:

And a table:
| A | B |
| -- | -- |
| 1 | 3 |
| 2 | 4 |
import os
os.environ["COLUMNS"] = "74" # output width, in characters
np.set_printoptions(
linewidth=74, # output width
legacy="1.25", # print scalars without type information
)
pd.set_option("display.width", 74)
import sklearn
sklearn.set_config(display="text")
_linestyles = [
"solid", "dashed", "dashdot", "dotted"
]
plt.rcParams["axes.prop_cycle"] = plt.cycler(
# each plotted line will have a different plotting style
color=_colours, linestyle=_linestyles*2
)
plt.rcParams["patch.facecolor"] = _colours[0]
First, we imported the most frequently used packages (together with their usual ali-
ases, we will get to that later). Then, we set up some further options that yours truly is
particularly fond of. On a side note, Section 6.4.2 discusses the issues in reproducible
pseudorandom number generation.
Open-source software regularly enjoys feature extensions, API changes, and bug fixes.
It is worthwhile to know which version of the Python environment was used to execute
all the code listed in this book:
import sys
print(sys.version)
## 3.11.11 (main, Dec 04 2024, 21:44:34) [GCC]
Given beneath are the versions of the packages that we will be relying on. This inform-
ation can usually be accessed by calling print(package.__version__).
2
Scalar types and control structures in Python
In this part, we introduce the basics of the Python language itself. Being a general-
purpose tool, various packages supporting data wrangling operations are provided as
third-party extensions. In further chapters, extending upon the concepts discussed
here, we will be able to use numpy, scipy, matplotlib, pandas, seaborn, and other
packages with a healthy degree of confidence.
we instantiated the former. This is a dull exercise unless we have fallen into the under-
mentioned pitfall.
Arithmetic operators
Here is the list of available arithmetic operators:
1 + 2 # addition
## 3
1 - 7 # subtraction
## -6
4 * 0.5 # multiplication
## 2.0
7 / 3 # float division (results are always of the type float)
## 2.3333333333333335
7 // 3 # integer division
## 2
7 % 3 # division remainder
## 1
2 ** 4 # exponentiation
## 16
1 https://docs.python.org/3/reference/expressions.html#operator-precedence
2 SCALAR TYPES AND CONTROL STRUCTURES IN PYTHON 15
“x” is great name: it means something of general interest in mathematics. Let’s print out
the value it is bound to:
print(x) # or just `x`
## 7
Exercise 2.2 Define two named variables height (in centimetres) and weight (in kilograms).
Determine the corresponding body mass index (BMI2 ).
Exercise 2.3 Call the print function on the above objects to reveal the meaning of the included
escape sequences.
Important Many string operations are available, e.g., for formatting and pattern
searching. They are especially important in the art of data wrangling as information
often arrives in textual form. Chapter 14 covers this topic in detail.
Notice the “f” prefix. The “{x}” part was replaced with the value stored in the x variable.
The formatting of items can be fine-tuned. As usual, it is best to study the documenta-
tion4 in search of noteworthy features. Here, let’s just mention that we will frequently
be referring to placeholders like “{value:width}” and “{value:width.precision}”,
3 https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
4 https://docs.python.org/3/reference/lexical_analysis.html#f-strings
2 SCALAR TYPES AND CONTROL STRUCTURES IN PYTHON 17
which specify the field width and the number of fractional digits of a number. This
way, we can output a series of values aesthetically aligned one beneath another.
π = 3.14159265358979323846
e = 2.71828182845904523536
print(f"""
π = {π:10.8f}
e = {e:10.8f}
πe² = {(π*e**2):10.8f}
""")
##
## π = 3.14159265
## e = 2.71828183
## πe² = 23.21340436
“10.8f” means that a value should be formatted as a float, be of width at least ten
characters (text columns), and use eight fractional digits.
e = 2.718281828459045
round(e, 2)
## 2.72
Exercise 2.4 Call help("round") to access the function’s manual. Note that the second argu-
ment, called ndigits, which we set to 2, defaults to None. Check what happens when we omit it
during the call.
Verifying that no other call scheme is permitted is left as an exercise, i.e., positionally
matched arguments must be listed before the keyword ones.
18 I INTRODUCING PYTHON
See the official documentation5 for the comprehensive list of objects available. On a
side note, all floating-point computations in any programming language are subject
to round-off errors and other inaccuracies. This is why the result of sin 𝜋 is not exactly
0, but some value very close thereto. We will elaborate on this topic in Section 5.5.6.
Packages can be given aliases, for the sake of code readability or due to our being lazy.
For instance, in Chapter 4 we will get used to importing the numpy package under the
np alias:
import numpy as np
Exercise 2.5 Call help("complex") to reveal that the complex class defines, amongst others,
the conjugate method and the real and imag slots.
5 https://docs.python.org/3/library/math.html
2 SCALAR TYPES AND CONTROL STRUCTURES IN PYTHON 19
Logical results can be combined using and (conjunction; for testing if both operands are
true) and or (alternative; for determining whether at least one operand is true). Like-
wise, not stands for negation.
3 <= math.pi and math.pi <= 4 # is it between 3 and 4?
## True
not (1 > 2 and 2 < 3) and not 100 <= 3
## True
Notice that not 100 <= 3 is equivalent to 100 > 3. Also, based on the de Morgan laws,
not (1 > 2 and 2 < 3) is true if and only if 1 <= 2 or 2 >= 3 holds.
20 I INTRODUCING PYTHON
Exercise 2.6 Assuming that p, q, r are logical and a, b, c, d are variables of the type float,
simplify the following expressions:
• not not p,
• not p and not q,
• not (not p or not q or not r),
• not a == b,
• not (b > a and b < c),
• not (a>=b and b>=c and a>=c),
• (a>b and a<c) or (a<c and a>d).
Actually, we remained cool as a cucumber (nothing was printed) because x is equal to:
print(x)
## 0.6964691855978616
Multiple elif (else-if ) parts can be added. They are inspected one by one, until one of
the tests turns out to be successful. At the end, we can include an optional else part.
It is executed when all of the tested conditions turn out to be false.
if x < 0.25: print("spam!")
elif x < 0.5: print("ham!") # i.e., x in [0.25, 0.5)
elif x < 0.75: print("bacon!") # i.e., x in [0.5, 0.75)
else: print("eggs!") # i.e., x >= 0.75
## bacon!
Note that if we wrote the second condition as x >= 0.25 and x < 0.5, we would
introduce some redundancy; when it is being considered, we already know that x <
0.25 (the first test) is not true. Similarly, the else part is only executed when all the
tests fail, which in our case happens if neither x < 0.25, x < 0.5, nor x < 0.75 is
true, i.e., if x >= 0.75.
Important The indentation must be neat and consistent. We recommend using four
spaces. Note the kind of error generated when we try executing:
if x < 0.5:
print("spam!")
print("ham!") # :(
IndentationError: unindent does not match any outer indentation level
Exercise 2.7 For a given BMI, print out the corresponding category as defined by the WHO (un-
derweight if less than 18.5 kg/m², normal range up to 25.0 kg/m², etc.). Bear in mind that the
BMI is a simplistic measure. Both the medical and statistical communities pointed out its inher-
ent limitations. Read the Wikipedia article thereon for more details (and appreciate the amount
of data wrangling required for its preparation: tables, charts, calculations; something that we
will be able to perform quite soon, given quality reference data, of course).
Exercise 2.8 (*) Check if it is easy to find on the internet (in reliable sources) some raw datasets
related to the body mass studies, e.g., measuring subjects’ height, weight, body fat and muscle
mass, etc.
Exercise 2.9 Using the while loop, determine the arithmetic mean of 100 randomly generated
numbers (i.e., the sum of the numbers divided by 100).
22 I INTRODUCING PYTHON
Example calls:
print(min3(10, 20, 30),
min3(10, 30, 20),
min3(20, 10, 30),
min3(20, 30, 10),
min3(30, 10, 20),
min3(30, 20, 10))
## 10 10 10 10 10 10
Note that min3 returns a value. The result it yields can be consumed in further compu-
tations:
x = min3(np.random.rand(), 0.5, np.random.rand()) # minimum of 3 numbers
x = round(x, 3) # transform the result somehow
print(x)
## 0.5
Exercise 2.10 Write a function named bmi which computes and returns a person’s BMI, given
their weight (in kilograms) and height (in centimetres). As documenting functions constitutes a
good development practice, do not forget about including a docstring.
New variables can be introduced inside a function’s body. This can help the function
perform its duties.
2 SCALAR TYPES AND CONTROL STRUCTURES IN PYTHON 23
Example call:
m = 7
n = 10
o = 3
min3(m, n, o)
## 3
All local variables cease to exist after the function is called. Notice that m inside the func-
tion is a variable independent of m in the global (calling) scope.
print(m) # this is still the global `m` from before the call
## 7
Exercise 2.11 Implement a function max3 which determines the maximum of three given val-
ues.
Exercise 2.12 Write a function med3 which defines the median of three given values (the value
that is in-between two other ones).
Exercise 2.13 (*) Indite a function min4 to compute the minimum of four values.
Unfortunately, once a module is loaded, any changes thereto will not be reflected until
the Python session is restarted. Thus, in an interactive environment (such as when
working with Jupyter notebooks), we may find the importlib.reload function useful.
2.5 Exercises
Exercise 2.14 What does import xxxxxx as x mean?
Exercise 2.15 What is the difference between if and while?
Exercise 2.16 Name the scalar types we introduced in this chapter.
Exercise 2.17 What is a function’s docstring and how can we create and access it?
Exercise 2.18 What are keyword arguments of a function?
3
Sequential and other types in Python
3.1.1 Lists
Lists consist of arbitrary Python objects. They can be created using standalone square
brackets:
x = [True, "two", 3, [4j, 5, "six"], None]
print(x)
## [True, 'two', 3, [4j, 5, 'six'], None]
The preceding is an example list featuring objects of the types: bool, str, int, list
(yes, it is possible to have a list inside another list), and None (the None object is the
only of this kind, it represents a placeholder for nothingness).
Note We will often be relying on lists when creating vectors in numpy or data frame
columns in pandas. Furthermore, lists of lists of equal lengths can be used to create
matrices.
Each list is mutable. Consequently, its state may freely be changed. For instance, we
can append a new object at its end:
26 I INTRODUCING PYTHON
x.append("spam")
print(x)
## [True, 'two', 3, [4j, 5, 'six'], None, 'spam']
3.1.2 Tuples
Next, tuples are like lists, but they are immutable (read-only): once created, they cannot
be altered.
("one", [], (3j, 4))
## ('one', [], (3j, 4))
This gave us a triple (a 3-tuple) carrying a string, an empty list, and a pair (a 2-tuple).
Let’s stress that we can drop the round brackets and still get a tuple:
1, 2, 3 # the same as `(1, 2, 3)`
## (1, 2, 3)
Also:
42, # equivalently: `(42, )`
## (42,)
Note the trailing comma; we defined a singleton (a 1-tuple). It is not the same as the
scalar 42 or (42), which is an object of the type int.
Note Having a separate data type representing an immutable sequence makes sense
in certain contexts. For example, a data frame’s shape is its inherent property that
should not be tinkered with. If a tabular dataset has 10 rows and 5 columns, we dis-
allow the user to set the former to 15 (without making further assumptions, providing
extra data, etc.).
When creating collections of items, we usually prefer lists, as they are more flexible a
data type. Yet, Section 3.4.2 will mention that many functions return tuples. We are
thus expected to be able to handle them with confidence.
3.1.3 Ranges
Objects defined by calling range(from, to) or range(from, to, by) represent arith-
metic progressions of integers.
list(range(0, 5)) # i.e., range(0, 5, 1) – from 0 to 5 (exclusive) by 1
## [0, 1, 2, 3, 4]
list(range(10, 0, -1)) # from 10 to 0 (exclusive) by -1
## [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
3 SEQUENTIAL AND OTHER TYPES IN PYTHON 27
We converted the two ranges to ordinary lists as otherwise their display is not particu-
larly spectacular. Let’s point out that the rightmost boundary (to) is exclusive and that
by defaults to 1.
Strings are most often treated as scalars (atomic entities, as in: a string as a whole).
However, we will soon find out that their individual characters can also be accessed
by index. Furthermore, Chapter 14 will discuss a plethora of operations on parts of
strings.
The valid indexes are 0, 1, … , 𝑛 − 2, 𝑛 − 1, where 𝑛 is the length (size) of the sequence,
which can be fetched by calling len.
Important Think of an index as the distance from the start of a sequence. For example,
x[3] means “three items away from the beginning”, i.e., the fourth element.
More examples:
range(0, 10)[-1] # the last item in an arithmetic progression
## 9
(1, )[0] # extract from a 1-tuple
## 1
Important The same “thing” can have different meanings in different contexts. There-
fore, we must always remain vigilant.
For instance, raw square brackets are used to create a list (e.g., [1, 2, 3]) whereas
their presence after a sequential object indicates some form of indexing (e.g., x[1] or
even [1, 2, 3][1]). Similarly, (1, 2) creates a 2-tuple and f(1, 2) denotes a call to
a function f with two arguments.
3.2.2 Slicing
We can also use slices of the form from:to or from:to:by to select a subsequence of
a given sequence. Slices are similar to ranges, but `:` can only be used within square
brackets.
x = ["one", "two", "three", "four", "five"]
x[1:4] # from the second to the fifth (exclusive)
## ['two', 'three', 'four']
x[-1:0:-2] # from the last to first (exclusive) by every second backwards
## ['five', 'three']
In fact, the from and to parts of a slice are optional. When omitted, they default to
one of the sequence boundaries.
x[3:] # from the third element to the end
## ['four', 'five']
x[:2] # the first two
## ['one', 'two']
x[:0] # none (the first zero)
## []
x[::2] # every second element from the start
(continues on next page)
3 SEQUENTIAL AND OTHER TYPES IN PYTHON 29
Knowing the difference between element extraction and subsetting a sequence (creat-
ing a subsequence) is crucial. For example:
x[0] # extraction (indexing with a single integer)
## 'one'
It returned the object of the same type as x (here, a list), even though, in this case, only
one object was fetched. However, a slice can potentially select any number of elements,
including zero.
pandas data frames and numpy arrays will behave similarly, but there will be many
more indexing options; see Section 5.4, Section 8.2, and Section 10.5.
Exercise 3.1 There are quite a few methods that modify list elements: not only the aforemen-
tioned append, but also insert, remove, pop, etc. Invoke help("list"), read their descrip-
tions, and call them on a few example lists.
Exercise 3.2 Verify that similar operations cannot be performed on tuples, ranges, and strings.
In other words, check that these types are immutable.
30 I INTRODUCING PYTHON
Exercise 3.3 In the documentation of the list and other classes, check out the count and in-
dex methods.
3.3 Dictionaries
Dictionaries are sets of key:value pairs, where the value (any Python object) can be
accessed by key (usually1 a string). In other words, they map keys to values.
1 Overall, hashable data types can be used as dictionary keys, e.g., integers, floats, strings, tuples, and
x = {
"a": [1, 2, 3],
"b": 7,
"z": "spam!"
}
print(x)
## {'a': [1, 2, 3], 'b': 7, 'z': 'spam!'}
We can also create a dictionary with string keys using the dict function which accepts
any keyword arguments:
dict(a=[1, 2, 3], b=7, z="spam!")
## {'a': [1, 2, 3], 'b': 7, 'z': 'spam!'}
The index operator extracts a specific element from a dictionary, uniquely identified
by a given key:
x["a"]
## [1, 2, 3]
In this context, x[0] is not valid and raises an error: a dictionary is not an object of
sequential type; a key of 0 does not exist in x. If we are unsure whether a specific key
is defined, we can use the in operator:
"a" in x, 0 not in x, "z" in x, "w" in x # a tuple of four tests' results
## (True, True, True, False)
There is also a method called get, which returns an element associated with a given
key, or something else (by default, None) if we have a mismatch:
x.get("a")
## [1, 2, 3]
x.get("c") # if missing, returns None by default
x.get("c") is None # indeed
## True
x.get("c", "unknown")
## 'unknown'
We can also add new elements to a dictionary using the index operator:
x["f"] = "more spam!"
print(x)
## {'a': [1, 2, 3], 'b': 7, 'z': 'spam!', 'f': 'more spam!'}
Example 3.4 (*) In practice, we often import JSON files (which is a popular data exchange
format on the internet) exactly in the form of Python dictionaries. Let’s demo it briefly:
32 I INTRODUCING PYTHON
import requests
x = requests.get("https://api.github.com/users/gagolews/starred").json()
Now x is a sequence of dictionaries giving the information on the repositories starred by yours
truly on GitHub. As an exercise, the reader is encouraged to inspect its structure.
Exercise 3.5 Take a look at the documentation of the extend method for the list class. The
manual page suggests that this operation takes any iterable object. Feed it with a list, tuple, range,
and a string and see what happens.
The notion of iterable objects is essential, as they appear in many contexts. There ex-
ist other iterable types that are, for example, non-sequential: we cannot access their
elements at random using the index operator.
Exercise 3.6 (*) Check out the enumerate, zip, and reversed functions and what kind of
iterable objects they return.
Example 3.7 Let’s compute the elementwise multiplication of two vectors of equal lengths,
i.e., the product of their corresponding elements:
x = [1, 2, 3, 4, 5] # for testing
y = [1, 10, 100, 1000, 10000] # just a test
z = [] # result list – start with an empty one
for i in range(len(x)):
tmp = x[i] * y[i]
print(f"The product of {x[i]:6} and {y[i]:6} is {tmp:6}")
z.append(tmp)
## The product of 1 and 1 is 1
## The product of 2 and 10 is 20
## The product of 3 and 100 is 300
## The product of 4 and 1000 is 4000
## The product of 5 and 10000 is 50000
The items were printed with a little help of f-strings; see Section 2.1.3. Here is the resulting list:
print(z)
## [1, 20, 300, 4000, 50000]
And now:
x = ["apple", "pear", "apple", "kiwi", "apple", "kiwi"]
recoded_x = []
for fruit in x:
recoded_x.append(map[fruit]) # or, e.g., map.get(fruit, "unknown")
print(recoded_x)
## ['red', 'yellow', 'red', 'green', 'red', 'green']
Exercise 3.9 Here is a function that determines the minimum of a given iterable object (com-
pare the built-in min function, see help("min")).
34 I INTRODUCING PYTHON
import math
def mymin(x):
"""
Fetches the smallest element in an iterable object x.
We assume that x consists of numbers only.
"""
curmin = math.inf # infinity is greater than any other number
for e in x:
if e < curmin:
curmin = e # a better candidate for the minimum
return curmin
Note that due to the use of math.inf, the function operates under the assumption that all ele-
ments in x are numeric. Rewrite it so that it will work correctly, e.g., in the case of lists of strings.
Exercise 3.10 Using the for loop, author some basic versions of the built-in max, sum, any, and
all functions.
Exercise 3.11 (*) The glob function in the glob module lists all files in a given directory whose
names match a specific wildcard, e.g., glob.glob("~/Music/*.mp3") gives the list of MP3
files in the current user’s home directory; see Section 13.6.1. Moreover, getsize from the os.
path module returns the size of a file, in bytes. Compose a function that determines the total size
of all the files in a given directory.
This is useful, for example, when the swapping of two elements is needed:
a, b = 1, 2 # the same as (a, b) = (1, 2) – parentheses are optional
a, b = b, a # swap a and b
(continues on next page)
3 SEQUENTIAL AND OTHER TYPES IN PYTHON 35
Another use case is where we fetch outputs of functions that return many objects at
once. For instance, later we will learn about numpy.unique which (depending on argu-
ments passed) may return a tuple of arrays:
import numpy as np
result = np.unique([1, 2, 1, 2, 1, 1, 3, 2, 1], return_counts=True)
print(result)
## (array([1, 2, 3]), array([5, 3, 1]))
we can write:
values, counts = np.unique([1, 2, 1, 2, 1, 1, 3, 2, 1], return_counts=True)
Note (**) If there are more values to unpack than then number of identifiers, we can
use the notation like *name inside the tuple_of_identifiers on the left side of the
assignment operator. Such a placeholder gathers all the surplus objects in the form of
a list:
for a, b, *c, d in [range(4), range(10), range(3)]:
print(a, b, c, d, sep="; ")
## 0; 1; [2]; 3
## 0; 1; [2, 3, 4, 5, 6, 7, 8]; 9
## 0; 1; []; 2
Arguments to be matched positionally can be wrapped inside any iterable object and then
unpacked using the asterisk operator:
args = [1, 2, 3, 4] # merely an example
test(*args) # just like test(1, 2, 3, 4)
## a = 1, b = 2, c = 3, d = 4
Keyword arguments can be wrapped inside a dictionary and unpacked with a double aster-
isk:
kwargs = dict(a=1, c=3, d=4, b=2)
(continues on next page)
3 SEQUENTIAL AND OTHER TYPES IN PYTHON 37
The unpackings can be intertwined. For this reason, the following calls are equivalent:
test(1, *range(2, 4), 4)
## a = 1, b = 2, c = 3, d = 4
test(1, **dict(d=4, c=3, b=2))
## a = 1, b = 2, c = 3, d = 4
test(*range(1, 3), **dict(d=4, c=3))
## a = 1, b = 2, c = 3, d = 4
For example:
test(1, 2, 3, 4, 5, spam=6, eggs=7)
## a = 1, b = 2, args = (3, 4, 5), kwargs = {'spam': 6, 'eggs': 7}
We see that *args gathers all the positionally matched arguments (except a and b,
which were set explicitly) into a tuple. On the other hand, **kwargs is a dictionary
that stores all keyword arguments that are not mentioned in the function’s parameter
list.
Exercise 3.13 From time to time, we will be coming across *args and **kwargs in various con-
texts. Study what matplotlib.pyplot.plot uses them for (by calling help(plt.plot)).
x = [1, 2, 3]
y = x
the assignment operator does not create a copy of x; both x and y refer to the same
object in the computer’s memory.
Important If x is mutable, any change made to it will affect y (as, again, they are two
different means to access the same object). This will also be true for numpy arrays and
pandas data frames.
For example:
x.append(4)
print(y)
## [1, 2, 3, 4]
And now:
myadd(x, 5)
myadd(y, 6)
print(x)
## [1, 2, 3, 4, 5, 6]
This did not change the object referred to as y because it is now a different entity.
3 SEQUENTIAL AND OTHER TYPES IN PYTHON 39
Additionally, random.shuffle is a function (not: a method) that changes the state of the
argument:
x = [5, 3, 2, 4, 1]
import random
random.shuffle(x) # modifies x in place, returns nothing
print(x)
## [1, 4, 3, 5, 2]
Later we will learn about the Series class in pandas, which represents data frame
columns. It has the sort_values method which, by default, returns a sorted copy of
the object it acts upon:
import pandas as pd
x = pd.Series([5, 3, 2, 4, 1])
print(list(x.sort_values())) # inplace=False
## [1, 2, 3, 4, 5]
print(list(x)) # unchanged
## [5, 3, 2, 4, 1]
Important We are always advised to study the official3 documentation of every func-
tion we call. Although surely some patterns arise (such as: a method is more likely to
modify an object in place whereas a similar standalone function will be returning a
copy), ultimately, the functions’ developers are free to come up with some exceptions
to them if they deem it more sensible or convenient.
3.7 Exercises
Exercise 3.14 Name the sequential objects we introduced.
Exercise 3.15 Is every iterable object sequential?
Exercise 3.16 Is dict an instance of a sequential type?
Exercise 3.17 What is the meaning of `+` and `*` operations on strings and lists?
Exercise 3.18 Given a list x of numeric scalars, how can we create a new list of the same length
giving the squares of all the elements in the former?
Exercise 3.19 (*) How can we make an object copy and when should we do so?
Exercise 3.20 What is the difference between x[0], x[1], x[:0], and x[:1], where x is a se-
quential object?
Part II
Unidimensional data
4
Unidimensional numeric data and their empirical
distribution
Our data wrangling adventure starts the moment we get access to loads of data points
representing some measurements, such as industrial sensor readings, patient body
measures, employee salaries, city sizes, etc.
For instance, consider the heights of adult females (in centimetres) in the longitudinal
study called National Health and Nutrition Examination Survey (NHANES1 ) conduc-
ted by the US Centres for Disease Control and Prevention.
heights = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_height_2020.txt")
This is an example of quantitative (numeric) data. They are in the form of a series of
numbers. It makes sense to apply various mathematical operations on them, includ-
ing subtraction, division, taking logarithms, comparing, and so forth.
Most importantly, here, all the observations are independent of each other. Each value
represents a different person. Our data sample consists of 4 221 points on the real line:
a bag of points whose actual ordering does not matter. We depicted them in Figure 4.1.
However, we see that merely looking at the raw numbers themselves tells us nothing.
They are too plentiful.
This is why we are interested in studying a multitude of methods that can bring some
insight into the reality behind the numbers. For example, inspecting their distribu-
tion.
1 https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx
46 II UNIDIMENSIONAL DATA
(jitter)
Figure 4.1. The heights dataset is comprised of independent points on the real line.
We added some jitter on the y-axis for dramatic effects only.
import numpy as np
Our code can now refer to the objects defined therein as np.spam, np.bacon, or np.
spam.
Here, the vector elements were specified by means of an ordinary list. Ranges and
tuples can also be used as content providers. The earnest readers are encouraged to
check it now themselves.
We can therefore fetch its length also by accessing x.shape[0]. On a side note,
matrices – two-dimensional arrays discussed in Chapter 7 - will be of shape like
(number_of_rows, number_of_columns).
Important Recall that Python lists, e.g., [1, 2, 3], represent simple sequences of
objects of any kind. Their use cases are very broad, which is both an advantage and
something quite the opposite. Vectors in numpy are like lists, but on steroids. They are
powerful in scientific computing because of the underlying assumption that each ob-
ject they store is of the same type5 . In most scenarios, we will be dealing with vectors
of logical values, integers, and floating-point numbers. Thanks to this, a wide range of
5 (*) Vectors are directly representable as simple arrays in the C programming language, in which the
numpy methods are written. Operations on vectors are very fast provided that we rely on functions that
process them as a whole. The readers with some background in other lower-level languages will need to get
out of the habit of processing individual elements using for-like loops.
48 II UNIDIMENSIONAL DATA
methods for performing the most popular mathematical operations could have been
defined.
To show that other element types are also available, we can convert it to a vector with
elements of the type float:
x.astype(float) # or np.array(x, dtype=float)
## array([10., 20., 30., 40., 50., 60.])
Let’s emphasise this vector is printed differently from its int-based counterpart: note
the decimal separators. Furthermore:
np.array([True, False, False, True])
## array([ True, False, False, True])
gives a logical vector, for the array constructor detected that the common type of all
the elements is bool. Also:
np.array(["spam", "spam", "bacon", "spam"])
## array(['spam', 'spam', 'bacon', 'spam'], dtype='<U5')
yields an array of strings in Unicode (i.e., capable of storing any character in any al-
phabet, emojis, mathematical symbols, etc.), each of no more than five6 code points
in length.
being truncated. We shall see that this can be remedied by calling x.astype("<U10").
7 https://numpy.org/doc/stable/reference/index.html
4 UNIDIMENSIONAL NUMERIC DATA AND THEIR EMPIRICAL DISTRIBUTION 49
np.repeat(3, 6) # six 3s
## array([3, 3, 3, 3, 3, 3])
np.repeat([1, 2], 3) # three 1s, three 2s
## array([1, 1, 1, 2, 2, 2])
np.repeat([1, 2], [3, 5]) # three 1s, five 2s
## array([1, 1, 1, 2, 2, 2, 2, 2])
In the last case, every element from the list passed as the first argument was repeated
the corresponding number of times, as defined by the second argument.
numpy.tile, on the other hand, repeats a whole sequence with recycling:
Notice the difference between the above and the result of numpy.repeat([1, 2], 3).
See also9 numpy.zeros and numpy.ones for some specialised versions of the foregoing
functions.
Here, nan stands for a not-a-number and is used as a placeholder for missing values
(discussed in Section 15.1) or wrong results, such as the square root of −1 in the domain
of reals. The inf object, on the other hand, denotes the infinity, ∞. We can think of it
as a value that is too large to be represented in the set of floating-point numbers.
We see that numpy.r_ uses square brackets instead of round ones. This is smart for
we mentioned in Section 3.2.2 that slices (`:`) can only be created inside the index
operator. And so:
8 DuckDuckGo also supports search bangs like “!numpy linspace”. They redirect us to the official docu-
mentation automatically.
9 When we write “see also”, it means that this is an exercise for the reader (Rule #3), in this case: to look
Here, 5j does not have its literal meaning (a complex number). By an arbitrary con-
vention, and only in this context, it designates the output length of the sequence to
be generated. Could the numpy authors do that? Well, they could, and they did. End of
story.
We can also combine many types of sequences into one:
np.r_[1, 2, [3]*2, 0:3, 0:3:3j]
## array([1. , 2. , 3. , 3. , 0. , 1. , 2. , 0. , 1.5, 3. ])
and to pick a few values from a given set with replacement (selecting the same value
multiple times is allowed):
np.random.choice(np.arange(1, 10), 20) # replace=True
## array([7, 7, 4, 6, 6, 2, 1, 7, 2, 1, 8, 9, 5, 5, 9, 8, 1, 2, 6, 6])
Note Mathematical notation is pleasantly abstract (general) in the sense that 𝒙 can be
anything, e.g., data on the incomes of households, sizes of the largest cities in some
country, or heights of participants in some longitudinal study. At first glance, such
a representation of objects from the so-called real world might seem overly simplistic,
especially if we wish to store information on very complex entities. Nonetheless, in
most cases, expressing them as vectors (i.e., establishing a set of numeric attributes
that best describe them in a task at hand) is not only natural but also perfectly suffi-
cient for achieving whatever we aim for.
By 𝑥(𝑖) (notice the bracket11 ) we denote the 𝑖-th order statistic, that is, the 𝑖-th smallest
value in 𝒙. In particular, 𝑥(1) is the sample minimum and 𝑥(𝑛) is the maximum. Here
is the same in Python:
x = np.array([5, 4, 2, 1, 3]) # just an example
x_sorted = np.sort(x)
x_sorted[0], x_sorted[-1] # the minimum and the maximum
## (1, 5)
To avoid the clutter of notation, in certain formulae (e.g., in the definition of the type-7
quantiles in Section 5.1.1), we will be assuming that 𝑥(0) is the same as 𝑥(1) and 𝑥(𝑛+1)
is equivalent to 𝑥(𝑛) .
The data were split into 11 bins. Then, they were plotted so that the bar heights are pro-
portional to the number of observations falling into each of the 11 intervals. The bins
are non-overlapping, adjacent to each other, and of equal lengths. We can read their
coordinates by looking at the bottom side of each rectangular bar. For example, circa
1200 observations fall into the interval [158, 163] (more or less) and roughly 400 into
[168, 173] (approximately). To get more precise information, we can query the return
objects:
bins # 12 interval boundaries; give 11 bins
## array([131.1 , 136.39090909, 141.68181818, 146.97272727,
## 152.26363636, 157.55454545, 162.84545455, 168.13636364,
(continues on next page)
11 Some textbooks denote the 𝑖-th order statistic by 𝑥𝑖∶𝑛 , but we will not.
12 https://matplotlib.org/
4 UNIDIMENSIONAL NUMERIC DATA AND THEIR EMPIRICAL DISTRIBUTION 53
1200
1000
800
Count
600
400
200
0
130 140 150 160 170 180 190
Figure 4.2. A histogram of the heights dataset: the empirical distribution is nicely
bell-shaped.
This distribution is nicely symmetrical around about 160 cm. Traditionally, we are
used to saying that it is in the shape of a bell. The most typical (normal, common) ob-
servations are somewhere in the middle, and the probability mass decreases quickly
on both sides.
As a matter of fact, in Chapter 6, we will model this dataset using a normal distribution
and obtain an excellent fit. In particular, we will mention that observations outside the
interval [139, 181] are very rare (probability less than 1%; via the 3𝜎 rule, i.e., expected
value ± 3 standard deviations).
pler components of some more complex entity, assuming that they are independent and follow the same
(any!) distribution with finite variance, is approximately normally distributed. This is called the Central
Limit Theorem and it is a very strong mathematical result.
54 II UNIDIMENSIONAL DATA
200
150
Count
100
50
0
0 25000 50000 75000 100000 125000 150000 175000 200000
We notice that the probability density quickly increases, reaches its peak at around
£15 500–£35 000, and then slowly goes down. We say that it has a long tail on the right,
or that it is right- or positive-skewed. Accordingly, there are several people earning a de-
cent amount of money. It is quite a non-normal distribution. Most people are rather
unwealthy: their income is way below the typical per-capita revenue (being the average
income for the whole population).
In Section 6.3.1, we will note that such a distribution is frequently encountered in bio-
logy and medicine, social sciences, or technology. For instance, the number of sexual
partners or weights of humans are believed to be aligned this way.
Note Looking at Figure 4.3, we might have taken note of the relatively higher bars, as
14 For privacy and other reasons, the UK Office for National Statistics does not detail the individual tax-
payers’ incomes. This is why we needed to guesstimate them based on more coarse-grained data from a
report published at https://www.ons.gov.uk/peoplepopulationandcommunity.
4 UNIDIMENSIONAL NUMERIC DATA AND THEIR EMPIRICAL DISTRIBUTION 55
compared to their neighbours, at c. £100 000 and £120 000. Even though we might
be tempted to try to invent a story about why there can be some difference in the rel-
ative probability mass, we ought to refrain from it. As our data sample is small, they
might merely be due to some natural variability (Section 6.4.4). Of course, there might
be some reasons behind it (theoretically), but we cannot read this only by looking at a
single histogram. In other words, a histogram is a tool that we use to identify some
rather general features of the data distribution (like the overall shape), not the specif-
ics.
For example, in the histogram with five bins, we miss the information that the
c. £20 000 income is more popular than the c. £10 000 one. (as given by the first two
bars in Figure 4.3). On the other hand, the histogram with 200 bins seems to be too
fine-grained already.
35
700
30
600
25
500
20
Count
400
300 15
200 10
100 5
0 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Figure 4.4. Too few and too many histogram bins (the income dataset).
more nuanced. For instance, the histogram on the left side of Figure 4.4 hides the
poorest households inside the first bar: the first income bracket is very wide. If we
cannot request access to the original data, the best we can do is to simply ignore such
a data visualisation instance and warn others not to trust it. A true data scientist must
be sceptical.
Also, note that in the right histogram, we exactly know what is the income of the
wealthiest person. From the perspective of privacy, this might be a bit unsensitive.
Thus, there are 238 observations both in the [15 461, 25 172) and [25 172, 34 883) intervals.
Note A table of ranges and the corresponding counts can be effective for data report-
4 UNIDIMENSIONAL NUMERIC DATA AND THEIR EMPIRICAL DISTRIBUTION 57
ing. It is more informative and takes less space than a series of raw numbers, especially
if we present them like in the table below.
Table 4.1. Incomes of selected British households; the bin edges are pleasantly round
numbers
income bracket [£1000s] count
0–20 236
20–40 459
40–60 191
60–80 64
80–100 26
100–120 11
120–140 10
140–160 2
160–180 0
180–200 1
Reporting data in tabular form can also increase the privacy of the subjects (making
subjects less identifiable, which is good) or hide some uncomfortable facts (which is
not so good; “there are ten people in our company earning more than £200 000 p.a.” –
this can be as much as £10 000 000, but shush).
Exercise 4.5 Find out how we can provide the matplotlib.pyplot.hist and numpy.
histogram functions with custom bin breaks. Plot a histogram where the bin edges are 0,
20 000, 40 000, etc. (just like in the above table). Also let’s highlight the fact that bins do not
have to be of equal sizes: set the last bin to [140 000, 200 000].
Exercise 4.6 (*) There are quite a few heuristics to determine the number of bins automagic-
ally, see numpy.histogram_bin_edges for a few formulae. Check out how different values of
the bins argument (e.g., "sturges", "fd") affect the histogram shapes on both income and
heights datasets. Each has its limitations, none is perfect, but some might be a sensible starting
point for further fine-tuning.
We will get back to the topic of manual data binning in Section 11.1.4.
16 http://www.pedestrian.melbourne.vic.gov.au/
58 II UNIDIMENSIONAL DATA
This time, data have already been binned by somebody else. Consequently, we cannot
use matplotlib.pyplot.hist to depict them. Instead, we can rely on a more low-level
function, matplotlib.pyplot.bar; see Figure 4.5.
plt.bar(np.arange(24)+0.5, width=1, height=peds,
color="lightgray", edgecolor="black")
plt.xticks([0, 6, 12, 18, 24])
plt.show()
2000
1750
1500
1250
1000
750
500
250
0
0 6 12 18 24
3.0
2.5
2.0
1.5
1.0
0.5
0.0
0 10 20 30 40 50 60 70
sprawozdanie/Sprawozdanie%202019%20-%20J%C4%99zyk%20polski.pdf
18 Gombrowicz, Nałkowska, Miłosz, Tuwim, etc.; I recommend.
60 II UNIDIMENSIONAL DATA
marathon = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/37_pzu_warsaw_marathon_mins.txt")
Figure Figure 4.7 gives the histogram for the participants who finished the 42.2 km
run in less than three hours, i.e., a truncated version of this dataset (more information
about subsetting vectors using logical indexers will be given in Section 5.4).
plt.hist(marathon[marathon < 180], color="lightgray", edgecolor="black")
plt.ylabel("Count")
plt.show()
70
60
50
40
Count
30
20
10
0
130 140 150 160 170 180
Figure 4.7. A histogram of a truncated version of the marathon dataset: the distribu-
tion is left-skewed.
We revealed that the data are highly left-skewed. This was not unexpected. There are
only a few elite runners in the game, but, boy, are they fast. Yours truly wishes his
personal best would be less than 180 minutes someday. We shall see. Running is fun,
and so is walking; why not take a break for an hour and go outside?
Exercise 4.7 Plot the histogram of the untruncated (complete) version of this dataset.
cities = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/other/us_cities_2000.txt")
Let’s focus only on the cities whose population is not less than 10 000 (another instance
of truncating, this time on the other side of the distribution). Even though they con-
stitute roughly 14% of all the US settlements, they are home to as much as about 84%
of all the US citizens.
large_cities = cities[cities >= 10000]
len(large_cities) / len(cities)
## 0.13863320820692138
np.sum(large_cities) / np.sum(cities) # more on aggregation functions later
## 0.8351248599553305
Here are the populations of the five largest cities (can we guess which ones are they?):
large_cities[-5:] # preview last five – data are sorted increasingly
## array([1517550., 1953633., 2896047., 3694742., 8008654.])
2500
2000
1500
Count
1000
500
0
0 1 2 3 4 5 6 7 8
1e6
The histogram is virtually unreadable because the distribution is not just right-
skewed; it is extremely heavy-tailed. Most cities are small, and those that are crowded,
62 II UNIDIMENSIONAL DATA
such as New York, are really enormous. Had we plotted the whole dataset (cities
instead of large_cities), the results’ intelligibility would be even worse. For this
reason, we should rather draw such a distribution on the logarithmic scale; see Fig-
ure 4.9.
logbins = np.geomspace(np.min(large_cities), np.max(large_cities), 21)
plt.hist(large_cities, bins=logbins, color="lightgray", edgecolor="black")
plt.xscale("log")
plt.ylabel("Count")
plt.show()
600
500
400
Count
300
200
100
0
104 105 106 107
Figure 4.9. Another histogram of the same large_cities dataset: the distribution is
right-skewed even on a logarithmic scale.
The log-scale causes the x-axis labels not to increase linearly anymore: it is no longer
based on steps of equal sizes, giving 0, 1 000 000, 2 000 000, …, and so forth. Instead,
the increases are now geometrical: 10 000, 100 000, 1 000 000, etc.
The current dataset enjoys a right-skewed distribution even on the logarithmic scale.
Many real-world datasets behave alike; e.g., the frequencies of occurrences of words
in books. On a side note, Chapter 6 will discuss the Pareto distribution family which
yields similar histograms.
Due to the fact that the natural logarithm is the inverse of the exponential function and
vice versa (compare Section 5.2), equidistant points on a logarithmic scale can also be
generated as follows:
np.round(np.exp(
np.linspace(
np.log(np.min(large_cities)),
np.log(np.max(large_cities)),
21
)))
## array([ 10001., 13971., 19516., 27263., 38084., 53201.,
## 74319., 103818., 145027., 202594., 283010., 395346.,
## 552272., 771488., 1077717., 1505499., 2103083., 2937867.,
## 4104005., 5733024., 8008654.])
Exercise 4.8 Draw the histogram of income on the logarithmic scale. Does it resemble a bell-
shaped distribution? We will get back to this topic in Section 6.3.1.
Very similar is the plot of the empirical cumulative distribution function (ECDF), which for
a sample 𝒙 = (𝑥1 , … , 𝑥𝑛 ) we denote by 𝐹𝑛̂ . And so, at any given point 𝑡 ∈ ℝ, 𝐹𝑛̂ (𝑡) is
a step function19 that gives the proportion of observations in our sample that are not greater
than 𝑡:
|{𝑖 = 1, … , 𝑛 ∶ 𝑥𝑖 ≤ 𝑡}|
𝐹𝑛̂ (𝑡) = .
𝑛
We read |{𝑖 = 1, … , 𝑛 ∶ 𝑥𝑖 ≤ 𝑡}| as the number of indexes like 𝑖 such that the corres-
ponding 𝑥𝑖 is less than or equal to 𝑡. Given the ordered inputs 𝑥(1) < 𝑥(2) < ⋯ < 𝑥(𝑛) ,
we have:
⎧ 0
{ for 𝑡 < 𝑥(1) ,
𝐹𝑛̂ (𝑡) = ⎨ 𝑘/𝑛 for 𝑥(𝑘) ≤ 𝑡 < 𝑥(𝑘+1) ,
{ 1 for 𝑡 ≥ 𝑥(𝑛) .
⎩
19 We cannot see the steps in Figure 4.11 for the points are too plentiful.
64 II UNIDIMENSIONAL DATA
1.0
0.8
Cumulative probability
0.6
0.4
0.2
0.0
130 140 150 160 170 180 190
Let’s underline the fact that drawing the ECDF does not involve binning; we only need
to arrange the observations in an ascending order. Then, assuming that all observa-
tions are unique (there are no ties), the arithmetic progression 1/𝑛, 2/𝑛, … , 𝑛/𝑛 is
plotted against them; see Figure 4.1120 .
n = len(heights)
heights_sorted = np.sort(heights)
plt.plot(heights_sorted, np.arange(1, n+1)/n, drawstyle="steps-post")
plt.xlabel("$t$") # LaTeX maths
plt.ylabel("$\\hat{F}_n(t)$, i.e., Prob(height $\\leq$ t)")
plt.show()
Thus, for example, the height of 150 cm is not exceeded by 10% of the women.
Note (*) Quantiles (which we introduce in Section 5.1.1) can be considered a general-
ised inverse of the ECDF.
4.4 Exercises
Exercise 4.9 What is the difference between numpy.arange and numpy.linspace?
20 (*) We are using (La)TeX maths typesetting within "$...$" to obtain nice plot labels, see [72] for a
comprehensive introduction.
4 UNIDIMENSIONAL NUMERIC DATA AND THEIR EMPIRICAL DISTRIBUTION 65
0.8
0.6
0.4
0.2
0.0
130 140 150 160 170 180 190
t
Figure 4.11. The empirical cumulative distribution function for the heights dataset.
Exercise 4.10 (*) What happens when we convert a logical vector to a numeric one using the
astype method? And what about when we convert a numeric vector to a logical one? We will
discuss that later, but you might want to check it yourself now.
Exercise 4.11 Check what happens when we try to create a vector storing a mix of logical, in-
teger, and floating-point values.
Exercise 4.12 Answer the following questions:
• What is a bell-shaped distribution?
• What is a right-skewed distribution?
• What is a heavy-tailed distribution?
• What is a multi-modal distribution?
Exercise 4.13 (*) When does logarithmic binning make sense?
5
Processing unidimensional data
Seldom will our datasets bring valid and valuable insights out of the box. The ones we
are using for illustrational purposes in the first part of our book have already been
curated. After all, it is an introductory course. We need to build the necessary skills
up slowly, minding not to overwhelm the tireless reader with too much information
all at once. We learn simple things first, learn them well, and then we move to more
complex matters with a healthy level of confidence.
In real life, various data cleansing and feature engineering techniques will need to be ap-
plied. Most of them are based on the simple operations on vectors that we cover in this
chapter:
• summarising data (for example, computing the median or sum),
• transforming values (applying mathematical operations on each element, such as
subtracting a scalar or taking the natural logarithm),
• filtering (selecting or removing observations that meet specific criteria, e.g., those
that are larger than the arithmetic mean ± 3 standard deviations).
Important Chapter 10 will be applying the same operations on individual data frame
columns.
revealing too much might not be a clever idea for privacy or confidentiality reasons1 .
Consequently, we might be interested in even more coarse descriptions: data aggreg-
ates which reduce the whole dataset into a single number reflecting some of its char-
acteristics. Such summaries can provide us with a kind of bird’s-eye view of some of
the dataset’s aspects.
In this part, we discuss a few noteworthy measures of:
• location; e.g., central tendency measures such as the arithmetic mean and median;
• dispersion; e.g., standard deviation and interquartile range;
• distribution shape; e.g., skewness.
We also introduce box-and-whisker plots.
(𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 ) 1 𝑛
𝑥̄ = = ∑ 𝑥𝑖 ,
𝑛 𝑛 𝑖=1
• the median, being the middle value in a sorted version of the sample if its length is
odd, or the arithmetic mean of the two middle values otherwise:
𝑥((𝑛+1)/2) if 𝑛 is odd,
𝑚={ 𝑥(𝑛/2) +𝑥(𝑛/2+1)
2
if 𝑛 is even.
• for skewed distributions, the arithmetic mean will be biased towards the heavier
tail.
Exercise 5.1 Get the arithmetic mean and median for the 37_pzu_warsaw_marathon_mins
dataset mentioned in Chapter 4.
Exercise 5.2 (*) Write a function that computes the median based on its mathematical defini-
tion and numpy.sort.
Note (*) Technically, the arithmetic mean can also be computed using the mean method
for the numpy.ndarray class. It will sometimes be the case that we have many ways to
perform the same operation. We can even compose it manually using the sum function.
Thus, all the following expressions are equivalent:
print(
np.mean(income),
income.mean(),
np.sum(income)/len(income),
income.sum()/income.shape[0]
)
## 35779.994 35779.994 35779.994 35779.994
Unfortunately, the median method for vectors is not available. As functions are more
universal in numpy, we prefer sticking with them.
Comparing this new result to the previous one, oh we all feel much richer now, don’t
we? In fact, the arithmetic mean reflects the income each of us would get if all the
wealth were gathered inside a single Santa Claus’s (Robin Hood’s or Joseph Stalin’s)
sack and then distributed equally amongst all of us. A noble idea provided that every-
one contributes equally to the society which, sadly, is not the case.
On the other hand, the median is the value such that 50% of the observations are less
than or equal to it and 50% of the remaining ones are not less than it. Hence, it is
not at all sensitive to most of the data points on both the left and the right side of the
distribution:
print(np.median(income), np.median(income2))
## 30042.0 30076.0
70 II UNIDIMENSIONAL DATA
We cannot generally say that one measure is preferred to the other. It depends on the
context (the nature of data, the requirements, etc.). Certainly, for symmetrical dis-
tributions with no outliers (e.g., heights), the mean will be better as it uses all data
(and its efficiency can be proven for certain statistical models). For skewed distribu-
tions (e.g., income), the median has a nice interpretation, as it gives the value in the
middle of the ordered sample. Remember that these data summaries allow us to look
at a single data aspect only, and there can be many different, valid perspectives. The
reality is complex.
Sample quantiles
Quantiles generalise the notion of the sample median and of the inverse of the empir-
ical cumulative distribution function (Section 4.3.8). They provide us with the value
that is not exceeded by the elements in a given sample with a predefined probability.
Before proceeding with their formal definition, which is quite technical, let’s point out
that for larger sample sizes, we have the following rule of thumb.
Important For any 𝑝 between 0 and 1, the 𝑝-quantile, denoted 𝑞𝑝 , is a value dividing
the sample in such a way that approximately 100𝑝% of observations are not greater
than 𝑞𝑝 , and the remaining c. 100(1 − 𝑝)% are not less than 𝑞𝑝 .
Quantiles appear under many different names, but they all refer to the same concept.
In particular, we can speak about the 100𝑝-th percentiles, e.g., the 0.5-quantile is the
same as the 50th percentile. Furthermore:
• 0-quantile (𝑞0 ) is the minimum (also: numpy.min),
• 0.25-quantile (𝑞0.25 ) equals to the first quartile (denoted 𝑄1 ),
• 0.5-quantile (𝑞0.5 ) is the second quartile a.k.a. median (denoted 𝑄2 or 𝑚),
• 0.75-quantile (𝑞0.75 ) is the third quartile (denoted 𝑄3 ),
• 1.0-quantile (𝑞1 ) is the maximum (also: numpy.max).
Here are these five aggregates for our two example datasets:
np.quantile(heights, [0, 0.25, 0.5, 0.75, 1])
## array([131.1, 155.3, 160.1, 164.8, 189.3])
np.quantile(income, [0, 0.25, 0.5, 0.75, 1])
## array([ 5750. , 20669.75, 30042. , 44123.75, 199969. ])
Example 5.3 Let’s print the aggregates neatly using f-strings; see Section 2.1.3:
wh = [0, 0.25, 0.5, 0.75, 1]
qheights = np.quantile(heights, wh)
qincome = np.quantile(income, wh)
print(" heights income")
for i in range(len(wh)):
(continues on next page)
5 PROCESSING UNIDIMENSIONAL DATA 71
Exercise 5.4 What is the income bracket for 95% of the most typical UK taxpayers? In other
words, determine the 2.5th and 97.5th percentiles.
Exercise 5.5 Compute the midrange of income and heights, being the arithmetic mean of
the minimum and the maximum (this measure is extremely sensitive to outliers).
Note (*) As we do not like the approximately part in the above “asymptotic definition”,
in this course, we shall assume that for any 𝑝 ∈ [0, 1], the 𝑝-quantile is given by:
where 𝑘 = (𝑛 − 1)𝑝 + 1 and ⌊𝑘⌋ is the floor function, i.e., the greatest integer less
than or equal to 𝑘 (e.g., ⌊2.0⌋ = ⌊2.001⌋ = ⌊2.999⌋ = 2, ⌊3.0⌋ = ⌊3.999⌋ = 3,
⌊−3.0⌋ = ⌊−2.999⌋ = ⌊−2.001⌋ = −3, and ⌊−2.0⌋ = ⌊−1.001⌋ = −2).
𝑞𝑝 is a function that linearly interpolates between the points featuring the consecutive
order statistics, ((𝑘 − 1)/(𝑛 − 1), 𝑥(𝑘) ) for 𝑘 = 1, … , 𝑛. For instance, for 𝑛 = 5, we
connect the points (0, 𝑥(1) ), (0.25, 𝑥(2) ), (0.5, 𝑥(3) ), (0.75, 𝑥(4) ), (1, 𝑥(5) ). For 𝑛 = 6,
we do the same for (0, 𝑥(1) ), (0.2, 𝑥(2) ), (0.4, 𝑥(3) ), (0.6, 𝑥(4) ), (0.8, 𝑥(5) ), (1, 𝑥(6) );
see Figure 5.1.
Notice that for 𝑝 = 0.5 we get the median regardless of whether 𝑛 is even or not.
Note (**) There are many possible definitions of quantiles used in statistical software
packages; see the method argument to numpy.quantile. They were nicely summarised
in [56] as well as in the corresponding Wikipedia2 article. They are all approximately
equivalent for large sample sizes (i.e., asymptotically), but the best practice is to be ex-
plicit about which variant we are using in the computations when reporting data ana-
lysis results. Accordingly, in our case, we say that we are relying on the type-7 quantiles
as described in [56]; see also [47].
In fact, simply mentioning that our computations are done with numpy version 1.xx
(as specified in Section 1.4) implicitly implies that the default method parameters are
used everywhere, unless otherwise stated. In many contexts, that is good enough.
2 https://en.wikipedia.org/wiki/Quantile
72 II UNIDIMENSIONAL DATA
n=5 n=6
x(5) xx(6)
(5)
x(4) x(4)
x(3) x(3)
qp
x(2) x(2)
x(1) x(1)
0 1 2 3 4 0 1 2 3 4 5
4 4 4 4 4 5 5 5 5 5 5
p p
Figure 5.1. 𝑞𝑝 as a function of 𝑝 for example vectors of length 5 (left subfigure) and 6
(right).
The standard deviation quantifies the typical amount of spread around the arithmetic
mean. It is overall adequate for making comparisons across different samples measur-
ing similar things (e.g., heights of males vs of females, incomes in the UK vs in South
Africa).
However, without further assumptions, it can be difficult to express the meaning of a
particular value of 𝑠: for instance, the statement that the standard deviation of income
is £22 900 is hard to interpret. This measure makes therefore most sense for data dis-
tributions that are symmetric around the mean.
Note (*) For bell-shaped data such as heights (more precisely: for normally-
distributed samples; see the next chapter), we sometimes report 𝑥 ̄ ± 2𝑠. By the so-
called 2𝜎 rule, the theoretical expectancy is that roughly 95% of data points fall into
the [𝑥 ̄ − 2𝑠, 𝑥 ̄ + 2𝑠] interval.
Further, the variance is the square of the standard deviation, 𝑠2 . Mind that if data are
expressed in centimetres, then the variance is in centimetres squared, which is not
very intuitive. The standard deviation does not have this drawback. For many reasons,
mathematicians find the square root in the definition of 𝑠 annoying, though; it is why
we will come across the 𝑠2 measure every now and then too.
Interquartile range
The interquartile range (IQR) is another popular way to quantify data dispersion. It is
defined as the difference between the third and the first quartile:
Computing it is effortless:
3 (**) We mean the one based on the so-called uncorrected for statistical bias version of the sample variance.
We prefer it here for didactical reasons (simplicity, interpretability). Plus, it is the default one in numpy.
Passing ddof=1 (delta degrees of freedom) to numpy.std will apply division by 𝑛 − 1 instead of by 𝑛 (we will
note later that the std methods in pandas have it activated by default). When used as an estimator of the
distribution’s standard deviation, the latter has slightly better statistical properties that we normally explore
in a course on mathematical statistics, which this one is not.
74 II UNIDIMENSIONAL DATA
The IQR has an appealing interpretation: it is the range comprised of the 50% most typ-
ical values. It is a quite robust measure, as it ignores the 25% smallest and 25% largest
observations. Standard deviation, on the other hand, is much more sensitive to out-
liers.
Furthermore, by range (or support) we will mean a measure based on extremal
quantiles: it is the difference between the maximal and minimal observation.
Note (*) It is worth stressing that 𝑔 > 0 does not necessarily imply that the sample
mean is greater than the median. As an alternative measure of skewness, the practi-
tioners sometimes use:
𝑥̄ − 𝑚
𝑔′ = .
𝑠
Yule’s coefficient is an example of a robust skewness measure:
𝑄3 + 𝑄1 − 2𝑚
𝑔″ = .
𝑄3 − 𝑄 1
5 PROCESSING UNIDIMENSIONAL DATA 75
heights
income
– the smallest observation (the minimum) or 𝑄1 − 1.5IQR (the left side of the
box minus 3/2 of its width), whichever is larger, and
– the largest observation (the maximum) or 𝑄3 + 1.5IQR (the right side of the
box plus 3/2 of its width), whichever is smaller.
Additionally, all observations that are less than 𝑄1 − 1.5IQR (if any) or greater than
𝑄3 + 1.5IQR (if any) are separately marked.
Note We are used to referring to the individually marked points as outliers, but it does
not automatically mean there is anything anomalous about them. They are atypical in
the sense that they are considerably farther away from the box. It might indicate some
problems in data quality (e.g., when someone made a typo entering the data), but not
necessarily. Actually, box plots are calibrated (via the nicely round magic constant 1.5)
in such a way that we expect there to be no or only few outliers if the data are normally
distributed. For skewed distributions, there will naturally be many outliers on either
side; see Section 15.4 for more details.
Box plots are based solely on sample quantiles. Most statistical packages do not draw
the arithmetic mean. If they do, it is marked with a distinctive symbol.
Exercise 5.6 Call matplotlib.pyplot.plot(numpy.mean(..data..), 0, "bX") to
mark the arithmetic mean with a blue cross. Alternatively, pass showmeans=True (amongst
others) to matplotlib.pyplot.boxplot.
Box plots are particularly beneficial for comparing data samples with each other (e.g.,
body measures of men and women separately), both in terms of the relative shift (loc-
ation) as well as spread and skewness; see, e.g., Figure 12.1.
Example 5.7 (*) A violin plot, see Figure 5.3, represents a kernel density estimator, which is a
smoothened version of a histogram; see Section 15.4.2.
plt.subplot(2, 1, 1) # two rows, one column; the first subplot
plt.violinplot(heights, vert=False, showextrema=False)
plt.boxplot(heights, vert=False)
plt.yticks([1], ["heights"])
plt.subplot(2, 1, 2) # two rows, one column; the second subplot
plt.violinplot(income, vert=False, showextrema=False)
plt.boxplot(income, vert=False)
plt.yticks([1], ["income"])
plt.show()
heights
income
• trimmed means – the arithmetic mean of all the observations except several, say 𝑝,
smallest and greatest ones,
• winsorised means – the arithmetic mean with 𝑝 smallest and 𝑝 greatest observations
replaced with the (𝑝 + 1)-th smallest and the (𝑝 + 1)-th greatest one, respectively.
The two other famous means are the geometric and harmonic ones. The former is more
meaningful for averaging growth rates and speedups whilst the latter can be used
for computing the average speed from speed measurements at sections of identical
lengths; see also the notion of the F measure in Section 12.3.2. Also, the quadratic mean
is featured in the definition of the standard deviation (it is the quadratic mean of the
distances to the mean).
As far as spread measures are concerned, the interquartile range (IQR) is a robust stat-
istic. If necessary, the standard deviation might be replaced with:
𝑛
• mean absolute deviation from the mean: 1
𝑛 ∑𝑖=1 |𝑥𝑖 − 𝑥|,̄
𝑛
• mean absolute deviation from the median: 1
𝑛 ∑𝑖=1 |𝑥𝑖 − 𝑚|,
• median absolute deviation from the median: the median of (|𝑥1 − 𝑚|, |𝑥2 −
𝑚|, … , |𝑥𝑛 − 𝑚|).
The coefficient of variation, being the standard deviation divided by the arithmetic mean,
is an example of a relative (or normalised) spread measure. It can be appropriate for
comparing data on different scales, as it is unitless (think how standard deviation
changes when you convert between metres and centimetres).
The Gini index, widely used in economics, can also serve as a measure of relative dis-
78 II UNIDIMENSIONAL DATA
It is normalised so that it takes values in the unit interval. An index of 0 reflects the
situation where all values in a sample are the same (0 variance; perfect equality). If
there is a single entity in possession of all the “wealth”, and the remaining ones are 0,
then the index is equal to 1.
In other words, 𝑓 operates element by element on the whole array. Vectorised operations
are frequently used for making adjustments to data, e.g., as in Figure 6.8, where we
discover that the logarithm of the UK incomes has a bell-shaped distribution.
Here is an example call to the vectorised version of the rounding function:
np.round([-3.249, -3.151, 2.49, 2.51, 3.49, 3.51], 1)
## array([-3.2, -3.2, 2.5, 2.5, 3.5, 3.5])
Important Thanks to the vectorised functions, our code is not only more readable,
but also runs faster: we do not have to employ the generally slow Python-level while
or for loops to traverse through each element in a given sequence.
1 12.5
10.0
0 7.5
5.0
1
2.5
2 0.0
0 5 10 15 20 2 0 2
Figure 5.4. The natural logarithm (left) and the exponential function (right).
Logarithms of different bases and non-natural exponential functions are also avail-
able. In particular, when drawing plots, we used the base-10 logarithmic scales on the
log 𝑥
axes. We have log10 𝑥 = log 10 and its inverse is 10𝑥 = 𝑒𝑥 log 10 . For example:
10.0**np.array([-1, 0, 1, 2]) # exponentiation; see below
## array([ 0.1, 1. , 10. , 100. ])
np.log10([-1, 0.01, 0.1, 1, 2, 5, 10, 100, 1000, 10000])
## <string>:1: RuntimeWarning: invalid value encountered in log10
## array([ nan, -2. , -1. , 0. , 0.30103, 0.69897,
## 1. , 2. , 3. , 4. ])
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75 cos
1.00 sin
2 0 /2 3 /2 2 4
Each element was transformed (e.g., squared, divided by 5) and we got a vector of the
same length in return. In these cases, the operators work just like the aforementioned
vectorised mathematical functions.
Mathematically, we commonly assume that the scalar multiplication is performed in
this way. In this book, we will also extend this behaviour to a scalar addition. Thus:
We will also become used to writing (𝒙 − 𝑡)/𝑐, which is equivalent to (1/𝑐)𝒙 + (−𝑡/𝑐).
Note Let 𝒚 = 𝑐𝒙 + 𝑡 and let 𝑥,̄ 𝑦,̄ 𝑠𝑥 , 𝑠𝑦 denote the vectors’ arithmetic means and
standard deviations. The following properties hold.
• The arithmetic mean is equivariant with respect to translation and scaling; we have
𝑦 ̄ = 𝑐𝑥 ̄ + 𝑡. This is also true for all the quantiles (including, of course, the median).
• The standard deviation is invariant to translations, and equivariant to scaling: 𝑠𝑦 =
𝑐𝑠𝑥 . The same happens for the interquartile range and the range.
As a byproduct, for the variance, we get 𝑠2𝑦 = 𝑐2 𝑠2𝑥 .
5 PROCESSING UNIDIMENSIONAL DATA 83
0.5x + 2
0.5x
original x
2x
4 3 2 1 0 1 2 3 4
What we obtained is sometimes referred to as the z-scores. They are nicely inter-
pretable:
84 II UNIDIMENSIONAL DATA
Even though the original heights were measured in centimetres, the z-scores are unit-
less (centimetres divided by centimetres).
Exercise 5.9 We have a patient whose height z-score is 1 and weight z-score is -1. How can we
interpret this piece of information?
What about a patient whose weight z-score is 0 but BMI z-score is 2?
On a side note, sometimes we might be interested in performing some form of robust
standardisation (e.g., for data with outliers or skewed). In such a case, we can replace
the mean with the median and the standard deviation with the IQR.
Here, the smallest value is mapped to 0 and the largest becomes equal to 1. Let’s stress
that, in this context, 0.5 does not represent the value which is equal to the mean (unless
we are incredibly lucky).
5 PROCESSING UNIDIMENSIONAL DATA 85
Also, clipping can be used to replace all values less than 0 with 0 and those greater than
1 with 1.
np.clip(x, 0, 1)
## array([0. , 0.5 , 1. , 0. , 0.25, 0.8 ])
The function is, of course, flexible: another popular choice involves clipping to [−1, 1].
Note that this operation can also be composed by means of the vectorised pairwise
minimum and maximum functions:
np.minimum(1, np.maximum(0, x))
## array([0. , 0.5 , 1. , 0. , 0.25, 0.8 ])
Exercise 5.10 Normalisation is similar to standardisation if data are already centred (when
the mean was subtracted). Show that we can obtain one from the other via the scaling by √𝑛.
5 (*) A Box–Cox transformation [12] can help achieve this in some datasets. Chapter 6 will apply its partic-
ular case: it will turn out that the logarithm of incomes follow a normal distribution (hence, incomes follow
a log-normal distribution). Generally, there is nothing “wrong” or “bad” about data’s not being normally-
distributed. It is just a nice feature to have in certain contexts.
86 II UNIDIMENSIONAL DATA
We did not apply numpy.abs because the values were already nonnegative.
We see that the first element in the left operand (2) was multiplied by the first ele-
ment in the right operand (10). Then, we multiplied 3 by 100 (the second correspond-
ing elements), and so forth. Such a behaviour of the binary operators is inspired by
the usual convention in vector algebra where applying + (or −) on 𝒙 = (𝑥1 , … , 𝑥𝑛 )
and 𝒚 = (𝑦1 , … , 𝑦𝑛 ) means exactly:
𝒙 + 𝒚 = (𝑥1 + 𝑦1 , 𝑥2 + 𝑦2 , … , 𝑥𝑛 + 𝑦𝑛 ).
Using other operators this way (elementwisely) is less standard in mathematics (for
instance multiplication might denote the dot product), but in numpy it is really con-
venient.
5 PROCESSING UNIDIMENSIONAL DATA 87
Example 5.12 Let’s compute the value of the expression ℎ = −(𝑝1 log 𝑝1 + ⋯ + 𝑝𝑛 log 𝑝𝑛 ),
𝑛
i.e., ℎ = − ∑𝑖=1 𝑝𝑖 log 𝑝𝑖 (the entropy):
p = np.array([0.1, 0.3, 0.25, 0.15, 0.12, 0.08]) # example vector
-np.sum(p*np.log(p))
## 1.6790818544987114
It involves the use of a unary vectorised minus (change sign), an aggregation function (sum), a
vectorised mathematical function (log), and an elementwise multiplication of two vectors of the
same lengths.
Example 5.13 Assume we would like to plot two mathematical functions: the sine, 𝑓 (𝑥) =
sin 𝑥, and a polynomial of degree 7, 𝑔(𝑥) = 𝑥 − 𝑥3 /6 + 𝑥5 /120 − 𝑥7 /5040 for 𝑥 in the
interval [−𝜋, 3𝜋/2]. To do this, we can probe the values of 𝑓 and 𝑔 at sufficiently many points
using the vectorised operations, and then use the matplotlib.pyplot.plot function to draw
what we see in Figure 5.7.
x = np.linspace(-np.pi, 1.5*np.pi, 1001) # many points in the said interval
yf = np.sin(x)
yg = x - x**3/6 + x**5/120 - x**7/5040
plt.plot(x, yf, 'k-', label="f(x)") # black solid line
plt.plot(x, yg, 'r:', label="g(x)") # red dotted line
plt.legend()
plt.show()
Figure 5.7. With vectorised functions, it is easy to generate plots like this one. We used
different line styles so that the plot is readable also when printed in black and white.
Decreasing the number of points in x will reveal that the plotting function merely draws a series
of straight-line segments. Computer graphics is essentially discrete.
88 II UNIDIMENSIONAL DATA
Exercise 5.14 Using a single line of code, compute the vector of BMIs of all people in
the nhanes_adult_female_height_20206 and nhanes_adult_female_weight_20207
datasets. It is assumed that the 𝑖-th elements therein both refer to the same person.
numpy vectors support two additional indexer types: integer and boolean sequences.
Second, we can also use lists or vectors of integer indexes. They return a subvector with
elements at the specified indexes:
x[ [0] ]
## array([10])
x[ [0, 1, -1, 0, 1, 0, 0] ]
## array([10, 20, 50, 10, 20, 10, 10])
x[ [] ]
## array([], dtype=int64)
Spaces between the square brackets were added only for readability, as x[[0]] looks
slightly more obscure. (What are these double square brackets? Nah, it is a list inside
the index operator.)
6 https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_adult_female_height_2020.
txt
7 https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_adult_female_weight_2020.
txt
5 PROCESSING UNIDIMENSIONAL DATA 89
To return the vector with the elements except at the given indexes, we call the numpy.
delete function. Its name is slightly misleading for the input vector is not modified
in place.
np.delete(x, [0, -1]) # except the first and the last element
## array([20, 30, 40])
It returned the first, third, and fourth element (select the first, omit the second, choose
the third, pick the fourth, skip the fifth).
Such type of indexing is particularly useful as a data filtering technique. Knowing that
the relational vector operators `<`, `<=`, `==`, `!=`, `>=`, and `>` are performed ele-
mentwisely, just like `+`, `*`, etc., for instance:
x >= 30 # elementwise comparison
## array([False, False, True, True, True])
we can write:
x[ x >= 30 ] # indexing by a logical vector
## array([30, 40, 50])
to mean “select the elements in x which are not less than 30”. Of course, the indexed
vector and the vector specifying the filter do not8 have to be the same:
y = (x/10) % 2 # whatever
y # equal to 0 if a number is a multiply of 10 times an even number
## array([1., 0., 1., 0., 1.])
x[ y == 0 ]
## array([20, 40])
Important Sadly, if we wish to combine many logical vectors, we cannot use the and,
or, and not operators. They are not vectorised (this is a limitation at the language level).
Instead, in numpy, we use the `&`, `|`, and `~` operators. Alas, they have a higher order
of precedence than `<`, `<=`, `==`, etc. Therefore, the bracketing of the comparisons
is obligatory.
8 (*) The indexer is computed first, and its value is passed as an argument to the index operator. Python
neither is a symbolic programming language, nor does it feature any nonstandard evaluation techniques.
In other words, [...] does not care how the indexer was obtained.
90 II UNIDIMENSIONAL DATA
For example:
x[ (20 <= x) & (x <= 40) ] # check what happens if we skip the brackets
## array([20, 30, 40])
means “elements in x between 20 and 40” (greater than or equal to 20 and less than or
equal to 40). Also:
len(x[ (x < 15) | (x > 35) ])
## 3
computes the number of elements in x which are less than 15 or greater than 35 (are
not between 15 and 35).
Exercise 5.15 Compute the BMIs only of the women whose height is between 150 and 170 cm.
5.4.3 Slicing
Just as with ordinary lists, slicing with `:` fetches the elements at indexes in a given
range like from:to or from:to:by.
x[:3] # the first three elements
## array([10, 20, 30])
x[::2] # every second element
## array([10, 30, 50])
x[1:4] # from the second (inclusive) to the fifth (exclusive)
## array([20, 30, 40])
Important For efficiency reasons, slicing returns a view of existing data. It does not
have to make an independent copy of the subsetted elements: by definition, sliced
ranges are regular.
In other words, both x and its sliced version share the same memory. This is import-
ant when we apply operations which modify a given vector in place, such as the sort
method that we mention in the sequel.
def zilchify(x):
x[:] = 0 # re-sets all values in x
y = np.array([6, 4, 8, 5, 1, 3, 2, 9, 7])
zilchify(y[::2]) # modifies parts of y in place
y # has changed
## array([0, 4, 0, 5, 0, 3, 0, 9, 0])
It zeroed every second element in y. On the other hand, indexing with an integer or
logical vector always returns a copy.
5 PROCESSING UNIDIMENSIONAL DATA 91
The original vector has not been modified, because we applied the function on a new
(temporary) object. However, we note that compound operations such as += work dif-
ferently, because setting elements at specific indexes is always possible:
y[ [0, -1] ] += 7 # the same as y[ [0, 7] ] = y[ [0, 7] ] + 7
y
## array([7, 4, 0, 5, 0, 3, 0, 9, 7])
It gave, in this order: the first element, the sum of the first two elements, the sum of
the first three elements, …, the sum of all elements.
Iterated differences are a somewhat inverse operation:
np.diff([5, 8, 4, 5, 6, 9])
## array([ 3, -4, 1, 1, 3])
It returned the difference between the second and the first element, then the differ-
ence between the third and the second, and so forth. The resulting vector is one ele-
ment shorter than the input one.
We often make use of cumulative sums and iterated differences when processing
time series, e.g., stock exchange data (e.g., by how much the price changed since the
previous day?; Section 16.3.1) or determining cumulative distribution functions (Sec-
tion 4.3.8).
5.5.2 Sorting
The numpy.sort function returns a sorted copy of a given vector, i.e., determines the
order statistics.
92 II UNIDIMENSIONAL DATA
The sort method, on the other hand, sorts the vector in place (and returns nothing).
x # before
## array([50, 30, 10, 40, 20, 30, 50])
x.sort()
x # after
## array([10, 20, 30, 30, 40, 50, 50])
Exercise 5.16 Readers concerned more with chaos than bringing order should give numpy.
random.permutation a try: it shuffles the elements in a given vector.
x = np.array([40, 10, 20, 40, 40, 30, 20, 40, 50, 10, 10, 70, 30, 40, 30])
np.unique(x)
## array([10, 20, 30, 40, 50, 70])
Exercise 5.17 Play with the return_index argument to numpy.unique. It permits pinpoint-
ing the indexes of the first occurrences of each unique value.
9 Where, theoretically, the probability of obtaining a tie is equal to 0.
10 Later we will mention pandas.unique which lists the values in the order of appearance.
5 PROCESSING UNIDIMENSIONAL DATA 93
Which means that the smallest element is at index 2, then the second smallest is at
index 4, the third smallest at index 1, etc. Therefore:
x[np.argsort(x)]
## array([10, 20, 30, 30, 40, 50, 50])
is equivalent to numpy.sort(x).
Element 10 is the smallest (“the winner”, say, the quickest racer). Hence, it ranks first.
The two tied elements equal to 30 are the third/fourth on the podium (ex aequo). Thus,
they receive the average rank of 3.5. And so on.
On a side note, there are many methods in nonparametric statistics (where we do not
make any particular assumptions about the underlying data distribution) that are
based on ranks. In particular, Section 9.1.4 will cover the Spearman correlation coef-
ficient.
Exercise 5.18 Consult the manual page of scipy.stats.rankdata and test various methods
for dealing with ties.
Exercise 5.19 What is the interpretation of a rank divided by the length of the sample?
11 https://github.com/python/cpython/blob/3.12/Objects/listsort.txt
94 II UNIDIMENSIONAL DATA
𝑖 = arg min 𝑥𝑗 ,
𝑗
and we read it as “let 𝑖 be the index of the smallest element in the sequence”. Alternat-
ively, it is the argument of the minimum, whenever:
𝑥𝑖 = min 𝑥𝑗 ,
𝑗
We can use numpy.flatnonzero to fetch the indexes where a logical vector has ele-
ments equal to True (Section 11.1.2 mentions that a value equal to zero is treated as
the logical False, and as True in all other cases). For example:
np.flatnonzero(x == np.max(x))
## array([0, 6])
It is a version of numpy.argmax that lets us decide what we would like to do with the
tied maxima (there are two).
Exercise 5.20 Let x be a vector with possible ties. Create an expression that returns a randomly
chosen index pinpointing one of the sample maxima.
which is almost zero (0.0000000000000134), but not exactly zero (it is zero for an en-
gineer, not a mathematician). We saw a similar result in Section 5.3.2, when we stand-
ardised a vector (which involves centring).
Important All floating-point operations on a computer12 (not only in Python) are per-
formed with finite precision of 15–17 decimal digits.
We know it from school. For example, some fractions cannot be represented as decim-
als. When asked to add or multiply them, we will always have to apply some rounding
that ultimately leads to precision loss. We know that 1/3 + 1/3 + 1/3 = 1, but using
a decimal representation with one fractional digit, we get 0.3 + 0.3 + 0.3 = 0.9. With
two digits of precision, we obtain 0.33 + 0.33 + 0.33 = 0.99. And so on. This sum will
never be equal exactly to 1 when using a finite precision.
Note Our data are often imprecise by nature. When asked about people’s heights,
rarely will they provide a non-integer answer (assuming they know how tall they are
and are not lying about it, but it is a different story). We will most likely get data roun-
ded to 0 decimal digits. In our heights dataset, the precision is a bit higher:
heights[:6] # preview
## array([160.2, 152.7, 161.2, 157.4, 154.6, 144.7])
Moreover, errors induced at one stage will propagate onto further operations. For in-
stance, that the heights data are not necessarily accurate, makes their aggregates such
as the mean approximate as well. Most often, the errors should more or less cancel out,
but in extreme cases, they can lead to undesirable consequences (like for some model
matrices in linear regression; see Section 9.2.9).
Exercise 5.21 Compute the BMIs of all females in the NHANES study. Determine their arith-
metic mean. Compare it to the arithmetic mean computed for BMIs rounded to 1, 2, 3, 4, etc.,
decimal digits.
There is no reason to panic, though. The rule to remember is as follows.
Important As the floating-point values are precise up to a few decimal digits, we must
refrain from comparing them using the `==` operator, which tests for exact equality.
When a comparison is needed, we need to take some error margin 𝜀 > 0 into account.
Ideally, instead of testing x == y, we either inspect the absolute error:
|𝑥 − 𝑦| ≤ 𝜀,
12 Double precision float64 format as defined by the IEEE Standard for Floating-Point Arithmetic (IEEE
754).
96 II UNIDIMENSIONAL DATA
|𝑥 − 𝑦|
≤ 𝜀.
|𝑦|
For instance, numpy.allclose(x, y) checks (by default) if for all corresponding ele-
ments in both vectors, we have numpy.abs(x-y) <= 1e-8 + 1e-5*numpy.abs(y),
which is a combination of both tests.
np.allclose(np.mean(heights_centred), 0)
## True
To avoid sorrow surprises, even the testing of inequalities like x >= 0 should rather
be performed as, say, x >= 1e-8.
Note (*) Another problem is related to the fact that floats on a computer use the binary
base, not the decimal one. Therefore, some fractional numbers that we believe to be
representable exactly, require an infinite number of bits. As a consequence, they are
subject to rounding.
0.1 + 0.1 + 0.1 == 0.3 # obviously
## False
This is because 0.1, 0.1+0.1+0.1, and 0.3 are literally represented as, respectively:
print(f"{0.1:.19f}, {0.1+0.1+0.1:.19f}, and {0.3:.19f}.")
## 0.1000000000000000056, 0.3000000000000000444, and 0.2999999999999999889.
x = np.round(np.random.rand(9)*2-1, 2)
x
## array([ 0.86, -0.37, -0.63, -0.59, 0.14, 0.19, 0.93, 0.31, 0.5 ])
if we wish to filter out all elements that are not positive, we can write:
[ e for e in x if e > 0 ]
## [0.86, 0.14, 0.19, 0.93, 0.31, 0.5]
We can also use the ternary operator of the form x_true if cond else x_false to
return either x_true or x_false depending on the truth value of cond.
e = -2
e**0.5 if e >= 0 else (-e)**0.5
## 1.4142135623730951
There is also a tool which vectorises a scalar function so that it can be used on numpy
vectors:
def clip01(x):
"""clip to the unit interval"""
if x < 0: return 0
elif x > 1: return 1
else: return x
Overall, vectorised numpy functions lead to faster, more readable code. However, if the
corresponding operations are unavailable (e.g., string processing, reading many files),
list comprehensions can serve as their reasonable replacement.
Exercise 5.22 Write equivalent versions of the above expressions using vectorised numpy func-
tions. Moreover, implement them using base Python lists, the for loop and the list.append
method (start from an empty list that will store the result). Use the timeit module to compare
the run times of different approaches on large vectors.
98 II UNIDIMENSIONAL DATA
5.6 Exercises
Exercise 5.23 What are some benefits of using a numpy vector over an ordinary Python list?
What are the drawbacks?
Exercise 5.24 How can we interpret the possibly different values of the arithmetic mean, me-
dian, standard deviation, interquartile range, and skewness, when comparing between heights
of men and women?
Exercise 5.25 There is something scientific and magical about numbers that make us ap-
proach them with some kind of respect. However, taking into account that there are many pos-
sible data aggregates, there is a risk that a party may be cherry-picking: report the value that
portrays the analysed entity in a good or bad light, e.g., choose the mean instead of the median
or vice versa. Is there anything that can be done about it?
Exercise 5.26 Even though, mathematically speaking, all measures can be computed on all
data, it does not mean that it always makes sense to do so. For instance, some distributions will
have skewness of 0. However, we should not automatically assume that they are delightfully sym-
metric and bell-shaped (e.g., this can be a bimodal distribution). We always ought to visualise
our data. Give some examples of datasets where we need to be critical of the obtained aggregates.
Exercise 5.27 Give the mathematical definitions, use cases, and interpretations of standardisa-
tion, normalisation, and min-max scaling.
Exercise 5.28 How are numpy.log and numpy.exp related to each other? What about numpy.
log vs numpy.log10, numpy.cumsum vs numpy.diff, numpy.min vs numpy.argmin, numpy.
sort vs numpy.argsort, and scipy.stats.rankdata vs numpy.argsort?
4. Compute the range, i.e., the difference between the greatest and the smallest value.
5. Compute the midrange, i.e., the arithmetic mean of the maximum and the minimum.
6. Compute the mean of absolute values.
7. Find the values closest to and farthest away from 0.
8. Find the values closest to and farthest away from 2.
Exercise 5.35 Write a function to compute the 𝑘-winsorised mean, given a numeric vector x and
𝑘 ≤ 𝑛−12
, i.e., the arithmetic mean of a version of x, where the 𝑘 smallest and 𝑘 greatest values
were replaced by the (𝑘 + 1)-th smallest and greatest value, respectively.
Exercise 5.36 (**) Reflect on the famous13 saying: not everything that can be counted
counts, and not everything that counts can be counted.
Exercise 5.37 (**) Being a data scientist can be a frustrating job, especially when you care for
some causes. Reflect on: some things that count can be counted, but we will not count
them because we ran over budget or because our stupid corporate manager simply
doesn’t care.
Exercise 5.38 (**) Being a data scientist can be a frustrating job, especially when you care for
the truth. Reflect on: some things that count can be counted, but we will not count them
for some people might be offended or find it unpleasant.
Exercise 5.39 (**) Assume you were to become the benevolent dictator of a nation living on some
remote island. How would you measure if your people are happy or not? Assume that you need to
come up with three quantitative measures (key performance indicators). What would happen if
your policy-making was solely focused on optimising those KPIs? What about the same problem
but in the context your company and employees? Think about what can go wrong in other areas
of life.
13 https://quoteinvestigator.com/2010/05/26/everything-counts-einstein
6
Continuous probability distributions
Successful data analysts deal with hundreds or thousands of datasets in their lifetimes.
In the long run, at some level, most of them will be deemed boring (datasets, not ana-
lysts). This is because only a few common patterns will be occurring over and over
again. In particular, the previously mentioned bell-shapedness and right-skewness
are prevalent in the so-called real world. Surprisingly, however, this is exactly when
things become scientific and interesting, allowing us to study various phenomena at
an appropriate level of generality.
Mathematically, such idealised patterns in the histogram shapes can be formalised
using the notion of a probability density function (PDF) of a continuous, real-valued random
variable. Intuitively1 , a PDF is a smooth curve that would arise if we drew a histogram
for the entire population (e.g., all women living currently on Earth and beyond or oth-
erwise an extremely large data sample obtained by independently querying the same
underlying data generating process) in such a way that the total area of all the bars is
equal to 1 and the bin sizes are very small. On the other hand, a real-valued random vari-
able is a theoretical process that generates quantitative data. From this perspective, a
sample at hand is assumed to be drawn from a given distribution; it is a realisation of
the underlying process.
We do not intend ours to be a course in probability theory and mathematical statistics.
Rather, a one that precedes and motivates them (e.g., [23, 40, 41, 82, 83]). Therefore,
our definitions must be simplified so that they are digestible. We will thus consider
the following characterisation.
Some distributions arise more frequently than others and appear to fit empirical data
or their parts particularly well [29]. In this chapter, we review a few noteworthy prob-
1 (*) This intuition is, of course, theoretically grounded and is based on the asymptotic behaviour of the
histograms as the estimators of the underlying probability density function; see, e.g., [30] and the many
references therein.
102 II UNIDIMENSIONAL DATA
ability distributions: the normal, log-normal, Pareto, and uniform families (we will
also mention the chi-squared, Kolmogorov, and exponential ones in this course).
0.8 N(0, 1)
0.7 N(0, 0.5)
N(1, 0.5)
0.6
0.5
Density
0.4
0.3
0.2
0.1
0.0
3 2 1 0 1 2 3 4
Figure 6.1. The probability density functions of some normal distributions N(𝜇, 𝜎).
Note that 𝜇 is responsible for shifting and 𝜎 affects scaling/stretching of the probab-
ility mass.
meters. If all observations are really drawn independently from N(𝜇, 𝜎) each, then we
will expect 𝑥 ̄ and 𝑠 to be equal to, more or less, 𝜇 and 𝜎. Furthermore, the larger the
sample size, the smaller the error.
Recall the heights (females from the NHANES study) dataset and its bell-shaped his-
togram in Figure 4.2.
heights = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_height_2020.txt")
n = len(heights)
n
## 4221
Mathematically, we will denote these two by 𝜇̂ and 𝜎̂ (mu and sigma with a hat) to
emphasise that they are merely guesstimates of the unknown theoretical parameters
𝜇 and 𝜎 describing the whole population. On a side note, in this context, the requested
ddof=1 estimator has slightly better statistical properties.
Figure 6.2 shows the fitted density function, i.e., the PDF of N(160.1, 7.06), which we
computed using scipy.stats.norm.pdf, on top of a histogram.
plt.hist(heights, density=True, color="lightgray", edgecolor="black")
x = np.linspace(np.min(heights), np.max(heights), 1000)
plt.plot(x, scipy.stats.norm.pdf(x, mu, sigma), "r--",
label=f"PDF of N({mu:.1f}, {sigma:.2f})")
plt.ylabel("Density")
plt.legend()
plt.show()
0.04
Density
0.03
0.02
0.01
0.00
130 140 150 160 170 180 190
Figure 6.2. A histogram for the heights dataset and the probability density function
of the fitted normal distribution.
We can risk assuming that the heights data follow the normal distribution (assump-
tion 1) with parameters 𝜇 = 160.1 and 𝜎 = 7.06 (assumption 2). Note that the choice
of the distribution family is one thing, and the way2 we estimate the underlying para-
meters (in our case, we use the aforementioned 𝜇̂ and 𝜎)̂ is another.
Creating a data model only saves storage space and computational time, but also –
based on what we can learn from a course in probability and statistics (by appropri-
ately integrating the normal PDF) – we can imply the facts such as:
• c. 68% of (i.e., a majority) women are 𝜇 ± 𝜎 tall (the 1𝜎 rule),
• c. 95% of (i.e., the most typical) women are 𝜇 ± 2𝜎 tall (the 2𝜎 rule),
• c. 99.7% of (i.e., almost all) women are 𝜇 ± 3𝜎 tall (the 3𝜎 rule).
Also, if we knew that the distribution of heights of men is also normal with some other
parameters (spoiler alert: N(173.8, 7.66)), we could make some comparisons between
the two samples. For example, we could compute the probability that a passerby who
is 155 cm tall is actually a man.
2 (*) Sometimes we will have many point estimators to choose from, some being more suitable than
others if data are not of top quality (e.g., contain outliers). For instance, in the normal model, we can also
estimate 𝜇 and 𝜎 via the sample median and IQR/1.349.
(**) It might also be the case that we will have to obtain the estimates of a probability distribution’s
parameters by numerical optimisation because there are no known open-form formulae therefor. For ex-
ample, in the case of the normal family, the maximum likelihood estimation problem involves minimising
𝑛 (𝑥𝑖 −𝜇)2
ℒ(𝜇, 𝜎) = ∑𝑖=1 ( + log 𝜎 2 ) with respect to 𝜇 and 𝜎 (here, we are lucky for its solution is exactly
𝜎2
the sample mean and standard deviation).
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 105
Exercise 6.1 How different manufacturing industries (e.g., clothing) can make use of such mod-
els? Are simplifications necessary when dealing with complexity of the real world? What are the
alternatives?
Furthermore, assuming a particular model gives us access to a range of parametric stat-
istical methods (ones that are derived for the corresponding family of probability dis-
tributions), e.g., the t-test to compare the expected values.
Important We should always verify the assumptions of the tool at hand before we
apply it in practice. In particular, we will soon discover that the UK annual incomes
are not normally distributed. Therefore, we must not refer to the aforementioned 2𝜎
rule in their case. A hammer neither barks nor can it serve as a screwdriver. Period.
For the normal distribution family, the values of the theoretical CDF can be computed
by calling scipy.stats.norm.cdf; compare Figure 6.4 below.
Figure 6.3 depicts the CDF of N(160.1, 7.06) and the empirical CDF of the heights
dataset. This looks like a superb match.
x = np.linspace(np.min(heights), np.max(heights), 1001)
probs = scipy.stats.norm.cdf(x, mu, sigma) # sample the CDF at many points
(continues on next page)
3 The probability distribution of any real-valued random variable 𝑋 can be uniquely defined by means
of a nondecreasing, right (upward) continuous function 𝐹 ∶ ℝ → [0, 1] such that lim𝑥→−∞ 𝐹(𝑥) = 0
and lim𝑥→∞ 𝐹(𝑥) = 1, in which case Pr(𝑋 ≤ 𝑥) = 𝐹(𝑥). The probability density function only exists for
continuous random variables and is defined as the derivative of 𝐹.
106 II UNIDIMENSIONAL DATA
0.6
Prob(height
0.4
0.2
0.0
130 140 150 160 170 180 190
x
Figure 6.3. The empirical CDF and the fitted normal CDF for the heights dataset: the
fit is superb.
𝑏
Example 6.2 𝐹(𝑏)−𝐹(𝑎) = ∫𝑎 𝑓 (𝑡) 𝑑𝑡 is the probability of generating a value in the interval
[𝑎, 𝑏]. Let’s compute the probability related to the 3𝜎 rule:
F = lambda x: scipy.stats.norm.cdf(x, mu, sigma)
F(mu+3*sigma) - F(mu-3*sigma)
## 0.9973002039367398
A common way to summarise the discrepancy between the empirical CDF 𝐹𝑛̂ and a
given theoretical CDF 𝐹 is by computing the greatest absolute deviation:
𝐷̂ 𝑛 = sup |𝐹𝑛̂ (𝑡) − 𝐹(𝑡)|,
𝑡∈ℝ
i.e., 𝐹 needs to be probed only at the 𝑛 points from the sorted input sample.
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 107
If the difference is sufficiently4 small, then we can assume that the normal model de-
scribes data quite well.
Dn = compute_Dn(heights, F)
Dn
## 0.010470976524201148
This is indeed the case here: we may estimate the probability of someone’s being as tall
as the given height with an error less than about 1.05%.
A Q-Q plot draws the sample quantiles against the corresponding theoretical quantiles.
In Section 5.1.1, we mentioned that there are a few possible definitions thereof in the
literature. Thus, we have some degree of flexibility. For simplicity, instead of using
4 The larger the sample size, the less tolerant regarding the size of this disparity we are; see Section 6.2.3.
5 More generally, for an arbitrary 𝐹, 𝑄 is its generalised inverse, defined for any 𝑝 ∈ (0, 1) as 𝑄(𝑝) =
inf{𝑥 ∶ 𝐹(𝑥) ≥ 𝑝}, i.e., the smallest 𝑥 such that the probability of drawing a value not greater than 𝑥 is at
least 𝑝.
108 II UNIDIMENSIONAL DATA
1.0 N(0, 1) 3
N(0, 0.5)
0.8 N(1, 0.5) 2
1
0.6
0
0.4
1
0.2 2
0.0 3
2 0 2 4 0.00 0.25 0.50 0.75 1.00
Figure 6.4. The cumulative distribution functions (left) and the quantile functions (be-
ing the inverse of the CDF; right) of some normal distributions.
numpy.quantile, we will assume that the 𝑖/(𝑛 + 1)-quantile6 is equal to 𝑥(𝑖) , i.e., the
𝑖-th smallest value in a given sample (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) and consider only 𝑖 = 1, 2, … , 𝑛.
This way, we mitigate the problem which arises when the 0- or 1-quantiles of the theor-
etical distribution, i.e., 𝑄(0) or 𝑄(1), are not finite (and this is the case for the normal
distribution family).
def qq_plot(x, Q):
"""
Draws a Q-Q plot, given:
* x - a data sample (vector)
* Q - a theoretical quantile function
"""
n = len(x)
q = np.arange(1, n+1)/(n+1) # 1/(n+1), 2/(n+2), ..., n/(n+1)
x_sorted = np.sort(x) # sample quantiles
quantiles = Q(q) # theoretical quantiles
plt.plot(quantiles, x_sorted, "o")
plt.axline((x_sorted[n//2], x_sorted[n//2]), slope=1,
linestyle=":", color="gray") # identity line
Figure 6.5 depicts the Q-Q plot for our example dataset.
qq_plot(heights, lambda q: scipy.stats.norm.ppf(q, mu, sigma))
plt.xlabel(f"Quantiles of N({mu:.1f}, {sigma:.2f})")
(continues on next page)
6 (*) scipy.stats.probplot uses a slightly different definition (there are many other ones in common
use).
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 109
Figure 6.5. The Q-Q plot for the heights dataset. It is a nice fit.
Ideally, we wish that the points would be arranged on the 𝑦 = 𝑥 line. In our case, there
are small discrepancies7 in the tails (e.g., the smallest observation was slightly smaller
than expected, and the largest one was larger than expected), although it is common a
behaviour for small samples and certain distribution families. However, overall, we
can say that we observe a fine fit.
The popular goodness-of-fit test by Kolmogorov and Smirnov can give us a conservat-
ive interval of the acceptable values of the largest deviation between the empirical and
theoretical CDF, 𝐷̂ 𝑛 , as a function of 𝑛.
Namely, if the test statistic 𝐷̂ 𝑛 is smaller than some critical value 𝐾𝑛 , then we shall deem
7 (*) We can quantify (informally) the goodness of fit by using the Pearson linear correlation coefficient;
the difference insignificant. This is to take into account the fact that reality might devi-
ate from the ideal. Section 6.4.4 mentions that even for samples that truly come from
a hypothesised distribution, there is some inherent variability. We need to be some-
what tolerant.
An authoritative textbook in mathematical statistics will tell us (and prove) that, un-
der the assumption that 𝐹𝑛̂ is the ECDF of a sample of 𝑛 independent variables really
generated from a continuous CDF 𝐹, the random variable 𝐷̂ 𝑛 = sup𝑡∈ℝ |𝐹𝑛̂ (𝑡)−𝐹(𝑡)|
follows the Kolmogorov distribution with parameter 𝑛.
In other words, if we generate many samples of length 𝑛 from 𝐹, and compute 𝐷̂ 𝑛 s for
each of them, we expect it to be distributed like in Figure 6.6, which we obtained by
referring to scipy.stats.kstwo.
n = 10 1.0
100 n = 100
n = 4221 0.8
80
0.6
Probability
60
Density
40 0.4
20 0.2
0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Figure 6.6. Densities (left) and cumulative distribution functions (right) of some
Kolmogorov distributions. The greater the sample size, the smaller the acceptable de-
viations between the theoretical and empirical CDFs.
The choice of the critical value 𝐾𝑛 involves a trade-off between our desire to:
• accept the null hypothesis when it is true (data really come from 𝐹), and
• reject it when it is false (data follow some other distribution, i.e., the difference is
significant enough).
These two needs are, unfortunately, mutually exclusive.
In the framework of frequentist hypothesis testing, we assume some fixed upper
bound (significance level) for making the former kind of mistake, which we call the
type-I error. A nicely conservative (in a good way) value that we suggest employing is
𝛼 = 0.001 = 0.1%, i.e., only 1 out of 1000 samples that really come from 𝐹 will be
rejected as not coming from 𝐹.
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 111
Such a 𝐾𝑛 may be determined by considering the inverse of the CDF of the Kolmogorov
distribution, Ξ𝑛 . Namely, 𝐾𝑛 = Ξ−1
𝑛 (1 − 𝛼):
In our case 𝐷̂ 𝑛 < 𝐾𝑛 because 0.01047 < 0.02996. We conclude that our empirical
(heights) distribution does not differ significantly (at significance level 0.1%) from
the assumed one, i.e., N(160.1, 7.06). In other words, we do not have enough evid-
ence against the statement that data are normally distributed. It is the presumption
of innocence: they are normal enough.
We will return to this discussion in Section 6.4.4 and Section 12.2.6.
Let’s thus proceed with the fitting of a log-normal model, LN(𝜇, 𝜎). The procedure is
similar to the normal case, but this time we determine the mean and standard devi-
ation based on the logarithms of the observations:
lmu = np.mean(np.log(income))
lsigma = np.std(np.log(income), ddof=1)
lmu, lsigma
## (10.314409794364623, 0.5816585197803816)
8 Except for the few filthy rich, who are interesting on their own; see Section 6.3.2 where we discuss the
Pareto distribution.
112 II UNIDIMENSIONAL DATA
80
60
Count
40
20
0
8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0
Figure 6.8 depicts the histograms on the log- and original scale together with the fitted
probability density function. On the whole, the fit is not too bad; after all, we are only
dealing with a sample of 1000 households. The original UK Office of National Statistics
data9 could tell us more about the quality of this model in general, but it is beyond the
scope of our simple exercise.
plt.subplot(1, 2, 1)
logbins = np.geomspace(np.min(income), np.max(income), 31)
plt.hist(income, bins=logbins, density=True,
color="lightgray", edgecolor="black")
plt.plot(x, fx, "r--")
plt.xscale("log") # log-scale on the x-axis
plt.ylabel("Density")
plt.subplot(1, 2, 2)
plt.hist(income, bins=30, density=True, color="lightgray", edgecolor="black")
plt.plot(x, fx, "r--", label=f"PDF of LN({lmu:.1f}, {lsigma:.2f})")
plt.ylabel("Density")
(continues on next page)
9 https://www.ons.gov.uk/peoplepopulationandcommunity/personalandhouseholdfinances/
incomeandwealth/bulletins/householddisposableincomeandinequality/financialyear2020
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 113
plt.show()
1e 5 1e 5
3.0
PDF of LN(10.3, 0.58)
2.5
2.5
2.0
2.0
Density
Density
1.5
1.5
1.0 1.0
0.5 0.5
0.0 0.0
104 105 0 50000 100000 150000 200000
Figure 6.8. A histogram and the probability density function of the fitted log-normal
distribution for the income dataset, on log- (left) and original (right) scale.
Next, the left side of Figure 6.9 gives the quantile-quantile plot for the above log-
normal model (note the double logarithmic scale). Additionally, on the right, we check
the sensibility of the normality assumption (using a “normal” normal distribution, not
its “log” version).
plt.subplot(1, 2, 1)
qq_plot( # see above for the definition
income,
lambda q: scipy.stats.lognorm.ppf(q, s=lsigma, scale=np.exp(lmu))
)
plt.xlabel(f"Quantiles of LN({lmu:.1f}, {lsigma:.2f})")
plt.ylabel("Sample quantiles")
plt.xscale("log")
plt.yscale("log")
plt.subplot(1, 2, 2)
mu = np.mean(income)
sigma = np.std(income, ddof=1)
qq_plot(income, lambda q: scipy.stats.norm.ppf(q, mu, sigma))
plt.xlabel(f"Quantiles of N({mu:.1f}, {sigma:.2f})")
plt.show()
114 II UNIDIMENSIONAL DATA
200000
175000
105
150000
Sample quantiles
125000
100000
75000
50000
104
25000
0
104 105 0 50000 100000
Quantiles of LN(10.3, 0.58) Quantiles of N(35780.0, 22900.22)
Figure 6.9. The Q-Q plots for the income dataset vs the fitted log-normal (good fit; left)
and normal (bad fit; right) distribution.
Exercise 6.4 Graphically compare the ECDF for income and the CDF of LN(10.3, 0.58).
Exercise 6.5 (*) Perform the Kolmogorov–Smirnov goodness-of-fit test as in Section 6.2.3, to
verify that the hypothesis of log-normality is not rejected at the 𝛼 = 0.001 significance level. At
the same time, the income distribution significantly differs from a normal one.
The hypothesis that our data follow a normal distribution is most likely false. On the
other hand, the log-normal model might be adequate. We can again reduce the whole
dataset to merely two numbers, 𝜇 and 𝜎, based on which (and probability theory), we
may deduce that:
2 /2
• the expected average (mean) income is 𝑒𝜇+𝜎 ,
• median is 𝑒𝜇 ,
2
• the most probable value (mode) in 𝑒𝜇−𝜎 ,
and so forth.
Note Recall again that for skewed a distribution such as the log-normal one, reporting
the mean might be misleading. This is why most people are sceptical when they read
the news about our prospering economy (“yeah, we’d like to see that kind of money in
our pockets”). It is not only 𝜇 that matters, but also 𝜎 that quantifies the discrepancy
between the rich and the poor.
For a normal distribution, the situation is vastly different. The mean, the median, and
the most probable outcomes are the same: the distribution is symmetric around 𝜇.
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 115
Exercise 6.6 What is the fraction of people with earnings below the mean in our LN(10.3, 0.58)
model? Hint: use scipy.stats.lognorm.cdf to get the answer.
Figure 6.10 gives the histogram of the city sizes on the log-scale. It looks like a log-
normal distribution again, which the readers can fit themselves when they are feeling
playful and have nothing better to do. (But, honestly, is there anything more delightful
than doing stats?)
logbins = np.geomspace(np.min(cities), np.max(cities), 21)
plt.hist(cities, bins=logbins, color="lightgray", edgecolor="black")
plt.xscale("log")
plt.ylabel("Count")
plt.show()
3500
3000
2500
2000
Count
1500
1000
500
0
100 101 102 103 104 105 106 107
Figure 6.10. A histogram of the unabridged cities dataset. Note the log-scale on the
x-axis.
This time, however, we will be concerned with not what is typical, but what is in some
sense anomalous or extreme. Just like in Section 4.3.7, let’s look at the truncated version
of the city size distribution by considering the cities with 10 000 or more inhabitants.
116 II UNIDIMENSIONAL DATA
s = 10_000
large_cities = cities[cities >= s] # a right tail of the original dataset
len(large_cities), sum(large_cities) # number of cities, total population
## (2696, 146199374.0)
𝛼𝑠𝛼
𝑓 (𝑥) = ,
𝑥𝛼+1
and 𝑓 (𝑥) = 0 if 𝑥 < 𝑠.
𝑠 is usually taken as the sample minimum (i.e., 10 000 in our case). 𝛼 can be estimated
through the reciprocal of the mean of the scaled logarithms of our observations:
alpha = 1/np.mean(np.log(large_cities/s))
alpha
## 0.9496171695997675
The left side of Figure 6.11 compares the theoretical density and an empirical histo-
gram on the double log-scale. The right part gives the corresponding Q-Q plot on a
double logarithmic scale. We see that the populations of the largest cities are overes-
timated. The model could be better, but the cities are still growing, aren’t they?
plt.subplot(1, 2, 1)
logbins = np.geomspace(s, np.max(large_cities), 21) # bin boundaries
plt.hist(large_cities, bins=logbins, density=True,
color="lightgray", edgecolor="black")
plt.plot(logbins, scipy.stats.pareto.pdf(logbins, alpha, scale=s),
"r--", label=f"PDF of P({alpha:.3f}, {s})")
plt.ylabel("Density")
plt.xscale("log")
plt.yscale("log")
plt.legend()
plt.show()
10 4 107
PDF of P(0.950, 10000)
10 5
10 7
10 8 105
10 9
10 10 104
104 105 106 107 104 105 106 107
Quantiles of P(0.950, 10000)
Figure 6.11. A histogram (left) and a Q-Q plot (right) of the large_cities dataset vs
the fitted density of a Pareto distribution on a double log-scale.
Example 6.7 (*) We might also be keen on verifying how accurately the probability of a ran-
domly selected city’s being at least of a given size can be predicted. Let’s denote by 𝑆(𝑥) =
1−𝐹(𝑥) the complementary cumulative distribution function (CCDF; sometimes referred
to as the survival function), and by 𝑆𝑛̂ (𝑥) = 1 − 𝐹𝑛̂ (𝑥) its empirical version. Figure 6.12 com-
pares the empirical and the fitted CCDFs with probabilities on the linear- and log-scale.
x = np.geomspace(np.min(large_cities), np.max(large_cities), 1001)
probs = scipy.stats.pareto.cdf(x, alpha, scale=s)
n = len(large_cities)
for i in [1, 2]:
plt.subplot(1, 2, i)
plt.plot(x, 1-probs, "r--", label=f"CCDF of P({alpha:.3f}, {s})")
plt.plot(np.sort(large_cities), 1-np.arange(1, n+1)/n,
drawstyle="steps-post", label="Empirical CCDF")
plt.xlabel("$x$")
plt.xscale("log")
(continues on next page)
118 II UNIDIMENSIONAL DATA
0.6
0.4 10 2
0.2
10 3
0.0
104 105 106 107 104 105 106 107
x x
In terms of the maximal absolute distance between the two functions, 𝐷̂ 𝑛 , from the left plot we
see that the fit seems acceptable. Still, let’s stress that the log-scale overemphasises the relatively
minor differences in the right tail and should not be used for judging the value of 𝐷̂ 𝑛 .
However, that the Kolmogorov–Smirnov goodness-of-fit test rejects the hypothesis of Paretianity
(at a significance level 0.1%) is left as an exercise for the reader.
All events seem to occur more or less with the same probability. Of course, the numbers
on the balls are integer, but in our idealised scenario, we may try modelling this dataset
using a continuous uniform distribution U(𝑎, 𝑏), which yields arbitrary real numbers on
a given interval (𝑎, 𝑏), i.e., between some 𝑎 and 𝑏. Its probability density function is
given for 𝑥 ∈ (𝑎, 𝑏) by:
1
𝑓 (𝑥) = ,
𝑏−𝑎
and 𝑓 (𝑥) = 0 otherwise. Notice that scipy.stats.uniform uses parameters a and
scale, the latter being equal to our 𝑏 − 𝑎.
In the Lotto case, it makes sense to set 𝑎 = 1 and 𝑏 = 50 and interpret an outcome like
49.1253 as representing the 49th ball (compare the notion of the floor function, ⌊𝑥⌋).
x = np.linspace(1, 50, 1001)
plt.bar(np.arange(1, 50), width=1, height=lotto/np.sum(lotto),
color="lightgray", edgecolor="black", alpha=0.8, align="edge")
plt.plot(x, scipy.stats.uniform.pdf(x, 1, scale=49), "r--",
label="PDF of U(1, 50)")
plt.ylim(0, 0.025)
plt.legend()
plt.show()
0.025
PDF of U(1, 50)
0.020
0.015
0.010
0.005
0.000
0 10 20 30 40 50
Visually, see Figure 6.13, this model makes much sense, but again, some more rigor-
ous statistical testing would be required to determine if someone has not been tamper-
ing with the lottery results, i.e., if data do not deviate from the uniform distribution
significantly. Unfortunately, we cannot use the Kolmogorov–Smirnov test in the fore-
going version as data are not continuous. See, however, Section 11.4.3 for the Pearson
chi-squared test which is applicable here.
Exercise 6.8 Does playing lotteries and engaging in gambling make rational sense at all, from
the perspective of an individual player? Well, we see that 16 is the most frequently occurring out-
come in Lotto, maybe there’s some magic in it? Also, some people sometimes became millionaires,
don’t they?
It might not be a bad idea to try to fit a probabilistic (convex) combination of three
normal distributions 𝑓1 , 𝑓2 , 𝑓3 , corresponding to the morning, lunchtime, and evening
pedestrian count peaks. This yields the PDF:
0.12
0.10
0.08
0.06
0.04
0.02
0.00
0 6 12 18 24
Figure 6.14. A histogram of the peds dataset and an estimated mixture of three normal
distributions.
Note Some data clustering techniques (in particular, the 𝑘-means algorithm that we
briefly discuss later in this course) can be used to split a data sample into disjoint
10 The estimates were obtained by running mixtools::normalmixEM in R (expectation-maximisation for
mixtures of univariate normals; [7]). Note that to turn peds, which is a table of counts, to a real-valued
sample, we had to call numpy.repeat(numpy.arange(24)+0.5, peds).
122 II UNIDIMENSIONAL DATA
gives five observations sampled independently from the uniform distribution on the
unit interval, i.e., U(0, 1). Here is the same with scipy, but this time the support is
(−10, 15).
scipy.stats.uniform.rvs(-10, scale=25, size=5) # from -10 to -10+25
## array([ 0.5776615 , 14.51910496, 7.12074346, 2.02329754, -0.19706205])
Alternatively, we could have shifted and scaled the output of the random number gen-
erator on the unit interval using the formula numpy.random.rand(5)*25-10.
Now, let’s set the same seed and see how “random” the next values are:
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 123
Note If we do not set the seed manually, it will be initialised based on the current wall
time, which is different every… time. As a result, the numbers will seem random to us,
but only because we are slightly ignorant.
Many Python packages that we refer to in the sequel, including pandas and
scikit-learn, rely on numpy’s random number generator. To harness them, we will
have to become used to calling numpy.random.seed. Additionally, some of them (e.g.,
sklearn.model_selection.train_test_split or pandas.DataFrame.sample) will
be equipped with the random_state argument, which behaves as if we temporarily
changed the seed (for just one call to that function). For instance:
scipy.stats.uniform.rvs(size=5, random_state=123)
## array([0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897])
Pseudorandom deviates from the standard normal distribution, i.e., N(0, 1), can also
be generated using numpy.random.randn. As N(160.1, 7.06) is a scaled and shifted
version thereof, the preceding is equivalent to:
np.random.seed(50489)
np.random.randn(3)*7.06 + 160.1
## array([166.01775384, 136.7107872 , 185.30879579])
Important Conclusions based on simulated data are trustworthy for they cannot be
manipulated. Or can they?
The above pseudorandom number generator’s seed, 50489, is a bit suspicious. It may
suggest that someone wanted to prove some point (in this case, the violation of the
3𝜎 rule). This is why we recommend sticking to only one seed, e.g., 123, or – when
124 II UNIDIMENSIONAL DATA
Exercise 6.9 Generate 1000 pseudorandom numbers from the log-normal distribution and
draw its histogram.
Note (*) Having a reliable pseudorandom number generator from the uniform dis-
tribution on the unit interval is crucial as sampling from other distributions usually
involves transforming the independent U(0, 1) variates. For instance, realisations of
random variables following any continuous cumulative distribution function 𝐹 can be
constructed through the inverse transform sampling:
1. Generate a sample 𝑥1 , … , 𝑥𝑛 independently from U(0, 1).
2. Transform each 𝑥𝑖 by applying the quantile function, 𝑦𝑖 = 𝐹−1 (𝑥𝑖 ).
Now 𝑦1 , … , 𝑦𝑛 follows the CDF 𝐹.
Exercise 6.10 (*) Generate 1000 pseudorandom numbers from the log-normal distribution us-
ing inverse transform sampling.
Exercise 6.11 (**) Generate 1000 pseudorandom numbers from the distribution mixture dis-
cussed in Section 6.3.4.
There is a certain ruggedness in the bar sizes that a naïve observer would try to inter-
pret as something meaningful. Competent data scientists train their eyes to ignore
such impurities. In this case, they are only due to random effects. Nevertheless, we
must always be ready to detect deviations from the assumed model that are worth at-
tention.
Exercise 6.12 Repeat the above experiment for samples of sizes 10, 1 000, and 10 000.
Example 6.13 (*) Using a simple Monte Carlo simulation, we can verify (approximately) that
the Kolmogorov–Smirnov goodness-of-fit test introduced in Section 6.2.3 has been calibrated
properly, i.e., that for samples that really follow the assumed distribution, the null hypothesis
is indeed rejected in roughly 0.1% of the cases.
Assume we are interested in the null hypothesis referencing the standard normal distribution,
N(160.1, 7.06), and sample size 𝑛 = 4221. We need to generate many (we assume 10 000
126 II UNIDIMENSIONAL DATA
below) such samples. For each of them, we compute and store the maximal absolute deviation
from the theoretical CDF, i.e., 𝐷̂ 𝑛 .
n = 4221
distrib = scipy.stats.norm(160.1, 7.06) # the assumed distribution
Dns = []
for i in range(10000): # increase this for better precision
x = distrib.rvs(size=n, random_state=i+1) # really follows the distrib.
Dns.append(compute_Dn(x, distrib.cdf))
Dns = np.array(Dns)
Now let’s compute the proportion of cases which lead to 𝐷̂ 𝑛 greater than the critical value 𝐾𝑛 :
len(Dns[Dns >= scipy.stats.kstwo.ppf(1-0.001, n)]) / len(Dns)
## 0.0008
Its expected value is 0.001. But our approximation is necessarily imprecise because we rely on
randomness. Increasing the number of trials from 10 000 to, say, 1 000 000 should make the
above estimate closer to the theoretical expectation.
It is also worth checking that the density histogram of Dns resembles the Kolmogorov distribution
that we can compute via scipy.stats.kstwo.pdf.
Exercise 6.14 (*) It might also be interesting to verify the test’s power, i.e., the probability that
when the null hypothesis is false, it will actually be rejected. Modify the above code in such a way
that x in the for loop is not generated from N(160.1, 7.06), but N(140, 7.06), N(141, 7.06),
etc., and check the proportion of cases where we deem the sample distribution significantly dif-
ferent from N(160.1, 7.06). Small deviations from the true location parameter 𝜇 are usually
ignored, but this improves with sample size 𝑛.
Adding noise also might be performed for aesthetic reasons, e.g., when drawing scat-
ter plots.
We can consider ourselves very lucky; all numbers are the same. So, the next number
must finally be a “zero”, right?
np.random.choice([0, 1], 1)
## array([1])
Wrong. The numbers we generate are independent of each other. There is no history.
In the current model of randomness (Bernoulli trials; two possible outcomes with the
same probability), there is a 50% chance of obtaining a “one” regardless of how many
“ones” were observed previously.
We should not seek patterns where no regularities exist. Our brain forms expectations
about the world, and overcoming them is hard work. This must be done as the reality
could not care less about what we consider it to be.
We took the logarithm of the log-normally distributed incomes and obtained a nor-
mally distributed sample. In statistical practice, it is not rare to apply different non-
linear transforms of the input vectors at the data preprocessing stage (see, e.g., Sec-
tion 9.2.6). In particular, the Box–Cox (power) transform [12] is of the form 𝑥 ↦
(𝑥𝜆 − 1)/𝜆 for some 𝜆. Interestingly, in the limit as 𝜆 → 0, this formula yields
𝑥 ↦ log 𝑥 which is exactly what we were applying in this chapter.
Newman et al. [16, 71] give a nice overview of the power-law-like behaviour of some
“rich” or otherwise extreme datasets. It is worth noting that the logarithm of a Pare-
tian sample divided by the minimum follows an exponential distribution (which we
128 II UNIDIMENSIONAL DATA
6.6 Exercises
Exercise 6.15 Why is the notion of the mean income confusing to the general public?
Exercise 6.16 When manually setting the seed of a pseudorandom number generator makes
sense?
Exercise 6.17 Given a log-normally distributed sample x, how can we turn it to a normally dis-
tributed one, i.e., y=f(x), with f being… what?
Exercise 6.18 What is the 3𝜎 rule for normally distributed data?
Exercise 6.19 Can the 3𝜎 rule be applied for log-normally distributed data?
Exercise 6.20 (*) How can we verify graphically if a sample follows a hypothesised theoretical
distribution?
Exercise 6.21 (*) Explain the meaning of the type I error, significance level, and a test’s power.
Part III
Multidimensional data
7
From uni- to multidimensional numeric data
Important Just like vectors, matrices were designed to store data of the same type.
Chapter 10 will cover pandas data frames, which support mixed data types, e.g., nu-
merical and categorical. Moreover, they let their rows and columns be named. pan-
das is built on top of numpy, and implements many recipes for the most popular
data wrangling tasks. We, however, we would like to be able to tackle any computa-
tional problem. It is worth knowing that many data analysis and machine learning
algorithms automatically convert numerical parts of data frames to matrices so that
numpy can do most of the mathematical heavy lifting.
body = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
delimiter=",")[1:, :] # skip the first row (column names)
The file specifies column names in the first non-comment line (we suggest inspecting
it in a web browser). Therefore, we had to omit it manually (more on matrix indexing
later). Here is a preview of the first few rows:
body[:6, :] # the first six rows, all columns
## array([[ 97.1, 160.2, 34.7, 40.8, 35.8, 126.1, 117.9],
## [ 91.1, 152.7, 33.5, 33. , 38.5, 125.5, 103.1],
## [ 73. , 161.2, 37.4, 38. , 31.8, 106.2, 92. ],
## [ 61.7, 157.4, 38. , 34.7, 29. , 101. , 90.5],
## [ 55.4, 154.6, 34.6, 34. , 28.3, 92.5, 73.2],
## [ 62. , 144.7, 32.5, 34.2, 29.8, 106.7, 84.8]])
We noted the column names down as numpy matrices give no means for storing column
labels. It is only a minor inconvenience.
which means that its shape slot is now a tuple of length two:
body.shape
## (4221, 7)
2 https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx
7 FROM UNI- TO MULTIDIMENSIONAL NUMERIC DATA 133
yields a 3 × 1 array. Such two-dimensional arrays with one column will be referred to
as column vectors (they are matrices still). Moreover:
np.array([ [1, 2, 3, 4] ])
## array([[1, 2, 3, 4]])
Note An ordinary vector (a unidimensional array) only displays a single pair of square
brackets:
np.array([1, 2, 3, 4])
## array([1, 2, 3, 4])
repeats a row vector rowwisely, i.e., over the first axis (0). Replicating a column vector
columnwisely is possible as well:
np.repeat([[1], [2], [3]], 4, axis=1) # over the second axis
## array([[1, 1, 1, 1],
## [2, 2, 2, 2],
## [3, 3, 3, 3]])
1 2
⎡ ⎤
⎢ 1 2 ⎥
⎢ 1 2 ⎥
⎢ ⎥ 1 2 1 2 1 2 1 1 2 2 2
⎢ 3 4 ⎥, [ ], [ ].
⎢ ⎥ 1 2 1 2 1 2 3 3 4 4 4
⎢ 3 4 ⎥
⎢ ⎥
⎢ 3 4 ⎥
⎣ 3 4 ⎦
The way we specify the output shapes might differ across functions and packages. Con-
sequently, as usual, it is always best to refer to their documentation.
Exercise 7.4 Check out the documentation of the following functions: numpy.eye, numpy.
diag, numpy.zeros, numpy.ones, and numpy.empty.
136 III MULTIDIMENSIONAL DATA
Internally, a matrix is represented using a long flat vector where elements are stored
in the row-major3 order:
A.size # the total number of elements
## 12
A.ravel() # the underlying flat array
## array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
It is the shape slot that is causing the 12 elements to be treated as if they were arranged
on a 3 × 4 grid, for example in different algebraic operations and during the printing
of the matrix. This virtual arrangement can be altered anytime without modifying the
underlying array:
A.shape = (4, 3)
A
## array([[ 1, 2, 3],
## [ 4, 5, 6],
## [ 7, 8, 9],
## [10, 11, 12]])
Here, the placeholder “-1” means that numpy must deduce by itself how many rows we
want in the result. Twelve elements are supposed to be arranged in six columns, so
the maths behind it is not rocket science. Thanks to this, generating row or column
vectors is straightforward:
np.linspace(0, 1, 5).reshape(1, -1) # one row, guess the number of columns
(continues on next page)
3 (*) Sometimes referred to as a C-style array, as opposed to the Fortran-style which is used in, e.g., R.
7 FROM UNI- TO MULTIDIMENSIONAL NUMERIC DATA 137
7.3.1 Transpose
The transpose of a matrix 𝐗 ∈ ℝ𝑛×𝑚 is an (𝑚 × 𝑛)-matrix 𝐘 given by:
𝑥 𝑥2,1 ⋯ 𝑥𝑚,1
⎡ 1,1 ⎤
𝑥 𝑥2,2 ⋯ 𝑥𝑚,2
𝐘 = 𝐗𝑇 = ⎢
⎢ ⋮
1,2 ⎥,
⎥
⎢ ⋮ ⋱ ⋮ ⎥
⎣ 𝑥1,𝑛 𝑥2,𝑛 ⋯ 𝑥𝑚,𝑛 ⎦
4 (*) Such matrices are usually sparse, i.e., have many elements equal to 0. We have special, memory-
A # before
## array([[ 1, 2, 3],
## [ 4, 5, 6],
## [ 7, 8, 9],
## [10, 11, 12]])
A.T # the transpose of A
## array([[ 1, 4, 7, 10],
## [ 2, 5, 8, 11],
## [ 3, 6, 9, 12]])
Rows became columns and vice versa. It is not the same as the aforementioned reshap-
ing, which does not change the order of elements in the underlying array:
𝑥
⎡ 1,𝑗 ⎤
𝑇 ⎢ 𝑥2,𝑗 ⎥
𝐱⋅,𝑗 = [ 𝑥1,𝑗 𝑥2,𝑗 ⋯ 𝑥𝑛,𝑗 ] =⎢ ⎥,
⎢ ⋮ ⎥
⎣ 𝑥𝑛,𝑗 ⎦
Thanks to the use of the matrix transpose, ⋅𝑇 , we can save some vertical space (we want
this enjoyable to be as long as possible, but maybe not this way).
Also, recall that we are used to denoting vectors of length 𝑚 by 𝒙 = (𝑥1 , … , 𝑥𝑚 ). A vector
is a one-dimensional array (not a two-dimensional one), hence the slightly different
bold font which is crucial where any ambiguity could be troublesome.
Note To avoid notation clutter, we will often be implicitly promoting vectors like 𝒙 =
140 III MULTIDIMENSIONAL DATA
(𝑥1 , … , 𝑥𝑚 ) to row vectors 𝐱 = [𝑥1 ⋯ 𝑥𝑚 ]. This is the behaviour that numpy5 uses; see
Chapter 8.
The identity matrix is a neutral element of the matrix multiplication (Section 8.3).
More generally, any diagonal matrix, diag(𝑎1 , … , 𝑎𝑛 ), can be constructed from a given
sequence of elements by calling:
np.diag([1, 2, 3, 4])
## array([[1, 0, 0, 0],
## [0, 2, 0, 0],
## [0, 0, 3, 0],
## [0, 0, 0, 4]])
make the same sense. However, sorting a single column and leaving others unchanged
will be semantically invalid.
Mathematically, we can consider the above as a set of 4221 points in a seven-
dimensional space, ℝ7 . Let’s discuss how we can visualise its different natural pro-
jections.
7.4.1 2D Data
A scatter plot visualises one variable against another one.
plt.plot(body[:, 1], body[:, 3], "o", c="#00000022")
plt.xlabel(body_columns[1])
plt.ylabel(body_columns[3])
plt.show()
50
45
upper leg len. (cm)
40
35
30
25
130 140 150 160 170 180 190
standing height (cm)
Figure 7.1 depicts upper leg length (the y-axis) vs (versus; against; as a function of)
standing height (the x-axis) in the form of a point cloud with (𝑥, 𝑦) coordinates like
(body[i, 1], body[i, 3]) for all 𝑖 = 1, … , 4221.
Example 7.5 Here are the exact coordinates of the point corresponding to the person of the smal-
lest height:
body[np.argmin(body[:, 1]), [1, 3]]
## array([131.1, 30.8])
Locate it in Figure 7.1. Also, pinpoint the one with the greatest upper leg length:
142 III MULTIDIMENSIONAL DATA
As the points are abundant, normally we cannot easily see where most of them are loc-
ated. As a simple remedy, we plotted the points using a semi-transparent colour. This
gave a kind of the points’ density estimate. The colour specifier was of the form #rrggb-
baa, giving the intensity of the red, green, blue, and alpha (opaqueness) channel in
four series of two hexadecimal digits (between 00 = 0 and ff = 255).
Overall, the plot reveals that there is a general tendency for small heights and small upper
leg lengths to occur frequently together. The taller the person, the longer her legs on
average, and vice verse.
But there is some natural variability: for example, looking at people of height roughly
equal to 160 cm, their upper leg length can be anywhere between 25 ad 50 cm (range),
yet we expect the majority to lie somewhere between 35 and 40 cm. Chapter 9 will ex-
plore two measures of correlation that will enable us to quantify the degree (strength)
of association between variable pairs.
Infrequently will such a 3D plot provide us with readable results, though. We are pro-
jecting a three-dimensional reality onto a two-dimensional screen or a flat page. Some
information must inherently be lost. What we see is relative to the position of the vir-
tual camera and some angles can be more meaningful than others.
Exercise 7.6 (*) Try finding an interesting elevation and azimuth angle by playing with
the arguments passed to the mpl_toolkits.mplot3d.axes3d.Axes3D.view_init func-
tion. Also, depict arm circumference, hip circumference, and weight on a 3D plot.
Note (*) We may have facilities for creating an interactive scatter plot (running the
above from the Python’s console enables this), where the virtual camera can be freely
repositioned with a mouse/touch pad. This can give some more insight into our data.
Also, there are means of creating animated sequences, where we can fly over the data
scene. Some people find it cool, others find it annoying, but the biggest problem there-
with is that they cannot be included in printed material. If we are only targeting the
7 FROM UNI- TO MULTIDIMENSIONAL NUMERIC DATA 143
display for the Web (this includes mobile devices), we can try some Python libraries6
that output HTML+CSS+JavaScript code which instructs the browser engine to create
some more sophisticated interactive graphics, e.g., bokeh or plotly.
Example 7.7 Instead of drawing a 3D plot, it might be better to play with a 2D scatter plot
that uses different marker colours (or sometimes sizes: think of them as bubbles). Suitable colour
maps7 can distinguish between low and high values of a third variable.
plt.scatter(
body[:, 4], # x
body[:, 5], # y
c=body[:, 0], # "z" - colours
cmap=plt.colormaps.get_cmap("copper"), # colour map
alpha=0.5 # opaqueness level between 0 and 1
)
plt.xlabel(body_columns[4])
plt.ylabel(body_columns[5])
plt.axis("equal")
plt.rcParams["axes.grid"] = False
cbar = plt.colorbar()
plt.rcParams["axes.grid"] = True
cbar.set_label(body_columns[0])
plt.show()
6 https://wiki.python.org/moin/NumericAndScientific/Plotting
7 https://matplotlib.org/stable/tutorials/colors/colormaps.html
144 III MULTIDIMENSIONAL DATA
In Figure 7.3, we see some tendency for the weight to be greater as both the arm and the hip cir-
cumferences increase.
Exercise 7.8 Play around with different colour palettes. However, be wary that every 1 in 12
men (8%) and 1 in 200 women (0.5%) have colour vision deficiencies, especially in the red-green
or blue-yellow spectrum. For this reason, some diverging colour maps might be worse than others.
A piece of paper is two-dimensional: it only has height and width. By looking around,
we also perceive the notion of depth. So far so good. But with more-dimensional data,
well, suffice it to say that we are three-dimensional creatures and any attempts to-
wards visualising them will simply not work, don’t even trip.
Luckily, it is where mathematics comes to our rescue. With some more knowledge and
intuitions, and this book helps us develop them, it will be easy8 to consider a generic
𝑚-dimensional space, and then assume that, say, 𝑚 = 7 or 42. This is exactly why
data science relies on automated methods for knowledge/pattern discovery. Thanks
to them, we can identify, describe, and analyse the structures that might be present
in the data, but cannot be experienced with our imperfect senses.
8 This is an old funny joke that most funny mathematicians find funny. Ha.
7 FROM UNI- TO MULTIDIMENSIONAL NUMERIC DATA 145
k = X.shape[1]
fig, axes = plt.subplots(nrows=k, ncols=k, sharex="col", sharey="row",
figsize=(plt.rcParams["figure.figsize"][0], )*2)
for i in range(k):
for j in range(k):
ax = axes[i, j]
if i == j: # diagonal
ax.text(0.5, 0.5, labels[i], transform=ax.transAxes,
ha="center", va="center", size="x-small")
else:
ax.plot(X[:, j], X[:, i], ".", color="black", alpha=alpha)
And now:
which = [1, 0, 4, 5]
pairplot(body[:, which], body_columns[which])
plt.show()
Plotting variables against themselves is rather silly (exercise: what would that be?).
Therefore, on the main diagonal of Figure 7.4, we printed out the variable names.
A scatter plot matrix can be a valuable tool for identifying noteworthy combinations
of columns in our datasets. We see that some pairs of variables are more “structured”
than others, e.g., hip circumference and weight are more or less aligned on a straight
line. This is why Chapter 9 will describe ways to model the possible relationships
between the variables.
Exercise 7.9 Create a pairs plot where weight, arm circumference, and hip circumference are
on the log-scale.
Exercise 7.10 (*) Call seaborn.pairplot to create a scatter plot matrix with histograms on
the main diagonal, thanks to which you will be able to see how the marginal distributions are
distributed. Note that the matrix must, unfortunately, be converted to a pandas data frame first.
Exercise 7.11 (**) Modify our pairplot function so that it displays the histograms of the mar-
ginal distributions on the main diagonal.
146 III MULTIDIMENSIONAL DATA
Figure 7.4. The scatter plot matrix for selected columns in the body dataset.
7.5 Exercises
Exercise 7.12 What is the difference between [1, 2, 3], [[1, 2, 3]], and [[1], [2],
[3]] in the context of an array’s creation?
Exercise 7.13 If A is a matrix with five rows and six columns, what is the difference between
A.reshape(6, 5) and A.T?
Exercise 7.14 If A is a matrix with 5 rows and 6 columns, what is the meaning of: A.
reshape(-1), A.reshape(3, -1), A.reshape(-1, 3), A.reshape(-1, -1), A.shape
= (3, 10), and A.shape = (-1, 3)?
Exercise 7.15 List some methods to add a new row or column to an existing matrix.
7 FROM UNI- TO MULTIDIMENSIONAL NUMERIC DATA 147
A = np.array([
[0.2, 0.6, 0.4, 0.4],
[0.0, 0.2, 0.4, 0.7],
[0.8, 0.8, 0.2, 0.1]
]) # example matrix
Important Let’s stress that axis=1 does not mean that we get the column means (even
though columns constitute the second axis, and we count starting at 0). It denotes the
axis along which the matrix is sliced. Sadly, even yours truly sometimes does not get it
right.
More generally, a set of rules referred to in the numpy manual as broadcasting2 describes
how this package handles arrays of different shapes.
Important Generally, for two matrices, their column/row counts must match or be
equal to 1. Also, if one operand is a one-dimensional array, it will be promoted to a row
vector.
Matrix vs scalar
If one operand is a scalar, then it is going to be propagated over all matrix elements.
For example:
(-1)*A
## array([[-0.2, -0.6, -0.4, -0.4],
## [-0. , -0.2, -0.4, -0.7],
## [-0.8, -0.8, -0.2, -0.1]])
It changed the sign of every element, which is, mathematically, an instance of mul-
tiplying a matrix 𝐗 by a scalar 𝑐:
𝑐𝑥 𝑐𝑥1,2 ⋯ 𝑐𝑥1,𝑚
⎡ 1,1 ⎤
⎢ 𝑐𝑥2,1 𝑐𝑥2,2 ⋯ 𝑐𝑥2,𝑚 ⎥.
𝑐𝐗 = ⎢ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 𝑐𝑥𝑛,1 𝑐𝑥𝑛,2 ⋯ 𝑐𝑥𝑛,𝑚 ⎦
Matrix vs matrix
For two matrices of identical sizes, we act on the corresponding elements:
B = np.tri(A.shape[0], A.shape[1]) # just an example
B # a lower triangular 0-1 matrix
## array([[1., 0., 0., 0.],
## [1., 1., 0., 0.],
## [1., 1., 1., 0.]])
And now:
A * B
## array([[0.2, 0. , 0. , 0. ],
## [0. , 0.2, 0. , 0. ],
## [0.8, 0.8, 0.2, 0. ]])
3 This is not the same as matrix-multiply by itself which we cover in Section 8.3.
152 III MULTIDIMENSIONAL DATA
Example 8.2 (*) Figure 8.1 depicts a (filled) contour plot of Himmelblau’s function, 𝑓 (𝑥, 𝑦) =
(𝑥2 + 𝑦 − 11)2 + (𝑥 + 𝑦2 − 7)2 , for 𝑥 ∈ [−5, 5] and 𝑦 ∈ [−4, 4]. To draw it, we
probe 250 points from these two intervals, and call numpy.meshgrid to generate two matrices,
both of shape 250 by 250, giving the x- and y-coordinates of all the points on the corresponding
two-dimensional grid. Thanks to this, we are able to use vectorised mathematical operations to
compute the values of 𝑓 thereon.
x = np.linspace(-5, 5, 250)
y = np.linspace(-4, 4, 250)
xg, yg = np.meshgrid(x, y)
z = (xg**2 + yg - 11)**2 + (xg + yg**2 - 7)**2
plt.contourf(x, y, z, levels=20)
CS = plt.contour(x, y, z, levels=[1, 5, 10, 20, 50, 100, 150, 200, 250])
plt.clabel(CS, colors="black")
plt.show()
To understand the result generated by numpy.meshgrid, let’s inspect its output for a smaller
number of probe points:
x = np.linspace(-5, 5, 3)
y = np.linspace(-4, 4, 5)
xg, yg = np.meshgrid(x, y)
xg
## array([[-5., 0., 5.],
## [-5., 0., 5.],
## [-5., 0., 5.],
## [-5., 0., 5.],
## [-5., 0., 5.]])
Figure 8.1. An example filled contour plot with additional labelled contour lines.
yg
## array([[-4., -4., -4.],
## [-2., -2., -2.],
## [ 0., 0., 0.],
## [ 2., 2., 2.],
## [ 4., 4., 4.]])
gives a matrix 𝐙 such that 𝑧𝑖,𝑗 is generated by considering the 𝑖-th element in y and the 𝑗-th item
in x, which is exactly what we desired. We will provide an alternative implementation in Ex-
ample 8.5.
It propagated the column vector over all columns (left to right). Similarly, combining
a matrix with a 1×m row vector recycles the latter over all rows (top to bottom).
A + np.array([1, 2, 3, 4]).reshape(1, -1)
## array([[1.2, 2.6, 3.4, 4.4],
## [1. , 2.2, 3.4, 4.7],
## [1.8, 2.8, 3.2, 4.1]])
𝑥 + 𝑡1 𝑥1,2 + 𝑡2 … 𝑥1,𝑚 + 𝑡𝑚
⎡ 1,1 ⎤
⎢ 𝑥2,1 + 𝑡1 𝑥2,2 + 𝑡2 … 𝑥2,𝑚 + 𝑡𝑚 ⎥.
𝐗 + 𝐭 = 𝐗 + [𝑡1 𝑡2 ⋯ 𝑡𝑚 ] = ⎢ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 𝑥𝑛,1 + 𝑡1 𝑥𝑛,2 + 𝑡2 … 𝑥𝑛,𝑚 + 𝑡𝑚 ⎦
This corresponds to shifting (translating) every row in the matrix.
Exercise 8.3 In the nhanes_adult_female_bmx_20204 dataset, standardise, normalise,
and min-max scale every column (compare Section 5.3.2). A single line of code will suffice in each
case.
Exercise 8.4 Check out that numpy.nonzero relies on similar shape broadcasting rules as the
binary operators we discussed here, but not with respect to all three arguments.
Example 8.5 (*) Himmelblau’s function in Example 8.2 is defined by means of arithmetic op-
erators only, and they all rely on the kind of shape broadcasting that we discuss in this section.
Consequently, calling numpy.meshgrid to evaluate 𝑓 on a point grid was not really necessary:
4 https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_adult_female_bmx_2020.
csv
8 PROCESSING MULTIDIMENSIONAL DATA 155
x = np.linspace(-5, 5, 3)
y = np.linspace(-4, 4, 5)
xg = x.reshape(1, -1)
yg = y.reshape(-1, 1)
(xg**2 + yg - 11)**2 + (xg + yg**2 - 7)**2
## array([[116., 306., 296.],
## [208., 178., 148.],
## [340., 170., 200.],
## [320., 90., 260.],
## [340., 130., 520.]])
See also the sparse parameter in numpy.meshgrid, and Section 12.3.1 where this function
turns out useful after all.
Some functions have the default argument axis=-1 meaning that they are applied
along the last5 axis (i.e., columns in the matrix case):
np.diff(A) # means axis=1 in this context (along the columns)
## array([[ 0.4, -0.2, 0. ],
## [ 0.2, 0.2, 0.3],
## [ 0. , -0.6, -0.1]])
Compare the foregoing to the iterated differences in each column separately (along
the rows):
np.diff(A, axis=0)
## array([[-0.2, -0.4, 0. , 0.3],
## [ 0.8, 0.6, -0.2, -0.6]])
If a vectorised function in not equipped with the axis argument, we can propagate it
over all the rows or columns by calling numpy.apply_along_axis. For instance, here
is another (did you solve Exercise 8.3?) way to compute the z-scores in each matrix
column:
def standardise(x):
(continues on next page)
Note (*) Matrices are iterable (in the sense of Section 3.4), but in an interesting way.
Namely, an iterator traverses through each row in a matrix. Writing:
r1, r2, r3 = A # A has three rows
creates three variables, each representing a separate row in A, the second of which is:
r2
## array([0. , 0.2, 0.4, 0.7])
Important Generally:
• each scalar index reduces the dimensionality of the subsetted object by 1;
• slice-slice and slice-scalar indexing returns a view of the existing array, so we need
to be careful when modifying the resulting object;
8 PROCESSING MULTIDIMENSIONAL DATA 157
It selected the fourth column and gave a flat vector (we can always use the reshape
method to convert the resulting object back to a matrix). Furthermore:
A[0, -1] # two scalars: from two to zero dimensions
## 0.4
It yielded the element (scalar) in the first row and the last column.
158 III MULTIDIMENSIONAL DATA
It selected the first, the last, and the first row again. Then, it reversed the order of
columns.
A[ A[:, 0] > 0.1, : ]
## array([[0.2, 0.6, 0.4, 0.4],
## [0.8, 0.8, 0.2, 0.1]])
It chose the rows from A where the values in the first column of A are greater than 0.1.
A[np.mean(A, axis=1) > 0.35, : ]
## array([[0.2, 0.6, 0.4, 0.4],
## [0.8, 0.8, 0.2, 0.1]])
It ordered the matrix with respect to the values in the first column (all rows permuted
in the same way, together).
Exercise 8.6 In the nhanes_adult_female_bmx_20206 dataset, select all the participants
whose heights are within their mean ± 2 standard deviations.
It yielded A[0, 1], A[-1, 2], A[0, 0], A[2, 2], and A[0, 1].
Second, to select a submatrix (a subblock) using integer indexes, it is best to make sure
that the first indexer is a column vector, and the second one is a row vector (or some
objects like these, e.g., compatible lists of lists).
A[ [[0], [-1]], [[1, 3]] ] # column vector-like list, row vector-like list
## array([[0.6, 0.4],
## [0.8, 0.1]])
Third, if indexing involves logical vectors, it is best to convert them to integer ones first
(e.g., by calling numpy.flatnonzero).
A[ np.flatnonzero(np.mean(A, axis=1) > 0.35).reshape(-1, 1), [[0, 2, 3, 0]] ]
## array([[0.2, 0.4, 0.4, 0.2],
## [0.8, 0.2, 0.1, 0.8]])
This is only a mild inconvenience. We will be forced to apply such double indexing
anyway in pandas whenever selecting rows by position and columns by name is required;
see Section 10.5.
Note (*) Interestingly, we can also index a vector using an integer matrix. This is like
subsetting using a list of integer indexes, but the output’s shape matches that of the
indexer:
u = np.array(["a", "b", "c"])
V = np.array([ [0, 1], [1, 0], [2, 1] ])
u[V] # like u[V.ravel()].reshape(V.shape)
## array([['a', 'b'],
## ['b', 'a'],
## ['c', 'b']], dtype='<U1')
160 III MULTIDIMENSIONAL DATA
This is time and memory efficient, but might lead to some unexpected results if we are
being rather absent-minded. The readers have been warned.
In all other cases, we get a copy of the subsetted array.
With numpy arrays, however, new rows or columns cannot be added via the index op-
erator. Instead, the whole array needs to be created from scratch using, e.g., one of
the functions discussed in Section 7.1.4. For example:
A = np.column_stack((A, np.sqrt(A[:, 0])))
A
## array([[ 0.04, 0.6 , -0.4 , 0.4 , 0.2 ],
## [ 0. , 0.2 , -0.4 , 0.7 , 0. ],
## [ 0.64, 0.8 , -0.2 , 0.1 , 0.8 ]])
8 https://numpy.org/devdocs/user/basics.copies.html
8 PROCESSING MULTIDIMENSIONAL DATA 161
And now:
C = A @ B # or: A.dot(B)
C
## array([[ 3, 0, 3, 1],
## [ 4, 8, 5, 7],
## [ 3, 8, 2, 6],
## [ 7, 8, 11, 9],
## [ 2, 0, 3, 1]])
1 0 1 3 0 3 1
⎡ ⎤ ⎡ ⎤
⎢ 2 2 1 ⎥ 1 0 𝟎 0 4 8 5 7 ⎥
⎢ ⎥ ⎡ ⎤ ⎢
⎢
3 2 0 ⎢
⎢ 0 4 𝟏 3 ⎥
⎥=⎢ 3 8 2 6 ⎥.
⎢ ⎥ ⎥
⎢ 𝟏 𝟐 𝟑 ⎥ ⎣ 2 0 𝟑 1 ⎦ ⎢ 7 8 𝟏𝟏 9 ⎥
⎣ 0 0 1 ⎦ ⎣ 2 0 3 1 ⎦
For example, the element in the fourth row and the third column, 𝑐4,3 takes the
fourth row in the left matrix 𝐚4,⋅ = [1 2 3] and the third column in the right mat-
rix 𝐛⋅,3 = [0 1 3]𝑇 (emphasised with bold), multiplies the corresponding elements,
and computes their sum, i.e., 𝑐4,3 = 1 ⋅ 0 + 2 ⋅ 1 + 3 ⋅ 3 = 11.
162 III MULTIDIMENSIONAL DATA
Another example:
A = np.array([
[1, 2],
[3, 4]
])
I = np.array([ # np.eye(2)
[1, 0],
[0, 1]
])
A @ I # or A.dot(I)
## array([[1, 2],
## [3, 4]])
Important In most textbooks, just like in this one, 𝐀𝐁 always denotes the matrix mul-
tiplication. This is a very different operation from the elementwise multiplication.
Exercise 8.7 (*) Show that (𝐀𝐁)𝑇 = 𝐁𝑇 𝐀𝑇 . Also notice that, typically, matrix multiplica-
tion is not commutative, i.e., 𝐀𝐁 ≠ 𝐁𝐀.
In matrix multiplication terms, if 𝐱 is a row vector and 𝐲𝑇 is a column vector, then the
above can be written as 𝐱𝐲𝑇 . The result is a single number.
8 PROCESSING MULTIDIMENSIONAL DATA 163
is the square of the Euclidean norm of 𝒙 (simply the sum of squares), which is used to
measure the magnitude of a vector (Section 5.3.2):
√ 𝑝
√
‖𝒙‖ = √∑ 𝑥𝑖2 = √𝒙 ⋅ 𝒙 = √𝐱𝐱𝑇 .
⎷𝑖=1
It is worth pointing out that the Euclidean norm fulfils (amongst others) the condition
that ‖𝒙‖ = 0 if and only if 𝒙 = 𝟎 = (0, 0, … , 0). The same naturally holds for its square.
Exercise 8.8 Show that 𝐀𝑇 𝐀 gives the matrix that consists of the dot products of all the pairs
of columns in 𝐀 and 𝐀𝐀𝑇 stores the dot products of all the pairs of rows.
Section 9.3.2 will note that matrix multiplication can be used as a way to express cer-
tain geometrical transformations of points in a dataset, e.g., scaling and rotating.
Also, Section 9.3.3 briefly discusses the concept of the inverse of a matrix. Further-
more, Section 9.3.4 introduces its singular value decomposition.
⎷𝑖=1
i.e., the square root of the sum of squared differences between the corresponding co-
ordinates.
In particular, for unidimensional data (𝑚 = 1), we have ‖𝒖 − 𝒗‖ = |𝑢1 − 𝑣1 |, i.e., the
absolute value of the difference.
9 There are many possible distances, allowing to measure the similarity of points not only in ℝ𝑚 , but
also character strings (e.g., the Levenshtein metric), ratings (e.g., cosine dissimilarity), etc. There is even
an encyclopedia of distances, [25].
164 III MULTIDIMENSIONAL DATA
Important Given two vectors of equal lengths 𝒙, 𝒚 ∈ ℝ𝑚 , the dot product of their
difference:
𝑚
(𝒙 − 𝒚) ⋅ (𝒙 − 𝒚) = (𝐱 − 𝐲)(𝐱 − 𝐲)𝑇 = ∑(𝑥𝑖 − 𝑦𝑖 )2 ,
𝑖=1
is nothing else than the square of the Euclidean distance between them.
0 0
⎡ ⎤
⎢ 1 0 ⎥.
𝐗=⎢ ⎥
⎢ −3/2 1 ⎥
⎣ 1 1 ⎦
Calculate (by hand): ‖𝐱1,⋅ − 𝐱2,⋅ ‖, ‖𝐱1,⋅ − 𝐱3,⋅ ‖, ‖𝐱1,⋅ − 𝐱4,⋅ ‖, ‖𝐱2,⋅ − 𝐱4,⋅ ‖, ‖𝐱2,⋅ − 𝐱3,⋅ ‖,
‖𝐱1,⋅ − 𝐱1,⋅ ‖, and ‖𝐱2,⋅ − 𝐱1,⋅ ‖.
Hence, 𝑑𝑖,𝑗 = ‖𝐱𝑖,⋅ − 𝐱𝑗,⋅ ‖. That we have zeros on the diagonal is due to the fact that
‖𝒖 − 𝒗‖ = 0 if and only if 𝒖 = 𝒗. Furthermore, ‖𝒖 − 𝒗‖ = ‖𝒗 − 𝒖‖, which implies the
symmetry of 𝐃, i.e., we have 𝐃𝑇 = 𝐃.
Figure 8.2 illustrates the six non-trivial pairwise distances. In the left subplot, our per-
ception of distance is disturbed because the aspect ratio (the ratio between the range
of the x-axis to the range of the y-axis) is not 1:1. To be able to assess spatial relation-
ships, it is thus very important to call matplotlib.pyplot.axis("equal").
8 PROCESSING MULTIDIMENSIONAL DATA 165
for s in range(2):
plt.subplot(1, 2, s+1)
if s == 1: plt.axis("equal") # right subplot
plt.plot(X[:, 0], X[:, 1], "ko")
for i in range(X.shape[0]-1):
for j in range(i+1, X.shape[0]):
plt.plot(X[[i,j], 0], X[[i,j], 1], "k-", alpha=0.2)
plt.text(
np.mean(X[[i,j], 0]),
np.mean(X[[i,j], 1]),
np.round(D[i, j], 2)
)
plt.show()
0.8 1.5
1.0 2.5
0.6
1.8 2.69 1.41 1.0 0.5 1.8 2.69 1.41 1.0
0.4
0.0 1.0
0.2 0.5
Figure 8.2. Distances between four example points. In the left plot, their perception is
disturbed because the aspect ratio is not 1:1.
Exercise 8.10 (*) Each metric also enjoys the triangle inequality: ‖𝒖−𝒗‖ ≤ ‖𝒖−𝒘‖+‖𝒘−𝒗‖
for all 𝒖, 𝒗, 𝒘. Verify that this property holds by studying each triple of points in an example
distance matrix.
Important A few popular data science techniques rely on pairwise distances, e.g.:
• multidimensional data aggregation (undermentioned),
• 𝑘-means clustering (Section 12.4),
• 𝑘-nearest neighbour regression (Section 9.2.1) and classification (Section 12.3.1),
• missing value imputation (Section 15.1),
166 III MULTIDIMENSIONAL DATA
• density estimation (which we can use outlier detection, see Section 15.4).
They assume that data have been appropriately preprocessed; compare, e.g., [2]. In
particular, matrix columns should be on the same scale (e.g., standardised) as other-
wise computing sums of their squared differences might not make sense at all.
which is the componentwise (columnwise) arithmetic mean. In other words, its 𝑗-th
component is given by:
1 𝑛
𝑐𝑗 = ∑𝑥 .
𝑛 𝑖=1 𝑖,𝑗
For example, the centroid of the dataset depicted in Figure 8.2 is:
c = np.mean(X, axis=0)
c
## array([0.125, 0.5 ])
Centroids are the basis for the 𝑘-means clustering method that we discuss in Sec-
tion 12.4.
being the square root of the average squared distance to the centroid. Notice that 𝑠 is
a single number.
np.sqrt(np.mean(scipy.spatial.distance.cdist(X, c.reshape(1, -1))**2))
## 1.1388041973930374
Note (**) Generalising other aggregation functions is not a trivial task because,
amongst others, there is no natural linear ordering relation in the multidimensional
space (see, e.g., [78]). For instance, any point on the convex hull of a dataset could serve
as an analogue of the minimal and maximal observation.
Furthermore, the componentwise median does not behave nicely (it may, for example,
fall outside the convex hull). Instead, we usually consider a different generalisation
of the median: the point 𝒎 which minimises the sum of distances (not squared),
𝑛
∑𝑖=1 ‖𝐱𝑖,⋅ − 𝒎‖. Even though it does not have an analytic solution, it can be determ-
ined algorithmically.
Note (**) A bag plot [84] is one of the possible multidimensional generalisations of the
box-and-whisker plot. Unfortunately, its use is not popular amongst practitioners.
𝐵𝑟 (𝒙′ ) = {𝑖 ∶ ‖𝐱𝑖,⋅ − 𝒙′ ‖ ≤ 𝑟} ;
• few nearest neighbour search: for some (usually small) integer 𝑘 ≥ 1, we seek the
indexes of the 𝑘 points in 𝐗 which are the closest to 𝒙′ :
𝑁𝑘 (𝒙′ ) = (𝑖1 , 𝑖2 , … , 𝑖𝑘 ),
length 2𝑟 centred at 𝒙′ , i.e., [𝑥1′ − 𝑟, 𝑥1′ + 𝑟]. In ℝ2 , 𝑆𝑟 (𝒙′ ) is the circle of radius 𝑟
centred at (𝑥1′ , 𝑥2′ ).
Let’s inspect the local neighbourhood of the point 𝒙′ = (0, 0) by computing the dis-
tances to each point in 𝐗.
x_test = np.array([0, 0])
import scipy.spatial.distance
D = scipy.spatial.distance.cdist(X, x_test.reshape(1, -1))
For instance, here are the indexes of the points in 𝐵0.75 (𝒙′ ):
r = 0.75
B = np.flatnonzero(D <= r)
B
## array([ 1, 11, 14, 16, 24])
Note that to prepare Figure 8.3, we need to set the aspect ratio to 1:1 as otherwise the
circle would look like an ellipse.
fig, ax = plt.subplots()
ax.add_patch(plt.Circle(x_test, r, color="red", alpha=0.1))
for i in range(k):
plt.plot(
[x_test[0], X[N[i], 0]],
[x_test[1], X[N[i], 1]],
"r:", alpha=0.4
)
plt.plot(X[:, 0], X[:, 1], "bo", alpha=0.1)
for i in range(X.shape[0]):
plt.text(X[i, 0], X[i, 1], str(i), va="center", ha="center")
plt.plot(x_test[0], x_test[1], "rX")
plt.text(x_test[0], x_test[1], "$\\mathbf{x}'$", va="center", ha="center")
plt.axis("equal")
plt.show()
8 PROCESSING MULTIDIMENSIONAL DATA 169
2.0 5 20
18
1.5
1.0 4
10 2
0.5 19 13
16 9
0.0 2414 x0 15
21
0.5 8 7 1 11
0 22
1.0
23 6 3
1.5 12 17
3 2 1 0 1 2 3
Note (*) In 𝐾-d trees, the data space is partitioned into hyperrectangles along the axes
of the Cartesian coordinate system (standard basis). Thanks to such a representation,
all subareas too far from the query point can be pruned to speed up the search.
Assume we would like to make queries with regard to three pivot points:
X_test = np.array([
[0, 0],
[2, 2],
[2, -2]
])
10 In our context, we would like to refer to them as 𝑚-d trees, but we decided to stick with the traditional
name.
170 III MULTIDIMENSIONAL DATA
Here are the results for the fixed radius searches (𝑟 = 0.75):
T.query_ball_point(X_test, 0.75)
## array([list([1, 11, 14, 16, 24]), list([20]), list([])], dtype=object)
We see that the method is nicely vectorised. We made a query about three points at the
same time. As a result, we received a list-like object storing three lists representing the
indexes of interest. Note that in the case of the third point, there are no elements in 𝐗
within the requested range (circle), hence the empty index list.
And here are the five nearest neighbours:
distances, indexes = T.query(X_test, 5) # returns a tuple of length two
Each is a matrix with three rows (corresponding to the number of pivot points) and
five columns (the number of neighbours sought).
Note (*) We expect the 𝐾-d trees to be much faster than the brute-force approach
(where we compute all pairwise distances) in low-dimensional spaces. Nonetheless,
due to the phenomenon called the curse of dimensionality, sometimes already for 𝑚 ≥ 5
the speed gains might be very small; see, e.g., [11].
8.5 Exercises
Exercise 8.11 Does numpy.mean(A, axis=0) compute the rowwise or columnwise means of
A?
Exercise 8.12 How does shape broadcasting work? List the most common pairs of shape cases
when performing arithmetic operations like addition or multiplication.
Exercise 8.13 What are the possible ways to index a matrix?
8 PROCESSING MULTIDIMENSIONAL DATA 171
Exercise 8.14 Which kinds of matrix indexers return a view of an existing array?
Exercise 8.15 (*) How can we select a submatrix comprised of the first and the last row and the
first and the last column?
Exercise 8.16 Why appropriate data preprocessing is required when computing the Euclidean
distance between the points in a matrix?
Exercise 8.17 What is the relationship between the dot product, the Euclidean norm, and the
Euclidean distance?
Exercise 8.18 What is a centroid? How is it defined by means of the Euclidean distance between
the points in a dataset?
Exercise 8.19 What is the difference between the fixed-radius and the few nearest neighbour
search?
Exercise 8.20 (*) When 𝐾-d trees or other spatial search data structures might be better than
a brute-force search based on scipy.spatial.distance.cdist?
Exercise 8.21 (**) See what kind of vector and matrix processing capabilities are available in
the following packages: TensorFlow, PyTorch, Theano, and tinygrad. Are their APIs similar
to that of numpy?
9
Exploring relationships between variables
Recall that in Section 7.4, we performed some graphical exploratory analysis of the
body measures recorded by the National Health and Nutrition Examination Survey:
body = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
delimiter=",")[1:, :] # skip the first row (column names)
body[:6, :] # preview: the first six rows, all columns
## array([[ 97.1, 160.2, 34.7, 40.8, 35.8, 126.1, 117.9],
## [ 91.1, 152.7, 33.5, 33. , 38.5, 125.5, 103.1],
## [ 73. , 161.2, 37.4, 38. , 31.8, 106.2, 92. ],
## [ 61.7, 157.4, 38. , 34.7, 29. , 101. , 90.5],
## [ 55.4, 154.6, 34.6, 34. , 28.3, 92.5, 73.2],
## [ 62. , 144.7, 32.5, 34.2, 29.8, 106.7, 84.8]])
body.shape
## (4221, 7)
We already know that 𝑛 = 4221 adult female participants are described by seven dif-
ferent numerical features, in this order:
body_columns = np.array([
"weight", # weight (kg)
"height", # standing height (cm)
"arm len.", # upper arm length (cm)
"leg len.", # upper leg length (cm)
"arm circ.", # arm circumference (cm)
"hip circ.", # hip circumference (cm)
"waist circ." # waist circumference (cm)
])
The data in different columns are somewhat related to each other. Figure 7.4 indicates
that higher hip circumferences tend to occur together with higher arm circumferences,
and that the latter metric does not really tell us much about heights. In this chapter,
we discuss ways to describe the possible relationships between variables.
174 III MULTIDIMENSIONAL DATA
1 𝑛 𝑥𝑖 − 𝑥 ̄ 𝑦𝑖 − 𝑦 ̄
𝑟(𝒙, 𝒚) = ∑ ,
𝑛 𝑖=1 𝑠𝑥 𝑠𝑦
with 𝑠𝑥 , 𝑠𝑦 denoting the standard deviations and 𝑥,̄ 𝑦 ̄ being the means of the two se-
quences 𝒙 = (𝑥1 , … , 𝑥𝑛 ) and 𝒚 = (𝑦1 , … , 𝑦𝑛 ), respectively.
𝑟 is the mean of the pairwise products1 of the standardised versions of two vectors. It
is a normalised measure of how they vary together (co-variance). Here is how we can
compute it manually on two example vectors:
x = body[:, 4] # arm circumference
y = body[:, 5] # hip circumference
x_std = (x-np.mean(x))/np.std(x) # z-scores for x
y_std = (y-np.mean(y))/np.std(y) # z-scores for y
np.mean(x_std*y_std)
## 0.8680627457873239
And here is the built-in function that implements the same formula:
scipy.stats.pearsonr(x, y)[0] # the function returns more than we need
## 0.8680627457873238
of the vectors.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 175
To help develop some intuitions, let’s illustrate a few noteworthy correlations using a
helper function that draws a scatter plot and prints out Pearson’s 𝑟 (and Spearman’s 𝜌
discussed in Section 9.1.4; let’s ignore it by then):
def plot_corr(x, y, axes_eq=False):
r = scipy.stats.pearsonr(x, y)[0]
rho = scipy.stats.spearmanr(x, y)[0]
plt.plot(x, y, "o")
plt.title(f"$r = {r:.3}$, $\\rho = {rho:.3}$",
fontdict=dict(fontsize=10))
if axes_eq: plt.axis("equal")
Important A negative correlation means that when one variable increases, the other
one decreases, and vice versa (like in: weight vs expected life span).
Recall that the arm and hip circumferences enjoy high-ish positive degree of linear
correlation (𝑟 ≃ 0.868). Their scatter plot (Figure 7.4) looks somewhat similar to one
of the cases depicted here.
Exercise 9.1 Draw a series of similar plots but for the case of negatively correlated point pairs,
e.g., 𝑦 = −2𝑥 + 5.
Important As a rule of thumb, a linear correlation coefficient of 0.9 or more (or -0.9
or less) is fairly strong. Closer to -0.8 and 0.8, we should already start being sceptical
about two variables’ possibly being linearly correlated. Some statisticians are more
lenient; in particular, it is not uncommon in the social sciences to consider 0.6 a decent
degree of correlation, but this is like building your evidence-base on sand. No wonder
why some fields are suffering from a reproducibility crisis these days (albeit there are
beautiful exceptions to this general observation).
If the dataset at hand does not give us too strong an evidence, it is our ethical duty to re-
frain from making unjustified claims. We must not mislead the recipients of our data
analysis exercises. Still, it can sometimes be interesting to discover that some factors
popularly conceived as dependent are actually not correlated, for this is an instance of
myth-busting.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 177
0.3 0.3
0.2
0.2
0.1
0.1
0.0
0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
r = 0.765, = 0.789 r = 0.398, = 0.368
0.6 1.00
0.75
0.4
0.50
0.2 0.25
0.00
0.0
0.25
0.2 0.50
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Figure 9.2. Linear correlation coefficients for data on a line, but with different
amounts of noise.
2 Note that in Section 6.2.3, we were also testing one very specific hypothesis: whether a distribution
was normal, or whether it was anything else. We only know that if the data really follow that particular
distribution, the null hypothesis will not be rejected in 0.1% of the cases. The rest is silence.
178 III MULTIDIMENSIONAL DATA
0.8 0.2
0.6 0.4
0.4 0.6
0.2 0.8
0.0 1.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
r = 0.0194, = 0.0147 r = 0.00917, = 0.0231
1.0 1.0
0.8
0.5
0.6
0.0
0.4
0.5
0.2
0.0 1.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
False correlations
What is more, sometimes we can fall into the trap of false correlations. This happens
when data are actually functionally dependent, the relationship is not affine, but the
points are aligned not far from a line; see Figure 9.4 for some examples.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 179
0.4 0.3
0.2
0.2
0.1
0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
r = 0.926, = 1.0 r = 0.949, = 1.0
2.75
2.50 4.5
2.25 4.0
2.00 3.5
1.75 3.0
1.50 2.5
1.25 2.0
1.00 1.5
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Figure 9.4. Example nonlinear relationships that look like linear, at least to Pearson’s 𝑟.
A single measure cannot be perfect: we are trying to compress 𝑛 data points into a
single number here. It is obvious that many different datasets, sometimes remarkably
diverse, will yield the same correlation degree.
180 III MULTIDIMENSIONAL DATA
Note Correlation analysis can aid in constructing regression models, where we would
like to identify a transformation that expresses a variable as a function of one or more
other features. For instance, when we say that 𝑦 can be modelled approximately by
𝑎𝑥 + 𝑏, regression analysis can identify the best matching 𝑎 and 𝑏 coefficients; see
Section 9.2.3 for more details.
3 https://tylervigen.com/spurious-correlations
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 181
plt.yticks(np.arange(k), labels=labels)
plt.tick_params(axis="y", which="both",
labelleft=True, labelright=True, left=False, right=False)
plt.grid(False)
Notice that we ordered4 the columns to reveal some naturally occurring variable
clusters: for instance, arm, hip, waist circumference, and weight are all strongly cor-
related.
Of course, we have 1.0s on the main diagonal because a variable is trivially correlated
with itself. This heat map is symmetric which is due to the property 𝑟(𝒙, 𝒚) = 𝑟(𝒚, 𝒙).
Example 9.4 (*) To fetch the row and column index of the most correlated pair of variables
(either positively or negatively), we should first take the upper (or lower) triangle of the correl-
ation matrix (see numpy.triu or numpy.tril) to ignore the irrelevant and repeating items:
4 (**) This can be done automatically by some hierarchical clustering algorithm applied onto the correl-
Ru = np.triu(np.abs(R), 1)
np.round(Ru, 2)
## array([[0. , 0.35, 0.55, 0.19, 0.91, 0.95, 0.9 ],
## [0. , 0. , 0.67, 0.66, 0.15, 0.2 , 0.13],
## [0. , 0. , 0. , 0.48, 0.45, 0.46, 0.43],
## [0. , 0. , 0. , 0. , 0.08, 0.1 , 0.03],
## [0. , 0. , 0. , 0. , 0. , 0.87, 0.85],
## [0. , 0. , 0. , 0. , 0. , 0. , 0.9 ],
## [0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
If we compute Pearson’s 𝑟 between these two, we will note that the degree of linear
correlation is rather small:
5 https://www.cia.gov/the-world-factbook
184 III MULTIDIMENSIONAL DATA
80 80
life expectancy (years)
70 70
60 60
0 50000 100000 8 10 12
per capita GDP PPP log(per capita GDP PPP)
Figure 9.6. Scatter plots for life expectancy vs gross domestic product (purchasing
power parity) on linear (left) and log-scale (right).
However, already the logarithm of GDP is more strongly linearly correlated with life
expectancy:
scipy.stats.pearsonr(np.log(world[:, 0]), world[:, 1])[0]
## 0.8066505089380016
which means that modelling our data via 𝒚 = 𝑎 log 𝒙 + 𝑏 could be an idea worth con-
sidering.
which is6 the Pearson linear coefficient computed over vectors of the corresponding
ranks of all the elements in 𝒙 and 𝒚 (denoted by 𝑅(𝒙) and 𝑅(𝒚), respectively). Hence,
the two following calls are equivalent:
6 If a method Y is nothing else than X on transformed data, we do not consider it a totally new method.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 185
Let’s point out that this measure is invariant to monotone transformations of the input
variables (up to the sign). This is because they do not change the observations’ ranks
(or only reverse them).
scipy.stats.spearmanr(np.log(world[:, 0]), -np.sqrt(world[:, 1]))[0]
## -0.8275220380818622
Exercise 9.6 We included the 𝜌s in all the outputs generated by our plot_corr function. Re-
view all the preceding figures.
Exercise 9.7 Apply numpy.corrcoef and scipy.stats.rankdata (with the appropriate
axis argument) to compute the Spearman correlation matrix for all the variable pairs in body.
Draw it on a heat map.
Exercise 9.8 (*) Draw the scatter plots of the ranks of each column in the world and body data-
sets.
1. Find the indices 𝑁𝑘 (𝒙′ ) = {𝑖1 , … , 𝑖𝑘 } of the 𝑘 points from 𝐗 closest to 𝒙′ , i.e., ones
that fulfil for all 𝑗 ∉ {𝑖1 , … , 𝑖𝑘 }:
For example, let’s express weight (the first column) as a function of hip circumference
(the sixth column) in the body dataset:
We can also model the life expectancy at birth in different countries (world dataset) as
a function of their GDP per capita (PPP):
Both are instances of the simple regression problem, i.e., where there is only one inde-
pendent variable (𝑚 = 1). We can easily create its appealing visualisation by means of
the following function:
def knn_regress_plot(x, y, K, num_test_points=1001):
"""
x - 1D vector - reference inputs
y - 1D vector - corresponding outputs
K - numbers of near neighbours to test
num_test_points - number of points to test at
"""
plt.plot(x, y, "o", alpha=0.1)
_x = np.linspace(x.min(), x.max(), num_test_points)
for k in K:
_y = knn_regress(_x, x, y, k) # see above
plt.plot(_x, _y, label=f"$k={k}$")
plt.legend()
Figure 9.7 depicts the fitted functions for a few different 𝑘s.
plt.subplot(1, 2, 1)
knn_regress_plot(body[:, 5], body[:, 0], [5, 25, 100])
plt.xlabel("hip circumference")
(continues on next page)
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 187
plt.subplot(1, 2, 2)
knn_regress_plot(world[:, 0], world[:, 1], [5, 25, 100])
plt.xlabel("per capita GDP PPP")
plt.ylabel("life expectancy (years)")
plt.show()
Figure 9.7. 𝑘-nearest neighbour regression curves for example datasets. The greater
the 𝑘, the more coarse-grained the approximation.
We obtained a smoothened version of the original dataset. The fact that we do not re-
produce the reference data points in an exact manner is reflected by the (figurative)
error term in the above equations. Its role is to emphasise the existence of some nat-
ural data variability; after all, one’s weight is not purely determined by their hip size
and life is not all about money.
For small 𝑘 we adapt to the data points more closely. This can be worthwhile unless
data are very noisy. The greater the 𝑘, the smoother the approximation at the cost of
losing fine detail and restricted usability at the domain boundaries (here: in the left
and right part of the plots).
Usually, the number of neighbours is chosen by trial and error (just like the number of
bins in a histogram; compare Section 4.3.3).
Note (**) Some methods use weighted arithmetic means for aggregating the 𝑘 refer-
ence outputs, with weights inversely proportional to the distances to the neighbours
(closer inputs are considered more important).
188 III MULTIDIMENSIONAL DATA
Also, instead of few nearest neighbours, we can easily compose some form of fixed-
radius search regression, by simply replacing 𝑁𝑘 (𝒙′ ) with 𝐵𝑟 (𝒙′ ); compare Sec-
tion 8.4.4. Yet, note that this way we might make the function undefined in sparsely
populated regions of the domain.
𝑦 = 𝑓 (𝑥1 , 𝑥2 , … , 𝑥𝑚 ) = 𝑐1 𝑥1 + 𝑐2 𝑥2 + ⋯ + 𝑐𝑚 𝑥𝑚 + 𝑐𝑚+1 ,
𝑦 = 𝐜𝐱𝑇 + 𝑐𝑚+1 ,
𝑦 = 𝑎𝑥 + 𝑏,
Important A separate intercept term “+𝑐𝑚+1 ” in the defining equation can be cum-
bersome. We will thus restrict ourselves to linear maps like:
𝑦 = 𝐜𝐱𝑇 ,
but where we can possibly have an explicit constant-1 component somewhere inside 𝐱.
For instance:
𝐱 = [𝑥1 𝑥2 ⋯ 𝑥𝑚 1].
Together with 𝐜 = [𝑐1 𝑐2 ⋯ 𝑐𝑚 𝑐𝑚+1 ], as trivially 𝑐𝑚+1 ⋅ 1 = 𝑐𝑚+1 , this new setting
is equivalent to the original one.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 189
because 𝐲̂ = 𝐜𝐗𝑇 gives the predicted values as a row vector (the diligent readers are
encouraged to check that on a piece of paper now), 𝐫 = 𝐲 − 𝐲̂ computes all the 𝑛
residuals, and 𝐫𝐫 𝑇 gives their sum of squares.
The method of least squares is one of the simplest and most natural approaches to
regression analysis (curve fitting). Its theoretical foundations (calculus…) were de-
veloped more than 200 years ago by Gauss and then were polished by Legendre.
Note (*) Had the points lain on a hyperplane exactly (the interpolation problem),
𝐲 = 𝐜𝐗𝑇 would have an exact solution, equivalent to solving the linear system of equa-
tions 𝐲 −𝐜𝐗𝑇 = 𝟎. However, in our setting we assume that there might be some meas-
urement errors or other discrepancies between the reality and the theoretical model.
To account for this, we are trying to solve a more general problem of finding a hyper-
plane for which ‖𝐲 − 𝐜𝐗𝑇 ‖2 is as small as possible.
This optimisation task can be solved analytically (compute the partial derivatives of
SSR with respect to each 𝑐1 , … , 𝑐𝑚 , equate them to 0, and solve a simple system of
7 To memorise the model for further reference, we only need to serialise its 𝑚 coefficients, e.g., in a
linear equations). This spawns 𝐜 = 𝐲𝐗(𝐗𝑇 𝐗)−1 , where 𝐀−1 is the inverse of a matrix
𝐀, i.e., the matrix such that 𝐀𝐀−1 = 𝐀−1 𝐀 = 𝐈; compare numpy.linalg.inv. As
inverting larger matrices directly is not too robust numerically, we would rather rely
on a more specialised algorithm.
The undermentioned scipy.linalg.lstsq function provides a fairly numerically
stable (yet, see Section 9.2.9) procedure that is based on the singular value decompos-
ition of the model matrix.
Let’s go back to the NHANES study excerpt and express weight (the first column) as
function of hip circumference (the sixth column) again, but this time using an affine
map of the form9 :
We used the vectorised exponentiation operator to convert each 𝑥𝑖 (the 𝑖-th hip circum-
ference) to a pair 𝐱𝑖,⋅ = (𝑥𝑖1 , 𝑥𝑖0 ) = (𝑥𝑖 , 1), which is a nice trick to append a column of
1s to a matrix. This way, we included the intercept term in the model (as discussed in
Section 9.2.2). Here is a preview:
preview_indices = [4, 5, 6, 8, 12, 13]
X_train[preview_indices, :]
## array([[ 92.5, 1. ],
## [106.7, 1. ],
## [ 96.3, 1. ],
## [102. , 1. ],
## [ 94.8, 1. ],
## [ 97.5, 1. ]])
y_train[preview_indices]
## array([55.4, 62. , 66.2, 77.2, 64.2, 56.8])
That’s it. The optimal coefficients vector (the one that minimises the SSR) is:
9 We sometimes explicitly list the error term that corresponds to the residuals. This is to assure the
reader that we are not naïve and that we know what we are doing. We see from the scatter plot of the
involved variables that the data do not lie on a straight line perfectly. Each model is merely an idealisa-
tion/simplification of the described reality. It is wise to remind ourselves about that every so often.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 191
c = res[0]
c
## array([ 1.3052463 , -65.10087248])
The model is nicely interpretable. For instance, as hip circumference increases, we ex-
pect the weights to be greater and greater. As we said before, it does not mean that
there is some causal relationship between the two (for instance, there can be some lat-
ent variables that affect both of them). Instead, there is some general tendency regard-
ing how the data align in the sample space. For instance, that the “best guess” (accord-
ing to the current model – there can be many; see below) weight for a person with
hip circumference of 100 cm is 65.4 kg. Thanks to such models, we might get more in-
sight into certain phenomena, or find some proxies for different variables (especially
if measuring them directly is tedious, costly, dangerous, etc.).
The scatter plot and the fitted regression line in Figure 9.8 indicates a fair fit but, of
course, there is some natural variability.
plt.plot(x_original, y_train, "o", alpha=0.1) # scatter plot
_x = np.array([x_original.min(), x_original.max()]).reshape(-1, 1)
_y = c @ (_x**[1, 0]).T
plt.plot(_x, _y, "r-") # a line that goes through the two extreme points
plt.xlabel("hip circumference")
plt.ylabel("weight")
plt.show()
Exercise 9.9 The Anscombe quartet10 is a famous example dataset, where we have four pairs of
variables that have almost identical means, variances, and linear correlation coefficients. Even
though they can be approximated by the same straight line, their scatter plots are vastly different.
Reflect upon this toy example.
10 https://github.com/gagolews/teaching-data/raw/master/r/anscombe.csv
192 III MULTIDIMENSIONAL DATA
Figure 9.8. The least squares line for weight vs hip circumference.
We wanted the squared residuals (on average – across all the points) to be as small
as possible. The least squares method assures that this is the case relative to the chosen
model, i.e., a linear one. Nonetheless, it still does not mean that what we obtained con-
stitutes a good fit to the training data. Thus, we need to perform the analysis of residuals.
Interestingly, the average of residuals is always zero:
1 𝑛
∑(𝑦 − 𝑦𝑖̂ ) = 0.
𝑛 𝑖=1 𝑖
Therefore, if we want to summarise the residuals into a single number, we can use, for
example, the root mean squared error instead:
√ 𝑛
√1
RMSE(𝐲, 𝐲)̂ = √ ∑(𝑦𝑖 − 𝑦𝑖̂ )2 .
𝑛
⎷ 𝑖=1
np.sqrt(np.mean(r**2))
## 6.948470091176111
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 193
80 fitted line
predicted output
75 residual
observed value (reference output)
70
weight
65
60
55
50
90 95 100 105 110
hip circumference
Hopefully we can see that RMSE is a function of SSR that we have already sought to
minimise.
Alternatively, we can compute the mean absolute error:
1 𝑛
MAE(𝐲, 𝐲)̂ = ∑ |𝑦 − 𝑦𝑖̂ |.
𝑛 𝑖=1 𝑖
np.mean(np.abs(r))
## 5.207073583769202
MAE is nicely interpretable: it measures by how many kilograms we err on average. Not
bad.
Exercise 9.10 Fit a regression line explaining weight as a function of the waist circumference
and compute the corresponding RMSE and MAE. Are they better than when hip circumference
is used as an explanatory variable?
Note Generally, fitting simple (involving one independent variable) linear models can
only make sense for highly linearly correlated variables. Interestingly, if 𝒚 and 𝒙 are
both standardised, and 𝑟 is their Pearson’s coefficient, then the least squares solution
is given by 𝑦 = 𝑟𝑥.
To verify whether a fitted model is not extremely wrong (e.g., when we fit a linear
model to data that clearly follows a different functional relationship), a plot of resid-
uals against the fitted values can be of help; see Figure 9.10. Ideally, the points are
194 III MULTIDIMENSIONAL DATA
Figure 9.10. Residuals vs fitted values for the linear model explaining weight as a func-
tion of hip circumference. The variance of residuals slightly increases as 𝑦𝑖̂ increases.
This is not ideal, but it could be much worse than this.
Exercise 9.11 Compare11 the RMSE and MAE for the 𝑘-nearest neighbour regression curves de-
picted in the left side of Figure 9.7. Also, draw the residuals vs fitted plot.
For linear models fitted using the least squares method, we have:
1 𝑛 2 1 𝑛 2 1 𝑛 2
∑ (𝑦𝑖 − 𝑦)̄ = ∑ (𝑦𝑖̂ − 𝑦)̄̂ + ∑ (𝑦𝑖 − 𝑦𝑖̂ ) .
𝑛 𝑖=1 𝑛 𝑖=1 𝑛 𝑖=1
In other words, the variance of the dependent variable (left) can be decomposed into
the sum of the variance of the predictions and the averaged squared residuals. Mul-
tiplying it by 𝑛, we have that the total sum of squares is equal to the explained sum of
11 In 𝑘-nearest neighbour regression, we are not aiming to minimise anything in particular. If the model
is performing well with respect to some metrics such as RMSE or MAE, we can consider ourselves lucky.
Nevertheless, some asymptotic results guarantee the optimality of the outcomes generated for large sample
sizes (e.g., consistency); see, e.g., [24].
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 195
We yearn for ESS to be as close to TSS as possible. Equivalently, it would be jolly nice
to have RSS equal to 0.
The coefficient of determination (unadjusted R-Squared, sometimes referred to as simply
the score) is a popular normalised, unitless measure that is easier to interpret than raw
ESS or RSS when we have no domain-specific knowledge of the modelled problem. It
is given by:
ESS RSS 𝑠2
𝑅2 (𝐲, 𝐲)̂ = =1− = 1 − 2𝑟 .
TSS TSS 𝑠𝑦
1 - np.var(y_train-y_pred)/np.var(y_train)
## 0.8959634726270759
The coefficient of determination in the current context12 is thus the proportion of vari-
ance of the dependent variable explained by the independent variables in the model.
The closer it is to 1, the better. A dummy model that always returns the mean of 𝒚 gives
R-squared of 0.
In our case, 𝑅2 ≃ 0.9 is high, which indicates a rather good fit.
Note (*) There are certain statistical results that can be relied upon provided that
the residuals are independent random variables with expectation zero and the same
variance (e.g., the Gauss–Markov theorem). Further, if they are normally distributed,
then we have several hypothesis tests available (e.g., for the significance of coeffi-
cients). This is why in various textbooks such assumptions are additionally verified.
But we do not go that far in this introductory course.
12 For a model that is not generated via least squares, the coefficient of determination can also be negative,
particularly when the fit is extremely bad. Also, note that this measure is dataset-dependent. Therefore, it
ought not to be used for comparing models explaining different dependent variables.
196 III MULTIDIMENSIONAL DATA
We skip the visualisation part for we do not expect it to result in a readable plot: these
are multidimensional data. The coefficient of determination is:
y_pred = c @ X_train.T
r = y_train - y_pred
1-np.var(r)/np.var(y_train)
## 0.9243996585518783
It is a slightly better model than the previous one. We can predict the participants’
weights more accurately, at the cost of an increased model’s complexity.
The design matrix is made of rubber, it can handle almost anything. If we have a linear
model, but with respect to transformed data, the algorithm does not care. This is the
beauty of the underlying mathematics; see also [12].
A creative modeller can also turn models such as 𝑢 = 𝑐𝑒𝑎𝑣 into 𝑦 = 𝑎𝑥 + 𝑏 by repla-
cing 𝑦 = log 𝑢, 𝑥 = 𝑣, and 𝑏 = log 𝑐. There are numerous possibilities based on the
properties of the log and exp functions listed in Section 5.2. We call them linearisable
models.
As an example, let’s model the life expectancy at birth in different countries as a func-
tion of their GDP per capita (PPP).
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 197
def make_model_matrix2(x):
return x.reshape(-1, 1)**[0, 1, 2]
def make_model_matrix3(x):
return x.reshape(-1, 1)**[0, 1, 2, 3]
def make_model_matrix4(x):
return (np.log(x)).reshape(-1, 1)**[0, 1]
model_matrix_makers = [
make_model_matrix1,
make_model_matrix2,
make_model_matrix3,
make_model_matrix4
]
x_original = world[:, 0]
Xs_train = [ make_model_matrix(x_original)
for make_model_matrix in model_matrix_makers ]
The logarithmic model is thus the best (out of the models we considered). The four
models are depicted in Figure 9.11.
plt.plot(x_original, y_train, "o", alpha=0.1)
_x = np.linspace(x_original.min(), x_original.max(), 101).reshape(-1, 1)
for i in range(len(model_matrix_makers)):
_y = cs[i] @ model_matrix_makers[i](_x).T
plt.plot(_x, _y, label=model_matrix_makers[i].__name__)
plt.legend()
plt.xlabel("per capita GDP PPP")
plt.ylabel("life expectancy (years)")
plt.show()
90
life expectancy (years)
80
70
linear model
60 quadratic model
cubic model
logarithmic model
0 20000 40000 60000 80000 100000 120000 140000
per capita GDP PPP
Exercise 9.12 Draw box plots and histograms of residuals for each model as well as the scatter
plots of residuals vs fitted values.
they behave, one does not need a university degree in economics/social policy to con-
clude that they are not the best description of how the reality behaves (on average).
In particular, the more independent variables we have in the model, the greater the
𝑅2 coefficient will be. We can try correcting for this phenomenon by considering the
adjusted 𝑅2 :
𝑛−1
𝑅̄ 2 (𝐲, 𝐲)̂ = 1 − (1 − 𝑅2 (𝐲, 𝐲))
̂ ,
𝑛−𝑚−1
which, to some extent, penalises more complex models.
Note (**) Model quality measures adjusted for the number of model parameters, 𝑚,
can also be useful in automated variable selection. For example, the Akaike Informa-
tion Criterion is a popular measure given by:
We should also be concerned with quantifying a model’s predictive power, i.e., how well
does it generalise to data points that we do not have now (or pretend we do not have)
but might face in the future. As we observe the modelled reality only at a few different
points, the question is how the model performs when filling the gaps between the dots
it connects.
In particular, we must be careful when extrapolating the data, i.e., making predictions
outside of its usual domain. For example, the linear model predicts the following life
expectancy for an imaginary country with $500 000 per capita GDP:
cs[0] @ model_matrix_makers[0](np.array([500000])).T
## array([164.3593753])
cs[1] @ model_matrix_makers[1](np.array([500000])).T
## array([-364.10630779])
Nonsense.
Example 9.13 Consider a theoretical illustration, where a true model of some reality is 𝑦 =
5 + 3𝑥3 .
def true_model(x):
return 5 + 3*(x**3)
Still, for some reason we are only able to gather a small (𝑛 = 25) sample from this model. What
is even worse, it is subject to some measurement error:
np.random.seed(42)
x = np.random.rand(25) # random xs on [0, 1]
y = true_model(x) + 0.2*np.random.randn(len(x)) # true_model(x) + noise
which is not too far, but still somewhat13 distant from the true coefficients, 5 and 3.
We can also fit a more flexible cubic polynomial, 𝑦 = 𝑐1 + 𝑐2 𝑥 + 𝑐3 𝑥2 + 𝑐4 𝑥3 :
X0123 = x.reshape(-1, 1)**[0, 1, 2, 3]
c0123 = scipy.linalg.lstsq(X0123, y)[0]
ssr0123 = np.sum((y-c0123 @ X0123.T)**2)
np.round(c0123, 2)
## array([4.89, 0.32, 0.57, 2.23])
In terms of the SSR, this more complex model of course explains the training data more accur-
ately:
ssr03, ssr0123
## (1.0612111154029558, 0.9619488226837537)
Yet, it is farther away from the truth (which, whilst performing the fitting task based only on
given 𝒙 and 𝒚, is unknown). We may thus say that the first model generalises better on yet-to-
be-observed data; see Figure 9.12 for an illustration.
13 For large 𝑛, we expect to pinpoint the true coefficients exactly. In our scenario (independent, normally
distributed errors with the expectation of 0), the least squares method is the maximum likelihood estimator
of the model parameters. As a consequence, it is consistent.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 201
_x = np.linspace(0, 1, 101)
plt.plot(x, y, "o")
plt.plot(_x, true_model(_x), "--", label="true model")
plt.plot(_x, c0123 @ (_x.reshape(-1, 1)**[0, 1, 2, 3]).T,
label="fitted model y=x**[0, 1, 2, 3]")
plt.plot(_x, c03 @ (_x.reshape(-1, 1)**[0, 3]).T,
label="fitted model y=x**[0, 3]")
plt.legend()
plt.show()
7.0
6.5
6.0
5.5
5.0
Figure 9.12. The true (theoretical) model vs some guesstimates (fitted based on noisy
data). More degrees of freedom is not always better.
Example 9.14 (**) We defined the sum of squared residuals (and its function, the root mean
squared error) by means of the averaged deviation from the reference values. They are subject to
error themselves, though. Even though they are our best-shot approximation of the truth, they
should be taken with a degree of scepticism.
In the previous example, given the true (reference) model 𝑓 defined over the domain 𝐷 (in our case,
𝑓 (𝑥) = 5 + 3𝑥3 and 𝐷 = [0, 1]) and an empirically fitted model 𝑓 ,̂ we can compute the square
root of the integrated squared error over the whole 𝐷:
For polynomials and other simple functions, RMSE can be computed analytically. More gener-
ally, we can approximate it numerically by sampling the above at sufficiently many points and
applying the trapezoidal rule (e.g., [77]). As this can be an educative programming exercise, let’s
consider a range of polynomial models of different degrees.
202 III MULTIDIMENSIONAL DATA
2 × 10 1
10 1
1 2 3 4 5 6 7 8 9
model complexity (polynomial degree)
Figure 9.13. Small RMSE on training data does not necessarily imply good generalisa-
tion abilities.
Figure 9.13 shows that a model’s ability to make correct generalisations onto unseen data im-
proves as the complexity increases, at least initially. However, then it becomes worse. It is a typ-
ical behaviour. In fact, the model with the smallest RMSE on the training set, overfits to the
input sample, see Figure 9.14.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 203
plt.plot(x, y, "o")
plt.plot(_x, true_model(_x), "--", label="true model")
for i in [0, 1, 8]:
plt.plot(_x, cs[i] @ (_x.reshape(-1, 1)**np.arange(ps[i]+1)).T,
label=f"fitted degree-{ps[i]} polynomial")
plt.legend()
plt.show()
6.5
6.0
5.5
5.0
4.5
Overall, models must never be blindly trusted. Common sense must always be applied.
The fact that we fitted something using a sophisticated procedure on a dataset that
was hard to obtain does not justify its use. Mediocre models must be discarded, and
we should move on, regardless of how much time/resources we have invested whilst
developing them. Too many bad models go into production and make our daily lives
harder. We need to end this madness.
Due of this, we shall only present a quick demo of scikit-learn’s API. We will do that
by fitting a multiple linear regression model again for the weight as a function of the
arm and the hip circumference:
X_train = body[:, [4, 5]]
y_train = body[:, 0]
import sklearn.linear_model
lm = sklearn.linear_model.LinearRegression(fit_intercept=True)
lm.fit(X_train, y_train)
lm.intercept_, lm.coef_
## (-63.383425410947524, array([1.30457807, 0.8986582 ]))
14 https://scikit-learn.org/stable/index.html
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 205
y_pred = lm.predict(X_train)
import sklearn.metrics
sklearn.metrics.r2_score(y_train, y_pred)
## 0.9243996585518783
This function is convenient, but can we really recall the formula for the score and what
it really measures?
Let’s fit a degree-4 polynomial to the life expectancy vs per capita GDP dataset.
x_original = world[:, 0]
X_train = (x_original.reshape(-1, 1))**[0, 1, 2, 3, 4]
y_train = world[:, 1]
cs = dict()
We store the estimated model coefficients in a dictionary because many methods will
follow next. First, scipy:
res = scipy.linalg.lstsq(X_train, y_train)
cs["scipy_X"] = res[0]
cs["scipy_X"]
## array([ 2.33103950e-16, 6.42872371e-12, 1.34162021e-07,
## -2.33980973e-12, 1.03490968e-17])
If we drew the fitted polynomial now (see Figure 9.15), we would see that the fit is un-
believably bad. The result returned by scipy.linalg.lstsq is now not at all optimal.
All coefficients are approximately equal to 0.
It turns out that the fitting problem is extremely ill-conditioned (and it is not the al-
gorithm’s fault): GDPs range from very small to very large ones. Furthermore, tak-
ing them to the fourth power breeds numbers of ever greater range. Finding the least
squares solution involves some form of matrix inverse (not necessarily directly) and
our model matrix may be close to singular (one that is not invertible).
As a measure of the model matrix’s ill-conditioning, we often use the condition num-
15 There are methods in statistical learning where there might be multiple local minima – this is even
ber, denoted 𝜅(𝐗𝑇 ). It is the ratio of the largest to the smallest singular values16 of 𝐗𝑇 ,
which are returned by the scipy.linalg.lstsq method itself:
s = res[3] # singular values of X_train.T
s
## array([5.63097211e+20, 7.90771769e+14, 4.48366565e+09, 6.77575417e+04,
## 5.76116463e+00])
Note that they are already sorted nonincreasingly. The condition number 𝜅(𝐗𝑇 ) is
equal to:
s[0] / s[-1] # condition number (largest/smallest singular value)
## 9.774017018467431e+19
As a rule of thumb, if the condition number is 10𝑘 , we are losing 𝑘 digits of numerical
precision when performing the underlying computations. As the foregoing number is
exceptionally large, we are thus currently faced with a very ill-conditioned problem. If
the values in 𝐗 or 𝐲 are perturbed even slightly, we might expect significant changes
in the computed regression coefficients.
Note (**) The least squares regression problem can be solved by means of the singular
value decomposition of the model matrix, see Section 9.3.4. Let 𝐔𝐒𝐐 be the SVD of
𝐗𝑇 . Then 𝐜 = 𝐔𝐒−1 𝐐𝐲, with 𝐒−1 = diag(1/𝑠1,1 , … , 1/𝑠𝑚,𝑚 ). As 𝑠1,1 ≥ … ≥ 𝑠𝑚,𝑚
gives the singular values of 𝐗𝑇 , the aforementioned condition number can simply be
computed as 𝑠1,1 /𝑠𝑚,𝑚 .
Let’s verify the method used by scikit-learn. As it fits the intercept separately, we ex-
pect it to be slightly better-behaving; still, it is merely a wrapper around scipy.linalg.
lstsq, but with a different API.
import sklearn.linear_model
lm = sklearn.linear_model.LinearRegression(fit_intercept=True)
lm.fit(X_train[:, 1:], y_train)
cs["sklearn"] = np.r_[lm.intercept_, lm.coef_]
cs["sklearn"]
## array([ 6.92257708e+01, 5.05752755e-13, 1.38835643e-08,
## -2.18869346e-13, 9.09347772e-19])
16 (**) Being themselves the square roots of eigenvalues of 𝐗𝑇 𝐗. Equivalently, 𝜅(𝐗𝑇 ) = ‖(𝐗𝑇 )−1 ‖ ‖𝐗𝑇 ‖
with respect to the spectral norm. Seriously, we really need to get good grasp of linear algebra to become
successful data scientists.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 207
The condition number is also enormous. Still, scikit-learn did not warn us about
this being the case (insert frowning face emoji here). Had we trusted the solution re-
turned by it, we would end up with conclusions from our data analysis built on sand.
As we said in Section 9.2.8, the package designers assumed that the users know what
they are doing. This is okay, we are all adults here, although some of us are still learn-
ing.
Overall, if the model matrix is close to singular, the computation of its inverse is prone
to enormous numerical errors. One way of dealing with this is to remove highly cor-
related variables (the multicollinearity problem). Interestingly, standardisation can
sometimes make the fitting more numerically stable.
Let 𝐙 be a standardised version of the model matrix 𝐗 with the intercept part (the
column of 1s) not included, i.e., with 𝐳⋅,𝑗 = (𝐱⋅,𝑗 − 𝑥𝑗̄ )/𝑠𝑗 where 𝑥𝑗̄ and 𝑠𝑗 denotes the
arithmetic mean and the standard deviation of the 𝑗-th column in 𝐗. If (𝑑1 , … , 𝑑𝑚−1 )
is the least squares solution for 𝐙, then the least squares solution to the underlying
original regression problem is:
𝑚−1 𝑑
⎛ 𝑗 𝑑1 𝑑2 𝑑𝑚−1 ⎞
𝒄=⎜
⎜𝑦 ̄ − ∑ 𝑥𝑗̄ , , , … , ⎟
⎟,
𝑠𝑗 𝑠1 𝑠 2 𝑠𝑚−1
⎝ 𝑗=1 ⎠
It is still far from perfect (we would prefer a value close to 1) but nevertheless it is a
significant improvement over what we had before.
Figure 9.15 depicts the three fitted models, each claiming to be the solution to the ori-
ginal regression problem. Note that, luckily, we know that in our case the logarithmic
model is better than the polynomial one.
208 III MULTIDIMENSIONAL DATA
120
100
life expectancy (years)
80
60
40 scipy_X SSR=562307.49
sklearn SSR=6018.16
scipy_Z SSR=4334.68
20
0 20000 40000 60000 80000 100000 120000 140000
per capita GDP PPP
Figure 9.15. Ill-conditioned model matrix can give a very wrong model.
Exercise 9.15 Check the condition numbers of all the models fitted so far in this chapter via the
least squares method.
To be strict, if we read a paper in, say, social or medical sciences (amongst others)
where the researchers fit a regression model but do not provide the model matrix’s
condition number, it is worthwhile to doubt the conclusions they make.
On a final note, we might wonder why the standardisation is not done automatically
by the least squares solver. As usual with most numerical methods, there is no one-
fits-all solution: e.g., when there are columns of extremely small variance or there are
outliers in data. This is why we need to study all the topics deeply: to be able to respond
flexibly to many different scenarios ourselves.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 209
where 𝛼 is the angle between two given vectors 𝒙, 𝒚 ∈ ℝ𝑛 . In plain English, it is the
product of the magnitudes of the two vectors and the cosine of the angle between
them.
We can retrieve the cosine part by computing the dot product of the normalised vectors,
i.e., such that their magnitudes are equal to 1:
𝒙 𝒚
cos 𝛼 = ⋅ .
‖𝒙‖ ‖𝒚‖
For example, consider two vectors in ℝ2 , 𝒖 = (1/2, 0) and 𝒗 = (√2/2, √2/2), which
are depicted in Figure 9.16.
u = np.array([0.5, 0])
v = np.array([np.sqrt(2)/2, np.sqrt(2)/2])
The dot product of their normalised versions, i.e., the cosine of the angle between
them is:
u_norm = u/np.sqrt(np.sum(u*u))
v_norm = v/np.sqrt(np.sum(v*v)) # BTW: this vector is already normalised
np.sum(u_norm*v_norm)
## 0.7071067811865476
The angle itself can be determined by referring to the inverse of the cosine function,
i.e., arccosine.
np.arccos(np.sum(u_norm*v_norm)) * 180/np.pi
## 45.0
Important If two vectors are collinear (codirectional, one is a scaled version of another,
angle 0), then cos 0 = 1. If they point in opposite directions (±𝜋 = ±180∘ angle), then
210 III MULTIDIMENSIONAL DATA
0.7
[0.707, 0.707]
0.6
0.5
0.4
0.3
0.2
0.1
0.0 [0.500, 0.000]
0.2 0.0 0.2 0.4 0.6 0.8 1.0
cos ±𝜋 = −1. For vectors that are orthogonal (perpendicular, ± 𝜋2 = ±90∘ angle), we
get cos ± 𝜋2 = 0.
Note (**) The standard deviation 𝑠 of a vector 𝒙 ∈ ℝ𝑛 that has already been centred
(whose components’ mean is 0) is a scaled version of its magnitude, i.e., 𝑠 = ‖𝒙‖/√𝑛.
Looking at the definition of the Pearson linear correlation coefficient (Section 9.1.1),
we see that it is the dot product of the standardised versions of two vectors 𝒙 and 𝒚
divided by the number of elements therein. If the vectors are centred, we can rewrite
𝒙 𝒚
the formula equivalently as 𝑟(𝒙, 𝒚) = ‖𝒙‖ ⋅ ‖𝒚‖ and thus 𝑟(𝒙, 𝒚) = cos 𝛼. It is not easy to
imagine vectors in high-dimensional spaces, but from this observation we can at least
imply the fact that 𝑟 is bounded between -1 and 1. In this context, being not linearly
correlated corresponds to the vectors’ orthogonality.
then 𝐗𝐒 represents scaling (stretching) with respect to the individual axes of the co-
ordinate system because:
𝑠 𝑥 𝑠2 𝑥1,2 … 𝑠𝑚 𝑥1,𝑚
⎡ 1 1,1 ⎤
⎢ 1 𝑥2,1
𝑠 𝑠2 𝑥2,2 … 𝑠𝑚 𝑥2,𝑚 ⎥
𝐗𝐒 = ⎢ ⋮ ⋮ ⋱ ⋮ ⎥.
⎢ ⎥
𝑠 𝑥
⎢ 1 𝑛−1,1 𝑠2 𝑥𝑛−1,2 … 𝑠𝑚 𝑥𝑛−1,𝑚 ⎥
⎣ 𝑠1 𝑥𝑛,1 𝑠2 𝑥𝑛,2 … 𝑠𝑚 𝑥𝑛,𝑚 ⎦
The above can be expressed in numpy without referring to the matrix multiplication.
A notation like X * np.array([s1, s2, ..., sm]).reshape(1, -1) will suffice
(elementwise multiplication and proper shape broadcasting).
In particular, the matrix representing the rotation in ℝ2 about the origin (0, 0) by the
counterclockwise angle 𝛼:
cos 𝛼 sin 𝛼
𝐑(𝛼) = [ ],
− sin 𝛼 cos 𝛼
is orthonormal (which can be easily verified using the basic trigonometric equalities).
Furthermore:
1 0 −1 0
[ ] and [ ],
0 −1 0 1
represent the two reflections, one against the x- and the other against the y-axis, re-
spectively. Both are orthonormal matrices too.
Consider a dataset 𝐗′ in ℝ2 :
np.random.seed(12345)
Xp = np.random.randn(10000, 2) * 0.25
2 0 cos 𝜋6 sin 𝜋6
𝐗 = 𝐗′ [ ][ ] + [ 3 2 ].
0 0.5 − sin 𝜋6 cos 𝜋6
t = np.array([3, 2])
S = np.diag([2, 0.5])
S
## array([[2. , 0. ],
## [0. , 0.5]])
alpha = np.pi/6
Q = np.array([
[ np.cos(alpha), np.sin(alpha)],
[-np.sin(alpha), np.cos(alpha)]
])
Q
## array([[ 0.8660254, 0.5 ],
## [-0.5 , 0.8660254]])
X = Xp @ S @ Q + t
Figure 9.17. A dataset and its scaled, rotated, and shifted version.
The computing of such linear combinations of columns is not rare during a dataset’s
preprocessing step, especially if they are on the same scale or are unitless. As a matter
of fact, the standardisation itself is a form of scaling and translation.
Exercise 9.16 Assume that we have a dataset with two columns representing the number of
apples and the number of oranges in clients’ baskets. What orthonormal and scaling transforms
should be applied to obtain a matrix that gives the total number of fruits and surplus apples
(e.g., to convert a row (4, 7) to (11, −3))?
𝐀−1 𝐀 = 𝐀𝐀−1 = 𝐈.
Noting that the identity matrix 𝐈 is the neutral element of the matrix multiplication,
the above is thus the analogue of the inverse of a scalar: something like 3⋅3−1 = 3⋅ 13 =
1
3
⋅ 3 = 1.
Important For any invertible matrices of admissible shapes, it might be shown that
the following noteworthy properties hold:
• (𝐀−1 )𝑇 = (𝐀𝑇 )−1 ,
• (𝐀𝐁)−1 = 𝐁−1 𝐀−1 ,
• a matrix equality 𝐀 = 𝐁𝐂 holds if and only if 𝐀𝐂−1 = 𝐁𝐂𝐂−1 = 𝐁; this is also
equivalent to 𝐁−1 𝐀 = 𝐁−1 𝐁𝐂 = 𝐂.
𝐗′ = (𝐗 − 𝐭)𝐐𝑇 (1/𝐒).
Let’s verify this numerically (testing equality up to some inherent round-off error):
214 III MULTIDIMENSIONAL DATA
𝐗 = 𝐔𝐒𝐐,
where:
• 𝐔 is an 𝑛 × 𝑚 semi-orthonormal matrix (its columns are orthonormal vectors; we
have 𝐔𝑇 𝐔 = 𝐈),
• 𝐒 is an 𝑚 × 𝑚 diagonal matrix such that 𝑠1,1 ≥ 𝑠2,2 ≥ … ≥ 𝑠𝑚,𝑚 ≥ 0,
• 𝐐 is an 𝑚 × 𝑚 orthonormal matrix.
Important In data analysis, we usually apply the SVD on matrices that have already
been centred (so that their column means are all 0).
For example:
import scipy.linalg
n = X.shape[0]
X_centred = X - np.mean(X, axis=0)
U, s, Q = scipy.linalg.svd(X_centred, full_matrices=False)
And now:
U[:6, :] # preview the first few rows
## array([[-0.00195072, 0.00474569],
## [-0.00510625, -0.00563582],
## [ 0.01986719, 0.01419324],
## [ 0.00104386, 0.00281853],
## [ 0.00783406, 0.01255288],
## [ 0.01025205, -0.0128136 ]])
The norms of all the columns in 𝐔 are all equal to 1 (and hence standard deviations are
1/√𝑛). Consequently, they are on the same scale:
np.std(U, axis=0), 1/np.sqrt(n) # compare
## (array([0.01, 0.01]), 0.01)
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 215
What is more, they are orthogonal: their dot products are all equal to 0. Regarding
what we said about Pearson’s linear correlation coefficient and its relation to dot
products of normalised vectors, we imply that the columns in 𝐔 are not linearly cor-
related. In some sense, they form independent dimensions.
Now, we have 𝐒 = diag(𝑠1 , … , 𝑠𝑚 ), with the elements on the diagonal being:
s
## array([49.72180455, 12.5126241 ])
The elements on the main diagonal of 𝐒 are used to scale the corresponding columns
in 𝐔. The fact that they are ordered decreasingly means that the first column in 𝐔𝐒 has
the greatest standard deviation, the second column has the second greatest variability,
and so forth.
S = np.diag(s)
US = U @ S
np.std(US, axis=0) # equal to s/np.sqrt(n)
## array([0.49721805, 0.12512624])
Multiplying 𝐔𝐒 by 𝐐 simply rotates and/or reflects the dataset. This brings 𝐔𝐒 to a new
coordinate system where, by construction, the dataset projected onto the direction
determined by the first row in 𝐐, i.e., 𝐪1,⋅ has the largest variance, projection onto
𝐪2,⋅ has the second largest variance, and so on.
Q
## array([[ 0.86781968, 0.49687926],
## [-0.49687926, 0.86781968]])
This is why we refer to the rows in 𝐐 as principal directions (or components). Their scaled
versions (proportional to the standard deviations along them) are depicted in Fig-
ure 9.18. Note that we have more or less recreated the steps needed to construct 𝐗 from
𝐗′ (by the way we generated 𝐗′ , we expect it to have linearly uncorrelated columns; yet,
𝐗′ and 𝐔 have different column variances).
plt.plot(X_centred[:, 0], X_centred[:, 1], "o", alpha=0.1)
plt.arrow(
0, 0, Q[0, 0]*s[0]/np.sqrt(n), Q[0, 1]*s[0]/np.sqrt(n), width=0.02,
facecolor="red", edgecolor="white", length_includes_head=True, zorder=2)
plt.arrow(
0, 0, Q[1, 0]*s[1]/np.sqrt(n), Q[1, 1]*s[1]/np.sqrt(n), width=0.02,
facecolor="red", edgecolor="white", length_includes_head=True, zorder=2)
plt.show()
Figure 9.18. Principal directions of an example dataset (scaled so that they are propor-
tional to the standard deviations along them).
chainlink = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/clustering/fcps_chainlink.csv")
Section 7.4 said that the plotting is always done on a two-dimensional surface (be it
the computer screen or book page). We can look at the dataset only from one angle at
a time.
In particular, a scatter plot matrix only depicts the dataset from the perspective of the
axes of the Cartesian coordinate system (standard basis); see Figure 9.19 (we used a
function we defined in Section 7.4.3).
pairplot(chainlink, ["axis1", "axis2", "axis3"]) # our function
plt.show()
These viewpoints by no means must reveal the true geometric structure of the dataset.
However, we know that we can rotate the virtual camera and find some more interesting
angle. It turns out that our dataset represents two nonintersecting rings, hopefully
visible Figure 9.20.
fig = plt.figure()
ax = fig.add_subplot(1, 3, 1, projection="3d", facecolor="#ffffff00")
ax.scatter(chainlink[:, 0], chainlink[:, 1], chainlink[:, 2])
ax.view_init(elev=45, azim=45, vertical_axis="z")
ax = fig.add_subplot(1, 3, 2, projection="3d", facecolor="#ffffff00")
ax.scatter(chainlink[:, 0], chainlink[:, 1], chainlink[:, 2])
ax.view_init(elev=37, azim=0, vertical_axis="z")
(continues on next page)
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 217
It turns out that we may find a noteworthy viewpoint using the SVD. Namely, we can
perform the decomposition of a centred dataset which we denote by 𝐗:
𝐗 = 𝐔𝐒𝐐.
import scipy.linalg
X_centered = chainlink-np.mean(chainlink, axis=0)
U, s, Q = scipy.linalg.svd(X_centered, full_matrices=False)
218 III MULTIDIMENSIONAL DATA
𝐏 = 𝐗𝐐−1 = 𝐔𝐒,
we know that its first column has the highest variance, the second column has the
second highest variability, and so on. It might indeed be worth looking at that dataset
from that most informative perspective.
Figure 9.21 gives the scatter plot for 𝐩⋅,1 and 𝐩⋅,2 . Maybe it does not reveal the true
geometric structure of the dataset (no single two-dimensional projection can do that),
but at least it is better than the initial ones (from the pairs plot).
P2 = U[:, :2] @ np.diag(s[:2]) # the same as ([email protected](s))[:, :2]
plt.plot(P2[:, 0], P2[:, 1], "o")
plt.axis("equal")
plt.show()
What we just did is a kind of dimensionality reduction. We found a viewpoint (in the form
of an orthonormal matrix, being a mixture of rotations and reflections) on 𝐗 such that
its orthonormal projection onto the first two axes of the Cartesian coordinate system
is the most informative18 (in terms of having the highest variance along these axes).
18 (**) The Eckart–Young–Mirsky theorem states that 𝐔
⋅,∶𝑘 𝐒∶𝑘,∶𝑘 𝐐∶𝑘,⋅ (where “∶ 𝑘” denotes “the first 𝑘
rows or columns”) is the best rank-𝑘 approximation of 𝐗 with respect to both the Frobenius and spectral
norms.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 219
0.75
0.50
0.25
0.00
0.25
0.50
0.75
Each index is on the scale from 0 to 10. These are, in this order:
1. Safe Sanitation,
2. Healthy Life,
19 https://ssi.wi.th-koeln.de/
220 III MULTIDIMENSIONAL DATA
3. Energy Use,
4. Greenhouse Gases,
5. Gross Domestic Product.
Above we displayed the data corresponding to six countries:
countries = list(ssi.iloc[:, 0]) # select the 1st column from the data frame
countries[:6] # preview
## ['Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia', 'Australia']
This is a five-dimensional dataset. We cannot easily visualise it. Observing that the
pairs plot does not reveal too much is left as an exercise. Let’s thus perform the SVD
decomposition of a standardised version of this dataset, 𝐙 (recall that the centring is
necessary, at the very least).
Z = (X - np.mean(X, axis=0))/np.std(X, axis=0)
U, s, Q = scipy.linalg.svd(Z, full_matrices=False)
The standard deviations of the data projected onto the consecutive principal compon-
ents (columns in 𝐔𝐒) are:
s/np.sqrt(n)
## array([2.02953531, 0.7529221 , 0.3943008 , 0.31897889, 0.23848286])
It is customary to check the ratios of the cumulative variances explained by the con-
secutive principal components, which is a normalised measure of their importances.
We can compute them by calling:
np.cumsum(s**2)/np.sum(s**2)
## array([0.82380272, 0.93718105, 0.96827568, 0.98862519, 1. ])
As, in some sense, the variability within the first two components covers c. 94% of the
variability of the whole dataset, we can restrict ourselves only to a two-dimensional
projection of this dataset. The rows in 𝐐 define the loadings, which give the coefficients
defining the linear combinations of the rows in 𝐙 that correspond to the principal com-
ponents.
Let’s try to find their interpretation.
np.round(Q[0, :], 2) # loadings – the first principal axis
## array([-0.43, -0.43, 0.44, 0.45, -0.47])
The first row in 𝐐 consists of similar values, but with different signs. We can consider
them a scaled version of the average Energy Use (column 3), Greenhouse Gases (4), and
MINUS Safe Sanitation (1), MINUS Healthy Life (2), MINUS Gross Domestic Product
(5). We could call this a measure of a country’s overall eco-unfriendliness(?) because
countries with low Healthy Life and high Greenhouse Gasses will score highly on this
scale.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 221
The second row in 𝐐 defines a scaled version of the average of Safe Sanitation (1),
Healthy Life (2), Energy Use (3), and Greenhouse Gases (4), almost completely ignor-
ing the GDP (5). Can we call it a measure of industrialisation? Something like this. But
this naming is just for fun20 .
Figure 9.22 is a scatter plot of the countries projected onto the said two principal dir-
ections. For readability, we only display a few chosen labels. This is merely a projec-
tion/approximation, but it might be an interesting one for some decision makers.
P2 = U[:, :2] @ np.diag(s[:2]) # == Y @ Q[:2, :].T
plt.plot(P2[:, 0], P2[:, 1], "o", alpha=0.1)
which = [ # hand-crafted/artisan
141, 117, 69, 123, 35, 80, 93, 45, 15, 2, 60, 56, 14,
104, 122, 8, 134, 128, 0, 94, 114, 50, 34, 41, 33, 77,
64, 67, 152, 135, 148, 99, 149, 126, 111, 57, 20, 63
]
for i in which:
plt.text(P2[i, 0], P2[i, 1], countries[i], ha="center")
plt.axis("equal")
plt.xlabel("1st principal component (eco-unfriendliness?)")
plt.ylabel("2nd principal component (industrialisation?)")
plt.show()
Exercise 9.17 Perform a principal component analysis of the body dataset. Project the points
onto a two-dimensional plane.
ematics, unlike the brains of ordinary mortals, does not need our imperfect interpretations/fairy tales to
function properly. We need more maths in our lives.
222 III MULTIDIMENSIONAL DATA
Sri Lanka
Albania
2nd principal component (industrialisation?)
1.5
Cuba Tajikistan
Honduras
Peru
1.0 Portugal
Montenegro
Greece
Hungary Uzbekistan Bangladesh
0.5 Cyprus Indonesia Nepal
Israel Lebanon
Venezuela Cambodia
0.0 JapanIreland Zambia
Botswana
Bosnia-Herzegovina
0.5Singapore
Czech Republic Libya Mongolia Tanzania
Estonia Angola
1.0 Gabon Nigeria
Kazakhstan Sierra Leone
1.5 South Africa
Russia
2.0 Turkmenistan
3 2 1 0 1 2 3
1st principal component (eco-unfriendliness?)
9.5 Exercises
Exercise 9.18 Why correlation is not causation?
Exercise 9.19 What does the linear correlation of 0.9 mean? What? about the rank correlation
of 0.9? And the linear correlation of 0.0?
Exercise 9.20 How is Spearman’s coefficient related to Pearson’s one?
Exercise 9.21 State the optimisation problem behind the least squares fitting of linear models.
Exercise 9.22 What are the different ways of the numerical summarising of residuals?
Exercise 9.23 Why is it important for the residuals to be homoscedastic?
Exercise 9.24 Is a more complex model always better?
Exercise 9.25 Why must extrapolation be handled with care?
Exercise 9.26 Why did we say that novice users should refrain from using scikit-learn?
Exercise 9.27 What is the condition number of a model matrix and why is it worthwhile to
always check it?
Exercise 9.28 What is the geometrical interpretation of the dot product of two normalised vec-
tors?
Exercise 9.29 How can we verify if two vectors are orthonormal? What is an orthonormal pro-
jection? What is the inverse of an orthonormal matrix?
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 223
Heterogeneous data
10
Introducing data frames
numpy arrays are an extremely versatile tool for performing data analysis activities and
other numerical computations of various kinds. Even though it is theoretically pos-
sible otherwise, in practice, we only store elements of the same type there: most often
numbers.
pandas1 [66] is amongst over one hundred thousand2 open-source packages and re-
positories that use numpy to provide additional data wrangling functionality. It was
originally written by Wes McKinney but was heavily inspired by the data.frame3 ob-
jects in S and R as well as tables in relational (think: SQL) databases and spreadsheets.
import pandas as pd
Important Let’s repeat: pandas is built on top of numpy and most objects therein can
be processed by numpy functions as well. Many other functions, e.g., in scikit-learn,
accept both DataFrame and ndarray objects, but often convert the former to the latter
internally to enable data processing using fast, lower-level C/C++/Fortran routines.
What we have learnt so far4 still applies. But there is more, hence this part.
Notice that rows and columns are labelled (and how readable that is).
A dictionary of vector-like objects of equal lengths is another common data source:
np.random.seed(123)
df = pd.DataFrame(dict(
a = np.round(np.random.rand(5), 2),
b = [1, 2.5, np.nan, 4, np.nan],
c = [True, True, False, False, True],
d = ["A", "B", "C", None, "E"],
e = ["spam", "spam", "bacon", "spam", "eggs"],
f = np.array([
"2021-01-01", "2022-02-02", "2023-03-03", "2024-04-04", "2025-05-05"
], dtype="datetime64[D]"),
g = [
["spam"], ["bacon", "spam"], None, ["eggs", "bacon", "spam"], ["ham"]
(continues on next page)
4 If, by any chance, some overenthusiastic readers decided to start this superb book at this chapter, it is
now the time to go back to the Preface and learn everything in the right order. See you later.
10 INTRODUCING DATA FRAMES 229
Reading from URLs and local files is also supported; compare Section 13.6.1.
Exercise 10.2 Check out other pandas.read_* functions in the pandas documentation, e.g.,
for importing spreadsheets, Apache Parquet and HDF5 files, scraping tables from HTML docu-
ments, or reading data from relational databases. We will discuss some of them in more detail
later.
Exercise 10.3 (*) Large files that do not fit into computer’s memory (but not too large) can still
be read with pandas.read_csv. Check out the meaning of the usecols, dtype, skiprows,
and nrows arguments. On a side note, sampling is mentioned in Section 10.5.4 and chunking in
Section 13.2.
5 https://pandas.pydata.org/docs
230 IV HETEROGENEOUS DATA
df.shape
## (5, 7)
Recall that numpy arrays are equipped with the dtype slot.
10.1.2 Series
There is a separate class for storing individual data frame columns: it is called Series.
s = df.loc[:, "a"] # extract the `a` column; alternatively: df.a
s
## 0 0.70
## 1 0.29
## 2 0.23
## 3 0.55
## 4 0.72
## Name: a, dtype: float64
Data frames with one column are printed out slightly differently. We get the column
name at the top, but do not have the dtype information at the bottom.
s.to_frame() # or: pd.DataFrame(s)
## a
## 0 0.70
## 1 0.29
## 2 0.23
## 3 0.55
## 4 0.72
Important It is crucial to know when we are dealing with a Series and when with a
DataFrame object as each of them defines a slightly different set of methods.
We will now be relying on object-orientated syntax (compare Section 2.2.3) much more
frequently than before.
10 INTRODUCING DATA FRAMES 231
As a consequence, what we covered in the part of this book that dealt with vector pro-
cessing still holds for data frame columns (but there will be more).
Series can also be named.
s.name
## 'a'
This is convenient, especially when we convert them to a data frame as the name sets
the label of the newly created column:
s.rename("spam").to_frame()
## spam
## 0 0.70
## 1 0.29
## 2 0.23
(continues on next page)
232 IV HETEROGENEOUS DATA
10.1.3 Index
Another important class is called Index6 . We use it for storing element or axes labels.
The index (lowercase) slot of a data frame stores an object of the class Index (or one of
its derivatives) that gives the row names:
df.index # row labels
## RangeIndex(start=0, stop=5, step=1)
The set_index method can be applied to make a data frame column act as a sequence
of row labels:
df2 = df.set_index("e")
df2
## a b c d f g
## e
## spam 0.70 1.0 True A 2021-01-01 [spam]
## spam 0.29 2.5 True B 2022-02-02 [bacon, spam]
## bacon 0.23 NaN False C 2023-03-03 None
## spam 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## eggs 0.72 NaN True E 2025-05-05 [ham]
6 The name Index is confusing not only because it clashes with the index operator (square brackets), but
also the concept of an index in relational databases. In pandas, we can have nonunique row names.
10 INTRODUCING DATA FRAMES 233
Having a named index slot is handy when converting a vector of row labels back to a
standalone column:
df2.rename_axis(index="NEW_COLUMN").reset_index()
## NEW_COLUMN a b c d f g
## 0 spam 0.70 1.0 True A 2021-01-01 [spam]
## 1 spam 0.29 2.5 True B 2022-02-02 [bacon, spam]
## 2 bacon 0.23 NaN False C 2023-03-03 None
## 3 spam 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## 4 eggs 0.72 NaN True E 2025-05-05 [ham]
There is also an option to drop the current index whatsoever and to replace it with the
default label sequence, i.e., 0, 1, 2, …:
df2.reset_index(drop=True)
## a b c d f g
## 0 0.70 1.0 True A 2021-01-01 [spam]
## 1 0.29 2.5 True B 2022-02-02 [bacon, spam]
## 2 0.23 NaN False C 2023-03-03 None
## 3 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## 4 0.72 NaN True E 2025-05-05 [ham]
Take note of the fact that reset_index, and many other methods that we have used so
far, do not modify the data frame in place.
Exercise 10.5 Use the pandas.DataFrame.rename method to change the name of the a
column in df to spam.
Also, a hierarchical index – one that is comprised of more than one level – is possible.
For example, here is a sorted (see Section 10.6.1) version of df with a new index based
on two columns at the same time:
df.sort_values("e", ascending=False).set_index(["e", "c"])
## a b d f g
## e c
## spam True 0.70 1.0 A 2021-01-01 [spam]
## True 0.29 2.5 B 2022-02-02 [bacon, spam]
## False 0.55 4.0 None 2024-04-04 [eggs, bacon, spam]
## eggs True 0.72 NaN E 2025-05-05 [ham]
## bacon False 0.23 NaN C 2023-03-03 None
For the sake of readability, the consecutive repeated spams were not printed.
Example 10.6 Hierarchical indexes might arise after aggregating data in groups. For example:
nhanes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_p_demo_bmx_2020.csv",
comment="#").rename({
"BMXBMI": "bmival",
"RIAGENDR": "gender",
"DMDBORN4": "usborn"
}, axis=1)
This returned a Series object with a hierarchical index. If we do not fancy it, reset_index
comes to our rescue:
res.reset_index()
## gender usborn bmival
## 0 1 1 25.734110
## 1 1 2 27.405251
## 2 2 1 27.120261
## 3 2 2 27.579448
(continues on next page)
10 INTRODUCING DATA FRAMES 235
All numpy functions can be applied on individual columns, i.e., objects of the type
Series, because they are vector-like.
u = df.loc[:, "u"] # extract the `u` column (gives a Series; see below)
np.quantile(u, [0, 0.5, 1])
## array([0.23, 0.55, 0.72])
Most numpy functions also work if they are fed with data frames, but we will need to
extract the numeric columns manually.
uv = df.loc[:, ["u", "v"]] # select two columns (a DataFrame; see below)
np.quantile(uv, [0, 0.5, 1], axis=0)
## array([[ 0.23, -1.62],
## [ 0.55, -0.05],
## [ 0.72, 1.98]])
Sometimes the results will automatically be coerced to a Series object with the index
slot set appropriately:
np.mean(uv, axis=0)
## u 0.498
## v 0.086
## dtype: float64
236 IV HETEROGENEOUS DATA
For convenience, many operations are also available as methods for the Series and
DataFrame classes, e.g., mean, median, min, max, quantile, var, std, and skew.
df.mean(numeric_only=True)
## u 0.498
## v 0.086
## dtype: float64
df.quantile([0, 0.5, 1], numeric_only=True)
## u v
## 0.0 0.23 -1.62
## 0.5 0.55 -0.05
## 1.0 0.72 1.98
Also note the describe method, which returns a few statistics at the same time.
df.describe()
## u v
## count 5.000000 5.000000
## mean 0.498000 0.086000
## std 0.227969 1.289643
## min 0.230000 -1.620000
## 25% 0.290000 -0.200000
## 50% 0.550000 -0.050000
## 75% 0.700000 0.320000
## max 0.720000 1.980000
Exercise 10.7 Check out the pandas.DataFrame.agg method that can apply all aggregates
given by a list of functions. Compose a call equivalent to df.describe().
Note (*) Let’s stress that above we see the corrected for bias (but still only asymptotic-
𝑛
ally unbiased) version of standard deviation, given by √ 𝑛−1
1
∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2 ; compare
Section 5.1. In pandas, std methods assume ddof=1 by default, whereas we recall that
numpy uses ddof=0.
This is an unfortunate inconsistency between the two packages, but please do not
blame the messenger.
10 INTRODUCING DATA FRAMES 237
When applying the binary arithmetic, relational, and logical operators on an object of
the class Series and a scalar or a numpy vector, the operations are performed element-
wisely – a style with which we are already familiar.
For instance, here is a standardised version of the u column:
u = df.loc[:, "u"]
(u - np.mean(u)) / np.std(u)
## a 0.990672
## b -1.020098
## c -1.314357
## d 0.255025
## e 1.088759
## Name: u, dtype: float64
Binary operators act on the elements with corresponding labels. For two objects hav-
ing identical index slots (this is the most common scenario), this is the same as ele-
mentwise vectorisation. For instance:
df.loc[:, "u"] > df.loc[:, "v"] # here: elementwise comparison
## a True
## b True
## c True
## d False
## e True
## dtype: bool
238 IV HETEROGENEOUS DATA
Anticipating what we cover in the next section, in both cases, we can write df.loc[:,
["u", "v"]] = uv2 to replace the old content. Also, new columns can be added based
on the transformed versions of the existing ones. For instance:
df.loc[:, "uv_squared"] = (df.loc[:, "u"] * df.loc[:, "v"])**2
df
## u v w uv_squared
## a 0.70 0.32 spam 0.050176
## b 0.29 -0.05 bacon 0.000210
## c 0.23 -0.20 spam 0.002116
## d 0.55 1.98 eggs 1.185921
## e 0.72 -1.62 sausage 1.360489
Example 10.8 (*) Binary operations on objects with different index slots are vectorised la-
belwisely:
x = pd.Series([1, 10, 100, 1000, 10000], index=["a", "b", "a", "a", "c"])
x
## a 1
## b 10
## a 100
## a 1000
## c 10000
## dtype: int64
y = pd.Series([1, 2, 3, 4, 5], index=["b", "b", "a", "d", "c"])
(continues on next page)
10 INTRODUCING DATA FRAMES 239
And now:
x * y
## a 3.0
## a 300.0
## a 3000.0
## b 10.0
## b 20.0
## c 50000.0
## d NaN
## dtype: float64
Here, each element in the first Series named a was multiplied by each (there was only one)
element labelled a in the second Series. For d, there were no matches, hence the result’s being
marked as missing; compare Chapter 15. Thus, it behaves like a full outer join-type operation; see
Section 10.6.3.
The above is different from elementwise vectorisation in numpy:
np.array(x) * np.array(y)
## array([ 1, 20, 300, 4000, 50000])
Labelwise vectorisation can be useful in certain contexts. However, we need to be aware of this
(yet another) incompatibility between the two packages.
np.random.seed(123)
b = pd.Series(np.round(np.random.rand(10), 2))
b.index = np.random.permutation(np.arange(10))
b
## 2 0.70
## 1 0.29
## 8 0.23
## 7 0.55
## 9 0.72
## 4 0.42
## 5 0.98
## 6 0.68
## 3 0.48
## 0 0.39
## dtype: float64
and:
c = b.copy()
c.index = list("abcdefghij")
c
## a 0.70
## b 0.29
## c 0.23
## d 0.55
## e 0.72
## f 0.42
## g 0.98
## h 0.68
## i 0.48
## j 0.39
## dtype: float64
They consist of the same values, in the same order, but have different labels (index
slots). In particular, b’s labels are integers that do not match the physical element pos-
itions (where 0 would denote the first element, etc.).
Important For numpy vectors, we had four different indexing schemes: via a scalar
(extracts an element at a given position), a slice, an integer vector, and a logical vector.
Series objects are additionally labelled. Therefore, they can also be accessed through
the contents of the index slot.
10.4.1 Do not use [...] directly (in the current version of pandas)
Applying the index operator, [...], directly on Series is currently not a wise idea:
10 INTRODUCING DATA FRAMES 241
both do not select the first item, but the item labelled 0. However, the undermentioned
two calls fall back to position-based indexing.
b[:1] # do not use it: it will change in the future!
## 2 0.7
## dtype: float64
c[0] # there is no label `0` (do not use it: it will change in the future!)
## 0.7
10.4.2 loc[...]
Series.loc[...] implements label-based indexing.
b.loc[0]
## 0.39
This returned the element labelled 0. On the other hand, c.loc[0] will raise a KeyEr-
ror because c consists of string labels only. But in this case, we can write:
c.loc["j"]
## 0.39
Note Be careful that if there are repeated labels, then we will be returning all (sic!8 )
the matching items:
d = pd.Series([1, 2, 3, 4], index=["a", "b", "a", "c"])
d.loc["a"]
## a 1
## a 3
## dtype: int64
10.4.3 iloc[...]
Here are some examples of position-based indexing with the iloc[...] accessor. It
is worth stressing that, fortunately, its behaviour is consistent with its numpy coun-
terpart, i.e., the ordinary square brackets applied on objects of the class ndarray. For
example:
b.iloc[0] # the same: c.iloc[0]
## 0.7
returns the second, third, …, seventh element (not including b.iloc[7], i.e., the eight
one).
For iloc[...], the indexer must be unlabelled, e.g., be an ordinary numpy vector.
np.random.seed(123)
df = pd.DataFrame(dict(
u = np.round(np.random.rand(5), 2),
v = np.round(np.random.randn(5), 2),
w = ["spam", "bacon", "spam", "eggs", "sausage"],
x = [True, False, True, False, True]
))
And now:
df.loc[ df.loc[:, "u"] > 0.5, "u":"w" ]
## u v w
## 0 0.70 0.32 spam
## 3 0.55 1.98 eggs
## 4 0.72 -1.62 sausage
It selected the rows where the values in the u column are greater than 0.5 and then
returns all columns between u and w (inclusive!).
Furthermore:
df.iloc[:3, :].loc[:, ["u", "w"]]
## u w
## 0 0.70 spam
## 1 0.29 bacon
## 2 0.23 spam
It fetched the first three rows (by position; iloc is necessary) and then selects two in-
dicated columns.
Compare this to:
df.loc[:3, ["u", "w"]] # df[:3, ["u", "w"]] does not even work; please don't
## u w
## 0 0.70 spam
## 1 0.29 bacon
## 2 0.23 spam
## 3 0.55 eggs
Important Getting a scrambled numeric index that does not match the physical pos-
itions is not rare: for instance, in the context of data frame sorting (Section 10.6.1):
df2 = df.sort_values("v")
df2
## u v w x
## 4 0.72 -1.62 sausage True
## 2 0.23 -0.20 spam True
(continues on next page)
10 INTRODUCING DATA FRAMES 245
This accessor is, sadly, not universal. We can verify this by considering a data frame
with a column named, e.g., mean: it clashes with the built-in method. As a workaround,
we should either use loc[...] or rename the column, for instance, like Mean or MEAN.
Exercise 10.11 In the tips9 dataset, select data on male customers where the total bills were in
the [10, 20] interval. Also, select Saturday and Sunday records where the tips were greater than
$5.
9 https://github.com/gagolews/teaching-data/raw/master/other/tips.csv
246 IV HETEROGENEOUS DATA
Important Notation like “df.new_column = ...” does not work. As we said, only loc
and iloc are universal. For other accessors, this is not necessarily the case.
Exercise 10.12 Use pandas.DataFrame.insert to add a new column not necessarily at the
end of df.
Exercise 10.13 Use pandas.DataFrame.assign to add a new column and replace an exist-
ing one with another, not necessarily of the same dtype.
Exercise 10.14 Use pandas.DataFrame.append to add a few more rows to df.
To remedy this, it is best to create a copy of a column, modify it, and then to replace
the old contents with the new ones.
u = df.loc[:, "u"].copy()
u.iloc[0] = 42 # or a whole for loop to process them all, or whatever
df.loc[:, "u"] = u
df.loc[:, "u"].iloc[0] # testing
## 42.0
Notice the random_state argument which controls the seed of the pseudorandom
number generator: this way, we get reproducible results. Alternatively, we could call
numpy.random.seed.
Exercise 10.15 Show how the three aforementioned scenarios can be implemented manually
using iloc[...] and numpy.random.permutation or numpy.random.choice.
Exercise 10.16 Can pandas.read_csv be used to read only a random sample of rows from a
CSV file?
In machine learning practice, we are used to training and evaluating machine learning
models on different (mutually disjoint) subsets of the whole data frame.
For instance, Section 12.3.3 mentions that we may be interested in performing the so-
called training/test split (partitioning), where 80% (or 60% or 70%) of the randomly se-
lected rows would constitute the first new data frame and the remaining 20% (or 40%
or 30%, respectively) would go to the second one.
Given a data frame like:
df = body.head(10) # this is just an example
df
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 0 97.1 160.2 34.7 40.8 35.8 126.1 117.9
## 1 91.1 152.7 33.5 33.0 38.5 125.5 103.1
## 2 73.0 161.2 37.4 38.0 31.8 106.2 92.0
## 3 61.7 157.4 38.0 34.7 29.0 101.0 90.5
## 4 55.4 154.6 34.6 34.0 28.3 92.5 73.2
## 5 62.0 144.7 32.5 34.2 29.8 106.7 84.8
## 6 66.2 166.5 37.5 37.6 32.0 96.3 95.7
## 7 75.9 154.5 35.4 37.6 32.7 107.7 98.7
## 8 77.2 159.2 38.5 40.5 35.7 102.0 97.5
## 9 91.6 174.5 36.1 45.9 35.2 121.3 100.3
248 IV HETEROGENEOUS DATA
Exercise 10.17 In the wine_quality_all10 dataset, leave out all but the white wines. Parti-
tion the resulting data frame randomly into three data frames: wines_train (60% of the rows),
wines_validate (another 20% of the rows), and wines_test (the remaining 20%).
Exercise 10.18 Compose a function kfold which takes a data frame df and an integer 𝑘 > 1
as arguments. Return a list of data frames resulting in stemming from randomly partitioning df
into 𝑘 disjoint chunks of equal (or almost equal if that is not possible) sizes.
10 https://github.com/gagolews/teaching-data/raw/master/other/wine_quality_all.csv
10 INTRODUCING DATA FRAMES 249
The index has both levels named, but this is purely for aesthetic reasons.
Indexing using loc[...] by default relates to the first level of the hierarchy:
df.loc[2023, :]
## data
## quarter
## Q1 0.70
## Q2 0.29
## Q3 0.23
## Q4 0.55
Note that we selected all rows corresponding to a given label and dropped (!) this level
of the hierarchy.
Another example:
df.loc[ [2023, 2025], : ]
## data
## year quarter
## 2023 Q1 0.70
## Q2 0.29
## Q3 0.23
## Q4 0.55
## 2025 Q1 0.48
## Q2 0.39
## Q3 0.34
## Q4 0.73
Let’s stress again that the `:` operator can only be used directly within the square brack-
ets. Nonetheless, we can always use the slice constructor to create a slice in any con-
text:
df.loc[ (slice(None), ["Q1", "Q3"]), : ] # :, ["Q1", "Q3"]
## data
## year quarter
## 2023 Q1 0.70
## Q3 0.23
## 2024 Q1 0.72
## Q3 0.98
## 2025 Q1 0.48
## Q3 0.34
df.loc[ (slice(None, None, -1), slice("Q2", "Q3")), : ] # ::-1, "Q2":"Q3"
## data
## year quarter
## 2025 Q3 0.34
## Q2 0.39
## 2024 Q3 0.98
## Q2 0.42
## 2023 Q3 0.23
## Q2 0.29
10.6.1 Sorting
Consider another example dataset: the yearly (for 2018) average air quality data11 in
the Australian state of Victoria.
air = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
air = (
air.
loc[air.param_id.isin(["BPM2.5", "NO2"]), :].
reset_index(drop=True)
)
sort_values is a convenient means to order the rows with respect to one criterion, be
it numeric or categorical.
air.sort_values("value", ascending=False)
## sp_name param_id value
## 6 Footscray NO2 10.274531
(continues on next page)
11 https://discover.data.vic.gov.au/dataset/epa-air-watch-all-sites-air-quality-hourly-averages-yearly
252 IV HETEROGENEOUS DATA
Here, in each group of identical parameters, we get a decreasing order with respect to
the value.
Exercise 10.19 Compare the ordering with respect to param_id and value vs value and then
param_id.
Note (*) Lamentably, DataFrame.sort_values by default does not use a stable al-
gorithm. If a data frame is sorted with respect to one criterion, and then we reorder
it with respect to another one, tied observations are not guaranteed to be listed in the
original order:
10 INTRODUCING DATA FRAMES 253
(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
.sort_values("sp_name")
.sort_values("param_id")
.set_index("param_id")
.loc[["BPM2.5", "NO2"], :]
.reset_index())
## param_id sp_name value
## 0 BPM2.5 Melbourne CBD 8.072998
## 1 BPM2.5 Moe 6.427079
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Morwell East 6.784596
## 4 BPM2.5 Churchill 6.391230
## 5 BPM2.5 Morwell South 6.512849
## 6 BPM2.5 Traralgon 8.024735
## 7 BPM2.5 Alphington 7.848758
## 8 BPM2.5 Geelong South 6.502762
## 9 NO2 Morwell South 5.124430
## 10 NO2 Traralgon 5.776333
## 11 NO2 Geelong South 5.681722
## 12 NO2 Altona North 9.467912
## 13 NO2 Alphington 9.558120
## 14 NO2 Dandenong 9.800705
## 15 NO2 Footscray 10.274531
We lost the ordering based on station names in the two subgroups. To switch to a
mergesort-like method (timsort), we should pass kind="stable".
(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
.sort_values("sp_name")
.sort_values("param_id", kind="stable") # !
.set_index("param_id")
.loc[["BPM2.5", "NO2"], :]
.reset_index())
## param_id sp_name value
## 0 BPM2.5 Alphington 7.848758
## 1 BPM2.5 Churchill 6.391230
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Geelong South 6.502762
## 4 BPM2.5 Melbourne CBD 8.072998
## 5 BPM2.5 Moe 6.427079
## 6 BPM2.5 Morwell East 6.784596
## 7 BPM2.5 Morwell South 6.512849
## 8 BPM2.5 Traralgon 8.024735
## 9 NO2 Alphington 9.558120
## 10 NO2 Altona North 9.467912
(continues on next page)
254 IV HETEROGENEOUS DATA
Exercise 10.20 (*) Perform identical reorderings but using only loc[...], iloc[...], and
numpy.argsort.
air_wide.T.rename_axis(index="location", columns="param").\
stack().rename("value").reset_index()
## location param value
## 0 BPM2.5 Alphington 7.848758
## 1 BPM2.5 Churchill 6.391230
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Geelong South 6.502762
## 4 BPM2.5 Melbourne CBD 8.072998
## 5 BPM2.5 Moe 6.427079
## 6 BPM2.5 Morwell East 6.784596
## 7 BPM2.5 Morwell South 6.512849
## 8 BPM2.5 Traralgon 8.024735
## 9 NO2 Alphington 9.558120
## 10 NO2 Altona North 9.467912
## 11 NO2 Dandenong 9.800705
## 12 NO2 Footscray 10.274531
## 13 NO2 Geelong South 5.681722
## 14 NO2 Morwell South 5.124430
## 15 NO2 Traralgon 5.776333
We used the data frame transpose (T) to get a location-major order (less boring an out-
come in this context). Missing values are gone now. We do not need them anymore.
Nevertheless, passing dropna=False would help us identify the combinations of loc-
ation and param for which the readings are not provided.
We could have stored them alongside the air data frame, but that would be a waste of space. Also,
256 IV HETEROGENEOUS DATA
if we wanted to modify some datum (note, e.g., the annoying double space in param_name for
BPM2.5), we would have to update all the relevant records.
Instead, we can always match the records in both data frames that have the same param_ids,
and join (merge) these datasets only when we really need this.
Let’s discuss the possible join operations by studying two toy datasets:
A = pd.DataFrame({
"x": ["a0", "a1", "a2", "a3"],
"y": ["b0", "b1", "b2", "b3"]
})
A
## x y
## 0 a0 b0
## 1 a1 b1
## 2 a2 b2
## 3 a3 b3
and:
B = pd.DataFrame({
"x": ["a0", "a2", "a2", "a4"],
"z": ["c0", "c1", "c2", "c3"]
})
B
## x z
## 0 a0 c0
## 1 a2 c1
## 2 a2 c2
## 3 a4 c3
The left join of 𝐴 with 𝐵 guarantees to return all the records from 𝐴, even those which
are not matched by anything in 𝐵.
pd.merge(A, B, how="left", on="x")
## x y z
## 0 a0 b0 c0
## 1 a1 b1 NaN
## 2 a2 b2 c1
(continues on next page)
10 INTRODUCING DATA FRAMES 257
The right join of 𝐴 with 𝐵 is the same as the left join of 𝐵 with 𝐴:
pd.merge(A, B, how="right", on="x")
## x y z
## 0 a0 b0 c0
## 1 a2 b2 c1
## 2 a2 b2 c2
## 3 a4 NaN c3
Finally, the full outer join is the set-theoretic union of the left and the right join:
pd.merge(A, B, how="outer", on="x")
## x y z
## 0 a0 b0 c0
## 1 a1 b1 NaN
## 2 a2 b2 c1
## 3 a2 b2 c2
## 4 a3 b3 NaN
## 5 a4 NaN c3
and:
B = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/some_birth_dates2.csv",
comment="#")
B
## Name BirthDate
## 0 Hushang Naigamwala 25.08.1991
## 1 Zhen Wei 16.11.1975
## 2 Micha Kitchen 17.09.1930
## 3 Jodoc Alwin 16.11.1969
## 4 Igor Mazał 14.05.2004
## 5 Katarzyna Lasko 20.10.1971
## 6 Duchanee Panomyaong 19.06.1952
## 7 Mefodiy Shachar 01.10.1914
## 8 Paul Meckler 29.09.1968
## 9 Noe Tae-Woong 11.07.1970
## 10 Åge Trelstad 07.03.1935
In both datasets, there is a single categorical column whose elements uniquely identify
each record (i.e., Name). In the language of relational databases, we would call it the
primary key. In such a case, implementing the set-theoretic operations is relatively
easy, as we can refer to the pandas.Series.isin method.
First, 𝐴 ∩ 𝐵 (intersection), includes only the rows that are both in 𝐴 and in 𝐵:
A.loc[A.Name.isin(B.Name), :]
## Name BirthDate
## 4 Micha Kitchen 17.09.1930
## 5 Mefodiy Shachar 01.10.1914
(continues on next page)
10 INTRODUCING DATA FRAMES 259
Second, 𝐴 𝐵 (difference), gives all the rows that are in 𝐴 but not in 𝐵:
−
A.loc[~A.Name.isin(B.Name), :]
## Name BirthDate
## 0 Paitoon Ornwimol 26.06.1958
## 1 Antónia Lata 20.05.1935
## 2 Bertoldo Mallozzi 17.08.1972
## 3 Nedeljko Bukv 19.12.1921
Exercise 10.26 (*) Determine the union, intersection, and difference of the wine_sample120
and wine_sample221 datasets, where there is no column uniquely identifying the observations.
Hint: consider using pandas.concat and pandas.DataFrame.duplicated or pandas.
DataFrame.drop_duplicates.
20 https://github.com/gagolews/teaching-data/raw/master/other/wine_sample1.csv
21 https://github.com/gagolews/teaching-data/raw/master/other/wine_sample2.csv
260 IV HETEROGENEOUS DATA
Nevertheless, the methods are probably too plentiful to our taste. Their developers
were overgenerous. They wanted to include a list of all the possible verbs related to
data analysis, even if they can be trivially expressed by a composition of 2-3 simpler
operations from numpy or scipy instead.
As strong advocates of minimalism, we would rather save ourselves from being over-
loaded with too much new information. This is why our focus in this book is on devel-
oping the most transferable23 skills. Our approach is also slightly more hygienic. We do
not want the reader to develop a hopeless mindset, the habit of looking everything up
on the internet when faced with even the simplest kinds of problems. We have brains
for a reason.
10.7 Exercises
Exercise 10.27 How are data frames different from matrices?
Exercise 10.28 What are the use cases of the name slot in Series and Index objects?
Exercise 10.29 What is the purpose of set_index and reset_index?
Exercise 10.30 Why learning numpy is crucial when someone wants to become a proficient user
of pandas?
Exercise 10.31 What is the difference between iloc[...] and loc[...]?
Exercise 10.32 Why applying the index operator [...] directly on a Series or DataFrame
object is discouraged?
Exercise 10.33 What is the difference between index, Index, and columns?
Exercise 10.34 How can we compute the arithmetic mean and median of all the numeric
columns in a data frame, using a single line of code?
Exercise 10.35 What is a training/test split and how to perform it using numpy and pandas?
Exercise 10.36 What is the difference between stacking and unstacking? Which one yields a
wide (as opposed to long) format?
Exercise 10.37 Name different data frame join (merge) operations and explain how they work.
22 https://pandas.pydata.org/pandas-docs/stable/reference/index.html
23 This is also in line with the observation that Python with pandas is not the only environment where we
can work with data frames; e.g., base R and Julia with DataFrame.jl support that too.
10 INTRODUCING DATA FRAMES 261
Exercise 10.38 How does sorting with respect to more than one criterion work?
Exercise 10.39 Name the set-theoretic operations on data frames.
11
Handling categorical data
So far, we have been mostly dealing with quantitative (numeric) data, on which we
were able to apply various mathematical operations, such as computing the arithmetic
mean or taking the square thereof. Naturally, not every transformation must always
make sense in every context (e.g., multiplying temperatures – what does it mean when
we say that it is twice as hot today as compared to yesterday?), but still, the possibilities
were plenty.
Qualitative data (also known as categorical data, factors, or enumerated types) such
as eye colour, blood type, or a flag whether a patient is ill, on the other hand, take a
small number of unique values. They support an extremely limited set of admissible
operations. Namely, we can only determine whether two entities are equal or not.
In datasets involving many features (Chapter 12), categorical variables are often used
for observation grouping (e.g., so that we can compute the best and average time for
marathoners in each age category or draw box plots for finish times of men and wo-
men separately). Also, they may serve as target variables in statistical classification
tasks (e.g., so that we can determine if an email is “spam” or “not spam”).
would use integers between 1 and 𝑙 (inclusive). Nevertheless, a dataset creator is free to encode the labels
however they want. For example, DMDBORN4 in NHANES has: 1 (born in 50 US states or Washington, DC), 2
(others), 77 (refused to answer), and 99 (do not know).
264 IV HETEROGENEOUS DATA
Consider the data on the original whereabouts of the top 16 marathoners (the 37th
Warsaw Marathon dataset):
marathon = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/37_pzu_warsaw_marathon_simplified.csv",
comment="#")
cntrs = np.array(marathon.country, dtype="str")
cntrs16 = cntrs[:16]
cntrs16
## array(['KE', 'KE', 'KE', 'ET', 'KE', 'KE', 'ET', 'MA', 'PL', 'PL', 'IL',
## 'PL', 'KE', 'KE', 'PL', 'PL'], dtype='<U2')
These are two-letter ISO 3166 country codes encoded as strings (notice the dtype="str"
argument).
Calling pandas.unique determines the set of distinct categories:
cat_cntrs16 = pd.unique(cntrs16)
cat_cntrs16
## array(['KE', 'ET', 'MA', 'PL', 'IL'], dtype='<U2')
Note We could have also used numpy.unique (Section 5.5.3) but it would sort the dis-
tinct values lexicographically. In other words, they would not be listed in the order of
appearance.
The code sequence 0, 0, 0, 1, … corresponds to the first, the first, the first, the second,
… level in cat_cntrs16, i.e., Kenya, Kenya, Kenya, Ethiopia, ….
not mean that they become instances of a quantitative type. Arithmetic operations
thereon do not really make sense.
The values between 0 (inclusive) and 5 (exclusive) can be used to index a given array of
length 𝑙 = 5. As a consequence, to decode our factor, we can call:
cat_cntrs16[codes_cntrs16]
## array(['KE', 'KE', 'KE', 'ET', 'KE', 'KE', 'ET', 'MA', 'PL', 'PL', 'IL',
## 'PL', 'KE', 'KE', 'PL', 'PL'], dtype='<U2')
Then we make use of the fact that numpy.argsort applied on a vector representing a permuta-
tion, determines its very inverse:
new_codes_cntrs16 = np.argsort(new_codes)[codes_cntrs16]
new_codes_cntrs16
## array([1, 1, 1, 4, 1, 1, 4, 2, 0, 0, 3, 0, 1, 1, 0, 0])
Verification:
np.all(cntrs16 == new_cat_cntrs16[new_codes_cntrs16])
## True
Exercise 11.2 (**) Determine the set of unique values in cntrs16 in the order of appear-
ance (and not sorted lexicographically), but without using pandas.unique nor pandas.
factorize. Then, encode cntrs16 using this level set.
Furthermore, pandas includes2 a special dtype for storing categorical data. Namely,
we can write:
2 https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
266 IV HETEROGENEOUS DATA
or, equivalently:
cntrs16_series = pd.Series(cntrs16).astype("category")
These two yield a Series object displayed as if it was represented using string labels:
cntrs16_series.head() # preview
## 0 KE
## 1 KE
## 2 KE
## 3 ET
## 4 KE
## dtype: category
## Categories (5, object): ['ET', 'IL', 'KE', 'MA', 'PL']
• 1, e.g., ill/success/on/spam/present/…).
Usually, the interesting or noteworthy category is denoted by 1.
Important When converting logical to numeric, False becomes 0 and True becomes
1. Conversely, 0 is converted to False and anything else (including -0.326) to True.
Hence, instead of working with vectors of 0s and 1s, we might equivalently be playing
with logical arrays. For example:
np.array([True, False, True, True, False]).astype(int)
## array([1, 0, 1, 1, 0])
or, equivalently:
np.array([-2, -0.326, -0.000001, 0.0, 0.1, 1, 7643]) != 0
## array([ True, True, True, False, True, True, True])
Important It is not rare to work with vectors of probabilities, where the 𝑖-th element
therein, say p[i], denotes the likelihood of an observation’s belonging to the class 1.
Consequently, the probability of being a member of the class 0 is 1-p[i]. In the case
where we would rather work with crisp classes, we can simply apply the conversion
(p>=0.5) to get a logical vector.
Exercise 11.3 Given a numeric vector x, create a vector of the same length as x whose 𝑖-th ele-
ment is equal to "yes" if x[i] is in the unit interval and to "no" otherwise. Use numpy.where,
which can act as a vectorised version of the if statement.
1 0 0 0
⎡ ⎤
⎢ 0 1 0 0 ⎥.
𝐑=⎢ ⎥
⎢ 0 0 1 0 ⎥
⎣ 0 1 0 0 ⎦
One can easily verify that each row consists of one and only one 1 (the number of 1s
268 IV HETEROGENEOUS DATA
per one row is 1). Such a representation is adequate when solving a multiclass classi-
fication problem by means of 𝑙 binary classifiers. For example, if spam, bacon, and hot
dogs are on the menu, then spam is encoded as (1, 0, 0), i.e., yeah-spam, nah-bacon,
and nah-hot dog. We can build three binary classifiers, each narrowly specialising in
sniffing one particular type of food.
Example 11.4 Write a function to one-hot encode a given categorical vector represented using
character strings.
Example 11.5 Compose a function to decode a one-hot-encoded matrix.
Example 11.6 (*) We can also work with matrices like 𝐏 ∈ [0, 1]𝑛×𝑙 , where 𝑝𝑖,𝑗 denotes the
probability of the 𝑖-th object’s belonging to the 𝑗-th class. Given an example matrix of this kind,
verify that in each row the probabilities sum to 1 (up to a small numeric error). Then, decode such
a matrix by choosing the greatest element in each row.
numpy.searchsorted can determine the interval where each value in mins falls.
By default, the intervals are of the form (𝑎, 𝑏] (not including 𝑎, including 𝑏). The code 0
corresponds to values less than the first bin edge, whereas the code 3 represent items
greater than or equal to the last boundary.
pandas.cut gives us another interface to the same binning method. It returns a vector-
like object with dtype="category", with very readable labels generated automatically
(and ordered; see Section 11.4.7):
cut_mins16 = pd.Series(pd.cut(mins16, [-np.inf, 130, 140, 150, np.inf]))
cut_mins16.iloc[ [0, 1, 6, 7, 13, 14, 15] ].astype("str") # preview
## 0 (-inf, 130.0]
## 1 (130.0, 140.0]
## 6 (130.0, 140.0]
## 7 (140.0, 150.0]
## 13 (140.0, 150.0]
## 14 (150.0, inf]
## 15 (150.0, inf]
## dtype: object
cut_mins16.cat.categories.astype("str")
## Index(['(-inf, 130.0]', '(130.0, 140.0]', '(140.0, 150.0]',
## '(150.0, inf]'],
## dtype='object')
Example 11.7 (*) We can create a set of the corresponding categories manually, for example, as
follows:
bins2 = np.r_[-np.inf, bins, np.inf]
np.array(
[f"({bins2[i]}, {bins2[i+1]}]" for i in range(len(bins2)-1)]
)
## array(['(-inf, 130.0]', '(130.0, 140.0]', '(140.0, 150.0]',
## '(150.0, inf]'], dtype='<U14')
Exercise 11.8 (*) Check out the numpy.histogram_bin_edges function which tries to de-
termine some informative interval boundaries based on a few simple heuristics. Recall that
numpy.linspace and numpy.geomspace can be used for generating equidistant bin edges on
linear and logarithmic scales, respectively.
270 IV HETEROGENEOUS DATA
A vector of counts can easily be turned into a vector of proportions (fractions) or per-
centages (if we multiply them by 100):
11 HANDLING CATEGORICAL DATA 271
Almost 31.25% of the top runners were from Poland (this marathon is held in Warsaw
after all…).
Exercise 11.9 Using numpy.argsort, sort counts_cntrs16 increasingly together with the
corresponding items in cat_cntrs16.
The three columns are: sex, age (in 10-year brackets), and country. We can, of course,
analyse the data distribution in each column individually, but this we leave as an exer-
cise. Instead, we note that some interesting patterns might also arise when we study
the combinations of levels of different variables.
Here are the levels of the sex and age variables:
pd.unique(marathon.sex)
## array(['M', 'F'], dtype=object)
pd.unique(marathon.age)
## array(['20', '30', '50', '40', '60+'], dtype=object)
These can be converted to a two-way contingency table, which is a matrix that gives the
number of occurrences of each pair of values from the two factors:
V = counts2.unstack(fill_value=0)
V
## age 20 30 40 50 60+
## sex
## F 240 449 262 43 19
## M 879 2200 1708 541 170
For example, there were 19 women aged at least 60 amongst the marathoners. Jolly
good.
The marginal (one-dimensional) frequency distributions can be recreated by comput-
ing the rowwise and columnwise sums of V:
np.sum(V, axis=1)
## sex
## F 1013
## M 5498
## dtype: int64
np.sum(V, axis=0)
## age
## 20 1119
## 30 2649
## 40 1970
## 50 584
## 60+ 189
## dtype: int64
The display is in the long format (compare Section 10.6.2) because we cannot nicely
print a three-dimensional array. Yet, we can always partially unstack the dataset, for
aesthetic reasons:
counts3.set_index(["country", "sex", "age"]).unstack()
## count
## age 20 30 40 50 60+
## country sex
## PL F 222.0 422.0 248.0 26.0 8.0
## M 824.0 2081.0 1593.0 475.0 134.0
## SK F NaN NaN NaN 1.0 NaN
## M NaN NaN 1.0 NaN 1.0
## UA F NaN 2.0 NaN NaN NaN
## M 8.0 8.0 2.0 3.0 NaN
Let’s again appreciate how versatile the concept of data frames is. Not only can we
represent data to be investigated (one row per observation, variables possibly of mixed
types) but also we can store the results of such analyses (neatly formatted tables).
274 IV HETEROGENEOUS DATA
Bar plots are self-explanatory and hence will do the trick most of the time; see Fig-
ure 11.1.
ind = np.arange(len(x)) # 0, 1, 2, 3, 4
plt.bar(ind, height=x, color="lightgray", edgecolor="black", alpha=0.8)
plt.xticks(ind, x.index)
plt.show()
The ind vector gives the x-coordinates of the bars; here: consecutive integers. By call-
ing matplotlib.pyplot.xticks we assign them readable labels.
Exercise 11.10 Draw a bar plot for the five most prevalent foreign (i.e., excluding Polish) mara-
thoners’ original whereabouts. Add a bar that represents “all other” countries. Depict percentages
instead of counts, so that the total bar height is 100%. Assign a different colour to each bar.
A bar plot is a versatile tool for visualising the counts also in the two-variable case; see
Figure 11.2. Let’s now use a pleasant wrapper around matplotlib.pyplot.bar offered
by the statistical data visualisation package called seaborn3 [99] (written by Michael
Waskom).
3 https://seaborn.pydata.org/
11 HANDLING CATEGORICAL DATA 275
2500
2000
1500
1000
500
0
20 30 40 50 60+
sex
2000 F
M
1500
count
1000
500
0
20 30 40 50 60+
age
Note It is customary to call a single function from seaborn and then perform a series
of additional calls to matplotlib to tweak the display details. We should remember
that the former uses the latter to achieve its goals, not vice versa. seaborn is particu-
larly convenient for plotting grouped data.
20 30 40 50 60+
Figure 11.3. An example stacked bar plot: age distribution for different sexes amongst
all the runners.
51.00
50.75
50.50
50.25
50.00
%
49.75
49.50
49.25
49.00
Duda Trzaskowski
Figure 11.4. Such a great victory! Wait… Just look at the y-axis tick marks…
Important We must always read the axis tick marks. Also, when drawing bar plots,
we must never trick the reader for this is unethical; compare Rule#9. More issues in
statistical deception are explored, e.g., in [98].
11.3.3 .
We are definitely not going to discuss the infamous pie charts because their use in data
analysis has been widely criticised for a long time (it is difficult to judge the ratios of
areas of their slices). Do not draw them. Ever. Good morning.
278 IV HETEROGENEOUS DATA
%
100
0
Duda Trzaskowski
Let’s display the dataset ordered with respect to the counts, decreasingly:
med = pd.Series(counts_med, index=cat_med).sort_values(ascending=False)
med
(continues on next page)
4 https://www.cec.health.nsw.gov.au/CEC-Academy/quality-improvement-tools/pareto-charts
11 HANDLING CATEGORICAL DATA 279
Pareto charts may aid in visualising the datasets where the Pareto principle is likely to
hold, at least approximately. They include bar plots with some extras:
• bars are listed in decreasing order,
• the cumulative percentage curve is added.
The plotting of the Pareto chart is a little tricky because it involves using two different
Y axes (as usual, fine-tuning the figure and studying the manual of the matplotlib
package is left as an exercise.)
x = np.arange(len(med)) # 0, 1, 2, ...
p = 100.0*med/np.sum(med) # percentages
fig.tight_layout()
plt.show()
Figure 11.6 shows that the first five causes (less than 40%) correspond to c. 85% of the
medication errors. More precisely, the cumulative probabilities are:
med.cumsum()/np.sum(med)
## Dose missed 0.213953
## Wrong time 0.406977
## Wrong drug 0.583721
## Overdose 0.720930
(continues on next page)
280 IV HETEROGENEOUS DATA
100
20
80
15
cumulative %
60
%
10
5 40
0 20
ime
Wro ient
oute
e
te
ror
d
Ove g
Und gs
se
rug
rdos
ru
Dup culatio
isse
V ra
erdo
dru
e er
ng d
ed d
ng t
at
ng r
em
ng I
ng p
ted
u
Wro
oris
Wro
hniq
al
Dos
lica
Wro
ng c
Wro
uth
Tec
Wro
Una
Figure 11.6. An example Pareto chart: the most frequent causes for medication errors.
Note that there is an explicit assumption here that a single error is only due to a single
cause. Also, we presume that each medication error has a similar degree of severity.
Policymakers and quality controllers often rely on similar simplifications. They most
probably are going to be addressing only the top causes. If we ever wondered why
some processes (mal)function the way they do, foregoing is a hint. Inventing some-
thing more effective yet so simple at the same time requires much more effort.
It would be also nice to report the number of cases where no mistakes are made and
the cases where errors are insignificant. Healthcare workers are doing a wonderful job
for our communities, especially in the public system. Why add to their stress?
2000
1750
240 449 262 43 19
F
1500
1250
sex
1000
750
879 2200 1708 541 170
M
500
250
20 30 40 50 60+
age
Figure 11.7. A heat map for the marathoners’ sex and age category.
Therefore, as far as qualitative data aggregation is concerned, what we are left with is
the mode, i.e., the most frequently occurring value.
282 IV HETEROGENEOUS DATA
It turns out that amongst the fastest 22 runners (a nicely round number), there is a tie
between Kenya and Poland – both meet our definition of a mode:
counts = marathon.country.iloc[:22].value_counts()
counts
## country
## KE 7
## PL 7
## ET 3
## IL 3
## MA 1
## MD 1
## Name: count, dtype: int64
To avoid any bias, it is always best to report all the potential mode candidates:
counts.loc[counts == counts.max()].index
## Index(['KE', 'PL'], dtype='object', name='country')
If one value is required, though, we can pick one at random (calling numpy.random.
choice).
This gave the number of elements that are equal to "PL" because the sum of 0s and 1s
is equal to the number of 1s in the sequence. Note that (country == "PL") is a logical
vector that represents a binary categorical variable with levels: not-Poland (False) and
Poland (True).
If we divide the preceding result by the length of the vector, we will get the proportion:
np.mean(marathon.country == "PL")
## 0.9265857779142989
11 HANDLING CATEGORICAL DATA 283
Roughly 93% of the runners were from Poland. As this is greater than 0.5, "PL” is def-
initely the mode.
Exercise 11.14 What is the meaning of numpy.all, numpy.any, numpy.min, numpy.max,
numpy.cumsum, and numpy.cumprod applied on logical vectors?
Note (**) Having the 0/1 (or zero/non-zero) vs False/True correspondence permits
us to perform some logical operations using integer arithmetic. In mathematics, 0
is the annihilator of multiplication and the neutral element of addition, whereas 1 is
the neutral element of multiplication. In particular, assuming that p and q are logical
values and a and b are numeric ones, we have, what follows:
• p+q != 0 means that at least one value is True and p+q == 0 if and only if both
are False;
• more generally, p+q == 2 if both elements are True, p+q == 1 if only one is True
(we call it exclusive-or, XOR), and p+q == 0 if both are False;
• p*q != 0 means that both values are True and p*q == 0 holds whenever at least
one is False;
• 1-p corresponds to the negation of p (changes 1 to 0 and 0 to 1);
• p*a + (1-p)*b is equal to a if p is True and equal to b otherwise.
Having such a test is beneficial, e.g., when the data we have at hand are based on small
surveys that are supposed to serve as estimates of what might be happening in a larger
population.
Going back to our political example from Section 11.3.2, it turns out that one of the
pre-election polls indicated that 𝑐 = 516 out of 𝑛 = 1017 people would vote for the
first candidate. We have 𝑝1̂ = 50.74% (Duda) and 𝑝2̂ = 49.26% (Trzaskowski). If we
would like to test whether the observed proportions are significantly different from
5 (*) There exists a discrete version of the Kolmogorov–Smirnov test, but it is defined in a different way
each other, we could test them against the theoretical distribution 𝑝1 = 50% and
𝑝2 = 50%, which states that there is a tie between the competitors (up to a sampling
error).
A natural test statistic is based on the relative squared differences:
𝑙 2
(𝑝𝑖̂ − 𝑝𝑖 )
𝑇̂ = 𝑛 ∑ .
𝑖=1
𝑝𝑖
c, n = 516, 1017
p_observed = np.array([c, n-c]) / n
p_expected = np.array([0.5, 0.5])
T = n * np.sum( (p_observed-p_expected)**2 / p_expected )
T
## 0.2212389380530986
Similarly to the continuous case in Section 6.2.3, we reject the null hypothesis, if:
𝑇̂ ≥ 𝐾.
The critical value 𝐾 is computed based on the fact that, if the null hypothesis is true, 𝑇̂
follows the 𝜒 2 (chi-squared, hence the name of the test) distribution with 𝑙−1 degrees
of freedom, see scipy.stats.chi2.
We thus need to query the theoretical quantile function to determine the test statistic
that is not exceeded in 99.9% of the trials (under the null hypothesis):
alpha = 0.001 # significance level
scipy.stats.chi2.ppf(1-alpha, len(p_observed)-1)
## 10.827566170662733
As 𝑇̂ < 𝐾 (because 0.22 < 10.83), we cannot deem the two proportions significantly
different. In other words, this poll did not indicate (at the significance level 0.1%) any
of the candidates as a clear winner.
Exercise 11.15 Assuming 𝑛 = 1017, determine the smallest 𝑐, i.e., the number of respondents
claiming they would vote for Duda, that leads to the rejection of the null hypothesis.
There are 𝑙 = 5 age categories. First, denote the total number of observations in both
groups by 𝑛′ and 𝑛″ .
n1 = c1.sum()
n2 = c2.sum()
n1, n2
## (1013, 5498)
The observed proportions in the first group (females), denoted by 𝑝′1̂ , … , 𝑝′𝑙̂ , are, re-
spectively:
p1 = c1/n1
p1
## array([0.23692004, 0.44323791, 0.25863771, 0.04244817, 0.01875617])
Here are the proportions in the second group (males), 𝑝″1̂ , … , 𝑝″𝑙̂ :
p2 = c2/n2
p2
## array([0.15987632, 0.40014551, 0.31065842, 0.09839942, 0.03092033])
We would like to verify whether the corresponding proportions are equal (up to some
sampling error):
𝐻0 ∶ 𝑝′𝑖̂ = 𝑝″𝑖̂ for all 𝑖 = 1, … , 𝑙 (null hypothesis)
{
𝐻1 ∶ 𝑝′𝑖̂ ≠ 𝑝″𝑖̂ for some 𝑖 = 1, … , 𝑙 (alternative hypothesis)
In other words, we want to determine whether the categorical data in the two groups
come from the same discrete probability distribution.
Taking the estimated expected proportions:
𝑛′𝑖 𝑝′𝑖̂ + 𝑛″𝑖 𝑝″𝑖̂
𝑝𝑖̄ = ,
𝑛′ + 𝑛″
for all 𝑖 = 1, … , 𝑙, the test statistic this time is equal to:
2 2
𝑙 (𝑝′𝑖̂ − 𝑝𝑖̄ ) 𝑙 (𝑝″
̂ − 𝑝𝑖̄ )
𝑇̂ = 𝑛′ ∑ + 𝑛″ ∑ 𝑖 ,
𝑖=1
𝑝𝑖̄ 𝑖=1
𝑝𝑖̄
If the null hypothesis is true, the test statistic approximately follows the 𝜒 2 distribution
with 𝑙 − 1 degrees of freedom6 . The critical value 𝐾 is equal to:
6 Notice that [77] in Section 14.3 recommends 𝑙 degrees of freedom, but we do not agree with this rather
informal reasoning. Also, simple Monte Carlo simulations suggest that 𝑙 − 1 is a better candidate.
286 IV HETEROGENEOUS DATA
As 𝑇̂ ≥ 𝐾 (because 75.31 ≥ 18.47), we reject the null hypothesis. And so, the runners’
age distribution differs across sexes (at significance level 0.1%).
Cramér’s 𝑉 is one of a few ways to measure the degree of association between two
7 https://www.abs.gov.au/statistics/health/health-conditions-and-risks/
national-health-survey-first-results/2017-18
11 HANDLING CATEGORICAL DATA 287
categorical variables. It is equal to 0 (the lowest possible value) if the two variables are
independent (there is no association between them) and 1 (the highest possible value)
if they are tied.
Given a two-way contingency table 𝐶 with 𝑛 rows and 𝑚 columns and assuming that:
𝑛 𝑚 (𝑐𝑖,𝑗 − 𝑒𝑖,𝑗 )2
𝑇 = ∑∑ ,
𝑖=1 𝑗=1
𝑒𝑖,𝑗
where:
𝑚 𝑛
(∑𝑘=1 𝑐𝑖,𝑘 ) (∑𝑘=1 𝑐𝑘,𝑗 )
𝑒𝑖,𝑗 = 𝑛 𝑚 ,
∑𝑖=1 ∑𝑗=1 𝑐𝑖,𝑗
Hence, there might be a small association between age and the prevalence of certain
conditions. In other words, some conditions might be more prevalent in some age
groups than others.
Exercise 11.16 Compute the Cramér 𝑉 using only numpy functions.
Example 11.17 (**) We can easily verify the hypothesis whether 𝑉 does not differ significantly
from 0, i.e., whether the variables are independent. Looking at 𝑇, we see that this is essentially
the test statistic in Pearson’s chi-squared goodness-of-fit test.
E = C.sum(axis=1).reshape(-1, 1) * C.sum(axis=0).reshape(1, -1) / C.sum()
T = np.sum((C-E)**2 / E)
T
## 3715440.465191512
If the data are really independent, 𝑇 follows the chi-squared distribution 𝑛 + 𝑚 − 1. As a con-
sequence, the critical value 𝐾 is equal to:
alpha = 0.001 # significance level
scipy.stats.chi2.ppf(1-alpha, C.shape[0] + C.shape[1] - 1)
## 32.90949040736021
As 𝑇 is much greater than 𝐾, we conclude (at significance level 0.1%) that the health conditions
are not independent of age.
288 IV HETEROGENEOUS DATA
Exercise 11.18 (*) Take a look at Table 19: Comorbidity of selected chronic conditions in
the National Health Survey 20188 , where we clearly see that many disorders co-occur. Visualise
them on some heat maps and bar plots (including data grouped by sex and age).
national-health-survey-first-results/2017-18
9 https://github.com/gagolews/teaching-data/raw/master/marek/uk_income_simulated_2020.txt
10 https://github.com/gagolews/teaching-data/raw/master/marek/grades_results.txt
11 HANDLING CATEGORICAL DATA 289
100 students attending an imaginary course in an Australian university. You can load it in the
form of an ordered categorical Series by calling:
grades = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/grades_results.txt", dtype="str")
grades = pd.Series(pd.Categorical(grades,
categories=["F", "P", "C", "D", "HD"], ordered=True))
grades.value_counts() # note the order of labels
## F 30
## P 29
## C 23
## HD 22
## D 19
## Name: count, dtype: int64
How would you determine the average grade represented as a number between 0 and 100, taking
into account that for a P you need at least 50%, C is given for ≥ 60%, D for ≥ 70%, and HD for only
(!) 80% of the points. Come up with a pessimistic, optimistic, and best-shot estimate, and then
compare your result to the true corresponding scores listed in the grades_scores11 dataset.
11.5 Exercises
Exercise 11.21 Does it make sense to compute the arithmetic mean of a categorical variable?
Exercise 11.22 Name the basic use cases for categorical data.
Exercise 11.23 (*) What is a Pareto chart?
Exercise 11.24 How can we deal with the case of the mode being nonunique?
Exercise 11.25 What is the meaning of the sum and mean for binary data (logical vectors)?
Exercise 11.26 What is the meaning of numpy.mean((x > 0) & (x < 1)), where x is a
numeric vector?
Exercise 11.27 List some ways to visualise multidimensional categorical data (combinations of
two or more factors).
Exercise 11.28 (*) State the null hypotheses verified by the one- and two-sample chi-squared
goodness-of-fit tests.
Exercise 11.29 (*) How is Cramér’s 𝑉 defined and what values does it take?
11 https://github.com/gagolews/teaching-data/raw/master/marek/grades_scores.txt
12
Processing data in groups
Consider another subset of the US Centres for Disease Control and Prevention Na-
tional Health and Nutrition Examination Survey, this time carrying some body meas-
ures (P_BMX1 ) together with demographics (P_DEMO2 ).
nhanes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_p_demo_bmx_2020.csv",
comment="#")
nhanes = (
nhanes
.loc[
(nhanes.DMDBORN4 <= 2) & (nhanes.RIDAGEYR >= 18),
["RIDAGEYR", "BMXWT", "BMXHT", "BMXBMI", "RIAGENDR", "DMDBORN4"]
] # age >= 18 and only US and non-US born
.rename({
"RIDAGEYR": "age",
"BMXWT": "weight",
"BMXHT": "height",
"BMXBMI": "bmival",
"RIAGENDR": "gender",
"DMDBORN4": "usborn"
}, axis=1) # rename columns
.dropna() # remove missing values
.reset_index(drop=True)
)
We consider only the adult (at least 18 years old) participants, whose country of birth
(the US or not) is well-defined. Let’s recode the usborn and gender variables (for read-
ability), and introduce the BMI categories:
nhanes = nhanes.assign(
usborn=( # recode usborn
nhanes.usborn.astype("category")
.cat.rename_categories(["yes", "no"]).astype("str")
),
gender=( # recode gender
nhanes.gender.astype("category")
.cat.rename_categories(["male", "female"]).astype("str")
(continues on next page)
1 https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_BMX.htm
2 https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_DEMO.htm
292 IV HETEROGENEOUS DATA
Important When browsing the list of available attributes in the pandas manual, it is
12 PROCESSING DATA IN GROUPS 293
worth knowing that DataFrameGroupBy and SeriesGroupBy are separate types. Still,
they have many methods and slots in common.
It returns an object of the type Series. We can also perform the grouping with respect
to a combination of levels in two qualitative columns:
nhanes.groupby(["gender", "bmicat"], observed=False).size()
## gender bmicat
## female underweight 93
## normal 1161
## overweight 1245
## obese 2015
## male underweight 65
## normal 1074
## overweight 1513
## obese 1619
## dtype: int64
This yields a Series with a hierarchical index (Section 10.1.3). Nevertheless, we can
always call reset_index to convert it to standalone columns:
nhanes.groupby(
["gender", "bmicat"], observed=True
).size().rename("counts").reset_index()
## gender bmicat counts
## 0 female underweight 93
## 1 female normal 1161
## 2 female overweight 1245
## 3 female obese 2015
## 4 male underweight 65
## 5 male normal 1074
## 6 male overweight 1513
## 7 male obese 1619
Take note of the rename part. It gave us some readable column names.
Furthermore, it is possible to group rows in a data frame using a list of any Series
3 https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html
294 IV HETEROGENEOUS DATA
objects, i.e., not just column names in a given data frame; see Section 16.2.3 for an
example.
Exercise 12.2 (*) Note the difference between pandas.GroupBy.count and pandas.GroupBy.
size methods (by reading their documentation).
The arithmetic mean was computed only on numeric columns4 . Alternatively, we can
apply the aggregate only on specific columns:
nhanes.groupby("gender")[["weight", "height"]].mean().reset_index()
## gender weight height
## 0 female 78.351839 160.089189
## 1 male 88.589932 173.759541
Another example:
nhanes.groupby(["gender", "bmicat"], observed=False)["height"].\
mean().reset_index()
## gender bmicat height
## 0 female underweight 161.976344
## 1 female normal 160.149182
## 2 female overweight 159.573012
## 3 female obese 160.286452
## 4 male underweight 174.073846
## 5 male normal 173.443855
## 6 male overweight 173.051685
## 7 male obese 174.617851
Further, the most common aggregates that we described in Section 5.1 can be gener-
ated by calling the describe method:
nhanes.groupby("gender").height.describe().T
## gender female male
## count 4514.000000 4271.000000
(continues on next page)
4 (*) In this example, we called pandas.GroupBy.mean. Note that it has slightly different functionality
We have applied the transpose (T) to get a more readable (“tall”) result.
If different aggregates are needed, we can call aggregate to apply a custom list of
functions:
(nhanes.
groupby("gender")[["height", "weight"]].
aggregate(["mean", len, lambda x: (np.max(x)+np.min(x))/2]).
reset_index()
)
## gender height weight
## mean len <lambda_0> mean len <lambda_0>
## 0 female 160.089189 4514 160.2 78.351839 4514 143.45
## 1 male 173.759541 4271 172.1 88.589932 4271 139.70
Note The column names in the output object are generated by reading the applied
functions’ __name__ slots; see, e.g., print(np.mean.__name__).
mr = lambda x: (np.max(x)+np.min(x))/2
mr.__name__ = "midrange"
(nhanes.
loc[:, ["gender", "height", "weight"]].
groupby("gender").
aggregate(["mean", mr]).
reset_index()
)
## gender height weight
## mean midrange mean midrange
## 0 female 160.089189 160.2 78.351839 143.45
## 1 male 173.759541 172.1 88.589932 139.70
def standardise(x):
return (x-np.mean(x, axis=0))/std0(x, axis=0)
nhanes.loc[:, "height_std"] = (
nhanes.
loc[:, ["height", "gender"]].
groupby("gender").
transform(standardise)
)
nhanes.head()
## age weight height bmival gender usborn bmicat height_std
## 0 29 97.1 160.2 37.8 female no obese 0.015752
## 1 49 98.8 182.3 29.7 male yes overweight 1.108960
## 2 36 74.3 184.2 21.9 male yes normal 1.355671
## 3 68 103.7 185.3 30.2 male yes obese 1.498504
## 4 76 83.3 177.1 26.6 male yes overweight 0.433751
The new column gives the relative z-scores: a woman with a relative z-score of 0 has
height of 160.1 cm, whereas a man with the same z-score has height of 173.8 cm.
We can check that the means and standard deviations in both groups are equal to 0
and 1:
(nhanes.
loc[:, ["gender", "height", "height_std"]].
groupby("gender").
aggregate(["mean", std0])
)
## height height_std
## mean std0 mean std0
## gender
## female 160.089189 7.034703 -1.352583e-15 1.0
## male 173.759541 7.701323 3.129212e-16 1.0
Note that we needed to use a custom function for computing the standard devi-
ation with ddof=0. This is likely a bug in pandas that nhanes.groupby("gender").
aggregate([np.std]) somewhat passes ddof=1 to numpy.std,
Exercise 12.3 Create a data frame comprised of the five tallest men and the five tallest women.
The way the output is formatted is imperfect, so we need to contemplate it for a tick.
We see that when iterating through a GroupBy object, we get access to pairs giving all
the levels of the grouping variable and the subsets of the input data frame correspond-
ing to these categories.
Here is a simple example where we make use of the earlier fact:
for level, df in grouped:
# level is a string label
# df is a data frame - we can do whatever we want
print(f"There are {df.shape[0]} subject(s) with gender=`{level}`.")
## There are 1 subject(s) with gender=`female`.
## There are 4 subject(s) with gender=`male`.
We see that splitting followed by manual processing of the chunks in a loop is te-
dious in the case where we would merely like to compute some simple aggregates.
These scenarios are extremely common. No wonder why the pandas developers in-
troduced a convenient interface in the form of the pandas.DataFrame.groupby and
pandas.Series.groupby methods and the DataFrameGroupBy and SeriesGroupby
classes. Still, for more ambitious tasks, the low-level way to perform the splitting will
come in handy.
Exercise 12.4 (**) Using the manual splitting and matplotlib.pyplot.boxplot, draw a
box-and-whisker plot of heights grouped by BMI category (four boxes side by side).
Exercise 12.5 (**) Using the manual splitting, compute the relative z-scores of the height
column separately for each BMI category.
usborn
no
female yes
gender
male
20 30 40 50 60 70 80 90
bmival
Figure 12.1. The distribution of BMIs for different genders and countries of birth.
Let’s contemplate for a while how easy it is now to compare the BMI distributions in
different groups. Here, we have two grouping variables, as specified by the y and hue
arguments.
Exercise 12.6 Create a similar series of violin plots.
Exercise 12.7 (*) Add the average BMIs in each group to the above box plot using matplotlib.
pyplot.plot. Check ylim to determine the range on the y-axis.
2000 bmicat
1750 underweight
normal
1500 overweight
1250 obese
counts
1000
750
500
250
0
female male
gender
Figure 12.2. Number of persons for each gender and BMI category.
Exercise 12.8 Draw a similar bar plot where the bar heights sum to 100% for each gender.
Exercise 12.9 Using the two-sample chi-squared test, verify whether the BMI category distri-
butions for men and women differ significantly from each other.
usborn
0.025 no
yes
0.020
Density
0.015
0.010
0.005
0.000
50 100 150 200 250
weight
Figure 12.3. The weight distribution of the US-born participants has a higher mean
and variance.
0.025
0.020
Density
0.015
0.010
0.005
0.000
usborn = yes | gender = female usborn = yes | gender = male
0.030
0.025
0.020
Density
0.015
0.010
0.005
0.000
50 100 150 200 250 50 100 150 200 250
weight weight
Figure 12.5. Distribution of weights for different genders and countries of birth.
Important Grid plots can feature any kind of data visualisation we have discussed so
far (e.g., histograms, bar plots, scatter plots).
Exercise 12.11 Draw a trellis plot with scatter plots of weight vs height grouped by BMI cat-
egory and gender.
We have used manual splitting of the weight column into subgroups and then plot-
ted the two ECDFs separately because a call to seaborn.ecdfplot(data=nhanes,
x="weight", hue="usborn") does not honour our wish to use alternating lines styles
(most likely due to a bug).
A two-sample Kolmogorov–Smirnov test can be used to check whether two ECDFs 𝐹𝑛̂′
(e.g., the weight of the US-born participants) and 𝐹𝑚
̂″ (e.g., the weight of non-US-born
12 PROCESSING DATA IN GROUPS 303
The test statistic will be a variation of the one-sample setting discussed in Section 6.2.3.
Namely, let:
Assuming significance level 𝛼 = 0.001, the critical value is approximately (for larger
𝑛 and 𝑚) equal to:
log(𝛼/2)(𝑛 + 𝑚)
𝐾𝑛,𝑚 = √− .
2𝑛𝑚
alpha = 0.001
np.sqrt(-np.log(alpha/2) * (len(x1)+len(x2)) / (2*len(x1)*len(x2)))
## 0.04607410479813944
As usual, we reject the null hypothesis when 𝐷̂ 𝑛,𝑚 ≥ 𝐾𝑛,𝑚 , which is exactly the case
here (at significance level 0.1%). In other words, weights of US- and non-US-born
participants differ significantly.
Important Frequentist hypothesis testing only takes into account the deviation
between distributions that is explainable due to sampling effects (the assumed ran-
domness of the data generation process). For large sample sizes, even very small de-
viations6 will be deemed statistically significant, but it does not mean that we consider
them as practically significant.
For instance, we might discover that a very costly, environmentally unfriendly, and
generally inconvenient for everyone upgrade leads to a process’ improvement: we re-
ject the null hypothesis stating that two distributions are equal. Nevertheless, a careful
5 Remember that this is an introductory course, and we are still being very generous here. We encourage
the readers to upskill themselves (later, of course) not only in mathematics, but also in programming (e.g.,
algorithms and data structures).
6 Including those that are merely due to round-off errors.
304 IV HETEROGENEOUS DATA
inspection told us that the gains are roughly 0.5%. In such a case, it is worthwhile to
apply good old common sense and refrain from implementing it.
Exercise 12.12 Compare between the ECDFs of weights of men and women who are between 18
and 25 years old. Determine whether they are significantly different.
Important Some statistical textbooks and many research papers in the social sciences
(amongst many others) employ the significance level of 𝛼 = 5%, which is often criti-
cised as too high7 . Many stakeholders aggressively push towards constant improve-
ments in terms of inventing bigger, better, faster, more efficient things. In this con-
text, larger 𝛼 generates more sensational discoveries: it considers smaller differences
as already significant. This all adds to what we call the reproducibility crisis in the em-
pirical sciences.
We, on the other hand, claim that it is better to err on the side of being cautious. This,
in the long run, is more sustainable.
Notice that we interpolated between the quantiles in a larger sample to match the
length of the shorter vector.
7 For similar reasons, we do not introduce the notion of p-values. Most practitioners tend to misunder-
We are given each wine’s alcohol and residual sugar content, as well as a binary cat-
egorical variable stating whether a group of sommeliers deem a given beverage rather
bad (1) or not (0). Figure 12.8 reveals that subpar wines are fairly low in… alcohol and,
to some extent, sugar.
sns.scatterplot(x="alcohol", y="sugar", data=wine_train,
hue="bad", style="bad", markers=["o", "v"], alpha=0.5)
(continues on next page)
8 http://archive.ics.uci.edu/ml/datasets/Wine+Quality
306 IV HETEROGENEOUS DATA
Figure 12.8. Scatter plot for sugar vs alcohol content for white, rather sweet wines, and
whether they are considered bad (1) or drinkable (0) by some experts.
Someone answer the door! We have a delivery of a few new wine bottles. Interestingly,
their alcohol and sugar contents have been given on their respective labels.
wine_test = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/other/sweetwhitewine_test2.csv",
comment="#").iloc[:, :-1]
wine_test.head()
## alcohol sugar
## 0 9.315523 10.041971
## 1 12.909232 6.814249
## 2 9.051020 12.818683
## 3 9.567601 11.091827
## 4 9.494031 12.053790
We would like to determine which of the wines from the test set might be not-bad
without asking an expert for their opinion. In other words, we would like to exercise
a classification task (see, e.g., [10, 50]). More formally:
Important Assume we are given a set of training points 𝐗 ∈ ℝ𝑛×𝑚 and the cor-
responding reference outputs 𝒚 ∈ {𝐿1 , 𝐿2 , … , 𝐿𝑙 }𝑛 in the form of a categorical vari-
12 PROCESSING DATA IN GROUPS 307
able with 𝑙 distinct levels. The aim of a classification algorithm is to predict what the
outputs for each point from a possibly different dataset 𝐗′ ∈ ℝ𝑛 ×𝑚 , i.e., 𝒚′̂ ∈
′
In other words, we are asked to fill the gaps in a categorical variable. Recall that in a
regression problem (Section 9.2), the reference outputs were numerical.
Exercise 12.13 Which of the following are instances of classification problems and which are
regression tasks?
• Detect email spam.
• Predict a market stock price (good luck with that).
• Assess credit risk.
• Detect tumour tissues in medical images.
• Predict the time-to-recovery of cancer patients.
• Recognise smiling faces on photographs (kind of creepy).
• Detect unattended luggage in airport security camera footage.
What kind of data should you gather to tackle them?
2. Classify 𝒙′ as 𝑦′̂ = mode(𝑦𝑖1 , … , 𝑦𝑖𝑘 ), i.e., assign it the label that most frequently
occurs amongst its 𝑘 nearest neighbours. If a mode is nonunique, resolve the ties
at random.
It is thus a similar algorithm to 𝑘-nearest neighbour regression (Section 9.2.1). We
only replaced the quantitative mean with the qualitative mode.
This is a variation on the theme: if you don’t know what to do in a given situation, try
to mimic what the majority of people around you are doing or saying. For instance, if
you don’t know what to think about a particular wine, discover that amongst the five
most similar ones (in terms of alcohol and sugar content) three are said to be awful.
Now you can claim that you don’t like it because it’s not sweet enough. Thanks to this,
others will take you for a very refined wine taster.
Let’s apply a 5-nearest neighbour classifier on the standardised version of the dataset.
308 IV HETEROGENEOUS DATA
As we are about to use a technique based on pairwise distances, it would be best if the
variables were on the same scale. Thus, we first compute the z-scores for the training
set:
X_train = np.array(wine_train.loc[:, ["alcohol", "sugar"]])
means = np.mean(X_train, axis=0)
sds = np.std(X_train, axis=0)
Z_train = (X_train-means)/sds
Let’s stress that we referred to the aggregates computed for the training set. This is a
representative example of a situation where we cannot simply use a built-in method
from pandas. Instead, we apply what we have learnt about numpy.
To make the predictions, we will use the following function:
def knn_class(X_test, X_train, y_train, k):
nnis = scipy.spatial.KDTree(X_train).query(X_test, k)[1]
nnls = y_train[nnis] # same as: y_train[nnis.reshape(-1)].reshape(-1, k)
return scipy.stats.mode(nnls.reshape(-1, k), axis=1, keepdims=False)[0]
First, we fetched the indexes of each test point’s nearest neighbours (amongst the
points in the training set). Then, we read their corresponding labels; they are stored
in a matrix with 𝑘 columns. Finally, we computed the modes in each row. As a con-
sequence, we have each point in the test set classified.
And now:
k = 5
y_train = np.array(wine_train.bad)
y_pred = knn_class(Z_test, Z_train, y_train, k)
y_pred[:10] # preview
## array([1, 0, 0, 1, 1, 0, 1, 0, 0, 1])
Note scipy.stats.mode does not resolve possible ties at random: e.g., the mode of
(1, 1, 1, 2, 2, 2) is always 1. Nevertheless, in our case, 𝑘 is odd and the number of pos-
sible classes is 𝑙 = 2, so the mode is always unique.
Figure 12.9 shows how nearest neighbour classification categorises different regions
of a section of the two-dimensional plane. The greater the 𝑘, the smoother the decision
boundaries. Naturally, in regions corresponding to few training points, we do not ex-
pect the classification accuracy to be acceptable9 .
9 (*) As an exercise, we could author a fixed-radius classifier; compare Section 8.4.4. In sparsely popu-
np.all(y_pred2 == y_pred)
## True
The accuracy score is the most straightforward measure of the similarity between these
true labels (denoted 𝒚′ = (𝑦1′ , … , 𝑦𝑛′ ′ )) and the ones predicted by the classifier (de-
noted 𝒚′̂ = (𝑦1′̂ , … , 𝑦𝑛′̂ ′ )). It is defined as a ratio between the correctly classified in-
stances and all the instances:
𝑛′
∑ 𝟏(𝑦′ = 𝑦𝑖′̂ )
Accuracy(𝒚′ , 𝒚′̂ ) = 𝑖=1 ′ 𝑖 ,
𝑛
where the indicator function 𝟏(𝑦𝑖′ = 𝑦𝑖′̂ ) = 1 if and only if 𝑦𝑖′ = 𝑦𝑖′̂ and 0 otherwise.
Computing it for our test sample gives:
np.mean(y_test == y_pred)
## 0.706
Thus, 71% of the wines were correctly classified with regard to their true quality. Before
we get too enthusiastic, let’s note that our dataset is slightly imbalanced in terms of the
distribution of label counts:
pd.Series(y_test).value_counts() # contingency table
## 0 330
## 1 170
## Name: count, dtype: int64
It turns out that the majority of the wines (330 out of 500) in our sample are truly de-
licious. Notice that a dummy classifier which labels all the wines as great would have
accuracy of 66%. Our 𝑘-nearest neighbour approach to wine quality assessment is not
that usable after all.
C = pd.DataFrame(
dict(y_pred=y_pred, y_test=y_test)
).value_counts().unstack(fill_value=0)
C
## y_test 0 1
## y_pred
## 0 272 89
## 1 58 81
In the binary classification case (𝑙 = 2) such as this one, its entries are usually referred
to as (see also the table below):
• TN – the number of cases where the true 𝑦𝑖′ = 0 and the predicted 𝑦𝑖′̂ = 0 (true
negative),
• TP – the number of instances such that the true 𝑦𝑖′ = 1 and the predicted 𝑦𝑖′̂ = 1
(true positive),
• FN – how many times the true 𝑦𝑖′ = 1 but the predicted 𝑦𝑖′̂ = 0 (false negative),
• FN – how many times the true 𝑦𝑖′ = 0 but the predicted 𝑦𝑖′̂ = 1 (false positive).
The terms positive and negative refer to the output predicted by a classifier, i.e., they
indicate whether some 𝑦𝑖′̂ is equal to 1 and 0, respectively.
Table 12.1. The different cases of true vs predicted labels in a binary classification task
(𝑙 = 2)
𝑦𝑖′ = 0 𝑦𝑖′ = 1
Ideally, the number of false positives and false negatives should be as low as possible.
The accuracy score only takes the raw number of true negatives (TN) and true positives
(TP) into account:
TN + TP
Accuracy(𝒚′ , 𝒚′̂ ) = .
TN + TP + FN + FP
Consequently, it might not be a valid metric in imbalanced classification problems.
There are, fortunately, some more meaningful measures in the case where class 1 is
less prevalent and where mispredicting it is considered more hazardous than making
an inaccurate prediction with respect to class 0. After all, most will agree that it is bet-
ter to be surprised by a vino mislabelled as bad, than be disappointed with a highly
recommended product where we have already built some expectations around it. Fur-
ther, getting a virus infection not recognised where we are genuinely sick can be more
312 IV HETEROGENEOUS DATA
dangerous for the people around us than being asked to stay at home with nothing but
a headache.
Precision answers the question: If the classifier outputs 1, what is the probability that
this is indeed true?
𝑛′
TP ∑ 𝑦𝑖′ 𝑦𝑖′̂
Precision(𝒚′ , 𝒚′̂ ) = = 𝑖=1 .
TP + FP 𝑛′
∑𝑖=1 𝑦𝑖′̂
Recall (sensitivity, hit rate, or true positive rate) addresses the question: If the true class
is 1, what is the probability that the classifier will detect it?
𝑛′
TP ∑ 𝑦𝑖′ 𝑦𝑖′̂
Recall(𝒚′ , 𝒚′̂ ) = = 𝑖=1 .
TP + FN 𝑛′
∑𝑖=1 𝑦𝑖′
C[1,1]/(C[1,1]+C[0,1]) # recall
## 0.4764705882352941
np.sum(y_test*y_pred)/np.sum(y_test) # equivalently
## 0.4764705882352941
Only 48% of the really bad wines will be filtered out by the classifier.
The F measure (or 𝐹1 measure), is the harmonic10 mean of precision and recall in the
case where we would rather have them aggregated into a single number:
−1
1 1 −1 −1 TP
F(𝒚′ , 𝒚′̂ ) = = ( (Precision + Recall )) = FP+FN
.
TP +
1
+ 1
Precision Recall 2
2
2
C[1,1]/(C[1,1]+0.5*C[0,1]+0.5*C[1,0]) # F
## 0.5242718446601942
• medical diagnosis,
• medical screening,
• suggestions of potential matches in a dating app,
• plagiarism detection,
• wine recommendation?
These always involve verifying the performance of many different classifiers, for ex-
ample, 1-, 3-, 9, and 15-nearest neighbours-based ones. For each of them, we need to
compute separate quality metrics, e.g., the F measures. Then, we promote the classi-
fier which enjoys the highest score. Unfortunately, if we do it recklessly, this can lead
to overfitting, this time with respect to the test set. The obtained metrics might be too
optimistic and can poorly reflect the real performance of the solution on future data.
Assuming that our dataset carries a decent number of observations, to overcome this
problem, we can perform a random training/validation/test split:
• training sample (e.g., 60% of randomly chosen rows) – for model construction,
• validation sample (e.g., 20%) – used to tune the hyperparameters of many classifiers
and to choose the best one,
• test (hold-out) sample (e.g., the remaining 20%) – used to assess the goodness of fit
of the best classifier.
This common sense-based approach is not limited to classification. We can validate
different regression models in the same way.
Exercise 12.16 Determine the best parameter setting for the 𝑘-nearest neighbour classification
of the color variable based on standardised versions of some physicochemical features (chosen
columns) of wines in the wine_quality_all11 dataset. Create a 60/20/20% dataset split. For
each 𝑘 = 1, 3, 5, 7, 9, compute the corresponding F measure on the validation test. Evaluate the
quality of the best classifier on the test set.
11 https://github.com/gagolews/teaching-data/raw/master/other/wine_quality_all.csv
12 PROCESSING DATA IN GROUPS 315
Exercise 12.17 (**) Redo Exercise 12.16, but this time maximise the F measure obtained by a
5-fold cross-validation.
12 We do not have to denote the number of clusters by 𝑘. We could be speaking about the 2-means, 3-
means, 𝑙-means, or ü-means method too. Nevertheless, some mainstream practitioners consider 𝑘-means
as a kind of a brand name, let’s thus refrain from adding to their confusion. Another widely known al-
gorithm is called fuzzy (weighted) 𝑐-means [8].
316 IV HETEROGENEOUS DATA
We refer to the objective function as the (total) within-cluster sum of squares (WCSS). This
version looks easier, but it is only some false impression: 𝑙𝑖 s depend on 𝒄𝑗 s. They vary
together. We have just made them less explicit.
Given a fixed label vector 𝒍 representing a partition, 𝒄𝑗 is the centroid (Section 8.4.2)
of the points assigned thereto:
1
𝒄𝑗 = ∑ 𝐱 ,
𝑛𝑗 𝑖∶𝑙 =𝑗 𝑖,⋅
𝑖
where 𝑛𝑗 = |{𝑖 ∶ 𝑙𝑖 = 𝑗}| gives the number of 𝑖s such that 𝑙𝑖 = 𝑗, i.e., the size of the 𝑗-th
cluster.
The discovered cluster centres are stored in a matrix with 𝑘 rows and 𝑚 columns, i.e.,
the 𝑗-th row gives 𝐜𝑗 .
C
## array([[ 0.99622971, 1.052801 ],
## [-0.90041365, -1.08411794]])
As usual in Python, indexing starts at 0. So for 𝑘 = 2 we only obtain the labels 0 and 1.
Figure 12.10 depicts the two clusters together with the cluster centroids. We use l as a
colour selector in my_colours[l] (this is a clever instance of the integer vector-based
12 PROCESSING DATA IN GROUPS 317
indexing). It seems that we correctly discovered the very natural partitioning of this
dataset into two clusters.
plt.scatter(X[:, 0], X[:, 1], c=np.array(["black", "#DF536B"])[l])
plt.plot(C[:, 0], C[:, 1], "yX")
plt.axis("equal")
plt.show()
4
3
2
1
0
1
2
3
6 4 2 0 2 4 6
Figure 12.10. Two clusters discovered by the 𝑘-means method. Cluster centroids are
marked separately.
The label vector l can be added as a new column in the dataset. Here is a preview:
Xl = pd.DataFrame(dict(X1=X[:, 0], X2=X[:, 1], l=l))
Xl.sample(5, random_state=42) # some randomly chosen rows
## X1 X2 l
## 184 -0.973736 -0.417269 1
## 1724 1.432034 1.392533 0
## 251 -2.407422 -0.302862 1
## 1121 2.158669 -0.000564 0
## 1486 2.060772 2.672565 0
We can now enjoy all the techniques for processing data in groups that we have dis-
cussed so far. In particular, computing the columnwise means gives nothing else than
the above cluster centroids:
318 IV HETEROGENEOUS DATA
Xl.groupby("l").mean()
## X1 X2
## l
## 0 0.996230 1.052801
## 1 -0.900414 -1.084118
The label vector l can be recreated by computing the distances between all the points
and the centroids and then picking the indexes of the closest pivots:
l_test = np.argmin(scipy.spatial.distance.cdist(X, C), axis=1)
np.all(l_test == l) # verify they are identical
## True
Important By construction13 , the 𝑘-means method can only detect clusters of convex
shapes (such as Gaussian blobs).
Exercise 12.18 Perform the clustering of the wut_isolation14 dataset and notice how non-
sensical, geometrically speaking, the returned clusters are.
Exercise 12.19 Determine a clustering of the wut_twosplashes15 dataset and display the res-
ults on a scatter plot. Compare them with those obtained on the standardised version of the data-
set. Recall what we said about the Euclidean distance and its perception being disturbed when a
plot’s aspect ratio is not 1:1.
Note (*) An even simpler classifier than the 𝑘-nearest neighbours one builds upon
the concept of the nearest centroids. Namely, it first determines the centroids (com-
ponentwise arithmetic means) of the points in each class. Then, a new point (from the
test set) is assigned to the class whose centroid is the closest thereto. The implement-
ation of such a classifier is left as a rather straightforward exercise for the reader. As
an application, we recommend using it to extrapolate the results generated by the 𝑘-
means method (for different 𝑘s) to previously unobserved data, e.g., all points on a
dense equidistant grid.
drawback in common. Namely, whatever they return can be suboptimal. Hence, they
can constitute a possibly meaningless solution.
The documentation of scipy.cluster.vq.kmeans2 is, of course, honest about it. It
states that the method attempts to minimise the Euclidean distance between observations and
centroids. Further, sklearn.cluster.KMeans, which implements a similar algorithm,
mentions that the procedure is very fast […], but it falls in local minima. That is why it can
be useful to restart it several times.
To understand what it all means, it will be very educational to study this issue in more
detail. This is because the discussed approach to clustering is not the only hard prob-
lem in data science (selecting an optimal set of independent variables with respect to
AIC or BIC in linear regression is another example).
3. Compute the centroids of the clusters defined by the label vector 𝒍, i.e., for every
𝑗 = 1, 2, … , 𝑘:
1
𝒄𝑗 = ∑ 𝐱 ,
𝑛𝑗 𝑖∶𝑙 =𝑗 𝑖,⋅
𝑖
0.0030
0.0025
0.0020
0.0015
0.0010
0.0005
0.0000
0.0005
0.0010
0.0 0.2 0.4 0.6 0.8 1.0
Figure 12.11. An example function (of only one variable; our problem is much higher-
dimensional) with many local minima. How can we be sure there is no better min-
imum outside of the depicted interval?
16 https://ssi.wi.th-koeln.de/
12 PROCESSING DATA IN GROUPS 321
The objective function (total within-cluster sum of squares) at the returned cluster
centres is equal to:
import scipy.spatial.distance
def get_wcss(X, C):
D = scipy.spatial.distance.cdist(X, C)**2
return np.sum(np.min(D, axis=1))
get_wcss(X, C1)
## 446.5221283436733
Is it acceptable or not necessarily? We are unable to tell. What we can do, however, is
to run the algorithm again, this time from a different starting point:
np.random.seed(1234) # different seed - different initial centres
C2, l2 = scipy.cluster.vq.kmeans2(X, k)
C2
## array([[7.80779013, 5.19409177, 6.97790733],
## [6.31794579, 3.12048584, 3.84519706],
## [7.92606993, 6.35691349, 3.91202972]])
get_wcss(X, C2)
## 437.51120966832775
It is a better solution (we are lucky; it might as well have been worse). But is it the best
possible? Again, we cannot tell, alone in the dark.
322 IV HETEROGENEOUS DATA
Does a potential suboptimality affect the way the data points are grouped? It is indeed
the case here. Let’s look at the contingency table for the two label vectors:
pd.DataFrame(dict(l1=l1, l2=l2)).value_counts().unstack(fill_value=0)
## l2 0 1 2
## l1
## 0 8 0 43
## 1 39 6 0
## 2 0 57 1
Important Clusters are essentially unordered. The label vector (1, 1, 2, 2, 1, 3) repres-
ents the same clustering as the label vectors (3, 3, 2, 2, 3, 1) and (2, 2, 3, 3, 2, 1).
17 If we have many different heuristics, each aiming to approximate a solution to the 𝑘-means problem,
from the practical point of view it does not really matter which one returns the best solution – they are
merely our tools to achieve a higher goal. Ideally, we could run all of them many times and get the result
that corresponds to the smallest WCSS. It is crucial to do our best to find the optimal set of cluster centres –
the more approaches we test, the better the chance of success.
12 PROCESSING DATA IN GROUPS 323
The best of the local minima (no guarantee that it is the global one, again) is:
np.min(wcss)
## 437.51120966832775
They are the same as C2 above (up to a permutation of labels). We were lucky18 , after
all.
It is very educational to look at the distribution of the objective function at the identi-
fied local minima to see that, proportionally, in the case of this dataset, it is not rare
to converge to a really bad solution; see Figure 12.12.
plt.hist(wcss, bins=100)
plt.show()
Also, Figure 12.13 depicts all the cluster centres to which the algorithm converged. We
see that we should not be trusting the results generated by a single run of a heuristic
solver to the 𝑘-means problem.
Example 12.22 (*) The scikit-learn package has an algorithm that is similar to the Lloyd’s
one. The method is equipped with the n_init parameter (which defaults to 10) which automat-
ically applies the aforementioned restarting.
import sklearn.cluster
np.random.seed(123)
km = sklearn.cluster.KMeans(k, n_init=10)
km.fit(X)
## KMeans(n_clusters=3, n_init=10)
km.inertia_ # WCSS – not optimal!
## 437.5467188958928
Still, there are no guarantees: the solution is suboptimal too. As an exercise, pass n_init=100,
n_init=1000, and n_init=10000 and determine the returned WCSS.
Note It is theoretically possible that a developer from the scikit-learn team, when
18 Mind who is the benevolent dictator of the pseudorandom number generator’s seed.
324 IV HETEROGENEOUS DATA
500
400
300
200
100
0
450 500 550 600 650
Figure 12.12. Within-cluster sum of squares at the results returned by different runs
of the 𝑘-means algorithm. Sometimes we might be very unlucky.
they see the preceding result, will make a tweak in the algorithm so that after an up-
date to the package, the returned minimum will be better. This cannot be deemed a
bug fix, though, as there are no bugs here. Improving the behaviour of the method in
this example will lead to its degradation in others. There is no free lunch in optimisa-
tion.
Note Some datasets are more well-behaving than others. The 𝑘-means method is over-
all quite usable, but we must always be cautious.
We recommend performing at least 100 random restarts. Also, if a report from data
analysis does not say anything about the number of tries performed, we are advised
to assume that the results are gibberish19 . People will complain about our being a pain,
but we know better; compare Rule#9.
Exercise 12.23 Run the 𝑘-means method, 𝑘 = 8, on the sipu_unbalance20 dataset from
many random sets of cluster centres. Note the value of the total within-cluster sum of squares.
Also, plot the cluster centres discovered. Do they make sense? Compare these to the case where you
19 For instance, R’s stats::kmeans automatically uses nstart=1. It is not rare, unfortunately, that data
Figure 12.13. Traces of different cluster centres our 𝑘-means algorithm converged to.
Some are definitely not optimal, and therefore the method must be restarted a few
times to increase the likelihood of pinpointing the true solution.
start the method from the following cluster centres which are close to the global minimum.
−15 5
⎡ ⎤
⎢ −12 10 ⎥
⎢ −10 5 ⎥
⎢ ⎥
⎢ 15 0 ⎥
𝐂=⎢ .
⎢ 15 10 ⎥⎥
⎢ 20 5 ⎥
⎢ ⎥
⎢ 25 0 ⎥
⎣ 25 10 ⎦
326 IV HETEROGENEOUS DATA
12.6 Exercises
Exercise 12.24 Name the data type of the objects that the DataFrame.groupby method re-
turns.
Exercise 12.25 What is the relationship between the GroupBy, DataFrameGroupBy, and
SeriesGroupBy classes?
Exercise 12.26 What are relative z-scores and how can we compute them?
Exercise 12.27 Why and when the accuracy score might not be the best way to quantify a clas-
sifier’s performance?
Exercise 12.28 What is the difference between recall and precision, both in terms of how they
are defined and where they are the most useful?
Exercise 12.29 Explain how the 𝑘-nearest neighbour classification and regression algorithms
work. Why do we say that they are model-free?
Exercise 12.30 In the context of 𝑘-nearest neighbour classification, why it might be important
to resolve the potential ties at random when computing the mode of the neighbours’ labels?
Exercise 12.31 What is the purpose of a training/test and a training/validation/test set split?
Exercise 12.32 Give the formula for the total within-cluster sum of squares.
Exercise 12.33 Are there any cluster shapes that cannot be detected by the 𝑘-means method?
Exercise 12.34 Why do we say that solving the 𝑘-means problem is hard?
Exercise 12.35 Why restarting Lloyd’s algorithm many times is necessary? Why are reports
from data analysis that do not mention the number of restarts not trustworthy?
13
Accessing databases
pandas is convenient for working with data that fit into memory and which can be
stored in individual CSV files. Still, larger information banks in a shared environment
will often be made available to us via relational (structured) databases such as Postgr-
eSQL or MariaDB, or a wide range of commercial products.
Most commonly, we use SQL (Structured Query Language) to define the data chunks1
we want to analyse. Then, we fetch them from the database driver in the form of a
pandas data frame. This enables us to perform the operations we are already familiar
with, e.g., various transformations or visualisations.
Below we make a quick introduction to the basics of SQL using SQLite2 , which is
a lightweight, flat-file, and server-less open-source database management system.
Overall, SQLite is a sensible choice for data of even hundreds or thousands of giga-
bytes in size that fit on a single computer’s disk. This is more than enough for playing
with our data science projects or prototyping more complex solutions.
Important In this chapter, we will learn that the syntax of SQL is very readable: it
is modelled after the natural (English) language. The purpose of this introduction is
not to compose own queries nor to design new databanks. The latter is covered by a
separate course on database systems; see, e.g., [17, 21].
ally a more versatile choice. If we have too much data, we can always fetch their random samples (this is
what statistics is for) or pre-aggregate the information on the server side. This should be sufficient for most
intermediate-level users.
2 https://sqlite.org/
3 https://travel.stackexchange.com/
4 https://archive.org/details/stackexchange
328 IV HETEROGENEOUS DATA
First, Tags gives, amongst others, topic categories (TagName) and how many questions
mention them (Count):
Tags = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Tags.csv.gz",
comment="#")
Tags.head(3)
## Count ExcerptPostId Id TagName WikiPostId
## 0 104 2138.0 1 cruising 2137.0
## 1 43 357.0 2 caribbean 356.0
## 2 43 319.0 4 vacations 318.0
Third, Badges recalls all rewards handed to the users (UserId) for their engaging in
various praiseworthy activities:
Badges = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Badges.csv.gz",
comment="#")
Badges.head(3)
## Class Date Id Name TagBased UserId
## 0 3 2011-06-21T20:16:48.910 1 Autobiographer False 2
## 1 3 2011-06-21T20:16:48.910 2 Autobiographer False 3
## 2 3 2011-06-21T20:16:48.910 3 Autobiographer False 4
Fourth, Posts lists all the questions and answers (the latter do not have ParentId set
to NaN).
Posts = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Posts.csv.gz",
comment="#")
Posts.head(3)
## AcceptedAnswerId ... ViewCount
## 0 393.0 ... 419.0
## 1 NaN ... 1399.0
## 2 NaN ... NaN
##
## [3 rows x 17 columns]
13 ACCESSING DATABASES 329
Fifth, Votes list all the up-votes (VoteTypeId equal to 2) and down-votes (VoteTypeId
of 3) to all the posts.
Votes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Votes.csv.gz",
comment="#")
Votes.head(3)
## BountyAmount CreationDate Id PostId UserId VoteTypeId
## 0 NaN 2011-06-21T00:00:00.000 1 1 NaN 2
## 1 NaN 2011-06-21T00:00:00.000 2 1 NaN 2
## 2 NaN 2011-06-21T00:00:00.000 3 2 NaN 2
Exercise 13.1 See the README5 file for a detailed description of each column. Note that rows
are uniquely defined by their respective Ids. They are relations between the data frames, e.g.,
Users.Id vs Badges.UserId, Posts.Id vs Votes.PostId, etc. Moreover, for privacy reas-
ons, some UserIds might be missing. In such a case, they are encoded with a not-a-number; com-
pare Chapter 15.
It defines the file path (compare Section 13.6.1) where the database is going to be stored.
We use a randomly generated filename inside the local file system’s (we are on Linux)
temporary directory, /tmp. This is just a pleasant exercise, and we will not be using this
database afterwards. The reader might prefer setting a filename relative to the current
working directory (as given by os.getcwd), e.g., dbfile = "travel.db".
We can now connect to the database:
import sqlite3
conn = sqlite3.connect(dbfile)
The database might now be queried: we can add new tables, insert new rows, and re-
trieve records.
5 https://github.com/gagolews/teaching-data/raw/master/travel_stackexchange_com_2017/README.
md
330 IV HETEROGENEOUS DATA
Important In the end, we must not forget about the call to conn.close().
Our data are already in the form of pandas data frames. Therefore, exporting them to
the database is straightforward. We only need to make a series of calls to the pandas.
DataFrame.to_sql method.
Note (*) It is possible to export data that do not fit into memory by reading them
in chunks of considerable, but not too large, sizes. In particular pandas.read_csv
has the nrows argument that lets us read several rows from a file connection; see
Section 13.6.4. Then, pandas.DataFrame.to_sql(..., if_exists="append") can be
used to append new rows to an existing table.
Exporting data can be done without pandas as well, e.g., when they are to be fetched
from XML or JSON files (compare Section 13.5) and processed manually, row by row.
Intermediate-level SQL users can call conn.execute("CREATE TABLE t..."), fol-
lowed by conn.executemany("INSERT INTO t VALUES(?, ?, ?)", l), and then
conn.commit(). This will create a new table (here: named t) populated by a list of re-
cords (e.g., in the form of tuples or numpy vectors). For more details, see the manual6
of the sqlite3 package.
6 https://docs.python.org/3/library/sqlite3.html
13 ACCESSING DATABASES 331
This query selected all columns (SELECT *) and the first three rows (LIMIT 3) from the
Tags table.
Exercise 13.2 For the afore- and all the undermentioned SQL queries, write the equivalent Py-
thon code that generates the same result using pandas functions and methods. In each case, there
might be more than one equally fine solution. In case of any doubt about the meaning of the quer-
ies, please refer to the SQLite documentation7 . Example solutions are provided at the end of this
section.
Example 13.3 For a reference query:
res1a = pd.read_sql_query("""
SELECT * FROM Tags LIMIT 3
""", conn)
No error message means that the test is passed. The cordial thing about the assert_frame_equal
function is that it ignores small round-off errors introduced by arithmetic operations.
Nonetheless, the results generated by pandas might be the same up to the reordering of
rows. In such a case, before calling pandas.testing.assert_frame_equal, we can invoke
DataFrame.sort_values on both data frames to sort them with respect to 1 or 2 chosen
columns.
13.3.1 Filtering
Exercise 13.4 From Tags, select two columns TagName and Count and rows for which Tag-
Name is equal to one of the three choices provided.
res2a = pd.read_sql_query("""
SELECT TagName, Count
FROM Tags
WHERE TagName IN ('poland', 'australia', 'china')
""", conn)
res2a
## TagName Count
## 0 china 443
## 1 australia 411
## 2 poland 139
Exercise 13.5 Select a set of columns from Posts whose rows fulfil a given set of conditions.
res3a = pd.read_sql_query("""
SELECT Title, Score, ViewCount, FavoriteCount
FROM Posts
WHERE PostTypeId=1 AND
ViewCount>=10000 AND
FavoriteCount BETWEEN 35 AND 100
""", conn)
res3a
## Title ... FavoriteCount
## 0 When traveling to a country with a different c... ... 35.0
## 1 How can I do a "broad" search for flights? ... 49.0
## 2 Tactics to avoid getting harassed by corrupt p... ... 42.0
## 3 Flight tickets: buy two weeks before even duri... ... 36.0
## 4 OK we're all adults here, so really, how on ea... ... 79.0
## 5 How to intentionally get denied entry to the U... ... 53.0
## 6 How do you know if Americans genuinely/literal... ... 79.0
## 7 OK, we are all adults here, so what is a bidet... ... 38.0
## 8 How to cope with too slow Wi-Fi at hotel? ... 41.0
##
## [9 rows x 4 columns]
13.3.2 Ordering
Exercise 13.6 Select the Title and Score columns from Posts where ParentId is missing
(i.e., the post is, in fact, a question) and Title is well-defined. Then, sort the results by the Score
column, decreasingly (descending order). Finally, return only the first five rows (e.g., top five scor-
ing questions).
res4a = pd.read_sql_query("""
SELECT Title, Score
FROM Posts
WHERE ParentId IS NULL AND Title IS NOT NULL
ORDER BY Score DESC
LIMIT 5
""", conn)
res4a
## Title Score
## 0 OK we're all adults here, so really, how on ea... 306
## 1 How do you know if Americans genuinely/literal... 254
## 2 How to intentionally get denied entry to the U... 219
## 3 Why are airline passengers asked to lift up wi... 210
## 4 Why prohibit engine braking? 178
Exercise 13.10 Count how many unique combinations of pairs (Name, Year) for the badges
won by the user with Id=23 are there. Then, return only the rows having Count greater than 1
and order the results by Count decreasingly. In other words, list the badges received more than
once in any given year.
res8a = pd.read_sql_query("""
SELECT
Name,
CAST(strftime('%Y', Date) AS FLOAT) AS Year,
COUNT(*) AS Count
FROM Badges
WHERE UserId=23
GROUP BY Name, Year
(continues on next page)
13 ACCESSING DATABASES 335
Note that WHERE is performed before GROUP BY, and HAVING is applied thereafter.
13.3.5 Joining
Exercise 13.11 Join (merge) Tags, Posts, and Users for all posts with OwnerUserId not
equal to -1 (i.e., the tags which were created by “alive” users). Return the top six records with
respect to Tags.Count.
res9a = pd.read_sql_query("""
SELECT Tags.TagName, Tags.Count, Posts.OwnerUserId,
Users.Age, Users.Location, Users.DisplayName
FROM Tags
JOIN Posts ON Posts.Id=Tags.WikiPostId
JOIN Users ON Users.AccountId=Posts.OwnerUserId
WHERE OwnerUserId != -1
ORDER BY Tags.Count DESC, Tags.TagName ASC
LIMIT 6
""", conn)
res9a
## TagName Count ... Location DisplayName
## 0 canada 802 ... Mumbai, India hitec
## 1 europe 681 ... Philadelphia, PA Adam Tuttle
## 2 visa-refusals 554 ... New York, NY Benjamin Pollack
## 3 australia 411 ... Mumbai, India hitec
## 4 eu 204 ... Philadelphia, PA Adam Tuttle
## 5 new-york-city 204 ... Mumbai, India hitec
##
## [6 rows x 6 columns]
Exercise 13.12 First, create an auxiliary (temporary) table named UpVotesTab, where we
store the information about the number of up-votes (VoteTypeId=2) that each post has received.
Then, join (merge) this table with Posts and fetch some details about the five questions (Post-
TypeId=1) with the most up-votes.
res10a = pd.read_sql_query("""
SELECT UpVotesTab.*, Posts.Title FROM
(
SELECT PostId, COUNT(*) AS UpVotes
FROM Votes
WHERE VoteTypeId=2
(continues on next page)
336 IV HETEROGENEOUS DATA
Example 13.14 To generate res3a with pandas only, we need some more complex filtering with
loc[...]:
res3b = (
Posts.
loc[
(Posts.PostTypeId == 1) & (Posts.ViewCount >= 10000) &
(Posts.FavoriteCount >= 35) & (Posts.FavoriteCount <= 100),
["Title", "Score", "ViewCount", "FavoriteCount"]
].
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res3a, res3b) # no error == OK
Example 13.15 For res4a, some filtering and sorting is all we need:
13 ACCESSING DATABASES 337
res4b = (
Posts.
loc[
Posts.ParentId.isna() & (~Posts.Title.isna()),
["Title", "Score"]
].
sort_values("Score", ascending=False).
head(5).
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res4a, res4b) # no error == OK
Example 13.17 For res6a, we first need to add a new column to the copy of Badges:
Badges2 = Badges.copy() # otherwise we would destroy the original object
Badges2.loc[:, "Year"] = (
Badges2.Date.astype("datetime64[ns]").dt.strftime("%Y").astype("float")
)
Had we not converted Year to float, we would obtain a meaningless average year, without any
warning.
Unfortunately, the rows in res7a and res7b are ordered differently. For testing, we need to re-
order them in the same way:
pd.testing.assert_frame_equal(
res7a.sort_values(["Name", "Count"]).reset_index(drop=True),
res7b.sort_values(["Name", "Count"]).reset_index(drop=True)
) # no error == OK
Example 13.19 For res8a, we first count the number of values in each group:
Badges2 = Badges.copy()
Badges2.loc[:, "Year"] = (
Badges2.Date.astype("datetime64[ns]").dt.strftime("%Y").astype("float")
)
res8b = (
Badges2.
loc[ Badges2.UserId == 23, ["Name", "Year"] ].
groupby(["Name", "Year"]).
size().
rename("Count").
reset_index()
)
Example 13.20 To obtain a result equivalent to res9a, we need to merge Posts with Tags, and
then merge the result with Users:
res9b = pd.merge(Posts, Tags, left_on="Id", right_on="WikiPostId")
res9b = pd.merge(Users, res9b, left_on="AccountId", right_on="OwnerUserId")
res9b = (
res9b.
loc[
(res9b.OwnerUserId != -1) & (~res9b.OwnerUserId.isna()),
["TagName", "Count", "OwnerUserId", "Age", "Location", "DisplayName"]
].
sort_values(["Count", "TagName"], ascending=[False, True]).
head(6).
reset_index(drop=True)
)
Example 13.21 To obtain a result equivalent to res10a, we first need to create an auxiliary data
frame that corresponds to the subquery.
UpVotesTab = (
Votes.
loc[Votes.VoteTypeId==2, :].
groupby("PostId").
size().
rename("UpVotes").
reset_index()
)
And now:
res10b = pd.merge(UpVotesTab, Posts, left_on="PostId", right_on="Id")
res10b = (
res10b.
loc[res10b.PostTypeId==1, ["PostId", "UpVotes", "Title"]].
sort_values("UpVotes", ascending=False).
head(5).
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res10a, res10b) # no error == OK
ultraviolet-radation-data-information
11 https://uvdata.arpansa.gov.au/xml/uvvalues.xml
12 https://en.wikipedia.org/wiki/List_of_20th-century_classical_composers
13 https://en.wikipedia.org/wiki/Category:Fr%C3%A9d%C3%A9ric_Chopin
14 https://stackexchange.com/
15 https://archive.org/details/stackexchange
16 https://meta.wikimedia.org/wiki/Data_dumps
17 https://wikimediafoundation.org/
13 ACCESSING DATABASES 341
import os.path
os.path.join("~", "Desktop", "file.csv") # we are on GNU/Linux
## '~/Desktop/file.csv'
Important We will frequently be referring to file paths relative to the working direct-
ory of the currently executed Python session (e.g., from which IPython/Jupyter note-
book server was started); see os.getcwd.
All non-absolute file names (ones that do not start with “~”, “/”, “c:\\”, and the like),
for example, "filename.csv" or os.path.join("subdir", "filename.csv") are al-
ways relative to the current working directory.
For instance, if the working directory is "/home/marek/projects/python", then
"filename.csv" refers to "/home/marek/projects/python/filename.csv".
Also, “..” denotes the current working directory’s parent directory. Thus, "../
filename2.csv" resolves to "/home/marek/projects/filename2.csv".
342 IV HETEROGENEOUS DATA
Exercise 13.28 Print the current working directory by calling os.getcwd. Next, download the
file air_quality_2018_param18 and save it in the current Python session’s working directory
(e.g., in your web browser, right-click on the web page’s canvas and select Save Page As…). Load
with pandas.read_csv by passing "air_quality_2018_param.csv" as the input path.
Exercise 13.29 (*) Download the aforementioned file programmatically (if you have not done
so yet) using the requests module.
13.8 Exercises
Exercise 13.31 Find an example of an XML and JSON file. Which one is more human-
readable? Do they differ in terms of capabilities?
Exercise 13.32 What is wrong with constructing file paths like "~" + "\\" + "filename.
csv"?
Exercise 13.33 What are the benefits of using a SQL database management system in data sci-
ence activities?
Exercise 13.34 (*) How can we populate a database with gigabytes of data read from many CSV
files?
Part V
There are a few binary operators overloaded for strings, e.g., `+` stands for string con-
catenation:
x + " and eggs"
## 'spam and eggs'
Chapter 3 noted that str is a sequential type. As a consequence, we can extract indi-
vidual code points and create substrings using the index operator:
x[-1] # last letter
## 'm'
348 V OTHER DATA TYPES
Strings are immutable, but parts thereof can always be reused in conjunction with the
concatenation operator:
x[:2] + "ecial"
## 'special'
Note Despite the wide support for Unicode, sometimes our own or other readers’ dis-
play (e.g., web browsers when viewing an HTML version of the output report) might
not be able to render all code points properly, e.g., due to missing fonts. Still, we can
rest assured that they are processed correctly if string functions are applied thereon.
Note (*) More advanced string transliteration2 can be performed by means of the ICU3
(International Components for Unicode) library. Its Python bindings are provided by
the PyICU package. Unfortunately, the package is not easily available on Win***s.
For instance, converting all code points to ASCII (English) might be necessary when
1 (*) More precisely, Python strings are UTF-8-encoded. Most web pages and APIs are nowadays served
in UTF-8. But we can still occasionally encounter files encoded in ISO-8859-1 (Western Europe), Windows-
1250 (Eastern Europe), Windows-1251 (Cyrillic), GB18030 and Big5 (Chinese), EUC-KR (Korean), Shift-JIS
and EUC-JP (Japanese), amongst others. They can be converted using the str.decode method.
2 https://unicode-org.github.io/icu/userguide/transforms/general
3 https://icu.unicode.org/
14 TEXT DATA 349
identifiers are expected to miss some diacritics that would normally be included (as
in "Gągolewski" vs "Gagolewski"):
import icu # PyICU package
(icu.Transliterator
.createInstance("Lower; Any-Latin; Latin-ASCII")
.transliterate(
"Χαίρετε! Groß gżegżółka — © La Niña – köszönöm – Gągolewski"
)
)
## 'chairete! gross gzegzolka - (C) la nina - koszonom - gagolewski'
food.replace("spam", "veggies")
## 'bacon, veggies, veggies, srapatapam, eggs, and veggies'
Exercise 14.1 Read the manual of the following methods: str.startswith, str.endswith,
str.find, str.rfind, str.rindex, str.removeprefix, and str.removesuffix.
4 https://www.unicode.org/faq/normalization.html
350 V OTHER DATA TYPES
The splitting of long strings at specific fixed delimiters can be done via:
food.split(", ")
## ['bacon', 'spam', 'spam', 'srapatapam', 'eggs', 'and spam']
See also str.partition. The str.join method implements the inverse operation:
", ".join(["spam", "bacon", "eggs", "spam"])
## 'spam, bacon, eggs, spam'
Moreover, Section 14.4 will discuss pattern matching with regular expressions. They
can be useful in, amongst others, extracting more abstract data chunks (numbers,
URLs, email addresses, IDs) from strings.
We have: "a" < "aa" < "aaaaaaaaaaaaa" < "ab" < "aba" < "abb" < "b" < "ba" <
"baaaaaaa" < "bb" < "Spanish Inquisition".
Additionally, it only takes into account the numeric codes (see Section 14.4.3) corres-
ponding to each Unicode character. Consequently, it does not work well with non-
English alphabets:
"MIELONECZKĄ" < "MIELONECZKI"
## False
In Polish, A with ogonek (Ą) is expected to sort after A and before B, let alone I. However,
their corresponding numeric codes in the Unicode table are: 260 (Ą), 65 (A), 66 (B), and
73 (I). The resulting ordering is thus incorrect, as far as natural language processing is
concerned.
It is best to perform string collation using the services provided by ICU. Here is an
example of German phone book-like collation where "ö" is treated the same as "oe":
c = icu.Collator.createInstance(icu.Locale("de_DE@collation=phonebook"))
c.setStrength(0) # ignore case and some diacritics
c.compare("Löwe", "loewe")
## 0
14 TEXT DATA 351
In some languages, contractions occur, e.g., in Slovak and Czech, two code points "ch"
are treated as a single entity and are sorted after "h":
icu.Collator.createInstance(icu.Locale("sk_SK")).compare("chladný", "hladný")
## 1
This means that we have "chladný" > "hladný" (the first argument is greater than the
second one). Compare the above to something similar in Polish:
icu.Collator.createInstance(icu.Locale("pl_PL")).compare("chłodny", "hardy")
## -1
That is, "chłodny" < "hardy" (the first argument is less than the second one).
Which is the correct result: "a9" is less than "a123" (compare the above to the example
where we used the ordinary `<`).
This allows for missing values encoding by means of the None object (which is of the
type None, not str); compare Section 15.1.
Vectorised versions of base string operations are available via the pandas.Series.
str accessor. We thus have pandas.Series.str.strip, pandas.Series.str.split,
pandas.Series.str.find, and so forth. For instance:
5 https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
352 V OTHER DATA TYPES
But there is more. For example, a function to compute the length of each string:
x.str.len()
## 0 4.0
## 1 5.0
## 2 NaN
## 3 9.0
## 4 4.0
## dtype: float64
Vectorised concatenation of strings can be performed using the overloaded `+` oper-
ator:
x + " and spam"
## 0 spam and spam
## 1 bacon and spam
## 2 NaN
## 3 buckwheat and spam
## 4 spam and spam
## dtype: object
Conversion to numeric:
pd.Series(["1.3", "-7", None, "3523"]).astype(float)
## 0 1.3
## 1 -7.0
## 2 NaN
## 3 3523.0
## dtype: float64
Select substrings:
x.str.slice(2, -1) # like x.iloc[i][2:-1] for all i
## 0 a
## 1 co
(continues on next page)
14 TEXT DATA 353
Replace substrings:
x.str.slice_replace(0, 2, "tofu") # like x.iloc[i][2:-1] = "tofu"
## 0 tofuam
## 1 tofucon
## 2 None
## 3 tofuckwheat
## 4 tofuam
## dtype: object
Exercise 14.2 Consider the nasaweather_glaciers6 data frame. All glaciers are assigned
11/12-character unique identifiers as defined by the WGMS convention that forms the glacier ID
number by combining the following five elements:
1. 2-character political unit (the first two letters of the ID),
2. 1-digit continent code (the third letter),
3. 4-character drainage code (the next four),
4. 2-digit free position code (the next two),
5. 2- or 3-digit local glacier code (the remaining ones).
Extract the five chunks and store them as independent columns in the data frame.
Here, the data type “<U5” means that we deal with Unicode strings of length no greater
than five. Unfortunately, replacing elements with too long a content will spawn trun-
cated strings:
6 https://github.com/gagolews/teaching-data/raw/master/other/nasaweather_glaciers.csv
354 V OTHER DATA TYPES
x[2] = "buckwheat"
x
## array(['spam', 'bacon', 'buckw'], dtype='<U5')
The numpy.char7 module includes several vectorised versions of string routines, most
of which we have already discussed. For example:
x = np.array([
"spam", "spam, bacon, and spam",
"spam, eggs, bacon, spam, spam, and spam"
])
np.char.split(x, ", ")
## array([list(['spam']), list(['spam', 'bacon', 'and spam']),
## list(['spam', 'eggs', 'bacon', 'spam', 'spam', 'and spam'])],
## dtype=object)
np.char.count(x, "spam")
## array([1, 2, 4])
Vectorised operations that we would normally perform through the binary operators
(i.e., `+`, `*`, `<`, etc.) are available through standalone functions:
np.char.add(["spam", "bacon"], " and spam")
## array(['spam and spam', 'bacon and spam'], dtype='<U14')
np.char.equal(["spam", "bacon", "spam"], "spam")
## array([ True, False, True])
The function that returns the length of each string is also noteworthy:
np.char.str_len(x)
## array([ 4, 21, 39])
7 https://numpy.org/doc/stable/reference/routines.char.html
14 TEXT DATA 355
x = pd.Series([
"spam",
"spam, bacon, spam",
"potatoes",
None,
"spam, eggs, bacon, spam, spam"
])
xs = x.str.split(", ", regex=False)
xs
## 0 [spam]
## 1 [spam, bacon, spam]
## 2 [potatoes]
## 3 None
## 4 [spam, eggs, bacon, spam, spam]
## dtype: object
or slicing:
356 V OTHER DATA TYPES
pi = 3.14159265358979323846
f"π = {pi:.2f}"
## 'π = 3.14'
creates a string showing the value of the variable pi formatted as a float rounded to
two places after the decimal separator.
Note (**) Similar functionality can be achieved using the str.format method:
"π = {:.2f}".format(pi)
## 'π = 3.14'
as well as the `%` operator overloaded for strings, which uses sprintf-like value place-
holders known to some readers from other programming languages (such as C):
"π = %.2f" % pi
## 'π = 3.14'
The former is more human-readable, and the latter is slightly more technical. Note
that repr often returns an output that can be interpreted as executable Python code
with no or few adjustments. Nonetheless, pandas objects are amongst the many ex-
ceptions to this rule.
Result: 22 = 2 ⋅ 2 = 4.
Recall from Section 1.2.5 that Markdown is a very flexible markup11 language that al-
lows us to define itemised and numbered lists, mathematical formulae, tables, im-
ages, etc.
On a side note, data frames can be nicely prepared for display in a report using pandas.
DataFrame.to_markdown.
11 (*) Markdown is amongst many markup languages. Other learn-worthy ones include HTML (for the
Web) and LaTeX (especially for the beautiful typesetting of maths, print-ready articles, and books, e.g., PDF;
see [72] for a comprehensive introduction).
14 TEXT DATA 359
We can convert it to other formats, including HTML, PDF, EPUB, ODT, and even
presentations by running12 the pandoc13 tool. We may also embed it directly inside
an IPython/Jupyter notebook:
IPython.display.Markdown(out)
Rank Food
1 maps
2 nocab
3 sgge
4 maps
Note Figures created in matplotlib can be exported to PNG, SVG, or PDF files us-
ing the matplotlib.pyplot.savefig function. We can include them manually in a
Markdown document using the  syntax.
Note (*) IPython/Jupyter Notebooks can be converted to different formats using the
12 External programs can be executed using subprocess.run.
13 https://pandoc.org/
360 V OTHER DATA TYPES
jupyter-nbconvert14 command line tool. jupytext15 can create notebooks from or-
dinary text files. Literate programming with mixed R and Python is possible with the
R packages knitr16 and reticulate17 . See [76] for an overview of many more options.
Most programming languages and text editors (including Kate18 and VSCodium19 )
support regex-based pattern finding and replacing. This is why regular expressions
should be amongst the instruments at every data scientist’s disposal.
14 https://pypi.org/project/nbconvert
15 https://jupytext.readthedocs.io/en/latest
16 https://yihui.org/knitr
17 https://rstudio.github.io/reticulate
18 https://kate-editor.org/
19 https://vscodium.com/
14 TEXT DATA 361
The order of arguments is (look for what, where), not vice versa.
Important We used the r"..." prefix to input a string so that “\b” is not treated as
an escape sequence which denotes the backspace character. Otherwise, the foregoing
would have to be written as “\\bni+\\b”.
If we had not insisted on matching at the word boundaries (i.e., if we used the simple
"ni+" regex instead), we would also match the "ni" in "knights".
The re.search function returns an object of the class re.Match that enables us to get
some more information about the first match:
r = re.search(r"\bni+\b", x)
r.start(), r.end(), r.group()
## (26, 28, 'ni')
It includes the start and the end position (index) as well as the match itself. If the regex
contains capture groups (more details follow), we can also pinpoint the matches thereto.
Moreover, re.finditer returns an iterable object that includes the same details, but
now about all the matches:
rs = re.finditer(r"\bni+\b", x)
for r in rs:
print((r.start(), r.end(), r.group()))
## (26, 28, 'ni')
## (30, 36, 'niiiii')
## (38, 40, 'ni')
## (42, 52, 'niiiiiiiii')
re.split(r"!\s+", x)
## ["We're the knights who say ni", 'niiiii', 'ni', 'niiiiiiiii!']
The “!\s*” regex matches the exclamation mark followed by one or more whitespace
characters.
re.sub(r"\bni+\b", "nu", x)
## "We're the knights who say nu! nu! nu! nu!"
Note (**) More flexible replacement strings can be generated by passing a custom
function as the second argument:
re.sub(r"\bni+\b", lambda m: "n" + "u"*(m.end()-m.start()-1), x)
## "We're the knights who say nu! nuuuuu! nu! nuuuuuuuuu!"
Note (*) If we intend to seek matches to the same pattern in many different strings
without the use of pandas, it might be faster to precompile a regex first, and then use
the re.Pattern.findall method instead or re.findall:
p = re.compile(r"\bni+\b") # returns an object of the class `re.Pattern`
p.findall("We're the Spanish Inquisition ni! ni! niiiii! nininiiiiiiiii!")
## ['ni', 'ni', 'niiiii']
Important The following characters have special meaning to the regex engine: “.”, “\”,
“|”, “(“, “)”, “[“, “]”, “{“, “}”, “^”, “$”, “*”, “+”, and “?”.
Any regular expression that contains none of the preceding characters behaves like a
fixed pattern:
20 https://docs.python.org/3/library/re.html
364 V OTHER DATA TYPES
There are three occurrences of a pattern that is comprised of four code points, “s” fol-
lowed by “p”, then by “a”, and ending with “m”.
It extracted non-overlapping substrings of length four that end with “am”, case-
insensitively.
The dot’s insensitivity to the newline character is motivated by the need to maintain
compatibility with tools such as grep (when searching within text files in a line-by-line
manner). This behaviour can be altered by setting the DOTALL flag.
re.findall("..am", x, re.DOTALL|re.IGNORECASE) # `|` is the bitwise OR
## ['Spam', ' ham', '\njam', 'SPAM', 'spam']
the “[hj]am” regex matches: “h” or “j”, followed by “a”, followed by “m”. In other words,
"ham" and "jam" are the only two strings that are matched by this pattern (unless
matching is done case-insensitively).
Important The following characters, if used within square brackets, may be treated
not literally: “\”, “[“, “]”, “^”, “-“, “&”, “~”, and “|”.
14 TEXT DATA 365
To include them as-is in a character set, the backslash-escape must be used. For ex-
ample, “[\[\]\\]” matches a backslash or a square bracket.
This pattern denotes the union of three code ranges: ASCII upper- and lowercase let-
ters and digits. Nowadays, in the processing of text in natural languages, this notation
should be avoided. Note the missing “ą” (Polish “a” with ogonek) in the result.
Some glyphs are not available in the PDF version of this book because we did not install
the required fonts, e.g., the Arabic digit 4 or left and right arrows. However, they are
well-defined at the program level.
Noteworthy Unicode-aware code point classes include the word characters:
re.findall(r"\w", x)
## ['a', 'ą', 'b', 'ß', 'Æ', 'A', 'Ą', 'B', '�', '1', '2', '�', '�']
decimal digits:
21 https://www.unicode.org/charts
366 V OTHER DATA TYPES
re.findall(r"\d", x)
## ['1', '2', '�', '�']
and whitespaces:
re.findall(r"\s", x)
## [' ', '\t', '\n']
Moreover, e.g., “\W” is equivalent to “[^\w]” , i.e., denotes the set’s complement.
re.findall(
"(?:sp|h)" + # match either 'sp' or 'h'
"am" + # followed by 'am'
"|" + # ... or ...
"egg", # just match 'egg'
x
)
## ['spam', 'egg', 'ham', 'spam']
Greedy:
x = "sp(AM)(maps)(SP)am"
re.findall(r"\(.+\)", x)
## ['(AM)(maps)(SP)']
Lazy:
re.findall(r"\(.+?\)", x)
## ['(AM)', '(maps)', '(SP)']
re.findall(r"\([^)]+\)", x)
## ['(AM)', '(maps)', '(SP)']
The first regex is greedy: it matches an opening bracket, then as many characters as
possible (including “)”) that are followed by a closing bracket. The two other patterns
terminate as soon as the first closing bracket is found.
More examples:
x = "spamamamnomnomnomammmmmmmmm"
re.findall("sp(?:am|nom)+", x)
## ['spamamamnomnomnomam']
re.findall("sp(?:am|nom)+?", x)
## ['spam']
And:
re.findall("sp(?:am|nom)+?m*", x)
## ['spam']
re.findall("sp(?:am|nom)+?m+", x)
## ['spamamamnomnomnomammmmmmmmm']
Let’s stress that the quantifier is applied to the subexpression that stands directly be-
fore it. Grouping parentheses can be used in case they are needed.
x = "12, 34.5, 678.901234, 37...629, ..."
re.findall(r"\d+\.\d+", x)
## ['34.5', '678.901234']
finds digits which are possibly (but not necessarily) followed by a dot and a digit se-
quence.
Exercise 14.5 Write a regex that extracts all #hashtags from a string #omg #SoEasy.
It returned the matches to the individual capture groups, not the whole matching sub-
strings.
re.find and re.finditer can pinpoint each component:
r = re.search(r"(\w+)='(.+?)'", x)
print("whole (0):", (r.start(), r.end(), r.group()))
print(" 1 :", (r.start(1), r.end(1), r.group(1)))
print(" 2 :", (r.start(2), r.end(2), r.group(2)))
## whole (0): (0, 20, "name='Sir Launcelot'")
## 1 : (0, 4, 'name')
## 2 : (6, 19, 'Sir Launcelot')
Here its vectorised version using pandas, returning the first match:
y = pd.Series([
"name='Sir Launcelot'",
"quest='Seek Grail'",
"favcolour='blue', favcolour='yel.. Aaargh!'"
])
y.str.extract(r"(\w+)='(.+?)'")
## 0 1
## 0 name Sir Launcelot
## 1 quest Seek Grail
## 2 favcolour blue
We see that the findings are conveniently presented in the data frame form. The first
column gives the matches to the first capture group. All matches can be extracted too:
y.str.extractall(r"(\w+)='(.+?)'")
## 0 1
## match
## 0 0 name Sir Launcelot
## 1 0 quest Seek Grail
## 2 0 favcolour blue
## 1 favcolour yel.. Aaargh!
Recall that if we just need the grouping part of “(...)”, i.e., without the capturing
feature, “(?:...)” can be applied.
Back-referencing (**)
Matches to capture groups can also be part of the regexes themselves. In such a context,
e.g., “\1” denotes whatever has been consumed by the first capture group.
In general, parsing HTML code with regexes is not recommended, unless it is well-
structured (which might be the case if it is generated programmatically; but we can
always use the lxml package). Despite this, let’s consider the following examples:
x = "<p><em>spam</em></p><code>eggs</code>"
re.findall(r"<[a-z]+>.*?</[a-z]+>", x)
## ['<p><em>spam</em>', '<code>eggs</code>']
It did not match the correct closing HTML tag. But we can make this happen by writ-
ing:
re.findall(r"(<([a-z]+)>.*?</\2>)", x)
## [('<p><em>spam</em></p>', 'p'), ('<code>eggs</code>', 'code')]
This regex guarantees that the match will include all characters between the opening
"<tag>" and the corresponding (not: any) closing "</tag>".
The five regular expressions match "spam", respectively, anywhere within the string,
at the beginning, at the end, at the beginning or end, and in strings that are equal to
the pattern itself. We can check this by calling:
pd.concat([x.str.contains(r) for r in rs], axis=1, keys=rs)
## spam ^spam spam$ spam$|^spam ^spam$
## 0 True True False True False
## 1 True False True True False
## 2 True True True True True
## 3 True False False False False
## 4 False False False False False
Exercise 14.6 Compose a regex that does the same job as str.strip.
Moreover, “(?<!...)...” and “...(?!...)” are their negated versions (negative look-
behind/ahead).
re.findall(r"\b\w+\b(?![,.])", x)
## ['I', 'like', 'and']
This time, we matched the words that end with neither a comma nor a dot.
14.5 Exercises
Exercise 14.7 List some ways to normalise character strings.
Exercise 14.8 (**) What are the challenges of processing non-English text?
Exercise 14.9 What are the problems with the "[A-Za-z]" and "[A-z]" character sets?
Exercise 14.10 Name the two ways to turn on case-insensitive regex matching.
Exercise 14.11 What is a word boundary?
Exercise 14.12 What is the difference between the "^" and "$" anchors?
Exercise 14.13 When would we prefer using "[0-9]" instead of "\d"?
Exercise 14.14 What is the difference between the "?", "??", "*", "*?", "+", and "+?" quan-
tifiers?
Exercise 14.15 Does "." match all the characters?
Exercise 14.16 What are named capture groups and how can we refer to the matches thereto in
re.sub?
Exercise 14.17 Write a regex that extracts all standalone numbers accepted by Python, includ-
ing 12.123, -53, +1e-9, -1.2423e10, 4. and .2.
Exercise 14.18 Author a regex that matches all email addresses.
Exercise 14.19 Indite a regex that matches all URLs starting with http:// or https://.
Exercise 14.20 Cleanse the warsaw_weather22 dataset so that it contains analysable nu-
meric data.
22 https://github.com/gagolews/teaching-data/raw/master/marek/warsaw_weather.csv
15
Missing, censored, and questionable data
Up to now, we have been mostly assuming that observations are of decent quality, i.e.,
trustworthy. It would be nice if that was always the case, but it is not.
In this chapter, we briefly address the most rudimentary methods for dealing with
suspicious observations: outliers, missing, censored, imprecise, and incorrect data.
Some of the columns bear NaN (not-a-number) values. They are used here to encode
missing (not available) data. Previously, we decided not to be bothered by them: a shy
call to dropna resulted in their removal. But we are curious now.
The reasons behind why some items are missing might be numerous, in particular:
• a participant did not know the answer to a given question;
• someone refused to answer a given question;
• a person did not take part in the study anymore (attrition, death, etc.);
• an item was not applicable (e.g., number of minutes spent cycling weekly when
someone answered they did not learn to ride a bike yet);
• a piece of information was not collected, e.g., due to the lack of funding or a failure
of a piece of equipment.
374 V OTHER DATA TYPES
Looking at the column descriptions on the data provider’s website1 , for example,
BMIHEAD stands for “Head Circumference Comment”, whereas BMXHEAD is “Head Cir-
cumference (cm)”, but these were only collected for infants.
Exercise 15.1 Read the column descriptions (refer to the comments in the CSV file for the relev-
ant URLs) to identify the possible reasons for some of the records in nhanes being missing.
Exercise 15.2 Learn about the difference between the pandas.DataFrameGroupBy.size
and pandas.DataFrameGroupBy.count methods.
1 https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_BMX.htm
2 (*) The R environment, on the other hand, supports missing values out-of-the-box.
15 MISSING, CENSORED, AND QUESTIONABLE DATA 375
There are versions of certain aggregation functions that ignore missing values whatso-
ever: numpy.nanmean, numpy.nanmin, numpy.nanmax, numpy.nanpercentile, numpy.
nanstd, etc.
This is unfortunate a behaviour as this way we might miss (sic!) the presence of missing
values. Therefore, it is crucial to have the dataset carefully inspected in advance.
Unfortunately, comparisons against missing values yield False, instead of the more
semantically valid missing value. Hence, if we want to retain the missingness informa-
tion (we do not know if a missing value is greater than 100), we need to do it manually:
y = y.astype("object") # required for numpy vectors, not for pandas Series
y[np.isnan(x)] = None
y
## 0 None
## 1 True
## 2 False
## 3 True
(continues on next page)
376 V OTHER DATA TYPES
Exercise 15.3 Read the pandas documentation3 about missing value handling.
Important In all kinds of reports from data analysis, we need to be explicit about the
way we handle the missing values. Sometimes they might strongly affect the results.
Consider an example vector with missing values, comprised of heights of the adult
participants of the NHANES study.
x = nhanes.loc[nhanes.loc[:, "RIDAGEYR"] >= 18, "BMXHT"]
The simplest approach is to replace each missing value with the corresponding
column’s mean. This does not change the overall average but decreases the variance.
xi = x.copy()
xi[np.isnan(xi)] = np.nanmean(xi)
Similarly, we could consider replacing missing values with the median, or – in the case
of categorical data – the mode.
0 0 0
150 200 150 200 150 200
1. For each numerical column, replace all missing values with the column averages.
2. For each categorical column, replace all missing values with the column modes.
3. For each numerical column, replace all missing values with the averages corresponding to a
patient’s sex (as given by the RIAGENDR column).
Note (**) Rubin (e.g., in [63]) suggests the use of a procedure called multiple imputation
(see also [94]), where copies of the original datasets are created, missing values are
imputed by sampling from some estimated distributions, the inference is made, and
then the results are aggregated. An example implementation of such an algorithm is
available in sklearn.impute.IterativeImputer.
15 MISSING, CENSORED, AND QUESTIONABLE DATA 379
There can be many tools that can assist us with identifying erroneous observations,
e.g., spell checkers such as hunspell7 .
For smaller datasets, observations can also be inspected manually. In other cases, we
might have to develop custom algorithms for detecting such bugs in data.
Exercise 15.6 Given some data frame with numeric columns only, perform what follows.
1. Check if all numeric values in each column are between 0 and 1000.
2. Check if all values in each column are unique.
3. Verify that all the rowwise sums add up to 1.0 (up to a small numeric error).
4. Check if the data frame consists of 0s and 1s only. Provided that this is the case, verify that
for each row, if there is a 1 in some column, then all the columns to the right are filled with 1s
too.
Many data validation methods can be reduced to operations on strings; see Chapter 14.
They may be as simple as writing a single regular expression or checking if a label is
in a dictionary of possible values but also as difficult as writing your own parser for a
custom context-sensitive grammar.
Exercise 15.7 Once we import the data fetched from dirty sources, relevant information will
have to be extracted from raw text, e.g., strings like "1" should be converted to floating-point
numbers. In the sequel, we suggest several tasks that can aid in developing data validation skills
involving some operations on text.
Given an example data frame with text columns (manually invented, please be creative), perform
what follows.
1. Remove trailing and leading whitespaces from each string.
2. Check if all strings can be interpreted as numbers, e.g., "23.43".
3. Verify if a date string in the YYYY-MM-DD format is correct.
4. Determine if a date-time string in the YYYY-MM-DD hh:mm:ss format is correct.
5. Check if all strings are of the form (+NN) NNN-NNN-NNN or (+NN) NNNN-NNN-NNN, where
N denotes any digit (valid telephone numbers).
11. (**) Using an external spell checker, ascertain that every string is a valid English noun in the
singular form.
12. (**) Resolve all abbreviations by means of a custom dictionary, e.g., "Kat." → "Kather-
ine", "Gr." → "Grzegorz".
15.4 Outliers
Another group of inspectionworthy observations consists of outliers. We can define
them as the samples that reside in the areas of substantially lower density than their
neighbours.
Outliers might be present due to an error, or their being otherwise anomalous, but
they may also simply be interesting, original, or novel. After all, statistics does not
give any meaning to data items; humans do.
What we do with outliers is a separate decision. We can get rid of them, correct them,
replace them with a missing value (and then possibly impute), or analyse them separ-
ately. In particular, there is a separate subfield in statistics called extreme value the-
ory that is interested in predicting the distribution of very large observations (e.g., for
modelling floods, extreme rainfall, or temperatures); see, e.g., [6]. But this is a topic
for a more advanced course; see, e.g., [53]. By then, let’s stick with some simpler set-
tings.
Note (*) We can choose a different threshold. For instance, for the normal distribu-
382 V OTHER DATA TYPES
tion N(10, 1), even though the probability of observing a value greater than 15 is the-
oretically non-zero, it is smaller 0.000029%, so it is sensible to treat this observation
as suspicious. On the other hand, we do not want to mark too many observations as
outliers: inspecting them manually might be too labour-intense.
Exercise 15.8 For each column in nhanes_p_demo_bmx_202010 , inspect a few smallest and
largest observations and see if they make sense.
Exercise 15.9 Perform the foregoing separately for data in each group as defined by the RIA-
GENDR column.
plt.show()
Fixed-radius search techniques discussed in Section 8.4 can be used for estimating
the underlying probability density function. Given a data sample 𝒙 = (𝑥1 , … , 𝑥𝑛 ),
consider11 :
1 𝑛
𝑓𝑟̂ (𝑧) = ∑ |𝐵 (𝑧)|,
2𝑟𝑛 𝑖=1 𝑟
where |𝐵𝑟 (𝑧)| denotes the number of observations from 𝒙 whose distance to 𝑧 is not
greater than 𝑟, i.e., fall into the interval [𝑧 − 𝑟, 𝑧 + 𝑟].
n = len(x)
r = 1 # radius – feel free to play with different values
(continues on next page)
10 https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_p_demo_bmx_2020.csv
11 This is an instance of a kernel density estimator, with the simplest kernel: a rectangular one.
15 MISSING, CENSORED, AND QUESTIONABLE DATA 383
350
300
250
200
Count
150
100
50
0
0 20 40 0 20 40
Figure 15.2. With box plots, we may fail to detect some outliers.
Then, points in the sample lying in low-density regions (i.e., all 𝑥𝑖 such that 𝑓𝑟̂ (𝑥𝑖 ) is
small) can be flagged for further inspection:
x[dx < 0.001]
## array([ 0. , 13.57157922, 15. , 45. , 50. ])
See Figure 15.3 for an illustration of 𝑓𝑟̂ . Of course, 𝑟 must be chosen with care, just like
the number of bins in a histogram.
z = np.linspace(np.min(x)-5, np.max(x)+5, 1001)
dz = pd.Series(t.query_ball_point(z.reshape(-1, 1), r)).str.len() / (2*r*n)
plt.plot(z, dz, label=f"density estimator ($r={r}$)")
plt.hist(x, bins=50, color="lightgray", edgecolor="black", density=True)
plt.ylabel("Density")
plt.show()
384 V OTHER DATA TYPES
0.175
0.150
0.125
Density
0.100
0.075
0.050
0.025
0.000
0 10 20 30 40 50
The scatter plot in Figure 15.5 reveals that the data consist of two well-separable blobs:
15 MISSING, CENSORED, AND QUESTIONABLE DATA 385
X[:, 0]
200
150
X[:, 0] 100
50
0
4 2 0 2 4 2 0 2
X[:, 1]
200
150
X[:, 1]
100
50
0
2 0 2 4 2 0 2 4
There are a few observations that we might mark as outliers. The truth is that yours
truly injected eight junk points at the very end of the dataset. Ha.
X[-8:, :]
## array([[-3. , 3. ],
## [ 3. , 3. ],
## [ 3. , -3. ],
## [-3. , -3. ],
## [-3.5, 3.5],
## [-2.5, 2.5],
(continues on next page)
386 V OTHER DATA TYPES
4
3
2
1
0
1
2
3
6 4 2 0 2 4 6
c[i] gives the number of points within X[i, :]’s 𝑟-radius (with respect to the Euclidean dis-
tance), including the point itself. Consequently, c[i]==1 denotes a potential outlier; see Fig-
ure 15.6 for an illustration.
12 (**) We can easily normalise the outputs to get a true 2D kernel density estimator, but multivariate
statistics is beyond the scope of this course. In particular, that data might have fixed marginal distributions
(projections onto 1D) but their multidimensional images might be very different is beautifully described by
the copula theory [70].
15 MISSING, CENSORED, AND QUESTIONABLE DATA 387
4 normal point
outlier
3
2
1
0
1
2
3
6 4 2 0 2 4 6
Figure 15.6. Outlier detection based on a fixed-radius search for the blobs1 dataset.
15.5 Exercises
Exercise 15.12 How can missing values be represented in numpy and pandas?
Exercise 15.13 Explain some basic strategies for dealing with missing values in numeric vec-
tors.
Exercise 15.14 Why we ought to be very explicit about the way we handle missing and other
suspicious data? Is it advisable to mark as missing (or remove completely) the observations that
we dislike or otherwise deem inappropriate, controversial, dangerous, incompatible with
our political views, etc.?
Exercise 15.15 Is replacing missing values with the sample arithmetic mean for income data
(as in, e.g., the uk_income_simulated_202013 dataset) a sensible strategy?
Exercise 15.16 What are the differences between data missing completely at random, missing
at random, and missing not at random?
13 https://github.com/gagolews/teaching-data/raw/master/marek/uk_income_simulated_2020.txt
388 V OTHER DATA TYPES
Exercise 15.17 List some strategies for dealing with data that might contain outliers.
16
Time series
So far, we have been using numpy and pandas mostly for storing:
• independent measurements, where each row gives, e.g., weight, height, … records
of a different subject; we often consider these a sample of a representative subset
of one or more populations, each recorded at a particular point in time;
• data summaries to be reported in the form of tables or figures, e.g., frequency
distributions giving counts for the corresponding categories or labels.
In this chapter, we will explore the most basic concepts related to the wrangling of
time series, i.e., signals indexed by discrete time. Usually, a time series is a sequence of
measurements sampled at equally spaced moments, e.g., a patient’s heart rate probed
every second, daily average currency exchange rates, or highest yearly temperatures
recorded in some location.
Here are some data aggregates for the whole sample. First, the popular quantiles:
1 Note that midrange, being the mean of the lowest and the highest observed temperature on a given day,
is not a particularly good estimate of the average daily reading. This dataset is considered for illustrational
purposes only.
390 V OTHER DATA TYPES
20 10 0 10 20 30
Figure 16.1. Distribution of the midrange daily temperatures in Spokane in the period
1889–2021. Observations are treated as a bag of unrelated items (temperature on a
“randomly chosen day” in a version of planet Earth where there is no climate change).
When computing data aggregates or plotting histograms, the order of elements does
not matter. Contrary to the case of the independent measurements, vectors represent-
ing time series do not have to be treated simply as mixed bags of unrelated items.
Important In time series, for any given item 𝑥𝑖 , its neighbouring elements 𝑥𝑖−1 and
𝑥𝑖+1 denote the recordings occurring directly before and after it. We can use this tem-
poral ordering to model how consecutive measurements depend on each other, describe
how they change over time, forecast future values, detect seasonal and long-time
trends, and so forth.
16 TIME SERIES 391
Figure 16.2 depicts the data for 2021, plotted as a function of time. What we see is of-
ten referred to as a line chart (line graph): data points are connected by straight line
segments. There are some visible seasonal variations, such as, well, obviously, that
winter is colder than summer. There is also some natural variability on top of seasonal
patterns typical for the Northern Hemisphere.
plt.plot(temps[-365:])
plt.xticks([0, 181, 364], ["2021-01-01", "2021-07-01", "2021-12-31"])
plt.show()
30
20
10
10
Figure 16.2. Line chart of midrange daily temperatures in Spokane for 2021.
2 https://numpy.org/doc/stable/reference/arrays.datetime.html
392 V OTHER DATA TYPES
spokane = pd.DataFrame(dict(
date=np.arange("1889-08-01", "2022-01-01", dtype="datetime64[D]"),
temp=temps
))
spokane.head()
## date temp
## 0 1889-08-01 21.1
## 1 1889-08-02 20.8
## 2 1889-08-03 22.2
## 3 1889-08-04 21.7
## 4 1889-08-05 18.3
When we ask the date column to become the data frame’s index (i.e., row labels),
we will be able select date ranges easily with loc[...] and string slices (refer to the
manual of pandas.DateTimeIndex for more details).
spokane.set_index("date").loc["2021-12-25":, :].reset_index()
## date temp
## 0 2021-12-25 -1.4
## 1 2021-12-26 -5.0
## 2 2021-12-27 -9.4
## 3 2021-12-28 -12.8
## 4 2021-12-29 -12.2
## 5 2021-12-30 -11.4
## 6 2021-12-31 -11.4
Example 16.2 Figure 16.3 depicts the temperatures in the last five years:
x = spokane.set_index("date").loc["2017-01-01":, "temp"].reset_index()
plt.plot(x.date, x.temp)
plt.show()
The pandas.to_datetime function can also convert arbitrarily formatted date strings,
e.g., "MM/DD/YYYY" or "DD.MM.YYYY" to Series of datetime64s.
dates = ["05.04.1991", "14.07.2022", "21.12.2042"]
dates = pd.Series(pd.to_datetime(dates, format="%d.%m.%Y"))
dates
## 0 1991-04-05
## 1 2022-07-14
## 2 2042-12-21
## dtype: datetime64[ns]
Exercise 16.3 From the birth_dates3 dataset, select all people less than 18 years old (as of the
current day).
Several date-time functions and related properties can be referred to via the pandas.
Series.dt accessor, which is similar to pandas.Series.str discussed in Chapter 14.
3 https://github.com/gagolews/teaching-data/raw/master/marek/birth_dates.csv
394 V OTHER DATA TYPES
30
20
10
10
Figure 16.3. Line chart of midrange daily temperatures in Spokane for 2017–2021.
For instance, converting date-time objects to strings following custom format spe-
cifiers can be performed with:
dates.dt.strftime("%d.%m.%Y")
## 0 05.04.1991
## 1 14.07.2022
## 2 21.12.2042
## dtype: object
We can also extract different date or time fields, such as date, time, year, month, day,
dayofyear, hour, minute, second, etc. For example:
dates_ymd = pd.DataFrame(dict(
year = dates.dt.year,
month = dates.dt.month,
day = dates.dt.day
))
dates_ymd
## year month day
## 0 1991 4 5
## 1 2022 7 14
## 2 2042 12 21
The other way around, we should note that pandas.to_datetime can convert data
frames with columns named year, month, day, etc., to date-time objects:
pd.to_datetime(dates_ymd)
## 0 1991-04-05
(continues on next page)
16 TIME SERIES 395
Example 16.4 Let’s extract the month and year parts of dates to compute the average monthly
temperatures it the last 50 or so years:
x = spokane.set_index("date").loc["1970":, ].reset_index()
mean_monthly_temps = x.groupby([
x.date.dt.year.rename("year"),
x.date.dt.month.rename("month")
]).temp.mean().unstack()
mean_monthly_temps.head().round(1) # preview
## month 1 2 3 4 5 6 7 8 9 10 11 12
## year
## 1970 -3.4 2.3 2.8 5.3 12.7 19.0 22.5 21.2 12.3 7.2 2.2 -2.4
## 1971 -0.1 0.8 1.7 7.4 13.5 14.6 21.0 23.4 12.9 6.8 1.9 -3.5
## 1972 -5.2 -0.7 5.2 5.6 13.8 16.6 20.0 21.7 13.0 8.4 3.5 -3.7
## 1973 -2.8 1.6 5.0 7.8 13.6 16.7 21.8 20.6 15.4 8.4 0.9 0.7
## 1974 -4.4 1.8 3.6 8.0 10.1 18.9 19.9 20.1 15.8 8.9 2.4 -0.8
Figure 16.4 depicts these data on a heat map. We rediscover the ultimate truth that winters are
cold, whereas in the summertime the living is easy, what a wonderful world.
sns.heatmap(mean_monthly_temps)
plt.show()
For instance, between the second and the first day of the last week, the midrange tem-
perature dropped by -3.6°C.
The other way around, here the cumulative sums of the deltas:
np.cumsum(d)
## array([ -3.6, -8. , -11.4, -10.8, -10. , -10. ])
This turned deltas back to a shifted version of the original series. But we will need the
first (root) observation therefrom to restore the dataset in full:
x[0] + np.append(0, np.cumsum(d))
## array([ -1.4, -5. , -9.4, -12.8, -12.2, -11.4, -11.4])
An exponential family is identified by the scale parameter 𝑠 > 0, being at the same time its
expected value. The probability density function of Exp(𝑠) is given for 𝑥 ≥ 0 by:
𝑓 (𝑥) = 1𝑠 𝑒−𝑥/𝑠 ,
and 𝑓 (𝑥) = 0 otherwise. We need to be careful: some textbooks choose the parametrisation by
𝜆 = 1/𝑠 instead of 𝑠. The scipy package also uses this convention.
Here is a pseudorandom sample where there are five events per minute on average:
np.random.seed(123)
l = 60/5 # 5 events per 60 seconds on average
d = scipy.stats.expon.rvs(size=1200, scale=l)
np.round(d[:8], 3) # preview
## array([14.307, 4.045, 3.087, 9.617, 15.253, 6.601, 47.412, 13.856])
The result is close to what we expected, i.e., 𝑠 = 12 seconds between the events.
We can convert the current sample to date-times (starting at a fixed calendar date) as follows. Note
that we will measure the deltas in milliseconds so that we do not loose precision; datetime64 is
based on integers, not floating-point numbers.
t0 = np.array("2022-01-01T00:00:00", dtype="datetime64[ms]")
d_ms = np.round(d*1000).astype(int) # in milliseconds
t = t0 + np.array(np.cumsum(d_ms), dtype="timedelta64[ms]")
t[:8] # preview
## array(['2022-01-01T00:00:14.307', '2022-01-01T00:00:18.352',
## '2022-01-01T00:00:21.439', '2022-01-01T00:00:31.056',
## '2022-01-01T00:00:46.309', '2022-01-01T00:00:52.910',
## '2022-01-01T00:01:40.322', '2022-01-01T00:01:54.178'],
## dtype='datetime64[ms]')
t[-2:] # preview
## array(['2022-01-01T03:56:45.312', '2022-01-01T03:56:47.890'],
## dtype='datetime64[ms]')
As an exercise, let’s apply binning and count how many events occur in each hour:
b = np.arange( # four 1-hour interval (five time points)
"2022-01-01T00:00:00", "2022-01-01T05:00:00",
1000*60*60, # number of milliseconds in 1 hour
dtype="datetime64[ms]"
)
np.histogram(t, bins=b)[0]
## array([305, 300, 274, 321])
398 V OTHER DATA TYPES
We expect 5 events per second, i.e., 300 of them per hour. On a side note, from a course in statistics
we know that for exponential inter-event times, the number of events per unit of time follows a
Poisson distribution.
Exercise 16.7 (*) Consider the wait_times5 dataset that gives the times between some consec-
utive events, in seconds. Estimate the event rate per hour. Draw a histogram representing the
number of events per hour.
Exercise 16.8 (*) Consider the btcusd_ohlcv_2021_dates6 dataset which gives the daily
BTC/USD exchange rates in 2021:
btc = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/btcusd_ohlcv_2021_dates.csv",
comment="#").loc[:, ["Date", "Close"]]
btc["Date"] = btc["Date"].astype("datetime64[s]")
btc.head(12)
## Date Close
## 0 2021-01-01 29374.152
## 1 2021-01-02 32127.268
## 2 2021-01-03 32782.023
## 3 2021-01-04 31971.914
## 4 2021-01-05 33992.430
## 5 2021-01-06 36824.363
## 6 2021-01-07 39371.043
## 7 2021-01-08 40797.609
## 8 2021-01-09 40254.547
## 9 2021-01-10 38356.441
## 10 2021-01-11 35566.656
## 11 2021-01-12 33922.961
Author a function that converts it to a lagged representation, being a convenient form for some
machine learning algorithms.
1. Add the Change column that gives by how much the price changed since the previous day.
2. Add the Dir column indicating if the change was positive or negative.
3. Add the Lag1, …, Lag5 columns which give the Changes in the five preceding days.
The first few rows of the resulting data frame should look like this (assuming we do not want any
missing values):
## Date Close Change Dir Lag1 Lag2 Lag3 Lag4 Lag5
## 2021-01-07 39371 2546.68 inc 2831.93 2020.52 -810.11 654.76 2753.12
## 2021-01-08 40798 1426.57 inc 2546.68 2831.93 2020.52 -810.11 654.76
## 2021-01-09 40255 -543.06 dec 1426.57 2546.68 2831.93 2020.52 -810.11
## 2021-01-10 38356 -1898.11 dec -543.06 1426.57 2546.68 2831.93 2020.52
## 2021-01-11 35567 -2789.78 dec -1898.11 -543.06 1426.57 2546.68 2831.93
## 2021-01-12 33923 -1643.69 dec -2789.78 -1898.11 -543.06 1426.57 2546.68
5 https://github.com/gagolews/teaching-data/raw/master/marek/wait_times.txt
6 https://github.com/gagolews/teaching-data/raw/master/marek/btcusd_ohlcv_2021_dates.csv
16 TIME SERIES 399
In the sixth row (representing 2021-01-12), Lag1 corresponds to Change on 2021-01-11, Lag2
gives the Change on 2021-01-10, and so forth.
To spice things up, make sure your code can generate any number (as defined by another para-
meter to the function) of lagged variables.
1 1 𝑘
𝑦𝑖 = (𝑥𝑖 + 𝑥𝑖+1 + ⋯ + 𝑥𝑖+𝑘−1 ) = ∑ 𝑥𝑖+𝑗−1 ,
𝑘 𝑘 𝑗=1
For example, here are the temperatures in the last seven days of December 2011:
x = spokane.set_index("date").iloc[-7:, :]
x
## temp
## date
## 2021-12-25 -1.4
## 2021-12-26 -5.0
## 2021-12-27 -9.4
## 2021-12-28 -12.8
## 2021-12-29 -12.2
## 2021-12-30 -11.4
## 2021-12-31 -11.4
We get, in this order: the mean of the first three observations; the mean of the second,
third, and fourth items; then the mean of the third, fourth, and fifth; and so forth. No-
tice that the observations were centred in such a way that we have the same number of
400 V OTHER DATA TYPES
missing values at the start and end of the series. This way, we treat the first three-day
moving average (the average of the temperatures on the first three days) as represent-
ative of the second day.
And now for something completely different; the 5-moving average:
x.rolling(5, center=True).mean().round(2)
## temp
## date
## 2021-12-25 NaN
## 2021-12-26 NaN
## 2021-12-27 -8.16
## 2021-12-28 -10.16
## 2021-12-29 -11.44
## 2021-12-30 NaN
## 2021-12-31 NaN
Applying the moving average has the nice effect of smoothing out all kinds of broadly-
conceived noise. To illustrate this, compare the temperature data for the last five years
in Figure 16.3 to their averaged versions in Figure 16.5.
x = spokane.set_index("date").loc["2017-01-01":, "temp"]
x30 = x.rolling(30, center=True).mean()
x100 = x.rolling(100, center=True).mean()
plt.plot(x30, label="30-day moving average")
plt.plot(x100, "r--", label="100-day moving average")
plt.legend()
plt.show()
Exercise 16.9 (*) Other aggregation functions can be applied in rolling windows as well. Draw,
in the same figure, the plots of the one-year moving minimums, medians, and maximums.
Seasonal patterns can be revealed by smoothening out the detrended version of the
data, e.g., using a one-year moving average:
16 TIME SERIES 401
25
20
15
10
0
30-day moving average
5 100-day moving average
2017 2018 2019 2020 2021 2022
Figure 16.5. Line chart of 30- and 100-moving averages of the midrange daily temper-
atures in Spokane for 2017–2021.
Also, if we know the length of the seasonal pattern (in our case, 365-ish days), we can
draw a seasonal plot, where we have a separate curve for each season (here: year) and
where all the series share the same x-axis (here: the day of the year); see Figure 16.7.
cmap = plt.colormaps.get_cmap("coolwarm")
x = spokane.set_index("date").loc["1970-01-01":, :].reset_index()
for year in range(1970, 2022, 5): # selected years only
y = x.loc[x.date.dt.year == year, :]
plt.plot(y.date.dt.dayofyear, y.temp,
c=cmap((year-1970)/(2021-1970)), alpha=0.3,
label=year if year % 10 == 0 else None)
avex = x.temp.groupby(x.date.dt.dayofyear).mean()
plt.plot(avex.index, avex, "g-", label="Average") # all years
plt.legend()
plt.xlabel("Day of year")
plt.ylabel("Temperature")
plt.show()
402 V OTHER DATA TYPES
Figure 16.6. Trend and seasonal pattern for the Spokane temperatures in recent years.
30
20
Temperature
10
1970
0 1980
1990
2000
10
2010
2020
20 Average
0 50 100 150 200 250 300 350
Day of year
Figure 16.7. Seasonal plot: temperatures in Spokane vs the day of the year for 1970–
2021.
16 TIME SERIES 403
Exercise 16.10 Draw a similar plot for the whole data range, i.e., 1889–2021.
Exercise 16.11 Try using pd.Series.dt.strftime with a custom formatter instead of pd.
Series.dt.dayofyear.
1. Based on the hourly observations, compute the daily mean PM2.5 measurements for Mel-
bourne CBD and Morwell South.
For Melbourne CBD, if some hourly measurement is missing, linearly interpolate between
the preceding and following non-missing data, e.g., a PM2.5 sequence of [..., 10, NaN,
NaN, 40, ...] (you need to manually add the NaN rows to the dataset) should be trans-
formed to [..., 10, 20, 30, 40, ...].
For Morwell South, impute the readings with the averages of the records in the nearest air
quality stations, which are located in Morwell East, Moe, Churchill, and Traralgon.
2. Present the daily mean PM2.5 measurements for Melbourne CBD and Morwell South on a
single plot. The x-axis labels should be human-readable and intuitive.
3. For the Melbourne data, determine the number of days where the average PM2.5 was greater
than in the preceding day.
4. Find five most air-polluted days for Melbourne.
This gives EUR/AUD (how many Australian Dollars we pay for 1 Euro), EUR/CNY
(Chinese Yuans), EUR/GBP (British Pounds), and EUR/PLN (Polish Złotys), in this or-
der. Let’s draw the four time series; see Figure 16.8.
dates = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/euraud-20200101-20200630-dates.txt",
dtype="datetime64[s]")
labels = ["AUD", "CNY", "GBP", "PLN"]
styles = ["solid", "dotted", "dashed", "dashdot"]
for i in range(eurxxx.shape[1]):
plt.plot(dates, eurxxx[:, i], ls=styles[i], label=labels[i])
plt.legend(loc="upper right", bbox_to_anchor=(1, 0.9)) # a bit lower
plt.show()
16 TIME SERIES 405
8
7 AUD
CNY
6 GBP
PLN
5
4
3
2
1
2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07
Figure 16.8. EUR/AUD, EUR/CNY, EUR/GBP, and EUR/PLN exchange rates in the first
half of 2020.
Unfortunately, they are all on different scales. This is why the plot is not necessarily
readable. It would be better to draw these time series on four separate plots (compare
the trellis plots in Section 12.2.5).
Another idea is to depict the currency exchange rates relative to the prices on some day,
say, the first one; see Figure 16.9.
for i in range(eurxxx.shape[1]):
plt.plot(dates, eurxxx[:, i]/eurxxx[0, i],
ls=styles[i], label=labels[i])
plt.legend()
plt.show()
This way, e.g., a relative EUR/AUD rate of c. 1.15 in mid-March means that if an Aussie
bought some Euros on the first day, and then sold them three-ish months later, they
would have 15% more wealth (the Euro become 15% stronger relative to AUD).
Exercise 16.14 Based on the EUR/AUD and EUR/PLN records, compute and plot the AUD/PLN
as well as PLN/AUD rates.
Exercise 16.15 (*) Draw the EUR/AUD and EUR/GBP rates on a single plot, but where each
series has its own9 y-axis.
Exercise 16.16 (*) Draw the EUR/xxx rates for your favourite currencies over a larger period.
9 https://matplotlib.org/stable/gallery/subplots_axes_and_figures/secondary_axis.html
406 V OTHER DATA TYPES
AUD
1.150 CNY
1.125 GBP
PLN
1.100
1.075
1.050
1.025
1.000
0.975
2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07
Figure 16.9. EUR/AUD, EUR/CNY, EUR/GBP, and EUR/PLN exchange rates relative to
the prices on the first day.
Use data10 downloaded from the European Central Bank. Add a few moving averages. For each
year, identify the lowest and the highest rate.
This gives the open, high, low, and close (OHLC) prices on the 365 consecutive days,
which is a common way to summarise daily rates.
The mplfinance11 (matplotlib-finance) package defines a few functions related to
the plotting of financial data. Let’s briefly describe the well-known candlestick plot.
10 https://www.ecb.europa.eu/stats/policy_and_exchange_rates/euro_reference_exchange_rates/
html/index.en.html
11 https://github.com/matplotlib/mplfinance
16 TIME SERIES 407
42000
40000
38000
36000
Price
34000
32000
30000
1
1
n1
n0
n0
n1
n2
n2
n3
Ja
Ja
Ja
Ja
Ja
Ja
Ja
Figure 16.10. A candlestick plot for the BTC/USD exchange rates in January 2021.
Figure 16.10 depicts the January 2021 data. Let’s stress that it is not a box and whisker
plot. The candlestick body denotes the difference in the market opening and the clos-
ing price. The wicks (shadows) give the range (high to low). White candlesticks rep-
resent bullish days – where the closing rate is greater than the opening one (uptrend).
Black candles are bearish (decline).
Exercise 16.17 Draw the BTC/USD rates for the entire year and add the 10-day moving aver-
ages.
Exercise 16.18 (*) Draw a candlestick plot manually, without using the mplfinance package.
Hint: matplotlib.pyplot.fill might be helpful.
408 V OTHER DATA TYPES
Signals such as audio, images, and video are different because structured randomness
does not play a dominant role there (unless it is a noise that we would like to filter out).
Instead, more interesting are the patterns occurring in the frequency (think: perceiv-
ing pitches when listening to music) or spatial (seeing green grass and sky in a photo)
domain.
Signal processing thus requires a distinct set of tools, e.g., Fourier analysis and finite
impulse response (discrete convolution) filters. This course obviously cannot be about
everything (also because it requires some more advanced calculus skills that we did
not assume the reader to have at this time); but see, e.g., [87, 89].
Nevertheless, keep in mind that these are not completely independent domains. For
example, we can extract various features of audio signals (e.g., overall loudness,
timbre, and danceability of each recording in a large song database) and then treat
them as tabular data to be analysed using the techniques described in this course.
Moreover, machine learning (e.g., convolutional neural networks) algorithms may
also be used for tasks such as object detection on images or optical character recog-
nition; see, e.g., [44].
16.5 Exercises
Exercise 16.20 Assume we have a time series with 𝑛 observations. What is a 1- and an 𝑛-
moving average? Which one is smoother, a (0.01𝑛)- or a (0.1𝑛)- one?
16 TIME SERIES 409
Important Any bug/typo reports/fixes12 are appreciated. The most up-to-date version
of this book can be found at https://datawranglingpy.gagolewski.com/.
– Bug fixes.
– Minor extensions, including: pandas.Series.dt.strftime, more details
how to avoid pitfalls in data frame indexing, etc.
• 2022-08-24 (v1.0.2):
– The first printed (paperback) version can be ordered from Amazon13 .
– Fixed page margin and header sizes.
– Minor typesetting and other fixes.
• 2022-08-12 (v1.0.1):
– Cover.
– ISBN 978-0-6455719-1-2 assigned.
• 2022-07-16 (v1.0.0):
– Preface complete.
– Handling tied observations.
– Plots now look better when printed in black and white.
– Exception handling.
– File connections.
– Other minor extensions and material reordering: more aggregation func-
tions, pandas.unique, pandas.factorize, probability vectors representing
binary categorical variables, etc.
– Final proofreading and copyediting.
• 2022-06-13 (v0.5.1):
– The Kolmogorov–Smirnov Test (one and two sample).
– The Pearson Chi-Squared Test (one and two sample and for independence).
– Dealing with round-off and measurement errors.
– Adding white noise (jitter).
– Lambda expressions.
– Matrices are iterable.
• 2022-05-31 (v0.4.1):
– The Rules.
– Matrix multiplication, dot products.
– Euclidean distance, few-nearest-neighbour and fixed-radius search.
13 https://www.amazon.com/dp/0645571911
CHANGELOG 413
• 2022-01-05 (v0.0.0):
– Project started.
References
[1] Abramowitz, M. and Stegun, I.A., editors. (1972). Handbook of Mathematical Func-
tions with Formulas, Graphs, and Mathematical Tables. Dover Publications. URL: https:
//personal.math.ubc.ca/~cbm/aands/intro.htm.
[2] Aggarwal, C.C. (2015). Data Mining: The Textbook. Springer.
[3] Arnold, B.C. (2015). Pareto Distributions. Chapman and Hall/CRC. DOI:
10.1201/b18141.
[4] Arnold, T.B. and Emerson, J.W. (2011). Nonparametric goodness-of-fit tests for
discrete null distributions. The R Journal, 3(2):34–39. DOI: 10.32614/RJ-2011-016.
[5] Bartoszyński, R. and Niewiadomska-Bugaj, M. (2007). Probability and Statistical In-
ference. Wiley.
[6] Beirlant, J., Goegebeur, Y., Teugels, J., and Segers, J. (2004). Statistics of Extremes:
Theory and Applications. Wiley. DOI: 10.1002/0470012382.
[7] Benaglia, T., Chauveau, D., Hunter, D.R., and Young, D.S. (2009). Mixtools: An
R package for analyzing mixture models. Journal of Statistical Software, 32(6):1–29.
DOI: 10.18637/jss.v032.i06.
[8] Bezdek, J.C., Ehrlich, R., and Full, W. (1984). FCM: The fuzzy c-means clus-
tering algorithm. Computer and Geosciences, 10(2–3):191–203. DOI: 10.1016/0098-
3004(84)90020-7.
[9] Billingsley, P. (1995). Probability and Measure. John Wiley & Sons.
[10] Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer-Verlag. URL:
https://www.microsoft.com/en-us/research/people/cmbishop.
[11] Blum, A., Hopcroft, J., and Kannan, R. (2020). Foundations of Data Science. Cam-
bridge University Press. URL: https://www.cs.cornell.edu/jeh/book.pdf.
[12] Box, G.E.P. and Cox, D.R. (1964). An analysis of transformations. Journal of the
Royal Statistical Society. Series B (Methodological), 26(2):211–252.
[13] Bullen, P.S. (2003). Handbook of Means and Their Inequalities. Springer Sci-
ence+Business Media.
[14] Campello, R.J.G.B., Moulavi, D., Zimek, A., and Sander, J. (2015). Hierarchical
density estimates for data clustering, visualization, and outlier detection. ACM
Transactions on Knowledge Discovery from Data, 10(1):5:1–5:51. DOI: 10.1145/2733381.
416 REFERENCES
[15] Chambers, J.M. and Hastie, T. (1991). Statistical Models in S. Wadsworth &
Brooks/Cole.
[16] Clauset, A., Shalizi, C.R., and Newman, M.E.J. (2009). Power-law distributions
in empirical data. SIAM Review, 51(4):661–703. DOI: 10.1137/070710111.
[17] Connolly, T. and Begg, C. (2015). Database Systems: A Practical Approach to Design,
Implementation, and Management. Pearson.
[18] Conover, W.J. (1972). A Kolmogorov goodness-of-fit test for discontinuous
distributions. Journal of the American Statistical Association, 67(339):591–596. DOI:
10.1080/01621459.1972.10481254.
[19] Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press.
URL: https://archive.org/details/in.ernet.dli.2015.223699.
[20] Dasu, T. and Johnson, T. (2003). Exploratory Data Mining and Data Cleaning. John
Wiley & Sons.
[21] Date, C.J. (2003). An Introduction to Database Systems. Pearson.
[22] Deisenroth, M.P., Faisal, A.A., and Ong, C.S. (2020). Mathematics for Machine
Learning. Cambridge University Press. URL: https://mml-book.github.io/.
[23] Dekking, F.M., Kraaikamp, C., Lopuhaä, H.P., and Meester, L.E. (2005). A Modern
Introduction to Probability and Statistics: Understanding Why and How. Springer.
[24] Devroye, L., Györfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recog-
nition. Springer. DOI: 10.1007/978-1-4612-0711-5.
[25] Deza, M.M. and Deza, E. (2014). Encyclopedia of Distances. Springer.
[26] Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evid-
ence, and Data Science. Cambridge University Press.
[27] Ester, M., Kriegel, H.P., Sander, J., and Xu, X. (1996). A density-based algorithm
for discovering clusters in large spatial databases with noise. In: Proc. KDD'96, pp.
226–231.
[28] Feller, W. (1950). An Introduction to Probability Theory and Its Applications: Volume I.
Wiley.
[29] Forbes, C., Evans, M., Hastings, N., and Peacock, B. (2010). Statistical Distribu-
tions. Wiley.
[30] Freedman, D. and Diaconis, P. (1981). On the histogram as a density estimator: L₂
theory. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57:453–476.
[31] Friedl, J.E.F. (2006). Mastering Regular Expressions. O'Reilly.
[32] Gagolewski, M. (2015). Data Fusion: Theory, Methods, and Applications. Institute of
Computer Science, Polish Academy of Sciences. DOI: 10.5281/zenodo.6960306.
REFERENCES 417
[51] Higham, N.J. (2002). Accuracy and Stability of Numerical Algorithms. SIAM. DOI:
10.1137/1.9780898718027.
[52] Hopcroft, J.E. and Ullman, J.D. (1979). Introduction to Automata Theory, Languages,
and Computation. Addison-Wesley.
[53] Huber, P.J. and Ronchetti, E.M. (2009). Robust Statistics. Wiley.
[54] Hunter, J.D. (2007). Matplotlib: A 2D graphics environment. Computing in Science
& Engineering, 9(3):90–95.
[55] Hyndman, R.J. and Athanasopoulos, G. (2021). Forecasting: Principles and Practice.
OTexts. URL: https://otexts.com/fpp3.
[56] Hyndman, R.J. and Fan, Y. (1996). Sample quantiles in statistical packages. Amer-
ican Statistician, 50(4):361–365. DOI: 10.2307/2684934.
[57] Kleene, S.C. (1951). Representation of events in nerve nets and finite automata.
Technical Report RM-704, The RAND Corporation, Santa Monica, CA. URL:
https://www.rand.org/content/dam/rand/pubs/research_memoranda/2008/
RM704.pdf.
[58] Knuth, D.E. (1992). Literate Programming. CSLI.
[59] Knuth, D.E. (1997). The Art of Computer Programming II: Seminumerical Algorithms.
Addison-Wesley.
[60] Kuchling, A.M. (2023). Regular Expression HOWTO. URL: https://docs.python.
org/3/howto/regex.html.
[61] Lee, J. (2011). A First Course in Combinatorial Optimisation. Cambridge University
Press.
[62] Ling, R.F. (1973). A probability theory of cluster analysis. Journal of the American
Statistical Association, 68(341):159–164. DOI: 10.1080/01621459.1973.10481356.
[63] Little, R.J.A. and Rubin, D.B. (2002). Statistical Analysis with Missing Data. John
Wiley & Sons.
[64] Lloyd, S.P. (1957 (1982)). Least squares quantization in PCM. IEEE Transactions
on Information Theory, 28:128–137. Originally a 1957 Bell Telephone Laboratories Re-
search Report; republished in 1982. DOI: 10.1109/TIT.1982.1056489.
[65] Matloff, N.S. (2011). The Art of R Programming: A Tour of Statistical Software Design.
No Starch Press.
[66] McKinney, W. (2022). Python for Data Analysis. O'Reilly. URL: https:
//wesmckinney.com/book.
[67] Modarres, M., Kaminskiy, M.P., and Krivtsov, V. (2016). Reliability Engineering and
Risk Analysis: A Practical Guide. CRC Press.
[68] Monahan, J.F. (2011). Numerical Methods of Statistics. Cambridge University Press.
REFERENCES 419
[87] Smith, S.W. (2002). The Scientist and Engineer's Guide to Digital Signal Processing.
Newnes. URL: https://www.dspguide.com/.
[88] Spicer, A. (2018). Business Bullshit. Routledge.
[89] Steiglitz, K. (1996). A Digital Signal Processing Primer: With Applications to Digital Au-
dio and Computer Music. Pearson.
[90] Tijms, H.C. (2003). A First Course in Stochastic Models. Wiley.
[91] Tufte, E.R. (2001). The Visual Display of Quantitative Information. Graphics Press.
[92] Tukey, J.W. (1962). The future of data analysis. Annals of Mathematical Statist-
ics, 33(1):1–67. URL: https://projecteuclid.org/journalArticle/Download?urlId=10.
1214%2Faoms%2F1177704711, DOI: 10.1214/aoms/1177704711.
[93] Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley.
[94] van Buuren, S. (2018). Flexible Imputation of Missing Data. CRC Press. URL: https:
//stefvanbuuren.name/fimd.
[95] van der Loo, M. and de Jonge, E. (2018). Statistical Data Cleaning with Applications
in R. John Wiley & Sons.
[96] Venables, W.N., Smith, D.M., and R Core Team. (2025). An Introduction to R. URL:
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf.
[97] Virtanen, P. and others. (2020). SciPy 1.0: Fundamental algorithms for scientific
computing in Python. Nature Methods, 17:261–272. DOI: 10.1038/s41592-019-0686-
2.
[98] Wainer, H. (1997). Visual Revelations: Graphical Tales of Fate and Deception from Napo-
leon Bonaparte to Ross Perot. Copernicus.
[99] Waskom, M.L. (2021). seaborn: Statistical data visualization. Journal of Open
Source Software, 6(60):3021. DOI: 10.21105/joss.03021.
[100] Wickham, H. (2011). The split-apply-combine strategy for data analysis. Journal
of Statistical Software, 40(1):1–29. DOI: 10.18637/jss.v040.i01.
[101] Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10):1–23. DOI:
10.18637/jss.v059.i10.
[102] Wickham, H., Çetinkaya-Rundel, M., and Grolemund, G. (2023). R for Data Sci-
ence. O'Reilly. URL: https://r4ds.hadley.nz/.
[103] Wierzchoń, S.T. and Kłopotek, M.A. (2018). Modern Algorithms for Cluster Analysis.
Springer. DOI: 10.1007/978-3-319-69308-8.
[104] Wilson, G. and others. (2014). Best practices for scientific computing. PLOS Bio-
logy, 12(1):1–7. DOI: 10.1371/journal.pbio.1001745.
[105] Wilson, G. and others. (2017). Good enough practices in scientific computing.
PLOS Computational Biology, 13(6):1–20. DOI: 10.1371/journal.pcbi.1005510.
[106] Xie, Y. (2015). Dynamic Documents with R and knitr. Chapman and Hall/CRC.