Introduction To Data Visualization in Python
Introduction To Data Visualization in Python
Introduction 1
1 Data preparation 4
1.1 Importing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Test files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Excel spreadsheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Statistical packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.4 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Cleaning data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Selecting variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Selecting observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Creating/recoding variables . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Summarizing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.5 Using pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.6 Reshaping data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.7 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Introduction to plotnine 14
2.1 A worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 ggplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 geoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.4 scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.5 facets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.6 labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.7 themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
i
2.2 Placing the data and mapping options . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Graphs as objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Univariate Graphs 24
3.1 Categorical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Bar chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Pie chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.3 Tree map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Quantitative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Kernel Density plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Dot Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Bivariate Graphs 38
4.1 Categorical vs. Categorical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.1 Stacked bar chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.2 Grouped bar chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.3 Segmented bar chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.4 Improving the color and labeling . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Quantitative vs. Quantitative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Line plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Categorical vs. Quantitative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.1 Bar chart (on summary statistics) . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Grouped kernel density plots . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.3 Box plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.4 Violin plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.5 Ridgeline plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.6 Mean/SEM plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.7 Strip plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.8 Beeswarm plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.9 Cleveland Dot Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 Multivariate Graphs 60
5.1 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Faceting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
ii
6.2 Dumbbell charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3 Slope graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4 Area Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7 Statistics Models 78
7.1 Correlation plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3.1 Setting a reference or base level for categorical variables . . . . . . . . . . 81
7.4 Survival plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.4.1 The Veterans’ Administration Lung Cancer Trial . . . . . . . . . . . . . . 82
7.4.2 Survival Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.4.3 The Survival Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.4.4 Considering other variables by stratification . . . . . . . . . . . . . . . . . 84
7.4.5 Survival functions by cell type . . . . . . . . . . . . . . . . . . . . . . . . 85
7.4.6 Multivariate Survival Models . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.5 Mosaic plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8 Other Graphs 90
8.1 3-D scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.2 Biplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.3 Bubble charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.4 Flow diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.4.1 Sankey diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.4.2 What is Sankey Diagram? . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.4.3 Alluvial diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.5 Heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.6 Radar charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.7 Scatterplot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.8 Waterfall charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.9 Word clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
iii
9.2.1 Specifying colors manually . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.2.2 Color palettes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.3 Points & Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.3.1 Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.3.2 Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.4 Fonts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.5 Legends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.5.1 Legend location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.5.2 Legend title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.6 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.7 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.7.1 Adding text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.7.2 Adding lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.8 Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.8.1 Altering theme elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Conclusion 126
References 127
iv
Introduction
This book is based on the book Data visualization with R by Rob Kabacoff who used the R
platform to introduce data visualization. His book is available on https://rkabacoff.github.io/
datavis/index.html.
You don’t need to read this book from start to finish in order to start building effective graphs.
Feel free to jump to the section that you need and then explore others that you find interesting.
Graphs are organized by the number of variables to be plotted, the type of variables to be
plotted and the purpose of the visualization.
Table 1: Overview
Chapter Description
provides a quick overview of how to get your data into Python and
1
how to prepare it for analysis
2 provides an overview of the plotnine package
describes graphs for visualizing the distribution of a single cate-
3
gorical (e.g. race) or quantitative (e.g. income) variable.
describes graphs that display the relationship between two vari-
4
ables.
describes graphs that display the relationships among 3 or more
5
variables. It is helpful to read chapters 3 and 4 before this chapter.
6 describes graphs that display change over time.
describes graphs that can help you interpret the results of statis-
7
tical models.
covers graphs that do not fit neatly elsewhere (every book needs
8
a miscellaneous chapter).
describes how to customize the look and feel of your graphs. If
9 you are going to share your graphs with others, be sure to skim
this chapter.
1
2
Setup
This book is essentially based on three python libraries: plotnine, plydata and mizani.
About plotnine
When doing descriptive statistics we frequently need to partition the graphics based on categori-
cal (i.e. qualitative) or ordinal variables. Doing such graphics may be particularly difficult when
using classical Python graphical libraries (e.g matplotlib). The R software benefits from a very
nice library for such a task, the ggplot2 package developed by Hadley Wickham (Wickham 2016)
This package has been quickly became really popular in the bioinformatic field (here categories
may be gene, groups of genes, species, signaling pathways, epigenetic marks…) and ordinal vari-
ables a discretized level of expression, for instance. The ggplot2 R package is an implementation
of the graphical model proposed by Leland Wilkison in his book: The Grammar of Graphics
(Wilkinson 2016). In this model, the graph is viewed as an entity composed of data, layers,
scales, coordinate system and facets. One can create a graphic then add the various component
using the + operator. Although the syntax may appear a little bit tricky for beginners, one can
quickly understand the benefit of such an approach when composing complexe diagrams.
Several projects have proposed a port of ggplot2 under Python. The plotnine library is one of
these projects that proposes a rather stable and exhaustive port of ggplot2 under Python.
Installation
# Using pip
$ pip install plotnine # 1. should be sufficient for most
$ pip install 'plotnine[extra]' # 2. includes extra/optional packages
$ pip install 'plotnine[test]' # 3. testing
$ pip install 'plotnine[doc]' # 4. generating docs
$ pip install 'plotnine[dev]' # 5. development (making releases)
$ pip install 'plotnine[all]' # 6. everyting
# Or using conda
$ conda install -c conda-forge plotnine
About plydata
plydata is a library that provides a grammar for data manipulation. The grammar consists of
verbs that can be applied to pandas dataframes or database tables. It is based on the R packages
2
3
dplyr, tidyr and forcats. plydata uses the >> operator as a pipe symbol, alternatively there is
the ply(data, *verbs) function that you can use instead of >>.
At present the only supported data store is the pandas dataframe
Installation
About Mizani
Mizani is python library that provides the pieces necessary to create scales for a graphics system.
It is based on the R Scales package.
Installation
3
Data preparation
1
Before you can visualize your data, you have to get it into python. This involves importing the
data from external source and massaging it into a useful format.
Python can import data from almost source, including text files, excel spreadsheets, statistical
packages, and database management systems. We’ll illustrate these techniques using the Salaries
dataset, containing the 9 month academic salaries of college professors at a single institution in
2008-2009.
The pandas packages provides functions for importing delimited text files into python data
frames.
import pandas as pd
These function assume that the first line of data contains the variables names, values are sepa-
rated by commas or tabs respectively, and that missing data are represented by blancs.
Options allow you to alter these assumptions. See the documentation for more details.
The pandas package can import data from Excel workbooks. Both xls and xlsx formats are
supported.
4
1.2. Cleaning data 5
Since workbooks can have more than one worksheet, you can specify the one you want with the
sheet_name option. The default is sheet_name = 0.
The pandas package provides functions for importing data from a variety of statistical packages.
1.1.4 Databases
Importing data from a database requires additional steps. Depending on the database containing
the data, the following packages can help : cantools and pyodbc for DBC file, pandas.read_sql
for SQL file.
The processes of cleaning your data can be the most time-consuming part of any data analysis.
The most important steps are considered below. While there are many approaches, those using
dfply and plydata packages are some of the quickest and easiest to learn.
Table 1.1: Example function available in plydata
Function Use
arrange Sort rows by column variables
create Create DataFrame with columns
define Add column to DataFrame
distinct Select distinct/unique riws
do Do arbitrary operations on dataframe
group_by Group dataframe by one or more columns/variables
group_indices Generate a unique id for each group
head Select the top n rows
mutate Alias of plydata.one_table_verbs.define
pull Pull a single column from the dataframe
query Return rows with matching conditions
rename Rename columns
sample_frac Sample a fraction of rows from dataframe
sample_n Sample nn rows from dataframe
select Rows columns
slice_rows Select rows
summarize Summarise multiple values to a single value
tail Select the bottom n rows
transmute alias of plydata.one_table_verbs.create
ungroup Remove the grouping variables for dataframe
unique alias of plydata.one_table_verbs.distinct
5
6 Chapter 1. Data preparation
The select function allows you to limit your dataset to specified variables (columns).
# load data
from datar.datasets import starwars
# keep the variables name and all variables between mass and species inclusive
newdata = select(starwars,"name",slice("mass","species"))
newdata.head()
6
1.2. Cleaning data 7
The query function allows you to limit your dataset to observations (rows) meeting a specific
criteria. Multiple criteria can be combined with the & (and) and | (or) symbols.
# Select females
from plydata import query
7
8 Chapter 1. Data preparation
The define function allows you to create new variables or transform existing ones.
# Using mutate
from plydata.one_table_verbs import mutate
8
1.2. Cleaning data 9
The if_else function can be used for recoding data. The format is if_else(test, return if
True, return if False).
The summarize function can be used to reduce multiple values down to a single value (such as
a mean).
mean_ht mean_mass
174.358 97.31186
It is often used in conjunction with the group_by function, to calculate statistics by group.
9
10 Chapter 1. Data preparation
Packages like plydata allow you to write your code in a compact format using the pipe >>
operator. Here is an example
The >> operator passes the result on the left to the first parameter of the function on the right.
Some graphs require the data to be in wide format, while some graphs require the data to be in
long format.
# Create a dataframe
import pandas as pd
df = pd.DataFrame({
"id" : ["01", "02", "03"],
"name" : ["Bill", "Bob", "Mary"],
"sex" : ["Male", "Male", "Female"],
"age" : [22, 25, 18],
"income" : [55000, 75000, 90000]
})
10
1.2. Cleaning data 11
# Long data
long_data = df.melt(id_vars=['id','name'],var_name='variable',value_name='value')
# Wide data
wide_data = (long_data
.pivot(index=['id','name'], columns='variable', values='value')
.reset_index())
Real data are likely contain missing values. There are three basic approaches to dealing with
missing data : feature selection, listwise delection, and imputation. Let’s see how each applies
to the msleep dataset from datar package. The msleep dataset describes the sleep habits of
mammals and contains missing values on several variables.
In feature selection, you delete variables (columns) that contain too many missing values.
11
12 Chapter 1. Data preparation
pct. of NA
name 0.0000000
genus 0.0000000
vore 0.0843373
order 0.0000000
conservation 0.3493976
sleep_total 0.0000000
sleep_rem 0.2650602
sleep_cycle 0.6144578
awake 0.0000000
brainwt 0.3253012
bodywt 0.0000000
Sixty-one percent of the sleep_cycle values are missing. You may decide to drop it.
Listwise delection involves deleting observations (rows) that contain missing values on any of
the variables of interest.
## genus 0.0
## vore 0.0
## conservation 0.0
## dtype: float64
1.2.7.3 Imputation
Imputation involves replacing missing values with “reasonable” guesses about what the values
would have been if they had not been missing. There are several approaches, as detailed in
sklearn.
# Set index
df = msleep.set_index('name')
# Select numerical variable
num_df = df.select_dtypes(include = np.number)
num_df.isnull().mean().round(2)
## sleep_total 0.00
## sleep_rem 0.27
12
1.2. Cleaning data 13
## sleep_cycle 0.61
## awake 0.00
## brainwt 0.33
## bodywt 0.00
## dtype: float64
imputer = KNNImputer(n_neighbors=5)
imputer.set_output(transform="pandas");
num_df = imputer.fit_transform(num_df)
num_df.isnull().mean().round(2)
## sleep_total 0.0
## sleep_rem 0.0
## sleep_cycle 0.0
## awake 0.0
## brainwt 0.0
## bodywt 0.0
## dtype: float64
## genus 0.00
## vore 0.08
## order 0.00
## conservation 0.35
## dtype: float64
## genus 0.0
## vore 0.0
## order 0.0
## conservation 0.0
## dtype: float64
13
Introduction to plotnine
2
This chapter provides a brief overview of how the plotnine package works. If you are simply
seeking code to make a specific type of graph, feel free to skip this section. However, the material
can help you understand how the pieces fit together.
The functions in the plotnine package build up a graph in layers. We’ll build a complex graph
by starting with a simple graph and adding additionnal elements, one at a time.
The example uses data from 1985 Current Population Survey to explore the relationship between
(wage) and (exper). It is an R dataset.
# load data
data(CPS85, package = "mosaicData")
In building a plotnine graph, only the first two functions described below are required. The
other functions are optional and can appear in any order.
2.1.1 ggplot
The first function in building a graph is the ggplot function. It specifies the :
Why is the graph empty? We specified that the expr variable should be mapped to the x-axis
and that the wage should be mapped to the y-axis, but we haven’t yet specified what we wanted
placed on the graph.
14
2.1. A worked example 15
40
30
wage
20
10
0
0 20 40
exper
2.1.2 geoms
Geoms are the geometric objects (points, lines, bars, etc.) that can be placed on a graph. They
are added using functions that start with geom_. In this example, we’ll add points using the
geom_point function, creating a scatterplot.
In plotnine graphs, functions are chained together using the + sign to build a final plot.
# add points
print((ggplot(data = r.CPS85,mapping = aes(x = "exper", y = "wage"))+
geom_point()))
40
30
wage
20
10
0
0 20 40
exper
The graph indicates that there is an outlier. One individual has a wage much higher than the
rest. We’ll delete this case before continuing.
# delete outlier
from plydata import query
plotdata = query(r.CPS85,"wage < 40")
# redraw scatterplot
print((ggplot(data = plotdata, mapping = aes(x = "exper", y = "wage"))+
geom_point()))
15
16 Chapter 2. Introduction to plotnine
20
wage
10
0
0 20 40
exper
20
wage
10
0
0 20 40
exper
Next, let’s add a line of the best fit. We can do this with the geom_smooth function.Options
control the type of line (linear, quadratic, non parametric), the thickness of the line, the line’s
color, and the presence or absence of a confidence interval. Here we request a linear regression
(method = lm) line (where lm stands for linear model).
2.1.3 grouping
In addition to mapping variables to the x and y axes, variables can be mapped to the color,
shape, size, transparency, and other visual characteristics of geometric objects. This allows
group of observations to be superimposed in a single graph.
16
2.1. A worked example 17
20
wage
10
0
0 20 40
exper
sex
F
M
20
wage
10
0
0 20 40
exper
The color ="sex" option is placed in the aes function, because we are mapping a variable to
an aesthetic. The geom_smooth option (se = False) was added to suppresses the confidence
intervals.
It appears that men tend to make more money than women. Additionally, there may be a
stronger relationship between experience and wages for men than for women.
2.1.4 scales
Scales control how variables are mapped to the visual characteristics of the plot. Scale functions
(which start with scale_) allow you to modify this mapping. In the next plot, we’ll change the
x and y scaling, and the colors employed.
17
18 Chapter 2. Introduction to plotnine
print((ggplot(data = plotdata,
mapping = aes(x = "exper", y = "wage",color = "sex")) +
geom_point(alpha = .7,size = 3) +
geom_smooth(method = "lm", se = False,size = 1.5) +
scale_x_continuous(breaks = range(0,60,10)) +
scale_y_continuous(breaks=range(0,30,5),labels=currency_format()) +
scale_color_manual(values = ["indianred","cornflowerblue"])+
theme(legend_position=(0.85,0.8),legend_direction='vertical')))
sex
$25.00 F
M
$20.00
$15.00
wage
$10.00
$5.00
$0.00
0 10 20 30 40 50
exper
We’re getting. The numbers on the x and y axes are better, the y axis uses dollar notation, and
the colors are more attractive (IMHO).
Here is a question. Is the relationship between experience, wages and sex the same for each job
sector? Let’s repeat this graph once for each job sector in order to explore this.
2.1.5 facets
Facets reproduce a graph for each level a given variable (or combination of variables). Facets
are created using functions that start with facet_. Here, facets will be defined by the eight
levels of the sector variable.
It appears that the differences between mean and women depend on the job sector under con-
sideration.
18
2.1. A worked example 19
wage
$15.00
$10.00
$5.00
$0.00
sales service 0 10 20 30 40 50
$25.00
$20.00
$15.00 sex
$10.00 F
$5.00 M
$0.00
0 10 20 30 40 50 0 10 20 30 40 50
exper
2.1.6 labels
Graphs should be easy to interpret and informative labels are a key element in achieving this
goal. The labs function provides customized labels for the axes and legends. Additionally, a
custom title and caption can be added.
$20.00
$15.00
$10.00
$5.00
$0.00
sales service 0 10 20 30 40 50
$25.00
$20.00
$15.00 Gender
$10.00 F
$5.00 M
$0.00
0 10 20 30 40 50 0 10 20 30 40 50
Years of Experience
source: http://mosaic-web.org/
19
20 Chapter 2. Introduction to plotnine
2.1.7 themes
Finally, we can fine the appearance of the graph using themes. Themes functions (which start
with theme_) control background colors, fonts, grid-lines, legend placement, and other non-data
related features of the graph. Let’s use a cleaner theme.
$20.00
$15.00
$10.00
$5.00
$0.00
sales service 0 10 20 30 40 50
$25.00
$20.00
$15.00 Gender
$10.00 F
$5.00 M
$0.00
0 10 20 30 40 50 0 10 20 30 40 50
Years of Experience
source: http://mosaic-web.org/
Now we have something. It appears that men earn more than women in management, manu-
facturing, sales, and the “other” category. They are most similar in clerical, professional, and
service positions. The data contain no women in the construction sector. For management
positions, wages appear to be related to experience for men, but not for women (this may be
the most interesting finding). This also appears to be true for sales.
Of course, these findings are tentative. They are based on a limited sample size and do not
involve statistical testing to assess whether differences may be due to chance variation.
Plots created with plotnine always start with the ggplot function. In the example above, the
data and mapping options were placed in this function. In this case they apply to each geom_
function that follows.
20
2.3. Graphs as objects 21
You can also place these options directly within a geom. In that case, they only apply only to
that specific geom.
Consider the following graph.
sex
F
M
20
wage
10
0
0 20 40
exper
Since the mapping of sex to color appears in the ggplot function, it applies to both geom_point
and geom_smooth. The color of the point indicates the sex, and a separate colored trend line is
produce for men and women. Compare to this :
Since the sex to color mapping only appears in the geom_point function, it is only used there.
A single trend line is created for all observations.
Most of the examples in this book place the data and mapping options in the ggplot function.
Additionally, the phrases data = and mapping= are omitted since the first option always refers
to data and the second option always refers to mapping.
A plotnine graph can be saved as a named Python object (like a data frame), manipulated
further, and then printed or saved to disk.
21
22 Chapter 2. Introduction to plotnine
sex
F
M
20
wage
10
0
0 20 40
exper
# Prepare data
plotdata = r.CPS85 >> query('wage < 40')
20
wage
10
0
0 20 40
exper
# make the points larger and blue then print the graph
myplot = myplot + geom_point(size = 3, color = "blue")
print(myplot)
22
2.3. Graphs as objects 23
20
wage
10
0
0 20 40
exper
20
wage
10
0
0 20 40
exper
20
wage
10
0
0 20 40
exper
23
3
Univariate Graphs
Univariate graphs plot the distribution of data from a single variable. The variable can be
categorical (e.g., race, sex) or quantitative (e.g., age, weight).
3.1 Categorical
The distribution of a single categorical variable is typically plotted with a bar chart, a pie
chart,or (less commonly) a tree map.
The Marriage dataset contains the marriage records of 98 individuals in Mobile County, Al-
abama. Below, a bar chart is used to display the distribution of wedding participants by race.
60
40
count
20
0
American Indian Black Hispanic White
race
24
3.1. Categorical 25
The majority of participants are white, followed by black, with very few Hispanics or American
Indians.
You can modify the bar fill and border colors, plot labels, and title by adding options to the
geom_bar function.
Participants by race
60
Frequency
40
20
0
American Indian Black Hispanic White
Race
3.1.1.1 Percents
Bars can represent percents rather than counts. For bar charts, the aes(x = "trans") is
actually a shortcut for aes(x ="trans", y = "..count.."), where "..count.." is a special
variable representing the frequency within each category.
In the code above, the mizani package is used to add % symbols to the y - axis labels.
It is often helpful to sort the bars by frequency. In the code below, the frequencies are calculated
explicitly. Then the reorder function is used to sort the categories by the frequency. The option
stat="identity" tells the plotting function not to calculate counts, because they are supplied
directly.
25
26 Chapter 3. Univariate Graphs
Participants by race
60%
Percent
40%
20%
0%
American Indian Black Hispanic White
Race
race n
White 74
Hispanic 1
Black 22
American Indian 1
Participants by race
60
Frequency
40
20
0
American Indian Hispanic Black White
Race
26
3.1. Categorical 27
The graph bars are sorted in ascending order. Use reorder(race, -n) to sort in descending
order.
Finally, you may want to label each bar with its numerical value.
Participants by race
74
60
Frequency
40
20 22
0 1 1
American Indian Black Hispanic White
Race
Here geom_text adds the labels, and va controls vertical alignment (one of top, center, bottom,
baseline)
Putting these ideas together, you can create a graph like the one below. The minus sign in
reorder(race, -n) is used to order the bars in descending order.
plotdata.loc[:,"pctlabel"] = percent_format()(plotdata['pct'])
Category labels may overlap if (1) there are many categories or (2) the labels are long. Consider
27
28 Chapter 3. Univariate Graphs
Participants by race
75.5%
60%
Percent
40%
20%
22.4%
0% 1.0% 1.0%
White Black American Indian Hispanic
Race
Marriages by officiate
40
30
frequency
20
10
0
BISHOP
CATHOLIC PRIEST
CHIEF CLERK
CIRCUIT JUDGE
ELDER
MARRIAGE OFFICIAL
MINISTERPASTORREVEREND
OfficialTitle
Finally, you can try staggering the labels. The trick is to add a newline \n to every other label.
import textwrap
28
3.1. Categorical 29
Marriages by officiate
REVEREND
PASTOR
MINISTER
MARRIAGE OFFICIAL
0fficialTitle
ELDER
CIRCUIT JUDGE
CHIEF CLERK
CATHOLIC PRIEST
BISHOP
0 10 20 30 40
Frequency
Marriages by officiate
40
30
Frequency
20
10
0
AL
T
ND
R
P
R
GE
IES
HO
ER
TE
DE
O
ICI
ST
RE
JUD
NIS
CL
EL
BIS
PR
FF
PA
VE
IEF
MI
EO
IC
IT
RE
CU
OL
CH
IAG
TH
CIR
RR
CA
Pie charts are controversial in statistics. If your goal is to compare the frequency of categories,
you are better off with bar charts (humans are better at judging the length of bars than the
volume of pie slices). If your goal is compare each category with the whole (e.g., what portion of
participants are Hispanic compared to all participants), and the number of categories is small,
then pie charts may work for you.
Unfortunately, plotnine, has not yet ported over the coord_polar ggplot layouts. Instead of
using plotnine, we will used matplotlib package.
29
30 Chapter 3. Univariate Graphs
Marriages by officiate
40
30
Frequency
20
10
IAL
T
ND
R
P
R
GE
IES
HO
ER
TE
DE
TO
FIC
RE
JUD
NIS
CL
EL
S
BIS
PR
PA
VE
OF
IEF
MI
IC
IT
RE
CU
E
OL
CH
IAG
TH
CIR
RR
CA
MA
Marriages by officiate
White
American Indian
Black
Hispanic
The pie chart makes it easy to compare each slice with the whole.
An alternative to a pie chart is a tree map. Unlike pie charts, it can handle categorical variables
that have many levels.
Plotnine, has not yet ported over the geom_treemap ggplot layouts. Instead of using plotnine,
we will used squarify package.
import squarify
30
3.1. Categorical 31
Marriages by officiate
White
76%
1% American Indian
22%
1%
Black
Hispanic
officialTitle n
CIRCUIT JUDGE 2
MARRIAGE OFFICIAL 44
MINISTER 20
PASTOR 22
CHIEF CLERK 2
REVEREND 2
ELDER 2
CATHOLIC PRIEST 2
BISHOP 2
squarify.plot(sizes=plotdata.n);
plt.title("Marriages by officiate");
plt.axis("off");
plt.show()
Marriages by officiate
# Add labels
squarify.plot(sizes=plotdata.n,label=plotdata.officialTitle,pad=1,
text_kwargs={'fontsize': 6});
plt.title("Marriages by officiate");
plt.axis("off");
31
32 Chapter 3. Univariate Graphs
plt.show()
Marriages by officiate
CHIEF CLERK
REVERENDELDER
CATHOLIC PRIEST
BISHOP
PASTOR
MARRIAGE OFFICIAL
MINISTER
CIRCUIT JUDGE
3.2 Quantitative
The distribution of a single quantitative variable is typically plotted with a histogram, kernel
density plot, or dot plot.
3.2.1 Histogram
Participants by age
12.5
10
7.5
count
2.5
0
20 40 60
age
Most participants appear to be in their early 20’s with another group in their 40’s, and a much
smaller group in their later sixties and early seventies. This would be a multimodal distribution.
Histogram colors can be modified using two options :
32
3.2. Quantitative 33
Participants by age
12.5
10
7.5
count
2.5
0
20 40 60
age
One of the most important histogram options is bins, which controls the number of bins into
which the numeric variable is divided (i.e., the number of bars in the plot). The default is 30,
but it is hepful to try smaller and larger numbers to get a better impression of the shape of the
distribution.
10
count
0
20 40 60
Age
Alternatively, you can specify the binwidth, the width of the bins represented by the bars.
33
34 Chapter 3. Univariate Graphs
20
15
count
10
0
20 40 60 80
Age
As with bar charts, the y-axis can represent counts or percent of the total.
Participants by age
20%
15%
Percent
10%
5%
0%
20 40 60 80
Age
An alternative to a histogram is the kernel density plot. Technically, kernel density estimation is
a non parametric method for estimating the probability density function of a continuous random
variable. (What??) Basically, we are trying to draw a smoothed histogram, where the area under
the curve equal one.
34
3.2. Quantitative 35
Participants by age
0.03
0.02
density
0.01
0
20 40 60
age
The graph shows the distribution of scores. For example, the proportion of cases between 20
and 40 years old would be represented by the area under the curve between 20 and 40 on the
x-axis.
As with previous charts, we can use fill and color to specify the fill and border colors.
Participants by age
0.03
0.02
density
0.01
0
20 40 60
age
The degree of smoothness is controlled by the bandwidth parameter bw. To find the default
value for a particular variable, use the bw_nrd0 function.
def bw_nrd0(x):
35
36 Chapter 3. Univariate Graphs
if len(x) < 2:
raise(Exception("need at least 2 data points"))
hi = np.std(x, ddof=1)
q75, q25 = np.percentile(x, [75 ,25])
iqr = q75 - q25
lo = min(hi, iqr/1.34)
bw_nrd0(r.Marriage.age)
## 5.1819460066411365
Values that are larger will result in more smoothing, while values that are smaller will produce
less smoothong.
0.04
density
0.02
0
20 40 60
age
In this example, the default bandwidth for mpg is 5.18. Choosing a value of 1 resulted in more
smoothing.
Kernel density plots allow you to easily see which scores are most frequent and which are
relatively rare. However it can be difficult to explain the meaning of the y-axis to a non-
statistician. (But it will make you look really smart at parties!).
Another alternative to the histogram is the dot chart. Again, the quantitative variable is divided
into bins, but rather than summary bars, each observation is represented by a dot. By default,
the width of a dot corresponds to the bin width, and dots are stacked, with each dot representing
one observation. This works best when the number of observations is small (say, less than 150).
36
3.2. Quantitative 37
Participants by age
1
0.75
Proportion
0.50
0.25
0
20 40 60
Age
The fill and color options can be used to specify the fill and border color of each dot respec-
tively.
Participants by age
1
0.75
Proportion
0.50
0.25
0
20 40 60
Age
The are many more options available. See the help for the details and examples.
37
4
Bivariate Graphs
Bivariate graphs display the relationship between two variables. The type of graph will depend
on the measurement elevel of the variables (categorical or quantitative).
When plotting the relationship between two categorical variables, stacked, grouped, or seg-
mented bar charts are typically used. A less common approach is the mosaic chart (7.5).
Let’s plot the relationship between automobile class and drive type (front-wheel, rear-wheel, or
4-wheel drive) for the automobiles in the Fule economy dataset.
print((ggplot(mpg,aes(x="class", fill="drv"))+geom_bar(position="stack")+
theme(legend_position=(0.55,0.75),legend_direction="vertical")))
60 drv
4
f
r
40
count
20
0
2seater compact midsize minivan pickup subcompact suv
class
38
4.1. Categorical vs. Categorical 39
From the chart, we can see for example, that the most common vehicle is the SUV. All 2seater
cars are rear wheel drive, while most, but not all SUVs are 4-wheel drive.
Stacked is the default, so the last line could have also been written as geom_bar.
Grouped bar charts place bars for the second categorical variable side-by-side. To create a
grouped bar plot use the position = "dodge".
50 drv
4
f
40 r
30
count
20
10
0
2seater compact midsize minivan pickup subcompact suv
class
Notice that all Minivans are front-wheel drive. By default, zero count bars dropped and the
remaining bars are made wider. This may not be behavior you want. You can modify this using
the position = position_dodge(preserve = "single") option.
50 drv
4
f
40 r
30
count
20
10
0
2seater compact midsize minivan pickup subcompact suv
class
Figure 4.3: Side-by-side bar chart with zero count bars retained
39
40 Chapter 4. Bivariate Graphs
A segmented bar plot is stacked bar plot where each bar represents 100 percent. You can create
a segmented bar chart using the position = "fill" option.
0.75
drv
Proportion
4
0.50 f
r
0.25
0
2seater compact midsize minivan pickup subcompact suv
class
This type of plot is particularly useful if the goal is to compare the percentage of a category in
one variable across each level of another variable. For example, the proportion of front-wheel
drive cars go up as you move from compact, to midsize, to minivan.
You can use additional options to improve color and labeling. In the graph below
• reorder_categories modifies the order of the categories for the class variable and both
the order and the labels for the drive variable.
• scale_y_continuous modifies the y-axis tick mark labels
• labs provides a title and changed the labels for the x and the y axes and the legend
• scale_y_brewer changes the fill color scheme.
• theme_minimal removes the grey background an changed the grid color.
40
4.1. Categorical vs. Categorical 41
80%
Drive Train
60% front-wheel
Percent
rear-wheel
4-wheel
40%
20%
0%
Figure 4.5: Segmented bar chart with improved labeling and color
In the graph above, the reorder_categories function was used to reorder and/or rename the
levels of a categorical variable. You could also apply this to the original dataset, making these
changes permanent. It would then apply to all future graphs using that dataset. For example:
# change the order the levels for the categorical variable "class"
mpg["class"] = (mpg["class"].cat.reorder_categories(["2seater","subcompact",\
"compact","midsize","minivan", "suv", "pickup"]))
Next, let’s add percent labels to each segment. First, we’ll create a summary dataset that has
the necessary labels.
Next, we’ll use this dataset and the geom_text function to add labels to each bar segment.
41
42 Chapter 4. Bivariate Graphs
rear-wheel
92.7%
100.0% 4-wheel100.0% 100.0%
40% 82.3%
25.7%
20%
25.5%
11.4% 7.3%
0%
The relationship between two quantitative variables is typically displayed using scatterplots and
line graphs.
4.2.1 Scatterplot
The simplest display of two quantitative variables is a scatterplot, with each variable represented
on an axis. For example, using Salaries
data(Salaries,package="carData")
# Simple scatterplot
print((r.Salaries >>ggplot(aes(x="yrs.since.phd",y="salary"))+geom_point()))
42
4.2. Quantitative vs. Quantitative 43
200000
salary
150000
100000
50000
0 20 40
yrs.since.phd
The functions scale_x_continuous and scale_y_continuous control the scaling on x and y axes
respectively.
$200000.00
$150000.00
$100000.00
$50000.00
0 10 20 30 40 50 60
Years Since PhD
It is often useful to summarize the relationship displayed in the scatterplot, using a best fit line.
Many types of lines are supported, including linear, polynomial, and nonparametric (loess). By
43
44 Chapter 4. Bivariate Graphs
salary 200000
150000
100000
50000
0 20 40
yrs.since.phd
Clearly, salary increases with experience. However, there seems to be a dip at the right end -
professors with significant experience, earning lower salaries. A straight line does not capture
this non-linear effect. A line with a bend will fit better here.
A polynomial regression line provides a fit line of the form
𝑦 ̂ = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽4 𝑥4 + ⋯
Typically either a quadratic (one bend), or cubic (two bends) line is used. It is rarely necessary
to use a higher order( >3 ) polynomials. Applying a quadratic fit to the salary dataset produces
the following result.
# create polynome
def poly(x,p):
return x**p
Finally, a smoothed nonparametric fit line can often provide a good picture of the relation-
ship. The default in plotnine is a loess line which stands for for locally weighted scatterplot
smoothing.
You can suppress the confidence bands by including the option se = false.
Here is a complete (and more attractive) plot
44
4.2. Quantitative vs. Quantitative 45
200000
salary
150000
100000
50000
0 20 40
yrs.since.phd
200000
salary
150000
100000
50000
0 20 40
yrs.since.phd
When one of the two variables represents time, a line plot can be an effective method of displaying
relationship. For example, the code below displays the relationship between time (year) and
life expectancy (lifeExp) in the United States between 1952 and 2007. The data comes from
the gapminder dataset.
45
46 Chapter 4. Bivariate Graphs
200000
150000
100000
50000
0 10 20 30 40 50 60
Years Since PhD
# Select US cases
plotdata = gapminder >> query("country == 'United States'")
78
76
74
lifeExp
72
70
68
1950 1960 1970 1980 1990 2000
year
It is hard to read individual values in the graph above. In the next plot, we’ll add points as
well.
Time dependent data is covered in more detail under Time series. Customizing line graphs is
covered in the Customizing graphs section
46
4.3. Categorical vs. Quantitative 47
76
72
70
68
1950 1960 1970 1980 1990 2000
Year
Source: http://www.gapminder.org/data/
When plotting the relationship between a categorical variable and a quantitative variable, a
large number of graph types are available. These include bar charts using summary statistics,
grouped kernel density plots, side-by-side box plots, side-by-side violin plots, mean/sem plots,
ridgeline plots, and Cleveland plots.
In previous sections, bar charts were used to display the number of cases by category for a signe
variable or for two variables. You can also use bar charts to display other summary statistics
(e.g., means or medians) on a quantitative variable for each level of a categorical variable.
For example, the following graph displays the mean salary for a sample of university professors
by their academic rank.
rank mean_salary
Prof 126772.11
AsstProf 80775.99
AssocProf 93876.44
47
48 Chapter 4. Bivariate Graphs
100000
mean_salary
50000
0
AsstProf AssocProf Prof
rank
$100000.00
$93876.44
$80000.00 $80775.99
$60000.00
$40000.00
$20000.00
$0.00
Assistant Associate Full
Professor Professor Professor
One limitation of such plots is that they do not display the distribution of the data - only the
summary statistic for each group. The plots below correct this limitation to some extent.
One can compare groups on a numeric variable by superimposing kernel density plots in a single
graph.
48
4.3. Categorical vs. Quantitative 49
4e-5
rank
AsstProf
AssocProf
Prof
3e-5
density
2e-5
1e-5
0
50000 100000 150000 200000
salary
The alpha option makes the density plots partially transparent, so that we can see what is
happening under the overlaps. Alpha values range from 0 (transparent) to 1 (opaque). The
graph makes clear that, in general, salary goes up with rank. However, the salary range for full
professors is very wide.
A boxplot display the 25𝑡ℎ percentile, median, and 75𝑡ℎ percentile of distribution. The whiskers
(vertical lines) capture roughly 99% of a normal distribution, and observations outside this range
are plotted as points representing outliers. Side-by-side box plots are very useful for comparing
groups (i.e., the levels of categorical variable) on a numerical variable.
200000
salary
150000
100000
50000
AsstProf AssocProf Prof
rank
Notched boxplots provide an approximate methode for visualizing whether groups differ. Al-
though not a formal test, if the notches of two boxplots do not overlap, there is strong evidence
(95% confidence) that the medians of the two groups differ.
49
50 Chapter 4. Bivariate Graphs
200000
salary
150000
100000
50000
AsstProf AssocProf Prof
rank
Violin plots are similar to kernel density plots, but are mirrored and rotated 90° .
200000
salary
150000
100000
50000
AsstProf AssocProf Prof
rank
50
4.3. Categorical vs. Quantitative 51
200000
salary
150000
100000
50000
AsstProf AssocProf Prof
rank
A ridgeline plot (also called a joyplot) displays the distribution of a quantitative variable for
several groups. They’re similar to kernel density plots with vertical faceting, but take up less
room. Ridgeline plots are created with the seaborn package.
Using the Fuel economy dataset, let’s plot the distribution of city driving miles per gallon by
car class.
2seater
subcompact
compact
midsize
minivan
suv
pickup
5 10 15 20 25 30 35 40
51
52 Chapter 4. Bivariate Graphs
A popular method for comparing groups on a numeric variable is the mean plot with error
bars. Error bars can represent standard deviations, standard error of the mean, or confidence
intervals. In this section, we’ll plot means and standard errors.
# Calculate means, standard deviation, standard errors, and 95% confidence intervals
from math import *
import scipy.stats as st
from plotnine.data import mtcars
rank n mean sd se ci
Prof 266 126772.11 27666.523 1696.3434 -3340.026
AsstProf 67 80775.99 8112.882 991.1463 -1978.888
AssocProf 64 93876.44 13723.214 1715.4018 -3427.957
130000
120000
110000
mean
100000
90000
80000
AsstProf AssocProf Prof
rank
Although we plotted error bars representing the standard error, we could have plotted standard
deviations or 95% confidence intervals. Simply replace se with sd or error in the aes option.
We can use the same technique to compare salary across rank and sex. (Technically, this is not
bivariate since we’re plotting rank, sex, and salary, but it seems to fit here)
52
4.3. Categorical vs. Quantitative 53
130000
120000 sex
Female
Male
110000
mean
100000
90000
80000
Unfortunately, the error bars overlap. We can dodge the horizontal positions a bit to overcome
this.
Finally, lets add some options to make the graph more attractive.
53
54 Chapter 4. Bivariate Graphs
130000
120000 sex
Female
Male
110000
mean
100000
90000
80000
$120000.00 Gender
Female
Male
$110000.00
$100000.00
$90000.00
$80000.00
The relationship between a grouping variable and a numeric variable can be displayed with a
scatter plot. For example
54
4.3. Categorical vs. Quantitative 55
Prof
rank
AssocProf
AsstProf
These one-dimensional scatterplots are called strip plots. Unfortunately, overprinting of points
makes interpretation difficult. The relationship is easier to see if the points are jittered. Basically
a small random number is added to each y-coordinate.
Prof
rank
AssocProf
AsstProf
55
56 Chapter 4. Bivariate Graphs
theme(legend_position = "none")+
scale_y_discrete(breaks=["AsstProf","AssocProf","Prof"],
labels = ["Assistant\nProfessor","Associate\nProfessor","Full\nProfessor"])))
Full
Professor
Associate
Professor
Assistant
Professor
The option legend_position = "none" is used to suppress the legend (which is not needed
here). Jittered plots work well when the number of points in not overly large.
56
4.3. Categorical vs. Quantitative 57
Full
Professor
Associate
Professor
Assistant
Professor
Finally, the x and y axes are revered using the coord_flip function (i.e. the graph is turned on
the side).
Beeswarm plots (also called violin scatter plots) are similar to jittered scatterplots, in that they
display the distribution of a quantitative variable by plotting points in ways that reduces overlap.
In addition, they also help display the density of the data at each point (in a manner that is
similar to a violin plot).
175000
150000
125000
100000
75000
50000
Assistant Associate Full
Professor Professor Professor
57
58 Chapter 4. Bivariate Graphs
Cleveland plots are useful when you want to compare a numeric statistic for a large number of
groups.For example, say that you want to compare the 2007 life expectancy for Asian country
using the gapminder dataset.
Yemen, Rep.
West Bank and Gaza
Vietnam
Thailand
Taiwan
Syria
Sri Lanka
Singapore
Saudi Arabia
Philippines
Pakistan
Oman
Nepal
Myanmar
Mongolia
country
Malaysia
Lebanon
Kuwait
Korea, Rep.
Korea, Dem. Rep.
Jordan
Japan
Israel
Iraq
Iran
Indonesia
India
Hong Kong, China
China
Cambodia
Bangladesh
Bahrain
Afghanistan
50 60 70 80
lifeExp
Japan
Hong Kong, China
Israel
Singapore
Korea, Rep.
Taiwan
Kuwait
Oman
Bahrain
reorder(country, lifeExp)
Vietnam
Malaysia
Syria
West Bank and Gaza
China
Saudi Arabia
Jordan
Sri Lanka
Lebanon
Philippines
Iran
Indonesia
Thailand
Korea, Dem. Rep.
Mongolia
Pakistan
India
Bangladesh
Nepal
Yemen, Rep.
Myanmar
Cambodia
Iraq
Afghanistan
50 60 70 80
lifeExp
58
4.3. Categorical vs. Quantitative 59
Japan clearly has the highest life expectancy, while Afghanistan has the lowest by far. This last
plot is also called a lollipop graph.
59
5
Multivariate Graphs
Multivariate graphs display the relationships amons three or more variables. There are two
common methods for accomodating multiple variables : grouping and faceting.
5.1 Grouping
In grouping, the values of the first two variables are mapped to the x and y axes. Then additional
variables are mapped to other visual characteristics such as color, shape, size, line type, and
transparency. Grouping allows you to plot the data for multiple groups in a single graph.
Using the Salaries dataset, let’s display the relationship between yrs.since.phd and salary.
data(Salaries, package="carData")
200000
salary
150000
100000
50000
0 20 40
yrs.since.phd
60
5.1. Grouping 61
200000
rank
AsstProf
salary
150000
AssocProf
Prof
100000
50000
0 20 40
yrs.since.phd
Finally, let’s add the gender of professor, using the shape of the points to indicate sex. We’ll
increase the point size and add transparency to make the individual points clearer.
150000
100000
50000
0 20 40
yrs.since.phd
I can’t say that this is a great graphic. It is very busy, and it can be difficult to distinguish male
from female professors. Faceting (described in the next section) would probably be a better
approach.
Notice the difference between specifying a constant value (such as size = 3) and a
mapping of a variable to a visual charactreistic (e.g., color = "rank"). Mappings
are always placed within the aes function, while the assignment of a constant value
appear outside of the aes function.
61
62 Chapter 5. Multivariate Graphs
Here is a cleaner example. We’ll graph the relationship between years since Ph.D. and salary
using the size of the points to indicate years of service. This is called a bubble plot.
150000
60
100000
50000
0 20 40
yrs.since.phd
There is obviously a strong positive relationship between years since Ph.D. and year of service.
Assistant Professors fall in the 0-11 years since Ph.D. and 0-10 years of service range. Clearly
highly experienced professionals don’t stay at the Assistant Professor level (they are probably
promoted or leave the University). We don’t find the same time demarcation between Associate
and Full Professors.
Bubble plots are described in more detail in a later chapter.
As a final example, let’s look at the yrs.since.phd vs salary and add sex using color and quadratic
best fit lines.
def poly(x,p):
return x**p
print((
r.Salaries >>
ggplot(aes(x = "yrs.since.phd",y = "salary", color = "sex")) +
geom_point(alpha = .4,size = 3) +
geom_smooth(se=False,method = "lm",formula = "y~x+poly(x,2)",size = 1.5) +
labs(x = "Years Since Ph.D.",title = "Academic Salary by Sex and Years Experience",
y = "",color = "Sex") +
scale_y_continuous(labels = currency_format()) +
scale_color_brewer(type="qual", palette="Set1") +
62
5.2. Faceting 63
theme_minimal()
))
$200000.00
Sex
$150000.00 Female
Male
$100000.00
$50000.00
0 20 40
Years Since Ph.D.
Figure 5.5: Scatterplot with color mapping and quadratic fit lines
To insert qualitative palette using plotnine, you have to include the type = "qual" argument.
5.2 Faceting
Grouping allows you to plot multiple variables in a single graph, using visual characteristics
such as color, shape, and size.
In faceting, a graph consists of several separate plots or small multiples, one for each level of
a third variable, or combination of variables. It is easiest to understand this with an example.
30
20
10
0
Prof
50
40
30
20
10
0
50000 100000 150000 200000 250000
salary
The facet_wrap function creates a separate graph for each level of rank``. Thencol‘ option
controls the numbers of columns.
In the next example, two variables are used to define the facets.
63
64 Chapter 5. Multivariate Graphs
40
30
Female
20
10
0
count
40
30
Male
20
10
0
50 100 150 200 25050 100 150 200 25050 100 150 200 250
Salary ($1000)
# Create
print((plotdata >>ggplot(aes(x = "sex",y = "mean",color = "sex")) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = "mean - se",ymax = "mean + se"),width = 0.1) +
scale_y_continuous(breaks = range(70000, 140000, 10000),
label = currency_format()) +
64
5.2. Faceting 65
130000
120000
110000
100000
90000
80000
70000
FemaleMale FemaleMale FemaleMale FemaleMale FemaleMale FemaleMale
The statement facet_grid(".~ rank + discipline") specifies no row variable (.) and
columns defined by the combination of rank and discipline.
The theme_ functions create a black and white theme and eliminates vertical grid lines and
65
66 Chapter 5. Multivariate Graphs
minor horizontal grid lines. The scale_color_brewer function changes the color scheme for
the points and errors bars.
At first glance, it appears that there might be gender differences in salaries for associate and full
professors in theoretical fields. I say “might” because we haven’t done any formal hypothesis
testing yet (ANCOVA in this case).
See the Customizing section to learn more about customizing the appearance of a graph.
As a final example, we’ll shift to a new dataset and plot the change in life expectancy over time
for countries in the America. The data comes from the gapminder dataset in the gapminder
package. Each country appears in its own facet. The theme functions are used to simplify the
background color, rotate the x-axis text, and make the font size smaller.
# Plot life expectancy by year separately for each country in the America
from gapminder import gapminder
19
19
19
19
Year
We can see that life expectancy is increasing in each country, but that Haiti is lagging behing.
66
6
Time - dependent graphs
A graph can be a powerful vehicle for displaying change over time. The most common time-
dependent graph is the time series line graph. Other options include the dumbbell charts and
the slope graph.
A time series is a set of quantitative values obtained at successive time points. The intervals
between time points (e.g., hours, days, weeks, months, or years) are usually equal.
Consider the Economics time series that come with the plotnine package. It contains US
monthly economic data collected from July 1967 to April 2015. Let’s plot personal savings rate
(psavert). We can do this with a simple line plot.
15
Personal Savings Rate
10
The calculated breaks are awful, we need to intervene. We do so using the date_breaks and
date_format functions from mizani.
67
68 Chapter 6. Time - dependent graphs
# Date breaks
from mizani.breaks import date_breaks
from plotnine.stats import stat_smooth
15
personal Saving Rate
10
1970-01-01
1975-01-01
1980-01-01
1985-01-01
1990-01-01
1995-01-01
2000-01-01
2005-01-01
2010-01-01
2015-01-01
Figure 6.2: Simple time series with loess regression and modified date breaks
That is better. Since all the breaks are at the beginning of the year, we can omit the month and
day. Using date_format we override the format string. For more on the options for the format
string see the strftime behavior.
# Date format
from mizani.formatters import date_format
def custom_date_format1(breaks):
"""
Function to format the date
"""
return [x.year if x.month==1 and x.day==1 else "" for x in breaks]
We can use a custom formatting function to get results that are not obtainable with the
date_format function. For example if we have monthly breaks over a handful of years we
68
6.1. Time series 69
15
15
personal Saving Rate
10
Figure 6.4: Simple time series with customized date format - one
can mix date formats as follows; specify beginning of the year and every other month. Such
tricks can be used reduce overcrowding.
def custom_date_format2(breaks):
"""
Function to format the date
"""
res = []
for x in breaks:
# First day of the year
if x.month == 1 and x.day == 1:
fmt = '%Y'
# Every other month
elif x.month % 2 != 0:
fmt = '%b'
else:
fmt = ''
res.append(date.strftime(x, fmt))
return res
69
70 Chapter 6. Time - dependent graphs
14
12
nov. 1971 mars mai juil. sept. nov. 1972 mars mai juil.
Date
Figure 6.5: Simple time series with customized date format - two
We removed the labels but not the breaks, leaving behind dangling ticks for the skipped months.
We can fix that by wrapping date_breaks around a filtering function.
def custom_date_format3(breaks):
"""
Function to format the date
"""
res = []
for x in breaks:
# First day of the year
if x.month == 1:
fmt = '%Y'
else:
fmt = '%b'
res.append(date.strftime(x, fmt))
return res
def custom_date_breaks(width=None):
"""
Create a function that calculates date breaks
return filter_func
70
6.2. Dumbbell charts 71
14
13
12
nov. 1971 mars mai juil. sept. nov. 1972 mars mai juil.
Date
Figure 6.6: Simple time series with customized date format - three and date breaks
When plotting time series, be sure that the date variable is class date and not class category.
Dumbbell charts are useful for dissplaying change between time points for several groups or
observations.
Using the gapminder dataset let’s plot the change in life expectancy from 1952 to 2007 in the
Aericas. The dataset is in long format. We willneed to convert it to wide format in order to
create the dumbbell plot.
# Subset data
from gapminder import gapminder
from plydata import *
71
72 Chapter 6. Time - dependent graphs
label='_nolegend_', alpha=.8);
ax.scatter(plotdata_wide["y1952"],plotdata_wide["country"],label='y1952', s=60,
color='#DB444B', zorder=3);
ax.scatter(plotdata_wide["y2007"],plotdata_wide["country"],label='y2007', s=60,
color='#006BA2', zorder=3);
ax.xaxis.set_tick_params(labeltop=False,labelbottom=True,bottom=True,
labelsize=11,pad=-1);
ax.set_xlabel("y1952");
ax.set_yticks(plotdata_wide["country"]);
ax.set_yticklabels(plotdata_wide["country"],ha = 'left');
ax.yaxis.set_tick_params(pad=120,labelsize=11);
ax.legend(['1952', '2007'], loc=(0,0.9),ncol=1, frameon=False,
handletextpad=-0.1, handleheight=1);
plt.show()
Venezuela 1952
Uruguay 2007
United States
Trinidad and Tobago
Puerto Rico
Peru
Paraguay
Panama
Nicaragua
Mexico
Jamaica
Honduras
Haiti
Guatemala
El Salvador
Ecuador
Dominican Republic
Cuba
Costa Rica
Colombia
Chile
Canada
Brazil
Bolivia
Argentina
40 50 60 70 80
y1952
The graph will be easier to read if the countries are sorted and the points are sized and colored.
In the next graph, we’ll sort by 1952 life expectancy, and modify the line and point size, color
the points, add titles and labels, and simplify the theme.
# Sort values by
plotdata_wide = plotdata_wide.sort_values(by= "y1952")
72
6.3. Slope graphs 73
color='#006BA2', zorder=3);
ax.xaxis.set_tick_params(labeltop=False,labelbottom=True,bottom=True,
labelsize=11,pad=-1);
ax.set_xlabel("Life Expectancy (years)");
ax.set_yticks(plotdata_wide["country"]);
ax.set_yticklabels(plotdata_wide["country"],ha = 'left');
ax.yaxis.set_tick_params(pad=120,labelsize=11);
ax.legend(['1952', '2007'], loc=(0,0.9), ncol=1, frameon=False,
handletextpad=-0.1, handleheight=1);
ax.set_title("Change in Life Expectancy",ha='right',weight='bold',fontsize=11);
ax.text(x=-0.04,y=1.01,s="1952 to 2007",ha='right',fontsize=11);
plt.show()
It is easier to discern patterns here. For example Haiti started with the lowest life expectancy in
1952 and still has the lowest in 2007. Paraguay started relatively high by has made few gains.
When there are several groups and several time points, a slope graph can be helpful. Let’s plot
life expectancy for six Central American countries in 1992, 1997, 2002 and 2007. Again we’ll
use the gapminder data.
# Filter
condition = '''
year in [1992, 1997, 2002, 2007] and \
country in ['Panama','Costa Rica',
'Nicaragua','Honduras','El Salvador','Guatemala','Belize']
'''
df = gapminder >> query(condition.replace('\n', ''))
73
74 Chapter 6. Time - dependent graphs
colors = ["green","red","yellow","blue","black","darkmagenta","purple"]
Panama Nicaragua
El Salvador
Guatemala
Honduras
ElHonduras
Salvador
Nicaragua
Guatemala
In the graph above, Costa Rica has the highest life expectancy across the range of years studied.
Guatemala has the lowest, and caught up with Honduras (also low at 69) in 2002.
Slope charts can be used to show change in a ‘before and after’ story by comparing their values
at different points in time. It is a great way to show change or difference when there are only
two data points. It works well with both continuous data and categorical data.
df = pd.read_excel("./donnee/slope_2.xlsx")
#set a list of country names
countries=['Afghanistan','Dem. Rep. of the Congo','Myanmar',
'South Sudan','Syrian Arab Rep.']
#plot the chart
fig, ax = plt.subplots()
for i, name in enumerate(countries):
74
6.4. Area Charts 75
Afghanistan
Afghanistan
South Sudan South Sudan
Myanmar
Dem. Rep. of the Congo
Dem. Rep. of the Congo
Myanmar
Syrian Arab Rep.
Source: UNHCR Refugee Data Finder
©UNHCR, The UN Refugee Agency
A simple area chart is basically a line graph, with a fill from the line to the x-axis.
75
76 Chapter 6. Time - dependent graphs
15
0
1970-01-01 1980-01-01 1990-01-01 2000-01-01 2010-01-01
Date
A stacked area chart can be used to show differences between group over time. Consider the
uspopage dataset from the gcookbook package. We’ll plot the age distribution of the US popu-
lation from 1900 and 2002.
US Population by age
300000 AgeGroup
<5
5-14
15-24
25-34
Population in Thousands
200000 35-44
45-54
55-64
>64
100000
0
1900 1925 1950 1975 2000
Year
It is best to avoid scientific notation in your graphs. How likely is it that the average reader will
know that 3e+05 means 300,000,000? It is easy to change the scale in plotnine. Simply divide
the Thousands variable by 1000 and report it as Millions. While we are at it, let’s
76
6.4. Area Charts 77
The levels of the AgeGroup variable can be reversed using the reorder_categories function in
the pandas package.
r.uspopage["AgeGroup"] = (r.uspopage["AgeGroup"].cat
.reorder_categories([">64","55-64","45-54","35-44","25-34","15-24","5-14","<5"]))
200 35-44
45-54
55-64
>64
100
Apparently, the number of young children have not changed very much in the past 100 years.
Stacked area charts are most useful when interest is on both (1) group change over time and (2)
overall change over time. Place the most important groups at the bottom. These are the easiest
to interpret in this type of plot.
77
Statistics Models
7
A statistical model describes the relationship between one or more explanatory variables and
one or more response variables. Graphs can help to visualize these relationships. In this section
we’ll focus on modles that have a single response variable that is either quantitative (a number)
or binary (yes/no).
Correlation plots help you to visualize the pairwise relationships between a set of quantitative
variables by displaying their correlations using color or shading.
Consider the Saratoga Houses dataset, which contains the sale price and characteristics of
Saratoga County, NY homes in 2006. In order to explore the relationships among the quantita-
tive variables, we can calculate the Pearson Product-Moment correlation coefficients.
# Load datatset
data(SaratogaHouses, package="mosaicData")
price lotSize age landValue livingArea pctCollege bedrooms fireplaces bathrooms rooms
price 1.00 0.16 -0.19 0.58 0.71 0.20 0.40 0.38 0.60 0.53
lotSize 0.16 1.00 -0.02 0.06 0.16 -0.03 0.11 0.09 0.08 0.14
age -0.19 -0.02 1.00 -0.02 -0.17 -0.04 0.03 -0.17 -0.36 -0.08
landValue 0.58 0.06 -0.02 1.00 0.42 0.23 0.20 0.21 0.30 0.30
livingArea 0.71 0.16 -0.17 0.42 1.00 0.21 0.66 0.47 0.72 0.73
pctCollege 0.20 -0.03 -0.04 0.23 0.21 1.00 0.16 0.25 0.18 0.16
bedrooms 0.40 0.11 0.03 0.20 0.66 0.16 1.00 0.28 0.46 0.67
fireplaces 0.38 0.09 -0.17 0.21 0.47 0.25 0.28 1.00 0.44 0.32
bathrooms 0.60 0.08 -0.36 0.30 0.72 0.18 0.46 0.44 1.00 0.52
rooms 0.53 0.14 -0.08 0.30 0.73 0.16 0.67 0.32 0.52 1.00
78
7.2. Linear Regression 79
The ggcorrplot function in the ggcorrplot package can be used to visualize these correlations.
By default, it creates a plotnine graph were darker red indicates stronger positive correlations,
darker blue indicates stronger negative correlations and white indicates no correlation.
rooms
bathrooms
fireplaces
bedrooms Corr
1
pctCollege 0.5
0
livingArea
0.5
landValue 1
age
lotSize
price
roo s
be lege
pc rea
fire ms
liv lue
ba ces
m
e
ms
ce
oo
Siz
o
e
a
A
pla
l
dV
dro
ag
tCo
ing
pri
thr
lot
lan
From the graph, an increase in number of bathrooms and living area are associated with in-
creased price, while older homes tend to be less expensive. Older homes also tend to have fewer
bathrooms.
The ggcorrplot function has a number of options for customizing the output. For example
• hc_order = True reorders the variables, placing variables with similar correlation patterns
together.
• type= "lower" plots the lower portion of the correlation matrix.
• lab = True overlays the correlation coefficients (as text) on the plot.
This, and other options, can make the graph easier to read and interpret.
Linear regression allows us to explore the relationship between a quantitative response variable
and an explanatory variable while other variables are held constant.
Consider the prediction of home prices in the Saratoga dataset from lot size (square feet), age
(years), land value (1000s dollars), living area (square feet), number of bedrooms and bathrooms
and whether the home is on the waterfront or not.
79
80 Chapter 7. Statistics Models
lotSize -0.02
fireplaces 0.28 0.47 0.32 0.38 0.44 0.21 0.25 0.09 -0.17
lan ms
lot e
s
roo a
pc lue
g
om
Are
e
ms
lle
ce
oo
Siz
e
a
dro
dV
ag
tCo
ing
pri
thr
be
liv
ba
Figure 7.2: Sorted lower triangel correlation matrix with options
# Linear regression
from statsmodels.api import OLS
formul="price~lotSize+age+landValue+livingArea+bedrooms+bathrooms+waterfront"
lm_model = OLS.from_formula(formula = formul, data = r.SaratogaHouses).fit()
lm_res = lm_model.summary2().tables[1]
From the results, we can estimate that an increase of one square foot of living area is associated
with a home price increase of $75, holding the other variables constant. Additionally, waterfront
home cost approximately $120, 726 more than non-waterfront home, again controlling for the
other variables in the model.
Let’s plot conditional relationships.
# Plot
print((ggplot(data = df,mapping = aes(x = "livingArea", y = "price"))+
geom_point() +
80
7.3. Logistic regression 81
800000
600000
price 400000
200000
0
1000 2000 3000 4000 5000
livingArea
The graph suggests that, after controlling for lot size, age, living area, number of bedrooms and
bathrooms, and waterfront location, sales price increases with living area in a linear fashion.
Logistic regression can be used to explore the relationship between a binary response variable
and an explanatory variable while other variables are held constant. Binary response variables
have two levels (yes/no, lived/died, pass/fail, malignant/benign).
# Load Titanic dataset
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic = titanic >> select("survived","pclass","sex","age","embark_town")
titanic = titanic.dropna()
With Categorical Variables, you’ll sometimes wan to set the reference category to be a specific
value. This ca, help make the results more interpretable.
81
82 Chapter 7. Statistics Models
In the Titanic Dataset used above, we could examine how likely survival was for first-class
passengers relative to third-class. We can do this with Patsy’s categorical treatments.
In the Titanic dataset, the pclass column gets interpreted as an integer.We change this by
wrapping it in an uppercase C and parentheses ().
Notice, though, this only signals to Patsy to treat pclass as categorical. The reference level
hasn’t changed. To set the reference level, we include a Treatment argument with a reference
set to the desired value.
For more on categorical treatments, see here and here from the Patsy docs.
In many research settings, the response variable is the time to an event. This is frequently true
in healthcare research, where we are interested in time to recovery, time to death, or time to
relapse.
If the event has not occurred for an observation (either because the study ended or the patient
dropped out) the observation is said to be censored.
The Veterans’ Administration Lung Cancer Trial is a randomized trial of two treatment regimens
for lung cancer. The data set (Kalbfleisch J. and Prentice R, (1980) The Statistical Analysis
of Failure Time Data.New York: Wiley) consists of 137 patients and 8 variables, which are
described below:
• Treatment : denotes the type of lung cancer treatment; standard and test drug.
82
7.4. Survival plots 83
• Celltype : denotes the type of cell involved; squamous, small cell, adeno, large
• Karnofsky_score : is the Karnofsky score
• Diag : is the time since diagnosis in months.
• Age : is the age of years.
• Prior_Therapy : denotes any prior therapy; none or yes
• Status : denotes the status of the patient as dead or alive; dead or alive.
• Survival_in_days : is the survival time in days since the treatment.
## Status Survival_in_days
## <bool> <float64>
## 0 True 72.0
## 1 True 411.0
## 2 True 228.0
## 3 True 126.0
## .. ... ...
## 4 True 118.0
## 132 True 133.0
## 133 True 111.0
## 134 True 231.0
## 135 True 378.0
## 136 True 49.0
##
## [137 rows x 2 columns]
We can easily see that only a few survival times are right censored (status is false), i.e., most
veteran’s died during the study period (Status is True).
A key quantity in survival analysis is the so-called survival function, which relates time to the
probability of surviving beyond a given time point.
83
84 Chapter 7. Statistics Models
If we observed the exact survival time of all subjects, i.e., everyone died before the study ended,
the survival function at time can simply be estimated by the ratio of patients surviving beyond
time and the total number of patients:
import pandas as pd
## Status Survival_in_days
## <bool> <float64>
## 1 True 8.0
## 2 True 10.0
## 3 True 20.0
## 4 False 25.0
## 5 True 59.0
Using the formula from above, we can compute 𝑆(𝑡 ̂ = 11) = 3 , but not 𝑆(𝑡
̂ = 30), because we
5
don’t know whether the 4th patient is still alive at 𝑡 = 30, all we know is that when we last
checked at 𝑡 = 25,the patient was still alive.
An estimator, similar to the one above, that is valid if survival times are right-censored is the
Kaplan-Meier estimator.
The estimated curve is a step function, with steps occurring at time points where one or more
patients died. From the plot we can see that most patients died in the first 200 days, as indicated
by the steep slope of the estimated survival function in the first 200 days.
Patients enrolled in the Veterans’ Administration Lung Cancer Trial were randomized to one of
two treatments: standard and a new test drug. Next, let’s have a look at how many patients
underwent the standard treatment and how many received the new drug.
84
7.4. Survival plots 85
1.0
0.8
est. probability of survival S(t)
0.6
0.4
0.2
0.0
0 200 400 600 800 1000
time t
̂
Figure 7.4: est. probability of survival 𝑆(𝑡)
data_x["Treatment"].value_counts()
## standard 69
## test 68
## Name: Treatment, dtype: int64
fig, ax = plt.subplots(figsize=(16,6))
for treatment_type in ("standard", "test"):
mask_treat = data_x["Treatment"] == treatment_type
time_treatment, survival_prob_treatment = kaplan_meier_estimator(
data_y["Status"][mask_treat],
data_y["Survival_in_days"][mask_treat])
ax.step(time_treatment, survival_prob_treatment, where="post",
label="Treatment = %s" % treatment_type);
ax.set_ylabel("est. probability of survival $\widehat{S}(t)$");
ax.set_xlabel("time $t$");
ax.legend(loc="best");
ax.grid(True);
plt.show()
Unfortunately, the results are inconclusive, because the difference between the two estimated
survival functions is too small to confidently argue that the drug affects survival or not.
Next, let’s have a look at the cell type, which has been recorded as well, and repeat the analysis
from above.
85
86 Chapter 7. Statistics Models
0.8
est. probability of survival S(t)
0.6
0.4
0.2
0.0
0 200 400 600 800 1000
time t
̂
Figure 7.5: est. probability of survival 𝑆(𝑡)
fig, ax = plt.subplots(figsize=(16,6))
for value in data_x["Celltype"].unique():
mask = data_x["Celltype"] == value
time_cell,survival_prob_cell = kaplan_meier_estimator(data_y["Status"][mask],
data_y["Survival_in_days"][mask])
ax.step(time_cell, survival_prob_cell, where="post",
label="%s (n = %d)" % (value, mask.sum()))
ax.set_ylabel("est. probability of survival $\widehat{S}(t)$");
ax.set_xlabel("time $t$");
ax.legend(loc="best");
ax.grid(True);
plt.show()
0.6
0.4
0.2
0.0
0 200 400 600 800 1000
time t
̂
Figure 7.6: est. probability of survival 𝑆(𝑡)
In this case, we observe a pronounced difference between two groups. Patients with squamous
or large cells seem to have a better prognosis compared to patients with small or adeno cells.
In the Kaplan-Meier approach used above, we estimated multiple survival curves by dividing
the dataset into smaller sub-groups according to a variable. If we want to consider more than 1
86
7.4. Survival plots 87
or 2 variables, this approach quickly becomes infeasible, because subgroups will get very small.
Instead, we can use a linear model, Cox’s proportional hazard’s model, to estimate the impact
each variable has on survival.
First however, we need to convert the categorical variables in the data set into numeric values.
sc = OneHotEncoder()
sc.set_output(transform="pandas");
data_x_numeric = sc.fit_transform(data_x)
data_x_numeric.head()
Survival models in scikit-survival follow the same rules as estimators in scikit-learn, i.e., they
have a fit method, which expects a data matrix and a structured array of survival times and
binary event indicators.
estimator = CoxPHSurvivalAnalysis()
res = estimator.fit(data_x_numeric, data_y)
The result is a vector of coefficients, one for each variable, where each value corresponds to the
log hazard ratio.
# Coefficients
coef = pd.Series(res.coef_, index=data_x_numeric.columns).to_frame("coefficients")
coefficients
Age_in_years -0.0085
Celltype=large -0.7887
Celltype=smallcell -0.3318
Celltype=squamous -1.1883
Karnofsky_score -0.0326
Months_from_Diagnosis -0.0001
Prior_therapy=yes 0.0723
Treatment=test 0.2899
87
88 Chapter 7. Statistics Models
Using the fitted model, we can predict a patient-specific survival function, by passing an appro-
priate data matrix to the estimator’s 𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑠𝑢𝑟𝑣𝑖𝑣𝑎𝑙_𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 method.
x_new = pd.DataFrame.from_dict({
1: [65, 0, 0, 1, 60, 1, 0, 1],
2: [65, 0, 0, 1, 60, 1, 0, 0],
3: [65, 0, 1, 0, 60, 1, 0, 0],
4: [65, 0, 1, 0, 60, 1, 0, 1]},
columns=data_x_numeric.columns, orient='index')
# Prediction
pred_surv = estimator.predict_survival_function(x_new)
time_points = np.arange(1,1000)
fig, ax = plt.subplots(figsize=(16,6))
for i, surv_func in enumerate(pred_surv):
ax.step(time_points, surv_func(time_points), where="post",
label="Sample %d" % (i + 1));
ax.set_ylabel("est. probability of survival $\widehat{S}(t)$");
ax.set_xlabel("time $t$");
ax.legend(loc="best");
ax.grid(True);
plt.show()
1.0 Sample 1
Sample 2
Sample 3
Sample 4
0.8
est. probability of survival S(t)
0.6
0.4
0.2
0.0
0 200 400 600 800 1000
time t
̂
Figure 7.7: est. probability of survival 𝑆(𝑡)
88
7.5. Mosaic plots 89
Mosaic charts can display the relationship between categorical variables using rectangles whose
areas represent the proportion of cases for any given combination of levels. The color of the tiles
can also indicate the degree relationship among the variables.
the plotnine package can’t create a mosaic plot. However, you can create a mosaic plot with
the mosaic function in the statsmodels package.
People are fascinated with the Titanic (or is it with Leo?). In the Titanic disaster, what role did
sex and class play in survival? We can visualize the relationship between these three categorical
variables using the code below.
# Mosaic plot
from statsmodels.graphics.mosaicplot import mosaic
male female
0 0
2 2
male
2 1
female 2
1
2
male female
0 0
1 1
male
1
female
1 1
1 1
male female
0 0
3 3
male
3
female
1 1
3 3
male female
0 1
89
Other Graphs
8
Graphs in this chapter can be very useful, but don’t fit in easily within the other chapters.
The plotnine package can’t create a 3-D plot. However, you can create a 3-D scatterplot with
the scatter function in the matplotlib package.
Let’s say that we want to plot automobile mileage vs. engine displacement vs. car weight using
the data in the mtcars dataframe.
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(projection='3d')
ax.scatter(mtcars.disp, mtcars.wt, mtcars.mpg, marker= "o",color="black");
ax.set_xlabel('disp');
ax.set_ylabel('wt')
ax.set_zlabel('mpg')
ax.set_title("3-D Scatterplot Example 1")
plt.show()
Now lets, modify the graph by replacing the points with filled blue circles, add drop lines to the
x-y plane, and create more meaningful labels.
import numpy as np
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(projection='3d')
ax.scatter(mtcars.disp, mtcars.wt, mtcars.mpg, marker= "o",color="blue");
z2 = np.ones(shape=mtcars.mpg.shape[0])*min(mtcars.mpg)
for i, j, k, h in zip(mtcars.disp,mtcars.wt,mtcars.mpg,z2):
ax.plot([i,i],[j,j],[k,h],color="blue");
ax.set_xlabel("Displacement (cu. in.)");
ax.set_ylabel("Weight (lb/1000)")
ax.set_zlabel("Miles/(US) Gallon")
90
8.1. 3-D scatterplot 91
35
30
25 mpg
20
15
10
5.5
5.0
4.5
4.0
100 150 3.5
200 250 3.0 wt
300 2.5
disp 350 400 2.0
450 1.5
35
Miles/(US) Gallon
30
25
20
15
10
5.5
5.0
4.5
4.0
)
00
3.0
Displa200 250
(lb
2.5
ceme 300 350
ht
1.5
We
. in.)
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(projection='3d')
ax.scatter(mtcars.disp, mtcars.wt, mtcars.mpg,marker= "o",color="blue");
z2 = np.ones(shape=mtcars.mpg.shape[0])*min(mtcars.mpg)
for i, j, k, h in zip(mtcars.disp,mtcars.wt,mtcars.mpg,z2):
ax.plot([i,i],[j,j],[k,h],color="blue");
for i, name in enumerate(mtcars.name):
ax.text(mtcars.disp[i],mtcars.wt[i],mtcars.mpg[i],name,color="black",fontsize=7);
91
92 Chapter 8. Other Graphs
35
Miles/(US) Gallon
Toyota
FiatCorolla
128 30
Honda
Lotus Civic
Europa 25
Fiat X1-9 Merc 240D
Merc 230
Porsche 914-2
VolvoMazda
Datsun 142EMerc 20
Toyota710
Corona
Mazda RX4280
RX4
Merc Wag
Hornet 4 Drive
280C
Ferrari Dino Valiant Merc
Merc450SL 15
450SE Pontiac FirebirdChrysler Imperial
Hornet Sportabout
Merc 450SLC
Maserati
Dodge
AMC Bora
Challenger
Javelin Lincoln
Camaro
FordDuster
Pantera Z28
L
360 CadillacContinental
10
Fleetwood
5.5
5.0
4.5
4.0
)
00
100 150 3.5
/10
3.0
Displa200 250
(lb
2.5
ceme 300 350
ht
nt (cu 400 450 2.0
1.5
ig
. in.) We
Figure 8.3: 3-D scatterplot with vertical lines and point labels
Almost there. As a final step, we will add information on the number of cylinders in each car.
To do this, we’ll add a column to the mtcars dataframe indicating the color for each point.
92
8.2. Biplots 93
Miles/(US) Gallon
Toyota
FiatCorolla
128 30
Honda
Lotus Civic
Europa 25
Fiat X1-9 Merc 240D
Merc 230
Porsche 914-2
VolvoMazda
Datsun 142EMerc 20
Toyota710
Corona
Mazda RX4280
RX4
Merc Wag
Hornet 4 Drive
280C
Ferrari Dino Valiant Merc
Merc450SL 15
450SE Pontiac FirebirdChrysler Imperial
Hornet Sportabout
Merc 450SLC
Maserati
Dodge
AMC Bora
Challenger
Javelin Lincoln
Camaro
FordDuster
Pantera Z28
L
360 CadillacContinental
10
Fleetwood
5.5
5.0
4.5
4.0
)
00
100 150 3.5
/10
3.0
Displa200 250
(lb
2.5
ceme 300 350
ht
nt (cu 400 450 2.0
ig
1.5
We
. in.)
Figure 8.4: 3-D scatterplot with vertical lines and point labels and legend
We can easily see that the car with the highest mileage (Toyota Corolla) has low engine dis-
placement, low weight, and 4 cylinders.
8.2 Biplots
A biplot is a specialized graph that attempts to represent the relationship between observations,
between variables, and between observations and variables, in a low (usually two) dimensional
space.
## perform PCA
X = mtcars.drop(columns=['name'])
res = PCA(normalize= True,row_labels=mtcars.name, col_labels = X.columns).fit(X)
summaryPCA(res,to_markdown=False)
93
94 Chapter 8. Other Graphs
##
## d(i,G) p(i) I(i,G) ... Dim.3 ctr cos2
## ...
## name <float64> <float64> <float64> ... <float64> <float64> <float64>
## Mazda RX4 2.234 0.031 0.156 ... -0.601 1.801 0.072
## Mazda RX4 Wag 2.081 0.031 0.135 ... -0.382 0.728 0.034
## Datsun 710 2.987 0.031 0.279 ... -0.241 0.290 0.007
## Hornet 4 Drive 2.521 0.031 0.199 ... -0.136 0.092 0.003
## Hornet Sportabout 2.456 0.031 0.189 ... -1.134 6.412 0.213
## Valiant 3.014 0.031 0.284 ... 0.164 0.134 0.003
## Duster 360 3.187 0.031 0.317 ... -0.363 0.656 0.013
## Merc 240D 2.841 0.031 0.252 ... 0.944 4.439 0.110
## Merc 230 3.733 0.031 0.435 ... 1.797 16.094 0.232
## Merc 280 1.907 0.031 0.114 1.493 11.103 0.613
##
## [10 rows x 12 columns]
##
## Continues variables
##
## Dim.1 ctr cos2 ... Dim.3 ctr cos2
## <float64> <float64> <float64> ... <float64> <float64> <float64>
## mpg -0.932 13.143 0.869 ... -0.179 5.096 0.032
## cyl 0.961 13.981 0.924 ... -0.139 3.073 0.019
## disp 0.946 13.556 0.896 ... -0.049 0.378 0.002
## hp 0.848 10.894 0.720 ... 0.111 1.960 0.012
## drat -0.756 8.653 0.572 ... 0.128 2.598 0.016
## wt 0.890 11.979 0.792 ... 0.271 11.684 0.073
## qsec -0.515 4.018 0.266 ... 0.319 16.255 0.102
## vs -0.788 9.395 0.621 ... 0.340 18.388 0.115
## am -0.604 5.520 0.365 ... -0.163 4.234 0.027
## gear -0.532 4.281 0.283 ... 0.229 8.397 0.053
## carb 0.550 4.580 0.303 0.419 27.936 0.175
##
## [11 rows x 9 columns]
# Individuals plots
from scientisttools.pyplot import plotPCA
fig, axe = plt.subplots(figsize=(6,6))
plotPCA(res,choice = "ind",repel=True,ax=axe)
plt.show()
# Variables plots
fig, axe = plt.subplots(figsize=(6,6))
plotPCA(res,choice = "var",repel=True,ax=axe)
plt.show()
It’s easiest to see how this works with an example. Let’s create a biplot for the mtcars dataset
# Biplots
fig = plt.figure(figsize=(6,6))
94
8.2. Biplots 95
Porsche 914-2
2 Mazda RX4
Dim.2 (24.1%)
Lotus Europa Mazda RX4 Wag
1 Honda Civic
Volvo 142E Camaro Z28
Merc 280 Duster 360
Chrysler Imperial
0 Fiat X1-9 Datsun 710 Lincoln Continental
Toyota
Fiat Corolla
128 Merc 280C Merc 450SLC
Merc 450SL Merc 450SE
1 Hornet Sportabout
Cadillac Fleetwood
AMC Javelin
Dodge Challenger
Merc 240D Pontiac Firebird
2 Merc 230
Toyota Corona Hornet 4 Drive
Valiant
3
4 2 0 2 4
Dim.1 (60.08%)
0.75 gear
am carb
0.50 drat hp
0.25
Dim.2 (24.1%)
mpg cyl
0.00 disp
wt
0.25
vs
0.50
0.75 qsec
1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
Dim.1 (60.08%)
axe1 = fig.add_subplot(111)
axe2 = axe1.twiny()
axe2 = axe2.twinx()
axe1.set_title("Biplot of mtcars data");
axe1.set_xlabel("Dim1 (60.1%)");
axe1.set_ylabel("Dim2 (24.1%)");
plt.axhline(0,color = "blue", linestyle = "--", linewidth = 0.5);
plt.axvline(0,color = "blue", linestyle = "--", linewidth = 0.5);
axe1.scatter(res.row_coord_[:,0],res.row_coord_[:,1],c = "black",alpha = 1);
# Add row labels
for i,name in enumerate(mtcars.name):
axe1.text(res.row_coord_[i,0],res.row_coord_[i,1],name,color = "black",
fontsize= 10);
# Add columns
for j in range(X.shape[1]):
95
96 Chapter 8. Other Graphs
axe2.arrow(0,0,res.col_coord_[j,0],res.col_coord_[j,1],head_width = 0.02,
head_length = 0.02,color = "blue", linestyle = "--");
# Add columns labels
for j,col in enumerate(X.columns):
axe2.text(res.col_coord_[j,0],res.col_coord_[j,1],col,color = "blue",
fontsize= 10);
plt.show()
Lotus Europa
cyl
1 mpg
Honda Civic Camaro Z280.0
Volvo 142E Dusterdisp
360
0
FiatFiat128
Toyota X1-9
Datsun 710
Corolla Merc
Merc 280
280C wt 0.2
Chrysler Imperial
Merc
Merc450SL
Hornet
Merc 450SE Lincoln Continental
Sportabout
450SLC
1 vs Dodge Challenger 0.4Fleetwood
Pontiac
AMC Javelin Cadillac
Firebird
Merc 240D
2 Merc 230 Corona
Toyota 0.6
Hornet 4 Drive
qsec Valiant
3 0.8
4 2 0 2 4
Dim1 (60.1%)
Dim1 and Dim2 are the first two principal components - linear combinations of the original 𝑝
variables.
The weights of these linear combinations (𝑎𝑖𝑗 ) are chosen to maximize the variance accounted
for in the original variables. Additionally, the principal components (PCs) are constrained to be
uncorrelated with each other.
In this graph, the first PC accounts for 60% of the variability in the original data. The second
PC accounts for 24%. Together, they account for 84% of the variability in the original 𝑝 = 11
variables.
As you can see, both the observations (cars) and variables (car characteristics) are plotted in
the same graph.
• Points represent observations. Smaller distances between points suggest similar values
on the original set of variables. For example, the Toyota Corolla and Honda Civic
are similar to each other, as are the Chrysler Imperial and Lincoln Continental.
However, the Toyota Corolla is very different from the Lincoln Continental.
• The vectors (arrows) represent variables. The angle between vectors are proportional to
the correlation between the variables. Smaller angles indicate stronger correlations. For
example, gear and am are positively correlated, gear and qsec are uncorrelated (90 degree
angle), and am and wt are negatively correlated (angle greater then 90 degrees).
96
8.3. Bubble charts 97
• The observations that are are farthest along the direction of a variable’s vector, have the
highest values on that variable. For example, the Toyoto Corolla and Honda Civic have
higher values on mpg. The Toyota Corona has a higher qsec. The Duster 360 has more
cylinders.
Care must be taken in interpreting biplots. They are only accurate when the percentage of
variance accounted for is high. Always check your conclusion with the original data.
See the article by Forrest Young to learn more about interpreting biplots correctly.
A bubble chart is basically just a scatterplot where the point size is proportional to the values
of a third quantitative variable.
Using the mtcars dataset, let’s plot car weight vs. mileage and use point size to represent
horsepower.
35
hp
100
30 150
200
250
25 300
mpg
20
15
10
2 3 4 5
wt
We can improve the default appearance by increasing the size of the bubbles, choosing a different
point shape and color, and adding some transparency.
The range parameter in the scale_size_continuous function specifies the minimum and max-
imum size of the plotting symbol. The default is range = (1, 6).
97
98 Chapter 8. Other Graphs
200
Miles/(US) gallon
25 250
300
20
15
10
2 3 4 5
Weight (1000 lbs)
The shape option in the geom_point function specifies an circle with a border color and fill
color.
Clearly, miles per gallon decreases with increased car weight and horsepower. However, there is
one car with low weight, high horsepower, and high gas mileage. Going back to the data, it’s
the Lotus Europa.
Bubble charts are controversial for the same reason that pie charts are controversial. People are
better at judging length than volume. However, they are quite popular.
A flow diagram represents a set of dynamic relationships. It usually captures the physical or
metaphorical flow of people, materials, communications, or objects through a set of nodes in a
network.
Sankey Diagram is used to display the flow of some property from various sources to destina-
tions.
The simple diagram has a few source nodes plotted at the beginning and a few destination
nodes at the end of the diagram. Then, there are various arrows/links representing the flow
of property from sources to destinations. There is one arrow/link per source and destination
combination. The width of an arrow/link is proportional to the amount of property flowing
from source to destination.
The Sankey diagrams can have intermediate nodes (plotted between source and destination
nodes) as well when the path from source to destination involves multiple intermediate
nodes (E.g., Journey of users on website pages).
Sankey diagrams and its variation like Alluvial Diagrams are commonly used for purposes
like analyzing population migration, website user journey, the flow of energy, the flow of other
98
8.4. Flow diagrams 99
properties (oil, gas, etc.), research paper citations, etc. Google Analytics uses an alluvial diagram
to show users’ Journey on a website (sessions).
We’ll use holoviews packages to build a Sankey diagrams. Holoviews is a high-level library
that let us specify chart metadata and then creates charts using one of its back end (Bokeh,
Matplotlib or Plotly). It takes pandas dataframe as a dataset and lets us create charts from it.
We’ll be using the New Zealand migration dataset for our plotting purpose. It’s available on
kaggle for download. Dataset has information about a number of people who departed from and
arrived in New Zealand from all continents and countries of the world from 1979 till 2016. We’ll
be aggregating this data in various ways to create different Sankey diagrams. We suggest that
you download this dataset to follow along with us.
# Load dataset
nz_migration = pd.read_csv("./donnee/migration_nz.csv")
After loading the dataset, we have performed a few steps of data cleaning and aggregation as
mentioned below.
After performing the above steps, we’ll have a dataset where we’ll have information about arrivals
and departure counts from each country and continent of all time.
# import holoviews as hv
# hv.extension('bokeh')
For our first Sankey diagram, we need to filter entries of the dataframe to keep only entries where
the count for each continent is present. Below we are filtering the dataset based on continent
names to remove all other entries.
99
100 Chapter 8. Other Graphs
8.4.2.2.1 Simple Sankey Diagram We can plot a Sankey diagram very easily using
holoviews by passing it above the dataframe. Holoviews needs a dataframe with at least three
columns. It’ll consider the first column as source, the second as destination, and the third as
property flow value from source to destination. It’ll plot links between each combination of
source and destination.
We can notice from the above chart that all links have a gray color. We can color them in two
ways.
1. Source Node Color - It’ll color links with the same color as the source node from which
they originated. This helps us better understand the flow from source to destination.
2. Destination Node Color - It’ll color links with the same color as the destination node
to which they are going. This helps us better understand reverse flow from destination to
source.
100
8.4. Flow diagrams 101
You can color links according to your need. We have explained in our tutorial how we can color
links using both ways.
8.4.2.2.2 Coloring Edges as Per Destination Node Color Below we are creating the
same Sankey plot again, but this time specifying which columns from the dataframe to take as
source and destination in kdims parameter and which column to use to generate property
flow arrow sizes using vdims parameter.
We have also specified various parameters as a part of opts() method called on the Sankey plot
object which helped us further improve the styling of the diagram further.
We have included colormap to use for nodes & edges, label position in the diagram, column to
use for edge color, edge line width, node opacity, graph width & height, graph background color,
and title attributes which improves the styling of the graph a lot and makes it aesthetically
pleasing.
Alluvial diagrams are a subset of Sankey diagrams, and are more rigidly defined. A discussion
of the differences can be found here
# Alluvial
import alluvial
from matplotlib import colormaps
101
102 Chapter 8. Other Graphs
Figure 8.11: Sankey diagrams - Coloring Edges as Per Destination Node Color
import numpy as np
# Plotting:
cmap = colormaps['jet']
ax = alluvial.plot(
input_data, alpha=0.4, color_side=1, rand_seed=seed, figsize=(7,5),
disp_width=True, wdisp_sep=' '*2, cmap=cmap, fontname='Monospace',
labels=('Capitals', 'Double Capitals'), label_shift=2);
ax.set_title('Utility display', fontsize=14, fontname='Monospace');
plt.show()
Utility display
G 8
E 5
15 CC
C 4
Double Capitals
L 4
F 3
Capitals
13 DD
O 3
H 3
N 3 8 EE
M 3
J 3 8 BB
D 3
K 2 6 AA
A 2
I 2
B 2
102
8.5. Heatmaps 103
8.5 Heatmaps
A heatmap displays a set of data using colored tiles for each variable value within each obser-
vation.
First, let’s create a heatmap for the mtcars dataset. The mtcars dataset contains information
on 32 cars measured on 11 variables.
# Basic heatmap
import seaborn as sns
mtcars2 = mtcars.set_index('name')
mtcars2 = (mtcars2 >> select_if("is_numeric"))
def scale(x):
return ((x - x.mean())/x.std(ddof=0))
fig, axe = plt.subplots(figsize=(16,6))
sns.heatmap(mtcars2.transform(scale),ax=axe)
plt.show()
Mazda RX4
Mazda RX4 Wag 3
Datsun 710
Hornet 4 Drive
Hornet Sportabout
Valiant
Duster 360
Merc 240D 2
Merc 230
Merc 280
Merc 280C
Merc 450SE
Merc 450SL
Merc 450SLC 1
Cadillac Fleetwood
Lincoln Continental
name
Chrysler Imperial
Fiat 128
Honda Civic
Toyota Corolla
Toyota Corona 0
Dodge Challenger
AMC Javelin
Camaro Z28
Pontiac Firebird
Fiat X1-9
Porsche 914-2 1
Lotus Europa
Ford Pantera L
Ferrari Dino
Maserati Bora
Volvo 142E
mpg cyl disp hp drat wt qsec vs am gear carb
# Clustermap
sns.clustermap(mtcars2,figsize=(16,6),row_cluster=True,
dendrogram_ratio=(.1, .2),cbar_pos=(0, .2, .03, .4))
A radar chart (also called a spider or star chart) displays one or more groups or observations on
three or more quantitative variables.
# Radar charts
from matplotlib.patches import Circle, RegularPolygon
from matplotlib.path import Path
from matplotlib.projections.polar import PolarAxes
from matplotlib.projections import register_projection
from matplotlib.spines import Spine
from matplotlib.transforms import Affine2D
103
104 Chapter 8. Other Graphs
Ferrari Dino
Toyota Corolla
Fiat X1-9
Mazda RX4 Wag
Merc 280C
400 Lotus Europa
Volvo 142E
300 Toyota Corona
name
Maserati Bora
200 Valiant
Merc 450SE
100 Dodge Challenger
Chrysler Imperial
0 Lincoln Continental
Duster 360
Hornet Sportabout
cyl vs am carb wt drat gear mpg qsec disp hp
Parameters
----------
num_vars : int
Number of variables for radar chart.
frame : {'circle' | 'polygon'}
Shape of frame surrounding axes.
"""
# calculate evenly-spaced axis angles
theta = np.linspace(0, 2*np.pi, num_vars, endpoint=False)
class RadarAxes(PolarAxes):
name = 'radar'
104
8.6. Radar charts 105
def _gen_axes_patch(self):
# The Axes patch must be centered at (0.5, 0.5) and of radius 0.5
# in axes coordinates.
if frame == 'circle':
return Circle((0.5, 0.5), 0.5)
elif frame == 'polygon':
return RegularPolygon((0.5, 0.5), num_vars,
radius=.5, edgecolor="k")
else:
raise ValueError("unknown value for 'frame': %s" % frame)
def _gen_axes_spines(self):
if frame == 'circle':
return super()._gen_axes_spines()
elif frame == 'polygon':
# spine_type must be 'left'/'right'/'top'/'bottom'/'circle'.
spine = Spine(axes=self,
spine_type='circle',
path=Path.unit_regular_polygon(num_vars))
# unit_regular_polygon gives a polygon of radius 1 centered at
# (0, 0) but we want a polygon of radius 0.5 centered at (0.5,
# 0.5) in axes coordinates.
spine.set_transform(Affine2D().scale(.5).translate(.5, .5)
+ self.transAxes)
register_projection(RadarAxes)
return theta
105
106 Chapter 8. Other Graphs
data = [['Sulfate', 'Nitrate', 'EC', 'OC1', 'OC2', 'OC3', 'OP', 'CO', 'O3'],
('Basecase', [
[0.88, 0.01, 0.03, 0.03, 0.00, 0.06, 0.01, 0.00, 0.00],
[0.07, 0.95, 0.04, 0.05, 0.00, 0.02, 0.01, 0.00, 0.00],
[0.01, 0.02, 0.85, 0.19, 0.05, 0.10, 0.00, 0.00, 0.00],
[0.02, 0.01, 0.07, 0.01, 0.21, 0.12, 0.98, 0.00, 0.00],
[0.01, 0.01, 0.02, 0.71, 0.74, 0.70, 0.00, 0.00, 0.00]])]
N = len(data[0])
theta = radar_factory(N, frame='polygon')
spoke_labels = data.pop(0)
title, case_data = data[0]
Basecase
Sulfate
Nitrate O3
0.8
0.6
0.4
EC 0.2 CO
OC1 OP
OC2 OC3
Figure 8.15: Basic radar chart
# Using plotly
from plotnine.data import msleep
106
8.7. Scatterplot matrix 107
fig = go.Figure()
for g in ["Cow","Dog","Pig"]:
df = plotdata[plotdata["group"]== g].drop(columns="group").values[0]
fig.add_trace(go.Scatterpolar(r=df,theta=categories,fill='toself',name=g))
fig.update_layout(
polar=dict(radialaxis=dict(visible=True,range=[0, 1])),
showlegend=True)
# # Save figure - run once
fig.write_image("./figure/radar.png")
A waterfall chart illustrates the cumulative effect of a sequence of positive and negative values.
107
108 Chapter 8. Other Graphs
2
0
log_brainwt
2
4
6
8
log_bodywt
0
5
20
15
sleep_total
10
5
6
sleep_rem
0
5 0 5 0 5 10 20 0.0 2.5 5.0
log_brainwt log_bodywt sleep_total sleep_rem
# Create dataset
income = pd.DataFrame({
'category':["Sales", "Services", "Fixed Costs", "Variable Costs", "Taxes"],
'num':[101000, 52000, -23000, -15000, -10000]})
category num
Sales 101000
Services 52000
Fixed Costs -23000
Variable Costs -15000
Taxes -10000
Now we can visualize this with a waterfall chart, using the waterfall function.
108
8.9. Word clouds 109
# Waterfall chart
waterfall(income,'category','num')
160000
140000 -23,000
52,000 -15,000
120000
-10,000
100000
80000
60000
101,000
40000
20000
0
Sales Services Fixed Costs Variable Costs Taxes
A word cloud (also called a tag cloud), is basically an infographic that indicates the frequency
of words in a collection of text (e.g., tweets, a text document, a set of text documents). Here is
the text :
109
110 Chapter 8. Other Graphs
le secteur public. C’est la raison pour laquelle les données sont souvent considérées
comme le pétrole du XXIème siècle. En s’appuyant sur ces découvertes, il est possi-
ble de créer de nouveaux produits et services innovants, de résoudre des problèmes
concrets, d’améliorer ses performances comme jamais auparavant. La Data Science
permet de prendre des décisions basées sur les données, plutôt que sur une simple
intuition. Ainsi, elle révolutionne notre quotidien et nous permet de s’ouvrir à de
nouveaux horizons. En bref, la data science représentera une science incontournable
du monde demain ! Comment fonctionne la data science ? La Data Science couvre
une large variété de disciplines et de champs d’expertise. Son but reste toutefois
de donner du sens aux données brutes. Pour y parvenir, les Data Scientists doivent
posséder des compétences en ingénierie des données, en mathématiques, en statis-
tique, en informatique et en Data Visualization. Ces compétences leur permettront
de parcourir les vastes ensembles de données brutes pour en dégager les informations
les plus pertinentes et les communiquer aux décideurs de leurs organisations. Les
Data Scientists exploitent également l’intelligence artificielle, et plus particulière-
ment le Machine Learning et le Deep Learning. Ces technologies sont utilisées pour
créer des modèles et réaliser des prédictions en utilisant des algorithmes et diverses
techniques. Dans un premier temps, les données doivent être collectées, extraites
à partir de différentes sources. Il s’agit ensuite de les entreposer dans une Data
Warehouse, de les nettoyer, de les transformer afin qu’elles puissent être analysées.
L’étape suivante est celle du traitement des données, par le biais du Data Mining
(forage de données), du clustering, de la classification ou de la modélisation. Les
données sont ensuite analysées à l’aide de techniques comme l’analyse prédictive, la
régression ou le text mining. Enfin, la dernière étape consiste à communiquer les
informations dégagées par le biais du reporting, du dashboarding ou de la Data Vi-
sualization. Les cas d’usage et applications Les cas d’usage de la Data Science sont
aussi nombreux que variés. Cette technologie est utilisée pour assister la prise de
décision en entreprise, mais permet aussi l’automatisation de certaines tâches. Elle
est utilisée à des fins de détection d’anomalies ou de fraude. La science des données
permet aussi la classification, par exemple pour trier automatiquement les emails
dans votre boîte. Elle permet aussi la prédiction, par exemple pour les ventes ou les
revenus. En l’utilisant, il est possible de détecter des tendances ou des patterns dans
les ensembles de données. La Data Science se cache aussi derrière les technologies
de reconnaissance faciale, vocale ou textuelle. Elle alimente aussi les moteurs de
recommandations capables de vous suggérer des produits ou du contenu en fonction
de vos préférences. D’un secteur d’activité à l’autre, la Data Science est exploitée
de différentes manières. Dans le domaine de la santé, les données permettent au-
jourd’hui de mieux comprendre les maladies, de recourir à la médecine préventive,
d’inventer de nouveaux traitements ou d’accélérer les diagnostics. En logistique, la
Data Science aide à optimiser les itinéraires et les opérations internes en temps réel
en tenant compte de facteurs comme la météo ou le trafic. Dans la finance, elle
permet d’automatiser le traitement des données d’accords de crédit grâce au Traite-
ment Naturel du Langage (Vous n’êtes pas familier avec ce concept, découvrez le
NLP dans notre article dédié) ou de détecter la fraude grâce au Machine Learn-
ing. Les entreprises de retail l’utilisent pour le ciblage publicitaire et le marketing
personnalisé. Les moteurs de recommandations, basés sur l’analyse des préférences
du consommateur, sont utilisés par Google pour son moteur de recherche web, par
les plateformes de streaming comme Netflix ou Spotify, et par les entreprises de e-
commerce comme Amazon. Les entreprises de cybersécurité se tournent vers l’IA et
la science des données pour découvrir de nouveaux malwares au quotidien. Même les
voitures autonomes reposent sur la Data Science et l’analyse prédictive pour ajuster
110
8.9. Word clouds 111
leur vitesse, éviter les obstacles et les changements de voie dangereux ou choisir
l’itinéraire le plus rapide.”
‘d’, ‘du’, ‘de’, ‘la’, ‘des’, ‘le’, ‘et’, ‘est’, ‘elle’, ‘une’, ‘en’, ‘que’, ‘aux’, ‘qui’, ‘ces’, ‘les’,
‘dans’, ‘sur’, ‘l’, ‘un’, ‘pour’, ‘par’, ‘il’, ‘ou’, ‘à’, ‘ce’, ‘a’, ‘sont’, ‘cas’, ‘plus’, ‘leur’,
‘se’, ‘s’, ‘vous’, ‘au’, ‘c’, ‘aussi’, ‘toutes’, ‘autre’, ’comme
wordcloud = (
WordCloud(background_color = 'white', stopwords = exclure_mots, max_words = 50)
.generate(text)
)
plt.imshow(wordcloud);
plt.axis("off");
plt.show()
111
Customizing graphs
9
Graph defaults are fine for quick data exploration, but when you want to publish your results
to a blog, paper, article or poster, you’ll probably want to customize the results. Customization
can improve the clarity and attractiveness of a graph.
This chapter describes how to customize a graph’s axes, gridlines, colors, fonts, labels, and
legend. It also describes how to add annotations (text and lines).
9.1 Axes
The x-axis and y-axis represent numeric, categorical, or date values. You can modify the default
scales and labels with the functions below.
print((
mpg >> ggplot(aes(x="displ", y="hwy")) +
geom_point() +
scale_x_continuous(breaks = range(1, 8, 1),limits=[1, 7]) +
scale_y_continuous(breaks = range(10, 46, 5),limits=[10, 45])
))
112
9.1. Axes 113
45
40
35
30
hwy
25
20
15
10
1 2 3 4 5 6 7
displ
The mizani package provides a number of functions for formatting numeric labels. Some of the
most useful are :
• currency_format
• comma_format
• percent_format
np.random.seed(1234)
df = pd.DataFrame({
"xaxis" : np.random.normal(loc=100000, scale=50000, size=50),
"yaxis" : np.random.uniform(low=0.0, high=1.0, size=50),
"pointsize": np.random.normal(loc = 1000, scale=1000, size=50)
})
113
114 Chapter 9. Customizing graphs
100%
75%
yaxis
50%
pointsize
$-1000.00
25% $0.00
$1000.00
$2000.00
0%
0 50,000 100,000 150,000 200,000
xaxis
• limits - a character vector (the levels of the quantitative variable in the desired order)
• labels - a character vector of labels (optional labels for these levels)
A date axis is modified using the scale_x_date or scale_y_date function. Options include :
• date_breaks - a string giving the distance between breaks likes “2 weeks” or “10 years”
• date_labels - A string giving the formatting specification for the labels.
The table below gives the formatting specifications for date values :
114
9.1. Axes 115
60
40
count
20
0
Pickup Sport Utility Minivan Mid-size Compact Subcompact 2-Seater
Truck Vehicle
class
12000
unemploy
8000
4000
janv.-70janv.-75janv.-80janv.-85janv.-90janv.-95janv.-00janv.-05janv.-10janv.-15
date
115
116 Chapter 9. Customizing graphs
9.2 Colors
The default colors in plotnine graphs are functional, but often not as visually appealing as
they can be. Happily this is easy to change.
Specific colors can be :
To specify a color for points, lines, or text, use the color = "colorname" option in the appro-
priate geom. To specify a color for bars and areas, use the fill = "colorname" option.
Examples :
• geom_point(color = "blue")
• geom_bar(fill = "steelblue")
9.2.2.1 ColorBrewer
The most popular alternative palettes are probably the ColorBrewer palettes.
You can specify these palettes with the scale_color_brewer and scale_fill_brewer func-
tions.
116
9.2. Colors 117
20000 clarity
I1
SI2
SI1
15000
VS2
VS1
count
VVS2
10000 VVS1
IF
5000
0
Fair Good Very Good Premium Ideal
cut
20000 clarity
I1
SI2
SI1
15000
VS2
VS1
count
VVS2
10000 VVS1
IF
5000
0
Fair Good Very Good Premium Ideal
cut
9.2.2.2 Viridis
117
118 Chapter 9. Customizing graphs
20000
clarity
I1
15000
SI2
SI1
count
VS2
10000 VS1
VVS2
VVS1
5000 IF
0
Fair Good Very Good Premium Ideal
cut
9.3.1 Points
For plotnine graphs, the default point is a filled circle. To specify a different shape, use the
shape = # option in the geom_point function. To map shapes to the levels of a categorical
variable use the shape = variablename option in the aes function.
Examples :
• geom_point(shape = 1)
• geom_point(aes(shape=sex))
Shapes from 21 through 26 provide for both a fill color and a border color.
9.3.2 Lines
The default line type is a solid line. To change the linetype, use the linetype = # option in the
geom_line function. To map linetypes to the levels of a categorical variable use the linetype
= variablename option in the aes function.
Examples :
• geom_line(linetype=1)
• geom_line(aes(linetype=sex))
9.4 Fonts
118
9.5. Legends 119
'