0% found this document useful (0 votes)
45 views52 pages

Chart Conversion Describe

Uploaded by

arun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views52 pages

Chart Conversion Describe

Uploaded by

arun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Contents

Numeric data:........................................................................................................................................1
One numeric variable........................................................................................................................1
Histogram......................................................................................................................................1
Density Plot...................................................................................................................................3
Two numeric variables.......................................................................................................................4
NOT Ordered data..........................................................................................................................4
Ordered Data...............................................................................................................................10
Three numeric variables..................................................................................................................14
Not Ordered data.........................................................................................................................14
Ordered Data...............................................................................................................................16
Several numeric variables................................................................................................................19
Ordered Data...............................................................................................................................19
Not Ordered Data........................................................................................................................19
Categorical data:..................................................................................................................................27
One categorical value......................................................................................................................27
Bar Plot........................................................................................................................................28
Lollipop Chart..............................................................................................................................29
Word Cloud..................................................................................................................................31
Pie Chart & Donut........................................................................................................................32
Tree Map......................................................................................................................................33
Circular Packing or Packed bubble...............................................................................................34
Two or more categorical values.......................................................................................................35
Nested.........................................................................................................................................35
Subgroup.....................................................................................................................................36
Adjacency....................................................................................................................................44
Numeric and Categorical data.............................................................................................................48
One numeric, one categorical..........................................................................................................48
One observation per group..........................................................................................................48
Several observations per group...................................................................................................48
One categorical, several numerical..................................................................................................48
No order......................................................................................................................................48
A number is ordered....................................................................................................................49
One value per group....................................................................................................................49
Several Categorical, One numerical.................................................................................................49
Subgroup.....................................................................................................................................49
Nested.........................................................................................................................................50
Adjacency....................................................................................................................................51

Data types can be of Numeric, Categorical, Numeric & Categorical.

The analysis that can be performed using this data fall under these categories.
Distribution, Correlation, Ranking, Part of a while, Evolution, Flow.

We are going to set out a decision tree to start from data and end in choice of choice visualizations.

Numeric data:
One numeric variable
– a very simple dataset with only one column composed by numbers.

Variable 1
3
2
6
2

Type of analysis Choice of charts


Distribution Histogram, Density Plot

Histogram
A histogram is an accurate graphical representation of the distribution of a numeric variable. It takes
as input numeric variables only. The variable is cut into several bins, and the number of observations
per bin is represented by the height of the bar.
Histogram are used to study the distribution of one or a few variables. Checking the distribution of
your variables one by one is probably the first task you should do when you get a new dataset. It
delivers a good quantity of information. Several distribution shapes exist, the 6 most common ones
are:

 Bimodal
 Combo
 Edge peak,
 Normal,
 Skewed,
 Uniform.

Checking this distribution also helps you discovering mistakes in the data. For example, the comb
distribution can often denote a rounding that has been applied to the variable or another mistake.
As a second step, histogram allow to compare the distribution of a few variables. Don’t compare
more than 3 or 4, it would make the figure cluttered and unreadable. This comparison can be done
showing the 2 variables on the same graphic and using transparency.

Common mistakes
 Try several bin sizes, it can lead to very different conclusions.
 Don’t use weird colour scheme. It does not give any more insight.
 It is not to be confused with a bar plot. A bar plot gives a value for each group of a categoric
variable. Here, we have only a numeric variable and we check its distribution.
 Don’t compare more than ~3 groups in the same histogram. The graphic gets cluttered and
hardly understandable. Instead use a violin plot, a boxplot, a ridgeline plot or use small
multiple.
 Using unequal bin widths

Density Plot

A density plot is a representation of the distribution of a numeric variable. It uses a kernel density
estimate to show the probability density function of the variable.
It is a smoothed version of the histogram and is used in the same concept.

Density plots are used to study the distribution of one or a few variables. It delivers a good quantity
of information. Several distribution shapes exist, just as histograms. histogram allow to compare the
distribution of a few variables.

Common mistakes
 Play with the bandwidth argument, it can lead to very different conclusions.
 Don’t compare more than ~3 groups on the same density plot. The graphic gets cluttered
and hardly understandable. Instead use a violin plot, a boxplot, a ridgeline plot or use small
multiple.

Two numeric variables

NOT Ordered data.


Few Points
A dataset composed of two columns all numeric and less than 2000 data points approx.

Variable 1 Variable 2
3 1
2 2
6 5
2 4
… …

Type of analysis Choice of charts


Distribution Box Plot, Histogram
Correlation Scatter Plot

Box Plot
A boxplot gives a nice summary of one or more numeric variables. A boxplot is composed of several
elements:
 The line that divides the box into 2 parts represents the median of the data. If the median is
10, it means that there are the same number of data points below and above 10.
 The two ends of the box show the upper (Q3) and lower (Q1) quartiles. If the third quartile is
15, it means that 75% of the observation are lower than 15.
 The difference between Quartiles 1 and 3 is called the interquartile range (IQR)
 The extreme line shows Q3+1.5xIQR to Q1-1.5xIQR (the highest and lowest value excluding
outliers).
 Dots (or other markers) beyond the extreme line shows potential outliers.

A boxplot can summarize the distribution of a numeric variable for several groups. The problem is
that summarizing also means losing information, and that can be a pitfall. we cannot see the
underlying distribution of dots in each group or their number of observations.

If the amount of data you are working with is not too large, adding jitter on top of your boxplot can
make the graphic more insightful.
If you have a large sample size, using jitter is not an option anymore since dots will overlap, making
the figure uninterpretable. An alternative is the violin plot, which describes the distribution of the
data for each group. A good alternative is a half violin plot showing the raw data.

Code: jbburant/half_violin_with_raw_data
Scatter Plot
A scatterplot displays the relationship between 2 numeric variables. For each data point, the value of
its first variable is represented on the X axis, the second on the Y axis.

A scatterplot is made to study the relationship between 2 variables. Thus, it is often accompanied by
a correlation coefficient calculation, that usually tries to measure the linear relationship.

However other types of relationship can be detected using scatterplots, and a common task is to
identify a model explaining Y in function of X. Here are a few patterns you can detect doing a
scatterplot.

Scatterplots are sometimes supported by marginal distributions. It indeed adds insight to the
graphic, revealing the distribution of both variables:
Common mistakes
 Overplotting is the most common mistake when sample size is high.
 Don’t forget to show subgroups if you have some. It can reveal important hidden patterns in
your data, like in the case of the Simpson paradox.

Many Points
A dataset composed of two columns all numeric and more than 2000 data points approx.

Variable 1 Variable 2
3 1
2 2
6 5
2 4
… …

Type of analysis Choice of charts


Distribution Violin Plot, Density Plot
Correlation Scatter Plot, 2D Density plot

Violin Plot

Violin plot allows to visualize the distribution of a numeric variable for one or several groups. Each
‘violin’ represents a group or a variable. The shape represents the density estimate of the variable:
the more data points in a specific range, the larger the violin is for that range. It is close to a boxplot
but allows a deeper understanding of the distribution.

Violin plot is a powerful data visualization technique since it allows to compare both the ranking of
several groups and their distribution. Surprisingly, it is less used than boxplot, even if it provides
more information in my opinion.
Violins are particularly adapted when the amount of data is huge and showing individual
observations gets impossible. For small datasets, a boxplot with jitter is probably a better option
since it really shows all the information.

Variation
 Violin plots are made vertically most of the time. If you have long labels, building an
horizontal version like above make the labels more readable.
 It is possible to display a boxplot in the violin: it allows to assess the median and quartiles in
a glimpse.

 If your variable is grouped, you can build a grouped violin as you would do for a boxplot.

 Ordering groups by median value makes the chart more insightful.


 If you have just a few groups, you are probably interested by ridgeline charts.

2D density plot

One variable is represented on the X axis, the other on the Y axis, like for a scatterplot. Then, the
number of observations within a particular area of the 2D space is counted and represented by a
colour gradient. The shape can vary:

 Hexagons are often used, leading to a hexbin chart.


 Squares make 2d histograms.
 It is also possible to compute kernel density estimate to get 2d density plots or contour plots.

2d distribution are very useful to avoid overplotting in a scatterplot. Here is an example showing the
difference between an overplotted scatterplot and a 2d density plot. In the second case, an obvious
hidden pattern appears.
Variation
2d distribution is one of the rare cases were using 3d can be worth it.

It is possible to transform the scatterplot information in a grid and count the number of data points
on each position of the grid. Then, instead of representing this number by a graduating colour, the
surface plot use 3d to represent dense are higher than others.

Ordered Data
Dataset composed of 2 columns, all numeric. Data are ordered (Col 1).

Variable 1 Variable 2
1 2.3
2 3.6
3 5
4 1.4
… …

Type of analysis Choice of charts


Evolution Area plot, Line plot
Correlation Connected Scatter Plot

Connected Scatter plot.


A connected scatterplot displays the evolution of a numeric variable. Data points are represented by
a dot and connected by straight line segments. It often shows a trend in data over intervals of time: a
time series. Basically, it is the same as a line plot in most of the cases, except that individual
observation are highlighted.

Connected scatterplot makes sense in specific conditions where both the scatterplot and the line
chart are not enough:

 When doing a line chart, it is sometimes difficult to visualize where the breaks in the curve
are, and thus when the observation have been done.

 When your X axis is ordered you must connect the dots together to get a connected
scatterplot. Indeed, the pattern is way hard to read if dots are not connected as illustrated in
the graphic below. Note the this can even lead to misleading conclusion.
Example of connected scatterplot
To cut or not to cut the Y axis?
Whether the Y axis must start at 0 is a hot topic leading to intense debates. The
graphic below presents the same data, starting at 0 (left) or not (right).
Generally, line plot does not need to start at 0 since it allows to observe patterns
more efficiently.

Three numeric variables

Not Ordered data


Variable 1 Variable 2 Variable 3
1.2 2.3 5.1
4.2 3.6 4.2
3.1 5 2.5
6.4 1.4 6.1
… … …

Type of analysis Choice of charts


Distribution Box Plot, Violin Plot
Correlation Bubble plot, 2D Density, 3D Surface Plot
Bubble Plot

A bubble plot is a scatterplot where a third dimension is added: the value of an additional numeric
variable is represented through the size of the dots.

You need 3 numerical variables as input: one is represented by the X axis, one by the Y axis, and one
by the dot size.
The problem with bubble plot is that the relationship between the variable of the X and Y axis is
much more obvious than the relationship with the third variable. Thus, you must prioritize your
variables and be sure of what you want to show.

 As for scatter, bubble plot suffers overplotting if sample size is too big.
 Show a legend for bubble size.

2D Density, 3D Surface Plot

A specific use case where three numeric columns are displayed is the grid system. In this case, the
two first columns give the grid coordinates, and the third variable gives a numeric value for each
position of the grid. For example, the volcano data set provides that altitude of each point of the grid
of a volcano:

lat long altitude


1 1 100
2 1 101
3 1 102
4 1 103
5 1 104
6 1 105

This kind of data can be represented using 2D density map. Another way is to build a surface plot. It
really makes sense to use 3D in this special case since it allows to visualize the real shape of the
volcano:
Ordered Data

Data is composed of 3 columns, all numeric. Data are ordered (Col 1)

Variable 1 Variable 2 Variable 3


1 2.3 5.1
2 3.6 4.2
3 5 2.5
4 1.4 6.1
… … …

Type of analysis Choice of charts


Evolution Stacked Area, Stream Graph, Line chart, Area chart

Stacked Area chart


A stacked area chart is the extension of a basic area chart. It displays the evolution of the value of
several groups on the same graphic. The values of each group are displayed on top of each other,
what allows to check on the same figure the evolution of both the total of a numeric variable, and
the importance of each group.

 Stacked area graph is appropriate to study the evolution of the whole and the relative
proportions of each group. Indeed, the top of the areas allows to visualize how the whole
behaves, like for a classic area chart.
 However, they are not appropriate to study the evolution of each individual group: it is very
hard to subtract the height of other groups at each time point. For a more accurate but less
attractive figure, consider a line chart or area chart using small multiple: here each category
has its own section in the graphic. It makes easy to understand the pattern of each category.
Variation
A variation of the stacked area graph is the percent stacked area graph. It is the same thing, but
value of each group is normalized at each time stamp. That allows to study the percentage of
each group in the whole more efficiently:

Common caveats
 Use it with care, try using small multiple with area chart or line chart instead.
 The group order (from bottom to top) can have an influence, try several orders to find the
appropriate one.

Stream Graph
A Stream graph is a type of stacked area chart. It displays the evolution of a numeric value (Y axis)
following another numeric value (X axis). This evolution is represented for several groups, all with a
distinct colour.

Contrary to a stacked area, there is no corner: edges are rounded what gives this nice impression of
flow. Moreover, areas are usually displaced around a central axis, resulting in a flowing and organic
shape.
Steam charts are good to study the relative proportions of the whole. However, they are bad to study
the evolution of each individual group: it is very hard to subtract the height of other groups at each
time point. For a more accurate but less attractive figure, consider a line chart or area chart using
small multiple.

Stream chart gets useful when displayed in an interactive mode: highlighting a group gives you
directly an insight of its evolution.

Variation
Even if areas are usually displaced around a central axis, it is possible to display them as for most of
the chart type: over the 0 axis.

It also possible to create a percent stream chart where the proportion of each group is displayed
instead of its absolute value.

Common caveats
Streamgraph work well when there is a clear pattern in the data. If the proportion of each group
remain more or less the same all along the time frame, the figure won’t be very insightful since small
variations will be hard to read.
Several numeric variables
Ordered Data
Data is composed of many columns, all numeric. Data are ordered (Col 1)

Variable 1 Variable 2 Variable 3 Variable 4 …


1 2.3 5.1 3.5 …
2 3.6 4.2 2.6 …
3 5 2.5 3.5 …
4 1.4 6.1 7.3 …
… … … … …

Type of analysis Choice of charts


Evolution Stacked Area, Stream Graph, Line chart, Area chart

Not Ordered Data


Data is composed of many columns, all numeric.

Variable 1 Variable 2 Variable 3 Variable 4 …


4.2 2.3 5.1 3.5 …
6.3 3.6 4.2 2.6 …
2.4 5 2.5 3.5 …
1.9 1.4 6.1 7.3 …
… … … … …

Type of analysis Choice of charts


Distribution Box Plot, Violin Plot, Ridge Line
Correlation Correlogram, Heatmap
Flow Dendrogram

Ridgeline Plot
A Ridgeline plot (sometimes called Joy plot) shows the distribution of a numeric value for several
groups. Distribution can be represented using histograms or density plots, all aligned to the same
horizontal scale and presented with a slight overlap.
Ridgeline plots make sense when the number of groups to represent is medium to high, and thus a
classic window separation would take too much space. Indeed, the fact that groups overlap each
other allows to use space more efficiently. If you have less than ~6 groups, dealing with other
distribution plots is probably better.

It works well when there is a clear pattern in the result, like if there is an obvious ranking in groups.
Otherwise, group will tend to overlap each other, leading to a messy plot not providing any insight.

Variation
The above example is a ridgeline plot using a set of density plots. It is possible to use histograms as
well:
It is possible to colour depending on the numeric variable instead of the categoric one.

Common caveats
 As with histogram or density plot, play with bin size / bandwidth argument.
 Think about ordering groups in a smart way.
 Ridgeline plot works well when there is a clear pattern to discover since it hides a part of the
data where the overlap takes place.

Correlogram
A correlogram or correlation matrix allows to analyse the relationship between each pair of numeric
variables of a dataset. The relationship between each pair of variables is visualised through a
scatterplot, or a symbol that represents the correlation (bubble, line, number.).

The diagonal often represents the distribution of each variable, using a histogram or a density plot.

Correlogram are handy for exploratory analysis. It allows to visualize the relationships of the whole
dataset in a glimpse. For instance, the linear relationship between petal length and petal width is
obvious here, as the one concerning sepal.

 The first step of performing analysis on a multivariate dataset is to build a correlogram.


 it is a good practice to display subgroups if a categoric variable is available as well:
Common caveats
 Displaying the relationship between more than ~10 variables make the plot very hard to read.
 All the common caveats of scatterplot and histogram apply.

Heat Map
A heatmap is a graphical representation of data where the individual values contained in a matrix are
represented as colours. It is a bit like looking a data table from above.
Heatmap is useful to display a general view of numerical data, not to extract specific data point. In
the graphic above, the huge population size of China and India pops out for example.

Variations
 For static heatmap, a common practice is to display the exact value of each cell in numbers.
Indeed, it is hard to translate a colour in a precise number.
 Heatmaps can also be used for time series where there is a regular pattern in time.
 Heatmaps can be applied to adjacency matrix.

Common caveats
 Often need to normalize your data.
 Colour palette is important.

Dendrogram
A dendrogram is a network structure. It is constituted of a root node that gives birth to several nodes
connected by edges or branches. The last nodes of the hierarchy are called leaves.

Two types of dendrogram exist, resulting from 2 types of datasets:


Dendrogram from hierarchic data
A hierarchic dataset provides the links between nodes explicitly. Like below.

Dendrogram from clustering


The result of a clustering algorithm can be visualized as a dendrogram.

Let’s consider a distance matrix that provides the distance between all pairs of 28 major cities. Note
that this kind of matrix can be computed from a multivariate dataset, computing distance between
each pair of individuals using correlation or Euclidean distance.

Berlin Cairo Caracas …

Berlin 
1795 5247 …

Cairo 1795 6338 …

Caracas 5247 6338 ...

… … … … …
It is possible to perform hierarchical cluster analysis on this set of dissimilarities. Basically, this
statistical method seeks to build a hierarchy of clusters: it tries to group sample that are close one
from another.

The result can be seen as a dendrogram:


As expected, cities that are in same geographic area tend to be clustered together. For example, the
yellow cluster is composed by all the Asian cities of the dataset. Note that the dendrogram provides
even more information. For instance, Sydney appears to be a bit further to Calcutta than Calcutta is
from Tokyo: this can be deduced from the branch size that represents the distance.

Variation
Many variations exist for dendrogram. It can be horizontal or vertical as shown before. It can also be
linear or circular. The advantage of the circular version being that it uses the graphic space more
efficiently:
Heatmap is also useful to display the result of hierarchical clustering. Basically, clustering checks what
countries tend to have the same features on their numeric variables, what countries are similar. The
usual way to represent the result is to use dendrogram. This type of chart can be drawn on top of the
heatmap:
Here, Afghanistan, India and Bolivia are grouped together. Indeed, they are 3 countries in strong
expansion, with a lot of children per woman but still a strong mortality rate.

Note: in this heatmap, features are also clustered. For instance, life expectancy and mortality rate are
grouped together since they are highly correlated.

hierarchical clustering is a complex statistical method. You can learn more about it here.

Common caveats
 If using a clustering algorithm, be sure you have a clear idea of which metrics have been used for
the distance calculation and for the clustering algorithm.
 Horizontal versions are more apt for data with long labels.
 Showing the heatmap is a good practice if you’re working with clustering.

Categorical data:
One categorical value
Very simple dataset with jus tone categorical column.
Variable 1
A
A
B
C

Type of analysis Choice of charts


Ranking Bar Plot, Lollipop, Word Cloud
Part of a whole Donut, Pie, Tree map, Circular Packing or Packed bubble chart

Bar Plot
It shows the relationship between a numeric and a categoric variable. Each entity of the categoric
variable is represented as a bar. The size of the bar represents its numeric value.

An ordered bar plot is a very good choice here since it displays both the ranking of countries and
their specific value.

Common caveats
 Do not confuse bar chart with histogram. A histogram has only a numeric variable as input and
shows its distribution.
 Order your bars. If the levels of your categoric variable have no obvious order, order the bars
following their values.
 Several values per group? Don’t use a bar plot. Even with error bars, it hides information and
other type of graphic like boxplot or violin are much more appropriate.
Circular bar plot

Lollipop Chart
A lollipop plot is basically a bar plot, where the bar is transformed in a line and a dot. It shows the
relationship between a numeric and a categoric variable.
Variations
Dumbbell plot
The dumbbell plot is a handy variation, allowing to compare the value of 2 numeric values for each
group. This kind of data could also be visualized using a grouped or stack bar plot. However, this
representation is less cluttered and way easier to read. Use it if you have 2 subgroups per group.

Note that when the number of subgroups is between 3 and ~7 this type of lollipop plot is nice as
well:
Common caveats
 Order your groups. If the levels of your categoric variable have no obvious order, order the bars
following their values.
 If for whatever reason your bars must remain unsorted, it is probably better to use a bar plot
instead. Lollipop would be harder to read.
 Several values per group? Don’t use a lollipop. Even with error bars, it hides information and
other type of graphic like boxplot or violin are much more appropriate.
 Think about the horizontal version, it makes the labels easier to read.

Additional material on when to use lollipop chart. Click here.

Word Cloud
A word cloud (also called tag cloud or weighted list) is a visual representation of text data. Words are
usually single words, and the importance of each is shown with font size or colour.

Word cloud is useful for quickly perceiving the most prominent terms and for locating a term
alphabetically to determine its relative prominence. It is widely used in media and well understood
by the public.

However, it is a highly criticized way to convey information due to its lack of accuracy. This is due to 2
main reasons:

 Area is a poor metaphor of a numeric value that is hardly perceived by human eye. Thus, readers
struggle to translate word size to an accurate frequency.
 Longer words appear bigger by construction, since they are composed by more letters. It creates
a bias that makes word cloud even less accurate.
A good workaround is to use bar plot or lollipop plot instead. Here is an example using the same data
as the previous chart:

Common caveat
Building a word cloud is a pitfall on its own, except if it is done for aesthetic reasons.

Pie Chart & Donut


A pie chart is a circle divided into sectors that each represent a proportion of the whole. It is often
used to show percentage, where the sum of the sectors equals 100%.

The problem is that humans are bad at reading angles. In the adjacent pie chart, try to figure out
which group is the biggest one and try to order them by value. You will probably struggle to do so,
and this is why pie charts must be avoided.
Let’s try to understand which group has the highest value in these 3 graphics. Also, try to figure out
what is the evolution of the value among groups.

Now, let’s represent the same data using a bar plot:

As you can see on this bar plot, there is a heavy difference between the three pie plots with a hidden
pattern that you don’t want to miss when you tell your story.

Even if pie charts are bad, it is still possible to make them even worse by adding other bad features:

 3d
 legend instead of annotations
 percentages that do not sum to 100.
 too many items.

Tree Map
A Tree map displays hierarchical data as a set of nested rectangles. Each group is represented by a
rectangle, which area is proportional to its value. Using colour schemes and or interactivity, it is
possible to represent several dimensions: groups, subgroups etc.
Tree maps are used to show two types of information simultaneously:

 How the whole is divided: for each level of the hierarchy, it is easy to understand which entity is
the most important and how the whole is distributed among entities. Tree map can even be used
without any hierarchy, just to show the value of several entities like in a bar plot.
 How the hierarchy is organized. Note that it is hard to represent more than 3 levels on a static
version.

Tree maps have the advantage to make efficient use of space, what makes them useful to represent a
big amount of data.

Common considerations:
 Don’t annotate more than 3 levels of the hierarchy, it would make the figure unreadable.
 Interactive version really makes sense for tree map.

Circular Packing or Packed bubble


Circular packing or circular tree map allows to visualize a hierarchic organization. It is an equivalent of
a tree map or a dendrogram, where each node of the tree is represented as a circle and its sub-
nodes are represented as circles inside of it. The size of each circle can be proportional to a specific
value, what gives more insight to the plot.

Circle packing is not recommended if you need to precisely compare values of group. Indeed, it is
hard for the human eye to translate an area into an accurate number. If you need accuracy, use a bar
plot or a lollipop plot instead.

However, circular packing shows very well how groups are organised in subgroups. It uses the space a
bit less efficiently than a tree map, but the hierarchy is depicted well.
Common Considerations:
If many levels in the hierarchy, using an interactive version is advised. Indeed, adding too many labels
could make the graphic unreadable as shown in the above image.

Two or more categorical values


Nested
The second category variable is nested in the first one. Each entry is separately identifiable but also
part of a larger organization. Example Word> Europe> Ireland >Dublin

Variable 1 Variable 2
A a1
A a1
A a2
B b1
B b1
... ...

Type of analysis Choice of charts


Ranking Bar Plot, Lollipop
Part of a whole Tree map, Circular Packing or Packed bubble chart, Sunburst, Dendrogram
Sunburst
A sunburst diagram displays a hierarchical structure. The origin of the organization is represented by
the centre of the circle, and each level of the organization by an additional ring. The last level (leaves)
is located at the extreme outer part of the circle. It is very similar to a tree map, except it uses a
radial layout.

Here is an example describing the world population of 250 countries. The world is divided in
continent (group), continent is divided in regions (subgroup), and regions are divided in countries. In
this tree structure, countries are considered as leaves: they are at the end of the branches. They are
thus represented at the outer part of the circle.

Continent Region Country Pop


Asia Southern Asia Afghanistan 25500100
Europe Northern Europe Åland Islands 28502
Europe Southern Europe Albania 2821977
... ... ... ...
Interactivity on hover or click is also necessary as there is no space for labels.

Common caveats
 Labels - It is very hard to represent labels on sunburst charts. This is why using interactivity as
above is often necessary to make the chart useful. This is an important downside though: it is
hard to understand the figure in a glimpse.
 Angles are hard to read - sunburst suffers the same issue than pie or donut chart. The human eye
is bad at reading angles. Therefore, it is hard to deduce values behind items accurately.
 Deeper slices are exaggerated - by construction, outer parts tend to get bigger than inner part for
a same value. Indeed, the perimeter of the circle gets longer when you go further from the
centre of the circle!

Subgroup
Two categorical variables organized in Groups + subgroups such that every combination of both
variable is possible. For example,

Country TIME Value The dataset used as an example quantifies the gender wage gap in 39
Australia 2000 17.2 countries at three different time stamps. The gender wage gap is
defined as the difference between male and female median wages
Australia 2005 15.8 divided by the male median wages.

Australia 2010 14.0

Australia 2015 13.0

Austria 2000 23.1

Austria 2005 22.0

Type of analysis Choice of charts


Correlation Scatter plot if the grouping variable has 2 levels
Ranking Grouped Lollipop, Parallel Plot, Spider Plot, Clustered Bar, Stacked Bar
Flow Sankey diagram

Parallel Plot
Parallel plot or parallel coordinates plot allows to compare the feature of several individual
observations (series) on a set of numeric variables. Each vertical bar represents a variable and often
has its own scale. (The units can even be different). Values are then plotted as series of lines
connected across each axis.

The ìris dataset provides four features (each represented with a vertical line) for 150 flower samples
(each represented with a color line). Samples are grouped in three species. The chart below
highlights efficiently that setosa has smaller Petals, but its sepal tends to be wider.
A parallel plot allows to study the features of samples for several quantitative variables. Its strength
is that the variables can even be completely different: different ranges and even different units.

In the graphic above flower features were grouped in species, and all variables were normalized and
sharing the same unit (cm). Here is another example where diamonds are compared for 4 variables
that share different units, like the price in $ or depth in %. Note the use of scaling to be able to
compare them.

Variation:
Here is an overview of the parallel coordinates features you can play with:
 Scaling - scaling transforms the raw data to a new scale that is common with other variables. It is
a crucial step to compare variables that do not have the same unit, but can also help otherwise
as shown in the example below:

 Axis order - optimizing the order of vertical axis can decrease the clutter of your parallel plot.
Basically, the goal is to minimize the number of cross between series. On the next figure, the left
plot is much harder to understand than the right one. Only variable order is different.

 Highlighting - a parallel plot being a line plot, the main caveat is the spaghetti chart where too
many lines overlap, making the chart unreadable. Several workarounds exist as described in this
page. A solution is to highlight a specific sample or a specific group of interest:
Common Caveats:
 Like for line plot, displaying too many samples result in a cluttered and unreadable spaghetti
chart.
 Sort the variables on the X axis, it makes sense to avoid crosses in sample lines.
 Try different scaling to find the one fitting your data best.

Spider or Radar chart

A radar or spider or web chart is a two-dimensional chart type designed to plot one or more series of
values over multiple quantitative variables. Each variable has its own axis, all axes are joined in the
centre of the figure.

Let’s consider the exam results of a student. He has a mark ranging from 0 to 20 for ten topics like
math, sports, statistics, and so on. The radar chart provides one axis for each topic. The shape allows
you to see which topics the student performed well or poorly in.
Variation
In the previous chart, only one series is plotted, showing how one student performed. A common
task is to compare several individuals. With only a few series it is possible to show every group on the
same chart.

With more than two or three series, it is a good practice to use small multiples to avoid a cluttered
figure. Each student is represented on their own radar chart. It is easy to understand the features of
a specific individual, and looking for similarity in shapes allows you to find students with similar
features.

Common Caveats:
 Circular layout = harder to read - quantitative values are much easier to compare when they are
laid out along a single vertical or horizontal axis. This is a general reproach made to circular
layouts. Compare the figure below that considers the data of one student only. It is easier to
compare values in the bar plot, and more accurate.
 Supporting the ranking - In the above example, the lollipop plot is ordered. It allows you to see
instantly which topic had the best mark and what the ranking of each topic was. This is more
difficult with radar charts where there are no starts and ends.
 Category order has a huge impact - Radar chart readers will probably focus on the shape
observed. This can be misleading since this shape highly depends on the ordering of categories
around the plot. See these charts made using the same data, but changing the category ordering:

 About scales - Radar charts display the value of several quantitative variables, all represented on
an axis. In the previous example, the same scale and the same unit is shared by all variables (a
mark ranging from 0 to 20). It is also possible to display variables that are completely different. In
this case, don’t forget to show an obvious scale for each: the reader will expect the same scale
otherwise.
 Overplotting - showing more than a couple of series would result in an unreadable graphic. Avoid
it and use small multiples instead.
 Over-evaluation of differences - the area of a shape in a radar chart also increases quadratically
rather than linearly, which could lead viewers to think that small changes are more significant
than they are. In the following example, the student on the left had a score of 7 in every topic,
whereas the student on the right had a score of 14 in every topic. However, the right figure’s area
is more than twice the left figure’s area.

Sankey Diagram

A Sankey Diagram is a visualisation technique that allows to display flows. Several entities (nodes) are
represented by rectangles or text. Their links are represented with arrow or arcs that have a width
proportional to the importance of the flow.

Here is an example displaying the number of people migrating from one country (left) to another
(right).
Sankey diagrams are used to show weighted networks, i.e. flows. It can happen with several data
structure:

 evolution: the nodes are duplicated in 2 or more groups that represent stages. Connections show
the evolution between two states, like in the migration example above. This is more often
visualized as a chord diagram.
 source to end: consider a total amount, the diagram shows where it comes from and where it
ends up, with possible intermediate steps. Each node is unique.
Common Caveats:
 The position of nodes is very important: algorithm exist to minimize the number of crossings
between links.
 Mind over-cluttering that makes the figure unreadable. It is advised to dismiss weak connections.

Adjacency
Adjacency and incidence matrix provide relationship between several nodes. The information they
contain can have different nature, thus this document will consider several examples:

 Relationships can be directed and weighted. Like the number of people migrating from one
country to another.

Africa East.Asia Europe


Africa 3.142471 0.000000 2.107883
East Asia 0.000000 1.630997 0.601265
Europe 0.000000 0.000000 2.401476

 Relationships can be undirected and unweighted. I will consider all the co-authors of a
researcher and study who is connected through a common publication. The result is an
adjacency matrix with about 100 researchers, filled with 1 if they have published a paper
together, 0 otherwise.

from A.Bateman A.Besnard A.Breil


A Armero NA NA 1
A Bateman NA NA NA
A Besnard NA NA NA
 Relationships can also be undirected and weighted.
 Relationships can also be directed and unweighted.
Type of analysis Choice of charts
Flow Network, Chord, Arc, Sankey
Correlation Heatmap

Chord diagram
Chord diagram is a good way to represent the migration flows. It works well if your data are directed
and weighted like for migration flows between country.

Since this kind of graphic is used to display flows, it can be applied only on network where
connection is weighted. It does not work for the other example on authors connections.

Major flows are easy to detect, like the migration from South Asia towards West Asia, or Africa to
Europe. Moreover, for each continent it is quite easy to quantify the proportion of people leaving and
arriving. However, chord diagram is not a usual way of displaying information. Thus, it is advised to
give a good amount of explanation to educate your audience. A good way to do so is to draw just a
few connections in a first step, before displaying the whole graphic.
Instead of using a custom algorithm to position each node, it is possible to place them all around a
circle, making a chord diagram. But this kind of chart makes sense only if the order of nodes around
the circle is carefully chosen, to avoid having a cluttered and unreadable figure.

Arc diagram
An arc diagram follows the same concept but displays nodes along a single axis and links with arcs.
The main advantage is that it allows to make the labels easy to read.
Network
Since an adjacency matrix is a network structure, it is possible to build a network graph. In a network
graph, each entity is represented as a node, and each connection as an edge.

This type of representation makes more sense when the connection is unweighted, since drawing
edges with different sizes tends to clutter the figure and make it unreadable.

Thus, here is an application of this chart type to the coauthor network. Researchers are the nodes,
represented as dots. If 2 researchers have published at least one scientific paper together, they are
connected. The node size is proportional to the number of coauthors.
Network graphs are very powerful to study the global structure of the network. Here, a few groups of
researchers are isolated. Each represents one single paper where Vincent Ranwez was involved. In
the middle a massive network of researchers appear these are the people with Vincent published
more often that are thus all linked together.

However, network charts are very bad an annotating every single point: names tend to overlap edges
making the figure unreadable. The arc diagram described below is a good alternative if you want to
show labels.

Numeric and Categorical data


One numeric, one categorical
One observation per group
 One numeric variable is provided for each entity of a categorical variable.

Id Feature 1
A 10
B 12
C 15
... ...

Type of analysis Choice of charts


Distribution Boxplot
Ranking Lollipop, Word Cloud,
Part of a whole Donut, Pie, Tree map, Circular Packing

Several observations per group


 One numeric variable is provided, with several values per group.

Group Var1
A 1.3
A 3.4
A 2.3
B 3.5
B 4.9
... ...

Type of analysis Choice of charts


Distribution Boxplot, Violin, Ridgeline, Density, Histogram

One categorical, several numerical


No order
A set of several Numerical variables adding meaning for each sample. Several observations are
available per sample.

Group Var1 Var2 ...


A 1.3 1.1 ...
A 3.4 4.3 ...
A 2.3 2.1 ...
B 3.5 9.4 ...
B 3.5 9.4 ...
... ... ... ...

Type of analysis Choice of charts


Distribution Boxplot, Violin
Correlation Grouped Scatter, 2D Density, Correlogram

A number is ordered.

A set of several numerical variables adding meaning for each sample. One of the variables is ordered
and will be used for the x-axis most of the time.

Group Var1 Var2 ...


A 1 1.1 ...
A 2 4.3 ...
A 3 2.1 ...
B 1 9.4 ...
B 2 9.4 ...
... ... ... ...

Type of analysis Choice of charts


Evolution Stacked area, Area, Stream Graph, Line Plot
Correlation Connected Scatter

One value per group


Set of several numerical variables adding meaning for each sample. One row only per sample.

Group Var1 Var2 ...


A 12.1 1.1 ...
B 9.2 7.8 ...
C 1.1 98 ...
... ... ... ...

Type of analysis Choice of charts


Ranking Lollipop, Parallel Plot, Spider Plot
Correlation Grouped Scatter, Heatmap
Part of a whole Grouped Barplot, Stacked Barplot
Flow Sankey Diagram
Several Categorical, One numerical
Subgroup
One observation per group
Categorical variables provide several ways to group samples. For each combination of groups, only
one observation is available.

Group Sub Var1


A Sub1 1.3
A Sub2 2.3
B Sub1 3.5
B Sub2 2.2
... ... ...

Type of analysis Choice of charts


Ranking Lollipop, Parallel Plot, Spider Plot
Correlation Grouped Scatter, Heatmap
Part of a whole Grouped Barplot, Stacked Barplot
Flow Sankey Diagram

Several observations per group

Categorical variables provide several ways to group samples. For each combination of groups, several
observations are available.

Group Sub Var1


A Sub1 1.3
A Sub1 2.3
A Sub2 2.3
B Sub1 3.2
B Sub1 2.2
B Sub2 3.5
... ... ...

Type of analysis Choice of charts


Distribution Box Plot, Violin

Nested
One observation per group
Categorical variables provide information of the hierarchy. One value only for each leaf.

Group Sub Var1


A A.1 1.3
A A.2 2.3
A B.1 2.3
B B.2 3.2
... ... ...

Type of analysis Choice of charts


Ranking Bar plot
Part of a whole Dendrogram, Sunburst, Treemap, Circular Packing

Several observations per group


Categoric variables provide information of the hierarchy. Several values are available for each leaf.

Group Sub Var1


A A.1 1.3
A A.1 3.4
A A.2 2.3
A A.2 9.8
B B.1 3.5
B B.1 4.9
B B.2 1.3
B B.2 2.2
Type of analysis Choice of charts
Distribution Box plot, Violin

Adjacency
A rectangular matrix that gives the relationship between each pair of samples. The first column is
node 1 and the first row is node 2.

A B C
A - 1.3 1.3
B 0.3 - 3.2
C - - -

Type of analysis Choice of charts


Flow Network, Chord, Arc, Sankey
Correlation Heatmap

You might also like