0% found this document useful (0 votes)
13 views

10 Must-know Seaborn Visualization Plots for Multivariate Data Analysis in Python _ by Susan Maina _ Towards Data Science

Uploaded by

Ravi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

10 Must-know Seaborn Visualization Plots for Multivariate Data Analysis in Python _ by Susan Maina _ Towards Data Science

Uploaded by

Ravi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

myself bogged down by all the documentation, community discussions, and many

ways of creating simple plots, and thank goodness I found Seaborn.

Seaborn is an interface built on top of Matplotlib that uses short lines of code to
create and style statistical plots from Pandas datafames. It utilizes Matplotlib under
the hood, and it is best to have a basic understanding of the figure, axes, and axis
objects.

8 Seaborn Plots for Univariate Exploratory Data Analysis (EDA) in


Python
Learn how to visualize and analyze one variable at a time using
seaborn and matplotlib
towardsdatascience.com

We will use the vehicles dataset from Kaggle that is under the Open database
license. The code below imports the required libraries, sets the style, and loads the
dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid')
sns.set(font_scale=1.3)

cars = pd.read_csv('edited_cars.csv')

Before we continue, note that seaborn plots belong to one of two groups.

Axes-level plots — These mimic Matplotlib plots and can be bundled into
subplots using the ax parameter. They return an axes object and use normal
Matplotlib functions to style.

Figure-level plots — These provide a wrapper around axes plots and can only
create meaningful and related subplots because they control the entire figure.
They return either FacetGrid, PairGrid, or JointGrid objects and do not support
the ax parameter. They use different styling and customization inputs.
For each plot, I will mention which group it falls in.

Part one: Exploring relationships between numeric columns


Numeric features contain continuous data or numbers as values.

The first two plots will be matrix plots, where you pass the whole dataframe to
visualize all the pairwise distributions in one plot.

1. Pair plot
A pair plot creates a grid of scatter plots to compare the distribution of pairs of
numeric variables. It also features a histogram for each feature in the diagonal
boxes.

Functions to use:

sns.pairplot() — figure-level plot

The kind parameter changes the type of bivariate plots created with kind=

‘scatter’ (default) , ‘kde’, ‘hist’ or ‘reg’ .

Two columns per grid (Bivariate)

sns.pairplot(cars);

What to look out for:

Scatter plots showing either positive linear relationships (if x increases, y


increases) or negative (if x increases, y decreases).

Histograms in the diagonal boxes that show the distribution of individual


features.

In the pair plot below, the circled plots show an apparent linear relationship. The
diagonal line points out the histograms for each feature, and the pair plot’s top
triangle is a mirror image of the bottom.
Pairplot by author

Three columns (multivariate): two numeric and one categorical

We can add a third variable that segments the scatter plots by color using the
parameter hue=’cat_col’ .

sns.pairplot(
data=cars,
aspect=.85,
hue='transmission');

Multivariate pairplot by author

What to look out for:

Clusters of different colors in the scatter plots.

2. Heat map
A heat map is a color-coded graphical representation of values in a grid. It’s an ideal
plot to follow a pair plot because the plotted values represent the correlation
coefficients of the pairs that show the measure of the linear relationships.

In short, a pair plot shows the intuitive trends of the data, while a heat map plots the
actual correlation values using color.

Functions to use:

sns.heatmap() —axes-level plot

First, we run df.corr() to get a table with the correlation coefficients. This table is
also known as a correlation matrix.

cars.corr()

Correlation matrix by author

sns.heatmap() — Since the table above is not very intuitive, we’ll create a heatmap.

sns.set(font_scale=1.15)
plt.figure(figsize=(8,4))
sns.heatmap(
cars.corr(),
cmap='RdBu_r',
annot=True,
vmin=-1, vmax=1);

cmap=’RdBu_r’ sets the color scheme, annot=True draws the values inside the cells,
and vmin and vmax ensures the color codes start at -1 to 1.
Heatmap by author

What to look out for:

Highly correlated features. These are the dark-red and dark-blue cells. Values
close to 1 mean a high positive linear relationship, while close to -1 show a high
negative relationship.

Image from www.statisticshowto.com

In the following plots, we will further explore these relationships.

3. Scatter plot
A scatter plot shows the relationship between two numeric features by using dots to
visualize how these variables move together.

Functions to use:
sns.scatterplot() — axes-level plot

sns.relplot(kind=’line’) — figure-level

Functions with regression line;

sns.regplot() — axes-level

sns.lmplot() — figure-level

Two numeric columns (bivariate)

sns.scatterplot(x='num_col1', y='num_col2', data=df) — Let us visualize the engine size with the
mileage (efficiency) of the vehicle.

sns.set(font_scale=1.3)
sns.scatterplot(
x='engine_cc',
y='mileage_kmpl',
data=cars)
plt.xlabel(
'Engine size in CC')
plt.ylabel(
'Fuel efficiency')

Scatter plot by author

sns.regplot(x, y, data)
A reg plot draws a scatter plot with a regression line showing the trend of the data.

sns.regplot(
x='engine_cc',
y='mileage_kmpl',
data=cars)
plt.xlabel(
'Engine size in CC')
plt.ylabel(
'Fuel efficiency');

Regression plot by author

Three columns (multivariate): two numeric and one categorical.

sns.scatterplot(x, y, data, hue='cat_col') — We can further segment the scatter


plot by a categorical variable using hue .

sns.scatterplot(
x='mileage_kmpl',
y='engine_cc',
data=cars,
palette='bright',
hue='fuel');
Scatter plot with hue by author

sns.relplot(x, y, data, kind='scatter', hue='cat_col')

A rel plot, or relational plot, is used to create a scatter plot using kind=’scatter’

(default), or a line plot using kind=’line’.

In our plot below, we use kind='scatter' and hue=’cat_col’ to segment by color.


Note how the image below has similar results to the one above.

sns.relplot(
x='mileage_kmpl',
y='engine_cc',
data=cars,
palette='bright',
kind='scatter',
hue='fuel');
Relplot by author

sns.relplot(x, y, data, kind='scatter', col='cat_col') — We can also create


subplots of the segments column-wise using col=’cat_col’ and/or row-wise using
row=’cat_col’ . The plot below splits the data by the transmission categories into
different plots.

sns.relplot(
x='year',
y='selling_price',
data=cars,
kind='scatter',
col='transmission');
Relplot by author

Four columns: two numeric and two categorical.

sns.relplot(x,y,data, hue='cat_col1', col='cat_col2') — the col_wrap parameter


wraps columns after this width so that the subplots span multiple rows.

sns.relplot(
x='year',
y='selling_price',
data=cars,
palette='bright',
height=3, aspect=1.3,
kind='scatter',
hue='transmission',
col='fuel',
col_wrap=2);
Relational scatterplots by author

sns.lmplot(x, y, data, col='cat_col1', hue='cat_col2')

The lmplot is the figure-level version of a regplot that draws a scatter plot with a
regression line onto a Facet grid. It does not have a kind parameter.

sns.lmplot(
x="seats",
y="engine_cc",
data=cars,
palette='bright',
col="transmission",
hue="fuel");
lmplot by author

4. line plot
A line plot comprises dots connected by a line that shows the relationship between
the x and y variables. The x-axis usually contains time intervals, while the y-axis
holds a numeric variable whose changes we want to track over time.

Functions to use:

sns.lineplot() — axes-level plot

sns.relplot(kind=’line’) — figure-level plot

Two columns (bivariate): numeric and time series.

sns.lineplot(x=’time’, y=’num_col’, data=df)

sns.lineplot(
x="year",
y="selling_price",
data=cars)
Line plot by author

Three columns (multivariate): time series, numeric, and categorical column.

sns.lineplot(x, y, data, hue='cat_col') —We split can split the lines by a


categorical variable using hue.

sns.lineplot(
x="year",
y="selling_price",
data=cars,
palette='bright',
hue='fuel');

Lineplot with hue by author


The results above can be obtained using sns.relplot with kind=’line’ and the hue

parameter.

sns.relplot(x, y, data, kind='line', col='cat_col') — As mentioned earlier, a rel


plot’s kind=’line’ parameter plots a line graph. We will use col=’transmission’ to
create column-wise subplots for the two transmission classes.

sns.relplot(
x="year",
y="selling_price",
data=cars,
color='blue', height=4
kind='line',
col='transmission');

Relational line plot by author

Four columns: time series, numeric, and two categorical columns.

sns.relplot(x, y, data, kind='line', col='cat_col1', hue='cat_col2')

sns.relplot(
x="year",
y="selling_price",
data=cars,
palette='bright',
height=4,
kind='line',
col='transmission',
hue="fuel");

Relational line plot with hue by author

5. Joint plot
A joint plot comprises three charts in one. The center contains the bivariate
relationship between the x and y variables. The top and right-side plots show the
univariate distribution of the x-axis and y-axis variables, respectively.

Functions to use:

sns.jointplot() — figure-level plot

Two columns (bivariate): two numeric

sns.jointplot(x='num_col1, y='num_col2, data=df) — By default, the center plot is a


scatter plot, (kind=’scatter’) while the side plots are histograms.

sns.jointplot(
x='max_power_bhp',
y='selling_price',
data=cars);
Joint plot by author

The joint plots in the image below utilize different kind parameters ( ‘kde’ , ‘hist’ ,

‘hex’ , or ‘reg’) as annotated in each figure.


Joint plots with different "kind" parameters by author

Three columns (multivariate): two numeric, one categorical

sns.jointplot(x, y, data, hue=’cat_col’)

sns.jointplot(
x='selling_price',
y='max_power_bhp',
data=cars,
palette='bright',
hue='transmission');
Joint plot with hue parameter by author

Part two: Exploring the relationships between categorical and numeric relationships
In the following charts, the x-axis will hold a categorical variable and the y-axis a
numeric variable.
6. Bar plot
The bar chart uses bars of different heights to compare the distribution of a
numeric variable between groups of a categorical variable.

By default, bar heights are estimated using the “mean”. The estimator parameter
changes this aggregation function by using python’s inbuilt functions such as
estimator=max or len , or NumPy functions like np.max and np.median .

Functions to use:

sns.barplot() — axes-level plot

sns.catplot(kind=’bar’) — figure-level plot

Two columns (bivariate): numeric and categorical

sns.barplot(x=’cat_col’, y=’num_col’, data=df)


sns.barplot(
x='fuel',
y='selling_price',
data=cars,
color='blue',
# estimator=sum,
# estimator=np.median);

Barplot by author

Three columns (multivariate): two categorical and one numeric.

sns.barplot(x, y, data, hue=’cat_col2')

sns.barplot(
x='fuel',
y='selling_price',
data=cars,
palette='bright'
hue='transmission');
Barplot with hue by author

sns.catplot(x, y, data, kind='bar', hue=’cat_col')

A catplot or categorical plot, uses the kind parameter to specify what categorical
plot to draw with options being ‘strip’ (default), ’swarm’, ‘box’, ‘violin’,

‘boxen’, ‘point’ and ‘bar’ .

The plot below uses catplot to create a similar plot to the one above.

sns.catplot(
x='fuel',
y='selling_price',
data=cars,
palette='bright',
kind='bar',
hue='transmission');
Barplot with hue parameter by author

Four columns: three categorical and one numeric

sns.catplot(x, y, data, kind='bar', hue=’cat_col2', col='cat_col3') — Use the


col_wrap parameter to wrap columns after this width so that the subplots span
multiple rows.

g = sns.catplot(
x='fuel',
y='selling_price',
data=cars,
palette='bright',
height=3, aspect=1.3,
kind='bar',
hue='transmission',
col ='seller_type',
col_wrap=2)
g.set_titles(
'Seller: {col_name}');
Categorical barplot by author

7. Point plot
Instead of bars like in a bar plot, a point plot draws dots to represent the mean (or
another estimate) of each category group. A line then joins the dots, making it easy
to compare how the y variable’s central tendency changes for the groups.

Functions to use:

sns.pointplot() — axes-level plot

sns.catplot(kind=’point’) — figure-level plot

Two columns(bivariate): one categorical and one numeric

sns.pointplot(x=’cat_col’, y=’num_col’, data=df)

sns.pointplot(
x='seller_type',
y='mileage_kmpl',
data=cars);
Point plot by author

Three columns (multivariate): two categorical and one numeric

When you add a third category using hue , a point plot is more informative than a
bar plot because a line is drawn through each “hue” class, making it easy to
compare how that class changes across the x variable’s groups.

sns.catplot(x, y, data, kind='point', col='cat_col2') — Here, catplot is used with


kind=’point’ and hue=’cat_col’ . The same results can be obtained using
sns.pointplot and the hue parameter.

sns.catplot(
x='transmission',
y='selling_price',
data=cars,
palette='bright',
kind='point',
hue='seller_type');
Categorical point plot by author

sns.catplot(x, y, data, kind='point', col='cat_col2', hue='cat_col') — Here, we


use the same categorical feature in the hue and col parameters.

sns.catplot(
x='fuel',
y='year',
data=cars,
ci=None,
height=5, #default
aspect=.8,
kind='point',
hue='owner',
col='owner',
col_wrap=3);
Point plots using hue and col by author

8. Box plot
A box plot visualizes the distribution between numeric and categorical variables by
displaying the information about the quartiles.

Boxplot illustration by author

From the plots, you can see the minimum value, median, maximum value, and
outliers for every category class.
Functions to use:

sns.boxplot() — axes-level plot

sns.catplot(kind=’box’) — figure-level plot

Two columns (bivariate): one categorical and one numeric

sns.boxplot(x=’cat_col’, y=’num_col’, data=df)

sns.boxplot(
x='owner',
y='engine_cc',
data=cars,
color='blue')

plt.xticks(rotation=45,
ha='right');

Boxplot by author

Three columns (multivariate): two categorical and one numeric

sns.boxplot(x, y, data, hue='cat_col2') — These results can also be recreated


using sns.catplot using kind=’box’ and hue .
sns.boxplot(
x='fuel',
y='max_power_bhp',
data=cars,
palette='bright',
hue='transmission');

Boxplot using hue by author

sns.catplot(x, y, data, kind='box', col='cat_col2' ) — Use the catplot function


with kind=’box’ and provide col parameter to create subplots.

sns.catplot(
x='fuel',
y='max_power_bhp',
data=cars,
palette='bright',
kind = 'box',
col='transmission');
Categorical boxplots by author

Four columns: three categorical and one numeric

sns.catplot(x, y, data, kind='box', hue='cat_col2', col=’cat_col3')

g = sns.catplot(
x='owner',
y='year',
data=cars,
palette='bright',
height=3, aspect=1.5,
kind='box',
hue='transmission',
col='fuel',
col_wrap=2)
g.set_titles(
'Fuel: {col_name}');

g.set_xticklabels(
rotation=45, ha='right')
Categorical boxplots by author

9. Violin plot
In addition to the quartiles displayed by a box plot, a violin plot draws a Kernel
density estimate curve that shows probabilities of observations at different areas.

Image from source

Functions to use:

sns.violinplot() — axes-level plot


sns.catplot(kind=’violin’) — figure-level plot

Two columns (bivariate): numeric and categorical.

sns.violinplot ( x=’cat_col’, y=’num_col’, data=df )

sns.violinplot(
x='transmission',
y='engine_cc',
data=cars,
color='blue');

Violin plot by author

Three columns (multivariate) — Two categorical and one numeric.

sns.catplot(x, y, data, kind='violin', hue='cat_col2') — Use the catplot function


with the kind=’violin’ and hue=’cat_col’ . The same results below can be
replicated using sns.violinplot with the hue parameter.

g = sns.catplot(
x='owner',
y='year',
data=cars,
palette='bright',
height=3,
aspect=2
split=False,
# split=True
kind='violin',
hue='transmission')
g.set_xticklabels(
rotation=45,
ha='right')

The violin plot supports the split parameter, which draws half of the violin plot for
each categorical class. Note that this works when the hue variable has only two
classes.

Four columns: three categorical and one numeric


sns.catplot(x, y, data, kind='violin', hue='cat_col2', col=’cat_col3') — Here, we
filter the data for only ‘diesel’ and ‘petrol’ fuel types.

my_df = cars[cars['fuel'].isin(['Diesel','Petrol'])]

g = sns.catplot(
x="owner",
y="engine_cc",
data=my_df,
palette='bright',
kind = 'violin',
hue="transmission",
col = 'fuel')

g.set_xticklabels(
rotation=90);

Violin plots by author

10. Strip plot


A strip plot uses dots to show how a numeric variable is distributed among classes
of a categorical variable. Think of it as a scatter plot where one axis is a categorical
feature.

Functions to use:
sns.stripplot() — axes-level plot

sns.catplot(kind=’strip’) — figure-level plot

Two variables (bivariate): one categorical and one numeric

sns.stripplot(x=’cat_col’, y=’num_col’, data=df)

plt.figure(
figsize=(12, 6))
sns.stripplot(
x='year',
y='km_driven',
data=cars,
linewidth=.5,
color='blue')
plt.xticks(rotation=90);

Stripplot by author

Three columns (multivariate): two categorical and one numeric

sns.catplot(x, y, data, kind='strip', hue='cat_col2') — Use the catplot function


using kind=’strip’ (default) and provide the hue parameter. The argument
dodge=True (default is dodge=False ) can be used to separate the vertical dots by
color.

sns.catplot(
x='seats',
y='km_driven',
data=cars,
palette='bright',
height=3,
aspect=2.5,
# dodge=True,
kind='strip',
hue='transmission');

Four columns: three categorical and one numeric

sns.catplot(x, y, data, kind='strip', hue='cat_col2', col='cat_col3')

g = sns.catplot(
x="seller_type",
y="year",
data=cars,
palette='bright',
height=3, aspect=1.6,
kind='strip',
hue='owner',
col='fuel',
col_wrap=2)
g.set_xticklabels(
rotation=45,
ha='right');

Categorical strip plots by author

Combining strip plot with violin plot

A strip plot can be used together with a violin plot or box plot to show the position
of gaps or outliers in the data.

g = sns.catplot(
x='seats',
y='mileage_kmpl',
data=cars,
palette='bright',
aspect=2,
inner=None,
kind='violin')
sns.stripplot(
x='seats',
y='mileage_kmpl',
data=cars,
color='k',
linewidth=0.2,
edgecolor='white',
ax=g.ax);

Strip and violin plots by author

Additional remarks
For categorical plots such as bar plots and box plots, the bar direction can be re-
oriented to horizontal bars by switching up the x and y variables.

The row and col parameters of the FacetGrid figure-level objects used together
can add another dimension to the subplots. However, col_wrap cannot be with
the row parameter.

The FacetGrid supports different parameters depending on the underlying plot.


For example, sns.catplot(kind=’violin’) will support the split parameter
while other kinds will not. More on the kind-specific options in this
documentation.

Figure-level functions also create bivariate plots. For example,


sns.catplot(x=’fuel’, y=’mileage_cc’, data=cars, kind=’bar’) creates a basic
bar plot.

Conclusion
In this article, we performed bivariate and multivariate analyses on a dataset.
We first created matrix plots that visualized relationships in a grid to identify
numeric variables with high correlations. We then used different axes-level and
figure-level functions to create charts that explored the relationships between the
numeric and categorical columns. Find the code here on GitHub.

I hope you enjoyed the article. To receive more like this whenever I publish,
subscribe here. If you are not yet a medium member and would like to support me
as a writer, follow this link and I will earn a small commission. Thank you for
reading!

Data Science Programming Exploratory Data Analysis Machine Learning

Editors Pick

Following

Written by Susan Maina


853 Followers · Writer for Towards Data Science

Data scientist, Machine Learning Enthusiast. LinkedIn https://www.linkedin.com/in/suemnjeri

More from Susan Maina and Towards Data Science


Susan Maina in Towards Data Science

Google Colab: How to Upload Large Image Datasets from Github, Kaggle
and Local Machine
Learn how to upload large deep-learning datasets to a Google Colab Jupyter notebook

· 11 min read · Apr 23, 2021

82 2

Kenneth Leung in Towards Data Science

You might also like