21AD71 Module 3 Textbook
21AD71 Module 3 Textbook
Introduction
In the previous chapter, we took an in-depth look at Matplotlib, one of the most
popular plotting libraries for Python. Various plot types were covered, and we looked
into customizing plots to create aesthetic plots.
With Seaborn, we attempt to make visualization a central part of data exploration and
understanding. Internally, Seaborn operates on DataFrames and arrays that contain
the complete dataset. This enables it to perform semantic mappings and statistical
aggregations that are essential for displaying informative visualizations. Seaborn can
also be used to simply change the style and appearance of Matplotlib visualizations.
• Built-in color palettes that can be used to reveal patterns in the dataset
• A dataset-oriented interface
Advantages of Seaborn
Working with DataFrames using Matplotlib adds some inconvenient overhead. For
example, simply exploring your dataset can take up a lot of time, since you require
some additional data wrangling to be able to plot the data from the DataFrames
using Matplotlib.
Introduction | 205
Seaborn, however, is built to operate on DataFrames and full dataset arrays, which
makes this process simpler. It internally performs the necessary semantic mappings
and statistical aggregation to produce informative plots.
Note
The American Community Survey (ACS) Public-Use Microdata
Samples (PUMS) dataset (one-year estimate from 2017) from https://
www.census.gov/programs-surveys/acs/technical-documentation/pums/
documentation.2017.html is used in this chapter. This dataset is later used
in Chapter 07, Combining What We Have Learned. This dataset can also be
downloaded from GitHub. Here is the link: https://packt.live/3bzApYN.
Seaborn uses Matplotlib to draw plots. Even though many tasks can be accomplished
with just Seaborn, further customization might require the usage of Matplotlib. We
only provided the names of the variables in the dataset and the roles they play in the
plot. Unlike in Matplotlib, it is not necessary to translate the variables into parameters
of the visualization.
Other potential obstacles are the default Matplotlib parameters and configurations.
The default parameters in Seaborn provide better visualizations without
additional customization. We will look at these default parameters in detail in the
upcoming topics.
For users who are already familiar with Matplotlib, the extension with Seaborn is self-
evident, since the core concepts are mostly similar.
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()
Seaborn categorizes Matplotlib's parameters into two groups. The first group
contains parameters for the aesthetics of the plot, while the second group scales
various elements of the plot so that it can be easily used in different contexts, such as
visualizations that are used for presentations and posters.
Here is an example:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()
210 | Simplifying Visualizations Using Seaborn
Here is an example:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.figure()
x1 = [10, 20, 5, 40, 8]
Controlling Figure Aesthetics | 211
The aesthetics are only changed temporarily. The result is shown in the
following diagram:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
sns.despine()
plt.legend()
plt.show()
In the next section, we will learn to control the scale of plot elements.
Controlling Figure Aesthetics | 213
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("poster")
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()
214 | Simplifying Visualizations Using Seaborn
Contexts are an easy way to use preconfigured scales of plot elements for different
use cases. We will apply them in the following exercise, which uses a box plot to
compare the IQ scores of different test groups.
Note
All the exercises and activities in this chapter are developed using Jupyter
Notebook. The files can be downloaded from the following link: https://packt.
live/2ONDmLl. All the datasets used in this chapter can be found at https://
packt.live/3bzApYN.
Controlling Figure Aesthetics | 215
Exercise 4.01: Comparing IQ Scores for Different Test Groups by Using a Box Plot
In this exercise, we will generate a box plot using Seaborn. We will compare IQ scores
among different test groups using a box plot of the Seaborn library to demonstrate
how easy and efficient it is to create plots with Seaborn provided that we have a
proper DataFrame. This exercise also shows how to quickly change the style and
context of a Figure using the pre-configurations supplied by Seaborn.
Let's compare IQ scores among different test groups using the Seaborn library:
3. Use the pandas read_csv() function to read the data located in the
Datasets folder:
mydata = pd.read_csv("../../Datasets/iq_scores.csv")
4. Access the data of each test group in the column. Convert this into a list using
the tolist() method. Once the data of each test group has been converted
into a list, assign this list to variables of each respective test group:
group_a = mydata[mydata.columns[0]].tolist()
group_b = mydata[mydata.columns[1]].tolist()
group_c = mydata[mydata.columns[2]].tolist()
group_d = mydata[mydata.columns[3]].tolist()
5. Print the values of each group to check whether the data inside it is converted
into a list. This can be done with the help of the print() function:
print(group_a)
216 | Simplifying Visualizations Using Seaborn
print(group_b)
print(group_c)
print(group_d)
6. Once we have the data for each test group, we need to construct a DataFrame
from this data. This can be done with the help of the pd.DataFrame()
function, which is provided by pandas:
7. If you don't create your own DataFrame, it is often helpful to print the column
names, which is done by calling print(data.columns). The output is
as follows:
You can see that our DataFrame has two variables with the labels Groups and
IQ score. This is especially interesting since we can use them to specify which
variable to plot on the x-axis and which one on the y-axis.
8. Now, since we have the DataFrame, we need to create a box plot using the
boxplot() function provided by Seaborn. Within this function, specify the
variables for both the axes along with the DataFrame. Make Groups the variable
to plot on the x-axis, and IQ score the variable for the y-axis. Pass data as
a parameter. Here, data is the DataFrame that we obtained from the previous
step. Moreover, use the whitegrid style, set the context to talk, and remove
all axes spines, except the one on the bottom:
plt.figure(dpi=150)
# Set style
sns.set_style('whitegrid')
# Create boxplot
sns.boxplot('Groups', 'IQ score', data=data)
# Despine
sns.despine(left=True, right=True, top=True)
# Add title
plt.title('IQ scores for different test groups')
# Show plot
plt.show()
218 | Simplifying Visualizations Using Seaborn
The despine() function helps in removing the top and right spines from the
plot by default (without passing any arguments to the function). Here, we have
also removed the left spine. Using the title() function, we have set the title
for our plot. The show() function visualizes the plot.
After executing the preceding steps, the final output should be as follows:
From the preceding diagram, we can conclude that Seaborn offers visually appealing
plots out of the box and allows easy customization, such as changing the style,
context, and spines. Once a suitable DataFrame exists, the plotting is achieved with
a single function. Column names are automatically used for labeling the axis. Even
categorical variables are supported out of the box.
Note
To access the source code for this specific section, please refer to
https://packt.live/3hwvR8m.
Another great advantage of Seaborn is color palettes, which are introduced in the
following section.
Color Palettes | 219
Color Palettes
Color is a very important factor for your visualization. Color can reveal patterns in
data if used effectively or hide patterns if used poorly. Seaborn makes it easy to
select and use color palettes that are suited to your task. The color_palette()
function provides an interface for many of the possible ways to generate
color palettes.
You can set the palette for all plots with set_palette(). This function accepts the
same arguments as color_palette(). In the following sections, we will explain
how color palettes are divided into different groups.
Choosing the best color palette is not straightforward and, to some extent, subjective.
To make a good decision, you have to know the characteristics of your data. There are
three general groups of color palettes, namely, categorical, sequential, and diverging,
which we will break down in the following sections.
Some examples where it is suitable to use categorical color palettes are line charts
showing stock trends for different companies, and a bar chart with subcategories;
basically, any time you want to group your data.
220 | Simplifying Visualizations Using Seaborn
There are six default themes in Seaborn: deep, muted, bright, pastel, dark,
and colorblind. The code and output for each theme are provided in the
following diagram. Out of these color palettes, it doesn't really matter which one
you use. Choose the one you prefer and the one that best fits the overall theme of
the visualization. It's never a bad idea to use the colorblind palette to account for
colorblind people. The following is the code to create a deep color palette:
palette2 = sns.color_palette("muted")
sns.palplot(palette2)
palette3 = sns.color_palette("bright")
sns.palplot(palette3)
palette4 = sns.color_palette("pastel")
sns.palplot(palette4)
palette5 = sns.color_palette("dark")
sns.palplot(palette5)
palette6 = sns.color_palette("colorblind")
sns.palplot(palette6)
One of the sequential color palettes that Seaborn offers is cubehelix palettes. They
have a linear increase or decrease in brightness and some variation in hue, meaning
that even when converted to black and white, the information is preserved.
Creating custom sequential palettes that only produce colors that start at either light
or dark desaturated colors and end with a specified color can be accomplished with
light_palette() or dark_palette(). Two examples are given in
the following:
custom_palette2 = sns.light_palette("magenta")
sns.palplot(custom_palette2)
The preceding palette can also be reversed by setting the reverse parameter to
True in the following code:
custom_palette3 = sns.light_palette("magenta", reverse=True)
sns.palplot(custom_palette3)
By default, creating a color palette only returns a list of colors. If you want to use it as
a colormap object, for example, in combination with a heatmap, set the
as_cmap=True argument, as demonstrated in the following example:
x = np.arange(25).reshape(5, 5)
ax = sns.heatmap(x, cmap=sns.cubehelix_palette(as_cmap=True))
custom_palette4 = sns.color_palette("coolwarm", 7)
sns.palplot(custom_palette4)
As we already mentioned, colors, when used effectively, can reveal patterns in data.
Spend some time thinking about which color palette is best for certain data. Let's
apply color palettes to visualize temperature changes in the following exercise.
Note
The dataset used for this exercise is used from https://data.giss.nasa.gov/
gistemp/ (accessed January 7, 2020). For more details about the dataset,
visit the website, looking at the FAQs in particular. This dataset is also
available in your Datasets folder.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
226 | Simplifying Visualizations Using Seaborn
data = pd.read_csv("../../Datasets/"\
"northern_surface_temperature.csv", \
index_col=['Year'])
data = data.transpose()
4. Create a custom-diverging palette that diverges to blue (240 degrees on the hue
wheel) for low values and to red (15 degrees on the hue wheel) for high values.
Set the saturation as s=99. Make sure that the diverging_palette()
function returns a colormap by setting as_cmap=True:
5. Plot the heatmap for every 5 years. To ensure that the neutral color corresponds
to no temperature change (the value is zero), set center=0:
plt.figure(dpi=200)
sns.heatmap(data.iloc[:, ::5], cmap=heat_colormap, center=0)
plt.title("Temperature Changes from 1880 to 2015 " \
"(base period 1951-1980)")
plt.savefig('temperature_change.png', dpi=300, \
bbox_inches='tight')
Color Palettes | 227
The preceding diagram helps us to visualize the surface temperature change for
the Northern Hemisphere for past years.
Note
To access the source code for this specific section, please refer to
https://packt.live/3fracg8.
Let's now perform an activity to create a heatmap using a real-life dataset with
various color palettes.
228 | Simplifying Visualizations Using Seaborn
3. Use your own appropriate colormap. Make sure that the lowest value is the
brightest, and the highest the darkest, color. After executing the preceding steps,
the expected output should be as follows:
Note
The solution to this activity can be found on page 420.
After the in-depth discussion about various color palettes, we will introduce some
more advanced plots that Seaborn offers in the following section.
Bar Plots
In the last chapter, we already explained how to create bar plots with Matplotlib.
Creating bar plots with subgroups was quite tedious, but Seaborn offers a very
convenient way to create various bar plots. They can also be used in Seaborn to
represent estimates of central tendency with the height of each bar, while uncertainty
is indicated by error bars at the top of the bar.
The following example gives you a good idea of how this works:
import pandas as pd
import seaborn as sns
data = pd.read_csv("../Datasets/salary.csv")
sns.set(style="whitegrid")
sns.barplot(x="Education", y="Salary", hue="District", data=data)
230 | Simplifying Visualizations Using Seaborn
Let's get some practice with Seaborn bar plots in the following activity.
3. Use Seaborn to create a visually appealing bar plot that compares the two scores
for all five movies.
Advanced Plots in Seaborn | 231
After executing the preceding steps, the expected output should appear
as follows:
Note
The solution to this activity can be found on page 422.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('../../Datasets/age_salary_hours.csv')
232 | Simplifying Visualizations Using Seaborn
sns.distplot(data.loc[:, 'Age'])
plt.xlabel('Age')
plt.ylabel('Density')
The KDE plot is shown in the following diagram, along with a shaded area under
the curve:
A scatter plot shows each observation as points on the x and y axes. Additionally, a
histogram for each variable is shown:
import pandas as pd
import seaborn as sns
data = pd.read_csv('../../Datasets/age_salary_hours.csv')
sns.set(style="white")
sns.jointplot(x="Annual Salary", y="Age", data=data))
The scatter plot with marginal histograms is shown in the following diagram:
It is also possible to use the KDE procedure to visualize bivariate distributions. The
joint distribution is shown as a contour plot, as demonstrated in the following code:
The joint distribution is shown as a contour plot in the center of the diagram. The
darker the color, the higher the density. The marginal distributions are visualized on
the top and on the right.
236 | Simplifying Visualizations Using Seaborn
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('../../Datasets/age_salary_hours.csv')
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(data, hue='Education')
Note
The age_salary_hours dataset is derived from https://www.census.gov/
programs-surveys/acs/technical-documentation/pums/documentation.2017.
html.
A pair plot, also called a correlogram, is shown in the following diagram. Scatter plots
are shown for all variable pairs on the off-diagonal, while KDEs are shown on the
diagonal. Groups are highlighted by different colors:
Advanced Plots in Seaborn | 237
Violin Plots
A different approach to visualizing statistical measures is by using violin plots. They
combine box plots with the kernel density estimation procedure that we described
previously. It provides a richer description of the variable's distribution. Additionally,
the quartile and whisker values from the box plot are shown inside the violin.
238 | Simplifying Visualizations Using Seaborn
import pandas as pd
import seaborn as sns
data = pd.read_csv("../../Datasets/salary.csv")
sns.set(style="whitegrid")
sns.violinplot('Education', 'Salary', hue='Gender', \
data=data, split=True, cut=0)
The violin plot shows both statistical measures and the probability distribution. The
data is divided into education groups, which are shown on the x-axis, and gender
groups, which are highlighted by different colors.
With the next activity, we will conclude the section about advanced plots. In this
section, multi-plots in Seaborn are introduced.
Advanced Plots in Seaborn | 239
3. Create a pandas DataFrame from the data for each respective group.
4. Create a box plot for the IQ scores of the different test groups using Seaborn's
violinplot function.
5. Use the whitegrid style, set the context to talk, and remove all axes spines,
except the one on the bottom. Add a title to the plot.
After executing the preceding steps, the final output should appear as follows:
Note
The solution to this activity can be found on page 424.
Multi-Plots in Seaborn
In the previous topic, we introduced a multi-plot, namely, the pair plot. In this topic,
we want to talk about a different way to create flexible multi-plots.
FacetGrid
The FacetGrid is useful for visualizing a certain plot for multiple variables separately.
A FacetGrid can be drawn with up to three dimensions: row, col, and hue. The first
two have the obvious relationship with the rows and columns of an array. The hue is
the third dimension and is shown in different colors. The FacetGrid class has to be
initialized with a DataFrame, and the names of the variables that will form the row,
column, or hue dimensions of the grid. These variables should be categorical
or discrete.
• row, col, hue: Variables that define subsets of the given data, which will be
drawn on separate facets in the grid
Initializing the grid does not draw anything on it yet. To visualize data on this grid, the
FacetGrid.map() method has to be used. You can provide any plotting function
and the name(s) of the variable(s) in the DataFrame to the plot:
Multi-Plots in Seaborn | 241
• *args: The column names in data that identify variables to plot. The data for
each variable is passed to func in the order in which the variables are specified.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("../../Datasets/salary.csv")[:1000]
g = sns.FacetGrid(data, col='District')
g.map(plt.scatter, 'Salary', 'Age')
Visualize the given data using a FacetGrid with two columns. The first column should
show the number of subscribers for each YouTube channel, whereas the second
column should show the number of views. The goal of this activity is to get some
practice working with FacetGrids. The following are the steps to implement
this activity:
1. Use pandas to read the YouTube.csv dataset located in the Datasets folder.
2. Access the data of each group in the column, convert this into a list, and assign
this list to variables of each respective group.
3. Create a pandas DataFrame with the preceding data, using the data of each
respective group.
After executing the preceding steps, the final output should appear as follows:
Note
The solution to this activity can be found on page 427.
In the next section, we will learn how to plot a regression plot using Seaborn.
Regression Plots | 243
Regression Plots
Regression is a technique in which we estimate the relationship between a
dependent variable (mostly plotted along the Y – axis) and an independent variable
(mostly plotted along the X – axis). Given a dataset, we can assign independent and
dependent variables and then use various regression methods to find out the relation
between these variables. Here, we will only cover linear regression; however, Seaborn
provides a wider range of regression functionality if needed.
import numpy as np
import seaborn as sns
x = np.arange(100)
# normal distribution with mean 0 and a standard deviation of 5
y = x + np.random.normal(0, 5, size=100)
sns.regplot(x, y)
The regplot() function draws a scatter plot, a regression line, and a 95%
confidence interval for that regression, as shown in the following diagram:
Note
The dataset used is from http://genomics.senescence.info/download.
html#anage. The dataset can also be downloaded from GitHub. Here is the
link to it: https://packt.live/3bzApYN.
After executing the preceding steps, the output should appear as follows:
Note
The solution to this activity can be found on page 430.
In the next section, we will learn how to plot Squarify using Seaborn.
246 | Simplifying Visualizations Using Seaborn
Squarify
At this point, we will briefly talk about tree maps. Tree maps display hierarchical
data as a set of nested rectangles. Each group is represented by a rectangle, of which
its area is proportional to its value. Using color schemes, it is possible to represent
hierarchies (groups, subgroups, and so on). Compared to pie charts, tree maps use
space efficiently. Matplotlib and Seaborn do not offer tree maps, and so the Squarify
library that is built on top of Matplotlib is used. Seaborn is a great addition for
creating color palettes.
Note
To install Squarify, first launch the command prompt from the
Anaconda Navigator. Then, execute the following command:
pip install squarify.
The following code snippet is a basic tree map example. It requires the
squarify library:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import squarify
colors = sns.light_palette("brown", 4)
squarify.plot(sizes=[50, 25, 10, 15], \
label=["Group A", "Group B", "Group C", "Group D"], \
color=colors)
plt.axis("off")
plt.show()
Squarify | 247
Now, let's have a look at a real-world example that uses tree maps in the
following exercise.
Note
Before beginning the exercise, make sure you have installed Squarify by
executing pip install squarify on your command prompt. The
water_usage.csv dataset used is this exercise is sourced from this
link: https://www.epa.gov/watersense/how-we-use-water. Their data originates
from https://www.waterrf.org/research/projects/residential-end-uses-water-
version-2. This dataset is also available in your Datasets folder.
248 | Simplifying Visualizations Using Seaborn
mydata = pd.read_csv("../../Datasets/water_usage.csv", \
index_col=0)
4. Create a list of labels by accessing each column from the preceding dataset.
Here, the astype('str') function is used to cast the fetched data into a type
string:
labels = mydata['Usage'] \
+ ' (' + mydata['Percentage'].astype('str') + '%)'
5. To create a tree map visualization of the given data, use the plot() function of
the squarify library. This function takes three parameters. The first parameter
is a list of all the percentages, and the second parameter is a list of all the labels,
which we got in the previous step. The third parameter is the colormap that can
be created by using the light_palette() function of the Seaborn library:
# Create figure
plt.figure(dpi=200)
# Create tree map
squarify.plot(sizes=mydata['Percentage'], \
label=labels, \
color=sns.light_palette('green', mydata.shape[0]))
Squarify | 249
plt.axis('off')
# Add title
plt.title('Water usage')
# Show plot
plt.show()
Note
To access the source code for this specific section, please refer to
https://packt.live/3fxRzqZ.
To conclude this exercise, you can see that tree maps are great for visualizing part-
of-a-whole relationships. We immediately see that using the toilet requires the most
water, followed by showers.
250 | Simplifying Visualizations Using Seaborn
Activity 4.06: Visualizing the Impact of Education on Annual Salary and Weekly
Working Hours
In this activity, we will generate multiple plots using a real-life dataset. You're asked
to get insights on whether the education of people has an influence on their annual
salary and weekly working hours. You ask 500 people in the state of New York about
their age, annual salary, weekly working hours, and their education. You first want
to know the percentage for each education type, so therefore you use a tree map.
Two violin plots will be used to visualize the annual salary and weekly working hours.
Compare in each case to what extent education has an impact.
It should also be taken into account that all visualizations in this activity are designed
to be suitable for colorblind people. In principle, this is always a good idea to bear
in mind.
Note
The American Community Survey (ACS) Public-Use Microdata
Samples (PUMS) dataset (one-year estimate from 2017) from https://
www.census.gov/programs-surveys/acs/technical-documentation/pums/
documentation.2017.html is used in this activity. This dataset is later used
in Chapter 07, Combining What We Have Learned. This dataset can also be
downloaded from GitHub. Here is the link: https://packt.live/3bzApYN.
Squarify | 251
3. Create a subplot with two rows to visualize two violin plots for the annual salary
and weekly working hours, respectively. Compare in each case to what extent
education has an impact. To exclude pensioners, only consider people younger
than 65. Use a colormap that is suitable for colorblind people. subplots()
can be used in combination with Seaborn's plot, by simply passing the ax
argument with the respective axes. The following output will be generated after
implementing this step: