0% found this document useful (0 votes)
68 views49 pages

21AD71 Module 3 Textbook

21AD71-module-3-textbook

Uploaded by

Dhanashree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views49 pages

21AD71 Module 3 Textbook

21AD71-module-3-textbook

Uploaded by

Dhanashree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

MODULE 3

204 | Simplifying Visualizations Using Seaborn

Introduction
In the previous chapter, we took an in-depth look at Matplotlib, one of the most
popular plotting libraries for Python. Various plot types were covered, and we looked
into customizing plots to create aesthetic plots.

Unlike Matplotlib, Seaborn is not a standalone Python library. It is built on top


of Matplotlib and provides a higher-level abstraction to make visually appealing
statistical visualizations. A neat feature of Seaborn is the ability to integrate with
DataFrames from the pandas library.

With Seaborn, we attempt to make visualization a central part of data exploration and
understanding. Internally, Seaborn operates on DataFrames and arrays that contain
the complete dataset. This enables it to perform semantic mappings and statistical
aggregations that are essential for displaying informative visualizations. Seaborn can
also be used to simply change the style and appearance of Matplotlib visualizations.

The most prominent features of Seaborn are as follows:

• Beautiful out-of-the-box plots with different themes

• Built-in color palettes that can be used to reveal patterns in the dataset

• A dataset-oriented interface

• A high-level abstraction that still allows for complex visualizations

Advantages of Seaborn
Working with DataFrames using Matplotlib adds some inconvenient overhead. For
example, simply exploring your dataset can take up a lot of time, since you require
some additional data wrangling to be able to plot the data from the DataFrames
using Matplotlib.
Introduction | 205

Seaborn, however, is built to operate on DataFrames and full dataset arrays, which
makes this process simpler. It internally performs the necessary semantic mappings
and statistical aggregation to produce informative plots.

Note
The American Community Survey (ACS) Public-Use Microdata
Samples (PUMS) dataset (one-year estimate from 2017) from https://
www.census.gov/programs-surveys/acs/technical-documentation/pums/
documentation.2017.html is used in this chapter. This dataset is later used
in Chapter 07, Combining What We Have Learned. This dataset can also be
downloaded from GitHub. Here is the link: https://packt.live/3bzApYN.

The following is an example of plotting using the Seaborn library:

import seaborn as sns


import pandas as pd
sns.set(style="ticks")
data = pd.read_csv("../../Datasets/salary.csv")[:1000]
sns.relplot(x="Salary", y="Age", hue="Education", \
            style="Education", col="Gender", data=data)
206 | Simplifying Visualizations Using Seaborn

This creates the following plot:

Figure 4.1: Seaborn relation plot

Seaborn uses Matplotlib to draw plots. Even though many tasks can be accomplished
with just Seaborn, further customization might require the usage of Matplotlib. We
only provided the names of the variables in the dataset and the roles they play in the
plot. Unlike in Matplotlib, it is not necessary to translate the variables into parameters
of the visualization.

Other potential obstacles are the default Matplotlib parameters and configurations.
The default parameters in Seaborn provide better visualizations without
additional customization. We will look at these default parameters in detail in the
upcoming topics.

For users who are already familiar with Matplotlib, the extension with Seaborn is self-
evident, since the core concepts are mostly similar.

Controlling Figure Aesthetics


As we mentioned previously, Matplotlib is highly customizable. But it also has the
effect that it is very inconvenient, as it can take a long time to adjust all necessary
parameters to get your desired visualization. In Seaborn, we can use customized
themes and a high-level interface for controlling the appearance of Matplotlib figures.
Controlling Figure Aesthetics | 207

The following code snippet creates a simple line plot in Matplotlib:

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()

This is what the plot looks with Matplotlib's default parameters:

Figure 4.2: Matplotlib line plot


208 | Simplifying Visualizations Using Seaborn

To switch to the Seaborn defaults, simply call the set() function:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()

Following is the output of the code:

Figure 4.3: Seaborn line plot


Controlling Figure Aesthetics | 209

Seaborn categorizes Matplotlib's parameters into two groups. The first group
contains parameters for the aesthetics of the plot, while the second group scales
various elements of the plot so that it can be easily used in different contexts, such as
visualizations that are used for presentations and posters.

Seaborn Figure Styles


To control the plot style, Seaborn provides two methods: set_style(style,
[rc]) and axes_style(style, [rc]).
seaborn.set_style(style, [rc]) sets the aesthetic style of the plots.
Parameters:

• style: A dictionary of parameters or the name of one of the following


preconfigured sets: darkgrid, whitegrid, dark, white, or ticks

• rc (optional): Parameter mappings to override the values in the preset Seaborn-


style dictionaries

Here is an example:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()
210 | Simplifying Visualizations Using Seaborn

This results in the following plot:

Figure 4.4: Seaborn line plot with whitegrid style

seaborn.axes_style(style, [rc]) returns a parameter dictionary for


the aesthetic style of the plots. The function can be used in a with statement to
temporarily change the style parameters.

Here are the parameters:

• style: A dictionary of parameters or the name of one of the following


pre-configured sets: darkgrid, whitegrid, dark, white, or ticks

• rc (optional): Parameter mappings to override the values in the preset Seaborn-


style dictionaries

Here is an example:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.figure()
x1 = [10, 20, 5, 40, 8]
Controlling Figure Aesthetics | 211

x2 = [30, 43, 9, 7, 20]


with sns.axes_style('dark'):
    plt.plot(x1, label='Group A')
    plt.plot(x2, label='Group B')
plt.legend()
plt.show()

The aesthetics are only changed temporarily. The result is shown in the
following diagram:

Figure 4.5: Seaborn line plot with dark axes style

For further customization, you can pass a dictionary of parameters to the rc


argument. You can only override parameters that are part of the style definition.

Removing Axes Spines


Sometimes, it might be desirable to remove the top and right axes spines. The
despine() function is used to remove the top and right axes spines from the plot:
seaborn.despine(fig=None, ax=None, top=True, right=True, \
                left=False, bottom=False, \
                offset=None, trim=False)
212 | Simplifying Visualizations Using Seaborn

The following code helps to remove the axes spines:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
sns.despine()
plt.legend()
plt.show()

This results in the following plot:

Figure 4.6: Despined Seaborn line plot

In the next section, we will learn to control the scale of plot elements.
Controlling Figure Aesthetics | 213

Controlling the Scale of Plot Elements


A separate set of parameters controls the scale of plot elements. This is a handy way
to use the same code to create plots that are suited for use in contexts where larger
or smaller plots are necessary. To control the context, two functions can be used.

seaborn.set_context(context, [font_scale], [rc]) sets the plotting


context parameters. This does not change the overall style of the plot but affects
things such as the size of the labels and lines. The base context is a notebook, and
the other contexts are paper, talk, and poster—versions of the notebook
parameters scaled by 0.8, 1.3, and 1.6, respectively.

Here are the parameters:

• context: A dictionary of parameters or the name of one of the following


preconfigured sets: paper, notebook, talk, or poster

• font_scale (optional): A scaling factor to independently scale the size of


font elements

• rc (optional): Parameter mappings to override the values in the preset Seaborn


context dictionaries

The following code helps set the context:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("poster")
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()
214 | Simplifying Visualizations Using Seaborn

The preceding code generates the following output:

Figure 4.7: Seaborn line plot with poster context

seaborn.plotting_context(context, [font_scale], [rc]) returns a


parameter dictionary to scale elements of the Figure. This function can be used with a
statement to temporarily change the context parameters.

Here are the parameters:

• context: A dictionary of parameters or the name of one of the following


pre-configured sets: paper, notebook, talk, or poster

• font_scale (optional): A scaling factor to independently scale the size of


font elements

• rc (optional): Parameter mappings to override the values in the preset Seaborn


context dictionaries

Contexts are an easy way to use preconfigured scales of plot elements for different
use cases. We will apply them in the following exercise, which uses a box plot to
compare the IQ scores of different test groups.

Note
All the exercises and activities in this chapter are developed using Jupyter
Notebook. The files can be downloaded from the following link: https://packt.
live/2ONDmLl. All the datasets used in this chapter can be found at https://
packt.live/3bzApYN.
Controlling Figure Aesthetics | 215

Exercise 4.01: Comparing IQ Scores for Different Test Groups by Using a Box Plot
In this exercise, we will generate a box plot using Seaborn. We will compare IQ scores
among different test groups using a box plot of the Seaborn library to demonstrate
how easy and efficient it is to create plots with Seaborn provided that we have a
proper DataFrame. This exercise also shows how to quickly change the style and
context of a Figure using the pre-configurations supplied by Seaborn.

Let's compare IQ scores among different test groups using the Seaborn library:

1. Create an Exercise4.01.ipynb Jupyter Notebook in the Chapter04/


Exercise4.01 folder to implement this exercise.
2. Import the necessary modules and enable plotting within the Exercise4.01.
ipynb file:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

3. Use the pandas read_csv() function to read the data located in the
Datasets folder:
mydata = pd.read_csv("../../Datasets/iq_scores.csv")

4. Access the data of each test group in the column. Convert this into a list using
the tolist() method. Once the data of each test group has been converted
into a list, assign this list to variables of each respective test group:

group_a = mydata[mydata.columns[0]].tolist()
group_b = mydata[mydata.columns[1]].tolist()
group_c = mydata[mydata.columns[2]].tolist()
group_d = mydata[mydata.columns[3]].tolist()

5. Print the values of each group to check whether the data inside it is converted
into a list. This can be done with the help of the print() function:

print(group_a)
216 | Simplifying Visualizations Using Seaborn

The data values of Group A are shown in the following screenshot:

Figure 4.8: Values of Group A

The following is the code for printing Group B:

print(group_b)

The data values of Group B are shown in the following screenshot:

Figure 4.9: Values of Group B

The following is the code for printing Group C:

print(group_c)

The data values of Group C are shown in the following screenshot:

Figure 4.10: Values of Group C

The following is the code for printing Group D:

print(group_d)

The data values of Group D are shown in the following screenshot:

Figure 4.11: Values of Group D


Controlling Figure Aesthetics | 217

6. Once we have the data for each test group, we need to construct a DataFrame
from this data. This can be done with the help of the pd.DataFrame()
function, which is provided by pandas:

data = pd.DataFrame({'Groups': ['Group A'] \


                     * len(group_a) + ['Group B'] \
                     * len(group_b) + ['Group C'] \
                     * len(group_c) + ['Group D'] \
                     * len(group_d),\
                     'IQ score': group_a + group_b \
                     + group_c + group_d})

7. If you don't create your own DataFrame, it is often helpful to print the column
names, which is done by calling print(data.columns). The output is
as follows:

Figure 4.12: Column labels

You can see that our DataFrame has two variables with the labels Groups and
IQ score. This is especially interesting since we can use them to specify which
variable to plot on the x-axis and which one on the y-axis.

8. Now, since we have the DataFrame, we need to create a box plot using the
boxplot() function provided by Seaborn. Within this function, specify the
variables for both the axes along with the DataFrame. Make Groups the variable
to plot on the x-axis, and IQ score the variable for the y-axis. Pass data as
a parameter. Here, data is the DataFrame that we obtained from the previous
step. Moreover, use the whitegrid style, set the context to talk, and remove
all axes spines, except the one on the bottom:

plt.figure(dpi=150)
# Set style
sns.set_style('whitegrid')
# Create boxplot
sns.boxplot('Groups', 'IQ score', data=data)
# Despine
sns.despine(left=True, right=True, top=True)
# Add title
plt.title('IQ scores for different test groups')
# Show plot
plt.show()
218 | Simplifying Visualizations Using Seaborn

The despine() function helps in removing the top and right spines from the
plot by default (without passing any arguments to the function). Here, we have
also removed the left spine. Using the title() function, we have set the title
for our plot. The show() function visualizes the plot.

After executing the preceding steps, the final output should be as follows:

Figure 4.13: IQ scores of groups

From the preceding diagram, we can conclude that Seaborn offers visually appealing
plots out of the box and allows easy customization, such as changing the style,
context, and spines. Once a suitable DataFrame exists, the plotting is achieved with
a single function. Column names are automatically used for labeling the axis. Even
categorical variables are supported out of the box.

Note
To access the source code for this specific section, please refer to
https://packt.live/3hwvR8m.

You can also run this example online at https://packt.live/2Y6TTPy.

Another great advantage of Seaborn is color palettes, which are introduced in the
following section.
Color Palettes | 219

Color Palettes
Color is a very important factor for your visualization. Color can reveal patterns in
data if used effectively or hide patterns if used poorly. Seaborn makes it easy to
select and use color palettes that are suited to your task. The color_palette()
function provides an interface for many of the possible ways to generate
color palettes.

The seaborn.color_palette([palette], [n_colors], [desat])


command returns a list of colors, thus defining a color palette.

The parameters are as follows:

• palette (optional): Name of palette or None to return the current palette.

• n_colors (optional): Number of colors in the palette. If the specified number of


colors is larger than the number of colors in the palette, the colors will be cycled.

• desat (optional): The proportion to desaturate each color by.

You can set the palette for all plots with set_palette(). This function accepts the
same arguments as color_palette(). In the following sections, we will explain
how color palettes are divided into different groups.

Choosing the best color palette is not straightforward and, to some extent, subjective.
To make a good decision, you have to know the characteristics of your data. There are
three general groups of color palettes, namely, categorical, sequential, and diverging,
which we will break down in the following sections.

Categorical Color Palettes


Categorical palettes (or qualitative color palettes) are best suited for distinguishing
categorical data that does not have an inherent ordering. The color palette should
have colors as distinct from one another as possible, resulting in palettes where
mainly the hue changes. When it comes to human perception, there is a limit to how
many different colors are perceived. A rule of thumb is that if you have double-digit
categories, it is advisable to divide the categories into groups. Different shades of
color could be used for a group. Another way to keep groups apart could be to use
hues that are close together in the color wheel within a group and hues that are far
apart for different groups.

Some examples where it is suitable to use categorical color palettes are line charts
showing stock trends for different companies, and a bar chart with subcategories;
basically, any time you want to group your data.
220 | Simplifying Visualizations Using Seaborn

There are six default themes in Seaborn: deep, muted, bright, pastel, dark,
and colorblind. The code and output for each theme are provided in the
following diagram. Out of these color palettes, it doesn't really matter which one
you use. Choose the one you prefer and the one that best fits the overall theme of
the visualization. It's never a bad idea to use the colorblind palette to account for
colorblind people. The following is the code to create a deep color palette:

import seaborn as sns


palette1 = sns.color_palette("deep")
sns.palplot(palette1)

The following diagram shows the output of the code:

Figure 4.14: Deep color palette

The following code creates a muted color palette:

palette2 = sns.color_palette("muted")
sns.palplot(palette2)

The following is the output of the code:

Figure 4.15: Muted color palette

The following code creates a bright color palette:

palette3 = sns.color_palette("bright")
sns.palplot(palette3)

The following is the output of the code:

Figure 4.16: Bright color palette


Color Palettes | 221

The following code creates a pastel color palette:

palette4 = sns.color_palette("pastel")
sns.palplot(palette4)

Here is the output showing a pastel color palette:

Figure 4.17: Pastel color palette

The following code creates a dark color palette:

palette5 = sns.color_palette("dark")
sns.palplot(palette5)

The following diagram shows a dark color palette:

Figure 4.18: Dark color palette

The following code creates a colorblind palette:

palette6 = sns.color_palette("colorblind")
sns.palplot(palette6)

Here is the output of the code:

Figure 4.19: Colorblind color palette


222 | Simplifying Visualizations Using Seaborn

Sequential Color Palettes


Sequential color palettes are appropriate for sequential data ranges from low
to high values, or vice versa. It is recommended to use bright colors for low values
and dark ones for high values. Some examples of sequential data are absolute
temperature, weight, height, or the number of students in a class.

One of the sequential color palettes that Seaborn offers is cubehelix palettes. They
have a linear increase or decrease in brightness and some variation in hue, meaning
that even when converted to black and white, the information is preserved.

The default palette returned by cubehelix_palette() is illustrated in the


following diagram. To customize the cubehelix palette, the hue at the start of the helix
can be set with start (a value between 0 and 3), or the number of rotations around
the hue wheel can be set with rot:

Figure 4.20: Cubehelix palette

Creating custom sequential palettes that only produce colors that start at either light
or dark desaturated colors and end with a specified color can be accomplished with
light_palette() or dark_palette(). Two examples are given in
the following:

custom_palette2 = sns.light_palette("magenta")
sns.palplot(custom_palette2)

The following diagram shows the output of the code:

Figure 4.21: Custom magenta color palette


Color Palettes | 223

The preceding palette can also be reversed by setting the reverse parameter to
True in the following code:
custom_palette3 = sns.light_palette("magenta", reverse=True)
sns.palplot(custom_palette3)

The following diagram shows the output of the code:

Figure 4.22: Custom reversed magenta color palette

By default, creating a color palette only returns a list of colors. If you want to use it as
a colormap object, for example, in combination with a heatmap, set the
as_cmap=True argument, as demonstrated in the following example:
x = np.arange(25).reshape(5, 5)
ax = sns.heatmap(x, cmap=sns.cubehelix_palette(as_cmap=True))

This creates the following heatmap:

Figure 4.23: Heatmap with cubehelix palette


224 | Simplifying Visualizations Using Seaborn

In the next section, we will learn about diverging color palettes.

Diverging Color Palettes


Diverging color palettes are used for data that consists of a well-defined midpoint.
An emphasis is placed on both high and low values. For example, if you are plotting
any population changes for a particular region from some baseline population, it is
best to use diverging colormaps to show the relative increase and decrease in the
population. The following code snippet and output provides a better understanding
of diverging plots, wherein we use the coolwarm template, which is built
into Matplotlib:

custom_palette4 = sns.color_palette("coolwarm", 7)
sns.palplot(custom_palette4)

The following diagram shows the output of the code:

Figure 4.24: Coolwarm color palette

You can use the diverging_palette() function to create custom-diverging


palettes. We can pass two hues in degrees as parameters, along with the total
number of palettes. The following code snippet and output provides a better insight:

custom_palette5 = sns.diverging_palette(120, 300, n=7)


sns.palplot(custom_palette5)

The following diagram shows the output of the code:

Figure 4.25: Custom diverging color palette


Color Palettes | 225

As we already mentioned, colors, when used effectively, can reveal patterns in data.
Spend some time thinking about which color palette is best for certain data. Let's
apply color palettes to visualize temperature changes in the following exercise.

Exercise 4.02: Surface Temperature Analysis


In this exercise, we will generate a heatmap using Seaborn. The goal of this exercise
is to choose an appropriate color palette for the given data. You are asked to visualize
the surface temperature change for the Northern Hemisphere for past years. Data
from the GISS Surface Temperature Analysis is used, which contains estimates
of global surface temperature change (in degree Celsius) for every month. The
dataset contains temperature anomalies for every month from 1880 to the present.
Temperature anomalies indicate how much warmer or colder it is than normal. For
the GISS analysis, normal means the average over the 30-year period 1951-1980.

Note
The dataset used for this exercise is used from https://data.giss.nasa.gov/
gistemp/ (accessed January 7, 2020). For more details about the dataset,
visit the website, looking at the FAQs in particular. This dataset is also
available in your Datasets folder.

Following are the steps to perform:

1. Create an Exercise4.02.ipynb Jupyter Notebook in the


Chapter04/Exercise4.02 folder to implement this exercise.
2. Import the necessary modules and enable plotting:

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
226 | Simplifying Visualizations Using Seaborn

3. Use the pandas read_csv() function to read the northern_surface_


temperature.csv dataset located in the Datasets folder. After successful
loading, transpose the dataset so that it is in a suitable structure:

data = pd.read_csv("../../Datasets/"\
                   "northern_surface_temperature.csv", \
                   index_col=['Year'])
data = data.transpose()

4. Create a custom-diverging palette that diverges to blue (240 degrees on the hue
wheel) for low values and to red (15 degrees on the hue wheel) for high values.
Set the saturation as s=99. Make sure that the diverging_palette()
function returns a colormap by setting as_cmap=True:

heat_colormap = sns.diverging_palette(240, 15, s=99, \


                                      as_cmap=True)

5. Plot the heatmap for every 5 years. To ensure that the neutral color corresponds
to no temperature change (the value is zero), set center=0:

plt.figure(dpi=200)
sns.heatmap(data.iloc[:, ::5], cmap=heat_colormap, center=0)
plt.title("Temperature Changes from 1880 to 2015 " \
          "(base period 1951-1980)")
plt.savefig('temperature_change.png', dpi=300, \
            bbox_inches='tight')
Color Palettes | 227

The following is the output of the preceding code:

Figure 4.26: Surface temperature changes visualized as a heatmap

The preceding diagram helps us to visualize the surface temperature change for
the Northern Hemisphere for past years.

Note
To access the source code for this specific section, please refer to
https://packt.live/3fracg8.

You can also run this example online at https://packt.live/3d4u5bd.

Let's now perform an activity to create a heatmap using a real-life dataset with
various color palettes.
228 | Simplifying Visualizations Using Seaborn

Activity 4.01: Using Heatmaps to Find Patterns in Flight Passengers' Data


In this activity, we will use a heatmap to find patterns in the flight passengers' data.
The goal of this activity is to apply your knowledge about color palettes to choose a
suitable color palette for this data.

The following are the steps to perform:

1. Use pandas to read the flight_details.csv dataset located in the


Datasets folder. The given dataset contains the monthly figures for flight
passengers for the years 1949 to 1960. This dataset originates from the
Seaborn library.

2. Use a heatmap to visualize the given data.

3. Use your own appropriate colormap. Make sure that the lowest value is the
brightest, and the highest the darkest, color. After executing the preceding steps,
the expected output should be as follows:

Figure 4.27: Heatmap of flight passengers' data


Advanced Plots in Seaborn | 229

Note
The solution to this activity can be found on page 420.

After the in-depth discussion about various color palettes, we will introduce some
more advanced plots that Seaborn offers in the following section.

Advanced Plots in Seaborn


In the previous chapter, we discussed various plots in Matplotlib, but there are still
a few visualizations left that we want to discuss. First, we will revise bar plots since
Seaborn offers some neat additional features for them. Moreover, we will cover
kernel density estimation, correlograms, and violin plots.

Bar Plots
In the last chapter, we already explained how to create bar plots with Matplotlib.
Creating bar plots with subgroups was quite tedious, but Seaborn offers a very
convenient way to create various bar plots. They can also be used in Seaborn to
represent estimates of central tendency with the height of each bar, while uncertainty
is indicated by error bars at the top of the bar.

The following example gives you a good idea of how this works:

import pandas as pd
import seaborn as sns
data = pd.read_csv("../Datasets/salary.csv")
sns.set(style="whitegrid")
sns.barplot(x="Education", y="Salary", hue="District", data=data)
230 | Simplifying Visualizations Using Seaborn

The result is shown in the following diagram:

Figure 4.28: Seaborn bar plot

Let's get some practice with Seaborn bar plots in the following activity.

Activity 4.02: Movie Comparison Revisited


In this activity, we will generate a bar plot to compare movie scores. You will be given
five movies with scores from Rotten Tomatoes. The Tomatometer is the percentage
of approved Tomatometer critics who have given a positive review for a movie. The
Audience Score is the percentage of users who have given a score of 3.5 or higher,
out of 5. Compare these two scores among the five movies:

1. Use pandas to read the movie_scores.csv dataset located in the


Datasets folder.
2. Transform the data into a useable format for Seaborn's barplot function.

3. Use Seaborn to create a visually appealing bar plot that compares the two scores
for all five movies.
Advanced Plots in Seaborn | 231

After executing the preceding steps, the expected output should appear
as follows:

Figure 4.29: Movie Scores comparison

Note
The solution to this activity can be found on page 422.

Kernel Density Estimation


It is often useful to visualize how variables of a dataset are distributed. Seaborn offers
handy functions to examine univariate and bivariate distributions. One possible way
to look at a univariate distribution in Seaborn is by using the distplot() function.
This will draw a histogram and fit a kernel density estimate (KDE), as illustrated in
the following example:

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('../../Datasets/age_salary_hours.csv')
232 | Simplifying Visualizations Using Seaborn

sns.distplot(data.loc[:, 'Age'])
plt.xlabel('Age')
plt.ylabel('Density')

The result is shown in the following diagram:

Figure 4.30: KDE with a histogram for a univariate distribution

To just visualize the KDE, Seaborn provides the kdeplot() function:

sns.kdeplot(data.loc[:, 'Age'], shade=True)


plt.xlabel('Age')
plt.ylabel('Density')
Advanced Plots in Seaborn | 233

The KDE plot is shown in the following diagram, along with a shaded area under
the curve:

Figure 4.31: KDE for a univariate distribution

In the next section, we will learn how to plot bivariate distributions.

Plotting Bivariate Distributions


For visualizing bivariate distributions, we will introduce three different plots. The
first two plots use the jointplot() function, which creates a multi-panel figure
that shows both the joint relationship between both variables and the corresponding
marginal distributions.
234 | Simplifying Visualizations Using Seaborn

A scatter plot shows each observation as points on the x and y axes. Additionally, a
histogram for each variable is shown:

import pandas as pd
import seaborn as sns
data = pd.read_csv('../../Datasets/age_salary_hours.csv')
sns.set(style="white")
sns.jointplot(x="Annual Salary", y="Age", data=data))

The scatter plot with marginal histograms is shown in the following diagram:

Figure 4.32: Scatter plot with marginal histograms


Advanced Plots in Seaborn | 235

It is also possible to use the KDE procedure to visualize bivariate distributions. The
joint distribution is shown as a contour plot, as demonstrated in the following code:

sns.jointplot('Annual Salary', 'Age', data=subdata, \


              kind='kde', xlim=(0, 500000), ylim=(0, 100))

The result is shown in the following diagram:

Figure 4.33: Contour plot

The joint distribution is shown as a contour plot in the center of the diagram. The
darker the color, the higher the density. The marginal distributions are visualized on
the top and on the right.
236 | Simplifying Visualizations Using Seaborn

Visualizing Pairwise Relationships


For visualizing multiple pairwise relationships in a dataset, Seaborn offers the
pairplot() function. This function creates a matrix where off-diagonal elements
visualize the relationship between each pair of variables and the diagonal elements
show the marginal distributions.

The following example gives us a better understanding of this:

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('../../Datasets/age_salary_hours.csv')
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(data, hue='Education')

Note
The age_salary_hours dataset is derived from https://www.census.gov/
programs-surveys/acs/technical-documentation/pums/documentation.2017.
html.

A pair plot, also called a correlogram, is shown in the following diagram. Scatter plots
are shown for all variable pairs on the off-diagonal, while KDEs are shown on the
diagonal. Groups are highlighted by different colors:
Advanced Plots in Seaborn | 237

Figure 4.34: Seaborn pair plot

Violin Plots
A different approach to visualizing statistical measures is by using violin plots. They
combine box plots with the kernel density estimation procedure that we described
previously. It provides a richer description of the variable's distribution. Additionally,
the quartile and whisker values from the box plot are shown inside the violin.
238 | Simplifying Visualizations Using Seaborn

The following example demonstrates the usage of violin plots:

import pandas as pd
import seaborn as sns
data = pd.read_csv("../../Datasets/salary.csv")
sns.set(style="whitegrid")
sns.violinplot('Education', 'Salary', hue='Gender', \
               data=data, split=True, cut=0)

The result appears as follows:

Figure 4.35: Seaborn violin plot

The violin plot shows both statistical measures and the probability distribution. The
data is divided into education groups, which are shown on the x-axis, and gender
groups, which are highlighted by different colors.

With the next activity, we will conclude the section about advanced plots. In this
section, multi-plots in Seaborn are introduced.
Advanced Plots in Seaborn | 239

Activity 4.03: Comparing IQ Scores for Different Test Groups by Using


a Violin Plot
In this activity, we will compare the IQ scores among four different test groups by
using the violin plot that's provided by the Seaborn library. The following steps will
help you to complete this activity:

1. Use pandas to read the iq_scores.csv dataset located in the


Datasets folder.
2. Access the data of each group in the column, convert it into a list, and assign
appropriate variables.

3. Create a pandas DataFrame from the data for each respective group.

4. Create a box plot for the IQ scores of the different test groups using Seaborn's
violinplot function.
5. Use the whitegrid style, set the context to talk, and remove all axes spines,
except the one on the bottom. Add a title to the plot.

After executing the preceding steps, the final output should appear as follows:

Figure 4.36: Violin plot showing IQ scores of different groups


240 | Simplifying Visualizations Using Seaborn

Note
The solution to this activity can be found on page 424.

In the next section, we will learn about multi-plots in Seaborn.

Multi-Plots in Seaborn
In the previous topic, we introduced a multi-plot, namely, the pair plot. In this topic,
we want to talk about a different way to create flexible multi-plots.

FacetGrid
The FacetGrid is useful for visualizing a certain plot for multiple variables separately.
A FacetGrid can be drawn with up to three dimensions: row, col, and hue. The first
two have the obvious relationship with the rows and columns of an array. The hue is
the third dimension and is shown in different colors. The FacetGrid class has to be
initialized with a DataFrame, and the names of the variables that will form the row,
column, or hue dimensions of the grid. These variables should be categorical
or discrete.

The seaborn.FacetGrid(data, row, col, hue, …) command initializes a


multi-plot grid for plotting conditional relationships.

Here are some interesting parameters:

• data: A tidy ("long-form") DataFrame where each column corresponds to a


variable, and each row corresponds to an observation

• row, col, hue: Variables that define subsets of the given data, which will be
drawn on separate facets in the grid

• sharex, sharey (optional): Share x/y axes across rows/columns

• height (optional): Height (in inches) of each facet

Initializing the grid does not draw anything on it yet. To visualize data on this grid, the
FacetGrid.map() method has to be used. You can provide any plotting function
and the name(s) of the variable(s) in the DataFrame to the plot:
Multi-Plots in Seaborn | 241

FacetGrid.map(func, *args, **kwargs) applies a plotting function to each


facet of the grid.

Here are the parameters:

• func: A plotting function that takes data and keyword arguments.

• *args: The column names in data that identify variables to plot. The data for
each variable is passed to func in the order in which the variables are specified.

• **kwargs: Keyword arguments that are passed to the plotting function.

The following example visualizes FacetGrid with scatter plots:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("../../Datasets/salary.csv")[:1000]
g = sns.FacetGrid(data, col='District')
g.map(plt.scatter, 'Salary', 'Age')

Figure 4.37: FacetGrid with scatter plots

We will conclude FacetGrids with the following activity.

Activity 4.04: Visualizing the Top 30 Music YouTube Channels


Using Seaborn's FacetGrid
In this activity, we will generate a FacetGrid plot using the Seaborn library. We will
visualize the total number of subscribers and the total number of views for the
top 30 YouTube channels (as of January 2020) in the music category by using the
FacetGrid() function that's provided by the Seaborn library.
242 | Simplifying Visualizations Using Seaborn

Visualize the given data using a FacetGrid with two columns. The first column should
show the number of subscribers for each YouTube channel, whereas the second
column should show the number of views. The goal of this activity is to get some
practice working with FacetGrids. The following are the steps to implement
this activity:

1. Use pandas to read the YouTube.csv dataset located in the Datasets folder.

2. Access the data of each group in the column, convert this into a list, and assign
this list to variables of each respective group.

3. Create a pandas DataFrame with the preceding data, using the data of each
respective group.

4. Create a FacetGrid with two columns to visualize the data.

After executing the preceding steps, the final output should appear as follows:

Figure 4.38: Subscribers and views of the top 30 YouTube channels

Note
The solution to this activity can be found on page 427.

In the next section, we will learn how to plot a regression plot using Seaborn.
Regression Plots | 243

Regression Plots
Regression is a technique in which we estimate the relationship between a
dependent variable (mostly plotted along the Y – axis) and an independent variable
(mostly plotted along the X – axis). Given a dataset, we can assign independent and
dependent variables and then use various regression methods to find out the relation
between these variables. Here, we will only cover linear regression; however, Seaborn
provides a wider range of regression functionality if needed.

The regplot() function offered by Seaborn helps to visualize linear


relationships, determined through linear regression. The following code snippet gives
a simple example:

import numpy as np
import seaborn as sns
x = np.arange(100)
# normal distribution with mean 0 and a standard deviation of 5
y = x + np.random.normal(0, 5, size=100)
sns.regplot(x, y)

The regplot() function draws a scatter plot, a regression line, and a 95%
confidence interval for that regression, as shown in the following diagram:

Figure 4.39: Seaborn regression plot


244 | Simplifying Visualizations Using Seaborn

Let's have a look at a more practical example in the following activity.

Activity 4.05: Linear Regression for Animal Attribute Relations


In this activity, we will generate a regression plot to visualize a real-life dataset using
the Seaborn library. You have a dataset pertaining to various animals, including
their body mass and maximum longevity. To discover whether there is any linear
relationship between these two variables, a regression plot will be used.

Note
The dataset used is from http://genomics.senescence.info/download.
html#anage. The dataset can also be downloaded from GitHub. Here is the
link to it: https://packt.live/3bzApYN.

The following are the steps to perform:

1. Use pandas to read the anage_data.csv dataset located in the


Datasets folder.
2. Filter the data so that you end up with samples containing a body mass and
maximum longevity. Only consider samples for the Mammalia class and a body
mass of less than 200,000.

3. Create a regression plot to visualize the linear relationship between


the variables.
Regression Plots | 245

After executing the preceding steps, the output should appear as follows:

Figure 4.40: Linear regression for animal attribute relations

Note
The solution to this activity can be found on page 430.

In the next section, we will learn how to plot Squarify using Seaborn.
246 | Simplifying Visualizations Using Seaborn

Squarify
At this point, we will briefly talk about tree maps. Tree maps display hierarchical
data as a set of nested rectangles. Each group is represented by a rectangle, of which
its area is proportional to its value. Using color schemes, it is possible to represent
hierarchies (groups, subgroups, and so on). Compared to pie charts, tree maps use
space efficiently. Matplotlib and Seaborn do not offer tree maps, and so the Squarify
library that is built on top of Matplotlib is used. Seaborn is a great addition for
creating color palettes.

Note
To install Squarify, first launch the command prompt from the
Anaconda Navigator. Then, execute the following command:
pip install squarify.

The following code snippet is a basic tree map example. It requires the
squarify library:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import squarify
colors = sns.light_palette("brown", 4)
squarify.plot(sizes=[50, 25, 10, 15], \
              label=["Group A", "Group B", "Group C", "Group D"], \
              color=colors)
plt.axis("off")
plt.show()
Squarify | 247

The result is shown in the following diagram:

Figure 4.41: Tree map

Now, let's have a look at a real-world example that uses tree maps in the
following exercise.

Exercise 4.03: Water Usage Revisited


In this exercise, we will create a tree map using the Squarify and Seaborn libraries.
Consider the scenario where you want to save water. Therefore, you visualize your
household's water usage by using a tree map, which can be created with the help of
the Squarify library.

Note
Before beginning the exercise, make sure you have installed Squarify by
executing pip install squarify on your command prompt. The
water_usage.csv dataset used is this exercise is sourced from this
link: https://www.epa.gov/watersense/how-we-use-water. Their data originates
from https://www.waterrf.org/research/projects/residential-end-uses-water-
version-2. This dataset is also available in your Datasets folder.
248 | Simplifying Visualizations Using Seaborn

Following are the steps to perform:

1. Create an Exercise4.03.ipynb Jupyter Notebook in the Chapter04/


Exercise4.03 folder to implement this exercise.
2. Import the necessary modules and enable plotting within the Exercise4.03.
ipynb file:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import squarify

3. Use the read_csv() function of pandas to read the water_usage.csv


dataset located in the Datasets folder:

mydata = pd.read_csv("../../Datasets/water_usage.csv", \
                     index_col=0)

4. Create a list of labels by accessing each column from the preceding dataset.
Here, the astype('str') function is used to cast the fetched data into a type
string:

labels = mydata['Usage'] \
         + ' (' + mydata['Percentage'].astype('str') + '%)'

5. To create a tree map visualization of the given data, use the plot() function of
the squarify library. This function takes three parameters. The first parameter
is a list of all the percentages, and the second parameter is a list of all the labels,
which we got in the previous step. The third parameter is the colormap that can
be created by using the light_palette() function of the Seaborn library:

# Create figure
plt.figure(dpi=200)
# Create tree map
squarify.plot(sizes=mydata['Percentage'], \
              label=labels, \
              color=sns.light_palette('green', mydata.shape[0]))
Squarify | 249

plt.axis('off')
# Add title
plt.title('Water usage')
# Show plot
plt.show()

Following is the output of the code:

Figure 4.42: Tree map visualizing the water usage in a household

Note
To access the source code for this specific section, please refer to
https://packt.live/3fxRzqZ.

You can also run this example online at https://packt.live/2N0U4WD.

To conclude this exercise, you can see that tree maps are great for visualizing part-
of-a-whole relationships. We immediately see that using the toilet requires the most
water, followed by showers.
250 | Simplifying Visualizations Using Seaborn

Activity 4.06: Visualizing the Impact of Education on Annual Salary and Weekly
Working Hours
In this activity, we will generate multiple plots using a real-life dataset. You're asked
to get insights on whether the education of people has an influence on their annual
salary and weekly working hours. You ask 500 people in the state of New York about
their age, annual salary, weekly working hours, and their education. You first want
to know the percentage for each education type, so therefore you use a tree map.
Two violin plots will be used to visualize the annual salary and weekly working hours.
Compare in each case to what extent education has an impact.

It should also be taken into account that all visualizations in this activity are designed
to be suitable for colorblind people. In principle, this is always a good idea to bear
in mind.

Note
The American Community Survey (ACS) Public-Use Microdata
Samples (PUMS) dataset (one-year estimate from 2017) from https://
www.census.gov/programs-surveys/acs/technical-documentation/pums/
documentation.2017.html is used in this activity. This dataset is later used
in Chapter 07, Combining What We Have Learned. This dataset can also be
downloaded from GitHub. Here is the link: https://packt.live/3bzApYN.
Squarify | 251

The following are the steps to perform:

1. Use pandas to read the age_salary_hours.csv dataset located in the


Datasets folder.
2. Use a tree map to visualize the percentages for each education type. After
executing the preceding steps, the outputs should appear as follows:

Figure 4.43: Tree map


252 | Simplifying Visualizations Using Seaborn

3. Create a subplot with two rows to visualize two violin plots for the annual salary
and weekly working hours, respectively. Compare in each case to what extent
education has an impact. To exclude pensioners, only consider people younger
than 65. Use a colormap that is suitable for colorblind people. subplots()
can be used in combination with Seaborn's plot, by simply passing the ax
argument with the respective axes. The following output will be generated after
implementing this step:

Figure 4.44: Violin plots showing the impact of education on annual


salary and weekly working hours

You might also like