module 3
module 3
DATA VISUALIZATION
21AD71
MODULE-3
Simplifying Visualizations using Seaborn
Seaborn helps you explore and understand your data. Its plotting functions
operate on data frames and arrays containing whole datasets and internally
perform the necessary semantic mapping and statistical aggregation to produce
informative plots.
Advantages of Seaborn
Seaborn, however, is built to operate on DataFrames and full dataset arrays,
which makes this process simpler. It internally performs the necessary semantic
mappings and statistical aggregation to produce informative plots.
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
S.VINUTHA, RNSIT CSE-DS 4
DATA VISUALIZATION -21AD71 NOTES
%matplotlib inline
import matplotlib.pyplot as plt
The aesthetics are only changed temporarily. The result is shown in the
following diagram:
The despine() function removes the top and right axes spines from a plot.
Parameters:
fig (optional): Figure object to apply despine to.
ax (optional): Axes object to apply despine to.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("poster")
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()
parameter
dictionary to scale Figure elements.
parameters.
Parameters:
context: A dictionary or name from preconfigured sets: paper, notebook,
talk, or poster.
font_scale (optional): A scaling factor for font size.
rc (optional): Overrides values in preset Seaborn context dictionaries.
o
compare IQ scores of different test groups.
print(group_b)
print(group_c)
print(group_d)
Once we have the data for each test group, we need to construct a DataFrame
from this data. This can be done with the help of
the pd.DataFrame() function, which is provided by pandas:
data = pd.DataFrame({'Groups': ['Group A'] \
* len(group_a) + ['Group B'] \
* len(group_b) + ['Group C'] \
* len(group_c) + ['Group D'] \
* len(group_d),\
'IQ score': group_a + group_b \
+ group_c + group_d})
If you don't create your own DataFrame, it is often helpful to print the column
names, which is done by calling
You can see that our DataFrame has two variables with the labels Groups and
IQ score. This is especially interesting since
we can use them to specify which variable to plot on the x-axis and which one
on the y-axis.
DataFrame.
-axis variable as "Groups" and the y-axis variable as "IQ score."
plt.figure(dpi=150)
# Set style
sns.set_style('whitegrid')
# Create boxplot
sns.boxplot('Groups', 'IQ score', data=data)
# Despine
sns.despine(left=True, right=True, top=True)
# Add title
plt.title('IQ scores for different test groups')
# Show plot
plt.show()
The despine() function helps in removing the top and right spines from the plot
by default (without passing any arguments
to the function). Here, we have also removed the left spine. Using the title()
function, we have set the title for our plot. The
show() function visualizes the plot.
After executing the preceding steps, the final output should be as follows:
2. Color Palettes
Categorical palettes (or qualitative color palettes) are best suited for
distinguishing categorical data that does not have an inherent ordering.
The color palette should have colors as distinct from one another as possible.
palette2 = sns.color_palette("muted")
sns.palplot(palette2)
palette3 = sns.color_palette("bright")
sns.palplot(palette3)
palette4 = sns.color_palette("pastel")
sns.palplot(palette4)
palette5 = sns.color_palette("dark")
sns.palplot(palette5)
palette6 = sns.color_palette("colorblind")
sns.palplot(palette6)
Sequential color palettes are appropriate for sequential data ranges from
low to high values, or vice versa
It is recommended to use bright colors for low values and dark ones for
high values
One of the sequential color palettes that Seaborn offers is cubehelix
palettes. They have a linear increase or decrease in brightness and some
variation in hue, meaning that even when converted to black and white,
the information is preserved
The default palette returned by cubehelix_palette() is illustrated in the
following diagram. To customize the cubehelix palette, the hue at the
start of the helix can be set with start (a value between 0 and 3), or the
number of rotations around the hue wheel can be set with rot:
Creating custom sequential palettes that only produce colors that start at
either light or dark desaturated colors and end with a specified color can
be accomplished with light_palette() or dark_palette().
Two examples are given in the following:
custom_palette2 = sns.light_palette("magenta")
sns.palplot(custom_palette2)
The preceding palette can also be reversed by setting the reverse parameter to
True in the following code:
By default, creating a color palette only returns a list of colors. If you want to
use it as a colormap object, for example, in combination
with a heatmap, set the as_cmap=True argument, as demonstrated in the
following example:
x = np.arange(25).reshape(5, 5)
ax = sns.heatmap(x, cmap=sns.cubehelix_palette(as_cmap=True))
This creates the following heatmap:
Creating bar plots with subgroups was quite tedious, but Seaborn offers a
very convenient way to create various bar plots.
They can also be used in Seaborn to represent estimates of central
tendency with the height of each bar, while uncertainty is indicated by
error bars at the top of the bar.
import pandas as pd
import seaborn as sns
data = pd.read_csv("../Datasets/salary.csv")
sns.set(style="whitegrid")
sns.barplot(x="Education", y="Salary", hue="District", data=data)
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('../../Datasets/age_salary_hours.csv')
sns.distplot(data.loc[:, 'Age'])
plt.xlabel('Age')
plt.ylabel('Density')
The KDE plot is shown in the following diagram, along with a shaded
area under the curve:
The joint distribution is shown as a contour plot in the center of the diagram.
The darker the color, the higher the density. The marginal distributions are
visualized on the top and on the right.
Visualizing Pairwise Relationships
For visualizing multiple pairwise relationships in a dataset, Seaborn
offers the pairplot() function
This function creates a matrix where off-diagonal elements visualize the
relationship between each pair of variables and the diagonal elements
show the marginal distributions.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('../../Datasets/age_salary_hours.csv')
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(data, hue='Education')
Violin Plots
A different approach to visualizing statistical measures is by using violin
plots. They combine box plots with the kernel density estimation
procedure that we described previously.
It provides a richer description of the variable's distribution. Additionally,
the quartile and whisker values from the box plot are shown inside the
violin.
The following example demonstrates the usage of violin plots:
import pandas as pd
import seaborn as sns
S.VINUTHA, RNSIT CSE-DS 25
DATA VISUALIZATION -21AD71 NOTES
data = pd.read_csv("../../Datasets/salary.csv")
sns.set(style="whitegrid")
sns.violinplot('Education', 'Salary', hue='Gender', \
data=data, split=True, cut=0)