Datascience
Datascience
Unit-5
DATA VISUALIZATION
Importing Matplotlib - Line plots - Scatter plots -
visualizing errors - density and contour plots -
Histograms - legends - colours -subplots-text
and annotation-customization - three
dimensional plotting - Geographic Data with
Base map-Visualization with Seaborn
INTRODUCTION OF MATPLOTLIB
Matplotlib is an amazing visualization library in Python for
2D plots of arrays.
Installation :
Windows, Linux and macOS distributions have matplotlib and most of its
dependencies as wheel packages. Run the following command to
install matplotlib
Importing matplotlib :
# x-axis values
x = [5, 2, 9, 4, 7]
# Y-axis values
y = [10, 5, 8, 4, 2]
# Function to plot
plt.plot(x,y)
plt.show()
Output :
Bar plot :
# importing matplotlib module
# x-axis values
x = [5, 2, 9, 4, 7]
# Y-axis values
y = [10, 5, 8, 4, 2]
plt.bar(x,y)
plt.show()
Output :
Histogram :
# importing matplotlib module
from matplotlib import pyplot as plt
# Y-axis values
y = [10, 5, 8, 4, 2]
Output :
Types of Line Plots
There are three main types of line plots that we commonly use, namely,
Ex: The line plot here is a single line plot that represents the data of
students height.
Ex: Here the multiple line plots gives the data of a number of Class 9 and
Class 10 students choosing different subjects.
Compound Line Graph
When information can be separated into various categories, this sort of
chart is used. It's an evolution of the basic line graph, which depicts the
overall data proportion as well as the various layers that make up the data.
We must first create several line graphs, then shade each portion to denote
the component of each data from the total while creating a compound line
map. The bottom lines each represent a portion of the total, while the top
line represents the total. The distance between any two consecutive lines
on a compound line graph represents the size of each element, with the
bottom line bounded by the origin.
Ex: Here the Compound Line graph gives the data of a number of Class 8,
Class 9, and Class 10 students choosing different subjects.
SIMPLE SCATTER PLOT
A scatter plot (aka scatter chart, scatter graph) uses dots to
represent values for two different numeric variables. The position
of each dot on the horizontal and vertical axis indicates values for
an individual data point. Scatter plots are used to observe
relationships between variables.
The example scatter plot above shows the diameters and heights
for a sample of fictional trees. Each dot represents a single tree;
each point’s horizontal position indicates that tree’s diameter (in
centimeters) and the vertical position indicates that tree’s height (in
meters). From the plot, we can see a generally tight positive
correlation between a tree’s diameter and its height. We can also
observe an outlier point, a tree that has a much larger diameter
than the others. This tree appears fairly short for its girth, which
might warrant further investigation.
4.20 3.14
5.55 3.87
3.33 2.84
6.91 4.34
… …
In [3]:
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
Now let's look at this with a standard line-only contour plot:
In [4]:
plt.contour(X, Y, Z, colors='black');
Notice that by default when a single color is used, negative values are represented by
dashed lines, and positive values by solid lines. Alternatively, the lines can be color-coded
by specifying a colormap with the cmap argument. Here, we'll also specify that we want
more lines to be drawn—20 equally spaced intervals within the data range:
In [5]:
plt.contour(X, Y, Z, 20, cmap='RdGy');
Here we chose the RdGy (short for Red-Gray) colormap, which is a good choice for centered
data. Matplotlib has a wide range of colormaps available, which you can easily browse in
IPython by doing a tab completion on the plt.cm module:
plt.cm.<TAB>
Our plot is looking nicer, but the spaces between the lines may be a bit distracting. We can
change this by switching to a filled contour plot using the plt.contourf() function (notice
the f at the end), which uses largely the same syntax as plt.contour().
In [6]:
plt.contourf(X, Y, Z, 20, cmap='RdGy')
plt.colorbar();
The colorbar makes it clear that the black regions are "peaks," while the red regions are
"valleys."
One potential issue with this plot is that it is a bit "splotchy." That is, the color steps are
discrete rather than continuous, which is not always what is desired. This could be
remedied by setting the number of contours to a very high number, but this results in a
rather inefficient plot: Matplotlib must render a new polygon for each step in the level. A
better way to handle this is to use the plt.imshow() function, which interprets a two-
dimensional grid of data as an image.
In [7]:
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',
cmap='RdGy')
plt.colorbar()
plt.axis(aspect='image');
There are a few potential gotchas with imshow(), however:
Finally, it can sometimes be useful to combine contour plots and image plots. For
example, here we'll use a partially transparent background image (with transparency set
via the alpha parameter) and overplot contours with labels on the contours themselves
(using the plt.clabel() function):
In [8]:
contours = plt.contour(X, Y, Z, 3, colors='black')
plt.clabel(contours, inline=True, fontsize=8)
This type of graph is widely used in cartography, where contour lines on a topological map
indicate elevations that are the same. Many other disciples use contour graphs including:
astrology, meteorology, and physics. Contour lines commonly show altitude (like height of a
geographical features), but they can also be used to show density, brightness, or electric
potential.
A contour plot is appropriate if you want to see how some value Z changes as
a function of two inputs, X and Y:
z = f(x, y).
The most common form is the rectangular contour plot, which is (as the name suggests)
shaped like a rectangle.
Polar contour plots are circular.
In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
data = np.random.randn(1000)
In [2]:
The hist() function has many options to tune both the calculation and the display; here's
an example of a more customized histogram:
In [3]:
plt.hist(data, bins=30, normed=True, alpha=0.5,
histtype='stepfilled', color='steelblue',
edgecolor='none');
The plt.hist docstring has more information on other customization options available. I
find this combination of histtype='stepfilled' along with some transparency alpha to be
very useful when comparing histograms of several distributions:
In [4]:
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);
If you would like to simply compute the histogram (that is, count the number of points in a
given bin) and not display it, the np.histogram() function is available:
In [5]:
counts, bin_edges = np.histogram(data, bins=5)
print(counts)
[ 12 190 468 301 29]
In [6]:
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T
plt.hist2
Two-dimensional histogram
One straightforward way to plot a two-dimensional histogram is to use
Matplotlib's plt.hist2d function:
In [12]:
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')
Just as with plt.hist, plt.hist2d has a number of extra options to fine-tune the plot and
the binning, which are nicely outlined in the function docstring. Further, just
as plt.hist has a counterpart in np.histogram, plt.hist2d has a counterpart
in np.histogram2d, which can be used as follows:
In [8]:
counts, xedges, yedges = np.histogram2d(x, y, bins=30)
For the generalization of this histogram binning in dimensions higher than two, see
the np.histogramdd function.
plt.hexbin
Hexagonal binnings
The two-dimensional histogram creates a tesselation of squares across the axes. Another
natural shape for such a tesselation is the regular hexagon. For this purpose, Matplotlib
provides the plt.hexbin routine, which will represents a two-dimensional dataset binned
within a grid of hexagons:
In [9]:
plt.hexbin(x, y, gridsize=30, cmap='Blues')
cb = plt.colorbar(label='count in bin')
plt.hexbin has a number of interesting options, including the ability to specify weights for
each point, and to change the output in each bin to any NumPy aggregate (mean of
weights, standard deviation of weights, etc.).
The simplest legend can be created with the plt.legend() command, which automatically
creates a legend for any labeled plot elements:
In [1]:
import matplotlib.pyplot as plt
plt.style.use('classic')
In [2]:
%matplotlib inline
import numpy as np
In [3]:
x = np.linspace(0, 10, 1000)
fig, ax = plt.subplots()
ax.plot(x, np.sin(x), '-b', label='Sine')
ax.plot(x, np.cos(x), '--r', label='Cosine')
ax.axis('equal')
leg = ax.legend();
CUSTOMIZED COLORBARS
A colorbar needs a "mappable" (matplotlib.cm.ScalarMappable) object (typically, an
image) which indicates the colormap and the norm to be used. In order to create a
colorbar without an attached image, one can instead use a ScalarMappable with no
associated data.
The arguments to the colorbar call are the ScalarMappable (constructed using
the norm and cmap arguments), the axes where the colorbar should be drawn, and the
colorbar's orientation.
cmap = mpl.cm.cool
norm = mpl.colors.Normalize(vmin=5, vmax=10)
fig.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=cmap),
cax=ax, orientation='horizontal', label='Some Units')
This is actually the simplest and recommended way of creating a single Figure and Axes.
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('A single plot')
Stacking subplots in one direction
The first two optional arguments of pyplot.subplots define the number of rows and columns
of the subplot grid.
When stacking in one direction only, the returned axs is a 1D numpy array containing the list
of created Axes.
If you have to set parameters for each subplot it's handy to iterate over all subplots in a 2D
grid using for ax in axs.flat:.
for ax in axs.flat:
ax.set(xlabel='x-label', ylabel='y-label')
# Hide x labels and tick labels for top plots and y ticks for right plots.
for ax in axs.flat:
ax.label_outer()
You can use tuple-unpacking also in 2D to assign all subplots to dedicated variables:
for ax in fig.get_axes():
ax.label_outer()
Sharing axes
By default, each Axes is scaled individually. Thus, if the ranges are different the tick values
of the subplots do not align.
To precisely control the positioning of the subplots, one can explicitly create
a GridSpec with Figure.add_gridspec, and then call its subplots method. For example, we
can reduce the height between vertical subplots using add_gridspec(hspace=0).
label_outer is a handy method to remove labels and ticks from subplots that are not at the
edge of the grid.
fig = plt.figure()
gs = fig.add_gridspec(3, hspace=0)
axs = gs.subplots(sharex=True, sharey=True)
fig.suptitle('Sharing both axes')
axs[0].plot(x, y ** 2)
axs[1].plot(x, 0.3 * y, 'o')
axs[2].plot(x, y, '+')
# Hide x labels and tick labels for all but bottom plot.
for ax in axs:
ax.label_outer()
Apart from True and False, both sharex and sharey accept the values 'row' and 'col' to share
the values only per row or column.
fig = plt.figure()
gs = fig.add_gridspec(2, 2, hspace=0, wspace=0)
(ax1, ax2), (ax3, ax4) = gs.subplots(sharex='col', sharey='row')
fig.suptitle('Sharing x per column, y per row')
ax1.plot(x, y)
ax2.plot(x, y**2, 'tab:orange')
ax3.plot(x + 1, -y, 'tab:green')
ax4.plot(x + 2, -y**2, 'tab:red')
for ax in fig.get_axes():
ax.label_outer()
If you want a more complex sharing structure, you can first create the grid of axes with no
sharing, and then call axes.Axes.sharex or axes.Axes.sharey to add sharing info a posteriori.
plt.show()
Text and Annotation
Creating a good visualization involves guiding the reader so that the figure tells a story. In
some cases, this story can be told in an entirely visual manner, without the need for added
text, but in others, small textual cues and labels are necessary. Perhaps the most basic
types of annotations you will use are axes labels and titles, but the options go beyond this.
Let's take a look at some data and how we might visualize and annotate it to help convey
interesting information. We'll start by setting up the notebook for plotting and importing
the functions we will use:
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
We'll start with the same cleaning procedure we used there, and plot the results:
In [2]:
births = pd.read_csv('data/births.csv')
births['day'] = births['day'].astype(int)
When we're communicating data like this, it is often useful to annotate certain features of
the plot to draw the reader's attention. This can be done manually with
the plt.text/ax.text command, which will place text at a particular x/y value:
In [4]:
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax)
Any graphics display framework needs some scheme for translating between coordinate
systems. For example, a data point at (x,y)=(1,1)(x,y)=(1,1) needs to somehow be
represented at a certain location on the figure, which in turn needs to be represented in
pixels on the screen. Mathematically, such coordinate transformations are relatively
straightforward, and Matplotlib has a well-developed set of tools that it uses internally to
perform them (these tools can be explored in the matplotlib.transforms submodule).
The average user rarely needs to worry about the details of these transforms, but it is
helpful knowledge to have when considering the placement of text on a figure. There are
three pre-defined transforms that can be useful in this situation:
Here let's look at an example of drawing text at various locations using these transforms:
In [5]:
fig, ax = plt.subplots(facecolor='lightgray')
ax.axis([0, 10, 0, 10])
Note that by default, the text is aligned above and to the left of the specified coordinates:
here the "." at the beginning of each string will approximately mark the given coordinate
location.
The transData coordinates give the usual data coordinates associated with the x- and y-
axis labels. The transAxes coordinates give the location from the bottom-left corner of the
axes (here the white box), as a fraction of the axes size. The transFigure coordinates are
similar, but specify the position from the bottom-left of the figure (here the gray box), as a
fraction of the figure size.
Notice now that if we change the axes limits, it is only the transData coordinates that will
be affected, while the others remain stationary:
In [6]:
ax.set_xlim(0, 2)
ax.set_ylim(-6, 6)
fig
Out[6]:
This behavior can be seen more clearly by changing the axes limits interactively: if you are
executing this code in a notebook, you can make that happen by changing %matplotlib
inline to %matplotlib notebook and using each plot's menu to interact with the plot.
Drawing arrows in Matplotlib is often much harder than you'd bargain for. While there is
a plt.arrow() function available, I wouldn't suggest using it: the arrows it creates are SVG
objects that will be subject to the varying aspect ratio of your plots, and the result is rarely
what the user intended. Instead, I'd suggest using the plt.annotate() function. This
function creates some text and an arrow, and the arrows can be very flexibly specified.
In [7]:
%matplotlib inline
fig, ax = plt.subplots()
The arrow style is controlled through the arrowprops dictionary, which has numerous
options available. These options are fairly well-documented in Matplotlib's online
documentation, so rather than repeating them here it is probably more useful to quickly
show some of the possibilities. Let's demonstrate several of the possible options using the
birthrate plot from before:
In [8]:
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax)
ax.set_ylim(3600, 5400);
You'll notice that the specifications of the arrows and text boxes are very detailed: this
gives you the power to create nearly any arrow style you wish. Unfortunately, it also
means that these sorts of features often must be manually tweaked, a process that can be
very time consuming when producing publication-quality graphics! Finally, I'll note that
the preceding mix of styles is by no means best practice for presenting data, but rather
included as a demonstration of some of the available options.
More discussion and examples of available arrow and annotation styles can be found in
the Matplotlib gallery, in particular the Annotation Demo.
Customizing Ticks
Matplotlib's default tick locators and formatters are designed to be generally sufficient in
many common situations, but are in no way optimal for every plot. This section will give
several examples of adjusting the tick locations and formatting for the particular plot type
you're interested in.
Before we go into examples, it will be best for us to understand further the object
hierarchy of Matplotlib plots. Matplotlib aims to have a Python object representing
everything that appears on the plot: for example, recall that the figure is the bounding
box within which plot elements appear. Each Matplotlib object can also act as a container
of sub-objects: for example, each figure can contain one or more axes objects, each of
which in turn contain other objects representing plot contents.
The tick marks are no exception. Each axes has attributes xaxis and yaxis, which in turn
have attributes that contain all the properties of the lines, ticks, and labels that make up
the axes.
In [1]:
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline
import numpy as np
In [2]:
ax = plt.axes(xscale='log', yscale='log')
ax.grid();
We see here that each major tick shows a large tickmark and a label, while each minor tick
shows a smaller tickmark with no label.
In [3]:
print(ax.xaxis.get_major_locator())
print(ax.xaxis.get_minor_locator())
<matplotlib.ticker.LogLocator object at 0x10dbaf630>
<matplotlib.ticker.LogLocator object at 0x10dba6e80>
In [4]:
print(ax.xaxis.get_major_formatter())
print(ax.xaxis.get_minor_formatter())
<matplotlib.ticker.LogFormatterMathtext object at 0x10db8dbe0>
<matplotlib.ticker.NullFormatter object at 0x10db9af60>
We see that both major and minor tick labels have their locations specified by
a LogLocator (which makes sense for a logarithmic plot). Minor ticks, though, have their
labels formatted by a NullFormatter: this says that no labels will be shown.
We'll now show a few examples of setting these locators and formatters for various plots.
Hiding Ticks or Labels
Perhaps the most common tick/label formatting operation is the act of hiding ticks or
labels. This can be done using plt.NullLocator() and plt.NullFormatter(), as shown
here:
In [5]:
ax = plt.axes()
ax.plot(np.random.rand(50))
ax.yaxis.set_major_locator(plt.NullLocator())
ax.xaxis.set_major_formatter(plt.NullFormatter())
Output:
Output:
Points, Lines:
A Point in three-dimensional geometry is defined as a location in
3D space that is uniquely defined by an ordered triplet (x, y, z) where x, y, &
z are the distances of the point from the X-axis, Y-axis, and Z-axis
respectively.
import pandas as pd
# Create DataFrame
df = pd.DataFrame( data )
Output: