Cs3353 Foundations of Data Science Unit V 01.12.2022
Cs3353 Foundations of Data Science Unit V 01.12.2022
1. Data Visualization
Data visualization is the practice of translating information into a visual context, such as a
map or graph, to make data easier for the human brain to understand and pull insights
from. The main goal of data visualization is to make it easier to identify patterns, trends and
outliers in large data sets.
o The process of finding trends and correlations in our data by representing it
pictorially is called Data Visualization.
The raw data undergoes different stages within a pipeline, which are:
Fetching the Data
Cleaning the Data Data visualization is the graphical representation of
Data Visualization information and data in a pictorial or graphical format
Modeling the Data (Example: charts, graphs, and maps).
Interpreting the Data
Revision
Data visualization is an easy and quick way to convey concepts to others. Data visualization has
some more specialties such as:
Data visualization can identify areas that need improvement or modifications.
Data visualization can clarify which factor influence customer behaviour.
Data visualization helps you to understand which products to place where.
Data visualization can predict sales volumes.
1.2 Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi-
platform data visualization library built on NumPy arrays and designed to work with the broader
SciPy stack. It was introduced by John Hunter in the year 2002. One of the greatest benefits of
visualization is that it allows us visual access to huge amounts of data in easily digestible visuals.
Matplotlib consists of several plots like line, bar, scatter, histogram etc.
Installation :
Run the following command to install matplotlibpackage :
python -mpip install -U matplotlib
import matplotlib
Once Matplotlib is installed, import it in your applications by adding the import module statement:
from matplotlib import pyplot as plt
or
import matplotlib.pyplot as plt
matplotlib Version
The version string is stored under __version__ attribute.
import matplotlib Output
print(matplotlib.__version__) 3.4.3
MatplotlibPyplot
Most of the Matplotlib utilities lies under the pyplotsubmodule, and are usually imported under the
plt as:
import matplotlib.pyplot as plt
Now the Pyplot package can be referred to as plt.
Note:
Points plotted are {[5,10], [2,5], [9,8], [4,4], [7,2]}
x = np.array([0, 6])
y = np.array([0, 25])
plt.plot(x, y)
plt.show()
Markers
You can use the keyword argument marker to emphasize each point with a specified marker with
markersize = 15.
/* Python program to show marker */ Output :
Linestyle
You can use the keyword argument linestyle, or shorter ls, to change the style of the plotted line
Output :
/* Python program to show linestyle */
Note:
linestyle = 'dashed'
plt.plot(x,y,ls ='dashed',marker='*')
plt.show()
#plot 1:
x = [0, 1, 2, 3]
y = [3, 8, 1, 10]
plt.subplot(2, 1, 1)
plt.plot(x,y)
#plot 2:
x = [0, 1, 2, 3]
y = [10, 20, 30, 40]
plt.subplot(2, 1, 2)
plt.plot(x,y)
Note:
plt.show() plt.subplot(2, 1, 1)
It means 2 rows , 1 column, and this
plot is the first plot.
plt.subplot(2, 1, 2)
It means 2 rows, 1 column, and this
plot is the second plot.
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y=[99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y=[99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y, marker='*', c='red', s=200,
edgecolor='black' )
plt.show()
# dataset2
x2 = [26, 29, 48, 64, 6]
y2 = [26, 34, 90, 33, 38]
plt.scatter(x2, y2, c ="yellow", marker
="^", edgecolor ="red", s =200)
plt.show()
ColorMap
The Matplotlib module has a number of available colormaps. A colormap is like a list of colors,
where each color has a value that ranges from 0 to 100. This colormap is called 'viridis' and as you
can see it ranges from 0, which is a purple color, up to 100, which is a yellow color.
Unit V CS3352 Foundations of Data Science 6
How to Use the ColorMap?
Specify the colormap with the keyword argument cmap with the value of the colormap, in this
case 'viridis' which is one of the built-in colormaps available in Matplotlib. In addition
create an array with values (from 0 to 100), one value for each point in the scatter plot. Some of the
available ColorMaps are Accent, Blues, BuPu, BuGn, CMRmap, Greens, Greys, Dark2 etc.
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
colors = np.array([0,10,20,30,40,45,50,55,60,70,80,90,100])
plt.scatter(x, y, c=colors, cmap='viridis')
plt.show()
Output :
Output :
Size
We can change the size of the dots with the s argument. Just like colors, we can do for sizes.
/* Python program to Set your own size for the markers*/ Output :
x = np.array([5,6,7,8,9,10])
y = np.array([10,20,30,40,50,60])
colors=["red","green","blue","yellow","violet","purple"]
sizes = np.array([100,200,300,400,500,600])
plt.scatter(x, y, c = colors, s=sizes )
plt.show()
Alpha
Adjust the transparency of the dots with the alpha argument. Just like colors, make sure the array for
sizes has the same length as the arrays for the x- and y-axis.
/* Python program to Set alpha*/
Output :
import matplotlib.pyplot as plt
import numpy as np
x = np.array([5,6,7,8,9,10])
y = np.array([10,20,30,40,50,60])
colors=["red","green","blue","yellow","violet","purple"
]
sizes = np.array([100,200,300,400,500,600])
plt.scatter(x,y,c=colors,s=sizes,alpha=0.5)
plt.show()
Create random arrays with 100 values for x-points, y-points, colors and sizes
/* Python program to create random arrays , random colors, Output :
random sizes*/
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randint(100,size=(100))
y = np.random.randint(100,size=(100))
colors = np.random.randint(100,size=(100))
sizes = 10 * np.random.randint(100,size=(100))
Error bars function used as graphical enhancement that visualizes the variability of the plotted
data on a Cartesian graph. Error bars can be applied to graphs to provide an additional layer of
detail on the presented data.
Error bars indicate estimated error or uncertainty. Measurement is done through the use of
markers drawn over the original graph and its data points. To visualize this information, error
bars work by drawing lines that extend from the centre of the plotted data point to reveal this
uncertainty of a data point.
A short error bar shows that values are concentrated signaling around the plotted value, while a
long error bar indicate that the values are more spread out and less reliable. The length of each
pair of error bars tends to be of equal length on both sides; however, if the data is skewed then
the lengths on each side would be unbalanced.
Error bars always run parallel to a quantity of scale axis so they can be displayed either vertically
or horizontally depending on whether the quantitative scale is on the y-axis or x-axis if there are
two quantities of scales and two pairs of arrow bars can be used for both axes.
# importing matplotlib
import matplotlib.pyplot as plt
# plotting graph
plt.plot(x, y)
/* Python program to add some error in y Output :
value in the simple graph */
# creating error
y_error = 0.2
plt.plot(x, y)
plt.errorbar(x, y, Note:
yerr = y_error, fmt is a format code controlling the appearance of
fmt ='o') lines and points
/* Python program to add some error in x Output :
value in the simple graph */
plt.bar(x,y, color="red")
plt.show()
Following is a simple example of the Matplotlib bar plot. It shows the number of students enrolled for
various courses offered at an institute.
/* Python program to implement Bar Chart */ Output :
Bar Width
The bar() takes the keyword argument width to set the width of the bars. Default width value is 0.8
import numpy as np
import matplotlib.pyplot as plt
X_axis = np.arange(len(X))
width = 0.25
plt.xticks(X_axis, X)
plt.xlabel("Groups")
plt.ylabel("Number of Students")
plt.title("Number of Students in each group")
plt.legend()
plt.show()
numpy.linspace() function
The linspace() function returns evenly spaced numbers over a specified interval [start, stop].
The endpoint of the interval can optionally be excluded.
Syntax: numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)
import numpy as np
import matplotlib.pyplot as plt
A = 5
y = np.zeros(A)
a1 = np.linspace(0, 10, 5)
plt.plot(a1, y)
plt.scatter(a1, y, c ="red", marker ="s",
edgecolor ="green", s =50)
# Python Program to illustrate Linear Plotting Output :
import matplotlib.pyplot as plt
plt.xlabel('Years')
plt.ylabel('Power consumption in kWh')
plt.title('Electricity consumption')
plt.legend()
plt.show()
matplotlib.pyplot.contour
matplotlib.pyplot.contour() are usually referred as Z = f(X, Y) i.e Z changes as a
function of input X and Y.
contourf() is also available which allows us to draw filled contours.
Parameters:
X, Y : 2-D numpy arrays with len(X)==M & len(Y)==N [M = rows, N = columns of Z]
Z : The height values over which the contour is drawn. Shape is (M, N)
levels : Determines the number and positions of the contour lines / regions
Example #1: Plotting of Contour using contour() which only plots contour lines.
Example #3: Plotting of contour using contourf() which plots filled contours.
/* Python program to implement Contourf Plot */
Output :
import numpy as np
import matplotlib.pyplot as plt
xlist = np.linspace(-3.0, 3.0, 100)
ylist = np.linspace(-3.0, 3.0, 100)
X, Y = np.meshgrid(xlist, ylist)
Z = np.sqrt(X**2 + Y**2)
fig,ax=plt.subplots(1,1)
cp = ax.contourf(X, Y, Z)
fig.colorbar(cp)
ax.set_title('Filled Contours Plot')
#ax.set_xlabel('x (cm)')
ax.set_ylabel('y (cm)')
plt.show()
import numpy as np
x = np.random.normal(170, 10, 250)
print(x)
Matplotlib.pyplot.annotate() in Python
Matplotlib is a library in Python and it is numerical – mathematical extension for NumPy
library. Pyplot is a state-based interface to a Matplotlib module which provides a
MATLAB-like interface.
Syntax: matplotlib.pyplot.annotate()
The annotate() function in pyplot module of matplotlib library is used to annotate the point xy
with text s.
Text - This parameter represents the text that we want to annotate.
xy - This parameter represents the Point X and Y to annotate.
XYText - An optional parameter represents the position where the text along X and Y needs
to be placed.
XYCOORDS - This parameter contains the string value.
ARROWPROPS - This parameter is also an optional value and contains “dict” type. By
default it is none.
Terms used
s: This parameter is the text of the annotation.
xy: This parameter is the point (x, y) to annotate.
xytext: This parameter is an optional parameter. It is The position (x, y) to place the text at.
xycoords: This parameter is also an optional parameter and contains the string value.
textcoords: This parameter contains the string value.Coordinate system that xytext is given,
which may be different than the coordinate system used for xy
arrowprops : This parameter is also an optional parameter and contains dict type.Its default
value is None.
annotation_clip : This parameter is also an optional parameter and contains boolean
value.Its default value is None which behaves as True.
/* Sine waveform */
Unit V CS3352 Foundations of Data Science 18
import matplotlib.pyplot as plt Output :
import numpy as np
fig, ppool = plt.subplots()
t = np.arange(0.0, 1.0, 0.001)
s = np.sin(2 * np.pi * t)
line = ppool.plot(t, s, lw=2)
ppool.annotate('Max value II Year',xy=(.25
, 1), xytext=(1, 1),arrowprops=dict(faceco
lor='green',shrink=0.05),xycoords="data",)
ppool.set_ylim(-1.5, 1.5)
plt.show()
ppool.set_ylim(-2, 2)
plt.show()
fig, ax = plt.subplots()
ax.set_ylabel('Sales')
ax.set_title('Sales report of 2 shops')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
def autolabel(rects):
for rect in rects:
height = rect.get_height()
ax.annotate('{}'.format(height),
xy=(rect.get_x() + rec
t.get_width() / 2, height),
xytext=(0, 3),
textcoords="offset poi
nts", size=16,color="Green",
ha='center', va='botto
m')
autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
plt.show()
Once this submodule is imported, a three-dimensional axes can be created by passing the keyword
projection='3d' to any of the normal axes creation routines:
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection ='3d')
With the above syntax three dimensional axes are enabled and data can be plotted in 3 dimensions.
3 dimension graph gives a dynamic approach and makes data more interactive. Like 2-D graphs, we
can use different ways to represent 3-D graph. We can make a scatter plot, contour plot, surface
plot, etc. Let’s have a look at different 3-D plots.
Graph with lines and point are the simplest 3 dimensional graph.
ax.plot3d and ax.scatter are the function to plot line and point graph respectively.
fig = plt.figure()
# plotting
ax.plot3D(x, y, z, 'darkblue')
ax.set_title('3D Line Plot by II CSE B Students')
plt.show()
# plotting
ax.scatter3D(x, y, z, c ='green')
ax.set_title('3D scatter Plot by II CSE B Students')
plt.show()
# plotting
ax.scatter3D(x, y, z, c = col)
ax.set_title('3D scatter Plot by II CSE B Students')
plt.show()
X, Y = np.meshgrid(feature_x, feature_y)
Z = f(X, Y)
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary')
Once you have the Basemap toolkit installed and imported, geographic plots can be easily drawn.
Pseudo-cylindrical projections
Pseudo-cylindrical projections relax the requirement that meridians should be always vertical
Hence, this give better properties near the poles of the projection. The Mollweide projection
(projection='moll') is one common example of this, in which all meridians are elliptical arcs.
(projection='sinu') and Robinson (projection='robin') projections.
Projections used:
projection='moll' & projection='sinu' & projection='robin'
Perspective projections
Perspective projections are constructed using a particular choice of perspective point, similar to if
you photographed the Earth from a particular point in space. One common example is the
orthographic projection (projection='ortho'), which shows one side of the globe as seen from a
viewer at a very long distance. Popular projections used are gnomonic projection and
stereographic projection. These are often the most useful for showing small portions of the map.
Projections used:
projection='gnom' & projection='stere'
Conic projections
A Conic projection projects the map onto a single cone, which is then unrolled. This can lead to very
good local properties, but regions far from the focus point of the cone may become very distorted.
One example of this is the Lambert Conformal Conic projection. It projects the map onto a cone
arranged in such a way that two standard parallels (specified in Basemap by lat_1 and lat_2) have
well-represented distances, with scale decreasing between them and increasing outside of them.
Other useful conic projections are the equidistant conic projection and the Albers equal-area
projection Conic projections, like perspective projections, tend to be good choices for representing
small to medium patches of the globe.
Projections used:
projection='lcc' & projection='eqdc' & projection='aea'
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 1000
0).T
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')
Note:
Seaborn has the following dependencies:
Python 2.7 or 3.4+
numpy
scipy
pandas
matplotlib
Dist plot :
Seaborn dist plot is used to plot a histogram, with some other variations like kdeplot and rugplot.
# Importing libraries Output:
import numpy as np
import seaborn as sns
sns.set(style="dark")
fmri = sns.load_dataset("fmri")
sns.set(style="ticks")
seaborn.lineplot()
x, y: Input data variables; must be numeric.
Can pass data directly or reference columns
in data.
hue: Grouping variable that will produce
lines with different colors. Can be either
categorical or numeric, although color
mapping will behave differently in latter
case.
style: Grouping variable that will produce
lines with different dashes and/or markers.
Can have a numeric dtype but will always
be treated as categorical.
data: Tidy (“long-form”) dataframe where
each column is a variable and each row is
an observation.
Unit V CS3352 Foundations of Data Science 28
markers: Object determining how to draw
the markers for different levels of the style
variable.
legend: How to draw the legend. If “brief”,
numeric “hue“ and “size“ variables will be
represented with a sample of evenly spaced
value
# x axis values
x =['sun', 'mon', 'fri', 'sat', 'tue', 'w
ed', 'thu']
# y axis values
y =[5, 6.7, 4, 6, 2, 4.9, 1.8]
Stripplot
It basically creates a scatter plot based on
the category.
Syntax:
stripplot([x, y, hue, data, order,
…])
Explanation:
One problem with strip plot is
that you can’t really tell which
points are stacked on top of
each other and hence we use
the jitter parameter to add
some random noise.
jitter parameter is used to add
an amount of jitter (only along
the categorical axis) which
can be useful when you have
many points and they overlap
so that it is easier to see the
distribution.
hue is used to provide an
additional categorical
separation
setting split=True is used to
draw separate strip plots
based on the category
specified by the hue
parameter.
Countplot
A countplot basically counts the categories
and returns a count of their occurrences. It
is one of the simplest plots provided by the
seaborn library.
Syntax:
countplot([x, y, hue, data, order,
…])
Explanation:
Looking at the plot we can say
that the number of males is more
than the number of females in
the dataset. As it only returns the
count based on a categorical
column, we need to specify only
the x parameter.
Boxplot
Box Plot is the visual representation of the
depicting groups of numerical data through
their quartiles. Boxplot is also used to detect
the outlier in the data set.
Syntax:
boxplot([x, y, hue, data, order,
hue_order, …])
Explanation:
x takes the categorical column
and y is a numerical column.
Violinplot
It is similar to the boxplot except that it
provides a higher, more advanced
visualization and uses the kernel density
estimation to give a better description about
the data distribution.
Syntax:
violinplot([x, y, hue, data, order,
…])
Explanation:
hue is used to separate the
data further using the sex
category
setting split=True will draw
half of a violin for each level.
This can make it easier to
directly compare the
distributions.
# make a countplot
style must be one of white, dark,
sns.countplot(x ='sex', data =
whitegrid, darkgrid, ticks
tips)
Removing Axes
Spines
The despine() is a function that
removes the spines from the
right and upper portion of the
plot by default. sns.despine(left
= True) helps remove the spine
from the left.
tips = sns.load_dataset('tips'
)
sns.set_style('white')
sns.countplot(x ='sex', data =
tips)
sns.despine()
tips = sns.load_dataset('tips'
)
plt.figure(figsize =(12, 3))
sns.countplot(x ='sex', data =
tips)
tips = sns.load_dataset('tips')
sns.set_context('poster', font_scale
= 2)
sns.countplot(x ='sex', data = tips,
palette ='coolwarm')
tips = sns.load_dataset('tips')
sns.set_context('paper', font_scale =
2)
sns.countplot(x ='sex', data = tips,
palette = 'coolwarm')
tips = sns.load_dataset('tips')
sns.set_context('notebook', font_scal
e = 2)
sns.countplot(x ='sex', data = tips,
palette ='coolwarm')
seaborn color_palette(),
n this article, We are going to see seaborn color_palette(), which can be used
for coloring the plot. Using the palette we can generate the point with
different colors. In this below example we can see the palette can be
responsible for generating the different colormap values.
Syntax: seaborn.color_palette(palette=None, n_colors=None, desat=None)
Parameters:
palette: Name of palette or None to return current palette.
n_colors: Number of colors in the palette.
desat: Proportion to desaturate each color.
Returns: list of RGB tuples or matplotlib.colors.Colormap