Unit I:
Data Handling using Pandas and
Data
Visualization
Marks :25
DATA VISUALIZATION-Plotting with Pyplot -I
Purpose of plotting; drawing and saving following types of plots using
Matplotlib – line plot, bar graph, histogram, pie chart, frequency
polygon, box plot and scatter plot.
Customizing plots: color, style (dashed, dotted), width; adding label, title,
and legend in plots.
Define Data Visualization.
• Data Visualization refers to the graphical or visual representation of
information and data using visual elements like charts, graphs and maps
etc.
How to use in Python:
• Python is an extensible language.
• Modules are often organized in packages.
• A package is a collection of modules that has the same purpose
• Data visualization in Python can be done using matplotlib package.
MATPLOTLIB:
Matplotlib is a Python 2D plotting library which produces Publication-quality
figures. Pyplot is a module of matplotlib library containing collection of methods which
allows user to create 2D plots and graphs easily and interactively.
PYPLOT:
Pyplot is an interface to the plotting library in matplotlib. Plot() method from
Pyplot will automatically create the necessary figure and axes to achieve the desired plot.
PLOT:
Plot is a graphical representation technique for representing a dataset, usually as a
graph, showing the relationship between two or more variables.
What do you mean by Modules?
Modules are simply files with the “. py” extension containing Python
code that can be imported inside another Python Program
What do you mean by packages?
A python package is a collection of modules. Modules that are related to each
other are mainly put in the same package. When a module from an external
package is required in a program, that package can be imported and its
modules can be put to use.
Examples where Data visualization to communicate information
• Traffic symbols
• Ultrasound reports
• Atlas book of maps
• Speedometer of a vehicle
• Tuners of instruments
Installing and Importing matplotlib:
• Open command prompt
• Type cd\ to move to the root directory
• Type pip install matplotlib
import matplotlib. pyplot as plt
Working with Pyplot methods:
• Pyplot interface allows to plot data in multiple ways such as line chart, bar chart, pie
chart, scatter chart etc.
Basics of Simple Plotting:
Commonly used chart types are.
1)Line chart/ Plot: It is a type of chart which displays information as a series of data
points called ‘markers’ connected by straight line segments.
#Example program to draw a line plot/ chart
import matplotlib. pyplot as plt
x_values=[1,2,3,4]
y_values=[2,4,6,8]
plt.xlabel("Show values")
plt.ylabel("Doubled values")
plt.title(“My First Line Chart”)
plt.plot(x_values,y_values)
plt.show()
Basic Components of a Plot/ Chart
• Figure It is a canvas which contains plots.
• Axes A figure can contain many axes ( 3-D objects). Each Axes has a title, x-label and
y-label
• Axis Number line like objects for generating graph limits.
• Artist Everything on the figure like text objects, line 2D objects .
• Labels Piece of information about the information represented on a chart
• Title Describes the purpose of the chart
• Legends explains what each line means in the figure.
Title
Employee details
Legend
Y-Axis
sales
Y LABEL
Emp code X LABEL
X-axis
Character Character
Color setting in
Applying various Color
plot() function: Character Color
• Color( line color /marker color)
• Marker type
‘b’• Marker size
Blue ‘m’ Magenta ‘c’ Cyan
i) Changing the line color:
<matplotlib>.plot (<data1>,[data 2],<color code>)
‘g’Different Green
color codes ‘y’ Yellow ‘w’ White
‘r’ Red ‘K’ Black
Note 1:
Data points are called Markers
Note 2:
Even if you skip the color information, it will plot multiple lines in the same plot
using different colors decided by Python.
Line Style:
Style Abbreviation Style
- Solid line
-- Dashed line
-. Dash-Dot line
: Dotted line
Changing Marker Type, Size and Color:
Marker Description Marker Description Marker Description
‘. ‘ Point marker ‘s’ Square marker ‘3’ Tri-left marker
‘,‘ Pixel marker ‘p’ Pentagon marker ‘4‘ Tri-right marker
‘o ’ Circle marker ‘*’ Star marker ‘v‘ Triangle-down
marker
‘+‘ Plus marker ‘h’ Hexagon1 marker ‘ ^ ‘ Triangle-up marker
‘x‘ X marker ‘H’ Hexagon2 marker ‘< ‘ Triangle-left marker
‘D’ Diamond marker ‘1’ Tri_down marker ‘ >’ Triangle-right
marker
‘d‘ Thin diamond marker ‘2’ Tri-up marker ‘|’, ‘_’ vline, hline markers
#program to draw two lines along with proper titles and legends
import matplotlib.pyplot as plt
x1 = [1, 2, 3]
y1 = [5, 7, 4]
plt.plot(x1, y1, label= "First Line”)
x2 = [1, 2, 3]
y2 = [10, 11, 14]
plt.plot(x2, y2,label= "Second Line“)
plt.xlabel ( "Plot Number")
plt.ylabel ( " Important variables")
plt.title ( " New Graph")
plt.legend (loc=“upper left”)
plt.show()
Homework:
1. WAP to plot 4 lines with different colors using plot method.
2. WAP to plot frequency of marks using line chart for the
following data:
marks=[50,50,50,65,65,75,75,80,80]
# Program to generate Sine and cosine wave
import matplotlib.pyplot as plt
import numpy as np
x = np.arange (0.0, 10, 0.1)
a = np.cos(x)
b = np.sin(x)
plt.plot ( x, a, "red")
plt.plot ( x, b, "y")
plt.show ()
NOTE
Sine wave is symmetric about the origin.
cosine function is symmetric about the y-axis
# Eg1 program to change marker type, size and color
import matplotlib. pyplot as plt
x = [1, 2 , 3, 4]
y = [2, 4, 6, 8]
plt.plot ( x, y, marker = 'd' , markersize = 5, markeredgecolor ='red')
plt.show ( )
# Eg2 program to change marker type, size and color
import matplotlib. pyplot as plt
x = [1, 2 , 3, 4]
y = [2, 4, 6, 8]
plt.plot ( x, y, ‘ ro ')
plt.show ( )
Question:
Create an array in the range 1 to 20 with values 1.25 apart. Another array
contains the log values of the elements in first array. Then Create a plot of
first vs second array ; specify the x-axis (containing first array ‘s values) title
as “Random Values” and y-axis title as “Logarithm Values”.
import matplotlib.pyplot as plt
import numpy as np
array1=np.array(np. arange(1,21,1.25))
array2=np.log(array1)
plt.plot( array1,array2)
plt.xlabel("Random Values",fontsize=15)
plt.ylabel("Logarithm Values",fontsize=15)
plt.show()
Bar Charts:
A bar Graph/ Chart is a graphical display of data using bars of different
heights.
Pyplot offers bar() function to create bar charts
Definition
Bar graphs are the pictorial representation of data (generally grouped), in the
form of vertical or horizontal rectangular bars, where the length of bars are
proportional to the measure of data.
The bars drawn are of uniform width, and the variable quantity is represented on
one of the axes. Also, the measure of the variable is depicted on the other axes.
The heights or the lengths of the bars denote the value of the variable, and these
graphs are also used to compare certain quantities
# program to plot values and their doubled values using bar
chart
import matplotlib.pyplot as plt
x,y=[1,2,3,4],[2,4,6,8]
plt.bar(x,y)
plt.xlabel("Values")
plt.ylabel("Doubled values")
plt.show()
Homework:
WAP to plot values and their squared values using bar chart. Use x= np.arange(1,5)
and proper labels for the axes.
WAP to create a bar chart with the following data:
x= np.arange(4)
y=[5., 25., 45., 20.]
# Program to plot the data using Bar chart.
import matplotlib.pyplot as plt
cities=["Delhi", "Mumbai", "Bangalore", "Chennai"]
population =[23456123, 20083104, 18456123, 13411093]
plt.bar( cities, population)
plt.xlabel("Cities")
plt.ylabel("Population")
plt.show()
Note:
The first sequence given in the bar forms the x-
axis and the second sequence forms the y-axis.
Changing the width of the Bars in a Bar Chart
• By default bar chart draws bars with different widths.
• Default width is 0.8 units.
• The width of the bars can be changed as
i) You can specify the width for all the bars of the bar chart
ii) you can specify different widths for different bars of a bar chart.
General form:
<matplotlib.pyplot>.bar(x=sequence, y=sequence, width =<float value>
1) To specify common width for all bars :
import matplotlib.pyplot as plt
cities=["Delhi", "Mumbai", "Bangalore", "Chennai"]
population =[23456123, 20083104, 18456123, 13411093]
plt.bar(cities,population,width=1/2)
plt.xlabel("Cities")
plt.ylabel("Population")
plt.show()
ii) To specify different widths for different bars:
<matplotlib.pyplot>.bar(x=sequence, y=sequence, width =<width value
sequence>)
import matplotlib.pyplot as plt
cities=["Delhi", "Mumbai","Bangalore", "Chennai"]
population =[23456123, 20083104, 18456123, 13411093]
plt.bar( cities, population, width=[0.5,0.6,0.7,0.8] )
plt.xlabel("Cities")
plt.ylabel("Population")
plt.show()
Creating Multiple Bars Chart:
• Decide number of X points Suppose we are going to plot sequences A and B. So the
length of sequences A or B determines the number of X points using range or
numpy.arange functions.
• Decide thickness of each bar and accordingly adjust X points on X-axis Plot two bars per
X point, think carefully of the thickness of each bar.
• Give different color to different data ranges
• The width argument remains the same for all ranges being plotted
• Plot using bar() for each range separately.
Let us plot ranges A = [2, 4, 6, 8] B = [ 2.8, 3.5, 6.5, 7.7]
import matplotlib.pyplot as plt
import numpy as np
A = [2,4,6,8]
B = [2.8, 3.5, 6.5, 7.7]
x =np.arange(len(A))
plt.bar(x, A, color='red', width= 0.35)
plt.bar(x+0.35, B, color='blue', width= 0.35)
plt.show()
Here the thickness of each bar is 0.35. for the first bar x point is
x and for the second range the x will be shifted by first bar’s
thickness i.e. x+0.35 . Suppose there is a third bar it will be
x+0.70
Creating a Horizontal Bar Chart:
• Using barh() function in the place bar()
import matplotlib.pyplot as plt
cities=["Delhi", "Mumbai","Bangalore", "Chennai"]
population =[23456123, 20083104, 18456123, 13411093]
plt.barh(cities,population)
plt.ylabel("Cities")
plt.xlabel("Population")
plt.show()
Setting Xlimits and Ylimits:
• Pyplot automatically finds best fitting range for x-axis and y-axis.
• We can set our own limits using xlim() and ylim() functions
Syntax:
<matplotlib.pyplot>.xlim(<xmin>, <xmax>)
<matplotlib.pyplot>.ylim(<ymin>, <ymax>)
Note:
1) Make sure that the data falls within the limits of x-axis and y-axis
2) If you give limits(max, min), then the plot get flipped.
# program to add xlimit and ylimit
WAP to create a bar chart with the following data:
x= np.arange(4)
y=[5., 25., 45., 20.]
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(4) [0,1,2,3]
y = [5., 25., 45.,20.]
plt.xlim(-1.0,4.0)
plt.ylim(0.0,50.0)
plt.bar(x,y)
DV16.py
plt.title("A simple bar chart")
Setting Ticks for Axes:
• By default Pyplot will automatically decide which data points will have ticks on the axes.
• To set own tick marks:
For x- axis:
xticks(<sequence containing tick data points>),[<optional sequence containing tick labels>])
For y –axis:
yticks(<sequence containing tick data points>),[<optional sequence containing tick labels>])
#program to add xticks and yticks
import matplotlib.pyplot as plt
q = range(4)
s = [10,15,20,25]
plt.xticks(q)
plt.yticks(s)
plt.bar(q,s,width=0.25)
plt.show()
Adding Legends:
• Legend is a color or mark linked to a specified data range plotted.
• In the plotting function i.e., plot or bar give a label in a data range using argument
label. A legend is an area describing the elements of the graph. In
the matplotlib library, there is a function called legend () which is used to Place
a legend on the axes. The attribute Loc in legend () is used to specify the location of
the legend. Default value of loc is loc=” best” (upper left).
• Add legend to the plot using legend()
<matplotlib.pyplot>.legend(loc=<position number or string>)
* The loc argument can be taken as values 1,2,3,4 or as strings “upper right”, “upper
left”, “lower left”, “lower right”.
Location String Location Code
'best' 0
'upper right' 1
'upper left' 2
'lower left' 3
'lower right' 4
'right' 5
'center left' 6
'center right' 7
'lower center' 8
'upper center' 9
'center' 10
Saving a figure:
<matplotlib.pyplot>.savefig(<string with filename name and path>)
Example:
plt.savefig(“multibar.png”) - save the plot in current directory
plt.savefig(“c:\\programs\\multibar.png”)
Note: .pdf, .png, .eps file formats will be accepted by Python.
I) Matplotlib–Histogram
• A histogram is a graphical representation which organizes a group of
datapoints into user-specified ranges.
• Histogram provides a visual interpretation of numerical data by showing the
number of datapoints that fall within a specified range of values(“bins”).
• It is like a vertical bar graph but without gaps between the bars
# Example program to plot a histogram
import matplotlib.pyplot as plt
plt.hist([10,11,18,25,10,15,20,21],bins=3,edgecolor='red')
plt.show()
# Example program for Histogram
import matplotlib.pyplot as plt
plt.hist([5,15,25,35],bins=[0,10,20,30,40],weights=[40,60,80,100],edgecolor="red")
plt.xlabel('Total Number of Students --->', fontsize=15)
plt.ylabel('Percentage of Marks -->', fontsize=15)
plt.title('Marks achieve by students')
plt.show()
Histogram: a graphical display of data using bars of different heights. It is like a
Bar Chart, but a histogram groups numbers into ranges . The height of each bar shows
how many data falls into each range.
The Histogram plot divides a variable's data range into a number of bins and then
counts the weighted values that fall within each bin. The bins and the counted data
are then used to create a graph that represents the distribution of data within the
variable's data range.
Note:
A histogram is a summarization tool for discrete or continuous data.
# Program to generate a histogram for no of students having the same weights assuming
there is no student having weight in the range of 40 to 50 kgs.
import matplotlib.pyplot as plt
plt.hist([5,15,25,35,15, 55],bins=[0,10,20,30,40,50,60],weights=[20,10,45,33,6,8],
edgecolor="red")
plt.show()
#at interval(bin) 40 to 50 no bar because we have not mentioned
position from 40 to 50 in first argument(list) of hist method. Whereas in
interval 10 to 20 width is being displayed as 16(10+6 both weights are
added) because 15 is twice in first argument.
General Syntax of hist() Function:
matplotlib.pyplot.hist(x, bins=None, cumulative= False, histtype=‘bar’,
align= ‘mid’, orientation=‘vertical’)
Where
x array or sequence plotted on a histogram
Bins bins+1 bin edges are calculated and returned. Default value is
automatically taken by Python
Cumulative-> Each bin gives the count in that bin plus all bins for smaller
values. The last bin gives the total no of data points.
Histtype ‘bar’ or ‘barstacked’ or ‘step’ or ‘stepfilled’ . Default is ‘bar’.
Orientation ‘horizontal’ or ‘vertical’
align {'left', 'mid', 'right'}, optional. Default is 'mid’.
# program to apply cumulative distribution
import matplotlib.pyplot as plt
data=[5,15,25,35,15,55]
plt.hist(data, bins=[0,10,20,30,40,50,60],weights=[20,10,45,33,6,8],
cumulative=True, facecolor='green’, orientation='vertical’, edgecolor='red')
plt.show()
#Program to apply histtype argument
import matplotlib.pyplot as plt
x=[10,15,20,25]
y=[15,20,25,30]
plt.hist([x,y],edgecolor='red',histtype='barstacked')
plt.show()
# Program to plot Frequency Polygon
import matplotlib.pyplot as plt
data=[5,15,25,35,15,55]
plt.hist(data,bins=[0,10,20,30,40,50,60],weights=[20,10,45,33,6,8],
edgecolor='red',histtype='step')
plt.show()
histtype=‘stepfilled’
histtype=‘step’
#Program to apply align argument
import matplotlib.pyplot as plt
plt.hist([10,11,18,25,10,15,20,21],bins=3,edgecolor='red', align = 'left')
plt.show()
Align =‘ mid’ Align = ‘right’
Plotting data in a dataframe can be done in 2 ways:
*Using pyplot’s graph functions
*Using Dataframe’s plot() function.
1) Plotting a dataframe’s data using Pyplot’s Graph functions:
using plot(),bar(), barh(),scatter(),boxplot(),hist(). It will treat the
plotted data as a series and plot it.
Consider a Dataframe df2
Age Projects
0 30 13
1 27 17
2 32 16
3 40 20
4 28 21
5 32 14
*Using Pyplot method:
Method 1:
import pandas as pd
import matplotlib.pyplot as plt
Age=[30,27,32,40,28,32]
Projects=[13,17,16,20,21,14]
df2=pd.DataFrame({"Age": Age ,"Projects": Projects})
plt.plot(df2.Age)
plt.show()
Method 2:
import pandas as pd
import matplotlib. pyplot as plt
Age=[30,27,32,40,28,32]
Projects=[13,17,16,20,21,14]
df2=pd. DataFrame({"Age": Age, "Projects": Projects})
plt. bar(df2.index,df2.Projects)
plt. show()
Method 3:
import pandas as pd
import matplotlib. pyplot as plt
Age=[30,27,32,40,28,32]
Projects=[13,17,16,20,21,14]
df2=pd.DataFrame({"Age":Age,"Projects":Projects})
plt.plot(df2)
plt.show()
*Using DataFrame’s Plot() method:
<DF>. plot()
The DataFrame’s plot() is a versatile function, which can plot all types of chart
using Kind argument
Kind : Type of the plot, can take values as
‘line’ : line plot(default)
‘bar ’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘pie’ :pie plot
‘Scatter’ : scatter plot
Note :
• Without kind argument , line chart will be plotted by default.
• It automatically adds legends for the plotted data.
import pandas as pd
import matplotlib. pyplot as plt
Age=[30,27,32,40,28,32]
Projects=[13,17,16,20,21,14]
df2=pd. DataFrame ({"Age": Age, "Projects": Projects})
df2.plot()
plt. show()
The kind argument is missing so line chart is plotted
import pandas as pd
import matplotlib. pyplot as plt
Age=[30,27,32,40,28,32]
Projects=[13,17,16,20,21,14]
df2=pd. DataFrame ({"Age": Age,“ Projects": Projects})
df2.plot(kind='hist')
plt. show()
Practice Session
• Example 1
Plotting a line chart of date versus temperature by adding Label on X and Y axis, and adding
a Title and Grids to the chart.
import matplotlib.pyplot as plt
date=["25/12","26/12","27/12"]
temp=[8.5,10.5,6.8]
plt.plot(date, temp)
plt.xlabel("Date") #add the Label on x-axis
plt.ylabel("Temperature") #add the Label on y-axis
plt.title("Date wise Temperature") #add the title to the chart
plt.grid(True) #add gridlines to the background
plt.yticks(temp)
plt.show()
Output
Example 2
Consider the average heights and weights of persons aged 8 to 16 stored in the following two lists:
height = [121.9,124.5,129.5,134.6,139.7,147.3, 152.4, 157.5,162.6]
weight= [19.7,21.3,23.5,25.9,28.5,32.1,35.7,39.6, 43.2]
Let us plot a line chart where:
i. x axis will represent weight
ii. y axis will represent height
iii. x axis label should be “Weight in kg”
iv. y axis label should be “Height in cm”
v. color of the line should be green
vi. use * as marker
vii. Marker size as10
viii. The title of the chart should be “Average weight with respect to average height”.
ix. Line style should be dash dot
x. Linewidth should be 2.
import matplotlib.pyplot as plt
import pandas as pd
height=[121.9,124.5,129.5,134.6,139.7,147.3,152.4,157.5,162.6]
weight=[19.7,21.3,23.5,25.9,28.5,32.1,35.7,39.6,43.2]
df=pd.DataFrame({"height":height,"weight":weight})
plt.xlabel("Weight in kg") #Set xlabel for the plot
plt.ylabel("Height in cm") #Set ylabel for the plot
plt.title('Average weight with respect to average height') #Set chart title
#plot using marker'-*' and line colour as green
plt.plot(df.weight,df.height,marker='*',markersize=10,color='green',linewidth=2,
linestyle='-.')
plt.show()
Output
Example 3
Smile NGO has participated in a three week cultural mela. Using
Pandas, they have stored the sales (in Rs) made day wise for every
week in a CSV file named “MelaSales.csv”, as shown in Table 4.6
Depict the sales for the three weeks using a Line chart. It should have the
following:
i. Chart title as “Mela Sales Report”.
ii. axis label as Days. iii. axis label as “Sales in Rs”.
iii. Line colours are red for week 1, blue for week 2 and brown for week 3
import pandas as pd
import matplotlib.pyplot as plt
# reads "MelaSales.csv" to df by giving path to the file
df=pd.read_csv("MelaSales.csv")
#create a line plot of different color for each week
df.plot(kind='line', color=['red','blue','brown'])
# Set title to "Mela Sales Report"
plt.title('Mela Sales Report')
# Label x axis as "Days“
plt.xlabel('Days')
# Label y axis as "Sales in Rs“
plt.ylabel('Sales in Rs')
#Display the figure
plt.show()
Example 4
Assuming the same CSV file, i.e., MelaSales. CSV, plot the line
chart with following customisations:
Marker ="*"
Marker size=10
linestyle="--"
Linewidth =3
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("MelaSales.csv")
#creates plot of different color for each week
df.plot(kind='line', color=['red','blue','brown'],marker="*",markersize=10,
linewidth=3,linestyle="--")
plt.title('Mela Sales Report')
plt.xlabel('Days')
plt.ylabel('Sales in Rs')
x=[0,1,2,3,4,5,6]
dayname=['Mon','Tue','Wed','Thur','Fri','Sat','Sun']
plt.xticks(x,labels=dayname)
plt.show()
Example 5
import pandas as pd
df= pd.read_csv('MelaSales.csv')
import matplotlib.pyplot as plt
# plots a bar chart with the column "Days" as x axis
df.plot(kind='bar',x='Day',title='Mela Sales Report')
#set title and set ylabel
plt.ylabel('Sales in Rs')
plt.show()
Example 6
Let us write a Python script to display Bar plot for the “MelaSales.csv”
file with column Day on x axis, and having the following customization:
● Changing the color of each bar to red, yellow and purple.
● Edgecolor to green
● Linewidth as 2
● Line style as "--"
import pandas as pd
import matplotlib.pyplot as plt
df= pd.read_csv('MelaSales.csv')
# plots a bar chart with the column "Days" as x axis
df.plot(kind='bar',x='Day',title='Mela Sales Report',color=['red', 'yellow','purple'],
edgecolor='Green', linewidth=2,linestyle='--')
#set title and set ylabel
plt.xlabel(‘Day’)
plt.ylabel('Sales in Rs')
plt.show()
Example 7
import pandas as pd
import matplotlib.pyplot as plt
data = {'Name':['Arnav', 'Sheela', 'Azhar', 'Bincy', 'Yash', 'Nazar'],
'Height' : [60,61,63,65,61,60],
'Weight' : [47,89,52,58,50,47]}
df=pd.DataFrame(data)
df.plot(kind='hist')
plt.show()
Example 8
import pandas as pd
import matplotlib.pyplot as plt
data = {'Name':['Arnav', 'Sheela', 'Azhar','Bincy','Yash', 'Nazar'],
'Height' : [60,61,63,65,61,60],
'Weight' : [47,89,52,58,50,47]}
df=pd.DataFrame(data)
df.plot(kind='hist',edgecolor='Green',linewidth=2,linestyle=':',
fill=False,hatch='o')
plt.show()
Example 9
import pandas as pd
import matplotlib.pyplot as plt
#read the CSV file with specified columns
#usecols parameter to extract only two required columns
data=pd.read_csv("Min_Max_Seasonal_IMD_2017.csv", usecols=['ANNUAL - MIN','ANNUAL
- MAX'])
df=pd.DataFrame(data)
#plot histogram for 'ANNUAL - MIN'
df.plot(kind='hist',y='ANNUAL - MIN',title='Annual Minimum Temperature (1901-2017)')
plt.xlabel('Temperature')
plt.ylabel('Number of times')
#plot histogram for both 'ANNUAL - MIN' and 'ANNUAL - MAX'
df.plot(kind='hist', title='Annual Min and Max Temperature (1901-2017)',color=['blue','red'])
plt.xlabel('Temperature')
plt.ylabel('Number of times')
plt.show()