0% found this document useful (0 votes)
9 views19 pages

MCA - S3 - Data Visualisation - U5

Uploaded by

Ramu Atmuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views19 pages

MCA - S3 - Data Visualisation - U5

Uploaded by

Ramu Atmuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Data Visualisation

Unit-05
Visualisation Using Pandas

Semester-03
Master of Computer Application 1
UNIT

Visualisation Using Pandas

Names of Sub-Units

Setting Up the Environment, Line Plot, Bar Plot, Stacked Plot, Histogram, Box Plot, Area Plot, Scatter
Plot, Hex Plot, Pie Plot, Scatter Matrix, Subplots

Overview

The unit begins by setting up the environment of pandas for visualising data. Next, the unit discusses
how to create a line plot, bar plot and stacked plot in python using pandas. Further, the unit discusses
the functions for creating histogram, box plot and area plot using pandas. The unit also discusses how
to create scatter plot, hex plot and pie plot using pandas. Towards the end, the unit explains how to
build a scatter matrix and subplots in python using pandas.

Learning Objectives

In this unit, you will learn to:


 Explain the process for setting up the environment of pandas for visualising data
 Describe how to create a line plot, bar plot and stacked plot using pandas
 Defines the functions for creating histogram, box plot and area plot using pandas
 Explains how to create scatter plot, hex plot and pie plot using pandas
 Explains how to build a scatter matrix and subplots in python using pandas

2
Learning Outcomes

At the end of this unit, you would:


 Evaluate the process for setting up the environment of pandas for visualising data
 Assess the knowledge about creating a line plot, bar plot and stacked plot using pandas
 Analyse the functions for creating histogram, box plot and area plot using pandas
 Understand about the function for creating scatter plot, hex plot and pie plot using pandas
 Examine how to build a scatter matrix and subplots in python using pandas

Pre-Unit Preparatory Material

 http://blaqueyard.com/download/Python%20Data%20Visualization%20Cookbook.pdf

5.1 INTRODUCTION
Data visualisation is perhaps the most critical phase in the whole data science, big data, or machine
learning life cycle. When we use colours and images to show our study or analysis, it becomes more
stunning, intriguing, and understandable. Clients may better comprehend the key underlying
architecture, trends, patterns, and correlations among parameters within the dataset by using
visualisation components such as graphs, charts, and maps. All of the data visualisations provide us a
clear and precise picture of what the data is trying to tell us. It neutralises all data hence, we can grasp
the data insights.

Pandas is a Python data manipulation and analysis package that is open-source. It is a quick and
strong tool that lets you modify statistical data and series data using data structures and operations.
Combining, restructuring, choosing, data cleansing, and data wrangling are forms of data manipulation
procedures. Data may be imported from a variety of file formats, including SQL, MS Excel, and comma-
separated values, using this library.

5.2 SETTING UP THE ENVIRONMENT


All of the standard of Python was actually a distribution that does not actually come in bundled with
all the Pandas module. A very lightweight alternative is also to install NumPy using all of the popular
Python package installer, pip.
pip install pandas
If you have installed the Anaconda Python package, Pandas will be installed automatically using the
following commands:
 For Windows user: The different ways to install pandas are as follows:
 Anaconda is a free Python distribution for all SciPy stack. Actually, it is also available for all

3
Linux & Mac.

 Canopy is also available as free as well as any commercial distribution with all full SciPy stack
for Windows, Linux & Mac.
 Python is a free Python in which distribution with SciPy can stack & Spyder IDE for Windows OS.
After this, matplotlib is install in for creating a chart.
 For Ubuntu Users: The command to install pandas in Ubuntu OS users is as follows:
sudo apt-get install python-numpy python-scipy python-
matplotlibipythonipythonnotebook
python-pandas python-sympy python-nose
 For Fedora Users: The command to install pandas in Fedora OS users is as follows:
sudo yum install numpyscipy python-matplotlibipython python-pandas
sympy python-nose atlas-devel

5.3 LINE PLOT

Line charts are used to plot continuous data in the form of lines. Therefore, each point on a line chart
corresponds to a value. A line chart can use any number of data series (that is, continuous related data
in a column) and you can distinguish the lines by using different colours or line styles. For instance
plotting the budget and expenses of an organisation as a line chart may enable you to identify cost
fluctuations. To represent data, a line chart uses a horizontal axis (x-axis) and a vertical axis (y-axis).
Line plots may be created straight from pandas dataframes using the dataframe.plot() function. The
syntax for the line plot is as follows:
DataFrame.plot.line(x=None, y=None, **kwargs)
where,
 x: Represents the x-axis data

 y: Represents the y- axis data

 Color: Shows the color for each column in the dataframe

 **kwargs: Refers to the additional keyword arguments that are documented in DataFrame.plot().

The python program to create line plot is as follows:


import pandas as pd
df = pd.DataFrame({
'Q1 Sales': [125, 156, 175, 121, 172],
'Q2 Sales': [152, 169, 131, 189, 135],
'Q3 Sales': [153, 187, 129, 142, 176],
'Q4 Sales':[143, 176, 153, 198,176]
})
df.plot(title="Quarterly Sales of an organisation (in Thousands)");

4
The output of the given program is as follows:

190
180
170
160
150
140
130
120
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

5.4 BAR PLOT


A bar chart is a visual presentation of category data. The data is represented using a bar chart, which
has a number of bars, every representing a different category. Each bar’s height corresponds to a
specific aggregate (for instance, the sum of the values in the category it represents). Bar plots is created
straight from pandas dataframes using the dataframe.plot.bar() function. The syntax for the line plot
is as follows:
dataframe.plot.bar(x=None, y=None, **kwargs)[source]
where,
 x: Represents the x-axis data
 y: Represents the y- axis data
 Color: Shows the color for each column in the dataframe
 **kwargs: Refers to the additional keyword arguments that are documented in DataFrame.plot().

The python program to create bar plot is as follows:


import pandas as pd
df = pd.DataFrame({
'Q1 Sales': [125, 156, 175, 121, 172],
'Q2 Sales': [152, 169, 131, 189, 135],
'Q3 Sales': [153, 187, 129, 142, 176],
'Q4 Sales':[143, 176, 153, 198,176]
})
df.plot.bar(title="Quarterly Sales of an organisation (in Thousands)")

5
The output of the given program is as follows:

175
180
125
100
75
50
25

5.5 STACKED PLOT


A stacked bar graph is another name for a stacked bar chart. It is a graph that compares different
sections of a whole. Every bar in a stacked bar chart symbolises the entire, while the segments or sections
of the bar indicate subcategories within that whole. These subcategories are represented by different
colours.The python program to create bar plot is as follows:
import pandas as pd
df = pd.DataFrame({
'Q1 Sales': [125, 156, 175, 121, 172],
'Q2 Sales': [152, 169, 131, 189, 135],
'Q3 Sales': [153, 187, 129, 142, 176],
'Q4 Sales':[143, 176, 153, 198,176]
})
df.plot.bar(title="Quarterly Sales of an organisation (in Thousands)",
stacked="True")

The output of the given program is as follows:

6
100

0 1 2 3 4

5.6 HISTOGRAM
Histogram chart is used to shows data in the form of frequency within a distribution. Each column in
the histogram chart is known as Bin. However, the continuously flowing data can be represented using
Histogram. It makes it easy to analyse the data defined within various data ranges. The function syntax
for creating histogram is as follows:
DataFrame.plot.hist(by=None, bins=10, **kwargs)[source]
where,
 by [str or sequence, optional]: Refers to the column in the DataFrame based on the data is group.
 Bins[int, default 10]: Refers to a number of histogram bins that is used for creating histogram.
 **kwargs: Refers to the additional keyword arguments that are documented in DataFrame.plot().

The python program to create histogram is as follows:


import pandas as pd
dataframe= pd.read_csv("BP_Record.csv")
dataframe.hist()
The output of the given program is as follows:

7
10

10

100 120

5.0 4

2.5

100 125 150

5.7 BOX PLOT


A boxplot, often known as a box and whisker plot, is a visual representation of a data set’s spread and
centres. This plot is appropriate to represent statistical data sets related to each other, without using
any formula. This plot produces answers from the raw data. The data is distributed into quartiles, along
with highlighted mean and outliers.

The function syntax for creating box plot is as follows:


DataFrame.boxplot(column=None, by=None, ax=None, fontsize=None, rot=0,
grid=True, figsize=None, layout=None, return_type=None, backend=None,
**kwargs)
where,
 column [str or list of str, optional]: Refers to a column name or list of names, or vector of the dataset
 By [str or array-like, optional]: Refers to a column in the dataframe to DataFrame.groupby()
function.
 ax: [object of class matplotlib.axes.Axes, optional]: Uses the matplotlib axes to boxplot.
 fontsize [float or str]: Specifies the font size
 rot[int or float, default 0]: Refers to the rotation angle of labels in context to the screen coordinate
system
 Grid[bool, default True]: displays grid if you set it ti true
 Fig[sizeA tuple (width, height) in inches]: Specifies the size of the figure to build by using matplotlib
 Layout[tuple (rows, columns), optional]: Shows the subplot
 return_type[{‘axes’, ‘dict’, ‘both’} or None, default ‘axes’]: Refers to the type of object to return

8
 backend[str, default None]: Uses in place of the backend specified in the plotting.backend option
 **kwargs: Refers to the additional plotting keyword arguments that are passed in matplotlib.pyplot.
boxplot().

The python program to create box plot is as follows:


import pandas as pd
dataframe= pd.read_csv("BP_Record.csv")
dataframe.boxplot(by ='Pulse', column =['Calories'], grid = True)
The output of the given program is as follows:

450

350

300

250

200

5.8 AREA PLOT


In an area chart, areas are used to represent values. It is similar to a line chart in that it displays a series
as a set of points connected by a line. However, the difference is that in an area chart, the area below
the line is filled with the colour of the line. Area charts help to draw attention to the total value across
a given data.

The function syntax to create an area plot is as follows:


DataFrame.plot.area(x=None, y=None, **kwargs)
where,
 x: Represents the x-axis data
 y: Represents the y- axis data
 stacked: Shows the area plot in stacked form. It is set to true by default. If you set to False to create
a unstacked plot.
 **kwargs: Refers to the additional keyword arguments that are documented in DataFrame.plot().

The python program to create area plot is as follows:

9
import pandas as pd
dataframe= pd.read_csv("BP_Record.csv")
dataframe.plot.area()
The output of the given program is as follows:

1000

10 20 25 30

5.9 SCATTER PLOT


A scatter plot is truly a diagram drawn between a pair of distributions of variables X & Y on a 2-
dimensional plane. Scatter plot is then used as an initial screening tool that whereas analysing two
variables for any of the connection that will then exist between them.

The function syntax for creating a scatter plot is as follows:


DataFrame.plot.scatter(x, y, s=None, c=None, **kwargs)

where,
 x: Refers to a column name to be used as horizontal coordinates for every purpose
 y: Refers to a column name to be used as vertical coordinates for every purpose
 s: Specifies the size of dots
 c: Specifies the colour of dots
 **kwargs: Refers to the additional keyword arguments that are documented in DataFrame.plot().

The python program to create scatter plot is as follows:


import pandas as pd
dataset={'Student Name':['Yash', 'Madhu', 'Gunjan','Vihan', 'Dipesh',
'Stuti'],'Class':[10, 12, 9, 6, 11, 8]}
df = pd.DataFrame(data = dataset)
df.plot.scatter(x = 'Student Name', y = 'Class')
The output of the given program is as follows:

10
12

10

Class

5.10 HEX PLOT


It presents a various vary of all utilities from parsing multiple file-formats to then changing a whole data
table into a NumPy matrix array. This is often property makes pandas a sure ally altogether knowledge
science &machine learning. Pandas will then facilitate with the creation of multiple forms of knowledge
and analysis graphs. One main specimen is that the polygonal shape plot. A polygonal shape plot is
particularly} very helpful if the entire scatter plot is simply too dense to interpret. And it also helps to bin
the realm of the chart and assigns colour intensity consequently
The function syntax for the hex plot is as follows:
DataFrame.plot.hexbin(x, y, C=None, reduce_C_function=None,
gridsize=None, **kwargs)
where,
 x: Refers to a column name to be used as horizontal coordinates
 y: Refers to a column name to be used as vertical coordinates
 c: Refers to a column name that is used for the value of (x, y) point
 reduce_C_function: Refers to a function that take a single argument for reducing the values in a
bin to a single number
 gridsize: Specifies the number of hexagons in the x-axes and y-axes
 **kwargs: Refers to the additional keyword arguments that are documented in DataFrame.plot().
The python program to create hexbin plot is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X-Axis': np.random.randn(2000), 'Y-Axis': np.random.
randn(2000)})
df.plot.hexbin(x='X-Axis', y='Y-Axis', gridsize=20)
The output of the given program is as follows:

11
xis
5.11 PIE PLOT
A pie chart is used to show relative proportions or contributions to a whole, which is contributed by each
value in a single data series. Pie charts are most effective while representing a small amount of data.

A chart highlights information and statistics in pie-slice format. This sort of chart represents numbers
in percentages, and also the total of all pies ought to equal 100%.

The function syntax for creating a pie plot is as follows:


DataFrame.plot.pie(**kwargs)
The python program to create pie plot is as follows:
import pandas as pd
df = pd.DataFrame({'Party Name': ['BJP', 'SP','Congress','Others'],
'Partywise Votes': [47.2,32.8,8,12]})
df.plot.pie(y='Partywise Votes', labels = df['Party Name'])

12
The output of the given program is as follows:

Partywise Votes

5.12 SCATTER MATRIX


Scatter matrix or additionally referred to as pairs plot that is succinctly plots of all the numeric variables
that have in an exceedingly dataset against one another one. All told Python, this information mental
image technique may be then dispensed with several different libraries however if we tend to area unit
exploitation Pandas to then load the information, we will use the bottom scatter matrix technique to
examine the dataset.

It is vital to envision for correlation among freelance variables utilised in analysing regression throughout
information pre-processing. Scatter plots create it terribly straightforward to know the correlation
between the options. Pandas provides analysts with the scatter matrix () perform to feasiblywin these
plots. It is conjointly accustomed verify whether or not the correlation is positive or negative

The function syntax for creation a scatter matrix is as follows:


pandas.plotting.scatter_matrix(frame, alpha=0.5, figsize=None, ax=None,
grid=False, diagonal='hist', marker='.', density_kwds=None, hist_
kwds=None, range_padding=0.05, **kwargs)
where,
 Frame: Refers to a dataframe

 Alpha[float, optional]: Refers to the amount for applying transparency

 Figsize[(float,float), optional]: Specifies the width, height of the figure in inches

 Ax[Matplotlib axis object, optional]: Specifies the axis object for matplotlib

 grid[bool, optional]: Specifies this option to true for displaying the grid

 diagonal[{‘hist’, ‘kde’}]: Allows you to select either hist for histogram plot or kde for kernel density
estimation in the diagonal

13
 marker[str, optional]: Specifies the marker type for matplotlib
 density_kwds[keywords]: Defines the density keyword to specify the kernel density estimate plot

 hist_kwds[keywords]: Defines the density keyword to specify the hist function

 range_padding[float, default 0.05]: Specifies the relative extension of axis range in x and y

 **kwargs: Refers to the additional keyword arguments that are passed to scatter function

The python program to create a scatter matrix is as follows:


import pandas as pd
df= pd.read_csv("BP_Record.csv")
pd.plotting.scatter_matrix(df)
The output of the given program is as follows:
Duration

250
Puls

100
Maxpulse

150

100
Calories

150
120

400
450
250

100

100

250

5.13 SUBPLOTS
Python subplots are a great tool for data visualisation because they provide you a lot of flexibility
about how data is shown. Subplot is a function that generates a figure and a series of subplots. It is a
wrapper function that makes it easy to generate standard subplot designs in a single call, including the
containing figure object.

5.14 LAB EXERCISE


1. Use Pandas to Perform Exploratory Data Analysis on the Dataset.
Ans. Exploratory Data Analysis (EDA) is a method of analysing data through the use of visual techniques.It
assists data analysts in determining how to effectively modify sources of data to obtain the
information they require, enabling it easier for them to find trends and patterns, test hypotheses,

14
and verify assumptions.

EDA helps data scientists in a variety of ways:


 Increasing your knowledge of data
 Detecting a variety of data patterns
 Improved comprehension of the problem statement
Python is one of the most widely used languages for all Data Science particularly because of the
presence of various libraries & packages that makes data analysis easier.
It also provides various functions & methods to both simplify as well as to expedite the data
analysis process.
The steps for exploratory data analysis on the dataset are as follows:
1. First, you need to import the pandas and numpy library as:
import pandas as pd
import numpy as np
2. Import and read the dataset as:
dataset = pd.read_csv("Automobile.csv")
This will import the Automobile.csv file into the panda’s data frame.
3. Apply the different operation of the dataset to analysis data. Some of them are as follows:
a. For displaying all rows from a data frame, the function is as follows:
import pandas as pd
dataset = pd.read_csv("Automobile.csv")
print(dataset.to_string())
b. For cleaning data from the data some commands are:
i. To remove the empty cells:
df=dataset.dropna()
ii. TO remove the duplicate entries:
df=dataset.drop_duplicate(inplace=true)
c. For displaying first five rows of the data, the function is as follows:
dataset.head()
d. For displaying last five rows of the data, the function is as follows:
dataset.tail()
4. Display structure (row and column number) in a dataset.
dataset.shape
5. Display information (columns and their data types) about the dataset.
dataset.info()
6. Display the quick summary of the dataset.
dataset.describe()

15
7. Prepare the different types of plot or chart that are best suited on the imported dataset.

Conclusion 5.15 CONCLUSION

 Data visualisation is perhaps the most critical phase in the whole data science, big data, or machine
learning life cycle.
 Pandas is a Python data manipulation and analysis package that is open-source
 Combining, restructuring, choosing, data cleansing, and data wrangling are forms of data
manipulation procedures.
 All of the standard of Python was actually a distribution that does not actually come in bundled
with all the Pandas module.
 Line charts are used to plot continuous data in the form of lines.
 Line plots may be created straight from pandas dataframes using the dataframe.plot function.
 A bar chart is a visual presentation of category data.
 Bar plots is created straight from pandas dataframes using the dataframe.plot.bar() function.
 A stacked bar graph is another name for a stacked bar chart. It is a graph that compares different
sections of a whole.
 Histogram chart is used to shows data in the form of frequency within a distribution.
 A boxplot, often known as a box and whisker plot, is a visual representation of a data set’s spread
and centres.
 In an area chart, areas are used to represent values. It is similar to a line chart in that it displays a
series as a set of points connected by a line.
 A scatter plot is truly a diagram drawn between a pair of distributions of variables X & Y on a
2-dimensional plane.
 Hex plot presents a various vary of all utilities from parsing multiple file-formats to then changing
a whole data table into a NumPy matrix array.
 A pie chart is used to show relative proportions or contributions to a whole, which is contributed by
each value in a single data series.
 Scatter matrix or additionally referred to as pairs plot that is succinctly plots of all the numeric
variables that have in an exceedingly dataset against one another one.
 Python subplots are a great tool for data visualisation because they provide you a lot of flexibility
about how data is shown.

5.16 GLOSSARY

 Data visualisation: It is the study of representing data or information in a visual form.


 Data: It refers to raw facts and information that are generally gathered in a systematic approach
for some kind of analysis.

16
 Chart: A graphical representation for all the data visualisation, in which “the data is to represented
by the indicators or symbols
 Scatter chart: It is used to show the relationship between the numeric values in two data series
 Histogram chart: It is used to shows data in the form of frequency within a distribution

5.17 SELF-ASSESSMENT QUESTIONS

A. Essay Type Questions


1. Define the process for setting up the environment for using pandas in python.
2. Explain the concept of bar plot.
3. How to create a hex plot using pandas?
4. Explain histogram plot.
5. How to create scatter matrix in python using pandas?

5.18 ANSWERS AND HINTS FOR SELF-ASSESSMENT QUESTIONS

A. Hints for Essay Types Questions


1. All of the standard of Python was actually a distribution that does not actually come in bundled
with all the Pandas module. A very lightweight alternative is also to install NumPy using all of the
popular Python package installer, pip. Refer to Section Setting up the Environment
2. A bar chart is a visual presentation of category data. The data is represented using a bar chart,
which has a number of bars, every representing a different category. Each bar’s height corresponds
to a specific aggregate (for instance, the sum of the values in the category it represents). Bar plots is
created straight from pandas dataframes using the dataframe.plot.bar() function. Refer to Section
Bar Plot
3. hex plot presents a various vary of all utilities from parsing multiple file-formats to then changing
a whole data table into a NumPy matrix array. This is often property makes pandas a sure ally
altogether knowledge science and machine learning. Refer to Section Hex Plot
4. Histogram chart is used to shows data in the form of frequency within a distribution. Each column
in the histogram chart is known as Bin. However, the continuously flowing data can be represented
using Histogram. It makes it easy to analyse the data defined within various data ranges. Refers to
Section Histogram
5. Scatter matrix or additionally referred to as pairs plot that is succinctly plots of all the numeric
variables that have in an exceedingly dataset against one another one. All told Python, this
information mental image technique may be then dispensed with several different libraries however
if we tend to area unit exploitation pandas to then load the information, we will use the bottom

17
scatter matrix technique to examine the dataset. Refer to Section Scatter Matrix

@ 5.19 POST-UNIT READING MATERIAL


 https://realpython.com/pandas-plot-python/
 https://stackabuse.com/introduction-to-data-visualization-in-python-with-pandas/

5.20 TOPICS FOR DISCUSSION FORUMS

 Discuss with your friends about how to create a different plots or charts in python using pandas.

18
19

You might also like