0% found this document useful (0 votes)

24 views17 pages

Data+Visualization+in+Python

Uploaded by

mohitdb7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views17 pages

Data+Visualization+in+Python

Uploaded by

mohitdb7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

In this module, you will learn about the importance of data visualisation in the real world,

data handling and cleaning, sanity checks, and various charts and plots, all of which can be
used to observe trends, obtain relationships, and portray final results.

The major advantage of data visualisation is that we can decipher the underlying patterns
from the raw numbers which are otherwise difficult to see for the human eye. Therefore, it
is important to visualise the data to observe how different features behave. The following
example shows the distribution of sales corresponding to specific discount rates in four
major cities. By observing the raw numbers, it is difficult to conclude anything because both
the statistics - average and standard deviation - are equal across cities. However, when the
relationship between sales and discount rate is obtained, it is observed that different cities
follow their own trends.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

However, the patterns in the underlying data and the difference become apparent when
visualised through appropriate plots.

Anscombe’s Quartet

Here, the trend followed by each city is different.

● Mumbai: A straight line with little variance.
● Bengaluru: A straight line with a sudden rise in unit sales at a 13% discount rate.
● Hyderabad: A curvy line with maximum unit sales at 11% followed by a decreasing
trend.
● Kolkata: As the discount rate is fixed for almost all months except August, unit sales
are in the range of 500-900.

Each branch employs different strategies to calculate its discount rates. Moreover, sales
numbers were also different across branches. Therefore, one should utilise an
appropriate visualisation technique to ‘look’ into the data.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

1. For the popular cricket-playing nations, the following visualisation illustrates every
batsman who has scored at least 20 runs in their ODI career along with their run rate. Size
= Number of runs.
Colour = Strike rate 20 50 80 1
10 140. You can play with the visualisation here.

Each block is further divided into several blocks representing the runs scored by the
cricketers in each innings.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

2. A treemap diagram showing how multiple companies and sectors react to the
budget. You can find the interactive graph here.

3. Visual exploratory analytics: Data visualisation helps in understanding the

connections between different software and clustering them based on common features.
You can find the visual here.

Once you load a dataset, there is a possibility of many disturbances being present. The
most common ones are missing values and incorrect data types. Following are some
common techniques to address these issues:

For missing values: Following are some common techniques to address this issue:

○ Dropping the rows containing the missing values

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Example:

○ Imputing the missing values

● For numerical variables, use mean and median
● For categorical variables, use mode

Example:

○ Keep the missing values if they do not affect the analysis

Incorrect data types:

○ Clean certain values

○ Clean and convert an entire column

Example:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Often, our data is noisy, incomplete, irregular or dirty. Before coming to any conclusions from
those beautiful charts, we need to perform sanity checks on the data available to ensure
nothing is wrong. These sanity checks are part of the process of data preparation and cleaning
it forms a good chunk of the data scientist’s time. For example, one should ensure that certain
features such as price, weight, height, etc., are always positive in the column. In our Google
Play app case study, we ensured that the number of reviews is always less than or equal to the
number of installs. We can always remove all the rows which do not make any sense at all.

Outliers are extreme values that deviate from other observations on data. They may
indicate variability in measurement, experimental errors, or a novelty. In other words, an
outlier is an observation that diverges from an overall
pattern on a sample. This is where one should start
utilising visualisation to achieve tasks. The visualisation
best suited for this is the box plot.

The maximum and minimum values are represented

by the fences of the box plot. The maximum value is
given by the formula Q3+1.5*IQR, and the minimum
value is represented by the formula Q1-1.5*IQR.

IQR: Interquartile range that denotes the values lying

between the percentiles 25 and 75.

Outliers are the values outside the range mentioned

earlier.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

In the previous example, all outlier points greater than 30 were removed from the
dataset.

Histograms generally work

by bucketing the entire
range of values that a
particular variable takes to
specific bins.

Vertical bars denote the

total number of records in a
specific bin, which is also
known as its frequency.
The number of bins can be
increased or decreased to
change the granularity of the analysis.

Code for plotting a histogram:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved
Seaborn:
● Python library to create statistical graphs easily
● Built on top of matplotlib and closely integrated with pandas

Functionalities of Seaborn:

● Dataset-oriented API
● Analysing univariate and bivariate distributions
● Automatic estimation and plotting of linear regression models
● Convenient views for complex datasets
● Concise control over the style
● Colour palettes
Importing Seaborn:

A distribution plot is pretty similar to the histogram functionality in matplotlib. Instead of a

frequency plot, it plots an approximate probability density for that rating bucket. The curve
or the KDE that gets drawn over the distribution is the approximate probability density
curve.

Following is an example of a distribution plot. Notice that the left axis has the density for
each bin or bucket instead of the frequency.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Code:

Distribution Plot

Some styling options in distribution plots:

To analyse categorical columns, pie charts and bar charts are used to portray the
relationships.
1. Pie Chart: A pie chart is a circular graph divided into slices. The larger a slice is, the
bigger portion of the total quantity it represents. For example, if a company operates
three separate divisions, at the year end, its top management would be interested in
seeing what portion of total revenue each division accounted for.

2. Bar Chart: Bar charts are among the most frequently used chart types. As the name
suggests, a bar chart is composed of a series of bars illustrating a variable’s development.
Given that bar charts are such a common chart type, people are generally familiar with
them and can understand them easily. Examples like the following one are
straightforward to read.

Code:

To plot a horizontal bar chart:

Scatter plots are perhaps one of the most commonly used and powerful visualisations
used in the field of machine learning. They are crucial in revealing relationships between
the data points. And one can generally deduce some sort of trends in the data with the
help of a scatter plot.

Applications of Scatter Plots in Machine Learning:

● Scatter plots are useful in regression problems to check whether a linear trend
exists in the data or not. For example, in the following image, creating a linear model in
the first case makes far more sense as a clear straight-line trend is visible.

● Scatter plots help in observing naturally occurring clusters. In the following image,
the marks of students in Maths and Biology has been plotted. You can clearly group the
students into four clusters. Cluster one includes students who score very well in Biology
but very poorly in Maths, Cluster two has students who score equally well in both the
subjects, and so on.

Code:

● Joint plot displays a relationship between two variables. On the other hand, Reg plots are an extension
to the joint plots with the addition of a regression line to
the view.

Pair plots help in identifying quickly the trends between a target variable and the predictor
variables. For example, suppose you want to predict how your company’s profits are affected
by three different factors. In order to choose one factor, you created a pair plot containing
profits and the three factors as variables. Here are the scatter plots of profits vs the three
variables that you obtained from the pair plot.

● When you have several numeric variables, making multiple scatter plots becomes rather
tedious. Therefore, a pair plot visualisation is preferred where all the scatter plots are in a
single view in the form of a matrix.
● For the non-diagonal views, it plots a scatter plot between two numeric variables.
● For the diagonal views, a histogram is plotted.

Heat maps utilise the concept of using colours and colour intensities to visualise a range of
values. In Python, you can create a heat map whenever you have a rectangular grid or table
of numbers analysing any two features.

The heatmap shown previously represents the collinearity of the multiple variables in the
dataset. The data.corr() function was used in the code to show the correlation between
the values. This is where we want to set our independent or target variable. Looking at
the blue heatmap, the focus should be on the dark and light areas. Dark blue represents a
positive correlation, whereas white is a negative correlation.

A line chart or a line graph is a type of chart which displays information as a series of
data points called markers connected by straight line segments. It is a basic type of chart
commonly used in a number of domains.

● A stacked bar chart
breaks down each bar of
the bar chart on the basis
of a different category
● The main objective
of a standard bar chart is
to compare numeric
values between levels of
a categorical variable.
One bar is plotted for
each level of the
categorical variable with
each bar’s length
indicating a numeric
value. A stacked bar
chart not only achieves this objective, but also targets a second goal.

The plotly Python library is an interactive, open-source plotting library that supports over 40
unique chart types covering a wide range of statistical, financial, geographic, scientific, and
three-dimensional use cases.

Line Plot in Plotly:

Common Visualization Idioms
0% (1)
Common Visualization Idioms
95 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Theobald O. Machine Learning With Python 2024
No ratings yet
Theobald O. Machine Learning With Python 2024
146 pages
Module-4
No ratings yet
Module-4
91 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Data Analytics - Unit 5 (22IT513PE)
No ratings yet
Data Analytics - Unit 5 (22IT513PE)
46 pages
Module4 DSV
No ratings yet
Module4 DSV
89 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
Data Science
No ratings yet
Data Science
59 pages
lec1
No ratings yet
lec1
50 pages
Dv Important Notes
No ratings yet
Dv Important Notes
60 pages
Ia - Eda
No ratings yet
Ia - Eda
10 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
Business Anaytics Unit 1
No ratings yet
Business Anaytics Unit 1
37 pages
DV Unit 2
No ratings yet
DV Unit 2
5 pages
Unit 2b AI Project Cycle
No ratings yet
Unit 2b AI Project Cycle
26 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
BA Unit 1
No ratings yet
BA Unit 1
38 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
39 pages
Data Unit4
No ratings yet
Data Unit4
8 pages
Eds Unit 3
No ratings yet
Eds Unit 3
22 pages
Unit-5 New
No ratings yet
Unit-5 New
31 pages
Unit 4
No ratings yet
Unit 4
21 pages
Data Visualization and Story Telling Notes
No ratings yet
Data Visualization and Story Telling Notes
31 pages
Data Visulization1
No ratings yet
Data Visulization1
39 pages
Unit 2
No ratings yet
Unit 2
36 pages
DAUP Exam Notes - 2in1
No ratings yet
DAUP Exam Notes - 2in1
35 pages
Dsbda Ut6
No ratings yet
Dsbda Ut6
11 pages
Week13 2 Data Analysis 2
No ratings yet
Week13 2 Data Analysis 2
44 pages
Data Visualization
No ratings yet
Data Visualization
32 pages
Sab Theek Ho Jaega Unit 4 BRM
No ratings yet
Sab Theek Ho Jaega Unit 4 BRM
34 pages
Unit 5
No ratings yet
Unit 5
6 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
ADS Imp Ans
No ratings yet
ADS Imp Ans
11 pages
Unit 4 - Data Visualization
No ratings yet
Unit 4 - Data Visualization
32 pages
Amit Khilare Used Device Data PM Project
No ratings yet
Amit Khilare Used Device Data PM Project
25 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
DWDM LS2 Fall 24 25
No ratings yet
DWDM LS2 Fall 24 25
42 pages
Data Mining Notes C3
No ratings yet
Data Mining Notes C3
11 pages
Unit 4 Actual Notes BA
No ratings yet
Unit 4 Actual Notes BA
24 pages
mgn343 Ca4
No ratings yet
mgn343 Ca4
9 pages
DSV Module-4
No ratings yet
DSV Module-4
36 pages
Data Visualization Guide: 1. Common Types of Data Visualizations
No ratings yet
Data Visualization Guide: 1. Common Types of Data Visualizations
11 pages
Visualizing Distributions
No ratings yet
Visualizing Distributions
28 pages
UNIT4
No ratings yet
UNIT4
8 pages
Data Visualization
No ratings yet
Data Visualization
16 pages
Data Visualization
No ratings yet
Data Visualization
19 pages
Data Basics For ML
No ratings yet
Data Basics For ML
23 pages
Document (25)
No ratings yet
Document (25)
11 pages
Data Visualization PDF
No ratings yet
Data Visualization PDF
3 pages
Chapter 3 - Data Visualization Chapter 4 - Summary Statistics
No ratings yet
Chapter 3 - Data Visualization Chapter 4 - Summary Statistics
38 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
L4 Data Visualization Part 1
No ratings yet
L4 Data Visualization Part 1
26 pages
Visual Presentation of Data
No ratings yet
Visual Presentation of Data
26 pages
ML Report
No ratings yet
ML Report
12 pages
Handout 3
No ratings yet
Handout 3
24 pages
Ass 2
No ratings yet
Ass 2
13 pages
Data Visualization Notes
No ratings yet
Data Visualization Notes
4 pages
Data Visualization Using Python
No ratings yet
Data Visualization Using Python
79 pages
Data Visualization cleaning and errors
No ratings yet
Data Visualization cleaning and errors
5 pages
DVP 3
No ratings yet
DVP 3
97 pages
Tableau Interview Questions
No ratings yet
Tableau Interview Questions
58 pages
Unit 4 Data Science Applications
No ratings yet
Unit 4 Data Science Applications
32 pages
Data Analytic
No ratings yet
Data Analytic
25 pages
Data Visualisation
No ratings yet
Data Visualisation
232 pages
Tableau Questions
No ratings yet
Tableau Questions
11 pages
Which Chart When - Your Guide To Choosing The Right Chart!
No ratings yet
Which Chart When - Your Guide To Choosing The Right Chart!
19 pages
4 - Data Visualization For Decison Making
100% (1)
4 - Data Visualization For Decison Making
64 pages
Seaborn Ploting in Titanic
No ratings yet
Seaborn Ploting in Titanic
18 pages
CH 15
No ratings yet
CH 15
88 pages
DVST Assignment-2
No ratings yet
DVST Assignment-2
12 pages
Pandas 1702216043
No ratings yet
Pandas 1702216043
86 pages
Project Risk Assessment - Example With A Risk Matrix Template
No ratings yet
Project Risk Assessment - Example With A Risk Matrix Template
23 pages
DMML Lab Report 05
No ratings yet
DMML Lab Report 05
6 pages
1.1 Univariate Analysis: 1.1.1 Categorical Data
No ratings yet
1.1 Univariate Analysis: 1.1.1 Categorical Data
10 pages
Flenniken, J., Stuglik, S., Iannone, B., 2020. Quantum GIS (QGIS) An Introduction To A Free Alternative To More Costly GIS Platforms
No ratings yet
Flenniken, J., Stuglik, S., Iannone, B., 2020. Quantum GIS (QGIS) An Introduction To A Free Alternative To More Costly GIS Platforms
6 pages
Incident Detection On Urban Roads
No ratings yet
Incident Detection On Urban Roads
8 pages
Strider SRS - Innovaxel
No ratings yet
Strider SRS - Innovaxel
5 pages
Consensus Cluster Plus
No ratings yet
Consensus Cluster Plus
12 pages
Practical File GR 9 AI
No ratings yet
Practical File GR 9 AI
19 pages
DV Unit 1 QB
No ratings yet
DV Unit 1 QB
23 pages
Note 2
No ratings yet
Note 2
27 pages
Spe 197339 Ms
No ratings yet
Spe 197339 Ms
8 pages
DSV Question Bank Copy1
No ratings yet
DSV Question Bank Copy1
2 pages
Sport Analytic
No ratings yet
Sport Analytic
3 pages
Eye Tracking VR Imotions
No ratings yet
Eye Tracking VR Imotions
3 pages

Data+Visualization+in+Python

Uploaded by

Data+Visualization+in+Python

Uploaded by

In this module, you will learn about the importance of data visualisation in the real world,

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Here, the trend followed by each city is different.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

3. Visual exploratory analytics: Data visualisation helps in understanding the

○ Dropping the rows containing the missing values

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

○ Imputing the missing values

○ Keep the missing values if they do not affect the analysis

Incorrect data types​:

○ Clean certain values

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

The maximum and minimum values are represented

IQR: ​Interquartile range that denotes the values lying

Outliers are the values outside the range mentioned

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Histograms​ generally work

Vertical bars denote the

Code for plotting a histogram:

A distribution plot is pretty similar to the histogram functionality in matplotlib. Instead of a

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Some styling options in distribution plots:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

To plot a horizontal bar chart:

Applications of Scatter Plots in Machine Learning:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Line Plot in Plotly​:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

You might also like

Incorrect data types:

IQR: Interquartile range that denotes the values lying

Histograms generally work

Line Plot in Plotly: