0% found this document useful (0 votes)

89 views13 pages

3 Awesome Visualization Techniques For Every Dataset: Mlwhiz

9 visualization techniques you should know

Uploaded by

Shivank Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views13 pages

3 Awesome Visualization Techniques For Every Dataset: Mlwhiz

9 visualization techniques you should know

Uploaded by

Shivank Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

MLWhiz

Deep Learning, Data Science And NLP Enthusiast

3 Awesome Visualization Techniques for every dataset

April 19, 2019

Visualizations are awesome. However, a good visualization is annoyingly hard to make.

Moreover, it takes time and effort when it comes to present these visualizations to a bigger audience.

We all know how to make Bar-Plots, Scatter Plots, and Histograms, yet we don’t pay much attention to beautify
them.

This hurts us - our credibility with peers and managers. You won’t feel it now, but it happens.

Also, I find it essential to reuse my code. Every time I visit a new dataset do I need to start again? Somereusable ideas of
graphs that can help us to find information about the data FAST.

In this post, I am also going to talk about 3 cool visual tools:

Categorical Correlation with Graphs,

Pairplots,
Swarmplots and Graph Annotations using Seaborn.

In short, this post is about useful and presentable graphs.

I will be using data from FIFA 19 complete player dataset on kaggle - Detailed attributes for every player registered in
the latest edition of FIFA 19 database.

Since the Dataset has many columns, we will only focus on a subset of categorical and continuous columns.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# We dont Probably need the Gridlines. Do we? If yes comment this line
sns.set(style="ticks")
player_df = pd.read_csv("../input/data.csv")
numcols = [
'Overall',
'Potential',
'Crossing','Finishing', 'ShortPassing', 'Dribbling','LongPassing', 'BallControl', 'Acceleration',
'SprintSpeed', 'Agility', 'Stamina',
'Value','Wage']
catcols = ['Name','Club','Nationality','Preferred Foot','Position','Body Type']
# Subset the columns
player_df = player_df[numcols+ catcols]
# Few rows of data
player_df.head(5)
Player Data

This is a nicely formatted data, yet we need to do some preprocessing to the Wage and Value columns(as they are in
Euro and contain strings) to make them numeric for our subsequent analysis.

def wage_split(x):
try:
return int(x.split("K")[0][1:])
except:
return 0
player_df['Wage'] = player_df['Wage'].apply(lambda x : wage_split(x))
def value_split(x):
try:
if 'M' in x:
return float(x.split("M")[0][1:])
elif 'K' in x:
return float(x.split("K")[0][1:])/1000
except:
return 0
player_df['Value'] = player_df['Value'].apply(lambda x : value_split(x))

Categorical Correlation with Graphs:

In Simple terms, Correlation is a measure of how two variables move together.

For example, In the real world, Income and Spend are positively correlated. If one increases the other also increases.

Academic Performance and Video Games Usage is negatively correlated. Increase in one predicts a decrease in another.

So if our predictor variable is positively or negatively correlated with our target variable, it is valuable.

I feel that Correlations among different variables are a pretty good thing to do when we try to understand our data.

We can create a pretty good correlation plot using Seaborn easily.

corr = player_df.corr()
g = sns.heatmap(corr, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True, fmt='.2f', cmap='coolwarm')
sns.despine()
g.figure.set_size_inches(14,10)

plt.show()
Where did all the categorical variables go?

But do you notice any problem?

Yes, this graph only calculates Correlation between Numerical columns. What if my target variable is Club or
Position ?

I want to be able to get a correlation among three different cases, and we use the following metrics of correlation to
calculate these:

1. Numerical Variables

We already have this in the form of Pearson’s Correlation which is a measure of how two variables move together. This
Ranges from [-1,1]

2. Categorical Variables

We will use Cramer’s V for categorical-categorical cases. It is the intercorrelation of two discrete variables and used
with variables having two or more levels. It is a symmetrical measure as in the order of variable does not matter.
Cramer(A,B) == Cramer(B,A).

For Example: In our dataset, Club and Nationality must be somehow correlated.

Let us check this using a stacked graph which is an excellent way to understand distribution between categorical vs.
categorical variables. Note that we use a subset of data since there are a lot of nationalities and club in this data.

We keep only the best teams(Kept FC Porto just for more diversity in the sample)and the most common nationalities.
Note that Club preference says quite a bit about Nationality: knowing the former helps a lot in predicting the latter.

We can see that if a player belongs to England, it is more probable that he plays in Chelsea or Manchester United and
not in FC Barcelona or Bayern Munchen or Porto.

So there is some information present here. Cramer’s V captures the same information.

If all clubs have the same proportion of players from every nationality, Cramer’s V is 0.

If Every club prefers a single nationality Cramer’s V ==1, for example, all England player play in Manchester United, All
Germans in Bayern Munchen and so on.

In all other cases, it ranges from [0,1]

3. Numerical and Categorical variables

We will use the Correlation Ratio for categorical-continuous cases.

Without getting into too much Maths, it is a measure of Dispersion.

Given a number can we find out which category it belongs to?

For Example:

Suppose we have two columns from our dataset: SprintSpeed and Position :

GK: 58(De Gea),52(T. Courtois), 58(M. Neuer), 43(G. Buffon)

CB: 68(D. Godin), 59(V. Kompany), 73(S. Umtiti), 75(M. Benatia)
ST: 91(C.Ronaldo), 94(G. Bale), 80(S.Aguero), 76(R. Lewandowski)

As you can see these numbers are pretty predictive of the bucket they fall into and thus high Correlation Ratio.
If I know the sprint speed is more than 85, I can definitely say this player plays at ST.

This ratio also ranges from [0,1]

The code to do this is taken from the dython package. I won’t write too much into code which you can anyway find in my
Kaggle Kernel. The final result looks something like:

player_df = player_df.fillna(0)
results = associations(player_df,nominal_columns=catcols,return_results=True)

Categorical vs. Categorical, Categorical vs. Numeric, Numeric vs. Numeric. Much more interesting

Isn’t it Beautiful?

We can understand so much about Football just by looking at this data. For Example:
The position of a player is highly correlated with dribbling ability. You won’t play Messi at the back. Right?

Value is more highly correlated with passing and ball control than dribbling. The rule is to pass the ball always.
Neymar I am looking at you.

Club and Wage have high Correlation. To be expected.

Body Type and Preferred Foot is correlated highly. Does that mean if you are Lean, you are most likely left-footed?
Doesn’t make much sense. One can investigate further.

Moreover, so much info we could find with this simple graph which was not visible in the typical correlation plot without
Categorical Variables.

I leave it here at that. One can look more into the chart and find more meaningful results, but the point is that this
makes life so much easier to find patterns.

Pairplots
While I talked a lot about correlation, it is a fickle metric.

To understand what I mean let us see one example.

Anscombe’s quartet comprises four datasets that have nearly identical Correlation of 1, yet have very different
distributions and appear very different when graphed.

Anscombe Quartet - Correlations can be fickle.

Thus sometimes it becomes crucial to plot correlated data. And see the distributions individually.

Now we have many columns in our dataset. Graphing them all would be so much effort.
No, it is a single line of code.

filtered_player_df = player_df[(player_df['Club'].isin(['FC Barcelona', 'Paris Saint-Germain',

'Manchester United', 'Manchester City', 'Chelsea', 'Real Madrid','FC Porto','FC Bayern München'])) &
(player_df['Nationality'].isin(['England', 'Brazil', 'Argentina',
'Brazil', 'Italy','Spain','Germany']))
]
# Single line to create pairplot
g = sns.pairplot(filtered_player_df[['Value','SprintSpeed','Potential','Wage']])

Pretty Good. We can see so much in this graph.

Wage and Value are highly correlated.

Most of the other values are correlated too. However, the trend of potential vs. value is unusual. We can see how the
value increases exponentially as we reach a particular potential threshold. This information can be helpful in
modeling. Can use some transformation on Potential to make it more correlated?

Caveat: No categorical columns.

Can we do better? We always can.

g = sns.pairplot(filtered_player_df[['Value','SprintSpeed','Potential','Wage','Club']],hue = 'Club')
So much more info. Just by adding the hue parameter as a categorical variable Club .

Porto’s Wage distribution is too much towards the lower side.

I don’t see that steep distribution in value of Porto players. Porto’s players would always be looking out for an
opportunity.
See how a lot of pink points(Chelsea) form sort of a cluster on Potential vs. wage graph. Chelsea has a lot of high
potential players with lower wages. Needs more attention.

I already know some of the points on the Wage/Value Subplot.

The blue point for wage 500k is Messi. Also, the orange point having more value than Messi is Neymar.

Although this hack still doesn’t solve the Categorical problem, I have something cool to look into categorical variables
distribution. Though individually.

SwarmPlots
How to see the relationship between categorical and numerical data?

Enter into picture Swarmplots, just like their name. A swarm of points plotted for each category with a little dispersion on
the y-axis to make them easier to see.

They are my current favorite for plotting such relationships.

g = sns.swarmplot(y = "Club",
x = 'Wage',
data = filtered_player_df,
# Decrease the size of the points to avoid crowding
size = 7)
# remove the top and right line in graph
sns.despine()
g.figure.set_size_inches(14,10)
plt.show()

Swarmplot...

Why don’t I use Boxplots? Where are the median values? Can I plot that? Obviously. Overlay a bar plot on top, and
we have a great looking graph.

g = sns.boxplot(y = "Club",
x = 'Wage',
data = filtered_player_df, whis=np.inf)
g = sns.swarmplot(y = "Club",
x = 'Wage',
data = filtered_player_df,
# Decrease the size of the points to avoid crowding
size = 7,color = 'black')
# remove the top and right line in graph
sns.despine()
g.figure.set_size_inches(12,8)
plt.show()
Swarmplot+Boxplot, Interesting

Pretty good. We can see the individual points on the graph, see some statistics and understand the wage difference
categorically.

The far right point is Messi. However, I should not have to tell you that in a text below the chart. Right?

This graph is going to go in a presentation. Your boss says. I want to write Messi on this graph. Comes into picture
annotations.

max_wage = filtered_player_df.Wage.max()
max_wage_player = filtered_player_df[(player_df['Wage'] == max_wage)]['Name'].values[0]
g = sns.boxplot(y = "Club",
x = 'Wage',
data = filtered_player_df, whis=np.inf)
g = sns.swarmplot(y = "Club",
x = 'Wage',
data = filtered_player_df,
# Decrease the size of the points to avoid crowding
size = 7,color='black')
# remove the top and right line in graph
sns.despine()
# Annotate. xy for coordinate. max_wage is x and 0 is y. In this plot y ranges from 0 to 7 for each level
# xytext for coordinates of where I want to put my text
plt.annotate(s = max_wage_player,
xy = (max_wage,0),
xytext = (500,1),
# Shrink the arrow to avoid occlusion
arrowprops = {'facecolor':'gray', 'width': 3, 'shrink': 0.03},
backgroundcolor = 'white')
g.figure.set_size_inches(12,8)
plt.show()
Annotated, Statistical Info and point swarm. To the presentation, I go.

See Porto Down there. Competing with the giants with such a small wage budget.
So many Highly paid players in Real and Barcelona.
Manchester City has the highest median Wage.
Manchester United and Chelsea believes in equality. Many players clustered in around the same wage scale.
I am happy that while Neymar is more valued than Messi, Messi and Neymar have a huge Wage difference.

A semblance of normalcy in this crazy world.

So to recap, in this post, we talked about calculating and reading correlations between different variable types, plotting
correlations between numerical data and Plotting categorical data with Numerical data using Swarmplots. I love how we
can overlay chart elements on top of each other in Seaborn.

Also if you want to learn more about Visualizations, I would like to call out an excellent course aboutData Visualization
and applied plotting from the University of Michigan which is a part of a pretty good Data Science Specialization
with Python in itself. Do check it out

If you liked this post, do look at my other post on Seaborn too where I have created some more straightforward
reusable graphs. I am going to be writing more beginner friendly posts in the future too. Follow me up at Medium or
Subscribe to my blog to be informed about them. As always, I welcome feedback and constructive criticism and can be
reached on Twitter @mlwhiz

Code for this post in this kaggle kernel.

References:
The Search for Categorical Correlation
Seaborn Swarmplot Documentation
Seaborn Pairplot Documentation

VISUALIZATION PYTHON MATPLOTLIB

« PREVIOUS
Chatbots aren't as difficult to make as You Think

NEXT »
Python Pro Tip: Start using Python defaultdict and Counter in place of dictionary

SEARCH...

Make your own Super Pandas using Multiproc

Minimize for loop usage in Python

Python Pro Tip: Start using Python defaultdict and Counter in place of dictionary

3 Awesome Visualization Techniques for every dataset

Chatbots aren't as difficult to make as You Think

Why Sublime Text for Data Science is Hotter than Jennifer Lawrence?

NLP Learning Series: Part 4 - Transfer Learning Intuition for Text Classification
NLP Learning Series: Part 3 - Attention, CNN and what not for Text Classification

What my first Silver Medal taught me about Text Classification and Kaggle in general?

email address

Preboard 4 Practical Problem Answer Key
71% (7)
Preboard 4 Practical Problem Answer Key
8 pages
Assignment2 DataViz
No ratings yet
Assignment2 DataViz
11 pages
The Architecture of Virtual Space
No ratings yet
The Architecture of Virtual Space
2 pages
Fifth Discipline Summary
100% (2)
Fifth Discipline Summary
7 pages
CO 2 MULTIVARIATE ANALYSIS
No ratings yet
CO 2 MULTIVARIATE ANALYSIS
71 pages
Astros
No ratings yet
Astros
20 pages
Big Data Analytics
No ratings yet
Big Data Analytics
13 pages
program-2
No ratings yet
program-2
9 pages
BookSlides 3B Data Exploration
No ratings yet
BookSlides 3B Data Exploration
60 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
BDA Lab 4: Python Data Visualization: Your Name: Mohamad Salehuddin Bin Zulkefli Matric No: 17005054
No ratings yet
BDA Lab 4: Python Data Visualization: Your Name: Mohamad Salehuddin Bin Zulkefli Matric No: 17005054
10 pages
BookSlides 3B Data Exploration
No ratings yet
BookSlides 3B Data Exploration
60 pages
Exp2a
No ratings yet
Exp2a
4 pages
https___regenerativetoday.com_30-very-useful-pandas-functions-for-everyday-data-analysis-tasks_
No ratings yet
https___regenerativetoday.com_30-very-useful-pandas-functions-for-everyday-data-analysis-tasks_
35 pages
DMDW Lab Progress Report: 'C:/Users/KIIT/Desktop/DMDW Lab/players - CSV'
No ratings yet
DMDW Lab Progress Report: 'C:/Users/KIIT/Desktop/DMDW Lab/players - CSV'
8 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Data-Engineering EINDE
No ratings yet
Data-Engineering EINDE
13 pages
Correlation Analysis in python
100% (1)
Correlation Analysis in python
6 pages
Data Basics for ML
No ratings yet
Data Basics for ML
23 pages
Beating The Odds: Learning To Bet On Soccer Matches Using Historical Data
No ratings yet
Beating The Odds: Learning To Bet On Soccer Matches Using Historical Data
7 pages
Data Exploration and Visualization Unit 2
100% (1)
Data Exploration and Visualization Unit 2
19 pages
EA Sports Notebook
No ratings yet
EA Sports Notebook
12 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
DA Lab Manual
No ratings yet
DA Lab Manual
60 pages
lecture4
No ratings yet
lecture4
60 pages
BPP Business School - Applied Modelling and Visualisation
No ratings yet
BPP Business School - Applied Modelling and Visualisation
19 pages
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 2
No ratings yet
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 2
25 pages
Report
No ratings yet
Report
25 pages
Day 30 UnderstandingYourData 7steps
No ratings yet
Day 30 UnderstandingYourData 7steps
4 pages
Mukta Sec A Da Fisac
No ratings yet
Mukta Sec A Da Fisac
43 pages
p&s theory
No ratings yet
p&s theory
8 pages
Lecture 3&4
No ratings yet
Lecture 3&4
294 pages
Computational Statistics
No ratings yet
Computational Statistics
364 pages
03 UnderstandData
No ratings yet
03 UnderstandData
29 pages
ADS IA 1 syllabus prep (1)
No ratings yet
ADS IA 1 syllabus prep (1)
5 pages
Soccer_Mac
No ratings yet
Soccer_Mac
38 pages
download (4)
No ratings yet
download (4)
9 pages
Lesson 1 - Data Visualisation
No ratings yet
Lesson 1 - Data Visualisation
35 pages
Histograms and Density Plots in R
No ratings yet
Histograms and Density Plots in R
9 pages
2 Data Analysis
No ratings yet
2 Data Analysis
11 pages
Data Mining Notes C3
No ratings yet
Data Mining Notes C3
11 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Lecture1-Introduction To Data Mining
No ratings yet
Lecture1-Introduction To Data Mining
46 pages
Chapter4 CA
No ratings yet
Chapter4 CA
54 pages
1.1 Univariate Analysis: 1.1.1 Categorical Data
No ratings yet
1.1 Univariate Analysis: 1.1.1 Categorical Data
10 pages
02Data
No ratings yet
02Data
24 pages
The 5 Feature Selection Algorithms Every Data Scientist Should Know
No ratings yet
The 5 Feature Selection Algorithms Every Data Scientist Should Know
29 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
35 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
Amit_Khilare_Used_Device_Data_PM_Project
No ratings yet
Amit_Khilare_Used_Device_Data_PM_Project
25 pages
Data Visualization
No ratings yet
Data Visualization
37 pages
CV Lecture 09 Histograms Covariance PCA
No ratings yet
CV Lecture 09 Histograms Covariance PCA
54 pages
Eidd S8 TD1
No ratings yet
Eidd S8 TD1
3 pages
02a EDA and Data Visualization
No ratings yet
02a EDA and Data Visualization
79 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
Business analytics
No ratings yet
Business analytics
19 pages
3 DescriptiveStatistics
No ratings yet
3 DescriptiveStatistics
25 pages
data_visualisation
No ratings yet
data_visualisation
6 pages
Visualization 10 Table and Graph
No ratings yet
Visualization 10 Table and Graph
58 pages
Exam Pa Note
No ratings yet
Exam Pa Note
73 pages
Big Data Visualization and Common Adopattation Issues
No ratings yet
Big Data Visualization and Common Adopattation Issues
34 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
Account STMT XX6767 20032024
No ratings yet
Account STMT XX6767 20032024
4 pages
Power Shift Transmission Testing and Adjusting
No ratings yet
Power Shift Transmission Testing and Adjusting
10 pages
Flo Darsensor F.T.
No ratings yet
Flo Darsensor F.T.
4 pages
SIE Selection and Application Guide Panelboards
No ratings yet
SIE Selection and Application Guide Panelboards
88 pages
LP8550 High-Efficiency LED Backlight Driver For Notebooks: 1 Features 3 Description
No ratings yet
LP8550 High-Efficiency LED Backlight Driver For Notebooks: 1 Features 3 Description
49 pages
Full Download Producing Videos A Complete Guide 2nd Edition Martha Mollison PDF DOCX
100% (2)
Full Download Producing Videos A Complete Guide 2nd Edition Martha Mollison PDF DOCX
71 pages
B-Clutch 6060
No ratings yet
B-Clutch 6060
28 pages
Displacement Sensors
No ratings yet
Displacement Sensors
29 pages
Lsc320an02 G Samsung
No ratings yet
Lsc320an02 G Samsung
36 pages
Variable Displacement Axial Piston Pumps: Edition: 05/10.2014 Replaces: MVP 04 T A
100% (1)
Variable Displacement Axial Piston Pumps: Edition: 05/10.2014 Replaces: MVP 04 T A
80 pages
Digital Innovation - War On Disruption Ver.14
100% (1)
Digital Innovation - War On Disruption Ver.14
376 pages
IoMemory VSL 3.2.14 User Guide For VMware ESXi 2016-06-27
No ratings yet
IoMemory VSL 3.2.14 User Guide For VMware ESXi 2016-06-27
59 pages
LNG Offshore Receiving Terminals
No ratings yet
LNG Offshore Receiving Terminals
54 pages
What Is The Difference Between Pipe and Tube
No ratings yet
What Is The Difference Between Pipe and Tube
5 pages
Medical Kit Components EandE
100% (3)
Medical Kit Components EandE
54 pages
Interior of Sudam Sahu Sambalpur
No ratings yet
Interior of Sudam Sahu Sambalpur
8 pages
CH 1S
No ratings yet
CH 1S
26 pages
To Study About Computer Printer
No ratings yet
To Study About Computer Printer
9 pages
Operations and Supply Chain Management B PDF
No ratings yet
Operations and Supply Chain Management B PDF
15 pages
TRANSDUCERS QUIZ ELECTRONICS & INSTRUMENTATION Practice Tests Objective Tests Free Download Many Online Tests Exams For India Exams
No ratings yet
TRANSDUCERS QUIZ ELECTRONICS & INSTRUMENTATION Practice Tests Objective Tests Free Download Many Online Tests Exams For India Exams
9 pages
Technical Information Sheetsaxles and Suspensionccd 0004235 PDF
No ratings yet
Technical Information Sheetsaxles and Suspensionccd 0004235 PDF
2 pages
Interactive Presentation Manager: Version 7.0 Tutorial
No ratings yet
Interactive Presentation Manager: Version 7.0 Tutorial
44 pages
Fireplace Stove: Orijent, Orijent K
No ratings yet
Fireplace Stove: Orijent, Orijent K
6 pages
Despiece Samsung WD6500
No ratings yet
Despiece Samsung WD6500
10 pages
Comandos de Valvulas 1-2
No ratings yet
Comandos de Valvulas 1-2
3 pages
Barangay Initial Inventory
100% (1)
Barangay Initial Inventory
1 page
Section 72 - Mobile Home and Recreational Vehicle Parks
No ratings yet
Section 72 - Mobile Home and Recreational Vehicle Parks
4 pages