0% found this document useful (0 votes)
62 views34 pages

EDA Module 2

EDA Module 2

Uploaded by

Vivekananda GN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views34 pages

EDA Module 2

EDA Module 2

Uploaded by

Vivekananda GN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

2

Visual Aids for EDA


As data scientists, two important goals in our work would be to extract knowledge from the
data and to present the data to stakeholders. Presenting results to stakeholders is very
complex in the sense that our audience may not have enough technical know-how to
understand programming jargon and other technicalities. Hence, visual aids are very useful
tools. In this chapter, we will focus on different types of visual aids that can be used with
our datasets. We are going to learn about different types of techniques that can be used in
the visualization of data.

In this chapter, we will cover the following topics:

Line chart
Bar chart
Scatter plot
Area plot and stacked plot
Pie chart
Table chart
Polar chart
Histogram
Lollipop chart
Choosing the best chart
Other libraries to explore
Visual Aids for EDA Chapter 2

Technical requirements
You can find the code for this chapter on GitHub: https:/​/​github.​com/​PacktPublishing/
hands-​on-​exploratory-​data-​analysis-​with-​python. In order to get the best out of this
chapter, ensure the following:

Make sure you have Python 3.X installed on your computer. It is recommended
to use a Python notebook such as Anaconda.
You must have Python libraries such as pandas, seaborn, and matplotlib
installed.

Line chart
Do you remember what a continuous variable is and what a discrete variable is? If not,
have a quick look at Chapter 1, Exploratory Data Analysis Fundamentals. Back to the main
topic, a line chart is used to illustrate the relationship between two or more continuous
variables.

We are going to use the matplotlib library and the stock price data to plot time series
lines. First of all, let's understand the dataset. We have created a function using the faker
Python library to generate the dataset. It is the simplest possible dataset you can imagine,
with just two columns. The first column is Date and the second column is Price,
indicating the stock price on that date.

Let's generate the dataset by calling the helper method. In addition to this, we have saved
the CSV file. You can optionally load the CSV file using the pandas (read_csv) library and
proceed with visualization.

My generateData function is defined here:


import datetime
import math
import pandas as pd
import random
import radar
from faker import Faker
fake = Faker()

def generateData(n):
listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)

[ 37 ]
Visual Aids for EDA Chapter 2

delta = end - start


for _ in range(n):
date = radar.random_datetime(start='2019-08-1',
stop='2019-08-30').strftime("%Y-%m-%d")
price = round(random.uniform(900, 1000), 4)
listdata.append([date, price])
df = pd.DataFrame(listdata, columns = ['Date', 'Price'])
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df.groupby(by='Date').mean()

return df

Having defined the method to generate data, let's get the data into a pandas dataframe and
check the first 10 entries:
df = generateData(50)
df.head(10)

The output of the preceding code is shown in the following screenshot:

Let's create the line chart in the next section.

[ 38 ]
Visual Aids for EDA Chapter 2

Steps involved
Let's look at the process of creating the line chart:

1. Load and prepare the dataset. We will learn more about how to prepare data in
Chapter 4, Data Transformation. For this exercise, all the data is preprocessed.
2. Import the matplotlib library. It can be done with this command:
import matplotlib.pyplot as plt

3. Plot the graph:


plt.plot(df)

4. Display it on the screen:


plt.show()

Here is the code if we put it all together:


import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (14, 10)


plt.plot(df)

And the plotted graph looks something like this:

[ 39 ]
Visual Aids for EDA Chapter 2

In the preceding example, we assume the data is available in the CSV format. In real-life
scenarios, the data is mostly available in CSV, JSON, Excel, or XML formats and is mostly
disseminated through some standard API. For this series, we assume you are already
familiar with pandas and how to read different types of files. If not, it's time to revise
pandas. Refer to the pandas documentation for further details: https:/​/​pandas-
datareader.​readthedocs.​io/​en/​latest/​.

Bar charts
This is one of the most common types of visualization that almost everyone must have
encountered. Bars can be drawn horizontally or vertically to represent categorical
variables.

Bar charts are frequently used to distinguish objects between distinct collections in order to
track variations over time. In most cases, bar charts are very convenient when the changes
are large. In order to learn about bar charts, let's assume a pharmacy in Norway keeps track
of the amount of Zoloft sold every month. Zoloft is a medicine prescribed to patients
suffering from depression. We can use the calendar Python library to keep track of the
months of the year (1 to 12) corresponding to January to December:

1. Let's import the required libraries:


import numpy as np
import calendar
import matplotlib.pyplot as plt

2. Set up the data. Remember, the range stopping parameter is exclusive, meaning
if you generate range from (1, 13), the last item, 13, is not included:
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1,
13)]

3. Specify the layout of the figure and allocate space:


figure, axis = plt.subplots()

4. In the x axis, we would like to display the names of the months:


plt.xticks(months, calendar.month_name[1:13], rotation=20)

5. Plot the graph:


plot = axis.bar(months, sold_quantity)

[ 40 ]
Visual Aids for EDA Chapter 2

6. This step is optional depending upon whether you are interested in displaying
the data value on the head of the bar. It visually gives more meaning to show an
actual number of sold items on the bar itself:
for rectangle in plot:
height = rectangle.get_height()
axis.text(rectangle.get_x() + rectangle.get_width() /2., 1.002 *
height, '%d' % int(height), ha='center', va = 'bottom')

7. Display the graph on the screen:


plt.show()

The bar chart is as follows:

[ 41 ]
Visual Aids for EDA Chapter 2

Here are important observations from the preceding visualizations:

months and sold_quantity are Python lists representing the amount of Zoloft
sold every month.
We are using the subplots() method in the preceding code. Why? Well, it
provides a way to define the layout of the figure in terms of the number of
graphs and provides ways to organize them. Still confused? Don't worry, we will
be using subplots plenty of times in this chapter. Moreover, if you need a quick
reference, Packt has several books explaining matplotlib. Some of the most
interesting reads have been mentioned in the Further reading section of this
chapter.
In step 3, we use the plt.xticks() function, which allows us to change the
x axis tickers from 1 to 12, whereas calender.months[1:13] changes this
numerical format into corresponding months from the calendar Python library.
Step 4 actually prints the bar with months and quantity sold.
ax.text() within the for loop annotates each bar with its corresponding
values. How it does this might be interesting. We plotted these values by getting
the x and y coordinates and then adding bar_width/2 to the x coordinates with
a height of 1.002, which is the y coordinate. Then, using the va and ha
arguments, we align the text centrally over the bar.
Step 6 actually displays the graph on the screen.

As mentioned in the introduction to this section, we said that bars can be either horizontal
or vertical. Let's change to a horizontal format. All the code remains the same,
except plt.xticks changes to plt.yticks() and plt.bar() changes to plt.barh().
We assume it is self-explanatory.

[ 42 ]
Visual Aids for EDA Chapter 2

In addition to this, placing the exact data values is a bit tricky and requires a few iterations
of trial and error to place them perfectly. But let's see them in action:
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]

figure, axis = plt.subplots()

plt.yticks(months, calendar.month_name[1:13], rotation=20)

plot = axis.barh(months, sold_quantity)

for rectangle in plot:


width = rectangle.get_width()
axis.text(width + 2.5, rectangle.get_y() + 0.38, '%d' % int(width),
ha='center', va = 'bottom')

plt.show()

And the graph it generates is as follows:

[ 43 ]
Visual Aids for EDA Chapter 2

Well, that's all about the bar chart in this chapter. We are certainly going to use several
other attributes in the subsequent chapters. Next, we are going to visualize data using a
scatter plot.

Scatter plot
Scatter plots are also called scatter graphs, scatter charts, scattergrams, and scatter
diagrams. They use a Cartesian coordinates system to display values of typically two
variables for a set of data.

When should we use a scatter plot? Scatter plots can be constructed in the following two
situations:

When one continuous variable is dependent on another variable, which is under


the control of the observer
When both continuous variables are independent

There are two important concepts—independent variable and dependent variable. In


statistical modeling or mathematical modeling, the values of dependent variables rely on
the values of independent variables. The dependent variable is the outcome variable being
studied. The independent variables are also referred to as regressors. The takeaway
message here is that scatter plots are used when we need to show the relationship between
two variables, and hence are sometimes referred to as correlation plots. We will dig into
more details about correlation in Chapter 7, Correlation.

You are either an expert data scientist or a beginner computer science student, and no
doubt you have encountered a form of scatter plot before. These plots are powerful tools for
visualization, despite their simplicity. The main reasons are that they have a lot of options,
representational powers, and design choices, and are flexible enough to represent a graph
in attractive ways.

Some examples in which scatter plots are suitable are as follows:

Research studies have successfully established that the number of hours of sleep
required by a person depends on the age of the person.
The average income for adults is based on the number of years of education.

[ 44 ]
Visual Aids for EDA Chapter 2

Let's take the first case. The dataset can be found in the form of a CSV file in the GitHub
repository:
headers_cols = ['age','min_recommended', 'max_recommended',
'may_be_appropriate_min', 'may_be_appropriate_max', 'min_not_recommended',
'max_not_recommended']

sleepDf =
pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/hands-on-exp
loratory-data-analysis-with-python/master/Chapter%202/sleep_vs_age.csv',
columns=headers_cols)
sleepDf.head(10)

Having imported the dataset correctly, let's display a scatter plot. We start by importing the
required libraries and then plotting the actual graph. Next, we display the x-label and the y-
label. The code is given in the following code block:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

# A regular scatter plot


plt.scatter(x=sleepDf["age"]/12., y=sleepDf["min_recommended"])
plt.scatter(x=sleepDf["age"]/12., y=sleepDf['max_recommended'])
plt.xlabel('Age of person in Years')
plt.ylabel('Total hours of sleep required')
plt.show()

The scatter plot generated by the preceding code is as follows:

[ 45 ]
Visual Aids for EDA Chapter 2

That was not so difficult, was it? Let's see if we can interpret the graph. You can explicitly
see that the total number of hours of sleep required by a person is high initially and
gradually decreases as age increases. The resulting graph is interpretable, but due to the
lack of a continuous line, the results are not self-explanatory. Let's fit a line to it and see if
that explains the results in a more obvious way:
# Line plot
plt.plot(sleepDf['age']/12., sleepDf['min_recommended'], 'g--')
plt.plot(sleepDf['age']/12., sleepDf['max_recommended'], 'r--')
plt.xlabel('Age of person in Years')
plt.ylabel('Total hours of sleep required')
plt.show()

[ 46 ]
Visual Aids for EDA Chapter 2

A line chart of the same data is as follows:

From the graph, it is clear that the two lines decline as the age increases. It shows that
newborns between 0 and 3 months require at least 14-17 hours of sleep every day.
Meanwhile, adults and the elderly require 7-9 hours of sleep every day. Is your sleeping
pattern within this range?

Let's take another example of a scatter plot using the most popular dataset used in data
science—the Iris dataset. The dataset was introduced by Ronald Fisher in 1936 and is
widely adopted by bloggers, books, articles, and research papers to demonstrate various
aspects of data science and data mining. The dataset holds 50 examples each of three
different species of Iris, named setosa, virginica, and versicolor. Each example has four
different attributes: petal_length, petal_width, sepal_length, and sepal_width.
The dataset can be loaded in several ways.

[ 47 ]
Visual Aids for EDA Chapter 2

Here, we are using seaborn to load the dataset:

1. Import seaborn and set some default parameters of matplotlib:


import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['figure.dpi'] = 150

2. Use style from seaborn. Try to comment on the next line and see the difference
in the graph:
sns.set()

3. Load the Iris dataset:


df = sns.load_dataset('iris')

df['species'] = df['species'].map({'setosa': 0, "versicolor": 1,


"virginica": 2})

4. Create a regular scatter plot:


plt.scatter(x=df["sepal_length"], y=df["sepal_width"], c =
df.species)

5. Create the labels for the axes:


plt.xlabel('Septal Length')
plt.ylabel('Petal length')

6. Display the plot on the screen:


plt.show()

[ 48 ]
Visual Aids for EDA Chapter 2

The scatter plot generated by the preceding code is as follows:

Do you find this graph informative? We would assume that most of you agree that you can
clearly see three different types of points and that there are three different clusters.
However, it is not clear which color represents which species of Iris. Thus, we are going to
learn how to create legends in the Scatter plot using seaborn section.

Bubble chart
A bubble plot is a manifestation of the scatter plot where each data point on the graph is
shown as a bubble. Each bubble can be illustrated with a different color, size, and
appearance.

Let 's continue using the Iris dataset to get a bubble plot. Here, the important thing to note
is that we are still going to use the plt.scatter method to draw a bubble chart:
# Load the Iris dataset
df = sns.load_dataset('iris')

df['species'] = df['species'].map({'setosa': 0, "versicolor": 1,


"virginica": 2})

# Create bubble plot

[ 49 ]
Visual Aids for EDA Chapter 2

plt.scatter(df.petal_length, df.petal_width,
s=50*df.petal_length*df.petal_width,
c=df.species,
alpha=0.3
)

# Create labels for axises


plt.xlabel('Septal Length')
plt.ylabel('Petal length')
plt.show()

The bubble chart generated by the preceding code is as follows:

Can you interpret the results? Well, it is not clear from the graph which color represents
which species of Iris. But we can clearly see three different clusters, which clearly indicates
for each specific species or cluster there is a relationship between Petal Length and Petal
Width.

Scatter plot using seaborn


A scatter plot can also be generated using the seaborn library. Seaborn makes the graph
visually better. We can illustrate the relationship between x and y for distinct subsets of the
data by utilizing the size, style, and hue parameters of the scatter plot in seaborn.

[ 50 ]
Visual Aids for EDA Chapter 2

Get more detailed information about the parameters from seaborn's


documentation website: https:/​/​seaborn.​pydata.​org/​generated/
seaborn.​scatterplot.​html.

Now, let's load the Iris dataset:


df = sns.load_dataset('iris')

df['species'] = df['species'].map({'setosa': 0, "versicolor": 1,


"virginica": 2})
sns.scatterplot(x=df["sepal_length"], y=df["sepal_width"], hue=df.species,
data=df)

The scatter plot generated from the preceding code is as follows:

In the preceding plot, we can clearly see there are three species of flowers indicated by
three distinct colors. It is more clear from the diagram how different specifies of flowers
vary in terms of the sepal width and the length.

[ 51 ]
Visual Aids for EDA Chapter 2

Area plot and stacked plot


The stacked plot owes its name to the fact that it represents the area under a line plot and
that several such plots can be stacked on top of one another, giving the feeling of a stack.
The stacked plot can be useful when we want to visualize the cumulative effect of multiple
variables being plotted on the y axis.

In order to simplify this, think of an area plot as a line plot that shows the area covered by
filling it with a color. Enough talk. Let's dive into the code base. First of all, let's define the
dataset:
# House loan Mortgage cost per month for a year
houseLoanMortgage = [9000, 9000, 8000, 9000,
8000, 9000, 9000, 9000,
9000, 8000, 9000, 9000]

# Utilities Bills for a year


utilitiesBills = [4218, 4218, 4218, 4218,
4218, 4218, 4219, 2218,
3218, 4233, 3000, 3000]
# Transportation bill for a year
transportation = [782, 900, 732, 892,
334, 222, 300, 800,
900, 582, 596, 222]

# Car mortgage cost for one year


carMortgage = [700, 701, 702, 703,
704, 705, 706, 707,
708, 709, 710, 711]

Now, let's import the required libraries and plot stacked charts:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

months= [x for x in range(1,13)]

# Create placeholders for plot and add required color


plt.plot([],[], color='sandybrown', label='houseLoanMortgage')
plt.plot([],[], color='tan', label='utilitiesBills')
plt.plot([],[], color='bisque', label='transportation')
plt.plot([],[], color='darkcyan', label='carMortgage')

# Add stacks to the plot


plt.stackplot(months, houseLoanMortgage, utilitiesBills, transportation,
carMortgage, colors=['sandybrown', 'tan', 'bisque', 'darkcyan'])

[ 52 ]
Visual Aids for EDA Chapter 2

plt.legend()

# Add Labels
plt.title('Household Expenses')
plt.xlabel('Months of the year')
plt.ylabel('Cost')

# Display on the screen


plt.show()

In the preceding snippet, first, we imported matplotlib and seaborn. Nothing new,
right? Then we added stacks with legends. Finally, we added labels to the axes and
displayed the plot on the screen. Easy and straightforward. Now you know how to create
an area plot or stacked plot. The area plot generated by the preceding code is as follows:

Now the most important part is the ability to interpret the graph. In the preceding graph, it
is clear that the house mortgage loan is the largest expense since the area under the curve
for the house mortgage loan is the largest. Secondly, the area of utility bills stack covers the
second-largest area, and so on. The graph clearly disseminates meaningful information to
the targeted audience. Labels, legends, and colors are important aspects of creating a
meaningful visualization.

[ 53 ]
Visual Aids for EDA Chapter 2

Pie chart
This is one of the more interesting types of data visualization graphs. We say interesting
not because it has a higher preference or higher illustrative capacity, but because it is one of
the most argued-about types of visualization in research.

A paper by Ian Spence in 2005, No Humble Pie: The Origins and Usage of a Statistical Chart,
argues that the pie chart fails to appeal to most experts. Despite similar studies, people have
still chosen to use pie charts. There are several arguments given by communities for not
adhering to the pie chart. One of the arguments is that human beings are naturally poor at
distinguishing differences in slices of a circle at a glance. Another argument is that people
tend to overestimate the size of obtuse angles. Similarly, people seem to underestimate the
size of acute angles.

Having looked at the criticism, let's also have some positivity. One counterargument is this:
if the pie chart is not communicative, why does it persist? The main reason is that people
love circles. Moreover, the purpose of the pie chart is to communicate proportions and it is
widely accepted. Enough said; let's use the Pokemon dataset to draw a pie chart. There are
two ways in which you can load the data: first, directly from the GitHub URL; or you can
download the dataset from the GitHub and reference it from your local machine by
providing the correct path. In either case, you can use the read_csv method from the
pandas library. Check out the following snippet:

# Create URL to JSON file (alternatively this can be a filepath)


url =
'https://raw.githubusercontent.com/hmcuesta/PDA_Book/master/Chapter3/pokemo
nByType.csv'

# Load the first sheet of the JSON file into a data frame
pokemon = pd.read_csv(url, index_col='type')

pokemon

[ 54 ]
Visual Aids for EDA Chapter 2

The preceding code snippet should display the dataframe as follows:

Next, we attempt to plot the pie chart:


import matplotlib.pyplot as plt

plt.pie(pokemon['amount'], labels=pokemon.index, shadow=False,


startangle=90, autopct='%1.1f%%',)
plt.axis('equal')
plt.show()

We should get the following pie chart from the preceding code:

[ 55 ]
Visual Aids for EDA Chapter 2

Do you know you can directly use the pandas library to create a pie chart? Checkout the
following one-liner:
pokemon.plot.pie(y="amount", figsize=(20, 10))

The pie chart generated is as follows:

[ 56 ]
Visual Aids for EDA Chapter 2

We generated a nice pie chart with a legend using one line of code. This is why Python is
said to be a comedian. Do you know why? Because it has a lot of one-liners. Pretty true,
right?

Table chart
A table chart combines a bar chart and a table. In order to understand the table chart, let's
consider the following dataset. Consider standard LED bulbs that come in different
wattages. The standard Philips LED bulb can be 4.5 Watts, 6 Watts, 7 Watts, 8.5 Watts, 9.5
Watts, 13.5 Watts, and 15 Watts. Let's assume there are two categorical variables, the year
and the wattage, and a numeric variable, which is the number of units sold in a particular
year.

Now, let's declare variables to hold the years and the available wattage data. It can be done
as shown in the following snippet:
# Years under consideration
years = ["2010", "2011", "2012", "2013", "2014"]

# Available watt
columns = ['4.5W', '6.0W', '7.0W','8.5W','9.5W','13.5W','15W']
unitsSold = [
[65, 141, 88, 111, 104, 71, 99],
[85, 142, 89, 112, 103, 73, 98],
[75, 143, 90, 113, 89, 75, 93],
[65, 144, 91, 114, 90, 77, 92],
[55, 145, 92, 115, 88, 79, 93],
]

# Define the range and scale for the y axis


values = np.arange(0, 600, 100)

We have now prepared the dataset. Let's now try to draw a table chart using the following
code block:
colors = plt.cm.OrRd(np.linspace(0, 0.7, len(years)))
index = np.arange(len(columns)) + 0.3
bar_width = 0.7

y_offset = np.zeros(len(columns))
fig, ax = plt.subplots()

cell_text = []

n_rows = len(unitsSold)

[ 57 ]
Visual Aids for EDA Chapter 2

for row in range(n_rows):


plot = plt.bar(index, unitsSold[row], bar_width, bottom=y_offset,
color=colors[row])
y_offset = y_offset + unitsSold[row]
cell_text.append(['%1.1f' % (x) for x in y_offset])
i=0
# Each iteration of this for loop, labels each bar with corresponding value
for the given year
for rect in plot:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, y_offset[i],'%d'
% int(y_offset[i]),
ha='center', va='bottom')
i = i+1

Finally, let's add the table to the bottom of the chart:


# Add a table to the bottom of the axes
the_table = plt.table(cellText=cell_text, rowLabels=years,
rowColours=colors, colLabels=columns, loc='bottom')
plt.ylabel("Units Sold")
plt.xticks([])
plt.title('Number of LED Bulb Sold/Year')
plt.show()

The preceding code snippets generate a nice table chart, as follows:

[ 58 ]
Visual Aids for EDA Chapter 2

Look at the preceding table chart. Do you think it can be easily interpreted? It is pretty
clear, right? You can see, for example, in the year 2014, 345 units of the 4.5-Watt bulb were
sold. Similarly, the same information can be deduced from the preceding table plot.

Polar chart
Do you remember the polar axis from mathematics class? Well, a polar chart is a diagram
that is plotted on a polar axis. Its coordinates are angle and radius, as opposed to the
Cartesian system of x and y coordinates. Sometimes, it is also referred to as a spider web
plot. Let's see how we can plot an example of a polar chart.

First, let's create the dataset:

1. Let's assume you have five courses in your academic year:


subjects = ["C programming", "Numerical methods", "Operating
system", "DBMS", "Computer Networks"]

2. And you planned to obtain the following grades in each subject:


plannedGrade = [90, 95, 92, 68, 68, 90]

3. However, after your final examination, these are the grades you got:
actualGrade = [75, 89, 89, 80, 80, 75]

Now that the dataset is ready, let's try to create a polar chart. The first significant step is to
initialize the spider plot. This can be done by setting the figure size and polar projection.
This should be clear by now. Note that in the preceding dataset, the list of grades contains
an extra entry. This is because it is a circular plot and we need to connect the first point and
the last point together to form a circular flow. Hence, we copy the first entry from each list
and append it to the list. In the preceding data, the entries 90 and 75 are the first entries of
the list respectively. Let's look at each step:

1. Import the required libraries:


import numpy as np
import matplotlib.pyplot as plt

2. Prepare the dataset and set up theta:


theta = np.linspace(0, 2 * np.pi, len(plannedGrade))

[ 59 ]
Visual Aids for EDA Chapter 2

3. Initialize the plot with the figure size and polar projection:
plt.figure(figsize = (10,6))
plt.subplot(polar=True)

4. Get the grid lines to align with each of the subject names:
(lines,labels) = plt.thetagrids(range(0,360,
int(360/len(subjects))),
(subjects))

5. Use the plt.plot method to plot the graph and fill the area under it:
plt.plot(theta, plannedGrade)
plt.fill(theta, plannedGrade, 'b', alpha=0.2)

6. Now, we plot the actual grades obtained:


plt.plot(theta, actualGrade)

7. We add a legend and a nice comprehensible title to the plot:


plt.legend(labels=('Planned Grades','Actual Grades'),loc=1)
plt.title("Plan vs Actual grades by Subject")

8. Finally, we show the plot on the screen:


plt.show()

The generated polar chart is shown in the following screenshot:

[ 60 ]
Visual Aids for EDA Chapter 2

As illustrated in the preceding output, the planned and actual grades by subject can easily
be distinguished. The legend makes it clear which line indicates the planned grades (the
blue line in the screenshot) and which line indicates actual grades (the orange line in the
screenshot). This gives a clear indication of the difference between the predicted and actual
grades of a student to the target audience.

Histogram
Histogram plots are used to depict the distribution of any continuous variable. These types
of plots are very popular in statistical analysis.

Consider the following use cases. A survey created in vocational training sessions of
developers had 100 participants. They had several years of Python programming
experience ranging from 0 to 20.

Let's import the required libraries and create the dataset:


import numpy as np
import matplotlib.pyplot as plt

#Create data set


yearsOfExperience = np.array([10, 16, 14, 5, 10, 11, 16, 14, 3, 14, 13, 19,
2, 5, 7, 3, 20,
11, 11, 14, 2, 20, 15, 11, 1, 15, 15, 15, 2, 9, 18, 1, 17, 18,
13, 9, 20, 13, 17, 13, 15, 17, 10, 2, 11, 8, 5, 19, 2, 4, 9,
17, 16, 13, 18, 5, 7, 18, 15, 20, 2, 7, 0, 4, 14, 1, 14, 18,
8, 11, 12, 2, 9, 7, 11, 2, 6, 15, 2, 14, 13, 4, 6, 15, 3,
6, 10, 2, 11, 0, 18, 0, 13, 16, 18, 5, 14, 7, 14, 18])
yearsOfExperience

In order to plot the histogram chart, execute the following steps:

1. Plot the distribution of group experience:


nbins = 20
n, bins, patches = plt.hist(yearsOfExperience, bins=nbins)

2. Add labels to the axes and a title:


plt.xlabel("Years of experience with Python Programming")
plt.ylabel("Frequency")
plt.title("Distribution of Python programming experience in the
vocational training session")

[ 61 ]
Visual Aids for EDA Chapter 2

3. Draw a green vertical line in the graph at the average experience:


plt.axvline(x=yearsOfExperience.mean(), linewidth=3, color = 'g')

4. Display the plot:


plt.show()

The preceding code generates the following histogram:

Much better, right? Now, from the graph, we can say that the average experience of the
participants is around 10 years. Can we improve the graph for better readability? How
about we try to plot the percentage of the sum of all the entries in yearsOfExperience? In
addition to that, we can also plot a normal distribution using the mean and standard
deviation of this data to see the distribution pattern. If you're not sure what a normal
distribution is, we suggest you go through the references in Chapter 1, Exploratory Data
Analysis Fundamentals. In a nutshell, the normal distribution is also referred to as the
Gaussian distribution. The term indicates a probability distribution that is symmetrical
about the mean, illustrating that data near the average (mean) is more frequent than data
far from the mean. Enough theory; let's dive into the practice.

[ 62 ]
Visual Aids for EDA Chapter 2

To plot the distribution, we can add a density=1 parameter in the plot.hist function.
Let's go through the code. Note that there are changes in steps 1, 4, 5, and 6. The rest of the
code is the same as the preceding example:

1. Plot the distribution of group experience:


plt.figure(figsize = (10,6))

nbins = 20
n, bins, patches = plt.hist(yearsOfExperience, bins=nbins,
density=1)

2. Add labels to the axes and a title:


plt.xlabel("Years of experience with Python Programming")
plt.ylabel("Frequency")
plt.title("Distribution of Python programming experience in the
vocational training session")

3. Draw a green vertical line in the graph at the average experience:


plt.axvline(x=yearsOfExperience.mean(), linewidth=3, color = 'g')

4. Compute the mean and standard deviation of the dataset:


mu = yearsOfExperience.mean()
sigma = yearsOfExperience.std()

5. Add a best-fit line for the normal distribution:


y = ((1 / (np.sqrt(2 * np.pi) * sigma)) * np.exp(-0.5 * (1 / sigma
* (bins - mu))**2))

6. Plot the normal distribution:


plt.plot(bins, y, '--')

7. Display the plot:


plt.show()

And the generated histogram with the normal distribution is as follows:

[ 63 ]
Visual Aids for EDA Chapter 2

The preceding plot illustrates clearly that it is not following a normal distribution. There are
many vertical bars that are above and below the best-fit curve for a normal distribution.
Perhaps you are wondering where we got the formula to compute step 6 in the preceding
code. Well, there is a little theory involved here. When we mentioned the normal
distribution, we can compute the probability density function using the Gaussian
distribution function given by ((1 / (np.sqrt(2 * np.pi) * sigma)) *
np.exp(-0.5 * (1 / sigma * (bins - mu))**2)).

Lollipop chart
A lollipop chart can be used to display ranking in the data. It is similar to an ordered bar
chart.

Let's consider the carDF dataset. It can be found in the GitHub repository for chapter 2.
Alternatively, it can be used from the GitHub link directly, as mention in the following
code:

1. Load the dataset:


#Read the dataset

carDF =
pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/hand

[ 64 ]
Visual Aids for EDA Chapter 2

s-on-exploratory-data-analysis-with-
python/master/Chapter%202/cardata.csv')

2. Group the dataset by manufacturer. For now, if it does not make sense, just
remember that the following snippet groups the entries by a particular field (we
will go through groupby functions in detail in Chapter 4, Data Transformation):
#Group by manufacturer and take average mileage
processedDF =
carDF[['cty','manufacturer']].groupby('manufacturer').apply(lambda
x: x.mean())

3. Sort the values by cty and reset the index (again, we will go through sorting
and how we reset the index in Chapter 4, Data Transformation):
#Sort the values by cty and reset index
processedDF.sort_values('cty', inplace=True)
processedDF.reset_index(inplace=True)

4. Plot the graph:


#Plot the graph
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
ax.vlines(x=processedDF.index, ymin=0, ymax=processedDF.cty,
color='firebrick', alpha=0.7, linewidth=2)
ax.scatter(x=processedDF.index, y=processedDF.cty, s=75,
color='firebrick', alpha=0.7)

5. Annotate the title:


#Annotate Title
ax.set_title('Lollipop Chart for Highway Mileage using car
dataset', fontdict={'size':22})

6. Annotate labels, xticks, and ylims:


ax.set_ylabel('Miles Per Gallon')
ax.set_xticks(processedDF.index)
ax.set_xticklabels(processedDF.manufacturer.str.upper(),
rotation=65, fontdict={'horizontalalignment': 'right', 'size':12})
ax.set_ylim(0, 30)

7. Write the actual mean values in the plot, and display the plot:
#Write the values in the plot
for row in processedDF.itertuples():
ax.text(row.Index, row.cty+.5, s=round(row.cty, 2),

[ 65 ]
Visual Aids for EDA Chapter 2

horizontalalignment= 'center', verticalalignment='bottom',


fontsize=14)

#Display the plot on the screen


plt.show()

The lollipop chart generated by the preceding snippet is as follows:

Having seen the preceding output, you now know why it is called a lollipop chart, don't
you? The line and the circle on the top gives a nice illustration of different types of cars and
their associated miles per gallon consumption. Now, the data makes more sense, doesn't it?

Choosing the best chart


There is no standard that defines which chart you should choose to visualize your data.
However, there are some guidelines that can help you. Here are some of them:

As mentioned with each of the preceding charts that we have seen, it is


important to understand what type of data you have. If you have continuous
variables, then a histogram would be a good choice. Similarly, if you want to
show ranking, an ordered bar chart would be a good choice.
Choose the chart that effectively conveys the right and relevant meaning of the
data without actually distorting the facts.

[ 66 ]
Visual Aids for EDA Chapter 2

Simplicity is best. It is considered better to draw a simple chart that is


comprehensible than to draw sophisticated ones that require several reports and
texts in order to understand them.
Choose a diagram that does not overload the audience with information. Our
purpose should be to illustrate abstract information in a clear way.

Having said that, let's see if we can generalize some categories of charts based on various
purposes.

The following table shows the different types of charts based on the purposes:

Purpose Charts
Scatter plot
Correlogram
Pairwise plot
Jittering with strip plot
Show correlation
Counts plot
Marginal histogram
Scatter plot with a line of best fit
Bubble plot with circling
Area chart
Diverging bars
Show deviation Diverging texts
Diverging dot plot
Diverging lollipop plot with markers
Histogram for continuous variable
Histogram for categorical variable
Density plot
Categorical plots
Density curves with histogram
Show distribution
Population pyramid
Violin plot
Joy plot
Distributed dot plot
Box plot
Waffle chart
Pie chart
Show composition
Treemap
Bar chart

[ 67 ]
Visual Aids for EDA Chapter 2

Time series plot


Time series with peaks and troughs annotated
Autocorrelation plot
Cross-correlation plot
Multiple time series
Show change
Plotting with different scales using the secondary y axis
Stacked area chart
Seasonal plot
Calendar heat map
Area chart unstacked
Dendrogram
Cluster plot
Show groups
Andrews curve
Parallel coordinates
Ordered bar chart
Lollipop chart
Show ranking Dot plot
Slope plot
Dumbbell plot

Note that going through each and every type of plot mentioned in the table is beyond the
scope of this book. However, we have tried to cover most of them in this chapter. A few of
them will be used in the upcoming chapters; we will use these graphs in more contextual
ways and with advanced settings.

Other libraries to explore


So far, we have seen different types of 2D and 3D visualization techniques using
matplotlib and seaborn. Apart from these widely used Python libraries, there are other
libraries that you can explore:

Ploty (https:/​/​plot.​ly/​python/​): This is a web-application-based toolkit for


visualization. Its API for Jupyter Notebook and other applications makes it very
powerful to represent 2D and 3D charts.
Ggplot (http:/​/​ggplot.​yhathq.​com/​): This is a Python implementation based
on the Grammar of Graphics library from the R programming language.
Altair (https:/​/​altair-​viz.​github.​io/​): This is built on the top of the
powerful Vega-Lite visualization grammar and follows very declarative
statistical visualization library techniques. In addition to that, it has a very
descriptive and simple API.

[ 68 ]
Visual Aids for EDA Chapter 2

Summary
Portraying any data, events, concepts, information, processes, or methods graphically has
been always perceived with a high degree of comprehension on one hand and is easily
marketable on the other. Presenting results to stakeholders is very complex in the sense that
our audience may not be technical enough to understand programming jargon and
technicalities. Hence, visual aids are widely used. In this chapter, we discussed how to use
such data visualization tools.

In the next chapter, we are going to get started with exploratory data analysis in a very
simple way. We will try to analyze our mailbox and analyze what type of emails we send
and receive.

Further reading
Matplotlib 3.0 Cookbook, Srinivasa Rao Poladi, Packt Publishing, October 22, 2018
Matplotlib Plotting Cookbook, Alexandre Devert, Packt Publishing, March 26, 2014
Data Visualization with Python, Mario Döbler, Tim Großmann, Packt
Publishing, February 28, 2019
No Humble Pie: The Origins and Usage of a Statistical Chart, Ian Spence, University of
Toronto, 2005.

[ 69 ]

You might also like