EDA Module 2
EDA Module 2
Line chart
Bar chart
Scatter plot
Area plot and stacked plot
Pie chart
Table chart
Polar chart
Histogram
Lollipop chart
Choosing the best chart
Other libraries to explore
Visual Aids for EDA Chapter 2
Technical requirements
You can find the code for this chapter on GitHub: https://github.com/PacktPublishing/
hands-on-exploratory-data-analysis-with-python. In order to get the best out of this
chapter, ensure the following:
Make sure you have Python 3.X installed on your computer. It is recommended
to use a Python notebook such as Anaconda.
You must have Python libraries such as pandas, seaborn, and matplotlib
installed.
Line chart
Do you remember what a continuous variable is and what a discrete variable is? If not,
have a quick look at Chapter 1, Exploratory Data Analysis Fundamentals. Back to the main
topic, a line chart is used to illustrate the relationship between two or more continuous
variables.
We are going to use the matplotlib library and the stock price data to plot time series
lines. First of all, let's understand the dataset. We have created a function using the faker
Python library to generate the dataset. It is the simplest possible dataset you can imagine,
with just two columns. The first column is Date and the second column is Price,
indicating the stock price on that date.
Let's generate the dataset by calling the helper method. In addition to this, we have saved
the CSV file. You can optionally load the CSV file using the pandas (read_csv) library and
proceed with visualization.
def generateData(n):
listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)
[ 37 ]
Visual Aids for EDA Chapter 2
return df
Having defined the method to generate data, let's get the data into a pandas dataframe and
check the first 10 entries:
df = generateData(50)
df.head(10)
[ 38 ]
Visual Aids for EDA Chapter 2
Steps involved
Let's look at the process of creating the line chart:
1. Load and prepare the dataset. We will learn more about how to prepare data in
Chapter 4, Data Transformation. For this exercise, all the data is preprocessed.
2. Import the matplotlib library. It can be done with this command:
import matplotlib.pyplot as plt
[ 39 ]
Visual Aids for EDA Chapter 2
In the preceding example, we assume the data is available in the CSV format. In real-life
scenarios, the data is mostly available in CSV, JSON, Excel, or XML formats and is mostly
disseminated through some standard API. For this series, we assume you are already
familiar with pandas and how to read different types of files. If not, it's time to revise
pandas. Refer to the pandas documentation for further details: https://pandas-
datareader.readthedocs.io/en/latest/.
Bar charts
This is one of the most common types of visualization that almost everyone must have
encountered. Bars can be drawn horizontally or vertically to represent categorical
variables.
Bar charts are frequently used to distinguish objects between distinct collections in order to
track variations over time. In most cases, bar charts are very convenient when the changes
are large. In order to learn about bar charts, let's assume a pharmacy in Norway keeps track
of the amount of Zoloft sold every month. Zoloft is a medicine prescribed to patients
suffering from depression. We can use the calendar Python library to keep track of the
months of the year (1 to 12) corresponding to January to December:
2. Set up the data. Remember, the range stopping parameter is exclusive, meaning
if you generate range from (1, 13), the last item, 13, is not included:
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1,
13)]
[ 40 ]
Visual Aids for EDA Chapter 2
6. This step is optional depending upon whether you are interested in displaying
the data value on the head of the bar. It visually gives more meaning to show an
actual number of sold items on the bar itself:
for rectangle in plot:
height = rectangle.get_height()
axis.text(rectangle.get_x() + rectangle.get_width() /2., 1.002 *
height, '%d' % int(height), ha='center', va = 'bottom')
[ 41 ]
Visual Aids for EDA Chapter 2
months and sold_quantity are Python lists representing the amount of Zoloft
sold every month.
We are using the subplots() method in the preceding code. Why? Well, it
provides a way to define the layout of the figure in terms of the number of
graphs and provides ways to organize them. Still confused? Don't worry, we will
be using subplots plenty of times in this chapter. Moreover, if you need a quick
reference, Packt has several books explaining matplotlib. Some of the most
interesting reads have been mentioned in the Further reading section of this
chapter.
In step 3, we use the plt.xticks() function, which allows us to change the
x axis tickers from 1 to 12, whereas calender.months[1:13] changes this
numerical format into corresponding months from the calendar Python library.
Step 4 actually prints the bar with months and quantity sold.
ax.text() within the for loop annotates each bar with its corresponding
values. How it does this might be interesting. We plotted these values by getting
the x and y coordinates and then adding bar_width/2 to the x coordinates with
a height of 1.002, which is the y coordinate. Then, using the va and ha
arguments, we align the text centrally over the bar.
Step 6 actually displays the graph on the screen.
As mentioned in the introduction to this section, we said that bars can be either horizontal
or vertical. Let's change to a horizontal format. All the code remains the same,
except plt.xticks changes to plt.yticks() and plt.bar() changes to plt.barh().
We assume it is self-explanatory.
[ 42 ]
Visual Aids for EDA Chapter 2
In addition to this, placing the exact data values is a bit tricky and requires a few iterations
of trial and error to place them perfectly. But let's see them in action:
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]
plt.show()
[ 43 ]
Visual Aids for EDA Chapter 2
Well, that's all about the bar chart in this chapter. We are certainly going to use several
other attributes in the subsequent chapters. Next, we are going to visualize data using a
scatter plot.
Scatter plot
Scatter plots are also called scatter graphs, scatter charts, scattergrams, and scatter
diagrams. They use a Cartesian coordinates system to display values of typically two
variables for a set of data.
When should we use a scatter plot? Scatter plots can be constructed in the following two
situations:
You are either an expert data scientist or a beginner computer science student, and no
doubt you have encountered a form of scatter plot before. These plots are powerful tools for
visualization, despite their simplicity. The main reasons are that they have a lot of options,
representational powers, and design choices, and are flexible enough to represent a graph
in attractive ways.
Research studies have successfully established that the number of hours of sleep
required by a person depends on the age of the person.
The average income for adults is based on the number of years of education.
[ 44 ]
Visual Aids for EDA Chapter 2
Let's take the first case. The dataset can be found in the form of a CSV file in the GitHub
repository:
headers_cols = ['age','min_recommended', 'max_recommended',
'may_be_appropriate_min', 'may_be_appropriate_max', 'min_not_recommended',
'max_not_recommended']
sleepDf =
pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/hands-on-exp
loratory-data-analysis-with-python/master/Chapter%202/sleep_vs_age.csv',
columns=headers_cols)
sleepDf.head(10)
Having imported the dataset correctly, let's display a scatter plot. We start by importing the
required libraries and then plotting the actual graph. Next, we display the x-label and the y-
label. The code is given in the following code block:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
[ 45 ]
Visual Aids for EDA Chapter 2
That was not so difficult, was it? Let's see if we can interpret the graph. You can explicitly
see that the total number of hours of sleep required by a person is high initially and
gradually decreases as age increases. The resulting graph is interpretable, but due to the
lack of a continuous line, the results are not self-explanatory. Let's fit a line to it and see if
that explains the results in a more obvious way:
# Line plot
plt.plot(sleepDf['age']/12., sleepDf['min_recommended'], 'g--')
plt.plot(sleepDf['age']/12., sleepDf['max_recommended'], 'r--')
plt.xlabel('Age of person in Years')
plt.ylabel('Total hours of sleep required')
plt.show()
[ 46 ]
Visual Aids for EDA Chapter 2
From the graph, it is clear that the two lines decline as the age increases. It shows that
newborns between 0 and 3 months require at least 14-17 hours of sleep every day.
Meanwhile, adults and the elderly require 7-9 hours of sleep every day. Is your sleeping
pattern within this range?
Let's take another example of a scatter plot using the most popular dataset used in data
science—the Iris dataset. The dataset was introduced by Ronald Fisher in 1936 and is
widely adopted by bloggers, books, articles, and research papers to demonstrate various
aspects of data science and data mining. The dataset holds 50 examples each of three
different species of Iris, named setosa, virginica, and versicolor. Each example has four
different attributes: petal_length, petal_width, sepal_length, and sepal_width.
The dataset can be loaded in several ways.
[ 47 ]
Visual Aids for EDA Chapter 2
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['figure.dpi'] = 150
2. Use style from seaborn. Try to comment on the next line and see the difference
in the graph:
sns.set()
[ 48 ]
Visual Aids for EDA Chapter 2
Do you find this graph informative? We would assume that most of you agree that you can
clearly see three different types of points and that there are three different clusters.
However, it is not clear which color represents which species of Iris. Thus, we are going to
learn how to create legends in the Scatter plot using seaborn section.
Bubble chart
A bubble plot is a manifestation of the scatter plot where each data point on the graph is
shown as a bubble. Each bubble can be illustrated with a different color, size, and
appearance.
Let 's continue using the Iris dataset to get a bubble plot. Here, the important thing to note
is that we are still going to use the plt.scatter method to draw a bubble chart:
# Load the Iris dataset
df = sns.load_dataset('iris')
[ 49 ]
Visual Aids for EDA Chapter 2
plt.scatter(df.petal_length, df.petal_width,
s=50*df.petal_length*df.petal_width,
c=df.species,
alpha=0.3
)
Can you interpret the results? Well, it is not clear from the graph which color represents
which species of Iris. But we can clearly see three different clusters, which clearly indicates
for each specific species or cluster there is a relationship between Petal Length and Petal
Width.
[ 50 ]
Visual Aids for EDA Chapter 2
In the preceding plot, we can clearly see there are three species of flowers indicated by
three distinct colors. It is more clear from the diagram how different specifies of flowers
vary in terms of the sepal width and the length.
[ 51 ]
Visual Aids for EDA Chapter 2
In order to simplify this, think of an area plot as a line plot that shows the area covered by
filling it with a color. Enough talk. Let's dive into the code base. First of all, let's define the
dataset:
# House loan Mortgage cost per month for a year
houseLoanMortgage = [9000, 9000, 8000, 9000,
8000, 9000, 9000, 9000,
9000, 8000, 9000, 9000]
Now, let's import the required libraries and plot stacked charts:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
[ 52 ]
Visual Aids for EDA Chapter 2
plt.legend()
# Add Labels
plt.title('Household Expenses')
plt.xlabel('Months of the year')
plt.ylabel('Cost')
In the preceding snippet, first, we imported matplotlib and seaborn. Nothing new,
right? Then we added stacks with legends. Finally, we added labels to the axes and
displayed the plot on the screen. Easy and straightforward. Now you know how to create
an area plot or stacked plot. The area plot generated by the preceding code is as follows:
Now the most important part is the ability to interpret the graph. In the preceding graph, it
is clear that the house mortgage loan is the largest expense since the area under the curve
for the house mortgage loan is the largest. Secondly, the area of utility bills stack covers the
second-largest area, and so on. The graph clearly disseminates meaningful information to
the targeted audience. Labels, legends, and colors are important aspects of creating a
meaningful visualization.
[ 53 ]
Visual Aids for EDA Chapter 2
Pie chart
This is one of the more interesting types of data visualization graphs. We say interesting
not because it has a higher preference or higher illustrative capacity, but because it is one of
the most argued-about types of visualization in research.
A paper by Ian Spence in 2005, No Humble Pie: The Origins and Usage of a Statistical Chart,
argues that the pie chart fails to appeal to most experts. Despite similar studies, people have
still chosen to use pie charts. There are several arguments given by communities for not
adhering to the pie chart. One of the arguments is that human beings are naturally poor at
distinguishing differences in slices of a circle at a glance. Another argument is that people
tend to overestimate the size of obtuse angles. Similarly, people seem to underestimate the
size of acute angles.
Having looked at the criticism, let's also have some positivity. One counterargument is this:
if the pie chart is not communicative, why does it persist? The main reason is that people
love circles. Moreover, the purpose of the pie chart is to communicate proportions and it is
widely accepted. Enough said; let's use the Pokemon dataset to draw a pie chart. There are
two ways in which you can load the data: first, directly from the GitHub URL; or you can
download the dataset from the GitHub and reference it from your local machine by
providing the correct path. In either case, you can use the read_csv method from the
pandas library. Check out the following snippet:
# Load the first sheet of the JSON file into a data frame
pokemon = pd.read_csv(url, index_col='type')
pokemon
[ 54 ]
Visual Aids for EDA Chapter 2
We should get the following pie chart from the preceding code:
[ 55 ]
Visual Aids for EDA Chapter 2
Do you know you can directly use the pandas library to create a pie chart? Checkout the
following one-liner:
pokemon.plot.pie(y="amount", figsize=(20, 10))
[ 56 ]
Visual Aids for EDA Chapter 2
We generated a nice pie chart with a legend using one line of code. This is why Python is
said to be a comedian. Do you know why? Because it has a lot of one-liners. Pretty true,
right?
Table chart
A table chart combines a bar chart and a table. In order to understand the table chart, let's
consider the following dataset. Consider standard LED bulbs that come in different
wattages. The standard Philips LED bulb can be 4.5 Watts, 6 Watts, 7 Watts, 8.5 Watts, 9.5
Watts, 13.5 Watts, and 15 Watts. Let's assume there are two categorical variables, the year
and the wattage, and a numeric variable, which is the number of units sold in a particular
year.
Now, let's declare variables to hold the years and the available wattage data. It can be done
as shown in the following snippet:
# Years under consideration
years = ["2010", "2011", "2012", "2013", "2014"]
# Available watt
columns = ['4.5W', '6.0W', '7.0W','8.5W','9.5W','13.5W','15W']
unitsSold = [
[65, 141, 88, 111, 104, 71, 99],
[85, 142, 89, 112, 103, 73, 98],
[75, 143, 90, 113, 89, 75, 93],
[65, 144, 91, 114, 90, 77, 92],
[55, 145, 92, 115, 88, 79, 93],
]
We have now prepared the dataset. Let's now try to draw a table chart using the following
code block:
colors = plt.cm.OrRd(np.linspace(0, 0.7, len(years)))
index = np.arange(len(columns)) + 0.3
bar_width = 0.7
y_offset = np.zeros(len(columns))
fig, ax = plt.subplots()
cell_text = []
n_rows = len(unitsSold)
[ 57 ]
Visual Aids for EDA Chapter 2
[ 58 ]
Visual Aids for EDA Chapter 2
Look at the preceding table chart. Do you think it can be easily interpreted? It is pretty
clear, right? You can see, for example, in the year 2014, 345 units of the 4.5-Watt bulb were
sold. Similarly, the same information can be deduced from the preceding table plot.
Polar chart
Do you remember the polar axis from mathematics class? Well, a polar chart is a diagram
that is plotted on a polar axis. Its coordinates are angle and radius, as opposed to the
Cartesian system of x and y coordinates. Sometimes, it is also referred to as a spider web
plot. Let's see how we can plot an example of a polar chart.
3. However, after your final examination, these are the grades you got:
actualGrade = [75, 89, 89, 80, 80, 75]
Now that the dataset is ready, let's try to create a polar chart. The first significant step is to
initialize the spider plot. This can be done by setting the figure size and polar projection.
This should be clear by now. Note that in the preceding dataset, the list of grades contains
an extra entry. This is because it is a circular plot and we need to connect the first point and
the last point together to form a circular flow. Hence, we copy the first entry from each list
and append it to the list. In the preceding data, the entries 90 and 75 are the first entries of
the list respectively. Let's look at each step:
[ 59 ]
Visual Aids for EDA Chapter 2
3. Initialize the plot with the figure size and polar projection:
plt.figure(figsize = (10,6))
plt.subplot(polar=True)
4. Get the grid lines to align with each of the subject names:
(lines,labels) = plt.thetagrids(range(0,360,
int(360/len(subjects))),
(subjects))
5. Use the plt.plot method to plot the graph and fill the area under it:
plt.plot(theta, plannedGrade)
plt.fill(theta, plannedGrade, 'b', alpha=0.2)
[ 60 ]
Visual Aids for EDA Chapter 2
As illustrated in the preceding output, the planned and actual grades by subject can easily
be distinguished. The legend makes it clear which line indicates the planned grades (the
blue line in the screenshot) and which line indicates actual grades (the orange line in the
screenshot). This gives a clear indication of the difference between the predicted and actual
grades of a student to the target audience.
Histogram
Histogram plots are used to depict the distribution of any continuous variable. These types
of plots are very popular in statistical analysis.
Consider the following use cases. A survey created in vocational training sessions of
developers had 100 participants. They had several years of Python programming
experience ranging from 0 to 20.
[ 61 ]
Visual Aids for EDA Chapter 2
Much better, right? Now, from the graph, we can say that the average experience of the
participants is around 10 years. Can we improve the graph for better readability? How
about we try to plot the percentage of the sum of all the entries in yearsOfExperience? In
addition to that, we can also plot a normal distribution using the mean and standard
deviation of this data to see the distribution pattern. If you're not sure what a normal
distribution is, we suggest you go through the references in Chapter 1, Exploratory Data
Analysis Fundamentals. In a nutshell, the normal distribution is also referred to as the
Gaussian distribution. The term indicates a probability distribution that is symmetrical
about the mean, illustrating that data near the average (mean) is more frequent than data
far from the mean. Enough theory; let's dive into the practice.
[ 62 ]
Visual Aids for EDA Chapter 2
To plot the distribution, we can add a density=1 parameter in the plot.hist function.
Let's go through the code. Note that there are changes in steps 1, 4, 5, and 6. The rest of the
code is the same as the preceding example:
nbins = 20
n, bins, patches = plt.hist(yearsOfExperience, bins=nbins,
density=1)
[ 63 ]
Visual Aids for EDA Chapter 2
The preceding plot illustrates clearly that it is not following a normal distribution. There are
many vertical bars that are above and below the best-fit curve for a normal distribution.
Perhaps you are wondering where we got the formula to compute step 6 in the preceding
code. Well, there is a little theory involved here. When we mentioned the normal
distribution, we can compute the probability density function using the Gaussian
distribution function given by ((1 / (np.sqrt(2 * np.pi) * sigma)) *
np.exp(-0.5 * (1 / sigma * (bins - mu))**2)).
Lollipop chart
A lollipop chart can be used to display ranking in the data. It is similar to an ordered bar
chart.
Let's consider the carDF dataset. It can be found in the GitHub repository for chapter 2.
Alternatively, it can be used from the GitHub link directly, as mention in the following
code:
carDF =
pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/hand
[ 64 ]
Visual Aids for EDA Chapter 2
s-on-exploratory-data-analysis-with-
python/master/Chapter%202/cardata.csv')
2. Group the dataset by manufacturer. For now, if it does not make sense, just
remember that the following snippet groups the entries by a particular field (we
will go through groupby functions in detail in Chapter 4, Data Transformation):
#Group by manufacturer and take average mileage
processedDF =
carDF[['cty','manufacturer']].groupby('manufacturer').apply(lambda
x: x.mean())
3. Sort the values by cty and reset the index (again, we will go through sorting
and how we reset the index in Chapter 4, Data Transformation):
#Sort the values by cty and reset index
processedDF.sort_values('cty', inplace=True)
processedDF.reset_index(inplace=True)
7. Write the actual mean values in the plot, and display the plot:
#Write the values in the plot
for row in processedDF.itertuples():
ax.text(row.Index, row.cty+.5, s=round(row.cty, 2),
[ 65 ]
Visual Aids for EDA Chapter 2
Having seen the preceding output, you now know why it is called a lollipop chart, don't
you? The line and the circle on the top gives a nice illustration of different types of cars and
their associated miles per gallon consumption. Now, the data makes more sense, doesn't it?
[ 66 ]
Visual Aids for EDA Chapter 2
Having said that, let's see if we can generalize some categories of charts based on various
purposes.
The following table shows the different types of charts based on the purposes:
Purpose Charts
Scatter plot
Correlogram
Pairwise plot
Jittering with strip plot
Show correlation
Counts plot
Marginal histogram
Scatter plot with a line of best fit
Bubble plot with circling
Area chart
Diverging bars
Show deviation Diverging texts
Diverging dot plot
Diverging lollipop plot with markers
Histogram for continuous variable
Histogram for categorical variable
Density plot
Categorical plots
Density curves with histogram
Show distribution
Population pyramid
Violin plot
Joy plot
Distributed dot plot
Box plot
Waffle chart
Pie chart
Show composition
Treemap
Bar chart
[ 67 ]
Visual Aids for EDA Chapter 2
Note that going through each and every type of plot mentioned in the table is beyond the
scope of this book. However, we have tried to cover most of them in this chapter. A few of
them will be used in the upcoming chapters; we will use these graphs in more contextual
ways and with advanced settings.
[ 68 ]
Visual Aids for EDA Chapter 2
Summary
Portraying any data, events, concepts, information, processes, or methods graphically has
been always perceived with a high degree of comprehension on one hand and is easily
marketable on the other. Presenting results to stakeholders is very complex in the sense that
our audience may not be technical enough to understand programming jargon and
technicalities. Hence, visual aids are widely used. In this chapter, we discussed how to use
such data visualization tools.
In the next chapter, we are going to get started with exploratory data analysis in a very
simple way. We will try to analyze our mailbox and analyze what type of emails we send
and receive.
Further reading
Matplotlib 3.0 Cookbook, Srinivasa Rao Poladi, Packt Publishing, October 22, 2018
Matplotlib Plotting Cookbook, Alexandre Devert, Packt Publishing, March 26, 2014
Data Visualization with Python, Mario Döbler, Tim Großmann, Packt
Publishing, February 28, 2019
No Humble Pie: The Origins and Usage of a Statistical Chart, Ian Spence, University of
Toronto, 2005.
[ 69 ]