Chapter 4
Chapter 4
Chapter 4
Contents
• Introduction to Exploratory Data Analysis
• Data visualization and visual encoding
• Data visualization libraries
• Basic data visualization tools
• 3 5 10
• Yellow Feature
• Pink Bug
• Blue Story
Python’s visualization tools
• Main visualization library tool for python
matplotlib
• Mathplotlib is a powerful AND flexible library .
• Integrates well with other libraries.
• Bokeh & plot.ly browser based libraries are
also popular.
Basic data visualization tools
• Pie charts
– Human are more capable of comprehending things by
visualization rather than reading or listening.
– In data science, visualization’s scope much increases as the
complete data cannot be read or understood, but a visualization
gives much information about the data.
– In visualization, use of pie charts is very common.
– One of the clearest ways to present data
– Conveys information in a way that the human brain will
understand
– Examples of exploratory analysis, where pie charts perform best
• How many of our customers are seniors?
• How many page views came from UK?
• Pie chart looks similar to a pie.
• It’s a circular structure divided into slices
• Each slice indicates a statistical numerical
proportion based on the data divided.
• The Arc length of each slice Is proportional to
the quantity it represents.
• Data can be either numerical or categorical in
nature.
• Pie chart cannot be plotted just by variables,
but some transformations may be needed as
per conditions.
• In iris data set, we choose Class variable as it
has three different species of iris and for the
purpose of the demonstration Sepal_Length
variable is used.
• Below is the code for data transformation and
plotting pie chart.
• We do the following transformation first:
– df1 = df.groupby(“Class”).count()
– df2 = sums[‘Sepal_Length’]
– df2
• o/p
Class
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: Sepal_length, dtype: int64
• import matplotlib
• from matplotlib import font_manager as fm
• plt.rcParams['font.size'] = 22.0
• matplotlib.rcParams['text.color'] = 'g‘
• matplotlib.rcParams['lines.linewidth'] = 2
• labels=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
• explode=(0,0,0.1)
• plt.pie(df2, labels=labels,explode=explode,
autopct='%1.1f%%', radius=1.2, colors=("y","m","b"),
textprops={'fontsize': 22})
• plt.title("Distribution of Sepal Length by Iris Class",
bbox={'facecolor':'0.8', 'pad':5}, y=1.2, fontsize=22)
• plt.show()
Option of “explode”, specifies the fraction of the radius with
which to offset any wedge.
• Donut Chart:
• A donut chart is a kind of pie chart, but has a donut
shape means the area of the center is cut out.
• Donut charts are considered more space efficient, since
the blank inner space can be used to display
percentage or any other info related to data series.
• import matplotlib
• from matplotlib import font_manager as fm
• plt.rcParams['font.size'] = 22.0
• matplotlib.rcParams['text.color'] = 'r‘
• matplotlib.rcParams['lines.linewidth'] = 2
• labels=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
• explode=(0.1,0.1,0.1)
• plt.pie(df2, labels=labels,explode=explode, autopct='%1.1f%%',
radius=1.2, colors=("b","g","y"), textprops={'fontsize': 22})
• centre_circle = plt.Circle((0,0),0.75, fc='white',linewidth=1.25)
• fig = plt.gcf()
• fig.gca().add_artist(centre_circle)
• plt.title("Distribution of Sepal Length by Iris Class",
bbox={'facecolor':'0.8', 'pad':5}, y=1.2, fontsize=22)
• plt.show()
• Histograms
– A graphical display of data using bars of different
heights.
– A histogram is basically used to represent data
provided in a form of some groups.
– It is accurate method for the graphical
representation of numerical data distribution.
– It is a type of bar plot where X-axis represents the
bin ranges while Y-axis gives information about
frequency
from matplotlib import pyplot as plt
import numpy as np
# Creating dataset
a = np.array([22, 87, 5, 43, 56,
73, 55, 54, 11,
20, 51, 5, 79, 31,
27])
# Creating histogram
fig, ax = plt.subplots(figsize =(10, 7))
ax.hist(a, bins = [0, 25, 50, 75, 100])
# Show plot
plt.show()
• Two major problems with histograms
– The first one is the number and size of the bins you use.
• If the bins are too large, then you can obscure fascinating
patterns that occur within a single bucket.
• If they are too small, then many of your buckets will contain no
points, and your bell‐shaped curve will turn into a bunch of
one‐unit high bars.
– The second problem is that sometimes your data can mar
the picture.
• There might be one bucket that contains so many points, for
example, that every other bucket is squashed down to what
looks like noise.
– The other visual problem is outliers, which can smash the
overwhelming majority of the points to the far left of the
graph.
• Bar Charts
• A bar chart is a graph with rectangular bars.
• The graph usually compares different categories.
• The graphs can be plotted vertically (bars standing up)
or horizontally (bars laying flat from left to right), the
most usual type of bar graph is vertical.
• A bar graph is useful for looking at a set of data and
making comparisons.
• For example, it’s easier to see which items are taking the
largest chunk of your budget by glancing at the above
chart rather than looking at a string of numbers.
• They can also shows trends over time, or reveal patterns
in periodic sequences
• Bar charts can also represent more complex
categories with stacked bar charts or grouped bar charts.
• For example, if you had two houses and needed budgets
for each, you could plot them on the same x-axis with a
grouped bar chart, using different colors to represent
each house.
• Although they look the same, bar charts and histograms
have one important difference: they plot different types
of data.
• Plot discrete data on a bar chart, and plot continuous
data on a histogram
• A bar chart is used for when we have categories of
data: Types of movies, music genres, or dog breeds.
• It’s also a good choice when we want to compare things
between different groups.
• we could use a bar graph if we want to track change over
time as long as the changes are significant (for example,
decades or centuries).
• If we have continuous data, like people’s weights or IQ
scores, a histogram is best.
Horizontal bars, y-axis categories
Grouped bar graph
Stacked bar chart
• Like the double bar chart, different colors
represent different sub-groups.
• Stacked bar chart is a good choice if we
– Want to show the total size of groups.
– Are interested in showing how the proportions
between groups related to each other, in addition
to the total of each group.
– Have data that naturally falls into components,
like:
• Sales by district.
• Book sales by type of book.
• The matplotlib API in Python provides the
bar() function which can be used in MATLAB
style use or as an object-oriented API.
• The syntax of the bar() function to be used
with the axes is as follows:
• plt.bar(x, height, width, bottom, align)
• The function creates a bar plot bounded with a
rectangle depending on the given parameters.
• A simple example of the bar plot, which represents the number of students enrolled in different courses of
an institute.
import numpy as np
import matplotlib.pyplot as plt
plt.xlabel("Courses offered")
plt.ylabel("No. of students enrolled")
plt.title("Students enrolled in different courses")
plt.show()
• plt.bar(courses, values, color=’maroon’) is used
to specify that the bar chart is to be plotted by
using the courses column as the X-axis, and the
values as the Y-axis.
• The color attribute is used to set the color of the
bars(maroon in this case).
• plt.xlabel(“Courses offered”) and
plt.ylabel(“students enrolled”) are used to label
the corresponding axes.
• plt.title() is used to make a title for the graph.
• plt.show() is used to show the graph as output
using the previous commands.
• Scatter plots:
– they are one of the simplest but most powerful
ways to visualize relationships within a dataset.
– Scatter plots are best when we want to visualize a
first hand information about a data set
df.plot(kind="scatter",
x="sepal length (cm)", y="sepal width (cm)")
plt.title("Length vs Width")
plt.show()
• Scatter plots can have several other
characteristics which allow more than just the
two dimensions to be packed in.
• Color coding Often data points that fall into different
categories are given different colors.
• Size. Changing the size of data points communicates
another dimension of information. It also has the often‐
desirable ability to draw attention disproportionately to
some points instead of others.
• Opacity. In scatterplots and other visualizations, it is
often useful to make things partially transparent in case
they overlap with other parts of the visualization
• Line charts:
– Used to plot a representation of continuous data
points on a number line.
– They are created by first plotting the data points
on a cartesian plane, and then joining those points
with a number line.
– Line plots can be used to plot data points for both
single variable analysis and multiple variable
analysis.
– Generally used for visualizing trends in time-series
problems.
Specialized data visualization tools