Data Visualization
Data Visualization
Data Visualization
Unit 1
Introduction
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Data visualization is a powerful tool that transforms
● What is data visualization? complex datasets into meaningful and insightful
visuals. In today’s data-driven world, where an immense
● Why is it important to learn
amount of information is generated daily, the ability to
visualization?
effectively present and understand data is crucial. Data
● Ways to visualize in python visualization bridges the gap between raw data and
● Why choose matplotlib over human comprehension, allowing us to explore patterns,
others? trends, and relationships that might otherwise remain
hidden in spreadsheets or databases.
● Installation of matplotlib in
python
The primary goal of data visualization is to communicate
information clearly, accurately, and efficiently. It goes
beyond mere aesthetics; a well-designed visualization
can convey information, insights, and narratives
that can drive informed decision-making, enhance
communication, and tell compelling stories. Whether
you’re a business analyst trying to convey sales trends,
a scientist explaining experimental results, or a journalist
presenting investigative findings, data visualization is
a universal language that can be understood by both
experts and laypeople alike.
2 Introduction
encompassing business, science, journalism, and on the information presented.
academia, wherein data-informed decisions and
● Data Reporting: Whether it’s a business
communication are pivotal for attaining success.
report, research paper, or presentation, data
visualizations make your work more professional
Why is it Important to Learn and impactful. Instead of overwhelming your
Visualization ? audience with tables and numbers, you can
present the information visually, making it easier
● Effective Communication: Visualizations are
for them to grasp the key takeaways.
powerful tools for conveying complex information
in a simple and understandable manner. ● Detecting Errors and Anomalies: Visualization
Learning visualization techniques enables you to can help you identify data errors, inconsistencies,
effectively communicate data-driven insights to or outliers that may not be apparent when
both technical and non-technical audiences. This looking at raw data. Spotting and rectifying these
skill is valuable in various professional fields, issues are crucial to ensure data accuracy and
● Decision Making: In today’s data-driven world, different visual elements, design principles, and
informed choices. Data visualizations enable presenting data in unique and engaging ways.
● Storytelling with Data: Data visualizations allow using visualizations, making it an asset that can
you to tell compelling stories with data. By enhance your career prospects.
3 Introduction
practitioner, enabling you to extract meaningful notebooks. Plotly supports a variety of chart
insights from data and communicate them in a way types, including line charts, scatter plots, bar
that resonates with others. charts, 3D plots, choropleth maps, and more.
5 Introduction
Installation ofS matplotlib in python
Generally if you are using anaconda distribution packages like matplotlib, pandas and numpy are installed.
However if you are not using conda and using pip as alternative then you need to run these commands on
console
Note all the above command can be executed from notebook by using a % sign before the command like
below
If you don’ t know how to install anaconda or python with pip, I have created a separate document for that, you
must go through that document to understand the processing of installing anaconda or python.
Summary
Data visualization is the graphical representation of data to help people understand and make sense of
complex information. It involves using various visual elements such as charts, graphs, maps, and diagrams
to present data in a way that is easy to interpret and analyze. Data visualization can reveal patterns, trends,
and insights within data, making it a powerful tool for decision-making, storytelling, and communication
in fields ranging from business and science to journalism and education. Effective data visualization can
enhance data-driven decision-making, improve data communication, and facilitate a deeper understanding
of data for a wide range of audiences.
6 Introduction
Unit 2
Different Approaches of
Learning Matplotlib
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Before we delve into the reasons behind the various
● Matplotlib approaches of Matplotlib, it’s important to understand
the historical development of the library. This history
● Styles of writing matplotlib
is one of the main factors contributing to the diverse
flavors that Matplotlib offers. Due to its lengthy presence
as a package in Python and the involvement of multiple
maintainers over time, Matplotlib has evolved in different
directions for achieving the same goals. Therefore,
comprehending the historical progression of Matplotlib
is crucial.
Matplotlib
● 2003: The initial version of Matplotlib was remains one of the go-to libraries for data
providing a set of stateful plotting have kept it relevant and essential for
Summary
Matplotlib is a widely-used Python library for creating static, animated, and interactive visualizations of data. It
provides a flexible and extensive set of tools for producing high-quality charts, plots, and graphs. Matplotlib’s
features include support for various plot types, customization options for colors and styles, and the ability
to create complex visualizations for a wide range of data analysis and presentation needs. It is often used
in scientific research, data analysis, data exploration, and data communication, making it an essential tool
for those working with data in Python. Matplotlib’s intuitive syntax and extensive documentation make it
accessible to both beginners and experienced data scientists and researchers.
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Understanding the datatype before commencing the
● Continuous data plotting process is of utmost importance, as numerous
plots rely heavily on the intrinsic data type of the problem.
● Discrete data
They may not function as intended with differing
● Scales of measurement datatypes.
Continuous Data
● Weight: Weight provides another illustration values within a defined set of categories or levels.
of continuous data. It can encompass various Unlike continuous data that spans any value
values within a specific range, such as 65.7 kg, within a range, discrete data remains confined to
72.3 kg, 58.9 kg, and more. a countable number of values. Discrete data is
frequently expressed as counts or frequencies
● Time: Time is continuously measurable, even
and is commonly employed for categorization and
down to fractions, such as 9:30 AM, 11:45 AM,
classification purposes.
3:15 PM, and so on.
● Speed: Speed, being subject to continuous Here are Some Examples of Discrete Data:
variation within a particular range, is also a form
of continuous data. For instance, a car’s ● Number of Children: The count of children
speed might be 45.6 km/h, 70.2 km/h, in a family is considered discrete data
100.8 km/h, and so forth. because it exclusively takes specific
specific and distinct values (e.g., the number of non-binary) are treated as discrete data, as each
children in a family, the number of students in a individual falls within one specific category.
Now, these aforementioned categories can be ● Interval Scale: Interval data has a meaningful
further divided into subcategories, often order, and the differences between values are
ignore it then matplotlib will take the default, same elements as needed.
Now, from next chapter onwards, comments are extensively written only where the logic is complex, you can
use above as template to produce different types of chart , you just have to figure out what to plot and what
is the equivalent command for it, for example, plt.hist will be replaced with something else and you have a
different graph.
Now above chart is called histogram and displays frequencies of value in a certain range, this is called
univariate chart because it represents single variable and in this case it is showing us the distribution of that
variable using histogram.
Another example:
import numpy as np
# Generate data
#In this example: and a title to the plot. show() displays the plot.
information can generate the frequencies, We generate random bivariate data x and
solidifying its classification as univariate. This y using NumPy’s np.random.randn() function.
type of plot also goes by a humorous moniker:
The hist2d() function is used to create a 2D
the “lollipop” chart. As you can observe, it can
histogram. We pass the x and y
be employed with categorical data to create a
values as the first and second arguments, and
univariate chart.
specify the number of bins and color map.
Hence, we’ve explored two examples—one involving The colorbar() function adds a color bar indicating
continuous data and the other categorical— the count of data points in each bin.
to generate charts that represent univariate
xlabel(), ylabel(), and title() functions add labels
distributions.
and a title to the plot.
‘’’
In this scenario, we have two independent columns, x and y. Upon observing the graph, it’s evident that there
are numerous records for x values ranging from 0 to 0.5, and for y values between 0 and 1. This insight is
indicated by the presence of darker blue regions on the graph. Similarly, at the extreme values of x and y, the
frequency of occurrences is notably low. This observation is depicted by the presence of very light blue color
boxes (bars).
import numpy as np
np.random.seed(1)
y = 2 * x + np.random.randn(100)
plt.ylabel(‘Y’)
plt.show()
Now that we’ve comprehended univariate and bivariate charts, as well as created some basic plots using
dummy data, it’s crucial to understand the underlying grammar of charts. This understanding will help make
the code more meaningful and coherent to us.
Summary
Data types, in the context of computer programming and data analysis, refer to the classification or
categorization of data values to specify what kind of data a particular variable or object can hold. Data types
are essential for accurately representing and manipulating data in a computer program.
Choosing the appropriate data type for your variables or data structures is crucial for efficient memory usage
and accurate processing of data. Different programming languages may have variations in data types and
their behavior, so it’s essential to understand the data types available in the language you are working with
and how to use them effectively.
Matplotlib Grammar
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Matplotlib is a widely used Python library for creating
● Graph vocabulary static, interactive, and animated visualizations in a
variety of formats. It provides a high-level interface for
creating a wide range of plots and charts, allowing users
to present data in a visually appealing and informative
manner. Matplotlib is highly customizable and provides
control over every aspect of the visualization, making it
a versatile tool for both beginners and experienced data
scientists.
Graph Vocabulary
22 Matplotlib Grammar
terms is crucial for effectively interpreting and conveying information through charts and graphs. Here are
key terms frequently employed to elucidate the vocabulary of a graph:
● Title: A concise descriptive statement summarizing the main intent or content of the graph.
● Axis Labels: Labels indicating what is being measured along the horizontal x-axis and the vertical y-axis.
● Legend: An area or box within the graph explaining the meaning of diverse colors or symbols used to
represent data series.
● Tick Marks: Small marks or lines on the axes signifying specific points or intervals on the scale.
● Tick Labels: Numeric or categorical labels accompanying tick marks, displaying values at those positions.
● Gridlines: Horizontal and vertical lines extending from tick marks to assist in reading the values.
● Bar: A rectangular representation utilized in bar charts to depict the magnitude of a variable.
● Line: A continuous line linking data points in a line chart, illustrating trends or relationships.
● Marker: A symbol or point employed to emphasize individual data points on a line or scatter plot.
● Axis Scale: The scope of values exhibited on an axis, encompassing linear, logarithmic, or other scales.
● Data Label: A numeric value or label connected with a data point, offering specific information.
● Axis Range: The range of values covered by an axis, spanning from the minimum to the maximum value.
● Annotations: Supplementary text, shapes, or lines incorporated into the graph to provide context or
explanations.
These terms constitute the fundamental building blocks of graph vocabulary, enabling individuals to
effectively comprehend and construct graphs in various domains, ranging from scientific research and
business analytics to education and communication.
Now, let’s examine a graph from the Matplotlib documentation and endeavor to correlate the above generic
anatomy of a graph with the specific components of a Matplotlib graph.
23 Matplotlib Grammar
Figure 4.1: Anatomy of a Figure
Analyzing the above image, we can observe that while there are slight differences, most of the elements are
similar to what we discussed earlier. This similarity highlights the advantage of understanding Matplotlib, as
it closely aligns with the general principles of plotting grammar. Let’s delve into the components of the above
graph and briefly explain them.
The structure of a Matplotlib plot encompasses various constituents that collaboratively form a comprehensive
visual representation of data. Acquiring familiarity with these components is pivotal for crafting and tailoring
plots proficiently. Here’s an outline of some principal components within a Matplotlib plot:
● Figure: The highest-level container embracing all plot elements. It can accommodate one or multiple
subplots (axes). Think of the figure as the entire canvas on which the plot takes shape.
● Axes: Individual plotting areas within a figure, each with its own x-axis, y-axis, data, and graphical elements.
24 Matplotlib Grammar
Unlike Cartesian axes, an “axes” object in Matplotlib represents a plotting region.
● Axis Labels: Labels on the x-axis and y-axis denoting the measured attributes. These labels provide
contextual information for the displayed data.
● Title: A descriptive heading above subplots or axes, encapsulating the main purpose or content of the plot.
● Data: The actual data points being plotted, visualized through various plot types like lines, bars, scatter
points, etc.
● Ticks: Marks or lines along axes indicating specific data values or intervals. Accompanied by tick labels,
which display corresponding values.
● Tick Labels: Numeric or categorical labels aligned with tick marks, aiding comprehension of data scale.
● Gridlines: Horizontal and vertical lines stemming from tick marks, facilitating data value interpretation and
relationship understanding.
● Legend: A section within the plot elucidating the significance of distinct colors, markers, or lines representing
data series.
● Color and Style: Different colors, markers, and line styles to differentiate data series or emphasize specific
data points.
● Annotations: Text or arrows incorporated into the plot to provide extra context or explanations about
specific data or trends.
● Spines: Lines connecting tick marks that frame the plot, outlining axis borders, and adjustable in visibility
and position.
● Background Color: The color of the plot area and entire figure canvas, customizable to match visualization
design.
● Subplots: When a figure includes multiple plots arranged in a grid, each individual plot in the grid is termed
a subplot. Subplots share the figure but possess distinct axes.
Summary
The anatomy of a Matplotlib plot constitutes a hierarchy of components working harmoniously to present
data visually. Understanding these components is pivotal for effective data communication and manipulation
through visualizations.
25 Matplotlib Grammar
Unit 5
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Matplotlib offers an extensive array of chart types that
● Chart types cater to visualizing various data forms.
Chart Types
x_values = [1, 2, 3, 4, 5]
plt.xlabel(‘X-axis Label’)
plt.ylabel(‘Y-axis Label’)
# Add a legend
plt.legend()
plt.show()
# Data
x_values = [1, 2, 3, 4, 5]
plt.xlabel(‘X-axis Label’)
plt.ylabel(‘Y-axis Label’)
# Add a legend
plt.legend()
plt.show()
Bar Chart: It’s important to note that a bar chart and a histogram are distinct visualizations, even though
these terms are sometimes used interchangeably in conversation. When you’re creating a plot, a bar chart is
commonly referred to as a frequency plot or, in some cases, a count plot. Essentially, a bar chart illustrates
the frequency of values within a dataset. Typically, a single column of categorical data serves as the input for
creating a bar chart. Depending on your preference, you can adjust the orientation of the bars, making them
either vertical or horizontal. In contemporary discussions, there’s a growing interest in a type of plot known
29 Different Types of Graphs
as the ‘lollipop chart,’ which can also be used instead of a traditional bar chart. Additionally, bar charts offer
variations in appearance, involving both horizontal and vertical bars.
# Data
plt.xlabel(‘Categories’)
plt.ylabel(‘Values’)
plt.show()
# Data
data = [5, 7, 9, 10, 12, 15, 18, 20, 22, 25, 28, 30, 32, 33, 35]
# Create a histogram
# Create a histogram
plt.xlabel(‘Value Range’)
plt.ylabel(‘Frequency’)
plt.title(‘Histogram Example’)
plt.show()
# Data
data = [15, 20, 22, 28, 30, 32, 33, 35, 40, 45, 50]
plt.xlabel(‘Value’)
plt.show()
The fundamental element of a density plot is the kernel density estimate (KDE), which represents a smoothed
rendition of the data’s distribution. The KDE involves placing a kernel (typically a smoothing function like
Gaussian) at each data point and then aggregating these kernels to form a smooth curve that represents the
overall distribution.
● You aim to visualize the underlying distribution of continuous data without being limited by bin sizes (as
seen in histograms).
● You intend to pinpoint modes (peaks) and identify any patterns with multiple modes in the data.
● You require a polished representation that offers a clearer view of the data’s distribution.
Notable variations of density plots encompass the violin plot, which blends the KDE with a box plot, and the
ridge plot, which stacks multiple KDEs to illustrate distributions across distinct categories.
Density plots can be generated using diverse visualization libraries, including Matplotlib and Seaborn in
Python. They provide insights into the general shape and characteristics of your data’s distribution.
## To plot a density plot which represents the shape of a distribution , one can do this
import numpy as np
data = np.random.randn(1000)
density = gaussian_kde(data)
plt.ylabel(‘Density’)
plt.show()
Violin Plot: Violin plots are exclusively applicable to continuous data and offer insights akin to both box plots
and density plots. Therefore, they can prove exceptionally valuable in certain scenarios. A violin plot not
only imparts information about summary statistics such as medians and quartiles but also showcases the
distribution’s shape, spread, and potential multimodal patterns. This amalgamation of features makes violin
plots a comprehensive tool for visualizing data distributions.
import numpy as np
plt.violinplot(combined_data)
plt.xlabel(“Categories”)
plt.ylabel(“Values”)
plt.show()
All the aforementioned charts are fundamental representations, and numerous other charts are derived from
these basics. As the book progresses, you’ll encounter variations of these charts with additional options for
gaining a more comprehensive understanding of data. Later on, we’ll also explore how to generate multiple
plots efficiently without writing extensive code. For now, it’s important to recognize that understanding data
types can guide our initial choice of plots. Keep in mind that Matplotlib can also be used to draw geometric
shapes and trigonometric/algebraic functions. These concepts will become clearer in subsequent chapters.
In this chapter, the focus is on comprehending how basic plots can be inferred from data types, serving as a
foundation for further interpretation.
Pie Chart: A pie chart is a circular data visualization tool employed to exhibit the distribution of a categorical
dataset. It represents each category as a “slice” of the pie, with the size of each slice corresponding to the
proportion of the entire dataset that the category occupies. Key attributes of a pie chart include:
● Circular Shape: The entire pie denotes the entire dataset or 100% of the data. Each category’s slice is
proportional to the percentage it contributes to the whole.
● Categories: Categories encircle the circumference of the circle, each with its corresponding slice.
● Angles: The size of each slice is dictated by the angle it forms at the center of the circle. A larger proportion
corresponds to a larger angle.
● Labels: Typically, labels or percentages are positioned inside or outside each slice to provide supplementary
information.
● Legend: A legend is commonly used to associate category names with the colors of the slices.
Pie charts are most effective when you intend to communicate the relative sizes of different categories in
relation to a whole. However, they might be less effective when comparing precise values or when numerous
categories are involved. The circular shape can make it challenging to accurately assess differences in size.
# Data
plt.show()
While pie charts may seem appealing, they are not commonly used in industries such as banking, especially in
my experience. This is due to concerns about financial sensitivity; people prefer more accurate representations
and sometimes pie charts fall short in providing clear financial insights. Additionally, banking data tends to be
intricate, and for analyses involving multiple variables, other visualization techniques are often preferred. Pie
charts are more suitable for univariate representations, while for bivariate and multivariate graphs, alternative
methods are favored. Here are some disadvantages of pie charts:
Despite their visual appeal and suitability for specific situations, pie charts have several limitations in the
realm of data science and data visualization:
● Difficulty in Comparing Slices: Accurately comparing the sizes of different slices in a pie chart can be
challenging, particularly when slices are similar in size or the chart contains numerous slices. Human
perception of angles and areas is not as precise as linear measurements, making it hard to determine
exact proportions.
● Limited Applicability to Few Categories: Pie charts work best when depicting a small number of categories.
As the number of categories increases, the chart can become cluttered and confusing, leading to difficulties
in distinguishing between slices.
● Inaccurate Data Representation: A slice’s size represents its proportion relative to the whole. However,
when precise data comparisons are required, people often make inaccurate estimations based on the
angle or area of slices.
● Unsuitability for Time Series Data: Pie charts are unsuitable for illustrating trends over time or any
sequential data. Time-related data is better presented using line charts or other time series visualizations.
● Unsuitability for Complex Data: Pie charts struggle to convey complex relationships or data with multiple
dimensions. They are confined to illustrating one-dimensional data distributions.
● Misleading 3D Effects: Adding a 3D effect to a pie chart can distort the perception of slice sizes, making
them appear larger or smaller than they truly are.
● Often Better Alternatives: In many instances, other visualization types such as bar charts, stacked bar
charts, and grouped bar charts provide clearer and more accurate data representations, rendering them
superior choices for data analysis.
In the realm of data science and effective data visualization, it’s crucial to thoughtfully select appropriate
chart types based on your data’s characteristics and the insights you aim to convey. While pie charts have
their place in specific contexts, they often fall short in conveying precise information or intricate relationships.
Summary
In a general sense, a “graph” is a mathematical and data structure concept used to represent relationships
between objects. It consists of nodes (vertices) connected by edges (links or arcs). Graphs provide a powerful
way to visualize and analyze complex relationships and networks, making them a fundamental concept in
mathematics, computer science, and various real-world applications.
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: We have already explored some fundamental types of
● Contour plot versus scatter pair graphs in Matplotlib. However, there are instances when
plots generating graphs using Matplotlib can be challenging
or visually unappealing. Sometimes, Seaborn’s plotting
capabilities are more straightforward than those of
Matplotlib. Here are some intriguing graphs that I believe
you should be aware of, as they can be effortlessly
plotted using Seaborn. Seaborn is a library built on
top of Matplotlib, so the fundamental graph and chart
terminology remains largely consistent. However,
Seaborn’s plotting capabilities can vary in terms of
simplicity.
tips = sns.load_dataset(“tips”)
plt.show()
If you were to attempt this using Matplotlib, it would be quite challenging. Generating a pairplot in Matplotlib
is a complex task. Additionally, Seaborn’s pairplot command comes in multiple variations. For instance,
using the same code, you can alter the parameter to obtain a pairplot with contours. We will delve into the
parameter later. For now, focus on comparing the differences between the plots in order to understand how
contour plots differ from scatter pair plots.
In essence, a contour plot, also referred to as a level plot or isoline plot, is a technique used to visualize two-
dimensional representations of three-dimensional function surfaces. Contour plots are commonly employed
to illustrate functions involving two continuous variables.
In a contour plot, the x and y axes represent the two input variables, while the contours represent the function’s
values at various levels. Each contour line connects points with the same function value. These contour lines
unveil patterns, trends, and regions of significance within the data.
The spacing between contour lines signifies the rate of change of the function. Closer contour lines denote
rapid changes, while wider spacing indicates more gradual changes.
Contour plots prove particularly valuable when visualizing functions with intricate shapes or when identifying
crucial points such as saddle points, minima, maxima, and regions with consistent function values.
plt.show()
flights = sns.load_dataset(“flights”)
# Create a heatmap
plt.show()
In
tips = sns.load_dataset(“tips”)
plt.show()
tips = sns.load_dataset(“tips”)
sns.set(style=”whitegrid”)
plt.show()
import numpy as np
tips = sns.load_dataset(“tips”)
tips[tips[“day”] == category].smoker]) )
plt.ylabel(“Total Bill”)
plt.legend()
plt.show()
tips = sns.load_dataset(“tips”)
# Create a countplot
sns.set(style=”darkgrid”)
sns.countplot(x=”day”, data=tips)
plt.title(“Countplot Example”)
plt.show()
import numpy as np
tips = sns.load_dataset(“tips”)
day_counts = tips[“day”].value_counts()
plt.show()
In this chapter, we’ve explored several new types of plots and recognized that seaborn is often a more
advantageous option compared to matplotlib, especially given its improved default visualizations. It’s
important to acknowledge that seaborn is built upon matplotlib, so we can also customize seaborn charts, a
topic we’ll delve into later in the course.
Summary
Seaborn is a Python data visualization library built on top of Matplotlib that specializes in creating informative
and aesthetically pleasing statistical graphics. It simplifies the process of creating various types of statistical
plots and enhances the visual appeal of your data visualizations. In summary, Seaborn is a valuable tool for
data visualization in Python, especially when you want to create informative, aesthetically pleasing statistical
plots with ease. It complements Matplotlib and integrates seamlessly with Pandas, making it a popular
choice for data analysts and scientists.
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Up until now, we’ve examined how matplotlib operates
● Internal plotting mechanism of with single-column structures, enabling us to plot based
pandas on individual vectors or arrays of values.
penguins = pd.read_csv(“https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/
extdata/penguins.csv”)
plt.ylabel(‘Frequency’)
plt.show()
If you replace the line .plot.hist() with .plot(kind=’kde’) , you can recieve a distribution of a continuous column
as well. Try it you will find that it is really easy to tweak and see what the defaults are and how they work.
Now if you explore little bit more on .plot and if you choose to run this code [item for item in dir(penguins[“bill_
length_mm”].plot) if not item.startswith(‘_’)] you will see that below are plot objects supported by the .plot
[‘area’, ‘bar’, ‘barh’, ‘box’, ‘density’, ‘hexbin’, ‘hist’, ‘kde’, ‘line’, ‘pie’, ‘scatter’]
You probably thinking this plot is good, but has no description and all, well don’t worry, we can add descriptions
by adding plt.legend() . So here is the code which can change this to a much readable code by adding legend
and title, but as promised the code is essentially a one liner. The idea of this code is to break the code into
groups by using .groupby , make sure you only pick those columns which you want in the plot, here two
columns we need one is categorical column which is island and other is continuous which bill_length_mm.
you can make this more readable by wrapping the code in parenthesis. You can see that adding parenthesis
around plots doesn’t change anything and you can add comments as well.
(penguins. # dataset
plt.show()
Expanding this thought more into a different direction by adding more different types of plots. For example
by just adding this line and replace plot.kde() with plot.hist(alpha=0.5) you will have histograms for three
groups. Intrestingly you can do this with categorical column but you have to use some more data processing
, for example to do this with categorical column and generate a count plot (bar plot), you need to count then
you need to plot. To achieve this we can use groupby with agg method and count them, then we use pivot to
convert the data from longer to wider form. Finally we do the plotting.
plot. bar()
As you can observe, with pandas as well, we can generate plots without directly calling matplotlib. However,
you still need to use matplotlib for generating legends and handling other aesthetic aspects. In general, you
can create plots using pandas, and the process is quite intuitive and often quicker. This is the concept behind
the accessor – it provides an interface to create straightforward plots without unnecessary complexity. I
hope this provides you with a better understanding of the pandas plotting mechanism.
Summary
Pandas, a popular Python library for data manipulation and analysis, includes a convenient plotting mechanism
that simplifies the creation of basic plots directly from DataFrame and Series objects. In summary, Pandas’
plotting mechanism offers a convenient and user-friendly way to create basic data visualizations directly
from your data stored in DataFrames and Series. It’s a valuable tool for quick exploratory data analysis and
visualization tasks in data science and data analysis workflows.
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Until now, we have been content with utilizing the
● Plot example stateful approach and have explored a modest level of
customization through it. However, let’s now delve into
more intricate examples where the stateless approach
proves to be more advantageous.
Plot Example
np.random.seed(1)
data = np.random.randn(100)
fig, ax = plt.subplots()
# we can save the information of ax.hist into n, bins and patches, for now it is not important, but you can
observe what they contain, this would give you more exposure about metadata information that carries
within matplotlib charts
ax.set_xlabel(‘bins’)
# adding x labels and y labels, note the difference here we are using set_xabel(in stateless approach) not
xlabel(in stateful approach)
ax.set_ylabel(‘counts’)
plt.show()
We can see many differences here with stateful approach, one of them is that the commands names are
mostly starting with .set_ which suggests that the changes are happening on the object itself, also we are
tracking the object only by ax using object, whatever relevant method we apply on this, it will get reflected on
the plots, and to me this is very satisfactory compared to what we are doing plt. because plt. alters the graph
information but we don’t know how it is tracking the object, this can be really confusing and source of bugs.
Here is an output
pd.read_csv(“https://raw.githubusercontent.
Names = [‘ ‘.join([i.title() for i in item.split(‘_’)]) for
com/allisonhorst/palmerpenguins/master/inst/
item in num_cols] # We are generating name labels
extdata/penguins.csv”)
here based on the column given to us cols, rows =
# And convert each column to lower case. This is
2, 2 # generating number of columns and number
something I left for the reader to do on their own
rows of a grid where our plots are going to be
end
placed, since we have have 4 columns the grid size
# Replacing missing with mean/ use always is 2x2, in case we have 6 columns we could have
vectorized pandas method don’t try to recreate used 3x2 etc.
something which is loop based
ax.set_xlabel(names[i])
# this is exactly the same code, however
# puttting the names on lables of x axis for the thing which is different is the line where we
each of the columns called sns.boxplot
fig.tight_layout() # make the distance between
plots nicer so that names don’t munge all together # Similary we can use ax.boxplot() in case we want
matplotlib to work with. # I am putting the code for
plt.show()
matplotlib in the next box
ax.set_xlabel(names[i])
fig.tight_layout()
plt.show()
# matplotlib code
# this code is more involved because parameters for boxplot is quite not simple, however we can do a lot
with the given input parameter of boxplot in matplotlib, try changing colors at your end and see how that
goes!!!
cols = 2
ax=fig.add_subplot(rows,cols,i+1)
boxprops=dict(facecolor='yellow', color='red'),
whiskerprops=dict(color='blue'))
ax.set_xlabel(names[i])
fig.tight_layout()
plt.show()
Please be informed that we will utilize our sessions to comprehensively interpret these graphs. At this point,
within this book, we will primarily focus on the techniques involved in generating these visualizations.
Let’s progress and employ the same approach to plot bar charts for various columns of character type. It’s
worth noting that this technique can be extended to various types of graphs you create.
names = [item.title() for item in lst ] # converting them all into title case
# note this the same structure and we can observe that apart from sns.countplot nothing is different as
such
cols = 3
rows = 3
ax=fig.add_subplot(rows,cols,i+1)
ax.set_xlabel(names[i])
fig.tight_layout()
plt.show()
Now we can try the same code with matplotlib to understand if the same stateless approach can be done on
matplotlib as well. To make it work with matplotlib we can use the below code
cols = 3
rows = 3
ax=fig.add_subplot(rows,cols,i+1)
temp = data[col].value_counts() # here we created a temporary dataset which calculates the frquency
# we put the values in height and labels in x of the temporary object and we are done, rest of the code is
very similar
ax.set_xlabel(names[i])
fig.tight_layout()
plt.show()
So, we can see that a similar approach can be used command is employed to switch the backend of the
with categorical data to create charts. Let us switch Matplotlib library to the Qt backend. This enables
gears and look at some difficult looking graphs and the utilization of interactive plots in a separate
understand how they work. window, detached from the notebook interface.
The Qt backend provides a graphical user interface
Now if you ever want to save graphs or play with (GUI) for engaging with the visualizations.
the graphs (zoom in zoom out etc). you can use this
magic command on jupyter notebook or jupyter lab Here’s what %matplotlib qt accomplishes and why
it finds use:
%matplotlib qt
● Backend Selection: Matplotlib supports a
# Note the above magic command which says `qt`
range of graphical backends that dictate how
# Note you have to use %matplotlib inline to stop
plots are displayed. By default, the “inline”
this behavior cols = 3
backend is commonly used in Jupyter Notebook,
rows = 3 rendering plots directly within the notebook. The
Once you execute the provided code, a new window will open. Through this window, you can save your plots
as images, perform zooming in and out, and even rotate 3D plots to view data from different perspectives.
This functionality can be particularly advantageous for comprehending complex data that is challenging to
conceptualize.
A screenshot of qt backend
Note the symbol of floppy disk suggesting how to save (or saving a figure), similarly the zoom symbol,
similarly there are adjustment button to adjust figure properties. You can fiddle with this and realise, in some
cases this can be very helpful.
To save a graph instead of this qt backend you can use the normal inline as well to save a graph, but in that
case you need to use plt.save to save any graph. Here is an example to demonstrate.
plt.show()
The above command will save the figure in current working directory, to change a directory you can provide
a path like below
## The below code saves my current figure object into test folder present in E drive
plt.show()
Here are some general use cases of 3D graphs: 3D ● Model Evaluation: 3D graphs facilitate the
graphs, also referred to as three-dimensional evaluation of models dependent on three
plots, are valuable when you want input variables. In finance, for instance,
to analyze and visualize data you could visualize the impacts
involving three variables or of interest rates, inflation, and
dimensions. They offer a more investment returns on portfolio
comprehensive understanding value.
of relationships and patterns
● Data Clustering: In machine
in data that go beyond what
learning and data mining,
traditional 2D graphs can
3D graphs aid in visualizing
portray. Here are instances
clusters of data points when
where 3D graphs are particularly
considering more than two features
effective:
for clustering.
y = np.linspace(-5, 5, 8)
Here’s how you can understand np.meshgrid :
print(Y)
Below is the output of the earlier 3D plot. Feel free to rotate it on your end to observe its behavior. For the sake
of completeness, I’m also providing the graph here.
Believe it or not, these two graphs are exactly the same, but the perspective changes drastically depending
on how you perceive them. Hence, I emphasize the importance of generating and observing this on your end
to gain a better understanding. With this in mind, we’ve covered a substantial number of important topics in
this chapter. In our next chapter, we’ll delve into small animations achievable using matplotlib. This chapter
promises to be both fun and intriguing, particularly for those intrigued by generative art. Moreover, it holds
potential for simplifying complex concepts when teaching a broader audience. So, until the next section on
animation, see you there!
Customizing Matplotlib graphs allows you to tailor your data visualizations to meet specific design and
presentation requirements. Customizing Matplotlib graphs allows you to create professional-looking
visualizations that effectively communicate your data and insights to your target audience. It's an essential
skill for data scientists, analysts, and researchers working with data visualization in Python.
Animations in Matplotlib
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: This chapter is designed to be a more enjoyable journey,
● Sine Curve yet it offers a valuable perspective: animations can be
remarkably helpful in grasping complex concepts. Let’s
● Central limit theorem (CLT)
kick things off with a code example that simplifies the
understanding of animation components. Subsequently,
we will delve into an animation that brings the central
limit theorem to life.
Sine Curve
import numpy as np
70 Animations in Matplotlib
markers, empty x and y values later filled during ● The FuncAnimation class is used to create the
animation animation. It takes the figure, the update function,
the frames to iterate through, the init function,
def init():
and blit=True to improve rendering speed.
ax.set_xlim(0, 2*np.pi) # this the x axis limit from 0
to 2*pi ax.set_ylim(-1, 1) # this is y axis limit from
It’s important to note that this example serves
-1 to +1 (because sine and
as a basic illustration to convey the concept of
cosine values lie between -1 to +1 both inclusive animation using Matplotlib. You have the flexibility
for any angle) return ln, to tailor the animation by adjusting plot data and
visual aspects within the function. Furthermore,
# this function is to be called later in FuncAnimation
you can explore more intricate animations and
for updation def update(frame):
even integrate Matplotlib with libraries like NumPy
xdata.append(frame) # frame is somehting which
for advanced visual effects.
is taken as value while doing animation, you can
call anything this is just a name given
Central Limit Theorem (CLT)
ydata.append(np.sin(frame)) ln.set_
data(xdata, ydata) return ln, Now, if we wish to delve into a more
intricate topic, such as the Central
# here update function will take
Limit Theorem (CLT), we must
value from frames and update
construct a function that
the value of xdata, ydata and
computes averages for
ln , which get reflected in fig
samples and subsequently
object, the initial state of fig is
employs these averages to
as per the init function defined,
generate a bell curve through
# blit = True is for rendering better
plotting.
ani = FuncAnimation(fig, update,
frames=np.linspace(0, 2*np.pi, 128), Let’s begin by comprehending the
init_func=init, blit=True) essence of the Central Limit Theorem before
venturing into the animation aspect:
plt.show()
71 Animations in Matplotlib
In simpler terms, the Central Limit Theorem testing, confidence interval estimation, and various
stipulates that when you draw numerous samples other statistical analyses.
of a specific size from any population, calculate
the mean for each sample, and then depict the Here how the code goes.
distribution of these computed sample means, the
resulting distribution will mirror the characteristic import numpy as np
bell-shaped curve, emblematic of a normal import matplotlib.pyplot as plt
distribution.
from matplotlib.animation import FuncAnimation
Key considerations related to the Central Limit # Set up the figure and axis fig, ax = plt.subplots()
Theorem: n_bins = 20
def update(frame):
hist, edges = np.histogram(samples, bins=n_bins, range=x_range, density=True) # generating the hist and
edges values from np.histogram
vlaues(array of data points), this is essentially captures the counts # rest of the lines are just asthetics
return line,
# FuncAnimation call is very similar to what we earlier saw, it has a figure which gets updated, from update
function which takes input from frames
plt.show()
● We are initializing an empty histogram hist and an empty line plot line on the axes.
● update(frame) is the function called for each frame. It clears the axes, generates random samples,
calculates the histogram and bin centers, updates the line plot data, sets axis labels, and limits.
● FuncAnimation creates the animation. It takes the figure, update function, frames (number of frames), init
function, and other parameters like blit and interval.
The last two examples animations can’t be shown in pdf so I am not attaching gif or any image file. You need
to run that in your PC to ensure that they do work.
In summary, this code generates an animation showing how the distribution of sample means approaches a
normal distribution as the sample size increases.
73 Animations in Matplotlib
Hope this clarifies the way the central theorem can be seen as animation to understand how getting more
more and more data pushes the data to achieve normal curve. This is one of the very important concepts in
statistics and often come a lot during statistical discussions/readings.
Summary
Animations in Matplotlib refer to the ability to create dynamic, time-based visualizations where data changes
and evolves over a sequence of frames. In summary, Matplotlib’s animation capabilities allow you to bring
your data to life by creating dynamic and interactive visualizations that convey changes and patterns over
time. Animations are a powerful tool for storytelling and communicating insights in various fields, from
scientific research to data-driven presentations.
74 Animations in Matplotlib
Unit 10
Using SymPy Commands With
Matplotlib
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: This is going to be a fairly small chapter and I want to
● Latex say that, this is rather an introduction to use latex based
notation than sympy . However, writing latex requires
● SymPy
practice and time, so instead we use sympy (a third party
python package) to write latex command. But if you are
interested you can learn latex.
Latex
75
package manager
Using Sympy Commands with Matplotlib
# Please don’t try to use pip within conda LaTeX serves as a typesetting system widely
environment that would not work in most of the employed for crafting documents that require
cases. intricate formatting, such as research papers,
theses, reports, academic articles, and books.
Now we got sympy with us, to understand sympy It enjoys notable popularity within academic
(symbolic python), we need to have some basic and scientific spheres due to its proficiency in
background of Latex. Let us an equation, Ever generating meticulously formatted documents with
wondered how books contain these equations like professional typography.
the one mentioned below:
Diverging from traditional word processors that
ax + bx + c = 0
2
prioritize WYSIWYG (What You See Is What You
Get) editing, LaTeX leverages a markup language
x = -b+
- √b2 - 4ac / 2a enabling you to describe your document’s structure
and formatting using plain text commands.
Upon observation, you’ll encounter a plethora of Subsequently, the LaTeX engine processes
intriguing symbols such as “+,” “x^2,” square root these commands to generate exquisitely typeset
symbols over certain expressions, and more. If documents. It attends to details encompassing
you wish to incorporate similar expressions into font styles, section headings, references, footnotes,
your books or journals, you’d need LaTeX. However, tables, and mathematical equations.
instead of using LaTeX directly, we’ll employ SymPy
to formulate such expressions. Subsequently, we LaTeX offers a host of advantages:
can utilize these expressions to annotate text,
legends, axes, and more within our plots. ● High-Quality Typesetting: LaTeX yields
documents characterized by impeccable
First, let’s establish a formal definition of LaTeX typography and formatting, an attribute
to provide you with a foundational understanding, particularly pivotal in academic and scientific
followed by an exploration of SymPy’s capabilities. writing.
version control systems such as Git. This feature mathematics. It equips users with tools to
print(product)
SymPy is an open-source library and can be
effortlessly installed using Python’s package print(inverse)
manager, pip. It finds extensive utility among Matrix([[19,22] , [43,59]])
mathematicians, scientists, engineers, and
Matrix([[-2,1] , [3/2,-1/2]])
students for symbolic mathematics, exploration
of mathematical concepts, and generation of product
● x, y = sp.symbols(‘x y’) : This line creates To understand the 2nd example with the process
symbolic variables x and y using the sp.symbol() of sympy (ignoring the mathematical details of
function. These variables are used to represent what is matrices etc). We are just looking at how to
mathematical symbols or placeholders in represent objects in more bookish form.
symbolic computations.
A = sp.Matrix([[1, 2], [3, 4]])
● Expr = x**2 + 2*x + y : Here, an expression
expr is defined using the symbolic variables x B = sp.Matrix([[5, 6], [7, 8]])
and y . The expression is a polynomial involving product = A * B
these variables. It’s x squared plus 2 times x plus
inverse = A.inv()
y.
print(product)
● Print(expr) : This line prints the value of the expr
variable. However, in the context of SymPy, this print(inverse)
doesn’t display the numerical result but rather
the symbolic expression itself. ● A = sp.Matrix([[1, 2], [3, 4]]) : This line creates
a 2x2 matrix named using the sp.Matrix()
constructor from the SymPy library. The matrix is
initialized with the values [1, 2] in the first row and
[3, 4] in the second row.
● Print(product) : This line prints the value of the import matplotlib.pyplot as plt import numpy as np
product matrix, which is the result of multiplying x = sp.symbols(‘x’) # here we defined a symbolic x
matrices A and B .
expr = x**2 + 2*x + 1 # we wrote our expression of
● Print(inverse) : This line prints the value of the parabola
inverse matrix, which is the inverse of matrix A.
# Convert the symbolic expression to a numpy
function
# if you see the plot, the plot doesn’t show x^2 + 2x + 1 but rather a much cleaner title
plt.xlabel(‘x’)
plt.ylabel(‘y’)
plt.grid(True) plt.show()
You might be wondering about that dollar symbol, so here is what it does:
In LaTeX, the dollar symbol $ is used to enter and exit math mode. Math mode is used for typesetting
mathematical content, such as equations, formulas, variables, and symbols. When you enclose text within
a pair of dollar symbols, LaTeX switches to math mode and treats the enclosed content as mathematical
notation. When you use a single dollar symbol, it enters or exits inline math mode, and when you use a pair of
dollar symbols $$, it enters or exits display math mode.
Back to original expression if you remember from where we started, you can create that expression and print
it in your notebook using.
import IPython
IPython.display.display(quadratic_formula)
-b + √=4ac + b2 / 2a
Now that we are familiar with how Sympy interacts with Matplotlib and how we can use “$” to manipulate
mathematical expressions as annotations, let’s delve into Sympy’s plotting capabilities. Yes, Sympy has a
concealed plotting feature as well! And it leverages Matplotlib in the background. To cover all aspects, let’s
examine an example to comprehend how it operates.
Through the combination of Sympy and its plotting capabilities, we can effortlessly graph fundamental
mathematical functions. For instance:
from sympy import symbols # you can also use import sympy as sp as well , however for simplicity I am
using this notation
x = symbols(‘x’)
p1.show() # this is mandatory to write otherwise you won’t be able to see plots
If you want to perform multiple plots given some mathamtical functions to you can simply do this.
# a one liner, where multiple function like f(x) = x, f(x) = x**2, f(x) = x**3 can be easily drawn
This is how the plot appears. Note that it is not feasible to obtain legends using Sympy; for that, you must
employ Matplotlib code. Therefore, while Sympy is not designed for visualization, it excels in swiftly solving
uncomplicated mathematical functions. Reserve the use of Sympy when you merely seek to comprehend
data, as opposed to creating publishable content. For more formal and publication-oriented work, Matplotlib
is the tool of choice.
To solve this problem of legends you would have to call matplotlib, an example of doing a similar thing with
legends.
import sympy
## creating two plots here with show=False parameter to not to plot it, just get the plot objects in line1 and
line2
(x,0,1)),label=’$f(x)$’,show=False)
## Now get all the points from above objects this will be used in matplotlib x1, y1 = line1.get_points()
x2, y2 = line2.get_points()
## These code lines must be familier by now. so we use them to finalise the plot ## define figure and axes
fig, ax = plt.subplots(1,1) ## define the plots ax.plot(x1, y1, label=’x**2’) ax.plot(x2, y2, label=’x**3’)
plt.show()
You will notice that we now have legends in the bottom-left corner. I’ll leave it to you to experiment with
placing the legends in any corner of the plot.
In this chapter, we learned about how Sympy operates and how symbols can be integrated into Matplotlib
charts to incorporate more mathematical notations in titles or legends. While text annotations are yet to
be covered in an upcoming chapter, we trust that this has provided you with a clearer understanding of
incorporating symbols into Matplotlib charts and utilizing Sympy for generating rapid plots.
Summary
Using SymPy commands with Matplotlib involves combining the capabilities of two Python libraries: SymPy
and Matplotlib. SymPy is a symbolic mathematics library that allows you to perform algebraic and symbolic
computations, while Matplotlib is a powerful library for creating visualizations and plots.
This combination allows you to visualize mathematical concepts, equations, and data generated through
symbolic computations in a graphical format, making it easier to understand and communicate mathematical
ideas and results.
ax.set_ylim(0, 1)
● Rectangle((x, y), width, height) creates a rectangle at the specified coordinates (x, y) with the given width
and height.
Remember to adjust the coordinates, dimensions, colors, and other properties as needed to achieve the
desired appearance for your rectangle.
● Linewidth sets the border thickness of the ● Closed = True indicates that the polygon is
polygon. closed, meaning its last vertex is connected to
the first vertex.
● Edgecolor sets the color of the polygon’s border.
● Linewidth sets the border thickness of the
● Facecolor sets the color of the polygon’s interior.
polygon.
You can adjust the vertices, colors, and other ● Edgecolor sets the color of the polygon’s border.
properties to customize the appearance of the
● Facecolor sets the color of the polygon’s interior.
polygon according to your needs.
# Create a figure and axis fig, ax = plt.subplots() (e.g. JoinStyle is one such command).
# Set limits for the plot ax.set_ us, whether or not this is useful.
xlim(0, 1)
import matplotlib.pyplot as plt
ax.set_ylim(0, 1)
# Sample data
# Show the plot
x = [1, 2, 3, 4, 5]
plt.show()
y = [5, 9, 3, 6, 8]
You might be wondering how many commands are # Create a figure and axis fig, ax = plt.subplots()
supported by patches , well you can see dir(patches)
# Plot the data
, you will realise that these are the commands that
ax.plot(x, y, marker=’o’, label=’Data’)
can be used for creating new objects.
# Annotate a point on the plot
[‘Annulus’, ‘Arc’,’Arrow’,’ArrowStyle’,
# To use annotation we use ax.annotate, giving
‘BoxStyle’, ‘CapStyle’, ‘Circle’, ‘CirclePolygon’,
a label ‘Annotated Point’, and arrow to point the
‘ConnectionPatch’, ‘ConnectionStyle’,
annotation
‘Ellipse’, ‘FancyArrow’, ‘FancyArrowPatch’,
plt.show()
● ax.annotate is used to add an annotation. The xy parameter specifies the point to annotate, while xy text
specifies the location of the annotation text.
● Arrowprops specifies properties for the arrow that connects the annotation to the point.
import sympy as sp
# Convert SymPy roots to numerical values numerical_roots = [float(root) for root in roots]
color=’blue’,
plt.show()
In this example, we use SymPy to find the roots of the equation x**3 - 6*x**2 + 11*x - 6 . Then, we convert the
SymPy roots to numerical values and use Matplotlib to plot the equation. For each root, we annotate it with
its value on the x-axis.
We can see that with the help of annotations we can point out the actual roots , there are 3 roots in this case.
with value of 1, 2 and 3 and to represent that we used ax.scatter for plotting and ax.annotate for annotations,
we have also used arrow like earlier example to put arrow to point text to the dots but with different style.
In this chapter we looked into drawing different shapes and annotating different charts, you can of course
use this to annotate any chart drawn in matplotlib. In the last chapter we will look into algorithmic based
visualization which may not be possible via common visualization tool as they require to have an algorithm
implemented within the tool, this is also one of the reasons on why we choose python over other dashboarding
tools.
Summary
Drawing different shapes and functions typically involves using various programming libraries or tools to
create visual representations of geometric shapes or mathematical functions.
In summary, drawing different shapes and functions involves using various programming and graphical tools
to create visual representations of geometric shapes, mathematical functions, custom shapes, or data. The
choice of tools and libraries depends on your specific requirements and programming language preferences.
n_features = 2
n_clusters = 4
# Generating random data using features and 4 groups for example sake
ax.scatter(kmeans.cluster_centers_[:, 0],
marker=’o’, alpha=0.5)
ax.set_xlabel(‘Feature 1’)
# Initialize KMeans
kmeans = KMeans(n_clusters=n_clusters)
# We need to fit the algorithm on data, that is why we are using fit for it. kmeans.fit(features)
# Plot the data points and cluster centers fig, ax = plt.subplots(1,1, figsize=(8, 8))
plt.show()
In this visualization, each data point is colored based on the cluster it belongs to, and the orange “O” markers
indicate the cluster centers.
However, it’s important to note that in real-world scenarios, you would typically conduct a more in-depth
analysis of the data, consider feature scaling, and experiment with various cluster numbers to determine the
best configuration for your specific needs.
Interestingly, if you run the code provided below, which is unrelated to clustering, you’ll notice a similar plot.
What does this tell us? It essentially indicates that even without prior knowledge of penguin species, we
could have inferred the existence of distinct groups based on the bill_length_mm and flipper_length_mm
measurements among the penguins. There seem to be three clear clusters, noteworthy is the fact that we
didn’t employ the species column in the K-means clustering process.
A note for the readers: In all the examples above, we utilized data that shares a common scale. If you intend
to compare and cluster features that are significantly different, it’s advisable to standardize the data to the
same scale before applying clustering techniques. This is because distance calculations, often used in
clustering, can be influenced by features with larger values. Therefore, standardizing the dataset can yield
more meaningful results. However, in the cases presented here, standardization wasn’t necessary since we
used features with the same physical unit. (You can refer to one example from DBScan.)
Please keep in mind that the Palmer Penguins dataset contains more features that you could also explore for
clustering purposes. This example, however, focused on two features for the sake of simplicity.
KMeans is a popular and widely used clustering algorithm; however, like any algorithm, it comes with
limitations and drawbacks. Some of the key drawbacks of KMeans include:
## Note here we are doing a standard scalar(standardization) to make every column in one scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Initialize DBSCAN
labels = dbscan.fit_predict(X_scaled)
## using multiple values of eps and min_samples and see how that changes
## You should try kmenas too, you will realise Kmeans fails with this dataset in identifying the correct
pattern
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap=’viridis’) plt.title(“DBSCAN Clustering of Moon Data”)
The resulting plot will show the data points colored according to the clusters identified by DBSCAN. You’ll
likely see two clusters representing the two moon shapes.
Remember that parameter tuning is crucial in DBSCAN. Adjusting eps and min_samples can significantly
affect the results. Experiment with different parameter values to see how they impact the clustering outcome.
Also, note that DBSCAN can identify noise points as well, which will be assigned the label -1. These are data
points that do not belong to any dense region or cluster.
With this example, we have reached the conclusion of the book. I hope you found it enjoyable and informative.
It’s important to clarify that this book’s purpose is to provide a straightforward way of explaining data
visualization in Matplotlib. While the book does not cover storytelling extensively, that skill can be developed
through hands-on experience in solving real-world business problems and interactive sessions.
The aim of this book is to equip you with the ability to create visualizations, choose appropriate plots for
different situations, and interpret data insights from them. It serves as a gentle introduction to the fascinating
world of data visualization. However, it’s worth noting that this book is not exhaustive or comprehensive. It’s
only the beginning of your journey into data visualization.
There is a wealth of resources available on the internet and in books that can further enhance your knowledge.
I’d recommend exploring books like “Fundamentals of Data Visualization” by Claus Wilke and “Storytelling
with Data” published by Wiley. These resources can take your skills to the next level.
I chose a tool, Matplotlib, that may seem challenging for beginners, but I hope it has given you the perspective
that data visualization is not as daunting as it may appear. With a bit of effort and dedication, you can master
it. As you move forward in your careers, I wish you the very best and success in all your endeavors.
Summary
Miscellaneous plotting techniques encompass a wide range of creative and specialized methods for visualizing
data and information beyond traditional charts and graphs. These miscellaneous plotting techniques offer
unique ways to represent and explore data, making them valuable tools for data analysis, storytelling, and
decision-making across a wide range of domains. The choice of technique depends on the nature of the data
and the insights you want to convey to your audience.