Summary: Introduction To Data Visualization Tools
Summary: Introduction To Data Visualization Tools
Congratulations! You have completed this module. At this point in the course, you know:
Data visualization is the process of presenting data in a visual format, such as charts, graphs, and maps,
to help people understand and analyze data easily.
Data visualization has diverse use cases, such as in business, science, healthcare, and finance.
It is important to follow best practices, such as selecting appropriate visualizations for the data being
presented, choosing colors and fonts that are easy to read and interpret, and minimizing clutter.
There are various types of plots commonly used in data visualization.
Line plots capture trends and changes over time, allowing us to see patterns and fluctuations.
Bar plots compare categories or groups, providing a visual comparison of their values.
Scatter plots explore relationships between variables, helping us identify correlations or trends.
Box plots display the distribution of data, showcasing the median, quartiles, and outliers.
Histograms illustrate the distribution of data within specific intervals, allowing us to understand its shape
and concentration.
Matplotlib is a plotting library that offers a wide range of plotting capabilities.
Pandas is a plotting library that provides Integrated plotting functionalities for data analysis.
Seaborn is a specialized library for statistical visualizations, offering attractive default aesthetics and
color palettes.
Folium is a Python library that allows you to create interactive and customizable maps.
Plotly is an interactive and dynamic library for data visualization that supports a wide range of plot types
and interactive features.
PyWaffle enables you to visualize proportional representation using squares or rectangles.
Matplotlib is one of the most widely used data visualization libraries in Python.
Matplotlib was initially developed as an EEG/ECoG visualization tool.
Matplotlib’s architecture is composed of three main layers: Backend layer, Artist layer, and the Scripting
layer.
The anatomy of a plot refers to the different components and elements that make up a visual
representation of data.
Matplotlib is a well-established data visualization library that can be integrated in different environments.
Jupyter Notebook is an open-source web application that allows you to create and share documents.
Matplotlib has a number of different backends available.
You can easily include the label and title to your plot with plt.
In order to start creating different types of plots of the data, you will need to import the data into a
Pandas DataFrame.
A line plot is a plot in the form of a series of data points connected by straight line segments.
Line plot is one of the most basic type of chart and is common in many fields.
You can generate a line plot by assigning "line" to 'Kind' parameter in the plot() function.
Pandas DataFrame
Select multiple
df[['col1', 'col2']] df_can[['Name', 'Age']]
columns
Create a new
Creating New column with values df_can['Total'] = df_can['Quantity'] *
df['new_column'] = expression
Columns derived from df_can['Price']
existing ones
Select rows in a
df.iloc[start:end] df_can.iloc[2:5]
specified range
Select rows in a
df.loc[start:end] specified df_can.loc['Age':'Quantity']
label/index range
descriptive
Statistics statistics for
numerical columns
Heatmaps, violin
Can be combined with other
Statistical data Stylish, specialized plots, scatter plots,
Seaborn Python Medium libraries to display plots on
visualization statistical plot types bar plots, count plots,
dashboards
etc.
pandas is an essential data analysis toolkit for Python. From their website:
https://pandas.pydata.org/pandas-docs/stable/reference/index.html
pandas Basics:
The first thing we'll do is install openpyxl (formerly xlrd), a module
that pandas requires to read Excel files.
!mamba install openpyxl==3.0.9 -y
df_can = pd.read_excel(
'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data
%20Files/Canada.xlsx',
sheet_name='Canada by Citizenship',
skiprows=range(20),
skipfooter=2)
head() function.
When analyzing a dataset, it's always a good idea to start by getting basic information about your dataframe. We
can do this by using the info() method.
Info()function df_can.info(verbose=False)
Columns: clist of column headers df_can.columns
df_can.index get the list of indices we use the .index instance variables.
**Note: The default type of intance variables index and columns are NOT list.*
tolist():To get the index and columns as lists, we can use the tolist() method.
df_can.columns.tolist() df_can.index.tolist()
print(type(df_can.columns.tolist()))
print(type(df_can.index.tolist()))
shape: To view the dimensions of the dataframe, we use the shape instance variable of it.
df_can.shape
drop() Let's clean the data set to remove a few unnecessary columns. We can use pandas drop() method as
follows: df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)
df_can.head(2)
rename():Let's rename the columns so that they make sense. We can use rename() method by passing in a
dictionary of old and new names as follows:
df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent',
'RegName':'Region'}, inplace=True)
df_can.columns
Adding a column : We will also add a 'Total' column that sums up the total immigrants by country over the
entire period 1980 - 2013, as follows:
df_can.describe()
pandas Intermediate: Indexing and Selection
(slicing)
Select Column¶
There are two ways to filter on a column name:
Method 1: Quick and easy, but only works if the column name does NOT have spaces or special characters.
Example: Let's try filtering on the list of countries ('Country' ). df_can.Country # returns a series
df_can[['Country', 1980, 1981, 1982, 1983, 1984, 1985]] # returns a dataframe
Let's try filtering on the list of countries ('Country') and the data for years: 1980 - 1985.
Select Row
There are main 2 ways to select rows:
[ ]:
In Matplotlib, backends are the components that handle the rendering of plots. They determine how figures are displayed or saved,
and they can be categorized into two main types: interactive backends and non-interactive backends. Here’s a detailed explanation
of each type of backend and its role:
1. Interactive Backends
Interactive backends allow for real-time interaction with plots. They enable features like zooming, panning, and
updating plots dynamically. Here are some common interactive backends:
TkAgg:
Role: Uses the Tkinter library for creating GUI applications.
Usage: Suitable for desktop applications where you want to display plots in a window.
import matplotlib
2matplotlib.use('TkAgg')
3import matplotlib.pyplot as plt
Qt5Agg:
Role: Utilizes the Qt framework for creating interactive applications.
Usage: Ideal for applications that require a modern GUI and advanced features.
import matplotlib
2matplotlib.use('Qt5Agg')
3import matplotlib.pyplot as plt
GTK3Agg:
Role: Uses the GTK+ toolkit for creating graphical user interfaces.
Usage: Commonly used in Linux environments.
import matplotlib
2matplotlib.use('GTK3Agg')
3import matplotlib.pyplot as plt
Non-Interactive Backends:
Non-interactive backends are used for generating static images without displaying them on the screen. They are
useful for saving plots to files. Here are some common non-interactive backends:
Agg:
Role: A raster graphics backend that generates images in formats like PNG, JPEG, etc.
Usage: Ideal for saving plots to files without displaying them.
import matplotlib
2matplotlib.use('Agg')
3import matplotlib.pyplot as plt
PDF:
SVG:
Role: Generates vector graphics in SVG format.
Usage: Suitable for web applications and scalable graphic
Choosing a Backend
For Interactive Use: Choose an interactive backend like TkAgg, Qt5Agg, or MacOSX.
For Saving Plots: Use a non-interactive backend like Agg, PDF, or SVG.