0% found this document useful (0 votes)
44 views21 pages

Pandas Complete + Visualisation Summary of IBM Visualization

Uploaded by

melikakhajeh94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views21 pages

Pandas Complete + Visualisation Summary of IBM Visualization

Uploaded by

melikakhajeh94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Summary: Introduction to Data Visualization Tools

Congratulations! You have completed this module. At this point in the course, you know:

 Data visualization is the process of presenting data in a visual format, such as charts, graphs, and maps,
to help people understand and analyze data easily.
 Data visualization has diverse use cases, such as in business, science, healthcare, and finance.
 It is important to follow best practices, such as selecting appropriate visualizations for the data being
presented, choosing colors and fonts that are easy to read and interpret, and minimizing clutter.
 There are various types of plots commonly used in data visualization.
 Line plots capture trends and changes over time, allowing us to see patterns and fluctuations.
 Bar plots compare categories or groups, providing a visual comparison of their values.
 Scatter plots explore relationships between variables, helping us identify correlations or trends.
 Box plots display the distribution of data, showcasing the median, quartiles, and outliers.
 Histograms illustrate the distribution of data within specific intervals, allowing us to understand its shape
and concentration.
 Matplotlib is a plotting library that offers a wide range of plotting capabilities.
 Pandas is a plotting library that provides Integrated plotting functionalities for data analysis.
 Seaborn is a specialized library for statistical visualizations, offering attractive default aesthetics and
color palettes.
 Folium is a Python library that allows you to create interactive and customizable maps.
 Plotly is an interactive and dynamic library for data visualization that supports a wide range of plot types
and interactive features.
 PyWaffle enables you to visualize proportional representation using squares or rectangles.
 Matplotlib is one of the most widely used data visualization libraries in Python.
 Matplotlib was initially developed as an EEG/ECoG visualization tool.
 Matplotlib’s architecture is composed of three main layers: Backend layer, Artist layer, and the Scripting
layer.
 The anatomy of a plot refers to the different components and elements that make up a visual
representation of data.
 Matplotlib is a well-established data visualization library that can be integrated in different environments.
 Jupyter Notebook is an open-source web application that allows you to create and share documents.
 Matplotlib has a number of different backends available.
 You can easily include the label and title to your plot with plt.
 In order to start creating different types of plots of the data, you will need to import the data into a
Pandas DataFrame.
 A line plot is a plot in the form of a series of data points connected by straight line segments.
 Line plot is one of the most basic type of chart and is common in many fields.
 You can generate a line plot by assigning "line" to 'Kind' parameter in the plot() function.

Data Visualization with Python


Cheat Sheet : Data Preprocessing Tasks in Pandas
Task Syntax Description Example

Load CSV data pd.read_csv('filename.csv') Read data from a df_can=pd.read_csv('data.csv')


CSV file into a
Task Syntax Description Example

Pandas DataFrame

Handling Drop rows with


df.dropna() df_can.dropna()
Missing Values missing values

Fill missing values


df.fillna(value) with a specified df_can.fillna(0)
value

Removing Remove duplicate


df.drop_duplicates() df_can.drop_duplicates()
Duplicates rows

Renaming df.rename(columns={'old_name': Rename one or


df_can.rename(columns={'Age': 'Years'})
Columns 'new_name'}) more columns

Selecting Select a single


df['column_name'] or df.column_name df_can.Age or df_can['Age]'
Columns column

Select multiple
df[['col1', 'col2']] df_can[['Name', 'Age']]
columns

Filter rows based


Filtering Rows df[df['column'] > value] df_can[df_can['Age'] > 30]
on a condition

Applying Apply a function to


Functions to df['column'].apply(function_name) transform values in df_can['Age'].apply(lambda x: x + 1)
Columns a column

Create a new
Creating New column with values df_can['Total'] = df_can['Quantity'] *
df['new_column'] = expression
Columns derived from df_can['Price']
existing ones

Grouping and df.groupby('column').agg({'col1': Group rows by a df_can.groupby('Category').agg({'Total':


Task Syntax Description Example

column and apply


Aggregating 'sum', 'col2': 'mean'}) 'mean'})
aggregate functions

df.sort_values('column', Sort rows based on


Sorting Rows ascending=True/False)
df_can.sort_values('Date', ascending=True)
a column

Show the first n


Displaying First
df.head(n) rows of the df_can.head(3)
n Rows
DataFrame

Show the last n


Displaying Last
df.tail(n) rows of the df_can.tail(3)
n Rows
DataFrame

Check for null


Checking for
df.isnull() values in the df_can.isnull()
Null Values
DataFrame

Selecting Rows Select rows based


df.iloc[index] df_can.iloc[3]
by Index on integer index

Select rows in a
df.iloc[start:end] df_can.iloc[2:5]
specified range

Select rows based


Selecting Rows
df.loc[label] on label/index df_can.loc['Label']
by Label
name

Select rows in a
df.loc[start:end] specified df_can.loc['Age':'Quantity']
label/index range

Summary df.describe() Generates df_can.describe()


Task Syntax Description Example

descriptive
Statistics statistics for
numerical columns

Cheat Sheet : Plot Libraries


Programming Level of Types of Plots
Library Main Purpose Key Features Dashboard Capabilities
Language Customization Possible

Line plots, scatter


Comprehensive plot Requires additional plots, bar charts,
General-purpose
Matplotlib types and variety of Python High components and histograms, pie
plotting
customization options customization charts, box plots,
heatmaps, etc.

Fundamentally used for Line plots, scatter


Easy to plot directly Can be combined with web
data manipulation but plots, bar charts,
Pandas on Panda data Python Medium frameworks for creating
also has plotting histograms, pie
structures dashboards
functionality charts, box plots, etc.

Heatmaps, violin
Can be combined with other
Statistical data Stylish, specialized plots, scatter plots,
Seaborn Python Medium libraries to display plots on
visualization statistical plot types bar plots, count plots,
dashboards
etc.

Line plots, scatter


Dash framework is dedicated
Interactive data interactive web-based Python, R, plots, bar charts, pie
Plotly High for building interactive
visualization visualizations JavaScript charts, 3D plots,
dashboards
choropleth maps, etc.
Programming Level of Types of Plots
Library Main Purpose Key Features Dashboard Capabilities
Language Customization Possible

For incorporating maps into


Choropleth maps,
Geospatial data Interactive, dashboards, it can be
Folium Python Medium point maps,
visualization customizable maps integrated with other
heatmaps, etc.
frameworks/libraries

Can be combined with other Waffle charts, square


PyWaffle Plotting Waffle charts Waffle charts Python Low libraries to display waffle chart pie charts, donut
on dashboards charts, etc.

pandas is an essential data analysis toolkit for Python. From their website:
https://pandas.pydata.org/pandas-docs/stable/reference/index.html
pandas Basics:
The first thing we'll do is install openpyxl (formerly xlrd), a module
that pandas requires to read Excel files.
!mamba install openpyxl==3.0.9 -y
df_can = pd.read_excel(
'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data
%20Files/Canada.xlsx',

sheet_name='Canada by Citizenship',

skiprows=range(20),

skipfooter=2)

print('Data read into a pandas dataframe!')

head() function.
When analyzing a dataset, it's always a good idea to start by getting basic information about your dataframe. We
can do this by using the info() method.

Info()function df_can.info(verbose=False)
Columns: clist of column headers df_can.columns
df_can.index get the list of indices we use the .index instance variables.

**Note: The default type of intance variables index and columns are NOT list.*
tolist():To get the index and columns as lists, we can use the tolist() method.
df_can.columns.tolist() df_can.index.tolist()
print(type(df_can.columns.tolist()))
print(type(df_can.index.tolist()))
shape: To view the dimensions of the dataframe, we use the shape instance variable of it.
df_can.shape
drop() Let's clean the data set to remove a few unnecessary columns. We can use pandas drop() method as
follows: df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)

df_can.head(2)
rename():Let's rename the columns so that they make sense. We can use rename() method by passing in a
dictionary of old and new names as follows:

df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent',
'RegName':'Region'}, inplace=True)
df_can.columns
Adding a column : We will also add a 'Total' column that sums up the total immigrants by country over the
entire period 1980 - 2013, as follows:

df_can['Total'] = df_can.sum(axis=1) df_can['Total']


----------------------------
df_can.isnull().sum() We can check to see how many null objects we have in the dataset as follows:

df_can.describe()
pandas Intermediate: Indexing and Selection
(slicing)
Select Column¶
There are two ways to filter on a column name:
Method 1: Quick and easy, but only works if the column name does NOT have spaces or special characters.

df.column_name # returns series


Method 2: More robust, and can filter on multiple columns.

df['column'] # returns series


df[['column 1', 'column 2']] # returns dataframe

Example: Let's try filtering on the list of countries ('Country' ). df_can.Country # returns a series
df_can[['Country', 1980, 1981, 1982, 1983, 1984, 1985]] # returns a dataframe
Let's try filtering on the list of countries ('Country') and the data for years: 1980 - 1985.

Select Row
There are main 2 ways to select rows:

df.loc[label] # filters by the labels of the index/column


df.iloc[index] # filters by the positions of the index/column

Before we proceed, notice that the default index of the dataset is a numeric
range from 0 to 194. This makes it very difficult to do a query by a specific
country. For example to search for data on Japan, we need to know the
corresponding index value. [ ]:
This can be fixed very easily by setting the 'Country' column as the index using set_index() method.
df_can.set_index('Country', inplace=True)

# optional: to remove the name of the index


df_can.index.name = None
Example: Let's view the number of immigrants from Japan (row 87) for the following scenarios: 1. The full row data
(all columns) 2. For year 2013 3. For years 1980 to 1985

# 1. the full row data (all columns)


df_can.loc['Japan']
# alternate methods
df_can.iloc[87]
df_can[df_can.index == 'Japan']
# 2. for year 2013
df_can.loc['Japan', 2013]
# alternate method
# year 2013 is the last column, with a positional index of 36
df_can.iloc[87, 36]
# 3. for years 1980 to 1985
df_can.loc['Japan', [1980, 1981, 1982, 1983, 1984, 1984]]
# Alternative Method
df_can.iloc[87, [3, 4, 5, 6, 7, 8]]
Click here for a sample python solution
# 1. the full row data (all columns)
df_can.loc['Haiti']
#or
df_can[df_can.index == 'Haiti']

# 2. for year 2000


df_can.loc['Haiti', 2000]

# 3. for years 1990 to 1995


df_can.loc['Haiti', [1990, 1991, 1992, 1993, 1994, 1995]]

Column names that are integers (such as the years) might introduce some confusion. For example, when
we are referencing the year 2013, one might confuse that when the 2013th positional index.

To avoid this ambuigity, let's convert the column names into strings: '1980' to '2013'.

df_can.columns = list(map(str, df_can.columns))


# [print (type(x)) for x in df_can.columns.values] #<-- uncomment to check type of
column headers
Since we converted the years to string, let's declare a variable that will allow us to easily call upon the full range of
years:

# useful for plotting later on


years = list(map(str, range(1980, 2014)))
years
Exercise: Create a list named 'year' using map function for years ranging from 1990 to 2013.
Then extract the data series from the dataframe df_can for Haiti using year list.

#The correct answer is:


year = list(map(str, range(1990, 2014)))
haiti = df_can.loc['Haiti', year] # passing in years 1990 - 2013

Filtering based on a criteria


filter the dataframe based on a condition, we simply pass the condition as a boolean vector.

For example, Let's filter the dataframe to show the data on Asian countries (AreaName = Asia).
# 1. create the condition boolean series

condition = df_can['Continent'] == 'Asia'


print(condition)

df_can[(df_can['Continent']=='Asia') & (df_can['Region']=='Southern Asia')]


# note: When using 'and' and 'or' operators, pandas requires we use '&' and '|' instead of
'and' and 'or'
# don't forget to enclose the two conditions in parentheses
Sorting Values of a Dataframe or Series
sort_values() [ ]:
df.sort_values(col_name, axis=0, ascending=True, inplace=False, ignore_index=False)
col_nam - the column(s) to sort by.
axis - axis along which to sort. 0 for sorting by rows (default) and 1 for sorting by columns.
ascending - to sort in ascending order (True, default) or descending order (False).
inplace - to perform the sorting operation in-place (True) or return a sorted copy (False, default).
ignore_index - to reset the index after sorting (True) or keep the original index values (False, default).
Click here for a sample python solution
df_can.sort_values(by='2010', ascending=False, axis=0, inplace=True)
top3_2010 = df_can['2010'].head(3)
top3_2010

#Pandas: Many operations can be performed directly on


DataFrames without assignment.
#In Pandas, you can chain methods together for more complex
operations without needing intermediate variables.
sorted_df = data_df.sort_values(by='column_name').reset_index(drop=True)
#In-Place Operations:
Some Pandas methods allow for in-place modifications,
meaning they change the DataFrame directly without needing
to create a new one.
data_df.sort_values(by='column_name', inplace=True)

 In traditional programming, you often assign results to variables for later use.
 In Pandas, you can access properties and methods directly, and you can also chain methods for more complex
operations.
 Understanding when to use assignment versus direct access can help streamline your data manipulation tasks.
His

History of matplotlib : half of them complete later


In Matplotlib, backends are the components that handle the rendering of plots. They determine how figures are displayed or saved,
and they can be categorized into two main types: interactive backends and non-interactive backends. Here’s a detailed explanation
of each type of backend and its role:

1. Interactive Backends
Interactive backends allow for real-time interaction with plots. They enable features like zooming, panning, and
updating plots dynamically. Here are some common interactive backends:

TkAgg:
Role: Uses the Tkinter library for creating GUI applications.
Usage: Suitable for desktop applications where you want to display plots in a window.
import matplotlib
2matplotlib.use('TkAgg')
3import matplotlib.pyplot as plt
Qt5Agg:
Role: Utilizes the Qt framework for creating interactive applications.
Usage: Ideal for applications that require a modern GUI and advanced features.
import matplotlib
2matplotlib.use('Qt5Agg')
3import matplotlib.pyplot as plt

GTK3Agg:
Role: Uses the GTK+ toolkit for creating graphical user interfaces.
Usage: Commonly used in Linux environments.
import matplotlib
2matplotlib.use('GTK3Agg')
3import matplotlib.pyplot as plt

Non-Interactive Backends:
Non-interactive backends are used for generating static images without displaying them on the screen. They are
useful for saving plots to files. Here are some common non-interactive backends:

Agg:
Role: A raster graphics backend that generates images in formats like PNG, JPEG, etc.
Usage: Ideal for saving plots to files without displaying them.
import matplotlib
2matplotlib.use('Agg')
3import matplotlib.pyplot as plt

PDF:

Role: Generates vector graphics in PDF format.


Usage: Useful for creating high-quality documents and publications.
import matplotlib
2matplotlib.use('PDF')
3import matplotlib.pyplot as plt

SVG:
Role: Generates vector graphics in SVG format.
Usage: Suitable for web applications and scalable graphic

Choosing a Backend
For Interactive Use: Choose an interactive backend like TkAgg, Qt5Agg, or MacOSX.

For Saving Plots: Use a non-interactive backend like Agg, PDF, or SVG.
Box plot:
Step 1: Generate Sample Data
We'll create a DataFrame with 20 columns and 100 rows of random numerical data.

python
Copy code
import pandas as pd
import numpy as np

# Set a random seed for reproducibility


np.random.seed(42)

# Generate random data


data = np.random.rand(100, 20) * 100 # 100 rows, 20 columns
columns = [f'Column_{i+1}' for i in range(20)]
df = pd.DataFrame(data, columns=columns)

# Display the first few rows of the DataFrame


print(df.head())
Step 2: Create Box Plots
Next, we will create box plots for each of the 20 columns to visualize their distributions.

import matplotlib.pyplot as plt


# Create box plots for each column
plt.figure(figsize=(15, 10))
df.boxplot()
plt.xticks(rotation=45)
plt.title('Box Plots for 20 Columns')
plt.ylabel('Values')
plt.show()
Step 3: Analyze the Box Plots
After generating the box plots, we can analyze them to make decisions. Here are some key points
to consider:

Identify Outliers:

Look for any points that fall outside the whiskers of the box plots. These points are considered
outliers.
Decision: Depending on the context, you may choose to remove outliers, investigate them
further, or keep them if they are valid data points.
Compare Medians:

Observe the median line within each box. This indicates the central tendency of each column.
Decision: If certain columns have significantly lower or higher medians, it may indicate a need for
normalization or transformation, especially if they are to be used in machine learning models.
Assess Variability:
The height of the boxes (IQR) indicates the variability of the data. Wider boxes suggest higher
variability.
Decision: Columns with low variability may not contribute much information and could be
candidates for removal or further investigation.
Distribution Shape:

The shape of the box plots can indicate the distribution of the data (e.g., symmetric, skewed).
Decision: If a column is skewed, you might consider applying transformations (e.g., log
transformation) to normalize the data.
Example Analysis
Assuming we analyzed the box plots and found the following:

Column_1: Has several outliers and a median significantly lower than other columns.
Column_5: Shows a very high median and a wide IQR, indicating high variability.
Column_10: Appears to be normally distributed with no outliers.
Column_15: Has a very narrow IQR, suggesting low variability.
Decisions Based on Analysis
Column_1:

Action: Investigate outliers. If they are errors, consider removing them. If they are valid, document their impact
on analysis.
Column_5:

Action: Consider normalization or transformation to reduce the impact of high values on models.
Column_10:
Action: This column can be used as is, as it appears to be well-behaved.
Column_15:

Action: Assess whether this column provides meaningful information. If it is too constant, consider removing it
from the analysis.
Conclusion
By analyzing the box plots of the 20 columns, we can make informed decisions about data cleaning,
transformation, and feature selection. This process is essential for preparing the data for further analysis or
modeling. If you have specific data or scenarios in mind, feel free to share, and I can provide more tailored
insights!

Scatter plot:

Histogram:

Pie chart:

Bar chart:

Area plot:
Line plot:

You might also like