Big Data Hadoop and Spark Developer
Data Manipulation with Pandas
Learning Objectives
By the end of this lesson, you will be able to:
Explain Pandas and its features
List different data structures of Pandas
Outline the process to create series and DataFrame with data inputs
Explain how to view, select, and access elements in a data structure
Describe the procedure to handle vectorized operations
Illustrate how to handle missing values
Analyze data with different data operation methods
Introduction to Pandas
Why Pandas
NumPy is great for mathematical computing, but why do we need Pandas?
Pandas with several
functionalities
NumPy
Why Pandas
Intrinsic data
alignment
Data Structures
Data operation
handle major
functions
use cases
Pandas
Data standardization ? Data handling
functions functions
Why Pandas
Data structures handling major use cases
Pandas
Features of Pandas
The various features of Pandas make it an efficient library for Data Scientists.
Powerful data
structure
Fast and
High performance
efficient
merging and joining
data wrangling
of data sets
Pandas
Intelligent and Easy data
automated aggregation and
data alignment transformation
Tools for reading
and writing data
Data Structures
Data Structures
The four main libraries of Pandas data structure are:
• One-dimensional labeled array
Series
• Supports multiple data types
• Two-dimensional labeled array
• Supports multiple data types
Data Frame • Input can be a series
• Input can be another DataFrame
• Three-dimensional labeled array
• Supports multiple data types Panel
• Items 🡪 axis 0 • Four-dimensional labeled array
• Major axis 🡪 rows Panel 4D • Supports multiple data types
• Minor axis🡪 columns (Experimental) • Labels 🡪 axis 0
• Items 🡪 axis 1
• Major axis 🡪 rows
• Minor axis🡪 columns
Understanding Series
Series is a one-dimensional array-like object containing data and labels (or index).
Data 4 11 21 36
0 1 2 3
Label(index)
Data alignment is intrinsic and will not be broken until changed explicitly by program.
Series
Series can be created with different data inputs:
Data Input
• Integer
• ndarray 2 3 8 4
• String
• dict
• Python 0 1 2 3
• scalar
Object
• list Label(index)
• Floating Point
Data Types
Series
How to Create Series?
Key points to note while creating a series are:
•Import Pandas as it is the main library (Import pandas as pd)
•Import NumPy while working with ndarrays (Import numpy as np)
•Apply the syntax and pass the data elements as arguments
Basic Method
4 11 21 36
S = pd.Series(data, index = [index])
Series
Creating Series from a List
Import libraries
Pass list as an argument
Data value
Index
Data type
We have not created index for data but notice that data alignment is done automatically.
Creating Series from an ndarray
ndarray for countries
Pass ndarray as an
argument
countrie
s
Data type
Creating Series from dict
A series can also be created with dict data input for faster operations.
dict for countries and their
gdp
Countries have been passed as an
index and GDP as the actual data
value
GDP
Country
Data type
Creating Series from Scalar
Scalar input
Index
Data
index
Data type
Accessing Elements in Series
Data can be accessed through different functions like loc, iloc by passing data element position or index range.
Vectorizing Operations in Series
Vectorized operations are performed by the data element’s position.
Add the series
Vectorizing Operations in Series
DataFrames
DataFrame
DataFrame is a two-dimensional labeled data structure with columns of potentially different types.
Data Input
• Integer
• ndarray 2 3 8 4
• String
• dict 5 8 10 1
• Python
• List
Object 0 1 2 3
• Series
• Floating Point Label(index)
• DataFrame
Data Types
DataFrame
Creating DataFrame from Lists
Pass the list to the DataFrame
Creating DataFrame from dict
This example shows you how to create a DataFrame from a series of dicts.
dict one dict two
Entire dict
Viewing DataFrame
You can view a DataFrame by referring to the column name or with the describe function.
Creating DataFrame from dict of Series
Creating DataFrame from ndarray
Create a ndarray with years
Create a dict with the ndarray
Pass this dict to a new DataFrame
Creating DataFrame from DataFrame Object
Create a DataFrame from a
DataFrame object
View and Select Data
Problem Statement: Demonstrate how to view and select data in a DataFrame
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Missing Values
Missing Values
Various factors may lead to missing data values:
Data not provided by the
source Software issue Data integration issue Network issue
Handling Missing Values
It’s difficult to operate a dataset when it has missing values or uncommon indices.
Handling Missing Values with Functions
The dropna function drops all the values with uncommon indices.
Handling Missing Values with Functions
The fillna function fills all the uncommon indices with a number instead of dropping them.
Fill the missing values with zero
Handling Missing Values with Functions: Example
Data Operation
Data Operation
Data operation can be performed through various built-in methods for faster data processing.
Data Operation with Functions
While performing data operation, custom functions can be applied using the applymap method.
Declare a custom function
Test the function
Apply the function to the DataFrame
Data Operation with Statistical Functions
Create a DataFrame with two test
Apply the max function to find
the maximum score
Apply the mean function to find
the average score
Apply the std function to find the standard
deviation for both the tests
Data Operation Using Groupby
Create a DataFrame with first
and last name as former
presidents
Group the DataFrame with the first name
Group the DataFrame with the first name
Data Operation Using Sorting
Sort values by first name
Data Operations
Problem Statement: Demonstrate how to perform data operations
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Data Standardization
Data Standardization
Create a function to return the standardize value
Apply the function to the entire dataset
Standardized test data is applied for the entire
DataFrame
File Read and Write Support
read_hdf
read_excel to_hdf read_clipboard
to_excel to_clipboard
read_csv read_html
to csv to_html
read_json read_pickle
to_json to_pickle
read_sql read_stata
read_sas
to_sql to_stata
to sas
Activity: Sequence it Right!
The code here is buggy. You have to correct its sequence to debug it. To do that, click any two code snippets,
which you feel are out of place, to swap their places.
Click any two code snippets to swap them.
Activity: Sequence it Right!
The code here is buggy. You must correct its sequence to debug it. To do that, click any two code snippets,
which you feel are out of place, to swap their places.
Click any two code snippets to swap them.
Pandas SQL Operations
Pandas SQL Operation
Pandas SQL Operation
Pandas SQL Operation
Analyze the Federal Aviation Authority (FAA) Dataset using
Pandas
Problem Statement:
Analyze the Federal Aviation Authority (FAA) dataset using Pandas to do the following:
1.View
a. Aircraft manufacturer name
b. State name
c. Aircraft model name
d. Text information
e. Flight phase
f. Event description type
g. Fatal flag
2. Clean the dataset and replace the fatal flag NaN with “No”
3. Find the aircraft types and their occurrences in the dataset
4. Remove all the observations where aircraft names are not available
5. Display the observations where fatal flag is “Yes”
Analyze the Federal Aviation Authority (FAA) Dataset using
Pandas
Instructions to perform the assignment:
•Download the FAA dataset from the “Resource” tab. Upload the dataset to your Jupyter
notebook to view and evaluate it.
Common instructions:
•If you are new to Python, download the “Anaconda Installation Instructions” document
from the “Resources” tab to view the steps for installing Anaconda and the Jupyter
notebook.
•Download the “Assignment 01” notebook and upload it on the Jupyter notebook to access
it.
•Follow the cues provided to complete the assignment.
Analyzing the Dataset
Problem Statement:
A dataset in CSV format is given for the Fire Department of the New York City. Analyze the
dataset to determine:
1. The total number of fire department facilities in the New York city
2. The number of fire department facilities in each borough
3. The facility names in Manhattan
Analyzing the Dataset
Instructions to perform the assignment:
•Download the FDNY dataset from the “Resource” tab. You can upload the dataset to your
Jupyter notebook to use it.
Common instructions:
•If you are new to Python, download the “Anaconda Installation Instructions” document
from the “Resources” tab to view the steps for installing Anaconda and the Jupyter
notebook.
•Download the “Assignment 02” notebook and upload it on the Jupyter notebook to access
it.
•Follow the cues provided to complete the assignment.
Key Takeaways
You are now able to:
Explain Pandas and its features
List different data structures of Pandas
Outline the process to create series and DataFrame with data inputs
Explain how to view, select, and access elements in a data structure
Describe the procedure to handle vectorized operations
Illustrate how to handle missing values
Analyze data with different data operation methods
Knowledge Check
Knowledge
Check How is an index for data elements assigned while creating a Pandas series ? Select all
that apply?
1
a. Created automatically
b. Needs to be assigned
c. Once created can not be changed or altered
d. Index is not applicable as series is one-dimensional
Knowledge
Check How is an index for data elements assigned while creating a Pandas series ? Select all
that apply?
1
a. Created automatically
b. Needs to be assigned
c. Once created can not be changed or altered
d. Index is not applicable as series is one-dimensional
The correct answer is a, b
Data alignment is intrinsic in Pandas data structure and happens automatically. One can also assign index to
data elements.
Knowledge
Check
What will the result be in vector addition if label is not found in a series?
2
a. Marked as zeros for missing labels
b. Labels will be skipped
c. Marked as NaN for missing labels
d. Will prompt an exception, index not found
Knowledge
Check
What will the result be in vector addition if label is not found in a series?
2
a. Marked as zeros for missing labels
b. Labels will be skipped
c. Marked as NaN for missing labels
d. Will prompt an exception, index not found
The correct answer is c
The result will be marked as NaN (Not a Number) for missing labels.
Knowledge
Check
What is the result of DataFrame[3:9]?
3
a. Series with sliced index from 3 to 9
b. dict of index positions 3 and 9
c. DataFrame of sliced rows index from 3 to 9
d. DataFrame with data elements at index 3 to 9
Knowledge
Check
What is the result of DataFrame[3:9]?
3
a. Series with sliced index from 3 to 9
b. dict of index positions 3 and 9
c. DataFrame of sliced rows index from 3 to 9
d. DataFrame with data elements at index 3 to 9
The correct answer is c
This is DataFrame slicing technique with indexing or selection on data elements. When a user passes the
range 3:9, the entire range from 3 to 9 gets sliced and displayed as output.
Knowledge
Check
What does the fillna() method do?
4
a. Fills all NaN values with zeros
b. Fills all NaN values with one
c. Fills all NaN values with values mentioned in the parenthesis
d. Drops NaN values from the dataset
Knowledge
Check
What does the fillna() method do?
4
a. Fills all NaN values with zeros
b. Fills all NaN values with one
c. Fills all NaN values with values mentioned in the parenthesis
d. Drops NaN values from the dataset
The correct answer is c
fillna is one of the basic methods to fill NaN values in a dataset with a desired value by passing that in
parenthesis.
Knowledge
Check
Which of the following data structures is used to store three-dimensional data?
5
a. Series
b. DataFrame
c. Panel
d. PanelND
Knowledge
Check
Which of the following data structures is used to store three-dimensional data?
5
a. Series
b. DataFrame
c. Panel
d. PanelND
The correct answer is c
Panel is a data structure used to store three-dimensional data.
Knowledge
Check
Which method is used for label-location indexing by label?
6
a. iat
b. iloc
c. loc
d. std
Knowledge
Check
Which method is used for label-location indexing by label?
6
a. iat
b. iloc
c. loc
d. std
The correct answer is c
The loc method is used for label-location indexing by label; iat is strictly integer location and iloc is integer-
location-based indexing by position.
Knowledge
Check
While viewing a dataframe, head() method will _____.
7
a. return only the first row
b. return only headers or column name of the DataFrame
c. return the first five rows of the DataFrame
d. throw an exception as it expects parameter(number) in parenthesis
Knowledge
Check
While viewing a dataframe, head() method will _____.
7
a. return only the first row
b. return only headers or column name of the DataFrame
c. return the first five rows of the DataFrame
d. throw an exception as it expects parameter(number) in parenthesis
The correct answer is c
The default value is 5 if nothing is passed in head method. So, it will return the first five rows of the
DataFrame.
Thank You