0% found this document useful (0 votes)
5 views

Pandas AI ML Python Software Engineering

This lesson covers data manipulation using the Pandas library in Python, focusing on its features, data structures like Series and DataFrame, and methods for handling missing values. Key functionalities include creating data structures, accessing elements, performing vectorized operations, and executing various data operations. The lesson also highlights the importance of Pandas in data analysis and its compatibility with multiple file formats.

Uploaded by

Vijay Yadav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Pandas AI ML Python Software Engineering

This lesson covers data manipulation using the Pandas library in Python, focusing on its features, data structures like Series and DataFrame, and methods for handling missing values. Key functionalities include creating data structures, accessing elements, performing vectorized operations, and executing various data operations. The lesson also highlights the importance of Pandas in data analysis and its compatibility with multiple file formats.

Uploaded by

Vijay Yadav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Data Science with Python

Lesson 7— Data Manipulation with Pandas


What You Will Learn

Pandas and its features

Different data structures of Pandas

Creating Series and DataFrame with data inputs

Viewing, selecting, and accessing elements in a data structure

Handling vectorized operations


Learning how to handle missing values

Analyzing data with different data operation methods


Why Pandas

NumPy is great for mathematical computing. Then why do we need Pandas?

Pandas with several


functionalities

NumPy
Why Pandas

NumPy is great for mathematical computing. Then why do we need Pandas?


Intrinsic data
alignment

Data Structures
Data operation
handling major
functions
use cases
Pandas

Data standardization
functions ? Functions for handling
missing data
.
Pandas Features

The various features of Pandas makes it an efficient library for Data Scientists.

Powerful data
structure

Fast and efficient


High performance
data wrangling
merging and joining
of data sets

Pandas
Intelligent and Easy data aggregation
automated data and transformation
alignment

Tools for reading/


writing data
Data Structures

The four main libraries of Pandas data structure are:

• One-dimensional labeled array


Series
• Supports multiple data types
• Two-dimensional labeled array
• Supports multiple data types
Data Frame
• Input can be a Series
• Three-dimensional labeled array • Input can be another DataFrame
• Supports multiple data types
• Items  axis 0 Panel
• Major axis  rows • Four-dimensional labeled array
• Minor axis columns • Supports multiple data types
Panel 4D • Labels  axis 0
(Experimental) • Items  axis 1
• Major axis  rows
• Minor axis columns
Understanding Series

Series is a one-dimensional array-like object containing data and labels (or index).

Data 4 11 21 36
0 1 2 3

Label(index)

Data alignment is intrinsic and will not be broken until changed explicitly by program.
Series

Series can be created with different data inputs:

Data Input

• Integer
• ndarray 2 3 8 4
• String
• dict
• Python Object 0 1 2 3
• scalar
• Floating Point
• list Label(index)

Data Types
Series
How to Create Series

Key points to note while creating a series are as follows:


• Import Pandas as it is the main library
• Apply the syntax and pass the data elements as arguments
• Import NumPy while working with ndarrays

Basic Method

4 11 21 36
S = pd.Series(data, index = [index])
Series
Create Series from List
This example shows you how to create a series from a list:

Import libraries

Pass list as an argument

Data value

Index

Data type

We have not created index for data but notice that data alignment is done automatically
Create Series from ndarray

This example shows you how to create a series from an ndarray:

ndarray for countries

Pass ndarray as an argument

countries
Index

Data type
Create Series from dict

A series can also be created with dict data input for faster operations.
dict for countries and their gdp

Countries have been passed as an index


and GDP as the actual data value

GDP

Country

Data type
Create Series from Scalar

Scalar input

Index

Data

index

Data type
Accessing Elements in Series
Data can be accessed through different functions like loc, iloc by passing data element position or
index range.
Vectorized Operations in Series

Vectorized operations are performed by the data element’s position.

Add the series


Vectorized Operations in Series
Knowledge Check
KNOWLEDGE How is an index for data elements assigned while creating a Pandas series ? Select all
CHECK that apply.

a. Created automatically

b. Needs to be assigned

c. Once created can not be changed or altered

d. Index is not applicable as series is one-dimensional


How is an index for data elements assigned while creating a Pandas series ? Select all
KNOWLEDGE that apply.
CHECK

a.
Created automatically

b. Needs to be assigned

c. Once created can not be changed or altered

d.
Index is not applicable as series is one-dimensional

The correct answer is a, b .

Explanation: Data alignment is intrinsic in Pandas data structure and happens automatically. One can also assign index to data
elements.
KNOWLEDGE
What will the result be in vector addition if label is not found in a series?
CHECK

a. Marked as Zeros for missing labels

b. Labels will be skipped

c. Marked as NaN for missing labels

d. Will throw an exception, index not found


KNOWLEDGE
CHECK
What will the result be in vector addition if label is not found in a series?

a.
Marked as Zeros for missing labels

b. Labels will be skipped

c. Marked as NaN for missing labels

d.
Will throw an exception, index not found

The correct answer is c .

Explanation: The result will be marked as NaN (Not a Number) for missing labels.
DataFrame

DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

Data Input

• Integer
• ndarray 2 3 8 4
• String
• dict 5 8 10 1
• Python Object
• scalar
• Floating Point 0 1 2 3
• list
Label(index)

Data Types
DataFrame
Create DataFrame from Lists
Let’s see how you can create a DataFrame from lists:

Pass the list to the DataFrame


Create DataFrame from dict
This example shows you how to create a DataFrame from a series of dicts:

dict one dict two

Entire dict
View DataFrame

You can view a DataFrame by referring the column name or with the describe function.
Create DataFrame from dict of Series
Create DataFrame from ndarray

Create an ndarrays with years


Create a dict with the ndarray

Pass this dict to a new DataFrame


Create DataFrame from DataFrame

Create a DataFrame from a


DataFrame
Demo 01—View and Select Data
Demonstrate how to view and select data in a DataFrame.
Missing Values

Various factors may lead to missing data values:

Data not provided by the


source Software issue Data integration issue Network issue
Handle Missing Values
It’s difficult to operate on a dataset when it has missing values or uncommon indices.
Handle Missing Values with Functions
The dropna function drops all the values with uncommon indices.
Handle Missing Values with Functions

The fillna function fills all the uncommon indices with a number instead of dropping them.

Fill the missing values with zero


Handle Missing Values with Functions- Example
Data Operation
Data operation can be performed through various built-in methods for faster data processing.
Data Operation with Functions
While performing data operation, custom functions can be applied with the applymap method.

Declare a custom function

Test the function

Apply the function to the DataFrame


Data Operation with Statistical Functions
This example shows data operations with different statistical functions.

Create a DataFrame with two test

Apply the max function to find the


maximum score

Apply the mean function to find


the average score

Apply the std function to find the standard


deviation for both the tests
Data Operation Using Groupby
This example shows how to operate data using the groupby function.

Create a DataFrame with first and


last name as former presidents

Group the DataFrame with the first name

Group the DataFrame with the first name


Data Operation – Sorting
This example shows how to sort data

Sort values by first name


Demo 02—Data Operations
Demonstrate how to perform data operations.
Data Standardization
This example shows how to standardize a dataset.
Create a function to return the standardize value

Apply the function to the entire dataset

Standardized test data is applied for the entire


DataFrame
Knowledge Check
KNOWLEDGE
CHECK
What is the result of DataFrame[3:9]?

a. Series with sliced index from 3 to 9

b. dict of index position 3 and index position 9

c. DataFrame of sliced rows index from 3 to 9

d. DataFrame with data elements at index 3 to index9


KNOWLEDGE
CHECK
What is the result of DataFrame[3:9]?

a. Series with sliced index from 3 to 9

b. dict of index position 3 and index position 9

c. DataFrame of sliced rows index from 3 to 9

d. DataFrame with data elements at index 3 to index9

The correct answer is . c

Explanation: This is DataFrame slicing technique with indexing or selection on data elements. When a user
passes the range 3:9, the entire range from 3 to 9 gets sliced and displayed as output.
KNOWLEDGE
CHECK
What does the fillna() method do?

a. Fills all NaN values with zeros

b. Fills all NaN values with one

c. Fills all NaN values with values mentioned in the parenthesis

d. Drops NaN values from the dataset


KNOWLEDGE
CHECK
What does the fillna() method do?

a. Fills all NaN values with zeros

b. Fills all NaN values with One

c. Fills all NaN values with values mentioned in the parenthesis

d. Drops NaN values from the dataset

The correct answer is . c

Explanation: fillna is one of the basic methods to fill NaN values in a dataset with a desired value by passing
that in parenthesis.
File Read and Write Support

read_hdf
read_excel to_hdf read_clipboard
to_excel to_clipboard

read_csv read_html
to csv to_html

read_json read_pickle
to_json to_pickle

read_sql read_stata
read_sas
to_sql to_stata
to sas
Pandas SQL operation
Pandas SQL operation
Pandas SQL operation
Activity—Sequence it Right!
The code here is buggy. You have to correct its sequence to debug it. To do that, click any two code
snippets, which you feel are out of place, to swap their places.

Click any two code snippets to swap them.


Assignment
Assignment
Quiz
QUIZ
Which of the following data structures is used to store three-dimensional data?
1

a. Series

b. DataFrame

c. Panel

d. PanelND
QUIZ
Which of the following data structures is used to store three-dimensional data?
1

a. Series

b. DataFrame

c. Panel

d. PanelND

The correct answer is c.

Explanation: Panel is a data structure used to store three-dimensional data.


QUIZ
Which method is used for label-location indexing by label?
2

a. iat

b. iloc

c. loc

d. std
QUIZ
Which method is used for label-location indexing by label?
2

a. iat

b. iloc

c. loc

d. std

The correct answer is c.

Explanation: The loc method is used to for label-location indexing by label; iat is strictly integer location and
iloc is integer-location-based indexing by position.
QUIZ
While viewing a dataframe, head() method will .
3

a. return only the first row

b. return only headers or column names of the DataFrame

c. return the first five rows of the DataFrame

d. throw an exception as it expects parameter(number) in parenthesis


QUIZ
While viewing a dataframe, head() method will .
3

a. return only the first row

b. return only headers or column name of the DataFrame

c. return the first five rows of the DataFrame

d. throw an exception as it expects parameter(number) in parenthesis

The correct answer is c.

Explanation: The default value is 5 if nothing is passed in head method. So it will return the first five rows
of the DataFrame.
Key Takeaways

Let us take a quick recap of what we have learned in the lesson:

Pandas is an open source library for data analysis and is an efficient


data wrangling tool in Python.
The four main libraries of Pandas are Series, DataFrame, Panel, and
Panel 4D.
DataFrame is a two-dimensional labeled data structure with columns
of potentially different data types.
To access data elements in a series, 'loc' and 'iloc' methods can be
used.
Key Takeaways

The 'iat' method enables selection of elements in a DataFrame by


index position and returns the corresponding data element.
Missing data values in Pandas can be resolved through
two built-in methods such as dropna and fillna.

Pandas supports multiple files for data analysis such as


Excel, PyTables, Clipboard, HTML, pickle, dta, SAS, SQL,
JSON, and CSV.
This concludes “Data Manipulation with Pandas.”
The next lesson is “Machine Learning with SciKit Learn.”

You might also like