0% found this document useful (0 votes)

5 views13 pages

Pandas 1

Uploaded by

suresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views13 pages

Pandas 1

Uploaded by

suresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Pandas for

Data Cleaning
What is Pandas?

Pandas is a popular open-source data manipulation and analysis

library for Python.

It provides easy-to-use functions needed to work with structured

data seamlessly.

Pandas also integrates seamlessly with other popular Python

libraries, such as NumPy for numerical computing and Matplotlib for
data visualization. This makes it a powerful asset for data driven
tasks.

Pandas excels in handling missing data, reshaping datasets, merging

and joining multiple datasets, and performing complex operations on
data, making it exceptionally useful for data cleaning and
manipulation.

linkedin.com/in/ileonjose
What is Data Cleaning?

Before we embark on our data adventure with Pandas, let's take a

moment to explain the term "data cleaning." Think of it as the digital
detox for your dataset, where we tidy up, and and prioritize accuracy
above all else.

Data cleaning involves identifying and rectifying errors,

inconsistencies, and missing values within a dataset. It's like
preparing your ingredients before cooking; you want everything in
order to get the perfect analysis or visualization.

Why bother with data cleaning? Well, imagine trying to analyze sales
trends when some entries are missing, or working with a dataset that
has duplicate records throwing off your calculations.

Not ideal, right?

In this digital detox, we use tools like Pandas to get rid of

inconsistencies, straighten out errors, and let the true clarity of your
data shine through.

linkedin.com/in/ileonjose
What is Data Processing?
You may be wondering, "Does data cleaning and data preprocessing
mean the same thing?" The answer is no – they do not.

Picture this: you stumble upon an ancient treasure chest buried in

the digital sands of your dataset. Data cleaning is like carefully
unearthing that chest, dusting off the cobwebs, and ensuring that
what's inside is authentic and reliable.

As for data preprocessing, you can think of it as taking that

discovered treasure and preparing its contents for public display. It
goes beyond cleaning; it's about transforming and optimizing the
data for specific analyses or tasks.

Data cleaning is the initial phase of refining your dataset, making it

readable and usable with techniques like removing duplicates,
handling missing values and data type conversion.

Data preprocessing is similar to taking this refined data and scaling

with more advanced techniques such as feature engineering,
encoding categorical variables and and handling outliers to achieve
better and more advanced results.

The goal is to turn your dataset into a refined masterpiece, ready for
analysis or modeling.

linkedin.com/in/ileonjose
How to Import the Necessary Libraries
Before we embark on data cleaning and preprocessing, let's import
the Pandas library.

To save time and typing, we often import Pandas as pd. This lets us
use the shorter pd.read_csv() instead of pandas.read_csv() for
reading CSV files, making our code more efficient and readable.

import pandas as pd

How to Load the Dataset

Start by loading your dataset into a Pandas DataFrame.
In this example, we'll use a hypothetical dataset named
your_dataset.csv. We will load the dataset into a variable called df.

#Replace 'your_dataset.csv' with the actual dataset name or file path

df = pd.read_csv('your_dataset.csv')

linkedin.com/in/ileonjose
Exploratory Data Analysis (EDA)

EDA helps you understand the structure and characteristics of your

dataset. Some Pandas functions help us gain insights into our
dataset. We call these functions by calling the dataset variable plus
the function.

For example:

df.head() will call the first 5 rows of the dataset. You can specify
the number of rows to be displayed in the parentheses.

df.describe() gives some statistical data like percentile, mean and

standard deviation of the numerical values of the Series or
DataFrame.

df.info() gives the number of columns, column labels, column

data types, memory usage, range index, and the number of cells
in each column (non-null values).

linkedin.com/in/ileonjose
#Display the first few rows of the dataset

print(df.head())

#Summary statistics

print(df.describe())

#Information about the dataset

print(df.info())

How to Handle Missing Values

As a newbie in this field, missing values pose a significant stress as
they come in different formats and can adversely impact your
analysis or model.

Machine learning models cannot be trained with data that has

missing or "NAN" values as they can alter your end result during
analysis. But do not fret, Pandas provides methods to handle this
problem.

linkedin.com/in/ileonjose
One way to do this is by removing the missing values altogether.
Code snippet below:

#Check for missing values

print(df.isnull().sum())

#Drop rows with missing valiues and place it in a new variable "df_cleaned"

df_cleaned = df.dropna()

#Fill missing values with mean for numerical data and place it ina new variable called df_filled

df_filled = df.fillna(df.mean())

But if the number of rows that have missing values is large, then this
method will be inadequate.

For numerical data, you can simply compute the mean and input it
into the rows that have missing values. Code snippet below:

linkedin.com/in/ileonjose
#Replace missing values with the mean of each column

df.fillna(df.mean(), inplace=True)

#If you want to replace missing values in a specific column, you can do it this way:

#Replace 'column_name' with the actual column name

df['column_name'].fillna(df['column_name'].mean(), inplace=True)

#Now, df contains no missing values, and NaNs have been replaced with column mean

How to Remove Duplicate Records

Duplicate records can distort your analysis by influencing the results

in ways that do not accurately show trends and underlying patterns
(by producing outliers).

Pandas helps to identify and remove the duplicate values in an easy

way by placing them in new variables.

linkedin.com/in/ileonjose
Code snippet below:

#Identify duplicates

print(df.duplicated().sum())

#Remove duplicates

df_no_duplicates = df.drop_duplicates()

Data Types and Conversion

Data type conversion in Pandas is a crucial aspect of data

preprocessing, allowing you to ensure that your data is in the
appropriate format for analysis or modeling.

Data from various sources are usually messy and the data types of
some values may be in the wrong format, for example some
numerical values may come in 'float' or 'string' format instead of
'integer' format and a mix up of these formats leads to errors and
wrong results.

linkedin.com/in/ileonjose
You can convert a Column of type int to float with the following code:

#Convert 'Column1' to float

df['Column1'] = df['Column1'].astype(float)

#Display updated data types

print(df.dtypes)

You can use df.dtypes to print column data types.

How to Handle Outliers

Outliers are data points significantly different from the majority of
the data, they can distort statistical measures and adversely affect
the performance of machine learning models.

They may be caused by human error, missing NaN values, or could be

accurate data that does not correlate with the rest of the data.

linkedin.com/in/ileonjose
There are several methods to identify and remove outliers, they are:

Remove NaN values.

Visualize the data before and after removal.
Z-score method (for normally distributed data).
IQR (Interquartile range) method for more robust data.

The IQR is useful for identifying outliers in a dataset. According to the

IQR method, values that fall below Q1−1.5× IQR or above Q3+1.5×IQR
are considered outliers.

This rule is based on the assumption that most of the data in a

normal distribution should fall within this range.

Here's a code snippet for the IQR method:

#Using median calculations and IQR, outliers are identified and these data points should be removed

Q1 = df["column_name"].quantile(0.25)

Q3 = df["column_name"].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

df = df[df["column_name"].between(lower_bound, upper_bound)]

linkedin.com/in/ileonjose
If you find this helpful, Repost

for more content.

linkedin.com/in/ileonjose

Cracking The Java Interview - Top Q&A
No ratings yet
Cracking The Java Interview - Top Q&A
19 pages
Casestudy 4
No ratings yet
Casestudy 4
3 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Data Cleaning & Preparation
100% (2)
Data Cleaning & Preparation
2 pages
Prac 7
No ratings yet
Prac 7
5 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Pandas
No ratings yet
Pandas
30 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Module 3
No ratings yet
Module 3
20 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
S08 Slides
No ratings yet
S08 Slides
14 pages
Core of ML - Part 1 Handling Data
No ratings yet
Core of ML - Part 1 Handling Data
3 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
Practice 1
No ratings yet
Practice 1
45 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
ML Practical 03
No ratings yet
ML Practical 03
20 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Lec 4
No ratings yet
Lec 4
9 pages
Phython Example
No ratings yet
Phython Example
12 pages
Lecture Week5
No ratings yet
Lecture Week5
72 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Master Data Cleaning With Python
No ratings yet
Master Data Cleaning With Python
11 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
9 pages
Unit V
No ratings yet
Unit V
47 pages
Data Cleanups
No ratings yet
Data Cleanups
16 pages
Datascience
No ratings yet
Datascience
26 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
No ratings yet
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
13 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
11 pages
Pandas 1
No ratings yet
Pandas 1
50 pages
Pandas
No ratings yet
Pandas
94 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
Task2 Eda Cleaning
No ratings yet
Task2 Eda Cleaning
33 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
1-Spring Boot Productapp Application Jan 25
No ratings yet
1-Spring Boot Productapp Application Jan 25
38 pages
Java Streams
No ratings yet
Java Streams
13 pages
IT Troubleshooting
No ratings yet
IT Troubleshooting
3 pages
5-MS Communication Jan 25
No ratings yet
5-MS Communication Jan 25
4 pages
AWS DevOps Interview Q&A
No ratings yet
AWS DevOps Interview Q&A
5 pages
K8s Horizontal Pod Autoscaling
No ratings yet
K8s Horizontal Pod Autoscaling
12 pages
Wireshark Display Filters Cheat Sheet
No ratings yet
Wireshark Display Filters Cheat Sheet
2 pages
2-Spring Data Jan 25
No ratings yet
2-Spring Data Jan 25
14 pages
1-Spring Boot MS Bank App Step by Setp Jan 25
No ratings yet
1-Spring Boot MS Bank App Step by Setp Jan 25
29 pages
4-SpringBoot BlogPost Project Jan 25
No ratings yet
4-SpringBoot BlogPost Project Jan 25
8 pages
Linux Commands-2
No ratings yet
Linux Commands-2
16 pages
Day 17 of 30
No ratings yet
Day 17 of 30
7 pages
Java Interview-1
No ratings yet
Java Interview-1
9 pages
Spring Boot
No ratings yet
Spring Boot
7 pages
?DevOps Interview Disaster - Avoid These Pitfalls!?
No ratings yet
?DevOps Interview Disaster - Avoid These Pitfalls!?
7 pages
AWS Athena Serverless Querying
No ratings yet
AWS Athena Serverless Querying
6 pages
AWS Waste Management Application
No ratings yet
AWS Waste Management Application
9 pages
Constraint Deltalake Pyspark
No ratings yet
Constraint Deltalake Pyspark
9 pages
API Testing Practical Guide - QA - SDET
No ratings yet
API Testing Practical Guide - QA - SDET
7 pages
Hands-On Guide Running DeepSeek LLMs Locally
No ratings yet
Hands-On Guide Running DeepSeek LLMs Locally
10 pages
Java Design Patterns
No ratings yet
Java Design Patterns
9 pages
SAP SD Important Tables For SD Consultants
No ratings yet
SAP SD Important Tables For SD Consultants
9 pages
CNIL - Transfer Impact Assessment Practical Guide
No ratings yet
CNIL - Transfer Impact Assessment Practical Guide
28 pages
Docker With NFS
No ratings yet
Docker With NFS
2 pages
Kubernetes Deployments
No ratings yet
Kubernetes Deployments
5 pages
Swipe ??
No ratings yet
Swipe ??
20 pages
Roles and Responsibilities of L1, L2 and L3 With Scenarios
No ratings yet
Roles and Responsibilities of L1, L2 and L3 With Scenarios
34 pages
Core Fundamentals Java Developers Must Know
No ratings yet
Core Fundamentals Java Developers Must Know
11 pages
Day 16 of 30
No ratings yet
Day 16 of 30
11 pages
Devkit Modules-Datasheet JADAK-1
No ratings yet
Devkit Modules-Datasheet JADAK-1
2 pages
The Peru Reader History Culture Politics 2nd Edition Orin Starn Download
100% (2)
The Peru Reader History Culture Politics 2nd Edition Orin Starn Download
39 pages
MCA DSD 990 User Manual
No ratings yet
MCA DSD 990 User Manual
46 pages
SZALAY Et Al-ICTINROADVEHICLESOBDvsCAN
No ratings yet
SZALAY Et Al-ICTINROADVEHICLESOBDvsCAN
8 pages
Management of Technology Task: Skype Business Canvas
0% (1)
Management of Technology Task: Skype Business Canvas
26 pages
Non Linear MLPG
No ratings yet
Non Linear MLPG
15 pages
QRTG0023 - Cube X RAM Clear Procedrue
No ratings yet
QRTG0023 - Cube X RAM Clear Procedrue
4 pages
Manual Configuracao Honeywell Eclipse Ms 5145
No ratings yet
Manual Configuracao Honeywell Eclipse Ms 5145
117 pages
How To Configure DHCP in Cisco Router Using Packet Tracer and Gns3 - Router Switch Configuration Using Packet Tracer GNS3
100% (1)
How To Configure DHCP in Cisco Router Using Packet Tracer and Gns3 - Router Switch Configuration Using Packet Tracer GNS3
5 pages
Assignment Guidelines-July'24 Session
No ratings yet
Assignment Guidelines-July'24 Session
2 pages
TV LCD Samsung Pl42p5hdx
No ratings yet
TV LCD Samsung Pl42p5hdx
13 pages
History of Computers
No ratings yet
History of Computers
49 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
TH1n EN Datasheet
No ratings yet
TH1n EN Datasheet
2 pages
BMS Book
No ratings yet
BMS Book
81 pages
HP DL380 G8: Hardware Module Description
No ratings yet
HP DL380 G8: Hardware Module Description
6 pages
Strivers A2Z DSA Completion Plan
No ratings yet
Strivers A2Z DSA Completion Plan
2 pages
Introduction To Optimization: Class Notes On: Mathematical Foundations in Engineering, ECEG 6209
No ratings yet
Introduction To Optimization: Class Notes On: Mathematical Foundations in Engineering, ECEG 6209
34 pages
Nguyễn Minh Thuận: Education
No ratings yet
Nguyễn Minh Thuận: Education
2 pages
78R-13 - Original Baseline Schedule Review
100% (3)
78R-13 - Original Baseline Schedule Review
21 pages
S33120+Kate Saenko+Fighting Dataset Bias in Computer Vision - 1617924588759001FZj3
No ratings yet
S33120+Kate Saenko+Fighting Dataset Bias in Computer Vision - 1617924588759001FZj3
47 pages
Water Tap
No ratings yet
Water Tap
14 pages
Behringer MIC100 P0207 M en
No ratings yet
Behringer MIC100 P0207 M en
0 pages
HFSS Filter
No ratings yet
HFSS Filter
37 pages
Data Sheet 6ES7155-6MU00-0CN0: General Information
No ratings yet
Data Sheet 6ES7155-6MU00-0CN0: General Information
4 pages
State of Local Governance Report 2011
100% (1)
State of Local Governance Report 2011
103 pages
2 - EDS-528E-4GTXSFP-HV - Layer 2 Managed Switches EDS-528E Series - MOXA
No ratings yet
2 - EDS-528E-4GTXSFP-HV - Layer 2 Managed Switches EDS-528E Series - MOXA
1 page
Oscalc Manual
No ratings yet
Oscalc Manual
90 pages
Procedures in Cosmetic Dermatology: and Lasers, Lights, Energy Devices 5th
No ratings yet
Procedures in Cosmetic Dermatology: and Lasers, Lights, Energy Devices 5th
349 pages

Pandas 1

Uploaded by

Pandas 1

Uploaded by

Pandas for

Pandas is a popular open-source data manipulation and analysis

It provides easy-to-use functions needed to work with structured

Pandas also integrates seamlessly with other popular Python

Pandas excels in handling missing data, reshaping datasets, merging

Before we embark on our data adventure with Pandas, let's take a

Data cleaning involves identifying and rectifying errors,

Not ideal, right?

In this digital detox, we use tools like Pandas to get rid of

Picture this: you stumble upon an ancient treasure chest buried in

As for data preprocessing, you can think of it as taking that

Data cleaning is the initial phase of refining your dataset, making it

Data preprocessing is similar to taking this refined data and scaling

How to Load the Dataset

#Replace 'your_dataset.csv' with the actual dataset name or file path

EDA helps you understand the structure and characteristics of your

df.describe() gives some statistical data like percentile, mean and

df.info() gives the number of columns, column labels, column

#Information about the dataset

How to Handle Missing Values

Machine learning models cannot be trained with data that has

#Check for missing values

#Replace 'column_name' with the actual column name

How to Remove Duplicate Records

Duplicate records can distort your analysis by influencing the results

Pandas helps to identify and remove the duplicate values in an easy

Data Types and Conversion

Data type conversion in Pandas is a crucial aspect of data

#Convert 'Column1' to float

#Display updated data types

You can use df.dtypes to print column data types.

How to Handle Outliers

They may be caused by human error, missing NaN values, or could be

Remove NaN values.

The IQR is useful for identifying outliers in a dataset. According to the

This rule is based on the assumption that most of the data in a

Here's a code snippet for the IQR method:

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

for more content.

You might also like