0% found this document useful (0 votes)

22 views6 pages

Data Cleaning in Python

Uploaded by

prachi.dhamale9994

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views6 pages

Data Cleaning in Python

Uploaded by

prachi.dhamale9994

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

PANDAS in Python:

Handling missing values and duplicate data is a common data preprocessing task when
working with datasets in Python, and the Pandas library is a powerful tool for these tasks.
Here's how you can handle missing values and duplicate data with Pandas:

Handling Missing Values:

Missing values in a dataset can be represented in various ways, such as NaN, None, or other
custom placeholders. Pandas provides several methods to handle missing values:

1. Detect Missing Values: You can use the isna() or isnull() methods to detect
missing values in a DataFrame.

import pandas as pd

df = pd.read_csv('your_data.csv')

missing_values = df.isnull() # or df.isna()

 Drop Missing Values: To remove rows or columns containing missing values, you can
use the dropna() method.

df.dropna() # Removes rows with missing values

df.dropna(axis=1) # Removes columns with missing values

 Fill Missing Values: You can fill missing values using the fillna() method with a
specified value.

df.fillna(value) # Fill missing values with a specific value

 Interpolate Missing Values: You can use interpolation methods to estimate missing
values.

df.interpolate() # Interpolate missing values

Handling Duplicate Data:

Duplicate data can lead to incorrect analysis and results. Pandas provides methods to handle
duplicate data:

1. Detect Duplicates: Use the duplicated() method to identify duplicate rows and the
drop_duplicates() method to remove them.
duplicated_rows = df[df.duplicated()]
df.drop_duplicates()

2. Keep the First or Last Occurrence: When removing duplicates, you can specify
whether to keep the first or last occurrence of a duplicate row using the keep
parameter.

df.drop_duplicates(keep='first') # Keep the first occurrence (default)

df.drop_duplicates(keep='last') # Keep the last occurrence

By using these methods, you can effectively handle missing values and duplicate data in your
Pandas DataFrame. It's important to choose the method that best suits your data and analysis
needs.

Handling Outliers:

Outliers are data points that significantly differ from the rest of the data and can affect the
analysis. Here's how you can detect and handle outliers:

1. Detect Outliers: Use statistical methods like z-scores or IQR (Interquartile Range) to
identify outliers in your dataset.

from scipy import stats

z_scores = stats.zscore(df)
abs_z_scores = np.abs(z_scores)
outlier_rows = (abs_z_scores > 3).all(axis=1)
outliers = df[outlier_rows]

Data Imputation:

Data imputation is the process of filling in missing values with estimates to make the dataset
complete. Here are some methods for data imputation using Pandas:

import pandas as pd

# Sample DataFrame with missing values

data = {'A': [1, 2, 3, None, 5],

'B': [4, 5, None, 7, 8],
'C': [None, 2, 3, 4, 5]}
df = pd.DataFrame(data)

# Imputation with mean, median, and mode

mean_imputed = df.fillna(df.mean())

median_imputed = df.fillna(df.median())

mode_imputed = df.fillna(df.mode().iloc[0]) # mode() returns a DataFrame, so we use iloc[0]

to select the first mode

# Display the original and imputed DataFrames

print("Original DataFrame:")
print(df)
print("\nImputed with Mean:")
print(mean_imputed)
print("\nImputed with Median:")
print(median_imputed)
print("\nImputed with Mode:")
print(mode_imputed)

In this code, we first create a sample DataFrame with missing values. Then, we use the
fillna() method to perform mean, median, and mode imputation on the DataFrame,
creating three separate DataFrames for each imputation method.

Remember that mean imputation fills missing values with the mean of the respective column,
median imputation uses the median, and mode imputation uses the mode (most frequent
value). The code showcases how to apply each of these imputation techniques to handle
missing values in your data.

Data Normalization and Standardization:

Data normalization and data standardization are two common techniques used in data
preprocessing to prepare data for analysis or machine learning tasks. They are used to
transform the data in a way that makes it more suitable for modeling, improving the
performance and interpretability of machine learning algorithms. These techniques are often
used when working with numerical data.

1. Data Normalization:

Data normalization is the process of rescaling data to a common scale without

distorting differences in the ranges of values. It is particularly useful when you have
features with different units or different scales. Normalization typically scales the data
to a range between 0 and 1.

It scales the data so that it falls within a specified range, typically 0 to 1.

o Formula:

X_normalized = (X - X_min) / (X_max - X_min)

 Data Standardization:

Data standardization, also known as feature scaling, is a specific type of data normalization
that transforms data to have a mean of 0 and a standard deviation of 1. Standardization is
particularly useful when you are dealing with algorithms that assume your data is normally
distributed, as it ensures that the data conforms to a standard Gaussian distribution.

Advantages of standardization:

 It makes the data more suitable for algorithms like principal component analysis
(PCA) and linear regression.
 It helps in comparing features with different units or scales.

Standardization formula:

X_standardized = (X - mean) / standard deviation

When to use normalization vs. standardization:

 Use normalization when you have features with different ranges or when you want to
scale the data to a specific range.
 Use standardization when you want to transform the data to have a mean of 0 and a
standard deviation of 1, which is often a requirement for certain statistical and
machine learning models.

The choice between normalization and standardization depends on the nature of your data
and the requirements of your specific machine learning algorithm. In some cases, it's also a
good practice to try both methods and see which one works better for your particular
problem.
Interview Questions:

1. How do you handle missing values in a Pandas DataFrame?

 Answer: You can use the fillna() method to fill missing values with a specific
value or use the dropna() method to remove rows or columns with missing values.

2. What is the function of the drop_duplicates() method in Pandas?

 Answer: drop_duplicates() is used to remove duplicate rows from a DataFrame,

keeping only the first occurrence by default.

3. How can you handle outliers in a DataFrame using Pandas?

 Answer: You can identify outliers using methods like z-scores or IQR, and then
decide to either remove them or transform them to reduce their impact on the analysis.

4. Explain the concept of data imputation in Pandas.

 Answer: Data imputation is the process of filling missing values with estimated or
substituted values. Pandas provides functions like fillna() for this purpose.

5. What are some common techniques for data imputation in Pandas?

 Answer: Common techniques include mean, median, mode imputation, forward or

backward fill, interpolation, and machine learning-based imputation.

6. How do you normalize data in a Pandas DataFrame?

 Answer: You can normalize data by scaling it to a specific range, typically [0, 1],
using techniques like Min-Max scaling with the MinMaxScaler from the
sklearn.preprocessing library.

7. What is data standardization, and how can you achieve it using Pandas and scikit-
learn?

 Answer: Data standardization (or z-score standardization) scales data to have a mean
of 0 and a standard deviation of 1. You can use the StandardScaler from scikit-learn
or manually calculate it with Pandas.

8. Can you demonstrate how to remove all rows with missing values in a Pandas
DataFrame?

 Answer: You can use the dropna() method with how='any' to remove all rows
containing any missing values.

9. Explain the difference between imputing missing values with the mean and median.
 Answer: Imputing with the mean replaces missing values with the average of the
available data, while imputing with the median replaces them with the middle value,
making it less sensitive to outliers.

Practice Problems & Solutions With Employee

Information Dataset
First load dataset
import pandas as pd
# Load the dataset
df = pd.read_csv('employee_data.csv')

Problem 1: Identify and count the number of missing values in each column of the dataset.
Problem 2: Find and remove duplicate records from the dataset.
Problem 3: Identify outliers in the 'Salary' column using the Z-score method (threshold: z-score >
3 or < -3).
Problem 4: Normalize the 'Age' column using the min-max normalization technique.
Problem 5: Standardize the 'Salary' column using the z-score standardization technique.
Problem 6: Fill in missing values in the 'Gender' column with the mode value.

Issues in Cognitive Science V - Cognitive Science and Society (COG345H1) Syllabus
No ratings yet
Issues in Cognitive Science V - Cognitive Science and Society (COG345H1) Syllabus
9 pages
Position and Transformation
100% (1)
Position and Transformation
11 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
IntroToPython Unit 5
No ratings yet
IntroToPython Unit 5
42 pages
Avinash DA 6
No ratings yet
Avinash DA 6
3 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Missing Data
No ratings yet
Missing Data
14 pages
Dealing With Missing Data - Jupyter Notebook
No ratings yet
Dealing With Missing Data - Jupyter Notebook
9 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Lab File
No ratings yet
Lab File
96 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
LP II Practical
No ratings yet
LP II Practical
5 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Pandas
No ratings yet
Pandas
30 pages
Data Preprocessing PT 2
No ratings yet
Data Preprocessing PT 2
7 pages
Pandas
No ratings yet
Pandas
4 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
No ratings yet
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
10 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Section_4
No ratings yet
Section_4
3 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
C2 - Data Cleaning & Preprocessing
No ratings yet
C2 - Data Cleaning & Preprocessing
59 pages
Slides On DataII
No ratings yet
Slides On DataII
26 pages
ML - Preprocessing - Introduction
No ratings yet
ML - Preprocessing - Introduction
14 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
Wa0061.
No ratings yet
Wa0061.
3 pages
Analystics Data Cleaning Questions Interview
No ratings yet
Analystics Data Cleaning Questions Interview
8 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Unit 5 Python
No ratings yet
Unit 5 Python
30 pages
Dealing With Missing Values
No ratings yet
Dealing With Missing Values
19 pages
02 - 23ECE216 - EDA - Pre Processing
No ratings yet
02 - 23ECE216 - EDA - Pre Processing
16 pages
Phython Example
No ratings yet
Phython Example
12 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
47 pages
Lec 4
No ratings yet
Lec 4
9 pages
Unit 3
No ratings yet
Unit 3
30 pages
TP2 - ML - Handling Outliers
No ratings yet
TP2 - ML - Handling Outliers
5 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Week 3
No ratings yet
Week 3
77 pages
DA Lab Manual r22
No ratings yet
DA Lab Manual r22
31 pages
ML Practical 03
No ratings yet
ML Practical 03
20 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Phil Iri 2022 2023 1
100% (1)
Phil Iri 2022 2023 1
4 pages
Gtyhdtrghjujjgt
No ratings yet
Gtyhdtrghjujjgt
1 page
PDF 1728985982158
No ratings yet
PDF 1728985982158
8 pages
B12 - Crop Yield Prediction Using Machine Learning
No ratings yet
B12 - Crop Yield Prediction Using Machine Learning
18 pages
Counselling Psychology Unit 1
No ratings yet
Counselling Psychology Unit 1
48 pages
TEST-05 - Botany: Complete Cell Chapter - Zoology: Body Fluids & Circulation
No ratings yet
TEST-05 - Botany: Complete Cell Chapter - Zoology: Body Fluids & Circulation
8 pages
Hes 008 - Sas 4
No ratings yet
Hes 008 - Sas 4
1 page
Songwriting
No ratings yet
Songwriting
5 pages
Project ONE DAPH Innovation - Alternative Learning System
100% (3)
Project ONE DAPH Innovation - Alternative Learning System
24 pages
Dollerup, Cay - Teaching Translation and Interpreting 3
No ratings yet
Dollerup, Cay - Teaching Translation and Interpreting 3
173 pages
School Calendar 2012-13
No ratings yet
School Calendar 2012-13
1 page
Fundamentals of Group Dynamics
No ratings yet
Fundamentals of Group Dynamics
10 pages
IPSEF KL 2012 - K12 South East Asia-Seizing The Opportunity For British Schools Opening Campuses Abroad
No ratings yet
IPSEF KL 2012 - K12 South East Asia-Seizing The Opportunity For British Schools Opening Campuses Abroad
16 pages
Chapter 2 - 3i's
No ratings yet
Chapter 2 - 3i's
26 pages
Thesis Chap 1 2 31
No ratings yet
Thesis Chap 1 2 31
17 pages
Korean National Police University-Brochure - 2015-11-12
No ratings yet
Korean National Police University-Brochure - 2015-11-12
36 pages
Mrs. Kimak's Classroom Procedures: Class Rules
No ratings yet
Mrs. Kimak's Classroom Procedures: Class Rules
2 pages
April 2023 Electronics Engineers S. P. L. E.
No ratings yet
April 2023 Electronics Engineers S. P. L. E.
4 pages
Gen Appropriation Budget
No ratings yet
Gen Appropriation Budget
455 pages
Kola: Carefully Benchmarking World Knowledge of Large Language Models
No ratings yet
Kola: Carefully Benchmarking World Knowledge of Large Language Models
35 pages
ELT Lesson Plan
No ratings yet
ELT Lesson Plan
5 pages
Network Load Balancing
No ratings yet
Network Load Balancing
17 pages
Re-Shala 2
No ratings yet
Re-Shala 2
12 pages
MBBS Fees & Documents Details Ay 2024-25
No ratings yet
MBBS Fees & Documents Details Ay 2024-25
7 pages
SHCT 177 Dupont - Preacher of Grace - A Critical Reappraisal of Augustine's Doctrine of Grace in His Sermones Ad Populum On Liturgical Feasts and During The Donatist Controversy 2014 PDF
100% (1)
SHCT 177 Dupont - Preacher of Grace - A Critical Reappraisal of Augustine's Doctrine of Grace in His Sermones Ad Populum On Liturgical Feasts and During The Donatist Controversy 2014 PDF
245 pages
DLL On Oral Communication WEEK 9
No ratings yet
DLL On Oral Communication WEEK 9
3 pages
Leadership Coaching Framework
100% (6)
Leadership Coaching Framework
17 pages
Cot Filipino 2 Quarter 3
No ratings yet
Cot Filipino 2 Quarter 3
21 pages

Data Cleaning in Python

Uploaded by

Data Cleaning in Python

Uploaded by

PANDAS in Python:

Handling Missing Values:

missing_values = df.isnull() # or df.isna()

df.dropna() # Removes rows with missing values

df.fillna(value) # Fill missing values with a specific value

df.interpolate() # Interpolate missing values

Handling Duplicate Data:

df.drop_duplicates(keep='first') # Keep the first occurrence (default)

from scipy import stats

# Sample DataFrame with missing values

data = {'A': [1, 2, 3, None, 5],

# Imputation with mean, median, and mode

mode_imputed = df.fillna(df.mode().iloc[0]) # mode() returns a DataFrame, so we use iloc[0]

# Display the original and imputed DataFrames

Data Normalization and Standardization:

Data normalization is the process of rescaling data to a common scale without

It scales the data so that it falls within a specified range, typically 0 to 1.

X_normalized = (X - X_min) / (X_max - X_min)

X_standardized = (X - mean) / standard deviation

When to use normalization vs. standardization:

1. How do you handle missing values in a Pandas DataFrame?

2. What is the function of the drop_duplicates() method in Pandas?

 Answer: drop_duplicates() is used to remove duplicate rows from a DataFrame,

3. How can you handle outliers in a DataFrame using Pandas?

4. Explain the concept of data imputation in Pandas.

5. What are some common techniques for data imputation in Pandas?

 Answer: Common techniques include mean, median, mode imputation, forward or

6. How do you normalize data in a Pandas DataFrame?

Practice Problems & Solutions With Employee

You might also like