6.data Cleaning
6.data Cleaning
Topic Title – Data Cleaning: Need for cleaning the Data, treating of missing values,
treating duplicate values, treating the bad data
1
Data Handling and Visualization
Dr.Akella S Narasimha Raju
Assistant Professor,
Institute of Aeronautical Engineering,
Dundigal,
Hyderabad
Custom Values:
•Replace with default or placeholder values.
•Example:
python
df['City'].fillna('Unknown', inplace=True)
3. Key Steps in Data Cleaning
•Remove Duplicates:
•Drop duplicate rows using:
python
df.drop_duplicates(inplace=True)
•Remove Data:
•Drop rows with invalid entries.
python
df = df[df['Age'] >= 0]
3. Key Steps in Data Cleaning
• D. Standardizing Formats
1.What is Format Standardization?
1. Ensures consistency in how data is stored and represented.
2.Common Scenarios:
1. Date Formats:
1."2023-01-01" vs. "01/01/2023".
2. Text Case:
1."New York" vs. "new york".
3. Numeric Precision:
1.3.14159 vs. 3.14.
3. Key Steps in Data Cleaning
1.Techniques:
•Standardize Dates:
python
df['Date'] = pd.to_datetime(df['Date'])
•Normalize Text Case:
python
df['City'] = df['City'].str.title()
•Round Numeric Values:
python
df['Price'] = df['Price'].round(2)
Detailed Example
• import pandas as pd • # 1. Handle Missing Values
• import numpy as np • df['Age'].fillna(df['Age'].mean(), inplace=True)
• df['City'].fillna("Unknown", inplace=True)
•
5. Real-Life Applications
• Healthcare:
• Handle missing patient data for accurate diagnosis.
• Remove duplicate patient records to prevent billing errors.
• E-Commerce:
• Clean transaction data by removing duplicates.
• Treat invalid values in product pricing to avoid incorrect revenue
calculations.
• Education:
• Fill missing exam scores with averages.
• Normalize text data for student names (e.g., capitalizing first letters).
6. Benefits of Data Cleaning