0% found this document useful (0 votes)
23 views20 pages

6.data Cleaning

The document discusses the importance of data cleaning in data handling and visualization, outlining the need to address missing, duplicate, and invalid data to ensure data quality. It details key steps in the data cleaning process, including techniques for handling missing values, duplicates, and standardizing formats. The document emphasizes that effective data cleaning enhances data reliability, improves model performance, and facilitates better decision-making.

Uploaded by

lakshmideepthi16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views20 pages

6.data Cleaning

The document discusses the importance of data cleaning in data handling and visualization, outlining the need to address missing, duplicate, and invalid data to ensure data quality. It details key steps in the data cleaning process, including techniques for handling missing values, duplicates, and standardizing formats. The document emphasizes that effective data cleaning enhances data reliability, improves model performance, and facilitates better decision-making.

Uploaded by

lakshmideepthi16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Course Title – Data Handling and Visualization

Topic Title – Data Cleaning: Need for cleaning the Data, treating of missing values,
treating duplicate values, treating the bad data

1
Data Handling and Visualization
Dr.Akella S Narasimha Raju
Assistant Professor,
Institute of Aeronautical Engineering,
Dundigal,
Hyderabad

27/05/2025 Computer System Architecture- Dr.A.S.Narasimha Raju 2


MODULE III: Data Cleaning and Pre-
Processing
Data Cleaning: Need for cleaning the Data, treating of missing
values, treating duplicate values, treating the bad data
1. What is Data Cleaning?
•Definition: Data cleaning is the process of detecting, correcting, or
removing corrupt, inaccurate, or incomplete data from a dataset to ensure its
quality.
•Objective:
•To improve the reliability of data analysis and decision-making.
•To prepare the data for further processing and modeling.
•Examples:
•Removing rows with missing customer details.
•Correcting invalid entries like negative ages in a dataset.
2. Why is Data Cleaning Needed?

• Challenges with Raw Data:


1.Incomplete Data:
1. Missing values in rows/columns.
2. Example: Missing addresses in a delivery database.
2.Inconsistent Data:
1. Conflicting formats or representations.
2. Example: Dates written as “01/01/2023” and “2023-01-01.”
2. Why is Data Cleaning Needed?
•Incorrect or Invalid Data:
•Values outside an acceptable range.
•Example: Age values like -10 or 200.
•Duplicate Data:
•Repeated records leading to redundancy.
•Example: Multiple rows for the same product in a sales dataset.
Importance of Data Cleaning:

1.Improves Data Quality:


1. Ensures consistency and accuracy of analysis.
2.Enhances Model Performance:
1. Garbage in, garbage out: Poor data leads to unreliable models.
3.Saves Time in Long-Term Analysis:
1. Clean data simplifies and speeds up analysis.
3. Key Steps in Data Cleaning
A. Handling Missing Values
1.What are Missing Values?
•Missing values occur when no data is provided for certain attributes in a
dataset.
•Represented as NaN (Not a Number) in Pandas.
2.Causes:
•Data entry errors (e.g., skipped fields).
•Faulty data collection systems.
•Merging datasets with non-overlapping information.
3.Techniques to Handle Missing Values:
•Removal:
•Remove rows or columns with excessive missing values.
•Example:
python
df.dropna(inplace=True) # Remove rows with missing values
3. Key Steps in Data Cleaning
•Imputation:
•Replace missing values with appropriate substitutes:
1.Mean/Median: Suitable for numerical data.
2.Mode: Suitable for categorical data.
3.Forward/Backward Fill: Propagate adjacent values.
•Example:
python
df['Age'].fillna(df['Age'].mean(), inplace=True)
3. Key Steps in Data Cleaning

Custom Values:
•Replace with default or placeholder values.
•Example:
python
df['City'].fillna('Unknown', inplace=True)
3. Key Steps in Data Cleaning

B. Handling Duplicate Values


1.What are Duplicate Values?
•Records or rows that repeat in the dataset, leading to redundancy.
2.Causes:
•Data collection or merging errors.
•Repetitive entries by users.
3.Techniques to Handle Duplicates:
•Identify Duplicates:
python
print(df.duplicated())
3. Key Steps in Data Cleaning

•Remove Duplicates:
•Drop duplicate rows using:
python
df.drop_duplicates(inplace=True)

•Retain Specific Records:


•Keep the first or last occurrence:
python
df.drop_duplicates(keep='last', inplace=True)
3. Key Steps in Data Cleaning
C. Treating Invalid Data
1.What is Invalid Data?
•Data that violates logical or domain-specific constraints.
•Example:
•Age: -5 (invalid for any real-world application).
•Salary: "abc" (string instead of numeric value).
2.Causes:
•Human errors during data entry.
•Faulty systems generating unrealistic values.
3.Techniques to Handle Invalid Data:
•Identify Invalid Data:
•Use logical conditions to flag errors.
python
invalid_ages = df[df['Age'] < 0] print(invalid_ages)
3. Key Steps in Data Cleaning
•Correct Data:
•Replace invalid values with meaningful defaults.
python
df['Age'] = df['Age'].apply(lambda x: 0 if x < 0 else x)

•Remove Data:
•Drop rows with invalid entries.
python
df = df[df['Age'] >= 0]
3. Key Steps in Data Cleaning
• D. Standardizing Formats
1.What is Format Standardization?
1. Ensures consistency in how data is stored and represented.
2.Common Scenarios:
1. Date Formats:
1."2023-01-01" vs. "01/01/2023".
2. Text Case:
1."New York" vs. "new york".
3. Numeric Precision:
1.3.14159 vs. 3.14.
3. Key Steps in Data Cleaning

1.Techniques:
•Standardize Dates:
python
df['Date'] = pd.to_datetime(df['Date'])
•Normalize Text Case:
python
df['City'] = df['City'].str.title()
•Round Numeric Values:
python
df['Price'] = df['Price'].round(2)
Detailed Example
• import pandas as pd • # 1. Handle Missing Values
• import numpy as np • df['Age'].fillna(df['Age'].mean(), inplace=True)
• df['City'].fillna("Unknown", inplace=True)

• # Sample DataFrame • # 2. Remove Duplicate Rows


• data = { • df.drop_duplicates(inplace=True)
• "Name": ["Alice", "Bob", "Charlie",
"Alice", None], • # 3. Treat Invalid Data
• "Age": [25, np.nan, 35, 25, -5], • df = df[df['Age'] >= 0] # Remove invalid ages
• "City": ["New York", "Los Angeles",
None, "New York", "Chicago"], • # 4. Standardize Formats
• df['Date'] = pd.to_datetime(df['Date'])
• "Date": ["2023-01-01", "2023/01/02",
• df['City'] = df['City'].str.title()
None, "2023-01-01", "2023-01-05"]
• } • print("\nCleaned DataFrame:\n", df)
• df = pd.DataFrame(data)


5. Real-Life Applications

• Healthcare:
• Handle missing patient data for accurate diagnosis.
• Remove duplicate patient records to prevent billing errors.
• E-Commerce:
• Clean transaction data by removing duplicates.
• Treat invalid values in product pricing to avoid incorrect revenue
calculations.
• Education:
• Fill missing exam scores with averages.
• Normalize text data for student names (e.g., capitalizing first letters).
6. Benefits of Data Cleaning

1.Enhanced Data Reliability:


1. Accurate insights from cleaned data.
2.Efficient Processing:
1. Clean data reduces processing time in downstream tasks.
3.Improved Decision-Making:
1. Better predictions and conclusions.
Conclusion

Data cleaning is an essential step in any data science workflow. By


addressing missing, duplicate, and invalid data, and ensuring consistent
formats, we create a reliable foundation for analysis and modeling. Let me
know if you need further clarifications or a presentation version!

You might also like