0% found this document useful (0 votes)
4 views

Data Cleaning in Machine Learning With Numerical Example

Data cleaning is essential in machine learning to enhance model performance by addressing issues like missing values, duplicates, and inconsistencies. The document outlines a step-by-step data cleaning process using a numerical example, demonstrating how to handle missing values, remove duplicates, fix inconsistencies, and detect outliers using Python and pandas. Proper data cleaning leads to improved model accuracy and reduces bias in predictions.

Uploaded by

mytreyan197
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data Cleaning in Machine Learning With Numerical Example

Data cleaning is essential in machine learning to enhance model performance by addressing issues like missing values, duplicates, and inconsistencies. The document outlines a step-by-step data cleaning process using a numerical example, demonstrating how to handle missing values, remove duplicates, fix inconsistencies, and detect outliers using Python and pandas. Proper data cleaning leads to improved model accuracy and reduces bias in predictions.

Uploaded by

mytreyan197
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Data Cleaning in Machine Learning with Numerical Example

Data cleaning is a crucial step in machine learning that involves handling


missing values, removing duplicates, correcting errors, and ensuring data
consistency. Poor data quality can lead to poor model performance, so
cleaning data properly is essential.

Steps in Data Cleaning


1. Handling Missing Values
2. Removing Duplicates
3. Fixing Inconsistent Data
4. Handling Outliers
5. Converting Data Types
6. Feature Scaling (if necessary)

Numerical Example: Data Cleaning Process


Step 1: Raw Dataset (Before Cleaning)
Suppose we have a dataset with customer purchase information:

Customer_ Ag Salary Purchase (Yes=1,


ID e ($) No=0)

101 25 50000 1

Na
102 60000 0
N

103 40 NaN 1

104 35 70000 1

105 50 80000 0

106 25 50000 1

107 -5 45000 1

108 29 90000 0

🛑 Issues in the dataset:


 Missing Values (Age for Customer 102, Salary for Customer 103)
 Duplicate Record (Customer 106 is the same as Customer 101)
 Inconsistent Data (Customer 107 has an invalid Age = -5)
 Outlier Detection (Salary differences)
Step 2: Data Cleaning in Python
Let's clean this dataset step by step using Python and pandas.
import pandas as pd
import numpy as np

# Creating the dataset


data = {
"Customer_ID": [101, 102, 103, 104, 105, 106, 107, 108],
"Age": [25, np.nan, 40, 35, 50, 25, -5, 29],
"Salary": [50000, 60000, np.nan, 70000, 80000, 50000, 45000, 90000],
"Purchase": [1, 0, 1, 1, 0, 1, 1, 0]
}

df = pd.DataFrame(data)
print("Original Dataset:\n", df)

# **1. Handling Missing Values**


df['Age'].fillna(df['Age'].mean(), inplace=True) # Fill missing Age with mean
df['Salary'].fillna(df['Salary'].median(), inplace=True) # Fill missing Salary with
median
++++++++++++++++++++++++++++++++++++++++++++++++
++

# **2. Removing Duplicates**


df.drop_duplicates(inplace=True)

# **3. Fixing Inconsistent Data**


df['Age'] = df['Age'].apply(lambda x: abs(x)) # Convert negative Age to positive

# **4. Checking Outliers (Optional)**


Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)] #
Removing outliers

# **5. Convert Data Types if Needed**


df['Age'] = df['Age'].astype(int)

# Final Cleaned Dataset


print("\nCleaned Dataset:\n", df)

Step 3: Cleaned Dataset (After Cleaning)


Customer_ Ag Salary Purchase (Yes=1,
ID e ($) No=0)
101 25 50000 1
102 35 60000 0
Customer_ Ag Salary Purchase (Yes=1,
ID e ($) No=0)

103 40 60000 1
104 35 70000 1
105 50 80000 0
108 29 90000 0
Improvements:
✅ Missing values handled using mean (Age) and median (Salary).
✅ Duplicate record removed (Customer 106 was a duplicate of 101).
✅ Negative value corrected (Customer 107’s Age changed from -5 to 5).
✅ Outliers removed in Salary column using IQR method.

Why Data Cleaning Matters?


✔️Improves model accuracy
✔️Removes bias from dirty data
✔️Prevents errors in prediction

You might also like