Data Cleaning Made Easy: Essential
SQL Techniques in MySQL
Data Cleaning Steps :
1️⃣ Data Profiling (Inspect structure & missing values)
2️⃣ Handling Missing Data (Delete or fill missing values)
3️⃣ Removing Duplicates (Identify and delete duplicate records)
4️⃣ Fixing Inconsistent Data Formats (Standardize text, numbers, and dates)
5️⃣ Handling Outliers (Detect and cap extreme values)
6️⃣ Standardizing Categorical Data (Fix inconsistent categories)
7️⃣ Ensuring Referential Integrity (Check for foreign key violations)
8️⃣ Validating Relationships Between Tables (Fix orphan records)
9️⃣ Data Normalization (Optimize database structure)
🔟 Automating Data Cleaning (Use stored procedures & events)
Step 1: Data Profiling (Understanding the Data)
1.1 Check Table Structure:
DESCRIBE table_name;
SHOW COLUMNS FROM table_name;
1.2 Count Total Rows:
SELECT COUNT(*) FROM table_name;
1.3 Identify Missing Values (NULLs):
SELECT column_name, COUNT(*) FROM table_name WHERE column_name IS NULL;
1.4 Check for Unique & Duplicate Values:
SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name;
Step 2: Handling Missing Data
2.1 Detect Missing Data:
SELECT * FROM table_name WHERE column_name IS NULL OR column_name = '';
2.2 Delete Rows with Missing Values (Use with caution):
DELETE FROM table_name WHERE column_name IS NULL;
2.3 Replace NULL with Default Value:
UPDATE table_name SET column_name = 'default_value' WHERE column_name IS NULL;
2.4 Fill Missing Numeric Values with Mean:
UPDATE table_name SET column_name = (SELECT AVG(column_name)
FROM table_name)
WHERE column_name IS NULL;
Step 3: Removing Duplicates
3.1 Identify Duplicate Rows:
SELECT column_name, COUNT(*)
FROM table_name GROUP BY column_name
HAVING COUNT(*) > 1;
3.2 Delete Duplicate Rows Keeping One Record:
DELETE t1 FROM table_name t1
JOIN table_name t2
ON t1.column_name = t2.column_name
WHERE t1.id > t2.id;
Step 4: Fixing Inconsistent Data Formats
4.1 Convert Text to Uppercase:
UPDATE table_name SET column_name = UPPER(column_name);
4.2 Convert Text to Lowercase:
UPDATE table_name SET column_name = LOWER(column_name);
4.3 Trim Extra Spaces:
UPDATE table_name SET column_name = TRIM(column_name);
4.4 Standardize Date Format:
UPDATE table_name
SET date_column = STR_TO_DATE(date_column, '%d/%m/%Y')
WHERE date_column LIKE '%/%/%';
Step 5: Handling Outliers
5.1 Detect Outliers Using Standard Deviation:
SELECT column_name FROM table_name
WHERE column_name > (SELECT AVG(column_name) + 2*STDDEV(column_name)
FROM table_name);
5.2 Cap Outliers at a Maximum Value:
UPDATE table_name
SET column_name = (SELECT MAX(column_name)
FROM table_name
WHERE column_name < (SELECT AVG(column_name) + 2*STDDEV(column_name)
FROM table_name))
WHERE column_name > (SELECT AVG(column_name) + 2*STDDEV(column_name)
FROM table_name);
Step 6: Standardizing Categorical Data
6.1 Replace Incorrect Categorical Values:
UPDATE table_name SET column_name = 'CorrectValue'
WHERE column_name IN ('wrongValue1', 'wrongValue2');
6.2 Standardize Gender Data:
UPDATE table_name SET gender = CASE
WHEN gender IN ('M', 'Male') THEN 'Male'
WHEN gender IN ('F', 'Female')THEN 'Female' ELSE 'Unknown' END;
Step 7: Ensuring Referential Integrity
7.1 Find Orphan Records (Foreign Key Mismatches):
SELECT * FROM orders WHERE customer_id
NOT IN (SELECT customer_id FROM customers);
7.2 Delete Orphan Records:
DELETE FROM orders WHERE customer_id
NOT IN (SELECT customer_id FROM customers);
Step 8: Validating and Cleaning Relationships Between Tables
8.1 Identify Invalid Foreign Key References:
SELECT o.* FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
8.2 Fix Incorrect Foreign Key Values
UPDATE orders o
JOIN customers c ON o.customer_name = c.customer_name
SET o.customer_id = c.customer_id
WHERE o.customer_id IS NULL;
Step 9: Data Normalization
9.1 Convert Denormalized Data to Normalized Format:
CREATE TABLE customers(
customer_id INT PRIMARY KEY,
customer_name VARCHAR(255) NOT NULL
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT, order_date DATE,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id) );
Step 10: Automating Data Cleaning
10.1 Create a Stored Procedure for Cleaning Data:
DELIMITER $$
CREATE PROCEDURE clean_data()
BEGIN
DELETE FROM table_name
WHERE column_name IS NULL;
UPDATE table_name SET column_name = TRIM(column_name);
END $$
DELIMITER ;
10.2 Schedule Automated Cleaning Jobs:
CREATE EVENT clean_data_event ON SCHEDULE EVERY 1 DAY DO CALL clean_data();