0% found this document useful (0 votes)
13 views11 pages

Data Cleaning Made Easy: Essential SQL Techniques in Mysql

The document outlines essential SQL techniques for data cleaning in MySQL, detailing ten key steps including data profiling, handling missing data, and removing duplicates. Each step provides specific SQL commands to perform tasks such as identifying missing values, standardizing formats, and ensuring referential integrity. Additionally, it emphasizes the importance of automating data cleaning processes through stored procedures and scheduled events.

Uploaded by

zubi.abbasi007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

Data Cleaning Made Easy: Essential SQL Techniques in Mysql

The document outlines essential SQL techniques for data cleaning in MySQL, detailing ten key steps including data profiling, handling missing data, and removing duplicates. Each step provides specific SQL commands to perform tasks such as identifying missing values, standardizing formats, and ensuring referential integrity. Additionally, it emphasizes the importance of automating data cleaning processes through stored procedures and scheduled events.

Uploaded by

zubi.abbasi007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Data Cleaning Made Easy: Essential

SQL Techniques in MySQL

Data Cleaning Steps :

1️⃣ Data Profiling (Inspect structure & missing values)


2️⃣ Handling Missing Data (Delete or fill missing values)
3️⃣ Removing Duplicates (Identify and delete duplicate records)
4️⃣ Fixing Inconsistent Data Formats (Standardize text, numbers, and dates)
5️⃣ Handling Outliers (Detect and cap extreme values)
6️⃣ Standardizing Categorical Data (Fix inconsistent categories)
7️⃣ Ensuring Referential Integrity (Check for foreign key violations)
8️⃣ Validating Relationships Between Tables (Fix orphan records)
9️⃣ Data Normalization (Optimize database structure)
🔟 Automating Data Cleaning (Use stored procedures & events)
Step 1: Data Profiling (Understanding the Data)
1.1 Check Table Structure:

DESCRIBE table_name;

SHOW COLUMNS FROM table_name;

1.2 Count Total Rows:


SELECT COUNT(*) FROM table_name;

1.3 Identify Missing Values (NULLs):

SELECT column_name, COUNT(*) FROM table_name WHERE column_name IS NULL;

1.4 Check for Unique & Duplicate Values:

SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name;


Step 2: Handling Missing Data
2.1 Detect Missing Data:

SELECT * FROM table_name WHERE column_name IS NULL OR column_name = '';

2.2 Delete Rows with Missing Values (Use with caution):

DELETE FROM table_name WHERE column_name IS NULL;

2.3 Replace NULL with Default Value:

UPDATE table_name SET column_name = 'default_value' WHERE column_name IS NULL;

2.4 Fill Missing Numeric Values with Mean:


UPDATE table_name SET column_name = (SELECT AVG(column_name)
FROM table_name)
WHERE column_name IS NULL;
Step 3: Removing Duplicates

3.1 Identify Duplicate Rows:

SELECT column_name, COUNT(*)


FROM table_name GROUP BY column_name
HAVING COUNT(*) > 1;

3.2 Delete Duplicate Rows Keeping One Record:

DELETE t1 FROM table_name t1


JOIN table_name t2
ON t1.column_name = t2.column_name
WHERE t1.id > t2.id;
Step 4: Fixing Inconsistent Data Formats

4.1 Convert Text to Uppercase:

UPDATE table_name SET column_name = UPPER(column_name);

4.2 Convert Text to Lowercase:

UPDATE table_name SET column_name = LOWER(column_name);

4.3 Trim Extra Spaces:

UPDATE table_name SET column_name = TRIM(column_name);

4.4 Standardize Date Format:

UPDATE table_name
SET date_column = STR_TO_DATE(date_column, '%d/%m/%Y')
WHERE date_column LIKE '%/%/%';
Step 5: Handling Outliers

5.1 Detect Outliers Using Standard Deviation:

SELECT column_name FROM table_name


WHERE column_name > (SELECT AVG(column_name) + 2*STDDEV(column_name)
FROM table_name);

5.2 Cap Outliers at a Maximum Value:

UPDATE table_name
SET column_name = (SELECT MAX(column_name)
FROM table_name
WHERE column_name < (SELECT AVG(column_name) + 2*STDDEV(column_name)
FROM table_name))
WHERE column_name > (SELECT AVG(column_name) + 2*STDDEV(column_name)
FROM table_name);
Step 6: Standardizing Categorical Data

6.1 Replace Incorrect Categorical Values:

UPDATE table_name SET column_name = 'CorrectValue'


WHERE column_name IN ('wrongValue1', 'wrongValue2');

6.2 Standardize Gender Data:

UPDATE table_name SET gender = CASE


WHEN gender IN ('M', 'Male') THEN 'Male'
WHEN gender IN ('F', 'Female')THEN 'Female' ELSE 'Unknown' END;
Step 7: Ensuring Referential Integrity

7.1 Find Orphan Records (Foreign Key Mismatches):

SELECT * FROM orders WHERE customer_id


NOT IN (SELECT customer_id FROM customers);

7.2 Delete Orphan Records:

DELETE FROM orders WHERE customer_id


NOT IN (SELECT customer_id FROM customers);
Step 8: Validating and Cleaning Relationships Between Tables

8.1 Identify Invalid Foreign Key References:

SELECT o.* FROM orders o


LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;

8.2 Fix Incorrect Foreign Key Values

UPDATE orders o
JOIN customers c ON o.customer_name = c.customer_name
SET o.customer_id = c.customer_id
WHERE o.customer_id IS NULL;
Step 9: Data Normalization

9.1 Convert Denormalized Data to Normalized Format:

CREATE TABLE customers(


customer_id INT PRIMARY KEY,
customer_name VARCHAR(255) NOT NULL
);

CREATE TABLE orders (


order_id INT PRIMARY KEY,
customer_id INT, order_date DATE,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id) );
Step 10: Automating Data Cleaning

10.1 Create a Stored Procedure for Cleaning Data:

DELIMITER $$
CREATE PROCEDURE clean_data()
BEGIN
DELETE FROM table_name
WHERE column_name IS NULL;
UPDATE table_name SET column_name = TRIM(column_name);
END $$
DELIMITER ;

10.2 Schedule Automated Cleaning Jobs:

CREATE EVENT clean_data_event ON SCHEDULE EVERY 1 DAY DO CALL clean_data();

You might also like