0% found this document useful (0 votes)
2 views

How To Prepare Messy Data for Analysis using SQL_

The document outlines a step-by-step process for cleaning a messy sales dataset using SQL, addressing issues such as missing values, duplicate records, inconsistent date formats, outliers, and invalid product IDs. It provides SQL queries for handling each issue, ensuring the dataset is accurate and ready for analysis. The final cleaned data is presented, demonstrating the effectiveness of the applied techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

How To Prepare Messy Data for Analysis using SQL_

The document outlines a step-by-step process for cleaning a messy sales dataset using SQL, addressing issues such as missing values, duplicate records, inconsistent date formats, outliers, and invalid product IDs. It provides SQL queries for handling each issue, ensuring the dataset is accurate and ready for analysis. The final cleaned data is presented, demonstrating the effectiveness of the applied techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Data Analytics 2024 Data Cleaning

How To
Prepare
Messy Data
for Analysis
using SQL?

How to clean data using SQL? Swipe next


Data Analytics 2024 Data Cleaning

You work as a data analyst for a retail company that collects

daily sales data from various stores.

Let’s walk through the process of cleaning a dataset

using SQL with a real-world example.

Below is a sample of the SalesTransactions table:

TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount

1001 2001 12/09/2024 P101 5 500

1002 2001 12/09/2024 P102 NULL 1500

1003 2002 11/09/2024 P103 3 450

1003 2002 11/09/2024 P103 3 450

1004 2003 10/09/2024 P104 10000 100000

1005 2004 09/09/2024 XYZ 2 300

Issues in the Data:

There are several issues in this table


Missing values (NULL in QuantitySold)
Duplicate transactions (same TransactionID repeated)
Inconsistent date format (dd/mm/yyyy instead of yyyy-mm-dd)
Outliers in QuantitySold and TotalAmount (e.g., 10,000 units
sold)
Invalid ProductID (e.g., XYZ is not a valid product).
Data Analytics 2024 Data Cleaning

Let’s clean this data using SQL.

Handling Missing Values

Problem :

The QuantitySold column has a missing value (NULL) for

ProductID = P102. We can estimate this missing value based on

the TotalAmount and the product's price.

First, identify rows where QuantitySold is missing (NULL).

SELECT *

FROM SalesTransactions

WHERE QuantitySold IS NULL;

If a price for the product is available in a Products table, calculate

the missing QuantitySold by dividing TotalAmount by the price.

UPDATE SalesTransactions

SET QuantitySold = TotalAmount / (

SELECT PricePerUnit

FROM Products

WHERE Products.ProductID = .ProductID

SalesTransactions

WHERE QuantitySold IS NULL;

Data Analytics 2024 Data Cleaning

If you can’t calculate QuantitySold, you may choose to remove

the row:

DELETE FROM SalesTransactions

WHERE QuantitySold IS NULL;

After Handling Missing Values

We used SQL to calculate the missing QuantitySold by dividing

the TotalAmount by the product's price from another table

(Products).

The missing value in QuantitySold is now replaced by the correct

value 10.

TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount

1001 2001 12/09/2024 P101 5 500

1002 2001 12/09/2024 P102 10 1500

Removing Duplicate Records


Problem :

There’s a duplicate record for TransactionID = 1003, where the

exact same transaction is repeated twice.


Data Analytics 2024 Data Cleaning

Duplicate records can lead to inaccurate analysis. We can remove

duplicates by using the ROW_NUMBER() function.

WITH DuplicateCheck AS (

SELECT *, ROW_NUMBER() OVER

(PARTITION BY TransactionID, CustomerID, TransactionDate

ORDER BY TransactionID) AS RowNum

FROM SalesTransactions

DELETE FROM SalesTransactions

WHERE TransactionID IN (

SELECT TransactionID FROM DuplicateCheck WHERE RowNum > 1

);

After Removing Duplicate Records

The duplicate row has been deleted, leaving only one record

for TransactionID = 1003.

TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount

1001 2001 12/09/2024 P101 5 500

1002 2001 12/09/2024 P102 10 1500

1003 2002 11/09/2024 P103 3 450


Data Analytics 2024 Data Cleaning

Standardizing Date Formats


Problem :

The TransactionDate is in different formats like dd/mm/yyyy

(e.g., 12/09/2024), which makes it inconsistent for analysis.

Now, let’s convert TransactionDate from the dd/mm/yyyy format

to yyyy-mm-dd for consistency.

UPDATE SalesTransactions

SET TransactionDate = STR_TO_DATE(TransactionDate, '%d/%m/%Y')

WHERE TransactionDate LIKE '%/%';

After Standardizing Date Formats

All transaction dates are now in the same format: 2024-09-12,

making it easier for queries, comparisons, and reporting.

TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount

1001 2001 12-09-2024 P101 5 500

1002 2001 12-09-2024 P102 10 5


1 00

1003 2002 11-09-2024 P103 3 5


4 0
Data Analytics 2024 Data C g
leanin

Identify & Handle Outliers


Problem :

There’s an abnormally high QuantitySold of 10,000 units for

TransactionID = 1004, which is likely an error or an extreme outlier.

We can flag or delete such extreme values.

Identify outliers in QuantitySold

SELECT *

FROM SalesTransactions

WHERE QuantitySold > 1000;

Remove the outlier

DELETE FROM SalesTransactions

WHERE QuantitySold > 1000;

After Identify & Handle Outliers

We flagged this as an outlier using a simple rule (e.g., quantities greater

than 1,000), and then removed this transaction from the data.

TransactionID C
ustomerID TransactionDate ProductID Q S
uantity old TotalAmount

1004 3
200 Deleted
Data Analytics 2024 Data Cleaning

Validating Product IDs


Problem :

The ProductID for TransactionID = 1005 is XYZ, which is invalid

because there is no such product in the Products table.

Identify invalid ProductIDs

SELECT *

FROM SalesTransactions

WHERE ProductID NOT IN (SELECT ProductID FROM Products);

Remove rows with invalid ProductIDs

DELETE FROM SalesTransactions

WHERE ProductID NOT IN (SELECT ProductID FROM Products);

After Identify & Handle Outliers

The row with the invalid ProductID = XYZ has been deleted, leaving

only valid product entries.

TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount

1005 2004 Deleted


Data Analytics 2024 Data Cleaning

Final Cleaned Data

After performing these steps, your cleaned SalesTransactions

table would look something like this:

TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount

1001 2001 12-09-2024 P101 5 500

1002 2001 12-09-2024 P102 10 1500

1003 2002 11-09-2024 P103 3 450

Summary

Handled missing values by filling or removing rows with NULL

values

Removed duplicates to ensure each transaction is unique

Standardized date formats for consistent querying

Handled outliers in QuantitySold and TotalAmount to ensure

realistic data

Validated ProductIDs by removing invalid entries.

By applying these SQL techniques, you ensure the dataset is clean,

accurate, and ready for analysis or reporting.


Data Analytics 2024 Data Cleaning

Dr. Jayen Thakker MetricMinds.in

Was This Helpful?


Be sure to save it so you can
come back to it later

Follow us for more.

You might also like