Complete Data Cleaning Guide On in SQL
Complete Data Cleaning Guide On in SQL
Guide on
Data
cleaning
in sql
Email [email protected]
PART -1
Data
cleaning
IN sql
Email [email protected]
02
03
Data Cleaning
INITIAL STEPS
First Step is to create a New table for our
cleaning purpose.
Data Cleaning
Data Cleaning
Data Cleaning
UNDERSTANDING SCHEMA
Now we have the data in our temporary
table we can start data cleaning.
DESCRIBE temp_employees ;
07
Data Cleaning
UNDERSTANDING SCHEMA
Data Cleaning
We have 2 ways :
09
Data Cleaning
SELECT STR_TO_DATE(hire_date,'%Y-%m-%d') as
hire_date FROM temp_employees;
Data Cleaning
VARCHAR to DATE
11
Data Cleaning
12
FOLLOW
ME ON
this helpful
LinkedIn www.linkedin.com/in/dhruvik-detroja/
PART -2
Data
cleaning
IN sql
Email [email protected]
02
03
Data Cleaning
Data Cleaning
SELECT
COUNT(*) AS TotalRows,
FROM Customers;
05
Data Cleaning
Data Cleaning
FullName IS NULL OR
Age IS NULL OR
Email IS NULL OR
PhoneNumber IS NULL OR
SignupDate IS NULL;
07
Data Cleaning
08
Data Cleaning
09
Data Cleaning
UPDATE customers
Data Cleaning
UPDATE Customers
Data Cleaning
Data Cleaning
Data Cleaning
UPDATE customers
WHERE company='Airbnb';
14
Data Cleaning
15
Data Cleaning
Data Cleaning
FOLLOW
ME ON
this helpful
LinkedIn www.linkedin.com/in/dhruvik-detroja/
PART -3
Data
cleaning
IN sql
Email [email protected]
02
03
Data Cleaning
Data Cleaning
SELECT
FROM
Customers
GROUP BY
HAVING
COUNT(*) > 1;
05
Data Cleaning
Data Cleaning
Data Cleaning
SELECT
FROM
Customers
WHERE
SELECT
FROM
Customers
GROUP BY
HAVING
COUNT(*) > 1
ORDER BY
08
Data Cleaning
Data Cleaning
SELECT
CustomerID,
FullName,
Age,
Email,
PhoneNumber,
SignupDate,
FROM
Customers
WHERE CustomerID IN (
SELECT CustomerID
FROM CTE
);
10
Data Cleaning
FOLLOW
ME ON
this helpful
LinkedIn www.linkedin.com/in/dhruvik-detroja/
PART -4
Data
cleaning
IN sql
Email [email protected]
02
03
Data Cleaning
STANDARDIZING
Standardizing is the process to make
data standardize means ensuring all the
data values are in perfect format.
Data Cleaning
STANDARDIZING DATES
We have this Orders Table let’s
standardize it into proper date format.
Data Cleaning
STANDARDIZING DATES
-- Format: 'YYYY/MM/DD'
UPDATE Orders
'%Y/%m/%d'), '%Y-%m-%d')
-- Format: 'DD-MM-YYYY'
UPDATE Orders
'%d-%m-%Y'), '%Y-%m-%d')
Data Cleaning
STANDARDIZING DATES
UPDATE Orders
-- Format: 'YYYY.MM.DD'
UPDATE Orders
Data Cleaning
STANDARDIZING DATES
Here is the result.
08
Data Cleaning
Data Cleaning
UPDATE Employees
UPPER(LEFT(TRIM(FullName), 1)),
LOWER(SUBSTR(TRIM(FullName), 2))
);
UPDATE Employees
Data Cleaning
Data Cleaning
Data Cleaning
13
Data Cleaning
SUBSTR(PhoneNumber, 11, 4)
UPDATE Contacts
SUBSTR(PhoneNumber, 9, 4))
14
Data Cleaning
15
Data Cleaning
STANDARDIZING ADDRESS
Now we have Addresses Table like this.
Data Cleaning
STANDARDIZING ADDRESS
We have Contacts Table like this
-- Convert state abbreviations to uppercase
UPDATE Addresses
UPDATE Addresses
UPDATE Addresses
UPPER(LEFT(TRIM(City), 1)),
LOWER(SUBSTR(TRIM(City), 2))
);
UPDATE Addresses
17
Data Cleaning
STANDARDIZING ADDRESS
This time we have a new REPLACE()
function which has 1st argument as column
name, 2nd argument value we want to
replace and 3rd argument value which we
want to replace with.
Data Cleaning
END OF STANDARDIZING
We have covered most of the important
stuffs in our data cleaning series and we
have few topics left in the series.
Thank You!
Dhruvik Detroja
For More Education Content
FOLLOW
ME ON
this helpful
LinkedIn www.linkedin.com/in/dhruvik-detroja/
PART -5
Data
cleaning
IN sql
Email [email protected]
02
LinkedIn www.linkedin.com/in/dhruvik-detroja/
03
Data Cleaning
OUTLIERS
Outliers are the data points which
deviate significantly from the rest of the
dataset.
Ready?
Let’s Go
LinkedIn www.linkedin.com/in/dhruvik-detroja/
04
Data Cleaning
IDENTIFYING OUTLIERS
To identify outliers we have different
method let’s learn them
1. Z Score Method:
LinkedIn www.linkedin.com/in/dhruvik-detroja/
05
Data Cleaning
IDENTIFYING OUTLIERS
LinkedIn www.linkedin.com/in/dhruvik-detroja/
06
Data Cleaning
IDENTIFYING OUTLIERS
SELECT
SaleID,
CustomerID,
SaleAmount,
FROM
Sales
HAVING
ABS(ZScore) > 3;
LinkedIn www.linkedin.com/in/dhruvik-detroja/
07
Data Cleaning
IDENTIFYING OUTLIERS
2. IQR Method:
IQR= Q3-Q1
OR
LinkedIn www.linkedin.com/in/dhruvik-detroja/
08
Data Cleaning
IDENTIFYING OUTLIERS
We have the Products table like this.
LinkedIn www.linkedin.com/in/dhruvik-detroja/
09
Data Cleaning
IDENTIFYING OUTLIERS
-- Calculate Q1
WITH OrderedPrices AS (
SELECT
Price,
FROM Products
),
Quartiles AS (
SELECT
Price AS Q1
FROM OrderedPrices
LinkedIn www.linkedin.com/in/dhruvik-detroja/
10
Data Cleaning
IDENTIFYING OUTLIERS
-- Calculate Q3
WITH OrderedPrices AS (
SELECT
Price,
FROM Products
),
Quartiles AS (
SELECT
Price AS Q3
FROM OrderedPrices
LinkedIn www.linkedin.com/in/dhruvik-detroja/
11
Data Cleaning
IDENTIFYING OUTLIERS
-- Identify outliers
SELECT
ProductID,
ProductName,
Price
FROM
Products
WHERE
LinkedIn www.linkedin.com/in/dhruvik-detroja/
12
Data Cleaning
IDENTIFYING OUTLIERS
Here is the result and we have successfully
found the outlier.
LinkedIn www.linkedin.com/in/dhruvik-detroja/
13
Data Cleaning
ANALYZING OUTLIERS
This was just a sample dataset.
LinkedIn www.linkedin.com/in/dhruvik-detroja/
14
Data Cleaning
ANALYZING OUTLIERS
Take Example of rooms booked in a hotel
dataset is like:
Rooms Booked
Here 12 is outlier but still
it’s possible in real world
1
to book 12 rooms at a
2
time.
2
So might not be any
12
data entry error here.
Data Cleaning
LinkedIn www.linkedin.com/in/dhruvik-detroja/
16
Data Cleaning
REMOVING OUTLIERS
Simply delete the data after identifying
outliers if you want to remove outliers.
LinkedIn www.linkedin.com/in/dhruvik-detroja/
For More Education Content
FOLLOW
ME ON
this helpful
LinkedIn www.linkedin.com/in/dhruvik-detroja/
PART -6
Data
cleaning
IN sql
Email [email protected]
02
LinkedIn www.linkedin.com/in/dhruvik-detroja/
03
Data Cleaning
ENSURING CONSTRAINTS
During Data Cleaning Process its
important that we should add
meaningful constraints to our columns.
LinkedIn www.linkedin.com/in/dhruvik-detroja/
04
Data Cleaning
ENSURING CONSTRAINTS
EmployeeID INT,
FirstName VARCHAR(50),
LastName VARCHAR(50),
Email VARCHAR(100),
PhoneNumber VARCHAR(20),
HireDate DATETIME,
DepartmentID INT
);
Data Cleaning
ENSURING CONSTRAINTS
Data Cleaning
LinkedIn www.linkedin.com/in/dhruvik-detroja/
07
Data Cleaning
LinkedIn www.linkedin.com/in/dhruvik-detroja/
08
Data Cleaning
LinkedIn www.linkedin.com/in/dhruvik-detroja/
09
Data Cleaning
LinkedIn www.linkedin.com/in/dhruvik-detroja/
10
Data Cleaning
LinkedIn www.linkedin.com/in/dhruvik-detroja/
11
Data Cleaning
LinkedIn www.linkedin.com/in/dhruvik-detroja/
12
Data Cleaning
LinkedIn www.linkedin.com/in/dhruvik-detroja/
13
Data Cleaning
LinkedIn www.linkedin.com/in/dhruvik-detroja/
For More Education Content
FOLLOW
ME ON
this helpful
LinkedIn www.linkedin.com/in/dhruvik-detroja/