Data Cleansing Steps

This document outlines 6 key steps for cleaning data: 1) Remove irrelevant data, 2) Deduplicate data, 3) Fix structural errors, 4) Deal with missing data, 5) Filter out outliers, and 6) Validate the cleaned data. Effective data cleaning is essential to derive powerful insights but requires removing garbage, fixing errors, and ensuring data quality and consistency.

Uploaded by

Imane Loukili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views8 pages

Data Cleansing Steps

Uploaded by

Imane Loukili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Data

Cleaning
Steps
“Garbage in, garbage out"

if you start with bad data (garbage), you’ll only get

“garbage” results.

Data cleaning is often a tedious process, but it’s

absolutely essential to get top results and
powerful insights from your data.
Step 1:
Remove irrelevant data

Take a good look at your data and get an idea of

what is relevant and what you may not need.
Filter out data or observations that aren’t relevant
to your downstream needs.
Step 2:
Deduplicate your data

Duplicate records slow down analysis. Even more

importantly, if you train a machine learning model
on a dataset with duplicate results, the model will
likely give more weight to the duplicates thus
generating an incorrect model.
Step 3:
Fix structural errors

Structural errors include things like misspellings,

incongruent naming conventions, incorrect word
use, etc. These can affect analysis because, while
they may be obvious to humans, most machine
learning applications wouldn’t recognize the
mistakes and your analyses would be skewed.
Step 4:
Deal with missing data

Scan your data to locate missing cells, blank

spaces in text, etc. You’ll need to determine
whether everything connected to this missing
data (an entire column or row, a whole survey,
etc.) should be completely discarded, individual
cells entered manually, or left as is.
Step 5:
Filter out data outliers

Outliers are data points that fall far outside of the

norm and may skew your analysis too far in a
certain direction. You’ll have to consider what
kind of analysis you’re running and what effect
removing or keeping an outlier will have on your
results.
Step 6:
Validate your data

Data validation is the final data cleaning

technique used to authenticate your data and
confirm that it’s high quality, consistent and
properly formatted for downstream processes.
Validate that your data is regularly structured and
sufficiently clean for your needs.

Unit 1 - Data Scientist Tool Box
No ratings yet
Unit 1 - Data Scientist Tool Box
26 pages
Data Cleaning Ebook
No ratings yet
Data Cleaning Ebook
25 pages
ETL Traceability Matrix
100% (1)
ETL Traceability Matrix
21 pages
Microsoft Power BI
No ratings yet
Microsoft Power BI
13 pages
Customer Churn Prediction
100% (1)
Customer Churn Prediction
18 pages
Power BI Training
No ratings yet
Power BI Training
5 pages
MAXIMO Student Manual 2 - 1
No ratings yet
MAXIMO Student Manual 2 - 1
6 pages
Celonis PQL Chapter-Web
No ratings yet
Celonis PQL Chapter-Web
34 pages
Presented By: Kunal Jain (071309) Under The Guidance of Mr. Praveen Kumar Tripathi Dept of CSE & IT (JUIT)
No ratings yet
Presented By: Kunal Jain (071309) Under The Guidance of Mr. Praveen Kumar Tripathi Dept of CSE & IT (JUIT)
23 pages
Data Cleansing Lecture
100% (1)
Data Cleansing Lecture
18 pages
30-Day Roadmap For Learning Power BI
No ratings yet
30-Day Roadmap For Learning Power BI
4 pages
PL 300T00A ENU Powerpoint 03
No ratings yet
PL 300T00A ENU Powerpoint 03
40 pages
Power BI Report Design Checklist
No ratings yet
Power BI Report Design Checklist
3 pages
Customization Best Practices
No ratings yet
Customization Best Practices
3 pages
Extending The Power BI Template For Project For The Web
100% (1)
Extending The Power BI Template For Project For The Web
7 pages
Power BI and Tableau Online Course
No ratings yet
Power BI and Tableau Online Course
15 pages
Capstone Project Power BI
No ratings yet
Capstone Project Power BI
8 pages
Business Intelligence (BI) Testing: Sample Test Cases For BI
No ratings yet
Business Intelligence (BI) Testing: Sample Test Cases For BI
3 pages
Power BI - Exam Prep - 29 - 3
No ratings yet
Power BI - Exam Prep - 29 - 3
40 pages
Data Analyst
No ratings yet
Data Analyst
12 pages
Qualys Was API User Guide
No ratings yet
Qualys Was API User Guide
376 pages
Data Visualization: For Analytics and Business Intelligence
No ratings yet
Data Visualization: For Analytics and Business Intelligence
49 pages
Data Cleaning Guide
No ratings yet
Data Cleaning Guide
66 pages
Technical Value
No ratings yet
Technical Value
30 pages
Power BI Brochure
No ratings yet
Power BI Brochure
16 pages
Macaw Power BI Cheat Sheet EN
100% (1)
Macaw Power BI Cheat Sheet EN
2 pages
Microsoft: DA-100 Exam
No ratings yet
Microsoft: DA-100 Exam
45 pages
Power Bi Interactive Training: After Starting As A Powerpoint Presentation, Please Click Here To Begin
No ratings yet
Power Bi Interactive Training: After Starting As A Powerpoint Presentation, Please Click Here To Begin
41 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
Power BI: Douglas Francisco Rivas
100% (1)
Power BI: Douglas Francisco Rivas
39 pages
Power Bi Q&a
No ratings yet
Power Bi Q&a
10 pages
Data Cleansing Functions - InFORMATICA
No ratings yet
Data Cleansing Functions - InFORMATICA
1 page
Power BI Syllabus
No ratings yet
Power BI Syllabus
7 pages
1 Getting Started Power Bi m1 Slides PDF
No ratings yet
1 Getting Started Power Bi m1 Slides PDF
13 pages
Kcb-Edwh&Bi Testing Check List
No ratings yet
Kcb-Edwh&Bi Testing Check List
4 pages
Developing Key Performance Indicators in Tableau
No ratings yet
Developing Key Performance Indicators in Tableau
10 pages
Table Manipulation DAX Functions
No ratings yet
Table Manipulation DAX Functions
22 pages
Etl
No ratings yet
Etl
13 pages
BI Testing
No ratings yet
BI Testing
4 pages
General AX Data Cleanup Jobs
No ratings yet
General AX Data Cleanup Jobs
6 pages
Cognos 7 To 8 Migration
No ratings yet
Cognos 7 To 8 Migration
53 pages
Quantitative Techniques & Operations Research: Ankit Sharma Neha Rathod Suraj Bairagi Vaibhav Thamman
No ratings yet
Quantitative Techniques & Operations Research: Ankit Sharma Neha Rathod Suraj Bairagi Vaibhav Thamman
12 pages
Velocity v8 Data Warehousing Methodology
No ratings yet
Velocity v8 Data Warehousing Methodology
1,106 pages
WEBI Report Errors
No ratings yet
WEBI Report Errors
4 pages
Idc Wipro Product Engg Services RD Profile
No ratings yet
Idc Wipro Product Engg Services RD Profile
13 pages
Sales Amount by Month - Sort It by The Correct Month Order, Not Alphabetical Order
No ratings yet
Sales Amount by Month - Sort It by The Correct Month Order, Not Alphabetical Order
6 pages
TDWI CBIP Brochure 2013 Web
No ratings yet
TDWI CBIP Brochure 2013 Web
6 pages
Agile Master Data Management: Better Approach Than Trial and Error
No ratings yet
Agile Master Data Management: Better Approach Than Trial and Error
8 pages
Bachelor of Science in Accountancy: Program Curriculum Ay 2020 - 2021
No ratings yet
Bachelor of Science in Accountancy: Program Curriculum Ay 2020 - 2021
6 pages
Quora - Informatica DW BI Ques ANS
No ratings yet
Quora - Informatica DW BI Ques ANS
7 pages
MSTR Architect Project Design Essentials: Course Contents: Basic and Advanced
No ratings yet
MSTR Architect Project Design Essentials: Course Contents: Basic and Advanced
3 pages
Wipro 7b1630 User Manuel English
No ratings yet
Wipro 7b1630 User Manuel English
70 pages
Dataware Q&a Bank
100% (1)
Dataware Q&a Bank
42 pages
The Growing Sophistication of The Master Data Cleansing Service Industry
100% (1)
The Growing Sophistication of The Master Data Cleansing Service Industry
22 pages
Business Analytics
No ratings yet
Business Analytics
9 pages
Vertipaq Vs OLAP - Change Your Data Modeling Approach - Marco Russo
No ratings yet
Vertipaq Vs OLAP - Change Your Data Modeling Approach - Marco Russo
10 pages
Service Transition Contents
No ratings yet
Service Transition Contents
8 pages
Practical Guide Measuring Shrinkage
No ratings yet
Practical Guide Measuring Shrinkage
4 pages
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
From Everand
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
Carl A. Bolton
No ratings yet
Big Data Analytics Complete Self-Assessment Guide
From Everand
Big Data Analytics Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet