0% found this document useful (0 votes)

36 views5 pages

Why Data Cleaning Is Critical

This document discusses the importance of clean data and defines what constitutes dirty data. It provides examples of different types of dirty data like duplicate, outdated, incomplete, incorrect, and inconsistent data. The causes of dirty data are also examined, such as manual errors, outdated systems, and improper collection. Finally, the summary outlines how dirty data can negatively impact businesses by leading to inaccurate insights, poor decision-making, and lost revenue.

Uploaded by

lamnt.vnu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views5 pages

Why Data Cleaning Is Critical

Uploaded by

lamnt.vnu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Why data cleaning is critical

Clean data is incredibly important for effective analysis. If a piece of data is entered
into a spreadsheet or database incorrectly, or if it's repeated, or if a field is left blank,
or if data formats are inconsistent, the result is dirty data. Small mistakes can lead to
big consequences in the long run.

Dirty data is incomplete, incorrect, or irrelevant to the problem you're trying to solve.
It can't be used in a meaningful way, which makes analysis very difficult, if not
impossible.

Clean data is complete, correct, and relevant to the problem you're trying to solve.
This allows you to understand and analyze information and identify important
patterns, connect related information, and draw useful conclusions. Then you can
apply what you learn to make effective decisions.

In some cases, you won't have to do a lot of work to clean data. For example, when
you use internal data that's been verified and cared for by your company's data
engineers and data warehouse team, it's more likely to be clean.

Data engineers transform data into a useful format for analysis and give it a reliable
infrastructure. This means they develop, maintain, and test databases, data
processors and related systems.

Data warehousing specialists develop processes and procedures to effectively store

and organize data. They make sure that data is available, secure, and backed up to
prevent loss.

When you become a data analyst, you can learn a lot by working with the person who
maintains your databases to learn about their systems. If data passes through the
hands of a data engineer or a data warehousing specialist first, you know you're off to
a good start on your project.

There's a lot of great career opportunities as a data engineer or a data warehousing

specialist. If this kind of work sounds interesting to you, maybe your career path will
involve helping organizations save lots of time, effort, and money by making sure
their data is sparkling clean. But even if you go in a different direction with your data
analytics career and have the advantage of working with data engineers and
warehousing specialists, you're still likely to have to clean your own data.

It's important to remember: no dataset is perfect. It's always a good idea to examine
and clean data before beginning analysis. Here's an example. Let's say you're working
on a project where you need to figure out how many people use your company's
software program. You have a spreadsheet that was created internally and verified by
a data engineer and a data warehousing specialist. Check out the column labeled
2

"Username." It might seem logical that you can just scroll down and count the rows
to figure out how many users you have.

But that won't work because one person sometimes has more than one username.

Maybe they registered from different email addresses, or maybe they have a work
and personal account. In situations like this, you would need to clean the data by
eliminating any rows that are duplicates.

Once you've done that, there won't be any more duplicate entries. Then your
spreadsheet is ready to be put to work. So far we've discussed working with internal
data. But data cleaning becomes even more important when working with external
data, especially if it comes from multiple sources. Let's say the software company
from our example surveyed its customers to learn how satisfied they are with its
software product. But when you review the survey data, you find that you have
several nulls.

A null is an indication that a value does not exist in a data set. Note that it's not the
same as a zero. In the case of a survey, a null would mean the customers skipped that
question. A zero would mean they provided zero as their response. To do your
analysis, you would first need to clean this data. Step one would be to decide what to
do with those nulls. You could either filter them out and communicate that you now
have a smaller sample size, or you can keep them in and learn from the fact that the
customers did not provide responses. There's lots of reasons why this could have
happened. Maybe your survey questions weren't written as well as they could be.
Maybe they were confusing or biased, something we learned about earlier. We've
touched on the basics of cleaning internal and external data, but there's lots more to
come.
3

What is dirty data?

Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant
to the problem you are trying to solve. This reading summarizes:

 Types of dirty data you may encounter

 What may have caused the data to become dirty
 How dirty data is harmful to businesses

Types of dirty data

Duplicate data

Description Possible causes Potential harm to businesses

Any data record that Manual data entry, Skewed metrics or analyses, inflated or
shows up more than batch data imports, or inaccurate counts or predictions, or
once data migration confusion during data retrieval
Outdated data

Potential harm to
Description Possible causes
businesses
Any data that is old which
People changing roles or Inaccurate insights,
should be replaced with
companies, or software and decision-making, and
newer and more accurate
systems becoming obsolete analytics
information
Incomplete data
4

Description Possible causes Potential harm to businesses

Any data that is Improper data Decreased productivity, inaccurate
missing important collection or incorrect insights, or inability to complete
fields data entry essential services
Incorrect/inaccurate data

Description Possible causes Potential harm to businesses

Any data that is Human error inserted during Inaccurate insights or decision-
complete but data input, fake information, making based on bad information
inaccurate or mock data resulting in revenue loss
Inconsistent data

Description Possible causes Potential harm to businesses

Data stored
Any data that uses Contradictory data points leading to
incorrectly or errors
different formats to confusion or inability to classify or
inserted during data
represent the same thing segment customers
transfer
Business impact of dirty data

For further reading on the business impact of dirty data, enter the term “dirty data”
into your preferred browser’s search bar to bring up numerous articles on the topic.
Here are a few impacts cited for certain industries from a previous search:

 Banking: Inaccuracies cost companies between 15% and 25% of revenue

(source).
 Digital commerce: Up to 25% of B2B database contacts contain inaccuracies
(source).
 Marketing and sales: 99% of companies are actively tackling data quality in
some way (source).
 Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s
electronic health records (source).

Key takeaways

Dirty data includes duplicate data, outdated data, incomplete data, incorrect or
inaccurate data, and inconsistent data. Each type of dirty data can have a significant
impact on analyses, leading to inaccurate insights, poor decision-making, and
revenue loss. There are a number of causes of dirty data, including manual data entry
errors, batch data imports, data migration, software obsolescence, improper data
collection, and human errors during data input. As a data professional, you can take
steps to mitigate the impact of dirty data by implementing effective data quality
processes.
5

Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
Unit 1 - Data Scientist Tool Box
No ratings yet
Unit 1 - Data Scientist Tool Box
26 pages
Data Cleaning in Excel
100% (1)
Data Cleaning in Excel
68 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Data Clean R
100% (1)
Data Clean R
11 pages
Module 2 Clean Data For More Accurate Insights
No ratings yet
Module 2 Clean Data For More Accurate Insights
35 pages
Lect 6
No ratings yet
Lect 6
36 pages
Process Data From Dirty To Clean
No ratings yet
Process Data From Dirty To Clean
34 pages
SQL Queries Hotel Database
67% (3)
SQL Queries Hotel Database
14 pages
The Data Science Process
No ratings yet
The Data Science Process
33 pages
Process Data From Dirty To Clean
No ratings yet
Process Data From Dirty To Clean
30 pages
DHV MODEL 1.2 Data Cleaning
No ratings yet
DHV MODEL 1.2 Data Cleaning
49 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
2 DM DataPreprocessing
No ratings yet
2 DM DataPreprocessing
43 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
Unit 2 Data Preprocessing and Association Rule Mining
No ratings yet
Unit 2 Data Preprocessing and Association Rule Mining
31 pages
Unit 2
No ratings yet
Unit 2
22 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
The Ultimate Guide To Data Cleaning With SQL 1738769035
No ratings yet
The Ultimate Guide To Data Cleaning With SQL 1738769035
36 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
Introduction To Data Science 1-2-2025
No ratings yet
Introduction To Data Science 1-2-2025
14 pages
M-II FDS U-II Questions
No ratings yet
M-II FDS U-II Questions
43 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Case Study-1 Data Quality
No ratings yet
Case Study-1 Data Quality
4 pages
Module 4 - (Process Data From Dirty To Clean)
No ratings yet
Module 4 - (Process Data From Dirty To Clean)
36 pages
1 Da
No ratings yet
1 Da
44 pages
Evidencia 3 Lengua y Cultura Extranjera
No ratings yet
Evidencia 3 Lengua y Cultura Extranjera
4 pages
Module 1
No ratings yet
Module 1
29 pages
Exaplain 5 Steps Followed When Cleaning Data in Excel
No ratings yet
Exaplain 5 Steps Followed When Cleaning Data in Excel
7 pages
Data Management Quiz
No ratings yet
Data Management Quiz
4 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Introduction To Emerging Technology - Chapter Two
No ratings yet
Introduction To Emerging Technology - Chapter Two
41 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Analytics Notes (Autorecovered)
No ratings yet
Data Analytics Notes (Autorecovered)
60 pages
(M3S1) Data Analytics Framework
No ratings yet
(M3S1) Data Analytics Framework
12 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Big Data Lec5
No ratings yet
Big Data Lec5
37 pages
Data Analytics - Module-1.2
No ratings yet
Data Analytics - Module-1.2
55 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Data Cleaning Using Pandas
No ratings yet
Data Cleaning Using Pandas
9 pages
Ba - Data Quality
No ratings yet
Ba - Data Quality
2 pages
Big Data Analysis With Apache Spark: Uc#Berkeley
No ratings yet
Big Data Analysis With Apache Spark: Uc#Berkeley
80 pages
WP Dirty Data Omni
No ratings yet
WP Dirty Data Omni
13 pages
Tho Nguyen, SAS Institute Inc. Cary, North Carolina: The Value of ETL and Data Quality
No ratings yet
Tho Nguyen, SAS Institute Inc. Cary, North Carolina: The Value of ETL and Data Quality
3 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Data Cleaning: Definition
No ratings yet
Data Cleaning: Definition
2 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
11G To 12C Upgrade Document
No ratings yet
11G To 12C Upgrade Document
18 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
Common Data-Cleaning Pitfalls
No ratings yet
Common Data-Cleaning Pitfalls
3 pages
Solving The Dirty Data Problem: Clean-Data Practices For Tech Innovators
No ratings yet
Solving The Dirty Data Problem: Clean-Data Practices For Tech Innovators
19 pages
Data Processing
No ratings yet
Data Processing
3 pages
1.1 Lesson 1.1 Hints PDF
100% (3)
1.1 Lesson 1.1 Hints PDF
14 pages
Ssis Notes
No ratings yet
Ssis Notes
7 pages
DBMS
No ratings yet
DBMS
5 pages
Recovering After The NOLOGGING Clause Is Specified
No ratings yet
Recovering After The NOLOGGING Clause Is Specified
4 pages
SQL Server 2012-New Features
No ratings yet
SQL Server 2012-New Features
90 pages
04 Newfeatures
100% (1)
04 Newfeatures
88 pages
PowerBuilder Connection Reference 10.5
No ratings yet
PowerBuilder Connection Reference 10.5
236 pages
(Retrieval Augmented Generation) : by Uttam Grade
No ratings yet
(Retrieval Augmented Generation) : by Uttam Grade
6 pages
Hashing Problem Set Solutions
No ratings yet
Hashing Problem Set Solutions
3 pages
Chapter 8 Solution Ncert Ip Class 11-1
No ratings yet
Chapter 8 Solution Ncert Ip Class 11-1
11 pages
NetBackup AdminGuide EntVault
No ratings yet
NetBackup AdminGuide EntVault
150 pages
CSE544: SQL: Monday 3/27 and Wednesday 3/29, 2006
No ratings yet
CSE544: SQL: Monday 3/27 and Wednesday 3/29, 2006
78 pages
Data For Geoscience
No ratings yet
Data For Geoscience
14 pages
ER Diagram PDF
No ratings yet
ER Diagram PDF
17 pages
Unit 3
No ratings yet
Unit 3
35 pages
DBMS Week-6 Assignment
No ratings yet
DBMS Week-6 Assignment
6 pages
SQL Window Function Cheat Sheet
No ratings yet
SQL Window Function Cheat Sheet
15 pages
Data Warehousing: Engr. Madeha Mushtaq Department of Computer Science Iqra National University
No ratings yet
Data Warehousing: Engr. Madeha Mushtaq Department of Computer Science Iqra National University
34 pages
Dreamhome DDL
No ratings yet
Dreamhome DDL
3 pages
Mix Answers
No ratings yet
Mix Answers
7 pages
Quick SQL
No ratings yet
Quick SQL
11 pages
Teoco
No ratings yet
Teoco
19 pages
Bis Notes
No ratings yet
Bis Notes
8 pages
ITE 221 - Management Info - Final
No ratings yet
ITE 221 - Management Info - Final
5 pages
Time Series Database
No ratings yet
Time Series Database
3 pages
Business Rules Overview Presentation
No ratings yet
Business Rules Overview Presentation
9 pages
Database Programming With SQL 16-1: Working With Sequences Practice Activities
No ratings yet
Database Programming With SQL 16-1: Working With Sequences Practice Activities
3 pages
Retrieval by Character Matching in and Not IN Clause Between Clause
No ratings yet
Retrieval by Character Matching in and Not IN Clause Between Clause
3 pages

Why Data Cleaning Is Critical

Uploaded by

Why Data Cleaning Is Critical

Uploaded by

Why data cleaning is critical

Data warehousing specialists develop processes and procedures to effectively store

There's a lot of great career opportunities as a data engineer or a data warehousing

What is dirty data?

 Types of dirty data you may encounter

Types of dirty data

Description Possible causes Potential harm to businesses

Description Possible causes Potential harm to businesses

Description Possible causes Potential harm to businesses

Description Possible causes Potential harm to businesses

 Banking: Inaccuracies cost companies between 15% and 25% of revenue

You might also like