Why Data Cleaning Is Critical
Why Data Cleaning Is Critical
Clean data is incredibly important for effective analysis. If a piece of data is entered
into a spreadsheet or database incorrectly, or if it's repeated, or if a field is left blank,
or if data formats are inconsistent, the result is dirty data. Small mistakes can lead to
big consequences in the long run.
Dirty data is incomplete, incorrect, or irrelevant to the problem you're trying to solve.
It can't be used in a meaningful way, which makes analysis very difficult, if not
impossible.
Clean data is complete, correct, and relevant to the problem you're trying to solve.
This allows you to understand and analyze information and identify important
patterns, connect related information, and draw useful conclusions. Then you can
apply what you learn to make effective decisions.
In some cases, you won't have to do a lot of work to clean data. For example, when
you use internal data that's been verified and cared for by your company's data
engineers and data warehouse team, it's more likely to be clean.
Data engineers transform data into a useful format for analysis and give it a reliable
infrastructure. This means they develop, maintain, and test databases, data
processors and related systems.
When you become a data analyst, you can learn a lot by working with the person who
maintains your databases to learn about their systems. If data passes through the
hands of a data engineer or a data warehousing specialist first, you know you're off to
a good start on your project.
It's important to remember: no dataset is perfect. It's always a good idea to examine
and clean data before beginning analysis. Here's an example. Let's say you're working
on a project where you need to figure out how many people use your company's
software program. You have a spreadsheet that was created internally and verified by
a data engineer and a data warehousing specialist. Check out the column labeled
2
"Username." It might seem logical that you can just scroll down and count the rows
to figure out how many users you have.
But that won't work because one person sometimes has more than one username.
Maybe they registered from different email addresses, or maybe they have a work
and personal account. In situations like this, you would need to clean the data by
eliminating any rows that are duplicates.
Once you've done that, there won't be any more duplicate entries. Then your
spreadsheet is ready to be put to work. So far we've discussed working with internal
data. But data cleaning becomes even more important when working with external
data, especially if it comes from multiple sources. Let's say the software company
from our example surveyed its customers to learn how satisfied they are with its
software product. But when you review the survey data, you find that you have
several nulls.
A null is an indication that a value does not exist in a data set. Note that it's not the
same as a zero. In the case of a survey, a null would mean the customers skipped that
question. A zero would mean they provided zero as their response. To do your
analysis, you would first need to clean this data. Step one would be to decide what to
do with those nulls. You could either filter them out and communicate that you now
have a smaller sample size, or you can keep them in and learn from the fact that the
customers did not provide responses. There's lots of reasons why this could have
happened. Maybe your survey questions weren't written as well as they could be.
Maybe they were confusing or biased, something we learned about earlier. We've
touched on the basics of cleaning internal and external data, but there's lots more to
come.
3
Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant
to the problem you are trying to solve. This reading summarizes:
Duplicate data
Potential harm to
Description Possible causes
businesses
Any data that is old which
People changing roles or Inaccurate insights,
should be replaced with
companies, or software and decision-making, and
newer and more accurate
systems becoming obsolete analytics
information
Incomplete data
4
For further reading on the business impact of dirty data, enter the term “dirty data”
into your preferred browser’s search bar to bring up numerous articles on the topic.
Here are a few impacts cited for certain industries from a previous search:
Key takeaways
Dirty data includes duplicate data, outdated data, incomplete data, incorrect or
inaccurate data, and inconsistent data. Each type of dirty data can have a significant
impact on analyses, leading to inaccurate insights, poor decision-making, and
revenue loss. There are a number of causes of dirty data, including manual data entry
errors, batch data imports, data migration, software obsolescence, improper data
collection, and human errors during data input. As a data professional, you can take
steps to mitigate the impact of dirty data by implementing effective data quality
processes.
5