Data Cleaning
Data Cleaning
Data Cleaning
• Data Cleaning is the process of fixing or
removing incorrect, corrupted, incorrect
formatted , duplicate, or incomplete data
within a dataset.
• Data cleansing improves data quality and
helps provide more accurate, consistent and
reliable information for decision-making in an
organization.
Why Data Cleaning is Necessary
Why Data Cleaning is Necessary
• Data cleaning might seem uninteresting, but it’s one of
the most important tasks you would have to do as a data
science professional. Having wrong or bad quality data
can be detrimental to your processes and analysis. Poor
data can cause a stellar algorithm to fail.
• On the other hand, high-quality data can cause a simple
algorithm to give you outstanding results. There are
many data cleaning techniques, and you should get
familiar with them to improve your data quality. Not all
data is useful. So that’s another major factor that affects
your data quality. Poor quality data can come from many
sources.
Cont..
• Usually, they are a result of human error, but
they can also arise if a lot of data is combined
from different sources. Multichannel data is
not only important, but it is also the norm. So
as a data scientist, you can expect errors from
this type of data. They can cause incorrect
insights in your project and sidetrack your data
analysis process. This is why data cleaning
methods in data analysis are so important.
Reasons why data cleaning is essential
• Efficiency
• Having clean data (free from wrong and
inconsistent values) can help you in performing
your analysis a lot faster. You’d save a
considerable amount of time by doing this task
beforehand. When you clean your data before
using it, you’d be able to avoid multiple errors. If
you use data containing false values, your results
won’t be accurate. A data scientist has to spend
significantly more time cleaning and purifying
data than analyzing it.
Error Margin