0% found this document useful (0 votes)
11 views4 pages

Data Cleansing

DADM Unit1 Level 2

Uploaded by

tasya lopa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

Data Cleansing

DADM Unit1 Level 2

Uploaded by

tasya lopa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Cleansing, also known as data cleaning or data scrubbing, is a

critical step in the data preparation process. It involves identifying and


rectifying errors, inconsistencies, missing values, and outliers in a dataset
to ensure that the data is accurate, reliable, and suitable for analysis or
modeling. Here's a detailed explanation of the data cleansing process:

1. Data Collection and Inspection:

The data cleansing process typically begins after data has been collected
from various sources, such as surveys, databases, sensors, or web
scraping. Before any cleaning takes place, the data is thoroughly
inspected. This inspection involves:

 Identifying missing values: Checking for cells or fields that are


empty or contain "null" or "NaN" values.
 Spotting errors and inconsistencies: Identifying data entries that do
not conform to the expected format or rules. These could be
typographical errors, conflicting information, or outliers.
 Handling duplicates: Identifying and removing duplicate records or
entries that may skew the analysis.
 Assessing data quality: Evaluating the overall quality of the dataset
and its adherence to data standards and guidelines.

2. Handling Missing Values:

Missing data can significantly impact the quality of an analysis. Data


cleansing involves addressing missing values through various methods:

 Imputation: Replacing missing values with estimated or interpolated


values based on statistical techniques or domain knowledge.
 Removal: If missing values are too numerous or cannot be
accurately imputed, removing rows or columns with missing data
may be necessary.

3. Correcting Errors and Inconsistencies:

Errors and inconsistencies can arise from various sources, such as human
input errors, measurement inaccuracies, or system glitches. Data
cleansing involves:

 Standardizing data: Ensuring that data follows a consistent format,


such as date formats, units of measurement, and naming
conventions.
 Correcting errors: Rectifying typographical errors, invalid entries, or
inaccurate data points using validation rules or cross-referencing
with authoritative sources.
 Dealing with outliers: Identifying and handling outliers that are
genuine data points (e.g., anomalies) differently from errors (e.g.,
typos).

4. Removing Duplicates:

Duplicate records can distort analysis results and lead to incorrect


conclusions. Data cleansing typically involves identifying and removing
duplicates based on specific criteria, such as matching certain fields or
attributes.

5. Data Transformation:

In some cases, data cleansing may also include data transformation,


where data is converted from one format or representation to another. For
example, converting categorical variables into numerical values or
normalizing data to a common scale.

6. Documentation and Auditing:

Throughout the data cleansing process, it's essential to maintain


documentation of all changes made to the dataset. This documentation
helps in understanding the data's lineage and in replicating the cleaning
process in the future. Additionally, the cleaned dataset should be audited
to ensure that it meets the desired quality standards.

7. Quality Assurance:

After cleansing the data, quality assurance checks are performed to verify
that the dataset now adheres to the defined data quality criteria. This
ensures that the data is ready for analysis, modeling, or other data-driven
tasks.

In summary, data cleansing is a crucial step in data preparation that


focuses on identifying and rectifying errors, inconsistencies, missing
values, and outliers in a dataset. It helps ensure data accuracy, reliability,
and consistency, enabling meaningful and trustworthy analysis or
modeling. Properly cleansed data forms the foundation for reliable and
valuable insights in various domains, including business analytics,
research, and decision-making.

Data cleansing is a crucial data preparation process that focuses on


ensuring the accuracy and reliability of data used for analysis. Here's a
detailed explanation of how data cleansing achieves this objective:

1. Identification of Errors and Inconsistencies:


Data cleansing begins with the identification of errors,
inconsistencies, missing values, and outliers in the dataset. These
issues can arise from various sources, including human input errors,
data entry mistakes, system glitches, or incomplete data collection.
2. Handling Missing Data:
Missing data can significantly affect the reliability of analysis results.
Data cleansing addresses this by handling missing values using
techniques such as imputation or removal. Imputation involves
replacing missing values with estimated or interpolated values
based on statistical methods or domain knowledge. Removing
records or attributes with excessive missing data may also be
necessary.
3. Correction of Errors:
Errors in the data can take various forms, including typographical
errors, incorrect data types, or values that do not conform to
expected formats. Data cleansing corrects these errors by:
 Standardizing data: Ensuring that data follows a consistent
format and adheres to predefined rules or standards. For
example, dates may be reformatted to a uniform format, and
units of measurement may be standardized.
 Error correction: Identifying and rectifying typographical
errors, invalid entries, or inaccurate data points. This can
involve validating data against predefined rules or cross-
referencing with authoritative sources to verify accuracy.
 Handling outliers: Distinguishing between genuine data points
(e.g., anomalies) and errors (e.g., typos) and treating them
differently. Outliers may be addressed through techniques like
winsorization or transformation.
4. Duplicate Data Removal:
Duplicate records can distort analysis results and lead to incorrect
conclusions. Data cleansing identifies and removes duplicates based
on specific criteria, such as matching fields or attributes. Removing
duplicates ensures that each data point is unique and contributes
only once to the analysis.
5. Data Transformation:
Data cleansing may also include data transformation, where data is
converted from one format or representation to another. For
example, categorical variables may be converted into numerical
values, or data may be normalized to a common scale to facilitate
comparisons.
6. Documentation and Auditing:
Throughout the data cleansing process, detailed documentation is
maintained to track all changes made to the dataset. This
documentation serves as a record of the cleaning process and helps
in understanding the data's lineage. Additionally, the cleaned
dataset undergoes auditing to ensure that it meets the defined data
quality criteria.
7. Quality Assurance:
After data cleansing, quality assurance checks are performed to
verify that the dataset now adheres to the desired data quality
standards. This step ensures that the data is ready for analysis,
modeling, or other data-driven tasks.

In summary, data cleansing is a critical step in data preparation that plays


a fundamental role in ensuring the accuracy and reliability of data used
for analysis. Errors, inconsistencies, and missing data can lead to incorrect
conclusions and decisions, making data cleansing essential for trustworthy
insights and decision-making. By addressing these issues, data cleansing
helps create a reliable foundation for meaningful analysis across various
domains, including business, research, and scientific investigations.

You might also like