Chapter 4
Chapter 4
• Typos and invalid or missing data. Data cleansing corrects various structural errors in data
sets. For example, that includes misspellings and other typographical errors, wrong
numerical entries, syntax errors and missing values, such as blank or null fields that should
contain data.
• Inconsistent data. Names, addresses and other attributes are often formatted differently
from system to system. For example, one data set might include a customer's middle initial,
while another doesn't. Data elements such as terms and identifiers may also vary. Data
cleansing helps ensure that data is consistent so it can be analyzed accurately.
• Duplicate data. Data cleansing identifies duplicate records in data sets and either removes
or merges them through the use of deduplication measures. For example, when data from
two systems is combined, duplicate data entries can be reconciled to create single records.
• Irrelevant data. Some data -- outliers or out-of-date entries, for example -- may not be
relevant to analytics applications and could skew their results. Data cleansing
removes redundant data from data sets, which streamlines data preparation and reduces the
required amount of data processing and storage resources.
Steps in data cleaning
1.Inspection and profiling. First, data is inspected and audited to assess its quality
level and identify issues that need to be fixed. This step usually involves data profiling,
which documents relationships between data elements, checks data quality and
gathers statistics on data sets to help find errors, discrepancies and other problems.
2.Cleaning. This is the heart of the cleansing process, when data errors are corrected
and inconsistent, duplicate and redundant data is addressed.
3.Verification. After the cleaning step is completed, the person or team that did the
work should inspect the data again to verify its cleanliness and make sure it conforms
to internal data quality rules and standards.
4.Reporting. The results of the data cleansing work should then be reported to IT and
business executives to highlight data quality trends and progress. The report could
include the number of issues found and corrected, plus updated metrics on the data's
quality levels.
Characteristics of clean data
• accuracy
• completeness
• consistency
• integrity
• timeliness
• uniformity
• validity
The benefits of effective data cleansing
• Data fuels ML. Harnessing this data to reinvent your business, while
challenging, is imperative to staying relevant now and in the future. It
is survival of the most informed, and those who can put their data to
work to make better, more informed decisions respond faster to the
unexpected and uncover new opportunities. This important yet tedious
process is a prerequisite for building accurate ML models and
analytics, and it is the most time-consuming part of an ML project. To
minimize this time investment, data scientists can use tools that help
automate data preparation in various ways.
How do you prepare your data?
• Data preparation follows a series of steps that starts with collecting the right data, followed by
cleaning, labeling, and then validation and visualization.
• Collect data
• Collecting data is the process of assembling all the data you need for ML. Data collection can be
tedious because data resides in many data sources, including on laptops, in data warehouses, in the
cloud, inside applications, and on devices. Finding ways to connect to different data sources can be
challenging. Data volumes are also increasing exponentially, so there is a lot of data to search
through. Additionally, data has vastly different formats and types depending on the source. For
example, video data and tabular data are not easy to use together.
• Clean data
• Cleaning data corrects errors and fills in missing data as a step to ensure data quality. After you
have clean data, you will need to transform it into a consistent, readable format. This process can
include changing field formats like dates and currency, modifying naming conventions, and
correcting values and units of measure so they are consistent.
How do you prepare your data?
Label data
• Data labeling is the process of identifying raw data (images, text files, videos, and so on) and adding
one or more meaningful and informative labels to provide context so an ML model can learn from it.
For example, labels might indicate if a photo contains a bird or car, which words were mentioned in
an audio recording, or if an X-ray discovered an irregularity. Data labeling is required for various
use cases, including computer vision, natural language processing, and speech recognition.
Validate and visualize
• After data is cleaned and labeled, ML teams often explore the data to make sure it is correct and
ready for ML. Visualizations like histograms, scatter plots, box and whisker plots, line plots, and bar
charts are all useful tools to confirm data is correct. Additionally, visualizations also help data
science teams complete exploratory data analysis. This process uses visualizations to discover
patterns, spot anomalies, test a hypothesis, or check assumptions. Exploratory data analysis does not
require formal modeling; instead, data science teams can use visualizations to decipher the data.
• Feature Engineering for Machine Learning
• Feature engineering is the pre-processing step of machine
learning, which is used to transform raw data into features that
can be used for creating a predictive model using Machine
learning or statistical Modelling. Feature engineering in machine
learning aims to improve the performance of models.
What is a feature?