0% found this document useful (0 votes)
14 views

What Is Data Preprocessing

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

What Is Data Preprocessing

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

What is data preprocessing?

Data preprocessing, a component of data preparation, describes any type of


processing performed on raw data to prepare it for another data processing
procedure. It has traditionally been an important preliminary step for the data
mining process. More recently, data preprocessing techniques have been
adapted for training machine learning models and AI models and for running
inferences against them.
Data preprocessing transforms the data into a format that is more easily and
effectively processed in data mining, machine learning and other data science
tasks. The techniques are generally used at the earliest stages of the machine
learning and AI development pipeline to ensure accurate results.
There are several different tools and methods used for preprocessing data,
including the following:
 sampling, which selects a representative subset from a large population of
data;
 transformation, which manipulates raw data to produce a single input;
 denoising, which removes noise from data;
 imputation, which synthesizes statistically relevant data for missing
values;
 normalization, which organizes data for more efficient access; and
 feature extraction, which pulls out a relevant feature subset that is
significant in a particular context.
These tools and methods can be used on a variety of data sources, including
data stored in files or databases and streaming data.
Why is data preprocessing important?
Virtually any type of data analysis, data science or AI development requires
some type of data preprocessing to provide reliable, precise and robust results
for enterprise applications.
Real-world data is messy and is often created, processed and stored by a variety
of humans, business processes and applications. As a result, a data set may be
missing individual fields, contain manual input errors, or have duplicate data or
different names to describe the same thing. Humans can often identify and
rectify these problems in the data they use in the line of business, but data used
to train machine learning or deep learning algorithms needs to be automatically
preprocessed.
achine learning and deep learning algorithms work best when data is presented
in a format that highlights the relevant aspects required to solve a problem.
Feature engineering practices that involve data wrangling, data transformation,
data reduction, feature selection and feature scaling help restructure raw data
into a form suited for particular types of algorithms. This can significantly reduce
the processing power and time required to train a new machine learning or AI
algorithm or run an inference against it.
One caution that should be observed in preprocessing data: the potential for
reencoding bias into the data set. Identifying and correcting bias is critical for
applications that help make decisions that affect people, such as loan approvals.
Although data scientists may deliberately ignore variables like gender, race or
religion, these traits may be correlated with other variables like zip codes or
schools attended, generating biased results.
Most modern data science packages and services now include various
preprocessing libraries that help to automate many of these tasks.
What are the key steps in data preprocessing?
The steps used in data preprocessing include the following:
1. Data profiling. Data profiling is the process of examining, analyzing and
reviewing data to collect statistics about its quality. It starts with a survey of
existing data and its characteristics. Data scientists identify data sets that are
pertinent to the problem at hand, inventory its significant attributes, and form a
hypothesis of features that might be relevant for the proposed analytics or
machine learning task. They also relate data sources to the relevant business
concepts and consider which preprocessing libraries could be used.
2. Data cleansing. The aim here is to find the easiest way to rectify quality
issues, such as eliminating bad data, filling in missing data or otherwise ensuring
the raw data is suitable for feature engineering.
3. Data reduction. Raw data sets often include redundant data that arise from
characterizing phenomena in different ways or data that is not relevant to a
particular ML, AI or analytics task. Data reduction uses techniques like principal
component analysis to transform the raw data into a simpler form suitable for
particular use cases.
4. Data transformation. Here, data scientists think about how different aspects
of the data need to be organized to make the most sense for the goal. This could
include things like structuring unstructured data, combining salient variables
when it makes sense or identifying important ranges to focus on.
5. Data enrichment. In this step, data scientists apply the various feature
engineering libraries to the data to effect the desired transformations. The result
should be a data set organized to achieve the optimal balance between the
training time for a new model and the required compute.
6. Data validation. At this stage, the data is split into two sets. The first set is
used to train a machine learning or deep learning model. The second set is the
testing data that is used to gauge the accuracy and robustness of the resulting
model. This second step helps identify any problems in the hypothesis used in
the cleaning and feature engineering of the data. If the data scientists are
satisfied with the results, they can push the preprocessing task to a data
engineer who figures out how to scale it for production. If not, the data scientists
can go back and make changes to the way they implemented the data cleansing
and feature engineering steps.
Data preprocessing techniques
There are two main categories of preprocessing -- data cleansing and feature
engineering. Each includes a variety of techniques, as detailed below.
Data cleansing
Techniques for cleaning up messy data include the following:
Identify and sort out missing data. There are a variety of reasons a data set
might be missing individual fields of data. Data scientists need to decide whether
it is better to discard records with missing fields, ignore them or fill them in with
a probable value. For example, in an IoT application that records temperature,
adding in a missing average temperature between the previous and subsequent
record might be a safe fix.
Reduce noisy data. Real-world data is often noisy, which can distort an analytic
or AI model. For example, a temperature sensor that consistently reported a
temperature of 75 degrees Fahrenheit might erroneously report a temperature as
250 degrees. A variety of statistical approaches can be used to reduce the noise,
including binning, regression and clustering.
Identify and remove duplicates. When two records seem to repeat, an
algorithm needs to determine if the same measurement was recorded twice, or
the records represent different events. In some cases, there may be slight
differences in a record because one field was recorded incorrectly. In other cases,
records that seem to be duplicates might indeed be different, as in a father and
son with the same name who are living in the same house but should be
represented as separate individuals. Techniques for identifying and removing or
joining duplicates can help to automatically address these types of problems.
Feature engineering
Feature engineering, as noted, involves techniques used by data scientists to
organize the data in ways that make it more efficient to train data models and
run inferences against them. These techniques include the following:
Feature scaling or normalization. Often, multiple variables change over
different scales, or one will change linearly while another will change
exponentially. For example, salary might be measured in thousands of dollars,
while age is represented in double digits. Scaling helps to transform the data in a
way that makes it easier for algorithms to tease apart a meaningful relationship
between variables.
Data reduction. Data scientists often need to combine a variety of data sources
to create a new AI or analytics model. Some of the variables may not be
correlated with a given outcome and can be safely discarded. Other variables
might be relevant, but only in terms of relationship -- such as the ratio of debt to
credit in the case of a model predicting the likelihood of a loan repayment; they
may be combined into a single variable. Techniques like principal component
analysis play a key role in reducing the number of dimensions in the training
data set into a more efficient representation.
Discretization. It's often useful to lump raw numbers into discrete intervals. For
example, income might be broken into five ranges that are representative of
people who typically apply for a given type of loan. This can reduce the overhead
of training a model or running inferences against it.
Feature encoding. Another aspect of feature engineering involves organizing
unstructured data into a structured format. Unstructured data formats can
include text, audio and video. For example, the process of developing natural
language processing algorithms typically starts by using data transformation
algorithms like Word2vec to translate words into numerical vectors. This makes it
easy to represent to the algorithm that words like "mail" and "parcel" are similar,
while a word like "house" is completely different. Similarly, a facial recognition
algorithm might reencode raw pixel data into vectors representing the distances
between parts of the face.
SCIKIT EARN: https://en.wikipedia.org/wiki/Scikit-learn
TENSORFLOW: https://en.wikipedia.org/wiki/TensorFlow
NUMPY: https://numpy.org/doc/stable/user/absolute_beginners.html
https://numpy.org/doc/stable/user/

You might also like