0% found this document useful (0 votes)

24 views4 pages

What Is Data Preprocessing

Uploaded by

Lekh Nath Chettri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views4 pages

What Is Data Preprocessing

Uploaded by

Lekh Nath Chettri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

What is data preprocessing?

Data preprocessing, a component of data preparation, describes any type of

processing performed on raw data to prepare it for another data processing
procedure. It has traditionally been an important preliminary step for the data
mining process. More recently, data preprocessing techniques have been
adapted for training machine learning models and AI models and for running
inferences against them.
Data preprocessing transforms the data into a format that is more easily and
effectively processed in data mining, machine learning and other data science
tasks. The techniques are generally used at the earliest stages of the machine
learning and AI development pipeline to ensure accurate results.
There are several different tools and methods used for preprocessing data,
including the following:
 sampling, which selects a representative subset from a large population of
data;
 transformation, which manipulates raw data to produce a single input;
 denoising, which removes noise from data;
 imputation, which synthesizes statistically relevant data for missing
values;
 normalization, which organizes data for more efficient access; and
 feature extraction, which pulls out a relevant feature subset that is
significant in a particular context.
These tools and methods can be used on a variety of data sources, including
data stored in files or databases and streaming data.
Why is data preprocessing important?
Virtually any type of data analysis, data science or AI development requires
some type of data preprocessing to provide reliable, precise and robust results
for enterprise applications.
Real-world data is messy and is often created, processed and stored by a variety
of humans, business processes and applications. As a result, a data set may be
missing individual fields, contain manual input errors, or have duplicate data or
different names to describe the same thing. Humans can often identify and
rectify these problems in the data they use in the line of business, but data used
to train machine learning or deep learning algorithms needs to be automatically
preprocessed.
achine learning and deep learning algorithms work best when data is presented
in a format that highlights the relevant aspects required to solve a problem.
Feature engineering practices that involve data wrangling, data transformation,
data reduction, feature selection and feature scaling help restructure raw data
into a form suited for particular types of algorithms. This can significantly reduce
the processing power and time required to train a new machine learning or AI
algorithm or run an inference against it.
One caution that should be observed in preprocessing data: the potential for
reencoding bias into the data set. Identifying and correcting bias is critical for
applications that help make decisions that affect people, such as loan approvals.
Although data scientists may deliberately ignore variables like gender, race or
religion, these traits may be correlated with other variables like zip codes or
schools attended, generating biased results.
Most modern data science packages and services now include various
preprocessing libraries that help to automate many of these tasks.
What are the key steps in data preprocessing?
The steps used in data preprocessing include the following:
1. Data profiling. Data profiling is the process of examining, analyzing and
reviewing data to collect statistics about its quality. It starts with a survey of
existing data and its characteristics. Data scientists identify data sets that are
pertinent to the problem at hand, inventory its significant attributes, and form a
hypothesis of features that might be relevant for the proposed analytics or
machine learning task. They also relate data sources to the relevant business
concepts and consider which preprocessing libraries could be used.
2. Data cleansing. The aim here is to find the easiest way to rectify quality
issues, such as eliminating bad data, filling in missing data or otherwise ensuring
the raw data is suitable for feature engineering.
3. Data reduction. Raw data sets often include redundant data that arise from
characterizing phenomena in different ways or data that is not relevant to a
particular ML, AI or analytics task. Data reduction uses techniques like principal
component analysis to transform the raw data into a simpler form suitable for
particular use cases.
4. Data transformation. Here, data scientists think about how different aspects
of the data need to be organized to make the most sense for the goal. This could
include things like structuring unstructured data, combining salient variables
when it makes sense or identifying important ranges to focus on.
5. Data enrichment. In this step, data scientists apply the various feature
engineering libraries to the data to effect the desired transformations. The result
should be a data set organized to achieve the optimal balance between the
training time for a new model and the required compute.
6. Data validation. At this stage, the data is split into two sets. The first set is
used to train a machine learning or deep learning model. The second set is the
testing data that is used to gauge the accuracy and robustness of the resulting
model. This second step helps identify any problems in the hypothesis used in
the cleaning and feature engineering of the data. If the data scientists are
satisfied with the results, they can push the preprocessing task to a data
engineer who figures out how to scale it for production. If not, the data scientists
can go back and make changes to the way they implemented the data cleansing
and feature engineering steps.
Data preprocessing techniques
There are two main categories of preprocessing -- data cleansing and feature
engineering. Each includes a variety of techniques, as detailed below.
Data cleansing
Techniques for cleaning up messy data include the following:
Identify and sort out missing data. There are a variety of reasons a data set
might be missing individual fields of data. Data scientists need to decide whether
it is better to discard records with missing fields, ignore them or fill them in with
a probable value. For example, in an IoT application that records temperature,
adding in a missing average temperature between the previous and subsequent
record might be a safe fix.
Reduce noisy data. Real-world data is often noisy, which can distort an analytic
or AI model. For example, a temperature sensor that consistently reported a
temperature of 75 degrees Fahrenheit might erroneously report a temperature as
250 degrees. A variety of statistical approaches can be used to reduce the noise,
including binning, regression and clustering.
Identify and remove duplicates. When two records seem to repeat, an
algorithm needs to determine if the same measurement was recorded twice, or
the records represent different events. In some cases, there may be slight
differences in a record because one field was recorded incorrectly. In other cases,
records that seem to be duplicates might indeed be different, as in a father and
son with the same name who are living in the same house but should be
represented as separate individuals. Techniques for identifying and removing or
joining duplicates can help to automatically address these types of problems.
Feature engineering
Feature engineering, as noted, involves techniques used by data scientists to
organize the data in ways that make it more efficient to train data models and
run inferences against them. These techniques include the following:
Feature scaling or normalization. Often, multiple variables change over
different scales, or one will change linearly while another will change
exponentially. For example, salary might be measured in thousands of dollars,
while age is represented in double digits. Scaling helps to transform the data in a
way that makes it easier for algorithms to tease apart a meaningful relationship
between variables.
Data reduction. Data scientists often need to combine a variety of data sources
to create a new AI or analytics model. Some of the variables may not be
correlated with a given outcome and can be safely discarded. Other variables
might be relevant, but only in terms of relationship -- such as the ratio of debt to
credit in the case of a model predicting the likelihood of a loan repayment; they
may be combined into a single variable. Techniques like principal component
analysis play a key role in reducing the number of dimensions in the training
data set into a more efficient representation.
Discretization. It's often useful to lump raw numbers into discrete intervals. For
example, income might be broken into five ranges that are representative of
people who typically apply for a given type of loan. This can reduce the overhead
of training a model or running inferences against it.
Feature encoding. Another aspect of feature engineering involves organizing
unstructured data into a structured format. Unstructured data formats can
include text, audio and video. For example, the process of developing natural
language processing algorithms typically starts by using data transformation
algorithms like Word2vec to translate words into numerical vectors. This makes it
easy to represent to the algorithm that words like "mail" and "parcel" are similar,
while a word like "house" is completely different. Similarly, a facial recognition
algorithm might reencode raw pixel data into vectors representing the distances
between parts of the face.
SCIKIT EARN: https://en.wikipedia.org/wiki/Scikit-learn
TENSORFLOW: https://en.wikipedia.org/wiki/TensorFlow
NUMPY: https://numpy.org/doc/stable/user/absolute_beginners.html
https://numpy.org/doc/stable/user/

Unit - 2
No ratings yet
Unit - 2
17 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
DS-Unit-2_ABM_final
No ratings yet
DS-Unit-2_ABM_final
134 pages
Chaper 3 FoDS - Copy
No ratings yet
Chaper 3 FoDS - Copy
127 pages
Data Binning
No ratings yet
Data Binning
9 pages
4.1 - Data Preprocessing
No ratings yet
4.1 - Data Preprocessing
28 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
Unit I and unit ii dev (1)
No ratings yet
Unit I and unit ii dev (1)
36 pages
Down 2
No ratings yet
Down 2
61 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
-16-Data Preprocessing
No ratings yet
-16-Data Preprocessing
27 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
DATA MINING MODULE 2
No ratings yet
DATA MINING MODULE 2
23 pages
Introduction to data science 1-2-2025
No ratings yet
Introduction to data science 1-2-2025
14 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
23 pages
100 Chatgpt Prompts for Remote Job
No ratings yet
100 Chatgpt Prompts for Remote Job
12 pages
CMR BDA Data Pre Processing
No ratings yet
CMR BDA Data Pre Processing
10 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
IV_AI-DS_AD3491_FDSA_QB_Unit1
No ratings yet
IV_AI-DS_AD3491_FDSA_QB_Unit1
5 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Dta Mining
No ratings yet
Dta Mining
15 pages
NN-7
No ratings yet
NN-7
26 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
Bi 20soeit11002 Antala Krishnaa
No ratings yet
Bi 20soeit11002 Antala Krishnaa
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Data Mining
No ratings yet
Data Mining
22 pages
Overview of Data Preprocessing
No ratings yet
Overview of Data Preprocessing
4 pages
Unit 3
No ratings yet
Unit 3
18 pages
DSA Module 1 Notes
No ratings yet
DSA Module 1 Notes
24 pages
Data Preprocessing in Python Pandas (With Code)
No ratings yet
Data Preprocessing in Python Pandas (With Code)
11 pages
Preprocessing in Data Mining: Edgar Acu Na
No ratings yet
Preprocessing in Data Mining: Edgar Acu Na
5 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
01ce0715 - Machine Learning
No ratings yet
01ce0715 - Machine Learning
4 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Unit 1
No ratings yet
Unit 1
8 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
Chap.3 Data Preprocessing
No ratings yet
Chap.3 Data Preprocessing
6 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
U1_DA_Data Preprocessing
No ratings yet
U1_DA_Data Preprocessing
6 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Data Mining
No ratings yet
Data Mining
5 pages
Intelligence and Wisdom Artificial Intelligence Meets Chinese Philosophers 1st Edition Bing Song Download PDF
No ratings yet
Intelligence and Wisdom Artificial Intelligence Meets Chinese Philosophers 1st Edition Bing Song Download PDF
49 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
gitam-laptop-specs
No ratings yet
gitam-laptop-specs
3 pages
Kuai 2024 Unravelling Copyright Dilemma of Ai Generated News and Its Implications For The Institution of Journalism The
No ratings yet
Kuai 2024 Unravelling Copyright Dilemma of Ai Generated News and Its Implications For The Institution of Journalism The
19 pages
Benefits of Pursuing BSc in Computer Science
No ratings yet
Benefits of Pursuing BSc in Computer Science
9 pages
Syllabus for Skill Test (General) by Steno Army (1)
No ratings yet
Syllabus for Skill Test (General) by Steno Army (1)
14 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
h4 Mid-Term SDP Final
No ratings yet
h4 Mid-Term SDP Final
15 pages
Ai Literature Review
100% (2)
Ai Literature Review
7 pages
Curriculum and Syllabus For B.tech 1st Sem
No ratings yet
Curriculum and Syllabus For B.tech 1st Sem
25 pages
Research in Operations Management and Information
No ratings yet
Research in Operations Management and Information
13 pages
The Veronese Subalgebras of A Free Alternative Algebra of Finite Rank Are Finitely Generated
No ratings yet
The Veronese Subalgebras of A Free Alternative Algebra of Finite Rank Are Finitely Generated
13 pages
Model Question Paper- AIML
No ratings yet
Model Question Paper- AIML
4 pages
How to Write an Effective Generative AI Prompt
No ratings yet
How to Write an Effective Generative AI Prompt
4 pages
AISAR Artificial Intelligence-Based Student Assess
No ratings yet
AISAR Artificial Intelligence-Based Student Assess
22 pages
Naukri SHASHANKSAGARJHA (8y 0m)
No ratings yet
Naukri SHASHANKSAGARJHA (8y 0m)
2 pages
Lecture 8 - Artificial Intelligence
No ratings yet
Lecture 8 - Artificial Intelligence
55 pages
PGP AIML Training (Domestic)
No ratings yet
PGP AIML Training (Domestic)
119 pages
Debate Speaker 1 Final Exam
No ratings yet
Debate Speaker 1 Final Exam
8 pages
Lecture+Notes (Upgrad)
No ratings yet
Lecture+Notes (Upgrad)
5 pages
History Term Paper Format
100% (1)
History Term Paper Format
5 pages
Computer Answer Key 8
100% (1)
Computer Answer Key 8
38 pages
Class 10 Artificial Intelligence Sample Paper Set 1
No ratings yet
Class 10 Artificial Intelligence Sample Paper Set 1
9 pages
Distance Based Classification Algorithms
No ratings yet
Distance Based Classification Algorithms
12 pages
Embracing A Culture of Lifelong Learning
No ratings yet
Embracing A Culture of Lifelong Learning
55 pages
Marketing 5.0. Transactions of Artificial Intelligence Systems in The Digital Environment
No ratings yet
Marketing 5.0. Transactions of Artificial Intelligence Systems in The Digital Environment
4 pages
Unit 1
No ratings yet
Unit 1
19 pages
PeopleHawk AI Enhanced Career Platform
No ratings yet
PeopleHawk AI Enhanced Career Platform
4 pages
CS2351 - Artificial Intellegence
No ratings yet
CS2351 - Artificial Intellegence
13 pages
Industrial Revolution
No ratings yet
Industrial Revolution
7 pages
Hostile Tactical AI Combat Commander Rules
No ratings yet
Hostile Tactical AI Combat Commander Rules
3 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet

What Is Data Preprocessing

Uploaded by

What Is Data Preprocessing

Uploaded by

What is data preprocessing?

Data preprocessing, a component of data preparation, describes any type of

You might also like