0% found this document useful (0 votes)

8 views6 pages

U1_DA_Data Preprocessing

Data preprocessing is a crucial step in data mining and analysis that involves transforming raw, messy data into a structured format suitable for machine learning. The process includes four main steps: data quality assessment, data cleaning, data transformation, and data reduction, each addressing issues like inconsistencies, missing values, and irrelevant information. Proper data preprocessing ensures accurate analysis and improves the efficiency of machine learning models.

Uploaded by

MANVITHA CHOWTAPALLI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views6 pages

U1_DA_Data Preprocessing

Uploaded by

MANVITHA CHOWTAPALLI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Data Preprocessing : What Is Data Preprocessing?

Data preprocessing is a step in the data mining and data analysis process that
takes raw data and transforms it into a format that can be understood and
analyzed by computers and machine learning.

Raw, real-world data in the form of text, images, video, etc., is messy. Not only
may it contain errors and inconsistencies, but it is often incomplete, and doesn’t
have a regular, uniform design.

Machines like to process nice and tidy information – they read data as 1s and 0s.
So calculating structured data, like whole numbers and percentages is easy.
However, unstructured data, in the form of text and images must first be cleaned
and formatted before analysis.

Data Preprocessing Steps

Let’s take a look at the established steps you’ll need to go through to make sure
your data is successfully preprocessed.

1. Data quality assessment

2. Data cleaning
3. Data transformation
4. Data reduction
1. Data quality assessment

Take a good look at your data and get an idea of its overall quality, relevance to
your project, and consistency. There are a number of data anomalies and inherent
problems to look out for in almost any data set, for example:

 Mismatched data types: When you collect data from many different sources, it
may come to you in different formats. While the ultimate goal of this entire
process is to reformat your data for machines, you still need to begin with
similarly formatted data. For example, if part of your analysis involves family
income from multiple countries, you’ll have to convert each income amount into a
single currency.
 Mixed data values: Perhaps different sources use different descriptors for features
– for example, man or male. These value descriptors should all be made uniform.
 Data outliers: Outliers can have a huge impact on data analysis results. For
example if you're averaging test scores for a class, and one student didn’t respond
to any of the questions, their 0% could greatly skew the results.
 Missing data: Take a look for missing data fields, blank spaces in text, or
unanswered survey questions. This could be due to human error or incomplete
data. To take care of missing data, you’ll have to perform data cleaning.
2. Data cleaning
Data cleaning is the process of adding missing data and correcting, repairing, or
removing incorrect or irrelevant data from a data set. Dating cleaning is the most
important step of preprocessing because it will ensure that your data is ready to go
for your downstream needs.
Data cleaning will correct all of the inconsistent data you uncovered in your data
quality assessment. Depending on the kind of data you’re working with, there are
a number of possible cleaners you’ll need to run your data through.
Missing data

There are a number of ways to correct for missing data, but the two most common
are:

 Ignore the tuples: A tuple is an ordered list or sequence of numbers or entities. If

multiple values are missing within tuples, you may simply discard the tuples with
that missing information. This is only recommended for large data sets, when a
few ignored tuples won’t harm further analysis.
 Manually fill in missing data: This can be tedious, but is definitely necessary
when working with smaller data sets.
Noisy data

Data cleaning also includes fixing “noisy” data. This is data that includes
unnecessary data points, irrelevant data, and data that’s more difficult to group
together.

 Binning: Binning sorts data of a wide data set into smaller groups of more similar
data. It’s often used when analyzing demographics. Income, for example, could be
grouped: $35,000-$50,000, $50,000-$75,000, etc.
 Regression: Regression is used to decide which variables will actually apply to
your analysis. Regression analysis is used to smooth large amounts of data. This
will help you get a handle on your data, so you’re not overburdened with
unnecessary data.
 Clustering: Clustering algorithms are used to properly group data, so that it can
be analyzed with like data. They’re generally used in unsupervised learning, when
not a lot is known about the relationships within your data.

If you’re working with text data, for example, some things you should consider
when cleaning your data are:

 Remove URLs, symbols, emojis, etc., that aren’t relevant to your analysis
 Translate all text into the language you’ll be working in
 Remove HTML tags
 Remove boilerplate email text
 Remove unnecessary blank text between words
 Remove duplicate data
After data cleaning, you may realize you have insufficient data for the task at hand.
At this point you can also perform data wrangling or data enrichment to add new
data sets and run them through quality assessment and cleaning again before
adding them to your original data.
3. Data transformation

With data cleaning, we’ve already begun to modify our data, but data
transformation will begin the process of turning the data into the proper format(s)
you’ll need for analysis and other downstream processes.

This generally happens in one or more of the below:

1. Aggregation
2. Normalization
3. Feature selection
4. Discreditization
5. Concept hierarchy generation
 Aggregation: Data aggregation combines all of your data together in a uniform
format.
 Normalization: Normalization scales your data into a regularized range so that
you can compare it more accurately. For example, if you’re comparing employee
loss or gain within a number of companies (some with just a dozen employees and
some with 200+), you’ll have to scale them within a specified range, like -1.0 to
1.0 or 0.0 to 1.0.
 Feature selection: Feature selection is the process of deciding which variables
(features, characteristics, categories, etc.) are most important to your analysis.
These features will be used to train ML models. It’s important to remember, that
the more features you choose to use, the longer the training process and,
sometimes, the less accurate your results, because some feature characteristics
may overlap or be less present in the data.

 Discreditization: Discreditiization pools data into smaller intervals. It’s somewhat

similar to binning, but usually happens after data has been cleaned. For example,
when calculating average daily exercise, rather than using the exact minutes and
seconds, you could join together data to fall into 0-15 minutes, 15-30, etc.
 Concept hierarchy generation: Concept hierarchy generation can add a hierarchy
within and between your features that wasn’t present in the original data. If your
analysis contains wolves and coyotes, for example, you could add the hierarchy
for their genus: canis.
4. Data reduction

The more data you’re working with, the harder it will be to analyze, even after
cleaning and transforming it. Depending on your task at hand, you may actually
have more data than you need. Especially when working with text analysis, much
of regular human speech is superfluous or irrelevant to the needs of the researcher.
Data reduction not only makes the analysis easier and more accurate, but cuts
down on data storage.

It will also help identify the most important features to the process at hand.
 Attribute selection: Similar to discreditization, attribute selection can fit your
data into smaller pools. It, essentially, combines tags or features, so that tags
like male/female and professor could be combined into male professor/female
professor.
 Numerosity reduction: This will help with data storage and transmission. You
can use a regression model, for example, to use only the data and variables that are
relevant to your analysis.
 Dimensionality reduction: This, again, reduces the amount of data used to help
facilitate analysis and downstream processes. Algorithms like K-nearest
neighbors use pattern recognition to combine similar data and make it more
manageable.

 Data Preprocessing Examples

 Take a look at the table below to see how preprocessing works. In this
example, we have three variables: name, age, and company. In the first
example we can tell that #2 and #3 have been assigned the incorrect
companies.

Name Age Company

Karen
57 CVS Health
Lynch
Elon Musk 49 Amazon
Jeff Bezos 57 Tesla
Tim Cook 60 Apple

 We can use data cleaning to simply remove these rows, as we know the data
was improperly entered or is otherwise corrupted.

Name Age Company

Karen
57 CVS Health
Lynch
Tim Cook 60 Apple

 Or, we can perform data transformation, in this case, manually, in order to

fix the problem:

Name Age Company

Karen
57 CVS Health
Lynch
Elon Musk 49 Tesla
Name Age Company
Jeff Bezos 57 Amazon
Tim Cook 60 Apple

 Once the issue is fixed, we can perform data reduction, in this case by
descending age, to choose which age range we want to focus on:

Name Age Company

Tim Cook 60 Apple
Karen
57 CVS Health
Lynch
Jeff Bezos 57 Amazon
Elon Musk 49 Tesla

The DIRKS Methodology: A User Guide
100% (2)
The DIRKS Methodology: A User Guide
285 pages
VRF V Plus International - UserGuide
No ratings yet
VRF V Plus International - UserGuide
67 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Introduction to data science 1-2-2025
No ratings yet
Introduction to data science 1-2-2025
14 pages
Math211101020
No ratings yet
Math211101020
12 pages
Unit 3
No ratings yet
Unit 3
18 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
DS-Unit-2_ABM_final
No ratings yet
DS-Unit-2_ABM_final
134 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
L3
No ratings yet
L3
34 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
DMW Module 2
No ratings yet
DMW Module 2
32 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Down 2
No ratings yet
Down 2
61 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Module 2
No ratings yet
Module 2
8 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Unit - II
No ratings yet
Unit - II
56 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
DS_UNIT_2
No ratings yet
DS_UNIT_2
23 pages
3-Preprocessing
No ratings yet
3-Preprocessing
27 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Lec 9
No ratings yet
Lec 9
1 page
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
Data Preprocessing
No ratings yet
Data Preprocessing
4 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
UNIT-1(DWV)[1]
No ratings yet
UNIT-1(DWV)[1]
12 pages
Programming Presentation
No ratings yet
Programming Presentation
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Supplementary Handout On Blended Learning Delivery Modalities (BLDMS)
100% (5)
Supplementary Handout On Blended Learning Delivery Modalities (BLDMS)
16 pages
ms101_spring_allocation_04-01-25
No ratings yet
ms101_spring_allocation_04-01-25
16 pages
Vinaya M: Product Marketer, Zoho Corporation
No ratings yet
Vinaya M: Product Marketer, Zoho Corporation
2 pages
PVST - Switch Alide
No ratings yet
PVST - Switch Alide
74 pages
GD350-UL Series VFD Manual
No ratings yet
GD350-UL Series VFD Manual
366 pages
Thai Thesis Online
100% (3)
Thai Thesis Online
6 pages
Eztag3 - Proximity and Keypad Door Entry: User Manual
No ratings yet
Eztag3 - Proximity and Keypad Door Entry: User Manual
16 pages
Sugam_Resume
No ratings yet
Sugam_Resume
1 page
Mahmoud Hassan It Man 01-07-2023
No ratings yet
Mahmoud Hassan It Man 01-07-2023
6 pages
Xpath Cheatsheet
No ratings yet
Xpath Cheatsheet
8 pages
Ankit Bhandari
No ratings yet
Ankit Bhandari
37 pages
Fluent GetStart 19.3 L01.0 Intro To ANSYS
No ratings yet
Fluent GetStart 19.3 L01.0 Intro To ANSYS
8 pages
RAC Frequently Asked Questions (RAC FAQ) (Doc ID 220970.1)
No ratings yet
RAC Frequently Asked Questions (RAC FAQ) (Doc ID 220970.1)
60 pages
4602-9101 Status Command Unit 4602-9102 Remote Control Unit Installation/Programming Instructions
No ratings yet
4602-9101 Status Command Unit 4602-9102 Remote Control Unit Installation/Programming Instructions
10 pages
Comparative Study_Ansys Vs. Caesar II
No ratings yet
Comparative Study_Ansys Vs. Caesar II
23 pages
Software Quality Metrics
No ratings yet
Software Quality Metrics
4 pages
easyControl EC4-200 Manual
No ratings yet
easyControl EC4-200 Manual
113 pages
NX501B - Product Manual
No ratings yet
NX501B - Product Manual
160 pages
MBR GPT Cheatsheet
No ratings yet
MBR GPT Cheatsheet
3 pages
Concurrency control
No ratings yet
Concurrency control
33 pages
Flow Calibration Using PLC and Scada
No ratings yet
Flow Calibration Using PLC and Scada
9 pages
Arenavision Led Gen2 Floodlight: Bvp425/Bvp415
No ratings yet
Arenavision Led Gen2 Floodlight: Bvp425/Bvp415
16 pages
Ip Project
No ratings yet
Ip Project
27 pages
F103-12-QMS-2015 ISO 9001 2015 Checklist Guidance
No ratings yet
F103-12-QMS-2015 ISO 9001 2015 Checklist Guidance
20 pages
Sic MCQ
No ratings yet
Sic MCQ
67 pages
Multidrop Combi User Manual
No ratings yet
Multidrop Combi User Manual
129 pages
IPLOOK Completed Indoor Coverage Solution_20241009
No ratings yet
IPLOOK Completed Indoor Coverage Solution_20241009
14 pages
Software Release Life Cycle
No ratings yet
Software Release Life Cycle
9 pages

U1_DA_Data Preprocessing

Uploaded by

U1_DA_Data Preprocessing

Uploaded by

Data Preprocessing : What Is Data Preprocessing?

Data Preprocessing Steps

1. Data quality assessment

 Ignore the tuples: A tuple is an ordered list or sequence of numbers or entities. If

This generally happens in one or more of the below:

 Discreditization: Discreditiization pools data into smaller intervals. It’s somewhat

 Data Preprocessing Examples

Name Age Company

Name Age Company

 Or, we can perform data transformation, in this case, manually, in order to

Name Age Company

Name Age Company

You might also like