0% found this document useful (0 votes)

17 views3 pages

Data Preprocessing Unit 2

The document discusses the important process of data preprocessing which involves cleaning, transforming, and reducing raw data to prepare it for analysis. Key steps in data preprocessing include data cleaning techniques like handling missing values and noisy data, data integration from multiple sources, transforming data through normalization, discretization, and attribute selection, and reducing data through compression, numerosity reduction, and dimensionality reduction.

Uploaded by

ac8198905

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views3 pages

Data Preprocessing Unit 2

Uploaded by

ac8198905

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Data preprocessing is an important process of data mining.

In this process, raw

data is converted into an understandable format and made ready for further
analysis. The motive is to improve data quality and make it up to mark for
specific tasks.

Tasks in Data Preprocessing

Data cleaning

Data cleaning help us remove inaccurate, incomplete and incorrect data from
the dataset. Some techniques used in data cleaning are −

Handling missing values

This type of scenario occurs when some data is missing.

 Standard values can be used to fill up the missing values in a manual way but only for a small
dataset.
 Attribute's mean and median values can be used to replace the missing values in normal and
non-normal distribution of data respectively.
 Tuples can be ignored if the dataset is quite large and many values are missing within a tuple.
 Most appropriate value can be used while using regression or decision tree algorithms

Noisy Data

Noisy data are the data that cannot be interpreted by machine and are
containing unnecessary faulty data. Some ways to handle them are −

 Binning − This method handle noisy data to make it smooth. Data gets divided equally and
stored in form of bins and then methods are applied to smoothing or completing the tasks.
The methods are Smoothing by a bin mean method(bin values are replaced by mean values),
Smoothing by bin median(bin values are replaced by median values) and Smoothing by bin
boundary(minimum/maximum bin values are taken and replaced by closest boundary values).
 Regression − Regression functions are used to smoothen the data. Regression can be
linear(consists of one independent variable) or multiple(consists of multiple independent
variables).
 Clustering − It is used for grouping the similar data in clusters and is used for finding
outliers.

Data integration

The process of combining data from multiple sources (databases,

spreadsheets,text files) into a single dataset. Single and consistent view of data
is created in this process. Major problems during data integration are Schema
integration(Integrates set of data collected from various sources), Entity
identification(identifying entities from different databases) and detecting and
resolving data values concept.

Data transformation

In this part, change in format or structure of data in order to transform the data
suitable for mining process. Methods for data transformation are −

Normalization − Method of scaling data to represent it in a specific smaller range(

-1.0 to 1.0)
Discretization − It helps reduce the data size and make continuous data divide
into intervals.

Attribute Selection − To help the mining process, new attributes are derived from
the given attributes.

Concept Hierarchy Generation − In this, the attributes are changed from lower level
to higher level in hierarchy.

Aggregation − In this, a summary of data gets stored which depends upon quality
and quantity of data to make the result more optimal.

Data reduction

It helps in increasing storage efficiency and reducing data storage to make the
analysis easier by producing almost the same results. Analysis becomes harder
while working with huge amounts of data, so reduction is used to get rid of that.

Steps of data reduction are −

Data Compression

Data is compressed to make efficient analysis. Lossless compression is when

there is no loss of data while compression. loss compression is when
unnecessary information is removed during compression.

Numerosity Reduction

There is a reduction in volume of data i.e. only store model of data instead of
whole data, which provides smaller representation of data without any loss of
data.

Dimensionality reduction

In this, reduction of attributes or random variables are done so as to make the

data set dimension low. Attributes are combined without losing its original
characteristics

Deep Residual Network For Steganalysis of Digital Images
No ratings yet
Deep Residual Network For Steganalysis of Digital Images
13 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Data Mining
No ratings yet
Data Mining
5 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Data transformation in data mining
No ratings yet
Data transformation in data mining
6 pages
ml4
No ratings yet
ml4
17 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Mining
No ratings yet
Data Mining
22 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
Unit 3
No ratings yet
Unit 3
18 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Down 2
No ratings yet
Down 2
61 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
unit 2 Preprocessing in Data Mining
No ratings yet
unit 2 Preprocessing in Data Mining
6 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Week 3
No ratings yet
Week 3
23 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Data Binning
No ratings yet
Data Binning
9 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Module 2
No ratings yet
Module 2
42 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Example of Thesis Chapter 4 and 5
100% (2)
Example of Thesis Chapter 4 and 5
4 pages
Programming Assignment 3: Programming A Simple Controller: Instructions
No ratings yet
Programming Assignment 3: Programming A Simple Controller: Instructions
9 pages
Analysis of Filling System and Consequences of Misfiled Patient Case Folders
100% (1)
Analysis of Filling System and Consequences of Misfiled Patient Case Folders
23 pages
Activity On Revenue Cycle
No ratings yet
Activity On Revenue Cycle
7 pages
Keth Iralex Cabalda BES 043 (Module 14)
No ratings yet
Keth Iralex Cabalda BES 043 (Module 14)
3 pages
Valeo Interview Q
No ratings yet
Valeo Interview Q
21 pages
Sta. Elena (Cam. Norte) College Inc
No ratings yet
Sta. Elena (Cam. Norte) College Inc
2 pages
Guidelines To Determine The Right Interface When Integrating With SAP Systems
100% (1)
Guidelines To Determine The Right Interface When Integrating With SAP Systems
45 pages
ESE Method Statement Topographic Rev3
No ratings yet
ESE Method Statement Topographic Rev3
38 pages
Hu&Ma JF 2024
No ratings yet
Hu&Ma JF 2024
126 pages
Get The Gold For App Inventor 2
No ratings yet
Get The Gold For App Inventor 2
6 pages
Experiment 4 WDM Fiber Optic Link: Aim Components Used
No ratings yet
Experiment 4 WDM Fiber Optic Link: Aim Components Used
3 pages
LCM & HCF Solution (2011-2021) Tnpscforgenius
No ratings yet
LCM & HCF Solution (2011-2021) Tnpscforgenius
67 pages
Mo Sds Prefill Employee Packet Request
No ratings yet
Mo Sds Prefill Employee Packet Request
2 pages
Design and Implementation of Mobile Robot for Fire Fighting Using Photovoltaic Panel With Artificial Intelligent
No ratings yet
Design and Implementation of Mobile Robot for Fire Fighting Using Photovoltaic Panel With Artificial Intelligent
10 pages
Xavenet Toughpad User Manual: Camero-Tech LTD Proprietary P/N Um0050 Rev. 01
No ratings yet
Xavenet Toughpad User Manual: Camero-Tech LTD Proprietary P/N Um0050 Rev. 01
25 pages
A6-RS Series Servo Drive User Manual
No ratings yet
A6-RS Series Servo Drive User Manual
362 pages
Natural Interleaving™ Transition-Mode PFC Controller With Improved Audible Noise
No ratings yet
Natural Interleaving™ Transition-Mode PFC Controller With Improved Audible Noise
34 pages
Working of Hive 2
No ratings yet
Working of Hive 2
7 pages
COA Chapter 5
No ratings yet
COA Chapter 5
23 pages
Honeywell
No ratings yet
Honeywell
1 page
Unit I Mobile Applications
No ratings yet
Unit I Mobile Applications
25 pages
(Ebook) Global Fragments. (Dis)Orientation in the New World Order. Asnel Papers 10. (Cross Cultures 90) (Cross Cultures: Readings in the Post Colonial Literatures in English) by BARTELS; Anke and Dirk WIEMANN (Eds.) ISBN 9042021829 - The ebook in PDF format is ready for immediate access
100% (1)
(Ebook) Global Fragments. (Dis)Orientation in the New World Order. Asnel Papers 10. (Cross Cultures 90) (Cross Cultures: Readings in the Post Colonial Literatures in English) by BARTELS; Anke and Dirk WIEMANN (Eds.) ISBN 9042021829 - The ebook in PDF format is ready for immediate access
37 pages
01-Frank Kern
No ratings yet
01-Frank Kern
2 pages
TOPIC 1 - Projections of Planes and Solids
No ratings yet
TOPIC 1 - Projections of Planes and Solids
26 pages
The Unsolved Opportunities For Cybersecurity Providers
No ratings yet
The Unsolved Opportunities For Cybersecurity Providers
6 pages
FF Video
100% (1)
FF Video
1 page
LAS 1 Computer Basic
No ratings yet
LAS 1 Computer Basic
10 pages
IEEE Template Research-Track
No ratings yet
IEEE Template Research-Track
3 pages

Data Preprocessing Unit 2

Uploaded by

Data Preprocessing Unit 2

Uploaded by

Data preprocessing is an important process of data mining.

In this process, raw

Tasks in Data Preprocessing

Handling missing values

This type of scenario occurs when some data is missing.

The process of combining data from multiple sources (databases,

Normalization − Method of scaling data to represent it in a specific smaller range(

Steps of data reduction are −

Data is compressed to make efficient analysis. Lossless compression is when

In this, reduction of attributes or random variables are done so as to make the

You might also like