0% found this document useful (0 votes)

38 views24 pages

Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University

This document discusses different techniques for data preprocessing, including data transformation, data reduction, and sampling. It describes how data transformation can involve smoothing, aggregation, generalization, normalization, and attribute construction. Common normalization techniques are min-max normalization, z-score normalization, and decimal normalization. The document also discusses why data reduction is important when dealing with large datasets, and covers strategies like data cube aggregation, attribute subset selection, dimensionality reduction, numerosity reduction using histograms and clustering, and different sampling methods.

Uploaded by

yachana_talk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views24 pages

Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University

Uploaded by

yachana_talk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 24

Ch2 Data Preprocessing part3

Amit Kr Upadhyay
Sharda University
Knowledge Discovery (KDD) Process

Pattern Evaluation
 Data mining—core of
knowledge discovery
process Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
Forms of Data Preprocessing
Data Transformation
 Data transformation – the data are
transformed or consolidated into forms
appropriate for mining
Data Transformation
 Data Transformation can involve the
following:
 Smoothing: remove noise from the data,
including binning, regression and clustering
 Aggregation
 Generalization
 Normalization
 Attribute construction
Normalization
 Min-max normalization
 Z-score normalization
 Decimal normalization
Min-max normalization
 Min-max normalization: to [new_minA,
new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000
normalized to [0.0, 1.0]. Then $73,000 is
mapped to 73,600  12,000 (1.0  0)  0  0.716
98,000  12,000
Z-score normalization
 Z-score normalization (μ: mean or for
what figure u have to calculate lets say
54000, σ: standard
v   deviation):
v' 
A

 A

 Ex. Let μ = 54,000, σ = 16,000. Then

73,600  54,000
 1.225
16,000
Decimal normalization
 Normalization by decimal scaling

v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10

 Suppose the recorded value of A range from

-986 to 917, the max absolute value is 986,
so j = 3
Data Reduction
 Why data reduction?
 A database/data warehouse may store
terabytes of data
 Complex data analysis/mining may take a
very long time to run on the complete data
set
Data Reduction
 Data reduction
 Obtain a reduced representation of the
data set that is much smaller in volume but
yet produce the same (or almost the same)
analytical results
Data Reduction
 Data reduction strategies
 Data cube aggregation
 Attribute subset selection
 Dimensionality reduction — e.g., remove
unimportant attributes
 Numerosity reduction — e.g., fit data into
models
 Discretization and concept hierarchy
generation
Data cube aggregation
Data cube aggregation
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with

 Reference appropriate levels

 Use the smallest representation which is enough
to solve the task
Attribute subset selection
Dimensionality reduction
 Feature selection (i.e., attribute subset
selection):

 Select a minimum set of features such that the

probability distribution of different classes given
the values for those features is as close as
possible to the original distribution given the
values of all features

 reduce # of patterns in the patterns, easier to

understand
Attribute subset selection
Dimensionality reduction
 Heuristic methods (due to exponential
# of choices):
 Step-wise forward selection
 Step-wise backward elimination
 Combining forward selection and backward
elimination
 Decision-tree induction
Attribute subset selection
Dimensionality reduction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Numerosity reduction
 Reduce data volume by choosing
alternative, smaller forms of data
representation

 Major families: histograms, clustering,

sampling
0
5
10
15
20
25
30
35
40
10000

20000

30000

40000
Histograms

50000

60000

70000

80000

90000

100000
Data Reduction Method:
Data Reduction Method:
Histograms
 Divide data into buckets and store average (sum) for
each bucket
 Partitioning rules:
 Equal-width: equal bucket range
 Equal-frequency (or equal-depth)
 V-optimal: with the least histogram variance (weighted sum
of the original values that each bucket represents)
 MaxDiff: set bucket boundary between each pair for pairs
have the β–1 largest differences
Data Reduction Method:
Clustering
 Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid and
diameter) only
 There are many choices of clustering definitions and
clustering algorithms
 Cluster analysis will be studied in depth in Chapter 7
Data Reduction Method:
Sampling
 Sampling: obtaining a small sample s to
represent the whole data set N

 Simple random sample without replacement

 Simple random sample with replacement
 Cluster sample: if the tuples in D are grouped
into M mutually disjoint clusters, then an Simple
Random Sample can be obtained, where s < M
 Stratified sample
Sampling: with or without
Replacement

W O R
SRS le random
i m p h ou t
( s e wi t
p l
sam ment)
pl a c e
re

SRSW
R

Raw Data
Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified Sample

3-Data Fundamentals For BI - Part2
No ratings yet
3-Data Fundamentals For BI - Part2
44 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Week 2
No ratings yet
Week 2
96 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preprocessing-2
No ratings yet
Data Preprocessing-2
30 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
Data Mining - Lecture 3
No ratings yet
Data Mining - Lecture 3
33 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
Data Reduction Techniques
No ratings yet
Data Reduction Techniques
41 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Preprocessing-Cleaning & Reduction
No ratings yet
Preprocessing-Cleaning & Reduction
42 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
M 2.2 8data Reduction
No ratings yet
M 2.2 8data Reduction
34 pages
DMA Notes
No ratings yet
DMA Notes
40 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
CH2 Data Reduction
No ratings yet
CH2 Data Reduction
10 pages
Research Proposal
83% (6)
Research Proposal
49 pages
DR
No ratings yet
DR
20 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
DWH Unit-3
No ratings yet
DWH Unit-3
12 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Adobe Scan 19 Mar 2025
No ratings yet
Adobe Scan 19 Mar 2025
8 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Lab Manual OF Antenna and Wave Propagation: Using MATLAB
No ratings yet
Lab Manual OF Antenna and Wave Propagation: Using MATLAB
83 pages
Data Mining 11
No ratings yet
Data Mining 11
6 pages
Data Mining
No ratings yet
Data Mining
21 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
Data Pre Processing
No ratings yet
Data Pre Processing
11 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Unit-3 Data Reduction
No ratings yet
Unit-3 Data Reduction
5 pages
Introduction To Data Science 8-2-2025
No ratings yet
Introduction To Data Science 8-2-2025
6 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Pca (Data Reduction)
No ratings yet
Pca (Data Reduction)
24 pages
Lecture 7 Data Reduction
No ratings yet
Lecture 7 Data Reduction
5 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Algebra and Equations
No ratings yet
Algebra and Equations
36 pages
Normalization
No ratings yet
Normalization
35 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
DMO 2024 Team and Individual Quiz Mechanics
No ratings yet
DMO 2024 Team and Individual Quiz Mechanics
6 pages
Lesson Plan Grade 4 Perimeter of Composite Figures
No ratings yet
Lesson Plan Grade 4 Perimeter of Composite Figures
5 pages
Capgemini Frequent Questions+Previous Year
100% (1)
Capgemini Frequent Questions+Previous Year
57 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
Crash Course JEE Advanced Sample Ebook
100% (1)
Crash Course JEE Advanced Sample Ebook
31 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Aryabhatta Question Paper Class XI 2019
No ratings yet
Aryabhatta Question Paper Class XI 2019
15 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Data Reduction
No ratings yet
Data Reduction
22 pages
DS Unit 5
No ratings yet
DS Unit 5
27 pages
978 3 662 03750 8
No ratings yet
978 3 662 03750 8
541 pages
Definiteness in A Language Without Articles A Study On Polish Adrian Czardybon PDF Download
No ratings yet
Definiteness in A Language Without Articles A Study On Polish Adrian Czardybon PDF Download
74 pages
Quadratic Formula PROOF
100% (1)
Quadratic Formula PROOF
1 page
CBSE Class 10 Maths Chapter 8 Introduction To Trignometry Objective Questions
No ratings yet
CBSE Class 10 Maths Chapter 8 Introduction To Trignometry Objective Questions
13 pages
ISO 2768-2 - 1989 General Tolerances
No ratings yet
ISO 2768-2 - 1989 General Tolerances
12 pages
4th Grade Math Framework
No ratings yet
4th Grade Math Framework
5 pages
UNIT5 Comparison Tree
No ratings yet
UNIT5 Comparison Tree
52 pages
Chapter 7
No ratings yet
Chapter 7
7 pages
ANGLES
No ratings yet
ANGLES
9 pages
90 - 48173 - FINALmaterial C and DS Course Handout 22-08-15
No ratings yet
90 - 48173 - FINALmaterial C and DS Course Handout 22-08-15
156 pages
Power Quality Performance Enhancement Using Single-Phase UPQC With Fuzzy Logic Controller Integrated With PV-BES System
No ratings yet
Power Quality Performance Enhancement Using Single-Phase UPQC With Fuzzy Logic Controller Integrated With PV-BES System
22 pages
3.OO Testing
No ratings yet
3.OO Testing
9 pages
Vibrant Academy: (India) Private Limited
No ratings yet
Vibrant Academy: (India) Private Limited
2 pages
Task Intermediate
No ratings yet
Task Intermediate
15 pages
Mission Planning Issues of Imaging Satellites Summ
No ratings yet
Mission Planning Issues of Imaging Satellites Summ
20 pages
Crossmark: Ocean Engineering
No ratings yet
Crossmark: Ocean Engineering
13 pages
SIMULATION MODEL of Permanent Magnet Synchronous Motor
No ratings yet
SIMULATION MODEL of Permanent Magnet Synchronous Motor
9 pages
Carrom
No ratings yet
Carrom
3 pages
Pivot HH LL Imp
No ratings yet
Pivot HH LL Imp
3 pages
Guttman 1999
No ratings yet
Guttman 1999
12 pages
Table Arus Motor
No ratings yet
Table Arus Motor
2 pages

Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University

Uploaded by

Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University

Uploaded by

Ch2 Data Preprocessing part3

 Ex. Let μ = 54,000, σ = 16,000. Then

 Suppose the recorded value of A range from

 Reference appropriate levels

 Select a minimum set of features such that the

 reduce # of patterns in the patterns, easier to

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

 Major families: histograms, clustering,

 Simple random sample without replacement

You might also like