Unit 2 - Data Preprocessing
Unit 2 - Data Preprocessing
Samir Siddiqui
CR FINAL YEAR IT
Department of Information Technology
1
December 22, 2022 Data Mining: Concepts and Techniques 2
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking
Clustering
detect and remove outliers
Y1
Y1’ y=x+1
X1 x
store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
December 22, 2022 Data Mining: Concepts and Techniques 26
December 22, 2022 Data Mining: Concepts and Techniques 27
December 22, 2022 Data Mining: Concepts and Techniques 28
December 22, 2022 Data Mining: Concepts and Techniques 29
December 22, 2022 Data Mining: Concepts and Techniques 30
December 22, 2022 Data Mining: Concepts and Techniques 31
December 22, 2022 Data Mining: Concepts and Techniques 32
Data Reduction Strategies
smaller in volume but yet produce the same (or almost the
same) analytical results
Data reduction strategies
Data cube aggregation:
Data Compression
understand
Heuristic methods (due to exponential # of choices):
Step-wise forward selection
Decision-tree induction
A4 ?
A1? A6?
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
X2
Y1
Y2
X1
Q9. Explain the data discretization and concept hierarchy generation. [6M][S-17], [7M]
[S-19]
Q10. What are the measures of data dispersion. [4M][S-19]
Q11. What is the need for multidimensional analysis. [5M][S-16]
Q12. Write short notes on:
a. Binning b. Regressionc. Clustering d. Smoothing
e. Generalization f. Aggregation
Q13. Explain MIN-MAX normalization and Z-score normalization. [7M][W-17], [4M]
[S-16], [6M][S-19]
Q14. Explain the various issues to be considered in data integration. Also give the
various forms of preprocessing? [6M][S-16]
Q.15. What are the challenges in data preprocessing?