ISE233 Lecture 3
ISE233 Lecture 3
Industrial Systems
(Lecture 3)
Synopsis
• Data Preprocessing
✓ Data cleaning
✓ Data integration
✓ Data reduction
✓ Data transformation
Data Preprocessing
• Real word data are highly susceptible to noisy, missing, and inconsistent data.
1. Ignore the data point - usually done when class label is missing (not effective)
2. Filling in the missing value manually – time consuming
3. Use a global constant to fill in the missing value – Unknown or -∞
4. Use a measure of central tendency for the attribute – mean or median
5. Use the attribute mean or median for all sample belonging to the same class as the given
tuple
6. Use the most probable value to fill in the missing value
Data Cleaning
Noisy Data - Noise is a random error or variance in a measured variable
• Binning: binning method smooth a sorted data value by consulting its “neighborhood”, that is,
the values around it
1. Smoothing by bin means
2. Smoothing by bin medians
3. Smoothing by bin boundaries
Data Cleaning
Noisy Data - Noise is a random error or variance in a measured variable
• Regression
• Outlier analysis – outliers may be detected by clustering
Data Integration
Redundancy and Correlation Analysis
n
a
i =1 i
E ( A) = A =
n
n
b
i =1 i
E ( B) = B =
n
n
=
(ai − A)(bi − B )
Cov( A, B ) = E (( A − A)( B − B )) = i 1
n
Cov( A, B )
rA, B =
A B
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Attribute Subset Selection
Data Reduction
• Regression and Log-linear Models: Parametric Data Reduction
• Histograms
• Clustering
• Sampling
Data Transformation
Data Transformation by Normalization
• Min-max normalization – a linear transformation; Suppose that 𝑚𝑖𝑛𝐴 and 𝑚𝑎𝑥𝐴 are the minimum and
maximum values of an attribute. A min-max normalization maps a value, 𝑣𝑖 , of A to 𝑣𝑖′ in the range
[new_𝑚𝑖𝑛𝐴 , new_𝑚𝑎𝑥𝐴 ] by computing
vi − min A
vi' = (new _ max A − new _ min A ) + new _ min A
max A − min A
• Z-score normalization
vi − A
vi' =
A
• Decimal scaling normalization: normalization by moving the decimal point of values of attribute A
vi
vi' =
10 j
Data Transformation
Data Transformation by Discretization - The raw values of a numeric attribute
are replaced by interval or conceptual labels
• Discretization by binning
• Discretization by histogram analysis
• Discretization by cluster
• Discretization by decision tree
• Discretization by correlation analysis