' 3 IT326 - Ch2 - Pre-Processing
' 3 IT326 - Ch2 - Pre-Processing
Data Preprocessing
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Feature Selection
Summary
Data Preprocessing:
3
Data cleaning
Fill in missing values, smooth noisy data, identify
or remove outliers, and resolve inconsistencies.
Data integration
Integration of multiple databases or files.
Data reduction
Dimensionality reduction.
Numerosity reduction.
Data transformation
Normalization.
Concept hierarchy generation.
Discretization
Data in the Real World is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error.
Incomplete: lacking attribute values, or containing only aggregate data.
◼ e.g., Occupation=“ ” (missing data)
Inconsistent: containing contradictories in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”.
Noisy: containing noise, errors, or outliers.
◼ e.g., Salary=“−10” (an error)
Intentional (e.g., disguised missing data)
◼ Jan. 1 as everyone’s birthday?
Ignore the tuple: usually done when class label is missing (when doing
classification)
◼ effective when the tuple contains several attributes with missing values
◼ not effective when the percentage of missing values per attribute varies considerably.
Fill in the missing value :
1) Manually: time consuming+ infeasible (large data and many missing values)
2) Use a global constant (such as a label like “Unknown” or −∞ or “NA”)
3) Use the central tendency for the attribute (e.g., the mean or median)
4) Use the attribute mean/median for all samples belonging to the same class.
5) Use the most probable value.
Data Cleaning: Noisy Data
9
Data integration:
The merging of data from multiple sources into a coherent store.
Challenges:
Entity identification problem: How to match schemas and objects from different
sources?
Redundancy and Correlation Analysis: Are any attributes correlated?
Data Integration: Challenges
13
Data Type
Correlation coefficient
Numerical
Covariance
Data Integration: Challenges (Correlation Analysis )
16
Expected
Expected values are calculated using :
𝑐𝑜𝑢𝑛𝑡 𝐴 = 𝑎𝑖 × 𝑐𝑜𝑢𝑛𝑡(𝐵 = 𝒃𝑗 )
𝑒𝑖𝑗 =
𝑛
The larger the Χ2 value, the highest the correlation.
The cells that contribute the most to the Χ2 value are those whose actual count is very different
from the expected count.
Data Integration: Challenges (Correlation Analysis )
17
where n is the number of tuples, 𝐴ҧ and 𝐵ത are the respective means of A and B, σA and σB are
the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.
rA,B =0 Independent
where n is the # of tuples, and 𝐴ҧ and 𝐵ത are the respective mean or expected values of A and B
Correlation
Coefficient
Data Integration: Challenges (Correlation Analysis )
21
◼ Positive covariance: IF A and B both tend to be larger than their expected values
THEN Cov(A,B) > 0 → they rise together
◼ Negative covariance: IF A is larger than its expected value & B is smaller than its expected value
THEN Cov(A,B) < 0.
◼ Independence: IF A and B are independent THEN Cov(A,B) = 0.
3
Data Transformation
23
Discretization
Encoding
Aggregation
14.30 \ Stdv = 14.30\718.27 = 0.0199
24
Normalization 1,500 – mean =1500 – 1485.70 = 14.30
Data Transformation: Strategies
25
Attribute construction: New attributes constructed from the given ones. New
attributes are added to help the mining process.
Normalization: where the attribute data are scaled so as to fall within a smaller,
specified range. such as [−1.0 to 1.0], or [0.0 to 1.0].
Data Transformation: Strategies
26
Discretization: divide the range of continuous attribute into intervals. Numerous continuous
attribute values are replaced by small interval labels.
Example: a numeric attribute (e.g., age)
◼ Raw values are replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult,
senior).
◼ The labels can be recursively organized into higher-level concepts, resulting in a concept hierarchy for this
numeric attribute.
Concept hierarchy generation for nominal data: replacing low level concepts by higher level
concepts
i.e. attributes such as street can be generalized to higher-level concepts, like city or country.
Data Transformation: Normalization
27
Example: Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
73,600 - 12,000
(1.0 - 0) + 0 = 0.716
98,000 - 12,000
v - µA
Z-score normalization: (μ: mean, σ: standard deviation) v' =
s A
73,600 − 54,000
Example: Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
Example: Let A = -986 to 917. Then the maximum absolute value is 986 and j=3
which normalize A to [-0.986 to 0.917]
Data Reduction
29
Data reduction techniques can be applied to obtain a reduced representation of the data
set that is much smaller in volume, yet closely maintains the integrity of the original data.
Mining on the reduced data set should be more efficient yet produce the same (or
almost the same) analytical results.
Dimensionality reduction is the process of reducing the number of attributes under consideration.
Including:
Data compression techniques transform or project the original data onto a smaller space, such
as Wavelet transforms and principal components analysis (PCA).
Why ? to improve quality and efficiency of the mining process. Mining on a reduced set of attributes reduces the
number of attributes appearing in the discovered patterns, helping to make the patterns easier to understand.
Data Reduction: Numerosity Reduction
31
Parametric methods:
Assume the data fits some model → estimate model parameters → store only the parameters →
discard the data (except possible outliers).
Methods: Regression and Log-Linear Models.
Data Reduction: Numerosity Reduction
32
lossless
Original Data
Approximated
• Feature selection is the process of removing redundant or irrelevant features from the
original data set.
– a process of selecting the most and small subset of informative feature that are most predictive to
its related class.
• So the carrying out time of the classifier that processes the data will decreases and also
accuracy increases because irrelevant features can include noisy data affecting the
classification accuracy negatively .
• Feature Selection and Dimensionality Reduction methods are used for reducing the
number of features in a dataset, there is an important difference.
https://towardsdatascience.com/feature-selection-and-dimensionality-reduction-f488d1a035de#:~:text=While%20both%20methods%20are%20used,features%20into%20a%20lower%20dimension.
Feature Selection Methods
36
• This method selects the feature without depending upon the type of classifier used.
• It does that by using statistical tests to find correlations between a feature and a class.
• The advantage of this method is that, it is simple and independent of the type of
classifier used so feature selection need to be done only once (e.g. as a preprocessing
step).
• The drawback of this method is that it ignores the interaction with the classifier,
ignores the feature dependencies, and lastly each feature considered separately
*https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/
Filter FS Methods
38
They are capable of removing redundant features from the data since they take the
mutual relationship between the features into account.
• In this method the feature is dependent upon the classifier , i.e. it uses the result of
the classifier to determine the goodness of the given feature or attribute.
• It does that by training the model using a subset of the features, after that this
method will try to improve the model by adding/removing features.
• The advantage of this method is that it removes the drawback of the filter method,
i.e. It includes the interaction with the classifier and also takes the feature
dependencies.
• The drawback of this method is that it is slower than the filter method because it
takes the dependencies also.
• The quality of the feature selection is directly measured by the performance of the
classifier.
Wrapper FS Methods
41
Set of all
Feature
Wrapper FS Methods
42
It is the most greedy algorithm of all the wrapper methods since it tries all the
combination of features and selects the best.
It can be slower compared to step forward and step backward method since it
• This approach consists in algorithms which simultaneously perform model fitting and
feature selection
• Examples of classifiers include decision tree (C4.5) and random forest.
• The advantage of this method is that it is less computationally intensive than a
wrapper approach.
• The accuracy of the classifier depends not only on the classification algorithm but
also on the feature selection method used.
• Selection of irrelevant and inappropriate features may confuse the classifier and
lead to incorrect results.
Embedded FS Methods
46
Source: https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/
Hybrid FS Methods
47
Hybrid methods usually achieve high accuracy that is characteristic to wrappers and
high efficiency characteristic to filters
Summary
48