0% found this document useful (0 votes)
19 views48 pages

' 3 IT326 - Ch2 - Pre-Processing

Uploaded by

SY010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views48 pages

' 3 IT326 - Ch2 - Pre-Processing

Uploaded by

SY010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

2 Data Preprocessing

IT 326: Data Mining


1st term 2023-2024

Chapter 2, “Data Mining: Concepts and Techniques” (4th ed.)


Outline
2

 Data Preprocessing
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Summary
Data Preprocessing:
3

Why Preprocess the Data?

LOW quality Data Mining LOW quality


data results

Apply Leads to..


ENHANCED
Preprocessing quality of
to ENHANCE results
the quality of
data
Data Quality
4

 Elements defining data quality:


 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …

 Consistency: inconsistent naming, coding, format …

 Timeliness: timely updated?

 Believability: how much the data are trusted by users?

 Interpretability: how easy the data are understood?


Major Tasks in Data Preprocessing

 Data cleaning
 Fill in missing values, smooth noisy data, identify
or remove outliers, and resolve inconsistencies.
 Data integration
 Integration of multiple databases or files.
 Data reduction
 Dimensionality reduction.
 Numerosity reduction.
 Data transformation
 Normalization.
 Concept hierarchy generation.
 Discretization

Data Mining: Concepts


and Techniques
Data Cleaning
6

 Data in the Real World is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error.
 Incomplete: lacking attribute values, or containing only aggregate data.
◼ e.g., Occupation=“ ” (missing data)
 Inconsistent: containing contradictories in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”.
 Noisy: containing noise, errors, or outliers.
◼ e.g., Salary=“−10” (an error)
 Intentional (e.g., disguised missing data)
◼ Jan. 1 as everyone’s birthday?

Other issues affecting the quality of data


Data Cleaning: Incomplete (Missing) Values
7

 Data is not always available.


 e.g., many tuples have no recorded value for several attributes, such as customer
income in sales data.
 Missing data may be due to:
 equipment malfunction.

 inconsistent with other recorded data and thus deleted.


◼ (e.g., is teacher & no students , age&DoB)

 data not entered due to misunderstanding.


 certain data may not be considered important at the time of entry.

 Missing data may need to be inferred.


Data Cleaning: Incomplete (Missing) Values
8

How to Handle Missing Values?

 Ignore the tuple: usually done when class label is missing (when doing
classification)
◼ effective when the tuple contains several attributes with missing values
◼ not effective when the percentage of missing values per attribute varies considerably.
 Fill in the missing value :
1) Manually: time consuming+ infeasible (large data and many missing values)
2) Use a global constant (such as a label like “Unknown” or −∞ or “NA”)
3) Use the central tendency for the attribute (e.g., the mean or median)
4) Use the attribute mean/median for all samples belonging to the same class.
5) Use the most probable value.
Data Cleaning: Noisy Data
9

 Noise: random error or variance in a measured variable.

 Incorrect attribute values may be due to:


 faulty data collection instruments.
 data entry problems.
 data transmission problems.

 Smooth out the data to remove noise.


Data Cleaning: Noisy Data
10

How to Handle Noisy Data?


 Binning
 First, sort data and partition into (equal-frequency) bins.
 Then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
 Regression
 Smooth by fitting the data into regression functions.
 Outlier Analysis
 Detect and remove outliers using clustering.
◼ Values that fall outside of the set of clusters may be considered outliers.
 Combined computer and human inspection.
Data Cleaning: Noisy Data

Example: Binning To Handle Noise 1

equal-frequency: It divides the range into N


intervals, each containing approximately same
number of samples.

equal-width: It divides the range into N intervals


of equal size. 3 or
Width= (max-min/#bins)
Example: Width=(34-4/3)=10
[4-13], [14-23] , [24-34]
Bin1: 4, 8
Bin2: 15, 21, 21
Figure 3.2 - Binning methods for data smoothing.
Bin3: 24, 25, 28, 34
Data Integration
12

 Data integration:
 The merging of data from multiple sources into a coherent store.

 Challenges:
 Entity identification problem: How to match schemas and objects from different
sources?
 Redundancy and Correlation Analysis: Are any attributes correlated?
Data Integration: Challenges
13

Entity Identification Problem


 How can equivalent real-world entities from multiple data sources be mapped up?
 Same attribute or object may have different names in different databases.
 Example: Attribute name: A.cust-id  B.cust-#, Attribute values: Bill Clinton  William Clinton
 If both kept → redundancy
 Schema integration and object matching are tricky:
 Integrate metadata from different sources.
 Match equivalent real-world entities from multiple sources.
 Detecting and resolving data value conflicts due to different representations, different scales.
 Metadata can be used to avoid errors in schema integration.
Data Integration: Challenges
14

Attribute Redundancy and Correlation Analysis


 Redundant data occur often when integration of multiple databases.
 Causes of redundancy:
 An attribute may be redundant if it can be “derived” from another attribute or set
of attributes.
 Inconsistencies in attribute naming can also cause redundancies in the resulting
dataset.
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality.
Data Integration: Challenges
15

Handling Redundancy with Correlation Analysis


 Some redundancies can be detected by correlation analysis.
 Given two attributes, correlation analysis can measure how strongly one
attribute implies the other, based on the available data.
 Each type of data has special type of correlation measure.
Nominal χ2 (Chi square)

Data Type
Correlation coefficient
Numerical
Covariance
Data Integration: Challenges (Correlation Analysis )
16

 Nominal Data: Χ2 (chi-square)


 Observed value is the actual frequency (counted from data)
 Expected value is the expected frequency (calculated by formula).
(Observed - Expected )2
c =å
2

Expected
 Expected values are calculated using :
𝑐𝑜𝑢𝑛𝑡 𝐴 = 𝑎𝑖 × 𝑐𝑜𝑢𝑛𝑡(𝐵 = 𝒃𝑗 )
𝑒𝑖𝑗 =
𝑛
 The larger the Χ2 value, the highest the correlation.
 The cells that contribute the most to the Χ2 value are those whose actual count is very different
from the expected count.
Data Integration: Challenges (Correlation Analysis )
17

□ Nominal Data: Χ2 (chi-square) Example


Contingency Table Observed (Expected) Col row
Dataset n=1500
male Female Sum (row) Gender Preferred
reading
fiction 250 (90) 200 (360) 450 Male Fiction
Not fiction 50 (210) 1000 (840) 1050 Female Not fiction
Female Fiction
Sum(col.) 300 1200 1500 .. …
.. …
 The expected frequencies(numbers in parentheses):
.. …
𝑐𝑜𝑢𝑛𝑡 𝑚𝑎𝑙𝑒 ×𝑐𝑜𝑢𝑛𝑡(𝑓𝑖𝑐𝑡𝑖𝑜𝑛) 300×450
◼ 𝑒11 = 𝑒male,fiction = 1500
= 1500
= 90 Female Fiction

 Χ2 (chi-square) calculation to test correlation between “preferred reading” and “gender” :

(250 - 90)2 (50 - 210) 2 (200 - 360) 2 (1000 - 840) 2


c =
2
+ + + = 507.93
90 210 360 840
Data Integration: Challenges (Correlation Analysis )
18

□ Nominal Data: Χ2 (chi-square) Example


1. Calculate degree of freedom (df) = (r-1)(c-1) where r is Null hypothesis:
𝑯𝟎 : the two attributes are independent
the # of rows and c is the # of columns. df= (2-1)(2-1) = 1 • df = 1→ degree of freedom
• 𝛼= 0.001 → significance level
2. Set your significance level “alpha” → e.g., alpha = 0.001
• Critical value= 10.827 → rejected\critical value
3. Find rejected value “critical value” from the table Χ2 : chi-square test statistic
the Χ2 rejected value “critical value” is 10.827
If Χ2 > critical value ➔ 𝑯𝟎 is rejected →
attributes are not independent → are correlated
4. Evaluate your results : Χ2 =507.93 >>> 10.827 → reject
𝑯𝟎 → reject“ preferred reading” and “gender” are not
independent → they are strongly correlated in the given
group of people.
Data Integration: Challenges (Correlation Analysis )
19

□ Numeric Data: Correlation coefficient

where n is the number of tuples, 𝐴ҧ and 𝐵ത are the respective means of A and B, σA and σB are
the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.

>0 Positively correlated

rA,B =0 Independent

<0 Negatively correlated


Data Integration: Challenges (Correlation Analysis )
20

□ Numeric Data: Covariance

where n is the # of tuples, and 𝐴ҧ and 𝐵ത are the respective mean or expected values of A and B

□ Covariance can be simplified as :

We will use the simplified equation


for calculations

□ Covariance is like correlation:

Correlation
Coefficient
Data Integration: Challenges (Correlation Analysis )
21

□ Numeric Data: Covariance

◼ Positive covariance: IF A and B both tend to be larger than their expected values
THEN Cov(A,B) > 0 → they rise together
◼ Negative covariance: IF A is larger than its expected value & B is smaller than its expected value
THEN Cov(A,B) < 0.
◼ Independence: IF A and B are independent THEN Cov(A,B) = 0.

◼ But the converse is not true.


▪ Cov(A,B) = 0 does NOT imply that A and B are independent.
▪ Some pairs of random variables may have a covariance of 0 but are not independent.
▪ Only under some additional assumptions does a covariance of 0 imply independence.
Data Integration: Challenges (Correlation Analysis )
22

□ Numeric Data: Covariance Example Time point AllElectronics HighTech


Stock prices observed at five time points for AllElectronics and T1 6 20
HighTech company. T2 5 10
T3 4 14
1
T4 3 5
T5 2 5
2

3
Data Transformation
23

 Data are transformed or consolidated into forms appropriate for mining.


 the resulting mining process may be more efficient, and the patterns found may be
easier to understand.

 Attribute Transformation: A function that maps the entire set of values


of a given attribute to a new set of replacement values such that each old
value can be identified with one of the new values.
Data Transformation

Discretization

Encoding

Aggregation
14.30 \ Stdv = 14.30\718.27 = 0.0199
24
Normalization 1,500 – mean =1500 – 1485.70 = 14.30
Data Transformation: Strategies
25

 Smoothing: Remove noise from data.

 Attribute construction: New attributes constructed from the given ones. New
attributes are added to help the mining process.

 Aggregation: Summary or aggregation operations are applied to the data.


 e.g., daily sales data may be aggregated so as to compute monthly and annual total amounts.

 Normalization: where the attribute data are scaled so as to fall within a smaller,
specified range. such as [−1.0 to 1.0], or [0.0 to 1.0].
Data Transformation: Strategies
26

 Discretization: divide the range of continuous attribute into intervals. Numerous continuous
attribute values are replaced by small interval labels.
 Example: a numeric attribute (e.g., age)
◼ Raw values are replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult,
senior).
◼ The labels can be recursively organized into higher-level concepts, resulting in a concept hierarchy for this
numeric attribute.

More than one concept hierarchy can be


defined for the same attribute to
accommodate the needs of various users.

 Concept hierarchy generation for nominal data: replacing low level concepts by higher level
concepts
 i.e. attributes such as street can be generalized to higher-level concepts, like city or country.
Data Transformation: Normalization
27

 The measurement unit used can affect the data analysis.


 For example, changing measurement units from meters to inches for height (2.5 cm= 1 inches), or
from kilograms to pounds for weight(1 kg=2.2 pounds), may lead to very different results.
 In general, expressing an attribute in smaller units will lead to a larger range for that
attribute, and thus tend to give such an attribute greater effect or “weight.”
 To help avoid dependence on the choice of measurement units, the data should be normalized or
standardized.
 Normalization is transforming the data to fall within a smaller or common range such
as [-1, 1] or [0.0, 1.0].
 Normalizing the data gives all attributes an equal weight.
 For distance-based methods, normalization helps prevent attributes with initially large ranges (e.g.,
income) from outweighing attributes with initially smaller ranges (e.g., binary attributes).
Data Transformation: Normalization
28

 Min-max normalization: to [new_minA, new_maxA]

Example: Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
73,600 - 12,000
(1.0 - 0) + 0 = 0.716
98,000 - 12,000
v - µA
 Z-score normalization: (μ: mean, σ: standard deviation) v' =
s A

73,600 − 54,000
Example: Let μ = 54,000, σ = 16,000. Then = 1.225
16,000

 Decimal scaling normalization : Where j is the smallest integer such that

Example: Let A = -986 to 917. Then the maximum absolute value is 986 and j=3
which normalize A to [-0.986 to 0.917]
Data Reduction
29

Data reduction techniques can be applied to obtain a reduced representation of the data
set that is much smaller in volume, yet closely maintains the integrity of the original data.

 Mining on the reduced data set should be more efficient yet produce the same (or
almost the same) analytical results.

 Data reduction strategies include:


◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression.
Data Reduction: Dimensionality Reduction
30

Dimensionality reduction is the process of reducing the number of attributes under consideration.
Including:
 Data compression techniques transform or project the original data onto a smaller space, such
as Wavelet transforms and principal components analysis (PCA).

 Attribute subset selection removes irrelevant, weakly relevant, or redundant attributes or


dimensions.
◼ Irrelevant attributes: attributes contain no information that is useful for the data mining task at hand.
◼ Redundant attributes: attributes duplicate much or all of the information contained in one or more other attributes.
 Attribute construction (Start with one column only, progressively adding one column at a time,
i.e., the column that produces the highest increase in performance)

Why ? to improve quality and efficiency of the mining process. Mining on a reduced set of attributes reduces the
number of attributes appearing in the discovered patterns, helping to make the patterns easier to understand.
Data Reduction: Numerosity Reduction
31

Numerosity Reduction : reduce data volume by choosing alternative, smaller forms


of data representation.

 Two types: Parametric methods and Non-parametric methods

 Parametric methods:
 Assume the data fits some model → estimate model parameters → store only the parameters →
discard the data (except possible outliers).
 Methods: Regression and Log-Linear Models.
Data Reduction: Numerosity Reduction
32

 Non-parametric methods: Do not assume models.


 Histogram: Divide data into buckets and store average (sum) for
each bucket
 Clustering: Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid and diameter)
only.
 Sampling: obtaining a small sample s to represent the whole
data set N.
◼ Key: Choose a representative subset of the data.
 Data cube aggregation: Data can be aggregated so that the Data cube aggregation
resulting data summarize the data (smaller in volume), without
loss of information necessary for the analysis task.
Data Reduction: Data Compression
33

 Obtain a reduced or “compressed” representation of the original data.


 Two types :
 Lossless: if the original data can be reconstructed from the compressed data without any information loss
 Lossy: if we can reconstruct only an approximation of the original data.

lossless

Original Data Compressed


Data

Original Data
Approximated

 Dimensionality and numerosity reduction may also be considered as forms of data


compression.
Feature Selection Methods
34

• Feature selection is the process of removing redundant or irrelevant features from the
original data set.
– a process of selecting the most and small subset of informative feature that are most predictive to
its related class.

• This maximizes the classifier’s ability to classify samples accurately.

• So the carrying out time of the classifier that processes the data will decreases and also
accuracy increases because irrelevant features can include noisy data affecting the
classification accuracy negatively .

Chapter 7, “Data Mining: Concepts and Techniques” (4th ed.)


Feature Selection vs Dimensionality Reduction
35

• Feature Selection and Dimensionality Reduction methods are used for reducing the
number of features in a dataset, there is an important difference.

• Feature selection is simply selecting and excluding given features without


changing them such as
– Remove features with missing values
– Remove features with low variance
– Remove highly correlated features

• Dimensionality reduction transforms features into a lower dimension. (ex: PCA)

https://towardsdatascience.com/feature-selection-and-dimensionality-reduction-f488d1a035de#:~:text=While%20both%20methods%20are%20used,features%20into%20a%20lower%20dimension.
Feature Selection Methods
36

• The feature selection methods can be classified into three categories:

Wrapper FS Embedded FS Hybrid FS


Filter FS Methods
Methods Methods Methods

Independent of the Model hypothesis search


within the feature subset Search in the combined
classification algorithm Features selection is
search space of optimal
embedded in the
feature subsets and
classifier
hypotheses
Computationally simple High classification
and fast. accuracy

Less computational Offer a good tradeoff


Worse classification than wrapper between filters and
Computationally complex , approaches. wrapper approaches.
performance expensive, and slow.
Filter FS Methods
37

• This method selects the feature without depending upon the type of classifier used.
• It does that by using statistical tests to find correlations between a feature and a class.
• The advantage of this method is that, it is simple and independent of the type of
classifier used so feature selection need to be done only once (e.g. as a preprocessing
step).
• The drawback of this method is that it ignores the interaction with the classifier,
ignores the feature dependencies, and lastly each feature considered separately

*https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/
Filter FS Methods
38

 Univariate Filter Methods


1. Individual features are ranked according to specific criteria
2. The top N features are selected

 It may select redundant features because the relationship between individual


features is not taken into account while making decisions.

 Examples of criteria include variance and correlation of the feature.


 Variance thresholds remove features whose values don’t change much from observation to
observation (i.e. their variance falls below a threshold). These features provide little value.
 Correlation examines each feature individually to determine the strength of the relationship of the
feature with the response variable.
Filter FS Methods
39

 Multivariate filter methods


 It calculates all pair-wise relationships among features according to a criterion

 They are capable of removing redundant features from the data since they take the
mutual relationship between the features into account.

 An example of criteria is Correlation of the features.


 Correlation thresholds remove features that are highly correlated with others (i.e. its values change
very similarly to another’s). These features provide redundant information.
Wrapper FS Methods
40

• In this method the feature is dependent upon the classifier , i.e. it uses the result of
the classifier to determine the goodness of the given feature or attribute.
• It does that by training the model using a subset of the features, after that this
method will try to improve the model by adding/removing features.
• The advantage of this method is that it removes the drawback of the filter method,
i.e. It includes the interaction with the classifier and also takes the feature
dependencies.
• The drawback of this method is that it is slower than the filter method because it
takes the dependencies also.
• The quality of the feature selection is directly measured by the performance of the
classifier.
Wrapper FS Methods
41

Set of all
Feature
Wrapper FS Methods
42

 Step Forward Feature Selection


1. It starts with an empty set of features.
2. The performance of the classifier is evaluated with respect to each feature. The
feature that performs the best is selected out of all the features.
3. The first feature is tried in combination with all the other features. The
combination of two features that yield the best algorithm performance is
selected.
4. The process continues until the specified number of features are selected.
Wrapper FS Methods
43

 Step Backwards Feature Selection


1. It starts from the set of all features
2. One feature is removed in round-robin fashion from the feature set and the
performance of the classifier is evaluated. The feature set that yields the best
performance is retained.
3. One feature is removed in a round-robin fashion and the performance of all the
combination of features except the 2 features is evaluated.
4. This process continues until the specified number of features remain in the
dataset.
Wrapper FS Methods
44

 Exhaustive Feature Selection


1. The performance of a machine learning algorithm is evaluated against all possible combinations
of the features in the dataset.
2. The feature subset that yields best performance is selected.

 It is the most greedy algorithm of all the wrapper methods since it tries all the
combination of features and selects the best.
 It can be slower compared to step forward and step backward method since it

evaluates all feature combinations.


Embedded FS Methods
45

• This approach consists in algorithms which simultaneously perform model fitting and
feature selection
• Examples of classifiers include decision tree (C4.5) and random forest.
• The advantage of this method is that it is less computationally intensive than a
wrapper approach.
• The accuracy of the classifier depends not only on the classification algorithm but
also on the feature selection method used.
• Selection of irrelevant and inappropriate features may confuse the classifier and
lead to incorrect results.
Embedded FS Methods
46

Source: https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/
Hybrid FS Methods
47

 It combines the best properties of filters and wrappers.


 First, a filter method is used in order to reduce the feature space dimension space,

possibly obtaining several candidate subsets.


 Then, a wrapper is employed to find the best candidate subset.

 Hybrid methods usually achieve high accuracy that is characteristic to wrappers and
high efficiency characteristic to filters
Summary
48

 Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability.


 Data cleaning: e.g. missing/noisy values, outliers.
 Data integration from multiple sources: Entity identification problem, correlation analysis.
 Data transformation:
 Normalization
 Concept hierarchy generation
 Data reduction:
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Feature selection:
 Filter Feature Selection
 Wrapper Feature Selection
 Hybrid Feature Selection

You might also like