04 DM BI Data Preprocessing
04 DM BI Data Preprocessing
and
Business Intelligence
Data Preprocessing
Module 2
Created/Adopted/Modified for
Data Mining and Business Intelligence – MCA II Semester
Vidya Vikas Institute of Engineering & Technology
Mysore
2023-24
GPD
Data Preprocessing
Today’s real-world databases contain data that is
How can the data be preprocessed in order to help improve the quality of
the dataand, consequently, of the mining results?
How can the data be preprocessed so as to improve the efficiency and ease
of the mining process?”
Data Preprocessing Techniques
Data cleaning can be applied to remove noise and correct inconsistencies in data.
Data integration merges data from multiple sources into a coherent data store
such as a data warehouse.
Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.
Data transformations (e.g., normalization) may be applied, where data are scaled
to fall within a smaller range like 0.0 to 1.0.
This can improve the accuracy and efficiency of mining algorithms involving
distance measurements.
Data Preprocessing Techniques
Data cleaning : remove noise and
correct inconsistencies in data.
Data integration : merge data from
multiple sources into a coherent
data store like data warehouse.
Data reduction : reduce data size
by aggregating, eliminating
redundant features, or clustering.
Data transformations : from one
form to another form.
What is Data Preprocessing? — Major Tasks
q Data cleaning
q Handle missing data, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
q Data integration
q Integration of multiple databases, data cubes, or files
q Data reduction
q Dimensionality reduction
q Numerosity reduction
q Data compression
q Data transformation and data discretization
q Normalization
q Concept hierarchy generation
4
Why Preprocess the Data? — Data Quality Issues
q Measures for data quality: A multidimensional view
q Accuracy: correct or wrong, accurate or not
q Completeness: not recorded, unavailable, …
q Consistency: some modified but some not, dangling, …
q Timeliness: timely update?
q Believability: how trustable the data are correct?
q Interpretability: how easily the data can be understood?
5
Data Cleaning
Data Cleaning
q Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, and transmission error
q Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
q e.g., Occupation = “ ” (missing data)
q Noisy: containing noise, errors, or outliers
q e.g., Salary = “−10” (an error)
q Inconsistent: containing discrepancies in codes or names, e.g.,
q Age = “42”, Birthday = “03/07/2010”
q Was rating “1, 2, 3”, now rating “A, B, C”
q discrepancy between duplicate records
q Intentional (e.g., disguised missing data)
q Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data
q Data is not always available
q E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
q Missing data may be due to
q Equipment malfunction
q Inconsistent with other recorded data and thus deleted
q Data were not entered due to misunderstanding
q Certain data may not be considered important at the time of entry
q Did not register history or changes of the data
q Missing data may need to be inferred
8
How to Handle Missing Data?
q Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
q Fill in the missing value manually: tedious + infeasible?
q Fill in it automatically with
q a global constant : e.g., “unknown”, a new class?!
q the attribute mean
q the attribute mean for all samples belonging to the same class: smarter
q the most probable value: inference-based such as Bayesian formula or decision
tree
9
Noisy Data
q Noise: random error or variance in a measured variable
q Incorrect attribute values may be due to
q Faulty data collection instruments
q Data entry problems
q Data transmission problems
q Technology limitation
q Inconsistency in naming convention
q Other data problems
q Duplicate records
q Incomplete data
q Inconsistent data
10
How to Handle Noisy Data?
q Binning
q First sort data and partition into (equal-frequency) bins
q Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
q Regression
q Smooth by fitting the data into regression functions
q Clustering
q Detect and remove outliers
q Semi-supervised: Combined computer and human inspection
q Detect suspicious values and check by human (e.g., deal with possible outliers)
11
Handling Noisy Data : Binning
Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the width of intervals
will be: W = (B –A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximately same number of
samples
Good data scaling
Managing categorical attributes can be tricky
9
Handling Noisy Data : Binning
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
10
Handling Noisy Data : Clustering
Partition data set into clusters based on
outliers.
– These are probably noise.
Binning vs. Clustering
14
Handling Redundancy in Data Integration
q Redundant data occur often when integration of multiple databases
q Object identification: The same attribute or object may have different names in
different databases
q Derivable data: One attribute may be a “derived” attribute in another table,
e.g., annual revenue
q Redundant attributes may be able to be detected by correlation analysis and
covariance analysis
q Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
15
Correlation Analysis (for Categorical Data)
q Χ2 (chi-square) test:
17
Variance for Single Variable (Numerical Data)
q The variance of a random variable X provides a measure of how much the value of
X deviates from the mean or expected value of X:
ì å ( x - µ ) 2 f ( x) if X is discrete
ï
ï x
s = var( X ) = E[(X - µ ) ] = í ¥
2 2
ï ò ( x - µ ) 2 f ( x)dx if X is continuous
ï
î -¥
q where σ2 is the variance of X, σ is called standard deviation
µ is the mean, and µ = E[X] is the expected value of X
q That is, variance is the expected value of the square deviation from the mean
q It can also be written as: s 2 = var( X ) = E[(X - µ ) 2 ] = E[X 2 ] - µ 2 = E[X 2 ] - [ E ( x)]2
q Sample variance is the average squared deviation of the data value xi from the
n
sample meanµ̂ 1
sˆ = å ( xi - µˆ ) 2
2
n i =1
18
Covariance for Two Variables
q Covariance between two variables X1 and X2
s 12 = E[( X 1 - µ1 )( X 2 - µ2 )] = E[ X 1 X 2 ] - µ1µ2 = E[ X 1 X 2 ] - E[ X 1 ]E[ X 2 ]
where µ1 = E[X1] is the respective mean or expected value of X1; similarly for µ2
1 n
q Sample covariance between X1 and X2: sˆ12 = å ( xi1 - µˆ1 )( xi 2 - µˆ 2 )
n i =1
q Sample covariance is a generalization of the sample variance:
1 n 1 n
sˆ11 = å ( xi1 - µˆ1 )( xi1 - µˆ1 ) = å ( xi1 - µˆ1 ) 2 = sˆ12
n i =1 n i =1
q Positive covariance: If σ12 > 0
q Negative covariance: If σ12 < 0
q Independence: If X1 and X2 are independent, σ12 = 0 but the reverse is not true
q Some pairs of random variables may have a covariance 0 but are not independent
q Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
19
Example: Calculation of Covariance
q Suppose two stocks X1 and X2 have the following values in one week:
q (2, 5), (3, 8), (5, 10), (4, 11), (6, 14)
q Question: If the stocks are affected by the same industry trends, will their prices
rise or fall together?
q Covariance formula
s 12 = E[( X 1 - µ1 )( X 2 - µ2 )] = E[ X 1 X 2 ] - µ1µ2 = E[ X 1 X 2 ] - E[ X 1 ]E[ X 2 ]
sˆ12 å (x - µˆ1 )( xi 2 - µˆ 2 )
Sample correlation for two attributes X1 and X2: rˆ
i1
q = = i =1
12
sˆ1sˆ 2 n n
å (x
i =1
i1 - µˆ1 ) 2
å (x
i =1
i2 - µˆ 2 ) 2
where n is the number of tuples, µ1 and µ2 are the respective means of X1 and X2 ,
σ1 and σ2 are the respective standard deviation of X1 and X2
q If ρ12 > 0: A and B are positively correlated (X1’s values increase as X2’s)
q The higher, the stronger correlation
q If ρ12 = 0: independent (under the same assumption as discussed in co-variance)
q If ρ12 < 0: negatively correlated
21
Visualizing Changes of Correlation Coefficient
22
Covariance Matrix
q The variance and covariance information for the two variables X1 and X2
can be summarized as 2 X 2 covariance matrix as
X 1 - µ1
S = E[( X - µ )( X - µ ) ] = E[(
T
)( X 1 - µ1 X 2 - µ2 )]
X 2 - µ2
æ E[( X 1 - µ1 )( X 1 - µ1 )] E[( X 1 - µ1 )( X 2 - µ2 )] ö
=ç ÷
è E[( X 2 - µ 2 )( X 1 - µ1 )] E[( X 2 - µ 2 )( X 2 - µ )]
2 ø
æ s 12 s 12 ö
=ç 2 ÷
è s 21 s 2 ø
q Generalizing it to d dimensions, we have,
23
Variance, Covariance, Correlation
Variance provides insight into how much individual data points
the majority of the data points, it does not contribute much to the
variability and can be flagged as redundant.
Variance, Covariance, Correlation
Covariance between two variables measures how they change
together.
A positive covariance indicates that the two variables tend to
Separate Coefficients:
Low-frequency components (Approximation): Capture the general trend of
the data.
High-frequency components (Details): Capture noise and minor fluctuations.
Discard High-Frequency Components: These usually represent noise or less
important data, so they can be removed, resulting in fewer attributes.
Reconstruct Data: Use only the low-frequency components to form a
compressed version of the original dataset.
Dimensionality Reduction – Principal Components Analysis
Principal Components Analysis is a statistical technique used to reduce
the dimensionality of data by transforming the data into a set of
linearly uncorrelated variables called Principal Components.
Suppose that the data to be reduced consist of tuples or data vectors
described by n attributes or dimensions.
Principal components analysis searches for k n-dimensional
orthogonal vectors that can best be used to represent the data, where
k ≤ n.
The original data are thus projected onto a much smaller space,
resulting in dimensionality reduction.
Dimensionality Reduction – Principal Components Analysis
Principal components analysis searches for k n-dimensional
orthogonal vectors that can best be used to represent the data, where
k ≤ n.
Principal Component Analysis (PCA)
q PCA: A statistical procedure that uses an
orthogonal transformation to convert a set of
observations of possibly correlated variables into
a set of values of linearly uncorrelated variables
called principal components
q The original data are projected onto a much
smaller space, resulting in dimensionality
reduction
q Method: Find the eigenvectors of the covariance
matrix, and these eigenvectors define the new
space Ball travels in a straight line. Data from
three cameras contain much redundancy
56
Dimensionality Reduction – PCA Process
Standardize the data: Ensure that each feature has a mean of 0 and
unit variance.
Compute the covariance matrix: Find the relationships between all
the features.
Calculate eigenvectors and eigenvalues:These determine the
directions (principal components) and magnitude of variance in each
direction.
Select principal components: Choose the top components that
capture the most variance and discard the rest.
Transform the data:Project the data onto the selected principal
components, reducing the dimensionality.
Dimensionality Reduction – Attribute Subset Selection
Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions).
The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained
using all attributes.
Mining on a reduced set of attributes has an additional benefit: It
reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
Dimensionality Reduction – Attribute Subset Selection
Methods
1. Stepwise Forward Selection:
60
Data Reduction Techniques
Data reduction techniques are methods used to reduce the volume,
size, or complexity of large datasets while preserving as much
relevant information as possible.
Numerocity Reduction / Data Reduction
Also known as Data size reduction
records/objects/rows in consideration.
That is, a reduced representation of the dataset
reduction)
Regression and Log-Linear Models
Data compression
Data Reduction: Parametric vs. Non-Parametric Methods
q Reduce data volume by choosing alternative, smaller
forms of data representation tip vs. bill
27
Linear and Multiple Regression
q Linear regression: Y = w X + b
q Data modeled to fit a straight line
q Often uses the least-square method to fit the line
q Two regression coefficients, w and b, specify the line
and are to be estimated by using the data at hand
q Using the least squares criterion to the known values
of Y1, Y2, …, X1, X2, ….
q Nonlinear regression:
q Data are modeled by a function which is a nonlinear
combination of the model parameters and depends
on one or more independent variables
q The data are fitted by a method of successive
approximations
28
Histogram
Divide the data into bins and aggregate the values within each bin.
intervals and
Get the count of data points in each bin, or
Data Bins/Histogram
Histogram
Example:
Histogram Analysis
40
q Divide data into buckets and store
average (sum) for each bucket 35
q Partitioning rules: 30
q Equal-width: equal bucket range 25
q Equal-frequency (or equal-depth) 20
15
10
5
0
10000 30000 50000 70000 90000
30
Clustering
q Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and
diameter) only
q Can be very effective if data is clustered but not if data
is “smeared”
q Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
q There are many choices of clustering definitions and
clustering algorithms
q Cluster analysis will be studied in depth in Chapter 10
31
Sampling
q Sampling: obtaining a small sample s to represent the whole data set N
q Allow a mining algorithm to run in complexity that is potentially sub-linear to the
size of the data
q Key principle: Choose a representative subset of the data
q Simple random sampling may have very poor performance in the presence of
skew
q Develop adaptive sampling methods, e.g., stratified sampling:
q Note: Sampling may not reduce database I/Os (page at a time)
34
Types of Sampling
q Simple random sampling: equal probability
of selecting any particular item Raw Data
reduce the dataset, data miners can work with more manageable
datasets while still retaining the critical information required for
meaningful analysis and valuable discoveries.
Data Cube Aggregation
q The lowest level of a data cube (base cuboid)
q The aggregated data for an individual entity of
interest
q E.g., a customer in a phone calling data warehouse
q Multiple levels of aggregation in data cubes
q Further reduce the size of data to deal with
q Reference appropriate levels
q Use the smallest representation which is enough to
solve the task
q Queries regarding aggregated information should be
answered using data cube, when possible
36
Data Reduction Techniques
Data reduction techniques are methods used to reduce the volume,
size, or complexity of large datasets while preserving as much
relevant information as possible.
Data Compression
q String compression
q There are extensive theories and well-tuned
algorithms
Original Data Compressed
q Typically lossless, but only limited manipulation Data
is possible without expansion lossless
q Audio/video compression
q Typically lossy compression, with progressive Original Data
refinement Approximated
q Sometimes small fragments of signal can be
reconstructed without reconstructing the whole Lossy vs. lossless compression
q Time sequence is not audio
q Typically short and vary slowly with time
q Data reduction and dimensionality reduction may
37 also be considered as forms of data compression
Wavelet Transform: A Data Compression Technique
q Wavelet Transform
q Decomposes a signal into different
frequency subbands
q Applicable to n-dimensional signals
q Data are transformed to preserve relative
distance between objects at different levels
of resolution
q Allow natural clusters to become more
distinguishable
q Used for image compression
38
Wavelet Transformation
Haar2 Daubechie4
q Discrete wavelet transform (DWT) for linear signal processing, multi-resolution
analysis
q Compressed approximation: Store only a small fraction of the strongest of the
wavelet coefficients
q Similar to discrete Fourier transform (DFT), but better lossy compression, localized
in space
q Method:
q Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
q Each transform has 2 functions: smoothing, difference
q Applies to pairs of data, resulting in two set of data of length L/2
q Applies two functions recursively, until reaches the desired length
39
Why Wavelet Transform?
q Use hat-shape filters
q Emphasize region where points cluster
q Suppress weaker information in their boundaries
q Effective removal of outliers
q Insensitive to noise, insensitive to input order
q Multi-resolution
q Detect arbitrary shaped clusters at different scales
q Efficient
q Complexity O(N)
q Only applicable to low dimensional data
41
Data Transformation
Data Transformation
Data transformation is a preprocessing technique used to convert
data into a suitable format for analysis, modeling, and visualization.
The goal of data transformation is to improve the quality,
distribution, and suitability of the data for specific tasks.
The data are transformed or consolidated so that the resulting
mining process may be more efficient, and the patterns found may
be easier to understand.
A function that maps the entire set of values of a given attribute to
a new set of replacement values s.t. each old value can be
identified with one of the new values
29
Methods
Data Transformation
1. Smoothing:
Smoothing Remove noise from data
2. Attribute/feature construction
New attributes constructed from the given ones
3. Aggregation:
Aggregation Summarization, data cube construction
4. Normalization:
Normalization Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
5. Discretization
6. Concept hierarchy climbing
30
Data Transformation
Smoothing
Remove noise from the data.
Binning, Regression, Clustering.
Attribute Construction
New attributes are constructed and added from the given set of attributes to
help the mining process
Aggregation
Summary or aggregation operations are applied to the data
Data Cube construction
Example : Daily sales data aggregated so as to compute monthly and annual
31 total amounts
Normalization
Changing measurement units from meters to inches for height, or
from kilograms to pounds for weight, may lead to very different
results.
In general, expressing an attribute in smaller units will lead to a
larger range for that attribute, and thus tend to give such an
attribute greater effect or “weight.”
To help avoid dependence on the choice of measurement units, the
data should be normalized or standardized.
This involves transforming the data to fall within a smaller or
common range such as [−1, 1] or [0.0, 1.0].
32
Normalization
Normalizing the data attempts to give all attributes an equal weight.
36
Discretization
We will differentiate Three types of attributes
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
Interval labels can then be used to replace actual data values
Reduce data size by discretization
It can be Supervised or Unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
37
Data Discretization Methods
q Binning
q Top-down split, unsupervised
q Histogram analysis
q Top-down split, unsupervised
q Clustering analysis
q Unsupervised, top-down split or bottom-up merge
q Decision-tree analysis
q Supervised, top-down split
q Correlation (e.g., c2) analysis
q Unsupervised, bottom-up merge
q Note: All the methods can be applied recursively
45
Simple Discretization: Binning
q Equal-width (distance) partitioning
q Divides the range into N intervals of equal size: uniform grid
q if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
q The most straightforward, but outliers may dominate presentation
q Skewed data is not handled well
q Equal-depth (frequency) partitioning
q Divides the range into N intervals, each containing approximately same number
of samples
q Good data scaling
q Managing categorical attributes can be tricky
46
Example: Binning Methods for Data Smoothing
q Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
47
Discretization Without Supervision: Binning vs. Clustering
49
Concept Hierarchy Generation
q Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is
usually associated with each dimension in a data warehouse
q Concept hierarchies facilitate drilling and rolling in data warehouses to view data
in multiple granularity
q Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
q Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
q Concept hierarchy can be automatically formed for both numeric and nominal
data—For numeric data, use discretization methods shown
50
Concept Hierarchy Generation for Nominal Data
q Specification of a partial/total ordering of attributes explicitly at the schema level
by users or experts
q street < city < state < country
q Specification of a hierarchy for a set of values by explicit data grouping
q {Urbana, Champaign, Chicago} < Illinois
q Specification of only a partial set of attributes
q E.g., only street < city, not others
q Automatic generation of hierarchies (or attribute levels) by the analysis of the
number of distinct values
q E.g., for a set of attributes: {street, city, state, country}
51
Automatic Concept Hierarchy Generation
qSome hierarchies can be automatically generated based on the analysis of the
number of distinct values per attribute in the data set
q The attribute with the most distinct values is placed at the lowest level of the
hierarchy
q Exceptions, e.g., weekday, month, quarter, year
52
Methods
Data Transformation
1. Smoothing:
Smoothing Remove noise from data
2. Attribute/feature construction
New attributes constructed from the given ones
3. Aggregation:
Aggregation Summarization, data cube construction
4. Normalization:
Normalization Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
5. Discretization
6. Concept hierarchy climbing
30
Data Preprocessing Techniques
Data cleaning can be applied to remove noise and correct inconsistencies in data.
Data integration merges data from multiple sources into a coherent data store
such as a data warehouse.
Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.
Data transformations (e.g., normalization) may be applied, where data are scaled
to fall within a smaller range like 0.0 to 1.0.
This can improve the accuracy and efficiency of mining algorithms involving
distance measurements.