Data Preprocessing
Data Preprocessing
• Data Quality
• Data Cleaning
• Data Integration
• Data Transformation
Why Data Preprocessing?
Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., occupation=“ ”
• noisy: containing errors or outliers
• e.g., Salary=“-10”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Why is Data Dirty?
• Incomplete data may come from
• “Not applicable” data value when collected
• Different considerations between the time when the data was collected and
when it is analyzed.
• Human/hardware/software problems
• Noisy data (incorrect values) may come from
• Faulty data collection instruments
• Human or computer error at data entry
• Errors in data transmission
• Inconsistent data may come from
• Different data sources
• Functional dependency violation (e.g., modify some linked data)
• Duplicate records also need data cleaning
Why is Data Preprocessing Important?
• No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the majority
of the work of building a data warehouse
Major Task in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar
analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for numerical
data
Data Cleaning
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.
How to handle Missing data?
• Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or
decision tree
Example
Ignore Tuple
• Pros:
1. Complete removal of data with missing values results in robust and highly
accurate model
2. Deleting a particular row or a column with no specific information is better,
since it does not have a high weightage
• Cons:
1. Loss of information and data
2. Works poorly if the percentage of missing values is high (say 30%),
compared to the whole dataset
Replacing With Mean/Median/Mode
• Pros:
1. This is a better approach when the data size is small
2. It can prevent data loss which results in removal of the rows and columns
• Cons:
1. Imputing the approximations add variance and bias
2. Works poorly compared to other multiple-imputations method
Assigning An Unique Category
• A categorical feature will have a definite number of possibilities, such
as gender, for example. Since they have a definite number of classes,
we can assign another class for the missing values. Here, the features
Cabin and Embarked have missing values which can be replaced with a
new category, say, U for ‘unknown’.
• Pros:
1. Less possibilities with one extra category, resulting in low variance after one
hot encoding — since it is categorical
2. Negates the loss of data by adding an unique category
• Cons:
1. Adds less variance
2. Adds another feature to the model while encoding, which may result in poor
performance
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
◦ faulty data collection instruments
◦ data entry problems
◦ data transmission problems
◦ technology limitation
◦ inconsistency in naming convention
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Binning Methods for Data Smoothing
Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21,
24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Equal-width partitioning
Regression Linear regression –
find the best line to fit
two variables and use
regression function to
Data can be smoothed by fitting the smooth data
data to a function, such as with y
regression.
Y1
Y1’ y=x+1
X1 x
•Linear regression (best line to fit
two variables)
•Multiple linear regression (more
than two variables), fit to a
multidimensional surface
Cluster Analysis
detect and remove outliers, Where similar values are organized into
groups or “clusters”
Data Integration
Data integration:
◦ Combines data from multiple sources into a coherent store
Issues to be considered
Schema integration: e.g., “cust-id” & “cust-no”
◦ Integrate metadata from different sources
◦ Entity identification problem:
Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
◦ Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are
different
Possible reasons: different representations, different scales,
e.g., metric vs. British units
Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple databases is
done.
◦ Object identification: The same attribute or object may have
different names in different databases (link adharcard and PAN card)
◦ Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue, age
Redundant attributes can be detected by correlation analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality.
Data value conflict detection and resolution
Hotel chain, retail shops, schools, institutes
Tuple duplication – increase redundancy and inconsistancy
Data
Correlation analysis of categorical (discrete) attributes using
chi square.
For given example expected frequency for the cell ( male , Fiction) is:
r A , B=
∑ ( A − A )( B−B ) ∑ ( AB )−n A B
=
(n−1)σ A σ B
n (n−1n )σ A σ B
If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
r = 0: independent;
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
42
Solution
• The answer r is: 2868 / 5413.27 = 0.529809
• Σx = 247
• Σy = 486
• Σxy = 20,485
• Σx2 = 11,409
• Σy2 = 40,022
• n is the sample size, in our case = 6
• The correlation coefficient =
6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]
= 0.5298
• The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or
52.98%, which means the variables have a moderate positive correlation.
Covariance (Numeric Data)
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or expected
values of A and B, σA and σB are the respective standard deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to
be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are not independent. Only
under some additional assumptions (e.g., the data follow multivariate normal distributions)
does a covariance of 0 imply independence 47
Co-Variance: An Example
• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8),
(5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their prices
rise or fall together?
v −min A
v '= ( new max A −new min A )+ new min A
max A −min A
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is
mapped to 73,600 12,000
( 1.0 0 ) + 0 = 0.716
98,000 12,000
Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ (mean) = 54,000, v −μ A 73,600−54,000
v '= =1.225
σ (std. dev)= 16,000. Then σ A 16,000
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the complete
data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression 58
Data Reduction 1: Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)
59
Principal Component Analysis (PCA)
x2
60
x1
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
• Normalize input data: Each attribute falls within the same range
• Compute k orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the k principal
component vectors
• The principal components are sorted in order of decreasing “significance”
or strength
• Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using
the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
• Works for numeric data only 61
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in one or more other
attributes
• E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining task at hand
• E.g., students' ID is often irrelevant to the task of predicting students' GPA
62
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
• Best single attribute under the attribute independence
assumption: choose by significance tests
• Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
• Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Best combined attribute selection and elimination
• Optimal branch and bound:
• Use attribute elimination and backtracking
63
Data Reduction
Attribute Subset Selection
Irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.
Methods
Stepwise Forward Selection
Stepwise Backward Elimination
Combination of Forward Selection & Backward Elimination
Decision Tree Induction
Data Reduction
Attribute Subset Selection
Data Reduction 2: Numerosity Reduction
66
Parametric Data Reduction: Regression and
Log-Linear Models
• Linear regression
• Data modeled to fit a straight line
• Often uses the least-square method to fit the line
• Multiple regression
• Allows a response variable Y to be modeled as a linear
function of multidimensional feature vector
• Log-linear model
• Approximates discrete multidimensional probability
distributions
67
y
Regression Analysis
Y1
• Multiple regression: Y = b0 + b1 X1 + b2 X2
• Many nonlinear functions can be transformed into the above
• Log-linear models:
• Approximate discrete multidimensional probability distributions
• Estimate the probability of each point (tuple) in a multi-dimensional space
for a set of discretized attributes, based on a smaller subset of
dimensional combinations
• Useful for dimensionality reduction and data smoothing
69
Histogram Analysis
40
• Divide data into buckets and store
35
average (sum) for each bucket 30
• Partitioning rules: 25
20
• Equal-width: equal bucket
15
range 10
70
Clustering
• Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and clustering
algorithms
• Cluster analysis will be studied in depth in Chapter 10
71
Sampling
• Sampling: obtaining a small sample s to represent the whole
data set N
• Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
• Key principle: Choose a representative subset of the data
• Simple random sampling may have very poor performance in
the presence of skew
• Develop adaptive sampling methods, e.g., stratified
sampling:
• Note: Sampling may not reduce database I/Os (page at a time)
72
Types of Sampling
• Simple random sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of
the data)
• Used in conjunction with skewed data
73
Sampling: With or without Replacement
W O R
SRS le random
i m p ho ut
( s e wi t
p l
sa m m e nt )
p l a ce
re
SRSW
R
Raw Data
74
Sampling: Cluster or Stratified Sampling
75
Data Reduction 3: Data Compression
• String compression
• There are extensive theories and well-tuned algorithms
• Typically lossless, but only limited manipulation is possible
without expansion
• Audio/video compression
• Typically lossy compression, with progressive refinement
• Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
• Time sequence is not audio
• Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be considered
as forms of data compression
76
Data Compression
os sy
l
Original Data
Approximated
77