0% found this document useful (0 votes)
104 views

Data Preprocessing

This document discusses data preprocessing techniques including data cleaning, integration, and transformation. It covers major tasks like handling missing data through mean/median imputation or adding a new category. It also discusses handling noisy data through binning, regression, and clustering to detect outliers. Data integration combines data from multiple sources and addresses issues like schema integration to resolve conflicts. Data transformation techniques include normalization, aggregation, and discretization. The document emphasizes that high quality data is needed for high quality data mining results, and data preprocessing comprises most of the work in building a data warehouse.

Uploaded by

20bme094
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

Data Preprocessing

This document discusses data preprocessing techniques including data cleaning, integration, and transformation. It covers major tasks like handling missing data through mean/median imputation or adding a new category. It also discusses handling noisy data through binning, regression, and clustering to detect outliers. Data integration combines data from multiple sources and addresses issues like schema integration to resolve conflicts. Data transformation techniques include normalization, aggregation, and discretization. The document emphasizes that high quality data is needed for high quality data mining results, and data preprocessing comprises most of the work in building a data warehouse.

Uploaded by

20bme094
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

Data Preprocessing

• Data Preprocessing: An Overview

• Data Quality

• Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Transformation
Why Data Preprocessing?
Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., occupation=“ ”
• noisy: containing errors or outliers
• e.g., Salary=“-10”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Why is Data Dirty?
• Incomplete data may come from
• “Not applicable” data value when collected
• Different considerations between the time when the data was collected and
when it is analyzed.
• Human/hardware/software problems
• Noisy data (incorrect values) may come from
• Faulty data collection instruments
• Human or computer error at data entry
• Errors in data transmission
• Inconsistent data may come from
• Different data sources
• Functional dependency violation (e.g., modify some linked data)
• Duplicate records also need data cleaning
Why is Data Preprocessing Important?
• No quality data, no quality mining results!
Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the majority
of the work of building a data warehouse
Major Task in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar
analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for numerical
data
Data Cleaning
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.
How to handle Missing data?
• Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or
decision tree
Example
Ignore Tuple
• Pros:
1. Complete removal of data with missing values results in robust and highly
accurate model
2. Deleting a particular row or a column with no specific information is better,
since it does not have a high weightage
• Cons:
1. Loss of information and data
2. Works poorly if the percentage of missing values is high (say 30%),
compared to the whole dataset
Replacing With Mean/Median/Mode

• Pros:
1. This is a better approach when the data size is small
2. It can prevent data loss which results in removal of the rows and columns
• Cons:
1. Imputing the approximations add variance and bias
2. Works poorly compared to other multiple-imputations method
Assigning An Unique Category
• A categorical feature will have a definite number of possibilities, such
as gender, for example. Since they have a definite number of classes,
we can assign another class for the missing values. Here, the features
Cabin and Embarked have missing values which can be replaced with a
new category, say, U for ‘unknown’.
• Pros:
1. Less possibilities with one extra category, resulting in low variance after one
hot encoding — since it is categorical
2. Negates the loss of data by adding an unique category
• Cons:
1. Adds less variance
2. Adds another feature to the model while encoding, which may result in poor
performance
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
◦ faulty data collection instruments
◦ data entry problems
◦ data transmission problems
◦ technology limitation
◦ inconsistency in naming convention

Other data problems which requires data cleaning


◦ duplicate records
◦ incomplete data
◦ inconsistent data
How to Handle Noisy Data?
Binning
◦ Binning methods smooth a sorted data value by consulting its
“neighbor-hood,” that is, the values around it.
◦ first sort data and partition into (equal-frequency) bins
◦ then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
Regression
◦ smooth by fitting the data into regression functions
Clustering
◦ detect and remove outliers
Semi-automated method: combined computer and human inspection
◦ detect suspicious values and check manually
Simple Discretization Methods: Binning
 Equal-width (distance) partitioning
◦ Divides the range into N intervals of equal size: uniform grid
◦ if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
◦ The most straightforward, but outliers may dominate presentation
◦ Skewed data is not handled well

 Equal-depth (frequency) partitioning


◦ Divides the range into N intervals, each containing approximately same
number of samples
◦ Good data scaling
◦ Managing categorical attributes can be tricky
Binning Methods for Data Smoothing
 Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21,
24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34

• Smoothing by bin means: each value in a bin is replaced by the mean


value of the bin.

- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Binning Methods for Data Smoothing
 Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21,
24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34

* Smoothing by bin boundaries: Each bin value is then replaced by the


closest boundary value. In general, the larger the width, the greater the
effect of the smoothing.

- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Equal-width partitioning
Regression Linear regression –
find the best line to fit
two variables and use
regression function to
Data can be smoothed by fitting the smooth data
data to a function, such as with y
regression.

Y1

Y1’ y=x+1

X1 x
•Linear regression (best line to fit
two variables)
•Multiple linear regression (more
than two variables), fit to a
multidimensional surface
Cluster Analysis
 detect and remove outliers, Where similar values are organized into
groups or “clusters”
Data Integration
 Data integration:
◦ Combines data from multiple sources into a coherent store

Issues to be considered
 Schema integration: e.g., “cust-id” & “cust-no”
◦ Integrate metadata from different sources
◦ Entity identification problem:
 Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
◦ Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources are
different
 Possible reasons: different representations, different scales,
e.g., metric vs. British units
Handling Redundancy in Data Integration
 Redundant data occur often when integration of multiple databases is
done.
◦ Object identification: The same attribute or object may have
different names in different databases (link adharcard and PAN card)
◦ Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue, age
 Redundant attributes can be detected by correlation analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality.
 Data value conflict detection and resolution
Hotel chain, retail shops, schools, institutes
 Tuple duplication – increase redundancy and inconsistancy
Data
Correlation analysis of categorical (discrete) attributes using
chi square.

For given example expected frequency for the cell ( male , Fiction) is:

Chi square computation is :


Chi-square distribution table
Goodness of fit test
Stating the Hypothesis
• Null Hypothesis: There is no relationship between the two categorical
variables. (They are independent.)
• Acceptable Hypothesis: There is a relationship between the two
categorical variables. (They are not independent.)
Find Degree of freedom = (R-1)*(C-1)
• For 1 degree of freedom, the chi square value needed to reject
hypothesis at the 0.001 significance level is 10.828.
Check chi-square(calculated) > chi-square(tabular)
• Yes – reject null hypothesis and accept alternate hypothesis
Goodness of fit
• Our value is above this so we can reject the hypothesis that gender
and prefered_reading are independent.
• Conclusion - Two attributes are (strongly) correlated for the given
group of people.
Example :1
Finding the P-Value
• Technically, the p-value is the probability of observing χ2 as least as
large as the one observed assuming that no relationship exists
between the explanatory and response variable. Using statistical
software, we find that the p-value for this test.
Example 2:
# Exercise :
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearson’s product moment
coefficient)

r A , B=
∑ ( A − A )( B−B ) ∑ ( AB )−n A B
=
(n−1)σ A σ B
n (n−1n )σ A σ B

Where; n is the number of tuples


are the respective means of A and
A B
B,
σA and σB are the respective standard deviation of A
and B,
Σ(AB) is the sum of the AB cross-product.

If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
r = 0: independent;
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

42
Solution
• The answer r is: 2868 / 5413.27 = 0.529809
• Σx = 247
• Σy = 486
• Σxy = 20,485
• Σx2 = 11,409
• Σy2 = 40,022
• n is the sample size, in our case = 6
• The correlation coefficient =
6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]
= 0.5298
• The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or
52.98%, which means the variables have a moderate positive correlation.
Covariance (Numeric Data)

• Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and B are the respective mean or expected
values of A and B, σA and σB are the respective standard deviation of A and B.

• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to
be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are not independent. Only
under some additional assumptions (e.g., the data follow multivariate normal distributions)
does a covariance of 0 imply independence 47
Co-Variance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8),
(5, 10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices
rise or fall together?

• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.


Data Transformation
Smoothing: remove noise from data using smoothing techniques
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
◦ min-max normalization
◦ z-score normalization
◦ normalization by decimal scaling
Attribute/feature construction:
◦ New attributes constructed from the given ones
Data Transformation: Normalization
 Min-max normalization: For Linear Transformation; to [new_minA, new_maxA]

v −min A
v '= ( new max A −new min A )+ new min A
max A −min A
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is
mapped to 73,600  12,000
( 1.0  0 ) + 0 = 0.716
98,000  12,000
 Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ (mean) = 54,000, v −μ A 73,600−54,000
v '= =1.225
σ (std. dev)= 16,000. Then σ A 16,000

 Normalization by decimal scaling v


v' = j
10
Where j is the smallest integer such that, Max(|ν’|) < 1
• Suppose that the minimum and maximum values for the feature
income are $12,000 and $98,000, respectively.
• We would like to map income to the range [0.0,1.0]
• By min-max normalization, a value of $73,600 for income is
transformed to:
Example
• Suppose that the mean and standard deviation of the values for the
feature income are $54,000 and $16,000, respectively. With z-score
normalization, a value of $73,600 for income is transformed to
Example
• Suppose that the recorded values of 𝐹range from −986to 917.
• The maximum absolute value of 𝐹is 986.
• To normalize by decimal scaling, we therefore divide each value by
1,000(i.e., 𝑗=3)
• so that −986 normalizes to −0.986 and 917 normalizes to 0.917
Data Reduction Strategies

• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the complete
data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression 58
Data Reduction 1: Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)

59
Principal Component Analysis (PCA)

• Find a projection that captures the largest amount of variation in data


• The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance
matrix, and these eigenvectors define the new space

x2

60
x1
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
• Normalize input data: Each attribute falls within the same range
• Compute k orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the k principal
component vectors
• The principal components are sorted in order of decreasing “significance”
or strength
• Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using
the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
• Works for numeric data only 61
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in one or more other
attributes
• E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining task at hand
• E.g., students' ID is often irrelevant to the task of predicting students' GPA

62
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
• Best single attribute under the attribute independence
assumption: choose by significance tests
• Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
• Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Best combined attribute selection and elimination
• Optimal branch and bound:
• Use attribute elimination and backtracking

63
Data Reduction
 Attribute Subset Selection
 Irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.

 Methods
 Stepwise Forward Selection
 Stepwise Backward Elimination
 Combination of Forward Selection & Backward Elimination
 Decision Tree Induction
Data Reduction
 Attribute Subset Selection
Data Reduction 2: Numerosity Reduction

• Reduce data volume by choosing alternative, smaller forms of


data representation
• Parametric methods (e.g., regression)
• Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
• Ex.: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal subspaces
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …

66
Parametric Data Reduction: Regression and
Log-Linear Models
• Linear regression
• Data modeled to fit a straight line
• Often uses the least-square method to fit the line
• Multiple regression
• Allows a response variable Y to be modeled as a linear
function of multidimensional feature vector
• Log-linear model
• Approximates discrete multidimensional probability
distributions

67
y
Regression Analysis
Y1

• Regression analysis: A collective name for


techniques for the modeling and analysis of Y1’
y=x+1
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or more X1 x
independent variables (aka. explanatory
variables or predictors)
• Used for prediction (including
• The parameters are estimated so as to give a forecasting of time-series data),
"best fit" of the data inference, hypothesis testing,
and modeling of causal
• Most commonly the best fit is evaluated by
relationships
using the least squares method, but other
criteria have also been used
68
Regress Analysis and Log-Linear Models
• Linear regression: Y = w X + b
• Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
• Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….

• Multiple regression: Y = b0 + b1 X1 + b2 X2
• Many nonlinear functions can be transformed into the above
• Log-linear models:
• Approximate discrete multidimensional probability distributions
• Estimate the probability of each point (tuple) in a multi-dimensional space
for a set of discretized attributes, based on a smaller subset of
dimensional combinations
• Useful for dimensionality reduction and data smoothing
69
Histogram Analysis
40
• Divide data into buckets and store
35
average (sum) for each bucket 30
• Partitioning rules: 25
20
• Equal-width: equal bucket
15
range 10

• Equal-frequency (or equal- 5


0
depth) 10000 30000 50000 70000 90000

70
Clustering
• Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and clustering
algorithms
• Cluster analysis will be studied in depth in Chapter 10

71
Sampling
• Sampling: obtaining a small sample s to represent the whole
data set N
• Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
• Key principle: Choose a representative subset of the data
• Simple random sampling may have very poor performance in
the presence of skew
• Develop adaptive sampling methods, e.g., stratified
sampling:
• Note: Sampling may not reduce database I/Os (page at a time)
72
Types of Sampling
• Simple random sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of
the data)
• Used in conjunction with skewed data

73
Sampling: With or without Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
p l
sa m m e nt )
p l a ce
re

SRSW
R

Raw Data
74
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

75
Data Reduction 3: Data Compression
• String compression
• There are extensive theories and well-tuned algorithms
• Typically lossless, but only limited manipulation is possible
without expansion
• Audio/video compression
• Typically lossy compression, with progressive refinement
• Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
• Time sequence is not audio
• Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be considered
as forms of data compression

76
Data Compression

Original Data Compressed


Data
lossless

os sy
l
Original Data
Approximated

77

You might also like