0% found this document useful (0 votes)
60 views

Data Preprocessing: Why Preprocess The Data?

The document discusses data preprocessing which includes data cleaning, integration, transformation, reduction, and discretization. Data cleaning involves filling in missing values, identifying and handling outliers, resolving inconsistencies, and addressing redundancy from data integration. Data integration merges data from multiple sources which requires schema integration and resolving object matching issues. Data transformation includes normalization and aggregation.

Uploaded by

Noel Jackson
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Data Preprocessing: Why Preprocess The Data?

The document discusses data preprocessing which includes data cleaning, integration, transformation, reduction, and discretization. Data cleaning involves filling in missing values, identifying and handling outliers, resolving inconsistencies, and addressing redundancy from data integration. Data integration merges data from multiple sources which requires schema integration and resolving object matching issues. Data transformation includes normalization and aggregation.

Uploaded by

Noel Jackson
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

Data Preprocessing

 Why preprocess the data?

 Data cleaning

 Data integration and transformation

 Data reduction

 Discretization and concept hierarchy generation

 Summary

Data Mining: Concepts and Techniques August 2, 2019 1


Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or
names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
Data Mining: Concepts and Techniques August 2, 2019 2
Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
Data Mining: Concepts and Techniques August 2, 2019 3
Why Is Data Preprocessing
Important?
 No quality data, no quality mining results.
 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
 Data warehouse needs consistent integration of quality
data
 Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse

Data Mining: Concepts and Techniques August 2, 2019 4


Multi-Dimensional Measure of Data
Quality
 A well-accepted multidimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Interpretability
 Accessibility

Data Mining: Concepts and Techniques August 2, 2019 5


Major Tasks in Data
Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or
similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially for
numerical data
Data Mining: Concepts and Techniques August 2, 2019 6
Forms of Data Preprocessing

Data Mining: Concepts and Techniques August 2, 2019 7


Data Preprocessing

 Why preprocess the data?

 Data cleaning

 Data integration and transformation

 Data reduction

 Discretization and concept hierarchy generation

 Summary

Data Mining: Concepts and Techniques August 2, 2019 8


Data Cleaning
 Importance
 “Data cleaning is one of the biggest and number one
problems in data warehousing”
 Data cleaning tasks

 Fill in missing values

 Identify outliers and smooth out noisy data

 Correct inconsistent data

 Resolve redundancy caused by data integration

Data Mining: Concepts and Techniques August 2, 2019 9


Missing Data

 Data is not always available


 E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 Missing data may need to be inferred.

Data Mining: Concepts and Techniques August 2, 2019 10


How to Handle Missing Data?
1.Ignore the tuple:

 usually done when class label is missing (assuming the tasks is


classification).

 This method is not very effective, unless the tuple contains several
attributes with missing values.
 Not effective when the percentage of missing values per attribute varies
considerably.
2.Fill in the missing value manually: time-consuming and
may not be feasible given a large data set with many missing values.

August 2, 2019 11
How to Handle Missing Data?
3.Use a global constant to fill in the missing value: Replace all
missing attribute values by the same constant, such as a label
like “Unknown” or infinity.

4. Use the attribute mean to fill in the missing value:

For example, suppose that the average income of customers is


$56,000. Use this value to replace the missing value for income.

5. Use the attribute mean for all samples belonging to the


same class:For example, if classifying customers according to
credit risk, replace the missing value with the average income
value for customers in the same credit risk category as that of
the given tuple August 2, 2019 12
How to Handle Missing Data?

6. Use the most probable value to fill in the missing value:


inference-based such as Bayesian formula or decision tree can
be used.
 For example, using the other customer attributes in our data
set, we may construct a decision tree to predict the missing
values for income.

Data Mining: Concepts and Techniques August 2, 2019 13


Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data

Data Mining: Concepts and Techniques August 2, 2019 14


How to Handle Noisy Data?
1.Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
2.Regression
 smooth by fitting the data into regression functions
3.Clustering
 detect and remove outliers
4.Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal
with possible outliers)

Data Mining: Concepts and Techniques August 2, 2019 15


Noisy data handling Methods:
Binning
 Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the width

of intervals will be: W = (B –A)/N.

 Equal-depth (frequency) partitioning


 Divides the range into N intervals, each containing approximately same

number of samples

Data Mining: Concepts and Techniques August 2, 2019 16


Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 26, 29, 34

Data Mining: Concepts and Techniques August 2, 2019 17


Noisy data handling Methods
:Regression
 Data can be smoothed by fitting the data to a function, such as
with regression.
 Linear regression involves finding the “best” line to fit two
attributes (or variables), so that one attribute can be used to
predict the other.

Data Mining: Concepts and Techniques August 2, 2019 18


Regression
y

Y1

Y1’ y=x+1

X1 x

Data Mining: Concepts and Techniques August 2, 2019 19


Noisy data handling Methods:
Cluster Analysis
 Outliers may be detected by clustering, where similar values are
organized into groups, or “clusters.”

Data Mining: Concepts and Techniques August 2, 2019 20


Cluster Analysis

Data Mining: Concepts and Techniques August 2, 2019 21


Data Integration
Data integration:
 Integration—the merging of data from multiple data stores.

Issues to be considered during data integration


1.Schema integration &object matching
 Entity identification problem: Same attribute with different
names in two different tables
 Eg., A.cust-id = B.cust-# (A and B are two different tables)
 Can be solved by Integrating metadata from different sources .
 Metadata can be used to help avoid errors in schema integration.
 The metadata may also be used to help transform the data (e.g., where
data codes for pay type in one database may be “H” and “S”, and 1
and 2 in another).

Data Mining: Concepts and Techniques August 2, 2019 22


Data Integration

2.Redundant data occur often when integration of multiple


databases
 Object identification: The same attribute or object may
have different names in different databases
 Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality

Data Mining: Concepts and Techniques August 2, 2019 23


Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product moment


coefficient)

i 1 (ai  A)(bi  B) 
n n
( ai bi )  n A B
rA, B   i 1
( n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective


means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
 If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
24
Correlation Analysis (Nominal Data)

 Χ2 (chi-square) test
(Observed  Expected) 2
2  
Expected

 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count

25
Correlation Analysis (Nominal
Data)
 For categorical (discrete) data, a correlation relationship
between two attributes, A and B, can be discovered by a chi-
square test.
 Suppose A has c distinct values, namely a1;a2; : : :ac.
 B has r distinct values, namely b1;b2; : : :br.
 The data tuples described by A and B can be shown as a
contingency table, with the c values of A making up the
columns and the r values of B making up the rows.
 Let (Ai;Bj) denote the event that attribute A takes on value ai
and attribute B takes on value bj, that is, where (A = ai;B = bj).
Each and every possible (Ai;Bj) joint event has its own cell (or
slot) in the table.

Data Mining: Concepts and Techniques August 2, 2019 26


 Chi- Square value can be computed as

 N-number of tuples
 The chi square statistic tests the hypothesis that A and B are
independent.
 The test is based on a significance level, with (r-1)(c-1) degrees
of freedom

Data Mining: Concepts and Techniques August 2, 2019 27


Chi-Square Calculation: An
Example
 Suppose that a group of 1,500 people was surveyed. The gender
of each person was noted. Each person was polled as to whether
their preferred type of reading material was fiction or
nonfiction. Thus, we have two attributes, gender and preferred
reading. Find the correlation between these two attributes

Data Mining: Concepts and Techniques August 2, 2019 28


Chi-Square Calculation: An Example

male female Sum (row)


Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the equation in previous
slide)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840

 It shows that like_science_fiction and male are correlated in the


group
29
Data Integration
3.Detection and resolution of data value conflicts.
 For the same real-world entity, attribute values from different
sources may differ.
Eg:A weight attribute may be stored in metric units in one system
and British imperial units in another.
 For a hotel chain, the price of rooms in different cities may
involve not only different currencies but also different services
(such as free breakfast) and taxes.
 An attribute in one system may be recorded at a lower level of
abstraction than the “same” attribute in another. For example,
the total sales in one database may refer to one branch of All
Electronics, while an attribute of the same name in another
database may refer to the total sales for All Electronics stores in
a given region.

Data Mining: Concepts and Techniques August 2, 2019 30


Data Transformation
 A function that maps the entire set of values of a given attribute to a
new set of replacement values such that each old value can be
identified with one of the new values
Methods
 Smoothing: remove noise from data (binning, clustering, regression)
 Normalization: scaled to fall within a small, specified range such as –
1.0 to 1.0 or 0.0 to 1.0
 Attribute/feature construction
 New attributes constructed / added from the given ones
 Aggregation: summarization or aggregation operations apply to data

 Generalization: concept hierarchy climbing


 Low level/ primitive/raw data are replace by higher level concepts
Data Transformation-Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,600  12,000
Then $73,000 is mapped to (1.0  0)  0  0.716
98,000  12,000
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

73,600  54,000
 1.225
 Ex. Let μ = 54,000, σ = 16,000. Then 16,000
 Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
32
Data Reduction

 Why data reduction?


 A database/data warehouse may store terabytes of data
 Complex data analysis/mining may take a very long time
to run on the complete data set
 Data reduction
 Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results

33
Data Reduction

 Data reduction strategies


 Data Cube Aggregation
 Attribute Subset Selection
 Dimensionality reduction, e.g., remove unimportant
attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)
 Regression and Log-Linear Models
 Histograms,
 Clustering,
 Sampling
Data Mining: Concepts and Techniques August 2, 2019 34
Data Reduction1:Data Cube
Aggregation
 Data cubes store multidimensional aggregated information.
 Data cubes provide fast access to precomputed, summarized
data, thereby benefiting on-line analytical processing as well
as data mining.
 Queries regarding aggregated information should be
answered using data cube, when possible.

Data Mining: Concepts and Techniques August 2, 2019 35


Data Reduction1:Data Cube Aggregation

Data Mining: Concepts and Techniques August 2, 2019 36


Data Reduction2:Attribute
Subset Selection
 Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions).

 The goal of attribute subset selection is to find a minimum set


of attributes such that the resulting probability distribution of
the data classes is as close as possible to the original
distribution obtained using all attributes.

 It reduces the number of attributes appearing in the


discovered patterns, helping to make the patterns easier to
understand.

Data Mining: Concepts and Techniques August 2, 2019 37


Attribute Subset Selection -
Techniques
1.Stepwise forward selection:
 The procedure starts with an empty set of attributes as the
reduced set.
 The best of the original attributes is determined and added to
the reduced set.
 At each subsequent iteration or step, the best of the remaining
original attributes is added to the set.

2. Stepwise backward elimination:


 The procedure starts with the full set of attributes.
 At each step, it removes the worst attribute remaining in the set.

Data Mining: Concepts and Techniques August 2, 2019 38


Attribute Subset Selection -
Techniques
3. Combination of forward selection and backward
elimination:
 The stepwise forward selection and backward elimination
methods can be combined .
 At each step, selects the best attribute and removes the worst
from among the remaining attributes.

Data Mining: Concepts and Techniques August 2, 2019 39


Attribute Subset Selection -
Techniques
4. Decision tree induction:
 Decision tree induction constructs a flowchart like structure
where each internal (nonleaf) node denotes a test on an
attribute, each branch corresponds to an outcome of the test,
and each external (leaf) node denotes a class prediction.
 At each node, the algorithm chooses the “best” attribute to
partition the data into individual classes .
 When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data.
 All attributes that do not appear in the tree are assumed to be
irrelevant.
 The set of attributes appearing in the tree form the reduced
subset of attributes.
Data Mining: Concepts and Techniques August 2, 2019 40
Attribute Subset Selection -Techniques

Data Mining: Concepts and Techniques August 2, 2019 41


Data Reduction 3: Dimensionality
Reduction
 Dimensionality reduction
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

42
Data Reduction 4: Numerosity Reduction

 Reduce data volume by choosing alternative, smaller forms


of data representation
 Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
 Ex.: Log-linear models
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …

43
Parametric Data Reduction: Regression and
Log-Linear Models
 Linear regression: Y = w X + b
 A randomvariable, y (called a response variable), can be modeled as
a linear function of another random variable.

Multiple regression: Y = b0 + b1 X1 + b2 X2
 Multiple linear regression is an extension of (simple) linear
regression, which allows a response variable, y, to be modeled as a
linear function of two or more predictor variables.

44
Parametric Data Reduction: Regression
and Log-Linear Models
Log-linear models
 Given a set of tuples in n dimensions (e.g., described by n
attributes), we can consider each tuple as a point in an n-
dimensional space.
 Log-linear models can be used to estimate the probability
of each point in a multidimensional space for a set of
discretized attributes, based on a smaller subset of
dimensional combinations.
 This allows a higher-dimensional data space to be
constructed from lower dimensional spaces.
 Log-linear models are thus useful for dimensionality
reduction and data smoothing

Data Mining: Concepts and Techniques August 2, 2019 45


Histogram Analysis

 Divide data into buckets and store average (sum) for


each bucket
Partitioning rules:
Equal-width: In an equal-width histogram, the width of each
bucket range is uniform.

Equal-frequency (or equidepth): In an equal-frequency


histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant (that is, each bucket
contains roughly the same number of contiguous data
samples).
46
Histogram Analysis
V-Optimal: If we consider all of the possible histograms for a
given number of buckets, the V-Optimal histogram is the one
with the least variance. Histogram variance is a weighted sum
of the original values that each bucket represents, where bucket
weight is equal to the number of values in the bucket.

MaxDiff: MaxDiff histogram, considers the difference between


each pair of adjacent values. A bucket boundary is established
between each pair for pairs having the b-1 largest differences,
where b is the user-specified number of buckets

Data Mining: Concepts and Techniques August 2, 2019 47


Clustering
 Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
 Can be very effective if data is clustered but not if data is
“smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many kinds of clustering algorithms.

48
Sampling

 Sampling: obtaining a small sample s to represent the whole data set


N
 Key principle: Choose a representative subset of the data
Simple random sample without replacement (SRSWOR) of size s:
 This is created by drawing s of the N tuples fromD (s < N), where the
probability of drawing any tuple in D is 1/N, that is, all tuples are
equally likely to be sampled.
Simple random sample with replacement (SRSWR) of size s: This is
similar to SRSWOR, except that each time a tuple is drawn from D, it
is recorded and then replaced. That is, after a tuple is drawn, it is
placed back in D so that it may be drawn again.

49
Sampling

Cluster sample: If the tuples in D are grouped into M mutually


disjoint “clusters,” then an SRS of s clusters can be obtained,
where s < M
Stratified sample: If D is divided into mutually disjoint parts
called strata, a stratified sample of D is generated by obtaining
an SRS at each stratum

Data Mining: Concepts and Techniques August 2, 2019 50


Sampling

Data Mining: Concepts and Techniques August 2, 2019 51

You might also like