Outline
• Data Preprocessing
– Data Quality
– Major Tasks
• Data Cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Data Reduction
2
Knowledge Discovery Process
Machine
Learning
3
The Moviegoer Example
4
5
Moviegoer Database - Tasks
• Classification
– Determine gender based on age, source and movies seen
– Determine source based on gender, age and movies seen
• Estimation
– For estimation, you need a continuos variable, e.g. age
– Estimate age as a function of source, gender and past movies
• Clustering
– Find groupings of movies that are often seen by same people
– Find groupings of people that tend to see the same movies
6
Moviegoer Database - Tasks
• Affinity grouping
– Association rules: which movies
go together?
– Need to create “transactions” for
each moviegoer containing
movies seen by that person
– May result in association rules
like:
7
8
Data Quality: Why Preprocess the Data?
• Measures for data quality: A multidimensional view
– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?
Can the decision be trusted?
Better change to discover useful knowledge when the data is clean.
9
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization
• Data Discretization
– Part of data reduction but with particular importance, especially for numerical data
• Data Reduction
– Obtains reduced representation in volume but produces the same or similar analytical
results
Outline
• Data Preprocessing
– Data Quality
– Major Tasks
• Data Cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Data Reduction
10
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday? 11
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred
12
How to Handle Missing Data?
• Ignore the record: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same class: smarter
13
Imputation of Missing Data (Basic)
• Imputation is a term that denotes a procedure that replaces
the missing values in a dataset by some plausible values
– i.e. by considering relationship among correlated values among
the attributes of the dataset.
If we consider only
Attribute 1 Attribute 2 Attribute 3 Attribute 4 {attribute#2}, then value
20 cool high false “cool” appears in 4
cool high true records.
20 cool high true Probability of Imputing
20 mild low false
value (20) = 66.6%
30 cool normal false
10 mild high true Probability of Imputing
value (30) = 33.3%
14
Imputation of Missing Data (Basic)
For {attribute#4} the
Attribute 1 Attribute 2 Attribute 3 Attribute 4
value “true” appears in 3
20 cool high false
records
cool high true
20 cool high true Probability of Imputing
20 mild low false value (20) = 50%
30 cool normal false Probability of Imputing
10 mild high true value (10) = 50%
Attribute 1 Attribute 2 Attribute 3 Attribute 4 For {attribute#2,
20 cool high false attribute#3} the value
cool high true {“cool”, “high”} appears
20 cool high true in only 2 records
20 mild low false
30 cool normal false Probability of Imputing
10 mild high true value (20) = 100% 15
16
Methods of Treating Missing Data
• K-Nearest Neighbor (k-NN) approach
– k-NN imputes the missing attribute values on the basis of nearest K
neighbor. Neighbors are determined on the basis of distance measure.
– Once K neighbors are determined, missing value are imputed by taking
mean/median or MOD of known attribute values of missing attribute.
Missing value
record
Other dataset
records
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
17
18
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with
possible outliers)
19
Noisy Data (Binning Methods)
Sorted data for price (in dollars) with Bin = 3:
4, 8, 15, 21, 21, 24, 25, 28, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 28, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9
- Bin 2: 22, 22, 22
- Bin 3: 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 25, 34
20
Noisy Data (Binning Methods)
Sorted data for price (in dollars) Bin = 4:
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
21
Noisy Data (Clustering)
• Outliers may be detected by clustering, where similar values are
organized into groups or “clusters”.
• Values which falls outside of the set of clusters may be considered
outliers.
Outline
• Data Preprocessing
– Data Quality
– Major Tasks
• Data Cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Data Reduction
22
23
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id B.cust-#
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are
different
– Possible reasons: different representations, different scales, e.g., metric vs.
British units
23
24
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
24
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
(Observed Expected) 2
2
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those whose actual count
is very different from the expected count
25
Chi-Square Calculation: An Example
• Null Hypothesis: A & B are independent (not correlated)
• Alternate Hypothesis: A & B are dependent (correlated)
Gender
Male Female Sum (row)
fiction 250(90) 200(360) 450
Preferred
Reading Non fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts
calculated based on the data distribution in the two categories)
( 250 90 ) 2
(50 210) 2
( 200 360) 2
(1000 840) 2
2 507.93
90 210 360 840 26
Chi Square Calculation: An Example
Degree of Freedom = (r-1) (c-1) = (2-1) (2-1) = 1 where r= distinct value in variable A,
c= distinct values in variable B
Level of Significance
Critical Value
It shows that Gender and Preferred Reading are strongly correlated in the group. 27
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment coefficient)
i1 (ai A)(bi B)
n n
(ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
where n is the number of records, 𝐴ҧ and 𝐵ത are the respective means of A and B, σA and
σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
28
Covariance (Numeric Data)
Expected Values of A and B
Covariance of A and B
Positively Correlated
Stock prices of both companies rise together 29
30
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
31
Pearson Correlation
32
Pearson Correlation (Shortcut)
Outline
• Data Preprocessing
– Data Quality
– Major Tasks
• Data Cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Data Reduction
33
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature selection
• Subset of a dataset is selected for further processing
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization: supervised
34
Normalization
• Works on numeric attributes
• Attribute normalization – one range of values to another
• Usual ranges: -1 to +1, or 0 to 1.
• Issues: this might introduce distortions or biases into the data
• So you need to understand the properties and potential weaknesses
of the method before using it
35
Normalization: min-max
• Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to 73,600 12,000 (1.0 0) 0 0.716
98,000 12,000
• Positive: min-max normalization preserves all relationships of data values
exactly and doesn't introduce any potential biases
• Negative: If a future input case falls outside the original data range, an “out of
bounds” error will occur
36
Normalization (Dealing with out-of-range values)
• Ignore the range has been exceeded
– ….but does this affect the quality of the model?
• Ignore the out-of-range instances
– Reducing the number of instances, reduces the confidence in the sample
representing population
– Introduces bias
• Clip the out-of-range values
– E.g. if the value > 1, assign 1 to it. If value < 0, assign 0 to it.
– Information content on the limits is distorted by projecting multiple values
to a single value.
37
Normalization (z-score)
• Normalization of A and A’ based on the mean and the
standard deviation of the attribute
– Mean and Standard deviation depend on the data
v A
v'
A
73,600 54,000
– Ex. Let μ = 54,000, σ = 16,000. Then 16,000
1.225
• When to use z-score if min-max normalization is available?
38
Normalization: decimal scaling
• moves the decimal point of A by j positions such that j is
the minimum number of positions moved so that absolute
maximum value falls in [0,1]
v
v' j
10
• E.g. if v ranges between -98 and 9738, then j = 4 means
that v` ranges between -0.0098 and 0.9738
39
Outline
• Data Preprocessing
– Data Quality
– Major Tasks
• Data Cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Data Reduction
40
41
Data Discretization
• The task of attribute (feature)-discretization techniques is to
discretize the values of continuous features into a small number
of intervals, where each interval is mapped to a discrete symbol.
• Advantages:-
– Simplified data description and easy-to-understand data and final data-
mining results.
– Only Small interesting rules are mined.
– End-results processing time decreased.
– End-results accuracy improved.
Entropy Based Discretization
• Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the
Information is
• Where S1 and S2 corresponds to the samples in S, satisfying the condition A<T and A>=T.
• Where pi is the probability of class i in S1, determined by dividing the number of samples of
class i in S1 by the total number of samples in S1.
• The boundary that minimizes the entropy function over all possible boundaries is selected as a
binary discretization.
• The process is recursively applied to partitions obtained until some stopping criterion is met
42
Entropy Based Discretization: Example
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
• Let Grade be the class attribute. Use entropy-based
discretization to divide the range of ages into different
discrete intervals.
• There are 6 possible boundaries. They are 21.5, 23, 24.5,
26, 31, and 38.
• Let us consider the boundary at T = 21.5.
Let S1 = {21}
Let S2 = {22, 24, 25, 27, 27, 27, 35, 41} 43
Entropy Based Discretization: Example
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
• The number of elements in S1 and S2 are:
|S1| = 1
|S2| = 8
• The entropy of S1 is
Ent ( S1 ) P (Grade F) log 2 P(Grade F) P (Grade P) log 2 P (Grade P)
(1 / 1) log 2 (1 / 1) (0 / 1) log 2 (0 / 1)
0
• The entropy of S2 is
Ent ( S 2 ) P (Grade F) log 2 P (Grade F) P (Grade P) log 2 P (Grade P)
(2 / 8) log 2 (2 / 8) (6 / 8) log 2 (6 / 8)
0.5 (0.311)
0.811 44
45
Entropy Based Discretization: Example
• Hence, the entropy after partitioning at T = 21.5 is
| S1 | | S2 |
E (S , T ) Ent ( S1 ) Ent ( S 2 )
|S| |S|
|1| |8|
Ent ( S1 ) Ent ( S 2 )
|9| |9|
(1 / 9)(0) (8 / 9)(0.811)
0.721
46
Entropy Based Discretization: Example
• The entropies after partitioning for all the boundaries are:
T = 21.5 = E(S,21.5)
T = 23 = E(S,23)
.
Now recursively apply entropy
.
discretization upon both
T = 38 = E(S,38)
partitions
Select the boundary with the smallest entropy
Suppose best is T = 23
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
Outline
• Data Preprocessing
– Data Quality
– Major Tasks
• Data Cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Data Reduction
47
Data Reduction
• Data is often too large and reducing data can improve performance
• Data reduction consists of reducing the representation of the dataset
while producing the same or almost the same results
• Data reduction includes:
– Aggregation, dimensionality reduction, discretization, numerosity reduction
48
Dimensionality reduction
• Feature selection (or attribute subset selection)
– Select only the necessary attributes
– The goal is to find a minimum set of attributes such that the
resulting probability distribution of data classes is as close as
possible to the original distribution using the attributes
• Example Technique:
– Decision-tree induction
49
50
Summary
• Data quality: accuracy, completeness, consistency, timeliness, believability,
interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
– Remove redundancies
– Detect inconsistencies
• Data transformation
– Dimensionality Reduction
– Normalization
– Discretization
51
References
• D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
• A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
• T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
• J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
• H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
• M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
• H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
• H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
• J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
• D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
• V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
• T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
• R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995