2 DM DataPreprocessing
2 DM DataPreprocessing
• Data cleansing
• Data integration
• Data reduction
• Data transformation
1
Data Collection for Mining
• Data mining requires collecting great amount of data
(available in data warehouses or databases) to achieve
the intended objective.
– Data mining starts by understanding the business or problem
domain in order to gain the business knowledge
• Business knowledge guides the process towards useful
results, and enables the recognition of those results that
are useful.
– Based on the business knowledge data related to the business
problem are identified from the database/data warehouse for
mining.
• Before feeding data to DM we have to make sure the
quality of data?
2
Data Quality Measures
• A well-accepted multidimensional data quality
measures are the following:
– Accuracy (No errors, no outliers)
– Completeness (no missing attributes & values)
– Consistency (no inconsistent values and attributes)
– Timeliness (appropriateness)
– Believability (acceptability)
– Interpretability (easy to understand)
• Most of the data in the real world are poor quality;
that is:
– Incomplete, Inconsistent, Noisy, Invalid, Redundant, …
3
Data is often of low quality
• Collecting the required data is challenging
– In addition to its heterogeneous & distributed nature
of data, real world data is low in quality.
• Why?
– You didn’t collect it yourself!
– It probably was created for some other use, and then
you came along wanting to integrate it
– People make mistakes (typos)
– People are busy (“this is good enough”) to
systematically organize carefully using structured
formats
6
Types of problems with data
• Some data have problems on their own that needs to be
cleaned:
– Outliers: misleading data that do not fit to most of the data/facts
– Missing data: attributes values might be absent which needs to
be replaced with estimates
– Irrelevant data: attributes in the database that might not be of
interest to the DM task being developed
– Noisy data: attribute values that might be invalid or incorrect.
E.g. typographical errors
– Inconsistent data, duplicate data, etc.
• Other data are problematic only when we want to integrate
it
– Everyone had their own way of structuring and formatting data,
based on what was convenient for them
–7 How to integrate data organized in different format following
different conventions.
Case study: Government Agency Data
• What we want:
11
Data Cleaning: Missing Data
• Data is not always available, lacking attribute values. E.g.,
Occupation=“ ”
many tuples have no recorded value for several
attributes, such as customer income in sales data
12
Data Cleaning: Missing Data
• Missing data may be due to
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding and may not be
considered important at the time of entry
– not register history or changes of the data
• How to handle Missing data? Missing data may need to be
inferred
– Ignore the missing value: not effective when the percentage
of missing values per attribute varies considerably
– Fill in the missing value manually: tedious + infeasible?
– Fill automatically
• calculate, say, using Expected Maximization (EM) Algorithm
the most probable value
13
Predict missing value using EM
• Solves estimation with incomplete data.
– Obtain initial estimates for parameters using mean value.
– use estimates for calculating a value for missing data &
– The process continue Iteratively until convergence ((μi - μi+1) ≤ Ө).
• E.g.: out of six data items given known values= {1, 5, 10, 4},
estimate the two missing data items?
– Let the EM converge if two estimates differ in 0.05 & our initial
guess of the two missing values= 3.
– μ0 = 3 • The algorithm stop
– μ1= 4.33 since the last two
– μ2= 4.77 estimates are only 0.05
– μ3= 4.92 apart.
– μ4 = 4.97 • Thus, our estimate for
the two items is 4.97.
14
Data Cleaning: Noisy Data
• Noisy: containing noise, errors, or outliers
– e.g., Salary=“−10” (an error)
• Typographical errors are errors that corrupt data
• Let say ‘green’ is written as ‘rgeen’
17
Data Integration: Formats
• Not everyone uses the same format. Do you agree?
– Schema integration: e.g., A.cust-id B.cust-#
• Integrate metadata from different sources
• Dates are especially problematic:
– 12/19/97
– 19/12/97
– 19/12/1997
– 19-12-97
– Dec 19, 1997
– 19 December 1997
– 19th Dec. 1997
• Are you frequently writing money as:
– Birr 200, Br. 200, 200 Birr, …
18
Data Integration: Inconsistent
• Inconsistent data: containing discrepancies in codes or
names, which is also the problem of lack of
standardization / naming conventions. e.g.,
– Age=“26” vs. Birthday=“03/07/1986”
– Some use “1,2,3” for rating; others “A, B, C”
• Discrepancy between duplicate records
ID Name City State
1 Ministry of Transportation Addis Ababa Addis Ababa region
Addis Ababa
2 Ministry of Finance Addis Ababa administration
Addis Ababa regional
3 Office of Foreign Affairs Addis Ababa administration
19
Data Integration: different structure
What’s wrong here? No data type constraints
21
Data at different level of detail than needed
• If it is at a finer level of detail, you can sometimes bin it
• Example
– I need age ranges of 20-30, 30-40, 40-50, etc.
– Imported data contains birth date
– No problem! Divide data into appropriate categories
• Sometimes you cannot bin it
• Example
– I need age ranges 20-30, 30-40, 40-50 etc.
– Data is of age ranges 25-35, 35-45, etc.
– What to do?
• Ignore age ranges because you aren’t sure
• Make educated guess based on imported data (e.g.,
assume that # people of age 25-35 are average # of
people of age 20-30 & 30-40) 22
Data Integration: Conflicting Data
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales, e.g.,
American vs. British units
• weight measurement: KG or pound
• Height measurement: meter or inch
• Information source #1 says that Alex lives in Bahirdar
– Information source #2 says that Alex lives in Mekele
• What to do?
– Use both (He lives in both places)
– Use the most recently updated piece of information
– Use the “most trusted” information
– Flag row to be investigated further by hand
– Use neither (We’d rather be incomplete than wrong)
23
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue, age
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
24
Covariance
• Covariance is similar to correlation
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
32
Sampling: Cluster or Stratified Sampling
33
Data Transformation
• A function that maps the entire set of values of a given attribute to
a new set of replacement values such that each old value can be
identified with one of the new values
• Methods for data transformation
– Normalization: Scaled to fall within a smaller, specified range
of values
• min-max normalization
• z-score normalization
– Discretization: Reduce data size by dividing the range of a
continuous attribute into intervals. Interval labels can then be
used to replace actual data values
• Discretization can be performed recursively on an attribute
using method such as
– Binning: divide values into intervals
– Concept hierarchy climbing: organizes concepts (i.e.,
attribute values) hierarchically
34
Normalization
• Min-max normalization:
v minA
v' (newMax newMin) newMin
maxA minA
– Ex. Let income range $12,000 to $98,000 is normalized to
[0.0, 1.0]. Then $73,600 is mapped to
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
73,600 54,000
1.225
– Ex. Let μ = 54,000, σ = 16,000. Then, 16,000
35
Simple Discretization: Binning
• Equal-width (distance) partitioning
–Divides the range into N intervals of equal size (uniform
grid)
–if A and B are the lowest and highest values of the attribute,
the width of intervals for N bins will be:
W = (B –A)/N.
–This is the most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well
37
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
38
Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e., country
attribute values) hierarchically and is usually
associated with each dimension in a data
warehouse Region or state
– Concept hierarchy formation: Recursively
reduce the data by collecting and replacing city
low level concepts (such as numeric values for
age) by higher level concepts (such as child,
youth, adult, or senior) Sub city
• Concept hierarchies can be explicitly
specified by domain experts and/or data
warehouse designers Kebele
• Concept hierarchy can be automatically formed by the analysis of
the number of distinct values. E.g., for a set of attributes: {Kebele,
city, state, country}
For numeric data, use discretization methods.
Data sets preparation for learning
• A standard machine learning and data mining technique is
to divide the dataset into a training set and a test set.
– Training dataset is used for learning the parameters of
the model in order to produce hypotheses.
• A training set is a set of problem instances (described as a
set of properties and their values), together with a
classification of the instance.
– Test dataset, which is never seen during the hypothesis
forming stage, is used to get a final, unbiased estimate
of how well the model works.
• Test set evaluates the accuracy of the model/hypothesis in
predicting the categorization of unseen examples.
• A set of instances and their classifications used to test the
accuracy of a learned hypothesis.
Classification: Train, Validation, Test split
Results Known
+ Model
+ Training set Builder
-
-
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Test dataset -
+
- Final Evaluation
+
Final Test Set Final Model -
42
Divide the dataset into training & test
• There are various ways in which to separate the data
into training and test sets
– The established ways by which to use the two sets to assess
the effectiveness and the predictive/ descriptive accuracy of a
machine learning techniques over unseen examples.
44
Cross-validation
• Cross-validation works as follows:
– First step: data is split into k subsets of equal-sized sets
randomly. A partition of a set is a collection of subsets for
which the intersection of any pair of sets is empty. That is, no
element of one subset is an element of another subset in a
partition.
– Second step: each subset in turn is used for testing and the
remainder for training
• This is called k-fold cross-validation
– Often the subsets are stratified before the cross-validation is
performed
• The error estimates are averaged to yield an overall
error estimate
45
Cross-validation example:
— Break up data into groups of the same size
— Hold aside one group for testing and use the rest to build
model
Test
— Repeat
4646