Data Mining and Data Warehousing CSPC-308
Data Mining and Data Warehousing CSPC-308
Warehousing
CSPC-308
• Importance
– “Data cleaning is one of the three biggest problems in
data warehousing”—Ralph Kimball
– “Data cleaning is the number one problem in data
warehousing”—DCI survey
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Resolve redundancy caused by data integration
• Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values per
attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same class: smarter
– the most probable value: inference-based such as Bayesian formula or
decision tree (e.g., predict my age based on the info at my web site?)
■ Binning
■ first sort data and partition into (equal-frequency) bins
■ then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
■ Regression
■ smooth by fitting the data into regression functions
■ Clustering
■ detect and remove outliers
■ Combined computer and human inspection
■ detect suspicious values and check by human (e.g., deal
with possible outliers)
Data Mining: Concepts and Techniques * 15
Simple Discretization Methods: Binning
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
18
Clustering
19
Data Cleaning as a Process
■ Data discrepancy detection
■ Use metadata (e.g., domain, range, dependency, distribution) (How
many people are there in Nebraska?)
■ Check uniqueness rule, consecutive rule and null rule
■ Use commercial tools
■ Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
■ Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
■ Data integration:
■ Combines data from multiple sources into a coherent store
■ Schema integration: e.g., A.cust-id ≡ B.cust-#
■ Integrate metadata from different sources
■ Entity identification problem:
■ Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
■ Detecting and resolving data value conflicts
■ For the same real world entity, attribute values from
different sources are different
■ Possible reasons: different representations, different scales,
e.g., metric vs. British units (e.g., GPA in US and China)
• The larger the Χ2 value, the more likely the variables are
related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
33
*
A4 ? Loan Approval
Example:
Salary, Credit
A1? A6? Score, House,
Monthly payment,
Age
Class 2 Class 1 Class 2
Class 1
■ String compression
■ There are extensive theories and well-tuned algorithms (e.g.,
Huffman encoding algorithm)
■ Typically lossless
■ But only limited manipulation is possible without expansion
■ Audio/video compression
■ Typically lossy compression, with progressive refinement
■ Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
■ Time sequence is not audio
■ Typically short and vary slowly with time
Data Mining: Concepts and Techniques * 36
Data Compression
lossless
s s y
lo
Original Data
Approximated
■ Non-parametric methods
■ Do not assume models
■ Major families: histograms, clustering, sampling
Data Mining: Concepts and Techniques * 38
Data Reduction Method (1): Regression
Models
• Linear regression: Y = w X + b
– Two regression coefficients, w and b, specify the line and
are to be estimated by using the data at hand
– Using the least squares criterion to the known values of Y1,
Y2, …, X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the
above
Data Reduction Method (2): Histograms
• Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is “smeared”
• There are many choices of clustering definitions and clustering algorithms
• Cluster analysis will be studied in depth in Chapter 7
W O R
SRS le random
i m p h ou t
(s e wi t
p l
sam ment)
p l a ce
re
SRSW
R
Raw Data
Data Mining: Concepts and Techniques 44
Sampling: Cluster or Stratified Sampling
■ Discretization
■ Reduce the number of values for a given continuous attribute by dividing
the range of the attribute into intervals
■ Interval labels can then be used to replace actual data values
■ Supervised vs. unsupervised
■ Split (top-down) vs. merge (bottom-up)
■ Discretization can be performed recursively on an attribute
■ Concept hierarchy formation
■ Recursively reduce the data by collecting and replacing low level concepts
(such as numeric values for age) by higher level concepts (such as young,
middle-aged, or senior)