Big Data Lecture # 04
Big Data Lecture # 04
Inconsistent Data
Dimensionality Reduction
Data Cleaning
Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification)—not effective when the percentage of missing
values per attribute varies considerably.
Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
Use the attribute mean for all samples belonging to the same class to fill in
the missing value: smarter
Use the most probable value to fill in the missing value: inference-based
such as Bayesian formula or decision tree
How to Handle Missing Data?
Equi-width
binning:
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Equi-width
binning: 22-31 62-80
0-22
38-44 48-55
32-38 44-48 55-62
Smoothing using Binning Methods
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries: [4,15],[21,25],[26,34]
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Inconsistent Data
Use routines designed to detect inconsistencies and manually correct them. E.g.,
the routine may use the check global constraints (age>10) or functional
dependencies
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
metadata: data about the data (i.e., data descriptors)
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from different
sources are different (e.g., J.D.Smith and Jonh Smith may refer to
the same person)
possible reasons: different representations, different scales, e.g.,
metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration
Attribute/feature construction
New attributes constructed from the given ones
Normalization: Why normalization?
Helps prevent attributes with large ranges outweigh ones with small
ranges
Example:
min-max normalization
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
e.g. convert age=30 to range 0-1, when min=10,max=80.
new_age=(30-10)/(80-10)=2/7
z-score normalization v − meanA
v' =
stand_devA
normalization
v by decimal scaling
v' = Where j is the smallest integer such that Max(|v' |)<1
10 j
Data Reduction Strategies
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
Discretization
Three types of attributes:
Discretization:
why?
Discretization
reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values.
Concept hierarchies
reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).