Data Integration and Discretization
Data Integration and Discretization
DATA DISCRETIZATION
sources into cohesive meaningful data with quality, governance and compliance
considerations.
Combination of technical and business processes used to combine data from disparate
sources into meaningful and valuable information. A complete data integration solution
delivers trusted data from a variety of sources.
Traditional domain of ETL (Extract, Transform and Load) that transforms and cleans the
data as it is being extracted from various data sources and loaded into one data store
(data warehouse). For example, converting a single variable of ―address‖ into ―street
address‖, ―city‖, ―state‖ and ―zip code‖ fields.
Source: KDnuggets
Data Integration
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
metric vs. British units (misal berat dalam kg atau pounds)
Problem in Data Integration
Nama atribut yang berbeda-beda
( x x )( y y )
i i
r i 1
n n
(x x ) ( y y)
i 1
i
2
i 1
i
2
where n is the number of tuples, x and y are the respective means of X and Y.
If rxy > 0, X and Y are positively correlated (X’s values increase
as Y’s). The higher, the stronger correlation.
rxy = 0: independent; rxy < 0: negatively correlated
9
Association Analysis (Categorical Data)
Data Requirements
• Two categorical variables.
• Two or more categories (groups) for each variable.
• Independence of observations.
There is no relationship between the subjects in each group.
The categorical variables are not "paired" in any way (e.g. pre-
test/post-test observations).
• Relatively large sample size.
Expected frequencies for each cell are at least 1.
Expected frequencies should be at least 5 for the majority (80%) of the
cells.
10
Association Analysis (2)
The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square
Test of Independence can be expressed in two different but equivalent
ways:
H0: "[Variable 1] is independent of [Variable 2]"
H1: "[Variable 1] is not independent of [Variable 2]"
OR
H0: "[Variable 1] is not associated with [Variable 2]"
H1: "[Variable 1] is associated with [Variable 2]―
11
Association Analysis (3)
12
Chi-Square Calculation: An Example
( 250 90 ) 2
(50 210 ) 2
( 200 360 ) 2
(1000 840 ) 2
2 507.93
90 210 360 840
It shows that like_science_fiction is associated with play_chess or
not.
Example of Data Redundancy
We have a data set having three attributes- person_name, is_male, is_female.
is_male is 1 if the corresponding person is a male else it is 0 .
is_female is 1 if the corresponding person is a female else it is 0.
Discretization:
Divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
Discretization and Concept Hierarchy
17
Discretization
Reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals
Interval labels can then be used to replace actual data values
Concept hierarchy formation
Recursively reduce the data by collecting and replacing low level concepts
(such as numeric values for age) by higher level concepts (such as young,
middle-aged, or senior)
Discretization and Concept Hierarchy Generation for
Numeric Data
18
Entropy is calculated based on class distribution of the samples in the set. Given
m classes, the entropy of S1 is m
Entropy( S1 ) pi log 2 ( pi )
i 1
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4:
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM,
42:73-78, 1999
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build a Data Quality
Browser. SIGMOD’02.
H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data
Engineering, 20(4), December 1997
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee
on Data Engineering. Vol.23, No.4
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation,
VLDB’2001
T. Redman. Data Quality: Management and Technology. Bantam Books, 1992
Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM,
39:86-95, 1996
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and
Data Engineering, 7:623-640, 1995