0% found this document useful (0 votes)
78 views

Data Integration and Discretization

Data integration involves ingesting, transforming, and integrating data from various sources into a single data warehouse. It includes processes to extract data from sources, transform it by resolving inconsistencies and conflicts, and load it into a data warehouse. Some common issues in data integration are differing attribute names and units, redundant or derived attributes, and inconsistencies in entity values across sources. Techniques like correlation analysis, association analysis, binning, and clustering can be used to identify redundancies and resolve conflicts during integration.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

Data Integration and Discretization

Data integration involves ingesting, transforming, and integrating data from various sources into a single data warehouse. It includes processes to extract data from sources, transform it by resolving inconsistencies and conflicts, and load it into a data warehouse. Some common issues in data integration are differing attribute names and units, redundant or derived attributes, and inconsistencies in entity values across sources. Techniques like correlation analysis, association analysis, binning, and clustering can be used to identify redundancies and resolve conflicts during integration.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

DATA INTEGRATION &

DATA DISCRETIZATION

Data Mining Irhamah


Data Integration
Data Integration ingests, transforms and integrates structured and unstructured data and
delivers the data to a scalable data warehouse platform using traditional ETL (Extract,
Transform, Load) tools and methodologies to collect of data from various sources into a
single data warehouse:
 Data ingestion is the process of obtaining and importing data for immediate use

or storage in a database. To ingest something is to "take something in or absorb


something." .
 Includes both technical processes and business logic to transform data from disparate

sources into cohesive meaningful data with quality, governance and compliance
considerations.
 Combination of technical and business processes used to combine data from disparate

sources into meaningful and valuable information. A complete data integration solution
delivers trusted data from a variety of sources.
 Traditional domain of ETL (Extract, Transform and Load) that transforms and cleans the

data as it is being extracted from various data sources and loaded into one data store
(data warehouse). For example, converting a single variable of ―address‖ into ―street
address‖, ―city‖, ―state‖ and ―zip code‖ fields.
Source: KDnuggets
Data Integration
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units (misal berat dalam kg atau pounds)
Problem in Data Integration
 Nama atribut yang berbeda-beda

 Unit yang berbeda: Sales dalam $, sales dalam Yen

 Skala berbeda: Rp, Rp juta, Rp Milyar


 Atribut turunan : monthly salary dan annual salary
Problem in Data Integration (2)
 Customer dengan customer-id 150 punya 3 anak dalam
relation1 dan 4 anak dalam relation2

 Komputasi annual salary dari monthly salary dalam relation1


tidak cocok dengan atribut ―annual-salary‖ dalam relation2
Handling Redundancy in Data Integration
8

 Redundant data occur often when integration of multiple databases


 Object identification: The same attribute or object may have different
names in different databases
 Derivable data: One attribute may be a ―derived‖ attribute in another
table, e.g., annual revenue
 An attribute (column or feature of data set) is called redundant if it can
be derived from any other attribute or set of attributes. Inconsistencies in
attribute or dimension naming can also lead to the redundancies in data
set.
 Redundant attributes may be able to be detected by
correlation/covariance analysis or association analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality
Correlation Analysis (Numerical Data)

 Correlation coefficient (also called Pearson’s product moment


coefficient) n

 ( x  x )( y  y )
i i
r i 1
n n

 (x  x )  ( y  y)
i 1
i
2

i 1
i
2

where n is the number of tuples, x and y are the respective means of X and Y.
 If rxy > 0, X and Y are positively correlated (X’s values increase
as Y’s). The higher, the stronger correlation.
 rxy = 0: independent; rxy < 0: negatively correlated

9
Association Analysis (Categorical Data)

The Chi-Square Test of Independence determines whether there is an


association between categorical variables (i.e., whether the variables are
independent or related). It is a nonparametric test.

Data Requirements
• Two categorical variables.
• Two or more categories (groups) for each variable.
• Independence of observations.
There is no relationship between the subjects in each group.
The categorical variables are not "paired" in any way (e.g. pre-
test/post-test observations).
• Relatively large sample size.
Expected frequencies for each cell are at least 1.
Expected frequencies should be at least 5 for the majority (80%) of the
cells.
10
Association Analysis (2)

The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square
Test of Independence can be expressed in two different but equivalent
ways:
H0: "[Variable 1] is independent of [Variable 2]"
H1: "[Variable 1] is not independent of [Variable 2]"
OR
H0: "[Variable 1] is not associated with [Variable 2]"
H1: "[Variable 1] is associated with [Variable 2]―

11
Association Analysis (3)

12
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)


Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are expected


counts calculated based on the data distribution in the two
categories)

( 250  90 ) 2
(50  210 ) 2
( 200  360 ) 2
(1000  840 ) 2
2      507.93
90 210 360 840
 It shows that like_science_fiction is associated with play_chess or
not.
Example of Data Redundancy
We have a data set having three attributes- person_name, is_male, is_female.
is_male is 1 if the corresponding person is a male else it is 0 .
is_female is 1 if the corresponding person is a female else it is 0.

On analysing the fact that if a person is not male


(i.e is_male is 0 corresponding the person_name) then,
the person is surely a female (since there are only two
value in output class- male and female). It implies that
the two attributes are highly correlated and one
attribute can determine the other. Hence, one of these
attributes became redundant. So one of these two
attributes can be dropped without any information loss.
Discretization (Diskritisasi)
15

Konsep sama dengan histogram


• Membagi domain dari suatu atribut numerik kedalam interval-interval.
• Menggantikan nilai atribut dengan label untuk interval.
Contoh:
– Dataset (age; salary):
(25;30,000),(30;80,000),(27;50,000),(60;70,000),(50;55,000),(28;25,
000)
– Dataset diskrit(age, discretizedSalary):
(25,low),(30,high),(27,medium),(60,high), (50,medium),(28,low)
Discretization
16

 Three types of attributes:


 Nominal — values from an unordered set, e.g., color, profession
 Ordinal — values from an ordered set, e.g., military or academic
rank
 Continuous — real numbers, e.g., integer or real numbers

 Discretization:
 Divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
 Prepare for further analysis
Discretization and Concept Hierarchy
17

 Discretization
 Reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals
 Interval labels can then be used to replace actual data values
 Concept hierarchy formation
 Recursively reduce the data by collecting and replacing low level concepts
(such as numeric values for age) by higher level concepts (such as young,
middle-aged, or senior)
Discretization and Concept Hierarchy Generation for
Numeric Data
18

 Typical methods: All the methods can be applied recursively


 Binning
 Histogram analysis
 Clustering analysis
 Entropy-based discretization: supervised, top-down split
 Interval merging by 2 Analysis: unsupervised, bottom-up merge
 Segmentation by natural partitioning: top-down split, unsupervised
Binning
 Bins are numbers that represent the intervals into which you want to group
the source data (input data). The intervals must be consecutive, non-
overlapping and usually equal size.
Binning
Binning or discretization is the process of transforming numerical
variables into categorical counterparts.
An example is to bin values for Age into categories such as 20-39, 40-
59, and 60-79. Numerical variables are usually discretized in the
modeling methods based on frequency tables (e.g., decision trees).
Moreover, binning may improve accuracy of the predictive models by
reducing the noise or non-linearity. Finally, binning allows easy
identification of outliers, invalid and missing values of numerical
variables.
Unsupervised Binning
Unsupervised binning methods transform numerical variables into
categorical counterparts but do not use the target (class)
information. Equal Width and Equal Frequency are two
unsupervised binning methods.
1- Equal Width Binning
 The algorithm divides the data into k intervals of equal

size. The width of intervals is:w = (max-min)/k


 And the interval boundaries are:min+w, min+2w, ... , min+(k-
1)w
Unsupervised Binning (2)
2- Equal Frequency Binning
The algorithm divides the data into k groups which each group
contains approximately same number of values. For the both
methods, the best way of determining k is by looking at the
histogram and try different intervals or groups. Example:
Unsupervised Binning (3)
3- Other Methods
 Rank: The rank of a number is its size relative to other values of a
numerical variable. First, we sort the list of values, then we assign the
position of a value as its rank. Same values receive the same rank
but the presence of duplicate values affects the ranks of subsequent
values (e.g., 1,2,3,3,5). Rank is a solid binning method with one
major drawback, values can have different ranks in different lists.
 Quantiles (median, quartiles, percentiles, ...): Quantiles are also
very useful binning methods but like Rank, one value can have
different quantile if the list of values changes.
 Math functions: For example, FLOOR(LOG(X)) is an effective
binning method for the numerical variables with highly skewed
distribution (e.g., income).
Supervised Binning
 Supervised binning methods transform numerical variables into
categorical counterparts and refer to the target (class) information
when selecting discretization cut points. Entropy-based binning is an
example of a supervised binning method.
 Entropy based method uses a split approach. The entropy (or the
information content) is calculated based on the class label.
 The goal of this algorithm is to find the split with the maximum
information gain (so that the bins are as pure as possible that is the
majority of the values in a bin correspond to have the same class label)
 The boundary that minimizes the entropy over all possible boundaries is
selected
 The process is recursively applied to partitions obtained until some
stopping criterion is met
 Such a boundary may reduce data size and improve classification
accuracy
 Example: Discretize the temperature variable using
entropy-based binning algorithm.
 Step 1: Calculate "Entropy" for the target.
O-Ring Failure
Y N
7 17
E (Failure) = E(7, 17) = E(0.29, .71) = -0.29 x log2(0.29) - 0.71 x log2(0.71) = 0.871
Step 2: Calculate "Entropy" for the target given a bin (T is the value used to split S into S1&S2)
O-Ring Failure
Y N
<= 60 3 0
Temperature
> 60 4 17
E (Failure,Temperature) = P(<=60) x E(3,0) + P(>60) x E(4,17) = 3/24 x 0 + 21/24 x 0.7= 0.615
 Step 3: Calculate "Information Gain" given a bin.
 Difference in entropy between original set (S) and weighted
split (S1 + S2)

 Information Gain (Failure, Temperature) = 0.256


Entropy-Based Discretization
29

 Diberikan suatu himpunan sampel S, jika S dipartisi kedalam 2 interval S1 dan


S2 menggunakan batas T, information gain setelah partisi adalah
| S1 | |S |
I (S , T )  Entropy( S 1)  2 Entropy( S 2)
|S| |S|

 Entropy is calculated based on class distribution of the samples in the set. Given
m classes, the entropy of S1 is m
Entropy( S1 )   pi log 2 ( pi )
i 1

where pi is the probability of class i in S1


 The boundary that minimizes the entropy function over all possible boundaries is
selected as a binary discretization
 The process is recursively applied to partitions obtained until some stopping
criterion is met
 Such a boundary may reduce data size and improve classification accuracy
Interval Merge by 2 Analysis
30

 Merging-based (bottom-up) vs. splitting-based methods


 Merge: Find the best neighboring intervals and merge them to form larger
intervals recursively
 ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
 Initially, each distinct value of a numerical attr. A is considered to be one
interval
 2 tests are performed for every pair of adjacent intervals
 Adjacent intervals with the least 2 values are merged together, since low 2
values for a pair indicate similar class distributions
 This merge process proceeds recursively until a predefined stopping criterion
is met (such as significance level, max-interval, max inconsistency, etc.)
Numeric Concept Hierarchy
 A concept hierarchy for a given numerical attribute
defines a discretization of the attribute
 Recursively reduce the data by collecting and
replacing low level concepts by higher level
concepts
Segmentation by Natural Partitioning
33

 A simply 3-4-5 rule can be used to segment numeric data into


relatively uniform, ―natural‖ intervals.
 If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width
intervals
 If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
 If it covers 1, 5, or 10 distinct values at the most significant
digit, partition the range into 5 intervals
Data Mining: Concepts and Techniques March 25, 2021
Example of 3-4-5 Rule
34
count

Step 1: -$351 -$159 profit $1,838 $4,700


Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)


(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Concept Hierarchy Generation for Categorical Data
35

 Specification of a partial/total ordering of attributes explicitly


at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit data
grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation
36

 Some hierarchies can be automatically generated based on


the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at the
lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


Summary
37

 Data preparation or preprocessing is a big issue for both data


warehousing and data mining
 Discriptive data summarization is need for quality data
preprocessing
 Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 A lot a methods have been developed but data preprocessing
still an active area of research
Notes
 In real world applications, data preprocessing usually
occupies about 70% workload in a data mining task.
 Domain knowledge is usually required to do good data
preprocessing.
 To improve a predictive performance of a model
– Improve learning algorithms (different algorithms,
different parameters)
Most data mining research focuses on here
– Improve data quality ---- data preprocessing
Deserve more attention!
References
39

 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM,
42:73-78, 1999
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
 T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build a Data Quality
Browser. SIGMOD’02.
 H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data
Engineering, 20(4), December 1997
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee
on Data Engineering. Vol.23, No.4
 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation,
VLDB’2001
 T. Redman. Data Quality: Management and Technology. Bantam Books, 1992
 Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM,
39:86-95, 1996
 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and
Data Engineering, 7:623-640, 1995

You might also like