0% found this document useful (0 votes)

86 views

Data Integration and Discretization

Data integration involves ingesting, transforming, and integrating data from various sources into a single data warehouse. It includes processes to extract data from sources, transform it by resolving inconsistencies and conflicts, and load it into a data warehouse. Some common issues in data integration are differing attribute names and units, redundant or derived attributes, and inconsistencies in entity values across sources. Techniques like correlation analysis, association analysis, binning, and clustering can be used to identify redundancies and resolve conflicts during integration.

Uploaded by

RIZKA FIDYA PERMATASARI 06211940005004

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views

Data Integration and Discretization

Uploaded by

RIZKA FIDYA PERMATASARI 06211940005004

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

DATA INTEGRATION &

DATA DISCRETIZATION

Data Mining Irhamah

Data Integration
Data Integration ingests, transforms and integrates structured and unstructured data and
delivers the data to a scalable data warehouse platform using traditional ETL (Extract,
Transform, Load) tools and methodologies to collect of data from various sources into a
single data warehouse:
 Data ingestion is the process of obtaining and importing data for immediate use

or storage in a database. To ingest something is to "take something in or absorb

something." .
 Includes both technical processes and business logic to transform data from disparate

sources into cohesive meaningful data with quality, governance and compliance
considerations.
 Combination of technical and business processes used to combine data from disparate

sources into meaningful and valuable information. A complete data integration solution
delivers trusted data from a variety of sources.
 Traditional domain of ETL (Extract, Transform and Load) that transforms and cleans the

data as it is being extracted from various data sources and loaded into one data store
(data warehouse). For example, converting a single variable of ―address‖ into ―street
address‖, ―city‖, ―state‖ and ―zip code‖ fields.
Source: KDnuggets
Data Integration
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units (misal berat dalam kg atau pounds)
Problem in Data Integration
 Nama atribut yang berbeda-beda

 Unit yang berbeda: Sales dalam $, sales dalam Yen

 Skala berbeda: Rp, Rp juta, Rp Milyar

 Atribut turunan : monthly salary dan annual salary
Problem in Data Integration (2)
 Customer dengan customer-id 150 punya 3 anak dalam
relation1 dan 4 anak dalam relation2

 Komputasi annual salary dari monthly salary dalam relation1

tidak cocok dengan atribut ―annual-salary‖ dalam relation2
Handling Redundancy in Data Integration
8

 Redundant data occur often when integration of multiple databases

 Object identification: The same attribute or object may have different
names in different databases
 Derivable data: One attribute may be a ―derived‖ attribute in another
table, e.g., annual revenue
 An attribute (column or feature of data set) is called redundant if it can
be derived from any other attribute or set of attributes. Inconsistencies in
attribute or dimension naming can also lead to the redundancies in data
set.
 Redundant attributes may be able to be detected by
correlation/covariance analysis or association analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality
Correlation Analysis (Numerical Data)

 Correlation coefficient (also called Pearson’s product moment

coefficient) n

 ( x  x )( y  y )
i i
r i 1
n n

 (x  x )  ( y  y)
i 1
i
2

i 1
i
2

where n is the number of tuples, x and y are the respective means of X and Y.
 If rxy > 0, X and Y are positively correlated (X’s values increase
as Y’s). The higher, the stronger correlation.
 rxy = 0: independent; rxy < 0: negatively correlated

9
Association Analysis (Categorical Data)

The Chi-Square Test of Independence determines whether there is an

association between categorical variables (i.e., whether the variables are
independent or related). It is a nonparametric test.

Data Requirements
• Two categorical variables.
• Two or more categories (groups) for each variable.
• Independence of observations.
There is no relationship between the subjects in each group.
The categorical variables are not "paired" in any way (e.g. pre-
test/post-test observations).
• Relatively large sample size.
Expected frequencies for each cell are at least 1.
Expected frequencies should be at least 5 for the majority (80%) of the
cells.
10
Association Analysis (2)

The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square
Test of Independence can be expressed in two different but equivalent
ways:
H0: "[Variable 1] is independent of [Variable 2]"
H1: "[Variable 1] is not independent of [Variable 2]"
OR
H0: "[Variable 1] is not associated with [Variable 2]"
H1: "[Variable 1] is associated with [Variable 2]―

11
Association Analysis (3)

12
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are expected

counts calculated based on the data distribution in the two
categories)

( 250  90 ) 2
(50  210 ) 2
( 200  360 ) 2
(1000  840 ) 2
2      507.93
90 210 360 840
 It shows that like_science_fiction is associated with play_chess or
not.
Example of Data Redundancy
We have a data set having three attributes- person_name, is_male, is_female.
is_male is 1 if the corresponding person is a male else it is 0 .
is_female is 1 if the corresponding person is a female else it is 0.

On analysing the fact that if a person is not male

(i.e is_male is 0 corresponding the person_name) then,
the person is surely a female (since there are only two
value in output class- male and female). It implies that
the two attributes are highly correlated and one
attribute can determine the other. Hence, one of these
attributes became redundant. So one of these two
attributes can be dropped without any information loss.
Discretization (Diskritisasi)
15

Konsep sama dengan histogram

• Membagi domain dari suatu atribut numerik kedalam interval-interval.
• Menggantikan nilai atribut dengan label untuk interval.
Contoh:
– Dataset (age; salary):
(25;30,000),(30;80,000),(27;50,000),(60;70,000),(50;55,000),(28;25,
000)
– Dataset diskrit(age, discretizedSalary):
(25,low),(30,high),(27,medium),(60,high), (50,medium),(28,low)
Discretization
16

 Three types of attributes:

 Nominal — values from an unordered set, e.g., color, profession
 Ordinal — values from an ordered set, e.g., military or academic
rank
 Continuous — real numbers, e.g., integer or real numbers

 Discretization:
 Divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
 Prepare for further analysis
Discretization and Concept Hierarchy
17

 Discretization
 Reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals
 Interval labels can then be used to replace actual data values
 Concept hierarchy formation
 Recursively reduce the data by collecting and replacing low level concepts
(such as numeric values for age) by higher level concepts (such as young,
middle-aged, or senior)
Discretization and Concept Hierarchy Generation for
Numeric Data
18

 Typical methods: All the methods can be applied recursively

 Binning
 Histogram analysis
 Clustering analysis
 Entropy-based discretization: supervised, top-down split
 Interval merging by 2 Analysis: unsupervised, bottom-up merge
 Segmentation by natural partitioning: top-down split, unsupervised
Binning
 Bins are numbers that represent the intervals into which you want to group
the source data (input data). The intervals must be consecutive, non-
overlapping and usually equal size.
Binning
Binning or discretization is the process of transforming numerical
variables into categorical counterparts.
An example is to bin values for Age into categories such as 20-39, 40-
59, and 60-79. Numerical variables are usually discretized in the
modeling methods based on frequency tables (e.g., decision trees).
Moreover, binning may improve accuracy of the predictive models by
reducing the noise or non-linearity. Finally, binning allows easy
identification of outliers, invalid and missing values of numerical
variables.
Unsupervised Binning
Unsupervised binning methods transform numerical variables into
categorical counterparts but do not use the target (class)
information. Equal Width and Equal Frequency are two
unsupervised binning methods.
1- Equal Width Binning
 The algorithm divides the data into k intervals of equal

size. The width of intervals is:w = (max-min)/k

 And the interval boundaries are:min+w, min+2w, ... , min+(k-
1)w
Unsupervised Binning (2)
2- Equal Frequency Binning
The algorithm divides the data into k groups which each group
contains approximately same number of values. For the both
methods, the best way of determining k is by looking at the
histogram and try different intervals or groups. Example:
Unsupervised Binning (3)
3- Other Methods
 Rank: The rank of a number is its size relative to other values of a
numerical variable. First, we sort the list of values, then we assign the
position of a value as its rank. Same values receive the same rank
but the presence of duplicate values affects the ranks of subsequent
values (e.g., 1,2,3,3,5). Rank is a solid binning method with one
major drawback, values can have different ranks in different lists.
 Quantiles (median, quartiles, percentiles, ...): Quantiles are also
very useful binning methods but like Rank, one value can have
different quantile if the list of values changes.
 Math functions: For example, FLOOR(LOG(X)) is an effective
binning method for the numerical variables with highly skewed
distribution (e.g., income).
Supervised Binning
 Supervised binning methods transform numerical variables into
categorical counterparts and refer to the target (class) information
when selecting discretization cut points. Entropy-based binning is an
example of a supervised binning method.
 Entropy based method uses a split approach. The entropy (or the
information content) is calculated based on the class label.
 The goal of this algorithm is to find the split with the maximum
information gain (so that the bins are as pure as possible that is the
majority of the values in a bin correspond to have the same class label)
 The boundary that minimizes the entropy over all possible boundaries is
selected
 The process is recursively applied to partitions obtained until some
stopping criterion is met
 Such a boundary may reduce data size and improve classification
accuracy
 Example: Discretize the temperature variable using
entropy-based binning algorithm.
 Step 1: Calculate "Entropy" for the target.
O-Ring Failure
Y N
7 17
E (Failure) = E(7, 17) = E(0.29, .71) = -0.29 x log2(0.29) - 0.71 x log2(0.71) = 0.871
Step 2: Calculate "Entropy" for the target given a bin (T is the value used to split S into S1&S2)
O-Ring Failure
Y N
<= 60 3 0
Temperature
> 60 4 17
E (Failure,Temperature) = P(<=60) x E(3,0) + P(>60) x E(4,17) = 3/24 x 0 + 21/24 x 0.7= 0.615
 Step 3: Calculate "Information Gain" given a bin.
 Difference in entropy between original set (S) and weighted
split (S1 + S2)

 Information Gain (Failure, Temperature) = 0.256

Entropy-Based Discretization
29

 Diberikan suatu himpunan sampel S, jika S dipartisi kedalam 2 interval S1 dan

S2 menggunakan batas T, information gain setelah partisi adalah
| S1 | |S |
I (S , T )  Entropy( S 1)  2 Entropy( S 2)
|S| |S|

 Entropy is calculated based on class distribution of the samples in the set. Given
m classes, the entropy of S1 is m
Entropy( S1 )   pi log 2 ( pi )
i 1

where pi is the probability of class i in S1

 The boundary that minimizes the entropy function over all possible boundaries is
selected as a binary discretization
 The process is recursively applied to partitions obtained until some stopping
criterion is met
 Such a boundary may reduce data size and improve classification accuracy
Interval Merge by 2 Analysis
30

 Merging-based (bottom-up) vs. splitting-based methods

 Merge: Find the best neighboring intervals and merge them to form larger
intervals recursively
 ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
 Initially, each distinct value of a numerical attr. A is considered to be one
interval
 2 tests are performed for every pair of adjacent intervals
 Adjacent intervals with the least 2 values are merged together, since low 2
values for a pair indicate similar class distributions
 This merge process proceeds recursively until a predefined stopping criterion
is met (such as significance level, max-interval, max inconsistency, etc.)
Numeric Concept Hierarchy
 A concept hierarchy for a given numerical attribute
defines a discretization of the attribute
 Recursively reduce the data by collecting and
replacing low level concepts by higher level
concepts
Segmentation by Natural Partitioning
33

 A simply 3-4-5 rule can be used to segment numeric data into

relatively uniform, ―natural‖ intervals.
 If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width
intervals
 If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
 If it covers 1, 5, or 10 distinct values at the most significant
digit, partition the range into 5 intervals
Data Mining: Concepts and Techniques March 25, 2021
Example of 3-4-5 Rule
34
count

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)

(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Concept Hierarchy Generation for Categorical Data
35

 Specification of a partial/total ordering of attributes explicitly

at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit data
grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation
36

 Some hierarchies can be automatically generated based on

the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at the
lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

Summary
37

 Data preparation or preprocessing is a big issue for both data

warehousing and data mining
 Discriptive data summarization is need for quality data
preprocessing
 Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 A lot a methods have been developed but data preprocessing
still an active area of research
Notes
 In real world applications, data preprocessing usually
occupies about 70% workload in a data mining task.
 Domain knowledge is usually required to do good data
preprocessing.
 To improve a predictive performance of a model
– Improve learning algorithms (different algorithms,
different parameters)
Most data mining research focuses on here
– Improve data quality ---- data preprocessing
Deserve more attention!
References
39

 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM,
42:73-78, 1999
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
 T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build a Data Quality
Browser. SIGMOD’02.
 H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data
Engineering, 20(4), December 1997
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee
on Data Engineering. Vol.23, No.4
 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation,
VLDB’2001
 T. Redman. Data Quality: Management and Technology. Bantam Books, 1992
 Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM,
39:86-95, 1996
 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and
Data Engineering, 7:623-640, 1995

E-Sys - VO Coding Guide
100% (1)
E-Sys - VO Coding Guide
16 pages
Tiktok Growth PDF
100% (1)
Tiktok Growth PDF
9 pages
50 Business Analyst Interview Questions and Answers
100% (1)
50 Business Analyst Interview Questions and Answers
11 pages
List of Documents ISO 27001 ISO 27017 ISO 27018 Cloud-En
No ratings yet
List of Documents ISO 27001 ISO 27017 ISO 27018 Cloud-En
13 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
IT326 - Ch3
No ratings yet
IT326 - Ch3
33 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
13. Data Preprocessing_Updated (6)
No ratings yet
13. Data Preprocessing_Updated (6)
31 pages
Ch 3-Final
No ratings yet
Ch 3-Final
39 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Module 2_DM_AI
No ratings yet
Module 2_DM_AI
61 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
Lecture 4-Data Preprocessing - Integration
No ratings yet
Lecture 4-Data Preprocessing - Integration
12 pages
Data Mining - Lecture 3
No ratings yet
Data Mining - Lecture 3
33 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Chapter3
No ratings yet
Chapter3
50 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
Unit I
No ratings yet
Unit I
57 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Data Preprocessing for Clustering
No ratings yet
Data Preprocessing for Clustering
40 pages
DM_merged
No ratings yet
DM_merged
169 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Kuliah 2 - Data Dan Eksplorasi Data
No ratings yet
Kuliah 2 - Data Dan Eksplorasi Data
61 pages
PPT1
No ratings yet
PPT1
93 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
Week2-2
No ratings yet
Week2-2
25 pages
CH 3
No ratings yet
CH 3
68 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Mining
No ratings yet
Data Mining
40 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
DWDM (Unit-4)-2
No ratings yet
DWDM (Unit-4)-2
23 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
Normalization
No ratings yet
Normalization
35 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
CH 2
No ratings yet
CH 2
36 pages
GK NU CS 503 - Data Preprocessing
No ratings yet
GK NU CS 503 - Data Preprocessing
62 pages
Unit 3
No ratings yet
Unit 3
164 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
data mining 3
No ratings yet
data mining 3
57 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Core Concepts in Real Analysis
From Everand
Core Concepts in Real Analysis
Roshan Trivedi
No ratings yet
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
Neural Metwork: Institut Teknologi Sepuluh Nopember (ITS) Surabaya - Indonesia
No ratings yet
Neural Metwork: Institut Teknologi Sepuluh Nopember (ITS) Surabaya - Indonesia
43 pages
Rizka Widya Permatasari
No ratings yet
Rizka Widya Permatasari
2 pages
Ecn 4231: Intoduction To Econometrics Dummy Variable Lab
No ratings yet
Ecn 4231: Intoduction To Econometrics Dummy Variable Lab
6 pages
Model Evaluation Dan Selection
No ratings yet
Model Evaluation Dan Selection
22 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Case Study Willingham CPD
No ratings yet
Case Study Willingham CPD
1 page
Design of Experiments: Analisis Data-S1
No ratings yet
Design of Experiments: Analisis Data-S1
26 pages
Cox Proportional Hazard Model
No ratings yet
Cox Proportional Hazard Model
34 pages
Likelihood Ratio Tests: Instructor: Songfeng Zheng
No ratings yet
Likelihood Ratio Tests: Instructor: Songfeng Zheng
9 pages
Hypothesis Testing: Concepts and Simple Examples
No ratings yet
Hypothesis Testing: Concepts and Simple Examples
16 pages
3rd Part Intervention Analysis ADW S1 2020
No ratings yet
3rd Part Intervention Analysis ADW S1 2020
4 pages
20200829044506incorporating Quotes Worksheet
No ratings yet
20200829044506incorporating Quotes Worksheet
3 pages
ATV340 Installation Manual en NVE61069 03
No ratings yet
ATV340 Installation Manual en NVE61069 03
158 pages
DBlue Price List - PT. LKP 2024-1
No ratings yet
DBlue Price List - PT. LKP 2024-1
5 pages
Self Development V1
No ratings yet
Self Development V1
5 pages
Chapter 7 Python Fundamental Class 11
No ratings yet
Chapter 7 Python Fundamental Class 11
12 pages
Aim Afm3d8t C2207202007 012008
No ratings yet
Aim Afm3d8t C2207202007 012008
74 pages
Xerox 5665 Service Manual: Read/Download
25% (8)
Xerox 5665 Service Manual: Read/Download
2 pages
Download full Using Excel For Principles of Econometrics 4th Edition Genevieve Briand ebook all chapters
100% (16)
Download full Using Excel For Principles of Econometrics 4th Edition Genevieve Briand ebook all chapters
50 pages
Resource Scheduling
No ratings yet
Resource Scheduling
10 pages
Unit-3
No ratings yet
Unit-3
131 pages
Apoorva Ranjan - 7years
No ratings yet
Apoorva Ranjan - 7years
2 pages
Senior Telecommunication Networks Engineer - Position Description - OSD - TNS - Oct 22
No ratings yet
Senior Telecommunication Networks Engineer - Position Description - OSD - TNS - Oct 22
3 pages
Salesforce Question and Answers
No ratings yet
Salesforce Question and Answers
15 pages
Da Ii
No ratings yet
Da Ii
4 pages
COE768 Course Outline
No ratings yet
COE768 Course Outline
5 pages
21EC71_AVLSI_Answers
No ratings yet
21EC71_AVLSI_Answers
31 pages
EmailAddresses
No ratings yet
EmailAddresses
57 pages
Formal Languages and Automata Theory
No ratings yet
Formal Languages and Automata Theory
24 pages
Temperature Measurement Using RTD-PT100
No ratings yet
Temperature Measurement Using RTD-PT100
5 pages
R Analytics and Tableau: Laguna State Polytechnic University
No ratings yet
R Analytics and Tableau: Laguna State Polytechnic University
2 pages
Ac310 Manual v1.0
No ratings yet
Ac310 Manual v1.0
28 pages
Scratch Design Template
No ratings yet
Scratch Design Template
6 pages
Functional Design and Architecture MEAP V09 Alexander Granin 2024 Scribd Download
100% (2)
Functional Design and Architecture MEAP V09 Alexander Granin 2024 Scribd Download
65 pages
10 English Literature Prose Chapter 6
No ratings yet
10 English Literature Prose Chapter 6
11 pages
DCF-Module2
No ratings yet
DCF-Module2
18 pages
Complete Download Digital Transformation and Knowledge Management 1st Edition Lucia Marchegiani PDF All Chapters
100% (1)
Complete Download Digital Transformation and Knowledge Management 1st Edition Lucia Marchegiani PDF All Chapters
55 pages

Data Integration and Discretization

Uploaded by

Data Integration and Discretization

Uploaded by

DATA INTEGRATION &

Data Mining Irhamah

or storage in a database. To ingest something is to "take something in or absorb

 Unit yang berbeda: Sales dalam $, sales dalam Yen

 Skala berbeda: Rp, Rp juta, Rp Milyar

 Komputasi annual salary dari monthly salary dalam relation1

 Redundant data occur often when integration of multiple databases

 Correlation coefficient (also called Pearson’s product moment

The Chi-Square Test of Independence determines whether there is an

Play chess Not play chess Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are expected

On analysing the fact that if a person is not male

Konsep sama dengan histogram

 Three types of attributes:

 Typical methods: All the methods can be applied recursively

size. The width of intervals is:w = (max-min)/k

 Information Gain (Failure, Temperature) = 0.256

 Diberikan suatu himpunan sampel S, jika S dipartisi kedalam 2 interval S1 dan

where pi is the probability of class i in S1

 Merging-based (bottom-up) vs. splitting-based methods

 A simply 3-4-5 rule can be used to segment numeric data into

Step 1: -$351 -$159 profit $1,838 $4,700

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 - 0) ($2,000 - $5, 000)

 Specification of a partial/total ordering of attributes explicitly

 Some hierarchies can be automatically generated based on

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

 Data preparation or preprocessing is a big issue for both data

You might also like