Data Integration: Click To Edit Master Subtitle Style
Data Integration: Click To Edit Master Subtitle Style
4/9/12
Data Models
Main aim of data models is to support the development of information systems by providing the definition and format of data
Conceptual
4/9/12
4/9/12
4/9/12
4/9/12
Identify the common entities Example, Employee dimension can come from Sales system, Payroll System, etc. Products can come from manufacturing, sales, purchase etc. Once the common entity is identified its definition should be standardized. Example, does employee include fulltime employees as well as temporary workers Identify the common attributes. What are the attributes that are common to employee, 1st name, 2nd name, last name, date of joining, etc? Each attribute should be defined. Identify the common values 4/9/12 Same information can be represented in different
step is identification of Data Steward who will own the responsibility and ownership for particular set of data elements Third step is to design an ETL process to integrate the data into the target. This is the most important area in the implementation of an ETL process for data integration. This topic will be discussed in more detailed under its own heading Fourth step is to establish a process of maintenance, review & reporting of such elements
4/9/12
Historical Perspective
Click to edit Master subtitle style
4/9/12
LOCATION TRANSPARENCY global schema global access only: global + local access: DISTRIBUTED DBS FEDERATED DBS
I N T E G R A T I O N
LOCATION VISIBILITY no global schema multiDB views, multi DB access language: MULTIDATABASE SYSTEMS
unstructured or semi-structured data (files, repositories, knowledge bases, spreadsheets, ) information exchange protocols / languages: INTEROPERABLE SYSTEMS
4/9/12 10
Federated Databases
I.S. DBMS1
1
DB1
FED. SCHEMA A
Filtering
Integration
4/9/12
12
DW Architecture
OLAP Server OLAP
Internal Sources Data Integration Component Data Warehouse Query and Analysis Component
Reports
Operational DBs
Data Mining
4/9/12
13
Catalog matching In order for a private company to participate in the marketplace (e.g., eBay), it has to determine correspondences between entries of its catalogs and entries of a common catalog of a marketplace. Once the correspondences between two schemas have been determined, the next step is to generate query expressions that automatically translate data instances of these catalogs under an integrated catalog Having aligned the catalogs, users of a marketplace have a unified access to the products which are on sale
4/9/12
14
Schemas
Generalization Specialization
Aggregation Structural
Typing
Completenes s Taxonomy
Data "Conflicts"
Semantic
Values Cognitive
4/9/12
15
Schema 2
name
Person
Pin Name
Faculty
Rank
Student
GP A
Schema S2 (relational)
Thesis (Phd-advisor, Phd-student, title)
17
source DBs
homogeneized DBs
DW
4/9/12 18
integrat ed schema
transformati on rules
investigatio n rules
integration rules
integrated schema
transformati on rules
investigatio n rules
integration rules
Step 1: Pre-Integration
Given
DBMSs , GISs, XML, UML, OWL, RDF, WSDL, Reduce this heterogeneity as much as possible to make the sources more suitable for integration
Often,
source DBs
homogeneized DBs
DW
4/9/12
22
Design Homogenization
Person
Family
4/9/12
23
? ?
4/9/12
24
Documentation is non-existing or obsolete (not updated) Designers are gone Knowledge is missing
4/9/12
25
COBOL files
Data description is within the application programs Data descriptions for the same file vary from one application program to the next Data descriptions may be incomplete (FILLER) 1st problem: to recover the underlying structure for each file
26
Relational databases
The
existing schema is usually not normalized Information on primary keys, candidate keys, foreign keys,functional dependencies, inclusion dependencies, other constraints, derivation rules, may be missing and is essential for the reverse engineering process
4/9/12
27
Example
Person ( P#, name, address Firstnames ( P#, firstname ) Employee ( E#, position, salary ) EmpDept ( E#, department ) Department ( name, location, boss, ) budget )
Person
0:n
Employee
0:n
E. D. bo ss
0:n
Department
1:1
4/9/12
28
Where to find
often defined in the schema check Select Distinct in SQL queries or mine the DB
check domain compatibility from the schema check join conditions in SQL queries check values in the DB Mine the DB
Foreign keys
4/9/12
29
Primary Keys?
Person ( P# , name, address ) Firstnames ( P# , firstname ) Employee ( E# , position, salary ) EmpDept ( E# , department ) Department ( name, location, boss, budget ) linguistic heuristics: Person ( P# , name, address ) Firstnames ( P# , firstname ) => Firstnames ( P# , firstname ) Employee ( E# , position, salary ) EmpDept ( E# , department ) => EmpDept ( E# , department ) Department ( name, location, boss, budget )
4/9/12
30
Foreign keys?
Person ( P# , name, address ) Firstnames ( P# , firstname ) Employee ( E# , position, salary ) EmpDept ( E# , department ) Department ( name, location, boss, budget ) Finding foreign keys
EmpDept.department Department.name
The FK boss in Department defines a relationship with 1:1 role cardinality firstname typically is a multi-valued attribute of Person
4/9/12
32
Is-a links
P# name address firstname s 0:n position salary
Person
0:n
Employee
0:n
E. D. bo ss
0:n
Departme nt
1:1
Person ( P# , name, address ) Firstnames ( P# , firstname ) Employee ( E# , position, salary ) PK is FK weak entity
4/9/12 33
meta-data for each representation external keys thesauri semantic definitions explicit representation rules
Example : if a road crosses a river, a bridge exists
quality indicators
4/9/12 34
Enrichment Examples
add a definition of "Employee" as "those persons who have an employment contract with the enterprise add a definition of "salary" in "Employee" as "the gross salary amount, before any deductions"
Add
E#
p ro je c ts
E m p lo y e e
E#
p ro je c ts nam e ro le
4/9/12
36
schemas transformation
transformati on rules
investigatio n rules
integration rules
4/9/12
37
Correspondences relate (schema) elements which describe the same phenomena of the real world
What are the related schema elements ? How their sets of instances are related ? How corresponding instances are identified ? Do their descriptions match ? How ?
Semantic Relativism
Semantics of correspondences
Real World
Database System
Schema 1
Schema 2
4/9/12
40
Example of correspondences
Person
Schema S1
Faculty
Rank
Pin Name
Student
GP A
Schema S2
4/9/12
41
Schema 1
title ISBN
authors
name birthdate
Schema
2
name
Author
Asserting Correspondences
S1 thing1 set_relationship S2 thing2
With Corresponding Identifiers: thing1id-predicate = thing2id-predicate [With Corresponding Properties: thing1attribute
set_relationship
thing2attribute , ... ]
set_relationship :
4/9/12
43
one element --- one element (same construct, e.g. object type)
Node CONTAINED-IN Extremity
one element --- one element (different constructs, e.g. objec type and attribute )
Book EQUAL books authors EQUAL Author
4/9/12
44
takes two schemas/ontologies, each consisting of a set of discrete entities (e.g., tables, XML elements, classes, properties) as input and determines as output the relationships (e.g., equivalence, subsumption) holding between these entities.
4/9/12
45
Matching Dimensions
Input
dimensions
Process
dimensions
Output
dimensions
Cardinality (e.g., 1:1, 1:m) Equivalence vs. more relations Graded vs. absolute confidence
4/9/12 46
Prefix / Suffix
Checks whether the first string starts (ends) with the second one prefix: net = network; but also hot = hotel suffix: phone = telephone; but also word = sword
Edit distance
Calculates the number of insertions / deletions / substitutions of characters required to transform one string into another, ratio max(length(string1); length(string2))
EditDistance(NKN,Nikon) = 2/5 = 0.4
N-Gram
Calculates the number of identical n-grams (i.e., sequences of n characters) between them trigram(3) for the string nikon are nik, iko, kon n-gram(2) (nikon, konnichiwa) = 3 4/9/12 47
Tokenization
Names are parsed into tokens by recognizing punctuation, cases Hands-Free_Kits => <hands, free, kits>
Lemmatization
Tokens are morphologically analyzed in order to find all their possible basic forms Kits => Kit
Elimination
Tokens that are articles, prepositions, conjunctions, and so on, are marked to be discarded a, the, by, type of
4/9/12
48
schemas transformation
transformati on rules
investigatio n rules
integration rules
4/9/12
49
for Integration: Creating an Integrated Schema ( IS ) and the mappings to the local databases Goal for Alignment: Conform a specification to a given prescriptive specification (PS) Goal for Matching: Find correspondences/complementarity PS IS
S1
S2
S3
S1
S1
4/9/12 50
GAV (Global As View): the Integrated Schema provides an integrated description of all the data available in the sources
the IS is used to access data from any sources queries to the IS are mapped into queries to the sources (as in distributed databases) the IS is defined to allow access to all data: in case of conflicting specifications, an all-encompassing specification is elaborated for the IS
LAV (Local As View): the Integrated Schema provides an integrated description of all the data that is desirable and that somehow matches the requirements of users of the IS
the IS may define data that does not exist in any of the sources (missing / incomplete information problem)
Integration Process
Solving
the conflicts
Examples
types
classifications (sets of instances) sets of properties structures coding schemes metadata 4/9/12
52
Integration rules
If an object type corresponds to an attribute, keep the object type If the population of an object type is included in the population of another object type, build an isa hierarchy
Integration rules depend on how you want the integrated schema to look like
4/9/12
53
Structural Conflicts
Different schema element types, e.g.: class, attribute, relationship Library example :
4/9/12
54
Classification Conflicts
Conflict Resolution:
Generalization / Specialization hierarchy
Faculty
Phd-advisor Faculty
Merging
4/9/12
55
Descriptive conflicts
Corresponding types have different properties, or corresponding properties are described in different ways Object / relationship type:
naming conflicts :
synonyms homonyms Node , Extremity Highway (EU) , Highway (USA)
Descriptive conflicts
(2)
Attribute :
naming conflicts : synonyms , homonyms cardinalities
firstname : one , two , N values
domains
salary : $, Euro ... student grade : [ 0 : 20] , [1 : 5 ] , [ A: D ] ... geometry : different scales, different reference systems, ...
integrity constraints
the average area for a flat in Japan, in USA ...
Method
:
4/9/12 57
Descriptive conflicts
(3)
S1 : Employee ( E# , name , address , dob ) S2 : Employee ( E# , position , salary ) S1.Employee EQUAL S2.Employee => IS. Employee ( E# , name , address , dob , position , salary )
4/9/12 58
sche mas
DBA
method : semi-automatic integration tell me about the problem , I will try to fix it
TOOL mapping rules integrate d schema DBA Opens to visual CASE tools, integration servers BUT knowledge acquisition can be painful
4/9/12 60
correspondence s schem as