0% found this document useful (0 votes)
33 views

Data Integration: Click To Edit Master Subtitle Style

The document discusses data integration and data modeling. It covers several key steps in data integration including identifying common elements between different data sources, designating data stewards, designing ETL processes, and establishing maintenance procedures. It also provides historical context on data integration, from early approaches focusing on views and interoperability to more modern concepts like data warehousing and data alignment on the web. Challenges of data heterogeneity are discussed along with examples.

Uploaded by

Adnan Ejaz
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Data Integration: Click To Edit Master Subtitle Style

The document discusses data integration and data modeling. It covers several key steps in data integration including identifying common elements between different data sources, designating data stewards, designing ETL processes, and establishing maintenance procedures. It also provides historical context on data integration, from early approaches focusing on views and interoperability to more modern concepts like data warehousing and data alignment on the web. Challenges of data heterogeneity are discussed along with examples.

Uploaded by

Adnan Ejaz
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Data Integration

Click to edit Master subtitle style

4/9/12

Data Models
Main aim of data models is to support the development of information systems by providing the definition and format of data

Conceptual

4/9/12

Data Modeling Process

4/9/12

4/9/12

Data Set Interaction

4/9/12

Steps in Data Integration


Step

1: Identification of common elements

Identify the common entities Example, Employee dimension can come from Sales system, Payroll System, etc. Products can come from manufacturing, sales, purchase etc. Once the common entity is identified its definition should be standardized. Example, does employee include fulltime employees as well as temporary workers Identify the common attributes. What are the attributes that are common to employee, 1st name, 2nd name, last name, date of joining, etc? Each attribute should be defined. Identify the common values 4/9/12 Same information can be represented in different

Steps in Data Integration


Second

step is identification of Data Steward who will own the responsibility and ownership for particular set of data elements Third step is to design an ETL process to integrate the data into the target. This is the most important area in the implementation of an ETL process for data integration. This topic will be discussed in more detailed under its own heading Fourth step is to establish a process of maintenance, review & reporting of such elements
4/9/12

Historical Perspective
Click to edit Master subtitle style

4/9/12

View Integration (1965-75)


From

application requirements to the database schema


Applicatio n requireme nts2 Applicatio n requireme nts3

Applicatio n requireme nts1

Applicatio n requireme ntsn

Integrated Conceptual Schema


4/9/12 9

Data Interoperability (1975-80)

LOCATION TRANSPARENCY global schema global access only: global + local access: DISTRIBUTED DBS FEDERATED DBS

I N T E G R A T I O N

LOCATION VISIBILITY no global schema multiDB views, multi DB access language: MULTIDATABASE SYSTEMS

unstructured or semi-structured data (files, repositories, knowledge bases, spreadsheets, ) information exchange protocols / languages: INTEROPERABLE SYSTEMS

4/9/12 10

Inter-organizational Data Integration (80-95)

Federated Databases
I.S. DBMS1
1

I.S. netw ork DBMS 3

DB1

FED. SCHEMA A

DB3 I.S. DBMS


2

FED. FED.SCHEM SCHEMA A B B


4/9/12 11

FDB Service Architecture


Negotiator Data Wharehouse

Translation: Wrappers / Mediators

Filtering

Integration

local export schemas

4/9/12

12

Data Warehousing (19952000)


Traditional

DW Architecture
OLAP Server OLAP

Internal Sources Data Integration Component Data Warehouse Query and Analysis Component

Reports

Operational DBs

Data Mining

Meta data External Sources Client Tools

4/9/12

13

The Web Era: Data Alignment


Catalog matching In order for a private company to participate in the marketplace (e.g., eBay), it has to determine correspondences between entries of its catalogs and entries of a common catalog of a marketplace. Once the correspondences between two schemas have been determined, the next step is to generate query expressions that automatically translate data instances of these catalogs under an integrated catalog Having aligned the catalogs, users of a marketplace have a unified access to the products which are on sale

4/9/12

14

Heterogeneity is the Issue

Schemas

Generalization Specialization

Aggregation Structural

Typing

Completenes s Taxonomy

Model Syntactical Language

Data "Conflicts"

Semantic

Values Cognitive

4/9/12

15

Library Example (homogeneous)

Same content, different structures


Schema 1
Book title ISBN authors name birthdate

Schema 2
name

Author birthdate books title ISBN 4/9/12 16

Thesis example (heterogeneous)


Schema S1 (OO)

Person
Pin Name

The integrated schema (OO)


Person Pin Name Faculty Rank Phd-advisor Student GP A PhD Student Thesis Adv . Titl e Student

Faculty
Rank

Student
GP A

Schema S2 (relational)
Thesis (Phd-advisor, Phd-student, title)

Different data models, overlapping content 4/9/12

17

DW: Heterogeneous External Structured Sources


Apply transformations as needed to homogenize data structures (optional) Integrate
transformation integration

source DBs

homogeneized DBs

DW
4/9/12 18

A Generic Framework for Integration


Source schema s schemas transformation correspondence s investigation schemas integration mapping rules

integrat ed schema

transformati on rules

investigatio n rules

integration rules

Also for alignment


4/9/12 19

A Generic Framework for Integration


Source schema s schemas transformation correspondence s investigation schemas integration mapping rules

integrated schema

transformati on rules

investigatio n rules

integration rules

Also for alignment


4/9/12 20

Step 1: Pre-Integration
Given

the heterogeneity of the local sources:

DBMSs , GISs, XML, UML, OWL, RDF, WSDL, Reduce this heterogeneity as much as possible to make the sources more suitable for integration
Often,

the source data sets cannot be modified


Keep current applications programs alive Proprietary sources with independent use 4/9/12 => Modifications through a view mechanism
21

Data Model Homogenization

Goal: use a single, common data model and format


transformation integration

source DBs

homogeneized DBs

DW

4/9/12

22

Design Homogenization

Explicit the semantics of the schema


relational normalization : one fact in one place complex object normalization

Enforce standard design rules to reduce the number of structural conflicts

Objectify multivalued attributes:


Person child Child

Person

Family

4/9/12

23

Filling the semantic gap


COBOL

files Relational databases Object-Oriented databases Spreadsheets

? ?

Problem: what do these legacy data sources contain?

4/9/12

24

Understanding legacy data

Typical data reverse engineering process:


From implementation structures To conceptual design

Documentation is non-existing or obsolete (not updated) Designers are gone Knowledge is missing

4/9/12

25

COBOL files

Data description is within the application programs Data descriptions for the same file vary from one application program to the next Data descriptions may be incomplete (FILLER) 1st problem: to recover the underlying structure for each file

Analysis of DATA DIVISIONs Dataflow analysis within application programs


4/9/12

26

Relational databases
The

existing schema is usually not normalized Information on primary keys, candidate keys, foreign keys,functional dependencies, inclusion dependencies, other constraints, derivation rules, may be missing and is essential for the reverse engineering process

4/9/12

27

Example
Person ( P#, name, address Firstnames ( P#, firstname ) Employee ( E#, position, salary ) EmpDept ( E#, department ) Department ( name, location, boss, ) budget )

P# name address firstname s 0:n position salary

Person

0:n

Employee
0:n

E. D. bo ss

0:n

Department

name location budget

1:1

4/9/12

28

Where to find

Primary keys / Candidate keys


often defined in the schema check Select Distinct in SQL queries or mine the DB
check domain compatibility from the schema check join conditions in SQL queries check values in the DB Mine the DB

Foreign keys

Dependencies (functional, inclusion, existence)

4/9/12

29

Primary Keys?
Person ( P# , name, address ) Firstnames ( P# , firstname ) Employee ( E# , position, salary ) EmpDept ( E# , department ) Department ( name, location, boss, budget ) linguistic heuristics: Person ( P# , name, address ) Firstnames ( P# , firstname ) => Firstnames ( P# , firstname ) Employee ( E# , position, salary ) EmpDept ( E# , department ) => EmpDept ( E# , department ) Department ( name, location, boss, budget )

4/9/12

30

Foreign keys?
Person ( P# , name, address ) Firstnames ( P# , firstname ) Employee ( E# , position, salary ) EmpDept ( E# , department ) Department ( name, location, boss, budget ) Finding foreign keys

Firstnames.P# Person.P# but: Person.P# Firstnames.P# EmpDept.E# Employee.E#

EmpDept.department Department.name

Department.boss Employee.E# Employee.E# Person.P# 4/9/12 31

Objects, relationships, attributes ) Person ( P# , name, address


Firstnames ( P# , firstname ) Employee ( E# , position, salary ) EmpDept ( E# , department ) Department ( name, location, boss, budget ) A relation with one PK only is an object

Person ( P# , name, address ) Employee ( E# , position, salary )

A relation whose PK is composed of FK is a relationship


EmpDept ( E# , department ) Employee, Department are objects

The FK boss in Department defines a relationship with 1:1 role cardinality firstname typically is a multi-valued attribute of Person

4/9/12

32

Is-a links
P# name address firstname s 0:n position salary

Person
0:n

Employee
0:n

E. D. bo ss

0:n

Departme nt

name location budget

1:1

Person ( P# , name, address ) Firstnames ( P# , firstname ) Employee ( E# , position, salary ) PK is FK weak entity
4/9/12 33

Further Semantic Enrichment


Goals: ensure better understanding of the schemas Completing existing descriptions with :

meta-data for each representation external keys thesauri semantic definitions explicit representation rules
Example : if a road crosses a river, a bridge exists

quality indicators
4/9/12 34

Enrichment Examples

Add information on how the schema maps to the real world

add a definition of "Employee" as "those persons who have an employment contract with the enterprise add a definition of "salary" in "Employee" as "the gross salary amount, before any deductions"

Add

information on the description of the real world

"boss" in "Department" is an external key to "Employee "salary" in "Employee" is expressed in Euros


4/9/12 35

Enrichment: Adding Constraints

E.g., add functional dependencies


E m p lo y e e

E#

p ro je c ts

Employee (E#, projectname)

n a m e Project r(name, manager) m anage

E m p lo y e e

E#

p ro je c ts nam e ro le

4/9/12

36

A Generic Framework for Integration


schem as mapping rules correspondences investigation schemas integration integrat ed schema

schemas transformation

transformati on rules

investigatio n rules

integration rules

4/9/12

37

Step 2: Correspondences Investigation

Correspondences relate (schema) elements which describe the same phenomena of the real world

What are the related schema elements ? How their sets of instances are related ? How corresponding instances are identified ? Do their descriptions match ? How ?

Known as the (Schema) Matching problem


4/9/12 38

Semantic Relativism

Information representation depends on perception


cable blue cable color material conflicts red cable copper cable color fiber material

Differences in representation as usually referred to as


4/9/12 39

Semantics of correspondences
Real World

Database System

Schema 1

Schema 2

4/9/12

40

Example of correspondences
Person

Schema S1
Faculty
Rank

Pin Name

Student
GP A

Schema S2

Thesis (Phd-advisor, Phd-student, title)

4/9/12

41

Some Library Correspondences


Book

Schema 1

title ISBN

authors

name birthdate

Schema

2
name

Author

birthdate books title ISBN 4/9/12 42

Asserting Correspondences
S1 thing1 set_relationship S2 thing2
With Corresponding Identifiers: thing1id-predicate = thing2id-predicate [With Corresponding Properties: thing1attribute
set_relationship

thing2attribute , ... ]

set_relationship :

4/9/12

43

Corresponding Schema Elements

one element --- one element (same construct, e.g. object type)
Node CONTAINED-IN Extremity

one element --- one element (different constructs, e.g. objec type and attribute )
Book EQUAL books authors EQUAL Author

4/9/12

44

The Match Operator


It

takes two schemas/ontologies, each consisting of a set of discrete entities (e.g., tables, XML elements, classes, properties) as input and determines as output the relationships (e.g., equivalence, subsumption) holding between these entities.

4/9/12

45

Matching Dimensions
Input

dimensions

Underlying data models (e.g., XML) Schema-level vs. Instance-level

Process

dimensions

Approximate vs. Exact Syntactic vs. semantic vs. external input

Output

dimensions

Cardinality (e.g., 1:1, 1:m) Equivalence vs. more relations Graded vs. absolute confidence
4/9/12 46

e.g. String-based Techniques

Prefix / Suffix

Checks whether the first string starts (ends) with the second one prefix: net = network; but also hot = hotel suffix: phone = telephone; but also word = sword

Edit distance
Calculates the number of insertions / deletions / substitutions of characters required to transform one string into another, ratio max(length(string1); length(string2))
EditDistance(NKN,Nikon) = 2/5 = 0.4

N-Gram

Calculates the number of identical n-grams (i.e., sequences of n characters) between them trigram(3) for the string nikon are nik, iko, kon n-gram(2) (nikon, konnichiwa) = 3 4/9/12 47

e.g. Language-based Techniques

Tokenization
Names are parsed into tokens by recognizing punctuation, cases Hands-Free_Kits => <hands, free, kits>

Lemmatization
Tokens are morphologically analyzed in order to find all their possible basic forms Kits => Kit

Elimination
Tokens that are articles, prepositions, conjunctions, and so on, are marked to be discarded a, the, by, type of

4/9/12

48

A Generic Framework for Integration


schem as mapping rules correspondences investigation schemas integration integrat ed schema

schemas transformation

transformati on rules

investigatio n rules

integration rules

4/9/12

49

Step 3 : Integration (or alignment)


Goal

for Integration: Creating an Integrated Schema ( IS ) and the mappings to the local databases Goal for Alignment: Conform a specification to a given prescriptive specification (PS) Goal for Matching: Find correspondences/complementarity PS IS

S1

S2

S3

S1

S1
4/9/12 50

GAV & LAV Integration

GAV (Global As View): the Integrated Schema provides an integrated description of all the data available in the sources

the IS is used to access data from any sources queries to the IS are mapped into queries to the sources (as in distributed databases) the IS is defined to allow access to all data: in case of conflicting specifications, an all-encompassing specification is elaborated for the IS

LAV (Local As View): the Integrated Schema provides an integrated description of all the data that is desirable and that somehow matches the requirements of users of the IS

the IS may define data that does not exist in any of the sources (missing / incomplete information problem)

Other Sub-Goals for integration:


minimality vs. exhaustiveness of the integrated schema, ... 4/9/12 51

Integration Process
Solving

the conflicts

Classification conflicts Descriptive conflicts Structural conflicts Fragmentation conflicts

Examples

of conflicts among related object

types

different different different different different

classifications (sets of instances) sets of properties structures coding schemes metadata 4/9/12

52

Integration rules

Rules defining the strategy to solve conflicts Example rules:

If an object type corresponds to an attribute, keep the object type If the population of an object type is included in the population of another object type, build an isa hierarchy

Integration rules depend on how you want the integrated schema to look like

4/9/12

53

Structural Conflicts

Different schema element types, e.g.: class, attribute, relationship Library example :

S1 : Book is a class S2 : books is an attribute of Author

Conflict resolution : Choose the less constraining structure

IS: Book are classes

4/9/12

54

Classification Conflicts

Corresponding elements describe different sets of real world objects

S1.Faculty CONTAINS S2.PhD-advisor

Conflict Resolution:
Generalization / Specialization hierarchy
Faculty

Phd-advisor Faculty

Merging

4/9/12

55

Descriptive conflicts
Corresponding types have different properties, or corresponding properties are described in different ways Object / relationship type:

naming conflicts :
synonyms homonyms Node , Extremity Highway (EU) , Highway (USA)

composition conflicts : different attributes and methods


Employee ( E# , name , address , dob )
4/9/12 56

Descriptive conflicts

(2)

Attribute :
naming conflicts : synonyms , homonyms cardinalities
firstname : one , two , N values

domains
salary : $, Euro ... student grade : [ 0 : 20] , [1 : 5 ] , [ A: D ] ... geometry : different scales, different reference systems, ...

integrity constraints
the average area for a flat in Japan, in USA ...

Method

:
4/9/12 57

naming conflicts : synonyms , homonyms

Descriptive conflicts

(3)

Solution depends upon :


the type of the descriptive conflict the type of classification conflict

S1 : Employee ( E# , name , address , dob ) S2 : Employee ( E# , position , salary ) S1.Employee EQUAL S2.Employee => IS. Employee ( E# , name , address , dob , position , salary )
4/9/12 58

Integration Methods: Manual

First method : manual integration do it yourself


a languag e

mapping rules integrate d schema

sche mas

DBA

easy to implement , flexible BUT time consuming for the DBA


4/9/12 59

Integration Methods: Semi-Automatic


Second

method : semi-automatic integration tell me about the problem , I will try to fix it
TOOL mapping rules integrate d schema DBA Opens to visual CASE tools, integration servers BUT knowledge acquisition can be painful
4/9/12 60

correspondence s schem as

You might also like