0% found this document useful (0 votes)
10 views46 pages

Data Integration

explain this like explaining to 15yrs old girl remember that i'm having examination in next 1 hour so please make sure to cover all the key point in the document

Uploaded by

aksshu1902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views46 pages

Data Integration

explain this like explaining to 15yrs old girl remember that i'm having examination in next 1 hour so please make sure to cover all the key point in the document

Uploaded by

aksshu1902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

DATA INTEGRATION

OVERVIEW

• Expose the problem of data integration (causes, consequences…)


• Present the concepts of data integration: pipeline, challenges…
• List different data integration approaches: federated databases, data migration, master data management,
data warehousing, virtual data integration, semantic (ontology based) data integration
• Explain data warehousing and show limitations
• Present virtual data integration system components and languages for mediated schema (GAV, LAV,
GLAV)
• Present ontology-based integration
• Conduct practical example of ontology-based integration using Ontop under Protégé
DATA SILOS PROBLEM

• Companies develop applications (separately) applications for different needs


• Applications are used for different processes and departments
• Each application stores its own data

Company
IS

Marketing HR Sales Production Cust. service


data data data data data
CAUSES OF DATA SILOS

• Organisational units (departments) are strictly partitioned (cloisonné)


• Different cultures, different profession/occupation

• Hierarchical structures make it difficult for departments to share data


• Layers, and management rules makes sharing difficult (ex. permissions…)
• Budgets allocated to separate departments

• Variability in technology
• Different department use different technologies: customer service with access DB, HR with
excel files, production with oracle…
CONSEQUENCES OF DATA SILOS

• No overview of existing data


• Ex. Marketing can benefit from accessing sales
data to optimize their plans
• This needs building up connections between data
from different processes
• Costs and resources
• Duplicated data has significant costs and wasted
resources
• Inconsistent data
• Duplication leads to inconsistency
• Inconsistency may result in loss of QoS
WHY IS IT DIFFICULT TO REMOVE SILOS?

• Difficulty in changing professional culture


• People may feel losing possession of their data

• Technical difficulty to change permissions, data responsibilities...


• Costs of unifying technologies
• Skilling IT staff to different technologies is a challenge
DATA INTEGRATION

• Combine several disparate data sources to provide a unified view of the data

Production
data
Analysis
Sales
data
Integration? Unified-view Reporting
HR
data
data
Mining
Marketing
data

Cust. service
data
DATA INTEGRATION PROCESS

Create an Extract,
Identify transform, load Query the data
integrated
sources data According to
(technology, schema) schema Integrated schema…
Centralized or distributed
Match data, schemas
storage
DATA INTEGRATION APPROACHES

• Federated databases
• DBMSs coordinated to collaborate
• Data migration
• Concerns merging two systems (ex; company acquisition/merge). Move data from one to the other or from both
to a new system
• Master data management
• Integrate only non-transactional data (customer, client, product, account…)
• Data warehousing
• Virtual data integration
• Semantic data integration
DATA WAREHOUSE ARCHITECTURE

• Make periodical physical copy of the data into an integrated database

Production Analysis
Local
schema data
Data
Local Sales mart
schema
data
Data
ETL Data Reporting
HR mart
Local
schema
tools warehouse
data
Data
Local
schema Marketing Global mart
schema
data Mining
Local
schema
Cust. service
data
EXAMPLE DATA WAREHOUSE SOLUTIONS

Data Warehouse Autonomous Data Warehouse Db2 Database

Synapse BigQuery Redshift


DATA WAREHOUSE SCHEMA

• Schema is structure of the database


• Names of tables, records, views, indexes

• Optimized for reporting and analysis


• Two approaches
• Normalised data warehouse: normalised relational model (similar to RDB)
• Dimensional data warehouse: uses facts and dimensions
• Fact: quantitative data to be analysed (number of products ordered, total price paid…)
• Dimension: criteria for the analysis (date, customer, location, sales officer…)
• Multiple dimensions for one fact
DIMENSIONAL DATA WAREHOUSE SCHEMAS

• Star schema
• A star shaped schema linking fact to its dimensions
• Most popular schema
• Easy for interpretation and reporting with small datasets
• Usually used for data marts
• Inappropriate for whole warehouse as they have to change when reporting requirements change
• Snowflake schema
• Extension of the start schema
• Dimensions are stored in multiple tables
• A dimension has multiple branches, giving the snowflake shape
• Separation of dimension allows finer analysis (example, unit of time)
EXAMPLES

STAR SNOWFLAKE

Source: datawarehouseinfo.com
EXTRACT, TRANSFORM, LOAD

• Extract: Read data from the sources


• Requires specific software connectors

• Transform: convert extracted data to the target schema


• Transformation can include operators like cleaning, merging, rules, calculating

• Load: write the transformed data to the target database according to its schema
EXAMPLE ETL TOOLS

Open source Open source


DATA WAREHOUSE LIMITATIONS

• Data integrity difficult to maintain


• Data out-dated

• Data loading fastidious


• Fixed structure, difficult to modify if company changes processes
VIRTUAL DATA INTEGRATION

• Data is kept at the source

• Wrapper convert to global model

• Mediated schema mechanism breaks query over sources


Analysis
Local Production Wrapper
schema
data
Local Sales Mediator
schema Wrapper
data (virtual database)
Reformulate Reporting
Local HR Wrapper
schema
data Optimize
Execute
Local
schema Marketing Wrapper Global
schema
data Mining
(Mediated schema)
Local Cust. service Wrapper
schema
data
CHARACTERISTICS OF VIRTUAL DATA INTEGRATION

• Data sources updated independently


• Sources may participate in different mediated systems
• Systems allows sources to be added/removed
• Mediated schema captures the data offered by sources and their relations to the global
schema
MEDIATOR ARCHITECTURE

User Query

Source
Global schema
mappings
Query plan
generator

Execution engine

Wrapper Wrapper Wrapper Wrapper

source source source source


EXAMPLE GLOBAL SCHEMA

• Three relational tables


• Student (student, course)
• Enrolment (student, year)
• Course (student, title, year)

• User would like to know which students enrolled with the course ‘F21BD’ in 2024
• Select student from course where title = ‘F21DB’ and year = ‘2024’

• Data is not in Course


• Mediator creates a query plan (which sources, which data, sub-queries to each source, gather
answers to create final answer)
WRAPPER

• Connector responsible for wrapping data source to allow communication with mediator
• Data source exposed as a convenient database with the appropriate (to mediator)
schema
• Export data from source as requested by mediator
• Presented schema can be different from the actual one
• Data can be the result of transformation of real data at the source
• The same source can have different wrappers for different virtual systems
MAPPINGS

• Provide information to mediator about data available at sources and relation to global
schema
• This information is the basis for generating the query plans
• Description provided as logical formulas (SQL queries, views..)
• Logical formulas relate local schemas to the global (mediated) schema
• The set of formulas define the mappings between local schemas and global schema
MAPPING APPROACHES

Global
• Global-as-view (GAV) – old approach (named in 1990’s) schema
• Relations in the global schema are described as views over the local schemas
Local Local local
• Local schemas more details than those in the global schema schema schema schema

• Local-as-view (LAV) – mid-1990’s Global


• Relations in the local schemas are described as views over the global schema schema

• Global schema may contain more details than those in the local schemas Local Local local
schema schema schema
• Global-and-Local-as-Views (GLAV) – 2000’s
• Mapping expressed as correspondence between views on the global schema and views over the
local schemas Global
schema
• GAV and LAV can be derived as special cases from GLAV
Local Local local
schema schema schema
ILLUSTRATION OF MAPPING APPROACHES

Local Schema: prof(name, exp_years) Local Schema: teacher(id, exp)


name exp_years id exp
Steve 9 Elisabeth 5
Radu 11 Hadj 36
Hadj 36
Global Schema: academic(name, years)
name years
Radu 11
Steve 9
Elisabeth 5
Hadj 36
GAV MAPPING

CREATE VIEW Academic AS • Query for academics with exp > 30


SELECT prof.name as name, prof.exp_years as years SELECT academic.years
FROM prof FROM academic
UNION WHERE years > 30;
SELECT lecturer.id as name, lecturer.exp as years • Transformed (unfolding) by mediator into
FROM lecturer SELECT academic.years
FROM (SELECT prof.name as name, prof.exp_years as
years FROM prof UNION SELECT lecturer.id as
name, lecturer.exp as years FROM lecturer)
WHERE years > 30;
LAV MAPPING

• Source schema 1 • Query for academics with exp > 30


CREATE VIEW prof AS SELECT academic.years
SELECT academic.name as name, academic.years as FROM academic
exp_years
WHERE years > 30;
FROM academic;
• Unfolding not possible (local schemas not known
• Source schema 2 to moderator), mediator run reasoning to generate
query plan
CREATE VIEW leturer AS
• Based on mappings
SELECT academic.name as id, academic.years as exp • Look for names in both views
FROM academic; • Fuse results (delete redundancies…)
LIMITATIONS OF VIRTUAL DATA INTEGRATION

• Mediator cannot update data


• Describing data at sources and mappings between schemas
• Formalism used should be expressive and easy to maintain
SEMANTIC DATA INTEGRATION

• Large volumes of legacy data in relational databases and other formats


• Importing into RDF stores requires large effort and costs
• How to expose such data to the web (of data)
• Expose relational databases as RDF
ONTOLOGY BASED DATA ACCESS (OBDA)

• A semantic approach to data integration


• Global schema is an ontology
• Local schemas are relational database schemas
• Mappings expressed as logical axioms between ontology and sources
• Semantic technologies can be fully applied
• Reasoning over the ontology to infer information
• Merge data with shared IRI
DATABASE TO ONTOLOGY MAPPING APPROACHES
TWO MAPPING APPROACHES

• Direct mapping • R2RML

Source: https://doi.org/10.3390/app10207070 Source: www.rdb2rdf.org


ONTOLOGY MAPPING LANGUAGES
(W3C STANDARDS)
• Direct mapping (W3C 2012)
• https://www.w3.org/TR/rdb-direct-mapping

• R2RML: RDB to RDF mapping language (W3C 2012)


• https://www.w3.org/TR/r2rml/

• RML: RDF mapping language


• Generalises R2RML to CSV, XML, JSON…
• Various implementations (https://www.w3.org/TR/rdb2rdf-implementations)
• D2RQ, Morph, RMLMapper, Ontop, Ultrawrap
DIRECT MAPPING EXAMPLE: TABLE TRIPLE
DIRECT MAPPING EXAMPLE: LITERAL TRIPLE
DIRECT MAPPING EXAMPLE: REFERENCE TRIPLE
DIRECT MAPPING RESULT

Bob
DIRECT RELATIONAL TO RDF MAPPING

• Automated mapping
• Data form relational database mapped directly to RDF / OWL
• Resulting graph poor due to inherent limitation in relational models
• No links to existing ontologies
• Recommended reading: Pinkel, C., Binnig, C., Jiménez-Ruiz, E., Kharlamov, E., May, W., Nikolov,
A., Sasa Bastinos, A., Skjæveland, M.G., Solimando, A., Taheriyan, M. and Heupel, C., 2018. RODI:
Benchmarking relational-to-ontology mapping generation quality. Semantic Web, 9(1), pp.25-52.
R2RML MAPPING
R2RML: RDB to RDF Mapping Language

• Relational database and domain ontologies are input to the mapping process
• RDF graph representing database content with terms from the ontologies
• Optimal graph representation (depending on ontology quality)
• Result linked to existing ontologies, support reuse
• Requires manual building of mappings and ontology alignment
OVERVIEW OF R2RML
https://www.w3.org/TR/r2rml/
R2RML MAPPING FROM TABLE: EXAMPLE

@prefix rr: <http://www.w3.org/ns/r2rml#> .


@prefix foaf: <http://xmlns.com/foaf/0.1/> .
Table being mapped
<TriplesMap1> a rr:TriplesMap ;
rr:logicalTable [rr:tableName "Person"]; Subject IRI pattern
rr:subjectMap [
rr:template "http://www.ex.com/Person/p{id}" ;
rr:class foaf:Person Generates
] ; <http://www.ex.com/Person/p1> rdf:type foaf:Person .

rr:predicateObjectMap [ Predicate IRI


rr:predicate foaf:name ;
Object value from DB
rr:objectMap [rr:column "NAME"] Generates
] . <http://www.ex.com/Person/p1> foaf:name ”Alice" .
R2RML MAPPING FROM VIEW: EXAMPLE

@prefix rr: <http://www.w3.org/ns/r2rml#> .


@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<TriplesMap2> a rr:TriplesMap ; SQL query to extract data


rr:logicalTable [rr:sqlQuery
"SELECT ID, NAME FROM Person WHERE gender = 'F'"];
rr:subjectMap [
rr:template "http://www.ex.com/Person/p{id}";
rr:class <http://www.ex.com/Woman>
] ;

rr:predicateObjectMap [
rr:predicate foaf:name ;
rr:objectMap [rr:column "NAME"]
] .
R2RML MAPPING RESULT

Bob
RML VOCABULARY
Extension of R2RML to other sources (excel, noSQL…)
RML VS. R2RML
RDF MATERIALISATION

• Mapping can result in


• materialised RDF (an actual triple store)
• Comparable to data warehouse
• virtual RDF
• Virtual database (view) with wrappers for the sources

You might also like