Data Integration
Data Integration
OVERVIEW
Company
IS
• Variability in technology
• Different department use different technologies: customer service with access DB, HR with
excel files, production with oracle…
CONSEQUENCES OF DATA SILOS
• Combine several disparate data sources to provide a unified view of the data
Production
data
Analysis
Sales
data
Integration? Unified-view Reporting
HR
data
data
Mining
Marketing
data
Cust. service
data
DATA INTEGRATION PROCESS
Create an Extract,
Identify transform, load Query the data
integrated
sources data According to
(technology, schema) schema Integrated schema…
Centralized or distributed
Match data, schemas
storage
DATA INTEGRATION APPROACHES
• Federated databases
• DBMSs coordinated to collaborate
• Data migration
• Concerns merging two systems (ex; company acquisition/merge). Move data from one to the other or from both
to a new system
• Master data management
• Integrate only non-transactional data (customer, client, product, account…)
• Data warehousing
• Virtual data integration
• Semantic data integration
DATA WAREHOUSE ARCHITECTURE
Production Analysis
Local
schema data
Data
Local Sales mart
schema
data
Data
ETL Data Reporting
HR mart
Local
schema
tools warehouse
data
Data
Local
schema Marketing Global mart
schema
data Mining
Local
schema
Cust. service
data
EXAMPLE DATA WAREHOUSE SOLUTIONS
• Star schema
• A star shaped schema linking fact to its dimensions
• Most popular schema
• Easy for interpretation and reporting with small datasets
• Usually used for data marts
• Inappropriate for whole warehouse as they have to change when reporting requirements change
• Snowflake schema
• Extension of the start schema
• Dimensions are stored in multiple tables
• A dimension has multiple branches, giving the snowflake shape
• Separation of dimension allows finer analysis (example, unit of time)
EXAMPLES
STAR SNOWFLAKE
Source: datawarehouseinfo.com
EXTRACT, TRANSFORM, LOAD
• Load: write the transformed data to the target database according to its schema
EXAMPLE ETL TOOLS
User Query
Source
Global schema
mappings
Query plan
generator
Execution engine
• User would like to know which students enrolled with the course ‘F21BD’ in 2024
• Select student from course where title = ‘F21DB’ and year = ‘2024’
• Connector responsible for wrapping data source to allow communication with mediator
• Data source exposed as a convenient database with the appropriate (to mediator)
schema
• Export data from source as requested by mediator
• Presented schema can be different from the actual one
• Data can be the result of transformation of real data at the source
• The same source can have different wrappers for different virtual systems
MAPPINGS
• Provide information to mediator about data available at sources and relation to global
schema
• This information is the basis for generating the query plans
• Description provided as logical formulas (SQL queries, views..)
• Logical formulas relate local schemas to the global (mediated) schema
• The set of formulas define the mappings between local schemas and global schema
MAPPING APPROACHES
Global
• Global-as-view (GAV) – old approach (named in 1990’s) schema
• Relations in the global schema are described as views over the local schemas
Local Local local
• Local schemas more details than those in the global schema schema schema schema
• Global schema may contain more details than those in the local schemas Local Local local
schema schema schema
• Global-and-Local-as-Views (GLAV) – 2000’s
• Mapping expressed as correspondence between views on the global schema and views over the
local schemas Global
schema
• GAV and LAV can be derived as special cases from GLAV
Local Local local
schema schema schema
ILLUSTRATION OF MAPPING APPROACHES
Bob
DIRECT RELATIONAL TO RDF MAPPING
• Automated mapping
• Data form relational database mapped directly to RDF / OWL
• Resulting graph poor due to inherent limitation in relational models
• No links to existing ontologies
• Recommended reading: Pinkel, C., Binnig, C., Jiménez-Ruiz, E., Kharlamov, E., May, W., Nikolov,
A., Sasa Bastinos, A., Skjæveland, M.G., Solimando, A., Taheriyan, M. and Heupel, C., 2018. RODI:
Benchmarking relational-to-ontology mapping generation quality. Semantic Web, 9(1), pp.25-52.
R2RML MAPPING
R2RML: RDB to RDF Mapping Language
• Relational database and domain ontologies are input to the mapping process
• RDF graph representing database content with terms from the ontologies
• Optimal graph representation (depending on ontology quality)
• Result linked to existing ontologies, support reuse
• Requires manual building of mappings and ontology alignment
OVERVIEW OF R2RML
https://www.w3.org/TR/r2rml/
R2RML MAPPING FROM TABLE: EXAMPLE
rr:predicateObjectMap [
rr:predicate foaf:name ;
rr:objectMap [rr:column "NAME"]
] .
R2RML MAPPING RESULT
Bob
RML VOCABULARY
Extension of R2RML to other sources (excel, noSQL…)
RML VS. R2RML
RDF MATERIALISATION