0% found this document useful (0 votes)
5 views23 pages

Roman 2013

Uploaded by

Hanen ABBES
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views23 pages

Roman 2013

Uploaded by

Hanen ABBES
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

The BigIaS Platform

Simplifying Big Data Integration


‐ A Software‐as‐a‐Service Approach –
~ Preliminary Analysis and Design ~

September 5th, 2013


Sogndal, Norway

Dumitru Roman
Claudia Daniela Pop
Roxana Ioana Roman
Bjørn Magnus Mathisen

Contact: [email protected]

1
Context: Big Data and Our Primary Focus
• Addresses things that can be done at a large scale but cannot be
done at a smaller one
o Extract new insights or create new forms of value in ways that change stakeholders
and relationships between them
• Causality vs. correlations: not knowing why but only what
o Challenges the basic traditional understanding of how to make decisions

Heterogeneity Change

2
Overview
• The problem
o Data integration - a complex, unsolved problem
o Tools addressing various aspects of data integration process can hardly be used
together for more complex, interesting integration tasks
=> High cost of data integration at large scale, rather complicated and time consuming
process

• The goal: Simplify data integration at large scale!


o Enable users with limited technical data integration skills to get from raw data to
insightful data with minimal effort

• The approach
o Semantic-based data integration
o Flexible and customizable workflows of data integration tools (application
integration)
=> A Software-as-a-Service for data integration at large scale

3
The BigIaS Platform
UX ‐ Portal and widgets

Platform Layer Talend Data Open UIMA/


Karma Tabula Silk Map4Rdf LodLive
• Data operations Integration Refine Gate
• Entity extraction
• Data crawling Talend ESB
• Data linking
• Visualization Grok/ Trident‐
C‐SPARQL LdSpider Other
• Streaming NUPIC ML

Data Layer
• Storage services
• Streaming services
Sesame Streaming
OWLIM

Data Sets
3rd party 3rd party 3rd party Open Data LOD
4
Evaluated tools/approaches
1. Application Integration 3. Storage
o Talend ESB o Sesame
o OWLIM
2. Data Processing 4. Visualization
o Talend Data Integration o Map4Rdf
o Tabula o LodLive
o Karma
o Open Refine 5. Real-time Machine Learning
o Trident‐ML
o UIMA o Grok/NUPIC
o GATE
o Silk 5. Streaming
o LdSpider o C-SPARQL

5
Proposed Integration Workflows

1. Data Contextualization
2. Entity Discovery
3. Data Linking
4. Data Visualization
5. Real-time Machine Learning
6. RDF Streaming

6
Data Contextualization *OWL

Excel, CSV,
XML, etc. Integration
with external
data
(Karma,
Open Refine)

PDF
Export RDF
Excel X2CSV
(Tabula –PDF, (Karma,
… Open Refine, Open Refine)
Karma)
Relational DB

Open Refine:
files: TSV, CSV, *SV, Excel (.xls and .xlsx), SPARQL
JSON, XML, RDF as XML
and Google Docs endpoint
Web Services
Karma: Data Cleaning Store RDF
databases: MySQL, SQL Server, ORACLE, PostGIS
files: Excel, CSV, XML, JSON, KML
Web Services (Open refine) (Sesame + OWLIM)
Tabula: PDF

7
* OWL file can be imported or configured
Entity Discovery
SPARQL
Export RDF endpoint
*OWL Store RDF
(Karma, Open (Sesame + OWLIM)
Refine)

CSV Entity
Excel Extraction

(UIMA, Gate)
Reconcile against:

Gate:
files: text files Entity • Sparql Endpoint
UIMA: Contextualization • RDF file
files: text, audio, video
Open Refine • Sindice service
(Reconcile Plug‐In) (Twitter, Freebase,
DBpedia, and other
datasets, relevant for
the data)

8
* OWL file can be imported or configured
Data Linking SPARQL
endpoint

CSV
X2RDF Link
Store RDF
Excel Discovery
… (Open Refine,
(Sesame + OWLIM)
Karma) (SILK)
Relational DB

Open Refine:
files: TSV, CSV, *SV, Excel (.xls and .xlsx),
JSON, XML, RDF as XML
and Google Docs SPARQL
Web Services Store RDF endpoint
Karma:
databases: MySQL, SQL Server, ORACLE, PostGIS
files: Excel, CSV, XML, JSON, KML (Sesame + OWLIM)
Web Services

9
Data Visualization
Data
Contextualization

Sparql
Endpoint
CSV
Data GUI
Excel Entity Visualization
Discovery (Map4RDF,

LodLive)
Relational DB

Data
Linking

10
Real‐time Machine Learning
Data
Contextualization
Sparql
Endpoint Real‐time C‐Sparql
Entity Patterns C‐Sparql queries
CSV Machine Learning
Discovery Query
(GROK/NUPIC,
Excel Trident‐ML) Generator

Data
Relational DB Linking

Real Time
Streaming
RDFization
Service
Service
(DB,
e.g. JSON)

11
RDF Streaming
C‐Sparql
queries

Real Time
RDFization
Service

Streaming Streaming
Service Engine RDF
(DB, (C‐Sparql)
e.g. JSON) Real‐time
Machine Learning
(GROK/NUPIC,
Trident‐ML)

12
Core Technological Prerequisites

Tool Prerequisites
Open Refine GREL Functions
GREL Examples
Karma Create ontologies
Ontologies from Karma Web Interface
UIMA Regular Expressions
Silk Silk Link Specification Language (Silk‐
LSL)
C‐Sparql C‐Sparql language

13
Relevant upcoming research projects
(currently under negotiations)
• Data Publishing through the Cloud: A Data- and Platform-as-a-
Service Approach for Efficient Data Publication and
Consumption (DaPaaS)
o The DaPaaS project aims to deliver an integrated DaaS and PaaS environment for
open data–the DaPaaS platform–together with supporting activities for effective
and efficient publication and consumption of data and creation of applications
using the data.
o Expected to start Nov 2013
o Budget ~2.1M € (~1.5M €) for 2 years (EC funded)

14
15
Relevant upcoming research projects
(currently under negotiations – cont’)
• ProaSense – The Proactive Sensing Enterprise
o The goal is to provide a very scalable, distributed architecture for the management
and processing of big-data that will enable continuous monitoring of the need for
the service adaptation and propose corresponding changes in an (semi-) automatic
way.
o Expected to start Nov 2013
o Budget ~4.2M € (~3.2M €) for 3 years (EC funded)

16
17
Relevant upcoming research projects
(currently under negotiations – cont’)
• SmartOpenData – Open Linked Data for
environment protection in Smart Regions
o SmartOpenData aims to define mechanisms for acquiring, adapting and
using Open Data provided by existing sources for environment protection
in European protected areas
o Expected to start Nov 2013
o Budget ~3.4M € (~2.5M €) for 2 years (EC funded)

18
Relevant upcoming research projects
(currently under negotiations – cont’)
• INFRARISK— Novel Indicators for identifying critical
INFRAstructure at RISK from natural Hazards
o Develop reliable stress tests on European critical infrastructure using
integrated modelling tools for decision-support. It will lead to higher
infrastructure networks resilience to rare and low probability extreme
events, known as “black swans”.
o Expected to start Oct 2013
o Budget ~3.6M € (~2.8M €) for 2 years (EC funded)

19
Relevant ongoing research projects

• BigFut – Analyzing Big Data: Preparing for the Future of


Intelligent Information Management
o SINTEF Internal project
o Goals:
• Analyze and integrate a suit of technological approaches and techniques
• Advance demo/prototype implementation to penetrate the market in the short term
o Jan-Dec 2013

20
Relevant ongoing research projects
(cont’)
• CITI-SENSE – develop ”Citizen’s Observatories” to empower
citizens to:
o Contribute to and participate in environmental governance
o Support and influence community and policy priorities and associated decision
making
o Contribute to Global Earth Observation System of Systems (GEOSS)
• 27 partners (EC fuded)
http://www.citi‐sense.eu/

21
Summary and Outlook
• What’s new here
o The platform itself, implementing flexible data integration workflows
o A set of components (e.g. Real-time RDFization of streams, C-Sparql Query
Generator)

• What’s challenging
o Application integration
o Consistent scalability throughout workflows
o Platform deployment on cloud environments
o Use of new, unproven technologies
o …and many others

• Short-term plan (end of September)


o Get the first prototype implementing the proposed workflows
o Experiment with some data / simple use cases (e.g. CITI-SENSE data)

22
Thank you!
Q&A

23

You might also like