0% found this document useful (0 votes)

11 views10 pages

Web Data Integration Summary

The document summarizes different types of structured data available on the web, including data portals, web APIs, linked data, and HTML-embedded data. It also discusses standards and principles for fair and interoperable data, including FAIR data principles. Common data exchange formats are described, such as CSV, XML, JSON, and RDF. Challenges and methods for schema mapping, integration, and matching between different data sources are outlined.

Uploaded by

paul.huber1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views10 pages

Web Data Integration Summary

Uploaded by

paul.huber1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Web Data Integration Zusammenfassung

Data Integration Process 

Types of Structured Data on the Web

1. Data Portals
 collect and host datasets
 collect and generate metadata describing the datasets
 provide for data search and exploration
 provide free or payment-based access to data

Types of Shared Data: public sector, research, commercial

FAIR Data Principles

 Findable
o (Meta)data are assigned a globally unique
identifier
o Data are described with rich metadata
o (Meta)data are registered or indexed in a
searchable resource
 Accessible
o (Meta)data are retrievable by their identifier using a standardized communications
protocol
o Metadata are accessible, even when the data are no longer available
 Interoperable
o (Meta)data use a formal, broadly applicable language for knowledge representation
o (Meta)data use vocabularies that follow FAIR principles
o (Meta)data include qualified references to other (meta)data
 Reusable
o (Meta)data are released with a clear data usage license
o (Meta)data are associated with detailed provenance
o (Meta)data meet domain-relevant community standards

2. Web APIs
Platforms that enable users to share information, e.g., Facebook, are partly accessible from the web
They slice the Web into Data Silos
--- 1 Not index-able by generic web crawlers
--- 2 No automatic discovery of additional data source
--- 3 No single global data space

3. Linked Data
+++ Entities are identified with HTTP URIs (role of global primary keys), URIs can be looked up on the
Web (discover new data sources, navigate the global data graph)

4. HTML-embedded Data
+++
1. Webpages traditionally contain structured data in the form of HTML tables as well as template data
2. More and more websites semantically markup the content of their HTML pages using standardized
markup formats, like RDFa

Data Exchange Formats

 Data Exchange: Transfer of data from one system to another
 Data Exchange Format: Format used to represent (encode) the transferred data
 Web Data is heterogeneous with respect to the employed
o 1. Data Exchange Format (Technical Heterogeneity, e.g., XML, JSON, CSV)
o 2. Character Encoding (Syntactical Heterogeneity, e.g., ASCII, Unicode)

Character Encoding
is mapping of “real” characters to bit sequences and a common problem in data integration

UTF-8: most common encoding, includes Asian signs, common characters are encoded using only one
byte, less common ones are encoded in 2-6 bytes

Comma Separated Values (CSV)

 Data model is a table
 Pro: Data representation with minimal overhead
 Cons
o restricted to tabular data
o hard to read for humans when tables get wider
o different variations, no support for data types

Extensible Markup Language (XML)

 Widely used format for data exchange in the Web and enterprise contexts
 Data model: Tree
 Is a meta language: defines standard syntax, allows the definition of specific languages (XML
applications)
 HTML versus XML
o HTML: Aimed at displaying information to humans; mixes structure, content, and
presentation
o XML: Aimed at data exchange; separates structure, content, and presentation
 Well-formed XML Documents
o 1. Closing tag for each
opening tag
o 2. Proper nesting of tags
o 3. Only one attribute with
a specific name
 XML References
o Trees are limited when it
comes to n:m relations
 Problem: data
duplication
 consistency
 storage
 transmission volume
o Solution: IDs and references
Document Type Definition (DTD)
 Defines valid content structure of an XML document
o allowed elements, attributes, child elements, optional elements
o allowed order of elements
 DTDs can be used to validate an XML document from the Web before it is further processed

XPath
From Slide 33 on till 44

JavaScript Object Notation (JSON)

 is a lightweight data exchange format that uses JavaScript syntax
o less verbose alternative to XML
o widely adopted
– by Web APIs as data
exchange format
– for embedding structured
data in the HEAD section of
HTML pages
 Basics:
o objects are noted as in JavaScript
o objects are enclosed in curly brackets
{…}
o data is organized in key value pairs
separated by colons { key : value }
 Example: { “firstname” : “John” , “lastname” :
“Smith” , “age” : 46 }
 JSON is a lot like XML:
o data model: tree
o opening/closing tags/brackets
 Differences
o more compact notation compared to XML
o no id/idref – JSON data is strictly tree shaped
o less data types (only string, number, and Boolean)

Resource Description Framework (RDF)

Graph data model designed for sharing data on the Web

 Applications:
o annotation of Web pages (RDFa, JSON-LD)
o publication of data on the Web (Linked Data)
o exchange of graph data between applications
 View 1: Sentences in form Subject-Predicate-Object (called Triples “Chris works at Uni of Ma”)
 View 2: Labeled directed graph
 Resources
o everything (a person, a place, a web page, …) is a resource
o are identified by URI references
o may have one or more types (e.g. foaf:Person)
 Literals
o are data values, e.g., strings or integers
o may only be objects, not subjects of triples
o may have a data type or a language tag
 Predicates (Properties)
o connect resources to other resources
o connect resources to literal
 1. RDF Data Model
 2. RDF Syntaxes
 3. RDF Schema
 4. SPARQL Query Language
 5. RDF in Java

Schema Mapping and Data Translation

1. Two Basic Integration Situations
Schema Mapping

 Goal: Translate data from a set of source schemata into a given

target schema
 Top-down integration situation
 Triggered by concrete information need

Schema Integration

 Goal: Create a new integrated schema that can represent all

data from a given set of source schemata
 Bottom-up integration situation
 Triggered by the goal to fulfill different information needs
based on data from all sources (not throwing information out)

2. Types of Correspondences
 A correspondence relates a set of elements in a schema S to a
set of elements in schema T
 Mapping = Set of all correspondences that relate S and T
 Schema Matching: Automatically or semi-automatically discover correspondences between
schemata
 Types of correspondences
o One-to-One Correspondences – Movie.title → Item.name
o One-to-Many – Person.Name → split() → FirstName (Token 1), Surname (Token 2)
o Many-to-One – Product.basePrice * (1 + Location.taxRate) → Item.price

3. Schema Integration
 Completeness: All elements of the source schemata should be covered
 Correctness: All data should be represented semantically correct
 Minimality: The integrated schema should be minimal in respect to the number of relations
and attributes
 Understandability: should be easy to understand

4. Data Translation
Query Generation Goal: Derive suitable data translation queries (or programs) from the
correspondences.

5. Schema Matching
Automatically or semi-automatically discover correspondences between schemata

5.1 Challenges to Finding Correspondences

 Large schemata > 100 tables and >1000 attributes
 Generic, automatically generated names, e.g., attribute1, attribute2, attribute3
 Missing documentation

5.2. Schema Matching Methods

 1. Label-based Methods: Rely on the names of schema elements
o 1. Generate cross product of all attributes (classes) from A and B
o 2. For each pair calculate the similarity of the attribute labels using some similarity
metric: Levenshtein (insert/delete/replace are the edit operations), Jaccard, Soundex,
etc.
o 3. The most similar pairs are the matches
o Problems:
 Semantic heterogeneity is not recognized (e.g., synonyms/homonyms)
 Problems with different naming conventions (e.g., abbreviations)
o Solution
 Preprocessing: Normalize (stop word removal, stemming,…) labels to prepare
them for matching
 Matching: Employ similarity metrics that fit the specifics of the schemata
 2. Instance-based Methods: Compare the actual data values
o determine correspondences between A and B by examining which attributes in A and
B contain similar values  values often better capture the semantics of an attribute
than its label
o 1. Attribute Recognizers
o 2. Value Overlap using Jaccard 
o 3. Feature-based Methods
 By comparing e.g., attribute data type, average string length
 Discussion
 Require decision which features to use
 Require decision how to compare and combine values
 Similar attribute values do not always imply same semantics
o 4. Duplicate-based Methods
 Check which attribute values closely match in each duplicate (= similar entry
in two DBs)
 ++Can correctly distinguish very similar attributes
 ++Work well if duplicates are known or easy to find
 --Does not work well if identity resolution is too noisy, products with very
similar names
 3. Structure-based Methods: Exploit the structure of the schema
o Addresses this problem 
o High similarity of neighboring attributes and/or name of
relation increases similarity of attribute pair
 4. Combined Approaches: Use combinations of above methods
5.3 Generating Correspondences from the Similarity Matrix
 Input: Matrix containing attribute similarities
 Output: Set of correspondences
 Local Single Attribute Strategies:
o Thresholding
 all attribute pairs with sim above a threshold are returned as
correspondences
 domain expert checks correspondences afterwards and selects the right ones
o TopK
 give domain expert TopK correspondences for each attribute
o Top1
 directly return the best match as correspondence
 very optimistic, errors might frustrate domain expert
 Alternative: Global Matching
o Looking at the complete mapping (all correct correspondences between A and B)
gives us the additional restriction that one attribute in A should only be matched to
one attribute in B
o Find optimal set of disjunct correspondences
 Alternative: Stable Marriage

5.4 Finding Many-to-One and One-to-Many Correspondences sl.65

5.5 Table Annotation
Goal: Annotate the columns (type and property) of tables in a large table corpus with concepts from a
knowledge graph or shared vocabulary

6. Schema Heterogeneity on the Web

Identity Resolution
 Goal: Find all records that refer to the
same real-world entity.
 Challenge 1: Representations of the
same real-world entity are not
identical  fuzzy duplicates
o Solution: Entity Matching: compare multiple attributes using attribute-specific
similarity measures, after value normalization
 Challenge 2: Quadratic Runtime Complexity
o Comparing every pair of records is too expensive for larger datasets
o Solution: Blocking methods  avoid unnecessary comparisons

Entity Matching
2.1 Linearly Weighted Matching
Rules
Compute the similarity score between
records x and y as a linearly weighted
combination of individual attribute
similarity scores 
We declare x and y matched if sim(x,y) >= b for a pre-specified threshold b, and not matched
otherwise

2.2 Non-Linear Matching Rules

Often better than linear rules but require specific domain knowledge.

Non-linear rules can be learned using tree-based learners

2.3 Data Gathering for Matching

Not only values of the records to be compared, but also values of related records are relevant for the
similarity computation, e.g., Movies look at Actors

2.4 Data Preprocessing for Matching

To enable similarity measures to compute reliable scores, the data needs to be normalized (spelling,
value formats, measurement units, abbreviations), then parsed (Extract attribute/value pairs from
title) and translated (e.g., into target language).

2.5 Local versus Global Matching

 Local Matching
o consider all pairs above threshold as matches
o implies that one record can be matched with several other records
o makes sense for duplicate detection within single data source
 Global Matching
o enforce constraint that one record in data set A
should only be matched to one record in data set B
o makes sense for data sources that do not contain
duplicates
o Stable Marriage: Immer zuerst Paar mit höchster
Sim vereinen

2.6 Cluster Records using Pairwise Correspondences

Goal: Create groups of records describing the same real-world entity
from pairwise correspondences.
Simple Approach: Connected Components, Smarter Approach: Correlation Clustering (…)

Blocking
Since similarity is reflexive and symmetric, one can avoid unnecessary comparisons
from n² to (n²-n)/2.

3.1 Standard Blocking

Idea: Reduce number of comparisons by partitioning the records into buckets and
compare only records within each bucket e.g., partition books by publisher.

 Pro: much faster, Con: missed true matches

 Reduction ratio depends on effectiveness of blocking key
o high: if records are equally distributed over buckets
o low: if majority of the records end up in one bucket
o possible workaround: build sub-buckets using a second blocking attribute
 Recall depends on the actually matching pairs being kept (not more than 3% should be lost)

3.2 The Sorted Neighborhood Method (SNM)

Idea: Sort records so that similar records are close to each other. Only compare records within a small
neighborhood window.

 Challenges
o Choice of Blocking Key
o Choice of Window Size
o But: no problem with different bucket sizes

3.3 Token Blocking for Textual Attributes (…) only slide 32

Evaluation
GS necessary

Accuracy is not a good metric as matching is a strongly unbalanced task

Goal: Have a high reduction ratio while keeping pair

completeness > 0.97

Similarity Measures – In Detail

5.1 Edit-based String Similarity Measures
 Levenshtein Distance (aka Edit Distance)
o Measures the minimum number of edits needed to transform one string into the
other (insert, delete, replace)
o ++ can deal with typos
o -- does not work if parts of string (words) have different order
o -- quadratic runtime complexity
 Jaro Similarity 
o Specifically designed for matching names
 Winkler Similarity
o Intuition: Similarity of first few letters is more important
 less typos in first letters
 dealing with abbreviations
 Jaro-Winkler Similarity

5.2 Token-based String Similarity Measures

 Token-based measures ignore the order of words which is often desirable for comparing
multi-word strings
 Using n-grams for Jaccard Coefficient
o Deal with typos and different order of words
o Reduce the time complexity compared to Levenshtein
 Cosine Similarity: for comparing weighted term vectors
 TF-IDF: gives less weight to common tokens

5.3 Hybrid String Similarity Measures

hybrid similarity measures split strings into tokens and apply internal similarity function to compare
tokens

 Monge-Elkan Similarity
o ++ can deal with typos and different order of words
o -- runtime complexity: quadratic

5.4 Datatype-specific Similarity Measures

Numerical Comparison, Dates, geographic coordinates

Learning Matching Rules

Data Fusion
Data profiling
= refers to the activity of calculating statistics and creating summaries of a data source or data lake.

Data Fusion
= Given multiple records that describe the same real-world entity, create a single record while
resolving conflicting data values.
Goal: Create a single high-quality record.
Two basic fusion situations:

 Slot Filling
o Fill missing values (NULLs) in one dataset with corresponding values from other
datasets.  increased dataset density
 Conflict Resolution
o Resolve contradictions between records by applying a conflict resolution function
(heuristic)  increased data quality
Conflict Resolution Functions
 Conflict resolution functions are attribute-specific
o 1. Content-based functions that rely only on the data values to be fused
 E.g., average, min/max, union,…
o 2. Metadata-based functions that rely on provenance data, ratings, or quality scores
 E.g., favourSources, mostRecent, …

Evaluation of Fusion Results

 1. Data-Centric Evaluation Measures
o Density: measures the fraction of non-NULL values
o Consistency: A data set is consistent if it is free of
conflicting information
 2. Ground-Truth Based Evaluation Measures
o Accuracy: Fraction of correct values
selected by conflict resolution
function
o Manually determine correct values for a subset of the records

Learning Conflict Resolution Functions: Choose function with smallest

mean absolute error with respect to gold standard

Session - 6 - Complex Data Types
No ratings yet
Session - 6 - Complex Data Types
27 pages
Data Mapping and Exchange
No ratings yet
Data Mapping and Exchange
13 pages
Week-3 Schema Matching and Mapping
No ratings yet
Week-3 Schema Matching and Mapping
26 pages
Slide 3
No ratings yet
Slide 3
35 pages
Introduction To The Semantic Web (Tutorial) Johnson & Johnson Philadelphia, USA October 30, 2009 Ivan Herman, W3C
No ratings yet
Introduction To The Semantic Web (Tutorial) Johnson & Johnson Philadelphia, USA October 30, 2009 Ivan Herman, W3C
184 pages
Information Integration: Existing Methods and Solutions
No ratings yet
Information Integration: Existing Methods and Solutions
25 pages
Semantic web Que Bank
No ratings yet
Semantic web Que Bank
26 pages
Chapter 11: XML: Data Integration
No ratings yet
Chapter 11: XML: Data Integration
73 pages
SWT QB
No ratings yet
SWT QB
26 pages
DCPP Notes
No ratings yet
DCPP Notes
6 pages
eis05j
No ratings yet
eis05j
18 pages
Introduction To The Semantic Web (Tutorial) Johnson & Johnson Philadelphia, USA October 30, 2009 Ivan Herman, W3C
No ratings yet
Introduction To The Semantic Web (Tutorial) Johnson & Johnson Philadelphia, USA October 30, 2009 Ivan Herman, W3C
184 pages
Ontology Mapping Survey: Siyamed Seyhmus SINIR
No ratings yet
Ontology Mapping Survey: Siyamed Seyhmus SINIR
34 pages
SW M1
No ratings yet
SW M1
11 pages
Semweb2012 Lab1 Bulat Malawski
No ratings yet
Semweb2012 Lab1 Bulat Malawski
7 pages
CH 2 Data Collection Management
No ratings yet
CH 2 Data Collection Management
42 pages
RDF SW Velocity
No ratings yet
RDF SW Velocity
12 pages
SG4 - IPT 101 DataMapping and Exchange
No ratings yet
SG4 - IPT 101 DataMapping and Exchange
12 pages
UNIT 3 Resource Description Framework and XML Technologies
No ratings yet
UNIT 3 Resource Description Framework and XML Technologies
22 pages
Lecture 5 - Semi-Structured Data
No ratings yet
Lecture 5 - Semi-Structured Data
26 pages
DA(Unit-1)
No ratings yet
DA(Unit-1)
45 pages
Data Integration: Click To Edit Master Subtitle Style
No ratings yet
Data Integration: Click To Edit Master Subtitle Style
60 pages
Unit 5 XML
No ratings yet
Unit 5 XML
73 pages
ece 2318 GENERAL DATA AND ITS TYPES
No ratings yet
ece 2318 GENERAL DATA AND ITS TYPES
34 pages
Validating RDF Data 2017
No ratings yet
Validating RDF Data 2017
308 pages
DBMS unit 6 Macro
No ratings yet
DBMS unit 6 Macro
3 pages
IRSW - Semantic Web Introduction
No ratings yet
IRSW - Semantic Web Introduction
76 pages
Unit 5- Notes
No ratings yet
Unit 5- Notes
13 pages
Schema Matching
No ratings yet
Schema Matching
4 pages
TDA357-L10-NoSQL,JSON1
No ratings yet
TDA357-L10-NoSQL,JSON1
41 pages
CH-5 WDM
No ratings yet
CH-5 WDM
10 pages
complete_resume ADA
No ratings yet
complete_resume ADA
33 pages
Semantic Web
No ratings yet
Semantic Web
25 pages
Chapter 11
No ratings yet
Chapter 11
73 pages
Yazici XML Ex
No ratings yet
Yazici XML Ex
71 pages
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
No ratings yet
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
8 pages
Big Data
No ratings yet
Big Data
32 pages
Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
23 State of The Art
No ratings yet
23 State of The Art
61 pages
Introduction To The Semantic Web
No ratings yet
Introduction To The Semantic Web
11 pages
Data Integration
No ratings yet
Data Integration
44 pages
The Extensible Markup Language
No ratings yet
The Extensible Markup Language
6 pages
DS-Lecture-2425-1B-SEMI (1)
No ratings yet
DS-Lecture-2425-1B-SEMI (1)
98 pages
A UML Profile For Modeling Schema Mappings
No ratings yet
A UML Profile For Modeling Schema Mappings
10 pages
Semantic Web Programming in Python An Introduction
No ratings yet
Semantic Web Programming in Python An Introduction
20 pages
Semantic Web Unit-III
No ratings yet
Semantic Web Unit-III
17 pages
Data Models and Information Accesses - : (Set, Graph, Map, Archetype) (Relations, XML, KML, ADL) (List)
No ratings yet
Data Models and Information Accesses - : (Set, Graph, Map, Archetype) (Relations, XML, KML, ADL) (List)
76 pages
Reading - W3 CLIO
No ratings yet
Reading - W3 CLIO
7 pages
Adbms Unit1
No ratings yet
Adbms Unit1
19 pages
LAIBA 9209.2 this is about information science and technology
No ratings yet
LAIBA 9209.2 this is about information science and technology
29 pages
Diplo Cloud
No ratings yet
Diplo Cloud
5 pages
Exploring Semantic Web Modeling Approaches For Web Application Design
No ratings yet
Exploring Semantic Web Modeling Approaches For Web Application Design
14 pages
Lecture 09
No ratings yet
Lecture 09
110 pages
Web Content Mining and NLP: Bing Liu Department of Computer Science University of Illinois at Chicago Liub@cs - Uic.edu
No ratings yet
Web Content Mining and NLP: Bing Liu Department of Computer Science University of Illinois at Chicago Liub@cs - Uic.edu
59 pages
Using Web Services: Python For Informatics: Exploring Information
No ratings yet
Using Web Services: Python For Informatics: Exploring Information
57 pages
Data & Knowledge Engineering: David Kensche, Christoph Quix, Xiang Li, Yong Li, Matthias Jarke
No ratings yet
Data & Knowledge Engineering: David Kensche, Christoph Quix, Xiang Li, Yong Li, Matthias Jarke
23 pages
IPT Chapter 3
No ratings yet
IPT Chapter 3
25 pages
Relationship Extraction: Fundamentals and Applications
From Everand
Relationship Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
XML Data Format
From Everand
XML Data Format
Lucas Lee
No ratings yet
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
New Totally Connected 2 9788808259226 - Compress
No ratings yet
New Totally Connected 2 9788808259226 - Compress
320 pages
Work 3
No ratings yet
Work 3
13 pages
sensebot_log
No ratings yet
sensebot_log
27 pages
Data Representation
No ratings yet
Data Representation
54 pages
Introduction To Computer Science 1.1
No ratings yet
Introduction To Computer Science 1.1
165 pages
1.2 Workbook - Part 2
No ratings yet
1.2 Workbook - Part 2
30 pages
I PUC CS Chapter 2
No ratings yet
I PUC CS Chapter 2
11 pages
Activity Book Vol.2
No ratings yet
Activity Book Vol.2
20 pages
Blank Katakana Practice Chart Characters: A-N (A, K, S, T, N, H, M, Y, R, W, N) N W R y M H N T S K A I U e o
No ratings yet
Blank Katakana Practice Chart Characters: A-N (A, K, S, T, N, H, M, Y, R, W, N) N W R y M H N T S K A I U e o
3 pages
10-5 Flat File Schema Developers Guide
No ratings yet
10-5 Flat File Schema Developers Guide
83 pages
Braille Activity SH
No ratings yet
Braille Activity SH
1 page
XML and Web Services
No ratings yet
XML and Web Services
176 pages
ABC Tracing
No ratings yet
ABC Tracing
26 pages
Alphanumeric Codes
0% (1)
Alphanumeric Codes
8 pages
Good Old ASCII: The Basics of ASCII
No ratings yet
Good Old ASCII: The Basics of ASCII
9 pages
Belanova - Rosa Pastel
No ratings yet
Belanova - Rosa Pastel
5 pages
Recovered DataStage Tip
100% (1)
Recovered DataStage Tip
115 pages
Ascii
No ratings yet
Ascii
2 pages
Roman Numerals 1 To 1000 PDF
No ratings yet
Roman Numerals 1 To 1000 PDF
11 pages
Symbols A4
No ratings yet
Symbols A4
164 pages
ANSI Escape Sequences
No ratings yet
ANSI Escape Sequences
4 pages
Alphabet 12 PDF
No ratings yet
Alphabet 12 PDF
4 pages
Iis Fundamentals
No ratings yet
Iis Fundamentals
18 pages
Unit 7 Data Transformation and Loading: Structure
No ratings yet
Unit 7 Data Transformation and Loading: Structure
24 pages
Ascii Decimal Binary Hex Conversion Chart
No ratings yet
Ascii Decimal Binary Hex Conversion Chart
5 pages
ASCII Code - The Extended ASCII Table
No ratings yet
ASCII Code - The Extended ASCII Table
8 pages
Ebook Calculus and Its Applications 14Th Edition Goldstein Solutions Manual Full Chapter PDF
100% (20)
Ebook Calculus and Its Applications 14Th Edition Goldstein Solutions Manual Full Chapter PDF
56 pages
Hidayat Ul Muslimeen
No ratings yet
Hidayat Ul Muslimeen
75 pages
A2 Worksheet - Bits and Bytes
No ratings yet
A2 Worksheet - Bits and Bytes
2 pages
The Book of Letters A Mystical Hebrew Alphabet Kushner 1nbsped 1683363442 9781683363446 - Compress
No ratings yet
The Book of Letters A Mystical Hebrew Alphabet Kushner 1nbsped 1683363442 9781683363446 - Compress
82 pages

Web Data Integration Summary

Uploaded by

Web Data Integration Summary

Uploaded by

Web Data Integration Zusammenfassung

Data Integration Process 

Types of Structured Data on the Web

Types of Shared Data: public sector, research, commercial

FAIR Data Principles

Data Exchange Formats

Comma Separated Values (CSV)

Extensible Markup Language (XML)

JavaScript Object Notation (JSON)

Resource Description Framework (RDF)

Schema Mapping and Data Translation

 Goal: Translate data from a set of source schemata into a given

 Goal: Create a new integrated schema that can represent all

5.1 Challenges to Finding Correspondences

5.2. Schema Matching Methods

5.4 Finding Many-to-One and One-to-Many Correspondences sl.65

6. Schema Heterogeneity on the Web

2.2 Non-Linear Matching Rules

Non-linear rules can be learned using tree-based learners

2.3 Data Gathering for Matching

2.4 Data Preprocessing for Matching

2.5 Local versus Global Matching

2.6 Cluster Records using Pairwise Correspondences

3.1 Standard Blocking

 Pro: much faster, Con: missed true matches

3.2 The Sorted Neighborhood Method (SNM)

3.3 Token Blocking for Textual Attributes (…) only slide 32

Accuracy is not a good metric as matching is a strongly unbalanced task

Goal: Have a high reduction ratio while keeping pair

completeness > 0.97

Similarity Measures – In Detail

5.2 Token-based String Similarity Measures

5.3 Hybrid String Similarity Measures

5.4 Datatype-specific Similarity Measures

Learning Matching Rules

Evaluation of Fusion Results

Learning Conflict Resolution Functions: Choose function with smallest

You might also like