Web Data Integration Summary
Web Data Integration Summary
Findable
o (Meta)data are assigned a globally unique
identifier
o Data are described with rich metadata
o (Meta)data are registered or indexed in a
searchable resource
Accessible
o (Meta)data are retrievable by their identifier using a standardized communications
protocol
o Metadata are accessible, even when the data are no longer available
Interoperable
o (Meta)data use a formal, broadly applicable language for knowledge representation
o (Meta)data use vocabularies that follow FAIR principles
o (Meta)data include qualified references to other (meta)data
Reusable
o (Meta)data are released with a clear data usage license
o (Meta)data are associated with detailed provenance
o (Meta)data meet domain-relevant community standards
2. Web APIs
Platforms that enable users to share information, e.g., Facebook, are partly accessible from the web
They slice the Web into Data Silos
--- 1 Not index-able by generic web crawlers
--- 2 No automatic discovery of additional data source
--- 3 No single global data space
3. Linked Data
+++ Entities are identified with HTTP URIs (role of global primary keys), URIs can be looked up on the
Web (discover new data sources, navigate the global data graph)
4. HTML-embedded Data
+++
1. Webpages traditionally contain structured data in the form of HTML tables as well as template data
2. More and more websites semantically markup the content of their HTML pages using standardized
markup formats, like RDFa
Character Encoding
is mapping of “real” characters to bit sequences and a common problem in data integration
UTF-8: most common encoding, includes Asian signs, common characters are encoded using only one
byte, less common ones are encoded in 2-6 bytes
XPath
From Slide 33 on till 44
Applications:
o annotation of Web pages (RDFa, JSON-LD)
o publication of data on the Web (Linked Data)
o exchange of graph data between applications
View 1: Sentences in form Subject-Predicate-Object (called Triples “Chris works at Uni of Ma”)
View 2: Labeled directed graph
Resources
o everything (a person, a place, a web page, …) is a resource
o are identified by URI references
o may have one or more types (e.g. foaf:Person)
Literals
o are data values, e.g., strings or integers
o may only be objects, not subjects of triples
o may have a data type or a language tag
Predicates (Properties)
o connect resources to other resources
o connect resources to literal
1. RDF Data Model
2. RDF Syntaxes
3. RDF Schema
4. SPARQL Query Language
5. RDF in Java
Schema Integration
2. Types of Correspondences
A correspondence relates a set of elements in a schema S to a
set of elements in schema T
Mapping = Set of all correspondences that relate S and T
Schema Matching: Automatically or semi-automatically discover correspondences between
schemata
Types of correspondences
o One-to-One Correspondences – Movie.title → Item.name
o One-to-Many – Person.Name → split() → FirstName (Token 1), Surname (Token 2)
o Many-to-One – Product.basePrice * (1 + Location.taxRate) → Item.price
3. Schema Integration
Completeness: All elements of the source schemata should be covered
Correctness: All data should be represented semantically correct
Minimality: The integrated schema should be minimal in respect to the number of relations
and attributes
Understandability: should be easy to understand
4. Data Translation
Query Generation Goal: Derive suitable data translation queries (or programs) from the
correspondences.
5. Schema Matching
Automatically or semi-automatically discover correspondences between schemata
Identity Resolution
Goal: Find all records that refer to the
same real-world entity.
Challenge 1: Representations of the
same real-world entity are not
identical fuzzy duplicates
o Solution: Entity Matching: compare multiple attributes using attribute-specific
similarity measures, after value normalization
Challenge 2: Quadratic Runtime Complexity
o Comparing every pair of records is too expensive for larger datasets
o Solution: Blocking methods avoid unnecessary comparisons
Entity Matching
2.1 Linearly Weighted Matching
Rules
Compute the similarity score between
records x and y as a linearly weighted
combination of individual attribute
similarity scores
We declare x and y matched if sim(x,y) >= b for a pre-specified threshold b, and not matched
otherwise
Blocking
Since similarity is reflexive and symmetric, one can avoid unnecessary comparisons
from n² to (n²-n)/2.
Challenges
o Choice of Blocking Key
o Choice of Window Size
o But: no problem with different bucket sizes
Evaluation
GS necessary
Monge-Elkan Similarity
o ++ can deal with typos and different order of words
o -- runtime complexity: quadratic
Data Fusion
Data profiling
= refers to the activity of calculating statistics and creating summaries of a data source or data lake.
Data Fusion
= Given multiple records that describe the same real-world entity, create a single record while
resolving conflicting data values.
Goal: Create a single high-quality record.
Two basic fusion situations:
Slot Filling
o Fill missing values (NULLs) in one dataset with corresponding values from other
datasets. increased dataset density
Conflict Resolution
o Resolve contradictions between records by applying a conflict resolution function
(heuristic) increased data quality
Conflict Resolution Functions
Conflict resolution functions are attribute-specific
o 1. Content-based functions that rely only on the data values to be fused
E.g., average, min/max, union,…
o 2. Metadata-based functions that rely on provenance data, ratings, or quality scores
E.g., favourSources, mostRecent, …