EKG Guide
EKG Guide
This tutorial is based on the work being done in the MindLab, an industrial research project for building
knowledge graphs to be consumed by conversational agents in domains like tourism.
An extensive version of the content of this tutorial can be found in our upcoming book “Knowledge
Graphs in Use” (working title)
https://mindlab.ai
For academics:
A brief overview of the literature, introduction of some tools, especially in knowledge curation.
Relevant Literature:
https://mindlab.ai/en/publications/ - An extensive list of the literature on knowledge graphs and their
applications with conversational agents
● An agent would interpret a knowledge graph to make rational decisions to take actions to reach its
goals
● KGs do not have a big TBox, but have a very large ABox. There is not much to reason.
● No strict schema: Good for integrating heterogeneous sources, not so much in terms of data quality.
● A knowledge graph may cost 0,1 - 6 USD per fact [Paulheim, 2018]
Fixing TBox
- We accept schema.org (and its extensions) as golden standard. No problem here.
Fixing ABox
- This is where knowledge curation comes in.
http://www.schema.
org/
"longitude": "10.9136698539673" },
Tool, developed as a research project, grown to a full-stack annotation creation, validation and publication framework!
Karlsruhe I Kärle
Ort&I Name
SimsekI Datum
I September 9, 2019 Seite 26
2. Knowledge Creation - tools - semantify.it
2) How to create those JSON-LD files?
- Semantify.it editor & instant annotations
- based on DS
- Inside platform (big DS files)
- or Instant Annotations (IA)
portable to every website (based on JS)
- mappers (RocketRML)
- wrapper framework
- semi-automatic
RocketRML ⇒
● Before starting the mapping process for a TriplesMap, we check whether the TriplesMap is in the
join condition of another TriplesMap. If it is, then we get the parent path of the join condition and
evaluate it. The value then is cached as path - value pair
● After everything is mapped, we go through the two caches and join the objects with matching child
and parent values.
https://semantifyit.github.io/RocketRML
/
Node.js implementation
- semantify.it stores all created annotations and provides them over an API
(http://smtfy.it/sj7Fie2 OR http://smtfy.it/url/http//... OR http://smtfy.it/cid/374fm38dkgi...)
- publication of annotations over JS or into popular CMSs trough plugins (Wordpress, TYPO3 etc.)
Either a URL:
- to identify resources http://fritz.phantom.com
- to refer to properties of an ontology http://schema.org/name/
- to refer to types of an ontology http://schema.org/Person
or a literal
- String: “Fritz Phantom”
- Date: “1.1.19??”
- Number: 42
“@type”:”Person”
http://fritz...
“@id”: “https://fritz.phantom.com”, schema:Organisation
“livesIn”:”Innsbruck”
worksFor
“born”:”19??-01-01” type
Either as
1) JSON-LD
or as
2) Knowledge Graph
Summary:
- works very well with tens of millions of JSON-LD files
- we replicate this data periodically into a graph database for “real” Knowledge Graph usage
Summary:
- overhead aside: great for big
knowledge graphs
● Knowledge Cleaning
● Knowledge Enrichment
2. A finite number of type definitions isA(t1,t2) with t1 and t2 are elements of T. isA is reflexive and
transitive.
○ Range definition for a property p with p is an element of P, t1 and t2 are Elements of T. Simple
definition: Global property definition: hasRange(p,t2)
■ Refined definition: Local property definition: hasRange(p,t2) for domain t1, short:
hasLocalRange(p,t1,t2)
● Various dimensions for data quality assessment introduced [Batini & Scannapieco, 2006], [Färber et
al., 2018], [Pipino et al., 2002], [Wang, 1998], [Wang & Strong, 1996], [Wang et al., 2001], [Zaveri et
al., 2016])
● SWIQA (Semantic Web Information Quality Assessment Framework) [Fürber & Hepp, 2011]:
Benefits from network features to assess data quality (e.g. counting open chains to find wrongly
asserted isSameAs relationships)
An online tool check the conformance of RDF graphs against ShEx (Shape Expressions)
● Luzzu (A Quality Assessment Framework for Linked Open Datasets) [Debattista et al., 2016]
https://eis-bonn.github.io/Luzzu/downloads.html
Allows declarative definitions of quality metrics and produces machine-readable assessment reports
based on Dataset Quality Vocabulary
A framework that assesses linked data quality based on test cases defined in various ways (e.g.
RDFS/OWL axioms can be converted into constraints)
Uses statistical distributions to predict the types of instances. Incoming and outgoing properties are
used as indicators for the types of resources.
○ Data Quality Indicators: Various type of (meta)data that can be used to assess data quality e.g. data
about the dataset provider, user ratings
○ Scoring Functions: A set of functions that help the calculation of assessment metrics based on the
indicators
○ Assessment Metrics: Metrics like relevancy, timeliness that help users to assess the quality for an
intended use
○ Aggregate Metrics: Allow users to aggregate new metrics based on simple assessment metrics.
○ Error detection
○ Error correction
Error Correction
i is not a proper instance identifier Delete assertion or correct i
i1 is not a valid instance identifier Delete assertion or correct t.
Instance assertion is semantically incorrect Delete assertion or find proper t.
Error Correction
p is not a valid property Delete assertion or correct p
i1 is not a valid instance identifier Delete assertion or correct i1
i1 is not in any domain of p Delete assertion or add assertion isElementOf(i1,t)
where t is in a domain of p
Error Correction
i2 is not a valid instance identifier delete assertion or correct i2
i2 is not in any range of p where i1 is an element of Delete assertion or
a domain of p.
Add assertion isElementOf(i1,t1) given that
hasLocalRange(t1,p,t2) and isElementOf(i2,t2)
or
Error Correction
i1 is not a valid instance identifier Delete assertion or correct i1
i2 is not a valid instance identifier Delete assertion or correct i2
Equality assertion is semantically wrong Delete assertion or loosen the semantics (e.g.
replace by a skos operator)
An error detection and correction tool based on integrity constraints to identify conflicting and invalid
values, external information to support the constraints, and quantitative statistics to detect outliers.
Learns the relationships between data columns and validate the learn patterns with the help of
existing Knowledge Bases and crowd, in order to detect errors in the data. Afterwards it also suggests
possible repairs.
Uses statistical distribution to detect erroneous statements that connect two resources. The
statements with less frequent predicate-object pairs are selected as candidates for being wrong.
Two approaches that aim to verify RDF graphs against a specification (so called shapes).
For a comparison of two approaches, see Chapter 7 in [Gayo et al., 2017]
Detects and corrects syntactic errors (e.g. bad encoding, broken IRIs), replaces blank nodes with IRIs,
removes duplicates in dirty linked open data and re-publishes it in a canonical format.
A framework that tries to identify the time interval where a statement was correct. It uses external
knowledge bases and the web content to extract evidence to assess the validity of a statement for a
time interval.
● Integration of TBox
○ We assume that all data sources are mapped to schema.org
○ Non-RDF sources can be also mapped with the techniques described in Knowledge Creation
Tackling issues:
can be realized by
● Dedupe: https://github.com/dedupeio/dedupe
A python library that uses machine learning to find duplicates in a dataset and to link two datasets.
Uses various similarity metrics to detect duplicates in a dataset or link records between two datasets based
on a given configuration. The configuration parameters can be
A recording linkage tool that utilizes Concise Bounded Description* of resources for comparison.
*https://www.w3.org/Submission/2004/SUBM-CBD-20040930/#r6
A link discovery approach that benefits from the metric spaces (in particular triangle inequality) to
reduce the amount of comparisons between source and target dataset.
A link discovery tool that utilizes string similarity functions on “label properties” without a prior
knowledge of data or schema
A link discovery tool with declerative linkage rules applying different similarity metrics (e.g. string,
taxonomic, set) that also supports policies for the notification of datasets when one of them
publishes new links to others.
A framework for fusing geospatial data. It suggests fusion strategies based on two datasets with
geospatial data and a set of linked entities.
A framework that allows the application of different methods on different attributes in the same
dataset for identification of duplicates and resolves inconsistencies caused by the fusion of linked
instances.
A framework that contains a fusion module that allows users to configure conflict resolution policies
based on different functions (e.g. AVG, MAX, CONCAT) that can be applied on conflicting property
values.