An Approach To Extracting Thematic Views From Highly Heterogeneous Source of A Data Lake
An Approach To Extracting Thematic Views From Highly Heterogeneous Source of A Data Lake
Abstract. In the last years, data lakes are emerging as an effective and
efficient support for information and knowledge extraction from a huge
amount of highly heterogeneous and quickly changing data sources. Data
lake management requires the definition of new techniques, very different
from the ones adopted for data warehouses in the past. One of the main
issues to address in this scenario consists in the extraction of thematic
views from the (very heterogeneous and generally unstructured) data
sources of a data lake. In this paper, we propose a new network-based
model to uniformly represent structured, semi-structured and unstruc-
tured sources of a data lake. Then, we present a new approach to, at
least partially, “structure” unstructured data. Finally, we define a tech-
nique to extract thematic views from the sources of a data lake, based
on similarity and other semantic relations among the metadata of data
sources.
1 Introduction
In the last years, data lakes are emerging as an effective and efficient answer
to the problem of extracting information and knowledge from a huge amount
of highly heterogeneous and quickly changing data sources [13]. Data lake man-
agement requires the definition of new techniques, very different from the ones
adopted for data warehouses in the past. These techniques may exploit the large
set of metadata always supplied with data lakes, which represent their core and
the main tool allowing them to be a very competitive framework in the big data
era.
One of the main issues to address in a scenario comprising many data sources
extremely heterogeneous in their format, structure and semantics, consists in the
extraction of thematic views from data sources [3], i.e., the construction of views
concerning one or more topics of interest for the user, obtained by extracting
and merging data coming from different sources. This problem has been largely
investigated in the past for structured and semi-structured data sources stored in
a data warehouse [26, 8, 23], and this witnesses its extreme relevance. However,
it is esteemed that, currently, more than 80% of data sources are unstructured
[9]. As a consequence, it is just this type of source that represents the main actor
of the big data scenario and, consequently, of data lakes.
In this paper, we aim at providing a contribution in this setting. Indeed, we
propose a supervised approach to extracting thematic views from highly hetero-
geneous sources of a data lake. Our approach represents all the data lake sources
by means of a suitable network. Indeed, networks are very flexible structures that
allow the modeling of almost all phenomena that researchers aim at investigating
[7]. Thanks to this uniform representation of the data lake sources, the extrac-
tion of thematic views from them can be performed by exploiting graph-based
tools. We define “supervised” our approach because it requires the user to spec-
ify the set of topics T = {T1 , T2 , . . . , Tn } that should be present in the thematic
view(s) to extract. Our approach consists of two steps. The former is mainly
based on the structure of involved sources. It exploits several notions typical
of (social) network analysis, such as the notion of ego network, which actually
represents the core of the proposed approach. The latter exploits a knowledge
repository, which is used to discover new relationships, other than synonymies,
among metadata, with the purpose to refine the integration of different thematic
views obtained after the first step. In this step, our approach relies on DBpedia3 .
This paper is organized as follows: Section 2 illustrates related literature. In
Section 3, we present the proposed approach. In particular, first we describe a
unifying model for data lake representation; then, we present our approach to
partially structuring unstructured sources; finally, we discuss the two steps of
our approach for thematic view extraction. In Section 4, we present our example
case, whereas, in Section 5, we draw our conclusions and discuss future work.
2 Related Literature
The new data lake scenario is characterized by several peculiarities that make
it very different from the data warehouse paradigm. Hence, it is necessary to
adapt (if possible) old algorithms conceived for data warehouses or to define
new approaches capable of handling and taking advantage of the specificities of
this new paradigm. However, most approaches proposed in the literature for data
integration, query answering and view extraction do not completely fit the data
lake paradigm. For instance, [8] proposes some techniques for building views on
semi-structured data sources based on some expected queries. Other researchers
focus on materialized views and, specifically, on throughput and execution time;
therefore, they a-priori define a set of well-known views and, then, materialize
them. Two surveys on this issue can be found in [16, 1]. The authors of [26]
investigate the same problem but they focus on XML sources. The approach
of [25] addresses the same issue by means of query rewriting; specifically, it
transforms a query Q into a set of new queries, evaluates them, and, then,
merges the corresponding answers to construct the materialized answer to Q.
[4] proposes an approach to constructing materialized views for heterogeneous
databases; it requires the presence of a static context and the pre-computation
of some queries.
3
DBpedia: http://dbpedia.org
Another family of approaches exploits materialized views to perform tree
pattern querying [24] and graph pattern queries [12]. Unfortunately, all these
approaches are well-suited for structured and semi-structured data, whereas they
are not scalable and lightweight enough to be used in a dynamic context or with
unstructured data. An interesting advance in this area can be found in [22]. Here,
the authors propose an incremental approach to address the graph pattern query
problem on both static and dynamic real-life data graphs. Other kinds of views
are investigated in [6] and [3]. In particular, this last paper uses virtual views to
access heterogeneous data sources without knowing many details of them. For
this purpose, it creates virtual views of the data sources.
Finally, semantic-based approaches have long been used to drive data inte-
gration in databases and data warehouses. More recently, in the context of big
data, formal semantics has been specifically exploited to address issues concern-
ing data variety/heterogeneity, data inconsistency and data quality in such a
way as to increase understandability [17]. In the data lake scenario, semantic
techniques have been successfully applied to more efficiently integrate and han-
dle both structured and unstructured data sources by aligning data silos and
better managing evolving data models. For instance, in [15], the authors discuss
a data lake system with a semantic metadata matching component for ontol-
ogy modeling, attribute annotation, record linkage, and semantic enrichment.
Furthermore, [14] presents a system to discover and enforce expressive integrity
constraints from data lakes. Similarly to what happens in our approach, knowl-
edge graphs in RDF4 are used to drive integration. To reach their objectives,
these techniques usually rely on information extraction tools (e.g., Open Calais5 )
that may assist in linking metadata to uniform vocabularies (e.g., ontologies or
knowledge repositories, such as DBpedia).
Our network-based model for representing and handling a data lake is perfectly
fitted for representing and managing semi-structured data because it has been
designed having XML and JSON in mind. Clearly, it is sufficiently powerful
to represent structured data. The highest difficulty regards unstructured data
because it is worth avoiding a flat representation consisting of a simple element
for each keyword provided to denote the source content. As a matter of fact, this
kind of representation would make the reconciliation, and the next integration,
of an unstructured source with the other (semi-structured and structured) ones
of the data lake very difficult. Therefore, it is necessary to (at least partially)
“structure” unstructured data.
Our approach to addressing this issue consists of four phases, namely: (i)
creation of nodes; (ii) derivation and management of part-of relationships; (iii)
derivation of lexical and string similarities; (iv) management of lexical and string
similarities.
Phase 1. During this phase, our approach creates a complex element repre-
senting the source as a whole, and a simple element for each keyword. Further-
more, it adds an arc from the source to each of the simple elements. Initially,
there is no arc between two simple elements. To determine the arcs to add, the
next phases are necessary.
Phase 2. During this phase, our approach adds an arc from the node nk1 ,
corresponding to the keyword k1 , to the node nk2 , corresponding to the keyword
k2 , if k2 is registered as a lemma6 of k1 in a suitable thesaurus. Taking the
current trends into account, this thesaurus should be a multimedia one; for this
purpose, in our experiments, we have adopted BabelNet [20]. When this arc has
been added, nk1 must be considered a complex element, instead of a simple one.
Phase 3. During this phase, our approach starts by deriving lexical similar-
ities. In particular, it states that there exists a similarity between the nodes nk1 ,
corresponding to the keyword k1 , and nk2 , corresponding to the keyword k2 , if k1
and k2 have at least one common lemma in a suitable thesaurus. Also in this case,
we have adopted BabelNet. After having found lexical similarities, our approach
derives string similarities and states that there exists a similarity between nk1
and nk2 if the string similarity degree kd(k1 , k2 ), computed by applying a suitable
string similarity metric on k1 and k2 , is “sufficiently high” (see below). We have
chosen N-Grams [18] as string similarity metric because we have experimentally
seen that it provides the best results in our context. Now, we illustrate in detail
what “sufficiently high” means and how our approach operates. Let KeySim
be the set of the string similarities for each pair of keywords of the source into
consideration. Each record in KeySim has the form hki , kj , kd(ki , kj )i. Our ap-
proach first computes the maximum keyword similarity degree kdmax present
in KeySim. Then, it examines each keyword similarity registered therein. Let
hk1 , k2 , kd(k1 , k2 )i be one of these similarities. If ((kd(k1 , k2 ) ≥ thk · kdmax ) and
(kd(k1 , k2 ) ≥ thkmin )), which implies that the keyword similarity degree be-
tween k1 and k2 is among the highest ones in KeySim and that, in any case,
it is higher than or equal to a minimum threshold, then it concludes that there
exists a similarity between nk1 and nk2 . We have experimentally set thk = 0.70
and thkmin = 0.50. At the end of this phase, our approach has found some (lex-
ical and/or string) similarities, each stating that a node nki is similar to a node
nkj .
Phase 4. During this phase, our approach aims at managing the similarities
found during Phase 3. In particular, if there exists a (lexical and/or string) sim-
ilarity between two nodes nki and nkj , it merges them into one node nkij , which
inherits all the incoming and outgoing arcs of nki and nkj . After all similarities
6
In this paper, we use the term “lemma” according to the meaning it has in Babel-
Net [20]. Here, given a term, its lemmas are other objects (terms, emoticons, etc.)
contributing to specify its meaning.
have been considered, it could happen that there exist two or more arcs from a
node nki to a node nkj . In this case, our approach merges them into one arc.
Step 1. The first step of our approach receives a data lake DL, a set of topics
T = {T1 , T2 , · · · , Tl }, representing the themes of interest for the user, and a
dictionary Syn of synonymies involving the objects stored in the sources of DL.
This dictionary could be a generic thesaurus, such as BabelNet [20], a domain-
specific thesaurus, or a dictionary obtained by taking into account the structure
and the semantics of the sources, which the corresponding objects refer to (such
as the dictionaries produced by XIKE [10], MOMIS [5] or Cupid [19]).
In this step, the concept of ego network [2, 11] plays a key role. We recall that
an ego network consists of a focal node (the ego) and the nodes it is directly
connected to (the “alters”), plus the ties, if any, between the alters.
Let Ti be a topic of T . Let Obji = {oi1 , oi2 , · · · , oiq } be the set of the
objects synonymous of Ti in DL. Let Ni = {ni1 , ni2 , · · · , niq } be the corre-
sponding nodes. First, Step 1 constructs the ego networks Ei1 , Ei2 , · · · , Eiq hav-
ing ni1 , ni2 , · · · , niq as the corresponding egos. Then, it merges all the egos
into a unique node ni . In this way, it obtains a unique ego network Ei from
Ei1 , Ei2 , · · · , Eiq . If a synonymy exists between two alters belonging to different
ego networks, then these are merged into a unique node and the corresponding
arcs linking them to the ego ni are merged into a unique arc. At the end of this
task, we have a unique ego network Ei corresponding to Ti .
After having performed the previous task for each topic of T , we have a
set E = {E1 , E2 , · · · , El } of l ego networks. At this point, Step 1 finds all the
synonymies of Syn involving objects of the ego networks of E and merges the
corresponding nodes. After all the possible synonymies involving objects of the
ego network of E have been considered and the corresponding nodes have been
merged, a set V = {V1 , · · · , Vg }, 1 ≤ g ≤ l, of networks representing potential
views is obtained.
If g = 1, then it is possible to conclude that Step 1 has been capable of
extracting a unique thematic view comprising all the topics required by the
user. Otherwise, there exist more views each comprising some (but not all) of
the topics of interest for the user. If g = 1, Step 2 is performed to make more
precise and complete the unique view representing all the topics of T . If g > 1,
Step 2 aims at finding other relationships, different from synonymies, among
the objects of the views of V in such a way as to try to obtain a unique view
embracing all the topics of interest for the user.
Step 2. This step starts by enriching each view Vi ∈ V . For this purpose, it
connects each of its elements to all the semantically related concepts taken from
a reference knowledge repository.
In this work, we rely on DBpedia, one of the largest knowledge graphs in the
Linked Data context, including more than 4.58 million entities in RDF. To this
aim, first each element of Vi (including its synonyms) is mapped to the corre-
sponding entry in DBpedia. In many cases, such a mapping is already provided by
BabelNet7 . Then, for each DBpedia entry, all the related concepts are retrieved.
In DBpedia, knowledge is structured according to the Linked Data principles,
i.e. as an RDF graph built by triples. Each triple hs(ubject), p(roperty), o(bject)i
states that a subject s has a property p, whose value is an object o. Both sub-
jects and properties are resources (i.e., nodes in DBpedia’s knowledge graph),
whereas objects may be either resources or literals (i.e., values of some primitive
data types, such as strings or numbers). Each triple represents the minimal com-
ponent of the knowledge graph. This last is built by merging triples together.
Therefore, retrieving the related concepts for a given element x implies finding
all the triples where x is either the subject or the object.
For each view Vi ∈ V , the procedure to extend it consists of the following
three substeps:
1. Mapping: for each node n ∈ Vi , its corresponding DBpedia entry d is found.
2. Triple extraction: all the related triples hd, p, oi and hs, p, di, i.e., all the
triples in which d is either the subject or the object, are retrieved.
3. View extension: for each retrieved triple hd, p, oi (resp., hs, p, di), Vi is ex-
tended by defining a node for the object o (resp., s), if not already existing,
linked to n through an arc labeled as p.
These three tasks are repeated for all the views of V . As previously pointed
out, this enrichment procedure is particularly important if |V | > 1 because the
new derived relationships could help to merge the thematic views that was not
possible to merge during Step 1. In particular, let Vi ∈ V and Vj ∈ V be two
views of V , and let Vi0 and Vj0 be the extended views corresponding to them. If
there exist two nodes nih ∈ Vi0 ad njk ∈ Vj0 such that nih = njk 8 , then they can
be merged in one node; if this happens, Vi0 and Vj0 become connected. After all
equal nodes of the views of V have been merged, all the views of V could be
either merged in one view or not. In the former case, the process terminates with
success. Otherwise, it is possible to conclude that no thematic views comprising
all the topics specified by the user can be found. In this last case, our approach
still returns the enriched views of V and leaves the user the choice to accept of
reject them.
7
Whenever this does not happen, the mapping can be automatically provided by the
DBpedia Lookup Service (http://wiki.dbpedia.org/projects/dbpedia-lookup).
8
Here, two nodes are equal if the corresponding name coincide.
4 An example case
9
Here and in the following, we use the notation S.o to indicate the object o of the
source S.
Fig. 1. Network-based representations of the four sources into consideration.
At this point, Step 2 is executed. As shown in Figure 4, first each term (syn-
onyms included) is semantically aligned to the corresponding DBpedia entry
(e.g., Ocean is linked to dbo:Sea, Area is linked to dbo:Location and dbo:Place,
Fig. 2. Ego networks corresponding to V.Ocean, C.Sea, W.P lace, C.P lace, V.Region
and E.Location.
5 Conclusion
In this paper, we have presented a new network-based model to uniformly rep-
resent the structured, semi-structured and unstructured sources of a data lake.
Then, we have proposed a new approach to, at least partially, “structuring”
unstructured data. Finally, based on these two tools, we have defined a new ap-
proach to extracting thematic views from the sources of a data lake consisting of
10
Prefixes dbo and dbr stand for http://dbpedia.org/ontology/ and http://
dbpedia.org/resource/
Fig. 4. The integrated thematic view.
two steps, based on ego networks (Step 1) and semantic relationships (Step 2).
This paper is not to be intended as an ending point. Actually, it could be the
starting point of a new family of approaches aiming at handling information sys-
tems in the new big data-oriented scenario. By proceeding in this direction, first
we plan to define an unsupervised approach to extracting thematic views from
a data lake. Then, we plan to define new approaches to supporting a flexible
and lightweight querying of the sources of a data lake, as well as approaches to
schema matching, schema mapping, data reconciliation and integration strongly
oriented to data lakes based mainly on unstructured data sources.
References