0% found this document useful (0 votes)
41 views12 pages

An Approach To Extracting Thematic Views From Highly Heterogeneous Source of A Data Lake

The document proposes a new approach to extracting thematic views from heterogeneous data sources in a data lake. It represents data lake sources using a network model and uniformly represents structured, semi-structured, and unstructured sources. The approach has two steps - the first uses network analysis techniques like ego networks to extract initial thematic views based on source metadata. The second refines the views using a knowledge repository like DBpedia to discover new relationships among metadata.

Uploaded by

toto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views12 pages

An Approach To Extracting Thematic Views From Highly Heterogeneous Source of A Data Lake

The document proposes a new approach to extracting thematic views from heterogeneous data sources in a data lake. It represents data lake sources using a network model and uniformly represents structured, semi-structured, and unstructured sources. The approach has two steps - the first uses network analysis techniques like ego networks to extract initial thematic views based on source metadata. The second refines the views using a knowledge repository like DBpedia to discover new relationships among metadata.

Uploaded by

toto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

An approach to extracting thematic views from

highly heterogeneous sources of a data lake

Claudia Diamantini1 , Paolo Lo Giudice2 , Lorenzo Musarella2 , Domenico


Potena1 , Emanuele Storti1 , and Domenico Ursino1
1
DII, Polytechnic University of Marche
2
DIIES, University “Mediterranea” of Reggio Calabria

Abstract. In the last years, data lakes are emerging as an effective and
efficient support for information and knowledge extraction from a huge
amount of highly heterogeneous and quickly changing data sources. Data
lake management requires the definition of new techniques, very different
from the ones adopted for data warehouses in the past. One of the main
issues to address in this scenario consists in the extraction of thematic
views from the (very heterogeneous and generally unstructured) data
sources of a data lake. In this paper, we propose a new network-based
model to uniformly represent structured, semi-structured and unstruc-
tured sources of a data lake. Then, we present a new approach to, at
least partially, “structure” unstructured data. Finally, we define a tech-
nique to extract thematic views from the sources of a data lake, based
on similarity and other semantic relations among the metadata of data
sources.

1 Introduction
In the last years, data lakes are emerging as an effective and efficient answer
to the problem of extracting information and knowledge from a huge amount
of highly heterogeneous and quickly changing data sources [13]. Data lake man-
agement requires the definition of new techniques, very different from the ones
adopted for data warehouses in the past. These techniques may exploit the large
set of metadata always supplied with data lakes, which represent their core and
the main tool allowing them to be a very competitive framework in the big data
era.
One of the main issues to address in a scenario comprising many data sources
extremely heterogeneous in their format, structure and semantics, consists in the
extraction of thematic views from data sources [3], i.e., the construction of views
concerning one or more topics of interest for the user, obtained by extracting
and merging data coming from different sources. This problem has been largely
investigated in the past for structured and semi-structured data sources stored in
a data warehouse [26, 8, 23], and this witnesses its extreme relevance. However,
it is esteemed that, currently, more than 80% of data sources are unstructured
[9]. As a consequence, it is just this type of source that represents the main actor
of the big data scenario and, consequently, of data lakes.
In this paper, we aim at providing a contribution in this setting. Indeed, we
propose a supervised approach to extracting thematic views from highly hetero-
geneous sources of a data lake. Our approach represents all the data lake sources
by means of a suitable network. Indeed, networks are very flexible structures that
allow the modeling of almost all phenomena that researchers aim at investigating
[7]. Thanks to this uniform representation of the data lake sources, the extrac-
tion of thematic views from them can be performed by exploiting graph-based
tools. We define “supervised” our approach because it requires the user to spec-
ify the set of topics T = {T1 , T2 , . . . , Tn } that should be present in the thematic
view(s) to extract. Our approach consists of two steps. The former is mainly
based on the structure of involved sources. It exploits several notions typical
of (social) network analysis, such as the notion of ego network, which actually
represents the core of the proposed approach. The latter exploits a knowledge
repository, which is used to discover new relationships, other than synonymies,
among metadata, with the purpose to refine the integration of different thematic
views obtained after the first step. In this step, our approach relies on DBpedia3 .
This paper is organized as follows: Section 2 illustrates related literature. In
Section 3, we present the proposed approach. In particular, first we describe a
unifying model for data lake representation; then, we present our approach to
partially structuring unstructured sources; finally, we discuss the two steps of
our approach for thematic view extraction. In Section 4, we present our example
case, whereas, in Section 5, we draw our conclusions and discuss future work.

2 Related Literature

The new data lake scenario is characterized by several peculiarities that make
it very different from the data warehouse paradigm. Hence, it is necessary to
adapt (if possible) old algorithms conceived for data warehouses or to define
new approaches capable of handling and taking advantage of the specificities of
this new paradigm. However, most approaches proposed in the literature for data
integration, query answering and view extraction do not completely fit the data
lake paradigm. For instance, [8] proposes some techniques for building views on
semi-structured data sources based on some expected queries. Other researchers
focus on materialized views and, specifically, on throughput and execution time;
therefore, they a-priori define a set of well-known views and, then, materialize
them. Two surveys on this issue can be found in [16, 1]. The authors of [26]
investigate the same problem but they focus on XML sources. The approach
of [25] addresses the same issue by means of query rewriting; specifically, it
transforms a query Q into a set of new queries, evaluates them, and, then,
merges the corresponding answers to construct the materialized answer to Q.
[4] proposes an approach to constructing materialized views for heterogeneous
databases; it requires the presence of a static context and the pre-computation
of some queries.
3
DBpedia: http://dbpedia.org
Another family of approaches exploits materialized views to perform tree
pattern querying [24] and graph pattern queries [12]. Unfortunately, all these
approaches are well-suited for structured and semi-structured data, whereas they
are not scalable and lightweight enough to be used in a dynamic context or with
unstructured data. An interesting advance in this area can be found in [22]. Here,
the authors propose an incremental approach to address the graph pattern query
problem on both static and dynamic real-life data graphs. Other kinds of views
are investigated in [6] and [3]. In particular, this last paper uses virtual views to
access heterogeneous data sources without knowing many details of them. For
this purpose, it creates virtual views of the data sources.
Finally, semantic-based approaches have long been used to drive data inte-
gration in databases and data warehouses. More recently, in the context of big
data, formal semantics has been specifically exploited to address issues concern-
ing data variety/heterogeneity, data inconsistency and data quality in such a
way as to increase understandability [17]. In the data lake scenario, semantic
techniques have been successfully applied to more efficiently integrate and han-
dle both structured and unstructured data sources by aligning data silos and
better managing evolving data models. For instance, in [15], the authors discuss
a data lake system with a semantic metadata matching component for ontol-
ogy modeling, attribute annotation, record linkage, and semantic enrichment.
Furthermore, [14] presents a system to discover and enforce expressive integrity
constraints from data lakes. Similarly to what happens in our approach, knowl-
edge graphs in RDF4 are used to drive integration. To reach their objectives,
these techniques usually rely on information extraction tools (e.g., Open Calais5 )
that may assist in linking metadata to uniform vocabularies (e.g., ontologies or
knowledge repositories, such as DBpedia).

3 Description of the proposed approach

3.1 A unifying model for data lake representation

In this section, we illustrate our network-based model to represent and handle a


data lake, which we will use in the rest of this paper. In our model, a data lake
DL is represented as a set of m data sources: DL = {D1 , D2 , · · · , Dm }. A data
source Dk ∈ DL is provided with a rich set Mk of metadata. We denote with
MDL the repository of the metadata of all the data sources of DL: MDL =
{M1 , M2 , . . . , Mm }.
According to [21], our model represents Mk by means of a triplet: Mk =
hMTk , MO B T
k , Mk i. Here: (i) Mk denotes technical metadata. It represents the
type, the format, the structure and the schema of the corresponding data. It
is commonly provided by the source catalogue. (ii) MO k represents operational
metadata. It includes the source and target locations of the corresponding data,
4
RDF Concepts and abstract Syntax: http://www.w3.org/TR/2004/
REC-rdf-concepts-20040210/
5
http://www.opencalais.com
the associated file size, the number of their records, and so on. Usually, it is auto-
matically generated by the technical framework handling the data lake. (iii) MB k
indicates business metadata. It comprises the business names and descriptions
assigned to data fields. It also covers business rules, which can become integrity
constraints for the corresponding data source.
In this paper, we consider only business metadata. Indeed, they denote, at
the intensional level, the information content stored in the data lake sources and
are those of interest for supporting the extraction of thematic views from a data
lake, which is our ultimate goal. In order to represent MB k , our model adopts
a notation typical of XML, JSON and many other semi-structured models. Ac-
cording to this notation, Objk indicates the set of all the objects stored in MB k.
It consists of the union of three subsets: Objk = Attk ∪ Smpk ∪ Cmpk . Here: (i)
Attk indicates the set of the attributes of MB k ; (ii) Smpk represents the set of
the simple elements of MB k ; (iii) Cmp k denotes the set of the complex elements
of MB k . In this context, the meaning of the terms “attribute”, “simple element”
and “complex element” is the one typical of semi-structured data models.
MB B
k can be also represented as a graph: Mk = hNk , Ak i. Nk is the set of
B
the nodes of Mk . There exists a node nkj ∈ Nk for each object okj ∈ Objk .
According to the structure of Objk , Nk consists of the union of three subsets:
Nk = NkAtt ∪ NkSmp ∪ NkCmp . Here, NkAtt (resp., NkSmp , NkCmp ) indicates the
set of the nodes corresponding to Attk (resp., Smpk , Cmpk ). There is a one-
to-one correspondence between a node of Nk and an object of Objk . Therefore,
in the following, we will use the two terms interchangeably. Let x be a complex
element of MB k . Objkx indicates the set of the objects directly contained in x,
whereas NkObj x
denotes the set of the corresponding nodes. Finally, let y be a
simple element of MB k . Attky represents the set of the attributes of y, whereas
NkAtty
denotes the set of the corresponding nodes.
Ak indicates the set of the arcs of MB k . It consists of two subsets: Ak =
Ak ∪ A00k . Here: (i) A0k = {(nx , ny )|nx ∈ NkCmp , ny ∈ NnObj
0
x
}, i.e., there is an
arc from a complex element of MB k to each object directly contained in it. (ii)
00 Smp Att
Ak = {(nx , ny )|nx ∈ Nk , ny ∈ Nnx }, i.e., there is an arc from a simple
element of MB k to each of its attributes.

3.2 An approach to partially structuring unstructured sources

Our network-based model for representing and handling a data lake is perfectly
fitted for representing and managing semi-structured data because it has been
designed having XML and JSON in mind. Clearly, it is sufficiently powerful
to represent structured data. The highest difficulty regards unstructured data
because it is worth avoiding a flat representation consisting of a simple element
for each keyword provided to denote the source content. As a matter of fact, this
kind of representation would make the reconciliation, and the next integration,
of an unstructured source with the other (semi-structured and structured) ones
of the data lake very difficult. Therefore, it is necessary to (at least partially)
“structure” unstructured data.
Our approach to addressing this issue consists of four phases, namely: (i)
creation of nodes; (ii) derivation and management of part-of relationships; (iii)
derivation of lexical and string similarities; (iv) management of lexical and string
similarities.
Phase 1. During this phase, our approach creates a complex element repre-
senting the source as a whole, and a simple element for each keyword. Further-
more, it adds an arc from the source to each of the simple elements. Initially,
there is no arc between two simple elements. To determine the arcs to add, the
next phases are necessary.
Phase 2. During this phase, our approach adds an arc from the node nk1 ,
corresponding to the keyword k1 , to the node nk2 , corresponding to the keyword
k2 , if k2 is registered as a lemma6 of k1 in a suitable thesaurus. Taking the
current trends into account, this thesaurus should be a multimedia one; for this
purpose, in our experiments, we have adopted BabelNet [20]. When this arc has
been added, nk1 must be considered a complex element, instead of a simple one.
Phase 3. During this phase, our approach starts by deriving lexical similar-
ities. In particular, it states that there exists a similarity between the nodes nk1 ,
corresponding to the keyword k1 , and nk2 , corresponding to the keyword k2 , if k1
and k2 have at least one common lemma in a suitable thesaurus. Also in this case,
we have adopted BabelNet. After having found lexical similarities, our approach
derives string similarities and states that there exists a similarity between nk1
and nk2 if the string similarity degree kd(k1 , k2 ), computed by applying a suitable
string similarity metric on k1 and k2 , is “sufficiently high” (see below). We have
chosen N-Grams [18] as string similarity metric because we have experimentally
seen that it provides the best results in our context. Now, we illustrate in detail
what “sufficiently high” means and how our approach operates. Let KeySim
be the set of the string similarities for each pair of keywords of the source into
consideration. Each record in KeySim has the form hki , kj , kd(ki , kj )i. Our ap-
proach first computes the maximum keyword similarity degree kdmax present
in KeySim. Then, it examines each keyword similarity registered therein. Let
hk1 , k2 , kd(k1 , k2 )i be one of these similarities. If ((kd(k1 , k2 ) ≥ thk · kdmax ) and
(kd(k1 , k2 ) ≥ thkmin )), which implies that the keyword similarity degree be-
tween k1 and k2 is among the highest ones in KeySim and that, in any case,
it is higher than or equal to a minimum threshold, then it concludes that there
exists a similarity between nk1 and nk2 . We have experimentally set thk = 0.70
and thkmin = 0.50. At the end of this phase, our approach has found some (lex-
ical and/or string) similarities, each stating that a node nki is similar to a node
nkj .
Phase 4. During this phase, our approach aims at managing the similarities
found during Phase 3. In particular, if there exists a (lexical and/or string) sim-
ilarity between two nodes nki and nkj , it merges them into one node nkij , which
inherits all the incoming and outgoing arcs of nki and nkj . After all similarities
6
In this paper, we use the term “lemma” according to the meaning it has in Babel-
Net [20]. Here, given a term, its lemmas are other objects (terms, emoticons, etc.)
contributing to specify its meaning.
have been considered, it could happen that there exist two or more arcs from a
node nki to a node nkj . In this case, our approach merges them into one arc.

3.3 An approach to extracting thematic views

Our approach to extracting thematic views operates on a data lake DL whose


data sources are represented by means of the model described in Section 3.1.
It is called “supervised” because it requires the user to specify the set of topics
T = {T1 , T2 , · · · , Tl }, which should be present in the thematic view(s) to extract.
It consists of two steps, the former mainly based on the structure of the sources
at hand, the latter mainly focusing on the corresponding semantics. We describe
these two steps in the next subsections.

Step 1. The first step of our approach receives a data lake DL, a set of topics
T = {T1 , T2 , · · · , Tl }, representing the themes of interest for the user, and a
dictionary Syn of synonymies involving the objects stored in the sources of DL.
This dictionary could be a generic thesaurus, such as BabelNet [20], a domain-
specific thesaurus, or a dictionary obtained by taking into account the structure
and the semantics of the sources, which the corresponding objects refer to (such
as the dictionaries produced by XIKE [10], MOMIS [5] or Cupid [19]).
In this step, the concept of ego network [2, 11] plays a key role. We recall that
an ego network consists of a focal node (the ego) and the nodes it is directly
connected to (the “alters”), plus the ties, if any, between the alters.
Let Ti be a topic of T . Let Obji = {oi1 , oi2 , · · · , oiq } be the set of the
objects synonymous of Ti in DL. Let Ni = {ni1 , ni2 , · · · , niq } be the corre-
sponding nodes. First, Step 1 constructs the ego networks Ei1 , Ei2 , · · · , Eiq hav-
ing ni1 , ni2 , · · · , niq as the corresponding egos. Then, it merges all the egos
into a unique node ni . In this way, it obtains a unique ego network Ei from
Ei1 , Ei2 , · · · , Eiq . If a synonymy exists between two alters belonging to different
ego networks, then these are merged into a unique node and the corresponding
arcs linking them to the ego ni are merged into a unique arc. At the end of this
task, we have a unique ego network Ei corresponding to Ti .
After having performed the previous task for each topic of T , we have a
set E = {E1 , E2 , · · · , El } of l ego networks. At this point, Step 1 finds all the
synonymies of Syn involving objects of the ego networks of E and merges the
corresponding nodes. After all the possible synonymies involving objects of the
ego network of E have been considered and the corresponding nodes have been
merged, a set V = {V1 , · · · , Vg }, 1 ≤ g ≤ l, of networks representing potential
views is obtained.
If g = 1, then it is possible to conclude that Step 1 has been capable of
extracting a unique thematic view comprising all the topics required by the
user. Otherwise, there exist more views each comprising some (but not all) of
the topics of interest for the user. If g = 1, Step 2 is performed to make more
precise and complete the unique view representing all the topics of T . If g > 1,
Step 2 aims at finding other relationships, different from synonymies, among
the objects of the views of V in such a way as to try to obtain a unique view
embracing all the topics of interest for the user.

Step 2. This step starts by enriching each view Vi ∈ V . For this purpose, it
connects each of its elements to all the semantically related concepts taken from
a reference knowledge repository.
In this work, we rely on DBpedia, one of the largest knowledge graphs in the
Linked Data context, including more than 4.58 million entities in RDF. To this
aim, first each element of Vi (including its synonyms) is mapped to the corre-
sponding entry in DBpedia. In many cases, such a mapping is already provided by
BabelNet7 . Then, for each DBpedia entry, all the related concepts are retrieved.
In DBpedia, knowledge is structured according to the Linked Data principles,
i.e. as an RDF graph built by triples. Each triple hs(ubject), p(roperty), o(bject)i
states that a subject s has a property p, whose value is an object o. Both sub-
jects and properties are resources (i.e., nodes in DBpedia’s knowledge graph),
whereas objects may be either resources or literals (i.e., values of some primitive
data types, such as strings or numbers). Each triple represents the minimal com-
ponent of the knowledge graph. This last is built by merging triples together.
Therefore, retrieving the related concepts for a given element x implies finding
all the triples where x is either the subject or the object.
For each view Vi ∈ V , the procedure to extend it consists of the following
three substeps:
1. Mapping: for each node n ∈ Vi , its corresponding DBpedia entry d is found.
2. Triple extraction: all the related triples hd, p, oi and hs, p, di, i.e., all the
triples in which d is either the subject or the object, are retrieved.
3. View extension: for each retrieved triple hd, p, oi (resp., hs, p, di), Vi is ex-
tended by defining a node for the object o (resp., s), if not already existing,
linked to n through an arc labeled as p.
These three tasks are repeated for all the views of V . As previously pointed
out, this enrichment procedure is particularly important if |V | > 1 because the
new derived relationships could help to merge the thematic views that was not
possible to merge during Step 1. In particular, let Vi ∈ V and Vj ∈ V be two
views of V , and let Vi0 and Vj0 be the extended views corresponding to them. If
there exist two nodes nih ∈ Vi0 ad njk ∈ Vj0 such that nih = njk 8 , then they can
be merged in one node; if this happens, Vi0 and Vj0 become connected. After all
equal nodes of the views of V have been merged, all the views of V could be
either merged in one view or not. In the former case, the process terminates with
success. Otherwise, it is possible to conclude that no thematic views comprising
all the topics specified by the user can be found. In this last case, our approach
still returns the enriched views of V and leaves the user the choice to accept of
reject them.
7
Whenever this does not happen, the mapping can be automatically provided by the
DBpedia Lookup Service (http://wiki.dbpedia.org/projects/dbpedia-lookup).
8
Here, two nodes are equal if the corresponding name coincide.
4 An example case

In this section, we present an example case aiming at illustrating the various


tasks of our approach. Here, we consider: (i) a structured source, called Weather
Conditions (W , in short), whose corresponding E/R schema is not reported for
space limitations; (ii) two semi-structured sources, called Climate (C, in short)
and Environment (E, in short), whose corresponding XML Schemas are not re-
ported for space limitations; (iii) an unstructured source, called Environment
Video (V , in short), consisting of a YouTube video and whose corresponding
keywords are: garden, f lower, rain, save, earth, tips, recycle, aurora, planet,
garbage, pollution, region, lif e, plastic, metropolis, environment, nature, wave,
eco, weather, simple, f ineparticle, climate, ocean, environmentawareness,
educational, reduce, power, bike.
By applying the approaches mentioned in Section 3.2, we obtain the corre-
sponding representations in our network-based model, shown in Figure 1.
Assume, now, that a user specifies the following set T of topics of her in-
terest: T = {Ocean, Area}. First, our approach determines the terms (and,
then, the objects) in the five sources that are synonyms of Ocean and Area.
As for Ocean, the only synonym present in the sources is Sea; as a conse-
quence, Obj1 comprises the node Ocean of the source V (V.Ocean9 ) and the
node Sea of the source C (C.Sea). An analogous activity is performed for
Area. At the end of this task we have that Obj1 = {V.Ocean, C.Sea} and
Obj2 = {W.P lace, C.P lace, V.Region, E.Location}.
Step 1 of our approach proceeds by constructing the ego networks corre-
sponding to the objects of Obj1 and Obj2 . They are reported in Figure 2.
Now, consider the ego networks corresponding to V.Ocean and C.Sea. Our
approach merges the two egos into a unique node. Then, it verifies whether
further synonyms exist between the alters. Since none of these synonyms exists,
it returns the ego network shown in Figure 3(a). The same task is performed to
the ego networks corresponding to W.P lace, C.P lace, V.Region and E.Location.
In particular, first the four egos are merged. Then, synonyms between the alters
W.City and C.City and the alters W.Altitude and C.Altitude are retrieved.
Based on this, W.City and C.City are merged in one node, W.Altitude and
C.Altitude in another node, the arcs linking the ego to W.City and C.City are
merged in one arc and the ones linking the ego to W.Altitude and C.Altitude in
another arc. In this way, the ego network shown in Figure 3(b) is returned. At
this point, there are two ego networks, EOcean and EArea , each corresponding
to one of the terms specified by the user.
Now, Step 1 verifies if there are any synonyms between a node of EOcean and
a node of EArea . Since this does not happen, it terminates and returns the set
V = {VOcean , VArea }, where VOcean (resp., VArea ) coincides with EOcean (resp.,
EArea ).

9
Here and in the following, we use the notation S.o to indicate the object o of the
source S.
Fig. 1. Network-based representations of the four sources into consideration.

At this point, Step 2 is executed. As shown in Figure 4, first each term (syn-
onyms included) is semantically aligned to the corresponding DBpedia entry
(e.g., Ocean is linked to dbo:Sea, Area is linked to dbo:Location and dbo:Place,
Fig. 2. Ego networks corresponding to V.Ocean, C.Sea, W.P lace, C.P lace, V.Region
and E.Location.

Fig. 3. Ego networks corresponding to Ocean and Area.

while Country to dbo:Country 10 , respectively). After a single iteration, the


triples hdbo:sea rdfs:range dbo:Seai and hdbo:sea rdfs:domain dbo:Placei are re-
trieved. They correspond to the DBpedia property sea, which relates a country
to the sea from which it is lapped. Other connections can be found by moving to
specific instances of the mentioned resources. In this way, the following triples are
retrieved: hinstance rdf:type dbo:Seai, hinstance rdf:type dbo:Locationi, hinstance
rdf:type dbo:Placei, meaning that there are resource instances having types Sea,
Location and Place simultaneously (e.g., dbr:Mediterranean Sea). Furthermore,
a triple hinstance dbo:country dbo:Countryi can be retrieved, meaning that those
instances being a Sea, a Location or a Place specify in which dbo:Country they
are located through the dbo:country property.
In this example case, Step 2 succeeded in merging the two views that Step 1
had maintained separated.

5 Conclusion
In this paper, we have presented a new network-based model to uniformly rep-
resent the structured, semi-structured and unstructured sources of a data lake.
Then, we have proposed a new approach to, at least partially, “structuring”
unstructured data. Finally, based on these two tools, we have defined a new ap-
proach to extracting thematic views from the sources of a data lake consisting of
10
Prefixes dbo and dbr stand for http://dbpedia.org/ontology/ and http://
dbpedia.org/resource/
Fig. 4. The integrated thematic view.

two steps, based on ego networks (Step 1) and semantic relationships (Step 2).
This paper is not to be intended as an ending point. Actually, it could be the
starting point of a new family of approaches aiming at handling information sys-
tems in the new big data-oriented scenario. By proceeding in this direction, first
we plan to define an unsupervised approach to extracting thematic views from
a data lake. Then, we plan to define new approaches to supporting a flexible
and lightweight querying of the sources of a data lake, as well as approaches to
schema matching, schema mapping, data reconciliation and integration strongly
oriented to data lakes based mainly on unstructured data sources.

References

1. S. Abiteboul and O.M. Duschka. Complexity of answering queries using materi-


alized views. In Proc. of the International Symposium on Principles of database
systems (SIGMOD/PODS’98), pages 254–263, Seattle, WA, USA, 1998. ACM.
2. V. Arnaboldi, M. Conti, A. Passarella, and F. Pezzoni. Analysis of ego network
structure in online social networks. In Proc. of the International Conference on
Privacy (PASSAT’12), pages 31–40, Amsterdam, Netherlands, 2012. IEEE.
3. L. Aversano, R. Intonti, C. Quattrocchi, and M. Tortorella. Building a virtual
view of heterogeneous data source views. In Proc. of the International Conference
on Software and Data Technologies (ICSOFT’10), pages 266–275, Athens, Greece,
2010. INSTICC Press.
4. C. Bachtarzi and F. Bachtarzi. A model-driven approach for materialized views
definition over heterogeneous databases. In Proc. of the International Conference
on New Technologies of Information and Communication (NTIC’15), pages 1–5,
Mila, Algeria, 2015. IEEE.
5. S. Bergamaschi, S. Castano, M. Vincini, and D. Beneventano. Semantic integration
and query of heterogeneous information sources. Data & Knowledge Engineering,
36(3):215–249, 2001.
6. J. Biskup and D. Embley. Extracting information from heterogeneous information
sources using ontologically specified target views. Information Systems, 28(3):169–
212, 2003. Elsevier.
7. M. R. Bouadjenek, H. Hacid, and M. Bouzeghoub. Social networks and information
retrieval, how are they converging? A survey, a taxonomy and an analysis of social
information retrieval approaches and platforms. Information Systems, 56:1–18,
2016.
8. S. Castano and V. De Antonellis. Building views over semistructured data sources.
In Proc. of the International Conference on Conceptual Modeling (ER’99), pages
146–160, Paris, France, 1999. Springer.
9. A. Corbellini, C. Mateos, A. Zunino, D. Godoy, and S.N. Schiaffino. Persisting
big-data: The NoSQL landscape. Information Systems, 63:1–23, 2017. Elsevier.
10. P. De Meo, G. Quattrone, G. Terracina, and D. Ursino. Integration of XML
Schemas at various “severity” levels. Information Systems, 31(6):397–434, 2006.
11. R. DeJordy and D. Halgin. Introduction to ego network analysis. Boston MA:
Boston College and the Winston Center for Leadership & Ethics, 2008.
12. W. Fan, X. Wang, and Y. Wu. Answering pattern queries using views. IEEE
Transactions on Knowledge and Data Engineering, 28(2):326–341, 2016. IEEE.
13. H. Fang. Managing data lakes in big data era: What’s a data lake and why has it be-
came popular in data management ecosystem. In Proc. of the International Confer-
ence on Cyber Technology in Automation (CYBER’15), pages 820–824, Shenyang,
China, 2015. IEEE.
14. M. Farid, A. Roatis, I.F. Ilyas, H. Hoffmann, and X. Chu. CLAMS: bringing quality
to Data Lakes. In Proc. of the International Conference on Management of Data
(SIGMOD/PODS’16), pages 2089–2092, San Francisco, CA, USA, 2016. ACM.
15. R. Hai, S. Geisler, and C. Quix. Constance: An intelligent data lake system. In Proc.
of the International Conference on Management of Data (SIGMOD/PODS’16),
pages 2097–2100, San Francisco, CA, USA, 2016. ACM.
16. A. Halevy. Answering queries using views: A survey. The VLDB Journal,
10(4):270–294, 2001. Springer.
17. P. Hitzler and K. Janowicz. Linked Data, Big Data, and the 4th Paradigm. Se-
mantic Web, 4(3):233–235, 2013.
18. G. Kondrak. N-gram similarity and distance. In String processing and information
retrieval, pages 115–126, 2005. Springer.
19. J. Madhavan, P.A. Bernstein, and E. Rahm. Generic schema matching with Cupid.
In Proc. of the International Conference on Very Large Data Bases (VLDB 2001),
pages 49–58, Rome, Italy, 2001. Morgan Kaufmann.
20. R. Navigli and S.P. Ponzetto. BabelNet: The automatic construction, evaluation
and application of a wide-coverage multilingual semantic network. Artificial Intel-
ligence, 193:217–250, 2012. Elsevier.
21. A. Oram. Managing the Data Lake. Sebastopol, CA, USA, 2015. O’Reilly.
22. K. Singh and V. Singh. Answering graph pattern query using incremental views.
In Proc. of the International Conference on Computing (ICCCA’16), pages 54–59,
Greater Noida, India, 2016. IEEE.
23. S. Spaccapietra and C. Parent. View integration: A step forward in solving struc-
tural conflicts. IEEE Transactions on Knowledge and Data Engineering, 6(2):258–
274, 1994.
24. J. Wang, J. Li, and J.X. Yu. Answering tree pattern queries using views: a re-
visit. In Proc. of the International Conference on Extending Database Technology
(EDBT/ICDT’11), pages 153–164, Uppsala, Sweden, 2011. ACM.
25. J. Wang and J.X. Yu. Revisiting answering tree pattern queries using views. ACM
Transactions on Database Systems, 37(3):18, 2012. ACM.
26. X. Wu, D. Theodoratos, and W.H. Wang. Answering XML queries using material-
ized views revisited. In Proc. of the International Conference on Information and
Knowledge Management (CIKM ’09), pages 475–484, Hong Kong, China, 2009.
ACM.

You might also like