2020 Keeping The Data Lake in Form - Proximity Mining For Pre-Filtering Schema Matching
2020 Keeping The Data Lake in Form - Proximity Mining For Pre-Filtering Schema Matching
net/publication/341596888
Keeping the Data Lake in Form: Proximity Mining for Pre-Filtering Schema
Matching
CITATIONS READS
0 64
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ayman Alserafi on 28 May 2020.
Abstract. Data Lakes (DLs) are large repositories of raw datasets from
disparate sources. As more datasets are ingested into a DL, there is an
increasing need for efficient techniques to profile them and to detect the
relationships among their schemata, commonly known as holistic schema
matching. Schema matching detects similarity between the information
stored in the datasets to support information discovery and retrieval.
Currently, this is computationally expensive with the volume of state-of-
the-art DLs. To handle this challenge, we propose a novel early-pruning
approach to improve efficiency, where we collect different types of content
metadata and schema metadata about the datasets, and then use this
metadata in early-pruning steps to pre-filter the schema matching com-
parisons. This involves computing proximities between datasets based
on their metadata, discovering their relationships based on overall prox-
imities and proposing similar dataset pairs for schema matching. We
improve the effectiveness of this task by introducing a supervised mining
approach for effectively detecting similar datasets which are proposed
for further schema matching. We conduct extensive experiments on a
real-world DL which proves the success of our approach in effectively
detecting similar datasets for schema matching, with recall rates of more
than 85% and efficiency improvements above 70%. We empirically show
the computational cost saving in space and time by applying our ap-
proach in comparison to instance-based schema matching techniques.
1 Introduction
Today, it is more and more common for data scientists to use Data Lakes (DLs)
to store heterogeneous datasets coming from different sources in their raw format
[35]. Such data repositories support the new era of data analytics where datasets
are ingested in large amounts and are required to be analysed just-in-time [20].
However, it is a challenge for data wranglers [13,16,35] preparing the datasets
2 A. Alserafi et al.
Goal Cost
DS-Prox
Dataset-level matching (Dataset Proximity mining) Dataset
Meta-
features
Content-based
metadata
(Relationships)
mining approach (see Section 4) between datasets. The similarity functions are
based on automatically extracted metadata and can be effectively used for pre-
filtering in a stratified approach (see the DS-Prox and Attribute-Prox boxes
attached to the matching steps in Fig. 1). To our knowledge, no other approach
uses automatically extracted metadata for this purpose.
To show the feasibility of our approach, we assess its performance by using it
on a real-world DL. Our approach was able to filter the number of fine-grained
comparisons by about 75% while maintaining a recall rate of at least 85% after
filtration. Our early-pruning approach also saves computational costs in terms
of space and time requirements by at least 2 orders of magnitude compared to
instance-based matching.
Contributions. We present an approach for pre-filtering schema matching
tasks. We propose techniques for detecting similar schemata based on meta-
data at different levels of granularity. This supports in early-pruning of the
raw-data instance-based schema matching tasks. We present an expanded and
cross-validated experiment for the DS-Prox technique from our previous work
[5] and comparisons against combining it with our new proposed attribute-level
proximity metrics to find the most appropriate metrics to assign similarities
between pairs of datasets. We demonstrate a detailed analysis of the different
proximity metrics based on different types of meta-features (name-based and
content-based). Our improvements outperform our previous work in terms of
effectiveness measures like recall and lift-scores.
The paper is organised as follows: Section 2 presents the related work,
Section 3 introduces the main concepts in our research, Section 4 presents our
proximity mining based approach for early-pruning tasks of holistic schema
matching, Section 5 presents our experimental evaluation, and finally, we con-
clude in Section 6.
4 A. Alserafi et al.
2 Related Work
granularity, e.g., overall dataset level [5] or attribute descriptions like we propose
in this paper.
The pre-filtering of dataset pairs which are less-likely to have interrelated
data before performing schema-matching is called early-pruning [4,12,7], and it
was implemented in previous research on semi-structured datasets, like XML, by
finding the similarity of hierarchical structures between named data objects [3].
Other works have investigated schema matching with semi-structured datasets
like XML [21,25] and JSON [14]. In the web of data, previous research like [26]
investigated recommendation of RDF datasets in the semantic web using pre-
defined annotations such as the sameAs property. In this paper, we consider
flat datasets without such a hierarchy of embedded data objects and without
pre-defined semantic linkages.
To facilitate the early-pruning tasks for schema matching, we can apply the
same approaches and concepts from collaborative filtering and adapt them to
the holistic schema matching problem [2,15]. The goal is to use profiling infor-
mation for comparison and recommendation, which was applied to multimedia
in [2] and semi-structured documents in [14]. Content-based metadata was also
used to predict schema labels [11]. They use minimum and maximum values for
numeric attributes, and exact values for nominal attributes, including the format
of values. We propose to apply similar techniques but at the dataset granularity
level. Accordingly, we adapt such techniques and find the appropriate similarity
metrics for tabular datasets.
Another line of research aims at optimising the schema matching process by
using computational improvements [12,28]. This can be done using partitioning
techniques that parallelise the schema matching comparison task [28]. Another
approach uses efficient string matching comparisons. Such techniques are very
useful in the case when schema components are properly named in the datasets.
However, such techniques fail when the components are not properly named (e.g.,
internal conventionalism, like sequential ID numbering of the components). In
[12], they introduce intelligent indexing techniques based on value-based signa-
tures.
Schema matching can also be automated using data mining techniques [10,11,14].
In [10], they use hybrid name-based and value-based classification models to
match dataset attributes to a mediated integration schema. Their approach is
focused on one-to-one mediation between two schemata, while our approach tar-
gets all datasets in a DL by holistic schema matching using coarser meta-features.
In [11], they use content-based meta-features in multi-value classification mod-
els to match schema labels of attributes across datasets. Decision trees were
also used to profile semi-structured documents before schema matching [14]. In
this paper, we also use data mining classification models, however the goal dif-
fers from [10], [11] and [14] as it tackles the early-pruning and pre-filtering task
rather than instance-based schema matching.
We summarise the state-of-the-art in Table 1. This table gives a compari-
son of the most relevant techniques discussed with our approach based on the
main features discussed in this section. As a result, we can see that we propose
6 A. Alserafi et al.
an approach based not only on string matching, but also on content metadata
matching involving statistics and profiles of data stored in the datasets. Con-
tent metadata are matched based on approximate similarities and not just exact
value-matches at the instance-level [20,23]. We focus on proposing novel early-
pruning techniques that use supervised learning to pre-filter irrelevant dataset
pairs and to detect likely-to-be similar pairs. Finally, the table shows that our
technique makes a novel contribution to the schema matching pre-filtering prob-
lem that is not achieved by other state-of-the-art techniques.
3 Preliminaries
Attributes
We consider DLs with datasets having tabular schemas that are structured
as attributes and instances like Fig. 2. We formally define a dataset D as a set
of instances D = {I1 , I2 , ...In }. Each dataset has a schema S = {A1 , A2 , ...Am },
where each attribute Ai has a data type and describes a single property of the in-
stances in the dataset. We focus on two types of attributes: numeric attributes
(consisting of real numbers) and nominal attributes (consisting of discrete cat-
egorical values). We differentiate between those two types of attributes, similar
to previous research like [10,11,25], because we collect different profiling meta-
data for them. The resulting statistics collected are called content meta-features,
and are as follows:
Attribute-level Dataset-level
Proximity Proximity
Pm Levenshtein Pm Levenshtein
computation distance computation distance
Mcls-num-attr /
Mcls-nom-attr Mcls-ds
Sim(Ai,Aj) Sim(Dy,Dz)
with an aggregation function Agg like averaging Sim(Ai , Aj ) scores. When pre-
dicting Rel(Dy , Dz ), typically the dataset pair will have information contained in
some of their attributes which are partially overlapping, satisfying Rel(Ai , Aj ),
where ∃Ai ∈ Dy ∧ Aj ∈ Dz =⇒ Rel(Ai , Aj ) = 1. An example would be a
pair of datasets describing different human diseases, (e.g., diabetes and hyper-
tension). The datasets will have similar attributes (partially) overlapping their
information like the patient’s age, gender, and some common lab tests like blood
samples.
The intermediate output leading to Rel(Ai , Aj ) and Rel(Dy , Dz ) in our pro-
posed proximity-based approach, seen in Fig. 3, is a similarity score consisting
of a real number in the range of [0, 1], which we indicate using Sim(Ai , Aj ) and
Sim(Dy , Dz ) respectively.
The similarity scores are computed based on proximity models we construct
using ensemble supervised learning techniques [34], which we denote as Mcls−ds
for models handling dataset-level metadata and Mcls−num−attr or Mcls−nom−attr
for models handling attribute-level metadata (depending on the attribute type,
numerical or nominal respectively). The models take as input the distance be-
tween the meta-features describing content of each object pair, whether dataset
pair or attribute pair for Sim(Dy , Dz ) and Sim(Ai , Aj ) respectively, and we call
the distance in a specific meta-feature ‘m’ a proximity metric which is denoted
D A
as Pm (Dy , Dz ) for dataset pairs or Pm (Ai , Aj ) for attribute pairs. The names of
Proximity Mining for Pre-filtering Schema Matching 9
objects can also be compared using Levenshtein string distance comparison [22]
to generate a distance score. The output from the models is a score we compute
using the positive class distribution (see Section 4.2).
We convert the intermediate similarity scores to the final output consisting
of a boolean value for the binary relationships using Equations (1) and (2). The
sim score computed for each relationship type is checked against a minimum
threshold in the range of [0, 1] to indicate whether the pair involved is overall
related ‘1’ or unrelated ‘0’, and therefore whether they should be proposed for
expensive schema matching or otherwise filtered out. Using cut-off thresholds
of similarity rankings for the collaborative filtering task and schema matching
is a common practice [12,15,20]. We can use different thresholds ‘cd ’ and ‘ca ’
for each of the relationship evaluated at the dataset level and attribute level
respectively. This means that we only consider a pair similar if their similarity
score is greater than the threshold as in Equations (1) and (2).
( (
1, Sim(Dy , Dz ) > cd 1, Sim(Ai , Aj ) > ca
Rel(Dy , Dz ) = (1) Rel(Ai , Aj ) = (2)
0, otherwise 0, otherwise
To summarise Fig. 3, the hierarchy to compute the final output is: Rel is
based on Sim similarity scores, which in turn are based on Pm proximity metrics
of meta-features. To convert from Pm to Sim we use an ensemble supervised
model Mcls which takes the Pm proximity metrics as input. The output Sim
is compared against a minimum threshold, and those passing the threshold are
positive cases for Rel to be considered for further detailed schema matching.
A1: salary {25k<A1<600k} Rel(A1,A10) = 1 A6: type {f,m} A11: gender {female,male}
Rel(A6,A11) = 1
A2: age { 20<A2<97} A7: age { 0<A2<100} A12: Ethnicity {AS,AF,ER,LT}
...
...
A
Pm = 0, because they are identical (i.e., 2 − 2 = 0), thus making the attribute
pair similar (in this case, by their number of distinct values). If we consider this
proximity metric of ‘number of unique values’ alongside other collected content-
based meta-features using an ensemble supervised learning model Mcls , we can
compute a Sim(A6 , A11 ) score based on the positive-class distribution (see Sec-
tion 4.2). This can lead to Sim(A6 , A11 ) = 0.95 and if we use a threshold of
ca = 0.75 then the final output for Rel(A6 , A11 ) = 1. A numeric attribute like
‘A7’ in D2 holds similar data as attributes ‘A13’ and ‘A14’ from D3 , as ex-
pressed by the intersecting numeric ranges. For such numeric attributes we can
consider a meta-feature like ‘mean value’. On the other hand, attributes ‘A1’
and ‘A7’ have different data profiles (different numeric ranges) and therefore are
not labelled with an arrow and do not satisfy the Rel(A1 , A7 ) relationship, as
they will have large differences in their meta-features, leading to high proximity
metric and a low similarity scores. In those examples, we collect attribute level
meta-features from the datasets (in this case, the number of distinct values for
nominal attributes and means for numeric attributes) to assess the similarity
between attributes of a given pair of datasets. In our approach, we compute the
similarity between attributes Sim(Ai , Aj ) using real number proximity metrics
in the range of [0, 1] and we use it to predict Rel(Dy , Dz ) instead of using the
binary output of Rel(Ai , Aj ). We should aggregate the individual attribute pairs’
similarities with an aggregation function agg to obtain a single value proximity
metric for the overall dataset level similarity. We discuss this in the description
of our approach in Section 4.
Furthermore, we extract higher-granularity dataset level meta-features (e.g.,
‘number of attributes per attribute type’) from the datasets for the task of di-
rectly computing the Sim(Dy , Dz ) similarity relationships. For example, Rel(D2 , D3 )
returns ‘1’ in the case we use ca = 0.67 because they have 2 nominal and 3 nu-
meric attributes each, so overall they can have Sim(D2 , D3 ) = 0.7 passing the
minimum threshold. Based on Rel(D2 , D3 ) =‘1’, our approach indicates that
these two datasets are possibly related and should be considered for further
scrutinising by schema matching.
Equation 3 shows the proximity metric computed for a pair of attributes (or
datasets), denoted as Oi , Oj . Using the meta-features described, we compute the
12 A. Alserafi et al.
z-score distance for each meta-feature m. The result is a real number, Pm . The
z-score is a normalisation where we use the mean ‘µ’ and standard deviation ‘σ’
of each meta-feature considering its value from all datasets in the DL. A value
of 0 is the most similar, while larger negative or positive number means more
different. The z-score is used to standardise the comparisons of attributes in a
holistic manner that considers all datasets and attributes in the DL. Most pairs
of attributes and dataset will have a value falling in the range of [−3, 3].
m(Oi ) − µ m(Oj ) − µ
Pm = zscore distance(Oi , Oj ) = − (3)
σ σ
Attribute-level Dataset-level
profiling profiling Evaluate Evaluate
Attribute-level Dataset-level
Classifiers Classifier
Rel(Ai,Aj) Rel(Dy,Dz)
Compute
Profile Attribute- attribute-level
level Meta- meta-feature
features proximity metrics
Sample Pm
Attribute Pairs
OML01 Training
Pairs Build Attribute-level
of DS Sample
Classification Models Build Dataset-level
Mcls_num_attr Classification Model
Mcls_nom_attr Mcls_ds
Data Compute
Lake Profile Attribute- attribute-level Aggregate attribute-
level Meta- meta-feature
level proximity to
features proximity metrics
Pm
dataset-level
Sample Apply Attribute-level Agg
OML02 Proximity Model
Sim(Ai,Aj) Compute Dataset-
Profile Dataset-level level meta-feature
Meta-features proximity metrics DS Pairs
Pm Training
Sample
Section 5.1). The pairs in sample OML01 should be already annotated to indicate
whether their attributes match (i.e., Rel(Ai , Aj ) for attribute level models) or
whether the datasets are relevant for schema matching or not in sample OML02
(i.e., Rel(Dy , Dz ) for dataset level models). Initially, we start with the attribute
level supervised learning procedure as it is only an auxiliary subcomponent used
for the dataset level, where an aggregation step is used to compute dataset level
proximities. First, we divide the pairs into training and test sets, we train a su-
pervised learning ensemble model for each attribute type (nominal and numeric
types) using the training sample, and we test the performance of the model on
the test set (evaluation distinguished by dotted lines and circles in the figure).
We conduct this test to guarantee that the models generated are accurate in
detecting Rel(Ai , Aj ). Similarly, we do the same with the dataset level super-
vised models which generate Rel(Dy , Dz ). We use the dataset level proximity
metrics and the attribute level aggregated proximity metrics together to train
a supervised model using a training sub-sample of dataset pairs from OML02.
Finally, we evaluate the generated dataset level supervised models to guarantee
their accuracy in detecting Rel(Dy , Dz ).
Supervised learning. To build the models, we use classical supervised
learning to create the proximity models. The meta-features are used as input to
the models as seen in Fig. 6, where an object could be an attribute for attribute-
level models or a dataset for dataset-level models. First, for each object we
extract its meta-features (i.e., ‘m1’, ‘m2’, ...). Then, for each object, we gener-
ate all pairs with each of the other objects and compute the proximity metrics
between their meta-features using either Equation 3 for content-based meta-
features or Equation 4 for the name-based comparison. We then take a sample
of pairs of objects which are analysed by a data analyst; a human-annotator
who manually decides whether the pairs of objects satisfy (assign ‘1’) or not
(assign ‘0’) the Rel properties (see Section 3). This can be achieved by simply
labelling the objects with their respective subject-areas and those falling under
the same one are annotated as positively matching ‘1’, otherwise all others are
14 A. Alserafi et al.
OB 1 OB 2
Rel(1,2) = '1'
OB 1 OB 2 0.55 0.73 ... +
OB 1 OB n Rel(1,n) = '0'
OB 1 OB n 0.50 0.42 ... -
M cls
Rel(ob1,ob2)
Compute distances Classifier
per pair
Fig. 6: Proximity Mining: supervised machine learning for predicting related data
objects.
labelled with ‘0’ (see Section 5.1). We then use supervised learning techniques
and 10-fold cross-validation over the proximity metrics to create two types of
independent models which can classify pairs of datasets or pairs of attributes
according to Rel(Dy , Dz ) and Rel(Ai , Aj ) respectively. This is the final output
consisting of the two auxiliary supervised models Mcls−nom−attr , Mcls−num−attr
for Rel(Ai , Aj ) and the main dataset level model Mcls−ds for Rel(Dy , Dz ). The
positive-class distribution from the generated models is used to score new pairs
of objects (unseen in the training process) with a similarity score Sim(Dy , Dz )
using Mcls−ds , and Sim(Ai , Aj ) using Mcls−nom−attr or Mcls−num−attr .
We use a random forest ensemble algorithm [9,34] to train the supervised
models in predicting related attribute and dataset pairs as it is one of the most
successful supervised learning techniques. The algorithm generates a similarity
score based on the positive-class distribution (i.e., the predicted probability of
the positive-class based on weighted averages of votes for the positive class from
all the sub-models in the ensemble model) to generate a score in [0, 1] for Sim.
For example, if Random Forest generates 1000 decision trees, and for a pair of
datasets [Dy , Dz ] we get 900 trees vote positive for Rel(Dy , Dz ) then we get
900
1000 = 0.9 for Sim(Dy , Dz ) score.
We feed the supervised learning algorithm the normalised proximity metrics
of the meta-features for pairs of datasets [Dy , Dz ]. For attribute level meta-
features, we feed the Mcls−ds model with all the different aggregations of the
meta-features after computing their normalised proximity metrics (i.e., after
applying Equation 7, which we describe later in this section).
Proximity Mining for Pre-filtering Schema Matching 15
any meaning (e.g., discrete ID numbers) or those attribute pairs with too low
proximity to be significant.
Thus, the top-matching pairs of attributes are sorted by proximity weights
and are fed to the aggregation function which allocates a weight between [0, 1]
for aggregation in the summation of weights. The total sum of weights should
add up to 1.0. The different aggregations we use are as follows:
– Minimum: we allocate all the weight (i.e., W = 1.0) to the single attribute
pair link with the minimum similarity, and we consider this as the overall
proximity between the dataset pair. Therefore, all top-matching attribute
pair links need to have a high similarity score to result into a high proximity
for a dataset pair.
– Maximum: we allocate all the weight (i.e., W = 1.0) to the single attribute
pair link with the maximum similarity, and we consider this as the overall
proximity between the dataset pair. Therefore, only one top-matching at-
tribute pair link needs to have a high similarity score to result into a high
proximity for a dataset pair.
– Euclidean: a Euclidean aggregation of the similarities Sim of all matching
pairs of attributes without any weighting as in Equation 5. Here we consider
all the attribute pair links in the aggregation and we assign equal weights to
all the links.
v
u n
u X
D
Pm = t [Sim(Ai , Aj )]2 (5)
i=1,j=2
This is visualised in Fig. 7. Here the weight 0.0 ≤ W ≤ 1.0 for each top-
matching attribute linkage is assigned based on ordering the linkages top-
to-least in terms of their similarity scores, and the weight allocated varies
according to a normal distribution. We use different p-parameters (proba-
bility of success) of {0.1, 0.25, 0.5, 0.75, 0.9}, where a parameter of 0.5 leads
to a standard normal distribution of weights allocated for the sorted pairs
of attributes. A lower parameter value leads to skewness to the left, allo-
cating more weight to highly related pairs, and a higher parameter leads to
skewness to the right, allocating higher weights to pairs with lower ranked
relationships. This means that with lower p we expect similar datasets to
have a few very similar attributes and a higher p value means we expect
most of the attributes to be strongly similar.
Proximity Mining for Pre-filtering Schema Matching 17
p = 0.10
p = 0.25
p = 0.50
p = 0.75
Assigned Weight
p = 0.90
D D N
P 0m = Pm ∗ (7)
M in(|Attry |, |Attrz |)
Applying the models on the DL In the second phase, after building the
ensemble models, we apply them to each new pair of previously unseen datasets
to achieve a measure of the similarity score. When applying the models, we
compute for each pair of datasets the similarity score of Sim(Dy , Dz ) and for
each attribute pair Sim(Ai , Aj ) using the supervised models extracted in the
previous phase. The Sim score is the positive-class distribution value generated
by each ensemble model [34]. For the attribute level scoring, we complete the
proximity mining task by aggregating the sim scores between pairs of datasets
18 A. Alserafi et al.
(as seen in the last steps of Algorithm 1). To compare dataset pairs, we use
Algorithm 2, and the Mcls−ds model generated by the previous build phase.
Attribute-level Dataset-level
profiling profiling
Fig. 8: An overview of the process to apply the learnt supervised models in our
approach for pre-filtering previously unseen dataset pairs independent of the
build process.
In the apply phase visualised in Fig. 8, we take the pairs of datasets from
sample OML02 which have not been used in the build phase and we compute
the proximity metrics for the dataset level and attribute level meta-features.
First, we profile the attribute level meta-features from this new dataset pairs
sample. Then, we apply the attribute level supervised models resulting from the
previous sub-process to score the attribute pairs similarities from the different
dataset pairs. Then, we aggregate the resulting attribute pairs similarities to
the dataset level using the aggregation functions. Once we have the dataset level
proximity metrics generated from dataset level and attribute level meta-features,
we feed them all to the dataset level supervised models from the build phase to
unseen testing set pairs, not used in the training of the models, which assigns a
proximity score to the pairs. If a pair exceeds a certain proximity threshold, we
consider that pair as a positive match to propose for further schema matching,
otherwise the pair is considered as a negative match and is pruned out from
Proximity Mining for Pre-filtering Schema Matching 19
further schema matching tasks (we evaluate this by pruning effectiveness metrics
in Section 5.2). This is described in the next subsection.
For the final step of pre-filtering pairs of datasets before applying detailed
instance-based schema matching, we check whether the pairs of datasets are
overall related or not, and therefore whether they should be filtered out or pro-
posed for expensive schema matching. We analyses the final Sim(Dy , Dz ) score
generated by the model Mcls−ds for each dataset pair in the DL to decide whether
they satisfy the Rel(Dy , Dz ) or not. We consider the relationship of each dataset
with each of the other datasets existing in the DL. Each dataset must pass the
similarity threshold cd with each individual dataset to be proposed for detailed
schema matching (as in Equation 1).
If we choose a high cut-off threshold we restrict the supervised model to re-
turn less pairs of high proximity, leading to lower recall but also less comparisons,
thus helping to reduce the computational time at the expense of possibly miss-
ing some misclassified pairs. Alternatively, if we choose a lower cut-off threshold,
we relax our model to return pairs of lower proximity. This leads to more pairs
(i.e., more work for further schema matching tasks) yielding positive matches
and higher recall of positive cases, but, with more pairs marked incorrectly as
matching. We propose how to select an appropriate threshold that optimises this
trade-off empirically in Section 5.
The complexity of our approach is quadratic in the number of objects (at-
tributes or datasets) compared, and therefore runs in polynomial time, however,
it applies the cheapest computational steps for early-pruning (just computing
distances in Equations 3 and 4 and applying the model to score each pair). This
way, we save unnecessary expensive schema matching processing per each value
instance of the attributes in later steps, reducing the computational workload
at the detailed granularity schema matching level by pre-filtering the matching
tasks. We demonstrate this empirically in Section 5.5.
5 Experimental Evaluation
Rel(Ai , Aj ). We test the attribute level model in experiment 1, the dataset level
pruning effectiveness against the ground-truth in experiment 2, and the compu-
tational performance in experiment 3.
In experiments 2 and 3, we compare the performance of our proposed prox-
imity mining models against traditional instance-based schema matching tech-
niques. Those are the most expensive techniques which compare values from
instances in the datasets to for computing schema similarity. We benchmark
our results against a naı̈ve averaging of attribute similarity from a prototype
called Probabilistic Alignment of Relations, Instances, and Schema (PARIS),
which is one of the most cited schema matching tools [33]. PARIS was found
to be best performing with large datasets when compared against other tools
[19] and does not need collection of extra metadata (see Table 1). PARIS does
exact value-string matching based on value-frequency inverse functionality [33].
We implement a prototype [4] which compares pairs of attributes from differ-
ent datasets using PARIS and generates an overall score for Sim(Dy , Dz ) by
averaging Sim(Ai , Aj ) generated by PARIS from the top-matching attribute-
pairs (similar to Algorithm 1, where PARIS replaces the supervised models). It
converts tabular datasets to RDF triples, and executes a probabilistic match-
ing algorithm for identifying overlapping instances and attributes. We selected
PARIS because of its simplicity and ease of integration with Java-based APIs
and its high performance in previous research [19]. We parametrised the pro-
totype with the top performing settings from experiments in [4] sampling 700
instances per dataset, 10 iterations comparisons, with identity and shingling
value strings matching. This will be a baseline pre-filtering heuristic approach
we shall compare against in the experiments.
The rest of this section describes the datasets used in the experiments, the
evaluation metrics used and the different experiments implemented. We present
the results from our experiments and discuss their implications.
5.1 Datasets
We use the OpenML DL5 in our experiments [36], which has more than 20,000
datasets intended for analytics from different subject areas. OpenML is a web-
based data repository that allows data scientists to upload different datasets,
which can be used in data mining experiments. OpenML stores datasets in the
ARFF tabular format which consist of diverse raw data loaded without any
specific integration schema. This allows us to evaluate our approach in a real-life
setting where datasets come from heterogeneous domains.
We use two subsets of manually annotated datasets from OpenML as our
ground-truth (gold standard) for our experiments. Those two subsets have been
generated using two different independent processes, and therefore provide in-
dependently generated ground truths that do not overlap. As the research com-
munity is lacking appropriate benchmarking gold standards for approximate
(non-equijoins) dataset and attribute similarity search [23], we published those
5
https://www.openml.org
Proximity Mining for Pre-filtering Schema Matching 21
6
https://github.com/AymanUPC/all prox openml
22 A. Alserafi et al.
– OML02 - The dataset level annotated 203 DS: consists of 203 datasets
different from those in the OML01 subset. To collect this sample, we scraped
the OpenML repository to extract all datasets not included in the OML01
sample and having a description of more than 500 characters. Out of the 514
datasets retrieved, we selected 203 with meaningful descriptions (i.e., exclud-
ing datasets whose descriptions do not allow to interpret the content and to
assign a topic). The datasets have a total of 10,971 attributes (2,834 nomi-
nal, 8,137 numeric). There are 19,931 pairs of datasets with about 35 million
attribute pairs to match. According to Algorithm 1, there are 3.7 million
comparisons for nominal attributes (leading to 59,570 top matching pairs)
and 31.5 million numeric attribute pairs (leading to 167,882 top matching
pairs). We try to prevent the value-based schema matching on all possible
pairs of values between datasets, where there are 216,330 values which would
lead to 23.4 billion comparisons at the value level. A domain expert with a
background in pharmaceutical studies and one of the authors collaborated
to manually label the datasets7 . They used the textual descriptions of the
datasets to extract their topics, which is common experimental practice in
dataset matching assessment, similar to the experimental setup in [6]. The
annotators sat together in the same room and discussed each dataset with
its description and decided on its appropriate real-life subject-area (e.g., car
engines, computer hardware, etc.). To group similar datasets in the same
subject-area grouping, annotators had to discuss and agree together on a
single annotation to give to a dataset. This was done by discussing the
specific real-world concept which the dataset describes, e.g., “animal pro-
files”, “motion sensing”, etc. The annotators were only allowed to scrutinise
the textual descriptions of the datasets and did not receive the underlying
data stored in their attributes to prevent any bias towards our proposed
algorithms. It took the annotators about 15 hours in total to annotate the
datasets. Pairs of datasets falling under the same subject-area were positively
annotated for Rel(Dy , Dz ). The sample consists of 543 positive pairs from
the 20,503 total number of pairs. The details of the sample is summarised
in Table 6, which lists the number of datasets, the number of topics, top
topics by the number of datasets, and the number of related pairs. Some of
the pairs from the sample can be seen in Table 7. We can see, for example,
that dataset with ID 23 should match all datasets falling under the topic of
‘census data’ like dataset 179. Both datasets have data about citizens from
a population census. In row 4, we can see an example of duplicated datasets
having highly intersecting data in their attributes. Duplicate pairs like those
in row 4 have the same number of instances, but described with different
number of attributes, which are overlapping. We consider all duplicate pairs
of datasets as related pairs. We aim to detect and recommend such kind
7
Those dataset annotations were reviewed by 5 independent judges, and the results
of this validation are published online at:
https://github.com/AymanUPC/all prox openml/blob/master/OML02/oml02 revalidation results.pdf
Proximity Mining for Pre-filtering Schema Matching 23
of similar dataset pairs as those in Table 7 for schema matching using our
proximity mining approach.
Table 7: An example of pairs of datasets from the OML02 sample from OpenML
No. DID 1 Dataset 1 DID 2 Dataset 2 Topic Relationship
1 23 cmc 179 adult Census Data related
2 14 mfeat-fourier 1038 gina agnostic Digit Handwriting Recognition related
3 55 hepatitis 171 primary-tumor Disease related
4 189 kin8nm 308 puma32H Robot Motion Sensing duplicate
– Classification effectiveness
• Granularity: Attribute level Rel(Ai , Aj ) and Dataset level Rel(Dy , Dz )
• Models evaluated: Mcls−nom−attr , Mcls−num−attr , Mcls−ds
• Classification measures: Classification accuracy, Recall, Precision, ROC,
Kappa
– Pre-filtering (pruning) effectiveness
• Granularity: Dataset level Rel(Dy , Dz )
• Model evaluated: Mcls−ds , P ARIS
• Retrieval measures: Recall, Precision, Efficiency Gain, Lift Score
– Computational performance
• Granularity: Attribute level Rel(Ai , Aj ) and Dataset level Rel(Dy , Dz )
• Model evaluated: Mcls−nom−attr , Mcls−num−attr , Mcls−ds , P ARIS
• Computational measures: computational processing time (milliseconds),
metadata size (megabytes)
24 A. Alserafi et al.
Results The attribute level models were evaluated for both nominal attribute
pairs and numeric attribute pairs. We evaluate the Mcls−nom−attr and Mcls−num−attr
models which assign the Sim(Ai , Aj ) for attribute pairs. As could be seen in Ta-
ble 10, we created two supervised models; one for each type of attribute pairs.
Both models achieved excellent ROC performance and highly significant results
on the Kappa statistic (see Tables 8-9 results significance). The models had
good accuracy, recall, and precision rates. This is important because the dataset
pairs pre-filtering step depends on this attribute proximity step, so we have to
achieve a good performance at this level to minimise accumulation of errors for
the following tasks.
comprehensive of all models is the All-P rox model which uses all the possible
meta-features we collect from the data profiling step. The DS-P rox-N ame is
the most generic of all models as it just considers a single meta-feature at the
most abstract-level, therefore it will be used as our baseline for our performance
comparisons in the experiments.
ModelType ModelType
All
-Pr
ox 78.
18 All
-Pr
ox 0.
56
Attr
ibute-Pr
ox 75.
51 Attr
ibute-Pr
ox 0.
51
Name- Pr
ox 74.29 Name- Pr
ox 0.49
Content-Pr
ox 73.72 Content-Pr
ox 0.47
DS-Pr
ox 73.
29 DS-Pr
ox 0.
47
71 72 73 74 75 76 77 78 79 80 81 82 0.
440.
460.
480.
500.
520.
540.
560.
580.
600.
62
Cl
assi
ficat
ionAc
cur
acy KappaSt
ati
sti
c
Fig. 9: Classification accuracy from 10- Fig. 10: Kappa statistic from 10-fold
fold cross-validation of dataset pairs cross-validation of dataset pairs pre-
pre-filtering models. filtering models.
ModelType
All
-Pr
ox 0.
86
Attr
ibute-Pr
ox 0.
83
Name- Pr
ox 0.
81
Content-Pr
ox 0.81
DS-Pr
ox 0.
81
0.
80 0.
81 0.
82 0.
83 0.
84 0.
85 0.
86 0.
87 0.
88
RocSc
ore
The different models used for the schema matching pre-filtering task achieve
different results because the meta-features used in the different models are not
correlated, therefore contain different information about the datasets leading
to the different performance of each model. We evaluated the Spearman rank
correlation [15] between the different types of meta-features, which is presented
in Table 11. The Spearman rank correlation ranks the dataset pairs according
to the proximity metrics of the meta-features. If the dataset pairs have the same
identical rankings between two different meta-features then we get a perfect
correlation. If the rankings produced in descending order by the two proximity
metrics are different (e.g., a dataset pair can be ranked in the 100th position by
one meta-feature and in the 9th position by the other, which have a difference
of 81 ranks) then we get a lower correlation, with completely uncorrelated meta-
features. We evaluated the average, standard deviation, minimum, and maximum
of the correlation between the meta-features falling under the different types
of meta-features. Recall that each type will have multiple meta-features (see
Section 4.1), like attribute content will include all the meta-features in Table 3
with all their different proximity metrics according to the aggregations described
in Section 4.2. We calculate the correlation between each individual meta-feature
pair and we calculate aggregates per type. As can be seen in Table 11, all the
correlation values are low.
Proximity Mining for Pre-filtering Schema Matching 29
Table 11: Spearman rank correlation for the different meta-features. We aggre-
gate minimum (Min.), average (Avg.), maximum (Max.), & standard deviation
(Std. Dev.) for different meta-feature types.
Type 1 Type 2 Min. Correlation Avg. Correlation Max. Correlation Std. Dev. Correlation
Attribute Name Attribute Content -0.12 -0.01 0.19 0.04
Attribute Name Dataset Content 0.02 0.04 0.10 0.02
Attribute Name Dataset Name 0.06 0.07 0.13 0.02
Dataset Content Attribute Content -0.01 0.09 0.15 0.04
Dataset Content Dataset Name -0.02 0.00 0.02 0.02
Dataset Name Attribute Content 0.00 0.01 0.04 0.01
24.
6 100
100 5.
5 4.
5
3.
3 10.
8
4.
9 2.
4 14. Al
l-
Prox
90
8 5.
6 90
7.
6 73.
8
Basel
ine 4.
5 80 Attri
bute-
Prox
80
n
3.
1
yGai
1.
7
70 Content-
Prox
on
70 DS-
Prox 2.
9
1.
3
i
29.
8
s
60
enc
24.
6
i
60 2.
3 Name-
Prox
c
Name-
Prox
e
i
Pr
1.
9 50
Avg.Effic
50 14.
3
g.
Cont
ent
-Pr
ox DS-
Prox
1.
6
40 14.
8
Av
40 10.
8
At
tri
but
e-Pr
ox
30 30
Basel
ine
Al
l-
Prox 1.
3 7.
6
1.
2 5.
6
20 20 5.5
4.
5 3.
1
4.
5
10 10 4.
9 1.
7 1.
3 3.
4
1.
9
0 3.
3 2.
4 2.
1
0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Avg.Rec
all Avg.Recal
l
Fig. 12: Recall against efficiency gain Fig. 13: Recall against precision for the
for the different supervised models. different supervised models.
Sheet4 Sheet5
38.
28 29.
14 38.
28
100 9.
38 100
8.
19 35.
03
4.
02 34.
23
90 3.
37 3.
73
90 32.
51
7.
06
Attribute Name Prox
3.
19 32.
37
80 2.
15 80 Attribute-Content
2.
77
6.
71 29.
14
DS-Content
1.
80 2.
21
70 70 DS-Name
n
1.
64
yGai
on
2.
36 1.
84 23.
93
si
60 60
enc
eci
1.
95
i
Avg.Pr
Avg.Effic
1.
40
50 1.
57 50 18.
40
1.
27 1.
69
22.
11 17.
06
40 40
1.
26 14.
26 15.
54
12.
69
30 1.
21 30
At
tri
bute-Name-Pr
ox 1.
26 14.
96 10.
41
At
tri
bte-
Cont
ent 8.
81 8.
70
20 20 68 8.
11. 19
1.
11
DS-
Cont
ent 6.
71
6.
57 4.
02
1.
07 3.
19 2.
742.
10 DS-
Name 10 361.
951.
1.
01 6.
63 69
3.
37 1.
21 1.
11 1.
01
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Avg.Rec
all Avg.Rec
all
AverageofRecallvs.averageofEffic i
encyGain.Colorshowsdet ai
lsaboutSim Measur
eGrouped.Shapeshowsdetai
lsaboutSim MeasureGr
ouped.The Aver
ageofRec allvs.aver
ageofPreci
sion.Col
orshowsdetail
saboutSi
m Measur
eGr ouped.Shapeshowsdetai
lsaboutSim Measur
eGrouped.Themar ks
marksarelabel
edbyaver ageofLif
tScore.Detai
lsareshownf orThreshol
d.Thedataisfil
teredonEvalTypeandSi
m Measure.TheEvalTypefilt
erkeeps arel
abeledbyaver ageofLif
tScore.Detai
lsar
eshownf orThr
eshol
d.Thedat
aisfilt
er edonEvalTypeandSi
m Measure.TheEvalTypefil
terkeepsent
ity.
enti
ty.TheSim Measur efilterhasmulti
plemembersselected. TheSim Measurefil t
erhasmultipl
emember ssel
ected.
Fig. 14: Recall against efficiency gain Fig. 15: Recall against precision for the
for the different metric types. different metric types.
30 A. Alserafi et al.
difficult pairs that can not be retrieved by any single type individually. There-
fore, it is possible to solely depend on content-based proximity models as a
replacement of name-based proximity models to achieve similar results. This
will be important in the cases of DLs which are not well maintained and do
not have properly named datasets and attributes. We investigate in detail the
performance of the All-Prox proximity model based on its true positives, false
positives and false negative pairs in the Appendix8 , where we present the exact
cases, we discuss the reasons of discrepancies and we give a comparative analysis
of the underlying proximity metrics which led to those cases.
For each of the pruning effectiveness evaluation metrics listed above, we com-
pute the average and standard deviation of the measure between the different
folds of evaluation for our approach. The average is plotted in the graphs in
Figures 12-15, and the standard deviations of each model for the threshold 0.5
(we chose the mean threshold) are given in Table 12. The standard deviation
indicates the stability of our proposed metrics and models with different subsets
of datasets. We aim for a low standard deviation to prove the high adaptability
of our approach.
Table 12: The standard deviation of each evaluation measure for 10-fold cross-
validation of each dataset pairs pre-filtering model, where cd = 0.5
Proximity SD Recall SD Efficiency Gain SD Precision SD Lift Score
All-Prox 5.4 0.61 0.58 0.32
Attribute-Prox 6.5 1.0 0.7 0.24
Content-Prox 6.8 1.0 0.38 0.35
DS-Prox 7.86 0.85 0.29 0.28
Name-Prox 6.3 1.1 0.47 0.24
We note here that the different types of meta-features are able to model
different information about the datasets and their attributes as seen by the
low correlations in Table 11. That is the main reason we used the combination
of different types of meta-features in our proximity models which are able to
combine the meta-features to give better results.
Dataset Pairs Pre-filtering Precision Vs. Recall As seen in Fig. 13, the
precision of the proximity models improved the performance of the schema
matching pre-filtering as seen by the higher precision rates compared to the
individual meta-features in Fig. 15. By combining the meta-features in a su-
pervised model we were able to achieve higher precision rates with the same
recall rates, for example, a precision of 17% with a recall rate of 75% using
the All-Prox model. This is better than the best achievable precision with the
individual meta-features, which can achieve a precision of 4% with the same
recall rate for the attribute level meta-feature type. However, we acknowledge
that the precision rates are low for all types of models and meta-features. We
can therefore conclude that our proposed proximity mining approach can only
be used as an initial schema matching pre-filter which is able to prune unneces-
sary schema matching comparisons from further steps. Our approach can not be
used for the final schema matching task because it will produce false positives.
Therefore, dataset pairs should be further scrutinised with more comparisons to
assess their schema similarity (as seen in Fig. 1 bottom instance-based matching
Proximity Mining for Pre-filtering Schema Matching 33
Table 13: The computational performance of our approach vs. the PARIS im-
plementation in terms of time and storage space
Task Timing Average Time Storage Space
Dataset Profiling 263,019ms (4:23 minutes) 1,295ms per dataset 31.25MB
Numeric Attribute Matching 1,184,000ms (19:44 min- 0.04ms per attribute pair In memory
utes)
Nominal Attribute Matching 160,000ms (2:40 minutes) 0.04ms per attribute pair In memory
Numeric Attribute Top Matching 3,250,000ms (54:10 min- 0.1ms per attribute pair 7MB
utes)
208ms per dataset pair
(15,576 dataset pairs)
Nominal Attribute Top Matching 313,000ms (5:13 minutes) 0.08ms per attribute pair 2.33MB
5.6 Generalisability
6 Conclusion
We have presented in this paper a novel approach for pre-filtering schema match-
ing using metadata-based proximity mining algorithms. The approach is able to
Proximity Mining for Pre-filtering Schema Matching 35
detect related dataset pairs containing similar data by analysing their meta-
data and using a supervised learning model to compute their proximity score.
Those pairs exceeding a minimum threshold are proposed for more detailed, more
expensive schema matching at the value-based granularity-level. Our approach
was found to be highly effective in this early-pruning task, whereby dissimilar
datasets were effectively filtered out and datasets with similar data were effec-
tively detected in a real-life DL setting. Our approach achieves high lift scores
and efficiency gain in the pre-filtering task, while maintaining a high recall rate.
For future research, we will investigate the different techniques to improve the
scalability of our approach by improving attribute level matching selectivity. We
also want to investigate the possibility of detailed semantic schema matching at
the attribute level. We will also investigate our proximity mining approach in
effectively clustering the datasets into meaningful groupings of similarity.
References
1. Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. The
VLDB Journal 24(4), 557–581 (2015). https://doi.org/10.1007/s00778-015-0389-
y,
2. Adomavicius, G., Sankaranarayanan, R., Sen, S., Tuzhilin, A.: Incorporating
contextual information in recommender systems using a multidimensional ap-
proach. ACM Transactions on Information Systems (TOIS) 23(1), 103–145 (2005).
https://doi.org/10.1145/1055709.1055714
3. Algergawy, A., Massmann, S., Rahm, E.: A Clustering-Based Approach for Large-
Scale Ontology Matching. In: East European Conference on Advances in Databases
and Information Systems (ADBIS), pp. 415–428. Springer (2011).
4. Alserafi, A., Abelló, A., Romero, O., Calders, T.: Towards Information Profiling:
Data Lake Content Metadata Management. In: DINA Workshop, ICDM. pp. 178–
185. IEEE (2016). https://doi.org/10.1109/ICDMW.2016.0033
5. Alserafi, A., Calders, T., Abelló, A., Romero, O.: DS-prox: Dataset proxim-
ity mining for governing the data lake. In: International Conference on Simi-
larity Search and Applications. vol. 10609 LNCS, pp. 284–299. Springer (2017).
https://doi.org/10.1007/978-3-319-68474-1 20
6. Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.: Dataset Recommenda-
tion for Data Linking: An Intensional Approach. In: Proceedings of the Inter-
national Semantic Web Conference: The Semantic Web. Latest Advances and New
Domains. vol. 9678, pp. 36–51. Springer (2016). http://link.springer.com/10.
1007/978-3-319-34129-3
7. Bernstein, P.A., Madhavan, J., Rahm, E.: Generic Schema Matching , Ten Years
Later. Proceedings of the VLDB Endowment 4(11), 695–701 (2011)
8. Bilke, A., Naumann, F.: Schema Matching using Duplicates. In: Proceedings of the
21st International Conference on Data Engineering. pp. 69–80. IEEE (2005)
9. Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)
10. Chen, C., Halevy, A., Tan, W.c.: BigGorilla : An Open-Source Ecosystem for Data
Preparation and Integration. IEEE Data Engineering Bulletin 41(2), 10–22 (2018)
36 A. Alserafi et al.
11. Chen, Z., Jia, H., Heflin, J., Davison, B.D.: Generating Schema Labels
through Dataset Content Analysis. In: Companion of the The Web Confer-
ence 2018 on The Web Conference 2018 - WWW ’18. pp. 1515–1522 (2018).
https://doi.org/10.1145/3184558.3191601
12. Deng, D., Kim, A., Madden, S., Stonebraker, M.: SilkMoth: An Efficient Method
for Finding Related Sets with Maximum Matching Constraints. Proceedings of the
VLDB Endowment 10(10), 1082–1093 (2017)
13. Furche, T., Gottlob, G., Libkin, L., Orsi, G., Paton, N.W.: Data wrangling for big
data: Challenges and opportunities. In: EDBT. vol. 16, pp. 473–478 (2016)
14. Gallinucci, E., Golfarelli, M., Rizzi, S.: Schema profiling of
document-oriented databases. Information Systems 75, 13–25 (2018).
https://doi.org/10.1016/j.is.2018.02.007
15. Herlocker, J., Konstan, J.A., Terveen, L.G., Riedel, J.T.: Evaluating Collabora-
tive Filtering Recommender Systems. ACM Transactions on Information Systems
(TOIS) 22(1), 5–53 (2004)
16. Kandel, S., Heer, J., Plaisant, C., Kennedy, J., Van Ham, F., Riche, N.H., Weaver,
C., Lee, B., Brodbeck, D., Buono, P.: Research directions in data wrangling: Vi-
sualizations and transformations for usable and credible data. Information Visual-
ization 10(4), 271–288 (2011)
17. Kim, J., Peng, Y., Ivezic, N., Shin, J.: An Optimization Approach for Semantic-
based XML Schema Matching. International Journal of Trade, Economics and
Finance 2(1), 78 – 86 (2011)
18. Kruse, S., Papenbrock, T., Harmouch, H., Naumann, F.: Data Anamnesis : Ad-
mitting Raw Data into an Organization. Bulletin of the IEEE Computer Society
Technical Committee on Data Engineering pp. 8–20 (2016)
19. Lacoste-Julien, S., Palla, K., Davies, A., Kasneci, G., Graepel, T., Ghahramani,
Z.: SiGMa: Simple Greedy Matching for Aligning Large Knowledge Bases. In: Pro-
ceedings of the 19th ACM SIGKDD international conference. pp. 572–580 (2013).
https://doi.org/10.1145/2487575.2487592
20. Maccioni, A., Torlone, R.: KAYAK: A Framework for Just-in-Time Data Prepa-
ration in a Data Lake. In: International Conference on Advanced Information
Systems Engineering. pp. 474–489. Springer International Publishing (2018).
https://doi.org/10.1007/978-3-319-91563-0
21. Madhavan, J., Bernstein, P.a., Rahm, E.: Generic Schema Matching with Cupid.
VLDB 1, 49–58 (2001)
22. Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Re-
trieval. No. c (2009)
23. Miller, R.: Open Data Integration. PVLDB 11(12), 2130–2139 (2018)
24. Naumann, F.: Data profiling revisited. ACM SIGMOD Record 42(4), 40–49 (2014)
25. Oliveira, A., Tessarolli, G., Ghiotto, G., Pinto, B., Campello, F., Marques, M.,
Oliveira, C., Rodrigues, I., Kalinowski, M., Souza, U., Murta, L., Braganholo, V.:
An efficient similarity-based approach for comparing XML documents. Information
Systems 78, 40–57 (2018). https://doi.org/10.1016/j.is.2018.07.001
26. de Oliveira, H.R., Tavares, A.T., Lóscio, B.F.: Feedback-based data set recommen-
dation for building linked data applications. In: Proceedings of the 8th Interna-
tional Conference on Semantic Systems - I-SEMANTICS ’12. p. 49. ACM (2012)
27. Pei, J., Hong, J., Bell, D.: A novel clustering-based approach to schema match-
ing. In: Proceedings of the international conference on Advances in Information
Systems. pp. 60–69. Springer (2006)
28. Rahm, E.: Towards large-scale schema and ontology matching. In: Schema match-
ing and mapping, pp. 3–27. Springer Berlin Heidelberg (2011)
Proximity Mining for Pre-filtering Schema Matching 37
29. Rahm, E.: The Case for Holistic Data Integration. In: ADBIS. pp. 11–27 (2016).
https://doi.org/10.1007/978-3-319-44039-2
30. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching.
VLDB Journal 10(4), 334–350 (2001). https://doi.org/10.1007/s007780100057
31. Shvaiko, P.: A Survey of Schema-based Matching Approaches. Journal on Data
Semantics 3730, 146–171 (2005).
32. Steorts, R., Ventura, S., Sadinle, M., Fienberg, S.: A Comparison of Blocking
Methods for Record Linkage. In: International Conference on Privacy in Statis-
tical Databases. pp. 253–268 (2014)
33. Suchanek, F.M., Abiteboul, S., Senellart, P.: PARIS : Probabilistic Alignment of
Relations , Instances , and Schema. Proceedings of the VLDB Endowment 5(3),
157–168 (2011). https://doi.org/10.14778/2078331.2078332
34. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to data mining. Pearson Edu-
cation (2006)
35. Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data Wrangling: The Challeng-
ing Journey from the Wild to the Lake. In: 7th Biennial Conference on Innovative
Data Systems Research CIDR’15 (2015)
36. Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science
in machine learning. ACM SIGKDD Explorations Newsletter 15(2), 49–60 (2014)