0% found this document useful (0 votes)
8 views4 pages

Machine Learning To Data Management A Round Trip

The document discusses the integration of machine learning (ML) techniques into data management tasks, highlighting their potential for automating processes such as data cleaning, integration, and error detection. It outlines a tutorial structure that includes a SWOT analysis, various applications of ML in instance-based, schema-based, system-oriented, and user-guided data management tasks, and discusses the advantages and limitations of these approaches. The tutorial aims to enhance the understanding and effectiveness of ML applications in data management, ultimately leading to improved database systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

Machine Learning To Data Management A Round Trip

The document discusses the integration of machine learning (ML) techniques into data management tasks, highlighting their potential for automating processes such as data cleaning, integration, and error detection. It outlines a tutorial structure that includes a SWOT analysis, various applications of ML in instance-based, schema-based, system-oriented, and user-guided data management tasks, and discusses the advantages and limitations of these approaches. The tutorial aims to enhance the understanding and effectiveness of ML applications in data management, ultimately leading to improved database systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2018 IEEE 34th International Conference on Data Engineering

Machine Learning to Data Management:


A Round Trip
Laure Berti-Equille 1 , Angela Bonifati 2 , Tova Milo 3
1
Aix-Marseille Université, LIS (CNRS), France
[email protected]
2
Université Claude Bernard Lyon 1, LIRIS (CNRS), France
[email protected]
3
Tel Aviv University, Tel Aviv, Israel
[email protected]

Abstract—With the emergence of machine learning A. ML for Data Management: SWOT Analysis
(ML) techniques in database research, ML has already
proved a tremendous potential to dramatically impact
The tutorial start with an introduction to the relevant
the foundations, algorithms, and models of several data concepts in Machine Learning from the data manage-
management tasks, such as error detection, data cleaning, ment perspective. We explore the use of ML techniques
data integration, and query inference. Part of the data as a tool to express and quantify data patterns and knowl-
preparation, standardization, and cleaning processes, such edge transfer for representing and analyzing data, using
as data matching and deduplication for instance, could be examples from literature in database cleaning/repair, data
automated by making a ML model “learn” and predict integration, and query inference. We first provide an
the matches routinely. Data integration can also benefit overview of the opportunities and limitations, alongside
from ML as the data to be integrated can be sampled and
with the computational challenges associated with ML
used to design the data integration algorithms. After the
initial manual work to setup the labels, ML models can techniques applied to data management. We articulate the
start learning from the new incoming data that are being tutorial into four main parts related to the presented lev-
submitted for standardization, integration, and cleaning. els of data management: (1) instance-based; (2) schema-
The more data supplied to the model, the better the based; (3) system-based, and (4) user interaction-based
ML algorithm can perform and deliver accurate results. data management.
Therefore, ML is more scalable compared to traditional
and time-consuming approaches. Nevertheless, many ML B. ML in Instance-Based Data Transformation Tasks
algorithms require an out-of-the-box tuning and their
parameters and scope are often not adapted to the problem In this part, we illustrate the role played by ML
at hand. To make an example, in cleaning and integration techniques in data processing at the data instance level,
processes, the window sizes of values used for the ML through ML applications to data cleaning, database re-
models cannot be arbitrarily chosen and require an adap- pairing, and data fusion. This list of topics is by no
tation of the learning parameters. This tutorial will survey means exhaustive albeit representative of the diversity
the recent trend of applying machine learning solutions of problems where ML tools have proved useful. We
to improve data management tasks and establish new
will review recent DB/ML research leveraging:
paradigms to sharpen data error detection, cleaning, and
integration at the data instance level, as well as at schema, • Clustering applied to anomaly detection and data
system, and user levels. cleaning [16], [41], [49], detection of patterns of
glitches [7], [8], replacement of erroneous or miss-
ing values, and deduplication [12];
• Classification applied to database repairing [44],
I. D ETAILED O UTLINE OF THE T UTORIAL
[47], regression classification used in record link-
age [27], and kNN for data fusion;
The tutorial is organized into an introductory SWOT • Semi- and supervised learning models for similarity
analysis (Section I-A), four main parts (described in and blocking functions used in deduplication and
detail from Section I-B to Section I-E), and final con- record linkage [10], [11], and active learning in
clusions (Section I-F) as follows. entity resolution [46];

2375-026X/18/$31.00 ©2018 IEEE 1735


DOI 10.1109/ICDE.2018.00226
Authorized licensed use limited to: KIIT University. Downloaded on January 29,2025 at 13:47:15 UTC from IEEE Xplore. Restrictions apply.
• Bayesian analysis applied to data cleaning [17], • Predicting query answering based on the past his-
probabilistic inference in data repairing [39], disam- tory of queries [34];
biguation (conflict resolution), and data fusion [20], • Predicting the decisions of a query optimizer [38]
[21], [35]; and and performing database tuning [43] by leveraging
• Model optimization and statistical model training ML techniques;
with guarantees [31] used in learning from samples • Feature engineering and labeling are bottlenecks
for progressive data cleaning. in ML techniques that hinder their adoption in
Finally, we will discuss the pros and cons of ML database-oriented tasks; we will review works to
applications to instance-based data management tasks facilitate those tasks [1], [5], [36];
and we will also review some work in ML to address • Finally, we will focus on the connections between
noisy data labels and model robustness and discuss its ML and Databases and the unsolved challenges in
applicability to data quality and integration. this area [24], [37].

C. ML in Schema-Based Data Transformation Tasks E. ML in User-Guided Data Management Tasks


In this part, we focus on the application of ML tech- In this part, we will discuss the limitations of pure
niques to schema-driven tasks in data management, such ML approaches and how users can help to complement
as schema and constraint inference, schema mapping and the efforts. As illustrative examples, we will examine
query specification. Again, our list is not meant to be common data management tasks such as entity resolution
exhaustive but aims at fostering the discussion on the and data cleaning [4], [6], [19], [25], [45], [48], [50]. We
usage of learning in all its facets in the above tasks. The will consider the interplay between ML-based algorithms
presentation will include the following topics: and crowd-sourcing, and highlight where users’ input is
• Schema and schema mapping discovery tech- essential. Specifically, we will discuss three dimensions
niques [2], [9], [29], [33] to fight database decay of the problem:
and facilitate data integration; • How users can help in improving the data itself, e.g.,
• Usage of input data examples to help the user spec- by detecting errors [42], gathering missing data and
ify complex data transformation tasks [15], [28]; choosing among possible data repairs [4], [48];
• Usage of machine learning in data source reconcil- • How they can assist in gathering meta-data that
iation [18]; facilitates improved data processing [6], [45];
• Learning in rule discovery and information extrac- • How can we find and identify the most relevant
tion [23], [26], [32], [40]; and crowd to complement the ML efforts in a given data
• Query specification paradigms leveraging grammar management task [19], [50].
induction techniques [3], [13], [14].
F. Lessons Learned and Perspectives
We will review recent approaches for data transforma-
tion, schema, and constraint discovery that can benefit We foresee two major outcomes from this tutorial. In
from learning based on input data examples. We will dis- the short term, we expect that this tutorial will lead to a
cuss the advantages and limitations of these techniques more effective use of ML techniques in data management
and pinpoint a few extensions. applications. In the long term, we hope that understand-
ing the benefits and limits of the application of ML to
D. ML in System-Oriented Data Management Tasks the modeling, representation, and analysis of data will
In this part, we will discuss about a recent trend lead to a better interaction between data management
in our community to use ML techniques for obtaining and ML when designing the next-generation database
“trained database systems”, i.e., databases that can learn management systems.
from past query workloads (or past query executions and
optimized plans) the behavior to be adopted in upcoming II. P RESENTERS ’ B IOGRAPHY
querying and tuning tasks. We will put under lenses the Laure Berti-Equille received her Ph.D. degree in
many ML approaches for database query optimization Computer Science from University of Toulon in France
and bulk data processing systems by highlighting their in 1999. From 2000-2010, she was a tenured Associate
advantages and their possible impact on full-fledged Professor at University of Rennes 1, and a 2-years vis-
database systems. We will (not exhaustively) discuss iting researcher at AT&T Labs Research in New Jersey,
about the following trends in the DB community: USA, as a recipient of the prestigious European Marie

1736

Authorized licensed use limited to: KIIT University. Downloaded on January 29,2025 at 13:47:15 UTC from IEEE Xplore. Restrictions apply.
Curie Outgoing Fellowship (2007-2009). From 2011- Tova Milo received her Ph.D. degree in Computer
2017, she joined IRD, the French Institute of Research Science from the Hebrew University, Jerusalem, in 1992.
for Development, as a Research Director. From 2014- After graduating she worked at the INRIA research
2017, she was a Senior Scientist at Qatar Computing institute in Paris and at University of Toronto and re-
Research Institute (Hamad Bin Khalifa University). She turned to Israel in 1995, joining the School of Computer
is now is a full Professor at Aix-Marseille University Science at Tel Aviv university, where she is now a full
(AMU) in France. Her interests are at the intersection Professor. She is the head of the Database research
of large-scale data science, data analytics, and machine group and holds the Chair of Information Manage-
learning with a focus on data quality and truth discovery ment. She served as the Head of the Computer Science
research. She initiated the very first workshop editions Department from 2011-2014. Her research focuses on
on information and data quality in information systems large-scale data management applications such as data
(IQIS 2005) and in databases (QDB 2009 and 2016) integration, semi-structured information, Data-centered
in conjunction with SIGMOD and VLDB respectively, Business Processes and Crowd-sourcing, studying both
and co-organized the first French workshops on Data theoretical and practical aspects. Tova served as the
and Knowledge Quality in conjunction with EGC (Ex- Program Chair of several international conferences, in-
traction et Gestion de Connaissances) in 2005, 2006, cluding PODS, VLDB, ICDT, XSym, and WebDB, and
2010, and 2011. Laure is serving as an associated editor as the chair of the PODS Executive Committee. She
of the ACM Journal on Data and Information Quality served as a member of the VLDB Endowment and the
and served as a Program Chair of the International PODS and ICDT executive boards and as an editor of
Conferences on Information Quality (ICIQ) in 2012 TODS and the Logical Methods in Computer Science
and 2016. She has received various grants from the Journal. Tova has received grants from the Israel Science
French Agency for National Research (ANR), the French Foundation, the US-Israel Binational Science Founda-
National Research Council (CNRS), and the European tion, the Israeli and French Ministry of Science and the
Union. European Union. She is an ACM Fellow, a member of
Academia Europaea, a recipient of the 2010 ACM PODS
Angela Bonifati received her Ph.D. degree in Com- Alberto O. Mendelzon Test-of-Time Award, the 2017
puter Science from Politecnico di Milano in 2002. After VLDB Women in Database Research award, the 2017
graduating she worked as a postdoctoral researcher at Weizmann award for Exact Sciences Research, and of
the INRIA research institute in Paris. She then obtained a the prestigious EU ERC Advanced Investigators grant.
permanent position as a researcher at the Italian National R EFERENCES
Research Council in 2003. She is now a full Professor in
France (since 2011), currently at University of Lyon 1. [1] M.R. Anderson, M.J. Cafarella. Input Selection for Fast Feature
Engineering. ICDE 577-588, 2016.
Her research focuses on advanced database applications [2] P. Andritsos, R.J. Miller, P. Tsaparas, Information-Theoretic
such as data integration and exchange, web and graph Tools for Mining Database Structure from Large Data Sets.
databases, query inference by considering both structured SIGMOD:731-742, 2004.
and semi-structured data models. She has been visiting [3] T. Antonopoulos, F.k Neven, F. Servais. Definability Problems
for Graph Query Languages. ICDT:141-152, 2013.
professor in several foreign universities, such as Stan- [4] A. Assadi, T. Milo, S. Novgorodov. DANCE: Data Cleaning with
ford University, UBC and Saarland University. Angela Constraints and Experts. ICDE:1409-1410, 2017.
served as the Program Chair of several international [5] S.H. Bach, B. Dawei He, A. Ratner, C. R. Learning the Structure
conferences, including ICDE 2011 (Semi-structured data of Generative Models without Labeled Data. ICML:273-282,
2017.
Track) and ICDE 2018 (Information Extraction and Data [6] M. Bergman, T. Milo, S. Novgorodov, W. Chiew Tan. Query-
Cleaning and Curation Track), WebDB 2013, and XSym Oriented Data Cleaning with Oracles. SIGMOD:1199-1214,
2009. She is currently associate editor of the VLDB Jour- 2015.
nal, ACM Transactions on Database Systems (TODS) [7] L. Berti-Equille, T. Dasu, D. Srivastava. Discovery of complex
glitch patterns: A novel approach to Quantitative Data Cleaning.
and Distributed and Parallel Databases. She has been ICDE 2011: 733-744.
the recipient of the prestigious Palse Impulsion Starting [8] L. Berti-Equille, J.M. Loh, T. Dasu. A masking index for
Grant at the University of Lyon (IDEX) in 2016. She quantifying hidden glitches. Knowl. Inf. Syst. 44(2): 253-277,
2015.
has received grants from the French and Italian Ministry
[9] G.J. Bex, W. Gelade, F. Neven, S. Vansummeren. Learning
of Science and the French National Research Council Deterministic Regular Expressions for the Inference of Schemas
(CNRS). from XML Data. WWW:825-834, 2008.

1737

Authorized licensed use limited to: KIIT University. Downloaded on January 29,2025 at 13:47:15 UTC from IEEE Xplore. Restrictions apply.
[10] I. Bhattacharya, L. Getoor. Entity Resolution. Encyclopedia of [33] R.J. Miller, P. Andritsos. Schema Discovery. IEEE Data Engi-
Machine Learning and Data Mining 2017:402-408, 2017. neering Bulletin, 26(3):40-45, 2003.
[11] M. Bilenko, B. Kamath, R.J. Mooney. Adaptive Blocking: [34] Y. Park, A. Shahab Tajik, M. Cafarella, B. Mozafari. Database
Learning to Scale Up Record Linkage. ICDM:87-96, 2006. Learning: Toward a Database that Becomes Smarter Every Time,
[12] M. Bilenko. Learnable Similarity Functions and their Applica- SIGMOD:745-758, 2017.
tions to Clustering and Record Linkage. AAAI:981-982, 2004. [35] R. Pradhan, S. Bykau, S. Prabhakar. Staging User Feedback
[13] A. Bonifati, R. Ciucanu, S. Staworko. Learning Join Queries toward Rapid Conflict Resolution in Data Fusion. SIGMOD
from User Examples. ACM Trans. Database Syst.40(4): 24:1- 2017.
24:38, 2016. [36] A. Ratner, S.H. Bach, H.R. Ehrenberg, J.A. Fries, S. Wu, C. R.
[14] A. Bonifati, R. Ciucanu, A. Lemay. Learning Path Queries on Snorkel: A System for Lightweight Extraction. CIDR 2017.
Graph Databases. EDBT:109-120, 2015. [37] C. Ré, D. Agrawal, M. Balazinska, M.J. Cafarella, M.I. Jordan,
[15] A. Bonifati, U. Comignani, E. Coquery, R. Thion. Inter- T. Kraska, R. Ramakrishnan. Machine Learning and Databases:
active Schema Mapping Specification with Exemplar Tuples. The Sound of Things to Come or a Cacophony of Hype?
SIGMOD:667-682, 2017. SIGMOD:283-284, 2015.
[16] Y. Chung, S. Krishnan, T. Kraska. A Data Quality Metric [38] M. Schleich, D. Olteanu, R. Ciucanu. Learning Linear Regres-
(DQM): How to Estimate the Number of Undetected Errors in sion Models over Factorized Joins. SIGMOD:3-18, 2016.
Data Sets. PVLDB 10(10):1094-1105, 2017. [39] T. Rekatsinas, X. Chu, I.F. Ilyas, C. R. HoloClean: Holistic
[17] S. De, Y. Hu, V.V. Meduri, Y. Chen, S. Kambhampati. Data Repairs with Probabilistic Inference. PVLDB, 10(11): 1190-
BayesWipe: A Scalable Probabilistic Framework for Improving 1201, 2017.
Data Quality. ACM J. Data and Information Quality, 8(1), 5:1- [40] A. Rostin, O. Albrecht, J. Bauckmann, F. Naumann, U. Leser.
5:30, 2016. A Machine Learning Approach to Foreign Key Discovery.
[18] A. Doan, P. Domingos, A.Y. Halevy. Reconciling Schemas WebDB@SIGMOD, 2009.
of Disparate Data Sources: A Machine-Learning Approach. [41] S. Song, C. Li, and X. Zhang. Turn Waste into Wealth: On Si-
SIGMOD:509-520, 2001. multaneous Clustering and Cleaning over Dirty Data. KDD:1115-
[19] A. Doan, A. Ardalan, J. R. Ballard, S. Das, Y. Govind, P. Konda, 1124, 2015.
H. Li, S. Mudgal, E. Paulson, P. Suganthan G. C., H. Zhang. [42] S. Thirumuruganathan, L. Berti-Equille, M. Ouzzani, J.-A.
Human-in-the-Loop Challenges for Entity Matching: A Midterm Quian-Ruiz, N. Tang. UGuide: User-Guided Discovery of FD-
Report. HILDA@SIGMOD:12:1-12:6, 2017. Detectable Errors. SIGMOD Conference 2017: 1385-1397.
[20] X.L. Dong, L. Berti-Equille, D. Srivastava. Data fusion: resolv- [43] D. Van Aken, A. Pavlo, G. J. Gordon, B. Zhang.Automatic
ing conflicts from multiple sources. Handbook of Data Quality, Database Management System Tuning Through Large-scale Ma-
293-318. chine Learning. SIGMOD:1009-1024, 2017.
[21] X.L. Dong, L. Berti-Equille, D. Srivastava. Integrating Conflict- [44] M. Volkovs, F. Chiang, J. Szlichta, R.J. Miller. Continuous Data
ing Data: The Role of Source Dependence. PVLDB:550-561, Cleaning. ICDE:244-255, 2014.
2009. [45] J. Wang, S. Krishnan, M.J. Franklin, K. Goldberg, T. Kraska,
[22] H. Fernau. Algorithms for Learning Regular Expressions from and T. Milo. A Sample-and-Clean Framework for Fast and
Positive Data. Inf. Comput. 207(4): 521-541, 2009. Accurate Query Processing on Dirty Data.SIGMOD:469-480,
[23] P.A. Flach, I. Savnik. Database Dependency Discovery: A 2014.
Machine Learning Approach. AI Commun. 12(3):139-160, 1999. [46] S.E. Whang, D. Marmaros, H. Garcia-Molina. Pay-as-you-go
[24] Frontiers in Massive Data Analysis. http://www.stat.berkeley. Entity Resolution. TKDE:1111-1124, 2013.
edu/∼mmahoney/pubs/nrc-massive-data.pdf [47] M. Yakout, L. Berti-Equille, A.K. Elmagarmid. Don’t be
[25] F. Geerts, G. Mecca, P. Papotti, D. Santoro. The LLUNATIC SCAREd: Use SCalable Automatic REpairing with Maximal
Data-Cleaning Framework. PVLDB, 6(9):625-636, 2013. Likelihood and Bounded Changes. SIGMOD:553-564, 2013.
[26] J.M. Hellerstein, C. Ré, F. Schoppmann, D.Z. Wang, E. Fratkin, [48] M. Yakout, A.K. Elmagarmid, J. Neville, M. Ouzzani, I.F. Ilyas.
A. Gorajek, K.S. Ng, C. Welton, X. Feng,K. Li, A. Kumar. The Guided Data Repair, PVLDB, 4(5):279-289, 2011.
MADlib Analytics Library or MAD Skills, the SQL. PVLDB, [49] A. Zhang, S. Song, J. Wang, P.S. Yu. Time Series Data Clean-
5(12):1700-1711, 2012. ing: From Anomaly Detection to Anomaly Repairing. PVLDB,
[27] Y. Hu, Q. Wang, D. Vatsalan, P. Christen. Improving Temporal 10(10):1046-1057, 2017.
Record Linkage Using Regression Classification. PAKDD:561- [50] C.J. Zhang, Z. Zhao, L. Chen, H. V. Jagadish, C.
573, 2017. C. Cao. Crowdmatcher: Crowd-Assisted Schema Matching.
[28] Z. Jin, M.R. Anderson, M.J. Cafarella, H. V. Jagadish. Foofah: SIGMOD:721-724, 2014.
Transforming Data By Example. SIGMOD:683-698, 2017.
[29] A. Kimmig, A. Memory, R.J. Miller, L. Getoor. A Collec-
tive Probabilistic Approach to Schema Mapping Discovery.
ICDE:921-932, 2017.
[30] S. Krishnan, J.n Wang, M.J. Franklin, K. Goldberg, T. Kraska,
T. Milo, E. Wu. SampleClean: Fast and Reliable Analytics on
Dirty Data. IEEE Data Eng. Bull., 38(3):59-75, 2015.
[31] S. Krishnan, J. Wang, E. Wu, M.J. Franklin, K. Goldberg.
ActiveClean: Interactive Data Cleaning For Statistical Modeling.
VLDB:948-959, 2016.
[32] Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, H.V.
Jagadish. Regular Expression Learning for Information Extrac-
tion. EMNLP:21-30, 2008.

1738

Authorized licensed use limited to: KIIT University. Downloaded on January 29,2025 at 13:47:15 UTC from IEEE Xplore. Restrictions apply.

You might also like