A Survey On Data Collection For Machine Learning A Big Data - AI Integration Perspective
A Survey On Data Collection For Machine Learning A Big Data - AI Integration Perspective
4, APRIL 2021
Abstract—Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are
largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we
are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep
learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts
of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and
computer vision communities, but also from the data management community due to the importance of handling large amounts of data.
In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely
consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these
operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of
machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration
and opens many opportunities for new research.
1 INTRODUCTION
are living in exciting times where machine learn- Whenever there is a new product or a new defect to detect,
W E
ing is having a profound influence on a wide range
of applications from text understanding, image and
there is little or no training data to start with. The na€ıve
approach of manual labeling may not be feasible because it is
speech recognition, to health care and genomics. As a expensive and requires domain expertise. This problem
striking example, deep learning techniques are known to applies to any novel application that benefits from machine
perform on par with ophthalmologists on identifying learning.
diabetic eye diseases in images [1]. Much of the recent Moreover, as deep learning [2] becomes popular, there is
success is due to better computation infrastructure and even more need for training data. In traditional machine
large amounts of training data. learning, feature engineering is one of the most challenging
Among the many challenges in machine learning, data steps where the user needs to understand the application
collection is becoming one of the critical bottlenecks. It is and provide features used for training models. Deep learn-
known that the majority of the time for running machine ing, on the other hand, can automatically generate features,
learning end-to-end is spent on preparing the data, which which saves us of feature engineering, which is a significant
includes collecting, cleaning, analyzing, visualizing, and fea- part of data preparation. However, in return, deep learning
ture engineering. While all of these steps are time-consuming, may require larger amounts of training data to perform
data collection has recently become a challenge due to the well [3].
following reasons. As a result, there is a pressing need of accurate and scal-
First, as machine learning is used in new applications, it is able data collection techniques in the era of Big data, which
usually the case that there is not enough training data. Tradi- motivates us to conduct a comprehensive survey of the data
tional applications like machine translation or object detec- collection literature from a data management point of view.
tion enjoy massive amounts of training data that have been There are largely three methods for data collection. First,
accumulated for decades. On the other hand, more recent if the goal is to share and search new datasets, then data
applications have little or no training data. As an illustration, acquisition techniques can be used to discover, augment, or
smart factories are increasingly becoming automated where generate datasets. Second, once the datasets are available,
product quality control is performed with machine learning. various data labeling techniques can be used to label the
individual examples. Finally, instead of labeling new data-
sets, it may be better to improve existing data or train on
The authors are with the School of Electrical Engineering, Korea Advanced top of trained models. These three methods are not neces-
Institute of Science and Technology, Daejeon, Korea. sarily distinct and can be used together. For example,
E-mail: {yuji.roh, geon.heo, swhang}@kaist.ac.kr. one could search and label more datasets while improving
Manuscript received 8 Nov. 2018; revised 31 July 2019; accepted 2 Oct. 2019. existing ones.
Date of publication 8 Oct. 2019; date of current version 5 Mar. 2021. An interesting observation is that the data collection tech-
(Corresponding author: Steven Euijong Whang.)
Recommended for acceptance by L. Chen. niques come not only from the machine learning commu-
Digital Object Identifier no. 10.1109/TKDE.2019.2946162 nity (including natural language processing and computer
1041-4347 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
ROH ET AL.: A SURVEY ON DATA COLLECTION FOR MACHINE LEARNING: A BIG DATA - AI INTEGRATION PERSPECTIVE 1329
Fig. 1. A high level research landscape of data collection for machine learning. The topics that are at least partially contributed by the data manage-
ment community are highlighted using blue italic text. Hence, to fully understand the research landscape, one needs to look at the literature from the
viewpoints of both the machine learning and data management communities.
vision, which traditionally use machine learning heavily), or most recent ones. The key audience of this survey can be
but have also been studied for decades by the data manage- researchers or practitioners that are starting to use data
ment community, mainly under the names of data science collection for machine learning and need an overall land-
and data analytics. Fig. 1 shows an overview of the research scape introduction. Since the data collection techniques
landscape where the topics that have contributions from the come from different disciplines, some may involve rela-
data management community are highlighted with blue tional data while others non-relational data (e.g., images
italic text. Traditionally, labeling data has been a natural and text). Sometimes the boundary between operations
focus of research for machine learning tasks. For example, (e.g., data acquisition and data labeling) is not clear cut. In
semi-supervised learning is a classical problem where those cases, we will clarify that the techniques are relevant
model training is done on a small amount of labeled data in multiple operations.
and a larger amount of unlabeled data. However, as Motivating Example. To motivate the need to explore the
machine learning needs to be performed on large amounts techniques in Fig. 1, we present a running example on data
of training data, data management issues including how to collection based on our experience with collaborating with
acquire large datasets, how to perform data labeling at scale, the industry on a smart factory application. Suppose that
and how to improve the quality of large amounts of existing Sally is a data scientist who works on product quality for a
data become more relevant. Hence, to fully understand the smart factory. The factory may produce manufacturing
research landscape of data collection, one needs to under- components like gears where it is important for them not to
stand the literature from both the machine learning and have scratches, dents, or any foreign substance. Sally may
data management communities. want to train a model on images of the components, which
While there are many surveys on data collection that are can be used to automatically classify whether each product
either limited to one discipline or a class of techniques, to has defects or not. This application scenario is depicted in
our knowledge, this survey is the first to bridge the machine Fig. 3. A general decision flow chart of the data collection
learning (including natural language processing and com- techniques that Sally can use is shown in Fig. 2. Although
puter vision) and data management disciplines. We contend the chart may look complicated at first glance, we contend
that a machine learning user needs to know the techniques that it is necessary to understand the entire research land-
on all sides to make informed decisions on which techni- scape to make informed decisions for data collection. In
ques to use when. In fact, data management plays a role in comparison, recent commercial tools [6], [7], [8] only cover a
almost all aspects of machine learning [4], [5]. We note that subset of all the possible data collection techniques. When
many sub-topics including semi-supervised learning, active using the chart, one can quickly narrow down the options
learning, and transfer learning are large enough to have in two steps by deciding whether to perform one of data
their own surveys. The goal of this survey is not to go into acquisition, data labeling, or existing data improvements,
all the depths of these sub-topics, but to focus on breadth and then choosing the specific technique to use for each
and identify what data collection techniques are relevant for operation. For example, if there is no data, then Sally could
machine learning purposes and what research challenges generate a dataset by installing camera equipment. Then if
exist. Hence, we will only cover the most representative she has enough budget for human computation, she can use
work of the sub-topics, which are either the best-performing crowdsourcing platforms like Amazon Mechanical Turk to
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
1330 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
Fig. 2. A decision flow chart for data collection. From the top left, Sally can start by asking whether she has enough data. The following ques-
tions lead to specific techniques that can be used for acquiring data, labeling data, or improving existing data or models. This flow chart does
not cover all the details in this survey. For example, data labeling techniques like self learning and crowdsourcing can be performed together
as described in Section 3.2.1. Also, some questions (e.g., “Enough labels for self learning?”) are not easy to answer and may require an
in-depth understanding of the application and data. There are also techniques specific to the data type (images and text), which we detail in
the body of the paper.
label the product images for defects. We will discuss more can be used when there is no available external dataset, but
details of the flow chart in the following sections. it is possible to generate crowdsourced or synthetic datasets
The rest of the paper is organized as follows: instead. The following sections will cover the three opera-
tions in more detail. The individual techniques are classified
We review the data acquisition literature, which can be in Table 1.
categorized into data discovery, data augmentation,
and data generation. Many of the techniques require
scalable solutions and have thus been studied by the
data management community (Section 2). 2.1 Data Discovery
We review the data labeling literature and group the Data discovery can be viewed as two steps. First, the gener-
techniques into three approaches: utilizing existing ated data must be indexed and published for sharing. Many
labels, using crowdsourcing techniques, and using collaborative systems are designed to make this process
weak supervision. While data labeling is traditionally easy. However, other systems are built without the inten-
a machine learning topic, it is also studied in the data tion of sharing datasets. For these systems, a post-hoc
management community as scalability becomes an approach must be used where metadata is generated after
issue (Section 3). the datasets are created, without the help of the dataset
We review techniques for improving existing data or owners. Next, someone else can search the datasets for their
models when acquiring and labeling new data is not machine learning tasks. Here the key challenges include
the best option. Improving data quality through clean- how to scale the searching and how to tell whether a dataset
ing is a traditional data management topic where is suitable for a given machine learning task. While most of
recent techniques are increasingly focusing on machine the data discovery literature came from the data manage-
learning applications (Section 4). ment community for data science and data analytics, they
We put all the techniques together and provide guide- are also relevant in a machine learning context. However,
lines on how to decide which data collection techni- another challenge in machine learning is data labeling,
ques to use when (Section 5). which we cover in Section 3.
Based on the current research landscape, we identify
interesting future research challenges (Section 6).
2 DATA ACQUISITION
The goal of data acquisition is to find datasets that can be
used to train machine learning models. There are largely
three approaches in the literature: data discovery, data aug-
mentation, and data generation. Data discovery is necessary
when one wants to share or search for new datasets and has
become important as more datasets are available on the Web Fig. 3. A running example for data collection. A smart factory may pro-
and corporate data lakes [19], [75]. Data augmentation com- duce various images of product components, which are classified as nor-
mal or defective by a convolutional neural network model. Unfortunately,
plements data discovery where existing datasets are with an application this specific, it is often difficult to find enough data for
enhanced by adding more external data. Data generation training the model.
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
ROH ET AL.: A SURVEY ON DATA COLLECTION FOR MACHINE LEARNING: A BIG DATA - AI INTEGRATION PERSPECTIVE 1331
TABLE 1
A Classification of Data Acquisition Techniques
Gathering [45], [46], [47], [48], [49], [50], [51], [52], [53], [54]
Crowdsourcing
Processing [49], [50], [55], [56]
Generative Adversarial Networks [57], [58], [59], [60], [61], [62]
Data generation Policies [63], [64]
Synthetic Data Image [65], [66], [67], [68], [69], [70], [71]
Text [72], [73], [74]
Some of the techniques can be used together. For example, data can be generated while augmenting existing data.
2.1.1 Data Sharing the Kaggle datasets are coupled with competitions and are
We study data systems that are designed with dataset shar- thus more readily usable for machine learning purposes.
ing in mind. These systems may focus on collaborative anal-
ysis, publishing on the Web, or both. 2.1.2 Data Searching
Collaborative Analysis. In an environment where data sci- While the previous data systems are platforms for sharing
entists are collaboratively analyzing different versions of datasets, as a next logical step, we now explore systems that
datasets, DataHub [9], [10], [11] can be used to host, share, are mainly designed for searching datasets. This setting is
combine, and analyze them. There are two components: a common within large companies or on the Web.
dataset version control system inspired by Git (a version Data Lake. Data searching systems have become more
control system for code) and a hosted platform on top of it, popular with the advent of data lakes [19], [75] in corporate
which provides data search, data cleaning, data integration, environments where many datasets are generated internally,
and data visualization. A common use case of DataHub is but they are not easily discoverable by other teams or indi-
where individuals or teams run machine learning tasks on viduals within the company. Providing a way to search data-
their own versions of a dataset and later merge with other sets and analyze them has significant business value because
versions if necessary. the teams or individuals do not have to make redundant
Web. A different approach of sharing datasets is to publish efforts to re-generate the datasets for their machine learning
them on the Web. Google Fusion Tables [12], [13], [14] is a tasks. Most of the recent data lake systems have come from
cloud-based service for data management and integration. the industry. In many cases, it is not feasible for all the data-
Fusion Tables enables users to upload structured data (e.g., set owners to publish datasets through one system. Instead,
spreadsheets) and provides tools for visually analyzing, fil- a post-hoc approach becomes necessary where datasets are
tering, and aggregating the data. The datasets that are pub- processed for searching after they are created, and no effort
lished through Fusion Tables on the Web can be crawled by is required on the dataset owner’s side.
search engines and show up in search results. The datasets As an early solution for data lakes, IBM proposed a sys-
are therefore primarily accessible through Web search. tem [19] that enables datasets to be curated and then
Fusion Tables has been widely used in data journalism for searched. IBM estimates that 70 percent of the time spent on
creating interactive maps of data and adding them in articles. analytic projects is concerned with discovering, cleaning,
In addition, there are many data marketplaces including and integrating datasets that are scattered among many busi-
CKAN [15], Quandl [16], and DataMarket [17] where users ness applications. Thus, IBM takes the stance of creating, fill-
can buy and sell datasets or find public datasets. ing, maintaining, and governing the data lake where these
Collaborative and Web. More recently, we are seeing a processes are collectively called data wrangling. When ana-
merging of collaborative and Web-based systems. For exam- lyzing data, users do not perform the analytics directly on
ple, Kaggle [18] makes it easy to share datasets on the Web the data lake, but extract data sets and store them separately.
and even host data science competitions for models trained Before this step, the users can do a preliminary exploration
on the datasets. A Kaggle competition host posts a dataset of datasets, e.g., visualizing them to determine if the dataset
along with a description of the challenge. Participants can is useful and does not contain anomalies that need further
then experiment with their techniques and compete with investigation. While supporting data curation in the data
each other. After the deadline passes, a prize is given to the lake saves users from processing raw data, it does limit the
winner of the competition. Kaggle currently has thousands scalability of how many datasets can be indexed.
of public datasets and code snippets (called kernels) from More recently, scalability has become a pressing issue for
competitions. In comparison to DataHub and Fusion Tables, handling data lakes that consists of most datasets in a large
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
1332 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
company. Google Data Search (GOODS) [20] is a system that Recently, a service called Google Dataset Search [26] was
catalogues the metadata of tens of billions of datasets from launched for searching repositories of datasets on the Web.
various storage systems within Google. GOODS infers various The motivation is that there are thousands of data reposito-
metadata including owner information and provenance infor- ries on the Web that contain millions of datasets that are not
mation (by looking up job logs), analyzes the contents of the easy to search. Dataset Search lets dataset providers
datasets, and collects input from users. At the core is a central describe their datasets using various metadata (e.g., author,
catalog, which contains the metadata and is indexed for data publication date, how the data was collected, and terms for
searching. Due to Google’s scale, there are many technical using the data) so that they become more searcheable. In
challenges including scaling to the number of datasets, sup- comparison to the fully-automatic WebTables, dataset pro-
porting a variety of data formats where the costs for extracting viders may need to do some manual work, but have the
metadata may differ, updating the catalog entries due to the opportunity to make their datasets more searcheable.
frequent churn of datasets, dealing with uncertainty in meta- In comparison to GOODS, Dataset Search targets the Web
data discovery, computing dataset importance for search instead of a data lake.
ranking, and recovering dataset semantics that are missing.
To find datasets, users can use keywords queries on the GOODS 2.2 Data Augmentation
frontend and view profile pages of the datasets that appear in Another approach to acquiring data is to augment existing
the search results. In addition, users can track the provenance datasets with external data. In the machine learning com-
of a dataset to see which datasets were used to create the munity, adding pre-trained embeddings is a common way
given dataset and those that rely on it. to increase the features to train on. In the data management
Finally, expressive queries are also important for search- community, entity augmentation techniques have been pro-
ing a data lake. While GOODS scales, one downside is that it posed to further enrich existing entity information. Data
only supports simple keyword queries. This approach is sim- integration is a broad topic and can be considered as data
ilar to keyword search in databases [76], [77], but the purpose augmentation if we are extending existing datasets with
is to find datasets instead of tuples. The DATA CIVILIZER sys- newly-acquired ones.
tem [21], [22] complements GOODS by focusing more on the
discovery aspect of datasets. Specifically, DATA CIVILIZER con- 2.2.1 Deriving Latent Semantics
sists of a module for building a linkage graph of data.
Assuming that datasets have schema, the nodes in the link- A common data augmentation is to derive latent semantics
age graph are columns of tables while edges are relationships from data. A popular technique is to generate and use
like primary key-foreign key (PK-FK) relationships. A data embeddings that represent words, entities, or knowledge. In
discovery module then supports a rich set of discovery queries particular, word embeddings have been successfully used to
solve many problems in natural language processing (NLP).
on the linkage graph, which can help users more easily
Word2vec [35] is a seminal work where, given a text corpus,
discover the relevant datasets. DATARAMAN [23] specializes in
a word is represented by a vector of real numbers that cap-
extracting structured data from semi-structured log datasets
tures the linguistic context of the word in the corpus. The
in data lakes automatically by learning patterns. AURUM [78],
word vectors can be generated by training a shallow two-
[79] supports data discovery queries on semantically-linked
datasets. layer neural network to reconstruct the surrounding words
Web. As the Web contains large numbers of structured in the corpus. There are two possible models for training
datasets, there have been significant efforts to automatically word vectors: Continuous Bag-of-Words (CBOW) and Skip-
extract the useful ones [32], [33], [34]. One of the most suc- gram. While CBOW predicts a word based on its surround-
cessful systems is WebTables [24], [25], which automatically ing words, Skip-gram does the opposite and predicts the
extracts structured data that is published online in the form surrounding words based on a given word. As a result, two
words that occur in similar contexts tend to have similar
of HTML tables. For example, WebTables extracts all Wiki-
word vectors. A fascinating application of word vectors is
pedia infoboxes. Initially, about 14.1 billion HTML tables
performing arithmetic operations on the word vectors. For
are collected from the Google search web crawl. Then a clas-
example, the result of subtracting the word vector of “king”
sifier is applied to determine which tables can be viewed as
by that of “queen” is similar to the result of subtracting the
relational database tables. Each relational table consists of a
schema that describes the columns and a set of tuples. In word vector of “man” by that of “woman”. Since word2vec
comparison to the above data lake systems, WebTables col- was proposed, there have been many extensions including
lects structured data from the Web. GloVe [36], which improves word vectors by also taking into
As Web data tends to be much more diverse than say those account global corpus statistics, and Doc2Vec [80], which
in a corporate environment, the table extraction techniques generates representations of documents.
have been extended in multiple ways as well. One direction is Another technique for deriving latent semantics is latent
topic modeling. For example, Latent Dirichlet Allocation [37]
to extend table extraction beyond identifying HTML tags by
(LDA) is a generative model that can be used to explain
extracting relational data in the form of vertical tables and lists
why certain parts of the data are similar using unobserved
and leveraging knowledge bases [27], [28]. Table searching
groups.
also evolved where, in addition to keyword searching, row-
subset queries, entity-attribute queries, and column search
were introduced [29]. Finally, techniques for enhancing the 2.2.2 Entity Augmentation
tables [30], [31] were proposed where entities or attribute val- In many cases, datasets are incomplete and need to be filled
ues are added to make the tables more complete. in by gathering more information. The missing information
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
ROH ET AL.: A SURVEY ON DATA COLLECTION FOR MACHINE LEARNING: A BIG DATA - AI INTEGRATION PERSPECTIVE 1333
can either be values or entire features. An early system learning, and human computer interaction. There is a wide
called Octopus [30] composes Web search queries using range of crowdsourcing tasks from simple ones like labeling
keys of the table containing the entities. Then all the Web images up to complex ones like collaboratively writing that
tables in the resulting Web pages are clustered by schema, involve multiple steps [86], [87]. Another important usage of
and the tables in the most relevant cluster are joined with crowdsourcing is data labeling (e.g., the ImageNet project),
the entity table. More recently, InfoGather [31] takes a holis- which we discuss in Section 3.2.
tic approach using Web tables on the Web. The entity aug- In this section, we narrow the scope and focus on crowd-
mentation is performed by filling in missing values of sourcing techniques that are specific to data generation
attributes in some or all of the entities by matching multiple tasks. A recent survey [88] provides an extensive discussion
Web tables using schema matching. To help the user decide on the challenges for data crowdsourcing. Another sur-
which attributes to fill in, InfoGather identifies synonymous vey [89] touches on the theoretical foundations of data
attributes in the Web tables. crowdsourcing. According to both surveys, data generation
using crowdsourcing can be divided into two steps: gather-
2.2.3 Data Integration ing data and preprocessing data.
Gathering Data. One way to categorize data gathering
Data integration can also be considered as data augmenta-
techniques is whether the tasks are procedural or declara-
tion, especially if we are extending existing data sets with
tive. A procedural task is where the task creator defines
other acquired ones. Since this discipline is well established,
explicit steps and assigns them to workers. For example,
we point the readers to some excellent surveys [40], [41].
one may write a computer program that issues tasks to
More recently, an interesting line of work relevant to
workers. TurKit [45] allows users to write scripts that
machine learning [42], [43], [44] observes that in practice,
many companies use relational databases where the train- include HITs using a crash-and-return programming model
ing data is divided into smaller tables. However, most where a script can be re-executed without re-running cos-
machine learning toolkits assume that a training dataset is a tly functions with side effects. AUTOMAN [46] is a domain-
single file and ignore the fact that there are typically multi- specific language embedded in Scala where crowdsourcing
ple tables in a database due to normalization. The key ques- tasks can be invoked like conventional functions. DOG [47]
tion is whether joining the tables and augmenting the is a high-level programming language that compiles into
MapReduce tasks that can be performed by humans or
information is useful for model training. The Hamlet sys-
machines. A declarative task is when the task creator speci-
tem [38] and its subsequent Hamlet++ systems [39] address
fies high-level data requirements, and the workers provide
this problem by determining if key-foreign key (KFK) joins
the data that satisfy them. For example, a database users
are necessary for improving the model accuracy for various
may pose an SQL query like “SELECT title, director, genre,
classifiers (linear, decision trees, non-linear SVMs, and arti-
ficial neural networks) and propose decision rules to predict rating FROM MOVIES WHERE genre = “action” to gather
when it is safe to avoid joins and, as a result, significantly movie ratings data for a recommendation system. DECO [48]
reduce the total runtime. A surprising result is that joins can uses a simple extension of SQL and defines precise seman-
often be avoided without negatively influencing the model’s tics for arbitrary queries on stored data and data collected
accuracy. Intuitively, a foreign key determines the entire by the crowd. CrowdDB [49] focuses on the systems aspect
record of the joining table, so the features brought in by a of using crowdsourcing to answer queries that cannot be
answered automatically.
join do not add a lot more information.
Another way to categorize data gathering is whether
the data is assumed to be closed-world or open-world. Under
2.3 Data Generation
a closed-world assumption, the data is assumed to be
If there are no existing datasets that can be used for train- “known” and entirely collectable by asking the right ques-
ing, then another option is to generate the datasets either
tions. ASKIT! [51] uses this assumption and focuses on the
manually or automatically. For manual construction, crowd-
problem of determining which questions should be directed
sourcing is the standard method where human workers are
to which users, in order to minimize the uncertainty of the
given tasks to gather the necessary bits of data that collec-
collected data. In an open-world assumption, there is no lon-
tively become the generated dataset. Alternatively, auto-
ger a guarantee that all the data can be collected. Instead,
matic techniques can be used to generate synthetic datasets. one must estimate if enough data was collected. Statistical
Note that data generation can also be viewed as data aug- tools [52] have been proposed for scanning a single table with
mentation if there is existing data where some missing parts predicates like “SELECT FLAVORS FROM ICE_CREAM.”
needs to be filled in. Initially, many flavors can be collected, but the rate of new fla-
vors will inevitably slow down, and statistical methods are
2.3.1 Crowdsourcing used to estimate the future rate of new values.
Crowdsourcing is used to solve a wide range of problems, Data gathering is not limited to collecting entire records
and there are many surveys as well [81], [82], [83], [84]. One of a table. CrowdFill [53] is a system for collecting parts of
of the earliest and most popular platforms is Amazon structured data from the crowd. Instead of posing specific
Mechanical Turk [85] where tasks (called HITs) are assigned questions to workers, CrowdFill shows a partially-filled
to human workers, and workers are compensated for finish- table. Workers can then fill in the empty cells and also
ing the tasks. Since then, many other crowdsourcing plat- upvote or downvote data entered by other workers. Crowd-
forms have been developed, and research on crowdsourcing Fill provides a collaborative environment and allows the
has flourished in the areas of data management, machine specification of constraints on values and mechanisms for
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
1334 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
resolving conflicts when workers are filling in values of the The objective of the generative network is to increase the
same record. ALFRED [54] uses the crowd to train extrac- error rate of the discriminative network. That is, the genera-
tors that can then be used to acquire data. ALFRED asks tive network attempts to fool the discriminative network
simple yes/no membership questions on the contents of into thinking that its candidates are from the true distribu-
Web pages to workers and uses the answers to infer the tion. GANs have been used to generate synthetic images
extraction rules. The quality of the rules can be improved and videos that look realistic in many applications.
by recruiting multiple workers. GANs have recently been used to generate synthetic rela-
Preprocessing Data. Once the data is gathered, one may tional data. A MEDGAN [58] generates synthetic patient
want to preprocess the data to make it suitable for machine records with high-dimensional discrete variable (binary or
learning purposes. While many possible crowd operations count) features based on real patient records. While GANs
have been proposed, the ones that are relevant include data can only learn to approximate discrete patient records, the
curation, entity resolution, and joining datasets. Data novelty is to also use an autoencoder to project these records
Tamer [55] is an end-to-end data curation system that can into a lower dimensional space and then project them back
clean and transform datasets and semantically integrate to the original space. A TABLE-GAN [59] also synthesizes
with other datasets. Data Tamer has a crowdsourcing com- tables that are similar to the real ones, but with a focus on
ponent (called Data Tamer Exchange), which assigns tasks privacy perservation. In particular, a metric for information
to workers. The supported operations are attribute identifi- loss is defined, and two parameters are provided to adjust
cation (i.e., determine if two attributes are the same) and the information loss. The higher the loss, the more privacy
entity resolution (i.e., determine if two entities are the the synthetic table has. A TGAN [60] focuses on simulta-
same). Corleone [56] is a hands-off crowdsourcing system, neously generating values for a mixture of discrete and con-
which crowdsources the entire workflow of entity resolu- tinuous features.
tion to workers. CrowdDB [49] and Qurk [50] are systems Policies. Another recent approach is to use human-
for aggregating, sorting, and joining datasets. defined policies [63], [64] to apply transformations to the
For both gathering and preprocessing data, quality con- images as long as they remain realistic. This criteria can be
trol is an important challenge as well. The issues include enforced by training a reinforcement learning model on a
designing the right interface to maximize worker productiv- separate validation set.
ity, managing workers who may have different levels of Data-Specific. We now introduce data-specific techniques
skills (or may even be spammers), and decomposing prob- for generation. Synthetic image generation is a heavily-
lems into smaller tasks and aggregating them. Several sur- studied topic in the computer vision community. Given the
veys [82], [83], [84] cover these issues in detail. wide range of vision problems, we are not aware of a com-
prehensive survey on synthetic data generation and will
2.3.2 Synthetic Data Generation only focus on a few representative problems. In object detec-
Generating synthetic data along with labels is increasingly tion, it is possible to learn 3D models of objects and give
being used in machine learning due to its low cost and flexi- variations (e.g., rotate a car 90 degrees) to generate another
bility [90]. A simple method is to start from a probability realistic image [65], [66]. If the training data is a rapid
distribution and generate a sample from that distribution sequence of images frames in time [67] the objects of a frame
using tools like scikit learn [91]. In addition, there are more can be assumed to move in a linear trajectory between con-
advanced techniques like Generative Adversarial Networks secutive frames. Text within images is another application
(GANs) [2], [57], [61], [62] and application-specific genera- where one can vary the fonts, sizes, and colors of the text to
tion techniques. We first provide a brief introduction of generate large amounts of synthetic text images [68], [69].
GANs and present synthetic data generation techniques on An alternative approach to generating image datasets is
relational data. We then introduce recent augmentation to start from a large set of noisy images and select the clean
techniques using policies. Finally, we introduce image and ones. Xia et al. [70] searches the Web for images with noise
text data generation techniques due to their importance. and then uses a density-based measure to cluster the images
GANs. The key approach of a GAN is to train two con- and remove outliers. Bai et al. [71] exploits large click-
testing neural networks: a generative network and a dis- through logs, which contains queries of users and the
criminative network. The generative network learns to map images that were clicked by those users. A deep neural net-
from a latent space to a data distribution, and the discrimi- work is used to learn representations of the words and
native network discriminates examples from the true distri- images and compute word-word and image-word similari-
bution from the candidates produced by the generative ties. The noisy images that have low similarities to their cat-
network. egories are then removed.
The training of a GAN can be formalized as: Generating synthetic text data has also been studied in
the natural language processing community. Paraphras-
ing [72] is a classical problem of generating alternative
min max V ðD; GÞ
G D expressions that have the same semantic meaning. For
example “What does X do for a living?” is a paraphrase of
V ðD; GÞ ¼ E ½log DðxÞ þ E ½log ð1 DðGðzÞÞ;
xpdata ðxÞ zpz ðzÞ “What is X’s job?”. We briefly cover two recent methods –
one is syntax-based and the other semantics-based – that
where pdata ðxÞ is the distribution of the real data, pz ðzÞ is the uses paraphrasing to generate large amounts of synthetic
distribution of the generator, GðzÞ is the generative network, text data. Syntactically controlled paraphrase networks [73]
and DðxÞ is the discriminative network. (SCPNs) can be trained to produce paraphrases of a
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
ROH ET AL.: A SURVEY ON DATA COLLECTION FOR MACHINE LEARNING: A BIG DATA - AI INTEGRATION PERSPECTIVE 1335
TABLE 2
A Classification of Data Labeling Techniques
Some of the techniques can be used for the same application. For example, for classification on graph data, both self-labeled techniques and label
propagation can be used.
sentence with different sentence structures. Semantically Table 2 shows where different labeling approaches fit
equivalent adversarial rules for text [74] (SEARs) have been into the categories. In addition, each labeling approach can
proposed for perturbing input text while preserving its be further categorized as follows:
semantics. SEARs can be used to debug a model by apply-
ing them on training data and seeing if the re-trained model Machine learning task: In supervised learning, the two
changes its predictions. In addition, there are many para- categories are classification (e.g., determining whether
phrasing techniques that are not covered in this survey. a piece of text has a positive sentiment) and regression
(e.g., estimating the salary of a person). Most of the
3 DATA LABELING data labeling research has been focused on classifica-
tion problems rather than regression problems, possi-
Once enough data has been acquired, the next step is to bly because data labeling is simpler in a classification
label individual examples. For instance, given an image setting.
dataset of industrial components in a smart factory applica- Data type: Depending on the data type (e.g., text,
tion, workers can start annotating if there are any defects in images, and graphs) the data labeling techniques dif-
the components. In many cases, data acquisition is done fer significantly. For example, fact extraction from
along with data labeling. When extracting facts from the text is very different from object detection on images.
Web and constructing a knowledge base, then each fact is
assumed to be correct and thus implicitly labeled as true. 3.1 Utilizing Existing Labels
When discussing the data labeling literature, it is easier to A common setting in machine learning is to have a small
separate it from data acquisition as the techniques can be amount of labeled data, which is expensive to produce with
quite different. humans, along with a much larger amount of unlabeled
We believe the following categories provide a reasonable data. Semi-supervised learning techniques [143] exploit
view of understanding the data labeling landscape: both labeled and unlabeled data to make predictions. In a
Use existing labels: An early idea of data labeling is to transductive learning setting, the entire unlabeled data is
exploit any labels that already exist. There is an available while in an inductive learning setting, some unla-
extensive literature on semi-supervised learning beled data is available, but the predictions must be on
where the idea is to learn from the labels to predict unseen data. Semi-supervised learning is a broad topic, and
the rest of the labels. we focus on a smaller branch of research called self-labeled
Crowd-based: The next set of techniques are based on techniques [96] where the goal is to generate more labels by
crowdsourcing. A simple approach is to label individ- trusting one’s own predictions. Since the details are in the
ual examples. A more advanced technique is to use survey, we only provide a summary here. In addition to the
active learning where questions to ask are more care- general techniques, there are graph-based label propagation
fully selected. More recently, many crowdsourcing techniques that are specialized for graph data.
techniques have been proposed to help workers
become more effective in labeling. 3.1.1 Classification
Weak labels: While it is desirable to generate correct For semi-supervised learning techniques for classification,
labels all the time, this process may be too expensive. the goal is to train a model that returns one of multiple pos-
An alternative approach is to newly generate less than sible classes for each example using labeled and unlabeled
perfect labels (i.e., weak labels), but in large quantities datasets. We consider the best-performing techniques in a
to compensate for the lower quality. Recently, the lat- survey that focuses on labeling data [96], which are summa-
ter approach is gaining more popularity as labeled rized in Fig. 4. The performance results are similar regard-
data is scarce in many new applications. less of using transductive or inductive learning.
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
1336 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
3.1.2 Regression
Relatively less research has been done for semi-supervised
Fig. 4. A simplified classification of semi-supervised learning techniques learning for regression where the goal is to train a model that
for self labeling according to a survey [96] using the best-performing
techniques regardless of inductive or transductive learning. predicts a real number given an example. Co-regularized
least squares regression [97] is a least squares regression
algorithm based on the co-learning approach. Another co-
The simplest class of semi-supervised learning techni-
regularized framework [98] utilizes sufficient and redundant
ques train one model using one learning algorithm on one views similar to Co-training. Co-training regressors [99] uses
set of features. For example, Self-training [92] initially trains two k-nearest neighbor regressors with different distance
a model on the labeled examples. The model is then applied metrics. In each iteration, a regressor labels the unlabeled
to all the unlabeled data where the examples are ranked by data that can be labeled most confidently by the other regres-
the confidences in their predictions. The most confident pre- sor. After the iterations, the final prediction of an example is
dictions are then added into the labeled examples. This pro- made by averaging the regression estimates by the two
cess repeats until all the unlabeled examples are labeled.
regressors. Co-training Regressors can be extended by using
The next class trains multiple classifiers by sampling the
any other base regressor.
training data several times and training a model for each sam-
ple. For example, Tri-training [93] initially trains three models
on the labeled examples using Bagging for the ensemble learn- 3.1.3 Graph-based Label Propagation
ing algorithm. Then each model is updated iteratively where Graph-based label propagation techniques also start with
the other two models make predictions on the unlabeled exam- limited sets of labeled examples, but exploit the graph struc-
ples, and only the examples with the same predictions are used ture of examples based on their similarities to infer the
in conjunction with the original labeled examples to re-train labels of the remaining examples. For example, if an image
the model. The iteration stops when no model changes. Finally, is labeled as a dog, then similar images down the graph can
the unlabeled examples are labeled using majority voting also be labeled as dogs with some probability. The further
where at least two models must agree with each other. the distance, the lower the probability of label propagation.
The next class uses multiple learning algorithms. For Graph-based label propagation has applications in com-
example, Democratic Co-learning [94] uses a set of different puter vision, information retrieval, social networks, and nat-
learning algorithms (in the experiments, they are Naive ural language processing. Zhu et al. [101] proposed a semi-
Bayes, C4.5, and 3-nearest neighbor) are used to train a set supervised learning based on a Gaussian random field
of classifiers separately on the same training data. Predic- model where the unlabeled and labeled examples form a
tions on new examples are generated by combining the weighted graph. The mean of the field is characterized in
results of the three classifiers using weighted voting. The terms of harmonic functions and can be efficiently com-
new labels are then added to the training set of the classi- puted using matrix methods or belief propagation. The
fiers whose predictions are different from the majority MAD-Sketch algorithm [102] was proposed to further
results. This process repeats until no more data is added to reduce the space and time complexities of graph-based SSL
the training data of a classifier. algorithms using count-min sketching. In particular, the
The final class uses multiple views, which are subsets of space complexity per node is reduced from OðmÞ to
features that are conditionally independent given the class. Oðlog mÞ under certain conditions where m is the number
For example, Co-training [95] splits the feature set into two of distinct labels, and a similar improvement is achieved for
sufficient and redundant views, which means that one set the time complexity. Recently, a family of algorithms called
of features is sufficient for learning and independent of EXPANDER [100] were proposed to further reduce the
learning with the other set of features given the label. For space complexity per node to Oð1Þ and compute the MAD-
each feature set, a model is trained and then used to teach Sketch algorithm in a distributed fashion.
the model trained on the other feature set. The co-trained
models can minimize errors by maximizing their agree- 3.2 Crowd-Based Techniques
ments over the unlabeled examples. The most accurate way to label examples is to do it manu-
According to the survey, these algorithms result in simi- ally. A well known use case is the ImageNet image classifi-
lar transductive or inductive accuracies when averaged on cation dataset [146] where tens of millions of images were
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
ROH ET AL.: A SURVEY ON DATA COLLECTION FOR MACHINE LEARNING: A BIG DATA - AI INTEGRATION PERSPECTIVE 1337
organized according to a semantic hierarchy by WordNet algorithm and trains models on bootstrap samples. There is no
using Amazon Mechanical Turk. However, ImageNet is an general agreement on the best number of models to train,
ambitious project that took years to complete, which most which is application-specific.
machine learning users cannot afford for their own applica- Both uncertainty sampling and query-by-committee focus
tions. Traditionally, active learning has been a key tech- on individual examples instead of the entire set of examples,
nique in the machine learning community for carefully they run into the danger of choosing examples that are outliers
choosing the right examples to label and thus minimize according to the example distribution. Density weighting [107]
cost. More recently, crowdsourcing techniques for labeling is a way to improve the above techniques by choosing instan-
have been proposed where there can be many workers who ces that are not only uncertain or disagreeing, but also repre-
are not necessarily experts in labeling. Hence, there is more sentative of the example distribution.
emphasis on how to assign tasks to workers, what interfaces Decision Theoretic Approaches. Another line of active learn-
to use, and how to ensure high quality labels. Recent com- ing performs decision-theoretic approaches. Decision theory
mercial tools vary in what services they provide for label- is a framework for making decision under uncertainty using
ing. For example, Amazon SageMaker [8] supports labeling states and actions to optimize some objective function. In
based on active learning, Google Cloud AutoML [6] pro- the context of active learning, the objective could be to
vides a manual labeling service, and Microsoft Custom choose an example that maximizes the estimated model
Vision [7] requires labels from the user. While crowdsourc- accuracy [108]. Another possible objective is reducing gen-
ing data labeling is closely related to crowdsourcing data eralization error [109], which is estimated as follows: if the
acquisition, the individual techniques are different. measure is log loss, then the entropy of the predicted class
distribution is considered the error rate; if the measure is
3.2.1 Active Learning 0-1 loss, the maximum probability among all classes is the
Active learning focuses on selecting the most “interesting” error rate. Each example to label is chosen by taking a sam-
unlabeled examples to give to the crowd for labeling. The ple of the unlabeled data and choosing the example that
workers are expected to be very accurate, so there is less minimizes the estimated error rate.
emphasis on how to interact with those with less expertise. Regression. Active learning techniques can also be extended
While some references view active learning as a special to regression problems. For uncertainty sampling, instead of
case of semi-supervised learning, the key difference is that computing the entropy of classes, one can compute the output
there is a human-in-the-loop. The key challenge is choosing variance of the predictions and select the examples with the
the right examples to ask given a limited budget. One highest variance. Query-by-committee can also be extended
downside of active learning is that the examples are biased to regression [110] by training a committee of models and
to the training algorithm and cannot be reused. Active selecting the examples where the variance among the
learning is covered extensively in other surveys [105], [147], committee’s predictions is the largest. This approach is said to
and we only cover the most prominent techniques here. work well when the bias of the models is small. Also, this
Uncertain Examples Uncertainty Sampling [103] is the sim- approach is said to be robust to overspecification, which
plest in active learning and chooses the next unlabeled reduces the chance of overfitting.
example that the model prediction is most uncertain. For Self and Active Learning Combined. The data labeling tech-
example, if the model is a binary classifier, uncertainty sam- niques we consider are complementary to each other and
pling chooses the example whose probability is nearest to can be used together. In fact, semi-supervised learning and
0.5. If there are more than three class labels, we could active learning have a history of being used together [111],
choose the example whose prediction is the least confident. [112], [113], [114], [148]. A key observation is that the two
The downside of this approach is that it throws away the techniques solve opposite problems where semi-supervised
information of all the other possible labels. So an improved learning finds the predictions with the highest confidence
version called margin sampling is to choose the example and adds them to the labeled examples while active learn-
whose probability difference between the most and second- ing finds the predictions with the lowest confidence (using
most probable labels is the largest. This method can be fur- uncertainty sampling, query-by-committee, or density-
ther generalized using entropy as the uncertainty measure weighted method) and sends them for manual labeling.
where entropy is an information-theoretic measure for the There are various ways semi-supervised learning can be
amount of information to encode a distribution. used with active learning. McCallum and Nigam [111]
Query-by-Committee [104] extends uncertainty sampling improves the Query-By-Committee (QBC) technique and
by training a committee of models on the same labeled data. combines it with Expectation-Maximization (EM), which
Each model can vote when labeling each example, and the effectively performs semi-supervised learning. Given a set
most informative example is considered to be the one where of documents for training data, active learning is done by
the most models disagree with each other. More formally, this selecting the documents that are closer to others (and thus
approach minimizes the version space, which is the space of all representative), but have committee disagreement, for label-
possible classifiers that give the same classification results as ing. In addition, the EM algorithm is used to further infer
(and are thus consistent with) the labeled data. The challenge the rest of the labels. The active learning and EM can either
is to train models that represent different regions of the version be done separately or interleaved. Tomanek and Hahn [112]
space and have some amount of disagreement. Various meth- propose semi-supervised active learning (SeSAL) for
ods have been proposed [105], but there does not seem to be a sequence labeling tasks, which include POS tagging, chunk-
clear winner. One general method is called query-by-bag- ing, or named entity recognition (NER). Here the examples
ging [106], which uses bagging as an ensemble learning are sequences of text. The idea is to use active learning for
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
1338 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
the subsequences that have the highest training utility records: pair-based and cluster-based. Qurk [50] uses a map-
within the selected sentences and use semi-supervised ping interface where multiple records on one side are
learning to automatically label the rest of the subsequences. matched with records on the other side. Qurk uses a combi-
The utility of a subsequence is highest when the current nation of comparison or rating tasks to accelerate labeling.
model is least confident about the labeling. Quality Control. Controlling the quality of data labeling by
Zhou et al. [113] proposes the semi-supervised active the crowd is important because the workers may vary signifi-
image retrieval (SSAIR) approach where the focus is on cantly in their abilities to provide labels. A simple way to
image retrieval. SSAIR is inspired by the co-training method ensure quality is to repeatedly label the same example using
where initially two classifiers are trained from the labeled multiple workers and perhaps take a majority voting at the
data. Then each learner passes the most relevant/irrelevant end. However, there are more sophisticated approaches as
images to the other classifier. The classifiers are then well. Get another label [119] and Crowdscreen [120] actively
retrained with the additional labels, and their results are solicit labels while Karger et al. [121] passively collects data
combined. The images that still have low confidence are and runs the expectation maximization algorithm. Vox Pop-
selected to be labeled by humans. uli [122] proposes techniques for pruning low-quality work-
Zhu et al. [114] combines semi-supervised and active ers that can achieve better labeling quality without having to
learning under a Gaussian random field model. The labeled repeatedly label examples.
and unlabeled examples are represented as vertices in a Scalability. Scaling up crowdsourced labeling is another
graph where edges are weighted by similarities between important challenge. While traditional active learning tech-
examples. This framework enables one to compute the next niques were proposed for this purpose, more recently the
question that minimizes the expected generalization error data management community has started to apply systems
efficiently for active learning. Once the new labels are added techniques for further scaling the algorithms to large data-
to the labeled data, semi-supervised learning is performed sets. In particular, Mozafari et al. [116] proposes active
using harmonic functions. learning algorithms that can run in parallel. One algorithm
(called Uncertainty) selects examples that the current classi-
3.2.2 Crowdsourcing fier is most uncertain about. A more sophisticated algorithm
In comparison to active learning, the crowdsourcing techni- (called MinExpError) combines the current model’s accuracy
ques here are more focused on running tasks with many with the uncertainty. A key idea is the use of bootstrap the-
workers who are not necessarily labeling experts. As a ory, which makes the algorithms applicable to any classifier
result, workers may make mistakes, and there is a heavy lit- and also enables embarassingly-parallel processing.
erature [81], [82], [83], [84], [115], [149], [150] on improving Regression. In comparison to crowdsourcing research for
the interaction with workers, evaluating workers so they classification tasks, less attention has been given to regres-
are reliable, reducing any bias that the workers may have, sion tasks. Marcus et al. [123] solves the problem of selectiv-
and aggregating the labeling results while resolving any ity estimation in a crowdsourced database. The goal is to
ambiguities among them. estimate the fraction of records that satisfy some property
User Interaction. A major challenge in user interaction is by asking workers questions.
to effectively provide instructions to workers on how to per-
form the labeling. The traditional approach is to provide 3.3 Weak Supervision
some guidelines for labeling to the workers up front and As machine learning is used in a wider range of applica-
then let them make a best effort to follow them. However, tions, it is mostly the case that there is not enough labeled
the guidelines are often incomplete and do not cover all pos- data. For example, in a smart factory setting, any new prod-
sible scenarios, leaving the workers in the dark. Revolt [115] uct will have no labels for training a model for quality con-
is a system that attempts to fix this problem through collab- trol. As a result, weak supervision techniques [151], [152],
orative crowdsourcing. Here workers work in three steps: [153] have become increasingly popular where the idea is to
Voting where workers vote just like in traditional labeling, semi-automatically generate large quantities of labels that
Explaining where workers justify their rational for labeling, are not as accurate as manual labels, but good enough for
and Categorize where workers review explanations from the trained model to obtain a reasonably-high accuracy.
other workers and tag any conflicting labels. This informa- This approach is especially useful when there are large
tion can then be used to make post-hoc judgements of the amounts of data, and manual labeling becomes infeasible.
label decision boundaries. Another approach is to provide In the next sections, we discuss the recently proposed data
better tools to assist workers to organize their concepts, programming paradigm and fact extraction techniques.
which may evolve as more examples are labeled [117].
In addition, providing the right labeling interface is critical
for workers to perform well. The challenge is that each appli- 3.3.1 Data Programming
cation may have a different interface that works best. We will As data labeling at scale becomes more important especially
not cover all the possible applications, but instead illustrate a for deep learning applications, data programming [126] has
line of research for the problem of entity resolution where the been proposed as a solution for generating large amounts of
goal is to find records in a database that refer to the same real- weak labels using multiple labeling functions instead of
world entity. Here the label is whether two (or more) records individual labeling. Fig. 5 illustrates how data program-
are the same or not. Just for this problem, there is a line of ming can be used for Sally’s smart factory application. A
research on providing the best interface for comparisons. labeling function can be any computer program that either
CrowdER [118] provides two types of interfaces to compare generates a label for an example or refrains to do so. For
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
ROH ET AL.: A SURVEY ON DATA COLLECTION FOR MACHINE LEARNING: A BIG DATA - AI INTEGRATION PERSPECTIVE 1339
extensive research (see the survey [181]) on how to make the like for datasets. The metadata for models may be quite
model training still use noisy labels by becoming more different than metadata for data because one needs to deter-
robust. For specific techniques, Xiao et al. [167] propose a mine if a model can be used for transfer learning in her
general framework for training convolutional neural net- own application. In addition to using pre-trained models,
works on images with a small number of clean labels and another popular technique mainly used in Computer Vision
many noisy labels. The idea is to model the relationships is few-shot learning [175] where the goal is to extend exist-
between images, class labels, and label noises with a proba- ing models to handle new classes using zero or more exam-
bilistic graphical model and integrate it into the model train- ples. Since transfer learning is primarily a machine learning
ing. Label noise is categorized into two types: confusing topic that does not significantly involve data management,
noise, which is caused by confusing content in the images, we only summarize the high-level ideas based on sur-
and pure random noise, which is caused by technical bugs veys [172], [173]. There are studies of transfer learning tech-
like mismatches between images and their surrounding niques in the context of NLP [176], Computer Vision [177],
text. The true labels and noise types are treated as latent var- and deep learning [184] as well.
iables, and an EM algorithm is used for inference. Webly An early survey of transfer learning [172] identifies three
supervised learning [168] is a technique for training a con- main research issues in transfer learning: what to transfer,
volutional neural network on clean and noisy images on the how to transfer, and when to transfer. That is, we need to
Web. First, the model is trained on top-ranked images from decide what part of knowledge can be transferred, what
search engines, which tend to be clean because they are methods should be used to transfer the knowledge, and
highly-ranked, but also biased in the sense that objects tends whether transferring this knowledge is appropriate and
to be centered in the image with a clean background. Then does not have any negative effect. Inductive transfer learning is
relationships are discovered among the clean images, which used when the source task and target task are different while
are then used to adapt the model to more noisier images the two domains may or may not be the same. Here a task
that are harder to classify. This method suggests that it is can be categorizing a document while a domain could be a
worth training on easy and hard data separately. set of university webpages to categorize. Transductive transfer
Goodfellow et al. [171] take a different approach where learning is used when the source and target tasks are the
they explain why machine learning models including neu- same, but the domains are different. Unsupervised transfer
ral networks may misclassify adversarial examples. While learning is similar to inductive transfer learning where the
previous research attempts to explain this phenomenon by source and target tasks are different, but uses unsupervised
focusing on nonlinearity and overfitting, the authors show learning tasks like clustering and dimensionality reduction.
that it is the model’s linear behavior in high-dimensional The three approaches above can also be divided based on
spaces that makes it vulnerable. That is, making many small what to transfer. Instance-based transfer learning assumes that
changes on the features of an example can result in a large the examples of the source can be re-used in the target by
change to the output prediction. As a result, generating re-weighting them. Feature-representation transfer learning
large amounts of adversarial examples becomes easier using assumes that the features that represent the data of the
linear perturbation. source task can be used to represent the data of the target
Even if the labels themselves are clean, it may be the case task. Parameter transfer learning assumes that the source and
that the labels are imbalanced. SMOTE [170] performs target tasks share some parameters or prior distributions
over-sampling for minority classes that need more exam- that can be re-used. Relational knowledge transfer learning
ples. Simply replicating examples may lead to overfitting, assumes that certain relationships within the data of the
so the over-sampling is done by generating synthetic exam- source task can be re-used in the target task.
ples using the minority examples and their nearest neigh- More recent surveys [173], [178] classify most of the tradi-
bors. He and Garcia [169] provide a comprehensive survey tional transfer learning techniques as homogeneous transfer
on learning from imbalanced data. learning where the feature spaces of the source and target
tasks are the same. In addition, the surveys identify a rela-
tively new class of techniques called heterogeneous transfer
4.2.2 Transfer Learning learning where the feature spaces are different, but the source
Transfer learning is a popular approach for training models and target examples are extracted from the same domain.
when there is not enough training data or time to train from Heterogeneous transfer learning largely falls into two cate-
scratch. A common technique is to start from an existing gories: asymmetric and symmetric transformation. In an
model that is well trained (also called a source task), one can asymmetric approach, features of the source task are trans-
incrementally train a new model (a target task) that already formed to the features of the target task. In a symmetric
performs well. For example, a convolutional neural net- approach, the assumption is that there is a common latent
works like AlexNet [182] and VGGNet [183] can be used to feature space that unifies the source and target features.
train a model for a different, but related vision problem. Transfer learning has been successfully used in many appli-
Recently, Google announced TensorFlow Hub [174], which cations including text sentiment analysis, image classifica-
enables users to easily re-use an existing model to train an tion, human activity classification, software defect detection,
accurate model, even with a small dataset. Also, Google and multi-language text classification.
Cloud AutoML [6] provides transfer learning as a service.
From a data management perspective, an interesting ques- 5 PUTTING EVERYTHING TOGETHER
tion is how these existing tools can be extended to index the We now return to Sally’s scenario and provide an end-
metadata of models and provide search as a service, just to-end guideline for data collection (summarized as the
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
1342 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
workflow in Fig. 2). If there is no or little data to start with 6 FUTURE RESEARCH CHALLENGES
then Sally would need to acquire datasets. She can either
Although data collection was traditionally a topic in the
search for relevant datasets either on the Web or within the
machine learning, as the amount of training data is increas-
company data lake, or decide to generate a dataset herself
ing, data management research is becoming just as relevant,
by installing camera equipment for taking photos of the
and we are observing a convergence of the two disciplines.
products within the factory. If the products also had some
As such, there needs to be more awareness on how the
metadata, Sally could also augment that data with external
research landscape will evolve for both communities and
information about the product.
more effort to better integrate the techniques.
Once the data is available, then Sally can choose among
Data Evaluation. An open question is how to evaluate
the labeling techniques using the categories discussed in
whether the right data was collected with sufficient quantity.
Section 3. If there are enough existing labels, then self label-
First, it may not be clear if we have found the best datasets
ing using semi-supervised learning is an attractive option.
for a machine learning task and whether the amount of data
There are many variants of self labeling depending on the
is enough to train a model with sufficient accuracy. In some
assumptions on the model training as we studied. If there are
cases, there may be too many datasets, and simply collecting
not enough labels, Sally can decide to generate some using
and integrating all of them may have a negative affect on
the crowd-based techniques using a budget. If there are only
model training. As a result, selecting the right datasets
a few experts available for labeling, active learning may be
becomes an important problem. Moreover, if the datasets are
the right choice, assuming that the important examples that
dynamic (e.g., they are streams of signals from sensors) and
influence the model can be narrowed down. If there are
change in quality, then the choice of datasets may have to
many workers who do not necessarily have expertise, gen-
change dynamically as well. Second, many data discovery
eral crowdsourcing methods can be used. If Sally does not
have enough budget for crowd-based methods or if it is sim- tools rely on dataset owners to annotate their datasets for bet-
ply not worth the cost, and if the model training can tolerate ter discovery, but more automatic techniques for under-
weak labels, then weak supervision techniques like data pro- standing and extracting metadata from the data are needed.
gramming and label propagation can be used. While most of the data collection work assumes that the
If Sally has existing labels, she may also want to make model training comes after the data collection, another
sure whether they can be improved in quality. If the data is important avenue is to augment or improve the data based
noisy or biased, then the various data cleaning techniques on how the model performs. While there is a heavy literature
can be used. If there are existing models for product quality on model interpretation [185], [186], it is not clear how to
through tools like TensorFlow Hub [174], they can be used address feedback on the data level. In the model fairness lit-
to further improve the model using transfer learning. erature [187], one approach to reducing unfairness is to fix
Through our experience, we also realize that it is not the data. In data cleaning, ActiveClean and BoostClean are
always easy to determine if there is enough data and labels. interesting approaches for fixing the data to improve model
For example, even if the dataset is small or there are few accuracy. A key challenge is analyzing the model, which
labels, as long as the distribution of data is easy to learn, then becomes harder as models become more complicated.
automatic approaches like semi-supervised learning will do Performance Tradeoff. While traditional labeling techni-
the job better than manual approaches like active learning. ques focus on accuracy, there is a recent push towards gen-
Another hard-to-measure factor is the amount of human erating large amounts of weak labels. We need to better
effort needed. When comparing active learning versus data understand the tradeoffs of accuracy versus scalability to
programming, we need to compare the tasks of labeling make informed decisions on which approach to use when.
examples and implementing labeling functions, which are For example, simply having more weak labels does not nec-
quite different. Depending on the application, implementing essarily mean the model’s accuracy will eventually reach a
a program on examples can range from trivial (e.g., look for perfect accuracy. At some point, it may be worth investing
certain keywords) to almost impossible (e.g., general object in humans or using transfer learning to make additional
detection). Hence, even if data programming is an attractive improvements. Such decisions can be made through some
option, one must determine the actual effort of programming, trial and error, but an interesting question is whether there
which cannot be determined with a few yes or no questions. is a more systematic way to do such evaluations.
Another thing to keep in mind is how the labeling techni- Crowdsourcing. Despite the many efforts in crowdsourc-
ques tradeoff accuracy and scalability. Manual labeling is ing, leveraging humans is still a non-trivial task. Dealing
obviously the most accurate, but least scalable. Active learn- with humans involves designing the right tasks and interfa-
ing scales better than the manual approach, but is still lim- ces, ensuring that the worker quality is good enough, and
ited to how fast humans can label. Data programming setting the right price for tasks. The recent data program-
produces weak labels, which tend to have lower accuracy ming paradigm introduces a new set of challenges where
than manual labels. On the other hand, data programming workers now have to implement labeling functions instead
can scale better than active learning assuming that the initial of providing labels themselves. One idea is to improve the
cost of implementing labeling functions and debugging quality of such collaborative programming by making the
them is reasonable. Semi-supervised learning obviously programming of labeling functions drastically easier, say by
scales the best with automatic labeling. The labeling accu- introducing libraries or templates for programming.
racy depends on the accuracy of the model trained on exist- Empirical Comparison of Techniques. Although we showed a
ing labels. Combining self labeling with active learning is a flowchart on when to use which techniques, it is far from
good example of taking the best of both worlds. complete, as many factors are application-specific and can
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
ROH ET AL.: A SURVEY ON DATA COLLECTION FOR MACHINE LEARNING: A BIG DATA - AI INTEGRATION PERSPECTIVE 1343
only be determined by looking at the data and application. [2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cam-
bridge, MA, USA: The MIT Press, 2016.
For example, if the model training can be done with a small [3] S. H. Bach, B. D. He, A. Ratner, and C. Re, “Learning the struc-
number of labels, then we may not have to perform data ture of generative models without labeled data,” in Proc. 34th Int.
labeling using crowdsourcing. In addition, the estimated Conf. Mach. Learn., 2017, pp. 273–282.
human efforts in labeling and data programming may not fol- [4] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data life-
cycle challenges in production machine learning: A survey,” SIG-
low any theoretical model in practice. For example, humans MOD Rec., vol. 47, no. 2, pp. 17–28, Jun. 2018.
may find programming for certain applications much more [5] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data man-
difficult and time-consuming than other applications agement challenges in production machine learning,” in Proc.
ACM Int. Conf. Manage. Data, 2017, pp. 1723–1726.
depending on their expertise. Hence, there needs to be more [6] “Google cloud automl,” [Online]. Available: https://cloud.
empirical research on the effectiveness of the techniques. google.com/automl/. Accessed: Oct. 17, 2019.
Generalizing and Integrating Techniques. We observed that [7] “Microsoft custom vision,” [Online]. Available: https://azure.
many data collection techniques were application or data microsoft.com/en-us/services/cognitive-services/custom-vision-
service/. Accessed: Oct. 17, 2019.
type specific and were often small parts of a larger research. [8] “Amazon sagemaker,” [Online]. Available: https://aws.amazon.
As machine learning becomes widely used in just about any com/sagemaker/. Accessed: Oct. 17, 2019.
application, there needs to be more effort in generalizing [9] A. Bhardwaj, A. Deshpande, A. J. Elmore, D. Karger, S. Madden,
the techniques to other problems. In data labeling, most of A. Parameswaran, H. Subramanyam, E. Wu, and R. Zhang,
“Collaborative data analytics with datahub,” Proc. VLDB Endow-
the research effort has been focused on classification tasks ment, vol. 8, no. 12, pp. 1916–1919, Aug. 2015.
and much less on regression tasks. An interesting question [10] A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande,
is which classification techniques can also be extended to A. J. Elmore, S. Madden, and A. G. Parameswaran, “Datahub:
Collaborative data science & dataset version management at
regression. It is also worth exploring if application-specific scale,” in Proc. Biennial Conf. Innovative Data Syst. Res., 2015.
techniques can be generalized further. For example, the [11] S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and
NELL system continuously extracts facts from the Web A. Parameswaran, “Principles of dataset versioning: Exploring
indefinitely. This idea can possibly be applied to collecting the recreation/storage tradeoff,” Proc. VLDB Endowment, vol. 8,
no. 12, pp. 1346–1357, Aug. 2015.
any type of data from any source, although the technical [12] A. Y. Halevy, “Data publishing and sharing using fusion tables,”
details may differ. Finally, given the variety of techniques in Proc. Biennial Conf. Innovative Data Syst. Res., 2013.
for data collection, there needs to be more research on end- [13] H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan,
R. Shapley, and W. Shen, “Google fusion tables: Data manage-
to-end solutions that combine techniques for data acquisi- ment, integration and collaboration in the cloud,” in Proc. 1st
tion, data labeling, and improvements of existing data and ACM Symp. Cloud Comput., 2010, pp. 175–180.
models. [14] H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan,
R. Shapley, W. Shen, and J. Goldberg-Kidon, “Google fusion
tables: Web-centered data management and collaboration,” in
7 CONCLUSION Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 1061–1066.
[15] “Ckan,” [Online]. Available: http://ckan.org. Accessed: Oct. 17,
As machine learning becomes more widely used, it becomes 2019.
more important to acquire large amounts of data and label [16] “Quandl,” [Online]. Available: https://www.quandl.com.
data, especially for state-of-the-art neural networks. Tradi- Accessed: Oct. 17, 2019.
tionally, the machine learning, natural language processing, [17] “Datamarket,” [Online]. Available: https://datamarket.com.
Accessed: Oct. 17, 2019.
and computer vision communities has contributed to this [18] “Kaggle,” [Online]. Available: https://www.kaggle.com/.
problem – primarily on data labeling techniques including Accessed: Oct. 17, 2019.
semi-supervised learning and active learning. Recently, in [19] I. G. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino, “Data
wrangling: The challenging yourney from the wild to the lake,”
the era of Big data, the data management community is also in Proc. Biennial Conf. Innovative Data Syst. Res., 2015.
contributing to numerous subproblems in data acquisition, [20] A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy,
data labeling, and improvement of existing data. In this sur- and S. E. Whang, “Goods: Organizing google’s datasets,” in Proc.
vey, we have investigated the research landscape of how all Int. Conf. Manage. Data, 2016, pp. 795–806.
[21] R. Castro Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao,
these technique complement each other and have provided Z. Abedjan, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani,
guidelines on deciding which technique can be used when. M. Stonebraker, and N. Tang, “A demo of the data civilizer sys-
Finally, we have discussed interesting data collection chal- tem,” in Proc. ACM Int. Conf. Manage. Data, 2017, pp. 1639–1642.
[22] D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker,
lenges that remain to be addressed. In the future, we expect A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and
the integration of Big data and AI to happen not only in N. Tang, “The data civilizer system,” in Proc. Biennial Conf. Inno-
data collection, but in all aspects of machine learning. vative Data Syst. Res., 2017.
[23] Y. Gao, S. Huang, and A. G. Parameswaran, “Navigating the data
lake with DATAMARAN: Automatically extracting structure from
ACKNOWLEDGMENTS log datasets,” in Proc. Int. Conf. Manage. Data, 2018, pp. 943–958.
[24] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang,
This research was supported by the Engineering Research “Webtables: Exploring the power of tables on the web,” Proc.
Center Program through the National Research Foundation VLDB Endowment, vol. 1, no. 1, pp. 538–549, 2008.
of Korea (NRF) funded by the Korean Government MSIT [25] M. J. Cafarella, A. Y. Halevy, H. Lee, J. Madhavan, C. Yu,
(NRF-2018R1A5A1059921), by SK Telecom, and by a Google D. Z. Wang, and E. Wu, “Ten years of webtables,” Proc. VLDB
Endowment, vol. 11, no. 12, pp. 2140–2149, 2018.
AI Focused Research Award. [26] “Google dataset search,” [Online]. Available: https://www.blog.
google/products/search/making-it-easier-discover-datasets/.
REFERENCES Accessed: Oct. 17, 2019.
[27] H. Elmeleegy, J. Madhavan, and A. Halevy, “Harvesting rela-
[1] “Deep learning for detection of diabetic eye disease,” [Online]. tional tables from lists on the web,” VLDB J., vol. 20, no. 2,
Available: https://research.googleblog.com/2016/11/deep- pp. 209–226, Apr. 2011.
learning-for-detection-of-diabetic.html. Accessed: Oct. 17, 2019.
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
1344 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
[28] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam, “Tegra: Table [53] H. Park and J. Widom, “Crowdfill: Collecting structured data
extraction by global record alignment,” in Proc. ACM SIGMOD from the crowd,” in Proc. ACM SIGMOD Int. Conf. Manage. Data,
Int. Conf. Manage. Data, 2015, pp. 1713–1728. 2014, pp. 577–588.
[29] K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, and Y. He, [54] V. Crescenzi, P. Merialdo, and D. Qiu, “Crowdsourcing large
“Data services leveraging bing’s data assets,” IEEE Data Eng. scale wrapper inference,” Distrib. Parallel Databases, vol. 33, no. 1,
Bull., vol. 39, no. 3, pp. 15–28, Sept. 2016. pp. 95–122, 2015. doi: 10.1007/s10619-014-7163-9.
[30] M. J. Cafarella, A. Halevy, and N. Khoussainova, “Data integra- [55] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack,
tion for the relational web,” Proc. VLDB Endowment, vol. 2, no. 1, S. B. Zdonik, A. Pagan, and S. Xu, “Data curation at scale: The data
pp. 1090–1101, Aug. 2009. tamer system,” in Proc. Biennial Conf. Innovative Data Syst. Res., 2013.
[31] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri, [56] C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik,
“Infogather: Entity augmentation and attribute discovery by and X. Zhu, “Corleone: Hands-off crowdsourcing for entity
holistic matching with web tables,” in Proc. ACM SIGMOD Int. matching,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2014,
Conf. Manage. Data, 2012, pp. 97–108. pp. 601–612.
[32] N. N. Dalvi, R. Kumar, and M. A. Soliman, “Automatic wrappers [57] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
for large scale web extraction,” Proc. VLDB Endowment, vol. 4, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
no. 4, pp. 219–230, 2011. in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
[33] P. Bohannon, N. N. Dalvi, Y. Filmus, N. Jacoby, S. S. Keerthi, and [58] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun,
A. Kirpal, “Automatic web-scale information extraction,” in “Generating multi-label discrete patient records using generative
Proc. ACM SIGMOD Int. Conf. Manage. Data, 2012, pp. 609–612. adversarial networks,” in Proc. 2nd Mach. Learn. Healthcare Conf.,
[34] R. Baumgartner, W. Gatterbauer, and G. Gottlob, “Web data 2017, pp. 286–305.
extraction system,” in Encyclopedia of Database Systems, 2nd ed., [59] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and
Berlin, Germany: Springer, 2018. Y. Kim, “Data synthesis based on generative adversarial
[35] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, networks,” Proc. VLDB Endowment, vol. 11, no. 10, pp. 1071–1083,
“Distributed representations of words and phrases and their 2018.
compositionality,” in Proc. 26th Int. Conf. Neural Inf. Process. Syst., [60] L. Xu and K. Veeramachaneni, “Synthesizing tabular data using
2013, pp. 3111–3119. generative adversarial networks,” CoRR, vol. abs/1811.11264,
[36] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vec- 2018. [Online]. Available: http://arxiv.org/abs/1811.11264
tors for word representation,” in Proc. Conf. Empirical Methods [61] I. J. Goodfellow, “NIPS 2016 tutorial: Generative adversarial
Natural Language Process., 2014, pp. 1532–1543. networks,” CoRR, vol. abs/1701.00160, 2017. [Online]. Available:
[37] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet http://arxiv.org/abs/1701.00160
allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003. [62] K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly, “The
[38] A. Kumar, J. Naughton, J. M. Patel, and X. Zhu, “To join or not to GAN landscape: Losses, architectures, regularization, and normal-
join?: Thinking twice about joins before feature selection,” in ization,” CoRR, vol. abs/1807.04720, 2018. [Online]. Available:
Proc. Int. Conf. Manage. Data, 2016, pp. 19–34. http://arxiv.org/abs/1807.04720
[39] V. Shah, A. Kumar, and X. Zhu, “Are key-foreign key joins safe [63] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le,
to avoid when learning high-capacity classifiers?” Proc. VLDB “Autoaugment: Learning augmentation policies from data,” in
Endowment, vol. 11, no. 3, pp. 366–379, Nov. 2017. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 113–123.
[40] M. Stonebraker and I. F. Ilyas, “Data integration: The current [Online]. Available: http://openaccess.thecvf.com/content\_
status and the way forward,” IEEE Data Eng. Bull., vol. 41, no. 2, CVPR\_2019/html/Cubuk\_AutoAugment\_Learning\_Augmen
pp. 3–9, June 2018. tation\_Strategies\_From\_Data\_CVPR\_2019\_paper.html
[41] A. Doan, A. Y. Halevy, and Z. G. Ives, Principles of Data Integra- [64] A. J. Ratner, H. R. Ehrenberg, Z. Hussain, J. Dunnmon, and C. Re,
tion. Burlington, MA, USA: Morgan Kaufmann, 2012. “Learning to compose domain-specific transformations for data
[42] S. Li, L. Chen, and A. Kumar, “Enabling and optimizing non- augmentation,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017,
linear feature interactions in factorized linear algebra,” in Proc. pp. 3239–3249.
Int. Conf. Manage. Data, 2019, pp. 1571–1588. [65] X. Peng, B. Sun, K. Ali, and K. Saenko, “Learning deep object
[43] L. Chen, A. Kumar, J. F. Naughton, and J. M. Patel, “Towards detectors from 3D models,” in Proc. IEEE Int. Conf. Comput. Vis.,
linear algebra over normalized data,” Proc. VLDB Endowment, 2015, pp. 1278–1286.
vol. 10, no. 11, pp. 1214–1225, 2017. [66] S. Kim, G. Choe, B. Ahn, and I. Kweon, “Deep representation of
[44] A. Kumar, J. F. Naughton, J. M. Patel, and X. Zhu, “To join or not industrial components using simulated images,” in Proc. IEEE
to join?: Thinking twice about joins before feature selection,” in Int. Conf. Robotics Autom., 2017, pp. 2003–2010.
Proc. Int. Conf. Manage. Data, 2016, pp. 19–34. [67] T. Oh, R. Jaroensri, C. Kim, M. A. Elgharib, F. Durand,
[45] G. Little, L. B. Chilton, M. Goldman, and R. C. Miller, “Turkit: W. T. Freeman, and W. Matusik, “Learning-based video motion
Human computation algorithms on mechanical turk,” in Proc. 23nd magnification,” Comput. Vis. - {ECCV} 2018 - 15th Eur. Conf.,
Annu. ACM Symp. User Interface Softw. Technol., 2010, pp. 57–66. 2018, pp. 663–679. doi: 10.1007/978-3-030-01225-0\_39.
[46] D. W. Barowy, C. Curtsinger, E. D. Berger, and A. McGregor, [68] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman,
“Automan: A platform for integrating human-based and digital “Synthetic data and artificial neural networks for natural scene
computation,” in Proc. ACM Int. Conf. Object Oriented Program. text recognition,” CoRR, vol. abs/1406.2227, 2014. [Online].
Syst. Languages Appl., 2012, pp. 639–654. Available: http://arxiv.org/abs/1406.2227
[47] S. Ahmad, A. Battle, Z. Malkani, and S. Kamvar, “The jabber- [69] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text
wocky programming environment for structured social compu- localisation in natural images,” in Proc. IEEE Conf. Comput. Vis.
ting,” in Proc. 24th Annu. ACM Symp. User Interface Softw. Pattern Recognit., Jun. 2016, pp. 2315–2324.
Technol., 2011, pp. 53–64. [70] Y. Xia, X. Cao, F. Wen, and J. Sun, “Well begun is half done:
[48] H. Park, R. Pang, A. G. Parameswaran, H. Garcia-Molina, Generating high-quality seeds for automatic image dataset
N. Polyzotis, and J. Widom, “Deco: A system for declarative construction from web,” in Proc. Eur. Conf. Comput. Vis., 2014,
crowdsourcing,” Proc. VLDB Endowment, vol. 5, no. 12, pp. 387–400.
pp. 1990–1993, 2012. [71] Y. Bai, K. Yang, W. Yu, C. Xu, W. Ma, and T. Zhao, “Automatic
[49] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, image dataset construction from click-through logs using deep
“Crowddb: Answering queries with crowdsourcing,” in Proc. neural network,” in Proc. 23rd ACM Int. Conf. Multimedia, 2015,
ACM SIGMOD Int. Conf. Manage. Data, 2011, pp. 61–72. pp. 441–450.
[50] A. Marcus, E. Wu, S. Madden, and R. C. Miller, “Crowdsourced [72] J. Mallinson, R. Sennrich, and M. Lapata, “Paraphrasing revisited
databases: Query processing with people,” in Proc. Biennial Conf. with neural machine translation,” in Proc. 15th Conf. Eur. Chapter
Innovative Data Syst. Res., 2011, pp. 211–214. Association Comput. Linguistics, 2017, pp. 881–893.
[51] R. Boim, O. Greenshpan, T. Milo, S. Novgorodov, N. Polyzotis, and [73] M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer, “Adversarial
W. C. Tan, “Asking the right questions in crowd data sourcing,” in example generation with syntactically controlled paraphrase
Proc. IEEE 28th Int. Conf. Data Eng., 2012, pp. 1261–1264. networks,” in Proc. 2018 Conf. North American Chapter Assoc. Com-
[52] M. J. Franklin, B. Trushkowsky, P. Sarkar, and T. Kraska, put. Linguistics: Human Lang. Technol., 2018, pp. 1875–1885. [Online].
“Crowdsourced enumeration queries,” in Proc. IEEE Int. Conf. Available: https://www.aclweb.org/anthology/N18-1170/
Data Eng., 2013, pp. 673–684.
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
ROH ET AL.: A SURVEY ON DATA COLLECTION FOR MACHINE LEARNING: A BIG DATA - AI INTEGRATION PERSPECTIVE 1345
[74] M. T. Ribeiro, S. Singh, and C. Guestrin, “Semantically equiv- [97] U. Brefeld, T. G€artner, T. Scheffer, and S. Wrobel, “Efficient co-
alent adversarial rules for debugging NLP models,” in Proc. regularised least squares regression,” in Proc. 23rd Int. Conf.
56th Annu. Meeting Association Comput. Linguistics, 2018, Mach. Learn., 2006, pp. 137–144.
pp. 856–865. [98] V. Sindhwani and P. Niyogi, “A co-regularized approach to
[75] A. Y. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, semi-supervised learning with multiple views,” in Proc. ICML
and S. E. Whang, “Managing google’s data lake: An overview Workshop Learn. Multiple Views, 2005.
of the goods system,” IEEE Data Eng. Bulletin, vol. 39, no. 3, [99] Z.-H. Zhou and M. Li, “Semi-supervised regression with
pp. 5–14, Sept. 2016. co-training,” in Proc. 19th Int. Joint Conf. Artif. Intell., 2005,
[76] J. X. Yu, L. Qin, and L. Chang, “Keyword search in relational pp. 908–913.
databases: A survey,” IEEE Data Eng. Bulletin, vol. 33, no. 1, [100] S. Ravi and Q. Diao, “Large scale distributed semi-supervised
pp. 67–78, Mar. 2010. learning using streaming approximation,” in Proc. Int. Conf. Artif.
[77] S. Chaudhuri and G. Das, “Keyword querying and ranking in Intell. Statistics, 2016, pp. 519–528.
databases,” Proc. VLDB Endowment, vol. 2, no. 2, pp. 1658–1659, [101] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learn-
2009. ing using gaussian fields and harmonic functions,” in Proc. 20th
[78] R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and Int. Conf. Int. Conf. Mach. Learn., 2003, pp. 912–919.
M. Stonebraker, “Aurum: A data discovery system,” in Proc. [102] P. P. Talukdar and W. W. Cohen, “Scaling graph-based semi
IEEE 34th Int. Conf. Data Eng., 2018, pp. 1001–1012. supervised learning to large number of labels using count-min
[79] R. C. Fernandez, E. Mansour, A. A. Qahtan, A. K. Elmagarmid, sketch,” in Proc. Int. Conf. Artif. Intell. Statistics, 2014, pp. 940–947.
I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang, [103] D. D. Lewis and W. A. Gale, “A sequential algorithm for training
“Seeping semantics: Linking datasets using word embeddings text classifiers,” in Proc. 17th Annu. Int. ACM SIGIR Conf. Res.
for data discovery,” in Proc. IEEE 34th Int. Conf. Data Eng., 2018, Development Inf. Retrieval, 1994, pp. 3–12.
pp. 989–1000. [104] H. S. Seung, M. Opper, and H. Sompolinsky, “Query by
[80] Q. V. Le and T. Mikolov, “Distributed representations of senten- committee,” in Proc. 5th Annu. Workshop Comput. Learn. Theory,
ces and documents,” in Proc. 31st Int. Conf. Int. Conf. Mach. Learn., 1992, pp. 287–294.
2014, pp. 1188–1196. [105] B. Settles, Active Learning, San Rafael, CA, USA: Morgan & Clay-
[81] “Crowdsourced data management: Industry and academic pool, 2012.
perspectives,” Foundations Trends Databases, vol. 6, pp. 1–161, [106] N. Abe and H. Mamitsuka, “Query learning strategies using
2015. boosting and bagging,” in Proc. 15th Int. Conf. Mach. Learn., 1998,
[82] M. Allahbakhsh, B. Benatallah, A. Ignjatovic, H. R. Motahari-Nezhad, pp. 1–9.
E. Bertino, and S. Dustdar, “Quality control in crowdsourcing [107] B. Settles and M. Craven, “An analysis of active learning strate-
systems: Issues and directions,” IEEE Internet Comput., vol. 17, gies for sequence labeling tasks,” in Proc. Conf. Empirical Methods
no. 2, pp. 76–81, Mar. 2013. Natural Language Process., 2008, pp. 1070–1079.
[83] F. Daniel, P. Kucherbaev, C. Cappiello, B. Benatallah, and [108] B. Settles, M. Craven, and S. Ray, “Multiple-instance active
M. Allahbakhsh, “Quality control in crowdsourcing: A survey learning,” in Proc. 20th Int. Conf. Neural Inf. Process. Syst., 2007,
of quality attributes, assessment techniques, and assurance pp. 1289–1296.
actions,” ACM Comput. Surv., vol. 51, no. 1, pp. 7:1–7:40, [109] N. Roy and A. McCallum, “Toward optimal active learning
Jan. 2018. through sampling estimation of error reduction,” in Proc. 18th
[84] G. Li, J. Wang, Y. Zheng, and M. J. Franklin, “Crowdsourced data Int. Conf. Mach. Learn., 2001, pp. 441–448.
management: A survey,” IEEE Trans. Knowl. Data Eng., vol. 28, [110] R. Burbidge, J. J. Rowland, and R. D. King, “Active learning for
no. 9, pp. 2296–2319, Sep. 2016. regression based on query by committee,” in Proc. Int. Conf. Intell.
[85] “Amazon mechanical turk,” [Online]. Available: https://www. Data Eng. Automated Learn., 2007, pp. 209–218.
mturk.com. Accessed: Oct. 17, 2019. [111] A. McCallum and K. Nigam, “Employing em and pool-based
[86] J. Kim, S. Sterman, A. A. B. Cohen, and M. S. Bernstein, active learning for text classification,” in Proc. 15th Int. Conf.
“Mechanical novel: Crowdsourcing complex work through Mach. Learn., 1998, pp. 350–358.
reflection and revision,” in Proc. ACM Conf. Comput. Supported [112] K. Tomanek and U. Hahn, “Semi-supervised active learning for
Cooperative Work Social Comput., 2017, pp. 233–245. sequence labeling,” in Proc. Joint Conf. 47th Annu. Meeting ACL,
[87] N. Salehi, J. Teevan, S. T. Iqbal, and E. Kamar, “Communicating 4th Int. Joint Conf. Natural Language Process. AFNLP, 2009,
context to the crowd for complex writing tasks,” in Proc. ACM pp. 1039–1047.
Conf. Comput. Supported Cooperative Work Social Comput., 2017, [113] Z.-H. Zhou, K.-J. Chen, and Y. Jiang, “Exploiting unlabeled data
pp. 1890–1901. in content-based image retrieval,” in Proc. 15th Eur. Conf. Mach.
[88] H. Garcia-Molina, M. Joglekar, A. Marcus, A. Parameswaran, Learn., 2004, pp. 525–536.
and V. Verroios, “Challenges in data crowdsourcing,” IEEE [114] X. Zhu, J. Lafferty, and Z. Ghahramani, “Combining active learn-
Trans. Knowl. Data Eng., vol. 28, no. 4, pp. 901–911, Apr. 2016. ing and semi-supervised learning using gaussian fields and har-
[89] Y. Amsterdamer and T. Milo, “Foundations of crowd data monic functions,” in Proc. ICML Workshop Continuum Labeled
sourcing,” SIGMOD Rec., vol. 43, no. 4, pp. 5–14, Feb. 2015. Unlabeled Data Mach. Learn. Data Mining, 2003, pp. 58–65.
[90] N. Patki, R. Wedge, and K. Veeramachaneni, “The synthetic data [115] J. C. Chang, S. Amershi, and E. Kamar, “Revolt: Collaborative
vault,” in Proc. IEEE Int. Conf. Data Sci. Advanced Analytics, 2016, crowdsourcing for labeling machine learning datasets,” in Proc.
pp. 399–410. CHI Conf. Human Factors Comput. Syst., 2017, pp. 2334–2346.
[91] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, [116] B. Mozafari, P. Sarkar, M. Franklin, M. Jordan, and S. Madden,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, “Scaling up crowd-sourcing to very large datasets: A case for
J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, active learning,” Proc. VLDB Endowment, vol. 8, no. 2, pp. 125–136,
and E. Duchesnay, “Scikit-learn: Machine learning in python,” J. Oct. 2014.
Mach. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011. [117] T. Kulesza, S. Amershi, R. Caruana, D. Fisher, and D. X. Charles,
[92] D. Yarowsky, “Unsupervised word sense disambiguation rival- “Structured labeling for facilitating concept evolution in machine
ing supervised methods,” in Proc. 33rd Annu. Meeting Association learning,” in Proc. SIGCHI Conf. Human Factors Comput. Syst.,
Comput. Linguistics, 1995, pp. 189–196. 2014, pp. 3075–3084.
[93] Z.-H. Zhou and M. Li, “Tri-training: Exploiting unlabeled data [118] J. Wang, T. Kraska, M. J. Franklin, and J. Feng, “Crowder:
using three classifiers,” IEEE Trans. Knowl. Data Eng., vol. 17, Crowdsourcing entity resolution,” Proc. VLDB Endowment,
no. 11, pp. 1529–1541, Nov. 2005. vol. 5, no. 11, pp. 1483–1494, 2012.
[94] Y. Zhou and S. A. Goldman, “Democratic co-learning,” in Proc. [119] V. S. Sheng, F. Provost, and P. G. Ipeirotis, “Get another label?
16th IEEE Int. Conf. Tools Artif. Intell., 2004, pp. 594–602. improving data quality and data mining using multiple, noisy
[95] A. Blum and T. Mitchell, “Combining labeled and unlabeled data labelers,” in Proc. 14th ACM SIGKDD Int. Conf. Knowl. Discovery
with co-training,” in Proc. 11th Annu. Conf. Comput. Learn. Theory, Data Mining, 2008, pp. 614–622.
1998, pp. 92–100. [120] A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis,
[96] I. Triguero, S. Garcıa, and F. Herrera, “Self-labeled techniques for A. Ramesh, and J. Widom, “Crowdscreen: Algorithms for filter-
semi-supervised learning: Taxonomy, software and empirical ing data with humans,” in Proc. ACM SIGMOD Int. Conf. Manage.
study,” Knowl. Inf. Syst., vol. 42, no. 2, pp. 245–284, 2015. Data, 2012, pp. 361–372.
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
1346 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
[121] D. R. Karger, S. Oh, and D. Shah, “Iterative learning for reliable [142] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and
crowdsourcing systems,” in Proc. 24th Int. Conf. Neural Inf. Pro- T. M. Mitchell, “Toward an architecture for never-ending lan-
cess. Syst., 2011, pp. 1953–1961. guage learning,” in Proc. 24th AAAI Conf. Artif. Intell., 2010,
[122] O. Dekel and O. Shamir, “Vox populi: Collecting high-quality pp. 1306–1313.
labels from a crowd,” in Proc. 22nd Annu. Conf. Learn. Theory, [143] X. Zhu, “Semi-supervised learning literature survey,” Comput. Sci.,
2009. Univ. Wisconsin-Madison, Madison, WI, Tech. Rep. TR 1530, 2008.
[123] A. Marcus, D. R. Karger, S. Madden, R. Miller, and S. Oh, [144] D. Dheeru and G. Casey, “UCI machine learning repository,”
“Counting with the crowd,” Proc. VLDB Endowment, vol. 6, no. 2, University of California, Irvine, School of Information and Com-
pp. 109–120, 2012. puter Sciences, 2017. [Online]. Available: http://archive.ics.uci.
[124] C. Zhang, “Deepdive: A data management system for automatic edu/ml
knowledge base construction,” Ph.D. dissertation, Computer Sci- [145] J. Alcal-Fdez, A. Fernndez, J. Luengo, J. Derrac, and S. Garca,
ences department, Univ. Wisconsin–Madison, Madison, WI, 2015. “Keel data-mining software tool: Data set repository, integration
[125] H. R. Ehrenberg, J. Shin, A. J. Ratner, J. A. Fries, and C. Re, “Data of algorithms and experimental analysis framework,” Multiple-
programming with ddlite: Putting humans in a different part of Valued Logic Soft Comput., vol. 17, pp. 255–287, 2010.
the loop,” in Proc. Workshop Human-In-the-Loop Data Analytics, [146] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet: A
2016, Art. no. 13. large-scale hierarchical image database,” in Proc. IEEE Conf. Com-
[126] A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. Re, “Data pro- put. Vis. Pattern Recognit., 2009, pp. 248–255.
gramming: Creating large training sets, quickly,” in Proc. Conf. [147] F. Ricci, L. Rokach, and B. Shapira, Eds., Recommender Systems
Neural Inf. Process. Syst., 2016, pp. 3567–3575. Handbook. Berlin, Germany: Springer, 2015.
[127] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Re, [148] Y. Gu, Z. Jin, and S. C. Chiu, “Combining active learning and semi-
“Snorkel: Rapid training data creation with weak supervision,” supervised learning using local and global consistency,” in Neural
Proc. VLDB Endowment, vol. 11, no. 3, pp. 269–282, Nov. 2017. Information Processing, C. K. Loo, K. S. Yap, K. W. Wong, A. Teoh,
[128] S. H. Bach, D. Rodriguez, Y. Liu, C. Luo, H. Shao, C. Xia, S. Sen, and K. Huang, Eds. Berlin, Germany: Springer, 2014, pp. 215–222.
A. Ratner, B. Hancock, H. Alborzi, R. Kuchhal, C. Re, and R. Mal- [149] V. Crescenzi, P. Merialdo, and D. Qiu, “Crowdsourcing large
kin, “Snorkel drybell: A case study in deploying weak supervi- scale wrapper inference,” Distrib. Parallel Databases, vol. 33, no. 1,
sion at industrial scale,” in Proc. Int. Conf. Manage. Data, 2019, pp. 95–122, Mar. 2015.
pp. 362–375. [150] M. Schaekermann, J. Goh, K. Larson, and E. Law, “Resolvable vs.
[129] E. Bringer, A. Israeli, Y. Shoham, A. Ratner, and C. Re, “Osprey: irresolvable disagreement: A study on worker deliberation in
Weak supervision of imbalanced extraction problems without crowd work,” Proc. ACM Human-Computer Interaction, vol. 2,
code,” in Proc. 3rd Int. Workshop Data Manage. End-to-End Mach. pp. 154:1–154:19, 2018.
Learn., 2019, pp. 4:1–4:11. [151] “Weak supervision: The new programming paradigm for machine
[130] A. Ratner, B. Hancock, J. Dunnmon, R. Goldman, and C. Re, learning,” [Online]. Available: https://hazyresearch.github.io/
“Snorkel metal: Weak supervision for multi-task learning,” in snorkel/blog/ws_blog_post.html. Accessed: Oct. 17, 2019.
Proc. 2nd Workshop Data Manage. End-To-End Mach. Learn., 2018, [152] A. J. Ratner, B. Hancock, and C. Re, “The role of massively multi-
pp. 3:1–3:4. task and weak supervision in software 2.0,” in Proc. Biennial
[131] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, Conf. Innovative Data Syst. Res., 2019.
“Freebase: A collaboratively created graph database for structur- [153] Z.-H. Zhou, “A brief introduction to weakly supervised
ing human knowledge,” in Proc. ACM SIGMOD Int. Conf. Man- learning,” Nat. Sci. Rev., vol. 5, pp. 44–53, 2017.
age. Data, 2008, pp. 1247–1250. [154] H. R. Ehrenberg, J. Shin, A. J. Ratner, J. A. Fries, and C. Re, “Data
[132] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: A core of programming with ddlite: Putting humans in a different part of
semantic knowledge,” in Proc. 16th Int. Conf. World Wide Web, the loop,” in Proc. Workshop Human-In-the-Loop Data Analytics,
2007, pp. 697–706. 2016, pp. 13:1–13:6.
[133] F. Mahdisoltani, J. Biega, and F. M. Suchanek, “YAGO3: A [155] A. J. Ratner, S. H. Bach, H. R. Ehrenberg, and C. Re, “Snorkel:
knowledge base from multilingual wikipedias,” in Proc. Biennial Fast training set generation for information extraction,” in Proc.
Conf. Innovative Data Syst. Res., 2015. ACM Int. Conf. Manage. Data, 2017, pp. 1683–1686.
[134] Mausam, M. Schmitz, R. Bart, S. Soderland, and O. Etzioni, [156] M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision
“Open language learning for information extraction,” in Proc. for relation extraction without labeled data,” in Proc. Joint Conf.
Joint Conf. Empirical Methods Natural Language Process. Comput. 47th Annu. Meeting ACL, 4th Int. Joint Conf. Natural Language Pro-
Natural Language Learn., 2012, pp. 523–534. cess. AFNLP, 2009, pp. 1003–1011.
[135] N. Sawadsky, G. C. Murphy, and R. Jiresal, “Reverb: Recom- [157] E. Ferrara, P. D. Meo, G. Fiumara, and R. Baumgartner, “Web
mending code-related web pages,” in Proc. Int. Conf. Softw. Eng., data extraction, applications and techniques: A survey,” Knowl.-
2013, pp. 812–821. Based Syst., vol. 70, pp. 301–323, 2014.
[136] M. Yahya, S. Whang, R. Gupta, and A. Y. Halevy, “Renoun: Fact [158] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Re, “Holoclean: Holistic
extraction for nominal attributes,” in Proc. Conf. Empirical Meth- data repairs with probabilistic inference,” Proc. VLDB Endow-
ods Natural Language Process., 2014, pp. 325–335. ment, vol. 10, no. 11, pp. 1190–1201, 2017.
[137] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, [159] S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg,
T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: A web- “Activeclean: Interactive data cleaning for statistical modeling,”
scale approach to probabilistic knowledge fusion,” in Proc. 20th Proc. VLDB Endowment, vol. 9, no. 12, pp. 948–959, 2016.
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2014, [160] S. Krishnan, M. J. Franklin, K. Goldberg, and E. Wu, “Boostclean:
pp. 601–610. Automated error detection and repair for machine learning,”
[138] R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu, CoRR, vol. abs/1711.01299, 2017. [Online]. Available: http://
“Biperpedia: An ontology for search applications,” Proc. VLDB arxiv.org/abs/1711.01299
Endowment, vol. 7, no. 7, pp. 505–516, Mar. 2014. [161] M. Dolatshah, M. Teoh, J. Wang, and J. Pei, “Cleaning crowd-
[139] V. Crescenzi, G. Mecca, and P. Merialdo, “Roadrunner: Towards sourced labels using oracles for statistical classification,” PVLDB,
automatic data extraction from large web sites,” in Proc. 27th Int. vol. 12, no. 4, pp. 376–389, 2018. [Online]. Available: http://
Conf. Very Large Data Bases, 2001, pp. 109–118. www.vldb.org/pvldb/vol12/p376-dolatshah.pdf
[140] O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A. Popescu, [162] V. Raman and J. M. Hellerstein, “Potter’s wheel: An interactive
T. Shaked, S. Soderland, D. S. Weld, and A. Yates, “Web-scale data cleaning system,” in Proc. 27th Int. Conf. Very Large Data
information extraction in knowitall: (preliminary results),” in Bases, 2001, pp. 381–390.
Proc. 13th Int. Conf. World Wide Web, 2004, pp. 100–110. [163] S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer, “Wrangler:
[141] T. M. Mitchell, W. W. Cohen, E. R. Hruschka, Jr., P. P. Talukdar, Interactive visual specification of data transformation scripts,”
J. Betteridge, A. Carlson, B. D. Mishra, M. Gardner, B. Kisiel, in Proc. SIGCHI Conf. Human Factors Comput. Syst., 2011,
J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, pp. 3363–3372.
E. A. Platanios, A. Ritter, M. Samadi, B. Settles, R. C. Wang, [164] S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer,
D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and “Profiler: Integrated statistical analysis and visualization for data
J. Welling, “Never-ending learning,” in Proc. Association Advance- quality assessment,” in Proc. Int. Working Conf. Advanced Visual
ment Artif. Intell., 2015, pp. 2302–2310. Interfaces, 2012, pp. 547–554.
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.
ROH ET AL.: A SURVEY ON DATA COLLECTION FOR MACHINE LEARNING: A BIG DATA - AI INTEGRATION PERSPECTIVE 1347
[165] W. R. Harris and S. Gulwani, “Spreadsheet table transformations Yuji Roh received the BS degree in electrical
from examples,” in Proc. 32nd ACM SIGPLAN Conf. Program. Lan- engineering from the Korea Advanced Institute of
guage Design Implementation, 2011, pp. 317–328. Science and Technology, in 2018. She is working
[166] K. H. Tae, Y. Roh, Y. H. Oh, H. Kim, and S. E. Whang, “Data toward the PhD degree at the School of Electrical
cleaning for accurate, fair, and robust models: A big data - AI Enginnering, Korea Advanced Institute of Science
integration approach,” in Proc. 3rd Int. Workshop Data Manage. and Technology (KAIST). Her current research
End-to-End Mach. Learn., 2019, Art. no. 5. interests include Big Data - AI Integration, Big
[167] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from Data Analytics, and Fairness of AI.
massive noisy labeled data for image classification,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 2691–2699.
[168] X. Chen and A. Gupta, “Webly supervised learning of convolu-
tional networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015,
pp. 1431–1439. Geon Heo received the BS degree from the
[169] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE School of Basic Science, Daegu Gyungbuk Insti-
Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009. tute of Science and Technology (DGIST), in 2018.
[170] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, He is working toward the MS degree in the School
“Smote: Synthetic minority over-sampling technique,” J. Artif. of Electrical Engineering, Korea Advanced Insti-
Int. Res., vol. 16, no. 1, pp. 321–357, Jun. 2002. tute of Science and Technology (KAIST). His cur-
[171] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and har- rent research interests include Big Data AI
nessing adversarial examples,” in Proc. 3rd Int. Conf. Learn. Repre- integration and Big Data Analytics.
sentations, 2015. [Online]. Available: http://arxiv.org/abs/
1412.6572
[172] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE
Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
[173] K. R. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of Steven Euijong Whang received the BS degree
transfer learning,” J. Big Data, vol. 3, 2016, Art. no. 9. in computer science from the Korea Advanced
[174] “Tensorflow hub,” [Online]. Available: https://www.tensorflow. Institute of Science and Technology, in 2003, and
org/hub/. Accessed: Oct. 17, 2019. the PhD degree in computer science from Stanford
[175] F. Li, R. Fergus, and P. Perona, “One-shot learning of object University, in 2012. He is an assistant professor
categories,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 4, with the School of Electrical Engineering, Korea
pp. 594–611, Apr. 2006. Advanced Institute of Science and Technology. His
[176] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer research interests are Big Data - AI Integration, Big
learning in natural language processing,” in Proc. Conf. North Data Analytics, and Big Data Systems. Previously,
American Chapter Association Comput. Linguistics: Tutorials, 2019, he was a research scientist at Google Research
pp. 15–18. from Dec. 2012 to Jan. 2018 and co-developed the
[177] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable data infrastructure of the TensorFlow Extended (TFX) machine learning
are features in deep neural networks?” in Proc. 27th Int. Conf. platform. He is a recipient of the Google AI Focused Research Award in
Neural Inf. Process. Syst., 2014, pp. 3320–3328. 2018, the first in Asia. He is also an IEEE senior member.
[178] O. Day and T. M. Khoshgoftaar, “A survey on heterogeneous
transfer learning,” J. Big Data, vol. 4, 2017, Art. no. 29.
[179] D. Baylor, E. Breck, H. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, " For more information on this or any other computing topic,
S. Haykal, M. Ispir, V. Jain, L. Koc, C. Y. Koo, L. Lew, C. Mewald, please visit our Digital Library at www.computer.org/csdl.
A. N. Modi, N. Polyzotis, S. Ramesh, S. Roy, S. E. Whang,
M. Wicke, J. Wilkiewicz, X. Zhang, and M. Zinkevich, “TFX: A
tensorflow-based production-scale machine learning platform,”
in Proc. 23rd ACM SIGKDD Int. Conf. Knowl. Discovery Data
Mining, 2017, pp. 1387–1395.
[180] “Tensorflow data validation,” [Online]. Available: https://www.
tensorflow.org/tfx/data_validation/. Accessed: Oct. 17, 2019.
[181] B. Frenay and M. Verleysen, “Classification in the presence of
label noise: A survey,” IEEE Trans. Neural Netw. Learn. Syst.,
vol. 25, no. 5, pp. 845–869, May 2014.
[182] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
cation with deep convolutional neural networks,” in Proc. 25th
Int. Conf. Neural Inf. Process. Syst., 2012, pp. 1106–1114.
[183] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Repre-
sentations, 2015. [Online]. Available: http://arxiv.org/abs/
1409.1556
[184] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A sur-
vey on deep transfer learning,” in Proc. Artif. Neural Netw. Mach.
Learn., 2018, pp. 270–279.
[185] M. T. Ribeiro, S. Singh, and C. Guestrin, ““why should i trust
you?”: Explaining the predictions of any classifier,” in Proc. 22nd
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016,
pp. 1135–1144.
[186] M. T. Ribeiro, S. Singh, and C. Guestrin, “Anchors: High-
precision model-agnostic explanations,” in Proc. 32nd AAAI Conf.
Artif. Intell., 2018, pp. 1527–1535.
[187] E. Krasanakis, E. Spyromitros-Xioufis, S. Papadopoulos, and
Y. Kompatsiaris, “Adaptive sensitive reweighting to mitigate
bias in fairness-aware classification,” in Proc. World Wide Web
Conf., 2018, pp. 853–862.
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on March 19,2024 at 11:13:03 UTC from IEEE Xplore. Restrictions apply.