0% found this document useful (0 votes)
5 views14 pages

Theory-Guided Data Science A New Paradigm For Scientific Discovery From Data

The document introduces Theory-Guided Data Science (TGDS), a paradigm aimed at enhancing data science models for scientific discovery by integrating scientific knowledge. TGDS seeks to produce interpretable models that maintain scientific consistency, addressing the limitations of traditional data science methods in complex scientific domains. The paper outlines various research themes within TGDS and illustrates approaches to combine domain knowledge with data science across multiple scientific disciplines.

Uploaded by

lucymcdonald179
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views14 pages

Theory-Guided Data Science A New Paradigm For Scientific Discovery From Data

The document introduces Theory-Guided Data Science (TGDS), a paradigm aimed at enhancing data science models for scientific discovery by integrating scientific knowledge. TGDS seeks to produce interpretable models that maintain scientific consistency, addressing the limitations of traditional data science methods in complex scientific domains. The paper outlines various research themes within TGDS and illustrates approaches to combine domain knowledge with data science across multiple scientific disciplines.

Uploaded by

lucymcdonald179
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

2318 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO.

10, OCTOBER 2017

Theory-Guided Data Science: A New Paradigm


for Scientific Discovery from Data
Anuj Karpatne, Gowtham Atluri, James H. Faghmous, Michael Steinbach, Arindam Banerjee,
Auroop Ganguly, Shashi Shekhar, Nagiza Samatova, and Vipin Kumar

Abstract—Data science models, although successful in a number of commercial domains, have had limited applicability in scientific
problems involving complex physical phenomena. Theory-guided data science (TGDS) is an emerging paradigm that aims to leverage
the wealth of scientific knowledge for improving the effectiveness of data science models in enabling scientific discovery. The
overarching vision of TGDS is to introduce scientific consistency as an essential component for learning generalizable models. Further,
by producing scientifically interpretable models, TGDS aims to advance our scientific understanding by discovering novel domain
insights. Indeed, the paradigm of TGDS has started to gain prominence in a number of scientific disciplines such as turbulence
modeling, material discovery, quantum chemistry, bio-medical science, bio-marker discovery, climate science, and hydrology. In this
paper, we formally conceptualize the paradigm of TGDS and present a taxonomy of research themes in TGDS. We describe several
approaches for integrating domain knowledge in different research themes using illustrative examples from different disciplines. We
also highlight some of the promising avenues of novel research for realizing the full potential of theory-guided data science.

Index Terms—Data science, knowledge discovery, domain knowledge, scientific theory, physical consistency, interpretability

1 INTRODUCTION

F ROMsatellites in space to wearable computing devices


and from credit card transactions to electronic health-
care records, the deluge of data [1], [2], [3] has pervaded
an important role in advancing scientific discovery. Histori-
cally, science has progressed by first generating hypotheses
(or theories) and then collecting data to confirm or refute
every walk of life. Our ability to collect, store, and access these hypotheses. However, in the big data era, ample data,
large volumes of information is accelerating at unprece- which is being continuously collected without a specific the-
dented rates with better sensor technologies, more powerful ory or hypothesis in mind, offers further opportunity for
computing platforms, and greater on-line connectivity. With discovering new knowledge. Indeed, the role of data science
the growing size of data, there has been a simultaneous revo- in scientific disciplines is beginning to shift from providing
lution in the computational and statistical methods for proc- simple analysis tools (e.g., detecting particles in Large Had-
essing and analyzing data, collectively referred to as the field ron Collider experiments [5], [6]) to providing full-fledged
of data science. These advances have made long-lasting knowledge discovery frameworks (e.g., in bio-informatics
impacts on the way we sense, communicate, and make deci- [7] and climate science [8], [9]). Based on the success of data
sions [4], a trend that is only expected to grow in the foresee- science in applications where Internet-scale data is available
able future. Indeed, the start of twenty-first century may well (with billions or even trillions of samples), e.g., natural lan-
be remembered in history as the “golden age of data science.” guage translation, optical character recognition, object track-
Apart from transforming commercial industries such as ing, and most recently, autonomous driving, there is a
retail and advertising, data science is also beginning to play growing anticipation of similar accomplishments in scien-
tific disciplines [10], [11], [12]. To capture this excitement,
 A. Karpatne, M. Steinbach, A. Banerjee, S. Shekhar, and V. Kumar are with
some have even referred to the rise of data science in scien-
the University of Minnesota, Minneapolis, MN 55455. E-mail: {karpa009, tific disciplines as “the end of theory” [13], the idea being
kumar001}@umn.edu, {steinbac, banerjee, shekhar}@cs.umn.edu. that the increasingly large amounts of data makes it possible
 G. Atluri is with the University of Cincinnati, Cincinnati, OH 45220. to build actionable models without using scientific theories.
E-mail: [email protected].
 J.H. Faghmous is with the Icahn School of Medicine at Mount Sinai, New Unfortunately, this notion of black-box application of
York, NY 10029. E-mail: [email protected]. data science has met with limited success in scientific
 A. Ganguly is with the Northeastern University, Boston, MA 02115. domains (e.g., [14], [15], [16]). A well-known example of the
E-mail: [email protected].
 N. Samatova is with the North Carolina State University, Raleigh, NC
perils in using data science methods in a theory-agnostic
27695. E-mail: [email protected]. manner is Google Flu Trends, where a data-driven model
Manuscript received 27 Dec. 2016; revised 14 May 2017; accepted 12 June was learned to estimate the number of influenza-related
2017. Date of publication 27 June 2017; date of current version 8 Sept. 2017. physician visits based on the number of influenza-related
(Corresponding author: Anuj Karpatne.) Google search queries in the United States [17]. This model
Recommended for acceptance by Y. Zhang. was built using search terms that were highly correlated
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. with the flu propensity in the Center for Disease Control
Digital Object Identifier no. 10.1109/TKDE.2017.2720168 (CDC) data. Despite its initial success, this model later
1041-4347 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
KARPATNE ET AL.: THEORY-GUIDED DATA SCIENCE: A NEW PARADIGM FOR SCIENTIFIC DISCOVERY FROM DATA 2319

overestimated the flu propensity by more than a factor of relationships [18], [19], closure of knowledge gaps in turbu-
two, as measured by the number of influenza-related doctor lence modeling efforts [20], [21], discovery of novel com-
visits in subsequent years, according to CDC data [15]. pounds in material science [22], [23], [24], design of density
There are two primary characteristics of knowledge dis- functionals in quantum chemistry [25], improved imaging
covery in scientific disciplines that have prevented data sci- technologies in bio-medical science [26], [27], discovery of
ence models from reaching the level of success achieved in genetic biomarkers [28], and the estimation of surface water
commercial domains. First, scientific problems are often dynamics at a global scale [29], [30]. These efforts have been
under-constrained in nature as they suffer from paucity of complemented with recent review papers [8], [31], [32], [33],
representative training samples while involving a large workshops (e.g., a 2016 conference on physics informed
number of physical variables. Further, physical variables machine learning [34]), and industry initiatives (e.g., a
commonly show complex and non-stationary patterns that recent IBM Research initiative on “physical analytics” [35]).
dynamically change over time. For this reason, the limited This paper attempts to build the foundations of theory-
number of labeled instances available for training or cross- guided data science by presenting several ways of bring-
validation can often fail to represent the true nature of rela- ing scientific knowledge and data science models together,
tionships in scientific problems. Hence, standard methods and illustrating them using examples of applications from
for assessing and ensuring generalizability of data science diverse domains. A major goal of this article is to formally
models may break down and lead to misleading conclu- conceptualize the paradigm of “theory-guided data scien-
sions. In particular, it is easy to learn spurious relationships ce”, where scientific theories are systematically integrated
that look deceptively good on training and test sets (even with data science models in the process of knowledge
after using methods such as cross-validation), but do not discovery.
generalize well outside the available labeled data. This was The remainder of the article is structured as follows.
one of the main reasons behind the failure of Google Flu Section 2 provides an introduction to theory-guided data sci-
Trends, since the data used for training the model in the first ence and presents an overview of research themes in TGDS.
few years was not representative of the trends in subse- Sections 3, 4, 5, 6, and 7 describe several approaches in every
quent years [15]. The paucity of representative samples is research theme of TGDS, using illustrative examples from
one of the prime challenges that differentiates scientific diverse disciplines. Section 8 provides concluding remarks.
problems from mainstream problems involving Internet-
scale data such as language translation or object recognition,
where large volumes of labeled or unlabeled data have been 2 THEORY-GUIDED DATA SCIENCE
critical in the success of recent advancements in data science A common problem in scientific domains is to represent
such as deep learning. relationships among physical variables, e.g., the combustion
The second primary characteristic of scientific domains pressure and launch velocity of a rocket or the shape of an
that have limited the success of black-box data science aircraft wing and its resultant air drag. The conventional
methods is the basic nature of scientific discovery. While a approach for representing such relationships is to use mod-
common end-goal of data science models is the generation els based on scientific knowledge, i.e., theory-based models,
of actionable models, the process of knowledge discovery in which encapsulate cause-effect relationships between varia-
scientific domains does not end at that. Rather, it is the bles that have either been empirically proven or theoreti-
translation of learned patterns and relationships to interpret- cally deduced from first principles. These models can range
able theories and hypotheses that leads to advancement of from solving closed-form equations (e.g., using Navier-
scientific knowledge, e.g., by explaining or discovering the Stokes equation for studying laminar flow) to running
physical cause-effect mechanisms between variables. computational simulations of dynamical systems (e.g., the
Hence, even if a black-box model achieves somewhat more use of numerical models in climate science, hydrology, and
accurate performance but lacks the ability to deliver a mech- turbulence modeling). An alternate approach is to use a set
anistic understanding of the underlying processes, it cannot of training examples involving input and output variables
be used as a basis for subsequent scientific developments. for learning a data science model that can automatically
Further, an interpretable model, that is grounded by extract relationships between the variables.
explainable theories, stands a better chance at safeguarding As depicted in Fig. 1, theory-based and data science
against the learning of spurious patterns from the data that models represent the two extremes of knowledge discovery,
lead to non-generalizable performance. This is especially which depend on only one of the two sources of information
important when dealing with problems that are critical in available in any scientific problem, i.e., scientific knowledge
nature and associated with high risks (e.g., healthcare). or data. They both enjoy unique strengths and have found
The limitations of black-box data science models in scien- success in different types of applications. Theory-based
tific disciplines motivate a novel paradigm that uses the models (see top-left corner of Fig. 1) are well-suited for rep-
unique capability of data science models to automatically resenting processes that are conceptually well understood
learn patterns and models from large data, without ignoring using known scientific principles. On the other hand, tradi-
the treasure of accumulated scientific knowledge. We refer tional data science models mainly rely on the information
to this paradigm that attempts to integrate scientific knowl- contained in the data and thus reside in the bottom-right
edge and data science as theory-guided data science (TGDS). corner of Fig. 1. They have a wide range of applicability in
The paradigm of TGDS has already begun to show promise domains where we have ample supply of representative
in scientific problems from diverse disciplines. Some exam- data samples, e.g., in Internet-scale problems such as text
ples include the discovery of novel climate patterns and mining and object recognition.
Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
2320 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 10, OCTOBER 2017

Fig. 2. Scientific knowledge can help in reducing the model variance by


removing physically inconsistent solutions, without likely affecting their
bias.

If we apply “black-box” data science models in scientific


problems, we would notice a completely different set of
issues arising due to the inadequacy of the available data in
representing the complex spaces of hypotheses encountered
in physical domains. Further, since most data science mod-
Fig. 1. A representation of knowledge discovery methods in scientific els can only capture associative relationships between varia-
applications. The x-axis measures the use of data while the y-axis bles, they do not fully serve the goal of understanding
measures the use of scientific knowledge. Theory-guided data science
causative relationships in scientific problems.
explores the space of knowledge discovery that makes ample use of
the available data while being observant of the underlying scientific Hence, neither a data-only nor a theory-only approach
knowledge. can be considered sufficient for knowledge discovery in
complex scientific applications. Instead, there is a need to
Despite their individual strengths, theory-based and data explore the continuum between theory-based and data sci-
science models suffer from certain deficiencies when applied ence models, where both theory and data are used in a syn-
in problems of great scientific relevance, where both theory ergistic manner. The paradigm of theory-guided data science
and data are currently lacking. For example, a number of sci- attempts to address the shortcomings of data-only and
entific problems involve processes that are not completely theory-only models by seamlessly blending scientific
understood by our current body of knowledge, because of the knowledge in data science models (see Fig. 1). By integrat-
inherent complexity of the processes. In such settings, theory- ing scientific knowledge in data science models, TGDS aims
based models are often forced to make a number of simplify- to learn dependencies that have a sufficient grounding in
ing assumptions about the physical processes, which not only physical principles and thus have a better chance to repre-
leads to poor performance but also renders the model difficult sent causative relationships. TGDS further attempts to
to comprehend and analyze. We illustrate this scenario using achieve better generalizability than models based purely on
the following example from hydrological modeling. data by learning models that are consistent with scientific
principles, termed as physically consistent models.
Example 1 (Hydrological Modeling). One of the primary To illustrate the role of “consistency with scientific
objectives of hydrology is to study the processes responsi- knowledge” in ensuring better generalization performance,
ble for the movement, distribution, and quality of water consider the example of learning a parametric model for a
across the planet. Some examples of such processes predictive learning problem using a limited supply of
include the discharge of water from the atmosphere via labeled samples. Ideally, we would like to learn a model
precipitation, and the infiltration of water underneath the that shows the best generalization performance over any
Earth’s surface, known as subsurface flow. Understand- unseen instance. Unfortunately, we can only observe the
ing subsurface flow is important as it is intricately linked model performance on the available training set, which may
with terrestrial ecosystem processes, agricultural water not be truly representative of the true generalization perfor-
use, and sudden adverse events such as floods. However, mance (especially when the training size is small). In recog-
our knowledge of subsurface flow using state-of-the-art nition of this fact, a number of learning frameworks have
hydrological models is quite limited [36]. This is mainly been explored to favor the selection of simpler models that
because subsurface flow operates in a regime that is diffi- may have lower accuracy on the training data (compared to
cult to measure directly using in-situ sensors such as bore- more complex models) but are likely to have better generali-
holes. In addition, subsurface flow involves a number of zation performance. This methodology, that builds on the
complex sub-processes that interact in non-linear ways, well-known statistical principle of bias-variance trade-off
which are difficult to encapsulate in current theory-based [39], can be described using Fig. 2.
models [37]. Due to these challenges, existing hydrologi- Fig. 2 shows an abstract representation of a succession of
cal models make use of a broad range of parameters in model families with varying levels of complexity (shown as
several weakly-informed physical equations. Thus, global curved lines), where M1 represents the set of least complex
hydrological models tend to show poor predictive perfor- models while M3 contains highly complex models. Every
mance in describing subsurface flow processes [38]. In point on the curved lines represents a model that a learning
addition, they also lose physical interpretability due to algorithm can arrive at, given a particular realization of
the large number of model parameters that are difficult to training instances. The true relationship between the input
interpret meaningfully with respect to the domain. and output variables is depicted as a star in Fig. 2. We can
Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
KARPATNE ET AL.: THEORY-GUIDED DATA SCIENCE: A NEW PARADIGM FOR SCIENTIFIC DISCOVERY FROM DATA 2321

observe that the learned models belonging to M3 , on aver- TABLE 1


age, are quite close to the true relationship. However, even Table Showing Some Commonly Used Combinations of
a small change in the training set can bring about large Link Function and Probability Distribution Functions in
Generalized Linear Models
changes in the learned models of M3 . Hence, M3 shows
low bias but high variance. On the other hand, models Name Link Function Probability Distribution
belonging to M1 are quite robust to changes in the training
Linear m Gaussian
set and thus show low variance. However, M1 shows high Poisson log ðmÞ Poisson
bias as its models are generally farther away from the true Logistic log ðm=ð1  mÞÞ Binomial
relationship as compared to models of M3 . It is the trade-
off between reducing bias and variance that is at the heart
of a number of machine learning algorithms [39], [40], [41]. modeled using data science components. Techniques for
In scientific applications, there is another source of infor- constructing hybrid TGDS models are discussed in Section 6.
mation that can be used to ensure the selection of generaliz- Fifth, data science methods can also help in augmenting
able models, which is the available scientific knowledge. By theory-based models to make effective use of observational
pruning candidate models that are inconsistent with known data. These approaches are discussed in Section 7.
scientific principles (shown as shaded regions in Fig. 2), we
can significantly reduce the variance of models without 3 THEORY-GUIDED DESIGN OF DATA SCIENCE
likely affecting their bias. A learning algorithm can then be MODELS
focused on the space of physically consistent models, lead-
ing to generalizable and scientifically interpretable models. An important decision in the learning of data science mod-
Hence, one of the overarching visions of TGDS is to include els is the choice of model family used for representing the
physical consistency as a critical component of model perfor- relationships between input and response variables. In sci-
mance along with training accuracy and model complexity. entific applications, if the domain knowledge suggests a
This can be summarized in a simple way by the following particular form of relationship between the inputs and out-
revised objective of model performance in TGDS puts, care must be taken to ensure that the same form of
relationship is used in the data science model. Here, we dis-
Performance / Accuracy þ Simplicity þ Consistency: cuss two different ways of using scientific knowledge in the
design of data science models. First, we can use synergistic
There are various ways of introducing physical consis- combinations of response and loss functions (e.g., in gener-
tency in data science models, in different forms and capaci- alized linear models or artificial neural networks) that not
ties. While some approaches attempt to naturally only simplify the optimization process and thus lead to low
incorporate physical consistency in existing learning frame- training errors, but are also consistent with our physical
works of data science models, others explore innovative understanding and hence result in generalizable solutions.
ways of blending data science principles with theory-based Another way to infuse domain knowledge is by choosing a
models. In the following sections, we describe five broad model architecture (e.g., the placement of layers in artificial
categories of approaches for combining scientific knowl- neural networks) that is compliant with scientific knowl-
edge with data science, that are illustrative of emerging edge. We discuss both these approaches in the following.
examples of TGDS research in diverse disciplines. Note that
many of these approaches can be applied together in multi- 3.1 Theory-Guided Specification of Response
ple combinations for a particular problem, depending on Many data science models provide the option for specifying
the nature of scientific knowledge and the type of data sci- the form of relationship used for describing the response
ence method. The five research themes of TGDS can be variable. For example, a generic family of models, which
briefly summarized as follows. can represent a broad variety of relationships between input
First, scientific knowledge can be used in the design of and response variables, is the generalized linear model
model families to restrict the space of models to physically (GLM). There are two basic building blocks in a GLM, the
consistent solutions, e.g., in the selection of response and link function gðÞ, and the probability distribution P ðyjxÞ.
loss functions or in the design of model architectures. These Using these building blocks, the expected mean m of the tar-
techniques are discussed in Section 3. Second, given a model get variable y is determined as a function of the weighted
family, we can also guide a learning algorithm to focus on linear combination of inputs, x, as follows:
physically consistent solutions. This can be achieved, for
instance, by initializing the model with physically meaning- gðmÞ ¼ wT x þ b; or equivalently,
(1)
ful parameters, by encoding scientific knowledge as probabi- m ¼ g1 ðwT x þ bÞ;
listic relationships, by using domain-guided constraints, or
with the help of regularization terms inspired by our physi- where w and b and the parameters of GLM to be learned
cal understanding. These techniques are discussed in from the data. Some common choices of link and probability
Section 4. Third, the outputs of data science models can be distribution functions are listed in Table 1, resulting in vary-
refined using explicit or implicit scientific knowledge. This is ing types of regression models.
discussed in Section 5. Fourth, another way of blending sci- To ensure the learning of GLMs that produce physically
entific knowledge and data science is to construct hybrid meaningful results, it is important to choose an appropriate
models, where some aspects of the problem are modeled specification of the response variable that matches with
using theory-based components while other aspects are domain understanding. For example, while modeling
Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
2322 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 10, OCTOBER 2017

response variables that show extreme effects (highly and evaporation, the process of surface water runoff, and
skewed distributions), e.g., occurrences of unusually severe the process related to groundwater seepage. Every ANN
floods and droughts, it would be inappropriate to assume model can be fed with appropriately chosen domain fea-
the response variable to be Gaussian distributed (the stan- tures at the input and output layers. This will help in using
dard assumption used in linear regression models). Instead, the power of deep learning frameworks while following a
a regression model that uses the Gumbel distribution to high-level organization in the ANN architecture that is
model extreme values would be more accurate and physi- motivated by domain knowledge.
cally meaningful. Domain knowledge can also be used in the design of
In general, the idea of specifying model response using ANN models by specifying node connections that capture
scientific principles can be explored in many types of learn- theory-guided dependencies among variables. A number of
ing algorithms. An example of theory-guided specification variants of ANN have been explored to capture spatial and
of response can be found in the field of ophthalmology, temporal dependencies between the input and output varia-
where the use of Zernike polynomials was explored by Twa bles. For example, recurrent neural networks (RNN) are
et al. [42] for the classification of corneal shape using deci- able to incorporate the sequential context of time in speech
sion trees. and language processing [44]. RNN models have been
recently explored to capture notions of long and short term
3.2 Theory-Guided Design of Model Architecture memory (LSTM) with the help of skip connections among
Scientific knowledge can also be used to influence the archi- nodes to model information delay [45]. Such models can be
tecture of data science models. An example of a data science used to incorporate time-varying domain characteristics in
model that provides ample room for tuning the model archi- scientific applications. For example, while surface water
tecture is artificial neural networks (ANN), which has runoff directly influences surface water discharge without
recently gained widespread acceptance in several applica- any delay, groundwater runoff has a longer latency and
tions such as vision, speech, and language processing. There contributes to the surface water discharge after some time
are a number of design considerations that influence the con- lag. Such differences in time delay can be effectively mod-
struction of an effective ANN model. Some examples include eled by a suitably designed LSTM model. Another variant
the number of hidden layers and the nature of connections of ANN is the convolutional neural network (CNN) [46],
among the layers, the sharing of model parameters among which has been widely applied in vision and image process-
nodes, and the choice of activation and loss functions for ing applications to capture spatial dependencies in the data.
effective model learning. Many of these design considerations It further facilitates the sharing of model parameters so that
are primarily motivated to simplify the learning procedure, the learned features are invariant to simple transformations
minimize the training loss, and ensure robust generalization such as scaling and transformation. Similar approaches can
performance using statistical principles of regularization. be explored to share the parameters (and thus reduce model
There is a huge opportunity in informing these design complexity) over more generic similarity structures among
considerations with our physical understanding of a prob- the input features that are based on domain knowledge.
lem, to obtain generalizable as well as scientifically inter-
pretable results. For example, in an attempt to build a 4 THEORY-GUIDED LEARNING OF DATA SCIENCE
model of the brain that learns view-invariant features of MODELS
human faces, the use of biologically plausible rules in ANN
architectures was recently explored in [43]. It was observed Having chosen a suitable model design, the next step of
that along with preserving view-invariance, such theory- model building involves navigating the search space of can-
guided ANN models were able to capture a known aspect didate models using a learning algorithm. In the following,
of human neurology (namely, the mirror-symmetric tuning we present four different ways of guiding the learning algo-
to head orientation) that was being missed by traditional rithm to choose physically consistent models. First, we can
ANN models. This made it possible to learn scientifically use physically consistent solutions as initial points in itera-
interpretable models of human cognition and thus advance tive learning algorithms such as gradient descent methods.
our understanding of the inner workings of the brain. In the Second, we can restrict the space of probabilistic models
following, we describe two promising directions for using with the help of theory-guided priors and relationships.
scientific knowledge while constructing ANN models: by Third, scientific knowledge can be used as constraints in
using a modular design that is inspired by domain under- optimization schemes for ensuring physical consistency.
standing, and by specifying the connections among the Fourth, scientific knowledge can be encoded as regulariza-
nodes in a physically consistent manner. tion terms in the objective function of learning algorithms.
Domain knowledge can be used in the design of ANN We describe each of these approaches in the following.
models by decomposing the overall problem into modular
sub-problems, each of which represents a different physical 4.1 Theory-Guided Initialization
sub-process. Every sub-problem can then be learned using a Many learning algorithms that are iterative in nature require
different ANN model, whose inputs and outputs are con- an initial choice of model parameters as a first step to com-
nected with each other in accordance with the physical rela- mence the learning process. For such algorithms, an inferior
tionships among the sub-processes. For example, in order to initialization can lead to the learning of a poor model. Domain
describe the overall hydrological process of surface water knowledge can help in the process of model initialization so
discharge, we can learn modular ANN models for different that the learning algorithm is guided at an early stage to
sub-processes such as the atmospheric process of rainfall choose generalizable and physically consistent models.
Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
KARPATNE ET AL.: THEORY-GUIDED DATA SCIENCE: A NEW PARADIGM FOR SCIENTIFIC DISCOVERY FROM DATA 2323

An example of theory-guided initialization of model is common to apply automated graph estimation techniques
parameters includes a recent matrix completion approach such as the use of graph Lasso [50]. The basic objective of
for plant trait analysis [47], where the rows of the matrix cor- such techniques is to estimate a sparse inverse covariance
respond to plants from diverse environments while the col- matrix that maximizes the model likelihood given the data.
umns correspond to plant traits such as leaf area, seed mass, To assist such techniques with scientific knowledge, a
and root length. Since observations about plant traits are promising research direction is to explore graph estimation
sparsely available, such a plant trait matrix would be highly techniques that maximize data likelihood while limiting the
incomplete [48]. Filling the missing entries in a plant trait search to physically consistent solutions.
matrix can help us understand the characteristics of different Another approach to reduce the variance of model
plant species and their ability to adapt to varying environ- parameters (and thus avoid model overfitting) is to intro-
mental conditions. A traditional data science approach to duce priors in the model space. An example of the use of
this problem is to use matrix completion algorithms that theory-guided priors is the problem of non-invasive electro-
have found great success in online recommender systems physiological imaging of the heart. In this problem, the elec-
[49]. However, many of these algorithms are iterative in trical activity within the walls of the heart needs to be
nature and use fixed or random values to initialize the predicted based on the ECG signal measured on the torso of
matrix. In the presence of domain knowledge, we can a subject. There are approximately 2,000 locations in the
improve these algorithms by using the species mean of every walls of the heart where electrical activity needs to be pre-
attribute as initial values in the matrix completion process. dicted, based on ECG data collected from approximately
This relies on the basic principle that the species mean pro- 100 electrodes on the torso. Given the large space of model
vides a robust estimate of the average behavior across all parameters and the paucity of labeled examples with
organisms. This approach has been shown to provide signifi- ground-truth information, a traditional black-box model
cant improvements in the accuracy of predicting plant traits that only uses the information contained in the data is
over traditional methods [47]. Changes from the species highly prone to learning spurious patterns. However, apart
mean can also be learned using subsequent matrix comple- from the knowledge contained in the data, we also have
tion operations, which could be physically interpreted as the domain knowledge (represented using electrophysiological
effect of varying environmental conditions on plant traits. equations) about how electrical signals are transmitted
One of the data science models that requires special within the heart via the myocardial fibre structure. These
efforts in choosing an appropriate combination of initial equations can be used to determine the spatial distribution
model parameters is the artificial neural network, which is of the electric signals in the heart at time t based on the pre-
known to be susceptible to getting stuck at local minimas, dicted electric signals at t  1. Incorporating such theory-
saddle points, and flat regions in the loss curve. In the era of guided spatial distributions as priors and using it along
deep learning, much progress has been made to avoid the with externally collected ECG data in a hierarchical Bayes-
problem of inferior ANN initialization with the help of pre- ian model has been shown to provide promising results
training strategies. The basic idea of these strategies is to train over traditional data science models [26], [27]. Another
the ANN model over a simpler problem (with ample avail- example of theory-guided priors can be found in the field
ability of representative data) and use the trained model to of geophysics [51], where the knowledge of convection-
initialize the learning for the original problem. These pre- diffusion equations was used as priors for determining the
training strategies have made major impact on our ability to connectivity structure of subsurface aquifers.
learn complex hierarchies of features in several application
domains such as speech and image processing. However, 4.3 Theory-Guided Constrained Optimization
they rely on plentiful amounts of unlabeled or labeled data Constrained optimization techniques are extensively used
and hence are not directly applicable in scientific domains in data science models for restricting the space of model
where the data sizes are small relative to the number of varia- parameters. For example, support vector machines use con-
bles. One way to address this challenge is by devising novel straints for ensuring separability among the classes, while
pretraining strategies where computational simulations of maximizing the margin of the hyperplane. There is also a
theory-based models are used to initialize the ANN model. rich literature on constraint-based pattern mining [52], [53]
This can be especially useful when theory-based models can and clustering [54]. The use of constraints provides a natu-
produce approximate simulations quickly, e.g., approximate ral way to integrate domain knowledge in the learning of
model simulations of turbulent flow (see Example 5). Such data science models. In scientific applications where theory-
pretrained theory-guided ANN models can then be fine- based constraints can be represented using linear equality
tuned using expert-quality ground truth. or inequality conditions, they can be readily integrated in
existing constrained optimization formulations, which are
4.2 Theory-Guided Probabilistic Models known to provide computationally efficient solutions espe-
Probabilistic graphical models provide a natural way to cially when the objective function is convex.
encode domain-specific relationships among variables as However, many scientific problems involve constraints
edges between nodes representing the variables. However, that are represented in complex forms, e.g., using partial dif-
manually encoding domain knowledge in graphical models ferential equations (PDE) or non-linear transformations of
requires a great deal of expert supervision, which can be variables, which are not easily handled by traditional con-
cumbersome for problems involving a large number of vari- strained optimization methods. For example, the Naiver–
ables with complex interactions–a common feature of scien- stokes equation for momentum expresses the following con-
tific problems. In the presence of a large number of nodes, it straint between the flow velocity v and the fluid pressure p
Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
2324 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 10, OCTOBER 2017

 @u  2
r þ u  ru ¼ rp þ r  ðmðru þ ðruÞT Þ  mðr  uÞIÞ; term have been developed to solve for the ground-state
@t 3 density of a system, the most notable being the class of
where r is the fluid density, m is the fluid dynamic viscosity, Kohn-Sham (KS) DFT methods. However, their perfor-
and r represents the gradient operator with respect to the mance is sensitive to the quality of approximation used
spatial coordinates. in modeling the interactions. Also, KS DFT methods
To utilize such complex forms of constraints in data sci- have a computational complexity of OðN 3 Þ, which makes
ence models, it is necessary to develop constrained optimi- them challenging to apply on large systems.
zation techniques that can use common forms of partial To overcome the challenges in existing DFT methods,
differential equations encountered in scientific disciplines. a recent work by Li et al. [25] explored the use of data sci-
An example of a data-driven approach that uses domain- ence models to approximate T½n, and use such approxi-
driven PDEs can be found in a recent work in climate sci- mations to predict the ground-state density, n0 ðrÞ. In this
ence [55], [56], where physically constrained time-series work, kernel ridge regression methods were used to
regression models were developed to incorporate memory model the kinetic energy, T½n, of a four-particle system
effects in time as well as the nonlinear noise arising from as a functional of its electron density, nðrÞ. Having
^
learned T½n, we can obtain the ground-state energy,
energy-conserving interactions.
In the following, we present detailed discussions of two n0 ðrÞ, using the following Euler-Lagrangian equation:
illustrative examples of the use of theory-guided con-
straints. While Example 2 explores the use of constraints for ^ 0
dT½n
¼ m  vðrÞ; (5)
predicting electron density in computational chemistry, dn0 ðrÞ
Example 3 explores the use of elevation-based constraints
among locations for mapping surface water dynamics. where vðrÞ is the external potential and m is an adjustable
Example 2 (Computational Chemistry). In computational constant. This imposes a theory-guided constraint on the
^
model learning, such that T½n must not only show good
chemistry, solving Schr€odinger’s equation is at the basis
of all quantum mechanical calculations for predicting the performance in predicting the kinetic energy, but should
properties of solids and molecules. Schr€odinger’s equa- also accurately estimate the ground-state density, n0 ðrÞ,
tion can be expressed as using Equation (5). A functional that adheres to this con-
straint can be called “self-consistent.”
HC ¼ EC; (2) It was shown in [25] that a regression model that only
focuses on minimizing the training error leads to highly
inconsistent solutions of the ground-state density, and is
¼ ðT þ U þ VÞC; (3)
thus not useful for quantum chemical calculations. This
where H is the electronic Hamiltonian operator, C is the inconsistency can be traced to the inability of regression
wavefunction that describes the quantum state of the sys- models in capturing functional derivative forms that are
tem, and E is the total energy consisting of three terms, ^
used in Equation (5). In particular, the derivative of T½n
the kinetic energy, T, the electron-electron interaction can easily leave the space of densities observed in the
energy, U, and the potential energy arising due to exter- training set, and thus arrive at ill-conditioned solutions
nal fields, V (e.g., due to positively charged nuclei). Since especially when the training size is small.
the computational complexity in directly solving the To overcome this limitation, a modified Euler-
Schr€odinger’s equation grows rapidly with the number Lagrange constraint was proposed in [25], which
of particles, N, it is infeasible for solving large many- restricted the space of n0 ðrÞ to the density manifold
particle systems in practical applications. observed in the training set. This helped in learning accu-
To address this, a new class of quantum chemical rate as well as self-consistent ground-state densities
modeling approaches was developed by Hohenberg and using the knowledge contained in the data as well as
Kohn in 1964 [57], which uses the electron density nðrÞ as domain theories.
a basic primitive in all calculations, instead of the wave- Example 3 (Mapping Surface Water Dynamics). Remote
function C. This has resulted in the rise of density func- sensing data from Earth observing satellites presents a
tional theory (DFT) methods, which have become a promising opportunity for monitoring the dynamics of
standard tool for solving many-particle systems. In DFT, surface water body extent at regular intervals of time. It
every variable can be expressed as a functional of the elec- is possible to build predictive models that use multi-
tron density function nðrÞ (where a functional is a func- spectral data from satellite images as input features to
tion of functions). For example, the total energy E can be classify pixels of the image as water or land. However,
expressed in terms of functionals of nðrÞ as follows: these models are challenged by the poor quality of
E½n ¼ T½n þ U½n þ V½n: (4) labeled data, noise and missing values in remote sensing
signals, and the inherent variability of water and land
The density, n0 ðrÞ, that leads to the lowest total energy, classes over space and time [58], [59].
E½n0 , is known as the ground-state density of the system, To address these challenges, there is an opportunity for
which is a critical quantity to determine. improving the quality of classification maps by using the
However, obtaining n0 ðrÞ is challenging because of the domain knowledge that water bodies have a concave ele-
interaction functional, U½n, whose exact form is vation structure. Hence, locations at a lower elevation are
unknown. Different approximations of the interaction filled up first before the water level reaches locations at
Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
KARPATNE ET AL.: THEORY-GUIDED DATA SCIENCE: A NEW PARADIGM FOR SCIENTIFIC DISCOVERY FROM DATA 2325

are thus less likely to be filled with water. For example,


notice that location A is at a higher elevation than both B
and C (see Fig. 3b). Hence, if B is labeled as land, it
would be inconsistent to classify A as water and instead
it should be classified as land. The use of such constraints
can help in learning a generalizable classification model
even with poorly labeled training data.

4.4 Theory-Guided Regularization


One way to constrain the search space of model parameters
is to use regularization terms in the objective function,
which penalize the learning of overly complex models. A
number of regularization techniques have been explored in
the data science community to enforce different measures
of model complexity. For example, minimizing the Lp norm
of model parameters has been extensively used for obtain-
ing various effects of regularization in parametric model
learning. While the L2 norm has been used to avoid overly
large parameter values in ridge regression and support vec-
tor machines, minimizing the L1 norm results in the Lasso
formulation and the Dantzig selector, both of which encode
sparsity in the model parameters.
However, these techniques are agnostic to the physical
feasibility of the learned model and thus can lead to physi-
cally inconsistent solutions. For example, while predicting
the elastic modulus using bond energy and melting point,
Lasso may favor melting point over bond energy even
though a direct causal link exists between bond energy and
the modulus [31]. This can result in the elimination of mean-
ingful attributes and the selection of secondary attributes
that are not directly relevant. Hence, there is a need to
Fig. 3. An illustrative example of the use of elevation-based ordering devise regularization techniques that can incorporate scien-
(domain theory) for learning physically consistent classification bound- tific knowledge to restrict the search space of model param-
aries of water and land. Along with the distribution of training instances
in the feature space, we also have information about their elevation, as eters. For example, instead of using the Lp norm for
shown in Fig. 3a. This information can be used to learn an elevation- regularization, we can find solutions on physically consis-
aware classification boundary that produces physically viable labels, tent sub-spaces of models. The Gaussian widths of such
e.g., if B is labeled as land, then A must necessarily be labeled as land
as it is at a higher elevation, as shown in Fig. 3b.
sub-spaces can be used as a regularization term in techni-
ques such as the generalized Dantzig selector [60], [61]. In
the following, we describe two research directions for
higher elevations. Thus, if we have access to elevation theory-guided regularization that have been explored in dif-
information (e.g., from bathymetric measurements ferent applications: using variants of Lasso to incorporate
obtained via sonar instruments), we can use it to constrain domain-specific structure among parameters, and the use of
the classifier so that it not only minimizes the training multi-task learning formulations to account for the hetero-
error in the feature space but also produces labels that are geneity in data sub-populations.
consistent with the elevation structure. To illustrate this, The group Lasso [62] is a useful variant of Lasso that has
consider an example of a two-dimensional training set been explored in problems involving structured attributes.
shown in Fig. 3a, where the squares and circles represent It assumes the knowledge of a grouping structure among
training instances belonging to water and land classes, the attributes, where only a small number of groups are con-
respectively. Along with the features, we also have infor- sidered relevant. As an example in bio-marker discovery,
mation about the elevation of every instance, shown using the groups of attributes may correspond to sets of bio-
the intensity of colored points in Fig. 3a. markers that are related via a common biological pathway.
If we disregard the elevation information and learn a Group Lasso helps in selecting physically meaningful
linear classifier to simply minimize the training errors, groups of attributes in the data science models, and various
we would learn the decision boundary shown using a extensions of group Lasso have been explored for handling
dotted line in Fig. 3a. This classifier would make some different types of domain characteristics, e.g., overlapping
mistakes in the lower-left corner of the feature space, group Lasso [63], tree-guided group Lasso [64], and sparse
where the class confusion is difficult to resolve using a group Lasso [65].
linear separator. However, if we use the elevation infor- In recent work [66], applications of sparse group Lasso
mation, we can see that the entire group of instances in were explored to model the domain characteristics of cli-
the lower lower-left corner has a higher elevation than mate variables. In this work, climate variables observed
the instances shown on the right (labeled as land), and over a range of spatial locations were used to predict a
Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
2326 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 10, OCTOBER 2017

climate phenomenon of interest. By treating the set of varia- model are made consistent with domain knowledge. In the
bles observed at every location as a group, the use of group following, we describe some of the approaches for refining
Lasso ensured that if a location is selected, all of the climate data science outputs using domain knowledge that is either
variables observed at that location will be used as relevant explicitly known (e.g., in the form of closed-form equations
features. Such features thus represent meaningful (spatially or model simulations) or implicitly available (e.g., in the
coherent) regions in space that can be studied to identify form of latent constraints).
physical pathways of relationships in climate science.
Another example of Lasso-based regularization that enc- 5.1 Using Explicit Domain Knowledge
odes domain knowledge can be found in the problem of dis- Data science outputs are often refined to reduce the effect of
covering genetic markers for diseases. In this problem, data- noise and missing values and thus improve the overall qual-
driven approaches such as elastic nets are traditionally used ity of the results. For example, in the analysis of spatio-
to determine the relative importance of genetic markers in temporal data, there is a vast body of literature on refining
the context of a disease. However, geneticists understand model outputs to enforce spatial coherence and temporal
that the relevant markers typically are located in close prox- smoothness among predictions. Data science outputs can
imity on the genome sequence due to a property called link- also be refined to improve a quality measure, e.g., in the dis-
age disequilibrium, which suggests that genetic information covery of frequent itemsets by pruning candidate patterns.
that is closely located travels together between generations Building on these methods, a promising direction is to
of the population. This domain knowledge can be incorpo- develop model refinement approaches that make ample use
rated as a regularizer to ensure that the discovered genetic of domain knowledge, encoded in the form of scientific the-
markers are typically located in close proximity on the ories, for producing physically consistent results.
genome. In fact, Liu and colleagues [28] introduced a An example of theory-guided refinement of data science
smoothed minimax concave penalty to Lasso that captured outputs can be found in the problem of material discovery,
squared differences in regression coefficients between adja- where the objective is to find novel materials and crystal
cent markers to ensure that the difference in genetic effects structures that show a desirable property, e.g., their ability to
between adjacent markers is small. filter gases or to serve as a catalyst. Traditional approaches
Domain knowledge can also be used to guide the regu- for predicting crystal structure and properties rely on ab initio
larization of a multi-task learning (MTL) model, as explored calculations such as density functional theory methods.
for the problem of forest cover estimation in [67]. In the However, since the space of all possible materials is
presence of heterogeneity in data sub-populations, different extremely large, it is impractical to perform computationally
groups of instances in the data show different relationships expensive ab initio calculations on every material to estimate
between the inputs and outputs. For example, different their structure and properties. Recently, a number of teams
types of vegetation (e.g., forests, farms, and shrublands) in material science have explored the use of probabilistic
may show varying responses to a target variable in remote graphical models for predicting the structure and properties
sensing signals. MTL provides a promising solution to han- of a material, given a training database of materials with
dle sub-population heterogeneity in such cases, by treating known structure and properties [22], [23], [24]. This provided
the learning at every sub-population as a different task. Fur- a computationally efficient approach to reduce the space of
ther, by sharing the learning at related tasks, MTL enforces candidate materials that show a desirable property, using the
a robust regularization on the learning across all tasks, even knowledge contained in the training data. The results of the
in the scarcity of training data. data science models were then cross-checked using expen-
However, most MTL formulations require explicit sive ab initio calculations to further refine the model outputs.
knowledge of the composition of every task and the similar- This line of research has resulted in the discovery of a hun-
ity structure among the tasks, which is not always known in dred new ternary oxide compounds that were previously
practical applications. For example, the exact number and unknown using traditional approaches [22], highlighting the
distribution of vegetation types is often unavailable, and effectiveness of TGDS in advancing scientific knowledge.
when they are known, they are available at varying granu-
larties [59]. In recent work [67], the presence of heterogene- 5.2 Using Implicit Domain Knowledge
ity due to varying vegetation types was first inferred by In scientific applications, the domain structure among the
clustering vegetation time series, which was then used to output variables may not always be known in the form of
induce similarity in the model parameters at related vegeta- explicit equations that can be easily integrated in existing
tion types. This resulted in an MTL formulation where the model refinement frameworks. This requires jointly solving
task structure was inferred using contextual variables, the dual problem of inferring the domain constraints and
obtained using domain knowledge. using the learned constraints to refine model outputs. We
illustrate this using an example in mapping surface water
5 THEORY-GUIDED REFINEMENT OF DATA dynamics, where implicit constraints among locations
SCIENCE OUTPUTS (based on a hidden elevation ordering) are estimated and
leveraged for refining classification maps of water bodies.
Domain knowledge can also be used to refine the outputs of
data science models so that they are in compliance with our Example 4 (Post-processing using elevation con-
current understanding of physical phenomena. This style of straints). As described in Example 3, it is difficult to map
TGDS leverages scientific knowledge at the final stage of the dynamics of surface water bodies by solely using the
model building where the outputs of any data science knowledge contained in remote sensing data, and there is
Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
KARPATNE ET AL.: THEORY-GUIDED DATA SCIENCE: A NEW PARADIGM FOR SCIENTIFIC DISCOVERY FROM DATA 2327

climate model simulations, available at coarse spatial and


temporal resolutions, are used as inputs in a statistical model
to predict the climate variables at finer resolutions. Theory-
based model outputs can also be used to supervise the train-
ing of data science models, by providing physically consis-
tent estimates of the target variable for every training
instance.
An alternate way of creating a hybrid TGDS model is to
Fig. 4. Mapping the extent of Lake Abhe (on the border of Ethiopia and use data science methods to predict intermediate quantities
Djibouti in Africa) using implicit theory-guided constraints. (a) Remote in theory-based models that are currently being missed or
sensing image of the water body (prepared using multi-spectral false inaccurately estimated. By feeding data science outputs into
color composites). (b) Initial classification maps. (c) Elevation contours
inferred from the history of classification labels. (d) Final classification theory-based models, such a hybrid model can not only
maps refined using elevation-based constraints. show better predictive performance but also amend the
deficiencies in existing theory-based models. Further, the
promise in using information about the elevation structure outputs of theory-based models may also be used as train-
of water bodies to assist classification models. However, ing samples in data science components [72], thus creating a
such information is seldom available at the desired granu- two-way synergy between them. Depending on the nature
larity for most water bodies around the world. Hence, of the model and the requirements of the application, there
there is a need to infer the latent ordering among the loca- can be multiple ways of introducing data science outputs in
tions (based on their elevation) so that they can be used to theory-based models. In the following, we provide an illus-
produce accurate and physically consistent labels. One trative example of this theme of TGDS research in the field
way to achieve this is by using the history of imperfect of turbulence modeling.
water/land labels produced by a data science model at
Example 5 (Turbulence Modeling). One of the important
every location over a long period of time. In particular, a
problems in aerospace engineering is to model the charac-
location that has been classified as water for a longer num-
teristics of turbulent flow, which consists of chaotic
ber of time-steps has a higher likelihood of being at a
changes in the flow velocity, and complex dissipation of
deeper location than a location that has been classified as
momentum and energy. Turbulence modeling is used in
water less frequently. This implicit elevation ordering, if
a number of applications such as the design and reliabil-
extracted effectively, can help in improving the classifica-
ity assessment of airfoils in aeroplanes and space vehicles.
tion maps by post-processing the outputs to be consistent
Key to the study of fluid dynamics is the Navier-Stokes
with elevation ordering. Further, the post-processed labels
equations, which describe the behavior of viscous fluids
can help in obtaining a better estimate of the elevation
under motion. Although the Navier-Stokes equations can
ordering, thus resulting in an iterative solution that simul-
be readily applied in simple flow problems involving
taneously infers the elevation ordering and produces phys-
incompressible and irrotational flow, obtaining an exact
ically consistent classification maps. This approach was
representation for turbulent flow requires computation-
successfully used in [29], [30] to build global maps of sur-
ally expensive solutions such as direct numerical simula-
face water dynamics. Fig. 4 illustrates the effectiveness of
tions (DNS) at fine spatial grids. The high computational
this approach using an example lake in Africa, where the
costs of DNS make it infeasible for studying practical tur-
post-processed classification map does not suffer from the
bulence problems in the industry, which are typically
errors of the initial classification map and visually matches
solved using inexact but computationally cheap approxi-
well with the remote sensing image of the water body.
mations. One such approximation is the Reynolds–
Other examples of the use of implicit constraints includes averaged Navier-Stokes (RANS) equations, which introdu-
mapping urbanization [68] and tree plantation conversions ces a term called as the Reynolds stress, t, to represent the
[69], [70], where hidden Markov models were used to incor- apparent stress due to fluctuations caused by turbulence.
porate domain knowledge about the transitions among land Since the exact form of the Reynolds stress is unknown, dif-
covers. ferent approximations of t have been explored in previous
studies, resulting in a variety of RANS models. Despite the
6 LEARNING HYBRID MODELS OF THEORY AND continued efforts in approximating t, current RANS mod-
DATA SCIENCE els are still insufficient for modeling complex flows with
separation, curvature, or swirling. To overcome their limi-
One way to combine the strengths of scientific knowledge
tations, recent work by Wang et al. [21] explored the use of
and data science is by creating hybrid combinations of theory-
machine learning methods to assist RANS models and
based and data science models, where some aspects of the
reduce their discrepancies. In particular, the Reynolds
problem are handled by theory-based components while the
stress was approximated as
remaining ones are modeled using data science components.
There are several ways of fusing theory-based and data sci- t ¼ t RANS þ Dt ML ; (6)
ence models to create hybrid TGDS models. One way is to
build a two-component model where the outputs of the where t RANS is obtained from a RANS model while Dt ML
theory-based component are used as inputs in the data sci- is the model discrepancy that is estimated using a ran-
ence component. This idea is used in climate science for sta- dom forest model. Although this approach can be
tistical downscaling of climate variables [71], where the used with any generic RANS model to estimate its
Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
2328 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 10, OCTOBER 2017

discrepancy, it does not alter the form of approximation and equations. Data assimilation provides a promising step in
used in obtaining t RANS , since Dt ML is learned indepen- the direction of integrating data with theory-based models so
dently of t RANS . In another work by Singh et al. [20], a that the knowledge discovery approach relies both on scien-
machine learning component was used to directly aug- tific knowledge and observational data.
ment a RANS approximation in the following manner:
7.2 Calibrating Theory-Based Models Using Data
2
t ij ¼ 2rnSij  rKdij ; (7) Theory-based models often involve a large number of
3
parameters in their equations that need to be calibrated in
Dn order to provide an accurate representation of the physical
¼ b  P  D þ T; (8) system. A na€ıve approach for model calibration is to try out
Dt
every combination of parameter values, perhaps by search-
where Equation (7) is the standard Boussinesq equation ing over a discrete grid defined over the parameters, and
relating the Reynolds stress t ij to the effective viscosity n, choose the combination that produces the maximum likeli-
and Equation (8) is a variant of the Spalart Allmaras model hood for the data. However, this approach is practically
that estimates n as a function of a machine learning term, b infeasible when the number of parameters are large and
(learned using an artificial neural network), and other every parameter takes many possible values. A number of
physical terms, P, D, and T, corresponding to production, computationally efficient approaches have been explored in
destruction, and transport processes, respectively. This different disciplines for parsimoniously calibrating model
class of modeling framework, which integrates machine parameters with the help of observational data. For example,
learning terms in theory-based models, has been called a seminal work on model calibration in the field of hydrology
field inversion and machine learning (FIML) [73]. is the Generalized Likelihood Uncertainty Estimation
Both these works illustrate the potential of coupling (GLUE) technique [75]. This approach models the uncer-
data science outputs with theory-based models to reduce tainty associated with every parameter combination using
model discrepancies in complex scientific applications. Monte Carlo approaches, and uses a Bayesian formulation to
The exact choice of the data science model and its contri- incrementally update the uncertainties as new observations
bution to the theory-based model can be explored in are made available. At any given iteration, the parameter
future investigations. Similar lines of TGDS research can combination that shows maximum agreement with the
be explored in other domains where current theory- observations is employed in the model, the results of which
based models are lacking, e.g., hydrological models for are used to update the uncertainties on the next iteration.
studying subsurface flow [36]. The problem of parameter selection has recently received
considerable attention in the machine learning community
7 AUGMENTING THEORY-BASED MODELS USING in the context of multi-armed bandit problems [76], [77],
DATA SCIENCE [78]. The basic objective in these problems is to incremen-
tally select parameter values so that we can explore the space
There are many ways we can use data science methods to of parameter choices and exploit the parameter choice that
improve the effectiveness of theory-based models. Data can provides the maximum reward, using a limited number of
be assimilated in theory-based models for improved selec- observations. Variants of these techniques have also been
tion of model states in numerical models. Data science explored for settings where the parameters take continuous
methods can also help in calibrating the parameters of values instead of discrete steps [79], [80]. These techniques
theory-based models so that they provide a better realiza- provide a promising direction for calibrating the high-
tion of the physical system. We describe both these dimensional parameters of theory-based models.
approaches in the following.

7.1 Data Assimilation in Theory-Based Models 8 CONCLUSION


One of the long-standing approaches of the scientific commu- In this paper, we formally conceptualized the paradigm of
nity for integrating data in theory-based models is to use data theory-guided data science that seeks to exploit the promise
assimilation approaches, which has been widely used in cli- of data science without ignoring the treasure of knowledge
mate science and hydrology [74]. These domains typically accumulated in scientific principles. We provided a taxonomy
involve dynamical systems, such as the progression of climate of ways in which scientific knowledge and data science can be
phenomena over time, which can be represented as a brought together in any application with some availability of
sequence of physical states in numerical models. Data assimi- domain knowledge. These approaches range from methods
lation is a way to infer the most likely sequence of states such that strictly enforce physical consistency in data science mod-
that the model outputs are in agreement with the observations els (e.g., while designing model architecture or specifying the-
available at every time-step. In data assimilation, the values of ory-based constraints) to methods that allow a relaxed usage
the current state are constrained to depend on previous state of scientific knowledge where our scientific understanding is
values as well as the current data observations. For example, weak (e.g., as priors or regularization terms). We presented
if we use the Gaussian distribution to model the linear transi- examples from diverse disciplines to illustrate the various
tion between consecutive states, this translates to a Kalman fil- research themes of TGDS and also discussed several avenues
ter. However, in general, the dependencies among the states of novel research in this rapidly emerging field.
in data assimilation methods are modeled using more com- One of the central motivations behind TGDS is to ensure
plex forms of distributions that are governed by physical laws better generalizability of models (even when the problem is
Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
KARPATNE ET AL.: THEORY-GUIDED DATA SCIENCE: A NEW PARADIGM FOR SCIENTIFIC DISCOVERY FROM DATA 2329

complex and data samples are under-representative) by [6] D. Castelvecchi, “Artificial intelligence called in to tackle LHC
data deluge,” Nature, vol. 528, no. 7580, pp. 18–19, 2015.
anchoring data science algorithms with scientific knowl- [7] P. Baldi and S. Brunak, Bioinformatics: The Machine Learning
edge. TGDS also aims at advancing our knowledge of the Approach. Cambridge, MA, USA: MIT Press, 2001.
physical world by producing scientifically interpretable [8] J. H. Faghmous and V. Kumar, “A big data guide to understand-
models. Reducing the search space of the learning algorithm ing climate change: The case for theory-guided data science,” Big
Data, vol. 2, no. 3, pp. 155–163, 2014.
to physically consistent models may also have an additional [9] J. H. Faghmous, V. Kumar, and S. Shekhar, “Computing and
benefit of reducing the computational cost of the algorithm. climate,” Comput. Sci. Eng., vol. 17, no. 6, pp. 6–8, 2015.
The TGDS research themes are not exhaustive and we [10] D. Graham-Rowe, et al., “Big data: Science in the petabyte era,”
Nature, vol. 455, no. 7209, pp. 8–9, 2008.
anticipate the development of novel TGDS themes in the [11] T. Jonathan, et al., “Special issue: Dealing with data,” Science,
future that explore innovative ways of blending scientific vol. 331, no. 6018, pp. 639–806, 2011.
theory with data science. While most of the discussion in [12] T. J. Sejnowski, P. S. Churchland, and J. A. Movshon, “Putting big
this paper focuses on supervised learning problems, similar data to good use in neuroscience,” Nature Neuroscience, vol. 17,
no. 11, pp. 1440–1441, 2014.
TGDS research themes can be explored for other traditional [13] B. Sterling, “The end of theory: The data deluge makes the scien-
tasks of data mining, machine learning, and statistics. For tific method obsolete,” Wired Mag., 2008, [Online]. Available:
example, the use of physical principles to constrain spatio- https://www.wired.com/2008/06/the-end-of-theo/, Accessed:
temporal pattern mining algorithms has been explored in 13-Jul-2017.
[14] P. M. Caldwell, C. S. Bretherton, M. D. Zelinka, S. A. Klein,
[81], [82] for finding ocean eddies from satellite data. The B. D. Santer, and B. M. Sanderson, “Statistical significance of cli-
need to explore TGDS models for uncertainty quantification mate sensitivity predictors obtained by data mining,” Geophysical
is discussed in [33] in the context of understanding and pro- Res. Lett., vol. 41, no. 5, pp. 1803–1808, 2014.
[15] D. Lazer, R. Kennedy, G. King, and A. Vespignani, “The parable
jecting climate extremes. Scientific knowledge can also be of Google flu: Traps in big data analysis,” Science, vol. 343,
used to advance other aspects of data science, e.g., the no. 6176, pp. 1203–1205, Mar. 2014. [Online]. Available: http://
design of scientific work-flows [83], [84] or the generation of www.ncbi.nlm.nih.gov/pubmed/24626916
model simulations [85]. [16] G. Marcus and E. Davis, “Eight (no, nine!) problems with big
data,” New York Times, vol. 6, no. 4, 2014, Art. no. 2014.
We hope that this paper serves as a first step in building [17] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolin-
the foundations of TGDS and encourages follow-on work to ski, and L. Brilliant, “Detecting influenza epidemics using search
develop in-depth theoretical formalizations of this para- engine query data,” Nature, vol. 457, no. 7232, pp. 1012–1014, 2009.
[18] J. Kawale, et al., “A graph-based approach to find teleconnections
digm. While success in this endeavor will need significant in climate data,” Statist. Anal. Data Mining, vol. 6, no. 3, pp. 158–
innovations in our ability to handle the diversity of forms in 179, 2013.
which scientific knowledge is represented and ingested in [19] J. H. Faghmous, I. Frenger, Y. Yao, R. Warmka, A. Lindell, and
different disciplines (e.g., differences in granularity and V. Kumar, “A daily global mesoscale ocean eddy dataset from sat-
ellite altimetry,” Sci. Data, vol. 2, 2015, Art. no. 150028.
type of information, degree of completeness, and uncer- [20] A. P. Singh, S. Medida, and K. Duraisamy, “Machine learning-
tainty in knowledge), the concrete TGDS approaches pre- augmented predictive modeling of turbulent separated flows over
sented in this paper can be considered as a stepping stone airfoils,” vol. 55, no. 7, pp. 2215–2227, 2017.
[21] J.-X. Wang, J.-L. Wu, and H. Xiao, “Physics-informed machine
in this ambitious journey. We anticipate the deep integra- learning approach for reconstructing Reynolds stress modeling
tion of theory-based and data science to become a quintes- discrepancies based on DNS data,” Physical Review Fluids, 2017.
sential tool for scientific discovery in future research. The [22] G. Hautier, C. C. Fischer, A. Jain, T. Mueller, and G. Ceder,
paradigm of TGDS, if effectively utilized, can help us realize “Finding natures missing ternary oxide compounds using
machine learning and density functional theory,” Chemistry
the vision of the “fourth paradigm” [86] in its full glory, Mater., vol. 22, no. 12, pp. 3762–3767, 2010.
where data serves an integral role at every step of scientific [23] C. C. Fischer, K. J. Tibbetts, D. Morgan, and G. Ceder, “Predicting
knowledge discovery. crystal structure by merging data mining with quantum mechan-
ics,” Nature Mater., vol. 5, no. 8, pp. 641–646, 2006.
[24] S. Curtarolo, G. L. Hart, M. B. Nardelli, N. Mingo, S. Sanvito, and
ACKNOWLEDGMENTS O. Levy, “The high-throughput highway to computational materi-
als design,” Nature Mater., vol. 12, no. 3, pp. 191–201, 2013.
The ideas in this vision paper were developed while being
[25] L. Li, et al.“Understanding machine-learned density functionals,”
funded by a US National Science Foundation Expeditions in Int. J. Quantum Chemistry, vol. 116, pp. 819–833, 2016.
Computing Grant #1029711. [26] K. C. Wong, L. Wang, and P. Shi, “Active model with orthotropic
hyperelastic material for cardiac image analysis,” in Functional
Imaging and Modeling of the Heart. Berlin, Germany: Springer, 2009,
REFERENCES pp. 229–238.
[1] G. Bell, T. Hey, and A. Szalay, “Beyond the data deluge,” Science, [27] J. Xu, J. L. Sapp, A. R. Dehaghani, F. Gao, M. Horacek, and
vol. 323, no. 5919, pp. 1297–1298, 2009. L. Wang, “Robust transmural electrophysiological imaging: Inte-
[2] Economist, “The data deluge,” Special Supplement, 2010, grating sparse and dynamic physiological models into ecg-based
[Online]. Available: http://www.economist.com/node/ inference,” in Medical Image Computing and Computer-Assisted Inter-
15579717, Accessed: 13-Jul-2017 vention. Berlin, Germany: Springer, 2015, pp. 519–527.
[3] M. James, et al., Big Data: The Next Frontier for Innovation, Competi- [28] J. Liu, K. Wang, S. Ma, and J. Huang, “Accounting for linkage dis-
tion, and Productivity. New York, NY, USA: The McKinsey Global equilibrium in genome-wide association studies: A penalized
Inst., 2011. regression method,” Statist. Interface, vol. 6, no. 1, 2013, Art. no. 99.
[4] A. Halevy, P. Norvig, and F. Pereira, “The unreasonable effec- [29] A. Khandelwal, V. Mithal, and V. Kumar, “Post classification label
tiveness of data,” IEEE Intell. Syst., vol. 24, no. 2, pp. 8–12, refinement using implicit ordering constraint among data
Mar./Apr. 2009. instances,” in Proc. IEEE Int. Conf. Data Mining, 2015, pp. 799–804.
[5] B. P. Roe, H.-J. Yang, J. Zhu, Y. Liu, I. Stancu, and G. McGregor, [30] A. Khandelwal, A. Karpatne, M. Marlier, J. Kim, D. Lettenmaier,
“Boosted decision trees as an alternative to artificial neural net- and V. Kumar, “An approach for global monitoring of surface
works for particle identification,” Nucl. Instruments Methods Phys- water extent variations using modis data,” Remote Sensing Envi-
ics Res. Section A: Accelerators Spectrometers Detectors Associated ronment (Rev.), 2017, http://dx.doi.org/10.1016/j.rse.2017.05.039
Equipment, vol. 543, no. 2, pp. 577–584, 2005. [31] N. Wagner and J. M. Rondinelli, “Theory-guided machine learn-
ing in materials science,” Frontiers Mater., vol. 3, 2016, Art. no. 28.
Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
2330 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 10, OCTOBER 2017

[32] J. Faghmous, et al., “Theory-guided data science for climate [57] P. Hohenberg and W. Kohn, “Inhomogeneous electron gas,” Phys.
change,” Computer, vol. 47, no. 11, pp. 74–78, Nov. 2014. Rev., vol. 136, no. 3B, 1964, Art. no. B864.
[33] A. R. Ganguly, et al., “Toward enhanced understanding and pro- [58] A. Karpatne, A. Khandelwal, X. Chen, V. Mithal, J. Faghmous, and
jections of climate extremes using physics-guided data mining V. Kumar, “Global monitoring of inland water dynamics: State-of-
techniques,” Nonlinear Processes Geophysics, vol. 21, no. 4, pp. 777– the-art, challenges, and opportunities,” in Computational Sustain-
795, 2014. ability. Berlin, Germany: Springer, 2016, pp. 121–147.
[34] Physics Informed Machine Learning Conference, Santa Fe, New Mexico, [59] A. Karpatne, Z. Jiang, R. R. Vatsavai, S. Shekhar, and V. Kumar,
2016, [Online]. Available: http://www.cvent.com/events/physics- “Monitoring land-cover changes: A machine-learning
informed-machine-learning/event-summary-7cd2f46ebc144bdeb6e perspective,” IEEE Geosci. Remote Sens. Mag., vol. 4, no. 2, pp. 8–
5f4106887ea04.aspx, Accessed: 13-Jul-2017. 21, Jun. 2016.
[35] Physical analytics, ibm research, [Online]. Available: http:// [60] G. M. James and P. Radchenko, “A generalized dantzig selector
researcher.watson.ibm.com/researcher/view_group.php?id=6566, with shrinkage tuning,” Biometrika, vol. 96, no. 2, pp. 323–337, 2009.
Accessed on: Oct. 20, 2016. [61] S. Chatterjee, S. Chen, and A. Banerjee, “Generalized dantzig
[36] M. Ghasemizade and M. Schirmer, “Subsurface flow contribution selector: Application to the k-support norm,” in Proc. Advances
in the hydrological cycle: Lessons learned and challenges aheada Neural Inf. Process. Syst., 2014, pp. 1934–1942.
review,” Environ. Earth Sci., vol. 69, no. 2, pp. 707–718, 2013. [62] M. Yuan and Y. Lin, “Model selection and estimation in regression
[37] C. Paniconi and M. Putti, “Physically based modeling in catch- with grouped variables,” J. Roy. Statist. Soc.: Series B (Statist. Meth-
ment hydrology at 50: Survey and outlook,” Water Resources Res., odology), vol. 68, no. 1, pp. 49–67, 2006.
vol. 51, no. 9, pp. 7090–7129, 2015. [63] L. Jacob, G. Obozinski, and J.-P. Vert, “Group Lasso with overlap
[38] M. F. Bierkens, “Global hydrology 2015: State, trends, and and graph Lasso,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009,
directions,” Water Resources Res., vol. 51, no. 7, pp. 4923–4947, pp. 433–440.
2015. [64] S. Kim and E. P. Xing, “Tree-guided group Lasso for multi-task
[39] J. Friedman, T. Hastie, and R. Tibshirani, The Elements of Statistical regression with structured sparsity,” in Proc. 27th Int. Conf. Mach.
Learning, vol. 1. Berlin, Germany: Springer, 2001. Learn., 2010, pp. 543–550.
[40] P.-N. Tan, M. Steinbach, and V. Kumar, Intorduction to Data Min- [65] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani, “A sparse-
ing. Reading, MA, USA: Addison-Wesley, 2005. group lasso,” J. Comput. Graphical Statist., no. 2, pp. 231–245, 2013.
[41] V. N. Vapnik and V. Vapnik, Statistical Learning Theory, vol. 1. [66] S. Chatterjee, K. Steinhaeuser, A. Banerjee, S. Chatterjee, and
New York, NY, USA: Wiley, 1998. A. R. Ganguly, “Sparse group lasso: Consistency and climate
[42] M. D. Twa, S. Parthasarathy, C. Roberts, A. M. Mahmoud, applications,” in Proc. SIAM Int. Conf. Data Mining, 2012, pp. 47–58.
T. W. Raasch, and M. A. Bullimore, “Automated decision tree [67] A. Karpatne, A. Khandelwal, S. Boriah, and V. Kumar, “Predictive
classification of corneal shape,” Optometry Vis. Sci.: Official Publica- learning in the presence of heterogeneity and limited training
tion Amer. Academy Optometry, vol. 82, no. 12, 2005, Art. no. 1038. data,” in Proc. SIAM Int. Conf. Data Mining, 2014, pp. 253–261.
[43] J. Z. Leibo, Q. Liao, F. Anselmi, W. A. Freiwald, and T. Poggio, [68] V. Mithal, A. Khandelwal, S. Boriah, K. Steinhaeuser, and
“View-tolerant face recognition and hebbian learning imply V. Kumar, “Change detection from temporal sequences of class
mirror-symmetric neural tuning to head orientation,” Current labels: Application to land cover change mapping,” in Proc. SIAM
Biol., vol. 27, pp. 62–67, 2017. Int. Conf. Data Mining, 2013, pp. 2–4.
[44] T. Mikolov, M. Karafi at, L. Burget, J. Cernocky, and S. Khudan- [69] X. Jia, et al., “Automated plantation mapping in southeast asia
pur, “Recurrent neural network based language model,” in Proc. using remote sensing data,” Dept. Comput. Sci., Univ. Minnesota,
11th Annu. Conf. Int. Speech Commun. Assoc., 2010, vol. 2, Art. no. 3. Twin Cities, Minneapolis, MN, USA, Tech. Rep. 16–029, 2016.
[45] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory [70] X. Jia, et al., “Predict land covers with transition modeling and
recurrent neural network architectures for large scale acoustic incremental learning,” in Proc. SIAM Int. Conf. Data Mining, 2017,
modeling,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2014, pp. 171–179.
pp. 338–342. [71] R. L. Wilby, et al., “Statistical downscaling of general circulation
[46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifi- model output: A comparison of methods,” Water Resources Res.,
cation with deep convolutional neural networks,” in Proc. Advan- vol. 34, no. 11, pp. 2995–3008, 1998.
ces Neural Inf. Process. Syst., 2012, pp. 1097–1105. [72] P. Sadowski, D. Fooshee, N. Subrahmanya, and P. Baldi,
[47] F. Schrodt, et al., “BHPMF–a hierarchical Bayesian approach to “Synergies between quantum mechanics and machine learning in
gap-filling and trait prediction for macroecology and functional reaction prediction,” J. Chemical Inf. Model., vol. 56, no. 11,
biogeography,” Global Ecology Biogeography, vol. 24, no. 12, pp. 2125–2128, 2016.
pp. 1510–1521, 2015. [73] E. J. Parish and K. Duraisamy, “A paradigm for data-driven pre-
[48] J. Kattge, et al., “Try–a global database of plant traits,” Global dictive modeling using field inversion and machine learning,”
Change Biology, vol. 17, no. 9, pp. 2905–2935, 2011. J. Comput. Physics, vol. 305, pp. 758–774, 2016.
[49] P. Melville and V. Sindhwani, “Recommender systems,” in Ency- [74] G. Evensen, Data Assimilation: The Ensemble Kalman Filter. Berlin,
clopedia of Machine Learning. Berlin, Germany: Springer, 2011, Germany: Springer, 2009.
pp. 829–838. [75] K. Beven and A. Binley, “The future of distributed models: Model
[50] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covari- calibration and uncertainty prediction,” Hydrological Processes,
ance estimation with the graphical Lasso,” Biostatistics, vol. 9, vol. 6, no. 3, pp. 279–298, 1992.
no. 3, pp. 432–441, 2008. [76] O. Chapelle and L. Li, “An empirical evaluation of thompson
[51] H. Denli, H. Denli, and N. Subrahmanya, “Multi-scale graphical sampling,” in Proc. Advances Neural Inf. Process. Syst., 2011,
models for spatio-temporal processes,” in Proc. Advances Neural pp. 2249–2257.
Inf. Process. Syst., 2014, pp. 316–324. [77] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-
[52] J.-F. Boulicaut and B. Jeudy, “Constraint-based data mining,” in bandit approach to personalized news article recommendation,”
Data Mining and Knowledge Discovery Handbook. Berlin, Germany: in Proc. 19th Int. Conf. World Wide Web, 2010, pp. 661–670.
Springer, 2005, pp. 399–416. [78] M. Jordan and T. Mitchell, “Machine learning: Trends, perspec-
[53] J. Pei and J. Han, “Constrained frequent pattern mining: A tives, and prospects,” Science, vol. 349, no. 6245, pp. 255–260, 2015.
pattern-growth view,” ACM SIGKDD Explorations Newslett., [79] R. Agrawal, “The continuum-armed bandit problem,” SIAM
vol. 4, no. 1, pp. 31–39, 2002. J. Control Optimization, vol. 33, no. 6, pp. 1926–1951, 1995.
[54] S. Basu, I. Davidson, and K. Wagstaff, Constrained Clustering: [80] R. D. Kleinberg, “Nearly tight bounds for the continuum-armed
Advances in Algorithms, Theory, and Applications. Boca Raton, FL, bandit problem,” in Proc. Advances Neural Inf. Process. Syst., 2004,
USA: CRC Press, 2008. pp. 697–704.
[55] A. J. Majda and J. Harlim, “Physics constrained nonlinear regres- [81] J. H. Faghmous, et al., “EddyScan: A physically consistent ocean
sion models for time series,” Nonlinearity, vol. 26, no. 1, 2012, eddy monitoring application,” in Proc. Conf. Intell. Data Under-
Art. no. 201. standing, Oct. 2012, pp. 96 –103.
[56] A. J. Majda and Y. Yuan, “Fundamental limitations of ad hoc lin- [82] J. H. Faghmous, H. Nguyen, M. Le, and V. Kumar, “Spatio-
ear and quadratic multi-level regression models for physical sys- temporal consistency as a means to identify unlabeled objects
tems,” Discrete Continuous Dynamical Syst. B, vol. 17, no. 4, in a continuous data field,” in Proc. 28th AAAI Conf. Artif.
pp. 1333–1363, 2012. Intell., 2014, pp. 410–416.

Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.
KARPATNE ET AL.: THEORY-GUIDED DATA SCIENCE: A NEW PARADIGM FOR SCIENTIFIC DISCOVERY FROM DATA 2331

[83] V. G. Honavar, “The promise and potential of big data: A case for Arindam Banerjee received the BE degree in
discovery informatics,” Rev. Policy Res., vol. 31, no. 4, pp. 326–330, from Jadavpur University, the MTech degree in
2014. EE from IIT Kanpur (IITK), and the PhD degree in
[84] Y. Gil, et al., “Examining the challenges of scientific workflows,” computer science from the University of Texas
IEEE Comput., vol. 40, no. 12, pp. 26–34, Dec. 2007. Austin. He is an associate professor in the
[85] M. Paganini, L. d. Oliveira, and B. Nachman, “CaloGAN: Simulat- Department of CSE, University of Minnesota
ing 3D high energy particle showers in multi-layer electromag- (UMN). His research interests include machine
netic calorimeters with generative adversarial networks,” arXiv: learning, data mining, and optimization.
1705.02355, 2017.
[86] T. Hey, S. Tansley, and K. M. Tolle, The Fourth Paradigm: Data-
Intensive Scientific Discovery, vol. 1. Redmond, WA, USA: Micro-
soft Res., 2009.
Auroop Ganguly received the PhD degree in
Anuj Karpatne received the BTech and MTech civil and environmental engineering (CEE) from
degrees in mathematics & computing from the the Massachusetts Institute of Technology. He is
Indian Institute of Technology (IIT) Delhi. He is an associate professor in the Department of CEE,
working toward the PhD degree in the Depart- Northeastern University. His research encom-
ment of Computer Science and Engineering passes weather extremes, water sustainability,
(CSE), University of Minnesota (UMN). His work and the resilience of critical infrastructures.
is in the area of spatio-temporal data mining for
enviornmental applications.

Shashi Shekhar received the BTech degree in


Gowtham Atluri received the MTech degree in computer science (CS) from IITK, and the MS
computer science (CS) from IIT Roorkee and the degree in business administration and the PhD
PhD degree in CS from University of Minnesota degree in CS from the University of California
(UMN). He is an assistant professor in the Berkeley. He is a McKnight Distinguished Univer-
Department of Electrical Engineering and Com- sity professor in the Department of CSE, Univer-
puter Science (EECS), University of Cincinnati. sity of Minnesota (UMN). His research interests
His research interests include data mining, neu- include spatial databases, spatial data mining,
roimaging, and climate science. and geographic and information systems.

Nagiza Samatova received the PhD degree in


James H. Faghmous received the BS degree in applied mathematics from the Computational
computer science from the City College of New Center of Russian Academy of Sciences Mos-
York and the MS and PhD degrees from Univer- cow. She is a professor in the Department of
sity of Minnesota (UMN). He is an assistant pro- Computer Science, North Carolina State Univer-
fessor and founding CTO of the Arnhold Global sity. Her research interests include theory of com-
Health Institute, Icahn School of Medicine at putation, data science, and high performance
Mount Sinai. His research interests include data computing.
science, climate science, and global health.

Vipin Kumar received BE degree in electronics &


Michael Steinbach received the BS degree in communication engineering from IIT Roorkee,
math, the MS degree in computer science and the ME degree in electrical engineering from Phi-
statistics, and the PhD degree in computer sci- lips International Institute Eindhoven, and the
ence from University of Minnesota (UMN). He is PhD degree in computer science from the Univer-
a research associate in the Department of CSE, sity of Maryland (UMN). He is a Regents Profes-
UMN. His research interests include data mining, sor and William Norris chair in large scale
healthcare, bio-informatics, and statistics. computing in the Department of CSE, UMN. His
research interests include data mining, high-per-
formance computing, and their applications in cli-
mate/ecosystems and biomedical domains.

" For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

Authorized licensed use limited to: University of Michigan Library. Downloaded on October 12,2024 at 16:00:26 UTC from IEEE Xplore. Restrictions apply.

You might also like