Data Driven Modelling Some Past Experien
Data Driven Modelling Some Past Experien
1 | 2008
ABSTRACT
Physically based (process) models based on mathematical descriptions of water motion are Dimitri P. Solomatine (corresponding author)
UNESCO-IHE Institute for Water Education,
widely used in river basin management. During the last decade the so-called data-driven models PO Box 3015, 2601 DA, Delft,
The Netherlands
are becoming more and more common. These models rely upon the methods of computational E-mail: [email protected]
intelligence and machine learning, and thus assume the presence of a considerable amount of
Avi Ostfeld
data describing the modelled system’s physics (i.e. hydraulic and/or hydrologic phenomena). Faculty of Civil and Environmental Engineering,
Technion-Israel Institute of Technology,
This paper is a preface to the special issue on Data Driven Modelling and Evolutionary Haifa, 32000,
Israel
Optimization for River Basin Management, and presents a brief overview of the most popular
techniques and some of the experiences of the authors in data-driven modelling relevant to river
basin management. It also identifies the current trends and common pitfalls, provides some
examples of successful applications and mentions the research challenges.
Key words | computational intelligence, data-driven modelling, neural networks, river basin
management, simulation modelling
INTRODUCTION
Modern river basin management is impossible without equations incorporate sophisticated solvers and are encap-
adequate hydraulic and hydrologic models – used in sulated into modelling environments with advanced inter-
different tasks from scenario analysis to real-time forecast- faces and visualisation tools (Abbott 1991; Falconer et al.
ing (Falconer et al. 2005). Numerous papers have been 2005). They vary in complexity and orientation at different
published on using a physically based approach for tasks from general river basin planning like RIBASIM
modelling behaviour of river basins, as well as the ways to (2006) to models able to simulate the entire land phase of
classify them. Traditionally the river basin (watershed) was the hydrologic cycle like MIKE SHE (2006).
treated as a lumped, time-invariant, linear, deterministic During the last 10 – 15 years, the advances in ICT
system, resulting in the unit hydrograph theory (Sherman brought the new tools enhancing data acquisition, data
1932), upon which the Nash linear cascade of reservoirs analysis and visualisation; such advances are often associ-
model (Nash 1957) and the parallel cascade of reservoirs of ated with Hydroinformatics. A Geographical Information
Diskin (1964) were built. Other models within this category System (GIS) connected to remote sensing tools stepped
are attributed to Dooge (1959), Diskin & Boneh (1975), in for watershed management, providing numerous tools
Eagleson et al. (1966) and others. Later models have to support modelling. A GIS-based hydrological model
expanded the lumped linear deterministic approach to a couples the descriptions of hydrological features on a
distributed linear cell approach in which the entire river spatial scale with the predictive power of models. Several
basin is partitioned into a tree-like structure built of cells, examples of these can be mentioned: SWAT – a river basin
with each cell being a sub-watershed (Diskin et al. 1984). scale model developed to quantify the impact of land
Nowadays the models based on partial differential management practices in large, complex watersheds
doi: 10.2166/hydro.2008.015
4 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
(SWAT 2006); BASINS – a multipurpose environmental in conventional empirical modelling in hydraulic engineer-
analysis system for performing watershed and water- ing and hydrology. They allow for solving numerical
quality-based studies, and AVGWLF – a spatial distributed prediction problems, reconstructing highly nonlinear func-
watershed model based on GWLF (Haith & Shoemaker tions, performing classification, grouping of data and
1987) for simulating runoff, sediment and nutrient loadings building rule-based systems.
from a watershed, given variable-size source areas. The It should be noted that there is still a certain
relatively inexpensive satellite technology of today permits scepticism about DDM among many hydrologists and
using various types of remote sensing data, and associated water resources specialists. They view the induction of
data analysis and pattern recognition techniques, for water models from datasets as a computational exercise, because
resources management. Hydraulic and hydrologic models in their opinion the derivation is not related to physical
are being more and more complemented by the data-driven principles and mathematical reasoning (See et al. 2007).
models which are the subject of this paper. Such integrated Another issue is the necessity of using sophisticated data-
systems, coupled with GIS and animation tools, being driven models: are they actually needed when traditional
incorporated into social and managerial environments, are statistical models (typically linear regression or ARIMA-
often referred to as Hydroinformatics systems, and form class models) are, in many cases, accurate enough? Some
powerful data management and modelling instrumentation of the concerns of this nature are presented, for example,
for water managers and decision-makers. by Gaume & Gosset (2003) and Han et al. (2007). In their
In order to make a step towards the understanding of excellent recent paper, Abrahart & See (2007) address
data-driven models, it is useful to provide a classification some of these problems and demonstrate that the existing
of models for river basin management. The following types nonlinear hydrological relationships, which are so import-
of models can be distinguished: ant when building flow forecasting models for river basin
management, are effectively captured by a neural network,
(1) a physically based ( process) model based on the descrip-
the most widely used DDM method. This discussion about
tion of the behaviour, typically based on the first-order
what model is the best may continue for a while, but in
principles from physics, of a phenomenon or system (also
our view it is important to stress that there are always
called knowledge-driven or simulation models). In river
situations when one model type cannot be applied or
hydraulics, these are the 1D or 2D hydrodynamic models,
suffers from inadequacies and can be well complemented
and in hydrology the lumped conceptual models or
or replaced by another one.
distributed physically based models;
DDM is a common topic of research in the framework
(2) an empirical, or data-driven (DD) model involving
of Hydroinformatics (Abbott 1991), and, subsequently, is an
mathematical equations assessed not from the physical
important topic at the International Conferences on
process in the river basin but from analysis of
Hydroinformatics, European Geosciences Union (sub-divi-
concurrent input and output time series. Typical
sion on Hydroinformatics), and at other conferences
examples here are the rating curves, unit hydrograph
related to water management. During the last decade the
method and various statistical models (linear regression,
number of researchers active in this area has considerably
multi-linear, ARIMA) and methods of machine learning
increased, so did the number of publications, and naturally
discussed later.
they have the tendency of clustering in the form of volumes
Data-driven modelling (DDM) is based on the analysis or special issues of the journals. An example is this special
of the data characterising the system under study. A model issue of the Journal of Hydroinformatics. Other examples
can then be defined on the basis of connections between include the edited volume to be published by Springer
the system state variables (input, internal and output (Abrahart et al. 2008), recent special issues of the Hydro-
variables) with only a limited number of assumptions logical Sciences Journal (2007), Hydrology and Earth
about the “physical” behaviour of the system. The con- System Sciences (Abrahart et al. 2007b) and the Neural
temporary methods can go much further than the ones used Networks Journal (Cherkassky et al. 2006) where some of
5 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
the challenges of DDM that are very relevant for the It is important to note that a component of CI,
purpose of this paper are discussed. evolutionary and genetic algorithms (GA), is primarily
This paper presents a general overview and some of the oriented towards optimisation that can be used in model
experiences of the authors in data-driven modelling relevant calibration and model structure optimisation (Savic 2005;
to river basin management as a preface to this special issue Ostfeld & Preis 2005), or in traditional water resources
on Data Driven Modelling and Evolutionary Optimization optimization problems like (multi-objective) reservoir
for River Basin Management. It also identifies the current optimisation (Kim et al. 2006: in this study a popular
trends and common pitfalls, mentions challenges and multi-objective genetic algorithm (NSGA-II) was used).
provides some examples of successful applications. The main part of data-driven modelling is, in fact,
learning which incorporates the so-far unknown mappings
(or dependencies) between a system’s inputs and its outputs
ESSENCE OF DATA-DRIVEN MODELLING from the available data (Mitchell 1997, Figure 1). By data we
understand the known samples that are combinations of
Definitions
inputs and corresponding outputs. As such, a dependence
There are a number of (overlapping) areas contributing to (viz. mapping or “model”) is discovered (induced), which
DDM: data mining, knowledge discovery in databases, can be used to operation predict (or effectively deduce) the
computational intelligence, machine learning, intelligent future system’s outputs from the known input values.
data analysis, soft computing and pattern recognition. By data we usually understand a set K of examples (or
Computational intelligence (CI) incorporates three large instances) represented by the duple kx k, ykl, where k ¼ 1, … , K,
areas: neural networks, fuzzy systems and evolutionary vector xk ¼ {x1, … ,xn}k, vector yk ¼ {y1, … ,ym}k, n ¼ number
computing. Soft computing (SC) is the area that emerged of inputs and m ¼ number of outputs. The process of building
from fuzzy logic, but currently also incorporates many a function (or “mapping”, or “model”) y ¼ f (x) is called
techniques of CI. Machine learning (ML) is an area of training. Very often only one output is considered, so m ¼ 1.
computer science that was for a long time considered a sub-
In the context of river basin modelling the inputs and
area of artificial intelligence (AI) that concentrates on the
outputs are typically real numbers (xk, yk [ Rn), so the
theoretical foundations of learning from data. Data mining
main learning problem solved is numerical prediction
(DM) and knowledge discovery in databases (KDD) used
(regression). Sometimes the problems of clustering and
ML methods and are focused typically at very large
classification are solved as well (see, for example, Hall &
databases and are associated with applications in banking,
Minns 1999; Hannah et al. 2000; Harris et al. 2000).
financial services and customer resources management.
The process of building a data-driven model follows
Data-driven modelling can thus be considered as an
general principles adopted in modelling: study the problem
approach to modelling that focuses on using the CI
– collect data – select model structure – build the model – test
(particularly ML) methods in building models that would
the model and (possibly) iterate. In DDM, not only the model
complement or replace the “knowledge-driven” models
parameters, but also the model structure, is often subject to
describing physical behaviour. DDM uses the methods
developed in the fields mentioned above, and the role of a
modeller is to tune them to a particular application area.
“Modelling” in the name stresses the fact that this activity is
close in its objectives to traditional approaches to modelling,
and follows the traditionally accepted modelling steps, and
that it does not comprise the analysis or mining of data only.
Examples of the most common methods used in data-driven
modelling of river basin systems are: statistical methods,
artificial neural networks and fuzzy rule-based systems. Figure 1 | Learning in data-driven modelling.
6 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
optimisation. Generally, following the so-called Occam’s generalisation ability possible without relying explicitly on
razor principle (Mitchell 1997), parsimonious models are the cross-validation set (Vapnik 1998).
valued (as simple as possible, but no simpler). An example of In connection to the issues covered above, there are two
such a parsimonious model could be a linear regression model common pitfalls, especially characteristic of DDM appli-
vs a nonlinear one, or a neural network with a small number of cations where time series are involved, that are worth
hidden nodes. Such models can be built by the deliberate use mentioning herein.
of the so-called regularisation when the objective function (1) The first pitfall relates to the construction of the
representing the overall model performance includes not only three mentioned datasets on the basis of available data.
the model error term but also a term that increases in value The three sets should be statistically similar, i.e. should
with the increase of model complexity represented, e.g. by the have similar distributions or, at least, similar ranges, mean
number terms in the equation or the number of hidden nodes and variance. This can be achieved by careful selection of
in a neural network. examples for each dataset to ensure such statistical
It is worth mentioning that DDM is sometimes used similarity, by random sampling data from the whole
to build models of models (replicating, for example, dataset, or employing an optimisation procedure resulting
physically based models such as 1D hydrodynamic in sets with predefined properties (Bowden et al. 2002). A
models) rather than models of natural systems; such popular approach leading to approximate statistical
models are often referred to as surrogate, emulation or similarity of training and cross-validation sets is to use
meta-models (see, e.g., Solomatine & Torres 1996; Khu the ten-fold validation method when a model is built ten
et al. 2004). times, trained each time on 9/10th of the whole set of
available data and validated on 1/10th (number of runs is
not necessarily ten). An extreme version of this method is
the “leave-one-out” method when K models are build
Use of data: methodological issues and trivial pitfalls
using K-1 examples and not using one (every time
There are a couple of methodological issues related to the different). The resulting model is either one of the models
use of data for building a DDM. They may be considered trained, or an ensemble of all the models built, possibly
trivial by experts in machine learning but are not always with the weighted outputs. Note that for generation of the
appreciated by hydraulic engineers or hydrologists building statistically similar training data sets for building a series of
or using such models. similar but different models, one should typically rely on
After the model is trained but before it is put into the well-developed statistical resampling methods like the
operation, it has to be tested (or verified) by some form of bootstrap originated by B. Tibshirani in the 1970s (see
error measurement (e.g. root mean squared error) on the Efron & Tibshirani 1993) where (in its basic form) K data is
test dataset. To test the model during training yet another randomly selected from K original data.The problem is
data set is needed – the cross-validation set. As a model that, if one of these procedures is followed, the data will
gradually improves as a result of the training process, the not always be contiguous, so that, for example, it would
error on the training data will be decreasing, but the cross- not be possible to visualise a hydrograph when the model
validation error will first be decreasing, but then will start to is fed with the test set. There is nothing wrong with such a
increase (effect of overfitting), so training should be stopped model if the “time structure” of all the datasets is
when the error on the cross-validation dataset starts to preserved. Such models, however, are reluctantly
increase. If these principles are respected, then there is a accepted by practitioners since they are so different from
hope that the model will generalise well, that is its the traditional physically based models that always
prediction error on unseen data will be small. generate contiguous time series. A solution in such a
Note that in an important class of machine learning situation is to group the data into hydrological events
models – support vector machines – a different approach is (i.e. contiguous blocks of data) and to try to ensure the
taken: it is to build the model that would have the best presence of similar events in all the three datasets.
7 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
(2) Another pitfall can be encountered when a modeller the reader can be directed to the paper by Bowden et al.
tries to optimise the model structure (e.g. number of hidden (2005) for an overview of these. Our own experience
nodes in an ANN) by using the test set. This is, of course, with using the Average Mutual Information (Solomatine
methodologically incorrect since the role of the test set is & Dulal 2003) show that this simple and reliable method
to judge the final model performance in operation. Our can help in selection of relevant input variables.
experience is, however, that this principle is not always It is our hope that the adequate data preparation and
respected even by experienced modellers; we could refer to the rational and formalized choice of variables will become
one of the recent international competitions of data-driven a standard part of any modelling study.
models of water systems where both training and test sets
were given to the contestants (with the best intentions of the
organisers), but indeed inevitably pushing some of them to
POPULAR METHODS AND TYPICAL APPLICATIONS
use the test set to optimize the performance of their model.
A question is, of course, what to do if the dataset is not Most engineering or water management problems are for-
large enough to allow for building all three sets of substantial mulated as prediction of real-valued variables; this is a
size. Often the modellers choose not to build a cross- regression problem (not to confuse with linear regression, a
validation set at all, with the hope that the model trained on particular case of regression). Machine learning aims at finding
the training set would perform well on the test set as well. a function that would best approximate some data and this
Another option is to perform ten-fold cross-validation. prompts for the use of the corresponding methods already
available like linear regression, polynomial functions like
splines or orthogonal polynomial functions. Most of the
Data preparation and choice of input variables
data-driven models use combinations of many simple
In any modelling exercise an important issue is data functions. In essence, training aims at optimising the number
preparation and the choice of such variables that would be of these functions and the values of their parameters (given the
able to represent the modelled system in a best possible way. functions’ class).
An excellent reference to the first issue is the book by Multilayer perceptron (MLP) is a typical example of an
Pyle (1999). We cannot provide here the details but one artificial neural network (ANN) (Haykin 1999). It consists of
thing has to be stressed: researchers excited by the power of several layers of mutually interconnected nodes (neurons),
the new modelling techniques often are not spending each of which receives several inputs, calculates the weighted
enough effort on proper data preparation. sum of them and then passes the result to a non-linear
An interesting study of the influence of different data “squashing” function. In this way the inputs to a MLP model
transformation methods (linear, logarithmic and seasonal are subjected to a multi-parameter nonlinear transformation
transformations, histogram equalization and a transform- so that the resulting model is able to approximate complex
ation to normality) was undertaken by Bowden et al. (2003). input–output relationships. Training of MLP is, in fact, solving
They found that the model using the linear transformation the problem of minimising the model error (typically, mean
resulted in the lowest RMSE and more complex transform- squared error) by determining the optimal set of weights.
ations did not improve the model (note, however, that the As the principle of backpropagation for training of
study is based only on one case study to forecast salinity in a MLPs was found and perfected in the 1970 –80s (Werbos
river in Australia 14 days in advance). Our own experience 1994), this type of ANN has become the most popular
shows that it is sometimes useful to apply the smoothing machine learning tool. Various types of ANNs are widely
filters to the hydrological time series. used for prediction and classification.
Choice of variables is an important subject and some Note that backpropagation is a principle that made it
studies suffer from the lack of relevant analysis. Apart from possible to use gradient-based methods for MLP training,
the expert judgement and visual inspection, there are formal and that permits the usage of various optimization
methods that help in making this choice more justified, and algorithms – from the simplistic versions of steepest descent
8 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
schemes to much more effective methods like conjugate span, or “width”, of the function in the input space. Functions F
gradient or Broydon –Fletcher – Goldfarb –Shanno (BFGS) are typically “bell-shaped” (e.g. a Gaussian function) so that
methods (Press et al. 2007). In this respect we believe it is they are defined in the proximity to some “representative”
not fully correct to name MLP a “backpropagation locations (centres) wj in n-dimensional input space and their
network” since it can be trained using various methods, values are close to zero far from these centres. The aim of
for example, direct search methods like GA. The use of learning here is in finding the positions of centres wj and the
these less efficient (much slower) algorithms is justified parameters of the functions f(x). This can be accomplished by
when gradient-based backpropagation training prematurely building a radial-basis function neural network; its training
converges to the local optimum. allows identifying these unknown parameters. The centres wj
MLP ANNs are known to have several dozens of of the RBFs can be chosen using a clustering algorithm,
successful applications in river basin management and the parameters of the Gaussian can be found based on the
related problems, for example: spread (variance) of data in each cluster, and it can be shown
† Modelling rainfall – runoff processes: Hsu et al. (1995); that the weights can be found by solving a system of linear
Minns & Hall (1996); Dawson & Wilby (1998); Dibike equations. This is done for a certain number of RBFs, with
the exhaustive optimisation run across the number of RBFs in a
et al. (1999); Abrahart & See (2000); Govindaraju &
Ramachandra Rao (2001); Hu et al. (2007); Abrahart et al. certain range. Conceptually, RBF networks are close to
† Building an ANN-based intelligent controller for real- widely used for problems similar to those where MLP
time control of water levels in the channels of polders networks were used. The following examples could be
† Modelling river stage-discharge relationships (Sudheer † Sudheer & Jain (2003) used RBF ANNs for modelling river
& Jain 2003; Bhattacharya & Solomatine 2005); stage-discharge relationships and showed that, in the
† Building a surrogate (emulation, meta-) model for: considered case study, RBF ANNs were superior to MLPs;
– replicating the behaviour of hydrodynamic and † Moradkhani et al. (2004) used RBF ANNs for predicting
hydrological models of a river basin where ANNs hourly streamflow hydrographs for the daily flow for a
are used in model-based optimal control of a reservoir river in the USA as a case study, and demonstrated their
(Solomatine & Torres 1996); accuracy if compared to other numerical prediction
– building an assisting surrogate model in calibration of models. In this study RBF was combined with the
a rainfall – runoff model (Khu et al. 2004); self-organising feature maps used to identify the clusters
– emulating by an MLP network and replacing of data;
the hydrologic simulation component of multi-objective † Nor et al. (2007) used RBF ANNs for the same purpose;
decision support model for watershed management however, just for the hourly flow and considering only
(Muleta & Nicklow 2004). In this study an alternative to storm events in the two catchments in Malaysia as case
the backpropagation training was used – a direct search studies.
method (evolutionary algorithm) that reportedly Genetic programming (GP) and evolutionary
allowed for avoiding local minima during training. regression. GP is a symbolic regression method in which
Most theoretical problems related to MLP have been the specific model structure is not chosen a priori, but is a
solved and it should be seen as a quite reliable, well- result of the search process. Various elementary mathemat-
understood method. ical functions, constants and arithmetic operations are
Radial basis functions (RBF) could be seen as a sensible combined in one function and the algorithm tries to build
alternative to the use of complex polynomials. The idea is to a model recombining these building blocks in one formula.
approximate some function y ¼ f(x) by a superposition of J The function structure is represented as a tree and since
functions F(x, s), where s is a parameter characterising the the resulting function is highly nonlinear and often
9 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
non-differentiable, it is optimised by a randomised search patterns in complex nonlinear relationships. The basics of the
method – usually a GA. An overview of GP applications in data-driven approach and its use in a number of water-related
hydrology can be found in Babovic & Keijzer (2005). applications can be found in Bárdossy & Duckstein (1995).
One of the criticisms towards GP relates to the fact that the Typically the following rules are considered:
formulae generated on the basis of the combination of multiple
elementary functions are often extremely complex and carry IF x1 is A1;r AND…AND xn is An;r THEN y is B
no physical insight. To address this issue, an augmented
version of GP – a dimensionally aware GP – has been where {x1, … ,xn} ¼ x ¼ input vector; Aim ¼ fuzzy set; r
proposed (Keijzer & Babovic 2002). It constraints the search ¼ index of the rule, r ¼ 1,…R. Fuzzy sets Air (defined as
and ensures that the output has the expected physical membership functions with values ranging from 0 to 1) are
dimension by allowing only the formulae with variables with used to partition the input space into overlapping regions
particular dimensions (being the combination of length, time, (for each input these are intervals). The structure of B in the
mass, etc.). This leads to the formula(e) with the dimensional consequent could be either a fuzzy set (then such a model is
semantics and increases the chance of them having some called a Mamdani model), or a function y ¼ f(x), often
physical meaning. The usefulness of this approach was linear (and then the model is referred to as a Takagi –
demonstrated in the experiments of generating a formula for Sugeno– Kang (TSK) model). The model output is calcu-
the Chezy coefficient using the data generated by a numerical lated as a weighted combination of the R rules’ responses.
model of river flow through flexible vegetation in wetlands Output of the Mamdani model is fuzzy (a membership
(Babovic & Keijzer 2000). function of irregular shape), so the crisp output has to be
Laucelli et al. (2007) present an application of GP to the calculated by the so-called defuzzification operator. Note
problem of forecasting the groundwater heads in an aquifer that in the TSK model, each of the r rules can be interpreted
in Italy; in this study the authors also employed averaging of as local models valid for certain regions in the input space
several models built on the data subsets generated by defined by the antecedent and overlapping fuzzy sets Air.
bootstrap. Resemblance to the RBF ANN is obvious.
In evolutionary regression (Giustolisi & Savic 2006), a FRBS were effectively used for drought assessment (Pesti
method similar to GP, the elementary functions are chosen et al. 1996); prediction of precipitation events (Abebe et al.
from a limited set and the structure of the overall function is 2000a); control of water levels in polder areas (Lobbrecht &
fixed. Typically, a polynomial regression equation is used Solomatine 1999); modelling rainfall-discharge dynamics (Ver-
and the coefficients are found by GA. This method nieuwe et al. 2005). One of the limitations of FRBS is that the
overcomes some shortcomings of GP, such as the compu- demand for data grows exponentially with an increase in the
tational requirements, the number of parameters to tune number of input variables. It is worth mentioning an important
and the complexity of the resulting symbolic models. It was area where the principles and methods of fuzzy logic were also
used, for example, for modelling groundwater level (Gius- successfully used, which is analysis of model uncertainty. The
tolisi et al. 2007a) and river temperature (Giustolisi et al. uncertainty of inputs and parameters is described in fuzzy terms
2007b) and the high accuracy and transparency of the (fuzzy numbers) rather than probabilistic ones, and it is possible
resulting models were reported. to generate the membership function (fuzzy number) char-
Fuzzy rule-based systems (FRBS). Fuzzy logic was acterising the output. This approach was applied, for example,
introduced by Lotfi Zadeh (1965) and since then it has found in groundwater modelling (Abebe et al. 2000b) and rainfall–
multiple successful applications, mainly in control theory runoff modelling (Maskey et al. 2004).
(e.g. Kosko 1997). Fuzzy rule-based systems can be built by Support vector machines (SVM). This machine learning
interviewing human experts, or by processing historical data method is based on the extension of the idea of identifying a
and thus forming a data-driven model. These rules are “patches” hyperplane that separates two classes in classification. It is
of local models overlapped throughout the parameter space, closely linked to the statistical learning theory initiated by
using a sort of interpolation at a lower level to represent V. Vapnik in the 1970s at the Institute of Control Sciences of the
10 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
Russian Academy of Science (Vapnik 1998). Originally par with the accuracy of hydrodynamic models. Phoon et al.
developed for classification, it was extended to solving predic- (2002) employed nonlinear dynamics for forecasting hydro-
tion problems, and in this capacity was used in hydrology- logic time series. Note that the chaos-based methods do not
related tasks (note that currently some researchers attribute have universal applicability: they can be successfully
SVM to the group of the so-called kernel machines). Dibike et al. applied only when time series (or their combination) have
(2001) and Liong & Sivapragasam (2002) reported using SVMs certain properties, for example are periodic, or indeed
for flood management and in prediction of river water flows and exhibit properties of chaotic behaviour (or close to it) and
stages. Bray & Han (2004) addressed the issue of tuning when time series are of adequate (considerable) length.
SVMs for rainfall–runoff modelling. In all cases SVM-based Note also a certain link between the chaos theory that uses
predictors have shown good results in many cases superseding local models and the principles of instance-based learning
other DDM methods in accuracy (not always, however). (considered below).
Chaos theory and nonlinear dynamics appear to be One of the research challenges relates to a quite a
useful for time series forecasting when a time series carries practical issue: the development of more reliable adaptive
enough information about the behaviour of the system routines for determining the number of neighbours used in
(Abarbanel 1996). Let a time series {x1, x2, … , xt, … , xn } be the local models.
given (e.g. a sequence of water levels). The state of the system at Instance-based learning (IBL). In IBL (Mitchell 1997)
time t can be represented by a vector yt in m-dimensional state no model is built: classification or numeric prediction is
space xt, xt2t, … , xt2(m 2 1)t, where t is the delay time. The made directly by combining instances from the training data
whole time series can then be represented by a sequence of set that are close (typically in the Euclidean sense) to the
such vectors {yt}: {ym, ym þ 1, … , yn }. If the original time series new vector xq of inputs (query point). In fact, IBL methods
exhibits the so-called chaotic properties (manifested by its construct a local approximation to the modelled function
equivalent trajectory in the phase space following a quasi- that applies well in the immediate neighbourhood of the
periodic pattern), then the methods of chaos theory can be new query instance encountered. Thus it describes a very
used to predict the future values of y, and hence of x. For this, complex target function as a collection of less complex local
the so-called local model predicting the future value of y has to approximations, and often demonstrates competitive per-
be built in phase space; this is an instance-based learning formance when compared, for example, to ANNs.
model, or a regression model (linear or nonlinear) built on the A typical representative of IBL is the k-nearest neigh-
basis of the points representing the “moves in the phase space” bour (k-NN) method. For nominal output, the predicted
of the neighbours of the current y. The predictive capacity of class will just be the most common value among k training
chaos theory, based on an idea that the system behaves in the examples nearest to the query point xq. For real valued
future in a similar manner as in the (distant) past, supersedes output, the estimate is the mean value of the k-nearest-
that of the linear models like ARIMA. In practical applications, neighbouring examples, possibly weighted according to their
the delay time t and the dimension m need to be appropriately distance to xq. Further extensions are known as locally
chosen (or determined by optimisation, for example minimis- weighted regression (LWR) when the regression model is
ing the model forecast error by GA) in order to fully capture the built on k nearest instances: the training instances are
dynamic structure of the time series. Multivariate models assigned weights according to their distance to xq and the
embody time series representing several variables; they regression equations are generated on the weighted data.
capture the interdependences of these variables and can be Karlsson & Yakowitz (1987) introduced this method in
interpreted as the input–output data-driven models. the context of water issues, focusing, however, only on
The chaos theory-based approach was used by Babovic (single-variate) time series forecasts. Galeati (1990) demon-
et al. (2000) for predicting water levels at the Venice lagoon. strated the applicability of the k-NN method (with the
Solomatine et al. (2000) and Velickov et al. (2003) used vectors composed of the lagged rainfall and flow values) for
chaos theory to predict the surge water level in the Rijn river daily discharge forecasting and favourably compared it to
estuary and the two-hourly prediction error was at least on the statistical ARX model. Shamseldin & O’Connor (1996)
11 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
used the k-NN method for adjusting the parameters of inclusion of DDM into existing decision-making frameworks,
the linear perturbation model for river flow forecasting. while taking into consideration both the system’s physics and
Toth et al. (2000) compared the k-NN approach to other the data availability.
time series prediction methods in a problem of short-term Another aspect of usefulness is the adequate reflection
rainfall forecasting. Ostfeld & Salomons (2005) developed a of reality which is uncertain, and in this respect developing
hybrid genetic –instance-based learning algorithm through the methods of dealing with the data and model uncertainty
linking a GA with a k-NN scheme for calibrating the 2D is currently an important issue. We will briefly address this
surface quantity and water quality model CE-QUAL-W2. issue as well.
Solomatine et al. (2007) explored a number of IBL methods
and tested their applicability in short-term hydrologic
forecasting. Combination of “local” specialized models
To conclude the coverage of the popular data-driven
methods it can be mentioned that most of them are developed Physical processes in rivers and river basins are multi-
in the computational intelligence community. The main stationary, are composed of a number of sub-processes
challenges for the researchers in hydroinformatics are in (e.g. related to various hydrologic conditions or river flow
testing various combinations of these methods for particular regimes), and their accurate modelling by the building of one
water-related problems, in combining them with the optim- single (“global”) model is sometimes not possible. For river
isation techniques, in developing the robust modelling basins usually several physically based models are built, each
procedures able to work with the noisy data, and in developing responsible for modelling various aspects of the basin:
methods providing the model uncertainty estimates. hydraulic, hydrology, groundwater, etc. Typically, only one
comprehensive model is developed for each of these areas. For
example, a hydrologic model should be able to represent all
complexity of hydrologic processes in the basin. If the
SOME OF THE MODERN TRENDS processes are described with a sufficient level of detail, and
Data-driven modelling has passed the initial stage in which properly encapsulated in the model, such a model may become
researchers, excited by the power of new machine learning an accurate representation of reality and is often adequate.
techniques, rushed to search for all possible data available Sometimes, however, such a global model is not capable
to feed (often indiscriminately) into a model with the hope of describing all the sub-processes adequately and is not
of constructing a good predictor. The power of basic data- equally accurate for all hydrological conditions. In this case
driven modelling techniques has been already proven and an option is to try to identify such sub-processes and to
the research community is now working towards develop- build separate models for each of them. Another approach
ment of the optimal model architectures and avenues for is to build several similar models for the same process and
making data-driven models more robust, understandable to combine them in an “ensemble”; an example of such an
and really useful for managers. approach is reported by Xiong et al. (2001), where a Takagi –
As regards the new modelling architectures, we will Sugeno fuzzy model is used to combine conceptual rain-
address herein an issue of the so-called modular models, fall– runoff models, and of course by many researchers using
being combinations of “local” models, with which we ensembles of meteorological and hydrological models.
obtained lately some experience. In the case of using data-driven models, the situation is
The usefulness of a model should be measured not only by similar. A single DDM, e.g. ANN, often is not accurate for all
its methodological correctness and accuracy, but mainly by the possible situations. The collected data (training set) can be split
degree to which a model would be able to help a water into a number of subsets and separate models will be trained on
manager or a decision-maker. In river basin management these subsets (regions). These models are called local, or
physically based models are widely applied and typically are expert, models and the overall model a modular model (MM),
found to be useful tools, so one of the challenges here is in the or a committee machine (Haykin 1999). The way models are
12 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
built and combined can be subjected to optimisation, resulting accuracy of the module on the sample used. During training,
in an overall model with the highest performance. unit A arranges the recalculation of the distribution and proper
In the process of building, training and using a MM, two resampling, and during operation it simply distributes each
decisions have to be made: (A) which module should receive new input vector to all modules. Boosting was originally
which training pattern (splitting problem) and (B) how the developed for binary classification problems and was
outputs of the modules should be combined to form the later extended to solve multiclass classification problems
output of the final output of the system (combining problem) (AdaBoost.M2) and regression problems (AdaBoost.R). An
(Figure 2). Accordingly, two decision units have to be built, or improved version of boosting for regression is AdaBoost.RT by
one unit performing both functions. Such a unit is called an Shrestha & Solomatine (2006a). They demonstrated its
integrating unit or a gating network (a reference to a neural advantages in comparison to other boosting algorithms and
network often used for this purpose). It should be delivered to other learning methods on several benchmarking problems
the user of the final model, along with the trained modules. and two problems of river flow forecasting.
Functioning of the units A and B could be different during Hard splitting of the training set. A number of methods
training and operation. Classification of modular models do not combine the outputs of different models but explicitly
(different from the one of Haykin (1999)) now follows. use only one of them, the most appropriate one (a particular
Soft splitting of the training set. The group of the case when the weights of other expert models are zero). Such
statistically driven approaches with “soft” splits of input methods use “hard” splits of input space into regions. Each
space are represented by mixtures of experts ( Jordan & individual local model is trained individually on the subsets of
Jacobs 1995), bagging (bootstrap aggregating; Breiman 1996) instances contained in these regions, and finally the output of
and boosting (Freund & Schapire 1997). Here we will briefly only one specialized model is taken into consideration. This
introduce boosting only. can be done manually by experts on the basis of domain
Boosting (its advanced version, AdaBoost, is described by knowledge. Another way is to use information theory and to
Freund & Schapire (1997)) can be seen as a method of building perform splitting progressively; examples are decision trees
a series of modular models using soft splits. In the first iteration (Quinlan 1986), regression trees (Breiman et al. 1984) or M5
the basis model is trained (this will be the first module) on the model trees (Quinlan 1992).
whole dataset. The probability for each data vector to be Several examples of such an approach can be mentioned.
selected for the next iteration is adjusted: it is increased if See & Openshaw (2000) built different neural networks based
prediction for this data vector was poor. Using this distribution on different types of hydrological events. Hsu et al. (2002)
the new dataset of the same size is sampled from the original set presented a method of reproducing the catchment response
and the new model is built. This process is repeated n times, through multiple local linear regression models which are built
thus resulting in n modules, each trained on different for specific flow conditions relating to the clusters identified by
(intersecting) subsets. The combining unit B uses the weighted a Kohonen network. Solomatine & Xue (2004) used M5 model
sum of the modules, where the weight is dependent on the trees and neural networks in a flood-forecasting problem,
combining the models valid for particular hydrologic con-
ditions only (see the next subsection). Wang et al. (2006) used a
combination of ANNs for forecasting flow: different networks
were trained on the data subsets determined by applying either
a threshold discharge value or clustering in the space of inputs
(lagged discharges only but no rainfall data, however). Jain &
Srinivasulu (2006) applied a mixture of neural networks and
conceptual techniques to model the different segments of a
decomposed flow hydrograph. Corzo & Solomatine (2007)
Figure 2 | Combining specialised local models. used several methods of baseflow separation, built different
13 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
models for base and excess flow and combined these models, Becker & Kundzewicz (1987)). The M5 model tree approach
ensuring optimal overall model performance. advances it further by introducing algorithms based on
Regression trees and M5 model trees. This class of models information theory that makes it possible to automatically
is not yet popular in river management, but the known split the multi-dimensional parameter space and to generate
applications to water issues show their high performance a range of models according to the overall quality criterion.
(Witten & Frank 2000). These machine-learning techniques MTs may serve as an alternative to nonlinear models like
use the following idea: split the parameter space into areas ANNs and are often almost as accurate as ANNs, but have
(subspaces) and build in each of them a separate regression some important advantages: training of MTs is much faster
model of zero or first order (Figure 3). In M5 trees models in than ANNs, and it always converges, and the results can be
leaves are linear. The data set T is either associated with a leaf easily understood by decision-makers. Moreover, it is easy to
(where a regression model is built) or with a node (where some generate a range of MTs varying in complexity and accuracy.
test is chosen that splits T into subsets corresponding to the test An early (if not the first) application of M5 model trees in
outcomes). The same process is applied recursively to the river flow forecasting was reported by Kompare et al. (1997).
subsets. In the case of numeric inputs the Boolean tests ai at a Solomatine & Dulal (2003) used the M5 model tree in rainfall–
node used to split the data set have the form “xi , C” where i runoff modelling of a river sub-basin in Italy. Stravs et al. (2006)
and C are chosen to minimise the standard deviation in the used M5 trees in modelling the precipitation interception in
subsets resulting from the split. Mn are local specialised the context of the Dragonja river basin case study.
models built for subsets filtered down to a given tree leaf. The It is worth mentioning that the models (modules) on
Figure 2 may not be necessarily data-driven ones but rather
resulting model can be seen as a committee of linear models
have various natures, and may include expert judgements. If
being specialized on the certain subsets of the training set
an overall model uses various types of models, it can be called
belonging to particular regions of the input space.
a hybrid model. This is an important emerging research trend
Combination of linear models was used in dynamic
and a challenge in modelling of water-related assets.
hydrology already in the 1970s (e.g. multi-linear models by
in deciding what data should be used and how it should be For each of these conditions separate local models were
structured (as is done by most modellers). built (M5 model trees and ANNs). The presented approach
We will address here a particular problem of including an demonstrated that combination of several “local” models
expert and using the domain knowledge in the process of improves the accuracy of prediction.
building modular models (Figure 2). In this context, the role Inclusion of domain knowledge in algorithmic form. A
for a human expert could be, for example, in making decisions human expert can be seen, of course, as a supplier of domain
(A) and (B) (or approving these made by an algorithm) and, of knowledge. However, recently there is an increased interest
course, in the choice of models used in each unit. in exploring the possibilities of encapsulating the domain
Inclusion of a human expert. It is possible to mention a knowledge in algorithmic form and thus making it part of a
number of studies where an attempt is made to include a data-driven model, thus allowing for performing optimis-
human expert in the process of building a modular model. In ation of the latter. One such approach is being developed by
solving a flow forecasting problem, Solomatine & Xue (2004) Corzo & Solomatine (2007) and is used to improve the
introduced a human expert to determine the hydrological accuracy of a predictive rainfall – runoff model. In it, separate
conditions for which separate DDMs were built. Solomatine ANN models for baseflow and excess flow are built. For
& Siek (2004), Solomatine & Siek (2006) presented an baseflow separation two methods are used: constant slope
M5flex algorithm, allowing an expert to choose the splitting method and a recurrent filter. These methods, representing
rules in building M5 trees, directing thus the process of the hydrological knowledge about this phenomenon, are
building a DDM, and demonstrated its accuracy in hydro- algorithmically implemented and run on the training dataset;
logic modelling. Jain & Srinivasulu (2006) and Corzo & then surrogate classifiers are trained to replicate them (since
their straightforward implementation needs future data and it
Solomatine (2007) also applied decomposition of the flow
is not available during operation). The resulting modular
hydrograph by a threshold value and then built the separate
model undergoes an exhaustive optimisation to ensure
ANNs for low and high flow regimes. All these studies
optimal accuracy. Application of this approach to two
demonstrated the higher accuracy of the resulting modular
catchments demonstrates its value, especially for longer
models if compared to the models built to represent all
forecasting horizons.
possible regimes of the modelled system.
The study by Solomatine & Xue (2004) will be used as an
illustration of such an approach. In it, the flow predictions in
the Huai river basin (China) were made on the basis of DATA-DRIVEN MODELS OF UNCERTAINTY
previous flows and precipitation, and a committee hybrid
Modelling uncertainty was always an issue associated with
model was built. The problem was to predict flow Qtþ1 one day
river basin management, but recently the interest in this
ahead. The following notations are used: flows on the previous
problem and, accordingly, the number of publications has
and the current day as Qt21 and Qt, respectively; precipitation
dramatically increased. One of the reasons is probably purely
on the previous day as Pt21; moving average (2 days) of the
technical: computer power and advances in networked
precipitation two days before as Pmov2t22; moving average
computer clusters nowadays allow for running Monte Carlo-
(3 days) precipitation four days before as Pmov3t24.
based analysis of parametric uncertainty of quite complex
As a first step the domain experts were asked to identify
models. However, there is, of course, a deeper reason: general
several hydrological conditions (rules), used to split the
recognition of the inadequacy of “point predictions” generated
input space into regions. Some of the rules follow:
by most water models to the requirements of real-life water
(1) Qt21 $ 1000 m3/s (high flows) management. An important trend of the last several years is
(2) Qt21 , 1000 m3/s AND Qt $ 200 m3/s (medium flows) complementing the modelling studies of river basins with the
(3) Pt21 . 50 AND Pmov2t22 , 5 AND Pmov3t24 , 5 sensitivity and uncertainty analysis (Montanari & Brath 2004).
(flood condition due to the short but intensive rainfall Recently we have made a step towards building the data-
after a period of dry weather). driven models of uncertainty.
15 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
Error prediction models. Consider a model simulating uncertainty through the models and it was tested in forecasting
or predicting certain water-related variables (referred to as a river flows in a flood context.
primary model). This model’s outputs are compared to the One of the interesting research directions in building
recorded data and the errors are calculated. Another model, the models of uncertainty is finding the ways of combining
a data-driven model, is trained on the recorded errors of the the fuzzy and probabilistic descriptors of uncertainty in a
primary model and can be used to correct errors of the data-driven model, and building robust predictors of model
primary model. In the context of river modelling, this uncertainty originating from various sources.
primary model would be typically a physically based model,
but can be a data-driven model as well.
Such an approach was employed in a number of studies.
TWO EXAMPLES
Shamseldin & O’Connor (2001) used ANNs to update runoff
forecasts: the simulated flows from a model and the current We are presenting two examples that illustrate several
and previously observed flows were used as input, and the machine-learning methods used in solving river-basin-
corresponding observed flow as the target output. Updates of related problems. They also demonstrate how data-driven
daily flow forecasts for a lead-time of up to four days were models are built in terms of choosing appropriate inputs,
made, and the ANN models gave more accurate improve- data processing and model optimisation.
ments than autoregressive models. Lekkas et al. (2001) showed
that error prediction improves real-time flow forecasting,
especially when the forecasting model is poor. Babovic et al. DDM for forecasting river flows
(2001) used ANN to predict errors of 2D hydrodynamic Solomatine et al. (2007) used decision trees and k-NN in
models. Abebe & Price (2004) used ANN to correct the errors classification of river flow levels according to their severity in a
of a routing model of the River Wye in the UK. Solomatine et al. river flood forecasting problem in Nepal. In this problem a
(2007) built an ANN-based rainfall–runoff model whose medium-sized foothill-fed river in the Bagmati basin was
outputs were corrected by an instance-based learning model. considered, having an area of about 3700 km2. Time series data
Uncertainty prediction models. Data-driven (machine- of rainfall at three stations within the basin with daily sampling
learning) methods may be helpful not only in modelling over eight years (1988–1995) were collected. Daily flows were
natural processes, but also in building models of the error recorded at one station so this precluded modelling the
probability distributions for physically based models. Recently routing. Weight factors were calculated using the Thiessen
Shrestha & Solomatine (2006b) presented an approach polygon. The daily evapotranspiration was computed using
termed UNcertainty Estimation based on local model Errors the modified Penman method recommended by FAO.
(UNEEC). It is based on an idea to build local data-driven Generally a rainfall–runoff data-driven model predicting
models predicting the properties of the error distribution, and flow T days ahead was sought in the form presented in Table 1.
uses clustering and fuzzy logic. This is a distribution-free, non- First, dependence analysis of input and output
parametric method to model the propagation of integral variables was accomplished by visual inspection. Then the
Inputs (L and M are to identified as a result of model optimization) Lagged rainfalls Rt2t, t ¼ 0, … , L
(F is typically multiple linear regression model, ANN, SVM, or M5 model tree) Qt þ T ¼ f (Rt, Rt21, … , Rt2L, Qt, … , Qt2M)
16 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
interdependences between variables and the lags t were GA-based optimization of M5 model trees for predicting
established using correlation and average mutual infor- river basin output flow and contaminant transport
mation (AMI) analyses (Solomatine & Dulal 2003; Bowden
This example application is based on Preis et al. (2006) and
et al. 2005). By visual inspection of several precipitation
Ostfeld & Preis (2005) for the flow and the contaminant
events the maximum value of peak-to-peak time lags of
predictions at Lake Kinneret (the Sea of Galilee) watershed,
rainfall and runoff was found to be close to one day. The
located in northern Israel. The Lake Kinneret watershed is
cross-correlation analysis of the rainfall and runoff gave a
about 2730 km2 (2070 in Israel, with the rest in Lebanon), is
maximum correlation of 0.78 for one day lag, so this lag was
inhabited by about 200 000 people, organised into 25
accepted as the average lag time of rainfall. This value of this
municipalities, and three cities (in the Israeli part). The
lag was also consistent with AMI analysis. The autocorrela-
watershed outlet is Lake Kinneret, which is the most important
tion function of runoff drops rapidly within three time steps
surface water resource in Israel, providing approximately 35%
(days). As a result, the model predicting flow one day ahead
on the basis of five variables was set to be of the form of its annual drinking water demand.
Factors such as the rapid increase in Israel’s population
Qtþ1 ¼ fðREt22 ; REt21 ; REt ; Qt21 ; Qt Þ ð1Þ over the last decade along with an increase in its standard of
living, the Israeli peace agreement with Jordan and the
An important problem is splitting the data into training increasingly frequent droughts in the region are consistently
and testing datasets. The ways to do it and the possible intensifying the demand for freshwater, and hence the need
problems have been mentioned previously. In this study we to remove larger volumes of water from the lake. These
used two approaches – a method based on randomization factors further increase the likelihood of water quality
to create statistically similar training, cross-validation and decline; thus preserving the lake from further pollution is a
testsets, and a method based on hydrological analysis of foremost concern.
data to generate three contiguous datasets, trying to ensure The developed data-driven model is aimed at predicting
at the same time at least some statistical resemblance of flow and contaminant transports within the watershed,
these sets. In the latter one eight years of data sets (2919
records) were split as follows: the first 919 records were
used as testing data set and the remaining records as
training and cross-validation data. Each instance was
represented by a vector in five-dimensional space (since
there are five inputs) accompanied by the associated value
of its output variable.
In the study of Solomatine et al. (2007) several methods
of instance-based learning were applied (including local
weighted regression), along with ANN, M5 model tree
models and a lumped conceptual model. The results show
high accuracy of all data-driven methods, especially the
weighted local regression. Corzo & Solomatine (2007) also
applied a modular ANN model to the same case study, and
have shown that building two separate models related to
baseflow and excess flow, with the global optimisation of
the resulting model structure, increases the prediction
accuracy, if compared to a single model. The mentioned
papers provide more details of the experiments conducted
and the visualisation of the results. Figure 4 | Schematics of the hybrid model tree – genetic algorithm scheme.
17 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
Figure 6 | Measured and simulated total nitrogen (normalised), test data, 1997–1998.
Figure 7 | Measured and simulated total phosphorus (normalised) test data, 1997– 1998.
18 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
down to its outlet – Lake Kinneret. The model, entitled It can be seen from Figures 5– 7 that the predictions
KWAT (Kinneret Watershed Analysis Tool), follows a received by the developed flow and water quality models
hybrid set-up combining GA and model trees (MT) were, in general, in good agreement with the measurements.
(Figure 4). The objective of the Flow section of the model However, the models were less successful in predicting high
is to tune the values of a vector of coefficients a that flows and water quality concentrations. This is an inherent
multiply the average rainfall time series intensity I(t) (the limitation of a data-driven technique whose accuracy is
input) imposed on a watershed so as to calibrate its outlet primarily dependent on the quality of the dataset used for
flows Q(t). The Water quality section of the model then training. The larger a dataset, the greater is the chance to
uses these optimal flows Q(t) and the effective optimal have better predictions. It is anticipated that increasing the
*
rainfall intensities Ie ðtÞ to adjust the values of a vector of number of training instances for the proposed model will
coefficients b so as to calibrate the watershed outlet also improve its prediction accuracy.
concentrations C(t).
In the Flow section the following variables are used:
t ¼ time; DT ¼ lag in time (e.g. the concentration time of
CONCLUSIONS
the watershed); a ¼ vector of coefficients (the GA decision
variables); I(t) ¼ time series of the average rainfall intensity Data-driven modelling and computational intelligence
imposed on the watershed (e.g. using the average rainfall methods have proven their applicability to various problems
Theisen method); ET(t) ¼ evapotranspiration time series; related to river basin management: modelling, short-term
Ie(t) ¼ time series of the effective rainfall; Q(t) ¼ flow time forecasting, classification of hydrology-related data, and
series at the watershed outlet; and Fitness ¼ the fitness of even automated generation of flood inundation maps based
the model tree (MT) outcome analysis estimated through a on aerial photos (not discussed in this paper due to lack of
least square type equation. space, see, e.g., Velickov et al. (2000)), etc. A particular
In the Water quality section the variables used are: problem will benefit from data-driven modelling if: (1) there
*
Ie ðtÞ ¼ the optimal effective rainfall intensity time series is a considerable amount of data available; (2) there are no
(i.e. the outcome of the quantity model); Cin(t) ¼ the considerable changes to the system during the period
input concentration time series imposed on the watershed; covered by the model; (3) it is difficult to build adequate
b ¼ vector of coefficients (the GA decision variables); knowledge-driven simulation models due to the lack of
Ce(t) ¼ an “effective” resultant concentration time series; understanding and/or to the ability to satisfactorily con-
and C(t) ¼ concentration time series at the watershed outlet. struct a mathematical model of the underlying processes. Of
To reduce the computational complexity and to increase course, data-driven models can also be useful when there is
the model robustness, the dimension of the a and b coefficient a necessity to validate the simulation results of physically
vectors are set to be much less than the dimension of t. This is based models with other types of models.
accomplished by dividing the time series of the rainfall It can be said that it is practically impossible to recommend
intensity I(t) to a set of category domains of no more than six one particular type of data-driven model for a given problem.
(i.e. the rainfall intensity is divided into six categories, with a Since water-related applications are often characterised by the
and b values assigned to each). data being noisy and of poor quality, it is advisable to apply
Figures 5 –7 show results for the flow and water various types of techniques and to compare and/or combine
quality models as applied to a sub-watershed of Lake the results. For example, M5 model trees, combining local and
2
Kinneret (Meshushim watershed – 140 km , Ostfeld & global properties, could very well complement ANNs, and be
Preis (2005)). Figure 5 shows the results for the flow model more easily accepted by decision-makers due to their reliance
test data set for 1997– 1998. Figures 6 and 7 describe the on simple linear models.
results for the water quality model test data set for 1997 – We have considered and demonstrated some of the new
1998 for predicting total nitrogen and total phosphorus, trends in data-driven modelling and mentioned a number of
respectively. research challenges. It is worth mentioning one challenge of
19 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
a general nature: development of hybrid models by AVGWLF ArcView Generalized Watershed Loading Function.
combining the models of different types and following Available at: http://www.avgwlf.psu.edu/
Babovic, V., Canizares, R., Jensen, H. R. & Klinting, A. 2001 Neural
different modelling paradigms, including the combination of
networks as routine for error updating of numerical models.
data-driven physically based models, and finding effective ASCE J. Hydraul. Engng. 127 (3), 181– 193.
ways of including of a human expert in the modelling cycle. Babovic, V. & Keijzer, M. 2000 Genetic programming as a model
induction engine. J. Hydroinf. 2, 35 –60.
Babovic, V. & Keijzer, M. 2005 Rainfall runoff modelling based on
genetic programming. In Encyclopedia of Hydrological
ACKNOWLEDGEMENTS Sciences, vol 1. (ed. Andersen, M.G.). John Wiley & Sons,
New York, Doi: 10.1002/0470848944.hsa017.
This work was partly supported by the EU project “Integrated Babovic, V., Keijzer, M. & Stefansson, M. 2000 Optimal embedding
Flood Risk Analysis and Management Methodologies” using evolutionary algorithms. In: Proc. 4th Int. Conference on
Hydroinformatics, Cedar Rapids.
(FLOODsite), contract GOCE-CT-2004-505420, and the
Bárdossy, A. & Duckstein, L. 1995 Fuzzy Rule-Based Modeling with
Delft Cluster Research Programme of the Dutch Government Applications to Geophysical, Biological and Engineering
(project 4.30 “Safety against flooding”). The authors are Systems. CRC Press, Boca Raton, FL.
grateful to the three reviewers for their valuable comments. BASINS Better Assessment Science Integrating Point and Nonpoint
Sources. Available at: http://www.epa.gov/OST/BASINS/
Becker, A. & Kundzewicz, Z. W. 1987 Nonlinear flood routing with
multilinear models. Wat. Res. Res. 23, 1043 –1048.
REFERENCES Bhattacharya, B. & Solomatine, D. P. 2005 Neural networks and
M5 model trees in modelling water level – discharge
Abarbanel, H. D. I. 1996 Analysis of Observed Chaotic Data.
relationship. Neurocomputing 63, 381– 396.
Springer-Verlag, Berlin.
Bowden, G. J., Dandy, G. C. & Maier, H. R. 2003 Data
Abbott, M. B. 1991 Hydroinformatics: Information Technology and
transformation for neural network models in water resources
the Aquatic Environment. Avebury Technical, Aldershot.
applications,. J. Hydroinf. 5, 245 –258.
Abebe, A. J., Solomatine, D. P. & Venneker, R. 2000a Application
Bowden, G. J., Dandy, G. C. & Maier, H. R. 2005 Input determination
of adaptive fuzzy rule-based models for reconstruction of
for neural network models in water resources applications. Part
missing precipitation events. Hydrol. Sci. J. 45 (3), 425–436.
1—Background and methodology. J. Hydrol. 301, 75–92.
Abebe, A.J., Guinot, V. & Solomatine, D.P. 2000b Fuzzy alpha-cut
Bowden, G. J., Maier, H. R. & Dandy, G. C. 2002 Optimal division
vs. Monte Carlo techniques in assessing uncertainty in model
of data for neural network models in water resources
parameters. Proc. 4th Int. Conf. on Hydroinformatics, Cedar
applications. Wat. Res. Res. 38 (2), 1– 11.
Rapids. Available online: http://www.ihe.nl/hi/sol/papers/
HI2000-AlphaCut.pdf Bray, M. & Han, D. 2004 Identification of support vector machines
Abebe, A. J. & Price, R. K. 2004 Information theory and neural for runoff modelling. J. Hydroinf. 6, 265 –280.
networks for managing uncertainty in flood routing. ASCE Breiman, L. 1996 Bagging predictors. Machine Learning 24 (2),
J. Comput. Civil Engng. 18 (4), 373– 380. 123– 140.
Abrahart, R. J., Heppenstall, A. J. & See, L. M. 2007a Timing error Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. 1984
correction procedure applied to neural network rainfall-runoff Classification and Regression Trees. Wadsworth International,
modelling. Hydrol. Sci. J. 52 (3), 414 –431. Belmont.
Abrahart, R.J., See, L.M., Solomatine, D.P. & Toth, E (Eds.) 2007b Cherkassky, V., Krasnopolsky, V., Solomatine, D.P. & Valdes, J
Data-driven approaches, optimization and model integration: (Eds.) 2006 Computational intelligence in earth sciences and
hydrological applications. Hydrol Earth Syst. Sci. 11 environmental applications. Neural Networks J. 19 (2)
(Special issue) Corzo, G. & Solomatine, D. P. 2007 Baseflow separation techniques
Abrahart, R. J. & See, L. 2000 Comparing neural network and for modular artificial neural network modelling in flow
autoregressive moving average techniques for the provision of forecasting. Hydrol. Sci. J. 52 (3), 491 –507.
continuous river flow forecast in two contrasting catchments. Dawson, C. W. & Wilby, R. 1998 An artificial neural network approach
Hydrol. Process. 14, 2157 –2172. to rainfall-runoff modelling. Hydrol. Sci. J. 43 (1), 47–66.
Abrahart, R. J. & See, L. M. 2007 Neural network modelling of Dibike, Y., Solomatine, D. P. & Abbott, M. B. 1999 On the
non-linear hydrological relationships. Hydrol. Earth Syst. Sci. encapsulation of numerical-hydraulic models in artificial
11, 1563 – 1579. neural network. J. Hydraul. Res. 37 (2), 147 –161.
Abrahart, B., See, L. M. & Solomatine, D. P. 2008 Hydroinformatics Dibike, Y. B., Velickov, S., Solomatine, D. P. & Abbott, M. B. 2001
in Practice: Computational Intelligence and Technological Model induction with support vector machines: introduction
Developments in Water Applications. Berlin, Springer-Verlag. and applications. ASCE J. Comput. Civil Engng. 15 (3),
In press. 208– 216.
20 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
Diskin, M. H. 1964 A basic study of the linearity of the rainfall – network suitable for hydrologic modeling and analysis. Wat.
runoff process in watersheds. PhD thesis University of Illinois. Res. Res. 38 (12), 1–17.
Urbana, Champaign. Hsu, K. L., Gupta, H. V. & Sorooshian, S. 1995 Artificial neural
Diskin, M. H. & Boneh, A. 1975 Determination of an optimal IUH network modelling of the rainfall-runoff process. Wat. Res.
for linear time invariant systems from multi-storm records. Res. 31 (10), 2517 –2530.
J. Hydrol. 24, 57 –76. Hu, T., Wu, F. & Zhang, X. 2007 Rainfall –runoff modeling using
Diskin, M. H., Wyseure, G. & Feyen, J. 1984 Application of a cell principal component analysis and neural network. Nordic
model to the Bellebeek watershed. Nordic Hydrol. 15, 25 – 38. Hydrol. 38 (3), 235– 248.
Dooge, J. C. I. 1959 A general theory of the unit hydrograph. Jain, A. & Srinivasulu, S. 2006 Integrated approach to model
J. Geophys. Res. 64 (2), 241 –256. decomposed flow hydrograph using artificial neural network
Eagleson, P. S., Mejia, R. & March, F. 1966 Computation of optimum and conceptual techniques. J. Hydrol. 317, 291– 306.
realizable unit hydrographs. Wat. Res. Res. 2 (4), 755–764. Jordan, M. I. & Jacobs, R. A. 1995 Modular and hierarchical
Efron, B. & Tibshirani, R. J. 1993 An Introduction to the Bootstrap. learning systems. In The Handbook of Brain Theory
Chapman & Hall, New York. and Neural Networks (ed. M. Arbib). MIT Press, Cambridge,
Falconer, R., Lin, B. & Harpin, R. 2005 Environmental modelling in MA.
river basin management. J. River Basin Mngmnt. 3 (3), Karlsson, M. & Yakowitz, S. 1987 Nearest neighbour methods for
169– 184. non-parametric rainfall runoff forecasting. Wat. Res. Res. 23
Freund, Y. & Schapire, R. 1997 A decision-theoretic generalisation (7), 1300 –1308.
of on-line learning and an application of boosting. J. Comput. Keijzer, M. & Babovic, V. 2002 Declarative and preferential bias in
System Sci. 55 (1), 119 –139. GP-based scientific discovery. Genetic Programming and
Galeati, G. 1990 A comparison of parametric and non-parametric Evolvable Machines 3 (1), 41– 79.
methods for runoff forecasting. Hydrol. Sci. J. 35 (1), 79 – 94. Khu, S.-T., Savic D., Liu, Y. & Madsen, H. 2004 A fast evolutionary-
Gaume, E. & Gosset, R. 2003 Over-parameterisation, a major based meta-modelling approach for the calibration of a rainfall-
obstacle to the use of artificial neural networks in hydrology? runoff model. In: Trans. 2nd Biennial Meeting of the
Hydrol. Earth Syst. Sci. 7 (5), 693– 706. International Environmental Modelling and Software Society,
Giustolisi, O., Doglioni, A., Savic, D. A. & di Pierro, F. 2007a An iEMSs, Manno, Switzerland. Available online: http://www.iemss.
evolutionary multiobjective strategy for the effective org/iemss2004/pdf/evocomp/khuafas.pdf
management of groundwater resources. Wat. Res. Res. Kim, T., Heo, J. -H. & Jeong, C. -S. 2006 Multireservoir system
doi:10.1029/2006WR005359. optimization in the Han River basin using multi-objective
Giustolisi, O., Doglioni, A., Savic, D. A. & Webb, B. W. 2007b A genetic algorithms. Hydrol. Process. 20 (9), 2057 –2075.
multi-model approach to analysis of environmental Kompare, B., Steinman, F., Cerar, U. & Dzeroski, S. 1997
phenomena. Environ. Modell. Syst. J. 22 (5), 674 –682. Prediction of rainfall runoff from catchment by intelligent
Giustolisi, O. & Savic, D. A. 2006 A symbolic data-driven technique data analysis with machine learning tools within the
based on evolutionary polynomial regression. J. Hydroinf. 8 artificial intelligence tools. Acta Hydrotech. 16/17
(3), 207 –222. (79 – 94(in Slovene)).
Govindaraju, R.S., & Ramachandra Rao, A. (Eds.) 2001, Artificial Kosko, B. 1997 Fuzzy Engineering. Prentice-Hall, Englewood Cliffs, NJ.
Neural Networks in Hydrology. Kluwer, Dordrecht. Laucelli, D., Giustolisi, O., Babovic, V. & Keijzer, M. 2007
Haith, D. A. & Shoemaker, L. L. 1987 Generalized watershed Ensemble modeling approach for rainfall/groundwater
loading functions for stream flow nutrients. Wat. Res. Bull. 23 balancing. J. of Hydroinf. 9 (2), 95 –106.
(3), 471 –478. Lekkas, D. F., Imrie, C. E. & Lees, M. J. 2001 Improved non-linear
Hall, M. J. & Minns, A. W. 1999 The classification of hydrologically transfer function and neural network methods of flow routing
homogeneous regions. Hydrol. Sci. J. 44, 693 –704. for real-time forecasting. J. Hydroinf. 3 (3), 153 –164.
Han, D., Kwong, T. & Li, S. 2007 Uncertainties in real-time flood Liong, S. Y. & Sivapragasam, C. 2002 Flood stage forecasting with
forecasting with neural networks. Hydrol. Process. 21 (2), SVM. J. AWRA 38 (1), 173– 186.
223– 228. Lobbrecht, A. H. & Solomatine, D. P. 1999 Control of water levels
Hannah, D. M., Smith, B. P. G., Gurnell, A. M. & McGregor, G. R. in polder areas using neural networks and fuzzy adaptive
2000 An approach to hydrograph classification. Hydrol. systems. In Water Industry Systems: Modelling and
Process. 14 (2), 317 –338. Optimization Applications (ed. in D. Savic & G. Walters),
Harris, N. M., Gurnell, A. M., Hannah, D. M. & Petts, G. E. 2000 pp. 509– 518. Research Studies Press, Baldock.
Classification of river regimes: a context for hydrogeology. Maskey, S., Guinot, V. & Price, R. K. 2004 Treatment of
Hydrol. Process. 14, 2831 – 2848. precipitation uncertainty in rainfall-runoff modelling: a fuzzy
Haykin, S. 1999 Neu Networks: A Comprehensive Foundation. set approach. Adv. Wat. Res. 27 (9), 889 –898.
McMillan, New York. MIKE SHE 2006 MIKE SHE Integrated Modelling Tool. Available
Hsu, K. L., Gupta, H. V., Gao, X., Sorooshian, S. & Imam, B. 2002 at: http://www.dhisoftware.com/mikeshe/ (last accessed on 1
Self-organizing linear output map (SOLO): an artificial neural July 2006).
21 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
Minns, A. W. & Hall, M. J. 1996 Artificial neural network as See, L. A., Solomatine, D. P., Abrahart, R. & Toth, E. 2007
rainfall-runoff model. Hydrol. Sci. J. 41 (3), 399–417. Hydroinformatics: computational intelligence and
Mitchell, T. M. 1997 Machine Learning. McGraw-Hill, New York. technological developments in water science applications –
Montanari, A. & Brath, A. 2004 A stochastic approach for assessing Editorial. Hydrol. Sci J. 52 (3), 391– 396.
the uncertainty of rainfall-runoff simulations. Wat. Res. Res. Shamseldin, A. Y. & O’Connor, K. M. 1996 A nearest neighbour
40, W01106 doi:10.1029/2003WR002540. linear perturbation model for river flow forecasting. J. Hydrol.
Moradkhani, H., Hsu, K. L., Gupta, H. V. & Sorooshian, S. 2004 179, 353 –375.
Improved streamflow forecasting using self-organizing radial Shamseldin, A. Y. & O’Connor, K. M. 2001 A non-linear neural
basis function artificial neural networks. J. Hydrol. 295 (1), network technique for updating of river flow forecasts. Hydrol.
246 –262. Earth Syst. Sci. 5 (4), 557–597.
Muleta, M. K. & Nicklow, J. W. 2004 Joint application of artificial Sherman, L. K. 1932 Stream flow from rainfall by the unit graph
neural networks and evolutionary algorithms to watershed method. Engng. News-Record 108, 501 –505.
management. J. Wat. Res. Mngmnt. 18 (5), 459 –482. Shrestha, D. L. & Solomatine, D. P. 2006a Experiments with
Nash, J. E. 1957 The form of the instantaneous unit hydrograph. AdaBoostRT, an improved boosting scheme for regression.
IASH Publ. 45 (3), 114– 121. Neural Comput. 2006, 17.
Nor, N. A., Harun, S. & Kassim, A. H. 2007 Radial basis function Shrestha, D. L. & Solomatine, D. P. 2006b Machine learning
modeling of hourly streamflow hydrograph. J. Hydrol. Engng. approaches for estimation of prediction interval for the model
12 (1), 113 –123. output. Neural Networks 19 doi:10.1016/j.neunet.2006.01.012.
Ostfeld, A. & Preis, A. 2005 A data driven model for flow and Solomatine, D. P. & Dulal, K. N. 2003 Model tree as an alternative
contaminants runoff predictions in watersheds. In River Basin to neural network in rainfall-runoff modelling. Hydrol. Sci. J.
Restoration and Management (ed. in A. Ostfeld & J. M. 48 (3), 399 –411.
Tyson), pp. 62 – 70. Water and Environmental Management Solomatine, D.P., Rojas, C., Velickov, S. & Wust, H. 2000
Series. IWA Publishing, London. Chaos theory in predicting surge water levels in the North
Ostfeld, A. & Salomons, S. 2005 A hybrid genetic –instance based Sea. In: Proc. 4th Int. Conf. on Hydroinformatics,
learning algorithm for CE-QUAL-W2 calibration. J. Hydrol. Cedar Rapids.
310, 122– 142. Solomatine, D. P., Shrestha, D. L. & Maskey, M. 2007 Instance-
Pesti, G., Shrestha, B. P., Duckstein, L. & Bogárdi, I. 1996 A fuzzy based learning compared to other data-driven methods in
rule-based approach to drought assessment. Wat. Res. Res. 32 hydrological forecasting. Hydrol. Process. 10.1002/hyp.6592.
(6), 1741 –1747. Solomatine, D. P. & Siek, M. B. 2004 Flexible and optimal M5
Phoon, K. K., Islam, M. N., Liaw, C. Y. & Liong, S. Y. 2002 A model trees with applications to flow predictions. In: Proc. 6th
practical inverse approach for forecasting nonlinear Int. Conf. on Hydroinformatics. World Scientific, Singapore.
hydrological time series. ASCE J. Hydrol. Engng. 7 (2), Solomatine, D. P. & Siek, M. B. 2006 Modular learning models in
116 –128. forecasting natural phenomena. Neural Networks J. 19 (2),
Preis, A., Tubaltzev, A. & Ostfeld, A. 2006 Kinneret Watershed 225– 235.
Analysis Tool (KWAT) - a cell based decision tree model for Solomatine, D. P. & Torres, L. A. 1996 Neural network
watershed flow and pollutants predictions. Wat. Sci. Technol. approximation of a hydrodynamic model in optimizing
53 (10), 29 –35. reservoir operation. In: Proc. 2nd Int. Conf. on
Press, W. H., Teukolsky, S. A., Vetterling, W. T. & Flannery, B. P. Hydroinformatics, Balkema, Rotterdam. pp. 201 –206.
2007 Numerical Recipes: The Art of Scientific Computing, 3rd Solomatine, D. P. & Xue, Y. 2004 M5 model trees and neural
edn. Cambridge University Press, Cambridge. networks: application to flood forecasting in the upper reach of
Quinlan, J. R. 1986 Induction of decision trees. Machine Learning 1 the Huai River in China. ASCE J. Hydrol. Engng. 9 (6), 491–501.
81 – 106. Stravs, L., Brilly, M. & Sraj, M. 2006 Precipitation interception
Quinlan, J. R. 1992 Learning with continuous classes. In Proc. modelling using machine learning methods – the Dragonja
AI’92, 5th Australian Joint Conference on Artificial River basin case study. In: Hydroinformatics in Practice:
Intelligence (ed. A. Adams & L. Sterling), World Scientific, Computational Intelligence and Technological Developments
Singapore. pp. 343 –348. in Water Applications (ed. B. Abrahart, L. M. See & D. P.
RIBASIM 2006 RIBASIM River Basin Planning and Management Solomatine), Springer-Verlag, Berlin.
Tool. at: http://www.wldelft.nl/soft/ribasim/int/index.html Sudheer, K. P. & Jain, S. K. 2003 Radial basis function neural
(last accessed on 1 July 2006). network for modeling rating curves. ASCE J. Hydrol. Engng. 8
Savic, D. 2005 Evolutionary computing in hydro-geological systems. (3), 161 –164.
In Encyclopedia of Hydrological Sciences, vol 1. (ed. M. G. SWAT 2006 SWAT Soil and Water Assessment Tool. Available at:
Andersen), John Wiley & Sons, New York, Doi: 10.1002/ http://www.brc.tamus.edu/swat/ (last accessed on 1 July 2006).
0470848944.hsa016. Toth, E., Brath, A. & Montanari, A. 2000 Comparison of short-term
See, L., Openshaw, S. 2000 A hybrid multi-model approach to rainfall prediction models for real-time flood forecasting.
river level forecasting. Hydrological Sciences J. 45 (3), 523 –536. J. Hydrol. 239, 132 –147.
22 D. P. Solomatine and A. Ostfeld | Data-driven modelling Journal of Hydroinformatics | 10.1 | 2008
Vapnik, V. N. 1998 Statistical Learning Theory. John Wiley & Sons, Wang, W., van Gelder, P. H. A. J. M., Vrijling, J. K. & Ma, J. 2006
New York. Forecasting daily streamflow using hybrid ANN models.
Velickov, S., Solomatine, D. & Price, R. K. 2003 Prediction of J. Hydrol. 324 (1–4), 383– 399.
nonlinear dynamical systems based on time series analysis: Werbos, P.J. (1974/1994). The Roots of Backpropagation. John Wiley
issues of entropy, complexity and predictability. In: Proc. of & Sons (Includes Werbos’s 1974 Harvard Ph.D. thesis,
the XXX IAHR Congress. Thessaloniki, Greece. Beyond Regression).
Velickov, S., Solomatine, D.P., Yu, X. & Price, R.K. 2000 Application Witten, I. H. & Frank, E. 2000 Data Mining. Morgan Kaufmann,
of data mining techniques for remote sensing image analysis. In: San Mateo, CA.
Proc. 4th Int. Conf. on Hydroinformatics, Cedar Rapids. Xiong, L. H., Shamseldin, A. Y. & O’Connor, K. M. 2001 A non-linear
Vernieuwe, H., Georgieva, O., De Baets, B., Pauwels, V. R. N., combination of the forecasts of rainfall–runoff models by the
Verhoest, N. E. C. & De Troch, F. P. 2005 Comparison of first-order Takagi-Sugeno fuzzy system. J. Hydrol. 245 (1–4),
data-driven Takagi – Sugeno models of rainfall – discharge 196–217.
dynamics. J. Hydrol. 302 (1–4), 173 –186. Zadeh, L. A. 1965 Fuzzy sets. Inf. Control 8, 338 –353.