Tucker Et Al 2012 Bioinformatics Tools in Predictive Ecology Applications To Fisheries

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Phil. Trans. R. Soc.

B (2012) 367, 279–290


doi:10.1098/rstb.2011.0184

Research

Bioinformatics tools in predictive ecology:


applications to fisheries
Allan Tucker1,* and Daniel Duplisea2
1
School of Information Systems, Computing and Maths, Brunel University, Uxbridge,
Middlesex UB8 3PH, UK
2
Fisheries and Oceans Canada, Institut Maurice-Lamontagne Mont Joli, Quebec, Canada
There has been a huge effort in the advancement of analytical techniques for molecular biological
data over the past decade. This has led to many novel algorithms that are specialized to deal with
Downloaded from https://royalsocietypublishing.org/ on 06 August 2024

data associated with biological phenomena, such as gene expression and protein interactions. In
contrast, ecological data analysis has remained focused to some degree on off-the-shelf statistical
techniques though this is starting to change with the adoption of state-of-the-art methods, where
few assumptions can be made about the data and a more explorative approach is required, for
example, through the use of Bayesian networks. In this paper, some novel bioinformatics tools
for microarray data are discussed along with their ‘crossover potential’ with an application to fish-
eries data. In particular, a focus is made on the development of models that identify functionally
equivalent species in different fish communities with the aim of predicting functional collapse.
Keywords: bioinformatics; Bayesian networks; classification; dynamic models;
fisheries management

1. INTRODUCTION explore the complexities of gene interaction on a large


Bioinformatics has revolutionized the way we analyse scale and therefore take a systems approach to modelling.
molecular biological data. Owing to the explosion in In contrast, ecological data analysis has been rather
data collection and storage made available since the less explorative to date when compared with bioinform-
dawn of parallel sequencing, there has been a demand atics and systems biology. There are of course
for specialist techniques to analyse and model data exceptions, and in the study of Hochachka et al. [14]
such as microarray experiments, which measure the a discussion of the potential of using data-mining tech-
expression of thousands of genes simultaneously. The niques is explored for situations where there is little or
advance of research in fields including machine learning no prior knowledge about a system. In this paper, we
[1], data mining [2] and intelligent data analysis [3,4] investigate the cross-over potential of techniques used
has resulted in many novel tools for the analysis of in bioinformatics, such as feature selection, classifi-
such data. In bioinformatics, techniques such as cluster- cation, Bayesian networks (BNs) and in particular an
ing were initially extremely popular for identifying adaptation of an algorithm that we previously developed
groups of genes with similar expression profiles [5,6]. for exploiting the availability of multiple datasets. This is
This allowed biologists to identify the function of pre- applied to fisheries data in order to identify species that
viously unknown genes through ‘guilt by association’. perform similar functional roles in different fish com-
It also allowed these groups to be treated as single mod- munities. These equivalent species are used to predict
ules [7,8] in order to reduce the massive number of functional collapse in their respective regions through
variables when building models for prediction. Classifi- the use of dynamic Bayesian models with latent
cation of disease outcome [9] has also been very popular variables.
with many approaches being developed, including In the remainder of this section, BNs are introduced
methods to identify relevant biomarkers through fea- in the context of bioinformatics research, and recent
ture selection [10]. Modelling time-series microarray relevant work on specialist bioinformatics techniques
data has been useful in understanding the underlying that have cross-over potential is discussed. The use of
dynamics of microarray time-series, and cell-cycle data these techniques applied to ecological data is also
have been a popular topic of study [11]. One particular discussed with a focus on fisheries. In §2, the fisheries
development in these areas is the adoption of graph- data and the ‘functional equivalence’ algorithm are
based models in the form of genetic regulatory networks described. Results in §3 demonstrate how models
(GRNs) [12,13]. These approaches allow biologists to learned from data in one region can be used to identify
and predict the biomass of ‘functionally similar’ species
and as a result, the functional collapse in other regions.
* Author for correspondence ([email protected]). Finally, the use of the techniques explored in this paper
One contribution of 16 to a Discussion Meeting Issue ‘Predictive (namely, BNs for feature selection and classification, the
ecology: systems approaches’. functional equivalence algorithm and dynamic models
279 This journal is q 2011 The Royal Society
280 A. Tucker & D. Duplisea Bioinformatics tools in ecology

(a) (b)
X1 X2 C

p(X1) p(X2)

X3
p(X3 | X1, X2)
X1 X2 X3 XN

X4 X5

p(X4 | X3 ) p(X5 | X3 )

t–1 t
(c) (d) t – 1 t
X1 X1
H H

X2 X2
Downloaded from https://royalsocietypublishing.org/ on 06 August 2024

X1 X1

X3 X3
XN XN
X4 X4

XN XN

Figure 1. (a) A Bayesian network (BN) that encodes a joint distribution using a graphical structure and local conditional
distributions. Links between variables represent conditional independences. (b) A BN classifier where C denotes a class
node to predict. (c) A dynamic BN where nodes represent variables at a point in time and (d) a hidden Markov model,
where H denotes an unmeasured (hidden or latent) variable.

with latent variables) are discussed in §4 in terms of the high-dimensional datasets and are prone to getting
wider ecological literature. stuck in local minima. Search-and-score methods to
infer BNs from data have been used frequently in learn-
ing GRNs [15]. These methods involve performing a
(a) Bayesian networks for bioinformatics search through the space of possible networks and
BNs have become a popular method for computational scoring each structure. A variety of search strategies
modelling of GRNs from microarray expression data can be used [20–23]. BNs are capable of performing
[15– 17]. A BN describes the joint distribution (which is many data analysis tasks including feature selection
a way of assigning probabilities to every possible out- and classification (performed by treating one node as
come over a set of variables, X1 . . . XN) by exploiting a class node and allowing the structure learning to
conditional independence relationships represented by select relevant features [24] (figure 1b)). Modelling
a directed acyclic graph (DAG). See figure 1a for an time series can be achieved by using an extension of
example of BN with five nodes. Each node in the the BN known as the dynamic Bayesian network
DAG is characterized by a state which can change (DBN) [25,26], where nodes represent variables at
depending on the state of other nodes and information particular time slices (figure 1c). Closely related to the
about those states propagated through the DAG. This DBN is the Hidden Markov Model (HMM) which
kind of inference facilitates the ability to ask ‘what if ?’ models the dynamics of a dataset through the use of a
questions of the data by entering evidence (changing a latent variable [27]. This latent variable is used to
state or confronting the DAG with new data) into the infer some underlying state of the series and through
network, applying inference and inspecting the posterior an autoregressive link that can capture relationships of
distribution (which represents the distributions of the a higher order (figure 1d).
variables given in the observed evidence). For example, BNs offer a natural mechanism for incorporating
one could ask, what is the probability of seeing gene A prior knowledge relating to the network structure
‘switch on’ (through high expression) given that we through informative structure priors [28]. There has
have observed a low expression in genes B and C? been substantial work in using priors to build more
There are numerous ways to infer both network robust GRNs. Steele et al. [22] use concept profiles
structure and parameters from data. Constraint-based learned from abstracts in the biological literature
approaches such as the PC [18] and IC* [19] algorithms (Medline) to bias BN learning algorithms and found
both work by applying independence tests between that lesser studied systems generally gain more from
variables and building networks that reflect these updating priors with new data. Imoto et al. [29] use
independences. However, these do not scale well for energy functions to incorporate prior knowledge sources
Phil. Trans. R. Soc. B (2012)
Bioinformatics tools in ecology A. Tucker & D. Duplisea 281

including literature-based knowledge extracted from 60 NS


regulatory interactions that are recorded in the Yeast
Proteome Database (YPD). In the study of Werhli & 55

latitude (º)
Husmeier [30], the approach was extended to multiple 50
sources of prior knowledge, applied to combining
protein–protein interactions and pathways from 45
ESS
KEGG (Kyoto Encyclopedia of Genes and Genomes) 40 GB
with expression data.
–60 –40 –20 0
(b) Consensus and functional models longitude (º)
Comparing apparently similar multivariate datasets is
Figure 2. Georges Bank (GB), the East Scotian Shelf (ESS)
often problematic owing to differences in collection and the North Sea (NS). The focus of the empirical analysis.
methods. Such often is the case for microarray data
which have methodological and laboratory dependencies
[31] and similar issues occur with ecological community
predator– prey and competitor species are available.
data collected for different systems. Though data nor-
Some of these are more detailed than others and
Downloaded from https://royalsocietypublishing.org/ on 06 August 2024

malization is the logical solution to such problems,


may include the results of stomach surveys [38],
it is neither straightforward nor a complete solution
where the diet of specific species can be determined.
[32,33]. A post-learning aggregation framework called
The experiments carried out in this paper focus on
consensus BNs was developed for microarray datasets
cod biomass. Some spectacular collapses in fish stocks
[34] to overcome some of these issues by combining
have occurred in the past 20 years but the most notable
datasets generated by different platforms, research
is the once largest cod stock in the world, the Northern
groups and laboratories without requiring normalization.
cod stock off eastern Newfoundland, which experienced
In this framework, learnt models that are generated from
a 99 per cent decline in biomass. Cod, unfortunately,
each dataset are aggregated, producing a combined
is not alone and there are stocks of various species
model that represents prominent features which occur
that have been reduced to only a small percentage of
in all, or a subset of, the individual dataset models. The
stock sizes in recent history. Much of this effect is due
problem with this approach is the need to pre-select
to direct mortality on fish through fishing and sub-
higher quality datasets to prevent the ‘dumbing down’
sequent indirect effects and weak linkages with other
of networks from lower quality data resulting in an ‘aver-
species. Some of these regions may have moved to an
age’ network rather than a ‘best-of ’. A reliable method
‘alternative stable state’ or experienced a ‘regime shift’
to identify these higher quality datasets prior to the con-
and are unlikely to return to a cod-dominated com-
sensus algorithm was found to be the predictive accuracy
munity without some chance event beyond human
of models learned from one dataset and tested on other
control [39].
available independent sets [35]. This approach result-
Different species may have similar functional roles
ed in consensus models that were consistently more
within a system depending on the region. For example,
parsimonious to biologically validated networks and
one species may act as a predator of another which regu-
was extended by Anvar et al. [36,37]. It is this idea of
lates a population in one location, but another species
exploiting independent datasets that shapes the work
may perform an almost identical role in another
in this paper.
location. If we can model the function of the interaction
In summary, the success of bioinformatics methods
rather than the species itself, data from different regions
such as feature selection, classifiers and HMMs has led
can be used to confirm key functional relationships, to
to many novel discoveries including the identification
generalize over systems and to predict impacts of
of biomarkers, the prediction of disease outcome and
forces such as fishing and climate change. The approach
GRNs built at a systems level. What is more, the
concerns functional network topology and avoids the
exploitation and integration of multiple data sources
necessity of describing the specifics of network nodes.
allow more robust regulatory mechanisms to be iden-
For example, the ‘wasp waist’ (WW) is a common
tified and predictions to be made across very
structure present in many temperate and boreal fish
different platforms and organisms. We now demon-
community food webs [40]. The WW functional struc-
strate the transfer of some of these methods to
ture is characterized by few or just one mid-trophic
ecological data with an application in fisheries.
level species preying upon several lower trophic level
species, while several high trophic species prey upon
(c) Fisheries and ecoinformatics the mid-trophic level species. In this way, energy flow
In this paper, the focus is on the application of bio- from low to high trophic level species is constricted
informatics techniques described in §1 to biomass at the mid-trophic level species analogous to a WW.
data from Georges Bank (GB in figure 2), the East These WW species exert undue influence on aquatic
Scotian Shelf (ESS) and the North Sea (NS) between community structure by top-down control of lower
the years 1960 and 2007. Data are typically noisy and trophic levels through predation and bottom-up control
there are similar data quality issues as found with of higher trophic levels by restricting energy flow. The
many microarray datasets. There are also multiple WW effect is found in populations in the northwest
studies carried out throughout the world and prior Atlantic and the northeast Atlantic. This functional
expertise available much similar to bioinformatics structure is identical in the two regions but the species
datasets. For example, food webs that describe involved are different (in the northwest Atlantic one of

Phil. Trans. R. Soc. B (2012)


282 A. Tucker & D. Duplisea Bioinformatics tools in ecology

the WW is the capelin and in the northeast one is known GB is relatively self-contained with deep channels to
to be the sand eel). the northeast and ocean currents containing waters
This focus on critical sub-systems through the on the bank giving the region a distinct character.
exploration of functionally equivalent species across However, the GB community does have seasonal
different populations is a novel approach to fish popu- migrants such as mackerel and dogfish which affect
lation modelling. This approach to modelling fish the community. Drastic changes occurred on GB in
populations will explore functional relationships (such the late 1980s, where groundfish were much less abun-
as predator, prey and WW) that are generalizable dant. We have termed 1988 as the collapse year for
between different oceanic regions allowing more GB. The ESS, though geographically not far from
robust models to be built and predictions to be made GB is a much different system with lower productivity,
about future biomass. There is some research into diversity and more open to both the northwest and the
using BNs for ecological modelling [41] and in particu- southwest biologically and oceanographically. A key
lar for modelling fish populations [42,43]. There is also characteristic of the ESS is the presence of a small
considerable literature on integrating heterogenous sandy arc 200 km offshore called Sable Island, which
data within the data-warehousing community includ- is the largest grey seal breeding colony in the world
ing the environmental data [44], but no exploration of and has been growing exponentially since the mid-
integrating or comparing different variables under a 1980s. The ESS showed drastic declines in cod and
Downloaded from https://royalsocietypublishing.org/ on 06 August 2024

single function as we do with species. In the study of some other groundfish in the early 1990s to almost
Thrush et al. [45], functions are explored by investigating undetectable levels. We consider 1992 to be the col-
weights in hidden nodes of neural network models. Here, lapse year for ESS. The NS is a shallow warm sea
we focus on DBNs with latent variables that can be with high fish community diversity and productive
used in conjunction with human expertise to predict multi-species fisheries. The NS has supported very
functional collapse in different dynamic systems. large groundfish and pelagic fisheries and despite
A number of questions are posed based upon fish extremely high fishing pressure, it is difficult to see a
interaction: sudden change in the system that might be termed a
collapse as seen in GB and ESS. The NS fish commu-
— Can we use bioinformatics-style analysis (in par- nity always seems to respond positively to curtailment
ticular, feature selection) to identify species that of fishing effort, while the equivalent is not true for GB
are relevant to some event such as cod functional and ESS.
collapse?
— Can we model the temporal and dynamic nature of
fish interactions? (b) Experiments
— Can we identify species in different oceans that The experiments undertaken in this paper involve
perform similar functions, and therefore predict applying classification. This involves the prediction of
functional collapse in their respective regions? a pre-selected variable (here functional collapse)
based on the values of other variables (here species
Techniques such as those described in §1 will be biomass). Feature selection is used to identify the rele-
employed to answer these questions within a BN vant species for optimal classification. There are two
framework. In particular, a novel algorithm—the approaches to feature selection: filter selection that
functional equivalence—search is introduced to make simply scores variables (species) independently, and
inferences between the different geographical systems. wrapper selection that builds models and selects com-
binations of variables (thus identifying interactions
2. MATERIAL AND METHODS between them). These experiments adopt the BN
(a) Data description classifier approach, where the class node is a binary
GB fish community data come from the National variable that represents functional collapse in GB.
Marine Fisheries Service autumn multi-species trawl The K2 search algorithm [20] is used to build the
survey from 1963 to 2008. About 80 randomly BN classifiers. This involves a greedy search technique
selected stations were sampled on GB each year and where links are incrementally added to an initially
annual averages of biomass of each species were calcu- unconnected graph and scored using the metric
lated and used in this analysis. About 220 have been given in equation (2.1), where n is the number of
caught in the survey but most infrequently and with nodes, Fijk is the frequency of occurrences in the data-
low statistical power; therefore, analyses were confined set that the node xi takes on the value vik (where there
to a subset of 39 species filtered from the dataset for are ri possible instantiations) and the parent nodes pi
which we have confidence in their quantitative esti- take on the instantiation wij (where there are qi possible
mates of abundance each year. ESS and NS data instantiations). This metric is based on equation (2.2),
were collected via a similar methodology as on GB which calculates the probability of observing a struc-
and this resulted in subsets of 34 and 45 species, ture G and a set of data D, p(G,D), where c is a
respectively. The sources of these datasets are outlined constant prior probability p(G). For simplicity, we
in the acknowledgements of this paper. assume a step change in functional structure in 1988
GB is a relatively small productive fishing bank for GB data and 1992 for ESS. Further work will
historically supporting large catches of common explore using hidden variables with more states and
groundfish such as cod and haddock and also with a continuous variables to explore intermediate stages
very valuable sea scallop fishery. Fish on GB tend to prior to collapse. A bootstrap [46] approach is
have ideal growing conditions and mature quickly. employed to repeat the following 1000 times:
Phil. Trans. R. Soc. B (2012)
Bioinformatics tools in ecology A. Tucker & D. Duplisea 283

1. Score each species using the likelihood score given Algorithm 1. The functional equivalence search algorithm.
in equation (2.1) and take the mean over the boot-
strap. This is known as filter feature selection [10] Input: tstart, iterations, data1, data2, vars1, BN1
and scores each variable independently. Parametrize Bayesian Network, BN1 from data1
2. Learn BN structure with the (greedy) K2 algorithm Generate randomly selected variables in data2: vars2
and score the proportion of times that links are Use vars2 to score the fit with selected model BN1 using
equation 2.2: score
associated with the class node during the bootstrap
Set bestscore ¼ score
(the confidence). This is known as wrapper feature Set initial temperature: t ¼ tstart
selection [10] and scores each variable by taking for i ¼ 1 to iterations do
into account their interaction with other variables Randomly replace one selected variable in data2 and
through the use of a classifier model. rescore using equation 2.2: rescore
dscore ¼ rescore 2 bestscore
n X
X qi
ðri  1Þ! X ri
log Fijk ð2:1Þ if dscore  0 OR UnifR and (0,1) , exp(dscore/t) then
ðFij þ ri  1Þ! k¼1 bestscore ¼ rescore
i¼1 j¼1
" # else
Yn Yqi
ðri  1Þ! Y ri Undo variable switch in vars2
max½ pðG; DÞ ¼ c max Nijk ! end if
pi ðNij þ ri  1Þ! k¼1
Downloaded from https://royalsocietypublishing.org/ on 06 August 2024

G
i¼1 j¼1 Update the temperature: t ¼ t  0.9
end for
ð2:2Þ
Output: vars2
and
log M
log pðGjDÞ  log pðDjG; u^ G Þ  DimG ð2:3Þ 3. RESULTS
2
Figure 3 displays the rankings for filter and wrapper
We rank species based upon these two feature selec- feature selection for differentiating between pre- and
tion approaches and examine their relevance to post-functional collapse in GB (1988). From both fea-
functional collapse in GB. In order to explore the ture selection approaches, it is clear that there are a
functionally equivalent species in the NS and ESS relatively small number of key players in this collapse
data, we use species identified using feature selection and these are known to be involved with cod. For
from GB in conjunction with the functional equivalence example, the likelihood approach strongly implicates
search algorithm (which is fully documented in algor- two zooplankton species (Calanus and Pseudocalanus)
ithm 1). This is applied to both the NS and ESS to as key to the functional collapse and it is known
identify equivalent species. Finally, we use dynamic from other sources that there were relatively large
models, specifically DBNs (as seen in figure 1d ) but changes then [49], and these changes can have
with a single dynamic hidden variable to identify bottom-up effects which affect species such as cod
functional collapse. These networks are built from [50]. Herring (Clupea harengus) was also identified as
the GB data (using the REVEAL algorithm [47], a key species and its abundance changes in the late
which is a greedy search applied to DBNs) to predict 1980 may have changed the predation environment
cod biomass and functional collapse (using the of juvenile cod whose recruitment to adult stages
hidden variable). may, in some systems, be significantly controlled by
The functional equivalence algorithm uses a simu- herring abundance [51]. Thorny skate (Amblyraja
lated annealing approach [48] to search for an radiata) became more abundant at the time of the
optimal combination of variables that fit the given cod collapse on GB and although some attribute this
function. This is where a random allocation of selected to an ecosystem regime shift [52] others attribute
variables is initialized and scored. Within each iter- this to immigration from the ESS [53].
ation, a single replacement is made to the selected Using the higher ranking species from figure 3, a
variables and the new selection is scored. Here, we DBN model was built with a hidden node using the
demonstrate the approach using a BN model, where REVEAL algorithm (see §2) to confirm how predictable
the given function is in the form of a predefined BN both cod biomass and the unobserved functional col-
structure, BN1, and set of variables, vars1 that is par- lapse were from the related species. Figure 4 plots
ametrized from a dataset, data1. This model is then these results and shows that a reasonable fit to the GB
used to search for the variables in another dataset, data is achievable. What is more, the hidden state iden-
data2 that fits best. The algorithm gives as output the tifies a noisy underlying process which appears to
set of variables that best fits the given model. We use stabilize somewhere in the late-1980s correlating with
the Bayesian Information Criterion which penalizes the expected functional collapse.
overly connected networks to avoid overfitting. It is The confidences resulting from the Bayesian wrap-
given in equation (2.3), where M is the number of per method applied to GB showed a quick decline with
samples, DimG is the dimension of the model, and u^ G species rank, such that thorny skate was the most
is the maximum-likelihood estimate of the parameters. important species implicated in the decline. When
The first term is essentially the log-likelihood and the this structure is imposed on the ESS and NS using
second is a penalty for model complexity. We set the functional equivalence search, a small number of
iterations ¼ 1000 and tstart ¼ 1000 as these were found functionally equivalent species are identified in both
through experimentation to allow convergence to a the ESS and the NS with high confidence (figure 5).
good solution. An interesting thing to note was the species/processes
Phil. Trans. R. Soc. B (2012)
284 A. Tucker & D. Duplisea Bioinformatics tools in ecology

(a) –25
–27

–29

log-likelihood –31

–33

–35

–37

–39
Pseudocalanus spp.
Amblyraja radiata
Paralichthys oblongus
Centropages typicus
Calanus spp.
Oithona spp.
Clupea harengus
Euphausiids
Lophius americanus
Hemitripterus americanus

Urophycis tenuis
haddock catch
Glyptocephalus cynoglossus
Metridia lucens
Scomber scombrus
Placopecten magellanicus
Myoxocephalus
Sebastes fasciatus
Ovalipes ocellatus
Urophycis chesteri
Gadus morhua
Cancer irroratus
Citharichthys arctifrons
Tautogolabrus adspersus
Peprilus triacanthus
Hippoglossoides platessoides
Brosme brosme
Downloaded from https://royalsocietypublishing.org/ on 06 August 2024

(b) 1.0

0.8
confidence

0.6

0.4

0.2

0
Glyptocephalus cynoglossus

Macrozoarces americanus
Malacoraja senta
Cancer irroratus

Paralichthys oblongus
Enchelyopus cimbrius
Amblyraja radiata

Scomber scombrus

Citharichthys arctifrons

Hippoglossoides platessoides

Metridia lucens
Loligo pealeii
Helicolenus dactylopterus
Clupea harengus

haddock catch

Lophius americanus

Ovalipes ocellatus
Brosme brosme

Hemitripterus americanus

Urophycis tenuis

Homarus americanus
Pseudocalanus spp.

Euphausiids
cod catch

Oithona spp.
Centropages typicus

Calanus spp.

Figure 3. Features selected from GB data using a bootstrap on (a) filter selection using log-likelihood and (b) a Bayesian
network classifier wrapper.

on GB, where the change in the community seemed to implicated in the decline and lack of recovery of many
be captured by changes in two zooplankton species groundfish stocks on the ESS. The largest breeding
while in the ESS and the NS, there was no strong colony of grey seals in the world is located on Sable
indication of zooplankton changes that accompanied Island in the middle of the ESS.
fish community change. The presence of coldwater-seeking deepwater species
Perhaps, the most striking feature of the functional on the ESS could be an indication of the water cooling
equivalence applied to the ESS is the presence of that occurred on the ESS in the late-1980s and early-
many deepwater species such as argentine (Argentina 1990s, which also led to increases in coldwater shrimp
sphyraena), grenadier (Nezumia bairdi) and hakes and snow crabs. Furthermore, though grey seals
(Merluccius bilinearis). Surprisingly, cod was not impli- increased in abundance at the same time, grey seals are
cated in the ESS collapse despite the fact that cod not deep divers and if the deepwater species remained
were a highly targeted species prior to collapse. The in the shelf basins and slope water, they would be less
inclusion of grey seals is also expected as they were susceptible to grey seal predation than would cod.
Phil. Trans. R. Soc. B (2012)
Downloaded from https://royalsocietypublishing.org/ on 06 August 2024

confidence

(b)
confidence standardized biomass standardized biomass

0
0.2
0.4
0.6
0.8
1.0
0
0.2
0.4
0.6
0.8
(a) 1.0

Phil. Trans. R. Soc. B (2012)


–2
0
2
–0.5
0
0.5
1.0
1.5

Pleuronectes platessa Argenti silus

1960
Hippoglossus hippoglossus Gadus morhua
Nezumia bairdi

1965
Gadus morhua
Pollachius virens
Cyclopterus lumpus
Squalus acanthias
cod catch 1970
Urophycis chuss
Amblyraja radiata
1975

grey seals
Merlangius merlangus Merluccius bilinearis
1980

Argentina sphyraena Scomber scombrus

Lepidorhombus whiffiagonis Urophycis chesteri


1985

Hippoglossus hippoglossus
hidden state

haddock catch
Sebastes sp.
Melanogrammus aeglefinus
1990
Gadus morhua (cod)

Urophycis tenuis
Anarhichas lupus
Hippoglossoides platessoides

(a) Shows the equivalent species in the ESS and (b) shows the species in the NS.
1995

Hippoglossoides platessoides Arhichas lupus


Callionymus lyra haddock catch
2000

Buglossidium luteum Leucoraja ericea


2005

Trisopterus minutus

Sprattus sprattus
2010

Pollachius virens
crosses denote the predicted biomass and hidden state as opposed to the observed biomass denoted by circles.
Bioinformatics tools in ecology A. Tucker & D. Duplisea
285

Figure 5. Functionally equivalent species to those selected from GB data identified using the functional equivalence algorithm.
Figure 4. The fit for the model trained on GB data along with the associated discovered hidden state. The series marked with
286 A. Tucker & D. Duplisea Bioinformatics tools in ecology

hidden state

standardized biomass
1.5
1.0
0.5
0
–0.5

Gadus morhua (cod)


standardized biomass

–2
Downloaded from https://royalsocietypublishing.org/ on 06 August 2024

Pollachius virens (pollack)


standardized biomass

–2

Atlantic argentine
standardized biomass

–2
standardized biomass

Halichoerus grypus (grey seals)


2

–2
1970 1975 1980 1985 1990 1995 2000 2005 2010
Figure 6. One-step ahead prediction of cod using DBN model trained on GB data and mapping to equivalent species in the
ESS (identified using the functional equivalence algorithm along with the associated discovered hidden state). The series
marked with crosses denote the predicted biomass and hidden state as opposed to the observed biomass denoted by circles.

In the NS, most of the selected species are commer- Figure 6 documents these results for the selected func-
cially desirable and some experienced large declines in tionally equivalent species for ESS (using the DBN
biomass in this period, though the nature of the species trained on GB data and then mapped on equivalent
is not dissimilar to GB when compared with ESS, species on ESS). The prediction of many of these
which showed the appearance of some qualitatively species was surprisingly good, with close fits to the
very different species. Catch of haddock and cod observed data. This is impressive considering that
appeared to be important in the NS while commercial the model was parametrized using biomass data from
fish catch seemed less important on the ESS. These different species in GB. For example, the model pre-
factors combined might suggest that catch is one of dicts the increase in seal numbers year after year
the most important factors driving change in the NS, based upon parameters determined on the relationship
while on the ESS, it may be that other factors lead between cod catch and other species in GB. What is
to fundamental changes in the fish community more, the hidden state inferred from the predicted
composition. data resembles very much what was observed in
The final set of results explore how well the func- terms of functional collapse. While the state fluctuates
tionally equivalent species can predict future biomass in the period up to the late-1980s/early-1990s, in
and the underlying state of the geographical system. the period after the collapse the state becomes very
Phil. Trans. R. Soc. B (2012)
Bioinformatics tools in ecology A. Tucker & D. Duplisea 287

hidden state
1.5

standardized
1.0

biomass
0.5
0
–0.5

Gadus morhua (cod)


4
standardized
biomass
2
0
–2

Amblyraja radiata (thorny skate)


standardized

4
biomass

2
Downloaded from https://royalsocietypublishing.org/ on 06 August 2024

0
–2

Merlangius merlangus (whiting)


2
standardized
biomass

–2

Hippoglossus hippoglossus (Atlantic halibut)


standardized

4
biomass

2
0
–2
Cyclopterus lumpus (lumpfish)
4
standardized
biomass

2
0
–2

cod catch
2
standardized
biomass

–2
1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
Figure 7. One-step ahead prediction of cod using DBN model trained on GB data and mapping to equivalent species in the NS
(identified using the functional equivalence algorithm along with the associated discovered hidden state). The series marked
with crosses denote the predicted biomass and hidden state as opposed to the observed biomass denoted by circles.

stable. This further adds credence to the conclusion throughout the series. This could be due to the hidden
that the selected species are indeed key to the func- state capturing the functional collapse successfully,
tional collapse of cod in the ESS. which is the most influential predictive feature of cod
The same analysis was applied to identifying func- in the ESS dataset, whereas the prediction of cod in
tionally equivalent species in the NS and testing them the NS is more complex due to the lack of any collapse.
for prediction of biomass and identifying changes in
the underlying state. Figure 7 illustrates the results.
Firstly, note that the hidden state does not appear 4. DISCUSSION
much less stable than in the ESS results. Rather than Since the large-scale fisheries collapses in many differ-
identifying no change in state (as was expected as no col- ent regions globally in the late-1980s and into the
lapse has been observed), the hidden variable appears to 1990s, there has been a search for causal mechanisms
have fitted the states to some noise process that fluctuates (e.g. [54]). This research has included studies of
Phil. Trans. R. Soc. B (2012)
288 A. Tucker & D. Duplisea Bioinformatics tools in ecology

fisheries on species [55] as well as indirect effects of insight into signs of an imminent collapse perhaps
modifications of food webs and functional structure while there is still time to prevent it.
[56]. They have included simulation studies on func-
We would like to thank Jerry Black DFO-BIO Halifax for
tional structures in food webs [54,57], development assistance with the ESS survey data, Alida Bundy DFO-
of static functional structures through covariance tech- BIO Halifax for the ESS food web, the ICES datras
niques [58] or summaries of complicated multivariate database for the North Sea IBTS data, Bill Kramer
data to examine overall temporal trends [59]. The NOAA –NMFS Woods Hole for providing the Georges
present use of machine-learning techniques and BNs Bank survey, Jason Link NOASS–NMFS for the Georges
is another method applied to the the problem. Bank food web, Jon Hare NOAA –NMFS for NE USA
plankton data, SAHFOS for North Sea plankton data and
The use of bioinformatics techniques in this paper Mike Hammill for ESS grey seal data.
is unique because it exploits functional equivalence
between different datasets and uses the identified
species in conjunction with a dynamic model that uses
latent variables to predict functional collapse (and REFERENCES
future biomass). The recognition of a latent variable is 1 Bishop, C. M. 2006 Pattern recognition and machine
important in fish community change studies of this learning. New York, NY: Springer.
2 Hand, D. J., Mannila, H. & Smyth, P. 2001 Principles of
nature because it allows causes of change which are
Downloaded from https://royalsocietypublishing.org/ on 06 August 2024

data mining. Cambridge, MA: MIT Press.


not purely found within the constrained model struc- 3 Berthold, M. R., Borgelt, C., Hoppner, F. & Klawonn, F.
ture. This is very different from mass balance model 2010 Guide to intellgent data analysis. Berlin, Germany:
approaches whose fitting is conditioned completely Springer.
upon the model structure. The latent variable therefore 4 Peek, N., Combi, C. & Tucker, A. 2009 Biomedical data
may partially represent something external to the fish mining. Methods Inform. Med. 48, 225–228.
community such as oceanographic conditions. We 5 Eisen, M., Spellman, P., Brown, P. & Botstein, D. 1998
intend to explore this further by using data of likely fac- Cluster analysis and display of genome-wide expression
tors such as temperature, nutrients and fishing patterns. Proc. Natl Acad. Sci. USA 95, 8 –25. (doi:10.
mortality. Changes in conditions external to the fish 1073/pnas.95.25.14863)
6 Swift, S., Tucker, A., Liu, X., Martin, N., Orengo, C. &
community may be responsible for collapse in GB and
Kellam, P. 2004 Consensus clustering and functional
ESS. The longer runs of similar estimates for the interpretation of gene expression data. Genome Biol. 5, R94.
hidden state compared with NS could suggest different 7 Segal, E., Friedman, N., Koller, D. & Regev, A. 2004 A
processes occurring there. Oceanographic conditions module map showing conditional activity of expression
are a contender for ESS. For GB, what is occurring is modules in cancer. Nat. Genet. 36, 1090 –1098.
less clear. NS, being highly exploited but shallow and (doi:10.1038/ng1434)
dynamic, may naturally be more variable and able to 8 Bar-Joseph, Z. et al. 2003 Computational discovery of
cope with disturbances that would send the other two gene modules and regulatory networks. Nat. Biotechnol.
systems into collapse. Further work is warranted and 21, 1337–1342. (doi:10.1038/nbt890)
exploration of other processes such as system variability 9 Pittman, J. et al. 2004 Integrated modeling of clinical and
gene expression information for personalized prediction
before and after collapse [60] may prove to be useful
of disease outcomes. Proc. Natl Acad. Sci. USA 101,
predictors of collapse. 8431–8436. (doi:10.1073/pnas.0401736101)
BN models also facilitate the direct incorporation of 10 Inza, I., Larrañaga, P., Blanco, R. & Cerrolaza, A. J. 2004
expertise into the structures and parameters. While Filter versus wrapper gene selection approaches in DNA
this has not been explored fully here (the use of food microarray domains. Artif. Intell. Med. 31, 91–103.
webs have been used mostly for validation), using (doi:10.1016/j.artmed.2004.01.007)
informative priors in the network models based upon 11 Kim, S., Imoto, S. & Miyano, S. 2003 Inferring gene
available expertise will be investigated. The modelling networks from time series microarray data using dynamic
approach also differs from other methods in how cor- Bayesian networks. Brief. Bioinform. 4, 228 –235.
relative structures, which are assumed to represent (doi:10.1093/bib/4.3.228)
causal functional relationships, discovered in one 12 Pe’er, D., Regev, A., Elidan, G. & Friedman, N. 2001
Inferring subnetworks from perturbed expression pro-
system can be imposed upon another system. The
files. In Proc. 9th Int. Conf. Intelligent Systems for
components of the other system which best fit these Molecular Biology (ISMB 2001), Copenhagen, Denmark,
structures can then be found in other systems. The July 2001. Oxford, UK: Oxford Journals.
topology of the BN allows us to explore these struc- 13 Li, H., Xuan, J., Wang, Y. & Zhan, M. 2008 Inferring
tures explicitly and a follow-up study will explore regulatory networks. Front. Biosci. 13, 263–275.
them prior to and after suspected regime changes. (doi:10.2741/2677)
Though most ecosystem studies recognize the func- 14 Hochachka, W. M., Caruana, R., Fink, D., Munson, A.,
tional relation approach between species, most Riedewald, M., Sorokina, D. & Kelling, S. 2007 Data-
cannot deal with it in as pure a sense. Essentially, mining discovery of pattern and process in ecological
what this approach assumes is that there are only a systems. J. Wildlife Manage. 71, 2427– 2437. (doi:10.
2193/2006-503)
few ways for similar ecosystems to organize themselves
15 Friedman, N., Linial, M., Nachman, I. & Pe’er, D. 2000
functionally even though the components may have
Using Bayesian networks to analyze expression data.
different qualities; our analysis suggests that there J. Comput. Biol. 7, 601 –620. (doi:10.1089/10665270
may be similar ways to collapse. This can provide 0750050961)
real insights into why fished ecosystems collapse and 16 Pe’er, D., Tanay, A. & Regev, A. 2006 MinReg: a scalable
why they sometimes do not recover when a pertur- algorithm for learning parsimonious networks in yeast
bation stops. Most importantly, it may give us an and mammals. J. Mach. Learning Res. 7, 167 –189.

Phil. Trans. R. Soc. B (2012)


Bioinformatics tools in ecology A. Tucker & D. Duplisea 289

17 Hartemink, A. J., Gifford, D., Jaakkola, T. & Young, R. 35 Steele, E. & Tucker, A. 2009 Selecting and weighting
2002 Bayesian methods for elucidating genetic regulatory data for building consensus gene regulatory networks.
networks. IEEE Intell. Syst. 17, 37–43. In Advances in intelligent data analysis VIII, vol. LNCS
18 Spirtes, P., Glymour, C. & Scheines, R. 2000 Causation, 5772, pp. 190– 201. Berlin, Germany: Springer.
prediction, and search. Cambridge, MA: MIT Press 36 Anvar, S. Y., ’t Hoen, PA. & Tucker, A. 2010 The identi-
19 Pearl, J. 2001 Causality: models, reasoning, and inference. fication of informative genes from multiple datasets with
Cambridge, UK: Cambridge University Press increasing complexity. BMC Bioinform. 11, 32. (doi:10.
20 Cooper, G. F. & Herskovitz, E. 1992 A Bayesian method 1186/1471-2105-11-32)
for the induction of probabilistic networks from data. 37 Anvar, Y., Tucker, A., Vinciotti, V., Venema, A., van
Mach. Learning 9, 309 –347. (doi:10.1007/BF00994110) Ommen, G. J. B., van der Maarel, S. M., Raz, V. & ’t
21 Janžura, M. & Nielsen, J. 2006 A simulated annealing- Hoen, P. A. C. In press. Interspecies translation of dis-
based method for learning Bayesian networks from ease networks increases robustness and predictive
statistical data: research articles. Int. J. Intell. Syst. 21, accuracy. PLOS Comput. Biol.
335 –348. (doi:10.1002/int.20138) 38 Garrison, L. P. & Link, J. S. 2000 Dietary guild structure
22 Steele, E., Tucker, A., Hoen, PAC.’t & Schuemie, M. J. of the fish community in the northeast united states
2009 Literature-based priors for gene regulatory net- continental shelf ecosystem. Mar. Ecol. Prog. Ser. 202,
works. Bioinformatics 25, 1768 –1774. (doi:10.1093/ 231 –240. (doi:10.3354/meps202231)
bioinformatics/btp277) 39 Jaio, Y. 2009 Regime shift in marine ecosystems and
Downloaded from https://royalsocietypublishing.org/ on 06 August 2024

23 Larrañaga, P., Poza, M., Yurramendi, Y., Murga, R. H. & implications for fisheries management, a review. Rev.
Kuijpers, C. M. H. 1996 Structure learning of Bayesian Fish Biol. Fish. 19, 177 –191. (doi:10.1007/s11160-008-
networks by genetic algorithms: a performance analysis 9096-8)
of control parameters. IEEE Trans. Pattern Anal. Mach. 40 Bakun, A. 2006 Wasp-waist populations and marine eco-
Intell. 18, 912 –926. (doi:10.1109/34.537345) system dynamics: navigating the ‘predator pit’
24 Langley, P., Iba, W. & Thompson, K. 1992 An analysis of topographies. Prog. Oceanogr. 68, 271–288. (doi:10.
Bayesian classifiers. In Proc. 10th Natl Conf. Artificial 1016/j.pocean.2006.02.004)
Intelligence, San Jose, CA, July 1992, pp. 223–228. 41 Marcot, B. G., Steventon, J. D., Sutherland, G. D. &
Menlo Park, CA: AAAI Press McCann, R. K. 2006 Guidelines for developing and
25 Ghahramani, Z. 1998 Learning dynamic Bayesian updating Bayesian belief networks applied to ecological
networks. In Adaptive processing of sequences and data struc- modeling and conservation. Can. J. For. Res. 36,
tures. Lecture Notes in Artificial Intelligence, pp. 168–197. 3063 –3074. (doi:10.1139/x06-135)
Berlin, Germany: Springer. 42 Hammond, T. R. & O’Brien, C. M. 2001 An application
26 Friedman, N., Geiger, D. & Goldszmidt, M. 1997 of the Bayesian approach to stock assessment model
Bayesian network classifiers. Mach. Learning 29, uncertainty. ICES J. Mar. Sci. 58, 648 –656. (doi:10.
131 –163. (doi:10.1023/A:1007465528199) 1006/jmsc.2001.1051)
27 Munch, K., Gardner, P. P., Arctander, P. & Krogh, A. 43 Marcot, B. G., Holthausen, R. S., Raphael, M. G.,
2006 A hidden Markov model approach for determining Rowland, M. M. & Wisdom, M. J. 2001 Using Bayesian
expression from genomic tiling microarrays. BMC belief networks to evaluate fish and wildlife population
Bioinform. 7, 239. (doi:10.1186/1471-2105-7-239) viability under land management alternatives from an
28 Castelo, R. & Siebes, A. 2000 Priors on network struc- environmental impact statement. For. Ecol. Manage.
tures: biasing the search for Bayesian networks. 153, 29– 42. (doi:10.1016/S0378-1127(01)00452-2)
Int. J. Approximate Reasoning 24, pp. 39– 57. (doi:10. 44 SEIS 2008. European commission: towards a shared
1016/S0888-613X(99)00041-9) environmental information system. See http://ec.europa.
29 Imoto, S., Goto, T. & Miyano, S. 2002 Estimation of gen- eu/environment/seis/index.htm.
etic networks and functional structures between genes by 45 Thrush, S., Giovani, C. & Hewitt, J. E. 2008 Complex
using Bayesian networks and nonparametric regression. positive connections between functional groups are
In Pacific Symposium on Biocomputing, Lihue, Hawaii, revealed by neural network analysis of ecological time-
January 2002, vol. 7, pp. 175–186. series. Am. Nat. 171, 669 –677.
30 Werhli, A. V. & Husmeier, D. 2007 Reconstructing gene 46 Efron, B. & Tibshirani, R. 1995 Cross-validation and the
regulatory networks with Bayesian networks by combin- bootstrap: estimating the error rate of a prediction rule.
ing expression data with multiple sources of prior Technical Report no. TR-477, Department of Statistics,
knowledge. Stat. Appl. Genet. Mol. Biol. 6, art. 15 Stanford University, Stanford, CA, USA.
(doi:10.2202/1544-6115.1282) 47 Liang, S., Fuhrman, S. & Somogyi, R. 1998 Reveal, a
31 Yauk, C., Nerndt, M. L., Williams, A. & Douglas, G. general reverse engineering algorithm for inference of
2004 Comprehensive comparison of six microarray tech- genetic network architectures. In Pacific Symposium on
nologies. Nucleic Acids Res. 32, e124. (doi:10.1093/nar/ Biocomputing, Maui, Hawaii, January 1998, vol. 3, pp.
gnh123) 18– 29.
32 Kuo, W. P., Jenssen, T. K., Butte, A. J., Ohno-Machado, 48 Kirkpatrick, S., Gelatt, C. D. & Vecchi, M. P. 1983
L. & Kohane, I. S. 2002 Analysis of matched mRNA Optimization by simulated annealing. Science 220,
measurements from two different microarray technol- 671 –680. (doi:10.1126/science.220.4598.671)
ogies. Bioinformatics 18, 405– 412. (doi:10.1093/ 49 Kane, J. 2007 Zooplankton abundance trends on
bioinformatics/18.3.405) Georges Bank, 1977–2004. ICES J. Mar. Sci. 64,
33 Jarvinen, A. K., Hautaniemi, S., Edgren, H., Auvinen, 909 –919. (doi:10.1093/icesjms/fsm066)
P., Saarela, J., Kallionoemi, O. P. & Monni, O. 2004 50 Beaugrand, G., Brander, K., Lindley, J. A., Souissi, S. &
Are data from different gene expression microarrays com- Reid, P. C. 2000 Plankton effect on cod recruitment in
parable? Genomics 83, 1164–1168. (doi:10.1016/j.ygeno. the north sea. Nature 426, 661 –664. (doi:10.1038/
2004.01.004) nature02164)
34 Steele, E. & Tucker, A. 2008 Consensus and meta- 51 Swain, D. P. & Sinclair, A. 2000 Pelagic fishes and the
analysis regulatory networks for combining multiple cod recruitment dilemma in the northwest atlantic.
microarray gene expression datasets. J. Biomed. Inform. Can. J. Fish. Aquat. Sci. 57, 1321–1325. (doi:10.1139/
41, 914–926. (doi:10.1016/j.jbi.2008.01.011) f00-104)

Phil. Trans. R. Soc. B (2012)


290 A. Tucker & D. Duplisea Bioinformatics tools in ecology

52 Fogarty, M. J. & Murawski, S. A. 1998 Large-scale dis- 57 Petrie, B., Frank, K. T., Shackell, N. L. & Leggett, W. C.
turbance and the structure of marine systems: fishery 2009 Structure and stability in exploited marine fish
impacts on Georges Bank. Ecol. Appl. 8, S6 –S22. communities: quantifying critical transitions. Fish.
53 Frisk, M. G., Miller, T. J., Martell, S. J. D. & Sosebee, K. Oceanogr. 18, 83– 101. (doi:10.1111/j.1365-2419.2009.
2008 New hypothesis helps explain elasmobranch ‘out- 00500.x)
burst’ on Georges Bank in the 1980s. Ecol. Appl. 18, 58 Duplisea, D. E. & Blanchard, F. 2005 Relating species
234– 245. (doi:10.1890/06-1392.1) and community dynamics in an heavily exploited
54 Bundy, A. 2001 Fishing on ecosystems: the interplay of marine fish community. Ecosystems 8, 899–910.
fishing and predation in Newfoundland and Labrador. (doi:10.1007/s10021-005-0011-z)
Can. J. Fish. Aquat. Sci. 58, 1153–1167. 59 Choi, J. S., Frank, K. T., Petrie, B. D. & Leggett, W. C.
55 Myers, R. A. & Worm, B. 2003 Rapid worldwide 2005 Integrated assessment of a large marine ecosystem:
depletion of predatory fish communities. Nature 423, a case study of the devolution of the Eastern Scotian
280– 283. (doi:10.1038/nature01610) shelf. Oceanogr. Mar. Biol. 43, 47– 67.
56 Frank, K., Petrie, B., Choi, J. & Leggett, W. 2005 Trophic 60 Carpenter, S. R. et al. 2011 Early warnings of regime
cascades in a formerly cod-dominated ecosystem. Science shifts: a whole-ecosystem experiment. Science 27,
308, 1621–1623. (doi:10.1126/science.1113075) 1079–1082. (doi:10.1126/science.1203672)
Downloaded from https://royalsocietypublishing.org/ on 06 August 2024

Phil. Trans. R. Soc. B (2012)

You might also like