جودة البيانات

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Professional Biologist

Data-intensive Science: A New


Paradigm for Biodiversity Studies

STEVE KELLING, WESLEY M. HOCHACHKA, DANIEL FINK, MIREK RIEDEWALD, RICH CARUANA, GRANT BALLARD,
GILES HOOKER

Downloaded from https://academic.oup.com/bioscience/article/59/7/613/334937 by guest on 04 June 2024


The increasing availability of massive volumes of scientific data requires new synthetic analysis techniques to explore and identify interesting
patterns that are otherwise not apparent. For biodiversity studies, a “data-driven” approach is necessary because of the complexity of ecological
systems, particularly when viewed at large spatial and temporal scales. Data-intensive science organizes large volumes of data from multiple sources
and fields and then analyzes them using techniques tailored to the discovery of complex patterns in high-dimensional data through visualizations,
simulations, and various types of model building. Through interpreting and analyzing these models, truly novel and surprising patterns that are
“born from the data” can be discovered. These patterns provide valuable insight for concrete hypotheses about the underlying ecological processes
that created the observed data. Data-intensive science allows scientists to analyze bigger and more complex systems efficiently, and complements
more traditional scientific processes of hypothesis generation and experimental testing to refine our understanding of the natural world.

Keywords: data-intensive science, informatics, biodiversity, machine learning, statistics

B iodiversity research is a branch of ecology that identifies


and predicts patterns of organism distribution and
abundance, and explains the causes of these patterns. Eco-
Gaining insights into the patterns of species occurrence in
complex ecological systems will require new synthetic analy-
ses of massive amounts of disparate data (Brown 1995). Re-
logical systems are extremely complex, and a multitude of cently there has been much discussion about the need for the
processes may affect organisms (McMichael et al. 2003). organization of large volumes of data and their use in scien-
These processes can vary over time (Delcourt and Delcourt tific analysis in both the scientific (Lynch 2008) and popular
2005) and through space (Tuomisto et al. 2003). Conse- (Anderson 2008) press. This need has led to the creation in
quently, to understand the determinants of biodiversity, data the United States of the $100-million DataNet program (NSF
need to be collected over long periods of time (Gaston and 2007), and in Europe to the creation of the Alliance for Per-
McArdle 1994) and at appropriate, potentially large, spatial manent Access (Angevaare 2008). The goal of these initiatives
scales (Doak et al. 1992). Further, because we must often is to develop cross-domain data standardization and curation
guess at the environmental features that can affect distribu- strategies to make scientific data available—from particle
tions, tens if not hundreds of potentially important predic- colliders to counting birds at a feeder—and preserve these data
tors must be screened. Given these challenges, we believe for long-term and unanticipated use over time and across dis-
that processes different from those typically used by ecologists ciplines. While these initiatives focus on the cyberinfra-
are necessary to best understand patterns in biodiversity. structure needed to organize and provide access to massive
Traditional ecological research has relied on expert-centered volumes of data, there has been less discussion on how this
parametric analysis. While many variants of this approach organization and access to data will affect scientific processes.
exist, they all fundamentally rely on extensive domain knowl- In this article we introduce a new analysis paradigm for bio-
edge to allow a scientist to identify a problem and formulate diversity studies that takes advantage of access to massive
and test hypotheses. This is accomplished by developing an quantities of data. Data-intensive science (Newman et al.
experimental design to gather the data needed to test the 2003) takes a “data-driven” approach, in which information
validity of the hypothesis. However, for many biodiversity emerges from the data, as opposed to the more traditional
studies, this expert-centered parametric analysis alone is “knowledge-driven” approach that examines hypothesized
inherently limiting because collecting data for hypothesis- patterns expected from the data. Data-intensive science is
testing analyses, at the spatial and temporal scope needed, is emerging in the face of similar challenges across multiple sci-
logistically, financially, or ethically challenging and is most entific domains as a result of the accumulation of large quan-
likely not feasible for one individual expert. tities of data, and from the need for new analysis techniques

BioScience 59: 613–620. ISSN 0006-3568, electronic ISSN 1525-3244. © 2009 by American Institute of Biological Sciences. All rights reserved. Request
permission to photocopy or reproduce article content at the University of California Press’s Rights and Permissions Web site at www.ucpressjournals.com/
reprintinfo.asp. doi:10.1525/bio.2009.59.7.12

www.biosciencemag.org July/August 2009 / Vol. 59 No. 7 • BioScience 613

You might also like