Big Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Genomics Proteomics Bioinformatics 12 (2014) 187–189

H O S T E D BY

Genomics Proteomics Bioinformatics


www.elsevier.com/locate/gpb
www.sciencedirect.com

PERSPECTIVE

Big Biological Data: Challenges and Opportunities


Yixue Li *, Luonan Chen *

Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences,
Shanghai 200031, China

Received 1 September 2014; revised 1 October 2014; accepted 1 October 2014


Available online 14 October 2014

In ‘‘Omics’’ era of the life sciences, data is presented in many approximately three billion nucleotides. But now we can
forms, which represent the information at various levels of bio- sequence a whole human genome for $1000 and within three
logical systems, including data about genome, transcriptome, days. We have spent decades struggling to collect enough bio-
epigenome, proteome, metabolome, molecular imaging, molec- logical and biomedical data, but when big data overwhelms us,
ular pathways, different population of people and clinical/med- are we ready to face the challenge? The new bottleneck to this
ical records. The biological data is big, and its scale has already problem in biology is how to reveal the essential mechanisms
been well beyond petabyte (PB) even exabyte (EB). Nobody of biological systems by understanding the big noisy data. Life
doubts that the biological data will create huge amount of val- sciences today need more robust, expressive, computable,
ues, if scientists can overcome many challenges, e.g., how to quantitative, accurate and precise ways to handle the big data.
handle the complexity of information, how to integrate the data As a matter of fact, recent works in this area have already
from very heterogeneous resources, what kind of principles or brought remarkable advantage and opportunities, which
standards to be adopted when facing with the big data. Tools implies the central roles of bioinformatics and bioinformati-
and techniques for analyzing big biological data enable us to cians in the future research of the biological and biomedical
translate massive amount of information into a better under- fields. In the following text, we describe several aspects of
standing of the basic biomedical mechanisms, which can be fur- big biological data based on our recent studies.
ther applied to translational or personalized medicine.
Today, big data is one of the hottest topics in information Expanding volume of the big biological data and its
science, but its concept can be misleading or confusing. The
bonanza
name itself suggests huge amount of data, which, however,
represents only one aspect. In general, big data has four impor-
tant features, so called four V’s: volume of data, velocity of With the increasingly accumulated large volumes of informa-
processing the data, variability of data sources, and veracity tion about human, animals or microbe, researchers are starting
of the data quality. These four hallmarks of big data require to grapple with massive datasets and further elucidate the
to be characterized by special theory and technology; however, fundamental implications of those datasets in biology. For
currently there is no satisfactory solution. Now, more biolo- instance, a recent genetics study about six dog breeds to
gists are involved with the big data due to the rapid advance decipher adaption mechanisms under hypoxia in highland area
of high-throughput biotechnologies. As an example, the has produced 3.2 terabytes (TB) genome sequencing data of 60
Human Genome Project utilized the expertise, infrastructure, dogs from different altitudes along the ‘‘Ancient Tea Horse
and people from 20 institutions and took 13 years of work with Road’’ [1]. After the data analysis, an important nonsynony-
over $3 billion to determine the whole genome structure of mous mutation, G305S, was found on gene EPAS1enocding
endothelial Per-Arnt-Sim (PAS) domain protein 1 and this
mutation is most possibly involved in the adaption
* Corresponding authors. mechanisms of hypoxia. In a milestone study of using
E-mail: [email protected] (Li Y), [email protected] (Chen L). next-generation sequencing technology, Jay Shendure and his
Peer review under responsibility of Beijing Institute of Genomics, co-workers captured 12 human exomes [2]. They obtained over
Chinese Academy of Sciences and Genetics Society of China.
http://dx.doi.org/10.1016/j.gpb.2014.10.001
1672-0229 ª 2014 The Authors. Production and hosting by Elsevier B.V. on behalf of Beijing Institute of Genomics, Chinese Academy of Sciences and
Genetics Society of China.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).
188 Genomics Proteomics Bioinformatics 12 (2014) 187–189

30 gigabyte (GB) DNA sequencing data and found a SNP in Computational systems biology plays a central role
gene MYH3 straightforward, which is causative for a mono-
in big biological data era
genic human disease called Freeman–Sheldon syndrome
(FSS). Another impressive big data based biomedicine study
was performed by Lupski and his colleagues [3]. In this work, There are four key features related with systems biology: inte-
12 pedigree samples were sequenced to acquire their whole gration – the whole biological system is more than the sum of
genome DNA data for finding causative variations related to its parts; network – biological function is a phenotype of inher-
Charcot-Marie-Tooth (CMT) disease. Over one TB of genome ent biological network structure; emergency – new function is
sequencing data were analyzed, and then a novel missense emerged through interactions among the elements of the bio-
mutation Y169H on gene SH3TC2 was found associated with logical system; interference – it implies correlation or coherence
CMT disease. All such studies show that big biological data between biological molecules in a single biological pathway or
has become an essential part of biological discovery and bio- between several biological pathways. The nature of biological
medical research [4], and clearly, the aforementioned discover- big data can be summarized as: hierarchy – data is generated
ies are inconceivable without big biological data. Exponential at different levels ranging from molecules, cells, tissues to sys-
growth in the amount of biological data undoubtedly means tems; heterogeneous – data is generated using different meth-
potential bonanza opportunities; nonetheless, we need to ods ranging from genetics, physiology, pathology to imaging;
develop revolutionary measures for data management, analysis complexity – data can be simultaneously recorded in the forms
and accessibility. of multi-level information from over thousands of cells or even
more; dynamics – biological processes or states change with
conditions and over time. It is undeniable that the association
Hypothesis-driven study is still a key for big
study only is too superficial to meet the needs of scientists,
biological data mining and our aspirations are to reveal the driving force or causal
relationship among biological elements, which can be used
Biological big data, in general, has the similar properties to the for deciphering the mechanisms of biological processes and dis-
4Vs of big data, in particular, at molecular levels. But, unlike eases, such as cancer, diabetes, and Alzheimer’s disease. The
the data gathered by Google, WeChat and Ali Baba, the big main challenge for big data mining then would be how we
biological data is highly heterogeneous; even within the data, can achieve a transition from association study to causality
there exist intrinsic structures determined by various biological study. From this point of view, computational systems biology
principles and experiment designs. Because of the 4-V features [6,7] provides a new way for system-wide study and could play a
of big data, association or correlation rather than causal rela- key role in such a transition in big-data era.
tionships are expected to be built among certain elements, such
as genes, proteins, and pathways, across the whole big data.
However, biological studies always need to know driving force The impacts of engineering, cooperation,
or causal relationship among biological elements, which form standardization and pipeline to big data analysis
complex biological systems. Many studies show that the exist-
ing intrinsic structures determined by various biological princi- Because of the nature of big biological data, conducting
ples and experiment designs have already provided biological research in life science to some extent has to change its style
data miners or curators possible ways to identify causal rela- in the era of big data, e.g., from academic exploration individ-
tionship among biological molecules in big biological data. In ually to more cooperative study in systematic, standardized
this case, ‘‘hypothesis-driven study’’ is a key for big biological and pipelining ways. The main challenges here could be to
data mining, which can reduce the CPU time of data mining establish interoperable databases, make sustainable tools
and the occupation of computing resources effectively. In other available to the research community, create tool development
words, it is a key how we can rely on some reasonable hypoth- centers, construct resources and infrastructure, such as cloud
eses to guide our big biological data mining by solving 4-V computing to serve the huge amount of researches, generate
problems efficiently. A remarkable example can be seen from standards, vocabularies and ontologies of big biological data,
a research study by Yuan and her colleagues [5]. They found develop new systems of infrastructure and tools, and obtain
that very diverse outputs are often generated when the same buy-in from the scientific community, such as cloud service.
gene expression data is analyzed using different algorithms, Clearly, aforementioned challenges can be solved in a more
i.e., low overlap and substantial false positives. The problem engineering manner, and a well-designed experiment system
results from the extreme heterogeneousness of gene expression matching some systematic, standardizing data processing pipe-
data and there is no guarantee that a pure statistical model will line will be an important factor for a successful study.
solve it. A recent effort was made to present a methodology,
aimed to circumvent the limitations of pure statistical models
and general gene expression data analysis strategy. The method Big-data medicine by dynamical network biomarkers
was based on a simple biological assumption: ‘‘If a number of
genes that are conservatively co-expressed emerge as a dynam- It is commonly recognized that a complicated living organism
ically-cooperative group across certain biological processes, cannot be completely appreciated by merely analyzing individ-
these genes are most likely functionally closely related with ual components. Phenotypes and functions of an organism are
physiological and pathological processes’’ [5]. Then, according ultimately determined by interactions between these compo-
to this ‘‘hypothesis’’, the data mining is just to be converted to nents or networks in terms of structures and dynamics [2]. Net-
finding those gene clusters with strongly cooperative and con- work and dynamics are two key aspects in computational
servative properties across cancer progression stages. systems biology [6,7–13]. However, majority of traditional
Li Y and Chen L / Big Biological Data 189

research focuses on the static and statistic properties (e.g., [3] Lupski JR, Reid JG, Gonzaga-Jauregui C, Rio Deiros D, Chen
GWAS) of big data, rather than the essential dynamics and DC, Nazareth L, et al. Whole-genome sequencing in a patient
networks of life in living organisms. Generally, a disease is a with Charcot–Marie–Tooth neuropathy. N Engl J Med 2010;362:
problem resulting not from malfunction of individual mole- 1181–91.
[4] Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W,
cules but from failure of the relevant system or network, which
et al. Big data: the future of biocuration. Nature 2008;455:47–50.
can be considered as a set of interactions among molecules. [5] Yuan L, Ding G, Chen YE, Chen Z, Li Y. A novel strategy for
Thus, rather than single molecules, the networks are stable deciphering dynamic conservation of gene expression relationship.
forms as biomarkers to reliably characterize complex diseases. J Mol Cell Biol 2012;4:177–9.
The era of big data [14,15] provides great opportunities for [6] Chen L, Wang R, Zhang X. Biomolecular networks: methods and
predictive, preventive, personalized and participatory (P4) applications in systems biology. Hoboken: John Wiley & Sons;
medicine, which is expected to lead to big-data medicine. 2009.
The study of network and interactions of biological elements [7] Chen L, Wang R, Li C, Aihara K. Modelling biomolecular
rather than biological elements themselves, can capture the networks in cells: structures and dynamics. London: Springer-
previously-unobserved features at the levels of both network Verlag; 2010.
[8] Song W, Wang J, Yang Y, Jing N, Zhang X, Chen L. Rewiring
(or edges) and dynamics. Therefore, with the demand from
drug-activated p53-regulatory network from suppressing to pro-
both theoretical and clinic aspects, biomarkers are evolving moting tumorigenesis. J Mol Cell Biol 2012;4:197–206.
from single molecules (e.g., individual genes) to multiple mol- [9] Zeng T, Wang DC, Wang X, Xu F, Chen L. Prediction of
ecules (e.g., gene set), associated molecules (e.g., molecule net- dynamical drug sensitivity and resistance by module network
work) and dynamical interactive molecules (e.g., dynamical rewiring-analysis based on transcriptional profiling. Drug Resist
molecule network) due to the availability of big data, in partic- Update 2014;17:64–76.
ular, high-dimensional data, which can be categorized as node [10] Liu B, Yuan Z, Aihara K, Chen L. Reinitiation enhances reliable
biomarkers [14,15], network-based biomarkers [16–18], net- transcriptional responses in eukaryotes. J R Soc Interface 2014;11:
work biomarkers [19,20] and dynamical network biomarkers 20140326.
(DNBs) [21,22], respectively. By exploiting the network infor- [11] Wu FX, Wu L, Wang J, Liu J, Chen L. Transittability of complex
networks and its applications to regulatory biomolecular net-
mation from big data, recent studies on EdgeMarker [14,20]
works. Sci Rep 2014;4:4819.
demonstrate that non-differentially expressed genes, which [12] Zhu H, Rao RS, Zeng T, Chen L. Reconstructing dynamic gene
are usually ignored by traditional methods, can be as informa- regulatory networks from sample-based transcriptional data.
tive as differentially expressed genes in terms of classifying dif- Nucleic Acids Res 2012;40:10657–67.
ferent biological conditions or phenotypes of samples. By [13] Ma H, Zhou T, Aihara K, Chen L. Predicting time-series from
exploiting the dynamical information from big data, a novel short-term high-dimensional data. Int J Bifurcat Chaos 2014;24:
biomarker, DNB, was recently developed [22]. In contrast to 1430033.
the disease state detected by traditional biomarkers, DNB is [14] Zeng T, Zhang W, Yu X, Liu X, Li M, Liu R, et al. Edge
able to identify the pre-disease state before the occurrence or biomarkers for classification and prediction of phenotypes. Sci
serious deterioration of diseases, which can actually be used China Life Sci 2014;57:1103–14.
[15] Liu R, Wang X, Aihara K, Chen L. Early diagnosis of complex
to prevent from further disease progression before deteriorat-
diseases by molecular biomarkers, network biomarkers, and
ing into their irreversible states [21–24]. In other words, by dynamical network biomarkers. Med Res Rev 2014;34:4555–78.
high-dimensional data (such as gene expression, RNA-seq, [16] Ren X, Wang Y, Chen L, Zhang XS, Jin Q. EllipsoidFN: a tool
protein expression, and imaging data), this new type of bio- for identifying a heterogeneous set of cancer biomarkers based on
markers can achieve the early diagnosis of ‘‘pre-disease’’ state gene expressions. Nucleic Acids Res 2013;41:e53.
or ‘‘un-occurring disease’’ state, which is a concept raised in [17] Zeng T, Zhang CC, Zhang W, Liu R, Liu J, Chen L. Deciphering
‘‘Yellow Emperor’s Canon of Internal Medicine’’ (one of the early development of complex diseases by progressive module
earliest books for Traditional Chinese Medicine) [14]. network. Methods 2014;67:334–43.
[18] Wen Z, Liu ZP, Liu Z, Zhang Y, Chen L. An integrated approach to
identify causal network modules of complex diseases with applica-
tion to colorectal cancer. J Am Med Inform Assoc 2013;20:659–67.
Acknowledgements [19] Yu X, Li G, Chen L. Prediction and early diagnosis of complex
diseases by edge-network. Bioinformatics 2014;30:852–9.
This project was partially supported by the Strategic Priority [20] Zhang W, Zeng T, Chen L. EdgeMarker: identifying differentially
Research Program of the Chinese Academy of Sciences (Grant correlated molecule pairs as edge-biomarkers. J Theor Biol 2014.
No. XDB13040700), the National Program on Key Basic http://dx.doi.org/10.1016/j.jtbi.2014.05.041.
Research Project (973 Program, Grant No. 2014CB910504) [21] Liu R, Yu X, Liu X, Xu D, Aihara K, Chen L. Identifying
critical transitions of complex diseases based on a single sample.
and the National Natural Science Foundation of China (NSFC)
Bioinformatics 2014;30:1579–86.
(Grant Nos. 61134013, 91130032, 61103075 and 91029301).
[22] Chen L, Liu R, Liu ZP, Li M, Aihara K. Detecting early-warning
signals for sudden deterioration of complex diseases by dynamical
References network biomarkers. Sci Rep 2012;2:342.
[23] Teschendorff AE, Liu X, Caren H, Pollard SM, Beck S, Widsch-
[1] Gou X, Wang Z, Li N, Qiu F, Xu Z, Yan D, et al. Whole-genome wendter M, et al. The dynamics of DNA methylation covariation
sequencing of six dog breeds from continuous altitudes reveals patterns in carcinogenesis. PLoS Comput Biol 2014;10:e1003709.
adaption to high-altitude hypoxia. Genome Res 2014;24:1308–15. [24] Li M, Zeng T, Liu R, Chen L. Detecting tissue-specific early-
[2] Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee warning signals for complex diseases based on dynamical network
C, et al. Targeted capture and massively parallel sequencing of 12 biomarkers: study of type-2 diabetes by cross-tissue analysis. Brief
human exomes. Nature 2009;461:272–6. Bioinform 2013;15:229–43.

You might also like