0% found this document useful (0 votes)
115 views3 pages

Will Computers Crash Genomics?

1) Advances in DNA sequencing technology have led to an exponential increase in the amount of genetic data available, far outpacing the rate of increase in computer processing power and data storage capacity. 2) This looming data deluge threatens to overwhelm existing genetic databases and computational tools if more funding is not allocated to bioinformatics and new approaches are not developed. 3) Moving genetic data analysis to cloud computing is one strategy being explored to help address this issue, providing flexible, scalable computing resources to handle large genomic analyses.

Uploaded by

mukeshgoindani
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views3 pages

Will Computers Crash Genomics?

1) Advances in DNA sequencing technology have led to an exponential increase in the amount of genetic data available, far outpacing the rate of increase in computer processing power and data storage capacity. 2) This looming data deluge threatens to overwhelm existing genetic databases and computational tools if more funding is not allocated to bioinformatics and new approaches are not developed. 3) Moving genetic data analysis to cloud computing is one strategy being explored to help address this issue, providing flexible, scalable computing resources to handle large genomic analyses.

Uploaded by

mukeshgoindani
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Will Computers Crash Genomics?

New technologies are making sequencing DNA easier and cheaper than ever, but the ability to analyze and store all that data is lagging
Lincoln Stein is worried. For decades, computers have improved at rates that have boggled the mind. But Stein, a bioinformaticist at the Ontario Institute for Cancer Research (OICR) in Toronto, Canada, works in a eld that is moving even faster: genomics. The cost of sequencing DNA has taken a nosedive in the decade since the human genome was publishedand it is now dropping by 50% every 5 months. The amount of sequence available to researchers has consequently skyrocketed, setting off warnings about a data tsunami. A single DNA sequencer can now generate in a day what it took 10 years to collect for the Human Genome Project. Computers are central to archiving and analyzing this information, notes Stein, but their processing power isnt increasing fast enough, and their costs are decreasing too slowly, to keep up with the deluge. The torrent of DNA data and the need to analyze it will swamp our storage systems and crush our computer clusters, Stein predicted last year in the journal Genome Biology. Funding agencies have neglected bioinformatics needs, Stein and others argue. Traditionally, the U.K. and the U.S. have not invested in analysis; instead, the focus has been investing in data generation, says computational biologist Chris Ponting of the University of Oxford in the United Kingdom. Thats got to change. Within a few years, Ponting predicts, analysis, not sequencing, will be the main expense hurdle to many genome projects. And thats assuming theres someone who can do it; bioinformaticists are in short supply everywhere. I worry there wont be enough people around to do the analysis, says Ponting. Recent reviews, editorials, and scientists blogs have echoed these concerns (see Perspective on p. 728). They stress the need for new software and infrastructures to deal with computational and storage issues. In the meantime, bioinformaticists are trying new approaches to handle the data onslaught. Some are heading for the cloudscloud computing, that is, a pay-as-

you-go service, accessible from ones own desktop, that provides rented time on a large cluster of machines that work together in parallel as fast as, or faster than, a single powerful computer. Surviving the data deluge means computing in parallel, says Michael Schatz, a bioinformaticist at Cold Spring Harbor Laboratory (CSHL) in New York. Dizzy with data The balance between sequence generation and the ability to handle the data began to shift after 2005. Until then, and even today, most DNA sequencing occurred in large centers, well equipped with the computer personnel and infrastructure to support the analysis of a genomes data. DNA sequences churned out by these centers were deposited and stored in centralized public databases, such as those run by the European Bioinformatics Institute (EBI) in Hinxton, U.K., and the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland. Researchers elsewhere could then download the data for study. By 2007, NCBI had 150 billion bases of genetic information stored in its GenBank database. Then several companies in quick succession introduced next-generation machines, faster sequencers that spit out data more cheaply. But the technologies behind these machines generate such short stretches of sequencetypically just 50

666

11 FEBRUARY 2011

VOL 331 SCIENCE www.sciencemag.org


Published by AAAS

CREDIT: ALVARO ARTEAGA /ALVAREJO.COM

Downloaded from www.sciencemag.org on February 16, 2011

HUMAN GENOME 10TH ANNIVERSARY | NEWSFOCUS


to 120 basesthat far more sequencing is required to assemble those fragments into a cohesive genome, which in turn greatly ups the computer memory and processing required. It once was enough to sequence a genome 10 times over to put together an accurate genome; now it takes 40 or more passes. In addition, the next-generation machines produce their sequence data at incredible rates that devour computer memory and storage. We all had a moment of panic when we saw the projections for nextgeneration sequencing, recalls Schatz. Those projections are already being realized. A massive study of genetic variation, the 1000 Genomes Project, generated more DNA sequence data in its first 6 months than GenBank had accumulated in its entire 21-year existence. And ambitious projects like ENCODE, which aims to characterize every DNA sequence in the human genome that has a function, offer jaw-dropping data challenges. Among other efforts, the project has investigated dozens of cell lines to identify every DNA sequence to which 40 transcription factors bind, yielding a complex matrix of data that needs to be not only stored but also represented in a way that makes sense to researchers. Were moving very rapidly from not having enough data to going, Oh, where do we start? says EBI bioinformaticist Ewan Birney. Moreover, as so-called third generation machineswhich promise even cheaper, faster production of DNA sequences (Science, 5 March 2010, p. 1190)become available, more, and smaller, labs will start genome projects of their own. As a result, the amount and kinds of DNA-related data available will grow even faster, and the sheer volume could overwhelm some databases and software programs, says Katherine Pollard, a biostatistician at the Gladstone Institutes of the University of California (UC), San Francisco. Take Genome Browser, a popular UC Santa Cruz Web site. The sites programs can compare 50 vertebrate genomes by aligning their sequences and looking for conserved or nonconserved regions, which reveal clues about the evolutionary history of the human genome. But the software, like most available genome analyzers, wont scale to thousands of genomes, says Pollard. The spread of sequencing technology to smaller labs could also increase the disconnect between data generation and analysis. The new technology is thought of [as] being democratizing, but the analytical capacity is still focused in the hands of a few, warns Ponting. Although large centers may be stretching their computing, and their laborpower, to new limits, they basically still have the means to interpret what they nd. But small labs, many of which underestimate computational needs when budgeting time and resources for a sequencing project, could be in over their heads, he warns. Clouds on the horizon James Taylor, a bioinformaticist at Emory University in Atlanta, saw some of the demands for data analysis coming. In 2005, he and Anton Nekrutenko of Pennsylvania State University (Penn State), University Park, pulled together various computer genomics tools and databases under one easy-to-use framework. The goal was to make collaborations between experimental and computational researchers easier and more efcient, Taylor explains. They created Galaxy, a software package that can be downloaded to a personal computer or accessed on Penn States computers via any Internet-connected machine. Galaxy allows any investigator to do basic genome analyses without in-house computer clusters or bioinformaticists. The public portal for Galaxy works well, but, as a shared resource, it can get bogged down, says Taylor. So last year, he and his colleagues tried a cloudcomputing approach to Galaxy. Cloud computing can mean various things, including simply renting off-site computing memory to store data, running ones own software on another facilitys computers, or exploiting software programs developed and hosted by others. Amazon Web Services and Microsoft are among the heavyweights running cloudcomputing facilities, and there are not-for-profit ones as well, such as the Open Cloud Consortium. For Taylors team, entering the cloud meant developing a version of Galaxy that would tap into rented off-site computing power. They set up a virtual computer that could run the Galaxy software on remote hardware using data uploaded temporarily into the clouds off-site computers. To test their strategy, they worked with Penn State colleague Kateryna Makova, who wanted to look at how the genomes of mitochondria vary from cell to cell in an individual. That involved sequencing the mitochondrial genomes from the blood and cheek swabs of three mother-child pairs, generating in one study some 1.8 gigabases of DNA sequence, about 1/10 of the amount of information generated for the rst human genome. Analyzing these data on the Penn State computers would have been a long and costly process. But when they uploaded their data to the cloud system, the processing took just an hour and cost $20, Taylor reported in May 2010 at the Biology of Genomes meeting in Cold Spring Harbor, New York. This is a particularly cost-effective solution when you need a lot of computing power on an occasional basis, he says. With the help of the cloud, he has access to many computers but doesnt have the overhead costs of maintaining a powerful computer network in-house. Were going to encourage more people to move to the cloud, he adds. CSHLs Schatz and Ben Langmead, a computer scientist at the Johns Hopkins Bloomberg School of Public Health in Baltimore, Maryland, are already there and are helping to make that shift possible for others. In 2009, the pair published one of the

CREDIT: ALVARO ARTEAGA /ALVAREJO.COM

www.sciencemag.org SCIENCE VOL 331

11 FEBRUARY 2011

667

Published by AAAS

Downloaded from www.sciencemag.org on February 16, 2011

NEWSFOCUS
rst results from marrying cloud comput- the connections among the clouds procesing and genomics. They wanted to identify sors can be fairly slow, so computations common sites of DNA variation known as requiring processors to talk to each other single-nucleotide polymorphisms (SNPs), can get bogged down, says Langmead. Some but to do so they needed to hunt through researchers worry that the burgeoning cloudshort sequences of human DNA totaling an computing industry wont agree on standards amount equivalent to 38 copies of the human that will allow for connections between genome. With the help of a cloud-based clus- clouds, such that data stored on one cloud ter of 320 computers, they identied 3.7 mil- can be accessible to another. Cloud computlion SNPs in less than 4 hours and for less ing is hot and sexy, says Bonazzi. But its than $100. We estimate it would have taken not the answer to everything. a single computer several hundred hours for the analysis, says Schatz. Storage issues At the Biology of Genomes meet- Cloud computing offers a possible solution ing, Langmead and Schatz unveiled two to other problems facing the bioinformatnew cloud-computing initiatives. Lang- ics community: data storage and transfer. mead described a computer program called Because storage costs are dropping much Myrna that determines the differential more slowly than the costs of generating expression of genes from RNA sequence sequence data, there will come a point when data and is designed 100,000 300 for the parallel proCost and Growth of Bases 250 10,000 cessing performed $10,000 by cloud-comput200 1,000 $1,000 Cost per million ing facilities. Schatz base pairs of sequence GenBank 150 introduced another (log scale) 100 $100 program, Contrail, 100 that can assemble 10 $10 50 genomes from data SOURCE: NCBI $1 1 that next-generation 0 sequencing machines 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 generate and deposit into a cloud. So much for so little. The decline in sequencing costs (red line) has led to Low cost and a surge in stored DNA data. speed arent the only advantages of the cloud approach, says we will have to spend an exponential amount Langmead. The cloud user never has to on data storage, says Birney. replace hard drives, renew service contracts, That has created pressure to let go of the worry about electricity usage and cooling, elds long-standing tendency to archive all deal with ooding or other natural disasters, raw sequence data. Because the raw mateet cetera, he points out. For small labs that rial from next-generation machines is in the lack their own powerful computer clusters, form of high-resolution images, it soaks up cloud computing may represent the democ- huge amounts of computer storage. So sciratization of computation, says Schatz. entists are considering discarding the origiBut cloud computing is not mature, nal image les once they produce the precautions Vivien Bonazzi, program direc- liminarily processed sequence data, which is tor for computational biology and bioinfor- more easily kept. Eventually, it may be more matics at NHGRI. Putting data into a cloud economical to save no raw data and just resecluster by way of the Internet can take many quence a DNA sample if necessary. But for hours, even days, so cloud providers and now, as to what should be kept, theres a lot their customers often resort to the sneaker of thrashing still to happen, says Bonazzi. net: overnight shipment of data-laden hard Putting the data in an off-site facility drives. And with the exception of Galaxy, could relieve some of the pressure, says Myrna, and a few other computer tools, not OICRs Stein. The economies of scale availmuch genomics software is congured for able to large cloud-providing companies can the massively parallel processing approach produce signicant cost savings, meaning taken by cloud computers. It is currently it might be cheaper to rent transient storage too difcult to develop cloud software thats space from the cloud in some cases. Stortruly easy to use, says Langmead. age costs at the Amazon Web Server top off Also, cloud computing works best if an at 14 cents a gigabyte per month, according analysis can be divided into many separate to Amazons Deepak Singh. In comparison, tasks handled by multiple processors. But it commonly costs 50 cents to $1 per gigaBillions of bases

byte for high-end storage on a local system, Schatz says. For NCBI, however, its still more cost-effective to keep GenBank and its other databases in-house, says Don Preuss of NCBI. Putting data in a cloud may help in other ways as well. Right now, anyone wanting to analyze a genome has to download it from a public archive such as GenBankand as these data sets get larger, such transfers become slower. Moreover, downloaded copies of these data sets, some now out of date, have proliferated around the world, each one taking up storage space that eats into bioinformatics budgets. In his vision, says Stein, you have one copy of the data located in this common cloud that everyone uses and it wont be necessary to download or upload the data between computers for processing. Encouraged by the genomics community, NCBI has put a copy of the data from the pilot project of the 1000 Genomes effort into off-site storage run by a cloud-computing provider. And U.S. East Coast users of Ensemble, the EBI sequence database, are automatically funneled into a cloud environment as part of a test of the strategy. One worry about this approach is the security of the data. Data involving the health of human subjects, which is being linked more and more to genome information, requires extra precautions that make some researchers hesitant about clouds. However, at least one cloud-computing company already has clients whose human data are covered by the strict health information protection laws of the United States, so there are indications that this concern can be allayed. All these issues came to the fore last year, when NHGRI hosted several meetings on cloud computing and on informatics and analysis, says Bonazzi. Also, at a retreat last summer, the case was made for more bioinformatics training and education. One thing that is clear is that as computation becomes more and more necessary throughout biomedical research, the way these [infrastructure] resources are funded will have to change to be more efcient, says Taylor. For now, NHGRI has no programs in place to address these needs. But they are on our radar, says Bonazzi. Like Stein, she worries about swamped storage systems and overwhelmed computer clusters. But Bonazzi remains sanguine. Do I think these problems will be solved? she says. Im optimistic. And even Stein is trying to think positively. Im very good at predicting disasters that never happen, he says. Theres always sunlight above the clouds.

ELIZABETH PENNISI

668

11 FEBRUARY 2011

VOL 331 SCIENCE www.sciencemag.org


Published by AAAS

Downloaded from www.sciencemag.org on February 16, 2011

Cost ($)

You might also like