Will Computers Crash Genomics?
Will Computers Crash Genomics?
New technologies are making sequencing DNA easier and cheaper than ever, but the ability to analyze and store all that data is lagging
Lincoln Stein is worried. For decades, computers have improved at rates that have boggled the mind. But Stein, a bioinformaticist at the Ontario Institute for Cancer Research (OICR) in Toronto, Canada, works in a eld that is moving even faster: genomics. The cost of sequencing DNA has taken a nosedive in the decade since the human genome was publishedand it is now dropping by 50% every 5 months. The amount of sequence available to researchers has consequently skyrocketed, setting off warnings about a data tsunami. A single DNA sequencer can now generate in a day what it took 10 years to collect for the Human Genome Project. Computers are central to archiving and analyzing this information, notes Stein, but their processing power isnt increasing fast enough, and their costs are decreasing too slowly, to keep up with the deluge. The torrent of DNA data and the need to analyze it will swamp our storage systems and crush our computer clusters, Stein predicted last year in the journal Genome Biology. Funding agencies have neglected bioinformatics needs, Stein and others argue. Traditionally, the U.K. and the U.S. have not invested in analysis; instead, the focus has been investing in data generation, says computational biologist Chris Ponting of the University of Oxford in the United Kingdom. Thats got to change. Within a few years, Ponting predicts, analysis, not sequencing, will be the main expense hurdle to many genome projects. And thats assuming theres someone who can do it; bioinformaticists are in short supply everywhere. I worry there wont be enough people around to do the analysis, says Ponting. Recent reviews, editorials, and scientists blogs have echoed these concerns (see Perspective on p. 728). They stress the need for new software and infrastructures to deal with computational and storage issues. In the meantime, bioinformaticists are trying new approaches to handle the data onslaught. Some are heading for the cloudscloud computing, that is, a pay-as-
you-go service, accessible from ones own desktop, that provides rented time on a large cluster of machines that work together in parallel as fast as, or faster than, a single powerful computer. Surviving the data deluge means computing in parallel, says Michael Schatz, a bioinformaticist at Cold Spring Harbor Laboratory (CSHL) in New York. Dizzy with data The balance between sequence generation and the ability to handle the data began to shift after 2005. Until then, and even today, most DNA sequencing occurred in large centers, well equipped with the computer personnel and infrastructure to support the analysis of a genomes data. DNA sequences churned out by these centers were deposited and stored in centralized public databases, such as those run by the European Bioinformatics Institute (EBI) in Hinxton, U.K., and the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland. Researchers elsewhere could then download the data for study. By 2007, NCBI had 150 billion bases of genetic information stored in its GenBank database. Then several companies in quick succession introduced next-generation machines, faster sequencers that spit out data more cheaply. But the technologies behind these machines generate such short stretches of sequencetypically just 50
666
11 FEBRUARY 2011
11 FEBRUARY 2011
667
Published by AAAS
NEWSFOCUS
rst results from marrying cloud comput- the connections among the clouds procesing and genomics. They wanted to identify sors can be fairly slow, so computations common sites of DNA variation known as requiring processors to talk to each other single-nucleotide polymorphisms (SNPs), can get bogged down, says Langmead. Some but to do so they needed to hunt through researchers worry that the burgeoning cloudshort sequences of human DNA totaling an computing industry wont agree on standards amount equivalent to 38 copies of the human that will allow for connections between genome. With the help of a cloud-based clus- clouds, such that data stored on one cloud ter of 320 computers, they identied 3.7 mil- can be accessible to another. Cloud computlion SNPs in less than 4 hours and for less ing is hot and sexy, says Bonazzi. But its than $100. We estimate it would have taken not the answer to everything. a single computer several hundred hours for the analysis, says Schatz. Storage issues At the Biology of Genomes meet- Cloud computing offers a possible solution ing, Langmead and Schatz unveiled two to other problems facing the bioinformatnew cloud-computing initiatives. Lang- ics community: data storage and transfer. mead described a computer program called Because storage costs are dropping much Myrna that determines the differential more slowly than the costs of generating expression of genes from RNA sequence sequence data, there will come a point when data and is designed 100,000 300 for the parallel proCost and Growth of Bases 250 10,000 cessing performed $10,000 by cloud-comput200 1,000 $1,000 Cost per million ing facilities. Schatz base pairs of sequence GenBank 150 introduced another (log scale) 100 $100 program, Contrail, 100 that can assemble 10 $10 50 genomes from data SOURCE: NCBI $1 1 that next-generation 0 sequencing machines 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 generate and deposit into a cloud. So much for so little. The decline in sequencing costs (red line) has led to Low cost and a surge in stored DNA data. speed arent the only advantages of the cloud approach, says we will have to spend an exponential amount Langmead. The cloud user never has to on data storage, says Birney. replace hard drives, renew service contracts, That has created pressure to let go of the worry about electricity usage and cooling, elds long-standing tendency to archive all deal with ooding or other natural disasters, raw sequence data. Because the raw mateet cetera, he points out. For small labs that rial from next-generation machines is in the lack their own powerful computer clusters, form of high-resolution images, it soaks up cloud computing may represent the democ- huge amounts of computer storage. So sciratization of computation, says Schatz. entists are considering discarding the origiBut cloud computing is not mature, nal image les once they produce the precautions Vivien Bonazzi, program direc- liminarily processed sequence data, which is tor for computational biology and bioinfor- more easily kept. Eventually, it may be more matics at NHGRI. Putting data into a cloud economical to save no raw data and just resecluster by way of the Internet can take many quence a DNA sample if necessary. But for hours, even days, so cloud providers and now, as to what should be kept, theres a lot their customers often resort to the sneaker of thrashing still to happen, says Bonazzi. net: overnight shipment of data-laden hard Putting the data in an off-site facility drives. And with the exception of Galaxy, could relieve some of the pressure, says Myrna, and a few other computer tools, not OICRs Stein. The economies of scale availmuch genomics software is congured for able to large cloud-providing companies can the massively parallel processing approach produce signicant cost savings, meaning taken by cloud computers. It is currently it might be cheaper to rent transient storage too difcult to develop cloud software thats space from the cloud in some cases. Stortruly easy to use, says Langmead. age costs at the Amazon Web Server top off Also, cloud computing works best if an at 14 cents a gigabyte per month, according analysis can be divided into many separate to Amazons Deepak Singh. In comparison, tasks handled by multiple processors. But it commonly costs 50 cents to $1 per gigaBillions of bases
byte for high-end storage on a local system, Schatz says. For NCBI, however, its still more cost-effective to keep GenBank and its other databases in-house, says Don Preuss of NCBI. Putting data in a cloud may help in other ways as well. Right now, anyone wanting to analyze a genome has to download it from a public archive such as GenBankand as these data sets get larger, such transfers become slower. Moreover, downloaded copies of these data sets, some now out of date, have proliferated around the world, each one taking up storage space that eats into bioinformatics budgets. In his vision, says Stein, you have one copy of the data located in this common cloud that everyone uses and it wont be necessary to download or upload the data between computers for processing. Encouraged by the genomics community, NCBI has put a copy of the data from the pilot project of the 1000 Genomes effort into off-site storage run by a cloud-computing provider. And U.S. East Coast users of Ensemble, the EBI sequence database, are automatically funneled into a cloud environment as part of a test of the strategy. One worry about this approach is the security of the data. Data involving the health of human subjects, which is being linked more and more to genome information, requires extra precautions that make some researchers hesitant about clouds. However, at least one cloud-computing company already has clients whose human data are covered by the strict health information protection laws of the United States, so there are indications that this concern can be allayed. All these issues came to the fore last year, when NHGRI hosted several meetings on cloud computing and on informatics and analysis, says Bonazzi. Also, at a retreat last summer, the case was made for more bioinformatics training and education. One thing that is clear is that as computation becomes more and more necessary throughout biomedical research, the way these [infrastructure] resources are funded will have to change to be more efcient, says Taylor. For now, NHGRI has no programs in place to address these needs. But they are on our radar, says Bonazzi. Like Stein, she worries about swamped storage systems and overwhelmed computer clusters. But Bonazzi remains sanguine. Do I think these problems will be solved? she says. Im optimistic. And even Stein is trying to think positively. Im very good at predicting disasters that never happen, he says. Theres always sunlight above the clouds.
ELIZABETH PENNISI
668
11 FEBRUARY 2011
Cost ($)