Barrett 等。 - 2013 - NCBI GEO archive for functional genomics data set
Barrett 等。 - 2013 - NCBI GEO archive for functional genomics data set
Barrett 等。 - 2013 - NCBI GEO archive for functional genomics data set
Received September 15, 2012; Revised October 28, 2012; Accepted October 29, 2012
*To whom correspondence should be addressed. Tel: +1 301 402 8693; Fax: +1 301 480 0109; Email: [email protected]
efforts to expand this import are in progress. Data (e.g. ChIP-Seq, methyl-Seq, DNase hypersensitivity) or
for large collaborative projects, including Encyclopedia other studies where measuring some form of sequence
of DNA Elements (ENCODE) (5) and Roadmap Epige- abundance or characterization is part of the study goals.
nomics (6), are deposited by Data Coordinating Centres GEO hosts the processed data files together with sample
and have dedicated data listings pages at http://www.ncbi. and study metadata; raw data files containing the original
nlm.nih.gov/geo/info/ENCODE.html and http://www. sequence reads are brokered and linked with NCBI’s
ncbi.nlm.nih.gov/geo/roadmap/epigenomics/. Sequence Read Archive (SRA) database (7). To date,
GEO has loaded >44 terabases of read data to SRA.
Support for next-generation sequence data Furthermore, several thousand processed data files have
GEO has made it a priority to continue to support the been incorporated into NCBI’s Epigenomics (8) database,
microarray community as they switch to next-generation where they are further curated and available to view as
sequence technologies. Established microarray submission tracks on genome browsers; work to incorporate several
formats, metadata standards and administrative proced- thousand more tracks with reciprocal links to GEO is
ures have been modified to accommodate the new ongoing.
technologies. The full sequence submission guideline is
provided at http://www.ncbi.nlm.nih.gov/geo/info/seq.
html and supports ‘minimum information about a RECENT UPDATES TO SEARCH, NAVIGATE,
high-throughput sequencing experiment’ (MINSEQE) DOWNLOAD AND ANALYSE
standards (http://www.fged.org/projects/minseqe/). GEO Much of the infrastructure, organization and search
accepts sequence data for studies that examine gene capabilities of GEO remain as previously described (9),
expression (RNA-Seq), gene regulation and epigenomics but several recent enhancements offer the user alternative
Nucleic Acids Research, 2013, Vol. 41, Database issue D993
methods for locating, downloading and interpreting data, GEO2R web application for identifying differentially
including: expressed genes
. Sample records are indexed as a distinct entry type in A major update recently implemented by GEO was release
the GEO DataSets database (http://www.ncbi.nlm.nih. of the GEO2R web application, available at http://www.
gov/gds/), permitting users to more easily identify ncbi.nlm.nih.gov/geo/geo2r/. GEO2R presents a simple
individual samples within a study. interface that allows users to perform sophisticated
. Sample characteristics are indexed separately under a R-based analysis of GEO data to help identify and visu-
new ‘Attribute’ field in the GEO DataSets database alize differential gene expression. The GEO2R back end
allowing more refined queries. uses established Bioconductor (13) R packages to trans-
. A ‘similar studies’ link has been added to the GEO form and analyse GEO data and presents results as a table
demonstrating GEO2R functionality is available at http:// into hypotheses that can be tested in the laboratory. Such
www.youtube.com/watch?v=EUPmGWS8ik0. opportunities will only increase as more and better quality
data become available.
GEO DATA RE-USE
SUMMARY
In the last GEO update article (16), we summarized the
diverse ways in which the community re-uses GEO data, The GEO database, now 12 years old, continues to grow
including providing evidence of specific gene expression in terms of volume, diversity of data types and usage.
to support hypotheses, testing material for algorithm The database and tools continue to undergo intensive
development, identifying disease predictors, developing development aimed at helping users to better explore
value-added target-audience databases and generally and extract meaningful information and new discoveries
aggregating and analyzing data in ways not anticipated from GEO data. Ongoing challenges include expanding
by the original data generators. Although data re-use is integration and cross-linking with related resources,
difficult to track accurately, based on usage citations procuring more consistent sample annotation from sub-
monitored internally (http://www.ncbi.nlm.nih.gov/geo/ mitters and providing additional methods for analysing
info/citations.html) and by others (17), it seems that the next-generation sequence data.
re-use rate is increasing. There is evidence that more sci-
entists are using a data-driven approach to research (18),
whereby the first step in a project is to combine and re- FUNDING
analyse public data sets to reveal previously unknown Funding for open access charge: Intramural Research
relations or uncover ever more subtle trends in the data. Program of the National Institutes of Health, National
The novel insights gained from such analyses are formed Library of Medicine.
Nucleic Acids Research, 2013, Vol. 41, Database issue D995
Conflict of interest statement. None declared. 10. Lin,J. and Wilbur,W.J. (2007) PubMed related articles: a
probabilistic topic-based model for content similarity. BMC
Bioinformatics, 8, 423.
REFERENCES 11. Geer,L.Y., Marchler-Bauer,A., Geer,R.C., Han,L., He,J., He,S.,
Liu,C., Shi,W. and Bryant,S.H. (2010) The NCBI BioSystems
1. Edgar,R., Domrachev,M. and Lash,A.E. (2002) Gene Expression database. Nucleic Acids Res., 38, D492–D496.
Omnibus: NCBI gene expression and hybridization array data 12. Barrett,T., Clark,K., Gevorgyan,R., Gorelenkov,V., Gribov,E.,
repository. Nucleic Acids Res., 30, 207–210. Karsch-Mizrachi,I., Kimelman,M., Pruitt,K.D., Resenchuk,S.,
2. Microarray standards at last. (2002) Nature, 419, 323. Tatusova,T. et al. (2012) BioProject and BioSample databases at
3. Brazma,A., Hingamp,P., Quackenbush,J., Sherlock,G., NCBI: facilitating capture and organization of metadata. Nucleic
Spellman,P., Stoeckert,C., Aach,J., Ansorge,W., Ball,C.A., Acids Res., 40, D57–D63.
Causton,H.C. et al. (2001) Minimum information about a 13. Gentleman,R.C., Carey,V.J., Bates,D.M., Bolstad,B., Dettling,M.,