Barrett 等。 - 2013 - NCBI GEO archive for functional genomics data set

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Published online 27 November 2012 Nucleic Acids Research, 2013, Vol.

41, Database issue D991–D995


doi:10.1093/nar/gks1193

NCBI GEO: archive for functional genomics data


sets—update
Tanya Barrett1,*, Stephen E. Wilhite1, Pierre Ledoux1, Carlos Evangelista1,
Irene F. Kim1, Maxim Tomashevsky1, Kimberly A. Marshall1, Katherine H. Phillippy1,
Patti M. Sherman1, Michelle Holko1, Andrey Yefanov1, Hyeseung Lee1, Naigong Zhang1,
Cynthia L. Robertson1, Nadezhda Serova1, Sean Davis2 and Alexandra Soboleva1

Downloaded from https://academic.oup.com/nar/article/41/D1/D991/1067995 by Sichuan University user on 11 June 2022


1
National Center for Biotechnology Information, National Library of Medicine and 2Molecular Genetics Section,
Genetics Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA

Received September 15, 2012; Revised October 28, 2012; Accepted October 29, 2012

ABSTRACT In addition to serving as a public archive, GEO


provides tools to help users identify, analyse and visualize
The Gene Expression Omnibus (GEO, http://www. data relevant to their specific interests. These tools include
ncbi.nlm.nih.gov/geo/) is an international public a powerful search engine that supports complex fielded
repository for high-throughput microarray and queries, sample comparison applications and gene expres-
next-generation sequence functional genomic data sion profile charts. The GEO database continues to grow
sets submitted by the research community. The and is being actively developed towards facilitating data
resource supports archiving of raw data, processed mining and discovery; this article provides an update of
data and metadata which are indexed, cross-linked the current status and recent improvements.
and searchable. All data are freely available
for download in a variety of formats. GEO also
provides several web-based tools and strategies to GEO CONTENT
assist users to query, analyse and visualize data. At the time of writing, the GEO database hosts >32 000
This article reports current status and recent public series (study records) submitted directly by 13 000
database developments, including the release of laboratories, comprising 800 000 samples derived from
GEO2R, an R-based web application that helps >1600 organisms. As depicted in Figure 1, the overall
users analyse GEO data. submission rate continues to grow; in 2011 alone, >6800
new series were processed, a 22% increase over the
previous year. The data types archived in GEO mirror
INTRODUCTION evolving trends in technology and methodologies used
The Gene Expression Omnibus (GEO) repository (1) by the functional genomics community. ‘Expression
archives and freely distributes microarray, next-generation profiling by array’ continues to be the most common
sequencing (NGS) and other forms of high-throughput study type submitted to GEO by an order of magnitude,
functional genomic data. The database is built and main- although its growth rate is slowing. Next-generation
tained by the National Center for Biotechnology sequence submission rates have been rapidly increasing
Information (NCBI), a division of the National Library of since 2008; interestingly, methods like chromatin
Medicine, located on the campus of the National Institutes immunoprecipitation by sequencing (ChIP-seq; included
of Health in Bethesda, MD, USA. Data in GEO represent under ‘genome binding/occupancy profiling by NGS’ in
original research deposited by the scientific community, Figure 1) are increasing at such a rate that they are now
often in compliance with grant or journal directives (2) submitted at a higher frequency than their array-based
that require data to be made publicly available in a counterpart ChIP–chip. Meanwhile, traditional SAGE
MIAME-supportive (3) database. As a result, GEO now (Serial Analysis of Gene Expression) submissions are
has supporting data and links to almost 20 000 published now infrequent.
manuscripts. Together with ArrayExpress (4), data for >1 Almost all submissions are deposited by individual
million samples are currently available in the public laboratories or by microarray facilities on behalf of their
domain. clients. Some data are imported from ArrayExpress;

*To whom correspondence should be addressed. Tel: +1 301 402 8693; Fax: +1 301 480 0109; Email: [email protected]

Published by Oxford University Press 2012.


This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which
permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
[email protected].
D992 Nucleic Acids Research, 2013, Vol. 41, Database issue

Downloaded from https://academic.oup.com/nar/article/41/D1/D991/1067995 by Sichuan University user on 11 June 2022


Figure 1. Distribution of the number and types of selected studies released by GEO each year since inception. Users can explore and download
historical submission numbers using the ‘history’ page at http://www.ncbi.nlm.nih.gov/geo/summary/?type=history, as well as constructing GEO
DataSet database queries for specific data types and date ranges using the ‘DataSet type’ and ‘publication date’ fields as described at http://www.
ncbi.nlm.nih.gov/geo/info/qqtutorial.html.

efforts to expand this import are in progress. Data (e.g. ChIP-Seq, methyl-Seq, DNase hypersensitivity) or
for large collaborative projects, including Encyclopedia other studies where measuring some form of sequence
of DNA Elements (ENCODE) (5) and Roadmap Epige- abundance or characterization is part of the study goals.
nomics (6), are deposited by Data Coordinating Centres GEO hosts the processed data files together with sample
and have dedicated data listings pages at http://www.ncbi. and study metadata; raw data files containing the original
nlm.nih.gov/geo/info/ENCODE.html and http://www. sequence reads are brokered and linked with NCBI’s
ncbi.nlm.nih.gov/geo/roadmap/epigenomics/. Sequence Read Archive (SRA) database (7). To date,
GEO has loaded >44 terabases of read data to SRA.
Support for next-generation sequence data Furthermore, several thousand processed data files have
GEO has made it a priority to continue to support the been incorporated into NCBI’s Epigenomics (8) database,
microarray community as they switch to next-generation where they are further curated and available to view as
sequence technologies. Established microarray submission tracks on genome browsers; work to incorporate several
formats, metadata standards and administrative proced- thousand more tracks with reciprocal links to GEO is
ures have been modified to accommodate the new ongoing.
technologies. The full sequence submission guideline is
provided at http://www.ncbi.nlm.nih.gov/geo/info/seq.
html and supports ‘minimum information about a RECENT UPDATES TO SEARCH, NAVIGATE,
high-throughput sequencing experiment’ (MINSEQE) DOWNLOAD AND ANALYSE
standards (http://www.fged.org/projects/minseqe/). GEO Much of the infrastructure, organization and search
accepts sequence data for studies that examine gene capabilities of GEO remain as previously described (9),
expression (RNA-Seq), gene regulation and epigenomics but several recent enhancements offer the user alternative
Nucleic Acids Research, 2013, Vol. 41, Database issue D993

methods for locating, downloading and interpreting data, GEO2R web application for identifying differentially
including: expressed genes
. Sample records are indexed as a distinct entry type in A major update recently implemented by GEO was release
the GEO DataSets database (http://www.ncbi.nlm.nih. of the GEO2R web application, available at http://www.
gov/gds/), permitting users to more easily identify ncbi.nlm.nih.gov/geo/geo2r/. GEO2R presents a simple
individual samples within a study. interface that allows users to perform sophisticated
. Sample characteristics are indexed separately under a R-based analysis of GEO data to help identify and visu-
new ‘Attribute’ field in the GEO DataSets database alize differential gene expression. The GEO2R back end
allowing more refined queries. uses established Bioconductor (13) R packages to trans-
. A ‘similar studies’ link has been added to the GEO form and analyse GEO data and presents results as a table

Downloaded from https://academic.oup.com/nar/article/41/D1/D991/1067995 by Sichuan University user on 11 June 2022


DataSets database. These links help users retrieve add- of genes ordered by significance and that can be visualized
itional studies relevant to their area of interest. The links with GEO Profile graphics. Unlike GEOs, other DataSet
are computed on series PubMed citations using the analysis tools [described in (9)], GEO2R does not rely on
same algorithm as PubMed’s ‘related articles’ links (10). curated DataSet records and interrogates original
. The ‘find pathways’ feature on GEO profiles (http:// submitter-supplied data directly. Over 90% of GEO
www.ncbi.nlm.nih.gov/geoprofiles/) retrievals allows studies may be analysed this way. This expands the
users to map genes to a frequency weighted list of utility of the database to a much wider audience,
pathways in NCBI’s BioSystems database (11) allowing a greater proportion of GEO data to be analysed
helping to characterize lists of genes. in a timely manner and with more flexibility in terms of
. The ‘GEO repository browser’ (http://www.ncbi.nlm. what groups of samples to compare and what type of
nih.gov/geo/browse/) has undergone significant re- analysis to perform.
design. The browser has tabs containing tables that list
series, sample, platform and DataSet records. The tables Implementation and data flow
now include more auxiliary information that can be On the web interface, after the user specifies the series they
searched and filtered, as well as links to related records want to analyse, a table populated with sample character-
and supplementary file downloads. Tables can be istics appears (Figure 2). The user designates up to 10
exported and include further information not displayed sample groups to compare and the type of analysis to
on the browser, including corresponding PubMed iden- perform. Users can accept default analysis settings, or
tifiers and related SRA accessions. they can choose to apply alternative P-value adjustments,
. The ‘my submissions’ page has been re-designed so force or override log transformation of input data or
that submitters can more easily track, browse and select alternative gene annotation categories. These par-
filter their deposits. It also serves as a gateway for ameters are passed to the back end where a ‘GEOquery’
performing updates and status edits. (14) call loads the corresponding SeriesMatrix file and
. All GEO series are now brokered to NCBI’s BioProject platform annotation files via FTP and returns the
database (12). The BioProject database enables users ExpressionSet object and contrasts, which are input for
to concurrently search for projects hosted by various two R scripts, ‘boxplot’, which draws a boxplot of the
databases at NCBI, including GenBank whole genome distribution of expression values of selected samples
sequencing projects and dbGaP controlled access helping users to determine whether the data are suitable
studies. for analysis, and ‘limma’ (15), which performs the
. More proactive approaches for acquiring citation topTable computation to extract a table of the top-ranked
information have been implemented. Reciprocal links genes. The ‘limma’ results are processed according to the
between GEO series records and corresponding articles type of output requested, formatted in JSON and then
in PubMed provide extra context to the data and used to create and populate html tables of the top 250
enhances navigation to related data domains, including genes ranked by P-value. The results table contains
to free full-text versions of the article in PubMed various categories of statistics, including P-values, t-stat-
Central where available. GEO uses several strategies istics and fold change, as well as gene annotations,
to procure citation information including, most including gene symbols, gene names, Gene Ontology
recently, a statement on series records that highlights (GO) terms and chromosome locations. The expression
when a citation is missing with an invitation to pattern of each gene in the table can be visualized
provide that information. When clicked, the invitation by clicking the row to reveal expression profile graphs
initiates either a dialogue box that enables direct pro- or the complete set of ordered results can be downloaded
vision of the PubMed identifier (for logged in submit- as a table. Alternatively, if users are not interested in per-
ters) or an email pre-populated with instructions on forming differential expression analysis but rather only
how to send citation information to GEO (for any want to see the expression profile of a specific gene, they
user). can bypass all the above and simply enter the Platform
. FTP site re-design. Although transparent to users, the gene ID to visualize that profile. To assist users replicate
organization of data on the FTP site has been their analyses, the native R script generated in each
upgraded to a virtual file system, implemented by session is provided. This information can be saved as a
Filesystem in Userspace (FUSE), offering greater flexi- reference for how results were calculated or used to repro-
bility in how data are packaged. duce GEO2R top genes results. A YouTube video tutorial
D994 Nucleic Acids Research, 2013, Vol. 41, Database issue

Downloaded from https://academic.oup.com/nar/article/41/D1/D991/1067995 by Sichuan University user on 11 June 2022


Figure 2. GEO2R screenshots. After selecting ‘analyse with GEO2R’ on series record GSE18388 (19), the user is presented with a table of the
samples in that study and their descriptions (Panel 1). In this case, two sample groups are defined, and four samples are assigned to each group. The
user can view the distribution of the sample values using the boxplot feature (Panel 2) and click the ‘Top250’ button to retrieve a table of the top 250
differentially expressed genes with statistics and gene annotation (Panel 3). The top hit is clicked to reveal the expression profile chart for that gene.

demonstrating GEO2R functionality is available at http:// into hypotheses that can be tested in the laboratory. Such
www.youtube.com/watch?v=EUPmGWS8ik0. opportunities will only increase as more and better quality
data become available.
GEO DATA RE-USE
SUMMARY
In the last GEO update article (16), we summarized the
diverse ways in which the community re-uses GEO data, The GEO database, now 12 years old, continues to grow
including providing evidence of specific gene expression in terms of volume, diversity of data types and usage.
to support hypotheses, testing material for algorithm The database and tools continue to undergo intensive
development, identifying disease predictors, developing development aimed at helping users to better explore
value-added target-audience databases and generally and extract meaningful information and new discoveries
aggregating and analyzing data in ways not anticipated from GEO data. Ongoing challenges include expanding
by the original data generators. Although data re-use is integration and cross-linking with related resources,
difficult to track accurately, based on usage citations procuring more consistent sample annotation from sub-
monitored internally (http://www.ncbi.nlm.nih.gov/geo/ mitters and providing additional methods for analysing
info/citations.html) and by others (17), it seems that the next-generation sequence data.
re-use rate is increasing. There is evidence that more sci-
entists are using a data-driven approach to research (18),
whereby the first step in a project is to combine and re- FUNDING
analyse public data sets to reveal previously unknown Funding for open access charge: Intramural Research
relations or uncover ever more subtle trends in the data. Program of the National Institutes of Health, National
The novel insights gained from such analyses are formed Library of Medicine.
Nucleic Acids Research, 2013, Vol. 41, Database issue D995

Conflict of interest statement. None declared. 10. Lin,J. and Wilbur,W.J. (2007) PubMed related articles: a
probabilistic topic-based model for content similarity. BMC
Bioinformatics, 8, 423.
REFERENCES 11. Geer,L.Y., Marchler-Bauer,A., Geer,R.C., Han,L., He,J., He,S.,
Liu,C., Shi,W. and Bryant,S.H. (2010) The NCBI BioSystems
1. Edgar,R., Domrachev,M. and Lash,A.E. (2002) Gene Expression database. Nucleic Acids Res., 38, D492–D496.
Omnibus: NCBI gene expression and hybridization array data 12. Barrett,T., Clark,K., Gevorgyan,R., Gorelenkov,V., Gribov,E.,
repository. Nucleic Acids Res., 30, 207–210. Karsch-Mizrachi,I., Kimelman,M., Pruitt,K.D., Resenchuk,S.,
2. Microarray standards at last. (2002) Nature, 419, 323. Tatusova,T. et al. (2012) BioProject and BioSample databases at
3. Brazma,A., Hingamp,P., Quackenbush,J., Sherlock,G., NCBI: facilitating capture and organization of metadata. Nucleic
Spellman,P., Stoeckert,C., Aach,J., Ansorge,W., Ball,C.A., Acids Res., 40, D57–D63.
Causton,H.C. et al. (2001) Minimum information about a 13. Gentleman,R.C., Carey,V.J., Bates,D.M., Bolstad,B., Dettling,M.,

Downloaded from https://academic.oup.com/nar/article/41/D1/D991/1067995 by Sichuan University user on 11 June 2022


microarray experiment (MIAME)-toward standards for Dudoit,S., Ellis,B., Gautier,L., Ge,Y., Gentry,J. et al. (2004)
microarray data. Nat. Genet., 29, 365–371. Bioconductor: open software development for computational
4. Parkinson,H., Sarkans,U., Kolesnikov,N., Abeygunawardena,N., biology and bioinformatics. Genome Biol., 5, R80.
Burdett,T., Dylag,M., Emam,I., Farne,A., Hastings,E., 14. Davis,S. and Meltzer,P.S. (2007) GEOquery: a bridge between the
Holloway,E. et al. (2011) ArrayExpress update—an archive of Gene Expression Omnibus (GEO) and BioConductor.
microarray and high-throughput sequencing-based functional Bioinformatics, 23, 1846–1847.
genomics experiments. Nucleic Acids Res., 39, D1002–D1004. 15. Smyth,G.K. (2004) Linear models and empirical bayes methods
5. Bernstein,B.E., Birney,E., Dunham,I., Green,E.D., Gunter,C. and for assessing differential expression in microarray experiments.
Snyder,M. (2012) An integrated encyclopedia of DNA elements in Stat. Appl. Genet. Mol. Biol., 3, Article 3.
the human genome. Nature, 489, 57–74. 16. Barrett,T., Troup,D.B., Wilhite,S.E., Ledoux,P., Evangelista,C.,
6. Bernstein,B.E., Stamatoyannopoulos,J.A., Costello,J.F., Ren,B., Kim,I.F., Tomashevsky,M., Marshall,K.A., Phillippy,K.H.,
Milosavljevic,A., Meissner,A., Kellis,M., Marra,M.A., Sherman,P.M. et al. (2011) NCBI GEO: archive for functional
Beaudet,A.L., Ecker,J.R. et al. (2010) The NIH roadmap genomics data sets—10 years on. Nucleic Acids Res., 39,
epigenomics mapping consortium. Nat. Biotechnol., 28, 1045–1048. D1005–D1010.
7. Shumway,M., Cochrane,G. and Sugawara,H. (2010) Archiving next 17. Piwowar,H.A., Vision,T.J. and Whitlock,M.C. (2011) Data
generation sequencing data. Nucleic Acids Res., 38, D870–D871. archiving is a good investment. Nature, 473, 285.
8. Fingerman,I.M., McDaniel,L., Zhang,X., Ratzat,W., Hassan,T., 18. Baker,M. (2012) Gene data to hit milestone. Nature, 487,
Jiang,Z., Cohen,R.F. and Schuler,G.D. (2011) NCBI 282–283.
Epigenomics: a new public resource for exploring epigenomic data 19. Lebsack,T.W., Fa,V., Woods,C.C., Gruener,R., Manziello,A.M.,
sets. Nucleic Acids Res., 39, D908–D912. Pecaut,M.J., Gridley,D.S., Stodieck,L.S., Ferguson,V.L. and
9. Barrett,T., Troup,D.B., Wilhite,S.E., Ledoux,P., Rudnev,D., Deluca,D. (2010) Microarray analysis of spaceflown murine
Evangelista,C., Kim,I.F., Soboleva,A., Tomashevsky,M., thymus tissue reveals changes in gene expression regulating stress
Marshall,K.A. et al. (2009) NCBI GEO: archive for and glucocorticoid receptors. J. Cell Biochem., 110, 372–381.
high-throughput functional genomic data. Nucleic Acids Res., 37,
D885–D890.

You might also like