0% found this document useful (0 votes)
13 views61 pages

GEO Oct 2024

The document outlines the features and improvements of GEO (Gene Expression Omnibus) analysis tools, highlighting its extensive database with over 238,000 public series and 7.45 million samples. Key tools include GEO2R for gene expression analysis and Genome Browser Tracks for visualizing data. Additionally, it discusses the evolution of RNA-seq data and programmatic access to GEO data for enhanced user experience.

Uploaded by

ademtoprak126
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views61 pages

GEO Oct 2024

The document outlines the features and improvements of GEO (Gene Expression Omnibus) analysis tools, highlighting its extensive database with over 238,000 public series and 7.45 million samples. Key tools include GEO2R for gene expression analysis and Genome Browser Tracks for visualizing data. Additionally, it discusses the evolution of RNA-seq data and programmatic access to GEO data for enhanced user experience.

Uploaded by

ademtoprak126
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

GEO Analysis Tools: New and Improved

Emily Clough, Ph.D. GEO Product Owner


Outline
• What is GEO?
• GEO analysis tools
• New features
• Programmatic access to data in GEO
• Useful links
is global
• 84K unique submitters
• 101 countries
• 6681 organisms
GEO’s current size

•238,316 public series


•7.45 million samples
Organization of data in GEO
• Platform (GPLnnnn)
• Series (GSEnnnn)
• Sample (GSMnnnn)
SERIES
• Study-level data
• Summary/abstract
• Contributor names
• Web links
• Citations and links
to PubMed
• Links to GEO2R and
Tracks
SAMPLES
• Metadata
• Protocols
• Links to
downloadable data
GEO Growth
Total Number of records
Platforms, Series and Samples
9000000

8000000

7000000

6000000

5000000

4000000

3000000

2000000

1000000

0
2000 2005 2010 2015 2020 2024
Evolution of assay type in GEO
100%

Percent of total by assay type


90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
Year
expression profiling by NGS expression profiling by array
epigenomic profiling by NGS epigenomic profiling by array
RNA-seq data in GEO
RNA-seq studies released by GEO each year

Number of studies (in thousands)


18000
16000
14000
12000
10000
8000
6000
4000
2000
0

Year
https://www.ncbi.nlm.nih.gov/geo/
GEO Analysis tools
• Genome Browser Tracks
• GEO2R
Genome Browser Tracks
• Loadable into NCBI’s Data Genome viewer
• 9618 series with tracks
• 40,202 samples with tracks
• Tracks are mostly from ENCODE samples
How to find tracks on GEO records
• Use "track"[Filter] in search bar
• Use links from GEO’s home page
• https://www.ncbi.nlm.nih.gov/geo/encode/
Tracks on series vs samples SERIES (GSE)
• Click on
• Button on Series (GSE)
page loads tracks for all
samples in series
• Button on sample (GSM)
page loads tracks only
for that sample
ENCODE ChIP-seq data for SETDB1
What if you want to use UCSC genome
browser for data stored in GEO?
Use “ftp”
link to copy
URL

Paste URL
Viewing bigwig file from GEO in UCSC Browser
GEO Analysis tools
• Genome Browser Tracks
• GEO2R
Evolution of assay type in GEO
100%

Percent of total by assay type


90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
Year
expression profiling by NGS expression profiling by array
epigenomic profiling by NGS epigenomic profiling by array
GEO2R
• Web-based analysis tool for gene
expression studies
• No knowledge of command line
computing needed
• Fast
• Introduced in 2011
GEO2R
• Online analysis tool for differential expression
• 105,444 studies available for analysis with GEO2R (45%)
Analyze gene expression data with GEO2R
Analyze gene expression data with GEO2R
GEO2R:
Create groups and
assign samples
• Create up to 10 groups
• Highlight sample (row will turn yellow)
and click on group to add it.
• Fix errors by highlighting sample again
and clicking different group
• Highlight multiple samples at a single
time
Analyze gene expression data with GEO2R
Analyze gene expression data with GEO2R
Explore data in
GEO2R
• Download full results

• Download chosen subsets of results

• Change options

• Explore visualizations
• Sample relationship with UMAP
• Volcano plot
• Venn Diagram
• P-value and expression value
distribution

• Look at gene expression of individual


genes
Analyze gene expression data with GEO2R
Download full results in a text file
Changing options

• P-value adjustment
method for multiple
testing
• Choose thresholds for P-
value and log2 fold-
change for plots
• Choose contrasts (which
set of samples to display
in plots)
Looking up your
favorite gene
• Go to ‘Profile graph’ tab
• Enter your gene of interest
Accessing the
R Code
• Go to ‘R script’ tab
• Copy entire script!
• Re-run analysis on your
own
Visualize and explore results in GEO2R
7 plots provided for every study
Visualization
plots 3 plots (with green outline) are
interactive
Subsets of data are downloadable
from interactive plots
Added in 2020
Volcano plot
Volcano Plot

• Changed Options
• Padj< 0.01
• Log2 fold-change
threshold set to 1
Venn
Diagram
GEO2R Help
• https://www.ncbi.nlm.nih.gov/geo/info/geo2r.html (Documentation)
• Tutorial video
• Updated in 2023
• https://www.youtube.com/watch?v=9RyWjzSnaE0&t=17s
What’s new at GEO?
RNA-seq data in GEO
RNA-seq studies released by GEO each year

Number of studies (in thousands)


18000
16000
14000
12000
10000
8000
6000
4000
2000
0

Year
https://www.ncbi.nlm.nih.gov/geo/
Making RNA-seq data more FAIR
FAIR Data Principles

https://www.nlm.nih.gov/oet/ed/cde/tutorial/02-300.html
NCBI has produced an
RNA-seq analysis pipeline
Consistently computed
RNA-seq counts for
millions of samples
NCBI RNA-seq count pipeline
Newly released Available
Deploy via Cloud
bulk RNA-seq runs for
and GEO
enter pipeline
use!

Align to genome Produce run-level


with HISAT2 QC metrics

Produce gene-level
Remove runs with
counts with
< 50% alignment
featureCounts
NCBI RNA-seq analysis pipeline goals

➢Reduce burden on users


➢Increase access to vast amount of data
➢Scalability: greatly increases number of
samples that can be analyzed
➢Promote data re-use and discovery
April 19, 2023
SEARCH

Human COVID-19

ANALYZE WITH GEO2R DOWNLOAD


RNA-seq
count GSMnnnnnn
GSEnnnnnn_raw_counts_GRCh38.p13_NCBI.tsv.gz
GSMnnnnnn

data are EXPLORE ANALYZE ON YOUR OWN

available
in GEO
Search
• “rnaseq counts”
[Filter]

• 27,731 studies
available with
NCBI-provided
RNA-seq counts
NCBI RNA-seq counts downloadable from GEO
NCBI RNA-seq counts downloadable from GEO

GEO Sample name (GSMnnnn)


NCBI Gene ID
Accessing run-level count files from SRA
--available in early 2025

• Search in Athena • Search on SRA website


• Free egress • Available as analysis
object with SRZnnnn ID
• Human and Mouse RNA-
seq data
Programmatic access to data in GEO
• https://www.ncbi.nlm.nih.gov/geo/info/geo_paccess.html
• Search and retrieve files with Entrez Programming Utilities
(E-Utils)
• Another option: wget followed by URL
• Bulk download using Aspera Connect
• https://www.ncbi.nlm.nih.gov/books/NBK242625/
• https://www.biostars.org/p/9528910/
Useful links
• https://www.ncbi.nlm.nih.gov/geo/
• https://www.ncbi.nlm.nih.gov/geo/geo2r/
• https://www.ncbi.nlm.nih.gov/geo/info/
• https://www.ncbi.nlm.nih.gov/geo/info/geo_paccess.html
• https://www.youtube.com/watch?v=9RyWjzSnaE0&t=17s
For updates, follow NCBI on:
• NCBI Insights Blog: (https://ncbiinsights.ncbi.nlm.nih.gov/)
• X: https://twitter.com/ncbi
• FaceBook: (https://www.facebook.com/ncbi.nlm)
• LinkedIn: (https://www.linkedin.com/company/ncbinlm)
• Subscribe to email NCBI Announce at: https://bit.ly/NCBI_subscribe

My email:
[email protected]
Acknowledgements
Pierre Ledoux, PhD Alexandra Soboleva Rodney Brister, PhD Ilene Mizrachi, PhD

Hyeseung Lee, PhD Maxim Tomashevsky Ryan Connor, PhD Valerie Schneider, PhD

Kimberly Marshall Naigong Zhang, PhD Ravinder P. Eskandary, PhD Kim Pruitt, PhD

Irene Kim, PhD Nadezhda Serova Andrey Kochergin, PhD

Katherine Phillippy Tanya Barrett, PhD Vamsi Kodali, PhD

Patti Sherman, PhD Lukas Wagner, PhD

Steve Wilhite, PhD

This work was supported by the National Center for Biotechnology Information (NCBI) at the National Library of
Medicine and NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability
(STRIDES) initiative.

You might also like