Using The Geoquery Package: Sean Davis September 21, 2014
Using The Geoquery Package: Sean Davis September 21, 2014
Sean Davis
September 21, 2014
Contents
Overview of GEO
Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
12
Use Cases
12
13
13
Citing GEOquery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
14
Session info
14
Overview of GEO
The NCBI Gene Expression Omnibus (GEO) serves as a public repository for a wide range of high-throughput
experimental data. These data include single and dual channel microarray-based experiments measuring
mRNA, genomic DNA, and protein abundance, as well as non-array techniques such as serial analysis of gene
expression (SAGE), mass spectrometry proteomic data, and high-throughput sequencing data.
At the most basic level of organization of GEO, there are four basic entity types. The first three (Sample,
Platform, and Series) are supplied by users; the fourth, the dataset, is compiled and curated by GEO staff
from the user-submitted data. See the GEO home page for more information.
Platforms
A Platform record describes the list of elements on the array (e.g., cDNAs, oligonucleotide probesets, ORFs,
antibodies) or the list of elements that may be detected and quantified in that experiment (e.g., SAGE
tags, peptides). Each Platform record is assigned a unique and stable GEO accession number (GPLxxx). A
Platform may reference many Samples that have been submitted by multiple submitters.
Samples
A Sample record describes the conditions under which an individual Sample was handled, the manipulations it
underwent, and the abundance measurement of each element derived from it. Each Sample record is assigned
a unique and stable GEO accession number (GSMxxx). A Sample entity must reference only one Platform
and may be included in multiple Series.
Series
A Series record defines a set of related Samples considered to be part of a group, how the Samples are related,
and if and how they are ordered. A Series provides a focal point and description of the experiment as a whole.
Series records may also contain tables describing extracted data, summary conclusions, or analyses. Each
Series record is assigned a unique and stable GEO accession number (GSExxx). Series records are available
in a couple of formats which are handled by GEOquery independently. The smaller and new GSEMatrix files
are quite fast to parse; a simple flag is used by GEOquery to choose to use GSEMatrix files (see below).
Datasets
GEO DataSets (GDSxxx) are curated sets of GEO Sample data. A GDS record represents a collection of
biologically and statistically comparable GEO Samples and forms the basis of GEOs suite of data display
and analysis tools. Samples within a GDS refer to the same Platform, that is, they share a common set
of probe elements. Value measurements for each Sample within a GDS are assumed to be calculated in an
equivalent manner, that is, considerations such as background processing and normalization are consistent
across the dataset. Information reflecting experimental design is provided through GDS subsets.
library(GEOquery)
Now, we are free to access any GEO accession. Note that in the following, I use a file packaged with the
GEOquery package. In general, you will use only the GEO accession, as noted in the code comments.
# If you have network access, the more typical way to do this
# would be to use this:
# gds <- getGEO("GDS507")
gds <- getGEO(filename=system.file("extdata/GDS507.soft.gz",package="GEOquery"))
Now, gds contains the R data structure (of class GDS) that represents the GDS507 entry from GEO. Youll
note that the filename used to store the download was output to the screen (but not saved anywhere) for
later use to a call to getGEO(filename=...).
We can do the same with any other GEO accession, such as GSM11805, a GEO sample.
# If you have network access, the more typical way to do this
# would be to use this:
# gds <- getGEO("GSM11805")
gsm <- getGEO(filename=system.file("extdata/GSM11805.txt.gz",package="GEOquery"))
$channel_count
[1] "1"
$comment
[1] "Raw data provided as supplementary file"
$contact_address
[1] "715 Albany Street, E613B"
$contact_city
[1] "Boston"
$contact_country
3
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
# Look at data associated with the GSM:
# but restrict to only first 5 rows, for brevity
Table(gsm)[1:5,]
##
##
##
##
##
##
ID_REF
AFFX-BioB-5_at
AFFX-BioB-M_at
AFFX-BioB-3_at
AFFX-BioC-5_at
AFFX-BioC-3_at
1
2
3
4
5
VALUE ABS_CALL
953.9
P
2982.8
P
1657.9
P
2652.7
P
2019.5
P
Column
1
ID_REF
2
VALUE
3 ABS_CALL
Description
1
2
MAS 5.0 Statistical Algorithm (mean scaled to 500)
3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
The GPL class behaves exactly as the GSM class. However, the GDS class has a bit more information associated
with the Columns method:
Columns(gds)[,1:3]
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$contact_address
[1] "715 Albany Street, E613B"
$contact_city
[1] "Boston"
$contact_country
[1] "USA"
$contact_department
[1] "Genetics and Genomics"
$contact_email
[1] "[email protected]"
$contact_fax
[1] "617-414-1646"
[1]
[7]
[13]
[19]
[25]
[31]
"GSM11805"
"GSM11830"
"GSM12079"
"GSM12105"
"GSM12283"
"GSM12399"
"GSM11810"
"GSM11832"
"GSM12083"
"GSM12106"
"GSM12287"
"GSM12412"
"GSM11814"
"GSM12067"
"GSM12098"
"GSM12268"
"GSM12298"
"GSM12444"
"GSM11815"
"GSM12069"
"GSM12099"
"GSM12269"
"GSM12299"
"GSM12448"
"GSM11823"
"GSM12075"
"GSM12100"
"GSM12270"
"GSM12300"
"GSM11827"
"GSM12078"
"GSM12101"
"GSM12274"
"GSM12301"
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
contact_address
[1] "715 Albany Street, E613B"
contact_city
[1] "Boston"
contact_country
[1] "USA"
contact_department
[1] "Genetics and Genomics"
contact_email
[1] "[email protected]"
contact_fax
[1] "617-414-1646"
contact_institute
[1] "Boston University School of Medicine"
contact_name
[1] "Marc,E.,Lenburg"
contact_phone
[1] "617-414-1375"
contact_state
[1] "MA"
contact_web_link
[1] "http://gg.bu.edu"
contact_zip/postal_code
[1] "02130"
data_row_count
[1] "22283"
description
[1] "Age = 70; Gender = Female; Right Kidney; Adjacent Tumor Type = clear cell; Adjacent Tumor Fuhrma
[2] "Keywords = kidney"
[3] "Keywords = renal"
[4] "Keywords = RCC"
[5] "Keywords = carcinoma"
[6] "Keywords = cancer"
[7] "Lot batch = 2004638"
geo_accession
[1] "GSM11805"
last_update_date
[1] "May 28 2005"
molecule_ch1
[1] "total RNA"
organism_ch1
[1] "Homo sapiens"
platform_id
[1] "GPL96"
series_id
[1] "GSE781"
source_name_ch1
[1] "Trizol isolation of total RNA from normal tissue adjacent to Renal Cell Carcinoma"
status
[1] "Public on Nov 25 2003"
submission_date
[1] "Oct 20 2003"
supplementary_file
[1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM11nnn/GSM11805/GSM11805.CEL.gz"
6
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
title
[1] "N035 Normal Human Kidney U133A"
type
[1] "RNA"
An object of class "GEODataTable"
****** Column Descriptions ******
Column
1
ID_REF
2
VALUE
3 ABS_CALL
Description
1
2
MAS 5.0 Statistical Algorithm (mean scaled to 500)
3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
****** Data Table ******
ID_REF VALUE ABS_CALL
1 AFFX-BioB-5_at 953.9
P
2 AFFX-BioB-M_at 2982.8
P
3 AFFX-BioB-3_at 1657.9
P
4 AFFX-BioC-5_at 2652.7
P
5 AFFX-BioC-3_at 2019.5
P
22278 more rows ...
##
##
##
##
##
##
##
##
##
##
##
##
show(pData(phenoData(gse2553[[1]]))[1:5,c(1,6,8)])
##
##
##
##
##
##
##
##
##
##
##
##
title
GSM48681
Patient sample ST18, Dermatofibrosarcoma
GSM48682
Patient sample ST410, Ewing Sarcoma
GSM48683
Patient sample ST130, Sarcoma, NOS
GSM48684 Patient sample ST293, Malignant Peripheral Nerve Sheath Tumor
GSM48685
Patient sample ST367, Liposarcoma
type
source_name_ch1
GSM48681 RNA
Dermatofibrosarcoma
GSM48682 RNA
Ewing Sarcoma
GSM48683 RNA
Sarcoma, NOS
GSM48684 RNA Malignant Peripheral Nerve Sheath Tumor
GSM48685 RNA
Liposarcoma
GSM11815
GSM11832
GSM12069
GSM12083
GSM12101
GSM12106
GSM12274
GSM12299
GSM12412
GSM11810
GSM11827
GSM12078
GSM12099
GSM12269
GSM12287
GSM12301
GSM12448
$GSM11805
[1] "GPL96"
$GSM11810
[1] "GPL97"
$GSM11814
[1] "GPL96"
$GSM11815
[1] "GPL97"
$GSM11823
[1] "GPL96"
$GSM11827
[1] "GPL97"
Indeed, they all used GPL5 as their platform (which we could have determined by looking at the GPLList
for gse, which shows only one GPL for this particular GSE.). So, now we would like to know what column
represents the data that we would like to extract. Looking at the first few rows of the Table of a single GSM
will likely give us an idea (and by the way, GEO uses a convention that the column that contains the single
measurement for each array is called the VALUE column, which we could use if we dont know what other
column is most relevant).
Table(GSMList(gse)[[1]])[1:5,]
##
##
##
##
##
##
1
2
3
4
5
ID_REF
AFFX-BioB-5_at
AFFX-BioB-M_at
AFFX-BioB-3_at
AFFX-BioC-5_at
AFFX-BioC-3_at
VALUE ABS_CALL
953.9
P
2982.8
P
1657.9
P
2652.7
P
2019.5
P
Column
10
##
##
##
##
##
##
##
##
##
##
##
1
ID_REF
2
VALUE
3
ABS_CALL
NA
<NA>
NA.1
<NA>
Description
1
2
MAS 5.0 Statistical Algorithm (mean scaled to 500)
3
MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
NA
<NA>
NA.1
<NA>
We will indeed use the VALUE column. We then want to make a matrix of these values like so:
# get the probeset ordering
probesets <- Table(GPLList(gse)[[1]])$ID
# make the data matrix from the VALUE columns from each GSM
# being careful to match the order of the probesets in the platform
# with those in the GSMs
data.matrix <- do.call('cbind',lapply(GSMList(gse),function(x)
{tab <- Table(x)
mymatch <- match(probesets,tab$ID_REF)
return(tab$VALUE[mymatch])
}))
data.matrix <- apply(data.matrix,2,function(x) {as.numeric(as.character(x))})
data.matrix <- log2(data.matrix)
data.matrix[1:5,]
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
GSM11805 GSM11810
[1,] 10.926963
NA
[2,] 5.749534
NA
[3,] 7.066089
NA
[4,] 12.660353
NA
[5,] 6.195741
NA
GSM11832 GSM12067
[1,]
NA 11.424376
[2,]
NA 7.901470
[3,]
NA 7.337176
[4,]
NA 11.397568
[5,]
NA 6.877744
GSM12098 GSM12099
[1,] 10.823367
NA
[2,] 6.556123
NA
[3,] 7.708739
NA
[4,] 12.336534
NA
[5,] 5.501439
NA
GSM12269 GSM12270
[1,]
NA 10.323055
[2,]
NA 7.353147
[3,]
NA 8.742815
[4,]
NA 11.213408
[5,]
NA 6.683696
GSM12300 GSM12301
GSM11814 GSM11815
11.105254
NA
7.908092
NA
7.750205
NA
12.479755
NA
6.061776
NA
GSM12069 GSM12075
NA 11.222795
NA 6.407693
NA 6.569856
NA 12.529870
NA 6.652486
GSM12100 GSM12101
10.835971
NA
8.207014
NA
7.428779
NA
11.762839
NA
6.247928
NA
GSM12274 GSM12283
NA 11.181028
NA 5.770829
NA 7.339850
NA 12.678380
NA 5.918863
GSM12399 GSM12412
11
GSM11823 GSM11827
11.275019
NA
7.093814
NA
7.244126
NA
12.215897
NA
6.565293
NA
GSM12078 GSM12079
NA 11.469845
NA 5.165912
NA 7.477354
NA 12.240046
NA 3.981853
GSM12105 GSM12106
10.810893
NA
6.816344
NA
7.754888
NA
11.237509
NA
6.017922
NA
GSM12287 GSM12298
NA 11.566387
NA 6.912889
NA 7.602142
NA 12.232901
NA 5.837943
GSM12444 GSM12448
GSM11830
11.438636
7.514122
7.962896
11.458355
6.583459
GSM12083
NA
NA
NA
NA
NA
GSM12268
11.062653
6.563768
7.126188
12.412490
6.525129
GSM12299
NA
NA
NA
NA
NA
##
##
##
##
##
[1,] 11.078151
[2,] 4.812498
[3,] 7.383704
[4,] 12.090939
[5,] 6.281698
NA 11.535178
NA 7.471675
NA 7.432959
NA 11.421802
NA 5.419539
NA 11.105450
NA 7.488644
NA 7.381110
NA 12.172834
NA 5.469235
NA
NA
NA
NA
NA
Note that we do a match to make sure that the values and the platform information are in the same order.
Finally, to make the ExpressionSet object:
require(Biobase)
# go through the necessary steps to make a compliant ExpressionSet
rownames(data.matrix) <- probesets
colnames(data.matrix) <- names(GSMList(gse))
pdata <- data.frame(samples=names(GSMList(gse)))
rownames(pdata) <- names(GSMList(gse))
pheno <- as(pdata,"AnnotatedDataFrame")
eset2 <- new('ExpressionSet',exprs=data.matrix,phenoData=pheno)
eset2
##
##
##
##
##
##
##
##
##
##
##
So, using a combination of lapply on the GSMList, one can extract as many columns of interest as necessary
to build the data structure of choice. Because the GSM data from the GEO website are fully downloaded
and included in the GSE object, one can extract foreground and background as well as quality for two-channel
arrays, for example. Getting array annotation is also a bit more complicated, but by replacing platform in
the lapply call to get platform information for each array, one can get other information associated with each
array.
Use Cases
GEOquery can be quite powerful for gathering a lot of data quickly. A few examples can be useful to show
how this might be done for data mining purposes.
12
Conclusion
The GEOquery package provides a bridge to the vast array resources contained in the NCBI GEO repositories.
By maintaining the full richness of the GEO data rather than focusing on getting only the numbers, it is
possible to integrate GEO data into current Bioconductor data structures and to perform analyses on that
data quite quickly and easily. These tools will hopefully open GEO data more fully to the array community
at large.
13
Citing GEOquery
Please consider citing GEOquery if used in support of your own research:
citation("GEOquery")
##
## Please cite the following if utilizing the GEOquery software:
##
##
Davis, S. and Meltzer, P. S. GEOquery: a bridge between the Gene
##
Expression Omnibus (GEO) and BioConductor. Bioinformatics, 2007,
##
14, 1846-1847
##
## A BibTeX entry for LaTeX users is
##
##
@Article{,
##
author = {Sean Davis and Paul Meltzer},
##
title = {GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor},
##
journal = {Bioinformatics},
##
year = {2007},
##
volume = {14},
##
pages = {1846--1847},
##
}
Session info
The following package and versions were used in the production of this vignette.
##
##
##
##
##
##
##
##
##
##
##
##
##
LC_NUMERIC=C
LC_COLLATE=C
LC_MESSAGES=en_US.UTF-8
LC_NAME=C
LC_TELEPHONE=C
LC_IDENTIFICATION=C
grDevices utils
14
datasets
methods
##
##
##
##
##
##
##
##
##
##
[8] base
other attached packages:
[1] limma_3.22.0
GEOquery_2.32.0
[4] BiocGenerics_0.12.0 knitr_1.7
Biobase_2.26.0
15