0% found this document useful (0 votes)

53 views

Using The Geoquery Package: Sean Davis September 21, 2014

The document describes the GEOquery package for accessing data from the Gene Expression Omnibus (GEO) database. It discusses the different entity types in GEO - Platforms, Samples, Series, and Datasets - and how they are organized. It then explains the key GEOquery data structures - GDS, GSM, GPL, and GSE classes - that are used to store the different types of data from GEO. These classes contain metadata and data tables extracted from GEO. The GSE class is composite, containing lists of GSM and GPL objects for the various samples and platforms in a GEO Series. The document provides examples of accessing and summarizing data from these GEOquery classes.

Uploaded by

Charmi Panchal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views

Using The Geoquery Package: Sean Davis September 21, 2014

Uploaded by

Charmi Panchal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Using the GEOquery Package

Sean Davis
September 21, 2014

Contents
Overview of GEO

Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Getting Started using GEOquery

GEOquery Data Structures

The GDS, GSM, and GPL classes

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The GSE class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Converting to BioConductor ExpressionSets and limma MALists

Getting GSE Series Matrix files as an ExpressionSet . . . . . . . . . . . . . . . . . . . . . . . . . .

Converting GDS to an ExpressionSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Converting GDS to an MAList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Converting GSE to an ExpressionSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Accessing Raw Data from GEO

Use Cases

Getting all Series Records for a Given Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion

13
13

Citing GEOquery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Reporting problems or bugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Session info

Overview of GEO
The NCBI Gene Expression Omnibus (GEO) serves as a public repository for a wide range of high-throughput
experimental data. These data include single and dual channel microarray-based experiments measuring
mRNA, genomic DNA, and protein abundance, as well as non-array techniques such as serial analysis of gene
expression (SAGE), mass spectrometry proteomic data, and high-throughput sequencing data.
At the most basic level of organization of GEO, there are four basic entity types. The first three (Sample,
Platform, and Series) are supplied by users; the fourth, the dataset, is compiled and curated by GEO staff
from the user-submitted data. See the GEO home page for more information.

Platforms
A Platform record describes the list of elements on the array (e.g., cDNAs, oligonucleotide probesets, ORFs,
antibodies) or the list of elements that may be detected and quantified in that experiment (e.g., SAGE
tags, peptides). Each Platform record is assigned a unique and stable GEO accession number (GPLxxx). A
Platform may reference many Samples that have been submitted by multiple submitters.

Samples
A Sample record describes the conditions under which an individual Sample was handled, the manipulations it
underwent, and the abundance measurement of each element derived from it. Each Sample record is assigned
a unique and stable GEO accession number (GSMxxx). A Sample entity must reference only one Platform
and may be included in multiple Series.

Series
A Series record defines a set of related Samples considered to be part of a group, how the Samples are related,
and if and how they are ordered. A Series provides a focal point and description of the experiment as a whole.
Series records may also contain tables describing extracted data, summary conclusions, or analyses. Each
Series record is assigned a unique and stable GEO accession number (GSExxx). Series records are available
in a couple of formats which are handled by GEOquery independently. The smaller and new GSEMatrix files
are quite fast to parse; a simple flag is used by GEOquery to choose to use GSEMatrix files (see below).

Datasets
GEO DataSets (GDSxxx) are curated sets of GEO Sample data. A GDS record represents a collection of
biologically and statistically comparable GEO Samples and forms the basis of GEOs suite of data display
and analysis tools. Samples within a GDS refer to the same Platform, that is, they share a common set
of probe elements. Value measurements for each Sample within a GDS are assumed to be calculated in an
equivalent manner, that is, considerations such as background processing and normalization are consistent
across the dataset. Information reflecting experimental design is provided through GDS subsets.

Getting Started using GEOquery

Getting data from GEO is really quite easy. There is only one command that is needed, getGEO. This one
function interprets its input to determine how to get the data from GEO and then parse the data into useful
R data structures. Usage is quite simple. This loads the GEOquery library.

library(GEOquery)
Now, we are free to access any GEO accession. Note that in the following, I use a file packaged with the
GEOquery package. In general, you will use only the GEO accession, as noted in the code comments.
# If you have network access, the more typical way to do this
# would be to use this:
# gds <- getGEO("GDS507")
gds <- getGEO(filename=system.file("extdata/GDS507.soft.gz",package="GEOquery"))
Now, gds contains the R data structure (of class GDS) that represents the GDS507 entry from GEO. Youll
note that the filename used to store the download was output to the screen (but not saved anywhere) for
later use to a call to getGEO(filename=...).
We can do the same with any other GEO accession, such as GSM11805, a GEO sample.
# If you have network access, the more typical way to do this
# would be to use this:
# gds <- getGEO("GSM11805")
gsm <- getGEO(filename=system.file("extdata/GSM11805.txt.gz",package="GEOquery"))

GEOquery Data Structures

The GEOquery data structures really come in two forms. The first, comprising GDS, GPL, and GSM all behave
similarly and accessors have similar effects on each. The fourth GEOquery data structure, GSE is a composite
data type made up of a combination of GSM and GPL objects. I will explain the first three together first.

The GDS, GSM, and GPL classes

Each of these classes is comprised of a metadata header (taken nearly verbatim from the SOFT format
header) and a GEODataTable. The GEODataTable has two simple parts, a Columns part which describes
the column headers on the Table part. There is also a show method for each class. For example, using the
gsm from above:
# Look at gsm metadata:
head(Meta(gsm))
##
##
##
##
##
##
##
##
##
##
##
##
##

$channel_count
[1] "1"
$comment
[1] "Raw data provided as supplementary file"
$contact_address
[1] "715 Albany Street, E613B"
$contact_city
[1] "Boston"
$contact_country
3

## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
# Look at data associated with the GSM:
# but restrict to only first 5 rows, for brevity
Table(gsm)[1:5,]
##
##
##
##
##
##

ID_REF
AFFX-BioB-5_at
AFFX-BioB-M_at
AFFX-BioB-3_at
AFFX-BioC-5_at
AFFX-BioC-3_at

1
2
3
4
5

VALUE ABS_CALL
953.9
P
2982.8
P
1657.9
P
2652.7
P
2019.5
P

# Look at Column descriptions:

Columns(gsm)
##
##
##
##
##
##
##
##

Column
1
ID_REF
2
VALUE
3 ABS_CALL

Description
1
2
MAS 5.0 Statistical Algorithm (mean scaled to 500)
3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065

The GPL class behaves exactly as the GSM class. However, the GDS class has a bit more information associated
with the Columns method:
Columns(gds)[,1:3]
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

sample disease.state individual

GSM11815
RCC
035
GSM11832
RCC
023
GSM12069
RCC
001
GSM12083
RCC
005
GSM12101
RCC
011
GSM12106
RCC
032
GSM12274
RCC
2
GSM12299
RCC
3
GSM12412
RCC
4
GSM11810
normal
035
GSM11827
normal
023
GSM12078
normal
001
GSM12099
normal
005
GSM12269
normal
1
GSM12287
normal
2
GSM12301
normal
3
GSM12448
normal
4

The GSE class

The GSE entity is the most confusing of the GEO entities. A GSE entry can represent an arbitrary number
of samples run on an arbitrary number of platforms. The GSE class has a metadata section, just like the
other classes. However, it doesnt have a GEODataTable. Instead, it contains two lists, accessible using the
GPLList and GSMList methods, that are each lists of GPL and GSM objects. To show an example:
# Again, with good network access, one would do:
# gse <- getGEO("GSE781",GSEMatrix=FALSE)
gse <- getGEO(filename=system.file("extdata/GSE781_family.soft.gz",package="GEOquery"))
## Parsing....
head(Meta(gse))
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

$contact_address
[1] "715 Albany Street, E613B"
$contact_city
[1] "Boston"
$contact_country
[1] "USA"
$contact_department
[1] "Genetics and Genomics"
$contact_email
[1] "[email protected]"
$contact_fax
[1] "617-414-1646"

# names of all the GSM objects contained in the GSE

names(GSMList(gse))
##
##
##
##
##
##

[1]
[7]
[13]
[19]
[25]
[31]

"GSM11805"
"GSM11830"
"GSM12079"
"GSM12105"
"GSM12283"
"GSM12399"

"GSM11810"
"GSM11832"
"GSM12083"
"GSM12106"
"GSM12287"
"GSM12412"

"GSM11814"
"GSM12067"
"GSM12098"
"GSM12268"
"GSM12298"
"GSM12444"

"GSM11815"
"GSM12069"
"GSM12099"
"GSM12269"
"GSM12299"
"GSM12448"

# and get the first GSM object on the list

GSMList(gse)[[1]]
##
##
##
##
##

An object of class "GSM"

channel_count
[1] "1"
comment
[1] "Raw data provided as supplementary file"
5

"GSM11823"
"GSM12075"
"GSM12100"
"GSM12270"
"GSM12300"

"GSM11827"
"GSM12078"
"GSM12101"
"GSM12274"
"GSM12301"

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

contact_address
[1] "715 Albany Street, E613B"
contact_city
[1] "Boston"
contact_country
[1] "USA"
contact_department
[1] "Genetics and Genomics"
contact_email
[1] "[email protected]"
contact_fax
[1] "617-414-1646"
contact_institute
[1] "Boston University School of Medicine"
contact_name
[1] "Marc,E.,Lenburg"
contact_phone
[1] "617-414-1375"
contact_state
[1] "MA"
contact_web_link
[1] "http://gg.bu.edu"
contact_zip/postal_code
[1] "02130"
data_row_count
[1] "22283"
description
[1] "Age = 70; Gender = Female; Right Kidney; Adjacent Tumor Type = clear cell; Adjacent Tumor Fuhrma
[2] "Keywords = kidney"
[3] "Keywords = renal"
[4] "Keywords = RCC"
[5] "Keywords = carcinoma"
[6] "Keywords = cancer"
[7] "Lot batch = 2004638"
geo_accession
[1] "GSM11805"
last_update_date
[1] "May 28 2005"
molecule_ch1
[1] "total RNA"
organism_ch1
[1] "Homo sapiens"
platform_id
[1] "GPL96"
series_id
[1] "GSE781"
source_name_ch1
[1] "Trizol isolation of total RNA from normal tissue adjacent to Renal Cell Carcinoma"
status
[1] "Public on Nov 25 2003"
submission_date
[1] "Oct 20 2003"
supplementary_file
[1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM11nnn/GSM11805/GSM11805.CEL.gz"
6

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

title
[1] "N035 Normal Human Kidney U133A"
type
[1] "RNA"
An object of class "GEODataTable"
****** Column Descriptions ******
Column
1
ID_REF
2
VALUE
3 ABS_CALL

Description
1
2
MAS 5.0 Statistical Algorithm (mean scaled to 500)
3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
****** Data Table ******
ID_REF VALUE ABS_CALL
1 AFFX-BioB-5_at 953.9
P
2 AFFX-BioB-M_at 2982.8
P
3 AFFX-BioB-3_at 1657.9
P
4 AFFX-BioC-5_at 2652.7
P
5 AFFX-BioC-3_at 2019.5
P
22278 more rows ...

# and the names of the GPLs represented

names(GPLList(gse))
## [1] "GPL96" "GPL97"
See below for an additional, preferred method of obtaining GSE information.

Converting to BioConductor ExpressionSets and limma MALists

GEO datasets are (unlike some of the other GEO entities), quite similar to the limma data structure MAList
and to the Biobase data structure ExpressionSet. Therefore, there are two functions, GDS2MA and GDS2eSet
that accomplish that task.

Getting GSE Series Matrix files as an ExpressionSet

GEO Series are collections of related experiments. In addition to being available as SOFT format files, which
are quite large, NCBI GEO has prepared a simpler format file based on tab-delimited text. The getGEO
function can handle this format and will parse very large GSEs quite quickly. The data structure returned
from this parsing is a list of ExpressionSets. As an example, we download and parse GSE2553.
# Note that GSEMatrix=TRUE is the default
gse2553 <- getGEO('GSE2553',GSEMatrix=TRUE)
show(gse2553)
## $GSE2553_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12600 features, 181 samples
7

##
##
##
##
##
##
##
##
##
##
##
##

element names: exprs

protocolData: none
phenoData
sampleNames: GSM48681 GSM48682 ... GSM48861 (181 total)
varLabels: title geo_accession ... data_row_count (30 total)
varMetadata: labelDescription
featureData
featureNames: 1 2 ... 12600 (12600 total)
fvarLabels: ID PenAt ... Chimeric_Cluster_IDs (13 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL1977

show(pData(phenoData(gse2553[[1]]))[1:5,c(1,6,8)])
##
##
##
##
##
##
##
##
##
##
##
##

title
GSM48681
Patient sample ST18, Dermatofibrosarcoma
GSM48682
Patient sample ST410, Ewing Sarcoma
GSM48683
Patient sample ST130, Sarcoma, NOS
GSM48684 Patient sample ST293, Malignant Peripheral Nerve Sheath Tumor
GSM48685
Patient sample ST367, Liposarcoma
type
source_name_ch1
GSM48681 RNA
Dermatofibrosarcoma
GSM48682 RNA
Ewing Sarcoma
GSM48683 RNA
Sarcoma, NOS
GSM48684 RNA Malignant Peripheral Nerve Sheath Tumor
GSM48685 RNA
Liposarcoma

Converting GDS to an ExpressionSet

Taking our gds object from above, we can simply do:
eset <- GDS2eSet(gds,do.log2=TRUE)
Now, eset is an ExpressionSet that contains the same information as in the GEO dataset, including the
sample information, which we can see here:
eset
##
##
##
##
##
##
##
##
##
##
##
##
##

ExpressionSet (storageMode: lockedEnvironment)

assayData: 22645 features, 17 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM11815 GSM11832 ... GSM12448 (17 total)
varLabels: sample disease.state individual description
varMetadata: labelDescription
featureData
featureNames: 200000_s_at 200001_at ... AFFX-TrpnX-M_at (22645
total)
fvarLabels: ID Gene title ... GO:Component ID (21 total)
fvarMetadata: Column labelDescription
8

## experimentData: use 'experimentData(object)'

##
pubMedIds: 14641932
## Annotation:
pData(eset)[,1:3]
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

GSM11815
GSM11832
GSM12069
GSM12083
GSM12101
GSM12106
GSM12274
GSM12299
GSM12412
GSM11810
GSM11827
GSM12078
GSM12099
GSM12269
GSM12287
GSM12301
GSM12448

sample disease.state individual

Converting GDS to an MAList

No annotation information (called platform information by GEO) was retrieved from because ExpressionSet
does not contain slots for gene information, typically. However, it is easy to obtain this information. First,
we need to know what platform this GDS used. Then, another call to getGEO will get us what we need.
#get the platform from the GDS metadata
Meta(gds)$platform
## [1] "GPL97"
#So use this information in a call to getGEO
gpl <- getGEO(filename=system.file("extdata/GPL97.annot.gz",package="GEOquery"))
So, gpl now contains the information for GPL5 from GEO. Unlike ExpressionSet, the limma MAList does
store gene annotation information, so we can use our newly created gpl of class GPL in a call to GDS2MA like
so:
MA <- GDS2MA(gds,GPL=gpl)
class(MA)
## [1] "MAList"
## attr(,"package")
## [1] "limma"
Now, MA is of class MAList and contains not only the data, but the sample information and gene information
associated with GDS507.
9

Converting GSE to an ExpressionSet

First, make sure that using the method described above in the section Getting GSE Series Matrix files as
an ExpressionSet for using GSE Series Matrix files is not sufficient for the task, as it is much faster and
simpler. If it is not (i.e., other columns from each GSM are needed), then this method will be needed.
Converting a GSE object to an ExpressionSet object currently takes a bit of R data manipulation due to
the varied data that can be stored in a GSE and the underlying GSM and GPL objects. However, using a simple
example will hopefully be illustrative of the technique.
First, we need to make sure that all of the GSMs} are from the same platform:
gsmplatforms <- lapply(GSMList(gse),function(x) {Meta(x)$platform})
head(gsmplatforms)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

$GSM11805
[1] "GPL96"
$GSM11810
[1] "GPL97"
$GSM11814
[1] "GPL96"
$GSM11815
[1] "GPL97"
$GSM11823
[1] "GPL96"
$GSM11827
[1] "GPL97"

Indeed, they all used GPL5 as their platform (which we could have determined by looking at the GPLList
for gse, which shows only one GPL for this particular GSE.). So, now we would like to know what column
represents the data that we would like to extract. Looking at the first few rows of the Table of a single GSM
will likely give us an idea (and by the way, GEO uses a convention that the column that contains the single
measurement for each array is called the VALUE column, which we could use if we dont know what other
column is most relevant).
Table(GSMList(gse)[[1]])[1:5,]
##
##
##
##
##
##

1
2
3
4
5

ID_REF
AFFX-BioB-5_at
AFFX-BioB-M_at
AFFX-BioB-3_at
AFFX-BioC-5_at
AFFX-BioC-3_at

VALUE ABS_CALL
953.9
P
2982.8
P
1657.9
P
2652.7
P
2019.5
P

# and get the column descriptions

Columns(GSMList(gse)[[1]])[1:5,]
##

Column
10

##
##
##
##
##
##
##
##
##
##
##

1
ID_REF
2
VALUE
3
ABS_CALL
NA
<NA>
NA.1
<NA>

Description
1
2
MAS 5.0 Statistical Algorithm (mean scaled to 500)
3
MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
NA
<NA>
NA.1
<NA>

We will indeed use the VALUE column. We then want to make a matrix of these values like so:
# get the probeset ordering
probesets <- Table(GPLList(gse)[[1]])$ID
# make the data matrix from the VALUE columns from each GSM
# being careful to match the order of the probesets in the platform
# with those in the GSMs
data.matrix <- do.call('cbind',lapply(GSMList(gse),function(x)
{tab <- Table(x)
mymatch <- match(probesets,tab$ID_REF)
return(tab$VALUE[mymatch])
}))
data.matrix <- apply(data.matrix,2,function(x) {as.numeric(as.character(x))})
data.matrix <- log2(data.matrix)
data.matrix[1:5,]
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

GSM11805 GSM11810
[1,] 10.926963
NA
[2,] 5.749534
NA
[3,] 7.066089
NA
[4,] 12.660353
NA
[5,] 6.195741
NA
GSM11832 GSM12067
[1,]
NA 11.424376
[2,]
NA 7.901470
[3,]
NA 7.337176
[4,]
NA 11.397568
[5,]
NA 6.877744
GSM12098 GSM12099
[1,] 10.823367
NA
[2,] 6.556123
NA
[3,] 7.708739
NA
[4,] 12.336534
NA
[5,] 5.501439
NA
GSM12269 GSM12270
[1,]
NA 10.323055
[2,]
NA 7.353147
[3,]
NA 8.742815
[4,]
NA 11.213408
[5,]
NA 6.683696
GSM12300 GSM12301

GSM11814 GSM11815
11.105254
NA
7.908092
NA
7.750205
NA
12.479755
NA
6.061776
NA
GSM12069 GSM12075
NA 11.222795
NA 6.407693
NA 6.569856
NA 12.529870
NA 6.652486
GSM12100 GSM12101
10.835971
NA
8.207014
NA
7.428779
NA
11.762839
NA
6.247928
NA
GSM12274 GSM12283
NA 11.181028
NA 5.770829
NA 7.339850
NA 12.678380
NA 5.918863
GSM12399 GSM12412
11

GSM11823 GSM11827
11.275019
NA
7.093814
NA
7.244126
NA
12.215897
NA
6.565293
NA
GSM12078 GSM12079
NA 11.469845
NA 5.165912
NA 7.477354
NA 12.240046
NA 3.981853
GSM12105 GSM12106
10.810893
NA
6.816344
NA
7.754888
NA
11.237509
NA
6.017922
NA
GSM12287 GSM12298
NA 11.566387
NA 6.912889
NA 7.602142
NA 12.232901
NA 5.837943
GSM12444 GSM12448

GSM11830
11.438636
7.514122
7.962896
11.458355
6.583459
GSM12083
NA
NA
NA
NA
NA
GSM12268
11.062653
6.563768
7.126188
12.412490
6.525129
GSM12299
NA
NA
NA
NA
NA

##
##
##
##
##

[1,] 11.078151
[2,] 4.812498
[3,] 7.383704
[4,] 12.090939
[5,] 6.281698

NA 11.535178
NA 7.471675
NA 7.432959
NA 11.421802
NA 5.419539

NA 11.105450
NA 7.488644
NA 7.381110
NA 12.172834
NA 5.469235

NA
NA
NA
NA
NA

Note that we do a match to make sure that the values and the platform information are in the same order.
Finally, to make the ExpressionSet object:
require(Biobase)
# go through the necessary steps to make a compliant ExpressionSet
rownames(data.matrix) <- probesets
colnames(data.matrix) <- names(GSMList(gse))
pdata <- data.frame(samples=names(GSMList(gse)))
rownames(pdata) <- names(GSMList(gse))
pheno <- as(pdata,"AnnotatedDataFrame")
eset2 <- new('ExpressionSet',exprs=data.matrix,phenoData=pheno)
eset2
##
##
##
##
##
##
##
##
##
##
##

ExpressionSet (storageMode: lockedEnvironment)

assayData: 22283 features, 34 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM11805 GSM11810 ... GSM12448 (34 total)
varLabels: samples
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation:

So, using a combination of lapply on the GSMList, one can extract as many columns of interest as necessary
to build the data structure of choice. Because the GSM data from the GEO website are fully downloaded
and included in the GSE object, one can extract foreground and background as well as quality for two-channel
arrays, for example. Getting array annotation is also a bit more complicated, but by replacing platform in
the lapply call to get platform information for each array, one can get other information associated with each
array.

Accessing Raw Data from GEO

NCBI GEO accepts (but has not always required) raw data such as .CEL files, .CDF files, images, etc.
Sometimes, it is useful to get quick access to such data. A single function, getGEOSuppFiles, can take as an
argument a GEO accession and will download all the raw data associate with that accession. By default, the
function will create a directory in the current working directory to store the raw data for the chosen GEO
accession. Combining a simple sapply statement or other loop structure with getGEOSuppFiles makes for a
very simple way to get gobs of raw data quickly and easily without needing to know the specifics of GEO raw
data URLs.

Use Cases
GEOquery can be quite powerful for gathering a lot of data quickly. A few examples can be useful to show
how this might be done for data mining purposes.
12

Getting all Series Records for a Given Platform

For data mining purposes, it is sometimes useful to be able to pull all the GSE records for a given platform.
GEOquery makes this very easy, but a little bit of knowledge of the GPL record is necessary to get started.
The GPL record contains both the GSE and GSM accessions that reference it. Some code is useful to illustrate
the point:
gpl97 <- getGEO('GPL97')
Meta(gpl97)$title
## [1] "[HG-U133B] Affymetrix Human Genome U133B Array"
head(Meta(gpl97)$series_id)
## [1] "GSE362" "GSE473" "GSE620" "GSE674" "GSE781" "GSE907"
length(Meta(gpl97)$series_id)
## [1] 149
head(Meta(gpl97)$sample_id)
## [1] "GSM3922" "GSM3924" "GSM3926" "GSM3928" "GSM3930" "GSM3932"
length(Meta(gpl97)$sample_id)
## [1] 6156
The code above loads the GPL97 record into R. The Meta method extracts a list of header information from
the GPL record. The title gives the human name of the platform. The series_id gives a vector of series
ids. Note that there are 149 series associated with this platform and 6156 samples. Code like the following
could be used to download all the samples or series. I show only the first 5 samples as an example:
gsmids <- Meta(gpl97)$sample_id
gsmlist <- sapply(gsmids[1:5],getGEO)
names(gsmlist)
## [1] "GSM3922" "GSM3924" "GSM3926" "GSM3928" "GSM3930"

Conclusion
The GEOquery package provides a bridge to the vast array resources contained in the NCBI GEO repositories.
By maintaining the full richness of the GEO data rather than focusing on getting only the numbers, it is
possible to integrate GEO data into current Bioconductor data structures and to perform analyses on that
data quite quickly and easily. These tools will hopefully open GEO data more fully to the array community
at large.

Citing GEOquery
Please consider citing GEOquery if used in support of your own research:
citation("GEOquery")
##
## Please cite the following if utilizing the GEOquery software:
##
##
Davis, S. and Meltzer, P. S. GEOquery: a bridge between the Gene
##
Expression Omnibus (GEO) and BioConductor. Bioinformatics, 2007,
##
14, 1846-1847
##
## A BibTeX entry for LaTeX users is
##
##
@Article{,
##
author = {Sean Davis and Paul Meltzer},
##
title = {GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor},
##
journal = {Bioinformatics},
##
year = {2007},
##
volume = {14},
##
pages = {1846--1847},
##
}

Reporting problems or bugs

If you run into problems using GEOquery, the Bioconductor Support site is a good first place to ask for help.
If you are convinced that there is a bug in GEOquery (this is pretty unusual, but not unheard of), feel free to
submit an issue on the GEOquery github site or file a bug report directly from R (will open a new github
issue):
bug.report(package='GEOquery')

Session info
The following package and versions were used in the production of this vignette.
##
##
##
##
##
##
##
##
##
##
##
##
##

R version 3.1.1 Patched (2014-09-25 r66681)

Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8
[3] LC_TIME=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8
[9] LC_ADDRESS=C
[11] LC_MEASUREMENT=en_US.UTF-8

LC_NUMERIC=C
LC_COLLATE=C
LC_MESSAGES=en_US.UTF-8
LC_NAME=C
LC_TELEPHONE=C
LC_IDENTIFICATION=C

attached base packages:

[1] parallel stats
graphics

grDevices utils
14

datasets

methods

##
##
##
##
##
##
##
##
##
##

[8] base
other attached packages:
[1] limma_3.22.0
GEOquery_2.32.0
[4] BiocGenerics_0.12.0 knitr_1.7

Biobase_2.26.0

loaded via a namespace (and not attached):

[1] RCurl_1.95-4.3 XML_3.98-1.1
codetools_0.2-9 digest_0.6.4
[5] evaluate_0.5.5 formatR_1.0
htmltools_0.2.6 rmarkdown_0.3.3
[9] stringr_0.6.2
tools_3.1.1
yaml_2.1.13

Durga Saptashati in Gujrati
65% (17)
Durga Saptashati in Gujrati
241 pages
s4 Inheritance by Tr. Okion
No ratings yet
s4 Inheritance by Tr. Okion
123 pages
My Gita - Devdutt Pattanaik
90% (30)
My Gita - Devdutt Pattanaik
156 pages
Tutorial Software SGeMS
100% (3)
Tutorial Software SGeMS
26 pages
Exercise 03 - Siting A Fire Tower in Nebraska
100% (1)
Exercise 03 - Siting A Fire Tower in Nebraska
48 pages
Matgpr Manual R3
No ratings yet
Matgpr Manual R3
109 pages
Rujuta Diveker - Latest Book - v1
100% (6)
Rujuta Diveker - Latest Book - v1
273 pages
Gene Expression Omnibus (GEO)
No ratings yet
Gene Expression Omnibus (GEO)
28 pages
GEO_Oct_2024
No ratings yet
GEO_Oct_2024
61 pages
Seismic Data Generator
No ratings yet
Seismic Data Generator
9 pages
Fat Noews Docx (5)
No ratings yet
Fat Noews Docx (5)
27 pages
Geosx Geosx Readthedocs Hosted Com en Latest
No ratings yet
Geosx Geosx Readthedocs Hosted Com en Latest
993 pages
Gds Course
No ratings yet
Gds Course
190 pages
Google Earth TBX Tutorial
No ratings yet
Google Earth TBX Tutorial
9 pages
An Introduction To: A Package For Geostatistical Data Analysis Using The Software and
No ratings yet
An Introduction To: A Package For Geostatistical Data Analysis Using The Software and
41 pages
Barrett 等。 - 2013 - NCBI GEO archive for functional genomics data set
No ratings yet
Barrett 等。 - 2013 - NCBI GEO archive for functional genomics data set
5 pages
Geopsy 20051026
0% (1)
Geopsy 20051026
67 pages
I Am Sharing 'Document' With You
No ratings yet
I Am Sharing 'Document' With You
3 pages
GmE 205 Lec 11 - OGC and GeoTools
No ratings yet
GmE 205 Lec 11 - OGC and GeoTools
34 pages
R Basics and Essentials
No ratings yet
R Basics and Essentials
51 pages
GY4005 Earthquakes Exercise
No ratings yet
GY4005 Earthquakes Exercise
14 pages
Manual Rgeostats
No ratings yet
Manual Rgeostats
18 pages
Gilberto Schwertner Eacbf9e GPT
No ratings yet
Gilberto Schwertner Eacbf9e GPT
310 pages
JAVA PROGRAMMING FOR BEGINNERS: Master Java Fundamentals and Build Your Own Applications (2023 Crash Course)
From Everand
JAVA PROGRAMMING FOR BEGINNERS: Master Java Fundamentals and Build Your Own Applications (2023 Crash Course)
Theo Houle
No ratings yet
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
S-Gems Tutorial Notes: Hydrogeophysics: Theory, Methods, and Modeling
No ratings yet
S-Gems Tutorial Notes: Hydrogeophysics: Theory, Methods, and Modeling
26 pages
(Ebook) Applied Geostatistics with SGeMS: A User's Guide by Nicolas Remy, Alexandre Boucher, Jianbing Wu ISBN 9780521514149, 0521514142 download
100% (1)
(Ebook) Applied Geostatistics with SGeMS: A User's Guide by Nicolas Remy, Alexandre Boucher, Jianbing Wu ISBN 9780521514149, 0521514142 download
59 pages
Geor and Georglm: Software For Model-Based Geostatistics: Paulo J. Ribeiro Jr. Ole F. Christensen
No ratings yet
Geor and Georglm: Software For Model-Based Geostatistics: Paulo J. Ribeiro Jr. Ole F. Christensen
16 pages
Opengeo Community Suite
No ratings yet
Opengeo Community Suite
5 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Gluster Filesystem - Practical Method
From Everand
Gluster Filesystem - Practical Method
Fabian Mestre
No ratings yet
ULFEM Time-Series-Analysis Package: Open-File Report 2013-1285
No ratings yet
ULFEM Time-Series-Analysis Package: Open-File Report 2013-1285
327 pages
Intro To Geospatial Data and Maps in R
No ratings yet
Intro To Geospatial Data and Maps in R
13 pages
Open Source Software Related To Geoscience and Remote Sensing
No ratings yet
Open Source Software Related To Geoscience and Remote Sensing
21 pages
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
NCBI Genome
No ratings yet
NCBI Genome
37 pages
12 Gmt2 Gridding Processing
No ratings yet
12 Gmt2 Gridding Processing
27 pages
SAS Graphics for Clinical Trials by Example
From Everand
SAS Graphics for Clinical Trials by Example
Kriss Harris
5/5 (1)
Making Maps and Plotting Data With GMT and iGMT: Thorsten Becker
No ratings yet
Making Maps and Plotting Data With GMT and iGMT: Thorsten Becker
25 pages
Sslib
No ratings yet
Sslib
106 pages
An Introduction To The Package GeoR
No ratings yet
An Introduction To The Package GeoR
17 pages
Geoxdraw
No ratings yet
Geoxdraw
12 pages
Open Source Approaches in Spatial Data Handling Advances in Geographic Information Science 1st Edition Brent Hall 2025 scribd download
100% (2)
Open Source Approaches in Spatial Data Handling Advances in Geographic Information Science 1st Edition Brent Hall 2025 scribd download
78 pages
Ofr20091103 USGS Geostatistics
No ratings yet
Ofr20091103 USGS Geostatistics
348 pages
State of The Art and Requirements in GIS
No ratings yet
State of The Art and Requirements in GIS
10 pages
Openquake Ground Motion Tookit - User Guide: "Openquake: Calculate, Share, Explore"
No ratings yet
Openquake Ground Motion Tookit - User Guide: "Openquake: Calculate, Share, Explore"
88 pages
Biopython Org DIST Docs Tutorial Tutorial HTML
No ratings yet
Biopython Org DIST Docs Tutorial Tutorial HTML
267 pages
Introduction to Geopandas
No ratings yet
Introduction to Geopandas
14 pages
Download Full (Ebook) Geostatistics for Compositional Data with R by Raimon Tolosana-Delgado; Ute Mueller ISBN 9783030825676, 3030825671 PDF All Chapters
100% (8)
Download Full (Ebook) Geostatistics for Compositional Data with R by Raimon Tolosana-Delgado; Ute Mueller ISBN 9783030825676, 3030825671 PDF All Chapters
81 pages
Wang2019 Article MiningDataAndMetadataFromTheGe
No ratings yet
Wang2019 Article MiningDataAndMetadataFromTheGe
8 pages
GRASS GIS 8 Programmer's Manual - GRASS GIS 8 Programmer's Manual
No ratings yet
GRASS GIS 8 Programmer's Manual - GRASS GIS 8 Programmer's Manual
12 pages
Connect Tutorial Geolog
No ratings yet
Connect Tutorial Geolog
59 pages
Derinlemesine React Data
From Everand
Derinlemesine React Data
Onder Teker
No ratings yet
Chapter 2
No ratings yet
Chapter 2
13 pages
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Using Google Earth Engine With R
No ratings yet
Using Google Earth Engine With R
14 pages
GPSSolutions Matlab 2003 PDF
No ratings yet
GPSSolutions Matlab 2003 PDF
7 pages
Cran Task Viewpauqte R Analis Espacila
No ratings yet
Cran Task Viewpauqte R Analis Espacila
13 pages
Intro GG
No ratings yet
Intro GG
50 pages
Introduction To R For Gene Expression Data Analysis
No ratings yet
Introduction To R For Gene Expression Data Analysis
11 pages
e6-64-03-01
No ratings yet
e6-64-03-01
5 pages
Spatial Data r 3
No ratings yet
Spatial Data r 3
6 pages
Charmi Horoscope - Pankaj Bhatt
No ratings yet
Charmi Horoscope - Pankaj Bhatt
15 pages
Shri Hit Ashtayam
No ratings yet
Shri Hit Ashtayam
155 pages
Life Speeches: AND of Sardar
No ratings yet
Life Speeches: AND of Sardar
402 pages
Time To Frequency: The Transformation From Domain
No ratings yet
Time To Frequency: The Transformation From Domain
20 pages
Stator Current and Motor Efficiency As Indicators For Different Types of Bearing Faults in Induction Motors
No ratings yet
Stator Current and Motor Efficiency As Indicators For Different Types of Bearing Faults in Induction Motors
8 pages
Application Note - Indicator Explanations
No ratings yet
Application Note - Indicator Explanations
5 pages
BTM 110 Btn9/Hcp4Cdb: Angular Contact Thrust Ball Bearings, Double Direction, Super-Precision
No ratings yet
BTM 110 Btn9/Hcp4Cdb: Angular Contact Thrust Ball Bearings, Double Direction, Super-Precision
3 pages
Fundamentals of DSP: Introduction To Digital Signal Processing
No ratings yet
Fundamentals of DSP: Introduction To Digital Signal Processing
51 pages
FFT Analysis in Practice
No ratings yet
FFT Analysis in Practice
86 pages
Jnana Vahini: Stream of Spiritual Wisdom
No ratings yet
Jnana Vahini: Stream of Spiritual Wisdom
53 pages
1005hans PDF
No ratings yet
1005hans PDF
3 pages
Bearing Frequency Estimator
No ratings yet
Bearing Frequency Estimator
13 pages
A Quick Introduction To Bearing Envelope Analysis
No ratings yet
A Quick Introduction To Bearing Envelope Analysis
10 pages
Bhagavad Gita
100% (2)
Bhagavad Gita
390 pages
Dampatya Jeevano Aadarsh Gujrati PDF
No ratings yet
Dampatya Jeevano Aadarsh Gujrati PDF
124 pages
Gayatri Mahavigyan Part 2 PDF
No ratings yet
Gayatri Mahavigyan Part 2 PDF
171 pages
Screening
No ratings yet
Screening
77 pages
GeneticsIII syllabus_07012019
No ratings yet
GeneticsIII syllabus_07012019
19 pages
Read The Text and Answer The Questions That Follow
No ratings yet
Read The Text and Answer The Questions That Follow
2 pages
Biology Paper 2 TZ2 SL
No ratings yet
Biology Paper 2 TZ2 SL
20 pages
Download Complete (Ebook) Textbook of Diagnostic Microbiology by Connie R. Mahon, Donald C. Lehman ISBN 9780323482189, 032348218X PDF for All Chapters
100% (13)
Download Complete (Ebook) Textbook of Diagnostic Microbiology by Connie R. Mahon, Donald C. Lehman ISBN 9780323482189, 032348218X PDF for All Chapters
67 pages
Biological Basis For Orthodontic Tooth Movement
No ratings yet
Biological Basis For Orthodontic Tooth Movement
31 pages
Genetics
No ratings yet
Genetics
1 page
Cervical Cancer
100% (1)
Cervical Cancer
7 pages
001-Fibromelanosis in Domestic Chickens PDF
No ratings yet
001-Fibromelanosis in Domestic Chickens PDF
16 pages
Savaged Fallout Wasteland Survival Guide PDF
100% (2)
Savaged Fallout Wasteland Survival Guide PDF
19 pages
ETHAN RUSSO - SINDROME DA DEFICIENCIA CLINICA DE ENDOCANABINOIDES
No ratings yet
ETHAN RUSSO - SINDROME DA DEFICIENCIA CLINICA DE ENDOCANABINOIDES
12 pages
Microorganisms
No ratings yet
Microorganisms
5 pages
MICROBIOLOGY AND PARASITOLOGY (Prelims)
No ratings yet
MICROBIOLOGY AND PARASITOLOGY (Prelims)
15 pages
Dutch Passion Seed Catalogue
100% (1)
Dutch Passion Seed Catalogue
42 pages
Microbiology Material C Arise Mahima Girase 2
No ratings yet
Microbiology Material C Arise Mahima Girase 2
105 pages
Histology Uestions From Lecturio
No ratings yet
Histology Uestions From Lecturio
80 pages
Ortoic Acid
100% (1)
Ortoic Acid
25 pages
NCPP
No ratings yet
NCPP
11 pages
978-0323049092 Color Textbook of Pediatric Dermatology (Color Textbook of Pediatric Dermatology (Weston)) All Chapters Instant Download
100% (2)
978-0323049092 Color Textbook of Pediatric Dermatology (Color Textbook of Pediatric Dermatology (Weston)) All Chapters Instant Download
41 pages
Khan 2018
No ratings yet
Khan 2018
29 pages
Withdrawn: DOI For Details
No ratings yet
Withdrawn: DOI For Details
14 pages
The Sickle Cell Membrane: Tip of The Iceberg
No ratings yet
The Sickle Cell Membrane: Tip of The Iceberg
38 pages
Postgraduate Manual of Pathology 4th Ed. 2021
No ratings yet
Postgraduate Manual of Pathology 4th Ed. 2021
87 pages
Download ebooks file (Ebook) In Defense of Self: How the Immune System Really Works by William R. Clark ISBN 9780195311266, 9780195335552, 9781435605886, 0195311264, 0195335554, 1435605888 all chapters
No ratings yet
Download ebooks file (Ebook) In Defense of Self: How the Immune System Really Works by William R. Clark ISBN 9780195311266, 9780195335552, 9781435605886, 0195311264, 0195335554, 1435605888 all chapters
82 pages
The Metabolic Processes of Folic Acid and Vitamin B12
No ratings yet
The Metabolic Processes of Folic Acid and Vitamin B12
5 pages
Vizza 1996
No ratings yet
Vizza 1996
10 pages
Pharmacology II (5)
No ratings yet
Pharmacology II (5)
2 pages
Hepatobiliary System
No ratings yet
Hepatobiliary System
2 pages
Cannabis Grow Course 101 - Module 1 - Introduction To Growing Cannabis (Preamble & Lesson 1)
No ratings yet
Cannabis Grow Course 101 - Module 1 - Introduction To Growing Cannabis (Preamble & Lesson 1)
16 pages

Using The Geoquery Package: Sean Davis September 21, 2014

Uploaded by

Using The Geoquery Package: Sean Davis September 21, 2014

Uploaded by

Using the GEOquery Package

Getting Started using GEOquery

GEOquery Data Structures

The GDS, GSM, and GPL classes

The GSE class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Converting to BioConductor ExpressionSets and limma MALists

Getting GSE Series Matrix files as an ExpressionSet . . . . . . . . . . . . . . . . . . . . . . . . . .

Converting GDS to an ExpressionSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Converting GDS to an MAList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Converting GSE to an ExpressionSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Accessing Raw Data from GEO

Getting all Series Records for a Given Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Reporting problems or bugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Getting Started using GEOquery

GEOquery Data Structures

The GDS, GSM, and GPL classes

# Look at Column descriptions:

sample disease.state individual

The GSE class

# names of all the GSM objects contained in the GSE

# and get the first GSM object on the list

An object of class "GSM"

# and the names of the GPLs represented

Converting to BioConductor ExpressionSets and limma MALists

Getting GSE Series Matrix files as an ExpressionSet

element names: exprs

Converting GDS to an ExpressionSet

ExpressionSet (storageMode: lockedEnvironment)

## experimentData: use 'experimentData(object)'

sample disease.state individual

Converting GDS to an MAList

Converting GSE to an ExpressionSet

# and get the column descriptions

ExpressionSet (storageMode: lockedEnvironment)

Accessing Raw Data from GEO

Getting all Series Records for a Given Platform

Reporting problems or bugs

R version 3.1.1 Patched (2014-09-25 r66681)

attached base packages:

loaded via a namespace (and not attached):

You might also like