0% found this document useful (0 votes)
87 views25 pages

Project O: Breast Cancer Gene Analysis Using R: Sheena Scroggins, Susan Mcgowan, John Caras

The document describes analysis of breast cancer gene expression data from 3 datasets using R. The analysis included quality assessment, normalization, pairwise comparisons to identify differentially expressed genes, hierarchical clustering, and meta-analysis combining p-values across datasets. While some genes were identified in individual datasets, meta-analysis found no significant genes, possibly due to inconsistencies across the larger but less relevant second dataset.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views25 pages

Project O: Breast Cancer Gene Analysis Using R: Sheena Scroggins, Susan Mcgowan, John Caras

The document describes analysis of breast cancer gene expression data from 3 datasets using R. The analysis included quality assessment, normalization, pairwise comparisons to identify differentially expressed genes, hierarchical clustering, and meta-analysis combining p-values across datasets. While some genes were identified in individual datasets, meta-analysis found no significant genes, possibly due to inconsistencies across the larger but less relevant second dataset.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 25

Project O: Breast Cancer Gene

Analysis Using R

Sheena Scroggins, Susan McGowan, John Caras


Introduction

• We chose to perform analysis on Breast Cancer Cells.


• We retrieved cells files that used the Affymetrix HGU133a
oligonucleotide array through a combination of the sites:
http://www.nextbio.com/b/nextbio.nb
and http://www.ncbi.nlm.nih.gov/geo/
• We performed our analysis on 3 datasets containing breast
cancer cells and normal epithelial tissue cells.
• The data sets and colors associated with each set are:
– data1 = GSE6883-GDS2617 (22)-orange
– data2 = GSE9574-GDS3139 (49)-green
– data3 = GSE5874-GDS3097 (28)-blue
Brief Background on Datasets

• The first group of cells chosen are high tumorigenic


capacity cells vs. non-tumorigenic cells vs. normal cells.
The tumorigenic cells express CD44 and low or
undetectable levels of CD24. This data set contains 2
disease states. We labeled the non-tumorigenic disease
cells as cancer = Y in the treat.txt file.
• The second group of cells we chose contained normal
epithelial cells of breast cancer patients vs. normal
epithelial cells of non-breast cancer patients.
• The third group of cells chosen had 2 disease states:
inflammatory breast cancer cells vs. invasive, non-
inflammatory breast cancer cells.
Quality Assessment & NUSE Plotting
• QA/QC - perform quality assessment and quality control to
remove, as early as possible, any ‘bad arrays’ identified.
• The code used to generate QC plots:
 library(simpleaffy)
 library(affyQCReport)
 library(affyPLM)
 dataset1 <- read.affy("treat1.txt")
 saqc1 <- qc(dataset1)
 plot(saqc1)

• The code used to generate NUSE plots:


 dataPLM1 <- fitPLM(dataset1)boxplot(dataPLM1,
main="NUSE Data Set 1", ylim=c(0.95, 1.10),
outline=FALSE, col="orange", las=3, whisklty=0,
staplelty=0)
QC Plots datasets 1,2,3
NUSE Plots Data sets 1, 2, 3
Pre-Processing of Data:
RMA Normalization & Expression Set
Creation
• RMA performs three tasks:
– background adjustment
– quantile normalization
– summarization
• Our code consists of RMA in conjunction with
creation of the expression sets.
 data1.rma <- call.exprs(dataset1,"rma")
Pairwise Comparison
• We used the function pairwise.comparison to find the mean
of the datasets and to compute log2 fold change between
the two groups: Cancer = Y and Cancer = N.
• Then we filtered the results to determine significantly
changing genes.
• Code used for pairwise analysis:
 pairwise_results1 <- pairwise.comparison(data1.rma,
"Cancer", c("Y", "N"), dataset1)
 significant1 <- pairwise.filter(pairwise_results1,
fc=log2(1.5), min.present.no=10, tt= 0.01,
present.by.group=FALSE)
• Next we sorted the significant results based upon t-test
results (lowest t-test value = highest significance, then we
display the first 25 genes.
 sort(abs(tt(significant1)),decreasing=FALSE)[1:25]
Volcano Plots

• Create volacano plots to display the fold change


vs. the lowest t-test pvalue scores (top 25 genes)
 plot( all_foldchange_probesets1 , lod1 , pch = "." ,
xlab = "fold change" , ylab = expression(-log[10]~p))
o1 <- order(abs(all_foldchange_probesets1),
decreasing = TRUE) [1:25]
points( all_foldchange_probesets1[o1], lod1[o1], pch
= 18 , col = "orange")
Volcano Plots 1, 2, 3
Filter Expression Sets

• Filter expression sets with criteria IQR >0.5


• Code – first load libraries needed
 library(affy)
 library(genefilter)
 library(multtest)
 library(RColorBrewer)
 library(pvclust)
 library(hopach)
 library(cluster)

• Next build expression set by RMA


 data1 <- read.AnnotatedDataFrame( "treat1.txt", header=TRUE,
row.names=1, sep="\t" )
 pData(data1)
 rma_eset1 <- justRMA( filenames = rownames (pData(data1)) )
 rma_data1 <- exprs(rma_eset1)
Filter code continued
 IQRfil <- function( x ) ( IQR(x) > 0.5 )
 Filter <- filterfun( IQRfil ) n
 rma_filtered1 <- genefilter( rma_data1, Filter)
 rma_selected1 <- rma_data1 [ rma_filtered1, ]
 cl1 <- as.numeric(data1$Cancer == "Y" )
 resT1 <- mt.maxT(rma_selected1, classlabel=cl1, B=2000 )
 top_multtest1 <- rownames(resT1) [1:50]
 library( "hgu133a.db" )
 probe_sets1 <- rownames(resT1) [1:10]
 gene_symbols1 <- unlist( mget( probe_sets1, hgu133aSYMBOL ) )
 tabulated_probes1 <- aafTableAnn( probe_sets1, "hgu133a.db",
aaf.handler( ) )
 saveHTML( tabulated_probes1 , file="MULTTEST-GeneList1.html" )
Distance Computation & Visualization
of gene expression data
 es1 <- rma_eset1[top_multtest1,]
 iqrs1 <- esApply(es1,1,IQR)
 gvals1 <- scale(t(exprs(es1)),rowMedians(es1),
iqrs1[featureNames(es1)])
Computing Distances – 3 methods
 manDist1 <- dist(gvals1, method="manhattan")
 hr1 <- hclust(as.dist(1-cor(t(gvals1),method=
"pearson")),method="complete")
 hc1 <- hclust(as.dist(1-cor(gvals1, method="spearman")),method="complete")
Creating Heatmaps
 hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
 hmcol <- rev(hmcol)
 heatmap(as.matrix(manDist1),sym=TRUE,col=hmcol,
distfun=function(x)as.dist(x))
Cluster Analysis (Pearson) Data set 1, 2, 3
Cluster Analysis (Spearman) Data set 1, 2, 3
The samples are in rows and the features
are in columns. Heat Map Dataset #1
Heat Map Data Set #2
Heat Map Data Set #3
JUST FOR FUN

• Affymetrix front end or GUI tool – Expression Console


Meta-Analysis of Microarray Data

• Method used - combining p-values


This is performed by obtaining two measurements of
significance of change in gene expression
1. value of test-statistic
2. p-value
This method combines the p-values from all three studies
and results them as one p-value.
Make expression sets, apply filters and merge:
 library(MAMA)
 eset1 = call.exprs(phenodata1, "rma")
 eset1 = nsFilter(eset1, require.entrez = TRUE, require.GOBP = TRUE,
remove.dupEntrez = TRUE, var.func = IQR, var.cutoff = 0.5, feature.exclude =
"^AFFX")
 eset1.data = eset1$eset
 esets = list(exprs(eset1.data), exprs(eset2.data), exprs(eset3.data))
 classes = list(pData(phenodata1)[,2], pData(phenodata2)[,2],
pData(phenodata3)[,2])
Detecting differentially expressed genes

 pvalt = pvalcombination(esets, classes, moderated = "t", BHth


= 0.01)
DE IDD Loss IDR IRR
0 0 92 NaN 100

RESULTS:
DE – this refers to the number of significant genes in Meta-
Analysis. Our chosen group of genes shows none.
IDD – genes which are significant in Meta-Analysis but not in
individual studies. None is expected since no DEG genes were
found.
Loss – genes significant in individual data sets but not in Meta-
Analysis.
IDR & IRR are the percentages of Integration Driven Dis-
coveries and Integration Driven Revisions in identified differentially
expressed genes.
DEG analysis – Summary of the
combination of P - values
 summary(pvalt)
Length Class Mode
study1 76 -none- numeric
study2 0 -none- numeric
study3 16 -none- numeric
AllIndStudies92 -none- numeric
Meta 0 -none- numeric
TestStatistic 5228-none- numeric
• study1, study2 & study3 – indices of differentially
expressed genes in data sets 1 through 3.
• AllIndStudies - an index of DEG in at least one data set.
• Meta – an index of DEG found by meta-analysis.
• TestStatistic- an index of test statistics in meta-analysis.
Conclusion

• All three data sets used the HGU133a array. All three data
sets came from breast cancer tissue or individuals who had
breast cancer, with the exception of the second set which
compared normal epithelial cells in breast cancer vs.
normal epithelial cells in non-breast cancer samples.
• Analyzing microarray data by meta-analysis can be
problematic. In some cases, such as this study, it can lead
to null results.
• Whether analyzing our data sets individually or by Meta-
Analysis we can come to the conclusion that the second
data set, which happened to be the largest data set did not
have highly expressed or differentially expressed genes.
References
• Clark, A.G., Glanowski S, Nielsen R, Thomas PD, Kejariwal A, Todd
MA, Tanenbaum DM, Civello D, Lu F, Murphy B, Ferriera S, Wang G,
Zheng X, White TJ, Sninsky JJ, Adams MD, Cargill M., (2003).
Inferring nonneutral evolution from human-chimp-mouse orthologous
gene trios. Journal of Science, 2003 December 12;302(5652):1960-3.
• Hahne, F.,Huber,W.,Gentleman, R., and Falcon,S.,. Bioconductor
Case Studies. New York, NY. Springer Science and Business Media,
LLC. 2008.
• Ihnatova, I., MAMA: a 9 in 1 R package for Meta-Analysis of
MicroArray, October 1, 2010.
• http://cran.r-project.org/
• http://en.wikipedia.org/wiki/BRCA1
• http://en.wikipedia.org/wiki/BRCA2
• http://www.cancer.gov/cancertopics/factsheet/Risk/BRCA
• http://www.ncbi.nlm.nih.gov/geo/
• http://www.nextbio.com/b/nextbio.nb

You might also like