0% found this document useful (0 votes)
101 views73 pages

Microarray Data Analysis: Stuart M. Brown NYU School of Medicine

Microarray data analysis involves measuring gene expression levels across thousands of genes simultaneously using microarray technology. Microarrays allow researchers to compare gene expression between experimental conditions, such as healthy vs diseased tissue. The data obtained from microarrays is analyzed using software to quantify expression levels, normalize the data, and identify differentially expressed genes between conditions. The goals are to discover genes that change expression between groups, classify samples based on expression profiles, and identify patterns of co-regulated genes.

Uploaded by

Patrick Hs Tan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views73 pages

Microarray Data Analysis: Stuart M. Brown NYU School of Medicine

Microarray data analysis involves measuring gene expression levels across thousands of genes simultaneously using microarray technology. Microarrays allow researchers to compare gene expression between experimental conditions, such as healthy vs diseased tissue. The data obtained from microarrays is analyzed using software to quantify expression levels, normalize the data, and identify differentially expressed genes between conditions. The goals are to discover genes that change expression between groups, classify samples based on expression profiles, and identify patterns of co-regulated genes.

Uploaded by

Patrick Hs Tan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 73

Microarray Data Analysis

Stuart M. Brown
NYU School of Medicine
The Central Dogma of Molecular Biology
DNA is transcribed into RNA which is then
translated into protein

transcription translation
DNA RNA protein
replication

Measured by Microarray
What is a Microarray
• A simple concept: Dot Blot + Northern
• Reverse the hybridization - put the probes
on the filter and label the bulk RNA
• Make probes for lots of genes - a massively
parallel experiment
• Make it tiny so you don’t need so much
RNA from your experimental cells.
• Make quantitative measurements
Microarrays are Popular
 At NYU Med Center we are now collecting
about 3 GB of microarray data per week (60
chips, 6-10 different experiments)
 PubMed search "microarray"= 13,948 papers
 2005 = 4406
5000
 2004 = 3509 4500 4406

 2003 = 2421 4000

3500 3509
 2002 = 1557 3000

 2001 = 834 2500 2421


2000
 2000 = 294 1500 1557

1000
834
500
294
0
2000 2001 2002 2003 2004 2005
A Filter Array
DNA Chip Microarrays
• Put a large number (~100K) of cDNA sequences or
synthetic DNA oligomers onto a glass slide (or other
subtrate) in known locations on a grid.
• Label an RNA sample and hybridize
• Measure amounts of RNA bound to each square in the
grid
• Make comparisons
– Cancerous vs. normal tissue
– Treated vs. untreated
– Time course
• Many applications in both basic and clinical research
cDNA Microarray Technologies
• Spot cloned cDNAs onto a glass microscope
slide
– usually PCR amplified segments of plasmids
• Label 2 RNA samples with 2 different colors of
flourescent dye - control vs. experimental
• Mix two labeled RNAs and hybridize to the
chip
• Make two scans - one for each color
• Combine the images to calculate ratios of
amounts of each RNA that bind to each spot
Spot your own Chip
(plans available for free from Pat Brown’s website)

Robot spotter

Ordinary glass
microscope slide
Combine scans for Red & Green

False color image is made from digitized fluorescence data,


not by superimposing scanned images
cDNA Spotted Microarrays
Affymetrix “Gene chip” system
• Uses 25 base oligos synthesized in place on a
chip (20 pairs of oligos for each gene)
• RNA labeled and scanned in a single “color”
– one sample per chip
• Can have as many as 20,000 genes on a chip
• Arrays get smaller every year (more genes)
• Chips are expensive
• Proprietary system: “black box” software, can
only use their chips
Affymetrix Gene Chip
Affymetrix Technology
Affymetrix Pivot Table
normal tumor tumor normal normal tumor
ID_REF VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL
AFFX-BioB-5_at 210.6 P 234.6 P 362.5 P 389 P 305.6 P 330.5 P
AFFX-BioB-M_at 393 P 327.8 P 501.4 P 816.5 P 542 P 440.8 P
AFFX-BioB-3_at 264.9 P 164.6 P 244.7 P 379.7 P 261.3 P 303.7 P
AFFX-BioC-5_at 738.6 P 676.1 P 737.6 P 1191.2 P 917 P 767.9 P
AFFX-BioC-3_at 356.3 P 365.9 P 423.4 P 711.6 P 560.3 P 484.9 P
AFFX-BioDn-5_at 566.3 P 442.2 P 649.7 P 834.3 P 599.1 P 606.9 P
AFFX-BioDn-3_at 3911.8 P 3703.7 P 4680.9 P 6037.7 P 4653.7 P 4232 P
AFFX-CreX-5_at 6433.3 P 5980 P 7734.7 P 10591 P 8162.1 P 8428 P
AFFX-CreX-3_at 11917.8 P 9376.7 P 11509.3 P 16814.4 P 13861.8 P 13653.4 P
AFFX-DapX-5_at 12.2 A 44.3 M 31.2 A 37.7 P 33.3 A 12.8 A
AFFX-DapX-M_at 57.8 M 42.5 A 79 M 48.8 P 39.5 A 39.2 A
AFFX-DapX-3_at 29.8 A 6.2 A 23.4 A 28.4 A 3.2 A 7.6 A
AFFX-LysX-5_at 15.3 A 16.2 A 15.6 A 16.7 A 3.1 A 3.9 A
AFFX-LysX-M_at 33.2 A 12 A 17.7 A 37.3 A 49.2 A 9.1 A
AFFX-LysX-3_at 40.7 M 10.7 A 36.2 A 22.1 A 22.8 A 28.2 A
AFFX-PheX-5_at 7.8 A 3 A 7.6 A 5.6 A 5 A 6.4 A
AFFX-PheX-M_at 4.2 A 4.8 A 6.8 A 6.1 A 3.7 A 5.5 A
AFFX-PheX-3_at 54.2 A 39.6 A 19.4 A 16.1 A 44.7 A 31.2 A
AFFX-ThrX-5_at 8.2 A 11.2 A 13.2 A 9.5 A 8.5 A 7.5 A
AFFX-ThrX-M_at 38.1 A 30.6 A 37.6 A 7.2 A 26.9 A 36.3 A
AFFX-ThrX-3_at 15.2 A 5 A 15 A 8.3 A 36.8 A 11.5 A
AFFX-TrpnX-5_at 11.2 A 11.8 A 22.2 A 22.1 A 8.9 A 35.6 A
AFFX-TrpnX-M_at 9 A 8.1 A 9.1 A 8.7 A 8.1 A 12 A
AFFX-TrpnX-3_at 19.8 A 12.8 A 11.8 A 43.2 M 17.4 A 10 A
AFFX-HUMISGF3A/M97935_5_at 82.7 P 120.7 P 92.7 P 46.4 P 55.9 P 46.5 P
AFFX-HUMISGF3A/M97935_MA_at 397.6 P 416.7 P 244.8 A 181.4 A 197.5 A 192.3 A
AFFX-HUMISGF3A/M97935_MB_at 206.2 P 303 P 300.8 P 253.5 P 195.3 P 216 P
AFFX-HUMISGF3A/M97935_3_at 663.8 P 723.9 P 812.1 P 666.1 P 629.4 P 754.1 P
AFFX-HUMRGE/M10098_5_at 547.6 P 405.9 P 6894.7 P 3496.1 P 1958.5 P 5799.4 P
AFFX-HUMRGE/M10098_M_at 239.1 P 175.8 P 3675 P 1348.6 P 695.9 P 2428.2 P
AFFX-HUMRGE/M10098_3_at 1236.4 P 721.4 P 9076.1 P 7795.9 P 4237.1 P 7890 P
AFFX-HUMGAPDH/M33197_5_at 19508 P 19267.1 P 22892 P 26584 P 29666.6 P 25038.1 P
AFFX-HUMGAPDH/M33197_M_at 18996.6 P 20610.4 P 21573.7 P 29936 P 30106.6 P 22380.2 P
AFFX-HUMGAPDH/M33197_3_at 18016.4 P 17463.8 P 20921.3 P 26908.3 P 28382.2 P 21885 P
AFFX-HSAC07/X00351_5_at 23294.6 P 21783.7 P 18423.3 P 21858.9 P 23517.1 P 19450.3 P
AFFX-HSAC07/X00351_M_at 25373.1 P 24922.8 P 22384.2 P 25760.2 P 27718.5 P 21401.6 P
AFFX-HSAC07/X00351_3_at 20032.8 P 20251.1 P 20961.7 P 23494.6 P 23381.2 P 21173.3 P
Data Acquisition
• Scan the arrays
• Quantitate each spot
• Subtract background
• Normalize
• Export a table of fluorescent intensities
for each gene in the array
Automate!!
• All of this can be done automatically by
software.
• Much more consistent
• Mistakes will be made (especially in the
spot quantitation) but you can’t
manually check hundreds of thousands
of spots
Affymetrix Software
• Affymetrix System is totally automated
• Computes a single value for each gene from 40
probes - (using surprisingly kludgy math)
• Highly reproducible
(re-scan of same chip or hyb. of duplicate chips with
same labeled sample gives very similar results)
• Incorporates false results due to image artefacts
– dust, bubbles
– pixel spillover from bright spot to neighboring dark
spots
Goals of a Microarray
Experiment
1. Find the genes that change expression
between experimental and control
samples
2. Classify samples based on a gene
expression profile
3. Find patterns: Groups of biologically
related genes that change expression
together across samples/treatments
Basic Data Analysis
• Fold change (relative increase or decrease in
intensity for each gene)
• Set cutoff filter for low values
(background +noise)
• Cluster genes by similar changes - only really
meaningful across multiple treatments or
time points
• Cluster samples by similar gene expression
profiles
Streamlined Affy Analysis
normal tumor tumor normal normal tumor
ID_REF VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL
AFFX-BioB-5_at 210.6 P 234.6 P 362.5 P 389 P 305.6 P 330.5 P
AFFX-BioB-M_at 393 P 327.8 P 501.4 P 816.5 P 542 P 440.8 P
AFFX-BioB-3_at 264.9 P 164.6 P 244.7 P 379.7 P 261.3 P 303.7 P
AFFX-BioC-5_at 738.6 P 676.1 P 737.6 P 1191.2 P 917 P 767.9 P
AFFX-BioC-3_at 356.3 P 365.9 P 423.4 P 711.6 P 560.3 P 484.9 P
AFFX-BioDn-5_at 566.3 P 442.2 P 649.7 P 834.3 P 599.1 P 606.9 P
AFFX-BioDn-3_at 3911.8 P 3703.7 P 4680.9 P 6037.7 P 4653.7 P 4232 P
AFFX-CreX-5_at 6433.3 P 5980 P 7734.7 P 10591 P 8162.1 P 8428 P
AFFX-CreX-3_at 11917.8 P 9376.7 P 11509.3 P 16814.4 P 13861.8 P 13653.4 P
AFFX-DapX-5_at 12.2 A 44.3 M 31.2 A 37.7 P 33.3 A 12.8 A
AFFX-DapX-M_at 57.8 M 42.5 A 79 M 48.8 P 39.5 A 39.2 A
AFFX-DapX-3_at 29.8 A 6.2 A 23.4 A 28.4 A 3.2 A 7.6 A
AFFX-LysX-5_at 15.3 A 16.2 A 15.6 A 16.7 A 3.1 A 3.9 A
AFFX-LysX-M_at 33.2 A 12 A 17.7 A 37.3 A 49.2 A 9.1 A

•Present/Absent
Normalize
AFFX-LysX-3_at 40.7 M 10.7 A 36.2 A 22.1 A 22.8 A 28.2 A

Raw data
AFFX-PheX-5_at 7.8 A 3 A 7.6 A 5.6 A 5 A 6.4 A

Filter
AFFX-PheX-M_at 4.2 A 4.8 A 6.8 A 6.1 A 3.7 A 5.5 A
AFFX-PheX-3_at 54.2 A 39.6 A 19.4 A 16.1 A 44.7 A 31.2 A
AFFX-ThrX-5_at 8.2 A 11.2 A 13.2 A 9.5 A 8.5 A 7.5 A

•Minimum value
AFFX-ThrX-M_at 38.1 A 30.6 A 37.6 A 7.2 A 26.9 A 36.3 A
AFFX-ThrX-3_at 15.2 A 5 A 15 A 8.3 A 36.8 A 11.5 A
AFFX-TrpnX-5_at 11.2 A 11.8 A 22.2 A 22.1 A 8.9 A 35.6 A
AFFX-TrpnX-M_at 9 A 8.1 A 9.1 A 8.7 A 8.1 A 12 A

•Fold change
AFFX-TrpnX-3_at 19.8 A 12.8 A 11.8 A 43.2 M 17.4 A 10 A
AFFX-HUMISGF3A/M97935_5_at 82.7 P 120.7 P 92.7 P 46.4 P 55.9 P 46.5 P

(RMA)
AFFX-HUMISGF3A/M97935_MA_at 397.6 P 416.7 P 244.8 A 181.4 A 197.5 A 192.3 A
AFFX-HUMISGF3A/M97935_MB_at 206.2 P 303 P 300.8 P 253.5 P 195.3 P 216 P
AFFX-HUMISGF3A/M97935_3_at 663.8 P 723.9 P 812.1 P 666.1 P 629.4 P 754.1 P
AFFX-HUMRGE/M10098_5_at 547.6 P 405.9 P 6894.7 P 3496.1 P 1958.5 P 5799.4 P
AFFX-HUMRGE/M10098_M_at 239.1 P 175.8 P 3675 P 1348.6 P 695.9 P 2428.2 P
AFFX-HUMRGE/M10098_3_at 1236.4 P 721.4 P 9076.1 P 7795.9 P 4237.1 P 7890 P
AFFX-HUMGAPDH/M33197_5_at 19508 P 19267.1 P 22892 P 26584 P 29666.6 P 25038.1 P
AFFX-HUMGAPDH/M33197_M_at 18996.6 P 20610.4 P 21573.7 P 29936 P 30106.6 P 22380.2 P
AFFX-HUMGAPDH/M33197_3_at 18016.4 P 17463.8 P 20921.3 P 26908.3 P 28382.2 P 21885 P
AFFX-HSAC07/X00351_5_at 23294.6 P 21783.7 P 18423.3 P 21858.9 P 23517.1 P 19450.3 P
AFFX-HSAC07/X00351_M_at 25373.1 P 24922.8 P 22384.2 P 25760.2 P 27718.5 P 21401.6 P
AFFX-HSAC07/X00351_3_at 20032.8 P 20251.1 P 20961.7 P 23494.6 P 23381.2 P 21173.3 P

Significance Classification Clustering


•t-test •PAM
•SAM •Machine learning
•Rank Product
Gene lists

Function
(Genome Ontology)
Sources of Variability
• Image analysis (identifying and quantitating
each spot on the array)
• Scanning (laser and detector, chemistry of the
flourescent label))
• Hybridization (temperature, time, mixing, etc.)
• Probe labeling
• RNA extraction
• Biological variability
Scatter plot of all genes in a
simple comparison of two
control (A) and two
treatments (B: high vs. low
glucose) showing changes in
expression greater than 2.2
and 3 fold.
Thomas Hudson, Montreal Genome Center
Normalization
• Can control for many of the experimental
sources of variability (systematic, not random
or gene specific)
• Bring each image to the same average
brightness
• Can use simple math or fancy -
– divide by the mean (whole chip or by sectors)
– LOESS (locally weighted regression)
• No sure biological standards
RMA
• Robust Multichip Average
• Bolstad, B.M., Irizarry R. A., Astrand, M., and Speed,
T.P. (2003), A Comparison of Normalization Methods
for High Density Oligonucleotide Array Data Based
on Bias and Variance. Bioinformatics 19(2):185-193
Are the Treatments Different?
• Analysis of microarray data has tended to focus on
making lists of genes that are up or down regulated
between treatments
• Before making these lists, ask the question:
"Are the treatments different?"
• Use standard statistical methods to evaluate
expression profiles for each treatment (t-test or f-test)
• If there are differences, find the genes most
responsible
• If there are not significant overall differences, then
lists of genes with large fold changes may only reflect
random variability.
Statistics
• When you have variability in measurements,
you need replication and statistics to find
real differences
• It’s not just the genes with 2 fold increase,
but those with a significant p-value across
replicates
• Non-parametric (i.e. rank) or paired value
statistics may be more appropriate
Multiple Comparisons
• In a microarray experiment, each gene (each
probe or probe set) is really a separate
experiment

• Yet if you treat each gene as an independent


comparison, you will always find some with
significant differences
– (the tails of a normal distribution)
False Discovery
• Statisticians call false positives a "type 1 error" or a
"False Discovery"
• False Discovey Rate (FDR) is equal to the p-value of
the t-test X the number of genes in the array
– For a p-value of 0.01 X 10,000 genes
= 100 false “different” genes
– You cannot eliminate false positives, but by choosing a
more stringent p-value, you can keep them manageable
(try p=0.001)
• The FDR must be smaller than the number of real
differences that you find - which in turn depends on
the size of the differences and varability of the
measured expression values
SAM
• Significance Analysis of Microarrays
• Tusher, Tibshirani and Chu (2001): Significance
analysis of microarrays applied to the ionizing radiation
response. PNAS 2001 98: 5116-5121, (Apr 24).

•Excel plugin
•Free
•Permutation based
•Most published method of
microarray data analysis
Higher Level
Microarray data analysis
• Clustering and pattern detection
• Data mining and visualization
• Controls and normalization of results
• Statistical validatation
• Linkage between gene expression data and gene
sequence/function/metabolic pathways databases
• Discovery of common sequences in co-regulated
genes
• Meta-studies using data from multiple experiments
Types of Clustering
• Herarchical
– Link similar genes, build up to a tree of all
• Self Organizing Maps (SOM)
– Split all genes into similar sub-groups
– Finds its own groups (machine learning)
• Principle Component
– every gene is a dimension (vector), find a single
dimension that best represents the differences in
the data
Cluster by
color
difference
GeneSpring
SOM Clusters
Classification
 How to sort samples into two classes
based on gene expression data
 Cancer vs. normal
 Cancer sub-types
(benign vs. malignant)
 Responds well to drug vs. poor
response
(i.e. tamoxifen for breast cancer)
Support Vector Machines

Fat planes: With an infinitely thin plane the data can always
be separated correctly, but not necessarily with a fat one.
Again if a large margin separation exists, chances are good
that we found something relevant.
Large Margin Classifiers
PAM: Prediction Analysis for Microarrays
Class Prediction and Survival Analysis for Genomic Expression Data Mining
Performs sample classification from gene expression data,
via "nearest shrunken centroid method'' of Tibshirani, Hastie, Narasimhan and Chu (2002):
"Diagnosis of multiple cancer types by shrunken centroids of gene expression"
PNAS 2002 99:6567-6572 (May 14).
BioConductor
 All of these normalization, statistical,
and clustering methods are available in
a free software package called
BioConductor.
www.bioconductor.org
 User hostile command line interface
 Uses scripts in the 'R' statistical language
> data(SpikeIn)
> pms <- pm(SpikeIn)
> mms <- mm(SpikeIn)
> par(mfrow = c(1, 2))
> concentrations <- matrix(as.numeric(sampleNames(SpikeIn)), 20,
+ 12, byrow = TRUE)
> matplot(concentrations, pms, log = "xy", main = "PM", ylim = c(30,
+ 20000))
> lines(concentrations[1, ], apply(pms, 2, mean), lwd = 3)
> matplot(concentrations, mms, log = "xy", main = "MM", ylim = c(30,
+ 20000))
> lines(concentrations[1, ], apply(mms, 2, mean), lwd = 3)
Functional Genomics
 Take a list of "interesting" genes and
find their biological relationships
Gene lists may come from
significance/classfication analysis of
microarrays, proteomics, or other high-
throughput methods
 Requires a reference set of "biological
knowledge"
Genome Ontology
 How to organize biological
knowledge?
Biologists work on a variety of
different research organisms: yeast,
fruit fly, mouse, … human
the same gene can have very
different functions (antennapedia)
and very different names
(sonic hedgehog…)
GO
 Biologists got together a few years ago
and developed a sensible system called
Genome Ontology (GO)
 3 hierarchical sets of terminology
Biological Process
Cellular Component (location within cell)
Molecular Function
 about 1000 categories of functions
Biological Pathways
Microarray Databases
• Large experiments may have hundreds of
individual array hybridizations
• Core lab at an institution or multiple
investigators using one machine - data
archive and validate across experiments
• Data-mining - look for similar patterns of
gene expression across different experiments
Public Databases
• Gene Expression data is an essential aspect of
annotating the genome
• Publication and data exchange for microarray
experiments
• Data mining/Meta-studies
• Common data format - XML
• MIAME (Minimal Information About a
Microarray Experiment)
GEO at the NCBI
Array Express at EMBL
Gene Expression
Technologies
• cDNA (EST) libraries
• SAGE
• Microarray
• rt-PCR
• RNA-seq
The Cancer Genome Anatomy
Project
• CGAP has collected a large amount of
cDNA and related data online
• http://cgap.nci.nih.gov/

• cDNA libraries from various tissues


– search for genes
– compare expression levels
SAGE
• Serial Analysis of Gene Expression is a
technology that sequences very short
fragments of mRNA (10 or 17 bp) that have
been randomly ligated together
• The short ‘tags’ are assigned to genes and
then relative counts for each gene are
computed for cDNA libraries from various
tissues
SAGE Genie

• SAGE Anatomic Viewer


• SAGE Digital Gene Expression Displayer
• Digital Northern
• SAGE Experiment Viewer
Microarray
• GEO database at NCBI
• Microarray experiments
– Defined arrays
– Published results
– Also lots of inconclusive experiments
– Tools to search for specific genes
– Unreliable to search for tissue or disease in
experiment description text
RNA-seq
• Next Generation DNA seqencing
• NYU currently has one Illumina Genome
Analyser
– generates more than 1 million RNA sequences
per sample
• Currently seeking funding for a Roche/454
– produces 100K reads of 250-400 bp
Count Transcripts
• Techology exists to accurately count
transcripts and compare samples
• “Digital Gene Expression”
• Can also identify alternate isoforms, splice
variants, etc.

You might also like