Basic Principles in Bioinformatics: Understanding Microarrays
Basic Principles in Bioinformatics: Understanding Microarrays
Basic Principles in Bioinformatics: Understanding Microarrays
Understanding Microarrays
Part I
Introduction into the microarray technology
Illustration of typical biological questions related to microarray studies
Short presentation of methods being used for the analysis of microarray data
Part II (TP)
Discussion of biological questions and how they could be answered applying
microarray data mining
Part III
Functional classification
Biological Problem
Proteome: Proteins
Genomics Fundamentals
Transcription
PCR
Exon 2 Exon 3
cDNA/PCR product: ATGC
Exon 2 Exon 3
Introduction into Microarray Technology : Sample
Tumor Tissue
Normal Tissue
Protein
CCAGGCAAUAAAAAA
A U G AGUAAUAAAAAA
CCAGGCAATAAAAAA
CCAGGCAATAAAAAA
A T G AGTAATAAAAAA
A T G AGTAATAAAAAA
Normal Tumor
Physical support:
Photolithography Glass slide,
nylon membrane
Printing
or
Microarray: Definition
Experimental design
Chip preparation
Probe design
Probe preparation
Printing
Sample preparation
cRNA/cDNA Labeling
Hybridization
Scanning
3 Untranslated region
Annotation less safe
Danger of alternative polyA sites
Danger of repetitive elements
Less likely to cross-hybridize with isoforms
Little danger of alternative splicing
5 Untranslated region
Close linkage to promoter
Frequently not available
INTRON
Probe Design for Custom Array
Transporters: 670
Channels: 263
Transporters: 316
Channels: 151
Contigs: 156
Positive Controls: 9 Run Pick70
Run Multiple alignment and
Negative Controls: 3 Tm = 70, Palindrome
Pick70 selection of repr. genes
Controls (diff. Oligos): 9 Uniqueness = 15 bp
RGS: 75
FGF/RGF-like: 7 236 Contigs and singlets
ADAM family: 18
Assemble contigs
Brown et al. AAPS PharmSci. 2003 Anderle et al. Pharm Res. 2003
THE EXPERIMENT : Printing I
The microspotting is done by a robot called arrayer
THE EXPERIMENT : Printing II
Microspotting
THE EXPERIMENT : Printing III
Oligo-spotting (Photolithography)
Summary
?
Microarray Analysis: Data Analysis
Experimental design
}
Scanning and Processing images
}
Statistical analysis of expression values
Addressing or gridding
Assigning coordinates to each of the spots
Segmentation
Classification of pixels either as foreground or as background
Intensity extraction (for each spot)
Foreground fluorescence intensity pairs (R, G)
Background intensities FG
Quality measures FG
M
Fluorescence Signal to Expression Level
GTTAAGCGTTCCGATGCTACTTACC PM
GTTAAGCGTTCCCATGCTACTTACC MM
Probes
Probes
Consensus sequence
Fluorescence Signal to Expression Level I
Example: Affymetrix
signal = TukeyBiweight{log( PM j MM *j )}
with MM*, a version of MM that is never bigger than PM, Tukey biweight is a type of
robust estimator...
PMij MM ij = i j + ij , ij N(0, 2 )
i is gene expression in chip i, j is rate of increase of PM response over MM (probe-
specific effect)
log(PM ij BG) = ai + b j + ij
Use only PM, ignore MM, assumes additive model (on log scale), estimates chip
effects ai and probe effects bj using a robust method (median polish)
MAS 5 vs. RMA: A Values
MAS 5 vs. RMA: M vs. A Plot
RMA MAS 5
Data Analysis: Transformation (Coding)
Log2 transformation
No transformation
Ratios:
un-transformed Log2 transformed:
2 distance 1 distance
2 1 2X = y; log2(y) = x
0 0 22 = 4; log2(4) = 2
0.5 1
0.5 -1
Data Analysis: Normalization I
Scatter (MVA-)plots
Normalization: global
0 2 Log2 Ratio
Data Analysis: Normalization III
Methods:
Median center: MEDIAN log2( CY3/CY5) = 0
CY5 CY5
Linear Transformation
CY3 CY3
CY5
CY3
Why is not satisfactory? More noise with lowexpressed genes
Data Analysis: Use of M vs A Plot
0 0
A A
Magnification
M M M
0 0
0
A A A
Loess correction
Data Analysis: Normalization IV
0
Sub-array
A
Array
M
0
Regional Variation
Spatial Bias
A
Data Analysis: Normalization V
Use of spikes
Before normalization After normalization
Data Analysis: Low Level Analysis
Summary:
Chip has been built!
Signals have been measured!
Systematic errors have been removed!
Data Analysis: Limitations
Problems in data analysis
Limitations of traditional biological interpretations:
Methods:
1. Example
Identification of genes that are responsible for the fact that some patients respond differently to a
certain type of chemotherapy
2. Example
Identification of genes or group of genes that explain the difference between tumor tissue and
non-tumor tissue based on the expression profile of ~100 samples (60 tumor tissue/ 40
healthy tissue)
3. Example
Length of neck
Correlation: Distance = 1 - R
Euclidean: Distance = sqrt((x1-x2)2+ (y1+y2)2)
Sample 2 Sample 3
Sample 1 Sample 1
2- The distance measure between the new cluster and the others
4 2
Gene 1 3
1 3 2 4 5 Gene 2
Dendrogram The
Thedendrogram
dendrograminduces
inducesaalinear
linearordering
orderingof
of
the data points
the data points
Clustering: Defining Clusters
Unsupervised Clustering: Example
Gene 1
PCA
LDA
Gene 2
Gene 2
Gene 1 Gene 1
Supervised Methods: Learning Problems
Labels
Tissues
Microarray Data
Data Matrix
Evaluation
Subset Subset
Training Test
Predicted
Labels
LDA Predictor
Supervised Methods: Experimental Design
Group A Group B
t - Statistics
For all Genes-> Compute the t statistics
LDA done with the most differently expressed, then most and the second
mostetc (Cumulative)
Supervised Methods: Students Test LDA II
Effect of the Number of Genes Selected with a Student's t-Test
on the LDA Performance.
120
Percent of correct predictions
100
80
60
Test Set
40 (12,89)
Training Set
20
0
0 10 20 30 40
Number of genes (cummulative)
Summary: Part I
2. Exercise
What features would you include in a probe design program?
3. Exercise
Which methods do you think the authors applied to answer their questions described in the
abstracts?
4. Exercise
What are the principal objectives of a supervised or unsupervised learning method, respectively?
5. Exercise
What do you think are the major limitations of microarrays?
6. Exercise
When would you rather use RMA or MAS5, respectively?
7. Exercise
Why is normalization crucial for the analysis of microarray data?
8. Exercise
How can you relate microarray data and phenotypes?
Part II: Abstract A
Novel genes and functional relationships in the adult mouse gastrointestinal tract identified by microarray analysis.
Bates MD, Erwin CR, Sanford LP, Wiginton D, Bezerra JA, Schatzman LC, Jegga AG, Ley-Ebert C, Williams SS,
Steinbrecher KA, Warner BW, Cohen MB, Aronow BJ.
Division of Gastroenterology, Children's Hospital Medical Center, University of Cincinnati College of Medicine, Cincinnati,
Ohio 45229, USA. [email protected]
BACKGROUND & AIMS: A genome-level understanding of the molecular basis of segmental gene expression along the
anterior-posterior (A-P) axis of the mammalian gastrointestinal (GI) tract is lacking. We hypothesized that functional patterning
along the A-P axis of the GI tract could be defined at the molecular level by analyzing expression profiles of large numbers of
genes. METHODS: Incyte GEM1 microarrays containing 8638 complementary DNAs (cDNAs) were used to define expression
profiles in adult mouse stomach, duodenum, jejunum, ileum, cecum, proximal colon, and distal colon. Highly expressed
cDNAs were classified based on segmental expression patterns and protein function. RESULTS: 571 cDNAs were expressed
2-fold higher than reference in at least 1 GI tissue. Most of these genes displayed sharp segmental expression boundaries, the
majority of which were at anatomically defined locations. Boundaries were particularly striking for genes encoding proteins that
function in intermediary metabolism, transport, and cell-cell communication. Genes with distinctive expression profiles were
compared with mouse and human genomic sequence for promoter analysis and gene discovery. CONCLUSIONS: The
anatomically defined organs of the GI tract (stomach, small intestine, colon) can be distinguished based on a genome-level
analysis of gene expression profiles. However, distinctions between various regions of the small intestine and colon are much
less striking. We have identified novel genes not previously known to be expressed in the adult GI tract. Identification of genes
coordinately regulated along the A-P axis provides a basis for new insights and gene discovery relevant to GI development,
differentiation, function, and disease.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA,
Bloomfield CD, Lander ES.
Whitehead Institute/Massachusetts Institute of Technology Center for Genome Research, Cambridge, MA 02139, USA.
[email protected]
Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer
classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer
classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a
test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute
lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to
determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene
expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer,
independent of previous biological knowledge.
5
2
4
3 4
Microarray Data Analysis Workflow. Existing data (repository) (1)-> generate data (2) -> collect
& manage data (3) (Microarray data management systems) -> analyze interesting sequences
(4) -> depositing into repositories (5)
Functional Classification
Biological process
Stearoyl-coenzyme desaturase SCOD-1 x
Carnitine palmitoyl transferase-1 CPT-1 x
Fatty acid binding protein FABP x
Phosphoenoyl carboxykinase PEPCK x x
Molecular function
Affymetrix
Representative Sequence
Representative Sequence: Chosen during chip design as a sequence which is best associated with the
transcribed region being interrogated
BLAT threshold: Only records whose match / Qsize >= 75% and; only records whose score >= 0.70, where
score = (match - mismatch - gap# x 5 - gap_size x 2) / Qsize; If record has several mapping locations with score >
0.70, choose the highest one; if a record has several mapping locations with the same highest score, all mapping
locations kept.
EnsMart Approach: cDNA sequence plus an additional length of downstream sequence immediately following
the most 3' exon. The individual probe sequences are mapped, by exact matching. If more than 50 % of probes
mapped, then listed as hits.
Comparison of Various Annotations
NetAffx A: 3209
EnsMart
A: 21545 A: 2686 A: 796 A: 15421
B: 904
B: 22014 B: 5507
B: 8473 B: 499
A: 11269
B: 4027
A: 4381
A: 147
B: 8610
B: 77
Mouse MOE A and B
A: 5085
B: 2533
NetAffx A: 1193
EnsMart
A: 22446 A: 418 A: 14220
Tagger A: 2384 B: 169
A: 20882 B: 22112 B: 2462
B: 7300 B: 355
B: 15247 A: 12460
B: 1853
A: 6409
A: 149
Human U133 A and B B: 12790
B: 85
A: 2657
B: 1728
Tagger
A: 21675
B: 16456
Quality of Probe Sets
10000
Number of Probe Sets
EnsMart A
1000 EnsMart B
Tagger A
Tagger B
100 NetAffx A
NetAffx B
10
1
1 10 100
Number of UniGenes
Functional Classification
Biological process
Stearoyl-coenzyme desaturase SCOD-1 x
Carnitine palmitoyl transferase-1 CPT-1 x
Fatty acid binding protein FABP x
Phosphoenoyl carboxykinase PEPCK x x
Molecular function
L3 L3 L2 L3 GO:Y L3 GO:Z L3
L4 GO:X L4 GO:Y
ABCB1
Two pragmatic purposes of ontology: Ontologies are structured vocabularies in the form
1. Facilitate communication between people of directed acyclic graphs (DAGs) that represent a
and organizations network in which each term may be a child of one or
2. Improve interoperability between systems more than one parent.
Distribution: Probe Sets per UG
100000
U133A
10000 U133B
U133AB
U74Av2
U74Bv2
Number of UniGenes
U74Cv2
1000 U74ABCv2
U74ABCv3_NA
MOE430A
MOE430B
100 MOE430AB
10
1
1 10 100
Number of Probe Sets
Functional Classification II
Biological process
Stearoyl-coenzyme desaturase SCOD-1 x
Carnitine palmitoyl transferase-1 CPT-1 x
Fatty acid binding protein FABP x
Phosphoenoyl carboxykinase PEPCK x x
Molecular function
GenMAPP:
Gene Microarray Pathway Profiler
KEGG: Kyoto Encyclopedia of Genes and Genomes
1. Computerizing the current knowledge of genetics, biochemistry, and molecular and cellular biology in
terms of the pathway of interacting molecules or genes
2. Collection of genes catalogs for all organisms with completely sequenced genomes and selected
organisms with partial genomics (consistent annotation)
3. Catalog of chemical elements, compounds and other substances in living cells
Kanehisa et al. 2002, Nucleic Acids Research, Ogata et al. 1999, Nucleic Acids Research; http://www.genome.ad.jp/kegg/
Functional Classification II
Biological process
Stearoyl-coenzyme desaturase SCOD-1 x
Carnitine palmitoyl transferase-1 CPT-1 x
Fatty acid binding protein FABP x
Phosphoenoyl carboxykinase PEPCK x x
Molecular function
Similar to other nuclear hormone receptors, PPAR acts as a ligand activated transcription factor. Upon binding fatty acids or hypolipidemic drugs, PPARa
interacts with RXR and regulates the expression of target genes. These genes are involved in the catabolism of fatty acids. Conversely, PPARg is activated by
prostaglandins, leukotrienes and anti-diabetic thiazolidinediones and affects the expression of genes involved in the storage of the fatty acids. PPARb is only
weakly activated by fatty acids, prostaglandins and leukotrienes and has no known physiologically relevant ligand. However, data from PPARb null mice suggest
PPARb does serve a role in fatty acid metabolism and perhaps in skin proliferation and cancer.
Genetic Network Models: Goals
ArrayInformatics dual-color cDNA/oligo dual-color cDNA/oligo dual-color cDNA/oligo, Affymetrix, Scatter, line and Normalization to LOWESS, total intensity, median
series plots and a cluster image map,. is not supporting ratio or to a user generated gene list, graphing data
XML as of yet. trends after normalization enabling examination of
data variability.
BASE dual-color cDNA/oligo, dual-color cDNA/oligo, dual-color cDNA/oligo, Affymetrix, SAGE global mean or median ratio based normalization,
Affymetrix, SAGE Affymetrix, SAGE Lowess, MDS module
Expressionist Affymetrix Affymetrix Affymetrix, dual-color cDNA/oligo standard data processing and clustering
GeneDirector dual-color cDNA/oligo dual-color cDNA/oligo dual-color cDNA/oligo, Affymetrix ImaGene and GeneSight packagse
GeNet dual-color cDNA/oligo, dual-color cDNA/oligo, dual-color cDNA/oligo, Affymetrix GeneSpring package
Affymetrix Affymetrix
GeneTraffic(Multi) filters, dual-color filters, dual-color filters, dual-color cDNA/oligo, Affymetrix, Global normalization, z-score, Lowess
cDNA/oligo, Affymetrix, cDNA/oligo, Affymetrix, normalization, full and sub-grid, for Affymetrix,
alternative probe based protocol
GeneX dual-color cDNA/oligo, dual-color cDNA/oligo dual-color cDNA/oligo, Affymetrix R routines are available to manipulate the data
Affymetrix (normalization, clustering, etc.)
maxdSQL dual-color cDNA/oligo, dual-color cDNA/oligo, dual-color cDNA/oligo, Affymetrix, maxdView, Filtering based on numerical values. 2-D
Affymetrix Affymetrix expression data class which represents results from one correlation plot with overlay of cluster data,
or more hybridizations and any associated clusters of multidimensional plots.
genes. Profiles viewers.
NOMAD dual-color cDNA/oligo, dual-color cDNA/oligo, dual-color cDNA/oligo, Axon scanner outcome ScanAlyse package: global normalization
Axon scanner outcome Axon scanner outcome
PartisanarrayLIMS filters, dual-color filters, dual-color filters, dual-color cDNA/oligo, Affymetrix, global mean or median ratio based normalization
cDNA/oligo, Affymetrix, cDNA/oligo, Affymetrix,
Resolver Affymetrix, Nylon filters, Affymetrix, Nylon filters, Affymetrix, Nylon filters. Table Viewer: K-means, K- Error models with any experimental replicates
dual-color cDNA/oligo dual-color cDNA/oligo medians clustering, and SOM algorithms. performed, P-values computed and error bars for
every gene expression measurement, ANOVA.
SMD dual-color cDNA/oligo dual-color cDNA/oligo dual-color cDNA/oligo ScanAlyse package: global normalization
Microarray and Data Repositories
Name Data Type Tissue Type Description Web address
GEO Microarray/ Normal and Gene expression and hybridization array data http://www.ncbi.nlm.nih.gov/geo/
SAGE tumor repository
RAD Microarray/ Normal and The ultimate goal is to allow comparative analysis of http://www.cbil.upenn.edu/RAD2/
SAGE tumor experiments performed by different laboratories using
different platforms and investigating different
biological systems.
ExpressDB Microarray/ Yeast Collection of yeast RNA expression datasets http://arep.med.harvard.edu/cgi-
SAGE bin/ExpressDByeast/EXDStart
CleanEx Microarray/ Normal and Gene expression and hybirdization array data http://www.epd.isb-sib.ch/cleanex/
EST tumor repository. SAGE will be added.
libraries
Gene Microarray Tumor Data from 60 cancer cell lines based on Affymetrix http://discover.nci.nih.gov/arraytools
Expression and cDNA technology
Database
SMD Microarray Normal and Extensive collection of cDNA microarray data http://genome-
tumor www.stanford.edu/microarray
SAGEmap SAGE Normal and Data from one hundred SAGE (Serial Analysis of http://www.ncbi.nlm.nih.gov/SAGE/
tumor Gene Expression) CGAP (Cancer Genome Anatomy
Project) libraries
SAGE SAGE Normal and SAGE data from over 600,000 transcripts including http://www.sagenet.org/SAGEData/
tumor SAGE data from human, mouse and yeast transcripts. sagedata.htm
UniGene EST Normal and Collection of EST libraries from different species http://www.ncbi.nlm.nih.gov/UniGene/
libraries tumor
CGAP/Tissue EST Normal and Information on CGAP and other cDNA libraries. http://cgap.nci.nih.gov/Tissues/xProfiler
libraries tumor
BodyMap EST Normal and Database of expression information of human and http://bodymap.ims.u-tokyo.ac.jp
libraries tumor mouse genes in various tissues and cell types.
TissueInfo EST Normal Information on tissue expression profile of a sequence http://icb.mssm.edu/services/tissueinfo/qu
libraries by comparing the given sequence against the EST ery
database. Each EST comes from a library derived
from a specific tissue type
Web Resources : General Information
Interesting Books
Baldi and Hatfield, DNA Microarrays and gene expression, 2002 Cambridge University Press