Microarray: Yuki Juan Ntust May 26, 2003
Microarray: Yuki Juan Ntust May 26, 2003
Content
Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion
The central dogma of life forms DNA RNA Monitoring the expression of genes
Central Dogma
DNA Replication
--ACGCGA---TGCGCT--
RNA Transcription
--UGCGCU--
Protein Translation
--CYSALA--
DNA
replication transcription translation
DNA
RNA
Protein
DNA
stable
A, T, G, C AT GC
Nucleotide
Base pair
Oligonucleotide
(http://www.nhgri.nih.gov/)
DNA Strand
read from 5 to 3 antiparallel: one strand has direction opposite to its complements
5 3 TACTGAA 3 ATGACTT 5
3
5
The force between base pair is hydrogen bond, This force let A-T(U), C-G can specifically match together.
RNA
replication transcription translation
DNA
RNA
Protein
RNA
Types
RNA (Detailed)
(http://www.nhgri.nih.gov/)
Reverse Transcription
replication transcription translation
DNA
RNA
Protein
Reverse Transcription
By reverse transcriptase, we can convert RNA into cDNA.
Basic DNA detection technique that has been used for over 30 years, known as Southern blots:
A known strand of DNA is deposited on a solid support (i.e. nitocellulose paper) An unknown mixed bag of DNA is labelled (radioactive or flourescent) Unknown DNA solution allowed to mix with known DNA (attached to nitro paper), then excess solution washed off If a copy of known DNA occurs in unknown sample, it will stick (hybridize), and labeled DNA will be detected on photographic film
When measure the level of a mRNA, we are monitoring the activity of a gene. Thus, if we can understand all the level of mRNAs, we can study the expression of whole genome. Microarray takes the advantage of getting over 10000 of blotting data in a single experiment, which makes monitoring the genome activity possible.
Content
Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion
Design of Microarray
Microarray in different context The idea of microarray Main type of array chips
Different tissues, same organism (brain v. liver) Same tissue, same organism (tumor v. nontumor) Same tissue, different organisms (wt v. mutant) Time course experiments (development) Other special designs (e.g. to detect spatial patterns).
Idea of Microarray
Cell A Cell B
Hybridizaton to chip
Developed by Pat Browns lab at Stanford PCR products of full-length genes (>100nt)
Use PCR to amplify DNA Robotic "pen" deposits DNA at defined coordinates
Affymetrix Microarrays
Raw image
1.28cm
50um
~107 oligonucleotides, half perfectly match mRNA (PM), half have one mismatch (MM) Raw gene expression is intensity difference: PM - MM
Agilent delivering printed 60-mer microarrays in addition to 25-mer formats. The inkjet process uses standard phosphoramidite chemistry to deliver extremely small volumes (picoliters) of the chemicals to be spotted.
Content
Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray
Labeled cDNA
Hybridized Array
Scanning
Sample loading
1.Loading from the corner of the cover slip It is time consuming and easily producing bubbles.
2
Sample loading
2. Loading sample at the center of array then put the slip smoothly Faster, and have lower chance of bubble producing then the last one. 3. Loading sample at the side of the array then put the slip on. Solution would attach to the slip right after the slip contact with it, and would diffuse with the movement of slip when we slowly move down.
3
Sample loading
Scan
Content
Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion
Image analysis
The Algorithms
1. Find spots: Finds the location of each spot on the microarray. 2. Cookie cutter algorithm: (1).Suppose the distribution of pixels vs intensity is Gaussian curve (2).Using SD or IQR to identify the feature and background of each spot
Interquartile Range(IQR)
25 %
50 % IQR
75 %
Feature or cookie
Exclusion zone
Local background
Data Quality
indistinguishable
saturated
bad print
miss alignment
artifact
strong similarity to hypothetical protein yhr214w questionable orf nuclear viral propagation protein histone h2b.2 hypothetical protein coproporphyrinogen iii oxidase strong similarity to flo1p, flo5p, flo9p and ylr110 similarity to hypothetical protein ydl204w questionable orf ubiquitin-like protein strong similarity to egd1p and to human btf3 pro questionable orf questionable orf hypothetical protein hypothetical protein weak similarity to c.elegans hypothetical protein questionable orf 40s small subunit ribosomal protein s26e.c7 strong similarity to members of the srp1/tip1 fa
Data Normalization
Dye bias Location bias Intensity bias Pin bias Slide bias
Data Normalization
Uncalibrated, red light under detected Calibrated, red and green equally detected
Data Normalization
Assumptions
After Normalization
Additional Normalization
Pin dependent
Similar to intensity dependent fit. Compute individual lowess fits for each pin group
After pin dependent normalization, log ratios for each pin are centered around 0 Scale variance for each pin
Additional Normalization
Dye swap
Combine relative expression levels without explicit normalization Compute lowess fit for
log2(RR/GG)/2 vs. log2(A + A)/2
Normalized ratio is
log2(R/G) - c(A)
Content
Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion
Data analysis
Datasets
Class Sno D26528 D63874 D63880 ALL 2 193 4157 556 ALL 3 129 11557 476 ALL 4 44 12125 498 ALL 5 218 8484 1211 AML 51 109 3537 131 AML 52 106 4578 94 AML 53 211 2431 209
Remove insufficient spot: saturated, None uniform, too high background Remove extreme signal: e.g. MaxVal - MinVal < 500 and MaxVal/MinVal < 5 Statistical filtering (e.g. p-value<0.01) biological reasons feature reduction for algorithmic
change analysis
Classification
identify
(Unsupervised)
n-fold change
2 expression
Calculate standard deviation Genes with expression more than 2 away are differentially expressed
1000
100
10
0.1
21
72 (con tro l)
23
AF015450
6.912
Classification: Multi-Class
Similar Approach: select top genes most correlated to each class select best subset using cross-validation build a single model separating all classes Advanced:
build
separate model for each class vs. rest choose model making the strongest prediction
Trees/Rules
Neural
work
SVM
good
K-nearest
neighbor - robust for small number genes Bayesian nets - simple, robust
Selected
top 100 genes most correlated to each class Selected best subset by testing 1,2, , 20 genes subsets, leave-oneout x-validation for each
Clustering
Goals Find natural classes in the data Identify new classes / gene correlations Refine existing taxonomies Support biological analysis / discovery Different Methods
Hierarchical
SOM clustering
SOM
away genes with insufficient biological variation normalize gene expression (across samples) to mean 0, st. dev 1, for each gene separately.
Run
27
Hierarchical Clustering
The most popular hierarchical clustering method used in microarray data analysis is the so called agglomerative method
Initially, each data point forms a cluster and the algorithm works through the cluster sets by repeatedly merging the two which are the most similar or have the shortest distance.
Hierarchical clustering
Genomic Reprogramming in Response to Oxidant
minutes
0 10 20 40 60 120
6218 genes
Fold re pr e ssion
>9 >6 >3 1:1
Fold induction
>3 >6 >9
Future directions
Algorithms
optimized for small samples (the no. of samples will remain small for many tasks) Integration with other data biological networks medical text protein data cost-sensitive classification algorithms error cost depends on outcome (dont want to miss treatable cancer), treatment side effects, etc.
Right picture: Gene Ontology: tool for the unification of biology, Nature Genetics, 25, p25
Content
Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion
Biological discovery
new and better molecular diagnostics new molecular targets for therapy finding and refining biological pathways Mutation and polymorphism detection
Recent examples
molecular diagnosis of leukemia, breast cancer, ... appropriate treatment for genetic signature potential new drug targets
Microarray Limitations
Cross-hybridization of sequences with high identity Chip to chip variation True measure of abundance? Does mRNA levels reflect protein levels?
Generally, do not prove new biology - simply suggest genes involved in a process, a hypothesis that will require traditional experimental verification.
What fold change has biological relevance? Need cloned EST or some sequence knowledge -- rare messages may be undetected Expensive!! Not every lab can afford experiment repeat. The real limitation is Bioinformatics
Additional Information
Genomics, gene expression and DNA arrays (Nature, June 2000) Microarray - technology review (Natural Cell Biology, Aug. 2001) Magic of Microarray (Scientific American, Feb. 2002)
http://www.lsic.ucla.edu/ls3/tutorials/
A retrieval system for searching a number of inter-connected databases at the NCBI. It provides access to: PubMed: The biomedical literature (Medline) Genbank: Nucleotide sequence database Protein sequence database Structure: three-dimensional macromolecular structures Genome: complete genome assemblies PopSet: population study data sets OMIM: Online Mendelian Inheritance in Man Taxonomy: organisms in GenBank Books: online books ProbeSet: gene expression and microarray datasets 3D Domains: domains from Entrez Structure UniSTS: markers and mapping data SNP: single nucleotide polymorphisms CDD: conserved domains