0% found this document useful (0 votes)
39 views

Microarray: Yuki Juan Ntust May 26, 2003

The document provides information about microarrays including the biology background, design, workflow, image analysis, and data analysis. It describes the central dogma of DNA replication, transcription, and translation. It explains how microarrays work by hybridizing fluorescently labeled cDNA from different cell samples to a chip containing DNA probes, then using image analysis software to quantify gene expression levels which are then normalized and analyzed. Microarrays allow monitoring expression of thousands of genes simultaneously.

Uploaded by

Bhavye Aggarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Microarray: Yuki Juan Ntust May 26, 2003

The document provides information about microarrays including the biology background, design, workflow, image analysis, and data analysis. It describes the central dogma of DNA replication, transcription, and translation. It explains how microarrays work by hybridizing fluorescently labeled cDNA from different cell samples to a chip containing DNA probes, then using image analysis software to quantify gene expression levels which are then normalized and analyzed. Microarrays allow monitoring expression of thousands of genes simultaneously.

Uploaded by

Bhavye Aggarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 70

Microarray

Yuki Juan NTUST May 26, 2003

Content

Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion

The Biology Background of Microarray

The central dogma of life forms DNA RNA Monitoring the expression of genes

Central Dogma

DNA Replication
--ACGCGA---TGCGCT--

RNA Transcription
--UGCGCU--

Protein Translation
--CYSALA--

DNA
replication transcription translation

DNA

RNA

Protein

DNA

The double helix

stable
A, T, G, C AT GC

Nucleotide

Base pair

Oligonucleotide

short DNA (tens of nucleotides, or bps)

(http://www.nhgri.nih.gov/)

DNA Strand

DNA has canonical orientation

read from 5 to 3 antiparallel: one strand has direction opposite to its complements
5 3 TACTGAA 3 ATGACTT 5

Hydrogen Bond Makes DNA Binding Specifically


Hydrogen bond

3
5

Hydrogen Bond Makes DNA Binding Specifically

The force between base pair is hydrogen bond, This force let A-T(U), C-G can specifically match together.

RNA
replication transcription translation

DNA

RNA

Protein

RNA

Types

messenger RNA ribosomal RNA (rRNA) transfer RNA (tRNA)

Gene is expressed by transcribing DNA into single-stranded mRNA

RNA (Detailed)

(http://www.nhgri.nih.gov/)

Reverse Transcription
replication transcription translation

DNA

RNA

Protein

Reverse Transcription
By reverse transcriptase, we can convert RNA into cDNA.

The Southern Blot

Basic DNA detection technique that has been used for over 30 years, known as Southern blots:

A known strand of DNA is deposited on a solid support (i.e. nitocellulose paper) An unknown mixed bag of DNA is labelled (radioactive or flourescent) Unknown DNA solution allowed to mix with known DNA (attached to nitro paper), then excess solution washed off If a copy of known DNA occurs in unknown sample, it will stick (hybridize), and labeled DNA will be detected on photographic film

mRNA Represent Gene Function

When measure the level of a mRNA, we are monitoring the activity of a gene. Thus, if we can understand all the level of mRNAs, we can study the expression of whole genome. Microarray takes the advantage of getting over 10000 of blotting data in a single experiment, which makes monitoring the genome activity possible.

Content

Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion

Design of Microarray

Microarray in different context The idea of microarray Main type of array chips

mRNA Levels Compared in Many Different Contexts


Different tissues, same organism (brain v. liver) Same tissue, same organism (tumor v. nontumor) Same tissue, different organisms (wt v. mutant) Time course experiments (development) Other special designs (e.g. to detect spatial patterns).

Idea of Microarray
Cell A Cell B

Labeled cDNA from geneX

Hybridizaton to chip

Spot of geneX with complementary sequence of colored cDNA

This spot shows red color after scanning.

Over 10,000 Hybridization Could Be Down at One Time

Several Types of Arrays

Spotted DNA arrays


Developed by Pat Browns lab at Stanford PCR products of full-length genes (>100nt)

Affymetrix gene chips

Photolithography technology from computer industry allows building many 25-mers


25-60-mers printed directly on glass slides Flexible, rapid, but expensive

Ink-jet microarrays from Agilent


Array Fabrication Spotting

Use PCR to amplify DNA Robotic "pen" deposits DNA at defined coordinates

approximately 1-10 ng per spot Experimentation with oligos (40, 70 bp)

This machine can make 48 microarrays simultaneously.

Array Fabrication Photolithography

Light activated synthesis

synthesize oligonucleotides on glass slides 107copies per oligo in 24 x 24 um square

Use 20 pairs of different 25-mers per gene

Perfect match and mismatch

Array Fabrication Photolithography

Affymetrix Microarrays
Raw image
1.28cm

50um

~107 oligonucleotides, half perfectly match mRNA (PM), half have one mismatch (MM) Raw gene expression is intensity difference: PM - MM

Agilent cDNA microarray and oligonucelotides microarray

Agilent delivering printed 60-mer microarrays in addition to 25-mer formats. The inkjet process uses standard phosphoramidite chemistry to deliver extremely small volumes (picoliters) of the chemicals to be spotted.

Content

Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray

The Workflow of Microarray


sample
Plate Plate Preparation RNA extraction

Array Fabrication cDNA synthesis and labeled Array Hybridization

Labeled cDNA

Hybridized Array

Scanning

cDNA Synthesis And Directly Labeling

Cy3 and Cy5 cDNA Hybridization On To The Chip


e.g. treatment / control normal / tumor tissue

Sample loading

1.Loading from the corner of the cover slip It is time consuming and easily producing bubbles.

2
Sample loading

2. Loading sample at the center of array then put the slip smoothly Faster, and have lower chance of bubble producing then the last one. 3. Loading sample at the side of the array then put the slip on. Solution would attach to the slip right after the slip contact with it, and would diffuse with the movement of slip when we slowly move down.

3
Sample loading

Scan

Green: down regulate Red: up regulate Yellow: equal level

Content

Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion

Image analysis

To find a spot Convert feature into numeric data Image normalization

The Algorithms
1. Find spots: Finds the location of each spot on the microarray. 2. Cookie cutter algorithm: (1).Suppose the distribution of pixels vs intensity is Gaussian curve (2).Using SD or IQR to identify the feature and background of each spot

(3).Calculates statistics for the pixel population

Interquartile Range(IQR)

K=IQR/2 1.42 IQR

Boundary for rejection

25 %

50 % IQR

75 %

Boundary for rejection

Feature or cookie

Exclusion zone

Local background

Data Quality

Irregular size or shape Irregular placement Low intensity

Saturation Spot variance Background variance

indistinguishable

saturated

bad print

miss alignment

artifact

Convert Feature Into Numeric Value


Green Green b.g.-corrected Red b.g.-corrected background (R. b.g.-c)/(G. b.g.Red intensity Green c) Systematic name intensity Red b.g. Gene function
A_1_1 A_1_2 A_1_3 A_1_4 A_1_5 A_1_6 A_1_7 A_1_8 A_1_9 A_1_10 A_1_11 A_1_12 A_1_13 A_1_14 A_1_15 A_1_16 A_1_17 A_1_18 A_1_19 A_1_20 A_1_21 A_1_22 A_1_23 A_1_24 A_1_25 A_1_26 Ctrl Ctrl Ctrl Data Data Data D x A - PSL B kgd sDxA D x A - PSL B kgd sDxA Ratio (sDxA): Data / 59358.75 512.92 58845.83 50953.13 1779.913 49173.22 0.835628 YAL003W 1209.19 512.92 696.271 2522.345 1779.913 742.4323 1.066298 YAR053W 1948.2 512.92 1435.28 3100.152 1779.913 1320.239 0.919848 YBL078C 4940.806 512.92 4427.886 6670.604 1779.913 4890.691 1.104521 YAL008W 1485.59 512.92 972.671 2916.086 1779.913 1136.173 1.168096 YAR062W 32642.03 512.92 32129.11 42304.13 1779.913 40524.22 1.261293 YBL087C 6919.441 512.92 6406.521 8540.246 1779.913 6760.333 1.055227 YAL014C 2698.301 512.92 2185.382 4314.47 1779.913 2534.557 1.159778 YAR068W 7167.958 512.92 6655.038 7379.286 1779.913 5599.373 0.841374 YBL100C 5470.062 512.92 4957.142 6953.799 1779.913 5173.886 1.043724 YAL025C 27879.49 512.92 27366.57 33746.9 1779.913 31966.99 1.168103 YBL002W 2589.613 512.92 2076.693 4385.568 1779.913 2605.655 1.254713 YBL107C 6196.245 512.92 5683.326 8840.475 1779.913 7060.562 1.242329 YDR044W 34737.1 512.92 34224.18 36129.62 1779.913 34349.7 1.003668 YDR134C 34035.35 512.92 33522.43 27128.53 1779.913 25348.62 0.756169 YDR233C 1638.381 512.92 1125.461 2988.042 1779.913 1208.129 1.073453 YDR048C 3873.718 512.92 3360.799 4955.141 1779.913 3175.228 0.944784 YDR139C 2433.625 512.92 1920.706 3502.406 1779.913 1722.493 0.896802 YDR252W 1800.736 512.92 1287.816 3011.855 1779.913 1231.942 0.956613 YDR053W 1296.689 512.92 783.77 2636.549 1779.913 856.6356 1.092968 YDR149C 3453.24 512.92 2940.32 4968.026 1779.913 3188.113 1.084274 YDR260C 10731.55 512.92 10218.63 9307.246 1779.913 7527.333 0.736629 YDR056C 6191.309 512.92 5678.39 8808.398 1779.913 7028.485 1.23776 YDR152W 3589.998 512.92 3077.078 4420.744 1779.913 2640.831 0.858227 YDR269C 27568.34 512.92 27055.42 20856.2 1779.913 19076.29 0.705082 YGL189C 1956.182 512.92 1443.262 3150.716 1779.913 1370.803 0.949795 YGL261C Ctrl translation elongation factor eef1beta hypothetical protein essential for autophagy protein of unknown function putative pseudogene 60s large subunit ribosomal protein l23.e

strong similarity to hypothetical protein yhr214w questionable orf nuclear viral propagation protein histone h2b.2 hypothetical protein coproporphyrinogen iii oxidase strong similarity to flo1p, flo5p, flo9p and ylr110 similarity to hypothetical protein ydl204w questionable orf ubiquitin-like protein strong similarity to egd1p and to human btf3 pro questionable orf questionable orf hypothetical protein hypothetical protein weak similarity to c.elegans hypothetical protein questionable orf 40s small subunit ribosomal protein s26e.c7 strong similarity to members of the srp1/tip1 fa

Data Normalization

Normalize data to correct for variances


Dye bias Location bias Intensity bias Pin bias Slide bias

Control vs. non-control spots

Data Normalization
Uncalibrated, red light under detected Calibrated, red and green equally detected

Data Normalization

Assumptions

Overall mean average ratio should be 1

Most genes are not differentially expressed

Total intensity of dyes are equivalent

Intensity Dependent Normalization

After Normalization

Additional Normalization

Pin dependent

Similar to intensity dependent fit. Compute individual lowess fits for each pin group
After pin dependent normalization, log ratios for each pin are centered around 0 Scale variance for each pin

Within slide normalization

Uses MAD (median absolute deviation)

Additional Normalization

Dye swap

Combine relative expression levels without explicit normalization Compute lowess fit for
log2(RR/GG)/2 vs. log2(A + A)/2

Normalized ratio is
log2(R/G) - c(A)

where c(A) is the lowess prediction

Content

Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion

Data analysis

Data filtering Fold change analysis Classification Clustering Future direction

Microarray Data Classification


Microarray chips Images scanned by laser
Gene D26528_at D26561_cds1_at D26561_cds2_at D26561_cds3_at D26579_at D26598_at D26599_at D26600_at D28114_at Value 193 -70 144 33 318 1764 1537 1204 707

New sample Prediction: Data Mining and analysis

Datasets
Class Sno D26528 D63874 D63880 ALL 2 193 4157 556 ALL 3 129 11557 476 ALL 4 44 12125 498 ALL 5 218 8484 1211 AML 51 109 3537 131 AML 52 106 4578 94 AML 53 211 2431 209

The Threshold of Spots

Filtering - remove genes with insufficient variation

Remove insufficient spot: saturated, None uniform, too high background Remove extreme signal: e.g. MaxVal - MinVal < 500 and MaxVal/MinVal < 5 Statistical filtering (e.g. p-value<0.01) biological reasons feature reduction for algorithmic

Microarray Data Analysis Types


Different
Fold

gene expression (Supervised)

change analysis

Classification
identify

disease predict outcome / select best treatment


Clustering
find

(Unsupervised)

new biological classes / refine existing ones exploration

Differential Gene Expression

n-fold change

n typically >= 2 May hold no biological relevance Often too restrictive

2 expression

Calculate standard deviation Genes with expression more than 2 away are differentially expressed

Fold Changes-Scatter Plot


72 (raw) 10000

1000

100

10

0.1

21

0.01 1 10 100 1000 10000

72 (con tro l)

Fold Changes Table


Description Group 1 caspase 10, apoptosis-related cysteine protease U60519 CASP8 and FADD-like apoptosis regulator U97075 nucleoside diphosphate kinase type 6 (inhibitor of p53-induced apoptosis-alpha) AF051941 Group 2 caspase 3, apoptosis-related cysteine protease CASP8 and FADD-like apoptosis regulator Group 3 caspase 9, apoptosis-related cysteine protease Group 4 caspase 4, apoptosis-related cysteine protease Group 5 inhibitor of apoptosis protein caspase 7, apoptosis-related cysteine protease caspase 4, apoptosis-related cysteine protease Group 6 AAF19819 U67319 U28976 5.249 2.19 2.603 Z48810 2.615 2.796 2.819 U60521 2.519 U13738 AF005775 2.301 2.272 0.471 0.355 0.376 Genebank accession No. 6h 24 h 48 h 72 h Fold Change Fold Change Fold Change Fold Change

23

CASP8 and FADD-like apoptosis regulator

AF015450

6.912

Classification: Multi-Class
Similar Approach: select top genes most correlated to each class select best subset using cross-validation build a single model separating all classes Advanced:
build

separate model for each class vs. rest choose model making the strongest prediction

Popular Classification Methods


Decision
find

Trees/Rules

smallest gene sets, but also false positives

Neural
work

Nets well if number of genes is reduced

SVM
good

accuracy, does its own gene selection, hard to understand

K-nearest

neighbor - robust for small number genes Bayesian nets - simple, robust

Multi-class Data Example


Brain
42

data, Pomeroy et al 2002, Nature (415), Jan 2002


examples, about 7,000 genes, 5 classes

Selected

top 100 genes most correlated to each class Selected best subset by testing 1,2, , 20 genes subsets, leave-oneout x-validation for each

Classification Other Applications


Combining
Age,

clinical and genetic data Outcome / Treatment prediction


Sex, stage of disease, are useful e.g. if Data from Male, not Ovarian cancer

Clustering
Goals Find natural classes in the data Identify new classes / gene correlations Refine existing taxonomies Support biological analysis / discovery Different Methods
Hierarchical

clustering, SOM's, etc

SOM clustering
SOM

- self organizing maps Preprocessing


filter

away genes with insufficient biological variation normalize gene expression (across samples) to mean 0, st. dev 1, for each gene separately.
Run

SOM for many iterations Plot the results

SOM & K Mean By GeneSpring

27

Hierarchical Clustering

The most popular hierarchical clustering method used in microarray data analysis is the so called agglomerative method

works with the data in a bottom-up manner.

Initially, each data point forms a cluster and the algorithm works through the cluster sets by repeatedly merging the two which are the most similar or have the shortest distance.

algorithm involves the computation of the distance or similarity matrix

O(N^2) complexity and thus is not very efficient.

Hierarchical clustering
Genomic Reprogramming in Response to Oxidant
minutes
0 10 20 40 60 120

One-third of genome expression is transiently reprogrammed

6218 genes

Fold re pr e ssion
>9 >6 >3 1:1

Fold induction
>3 >6 >9

Future directions
Algorithms

optimized for small samples (the no. of samples will remain small for many tasks) Integration with other data biological networks medical text protein data cost-sensitive classification algorithms error cost depends on outcome (dont want to miss treatable cancer), treatment side effects, etc.

Integrate biological knowledge when analyzing microarray data (from Cheng


Li, Harvard SPH)

Right picture: Gene Ontology: tool for the unification of biology, Nature Genetics, 25, p25

Content

Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion

Microarray Potential Applications

Biological discovery

new and better molecular diagnostics new molecular targets for therapy finding and refining biological pathways Mutation and polymorphism detection

Recent examples

molecular diagnosis of leukemia, breast cancer, ... appropriate treatment for genetic signature potential new drug targets

Microarray Limitations

Cross-hybridization of sequences with high identity Chip to chip variation True measure of abundance? Does mRNA levels reflect protein levels?

Generally, do not prove new biology - simply suggest genes involved in a process, a hypothesis that will require traditional experimental verification.

What fold change has biological relevance? Need cloned EST or some sequence knowledge -- rare messages may be undetected Expensive!! Not every lab can afford experiment repeat. The real limitation is Bioinformatics

Additional Information

Review papers on microarray

Genomics, gene expression and DNA arrays (Nature, June 2000) Microarray - technology review (Natural Cell Biology, Aug. 2001) Magic of Microarray (Scientific American, Feb. 2002)
http://www.lsic.ucla.edu/ls3/tutorials/

Molecular biology tutorial

Biological data retrieval systems: Entrez


http://www.ncbi.nlm.nih.gov/Database/index.html
1.

A retrieval system for searching a number of inter-connected databases at the NCBI. It provides access to: PubMed: The biomedical literature (Medline) Genbank: Nucleotide sequence database Protein sequence database Structure: three-dimensional macromolecular structures Genome: complete genome assemblies PopSet: population study data sets OMIM: Online Mendelian Inheritance in Man Taxonomy: organisms in GenBank Books: online books ProbeSet: gene expression and microarray datasets 3D Domains: domains from Entrez Structure UniSTS: markers and mapping data SNP: single nucleotide polymorphisms CDD: conserved domains

2. Entrez allows users to perform various searches.

You might also like