DNA Microarray
DNA Microarray
Data Analysis
Editors
Jarno Tuimala
M. Minna Laine
All rights reserved. The PDF version of this book or parts of it can be
used in Finnish universities as course material, provided that this copyright
notice is included. However, this publication may not be sold or included
as part of other publications without permission of the publisher.
c The authors and
CSC – Scientific Computing Ltd.
2003
ISBN 952-9821-89-1
http://www.csc.fi/oppaat/siru/
Printed at
Picaset Oy
Helsinki 2003
DNA microarray data analysis 5
Preface
This is the first edition of the DNA microarray data analysis guidebook. Although
invented in the mid-90s, DNA microarrays are still novelties as biomedical research
tools. DNA microarrays generate large amounts of numerical data, which should
be analyzed effectively.
In this book, we hope to offer a broad view of basic theory and techniques
behind the DNA microarray data analysis. Our aim was not to be comprehensive,
but rather to cover the basics, which are unlikely to change much over years. We
hope that especially researchers starting their data analysis can benefit from the
book.
The text emphasizes gene expression analysis. Topics, such as genotyping,
are discussed shortly. This book does not cover the wet-lab practises, such as sam-
ple preparation or hybridization. Rather, we start when the microarrays have been
scanned, and the resulting images analyzed. In other words, we take the files with
signal intensities, which usually generate questions such as: “How is the data nor-
malized?” or “How do I identify the genes which are upregulated?”. We provide
some simple solutions to these specific questions and many others.
Each chapter has a section on suggested reading, which introduces some of
the relevant literature. Several chapters also include data analysis examples using
GeneSpring software.
This edition of the book was written by M. Minna Laine (chapters 4, 8 and
14), Tomi Pasanen (chapter 11), Janna Saarela (chapters 2 and 3), Ilana Saarikko
(chapter 8), Teemu Toivanen (chapter 14), Martti Tolvanen (chapter 12), Jarno Tu-
imala (chapters 4, 6, 7, 8, 9, 10, 13 and 15), Mauno Vihinen (chapters 10, 11 and
12), and Garry Wong (chapters 1 and 5).
Juha Haataja and Leena Jukka are warmly acknowledged for their support
during the production of this book.
We are very interested in receiving feedback about this publication. Especially,
if you feel that some essential technique has been missed, let us know. Please send
your comments to the e-mail address [email protected].
The authors
6 DNA microarray data analysis
List of Contributors
Teemu Toivanen
Centre for Biotechnology
Tykistökatu 6
20521 Turku
Finland
Contents 7
Contents
Preface 5
List of Contributors 6
I Introduction 14
1 Introduction 15
1.1 Why perform microarray experiments? . . . . . . . . . . . . . 15
1.2 What is a microarray? . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Microarray production . . . . . . . . . . . . . . . . . . . . . 16
1.4 Where can I obtain microarrays? . . . . . . . . . . . . . . . . 17
1.5 Extracting and labeling the RNA sample . . . . . . . . . . . . 19
1.6 RNA extraction from scarse tissue samples . . . . . . . . . . . 19
1.7 Hybridization . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.8 Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.9 Typical research applications of microarrays . . . . . . . . . . 21
1.10 Experimental design and controls . . . . . . . . . . . . . . . . 22
1.11 Suggested reading . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Genotyping systems 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8 DNA microarray data analysis
3.2 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Genotype calls . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Suggested reading . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Experimental design 38
5.1 Why do we need to consider experimental design? . . . . . . . 38
5.2 Choosing and using controls . . . . . . . . . . . . . . . . . . 38
5.3 Choosing and using replicates . . . . . . . . . . . . . . . . . . 39
5.4 Choosing a technology platform . . . . . . . . . . . . . . . . 39
5.5 Gene clustering v. gene classification . . . . . . . . . . . . . . 40
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.7 Suggested reading . . . . . . . . . . . . . . . . . . . . . . . . 41
6 Basic statistics 42
6.1 Why statistics are needed . . . . . . . . . . . . . . . . . . . . 42
6.2 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.2 Constants . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.3 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.4 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Simple statistics . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3.1 Number of subjects . . . . . . . . . . . . . . . . . . . . . 43
6.3.2 Mean (m) . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3.3 Trimmed mean . . . . . . . . . . . . . . . . . . . . . . . 43
6.3.4 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3.5 Percentile . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3.6 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3.7 Variance and the standard deviation . . . . . . . . . . . . 44
6.3.8 Coefficient of variation . . . . . . . . . . . . . . . . . . . 44
6.4 Effect statistics . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4.1 Scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4.2 Correlation (r) . . . . . . . . . . . . . . . . . . . . . . . 45
6.4.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . 46
6.5 Frequency distributions . . . . . . . . . . . . . . . . . . . . . 47
6.5.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . 47
6.5.2 t-distribution . . . . . . . . . . . . . . . . . . . . . . . . 49
6.5.3 Skewed distribution . . . . . . . . . . . . . . . . . . . . . 49
6.5.4 Checking the distribution of the data . . . . . . . . . . . . 50
Contents 9
6.6 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.6.1 Log 2-transformation . . . . . . . . . . . . . . . . . . . . 52
6.7 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.8 Missing values and imputation . . . . . . . . . . . . . . . . . 53
6.9 Statistical testing . . . . . . . . . . . . . . . . . . . . . . . . 54
6.9.1 Basics of statistical testing . . . . . . . . . . . . . . . . . 54
6.9.2 Choosing a test . . . . . . . . . . . . . . . . . . . . . . . 55
6.9.3 Threshold for p-value . . . . . . . . . . . . . . . . . . . 55
6.9.4 Hypothesis pair . . . . . . . . . . . . . . . . . . . . . . . 55
6.9.5 Calculation of test statistic and degrees of freedom . . . . 56
6.9.6 Critical values table . . . . . . . . . . . . . . . . . . . . . 57
6.9.7 Drawing conclusions . . . . . . . . . . . . . . . . . . . . 57
6.9.8 Multiple testing . . . . . . . . . . . . . . . . . . . . . . . 57
6.10 Analysis of variance . . . . . . . . . . . . . . . . . . . . . . . 58
6.10.1 Basics of ANOVA . . . . . . . . . . . . . . . . . . . . . 58
6.10.2 Completely randomized experiment . . . . . . . . . . . . 58
6.11 Statistics using GeneSpring . . . . . . . . . . . . . . . . . . . 60
6.11.1 Simple statistics . . . . . . . . . . . . . . . . . . . . . . 60
6.11.2 Tranformations . . . . . . . . . . . . . . . . . . . . . . . 60
6.11.3 Scatter plot and histogram . . . . . . . . . . . . . . . . . 60
6.11.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.11.5 Linear regression . . . . . . . . . . . . . . . . . . . . . . 61
6.11.6 One-sample t-test . . . . . . . . . . . . . . . . . . . . . . 62
6.11.7 Independent samples t-test and ANOVA . . . . . . . . . . 62
6.12 Suggested reading . . . . . . . . . . . . . . . . . . . . . . . . 64
II Analysis 65
7 Preprocessing of data 66
7.1 Rationale for preprocessing . . . . . . . . . . . . . . . . . . . 66
7.2 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3 Checking the background reading . . . . . . . . . . . . . . . . 68
7.4 Calculation of expression change . . . . . . . . . . . . . . . . 69
7.4.1 Intensity ratio . . . . . . . . . . . . . . . . . . . . . . . . 69
7.4.2 Log ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.4.3 Fold change . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.5 Handling of replicates . . . . . . . . . . . . . . . . . . . . . . 71
7.5.1 Types of replicates . . . . . . . . . . . . . . . . . . . . . 71
7.5.2 Time series . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.5.3 Case-control studies . . . . . . . . . . . . . . . . . . . . 72
7.5.4 Power analysis . . . . . . . . . . . . . . . . . . . . . . . 72
7.5.5 Averaging replicates . . . . . . . . . . . . . . . . . . . . 72
7.6 Checking the quality of replicates . . . . . . . . . . . . . . . . 72
10 DNA microarray data analysis
8 Normalization 85
8.1 What is normalization? . . . . . . . . . . . . . . . . . . . . . 85
8.2 Sources of systematic bias . . . . . . . . . . . . . . . . . . . 85
8.2.1 Dye effect . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2.2 Scanner malfunction . . . . . . . . . . . . . . . . . . . . 85
8.2.3 Uneven hybridization . . . . . . . . . . . . . . . . . . . . 86
8.2.4 Printing tip . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.2.5 Plate and reporter effects . . . . . . . . . . . . . . . . . . 86
8.2.6 Batch effect and array design . . . . . . . . . . . . . . . . 87
8.2.7 Experimenter issues . . . . . . . . . . . . . . . . . . . . 87
8.2.8 What might help to track the sources of bias? . . . . . . . 87
8.3 Normalization terminology . . . . . . . . . . . . . . . . . . . 87
8.3.1 Normalization, standardization and centralization . . . . . 88
8.3.2 Per-chip and per-gene normalization . . . . . . . . . . . . 89
8.3.3 Global and local normalization . . . . . . . . . . . . . . . 89
8.4 Performing normalization . . . . . . . . . . . . . . . . . . . . 89
8.4.1 Choice of the method . . . . . . . . . . . . . . . . . . . . 89
Contents 11
14.3 How the data should be presented: the MAGE standard . . . . 147
14.3.1 MAGE-OM . . . . . . . . . . . . . . . . . . . . . . . . . 147
14.3.2 MAGE-ML; an XML-translation of MAGE-OM . . . . . 147
14.3.3 MAGE-STK . . . . . . . . . . . . . . . . . . . . . . . . 148
14.4 Where and how to submit your data . . . . . . . . . . . . . . 148
14.4.1 ArrayExpress and GEO . . . . . . . . . . . . . . . . . . . 148
14.4.2 MIAMExpress . . . . . . . . . . . . . . . . . . . . . . . 148
14.4.3 GEO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
14.4.4 Other options and aspects . . . . . . . . . . . . . . . . . 149
14.5 MIAME-compliant sample attributes in GeneSpring . . . . . 150
14.6 Suggested reading . . . . . . . . . . . . . . . . . . . . . . . . 150
15 Software issues 152
15.1 Data format conversions problems . . . . . . . . . . . . . . . 152
15.2 A standard file format . . . . . . . . . . . . . . . . . . . . . . 152
15.3 Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 153
15.3.1 Perl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
15.3.2 Awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
15.3.3 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
15.4 Freeware software packages . . . . . . . . . . . . . . . . . . . 154
15.4.1 Cluster and treeview . . . . . . . . . . . . . . . . . . . . 155
15.4.2 Expression profiler . . . . . . . . . . . . . . . . . . . . . 155
15.4.3 ArrayViewer . . . . . . . . . . . . . . . . . . . . . . . . 155
15.4.4 MAExplorer . . . . . . . . . . . . . . . . . . . . . . . . 155
15.4.5 Bioconductor . . . . . . . . . . . . . . . . . . . . . . . . 155
15.5 Commercial software packages . . . . . . . . . . . . . . . . . 156
15.5.1 VisualGene . . . . . . . . . . . . . . . . . . . . . . . . . 156
15.5.2 GeneSpring . . . . . . . . . . . . . . . . . . . . . . . . . 156
15.5.3 Kensington . . . . . . . . . . . . . . . . . . . . . . . . . 156
15.5.4 J-Express . . . . . . . . . . . . . . . . . . . . . . . . . . 156
15.5.5 Expression Nti . . . . . . . . . . . . . . . . . . . . . . . 157
15.5.6 Rosetta Resolver . . . . . . . . . . . . . . . . . . . . . . 157
15.5.7 Spotfire . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Index 158
Part I
Introduction
1 Introduction 15
1 Introduction
Microarray technologies as a whole provide new tools that transform the way sci-
entific experiments are carried out. The principle advantage of microarray tech-
nologies compared with traditional methods is one of scale. In place of conducting
experiments based on results from one or a few genes, microarrays allow for the
simultaneous interrogation of hundreds or thousands of genes.
The answers to this question span a wide range from the formally considered, well-
constructed hypotheses with elegant supporting arguments to “I can’t think of any-
thing else to do, so lets’s do a microarray experiment”. The true motivation for
performing these experiments lies likely somewhere between the two extremes. It
is this combination of generating a scientific hypothesis (elegant or not), and at the
same time being able to produce massive amounts of data that has made research
in microarrays so attractive. Nonetheless, the production and use of microarrays is
set with high technical and instrumentation demands. Moreover, the computation
and statistical requirements for dealing with the data can be daunting, especially
to those scientists used to single experiment – single result analysis. So, for those
willing to try this new technology, microarray experiments are performed to answer
a wide range of biological questions to which the answers are to be found in the
realm of hundreds, thousands, or an entire genome of individual genes.
Microarrays are microscope slides that contain an ordered series of samples (DNA,
RNA, protein, tissue). The type of microarray depends upon the material placed
onto the slide: DNA, DNA microarray; RNA, RNA microarray; protein, protein
microarray; tissue, tissue microarray. Since the samples are arranged in an ordered
fashion, data obtained from the microarray can be traced back to any of the samples.
This means that genes on the microarray are addressable. The number of ordered
samples on a microarray can number into the hundred of thousands. The typical
microarray contains several thousands of addressable genes.
The most commonly used microarray is the DNA microrray. The DNA printed
or spotted onto the slides can be chemically synthesized long oligonucleotides or
enzymatically generated PCR products. The slides contain chemically reactive
groups (typically aldehydes or primary amines) that help to stabilize the DNA
16 DNA microarray data analysis
Printing microarrays in not a trivial task and is both an art and a science. The
job requires considerable expertise in chemistry, engineering, programming, large
project management, and molecular biology. The aim during printing is to produce
reproducible spots with consistant morphology. Early versions of printers were
custom made with a basic design taken from a prototype version. Some were built
from laboratory robots. Current commercial microarray printers are available for
almost every size application and have made the task of printing microarrays feasi-
ble and affordable for many nonspecialized laboratories. The basic printer consists
of: a nonvibrating table surface where the raw glass slides are placed, a moving
head in the x-y-z plane that contains the pins or pens to transfer the samples onto
the array, a wash station to clean the pins/pens between samples, a drying station
for the pins/pens, a place for the samples to be printed, and a computer to con-
trol the operation. Some of these procedures can be automated such as replacing
the samples to be printed, although most complete systems are semi-automated.
Samples to be printed are concentrated and stored in microtitre plates.
The printers are operated in dust-free, temperature and humidity controlled
rooms. Some printer designs have their own self-contained environmental controls.
Printing pen designs have been adapted from ink methods and include quill, ball-
point, ink-jet, and P-ring techniques. The pens can get stuck and need to be cleaned
frequently. Multiple pens placed on a printing head can multiplex the printing
operation and speed up the process. Thousands of samples in duplicate or triplicate
are printed in a single run over perhaps a hundred or more slides; thus, printing
times of several days are common. Since printing times can be long and sample
volumes are small, sample evaporation is a major concern. As a result, hygroscopic
printing buffers, often containing DMSO have been developed and are highly useful
to alleviate evaporation. A typical printer design is shown in Figure 1.1.
1 Introduction 17
problem. These procedures generally involve either PCR amplification of the cD-
NAs made from the original RNAs, or production of more RNA from the original
RNA sample by hybridization of a T7 or T3 promoter followed by RNA sythesis
with RNA polymerase. As usual for any amplification procedure, proper controls
and interpretation of the results need to be considered.
A related issue in isolating tissues for microarray studies is the dissection of
small populations of cells or even single cells. Sophisticated instruments have been
developed for this application and many are commercially available. These laser-
assisted microdissection machine, while expensive, are nonetheless fairly straight-
forward to use and provide a convenient method for obtaining pure cell samples.
1.7 Hybridization
Conditions for hybridizing fluorescently labeled DNAs onto microarrays are re-
markably similar to hybridizations for other molecular biology applications. Gener-
ally the hybridization solution contains salt in the form of buffered standard sodium
citrate (SSC), a detergent such as sodium dodecyl sulphate (SDS), and nonspecific
DNA such as yeast tRNA, salmon sperm DNA, and/or repetitive DNA such as hu-
man Cot-1. Other nonspecific blocking reagents used in hybridization reactions
include bovine serum albumin or Denhardt’s reagent. Lastly, the hybridization so-
lution should contain the labeled cDNAs produced from the different RNA popula-
tions.
Hybridization temperatures vary depending upon the buffers used, but gen-
erally are performed at approximately 15 − 20 ◦ C below the melting temperature,
which is 42 − 45 ◦C for PCR products in 4X SSC and 42 − 50 ◦C for long oligos.
Hybridization volumes vary widely from 20µl to several mLs. For small hybridis-
ation volumes, hydrophobic cover slips are used. For larger volumes, hybridization
chambers can be used.
Hybridization chambers are necessary to keep the temperature constant and
the hybridization solution from evaporating. Hybridization chambers vary substan-
tially from the most expensive high-tech automated instruments to empty pipette
boxes with a few wet paper towels inserted. The range of solutions for providing
a thermally stable, humidified enviroment for a microscope slide is only virtually
unlimited. Some might even consider a sauna as a potential chamber. In small
volumes, the hybridization kinetics are rapid so a few hours can yield reproducible
results, although overnight hybridizations are more common.
1.8 Scanning
Following hybridization, microarrays are washed for several minutes in decreas-
ing salt buffers and finally dried, either by centrifugation of the slide, or a rinse
in isopropanol followed by quick drying with nitrogen gas or filtered air. Fluores-
cently labeled microarrays can then be “read” with commercially available scan-
ners. Most microarray scanners are basically scanning confocal microscopes with
lasers exciting at wavelengths specifically for Cy3 and Cy5, the typical dyes used
in experiments. The scanner excites the fluorescent dyes present at each spot on the
1 Introduction 21
microarray and the dye then emits at a characteristic wavelength that is captured
in a photomultiplier tube. The amount of signal emitted is directly in proportion
to the amount of dye at the spot on the microarray and these values are obtained
and quantitated on the scanner. A reconstruction of the signals from each location
on the microarray is then produced. For cDNA microarrays one intensity value is
generated for the Cy3 and another for the Cy5. Hence, cDNA microarrays pro-
duce two-color data. Affymetrix chips produce one-color data, because only one
mRNA sample is hybridized to every chip (see chapter 3). When both dyes are
reconstructed together, a composite image is generated. This image produces the
typical microarray picture.
of action of the drug, prediction of toxicologic properties, and new drug targets.
One of the most exciting areas of application is the diagnosis of clinically
relevent diseases. The oncology field has been especially active and to an extent
successful in using microarrays to differentiate between cancer cell types. The
ability to identify cancer cells based on gene expression represents a novel method-
ology that has real benefits. In difficult cases where a morphological or an antigen
marker is not available or reliable enough to distinguish cancer cell types, gene
expression profiling using microarrays can be extremely valuable. Programs to
predict clinical outcome and to design individual therapies based on expression
profiling results are well underway.
A very recent application of microarrays has been to perform comparative
genomic analysis. Genome projects are producing sequences on a massive level,
yet there still does not exist sufficient resources to sequence every organism that
seems interesting or worthy of the effort. Therefore, microarrays have been used as
a shortcut to both characterize the genes within an organism (structural genomics)
and also to determine whether those genes are expressed in a similar way to a
reference organism (functional genomics). A good example of this is in the species
Oryza sativa (rice). Microarrays based on rice sequences can be used to hybridize
cDNAs derived from other plant species such as corn or barley. The genome sizes
in the latter are simply too large for whole genome projects, so hybridization with
microarrays to rice genes presents an agile way to address this question.
Single nucleotide polymorphism (SNP) microarrays are designed to detect the
presence of single nucleotide differences between genomic samples. SNPs occur at
frequencies of approximately 1 in a 1000 bases in humans and underlie the genomic
differences between individuals. Mapping and obtaining frequencies of identified
SNPs should provide a genetic basis for identifying disease genes, predicting ef-
fects of the environment as well as responses to therapeutic agents. Minisequenc-
ing, primer extension, and differential hybridization methods have been developed
on the microarray platform with all the advantages of expression arrays: high
throughput, reproducibility, economy, and speed.
Indeed, the use of microarrays to determine whether a gene is present and
whether it goes up or down under certain conditions will continue to spawn even
more applications that now depend only upon the imagination of the microarray
researcher.
interpretable with minimum number of confounders. Indeed, probably the only dif-
ference between good experimental design in microarray and other experiments is
that the time budgeted for data analysis seems always to be underestimated. As a
final suggestion, attention to the mundane but critical statistical and data analysis
elements of a microarray experiment will greatly increase your ratio of joy to pain
at the end of your microarray journey.
10. Schena, M. Shalon, D., Davis, R. W., and Brown, P. O. (1995) Quantitative
monitoring of gene expression patterns with a complementary DNA microar-
ray. Science 270, 467-470.
2 Affymetrix Genechip
system
2.1 Affymetrix technology
There are two principal differences between the Affymetrix Genechip system and
“traditional” cDNA microarrays in studying gene expression. First, instead of
hybridizing two RNAs labeled with different fluorophores competitively on one
cDNA microarray, a single RNA is hybridized on the array in the Affymetrix sys-
tem, and the comparisons are then made computationally. Second, in the Affymetrix
arrays each gene is represented as a probe set of 10–25 oligonucleotide pairs instead
of one full length or partial cDNA clone. The oligonucleotide pair (probe pair)
comprises of one oligonucleotide perfectly matching to the gene sequence (Perfect
Match, PM) and a second oligonucleotide having one nucleotide mismatch in the
middle of it (Mismatch, MM) . Probes are designed within 500 base pairs of the 3’
end of each gene to hybridize uniquely in the same, predetermined hybridization
conditions. Some housekeeping genes are represented as three probe sets, one set
designed to the 5’ end of the gene, second set to the middle of the gene and the third
to the 3’end. In addition to species specific genes, some spiked-in control probe sets
are introduced to facilitate the controlling of the hybridization. In the experiment,
biotin-labeled RNA is hybridized to the array, which is stained with phycoerythrin-
conjugated streptavidin after washing and scanned with the Gene Array Scanner. A
grid is automatically laid over the array image and the intensities of each probe pair
are used to make expression measurements with the Affymetrix Microarray Suite ,
version 5 software.
This analysis generates qualitative and quantitative values from one gene expres-
sion experiment and provides initial data required to perform comparisons between
experiments. A quantitative value, a Detection call, indicates whether a transcript is
reliably detected (Present) or not detected (Absent) in the array. The Detection call
is determined by comparing the Detection p-value generated in the analysis against
user-defined cut-offs. A quantitative value, a signal, assigns a relative measure of
abundance to the transcript.
26 DNA microarray data analysis
• If the Mismatch probe cells are generally informative across the probe set
except for a few Mismatches, an adjusted Mismatch value is used for unin-
formative Mismatches based on the bi-weight mean of the Perfect match and
the Mismatch ratio.
• Scaling Factor is close to one (using current scanner settings and Target In-
tensity 100) or at least in the same level for those probe arrays you are plan-
ning to compare with each other.
• All positive hybridization controls (BioB, BioC, BioD, CreX) are present.
• Signal ratios of 3’, middle and 5’ probe sets (GAPDH, Beta-actin) are close
to one.
2.8 Normalization
Before a comparison of two probe arrays can be made, variations between the two
experiments caused by technical and biological factors must be corrected by scal-
ing or normalization. Main sources of technical variation in an array experiment
are quality and quantity of the labeled RNA hybridized as well as differences in
reagents, stain and chip handling. Biological variation, though irrelevant to study
question, may arise from differences in genetic background, growth conditions,
time, sex, age, etc. Either scaling or normalization should be used to minimize this
variation.
Scaling (recommended by Affymetrix) and normalization can be made using
either data from user-selected probe sets or all probe sets. When using selected
probe sets (for example a group of housekeeping genes) for scaling/normalization,
it is important to have a priori knowledge of gene expression profiles of selected
genes. Thus in most cases less bias is introduced to the experiment when the data
of all probe sets is used for scaling/normalization. When scaling is applied, the
intensity of the probe sets from experimental and baseline arrays is scaled to a same,
user-defined target intensity (recommendation using current scanner setting is 100).
If normalization is applied, the intensity of the probe sets from experimental array
is normalized to the intensity of the probe sets on the baseline array.
An additional normalization factor, Robust Normalization, is also introduced
to the data. It accounts for probe set characteristics resulting from sequence-related
factors, such as affinity of the probe set to the RNA and linearity of the hybridiza-
tion of each probe pair. More specifically, this factor corrects for the inevitable
error of using an average intensity of all the probes (or selected probes) on the
array as a normalization factor for every probe set. Robust normalization of the
probe set is calculated once the initial scaling/normalization factor is determined,
by calculating a slightly higher and a slightly lower factor than the original. User-
modified parameter, Perturbation, which can vary between 1.00 (no perturbation)
and 1.49, defines the range by which the normalization factor is adjusted up and
down. The perturbation value directly affects the subsequent p-value calculation,
since of the p-values resulting from applying the normalization factor and its two
perturbed variants, the most conservative is used to evaluate whether any change
in level is justified by the data. Increasing the perturbation value can reduce the
number of false changes, but may also block the detection of some true changes.
values close to 0.0 indicate probability for an increase in gene expression level in
the experiment array compared to baseline, whereas values close to 1.0 indicate
likelihood for a decrease in gene expression level.
Figure 2.1: A representation of p-values in a data set. The Y-axis is the probe set signal and
the X-axis is the p-values. Values γ 1L, γ 2L, γ 1H, γ 2H describe user-adjustable values
used in making the Change Call (I= Increased, MI = Marginal Increase, MD = Marginal
Decrease, D = Decrease).
Log Ratio of 1.0 to a 2-fold increase in the expression level and of -1.0 to a 2-fold
decrease. No change in the expression level is thus indicated by a Signal Log Ratio
value 0.
Tukey’s Biweight method also gives estimate of the amount of variation in the
data. Confidence intervals are generated from the scale of variation of the data.
A 95% confidence interval shows a range of values, which will include the true
value 95% of the time. Small confidence interval implies that the expression data
is more exact, while large confidence intervals reflect more noise and uncertainty
in estimating the true level. Since the confidence intervals attached to Signal Log
Ratios are computed from variation between probes, they may not reflect the full
width of experimental variation.
3 Genotyping systems
3.1 Introduction
3.2 Methodologies
For the single base extension reaction, multiple genomic DNA-regions flanking
SNP sites are amplified by PCR using SNP-specific primers (Figure 3.1:A). Af-
ter enzymatic removal of the excess primers and nucleotides single base extension
reactions are carried out with detection primers each containing a sequence comple-
mentary to the predefined TAG sequence spotted on the array and the SNP-specific
sequence just preceding the variation. One allele-defining nucleotide, each labeled
with different fluorescent dye, is added to the detection primer in the SBE reaction.
Finally, detection primers are hybridized to the TAG sequences spotted on an array.
Genotypes are determined by comparing the signals of two possible nucleotides
labeled with different fluorescent dyes on each spot.
For the second method, allele-specific primer extension, two amino-modified
detection primers, each containing one of the variable nucleotides of the SNP as
their 3’ nucleotide, are spotted and covalently linked to chemically activated mi-
croscope slides (Figure 3.1:B). Genomic DNA flanking the SNP is amplified with
SNP-specific primers containing T7 or T3 RNA polymerase recognition sequence
in their 5’ end. Multiple SNPs can be amplified in one multiplex PCR reaction
simultaneously. By using T7 (or T3) RNA polymerase, PCR products are subse-
quently transcribed to RNA, which is then hybridized to the SNP array containing
the two detection primers for each SNP. Reverse transcriptase enzyme and fluo-
rescently labeled nucleotides are then employed to visualize the sequence-specific
extension reaction of the allele-defining detection primer(s). Up to 80 different
samples can be analyzed on one slide when as many identical subarrays of de-
tection primers are spotted on a slide, which is partitioned into small hybridization
chambers. Genotypes are determined by comparing the signals of the two detection
primers representing the SNP on the array.
32 DNA microarray data analysis
Figure 3.1: Principles of the two SNP genotyping methods. A. Single base primer exten-
sion followed by TAG-array hybridization. B. Allele-specific primer extension on array.
Traffic lights at the bottom of the figures describe three possible genotypes with each
method.
To make the genotype call, signals of the two possible variants are compared. In
case of the SBE reaction, intensities of the two different fluorophores in a spot
representing the SNP are measured. In case of the allele-specific primer extension
method, each SNP is represented by two spots, both putatively labeled (with the
same fluorophore). The intensities of the two spots are measured and compared
to each other to determine the genotype. The same methods can be used for both
comparisons. One method is to calculate the part of the signal of one allele over
the total signal intensity:
signal of allele 1
P=
(signal of allele 1 + signal of allele 2)
All the values are between zero and one (Figure 3.2). Values close to one
represent the homozygote allele 1 genotype, and those close to zero represent the
other homozygote genotypes (allele 2). Those in-between (close to 0.5) represent
the heterozygote genotypes. A scatter plot can be formed having the values for P
in the X-axis and logarithm of the total intensity as the Y-axis. Three clear groups
of signals should show clearly, each representing one of the possible genotypes.
3 Genotyping systems 33
2. Guo, Z., Guilfoyle, R. A., Thiel, A. J., Wang, R., and Smith, L. M. (1994)
Direct fluorescence analysis of genetic polymorphisms by hybridization with
oligonucleotide arrays on glass supports. Nucleic Acids Res. 22, 5456-5465.
3. Pastinen, T., Raitio, M., Lindroos, K., Tainola, P., Peltonen, L., Syvanen,
A-C. (2000) A system for specific, high-throughput genotyping by allele-
specific primer extension on microarrays. Genome Res. 10,1031-42.
We start data analysis from the results of scanned images. At this point, images
have been evaluated, bad spots have been investigated and the spots have preferably
been scored with flags indicating whether the spot was good, bad, or borderline.
This is crucial, because in the later stages of the analysis the visual inspection of
individual spots is not possible.
Next steps in the analysis are preprocessing (Chapter 7), normalization (Chap-
ter 8), and quality control (Chapters 6, 7, and 8). The goal of these analyses is to
organize results in a meaningful order: flags, controls, and experiments are pointed
out and checked. Variation due to systematic errors are removed, and data from
different chips is made comparable.
Statistics is needed in many steps during the analysis. Therefore, the whole
Chapter 6 has been dedicated to basics of statistics. Statistical tools are used for
evaluation of the raw data during the above-mentioned steps, and in the further
analysis to find significantly differentially expressed genes.
In these further analysis steps, statistically significant, quality-checked data is
separated from not interesting and not-trustworthy data (Chapter 7). The next step
is to find the differentially expressed genes using statistical tools (Chapter 9), or to
group the good quality data (usually only a small fraction of the original raw data)
into meaningful clusters by e.g. clustering (Chapter 10). The goal of clustering
is to find similarly behaving genes or patterns related to time scale, time point,
developmental phase or treatment of the sample.
At this point, we have already manipulated our data quite a lot, and spent con-
siderable time with computer. Despite that and depending on what we are looking
for, we may be at the very beginning of the challenging part of the data analysis.
Next we need to link the observations to biological data, to regulation of genes,
and to annotations of functions and biological processes. This part, data mining, is
described in Chapters 11, 12 and 13.
4 Overview of data analysis 35
With an enormous amount of data, we need standardized systems and tools for
data management in order to publish the results in a proper and sound way, as well
as to be able to benefit from other publicly available gene expression data. These
aspects are discussed in Chapter 14. Data file manipulation and analysis tools are
introduced in Chapter 15.
Figure 4.1: Overview of data analysis methods discussed in this book. Note that all
possible orders of analysis have not been shown.
4 Overview of data analysis 37
Method Chapter
Experimental design 5
Basic statistics 6
Background correction 7
Calculation of expression change 7
Quality checking 7
Filtering 7
Normalization 8
Single slide methods 9
t-test 9
ANOVA 6
Clustering 10
Classification 10
PCA 10
Function prediction 10
Gene regulatory networks 11
Promoter sequence analysis 12
Annotations 12, 13
Ontologies 12, 13
Article mining 13
Data management 14
MIAME standard 14
MAGE standard 14
38 DNA microarray data analysis
5 Experimental design
5.1 Why do we need to consider experimental design?
Good experimental design, above the myriad of other considerations, will likely
provide you with the greatest amount of satisfaction and least amount of frustra-
tion in executing a microarray project. In the early days of microarray research, a
simple design of control v. treatment and replicates of each would suffice. How-
ever, those days are rapidly going by the wayside and 30–80 chip experiments in
a single design are becoming common. While the design depends primarily on
the scientific question that is being proposed (hypothesis), other considerations,
such as appropriate controls, platforms, and statistical issues merit serious thought.
This is to say nothing about costs! While many projects have similar objectives, it
would be foolish for us to design protocols in a guidebook without intimate knowl-
edge of the system and the intricacies involved in the study. Moreover, many fields
have specialized and/or traditional methodologies that require non-typical designs.
Nonetheless, there are several variables in the experimental design that are common
to most studies, and we will try to describe them here. We hope that the consider-
ation of the strengths and weaknesses of each variable on both the biological and
statistical level would help the individual investigator better design his/her own mi-
croarry study. Most of the discussion concerns expression arrays, although places
where SNP arrays are relevant are pointed out.
The control serves as a reference for an experiment commonly termed the treat-
ment. The treatment can be chemical (drugs, toxins), biological (viruses, microbes,
transgenes) or environmental (stress, irradiation) in nature. The individual treat-
ments can be given in a time series (time course) or at different doses (dose re-
sponse). In all these cases, the control should be matched as closely as possible
genetically. This can mean that the controls are sibs or that the animal used is an
inbred strain, or a combination of the two. Controls with the same environmental
influences are usually dealt with by using litter siblings that have been raised iden-
tically. Physiological matching can be done by taking the same sex, age, and health
status.
Controls for human tissues are problematic since age, cause of death, and
post-mortum interval are difficult to match. Moreover, human tissues for research
purposes are often stored for unequal intervals, which greatly effect RNA quality.
5 Experimental design 39
Because of this potential confounder, many human studies have been done using
human cell lines rather than tissues. Another solution is to match tissue by taking
different regions from the same human tissue, one that is healthy to serve as the
control, and one that carries the disease as the treatment sample.
Controls for mice studies have a different problem in that some transgenic
mice are crosses between xxx and yyy strains. Therefore, both transgenic and non-
transgenic littermates may have quite different genetic backgrounds. The solution
would be to backcross to one of the parental strains until both the control and trans-
genic mouse have the same genetic background. This is somewhat time consuming
since mice have a reproductive cycle of 2–3 months. Another solution would be
to make sure that the transgenic lines that are produced have a homogeneous back-
ground.
Controls for cell based experiments generally consist of identical cultures
without the physiological, physical, or chemical treatments. The controls may also
include cells derived from other sources such as the equivalent or healthy tissues.
When cells are cocultured, the controls become more complicated. Cultures of
each of the individual constituent cell populations in addition to a coculture could
be used as a control. In practice this means that if a coculture of A and B cells is
made, then the controls will be A alone, B alone, A and B together.
Because of all these potential variables that could effect the microarray results,
it often makes sense to create a design with more than one control within the study.
It often makes more sense to identify genes and their related expression levels that
predict the experimental group they are derived from (gene classification) than to
find groups of genes that act in a particular way after treatment (gene clustering).
5 Experimental design 41
For classification studies, large numbers of samples for each treatment groups are
needed in order to robustly find a set of genes that predicts their original treatment
group. The large number of samples is also necessary to validate the predicted clas-
sification based on training sets derived from the data. In this case of using gene
expression to predict human tumor type as a diagnostic tool, individual variation
from human subjects, and variation in the samples due to which part of the tumor
was dissected, how the tissue was obtained, and the other experimental variables
typically suggests that random variation in the profiling is unavoidable, further in-
dicating that sample sizes should be large.
In contrast, the number of genes profiled need not be large if the genes consis-
tently predict the classification group. This is unfortunately not the usual case.
5.6 Conclusions
The experimentalist is likely to know the most about the scientific question and
therefore have the greatest input into the experimental design. Nonetheless, con-
sultations with a statistical expert prior to experiments could help to decrease exper-
imental variation derived from the limitations listed above. Recently, two excellent
review articles have become available and are listed at the end of this chapter. In
the end, there are many factors influencing the successful outcome of a microarray
study, and attention to as many as possible will greatly aid in the final outcome.
6 Basic statistics
6.1 Why statistics are needed
One important use of statistics is to summarize a collection of data in a clear and un-
derstandable way. There are two basic methods: numerical and graphical. Graphi-
cal methods are best at identifying patterns in the data. Numerical approaches are
more precise and objective. Since numerical and graphical approaches complement
each other, both are needed.
6.2.2 Constants
Measures of subjects, i.e. variables, which can take on only one value during the
experiment are called constants. For example, if only women are included in a
study, then the variable sex is constant during the study.
6.2.3 Distribution
Distribution describes the scores and the number of scores the variable can take.
A continuous variable can be described with a histogram, which is a graphical
representation of the distribution (Figure 6.5). Distributions are more thoroughly
6 Basic statistics 43
6.2.4 Errors
Errors are inaccuracies in measurements. In the laboratory analyses, there are often
an innumerable amount of potential sources of errors, but only a few are usually of
paramount importance.
Errors can be systematic (biases) or random. Systematic errors affect the mea-
surements such a way that you get a wrong estimate of whatever you are measuring.
Systematic errors affect all the subjects similarly. Random errors lack such predic-
tivity. For example, one laboratory technician can be biased, if his/her analysis
results are always consistently higher than the results acquired by other techni-
cians. If the same technician drops the tube so that the liquid is spilled on the floor,
but he/she then slurps the liquid back up and transfers it to the tube with a pipet
(a procedure known as linoleum blot), a random error has been created. If all the
tubes are always cast on the floor, that would systematically bias the results.
Number of subjects (N) or sample size both describe the same thing: How many
subjects there are included in the study. Often the sample size is fixed before the
experiment is conducted. For example, if 1 000 genes are spotted on the DNA
microarray, then the sample size would be 1 000 genes.
6.3.4 Median
The median is the middle of a distribution: half the scores are above the median and
half are below. The median is less sensitive to extreme scores than the mean and
this makes it a better measure of central tendency for highly skewed distributions.
44 DNA microarray data analysis
6.3.5 Percentile
The percentile is a certain cut-off, under which the specified percentage of the ob-
servations lie. The median is 50th the percentile of the distribution. Similarly, the
80th percentile is a point of the distribution under which 80% of the observations
lie.
6.3.6 Range
The range is the difference between the largest and the smallest value of the dis-
tribution. It is the simplest measure of spread, but it is very sensitive to extreme
scores, because it is based only on two values.
The correlation of two variables represents the degree to which the variables are
related. When two variables are perfectly linearly related, the points in the scatter
plot fall on a straight line. Correlation measures only linear relationship. Two sum-
mary measures or correlation coefficients, Pearson’s correlation and Spearmans’s
rho, are most commonly used. Both of these measure range from perfectly positive
linear relationship to perfectly negative linear relationship, or from -1 to 1.
It is not wrong to calculate the correlation between variables, which are not
linearly related, but it does not make much sense. If the variables are not linearly
related, the correlation does not describe their relationships effectively, and no con-
clusions can be based on the correlation coefficient only.
Correlation and scatter plot are a good example of how numerical and graphi-
cal tools effectively complement each other.
correlation = 0.962
30000
Red channel intensity
20000
10000
0
Figure 6.1: An example of a scatter plot. The red and green channel intensities from
a two-color DNA microarray experiment have been depicted as a scatter plot. Variables
seem to have a linear relationship, because a straight line can be drawn through the data
points. In addition, the linear correlation between these variables is 0.962.
46 DNA microarray data analysis
Y = a X + b,
where b indicates where the regression line cuts the vertical axis. If only two vari-
ables are used in the linear regression model, a is equal to the pearson correlation,
and indicates the slope of the regression line. It also indicates by which number
change in X (independent variable) must be multiplied to give the corresponding
change in Y (dependent variable).
correlation = 0.962
30000
Red channel intensity
20000
10000
0
Figure 6.2: The two linear regression lines fitted into two-color DNA microarray results.
The results are different if green channel intensity (green line) is used as a dependent
variable instead of red channel intensity (red line).
6 Basic statistics 47
density
−4 −2 0 2 4 0 2 4 6 8 10
x x
mean=0, sd=1 mean=5, sd=1
density
−4 −2 0 2 4 −4 −2 0 2 4
x x
mean=0, sd=2.5 mean=0, sd=0.5
Figure 6.3: Normal distributions of different shapes. The mean defines the place of
the distribution along the horizontal axis, and standard deviation (sd) defines the shape
of the curve. Small standard deviation means tight, and large standard deviation a flat
distribution.
in the tails. Normal distributions are sometimes described as bell shaped. The
shape of the normal distribution can be specified mathematically in terms of the
mean and standard deviation (Figure 6.3).
A standard normal distribution is a normal distribution with a mean of 0 and
a standard deviation of 1. Any normal distribution can be transformed to standard
normal distribution by the formula
(X − m)
Z= ,
s
where X is a score from the original normal distribution, m is the mean of the
original normal distribution, and s is the standard deviation of the original normal
distribution. This mathematical procedure is also called standardization.
There are biological and historical reasons for the widespread usage of normal
distribution: Many biologically relevant variables are distributed approximately
normally. Mathematical statisticians also work comfortably with normal distri-
butions, and many kinds of statistical tests can be and are derived from normal
distributions. Some of these tests will be described in the next chapters. We have
already had a peek at one application of normal distribution, linear regression.
density
−4 −2 0 2 4 −4 −2 0 2 4
x x
mean=0, sd=1 mean=0, sd=1, df=1
t density t density
0.0 0.1 0.2 0.3 0.4
density
−4 −2 0 2 4 −4 −2 0 2 4
x x
mean=0, sd=1, df=10 mean=0, sd=1, df=100
Figure 6.4: The mean of the t-distribution defines its place on x-axis. The shape of
the distribution is defined by standard deviation (s) and degrees of freedom (df). Stan-
dard deviation defines the tightness of the distribution, whereas df defines the amount of
resemblanse to normal distribution.
6 Basic statistics 49
6.5.2 t-distribution
Histogram of gmean
1200
1000
800
Frequency
600
400
200
0
gmean
Figure 6.5: A typical histogram of the DNA microarray intensity values of one channel
(gmean). The distribution is highly skewed to the right. Median, mean and 68% standard
deviation are depicted as red, green, and blue vertical lines, respectively.
A distribution is skewed if one of its tails is longer than the other. Distributions
having a longer tail on the right are positively skewed, and distributions having a
longer tail on the left are negatively skewed. The best way to identify a skewed
distribution is to draw a histogram of the distribution (Figure 6.5). Numerically,
50 DNA microarray data analysis
a skewed distribution can be identified by comparing the mean and median of the
distribution. If the mean is larger than median, the distribution is skewed to the
right. For left skewed distributions the mean is lower than median.
Many statistical tests assume that the data is normally distributed, which means
that the histogram drawn from the values of the variable resembles the normal
distribution. Even the most basic descriptive statistics can be misleading if the
distribution is highly skewed. For example, standard deviation does nor bear a
meaningful interpretation if the distribution significantly deviates from normality
(Figure 6.5).
Normality of the data is most easily checked from an appropriate histogram . If
the distribution is, as judged by eye, approximately symmetric and does not contain
more than one peak, it can be assumed to be normally distributed (Figure 6.5 and
Figure 7.6). Other graphical possibilities include a density plot, which are basically
just a smoothed histogram (Figure 6.6).
Normal density
2.0
1.5
density
1.0
0.5
0.0
−4 −2 0 2 4
intensity
mean=0, sd=1
Figure 6.6: Some real data compared to a normal distribution. Black, normal distribu-
tion; red, non-transformed data; green, log-transformed data. The log-transformation was
performed, because it often makes highly skewed distribution more normally distributed.
The values have been standardized before plotting.
6 Basic statistics 51
A more formal way to test for the normality of the data is the normal prob-
ability plot. The sample values and the theoretical values assuming normality are
plotted against each other in the normal probability plot. If the points fall on the
line, the distribution is normal (Figure 6.7).
Both the histogram and normal probability plot are valid methods for checking
the normality of the data, but the plot can detect much smaller deviations from
normality than histogram.
15
Sample Quantiles
Sample Quantiles
13
20000
11
9
0
−4 −2 0 2 4 −4 −2 0 2 4
Sample Quantiles
2
4
1
2
0
−1
0
−4 −2 0 2 4 −4 −2 0 2 4
Figure 6.7: The normal probability plots. The raw data has been pushed through different
kinds of transformations, and the results are checked for normality. The standardized log-
transformed data was drawn with green in the Figure 6.6. According to the plots, none of
the data sets is normally distributed, because the datapoints do not fall on a straight line.
6.6 Transformation
correlation=0.962
30000
Red channel intensity
20000
10000
0
Figure 6.8: A scatter plot where one probable outlier has been marked with a red dot.
6.7 Outliers
Outliers are by definition atypical or infrequent observations, data points which do
not appear to follow the characteristic distribution of the rest of the data. These
may reflect the properties of the variable, or be due to measurement errors or other
anomalies which should not be included in the analyses.
Typically, we believe that outliers represent random errors, which we would
like to control or get rid of. Outliers can have many undesirable properties: They
often attract the linear regression line, which might lead to wrong conclusions.
They can also artificially increase the correlation coefficient or decrease the value
of the legitimate correlation. The measure of spread, range, is unreliable if the data
includes outliers.
Outliers are often excluded from the data after the superficial analysis with
quantitative methods. For example, observations that are outside the range of 2
6 Basic statistics 53
Maximum value
40000
30000
Green channel intensity
20000
10000
Figure 6.9: An example of a box plot. Outliers are often indicated with individual marks
on the top of the boxplot.
No. of obs. 20 3 4 2 1
|------|-------------------------------|
Expression 0.0 mean 5.0 10.0
54 DNA microarray data analysis
Non-responsive group
No. of obs. 0 3 4 2 1
|----------------------|---------------|
Expression 0.0 5.0 mean 10.0
In the example, the missing values influence the mean very much. In a sense,
missing values draw the mean towards them. Even the median would not have
helped us here, because it would have scored 0 for the responsive group!
Missing values are usually either removed from the analysis or estimated using
the rest of the data in a process called imputation. These methods are covered in
more detail in the next chapter.
6. Compare the test statistic to the critical values table of the test statistic distri-
bution.
8. Draw conclusions.
There are at least a couple of hundred of different statistical tests available, but only
a few of those are covered here in more detail. Usually the means of certain groups
are compared in the microarray experiments. For example, we can test whether the
expression of a certain gene is higher in the cancer patients than in their healthy
controls. The tests introduced here compare the means of two or several groups
with each other.
When choosing a test, there are two essential questions, which need to be
answered: Is there more than two groups to compare, and should we assume that
the data is normally distributed?
If two groups are compared, there are two applicable tests, the t-test and Mann-
Whitney U test. If more than two groups are compared, an analysis of variance
(ANOVA) or a Kruskal-Wallis test is used. The ANOVA procedure is covered in
more detail in section 6.10 of this book. Note, that if the ANOVA procedure is
applied to two group means only, it will produce the same results as the t-tests
introduced in the next sections.
If the data is normally distributed (see the tests for this in section 6.5.4), the
t-test for two groups, or the ANOVA for multiple groups can be used for compar-
isons. If the data is not normally distributed, and each group has at least five obser-
vations, the Mann-Whitney U test or Kruskal-Wallis test can be applied. However,
if less than five observations are available per group, it is better to use the t-test or
ANOVA.
Tests, which assume that the data is normally distributed, are called parametric
tests. Non-parametric tests do not make the normality assumption.
The p-value is usually associated with a statistical test, and it is the risk that we
reject the null hypothesis (see section 6.9.4), when it actually is true. Before test-
ing, a threshold for p-value should be decided. This is a cut-off below which the
results are statistically significant, and above which the results are not statistically
significant. Often a threshold of 0.05 is used. This means that every 20th time we
conclude by chance alone that the difference between groups is statistically signif-
icant, when it actually isn’t.
If the compared groups are large enough, even the tiniest difference can get
a significant p-value. In such cases it needs to be carefully weighted whether the
statistical significance is just that, statistical significance, or is there a real biological
phenomenon acting in the background.
Before applying the test to the data, a hypothesis pair should be formed. A hypoth-
esis pair consists of a null hypothesis (H 0) and an alternative hypothesis (H 1). For
the tests described here, the hypotheses are always formulated as follows.
H0 = There is no difference in means between compared groups
H1 = There is a difference in means between compared groups.
56 DNA microarray data analysis
M −µ
T= ,
√s
n
df = n−1
The means of two samples are compared with an independent samples t-test
. In order to make the test, we need to know whether the variances of the groups
are equal. As a rule of thumb, the variances of the groups can be assumed to be
equal, if the variance of the first group is not more than three times larger than the
variance of the second group. Assuming that the variances of the two-groups are
not equal, the formula of the test statistic (Welsh’s t-test) is:
Xi − X j
T= ,
si2 s2
ni + n jj
where X i and X j are the means of the compared groups, s i2 and s 2j are the variances
of the compared groups, and n i and n j are the numbers of observations in the
compared groups.
The degrees of freedom are calculated with a formula:
2
si2 s2
ni + n jj
df = ,
s2 s2
( ni )2 ( n j )2
i
n i −1 + n j j−1
where si2 and s 2j are the variances of the compared groups, and n i and n j are the
numbers of observations in the compared groups.
If the variances of the groups are equal, the test statistic (student’s t-test) is
calculated with the formula:
x̄ 1 − x̄ 2
T= ,
s × n11 + n12
where
6 Basic statistics 57
2
n1
i=1 (x 1i − x̄ 1 )2 + ni=1 (x 2i − x̄ 2 )2
s= .
n1 + n2 − 2
Degrees of freedom for the equal variances t-test are calculated as:
d f = ni + n j − 2
Formulas for the test statistics of non-parametric tests are not presented here,
but they can be easily found from any statistical textbook.
From the table, read the value at the intersection of the degrees of freedom and
the set p-value threshold divided by two. For example, if the we used p-value of
0.05 and df = 9, the critical value would be 2.262.
account for this problem. One commonly used correction is the Bonferroni correc-
tion, where the original p-value is divided by the number of comparisons to create
a new corrected p-value against which those comparisons should be tested.
lated using the groupwise means (Table 6.3). This reflects the various errors in the
experiment. Then the SS between groups or SStreatment (how much the individual
observations deviate from the overall mean) is calculated using the overall mean
(Table 6.4). It reflects the effect of treatment on different groups.
SStreatment has k-1 degrees of freedom, where k is the number of groups.
SSerror has N-k dfs, where N is the total number of subjects, and k is the number of
groups. Mean squares (MS) are calculated by dividing the SSs by the concommitant
dfs.
The results are summarized in an ANOVA table (Table 6.5), which reports
sum of squares, and the F-statistic with an associated p-value. The F-test statistic
is calculated as
M Streatment
F=
M Serror
The p-value reported in the table below has been read from the F-distribution
table of critical values with the appropriate dfs and F-statistic.
SS df MS F p-value
Effect 24 2 12 12 <0.01
Error 6 6 1
6.11.2 Tranformations
In GeneSpring, data transformations are linked to Experiment Interpretation. There
are three options to choose from: Non-transformed data (ratio) log 2 -transformed
data (log of ratio) and fold change . When one transformation is chosen, Gene-
Spring will automatically recalculate the data values, and use the new values for
any subsequent statistical analyses (statistical group comparison, k-means cluster-
ing, etc.).
6.11.4 Correlation
The Pearson correlation between chips is automatically calculated. The values of
correlation coefficients can be viewed through Condition Inspector. Condition In-
spector is invoked when the right-hand mouse button is clicked over one chip in the
navigator bar, and Inspect is selected. From the opening window, select the Similar
Conditions tab. The correlation coefficients between the selected chip and all the
other chips are reported in the column Correlation (Figure 6.10).
Figure 6.10: The Pearson correlation in GeneSpring is found under the Similar Condition-
tab in Condition Inspector.
Select the Lines to Graph tab, and tick Line of Best Fit box. The linear regression
line is overlaid with the scatter plot, and the regression equation of form y = a X +b
is displayed at the bottom of the scatter plot view. Recall that a in the regression
equation equals the correlation coefficient between the two variables plotted along
the axes.
Figure 6.11: In GeneSpring the p-values for the one-sample t-test are found from Gene
Inspector
Figure 6.13: The result of the statistical group comparison is stored into a genelist, which
can be viewed with a Genelist Inspector.
64 DNA microarray data analysis
2. Sokal, R. R., Rohlf, F. J. (1992) Biometry, Freeman and co, New York.
Analysis
66 DNA microarray data analysis
7 Preprocessing of data
7.1 Rationale for preprocessing
Preprocessing includes analytical or transformational procedures that need to be
applied to the data before it is suitable for a detailed analysis.
The black-box thinking, where data is fed into a program, and the result pops
out, is gaining ground rapidly. This kind of approach for statistical analysis in sim-
ply erraneous, because the results coming out from the program can be statistically
erraneously derived. In such cases, also the biological conclusions can be wrong.
Statistical tests have often strick assumptions, which need to be fulfilled. Violation
of assumptions can lead to grossly wrong results.
We strongly recommend that the researcher, even if he/she is not himself per-
forming the data analysis, gets basic knowledge of the data, because the results
presented by the bioinformatician or statistician are more easily interpretable, if
one is at least somewhat familiar with the data.
Here we will introduce some methods for checking the data for violation of
statistical test assumptions. We also present some things to consider before and
during the data analysis. The methods are introduced in the order of applicability,
although some methods are needed in several steps.
Missing values can lead to problems in the data analysis, because they easily
interfere with computation of statistical tests and clustering. Therefore, it is worth
giving a thought to the treatment of missing values. There are a couple of options
for the treatment of missing values: They may be replaced with estimated values in
a process called imputation, or they can be deleted from the further analyses.
The default way of deleting missing data (in most of the software packages),
for example while calculating a correlation matrix, is to exclude all cases that have
missing data in at least one of the selected variables; that is, by casewise deletion of
7 Preprocessing of data 67
missing data. However, if missing data are randomly distributed across cases, you
could easily end up with no "valid" cases in the data set, because each of the genes
will highly likely have at least one missing observation on some chip. The most
common solution used in such instances is to use the so-called pairwise deletion,
where a statistic between each pair of variables is calculated from all cases that
have valid data on those two variables.
Another common method is the so-called mean substitution of missing data
(mean imputation, replacing all missing data in a variable by the mean of that vari-
able). Its main advantage is that it produces internally consistent sets of results.
Mean substitution artificially decreases the variation of scores, and this decrease
in individual variables is proportional to the number of missing data. Because it
substitutes missing data with artificially created average data points, mean substi-
tution may considerably change the values of correlations. Imputation is commonly
carried out for intensity ratios, but can also be done for raw data.
Different computer programs manipulate missing values very differently, and
drawing any consensus would be futile. At least statistical programs often offer
a possibility to define whether to use imputation, pairwise deletion or casewise
deletion.
12
10
0
mouse1$morphR log2(mouse1$morphR)
rmean
12.5
5000
11.0
r.ba r.ba
Figure 7.1: Raw and log-transformed intensity values of the red channel plotted against
its background. Upper row contains uncorrected scatter plots, lower row background cor-
rected scatter plots.
68 DNA microarray data analysis
2500
red.background
500 1000
1000
0
0
0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0
ratio ratio
10.5
1500
0 500
9.5
ratio log2(ratio)
Figure 7.2: Intensity ratios (expression) plotted against the background of the channels.
Figure 7.3: An A versus M plot, where the pheasant tail is visible in the lower intensity
values.
C y3
Intensity ratio =
C y5
For Affymetrix data, substitute Cy3’ and Cy5’ with the appropriate intensities
from the sample and control chips.
This intensity ratio is one for an unchanged expression, less than one for down-
regulated genes and larger than one for up-regulated genes. The problem with
intensity ratio is that its distribution is highly asymmetric or skewed. Up-regulated
genes can take any values from one to infinity whereas all downregulated genes are
squeezed between zero and one. Such distributions are not very useful for statistical
testing.
C y3
For values >1, fold change = C y5
1
For values <1, fold change = (C y3 /C y5 )
The fold change makes the distribution of the expression values more sym-
metric, and both under and over-expressed genes can take values between zero and
infinity. Note, that the fold change makes the expression values additive in a similar
fashion as the log-transformation.
Replicates are a very powerful way to reduce noisiness of the data. Noisiness (ran-
dom errors) is an undesirable feature of the data, because it potentially abolishes
interesting information. It can results from many sources, some of which are hard
to deal with.
There are at least three different kinds of replicates that are potentially useful in the
context of DNA microarray analysis. For example, if we treat a certain cell line
with a cancer drug, we can set up several culture flasks, and then harvest them as
biological replicates. For every culture flask, the isolated and labeled mRNA popu-
lation can be hybridized to several chips making a number of technical replicates .
In addition, every hybridized chip can be quantified several times or using different
image analysis software (software replicates) .
These are all valuable sources of information about the variability in the data.
In practise, it might be a good idea to make some biological replicates and some
technical replicates. Then the biological and technical variation can be taken into
account in the same experiment.
In a time series experiment expression changes are monitored with samples taken
between certain time intervals. Although several replicates can be made per every
time point, it should be considered that these replicate chips can possible be made
a better use of, if they are added to the time series as sampling points. That is, it
should be weighted whether a high precision in every time point is more valuable
than the additional information of expression changes new sampling points (time
points) produce.
72 DNA microarray data analysis
For case-control studies, where for instance cancer patients are compared with their
non-diseased controls, the individuals in both groups can be considered replicates
of that disease state. This is assuming that we are interested in the differencies of
those two disease states instead of inter-individual variation.
Using power analysis we can easily compute the number of replicates that are
needed to reliably discover the genes that are expressed. For example, we would
need 11 replicates to reliably pick up the genes, which are at least 1.41-fold over-
expressed at a p-value of 0.05. Similarly, 6 replicates are needed if genes with
2-fold over-expression need to be quantified using a p-value cutoff of 0.01.
If the experiment includes replicates, their quality can be checked with simple
methods, using scatter plots and pairwise correlations or hierarchical clustering
techniques, which are explained in the cluster analysis chapter in more detail. Of-
ten the quality of replicates is checked before and after the normalization. If the
correlation between replicates drops dramatically after certain normalization, the
applicability of the normalization method should be reconsidered.
7 Preprocessing of data 73
The first task is often to find out, how well two replicate chips resemble each other.
In such cases the intensity or log ratios can be plotted along the axes of a scatter
plot. Three replicates can be plotted using a three dimentional scatter plot. If
the replicates are completely similar, the data points in the scatter plot fall onto a
perfectly straight line running through the origin of the plot. If the replicates are
not exactly similar, the observations form a data cloud rather than a straight line.
The absolute deviation from a line can be better visualized, if a linear regression
line is fitted to the data.
Correlation measures the linear similarity quite well. It might give misleading
information, if the data cloud is not linear. Correlation coupled with a scatter plot
gives much more information. In practise, the Pearson’s correlation coefficients
between two replicate cDNA chips produced from the same mRNA pool (technical
replicates) are often in the range of 0.6–0.9, and the correlation between similarly
hybridized Affymetrix chips is typically over 0.95. When biological replicates are
performed, the correlation between replicate chips usually drops from these values.
For cDNA chips, a correlation of 0.7 between replicates may be considered good,
whereas for Affymetrix chips a correlation over 0.9 may be considered a good
result.
In addition, the scaling factor should be checked for the Affymetrix chips. If
the factor is larger than 50, the hybridization is probably bad.
Quality of several replicate chips is most easily checked by hierarchical clus-
tering. For example, if the two replicate chips are placed closest to each other
in the dendrogram, they can be expected to be good replicates. Depending on the
similarity measure, the hierarchical clustering can be applied either after (Euclidian
distance) or before the normalization (Pearson’s correlation).
The quality of replicate measurements of one gene (one spot) can be assessed in
a similar fashion to replicate chips, using correlation measures. However, this is
more laborous than with replicate chips, but it can sometimes be essential that bad
replicate spots are found and removed from the data.
It is quite common to exclude bad replicates from further analyses. For example,
if there are four replicate chips available for the same treatment, and one seems to
deviate from every other replicate, then this most deviating one can be removed
from any further analyses. This often results in a massive loss of data, because
whole chips are removed from the analysis. Thus, it would be better to study the
quality of replicates on a gene-wise basis, after initially studying the quality of
whole chips using the tools described above.
In practise, there often are replicates which deviate from the others. These can
be results of modified experimental conditions (different mRNA batch, etc.) or just
important biological variation, if different animals of human subjects are studied.
74 DNA microarray data analysis
Because of this, some caution should be exercised when removing bad replicates.
7.7 Outliers
Outliers in chip experiments can occur at several levels. There can be entire chips,
which deviate from all the other replicates. Or there can be an individual gene,
which deviates from the other replicates of the same gene. These deviations can
be caused by chip artifacts like hairs, scratches, and precipitation. Precipitation,
among other causes, can also perturb one single spot or probe.
Outliers should consist mainly of quantification errors. In practise, it is often
not very easy to distinguish quantification errors from true data, especially if there
are no replicate measurements. If the expression ratio is very low (quantification
errors) or very high (spot intensity saturation), the result can be assumed to be an ar-
tifact, and should be removed. Most of the actual outliers should be removed at the
filtering step (those that have too low intensity values), and some ways to identify
deviating observations were presented in the section on checking replicates.
In the absence of replicate, the highest and lowest 0.3% of the data (gene ex-
pression values on one chip) is often removed, because assuming normality, such
data resemble outliers. These values are outside the range of three standard de-
viations from the distribution mean. This is often equivalent to a filtering, where
observations with too low or high intensity values are excluded from further analy-
ses.
Formally, a statistical model of the data is needed for the reliable removal
of outliers. The simplest model is equality between replicates. If one replicate
deviates several standard deviations from the mean of the other replicates, it can
be considered an outlier and removed. The t-test measures standard deviation and
gives genes, where outliers are present among replicates, a low significance.
In practise, it is easiest and best to couple the removal of outliers with filtering ,
for example, using the following procedure: First, genes outside the range of three
standard deviations from the chip mean are removed. In the succeeding filtering
steps, the genes whose intensity is too low are removed, if needed. Next, genes
which do not show any expression changes during the experiment are eliminated
(this can be based on either log ratio or standard deviation of log ratios). What is
left in the end, is the good quality data. There are also some advanced statistical
methods developed that also allow for outlier detection and removal, but they are
outside the scope of this book.
The idea of filtering is simple. We want to remove all the data we do not
have confidence in, and proceed with the rest of the data. The results based on the
trustworthy data are often biologically more meaningful than the results based on
very confusing or noisy data.
The first step of filtering is flagging. Flagging is performed at the image anal-
ysis step. It’s idea is to mark the spots which are, by eye, judged to be “bad”.
Then, in the data analysis phase, the spots, which were flagged as “bad” are re-
moved from the data. For example, the spots overlapped with a sizable dust particle
should be flagged as bad and removed at the data filtering step.
There are a couple of statistical measures that can be used for filtering. Some
image analysis programs give signal-to-noise or signal-to-background measure-
ments for every spot on the array. These quality measurements can be used for
filtering out bad data. Often a cut-off point of 90–95% for signal-to-noise ratio is
used, at least on control channel.
Signal-to-background can be calculated after the image analysis, from the in-
tensity values:
spot signal on green channel
background signal on green channel
The spots can then be filtered on the basis of this quality statistic. Signal to
background plots of both channels appear often very much the same as the original
data (Figure 7.4).
30
100
g.stb
20
r.stb
50
10
0
gmean rmean
20000
rmean
r.qc
0 10000
g.qc gmean
Figure 7.4: Raw intensities and log-transformed intensities of different channels plotted
against their signal-to-background ratios (stb).
76 DNA microarray data analysis
It would also be expected that the signal to background ratio would increase as
the intensity of the channel increases, if the background is approximately the same
in all areas of the chip (Figure 7.5). This is one hallmark of nice-quality data.
150
100
g.stb
50
0
9 10 11 12 13 14 15
log2(gmean)
Figure 7.5: The signal-to-background ratios (stb) increase as the intensity of the spot
increases.
What is also often used, is the filtering by the standard deviation of background
intensities. The observations, which have an intensity value lower than background
+ 2–6 times standard deviation of the background intensities, are often removed
from the data.
7.10.3 Variance
Sometimes you need to make a decision of the appropriate statistical test. For
example, there are different kinds of t-tests for variables with equal or unequal
variances. As a rule of thumb, the variances can be assumed to be equal if the
variance of the first variable if less than three times the variance of the second
variable.
1500
4000
Frequency
Frequency
2000
500
0
0
0 20000 40000 0 10000 30000
gmean gmean
Not normally distributed Not normally distributed
Frequency
200
1000
100
0
3.0 3.2 3.4 3.6 3.8 4.0 3.2 3.4 3.6 3.8
log2(na.gmean) log2(na.gmean)
Approximately normally distributed Approximately normally distributed
Figure 7.6: Four histograms of the same data. The two upper histograms contain the raw
data, and the two lower histograms the log-tranformed data. The log-transformed data is
clearly more normal-like than the non-transformed data.
Normality means how well your data fits to the normal distribution. This
should be checked, because most of the statistical procedures assume that the data
is normally distributed. Even the most basic descriptive statistics can be misleading
if the distribution is highly skewed; standard deviation does not bear a meaningful
interpretation if the distribution significantly deviates from normality.
The easiest method for checking normality and skewness of the distribution
is to draw a histogram of the intensities (Figure 7.6). For checking the skewness,
comparison of the mean and median is complementary to this graphical method.
Note that there should be enough columns in the histogram in order to make the
results reliable.
There is also a more informative way to check for the normality, quantile-
quantile plot. This should be used with the histograms as a more diagnostic method
for observing the deviations from normality.
7.11.1 Linearity
It is also important to check the linearity of the data (log ratio). Linearity means
that in the scatter plot of channel 1 (red colour) versus channel 2 (green colour), the
relationship between the channels is linear. It is often more informative to produce
7 Preprocessing of data 79
a scatter plot of the log-transformed intensities, because then the lowest intensities
are better represented in the plot (Figure 7.7). In this kind of a plot, the data points
fit a straight line, if the data is linear.
14
log2(rmean)
20000
rmean
12
10
0
0 10000 30000 9 10 12 14
gmean log2(gmean)
0.5
0.5
logratio
logratio
−0.5
−0.5
−1.5
−1.5
rmean gmean
Figure 7.7: Checking the linearity of the data. Top row, red channel intensities versus
green channel intensities; bottom row, log ratio versus different channel intensities.
Another way to test linearity is to plot the log ratio versus the intensity of
one channel (Figure 7.7). This should be done for both channels independently.
In this plot the data cloud has been tilted 45% to right, and it should be easier to
identify non-linearity. Again, the data points should fit an approximately straight
line, which in this case is horizontal.
Checking the linearity of the data helps to pick the right normalization method.
It also provides information about the reliability of the data, especially in the lower
intensity range.
Often the intensity values (especially background) vary a lot on the same chip de-
pending on the location of the spot on the chip. This effect is known as spatial
effect (Figure 7.8).
80 DNA microarray data analysis
Figure 7.8: Intensities of spots and their backgrounds spotted in the form of the original
chip in order to check for spatial biases. Background intensities seem to be highly variable
in different areas of the chip. Especially the corners have an abnormally high background.
These areas might be removed from further analyses.
Spatial effects are often most pronounced in the corners of the chip, where the
hybridization solution has possibly been squeezed thinner than in the other areas of
the chip. Furthermore, corners of the chip can also dry up more easily than other
areas of the chip.
Another source of spatial effects is the regional clustering of genes with a
similar expression profile. This is often caused by bad chip design, where the
probes have not been randomized onto the chip. For example, if genes acting in a
cell cycle regulation are spotted into one corner of the chip, the expression values in
the corner are probably dependent on the cell cycle stage of the sample material. In
such cases, the spatial effect should not be removed by normalization procedures,
because it contains real biological information instead of random noise.
The similar kind of spatial effect can also be checked for from the calculated
expression values (Figure 7.9). There should not be any suspicious areas on the
subarrays (assuming good chip design), where the intensity values are much higher
than everywhere else on the same chip.
7 Preprocessing of data 81
Figure 7.9: The measured gene expression spotted in the form of the original chip in order
to check for spatial biases.
7.13 Normalization
Many statistical procedures assume that across the experiment, the dynamic range
(minimum and maximum), mean and variance of the chips are equal. These should
be checked after normalization. If they are not, the results of the statistical tests
should be interpreted with caucion.
82 DNA microarray data analysis
7.15.4 Replicates
When importing the data from a chip where the same spot is present in multiple
copies, GeneSpring automatically calculates an average of those replicates. This
is assuming that the replicate spots have exactly the same Gene identifier in the
data file. Replicate chips are also averaged, after defining the replicates in the
Experiment -> Experiment parameters window and setting up the parameters in
Experiment -> Experiment interpretation.
For example, if we have a time series experiment (Table 7.1) with three time
points and two replicates per every time point, the parameters should be set up
7 Preprocessing of data 83
as follows. Two parameters, a time point and a replicate are created. In the time
point, mark the time points suitably. The replicates are then set up using the other
parameter. Inside one time point, the replicates are marked with a running number.
Last, set up the value order for the time points. This tells GeneSpring, in which
order the time points should be displayed on the screen.
If there is missing data for some genes either when importing the data or when
setting up the replicate chips, GeneSpring only uses the existing data for the calcu-
lation of means.
7.15.6 Normality
Normality can be checked using the histogram. A histogram can be investigated
in (View -> Graph) window by checking All Samples - interpretation mode, that is
automatically produced for each experiment.
7.15.7 Filtering
A filtering tool can be invoked from the menu Tools -> Filtering and statistical
analysis. A new window opens, where one genelist and one experiment should
be selected. The actual filtering tool is accessed by right-clicking on the selected
experiment and selecting Add expression percentage restriction from the opening
list. Often the bad quality data is first filtered out. After that, the not-changing
genes are removed from the dataset.
Using the scatter plot tool , define the intensity value, under which you can’t
trust your data anymore on either raw or signal channel. For cDNA chips, this is
often around 200-1000 and for Affymetrix chips around 200. In the expression
percentage filtering, use this signal value as a minimum cut-off. Also select that it
84 DNA microarray data analysis
applies to all the conditions. Create one such filter for both channels. Create a new
genelist of the results (Make list button).
In the next filtering phase we will try to find the genes that are changing and we
can trust in. Using the genelist created above, set up a new filter, where you select
genes, which have expression values between 0.5 (minimum) and 2 (maximum).
These genes are not showing any expression changes and are uninteresting to us.
This filtering should also apply to the whole dataset (all conditions). After saving
the new genelist, the filtering tool can be closed.
Go to the navigator bar and right-click on the good quality gene list. From the
list, pick Venn diagram -> Left (red). Similarly, add the not-changing and all genes
genelist to the Venn diagram. Then select the All genes genelist from the navigator.
From the Venn diagram identify the region that contains the genes included in the
reliable genelist, but not in the not-changing genelist. Right-click on that area of
the diagram, and make a list of these genes.
Now you have a list of genes, which are reliable and also changing. You’re
ready to proceed with the analysis, for example to clustering or classification anal-
yses.
8 Normalization
8.1 What is normalization?
There are many sources of systematic variation in microarray experiments that af-
fect the measured gene expression levels. Normalization is the term used to de-
scribe the process of removing such variation [12]. Normalization can also be
thought of as an attempt to remove the non-biological influences on biological data.
Sources of systematic variation will affect different microarray experiments
to different extents. Thus, to compare microarrays from one array to another, we
need to try to remove the systematic variation, to bring the data from the different
experiments onto a level playing field, so to speak. One goal of normalization is
also to make possible the comparison from one microarray platform to another, but
this is clearly a much more complicated problem and is not covered in this section.
The biggest problem with the normalization process is the recognition of the
source of systematic bias. There is a strong possibility that some or even most of
the biological information will also be removed when normalizing the data. Thus
it is good to keep in mind that the amount of normalization should be minimized to
avoid losing the real biological information. In the next chapter some of the most
common sources of systematic bias will be introduced and also some methods for
recognizing the sources and dealing with the bias will be discussed.
Differences in dye (labeling) efficiencies are the most common source of bias and
also easily identifiable. This can be seen when the intensity of one channel on the
array is much higher than the other (see Figure 8.1, B and C). Dye effect can be
corrected for by balancing the dyes using the assumption that both channels should
be equally “bright”. Extra information for dye balancing can be received from dye-
swap experiments. Problems will occur when dye also has interaction effects; i.e.,
the labeling efficiency may depend on the gene sequence.
Scanners might show many different failures. When laser or PMT intensity values
are wrongly adjusted, the scanner can cause the dye effects to show up. But most
of the scanner malfunctions are hard to deal with in silico and the best solution is to
86 DNA microarray data analysis
fix the scanner and repeat the scanning. For example, when lasers are misaligned
the two channels are slightly out of register. This can cause big problems if image
analysis software does not allow the user to align images manually.
Sometimes patterns are seen on the slide that can be caused by many different
reasons. The most common ones of these are described in the next three sections.
When spatial bias is seen on the slide, the first impression is usually that it is
caused by uneven hybridization. In most cases this is also true. Uneven hybridiza-
tion can be recognized, for example, as lighter areas in the middle of the slide or on
the edges of the slide. This is usually very hard to fix numerically and it is recom-
mended to aim for developing more consistent techniques for hybridization. For
single cases, spots that are not hybridized properly can be excluded from further
analyses. Background difference can also cause bias (see Figure 8.1, E).
Slides are usually printed using more than one pen (2, 4, 8, 16, ..). If any of these
pens work differently from others, for example a pen gets infected by hair or is
defective in some other way, the corresponding subarray could differ from other
subarrays. Quite commonly printing pens also wear out at a different rate. One
way to see if one pen performs differently is to visualize the data by subarrays
using colors or regression lines so that the faulty subarray stands out. In some
cases, printing tip error can be corrected for by applying different normalization
parameters to the subarrays.
Printing can also cause problems that are not categorized as spatial bias. Some-
times some or all spots are irregular in shape or they might express as rings instead
of circles. It is likely that the detection of these spots will be inaccurate. This
should be taken into account during image analysis if possible.
These often ignored but very common effects are not artificial but are caused by
biology itself. Sometimes concentration of reporters (probes) on the slide might
differ and this can cause patterns on a slide. Usually this is easy to notice if the
position of the plates (assuming that concentration is constant on a plate) is known.
Other quality control methods should also be used. Plate effects can be corrected
for by using methods as in the case of different printing tips.
More difficult to recognize and especially to distinguish from other sources of
bias is the bias caused by the biological role of reporters. If reporters are arranged
according to their biological function (as is done in many cases) this can also be
seen as patterns on a slide. It is very important to note that this effect should not
be normalized! The reporter effect needs to be taken into account when slides are
being designed. Related reporters should never be grouped together.
8 Normalization 87
When large amounts of slides have been studied, it has been noted that slides from
the same print run (or batch) often cluster together and also slides from an identical
print design but different print run. Some of these effects might be due to mistakes
in printing. This kind of bias is very hard to notice and the only way to prevent it
is to keep track of printing (LIMS) and use of good pre- and post-printing quality
control methods.
Equal to batch effect is the systematic bias caused by the experimenter. Experi-
ments done by the same experimenter often cluster together more tightly than war-
ranted by biology. A survey made at Stanford University showed that the experi-
menter was one of the largest sources of systematic bias. The best but not a very
practical solution for the experimenter issue would be to let the same experimenter
do everything. Because this naturally isn’t possible, consistent hybridization tech-
niques are needed as well as methods to recognize bias caused by the experimenter.
If you have the possibility to hybridize one or several arrays where the same sam-
ple has been used both as target and control i.e., the same sample labeled with both
dyes on the same array), or to study two replicate arrays as sample and reference,
the results of both channels should be equal, a scatter plot should show a straight
line, and optimally, your data should follow natural distribution. Studying self-self
hybridized arrays shows you clearly the sources of error, such as uneven incorpo-
ration of the dyes, that are not dependent on your sample. It can also help you in
deciding which normalization steps would minimize these errors.
Scatter plot (MA plot, see 8.5.5.) may be used to illustrate the different types
of effects due to intensity-dependent variation as shown in Figure 8.1.
There are three broad groups of normalization techniques, each of which can con-
tain multiple methods. Classically, normalization means to make the data more
normally distributed. In microarray analyses, this is still the goal, but the method-
ology has concealed the original idea.
88 DNA microarray data analysis
Figure 8.1: Truncation at the high intensity end (A) is caused by scanner saturation.
High variation at the low intensity end (B) is from larger channel- specific additive er-
ror. High variation at the high intensity end (C) is from larger channel- specific multi-
plicative error. The curvature in (D) is from channel mean background difference. The
curvature in (E) is from slope difference. The split RI plots in (F) come from hetero-
geneity. Figure adopted and modified from a paper entitled "Data transformation for
cDNA Microarray Data" by Cui, X., Kerr, M. K., and Churchill, G. A. (Submitted to
Statistical Applications in Genetics and Molecular Biology. The manuscript web site is at
http://www.jax.org/staff/churchill/labsite/pubs/index.html)
iment. Ideally, you would have several types of controls in different concentrations
scattered throughout the printed array. If you use Affymetrix products, chips con-
tain controls for the experiment and for the hybridization of the correct target (per-
fect match and mismatch, Chapter 3). More about controls can be read in Chapters
2 (Experimental design) and 7 (Preprocessing of data).
There are several mathematical procedures and different algorithms (median cen-
tering, standardization, lowess smoothing) for normalization from which to choose
from. Before deciding on the method, the linearity of data should be checked (see
Chapter 7). If the data is linear, such procedures as median centering can be ap-
plied. And the other way around, median centering can only be applied to linear
data. If the data is non-linear, lowess smoothing or an other local method should
rather be applied. Moreover, if the data is normally distributed, it can be stan-
dardized. Also for certain purposes, for example clustering, the spread of the data
should be standardized.
Sometimes basic normalization schemes don’t perform well enough. If most of the
genes on a chip are likely to change, or there are spatial biases on the chip, more
sophisticated methods should be used. Again, global methods should not be used,
if the data is nonlinear, if there are spatial bias on the chip, or if the number of
expressed genes varies a lot between individual chips. In such cases, local methods
92 DNA microarray data analysis
There are several ways to calculate the normalized intensity ratios. Here, we present
a few of the most commonly used ones. If you want to play around with these,
please check the logarithmic rules of calculation before proceeding. For example,
the mean centering (subtract the mean or median to get a mean or median of zero)
with log-transformed data is equivalent to mean scaling (divide with a certain num-
ber to get a mean or median of one) with untransformed data. The next formulas
are presented for the log ratios. Usually the normalization is calculated using only
a control channel or a reference chip.
Calculate the mean of the log ratios for one microarray. Produce the centered data
by subtracting this mean from the log ratio of every gene.
Calculate the median of the log ratios for one microarray. Produce the centered
data by subtracting this median from the log ratio of every gene.
Remove the most deviant observations (5%) from the data for one microarray. Cal-
culate the median of the log ratios of the remaining genes. For centered data, sub-
tract this median from the log ratios of every gene.
8.5.4 Standardization
1
2
G = 22A−M
Lowess-normalization works as follows. A lowess-function smoothing is ap-
plied to the data and a curve is estimated by the sliding window method. The nor-
malized log-transformed intensity ratio is calculated by subtracting the estimated
curve from the original values. (Note: If a linear function is used for the local re-
gression, it is called lowess with a “w”. If a quadratic function is used for the local
regression, it is called loess without the “w”.)
Figure 8.2 gives an example of how the lowess-regression treats real life data.
Figure 8.2: Centralization with a lowess-regression. On the left, the log2 -normalized data
(centered around zero) in an A versus M plot. On the right, lowess-normalized data.
logarithmic transformation is more appropriate for high intensity spots (Cui et al.,
submitted). The linlog transformation can precede lowess smoothing and in that
way stabilize the variation due to the additive error dominant at low intensity, and
the multiplicative error dominant at high intensity in microarray data.
red intensity
green intensity =
spiked control intensity ratio
where R and G are the original and R’ and G’ are the dye-swap experiment intensi-
ties of the red and green channels, respectively. The normalization constant can be
estimated as
1 R R
c= log2 + log2
2 G G
8 Normalization 95
In other words, the average of the original and dye-swap chip is calculated
for every gene. Normalized values are calculated by subtracting the average of the
chips from the averaged genewise intensities.
Most of the aforementioned methods are used for per-chip normalization. Per-gene
normalization is not worthwhile if you only have a few chips, because then you
can potentially introduce errors to your data. This is because the mean, standard
deviation, and regression curves can not be effectively estimated from a very small
number of observations. As a guideline, the experiment should consist of at least
five chips, if you want to perform the per-gene normalization using mean or me-
dian centering. Even more observations (at least 10-25) are needed in order to use
standardization, regression or other advanced normalization method.
Furthermore, using very sophisticated normalization methods can lead to a
phenomenon called overfitting, in which case the model (e.g., a linear regression)
describes the variability of the data too well. This effectively removes biologically
relevant variation from the data, but it also introduces biases to the analysis. It is
common that the correlation of two chips (especially replicates) slightly decreases
after normalization. Whether this means that the biologically relevant information
has been removed, and some noise added, is currently unknown.
The mean and median centering do not usually influence the standard deviation
very much, but the regression and other more sophisticated methods do. Usually
they move towards a higher standard deviation, which might mean more noisy data.
Therefore, the normalization procedure should be as simple as possible, yet taking
the systematic errors into account.
A couple of examples how the normalization procedure affects the graphical rep-
resentation of the data are presented in Figure 8.3. These examples showes, how
the number of genes affects the results. In the image a diminishing series of house-
keeping genes is used for normalization.
Figure 8.3: The four images represent the same data processed with per-chip mean cen-
tering. The number of genes used for the mean calculation has been varied. From top left
to bottom right: 5776 genes, 57 genes, 577 genes, and 6 genes. Note the varying slope of
the linear regression line fitted to the normalized data.
Table 8.1: An example of per-chip and per-gene mean centering. Because of rounding
errors, the results presented in the table are not necessarily products of individual observa-
tions and their means.
Uncentered data
chip1 chip2 chip3 chip4 mean standard dev.
gene1 2.12 2.01 4.37 2.01 2.63 1.16
gene2 2.20 2.06 4.32 2.03 2.65 1.11
gene3 2.18 1.90 4.37 1.90 2.59 1.20
gene4 2.15 1.92 4.38 1.89 2.59 1.20
gene5 2.14 2.00 4.52 1.99 2.66 1.24
gene6 1.93 2.02 4.18 2.01 2.54 1.09
gene7 2.26 1.96 4.19 1.98 2.60 1.07
gene8 2.07 2.00 4.39 2.01 2.62 1.18
gene9 2.25 2.06 4.34 2.04 2.67 1.12
gene10 1.95 1.76 3.97 1.82 2.37 1.07
mean 2.13 1.97 4.30 1.97
standard dev. 0.11 0.09 0.15 0.07
or
2. Beißbarth, T., Fellenberg, K., Brors, B., Arribas-Prat, R., Boer, J., Hauser,
N. C., Scheideler, M., Hoheisel, J. D., Schutz, G., Poustka, A., and Vingron,
M. (2000) Processing and quality control of DNA array hybridization data.
Bioinformatics 16, 1014-22.
3. Bolstad, B. M., Irizarry R. A., Astrand, M., and Speed, T. P. (2003), A Com-
parison of Normalization Methods for High Density Oligonucleotide Array
Data Based on Bias and Variance. Bioinformatics 19, 185-193.
4. Brazma, A., and Vilo, J. (2000) Gene expression data analysis. FEBS Lett.
480, 17-24.
5. Cui, X, Kerr, M. K., and Churchill, G. A.. Data Transformations for cDNA
Microarray Data. Statistical Applications in Genetics and Molecular Biol-
ogy, submitted.
8 Normalization 99
7. Eickhoff, B., Korn, B., Schick, M., Poustka, A., and van der Bosch, J. (1999)
Normalization of array hybridization experiments in differential gene expres-
sion analysis. Nucleic Acids Res. 27, e33.
10. Schuchhardt, J., Beule, D., Malik, A., Wolski, E., Eickhoff, H., Lehrach, H.,
and Hanspeter, H. (2000) Normalization strategies for cDNA microarrays.
Nucleic Acids Research 28, e47.
12. Yang, Y. H., Dudoit, S., Luu, P. and Speed, T. P. (2001) Normalization for
cDNA Microarray Data. Proceedings of Photonics West 2001: Biomedi-
cal Optics Symposium (BiOS),The International Society of Optical Engi-
neering (SPIE), San Jose, California. http://www.stat.berkeley.edu/
users/terry/zarray/Html/normspie.html
13. Yang, Y. H., Dudoit, P., Luu, P., and Speed, T.P. (2001) Normalization for
cDNA microarray data, submitted.
This chapter was written by Jarno Tuimala, Ilana Saarikko, and M. Minna
Laine.
100 DNA microarray data analysis
9 Finding differentially
expressed genes
After removing the bad quality data we are left with reliable data. As we have
already seen, good quality data can be further filtered so that only the genes that
show some changes in the expression during the experiment are preserved in the
dataset. Often the differentially expressed or otherwise interesting genes are stored
as simple lists of gene names, i.e., genelists. Most typically genelists are created
by driving the data through a certain filter, but they can also include lists of genes,
which have been produced by clustering analyses or other more complex methods.
Sometimes it is sufficient to get information about genes that are either under- or
overexpressed during the experiment. For example, we might be interested in genes
that have an elevated expression because of a drug treatment. Such genes are most
easily found by simple filtering. Simple filtering (by absolute expression change)
can even be used for experiments, where there are no replicates.
significance values for expression changes. However, using such methods carries
the risk that the most unstable probes or mRNAs are identified as differentially
expressed.
In the next four sections we will go through some of these methods.
Noise envelope means simply a method where the standard deviation (calculated
with sliding window) is used as a cut-off for under and over-expressed genes.
The variance in the reported gene expression level is a function of the expres-
sion level so that the variance is higher in the lower expression values and lower
when the expression is high. Thus, the identification of the differentially expressed
genes calls for a statistical model of the variance or noise. The simplest model
could be constructed on the basis of the standard deviation of expression values.
Such a model can be constructed relatively easily using statistical programs.
The method relies on a segmental (sliding window) calculation of standard devia-
tion. A datapoint refers to an (x, y) paring, where x is the absolute intensity value
of a gene from the hybridization, and y is the corresponding intensity ratio value.
Using all data points in a given sliding window of expression values, the standard
deviation of the intensity ratios is calculated. The average of expression values
within the window are then paired with the average intensity ratio value within the
same window plus the number of standard deviations (usually 2 or 3) specified by
the experimenter. This new pair becomes the candidate upper noise envelope point.
Similarly a candidate lower noise envelope points are determined.
Using these noise envelopes, the insignificant expression changes are removed:
The region between the upper and lower noise envelopes contains genes with in-
significant changes.
1.5
1.0
0.5
M
0.0
−0.5
−1.0
−1.5
8 10 12 14
Figure 9.1: A scatter plot with Sapir and Churchill’s method with the cut off lines marked.
The vertical lines represent the cut offs for up- and downregulated genes. Note that the axes
of the plot are normalized log ratio (M) and average intensity (A).
The idea central to Chen’s method is that the gene expression levels are determined
by the intrinsic properties of each gene, which means that the expression levels vary
widely among genes. Therefore it is inappropriate to pool statistics on gene expres-
sion differences across the microarray. Assuming that the green and red channel
intensities are normally distributed and have similar coefficients of variation and a
similar mean, the confidence intervals for actual differential expression are calcu-
lated using the data derived from the maximum likelihood estimate of the coeffi-
cient of variation and a set of genes, which do not show any expression change on
the chip.
The method is implemeted in the R language package sma. The results can be
obtained either as a genelist or in a form of a scatter plot. The scatter plot example
is given here (Figure 9.2).
9 Finding differentially expressed genes 103
1.5
1.0
0.5
M
0.0
−0.5
−1.0
−1.5
8 10 12 14
Figure 9.2: A scatter plot with Chen’s method with the 95% (inner, light blue) and 99%
(outer, dark blue) cut off lines marked. The vertical lines represent the cut offs for up-
and downregulated genes. Note that the axes of the plot are normalized log ratio (M) and
average intensity (A).
Newton’s method models the measured expression levels using terms that account
for the sources of variation. The first obvious source is measurement error. The
second source of variation is due to the different genes spotted onto the microarray.
Newton’s method can be viewed as a hybrid of Chen’s and Sapir and Churchill’s
methods, because it models the variability on the slide very similarly to Chen’s
method, but the mathematical calculations are done with the EM-algorithm similar
to the one Sapir and Churchill used. Results are presented as log-odds ratios that
the gene is differentially expressed.
The method is implemeted in the R language package sma. The results can
be obtained either as a genelist or in the form of a scatter plot. The scatter plot
example is given here (Figure 9.3).
104 DNA microarray data analysis
1.5
1.0
0.5
M
0.0
−0.5
−1.0
−1.5
8 10 12 14
A
Figure 9.3: A scatter plot with Newton’s method with the log-odds ratios 0, 1 and 2 cut
off lines marked. The vertical lines represent the cut offs for up- and downregulated genes.
Note that the axes of the plot are normalized log ratio (M) and average intensity (A).
Say, we have found a gene that is threefold upregulated after the drug treatment.
How do we know that the result is not just an experimental error? We need to de-
termine the statistical significance of the upregulation of the gene. Significance can
only be assessed, if replicate measures of the same gene were performed during the
experiment. The chips (or expression of invidual genes on them) can be compared
by the t-test (two conditions) or ANOVA (more than two conditions).
If the gene expression has been measured for control and treatment, but only treat-
ment has been replicated, the experimental error can be crudely estimated from
the variation between the replicates. The standard deviation is determined for ev-
ery gene, and the expression change is compared with the standard deviation. The
more the change exceeds the standard deviation between replicates, the more sig-
nificant the gene is. Note that this method does not allow for the calculation of
p-values.
9 Finding differentially expressed genes 105
The statistical basis for the t-test has been covered in detail elsewhere (see 6.9), but
briefly, the two-sample t-test looks at the mean and variance of the two distributions
(say, control and treatment chip log ratios), and calculates the probability that they
were sampled from the same distribution. The t-test can be applied successfully in
situations, where both the control and treatment have been repeated.
Here (Figure 9.4), we present an example of a graphical t-test result, which
has been produced using R language supplemented with package sma.
T plots
t vs. average A t denominator vs. average A
*********** ** * * * t denominator
0
0.8
****** * *** ***** * ** *
t
* * ** *
−20
average A average A
*
0.8
*
* * * ** **
* * * *
* ************ **** *** * * **** *
*
************* * *
0.0
******
8 9 11 13 15 0.0 1.0 2.0 3.0
average A |t numerator|
Figure 9.4: Package sma automatically produces these four plots for t-test results. The
genes that have an expression pattern most significantly deviating from zero are high-
lighted with a colored star.
After calculating the t-test p-values for the replicated genes, the ones with the
lowest p-value (marked with a star in the images in Figure 9.4) can be saved into
a new genelist and used in further analyses, for example cluster analysis. These
are the genes that most significantly differ between two conditions, say control and
106 DNA microarray data analysis
treatment mice.
2
1
0
−4 −2 0 2 4
log2(ratio)
Figure 9.5: Volcano plot, where the statistical significance by one-sample t-tests
[-log10(p)] is plotted against normalized log ratio [log2(ratio)]. Vertical dashed lines rep-
resent a 2-fold difference between the control and the sample. Horizontal dashed line
represents the t-test p-value of 0.01, and the solid horizontal line represent p=0.001.
10 Cluster analysis of
microarray information
10.1 Basic concept of clustering
For a set of N genes to be clustered, and an NxN distance (or similarity) matrix,
the hierarchical clustering is performed as follows:
2. Find the closest pair of clusters and merge them into a single cluster.
3. Compute the distances (similarities) between the new cluster and each of the
old clusters.
is a minimum number of poorly fitting genes in the clusters. The actual number of
clusters is also difficult to assign, but the square root of the number of genes is a
good initial estimate. However, it depends on the data set.
1. The genes are arbitrarily divided into K centroids. The reference vector i.e.,
location of each centroid, is calculated.
2. Each gene is examined and assigned to one of the clusters depending on the
minimum distance.
4. Steps 2 and 3 are repeated until all the genes are grouped into the final re-
quired number of clusters.
During the course of iterations, the program tries to minimize the sum, over
all groups, of the squared within-group residuals, which are the distances of the
objects to the respective group centroids. Convergence is reached when the objec-
tive function (i.e. the residual sum-of-squares) cannot be lowered any more. The
obtained groups are geometrically as compact as possible around their respective
centroids (Figure 10.4).
K-means partitioning is a so-called NP-hard problem (there is no known algo-
rithm that would be able to solve the problem in polynomial time), thus there is no
guarantee that the absolute minimum of the objective function has been reached.
112 DNA microarray data analysis
Therefore, it is good practise to repeat the analysis several times using randomly
selected initial group centroids, and check whether these analyses produce compa-
rable results.
Figure 10.3: K-means algorithm. Genes are initially divided into K clusters, and the final
clusters are iterated from these.
ber of uncorrelated variables called principal components (Figure 10.5). The basic
idea in PCA is to find the components that explain the maximum amount of vari-
ance possible by n linearly transformed components. The first principal component
accounts for as much of the variability in the data as possible, and each succeeding
component accounts for as much of the remaining variability as possible.
PCA can be also applied when other information in addition to the actual ex-
pression levels is available (this applies to SOM and K-means methods as well).
The method provides insight into relationships, but no matter how interesting the
results may be, they can be very difficult if not impossible to interpret in biological
terms.
Figure 10.5: An example of principal component analysis. The two most significant
principal components have been selected as the axes of the plot.
addition to clustering information other data is used to organize the genes. The
problem with supervised methods is that they may force the data to behave in a
certain way. After all, we cannot know what is the correct partitioning for a given
data set. The usefulness of the data for interpretation of biological processes is in
fact the true test for the correctness of clustering.
Clustering methods provide a relatively easy way to organize the cluster in-
formation. Together with visualization methods they allow for the user an intuitive
way of looking at, understanding and analyzing the data. Different clustering meth-
ods and the same method with different premises produce different end results, so
the user has to try to find out a useful result. In many articles the analysis is finished
at clustering. This is a mistake, since clustering is actually only a starting point for
more detailed and interesting studies.
For the clustering methods to work properly, the data should be well separated
meaning that clusters are easily separable. However, this is usually not the case in
microarray studies. Diffuse or interpenetrating groups cause problems in the de-
termination of cluster boundaries during the clustering, leading to different results
when using different methods and even the same method with different parameters.
Therefore, the user has to consider clusters as a working partitioning, not as the
final truth.
Single linkage is not the best approach for hierarchical clustering, because the
result can be remarkably skewed. The method is good for picking outliers that are
connected in the very last steps of the process. Complete linkage tends to produce
very tightly packed clusters. The method is very sensitive for the quality of the
data. K-means is one of the simplest and fastest clustering methods. The result is
dependent on the initial location of centroids. This problem can be overcome by
applying a number of intialization approaches.
10.8 Visualization
pression and green underexpression within each row. The color intensity represents
the magnitude of deviation. It is naturally possible to use other colors, too, and it
has been recommended to use a blue/yellow scheme instead, and avoid the prob-
lems with color blind persons. Most visulization packages allow free choice of the
colors.
Red/green figures can be found from most microarray analysis articles. These
figures provide a general overview of the data. With large data sets, it is not possible
to identify or even mark individual genes into such figures. Often either more de-
tailed figures of certain types or groups of genes are provided or the whole data is in
a (web accessible) appendix. From the beginning of 2003, major journals have re-
quired the data to be submitted to data repositories (see chapter on MIAME/MGED
system).
Several programs are available for clustering and visualization of gene expression
patterns (Table 10.1). Here we present an incomplete list of programs that we
have found useful. Many programs are freely available on the Internet. Both the
GeneSpring and Kensington package available at CSC contain several options for
clustering. Despite the initial time required for learning the use of these programs,
they provide some benefits, mainly due to allowing several analyses to be done
within a single package. These programs are not as such any better than freely
available programs. They, however, might have a more user-friendly interface.
Table 10.1: Freely available software for cluster analysis and visualization.
Program Description
Cluster Performs hierarchical clustering, self-organizing maps
SAM Significance Analysis of Microarrays: Supervised learning
ScanAlyze Processes fluorescent images of microarrays
TreeView Graphical results of analyses from Cluster
Expression Profiler Analysis and clustering of gene expression data
GeneCluster Self-organizing maps
J-Express Clustering and visualization
Genes that fall into the same cluster might have a similar transcription response to a
certain treatment. It is likely that some common biological function or role is acting
in the background. If the cluster has a bunch of genes with known and similar func-
tions, the characteristics of unknown genes in the cluster can be inferred. Function
prediction can be assisted with additional data on sequence similarity, phylogenetic
inference based on sequences, posttranslational modifications, cellular destination
signals, and so on. Taken together, a sufficiently large number of common features
can be used to make fairly accurate predictions of gene functions.
The clustering tool has four fields, which need to be filled in before running the
analysis (Figure 10.8). From top to down, the first box indicates the gene list to
be clustered. Initially the list is the same that was highlighted when the clustering
tool was invoked, but it can be changed from the navigator bar on the left. Next
box contains the information about the experiment to be clustered. This can also be
easily changed from the navigator bar on the left. The pull-down menu offers the
choice of three clustering algorithms (hierarchical, K-means and SOM). The setting
for the current analysis is in the box below the pull-down menu. For K-means, the
number of clusters needs to specified, as well as the number of iterations and the
desired measure of similarity.
GeneSpring has several different similarity measures, which fall into the fol-
lowing categories: correlation, confidence, and distance. The selection of the sim-
ilarity measure should be given some thought, because it significantly affects the
generated results. Pearson’s correlation emphasizes both over- and underexpressed
genes, and the Standard correlation finds especially overexpressed genes. Spear-
man’s correlation is highly similar to Pearson’s correlation except it uses ranks for
the calculation of the correlation coefficient (and is thus a nonparametric measure
of correlation). Distance measures the euclidian distance between two gene expres-
sion profiles. It is calculated as the square root of averaged squared deviation of the
profiles. Spearman’s confidence measures the probability of getting a correlation
of S or higher by chance alone, if the true correlation is zero.
If there is only one measurement per gene (i.e., one chip), only the distance
measure can be used for the clustering of the genes. If there are two replicates of
118 DNA microarray data analysis
every gene, the Standard correlation can also be used. If there are three replicates,
Pearson’s correlation becomes available, and with five replicates the confidence
measures can be used.
In other words, the decision about the applied similarity measure depends on
the biological question you are interested in, and the amount of replicates in your
dataset.
After specifying the aforementioned settings, the run can be started by click-
ing the Start-button on the bottom of the window. When the run has ended, you
can name and save the clustering result. The result appears on the main screen of
GeneSpring. Thereafter, all the results can be found from the navigator bar un-
der the folder classification. Note that the hierarchical clustering results are stored
under two different folders, Gene Trees and Experiment Trees.
After viewing the clustering results, you can get back to the original view by
selecting View->Unsplit window. Hierarchical clustering results can be dismissed,
for example, by selecting View->blocks.
want to compare PCA with some clustering results, you can right-click on one
clustering result in the navigator bar, and select Set as coloring scheme from the
appearing menu. A good clustering result often creates clusters that do not overlap
with each other in the PCA scatter plot.
Data mining
11 Gene regulatory networks 121
11 Gene regulatory
networks
11.1 What are gene regulatory networks?
11.2 Fundamentals
V2
V1
V3 V4
Figure 11.1: Graph G consisting of four nodes V1 , V2 , V3 , V4 , and tree edges E (1,2) , E (1,3) ,
and E (3,4) .
are several possible structures with respect to the experiments done so far, which
one is right? So a graph represents hypotheses based on current knowledge. By
locating central nodes or nodes of interest we can conduct additional experiments
to confirm the dependencies around nodes or get an idea which of the graphs is
the correct one. Choosing the best node to inspect more carefully is not trivial,
at least not, when we try to make distinctions between putative graphs and we try
to minimize experiment costs at the same time. To this category belongs also the
question: is the order of regulation correct? Manual perturbation, like the deletion
of a gene, might give a clear clue on the order of regulation. Finally, if we stick
to the obtained structure there is still questions on the strength of each regulation,
whether they are they right.
But before all the above questions, we need to know how to construct such a
graph directly from data, because, in theory, any kind of an edge is possible between
any set of vertices, and so the problem is to infer the true connections based on the
data produced by the experiments. This approach is called reverse-engineering
of gene regulatory networks. It is interesting to note that similar to biochemical
reactions, sequences, and three dimensional structures having their own motifs, it
is clear that regulatory networks have to have their own regulatory motifs or circuit
motifs based on the lower level motifs.
To infer a regulatory network, several distinct expression samples of the genes
of discourse are needed. In a time series analysis, the expression levels are recorded
along fixed time points, while in a perturbation analysis, the expression levels are
recorded separately for each manual perturbation (over/under, deletion) of specific
genes. A fundamental difference between perturbation data and time series data is
that perturbation allows a firm inferring order of regulation while time series data
can only reveal the probable regulation direction. The problem can be overcome by
combining a few manual perturbation experiments with time series data. In both
cases, the genes are monitored at fixed points inducing an expression matrix
{X i [ j ] | 1 ≤ i ≤ n, 1 ≤ j ≤ m} ,
Table 11.1: Top, part of the table for time series expression data of genes SWI5, CLN2,
CLN3, and CLB1 [1]. The mean of whole data was 0. Bottom, the expression levels are
discretized so that values higher than or equal to the mean get a value of 1 and values lower
than the mean get a value of 0.
gene t1 t2 t3 t4 t5 t6 t7 t8 t9
1 (SWI5) -0.41 -0.97 -1.46 0.16 0.74 0.72 1 0.77 0.3
2 (CLN3) 0.49 0.62 0.05 -0.13 0.02 0.04 -0.14 0.24 0.91
3 (CLB1) 0.6 -0.53 -1.37 1.03 1.13 1.27 1.04 1 0.07
4 (CLN2) -1.26 1.6 1.54 0.31 -0.14 -0.88 -1.7 -1.88 -1.7
1 (SWI5) 0 0 0 1 1 1 1 1 1
2 (CLN3) 1 1 1 0 1 1 0 1 1
3 (CLB1) 1 0 0 1 1 1 1 1 1
4 (CLN2) 0 1 1 1 0 0 0 0 0
Table 11.2: Conditional probabilities based on the structure of G and the bottom values
of Table 11.1.
Pr(X 1 = 1) X 1 Pr(X 2 = 1) X 1 Pr(X 3 = 1) X 3 Pr(X 4 = 1)
6/9 1 4/6 1 1 1 1/7
0 1 0 1/3 0 1
when each node has at most k parents, i.e., one “success” probability for each row
in each table just calculated.
This is clear advantage over O(2 n ) parameters when the expression value of a
node depends on the expression values of all other nodes. Thus, parameters pertain
only to local interaction.
Underlying the above discussion there are some basic probability rules that are
examined next. The joint probability for random variables X and Y (not necessarily
independent) is calculated using the conditional probability:
Pr(X ∩ Y ) = Pr(X, Y ) = Pr(Y | X) Pr(X) = Pr(X | Y ) Pr(Y ) ,
where notation Pr(X | Y ) denotes the probability of observation X after we have
seen evidence Y , see Figure 11.2. Reorganizing the equation induces Bayes’ rule
Pr(Y | X) Pr(X)
Pr(X | Y ) =
Pr(Y )
while generalization of it forms the chain rule
Pr(K , X, Y , Z ) = Pr(Z | K , X, Y ) Pr(K , X, Y )
= Pr(Z | K , X, Y ) Pr(Y | K , X) Pr(K , X)
= Pr(Z | K , X, Y ) Pr(Y | K , X) Pr(X | K ) Pr(K ) .
If variables X and Y are independent, then
Pr(X ∩ Y ) = Pr(X) Pr(Y ) ,
and if variables X and Y are independent for given a value of Z , then
Pr(X | Z , Y ) = Pr(X | Z ) .
Based on our dependency graph G, probability Pr(X 1 , X 2 , X 3 , X 4 ) can be simplified
as follows
Pr(X 1 , X 2 , X 3 , X 4 ) = Pr(X 1 ) Pr(X 2 | X 1 ) Pr(X 3 | X 1 ) Pr(X 4 | X 3 ) .
For example, Pr(X 4 | X 1 , X 2 , X 3 ) = Pr(X 4 | X 3 ) because for given the value of X 3 ,
X 4 is independent of the values of X 1 and X 2 . So, it is enough to consider only
direct local dependencies from parents when evaluating the probabilities (for X 4 the
only parent is X 3 ). We see that in Bayesian networks the probability calculations
follow in a natural way the structure of a graph.
X X ∩Y Y
Figure 11.2: A Venn diagram of the probability space , where the areas denote proba-
bilities, for example, Pr(X), is the left circle area divided by the total area of rectangle .
It is easy to see that Pr(X ∩ Y ) = Pr(Y | X) Pr(X).
By using the structure of G (see again Figure 11.1) we can decompose our opti-
mization problem to independent subproblems:
m
m
L(θ : D) = Pr(X 1 [ j ] : θ1 ) · Pr(X 2 [ j ] | X 1 [ j ] : θ1 )
j =1 j =1
m
m
· Pr(X 3 [ j ] | X 1 [ j ] : θ3 ) · Pr(X 4 [ j ] | X 3 [ j ] : θ4 ) ,
j =1 j =1
or, in general,
Li (θi : D) .
i
m
L1 (θ1 : D) = Pr(D : θ1 ) = Pr(X i [ j ] : θ1 ) = (θ1 ) N(X 1 =1) (1 − θ1)m−N(X 1 =1) ,
j =1
where N(·) denotes the number of each occurrence in data with respect to the re-
quirements inside parentheses. The estimator θ̂1 for θ1 maximizing the product is
simply the already calculated probability for Pr(X 1 = 1), i.e.,
N(X 1 = 1)
θ̂1 = = 6/9 .
N(X 1 = 1) + N(X 1 = 0)
Similarly, the conditional estimators for parameters θ (2|X 1 ) , θ(3|X 1 ) and θ(4|X 3 ) are
the probabilities already calculated:
Together these estimators form θ̂ = {θ̂1 , θ̂(2|X 1 ) , θ̂(3|X 1 ) , θ̂(4|X 3 ) }, which is the best
explanation for the observed data D restricted to the structure of G, and we can
compute
as the maximal likelihood value for graph G. Note that we have dropped out terms
1x = 1 and 00 = 1.
Usage of more states for genes like “low”, “medium”, and “high” follows the
multinomial theory conveniently generalizing the binomial theory. Using the pa-
rameters, we can formulate hypothetical behaviors of genes in different situations
fixing some expression levels and observing others; moreover, updating the param-
eters is easy as new data arrives.
Here, Parents(X n ) denotes the expression values of all parents of node V n and θnG
denotes the parameter of node V n in graph G. Instead of direct calculation it is more
convenient to use a logarithm of L(G, θ G : D) for comparing different structures.
After simplifying we have a surprisingly simple formula
n
log(L(G, θ G : D)) = m (I(Vi , Parents(Vi )) − H(Vi )) ,
i=1
V2
V3 V1
V4
Figure 11.3: A different structure explaining conditional dependencies between the ex-
pression values of nodes V1 , V2 , V3 , V4 .
which can be ignored because it is the same for all network structures. The function
I(Vi , Parents(Vi )) is the mutual expression information between node V i and its
parent nodes Parents(Vi ) = {P1 , P2 , . . .} defined by
Pr(X i = x i Pi = x i )
Pr(X i = x i Pi = x i ) lg .
Pr(X i = x) i Pr(Pi = x i )
x=0,1
(∀i):x i =0,1
Function I(Vi , Parents(Vi )) ≥ 0 measures how much information the expression val-
ues of nodes Parents(Vi ) provide about V i . If Vi is independent of parent nodes,
then I(Vi , Parents(Vi )) has the value of zero. On the other hand, when V i is totally
predictable for given values of Parents(V i ), then I(Vi , Parents(Vi )) reduces into the
entropy function H(V i ). It should be noted that in general I(X, Y ) = I(Y , X) so the
direction of edges matters.
Is the structure presented in Figure 11.1 optimal with respect to the data? What
about the structure in Figure 11.3. Because our data set is small we do not resort to
the above calculations but we calculate the parameter set θ̂ :
The comparison of these figures with the previous ones shows that the new graph
is more probable with respect to data.
11.6 Conclusion
Our example data was complete without any missing observations of expression
values, which happens quite rarely in reality. Two of the most familiar conventions
to deal with the problem of missing expression values is to omit the data rows miss-
ing some values or substituting the most common values for the missing values. Of
course, what is the best policy, is an everlasting topic to debate on.
Even a more difficult problem is to decide how the discretization from expres-
sion values to logical “on/off” values should be done. Can we really use a global
step value like in our example or should we use use gene-specific values? On the
other hand, can we use a single limit because then a small variation in measure-
ments can toss a gene between the “on” and “off” state without any real reason?
128 DNA microarray data analysis
Bayesia www.bayesia.com
JavaBayes www-2.cs.cmu.edu/ javabayes
PowerConstructor family www.cs.ualberta.ca/ jcheng/bnsoft.htm
Bayesian Knowledge Discoverer kmi.open.ac.uk/projects/bkd
GeNIe www.sis.pitt.edu/ genie
Prevision www.prevision.com
Hugin www.hugin.com
Norsys www.norsys.com
Knowledge Industries Company www.kic.com
Microsoft MSBN system www.research.microsoft.com/msbn
One cure for the last problem might be to use fuzzy logic, where a gene has an
increasing grade of being “on” and after a fixed point it is completely or fully “on”;
similarly for “off” but reversed.
Today, there are some Internet-based servers set up for inspection of reverse-
engineering results based on researchers’ own data and few software packages, see
Table 11.3. Our better graph was calculated by using a server hosted by Techni-
cal High School (B-course). In future, some of these tools will be integrated to
common gene expression software tools like GeneSpring and Kensington.
2. Pe’er, D., Regev, A., Elidan, G., Friedman N. (2001) Inferring subnetworks
from perturbed expression profiles, Bioinformatics 17, 215S-224.
3. D’haeseleer P., Shoudan Liang S., Somogyi R. (2000) Genetic network infer-
ence: from co-expression clustering to reverse engineering, Bioinformatics
16, 707-726.
12.2 Introduction
Changes in the expression levels of genes can be due to a number of factors. First,
there are such large-scale factors as chromatin accessibility: some regions of DNA
are packed too tightly to allow strand separation or even binding of various factors.
One well-known mechanism of chromatin packing is found in methylation of cy-
tidines in CpG islands to effect gene silencing via heterochromatin formation, and
another one is related to histone acetylation/methylation. Second, there are specific
transcription factors that may inhibit or activate transcription. The balance between
activating and inhibiting transcription factors can determine the expression level in
loosely packed euchromatin. Third, there are some genes that seem to be on “by
default” without much apparent regulation, the so-called maintenance or household
genes which are needed in all cell types. “Household gene”, however, does not im-
ply constant expression; in any gene the expression level will fluctuate over time
due to the general metabolic activity of the cell, especially the activity and amount
of parts in the transcription machinery itself. If maintenance genes are used for
normalization of array data it is necessary to verify the constant expression, e.g., by
RT-PCR in the system and conditions of the experiment. (See also the chapter on
130 DNA microarray data analysis
normalization.)
Data mining for gene regulation is currently feasible only in the second case
above, i.e., in search and study of transcription factor binding sites.
Most known transcription factor binding sites are located close to the tran-
scription start site (TSS), particularly in the 500 bp directly upstream (5’) of TSS.
This upstream region is often referred to as the promoter region (but sometimes
the binding sites themselves are called promoters). Other regions that activate tran-
scription (the enhancers) may occur almost anywhere downstream (3’) or upstream
from the gene, even 20 kB away from the promoter region, or in the introns. The
methods for the search of regulatory elements that we describe could be used for
potential enhancer regions as well as for the promoter regions, but commonly only
promoters are analyzed. This because a larger search space weakens the signal-to-
noise ratio. Therefore, we limit our discussion to data mining in promoter regions
only.
Coexpressed genes are often grouped by different clustering methods. It is
reasonable to present the hypothesis that some of the similar changes of expression
are due to the effect of the same or similar transcription factors – therefore it makes
sense to search for shared sequences that could be transcription factor binding sites.
Of course, depending on your clustering (cluster sizes and number), you may have
several mechanisms working within one cluster, or you may have the same mech-
anism affecting members of several clusters, but in any case, at least some clusters
may show enrichment of similar transcription binding sites.
In brief, it is presumed that coexpression implies coregulation, and looking for
shared patterns in promoter regions allows you to formulate hypotheses of regula-
tion mechanisms for focused experimental testing. However, before verification,
your results will be only sets of potential promoter sites of potentially coregulated
genes.
• the data regarding your filter or chip may have been updated after it left the
factory, so the shipped gene lists may not be as complete and up-to-date as
you can find at the manufacturer’s web site. You should definitely work with
the most recent data available as long as it refers to the same manufacturing
batch that you actually used,
• you may find several RefSeq mRNA and protein codes for a single gene
locus, corresponding to alternatively spliced forms,
• the provided EST code is not always the actual sequence on the array, be-
cause the manufacturer may have used a longer and more correct version of
the 5’-end of the corresponding mRNA.
The ultimate source for the promoter regions is in the genome data. The anno-
tations of gene locations are still incomplete, and there are no ready-made tools for
pulling out the regions preceding each gene. Currently (early 2003) we have found
that human promoters are easiest to retrieve in large scale from three data sources.
They are multi-species resources, so when more genomes become available, the
same approaches work.
For a smaller number of genes the most reliable data (regarding the place-
ment of TSS) is available in the Eukaryotic Promoter Database (EPD), which con-
tains promoter regions of 500 bp, only from genes with an experimentally verified
TSS. Organism-specific promoter databases may exist to fit your needs, consult the
genome site of your favourite organism or Michael Zhang’s lab at http://rulai.
cshl.org/software/index1.htm.
If these sources fail, or if you feel you cannot trust all of your findings, you can
always resort to using your own similarity searches to place the gene in the genome
132 DNA microarray data analysis
assembly, and then pick the region preceding the gene, but this is far from straight-
forward. It may be advisable to drop data for which you cannot get unambiguous
gene mapping. Currently full genome information is in most cases not available for
all genes contained in the microarrays.
If you aim at finding your promoter sequences from the UCSC upstream data
sets, you need the corresponding RefSeq mRNA accession codes, because these
codes are used as identifiers for the promoter sequences. You can also retrieve
Ensembl genes with RefSeq codes. Figure 12.1 shows you the paths and tools for
arriving at RefSeq codes from other data.
Figure 12.1: Different data items that you may receive from your microarray manufacturer
and how you can find missing data items and the actual sequences.
• CSC also offers tools for fast BLAST searches (gepardi.csc.fi) and mass re-
trieval of sequencies in FastA format (Seqret in the EMBOSS package).
If you want to be really certain that you know which genes you are dealing
with, you should follow several paths in the diagram to see if they all lead you to
same Locus IDs. Remember that there may be several RefSeq mRNAs per gene,
whereas Locus IDs are unique.
The data files from GeneLynx (http://www.genelynx.org/cgi-bin/
a?page=info) are likely to allow many shortcuts to obtain RefSeq mRNA codes,
but we have not tried this yet. Likewise, querying of various UCSC genome data
tables seems a promising option.
If you work with the Affymetrix Arabidopsis chips, you may want to look at
VIZARD (http://www.anm.f2s.com/research/vizard/), which includes an-
notation and upstream sequence databases for the majority of genes represented on
the Affymetrix Arabidopsis GeneChip R array. Whitehead Institute at MIT pro-
vides upstream sequence data for Neurospora crassa and several other organisms,
see e.g. http://www?genome.wi.mit.edu/cgi?bin/annotation/neurospora/
download_license.cgi
Yet another option for retrieving upstream sequences is found at the EnsMart ser-
vice (http://www.ensembl.org/EnsMart/) of the Ensembl project. This is lim-
ited only by the number of Ensembl genes that are annotated (currently almost
25,000), and it is very easy to use, as illustrated in Figure 12.2. The performance of
EnsMart v. UCSC upstream data searches is compared later in this chapter. They
have several species to select from:
134 DNA microarray data analysis
Next, enter your list of gene identifiers in the Filter step (Figure 12.3):
Figure 12.3: EnsMart filtering. In addition to RefSeq, there are other options, too, but you
should be aware of that some mappings are more complete than others. Next to internal
Ensembl references, RefSeq is your best choice. A long list of other filtering options is
omitted from the figure.
Then, in the next phase, you should choose output as Sequences, and select
your options as below to retrieve your upstream sequences in Fasta format (Fig-
12 Data mining for promoter sequences 135
ure 12.4). Additional options for data compression, saving locally etc. are not
shown.
For a few microarrays (currently only two human Affymetrix chips), EnsMart
provides direct mappings, so you can skip all the previous steps for finding a con-
sistent set of gene identifiers. Instead, you can choose the genes in these chips
directly in the Filter step (Figure 12.5).
We can expect such mappings with microarray contents to become more com-
mon in the genome sites, both at Ensembl and at UCSC. Therefore, some day
retrieving the upstream sequences will be less of a problem.
Figure 12.5: Selecting genes that are represented on AFFY-HG-U133. Compare with Fig.
12.3.
Table 12.1: Search of upstream sequences from the precomputed upstream data set of
UCSC (http://genome.ucsc.edu/) and an interactive search from Ensembl data via
EnsMart (http://www.ensembl.org/EnsMart/)
the November 2002 freeze of the human genome, and Ensembl data was as of 31st
March, 2003. Table 12.1 summarizes our findings.
There seem to be two lessons to be learned. First, there is some duplication,
so finding the real gene and promoter sequence is not unambiguous even if you
know a code in a curated collection and use the most authoritative data sources.
Second, currently it is useful to use both services, even though both data sources
are expected to become more complete in the near future as the human genome
data becomes more fully assembled and annotated. The situation may be different
for your favourite organism.
We did not make any comparison whether the sequences or genome locations
matched between the two data sets, but we would guess there are some differences.
Two final warnings regarding upstream data:
• in some cases the genome assembly may be broken close to the start of your
gene, so that you do not get a full 1000 (or whatever) nucleotides of sequence.
In Ensembl this shows as a long string of Ns in the sequence (this may be
another reason why Ensembl gives more sequences)
• the start of a RefSeq mRNA may not always correspond to a true transcrip-
tion start site. There are mRNAs or ESTs which give evidence of longer
transcript variants than some of the current RefSeqs. UCSC genome browser
only includes entries with a certain TSS in their upstream1000 data set Search-
ing known patterns
Once you have found the genes and their 5’ regions you can proceed to the
analysis of patterns in the regions.
If you want to run a check of known transcription factor binding sites , Tran-
scription Regulatory Regions Database (TRRD, http://wwwmgs.bionet.nsc.ru/
mgs/dbases/trrd4/trrdintro.html) is a site with lots of information, but only
for on-line searching. The complete database is not available. The Transcription
Factor Database (TFD) and some tools that utilize the data are available at http:/
/www.ifti.org/.
Transfac database ver. 5.0 (from 2001) is said to be publicly available for
academic use (http://transfac.gbf.de/TRANSFAC/), but currently most of the
12 Data mining for promoter sequences 137
data and services seem to be closed. The EMBOSS program tfscan (http://www.
csc.fi/molbio/progs/emboss/Apps/tfscan.html) can be used to find Trans-
fac sequence patterns in your sequences once Transfac is set up in the EMBOSS
environment.
The most recent versions of Transfac database and its search tools have moved
into a commercial environment at http://www.gene-regulation.de/. They of-
fer PC software for promoter analysis, Transplorer (Figure 12.6). Their “Transfac
Professional” package contains search tools, too.
Whatever method you choose to search for TF binding sites, you have to bear
in mind that they are short patterns that will be found very frequently by chance
alone, so you need to confirm the statistical significance of your findings. A search
strategy which aims at finding clusters of several sites might work better in this
respect.
12.7 Summary
In summary, we identified a similar expression profile of certain genes using DNA
microarrays and clustering tools. We hypothesized that the similar expression was
due to common regulatory elements situated in the promoter regions of the genes.
We retrieved the promoter sequencies from the databanks, analyzed the promoter
regions, and successfully identified a common element on their promoter region.
new sequencies or for a specific sequence. You can also select the lenght of the se-
quence to be considered a promoter region, how long a regulatory element is being
searched, and how many unknown bases are allowed. The longer the sequence, and
the larger the number of unknown bases, the longer the analysis time. You have
control over the probability statistics: The p-value cut-off for a significant pattern
can be modified. Whether the sequence is relative to the sequence upstream of
other genes or relative to the whole genomic sequence can also be modified. The
first option is far more common.
After the analysis have completed, or you stop the search, the results are re-
ported. They appear on right side of the toolbox. Potential regulatory sequencies,
the number of genes they were detected in, and the detection p-value are reported.
The best findings are reported first.
2. Roth, F. P., Hughes, J. D., Estep, P. W., Church, G. M. (1998) Finding DNA
regulatory motifs within unaligned noncoding sequences clustered by whole-
genome mRNA quantitation. Nat. Biotechnol. 16,939-45.
3. Vilo, J., Brazma, A., Jonassen, I., Robinson, A., Ukkonen, E. (2000) Mining
for Putative Regulatory Elements in the Yeast Genome Using Gene Expres-
sion Data. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2000 8,384-94.
This chapter was written by Martti Tolvanen, Mauno Vihinen and Jarno
Tuimala (GeneSpring examples).
140 DNA microarray data analysis
Sequences are usually annotated when they are submitted to databanks, e.g., Gen-
bank or EMBL. Sequences in Genbank and EMBL are computationally annotated
and might contain errors. Some sources of annotations, like SWISS-PROT and Ref-
Seq, contain curated annotations for the sequences, and the information is probably
more reliable than in Genbank. Annotations for individual genes can easily be
retrieved from the databanks using the sequence accession numbers.
Also Locuslink and Unigene, which are available at NCBI (http://www.
ncbi.nlm.nih.gov/), contain valuable information about gene functions. Lo-
cuslink presents information on official nomenclature, aliases, sequence accessions,
phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map lo-
cations, and related web sites. This information can be retrieved using common
gene names or Locuslink accession numbers. In addition, GeneCards ( http:
13 Annotations and article mining 141
If sequences, or at least Genbank accession numbers for the genes of interest are
available, their annotations can be updated using BLAST. After performing a (stan-
dard) BLAST search, you can pick three to five best scoring hits, and check what
annotations they have. Likely, your query sequence has functions similar to the
best hits you find from the database.
Articles and abstracts contain masses of valuable information about the functions
of the genes, but the data as such is difficult to approach. Of course, individual
searches with gene names can be made from PubMed (http://www.ncbi.nlm.
nih.gov/Pubmed/), but this is not feasible, if the number of genes or published
articles is high. Article mining makes the retrieval of gene functions easier. Using
a suitable query term, e.g., a gene name, a list of functions or common keywords
identified from the articles and abstracts can be produced.
MedMiner http://discover.nci.nih.gov/textmining/main.jsp is an ar-
ticle mining tool for biomedical information. Currently, MedMiner combines the
information of Whitehead Institute’s GeneCards and Pubmed. Searches about gene,
gene-gene or gene-drug functions and interactions can be made. MedMiner does
not necessarily contain information about all human genes, because the genes in the
current version of the program have been selected from a certain DNA microarray.
National Cancer Institute also offers many other text mining tools, like GO-
Miner (http://discover.nci.nih.gov/gominer/), which helps to interpret the
human -omic-data (including DNA microarrays) by classifying the genes into bio-
logically coherent categories based on GO ontologies. Classifications can then be
assessed using the tool.
Note that article mining is not a substitute for getting familiar with the rele-
vant articles, but it enables one to easily infer probable gene functions from the
established literature.
142 DNA microarray data analysis
13.4.1 Annotations
GeneSpring contains a built-in annotation tool (Figure 13.1), which can be used if
Genbank accession numbers for the genes on the chip are known. If you do not
have Genbank accession numbers, don’t panic, because RefSeq and SWISS-PROT
(and many other) accession numbers can be converted to Genbank IDs using the
Ensembl database or other servers available on the Internet.
Moreover, when importing the data to GeneSpring, you have to specify which
column in your datafile contains the Genbank accession numbers. This column is
marked in GeneSpring as GenBankID.
The annotation tool GeneSpider can be accessed from the menu Annotation-
>GeneSpider. GeneSpider has four different options: you can annotate your chip
using either Genbank, Unigene or Locuslink database or Silicon Genetics’ mirror.
It is recommended to use the Silicon Genetics’ mirror, because it contains all the
combined information of three other databases. It is also faster to update annota-
tions using the mirror server.
GeneSpider contains several settings. First of all, you need to select the col-
umn containing Genbank IDs. Additionally you need to select the sources for the
annotations. Annotations from all the selected sources can be combined. Alter-
natively, only the highest priority annotations are used. Of course, the priority of
the databases can be mofified. Usually, the existing annotations are overwritten,
but you can also decide not to do so. In addition to annotations, it is possible to re-
trieve sequences, but it can be very time-consuming, especially if you are analyzing
human data.
It is also possible to request GeneSpring genomes for some commercial chips,
e.g., Affymetrix, directly from Silicon Genetics. This often cuts down the hazzle
with annotations and Genbank IDs tremendously.
13.4.2 Ontologies
After retrieving annotations for your genes, you can build a simplified ontology.
The simplified ontology creates a number of new gene lists, which fall into three
categories; molecular function, biological process and cellular component of gene
products, as specified in the GO ontology. The tool is located in Annotations-
>Build simplified ontology. The simplified ontology tool provided by GeneSpring
is, as the name indicates, a simplified version of the actual GO ontology, and it
is not updated as often as the official GO ontology is, so you should consider the
formed genelist groupings suggestive.
Build simplified ontology tool is handy if you are interested in a certain gene
type. For example, if you are interested in transcription factors, you can use the
automatically created gene list (transcription factors) as a basis for further analyses.
You can create a similar gene list using text filtering tools, but it is much more
convenient to let GeneSpring do the work for you.
13 Annotations and article mining 143
Figure 13.1: GeneSpring annotation tool GeneSpider. Using the current settings, data
from GenBank, Unigene and Locuslink will be searched, but only the Unigene annotations
will be saved. Moreover, old annotations will not be overwritten, and no sequences will be
retrieved.
14 Reporting results
14.1 Why the results should be reported
Although microarrays work best as a screening tool, sometimes the final results that
you wish to publish lean largely on the microarray data.
There are several good reasons why you should make your microarray data
publicly available. It will be beneficial for you and others. Analyzing several good-
quality, well-documented experiments together will ease the task of finding actual
correlations between the gene expression. Since the microarray experiments are
expensive, adding publicly available information from other microarrays into your
own findings will save money.
You should think of publishing your microarray results already from the very
beginning of the experimental work. Later on, it may be very difficult to “mine your
notebook” for all the details on how you performed your experiments and how you
handled your data.
Microarray data without detailed information about the experimental condi-
tions and analysis steps is useless, and it should always be accompanied with that
supportive information. To be able to compare your data with other researchers’ ex-
periments, all this information has to be reported in a common way, using a widely
accepted form. Such a form is called Minimum Information About a Microarray
Experiment, MIAME.
Furthermore, two major scientific journals, Nature and Cell, as well as EMBO
Journal, have already set a condition that the microarray data in the submitted paper
have to be MIAME-compliant and publicly available, before they accept the paper
for publication (see instructions for authors at http://www.cell.com and http:/
/www.nature.com, and Nature opinion article “Microarray standards at last”, Na-
ture 419, 323 (26 Sep 2002)). In a similar way, The Lancet has adopted MIAME
quidelines (http://www.mged.org/Workgroups/MIAME/miame_checklist.html).
Probably this demand will become a prerequisite also for other journals, in a same
way as publishing your sequence data in public databases, GenBank and EMBL.
The MIAME standard outlines the minimum information that should be reported
about a microarray experiment to enable its unambiguous interpretation and repro-
duction. The latest version of the MIAME standard can be found from the MGED
146 DNA microarray data analysis
Figure 14.1: The MIAME organization. A schematic representation of the six components
of a microarray experiment (A. Brazma et al. 2001)
The MIAME includes a description of the following six sections (Figure 14.1):
• Array design: each array used and each element (feature, reporter, compos-
itegroup) on the array and protocols used
can be annotated based on this ontology. MGED ontology is used as a basis for
building the MAGE object model and markup language. MGED ontology is an
integral part of the MIAME standard, and they are being developed together.
14.3.1 MAGE-OM
MAGE-OM (microarray gene expression object model) is an Object Model for
the representation of microarray expression data that facilitates the exchange of
microarray information between different data systems and organizations.
MAGE-OM is a data-centric model that contains 132 classes grouped into 17
packages containing, in total, 123 attributes and 223 associations between classes.
Entities defined by the MIAME standard are organized as MAGE-OM components
into packages, such as Experiment, BioMaterial, ArrayDesign, BioSequence, Ar-
ray, and BioAssay packages, and the gene-expression data into a BioAssayData
package. The packages are used to organize classes that share a common purpose,
and the attributes and associations define further the classes and their relations.
A database created using MAGE-OM is able to store experimental data from
different types of DNA technologies such as cDNA, oligonucleotides or Affymetrix.
It is also capable of storing experiment working processes, protocols, array designs
and analyzing results. MAGE-OM has been translated into an XML-based data
format, MAGE-ML, to facilitate the exchange of data.
systems.
14.3.3 MAGE-STK
The MAGE Software Toolkit is a collection of Open Source packages that im-
plement the MAGE Object Model in various programming languages. The suite
currently supports three implementations: MAGE-Perl, MAGE-Java, and MAGE-
C++. The idea is to be able to have an intermediate object layer that can then be
used to export data to MAGE-ML, to store data in a persistent data store such as a
relational database, or as input to software-analysis tools.
14.4.2 MIAMExpress
MIAMExpress (http://www.ebi.ac.uk/miamexpress/) is a MIAME-compliant
microarray data submission tool. It has a web interface that allows you in a step-by-
step manner to submit your microarray data to ArrayExpress database. First you
14 Reporting results 149
will create an account, log in to that account, and then select a submission type:
Protocol, Array design or Experiment. Protocol and Array design can be submitted
individually, but the Experiment always need to be accompanied by details about
Protocol and Array Design to complete the submission.
In addition to MIAME attributes, you can add other qualifiers. If the microar-
ray data is being submitted to a journal, the journal name and status (submitted/in
press/accepted) should be indicated. After filling the information about a certain
protocol and array design, you next link that information to your experiment. The
actual experiment results are submitted as scanned files (raw data), that can be ei-
ther CEL files for Affymetrix data, or .gpr files, which contain two wavelengths
in a same file, for others. If you wish, you can also add a data file corresponding
to the transformation of all the results generated from the raw datafiles in a given
experiment. In that case, the submitted protocol should describe all steps necessary
to allow a third party to recreate the transformed data file.
14.4.3 GEO
GEO web interface works in a similar manner, although the vocabulary differs;
first you log in, identify yourself, and then define your platform (array type and
specifications, annotated gene list as tab-delimited text file, organism used, etc.)
using pop-up menues. Next you bring in sample data (scanned file) in either of
two standard sample data table formats, one for one- and another for two-channel
data. These tables contain columns with standard headings, which can be either re-
quired or optional, and non-standard headings, which are user-defined and always
optional. User-defined columns may contain useful information for the submit-
ter and other users, but queries on these columns are not supported within GEO.
Samples can currently be of four types: single channel, dual channel, comparative
genomic, or SAGE. Samples (yours and others in GEO database) can be grouped
in series such as dose-response or time course series, or as repeat samples group.
2. Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J., Abeygu-
nawardena, N., Holloway, E., Kapushesky, M., Kemmeren, P., Garcia, L. G.,
Oezcimen, A., Rocca-Serra, P., and Sansone, S-A. (2003) ArrayExpress-a
public repository for microarray gene expression data at the EBI, Nucleic
Acids Research 31, 68-71.
3. Edgar, R., Domrachev, M., and Lash, A. E. (2002) Gene Expression Om-
nibus: NCBI gene expression and hybridization array data repository, Nu-
cleic Acids Research 30, 207-10.
5. Gollub, J., Ball, C.A., Binkley, G., Demeter, J., Finkelstein, D. B., Hebert, J.
M., Hernandez-Boussard, T., Jin, H., Kaloper, M., Matese, J. C., Schroeder,
M., Brown, P. O., Botstein, P., and Sherlock, G. (2003) The Stanford Mi-
croarray Database: data access and quality assessment tools, Nucleic Acids
Research 31, 94-96. http://genomebiology.com/2002/3/9/research/
0046.1
6. Saal, L. H., Troein, C., Vallon-Christersson, J., Gruvberger, S., Borg, Å., and
Peterson, C. (2002) BioArray Software Environment: A Platform for Com-
prehensive Management and Analysis of Microarray Data, Genome Biology
3, software0003.1-0003.6.
7. Spellman, P. T., Miller, M., Stewart, J., Troup, C., Sarkans, U., Chervitz, S.,
Bernhart, D., Sherlock, G., Ball, C., Lepage, M., Swiatek, M., Marks, W. L.,
Goncalves, J., Markel, S., Iordan, D., Shojatalab, M., Pizarro, A., White, J.,
Hubley, R., Deutsch, D., Senger, M., Aronow, B. J., Robinson, A., Bassett,
14 Reporting results 151
15 Software issues
Traditionally, biological data, including DNA and amino acid sequencies and DNA
microarray results, have been stored in a flat file format. Flat files are basically just
text files, which have been formatted in such a way that programs can read their
contents. Flat files are not the ideal solution for long term storage of large data sets.
They also cause problems when the same data needs to be analyzed using several
programs that use a different kind of file format.
The data standardization and storage in central databases (see chapter 14 for
more details) is one solution to the data format issues. However, this solution has
not yet fully matured, and most of the programs used for the DNA microarray
data analyses are not able to read the data directly from these databases or even
import the XML-format the databases commonly produce. Databases and programs
will surely learn to cooperate better over time. The standardization and storage of
the data is currently under heavy development, and many software developers are
implementing their own solution to the these problems. However, the solutions are
not always MIAME compliant.
Especially, if freeware tools are used for the analysis, there is often a need to ex-
port the partly analyzed data from one program to another. More often than not, the
datafile formats are not compatible, which means additional and tedious conversion
work. Data can quite easily be converted between formats using Microsoft Excel
or any other spreadsheet program. However, the work load becomes quickly un-
bearable when the amount of data increases. Excel has some macro programming
capacities, but there exists also other, and possibly better tools, if one is interested
in programming (see section on programming). In addition, Excel does not tolerate
huge data files (more than 65 536 rows or 256 columns). Sometimes it becomes
necessary to transpose (switch the rows and columns) the data file, and if there are
more than 256 genes, Excel can’t be used.
If the data needs to be imported into several programs at the same time, it is often
easier if a “standard file” is first generated from the files written by the chip scanner.
Standard files can be produced in Excel or another spreadsheet program easily.
Briefly, the first column of the standard file should contain the gene identifiers, for
15 Software issues 153
example their common names. The second column should consist of the sequence
accession numbers, Genbank accession numbers are probably the most usable ones.
The next four columns should contain the spot and background intensities of the
used colors. Depending on your analysis needs, it is also possible to use ratios, for
example normalized log ratios, instead of intensity values. The standard file should
be saved in a tab-delimited text format.
15.3 Programming
When the capacities of the spreadsheet program are not sufficient, some program-
ming tools can be used instead for the datafile management purposes or for data
analyses. There are actually several programming languages that are especially
suited for the data file formatting and other text file manipulations.
15.3.1 Perl
One of the most common languages is Perl, which has very powerful and easy to
use text manipulation tools. Perl is available for free for UNIX, Linux http://
www.perl.org and PC machines http://www.activeperl.com. Perl is a pro-
gramming language, which means that getting to know it takes some time and ef-
fort. However, if data conversion tools are needed everyday, it would definitely be
worthwhile to befriend Perl.
To illustrate how easily text can be manipulated using Perl, we present a short
example. The next code produces a complementary DNA sequence from the origi-
nal sequence, which has been stored into the text string $sequence. Both the origi-
nal sequence and its complement are printed on the computer screen.
The program starts with a line that tells the computer where The Perl software
can be found. The next four lines contain the actual commands that manipulate
the DNA sequences. Note that every command line has to end with a semicolon.
Function tr makes a complementary sequence from the original one, and function
print outputs the result to the screen.
#!/usr/bin/perl
$sequence="aaattcgagtaggtcaggcat";
print "Original: $sequence\n";
$sequence=~ tr/acgtACGT/tgcaTGCA/;
print "Complementary: $sequence\n";
You can use pico editor on CSC’s Cedar server to create a similar file and test
the example yourself. Perl programs are started in Cedar with a command perl
filename.
15.3.2 Awk
Awk is a standard UNIX and Linux tool, which is available on CSC’s servers.
With Awk, individual columns can be easily extracted from tab-delimited text files.
Using other standard UNIX tools, these individual columns can be saved into a new
154 DNA microarray data analysis
text file. For example, the next script takes the first column from a specified datafile
and saves it into a new file.
Two columns can be “awked” into a new file next to each other separated with
a space:
15.3.3 R
R is a free statistical analysis tool and a self-sufficient programming language. It
is available for UNIX, Linux, Macintosh and PC platforms. In R, scripts for anal-
yses and data file manipulations can be easily constructed. R has many add-on
packages for cluster analysis, self-organizing maps, and neural networks, among
others. There are also some packages available that have been specifically tailored
for DNA microarray data analysis.
Here is an example on how to read the tab-delimited datafile to R and how to
process it into a new table, which is then written out to a new file. The function
read.table reads in the specified file with the headers. The table is then saved in a
variable data. Two columns are extracted from the data, and saved into new variable
(x and y). The new variables are used for the creation of a new table (dataout),
which is then written to a new text file (filenameout.txt). Such a script can easily
be automated using R, and the analyses can simultaneously be intergrated with the
datafile conversions.
data<-read.table("filename.txt", header=T)
x<-data$greenintensity
y<-data$redintensity
dataout<-cbind(x,y)
sink("filenameout.txt")
dataout
sink()
Many images included in chapters 5–8 have been produced by R using real
DNA microarray datasets.
15.4.3 ArrayViewer
The Institute for Genomic Research (TIGR) has released a couple of software pack-
ages for DNA microarray data analysis. ArrayViewer comes in two versions. One
is suited for viewing the results from one chip only, the other can cope with multiple
slides at the same time. Analysis options cover some basic filtering and normaliza-
tion methods and various clustering algorithms.
15.4.4 MAExplorer
MAExplorer can be used as a web-based tool or a desktop program. It performs
clustering, but also normalization, and has the best statistical tools among the free
programs. The program contains a possibility to use a tailored database. MAEx-
plorer is also under constant development, and new features are frequently added.
15.4.5 Bioconductor
Bioconductor is an attempt to build a DNA microarray data analysis environment
on top of R (see section 14.3.3). The broad goals of the projects are to provide
access to a wide range of powerful statistical and graphical methods for the analy-
sis of genomic data; facilitate the integration of biological metadata in the analysis
of experimental data: e.g., literature data from PubMed, annotation data from Lo-
cusLink; allow the rapid development of extensible, scalable, and interoperable
software; and promote high-quality documentation and reproducible research.
Many commercial software packages have been developed for DNA microarray
analysis during the last couple of years. The ones we have experienced with are
briefly presented here.
15.5.1 VisualGene
15.5.2 GeneSpring
GeneSpring is a general analysis tool specifically tailored for DNA microarray data
analysis. The intented group of users are the biologists, who actually perform the
experiments. The program contains various data preprocessing (filtering, normal-
ization) and clustering tools. Gene annotation can also be retrieved directly from
web-based databases. GeneSpring can be expanded with user-made scripts or pro-
grams (Java APIs).
15.5.3 Kensington
Kensington Discovery Edition offers a heavy data mining tool for DNA microar-
ray analyses. Kensington has a visual programming language that describes the
dataflow through different analysis steps. Kensington has more or less the same
analysis tools as GeneSpring, but it implements the database connections in a more
sensible fashion, for example, data can be retrieved from any field of the Genbank
report. Kensington can be expanded with user-made modules containing the new
algorithms.
15.5.4 J-Express
J-Express is another appealing commercial tool. It has some filtering and other
preprocessing tools, but the main emphasis is on various clustering algorithms and
the visual examination of normalization results. It has nice links to web-based
databases. J-Express is also user customizable, because plug-ins, which allow
custom algorithm implementation, can be constructed with Java-programming lan-
guage. There is also a free version of the program available.
15.5.7 Spotfire
Spotfire is a data mining tool that has been modified to suit DNA microarray anal-
yses. Spotfire runs in a web-browser, and different analysis tools are loaded on the
user’s machine on as-needed basis. Spotfire has very nice tools for visualization
of the microarray data. Because it is based on general statistical tools, it also has
some well-implemented functions for outlier detection and the assessment of the
statistical significance of the results.
Index
obtaining, 17 P
printing, 16 pheasant tail, 68
RNA sample preparation, 19 plot
scanning, 20 normal probability, 51
typical applications, 21 power analysis, 72
Microarray databases preprocessing, 66
ArrayExpress, 148 Promoter data mining, 129
GEO, 149 Promoter datamining
MIAMExpress, 148 pattern databases, 136
missing values, 53, 66 pattern search, 137
casewise deletion, 66 retrieving sequences, 130
mean substitution, 67 EnsMart, 133
pairwise deletion, 67
multiple chip methods, 104 R
one-sample t-test, 106 range, 44
standard deviation, 104 replicates
two-sample t-test, 105 averaging of, 72
biological, 71
N case-control studies, 72
Newton’s method, 103 chips
noise envelope, 101 checking quality, 73
normality, 77 excluding bad replicates, 73
normalization, 81, 85, 88 handling, 71
analysis of variance, 94 power analysis, 72
dye-swap, 94 software, 71
global, 89 spots
housekeeping genes, 90 checking quality, 73
linearity of data, 91 technical, 71
local, 89 time series, 71
lowess, 91, 93
mean centering, 92 S
median centering, 91, 92 sample size, 43
per-chip, 89 Sapir and Churchill’s method, 101
per-gene, 89 scatter plot, 44
ratio statistics, 94 M versus A, 87
spiked controls, 90, 94 signal-to-background, 75
standardization, 92 signal-to-noise, 75
trimmed mean centering, 92 single chip methods, 100
number of subjects, 43 Chen’s method, 102
Newton’s method, 103
O noise envelope, 101
oligonucleotide pairs, 25 Sapir and Churchill’s method, 101
mismatch, 25 single nucleotide polymorphism, 31
perfect match, 25 SNP, 31
ontology, 140 genotype calls, 32
outliers, 52, 74 methods
quantification error, 74 APEX, 31
spot saturation, 74 single base extension, 31
statistical modeling, 74 software
Index 161
ArrayViewer, 155 U
awk, 153 UPGMA, 109
bioconductor, 155
Cluster, 155 V
Expression Nti, 157 variable, 42
Expression Profiler, 155 dependent, 42
GeneSpring, 156 independent, 42
J-Express, 156 qualitative, 42
quantitative, 42
Kensington, 156
variables, 42
MAExplorer, 155
variance, 44, 77, 81
Perl, 153
Volcano plot, 106
R, 154
Rosetta Resolver, 157 Z
Spotfire, 157 Z-score, 89
Treeview, 155
VisualGene, 156
spatial effects, 79
spiked controls, 90
standard deviation, 44, 77
standardization, 48, 88, 89
statistical testing, 54
ANOVA, 55, 58
completely randomized, 58
Bonferroni correction, 58
choosing the test, 55
critical values table, 57
hypothesis pair, 55
Kruskal-Wallis test, 55
one-sample t-test, 56
p-value, 55
t-test, 55
test statistic, 56
two-sample t-test, 56
systematic bias, 85
array design, 87
batch effect, 87
dye effect, 85
experimenter issues, 87
plate effects, 86
printing tip, 86
reporter effects, 86
scanner malfunction, 85
uneven hybridization, 86
systematic variation, 85
T
time series, 38, 71