Micro Array Analysis
Micro Array Analysis
1.1
Before any kind of microarray data can be analysed for differential expression
several steps must be taken. Raw data must be quality assessed to ensure its
integrity. Unprocessed raw data will always be subject to some form of technical
variation and thus must be preprocessed to remove as many unwanted sources of
variation as is possible, to ensure that results are of the highest attainable level
of accuracy. Ideally, the data being assayed should be preprocessed using several
different methods, the results of which should be compared to identify which
method is of the highest level of suitability. The most appropriate method should
then be used to preprocess the raw data before differential expression analysis.
1.1.1
Because of the design of these kinds of chips, the steps that need to be taken
before differential expression analysis are slightly more elaborate than for cDNA
arrays, which we will outline later in the chapter.
Background Correction
The first step is generally to background correct the intensity reading for each
spot. Background fluorescence can arise from many sources, such as non-specific
binding of labelled sample to the array surface, processing effects such as deposits
left after the wash stage or optical noise from the scanner [?]. There is always
some level of background noise, even if nothing but sterile water is labeled and
hybridised to the array, some fluorescence will still be picked up by the scanner
[?]. Different algorithms will use different methods of background correction. The
popular Robust Multi-Array Analysis (RMA) algorithm, for example, uses the
convolution of signal and noise distributions [?].
Normalization
The next stage is normalization. The purpose of this step is to adjust data for technical variation, as opposed to biological differences between the samples. There
will always be slight discrepancy between the hybridisation processes for each array and these variations tend to lead to scaling differences between the overall
fluorescence intensity levels of various arrays. For example the quantity of RNA
in a sample, the amount of time for which a sample spends hybridising or the
volume of a sample can all introduce significant variance. Even subtle physical
differences between arrays or between the scanners used to read arrays can have
an effect.
Put simply, normalization ensures that when comparing expression levels of
different arrays, that we are, as much as is possible, comparing like with like.
Studies have shown that the normalization method used has a significant difference
on final differential expression levels, so it is vital to choose an appropriate method
[?].
PM Correction
As stated previously, PM probes on the GeneChip measure both the relative
abundance of the corresponding gene and the amount of non-specific binding,
which arises when mRNA binds to a probe which is not targeting it. MM probes
are designed to give a measure for non-specific binding of their corresponding PM
probe. It then seems obvious that the MM values should be subtracted from their
corresponding PM values as a first step in the analysis process.
In reality however, this does not work, because generally about 30% of MMs
are actually larger than their corresponding PMs [?]. This is because, as well as
measuring background signal, high volumes of mRNA targeted intentionally by
the PM probes tend to also bind to MMs. Many of the most popular preprocessing
methods solve this problem by simply ignoring the MM probes altogether and PM
values are corrected for non-specific binding using other methods.
Summarisation
We have already seen how GeneChip arrays work by using 11 different PM spots
to target 11 separate 25 base long sections of a target genes mRNA. The final step
in preprocessing GeneChip Data is to summarise the data from these 11 separate
probes into an expression value for the gene in question. There are a number
of different ways that this can be achieved, but the end result is always a single
expression value for each gene on each chip.
1.1.2
Having introduced the general pipeline followed to preprocess Affymetrix microarray data, we will outline some of the preprocessing methods implemented by this
system and describe their operation as well as justifying their inclusion.
There are a number of popular composite preprocessing algorithms. These
algorithms implement the four preprocessing steps outlined above and output
background corrected and normalized expression measures for each gene on each
array. The preproessing methods implemented by this system are as follows.
MicroArray Suite 5.0 (MAS5)
MAS5 is an algorithm developed by Affymetrix and is described in their white
paper Statistical Algorithms Description Document (2002) [?]. This algorithm
background corrects both PM and MM probes; MMs are then converted into ideal
mismatches, where their values are always smaller than their corresponding PM
values. Remeber than approximently 30% of the time MM values are greater than
their PMs. If MM < PM, then MM value is left unchanged. A robust mean over
the log2 transformed differences between PMs and the already calculated ideal
mismatch is computed. Expression values are normalized by setting the trimmed
mean of the original signals of each chip to a prespecified value. Hence, MAS5
data is normalized after summarisation, not before, as in many other algorithms.
Probe Logarithmic Intensity Error Estimation (PLIER)
PLIER is the current recommended algorithm from Affymetrix. Affymetrix claim
that the algorithm improves on MAS5 by introducing a higher reproducibility of
signal (lower coefficient of variation) without loss of accuracy; higher sensitivity
to changes in abundance for targets near background and dynamic weighting of
3
the most informative probes in a dataset to determine signal [?]. In this system
the PLIER algorithm is modified to include quantile normalization as PLIER does
not normalize data by default.
Robust Multi-Array Analysis (RMA)
RMA is an academic alternative to Affymetrixs algorithms for converting probe
level data to gene expression measures. This method is distinct from Affymetrixs
methods in that it completely ignores the MM probe readings; the inventors of
the algorithm claim that the MM probes introduce more noise and that, while
acknowledging that these probes do provide useful information, have not, at the
time of publication of the method, found a productive way to use it [?].
The methods works by adjusting for background noise on a raw intensity scale,
which does not lead to negative background corrected values. The log2 transformed value of each background corrected PM probe is obtained and these values
are normalized using quantiles normalization, which was developed by Bolstad et
al. (2003) [?]. Robust multi-array analysis is then carried out on the quantiles
[?].
GeneChip RMA (GCRMA)
GCRMA is largely based on RMA and in fact only differs in the background
correction step where it uses probe sequence information to help estimate the
background. This leads to improved accuracy in fold changes but at the expense
of marginally lower precision [?].
Other Methods Implemented
The system can also carry out a preprocessing method by which the user can
manually create the algorithm used, by specifying explicitly which of a selection
of available functions, should be applied at each of the various stages, the options
available to the user are as follows.
Background Correction:
Mas5
RMA
RMA2
Normalization:
Constant
4
Contrasts
Invariant Set
Loess
Qspline
Quantiles
Robust Quantiles
PM Correction:
PM Only
Subtract MM
MAS5
Summarisation:
Average Difference
LiWong
MAS5
Median Polish
Playerout
The above options can be combined as the user desires to tailor preprocessing
to their needs. This route is not recommended for novice users.
1.1.3
The general steps followed when preprocessing cDNA data are quite similar to the
above. The main differences are that their is no need for PM correction, as there
are no MM probes on cDNA arrays and that their is no summarisation stage, as
each gene is represented by a single probe.
Background Correction
Background fluorescence occurs virtually identically in cDNA arrays as it does as
previously described in oligonucleotide arrays [?]. The methods used to correct
for background noise are described below.
1.1.4
There are a large number of methods available for preprocessing of dual dye data.
The system implements the following methods.
Background Correction [?]:
Subtract
Half
Minimum
MovingMin
Edwards
NormExp
RMA
Normalize Between Arrays [?]:
Aquantile
6
Scale
Quantile
Gquantile
Rquantile
Tquantile
Normalize Within Arrays[?]:
Print Tip Loess
Median
Loess
Composite
Control
Robust Spline
1.1.5
The VSN method [?] has been implemented to handle preprocessing of single
channel data, such as that of Exiqon miRNA arrays. The function calibrates
for sample-to-sample variations through shifting and scaling, and transforms the
intensities to a scale where the variance is independent of the mean intensity. It
combines background correction and normalization into one single procedure. For
a matrix xki , with k counting the probes and i the arrays, the function fits a
normalization transformation
xki ai
xki 7 hi (xki ) = glog
(1.1)
bi
where bi is the scale parameter for array i, ai is a background offset and glog
is the generalised logarithm as described by Rocke and Durbin (2003) [?].
1.2
Having introduced preprocessing of both Affymetrix GeneChip and cDNA microarray data, we now introduce and illustrate the importance of, the concept of
quality assessment of microarrays data.
Quality assessment is an important phase that applies to analysis of all types of
microarrays. Quality assessment of data ensures that the best use is made of the
7
1.2.1
There are a number of useful tools implemented to assess the quality of GeneChip
data. We will now proceed to outline them and their various uses, using an
example dataset.
The dataset being used to demonstrate preprocessing and quality assessment of
GeneChip microarray data is from an experiment to determine the effects of negative energy balance on the postpartum cow. The bovine version of the Affymetrix
GeneChip was used in this experiment. A set of six arrays from a negative energy
balance group are compared to a set of six control arrays in order to determine
differential gene expression.
Boxplot
A boxplot is a convenient means by which to compare the probe intensity levels
between the arrays of a dataset. Either end of the box represents the upper
and lower quartile. The line in the middle of the box represents the median.
Horizontal lines, connected to the box by whiskers, indicate the largest and
smallest values not considered outliers. Outliers are values that lie more than 1.5
times the interquartile range from the first of third quartile (the edges of the box);
they are represented by a small circle.
If one or more arrays have intensity levels which are drastically different from
the rest of the arrays, this may indicate a problem with these arrays. These kinds
of problems can however sometimes be corrected by normalization. For microarray
data, these graphs are always constructed using log2 transformed probe intensity
values, as the graph would be virtually unreadable using raw values, as you can
see below, where raw values are juxtaposed with log2 transformed values.
The boxplot of log transformed intensity levels in the above example communicates some useful information. As can be seen the fourth array from the left
has noticeably higher overall probe intensity readings than any of the other arrays. This could be an early indication of a problem with this array. We need
to preform further investigation and establish if this discrepancy an be corrected
by normalization. The Figures on the next page show boxplots of probe intensity
levels following, RMA, GCRMA, MAS5 and qPLIER preprocessing algorithms.
9
The above plots give an interesting picture of how different algorithms affect the
raw data in significantly different ways. We have a good indication that normalization could solve the scaling problem of our rogue array. We will however need
more much more information in order to make an informed decision as to whether
this array should or shouldnt be included in analysis and which preprocessing
method should be selected.
10
Histogram
A histogram of array intensity levels tells us quite a similar story to that of a
boxplot. It is again used to visualise the spread of data and compare and contrast
probe intensity between the arrays of the dataset. The x-axis represents probe
density level and the y-axis indicates probe intensity. This plot provides us with
a slightly more detailed picture and there are a number of inferences that can be
made from these plots; a bimodal distribution in the raw data, for example, is
often indicative of an array containing a spacial artifact and an array which is
shifted to the right often has abnormally high background interference.
As you can see from the plot of our raw data below, the array NS7.CEL is
once again a problem, being shifted slightly to the right, which as stated above
could indicate high levels of background noise. This point is worth continued
investigation.
For comparison purposes I have also included an image of RMA preprocessed
values. Even with normalization the same array still has the highest overall values.
11
Figure 1.8:
togram
12
Figure 1.10: PCA plot of RMA preprocessed data. Note improved overall clustering but continued non-grouping of NS7.CEL.
Figure 1.11: PCA plot of MAS5 preprocessed data. Once again the situation is
not ameliorated for NS7.CEL
13
14
N11S.CEL
62.04
P3S.CEL
64.53
N8S.CEL
68.94
P6S.CEL
64.25
P1S.CEL
62.50
P2S.CEL
70.25
P4S.CEL
60.89
N7S.CEL
84.85
P5S.CEL
73.18
N9S.CEL
62.75
P4S.CEL
0.44
N7S.CEL
0.36
P5S.CEL
0.43
N9S.CEL
0.34
N11S.CEL
0.50
P3S.CEL
0.33
N8S.CEL
0.45
P6S.CEL
0.38
15
P1S.CEL
0.26
P2S.CEL
0.51
N11S.CEL
58.61%
P3S.CEL
59.57%
N8S.CEL
57.69%
P6S.CEL
58.17%
P1S.CEL
57.89%
P2S.CEL
59.46%
P4S.CEL
57.75%
N7S.CEL
54.78%
P5S.CEL
58.16%
N9S.CEL
58.90%
None of the values in the tables above are alone particular cause for concern.
It is worth noting however that the array we previously identified as an outlier
NS7.CEL does contain the highest level of background noise, which we previously suspected could have been part of the cause of its problems.
Shown below is a simpleaffy plot, which is visual representation of some of the
data above. The plot is labeled with some explanations of how to read it.
16
(1.2)
17
SE(gi )
medi (SE(gi ))
(1.3)
NUSE and RLE plots of our original bovine dataset are shown below. You can
see that once again NS7.CEL is again slightly askew in both figures.
18
1.2.2
There are also number of useful tools implemented to assess the quality of cDNA
data. This subsection aims to to outline these tools and their various uses.
The dataset which will be used to demonstrate preprocessing and quality assessment of cDNA microarray data is the Swirl dataset, which is one of the example
datasets packaged with Bioconductor.
To give a very brief background; this experiment was carried out using zebrafish
as a model organism to study the early development in vertebrates. Swirl is a point
mutant in the BMP2 gene that affects the dorsal/ventral body axis. The main
goal of the Swirl experiment is to identify genes with altered expression in the
Swirl mutant compared to wild-type zebrafish. Each of the four arrays in the
experiment compares RNA from the mutant Swirl zebrafish to that of the normal
wild-type fish.
The following pages outline the cDNA quality assessed tools implemented in
this system.
19
M-A Plots
M and A are two very commonly used variables in the analysis of dual dye arrays
and understanding their meaning is a crucial concept in understanding this kind
of analysis.
A is defined by
p
1
Cy5 Cy3 = [log2 (Cy5) + log2 (Cy3)]
(1.4)
2
Cy5 and Cy3 represent respectively the red and green dye intensities of a
particular spot. So A is the red and green intensities of a spot multiplied together,
square rooted and log transformed. Thus it is essentially a measure of the total log
transformed intensity of a spot. Essentially, if combined red and green intensities
are high for a particular spot, then A will also be high.
M is defined as
A = log2
M = log2
Cy5
Cy5
= log2
= log2 (Cy5) log2 (Cy3)
Cy3
Cy3
(1.5)
So M is the log transformed red dye intensity divided by the green dye intensity.
It gives an indication of whether more of either the red or green dye binding to
the array at a given spot.
The purpose of an MA-plot is to investigate intensity bias. If a disproportionate amount of spots on the plot are above or below the x-axis it could indicate
a problem with an array. As before these kinds of problems can sometimes be
addressed with normalization.
MA-plots can be viewed for a whole array, or for the individual print tip groups
on an array. This diagram gives us a good indication of whether normalization
within an array is needed.
Below are MA-plots of the first array, swirl.1.spot, in our example dataset.
Plots are shown for both print-tip groups and the array as a whole.
Data for the normalized plots is created using the print-tip-group loess within
array normalization technique, which is suitable for most kinds of data.
20
Figure 1.17:
Swirl.1.spot
MA-plot
of
Raw
21
22
The kind of intensity histogram below follows the exact same principal as the
histogram already described for oligonucleotide arrays, save for the fact that in
this case each array is represented by both a red and a green channel.
An intensity histogram is shown for both raw and quantiles normalized data.
Note that boxplots of red and green foreground and background intensity levels
can also be viewed.
24
1.2.3
Support for single channel platforms like Exiqon miRNA arrays in Bioconductor
is still in something of an experimental stage and can be somewhat ad-hoc; as
already stated, only the VSN preprocessing method is available. Quality control
options are slightly more limited than for other platforms, but there is still enough
available for a user to make a reasonable judgement about the integrity of a
datasets constituent arrays.
The system implements many of the same plots as before. Available for assessment are, array images of both foreground and background intensities, boxplots
of raw and preprocessed foreground and background intensities, density plots of
raw and preprocessed data and PCA and accompanying scree plots of raw and
preprocessed data. All of these plots should be assayed in a similar manner as
already described for other platforms.
1.3
False Discovery Rate (FDR) developed by Benjamini and Hochberg (1995) [?], or
the more stringent Bonferroni Method which controls the family-wise error rate.
These and other methods can be applied to address the problem of false positives
in microarray gene expression analysis.
The system developed during this project uses the functions available in Bioconductors LIMMA package to calculate differential expression of GeneChip, dual
dye and single dye data, as the same principals can be applied to all of these data
types.
Further to that, the system also implements the functionality of the more
recent PUMA package, for analysis of GeneChip data.
Note that further technical details of how these packages are integrated will
be discussed at a later point in this thesis.
1.3.1
LIMMA is an R library which is part of the Bioconductor project and is used for
the analysis of gene expression microarray data. It incorporates the use of linear
models for assessment of differential expression. LIMMA provides the ability to
analyse comparisons between many RNA targets simultaneously in complicated
designed experiments. Empirical Bayesian methods are used to provide stable
results even when the number of arrays is small.
The general procedure followed in analysis using the package is as follows.
This procedure first fits a linear model to the expression data for each probe.
The expression data should be log-ratios M for two-colour array platforms or logexpression values for one-channel platforms. The coefficients of the fitted models
describe the differences between the RNA sources hybridised to the arrays, these
coefficients are described by the design matrix. The probe-wise fitted model results
are stored in a compact form suitable for further processing by other functions in
the Limma package.
The fitted model object is then re-orientated from the coefficients of the original design matrix to any set of contrasts of the original coefficients. The coefficients, correlation matrix and unscaled standard deviations are then re-calculated
in terms of the contrasts.
Finally, Empirical Bayes shrinkage is used to compute moderated t-statistics,
moderated F-statistic, and B-statistic (log-odds of differential expression) by shrinkage of the standard errors towards a common value. This method has the advantage of being able to provide a stable result even when the number of arrays in
an experiment is small [?].
Below are screen shots of the top ranked differentially expressed genes from
the two datasets we introduced earlier. The GeneChip data (bovine dataset) was
preprocessed using RMA; while the Swirl dataset, which is a dual dye experiment
26
where array image analysis was performed using Spotfire Software, was preprocessed using background subtraction, Print-tip-group Loess normalization within
arrays and Quantile normalization between arrays. In both cases, the adjusted
p-values are corrected for multiple testing using the Benjamini and Hochberg
method.
These screen shots are of HTML tables output by the system. How these are
created will become clear over the subsequent chapters.
1.3.2
1.4
As already outlined, most GeneChip arrays use 11 different 25-base long probes
to target specific genes.
A problem is however introduced by the ever changing nature of knowledge
of genomic sequences of different organisms. As such knowledge evolves, it has
become clear that the original probe to transcript mappings assigned in an arrays
Chip Definition File (CDF), defined initially by the manufacturer, are in certain
cases, known to be no longer entirely accurate. In simple terms, some probes are
not targeting the sequence that they were originally thought to be targeting.
Because of this a number of groups have developed alternative probe to probeset mappings, which are defined in remapped Chip Definition Files.
This system gives the user the option of using some of the remapped CDF
packages created by the AffyProbeMiner project [?], as an alternative to the default Affymetrix CDFs. AffyProbeMiner regroups probes in the GeneChip into
new probesets according to the verified complete coding sequences available in
GeneBank and RefSeq databases. This remapping has been shown to affect 2030% of all probesets, with genes shown to be differentially expressed using the
28
default CDF file showing only a 50% overlap with an analysis based on the new
CDF, but the remapped probesets are more consistent with the latest genomic
sequencing information and therefor provide a more reliable measure of a genes
true expression level[?].
Below is a list of the top ten genes in our bovine dataset calculated using
Affyprobeminers remapped CDF file and the methods in the PUMA package.
Note that the ID column contains EntrezGene gene IDs, as opposed to standard
Affymetrix identifiers, which are assigned by default by Affyprobeminer.
29