Manual
Manual
User Manual
Statistical software for the analysis of
microarray data
Julio A. Di Rienzo
Gonzalez L.A.
Tablada E.M.
Updated 09/2009
The correct citation for the manual is as follows:
Di Rienzo J.A., Gonzalez L.A. Tablada E.M. (2009). fgStatistics - User Manual.
Electronic Edition, Argentina
ISBN 978-987-05-8944-0
ii
INDEX
Introduction ................................................................................................................................. 1
Requirements............................................................................................................................... 1
Getting started............................................................................................................................. 2
References .................................................................................................................................. 34
iii
fgStatistics
Introduction
Requirements
Note: Use the DCOM y R versions suggested by the link provided with the instructions.
1
In some Vista version you can experiment some trouble with the installation of rscproxy package. The
symptom is that when you try to install it, vista will show you a message giving you same options. No
matters the option you choose the installation of the package will fail. In that case enter CRAN, look for
and download manually the rscproxy zip file. Uncompress the folder and move or copy it to the
C:\Program Files\R\R-2.9.0\library\. Take care, when moving–coping the folder just to move the folder
containing the package files not a folder inside another folder that use to produce the unzip procedure.
1
fgStatistics
The current version of fgStatistics has been en tested under Windows XP, Vista and
Windows 7. fgStatistics requires several R packages that are automatically loaded from
Internet upon request by the procedures.
Getting started
When fgStatistics is run, the application window is displayed as shown in Figure 1. The
illustration shows the application desktop. The main menus of the application are Data,
Statistics, and Microarray data analysis. A short description of each of the menus
follows.
File menu
As usual, the file menu contains the list of actions for handling data sets, like creating a
new table or opening an existing one, saving, saving as, closing a table, and quitting the
application (Figure 2). The default file format that fgStatistics uses to save tables is a
proprietary file format with an .FGDB extension. However, fgStatistics can read Excel
files (*.xls), text files (*.txt, *.dat), dbase files (*.dbf), InfoStat files (*.idb, *.idb2) and
R scripts (*.r). It also has the ability to export data tables to all previously mentioned
data formats.
2
fgStatistics
Edit menu
The edit menu shown in Figure 3 contains the usual links to the Cut-Copy-Paste-Undo
functions, common to almost every windows application. Copy with column names and
Paste with column names are specialized forms of copy-paste for application on tables.
Data menu
This menu contains a list of links to procedures that apply to the current active table
(Figure 4). These procedures are intended to manage usual actions on a data sheets, such
as: inserting, adding and deleting rows and columns, arranging rows according to
different sorting criteria, activating and deactivating cases to allow or disallow
3
fgStatistics
participation in calculations, changing columns names, modify the number of decimal
numbers displayed and the alignment and width of columns. It also contains links to
more specialized procedures that allow categorizing numerical variables, editing the
names of categories of categorical variables, writing formulas and searching for cases
according to logical rules. Moreover, there are links to procedures on how to merge
tables side by side according to matching criteria (Merge horizontally) and on how to
merge tables appending one to the other (Merge vertically).
4
fgStatistics
Statistics menu
This menu has a link to a Summary statistic procedure and a link to an R interpreter
(Figure 5).
The first item allows the user to calculate common summary statistics of numerical
variables in the whole data set as well as in subsets of data defined by classification
criteria.
The second item in the Statistics menu links to an R language interpreter. The general
aspect of this window is shown below (Figure 6). It allows the user who knows the
language to write his/her own scripts.
5
fgStatistics
Figure 6: R interpreter
This is the menu containing the links to the main procedure implemented in fgStatistics.
Figure 7 shows the items of this menu. The first two items are related to the generation
of the gene expression matrix (GEM). The steps to calculate this matrix are platform-
specific and involve methods to read the data generated be the devices used to read-
generate the image of the microarray. These items will be described next. The other
items are related to the process of the GEM and are not platform specific. The last one
implements a generalization of a test to compare relative expression of genes in real-
time PCR data.
6
fgStatistics
When this menu is clicked a standard open-file window is displayed to allow the user to
look for the .CEL files containing the microarray data. Once the .CEL files are located
the user should select the ones he/she wants to process and click the “Open button”.
Then the following dialog is displayed to let the user decide if all the microarrays
selected will be used to generate the GEM. Clicking on the “Go” button a message will
be displayed to inform that the process can take some minutes to proceed.
This procedure is base on the “rma” (Irizarry et al., 2003a,b) function implemented int
“affy” package. This function computes the RMA (Robust Multichip Average)
expression measure described in Irizarry et al Biostatistics (2003). It makes background
7
fgStatistics
correction, probe summarization, and quantile normalization. The result is a GEM as is
show in Figure 10.
Figure 9: Message displayed bye the Process Affymetrix .CEL file menu
This menu require Affy package. If it is not already installed in your current
R-installation it will try to download and install it automatically.
8
fgStatistics
Figure 10: Gene expression matrix generated by the Process Affymetrix .CEL file menu
cDNA
The generation of a GEM departing from cDNA data is not as straightforward as with
the case of oligo-nucleotide chips. It is because there is not a standardized data format
and there are several steps at which the user will need to take some decisions. For this
reason the menu include several submenus as is shown in Figure 11.
9
fgStatistics
Read MEV files
The images generated by a cDNA microarray have by the researcher using specialized
software. Spotfinder® is a free software, commonly used to read cDNA those images. It
generates as outpuet a text file with extension .MEV. If this is your case the submenu
Read .MEV files is the correct choice. This submenu will display a standard open-file
window to allow the user to look for the .MEV files containing the microarray data. It
will let you to choose one file at a time. This is because each file requires human-driven
preprocessing.
10
fgStatistics
The MEV files contain several “columns”. The only ones displayed, by default, are
those including the intensities (IA and IB) of the Red and Green channels. Which of
these columns actually contain the Red and Green channels depend on the way the
correspondent images were loaded in the Spotfinder software. Columnas named “R”,
“C”, “MR” and “MC” are the grid reference of the spots. MR and MC are the “macro”
rows and columns of the array and R and C are the rows and columns within the grid
defined by MR and MC. These references are very important to address the names of
the genes of each spot. Entries in the table equal to zero indicate that the spot could not
be read. These zeros must be set to missing values using the Search item in the Data
menu, specifying a search-replace as in the following figure. The region of the table
were the search-replace should take place must be selected previously to the invocation
of the Search procedure.
11
fgStatistics
The hidden columns of the .MEV file can be shown invoking the Data >> Actions on
columns >> Hide, show, delete, arrange, rename… as is illustrated in the following
figure.
This action will display a dialog allowing the user to choose which columns to display,
delete, rename, etc. as shown in the next figure.
12
fgStatistics
The differences between red (R) and green (G) channels are expressed as the log
intensity ratio M = log_2(R/G). This statistic is a relative measure of gene expression.
Values above zero indicate that the gene in the red labeled sample is over expressed
relative to it expression in the green labeled sample. However this gene expression
needs some corrections.
Figure 15: Specification of the variable selection step of the intensity correction procedure
Under the assumption that most of the genes remains in the same level across the
experimental conditions, except for some relative small fraction of them, it is expected
that the shape of the scatter plot of M vs A will have any trend. It is known, however,
that M varies with the average log intensity A=(log_2(R)+ log_2(G))/2.
13
fgStatistics
This is an artifact that must be removed for each slice. The correction is slide specific
and usually demands human decisions. The correction consists in substracts from M its
expected value according to A. Thus a key point is to calculate the expectations of M
for the A observed. This is usually done adjusting a local weighted regression
(LOWESS). The smoothing algorithm has a parameter that is adjusted manually
according to a visual inspection of the M vs A scatter plot. When “Apply intensity
correction” is called the user must specify which columns of the active data table
contain the red and green channels as in Figure 15.
After accepting the identification of channels positions in the data table two emerging
windows will appear (Figure 16). One display the M vs A scatter plot including a
smoothing curve. The other, a dialog in which the user can specify different smoothing
factors (in the open interval (0,1)).
Figure 16: Specification of the variable selection step of the intensity correction procedure
By default the smoothing factor is 0.1 which means that the prediction for a given A is
calculated form a regression line adjusted with al the point around that point
representing 10% of total points in the data set. When the smoothing factor approaches
14
fgStatistics
to 1 the smoothing approximates to a straight line. The user should try other smoothing
factors until he/she considers that the smooth capture the shape of the relationship
between M and A. Once the user has finished, the data table will show three new
columns named “A”, “M” and “Mci” (Figure 17). The last, is the one we will retain for
further analysis.
The user will have to repeat the steps the reading MEV files and apply intensity
corrections for each slide in the experiment.
Once the user has all the slices processed she/he must merge all the Mci columns in one
data table. This can be done just cutting and pasting. It is necessary at this step, to
associate the gene names to each row of the new table containing the Mci columns.
To assign the gene names to the rows of the Mci’s table it could be useful to add the
“grid addresses” of each spot (point in the slide). If the user has a .GAL file or a list of
gene names associated with coordinates of each position in the slice it is ease to “copy”
this information to the Mci’s table using the Merge tables >> Merge horizontally
procedure in the Data menu.
Suppose that you have a table having all the Mci of your experiment that includes the R,
C, MR, and MC columns of a MEV file and another table having the gene names
associated with the same R, C, MR, and MC columns, as is exemplified in the following
figure.
15
fgStatistics
Figure 18: Example of a table having Mcis (left) and other (right) containing the names of genes.
Both share the spots addresses
On the left, the table of Mcis, on the right the table containing the gene names (ID).
Both share the same columns that identify spot positions.
Figure 19: First dialog in the invocation of the Merge horizontally procedure
To apply Merge-horizontally procedure to this example, make the left table active (just
clicking it). Then invoke the procedure, it will display a dialog like the one shown in
16
fgStatistics
Figure 19. As illustrated, choose the R, C, MR and MC columns names as de the
Concatenation variables. Then, click the OK button.
A new dialog will emerge as shown below. It the left-top corner it display a list of open
tables not including the one from with the procedure was called. In this example there is
only one additional table (the one shown at right in Figure 18).
Figure 20: Second dialog in the invocation of the Merge horizontally procedure
When the user clicks the name of that table the names of the “concatenation variables”
are shown (bottom left) with a check mark if that names are the “New_1” table. All of
them must appear checked if not, there is a problem with the criteria to match the tables
and you shouldn’t proceed. At the right of this dialog will appear all other columns in
the chosen table. In this case that table has only the ID column (not including those used
to drive the merging procedure).
17
fgStatistics
Figure 21: Second dialog in the invocation of the Merge horizontally procedure displaying that
merging criteria are ok and that the ID variable will be pasted to the table calling the procedure
Once the user clicks the OK bottom, the table containing the Mcis will display a new
column with the gene names as illustrates the next figure.
Figure 22: Result of the invocation of the Merge horizontally procedure. The ID variable was copied
from another table containing according to R, C, MR and MC merging criteria
18
fgStatistics
Gene expression matrix normalization
The matrix having the intensity-corrected M values plus the genes identifiers is what we
call the first version of the gene expression matrix. This matrix still requires processing
before entering to the expression analysis step.
Ones the user clicks the OK button a dialog emerges requesting the user to choose the
chart type. The appropriate one is, in this case, Box-Plot (Figure 24).
19
fgStatistics
The treatments compared in this experiment are the combination of three genotypes
WT, JM, and NM evaluated in two times T0 and T1. Because there are four biological
replicates, JM_T0_3 labels the third replicate of the genotype JM evaluated at time T0.
As can be seen, even for replicates of the same treatment there are difference in the
scale (see for example NM_T0_2, and NM_T0_3). This difference in scale, must be
20
fgStatistics
corrected and it justification reside on the assumption that most of the genes remains in
the same level across the experimental conditions.
To equalize the scale of all microarrays in the experiments the user will choose the
Normalize gene expression matrix menu. As in the previous case the procedure will
request the user to identify which columns, in the data matrix, represent the microarrays.
Then a dialog will ask the user to choose the Normalization method. Additionally he
/she can specify a transformation to the normalized data.
The normalized microarrays appears as added columns to the data table with appropriate
prefix according to normalization. The box-plot of normalized microarrays are shown in
Figure 27. The “Equalize all quantiles” is a commonly used normalization. However it
coud be some “extreme”. The user could try others like normal scores o simply
standardize. When the genes expressions are measured by M, it is, usually no need to
further transformation, as is suggested in Figure 26 (identity transformation)
21
fgStatistics
3.23
0.82
Common scale
-1.59
-4.00
-6.41
NQ_WT_T0_1
NQ_WT_T0_2
NQ_WT_T0_3
NQ_WT_T0_4
NQ_WT_T1_1
NQ_WT_T1_2
NQ_WT_T1_3
NQ_WT_T1_4
NQ_JM_T0_1
NQ_JM_T0_2
NQ_JM_T0_3
NQ_JM_T0_4
NQ_JM_T1_1
NQ_JM_T1_2
NQ_JM_T1_3
NQ_JM_T1_4
NQ_NM_T0_1
NQ_NM_T0_2
NQ_NM_T0_3
NQ_NM_T0_4
NQ_NM_T1_1
NQ_NM_T1_2
NQ_NM_T1_3
NQ_NM_T1_4
Figure 27: Box plot of genes expressions all normalized microarrays
In a microarray design usually there are more than one spot for every gene. These are
what we call technical replicates within the array. Before proceed with the analysis you
have to average this replicates. This is done with the procedure “Summarize technical
replicates”. This procedure will be applied the normalized microarrays and when the
user call it he/she have to specify the column containing the names of the names of the
genes (this is because they have to be used to recognize the replicates of the same gene).
The specification of this procedure for the example we are using is presented in Figure
28. The result will be a new data table of one gene by row with the average of its
technical replicates.
22
fgStatistics
Figure 28: Selection variable dialog for the Summarize technical replicates procedure
Hereafter, the analysis of a gene expression matrix is not depending on the platform
utilized for generating it.
Some times, when the number of missing values is a small fraction of the data, the user
can use this procedure (with caution) to fill in the missing data. The missing data are
filled with the corresponding entry of the gene most alike (in Euclidean distance) to the
one having the missing value. If the three most alike genes have missing data in that
entry the procedure live the entry empty. When the procedure finish it report the amount
of missing data found and how many could fill as well as identify them in the data table
by coloring the cell with orange color.
23
fgStatistics
Analysis
The main purpose of microarray data analysis is to discover genes having differential
expressions across the experimental conditions.
Probably the first in discovering differentially expressed genes is to statistically test the
hypothesis of no changes in gene expressions among “treatments” compared. This is
accomplished gene by gene using the “Gene by Gene F-test”. When the procedure is
called the user will specify the microarrays (usually all of them) that he/she wants to
participate in the analysis (Figure 29). It is important that the microarray of the same
“treatment” should be next to each other in the data table.
Figure 29: Selection variable dialog for the gene by gene F test
When the selection variable is done, the following step is to assign the microarrays to
the different treatments. This is done specifying the number of replicates of each
treatment (corresponding to the microarrays listed) separated by commas. If all the
24
fgStatistics
treatments are equally replicated the it is only needed to specify that number as is shown
in Figure 30
Once the user clicks the “GO” button tow new columns are added to the data file: One,
containing the p-values, the other an estimate of the common-within-treatments
variance (Figure 31).
25
fgStatistics
Adjust p-values
To avoid false positives (genes declared as differentially expressed when they are not)
there are several corrections to p-values. Those included in fgStatistics are the ones
implemented in multest package. One the p-values were calculated, they can be adjusted
(augmented) using the “Adjust p-values”. When the procedure is called, it will request
the user to identify the columns containing the p-values. Once the identification has
been done, a dialog requesting the user to choose the adjustment procedure is shown ()
By default the Benjamini & Hochberg is chosen. It is, in the experience of the author, as
the most balanced procedure (false positives-false negatives). The output of this
procedure is the addition of new columns to the data table (one for each adjusting
method) of adjusted p-values.
Filter genes
Some times adjust p-values to the entering set of genes conduct to a high rate of false
negatives (as high as 60% in some cases). For this reason it is better to filter the genes
retaining, for example, the 20% most relevant ones in discriminating the treatments. The
procedure is called the same way as the “Gene by gene F test” and the output is a new
column containing the relevance of each Gen. The filter implemented is RELEF-F.
Using the summary statistics menu the user can calculate the upper percentile of his/her
choice (80%) and use this value to deactivate (Seach item in Data menu) the cases with
relevance below that value. Then apply the p-value correction.
26
fgStatistics
Group genes
One problem that often arises is to determine how many group of gene, according to
their expression profile, there are in the selected group of “candidate” genes. fgStatistics
implements the gDGC algorithm. gDGC. Valdano and Di Rienzo (2007) calculate the
number of groups in a hierarchical structure calculating the cutting point for a
dendrogram, generated by a given linkage algorithm. The node in which two mean
vectors –or a cluster of them– join, have an associated measure that corresponds to the
distance –calculated according to linkage algorithm– between the mean vectors or the
clusters that the node is joining. The node in which all mean vectors join, to form a
unique cluster, is the root node. In the UPGMA algorithm, if SM and SL are two
different clusters, the distance between them is defined as:
1
q = q( S M , S L ) = ∑ Dij
# ( S M )# ( S L ) y ∈S
i M
y ∈S
j L
where Dij is the square root of Mahalanobis distance. If SM and SL are coincident, then
q(SM,SM)=0.
The smallest value of Dij will correspond to the pair of most similar mean vectors and
the node that is formed will be at a distance q1 from the origin. The following distance –
q2– is associated with the next node, which can join two different mean vectors or the
cluster previously formed and another mean vector. At the end of the clustering
algorithm, the last union will be at distance qk-1 and will be referred to as the distance to
27
fgStatistics
the root node (Figure 32). This distance can be seen as a realization of a random
variable Q. The (1-α)-quantile of its distribution under the null hypothesis of equal
population mean vectors can be used to construct a test of size α. Given Q1-α, as the
α-level critical value, all Q ≥ Q1-α will lead to the rejection of the null hypothesis.
Root node
Q
Distance
Q1-α
q3
q2
q1
A A A B C C
Figure 32: Dendrogram showing the relationships among mean vectors. Cut-off criterion
obtained with the gDGC test –Q1-α– is indicated with a dotted line. At the bottom of the figure,
different letters identify groups statistically differing in the population centroids at a
significance level α.
The output is a list of genes with the corresponding group label
This procedure requires the specification of the list of microarrays and an identification
of the column containing the names of the genes. The format of the selection variable is
the same like the one shown in Figure 28. For operational reasons the algorithm is
restrict to a maximum of 500 genes. The output is a heatmap with an associated dialog
that allows the user to customize the graph in an iterative way (Figure 33).
28
fgStatistics
The same technique applied to the genes can be applied to microarray. This is useful to
have a picture of how consistent the biological replicates are. Some times a microarray
can have defect that are not ease to detect by the response gene by gene, but considering
it as a whole it becomes evident that has a problem.
29
fgStatistics
Where Ct mean refers to the average of Ct across the biological replicates. Usually you
will not have a “control” but a set of “treatments” among them you want to compare the
gene expressions. For this reason, this procedure uses each “treatment” as a control
sample one at a time. In the sample data set there is a file called RTPCRexample.FGDB
(Figure 34). The data table has 1 column that identify the “treatments”, three columns
labeled T1, T2 and T3 which refers to the technical replicates of the target gene, then a
column labeled Etarget that contains the efficiency of RT replication. The next three
columns contain the reference gene’s Cts. The example contains three treatments A, B,
C, having 4, 4 and 3 biological replicates respectively.
30
fgStatistics
When the user call the procedure a dialog emerges and it should be filled as is shown in
Figure 35. Once the user clicks OK, a new dialog of optional settings appear (Figure
36). This dialog let the user specify efficiencies if they were not included in the first
dialog (Figure 35), the number of samples to use to perform the permutation test and the
way to obtain the samples. The last specification has three options (None, At random,
Paired replicates). If you choose “None”, the p-values will be obtained under normality
assumptions, if you choose “At random” the p-values are calculated by randomly
selecting combinations of biological and technical replicates following the permutation
test paradigm. If you choose “Paired replicates" the program will perform a
permutation test but biological replicates in the same row of the data file will be chosen
together. This way of sampling is in case the researcher made biological replicates in of
the reference and target in such a way that they result correlated.
31
fgStatistics
Figure 35: RT –PCR calling dialog showing the way the data in the example have to be
used
Figure 36: RT –PCR calling dialog options. Target and Reference efficiencies are not
allowed to be specified because they are read from the data
32
fgStatistics
RT-PCR Comparisons
Resampling cycles=5000
Ratio DCP.Target(Control-Sample)/DCP.Reference(Control-Sample)
Ctrl\Samp SC SB SA
CC 1.45 2.23
CB 0.69 1.53
CA 0.45 0.65
P-value matrix
Resampling cycles: 5000
Ctrl\Samp SC SB
CB 0.1104
CA 0.0182 0.0546
33
fgStatistics
References
Pfaffl M.W. Horgan G.W., Dempfle L. (2002) Relative expression software tool
(REST©) for group-wise comparison and statistical analysis of relative
expression results in real-time PCR. Nucleic Acids Research, 2002, Vol. 30,
No. 9 e36
34