Omi 2014 0062
Omi 2014 0062
Omi 2014 0062
Abstract
Multi-omics research is a key ingredient of data-intensive life sciences research, permitting measurement of
biological molecules at different functional levels in the same individual. For a complete picture at the biological
systems level, appropriate statistical techniques must however be developed to integrate different omics data sets
(e.g., genomics and proteomics). We report here multivariate projection-based analyses approaches to genomics
and proteomics data sets, using the case study of and applications to observations in kidney transplant patients
who experienced an acute rejection event (n = 20) versus non-rejecting controls (n = 20). In this data sets, we show
how these novel methodologies might serve as promising tools for dimension reduction and selection of relevant
features for different analytical frameworks. Unsupervised analyses highlighted the importance of post transplant
time-of-rejection, while supervised analyses identified gene and protein signatures that together predicted rejection status with little time effect. The selected genes are part of biological pathways that are representative of
immune responses. Gene enrichment profiles revealed increases in innate immune responses and neutrophil
activities and a depletion of T lymphocyte related processes in rejection samples as compared to controls. In all,
this article offers candidate biomarkers for future detection and monitoring of acute kidney transplant rejection, as
well as ways forward for methodological advances to better harness multi-omics data sets.
Introduction
NCE CECR Prevention of Organ Failure (PROOF) Centre of Excellence, Vancouver, British Columbia, Canada.
Gunther Analytics, Vancouver, British Columbia, Canada.
3
Department of Pathology and Laboratory Medicine, 6Medicine, 8Computer Science, 9Medical Genetics, 7James Hogg Research Centre,
St. Pauls Hospital, 11Department of Medicine, Division of Respiratory Medicine, University of British Columbia, Vancouver, British
Columbia, Canada.
4
Immunity and Infection Research Centre, Vancouver, British Columbia, Canada.
5
Immunology Laboratory, Vancouver General Hospital, Vancouver, British Columbia, Canada.
10
Institute for HEART + LUNG Health, Vancouver, British Columbia, Canada.
12
Queensland Facility for Advanced Bioinformatics and Institute for Molecular Bioscience, 13Queensland Diamantina Institute,
Translational Research Institute, The University of Queensland, Brisbane, Australia.
2
682
The analysis of high throughput omics approaches generates large amounts of data that require significant statistical
and computational breakthroughs to decipher complex biological systems. Several statistical approaches have been
proposed in the literature for the integration of two or more
high-throughput data sets. These include projection-based
multivariate approaches for biological exploration and ensemble classifiers for biomarker development and medical
decision making (Gunther et al., 2012; Le Cao et al., 2007).
Ensemble classifiers combine separately developed,
platform-specific classifiers using different combination rules
(Polikar, 2006). A popular rule is majority vote where the
predicted class is simply the one that is called by the majority
of classifiers in an ensemble. Integration of information from
different biological entities in ensemble classifiers happens
after platform-specific analyses are performed. This is in
contrast to projection-based multivariate approaches that are
discussed in this article, which integrate data from different
platforms at the analysis level.
Projection-based multivariate approaches are computationally efficient to handle large data sets, where the number
of biological features is much larger than the number of
samples, by projecting the data into a smaller subspace while
capturing the largest sources of variation in the biological
studies. During this statistical integration process, these
approaches produce a snapshot of the data and highlight
the largest sources of variation. However, when there is a
large number of biological entities to summarize each functional level, a projection of the data in a smaller subspace
might not be sufficient to extract relevant information (i.e.,
which genes, which proteins are relevant and are acting in
concert?).
In recent years, several variants of these statistical integrative approaches have been proposed to perform variable
selection and highlight those contributing to the largest variation in the data (Chun and Kelesx, 2010; Le Cao et al., 2008,
2009; Parkhomenko, et al., 2009; Waaijenborg, et al., 2008).
The approaches are based on the Partial Least Squares regression methodology (PLS), which enables the integration
of two data sets in a statistical sense: each data set is projected
into a smaller subspace so that the covariance or the correlation between both data sets is maximized. The improvement that these authors propose is to perform variable
selection, so that biological entities from both data sets that
are correlated with each other are directly extracted from the
methods. However, very few approaches have been proposed
so far to both integrate more than two data sets and to select
variables. Witten and Tibshirani (2009) proposed to concatenate all data sets with an appropriate weight applied to each
of them. Recently, a promising approach based on regularized generalised Canonical Correlation Analysis (rGCCA)
was proposed by (Tenenhaus and Tenenhaus, 2011) as a
generalization to the PLS approaches for more than two data
sets by maximizing the sum of the correlation in a pairwise
fashion between two data sets at a time, followed by a variant
that enables variable selection (Tenenhaus, et al., 2014.
In this article, we illustrate the usefulness and biological
relevance of selected multivariate approaches from Le Cao
et al., (2008; 2011) and Tenenhaus et al., (2014) on a clinically relevant biological example, which is an acute renal
allograft rejection study from the Biomarkers in Transplantation study. Kidney transplantation is a means to restore
683
AR (n = 20)
NR (n = 20)
45.74 (12.10)
49.04 (9.30)
6 (30%)
14 (70%)
8 (40%)
12 (60%)
19 (95%)
1 (5%)
0 (0%)
16 (80%)
3 (15%)
1 (5%)
13 (65%)
7 (35%)
13 (65%)
7 (35%)
684
Description of the genomics and proteomics experiments. Peripheral blood samples were drawn into PAXgene
NTHER ET AL.
GU
685
Principal Component Analysis. Three principal components explained 57.2% and 31.6% of the total variance on the
genomics and proteomics data respectively (Supplementary
Fig. S2). These rather low percentages give a first hint of the
amount of information that can be summarized from each
data source in a small number of components. The proteomics data seem to contain less relevant information than
the genomics data with three principal components. No clear
separation between the AR and the NR samples was observed
in the PCA sample plots (Supplementary Fig. S3). Interestingly however, most of the late samples ( > 2 weeks posttransplantation) clustered separately from the corresponding
early samples.
Independent Principal Component Analysis. IPCA was
able to separate the two groups of samples in a better way than
PCA and with less components (Fig. 2). The main separation
between AR vs. NR can be observed on the first component in
the genomics data but less so in the proteomics data. Most
importantly, the IPCA results support the earlier observation in
that the later samples cluster separately from the earlier ones
within their respective group (Supplementary Fig. S3).
Supervised analysis of each separate data set
686
NTHER ET AL.
GU
FIG. 2. IPCA sample representation. Samples were projected on the first two independent principal components for
the genomics data (a) and for the proteomics data (b).
samples by means of classification error, sensitivity, specificity, and AUC and compared with PLS-DA which includes
all variables in the model (Table 2). For both platforms, the
sparse version of PLS performed better than the non sparse
version. In the proteomics data, the PLS-DA method produced a high CE of 43% and an AUC of close to 0.5, which is
no better than a random classifier compared to a CE of 29%
(AUC = 0.76) with 21 selected proteins. The genomics classifiers performed better than the proteomics classifiers
(AUC = 0.90 with the selected 90 genes).
Integration of the two data sets
Even though genomics and proteomics data were extracted from blood samples, the two data sources are de-
687
graphical representations of the correlations between the selected variables and the sPLS components. These plots (whose
interpretation is detailed in Gonzalez et al., (2012) help to
visualize the correlation between the two types of selected
variables [clusters of features close to the circle of radius 1,
Supplementary Fig. S6 (b)]. Our observations were twofold:
first, we did not observe a very strong correlation between
genes and proteins; second, the sample representation does not
highlight a clear separation between the two groups of patients,
rather, the first sPLS dimension seemed to separate the samples according to the time of rejection (early vs. late).
Supervised analysis of the integrated data sets with
sGCCA. Two types of design were investigated with sGCCA.
NTHER ET AL.
GU
688
Table 2. Summary of Classification Performance for sPLS-DA and PLS-DA Classifiers Trained
on Full 26-Sample Training Set as Determined on a 14-Sample Testing Set (7AR and 7NR)
Genomics
Proteomics
Classifier
Number
of selected
variables
Classification
Error rate (CE)
Sensitivity
Specificity
Area under
curve
sPLS-DA
PLS-DA
sPLS-DA
PLS-DA
90
27,306 (all)
21
133 (all)
0.14
0.21
0.29
0.43
0.71
0.57
0.57
0.43
1
1
0.86
0.71
0.90
0.82
0.76
0.55
FIG. 4. sGCCA analysis with design 2. Sample representation on the genomics space (a) and the proteomics
space (b) for the first two dimensions.
689
NTHER ET AL.
GU
690
Gene title
234312_s_at
228758_at
202592_at
204495_s_at
1554016_a_at
235568_at
212463_at
208052_x_at
210789_x_at
219183_s_at
214017_s_at
201536_at
222483_at
221755_at
216950_s_at
214511_x_at
205418_at
208749_x_at
210142_x_at
220404_at
224807_at
205936_s_at
39402_at
210184_at
210629_x_at
211581_x_at
211582_x_at
215633_x_at
218376_s_at
205323_s_at
219862_s_at
233072_at
238327_at
1554503_a_at
219394_at
219066_at
201482_at
225251_at
217762_s_at
240862_at
217728_at
203535_at
205241_at
209370_s_at
211250_s_at
210569_s_at
220371_s_at
204099_at
204858_s_at
203234_at
229743_at
1552497_a_at
1553681_a_at
202206_at
205171_at
Gene symbol
Regulation in AR
ACSS2
BCL6
BLOC1S1
C15orf39
C16orf57
C19orf59
CD59
CEACAM3
Up
Up
Up
Up
Up
Up
Up
Up
CEACAM3
Up
CYTH4
DHX34
DUSP3
EFHD2
EHBP1L1
FCGR1A /// FCGR1C
Up
Up
Up
Up
Up
Up
FCGR1B
FES
FLOT1
FLOT1
GPR97
GRAMD1A
HK3
IL1B
ITGAX
Up
Up
Up
Up
Up
Up
Up
Up
Up
LST1
LST1
LST1
LST1
MICAL1
Up
Up
Up
Up
Up
MTF1
NARF
NTNG2
ODF3B
OSCAR
PGS1
PPCDC
QSOX1
RAB24
RAB31
RASGRP4
S100A6
S100A9
SCO2
SH3BP2
SH3BP2
SIGLEC9
SLC12A9
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
SMARCD3
Up
TYMP
UPP1
ZNF438
SLAMF6
PRF1
ARL4C
PTPN4
Up
Up
Up
Down
Down
Down
Down
(continued)
691
Table 3. (Continued)
Probe Set ID
205291_at
205758_at
210606_x_at
210972_x_at
211597_s_at
212656_at
216920_s_at
219034_at
221011_s_at
222482_at
Gene title
interleukin 2 receptor, beta
CD8a molecule
killer cell lectin-like receptor subfamily D, member 1
T cell receptor alpha locus /// T cell receptor alpha
constant /// T cell receptor alpha joining 17 ///
T cell receptor alpha variable 20
HOP homeobox
Ts translation elongation factor, mitochondrial
TCR gamma alternate reading frame protein ///
T cell receptor gamma constant 2
poly (ADP-ribose) polymerase family, member 16
limb bud and heart development homolog (mouse)
single stranded DNA binding protein 3
Gene symbol
Regulation in AR
IL2RB
CD8A
KLRD1
TRA@ /// TRAC ///
TRAJ17 /// TRAV20
Down
Down
Down
Down
HOPX
TSFM
TARP /// TRGC2
Down
Down
Down
PARP16
LBH
SSBP3
Down
Down
Down
Only the most differentially expressed between AR and NR for a p value < 0.05 (FDR correction).
Tables 5 and 6 show the overlap between genes and between proteins for all three integrative analyses. Also included for comparison are the 90 probe-sets and 21 protein
group lists returned by sPLS-DA (see Supplementary Tables
T2 and T3) when applied to the genomics and proteomics
data set separately. For the genomics data, we observed little
overlap between the lists for sPLS and the sGCCA models,
refseq
Regulation in AR
SERPING1
F13A1
APOA4
SERPINC1
F2
GC
SERPINA5
AFM
F13B
ADIPOQ
SERPINA4
PROC
SHBG
CST3
PON1
CANX
LBP
ARNTL2
NM_000062
NM_000129
NM_000482
NM_000488
NM_000506
NM_000583
NM_000624
NM_001133
NM_001994
NM_004797
NM_006215
NM_000312
NM_001040
NM_000099
NM_000446
NM_001024649
NM_004139
NM_020183
down
sPLS
sGCCA D1
sGCCA D2
sPLS-DA
sPLS
sGCCA D1
sGCCA D2
sPLS-DA
33
5
0
0
5
46
3
5
0
3
41
24
0
5
24
90
NTHER ET AL.
GU
692
sPLS
sGCCA D1
sGCCA D2
sPLS-DA
sPLS
sGCCA D1
sGCCA D2
sPLS-DA
38
35
23
8
35
64
43
11
23
43
60
17
8
11
17
21
FIG. 6. Gene enrichment profile of upregulated genes in sGCCA-D1. Red cells indicate genes that are enriched (known
to be highly expressed) in the corresponding tissue cell-types.
693
FIG. 7. Gene enrichment profile of downregulated genes in sGCCA-D1. Red cells indicate genes that are enriched
(known to be highly expressed) in the corresponding tissue cell-types.
694
potential decrease in complement activation as a consequence might therefore indicate reduced protective effects of
these proteins on ischemic tissue injuries.
Alternatively, downregulation of lymphocyte activities
in AR patient samples might be due to uremic conditions
brought on by kidney failure, as is implied by the increase in
patient creatinine levels. The association between uremia and
depletion of lymphocyte activities and proliferation is well
documented (Hauser et al., 2008; Kato et al., 2008; Nakai,
et al., 1992).
Conclusions
Altman RB. (2013). Personal genomic measurements: The opportunity for information integration. Clin Pharmacol Therapeut 93, 2123.
Beibarth T, and Speed TP. (2004). GOstat: Find statistically
overrepresented Gene Ontologies within a group of genes.
Bioinformatics 20, 14641465.
Benita Y, Cao Z, Giallourakis C, Li C, Gardet A, and Xavier
RJ. (2010). Gene enrichment profiles reveal T-cell develop-
NTHER ET AL.
GU
ment, differentiation, and lineage-specific transcription factors including ZBTB25 as a novel NF-AT repressor. Blood
115, 53765384.
Bolstad B, Collin F, Brettschneider J, et al. (2005). Quality
assessment of Affymetrix GeneChip data. Bioinformatics and
Computational Biology Solutions Using R and Bioconductor,
3347.
Buerke M, Murohara T, and Lefer AM. (1995). Cardioprotective effects of a C1 esterase inhibitor in myocardial ischemia
and reperfusion. Circulation 91, 393402.
Chun H, and Kelesx S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable
selection. J Royal Stat Soc Series B, Statistical Methodology,
72, 325.
De Perrot M, Liu M, Waddell TK, and Keshavjee S. (2003).
Ischemiareperfusioninduced lung injury. Am J Resp Crit
Care Med 167, 490511.
Eror AT, Stojadinovic A, Starnes BW, Makrides SC, Tsokos
GC, and Shea-Donohue T. (1999). Antiinflammatory effects
of soluble complement receptor type 1 promote rapid recovery of ischemia/reperfusion injury in rat small intestine.
Clin Immunol 90, 266275.
Freue GVC, Sasaki M, Meredith A, et al. (2010). Proteomic
signatures in plasma during early acute renal allograft rejection. Mol Cell Proteomics 9, 19541967.
Gomez-Cabrero D, Abugessaisa I, Maier D, et al. (2014). Data
integration in the era of omics: Current and future challenges.
BMC Systems Biol 8, I1.
Gonzalez I, Le Cao KA, Davis MJ, and Dejean S. (2012). Visualising associations between paired omics data sets. BioData Mining 5, 19.
Gunther OP, Balshaw RF, Scherer A, et al. (2009). Functional
genomic analysis of peripheral blood during early acute renal
allograft rejection. Transplantation 88, 942951.
Gunther OP, Chen V, Freue GC, et al. (2012). A computational
pipeline for the development of multi-marker bio-signature
panels and ensemble classifiers. BMC Bioinformatics 13, 326.
Harbron C, Chang KM, and South MC. (2007). RefPlus: An R
package extending the RMA algorithm. Bioinformatics 23,
24932494.
Hauser AB, Stinghen AEM, Kato S, et al. (2008). Characteristics and causes of immune dysfunction related to uremia and
dialysis. Periton Dialysis Intl 28, S183S187.
Jolliffe I. 2002. Principal Component Analysis. Springer Series
in Statistics, Springer, New York.
Kato S, Chmielewski M, Honda H, et al. (2008). Aspects of
immune dysfunction in end-stage renal disease. Clin J Am
Soc Nephrol 3, 15261533.
Kreisel D, Sugimoto S, Tietjens J, et al. (2011a). Bcl3 prevents
acute inflammatory lung injury in mice by restraining emergency granulopoiesis. J Clin Invest. 121, 265276.
Kreisel D, Sugimoto S, Zhu J, et al. (2011b). Emergency granulopoiesis promotes neutrophil-dendritic cell encounters that
prevent mouse lung allograft acceptance. Blood 118, 61726182.
Le Cao KA, Besse P, and Goncalvez O. (2007). Selection of
biologically relevant genes with a wrapper stochastic algorithm. Stat App Genetics Mol Biol 6, 29.
Le Cao KA, Boitard S, and Besse P. (2011). Sparse PLS discriminant analysis: Biologically relevant feature selection and
graphical displays for multiclass problems. BMC Bioinformatics 12, 253.
Le Cao KA, Gonzalez I, and Dejean S. (2009). IntegrOmics: an
R package to unravel relationships between two omics datasets. Bioinformatics 25, 28552856.
695