0% found this document useful (0 votes)
20 views

Multivariate Data Integration Using R

Preview of "Multivariate Data Integration Using R"

Uploaded by

Marwa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Multivariate Data Integration Using R

Preview of "Multivariate Data Integration Using R"

Uploaded by

Marwa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Multivariate Data

Integration Using R
Computational Biology Series
About the Series:

This series aims to capture new developments in computational biology, as well as high-quality work summarizing or con-
tributing to more established topics. Publishing a broad range of reference works, textbooks, and handbooks, the series is
designed to appeal to students, researchers, and professionals in all areas of computational biology, including genomics,
proteomics, and cancer computational biology, as well as interdisciplinary researchers involved in associated fields, such
as bioinformatics and systems biology.

Introduction to Bioinformatics with R: A Practical Guide for Biologists


Edward Curry
Analyzing High-Dimensional Gene Expression and DNA Methylation Data with R
Hongmei Zhang
Introduction to Computational Proteomics
Golan Yona

Glycome Informatics: Methods and Applications


Kiyoko F. Aoki-Kinoshita

Computational Biology: A Statistical Mechanics Perspective


Ralf Blossey

Computational Hydrodynamics of Capsules and Biological Cells


Constantine Pozrikidis

Computational Systems Biology Approaches in Cancer Research


Inna Kuperstein, Emmanuel Barillot

Clustering in Bioinformatics and Drug Discovery


John David MacCuish, Norah E. MacCuish

Metabolomics: Practical Guide to Design and Analysis


Ron Wehrens, Reza Salek

An Introduction to Systems Biology: Design Principles of Biological Circuits


2nd Edition
Uri Alon

Computational Biology: A Statistical Mechanics Perspective


Second Edition
Ralf Blossey

Stochastic Modelling for Systems Biology


Third Edition
Darren J. Wilkinson

Computational Genomics with R


Altuna Akalin, Bora Uyar, Vedran Franke, Jonathan Ronen
An Introduction to Computational Systems Biology: Systems-level Modelling of Cellular Networks
Karthik Raman

Virus Bioinformatics
Dmitrij Frishman, Manuela Marz

Multivariate Data Integration Using R: Methods and Applications with the mixOmics Package
Kim-Anh LeCao, Zoe Marie Welham
Bioinformatics
A Practical Guide to NCBI Databases and Sequence Alignments
Hamid D. Ismail
For more information about this series please visit:
https://www.routledge.com/Chapman--HallCRC-Computational-Biology-Series/book-series/CRCCBS
Multivariate Data
Integration Using R
Methods and Applications with the
mixOmics Package

Kim-Anh Lê Cao
Zoe Welham
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742

and by CRC Press


2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

© 2022 Taylor & Francis Group, LLC

CRC Press is an imprint of Taylor & Francis Group, LLC

Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact [email protected]

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden-
tification and explanation without intent to infringe.

ISBN: 9780367460945 (hbk)


ISBN: 9781032128078 (pbk)
ISBN: 9781003026860 (ebk)

DOI: 10.1201/9781003026860

This book has been prepared from camera-ready copy provided by the authors.
From Kim-Anh Lê Cao:
To my parents, Betty and Huy Lê Cao
To the mixOmics team and our mixOmics users,
And to my co-author Zoe Welham without whom this book would not have existed.

For Zoe Welham:


To Joy and Tamara Welham, for their continual support
Contents

Preface xv

Authors xxi

I Modern biology and multivariate analysis 1


1 Multi-omics and biological systems 3
1.1 Statistical approaches for reductionist or holistic analyses . . . . . . . . . . 3
1.2 Multi-omics and multivariate analyses . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 More than a ‘scale up’ of univariate analyses . . . . . . . . . . . . . 5
1.2.2 More than a fishing expedition . . . . . . . . . . . . . . . . . . . . 5
1.3 Shifting the analysis paradigm . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Challenges with high-throughput data . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Multi-collinearity and ill-posed problems . . . . . . . . . . . . . . . 7
1.4.3 Zero values and missing values . . . . . . . . . . . . . . . . . . . . 7
1.5 Challenges with multi-omics integration . . . . . . . . . . . . . . . . . . . . 8
1.5.1 Data heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.2 Data size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.3 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.4 Expectations for analysis . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.5 Variety of analytical frameworks . . . . . . . . . . . . . . . . . . . 9
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 The cycle of analysis 11


2.1 The Problem guides the analysis . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Plan in advance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 What affects statistical power? . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Identify covariates and confounders . . . . . . . . . . . . . . . . . . 13
2.2.4 Identify batch effects . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Data cleaning and pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Analysis: Choose the right approach . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Exploratory statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.3 Inferential statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.4 Univariate or multivariate modelling? . . . . . . . . . . . . . . . . . 16
2.4.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

vii
viii Contents

2.5 Conclusion and start the cycle again . . . . . . . . . . . . . . . . . . . . . . 18


2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Key multivariate concepts and dimension reduction in mixOmics 19


3.1 Measures of dispersion and association . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Random variables and biological variation . . . . . . . . . . . . . . 19
3.1.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.3 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.5 Covariance and correlation in mixOmics context . . . . . . . . . . . 22
3.1.6 R examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Matrix factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Factorisation with components and loading vectors . . . . . . . . . 24
3.2.3 Data visualisation using components . . . . . . . . . . . . . . . . . 24
3.3 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Ridge penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Lasso penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.3 Elastic net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.4 Visualisation of the selected variables . . . . . . . . . . . . . . . . . 26
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Choose the right method for the right question in mixOmics 29


4.1 Types of analyses and methods . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Single or multiple omics analysis? . . . . . . . . . . . . . . . . . . . 29
4.1.2 N − or P −integration? . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.3 Unsupervised or supervised analyses? . . . . . . . . . . . . . . . . . 31
4.1.4 Repeated measures analyses . . . . . . . . . . . . . . . . . . . . . . 32
4.1.5 Compositional data . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Classical omics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.2 Microbiome data: A special case . . . . . . . . . . . . . . . . . . . . 33
4.2.3 Genotype data: A special case . . . . . . . . . . . . . . . . . . . . . 33
4.2.4 Clinical variables that are categorical: A special case . . . . . . . . 33
4.3 Types of biological questions . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 A PCA type of question (one data set, unsupervised) . . . . . . . . 34
4.3.2 A PLS type of question (two data sets, regression or unsupervised) 34
4.3.3 A CCA type of question (two data sets, unsupervised) . . . . . . . 35
4.3.4 A PLS-DA type of question (one data set, classification) . . . . . . 35
4.3.5 A multiblock PLS type of question (more than two data sets,
supervised or unsupervised) . . . . . . . . . . . . . . . . . . . . . . 36
4.3.6 An N −integration type of question (several data sets, supervised) . 36
4.3.7 A P −integration type of question (several studies of the same omic
type, supervised or unsupervised) . . . . . . . . . . . . . . . . . . . 37
4.4 Examplar data sets in mixOmics . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.A Appendix: Data transformations in mixOmics . . . . . . . . . . . . . . . . . 38
4.A.1 Multilevel decomposition . . . . . . . . . . . . . . . . . . . . . . . . 38
4.A.2 Mixed-effect model context . . . . . . . . . . . . . . . . . . . . . . 39
4.A.3 Split-up variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.A.4 Example of multilevel decomposition in mixOmics . . . . . . . . . . 40
Contents ix

4.B Centered log ratio transformation . . . . . . . . . . . . . . . . . . . . . . . 41


4.C Creating dummy variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

II mixOmics under the hood 45


5 Projection to latent structures 47
5.1 PCA as a projection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.2 Calculating the components . . . . . . . . . . . . . . . . . . . . . . 48
5.1.3 Meaning of the loading vectors . . . . . . . . . . . . . . . . . . . . 49
5.1.4 Example using the linnerud data in mixOmics . . . . . . . . . . . 49
5.2 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . 50
5.2.1 SVD algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.2 Example in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.3 Matrix approximation . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Non-linear Iterative Partial Least Squares (NIPALS) . . . . . . . . . . . . . 54
5.3.1 NIPALS pseudo algorithm . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Local regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.3 Deflation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.4 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Other matrix factorisation methods in mixOmics . . . . . . . . . . . . . . . 57
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Visualisation for data integration 59


6.1 Sample plots using components . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.1 Example with PCA and plotIndiv . . . . . . . . . . . . . . . . . . 59
6.1.2 Sample plot for the integration of two or more data sets . . . . . . 60
6.1.3 Representing paired coordinates using plotArrow . . . . . . . . . . 63
6.2 Variable plots using components and loading vectors . . . . . . . . . . . . . 65
6.2.1 Loading plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2.2 Correlation circle plots . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.3 Biplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.4 Relevance networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.5 Clustered Image Maps (CIM) . . . . . . . . . . . . . . . . . . . . . 73
6.2.6 Circos plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.A Appendix: Similarity matrix in relevance networks and CIM . . . . . . . . 76
6.A.1 Pairwise variable associations for CCA . . . . . . . . . . . . . . . . 76
6.A.2 Pairwise variable associations for PLS . . . . . . . . . . . . . . . . 76
6.A.3 Constructing relevance networks and displaying CIM . . . . . . . . 77

7 Performance assessment in multivariate analyses 79


7.1 Main parameters to choose . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2 Performance assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2.1 Training and testing: If we were rich . . . . . . . . . . . . . . . . . 80
7.2.2 Cross-validation: When we are poor . . . . . . . . . . . . . . . . . . 81
7.3 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.3.1 Evaluation measures for regression . . . . . . . . . . . . . . . . . . 82
7.3.2 Evaluation measures for classification . . . . . . . . . . . . . . . . . 83
7.3.3 Details of the tuning process . . . . . . . . . . . . . . . . . . . . . . 83
7.4 Final model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
x Contents

7.4.1 Assessment of the performance . . . . . . . . . . . . . . . . . . . . 86


7.4.2 Assessment of the signature . . . . . . . . . . . . . . . . . . . . . . 86
7.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.5.1 Prediction of a continuous response . . . . . . . . . . . . . . . . . . 87
7.5.2 Prediction of a categorical response . . . . . . . . . . . . . . . . . . 88
7.5.3 Prediction is related to the number of components . . . . . . . . . 90
7.6 Summary and roadmap of analysis . . . . . . . . . . . . . . . . . . . . . . . 90

III mixOmics in action 93


8 mixOmics: Get started 95
8.1 Prepare the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.1.1 Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.1.2 Filtering variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.1.3 Centering and scaling the data . . . . . . . . . . . . . . . . . . . . 96
8.1.4 Managing missing values . . . . . . . . . . . . . . . . . . . . . . . . 100
8.1.5 Managing batch effects . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.1.6 Data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.2 Get ready with the software . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.1 R installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.2 Pre-requisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.3 mixOmics download . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.4 Load the package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3 Coding practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3.1 Set the working directory . . . . . . . . . . . . . . . . . . . . . . . 103
8.3.2 Good coding practices . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.4 Upload data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.4.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.4.2 Dependent variables . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.4.3 Set up the outcome for supervised classification analyses . . . . . . 105
8.4.4 Check data upload . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.5 Structure of the following chapters . . . . . . . . . . . . . . . . . . . . . . . 106

9 Principal Component Analysis (PCA) 109


9.1 Why use PCA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.1.1 Biological questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.1.2 Statistical point of view . . . . . . . . . . . . . . . . . . . . . . . . 109
9.2 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.2.1 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.2.2 Sparse PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.3 Input arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.3.1 Center or scale the data? . . . . . . . . . . . . . . . . . . . . . . . . 112
9.3.2 Number of components (choice of dimensions) . . . . . . . . . . . . 112
9.3.3 Number of variables to select in sPCA . . . . . . . . . . . . . . . . 113
9.4 Key outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.5 Case study: Multidrug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.5.1 Load the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.5.2 Quick start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.5.3 Example: PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.5.4 Example: Sparse PCA . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.5.5 Example: Missing values imputation . . . . . . . . . . . . . . . . . 125
Contents xi

9.6 To go further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129


9.6.1 Additional processing steps . . . . . . . . . . . . . . . . . . . . . . 129
9.6.2 Independent component analysis . . . . . . . . . . . . . . . . . . . 129
9.6.3 Incorporating biological information . . . . . . . . . . . . . . . . . 130
9.7 FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9.A Appendix: Non-linear Iterative Partial Least Squares . . . . . . . . . . . . 132
9.A.1 Solving PCA with NIPALS . . . . . . . . . . . . . . . . . . . . . . 132
9.A.2 Estimating missing values with NIPALS . . . . . . . . . . . . . . . 132
9.B Appendix: sparse PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.B.1 sparse PCA-SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.B.2 sPCA pseudo algorithm . . . . . . . . . . . . . . . . . . . . . . . . 134
9.B.3 Other sPCA methods . . . . . . . . . . . . . . . . . . . . . . . . . . 134

10 Projection to Latent Structure (PLS) 137


10.1 Why use PLS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.1.1 Biological questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.1.2 Statistical point of view . . . . . . . . . . . . . . . . . . . . . . . . 137
10.2 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
10.2.1 Univariate PLS1 and multivariate PLS2 . . . . . . . . . . . . . . . 139
10.2.2 PLS deflation modes . . . . . . . . . . . . . . . . . . . . . . . . . . 140
10.2.3 sparse PLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
10.3 Input arguments and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 142
10.3.1 The deflation mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
10.3.2 The number of dimensions . . . . . . . . . . . . . . . . . . . . . . . 143
10.3.3 Number of variables to select . . . . . . . . . . . . . . . . . . . . . 143
10.4 Key outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10.4.1 Graphical outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10.4.2 Numerical outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10.5 Case study: Liver toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.5.1 Load the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
10.5.2 Quick start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
10.5.3 Example: PLS1 regression . . . . . . . . . . . . . . . . . . . . . . . 147
10.5.4 Example: PLS2 regression . . . . . . . . . . . . . . . . . . . . . . . 152
10.6 Take a detour: PLS2 regression for prediction . . . . . . . . . . . . . . . . . 163
10.7 To go further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
10.7.1 Orthogonal projections to latent structures . . . . . . . . . . . . . . 165
10.7.2 Redundancy analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 166
10.7.3 Group PLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
10.7.4 PLS path modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 166
10.7.5 Other sPLS variants . . . . . . . . . . . . . . . . . . . . . . . . . . 167
10.8 FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
10.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
10.A Appendix: PLS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
10.A.1 PLS Pseudo algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 169
10.A.2 Convergence of the PLS iterative algorithm . . . . . . . . . . . . . 170
10.A.3 PLS-SVD method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
10.B Appendix: sparse PLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
10.B.1 sparse PLS-SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
10.B.2 sparse PLS pseudo algorithm . . . . . . . . . . . . . . . . . . . . . 171
10.C Appendix: Tuning the number of components . . . . . . . . . . . . . . . . . 172
xii Contents

10.C.1 In PLS1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172


10.C.2 In PLS2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

11 Canonical Correlation Analysis (CCA) 177


11.1 Why use CCA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.1.1 Biological question . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.1.2 Statistical point of view . . . . . . . . . . . . . . . . . . . . . . . . 177
11.2 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
11.2.1 CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
11.2.2 rCCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
11.3 Input arguments and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 179
11.3.1 CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
11.3.2 rCCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
11.4 Key outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
11.4.1 Graphical outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
11.4.2 Numerical outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
11.5 Case study: Nutrimouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
11.5.1 Load the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
11.5.2 Quick start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
11.5.3 Example: CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
11.5.4 Example: rCCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
11.6 To go further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
11.7 FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
11.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
11.A Appendix: CCA and variants . . . . . . . . . . . . . . . . . . . . . . . . . . 196
11.A.1 Solving classical CCA . . . . . . . . . . . . . . . . . . . . . . . . . 196
11.A.2 Regularised CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

12 PLS-Discriminant Analysis (PLS-DA) 201


12.1 Why use PLS-DA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
12.1.1 Biological question . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
12.1.2 Statistical point of view . . . . . . . . . . . . . . . . . . . . . . . . 201
12.2 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
12.2.1 PLS-DA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
12.2.2 sparse PLS-DA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
12.3 Input arguments and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 204
12.3.1 PLS-DA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
12.3.2 sPLS-DA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.3.3 Framework to manage overfitting . . . . . . . . . . . . . . . . . . . 205
12.4 Key outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
12.4.1 Numerical outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
12.4.2 Graphical outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
12.5 Case study: SRBCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
12.5.1 Load the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
12.5.2 Quick start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
12.5.3 Example: PLS-DA . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
12.5.4 Example: sPLS-DA . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
12.5.5 Take a detour: Prediction . . . . . . . . . . . . . . . . . . . . . . . 223
12.5.6 AUROC outputs complement performance evaluation . . . . . . . . 225
12.6 To go further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
12.6.1 Microbiome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Contents xiii

12.6.2 Multilevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227


12.6.3 Other related methods and packages . . . . . . . . . . . . . . . . . 228
12.7 FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
12.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
12.A Appendix: Prediction in PLS-DA . . . . . . . . . . . . . . . . . . . . . . . 229
12.A.1 Prediction distances . . . . . . . . . . . . . . . . . . . . . . . . . . 229
12.A.2 Background area . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

13 N −data integration 233


13.1 Why use N −integration methods? . . . . . . . . . . . . . . . . . . . . . . . 233
13.1.1 Biological question . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
13.1.2 Statistical point of view and analytical challenges . . . . . . . . . . 234
13.2 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
13.2.1 Multiblock sPLS-DA . . . . . . . . . . . . . . . . . . . . . . . . . . 234
13.2.2 Prediction in multiblock sPLS-DA . . . . . . . . . . . . . . . . . . 236
13.3 Input arguments and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 237
13.4 Key outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
13.4.1 Graphical outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
13.4.2 Numerical outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
13.5 Case Study: breast.TCGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
13.5.1 Load the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
13.5.2 Quick start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
13.5.3 Parameter choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
13.5.4 Final model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
13.5.5 Sample plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
13.5.6 Variable plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
13.5.7 Model performance and prediction . . . . . . . . . . . . . . . . . . 251
13.6 To go further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
13.6.1 Additional data transformation for special cases . . . . . . . . . . . 255
13.6.2 Other N −integration frameworks in mixOmics . . . . . . . . . . . . 255
13.6.3 Supervised classification analyses: concatenation and ensemble
methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
13.6.4 Unsupervised analyses: JIVE and MOFA . . . . . . . . . . . . . . . 256
13.7 FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
13.8 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
13.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
13.A Appendix: Generalised CCA and variants . . . . . . . . . . . . . . . . . . . 258
13.A.1 regularised GCCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
13.A.2 sparse GCCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
13.A.3 sparse multiblock sPLS-DA . . . . . . . . . . . . . . . . . . . . . . 260

14 P −data integration 261


14.1 Why use P −integration methods? . . . . . . . . . . . . . . . . . . . . . . . 261
14.1.1 Biological question . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
14.1.2 Statistical point of view . . . . . . . . . . . . . . . . . . . . . . . . 261
14.2 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
14.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
14.2.2 Multi-group sPLS-DA . . . . . . . . . . . . . . . . . . . . . . . . . 263
14.3 Input arguments and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 264
14.3.1 Data input checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
14.3.2 Number of components . . . . . . . . . . . . . . . . . . . . . . . . . 265
xiv Contents

14.3.3 Number of variables to select per component . . . . . . . . . . . . 265


14.4 Key outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
14.4.1 Graphical outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
14.4.2 Numerical outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
14.5 Case Study: stemcells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
14.5.1 Load the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
14.5.2 Quick start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
14.5.3 Example: MINT PLS-DA . . . . . . . . . . . . . . . . . . . . . . . 268
14.5.4 Example: MINT sPLS-DA . . . . . . . . . . . . . . . . . . . . . . . 271
14.5.5 Take a detour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
14.6 Examples of application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
14.6.1 16S rRNA gene data . . . . . . . . . . . . . . . . . . . . . . . . . . 280
14.6.2 Single cell transcriptomics . . . . . . . . . . . . . . . . . . . . . . . 280
14.7 To go further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
14.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

Glossary of terms 283

Key publications 285

Bibliography 287

Index 299
Preface

Context and scope

Modern high-throughput technologies generate information about thousands of biological


molecules at different cellular levels in a biological system, leading to several types of
omics studies (e.g. transcriptomics as the study of messenger RNA molecules expressed
from the genes of an organism, or proteomics as the study of proteins expressed by a
cell, tissue, or organism). However, a reductionist approach that considers each of these
molecules individually does not fully describe an organism in its environment. Rather, we
use multivariate analysis to investigate the simultaneous and complex relationships that
occur in molecular pathways. In addition, to obtain a holistic picture of a complete biological
system, we propose to integrate multiple layers of information using recent computational
tools we have developed through the mixOmics project.
mixOmics is an international endeavour that encompasses methodological developments,
software implementation, and applications to biological and biomedical problems to address
some of the challenges of omics data integration. We have trained students and researchers in
essential statistical and data analysis skills via our numerous multi-day workshops to build
capacity in best practice statistical analysis and advance the field of computational statistics
for biology. The goal of this book is to provide guidance in applying multivariate dimension
reduction techniques for the integration of high-throughput biological data, allowing readers
to obtain new and deeper insights into biological mechanisms and biomedical problems.

Who is this book for?

This book is suitable for biologists, computational biologists, and bioinformaticians who
generate and work with high-throughput omics data. Such data include – but are not
restricted to – transcriptomics, epigenomics, proteomics, metabolomics, the microbiome,
and clinical data. Our book is dedicated to research postgraduate students and scientists at
any career stage, and can be used for teaching specialised multi-disciplinary undergraduate
and Masters’s courses. Data analysts with a basic level of R programming will benefit most
from this resource. The book is organised into three distinct parts, where each part can be
skimmed according to the level and interest of the reader. Each chapter contains different
levels of information, and the most technical chapters can be skipped during a first read.

Overview of methods in mixOmics

The mixOmics package focuses on multivariate analysis which examines more than two
variables simultaneously to integrate different types of variables (e.g. genes, proteins, metabo-
lites). We use dimension reduction techniques applicable to a wide range of data analysis
types. Our analyses can be descriptive, exploratory, or focus on modeling or prediction. Our

xv
xvi Preface
Single ‘omics N-integration P-integration

INPUT
PCA*
MULTIVARIATE

PLS*
METHODS

IPCA* DIABLO* MINT*


rCCA
PLS-DA*
10
SAMPLE PLOTS

-5

-10
GRAPHICS

-15
-5
0 5
0
5 -5
-10

A
VARIABLE PLOTS

RN

i
m

mRNA
pr
te
in

o
supervised method *variable selection

FIGURE 1: Overview of the methods implemented in the mixOmics package for the explo-
ration and integration of multiple data sets. This book aims to guide the data analyst in
constructing the research question, applying the appropriate multivariate techniques, and
interpreting the resulting graphics.

aim is to summarise these large biological data sets to elucidate similarities between sam-
ples, between variables, and the relationship between samples and variables. The mixOmics
package provides a range of methods to answer different kinds of biological questions, for
example to:
• Highlight patterns pertaining to the major sources of variation in the data (e.g. Principal
Component Analysis),
• Segregate samples according to their known group and predict group membership of new
samples (e.g. Partial Least Squares Discriminant Analysis),
• Identify agreement between multiple data sets (e.g. Canonical Correlation Analysis,
Partial Least Squares regression, and other variants),
• Identify molecular signatures across multiple data sets with sparse methods that achieve
variable selection.

Key methodological concepts in mixOmics

Methods in mixOmics are based on matrix factorisation techniques, which offer great
flexibility in analysing and integrating multiple data sets in a holistic manner. We use
dimension reduction combined with feature selection to summarise the main characteristics
of the data and posit novel biological hypotheses.
Preface xvii

Dimension reduction is achieved by combining all original variables into a smaller number of
artificial components that summarise patterns in the original data.
The mixOmics package is unique in providing novel multivariate techniques that enable
feature selection to identify molecular signatures. Feature selection refers to identifying
variables that best explain, or predict, the outcome variable (e.g. group membership, or
disease status) of interest. Variables deemed irrelevant according to the specific statistical
criterion we use in the methods are not taken into account when calculating the components.
Data integration methods use data projection techniques to maximise the covariance, or the
correlation between, omics data sets. We propose two types of data integration, whether on
the same N samples, or on the same P variables (Figure 1).
Finally, our methods can provide either unsupervised or supervised analyses. Unsupervised
analyses are exploratory: any information about sample group membership, or outcome,
is disregarded, and data are explored based on their variance or correlation structure.
Supervised analyses aim to segregate sample groups known a priori (e.g. disease status,
treatments) and identify variables (i.e. biomarker candidates, or molecular signatures) that
either explain or separate sample groups.
These concepts will be explained further in Part I.
To aid in interpreting analysis results, mixOmics provides insightful graphical plots designed
to highlight patterns in both the sample and variable dimensions uncovered by each method
(Figure 1).

Concepts not covered

Each mixOmics method corresponds to an underlying statistical model. However, the methods
we present radically differ from univariate formulations as they do not test one variable at a
time, or produce p-values. In that sense, multivariate methods can be considered exploratory
as they do not enable statistical inference. Our wide range of methods come in many different
flavours and can be applied also for predictive purposes, as we detail in this book. ‘Classical’
univariate statistical inference methods can still be used in our analysis framework after
the identification of molecular signatures, as our methods aim to generate novel biological
hypotheses.

Who is ‘mixOmics’?

The mixOmics project has been developed between France, Australia and Canada since 2009,
when the first version of the package was submitted to the CRAN1 . Our team is composed of
core members from the University of Melbourne, Australia, and the Université de Toulouse,
France. The team also includes several key contributors and collaborators.
The package implements more than nineteen multivariate and sparse methodologies for
omics data exploration, integration, and biomarker discovery for different biological settings,
amongst which thirteen were developed by our team (see our list of publications in Section
14.8). Originally, all methods were designed for omics data, however, their application is
not limited to biological data only. Other applications where integration is required can be
considered, but mostly for cases where the predictor variables are continuous.
1 The Comprehensive R Architecture Network https://www.cran.r-project.org
xviii Preface

The package is currently available from Bioconductor2 , with a development version available
on GitHub3 . We continue to maintain and improve the package via new methods, code
optimisation and efficient memory storage of R objects.

About this book

Part I: Modern biology and multivariate analysis introduces fundamental concepts


in multivariate analysis. Multi-omics and biological systems (Chapter 1) compares and
contrasts multivariate and univariate analysis, and outlines the advantages and challenges
of multivariate analyses. The Cycle of Analysis (Chapter 2) details the necessary steps
in planning, designing and conducting multivariate analyses. Key multivariate concepts
and dimension reduction in mixOmics (Chapter 3) describes measures of dispersion and
association, and introduces key methods in mixOmics to manage large data, such as dimension
reduction using matrix factorisation and feature selection. Choose the right method for the
right question in mixOmics (Chapter 4) provides an overview of the methods available in
mixOmics and the types of biological questions these methods can answer.
Part II: mixOmics under the hood provides a deeper understanding of the statistical
concepts underlying the methods presented in Part III. Projection to Latent Structures
(PLS) (Chapter 5) illustrates the different types of algorithms used to solve Principal
Component Analysis. We detail in particular the iterative PLS algorithm that projects data
onto latent structures (components) for matrix decomposition and dimension reduction, as
this algorithm forms the basis of most of our methods. Visualisation for data integration
(Chapter 6) showcases the variety of graphical outputs offered in mixOmics to complement
each method. Performance assessment in supervised analyses (Chapter 7) describes the
techniques employed to evaluate the results of the analyses.
Part III: mixOmics in action provides detailed case studies that apply each method
in mixOmics to answer pertinent biological questions, complete with example R code and
insightful plots. We begin with mixOmics: get started (Chapter 8) to guide the novice analyst
in using the R platform for data analysis. Each subsequent chapter is dedicated to one method
implemented in mixOmics. In Principal Component Analysis (PCA) (Chapter 9) and PLS -
Discriminant Analysis (PLS-DA) (Chapter 12), we introduce different multivariate methods
for single omics analysis. The N -integration framework is introduced in PLS (Chapter 10)
and Canonical Correlation Analysis (Chapter 11) for two omics, and N −data integration
(DIABLO, Chapter 13) for multi-omics integration. P −data integration (MINT, Chapter 14)
introduces our latest developments for P -integration to combine independent omics studies.
Each of these chapters is organised as follows:
• Aim of the method,
• Research question framed biologically and statistically,
• Principles of the method,
• Input arguments and key outputs,
• Introduction of the case study,
• Quick start R command lines,
• Further options to go deeper into the analysis,
• Frequently Asked Questions,
• Technical methodological details in each Appendix.
2 https://www.bioconductor.org/packages/release/bioc/html/mixOmics.html
3 https://github.com/mixOmicsTeam/
Preface xix

Additional resources related to this book

In addition to the R package, the mixOmics project includes a website with extensive
tutorials in http://www.mixOmics.org. The R code of each chapter is also available on
the website. Our readers can also register for our newsletter mailing list, and be part of
the mixOmics community on GitHub and via our discussion forum https://mixomics-
users.discourse.group/.
Authors

Dr Kim-Anh Lê Cao develops novel methods, software and tools to interpret big biological
data and answer research questions efficiently. She is committed to statistical education to
instill best analytical practice and has taught numerous statistical workshops for biologists
and leads collaborative projects in medicine, fundamental biology or microbiology disciplines.
Dr Kim-Anh Lê Cao has a mathematical engineering background and graduated with a PhD
in Statistics from the Université de Toulouse, France. She is currently an Associate Professor
in Statistical Genomics at the University of Melbourne. In 2019, Kim-Anh received the
Australian Academy of Science’s Moran Medal for her contributions to Applied Statistics in
multidisciplinary collaborations. She has contributed to a leadership program for women
in STEMM, including the international Homeward Bound which culminated in a trip to
Antarctica, and Superstars of STEM from Science Technology Australia.

Zoe Welham completed a BSc in molecular biology and during this time developed an
interest in the analysis of big data. She completed a Master of Bioinformatics with a focus
on the statistical integration of different omics data in bowel cancer. She is currently a PhD
candidate at the Kolling Institute in Sydney where she is furthering her research into bowel
cancer with a focus on integrating microbiome data with other omics to characterise early
bowel polyps. Her research interests include bioinformatics and biostatistics for many areas
of biology and making that information accessible to the general public.

xxi
Part I

Modern biology and


multivariate analysis
1
Multi-omics and biological systems

Technological advances such as next-generation sequencing and mass spectrometry generate a


wealth of diverse biological information, allowing for the monitoring of thousands of variables,
or dimensions, that describe a given sample or individual, hence the term ‘high dimensional
data’. Multi-omics variables represent molecules from different functional levels: for example,
transcriptomics for the study of transcripts, proteomics for proteins, and metabolomics
for metabolites. However, their complex nature requires an integrative, multidisciplinary
approach to analysis that is not yet fully established.
Historically, the scientific community has adopted a reductionist approach to data analysis by
characterising a very small number of genes or proteins in one experiment to assess specific
hypotheses. A holistic approach allows for a deeper understanding of biological systems by
adding two new facets to analysis (Figure 1.1): Firstly, by integrating data from different omic
functional levels, we move from clarifying a linear process (e.g. the dysregulation of one or
two genes) towards understanding the development, health, and disease of an ever-changing,
dynamic, hierarchical system. Secondly, by adopting a hypotheses-free, data-driven approach,
we can build integrated and coherent models to address novel, systems-level hypotheses that
can be further validated through more traditional hypotheses.

1.1 Statistical approaches for reductionist or holistic analyses

Compared to a traditional reductionist analysis, multivariate multi-omics analysis drastically


differs in its viewpoint and aims. We briefly introduce three types of analysis to illustrate
this point:
A univariate analysis is a fundamentally reductionist, hypothesis-driven approach that is
related to inferential statistics (introduced in Section 2.4). A hypothesis test is conducted
on one variable (e.g. gene expression or protein abundance) independently from the other
variables. Univariate methods make inferences about the population and measure the certainty
of this inference through test statistics and p-values. Linear models, t-tests, F -tests, or
non-parametric tests fit into a univariate analysis framework. Although interactions between
variables are not considered in univariate analyses, when one variable is manipulated in a
controlled experiment, we can often attribute the result to that particular variable. In omics
studies, where multiple variables are monitored simultaneously, it is difficult to determine
which variables influence the biology of interest.
A bivariate analysis considers two variables simultaneously, for example, to assess the
association between the expression levels of two genes via correlation or linear regression.
Such an analysis is often supported by visualisation through scatterplots but can quickly

DOI: 10.1201/9781003026860-1 3
4 Multi-omics and biological systems

RNA CS
MI
IP TO
SCR
AN
TR

PROTEIN CS
MI
EO
OT
PR

METABOLITE ICS
OM
B OL
TA
ME

MICROBE ICS
OM
EN
AG
MET
Conventional molecular
biology Single-omics Multi-omics
(Reductionist) (Hypothesis free) (Holistic)
H yp ot g
h e s is g e n e r a t i n

FIGURE 1.1: From reductionism to holism. Until recently, only a few molecules of a
given omics type were analysed and related to other omics. The advent of high-throughput
biology has ushered in an era of hypothesis-free approaches within a single type of omics
data, and across multiple omics from the same set of samples. A holistic approach is now
required to understand the different omic functional layers in a biological system and posit
novel hypotheses that can be further validated with a traditional reductionist approach.
We have omitted DNA as this data type needs to be handled differently in mixOmics, see
Section 4.2.3.

become cumbersome when dealing with thousands of variables that are considered in a
pairwise manner.
A multivariate analysis examines more than two variables simultaneously and potentially
thousands at a time. In omics studies, this approach can lead to computational issues and
inaccurate results, especially when the number of samples is much smaller than the number of
variables. Several computational and statistical techniques have been revisited or developed
for high-dimensional data. This book focuses on multivariate analyses and extends this to
include the integration of multi-omics data sets.

1.2 Multi-omics and multivariate analyses


The aim of omics data integration is to identify associations and patterns amongst different
variables and across different omics collected from a sample. Provided appropriate data
analysis is conducted, the integration of multiple data sources may also consolidate our
confidence in the results when consensus is observed from different experiments.
Shifting the analysis paradigm 5

1.2.1 More than a ‘scale up’ of univariate analyses

The fundamental difference between multivariate and univariate analysis lies in the scope of
the results obtained. Multivariate analysis can unravel groups of variables that share similar
patterns in expression across different phenotypes, thus complementing each other to describe
an outcome. A univariate analysis may declare the same variables as non-significant, as a
variable’s ability to explain the phenotype may be subtle and can be masked by individual
variation, or confounders (Saccenti et al., 2014). However, with sufficiently powered data,
univariate and multivariate methods are complementary and can help make sense of the
data. For example, several multivariate and exploratory methods presented in this book can
suggest promising candidate variables that can be further validated through experiments,
reductionist approaches, and inferential statistics.

1.2.2 More than a fishing expedition

Multivariate analyses, which examine up to thousands of variables simultaneously, are


often considered to be ‘fishing expeditions’. This somewhat pejorative term refers to either
conducting analyses without first specifying a testable hypothesis based on prior research,
or, conducting several different analyses on the same data to ‘fish’ for a significant result
regardless of its domain relevance. Indeed, examining a large number of variables can lead
to statistically significant results purely by chance.
However, the integration of multi-omics data, with an appropriate experimental design set
in an exploratory, rather than predictive approach, offers a tremendous opportunity for
discovering associations between omics molecules (whether genes, transcripts, proteins, or
metabolites), in normal, temporal or spatial changes, or in disease states. For example,
one of our studies identified pathways that were never previously identified as relevant to
ontogeny during the first week of human life (Lee et al., 2019). Multi-omics data integration
has deepened our understanding of gene regulatory networks by including information
from related molecules prior to validation of gene associations with a functional approach
(Gligorijević and Pržulj, 2015) and has also efficiently improved functional annotations to
proteins instead of using expensive and time-consuming experimental techniques (Ma et al.,
2013). Multi-omics can also more easily characterise the relatively small number of genes
associated with a particular disease by integrating multiple sources of information (Žitnik
et al., 2013). Finally, it has further developed precision medicine by integrating patient-
and disease-specific information with the aim to improve prognosis and clinical outcomes
(Ritchie et al., 2015).

1.3 Shifting the analysis paradigm

Despite the potential advantages of high-dimensional data, we should keep in mind that
quantity does not equal quality. Multivariate data integration is not straightforward: the
analyses cannot be reduced to a mere concatenation of variables of different types, or by
overlapping information between single data sets, as we illustrate in Figure 1.2. As such, we
must shift our traditional view of analysing data.
Biological experimentation often employs univariate statistics to answer clear hypotheses
6 Multi-omics and biological systems

Factorisation
≈ ×

Matrix
prior

Bayesian
prior Inferred
ICS posterior
M
TO
S CR
IP prior
AN
TR
S
MIC
EO
OT
PR

Network
based
ICS
OM
B OL
TA
ME

Multiple
steps
Single-omics Correlation/
analysis overlap

FIGURE 1.2: Types of methods for data integration. Methods for multi-omics data
integration are still in active development, and can be broadly categorised into matrix
factorisation techniques (the focus of this book), Bayesian, network-based, and multiple-
step approaches. The latter deviates from data integration as it considers each data set
individually before combining the results.

about the potential causal effect of a given molecule of interest. In high-dimensional data
sets, this reductionist approach may not hold due to the sheer amount of molecules that are
monitored, and their interactions that might be of biological interest. Therefore, exploratory,
data-driven approaches are needed to extract information from noise and generate new
hypotheses and knowledge. However, the lack of a clear, causal-driven hypothesis presents a
challenging new paradigm in statistical analyses.
In univariate hypothesis testing, we report p-values to determine the significance of a
statistical test conducted on a single variable. In a multivariate setting, however, a p-
value assesses the statistical significance of a result while taking into account all variables
simultaneously. In such analyses, permutation-based tests are common to assess how far
from random a result is when the data are reshuffled, but other inference-based methods
are currently being developed in the field of multivariate analysis (Wang and Xu, 2021). In
mixOmics we do not offer such tests, but related methods propose permutation approaches
to choose the parameters in the method (see Section 10.7.5).

1.4 Challenges with high-throughput data


There are multiple challenges associated with managing large amounts of biological data,
pertaining to specific types of data as well as statistical analysis. To make reliable, valid,
and meaningful interpretations, these challenges must be considered, ideally before data
collection.
Challenges with high-throughput data 7

1.4.1 Overfitting

Multivariate omics analysis assesses many molecules that individually, or in combination, can
explain the biological outcome of interest. However, these associations may be spurious, as the
large number of features can often be combined in different ways to explain the outcome well,
despite having no biological relevance. Overfitting occurs when a statistical model captures
the noise along with the underlying pattern in the data: if we apply the same statistical
model fitted on a high-dimensional data set to a similar but external study, we might obtain
different results.1 The problem of overfitting is a well-known issue in high-throughput biology
(Hawkins, 2004). We can assess the amount of overfit using cross-validation or subsampling
of the data, as described in Chapter 7.

1.4.2 Multi-collinearity and ill-posed problems

As the number of variables increase, the number of pairwise correlations also increases. Multi-
collinearity poses a problem in most statistical analyses as these variables bring redundant
and noisy information that decreases the precision of the statistical model. Correlations
in high-throughput data sets are often spurious, especially when the number of biological
samples, or individuals N , is small compared to the number of variables P 2 . The ‘small N
large P ’ problem is defined as ill-posed, as standard statistical inference methods assume N
is much greater than P to generalise the results to the population the sample was drawn
from. Ill-posed problems also lead to inaccurate computations.

1.4.3 Zero values and missing values

Data sets may contain a large number of zeros, depending on the type of omics studied
and the platform that is used. This is particularly the case for microbiome, proteomics, and
metabolomics data: a large number of zeros results in zero-inflated (skewed) data, which
can impair methods that assume a normal distribution of the data. Structural zeros, or true
zeros, reflect a true absence of the variable in the biological environment while sampling
zeros, or false zeros, may not reflect reality due to experimental error, technological reasons,
or an insufficient sample size (Blasco-Moreno et al., 2019). The challenge is whether to
consider these zeros as a true zero or missing (coded as NA in R).
Methods that can handle missing values often assume they are ‘missing at random’, i.e. miss-
ingness is not related to a specific sample, individual, molecule, or type of omics platform.
Some methods can estimate missing values, as we present in Appendix 9.A.
1 Statistical models that overfit have low bias and high variance, meaning that they tend to be complex to

fit the training data well, but do not predict well on test data (more details about the bias-variance tradeoff
can be found in Friedman et al. (2001) Chapter 2).
2 In our context, N can also refer to the number of cells in single cell assays, as we briefly mention in

Section 14.6.
8 Multi-omics and biological systems

1.5 Challenges with multi-omics integration

Examining data holistically may lead to better biological understanding, but integrating
multiple omics data sets is not a trivial task and raises another series of challenges.

1.5.1 Data heterogeneity

Different omics rely on different laboratory techniques and data extraction platforms, resulting
in data sets of different formats, complexity, dimensionalities, information content, and scale,
and may be processed using different bioinformatics tools. Therefore, data heterogeneity
arises from biological and technical reasons and is the main analytical challenge to overcome.

1.5.2 Data size

Integrating multiple omics results in a drastic increase in the number of variables. A filtering
step is often applied to remove irrelevant and noisy variables (see Section 8.1). However, the
number of variables P still remains extremely large compared to the number of samples N ,
which raises computational as well as analytical issues.

1.5.3 Platforms

The data integration field is constantly evolving due to ever-advancing technologies with new
platforms and protocols, each containing inherent technical biases and analytical challenges.
It is crucial that data analysts swiftly adapt their analysis framework to keep apace with
these omics-era demands. For example, single cell techniques are rapidly advancing, as are
new protocols for their multi-omics analysis.

1.5.4 Expectations for analysis

The field of data integration has no set definition. Data integration can be managed
biologically, bioinformatically, statistically, or at the interpretation steps (i.e. by overlapping
biological interpretation once the statistical results are obtained). Therefore, the expectations
for data integration are diverse; from exploration, and from a low to high-level understanding
of the different omics data types. Despite recent advances in single cell sequencing, current
technologies are still limited in their ability to parse omics interactions at precise functional
levels. Thus, our expectations for data integration are limited, not only by the statistical
methods but also by the technologies available to us.
Summary 9

1.5.5 Variety of analytical frameworks

Integrative techniques fully suited to multi-omics biological data are still in development and
continue to expand3 . Different types of techniques can be considered and broadly categorised
into (Huang et al. (2017), Figure 1.2):
• Matrix factorisation techniques, where large data sets are decomposed into smaller
sub-matrices to summarise information. These techniques use algebra and analysis to
optimise specific statistical criteria and integrate different levels of information. Methods
in mixOmics fit into this category and will be detailed in Chapter 3 and subsequent
chapters,
• Bayesian methods, which use assumptions of prior distributions for each omics type to
find correlations between data layers and infer posterior distributions,
• Network-based approaches, which use visual and symbolic representations of biological
systems, with nodes representing molecules and edges as correlations between molecules,
if they exist. Network-based methods are mostly applied for detecting significant genes
within pathways, discovering sub-clusters, or finding co-expression network modules,
• Multiple-step approaches that first analyse each single omics data set individually
before combining the results based on their overlap (e.g. at the gene level of a molecular
signature) or correlation. This type of approach technically deviates from data integration
but is commonly used.

1.6 Summary

Modern biological data are high dimensional; they include up to thousands of molecular
entities (e.g. genes, proteins, or epigenetic markers) per sample. Integrating these rich
data sets can potentially uncover the hierarchical and holistic mechanisms that govern
biological pathways. While classical, reductionist, univariate methods ignore these molecular
interactions, multivariate, integrative methods offer a promising alternative to obtain a
more complete picture of a biological system. Thus, univariate and multivariate methods
are different approaches with very little overlap in results but have the advantage of
complementarity.
The advent of high-throughput technology has revealed a complex world of multi-omics
molecular systems that can be unraveled with appropriate integration methods. However,
multivariate methods able to manage high-dimensional and multi-omics data are yet to
be fully developed. The methods presented in this book mitigate some of these challenges
and will help to reveal patterns in omics data, thus forging new insights and directions for
understanding biological systems as a whole.

3 A comprehensive list of multi-omics methods and software is available at https://github.com/mikelove/

awesome-multi-omics.
References
Abdi, H. , Chin, W.W. , Esposito Vinzi, V. , Russolillo, G. , Trinchera, L. (2013). Multi-group pls regression:
application to epidemiology. In New Perspectives in Partial Least Squares and Related Methods, pages
243–255. Springer.
Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society.
Series B (Methodological), 44(2):pages 139–177.
Ambroise, C. and McLachlan, G. J. (2002). Selection bias in gene extraction in tumour classification on basis of
microarray gene expression data. Proc. Natl. Acad. Sci. USA, 99(1):6562–6566.
Argelaguet, R. , Velten, B. , Arnol, D. , Dietrich, S. , Zenz, T. , Marioni, J. C. , Buettner, F. , Huber, W. , and
Stegle, O. (2018). Multi-omics factor analysis–a framework for unsupervised integration of multi-omics data
sets. Molecular Systems Biology, 14(6):e8124.
Arumugam, M. , Raes, J. , Pelletier, E. , Le Paslier, D. , Yamada, T. , Mende, D. R. , Fernandes, G. R. , Tap, J.
, Bruls, T. , Batto, J.-M. , et al. (2011). Enterotypes of the human gut microbiome. Nature, 473(7346):174.
Baumer, B. and Udwin, D. (2015). R markdown. Wiley Interdisciplinary Reviews: Computational Statistics,
7(3):167–177.
Blasco-Moreno, A. , Pérez-Casany, M. , Puig, P. , Morante, M. , and Castells, E. (2019). What does a zero
mean? Understanding false, random and structural zeros in ecology. Methods in Ecology and Evolution,
10(7):949–959.
Bodein, A. , Chapleur, O. , Cao, K.-A. L. , and Droit, A. (2020). timeOmics: Time-Course Multi-Omics Data
Integration. R package version 1.0.0.
Bodein, A. , Chapleur, O. , Droit, A. , and Lê Cao, K.-A. (2019). A generic multivariate framework for the
integration of microbiome longitudinal studies with other data types. Frontiers in Genetics, 10.
Boulesteix, A.-L. (2004). Pls dimension reduction for classification with microarray data. Statistical Applications
in Genetics And Molecular Biology, 3(1):1–30.
Boulesteix, A. and Strimmer, K. (2005). Predicting transcription factor activities from combined analysis of
microarray and chip data: a partial least squares approach. Theoretical Biology and Medical Modelling, 2(23).
Boulesteix, A. and Strimmer, K. (2007). Partial least squares: a versatile tool for the analysis of high-
dimensional genomic data. Briefings in Bioinformatics, 8(1):32.
Bushel, P. R. , Wolfinger, R. D. , and Gibson, G. (2007). Simultaneous clustering of gene expression data with
clinical chemistry and pathological evaluations reveals phenotypic prototypes. BMC Systems Biology, 1(1):15.
Butler, A. , Hoffman, P. , Smibert, P. , Papalexi, E. , and Satija, R. (2018). Integrating single-cell transcriptomic
data across different conditions, technologies, and species. Nature Biotechnology, 36(5):411–420.
Butte, A. J. , Tamayo, P. , Slonim, D. , Golub, T. R. , and Kohane, I. S. (2000). Discovering functional
relationships between RNA expression and chemotherapeutic susceptibility using relevance networks.
Proceedings of the National Academy of Sciences of the USA, 97:12182–12186.
Bylesjö, M. , Rantalainen, M. , Cloarec, O. , Nicholson, J. K. , Holmes, E. , and Trygg, J. (2006). OPLS
discriminant analysis: combining the strengths of PLS-DA and SIMCA classification. Journal of Chemometrics:
A Journal of the Chemometrics Society, 20(8–10):341–351.
Calude, C. S. and Longo, G. (2017). The deluge of spurious correlations in big data. Foundations of Science,
22(3):595–612.
Cancer Genome Atlas Network et al. (2012). Comprehensive molecular portraits of human breast tumours.
Nature, 490(7418):61–70.
Caporaso, J. G. , Kuczynski, J. , Stombaugh, J. , Bittinger, K. , Bushman, F. D. , Costello, E. K. , Fierer, N. ,
Pena, A. G. , Goodrich, J. K. , Gordon, J. I. , et al. (2010). QIIME allows analysis of high-throughput community
sequencing data. Nature Methods, 7(5):335.
Chin, M. H. , Mason, M. J. , Xie, W. , Volinia, S. , Singer, M. , Peterson, C. , Ambartsumyan, G. , Aimiuwu, O. ,
Richter, L. , Zhang, J. , et al. (2009). Induced pluripotent stem cells and embryonic stem cells are distinguished
by gene expression signatures. Cell Stem Cell, 5(1):111–123.
Chun, H. and Keleş, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction
and variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(1):3–25.
Chung, D. , Chun, H. , and S., K. (2020). spls: Sparse Partial Least Squares (SPLS) Regression and
Classification . R package version 2:2–3.
Chung, D. and Keles, S. (2010). Sparse Partial Least Squares Classification for high dimensional data.
Statistical Applications in Genetics and Molecular Biology, 9(1):17.
Combes, S. , González, I. , Déjean, S. , Baccini, A. , Jehl, N. , Juin, H. , Cauquil, L. , and BÈatrice Gabinaud ,
François Lebas, C. L. (2008). Relationships between sensorial and physicochemical measurements in meat of
rabbit from three different breeding systems using canonical correlation analysis. Meat Science , Meat Science,
80(3):835–841.
Comon, P. (1994). Independent component analysis, a new concept? Signal Processing , 36(3):287–314.
Csala, A. (2017). sRDA: Sparse Redundancy Analysis. R package version 1.0.0.
Csala, A. , Voorbraak, F. P. , Zwinderman, A. H. , and Hof, M. H. (2017). Sparse redundancy analysis of high-
dimensional genetic and genomic data. Bioinformatics, 33(20):3228–3234.
Csardi, G. , Nepusz, T. , et al. (2006). The igraph software package for complex network research. InterJournal,
Complex Systems, 1695(5):1–9.
De La Fuente, A. , Bing, N. , Hoeschele, I. , and Mendes, P. (2004). Discovery of meaningful associations in
genomic data using partial correlation coefficients. Bioinformatics, 20(18):3565–3574.
De Tayrac, M. , Lê, S. , Aubry, M. , Mosser, J. , and Husson, F. (2009). Simultaneous analysis of distinct omics
data sets with integration of biological knowledge: Multiple factor analysis approach. BMC Genomics,
10(1):1–17.
De Vries, A. and Meys, J. (2015). R for Dummies. John Wiley & Sons.
Dray, S. , Dufour, A.-B. , et al. (2007). The ade4 package: implementing the duality diagram for ecologists.
Journal of Statistical Software, 22(4):1–20.
Drier, Y. , Sheffer, M. , and Domany, E. (2013). Pathway-based personalized analysis of cancer. Proceedings
of the National Academy of Sciences, 110(16):6388–6393.
Efron, B. and Tibshirani, R. (1997). Improvements on cross-validation: the 632+ bootstrap method. Journal of
the American Statistical Association, 92(438):548–560.
Eisen, M. B. , Spellman, P. T. , Brown, P. O. , and Botstein, D. (1998). Cluster analysis and display of genome-
wide expression patterns. Proceeding of the National Academy of Sciences of the USA, 95:14863–14868.
Escudié, F. , Auer, L. , Bernard, M. , Mariadassou, M. , Cauquil, L. , Vidal, K. , Maman, S. , Hernandez-Raquet,
G. , Combes, S. , and Pascal, G. (2017). Frogs: find, rapidly, OTUs with galaxy solution. Bioinformatics,
34(8):1287–1294.
Eslami, A. , Qannari, E. M. , Kohler, A. , and Bougeard, S. (2014). Algorithms for multi-group pls. Journal of
Chemometrics, 28(3):192–201.
Féraud, B. , Munaut, C. , Martin, M. , Verleysen, M. , and Govaerts, B. (2017). Combining strong sparsity and
competitive predictive power with the L-sOPLS approach for biomarker discovery in metabolomics.
Metabolomics, 13(11):130.
Fernandes, A. D. , Reid, J. N. , Macklaim, J. M. , McMurrough, T. A. , Edgell, D. R. , and Gloor, G. B. (2014).
Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16s rrna gene
sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1):1.
Fornell, C. , Barclay, D. W. , and Rhee, B.-D. (1988). A Model and Simple Iterative Algorithm For Redundancy
Analysis. Multivariate Behavioral Research, 23(3):349–360.
Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American Statistical Association,
84:165–175.
Friedman, J. , Hastie, T. , Höfling, H. , Tibshirani, R. , et al. (2007). Pathwise coordinate optimization. The
Annals of Applied Statistics, 1(2):302–332.
Friedman, J. , Hastie, T. , and Tibshirani, R. (2001). The Elements of Statistical Learning , volume 1. Springer,
Series in statistics, New York.
Friedman, J. , Hastie, T. , and Tibshirani, R. (2010). Regularization paths for generalized linear models via
coordinate descent. Journal of Statistical Software, 33(1):1.
Gagnon-Bartsch, J. A. and Speed, T. P. (2012). Using control genes to correct for unwanted variation in
microarray data. Biostatistics, 13(3):539–552.
Gaude, E. , Chignola, F. , Spiliotopoulos, D. , Spitaleri, A. , Ghitti, M. , Garcìa-Manteiga, J. M. , Mari, S. , and
Musco, G. (2013 ). muma, an R package for metabolomics univariate and multivariate statistical analysis.
Current Metabolomics, 1(2):180–189.
Gittins, R. (1985). Canonical Analysis: A Review with Applications in Ecology. Springer-Verlag.
Gligorijević, V. and Pržulj, N. (2015). Methods for biological data integration: perspectives and challenges.
Journal of the Royal Society Interface, 12(112):20150571.
Gloor, G. B. , Macklaim, J. M. , Pawlowsky-Glahn, V. , and Egozcue, J. J. (2017). Microbiome datasets are
compositional: and this is not optional. Frontiers in Microbiology, 8:2224.
Wilkinson J. H. , Reinsch C. (1971). Singular value decomposition and least squares solutions. In Linear
Algebra, pages 134–151. Springer.
Golub, G. and Van Loan, C. (1996). Matrix Computations. Johns Hopkins University Press.
González, I. , Déjean, S. , Martin, P. G. , and Baccini, A. (2008). CCA: An R package to extend canonical
correlation analysis. Journal of Statistical Software, 23(12):1–14.
González, I. , Déjean, S. , Martin, P. , Gonçalves, O. , Besse, P. , and Baccini, A. (2009). Highlighting
relationships between heterogeneous biological data through graphical displays based on regularized canonical
correlation analysis. Journal of Biological Systems, 17(02):173–199.
González, I. , Lê Cao, K.-A. , Davis, M. J. , Déjean, S. , et al. (2012). Visualising associations between paired
‘omics’ data sets. BioData Mining, 5(1):19.
Günther, O. P. , Chen, V. , Freue, G. C. , Balshaw, R. F. , Tebbutt, S. J. , Hollander, Z. , Takhar, M. , McMaster,
W. R. , McManus, B. M. , Keown, P. A. , et al. (2012). A computational pipeline for the development of multi-
marker bio-signature panels and ensemble classifiers. BMC bioinformatics, 13(1):326.
Haghverdi, L. , Lun, A. T. , Morgan, M. D. , and Marioni, J. C. (2018). Batch effects in single-cell RNA-
sequencing data are corrected by matching mutual nearest neighbors. Nature Biotechnology, 36(5):421–427.
Hardoon, D. R. and Shawe-Taylor, J. (2011). Sparse canonical correlation analysis. Machine Learning,
83(3):331–353.
Hastie, T. and Stuetzle, W. (1989). Principal curves. Journal of the American Statistical Association,
84(406):502–516.
Hawkins, D. M. (2004). The problem of overfitting. Journal of Chemical Information and Computer Sciences,
44(1):1–12.
Hervé, Maxime. “RVAideMemoire: testing and plotting procedures for biostatistics. R package version 0.9–69 3
(2018).
Hervé, M. R. , Nicolè, F. , and Lê Cao, K.-A. (2018). Multivariate analysis of multiple datasets: a practical guide
for chemical ecology. Journal of Chemical Ecology, 44(3):215–234.
Hie, B. , Bryson, B. , and Berger, B. (2019). Efficient integration of heterogeneous single-cell transcriptomes
using Scanorama. Nature Biotechnology, 37(6):685–691.
Hoerl, A. E. (1964). Ridge analysis. In Chemical engineering progress symposium series (Vol. 60, No. 67–77,
p. 329).
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation for non-orthogonal problems.
Technometrics, 12:55–67.
Hornung, R. , Boulesteix, A.-L. , and Causeur, D. (2016). Combining location-and-scale batch effect adjustment
with data cleaning by latent factor adjustment. BMC Bioinformatics, 17(1):27.
Hoskuldsson, A. (1988). PLS regression methods. Journal of Chemometrics, 2(3):211–228.
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28:321–377.
Huang, S. , Chaudhary, K. , and Garmire, L. X. (2017). More is better: recent progress in multi-omics data
integration methods. Frontiers in Genetics, 8:84.
Hyvärinen, A. and Oja, E. (2001). Independent component analysis: algorithms and applications. Neural
Networks, 13(4):411–430.
Indahl, U. G. , Liland, K. H. , and Næs, T. (2009). Canonical partial least squares – a unified pls approach to
classification and regression problems. Journal of Chemometrics: A Journal of the Chemometrics Society,
23(9):495–504.
Indahl, U. G. , Martens, H. , and Næs, T. (2007). From dummy regression to prior probabilities in pls-da.
Journal of Chemometrics: A Journal of the Chemometrics Society, 21(12):529–536.
James, G. , Witten, D. , Hastie, T. , and Tibshirani, R. (2013). An Introduction to Statistical Learning, volume
112. Springer.
Johnson, W. E. , Li, C. , and Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using
empirical Bayes methods. Biostatistics, 8(1):118–127.
Jolliffe, I. (2005). Principal component analysis. Wiley Online Library.
Karlis, D. , Saporta, G. , and Spinakis, A. (2003). A simple rule for the selection of principal components.
Communications in Statistics-Theory and Methods, 32(3):643–666.
Khan, J. , Wei, J. S. , Ringner, M. , Saal, L. H. , Ladanyi, M. , Westermann, F. , Berthold, F. , Schwab, M. ,
Antonescu, C. R. , Peterson, C. , et al. (2001). Classification and diagnostic prediction of cancers using gene
expression profiling and artificial neural networks. Nature Medicine, 7(6):673–679.
Kim, Y. , Kwon, S. , and Choi, H. (2012). Consistent model selection criteria on high dimensions. Journal of
Machine Learning Research, 13(Apr):1037–1057.
Kunin, V. , Engelbrektson, A. , Ochman, H. , and Hugenholtz, P. (2010). Wrinkles in the rare biosphere:
pyrosequencing errors can lead to artificial inflation of diversity estimates. Environmental Microbiology,
12(1):118–123.
Lazzeroni, L. and Ray, A. (2012). The cost of large numbers of hypothesis tests on power, effect size and
sample size. Molecular Psychiatry, 17(1):108.
Lê Cao, K.-A. , Boitard, S. , and Besse, P. (2011). Sparse PLS Discriminant Analysis: biologically relevant
feature selection and graphical displays for multiclass problems. BMC Bioinformatics, 12(1):253.
Lê Cao, K.-A. , Costello, M.-E. , Chua, X.-Y. , Brazeilles, R. , and Rondeau, P. (2016). MixMC: Multivariate
insights into microbial communities. PloS One, 11(8):e0160169.
Lê Cao, K.-A. , Martin, P. G. , Robert-Granié, C. , and Besse, P. (2009). Sparse canonical methods for
biological data integration: application to a cross-platform study. BMC Bioinformatics, 10(1):34.[Mismatch]
Lê Cao, K.-A. , Rohart, F. , McHugh, L. , Korn, O. , and Wells, C. A. (2014). YuGene: a simple approach to
scale gene expression data derived from different platforms for integrated analyses. Genomics,
103(4):239–251.
Lê Cao, K. , Rossouw, D. , Robert-Granié, C. , Besse, P. , et al. (2008). A sparse PLS for variable selection
when integrating omics data. Statistical Applications in Genetics and Molecular Biology, 7(35).
Lee, A. H. , Shannon, C. P. , Amenyogbe, N. , Bennike, T. B. , Diray-Arce, J. , Idoko, O. T. , Gill, E. E. , Ben-
Othman, R. , Pomat, W. S. , Van Haren, S. D. , et al. (2019). Dynamic molecular changes during the first week
of human life follow a robust developmental trajectory. Nature Communications, 10(1):1092.
Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable
analysis. PLoS Genetics, 3(9):e161.
Lenth, R. V. (2001). Some practical guidelines for effective sample size determination. The American
Statistician, 55(3):187–193.
Leurgans, S. E. , Moyeed, R. A. , and Silverman, B. W. (1993). Canonical correlation analysis when the data
are curves. Journal of the Royal Statistical Society. Series B, 55:725–740.
Li, Z. , Safo, S. E. , and Long, Q. (2017). Incorporating biological information in sparse principal component
analysis with application to genomic data. BMC Bioinformatics, 18(1):332.
Lin, Y. , Ghazanfar, S. , Wang, K. Y. , Gagnon-Bartsch, J. A. , Lo, K. K. , Su, X. , Han, Z.-G. , Ormerod, J. T. ,
Speed, T. P. , Yang, P. , and Jyh, Y. (2019). scMerge leverages factor analysis, stable expression, and
pseudoreplication to merge multiple single-cell RNA-seq datasets. Proceedings of the National Academy of
Sciences of the United States of America, 116(20):9775.
Liquet, B. , de Micheaux, P. L. , Hejblum, B. P. , and Thiébaut, R. (2016). Group and sparse group partial least
square approaches applied in genomics context. Bioinformatics, 32(1):35–42.
Liquet, B. , Lafaye de Micheaux, P. , and Broc, C. (2017). sgPLS: Sparse Group Partial Least Square Methods.
R package version 1.7.
Liquet, B. , Lê Cao, K.-A. , Hocini, H. , and Thiébaut, R. (2012). A novel approach for biomarker selection and
the integration of repeated measures experiments from two assays. BMC Bioinformatics, 13:325.
Listgarten, J. , Kadie, C. , Schadt, E. E. , and Heckerman, D. (2010). Correction for hidden confounders in the
genetic analysis of gene expression. Proceedings of the National Academy of Sciences, 107(38):16465–16470.
Liu, C. , Srihari, S. , Lal, S. , Gautier, B. , Simpson, P. T. , Khanna, K. K. , Ragan, M. A. , and Lê Cao, K.-A.
(2016). Personalised pathway analysis reveals association between DNA repair pathway dysregulation and
chromosomal instability in sporadic breast cancer. Molecular Oncology, 10(1):179–193.
Lock, E. F. , Hoadley, K. A. , Marron, J. S. , and Nobel, A. B. (2013). Joint and individual variation explained
(jive) for integrated analysis of multiple data types. The Annals of Applied Statistics, 7(1):523.
Lohmöller, J.-B. (2013). Latent Variable Path Modeling with Partial Least Squares. Springer Science &
Business Media.
Lonsdale, J. , Thomas, J. , Salvatore, M. , Phillips, R. , Lo, E. , Shad, S. , Hasz, R. , Walters, G. , Garcia, F. ,
Young, N. , et al. (2013). The genotype-tissue expression (GTEx) project. Nature Genetics, 45(6):580.
Lorber, A. , Wangen, L. , and Kowalski, B. (1987). A theoretical foundation for the PLS algorithm. Journal of
Chemometrics, 1(19–31):13.
Lovell, D. , Pawlowsky-Glahn, V. , Egozcue, J. J. , Marguerat, S. , and Bähler, J. (2015). Proportionality: a valid
alternative to correlation for relative data. PLoS Computational Biology, 11(3):e1004075.
Luecken, M. D. and Theis, F. J. (2019). Current best practices in single-cell RNA-seq analysis: a tutorial.
Molecular Systems Biology, 15(6).
Ma, X. , Chen, T. , and Sun, F. (2013). Integrative approaches for predicting protein function and prioritizing
genes for complex phenotypes using protein interaction networks. Briefings in Bioinformatics,
15(5):685–698.[Mismatch]
MacKay, R. J. and Oldford, R. W. (2000). Scientific method, statistical method and the speed of light. Statistical
Science, pages 254–278.
Mardia, K. V. , Kent, J. T. , and Bibby, J. M. (1979). Multivariate Analysis. Academic Press.
Martin, P. , Guillou, H. , Lasserre, F. , Déjean, S. , Lan, A. , Pascussi, J.-M. , San Cristobal, M. , Legrand, P. ,
Besse, P. , and Pineau, T. (2007). Novel aspects of PPARalpha-mediated regulation of lipid and xenobiotic
metabolism revealed through a nutrigenomic study. Hepatology, 54:767–777.
Meinshausen, N. and Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 72(4):417–473.
Mizuno, H. , Ueda, K. , Kobayashi, Y. , Tsuyama, N. , Todoroki, K. , Min, J. Z. , and Toyo'oka, T. (2017). The
great importance of normalization of LC–MS data for highly-accurate non-targeted metabolomics. Biomedical
Chromatography, 31(1):e3864.
Muñoz-Romero, S. , Arenas-Garca, J. , and Gómez-Verdejo, V. (2015). Sparse and kernel OPLS feature
extraction based on eigenvalue problem solving. Pattern Recognition, 48(5):1797–1811.
Murdoch, D. and Chow, E. (1996). A graphical display of large correlation matrices. The American Statistician,
50(2):178–180.
Newman, A. M. and Cooper, J. B. (2010). Lab-specific gene expression signatures in pluripotent stem cells.
Cell Stem Cell, 7(2):258–262.
Nguyen, D. V. and Rocke, D. M. (2002a). Multi-class cancer classification via partial least squares with gene
expression profiles. Bioinformatics, 18(9):1216–1226.
Nguyen, D. V. and Rocke, D. M. (2002b). Tumor classification by partial least squares using microarray gene
expression data. Bioinformatics, 18(1):39.
Nichols, J. D. , Oli, M. K. , Kendall, W. L. , and Boomer, G. S. (2021). Opinion: A better approach for dealing
with reproducibility and replicability in science. Proceedings of the National Academy of Sciences, 118(7).
O'Connell, M. J. and Lock, E. F. (2016). R. jive for exploration of multi-source molecular data. Bioinformatics,
32(18):2877–2879.
Pan, W. , Xie, B. , and Shen, X. (2010). Incorporating predictor network in penalized regression with application
to microarray data. Biometrics, 66(2):474–484.
Parkhomenko, E. , Tritchler, D. , and Beyene, J. (2009). Sparse canonical correlation analysis with application
to genomic data integration. Statistical Applications in Genetics and Molecular Biology, 8(1):1–34.
Poirier, S. , Déjean, S. , Midoux, C. , Lê Cao, K.-A. , and Chapleur, O. (2020). Integrating independent microbial
studies to build predictive models of anaerobic digestion inhibition. Bioresource Technology, 316(123952).
Prasasya, R. D. , Vang, K. Z. , and Kreeger, P. K. (2012). A multivariate model of ErbB network composition
predicts ovarian cancer cell response to canertinib. Biotechnology and Bioengineering, 109(1):213–224.
Ricard Argelaguet , Britta Velten , Damien Arnol , Florian Buettner , Wolfgang Huber and Oliver Stegle (2020).
MOFA: Multi-Omics Factor Analysis (MOFA). R package version 1.4.0.
Richardson, S. , Tseng, G. , and Sun, W. (2016). Statistical methods in integrative genomics. Annual Review of
Statistics and Its Application, 3(1):181–209.
Ritchie, M. D. , de Andrade, M. , and Kuivaniemi, H. (2015). The foundation of precision medicine: integration of
electronic health records with genomics through basic, clinical, and translational research. Frontiers in genetics,
6:104.
Robinson, M. D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of
RNA-seq data. Genome Biology, 11(3):R25.
Roemer, E. (2016). A tutorial on the use of PLS path modeling in longitudinal studies. Industrial Management &
Data Systems, Vol. 116 No. 9, pp. 1901–1921.
Rohart, F. , Gautier, B. , Singh, A. , and Lê Cao, K.-A. (2017a). mixOmics: An r package for ‘omics feature
selection and multiple data integration. PLoS Computational Biology, 13(11):e1005752.
Rohart, F. , Matigian, N. , Eslami, A. , Stephanie, B. , and Lê Cao, K.-A. (2017b). Mint: a multivariate integrative
method to identify reproducible molecular signatures across independent experiments and platforms. BMC
Bioinformatics, 18(1):128.
Ruiz-Perez, D. , Guan, H. , Madhivanan, P. , Mathee, K. , and Narasimhan, G. (2020). So you think you can
pls-da? BMC Bioinformatics, 21(1):1–10.
Saccenti, E. , Hoefsloot, H. C. , Smilde, A. K. , Westerhuis, J. A. , and Hendriks, M. M. (2014). Reflections on
univariate and multivariate analysis of metabolomics data. Metabolomics, 10(3):0.
Saccenti, E. and Timmerman, M. E. (2016). Approaches to sample size determination for multivariate data:
Applications to pca and pls-da of omics data. Journal of Proteome Research, 15(8):2379–2393.
Saelens, W. , Cannoodt, R. , and Saeys, Y. (2018). A comprehensive evaluation of module detection methods
for gene expression data. Nature Communications, 9(1):1–12.
Saporta, G. (2006). Probabilités analyse des données et statistique. Technip.
Schafer, J. , Opgen-Rhein, R. , Zuber, V. , Ahdesmaki, M. , Silva, A. P. D. , and Strimmer., K. (2017). corpcor:
Efficient Estimation of Covariance and (Partial) Correlation. R package version 1.6.9.
Schäfer, J. and Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and
implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4(1).
Scherf, U. , Ross, D. T. , Waltham, M. , Smith, L. H. , Lee, J. K. , Tanabe, L. , Kohn, K. W. , Reinhold, W. C. ,
Myers, T. G. , Andrews, D. T. , et al. (2000). A gene expression database for the molecular pharmacology of
cancer. Nature Genetics, 24(3):236–244.
Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix
approximation. Journal of Multivariate Analysis, 99(6):1015–1034.
Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3):289–310.
Sill, M. , Saadati, M. , and Benner, A. (2015). Applying stability selection to consistently estimate sparse
principal components in high-dimensional molecular data. Bioinformatics, 31(16):2683–2690.
Simon, N. , Friedman, J. , Hastie, T. , and Tibshirani, R. (2013). A sparse-group lasso. Journal of
Computational and Graphical Statistics, 22(2):231–245.
Simon, N. and Tibshirani, R. (2012). Standardization and the group lasso penalty. Statistica Sinica, 22(3):983.
Sims, A. H. , Smethurst, G. J. , Hey, Y. , Okoniewski, M. J. , Pepper, S. D. , Howell, A. , Miller, C. J. , and
Clarke, R. B. (2008). The removal of multiplicative, systematic bias allows integration of breast cancer gene
expression datasets – improving meta-analysis and prediction of prognosis. BMC medical genomics, 1(1):42.
Singh, A. , Shannon, C. P. , Gautier, B. , Rohart, F. , Vacher, M. , Tebbutt, S. J. , and Lê Cao, K.-A. (2019).
Diablo: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics,
35(17):3055–3062.
Smit, S. , van Breemen, M. J. , Hoefsloot, H. C. , Smilde, A. K. , Aerts, J. M. , and De Koster, C. G. (2007).
Assessing the statistical validity of proteomics based biomarkers. Analytica Chimica Acta, 592(2):210–217.
Smyth, G. K. (2005). Limma: linear models for microarray data. In Bioinformatics and Computational Biology
Solutions Using R and Bioconductor, pages 397–420. Springer.
Sompairac, N. , Nazarov, P. V. , Czerwinska, U. , Cantini, L. , Biton, A. , Molkenov, A. , Zhumadilov, Z. , Barillot,
E. , Radvanyi, F. , Gorban, A. , et al. (2019). Independent component analysis for unraveling the complexity of
cancer omics datasets. International Journal of Molecular Sciences, 20(18):4414.
Sørlie, T. , Perou, C. M. , Tibshirani, R. , Aas, T. , Geisler, S. , Johnsen, H. , Hastie, T. , Eisen, M. B. , Van De
Rijn, M. , Jeffrey, S. S. , et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor
subclasses with clinical implications. Proceedings of the National Academy of Sciences,
98(19):10869–10874.[Mismatch]
Ståhle, L. and Wold, S. (1987). Partial least squares analysis with cross-validation for the two-class problem: a
Monte Carlo study. Journal of Chemometrics, 1(3):185–196.
Stein-O'Brien, G. L. , Arora, R. , Culhane, A. C. , Favorov, A. V. , Garmire, L. X. , Greene, C. S. , … and Fertig,
E. J. (2018). Enter the matrix: factorization uncovers knowledge from omics. Trends in Genetics,
34(10):790–805.
Straube, J., Gorse, A. D., PROOF Centre of Excellence Team, Huang, B. E., and Lê Cao, K. A . (2015). A
linear mixed model spline framework for analysing time course ‘omics’ data. PLoS ONE, 10(8):e0134540.
Susin, A. , Wang, Y. , Lê Cao, K.-A. , and Calle, M. L. (2020). Variable selection in microbiome compositional
data analysis. NAR Genomics and Bioinformatics, 2(2):lqaa029.
Szakács, G. , Annereau, J.-P. , Lababidi, S. , Shankavaram, U. , Arciello, A. , Bussey, K. J. , Reinhold, W. ,
Guo, Y. , Kruh, G. D. , Reimers, M. , et al. (2004). Predicting drug sensitivity and resistance: profiling ABC
transporter genes in cancer cells. Cancer Cell, 6(2):129–137.
Tan, Y. , Shi, L. , Tong, W. , Gene Hwang, G. , and Wang, C. (2004). Multi-class tumor classification by
discriminant partial least squares using microarray gene expression data and assessment of classification
models. Computational Biology and Chemistry, 28(3):235–243.
Tapp, H. S. and Kemsley, E. K. (2009). Notes on the practical utility of OPLS. TrAC Trends in Analytical
Chemistry, 28(11):1322–1327.
Tenenhaus, M. (1998). La régression PLS: théorie et pratique. Editions Technip.
Tenenhaus, A. and Guillemot, V. (2017). RGCCA: Regularized and sparse generalized canonical correlation
analysis for multiblock data. R package version, 2(2).
Tenenhaus, A. , Philippe, C. , and Frouin, V. (2015). Kernel generalized canonical correlation analysis.
Computational Statistics & Data Analysis, 90:114–131.
Tenenhaus, A. , Philippe, C. , Guillemot, V. , Le Cao, K. A. , Grill, J. , and Frouin, V. (2014). Variable selection
for generalized canonical correlation analysis. Biostatistics, 15(3):569–583.
Tenenhaus, A. and Tenenhaus, M. (2011). Regularized generalized canonical correlation analysis.
Psychometrika, 76(2):257–284.
Tenenhaus, M. , Tenenhaus, A. , and Groenen, P. J. (2017). Regularized generalized canonical correlation
analysis: a framework for sequential multiblock component methods. Psychometrika, 82(3):737–777.
Thévenot, E. A. , Roux, A. , Xu, Y. , Ezan, E. , and Junot, C. (2015). Analysis of the human adult urinary
metabolome variations with age, body mass index, and gender by implementing a comprehensive workflow for
univariate and opls statistical analyses. Journal of Proteome Research, 14(8):3322–3335.
Tian, L. , Dong, X. , Freytag, S. , Lê Cao, K.-A. , Su, S. , JalalAbadi, A. , Amann-Zalcenstein, D. , Weber, T. S. ,
Seidi, A. , Jabbari, J. S. , et al. (2019). Benchmarking single cell RNA-sequencing analysis pipelines using
mixture control experiments. Nature Methods, 16(6):479–487.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society:
Series B (Methodological), 58(1):267–288.
Trygg, J. and Lundstedt, T. (2007). Chemometrics Techniques for Metabonomics, pages 171–200. Elsevier.
Trygg, J. and Wold, S. (2002). Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics,
16(3):119–128.
Tseng, G. C. , Ghosh, D. , and Feingold, E. (2012). Comprehensive literature review and statistical
considerations for microarray meta-analysis. Nucleic Acids Research, 40(9):3785–3799.
Umetri, A. (1996). SIMCA-P for windows, Graphical Software for Multivariate Process Modeling. Umea,
Sweden.
Välikangas, T. , Suomi, T. , and Elo, L. L. (2016). A systematic evaluation of normalization methods in
quantitative label-free proteomics. Briefings in Bioinformatics, 19(1):1–11.
van den Wollenberg, A. L. (1977). Redundancy analysis an alternative for canonical correlation analysis.
Psychometrika, 42(2):207–219.
van Gerven, M. A. , Chao, Z. C. , and Heskes, T. (2012). On the decoding of intracranial data using sparse
orthonormalized partial least squares. Journal of Neural Engineering, 9(2):026017.
Vinod, H. D. (1976). Canonical ridge and econometrics of joint production. Journal of Econometrics, 6:129–137.
Waaijenborg, S. , de Witt Hamer, P. C. V. , and Zwinderman, A. H. (2008). Quantifying the association between
gene expressions and DNA-markers by penalized canonical correlation analysis. Statistical Applications in
Genetics and Molecular Biology, 7(1).
Wang, T. , Guan, W. , Lin, J. , Boutaoui, N. , Canino, G. , Luo, J. , Celedón, J. C. , and Chen, W. (2015). A
systematic study of normalization methods for Infinium 450k methylation data using whole-genome bisulfite
sequencing data. Epigenetics, 10(7):662–669.
Wang, Y. and Lê Cao, K.-A. (2019). Managing batch effects in microbiome data. Briefings in Bioinformatics,
21:1477–4054.
Wang, Y. and Lê Cao, K.-A. (2020). A multivariate method to correct for batch effects in microbiome data.
bioRxiv.
Wang, Z. and Xu, X. (2021). Testing high dimensional covariance matrices via posterior bayes factor. Journal of
Multivariate Analysis, 181:104674.
Wegelin, J. A. et al. (2000). A survey of partial least squares (pls) methods, with emphasis on the two-block
case. Technical report, University of Washington, Tech. Rep.
Weinstein, J. N. , Kohn, K. W. , Grever, M. R. , Viswanadhan, V. N. , Rubinstein, L. V. , Monks, A. P. , Scudiero,
D. A. , Welch, L. , Koutsoukos, A. D. , Chiausa, A. J. , et al. (1992). Neural computing in cancer drug
development: predicting mechanism of action. Science, 258(5081):447–451.
Weinstein, J. , Myers, T. , Buolamwini, J. , Raghavan, K. , Van Osdol, W. , Licht, J. , Viswanadhan, V. , Kohn,
K. , Rubinstein, L. , Koutsoukos, A. , et al. (1994). Predictive statistics and artificial intelligence in the us
national cancer institute's drug discovery program for cancer and aids. Stem Cells, 12(1):13–22.
Weinstein, J. N. , Myers, T. G. , O'Connor, P. M. , Friend, S. H. , Fornace Jr, A. J. , Kohn, K. W. , Fojo, T. ,
Bates, S. E. , Rubinstein, L. V. , Anderson, N. L. , Buolamwini, J. K. , van Osdol, W. W. , Monks, A. P. ,
Scudiero, D. A. , Sausville, E. A. , Zaharevitz, D. W. , Bunow, B. , Viswanadhan, V. N. , Johnson, G. S. , Wittes,
R. E. , and Paull, K. D. (1997). An information-intensive approach to the molecular pharmacology of cancer.
Science, 275:343–349.
Wells, C. A. , Mosbergen, R. , Korn, O. , Choi, J. , Seidenman, N. , Matigian, N. A. , Vitale, A. M. , and
Shepherd, J. (2013). Stemformatics: visualisation and sharing of stem cell gene expression. Stem Cell
Research, 10(3):387–395.
Westerhuis, J. A. , Hoefsloot, H. C. , Smit, S. , Vis, D. J. , Smilde, A. K. , van Velzen, E. J. , van Duijnhoven, J.
P. , and van Dorsten, F. A. (2008). Assessment of PLSDA cross validation. Metabolomics, 4(1):81–89.
Westerhuis, J. A. , van Velzen, E. J. , Hoefsloot, H. C. , and Smilde, A. K. (2010). Multivariate paired data
analysis: multilevel PLSDA versus OPLSDA. Metabolomics, 6(1):119–128.
Wilms, I. and Croux, C. (2015). Sparse canonical correlation analysis from a predictive point of view.
Biometrical Journal, 57(5):834–851.
Wilms, I. and Croux, C. (2016). Robust sparse canonical correlation analysis. BMC Systems Biology, 10(1):72.
Witten, D. and Tibshirani, R. (2020). PMA: Penalized Multivariate Analysis. R package version 1.2.1.
Witten, D. , Tibshirani, R. , Gross, S. , and Narasimhan, B. (2013). PMA: Penalized multivariate analysis. R
Package Version, 1(9).
Witten, D. M. , Tibshirani, R. , and Hastie, T. (2009). A penalized matrix decomposition, with applications to
sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534.
Wold, H. (1966). Estimation of Principal Components and Related Models by Iterative Least Squares, pages
391–420. New York: Academic Press.
Wold, H. (1982). Soft modeling: the basic design and some extensions. Systems under indirect observation,
2:343.
Wold, S. , Sjöström, M. , and Eriksson, L. (2001). Pls-regression: a basic tool of chemometrics. Chemometrics
and Intelligent Laboratory Systems, 58(2):109–130.
Yao, F. , Coquery, J. , and Lê Cao, K.-A. (2012). Independent Principal Component Analysis for biologically
meaningful dimension reduction of large biological data sets. BMC Bioinformatics, 13(1):24.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67.
Žitnik, M. , Janjić, V. , Larminie, C. , Zupan, B. , and Pržulj, N. (2013). Discovering disease-disease
associations by fusing systems-level molecular data. Scientific reports, 3(1):1–9.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 67(2):301–320.
Zou, H. and Hastie, T. (2018). Elasticnet: Elastic-Net for Sparse Estimation and Sparse PCA. R package
version 1.1.1.
Zou, H. , Hastie, T. , and Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational
and Graphical Statistics, 15(2):265–286.

You might also like