Computational Methods
Computational Methods
Shuzhao Li Editor
Computational
Methods and
Data Analysis for
Metabolomics
METHODS IN MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, UK
Edited by
Shuzhao Li
Department of Medicine, Emory University School of Medicine, Atlanta, GA, USA
Editor
Shuzhao Li
Department of Medicine
Emory University School of Medicine
Atlanta, GA, USA
This Humana imprint is published by the registered company Springer Science+Business Media, LLC part of Springer
Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface
Metabolomics is the new biochemistry. It reinvigorates the old discipline by new data at a
large scale: simultaneous measurement of thousands of chemicals in biological samples.
Many of these chemicals are beyond the known metabolic intermediates. This new informa-
tion fills an important gap between the interactions of genome and environment, thus
conferring enormous potential for improving human health. The metabolomics data also
overlap significantly with the exposome, which aims to quantify all environmental exposures.
The explosive growth of metabolomics creates a large gap in training on metabolomics data
analysis. This book shall provide a comprehensive guide to scientists, engineers, and students
that employ metabolomics in their work, with an emphasis on the understanding and
interpretation of the data.
The book is organized as follows. Chapter 1 provides an overview of the field, and the
following chapters are presented in four sections: data processing for major experimental
platforms (Chapters 2–7), databases and metabolite annotation (Chapters 8–13), major
techniques in data analysis (Chapters 14–19), and biomedical applications (Chapters
20–23).
While it is not possible to cover all the databases and software tools, we aim to have
representations of each major topic and give readers a foundation to work in this field. It is
critical to note that the scientific landscape keeps evolving and tools keep changing. There-
fore, it is more important to understand the rationale and principles than to replicate the
protocols. This book is supplemented by example data and code at GitHub (https://
metabolomics-data.github.io), which can be continuously updated by the community.
I would like to express my gratitude to the metabolomics group at Emory University,
especially Dean Jones, Tianwei Yu, Young-Mi Go, Youngja Park, Karan Uppal, Douglas
Walker, and Gary Miller. Their intellectual input and friendship made my scientific journey
truly rewarding.
v
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Overview of Experimental Methods and Study Design in Metabolomics,
and Statistical and Pathway Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Stephen Barnes
2 Metabolomics Data Processing Using XCMS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Xavier Domingo-Almenara and Gary Siuzdak
3 Metabolomics Data Preprocessing Using ADAP and MZmine 2 . . . . . . . . . . . . . . 25
Xiuxia Du, Aleksandr Smirnov, Tomáš Pluskal, Wei Jia,
and Susan Sumner
4 Metabolomics Data Processing Using OpenMS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Marc Rurik, Oliver Alka, Fabian Aicheler, and Oliver Kohlbacher
5 Analysis of NMR Metabolomics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Wimal Pathmasiri, Kristine Kay, Susan McRitchie, and Susan Sumner
6 Key Concepts Surrounding Studies of Stable Isotope-Resolved
Metabolomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Stephen F. Previs and Daniel P. Downes
7 Extracting Biological Insight from Untargeted Lipidomics Data . . . . . . . . . . . . . . 121
Jennifer E. Kyle
8 Overview of Tandem Mass Spectral and Metabolite Databases
for Metabolite Identification in Metabolomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Zhangtao Yi and Zheng-Jiang Zhu
9 METLIN: A Tandem Mass Spectral Library of Standards . . . . . . . . . . . . . . . . . . . . 149
J. Rafael Montenegro-Burke, Carlos Guijas, and Gary Siuzdak
10 Metabolomic Data Exploration and Analysis with the Human
Metabolome Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
David S. Wishart
11 De Novo Molecular Formula Annotation and Structure Elucidation
Using SIRIUS 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Marcus Ludwig, Markus Fleischauer, Kai Dührkop,
Martin A. Hoffmann, and Sebastian Böcker
12 Annotation of Specialized Metabolites from High-Throughput
and High-Resolution Mass Spectrometry Metabolomics . . . . . . . . . . . . . . . . . . . . . 209
Thomas Naake, Emmanuel Gaquerel, and Alisdair R. Fernie
13 Feature-Based Molecular Networking for Metabolite Annotation . . . . . . . . . . . . . 227
Vanessa V. Phelan
14 A Bioinformatics Primer to Data Science, with Examples for Metabolomics . . . . 245
W. Stephen Pittard, Cecilia “Keeko” Villaveces, and Shuzhao Li
15 The Essential Toolbox of Data Science: Python, R, Git, and Docker . . . . . . . . . . 265
W. Stephen Pittard and Shuzhao Li
vii
viii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Contributors
ix
x Contributors
Abstract
Metabolomics has become a powerful tool in biological and clinical investigations. This chapter reviews the
technological basis of metabolomics and the considerations in answering biomedical questions. The work-
flow of metabolomics is explained in the sequence of data processing, quality control, metabolite annota-
tion, statistical analysis, pathway analysis, and multi-omics integration. Reproducibility in both sample
analysis and data analysis is key to the scientific progress, and the recommendation is made on reporting
standards in publications. This chapter explains the technical aspects of metabolomics in the context of
systems biology and applications to human health.
Key words Metabolomics, GC-MS, LC-MS, NMR, Precision medicine, Systems medicine, Annota-
tion, Recommendation
1 Introduction
1.1 The Age of Omics In biomedical science, the late 1980s saw a great change in the
and Precision forms and the scale of research data. Instead of cloning individual
Medicine cDNAs that encoded the open reading frames of individual genes,
the NIH funded sequencing of the whole human genome. The data
were collected without recourse to a hypothesis; instead, the
human genome project was intended to create a national (and
later international) resource of genes and genomes. As a result,
the opportunity arose to engineer placing representatives of the
open reading frames of genes onto small glass chips (microarrays),
thus allowing for the “whole” transcriptome to be interrogated in a
single experiment. Since then, in studying the transcriptome, the
need for selected DNA sequences has been removed, and instead,
with further engineering, direct sequencing of the RNA transcripts
(RNA-Seq) has occurred.
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_1, © Springer Science+Business Media, LLC, part of Springer Nature 2020
1
2 Stephen Barnes
CE-MS can be fast and work well with small volumes [14]. As is
occurring in transcriptomics and proteomics, there is a strong
interest in studying the metabolome of single cells. However, the
fluid coming from a single cell without dilution is in the nL or even
pL range. CESI-MS is suited to these volumes and since it also
generates a very sharp peak for each metabolite, it has allowed for
the analysis of over 100 metabolites from a frog embryo cell [15].
1.3 What Is The metabolome in a human biological fluid or even in the culture
the Metabolome? of a single cell is not fully predicted by the genes in the genome of
that species. Since humans can only synthesize from simpler pre-
cursors of ten of the twenty amino acids that make up human
proteins, they and other forms of life therefore have to eat (or be
fed) foods coming from other genomes. Some foods, such as fruits
and vegetables, have genomes and hence secondary metabolic path-
ways, that are completely unrelated to those in humans. In addition
to amino acids, these other sources may provide critical small
molecules that are vital for life (vitamins), many of which are
Overview of Metabolomics 5
4 Data Analysis
References
1. Hillenkamp F, Karas M (1990) Mass spectrom- 5. Craig R, Cortens JC, Fenyo D, Beavis RC
etry of peptides and proteins by matrix-assisted (2006) Using annotated peptide mass spec-
ultraviolet laser desorption/ionization. Meth- trum libraries for protein identification. J Pro-
ods Enzymol 193:280–295 teome Res 5(8):1843–1849. https://doi.org/
2. Want EJ, Cravatt BF, Siuzdak G (2005) The 10.1021/pr0602085
expanding role of mass spectrometry in metab- 6. Huttlin EL, Hegeman AD, Harms AC, Suss-
olite profiling and characterization. Chembio- man MR (2007) Prediction of error associated
chem 6(11):1941–1951. https://doi.org/10. with false-positive rate determination for pep-
1002/cbic.200500151 tide identification in large-scale proteomics
3. Barnes S, Benton HP, Casazza K, Cooper SJ, experiments using a combined reverse and for-
Cui X, Du X, Engler J, Kabarowski JH, Li S, ward peptide sequence database strategy. J Pro-
Pathmasiri W, Prasain JK, Renfrow MB, Tiwari teome Res 6(1):392–398. https://doi.org/10.
HK (2016) Training in metabolomics 1021/pr0603194
research. I. Designing the experiment, collect- 7. Nesvizhskii AI (2007) Protein identification by
ing and extracting samples and generating tandem mass spectrometry and sequence data-
metabolomics data. J Mass Spectrom 51 base searching. Methods Mol Biol
(7):461–475. https://doi.org/10.1002/jms. 367:87–119. https://doi.org/10.1385/1-
3782 59745-275-0:87
4. Cox J, Mann M (2008) MaxQuant enables 8. Anjo SI, Santa C, Manadas B (2017) SWATH-
high peptide identification rates, individualized MS as a tool for biomarker discovery: from
p.p.b.-range mass accuracies and proteome- basic research to clinical applications. Proteo-
wide protein quantification. Nat Biotechnol mics 17(3–4). https://doi.org/10.1002/
26(12):1367–1372. https://doi.org/10. pmic.201600278
1038/nbt.1511
10 Stephen Barnes
9. Peterson AC, Russell JD, Bailey DJ, Westphall reporting standards for metabolite annotation
MS, Coon JJ (2012) Parallel reaction monitor- and identification in metabolomic studies.
ing for high resolution and high mass accuracy Gigascience 2(1):13. https://doi.org/10.
quantitative, targeted proteomics. Mol Cell 1186/2047-217X-2-13
Proteomics 11(11):1475–1488. https://doi. 18. Sumner LW, Amberg A, Barrett D, Beale MH,
org/10.1074/mcp.O112.020131 Beger R, Daykin CA, Fan TW, Fiehn O,
10. Rinschen MM, Ivanisevic J, Giera M, Siuzdak Goodacre R, Griffin JL, Hankemeier T,
G (2019) Identification of bioactive metabo- Hardy N, Harnly J, Higashi R, Kopka J, Lane
lites using activity metabolomics. Nat Rev Mol AN, Lindon JC, Marriott P, Nicholls AW, Reily
Cell Biol 20(6):353–367. https://doi.org/10. MD, Thaden JJ, Viant MR (2007) Proposed
1038/s41580-019-0108-4 minimum reporting standards for chemical
11. Collins FS (2004) The case for a US prospec- analysis Chemical Analysis Working Group
tive cohort study of genes and environment. (CAWG) Metabolomics Standards Initiative
Nature 429(6990):475–477. https://doi. (MSI). Metabolomics 3(3):211–221. https://
org/10.1038/nature02628 doi.org/10.1007/s11306-007-0082-2
12. Manrai AK, Cui Y, Bushel PR, Hall M, 19. Schymanski EL, Jeon J, Gulde R, Fenner K,
Karakitsios S, Mattingly CJ, Ritchie M, Ruff M, Singer HP, Hollender J (2014) Iden-
Schmitt C, Sarigiannis DA, Thomas DC, tifying small molecules via high resolution mass
Wishart D, Balshaw DM, Patel CJ (2017) spectrometry: communicating confidence.
Informatics and data analytics to support Environ Sci Technol 48(4):2097–2098.
exposome-based discovery for public health. https://doi.org/10.1021/es5002105
Annu Rev Public Health 38:279–294. 20. Schrimpe-Rutledge AC, Codreanu SG, Sher-
https://doi.org/10.1146/annurev- rod SD, McLean JA (2016) Untargeted meta-
publhealth-082516-012737 bolomics strategies-challenges and emerging
13. Li S, Cirillo P, Hu X, Tran V, Krigbaum N, directions. J Am Soc Mass Spectrom 27
Yu S, Jones DP, Cohn B (2019) Understanding (12):1897–1905. https://doi.org/10.1007/
mixed environmental exposures using metabo- s13361-016-1469-y
lomics via a hierarchical community network 21. Chong J, Soufan O, Li C, Caraus I, Li S,
model in a cohort of California women in Bourque G, Wishart DS, Xia J (2018) Meta-
1960’s. Reprod Toxicol. pii: S0890-6238(18) boAnalyst 4.0: towards more transparent and
30603-8. https://doi.org/10.1016/j. integrative metabolomics analysis. Nucleic
reprotox.2019.06.013 Acids Res 46(W1):W486–W494. https://doi.
14. Stolz A, Jooss K, Hocker O, Romer J, org/10.1093/nar/gky310
Schlecht J, Neususs C (2019) Recent advances 22. Li S, Park Y, Duraisingham S, Strobel FH,
in capillary electrophoresis-mass spectrometry: Khan N, Soltow QA, Jones DP, Pulendran B
instrumentation, methodology and applica- (2013) Predicting network activity from high
tions. Electrophoresis 40(1):79–112. https:// throughput metabolomics. PLoS Comput Biol
doi.org/10.1002/elps.201800331 9(7):e1003123. https://doi.org/10.1371/
15. Onjiko RM, Portero EP, Moody SA, Nemes P journal.pcbi.1003123
(2017) In situ microprobe single-cell capillary 23. Nakayasu ES, Nicora CD, Sims AC, Burnum-
electrophoresis mass spectrometry: metabolic Johnson KE, Kim YM, Kyle JE, Matzke MM,
reorganization in single differentiating cells in Shukla AK, Chu RK, Schepmoes AA, Jacobs
the live vertebrate (Xenopus laevis) embryo. JM, Baric RS, Webb-Robertson BJ, Smith
Anal Chem 89(13):7069–7076. https://doi. RD, Metz TO (2016) MPLEx: a robust and
org/10.1021/acs.analchem.7b00880 universal protocol for single-sample integrative
16. Members MSIB, Sansone SA, Fan T, proteomic, metabolomic, and lipidomic ana-
Goodacre R, Griffin JL, Hardy NW, lyses. mSystems 1(3). https://doi.org/10.
Kaddurah-Daouk R, Kristal BS, Lindon J, 1128/mSystems.00043-16
Mendes P, Morrison N, Nikolau B, 24. Guo L, Milburn MV, Ryals JA, Lonergan SC,
Robertson D, Sumner LW, Taylor C, van der Mitchell MW, Wulff JE, Alexander DC, Evans
Werf M, van Ommen B, Fiehn O (2007) The AM, Bridgewater B, Miller L, Gonzalez-Garay
metabolomics standards initiative. Nat Bio- ML, Caskey CT (2015) Plasma metabolomic
technol 25(8):846–848. https://doi.org/10. profiles enhance precision medicine for volun-
1038/nbt0807-846b teers of normal health. Proc Natl Acad Sci U S
17. Salek RM, Steinbeck C, Viant MR, A 112(35):E4901–E4910. https://doi.org/
Goodacre R, Dunn WB (2013) The role of 10.1073/pnas.1508425112
Chapter 2
Abstract
XCMS is one of the most used software for liquid chromatography–mass spectrometry (LC-MS) data
processing and it exists both as an R package and as a cloud-based platform known as XCMS Online. In this
chapter, we first overview the nature of LC-MS data to contextualize the need for data processing software.
Next, we describe the algorithms used by XCMS and the role that the different user-defined parameters play
in the data processing. Finally, we describe the extended capabilities of XCMS Online.
Key words XCMS, Liquid chromatography, Mass spectrometry, Metabolomics, Data processing
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_2, © Springer Science+Business Media, LLC, part of Springer Nature 2020
11
12 Xavier Domingo-Almenara and Gary Siuzdak
the peak area (and thus relative concentration) between the same
ion peak across samples, allowing us to find statistically significant
dysregulated peaks for a given phenotype.
The processes that usually follows peak-picking and alignment
are statistical analysis - to find dysregulated peaks in a given
phenotype-, or metabolite annotation. Computational metabolite
annotation aims at providing chemical information on the observed
features [6]. This annotation usually consists in determining what
features stem from the same and each metabolite, determining the
nature of features (e.g., determine if a feature is a protonated/
deprotonated species, an adduct, an in-source fragment, an isotope,
a dimer, etc.), and providing with putative metabolite identifica-
tions. As an example, the simplest annotation process consists in
using the m/z of the observed features and match them against the
database to determine their potential identity. This is considered,
however, a very weak annotation, and as we will discuss further in
this chapter, more advanced techniques based on in-source frag-
ment annotation can be used.
In this chapter, we describe XCMS, a computational workflow
that integrates peak picking with alignment. It is available as an R
package and also as a cloud-based resource. The latter encompasses
a comprehensive suite of tools through an easy-to-use visual inter-
face that allows for data to be shared with the community or
collaborators and has a higher computational performance. It also
integrates tools and modules that enable systems biology analysis
and advanced metabolite annotation through its integration with
the METLIN spectral library (Fig. 1). A version of XCMS Online
for targeted metabolomics is also available, called XCMS-MRM.
Fig. 1 XCMS Online overview. Through an interactive user interface, XCMS Online allows for data sharing, data
streaming, advanced metabolite annotation, and systems biology analysis and multi-omics integration
XCMS Data Processing 13
XCMS has the option of using either the original match filter
algorithm or the centWave. In this section, we will describe these
two types of algorithms.
When peaks in raw MS data are detected across multiple sam-
ples, the same peak stemming from a given ion will ideally appear
across the different samples with different intensities. Due to the
high accuracy of modern mass spectrometers such as TOF and
Orbitrap instruments, m/z values across samples tend to have var-
iations as low as 5 and 1 ppm, respectively. On the other hand, the
retention time will have larger variations across samples. Therefore,
peaks from the same ion detected across different samples need to
be aligned and grouped, that is, their area needs to be assigned to
the same row in a data matrix so they can be quantitatively com-
pared across different samples. Specifically, in XCMS, this peak
alignment is performed by a two-step process known as peak
grouping and retention time alignment. These two processes will
be described in the following sections.
After peaks are detected and aligned, statistics can be used to
find features showing statistically significant changes among peak
areas among phenotypes. However, after peak-picking and align-
ment, the existence of missing values is a frequent scenario, typically
affecting more than 80% of the detected features [9]. This means
that around 80% of the detected features will have missing values
for certain samples, that is, zero peak area or intensity. This occurs
when for a given group of samples, the ion peak is below the
detection limit. In other cases, the peak is observed above the
detection limit but the peak picking algorithm has failed in detect-
ing it due to low peak intensity or because the peak is masked by
noise or other coeluting peaks. Having missing values reduce the
power of statistical tests and analysis and can lead to biased results
[9]. This is of special importance if multivariate analyses are applied,
as missing values will bias the results. To tackle this problem there
exist different strategies known as missing value imputation strate-
gies [10]. Among these strategies, a commonly accepted approach
is known as filling peaks, where “missing” peaks are searched again
in the raw data. XCMS uses a fill peaks strategy to “fill” the
“missing” peaks in the feature list and this strategy will be detailed
in the following sections.
The following sections explain in detail the different steps of
the XCMS workflow (Fig. 2) which consist of (1) peak picking,
(2) peak grouping and retention time alignment, and (3) fill peaks.
3.1 Peak Picking: Peak picking aims at finding the chromatographic peaks stemming
Operation from molecules eluting from the chromatographic column. Histor-
and Algorithms ically, these have been based on filtering algorithms, which filter the
signal to clear it from noise and thus distinguish the underlying
peaks from the raw signal. There are many different peak picking
algorithms that have been designed, some of which are targeted
XCMS Data Processing 15
3.1.1 Matched Filter It is the original algorithm of detection of XCMS [3]. To detect
peaks, the algorithm first performs a “binning” procedure consist-
ing in slicing the data into bins of 0.1 m/z units (defined by the
parameter step). For each bin, the algorithm then detects any peak
that is above the signal-to-noise threshold as defined by the S/N
ratio cutoff parameter (default 10). This peak detection is per-
formed by leveraging the typical Gaussian-like shape of chro-
matographic peaks to detect them. That means that if these data
points fit well into a second derivative Gaussian (essentially a nor-
mal distribution) they are detected as a peak. The width of this
Gaussian is defined by the full width at half maximum (FWHM)
parameter, a value in seconds. The lower the value of this parame-
ter, the more likely is to detect false positive peaks (noise detected as
a peak). To make sure that peaks are not split between two m/z bins
the algorithm combines pairs of consecutive m/z bins. This algo-
rithm was originally designed for data acquired by low resolution
instruments such as single quadrupole mass spectrometers, where
the highest mass accuracy was around 0.1 Da. For current high-
resolution mass spectrometers (HRMS), the centWave algorithm is
recommended (described below). However, if the MatchedFilter is
to be applied to HRMS data, the binning value should be signifi-
cantly lowered. The output of the algorithm is a peak list where
each m/z, RT (and the deviations) and integrated peak intensity
(peak area) are given.
16 Xavier Domingo-Almenara and Gary Siuzdak
3.1.2 centWave Published in 2008 [12], this algorithm is a high resolution and
consequently high mass accuracy peak detection algorithm. The
“centWave” algorithm consists also of two steps. The first step
consists in a dynamic binning that finds potential areas containing
peaks, known as region of interest (ROI). The second step per-
forms the peak detection within these ROIs using a highly sensitive
wavelet filter (Fig. 3a). This algorithm looks for ROI consisting of
data points that show low m/z deviations and increase and then
decrease in intensity (Fig. 3a). The main parameters that control
this behavior are the “ppm” parameter that selects the m/z span of
the ROIs and the peak width which is measured in seconds and
dictates how long the peak should be in chromatographic time.
Once a ROI fulfilling these requirements has been discovered it is
analyzed by the wavelet filter. The wavelet that is used is a Mexican
hat wavelet that models the peak shape and allows for the selection
of multiple closely eluting peaks within this ROI (Fig. 3b). The
Fig. 3 The centWave peak detection overview. First, regions of interest (ROI) are detected. For each detected
ROI, a wavelet filter is applied and peaks within the ROI are detected
XCMS Data Processing 17
scale or height of the peak is modulated until the best fit is achieved.
If the fitting parameters are not satisfied, then the ROI is rejected as
a peak. The output of this algorithm is the same as in the Match-
edFilter and consists of the m/z, RT (deviations of each) and the
integrated intensities.
the peaks have been aligned the “bw” parameter can be greatly
reduced. As an additional robustness filter, there is a parameter
called “minfrac.” This parameter states that for any one class
(e.g., KO for knockout, WT for wild type), there must be at least
a certain fraction of peaks present in that feature to be considered as
a valid feature. For example, if we have 6 samples in a KO class and
6 samples in a WT class, then for a particular m/z and RT we need
to have at least 3 peaks present in either class if “minfrac” is set at
50%. It should be not that it is “any one class” and not both classes.
Therefore, if KO has 3 peaks and WT has 0, it is still a valid feature.
This parameter can be good for increasing the robustness when we
need to seek peaks in all samples.
3.3 Fill Peaks As commented before, two main causes can lead to peaks being
missed. The first is that the peak has not been correctly detected or
aligned by the algorithm, or the peak is in under the detection limit.
The fill peaks will effectively resolve and find missing peaks when
the problem is due to the algorithm (the peak exist in the data but it
was not correctly detected or aligned), leveraging the information
from other samples where the peak has been detected (RT and m/
z). In the second case, where the peak is under the detection limit,
the fill peaks step will use the background noise in the area where
the peak should be expected to determine the missing peak value.
Because there may be biological reasons that a peak is missing in
one class of samples not the others, the user can define the number
of peaks that need to be found in a single class.
4 XCMS R Package
5.1 Data Streaming Data streaming emulates the streaming video platforms where users
with XCMS can directly watch movies without the need to wait until the movie
has been completely downloaded. In the case of LC-MS data, we
typically have to wait until all samples have been acquired and
upload all the samples to the XCMS cloud-based system. However,
XCMS Online allows LC-MS data to be streamed [17]. By instal-
ling a program on their instrument computer and login to XCMS
Online, users can stream their data directly to the cloud and start
processing the data while the data of the rest of samples are being
acquired. The parameters for the job can be chosen beforehand.
The streaming process dramatically reduces the overall dead time
from acquisition to processing.
5.2 MISA: METLIN- XCMS Online also integrates an algorithm for advanced metabolite
Guided In-Source annotation, known as the METLIN-guided in-source annotation
Annotation (MISA) algorithm [18]. This algorithm aims at using the informa-
tion from in-source fragments (ISF) naturally occurring in MS1
data to provide a robust peak and metabolite annotation. We
20 Xavier Domingo-Almenara and Gary Siuzdak
5.3 XCMS-Guided A system levels analysis can provide important insights to under-
Systems Biology stand biochemical mechanisms. XCMS online enables metabolo-
mics data to be projected onto metabolic pathways and integrate it
with transcriptomics and proteomics data [19].
To project quantitative metabolic data onto metabolic net-
works, first, metabolites need to be identified. Metabolite identifi-
cation is a process that heavily relies on manual labor and expert
curation. XCMS Online uses an automated predictive pathway
analysis method, developed by Li et al. and known as Mummichog
[20], that bypasses metabolite identification and instead uses bio-
chemical information to annotate features and project them onto
metabolic pathways. Scientist can then interpret the data and for-
mulate biochemical mechanisms and hypothesis, and then confirm
these identifications via tandem MS analysis. More details of this
algorithm can be found in the original study [20] and Chapter 19
of this book.
In that sense, in XCMS Online-guided systems biology, users
can directly map their results into metabolic networks, without the
need to transfer data among different applications. Users can
upload gene and protein data to overlay it within the pathways.
Results are shown as a table and also by an interactive Pathway
Cloud plot. The Pathway Cloud plot shows dysregulated pathways,
ordering them by overlap percentage to other omics data and
statistical significance. To the date, there are over 7600 metabolic
models available for pathway analysis from BioCyc4 v19.5–20.0.
6 Conclusions
Acknowledgments
References
1. Patti GJ, Yanes O, Siuzdak G (2012) Metabo- 10. Runmin W, Wang J, Su M, Jia E, Chen S,
lomics: the apogee of the omic triology. Nat Chen T, Ni Y (2018) Missing value imputation
Rev Mol Cell Biol 13(4):263–269. https:// approach for mass spectrometry-based meta-
doi.org/10.1038/nrm3314 bolomics data. Sci Rep 8. https://doi.org/10.
2. Vinaixa M, Samino S, Saez I, Duran J, Guino- 1038/s41598-017-19120-0
vart JJ, Yanes O (2012) A guideline to univari- 11. Vereyken L, Dillen L, Vreeken RJ, Cuyckens F
ate statistical analysis for LC/MS-based (2019) High-resolution mass spectrometry
untargeted metabolomics-derived data. Meta quantification: impact of differences in data
2(4):775–795. https://doi.org/10.3390/ processing of centroid and continuum data. J
metabo2040775 Am Soc Mass Spectrom 30(2):203–212.
3. Smith CA, Want EJ, O’Maille G, Abagyan R, https://doi.org/10.1007/s13361-018-2101-
Siuzdak G (2006) XCMS: processing mass 0
spectrometry data for metabolite profiling 12. Tautenhahn R, Böttcher C, Neumann S
using nonlinear peak alignment, matching, (2008) Highly sensitive feature detection for
and identification. Anal Chem 78 high resolution LC/MS. BMC Bioinformatics
(3):779–787. https://doi.org/10.1021/ 9:504. https://doi.org/10.1186/1471-2105-
ac051437y 9-504
4. Mikko K, Miettinen J, Oresic M (2006) 13. Prince JT, Marcotte EM (2006) Chro-
MZmine: toolbox for processing and visualiza- matographic alignment of ESI-LC-MS proteo-
tion of mass spectrometry based molecular pro- mics data sets by ordered bijective interpolated
file data. Bioinformatics 22(5):634–636. warping. Anal Chem 78(17):6140–6152.
https://doi.org/10.1093/bioinformatics/ https://doi.org/10.1021/ac0605344
btk039. 14. Benton HP, Wong DM, Trauger SA, Siuzdak G
5. Tomás P, Castillo S, Villar-Briones A, Oresic M (2008) XCMS2: processing tandem mass spec-
(2010) MZmine 2: modular framework for trometry data for metabolite identification and
processing, visualizing, and analyzing mass structural characterization. Anal Chem 80
spectrometry-based molecular profile data. (16):6382–6389. https://doi.org/10.1021/
BMC Bioinformatics 11:395. https://doi. ac800795f
org/10.1186/1471-2105-11-395 15. Tautenhahn R, Patti GJ, Rinehart D, Siuzdak
6. Domingo-Almenara X, Montenegro-Burke JR, G (2012) XCMS online: a web-based platform
Benton PH, Siuzdak G (2018) Annotation: a to process untargeted metabolomic data. Anal
computational solution for streamlining meta- Chem 84(11):5035–5039. https://doi.org/
bolomics analysis. Anal Chem 90(1):480–489. 10.1021/ac300698c
https://doi.org/10.1021/acs.analchem. 16. Domingo-Almenara X, Montenegro-Burke JR,
7b03929 Ivanisevic J, Thomas A, Sidibé J, Teav T, Guijas
7. Mahieu NG, Patti GJ (2017) Systems-level C et al (2018) XCMS-MRM and METLIN-
annotation of a metabolomics data set reduces MRM: a cloud library and public resource for
25 000 features to fewer than 1000 unique targeted analysis of small molecules. Nat Meth-
metabolites. Anal Chem 89 ods 15(9):681–684. https://doi.org/10.
(19):10397–10406. https://doi.org/10. 1038/s41592-018-0110-3
1021/acs.analchem.7b02380 17. Montenegro-Burke JR, Aisporna AE, Benton
8. Lin W, Xing X, Chen L, Yang L, Su X, HP, Rinehart D, Fang M, Huan T, Warth B
Rabitz H, Lu W, Rabinowitz JD (2019) Peak et al (2017) data streaming for metabolomics:
annotation and verification engine for untar- accelerating data processing and analysis from
geted LC–MS metabolomics. Anal Chem 91 days to minutes. Anal Chem 89
(3):1838–1846. https://doi.org/10.1021/ (2):1254–1259. https://doi.org/10.1021/
acs.analchem.8b03132 acs.analchem.6b03890
9. Trinh DK, Wahl S, Raffler J, Molnos S, 18. Domingo-Almenara X, Montenegro-Burke JR,
Laimighofer M, Adamski J, Suhre K et al Guijas C, Majumder EL-W, Benton HP, Siuz-
(2018) Characterization of missing values in dak G (2019) Autonomous METLIN-guided
untargeted MS-based metabolomics data and in-source fragment annotation for untargeted
evaluation of missing data handling strategies. metabolomics. Anal Chem 91(5):3246–3253.
Metabolomics 14(10). https://doi.org/10. https://doi.org/10.1021/acs.analchem.
1007/s11306-018-1420-2 8b03126
24 Xavier Domingo-Almenara and Gary Siuzdak
19. Tao H, Forsberg EM, Rinehart D, Johnson throughput metabolomics. PLoS Comput Biol
CH, Ivanisevic J, Paul Benton H, Fang M 9(7). https://doi.org/10.1371/journal.pcbi.
et al (2017) Systems biology guided by 1003123
XCMS online metabolomics. Nat Methods 14 21. Vinzenz L, Picotti P, Domon B, Aebersold R
(5):461–462. https://doi.org/10.1038/ (2008) Selected reaction monitoring for quan-
nmeth.4260 titative proteomics: a tutorial. Mol Syst Biol
20. Li S, Park Y, Duraisingham S, Strobel FH, 4:222. https://doi.org/10.1038/msb.2008.
Khan N, Soltow QA, Jones DP, Pulendran B 61
(2013) Predicting network activity from high
Chapter 3
Abstract
The informatics pipeline for making sense of untargeted LC–MS or GC–MS data starts with preprocessing
the raw data. Results from data preprocessing undergo statistical analysis and subsequently mapped to
metabolic pathways for placing untargeted metabolomics data in the biological context. ADAP is a suite of
computational algorithms that has been developed specifically for preprocessing LC–MS and GC–MS data.
It consists of two separate computational workflows that extract compound-relevant information from raw
LC–MS and GC–MS data, respectively. Computational steps include construction of extracted ion chro-
matograms, detection of chromatographic peaks, spectral deconvolution, and alignment. The two work-
flows have been incorporated into the cross-platform and graphical MZmine 2 framework and ADAP-
specific graphical user interfaces have been developed for using ADAP with ease. This chapter summarizes
the algorithmic principles underlying key steps in the two workflows and illustrates how to apply ADAP to
preprocess LC–MS and GC–MS data.
Key words ADAP, MZmine 2, Metabolomics, LC–MS, GC–MS, Data preprocessing, Peak picking,
Alignment, Spectral deconvolution, Visualization
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_3, © Springer Science+Business Media, LLC, part of Springer Nature 2020
25
26 Xiuxia Du et al.
system (Fig. 1). As the first step of this informatics pipeline, data
preprocessing is critical for the success of a metabolomics study
because preprocessing errors can propagate downstream into spu-
rious or missing compound identifications and cause misinterpreta-
tion of the metabolome. Data preprocessing workflows (Fig. 2) in
open-source software tools generally consist of four sequential
steps after masses have been detected from profile mass spectra
(i.e., converting mass spectra from profile to centroid format).
These four steps are construction of extracted ion chromatograms
(EIC), detection of chromatographic peaks from EICs, peak
grouping for LC–MS data or spectral deconvolution for GC–MS
data, and alignment. ADAP (Automated Data Analysis Pipeline) is
one such open-source workflow that has been developed for pre-
processing both GC–MS and LC–MS data and incorporated into
the MZmine 2 framework [1].
In sections below, we first briefly describe the evolution of
ADAP and subsequently focus on describing how to carry out
LC–MS and GC–MS preprocessing workflows using primarily
ADAP modules in the MZmine 2 framework. These modules
include EIC construction, chromatographic peak detection, spec-
tral deconvolution, and alignment. To facilitate describing how to
specify relevant parameters, we provide a brief summary of the
algorithmic principles underlying these ADAP modules. To make
the workflow complete and this chapter self-contained, we provide
information on how to carry out other essential tasks using
non-ADAP modules in MZmine 2. These tasks include: (1) import
raw LC–MS and GC–MS data files into MZmine 2 and inspect
them, (2) detect masses from profile mass spectra (step 1 in Fig. 2),
(3) group chromatographic peaks (step 4 in Fig. 2) detected in LC–
MS data, and (4) export preprocessing results for downstream
statistical analysis, metabolite identification, and -omics data
integration.
Fig. 1 Informatics pipeline for making sense of untargeted GC–MS and LC–MS
metabolomics data
1 2 3 4 5 6
detect masses from construct detect alignment / database
GC-MS mass spectra EICs chromatographic peaks
deconvolution
correspondence search
Fig. 2 General computational workflows for preprocessing GC–MS and LC–MS data in existing open-source
software tools. EIC stands for extracted ion chromatogram
Metabolomics Data Preprocessing Using ADAP and MZmine 2 27
2 Evolution of ADAP
3 Install MZmine 2
4.1 Import and Raw data files are imported into MZmine 2 using the Raw data
Inspection of Raw Data methods drop-down menu as shown in Fig. 3. Acceptable file
Files formats include mzXML and netCDF. One of the greatest
strengths of MZmine 2 lies in the rich built-in visualization func-
tions that allow users to inspect the raw data, which greatly
28 Xiuxia Du et al.
Fig. 3 Import raw data into MZmine 2. (a) Raw data import is in the Raw
data methods drop-down menu. (b) Imported raw data files are listed under
Raw data files on the left
Fig. 4 Capabilities of MZmine 2 that allow users to inspect the raw spectra. (a)
List of spectra in a raw data file and the spectra meta data. (b) Double click on
any spectrum opens up a separate window displaying it
30 Xiuxia Du et al.
4.2 Detect Masses The mass detection step detects mass centroids from profile mass
from Profile Mass spectra. MZmine 2 provides five centroiding methods that include
Spectra Centroid, Exact mass, Local maxima, Recursive thresh-
old, and Wavelet transform. The Centroid mass detector is
for spectra that have been centroided and the other four detectors
are for profile mass spectra only. The Exact mass detector is
suitable for high-resolution MS data, such as provided by FTMS
instruments. The Local maxima mass detector simply detects all
local maxima within a spectrum, except those signals below the
specified noise level. The Recursive threshold mass detector is
suitable for data that has too much noise for the Exact mass
detector to be used. The Wavelet transform mass detector is
suitable for both high-resolution and low-resolution data. It uses
the Ricker wavelet (also called Mexican Hat wavelet) and carry out
a continuous wavelet transform (CWT) of the continuous profile
spectra.
This Wavelet transform mass detector provides a sensitive
and robust way to detect masses (Fig. 7) and we describe it in more
detail herein. It requires users to set three parameters: noise level,
scale level, and wavelet window size. Noise level specifies the mini-
mum intensity level for a data point to be considered part of a
spectrum. All data points below this intensity level are ignored.
Scale level is the scale factor that either dilates or compresses the
wavelet signal. When it is small (e.g., below 10), the Ricker wavelet
is more contracted which in turn results in more noisy peaks being
detected. Wavelet window size ( % ) is the size of the window
used to calculate the wavelet signal. When the size of the window is
small, more noisy peaks can be detected. Among the three para-
meters, scale level, in particular, can have a large impact on mass
detection.
When the scale level is small, a significant number of very
narrow noise peaks can be detected. They are passed to the
subsequent EIC construction and can form false EIC peaks. As
the scale level increases, the number of detected noise peaks
decreases. However, a larger scale level could cause a noticeable
shift in the centroid m/z values. Figure 8c, d depicts the m/z values
Metabolomics Data Preprocessing Using ADAP and MZmine 2 31
Fig. 5 Inspect BPCs. (a, b) Display BPCs by using the TIC/XIC visual-
izer in the visualization drop-down menu. (c) BPCs of 12 data files
32 Xiuxia Du et al.
Fig. 7 Mass detection in MZmine 2. (a) The Mass detection method can be accessed via the Raw data
methods draw-down menu. (b) Wavelet transform is one of the mass detection methods. Click the
button pointed to by the red arrow would open the window in (c) for specifying parameters (c) User-defined
parameters for the wavelet transform method. Check Show preview opens up the preview
window. Effect of parameter changes is displayed almost immediately, which greatly facilitates specifying
parameters. The inset shows the profile mass peak in blue and the detected mass centroid in red
34 Xiuxia Du et al.
detected from consecutive scans when scale levels are set at 5 and
15, respectively. Compared to the m/z values detected at scale level
equal to 5, most of the m/z values detected at scale level 15 are
larger. When the final representative m/z for a chromatographic
peak is calculated as the weighted average of all of the centroid m/z
values along the EIC as shown in Fig. 8b, the difference in the final
representative m/z values between using scale level¼5 and scale
level¼15 is 19 ppm. This difference in the mass values is big
enough to cause different compounds to be eventually identified.
Regardless of which of the mass detectors is used, the results of
mass detection for a particular profile mass spectrum can be
accessed by clicking masses under the profile mass spectrum
(Fig. 9). It is relevant to note that mass detection can also be carried
out by using msConvert that is part of ProteoWizard [8]. msConvert
detects masses by either using a CWT-based method or calling
intensity
m/z
m/z
0.8 0.8
566.34
566.34
0.4 0.4
566.33 566.33
0.0 0.0
564 565 566 567 568 569 570 571 7.90 7.95 8.00 8.05 8.10 8.15 7.9 8.0 8.1 8.2 7.9 8.0 8.1 8.2
m/z rt rt rt
Fig. 8 Differences in the resulting mass values caused by different scale levels when using the wavelet
transform-based mass detection in MZmine 2. (a) One of the consecutive mass spectra from which the mass
indicated by red arrow is to be detected. (b) The EIC of the mass. The mass values of the blue dots along the
elution profile are depicted in (c) and (d) with scale level being 5 and 15, respectively
Fig. 9 Examine mass detection results for a particular profile mass spectrum.
Vertical lines in green indicate mass values that have been detected
Metabolomics Data Preprocessing Using ADAP and MZmine 2 35
4.3 Construct EICs In untargeted metabolomics, the masses of ion species that have
by ADAP been detected by a mass analyzer are unknown prior to data pre-
processing. It is up to the step of EIC construction to determine.
With mass centroids detected from profile mass spectra, construc-
tion of EICs can begin. Figure 10 shows how to carry out this step
using ADAP. ADAP examines all of the data points in the entire
data file and works from the largest intensity data point down to the
smallest. As a result, a list of ions is produced that have been
detected by the mass analyzer over a continuous retention time
period. This approach in constructing EICs is in contrast to the
EIC construction process in other open-source software tools such
as XCMS where EICs are built chronologically in retention time.
The advantage of starting an EIC from the highest intensity point
among all of the data points belonging to this EIC is that the
reference mass for the EIC has the highest possible mass measure-
ment accuracy. This is particularly important for TOF-type mass
analyzers whose mass measurement accuracy tends to be higher for
more intense signals.
4.4 Detect After EICs have been constructed, ADAP detects chromatographic
Chromatographic peaks from each of these EICs using the continuous wavelet trans-
Peaks by ADAP form (CWT) that is similar to what the wavelet transform mass
detector uses. Specifically, wavelet coefficients are first calculated as
the inner product between the EIC and the Ricker wavelets at
different wavelet scales and locations. Subsequently, peak location
36 Xiuxia Du et al.
Fig. 10 Construction of EICs in ADAP. (a) EIC construction is achieved by using the ADAP Chromatogram
builder method. (b) Parameters relevant to this method. (c) A list of EICs is produced for each data file
Metabolomics Data Preprocessing Using ADAP and MZmine 2 37
Fig. 11 Examine EICs. (a) Select a particular EIC. (b) Double click the selected EIC opens this window. Select
Chromatogram and click Show opens the EIC in (c) for visual examination
38 Xiuxia Du et al.
Fig. 13 Alignment. (a) Alignment methods can be accessed through the Peak
list methods drop-down menu. (b) It is strongly recommended to use the
preview function for specifying parameters. (c) After alignment, an Aligned
peak list is produced and can be exported
Metabolomics Data Preprocessing Using ADAP and MZmine 2 41
Fig. 14 Visualization and export of the aligned peak list. (a) Double clicking the
Aligned peak list opens up the list of peaks for visual examination. (b)
The aligned peak list can be exported in .csv, MetaboAnalyst, or other format
5.1 Spectral The most recent version of the ADAP-GC spectral deconvolution
Deconvolution algorithm is 3.2 [5]. The algorithm starts with automated determi-
nation of deconvolution windows. For each deconvolution win-
dow, a sequence of four computational steps is carried out
including: (1) two rounds of hierarchical clustering for estimating
the number of compounds in the window, (2) selection of the
sharpest and unique chromatographic peaks as the model peaks,
(3) construction of pure mass spectrum for each compound, and
(4) correction of splitting issues. Figure 15 shows how to access
ADAP-GC 3.2 in MZmine 2 and lists the user-defined parameters.
Similar to ADAP peak detection described earlier, it is strongly
recommended that users use the Show preview function to make
informed decisions about the parameters (Fig. 15b). After spectral
deconvolution completes, a list of pure mass spectra is produced for
each data file (Fig. 16).
5.2 Alignment GC–MS samples are aligned by finding the same compounds across
the data files based on spectral similarity and retention time prox-
imity. Specifically, a score is calculated as follows to measure the
likelihood that two spectra, c1 and c2, correspond to the same
compound:
Scoreðc 1 , c 2 Þ ¼ wS time ðc 1 , c 2 Þþð1 wÞS spec ðc 1 , c 2 Þ, ð1Þ
Fig. 16 Examine spectral deconvolution results. (a) Expand the list of mass spectra that has been produced for
each data file. (b) Each mass spectrum can be examined by double clicking it to open up a window. Select the
data file and Mass spectrum and click Show to open the window in (c). The constructed pure spectrum is
shown in green in the context of the raw spectrum shown in blue
Metabolomics Data Preprocessing Using ADAP and MZmine 2 45
Fig. 17 GC alignment. (a) The Alignment method can be accessed in the drop-down menu of Peak
list methods ! Alignment ! ADAP Aligner. (b) Specify parameters whose meaning can be
found in the Help file. (c) The aligner produces the Aligned peak list
46 Xiuxia Du et al.
Fig. 18 Export GC–MS spectra. (a) Select the Aligned peak list and
specify the export file name. (b) An example of the exported spectra produced by
the spectral deconvolution. (c) Exported GC–MS spectra are stored in an .msp
file
Metabolomics Data Preprocessing Using ADAP and MZmine 2 47
5.3 Export of GC–MS The pure mass spectra that the spectral deconvolution step has
Preprocessing Results constructed can be exported in .msp or .mgf format for matching
the spectra against spectral libraries for compound identification or
annotation. Figure 18 shows the procedure. The resulting .msp file
can be directly imported to the NIST MS Search software tool for
compound identification or annotation.
6 Conclusions
Acknowledgements
References
1. Pluskal T, Castillo S, Villar-Briones A, Oresic M TOF-MS metabonomics studies. J Proteome
(2010) MZmine 2: modular framework for pro- Res 9(11):5974–5981
cessing, visualizing, and analyzing mass 3. Ni Y, Qiu Y, Jiang W, Suttlemyre K, Su M,
spectrometry-based molecular profile data. Zhang W, Jia W, Du X (2012) ADAP-GC 2.0:
BMC Bioinf 11:395 deconvolution of coeluting metabolites from
2. Jiang W, Qiu Y, Ni Y, Su M, Jia W, Du X (2010) GC/TOF-MS data for metabolomics studies.
An automated data analysis pipeline for GC- Anal Chem 84(15):6619–6629
48 Xiuxia Du et al.
Abstract
This chapter describes the open-source tool suite OpenMS. OpenMS contains more than 180 tools which
can be combined to build complex and flexible data-processing workflows. The broad range of functionality
and the interoperability of these tools enable complex, complete, and reproducible data analysis workflows
in computational proteomics and metabolomics. We introduce the key concepts of OpenMS and illustrate
its capabilities with a complete workflow for the analysis of untargeted metabolomics data, including
metabolite quantification and identification.
1 Introduction
1.1 OpenMS Mass spectrometry workflows can be highly complex, due to many
different separation techniques, acquisition strategies, quantification,
and identification methods. As a solution, we present OpenMS, an
open-source C++ library for LC-MS data management that offers a
wide range of algorithms for proteomics and metabolomics data
analysis. It is available under the permissive 3-clause BSD license
for Windows, macOS, and Linux (www.openms.de) [1, 2]. OpenMS
is composed of a set of individual tools that can be combined in
a highly flexible manner allowing its application from standard opera-
tions to newly developed methods. The software is interesting
for both users and developers since the library can be used as a
foundation for tool and algorithm development (see Note 1). Fast
scripting and prototyping is supported via Python bindings (pyO-
penMS) [3]. Command line tools are available for high-throughput
processing, which is especially useful on cluster or cloud infrastruc-
tures. Additionally, OpenMS is integrated into the workflow engine
Konstanz Information Miner (KNIME), allowing for the visual
construction of data analysis workflows [4, 5] (see Note 2). In the
following, we will present the application of OpenMS and KNIME
for the quantification and identification of metabolomics data.
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_4, © Springer Science+Business Media, LLC, part of Springer Nature 2020
49
50 Marc Rurik et al.
1.2 Data Preparation A multitude of vendor-specific, proprietary file formats exist for
LC-MS data. In order to process this data with OpenMS it needs
to be converted to the open HUPO-PSI standard format mzML
[6]. Data conversion from the vendor-specific formats to mzML
can be performed using the free software MSConvertGUI included
in ProteoWizard [7]. Additionally, if the data was acquired in
profile mode, centroiding is necessary as almost all OpenMS algo-
rithms require data to be in centroid mode. This step can be
performed during the data conversion by using the peakPicking
filter provided by ProteoWizard. Alternatively, the OpenMS tool
PeakPickerHiRes can be used as the first step of the analysis
pipeline.
1.3 Data Data visualization is a common starting point for many studies. It
Visualization Using allows additional insight into the data, can act as a quality control
TOPPView step and help with the optimization of the tools’ parameter values.
In addition, visualization can be used to inspect intermediate or
final results throughout the data analysis process. To this end,
OpenMS provides TOPPView [8], a graphical application for the
interactive visualization of raw data and analysis results. It supports
the visualization of individual spectra and entire LC-MS maps in
2D and 3D representations. Results of different OpenMS tools
such as the location and extent of detected features (see Subheading
2.1) can be inspected in conjunction with the raw data. An example
for this is shown in Fig. 1.
Fig. 1 LC-MS sample visualized in TOPPView in 2D and 3D with color-coded intensities. Two features that
were detected with FeatureFinderMetabo (see Subheading 2.1) are shown as rectangles representing their
location and extent
Metabolomics Data Processing Using OpenMS 51
Fig. 2 A short KNIME workflow that detects and quantifies metabolites in a single LC-MS sample and exports
the results to an Excel file. Note that some nodes process entire files (square ports) while other nodes process
Excel-like tables (triangle ports)
52 Marc Rurik et al.
Fig. 3 FeatureFinderMetabo feature detection process: (a) Mass trace detection starts at the peaks of highest
intensity (marked by a blue square) and extends in both directions. (b) Split the mass trace at the local
minimum if it contains more than one compound. (c) Test whether co-eluting mass traces are isotopic traces
of the same compound. (d) Assemble highest scoring feature hypotheses. The final features are characterized
by centroid m/z, retention time, charge, and intensity. A visualization of a feature in TOPPView is shown in
Fig. 1
2.2 Adduct Grouping The previous feature finding step aims to detect and group all
isotopic traces of a compound ion into one feature represented by
centroid m/z, retention time, and intensity. However, a compound
might be detected multiple times in the same sample, with different
adduct species or common charge neutral modifications.
54 Marc Rurik et al.
2.3 Map Alignment After feature detection, an optional but oftentimes appropriate
processing step is to perform chromatographic alignment of the
samples with each other. Here, the goal is to compensate for
commonly occurring constant or linear chromatography-related
distortions between samples.
This can be done with the MapAlignerPoseClustering tool,
which reduces feature elution time differences between sample
maps. To achieve this, the method aims to find affine retention
time transformations between samples that minimize the overall
time shifts of corresponding features. As a consequence,
subsequent aggregation of respective compound ion features
found in multiple measurements is more straightforward. To search
the solution space of possible time shifts and scaling distortions
(i.e., compression or extension of a samples chromatography) effi-
ciently, pose clustering is performed [11]. The method aligns the
features of multiple maps by using one as the reference to which the
other maps are aligned, akin to matching star charts in astronomy.
It determines the optimal linear transformation for each pair of
maps and transforms the feature retention times accordingly (see
Fig. 4a, b).
Fig. 4 Map alignment and feature linking are used to combine information across
multiple samples. (a) Features have been quantified in all samples individually
using FeatureFinderMetabo. (b) MapAlignerPoseClustering applies a linear
transformation to address potential retention time shifts. (c) The retention
time-aligned features can then be grouped using FeatureLinkerUnlabeledQT
(see Subheading 2.4)
3.1 Accurate Mass The foundation of the AccurateMassSearch (AMS) tool is a data-
Search base of known metabolites and their exact neutral masses. OpenMS
provides access to the Human Metabolome Database (HMDB)
[13], but generally any compound database can be used. In addi-
tion to such a database, a list of potential adducts has to be consid-
ered to arrive at the neutral masses of the compounds. OpenMS
provides two adduct lists for positive and negative polarity, which
can be edited depending on the experimental setup, e.g., which
buffers were used. Ideally this step should use the same set of
adducts as during adduct grouping.
3.2 Spectral Library While accurate mass search can be used to obtain a general idea
Search which compounds may be present in the sample, it will often
provide multiple putative identifications. To arrive at more confi-
dent identifications, it is necessary to fragment the compounds and
Metabolomics Data Processing Using OpenMS 57
Fig. 6 KNIME workflow to identify metabolites by searching all tandem mass spectra against a spectral library,
in this case MassBank
I 0i
where Ii and are the intensities of all n matched peaks. Metabo-
liteSpectralMatcher can use any public or in-house spectral library
in the mzML format. By default, OpenMS provides MassBank, the
most comprehensive publicly available spectral library [15].
3.3 De Novo For de novo identification, SIRIUS [16, 17] and CSI:FingerID
Identification [18] are integrated in the OpenMS tool SiriusAdapter (see Fig. 7).
A preprocessing step is performed using the OpenMS framework to
optimize and convert the input data to a SIRIUS compatible for-
mat. SiriusAdapter can then perform identification of features at
MS2 level, i.e., molecular formulas are assigned de novo. Subse-
quently, the feature can be searched in a molecular structure data-
base. SiriusAdapter uses information from both MS and MS/MS
spectra (mzML) and an optional featureXML. The tool provides
different modes, depending on the input and output configuration.
Preprocessing can be applied if additional feature information is
present, which is used to map all MS2 spectra to their
corresponding features. SIRIUS will internally merge and jointly
process all MS2 spectra allocated to the same feature. To reduce the
feature space further, a mass trace filter (number of isotopic traces)
can be applied. Additionally, adduct information can be provided
using a featureXML processed by the MetaboliteAdductDecharger
or AccurateMassSearch. Depending on the workflow, SiriusAdapter
can be used either for preprocessing only (output port SIRIUS.ms)
or full data processing (output port SIRIUS.mzTab), see Fig. 7.
58 Marc Rurik et al.
Fig. 7 Workflow for metabolite quantification using FeatureFinderMetabo and de novo identification using
SiriusAdapter
5 Notes
Fig. 8 Basic OpenMS workflow for metabolite quantification and identification in KNIME
References
1. Röst HL, Sachsenberg T, Aiche S, Bielow C, charge ladders generated in ESI-MS. J Prote-
Weisser H, Aicheler F, Andreotti S, Ehrlich ome Res 9(5):2688–2695
H-c, Gutenbrunner P, Kenar E, Liang X, 11. Lange E, Gropl C, Schulz-Trieglaff O,
Nahnsen S, Nilse L, Pfeuffer J, Leinenbach A, Huber C, Reinert K (2007) A
Rosenberger G, Rurik M, Schmitt U, Veit J, geometric approach for the alignment of liquid
Walzer M, Wojnar D, Wolski WE, Schilling O, chromatography mass spectrometry data. Bio-
Choudhary JS, Malmström L, Aebersold R, informatics 23(13):i273–i281
Reinert K, Kohlbacher O (2016) OpenMS: a 12. Weisser H, Nahnsen S, Grossmann J, Nilse L,
flexible open-source software platform for mass Quandt A, Brauer H, Sturm M, Kenar E,
spectrometry data analysis. Nat Methods Kohlbacher O, Aebersold R, Malmström L
13:741–748 (2013) An automated pipeline for high-
2. Pfeuffer J, Sachsenberg T, Alka O, Walzer M, throughput label-free quantitative proteomics.
Fillbrunn A, Nilse L, Schilling O, Reinert K, J Proteome Res 12:1628–1644
Kohlbacher O (2017) OpenMS - a platform for 13. Wishart DS, Feunang YD, Marcu A, Guo AC,
reproducible analysis of mass spectrometry Liang K, Vázquez-Fresno R, Sajed T,
data. J Biotechnol 261(February):142–148 Johnson D, Li C, Karu N, Sayeeda Z, Lo E,
3. Röst HL, Schmitt U, Aebersold R, Malmström Assempour N, Berjanskii M, Singhal S,
L (2014) pyOpenMS: a Python-based interface Arndt D, Liang Y, Badran H, Grant J, Serra-
to the OpenMS mass-spectrometry algorithm Cayuela A, Liu Y, Mandal R, Neveu V, Pon A,
library. Proteomics 14(1):74–77 Knox C, Wilson M, Manach C, Scalbert A
4. Berthold MR, Cebron N, Dill F, Gabriel TR, (2018) HMDB 4.0: the human metabolome
Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, database for 2018. Nucleic Acids Res 46(D1):
Wiswedel B (2007) KNIME: the Konstanz D608–D617
Information Miner. In: Studies in classification, 14. Fenyö D, Beavis RC (2003) A method for
data analysis, and knowledge organization assessing the statistical significance of mass
(GfKL 2007). Springer, Berlin spectrometry-based protein identifications
5. Fillbrunn A, Dietz C, Pfeuffer J, Rahn R, using general scoring schemes. Anal Chem 75
Landrum GA, Berthold MR (2017) KNIME (4):768–774
for reproducible cross-domain analysis of life 15. Horai H, Arita M, Kanaya S, Nihei Y, Ikeda T,
science data. J Biotechnol 261 Suwa K, Ojima Y, Tanaka K, Tanaka S,
(February):149–156 Aoshima K, Oda Y, Kakazu Y, Kusano M,
6. Martens L, Chambers M, Sturm M, Kessner D, Tohge T, Matsuda F, Sawada Y, Hirai MY,
Levander F, Shofstahl J, Tang WH, Römpp A, Nakanishi H, Ikeda K, Akimoto N, Maoka T,
Neumann S, Pizarro AD et al (2011) mzML— Takahashi H, Ara T, Sakurai N, Suzuki H,
a community standard for mass spectrometry Shibata D, Neumann S, Iida T, Tanaka K,
data. Mol Cell Proteomics 10(1): Funatsu K, Matsuura F, Soga T, Taguchi R,
R110.000133 Saito K, Nishioka T (2010) MassBank: a public
7. Kessner D, Chambers M, Burke R, Agus D, repository for sharing mass spectral data for life
Mallick P (2008) ProteoWizard: open source sciences. J Mass Spectrom 45(7):703–714
software for rapid proteomics tools develop- 16. Böcker S, Letzel MC, Lipták Z, Pervukhin A
ment. Bioinformatics 24(21):2534–2536 (2009) SIRIUS: decomposing isotope patterns
8. Sturm M, Kohlbacher O (2009) TOPPView: for metabolite identification. Bioinformatics 25
an open-source viewer for mass spectrometry (2):218–224
data. J Proteome Res 8(7):3760–3763 17. Böcker S, Dührkop K (2016) Fragmentation
9. Kenar E, Franken H, Forcisi S, Wörmann K, trees reloaded. J Cheminform 8(1):1–26
H€aring H-U, Lehmann R, Schmitt-Kopplin P, 18. Dührkop K, Shen H, Meusel M, Rousu J,
Zell A, Kohlbacher O (2014) Automated label- Böcker S (2015) Searching molecular structure
free quantification of metabolites from liquid databases with tandem mass spectra using CSI:
chromatography–mass spectrometry data. Mol FingerID. Proc Natl Acad Sci
Cell Proteomics 13:348–359 112:12580–12585
10. Bielow C, Ruzek S, Huber CG, Reinert K
(2010) Optimal decharging and clustering of
Chapter 5
Abstract
In this chapter, we summarize data preprocessing and data analysis strategies used for analysis of NMR data
for metabolomics studies. Metabolomics consists of the analysis of the low molecular weight compounds in
cells, tissues, or biological fluids, and has been used to reveal biomarkers for early disease detection and
diagnosis, to monitor interventions, and to provide information on pathway perturbations to inform
mechanisms and identifying targets. Metabolic profiling (also termed metabotyping) involves the analysis
of hundreds to thousands of molecules using mainly state-of-the-art mass spectrometry (MS) and nuclear
magnetic resonance (NMR) spectroscopy technologies. While NMR is less sensitive than mass spectrome-
try, NMR does provide a wealth of complex and information rich metabolite data. NMR data together with
the use of conventional statistics, modeling methods, and bioinformatics tools reveals biomarker and
mechanistic information. A typical NMR spectrum, with up to 64k data points, of a complex biological
fluid or an extract of cells and tissues consists of thousands of sharp signals that are mainly derived from
small molecules. In addition, a number of advanced NMR spectroscopic methods are available for extract-
ing information on high molecular weight compounds such as lipids or lipoproteins. There are numerous
data preprocessing, data reduction, and analysis methods developed and evolving in the field of NMR
metabolomics. Our goal is to provide an extensive summary of NMR data preprocessing and analysis
strategies by providing examples and open source and commercially available analysis software and bioin-
formatics tools.
Key words NMR, Metabolomics, Metabotyping, Quality control, Preprocessing, Multivariate data
analysis
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_5, © Springer Science+Business Media, LLC, part of Springer Nature 2020
61
62 Wimal Pathmasiri et al.
Table 1
NMR frequencies and natural abundance of some useful nuclei in an 11.7 T magnetic field strength
(values adopted from Bruker NMR frequency table)
1.2 Applications Cells, tissues, and biological fluids are rich in low molecular weight
of Metabolomics metabolites, and metabolomics involves the analysis of these low
Analysis Methods molecular weight metabolites [2–5]. Common biological fluids
used in metabolomics include serum, plasma, saliva, urine, cerebro-
spinal fluid, feces, and exhaled breath. Unlike other omics (geno-
mics, transcriptomics, and proteomics) technologies,
metabolomics gives the most functional information about the
system since metabolites are the final end products of genomic,
transcriptomic, and/or proteomic perturbations [6] within a com-
plex network of biological pathways. Metabolomics studies are
designed to reveal the pattern of signals that increase or decrease
as a function of health and wellness, or response to treatment.
Through determining signals that associate with health and disease
states, metabolites can be identified and mapped to biochemical
pathways to provide insights into biological mechanisms.
In the past decade, advances in technology have enabled the
application of metabolomics in a variety of diverse research areas
that cover basic, biomedical, and clinical sciences to measure these
metabolites in readily accessible biological fluids, cells, and tissues
to correlate back with the responses at the cellular or organ level.
The terms metabolomics, metabonomics, metabolic profiling, met-
abolic phenotyping, and metabotyping are widely used inter-
changeably by the research community to basically the same
technique and procedures [7, 8]. The technologies that can be
used to detect and measure these metabolites are very diverse.
NMR spectroscopy and mass spectrometry (often coupled with
chromatography) are the two main analytical platforms in metabo-
typing whereas other analytical techniques such as liquid chroma-
tography coupled electrochemical detection (for detecting neuro
transmitters [9] or capillary electrophoresis methods [10]) can be
used in the targeted analysis. Advances in high-throughput, high-
resolution analytical technologies and data analysis platforms have
allowed for metabotyping in large-scale population-based Metabo-
lome Wide Association Studies (MWAS) and metabotype quantita-
tive trait locus (mQTL) mapping studies by using both NMR
Spectroscopy [11–13] and mass spectrometry [14–16]. These
methods are more useful than Genome Wide Association Studies
(GWAS) and QTL mapping for discovering biomarkers for disease
risk because metabolic perturbations are the most downstream
products of all of the other omics activities. Furthermore, integrat-
ing of MWAS and mQTL with GWAS and QTL mapping studies
improves biomarker identification.
64 Wimal Pathmasiri et al.
2 NMR Metabolomics
Fig. 1 NMR signal is measured as a FID and then Fourier transformed to obtain the frequency domain
spectrum (example shown here is for a 1H NMR spectrum of a cholesterol derivative)
2.3 NMR An example workflow for broad spectrum (or untargeted) metabo-
Metabolomics lomics data using NMR is shown in Fig. 2. This workflow includes
Workflow study design, sample collection, sample selection, sample randomi-
zation, sample preparation, data acquisition, data preprocessing,
and data analysis.
2.3.1 Study Design Metabolomics investigations have been conducted that have
revealed that genetics (e.g., race, gender, polymorphisms), nutri-
tion (e.g., diet, nutrients, weight, gut microbiome), mental health
(e.g., stress, depression, cognition, behaviors), and exposures (e.g.,
tobacco use, pollution, drugs, medications, personal care products)
Fig. 2 Schematic diagram explaining the NMR metabolomics workflow. One advantage of NMR metabolomics
is that both untargeted and targeted data analysis can be performed using the same NMR spectrum. In this
workflow, metabolites are identified after using statistics and modeling approaches to reveal signals that are
important to defining the study phenotype. Quality Controls (underlined) are included at each step of the
pipeline
68 Wimal Pathmasiri et al.
2.3.2 Sample Selection The number of samples needed for the study will depend on the
and Randomization variables mentioned above, as well as the strength of the phenotype.
More samples are needed when the study involves weaker pheno-
types (e.g., control vs. mild anxiety), whereas smaller number of
samples are sufficient for situations with stronger phenotypes (e.g.,
healthy vs. cancer cases) to extract desirable information. For a
validation or a confirmation study, a sufficient number of samples
that can be determined by power calculations such as multivariate
power calculations [38] must be used to achieve statistically signifi-
cant results. Selection of biospecimens derived from a different
cohort is preferable for such validation studies. Samples must be
randomized for both the sample preparation and analysis sequence
in the data acquisition in order to avoid any analytical bias.
2.3.3 Biospecimen It is also necessary to consider other factors such as the type of
Availability anticoagulant (or preservative, if any), storage condition, and num-
ber of freeze–thaw cycles. Biospecimens that have undergone dif-
ferent collection procedure, a different number of freeze-thaw
cycles, or different storage temperatures (20 C, 80 C) can
give rise to different profiles. Ideally, samples with no freeze–thaw
cycles and stored at 80 C are best suited for metabolomics
analysis. This can be achieved by preparing multiple aliquots at
the sample collection stage, prior to freezing. The most important
thing is to keep all conditions of the samples the same within a
particular metabolomics study. The type of anticoagulant used for
samples can affect the NMR spectrum. Serum is best suited for
NMR metabolomics, while heparinized plasma (preferable over
EDTA or citrated plasma) can also be used. EDTA (and its Ca2+,
and Mg2+ complexes) or citrate in the biospecimens results in
signals that hinder or obscure the signals of some endogenous
metabolite signals and therefore such peaks must be removed
from further analysis. Therefore, valuable information is lost if
EDTA or citrate is used as an anticoagulant in samples. However,
it was found that useful biochemical could still be effectively recov-
ered using samples collected with these anticoagulants [39]. Once a
Analysis of NMR Metabolomics Data 69
2.3.4 Biological Sample Human subject metabolomics investigations using NMR have gen-
Types erally used urine, serum, or plasma, while saliva and stool samples
are becoming more commonly utilized. In addition to these avail-
able biological matrices, a wide variety of other sample types such as
cerebrospinal fluid, seminal fluid, bronchoalveolar lavage fluid,
amniotic fluid, seminal fluid, and extracts of organ tissue have
been used in NMR metabolomics. On the other hand, any available
biological matrices (e.g., urine, serum or plasma, feces, tissue,
organ, and other biological samples) can be used for studies involv-
ing animal models.
2.3.5 Sample Collection Standard established protocols are needed for the sample collection
and Storage and storage [40] of biological samples. Consistency in sample
collection and storage is an important consideration [41]. For
biomarker discovery and validation, metabolomics data analysis
using combined sample sets derived from longitudinal or multicen-
ter multicohort population studies is important for increasing the
statistical power and, also, is challenging due to inconsistencies in
the sample collection in the cohorts and storage conditions in the
biobanks. There are ongoing efforts to integrate metabolomics
data obtained from samples in multiple biobanks [42]. While it is
not always possible to obtain biospecimens from different cohorts
that have been collected and stored in identical manner, it is impor-
tant to have access to the protocols regarding the collection and
handling of the samples in order to understand differences that may
arise in the profiles between cohorts.
2.3.7 Data Acquisition The data acquisition methods selected depend on the study ques-
tion. Typically, NMR metabolomics uses 1D 1H NMR spectro-
scopic methods because of the high sensitivity of 1H (and
simplicity) in contrast to very low sensitivity of 13C or 15N because
of the very low natural abundance. Serum and plasma samples
contain macromolecules such as proteins, lipids, and lipoproteins.
Atoms in these macromolecules relax faster and therefore the sig-
nals deriving from these molecules become broadened making it
harder to analyze small molecule metabolites.
A number of different NMR pulse sequences are commonly
used in 1H NMR metabotyping of biological samples to address
these issues to improve the spectral quality of the small molecules.
The four most commonly used sequences in routine metabolomics
data acquisition from blood samples are NOESY presaturation
(presat), Carr-Purcell-Meiboom-Gill (CPMG), J-Resolved, and
Diffusion-edited method [43].
1. NOESY presat: The NOESY presat pulse sequence is the first
increment of a NOESY pulse sequence [24] (RD-gz(1)-90 -t-
90 -tm-gz(2)-90 -ACQ, where RD is the relaxation delay, t is a
short delay between pulses, tm is the mixing time, 90 is the
duration of the radio frequency pulse, gz(1) and gz(2) are the
pulse field gradients, and ACQ is the data acquisition time).
2. Carr-Purcell-Meiboom-Gill (CPMG) spin-echo sequence: The
CPMG spin-echo sequence uses a CPMG presat pulse
Analysis of NMR Metabolomics Data 71
Fig. 3 Schematic diagram showing preparation of QC samples. (a) Preparation of external pooled QC reference
samples. Biological fluids (serum, plasma, or urine) can be collected from volunteers to make a bulk pooled
QC reference material, which is aliquoted and stored in desirable volumes at 80 C. (b) Preparation of total
study pooled QC samples. For small studies, an equal amount from each sample in a study or a randomly
selected subset of samples is transferred into a container for pooling. Contents are thoroughly mixed and
aliquots of the pooled QC are prepared. For large studies, study pooled QC samples can be prepared by
randomly selecting a subset of samples for pooling and aliquots of these study pooled samples are used for
each of the batch of samples. In addition, batch pooled samples can be prepared by pooling equal amounts
from all samples within the batch. (c) Preparation of both phenotypic and combined pooled QC samples using
the study biospecimens
3.1 QC of Data NMR spectrometer performance and the field homogeneity is eval-
Acquisition uated routinely, using NMR reference standards (e.g., by optimiz-
ing for best line shape and line width), to ensure high quality of data
acquisition. In addition, NMR spectra are visually inspected for
74 Wimal Pathmasiri et al.
Water Water
0.9 0.9
0.8 0.8
Normalized Intensity
Normalized Intensity
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
8.5 8.0 7.5 7.0 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 8.5 8.0 7.5 7.0 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5
Chemical Shift (ppm) Chemical Shift (ppm)
Poor water suppression Good water suppression
Fig. 4 Effect of water suppression on the quality of the NMR spectrum of serum. Signal intensities of
metabolites are very low in the spectrum with the poor water suppression
3.2 Analytical The quality of normalized spectral data is evaluated using the
Reproducibility pooled QC samples. The QC pools are technical replicates and
there should be minimal variation among these samples. An initial
evaluation of data can be carried out using a nonsupervised analysis
method such as principal component analysis (PCA) and other
unsupervised methods such as intraclass correlation [59], or hierar-
chical clustering trees [33] can also be used to assess the analytical
reproducibility. In PCA, a tight clustering of pooled QC samples
and dispersion of study samples [60, 61] indicates that the analyti-
cal variation is minimal compared to biological variation between
study samples. If the QC samples were created from the study
samples, then the QC samples in the PCA plot should be centered
among the sample group from which the QC was created. If
external pools were used then the location of the QC pools in
relation to the study samples does not provide any additional
Analysis of NMR Metabolomics Data 75
4.1 NMR Data The position of NMR signals of observed atoms (e.g., 1H) within a
particular molecule are measured as chemical shifts in parts per
million (ppm). The 1H NMR spectrum of alanine is shown is
Fig. 5. The alanine molecule has three carbon atoms with attached
protons with different functional groups (Fig. 5). The 1H NMR
spectrum of alanine in the solvent D2O has two distinct signals: one
for the methyl group (signal A), and one for the H attached to the
C-2 carbon (signal B). 1H atoms in the NH2 group and OH group
usually exchange with deuterium in the solvent and are not observ-
able on the NMR time scale (millisecond). The 1H signal splits into
coupling patterns that are dependent on the neighboring atoms
Fig. 5 NMR Spectrum of alanine showing chemical shifts and J- coupling of CH3 and CH proton groups. The
chemical structure of alanine is shown (in the box) above the NMR spectrum. Protons in the methyl group
(A) are split into 2 lines by a neighboring proton (B) and the CH is split into 4 lines by neighboring methyl
protons (A). Proportion of peak intensities of signal A: B is 3: 1
76 Wimal Pathmasiri et al.
Fig. 6 A 700 MHz 1H NMR spectrum of a pooled urine sample with example identified metabolites.
4,4-dimethyl-4-silapentane-1-sulfonic acid (DSS), is used as an internal standard for chemical shift referen-
cing and relative concentration determination
4.2 Data As described earlier, NMR data are recorded as FIDs and are then
Preprocessing Fourier transformed into spectra and this is performed at the
instrument level. Before the Fourier transformation step of the
Analysis of NMR Metabolomics Data 77
4.2.1 Phenotypic After phase and baseline correction, and chemical shift referencing,
Anchoring NMR data is transferred to a data format accessible for further
statistical analysis and modeling. An NMR spectrum is a data matrix
consisting of chemical shift and peak intensity information. A single
spectrum usually contains up to 65,536 data points. There are two
methods for handling NMR data: (1) Using raw data (full resolu-
tion) or binned data (data reduction); (2) Peak deconvolution,
identification, and relative quantification of metabolites. Full-
resolution data analysis is computationally intensive and generally
requires high-performance computing clusters. In such cases, the
78 Wimal Pathmasiri et al.
700000
600000
Before
500000
400000
300000
200000
100000
0
3 2.9 2.8 2.7 2.6 2.5 2.4 2.3
700000
600000
After
500000
400000
300000
200000
100000
0
3 2.9 2.8 2.7 2.6 2.5 2.4 2.3
Fig. 7 NMR spectra of urine samples aligned by CluPA algorithm using NMRProcFlow software
raw data can directly be used for modeling. Otherwise, data must
be preprocessed using a process called binning. Both of these
processes offer high-throughput data analysis. Some artifacts and
unwanted spectral regions must be removed prior to further analy-
sis of either raw or processed spectral data. These include residual
water signal, internal standards, and applied chemicals or medica-
tion and their metabolites used as treatments in the study. The
NMR peaks in the spectra can be deconvoluted into metabolites
and quantified. Despite efforts toward automation of metabolite
identification, fitting, and quantification of metabolite levels in 1D
1
H NMR spectra, this process is somewhat hampered by signal
overlap, high dynamic range of metabolites, small shifts in chemical
shift between spectra for the same metabolites, and complex
splitting patterns (multiplicity) of the peaks of a metabolite, and
level of noise. Another disadvantage of automation is that unknown
metabolites cannot be identified, since only known metabolite
peaks can be modeled in automating.
4.2.3 NMR Binning Binning or bucketing implements data reduction of NMR metabo-
lomics data. In the conventional binning approach, the spectrum is
divided into evenly spaced windows called bins or buckets and the
peak area in each bin is integrated. The bucket width is typically
0.01 or 0.04 ppm (some uses 0.001 ppm bin width). Binning can
also reduce the effect of variations in peak positions in spectra to
some extent. The main disadvantage of conventional binning is that
parts of metabolite peaks can fall into the neighboring bins or
buckets. Several methods have been proposed to overcome this:
intelligent bucketing [69, 70], adaptive binning [71], and opti-
mized bucketing [72]. In the intelligent binning, the bin bound-
aries are loosened by a specified percentage (e.g., 50% looseness)
allowing peaks to fall within the same bin. Adaptive binning cor-
rects for variation in chemical shifts through utilization of a wavelet
transform. The optimized bucketing algorithm generates an aver-
age spectrum of all data and defines bin boundaries using that
spectrum. The NMR binning of projections of 2D J-resolved spec-
tra can be performed by using the JBA algorithm [73] employed in
the R-based MWASTools [74] software.
80 Wimal Pathmasiri et al.
4.2.4 NMR Data Data normalization is one of the most important steps in metabo-
Normalization lomics analysis, and the goal is to remove any variation in total
amount of material between samples. Specifically, it is important for
analyzing urine samples where concentration can have changes in
orders of magnitude. Similarly, it is important to normalize data
when analyzing cell or tissue extracts where it can be difficult to
control analyte levels due to factors such as sampling or extraction
efficiency. These changes would be disproportionate to the relevant
biological or biochemical variations attributed to metabotype, thus
hindering extracting valuable biomarker information. Despite
many methods in the literature for data normalization, no single
approach is optimum for any given study. The data must be pro-
cessed appropriately according to the study and type of samples and
data. The normalization of data is typically performed row-wise
(on observations). Commonly used normalization methods
include constant sum normalization, probabilistic normalization,
cubic spline normalization, normalizing to a standard peak, and
normalizing to the total volume collected, specific gravity, or to the
total mass of sample used. In the constant sum normalization, each
integral (or intensity) is divided by the sum of all integrals or
intensities and typically multiplied by a constant number such as
1000. Probabilistic quotient normalization (PQN) [75] is based on
the calculation of a most probable dilution factor by looking at the
distribution of the quotients of the amplitudes of a test spectrum by
those of a reference spectrum. It was shown that the PQN method
is more robust and more accurate than the widespread total sum
normalization using experimental spectra of a complete metabolo-
mics study as well as simulated spectra. It should be noted that the
choice of a standard metabolite should be carefully selected and it
should not have biological variation due to metabotype. For exam-
ple, though normalization to the creatinine peak in urine samples is
common in clinical studies, creatinine would not be a suitable
choice where it is expected to change in patients (e.g., in kidney
disease).
4.2.5 Centering, Scaling Appropriate data pretreatment could include centering, scaling or
and Transformation of Data transformation, methods which are performed column-wise
(on variables) in a table of data [76]. Centering (mean centering)
helps to adjust for the skew between high and low abundance
metabolites in order to emphasize the relevant variation (covari-
ance) between biological samples. Centering is done by centering
the mean of each column (variable) to zero and subtracting the
value of the variable from the mean of each column. Scaling of data
is a method that adjusts for the differences in magnitude between
metabolites by converting the data into concentration change rela-
tive to a scaling factor [76].
Data analysis without this preprocessing step could lead to loss
of information arising from only slight changes in metabolite levels
Analysis of NMR Metabolomics Data 81
4.3 Data Analysis As outlined earlier, the NMR metabolomics data can be analyzed in
Methods either raw or preprocessed forms. A vast number of statistical
analysis and modeling approaches are available in the literature
and the number is rapidly increasing. Most of these analyses are
performed by using Matlab (or Mathematica), R, Java, or Python
codes. Some methods have focused on integrating the whole NMR
metabolomics pipeline (from importing, preprocessing, analyzing,
and graphical visualization to identifying biomarker metabolites)
while others have been developed for addressing specific data pro-
cessing, modeling, or identifying needs.
4.3.1 Multivariate Data Metabolomics analysis involves data in high dimensional data
Analysis matrices: typically, samples in rows and variables in columns. The
use of multivariate data analysis methods [78] is common in analyz-
ing metabolomics data [79]. Multivariate data analysis is a
projection-based method that uses eigenvalues and a number of
variations have been reported in literature [78, 80, 81]. The non-
supervised multivariate method, principal component analysis
82 Wimal Pathmasiri et al.
1
H-15N HSQC spectra. The smart tagging is performed on
biological sample, 1H-15N HSQC spectra are recorded on the
samples, and 15N and 1H chemical shifts observed for cross peaks
in the spectra are compared to those in the library.
1D 13C spectroscopy (proton decoupled) has many advantages
for a metabolomics study, including a large spectral dispersion,
narrow singlets at natural abundance, and a direct measure of the
backbone structures of metabolites compared to 1H NMR spectra.
However, it suffers from the very low sensitivity due to low natural
abundance of 13C isotope (~1.1%) combined with a decreased
gyromagnetic ratio γ (one quarter of that of 1H; energy transition
is proportional to γ3, and sensitivity is decreased by 64 times) unless
the metabolomics experiments are performed by using 13C enrich-
ment experiments. Another disadvantage of this experiment is the
inability to detect quaternary carbon atoms (carbon atoms with no
protons attached). On the other hand, an INADEQUATE (incred-
ible natural abundance double quantum transfer experiment)
experiment can provide the complete carbon skeleton structure of
a metabolite. This type of experiment is very helpful when 13C
isotope enrichment is performed prior to analysis. A semiauto-
mated algorithm was developed by Clendinen and coworkers
[96], INADEQUATE Network Analysis (INETA), where untar-
geted analysis can be performed and metabolites can be identified
by using INADEQUATE spectral data and an in silico database
constructed using BMRB database.
There are efforts to deconvolute the signals in NMR spectra
into metabolites for the identification of metabolites. Another
advantage of NMR spectroscopy is that the same spectrum can be
used for both untargeted and semitargeted analysis. Since NMR
spectroscopy is a quantitative technique, the use of an internal
standard with known concentration allows for quantitation of
metabolites [97]. This is a relative or semiquantitative method
because there is no calibration curve employed to obtain absolute
concentration. For example, Chenomx is a commercial package
that contains a library of metabolites that has been modeled to
account for chemical shift variation due to pH. It should also be
noted that the spectra should be acquired to account for the total
relaxation of 1H atoms of the metabolites in the biological sample,
in order to allow for identification and semiquantification of meta-
bolites in the sample. It is possible to apply this method to profile a
selected set of metabolites in a batch profiling mode. The experi-
ence of the analyst or prior knowledge of metabolites signals plays a
critical role in this approach to a successful completion of identifi-
cation and semiquantification using this method. On the other
hand, Röhnisch et al. [98] have recently reported automated quan-
tification algorithm (AQuA) for targeted quantification of metabo-
lites that can account for signal overlaps of metabolites in complex
86 Wimal Pathmasiri et al.
4.3.4 Workflows There are a number of open source workflows, and web-based
and Web-Based Analysis programs available for NMR metabolomics analysis. In addition,
Programs standalone programs and web-based servers have been proposed in
literature for metabolomics data analysis. Some of these methods
are briefly described in this section with relevant literature and links.
Automics [104] is an integrated metabolomics analysis platform
that has functionality to each step in a NMR metabolomics work-
flow including processing spectra, nonsupervised and supervised
multivariate data analysis and graphical visualization. KIMBLE
(KNIME-based Integrated MetaBoLomics Environment [105] is
a KNIME workflow management system based NMR metabolo-
mics workflow. It is a platform that is integrated with algorithms
and software tools necessary for untargeted and targeted analysis. It
is self-documenting and combines data, algorithms, and software in
one file [105], allowing for reuse of the exact workflow parameters
Analysis of NMR Metabolomics Data 87
in future. Even the workflow for a single project can be saved with
all tools and parameters as a single virtual machine. These capabil-
ities make KIMBLE a more robust platform for reproducibly using
the NMR metabolomics workflow. Metaboanalyst is a web-based
metabolomics platform [106–109] for uploading metabolomics
data (both NMR and MS) and data analysis including pathway
analysis. It has a number of modules with various preprocessing,
statistical, multivariate analysis, identification, pathway analysis, and
integrated omics data analysis tools. A standalone package for local
installation is available on Metaboanalyst web page. Metabolomics
Univariate and Multivariate Statistical Analysis (MUMA) [110] is
an R-based pipeline that guides the user in the data analysis (uni-
variate and multivariate data analysis) and interpretation. It also
provides additional tools specifically designed to help the user in
the interpretation of NMR data, such as STOCSY and RANSY.
MVAPACK [111] is an open source package written in GNU
Octave programming language to process metabolomics data
from FIDs to modeling. It has a collection of tools including
those needed for traditional NMR processing (apodization, zero-
filling, Fourier transformation, manual and automatic phase cor-
rection, region of interest selection, and peak picking, integration,
and referencing), data preprocessing (alignment, binning, normali-
zation, and scaling), and multivariate data analysis (nonsupervised
and supervised), and model validation tools. NMRProcFlow
[112] (https://nmrprocflow.org/) is a web-based work flow to
process 1D NMR spectra interactively with visualization tools.
The workflow can be accessed through the web or can be installed
locally as a virtual machine. It allows the user to import spectra and
perform baseline and phase correction, chemical shift calibration,
peak alignment, NMR binning (after removing unwanted regions
including baseline noise), and merging of metadata. The workflow
process can be saved and used again similarly for other datasets. A
cloud-based computing framework for NMR metabolomics analy-
sis using Hadoop streaming with Matlab [113] has been reported
indicating the possibility of using the cloud-based computing
power for NMR metabolomics applications. Pathomx [114] is
another workflow-based tool developed using Python for proces-
sing, analyzing, visualizing, and exporting metabolomics data. The
R-based package speaq 2.0 [115] contains a workflow that includes
peak alignment using CluPA algorithm, peak picking, data imputa-
tion, and peak table generation. The output that speaq produces is
compatible for using other R-based statistical and multivariate data
analysis packages. Workflow4Metabolomics (W4M) [116] is an
open source and Galaxy-based web platform (https://
workflow4metabolomics.org/) that has algorithms for data pre-
processing, statistical analysis, and annotation. It is a virtual
research environment based on high performance computing
88 Wimal Pathmasiri et al.
environment (660 cores and 100 TB). It has modules for analyzing
both MS and NMR data. Users can access the web version or the
whole framework and computing tools can be installed locally as a
virtual machine. ASCIS (Automatic Statistical Identification in
Complex Spectra) [117], is another R-based software package
that comprises of a workflow with algorithms developed for auto-
mated analyzing of NMR spectra. The workflow includes raw NMR
data import, baseline processing, peak alignment, NMR binning,
metabolite identification and quantification. In addition, ASICS
workflow includes tools for multivariate data analysis (PCA and
OPLS-DA), identification of NMR bins or metabolites that dis-
criminate phenotypes, and some other statistical analysis.
4.3.6 Metabolic Pathway Metabotyping studies can result in identification of certain meta-
Analysis bolites, which have undergone changes in levels in the biological
samples. Then the next step is to interpret these findings to answer
the biological questions mechanistically by exploring the metabolic
pathways, which have been perturbed due to the disease
Analysis of NMR Metabolomics Data 89
4.3.7 Metabolomics Data It is important to make data and metadata including experimental
Repositories data of metabolomics studies available to the research community
for various reasons. It helps a wider community of researchers and
bioinformaticians to obtain data for testing their software tools
using publicly available data. It also facilitates reproducibility of
conducting metabolomics studies by the availability of standardized
experimental protocols. Therefore, establishing metabolomics data
repositories and data coordinating centers are important and the
Metabolomics Workbench [21] in the USA and MetaboLights in
the Europe are two such data repositories. The Metabolomics
Workbench serves as a national and international repository for
metabolomics data and metadata and provides analysis tools and
access to metabolite standards, protocols, tutorials, training, and
more functionalities. On the other hand, Metabolights [122] is a
database for metabolomics experiments, experimental data, and
metadata for a variety of species, techniques and it covers metabo-
lite structures and their reference spectra as well as information on
their biological roles, locations, and concentrations.
90 Wimal Pathmasiri et al.
5 Summary
References
(2010) Opening up the "Black Box": meta- 19. Phenome Center Birmingham. https://www.
bolic phenotyping and metabolome-wide birmingham.ac.uk/research/activity/
association studies in epidemiology. J Clin phenome-centre/index.aspx. Accessed
Epidemiol 63(9):970–979. https://doi.org/ February 2019
10.1016/j.jclinepi.2009.10.001 20. Australian National Phenome Center.
12. Hedjazi L, Gauguier D, Zalloua PA, Nichol- https://www.wahtn.org/enabling-pla
son JK, Dumas ME, Cazier JB (2015) mQTL. tforms/australian-national-phenome-centre/
NMR: an integrated suite for genetic mapping . Accessed February 2019
of quantitative variations of (1)H NMR-based 21. Sud M, Fahy E, Cotter D, Azam K, Vadivelu I,
metabolic profiles. Anal Chem 87 Burant C, Edison A, Fiehn O, Higashi R, Nair
(8):4377–4384. https://doi.org/10.1021/ KS, Sumner S, Subramaniam S (2016) Meta-
acs.analchem.5b00145 bolomics Workbench: an international reposi-
13. Cazier JB, Kaisaki PJ, Argoud K, Blaise BJ, tory for metabolomics data and metadata,
Veselkov K, Ebbels TM, Tsang T, Wang Y, metabolite standards, protocols, tutorials and
Bihoreau MT, Mitchell SC, Holmes EC, Lin- training, and analysis tools. Nucleic Acids Res
don JC, Scott J, Nicholson JK, Dumas ME, 44(D1):D463–D470. https://doi.org/10.
Gauguier D (2012) Untargeted metabolome 1093/nar/gkv1042
quantitative trait locus mapping associates 22. Markley JL, Bruschweiler R, Edison AS, Egh-
variation in urine glycerate to mutant glyce- balnia HR, Powers R, Raftery D, Wishart DS
rate kinase. J Proteome Res 11(2):631–642. (2017) The future of NMR-based metabolo-
https://doi.org/10.1021/pr200566t mics. Curr Opin Biotechnol 43:34–40.
14. Gibson G, Gieger C, Geistlinger L, https://doi.org/10.1016/j.copbio.2016.08.
Altmaier E, Hrabé de Angelis M, 001
Kronenberg F, Meitinger T, Mewes H-W, 23. Ludwig C, Easton JM, Lodi A, Tiziani S,
Wichmann HE, Weinberger KM, Adamski J, Manzoor SE, Southam AD, Byrne JJ, Bishop
Illig T, Suhre K (2008) Genetics meets meta- LM, He S, Arvanitis TN, Günther UL, Viant
bolomics: a genome-wide association study of MR (2011) Birmingham metabolite library: a
metabolite profiles in human serum. PLoS publicly accessible database of 1-D 1H and
Genet 4(11):e1000282. https://doi.org/10. 2-D 1H J-resolved NMR spectra of authentic
1371/journal.pgen.1000282 metabolite standards (BML-NMR). Metabo-
15. Sekula P, Goek ON, Quaye L, Barrios C, lomics 8(1):8–18. https://doi.org/10.1007/
Levey AS, Romisch-Margl W, Menni C, s11306-011-0347-7
Yet I, Gieger C, Inker LA, Adamski J, 24. Wishart DS (2008) Quantitative metabolo-
Gronwald W, Illig T, Dettmer K, mics using NMR. TrAC Trends Anal Chem
Krumsiek J, Oefner PJ, Valdes AM, 27(3):228–237. https://doi.org/10.1016/j.
Meisinger C, Coresh J, Spector TD, Mohney trac.2007.12.001
RP, Suhre K, Kastenmuller G, Kottgen A 25. Wishart DS, Tzur D, Knox C, Eisner R, Guo
(2016) A metabolome-wide association AC, Young N, Cheng D, Jewell K, Arndt D,
study of kidney function and disease in the Sawhney S, Fung C, Nikolai L, Lewis M,
general population. J Am Soc Nephrol 27 Coutouly MA, Forsythe I, Tang P,
(4):1175–1188. https://doi.org/10.1681/ Shrivastava S, Jeroncic K, Stothard P,
ASN.2014111099 Amegbey G, Block D, Hau DD, Wagner J,
16. Kraus WE, Muoio DM, Stevens R, Craig D, Miniaci J, Clements M, Gebremedhin M,
Bain JR, Grass E, Haynes C, Kwee L, Qin X, Guo N, Zhang Y, Duggan GE, Macinnis
Slentz DH, Krupp D, Muehlbauer M, Hauser GD, Weljie AM, Dowlatabadi R,
ER, Gregory SG, Newgard CB, Shah SH Bamforth F, Clive D, Greiner R, Li L,
(2015) Metabolomic quantitative trait loci Marrie T, Sykes BD, Vogel HJ, Querengesser
(mQTL) mapping implicates the ubiquitin L (2007) HMDB: the human metabolome
proteasome system in cardiovascular disease database. Nucleic Acids Res 35(Database
pathogenesis. PLoS Genet 11(11): issue):D521–D526. https://doi.org/10.
e1005553. https://doi.org/10.1371/jour 1093/nar/gkl923
nal.pgen.1005553 26. Kuhn S, Schlorer NE (2015) Facilitating qual-
17. MRC-NIHR National Phenome Center. ity control for spectra assignments of small
https://www.imperial.ac.uk/phenome-cen organic molecules: nmrshiftdb2—a free
tre. Accessed February 2019 in-house NMR database with integrated
18. Clinical Phenotyping Centre. http://www. LIMS for academic service laboratories.
imperial.ac.uk/clinical-phenotyping-centre/. Magn Reson Chem 53(8):582–589. https://
Accessed February 2019 doi.org/10.1002/mrc.4263
92 Wimal Pathmasiri et al.
27. Laine JE, Bailey KA, Olshan AF, Smeester L, 35. Bornet A, Maucourt M, Deborde C, Jacob D,
Drobna Z, Styblo M, Douillet C, Garcia- Milani J, Vuichoud B, Ji X, Dumez JN,
Vargas G, Rubio-Andrade M, Pathmasiri W, Moing A, Bodenhausen G, Jannin S, Girau-
McRitchie S, Sumner SJ, Fry RC (2017) Neo- deau P (2016) Highly repeatable dissolution
natal metabolomic profiles related to prenatal dynamic nuclear polarization for heteronuc-
arsenic exposure. Environ Sci Technol 51 lear NMR metabolomics. Anal Chem 88
(1):625–633. https://doi.org/10.1021/acs. (12):6179–6183. https://doi.org/10.1021/
est.6b04374 acs.analchem.6b01094
28. Szabo DT, Pathmasiri W, Sumner S, Birn- 36. Dumez JN, Milani J, Vuichoud B, Bornet A,
baum LS (2017) Serum metabolomic profiles Lalande-Martin J, Tea I, Yon M,
in neonatal mice following oral brominated Maucourt M, Deborde C, Moing A,
flame retardant exposures to hexabromocy- Frydman L, Bodenhausen G, Jannin S, Girau-
clododecane (HBCD) alpha, gamma, and deau P (2015) Hyperpolarized NMR of plant
commercial mixture. Environ Health Perspect and cancer cell extracts at natural abundance.
125(4):651–659. https://doi.org/10.1289/ Analyst 140(17):5860–5863. https://doi.
EHP242 org/10.1039/c5an01203a
29. Fan TW, Lane AN (2011) NMR-based stable 37. Johnson CH, Patterson AD, Idle JR, Gonza-
isotope resolved metabolomics in systems bio- lez FJ (2012) Xenobiotic metabolomics:
chemistry. J Biomol NMR 49(3–4):267–280. major impact on the metabolome. Annu Rev
https://doi.org/10.1007/s10858-011- Pharmacol Toxicol 52:37–56. https://doi.
9484-6 org/10.1146/annurev-pharmtox-010611-
30. Creek DJ, Chokkathukalam A, Jankevics A, 134748
Burgess KE, Breitling R, Barrett MP (2012) 38. Blaise BJ, Correia G, Tin A, Young JH, Verg-
Stable isotope-assisted metabolomics for naud AC, Lewis M, Pearce JT, Elliott P,
network-wide metabolic pathway elucidation. Nicholson JK, Holmes E, Ebbels TM (2016)
Anal Chem 84(20):8442–8447. https://doi. Power analysis and sample size determination
org/10.1021/ac3018795 in metabolic phenotyping. Anal Chem 88
31. Zamboni N, Fendt S-M, Rühl M, Sauer U (10):5179–5188. https://doi.org/10.1021/
(2009) 13C-based metabolic flux analysis. acs.analchem.6b00188
Nat Protoc 4(6):878–892. https://doi.org/ 39. Barton RH, Waterman D, Bonner FW,
10.1038/nprot.2009.58 Holmes E, Clarke R, Procardis C, Nicholson
32. Beckonert O, Keun HC, Ebbels TMD, JK, Lindon JC (2010) The influence of EDTA
Bundy J, Holmes E, Lindon JC, Nicholson and citrate anticoagulant addition to human
JK (2007) Metabolic profiling, metabolomic plasma on information recovery from
and metabonomic procedures for NMR spec- NMR-based metabolic profiling studies. Mol
troscopy of urine, plasma, serum and tissue BioSyst 6(1):215–224. https://doi.org/10.
extracts. Nat Protoc 2(11):2692–2703. 1039/b907021d
https://doi.org/10.1038/nprot.2007.376 40. Bernini P, Bertini I, Luchinat C, Nincheri P,
33. Dumas M-E, Maibaum EC, Teague C, Staderini S, Turano P (2011) Standard
Ueshima H, Zhou B, Lindon JC, Nicholson operating procedures for pre-analytical
JK, Stamler J, Elliott P, Queenie HE (2006) handling of blood and urine for metabolomic
Assessment of analytical reproducibility of 1H studies and biobanks. J Biomol NMR 49
NMR spectroscopy based metabonomics for (3–4):231–243. https://doi.org/10.1007/
large-scale epidemiological research: the s10858-011-9489-1
INTERMAP study. Anal Chem 41. Haid M, Muschet C, Wahl S, Romisch-Margl-
78:2199–2208 W, Prehn C, Moller G, Adamski J (2018)
34. Karaman I, Ferreira DL, Boulange CL, Long-term stability of human plasma metabo-
Kaluarachchi MR, Herrington D, Dona AC, lites during storage at 80 degrees C. J Pro-
Castagne R, Moayyeri A, Lehne B, Loh M, de teome Res 17(1):203–211. https://doi.org/
Vries PS, Dehghan A, Franco OH, 10.1021/acs.jproteome.7b00518
Hofman A, Evangelou E, Tzoulaki I, 42. Dane AD, Hendriks MM, Reijmers TH,
Elliott P, Lindon JC, Ebbels TM (2016) Harms AC, Troost J, Vreeken RJ, Boomsma
Workflow for integrated processing of multi- DI, van Duijn CM, Slagboom EP, Hankeme-
cohort untargeted (1)H NMR metabolomics ier T (2014) Integrating metabolomics
data in large-scale metabolic epidemiology. J profiling measurements across multiple bio-
Proteome Res 15(12):4188–4194. https:// banks. Anal Chem 86(9):4110–4114.
doi.org/10.1021/acs.jproteome.6b00125 https://doi.org/10.1021/ac404191a
Analysis of NMR Metabolomics Data 93
43. Dona AC, Jimenez B, Schafer H, Humpfer E, Bamforth F, Greiner R, McManus B, New-
Spraul M, Lewis MR, Pearce JT, Holmes E, man JW, Goodfriend T, Wishart DS (2011)
Lindon JC, Nicholson JK (2014) Precision The human serum metabolome. PLoS One 6
high-throughput proton NMR spectroscopy (2):e16957. https://doi.org/10.1371/jour
of human urine, serum, and plasma for large- nal.pone.0016957
scale metabolic phenotyping. Anal Chem 86 51. Smilowitz JT, O’Sullivan A, Barile D, German
(19):9887–9894. https://doi.org/10.1021/ JB, Lonnerdal B, Slupsky CM (2013) The
ac5025039 human milk metabolome reveals diverse oli-
44. Beckonert O, Coen M, Keun HC, Wang Y, gosaccharide profiles. J Nutr 143
Ebbels TMD, Holmes E, Lindon JC, Nichol- (11):1709–1718. https://doi.org/10.3945/
son JK (2010) High-resolution magic-angle- jn.113.178772
spinning NMR spectroscopy for metabolic 52. Rodriguez-Martinez A, Posma JM, Ayala R,
profiling of intact tissues. Nat Protoc 5 Harvey N, Jimenez B, Neves AL, Lindon JC,
(6):1019–1032. https://doi.org/10.1038/ Sonomura K, Sato TA, Matsuda F, Zalloua P,
nprot.2010.45 Gauguier D, Nicholson JK, Dumas ME
45. Wong A, Jimenez B, Li X, Holmes E, Nichol- (2017) J-resolved (1)H NMR
son JK, Lindon JC, Sakellariou D (2012) 1D-projections for large-scale metabolic phe-
Evaluation of high resolution magic-angle notyping studies: application to blood plasma
coil spinning NMR spectroscopy for meta- analysis. Anal Chem 89(21):11405–11412.
bolic profiling of nanoliter tissue biopsies. https://doi.org/10.1021/acs.analchem.
Anal Chem 84(8):3843–3848. https://doi. 7b02374
org/10.1021/ac300153k 53. Fonville JM, Maher AD, Coen M, Holmes E,
46. Gillies RJ, Morse DL (2005) In vivo magnetic Lindon oC, Nicholson JK (2010) Evaluation
resonance spectroscopy in cancer. Annu Rev of full-resolution J-resolved 1H NMR projec-
Biomed Eng 7:287–326. https://doi.org/ tions of biofluids for metabonomics informa-
10.1146/annurev.bioeng.7.060804.100411 tion retrieval and biomarker identification.
47. Stewart DA, Winnike JH, McRitchie SL, Anal Chem 82:1811–1821
Clark RF, Pathmasiri WW, Sumner SJ (2016) 54. Liu M, Tang H, Nicholson JK, Lindon JC
Metabolomics analysis of hormone- (2002) Use of1H NMR-determined diffusion
responsive and triple-negative breast cancer coefficients to characterize lipoprotein frac-
cell responses to paclitaxel identify key meta- tions in human blood plasma. Magn Reson
bolic differences. J Proteome Res 15 Chem 40(13):S83–S88. https://doi.org/10.
(9):3225–3240. https://doi.org/10.1021/ 1002/mrc.1121
acs.jproteome.6b00430 55. Chylla RA, Hu K, Ellinger JJ, Markley JL
48. Livanos AE, Greiner TU, Vangay P, (2011) Deconvolution of two-dimensional
Pathmasiri W, Stewart D, McRitchie S, Li H, NMR spectra by fast maximum likelihood
Chung J, Sohn J, Kim S, Gao Z, Barber C, reconstruction: application to quantitative
Kim J, Ng S, Rogers AB, Sumner S, Zhang metabolomics. Anal Chem 83
XS, Cadwell K, Knights D, Alekseyenko A, (12):4871–4880. https://doi.org/10.1021/
Backhed F, Blaser MJ (2016) Antibiotic- ac200536b
mediated gut microbiome perturbation accel- 56. Phinney KW, Ballihaut G, Bedner M, Benford
erates development of type 1 diabetes in mice. BS, Camara JE, Christopher SJ, Davis WC,
Nat Microbiol 1(11):16140. https://doi. Dodder NG, Eppe G, Lang BE, Long SE,
org/10.1038/nmicrobiol.2016.140 Lowenthal MS, McGaw EA, Murphy KE,
49. Loeser RF, Pathmasiri W, Sumner SJ, Nelson BC, Prendergast JL, Reiner JL, Rim-
McRitchie S, Beavers D, Saxena P, Nicklas mer CA, Sander LC, Schantz MM, Sharpless
BJ, Jordan J, Guermazi A, Hunter DJ, Mess- KE, Sniegoski LT, Tai SS, Thomas JB, Vetter
ier SP (2016) Association of urinary metabo- TW, Welch MJ, Wise SA, Wood LJ, Guthrie
lites with radiographic progression of knee WF, Hagwood CR, Leigh SD, Yen JH, Zhang
osteoarthritis in overweight and obese adults: NF, Chaudhary-Webb M, Chen H, Fazili Z,
an exploratory study. Osteoarthr Cartil 24 LaVoie DJ, McCoy LF, Momin SS,
(8):1479–1486. https://doi.org/10.1016/j. Paladugula N, Pendergrast EC, Pfeiffer CM,
joca.2016.03.011 Powers CD, Rabinowitz D, Rybak ME,
50. Psychogios N, Hau DD, Peng J, Guo AC, Schleicher RL, Toombs BM, Xu M,
Mandal R, Bouatra S, Sinelnikov I, Zhang M, Castle AL (2013) Development of
Krishnamurthy R, Eisner R, Gautam B, a Standard Reference Material for metabolo-
Young N, Xia J, Knox C, Dong E, Huang P, mics research. Anal Chem 85
Hollander Z, Pedersen TL, Smith SR,
94 Wimal Pathmasiri et al.
94. Ye T, Mo H, Shanaiah N, Nagana Gowda GA, Magn Reson Chem 47(Suppl 1):S123–S126.
Zhang S, Raftery D (2009) Chemoselective https://doi.org/10.1002/mrc.2526
15N tag for sensitive and high-resolution 104. Wang T, Shao K, Chu Q, Ren Y, Mu Y, Qu L,
nuclear magnetic resonance profiling of the He J, Jin C, Xia B (2009) Automics: an
carboxyl-containing metabolome. Anal integrated platform for NMR-based metabo-
Chem 81:4882–4888 nomics spectral processing and data analysis.
95. Tayyari F, Gowda GA, Gu H, Raftery D BMC Bioinformatics 10:83. https://doi.org/
(2013) 15N-cholamine—a smart isotope tag 10.1186/1471-2105-10-83
for combining NMR- and MS-based metabo- 105. Verhoeven A, Giera M, Mayboroda OA
lite profiling. Anal Chem 85(18):8715–8721. (2018) KIMBLE: a versatile visual NMR
https://doi.org/10.1021/ac401712a metabolomics workbench in KNIME. Anal
96. Clendinen CS, Pasquel C, Ajredini R, Edison Chim Acta 1044:66–76
AS (2015) (13)C NMR metabolomics: 106. Chong J, Soufan O, Li C, Caraus I, Li S,
INADEQUATE network analysis. Anal Bourque G, Wishart DS, Xia J (2018) Meta-
Chem 87(11):5698–5706. https://doi.org/ boAnalyst 4.0: towards more transparent and
10.1021/acs.analchem.5b00867 integrative metabolomics analysis. Nucleic
97. Weljie AM, Newton J, Mercier P, Carlson E, Acids Res 46(W1):W486–W494. https://
Slupsky CM (2006) Targeted profiling: quan- doi.org/10.1093/nar/gky310
titative analysis of 1H NMR metabolomics 107. Xia J, Psychogios N, Young N, Wishart DS
data. Anal Chem 78(13):4430–4442. (2009) MetaboAnalyst: a web server for
https://doi.org/10.1021/ac060209g metabolomic data analysis and interpretation.
98. Rohnisch HE, Eriksson J, Mullner E, Nucleic Acids Res 37(Web Server issue):
Agback P, Sandstrom C, Moazzami AA W652–W660. https://doi.org/10.1093/
(2018) AQuA: an automated quantification nar/gkp356
algorithm for high-throughput NMR-based 108. Xia J, Wishart DS (2011) Web-based infer-
metabolomics and its application in human ence of biological patterns, functions and
plasma. Anal Chem 90(3):2095–2102. pathways from metabolomic data using Meta-
https://doi.org/10.1021/acs.analchem. boAnalyst. Nat Protoc 6(6):743–760.
7b04324 https://doi.org/10.1038/nprot.2011.319
99. Hao J, Astle W, De Iorio M, Ebbels TM 109. Metaboanalyst. https://www.metaboanalyst.
(2012) BATMAN—an R package for the ca/MetaboAnalyst/faces/home.xhtml.
automated quantification of metabolites Accessed February 2019
from nuclear magnetic resonance spectra 110. Gaude E, Chignola F, Spiliotopoulos D,
using a Bayesian model. Bioinformatics 28 Spitaleri A, Ghitti M, Garcia-Manteiga M,
(15):2088–2090. https://doi.org/10.1093/ Mari S, Musco G (2013) muma, An R pack-
bioinformatics/bts308 age for metabolomics univariate and multivar-
100. Hao J, Liebeke M, Astle W, De Iorio M, iate statistical analysis. Curr Metabolomics 1
Bundy JG, Ebbels TM (2014) Bayesian (2):180–189. https://doi.org/10.2174/
deconvolution and quantification of metabo- 2213235x11301020005
lites in complex 1D NMR spectra using BAT- 111. Worley B, Powers R (2014) MVAPACK: a
MAN. Nat Protoc 9(6):1416–1427. https:// complete data handling package for NMR
doi.org/10.1038/nprot.2014.090 metabolomics. ACS Chem Biol 9
101. Liebeke M, Hao J, Ebbels TM, Bundy JG (5):1138–1144. https://doi.org/10.1021/
(2013) Combining spectral ordering with cb4008937
peak fitting for one-dimensional NMR quan- 112. Jacob D, Deborde C, Lefebvre M,
titative metabolomics. Anal Chem 85 Maucourt M, Moing A (2017) NMRProc-
(9):4605–4612. https://doi.org/10.1021/ Flow: a graphical and interactive tool dedi-
ac400237w cated to 1D spectra processing for
102. Ravanbakhsh S, Liu P, Bjorndahl TC, NMR-based metabolomics. Metabolomics
Mandal R, Grant JR, Wilson M, Eisner R, 13(4):36. https://doi.org/10.1007/
Sinelnikov I, Hu X, Luchinat C, Greiner R, s11306-017-1178-y
Wishart DS (2015) Accurate, fully-automated 113. Gunaratna K, Anderson P, Ranabahu A, Sheth
NMR spectral profiling for metabolomics. A (2010) A study in hadoop streaming with
PLoS One 10(5):e0124219. https://doi. matlab for NMR data processing. Paper pre-
org/10.1371/journal.pone.0124219 sented at the 2010 IEEE second international
103. Lewis IA, Schommer SC, Markley JL (2009) conference on cloud computing technology
rNMR: open source software for identifying and science.
and quantifying metabolites in NMR spectra.
Analysis of NMR Metabolomics Data 97
114. Fitzpatrick MA, McGrath CM, Young SP MR, Spagou K, Dona AC, Evangelos V,
(2014) Pathomx: an interactive workflow- Tracy R, Greenland P, Lindon JC,
based tool for the analysis of metabolomic Herrington D, Ebbels TMD, Elliott P,
data. BMC Bioinformatics 15(1):396 Tzoulaki I, Chadeau-Hyam M (2017)
115. Beirnaert C, Meysman P, Vu TN, Improving visualization and interpretation of
Hermans N, Apers S, Pieters L, Covaci A, metabolome-wide association studies: an
Laukens K (2018) speaq 2.0: a complete application in a population-based cohort
workflow for high-throughput 1D NMR using untargeted (1)H NMR metabolic
spectra processing and quantification. PLoS profiling. J Proteome Res 16
Comput Biol 14(3):e1006018. https://doi. (10):3623–3633. https://doi.org/10.1021/
org/10.1371/journal.pcbi.1006018 acs.jproteome.7b00344
116. Giacomoni F, Le Corguille G, Monsoor M, 120. Karnovsky A, Weymouth T, Hull T, Tarcea
Landi M, Pericard P, Petera M, Duperier C, VG, Scardoni G, Laudanna C, Sartor MA,
Tremblay-Franco M, Martin JF, Jacob D, Stringer KA, Jagadish HV, Burant C,
Goulitquer S, Thevenot EA, Caron C Athey B, Omenn GS (2012) Metscape 2 bio-
(2015) Workflow4Metabolomics: a collabo- informatics tool for the analysis and visualiza-
rative research infrastructure for computa- tion of metabolomics and gene expression
tional metabolomics. Bioinformatics 31 data. Bioinformatics 28(3):373–380.
(9):1493–1495. https://doi.org/10.1093/ https://doi.org/10.1093/bioinformatics/
bioinformatics/btu813 btr661
117. Lefort G, Liaubet L, Canlet C, Tardivel P, 121. Kamburov A, Cavill R, Ebbels TMD,
Pere MC, Quesnel H, Paris A, Iannuccelli N, Herwig R, Keun HC (2011) Integrated
Vialaneix N, Servien R (2019) ASICS: an R pathway-level analysis of transcriptomics and
package for a whole analysis workflow of 1D metabolomics data with IMPaLA. Bioinfor-
1H NMR spectra. Bioinformatics. https:// matics 27(20):2917–2918. https://doi.org/
doi.org/10.1093/bioinformatics/btz248 10.1093/bioinformatics/btr499
118. Chadeau-Hyam M, Ebbels TMD, Brown IJ, 122. Haug K, Salek RM, Conesa P, Hastings J, de
Chan Q, Stamler J, Huang CC, Daviglus ML, Matos P, Rijnbeek M, Mahendraker T,
Ueshima H, Zhao L, Holmes E, Nicholson Williams M, Neumann S, Rocca-Serra P,
JK, Elliott P, Iorio MD (2010) Metabolic Maguire E, Gonzalez-Beltran A, Sansone
profiling and the metabolome-wide associa- SA, Griffin JL, Steinbeck C (2013) Metabo-
tion study: significance level for biomarker Lights—an open-access general-purpose
identification. J Proteome Res 9:4620–4627 repository for metabolomics studies and asso-
119. Castagne R, Boulange CL, Karaman I, ciated meta-data. Nucleic Acids Res 41(Data-
Campanella G, Santos Ferreira DL, Kaluar- base issue):D781–D786. https://doi.org/
achchi MR, Lehne B, Moayyeri A, Lewis 10.1093/nar/gks1004
Chapter 6
Abstract
“Omics”-based analyses are widely used in numerous areas of research, advances in instrumentation (both
hardware and software) allow investigators to collect a wealth of data and therein characterize metabolic
systems. Although analyses generally examine differences in absolute or relative (fold-) changes in concen-
trations, the ability to extract mechanistic insight would benefit from the use of isotopic tracers. Herein, we
discuss important concepts that should be considered when stable isotope tracers are used to capture
biochemical flux. Special attention is placed on in vivo systems, however, many of the general ideas have
immediate impact on studies in cellular models or isolated-perfused tissues. While it is somewhat trivial to
administer labeled precursor molecules and measure the enrichment of downstream products, the ability to
make correct interpretations can be challenging. We will outline several critical factors that may influence
choices when developing and/or applying a stable isotope tracer method. For example, is there a “best”
tracer for a given study? How do I administer a tracer? When do I collect my sample(s)? While these
questions may seem straightforward, we will present scenarios that can have dramatic effects on conclusions
surrounding apparent rates of metabolic activity.
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_6, © Springer Science+Business Media, LLC, part of Springer Nature 2020
99
100 Stephen F. Previs and Daniel P. Downes
Fig. 1 Metabolic flux can change independent of pool size. Sprague-Dawley rats (n ¼ 15, 253 12 g,
mean sem) were fed a standard rodent diet on a 12-h light–dark cycle (dark between 6 PM and 6 AM). An
intraperitoneal bolus of 2H-water (15 μL 99% 2H-water per gram body weight) was given to a subgroup of
5 rats at 6 AM and samples of plasma and mixed leg muscle were collected at 10 AM, a second group was
then given a bolus of 2H-water and samples collected at 2 PM, a final group was given a bolus of 2H-water at
2 PM and samples were collected at 6 PM. Protein synthesis was estimated from measurements of the 2H-
labeling of plasma water and alanine (isolated from total muscle protein, n ¼ 5 per group, Eq. 5) [7]. Although
muscle mass did not change, different metabolic flux (protein synthesis) was observed at different times after
the feeding cycle ended (Asterisk represents p < 0.05 using a 2-tailed t-test, assuming equal variance; this
study was completed using an IACUC approved protocol)
2.1 What Is the This question is obviously broad, so we will try to provide an
“Best” Tracer? answer in the context of studying lipid flux, more specifically tri-
glyceride synthesis. Perhaps a first line of thinking is to use a labeled
fatty acid, one is then immediately faced with subsequent questions
around which fatty acid and how to formulate the tracer? When
thinking about “which fatty acid” one may consider 13C- vs 2H-
labeled forms, as this could certainly be important from an analyti-
cal perspective, but the question is more complicated in the eyes of
a biologist since palmitate and oleate may not be handled in the
same way [21, 22]. Therefore, choosing one fatty acid over another
could lead to different outcomes in the flux calculations under
conditions where fatty acids are not utilized in proportion to their
availability [6, 21, 22].
A less biased tracer might be labeled glycerol or glucose, these
precursors are expected to provide the backbone on which fatty
acids are esterified [23]. Again, the choice of how to label the
102 Stephen F. Previs and Daniel P. Downes
2.2 How Do I To build on the examples that are outlined above we need to
Administer the Tracer? appreciate other challenges of the experimental design, namely,
the tracer needs to enter the system that is under investigation. If
we imagine in vivo studies, we are immediately presented with
several options and challenges. Although most of the tracers
noted earlier are water soluble, it is not easy to simply dissolve a
labeled fatty acid in saline prior to administration. Fatty acid tracers
are typically bound to albumin or solubilized with Intralipid [29],
when an intra-vascular route of administration is used, or they can
Key Concepts Surrounding Studies of Stable Isotope-Resolved Metabolomics 103
Fig. 2 Theoretical labeling profiles for different routes of tracer administration in vivo. If we consider a simple
one compartment model (Panel a), we can expect distinct labeling profiles depending on how a tracer is
administered (Panel b). For simplicity, we have assumed that one would target ~10% enrichment of a
circulating pool. In the case where one contrasts the intravenous (IV) vs intraperitoneal (IP) or oral bolus
methods, we have assumed that the same dose of tracer is administered. For purposes of illustration, we
expect that the maximum labeling achieved with the IP or oral bolus will never reach that when the same dose
of tracer is given as an IV bolus; when using the IP or oral bolus the temporal rise to maximum enrichment is
impacted by the balance between absorption and metabolism kinetics. A constant infusion will be associated
with a rise to steady-state labeling, whereas a primed-infusion could lead to an “immediate” steady state if
the system if well described
2.3 How Can I Get a An important factor to consider when deciding on the approach for
Metabolic Rate from administering a tracer is the type of kinetic data that one intends to
My Tracer Data? obtain. For example, studying the overall kinetics of a product may
require a different strategy as compared to dissecting the source
(s) of that product. We will consider the kinetics of triglyceride in
adipose tissue as a model problem, however, we will first review
some general concepts concerning a simpler system and then
104 Stephen F. Previs and Daniel P. Downes
FD 0 ½glucosekabs kdil
rate of appearance ¼ ð3Þ
kabs c dil abs dil
0 c0 k
are the back calculated concentrations of the tracer at time 0 for the
absorbance and dilution (elimination) curves, respectively
[33]. When tracers are given via an IP or oral bolus one can estimate
the metabolic rate using different approaches. For example, van
Dijk et al. proposed an elegant application of this logic in studies of
glucose kinetics in rodent models, in which they considered the
entire labeling profile [33]. We performed comparable studies but
only modeled the tail of the tracer dilution [32, 37].
When either a constant infusion (triangles) or a primed-
infusion (squares) is used, we can estimate the turnover from the
steady-state enrichment (i.e., the dilution rate of the tracer) accord-
ing to the equation:
rate of appearance ¼ tracer infusion rate
enrichment of the infusate
1
enrichment of the analyte pool
ð4Þ
where the units that define the “rate of appearance” are the same as
those used to define the “tracer infusion rate.” We should note that
this approach is often used in studies of circulating analytes; there-
fore, the “enrichment of the analyte pool” represents the enrich-
ment of some endogenous molecule in the plasma. In theory, the
use of a primed-infusion allows one to reach a steady-state enrich-
ment of the analyte pool in a shorter amount of time as compared
to when a constant infusion is used. That said, there are examples
where achieving the correct balance between priming doses and
infusion rates can require some development, otherwise investiga-
tors run a risk of making misleading conclusions [35, 38, 39].
The types of scenarios that were just described have been widely
used to quantify the kinetics in cases where a pool turns over with
considerable frequency over the duration of an experiment (circu-
lating glucose is an excellent model problem). However, there are
other cases where a pool may experience very limited synthesis
(e.g., mixed muscle protein, Fig. 1). We will consider how tracers
can be used to estimate kinetics in these cases as well. This next
scenario emphasizes a key point, in that, we may not be able to
directly administer a labeled form of the product. Consequently,
one will typically rely on the use of a precursor-product labeling
106 Stephen F. Previs and Daniel P. Downes
Fig. 3 Theoretical precursor and product labeling profiles for different routes of tracer administration in vivo.
Panel a considers a case where two subjects are given the same IV bolus of a labeled precursor (e.g., [U-13C6]
glucose) and the temporal dilution is identical between the subjects (dashed line), however, the product (e.g.,
triglyceride) is turning over at ~0.04 or 0.80 pool/h, solid or open symbols, respectively. Panel b considers the
same scenario, however, the precursor is given as a primed-infusion (dashed line); again, the product is
turning over at ~0.04 or 0.80 pool/h, solid or open symbols, respectively
2.4 How Many Building on the example shown in Fig. 3 raises obvious questions
Samples Do I Need and around how the number of samples and the timing of their collec-
When Should They Be tion will impact our conclusions [46]. In our experience, even if we
Collected? could collect multiple samples the use of a single tracer bolus is not
practical for studying a problem such as triglyceride flux in adipose
tissue. The pool size is generally quite large, and the turnover is
relatively slow. We should note that the time scale used here is
exaggerated even for rodent studies [5, 6]. This does not mean
that the logic surrounding a bolus administration of a tracer is
flawed, in fact, this approach is used in studies of plasma lipoprotein
kinetics [23, 47]. However, if a single bolus is administered we are
somewhat obligated to collect multiple samples to determine if we
are on the up-slope or the down-slope of a product labeling curve.
Note that for the rapidly turning over product in Fig. 3a, the
enrichment is comparable at ~1 h and ~5 h. If we had only collected
a sample at ~5 h, we could draw appropriate conclusions regarding
the relative differences between the kinetics of the two product
pools. The open circles clearly demonstrate substantially more
enrichment as compared to the closed circles, consistent with the
faster relative turnover (Fig. 3a). Unfortunately, we would be far
from an accurate estimate of the true turnover, for example, the
open circles are on the descending phase of the enrichment curve
by 5 h and therefore we have grossly underestimated the synthesis
in that condition.
Those conclusions are in strong contrast with the example in
Fig. 3b, where the enrichment continues to increase until it reaches
a plateau, making it possible to consider using a single time point to
estimate the kinetics [42]. Such a statement would be appealing to
the analytical lab since it would mean fewer samples to process and
less data to acquire, which implies a level of resource sparing.
However, it also has the potential drawback that the longer we
wait to collect our samples the less accurate will be our estimation of
the true differences between the two product pools [42]. Although
it is tempting to think that a single data point can be used to infer
the kinetics (e.g., Eq. 5), several factors must align to ensure the
validity of that logic. Given the complexity of some biological
problems it is perhaps better to consider more extensive sampling
schemes. At the very least, pilot studies should evaluate critical
assumptions noted here. Those preliminary data may suggest rea-
sonable options to simplify the experimental design in follow up
studies.
2.5 How Do We Know exemplified if we build off Fig. 3b. We will add some complexity to
Whether the reflect true events surrounding metabolic biochemistry and inte-
Intracellular Dilution of grative physiology.
a Precursor Is Creating The example in Fig. 3b shows a case where our substrate (e.g.,
13
a Problem and How C-glucose) is the only precursor for the glycerol backbone in
Can We Adjust the triglyceride; therefore, we expect that we will eventually reach a
Calculations of steady-state enrichment of triglyceride-glycerol and that its enrich-
Metabolic Flux? ment will equal that of the precursor. In the example, the labeled
glucose precursor pool is ~10% enriched, consequently, the product
should reach the same level if there are no other sources of the
glycerol backbone. However, several studies have demonstrated the
existence of alternative sources of triglyceride-glycerol [5, 48–
51]. For example, although the adipose tissue is thought to have
little or no glycerokinase, it has been hypothesized that glycero-
neogenesis could occur (i.e., the conversion of lactate, pyruvate,
etc. to triose-phosphates) [52]. Let us further evolve the reaction
sequence in our metabolic pathway to see what happens if glucose is
not the only source of triglyceride-glycerol.
Figure 4a outlines a scenario that is supported by experimental
data. If we assume that 13C-glucose is given as a primed-infusion,
and that the precursor is ~10% enriched we expect that the enrich-
ment of triglyceride-glycerol will reach that of the precursor
(as shown in Fig. 4c, open triangles). However, if we found com-
parable precursor enrichment in two conditions but different prod-
uct enrichment (i.e., we collected samples from the respective
groups at any of the time points noted in Fig. 4c) can we confi-
dently state that the metabolic flux of the product is different
between the respective groups? The answer is “no.” Although the
examples represented in Fig. 4a, b demonstrate a case where two
groups are maintained at the same glucose enrichment (~10%) and
the product enrichment is approximately twofold different for the
respective curves. We do not know whether some factor other than
triglyceride synthesis is altered. In fact, the simulation shown here
reflects differences in the precursor dilution and not in the triglyc-
eride kinetics. The triangles (Fig. 4c) represent the enrichment of
triglyceride-glycerol, in each case the glucose is ~10% enriched and
the half-life of triglyceride is ~ 3.4 h (or ~0.2 pool/h). However,
the open triangles reflect the outcome if 13C-glucose was the sole
precursor (Fig. 4a), whereas the solid triangles reflect the outcome
if there was another source of triglyceride-glycerol (lactate, pyru-
vate, etc.), that enters the cell “cold” (Fig. 4b). Since this second
source of the glycerol backbone is not labeled the pathway is
invisible, this underscores a key assumption, in that, the labeled
precursor that we administer may not be the true precursor. For
example, if we collected a single sample at ~6 h and we applied Eq. 5
we would conclude that the turnover was ~0.11 vs ~0.06 pools/h,
open vs. solid triangles, respectively. This is in strong contrast to the
true value of ~0.20 pools/h in both cases.
110 Stephen F. Previs and Daniel P. Downes
Fig. 4 Product labeling can be impacted by biochemical rates and sources of substrates. One can consider
using a primed-infusion of [U-13C6]glucose in studies of triglyceride kinetics. For simplicity, we assume that
13
C- will only be incorporated into the glycerol-backbone (in reality, 13C- can be incorporated into the fatty
acids as well). If glucose is the only carbon source (Panel a), the 13C- enrichment of triglyceride-glycerol will
reach a steady-state (Panel c, open triangles), which is equal to that of the 13C-glucose being infused (Panel c,
dashed line). In theory, one could estimate triglyceride turnover by collecting a single sample during the
pseudolinear phase of the triglyceride labeling and comparing that against the 13C-enrichment of the infused
glucose (see Eq. 5). In contrast, if glucose and pyruvate, for example, each made similar contributions to the
triose-phosphate (and glycerol-phosphate) pool (Panel b), we could observe very different absolute changes in
the triglyceride enrichment (Panel c, solid triangles). In this case, we would draw erroneous conclusions by
comparing the enrichment of the product with that of the infused glucose (see Eq. 5). The true precursor (i.e.,
glycerol-3-phosphate) receives equal input of carbon from labeled (e.g., glucose) and unlabeled (e.g.,
pyruvate) sources. If we sampled at several time points (e.g., 2, 6, 12, and 24 h) and applied Eq. 6 we
would determine that the triglyceride turnover is the same between the groups (i.e., the correct conclusion).
Note that the correct stoichiometry to generate the curve represented by the closed triangles in Panel c is 1
glucose and 2 pyruvate
was different between the groups (as we just noted). However, this
would assume that the 13C-enrichment of glucose reflects the true
precursor labeling for both groups. On the other hand, if we were
able to collect several samples across a reasonable time range (e.g.,
~2, 6, 12, and 24 h) we would immediately recognize that the
fractional turnover of the triglyceride is comparable between the
two groups, that is, if we fit the enrichment curves using Eq. 6 we
would see that it takes the same amount of time to reach a plateau in
both groups (Fig. 4c). We would also note that the plateau labeling
is different between the groups, which could be explained by the
entry of “cold” carbon; the dilution of our labeled glucose precur-
sor could be “proved” by measuring the enrichment of an interme-
diate at or beyond the mixing point. That is, if we sampled the
triose-phosphates or glycerol-3-phosphate we would see that the
enrichment is 50% lower than that of the glucose. Future studies
could then be designed with this new knowledge in mind. For
example, we could sample the triose-phosphates or glycerol-3-
phosphate enrichment and set that value as the true precursor
labeling, we could then consider using a single time point (e.g.,
measure triglyceride-glycerol at a 6 h time point and Eq. 5) and
therein draw reliable conclusions regarding triglyceride kinetics
[42]. This could make some aspects of our studies simpler since
we could perhaps then collect and measure fewer samples, and the
experimental time might be shortened (e.g., from 24 h to 6 h).
A critical take-home message is that although we may observe
the incorporation of a labeled precursor into a product we should
question the potential of any dilution in the steps between the
precursor and the product. The example shown here might seem
extreme since there are several reactions between glucose and
triglyceride-glycerol and readers may think that if we use a more
direct precursor that this problem would not arise. That is not
entirely the case – studies of protein synthesis often administer a
labeled amino acid and measure its incorporation into a protein
(s) [35, 53]. Readers may recognize that there are fewer steps in
that precursor ! product relationship, that is, free amino
acids ! tRNA-bound amino acids ! protein-bound amino acids.
However, another problem is observed: the enrichment of the
amino acid that is administered can undergo substantial intracellu-
lar dilution since cells continuously degrade proteins and recycle
amino acids. In fact, the enrichment of free amino acids has been
shown to experience ~40% dilution upon entry into a cell [54].
This raises a question regarding the novelty of using 2H- or
18
O-water to study metabolic flux [24]. Briefly, water readily moves
across membranes and its enrichment is virtually identical in all
body compartments shortly after it is administered [55, 56]. For
example, we have given ~5–20 μL of 2H-water per gram body
weight to rodents (via an intraperitoneal bolus) and observed
90% distribution within ~20 min (unpublished observations).
Since labeled water rapidly enters cells we should expect a homoge-
neous labeling. However, we then need to assume that the entry of
112 Stephen F. Previs and Daniel P. Downes
3.2 Cross Talk of If we aim to study metabolic activity in vivo we must recognize that
Tracers Between our primary tracer can generate other labeled precursors. Not only
Tissues can this occur in the form of different mass or positional isomers of
the original precursor (shown in Fig. 5) but one can also expect new
sources of label altogether. An example of secondary tracers can be
Key Concepts Surrounding Studies of Stable Isotope-Resolved Metabolomics 113
Fig. 5 Impact of side compartments on isotope scrambling and potential errors estimating biochemical flux.
The biochemical model outlines potential reactions in the conversion of [2H5]glycerol to triglyceride-glyceride
(Panel a). Although there are two steps in the pathway (i.e., glycerol ! glycerol-3-phosphate ! triglyceride),
the pool of glycerol-3-phosphate is also linked to the pool of triose-phosphates. Consequently, the relative
rates of flux (i.e., glycerol-3-phosphate ! triglyceride vs glycerol-3-phosphate $ triose-phosphates) will
impact the mass isotopomer distribution profile of triglyceride-glycerol (Panel a). Hepatocytes were isolated
from rats and incubated immediately with ~150 μM [2H5]glycerol (M5, >95% enriched), triglycerides were
isolated at different times and the appearance of various isotopically labeled species (M1 ! M5) was
measured (Panel b, unpublished data from previous studies [76]). Note that solid and open squares represent
2
H and 1H, respectively
Fig. 6 Integrative physiology and tissue-induced tracer cross talk. The interpretation of tracer studies requires
special attention when there is tissue cross talk. Although labeled-glucose can be used to quantify triglyceride
synthesis, one should exercise caution when choosing the tracers if the goal is to sort out the contribution of
specific pathways. As shown in Fig. 4, sampling of glycerol-phosphate (or triose-phosphates) may be
necessary to study true rates of triglyceride turnover. One could imply that comparing the enrichment of
glycerol-phosphate with that of glucose could also yield information on the source(s) of triglyceride-glycerol.
Panel A demonstrates that it is possible to “recycle” label from one tissue to another. In this case, the infusion
of [U-13C6]glucose can lead to the generation of 13C-pyruvate in an adipocyte (via glycolytic flux/transfer of
13
C- from the red blood cell, RBC). The solid red circles represent 13C-lactate/pyruvate that traverses the
Krebs cycle prior to conversion into triglyceride-glycerol; those reactions would lead to the generation of
different mass isotopomers. In contrast, Panel B demonstrates that [6,6-2H2]glucose should not be impacted
by this type of tissue cross talk, 2H should only appear in triglyceride-glycerol is there is direct conversion of
glucose. The 2H-labeling of pyruvate could be affected by keto-enol tautomerism, transamination and/or by
traversing the TCA cycle before it moves to glycerol-3-phosphate
4 Notes
was ~10% enriched and we noted that our product, that is,
triglyceride-glycerol, could reach 10% enrichment if glucose was
the sole substrate being used in the synthesis. In cases where we aim
to measure the breakdown can we extract information from the
decrease in enrichment? For example, if one thinks about classical
pulse-chase experiments, intuition might lead one to consider the
decrease in enrichment being caused by a clearance or degradation
process. In fact, this is not true [13]. The decrease in enrichment
that one observes during a “chase,” that is, once the precursor
input has been stopped, will reflect the dilution of a tracer (or,
stated another way, the synthesis of that product from a cold
source). There seems to be some confusion in the literature regard-
ing these matters, and this is certainly one area where stable
isotope-based experiments can deviate from classical radiolabeled
tracer studies [13].
In our experience it is not possible to give strict conditions for
designing a tracer study, as the exact details will depend on many
variables. For example, we would likely design different protocols if
we were interested in studying triglyceride synthesis in adipose
tissue vs liver in a mouse model or if we were interested in triglyc-
eride synthesis and secretion in plasma in a rodent model vs a
human subject. This is in strong contrast to purely analytical meth-
ods, that is, lipidomic analyses that might be used to measure the
abundance of a triglyceride. In those cases, the same (or similar
enough) approach can be used to quantify the abundance and
isotopic labeling of triglycerides regardless of their source. Classical
methods (e.g., the Folch or Bligh and Dyer extraction) can be used
to isolate triglycerides from plasma, liver or adipose tissue samples
(regardless of the model). Therefore, when one aims to quantify the
abundance and/or isotope labeling of an analyte it is possible to
devise more concrete steps or rules to ensure consistency of the
data.
Hopefully we have provided some general guidelines on which
to begin designing tracer experiments. Investigators should appre-
ciate the fact that protocols which are valid in one type of study may
require substantial editing for use in another type of study, even if it
appears that we have the same goal in the respective studies. In our
experience tracer studies require that one respect a handful of rules
but there is also an opportunity for creativity. Everyone will recog-
nize a basic rule learned in elementary school, that is, mixing blue
and yellow yields green; however, artists regularly adjust the mix-
ture to achieve the “best” green for a given painting. In our
opinion, the effective application of tracer methods requires that
one balance some strict principles against some imagination since
biological problems vary in their complexity, including the con-
straints that are imposed when examining different models.
Key Concepts Surrounding Studies of Stable Isotope-Resolved Metabolomics 117
Acknowledgments
References
1. Samuel VT, Liu ZX, Qu X, Elder BD, Bilz S, skeletal muscle protein synthetic rates over
Befroy D, Romanelli AJ, Shulman GI (2004) increasing periods of label incorporation. J
Mechanism of hepatic insulin resistance in Appl Physiol 118(6):655–661
non-alcoholic fatty liver disease. J Biol Chem 11. Rachdaoui N, Austin L, Kramer E, Previs MJ,
279(31):32345–32353 Anderson VE, Kasumov T, Previs SF (2009)
2. Erion DM, Shulman GI (2010) Diacylglycerol- Measuring proteome dynamics in vivo. Mol
mediated insulin resistance. Nat Med 16 Cell Proteomics 8(12):2653–2663
(4):400–402 12. Busch R, Kim YK, Neese RA, Schade-Serin V,
3. Coen PM, Goodpaster BH (2012) Role of Collins M, Awada M, Gardner JL, Beysen C,
intramyocelluar lipids in human health. Trends Marino ME, Misell LM et al (2006) Measure-
Endocrin Metab 23(8):391–398 ment of protein turnover rates by heavy water
4. Turner SM, Hellerstein MK (2005) Emerging labeling of nonessential amino acids. Biochim
applications of kinetic biomarkers in preclinical Biophys Acta 1760(5):730–744
and clinical drug development. Curr Opin 13. Daurio NA, Wang SP, Chen Y, Zhou H, McLa-
Drug Discov Devel 8(1):115–126 ren DG, Roddy TP, Johns DG, Milot D,
5. Bederman IR, Foy S, Chandramouli V, Alexan- Kasumov T, Erion MD et al (2017) Enhancing
der JC, Previs SF (2009) Triglyceride synthesis studies of pharmacodynamic mechanisms via
in epididymal adipose tissue contribution of measurements of metabolic flux: fundamental
glucose and non-glucose carbon sources. J concepts and guiding principles for using stable
Biol Chem 284(10):6101–6108 isotope tracers. J Pharmacol Exp Ther 363
6. Brunengraber DZ, Mccabe BJ, Kasumov T, (1):80–91
Alexander JC, Chandramouli V, Previs SF 14. Samarel AM (1991) In vivo measurements of
(2003) Influence of diet on the modeling of protein turnover during muscle growth and
adipose tissue triglycerides during growth. Am atrophy. FASEB J 5(7):2020–2028
J Physiol Endocrinol Metab 285(4): 15. DeFronzo RA, Ferrannini E (1987) Regula-
E917–E925 tion of hepatic glucose metabolism in humans.
7. Gasier HG, Fluckey JD, Previs SF (2010) The Diabetes Metab Rev 3(2):415–459
application of (H2O)-H-2 to measure skeletal 16. DeFronzo RA, Ferrannini E, Hendler R,
muscle protein synthesis. Nutr Metab 7:31 Wahren J, Felig P (1978) Influence of hyper-
8. Previs SF, Fatica R, Chandramouli V, Alexan- insulinemia, hyperglycemia, and the route of
der JC, Brunengraber H, Landau BR (2004) glucose administration on splanchnic glucose
Quantifying rates of protein synthesis in exchange. Proc Natl Acad Sci U S A 75
humans by use of (H2O)-H-2: application to (10):5173–5177
patients with end-stage renal disease. Am J 17. Previs SF, Brunengraber DZ, Brunengraber H
Physiol Endocrinol Metab 286(4):E665–E672 (2009) Is there glucose production outside of
9. Wilkinson DJ, Franchi MV, Brook MS, Narici the liver and kidney? Annu Rev Nutr 29:43–57
MV, Williams JP, Mitchell WK, Szewczyk NJ, 18. Hundal RS, Krssak M, Dufour S, Laurent D,
Greenhaff PL, Atherton PJ, Smith K (2014) A Lebon V, Chandramouli V, Inzucchi SE, Schu-
validation of the application of D2O stable mann WC, Petersen KF, Landau BR et al
isotope tracer techniques for monitoring day- (2000) Mechanism by which metformin
to-day changes in muscle protein subfraction reduces glucose production in type 2 diabetes.
synthesis in humans. Am J Physiol Endocrinol Diabetes 49(12):2063–2069
Metab 306(5):E571–E579 19. Shulman GI, Landau BR (1992) Pathways of
10. Miller BF, Wolff CA, Peelor FF III, Shipman glycogen repletion. Physiol Rev 72
PD, Hamilton KL (2015) Modeling the con- (4):1019–1035
tribution of individual proteins to mixed
118 Stephen F. Previs and Daniel P. Downes
20. Chung ST, Chacko SK, Sunehag AL, Hay- 30. Barrows BR, Timlin MT, Parks EJ (2005) Spill-
mond MW (2015) Measurements of gluconeo- over of dietary fatty acids and use of serum
genesis and glycogenolysis: a methodological nonesterified fatty acids for the synthesis of
review. Diabetes 64(12):3996–4010 VLDL-triacylglycerol under two different
21. Bessesen DH, Vensor SH, Jackman MR (2000) feeding regimens. Diabetes 54(9):2668–2673
Trafficking of dietary oleic, linolenic, and stea- 31. Verhoeven NM, Schor DSM, Previs SF,
ric acids in fasted or fed lean rats. Am J Physiol Brunengraber H, Jakobs C (1997) Stable iso-
Endocrinol Metab 278(6):E1124–E1132 tope studies of phytanic acid alpha-oxidation:
22. Romanski SA, Nelson RM, Jensen MD (2000) in vivo production of formic acid. Eur J Pediatr
Meal fatty acid uptake in human adipose tissue: 156:S83–S87
technical and experimental design issues. Am J 32. Wang SP, Zhou D, Yao Z, Satapati S, Chen Y,
Physiol Endocrinol Metab 279(2):E447–E454 Daurio NA, Petrov A, Shen X, Metzger D, Yin
23. Patterson BW, Mittendorfer B, Elias N, W et al (2016) Quantifying rates of glucose
Satyanarayana R, Klein S (2002) Use of stable production in vivo following an intraperitoneal
isotopically labeled tracers to measure very low tracer bolus. Am J Physiol Endocrinol Metab
density lipoprotein-triglyceride turnover. J 311(6):E911–E921
Lipid Res 43(2):223–233 33. van Dijk TH, Laskewitz AJ, Grefhorst A, Boer
24. Previs SF, McLaren DG, Wang SP, Stout SJ, TS, Bloks VW, Kuipers F, Groen AK, Reijn-
Zhou H, Herath K, Shah V, Miller PL, goud DJ (2013) A novel approach to monitor
Wilsie L, Castro-Perez J et al (2014) New glucose metabolism using stable isotopically
methodologies for studying lipid synthesis and labelled glucose in longitudinal studies in
turnover: looking backwards to enable moving mice. Lab Anim 47(2):79–88
forwards. Biochim Biophys Acta 1842 34. Sun RC, Fan TW, Deng P, Higashi RM, Lane
(3):402–413 AN, Le AT, Scott TL, Sun Q, Warmoes MO,
25. Previs SF, Kelley DE (2015) Tracer-based Yang Y (2017) Noninvasive liquid diet delivery
assessments of hepatic anaplerotic and TCA of stable isotopes into mouse models for deep
cycle flux: practicality, stoichiometry, and metabolic network tracing. Nat Commun 8
hidden assumptions. Am J Physiol Endocrinol (1):1646
Metab 309(8):E727–E735 35. Wolfe RR, Chinkes DL (2005) Isotope tracers
26. Befroy DE, Perry RJ, Jain N, Dufour S, Cline in metabolic research: principles and practice of
GW, Trimmer JK, Brosnan J, Rothman DL, kinetic analyses. Wiley-Liss, Hoboken, NJ
Petersen KF, Shulman GI (2014) Direct assess- 36. Shipley RA, Clark RE (1972) Tracer methods
ment of hepatic mitochondrial oxidative and for in vivo kinetics. Theory and applications.
anaplerotic fluxes in humans using dynamic Academic, New York
(13)C magnetic resonance spectroscopy. Nat 37. Wang SP, Satapati S, Daurio NA, Kelley DE,
Med 20(1):98–102 Previs SF (2017) Reply to letter to the editor:
27. Hellerstein MK, Christiansen M, Kaempfer S, “The art of quantifying glucose metabolism”.
Kletke C, Wu K, Reid JS, Mulligan K, Heller- Am J Physiol Endocrinol Metab 313(2):
stein NS, Shackleton CHL (1991) Measure- E259–E261
ment of denovo hepatic lipogenesis in humans 38. Matthews DE, Downey RS (1984) Measure-
using stable isotopes. J Clin Investig 87 ment of urea kinetics in humans: a validation of
(5):1841–1852 stable isotope tracer methods. Am J Phys 246
28. Beysen C, Ruddy M, Stoch A, Mixson L, (6 Pt 1):E519–E527
Rosko K, Riiff T, Turner SM, Hellerstein MK, 39. Ostlund RE Jr, Matthews DE (1993) [13C]
Murphy EJ (2018) Dose-dependent quantita- cholesterol as a tracer for studies of cholesterol
tive effects of acute fructose administration on metabolism in humans. J Lipid Res 34
hepatic de novo lipogenesis in healthy humans. (10):1825–1831
Am J Physiol Endocrinol Metab 315(1): 40. Zhou H, Wang SP, Herath K, Kasumov T,
E126–E132 Sadygov RG, Previs SF, Kelley DE (2015)
29. McLaren DG, He T, Wang SP, Mendoza V, Tracer-based estimates of protein flux in cases
Rosa R, Gagen K, Bhat G, Herath K, Miller of incomplete product renewal: evidence and
PL, Stribling S et al (2011) The use of stable- implications of heterogeneity in collagen turn-
isotopically labeled oleic acid to interrogate over. Am J Physiol Endocrinol Metab 309(2):
lipid assembly in vivo: assessing pharmacologi- E115–E121
cal effects in preclinical species. J Lipid Res 52 41. Foster DM, Barrett PH, Toffolo G, Beltz WF,
(6):1150–1161 Cobelli C (1993) Estimating the fractional syn-
thetic rate of plasma apolipoproteins and lipids
Key Concepts Surrounding Studies of Stable Isotope-Resolved Metabolomics 119
from stable isotope data. J Lipid Res 34 carbohydrate-free diet. Horm Metab Res 27
(12):2193–2205 (7):310–313
42. Daurio NA, Wang Y, Chen Y, Zhou H, 52. Ballard FJ, Hanson RW, Leveille GA (1967)
Carballo-Jane E, Mane J, Rodriguez CG, Phosphoenolpyruvate carboxykinase and the
Zafian P, Houghton A, Addona G et al synthesis of glyceride-glycerol from pyruvate
(2019) Spatial and temporal studies of meta- in adipose tissue. J Biol Chem 242
bolic activity: contrasting biochemical kinetics (11):2746–2750
in tissues and pathways during fasted and fed 53. Waterlow JC (2006) Protein turnover. CABI,
states. Am J Physiol Endocrinol Metab 316(6): Oxfordshire
E1105–E1117 54. Lichtenstein AH, Cohn JS, Hachey DL, Millar
43. Bederman IR, Dufner DA, Alexander JC, Pre- JS, Ordovas JM, Schaefer EJ (1990) Compari-
vis SF (2006) Novel application of the “doubly son of deuterated leucine, valine, and lysine in
labeled” water method: measuring CO2 pro- the measurement of human apolipoprotein A-I
duction and the tissue-specific dynamics of and B-100 kinetics. J Lipid Res 31
lipid and protein in vivo. Am J Physiol Endo- (9):1693–1701
crinol Metab 290(5):E1048–E1056 55. Mccabe BJ, Bederman IR, Croniger CM, Mill-
44. Steele R (1971) Tracer probes in steady-state ward CA, Norment CJ, Previs SF (2006)
systems. Springfield. Charles C Thomas, Reproducibility of gas chromatography-niass
Illinois spectrometry measurements of H-2 labeling
45. Frayn KN, Coppack SW, Fielding BA, Hum- of water: application for measuring body com-
phreys SM (1995) Coordinated regulation of position in mice. Anal Biochem 350
hormone-sensitive lipase and lipoprotein lipase (2):171–176
in human adipose tissue in vivo: implications 56. Annegers J (1954) Total body water in rats and
for the control of fat storage and fat mobiliza- in mice. Proc Soc Exp Biol Med 87
tion. Adv Enzym Regul 35:163–178 (2):454–456
46. Previs SF, Herath K, Castro-Perez J, Mahsut A, 57. Brook MS, Wilkinson DJ, Atherton PJ, Smith
Zhou H, McLaren DG, Shah V, Rohm RJ, K (2017) Recent developments in deuterium
Stout SJ, Zhong W et al (2015) Effect of oxide tracer approaches to measure rates of
error propagation in stable isotope tracer stud- substrate turnover: implications for protein,
ies: an approach for estimating impact on lipid, and nucleic acid research. Curr Opin
apparent biochemical flux. Methods Enzymol Clin Nutr Metab Care 20(5):375–381
561:331–358 58. Strawford A, Antelo F, Christiansen M, Heller-
47. Melish J, Le NA, Ginsberg H, Steinberg D, stein MK (2004) Adipose tissue triglyceride
Brown WV (1980) Dissociation of apoprotein turnover, de novo lipogenesis, and cell prolifer-
B and triglyceride production in very-low-den- ation in humans measured with 2H2O. Am J
sity lipoproteins. Am J Phys 239(5): Physiol Endocrinol Metab 286(4):E577–E588
E354–E362 59. Krebs HA, Hems R, Weidemann MJ, Speake
48. Chen JL, Peacock E, Samady W, Turner SM, RN (1966) The fate of isotopic carbon in kid-
Neese RA, Hellerstein MK, Murphy EJ (2005) ney cortex synthesizing glucose from lactate.
Physiologic and pharmacologic factors influen- Biochem J 101(1):242–249
cing glyceroneogenic contribution to triacyl- 60. Kowalski GM, De Souza DP, Burch ML,
glyceride glycerol measured by mass Hamley S, Kloehn J, Selathurai A, Tull D,
isotopomer distribution analysis. J Biol Chem O’Callaghan S, McConville MJ, Bruce CR
280(27):25396–25402 (2015) Application of dynamic metabolomics
49. Nye CK, Hanson RW, Kalhan SC (2008) Gly- to examine in vivo skeletal muscle glucose
ceroneogenesis is the dominant pathway for metabolism in the chronically high-fat fed
triglyceride glycerol synthesis in vivo in the mouse. Biochem Biophys Res Commun 462
rat. J Biol Chem 283(41):27565–27574 (1):27–32
50. Botion LM, Brito MN, Brito NA, Brito SRC, 61. Kowalski GM, De Souza DP, Risis S, Burch
Kettelhut IC, Migliorini RH (1998) Glucose ML, Hamley S, Kloehn J, Selathurai A,
contribution to in vivo synthesis of glyceride- Lee-Young RS, Tull D, O’Callaghan S et al
glycerol and fatty acids in rats adapted to a (2015) In vivo cardiac glucose metabolism in
high-protein, carbohydrate-free diet. Metabo- the high-fat fed mouse: comparison of
lism 47(10):1217–1221 euglycemic-hyperinsulinemic clamp derived
51. Botion LM, Kettelhut IC, Migliorini RH measures of glucose uptake with a dynamic
(1995) Increased adipose-tissue glyceroneo- metabolomic flux profiling approach. Biochem
genesis in rats adapted to a high-protein, Biophys Res Commun 463(4):818–824
120 Stephen F. Previs and Daniel P. Downes
62. Landau BR, Wahren J, Ekberg K, Previs SF, acids for the measurement of dietary fat oxida-
Yang DW, Brunengraber H (1998) Limitations tion during physical activity. J Lipid Res 45
in estimating gluconeogenesis and Cori cycling (12):2339–2344
from mass isotopomer distributions using 70. Votruba SB, Zeddun SM, Schoeller DA (2001)
[U-C-13(6)]glucose. Am J Physiol 37(5): Validation of deuterium labeled fatty acids for
E954–E961 the measurement of dietary fat oxidation: a
63. Katz J, Chaikoff IL (1955) Synthesis via the method for measuring fat-oxidation in free-
Krebs’ cycle in the utilization of acetate by rat living subjects. Int J Obes Relat Metab Disord
liver slices. Biochim Biophys Acta 18 25(8):1240–1245
(1):87–101 71. Landau BR, Wahren J (1992) Nonproductive
64. Sidossis LS, Coggan AR, Gastaldelli A, Wolfe exchanges: the use of isotopes gone astray.
RR (1995) Pathway of free fatty acid oxidation Metabolism 41(5):457–459
in human subjects. Implications for tracer stud- 72. Ramakrishnan R (2006) Studying apolipopro-
ies. J Clin Invest 95(1):278–284 tein turnover with stable isotope tracers: cor-
65. Sidossis LS, Coggan AR, Gastaldelli A, Wolfe rect analysis is by modeling enrichments. J
RR (1995) A new correction factor for use in Lipid Res 47(12):2738–2753
tracer estimations of plasma fatty acid oxida- 73. Cobelli C, Toffolo G, Foster DM (1992)
tion. Am J Phys 269(4 Pt 1):E649–E656 Tracer-to-tracee ratio for analysis of stable iso-
66. Wolfe RR, Jahoor F (1990) Recovery of labeled tope tracer data: link with radioactive kinetic
CO2 during the infusion of C-1- vs C-2- formalism. Am J Phys 262(6 Pt 1):E968–E975
labeled acetate: implications for tracer studies 74. Chinkes DL, Aarsland A, Rosenblatt J, Wolfe
of substrate oxidation. Am J Clin Nutr 51 RR (1996) Comparison of mass isotopomer
(2):248–252 dilution methods used to compute VLDL pro-
67. Toth MJ, MacCoss MJ, Poehlman ET, Mat- duction in vivo. Am J Phys 271(2 Pt 1):
thews DE (2001) Recovery of (13)CO E373–E383
(2) from infused [1-(13)C]leucine and 75. Kharroubi AT, Masterson TM, Aldaghlas TA,
[1,2-(13)C(2)]leucine in healthy humans. Am Kennedy KA, Kelleher JK (1992) Isotopomer
J Physiol Endocrinol Metab 281(2): spectral analysis of triglyceride fatty acid syn-
E233–E241 thesis in 3T3-L1 cells. Am J Phys 263(4 Pt 1):
68. Beysen C, Murphy EJ, McLaughlin T, Riiff T, E667–E675
Lamendola C, Turner HC, Awada M, Turner 76. Previs SF, Hallowell PT, Neimanis KD,
SM, Reaven G, Hellerstein MK (2007) Whole- David F, Brunengraber H (1998) Limitations
body glycolysis measured by the deuterated- of the mass isotopomer distribution analysis of
glucose disposal test correlates highly with glucose to study gluconeogenesis—heteroge-
insulin resistance in vivo. Diabetes Care 30 neity of glucose labeling in incubated hepato-
(5):1143–1149 cytes. J Biol Chem 273(27):16853–16859
69. Raman A, Blanc S, Adams A, Schoeller DA
(2004) Validation of deuterium-labeled fatty
Chapter 7
Abstract
Lipidomics data generated using untargeted mass spectrometry techniques can offer great biological insight
to metabolic status and disease diagnoses. As the community’s ability to conduct large-scale studies with
deep coverage of the lipidome expands, approaches to analyzing untargeted data and extracting biological
insight are needed. Currently, the function of most individual lipids are not known; however, meaningful
biological information can be extracted. Here, I will describe a step-by-step approach to identify patterns
and trends in untargeted mass spectrometry lipidomics data to assist users in extracting information leading
to a greater understanding of biological systems.
Key words Lipidomics, Lipidome, Untargeted, Mass spectrometry, LipidMaps, Blood plasma
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_7, © Springer Science+Business Media, LLC, part of Springer Nature 2020
121
122 Jennifer E. Kyle
roles of the lipid components (e.g., fatty acids) or the type of lipids
(e.g., triglyceride) identified.
Multiple review papers on sample collection to storage proce-
dures, lipid extraction from biological matrices, mass spectrometry
technologies and methods, appropriate QCs as well as software are
available [2, 12–16] and will not be covered here. In this chapter, I
will provide an approach to exploring lipidomics data at the global
level (i.e., all lipids identified in the biological study) to help guide
the understanding and interpretation of the lipidome. The integra-
tion of other “omics” data, such as proteomics and transcriptomics,
can aid in interpretation of lipidomics data as it enables mechanistic
support for lipidomics observations.
Fig. 1 Classification information and specific chain details can be gathered from the lipid common name. For
example, PC(16:_20:4), the PC reveals the type of lipid (i.e., classification scheme) while the fatty acids (FAs)
can be parsed at the fatty level (16:0 and 20:4), the FA characteristics (e.g., 16 carbons long (C16)), and also
by the total number of carbons and double bonds of the chains (i.e., 36 carbons (C36) and 4 double bonds
(DB 4)). The FAs themselves include a saturated FA (SFA) and polyunsaturated FA (PUFA) and the chain lengths
show they are both long chain FAs (LCFAs). A second example using Cer(d18:0/24:1) can be similarly parsed.
MUFA monounsaturated FA, VLCFA very long chain FA
124 Jennifer E. Kyle
Using the following approach, you will see that lipids tend of have
patterns based on their classification, both at the inter- and intra-
subclass level as well as chain composition or chain characteristics.
Here I present my method using two lipidomics studies, cell-sorted
human lung tissue [27] and human plasma from donors with Ebola
virus disease [28]. This approach applies to lipidomics studies were
confident identifications have been made and statistical analysis has
provided results on differences (e.g., log2 fold change or Zscore)
between groups (e.g., controls and diseased samples). This
approach involves:
1. Understanding the lipids within the sample biological matrix
(Subheading 4.1).
2. Identifying signatures and patterns in lipidomics data (Sub-
heading 4.2).
3. Integrating ‘omics data (Subheading 4.3).
4.1 Understanding Understanding how lipids influence membranes and where lipids
the Lipids Within are localized at the cellular level or within biofluids will aid in the
the Sample Biological interpretation of the lipidome. Several excellent reviews on the
Matrix cellular lipid subclass distribution and fatty acid characteristics are
available [4, 11, 29, 30]. These reviews detail the influence of the
phospholipid head groups and fatty acid composition on mem-
brane composition, fluidity, width, and charge. For example, the
small head group of PE lipids induces a negative membrane curva-
ture, and unsaturated lipids lead to more fluid membranes. Very few
lipid classes in human or mammalian systems are organelle specific.
Exceptions include cardiolipin which is located mostly in the inner
mitochondrial membrane and bis(monoacylglycero)phosphate
(BMP; also known as monoacylglycerophosphomonoradylglycerol
(LBPA)) located in late endosome and lysosomes.
Less information is available on the localization of lipids in
biological fluids. Biofluids are particularly complicated due to
both endogenous and exogenous sources of lipids. Lipids have
been identified in many biofluids [31]; however, given their hydro-
phobic nature they are, for the most part, present in low abun-
dance. In plasma and CSF, lipids are primarily associated with
lipoproteins [32–34]. In CSF, lipids are also associated with
brain-derived nanoparticles [35]. In urine, the presence of lipids
has been suggested to be due to urea facilitating the dissolution of
the lipids in urine [36].
Most lipidomics studies using biofluids are conducted using
blood plasma where lipids are abundant. Lipids in blood plasma are
associated with lipoprotein membranes (e.g., PC and sphingomye-
lin (SM)), lipoprotein cargo (e.g., triglycerides (TGs) and choles-
terol ester)), and proteins (e.g., lyso-PC). A handful of publications
Analyzing Untargeted Lipidomics 125
4.2 Identifying To understand the lipidome, one must identify the trends, patterns,
Signatures and unique signatures. Organizing the lipid identifications with
and Patterns associated abundance values in the following manner will allow
in Lipidomics Data for observations to be gathered at the global level as well as inter-
and intrasubclass levels. Perform the steps 1–4 with both the global
results (i.e., results from all of the identified lipids) as well as the
results from your query of interest (e.g., statistically significant
lipids)
Step 1. Organize the statistical results file based on the LipidMaps
classification scheme. This will highlight global trends as
well as trends across lipid categories and subclasses
(Fig. 3a).
126 Jennifer E. Kyle
Fig. 3 Organization of lipidome of human Ebola virus disease. (a) The lipids are organized by category
(glycerolipids and glycerophospholipids), subclass, and then intrasubclass by the total number of chain carbons
and double bonds. (b) Detailed intrasubclass organization of TGs. MG monoacylglyceride, DG diacylglyceride, TG
triacylglyceride, LPC monoacylglycerophosphocholine, oxPC oxidized diacylglycerophosphocholine, PC diacyl-
glycerophosphocholine, LPE monoacylglycerophosphoethanolamine, PE diacylglycerophosphoethanolamine,
PEP plasmalogen PE, PG diacylglycerophosphosphoglycerol OR bis(monoacylglycerol)phosphate, LPI PI mono-
acylglycerophosphoinositol, PI diacylglycerophosphoinositol, PS diacylglycerophosphoserine. Data used to gen-
erate the figure is from [28] (Fig. 2) from the original Supplemental Table S2 of publication [28]
Analyzing Untargeted Lipidomics 127
Fig. 4 Intrasubclass differences based on the fatty acid composition. PI lipids containing the fatty acid 20:4
were high (positive Zscore) in the endothelial cells of lung tissues of three 20-month-old donors [27], whereas
the rest of the PIs were low (negative Zscore). Data used to generate this figure is from [27] from the original
Supplemental Table S4 [27]
4.3 Integrating The interaction of lipids with other cellular components, including
‘Omics Data proteins, genes, and metabolites, can enable a greater understand-
ing of the mechanisms behind the observations in your lipidomics
data generated in steps 1–4 above. LipidMaps protein database
contains thousands of entries across ten model organisms that
contain protein and gene identifiers for proteins and genes asso-
ciated with lipid metabolism and homeostasis. For humans, there
are 2273 entries with over 1100 unique entrez gene and gene
symbols, uniprot identifications, and protein entries. To identify
proteins or genes that may have a role in the lipid signatures noted
in Subheading 4.2, perform the following steps with your asso-
ciated proteomics (and/or transcriptomics) results.
Analyzing Untargeted Lipidomics 129
Table 1
Lipid category and subclass enrichment in Ebola virus disease [28], using Fisher’s exact test with p-
value < 0.05
EVD—
increased in
fatalities vs
controls
FDR.
Test. Count. Count. %. %. p- q- Fold.
performed Classifier query universe query universe value value change
Category Glycerolipid 43/ 94/379 37.4 24.8 0.012 0.037 1.5
115
Sub class DG( 16/ 19/379 13.9 5.0 0.003 0.008 2.8
115
Sub class PE( 20/ 27/379 17.4 7.1 0.002 0.008 2.4
115
Sub class PG( 17/ 20/379 14.8 5.3 0.002 0.008 2.8
115
Total chain Glycerophospholipid 18/ 42/212 35.3 19.8 0.025 0.225 1.8
carbon by all with a total number of 51
chain carbon of 36
Total chain PC with a total number 1/1 3/83 100.0 3.6 0.048 1.000 27.7
carbon by all of chain carbon of 32
Specific chains PC( with the chain 16:0 2/2 15/110 100.0 13.6 0.022 1.000 7.3
by all
Total number SM(d with a total 2/3 2/28 66.7 7.1 0.037 0.074 9.3
of DB by all number of chain
unsaturation of 0
Fig. 5 Lipid enrichment network result performed using Lipid Mini-On. The network shows statistically
significant ontology terms (diamond shapes) and the individual lipids associated with those terms were
found to be enriched in human Ebola virus disease fatalities versus controls. The network was produced from
an enrichment test using a Fisher’s Exact test with a p-value cutoff of <0.05. The query list contained lipids
that were statistically elevated in fatalities versus the controls . Data used for this analysis is from [28] from
the original Supplemental Table S2 of publication [28]
Fig. 6 Identifying lipid-associated proteins in proteomics data. Proteomics data from a human lung study [27]
was compared against the LipidMaps protein database using the protein_entry identifier. Matching protei-
n_entry identifiers were highlighted in green. The gene_name was also tracked for the protein (gene)
descriptor. The proteomics data shows lipid-associated proteins identified in three donors (D1, D2, and D3)
of sorted endothelial cells (END), epithelial cells (EPI), mesenchymal cells (MES), and immune cells (MIC) from
the lung tissue. Proteins that are higher in the associated cell type are in red, and those that are down are in
blue. Black cells indicate that the protein was not identified in that cell type for the associated donor.
Proteomics data presented in this figure is from [27] from the original Supplemental Table S2 of
publication [27]
Fig. 7 Integrating lipidomics, proteomics, and transcriptomics data to understand the lung lipidome. Lipido-
mics analysis of cell-sorted lung tissue from 3 20-month-old female donors revealed intrasubclass trends for
PG and TG lipids [27]. For the PGs, lipids with shorter retention times elevated in immune (MIC) cells and were
found to be BMP lipids [27], which are known to be associated with late endosomes and lysosomes.
Analyzing Untargeted Lipidomics 133
5 Summary
Fig. 7 (continued) Proteomics supported the presence of BMPs in MIC cells, as lysosomal acid lipase, a
protein associated with lysosomes, was only identified in MIC cells. Protein abundance is LFQ intensity. For
TGs, organization of the lipids as shown in Subheading 4.2 step 2 revealed that TGs with shorter total chain
carbons and no or low number of total double bonds showed elevated levels in mesenchymal cells (MES),
whereas longer chained, more polyunsaturated fatty acids showed elevated levels in the MIC cells. Examining
transcriptomics data revealed that lipoprotein lipase, which hydrolyzes TGs, was expressed the greatest in
MES cells, and perilipin-2, which is associated with lipid droplets, was the greatest in MIC cells. Transcript
values are log 2 abundances. END endothelial cells, EPI epithelial cells, three lung tissue donors (D1, D2, and
D3). Lipidomics data presented in this figure is modified from [27] from the original Supplemental Table S4 of
publication [28]. Protein and transcript graphs were modified from the supplemental Fig. S1 and S2 from the
publication [27]
134 Jennifer E. Kyle
6 Data Deposition
Acknowledgments
I would like thank Geremy Clair and Ernesto S. Nakayasu for their
comments and careful review of the manuscript. This work was
supported by an administrative supplement to grant
U19AI106772, provided by the National Institute of Allergy and
Infectious Diseases (NIAID), National Institute of Health (NIH)
(USA). Research conducted on the lung samples were supported by
grant HL122703 from the National Heart Lung Blood Institute of
NIH.
References
1. Gross RW, Han X (2011) Lipidomics at the altered membrane lipid metabolism in breast
interface of structure and function in systems cancer progression. Cancer Res 71
biology. Chem Biol 18(3):284–291. https:// (9):3236–3245. https://doi.org/10.1158/
doi.org/10.1016/j.chembiol.2011.01.014 0008-5472.can-10-3894
2. Rustam YH, Reid GE (2018) Analytical chal- 6. Lydic TA, Goo YH (2018) Lipidomics unveils
lenges and recent advances in mass spectrome- the complexity of the lipidome in metabolic.
try based lipidomics. Anal Chem 90 diseases 7(1):4. https://doi.org/10.1186/
(1):374–397. https://doi.org/10.1021/acs. s40169-018-0182-9
analchem.7b04836 7. Zhao YY, Miao H, Cheng XL, Wei F (2015)
3. Agmon E, Stockwell BR (2017) Lipid homeo- Lipidomics: novel insight into the biochemical
stasis and regulated cell death. Curr Opin mechanism of lipid metabolism and
Chem Biol 39:83–89. https://doi.org/10. dysregulation-associated disease. Chem Biol
1016/j.cbpa.2017.06.002 Interact 240:220–238. https://doi.org/10.
4. Holthuis JC, Menon AK (2014) Lipid land- 1016/j.cbi.2015.09.005
scapes and pipelines in membrane homeostasis. 8. Lamari F, Mochel F, Saudubray JM (2015) An
Nature 510(7503):48–57. https://doi.org/ overview of inborn errors of complex lipid bio-
10.1038/nature13474 synthesis and remodelling. J Inherit Metab Dis
5. Hilvo M, Denkert C, Lehtinen L, Muller B, 38(1):3–18. https://doi.org/10.1007/
Brockmoller S, Seppanen-Laakso T, s10545-014-9764-x
Budczies J, Bucher E, Yetukuri L, Castillo S, 9. Dautel SE, Kyle JE, Clair G, Sontag RL, Weitz
Berg E, Nygren H, Sysi-Aho M, Griffin JL, KK, Shukla AK, Nguyen SN, Kim YM, Zink
Fiehn O, Loibl S, Richter-Ehrenstein C, EM, Luders T, Frevert CW, Gharib SA,
Radke C, Hyotylainen T, Kallioniemi O, Laskin J, Carson JP, Metz TO, Corley RA,
Iljin K, Oresic M (2011) Novel theranostic Ansong C (2017) Lipidomics reveals dramatic
opportunities offered by characterization of lipid compositional changes in the maturing
Analyzing Untargeted Lipidomics 135
postnatal lung. Sci Rep 7:40555. https://doi. purification of total lipides from animal tissues.
org/10.1038/srep40555 J Biol Chem 226(1):497–509
10. Quehenberger O, Armando AM, Brown AH, 20. Bligh EG, Dyer WJ (1959) A rapid method of
Milne SB, Myers DS, Merrill AH, total lipid extraction and purification. Can J
Bandyopadhyay S, Jones KN, Kelly S, Shaner Biochem Physiol 37(8):911–917. https://doi.
RL, Sullards CM, Wang E, Murphy RC, Bark- org/10.1139/o59-099
ley RM, Leiker TJ, Raetz CR, Guan Z, Laird 21. Matyash V, Liebisch G, Kurzchalia TV,
GM, Six DA, Russell DW, McDonald JG, Shevchenko A, Schwudke D (2008) Lipid
Subramaniam S, Fahy E, Dennis EA (2010) extraction by methyl-tert-butyl ether for high-
Lipidomics reveals a remarkable diversity of throughput lipidomics. J Lipid Res 49
lipids in human plasma. J Lipid Res 51 (5):1137–1146. https://doi.org/10.1194/jlr.
(11):3299–3305. https://doi.org/10.1194/ D700041-JLR200
jlr.M009449 22. Liebisch G, Vizcaino JA, Kofeler H,
11. van Meer G, de Kroon AI (2011) Lipid map of Trotzmuller M, Griffiths WJ, Schmitz G,
the mammalian cell. J Cell Sci 124(Pt 1):5–8. Spener F, Wakelam MJ (2013) Shorthand
https://doi.org/10.1242/jcs.071233 notation for lipid structures derived from
12. Hu T, Zhang JL (2018) Mass-spectrometry- mass spectrometry. J Lipid Res 54
based lipidomics. J Sep Sci 41(1):351–372. (6):1523–1530. https://doi.org/10.1194/jlr.
https://doi.org/10.1002/jssc.201700709 M033506
13. Hyotylainen T, Oresic M (2015) Optimizing 23. Han X (2016) Lipidomics for studying metab-
the lipidomics workflow for clinical studies— olism. Nat Rev Endocrinol 12(11):668–679.
practical considerations. Anal Bioanal Chem https://doi.org/10.1038/nrendo.2016.98
407(17):4973–4993. https://doi.org/10. 24. Grosch S, Schiffmann S, Geisslinger G (2012)
1007/s00216-015-8633-2 Chain length-specific properties of ceramides.
14. Hyotylainen T, Oresic M (2016) Bioanalytical Prog Lipid Res 51(1):50–62. https://doi.org/
techniques in nontargeted clinical lipidomics. 10.1016/j.plipres.2011.11.001
Bioanalysis 8(4):351–364. https://doi.org/ 25. Veldhuizen R, Nag K, Orgeig S, Possmayer F
10.4155/bio.15.244 (1998) The role of lipids in pulmonary surfac-
15. Kyle JE, Crowell KL, Casey CP, Fujimoto GM, tant. Biochim Biophys Acta 1408
Kim S, Dautel SE, Smith RD, Payne SH, Metz (2–3):90–108
TO (2017) LIQUID: an-open source software 26. Clair G, Reehl S Stratton KG, Monroe ME,
for identifying lipids in LC-MS/MS-based lipi- Tfaily MM, Ansong C, Kyle JE (2019) Lipid
domics data. Bioinformatics 33 Mini-On: mining and ontology tool for enrich-
(11):1744–1746. https://doi.org/10.1093/ ment analysis of lipidomic data. Bioinformatics
bioinformatics/btx046 35(2):4507–4508. https://doi.org/10.1093/
16. Misra BB, Mohapatra S (2019) Tools and bioinformatics/btz250
resources for metabolomics research commu- 27. Kyle JE, Clair G, Bandyopadhyay G, Misra RS,
nity: a 2017-2018 update. Electrophoresis 40 Zink EM, Bloodsworth KJ, Shukla AK, Du Y,
(2):227–246. https://doi.org/10.1002/elps. Lillis J, Myers JR (2018) Cell type-resolved
201800428 human lung lipidome reveals cellular coopera-
17. Fahy E, Subramaniam S, Murphy RC, tion in lung function. Sci Rep 8(1):13455.
Nishijima M, Raetz CR, Shimizu T, Spener F, https://doi.org/10.1038/s41598-018-
van Meer G, Wakelam MJ, Dennis EA (2009) 31640-x
Update of the LIPID MAPS comprehensive 28. Kyle JE, Burnum-Johnson KE (2019) Plasma
classification system for lipids. J Lipid Res 50 lipidome reveals critical illness and recovery
(Suppl):S9–S14. https://doi.org/10.1194/jlr. from human Ebola virus disease. Proc Natl
R800095-JLR200 Acad Sci U S A 116(9):3919–3928. https://
18. Fahy E, Subramaniam S, Brown HA, Glass CK, doi.org/10.1073/pnas.1815356116
Merrill AH Jr, Murphy RC, Raetz CR, Russell 29. Harayama T, Riezman H (2018) Understand-
DW, Seyama Y, Shaw W, Shimizu T, Spener F, ing the diversity of membrane lipid composi-
van Meer G, VanNieuwenhze MS, White SH, tion. Nat Rev Mol Cell Biol 19(5):281–296.
Witztum JL, Dennis EA (2005) A comprehen- https://doi.org/10.1038/nrm.2017.138
sive classification system for lipids. J Lipid Res 30. van Meer G, Voelker DR, Feigenson GW
46(5):839–861. https://doi.org/10.1194/jlr. (2008) Membrane lipids: where they are and
E400004-JLR200 how they behave. Nat Rev Mol Cell Biol 9
19. Folch J, Lees M, Sloane Stanley GH (1957) A (2):112–124. https://doi.org/10.1038/
simple method for the isolation and nrm2330
136 Jennifer E. Kyle
Abstract
Liquid chromatography–mass spectrometry (LC-MS) is one of the most popular technologies in metabo-
lomics. The large-scale and unambiguous identification of metabolite structures remains a challenging task
in LC-MS based metabolomics. Tandem mass spectral databases provide experimental and in silico MS/MS
spectra to facilitate the identification of both known and unknown metabolites, which has become a gold
standard method in metabolomics. In addition, metabolite knowledge databases offer valuable biological
and pathway information of metabolites. In this chapter, we have briefly reviewed the most common and
important tandem mass spectral and metabolite databases, and illustrated how they could be used for
metabolite identification.
Key words Metabolite identification, Metabolomics, Tandem mass spectrum, Metabolite database
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_8, © Springer Science+Business Media, LLC, part of Springer Nature 2020
139
140 Zhangtao Yi and Zheng-Jiang Zhu
Table 1
Statistics of tandem mass spectral databases
become the “gold” standard method [1]. This method requires the
availability of standard MS/MS spectra. In past decades, many
efforts have been made to expand the existing tandem mass spectral
databases such as METLIN [3], NIST [4], and MassBank [5]
(Table 1). Clearly, all mass spectral databases are hindered by the
lack of chemical standards for many cellular metabolites, and suffer
from having uncharacterized spectral variations across different
LC-MS instruments, acquisition condition, and laboratories.
Other efforts have also been made to theoretically predict the
MS/MS spectra in silico [6, 7]. Limited by the high diversity of
chemical structures of metabolites and the relatively small size of
training dataset, the accuracy for the in silico prediction approach
still requires a substantial improvement. Molecular and metabolic
pathways can also be utilized for metabolite identification by
mapping dysregulated metabolic features into the metabolic net-
work, such as Mummichog [8] and PIUMet [9]. In this approach,
metabolic pathway databases are generally required.
Many metabolomics-related databases have been constructed
and are either freely accessible or commercially available to provide
chemical structures, physiochemical properties, spectral informa-
tion, biological functions, and pathway information. In this chap-
ter, we have provided an overview of current databases for
metabolomics, especially the supportive databases for metabolite
identification (Table 1), including tandem mass spectral databases
like NIST, METLIN, MassBank, and mzCloud; metabolite knowl-
edge databases such as HMDB; and metabolic pathway databases
such as KEGG [10], MetaCyc [11] and Reactome [12]. However,
common chemical databases such as PubChem and ChemSpider
are not covered in this chapter.
Tandem Mass Spectral and Metabolite Databases 141
2 METLIN
NIST mass spectral libraries are developed by the NIST mass spec-
trometry data center led by Dr. Stephen E. Stein (Gaithersburg,
MD, USA). The NIST spectral libraries aim to facilitate chemical
compound identification in GC- and LC-MS by providing standard
mass spectra. These libraries were originally developed as GC-MS
142 Zhangtao Yi and Zheng-Jiang Zhu
5 mzCloud
6 HMDB
Acknowledgments
References
processing of tandem mass spectra using for- 22. Fahy E, Subramaniam S, Murphy RC,
mula annotation. J Mass Spectrom 48:89–99 Nishijima M, Raetz CR, Shimizu T, Spener F,
16. Feunang YD, Eisner R, Knox C, Chepelev L, van Meer G, Wakelam MJ, Dennis EA (2009)
Hastings J, Owen G, Fahy E, Steinbeck C, Update of the LIPID MAPS comprehensive
Subramanian S, Bolton E (2016) ClassyFire: classification system for lipids. J Lipid Res 50:
automated chemical classification with a com- S9–S14
prehensive, computable taxonomy. J Chem 23. Kind T, Liu KH, Lee DY, DeFelice B, Meissen
8:61 JK, Fiehn O (2013) LipidBlast in silico tandem
17. Wohlgemuth G, Mehta SS, Mejia RF, mass spectrometry database for lipid identifica-
Neumann S, Pedrosa D, Pluskal T, Schymanski tion. Nat Methods 10:755–758
EL, Willighagen EL, Wilson M, Wishart DS 24. Fiehn O, Barupal DK, Kind T (2011) Extend-
et al (2016) SPLASH, a hashed identifier for ing biochemical databases by metabolomic
mass spectra. Nat Biotechnol 34:1099–1101 surveys. J Biol Chem 286:23637–23643
18. Wishart DS, Feunang YD, Marcu A, Guo AC, 25. Shen X, Wang R, Xiong X, Yin Y, Cai Y, Ma Z,
Liang K, Vázquez-Fresno R, Sajed T, Liu N, Zhu Z-J (2019) Metabolic reaction
Johnson D, Li C, Karu N et al (2018) network-based recursive metabolite annotation
HMDB 4.0: the human metabolome database for untargeted metabolomics. Nat Commun
for 2018. Nucleic Acids Res 46:D608–D617 10:1516
19. Allen ‘F, Greiner R, Wishart D (2014) Com- 26. Jeffryes JG, Colastani RL, Elbadawi-Sidhu M,
petitive fragmentation modeling of ESI-MS/ Kind T, Niehaus TD, Broadbelt LJ, Hanson
MS spectra for putative metabolite identifica- AD, Fiehn O, Tyo KE, Henry CS (2015)
tion. Metabolomics 11:98–110 MINEs: open access databases of computation-
20. Sud M, Fahy E, Cotter D, Brown A, Dennis ally predicted enzyme promiscuity products for
EA, Glass CK, Merrill AH Jr, Murphy RC, untargeted metabolomics. J Chem 7:44
Raetz CR, Russell DW (2006) LMSD: lipid 27. Huan T, Tang C, Li R, Shi Y, Lin G, Li L
maps structure database. Nucleic Acids Res (2015) MyCompoundID MS/MS search:
35:D527–D532 metabolite identification using a library of pre-
21. Fahy E, Sud M, Cotter D, Subramaniam S dicted fragment-ion-spectra of 383,830 possi-
(2007) LIPID MAPS online tools for lipid ble human metabolites. Anal Chem
research. Nucleic Acids Res 35:W606–W612 87:10619–10626
Chapter 9
Abstract
Untargeted mass spectrometry metabolomics studies rely on accurate databases for the identification of
metabolic features. Leveraging unique fragmentation patterns as well as characteristic dissociation routes
allows for structural information to be gained for specific metabolites and molecular classes, respectively.
Here we describe the evolution of METLIN as a resource for small molecule analysis as well as the tools
(e.g., Fragment Similarity Search and Neutral Loss Search) used to query the database and their workflows
for the identification of molecular entities. Additionally, we will discuss the functionalities of isoMETLIN, a
database of isotopic metabolites, and the latest addition to the METLIN family, METLIN-MRM, which
facilitates the analysis of quantitative mass spectrometry data generated with triple quadrupole
instrumentation.
Key words Untargeted metabolomics, Spectral database, MS/MS spectra, Metabolite identification
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_9, © Springer Science+Business Media, LLC, part of Springer Nature 2020
149
150 J. Rafael Montenegro-Burke et al.
2.2 Identification In order to reduce the list of putative identifications obtained from
of Known Metabolites: m/z value searching, experimental MS/MS spectra are compared
MS/MS Spectrum against spectral libraries in terms of fragmentation patterns (m/z of
Match Search fragments and their intensities) (Fig. 1). MS/MS Spectrum Match
Search is a tool that allows the autonomous identification of meta-
bolites. Here, users upload the fragmentation profile as a table of
m/z values and intensities, and enter the mass of the precursor with
a specific tolerance, collision energy, and polarity. This tool then
searches, compares, and scores the similarity of the experimental
spectra with the reference spectra in the library for all metabolites
within the selected mass tolerance, relying on a modified X-Rank
similarity algorithm [15]. Without a doubt the ~300,000 molecu-
lar standards with MS/MS spectra are METLIN’s most valuable
contribution to metabolomics research. The fragmentation spectra
for these ~600,000 molecules have been acquired at four collision
energies (0, 10, 20, and 40 V) in both positive- and negative-ion
mode. While low collision energy spectra were previously not
extensively used in the identification process, they have been
152 J. Rafael Montenegro-Burke et al.
Fig. 1 METLIN search functions for small molecule identification. (a) Simple and Batch Search allow users to
search small molecules against a database of 1 million compounds based on both m/z values and neutral
masses within a selected mass tolerance. Advanced Search allows searches based on metabolite information
such as molecular formula, metabolite names, SMILES and KEGG, and CAS and MID numbers. (b) With the
MS/MS Spectrum Match Search, experimental and library MS/MS spectra can be searched, matched, and
scored in an automatic way. (c) Fragment Similarity Search and Neutral Loss Search aid the identification of
metabolites or chemical structures by searching m/z values of the fragments or neutral losses, respectively,
regardless of the precursor mass. Analytical Chemistry 2018, 90, 3156–3164. (Figure 1, with permission from
ACS and RightsLink)
2.3 Identification Interestingly, spectral libraries serve a dual purpose in the identifi-
of Unknown cation/characterization of metabolic features. As mentioned above,
Metabolites the main use of these libraries is to compare experimental spectra
with reference spectra for the purpose of identification (up to level
2 according to the Metabolomics Standards Initiative)
[2, 17]. However, given the large number of metabolites and the
broad range of chemistries, no library is complete despite consider-
able efforts dedicated to increasing their populations. Furthermore,
the number of metabolites in nature still is a subject of debate but
estimates in the million range are not hyperbolized. Such numbers
dwarf the ~20,000 genes and proteins (without taking
METLIN: A Metabolite Mass Spectral Database 153
2.3.1 Fragment Similarity The Fragment Similarity Search algorithm was originally imple-
and Neutral Loss Search mented into METLIN to find chemical similarities between the
desired unknown features and the known molecules available in the
library based on the experimental MS/MS acquired by the user and
the over 4 million high-resolution MS/MS spectra in the METLIN
library. Specifically, the Fragment Similarity Search algorithm relies
on a shared peak count method and facilitates the identification of
the molecule of interest or molecular class by prioritizing molecules
with a larger chemical fragment overlap [4].
The Neutral Loss Search algorithm was designed as a comple-
mentary tool to Fragment Similarity Search. While in Fragment
Similarity Search shared fragments (i.e., ions with the same m/z
value) provide structural information of the ion of interest, similar
structures with different molecular formulas, ergo different masses
and m/z values, would generate fragments with different masses
and no similarity would be determined. However, in Neutral Loss
Search, mass differences between precursor ions and their frag-
ments ions can provide structural information based on common
“leaving groups” in their respective dissociation routes. Both these
tools leverage the vast number of carefully curated MS/MS spectra
from a wide range of small molecular entities in the METLIN
library to facilitate the identification of not only known molecules
with no available MS/MS spectra, but also of unknown molecules
which have not been previously described in any form.
Here, we provide an example using these tools to identify
unknown metabolites detected in a murine macrophage cell line
(RAW264.7). The analysis was performed using an I-class UPLC
system coupled to a Synapt G2-Si mass spectrometer (Waters Corp.
Milford, MA). After the unsuccessful identification of a metabolic
feature using accurate mass and MS/MS spectra matching proce-
dures described in Subheadings 2.1 and 2.2 respectively, Fragment
Similarity Search and Neutral Loss Search tools were employed to
gain structural information and therefore clues to molecular iden-
tity in the following steps:
154 J. Rafael Montenegro-Burke et al.
Fig. 2 Fragment Similarity Search facilitates the identification of unknown metabolites where no MS/MS
spectral data are available. The fragments of an unknown metabolite were searched against METLIN and all of
the four fragments were found to match with methionine sulfoxide. The comparison between experimental
and library MS/MS spectra implies high structural similarities
3 METLIN Family
Fig. 3 Neutral loss search aides the identification of unknown metabolites based on mass differences between
precursor ions and their fragments ions (i.e., common “leaving groups”). A Neutral Loss Search of 129.04
(295.10–166.05 ¼ 129.04) yielded 168 results, where ~70% contain a glutamic acid moiety. Based on the
masses of Met sulfoxide and Glu, a dipeptide is a likely option. Such a peptide is not available in the library;
however, a similar compound Glu Met contains MS/MS spectra for comparison. Several fragments, which
contain the thiol moiety, have a mass difference of 15.99 (monoisotopic mass of oxygen atom) corresponding
to the oxidation of sulfur in methionine
Fig. 4 isoMETLIN Simple Search menu allows searching the m/z of isotopologs within a selected mass
tolerance. Type of isotopic labeling, ion charge, and ion adducts can also be selected. This search renders a
list of all possible metabolites taking all possible isotopic combinations into account. Additionally, MS/MS
spectra can be accessed when available
Fig. 5 Systematic generation of MS/MS spectra of uniformly labeled metabolites in isoMETLIN. Pichia pastoris
was grown in unlabeled and 13C-labeled glucose, producing uniformly labeled metabolites after several
generations. Unlabeled and labeled metabolite extracts were analyzed by high-resolution untargeted meta-
bolomics to generate pairs of unlabeled and labeled putative metabolites. Finally, the MS/MS spectra of
identified pairs was incorporated into isoMETLIN after a careful curation
3.1.2 Identification An additional application of fully labeled MS/MS spectra is the gain
of Metabolites Using of structural information for metabolite identification. By lever-
Isotopes aging the mass differences between analogous fragments of isoto-
pologs, the number of C atoms in each fragment can be
158 J. Rafael Montenegro-Burke et al.
Fig. 6 (a) The MS/MS of an unknown metabolite is matched with the MS/MS of the Pichia pastoris extract
unlabeled compound. To consider the Pichia pastoris pair of compounds for identification, the retention time of
the unknown metabolite should match under the same analytical conditions. The number of carbons of each
fragment can be used to determine the structures of both parent and fragment ions. In this example,
pseudouridine was identified. (b) Proposed parallel analysis of microorganisms labeled with other stable
isotopes. MS/MS spectra of fully labeled known metabolites can be incorporated into isoMETLIN, whereas
MS/MS spectra of unknown metabolites can be used for their identification as described in (a)
4 METLIN Population
5 Conclusion
Acknowledgments
References
1. Smith CA, Maille GO, Want EJ, Qin C, Trau- sciences. J Mass Spectrom 45(7):703–714.
ger SA, Brandon TR, Custodio DE, https://doi.org/10.1002/jms.1777
Abagyan R, Siuzdak G (2005) METLIN: a 9. Wishart DS, Tzur D, Knox C, Eisner R, Guo
metabolite mass spectral database. Ther Drug AC, Young N, Cheng D, Jewell K, Arndt D,
Monit 27(6):747–751. https://doi.org/10. Sawhney S, Fung C, Nikolai L, Lewis M, Cou-
1097/01.ftd.0000179845.53213.39 touly M-A, Forsythe I, Tang P, Shrivastava S,
2. Vinaixa M, Schymanski EL, Neumann S, Jeroncic K, Stothard P, Amegbey G, Block D,
Navarro M, Salek RM, Yanes O (2016) Mass Hau DD, Wagner J, Miniaci J, Clements M,
spectral databases for LC/MS- and GC/MS- Gebremedhin M, Guo N, Zhang Y, Duggan
based metabolomics: state of the field and GE, MacInnis GD, Weljie AM, Dowlatabadi R,
future prospects. TrAC Trends Anal Chem Bamforth F, Clive D, Greiner R, Li L,
78:23–35. https://doi.org/10.1016/j.trac. Marrie T, Sykes BD, Vogel HJ, Querengesser
2015.09.005 L (2007) HMDB: the human metabolome
3. Tautenhahn R, Cho K, Uritboonthai W, database. Nucleic Acids Res 35(suppl_1):
Zhu Z, Patti GJ, Siuzdak G (2012) An acceler- D521–D526. https://doi.org/10.1093/nar/
ated workflow for untargeted metabolomics gkl923
using the METLIN database. Nat Biotechnol 10. Wang M, Carver JJ, Phelan VV, Sanchez LM,
30:826. https://doi.org/10.1038/nbt.2348 Garg N, Peng Y, Nguyen DD, Watrous J,
4. Benton HP, Wong DM, Trauger SA, Siuzdak G Kapono CA, Luzzatto-Knaan T, Porto C,
(2008) XCMS2: processing tandem mass spec- Bouslimani A, Melnik AV, Meehan MJ, Liu
trometry data for metabolite identification and W-T, Crüsemann M, Boudreau PD,
structural characterization. Anal Chem Esquenazi E, Sandoval-Calderón M, Kersten
80:6382–6389. https://doi.org/10.1021/ RD, Pace LA, Quinn RA, Duncan KR, Hsu
ac800795f C-C, Floros DJ, Gavilan RG, Kleigrewe K,
5. Benton HP, Ivanisevic J, Mahieu NG, Kurczy Northen T, Dutton RJ, Parrot D, Carlson
ME, Johnson CH, Franco L, Rinehart D, EE, Aigle B, Michelsen CF, Jelsbak L,
Valentine E, Gowda H, Ubhi BK, Sohlenkamp C, Pevzner P, Edlund A,
Tautenhahn R, Gieschen A, Fields MW, Patti McLean J, Piel J, Murphy BT, Gerwick L,
GJ, Siuzdak G (2015) Autonomous metabolo- Liaw C-C, Yang Y-L, Humpf H-U,
mics for rapid metabolite identification in Maansson M, Keyzers RA, Sims AC, Johnson
global profiling. Anal Chem 87(2):884–891. AR, Sidebottom AM, Sedio BE, Klitgaard A,
https://doi.org/10.1021/ac5025649 Larson CB, Boya PCA, Torres-Mendoza D,
Gonzalez DJ, Silva DB, Marques LM, Demar-
6. Montenegro-Burke JR, Phommavongsay T, que DP, Pociute E, O’Neill EC, Briand E, Hel-
Aisporna AE, Huan T, Rinehart D, frich EJN, Granatosky EA, Glukhov E,
Forsberg E, Poole FL, Thorgersen MP, Ryffel F, Houson H, Mohimani H, Kharbush
Adams MWW, Krantz G, Fields MW, Northen JJ, Zeng Y, Vorholt JA, Kurita KL,
TR, Robbins PD, Niedernhofer LJ, Lairson L, Charusanti P, McPhail KL, Nielsen KF,
Benton HP, Siuzdak G (2016) Smartphone Vuong L, Elfeki M, Traxler MF, Engene N,
analytics: mobilizing the lab into the cloud for Koyama N, Vining OB, Baric R, Silva RR,
omic-scale analyses. Anal Chem 88 Mascuch SJ, Tomasi S, Jenkins S, Macherla V,
(19):9753–9758. https://doi.org/10.1021/ Hoffman T, Agarwal V, Williams PG, Dai J,
acs.analchem.6b02676 Neupane R, Gurr J, Rodrı́guez AMC,
7. Fiehn O, Barupal DK, Kind T (2011) Extend- Lamsa A, Zhang C, Dorrestein K, Duggan
ing biochemical databases by metabolomic BM, Almaliti J, Allard P-M, Phapale P, Nothias
surveys. J Biol Chem 286(27):23637–23643. L-F, Alexandrov T, Litaudon M, Wolfender
https://doi.org/10.1074/jbc.R110.173617 J-L, Kyle JE, Metz TO, Peryea T, Nguyen
8. Horai H, Arita M, Kanaya S, Nihei Y, Ikeda T, D-T, VanLeer D, Shinn P, Jadhav A,
Suwa K, Ojima Y, Tanaka K, Tanaka S, Müller R, Waters KM, Shi W, Liu X, Zhang L,
Aoshima K, Oda Y, Kakazu Y, Kusano M, Knight R, Jensen PR, Palsson BØ, Pogliano K,
Tohge T, Matsuda F, Sawada Y, Hirai MY, Linington RG, Gutiérrez M, Lopes NP, Ger-
Nakanishi H, Ikeda K, Akimoto N, Maoka T, wick WH, Moore BS, Dorrestein PC, Bandeira
Takahashi H, Ara T, Sakurai N, Suzuki H, N (2016) Sharing and community curation of
Shibata D, Neumann S, Iida T, Tanaka K, mass spectrometry data with global natural
Funatsu K, Matsuura F, Soga T, Taguchi R, products social molecular networking. Nat
Saito K, Nishioka T (2010) MassBank: a public Biotechnol 34:828. https://doi.org/10.
repository for sharing mass spectral data for life 1038/nbt.3597
METLIN: A Metabolite Mass Spectral Database 163
Abstract
The Human Metabolome Database (HMDB) is a comprehensive, online, digital database designed to
support the analysis and interpretation of metabolomic data acquired from human and/or mammalian
metabolomic studies. This chapter covers three methods or protocols pertinent to using the HMDB:
(1) understanding the general layout of the HMDB; (2) exploring the contents of a typical HMDB
“MetaboCard”; and (3) an example of how HMDB can be used in a metabolomics study on human
glioblastoma.
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_10, © Springer Science+Business Media, LLC, part of Springer Nature 2020
165
166 David S. Wishart
2 Materials
3 Methods
Four methods are covered here: (1) exploring the general layout of
the HMDB; (2) exploring the contents of a typical HMDB “Meta-
boCard”; and (3) an example of how HMDB can be used in
metabolomics study on human glioblastoma. All activities can be
done with almost any internet-compatible device although a larger
screen (18 cm or larger) is preferable.
3.1 Exploring the 1. Open a preferred web browser and enter the following URL
HMDB Layout (web address): http://www.hmdb.ca.
2. The HMDB home page should be visible (Fig. 1).
3. On the top right is a Search box with a pull-down menu (with
options for metabolites, diseases, pathways, proteins, reactions)
and a blue search button. Users may enter text into the search
box to search any of the categories in the pull-down menu.
Enter the name “Alanine” into the search box and press the
search button (or press the return key). A “Search Results”
page should appear listing more than 100 matches to Alanine
with the most likely match appearing at the top of the list
(Fig. 2). See Note 1 regarding a more detailed explanation of
what is seen in this page.
4. Press the “back” button on your browser or the “HMDB” icon
(top left of the Search Results page) to return to the HMDB
home page.
Fig. 2 A screenshot of the “Search Results” page for a search using Alanine as a text entry
Fig. 3 A screenshot of the HMDB home page with the “Browse” menu displayed
Fig. 4 A screenshot of the HMDB home page with the “Search” menu displayed
170 David S. Wishart
3.2 Exploring the 1. Go to the HMDB home page (http://www.hmdb.ca) and use
Content of the HMDB your mouse or track pad to select “Browse” on the HMDB
MetaboCard home page menu.
2. Select “Metabolites” from the pull-down menu (it is at the top
of the pull-down list).
3. Scroll down the page, move the mouse pointer to the name of
the first compound in the table (1-methylhistidine) and click
on it. Users may alternately click on the tan-colored HMDB
identifier (HMDB0000001) as this will yield the same result.
4. A “MetaboCard” for 1-Methylhistidine should appear that
shows the Record Information along with some of the Metab-
olite identification data (Fig. 5). Additional details about the
MetaboCard concept are given in Note 5.
5. Use a mouse or computer track pad to scroll down the page to
view more of the Metabolite Identification data (see Note 6).
6. Use a mouse or computer track pad to scroll down the page to
view more of the Chemical Taxonomy data (Fig. 6). Additional
information about this section is given in Note 7.
7. Use a mouse or computer track pad to scroll down the page to
view more of the Ontology data (Fig. 7). Additional informa-
tion about this section is given in Note 8.
8. Use a mouse or computer track pad to scroll down the page to
view more of the Physical Property data (Fig. 8).
9. Use a mouse or computer track pad to scroll down the page to
view more of the Spectral data and Biological Property data
(Fig. 9). Additional information about these sections is given in
Note 9.
10. Use a mouse or computer track pad to scroll down the page to
view data on the Normal Concentrations, Abnormal
Human Metabolome Database 171
Fig. 6 A screenshot of the Chemical Taxonomy data field from the 1-Methylhistidine MetaboCard
172 David S. Wishart
Fig. 7 A screenshot of the Ontology data field from the 1-Methylhistidine MetaboCard
Fig. 8 A screenshot of the Physical Property data field from the 1-Methylhistidine MetaboCard
Human Metabolome Database 173
Fig. 9 A screenshot of the Spectral data and Biological Property data fields from the 1-Methylhistidine
MetaboCard
3.3 Using the HMDB 1. Go to the HMDB home page (http://www.hmdb.ca) and use
for Disease Studies your mouse or track pad to select “Search” on the HMDB
(Glioblastoma) home page menu.
2. The drop-down menu will display several search options
including structure, sequence, text, spectral and molecular
weight searches. For this example select the “LC-MS Search”
option (Fig. 13).
3. As explained in Note 12, the following m/z values were found
to be significantly altered in cerebrospinal fluid samples
between patients with glioblastoma (brain cancer) and healthy
controls: 149.0444, 117.0183, 119.0339, 91.0388,
147.0763, 193.0341, 308.0911. Enter these numbers into
the “Query Masses” box as shown in Fig. 14.
174 David S. Wishart
Fig. 10 A screenshot of the Normal and Abnormal Concentrations data fields from the 1-Methylhistidine
MetaboCard
Fig. 11 A screenshot of the External Links and References data fields from the 1-Methylhistidine MetaboCard
Human Metabolome Database 175
Fig. 12 A screenshot of the Enzymes and Transporters page from the 1-Methylhistidine MetaboCard
Fig. 14 A screenshot of the “LC-MS Search” page with the 149.0444, 117.0183, 119.0339, 91.0388,
147.0763, 193.0341, 308.0911 m/z values entered into the “Query Masses” box
Fig. 15 A screenshot showing the results of the “LC-MS Search” using the seven entered masses or m/z
values
4 Notes
1. The Search Results page has several components. At the top left
of the page are “change page” or page selector icons that allow
users to quickly select and display different result pages con-
sisting of 25 metabolites per page. Users can select a page by
number or they may select the next page or the last page of
results. Below the page selector panel is a greyed-in area for
filtering or limiting the displayed results. Users can filter by
metabolite status (detect, expected, predicted) or by biospeci-
men type (blood, urine, saliva, etc.), which refers to the biofluid
or tissue location(s) where the metabolite is found. Using a
mouse or track-pad, users can click on one of the desired check
boxes after which they can press on the “Apply Filter” button
on the far right. Applying the filter generates a new “Search
Results” page with the results filtered according to the user
selection. Below the filter selection box is a scrollable that dis-
plays the hits from the search. On the left side is a tan-colored
button that displays the compound’s HMDB identifier. Click-
ing on this button will display the compound’s corresponding
MetaboCard. The number below each MetaboCard button is
the CAS (Chemical Abstract Services) identifier, if available. To
the right of the MetaboCard button (in the same metabolite
row) is the common name and the IUPAC name of the com-
pound along with the text in which the word “Alanine” has
matched to the text in the corresponding MetaboCard (high-
lighted in yellow). The biofluid location in which the metabo-
lite has been found is indicated in the dark grey boxes at the top
of each metabolite row. On the far right of each metabolite row
is a 2D structure of the metabolite drawn in an electronically
neutral format.
2. The HMDB has 14 different browsing options. The most
frequently chosen option is the “Metabolite” category, which
is listed at the top of the pull-down menu. When selected, the
metabolite browse option allows users to interactively view,
scroll through and filter metabolite data. The data is presented
in a structured table that provides synoptic information (name,
structure, chemical formula, biofluid source) about the
110,000+ metabolites in the HMDB. The filtering option
allows users to select metabolites based on their status (quanti-
fied, expected, predicted, etc.), biospecimen type (urine,
blood, etc.), general origin (microbial, food, endogenous,
etc.), and subcellular location. Each metabolite is linked to a
specific “MetaboCard,” which is described in more detail in
Subheading 3.2. Selecting the “Diseases” browsing option
allows users to view a structured, scrollable table that lists
(age and gender-specific) metabolite concentration data for
Human Metabolome Database 179
13. The table that displays the mass hits includes 7 subtables with a
subtable of mass matches for each of the 7 query masses. The
first subtable shows matches for the m/z 149.0444 query, the
second subtable shows mass matches for the m/z 117.0183
query, and so on. In some cases up to 10–12 metabolites with
identical masses or m/z values (isobaric compounds) will be
displayed. Also shown in each table or subtable is the name of
the compound, the molecular formula, the monoisotopic mass
of the parent compound, the adduct type (selected by the
user), the adduct mass, and the difference between the query
mass and the matching mass (given as Delta, in parts per
million or ppm). One of the challenges in working with only
m/z data (or only LC-MS data) is that multiple candidates or
multiple metabolites can match to a given m/z value. Usually
additional information, such as retention time, expected abun-
dance in the given biofluid, additional MS/MS spectra or
selected information regarding biological context must be
used to decide which unique compound(s) are most likely
matching to the observed MS spectra.
14. Reading through the MetaboCard for D-2-hydroxyglutarate
will reveal that it has been implicated as an oncometabolite as
well as a key metabolite in a rare inborn of metabolism called D-
2-hydroxyglutaricaciduria. Scrolling down the MetaboCard
will show more details regarding D-2-hydroxyglutarate’s struc-
ture, synonyms, chemical classifications, concentrations in dif-
ferent biofluids, pathways, and the enzymes/transporters that
bind it or act on it.
15. While multiple “hits” are possible with the mass list provided in
this example, the results of this particular LC-MS study on
glioblastoma CSF should ideally show that D-2-hydroxygluta-
rate, fumarate, succinate, and lactate have increased in concen-
tration in cancer patients, while glutamine, isocitrate, and
glutathione have decreased in cancer patients. The metabolites
that increased in concentration are known oncometabolites
and appear to catalyze further mutations and epigenetic
changes (D-2-hydroxyglutarate, fumarate) or contribute to
immunosuppression, metastasis, inflammation, and other
well-known hallmarks of cancer (succinate and lactate). This
simple example was chosen for didactic purposes. Identifying
compounds using only m/z data is often risky and leads to
multiple mass-matching redundancies. If MS/MS data were
available it would have been possible to search for similar
MS/MS spectra via HMDB’s “MS/MS Search” option.
Finding matching MS/MS spectra (either observed or pre-
dicted) to any query MS/MS spectrum provides important
supporting evidence regarding the actual identity of the com-
pound or compounds of interest.
184 David S. Wishart
Acknowledgments
References
1. Wishart DS (2007) Proteomics and the human with a comprehensive, computable taxonomy.
metabolome project. Expert Rev Proteomics J Cheminform 8:61
4:333–335 11. Weininger D (1988) SMILES, a chemical lan-
2. Mandal R, Guo AC, Chaudhary KK, Liu P, guage and information system. 1. Introduction
Yallou FS, Dong E et al (2012) Multi-platform to methodology and encoding rules. J. Chem
characterization of the human cerebrospinal Inf Comput Sci 28:31–36
fluid metabolome: a comprehensive and quan- 12. Heller S, McNaught A, Stein S,
titative update. Genome Med 4:38 Tchekhovskoi D, Pletnev I (2013) InChI—
3. Psychogios N, Hau DD, Peng J, Guo AC, the worldwide chemical structure identifier
Mandal R, Bouatra S et al (2011) The human standard. J Chem 5:7
serum metabolome. PLoS One 6:e16957 13. Kim S, Thiessen PA, Bolton EE, Chen J, Fu G,
4. Bouatra S, Aziat F, Mandal R, Guo AC, Wilson Gindulyte A et al (2016) PubChem substance
MR, Knox C et al (2013) The human urine and compound databases. Nucleic Acids Res
metabolome. PLoS One 8:e73076 44(Database issue):D1202–D1213
5. Dame ZT, Aziat F, Mandal R, 14. Sud M, Fahy E, Cotter D, Brown A, Dennis
Krishnamurthy R, Bouatra S, Borzouie S et al EA, Glass CK et al (2007) LMSD: LIPID
(2015) The human saliva metabolome. Meta- MAPS structure database. Nucleic Acids Res
bolomics 11:1864–1883 35(Database issue):D527–D532
6. Karu N, Deng L, Slae M, Guo AC, Sajed T, 15. Hastings J, Owen G, Dekker A, Ennis M,
Huynh H et al (2018) A review on human fecal Kale N, Muthukrishnan V et al (2016) ChEBI
metabolomics: methods, applications and the in 2016: improved services and an expanding
human fecal metabolome database. Anal Chim collection of metabolites. Nucleic Acids Res 44
Acta 1030:1–24 (Database issue):D1214–D1219
7. Wishart DS, Tzur D, Knox C, Eisner R, Guo 16. Ashburner M, Ball CA, Blake JA, Botstein D,
AC, Young N et al (2007) HMDB: the human Butler H, Cherry JM et al (2000) Gene ontol-
Metabolome database. Nucleic Acids Res 35 ogy: tool for the unification of biology. The
(Database issue):D521–D526 gene ontology consortium. Nat Genet
8. Wishart DS, Feunang YD, Marcu A, Guo AC, 25:25–29
Liang K, Vázquez-Fresno R et al (2018) 17. Kanehisa M, Furumichi M, Tanabe M, Sato Y,
HMDB 4.0: the human metabolome database Morishima K (2017) KEGG: new perspectives
for 2018. Nucleic Acids Res 46(D1): on genomes, pathways, diseases and drugs.
D608–D617 Nucleic Acids Res 45(Database issue):
9. Djoumbou-Feunang Y, Fiamoncini J, Gil-de- D353–D361
la-Fuente A, Greiner R, Manach C, Wishart DS 18. Jewison T, Su Y, Disfany FM, Liang Y, Knox C,
(2019) BioTransformer: a comprehensive Maciejewski A et al (2014) SMPDB 2.0: big
computational tool for small molecule metab- improvements to the small molecule pathway
olism prediction and metabolite identification. database. Nucleic Acids Res 42(Database
J Cheminform 11:2 issue):D478–D484
10. Djoumbou Feunang Y, Eisner R, Knox C, 19. Amberger JS, Bocchini CA, Scott AF, Hamosh
Chepelev L, Hastings J, Owen G et al (2016) A (2019) OMIM.org: leveraging knowledge
ClassyFire: automated chemical classification across phenotype-gene relationships. Nucleic
Acids Res 47(Database):D1038–D1043
Chapter 11
Abstract
SIRIUS 4 is the best-in-class computational tool for metabolite identification from high-resolution tandem
mass spectrometry data. It offers de novo molecular formula annotation with outstanding accuracy. When
searching fragmentation spectra in a structure database, it reaches over 70% correct identifications. A
predicted fingerprint, which indicates the presence or absence of thousands of molecular properties, helps
to deduce information about the compound of interest even if it is not contained in any structure database.
Here, we present best practices and describe how to leverage the full potential of SIRIUS 4, how to
incorporate it into your own workflow, and how it adds value to the analysis of mass spectrometry data
beyond spectral library search.
Key words Metabolomics, LC–MS/MS, Annotation, Molecular formula, Structure prediction, SIR-
IUS, Metabolite identification
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_11, © Springer Science+Business Media, LLC, part of Springer Nature 2020
185
186 Marcus Ludwig et al.
3 Preprocessing
l MS/MS spectra with low total intensity or very few signal peaks
should be rejected. Usually it is difficult to confidently identify
the corresponding compounds.
It is usually not necessary to preprocess fragmentation spectra
by removing “noise peaks” or recalibrating masses; such preproces-
sing can substantially worsen results, as signal peaks may be
removed or masses shifted into the wrong direction. SIRIUS can
decide for itself which of the peaks in the spectrum are noise, but it
cannot recover the masses of accidentally removed signal peaks. To
this end, be cautious when using intensity thresholds. If the data is
noisy and necessitates “noise peak” removal, use a low intensity
threshold to remove as few signal peaks as possible. Furthermore,
we propose to use a low MS1 intensity threshold and not-to-
restrictive parameters for feature detection. A high number of
spurious features might pose a problem for MS1-only analysis.
But here, we concentrate on metabolite identification based on
fragmentation spectra, and spurious features can easily be recog-
nized because these will not produce significant signal peaks within
the fragmentation spectrum. Using liberal parameters will help to
detect more low intensity isotope peaks and include them into the
compound’s isotope pattern.
Instrumental setup has huge impact on spectrum quality and
some setups might be more suitable for structure elucidation with
computational tools. See Tip 1 for more information.
for all fragments. Selecting only the monoisotopic peak for frag-
mentation makes it easier to interpret the fragmentation spectrum.
SIRIUS provides an option to account for isotopes in the fragmen-
tation spectrum, but this assumes that the isolation window is
broad and isotope patterns of fragments are undisturbed. Unfortu-
nately, filtering is imperfect in practice: An isolation window of
width, say, 3 Da may select 100% of the monoisotopic peak, 80%
or the first and 50% of the second isotope peak. This will distort the
isotope patterns of fragments in a non-trivial way. At present,
SIRIUS cannot deal with distorted fragment isotopes patterns.
Compound identification benefits from choosing an instru-
mental setup which minimizes chimeric spectra, and favors peak-
rich and low noise fragmentation spectra.
4 Metabolite Identification
4.1 Molecular SIRIUS finds the most likely molecular formula by considering all
Formula Annotation possible molecular formulas, and is able to annotate biomolecules
with a molecular formula missing from any database. Necessary
parameters for SIRIUS are:
Elements Set of considered elements. Some elements
can be auto-detected if an isotope pattern is
given (see Tip 2).
ppm Allowed mass deviation in ppm. This is the
maximum value a molecular formula expla-
nation is allowed to deviate from the peaks’
measured mass. Molecular formulas with a
190 Marcus Ludwig et al.
Fig. 1 The SIRIUS Overview tab displays the spectrum and fragmentation trees of the top molecular formula
candidates. The best candidate C24H38O3 is selected; the corresponding explained spectrum and fragmenta-
tion tree are shown. The left panel contains a searchable list of all compounds; selected compounds are
highlighted. The data and results of the first selected compound are displayed in all the views to the right of
the compound list. The upper panel provides functionalities to import spectra, save and load workspaces,
export result tables, start computations, and display their status in the jobs panel. The SIRIUS overview tab
displays various scores for each molecular formula candidate and can be sorted accordingly
4.1.1 Judging Results Molecular formula annotation results are displayed in the Sirius
Overview tab (see Fig. 1). Candidates are ranked by the sum of
isotope pattern and fragmentation tree score (see Tip 2 on isotopes
and Tip 3 on fragmentation trees). Colored bars for each score ease
comparison between candidates. Each candidate molecular formula
has an adduct. At this stage, this is an ion type; after structure
database search with CSI:FingerID this adduct corresponds to an
adduct type (compare Figs. 1 and 3 and see Tip 4).
The displayed attributes are:
Score Overall score by which candi-
dates are ranked. This is the
sum of isotope and tree score.
Isotope score Similarity score comparing the
measured isotope pattern with
the theoretical pattern for each
candidate molecular formula.
Usually, a score close to zero or
low in comparison to the remain-
ing candidates indicates an incor-
rect molecular formula, or at
least an annotation of low confi-
dence. Besides being the incor-
rect candidate, this might
indicate improper data quality
such as high intensity deviation
or a low number of detected iso-
tope peaks. The scored isotope
192 Marcus Ludwig et al.
Fig. 2 Example of a fragmentation tree computed from a fragmentation graph in (a), given the spectrum in
(b). The molecular formula of the neutral precursor is assumed to be C9H12NO2. Molecular formulas are
computed for all fragment peaks and serve as the nodes of the graph; nodes with the same color indicate
molecular formulas corresponding to the same peaks. Nodes are connected by edges if one node is a
subformula of another, thereby creating the fragmentation graph. A fragmentation tree is a connected
subgraph which explains each color (peak) at most once and has no cycles. The best-scoring fragmentation
tree, corresponding to a Maximum A Posteriori estimator, is computed by combinatorial optimization. The
optimal fragmentation tree is indicated by solid lines; nodes which are not used are grayed out. These
computations are repeated for each molecular formula candidate explaining the precursor mass, and the best
such fragmentation tree is reported
Fig. 3 Additional candidates are added to the SIRIUS Overview tab after searching with CSI:FingerID in a
structure database considering adduct types [M + H]+, [M + NH4]+ and [M H2O + H]+. Molecular formulas
C24H40O4 and C24H38O3 differ by an in-source loss of H2O and are not distinguishable by MS/MS since in both
cases, the ion [C24H38O3 + H]+ is fragmented; hence, both have identical score. (The same holds for the pairs
C22H33N2O2 vs. C22H36N3O2 and C18H39N4O2P vs. C18H36N3O2P). Displayed is the resolved fragmentation tree
for [C24H40O4 H2O + H]+, where an H2O loss has been added to its top
1
ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf
196 Marcus Ludwig et al.
Fig. 4 The CSI:FingerID Details tab displays structure candidates for a selected molecular formula. The
highlighted molecular property, which is predicted to be present in the query, is contained in the top 2 hits.
Candidates are sorted by their score which is displayed on the right-hand side. Numbers in percent indicate
the Tanimoto similarity between the predicted fingerprint and the fingerprint of each candidate. Candidates
can be filtered by database, SMARTS string and XlogP value
4.2 Searching After the molecular formula has been identified, the compound is
in Structure Databases searched in a structure database. Firstly, a molecular fingerprint of
the query (see Tip 5) is predicted from the spectrum and fragmen-
tation tree. Next, this predicted fingerprint is compared to (and
Metabolite Identification Using SIRIUS 197
4.2.1 Judging Results Users should check if the best structure candidate agrees with the
best molecular formula candidate. Sometimes, CSI:FingerID deci-
des that, based on its machine learning model and the given candi-
date structures, a structure with a different molecular formula
better agrees with the data. Users should verify if the selected
structure database does not contain any structures for the best-
198 Marcus Ludwig et al.
2
https://www.ncbi.nlm.nih.gov/pubmed
Metabolite Identification Using SIRIUS 199
4.3 Beyond Structure It is understood that certain query biomolecules are not contained
Database Search in any structure database. But even for such difficult instances,
SIRIUS and CSI:FingerID can assist in structural elucidation.
Recall that the SIRIUS molecular formula annotation step (Sub-
heading 4.1) is done de novo. Hence, molecular formulas can be
determined even for “novel compounds” absent from any structure
database. Even if a structure is not contained in the structure
databases, CSI:FingerID may find a very similar structure. Further-
more, CSI:FingerID allows the user to search in custom databases
which may contain hypothetical structures, to identify “novel
compounds.”
But one key feature sets CSI:FingerID apart from other
computational tools for structure elucidation: Predicting the
molecular fingerprint of the query compound does not require
any molecular structure database! The fingerprint is predicted
from fragmentation spectrum and tree, and contains information
about thousands of molecular properties. From that, we may draw
conclusions what kind of substructures the query compound con-
tains; and this information may be sufficient to decide if it is worth
to further investigate the examined compound.
4.3.1 Judging Results The predicted fingerprint is displayed in the Predicted Fingerprint
tab, see Fig. 5. Most molecular properties are described by
SMARTS (SMiles ARbitrary Target Specification) strings.3
SMARTS allows a flexible encoding of substructures; for example,
a property might be described as “a methyl group bound to a
hetero atom.” Since SMARTS strings are usually hard to visualize,
SIRIUS displays a set of example structures from the training data
that have a particular molecular property.
A posterior probability is predicted for every molecular prop-
erty. Estimates close to 1 indicate the property is likely being
present in the query compound, whereas estimates close to 0 indi-
cate it is not. But be careful: Since CSI:FingerID predicts thousands
of properties, even some “rather certain predictions” must be
wrong. A 98% chance of being present also corresponds to 2%
chance of being absent; if 1000 molecular properties are predicted
at this level of certainty, then 20 predictions are wrong. Also be
3
http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
Metabolite Identification Using SIRIUS 201
Fig. 5 The Predicted Fingerprint tab displays a predicted molecular fingerprint for each molecular formula
candidate. The molecular fingerprint is predicted independently from any database. It can help deducing
structural information on the compound even if the compound is not present in any structure database.
Highlighted is a property mainly consisting of a ring. Training examples are displayed at the bottom. As shown,
the oxygen is not a mandatory part of the substructure. The posterior probability of each property is also
visualized as a color bar, to allow the user to swiftly distinguish properties predicted being present and absent.
Green bars going to the right encode presence, red bars going the left encode absence
5.1 The SIRIUS The SIRIUS project-space is a standardized directory structure that
Project-Space is organized in a three hierarchy levels, namely the project level, the
compound level, and the method level (see Fig. 6 for details).
On the project level, each compound corresponds to one
sub-directory (compound level) storing the input data, parameters,
and results of the different analysis methods. These data is continu-
ously written to the project-space, so that it represents the actual
progress of a SIRIUS analysis. Further, the .progress file gives an
overview about the progress of the ongoing analysis. On the com-
pound level, each method provided by SIRIUS stores its results in its
own sub-directory (method level). This allows the user to redo one
analysis step without having to recompute the intermediate results
it depends on. Further, SIRIUS is able to transfer intermediate
results to a new project-space, so different parameters can easily
be evaluated without having to recompute intermediate results.
Since a project-space can be imported into the GUI, the user is
able to judge intermediate results using the GUI before executing
further analysis steps. Project-spaces can be read and written as an
uncompressed directory or a compressed zip archive when using
the .sirius file extension.
In addition to the method level results, the project-space con-
tains summaries of these results on the project level and the com-
pound level. These summaries are in csv format to provide easy
access to the results for further downstream analysis, data sharing,
and data visualization. The summaries are not imported into SIR-
IUS but are (re-)created based on the actual results every time a
project-space is exported.
Metabolite Identification Using SIRIUS 203
Fig. 6 The SIRIUS project-space is a standardized directory structure that stores results, summarized results,
input data, parameters, and version information of a SIRIUS analysis. It is organized on three levels, namely
the project level (dashed line), the compound level (dashed-and-dotted line), and the method level (solid line).
The compound level contains sub-directories (blue) for each compound, summaries (green) about the whole
dataset, and additional information (red) about the version of SIRIUS that created the output. The compound
level contains a sub-directory for each method that was applied to the compound as well as the summaries of
these methods results. Further, it contains additional information, such as the input data and the parameters
used for the computations. On the method level, SIRIUS stores the results of a specific method for a given
compound (grey)
5.2 Standardized The project-space is a SIRIUS-specific format that allows the user
Project-Space to access all results and analysis details, but may not be optimal for
Summary sharing this data with third party tools or data archives. For this
with mzTab-M purpose, SIRIUS provides an analysis report (report.mztab) in
the standardized mzTab-M format [14]. All results summarized in
this report are linked to the results in the corresponding SIRIUS
project-space, allowing the user to share summarized results using
mzTab-M without losing the connection to the detailed results
provided in the project-space. Furthermore, SIRIUS passes meta-
information such as scan numbers and identifiers of the input data
into this analysis report. This allows for an easy combination of the
SIRIUS results with the results of other analyses such as MS1-based
quantification.
204 Marcus Ludwig et al.
6 Custom Databases
Users may define their own structure databases to search in. These
“custom databases” can be created via GUI and CLI. In the GUI,
the Databases button opens a dialogue listing existing databases.
New ones can be created with one click. Structures are imported
by inserting structure descriptors (InChI or SMILES) into the
import field; one structure per line. Custom databases are useful in
case the users has a limited set of structures of interest. When
screening for pollutants or drugs, a list of suspected structures can
be collected in advance.
When searching with CSI:FingerID it does not matter if the
structures in the database are known biomolecules or if these are
hypothetical structures, which have not yet been discovered in any
organism. Clearly, it is not reasonable to search in an arbitrarily
large database. Databases of hypothetical structures have to be
compiled with care to avoid combinatorial explosion. Available
tools are BioTransformer [6] and the in silico generated MINE
databases [17]. Currently, there exist MINE extensions for Ecocyc
[19], YMDB [28], and KEGG [18]. But in principle, any existing
structure database can be extended by such methods. Say, you are
interested in finding new bile acids. A database of hypothetical bile
acids can be created by applying biotransformations to known bile
acids. This new database can then be searched with CSI:FingerID
to find new bile acids synthesized by the investigated organisms.
7 Conclusion
References
1. Allen F, Greiner R, Wishart D (2015) Compet- Rousu J, Böcker S (2019) Sirius 4: a rapid tool
itive fragmentation modeling of ESI-MS/MS for turning tandem mass spectra into metabo-
spectra for putative metabolite identification. lite structure information. Nat Methods.
Metabolomics 11(1):98–110. https://doi. https://doi.org/10.1038/s41592-019-0344-
org/10.1007/s11306-014-0676-4 8
2. Böcker S (2017) Searching molecular structure 10. Fonger GC, Hakkinen P, Jordan S, Publicker S
databases using tandem MS data: are we there (2014) The National Library of Medicine’s
yet? Curr Opin Chem Biol 36:1–6. https:// (NLM) Hazardous Substances Data Bank
doi.org/10.1016/j.cbpa.2016.12.010. (HSDB): background, recent enhancements
https://authors.elsevier.com/a/1UF- and future plans. Toxicology 325:209–216.
u4sz6LvFfY https://doi.org/10.1016/j.tox.2014.09.003
3. Böcker S, Dührkop K (2016) Fragmentation 11. Gu J, Gui Y, Chen L, Yuan G, Lu HZ, Xu X
trees reloaded. J Cheminform 8:5. https://doi. (2013) Use of natural products as chemical
org/10.1186/s13321-016-0116-8. http:// library for drug discovery and network phar-
www.jcheminf.com/content/8/1/5 macology. PLoS One 8(4):1–10
4. Caspi R, Altman T, Billington R, Dreher K, 12. Hastings J, Owen G, Dekker A, Ennis M,
Foerster H, Fulcher CA, Holland TA, Keseler Kale N, Muthukrishnan V, Turner S,
IM, Kothari A, Kubo A, Krummenacker M, Swainston N, Mendes P, Steinbeck C (2016)
Latendresse M, Mueller LA, Ong Q, Paley S, ChEBI in 2016: improved services and an
Subhraveti P, Weaver, DS, Weerasinghe D, expanding collection of metabolites. Nucleic
Zhang P, Karp PD (2014) The MetaCyc data- Acids Res 44(D1):D1214–D1219. https://
base of metabolic pathways and enzymes and doi.org/10.1093/nar/gkv1031. http://
the BioCyc collection of pathway/genome europepmc.org/articles/PMC4702775
databases. Nucleic Acids Res 42(D1): 13. Heinonen M, Shen H, Zamboni N, Rousu J
D459–D471. https://doi.org/10.1093/nar/ (2012) Metabolite identification and molecular
gkt1103. http://nar.oxfordjournals.org/con fingerprint prediction via machine learning.
tent/42/D1/D459.abstract Bioinformatics 28(18):2333–2341. https://
5. da Silva RR, Dorrestein PC, Quinn RA (2015) doi.org/10.1093/bioinformatics/bts437
Illuminating the dark matter in metabolomics. 14. Hoffmann N, Rein J, Sachsenberg TT,
Proc Natl Acad Sci U S A 112 Hartler J, Haug K, Mayer G, Alka O,
(41):12549–12550. https://doi.org/10. Dayalan S, Pearce JTM, Rocca-Serra P et al
1073/pnas.1516878112 (2019) mzTab-M: a data standard for sharing
6. Djoumbou-Feunang Y, Fiamoncini J, Gil-de-la quantitative results in mass spectrometry meta-
Fuente A, Greiner R, Manach C, Wishart DS bolomics. Anal Chem 91(5):3302–3310.
(2019) BioTransformer: a comprehensive https://doi.org/10.1021/acs.analchem.
computational tool for small molecule metab- 8b04310
olism prediction and metabolite identification. 15. Horai H, Arita M, Kanaya S, Nihei Y, Ikeda T,
J Cheminf 11(1):2 Suwa K, Ojima Y, Tanaka K, Tanaka S,
7. Dührkop K, Shen H, Meusel M, Rousu J, Aoshima K, Oda Y, Kakazu Y, Kusano M,
Böcker S (2015) Searching molecular structure Tohge T, Matsuda F, Sawada Y, Hirai MY,
databases with tandem mass spectra using CSI: Nakanishi H, Ikeda K, Akimoto N, Maoka T,
FingerID. Proc Natl Acad Sci U S A 112 Takahashi H, Ara T, Sakurai N, Suzuki H,
(41):12580–12585. https://doi.org/10. Shibata D, Neumann S, Iida T, Tanaka K,
1073/pnas.1509788112 Funatsu K, Matsuura F, Soga T, Taguchi R,
8. Dührkop K, Lataretu MA, White WTJ, Böcker Saito K, Nishioka T (2010) MassBank: a public
S (2018) Heuristic algorithms for the maxi- repository for sharing mass spectral data for life
mum colorful subtree problem. In: Proceed- sciences. J Mass Spectrom 45(7):703–714.
ings of workshop on algorithms in https://doi.org/10.1002/jms.1777
bioinformatics (WABI 2018). Leibniz interna- 16. Irwin JJ, Sterling T, Mysinger MM, Bolstad
tional proceedings in informatics (LIPIcs), vol ES, Coleman RG (2012) ZINC: a free tool to
113. Schloss Dagstuhl–Leibniz-Zentrum fuer discover chemistry for biology. J Chem Inf
Informatik, Dagstuhl, pp 23:1–23:14. https:// Model 52(7):1757–1768
doi.org/10.4230/LIPIcs.WABI.2018.23. 17. Jeffryes JG, Colastani RL, Elbadawi-Sidhu M,
http://drops.dagstuhl.de/opus/volltexte/ Kind T, Niehaus TD, Broadbelt LJ, Hanson
2018/9325 AD, Fiehn O, Tyo KEJ, Henry CS (2015)
9. Dührkop K, Fleischauer M, Ludwig M, Akse- MINEs: open access databases of computation-
nov AA, Melnik AV, Meusel M, Dorrestein PC, ally predicted enzyme promiscuity products for
206 Marcus Ludwig et al.
Biotechnology in agriculture and forestry, vol 41. Wang M et al (2016) Sharing and community
57. Springer, Berlin, pp 165–181 curation of mass spectrometry data with Global
35. Steinbeck C, Han Y, Kuhn S, Horlacher O, Natural Products Social molecular networking.
Luttmann E, Willighagen E (2003) The Nat Biotechnol 34(8):828–837. https://doi.
Chemistry Development Kit (CDK): an open- org/10.1038/nbt.3597
source Java library for chemo- and bioinfor- 42. Weber RJM, Li E, Bruty J, He S, Viant MR
matics. J Chem Inf Comput Sci 43:493–500 (2012) MaConDa: a publicly accessible mass
36. Tautenhahn R, Cho K, Uritboonthai W, spectrometry contaminants database. Bioinfor-
Zhu Z, Patti GJ, Siuzdak G (2012) An acceler- matics 28(21):2856–2857. https://doi.org/
ated workflow for untargeted metabolomics 10.1093/bioinformatics/bts527
using the METLIN database. Nat Biotechnol 43. Willighagen EL, Mayfield JW, Alvarsson J,
30(9):826–828. https://doi.org/10.1038/ Berg A, Carlsson L, Jeliazkova N, Kuhn S,
nbt.2348 Pluskal T, Rojas-Chertó M, Spjuth O,
37. Tsugawa H, Kind T, Nakabayashi R, Torrance G, Evelo CT, Guha R, Steinbeck C
Yukihira D, Tanaka W, Cajka T, Saito K, (2017) The Chemistry Development Kit
Fiehn O, Arita M (2016) Hydrogen rearrange- (CDK) v2.0: atom typing, depiction, molecular
ment rules: computational ms/ms fragmenta- formulas, and substructure searching. J Che-
tion and structure elucidation using minf 9(1):33. http://dx.doi.org/10.1186/
MS-FINDER software. Anal Chem 88 s13321-017-0220-4
(16):7946–7958. https://doi.org/10.1021/ 44. Wishart DS, Feunang YD, Marcu A, Guo AC,
acs.analchem.6b00770 Liang K, Vázquez-Fresno R, Sajed T,
38. Wang R, Fu Y, Lai L (1997) A new atom- Johnson D, Li C, Karu N, Sayeeda Z, Lo E,
additive method for calculating partition coef- Assempour N, Berjanskii M, Singhal S,
ficients. J Chem Inf Comput Sci 37 Arndt D, Liang Y, Badran H, Grant J, Serra-
(3):615–621. https://doi.org/10.1021/ Cayuela A, Liu Y, Mandal R, Neveu V, Pon A,
ci960169p Knox C, Wilson M, Manach C, Scalbert A
39. Wang R, Gao Y, Lai L (2000) Calculating par- (2018) HMDB 4.0: the human metabolome
tition coefficient by atom-additive method. database for 2018. Nucleic Acids Res 46(D1):
Perspect Drug Discov Des 19(1):47–66. D608–D617. http://dx.doi.org/10.1093/
https://doi.org/10.1023/A:1008763405023 nar/gkx1089
40. Wang Y, Kora G, Bowen BP, Pan C (2014) 45. Wolf S, Schmidt S, Müller-Hannemann M,
MIDAS: a database-searching algorithm for Neumann S (2010) In silico fragmentation for
metabolite identification in metabolomics. computer assisted identification of metabolite
Anal Chem 86(19):9496–9503. https://doi. mass spectra. BMC Bioinf 11:148. https://doi.
org/10.1021/ac5014783 org/10.1186/1471-2105-11-148
Chapter 12
Abstract
High-throughput mass spectrometry (MS) metabolomics profiling of highly complex samples allows the
comprehensive detection of hundreds to thousands of metabolites under a given condition and point in
time and produces information-rich data sets on known and unknown metabolites. One of the main
challenges is the identification and annotation of metabolites from these complex data sets since the number
of authentic standards available for specialized metabolites is far lower than an account for the number of
mass spectral features. Previously, we reported two novel tools, MetNet and MetCirc, for putative
annotation and structural prediction on unknown metabolites using known metabolites as baits. MetNet
employs differences between m/z values of MS1 features, which correspond to metabolic transformations,
and statistical associations, while MetCirc uses MS/MS features as input and calculates similarity scores
of aligned spectra between features to guide the annotation of metabolites. Here, we showcase the use of
MetNet and MetCirc to putatively annotate metabolites and provide detailed instructions as to how
those can be used. While our case studies are from plants, the tools find equal utility in studies on bacterial,
fungal, or mammalian xenobiotic samples.
Key words Annotation, Plant metabolite, Specialized metabolite, Unknown metabolite, Metabolic
modification, Molecular networking
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_12, © Springer Science+Business Media, LLC, part of Springer Nature 2020
209
210 Thomas Naake et al.
2 Materials
2.2 Input Data 1. m n peak table of m MS1 features containing m/z values,
retention time and intensity values for n samples (MetNet, see
Notes 3–5). A peak table can be acquired by following the
protocol of Shimizu et al. [34] using an Orbitrap mass spec-
trometer or similar and running the below-mentioned xcms/
CAMERA script.
2. MS2 data from DDA or DIA mass spectrometry formatted as a
peak table or as a .msp file (MetCirc, see Note 4). A peak table
can be acquired according to Li et al. [28] using an Orbitrap or
qTOF mass spectrometer.
3 Methods
library(MetNet)
3. Load the peak table to the R session. This step will differ
depending on how the peak table is stored. If the peak table
is stored in tabular format in a file (here in the peaklist.txt file)
load the file by
xset <- xcmsSet(method = "centWave", ppm = 30, snthresh = 10, peakwidth = c(5,20))
xset2 <- group(xset, method = "density", minfrac = 0.5, minsamp = 2, bw = 2, mzwid =
0.025)
xset3 <- retcor(xset2, family = "s", plottype = "m", missing = 1, extra = 1, span = 1)
xset4 <- group(xset3, method = "density", bw = 2, mzwid = 0.025, minfrac = 0.5,
minsamp = 2)
xset5 <- fillPeaks(xset4, method = "chrom")
an <- xsAnnotate(xset5)
anF <- groupFWHM(an, perfwhm = 0.6)
anI <- findIsotopes(anF, mzabs = 0.01)
anIC <- groupCorr(anI, cor_eic_th = 0.75, graphMethod = "lpc")
anFA <- findAdducts(anIC, polarity = "positive")
peaklist <- getPeaklist(anFA)
library(igraph)
net <- graph_from_adjancency_matrix(cons_adj, mode = ”undirected”, diag = FALSE)
net_comp <- components(net)
Plant Metabolite Annotation 217
Fig. 1 Typical result of MetNet pipeline. MetNet created based on a Nicotiana MS2 dataset, a network
considering the m/z shifts corresponding to hydroxylation, malonylation, rhamnosylation, and glucosylation
and statistical associations using Pearson and Spearman correlation as well as CLR and ARACNE as statistical
models (see also [22] for further details). For the whole dataset, 2,417,704 and 3924 links between metabolic
features were determined for statistical and structural associations (intersection of 2026 links). Displayed are
network components with 10 or more members. Network #13 is enriched in diterpene glycosides (DTG) but
also contains unknown metabolites collected in the dataset. Based on m/z differences to known DTG, these
unknowns can be putatively annotated. For network #13, 704 and 87 links between metabolic features were
determined for statistical and structural annotations (intersection of 71 links)
library(MetCirc)
3. Load MS2 data and convert to the MSP class by either two
ways:
(a) Load a data frame (here msms.txt) with column “id”
(unique identifiers for MS/MS features), and columns
“mz” and “intensity” comprising the fragment ions and
their intensities (see Notes 16–19). The data frame can be
loaded to the R session by entering and hitting Enter.
(b) Load a data frame in .msp file format, a typical data format
for storing MS/MS libraries. Required properties of such
a data frame are the name of the metabolite (row entry
“NAME:”), the m/z value of the precursor ion (“PRE-
CURSORMZ” or “EXACTMASS:”), the number of
peaks of the feature (“Num Peaks:”), and information
on fragments and peak areas (see Note 20). To load the
file in .MSP format and convert it to the MSP class object,
enter the following to the R session and hit Enter.
Species 1
Species 2
Height
all MS/MS
similarity
edges
Visualized similarity edges
for a given threshold
(here = 1, identical MS/MS)
5000000
A
Species 3
0
#1 #2 #3 #4 #5 Selected
Species 4 MSMS (here mz 498.26 / RT 596s)
C Cluster #2
N. attenuata
N. obtusifolia
N. clevelandii
N. quadrivalvis
N. x obtusiata 10/27
N. x obtusiata 57/126
MS/MS similarity = 1
Nicotianoside X
Fig. 2 Typical result of MetCirc pipeline. (a) Hierarchical clustering of MS/MS features based on distance
between similarity scores prior to MetCirc analysis. Clustering will extract clusters containing highly similar
MS/MS features (with high similarity scores). (b) Interface of the MetCirc Shiny application. Within the Shiny
application MS/MS features can be navigated, reordered, annotated, and links between MS2 features
thresholded and selected based on the type of links. (c) Exemplary MetCirc output for diterpene glycosides
(DTG) for six Nicotiana species. For all panels, MS2 spectra were collected from leaves of six Nicotiana
species and used for processing through the MetCirc pipeline. MetCirc visualization is built for Cluster #2
enriched in DTGs. From this compound class, nicotianoside X previously identified in Nicotiana attenuata is
selected in the interface. Similarity threshold is fixed to 1 (cross-species MS2 identity) which readily allows for
the dereplication (reidentification) of nicotianoside X in three of the five other Nicotiana species tested
selectedFeatures$msp
4 Notes
Acknowledgments
References
1. Dixon RA, Strack D (2003) Phytochemistry compound spectra extraction and annotation
meets genome analysis, and beyond. Phyto- of liquid chromatography/mass spectrometry
chemistry 62:815–816 data sets. Anal Chem 84:283–289
2. Rai A, Saito K, Yamazaki M (2017) Integrated 16. Alonso A, Julia A, Beltran A et al (2011)
omics analysis of specialized metabolism in AStream: an R package for annotating
medicinal plants. Plant J 90:764–787 LC/MS metabolomic data. Bioinformatics
3. Fernie AR, Trethewey RN, Krotzky AJ et al 27:1339–1340
(2004) Metabolite profiling: from diagnostics 17. Uppal K, Walker DI, Jones DP (2017) xMSan-
to systems biology. Nat Rev Mol Cell Biol notator: an R package for network-based anno-
5:763–769 tation of high-resolution metabolomics data.
4. Alseekh S, Fernie AR (2018) Metabolomics Anal Chem 89:1063–1067
20 years on: what have we learned and what 18. Qiu F, Fine DD, Wherritt DJ et al (2016)
hurdles remain? Plant J 94:933–942 PlantMAT: a metabolomics tool for predicting
5. Ziegler J, Facchini PJ (2008) Alkaloid biosyn- the specialized metabolic potential of a system
thesis: metabolism and trafficking. Annu Rev and for large-scale metabolite identifications.
Plant Biol 59:735–769 Anal Chem 88:11373–11383
6. Wink M (2004) Phytochemical diversity of sec- 19. Li SZ, Park Y, Duraisingham S et al (2013)
ondary metabolites. In: Goodman RM Predicting network activity from high through-
(ed) Encyclopedia of plant and crop science. put metabolomics. PLoS Comput Biol 9:
Marcel Dekker, New York, pp 915–919 e1003123
7. Wink M (2015) Modes of action of herbal 20. Van Der Hooft JJ, Wandy J, Barrett MP et al
medicines and plant secondary metabolites. (2016) Topic modeling for untargeted sub-
Medicines (Basel) 2:251–286 structure exploration in metabolomics. Proc
8. Tohge T, Alseekh S, Fernie AR (2014) On the Natl Acad Sci U S A 113:13738–13743
regulation and function of secondary metabo- 21. Treutler H, Tsugawa H, Porzel A et al (2016)
lism during fruit development and ripening. J Discovering regulated metabolite families in
Exp Bot 65:4599–4611 untargeted metabolomics studies. Anal Chem
9. Van Der Hooft JJJ, Wandy J, Young F et al 88:8082–8090
(2017) Unsupervised discovery and compari- 22. Naake T, Fernie AR (2019) MetNet: metabo-
son of structural families across multiple sam- lite network prediction from high-resolution
ples in untargeted metabolomics. Anal Chem mass spectrometry data in R aiding metabolite
89:7569–7577 annotation. Anal Chem 91:1768–1772
10. Perez De Souza L, Naake T, Tohge T et al 23. Naake T, Gaquerel E (2017) MetCirc: navigat-
(2017) From chromatogram to analyte to ing mass spectral similarity in high-resolution
metabolite. How to pick horses for courses MS/MS metabolomics data. Bioinformatics
from the massive web resources for mass spec- 33:2419–2420
tral plant metabolomics. Gigascience 6:1–20 24. Breitling R, Ritchie S, Goodenowe D et al
11. Kopka J, Schauer N, Krueger S et al (2005) (2006) Ab initio prediction of metabolic net-
[email protected]: the Golm metabolome data- works using Fourier transform mass spectrom-
base. Bioinformatics 21:1635–1638 etry data. Metabolomics 2:155–164
12. D’auria JC, Gershenzon J (2005) The second- 25. Steuer R (2006) Review: on the analysis and
ary metabolism of Arabidopsis thaliana: grow- interpretation of correlations in metabolomic
ing like a weed. Curr Opin Plant Biol data. Brief Bioinform 7:151–158
8:308–316 26. Morreel K, Saeys Y, Dima O et al (2014) Sys-
13. Li X, Svedin E, Mo HP et al (2014) Exploiting tematic structural characterization of metabo-
natural variation of secondary metabolism lites in Arabidopsis via candidate substrate-
identifies a gene controlling the glycosylation product pair networks. Plant Cell 26:929–945
diversity of dihydroxybenzoic acids in Arabi- 27. Gaquerel E, Kuhl C, Neumann S (2013)
dopsis thaliana. Genetics 198:1267 Computational annotation of plant metabolo-
14. Sweetlove LJ, Fernie AR (2013) The spatial mics profiles via a novel network-assisted
organization of metabolism within the plant approach. Metabolomics 9:904–918
cell. Annu Rev Plant Biol 64:723–746 28. Li D, Baldwin IT, Gaquerel E (2015) Navigat-
15. Kuhl C, Tautenhahn R, Bottcher C et al (2012) ing natural variation in herbivory-induced sec-
CAMERA: an integrated strategy for ondary metabolism in coyote tobacco
Plant Metabolite Annotation 225
populations using MS/MS structural analysis. metabolite profiling using nonlinear peak
Proc Natl Acad Sci U S A 112:E4147–E4155 alignment, matching, and identification. Anal
29. Watrous J, Roach P, Alexandrov T et al (2012) Chem 78:779–787
Mass spectral molecular networking of living 36. Patti GJ, Tautenhahn R, Siuzdak G (2012)
microbial colonies. Proc Natl Acad Sci U S A Meta-analysis of untargeted metabolomic data
109:E1743–E1752 from multiple profiling experiments. Nat Pro-
30. Gaquerel E, Heiling S, Schoettner M et al toc 7:508–516
(2010) Development and validation of a liquid 37. Marbach D, Costello JC, Kuffner R et al
chromatography-electrospray ionization-time- (2012) Wisdom of crowds for robust gene net-
of-flight mass spectrometry method for work inference. Nat Methods 9:796
induced changes in Nicotiana attenuata leaves 38. Tibshirani R (1996) Regression shrinkage and
during simulated herbivory. J Agric Food selection via the lasso. J Roy Stat Soc B Met
Chem 58:9418–9427 58:267–288
31. Li DP, Heiling S, Baldwin IT et al (2016) 39. Breiman L (2001) Random forests. Mach
Illuminating a plant’s tissue-specific metabolic Learn 45:5–32
diversity using computational metabolomics 40. Faith JJ, Hayete B, Thaden JT et al (2007)
and information theory. Proc Natl Acad Sci Large-scale mapping and validation of Escher-
USA 113:E7610–E7618 ichia coli transcriptional regulation from a
32. Heiling S, Khanal S, Barsch A et al (2016) compendium of expression profiles. PLoS Biol
Using the knowns to discover the unknowns: 5:e8
MS-based dereplication uncovers structural 41. Margolin AA, Nemenman I, Basso K et al
diversity in 17-hydroxygeranyllinalool diter- (2006) ARACNE: an algorithm for the recon-
pene glycoside production in the Solanaceae. struction of gene regulatory networks in a
Plant J 85:561–577 mammalian cellular context. BMC Bioinfor-
33. Heiling S, Schuman MC, Schoettner M et al matics 7:S7
(2010) Jasmonate and ppHsystemin regulate 42. Scutari M (2010) Learning Bayesian networks
key malonylation steps in the biosynthesis of with the bnlearn R package. J Stat Softw
17-hydroxygeranyllinalool diterpene glyco- 35:1–22
sides, an abundant and effective direct defense
against herbivores in Nicotiana attenuata. Plant 43. Wolfender JL, Nuzillard JM, Van Der Hooft
Cell 22:273–292 JJJ et al (2019) Accelerating metabolite identi-
fication in natural product research: toward an
34. Shimizu T, Watanabe M, Fernie AR et al ideal combination of LC-HRMS/MS and
(2018) Targeted LC-MS analysis for plant sec- NMR profiling, in silico databases and chemo-
ondary metabolites. Methods Mol Biol metrics. Anal Chem 91(1):704–742
1778:171–181
35. Smith CA, Want EJ, O’maille G et al (2006)
XCMS: processing mass spectrometry data for
Chapter 13
Abstract
The Global Natural Product Social Molecular Networking (GNPS) platform leverages tandem mass
spectrometry (MS/MS) data for annotation of compounds. Molecular networks aid in the visualization
of the chemical space within a metabolomics experiment. Recently, molecular networking has been
combined with feature detection methods to yield Feature-Based Molecular Networking (FBMN).
FBMN allows for the discrimination of isomers within the molecular network, incorporation of quantitative
information generated by the feature detection tools into visualization of the molecular network, and
compatibility with forthcoming in silico annotation tools. This chapter provides step-by-step methods for
generating a molecular network to annotate microbial natural products using the Global Natural Product
Social Molecular Networking (GNPS) Feature-Based Molecular Networking (FBMN) workflow.
Key words Molecular networking, Secondary metabolism, Feature annotation, Specialized metabo-
lites, Natural Products, GNPS
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_13, © Springer Science+Business Media, LLC, part of Springer Nature 2020
227
228 Vanessa V. Phelan
2 Materials
2.1 Data The raw data (Bruker .d file format and centroided mzXML file
format) and the metadata table used for the step-by-step instruc-
tions below, a batch file containing the settings for processing the
raw data in MZmine2 [13], and all resulting files from the
MZmine2 and GNPS processing workflows can be downloaded
free of charge at https://massive.ucsd.edu or accessed via the
GNPS menu option “MassIVE Datasets” via accession number
MSV000083500 (see Note 1).
2.2 mzMine 2 The current version of MZmine2 is available free of charge from the
project web page http://mzmine.github.io.
230 Vanessa V. Phelan
2.4 Cytoscape The current version of Cytoscape is available free of charge from the
project web page https://cytoscape.org.
3 Methods
3.1 Create The metadata file provides the ability to incorporate sample prop-
a Metadata File erty data into visualization and analysis of FBMN in Cytoscape.
The metadata format is a tab separated text file, that must be
generated using a text editor (see Note 2).
In order for the metadata file to be processed properly in
GNPS, the format of the file must be correct (Fig. 1). The first
column must be titled “filename” in all lowercase. The file names in
the “filename” column must be the full filename. The filenames
should each be unique, identical to the names of the ones used for
the feature finding process described in Subheading 3.2, and not
contain any path information. The columns containing metadata
must be prefixed by “ATTRIBUTE_” (see Note 3). The “ATTRI-
BUTE_” columns should contain simple descriptors. For example,
the “ATTRIBUTE_Strain” column in Fig. 1 contains two descrip-
tors called “GNPSGROUPS”: WT and rhlR, which denote the two
strains of P. aeruginosa being compared.
3.2 Perform FBMN 1. Start MZmine2 (see Note 4). The MZmine2 settings used for
Compatible Feature the sample data are outlined in Table 1.
Finding 2. Import raw data files: Select the files to be analyzed under the
menu option “Raw data methods ! Raw data import” (see
Note 5).
3. In MZmine2, a sequence of steps must be performed to cor-
rectly process the raw data for both feature finding and export
to GNPS for molecular networking (see Note 6). The
MZmine2 settings used for the sample data are outline in
Table 1 (see Note 7).
(a) Mass detection:
l Select all raw data files. Perform mass detection on MS
level 1 under the menu option “Raw data methods !
Mass detection ! Set filter: MS level 1” (see Note 8).
l Select all raw data files. Perform mass detection on MS
level 2 under the menu option “Raw data methods !
Mass detection ! Set filter: MS level 2” (see Note 9).
(b) Build chromatograms: Select all raw data files. Perform
building the chromatograms by choosing and applying
Feature Based Molecular Networking 231
A B C D
1 filename ATTRIBUTE_Strain
2 9177.mzXML rhIR
3 9178.mzXML WT
4 9208.mzXML rhIR
5 9209.mzXML WT
6 9238.mzXML rhIR
7 9239.mzXML WT
8
Fig. 1 Correct format for the metadata table file. The metadata file describes
sample properties which will allow greater flexibility for analysis and
visualization of the molecular network
Table 1
MZmine2 parameter settings used for sample dataset
Parameter Setting
Mass detection (MS1)
Scans MS level: 1; any polarity; any spectrum type
Mass detector Centroid; noise level 1.0E3
Mass list name Masses
Mass detection (MS2)
Scans MS level: 2; any polarity; any spectrum type
Mass detector Centroid; noise level 1.0E2
Mass list name Masses
Chromatogram builder
Scans MS level: 1; any polarity; any spectrum type
Mass list Masses
Min time span (min) 0.05
Min height 3.0E3
m/z tolerance 0.01 m/z or 20.0 ppm
Suffix Chromatograms
Chromatogram deconvolution
Suffix Deconvoluted
Algorithm Baseline cut-off; min peak height 1.0E3; peak duration range (min)
0.00–2.00; baseline level 5.0E3
m/z center calculation MEDIAN
Isotopic peaks grouper
Name suffix Deisotoped
m/z tolerance 0.01 m/z or 20.0 ppm
Retention time tolerance 0.25 absolute (min)
Monotonic shape Checked
Maximum charge 4
Representative isotope Most intense
Join aligner
Peak list name Aligned peak list
m/z tolerance 0.01 m/z or 20.0 ppm
Weight for m/z 0.8
Retention time tolerance 0.5 absolute (min)
(continued)
Feature Based Molecular Networking 233
Table 1
(continued)
Parameter Setting
Weight for RT 0.2
Require same charge state Checked
Peak finder (multithreaded)
Name suffix Gap-filled
Intensity tolerance 10.0%
m/z tolerance 0.01 m/z or 20.0 ppm
Retention time tolerance 0.5 absolute (min)
Peak list rows filter
Name suffix Filtered
Minimum peaks in a row 3
Peak duration range Checked 0.00–2.0
Keep or remove rows Keeps rows that match all criteria
Keep only peaks with MS2 Checked
scan (GNPS)
Reset the peak number ID Checked
Duplicate peak filter
Name suffix Filtered
Filter mode OLD AVERAGE
m/z tolerance 0.005 m/z or 10.0 ppm
RT tolerance 0.5 absolute (min)
Export for/submit to GNPS
Filename FBMN example
Mass list Masses
Filter rows ONLY WITH MS2
Table 2
GNPS FBMN workflow parameter settings used for sample dataset
Parameter Setting
Workflow selection
Title FBMN example PA14 rhlR cosine 0.70
File selection
MS2 MGF file FBMN example.mgf
Peak area quantification table FBMN example_quant.csv
Sample metadata table FBMN example metadata.txt
Basic options
Quantification table source MZmine2
Precursor ion mass tolerance 0.05 Da
Fragment ion mass tolerance 0.1 Da
Advanced network options
Min pairs Cos 0.70
Network TopK 10
Minimum matched fragment ions 4
Maximum connected component size 100
Maximum shift between precursors 500 Da
Advanced library search options
Library search min matched peaks 4
Search analogs Don’t search
Top results to report per query 1
Score threshold 0.7
Maximum analog search mass difference 100.0 Da (default value)
Advanced filtering options
Minimum peak intensity 0.0
Filter precursor window Filter
Filter peaks in 50 Da window Filter
Filter library Filter
Advanced quantification options
Normalization per file No norm
Aggregation method for peak abundances per group Mean
Advanced external tools
(continued)
236 Vanessa V. Phelan
Table 2
(continued)
Parameter Setting
Run dereplicator Don’t run
Advanced extras
Supplementary pairs
3.4 Visualize 1. Locate and open the Cytoscape file downloaded from GNPS.
the FBMN in Cytoscape In the resulting molecular network, each node represents the
MS/MS spectrum from an MS1 feature (see Note 22). All
metadata associated with the network is shown in the
“Table Panel,” including the “GNPSGROUPS” descriptors
of the “ATTRIBUTE_Strain” column of the metadata table.
Additionally, the “Table Panel” lists information about
matches to the GNPS libraries.
2. Different parameters can be set to enhance the visualization of
data features within the molecular network by modifying the
settings in the “Node,” “Edge,” or “Network” tabs within the
“Style” tab of the “Control Panel” window. The default set-
tings for the automatically generated Cytoscape file are node
color: blue; edge color: white; background color: gray; node label:
Compound name (see Note 23). To recapitulate the molecular
network in Fig. 2a, apply the following steps:
(a) To highlight the connectivity similarity between nodes,
load cosine settings. To do this, select the “Edge” tab
within the “Style” tab of the “Control Panel.” For the
“Width” setting, select cosine for the “Column” and Con-
tinuous Mapping for the “Mapping Type.” Double-click
the “Current Mapping” box and set the desired minimum
(0) and maximum (30) values. Within the molecular net-
work, the thicker edges indicate a stronger relationship
between the MS2 spectra represented by the nodes.
Feature Based Molecular Networking 237
3.5 Use the FBMN 1. The visualization of experimental data can be used to guide
to Perform Initial secondary metabolite annotation and analysis.
Analysis (a) Differential abundance (see Note 25): The pie chart visu-
alization can be used to quickly identify features that have
differential abundance between sample groups (Fig. 2).
The “cluster index” column of the “Node Table” panel of
the “Table Panel” in Cytoscape corresponds to the feature
number in the Peak Area Quanitification Table generated
by the MZmine2 workflow. Therefore, the abundance
differences displayed by the pie chart in the molecular
network can be validated simply by finding the feature
number in the Peak Area Quanitification Table for each
sample and averaging the abundance values for each group
(see Note 26).
(b) Annotation: In the molecular network, all nodes with
matches to the GNPS libraries are indicated by a black
border. Click on a node with a black boarder. The anno-
tation of that node will be listed within the “Compound_-
Name” column of the “Node Table” panel of the
“Table Panel.”
(c) Propagating an annotation: Locate the node annotated
“Rhamnolipid Rha-C10-C10” (m/z 527.3197). As
molecular networking compares the similarity of
Feature Based Molecular Networking 239
4 Notes
17. Using a descriptive title that encompasses the setting used for
FBMN is extremely useful. One of the advantages of the GNPS
platform is that a user can perform iterative analyses of the same
data set concurrently in order to identify the most optimal
settings for the analysis of their dataset.
18. If the files are too large for the Drag and Drop feature of the
Select Input Files interface, the files can be uploaded to GNPS
via an ftp server. Directions to perform file upload via ftp is
available within the GNPS documentation.
19. GNPS annotates MS/MS spectra by comparing each spectrum
to MS/MS libraries. The list of available MS/MS libraries can
be accessed via the GNPS website.
20. The appropriate “Precursor Ion Mass Tolerance” and “Frag-
ment Ion Mass Tolerance” settings are dictated by the mass
accuracy and mass resolution of the mass spectrometer used for
MS/MS data collection. In addition to Basic Options, the
FBMN workflow can be adjusted by modifying the various
Advanced Options. Learn more about these settings on the
GNPS documentation pages.
21. In addition to visualizing the molecular network in Cytoscape,
the network can be viewed and analyzed via the in-browser
visualization. For further information about the different types
of analyses that can be performed in GNPS, access the GNPS
documentation.
22. It is important to note that during the feature finding proces-
sing in MZmine2, only the features that have a corresponding
MS/MS spectrum are retained for molecular networking anal-
ysis. If a feature does not have an associated MS/MS spectrum,
it will not be in the molecular network. Whether a feature has a
corresponding MS/MS spectrum is dependent upon the set-
tings used during data acquisition as well as during data pro-
cessing in MZmine2.
23. Tutorials for how to use Cytoscape, including a manual and
videos, are available on the Cytoscape website.
24. If the visualization adjustments of the molecular network in
Cytoscape do not automatically change, they can be shown by
utilizing the “Show Graphic Details” option within the
“View” menu.
25. Cytoscape has many different settings that can be used to
differentiate groups. Node size, shape, color, etc. can be
adjusted to reflected different attributes of the samples such
as origin, and bioactivity.
26. FBMN is a tool to aid in the analysis of MS data. The settings
used in the FBMN workflow, including those for feature
finding and generating the molecular network, will influence
the final molecular network output. It is very important to
validate the observations from FBMN with the raw data.
242 Vanessa V. Phelan
Acknowledgments
References
1. Sumner LW, Amberg A, Barrett D, Beale MH, Gonzalez DJ, Silva DB, Marques LM, Demar-
Beger R, Daykin CA, Fan TW, Fiehn O, que DP, Pociute E, O’Neill EC, Briand E, Hel-
Goodacre R, Griffin JL, Hankemeier T, frich EJN, Granatosky EA, Glukhov E,
Hardy N, Harnly J, Higashi R, Kopka J, Lane Ryffel F, Houson H, Mohimani H, Kharbush
AN, Lindon JC, Marriott P, Nicholls AW, Reily JJ, Zeng Y, Vorholt JA, Kurita KL,
MD, Thaden JJ, Viant MR (2007) Proposed Charusanti P, McPhail KL, Nielsen KF,
minimum reporting standards for chemical Vuong L, Elfeki M, Traxler MF, Engene N,
analysis Chemical Analysis Working Group Koyama N, Vining OB, Baric R, Silva RR,
(CAWG) Metabolomics Standards Initiative Mascuch SJ, Tomasi S, Jenkins S, Macherla V,
(MSI). Metabolomics 3(3):211–221. https:// Hoffman T, Agarwal V, Williams PG, Dai J,
doi.org/10.1007/s11306-007-0082-2 Neupane R, Gurr J, Rodriguez AMC,
2. Garg N, Luzzatto-Knaan T, Melnik AV, Lamsa A, Zhang C, Dorrestein K, Duggan
Caraballo-Rodriguez AM, Floros DJ, BM, Almaliti J, Allard PM, Phapale P, Nothias
Petras D, Gregor R, Dorrestein PC, Phelan LF, Alexandrov T, Litaudon M, Wolfender JL,
VV (2017) Natural products as mediators of Kyle JE, Metz TO, Peryea T, Nguyen DT,
disease. Nat Prod Rep 34(2):194–219. VanLeer D, Shinn P, Jadhav A, Muller R,
https://doi.org/10.1039/c6np00063k Waters KM, Shi W, Liu X, Zhang L,
3. Newman DJ, Cragg GM (2016) Natural pro- Knight R, Jensen PR, Palsson BO,
ducts as sources of new drugs from 1981 to Pogliano K, Linington RG, Gutierrez M,
2014. J Nat Prod 79(3):629–661. https://doi. Lopes NP, Gerwick WH, Moore BS, Dorres-
org/10.1021/acs.jnatprod.5b01055 tein PC, Bandeira N (2016) Sharing and com-
munity curation of mass spectrometry data
4. Phelan VV, Liu WT, Pogliano K, Dorrestein with global natural products social molecular
PC (2011) Microbial metabolic exchange— networking. Nat Biotechnol 34(8):828–837.
the chemotype-to-phenotype link. Nat Chem https://doi.org/10.1038/nbt.3597
Biol 8(1):26–35. https://doi.org/10.1038/
nchembio.739 6. Nguyen DD, Wu CH, Moree WJ, Lamsa A,
Medema MH, Zhao X, Gavilan RG,
5. Wang M, Carver JJ, Phelan VV, Sanchez LM, Aparicio M, Atencio L, Jackson C,
Garg N, Peng Y, Nguyen DD, Watrous J, Ballesteros J, Sanchez J, Watrous JD, Phelan
Kapono CA, Luzzatto-Knaan T, Porto C, VV, van de Wiel C, Kersten RD, Mehnaz S, De
Bouslimani A, Melnik AV, Meehan MJ, Liu Mot R, Shank EA, Charusanti P, Nagarajan H,
WT, Crusemann M, Boudreau PD, Duggan BM, Moore BS, Bandeira N, Palsson
Esquenazi E, Sandoval-Calderon M, Kersten BO, Pogliano K, Gutierrez M, Dorrestein PC
RD, Pace LA, Quinn RA, Duncan KR, Hsu (2013) MS/MS networking guided analysis of
CC, Floros DJ, Gavilan RG, Kleigrewe K, molecule and gene cluster families. Proc Natl
Northen T, Dutton RJ, Parrot D, Carlson Acad Sci U S A 110(28):E2611–E2620.
EE, Aigle B, Michelsen CF, Jelsbak L, https://doi.org/10.1073/pnas.1303471110
Sohlenkamp C, Pevzner P, Edlund A,
McLean J, Piel J, Murphy BT, Gerwick L, 7. Shannon P, Markiel A, Ozier O, Baliga NS,
Liaw CC, Yang YL, Humpf HU, Wang JT, Ramage D, Amin N,
Maansson M, Keyzers RA, Sims AC, Johnson Schwikowski B, Ideker T (2003) Cytoscape: a
AR, Sidebottom AM, Sedio BE, Klitgaard A, software environment for integrated models of
Larson CB, CAB P, Torres-Mendoza D, biomolecular interaction networks. Genome
Feature Based Molecular Networking 243
Abstract
With the increasing importance of big data in biomedicine, skills in data science are a foundation for the
individual career development and for the progress of science. This chapter is a practical guide to working
with high-throughput biomedical data. It covers how to understand and set up the computing environ-
ment, to start a research project with proper and effective data management, and to perform common
bioinformatics tasks such as data wrangling, quality control, statistical analysis, and visualization, with
examples on metabolomics data. Concepts and tools related to coding and scripting are discussed. Version
control, knitr and Jupyter notebooks are important to project management, collaboration, and research
reproducibility. Overall, this chapter describes a core set of skills to work in bioinformatics, and can serve as
a reference text at the level of a graduate course and interfacing with data science.
Key words Bioinformatics, Metabolomics, Data science, Quality control, Data management, Cloud
computing, Scripting, Data visualization
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_14, © Springer Science+Business Media, LLC, part of Springer Nature 2020
245
246 W. Stephen Pittard et al.
2.3 UNIX/Linux The terms UNIX and Linux are often used interchangeably, but
technically that is not accurate. The term UNIX originally applied
to a proprietary operating system developed out of Bell Labs in the
early 1970s, which was originally intended for internal use. As the
system grew in utility and popularity, licenses were offered to
Berkeley University for further development, which resulted in
BSD, the first major branch of the ATT code base. Organizations
such as IBM and HP developed their own “flavors” of UNIX, some
based on ATT or BSD, and offered these distributions under vari-
ous business models. The GNU project was created by Richard
Stallman and became the leader of the free software movement. The
year of 1991 saw the release of the first Linux kernel by Linus
Torvalds. Subsequently, Linux became the dominant system to
run web servers, and led to the boom of an open source software
ecosystem. The name Linux is commonly employed generically
though there are a number of specific distributions (abbreviated
as “distros”) such as Ubuntu, Debian, GenToo, and Fedora as well
as commercially supported distros such as Redhat and SUSE.
While Linux is freely available, vendors can offer “for cost”
support models to help individuals and organizations become pro-
ductive. The organizations behind each of the respective distros can
offer enthusiastic support freely simply to encourage adoption of
those particular distributions. As an example, the Ubuntu distribu-
tion has a large number of “baked in” tools for the processing of
genomic data which is appealing to researchers within that domain.
250 W. Stephen Pittard et al.
2.4 GUI vs. the The convenience and intuitive nature of GUI (graphical user inter-
Command Line face) is attractive as it simplifies the management of files, folders,
and computer settings by offering a visual interface to perform
these common actions without requiring in depth training or ori-
entation as a prerequisite to productivity. Moreover, there are a
number of desktop applications written specifically to leverage the
capabilities of these operating systems which can further simplify
the manipulation of data. For example, there are commercial
packages for the analysis of genomic data that run natively on
both Windows and Mac OS. However, in order for a GUI to be
available, the exact task has to be defined and coded into the GUI
design. This places a lot of limit on what one can do within a GUI.
A typical task in bioinformatics is converting data formats, which
relies mostly on the use of text manipulation package such as sed or
awk or the creation of a script in Python or R to first preprocess the
data. Without such knowledge the researcher will be at a disadvan-
tage and reliant upon others to accomplish what is a basic yet
essential task.
The phrase “command line” refers to the use of typed com-
mands at a shell (e.g., Bash, ksh) prompt that in turn invokes
specific programs available within a given computational environ-
ment. Microsoft Windows has a command prompt. Shell is native
to Linux and Mac OS, which runs on top of a variant of the UNIX
operating system. Bioinformatics often involves command lines to
execute and connect different programs. This creates the capability
to perform and automate complex tasks.
2.5 The Shell Learning how to interact with the UNIX operating system does
require a commitment on the user’s part, though a facility with it
can be acquired in stages over time and in response to the demands
of a given project. The basic components involve the use of a
SHELL which is the interpreter for any command a user might
enter at the command line prompt. Think of a shell as an interpreter
of interactive input although scripts containing an arbitrary number
of shell commands, including programming constructs, can be
developed for execution within the shell context. As an example,
it is possible to create a number of shell scripts written in bash for
the maintenance of data over time. These are typically designed and
implemented for system administration tasks, though Linux offers
powerful text manipulation tools which can be helpful in the man-
agement of scientific data. However, many researchers choose to
develop scripts in Python or R, with the benefit of packages written
specifically to work with high throughput data. Figure 1 is an
example of using the shell on Apple Mac OS. It is outside the
Bioinformatics Primer for Data Science 251
3 Data Management
3.1 Organizing Your Independent of the operating system you choose, one of the pri-
Projects mary uses of GUI is to manage files and folders on your local system
or attached to it in cases wherein you are using a shared filesystem
or a form of network attached storage. GUIs provide an easy way to
create, rename, move, and remove folders to reflect the needs of
your work. All of this can be accomplished using the command
line/shell, although that requires facility with the specific command
names to perform efficiently. One area in which the SHELL is quite
helpful is when needing to rename or manipulate a large number of
files according to a general pattern in which cases a Bash script can
be written. However, generally speaking it is fine to use a GUI to
manage your project organization. The key is to choose a setup and
naming convention that will make sense particularly after the pas-
sage of time such that when you return the structure will be under-
standable. Also consider that you might later involve a collaborator
in which case you will need to explain the structure if it is not
already evident. While organizing your project is a highly subjective
activity here are some tips that will help you:
1. Do not stockpile code, data, and spreadsheets into a single
folder. Create meaningfully sub folders up front to contain
highly related information even if you do not yet have a signifi-
cant amount of information to manage.
For example, all sequencing data should be in a folder and
perhaps sub folders depending on the experimental design of
your project. While this seems obvious, it is very easy to initiate
a project by creating one folder with some basic data, perhaps
252 W. Stephen Pittard et al.
Fig. 2 File directories as examples for data management. (a) A project folder on a remote Linux server,
mounted to MacOS desktop via sshfs. This is an automated directory structure on this server, so that scripts
are run periodically to gather project statistics and generate reports. (b) A project folder with a focus on DNA
sequencing. (c) A manuscript folder, where data are organized by figures
4.1 Before You Start Software development usually requires specialized tools and work-
Coding flow. Most bioinformatics data analysis does not require software
development but scripting. Bioinformatics is often a lengthy pro-
cess of many steps, each step to accomplish a specific task. It is
critical to identify what tasks one need to complete in order to
accomplish the overall goal. One has to know what the question is
before choosing the right tool. Scripting is usually used to wrangle
data and “glue” different steps. Interactive terminals and notebook
tools (e.g., knitr and Jupyter notebooks) are often preferred over
complex IDEs (integrated development environments). But it is
still a good idea to version control the scripts. Should a different
version produce a different result, one need to track down the
source of the change.
Even without involving serious software development, a few
practical tips are good to have. Do not repeat yourself—if a piece of
code accomplishes a common task, make it reusable as a function,
Bioinformatics Primer for Data Science 255
4.2 What Languages It is easy to incite hot debates when choosing a programing lan-
to Use guage. In the field of bioinformatics, Python has already become
the most popular language (replacing Perl). Java has wide applica-
tions in enterprise software development, and C/C++ is often
preferred for performance. Javascript is now gaining prominence
due to its central role in web development. With many mature
libraries and a clean, elegant design, Python enables rapid scripting
and development without compromising a lot of performance,
since many libraries are implemented in C. Also highly popular, R
is irreplaceable in the field of bioinformatics. R is designed as a
statistical programming environment, centered on data analysis.
Preferred by statisticians, many statistical tools were first published
in R. The successful Bioconductor project further boosted its pop-
ularity. Both Python and R are explained in detail in Chapter 15.
The most practical choice, however, is dependent on several
factors of a project. It takes time to gain proficiency in a language—
one has to weigh over the time investment when learning a new
tool. Domain specific applications are an important consideration.
To work with relational databases, knowledge of Structured Query
Language (SQL) is required to select and manipulate data from the
database. One may also consider embedding SQL capability within
software code and pipelines. Sometimes, the availability of one
function determines the design of a project. If a commercial license
is required, the cost and maintenance needs to be planned carefully.
Moving a piece of software from a laptop to cloud could require a
new license.
256 W. Stephen Pittard et al.
5.1 Performing Before spending 2 years to analyze your data, it is critical to know
Quality Control that the data are not garbage due to some malfunction of an
(QC) and Quality instrument. For high-throughput data, abnormality is easy to spot
Assurance (QA) by examining two things: signal intensity acquired on a sample
(Figs. 3 and 4), and how the measurements in a sample correlate
with other samples (Fig. 5). If an instrument fails during the
analysis of a sample, the signal intensity will not be consistent for
this sample, resulting often low signal intensity. For samples of the
same type of biological matrix (e.g., human serum), the basic
pattern of intensity should be the same, therefore manifesting
high correlation coefficients between these samples.
Bioinformatics Primer for Data Science 257
Fig. 3 Bar plot of average intensity of all replicates in the moDC dataset,
inspected for quality control
Fig. 5 Using correlation between samples to examine data quality. (a) Scatter plot of all features between two
technical replicates. Each dot is a feature. The high number of values close to 10 (forming two lines) reflects
imputed values in the data. This scatter plot is typical for metabolomics and transcriptomics data, where
reproducibility is better for features of higher intensities. (b) Pair-wise Pearson correlation coefficients for all
samples plotted on a heatmap, where the color is scaled by the coefficient value
Bioinformatics Primer for Data Science 259
Fig. 6 Using PCA plot to examine grouping patterns of biological samples. Each
dot represents a sample. One of the 0 h samples groups with the mock 6 h
samples. This raises a flag for further investigation
Fig. 7 The MA plot of two replicates or samples should have a flat trendline,
otherwise it indicates a bias in data distribution. The straight lines of features are
from imputed values, similar to those in Fig. 5a
5.2 Data For high-throughput data, it is often useful to filtering the data for
Transformation, missing values and low intensities, so that data of higher quality is
Scaling used for subsequent analysis. Many statistical methods do not cope
and Normalization with missing data, and imputation needs to be performed. It is
common in metabolomics data to replace missing values by limit
260 W. Stephen Pittard et al.
5.3 Common The field of metabolomics has a good share of complex statistics.
Statistical Analyses Multivariate methods are often used to project data into latent
spaces, and PLS-DA (partial least square discriminant analysis) is
among the popular methods to compare biological classes. Princi-
pal component analysis (PCA) is not directly used for statistical
analysis, but often used for dimension reduction and visualization.
Due to its unsupervised nature, PCA is also used for exploration
and quality control. In Fig. 6, one baseline sample is not grouped
with other baseline samples, but with a treatment group. This
figure raises a red flag on that particular sample and demands
further investigation of the cause.
Less complex but often preferred are the univariate statistics. If
the data are known to follow normal distribution, ANOVA (Analy-
sis of variance, three or more groups) or t-test (two groups) can be
used. They can be further divided into categories for independent
samples or related samples. If the distribution of data is not normal
or unknown, nonparametric methods such as Mann–Whitney U
test can be used. For continuous outcome, linear regression is
Bioinformatics Primer for Data Science 261
Fig. 8 Visualization of results from a statistical test. (a) Volcano plot of all features. The Y-axis indicates the
significance of a feature in the statistical test, and the X-axis shows the direction and magnitude of the
difference between two groups. (b) A heatmap of the significant features (FDR < 0.05 here) the statistical test.
Each column represents a sample and cell of the heatmap represents a feature of one sample. The six
samples are separated into two biological groups, matched to the experimental design. (c) Boxplot to visualize
the mean level and standard deviation of one feature in the two biological groups
5.4 Visualization A large portion of scientific findings are communicated via figures.
For data intensive publications, the visual presentation has even
more special challenges. A good example is the rising of Circos
plots. Circos was initially developed for genome projects, but has
since been applied to many domains to illustrate the relationships
between complex data [5]. One can condense a great amount of
data of many different types of plots, such as line plots, histograms,
tiles and heatmap, into one figure. This type of figures has to be
262 W. Stephen Pittard et al.
6 Notes
Acknowledgments
References
Abstract
The daily work in data science involves a set of essential tools: the programming languages Python and R,
the version control tool Git and the virtualization tool Docker. Proficiency in at least one programming
language is required for data science. R is tied to a computing environment that focuses on statistics, in
which many new algorithms in genomics and biomedicine are first published. Python has a root in system
administration, and is a superb language for general programming. Version control is critical to managing
complex projects, even if software development is not involved. Docker container is becoming a key tool for
deployment, portability, and reproducibility. This chapter provides a self-contained practical guide of these
topics so that readers can use it as a reference and to plan their training.
Key words Bioinformatics, Data science, Python, R, Git, Docker, Version control, Virtualization
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_15, © Springer Science+Business Media, LLC, part of Springer Nature 2020
265
266 W. Stephen Pittard and Shuzhao Li
2.1 Installation and Python is included by default in Apple Mac OS and Linux distribu-
Relevant Packages tions, as it also serves system administrative functions. Thus,
besides the Python package manager pip, Python libraries can be
installed via the software manager of the operating systems. For
scientific computing, several libraries are indispensable and called
the “SciPy stack” (as presented at https://www.scipy.org), includ-
ing NumPy, SciPy, Matplotlib, and pandas. The basic numerical
array data structure is provided by NumPy, and pandas enables
major modern data structures similar to those in R. Matplotlib
provides comprehensive plotting functions. The name of Matplotlib
reflects its initial similarity to Matlab, a popular commercial tool for
data analysis and visualization. But the library has been rewritten for
object oriented programming. Matplotlib is the basis of other
Python plotting tools, including seaborn. In addition, comprehen-
sive methods of machine learning are implemented in the scikit-
learn library.
Custom installation of Python on Mac OS and Linux is often
necessary to upgrade to newer versions, or accommodate multiple
versions. Python can be downloaded from its official site, https://
Essential Toolbox of Data Science 267
Fig. 1 Common Python libraries for scientific computing that are shipped with the Anaconda distribution. The
logos may be trademarked by individual parties. The tool stack is compatible with all three major operating
systems
2.2 Interactive Once Anaconda (or another Python distribution with the above
Computing with libraries) is installed, we can start working on data—the possibilities
Python SciPy Stack are almost endless! Evoke the interactive environment by typing
“python” in the command line shell and pressing the ENTERkey.
This interactive environment is marked by “>>>” starting every
line, and the code in the current line is executed every time the
ENTER key is pressed:
>>> n = 5
>>> a = [2, 3, 4, 5, 6]
>>> a2 = np.array(a)
268 W. Stephen Pittard and Shuzhao Li
>>>
>>> type(n)
<class ’int’>
>>> type(a)
<class ’list’>
>>> type(a2)
<class ’numpy.ndarray’>
>>>
>>> # help will show the supporting documentation on the
object, “q” to quit
>>> help(a2)
>>> mydata.mean(0)[:8]
mz 479.377606
time 165.949377
mz.min 479.327571
mz.max 479.427072
p_0hr_01_1 16.091406
p_0hr_01_3 16.144007
p_0hr_01_5 16.290922
p_0hr_02_1 16.295431
dtype: float64
2.3 Concepts in Data For simplicity, we take a subset of the data (first 4 rows and
Structure 6 columns) for this section.
> install.packages("actuar")
Installing package into ‘/Users/esteban/r_packages’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
trying URL ’https://mirrors.nics.utk.edu/cran/bin/macosx/el-
capitan/contrib/3.5/actuar_2.3-1.tgz’
Content type ’application/x-gzip’ length 2044668 bytes (1.9
MB)
==================================================
downloaded 1.9 MB
> library(actuar).
272 W. Stephen Pittard and Shuzhao Li
3.2 Getting Help A good starting place for R assistance is the “Getting Help” page at
https://www.r-project.org/help.html which presents FAQs and
steps on how to walk through demonstrations of various functions
and packages. R also contains the browseVignettes() command
which provides a web listing of all available vignettes / guided
tours associated with a given package. Assistance is also available
via Stack Overflow http://stackoverflow.com with bioinformatics
specific support being available at the BioStars forum https://www.
biostars.org/. Please read the respective terms of use for both sites
prior to the submission of questions.
3.3 Variables and Variables are containers for data involving four primary types:
Data Structures numeric, character, logical, and factor. Variables are assigned
using the symbol “<” or “¼“; however, the former is strongly
preferred. R is largely an interactive environment that retains vari-
ables and data structures during and, optionally, across sessions.
The interpreter accepts input after the “>” character.
3.3.1 Vectors Vectors are containers for homogeneous data that in turn can be
assembled into matrices. Vectors can be explicitly created using the
“c” function or be returned from other R functions.
> weight/100
[1] 1.17 1.65 1.39 1.42 1.26 1.51 1.20 1.66
> mean(weight)
[1] 140.75
3.3.3 Lists Lists provide a way to store heterogeneous data within a single
structure. Newcomers to R usually do not create lists except when
(1) writing a function that needs to return heterogeneous informa-
tion or (2) as a precursor to creating a data frame. Lists are more
commonly encountered in results returned by common statistical
modeling activities such as regression, decision trees, and random
forests.
> names(mylm)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.
residual"
[9] "xlevels" "call" "terms" "model"
> mylm$rank
[1] 2
276 W. Stephen Pittard and Shuzhao Li
> mylm$call
lm(formula = mpg ~ wt, data = mtcars)
> mylm$coefficients
(Intercept) wt
37.285126 -5.344472
3.3.4 Data Frames Data frames are tightly coupled collections of variables organized
into a rectangular format. Data frames can be constructed from
vectors, lists, matrices or by reading in CSV files, or from informa-
tion being returned by APIs (Application Programming Inter-
faces). Best of all, data frames can hold different data types across
multiple columns. There are a number of data frames built into R
that can be used to illustrate how to work with them. The data()
function can be used to load a built-in data frame.
# Get the first three rows and only columns one and four
> mtcars[1:3,c(1,4)]
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
# Find all rows wherein the mpg is >= 30 and then select
columns 2 through 6
3.4 Functions Functions are a very important part of the R language especially
given that all packages are comprised of predefined functions to
assist with a large variety of tasks. Users communicate within R
almost entirely through functions, thus consider writing one to
avoid repeating the interactive entry of commands. However,
before writing new code it is wise to first determine if there is not
already an existing function that performs the desired actions. A
significant idea associated with software development is “Don’t
Repeat Yourself” commonly abbreviated as “DRY.” Functions are
created using the function() directive and are stored as R objects
just like any other variable. Functions allow easy reuse of code that
can optionally be used in the creation of a package for distribution
on CRAN. Information on a preexisting function can be obtained
using the “?” character before the function name of interest.
> ?mean
mean package:base R
Documentation
Arithmetic Mean
Description:
Usage:
mean(x, ...)
> grep("lm",methods(print),value=TRUE)
[1] "print.glm" "print.lm" "print.summary.glm"
[4] "print.summary.lm"
> print(mylm)
Call:
lm(formula = mpg ~ ., data = mtcars)
Coefficients:
(Intercept) cyl disp hp drat
wt
12.30337 -0.11144 0.01334 -0.02148 0.78711
-3.71530
qsec vs am gear carb
0.82104 0.31776 2.52023 0.65541 -0.19942
Essential Toolbox of Data Science 279
3.6 Graphics The R language provides a powerful environment for the visualiza-
tion of scientific data. It provides publication quality graphics,
which are fully programmable and reproducible. There are a num-
ber of useful output types (PDF, JPEG, PNG, SVG) in addition to
the default high resolution on-screen graphics capability. R Gra-
phics can be confusing given that there are three primary user-
focused graphics systems: Base, Lattice, and ggplot2.
3.6.1 Base Graphics Base graphics is the default display package included in every R
installation. It has both high and low level routines which provides
flexibility in developing customized plots. Base graphics is well
documented and there is significant support available via Google.
4.1 Git Git is an open source tool for managing revisions to a set of files that
experience change over time, that is, version control. Many tools
for version control have been developed in the past few decades,
but Git has become arguably the most popular. It was originally
developed in 2005 by Linus Torvalds to manage the development
of the Linux kernel. While Git can work with a general set of text
files, it is more commonly employed as part of collaborative soft-
ware development projects to track modifications to a “repository”
(a collection of files within a folder structure). Even If there is no
intent for public distribution, the developer benefits from the
ability to revert to previous versions or create new “branches” for
novel development without impacting the reference branch. Git
follows a distributed model that allows developers to “clone” repo-
sitories, along with any associated change history, and submit mod-
ifications for subsequent (re)integration into the reference copy,
commonly known as the “master branch.” Git can assist in the
detection and resolution of conflicts wherein changes to a file are
made by multiple people.
According to the 2018 Stack Overflow Developer Survey,
87.2% of respondents who use version control in their projects
use Git. It is ideal for managing large, distributed projects involving
hundreds of participants though it is also appropriate for the
laboratory-based informaticist developing code in support of a
publication. Git is language agnostic in that it views source code
as simple text files to be monitored for changes. The functionality
of Git is not a function of the selected programming language.
Essential Toolbox of Data Science 281
4.2 Availability and Operating system specific installers are available at https://git-scm.
Installation com/downloads The clients provide a built-in GUI tool although
the method of interaction described in this text will relate to the
command line client available which is available via the Terminal
application in Apple Mac OS and Linux. The Windows install also
comes with a git shell that can be launched as a standard
application.
4.3 A Common Git There are formal methodologies for software development projects
Workflow although use of Git does not involve or require the adoption of any
specific approach. A common generic development workflow using
git involves the following steps:
1. Create a repository using “git init” or “git clone.”
2. Create/modify files in the working repository using a text
editor.
3. Add file(s) to be tracked using “git add.”
4. Use “git status” frequently to see the current state of files and
commits.
5. When the tracked files represent a logical point of progress then
commit changes using “git commit” or “git commit -a” for
subsequent commits.
Repeat steps 2–5 until the project is complete or at a logical
stopping point. There are many possible variations to this workflow
though it represents a useful starting point.
4.4 Key Concepts The following represent essential ideas for becoming productive
with Git. Each concept will be considered in detail.
Repository: A folder that has been placed under git control.
Origin: Every repository has an “origin,” which could be local
(on a hard drive) or a remote hosting service such as Gihub or
GitLab.
Branch: Each repository has a reference or “master” branch
from which copies can be made for purposes of experimentation,
testing, and bug fixing without impacting the availability of the
master branch.
Tracking: The activity of monitoring files for changes.
Commit: The activity of saving or taking a snapshot of existing
modifications. Each project will usually experience multiple com-
mits over time.
4.5 Getting Started Git requires some basic identity information such as e-mail address
although this does not have to be an actual working address.
Consider though that this information is what Git will use in the
meta data of the repository so should it be distributed in the future,
end users will have appropriate contact information. Launch a
282 W. Stephen Pittard and Shuzhao Li
4.6 Create a Git repositories can be created from existing folders or new ones.
Repository The following commands demonstrate the creation of a new folder
and repository. Note that there are no files to manage just yet.
4.7 Adding Content Next, create some files using a text editor (e.g., vi, emacs).
/home/ubuntu/MyProject $ vi Readme.md
/home/ubuntu/MyProject $ vi regression.R
/home/ubuntu/MyProject $ cat Readme.md
# MyProject
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be
committed)
Readme.md
regression.R
The results of the “status” command indicate that the two new
files are not currently being tracked. It also provides information on
how to initiate tracking which means that git will monitor the files
for changes. After adding the file, enter the “git status” command
which will show that the files are now “staged” for a “commit.”
Additional changes can be made and git will track them.
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
4.8 Making a Making a commit in git will take a “snapshot” of all changes and
Commit register them. This is accomplished using the “git commit” com-
mand. For an active development project, commits should be made
frequently. Remember that a benefit of the git system is that any
commit can be reverted should it result in a software error.
This will invoke the default editor associated with the current
user account. For Ubuntu this will be “nano.” The following screen
illustrates a typical commit message which should be informative.
Next, follow this up with a “git status.” Git indicates that all
changes have been noted and committed to the master branch. At
this point, more files can be created or new edits can be made to the
existing files followed by more commits.
4.9 Inspecting This example demonstrates what happens when modifications are
Changes to Files made to a file that is currently being tracked for changes but has yet
to be committed. This is accomplished using the “git diff” com-
mand. At this point, neither of the two files in the repository have
been modified since the commit.
data(mtcars)
mylm <- lm(mpg~.,data=mtcars)
summary(mylm)
modified: regression.R
Now check the status and commit event log to confirm that the
master branch is up to date.
commit f025b99c53def7b6216d92d8113982d387610615
Author: John Doe <[email protected]>
Date: Tue Jun 18 19:35:24 2019 +0000
4.10 Using Branches An attractive feature of Git is the ability to create copies of the
master branch into arbitrarily named experimental branches with-
out impacting the master branch code. Think of any new branches
as being derivative safe copies of previously committed code which
can be managed using the usual git command set. Changes to a
derivative branch can, if desired, be integrated back into the master
branch or remain in a separate branch for more testing. The “git
branch” command indicates any existing branches and places an
288 W. Stephen Pittard and Shuzhao Li
asterisk next to the currently active branch which in this case is the
“master” branch.
The only thing that has changed is the current branch. The
underlying folder contents are the same. Create a new file called
“logistic_regression.R” with the following contents and commit
the change with a commit message of “Added a file for logistic
regression.”
/home/ubuntu/tester $ ls
Essential Toolbox of Data Science 289
Remember that the new file has been added within the experi-
mental branch and will NOT impact the master branch. To verify
this, switch back to the master branch by using the “git checkout
master” command.
/home/ubuntu/tester $ ls
Readme.md regression.R
Observe that the master branch does not include the logisti-
c_regression.R file since it exists only within the experiment
branch. It is possible to merge the changes made in the experiment
branch into the master branch by using the “git merge” command.
/home/ubuntu/tester $ ls
Readme.md logistic_regression.R regression.R
4.11 Sharing Using git is an effective way to manage any number of development
Repositories projects in a way that preserves changes made during the project
lifecycle which enhances reproducibility and serves as documenta-
tion of the project’s development trajectory. The examples pre-
sented thus far use a locally created repository which could be
shared with other users although that is optional. However, it is
common for developers to share projects using a hosting service
such as Github that supports over 36 million developers and
100 million repositories. Ref https://github.com/ Github offers
a free hosting plan for unlimited public and private repositories,
issue and bug tracking, and project management tools. Proceed to
https://github.com/join?source¼header-home to create an
account and select the free plan.
Many newcomers to Git are already familiar with GitHub
having been directed to download a repository by a publication or
a colleague. Knowledge of git commands is usually not essential to
simply using software in a repository although to participate in
development efforts does require such knowledge. Login to the
previously created Github account and create a New repository by
clicking the “+” button in the upper right corner of the Github
screen (Fig. 3):
After selecting “New Repository” a screen will be displayed
that prompts for information including the desired repository
name, an optional description, whether the repository should be
public (leave this selected), and whether to initialize it with a
README file (leave selected). After providing this information,
click the green “Create repository” at the bottom of the screen
(Fig. 4):
The next screen will display a summary of the repository from
which the green “Clone or Download” button can be clicked. After
clicking this button, in the pop-up box the URL for the repository
will be shown. Click the clipboard icon to copy the URL.
4.12 Cloning the As Git is decentralized, a repository can be cloned by any number of
Repository Locally parties interested in contributing to a project. To clone the reposi-
tory locally, open a Terminal and enter the following git command
which will download the repository into the current folder.
/home/ubuntu $ ls -a
. .. .git README.md
/home/ubuntu $ cd mycoolrepo/
/home/ubuntu/mycoolrepo $ ls
README.md
/home/ubuntu $ cd mycoolrepo/
/home/ubuntu/mycoolrepo $ git remote -v
origin https://github.com/wspem/mycoolrepo.git (fetch)
origin https://github.com/wspem/mycoolrepo.git (push)
Untracked files:
(use "git add <file>..." to include in what will be
committed)
plotter.R
This is a familiar result based on the earlier work with the local
MyProject repository. At this point add and commit the file and
then check the status.
Push up the changes. This will require knowing the user id and
password for the Github account created earlier.
Refer back to the web page associated with the Github reposi-
tory which in this example is https://github.com/wspem/
mycoolrepo Note that the file plotter.R that was added locally is
now present in the Github repository (Fig. 5).
4.13 Next Steps This walkthrough has provided an overview of Git and Github
basics although there remains a significant degree of capability
that cannot be covered due to space constraints. Understanding
how to manage software conflicts and work with repositories
owned by other developers is also important information. In any
case, the material presented in this section represents a solid basis
upon which laboratory-based developers can efficiently build repro-
ducible software projects.
5.1 Installing Docker Free installers for Apple Mac OS and Microsoft Windows are
available from Docker Hub which requires the creation of an
account. The install provides both the Docker daemon and client
CLI command line interface.
Microsoft Windows—https://docs.docker.com/docker-for-
windows/install/.
Essential Toolbox of Data Science 297
5.2 Key Docker From an end user point of view, the terminology for Docker
Concepts involves a basic knowledge of the following ideas:
Client/CLI—This is the CLI (command line interface) that
allows the user to issue various docker commands. The client
usually runs on the same host as the Docker daemon process. The
CLI offers a suite of commands for managing containers, images,
volumes, and networking between containers.
Docker Process/Daemon—This is the persistent process that
handles client requests, checks with the registry, and manages con-
tainers. This usually runs on the same host as the Docker client.
Container—A packaging construct designed to insulate docker
images from the environment in which they will execute. A con-
tainer can be thought of as an actively running image with which
the end user can interact.
Image—A collection of bundled dependencies necessary to
execute an application. Base images are those not reliant upon
other images. Examples are operating systems (e.g., Ubuntu,
Debian). Child images are those created on top of Base images.
Docker Registry—A registry is a collection of images that can
be searched and deployed. Consider it as a reference point for
locating and initiating containers. An example is Docker Hub
https://www.docker.com/products/docker-hub.
Essential Toolbox of Data Science 299
5.3 Finding Useful There are a number of prebuilt images relating to bioinformatics
Images and supporting languages including Python and R. For example,
the Rocker project https://www.rocker-project.org/ offers Docker
Containers for the R Environment as well as instructions on how to
run the images. Image discovery is straightforward by using the
search capability which is part of the default installation of Docker.
The “search” command is used to locate images of interest. Note
also that there is a web-based search tool Located at https://hub.
docker.com/search?q¼rocker&type¼image that offers an intuitive
user interface (Fig. 9)
The “docker search” command has a number of options that
can help filter output though the basic format given above is typical.
The NAME and DESCRIPTION files document the image label
and any supplied information. The STARS field is a community
rating of the image which is how the output is sorted. Once the
desired image is found it is simple to execute it. Below is an example
to run “rocker/r-base.”
>
5.4 Managing A notable side effect or running containers is that any associated
Containers and Images image might require additional image layers that contribute to the
overall download size. Images remain on the local hard drive unless
explicitly removed. The Docker client includes commands to man-
age containers and images as well as tools to create them. Start the
following container using the ubuntu:latest image command.
Fig. 15 Using the ps command to verify the existence of the container and image
Fig. 16 Using the ps command to verify the existence of image but not the container
d134adfb783db68defc8894b3c
Status: Downloaded newer image for ubuntu:latest
root@ee1ef120852d:/#
root@ee1ef120852d:/# exit
exit
5.5 A More Practical As seen earlier, the rocker/r-base image offers command line
Use Case access to the R environment although the rocker project offers
images in support of the R language (Fig. 17) .
The rocker/rstudio image in particular provides R lan-
guage support along with the RStudio IDE which in this case is a
web-based service accessible locally via a browser. Because of this,
some options will need to be supplied to appropriately map the
RStudio port (typically 8787) to a local system port which can also
be 8787 assuming no other local service is using it. The RStudio
IDE also requires the specification of a password at run time. The
304 W. Stephen Pittard and Shuzhao Li
The “-e” option will set the PASSWORD within the container.
Note that this command will be running in attached mode which
means that the terminal window in which it was entered will be
blocked/engaged until the container is terminated. This is not
necessarily a problem since the point of user interaction will be
the RStudio login panel. In order to run rocker/rstudio as a
detached container, include the “-d” option which will leave the
terminal free for input.
Next, launch a web browser and provide a URL of http://
localhost:8787. This will bring up the RStudio login page. Provide
a user name of “rstudio” and a password of “testpass”
(Fig. 18).
After clicking the “Sign In” button, the RStudio IDE will be
presented for engagement. At this point the user can install any
additional packages and create code as with any instance of R
(Fig. 19).
5.6 Sharing Data Remember that the container is running under the supervision of
the Docker process and the containerized R environment will in no
way impact any locally existing versions of R. However, a problem
with this specific example is that there is nowhere to save code and
have it persist except within the container itself. For example, a user
might have existing R code in a local folder (e.g., metabolo-
mics_r_code) intended for use with the container. Or, in working
with the container, the user might want to save code to the local
Essential Toolbox of Data Science 305
$ cd ~
$ cd metabolomics_r_code/
$ pwd
/Users/esteban/metabolomics_r_code
$ ls
linear_regression.R. logistic_regression.R support_vector_-
machine.R
-v /Users/esteban/metabolomics_r_code:/home/rstudio/metabolo-
mics_r_code
Essential Toolbox of Data Science 307
{{.Description}}’
bioconductor/release_base release base container
bioconductor/release_core release core container
bioconductor/devel_core2 Automated Build Bioconductor
Develement Core. . .
bioconductor/devel_base2 Automated Build Bioconductor
Develement Base. . .
bioconductor/devel_mscore2 Automated Build Bioconductor
Develement msco. . .
$
$ docker search xcms --limit=5 --format ’{{.Name}} \t {{.
Description}}’
yufree/xcmsrocker Rocker image for metabolomics study
payamemami/xcms-container
yufree/xcms Non-Target Data Analysis Environment with CR. . .
pcm32/xcms-camera
wilsontom/xcms-dockerdev
jupyter/pyspark-notebook
jupyter/minimal-notebook
jupyter/base-notebook
jupyterhub/singleuser
jupyter/nbviewer
Launching Jupyter.
Based on the previous search for Jupyter notebooks, there is a
jupyter/minimal-notebook image. The command will start
the image in a local container and associate the local folder /
Users/esteban/notebooks to the /home/joyvan folder
inside the container. Also note that the container provides specific
URL information to access the notebook (Fig. 23).
http://127.0.0.1:8888/?token=664e4773930180eacef53877390-
f30a39d758595ab0b61dc
5.8 Summary This introduction has provided basic information on how to iden-
tify and run Docker containers which represent a lightweight alter-
native to virtualization techniques involving the use of hypervisor
software. Dockers are particularly useful for containing applications
involving multiple languages or run times as well as those delivered
via the web such as RStudio, Jupyter Notebooks, and Galaxy. The
Docker mechanism neatly packages these tools, and supporting
components, into images on behalf of the end user who does not
need to install or compile code. All that is required is installation of
the Docker package for the local host. Docker also offers the ability
Essential Toolbox of Data Science 311
Acknowledgments
Abstract
In recent years, mass spectrometry (MS)-based metabolomics has been extensively applied to characterize
biochemical mechanisms, and study physiological processes and phenotypic changes associated with dis-
ease. Metabolomics has also been important for identifying biomarkers of interest suitable for clinical
diagnosis. For the purpose of predictive modeling, in this chapter, we will review various supervised learning
algorithms such as random forest (RF), support vector machine (SVM), and partial least squares-
discriminant analysis (PLS-DA). In addition, we will also review feature selection methods for identifying
the best combination of metabolites for an accurate predictive model. We conclude with best practices for
reproducibility by including internal and external replication, reporting metrics to assess performance, and
providing guidelines to avoid overfitting and to deal with imbalanced classes. An analysis of an example data
will illustrate the use of different machine learning methods and performance metrics.
Key words Metabolomics, Mass spectrometry, Supervised learning, Performance Metrics, Predictive
Modeling
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_16, © Springer Science+Business Media, LLC, part of Springer Nature 2020
313
314 Tusharkanti Ghosh et al.
2 Methods
2.1 Missing Values Supervised learning methods require complete data; however,
untargeted metabolomic data is prone to missing values, where
the data matrix contains zeros in one or more entries. Some studies
Predictive Modeling of Metabolomics Data 315
2.2 Classification It is important to note that what we now call supervised learning
Methods: An dates back to over 80 years ago, when Sir R. A. Fisher introduced
Early Look the use of linear discriminant analysis (LDA) [29]. This was a
generative model in which the features conditional on class label
were modeled as a multivariate normal distribution with a mean
vector that depended on group, and a common covariance matrix.
316 Tusharkanti Ghosh et al.
2.3 Decision Tree A Decision Tree (DT) is a supervised machine learning model, that
outputs a hierarchical structure to classify subjects [32]. It is a
nonlinear classifier which is mainly used for classifying nonlinearly
separable data. The objective of a decision tree is to develop a model
that predicts the value of a response variable based on several
predictor variables. Figure 1 shows an example of a hypothetical
DT, which divides the data into two categories based on two input
variables. DT used in data mining can be classified into two groups:
l Classification tree: The predicted outcome is a categorical vari-
able, representing two or more classes to which the observation
belongs.
l Regression tree: The predicted value is a continuous variable.
DT is also known as Classification and Regression Trees
(CART), which was first introduced in the machine learning litera-
ture [33]. The main difference between classification and regres-
sion trees is the criteria on which the split-point decision is made.
Predictive Modeling of Metabolomics Data 317
Fig. 1 A simple decision tree that splits the data into two gender groups based on two metabolites
2.4 Random Forest A Random Forest (RF) is an extremely reliable classifier and robust
to overfitting. It constructs an ensemble of DTs, which means an
aggregation of tree-structured predictors [34]. In RF, each tree is
independently constructed using a bootstrap sample of the original
data (the “bagged sample”). This training data is used to build the
classification model. The data that was not sampled using the
bootstrap is referred to as the out-of-bag sample. Since these data
were not used in model building, they can be used as a test data set,
which can be used to evaluate classification accuracy in an unbiased
manner, by calculating the “out-of-bag error” [35]. A measure of
the variable importance of classification is also computed by con-
sidering the difference between the results from the original and
randomly permuted versions of the data set. Cross-validation is not
needed since RF is estimated from the bootstrap samples.
RF has become popular as a biomarker detection tool in various
metabolomics studies [36, 37]. RF has the strength to deal with
missing and data [34, 38] and overfitting issues [39, 40]. In addi-
tion, it can also tackle high-dimensional data sets without feature
elimination as a requirement [41].
2.5 Support Vector Support Vector Machines (SVM) have been previously used in the
Machines analysis of several omics studies, particularly gene expression data
[42–44]. A simple figure of an SVM is shown in Fig. 2. The main
characteristics that define the concept of SVMs are (a) the criteria
they use to categorize nonlinear relationships (b) the set of training
318 Tusharkanti Ghosh et al.
sets that are necessary to optimize the linear classifier; (c) the use of
kernel machines to transform the variable into a higher order
nonlinear space where linear separability holds; (d) utility in terms
of performance and efficiency for high dimensional data sets.
A major drawback of SVM is its restrictions to binary classifica-
tion problems. For example, it can only discriminate between two
classes where the data points are categorized by two classes in
n-dimensional space, where n corresponds to the number of meta-
bolites in our context. A hyperplane is constructed that separates
the data points from the two classes. The hyperplane coefficients are
determined based on the variable (metabolite) importance for dis-
criminating between two classes.
SVM can yield a hyperplane of p-1 dimension in p dimensional
space. The main purpose of SVM is to optimize the largest margin.
In practice, a separation often does not exist as the data points
cannot always be linearly separated. In such nonlinear cases, a
kernel substitution is adopted to map the data to a higher order
dimension. The maximum-margin hyperplane was the original
algorithm developed as a linear classifier [45]. An extension to
create nonlinear classifiers was proposed by applying the kernel
trick to maximum-margin hyperplanes [46]. The advantage of
using the kernel trick is that it can substitute the linear kernel
with other robust kernels, such as the Gaussian kernel [47]. Also
in the family of nonlinear supervised learners are deep neural net-
works (DNN), which construct a nonlinear function from input
variables to outcome variables using a combination of convolution
filters and hidden layers [48].
Predictive Modeling of Metabolomics Data 319
3.1 Feature Selection Feature Selection (FS) is an important step in successful data
mining procedures [63], such as SVMs [64, 65] and Naı̈ve Bayes
[66], to enhance performance and reduce computational efficiency.
However, FS is not a necessary criterion for some supervised algo-
rithms, such as SVM due to its reliance on regularization, which is
the process of adding information to prevent overfitting in order to
enhance the predictive accuracy and interpretability of the super-
vised learning model. The purpose of feature selection is similar to
model selection [67], which tries to find a compromise between
high predictive accuracy and a model with few predictors. The
insignificant input features in a supervised model may lead to over-
fitting. Hence, it is reasonable to ignore those input features with
negligible or no effect on the output. For example, in the example
later in this chapter, the objective is to infer the relationship
between gender and their corresponding metabolite features.
However, if the sample identifier or any other redundant column
is included as one of the input features, it may cause overfitting. FS
320 Tusharkanti Ghosh et al.
3.3 Metrics There are several potential metrics by which one can evaluate a
for Evaluation prediction model. The most common metric that is used in practice
is the classification accuracy, meaning the proportion of predictions
from the model that are correct based on the gold standard label.
An alternative classification metric is given by the receiver operating
characteristic (ROC) curve. Assume that we have two groups,
disease and control and that higher values of the model correspond
to a greater probability of having disease. We will let the model
output be Y and group label be D, where D ¼ 0 means control and
D ¼ 1 means diseased. One can define the false positive rate based
on a cutoff c by FP(c) ¼ P(Y > c| D ¼ 0). Similarly, the true positive
rate is TP(c) ¼ P(Y > c| D ¼ 1). The true and false positive rates can
then be summarized by the receiver operating characteristic (ROC)
curve, which is a graphical presentation of TP(c),FP(c) for all possi-
ble cutoff values of c. The ROC curve shows the tradeoff between
increasing true positive and false positive rates. Then, the area
under the ROC curve (AUC) can be measured for the curve and
is a summary based on how well the model can distinguish between
two diagnostic groups (diseased/control). Other commonly used
metrics are defined in terms of TP(c) and FP(c) as below:
Predictive Modeling of Metabolomics Data 321
3.4 Imbalanced In numerous data sets, there are unequal numbers of cases in each
Classes class. In this instance, the classifier is biased toward better perfor-
mance of the larger (or majority) class, compared to the smaller
(or minority) class. Often, the research question is much more
focused on performance of discriminating the minority class from
the majority class. But the size of the minority class may be limited
by the difficulty, expense, or time of obtaining the rarer type of
sample. This unequal distribution between classes of a data set is
referred to as the imbalanced class problem [75].
In such cases, the main interest lies in the correct classification
of the “minority class” [76]. Classes with fewer samples or no
sample have a low prior probability and low error cost [77]. The
relation between the distribution of samples in the training set and
costs of misclassification can be controlled by setting a prior proba-
bility at each class.
322 Tusharkanti Ghosh et al.
3.5 TRIPOD A recent scientific initiative has focused on developing more repro-
Guidelines ducible approaches to the building, evaluation and validation of
prediction models. A document resulting from this effort that helps
in this goal is the TRIPOD (Transparent reporting of a multivari-
able prediction model for individual prognosis or diagnosis) state-
ment, which provides recommendations for fair reporting of
studies developing, validating, or updating a prediction model. It
consists of a 22-item checklist detailing vital information that must
be incorporated in a prediction model study report [86]. For our
example analysis below, we provide supporting information to
illustrate how our analysis satisfies the 22-item TRIPOD checklist
(Appendix 1).
4 Illustrative Example
4.2 Training We split the data (131 samples) into 70% (93 samples) training and
and Test Sets 30% (38 samples) test (evaluation) data. For the training data, we
use fivefold CV, where we split the training data (93 training sam-
ples) into 5 different subsets (or fivefolds). We used the first four-
folds to train the data and left the last (fifth) fold as holdout-test
dataset. We then trained the algorithms against each of the folds
and computed (average over fivefolds) the metrics for the training
dataset. The test dataset (n ¼ 38 samples) is used to provide an
unbiased evaluation of the best model fit on the training dataset.
The test dataset can be regarded as an external dataset which
provides the gold standard used to evaluate the models, using
ROC curves and other metrics for evaluation. For model validation,
we predicted the performance of the test data using the trained
models for all the three classifiers.
4.3 Feature Ranking In this section, we implemented different predictive models using
and Variable metabolite abundances as the predictor variables and Gender
Importance (Male/Female) as the response based on the training dataset. We
then computed the Variable Importance Score, which is a measure
of feature relevance to gender for each metabolite (see Subheading
3.1). These scores are nonparametric in nature, and range between
0 and 100. They are subsequently used to rank all the features to
the classification of our response variable, that is, Gender. Metabo-
lites with high values are considered to more relevant features in
classification problem.
In the dataset, the top five metabolites are detected as feature
metabolites out of 2999 metabolites in the training set with fivefold
Cross-Validation for three different classifiers (Fig. 3a–c). Among
them, C39 H79 N7 O + 7.3314843, N-palmitoyl-D-sphingosyl-1-
(2-aminoethyl)phosphonate, and C43 H86 N2 O2 are considered
to be significant metabolite features based on RF and SVM classi-
fiers. However, zeta-Carotene, unannotated metabolite (mass:
2520.6355 and retention time: 1.5409486), 5-Hydroxyisourate
+4.668069, C13 H28 N2 O4, and Tyrosine∗ +2.3151746 are
identified as good predictors based on PLS-DA.
4.4 Model Validation In this section, we evaluated the performance of all the three
classifiers based on the 30% test data of 38 samples using the trained
models. Here, we present ROC curves for all the predictive models
of the testing data used to compute the diagnostic potential of a
classifier in this clinical metabolomics application. From the ROC
324 Tusharkanti Ghosh et al.
Fig. 3 Metabolite relevant feature ranking bar plots (top five metabolites) using Variable Important Scores
ranging from 0 to 100. (a) Random Forest, (b) Support Vector Machine (SVM), and (c) Partial Least Square-
Discriminant Analysis (PLS-DA) for the training dataset
curves, the three methods perform similarly (Fig. 4). Table 1 shows
the performance metrics of the testing data evaluated for all the
classifiers. In this testing dataset, we use AUC as our metric to
choose the best performing classifier. Based on this metric RF has
a small advantage over the other methods (0.87 versus 0.86), but
with other metrics the other methods have a small advantage. In
addition, we also computed the Variable Importance Score on the
test dataset. The top five metabolites for all the three classifiers
using the test dataset were exactly the same selected using the
training dataset with fivefold CV in the previous section.
Predictive Modeling of Metabolomics Data 325
1.0
0.8
0.6
Sensitivity
0.4
RF
SVM
0.2
PLS-DA
0.0
1.0 0.8 0.6 0.4 0.2 0.0
Specificity
Fig. 4 ROC curves of the testing dataset obtained from three classification
algorithms (RF, SVM, and PLS-DA)
Table 1
Metrics (area under curve (AUC), sensitivity (SENS), specificity (SPEC), precision (PREC), recall (REC))
to evaluate the performance of classification on testing dataset
5 Summary
(continued)
Predictive Modeling of Metabolomics Data 327
(continued)
328 Tusharkanti Ghosh et al.
(continued)
Predictive Modeling of Metabolomics Data 329
(continued)
330 Tusharkanti Ghosh et al.
(continued)
Predictive Modeling of Metabolomics Data 331
References
1. Maniscalco M, Fuschillo S, Paris D, nitrogen supplementation in oats. Metabolo-
Cutignano A, Sanduzzi A, Motta A (2019) mics 15(3):42. https://doi.org/10.1007/
Clinical metabolomics of exhaled breath con- s11306-019-1501-x
densate in chronic respiratory diseases. Adv 10. Fang J, Zhao H, Zhang Y, Wong M, He Y,
Clin Chem 88:121–149. https://doi.org/ Sun Q, Xu S, Cai Z (2019) Evaluation of gas
10.1016/bs.acc.2018.10.002 chromatography-atmospheric pressure chem-
2. Pujos-Guillot E, Petera M, Jacquemin J, ical ionization tandem mass spectrometry as
Centeno D, Lyan B, Montoliu I, Madej D, an alternative to gas chromatography tandem
Pietruszka B, Fabbri C, Santoro A, mass spectrometry for the determination of
Brzozowska A, Franceschi C, Comte B polychlorinated biphenyls and polybromi-
(2018) Identification of pre-frailty sub-phe- nated diphenyl ethers. Chemosphere
notypes in elderly using metabolomics. Front 225:288–294. https://doi.org/10.1016/j.
Physiol 9:1903. https://doi.org/10.3389/ chemosphere.2019.03.011
fphys.2018.01903 11. Lohr KE, Camp EF, Kuzhiumparambil U,
3. Sarode GV, Kim K, Kieffer DA, Shibata NM, Lutz A, Leggat W, Patterson JT, Suggett DJ
Litwin T, Czlonkowska A, Medici V (2019) (2019) Resolving coral photoacclimation
Metabolomics profiles of patients with Wilson dynamics through coupled photophysiologi-
disease reveal a distinct metabolic signature. cal and metabolomic profiling. J Exp Biol
Metabolomics 15(3):43. https://doi.org/10. 222:jeb195982. https://doi.org/10.1242/
1007/s11306-019-1505-6 jeb.195982
4. Wang X, Zhang A, Sun H (2013) Power of 12. Baumeister TUH, Ueberschaar N, Schmidt-
metabolomics in diagnosis and biomarker dis- Heck W, Mohr JF, Deicke M, Wichard T,
covery of hepatocellular carcinoma. Hepatol- Guthke R, Pohnert G (2018) DeltaMS: a
ogy 57(5):2072–2077 tool to track isotopologues in GC- and
5. Caesar LK, Kellogg JJ, Kvalheim OM, Cech LC-MS data. Metabolomics 14(4):41.
NB (2019) Opportunities and limitations for https://doi.org/10.1007/s11306-018-
untargeted mass spectrometry metabolomics 1336-x
to identify biologically active constituents in 13. Gilmore IS, Heiles S, Pieterse CL (2019)
complex natural product mixtures. J Nat Prod Metabolic imaging at the single-cell scale:
82:469. https://doi.org/10.1021/acs. recent advances in mass spectrometry imag-
jnatprod.9b00176 ing. Annu Rev Anal Chem (Palo Alto Calif)
6. Liu LL, Lin Y, Chen W, Tong ML, Luo X, Lin 12:201. https://doi.org/10.1146/annurev-
LR, Zhang HL, Yan JH, Niu JJ, Yang TC anchem-061318-115516
(2019) Metabolite profiles of the cerebrospi- 14. Do KT, Wahl S, Raffler J, Molnos S,
nal fluid in neurosyphilis patients determined Laimighofer M, Adamski J, Suhre K,
by untargeted metabolomics analysis. Front Strauch K, Peters A, Gieger C,
Neurosci 13:150. https://doi.org/10.3389/ Langenberg C, Stewart ID, Theis FJ,
fnins.2019.00150 Grallert H, Kastenmuller G, Krumsiek J
7. Sanchez-Arcos C, Kai M, Svatos A, (2018) Characterization of missing values in
Gershenzon J, Kunert G (2019) Untargeted untargeted MS-based metabolomics data and
metabolomics approach reveals differences in evaluation of missing data handling strategies.
host plant chemistry before and after infesta- Metabolomics 14(10):128. https://doi.org/
tion with different pea aphid host races. Front 10.1007/s11306-018-1420-2
Plant Sci 10:188. https://doi.org/10.3389/ 15. Liggi S, Hinz C, Hall Z, Santoru ML,
fpls.2019.00188 Poddighe S, Fjeldsted J, Atzori L, Griffin JL
8. Wang R, Yin Y, Zhu ZJ (2019) Advancing (2018) KniMet: a pipeline for the processing
untargeted metabolomics using data- of chromatography-mass spectrometry meta-
independent acquisition mass spectrometry bolomics data. Metabolomics 14(4):52.
technology. Anal Bioanal Chem 411:4349. https://doi.org/10.1007/s11306-018-
https://doi.org/10.1007/s00216-019- 1349-5
01709-1 16. Fielding S, Fayers PM, McDonald A,
9. Allwood JW, Xu Y, Martinez-Martin P, McPherson G, Campbell MK (2008) Simple
Palau R, Cowan A, Goodacre R, Marshall A, imputation methods were inadequate for
Stewart D, Howarth C (2019) Rapid missing not at random (MNAR) quality of
UHPLC-MS metabolite profiling and pheno- life data. Health Qual Life Outcomes 6(1):57
typic assays reveal genotypic impacts of
Predictive Modeling of Metabolomics Data 333
17. Schafer JL, Graham JW (2002) Missing data: 27. Chen MX, Wang SY, Kuo CH, Tsai IL (2019)
our view of the state of the art. Psychol Meth- Metabolome analysis for investigating host-
ods 7(2):147 gut microbiota interactions. J Formos Med
18. Steyerberg EW, van Veen M (2007) Imputa- Assoc 118(Suppl 1):S10–S22. https://doi.
tion is beneficial for handling missing data in org/10.1016/j.jfma.2018.09.007
predictive models. J Clin Epidemiol 60 28. Shen X, Zhu ZJ (2019) MetFlow: an interac-
(9):979 tive and integrated workflow for metabolo-
19. Smith CA, Want EJ, O’Maille G, Abagyan R, mics data cleaning and differential metabolite
Siuzdak G (2006) XCMS: processing mass discovery. Bioinformatics 35:2870. https://
spectrometry data for metabolite profiling doi.org/10.1093/bioinformatics/bty1066
using nonlinear peak alignment, matching, 29. McLachlan, Geoffrey J (2004) Discriminant
and identification. Anal Chem 78 analysis and statistical pattern recognition.
(3):779–787. https://doi.org/10.1021/ Wiley-Interscience, Hoboken, N.J. John
ac051437y Wiley & Sons. & Wiley InterScience (Online
20. Wei R, Wang J, Su M, Jia E, Chen S, Chen T, Service)
Ni Y (2018) Missing value imputation 30. McCallum A, Nigam K (1998) A comparison
approach for mass spectrometry-based meta- of event models for naive Bayes text classifica-
bolomics data. Sci Rep 8(1):663. https://doi. tion. In: AAAI-98 workshop on learning for
org/10.1038/s41598-017-19120-0 text categorization, vol 1. Citeseer, pp 41–48
21. Zhan X, Patterson AD, Ghosh D (2015) Ker- 31. Wang Q, Garrity GM, Tiedje JM, Cole JR
nel approaches for differential expression anal- (2007) Naive Bayesian classifier for rapid
ysis of mass spectrometry-based assignment of rRNA sequences into the new
metabolomics data. BMC Bioinformatics bacterial taxonomy. Appl Environ Microbiol
16:77. https://doi.org/10.1186/s12859- 73(16):5261–5267
015-0506-3 32. Quinlan JR (1986) Induction of decision
22. Gromski PS, Xu Y, Kotze HL, Correa E, Ellis trees. Mach Learn 1(1):81–106
DI, Armitage EG, Turner ML, Goodacre R 33. Breiman L (2017) Classification and regres-
(2014) Influence of missing values substitutes sion trees. Routledge, Boca Raton
on multivariate analysis of metabolomics data. 34. Liaw A, Wiener M (2002) Classification and
Metabolites 4(2):433–452. https://doi.org/ regression by randomForest. R News 2
10.3390/metabo4020433 (3):18–22
23. Kumar N, Hoque MA, Shahjaman M, Islam 35. Gislason PO, Benediktsson JA, Sveinsson JR
SM, Mollah MN (2017) Metabolomic bio- (2006) Random forests for land cover classifi-
marker identification in presence of outliers cation. Pattern Recogn Lett 27(4):294–300
and missing values. Biomed Res Int
2017:2437608. https://doi.org/10.1155/ 36. Chen T, Cao Y, Zhang Y, Liu J, Bao Y,
2017/2437608 Wang C, Jia W, Zhao A (2013) Random forest
in clinical metabolomics for phenotypic dis-
24. Sun X, Langer B, Weckwerth W (2015) Chal- crimination and biomarker selection. Evid
lenges of inversely estimating Jacobian from Based Complement Alternat Med
metabolomics data. Front Bioeng Biotechnol 2013:298183
3:188. https://doi.org/10.3389/fbioe.
2015.00188 37. Scott I, Lin W, Liakata M, Wood J, Vermeer
CP, Allaway D, Ward J, Draper J, Beale M,
25. Lee JY, Styczynski MP (2018) NS-kNN: a Corol D (2013) Merits of random forests
modified k-nearest neighbors approach for emerge in evaluation of chemometric classi-
imputing metabolomics data. Metabolomics fiers by external validation. Anal Chim Acta
14(12):153. https://doi.org/10.1007/ 801:22–33
s11306-018-1451-8
38. Ho TK (1998) Nearest neighbors in random
26. Di Guida R, Engel J, Allwood JW, Weber subspaces. In: Joint IAPR International
RJM, Jones MR, Sommer U, Viant MR, Workshops on Statistical Techniques in Pat-
Dunn WB (2016) Non-targeted UHPLC- tern Recognition (SPR) and Structural and
MS metabolomic data processing methods: a Syntactic Pattern Recognition (SSPR).
comparative investigation of normalisation, Springer, pp 640–648
missing value imputation, transformation
and scaling. Metabolomics 12(5):93. 39. Biau G (2012) Analysis of a random forests
https://doi.org/10.1007/s11306-016- model. J Mach Learn Res 13
1030-9 (Apr):1063–1095
40. Hapfelmeier A, Hothorn T, Ulm K, Strobl C
(2014) A new variable importance measure
334 Tusharkanti Ghosh et al.
for random forests with missing data. Stat 51. Quiros-Guerrero L, Albertazzi F, Araya-
Comput 24(1):21–34 Valverde E, Romero RM, Villalobos H,
41. Menze BH, Kelm BM, Masuch R, Poveda L, Chavarria M, Tamayo-Castillo G
Himmelreich U, Bachert P, Petrich W, Ham- (2019) Phenolic variation among Chamae-
precht FA (2009) A comparison of random crista nictitans subspecies and varieties
forest and its Gini importance with standard revealed through UPLC-ESI()-MS/MS
chemometric methods for the feature selec- chemical fingerprinting. Metabolomics 15
tion and classification of spectral data. BMC (2):14. https://doi.org/10.1007/s11306-
Bioinformatics 10(1):213 019-1475-8
42. Maker AV, Hu V, Kadkol SS, Hong L, 52. Wang J, Yan D, Zhao A, Hou X, Zheng X,
Brugge W, Winter J, Yeo CJ, Hackert T, Chen P, Bao Y, Jia W, Hu C, Zhang ZL, Jia W
Buchler M, Lawlor RT, Salvia R, Scarpa A, (2019) Discovery of potential biomarkers for
Bassi C, Green S (2019) Cyst fluid biosigna- osteoporosis using LC-MS/MS metabolomic
ture to predict Intraductal papillary mucinous methods. Osteoporos Int 30:1491. https://
neoplasms of the pancreas with high malig- doi.org/10.1007/s00198-019-04892-0
nant potential. J Am Coll Surg 228:721. 53. Grissa D, Petera M, Brandolini M, Napoli A,
https://doi.org/10.1016/j.jamcollsurg. Comte B, Pujos-Guillot E (2016) Feature
2019.02.040 selection methods for early predictive bio-
43. Tkachev V, Sorokin M, Mescheryakov A, marker discovery using untargeted metabolo-
Simonov A, Garazha A, Buzdin A, mic data. Front Mol Biosci 3:30. https://doi.
Muchnik I, Borisov N (2018) FLOating- org/10.3389/fmolb.2016.00030
window projective separator (FloWPS): a 54. Bayci AWL, Baker DA, Somerset AE,
data trimming tool for support vector Turkoglu O, Hothem Z, Callahan RE,
machines (SVM) to improve robustness of Mandal R, Han B, Bjorndahl T, Wishart D,
the classifier. Front Genet 9:717. https:// Bahado-Singh R, Graham SF, Keidan R
doi.org/10.3389/fgene.2018.00717 (2018) Metabolomic identification of diag-
44. Yerukala Sathipati S, Ho SY (2018) Identify- nostic serum-based biomarkers for advanced
ing a miRNA signature for predicting the stage melanoma. Metabolomics 14(8):105.
stage of breast cancer. Sci Rep 8(1):16138. https://doi.org/10.1007/s11306-018-
https://doi.org/10.1038/s41598-018- 1398-9
34604-3 55. Catav SS, Elgin ES, Dag C, Stark JL, Kucuka-
45. Cortes C, Vapnik V (1995) Support-vector kyuz K (2018) NMR-based metabolomics
networks. Mach Learn 20(3):273–297 reveals that plant-derived smoke stimulates
46. Boser BE, Guyon IM, Vapnik VN (1992) A root growth via affecting carbohydrate and
training algorithm for optimal margin classi- energy metabolism in maize. Metabolomics
fiers. In: Proceedings of the fifth annual work- 14(11):143. https://doi.org/10.1007/
shop on computational learning theory. s11306-018-1440-y
ACM, pp 144–152 56. Guo JG, Guo XM, Wang XR, Tian JZ, Bi HS
47. Bishop CM (2006) Pattern recognition and (2019) Metabolic profile analysis of free
machine learning. Springer, Berlin amino acids in experimental autoimmune
uveoretinitis rat plasma. Int J Ophthalmol 12
48. Ripley BD (1994) Flexible non-linear (1):16–24. https://doi.org/10.18240/ijo.
approaches to classification. In: From statistics 2019.01.03
to neural networks. Springer, Berlin, pp
105–126 57. Rodrigues-Neto JC, Correia MV, Souto AL,
Ribeiro JAA, Vieira LR, Souza MT Jr, Rodri-
49. Contreras-Jodar A, Nayan NH, Hamzaoui S, gues CM, Abdelnur PV (2018) Metabolic fin-
Caja G, Salama AAK (2019) Heat stress modi- gerprinting analysis of oil palm reveals a set of
fies the lactational performances and the uri- differentially expressed metabolites in fatal
nary metabolomic profile related to yellowing symptomatic and
gastrointestinal microbiota of dairy goats. non-symptomatic plants. Metabolomics 14
PLoS One 14(2):e0202457. https://doi. (10):142. https://doi.org/10.1007/
org/10.1371/journal.pone.0202457 s11306-018-1436-7
50. Park HG, Jang KS, Park HM, Song WS, 58. Wong M, Lodge JK (2012) A metabolomic
Jeong YY, Ahn DH, Kim SM, Yang YH, Kim investigation of the effects of vitamin E sup-
YG (2019) MALDI-TOF MS-based total plementation in humans. Nutr Metab (Lond)
serum protein fingerprinting for liver cancer 9(1):110. https://doi.org/10.1186/1743-
diagnosis. Analyst 144:2231. https://doi. 7075-9-110
org/10.1039/c8an02241k
Predictive Modeling of Metabolomics Data 335
59. Li Y, Chen M, Liu C, Xia Y, Xu B, Hu Y, values. IEEE Geosci Remote Sens Lett 14
Chen T, Shen M, Tang W (2018) Metabolic (11):1988–1992
changes associated with papillary thyroid car- 72. Van Calster B, Vickers AJ (2015) Calibration
cinoma: a nuclear magnetic resonance-based of risk prediction models: impact on decision-
metabolomics study. Int J Mol Med 41 analytic performance. Med Decis Making 35
(5):3006–3014. https://doi.org/10.3892/ (2):162–169
ijmm.2018.3494 73. Agresti A (2002) Categorical data analysis.
60. Rezig L, Servadio A, Torregrossa L, Miccoli P, Wiley, New York
Basolo F, Shintu L, Caldarelli S (2018) Diag- 74. Huang Y, Sullivan Pepe M, Feng Z (2007)
nosis of post-surgical fine-needle aspiration Evaluating the predictiveness of a continuous
biopsies of thyroid lesions with indeterminate marker. Biometrics 63(4):1181–1188
cytology using HRMAS NMR-based metabo-
lomics. Metabolomics 14(10):141. https:// 75. Holder LB, Haque MM, Skinner MK (2017)
doi.org/10.1007/s11306-018-1437-6 Machine learning for epigenetics and future
medical applications. Epigenetics 12
61. Westerhuis JA, van Velzen EJ, Hoefsloot HC, (7):505–514. https://doi.org/10.1080/
Smilde AK (2010) Multivariate paired data 15592294.2017.1329068
analysis: multilevel PLSDA versus OPLSDA.
Metabolomics 6(1):119–128 76. Chen C, Liaw A, Breiman L (2004) Using
random forest to learn imbalanced data, vol
62. Liquet B, Le Cao KA, Hocini H, Thiebaut R 110. University of California, Berkeley, pp
(2012) A novel approach for biomarker selec- 1–12
tion and the integration of repeated measures
experiments from two assays. BMC Bioinfor- 77. Breiman L, Friedman J, Olshen RA, Stone CJ
matics 13:325. https://doi.org/10.1186/ (1984) Classification and regression trees.
1471-2105-13-325 Chapman & Hall, New York
63. Liu H, Motoda H (1998) Feature extraction, 78. Japkowicz N (2000) Learning from imbal-
construction and selection: a data mining per- anced data sets: a comparison of various stra-
spective, vol 453. Springer Science & Business tegies. In: AAAI workshop on learning from
Media, Norwell imbalanced data sets. Menlo Park, CA, pp
10–15
64. Guyon I, Elisseeff A (2003) An introduction
to variable and feature selection. J Mach Learn 79. Maloof MA (2003) Learning when data sets
Res 3:1157–1182 are imbalanced and when costs are unequal
and unknown. In: ICML-2003 workshop on
65. Weston J, Elisseeff A, Schölkopf B, Tipping M learning from imbalanced data sets II, pp 2–1
(2003) Use of the zero-norm with linear
models and kernel methods. J Mach Learn 80. Ling CX, Li C (1998) Data mining for direct
Res 3(Mar):1439–1461 marketing: problems and solutions. In: KDD
1998, pp 73–79
66. Mladenic D, Grobelnik M (1999) Feature
selection for unbalanced class distribution 81. Chawla NV, Bowyer KW, Hall LO, Kegel-
and naive bayes. In: ICML 1999, pp 258–267 meyer WP (2002) SMOTE: synthetic minor-
ity over-sampling technique. J Artif Intell Res
67. Bozdogan H (1987) Model selection and 16:321–357
Akaike’s information criterion (AIC): the
general theory and its analytical extensions. 82. Kubat M, Matwin S (1997) Addressing the
Psychometrika 52(3):345–370 curse of imbalanced training sets: one-sided
selection. In: ICML 1997. Citeseer, pp
68. Guan W, Zhou M, Hampton CY, Benigno 179–186
BB, Walker LD, Gray A, McDonald JF, Fer-
nández FM (2009) Ovarian cancer detection 83. Domingos P (1999) Metacost: a general
from metabolomic liquid chromatography/ method for making classifiers cost-sensitive.
mass spectrometry data by support vector In: KDD 1999, pp 155–164
machines. BMC Bioinformatics 10(1):259 84. Cateni S, Colla V, Vannucci M (2014) A
69. Platt J (1998) Sequential minimal optimiza- method for resampling imbalanced datasets
tion: a fast algorithm for training support vec- in binary classification tasks for real-world
tor machines problems. Neurocomputing 135:32–41
70. Kuhn M, Johnson K (2013) Applied predic- 85. Drummond C, Holte RC (2003) C4. 5, class
tive modeling, vol 26. Springer, New York imbalance, and cost sensitivity: why under-
sampling beats over-sampling. In: Workshop
71. Behnamian A, Millard K, Banks SN, White L, on learning from imbalanced datasets
Richardson M, Pasher J (2017) A systematic II. Citeseer, pp 1–8
approach for variable selection with random
forests: achieving stable variable importance
336 Tusharkanti Ghosh et al.
86. Collins GS, Reitsma JB, Altman DG, Moons 93. Thévenot EA (2016) ropls: PCA, PLS (-DA)
KG (2015) Transparent reporting of a multi- and OPLS (-DA) for multivariate analysis and
variable prediction model for individual prog- feature selection of omics data
nosis or diagnosis (TRIPOD): the TRIPOD 94. Rinaudo P, Boudah S, Junot C, Thévenot EA
statement. BMC Med 13(1):1 (2016) Biosigner: a new method for the dis-
87. Cruickshank-Quinn CI, Jacobson S, covery of significant molecular signatures
Hughes G, Powell RL, Petrache I, from omics data. Front Mol Biosci 3:26
Kechris K, Bowler R, Reisdorph N (2018) 95. Zararsiz G, Goksuluk D, Korkmaz S,
Metabolomics and transcriptomics pathway Eldem V, Duru IP, Unver T, Ozturk A, Zar-
approach reveals outcome-specific perturba- arsiz MG, klaR M, biocViews Sequencing, R
tions in COPD. Sci Rep 8(1):17132 (2014) Package ‘MLSeq’
88. Regan EA, Hokanson JE, Murphy JR, 96. Xia J, Psychogios N, Young N, Wishart DS
Make B, Lynch DA, Beaty TH, Curran- (2009) MetaboAnalyst: a web server for meta-
Everett D, Silverman EK, Crapo JD (2010) bolomic data analysis and interpretation.
Genetic epidemiology of COPD (COPD- Nucleic Acids Res 37(suppl_2):W652–W660
Gene) study design. COPD 7(1):32–43. 97. Luan H, Ji F, Chen Y, Cai Z (2018) statTar-
https://doi.org/10.3109/ get: a streamlined tool for signal drift correc-
15412550903499522 tion and interpretations of quantitative mass
89. Andersen SL, Briggs FBS, Winnike JH, spectrometry-based omics data. Anal Chim
Natanzon Y, Maichle S, Knagge KJ, Newby Acta 1036:66–72
LK, Gregory SG (2019) Metabolome-based 98. Determan Jr CE, Determan Jr MCE (2015)
signature of disease pathology in MS. Mult Package ‘OmicsMarkeR’
Scler Relat Disord 31:12–21. https://doi.
org/10.1016/j.msard.2019.03.006 99. Rohart F, Gautier B, Singh A, Le Cao K-A
(2017) mixOmics: an R package for ‘omics
90. Lee HS, Seo C, Hwang YH, Shin TH, Park feature selection and multiple data integra-
HJ, Kim Y, Ji M, Min J, Choi S, Kim H, Park tion. PLoS Comput Biol 13(11):e1005752
AK, Yee ST, Lee G, Paik MJ (2019) Metabo-
lomic approaches to polyamines including 100. Al-Akwaa FM, Yunits B, Huang S, Alhajaji H,
acetylated derivatives in lung tissue of mice Garmire LX (2018) Lilikoi: an R package for
with asthma. Metabolomics 15(1):8. personalized pathway-based classification
https://doi.org/10.1007/s11306-018- modeling using metabolomics data. Giga-
1470-5 Science 7(12):giy136
91. Long NP, Yoon SJ, Anh NH, Nghi TD, Lim 101. Gift N, Gormley IC, Brennan L, Gormley
DK, Hong YJ, Hong SS, Kwon SW (2018) A MC (2010) Package ‘MetabolAnalyze’
systematic review on metabolomics-based 102. Gaude E, Chignola F, Spiliotopoulos D,
diagnostic biomarker discovery and validation Spitaleri A, Ghitti M, Garcı̀a-Manteiga JM,
in pancreatic cancer. Metabolomics 14 Mari S, Musco G (2013) Muma, an R package
(8):109. https://doi.org/10.1007/s11306- for metabolomics univariate and multivariate
018-1404-2 statistical analysis. Curr Metabol 1
92. Regan EA, Hersh CP, Castaldi PJ, DeMeo (2):180–189
DL, Silverman EK, Crapo JD, Bowler RP 103. Palla P (2015) Information management and
(2019) Omics and the search for blood bio- multivariate analysis techniques for metabolo-
markers in COPD: insights from COPD- mics data. Universita’degli Studi di Cagliari
Gene. Am J Respir Cell Mol Biol 61:143. 104. Kuhn M (2008) Building predictive models in
https://doi.org/10.1165/rcmb.2018- R using the caret package. J Stat Softw 28
0245PS (5):1–26
Chapter 17
Abstract
MetaboAnalyst (www.metaboanalyst.ca) is an easy-to-use, comprehensive web-based tool, freely available
for metabolomics data processing, statistical analysis, functional interpretation, as well as integration with
other omics data. This chapter first provides an introductory overview to the current MetaboAnalyst
(version 4.0) with regards to its underlying design concepts and user interface structure. Subsequent
sections describe three common metabolomics data analysis workflows covering targeted metabolomics,
untargeted metabolomics, and multi-omics data integration.
Key words Web server, Multivariate statistics, Enrichment analysis, Metabolic pathway analysis,
Multi-omics integration
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_17, © Springer Science+Business Media, LLC, part of Springer Nature 2020
337
338 Jasmine Chong and Jianguo Xia
1.1 Understanding 1. To begin, open a preferred web browser and enter the Meta-
the MetaboAnalyst boAnalyst web address: https://www.metaboanalyst.ca/.
Layout 2. At the top of the home page, click the “>>click here to
start<<” hyperlink to enter the “Module Overview” page.
3. In the center of the “Module Overview” page is a circular
“clock” showing 12 different modules currently available in
MetaboAnalyst 4.0 (Fig. 1). All modules follow the same gen-
eral flow: data upload, data processing and data analysis. To
briefly demonstrate this, select the “Statistical Analysis” button
on the top left of the circular panel.
4. The Statistical Analysis data upload page should now be visible
(Fig. 2). On the left-hand side is the navigation tree, which
guides users through all analysis steps of the selected module.
This navigation tree is specific for each module. Note the
“Upload” hyperlink is highlighted in blue, representing the
current step. On the right-hand side is the R Command His-
tory panel, which displays all underlying R commands as they
occur in real-time. The aim of this panel is to improve the
reproducibility and transparency of the MetaboAnalyst web
server. These R commands can be used directly by the compan-
ion MetaboAnalystR package to reproduce one’s results [5, 6].
Metabolomics Data Analysis Using MetaboAnalyst 4.0 339
Fig. 2 A screenshot of the Statistical Analysis data upload page within MetaboAnalyst
340 Jasmine Chong and Jianguo Xia
Fig. 3 A screenshot of the heatmap generated in the Statistical Analysis module within MetaboAnalyst. To the
left of the image is the navigation tree and to the right is the R Command History panel
5. Select any example dataset listed at the bottom of the page and
click “Submit.” For the purposes of this section, we will speed
through the next few pages. A detailed guide to data processing
and statistical analysis is provided below in Subheading 2.
6. The next page is the Data Integrity Check, click “Skip” on the
bottom of the page. Note that the R commands are updated on
the right upon each user action. After is the Normalization
page, keep all options set to “None”, press “Normalize” (bot-
tom left) and then “Proceed” (bottom right). Following this is
the “Analysis View” page, which shows all the analysis options
available in MetaboAnalyst. Click “Heatmaps” under the
“Cluster Analysis” subheading. A heatmap should now be
visible in the center of the screen (Fig. 3). Take some time to
adjust all available parameters at the top of the page to update
the heatmap, such as switching the “View Mode” from “Over-
view” to “Detail View.”
7. To download this plot (and any other plot in MetaboAnalyst)
as a high-resolution image, click the paint palette icon at the
top right of the page. The “Graphics Center” dialog will now
appear, where users can adjust the format, resolution, and size
of their image to download. Click “Submit” and the download
link will now appear at the bottom of the dialog. Right click the
hyperlink and select “Save link as. . .” to download the image.
Metabolomics Data Analysis Using MetaboAnalyst 4.0 341
2.1 Data Upload Metabolomics data contains both biologically relevant variation—
and Processing that is, changes in the metabolome induced by different experi-
mental conditions, and unwanted variation introduced during sam-
ple preparation, instrumentation, and even data preprocessing
(e.g., peak picking, grouping, and alignment). The aim of data
processing is to enhance biologically relevant signals while reducing
unrelated variations in the data [7]. Caution must be applied at this
step to select the most appropriate methods as it can greatly influ-
ence the downstream interpretation of results [8, 9]. Researchers
should keep in mind the aim of their study, the structure of their
data, as well as the statistics they wish to use [7–9]. Methods for
data processing within MetaboAnalyst include missing value impu-
tation, filtering, scaling, centering, and transformations. In this
section we demonstrate how users can process their data using
MetaboAnalyst 4.0.
342 Jasmine Chong and Jianguo Xia
Fig. 4 A screenshot of a volcano plot created using an example dataset within MetaboAnalyst highlighting
glucose in a boxplot
2.2 Univariate 6. Click “Volcano plot” to view the volcano plot of the data using
Analysis to Identify default parameters. Points highlighted in red are significant
Significant Features based on the default p-value cutoff of 0.1 and fold-change
threshold of 2.0. Click on the single red point to view a boxplot
showing the concentrations of the selected feature within each
group (Fig. 4). Glucose shows a greater concentration in
cachexic patients as compared to controls.
7. There are two icons on the top-right of the plot. Click on the
mini table icon to enter the “Feature Details View” of the
detailed result table from volcano analysis. This table shows
compound names, fold-changes and p-values of statistically
significant features. Click on any feature name to view boxplot
summaries of both the original and normalized concentrations.
Fig. 5 A screenshot of a 3D PCA score and loading plot. Hypoxanthine is highlighted with a boxplot
Metabolomics Data Analysis Using MetaboAnalyst 4.0 345
Fig. 6 A screenshot of a PLS-DA VIP scores plot highlighting the top 15 most important features
2.4.1 Metabolite Set 1. Return to the MetaboAnalyst home page by clicking the home
Enrichment Analysis icon on the top of the navigation tree. Click “>>click here to
start<<” to enter the “Module Overview” page and select the
“Enrichment Analysis” button to enter the data upload page.
2. On the Enrichment Analysis upload page, click the “A concen-
tration table” panel and select the example data named “Data
1”. This is the same data used previously for statistical analysis.
Click “Submit” to upload the data.
3. The “Compound Name/ID Standardization” page should
appear. The aim of compound name standardization is to
match compounds from the user’s data to those curated in
the MetaboAnalyst knowledgebase. Compounds without
Metabolomics Data Analysis Using MetaboAnalyst 4.0 347
Fig. 7 A screenshot of the interaction network of the SMPDB pathway metabolite set created using the
Enrichment Analysis module of MetaboAnalyst
2.4.2 Metabolic Pathway 1. Return to the MetaboAnalyst home page by selecting the home
Analysis icon on the top of the navigation tree. Click “>>click here to
start<<” to enter the “Module Overview” page and select the
“Pathway Analysis” button to enter the data upload page.
2. On the Pathway Analysis upload page, click the “A concentra-
tion table” panel and select the empty box next to “Use the
example data.” This is the same data used previously for enrich-
ment analysis. Click “Submit” to upload the data.
3. The compound standardization, data integrity check, and nor-
malization are identical to steps 3–5 in Section 2.4.1 above.
Metabolomics Data Analysis Using MetaboAnalyst 4.0 349
4. The next page shows all the options for metabolic pathway
analysis. Keep the default parameters for the pathway analysis
algorithm and KEGG Version. As the example samples were
obtained from humans, select “Homo sapiens (KEGG)” and
click “Submit” at the bottom of the page to perform pathway
analysis.
5. The Pathway Analysis results are displayed on the following
page. The “Overview of Pathway Analysis” plot on the left of
the screen provides a graphical summary of the pathway analy-
sis. In the plot, all matched pathways are represented as circles.
The color and size of each circle corresponds to its p-value and
pathway impact value, respectively. From this plot, Glycine,
serine and threonine metabolism was the most enriched pathway
(largest p-value) and had a pathway impact of 0.612. Click
upon any circle and its corresponding pathway plot will appear
to the right (Fig. 8). Matched compounds in the pathway plot
are colored in red, orange, or yellow based on its
corresponding p-value. Press on any of these matched com-
pounds and a popup will appear showing a boxplot of the
concentrations of the selected metabolite between the two
Fig. 8 A screenshot of the graphical overview of Pathway Analysis results in MetaboAnalyst, highlighting
Glycine, serine and threonine metabolism
350 Jasmine Chong and Jianguo Xia
Fig. 9 A screenshot of the pathway activity profile plot in MetaboAnalyst summarizing the results of the
mummichog algorithm in the MS Peaks to Paths module
Fig. 10 A screenshot of the pathway summary plot integrating mummichog (y-axis) and GSEA (x-axis) p-
values. The size and color of the circles correspond to the transformed combined p-values
4.1 Joint Pathway 1. From the MetaboAnalyst home page, click “>>click here to
Analysis start<<” to enter the “Module Overview” page. Select the
“Joint Pathway Analysis” button on the middle-right of the
circular panel to enter the Data Upload page.
2. On the “Joint Upload” page, click the checkbox next to “Use
our example data” to use the example data. A list of genes and
metabolites will show in the corresponding text areas and the
organism will be set to Homo sapiens. This example data con-
tains a subset of transcriptome and metabolome data from a
study of intrahepatic cholangiocarcinoma [27]. Click “Submit”
to continue.
3. The “Compound/Gene Mapping” page is similar to the
“Compound Name Standardization” page from the enrich-
ment/pathway analysis modules. The page contains two tabs,
showing the results of the compound and gene name mapping,
respectively. Compounds or genes with no matches will be
highlighted in red. Scroll down and press “Submit.”
4. The “Analysis Parameters” page allows users to specify para-
meters for enrichment analysis, topology analysis, underlying
pathway databases, and integration method. There are three
types of pathways—the gene–metabolite pathways, gene-
centric pathways, and metabolite-centric pathways. Please
note, many KEGG pathways contain only genes/proteins and
are only available when gene-centric pathways are selected. By
default genes and metabolites are merged into a single list
which is used to perform enrichment analysis against pathways
containing both genes and metabolites. Keep all default para-
meters and press “Submit.”
5. The result from Joint Pathway Analysis follows the same design
as the Pathway Analysis result (Fig. 11). Briefly, the page is
separated into two, with the top half containing the pathway
visualization and the bottom half showing a detailed results
table (refer to steps 5–6 in Section 2.4.2 for further details).
The “Overview of Pathway Analysis” plot displays all matched
pathways as circles, with the color and size of each circle
corresponding to its p-value and pathway impact value,
Metabolomics Data Analysis Using MetaboAnalyst 4.0 355
Fig. 11 A screenshot of the graphical overview of the Joint Pathway Analysis results in MetaboAnalyst,
highlighting Arginine and proline metabolism
Fig. 12 A screenshot of the metabolite-gene-disease interaction network created using example data within
the Network Explorer module of MetaboAnalyst
358 Jasmine Chong and Jianguo Xia
5 Conclusion
Acknowledgement
References
1. Xia J, Psychogios N, Young N, Wishart DS more meaningful. Nucleic Acids Res 43:
(2009) MetaboAnalyst: a web server for meta- W251–W257
bolomic data analysis and interpretation. 5. Chong J, Xia J (2018) MetaboAnalystR: an R
Nucleic Acids Res 37:W652–W660 package for flexible and reproducible analysis of
2. Chong J, Soufan O, Li C, Caraus I, Li S, metabolomics data. Bioinformatics
Bourque G, Wishart DS, Xia J (2018) Meta- 34:4313–4314. https://doi.org/10.1093/bio
boAnalyst 4.0: towards more transparent and informatics/bty528
integrative metabolomics analysis. Nucleic 6. Chong J, Yamamoto M, Xia J (2019) Meta-
Acids Res 46:W486–W494. https://doi.org/ boAnalystR 2.0: from raw spectra to biological
10.1093/nar/gky310 insights. Meta 9:57
3. Xia J, Mandal R, Sinelnikov IV, Broadhurst D, 7. van den Berg RA, Hoefsloot HC, Westerhuis
Wishart DS (2012) MetaboAnalyst 2.0—a JA, Smilde AK, van der Werf MJC (2006) Scal-
comprehensive server for metabolomic data ing, and transformations: improving the
analysis. Nucleic Acids Res 40:W127–W133 biological information content of metabolo-
4. Xia J, Sinelnikov IV, Han B, Wishart DS (2015) mics data. BMC Genomics 7:142. https://
MetaboAnalyst 3.0—making metabolomics doi.org/10.1186/1471-2164-7-142.
Metabolomics Data Analysis Using MetaboAnalyst 4.0 359
Abstract
Interpretation of metabolomics data in the context of biological pathways is important to gain knowledge
about underlying metabolic processes. In this chapter we present methods to analyze genome-scale models
(GSMs) and metabolomics data together. This includes reading and mining of GSMs using the SBTab
format to retrieve information on genes, reactions, and metabolites. Furthermore, the chapter showcases
the generation of metabolic pathway maps using the Escher tool, which can be used for data visualization.
Lastly, approaches to constrain flux balance analysis (FBA) by metabolomics data are presented.
1 Introduction
1.1 Genome-Scale Similar to how an insect struggling in a spider’s web sends vibra-
Models tions across the threads, any originally localized disturbance can
result in effects “reverberating” in a complex pattern across the
entire metabolic network. To improve understanding of the com-
plex metabolic networks, present in nature, genome-scale modeling
has emerged in recent years as a means of representing a biological
system in a format that is informative and usable for experiments. A
genome-scale model (GSM, or GSMN for genome-scale metabolic
network) is a digital reconstruction that represents the global met-
abolic processes that occur within an organism [1, 2]. Such a model
should be able to be constrained and optimized to facilitate both
qualitative and quantitative investigation of the biological system of
interest, usually taking the form of in silico simulations. GSMs help
to expedite research by providing a comprehensive knowledge base
of the organism’s metabolism as well as reducing time and financial
costs.
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_18, © Springer Science+Business Media, LLC, part of Springer Nature 2020
361
362 Jake P. N. Hattwell et al.
1.2 Caenorhabditis The small nematode Caenorhabditis elegans was the first multicel-
elegans GSM lular organism with a completely sequenced genome [13]. Several
GSMs have been published for C. elegans [14–17]. Recently, these
different models have been merged into a consensus GSMN chris-
tened WormJam (from “worm jamboree,” i.e., curation meetings
of the WormJam community) [18, 19]. This model currently
represents one of the best curated models for C. elegans, but work
is still constantly being undertaken to improve it further. The
WormJam model will be used as the example model throughout
this chapter.
Genome Scale Metabolic Networks 363
Fig. 1 Integration of -omics data with flux balance analysis. This illustration shows different ways that -omics
data can be incorporated together with a GSM for FBA. The methods are sorted with respect to the different
omics data sources and whether the data are used to apply constraints to the model (left) or to inform the
objective function that will be optimized (right)
2 Materials
3 Methods
3.1 Reading Metabolic models are typically stored in the SBML format, an XML
and Searching derivative customized for the use in systems biology. However, this
for Information format is hardly human readable and contains a lot of information
in SBTab Files to aid the computer in processing the file. In contrast to SBML files,
the Systems Biology Table (SBTab) format allows humans directly
to work with the model. Both formats are interconvertible and free
web tools for conversion are available (https://www.sbtab.net/
sbtab/default/converter.html). The SBTab format comprises a
series of separate files that individually contain the reactions, meta-
bolites, genes, and other information present in the GSM. Because
the data in the individual data files are related to each other, the
same identifiers are used across all data. This makes it possible to
cross-reference and search for information (e.g., isolate metabolites
from a single reaction and retrieve information on them from the
compound table).
We use an example model, which is a truncated version of the
WormJam model containing only information related to the tricar-
boxylic acid (TCA) cycle, pentose phosphate pathway (PPP), and
glycolysis pathway from the WormJam model. This minimal exam-
ple contains the following SBTab files:
– compartments-SBTab.tsv contains the information on all the
model involved compartments,
– compounds-SBTab.tsv contains the chemical information of all
metabolites,
– genes-SBTab.tsv contains all related metabolic genes,
– pathways-SBTab.tsv contains the information on pathways and
mapping to external pathway databases (e.g., KEGG),
366 Jake P. N. Hattwell et al.
Fig. 2 (a) The three most common classes of entities in genome-scale models, and some of the information
they contain. In this representation, Genes 1 and 2 are metabolic genes required for the catalysis of the
reaction, which converts A and B into C. (b) Example tables from SBTab format
3.1.1 Reading of SBTab 1. All SBTab files have to be stored in the same folder. All files are
Files read at the same time by the function below, which reads all files
having a file extension of “-SBTab.tsv” and creates a data frame
(called a tibble) for each file.
# correct names
sbtab_names <- str_replace_all(basename(sbtab_files),
"-SBTab.tsv", "")
names(sbtab_list) <- paste0(sbtab_names,
assign(names(sbtab_list)[i],
sbtab_list[[i]],
envir = parent.frame())
}
368 Jake P. N. Hattwell et al.
# source
source("R/loadSBTab.R")
> reactions_table$‘!ReactionFormula‘[20]
[1] "M_coa_m + M_nad_m + M_pyr_m <=> M_co2_m + M_nadh_m +
M_accoa_m"
metabolite_counts %>%
filter(!‘!ID‘ %in% hub_metabolites)
> reactions_table$‘!GeneAssociation‘[20]
[1] "(WBGene00010794 and WBGene00007824 and WBGene00009082 and
WBGene00015413 and WBGene00011510)"
reactions_table$‘!GeneAssociation‘[20] %>%
map(.f=~str_extract_all(.x, "WBGene\\d+")) %>%
unlist()
filtered_reactions_table
3.1.2 Interaction and ID 1. So far only entries from the reaction table have been isolated,
Mapping Between Tables without additionally retrieving data from the compound or
gene table. In order to obtain this linked information (e.g.,
gene symbols for each gene related to pyruvate synthase), we
can isolate all WormBase identifiers and query the gene table.
First the gene identifiers are isolated and stored in a vector.
This vector can be then used to filter the gene table using
the filter function.
3.1.3 Generation of a 1. The complete information for all metabolites in the reactions
Suspect List for Pathway can be retrieved from the previous script (see Note 2). Based on
Screening these data, a suspect list for inspection or annotation of meta-
bolomics data can be generated. First all metabolites are
isolated from the reactions and their chemical data is retrieved.
The list of metabolites might contain certain molecules that
cannot be measured by the relevant analytical technique (e.g.,
protons or water in the case of mass spectrometry). These
compounds and others that represent only generic compounds
with no explicit chemical structure are removed from the list.
Afterward, only unique entries are selected using the dis-
tinct() function.
tibble(‘!ID‘ = .) %>%
left_join(., compounds_table, by = c("!ID")) %>%
filter(!.$‘!ID‘ %in% metabolite_exclusion) %>%
select(contains("neutral")) %>%
distinct()
2. One package that allows for the calculation of m/z values from
neutral masses or formulas is masstrixR (Witting and Schmitt-
Kopplin, under review). Data has to be supplied in a specific
format. Therefore a new data frame called suspect_List is
created.
Fig. 3 (a) The Escher workflow. Starting with a blank window, reactions are added to the map in a chainlike
fashion, creating a pathway. (b) Overlaying an example table of metabolites onto the generated pathway map.
The Escher Builder window allows for customization of the overlay, through the settings pane
3.2 Generation Escher is a versatile tool for the generation of biological pathway
of Escher Pathway maps that can be used to visualize metabolic systems and give
Map for Visualization context to omics data [23]. Several template Escher maps are avail-
of Metabolic Fluxes able for use on the Escher website; however, for organism-specific
pathways or figure generation, a custom map may be desired. This
section details the generation of an Escher pathway map from
scratch (Fig. 3).
Genome Scale Metabolic Networks 375
3.2.1 Installation 1. To install the COBRApy and Escher modules, after installing
of COBRApy and Escher Python 3.6+, open a command prompt window, and enter the
following commands:
3.2.2 Generation 1. A map can be generated using the SBML file of the model,
of Escher Pathway Map Python 3, and the COBRApy and Escher packages. In Python,
enter the following commands.
When a file name is presented (nonitalicized), replace this
with the path to the corresponding file from the Demo Scripts
in Subheading 2.
Lines beginning with a # represent comments to explain
the code and do not need to be entered.
#import dependancies
import cobra
from escher import Builder
#read the SBML model, create an Escher Builder object, and open
the Escher server
model = cobra.io.read_sbml_model(“WormJam_Toy_Network.xml”)
escher_model = Builder(model=model)
escher_model.display_in_browser()
6. Next the data is merged with the table containing the mapping
of the measured and model metabolites. Since not all metabo-
lites that have been measured so far are on the metabolic model
only metabolites that have a model ID are kept.
378 Jake P. N. Hattwell et al.
3.3.2 Mapping The resulting CSV can then be loaded into Escher through the map
and Interpretation of Data JSON file created in Subheading 3.2.2. An example map of the
TCA cycle, PPP, and glycolysis pathway is available in the Demo
Scripts in Subheading 2.
1. To overlay data onto the pathway maps, we begin by importing
dependencies.
import cobra
from escher import Builder
3.4 Integrating Endo- To incorporate data from the endometabolome into the model,
and Exometabolomic time series metabolomic data of the cells of interest are required.
Data with FBA MetabFBA is a method which integrates intracellular metabolomic
changes with the objective function of the FBA problem, requiring
3.4.1 Using the model to allow for the production or consumption of metabo-
Endometabolome Data lites within the system between time points [20]. The central logic
to Constrain a Flux Balance of MetabFBA is shown in Fig. 4.
Analysis If we compare an example metabolite (Fig. 4b), called M, the
concentration drops during Timeframe f1, indicating that the
demand of M by reactions is greater than the generation of M by
reactions. Conversely in Timeframe f2, the concentration of
M increases; thus, the supply must have exceeded the demand.
The changes in concentration of metabolites across a timeframe
provide information on the changes that are occurring within the
system of interest. The implication that these changes are a result of
net production or consumption of metabolites over time is used by
MetabFBA to augment the objective function, adding the demand
and supply of changing metabolites as variables to be maximized. In
this section we present a basic Python implementation of
MetabFBA.
1. The experimental metabolomics data must first be mapped to
the model before any further analysis can occur. A series of t-
tests are used to assess if the concentrations of metabolites have
changed significantly; thus, replicates are needed. To begin, we
generate a table of metabolites and their concentrations at
various time points. An example of this can be seen in the file
“MetabolomicsData/metabolomics_table.tsv”. Additionally, a
table that allows for the conversion of experimental metabolite
identifiers in the dataset with the identifiers used within the
model must also exist (e.g., “MetabolomicsData/metabo-
model-mapping.tsv”).
2. The metabolite data table is then converted to a “Differences”
table (example: “Metabolomicsdata/metabo-diffs.tsv”), in
which each column represents the difference in concentrations
for each timeframe (the time between adjacent time points),
and each row is a different metabolite. To do this, we perform
t-tests for each combination of metabolite and timeframe in to
380 Jake P. N. Hattwell et al.
Fig. 4 Overview of MetabFBA approach: (a) Schematic overview of the MetabFBA approach, where t1 and t2
are two consecutive time points (t2 > t1). (b) Schematic representation of the underlying assumption that
metabolite level changes between time points correspond to sustained differences in fluxes within the
timeframe
import csv
import cobra
import cobra.flux_analysis
from cobra.core import Reaction
mapping_file = "PATH_TO_MAPPING_TABLE"
metabolite_file = “PATH_TO_DIFFERENCES_TABLE”
model_file = “PATH_TO_MODEL_FILE
def integrate_metabolomics(model,group):
metabo_diff_values = [row[metabolite_data[0].index(group)]
for row in metabolite_data][1:]
# arbitrary threshold
threshold = 2.5
if (len(up_mets_codes)>0):
up_mets_react = addMetabolomicsReaction(model,
up_mets_codes, "up_metabolites")
Genome Scale Metabolic Networks 383
model.metabolites.up_metabolites_c.compartment = ’c’
model.add_boundary(model.metabolites.up_metabolites_c,
type=’sink’,lb=0,ub=threshold)
if (len(down_mets_codes)>0):
down_mets_react = addMetabolomicsReaction(model,
down_mets_codes, "down_metabolites")
model.metabolites.down_metabolites_c.compartment = ’c’
model.add_boundary(model.metabolites.down_metaboli-
tes_c, type=’sink’,lb=-1∗threshold,ub=0)
model.objective = model.reactions.BIO0100.flux_expres-
sion + model.reactions.up_metabolites.flux_expression - model.
reactions.down_metabolites.flux_expression
elif (len(up_mets_codes)>0):
model.objective = model.reactions.BIO0100.flux_expres-
sion + model.reactions.up_metabolites.flux_expression
elif (len(down_mets_codes)>0):
model.objective = model.reactions.BIO0100.flux_expres-
sion - model.reactions.down_metabolites.flux_expression
return model
model = cobra.io.read_sbml_model(model_file)
model = integrate_metabolomics(model,“INSERT_COLUMN_NAME”)
solution = cobra.flux_analysis.pfba(model)
print(solution.fluxes[“BIO0100”])
3.4.2 Using The exometabolome can also be used to constrain an FBA. The
Exometabolome Data logic of the approach is related to the use of endometabolomic data,
to Constrain a Flux Balance but rather than modifying the objective function, the data are used
Analysis to constrain the FBA directly via using exchange fluxes between the
cells and the external medium:
Extracellular metabolomic data can be obtained through the
analysis of spent medium of cell culture. By analyzing the changes
in metabolite concentration over time and the quantity of cells in
the culture, it is then possible to constrain an FBA experiment.
Differences in the exometabolome over time should be a result of
transport of metabolites in and out of cells, which can then be used
to place qualitative and/or quantitative constraints on the
exchange reactions: The limits of detection of a specific metabolite
and the concentration of that metabolite in the media can be used
to place qualitative constraints on the respective exchange reaction.
Furthering the use of extracellular metabolomic data, the absolute
differences in metabolite concentrations can be used to constrain
exchange reactions in a quantitative fashion.
This method to constrain an FBA with exometabolomic data
has been incorporated into the MetaboTools toolbox [22], and the
MetaboTools protocols article provides an excellent step-by-step
description of the workflow and also contains tutorials of two use
cases. Thus, the present chapter will not provide detailed instruc-
tions on the use of MetaboTools, as any description by us would
merely duplicate already existing resources. Instead, the reader is
referred to the detailed description of the protocol in the Metabo-
Tools paper [22]. MetaboTools is part of the COBRA toolbox
(https://github.com/opencobra/cobratoolbox), and detailed
descriptions on the installation and use of the COBRA toolbox
are provided in a previous protocol paper [24]. An applied example
of the MetaboTools technique is the use of extracellular metabo-
lomic data to constrain the human genome-scale model in order to
study the metabolic profiles of cancer cells [21, 25].
4 Notes
References
1. Palsson BØ (2015) Systems biology: genome-scale models. Nucleic Acids Res 44
constraint-based reconstruction and analysis. (D1):D515–D522
Cambridge University Press, Cambridge 10. Edwards JS, Palsson BO (1999) Systems prop-
2. Thiele I, Palsson BØ (2010) A protocol for erties of the Haemophilus influenzae Rd meta-
generating a high-quality genome-scale meta- bolic genotype. J Biol Chem 274
bolic reconstruction. Nat Protoc 5:93 (25):17410–17416
3. Lee D-S (2010) Interconnectivity of human 11. Krauss M et al (2012) Integrating cellular
cellular metabolism and disease prevalence. J metabolism into a multiscale whole-body
Stat Mech 2010(12):14 model. PLoS Comput Biol 8(10):e1002750
4. Bergdahl B, Sonnenschein N, Machado D, 12. Yilmaz LS, Walhout AJM (2017) Metabolic
Herrgård M, Förster J (2015) Genome‐scale network modeling with model organisms.
models. In: Villadsen J (ed) Fundamental bio- Curr Opin Chem Biol 36:32–39
engineering, 1st edn. John Wiley & Sons, Inc., 13. C. elegans Sequencing Consortium (1998)
Hoboken, New Jersey. https://doi.org/10. Genome sequence of the nematode
1002/9783527697441.ch06 C. elegans: a platform for investigating biology.
5. Tian M, Kumar P, STP G, Reed JL (2017) Science 282(5396):2012–2018
Metabolic modeling for design of cell factories. 14. Büchel F et al (2013) Path2Models: large-scale
In: Nielsen J, Hohmann S (eds) Systems biol- generation of computational models from bio-
ogy. John Wiley & Sons, Inc., Hoboken, New chemical pathway maps. BMC Syst Biol 7
Jersey. https://doi.org/10.1002/ (1):116
9783527696130.ch3 15. Gebauer J et al (2016) A genome-scale data-
6. Brunk E et al (2018) Recon3D enables a three- base and reconstruction of Caenorhabditis ele-
dimensional view of gene variation in human gans metabolism. Cell Syst 2(5):312–322
metabolism. Nat Biotechnol 36:272 16. Yilmaz LS, Walhout AJ (2016) A Caenorhabdi-
7. de Oliveira Dal’Molin CG et al (2010) Ara- tis elegans genome-scale metabolic network
GEM, a genome-scale reconstruction of the model. Cell Syst 2(5):297–311
primary metabolic network in Arabidopsis. 17. Ma L et al (2017) Systems biology analysis
Plant Physiol 152(2):579–589 using a genome-scale metabolic model shows
8. Sigurdsson MI et al (2010) A detailed genome- that phosphine triggers global metabolic sup-
wide reconstruction of mouse metabolism pression in a resistant strain of C. elegans.
based on human recon 1. BMC Syst Biol 4 bioRxiv
(1):140 18. Witting M et al (2018) Modeling meets meta-
9. Ebrahim A et al (2015) BiGG models: a plat- bolomics—the WormJam consensus model as
form for integrating, standardizing and sharing basis for metabolic studies in the model
386 Jake P. N. Hattwell et al.
Abstract
Recent advances in analytical techniques, particularly LC-MS, generate increasingly large and complex
metabolomics datasets. Pathway analysis tools help place the experimental observations into relevant
biological or disease context. This chapter provides an overview of the general concepts and common
tools for pathway analysis, including Mummichog for untargeted metabolomics. Examples of pathway
mapping, MetScape, and Mummichog are explained. This serves as both a practical tutorial and a timely
survey of pathway analysis for label-free metabolomics data.
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_19, © Springer Science+Business Media, LLC, part of Springer Nature 2020
387
388 Alla Karnovsky and Shuzhao Li
statistical power can also identify patterns that are missed at indi-
vidual gene level. Later, as metabolomics data became available,
similar methods were applied to perform pathway analysis on
metabolites.
2 Materials
5 Pathway Visualization
Fig. 2 Screenshot of MetScape within the CytoScape framework. Using our example data, a metabolite
network is formed by connecting known metabolic reactions that contain the input metabolites. The full
network is shown in the thumbnail and the metabolites of tryptophan pathway are shown in the center of the
window
Fig. 3 Example output from Mummichog (version 2.1). (a) The summary report contains a replot of user input
data as two Manhattan plots. Significant pathways are shown as a table and a bar plot. Not included but
scrollable in the screenshot is the module analysis and list of metabolites of interest. (b) Example of a significant
network module. (c) Mummichog computes an empirical p-value by comparing the pathway FET p-values (red)
against those from permutation data (blue). Details of the algorithm are given in Li et al. (2013) [31]
8 Notes
Acknowledgments
References
1. Roberts LD, Souza AL, Gerszten RE, Clish CB ND, Selkov E Sr, Sigurdsson MI, Simeonidis E,
(2012) Targeted metabolomics. Curr Protoc Sonnenschein N, Smallbone K, Sorokin A, van
Mol Biol. Chapter 30:Unit 30.32.31-24. Beek JH, Weichart D, Goryanin I, Nielsen J,
https://doi.org/10.1002/0471142727. Westerhoff HV, Kell DB, Mendes P, Palsson
mb3002s98 BO (2013) A community-driven global recon-
2. Mahieu NG, Patti GJ (2017) Systems-level struction of human metabolism. Nat Biotech-
annotation of a metabolomics data set reduces nol 31(5):419–425. https://doi.org/10.
25000 features to fewer than 1000 unique 1038/nbt.2488
metabolites. Anal Chem 89 8. Sigurdsson MI, Jamshidi N, Steingrimsson E,
(19):10397–10406. https://doi.org/10. Thiele I, Palsson BO (2010) A detailed
1021/acs.analchem.7b02380 genome-wide reconstruction of mouse metab-
3. Kanehisa M, Goto S, Hattori M, Aoki- olism based on human Recon 1. BMC Syst Biol
Kinoshita KF, Itoh M, Kawashima S, 4:140. https://doi.org/10.1186/1752-0509-
Katayama T, Araki M, Hirakawa M (2006) 4-140
From genomics to chemical genomics: new 9. Hao T, Ma HW, Zhao XM, Goryanin I (2010)
developments in KEGG. Nucleic Acids Res 34 Compartmentalization of the Edinburgh
(database issue):D354–D357 human metabolic network. BMC Bioinformat-
4. Caspi R, Altman T, Dreher K, Fulcher CA, ics 11:393. https://doi.org/10.1186/1471-
Subhraveti P, Keseler IM, Kothari A, 2105-11-393
Krummenacker M, Latendresse M, Mueller 10. Jewison T, Su Y, Disfany FM, Liang Y, Knox C,
LA, Ong Q, Paley S, Pujar A, Shearer AG, Maciejewski A, Poelzer J, Huynh J, Zhou Y,
Travers M, Weerasinghe D, Zhang P, Karp Arndt D, Djoumbou Y, Liu Y, Deng L, Guo
PD (2012) The MetaCyc database of metabolic AC, Han B, Pon A, Wilson M, Rafatnia S,
pathways and enzymes and the BioCyc collec- Liu P, Wishart DS (2014) SMPDB 2.0: big
tion of pathway/genome databases. Nucleic improvements to the small molecule pathway
Acids Res 40(Database issue):D742–D753. database. Nucleic Acids Res 42(Database
https://doi.org/10.1093/nar/gkr1014 issue):D478–D484. https://doi.org/10.
5. Duarte NC, Becker SA, Jamshidi N, Thiele I, 1093/nar/gkt1067
Mo ML, Vo TD, Srivas R, Palsson BO (2007) 11. Huang DW, Sherman BT, Tan Q, Collins JR,
Global reconstruction of the human metabolic Alvord WG, Roayaei J, Stephens R, Baseler
network based on genomic and bibliomic data. MW, Lane HC, Lempicki RA (2007) The
Proc Natl Acad Sci U S A 104(6):1777–1782. DAVID gene functional classification tool: a
https://doi.org/10.1073/pnas.0610772104 novel biological module-centric algorithm to
6. Ma H, Sorokin A, Mazein A, Selkov A, functionally analyze large gene lists. Genome
Selkov E, Demin O, Goryanin I (2007) The Biol 8(9):R183. https://doi.org/10.1186/
Edinburgh human metabolic network recon- gb-2007-8-9-r183
struction and its functional analysis. Mol Syst 12. Subramanian A, Tamayo P, Mootha VK,
Biol 3:135. https://doi.org/10.1038/ Mukherjee S, Ebert BL, Gillette MA,
msb4100177 Paulovich A, Pomeroy SL, Golub TR, Lander
7. Thiele I, Swainston N, Fleming RM, Hoppe A, ES, Mesirov JP (2005) Gene set enrichment
Sahoo S, Aurich MK, Haraldsdottir H, Mo analysis: a knowledge-based approach for inter-
ML, Rolfsson O, Stobbe MD, Thorleifsson preting genome-wide expression profiles. Proc
SG, Agren R, Bolling C, Bordel S, Chavali Natl Acad Sci U S A 102(43):15545–15550.
AK, Dobson P, Dunn WB, Endler L, Hala D, https://doi.org/10.1073/pnas.0506580102
Hucka M, Hull D, Jameson D, Jamshidi N, 13. Khatri P, Sirota M, Butte AJ (2012) Ten years
Jonsson JJ, Juty N, Keating S, Nookaew I, Le of pathway analysis: current approaches and
Novere N, Malys N, Mazein A, Papin JA, Price outstanding challenges. PLoS Comput Biol 8
Pathway Analysis 399
32. Huan T, Forsberg EM, Rinehart D, Johnson 34. Pirhaji L, Milani P, Leidl M, Curran T, Avila-
CH, Ivanisevic J, Benton HP, Fang M, Pacheco J, Clish CB, White FM, Saghatelian A,
Aisporna A, Hilmers B, Poole FL, Thorgersen Fraenkel E (2016) Revealing disease-associated
MP, Adams MWW, Krantz G, Fields MW, Rob- pathways by network integration of untargeted
bins PD, Niedernhofer LJ, Ideker T, Majum- metabolomics. Nat Methods 13(9):770–776.
der EL, Wall JD, Rattray NJW, Goodacre R, https://doi.org/10.1038/nmeth.3940
Lairson LL, Siuzdak G (2017) Systems biology 35. Barupal DK, Haldiya PK, Wohlgemuth G,
guided by XCMS online metabolomics. Nat Kind T, Kothari SL, Pinkerton KE, Fiehn O
Methods 14(5):461–462. https://doi.org/ (2012) MetaMapp: mapping and visualizing
10.1038/nmeth.4260 metabolomic data by integrating information
33. Chong J, Soufan O, Li C, Caraus I, Li S, from biochemical pathways and chemical and
Bourque G, Wishart DS, Xia J (2018) Meta- mass spectral similarity. BMC Bioinformatics
boAnalyst 4.0: towards more transparent and 13:99. https://doi.org/10.1186/1471-
integrative metabolomics analysis. Nucleic 2105-13-99
Acids Res 46(W1):W486–W494. https://doi.
org/10.1093/nar/gky310
Chapter 20
Abstract
Metabolomics has been increasingly applied to study renal and related cardiometabolic diseases, including
diabetes and cardiovascular diseases. These studies span cross-sectional studies correlating metabolites with
specific phenotypes, longitudinal studies to identify metabolite predictors of future disease, and physio-
logic/interventional studies to probe underlying causal relationships. This chapter provides a description of
how metabolomic profiling is being used in these contexts, with an emphasis on study design considerations
as a practical guide for investigators who are new to this area. Research in kidney diseases is underlined to
illustrate key principles. The chapter concludes by discussing the future potential of metabolomics in the
study of renal and cardiometabolic diseases.
Key words Metabolomics, Study design, Cardiometabolic disease, Renal disease, Cardiovascular
disease, Biomarker, Replication
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_20, © Springer Science+Business Media, LLC, part of Springer Nature 2020
401
402 Casey M. Rebholz and Eugene P. Rhee
1.2 Opportunity There is a substantial need for the discovery of new biomarkers of
for Metabolomics renal and other cardiometabolic diseases to overcome existing lim-
to Improve Biomarker itations. As outlined in more detail elsewhere in this book, current
Research metabolomic platforms can detect >1000 known metabolites and
often many more unknown ion features, many of which will be
characterized and added to the growing list of known metabolites
over time [13–15]. Alterations in metabolites could yield new
insights into disrupted metabolism characteristic of prevalent dis-
ease states as well as future disease prognosis, and ideally, provide
mechanistic insight and outline new treatment targets as well.
Further, by integrating both endogenous metabolism and exoge-
nous inputs, such as diet, medications, and the microbiome,
Metabolomics for Renal and Cardiometabolic Diseases 403
2.1 Designing Different study designs have their own strengths and weaknesses,
a Population or Clinical and their suitability depends on the study question of interest as
Study well as practical constraints such as sample size and available
resources. Cross-sectional, cohort, and case-control studies are types
of epidemiologic, observational study designs that allow for the
description of free-living populations with respect to their expo-
sures (including metabolites) and disease outcomes.
Cross-sectional metabolomics studies assess the metabolome
and the outcome of interest at the same point in time, allowing
for the metabolomic characterization of a prevalent disease state.
Cross-sectional studies can compare metabolite levels in individuals
with or without a disease—for example, coronary disease, type
2 diabetes, CKD—or can test the association (or correlation) of
metabolite levels with related continuous traits. For example, one
study measured ~500 blood metabolites in 1735 participants of the
KORA F4 study, followed by replication in 1164 individuals in the
TwinsUK registry. This study demonstrated the substantial impact
of kidney function—in KORA F4, 112 metabolites were signifi-
cantly associated with eGFR, 54 of which replicated in TwinsUK
[16]. For select metabolites most strongly correlated with eGFR,
the authors then examined their association with measured GFR
among 200 participants of the AASK study, highlighting
c-mannosyltryptophan and pseudouridine as alternative or comple-
mentary markers of kidney function. Cross-sectional studies have
also identified numerous blood metabolites that correlate with
other cardiometabolic traits such as body mass index and insulin
sensitivity, including branched chain amino acids and lipids
enriched for saturated fatty acids [17, 18]. Thus, cross-sectional
studies have been useful in highlighting metabolite perturbations
that coincide with abnormal pathophysiologic conditions.
Although these studies have the potential to identify novel disease
404 Casey M. Rebholz and Eugene P. Rhee
biomarkers and risk factors, the fact that metabolites and outcomes
are being assessed at the same time precludes conclusions about the
direction of causality.
Unlike cross-sectional analyses, observational studies allow for
the establishment of temporality, whereby metabolomic profiling
can be conducted at a baseline time point and the outcome events
occur subsequently such that the exposure (metabolite alteration)
precedes disease onset. These types of studies are useful for identi-
fying early (preclinical) markers of disease risk that could be used to
risk stratify the population, and potentially to identify causal disease
pathways and therapeutic targets. Observational studies can have
different designs, including cohort and case-control studies. A cohort
study considers all subjects available in a given study population,
some of whom have the outcome of interest (cases). Cohort studies
have the advantage of having a well-characterized underlying study
population and allow for multiple study questions to be addressed.
For example, metabolomics data generated at a baseline time point
in a large cohort can be tested for association with any longitudinal
outcome that has been rigorously assessed over time, such as new
onset kidney disease, diabetes, or cardiovascular disease. For meta-
bolites that emerge as potential risk markers, normal reference
ranges can be determined across all individuals in the cohort and
the correlation with various demographic factors and comorbidities
can be explored. Extensive profiling of a given cohort also permits
other avenues of discovery, for example conducting a genome wide
association study of metabolite levels if genotyping data has also
been generated [19], and ultimately even pursuing Mendelian
randomization analyses [20].
Assaying all samples in a given cohort study is not always
feasible, considering the cost of metabolomic profiling and the
value of nonrenewable biological specimens, and alternatives are
required to answer specific questions of interest more efficiently. In
a case-control study, one compares individuals who have the out-
come of interest (cases) with individuals who do not but are other-
wise similar (controls). Often, cases and controls are selected from a
larger cohort, or in other words are “nested” within the cohort
study. For example, one study utilized a nested case-control study
design within the Chronic Renal Insufficiency Cohort, consisting
of 200 cases with rapid progression of kidney disease (defined as
GFR loss >3 ml/min per year) and 200 controls with stable kidney
function over time who were matched to cases on baseline eGFR
and proteinuria [21]—this subset of 400 individuals was selected
from the nearly ~4000 recruited into the parent cohort. This study
identified a panel of three metabolites (arginine, methionine, and
threonine) for which higher levels were strongly and independently
associated with rapid kidney disease progression, were not asso-
ciated with baseline eGFR, and might serve as markers of renal
metabolic capacity.
Metabolomics for Renal and Cardiometabolic Diseases 405
2.3 Controlling The measurement and adjustment for potential confounding fac-
for Confounding tors is a critical consideration in metabolomics studies of renal and
Factors cardiometabolic disease. Body mass index and diabetes status have
an enormous influence on the metabolome, in large part because
insulin action and sensitivity modulates blood levels of sugars,
amino acids, and lipids [32, 33]. Kidney function also has an
enormous influence on the metabolome, as highlighted by both
observational and physiologic studies described above. Age,
Metabolomics for Renal and Cardiometabolic Diseases 409
2.4 Specimen Type, The nature of samples used for metabolomics analysis requires
Storage, and Handling careful consideration. For large cohort studies that have completed
enrollment, samples will have already been collected with
corresponding demographic and clinical data. Access to these sam-
ples often requires direct collaboration with investigators oversee-
ing the cohort studies, or in some cases, can be obtained through
public channels (e.g., the National Institute of Diabetes and Diges-
tive and Kidney Diseases Central Repository or the National Heart,
Lung, and Blood Institute Biologic Specimen and Data Repository
Information Coordinating Center). In addition to large observa-
tional cohorts, biospecimens are also often collected for rando-
mized clinical trials of different pharmacologic or lifestyle
interventions, although access to these specimens may be restricted
in use. Finally, prospective collection by the investigator or a group
of collaborators is an attractive approach to sample accrual when
appropriate, for example for physiologic or interventional studies or
when rare or unusual study populations are required.
Whether samples have already been collected or will be col-
lected prospectively, it is important to recognize that procedures for
sample collection, processing, and storage can all impact the levels
of metabolites in a given biological specimen. First, all sampling
Metabolomics for Renal and Cardiometabolic Diseases 411
2.6 Data Analysis As metabolomics methods develop further, the analytic cost of
and Computational measuring the metabolome decreases, but the computational cost
Cost of Metabolomics of big data will remain, especially as the ability to run larger batches
Research and identify more metabolites increases. Key advances that have
allowed for more efficient measurement of the metabolome include
the use of robotics, automated data processing software for metab-
olite identification and quantification, and improvement in quality
control procedures [54]. As such, it is now possible to conduct
metabolomics in large-scale epidemiologic studies, yielding an
increasingly large amount of data in terms of sample size and
identified metabolites.
The most productive metabolomics projects providing mean-
ingful and translatable findings will be led by a team of scientists,
each offering their own expertise in basic science, physiology, clini-
cal medicine, bioinformatics, statistics, and epidemiology. This type
of team science requires the support by home institutions, to
recognize the need of infrastructure and career development of
young scientists. There is a need to train early career researchers
to be able to meaningfully analyze metabolomics data and to
appreciate both the physiological significance of detectable meta-
bolites as well as how to effectively use statistical techniques. Chap-
ters 14–16 in this book provide a foundation of data science and
statistics. Other chapters give more specific treatment of topics in
metabolomics research. Readers are also encouraged to visit the
supplemental website (https://metabolomics-data.github.io) for
more training materials.
3 Summary
References
1. Thygesen K, Alpert JS, Jaffe AS, Chaitman BR, and the US Food and Drug Administration.
Bax JJ, Morrow DA, White HD, Executive Am J Kidney Dis 64(6):821–835
Group on behalf of the Joint European Society 7. Stevens LA, Coresh J, Greene T, Levey AS
of Cardiology/American College of Cardiol- (2006) Assessing kidney function--measured
ogy/American Heart Association/World and estimated glomerular filtration rate. N
Heart Federation Task Force for the Universal Engl J Med 354(23):2473–2483. https://doi.
Definition of Myocardial I (2018) Fourth uni- org/10.1056/NEJMra054415
versal definition of myocardial infarction 8. Stevens LA, Schmid CH, Greene T, Li L, Beck
(2018). Glob Heart 13(4):305–338. https:// GJ, Joffe MM, Froissart M, Kusek JW, Zhang
doi.org/10.1016/j.gheart.2018.08.004 YL, Coresh J, Levey AS (2009) Factors other
2. Anderson KM, Odell PM, Wilson PW, Kannel than glomerular filtration rate affect serum
WB (1991) Cardiovascular disease risk profiles. cystatin C levels. Kidney Int 75(6):652–660.
Am Heart J 121(1 Pt 2):293–298 https://doi.org/10.1038/ki.2008.638
3. Goff DC Jr, Lloyd-Jones DM, Bennett G, 9. HJL H, Greene T, Tighiouart H, Gansevoort
Coady S, D’Agostino RB, Gibbons R, RT, Coresh J, Simon AL, Chan TM, Hou FF,
Greenland P, Lackland DT, Levy D, O’Donnell Lewis JB, Locatelli F, Praga M, Schena FP,
CJ, Robinson JG, Schwartz JS, Shero ST, Levey AS, Inker LA, Chronic Kidney Disease
Smith SC Jr, Sorlie P, Stone NJ, Wilson PW, Epidemiology C (2019) Change in albumin-
Jordan HS, Nevo L, Wnek J, Anderson JL, uria as a surrogate endpoint for progression of
Halperin JL, Albert NM, Bozkurt B, Brindis kidney disease: a meta-analysis of treatment
RG, Curtis LH, De Mets D, Hochman JS, effects in randomised clinical trials. Lancet Dia-
Kovacs RJ, Ohman EM, Pressler SJ, Sellke betes Endocrinol 7(2):128–139. https://doi.
FW, Shen WK, Smith SC Jr, Tomaselli GF, org/10.1016/S2213-8587(18)30314-0
American College of Cardiology/American 10. Coresh J, Heerspink HJL, Sang Y,
Heart Association Task Force on Practice G Matsushita K, Arnlov J, Astor BC, Black C,
(2014) 2013 ACC/AHA guideline on the Brunskill NJ, Carrero JJ, Feldman HI, Fox
assessment of cardiovascular risk: a report of CS, Inker LA, Ishani A, Ito S, Jassal S,
the American College of Cardiology/American Konta T, Polkinghorne K, Romundstad S,
Heart Association task force on practice guide- Solbu MD, Stempniewicz N, Stengel B,
lines. Circulation 129(25 Suppl 2):S49–S73. Tonelli M, Umesawa M, Waikar SS, Wen CP,
https://doi.org/10.1161/01.cir. Wetzels JFM, Woodward M, Grams ME,
0000437741.48606.98 Kovesdy CP, Levey AS, Gansevoort RT,
4. Kidney Disease: Improving Global Outcomes Chronic Kidney Disease Prognosis C, Chronic
(KDIGO) (2013) KDIGO clinical practice Kidney Disease Epidemiology C (2019)
guideline for the evaluation and management Change in albuminuria and subsequent risk of
of chronic kidney disease. Kidney Int 3 end-stage kidney disease: an individual
(1):1–150 participant-level consortium meta-analysis of
5. Inker LA, Schmid CH, Tighiouart H, Eckfeldt observational studies. Lancet Diabetes Endo-
JH, Feldman HI, Greene T, Kusek JW, crinol 7(2):115–127. https://doi.org/10.
Manzi J, Van Lente F, Zhang YL, Coresh J, 1016/S2213-8587(18)30313-9
Levey AS, Chronic Kidney Disease Epidemiol- 11. Inker LA, Levey AS, Pandya K, Stoycheff N,
ogy Collaboration Investigators (2012) Esti- Okparavero A, Greene T, Chronic Kidney Dis-
mating glomerular filtration rate from serum ease Epidemiology C (2014) Early change in
creatinine and cystatin C. N Engl J Med 367 proteinuria as a surrogate end point for kidney
(1):20–29. https://doi.org/10.1056/ disease progression: an individual patient meta-
NEJMoa1114248 analysis. Am J Kidney Dis 64(1):74–85.
6. Levey AS, Inker LA, Matsushita K, Greene T, https://doi.org/10.1053/j.ajkd.2014.02.020
Willis K, Lewis E, De Zeeuw D, Cheung AK, 12. Waikar SS, Rebholz CM, Zheng Z, Hurwitz S,
Coresh J (2014) GFR decline as an endpoint Hsu CY, Feldman HI, Xie D, Liu KD, Mifflin
for clinical trials in CKD: a scientific workshop TE, Eckfeldt JH, Kimmel PL, Vasan RS, Bon-
sponsored by the National Kidney Foundation ventre JV, Inker LA, Coresh J, Chronic Kidney
Metabolomics for Renal and Cardiometabolic Diseases 415
Metabolite profiling of blood from individuals among Asian Indians living in the United
undergoing planned myocardial infarction States have distinct Metabolomic profiles that
reveals early markers of myocardial injury. J are associated with Cardiometabolic risk. J
Clin Investig 118(10):3503–3512. https:// Nutr 148(7):1150–1159. https://doi.org/
doi.org/10.1172/JCI35111 10.1093/jn/nxy074
28. Rhee EP, Clish CB, Ghorbani A, Larson MG, 36. Ramezani A, Raj DS (2014) The gut micro-
Elmariah S, McCabe E, Yang Q, Cheng S, biome, kidney disease, and targeted interven-
Pierce K, Deik A, Souza AL, Farrell L, tions. J Am Soc Nephrol 25(4):657–670.
Domos C, Yeh RW, Palacios I, Rosenfield K, https://doi.org/10.1681/ASN.2013080905
Vasan RS, Florez JC, Wang TJ, Fox CS, Gersz- 37. Coresh J, Inker LA, Sang Y, Chen J, Shafi T,
ten RE (2013) A combined epidemiologic and Post WS, Shlipak MG, Ford L, Goodman K,
metabolomic approach improves CKD predic- Perichon R, Greene T, Levey AS (2018) Meta-
tion. J Am Soc Nephrol 24(8):1330–1338. bolomic profiling to improve glomerular filtra-
https://doi.org/10.1681/ASN.2012101006 tion rate estimation: a proof-of-concept study.
29. Curtin F, Schulz P (1998) Multiple correla- Nephrol Dial Transplant 34(5):825–833.
tions and Bonferroni’s correction. Biol Psychi- https://doi.org/10.1093/ndt/gfy094
atry 44(8):775–777 38. Titan SM, Venturini G, Padilha K, Tavares G,
30. Benjamini Y, Hochberg Y (1995) Controlling Zatz R, Bensenor I, Lotufo PA, Rhee EP,
the false discovery rate - a practical and power- Thadhani RI, Pereira AC (2019) Metabolites
ful approach to multiple testing. J R Stat Soc B related to eGFR: evaluation of candidate mole-
57(1):289–300 cules for GFR estimation using untargeted
31. Niewczas MA, Sirich TL, Mathew AV, metabolomics. Clin Chim Acta 489:242–248.
Skupien J, Mohney RP, Warram JH, Smiles A, https://doi.org/10.1016/j.cca.2018.08.037
Huang X, Walker W, Byun J, Karoly ED, Ken- 39. Goek ON, Doring A, Gieger C, Heier M,
sicki EM, Berry GT, Bonventre JV, Koenig W, Prehn C, Romisch-Margl W,
Pennathur S, Meyer TW, Krolewski AS Wang-Sattler R, Illig T, Suhre K, Sekula P,
(2014) Uremic solutes and risk of end-stage Zhai G, Adamski J, Kottgen A, Meisinger C
renal disease in type 2 diabetes: metabolomic (2012) Serum metabolite concentrations and
study. Kidney Int 85(5):1214–1224. https:// decreased GFR in the general population. Am J
doi.org/10.1038/ki.2013.497 Kidney Dis 60(2):197–206. https://doi.org/
32. Warren B, Rebholz CM, Sang Y, Lee AK, 10.1053/j.ajkd.2012.01.014
Coresh J, Selvin E, Grams ME (2018) Diabetes 40. Ng DP, Salim A, Liu Y, Zou L, Xu FG,
and trajectories of estimated glomerular filtra- Huang S, Leong H, Ong CN (2012) A meta-
tion rate: a prospective cohort analysis of the bolomic study of low estimated GFR in
atherosclerosis risk in communities study. Dia- non-proteinuric type 2 diabetes mellitus. Dia-
betes Care 41(8):1646–1653. https://doi. betologia 55(2):499–508. https://doi.org/
org/10.2337/dc18-0277 10.1007/s00125-011-2339-6
33. Bell EK, Gao L, Judd S, Glasser SP, 41. Luo S, Coresh J, Tin A, Rebholz CM, Appel
McClellan W, Gutierrez OM, Safford M, Lack- LJ, Chen J, Vasan RS, Anderson AH, Feldman
land DT, Warnock DG, Muntner P (2012) HI, Kimmel PL, Waikar SS, Kottgen A, Evans
Blood pressure indexes and end-stage renal dis- AM, Levey AS, Inker LA, Sarnak MJ, Grams
ease risk in adults with chronic kidney disease. ME, Chronic Kidney Disease Biomarkers Con-
Am J Hypertens 25(7):789–796. https://doi. sortium I (2019) Serum Metabolomic altera-
org/10.1038/ajh.2012.48 tions associated with proteinuria in CKD. Clin
34. Yin X, Subramanian S, Willinger CM, Chen G, J Am Soc Nephrol 14(3):342–353. https://
Juhasz P, Courchesne P, Chen BH, Li X, doi.org/10.2215/CJN.10010818
Hwang SJ, Fox CS, O’Donnell CJ, 42. Guo L, Milburn MV, Ryals JA, Lonergan SC,
Muntendam P, Fuster V, Bobeldijk-Pastorova I, Mitchell MW, Wulff JE, Alexander DC, Evans
Sookoian SC, Pirola CJ, Gordon N, AM, Bridgewater B, Miller L, Gonzalez-Garay
Adourian A, Larson MG, Levy D (2016) ML, Caskey CT (2015) Plasma metabolomic
Metabolite signatures of metabolic risk factors profiles enhance precision medicine for volun-
and their longitudinal changes. J Clin Endocri- teers of normal health. Proc Natl Acad Sci U S
nol Metab 101(4):1779–1789. https://doi. A 112(35):E4901–E4910. https://doi.org/
org/10.1210/jc.2015-2555 10.1073/pnas.1508425112
35. Bhupathiraju SN, Guasch-Ferre M, Gadgil 43. Krebs-Smith SM, Pannucci TE, Subar AF,
MD, Newgard CB, Bain JR, Muehlbauer MJ, Kirkpatrick SI, Lerman JL, Tooze JA, Wilson
Ilkayeva OR, Scholtens DM, Hu FB, Kanaya MM, Reedy J (2018) Update of the healthy
AM, Kandula NR (2018) Dietary patterns eating index: HEI-2015. J Acad Nutr Diet
Metabolomics for Renal and Cardiometabolic Diseases 417
Abstract
Rapid advancements in metabolomics technologies have allowed for application of liquid chromatography
mass spectrometry (LCMS)-based metabolomics to investigate a wide range of biological questions. In
addition to an important role in studies of cellular biochemistry and biomarker discovery, an exciting
application of metabolomics is the elucidation of mechanisms of drug action (Creek et al., Antimicrob
Agents Chemother 60:6650–6663, 2016; Allman et al., Antimicrob Agents Chemother 60:6635–6649,
2016). Although it is a very useful technique, challenges in raw data processing, extracting useful informa-
tion out of large noisy datasets, and identifying metabolites with confidence, have meant that metabolomics
is still perceived as a highly specialized technology. As a result, metabolomics has not yet achieved the
anticipated extent of uptake in laboratories around the world as genomics or transcriptomics. With a view to
bring metabolomics within reach of a nonspecialist scientist, here we describe a routine workflow with
IDEOM, which is a graphical user interface within Microsoft Excel, which almost all researchers are familiar
with. IDEOM consists of custom built algorithms that allow LCMS data processing, automatic noise
filtering and identification of metabolite features (Creek et al., Bioinformatics 28:1048–1049, 2012). Its
automated interface incorporates advanced LCMS data processing tools, mzMatch and XCMS, and
requires R for complete functionality. IDEOM is freely available for all researchers and this chapter will
focus on describing the IDEOM workflow for the nonspecialist researcher in the context of studies
designed to elucidate mechanisms of drug action.
Key words Metabolomics, IDEOM, Drug mechanism, Mode of action, Microsoft Excel, LCMS,
Data processing
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_21, © Springer Science+Business Media, LLC, part of Springer Nature 2020
419
420 Anubhav Srivastava and Darren J. Creek
2.1 Cell Culture Trophozoite stage asexual P. falciparum (3D7) parasites (causative
and Drug Incubations agents of malaria) were cultured using a previously defined method
for Metabolomics [13], using human RBCs (obtained from the Australian Red Cross
Analysis Blood Service) at 7–8% parasitemia (iRBC) and 3% hematocrit in
RPMI 1640 with hypoxanthine and 0.5% Albumax (Gibco) at
37 C under a specialized gas mix (95% N2, 4% CO2, 1% O2). For
drug-induced metabolic perturbation experiments, 200 μL iRBCs
were incubated with 1 μM of test compounds for 5 h in 96-well
plates in quadruplets. This short incubation with test compounds
was intended to induce metabolic perturbations in the malaria
parasite without causing significant mortality in the cell population,
as P. falciparum in vitro activity assays are usually performed using a
48 – 72-h incubation.
2.2 Metabolite Following sublethal drug incubation, culture medium was removed
Extraction and the metabolism of cells was quenched by resuspending the cells
for Metabolomics in ice-cold PBS. All the subsequent steps were performed on ice.
Analysis Cells were washed by centrifugation at 1000 g for 5 min and PBS
was removed. Metabolite extraction was performed by adding
135 μL methanol (containing internal standards) followed by
rapid mixing. Samples were then gently mixed for 1 h at 4 C and
then centrifuged at 3000 g for 5 min to precipitate proteins and
other insoluble molecules. The supernatant, which contained the
422 Anubhav Srivastava and Darren J. Creek
2.3 LC-MS Based The method of choice for this study was LC-MS, using hydrophilic
Metabolomics interaction liquid chromatography (HILIC) and high-resolution
Analysis (Orbitrap) mass spectrometry (MS). Briefly, Samples (10 μL) were
injected onto a Dionex RSLC U3000 LC system (Thermo) fitted
with a ZIC-pHILIC column (5 μm, 4.6 by 150 mm; Merck).
20 mM ammonium carbonate (A) and acetonitrile (B) were used
as the mobile phases with a 30 min gradient starting from 80% B to
40% B over 20 min, followed by washing at 5% B for 3 min and
reequilibration at 80% B. Mass spectrometry utilized a Q-Exactive
MS (Thermo) with a heated electrospray source operating in posi-
tive and negative modes (rapid switching) and a mass resolution of
35,000 from m/z 85 to 1050. Before each batch of samples, blanks
and mixtures of authentic standards (234 metabolites) were ana-
lyzed, and 2 pooled extracts were analyzed in data-dependent
tandem mass spectrometry (MS/MS) mode to facilitate down-
stream metabolite identification where necessary. Pooled QC sam-
ples were analyzed periodically throughout each batch.
2.4 LC-MS Data Data was analyzed using IDEOM pipeline as described in this
Analysis Using IDEOM chapter (see Subheading 3). Briefly, raw files were converted to
mzXML with msconvert [14], LC-MS peak signals were extracted
with the Centwave algorithm in XCMS [15], samples were aligned
and artifacts were filtered with mzMatch [2] and additional data
filtering and feature identification was performed based on accurate
mass and retention time [16]. The parameters used in automated
IDEOM filtering for this study are demonstrated in Fig. 1, and the
annotated peak metadata and chromatogram images allowed fur-
ther manual data filtering to remove features that were not reliably
detected across replicates and those with very low-quality chro-
matographic peaks. This resulted in reduction of the 15,000
detected LC-MS peaks, which includes many noise and electrospray
artifacts, to a manageable 460 reproducible peaks with reasonable
putative identifications. Metabolite identification (level 1 confi-
dence according to the Metabolomics Standards Initiative (MSI)
[17] corresponded to confidence score of 10 in IDEOM based on
accurate mass and retention time for metabolites that were present
in the standard mixture. Other features were putatively annotated
(MSI level 2) based on accurate mass and predicted retention time
using the IDEOM database. Metabolite abundance was deter-
mined by peak height and normalized to the average for untreated
samples from the same batch. Univariate statistical analyses utilized
Welch’s t test (α ¼ 0.05) and Pearson’s correlation (Microsoft
IDEOM for Drug Mechanism 423
Fig. 1 Parameters used for automated IDEOM filtering and feature identification from raw metabolomics data
in the atovaquone mode of action study [7]
2.5 Mode of Action Although 100 compounds were tested in this study, for the pur-
of Atovaquone poses of this chapter we focused on the mode of action of a known
Confirmed by IDEOM- antimalarial drug, atovaquone. The comparison sheet (see IDEOM
Based Analysis processing) showed the metabolite intensities of drug-treated sam-
ples expressed relative to the “control” group, which was DMSO
(vehicle control) in this study (IDEOM file with entire data set can
be downloaded from the supplemental material for this study [7]).
As expected [19], specific changes in the levels of metabolites
involved in pyrimidine biosynthesis were observed in metabolite
extracts from parasites treated with atovaquone. The observed
increase in levels of N-carbamoyl-L-aspartate and dihydroorotate
in atovaquone-treated cultures, and the decrease in downstream
424 Anubhav Srivastava and Darren J. Creek
Fig. 2 Top panel: Snapshot of the comparison sheet highlighting the metabolites involved in pyrimidine
biosynthesis in the atovaquone mode of action study [7]. Bottom panel: Schematic representation of the
pyrimidine biosynthesis pathway in malaria parasite and the step inhibited by atovaquone (ATV: Atovaquone,
DHODH: dihydroorotate dehydrogenase)
3 IDEOM Workflow
Fig. 3 Screenshot of the IDEOM template as it appears when the file is first opened. Areas circled in red
indicate the location of I: “Save As” button, II: “Enable Content” button to allow macros to run, III: Button for
installation of the necessary R packages, IV: Adjustable parameters for IDEOM, mzMatch and XCMS functions,
V: Cells to enter names of the internal standards, VI: worksheets containing metabolite database, retention
time and fragment information. See information in the “Getting Started” section for more details
3.2 Getting Started Figure 3 gives a screenshot of the IDEOM template as it appears
when the file is first opened.
1. Open the IDEOM.xlsb file in MS Excel and “Save As” with
your study name (e.g., mystudy1.xlsb).
2. Enable Macros by clicking the “Enable content” button in the
“Security Warning” ribbon.
426 Anubhav Srivastava and Darren J. Creek
3.4 Processing Steps The raw data processing steps can be executed by clicking the
corresponding blue buttons at the top-left of the settings sheet
3.4.1 Raw Data
(Fig. 4)
Processing
1. Manually sort files into folders according to study group.
(a) Create a working directory which is not within “My Docu-
ments” or “Program Files,” create folders for each group
of replicates in the study (control, treatment, blank, QC,
etc.) and move relevant raw files into these folders.
(b) This step may be skipped if the user wishes to process all
files without grouping.
(c) Avoid spaces in folders and filenames. It is recommended
to replace space with underscore in the names. For exam-
ple; use “IDEOM_trial” rather than “IDEOM trial.”
2. Convert RAW to mzXML files and split polarity.
(a) Run step (b) (click blue button) if dealing with raw files.
(Skip this step if you already have mzxml files).
(b) This step uses msconvert (or ReAdW), through R, to
convert raw LCMS files to the .mzXML format. Mscon-
vert is recommended.
(c) The location of msconvert.exe (or ReAdW.exe) on your
computer needs to be assigned, and can be stored in cell
E43.
(d) msconvert.exe needs to be in the same folder as the Pwiz
dll files (keep all Pwiz files together). (ReAdW requires
zlib1.dll to be in your windows directory to work
correctly.)
IDEOM for Drug Mechanism 427
Table 1
Description of worksheets in the IDEOM file
Sheet Description
Settings Home page and the starting point for all analysis. It contains many basic settings,
help documentation and the main macro buttons for processing data. Default
settings are suitable for 4.6 mm ZIC-pHILIC chromatography (with
ammonium carbonate/ACN, 0.3 mL/min) coupled to the Q-Exactive
Orbitrap
MZmatch Blank sheet to allow import of the peak list from mzMatch (or other tools)
Alldata Information about every peak set in the peak list is written to this sheet
Rejected Peak sets with putative identification, but confidence below 5, are copied to this
sheet. Most of these peaks are noise or artifacts, but users may wish to scan
them manually for false rejections
Identification All identified metabolites with confidence of 5 or above are copied to this sheet
allBasepeaks All base peaks (from mzMatch related peaks function) are copied to this sheet
unless the peaks are not significant in any group (less than blank)
Comparison Comparative peak intensity data is placed here after running the “compare all
sets” macro. Many functions for evaluation and visualization are available on
this sheet
DB This sheet contains the full metabolite database and associated metadata. If
required, additional metabolites may be added to the bottom of the list, or
additional property columns to the right of existing columns. Additional
information may also be added to existing columns, but do not insert new
columns between existing data
RTcalculator and These contain important tables of information required for the IDEOM macros.
fragments Advanced users may wish to update these tables with instrument-specific values
Targeted This sheet allows a targeted analysis of specific metabolites
Tools This sheet demonstrates the utilization of IDEOM’s user-defined Excel
functions, suitable for manual calculations such as mass determinations
Method and samples Extra sheets to allow uploading of experiment metadata
Fig. 4 Screenshot from the settings sheet with the raw data processing steps (blue buttons) and IDEOM data
processing steps (green buttons)
IDEOM for Drug Mechanism 429
Table 2
List of automated mzMatch functions
3.5 IDEOM The main IDEOM data processing steps and associated parameters
Processing are found on the settings sheet (Fig. 4). While most IDEOM
processing is automated, some study-specific input is required
from the user. Optimal results are achieved by clicking steps 1–9
(green buttons) and following the on-screen prompts. Steps 1 and
2 can be run in any order, but must both be completed before
running the main processing macro (step 3)
1. Import MZmatch data and enter grouping info.
(a) Use this function to import data from the mzMatchout-
put.txt file produced by mzMatch. (If you have already
entered data manually, or by the “import example data” or
“import Mzmine data” buttons, you may press cancel at
the “import file” screen to skip this step.)
(b) The second part of this function asks the user to enter
grouping information. “Autofill” can be used if the prefix
of the sample names refers to the grouping, otherwise
manually select groups using the “add” buttons.
(c) Set-Type needs to be selected for each group using the
drop-down lists (Table 3). Always set one group as “Treat-
ment” and another as “Control” to allow statistical
comparisons.
(d) If there are more than 15 sample groups, there is a second
tab with space for 15 more groups. IDEOM currently
supports a maximum of 30 groups (Fig. 5).
(e) The third part of this function plots average sample inten-
sities to allow a quick check of whether the data is consis-
tent. Internal (external) standards will also be plotted if
details have been entered in cells U2-AD2 of the settings
sheet.
(f) The fourth part gives the option of normalizing the data
either by TIC, median, or user-defined values (column R
on settings sheet). Normalization is not routinely recom-
mended for LCMS data due to nonlinear responses and
the unpredictability of ion-suppression.
IDEOM for Drug Mechanism 431
Table 3
Description of set types (groups) in the settings sheet
Set types
(groups) Description
Blank Solvent blank(s), used as the background reference for the groups significance filter
Control Base group for intensity comparisons. Only one control can be set at any time. If you
wish to compare results to multiple controls please use “‘Save As” to get different
Excel files corresponding to each control when you run the “‘Compare All Sets”
macro
Treatment Initial 2-sample intensity comparison, and used for the QC: RSD if no QC is present.
Only one treatment can be set at any time
Sample Any real sample groups that are not selected as the initial “control” or “treatment”
group should be set as “‘Sample.” Undefined groups are assumed to be “sample”
QC This group is used for the RSD filter and is included in graphs of individual samples,
but not in the comparison of results
Standards and These groups are excluded from all functions, but the peak intensity data is retained
Exclude in the data matrix on all sheets
Fig. 5 Screenshot of the window where group information can be entered while
importing the mzMatch data. Refer to Table 3 for description of Group Types
432 Anubhav Srivastava and Darren J. Creek
Table 4
List of RSD parameters for each group of replicates
RSD
parameters Description
Strict If any group has RSD larger than this setting, the metabolite is considered unreliable and
given a low confidence score (0.4)
Technical If the QC group (or Treatment group in the absence of a QC group) has RSD larger than
this setting, the metabolite is given a low confidence score (0.4)
Generous No specific RSD filter is applied, although if no group has an RSD below the filter
threshold then the peak is not significant and has confidence level of 0
Exclusions Groups with mean intensity below the LOQ threshold are excluded from this test
(as they often have higher RSD)
Table 5
Description of columns in Identification, Rejected, All data, and all Base peaks sheets
Column Description
A Neutral exact mass (from mzMatch)
B Retention time (from mzMatch) in minutes
C Formula from DB with closest match to mass (if within ppm window)
D Number of isomers in DB with this exact formula
E Metabolite name: Best match from DB for this mass and RT (bold type if there is a
standard RT for this metabolite in the DB)
F Confidence level (arbitrary out of 10) according to parameters on “settings” sheet
G Records whether the metabolite is in a “preferred database” (from DB)
H Map: The general area of metabolism for this metabolite (usually from KEGG)
Note: Columns G and H can be changed by choosing a different header in cell G1 or H1
I Mass error (in ppm) from nearest match in DB
J RT error relative to authentic standard (name is bold) or predicted RT as % of RT (NB:
Where no predicted RT exists in database the cell is colored red)
K Alt ppm: Mass error for the next closest mass in the DB (if within ppm window) — If font
is red it is a double-charged match, if background is colored it is an adduct match — See
settings D36:F38 for adduct colors
L Groups: Records which sample groups have significant peaks detected for this metabolite
(peaks > blanks, and RSD <RSDfilter)
M BP: Basepeak (if identified) for that peak (from mzMatch: Largest coeluting peak with
correlated peak shape and intensity trend across samples) NB: This does not update
when other identifications are changed to a different isomer
N Mzdiff: Mass difference between this peak and the basepeak
O Relationship: Relationship to the basepeak (according to mzMatch)
P Addfrag: Common adduct, isotope, fragment or neutral-loss: Based on filters and formulas
the “fragments” sheet. Possible complex adducts of larger coeluting peaks
(in duplicatepeaks window) are also annotated here. Note: a “complex adduct” may
sometimes be the parent of two fragment ions
Q % error of detected 13C-isotope intensity from the theoretical 13C-isotope intensity (note:
This relies on filtered peaks, and does not go back to the raw data)
R Related peaks: A list of common related peaks (isotopes, multicharged, adducts,
fragments) that the macro has detected for this metabolite
S RSD (relative standard deviation) for QC samples (or for treatment group if no QC is
assigned)
T Maximum RSD for all included sample groups
U Maximum intensity from all included samples
V Relation id (from mzMatch)
(continued)
IDEOM for Drug Mechanism 435
Table 5
(continued)
Column Description
W Peak intensity ratio for mean of “treatment” group vs. mean of “control” group
X P-value for unpaired t-test between “treatments” and “controls”
Y Adduct of formula match to mass (H, Na, double-charge, etc.)
Z Polarity (in combined files the first-named polarity is that with the biggest peak. All sample
intensities in combined file are taken from the polarity with the biggest peak for each
metabolite)
AA Number of detected peaks in included groups
NEXT # Peak intensities from all samples present on the mzmatch sheet
Extra Other functions such as “compare with medium,” “compare 2 other groups,” and
columns “isotope search” will add extra columns. Users may add additional columns to the right
without affecting macro performance
Table 6
Description of columns in comparisons sheet
Column Description
A Neutral exact mass (from mzMatch)
B Retention time (from mzMatch) in minutes
C Formula from DB with closest match to mass (if within ppm window)
D Number of isomers in DB with this exact formula
E Metabolite name: Best match from DB for this mass and RT (bold type if there is a standard
RT for this metabolite in the DB)
F Confidence level (arbitrary out of 10) according to parameters on “settings” sheet
G Map: The general area of metabolism for this metabolite (usually from KEGG)
H Pathway: List of biochemical pathways for this metabolite (usually from KEGG)
I Max intensity: Maximum LCMS intensity in any sample for that metabolite. Note:
Columns G, H, and I can be changed by choosing a different header in cell G1, H1, or I1
J Mean intensity of each included group relative to the “control” group (as set when the
“comparison” macro was run). Significant (t-test) values are in bold
Next # P-values for unpaired t-test between each included group and the control
Next # Mean intensity for each included group
Next # Standard deviation for each included group
Next # Relative standard deviation for each included group
Next # Fisher ratio for each included group, relative to the control group
Additional Additional columns are added every time you sort by correlation of intensity trends relative
to a specific metabolite. You may add your own additional columns to the right of existing
data without affecting performance. Please do not insert columns between existing data
Table 7
List of shortcuts obtained by double-clicking in columns of Identification, Rejected and all Base
peaks sheets
Table 8
List of shortcuts obtained by double-clicking in columns of Comparison sheet
recalibrate masses. If the curve is not a good fit, but you see a
trend, consider manual recalibration efforts. After calibration,
check the new plot of mass errors, and set a new ppm window
to remove outliers (false-identifications). In some cases, it is
worth checking the rejected peaks (bottom of “rejected” list)
for alternative identifications by clicking the “altppm” column.
If significant adjustment was made (e.g., >2 ppm) consider
rerunning analysis from step 3 (with the recalibrated data) to
improve identification.
6. Manually check related peaks and isomers (optional).
For thorough analysis, check all the information supplied
for each peak in the “identification” sheet. Another common
approach is to skip this step initially, and return later to double-
check specific metabolites of interest. While it is always a good
idea to return to raw data for confirmation of specific metabo-
lites, the identification sheet allows rapid access to a large
amount of meta-information to simplify the process of manual
data curation and metabolite identification.
(a) Isomers: Sort by the Formula column and remove dupli-
cates that appear to be due to poor chromatography (click
a cell in the RT column (B:B) for a plot of relative inten-
sities and retention times of all isomeric peaks).
(b) Click the light-blue “add chromatograms” button to add
cropped peak chromatogram images for each putative
metabolite (click or hover in column A to view).
(c) Related Peaks: Click the orange “Sort By Relation ID”
button and look in columns M:P for ESI artifacts that
were not filtered by the common adducts search (e.g.,
Metabolite-specific fragments). Double-click in the mass
(column A) to see all coeluting peaks—red masses are
“related” according to mzMatch.
(d) Also look at “C isotope error” (column Q) to see if the
main isotope intensity does not match the formula,
Related Peaks (column R) to see if all the fragments and
isotopes are possible from the identified structure, and
“adduct” (column Y) to check whether the identified
adduct (e.g., 2+, Na) is likely. (The “charge” column at
the end may help this if a 13C isotope was present, but it is
sometimes not correct in noisy data).
(e) Alternative identification: Easily change identification by
clicking the metabolite name (column E), then select from
the isomers in the dropdown list. Further information
about these isomers can be obtained by double-clicking
the isomer number (column D). Alternative formulas can
be investigated for individual masses by double-clicking
the formula (column C).
IDEOM for Drug Mechanism 441
Table 9
Additional functions are provided which may be used in any cell just like the native Excel functions
4.1 Targeted Targeted analysis for specific metabolites can be undertaken from
Analysis the “Targeted” sheet by following the process indicated by the
buttons at the top of the page.
1. Enter the Metabolite names in Column A.
2. Click step 2 to upload information for each metabolite from
the metabolite database (DB sheet). For unique metabolites
and/or masses enter the RT, formula and/or mass directly.
3. This step uses msconvert/mzMatch (through R) to process
either raw, mzXML or peakML files and filters the results
according to the specified masses. Parameters are taken from
the Settings sheet.
4. The results from step III are searched to return results for the
specific metabolites to the targeted page. Alternatively, any text
file from mzMatch, or the data on the mzMatch sheet of
IDEOM, can be used for this step. If multiple peaks are present
it first takes the most intense peak within the RT window
setting for standard RT (settings sheet; cell E23), then the
most intense peak within the RT window setting for calculated
RT (settings sheet; cell E24). Metabolites with no expected RT
are assigned the peak with the largest intensity.
5. The targeted analysis is commonly used to obtain retention
times for authentic standards, and hence there is an option to
export these results to the RT calculator which is used in the
regular untargeted IDEOM processing method.
4.2 Isotope Search This tool provides an untargeted search for labeled metabolites in
data that was obtained using stable-isotope labeled precursors. Run
this from the “tools” menu in “comparison,” “allbasepeaks” or
“Identification” sheets to search for isotopes of putatively identified
compounds. The isotope search macro looks within the RT win-
dow for related peaks. The isotope search result is the relative
abundance, that is, the ratio of the isotope peak to the unlabeled
peak in each specified sample.
The result is shaded yellow if the relative abundance is more
than 10% greater than the expected abundance of the natural
isotope of the unlabeled metabolite. This macro will only find
isotopes if they were imported to IDEOM from mzMatch. In
some cases, isotopes are missed because the peaks were not detected
by XCMS. To conduct a more comprehensive analysis of isotopes in
raw data use the “Targeted Isotopes” macro in the “export” menu.
References
1. Team RC (2014) R: a language and environ- 2. Scheltema RA, Jankevics A, Jansen RC, Swertz
ment for statistical computing. R Foundation MA, Breitling R (2011) PeakML/mzMatch: a
for Statistical Computing, Vienna, Austria file format, Java library, R library, and tool-
IDEOM for Drug Mechanism 445
chain for mass spectrometry data analysis. Anal 14. Chambers MC, Maclean B, Burke R,
Chem 83:2786–2793 Amodei D, Ruderman DL, Neumann S,
3. Smith CA, Want EJ, O’Maille G, Abagyan R, Gatto L, Fischer B, Pratt B, Egertson J et al
Siuzdak G (2006) XCMS: processing mass (2012) A cross-platform toolkit for mass spec-
spectrometry data for metabolite profiling trometry and proteomics. Nat Biotechnol
using nonlinear peak alignment, matching, 30:918–920
and identification. Anal Chem 78:779–787 15. Tautenhahn R, Bottcher C, Neumann S
4. Moffat JG, Vincent F, Lee JA, Eder J, Prunotto (2008) Highly sensitive feature detection for
M (2017) Opportunities and challenges in high resolution LC/MS. BMC Bioinformatics
phenotypic drug discovery: an industry per- 9:504
spective. Nat Rev Drug Discov 16:531–543 16. Creek DJ, Jankevics A, Burgess KE,
5. Gamo FJ, Sanz LM, Vidal J, de Cozar C, Breitling R, Barrett MP (2012) IDEOM: an
Alvarez E, Lavandera JL, Vanderwall DE, excel interface for analysis of LC-MS-based
Green DV, Kumar V, Hasan S et al (2010) metabolomics data. Bioinformatics
Thousands of chemical starting points for anti- 28:1048–1049
malarial lead identification. Nature 17. Sansone SA, Fan T, Goodacre R, Griffin JL,
465:305–310 Hardy NW, Kaddurah-Daouk R, Kristal BS,
6. Hovlid ML, Winzeler EA (2016) Phenotypic Lindon J, Mendes P, Morrison N et al (2007)
screens in antimalarial drug discovery. Trends The metabolomics standards initiative. Nat
Parasitol 32:697–707 Biotechnol 25:846–848
7. Creek DJ, Chua HH, Cobbold SA, Nijagal B, 18. De Livera AM, Dias DA, De Souza D,
Macrae JI, Dickerman BK, Gilson PR, Ralph Rupasinghe T, Pyke J, Tull D, Roessner U,
SA, McConville MJ (2016) Metabolomics- McConville M, Speed TP (2012) Normalizing
based screening of the malaria box reveals and integrating metabolomics data. Anal Chem
both novel and established mechanisms of 84:10768–10776
action. Antimicrob Agents Chemother 60 19. Biagini GA, Fisher N, Shone AE, Mubaraki
(11):6650–6663 MA, Srivastava A, Hill A, Antoine T, Warman
8. Allman EL, Painter HJ, Samra J, AJ, Davies J, Pidathala C et al (2012) Genera-
Carrasquilla M, Llinas M (2016) Metabolomic tion of quinolone antimalarials targeting the
profiling of the malaria box reveals antimalarial Plasmodium falciparum mitochondrial respira-
target pathways. Antimicrob Agents Che- tory chain for the treatment and prophylaxis of
mother 60:6635–6649 malaria. Proc Natl Acad Sci U S A
9. Kwon YK, Lu W, Melamud E, Khanam N, 109:8298–8303
Bognar A, Rabinowitz JD (2008) A domino 20. Ganesan SM, Morrisey JM, Ke H, Painter HJ,
effect in antifolate drug action in Escherichia Laroiya K, Phillips MA, Rathod PK, Mather
coli. Nat Chem Biol 4:602–608 MW, Vaidya AB (2011) Yeast dihydroorotate
10. Vincent IM, Creek DJ, Burgess K, Woods DJ, dehydrogenase as a new selectable marker for
Burchmore RJ, Barrett MP (2012) Untargeted Plasmodium falciparum transfection. Mol Bio-
metabolomics reveals a lack of synergy between chem Parasitol 177:29–34
nifurtimox and eflornithine against Trypano- 21. Cobbold SA, Chua HH, Nijagal B, Creek DJ,
soma brucei. PLoS Negl Trop Dis 6:e1618 Ralph SA, McConville MJ (2016) Metabolic
11. Zampieri M, Szappanos B, Buchieri MV, dysregulation induced in Plasmodium falci-
Trauner A, Piazza I, Picotti P, Gagneux S, parum by dihydroartemisinin and other front-
Borrell S, Gicquel B, Lelievre J et al (2018) line antimalarial drugs. J Infect Dis
High-throughput metabolomic analysis pre- 213:276–286
dicts mode of action of uncharacterized antimi- 22. Creek DJ, Jankevics A, Breitling R, Watson
crobial compounds. Sci Transl Med 10: DG, Barrett MP, Burgess KE (2011) Toward
eaal3973 global metabolomics analysis with hydrophilic
12. Spangenberg T, Burrows JN, Kowalczyk P, interaction liquid chromatography-mass spec-
McDonald S, Wells TN, Willis P (2013) The trometry: improved metabolite identification
open access malaria box: a drug discovery cata- by retention time prediction. Anal Chem
lyst for neglected diseases. PLoS One 8:e62906 83:8703–8710
13. Trager W, Jensen J (1976) Human malaria 23. Kind T, Fiehn O (2007) Seven Golden rules for
parasites in continuous culture. Science heuristic filtering of molecular formulas
193:673–675 obtained by accurate mass spectrometry. BMC
Bioinformatics 8:105
Chapter 22
Abstract
The exposome is the cumulative measure of environmental influences and associated biological responses
across the life span, with critical relevance for understanding how exposures can impact human health.
Metabolomics analysis of biological samples offers unique advantages for examining the exposome. Simul-
taneous analysis of external exposures, biological responses, and host susceptibility at a systems level can
help establish links between external exposures and health outcomes. As metabolomics technologies
continue to evolve for the study of the exposome, metabolomics ultimately will help provide valuable
insights for exposure risk assessment, and disease prevention and management. Here, we discuss recent
advances in metabolomics, and describe data processing protocols that can enable analysis of the exposome.
This chapter focuses on using liquid chromatography–mass spectrometry (LC-MS)-based untargeted
metabolomics for analysis of the exposome, including (1) preprocessing of untargeted metabolomics
data, (2) identification of exposure chemicals and their metabolites, and (3) methods to establish associa-
tions between exposures and diseases.
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_22, © Springer Science+Business Media, LLC, part of Springer Nature 2020
447
448 Yuping Cai et al.
studies over the past 50 years found that heritability across all
human traits is 49% [10]. These results indicate that nongenetic
influences such as environmental and lifestyle risk factors could be
related to different phenotypes and underlie the etiology of many
diseases.
The importance of the environment as a key determinant of
health, both separate from and in interaction with genetics, has
motivated large-scale collaborative efforts to develop methods for
comprehensively measuring environmental exposures [11–
13]. While particulate matter (ambient particles and household
smoke), smoking, nutrition (diets high in cholesterol, salt, and
sugar), and occupational exposures are linked to about 50% of
deaths globally, there are many diseases for which the exposure
risk factors remain unknown [14]. In addition, the nature of these
risks is different for each person due to individual susceptibility,
which includes host factors such as differences in metabolism,
microbiome and endogenous hormones, and social determinants
of health like socioeconomic status or neighborhood. All of these
factors fluctuate over the life course, with susceptibility increasing
during critical windows, such as in pregnancy or times of ill health
[15], making it difficult to attribute a specific exposure as causal of a
particular outcome. The exposome, or “the cumulative measure of
environmental influences and associated responses throughout the
lifespan,” encompasses these diverse and changing external and
internal exposures [16]. The exposome concept was developed in
part to address the need for more comprehensive exposure assess-
ment methods, and to provide a framework for the measurement
and analysis of an individual’s exposure portfolio over an entire
lifetime [16–19].
Developing methods for measuring the exposome remains a
challenge given the difficulty of cataloguing an individual’s expo-
some at even a single point in time. High-throughput methods
enable a systems-level evaluation of genes (genomics), gene expres-
sion (transcriptomics), proteins (proteomics), and metabolites
(metabolomics), offering a comprehensive view of the biological
changes that can occur in an individual after a specific exposure.
Multi-omics approaches which combine these data can further help
identify downstream chemical changes that contribute to an expo-
sure phenotype or exposotype, or the accrued biological changes
within a system that has undergone an exposure event (Fig. 1).
Exposures can thus be linked to disease end points through
identification of associated biomolecular changes, or biomarker
measurements [20]. Compared to other -omics technologies,
metabolomics is advantageous for studying the exposome because
the metabolome profiles both the exposure and the biological
response. Peaks from environmental exposures are routinely
observed in untargeted mass spectrometry data. Indeed, the
increased incorporation of metabolomics into exposure research
Metabolomics for Environmental Health and Exposome Research 449
Fig. 1 Exposotype, the exposure phenotype. (a) The environmental health paradigm: exposures modify
biological molecules and lead to disease. (b) Possible actions of environmental exposures on DNA, proteins,
and metabolites. Please note that metabolites are not in the original Central Dogma of Molecular Biology, but
informative of biological functions at both cellular and organismal levels. (Figure reproduced from [19])
2 Methods
2.1 Metabolomic Choosing the most suitable analytical platform is a key foundation
Technologies for exposome research. Developments in high-resolution metabo-
for Exposome lomics (HRM) have accelerated exposome research due to their
Research improved ability to profile a broad range of endogenous metabo-
lites and environmental chemicals simultaneously. Modern time-of-
2.1.1 High-Resolution flight (ToF) and Fourier-transform (FT) mass spectrometers confer
Metabolomics (HRM) high resolution, high accuracy, and provide the most sensitive
analysis for untargeted metabolomics methodologies [22]. Benefit-
ing from a high scan speed, ToF mass spectrometers are the most
commonly used instruments for high-throughput metabolite
profiling. FT mass spectrometers such as the Orbitrap™, however,
have high mass resolution and mass accuracy which are not affected
by low ion abundance. Therefore, they are superior for metabolite
quantification, especially for environmental chemicals with a low
abundance and a wide dynamic range in biological samples
[33]. The data output from these HRM platforms contains a wealth
of information regarding enriched metabolic pathways which can
facilitate the discovery of relationships between environmental
exposures and adverse health outcomes.
452 Yuping Cai et al.
2.1.2 Measurement The wide dynamic range of chemical compounds that are present in
of Low Level of Exposures human samples presents a major challenge for accurate quantifica-
tion, which is essential for distinguishing biological differences
between groups. Typically, blood concentrations of environmental
chemicals (fmol to μmol) are 1000 times lower than endogenous
metabolites (nmol to mmol) and metabolites from nutritional or
pharmaceutical drug sources [34]. Biotransformed environmental
chemicals produced through endogenous enzymatic reactions may
also be present at very low levels. Electrospray ionization (ESI)
coupled with LC is the most commonly applied technique in
MS-based metabolomics research [35]. In order to quantify
low-abundance environmental chemicals such as polychlorinated
biphenyls [36], polycyclic aromatic hydrocarbons [37], and poly-
brominated diphenyl ethers [38, 39], other ionization sources are
required due to their hydrophobic chemical structures.
Gas-chromatography (GC) with electron impact (EI) ionization is
routinely used for the analyses of these common environmental
pollutants [40–42]. LC hybrids with soft ionization approaches
such as atmospheric pressure chemical ionization (APCI) and
atmospheric pressure photoionization (APPI) can also readily mea-
sure these low-level hydrophobic chemicals [43]. In addition, APPI
uses a charge carrier (acetone or toluene) which increases sensitiv-
ity. Ion mobility LC-MS analysis has shown promise for the char-
acterization of structurally similar environmental chemicals, due to
increased resolving power and improved signal-to-noise ratio
[43, 44].
Table 1
Databases of environmental metabolites for exposome research
Number of
exposures or
Database metabolites Content URL
METLIN Exposome >700,000 Chemical information, MS/MS https://metlin.scripps.
database spectra edu/
Comparative 130,796 Chemical–gene/protein http://ctdbase.org/
Toxicogenomics interactions, chemical/
Database (CTD) gene–disease relationships
Drugbank 11,926 Drugs, drug targets, MS/MS https://www.
spectra drugbank.ca/
Hazardous 6016 Chemical information, https://toxnet.nlm.
Substances Data environmental fate, toxicity nih.gov/
Bank (HSDB) newtoxnet/hsdb.
htm
Toxin Exposome 3678 Both toxin and toxin target, gene http://www.t3db.ca/
Database (T3DB) expression data, MS/MS spectra
Exposome-Explorer 876 Dietary and pollutant biomarkers, http://exposome-
with concentration and explorer.iarc.fr/
correlation values
2.2.3 Analyzing Across the life course, an individual will encounter a myriad of
Associations Between environmental exposures which can influence human health, either
the Exposome and Health individually or synergistically. A notable challenge of implementing
Outcomes the exposome is understanding the direct relationship between one
or more exposures and a disease outcome, and identifying the
factors that link environmental chemicals to adverse health out-
comes. From a data analysis perspective, the core data processing
step essential for the exposome lies in establishing the relationship
between the exposure and health outcome.
Conventional biomonitoring of environmental exposures tar-
gets the measurement of a limited number of chemicals in biospeci-
mens such as blood and urine. Establishing the effects of an
exposure on human health is therefore restricted to a relatively
small exposure panel. Traditional environmental epidemiology
reflects this biomonitoring approach, as it assesses the association
between a single exposure and a single response. However, these
methods can result in biased and fragmentary pictures of environ-
ment–health relationships, since environmental exposures often
occur as mixtures and change frequently [29]. The exposome
more realistically represents the many exposures an individual will
encounter throughout life, but the enormous quantity of data
collected in exposome research presents challenges for analysis
and interpretation.
To analyze the thousands of features typical of LC-MS experi-
ments, metabolomics-based exposome research often employs
agnostic or hypothesis-free statistical methods, examining data for
patterns and correlations between exposures/metabolites and out-
comes [30]. These findings can later be validated through targeted
methods to describe potential mechanisms. Proponents of agnostic
methods see such techniques as allowing previously unknown rela-
tionships to emerge given the complexity of exposures and out-
comes, while critics warn such approaches have high chances of
Metabolomics for Environmental Health and Exposome Research 457
Table 2
Common informatics tools freely available as R packages for analyzing associations between the
exposome and health outcomes
sharing for multiple data types is another critical aspect for consid-
eration. At present, domain-specific data repositories are available
for data sharing, including repositories for EWAS data sets (https://
nhanes.hms.harvard.edu/transmart/datasetExplorer/index),
metabolomics data (Metabolomics Workbench, https://www.
metabolomicsworkbench.org; MetaboLights, https://www.ebi.ac.
uk/metabolights/), and genomics data (Gene Expression Omni-
bus, https://www.ncbi.nlm.nih.gov/geo/). Future directions for
multi-omics data sharing could include platforms integrating data-
sets on specific health outcomes.
References
1. Ozaki K, Ohnishi Y, Iida A, Sekine A, nucleotide polymorphisms (SNPs) previously
Yamada R, Tsunoda T, Sato H, Sato H, associated with type 2 diabetes replicates asso-
Hori M, Nakamura Y, Tanaka T (2002) Func- ciation with 12 SNPs in nine genes. Diabetes
tional SNPs in the lymphotoxin-alpha gene 56(1):256–264
that are associated with susceptibility to myo- 10. Polderman TJC, Benyamin B, de Leeuw CA,
cardial infarction. Nat Genet 32(4):650–654 Sullivan PF, van Bochoven A, Visscher PM,
2. Collins FS, Lander ES, Rogers J, Waterston Posthuma D (2015) Meta-analysis of the heri-
RH, Conso IHGS (2004) Finishing the tability of human traits based on fifty years of
euchromatic sequence of the human genome. twin studies. Nat Genet 47(7):702
Nature 431(7011):931–945 11. Vrijheid M, Slama R, Robinson O, Chatzi L,
3. Buniello A, MacArthur JAL, Cerezo M, Harris Coen M, van den Hazel P, Thomsen C,
LW, Hayhurst J, Malangone C, McMahon A, Wright J, Athersuch TJ, Avellana N,
Morales J, Mountjoy E, Sollis E, Suveges D, Basagana X, Brochot C, Bucchini L,
Vrousgou O, Whetzel PL, Amode R, Guillen Bustamante M, Carracedo A, Casas M,
JA, Riat HS, Trevanion SJ, Hall P, Junkins H, Estivill X, Fairley L, van Gent D, Gonzalez
Flicek P, Burdett T, Hindorff LA, JR, Granum B, Grazuleviciene R, Gutzkow
Cunningham F, Parkinson H (2019) The KB, Julvez J, Keun HC, Kogevinas M, McEa-
NHGRI-EBI GWAS Catalog of published chan RRC, Meltzer HM, Sabido E, Schwarze
genome-wide association studies, targeted PE, Siroux V, Sunyer J, Want EJ, Zeman F,
arrays and summary statistics 2019. Nucleic Nieuwenhuijsen MJ (2014) The human early-
Acids Res 47(D1):D1005–D1012 life exposome (HELIX): project rationale and
4. Eyre S, Worthington J (2014) Take your PICS: design. Environ Health Perspect 122
moving from GWAS to immune function. (6):535–544
Immunity 41(6):883–885 12. Kawamoto T, Nitta H, Murata K, Toda E,
5. Cuzick J, Brentnall A, Dowsett M (2017) Tsukamoto N, Hasegawa M, Yamagata Z,
SNPs for breast cancer risk assessment. Onco- Kayama F, Kishi R, Ohya Y, Saito H, Sago H,
target 8(59):99211–99212 Okuyama M, Ogata T, Yokoya S, Koresawa Y,
6. Yao L, Tak YG, Berman BP, Farnham PJ Shibata Y, Nakayama S, Michikawa T,
(2014) Functional annotation of colon cancer Takeuchi A, Satoh H, Ch WGER (2014)
risk SNPs. Nat Commun 5:5114 Rationale and study design of the Japan envi-
ronment and children’s study (JECS). BMC
7. Need AC, Ge D, Weale ME, Maia J, Feng S, Public Health 14:25
Heinzen EL, Shianna KV, Yoon W,
Kasperaviciute D, Gennarelli M, Strittmatter 13. Vineis P, Chadeau-Hyam M, Gmuender H,
WJ, Bonvicini C, Rossi G, Jayathilake K, Cola Gulliver J, Herceg Z, Kleinjans J,
PA, McEvoy JP, Keefe RS, Fisher EM, St Jean Kogevinas M, Kyrtopoulos S,
PL, Giegling I, Hartmann AM, Moller HJ, Nieuwenhuijsen M, Phillips DH, Probst-
Ruppert A, Fraser G, Crombie C, Middleton Hensch N, Scalbert A, Vermeulen R, Wild
LT, St Clair D, Roses AD, Muglia P, Francks C, CP, Consortium E (2017) The exposome in
Rujescu D, Meltzer HY, Goldstein DB (2009) practice: design of the EXPOsOMICS project.
A genome-wide investigation of SNPs and Int J Hyg Environ Health 220(2):142–151
CNVs in schizophrenia. PLoS Genet 5(2): 14. Rappaport SM, Barupal DK, Wishart D,
e1000373 Vineis P, Scalbert A (2014) The blood expo-
8. Reddy MVPL, Wang H, Liu S, Bode B, Reed some and its role in discovering causes of dis-
JC, Steed RD, Anderson SW, Steed L, ease. Environ Health Perspect 122
Hopkins D, She JX (2011) Association (8):769–774
between type 1 diabetes and GWAS SNPs in 15. Wild CP (2012) The exposome: from concept
the southeast US Caucasian population. Genes to utility. Int J Epidemiol 41(1):24–32
Immun 12(3):208–212 16. Miller GW, Jones DP (2014) The nature of
9. Willer CJ, Bonnycastle LL, Conneely KN, nurture: refining the definition of the expo-
Duren WL, Jackson AU, Scott LJ, Narisu N, some. Toxicol Sci 137(1):1
Chines PS, Skol A, Stringham HM, Petrie J, 17. Wild CP (2005) Complementing the genome
Erdos MR, Swift AJ, Enloe ST, Sprau AG, with an “exposome”: the outstanding chal-
Smith E, Tong M, Doheny KF, Pugh EW, lenge of environmental exposure measurement
Watanabe RM, Buchanan TA, Valle TT, Berg- in molecular epidemiology. Cancer Epidemiol
man RN, Tuomilehto J, Mohlke KL, Collins Biomark Prev 14(8):1847–1850
FS, Boehnke M (2007) Screening of 134 single
Metabolomics for Environmental Health and Exposome Research 463
18. Louis GMB, Sundaram R (2012) Exposome: human health and disease. Curr Environ
time for transformative research. Stat Med 31 Health Rep 4(1):89–98
(22):2569–2575 29. Stingone JA, Louis GMB, Nakayama SF, Ver-
19. Rattray NJW, Deziel NC, Wallach JD, Khan meulen RCH, Kwok RK, Cui YX, Balshaw
SA, Vasiliou V, Ioannidis JPA, Johnson CH DM, Teitelbaum SL (2017) Toward greater
(2018) Beyond genomics: understanding implementation of the Exposome research par-
exposotypes through metabolomics. Hum adigm within environmental epidemiology.
Genomics 12(1):4 Annu Rev Public Health 38(38):315–327
20. Steinberg CEW, Sturzenbaum SR, Menzel R 30. Robinson O, Basagana X, Agier L, de
(2008) Genes and environment - striking the Castro M, Hernandez-Ferrer C, Gonzalez JR,
fine balance between sophisticated biomonitor- Grimalt JO, Nieuwenhuijsen M, Sunyer J,
ing and true functional environmental geno- Slama R, Vrijheid M (2015) The pregnancy
mics. Sci Total Environ 400(1–3):142–161 Exposome: multiple environmental exposures
21. Nicholson JK, Lindon JC, Holmes E (1999) in the INMA-Sabadell birth cohort. Environ
Metabonomics: understanding the metabolic Sci Technol 49(17):10632–10641
responses of living systems to pathophysiologi- 31. Chung MK, Kannan K, Louis GM, Patel CJ
cal stimuli via multivariate statistical analysis of (2018) Toward capturing the Exposome:
biological NMR spectroscopic data. Xenobio- exposure biomarker variability and Coexposure
tica 29(11):1181–1189 patterns in the shared environment. Environ
22. Patti GJ, Yanes O, Siuzdak G (2012) Metabo- Sci Technol 52(15):8801–8810
lomics: the apogee of the omics trilogy. Nat 32. Rappaport SM (2016) Genetic factors are not
Rev Mol Cell Biol 13(4):263–269 the major causes of chronic diseases. PLoS One
23. Ellis JK, Athersuch TJ, Thomas LDK, 11(4):e0154387
Teichert F, Perez-Trujillo M, Svendsen C, 33. Go YM, Walker DI, Liang YL, Uppal K, Soltow
Spurgeon DJ, Singh R, Jarup L, Bundy JG, QA, Tran V, Strobel F, Quyyumi AA, Ziegler
Keun HC (2012) Metabolic profiling detects TR, Pennell KD, Miller GW, Jones DP (2015)
early effects of environmental and lifestyle Reference standardization for mass spectrome-
exposure to cadmium in a human population. try and high-resolution metabolomics applica-
BMC Med 10:61 tions to Exposome research. Toxicol Sci 148
24. Maitre L, Villanueva CM, Lewis MR, (2):531–543
Ibarluzea J, Santa-Marina L, Vrijheid M, 34. Dennis KK, Marder E, Balshaw DM, Cui YX,
Sunyer J, Coen M, Toledano MB (2016) Lynes MA, Patti GJ, Rappaport SM, Shaugh-
Maternal urinary metabolic signatures of fetal nessy DT, Vrijheid M, Barr DB (2017) Biomo-
growth and associated clinical and environ- nitoring in the era of the Exposome. Environ
mental factors in the INMA study. BMC Med Health Perspect 125(4):502–510
14:177 35. Lei ZT, Huhman DV, Sumner LW (2011)
25. Baker MG, Simpson CD, Lin YS, Shireman Mass spectrometry strategies in metabolomics.
LM, Seixas N (2017) The use of metabolomics J Biol Chem 286(29):25435–25442
to identify biological signatures of manganese 36. Ulbrich B, Stahlmann R (2004) Developmen-
exposure. Ann Work Expo Health 61 tal toxicity of polychlorinated biphenyls
(4):406–415 (PCBs): a systematic review of experimental
26. Johnson CH, Athersuch TJ, Collman GW, data. Arch Toxicol 78(5):252–268
Dhungana S, Grant DF, Jones DP, Patel CJ, 37. Balcioglu EB (2016) Potential effects of poly-
Vasiliou V (2017) Yale school of public health cyclic aromatic hydrocarbons (PAHs) in marine
symposium on lifetime exposures and human foods on human health: a critical review. Toxin
health: the exposome; summary and future Rev 35(3–4):98–105
reflections. Hum Genomics 11:32 38. Frederiksen M, Vorkamp K, Thomsen M,
27. Ivanisevic J, Zhu ZJ, Plate L, Tautenhahn R, Knudsen LE (2009) Human internal and exter-
Chen S, O’Brien PJ, Johnson CH, Marletta nal exposure to PBDEs—a review of levels and
MA, Patti GJ, Siuzdak G (2013) Toward sources. Int J Hyg Environ Health 212
Omic scale metabolite profiling: a dual (2):109–134
separation-mass spectrometry approach for 39. Herbstman JB, Sjodin A, Kurzon M, Leder-
coverage of lipid and central carbon metabo- man SA, Jones RS, Rauh V, Needham LL,
lism. Anal Chem 85(14):6876–6884 Tang D, Niedzwiecki M, Wang RY, Perera F
28. Buck Louis GM, Smarr MM, Patel CJ (2017) (2010) Prenatal exposure to PBDEs and neu-
The Exposome research paradigm: an opportu- rodevelopment. Environ Health Perspect 118
nity to understand the environmental basis for (5):712–719
464 Yuping Cai et al.
40. Pleil JD, Stiegel MA, Sobus JR, Tabucchi S, 51. Dieterle F, Ross A, Schlotterbeck G, Senn H
Ghio AJ, Madden MC (2010) Cumulative (2006) Probabilistic quotient normalization as
exposure assessment for trace-level polycyclic robust method to account for dilution of com-
aromatic hydrocarbons (PAHs) using human plex biological mixtures. Application in H-1
blood and plasma analysis. J Chromatogr B NMR metabonomics. Anal Chem 78
Analyt Technol Biomed Life Sci 878 (13):4281–4290
(21):1753–1760 52. Gagnebin Y, Tonoli D, Lescuyer P, Ponte B, de
41. Marek RF, Thorne PS, Wang K, DeWall J, Seigneux S, Martin PY, Schappler J, Boccard J,
Hornbuckle KC (2013) PCBs and OH-PCBs Rudaz S (2017) Metabolomic analysis of urine
in serum from children and mothers in urban samples by UHPLC-QTOF-MS: impact of
and rural US communities. Environ Sci Tech- normalization strategies. Anal Chim Acta
nol 47:3353–3361 955:27–35
42. Awad AM, Martinez A, Marek RF, Hornbuckle 53. Johnson CH, Ivanisevic J, Benton HP, Siuzdak
KC (2016) Occurrence and distribution of two G (2015) Bioinformatics: the next frontier of
hydroxylated polychlorinated biphenyl conge- metabolomics. Anal Chem 87(1):147–156
ners in Chicago air. Environ Sci Technol Lett 3 54. Uppal K, Walker DI, Jones DP (2017) xMSan-
(2):47–51 notator: an R package for network-based anno-
43. Zheng XY, Dupuis KT, Aly NA, Zhou YX, tation of high-resolution metabolomics data.
Smith FB, Tang KQ, Smith RD, Baker ES Anal Chem 89(2):1063–1067
(2018) Utilizing ion mobility spectrometry 55. Warth B, Spangler S, Fang M, Johnson CH,
and mass spectrometry for the analysis of poly- Forsberg EM, Granados A, Martin RL,
cyclic aromatic hydrocarbons, polychlorinated Domingo-Almenara X, Huan T, Rinehart D,
biphenyls, polybrominated diphenyl ethers and Montenegro-Burke JR, Hilmers B, Aisporna A,
their metabolites. Anal Chim Acta Hoang LT, Uritboonthai W, Benton HP,
1037:265–273 Richardson SD, Williams AJ, Siuzdak G
44. Marquez-Sillero I, Aguilera-Herrador E, (2017) Exposome-scale investigations guided
Cardenas S, Valcarcel M (2011) Ion-mobility by global metabolomics, pathway analysis, and
spectrometry for environmental analysis. TrAC cognitive computing. Anal Chem 89
Trends Anal Chem 30(5):677–690 (21):11505–11513
45. Smith CA, Want EJ, O’Maille G, Abagyan R, 56. Tautenhahn R, Patti GJ, Rinehart D, Siuzdak
Siuzdak G (2006) XCMS: processing mass G (2012) XCMS online: a web-based platform
spectrometry data for metabolite profiling to process untargeted Metabolomic data. Anal
using nonlinear peak alignment, matching, Chem 84(11):5035–5039
and identification. Anal Chem 78(3):779–787 57. Wishart D, Arndt D, Pon A, Sajed T, Guo AC,
46. Katajamaa M, Miettinen J, Oresic M (2006) Djoumbou Y, Knox C, Wilson M, Liang YJ,
MZmine: toolbox for processing and visualiza- Grant J, Liu YF, Goldansaz SA, Rappaport SM
tion of mass spectrometry based molecular pro- (2015) T3DB: the toxic exposome database.
file data. Bioinformatics 22(5):634–636 Nucleic Acids Res 43(D1):D928–D934
47. Lommen A (2009) MetAlign: Interface- 58. Wishart DS, Feunang YD, Guo AC, Lo EJ,
driven, versatile metabolomics tool for Marcu A, Grant JR, Sajed T, Johnson D,
hyphenated full-scan mass spectrometry data Li C, Sayeeda Z, Assempour N, Iynkkaran I,
Preprocessing. Anal Chem 81(8):3079–3086 Liu YF, Maciejewski A, Gale N, Wilson A,
48. Misra BB, Mohapatra S (2019) Tools and Chin L, Cummings R, Le D, Pon A, Knox C,
resources for metabolomics research commu- Wilson M (2018) DrugBank 5.0: a major
nity: a 2017–2018 update. Electrophoresis 40 update to the DrugBank database for 2018.
(2):227–246 Nucleic Acids Res 46(D1):D1074–D1082
49. Chadeau-Hyam M, Athersuch TJ, Keun HC, 59. Davis AP, Grondin CJ, Johnson RJ, Sciaky D,
De Iorio M, Ebbels TMD, Jenab M, McMorran R, Wiegers J, Wiegers TC, Mat-
Sacerdote C, Bruce SJ, Holmes E, Vineis P tingly CJ (2019) The comparative toxicoge-
(2011) Meeting-in-the-middle using meta- nomics database: update 2019. Nucleic Acids
bolic profiling - a strategy for the identification Res 47(D1):D948–D954
of intermediate biomarkers in cohort studies. 60. Jordan S, Fonger G, Hazard G (2017) Hazard-
Biomarkers 16(1):83–88 ous substances data bank: recent features and
50. MacPherson S, Arbuckle TE, Fisher M (2018) enhancements. Abstr Am Chem Soc 254
Adjusting urinary chemical biomarkers for 61. Neveu V, Moussy A, Rouaix H, Wedekind R,
hydration status during pregnancy. J Expo Sci Pon A, Knox C, Wishart DS, Scalbert A (2017)
Environ Epidemiol 28(5):481–493 Exposome-explorer: a manually-curated
Metabolomics for Environmental Health and Exposome Research 465
factors: persistent pollutants and nutrients cor- combined biological effects of multiple expo-
related with serum lipid levels. Int J Epidemiol sures. J Epidemiol Community Health 72
41(3):828–843 (7):564–571
75. Manrai AK, Cui YX, Bushel PR, Hall M, 85. Roede JR, Uppal K, Park Y, Tran V, Jones DP
Karakitsios S, Mattingly CJ, Ritchie M, (2014) Transcriptome-metabolome wide asso-
Schmitt C, Sarigiannis DA, Thomas DC, ciation study (TMWAS) of maneb and paraquat
Wishart D, Balshaw DM, Patel CJ (2017) neurotoxicity reveals network level interactions
Informatics and data analytics to support in toxicologic mechanism. Toxicol Rep
exposome-based discovery for public health. 1:435–444
Annu Rev Public Health 38(38):279–294 86. Guida F, Sandanger TM, Castagne R,
76. Sun ZC, Tao YB, Li S, Ferguson KK, Meeker Campanella G, Polidoro S, Palli D, Krogh V,
JD, Park SK, Batterman SA, Mukherjee B Tumino R, Sacerdote C, Panico S, Severi G,
(2013) Statistical strategies for constructing Kyrtopoulos SA, Georgiadis P, Vermeulen
health risk models with multiple pollutants RCH, Lund E, Vineis P, Chadeau-Hyam M
and their interactions: possible choices and (2015) Dynamics of smoking-induced
comparisons. Environ Health 12:85 genome-wide methylation changes with time
77. Tibshirani R (1996) Regression shrinkage and since smoking cessation. Hum Mol Genet 24
selection via the Lasso. J R Stat Soc Series B (8):2349–2359
Stat Methodol 58(1):267–288 87. Mahieu NG, Patti GJ (2017) Systems-level
78. Zou H, Hastie T (2005) Regularization and annotation of a 25 000 features to fewer than
variable selection via the elastic net (vol B G metabolomics data set reduces 1000 unique
67, pg 301, 2005). J R Stat Soc Series B Stat metabolites. Anal Chem 89(19):10397–10406
Methodol 67:768–768 88. Geng DW, Jogsten IE, Dunstan J, Hagberg J,
79. Agier L, Portengen L, Chadeau-Hyam M, Wang T, Ruzzin J, Rabasa-Lhoret R, van Bavel
Basagana X, Giorgis-Allemand L, Siroux V, B (2016) Gas chromatography/atmospheric
Robinson O, Vlaanderen J, Gonzalez JR, Nieu- pressure chemical ionization/mass spectrome-
wenhuijsen MJ, Vineis P, Vrijheid M, Slama R, try for the analysis of organochlorine pesticides
Vermeulen R (2016) A systematic comparison and polychlorinated biphenyls in human
of linear regression-based statistical methods to serum. J Chromatogr A 1453:88–98
assess Exposome-health associations. Environ 89. Zhao S, Luo X, Li L (2016) Chemical isotope
Health Perspect 124(12):1848–1856 Labeling LC-MS for high coverage and quanti-
80. Bottolo L, Richardson S (2010) Evolutionary tative profiling of the hydroxyl submetabolome
stochastic search for Bayesian model explora- in metabolomics. Anal Chem 88
tion. Bayesian Anal 5(3):583–618 (21):10617–10623
81. Liquet B, Bottolo L, Campanella G, 90. Treutler H, Tsugawa H, Porzel A, Gorzolka K,
Richardson S, Chadeau-Hyam M (2016) Tissier A, Neumann S, Balcke GU (2016) Dis-
R2GUESS: a graphics processing unit-based covering regulated metabolite families in untar-
R package for Bayesian variable selection geted metabolomics studies. Anal Chem 88
regression of multivariate responses. J Stat (16):8082–8090
Softw 69(2):1–32 91. Depke T, Franke R, Bronstrup M (2017) Clus-
82. Jiang C, Wang X, Li XY, Inlora J, Wang T, tering of MS2 spectra using unsupervised
Liu Q, Snyder M (2018) Dynamic human envi- methods to aid the identification of secondary
ronmental Exposome revealed by longitudinal metabolites from Pseudomonas aeruginosa. J
personal monitoring. Cell 175(1):277 Chromatogr B: Anal Technol Biomed Life Sci
83. Wang XH, Eijkemans MJC, Wallinga J, 1071:19–28
Biesbroek G, Trzcinski K, Sanders EAM, 92. van der Hooft JJJ, Wandy J, Barrett MP, Bur-
Bogaert D (2012) Multivariate approach for gess KEV, Rogers S (2016) Topic modeling for
studying interactions between environmental untargeted substructure exploration in meta-
variables and microbial communities. PLoS bolomics. Proc Natl Acad Sci U S A 113
One 7(11):e50267 (48):13738–13743
84. Jain P, Vineis P, Liquet B, Vlaanderen J, 93. van der Hooft JJJ, Wandy J, Young F,
Bodinier B, van Veldhoven K, Kogevinas M, Padmanabhan S, Gerasimidis K, Burgess KEV,
Athersuch TJ, Font-Ribera L, Villanueva CM, Barrett MP, Rogers S (2017) Unsupervised
Vermeulen R, Chadeau-Hyam M (2018) A discovery and comparison of structural families
multivariate approach to investigate the across multiple samples in untargeted metabo-
lomics. Anal Chem 89(14):7569–7577
Metabolomics for Environmental Health and Exposome Research 467
94. Lu YF, Goldstein DB, Angrist M, Cavalleri G MG, Estes SM, Agboto VK, Robinson P,
(2014) Personalized medicine and human Wilson S, Lichtveld MY (2014) The public
genetic diversity. Cold Spring Harb Perspect health exposome: a population-based, expo-
Med 4(9):a008581 sure science approach to health disparities
95. Juarez PD, Matthews-Juarez P, Hood DB, research. Int J Environ Res Public Health 11
Im W, Levine RS, Kilbourne BJ, Langston (12):12866–12895
MA, Al-Hamdan MZ, Crosson WL, Estes
Chapter 23
Abstract
Network-based approach is rapidly emerging as a promising strategy to integrate and interpret different
-omics datasets, including metabolomics. The first section of this chapter introduces the current progresses
and main concepts in multi-omics integration. The second section provides an overview of the public
resources available for creation of biological networks. The third section describes three common applica-
tion scenarios including subnetwork identification, network-based enrichment analysis, and systems meta-
bolomics. The section four introduces the concept of hierarchical community network analysis. The section
five discusses different tools for network visualization. The chapter ends with a future perspective on multi-
omics integration.
Key words Systems biology, Multi-omics integration, Data integration, Network enrichment analysis,
Hierarchical community networks, Network visualization
1 Introduction
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3_23, © Springer Science+Business Media, LLC, part of Springer Nature 2020
469
470 Guangyan Zhou et al.
3 Network-Based Integration
Table 1
Example large-scale multi-omics projects
Fig. 1 The overall workflow of knowledge-driven network-based approach. Experimental data obtained from
single-omics analysis are mapped onto existing molecular interaction or pathway databases to build a multi-
omics network. The resulting network can be used to identify context-specific subnetworks, perform network
enrichment analysis and for compound identification
3.2 Network-Based Enrichment analysis aims to identify functions that are significantly
Enrichment Analysis changed by comparing experimental data with known pathways or
other functionally related gene sets. A wide range of methods have
been developed. Among them, overrepresentation analysis (ORA)
and gene set enrichment analysis (GSEA) are two widely methods
for this task [61]. Although very powerful approaches, these
approaches ignore topological properties of these molecules within
pathways or networks, which are essential in order to understand
systems behaviour. In addition, they rely on direct overlap between
input molecules and functions, and molecules with missing path-
way or gene set annotation are not considered.
The network-based enrichment methods have been developed
to overcome some of these issues. These approaches take advantage
of graph-based statistics to exploit the connectivity information in
biological pathways or molecular interaction networks. A typical
approach consists of two main steps: (1) mapping the experimental
data onto the pathway or network; and (2) use “topology-aware”
476 Guangyan Zhou et al.
Fig. 2 An example mummichog result implemented in MetaboAnalyst. Mummichog algorithm directly maps
peaks as putative metabolites onto metabolic pathways and performs enrichment analysis to identify
perturbed pathways (shown in red circle)
5 Network Visualization
6 Future Prospects
Fig. 4 An example multi-omics network generated using OmicsNet. Transcription factors are colored in purple.
Genes/proteins are colored by their expression value using a green-red gradient. Grey nodes refers to
predicted interacting partner. Metabolites are colored in yellow. A shortest path between a transcription
factor and metabolite is highlighted in blue
7 Notes
Acknowledgments
References
1. Hasin Y, Seldin M, Lusis A (2017) Multi-omics 7. Casci T (2012) Bioinformatics: next-genera-
approaches to disease. Genome Biol 18(1):83 tion omics. Nat Rev Genet 13(6):378
2. Coleman WB (2017) Next-generation breast 8. Rattray NJ, Deziel NC, Wallach JD, Khan SA,
cancer omics. Am J Pathol 187 Vasiliou V, Ioannidis JP et al (2018) Beyond
(10):2130–2132 genomics: understanding exposotypes through
3. Mach N, Ramayo-Caldas Y, Clark A, Moroldo metabolomics. Hum Genomics 12(1):4
M, Robert C, Barrey E et al (2017) Under- 9. Ritchie MD, Holzinger ER, Li R, Pendergrass
standing the response to endurance exercise SA, Kim D (2015) Methods of integrating data
using a systems biology approach: combining to uncover genotype–phenotype interactions.
blood metabolomics, transcriptomics and miR- Nat Rev Genet 16(2):85
Nomics in horses. BMC Genomics 18(1):187 10. Chong J, Xia J (2017) Computational
4. Villar M, Ayllon N, Alberdi P, Moreno A, Mor- approaches for integrative analysis of the meta-
eno M, Tobes R et al (2015) Integrated meta- bolome and microbiome. Metabolites 7(4):
bolomics, transcriptomics and proteomics E62
identifies metabolic pathways affected by Ana- 11. Gligorijevic V, Przulj N (2015) Methods for
plasma phagocytophilum infection in tick cells. biological data integration: perspectives and
Mol Cell Proteomics 14(12):3154–3172 challenges. J R Soc Interface 12
5. Rinschen MM, Ivanisevic J, Giera M, Siuzdak (112):20150571
G (2019) Identification of bioactive metabo- 12. Meng C, Zeleznik OA, Thallinger GG, Kuster
lites using activity metabolomics. Nat Rev Mol B, Gholami AM, Culhane AC (2016) Dimen-
Cell Biol 20:353–367 sion reduction techniques for the integrative
6. Yan J, Risacher SL, Shen L, Saykin AJ (2018) analysis of multi-omics data. Brief Bioinform
Network approaches to systems biology analy- 17(4):628–641
sis of complex disease: integrative methods for 13. Bersanelli M, Mosca E, Remondini D, Giam-
multi-omics data. Brief Bioinform 19 pieri E, Sala C, Castellani G et al (2016) Meth-
(6):1370–1381 ods for the integration of multi-omics data:
484 Guangyan Zhou et al.
miRNA–gene interactions. Nucleic Acids Res 51. Ideker T, Ozier O, Schwikowski B, Siegel AF
46 (Database issue):D239–D245 (2002) Discovering regulatory and signalling
39. Shoemaker RH (2006) The NCI60 human circuits in molecular interaction networks. Bio-
tumour cell line anticancer drug screen. Nat informatics 18(suppl_1):S233–S240
Rev Cancer 6(10):813–823 52. Khurana V, Peng J, Chung CY, Auluck PK,
40. Tomczak K, Czerwińska P, Wiznerowicz M Fanning S, Tardiff DF et al (2017) Genome-
(2015) The cancer genome atlas (TCGA): an scale networks link neurodegenerative disease
immeasurable source of knowledge. Contemp genes to α-synuclein through specific molecu-
Oncol 19(1A):A68 lar pathways. Cell Syst 4(2):157–170. e14
41. The Integrative HMP (iHMP) Research Net- 53. Sychev ZE, Hu A, DiMaio TA, Gitter A, Camp
work Consortium (2014) The Integrative ND, Noble WS et al (2017) Integrated systems
human microbiome project: dynamic analysis biology analysis of KSHV latent infection
of microbiome-host omics profiles during per- reveals viral induction and reliance on peroxi-
iods of human health and disease. Cell Host some mediated lipid metabolism. PLoS Pathog
Microbe 16(3):276–289 13(3):e1006256
42. Laakso M, Kuusisto J, Stančáková A, Kuulas- 54. Beisser D, Klau GW, Dandekar T, Müller T,
maa T, Pajukanta P, Lusis AJ et al (2017) The Dittrich MT (2010) BioNet: an R-package for
metabolic syndrome in men study: a resource the functional analysis of biological networks.
for studies of metabolic and cardiovascular dis- Bioinformatics 26(8):1129–1130
eases. J Lipid Res 58(3):481–493 55. Alcaraz N, List M, Dissing-Hansen M,
43. Tadaka S, Saigusa D, Motoike IN, Inoue J, Rehmsmeier M, Tan Q, Mollenhauer J et al
Aoki Y, Shirota M et al (2017) jMorp: Japanese (2016) Robust de novo pathway enrichment
multi Omics reference panel. Nucleic Acids Res with KeyPathwayMiner 5. F1000Res 5:1531
46(D1):D551–D557 56. Anvar MS, Minuchehr Z, Shahlaei M, Kheitan
44. Nica AC, Parts L, Glass D, Nisbet J, Barrett A, S (2018) Gastric cancer biomarkers; a systems
Sekowska M et al (2011) The architecture of biology approach. Biochem Biophys Rep
gene regulatory variation across multiple 13:141–146
human tissues: the MuTHER study. PLoS 57. Jha AK, Huang SC-C, Sergushichev A, Lam-
Genet 7(2):e1002003 propoulou V, Ivanova Y, Loginicheva E et al
45. Perez-Riverol Y, Bai M, da Veiga Leprevost F, (2015) Network integration of parallel meta-
Squizzato S, Park YM, Haug K et al (2017) bolic and transcriptional data reveals metabolic
Discovering and linking public omics data sets modules that regulate macrophage polariza-
using the omics discovery index. Nat Biotech- tion. Immunity 42(3):419–430
nol 35(5):406–409 58. Chen X, Liu M-X, Yan G-Y (2012) Drug–tar-
46. Yugi K, Kubota H, Hatano A, Kuroda S (2016) get interaction prediction by random walk on
Trans-omics: how to reconstruct biochemical the heterogeneous network. Mol BioSyst 8
networks across multiple ‘omic’layers. Trends (7):1970–1978
Biotechnol 34(4):276–290 59. Liu Y, Zeng X, He Z, Zou Q (2017) Inferring
47. Zhou G, Xia J (2018) OmicsNet: a web-based microRNA-disease associations by random
tool for creation and visual analysis of walk on a heterogeneous network with multi-
biological networks in 3D space. Nucleic ple data sources. IEEE/ACM transactions on
Acids Res 46(W1):W514–W522 computational biology. Bioinformatics 14
48. Creixell P, Reimand J, Haider S, Wu G, Shibata (4):905–915
T, Vazquez M et al (2015) Pathway and net- 60. Chen X, You Z-H, Yan G-Y, Gong D-W
work analysis of cancer genomes. Nat Methods (2016) IRWRLDA: improved random walk
12(7):615–621 with restart for lncRNA-disease association
49. Akhmedov M, Kedaigle A, Chong RE, Mon- prediction. Oncotarget 7(36):57919
temanni R, Bertoni F, Fraenkel E et al (2017) 61. Subramanian A, Tamayo P, Mootha VK,
PCSF: an R-package for network-based inter- Mukherjee S, Ebert BL, Gillette MA et al
pretation of high-throughput data. PLoS (2005) Gene set enrichment analysis: a knowl-
Comput Biol 13(7):e1005694 edge-based approach for interpreting genome-
50. Tuncbag N, Gosline SJ, Kedaigle A, Soltis AR, wide expression profiles. Proc Natl Acad Sci U
Gitter A, Fraenkel E (2016) Network-based S A 102(43):15545–15550
interpretation of diverse high-throughput 62. Tarca AL, Draghici S, Khatri P, Hassan SS,
datasets through the omics integrator software Mittal P, Kim JS et al (2009) A novel signaling
package. PLoS Comput Biol 12(4):e1004879 pathway impact analysis. Bioinformatics 25
(1):75–82
486 Guangyan Zhou et al.
63. Alexeyenko A, Lee W, Pernemalm M, Guegan of response to vaccination in humans. Cell 169
J, Dessen P, Lazar V et al (2012) Network (5):862–877. e17
enrichment analysis: extension of gene-set 77. Gardinassi LG, Arévalo-Herrera M, Herrera S,
enrichment analysis to gene networks. BMC Cordy RJ, Tran V, Smith MR et al (2018)
Bioinformatics 13(1):226 Integrative metabolomics and transcriptomics
64. Glaab E, Baudot A, Krasnogor N, Schneider R, signatures of clinical tolerance to Plasmodium
Valencia A (2012) EnrichNet: network-based vivax reveal activation of innate cell immunity
gene set enrichment analysis. Bioinformatics and T cell signaling. Redox Biol 17:158–170
28(18):i451–i457 78. Pavlopoulos GA, Malliarakis D, Papanikolaou
65. Dettmer K, Aronov PA, Hammock BD (2007) N, Theodosiou T, Enright AJ, Iliopoulos I
Mass spectrometry-based metabolomics. Mass (2015) Visualizing genome and systems biol-
Spectrom Rev 26(1):51–78 ogy: technologies, tools, implementation tech-
66. da Silva RR, Dorrestein PC, Quinn RA (2015) niques and trends, past, present and future.
Illuminating the dark matter in metabolomics. Gigascience 4(1):38
Proc Natl Acad Sci U S A 112 79. Shannon P, Markiel A, Ozier O, Baliga NS,
(41):12549–12550 Wang JT, Ramage D et al (2003) Cytoscape: a
67. Albert R, Jeong H, Barabasi AL (2000) Error software environment for integrated models of
and attack tolerance of complex networks. biomolecular interaction networks. Genome
Nature 406(6794):378–382 Res 13(11):2498–2504
68. Li S, Park Y, Duraisingham S, Strobel FH, 80. Alcaraz N, Pauling J, Batra R, Barbosa E, Junge
Khan N, Soltow QA et al (2013) Predicting A, Christensen AG et al (2014) KeyPathway-
network activity from high throughput meta- Miner 4.0: condition-specific pathway analysis
bolomics. PLoS Comput Biol 9(7):e1003123 by combining multiple omics studies and net-
69. Xu X, Araki K, Li S, Han JH, Ye L, Tan WG et works with Cytoscape. BMC Syst Biol 8(1):99
al (2014) Autophagy is essential for effector 81. Kutmon M, van Iersel MP, Bohler A, Kelder T,
CD8(+) T cell survival and memory formation. Nunes N, Pico AR et al (2015) PathVisio 3: an
Nat Immunol 15(12):1152–1161 extendable pathway analysis toolbox. PLoS
70. Li S, Todor A, Luo R (2016) Blood transcrip- Comput Biol 11(2):e1004085
tomics and metabolomics for personalized 82. Luo W, Pant G, Bhavnasi YK, Blanchard SG Jr,
medicine. Comput Struct Biotechnol J 14:1–7 Brouwer C (2017) Pathview web: user friendly
71. Stewart CJ, Embleton ND, Marrs ECL, Smith pathway visualization and data integration.
DP, Fofanova T, Nelson A et al (2017) Longi- Nucleic Acids Res 45(W1):W501–W508
tudinal development of the gut microbiome 83. Garcia-Alcalde F, Garcia-Lopez F, Dopazo J,
and metabolome in preterm neonates with Conesa A (2010) Paintomics: a web based tool
late onset sepsis and healthy controls. Micro- for the joint visualization of transcriptomics
biome 5(1):75 and metabolomics data. Bioinformatics 27
72. Huan T, Forsberg EM, Rinehart D, Johnson (1):137–139
CH, Ivanisevic J, Benton HP et al (2017) Sys- 84. Kuo T-C, Tian T-F, Tseng YJ (2013) 3Omics: a
tems biology guided by XCMS online metabo- web-based systems biology tool for analysis,
lomics. Nat Methods 14(5):461–462 integration and visualization of human tran-
73. Chong J, Soufan O, Li C, Caraus I, Li S, Bour- scriptomic, proteomic and metabolomic data.
que G et al (2018) MetaboAnalyst 4.0: towards BMC Syst Biol 7(1):64
more transparent and integrative metabolomics 85. Sommer B, Baaden M, Krone M, Woods A
analysis. Nucleic Acids Res 46(W1): (2018) From virtual reality to immersive ana-
W486–W494 lytics in Bioinformatics. J Integr Bioinform 15
74. Pirhaji L, Milani P, Leidl M, Curran T, Avila- (2):20180043
Pacheco J, Clish CB et al (2016) Revealing 86. Argelaguet R, Velten B, Arnol D, Dietrich S,
disease-associated pathways by network inte- Zenz T, Marioni JC et al (2018) Multi-Omics
gration of untargeted metabolomics. Nat factor analysis-a framework for unsupervised
Methods 13(9):770 integration of multi-omics data sets. Mol Syst
75. Rohart F, Gautier B, Singh A, Lê Cao K-A Biol 14(6):e8124
(2017) mixOmics: an R package for ‘omics 87. Csardi G, Nepusz T (2006) The igraph soft-
feature selection and multiple data integration. ware package for complex network research.
PLoS Comput Biol 13(11):e1005752 InterJournal, Complex Systems 1695(5):1–9
76. Li S, Sullivan NL, Rouphael N, Yu T, Banton S, 88. Hagberg A, Swart P, S Chult D (2008) Explor-
Maddur MS et al (2017) Metabolic phenotypes ing network structure, dynamics, and function
using NetworkX (No. LA-UR-08-05495; LA-
Network-Based Multi-omics Integration 487
Shuzhao Li (ed.), Computational Methods and Data Analysis for Metabolomics, Methods in Molecular Biology, vol. 2104,
https://doi.org/10.1007/978-1-0716-0239-3, © Springer Science+Business Media, LLC, part of Springer Nature 2020
489
COMPUTATIONAL METHODS AND DATA ANALYSIS FOR METABOLOMICS
490 Index
Human metabolome database Metabolomics ............................................v, 2, 11, 25, 49,
(HMDB) .......................................... 6, 56, 58, 84, 61, 139, 149, 165, 185, 209, 227, 246, 265, 313,
140, 141, 144, 145, 150, 165–170, 173, 175, 337, 362, 388, 402, 419, 448, 469
176, 178–183, 199, 390, 437, 453, 455 Metabotyping ............................................. 62–65, 68–70,
73, 80–82, 88, 90
I METLIN ............................................. 6, 12, 20, 21, 140,
IDEOM ............................................................... 419–444 141, 149–161, 453–455
Integration..............................................9, 12, 22, 25, 87, MetScape .............................................................. 89, 390,
393, 394, 397
137, 197, 202, 280, 351, 354, 361–385, 395,
396, 411, 453, 459, 470, 472–478, 482 Microsoft Excel ................................. 239, 419, 422, 424
Interpretation ......................................................... v, 8, 87, Mode of action ........................................... 420, 421, 423
Molecular formula ........................................... 7, 57, 139,
88, 102, 108, 112, 114, 121–124, 133, 146, 221,
330, 337–358, 378, 421, 456, 469, 477–479 151, 153, 155, 183, 185–204
Isotope ................................................... 8, 12, 13, 52, 53, Molecular networking............................ 6, 211, 227–241
MS/MS spectra ......................................... 2, 3, 6, 20, 21,
62, 66, 71, 84, 85, 101–103, 113, 115, 116,
156–158, 186–189, 191–193, 210, 221, 231, 57, 140–144, 150–159, 161, 183, 187, 188, 228,
232, 240, 257, 388, 429, 434, 435, 438–440, 229, 237, 239, 241, 454–456, 460
Multi-omics integration............................... 19, 353–358,
443, 444, 453, 460, 479
396, 459, 469–483
L Multivariate data analysis ................................. 14, 81–83,
86–88, 343–346
LipidMaps.................................................. 122, 123, 125, Multivariate statistics................. 8, 38, 87, 344, 357, 482
128–131, 133, 145, 437, 453, 455 Mummichog............................................ 8, 21, 140, 146,
Lipidome .................................................... 122, 124–134 211, 350–353, 389, 390, 395–397, 456, 477, 478
Lipidomics .......................................................... 5, 9, 116, MZmine............................................... 6, 11, 25–47, 187,
121–134 229, 230, 232, 235, 236, 238–241, 256, 430, 452
Liquid chromatography (LC).............................. 2, 4, 13,
25, 210, 322, 452 N
Liquid chromatography-mass spectrometry
(LC-MS) ........................ 210, 227, 419–444, 450 Natural products ............................... 143, 210, 228, 455
Liquid-chromatography tandem mass spectrometry Network enrichment analysis .................... 472, 474, 476
Network visualization ....................... 357, 393, 480, 481
(LC-MS/MS) .................................. 13, 140, 145,
179, 181, 185, 187, 227, 229 Nuclear magnetic resonance
(NMR) ....................... 3, 61, 144, 166, 337, 450
M
O
Mass spectrometry (MS)......................................... 3, 5–9,
OpenMS ............................................................. 6, 49–59,
13, 14, 21, 25, 30, 49, 63, 66, 87, 88, 115, 121,
122, 134, 143, 145, 149, 160, 161, 166, 204, 187, 256
209–223, 257, 337, 372, 393, 422, 448, 477
P
MetaboAnalyst .......................................... 8, 41, 87, 261,
337–358, 392, 395, 397, 477, 478, 481 Pathway................................................ 3, 21, 63, 99, 130,
Metabolic activity ....................................... 100, 112, 115 140, 150, 166, 210, 228, 261, 338, 362, 387,
Metabolic modification................................................ 161 424, 450, 471, 505
Metabolic networks.......................................... 21, 66, 89, Pathway analysis ................................................... 5, 8, 21,
351, 361–385, 470 87–89, 346, 348, 349, 352, 354–357, 363,
Metabolite databases............................. 6, 139–147, 149, 387–397, 481
425, 427, 433, 444, 453–455 Peak............................................... 4, 11, 26, 50, 65, 104,
Metabolite identification ............................... 7, 8, 21, 26, 139, 153, 179, 187, 210, 231, 251, 315, 338,
56–58, 71, 72, 76, 78, 83–86, 88, 139–147, 393, 412, 420, 448, 477
149–155, 157, 158, 161, 170, 180, 185–201, Performance metrics ........................................... 324, 325
204, 211, 350, 395, 412, 413, 419, 422, 427, Plant ...................................................................... 5, 6, 90,
428, 440, 451, 453–456, 477 209–223, 392, 471
COMPUTATIONAL METHODS AND DATA ANALYSIS FOR METABOLOMICS
Index 491
Plasma ........................................................ 63, 65, 68–70, Systems biology................................ 12, 19, 21, 66, 337,
72, 73, 100, 104, 105, 108, 112, 116, 122, 365, 470, 478, 481
124–127, 257, 322, 325, 405–407, 411, 479 Systems medicine ......................................................... 8, 9
Precision medicine .......................................... 1, 2, 9, 473
Predictive modeling ............................................. 38, 265, T
279, 313–331 Tandem mass spectrum ............................................... 139
Preprocessing............................................................... 5, 6,
25–47, 57, 67, 76, 78–81, 87, 187–189, 191, U
320, 341, 358, 432, 451–453
Processing .................................................. v, 5, 11–22, 26, Unknown metabolite ................................... 84, 150–156,
28, 35, 49–59, 66, 79, 86–88, 149, 187, 210, 219, 158, 210, 212, 216, 217, 408
229, 239, 240, 249, 256, 262, 307, 337, 338, Untargeted ..................................... 8, 11, 13, 20, 25–27,
340–343, 365, 394, 395, 410, 411, 413, 419, 35, 47, 52–26, 58, 64, 67, 69, 85, 86, 121–133,
420, 423, 426–428, 430–442, 444, 451, 452, 141, 151, 157, 161, 186, 314, 315, 322, 338,
454, 456, 459, 460 350–353, 358, 387–398, 412, 419, 444, 448,
Python ................................................. 49, 51, 79, 81, 87, 450, 451, 453, 459, 460, 479
246, 250, 254–256, 261, 265–311, 364, 365, Untargeted metabolomics ..................................... 11, 13,
375, 379, 380, 384, 395 25, 27, 35, 69, 141, 151, 157, 314, 338,
350–353, 358, 387–398, 419, 450, 451, 453,
Q 459, 460, 479