0% found this document useful (0 votes)
231 views391 pages

Microbial Systems Biology - Methods and Protocols (2021)

Uploaded by

Luis Gómez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
231 views391 pages

Microbial Systems Biology - Methods and Protocols (2021)

Uploaded by

Luis Gómez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 391

Methods in

Molecular Biology 2349

Ali Navid Editor

Microbial
Systems Biology
Methods and Protocols
Second Edition
METHODS IN MOLECULAR BIOLOGY

Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, UK

For further volumes:


http://www.springer.com/series/7651
For over 35 years, biological scientists have come to rely on the research protocols and
methodologies in the critically acclaimed Methods in Molecular Biology series. The series was
the first to introduce the step-by-step protocols approach that has become the standard in all
biomedical protocol publishing. Each protocol is provided in readily-reproducible step-by-
step fashion, opening with an introductory overview, a list of the materials and reagents
needed to complete the experiment, and followed by a detailed procedure that is supported
with a helpful notes section offering tips and tricks of the trade as well as troubleshooting
advice. These hallmark features were introduced by series editor Dr. John Walker and
constitute the key ingredient in each and every volume of the Methods in Molecular Biology
series. Tested and trusted, comprehensive and reliable, all protocols from the series are
indexed in PubMed.
Microbial Systems Biology

Methods and Protocols

Second Edition

Edited by

Ali Navid
Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, CA, USA
Editor
Ali Navid
Biosciences and Biotechnology Division
Lawrence Livermore National Laboratory
Livermore, CA, USA

ISSN 1064-3745 ISSN 1940-6029 (electronic)


Methods in Molecular Biology
ISBN 978-1-0716-1584-3 ISBN 978-1-0716-1585-0 (eBook)
https://doi.org/10.1007/978-1-0716-1585-0

© Springer Science+Business Media, LLC, part of Springer Nature 2022


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations
and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to
be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty,
expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been
made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer
Nature.
The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Dedication

Dedicated to the memory of Dr. Marc Griesemer

v
Preface

Although we have advanced the field of microbial systems biology significantly during the
last decade, our foray into analyzing exciting and evermore complex systems such as gut
microbiome or autotroph-bacterial interactions requires examining all available data at our
disposal. Systems biology analyses today usually use multiple types of omics data for each
study. This of course is a pleasant byproduct of the revolutionary advances in technologies
that allow rapid and relatively cheap collection of system-level data.
To combine and analyze these data requires the use of increasingly complex computa-
tional models and state-of-the-art bioinformatics tools. Fortunately, the field of computa-
tional biology has kept up with technological advances. New and exciting tools are being
developed that allow for detailed analyses of disparate types of omics data. The results of
these studies can then be used to quickly generate system-level models and conduct analyses
that at times can only be achieved in silico.
In this edition of the book I have tried to introduce the reader to powerful and easy-to-
use computational biology databases and tools (e.g., MetaFlux, Kbase, COBRA toolbox)
that can significantly help researchers examine their system-level data. There are also chap-
ters that deal with annotating genomes and using the collected information to conduct
network analyses of the system. The book also introduces the reader to a number of
specialized analytic tools (e.g., NanoSIP and PAMMS) that can significantly improve and
add to the data that is available for systems biology studies. It is my hope that the informa-
tion provided by the authors will inspire researchers to explore innovative analytical meth-
ods, examine their problem from multiple angles, and generate novel hypotheses that will
lead to new scientific discoveries.
Unlike 8 years ago when I edited the first edition of this book, this one took significantly
longer than expected. This was due to a large extent to last minute withdrawal of contribu-
tions due to uncontrollable events. I want to express my gratitude to Dr. John Walker for his
guidance and support during this process. I want to thank the authors for their contribution
and patience during this prolonged process; specially Drs. Benjamin Stewart, Peter Karp,
and their coauthors for having waited the longest for publication of their chapters. Finally, I
want to acknowledge current and former staff from DOE’s national laboratories, specially
my colleagues at Lawrence Livermore National lab for their support and contributions.

Livermore, CA, USA Ali Navid

vii
Contents

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Parallel Accelerator and Molecular Mass Spectrometry


Measurement of Carbon-14-Labeled Analytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Benjamin J. Stewart and Ted J. Ognibene
2 Fast Sampling of the Cellular Metabolome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Walter M. van Gulik, Andre B. Canelas, Hilal Taymaz-Nikerel,
Rutger D. Douma, Lodewijk P. de Jonge, and Joseph J. Heijnen
3 Investigation of Protein–Lipid Interactions Using Native
Mass Spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Xiao Cong, John W. Patrick, Yang Liu, Xiaowen Liang, Wen Liu,
and Arthur Laganowsky
4 Western Blot Processing Optimization: The Perfect Blot. . . . . . . . . . . . . . . . . . . . . 65
Russ Yukhananov, David P. Chimento, and Laura A. Marlow
5 FISHing on a Budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Gable M. Wadsworth and Harold D. Kim
6 NanoSIP: NanoSIMS Applications for Microbial Biology . . . . . . . . . . . . . . . . . . . . 91
Jennifer Pett-Ridge and Peter K. Weber
7 Construction of Metatranscriptomic Libraries for 50 End
Sequencing of rRNAs for Microbiome Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Marja Tiirola and Anita M€ a ki
8 Computational Approaches for Designing Highly Specific
and Efficient sgRNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Jaspreet Kaur Dhanjal, Dhvani Vora, Navaneethan Radhakrishnan,
and Durai Sundar
9 Complex Network Analysis in Microbial Systems:
Theory and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
André Voigt and Eivind Almaas
10 Prokaryotic Genome Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Jeffrey A. Kimbrel, Brendan M. Jeffrey, and Christopher S. Ward
11 Functional Annotation from Structural Homology. . . . . . . . . . . . . . . . . . . . . . . . . . 215
Brent W. Segelke
12 Metabolic Modeling with MetaFlux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Mario Latendresse, Wai Kit Ong, and Peter D. Karp
13 Application of the Metabolic Modeling Pipeline in KBase
to Categorize Reactions, Predict Essential Genes, and Predict
Pathways in an Isolate Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Benjamin H. Allen, Nidhi Gupta, Janaka N. Edirisinghe, José P. Faria,
and Christopher S. Henry

ix
x Contents

14 Curating COBRA Models of Microbial Metabolism. . . . . . . . . . . . . . . . . . . . . . . . . 321


Ali Navid
15 A Beginner’s Guide to the COBRA Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Ali Navid
16 Rules of Engagement: A Guide to Developing
Agent-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Marc Griesemer and Suzanne S. Sindi

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Contributors

BENJAMIN H. ALLEN • Oak Ridge National Laboratory, Oak Ridge, TN, USA
EIVIND ALMAAS • Department of Biotechnology, Norwegian University of Science &
Technology, NTNU, Trondheim, Norway
ANDRE B. CANELAS • Department of Biotechnology, Delft University of Technology, Delft, The
Netherlands
DAVID P. CHIMENTO • Rockland Immunochemicals Inc., Limerick, PA, USA
XIAO CONG • Center for Infectious and Inflammatory Diseases, Institute of Biosciences and
Technology, Texas A&M Health Science Center, Houston, TX, USA
LODEWIJK P. DE JONGE • Department of Biotechnology, Delft University of Technology, Delft,
The Netherlands
JASPREET KAUR DHANJAL • Department of Biochemical Engineering and Biotechnology, DBT-
AIST International Laboratory for Advanced Biomedicine (DAILAB), Indian Institute
of Technology Delhi, Hauz Khas, New Delhi, India
RUTGER D. DOUMA • Department of Biotechnology, Delft University of Technology, Delft, The
Netherlands
JANAKA N. EDIRISINGHE • Argonne National Laboratory, Lemont, IL, USA
JOSÉ P. FARIA • Argonne National Laboratory, Lemont, IL, USA
MARC GRIESEMER • Controls and Data Systems Division, SLAC National Accelerator
Laboratory, Menlo Park, CA, USA
NIDHI GUPTA • Argonne National Laboratory, Lemont, IL, USA
JOSEPH J. HEIJNEN • Department of Biotechnology, Delft University of Technology, Delft, The
Netherlands
CHRISTOPHER S. HENRY • Argonne National Laboratory, Lemont, IL, USA
BRENDAN M. JEFFREY • Bioinformatics and Computational Biosciences Branch, Rocky
Mountain Laboratories, National Institute of Allergy and Infectious Diseases, National
Institutes of Health, Hamilton, MA, USA
PETER D. KARP • SRI International, Menlo Park, CA, USA
HAROLD D. KIM • School of Physics, Georgia Institute of Technology, Atlanta, GA, USA
JEFFREY A. KIMBREL • Biosciences and Biotechnology Division, Lawrence Livermore National
Laboratory, Livermore, CA, USA
ARTHUR LAGANOWSKY • Center for Infectious and Inflammatory Diseases, Institute of
Biosciences and Technology, Texas A&M Health Science Center, Houston, TX, USA;
Department of Chemistry, Texas A&M University, College Station, TX, USA
MARIO LATENDRESSE • SRI International, Menlo Park, CA, USA
XIAOWEN LIANG • Center for Infectious and Inflammatory Diseases, Institute of Biosciences
and Technology, Texas A&M Health Science Center, Houston, TX, USA
WEN LIU • Center for Infectious and Inflammatory Diseases, Institute of Biosciences and
Technology, Texas A&M Health Science Center, Houston, TX, USA
YANG LIU • Center for Infectious and Inflammatory Diseases, Institute of Biosciences and
Technology, Texas A&M Health Science Center, Houston, TX, USA; Department of
Chemistry, Texas A&M University, College Station, TX, USA
ANITA MA€ KI • Department of Biological and Environmental Science, University of Jyv€
askyl€
a,
Jyv€askyl€
a , Finland

xi
xii Contributors

LAURA A. MARLOW • Mayo Clinic, Jacksonville, FL, USA


ALI NAVID • Biosciences and Biotechnology Division, Physical and Life Sciences Directorate,
Lawrence Livermore National Laboratory, Livermore, CA, USA
TED J. OGNIBENE • Center for Accelerator Mass Spectrometry, Lawrence Livermore National
Laboratory, Livermore, CA, USA
WAI KIT ONG • SRI International, Menlo Park, CA, USA
JOHN W. PATRICK • Department of Chemistry, Texas A&M University, College Station, TX,
USA
JENNIFER PETT-RIDGE • Lawrence Livermore National Lab, Physical and Life Science
Directorate, Livermore, CA, USA
NAVANEETHAN RADHAKRISHNAN • Department of Biochemical Engineering and
Biotechnology, DBT-AIST International Laboratory for Advanced Biomedicine
(DAILAB), Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India
BRENT W. SEGELKE • Physical and Life Sciences, Biosciences and Biotechnology Division,
Lawrence Livermore National Laboratory, Livermore, CA, USA
SUZANNE S. SINDI • Department of Applied Mathematics, University of California, Merced,
Merced, CA, USA
BENJAMIN J. STEWART • Biosciences and Biotechnology Division, Lawrence Livermore
National Laboratory, Livermore, CA, USA
DURAI SUNDAR • Department of Biochemical Engineering and Biotechnology, DBT-AIST
International Laboratory for Advanced Biomedicine (DAILAB), Indian Institute of
Technology Delhi, Hauz Khas, New Delhi, India
HILAL TAYMAZ-NIKEREL • Department of Biotechnology, Delft University of Technology, Delft,
The Netherlands
MARJA TIIROLA • Department of Biological and Environmental Science, University of
Jyv€
askyl€
a , Jyv€
a skyl€
a , Finland
WALTER M. VAN GULIK • Department of Biotechnology, Delft University of Technology, Delft,
The Netherlands
ANDRÉ VOIGT • Department of Biotechnology, Norwegian University of Science &
Technology, NTNU, Trondheim, Norway
DHVANI VORA • Department of Biochemical Engineering and Biotechnology, DBT-AIST
International Laboratory for Advanced Biomedicine (DAILAB), Indian Institute of
Technology Delhi, Hauz Khas, New Delhi, India
GABLE M. WADSWORTH • School of Physics, Georgia Institute of Technology, Atlanta, GA,
USA
CHRISTOPHER S. WARD • Biosciences and Biotechnology Division, Lawrence Livermore
National Laboratory, Livermore, CA, USA; Department of Biological Sciences, Bowling
Green State University, Bowling Green, OH, USA
PETER K. WEBER • Lawrence Livermore National Lab, Physical and Life Science Directorate,
Livermore, CA, USA
RUSS YUKHANANOV • Precision Biosystems LLC, Mansfield, MA, USA
Chapter 1

Parallel Accelerator and Molecular Mass Spectrometry


Measurement of Carbon-14-Labeled Analytes
Benjamin J. Stewart and Ted J. Ognibene

Abstract
Parallel accelerator and molecular mass spectrometry (PAMMS) is a powerful analytical technique capable
of simultaneous quantitation of carbon-14 tracer and structural characterization of 14C-labeled biomole-
cules. Here we describe the use of PAMMS for the analysis of biological molecules separated by high-
performance liquid chromatography. This protocol is intended to serve as a guide for researchers who need
to perform PAMMS experiments using instrumentation available at resource centers such as the National
User Resource for Biological Accelerator Mass Spectrometry at Lawrence Livermore National Laboratory.

Key words Accelerator mass spectrometry, Radiocarbon, Liquid sample interface, Isotope ratio mass
spectrometry

1 Introduction

Accelerator mass spectrometry (AMS) is currently the most sensi-


tive analytical method available for the measurement of rare, long-
lived isotopes. AMS is a form of isotope ratio mass spectrometry.
The most common isotope used for measurement of drugs and
biomolecules is carbon-14. Although some AMS instruments are
capable of measuring isotopes of elements other than carbon, this
discussion will be limited to measurement of carbon isotopes.
Unlike liquid scintillation counting (LSC), which relies on nuclear
decay events to quantify radioactive material, AMS counts atoms
independent of decay. AMS has been used in combination with
other analytical methods to characterize important parameters in
microbial systems, ranging from tracing specific metabolites in a
single organism to quantitation of community-scale metabolism.
Some applications of AMS in microbial systems biology published
to date include identification of biosynthesis precursors [1], quan-
titation of metabolite redox states [2], measurement of metabolite
fluxes [3], detoxification of reactive metabolites [4], determination
of soil respiration in terrestrial ecosystems [5], and measurement of

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_1, © Springer Science+Business Media, LLC, part of Springer Nature 2022

1
2 Benjamin J. Stewart and Ted J. Ognibene

plastics biodegradation [6]. Additional applications are possible


and are limited only by the creativity of the researcher and the
technical constraints of performing experiments, as discussed
below.
AMS is a quantitative method, providing good linearity over six
orders of magnitude down to the attomole (1018) level [7–
9]. However, AMS functions solely as a carbon isotope detector
and does not provide any structural information about the analytes.
For this reason, Lawrence Livermore National Laboratory (LLNL)
developed the AMS liquid sample interface [10, 11] and coupled it
with a QTOF mass spectrometer to enable real-time analysis of
samples separated by HPLC. Using the nomenclature recom-
mended by Sacks [12], this combined analytical method is referred
to as parallel accelerator and molecular mass spectrometry
(PAMMS). PAMMS is useful in drug and toxicant metabolism
experiments, basic metabolism, reactive chemical adduct detection,
and other microtracer experiments. The basic PAMMS analysis
procedure consists of sample collection from radioisotope labeling
experiments, preparation of samples for analysis, analyte separation
by high-performance liquid chromatography, stable carbon and
radiocarbon analysis by AMS, analyte characterization by molecular
mass spectrometry, and data analysis. This protocol is composed of
two basic parts:
1. Designing and performing isotope labeling experiments.
2. Measuring samples by PAMMS and interpreting the results.
The first part of the protocol, designing and performing label-
ing experiments, can be done at the end user’s laboratory facilities.
The second part, performing PAMMS measurements, can only be
performed at facilities possessing PAMMS instrumentation.
PAMMS instrumentation requires significant financial and techni-
cal investments beyond the level of affordability for most labora-
tories. Fortunately, organizations such as LLNL’s National
User Resource for Biological Accelerator Mass Spectrometry
make PAMMS available to researchers throughout the world.

2 Experimental Design, Laboratory Requirements, and Materials

2.1 Deciding to Use PAMMS has unique capabilities and limitations that must be under-
PAMMS stood before deciding on the most suitable analytical method for a
desired experiment. For many researchers, analysis cost is an impor-
tant consideration. Access to PAMMS through the National User
Resource for Biological Accelerator Mass Spectrometry is free for
nearly all investigators. With this in mind, the central questions a
researcher needs to answer are:
Parallel Accelerator and Molecular Mass Spectrometry Measurement. . . 3

Fig. 1 Decision flowchart for selecting the use of PAMMS or other analytical methods for sample
measurement.

1. What information is needed?


2. Which analytical method(s) provide the needed information?
3. If more than one analytical method is suitable, which method
provides the best results at the most economical total cost?
PAMMS is a powerful tool that can be used to fill gaps in the
capabilities of other methods. Because PAMMS consists of a cou-
pled analytical method, the benefits and limitations of each individ-
ual method will be considered first individually and then together.
The flowchart shown in Fig. 1 can be used to determine if PAMMS
is the appropriate analytical method for a desired experiment.
1. AMS measurements:
AMS is suitable for the measurement of carbon-14, with
sensitivity down to the mid-zeptomole (1021) range. Mea-
surements of stable carbon are sensitive down to the
mid-nanogram range. Many types of biological experiment
make use of isotope tracers. A subset of these experiments can
be performed using carbon-14 and AMS measurements. AMS
may be the most appropriate analytical method if a carbon-14
tracer is needed and the activity of samples cannot be measured
by LSC.
4 Benjamin J. Stewart and Ted J. Ognibene

Fig. 2 Block diagram of the parallel accelerator mass spectrometer-molecular


mass spectrometer (PAMMS) system. HPLC effluent is split to the molecular
mass spectrometer and the AMS liquid sample interface for simultaneous
measurement on both instruments

2. HPLC Separations and Mass Spectrometry Measurements


Many experiments benefit from AMS but do not need
HPLC separation of analytes or molecular mass spectrometry.
For example, if the goal of an experiment simply requires
measurement of compound uptake by a particular cell type,
there is no need for HPLC separation, and the isolated cells can
be measured by AMS directly. Alternatively, it may be necessary
to identify specific metabolites from cell extracts, in which case
HPLC separation is required. Similar considerations should be
used to determine if molecular mass spectrometry is necessary
and useful for answering a given scientific question. If both
separation and molecular mass spectrometry are required,
PAMMS may be the most suitable analytical method for the
experiment. A simplified block diagram of the PAMMS system
is shown in Fig. 2.

2.2 Designing Although each PAMMS experiment is unique and must be tailored
an Experiment to the specific scientific question of interest, some basic principles
can be applied to all PAMMS experiments to ensure that results are
correct and useful. Investigators should answer the following ques-
tions:
Parallel Accelerator and Molecular Mass Spectrometry Measurement. . . 5

1. What quantity of experimental compound is required for the


experimental system? In this step, ignore the isotope label and
estimate the amount of compound needed to produce the
desired biological effect (if any).
(a) Define the type of experiment to be performed. Use
known or estimated parameters such as mode of adminis-
tration, bioavailability, cell doubling time, etc. as appro-
priate to the specific experiment. If multiple dosing or
collection events will occur, consider the appropriate
time intervals. The list below is not exhaustive, but some
of the most common experiment types include:
l Animal or human dosing of a labeled compound for
metabolism or pharmacokinetics characterization.
l Cell culture experiment for determination of com-
pound uptake, metabolism, or subcellular localization.
l Animal or cell culture experiment for detection of pro-
tein or DNA adducts formed by a reactive compound
or reactive metabolites.

After completion of this step, you should have an estimate of


the total chemical quantity of the compound of interest that will be
needed to perform each experiment.
2. What is the activity of the labeled compound? Is the compound
available through commercial suppliers, or will a custom syn-
thesis be necessary? It is important to characterize the chemical
purity, radiopurity, and specific activity of the labeled
compound.
3. What is the minimum amount of label that must be adminis-
tered to enable robust detection and measurement?
4. Can the compound be measured by molecular mass spectrom-
etry at the level it is present?
5. Are HPLC and molecular mass spectrometry methods for the
target compound(s) available?
6. What is the minimum target activity or Fraction Modern to be
achieved in the sample? For example, if the target Fraction
Modern of labeled glucose is 10 Modern, the required sample
activity can be calculated as follows:
Glucose molecular weight ¼ 180:1559:
Glucose molecular formula : C6 H12 O6 :
Glucose percent carbon by mass ¼ ð6  12:0107Þ=180:1559  40%:
1 Fraction Modern ¼ 6:11 fCi=mg C, so 10 Fraction Modern Glucose
¼ 61:1 fCi=mg C:
6 Benjamin J. Stewart and Ted J. Ognibene

ð61:1 fCi=mg CÞ ð40 mg C=mg glucoseÞ ð180:1559 mg glucose=mmol glucoseÞ


 4:4 pCi=mmol14 C‐glucose:

Any anticipated dilution effects must be considered as they will


impact the measured isotope ratio. Particularly in time course
isotope tracer experiments, it is essential to determine the correct
dose of the labeled material to enable useful measurements at all
time points. In the case that a biological effect is not expected, the
dose can be calculated based on the level of carbon-14 required to
detect a signal and the total chemical quantity of the compound
required for molecular mass spectrometry. If observation of a
biological effect is intended, dose should be calculated with con-
sideration of the dose required to elicit the desired effect, the
detection range of the molecular mass spectrometry method, and
the quantity of carbon-14 required for accurate AMS
measurement.

2.3 Laboratory Laboratories in which biological AMS experiments and sample


Requirements preparation procedures are performed must be clean and free of
carbon-14 contamination that could compromise experimental
results. Work areas that have previously been used for work with
radiochemicals are generally unsuitable for use because even fCi
quantities of 14C can contaminate experimental samples [13]. Any
laboratory space intended for use with AMS experiments should
first be checked to verify the absence of carbon-14 contamination.
In addition, regular monitoring during ongoing experiments is
essential to ensure that work spaces and laboratory equipment
remain free of contamination. Due to the high sensitivity of AMS,
contamination frequently cannot be detected by liquid scintillation
counting and must be identified using specific survey techniques as
described elsewhere [13].
The use of disposable materials (test tubes, pipet tips, etc.)
where possible is strongly encouraged to avoid the possibility of
spreading contamination between samples. Chemicals used in sam-
ple generation and preparation should be well-characterized to be
free of interfering radioisotopes. Addition of excess carbon-
containing compounds to samples should also be scrupulously
avoided, since AMS measures the isotope ratio of radiocarbon to
stable carbon. If addition of carbon is necessary at any processing
step, a careful inventory of added carbon must be maintained to
ensure that AMS measurements can provide meaningful informa-
tion. Separate laboratory spaces need to be used for preparation of
radiochemicals (e.g., diluting high-activity solutions containing
milliCuries to dosing solutions containing picoCuries) and for
performing experiments with lower quantities of radiochemicals.
Parallel Accelerator and Molecular Mass Spectrometry Measurement. . . 7

2.4 Labeled Analyte Prepare analytes to be measured by HPLC-MS/MS. Dosing with


the correct amount of radiochemical is essential. Radiochemical
activity levels, as well as radiopurity and chemical purity should
always be verified before use. Not all classes of compound are
amenable to PAMMS analysis. Samples with significant volatility
at room temperature will be lost in the drying oven. Some com-
pounds also undergo limited thermal degradation at temperatures
present in the drying oven. The primary source of error in AMS
sample preparation is the inadvertent addition of radiocarbon or
stable carbon from sources that remain uninventoried in the final
mass balance. Radiocarbon contamination can originate from labo-
ratory equipment, reagents such as solvents, and reuse of glassware
or pipettes. Miscalculations can also result in incorrect dosing.

2.5 Mobile Phase In general, analytes and mobile phases that are amenable to mea-
Preparation surement by LC-MS electrospray ionization are also suitable for
PAMMS analysis. All reagents should be LC-MS grade if available.
Mobile phases should be prepared fresh in bottles not used for
other laboratory procedures. When cleaning dedicated LC-MS
glassware, avoid the use of detergents as they can contaminate the
mass spectrometer. Reagents containing nonvolatile salts such as
sodium chloride and phosphates should be avoided. These salts will
not evaporate in the drying oven and will precipitate in the com-
bustion oven, clogging the aperture and breaking the wire.

3 Methods

As indicated in the experimental design section, details of each


experiment are unique. In all experiments, it is important that the
amount of carbon-14 label and total investigational compound
used will produce a signal that is measurable by PAMMS. It is also
essential to avoid use of an excessive amount of labeling material,
which will exceed the dynamic range of the instrument, making it
impossible to derive useful quantitative information from the
experiment. When performing experiments, samples should be
collected using disposable materials such as cell culture flasks, test
tubes, and pipette tips to avoid the possibility of contaminating
laboratory equipment and workspaces. A separate set of pipettes
should be used for dilution of stock radiochemicals and for sample
preparation for AMS, since the total radiochemical activity used in
these different stages of the experiment will differ by several orders
of magnitude. This large difference in activity increases the likeli-
hood of spreading contamination if the same set of pipettes are used
in high- and low-level material handling. A typical experiment
consists of the following steps:
8 Benjamin J. Stewart and Ted J. Ognibene

1. Calculate the quantity of radiochemical required for the exper-


iment. Measure the activity of the radiochemical stock material
by liquid scintillation counting. Perform any additional ana-
lyses necessary to confirm chemical and radiologic purity of the
dosing material. This work is performed in a standard radio-
chemical laboratory where radiochemicals are routinely used as
supplied by the manufacturer (typically milliCurie range total
activity).
2. Make dilutions of the radiochemical stock to give the total
radiologic activity needed for the experiment as calculated in
step 1.
3. Transport the diluted radiochemical dosing material to the
appropriate low-level radiochemical laboratory to perform the
experiment (total activity typically in picoCuries to nanoCuries
for cell culture experiments or low microCuries for animal or
human dosing experiments). Human dosing experiments typi-
cally range from 10 nCi/person to 50μCi/person [7].
4. Prepare appropriate control samples (pre-dose or undosed) to
enable carbon-14 background measurements.
5. Dose the experimental system with carbon-14 tracer material at
the desired level based on calculations and LSC data.
6. Collect sample material for analysis at the appropriate time
intervals or endpoints, as required by the experimental design.
Perform sample extractions and cleanup as appropriate to pre-
pare samples for PAMMS analysis. Be sure to account for all
carbon added (if any) during the sample preparation process.
7. If an experiment has been performed in humans or animals,
samples should always be measured by LSC prior to measure-
ment by PAMMS to ensure that the activity of the samples is
within the measurement range of the PAMMS system. This is
because the doses used in human and animal studies are typi-
cally too high to be measured by AMS, and substantial dilution
or biological partitioning of the dosing material is expected in
these studies. Dilution is typically much smaller in cell culture
experiments, frequently eliminating the need for measurement
of samples by LSC prior to PAMMS measurement.

3.1 Making PAMMS After samples have been prepared for PAMMS measurement and it
Measurements has been determined that the activity is within the range of AMS
detection, samples can be measured by PAMMS as follows:
1. Set up the molecular mass spectrometer. This step includes
calibration, lockmass setup, ionization mode selection, and
other instrument-specific parameters.
2. Set up the HPLC system and load the appropriate separation
method. Prepare fresh mobile phase(s) as needed for the
Parallel Accelerator and Molecular Mass Spectrometry Measurement. . . 9

separation method. Equilibrate the column with the starting


mobile phase composition.
3. Tune the accelerator mass spectrometer for use with gas
targets.
4. Turn on the liquid sample interface. When the system has
stabilized at the operating temperature, measurements can
be made.
5. Measure standards and samples on the PAMMS system.
6. After data acquisition has been completed, put the PAMMS
system into standby mode.

3.2 Interpreting PAMMS data consists of data collected from three measurement
the Data instruments: HPLC, molecular mass spectrometer, and AMS.
HPLC and molecular mass spectrometer data sets are collected
together and can be analyzed using the same data analysis software
package, and the AMS results must be analyzed separately and
linked together with the HPLC and molecular mass spectrometry
data. The best way to do this is to plot time and intensity of each
signal together using a plotting package. To overlay peaks, it is
necessary to correct the time scale for all instruments for which
data has been collected. Data merging and analysis can be per-
formed as follows:
1. Determine the time delay between instruments. Each analyte
has a specific column retention time but reaches the molecular
mass spectrometer and the AMS detector at different times due
to differences in flow path lengths. In the case that the analyte
of interest is detected by the HPLC detector, align the spectra
by subtracting the time difference between detection by the
HPLC detector and detection by the molecular mass spectrom-
eter from the molecular mass spectrometer detection time
scale. Perform the same procedure to align the AMS spectrum.
In the case that the analyte of interest is invisible to the HPLC
detector but visible to the molecular mass spectrometer, it is
only necessary to align the molecular mass spectrometer chro-
matogram with the AMS detector chromatogram. Depending
on the complexity of the molecular mass spectrometer total ion
chromatogram, it may be more useful to use the extracted ion
chromatogram rather than the total ion chromatogram for
alignment.
2. Quantify HPLC peaks using standard analysis methods.
3. Verify identity of analytes using molecular mass spectra and
MS/MS spectra. Quantify analytes using standard analysis
methods for molecular mass spectrometry.
4. Using liquid sample AMS analysis software, bin and integrate
carbon-14 and stable carbon peaks for each analyte of interest
to calculate carbon-14 and stable carbon content.
10 Benjamin J. Stewart and Ted J. Ognibene

5. Combine PAMMS data to generate a meaningful report to


answer the question posed by the original experiment, includ-
ing appropriate unit conversions. PAMMS can be used to
generate isotope ratio or total carbon-14 activity levels. The
final units calculated must be consistent with the original
biological question.

Acknowledgments

This work was performed under the auspices of the


U.S. Department of Energy by Lawrence Livermore National Lab-
oratory (LLNL) under Contract DE-AC52-07NA27344. This
work was performed at the LLNL Research Resource for Biomedi-
cal Accelerator Mass Spectrometry with support from the National
Institutes of Health, National Institute of General Medical Sciences
under Grant P41GM103483. Release Number: LLNL-BOOK-
734943.

References

1. Sporty J, Lin SJ, Kato M, Ognibene T, 7. Brown K, Dingley KH, Turteltaub KW (2005)
Stewart B, Turteltaub K et al (2009) Quantita- Accelerator mass spectrometry for biomedical
tion of NAD+ biosynthesis from the salvage research. Methods Enzymol 402:423–443
pathway in Saccharomyces cerevisiae. Yeast 8. Brown K, Tompkins EM, White IN (2006)
26:363–369 Applications of accelerator mass spectrometry
2. Sporty JL, Kabir MM, Turteltaub KW, for pharmacological and toxicological research.
Ognibene T, Lin SJ, Bench G (2008) Single Mass Spectrom Rev 25:127–145
sample extraction protocol for the quantifica- 9. Turteltaub KW, Vogel JS (2000) Bioanalytical
tion of NAD and NADH redox states in Sac- applications of accelerator mass spectrometry
charomyces cerevisiae. J Sep Sci 31:3202–3211 for pharmaceutical research. Curr Pharm
3. Stewart BJ, Navid A, Turteltaub KW, Bench G Design 6:991–1007
(2010) Yeast dynamic metabolic flux measure- 10. Thomas AT, Ognibene T, Daley P,
ment in nutrient-rich media by HPLC and Turteltaub K, Radousky H, Bench G (2011)
accelerator mass spectrometry. Anal Chem Ultrahigh efficiency moving wire combustion
82:9812–9817 interface for online coupling of high-
4. Stewart BJ, Navid A, Kulp KS, Knaack JL, performance liquid chromatography (HPLC).
Bench G (2013) D-Lactate production as a Anal Chem 83:9413–9417
function of glucose metabolism in Saccharo- 11. Thomas AT, Stewart BJ, Ognibene TJ, Turtel-
myces cerevisiae. Yeast 30:81–91 taub KW, Bench G (2013) Directly coupled
5. Koarashi J, Iida T, Moriizumi J, Asano T high-performance liquid chromatography-
(2004) Evaluation of 14C abundance in soil accelerator mass spectrometry measurement
respiration using acclerator mass spectrometry. of chemically modified protein and peptides.
J Environ Radioact 75:117–132 Anal Chem 85:3644–3650
6. Kunioka M, Ninomiya F, Funabashi M (2009) 12. Sacks GL, Derry LA, Brenna JT (2006) Ele-
Biodegradation of poly(butylene succinate) mental speciation by parallel elemental and
powder in a controlled compost at 58 degrees molecular mass spectrometry and peak profile
C evaluated by naturally-occurring carbon matching. Anal Chem 78:8445–8455
14 amounts in evolved CO(2) based on the 13. Buchholz BA, Freeman SPHT, Haack KW,
ISO 14855-2 method. Int J Mol Sci Vogel JS (2000) Tips and traps in the 14C
10:4267–4283 bio-AMS preparation laboratory. Nucl Instrum
Meth Phys Res B 172:404–408
Chapter 2

Fast Sampling of the Cellular Metabolome


Walter M. van Gulik, Andre B. Canelas, Hilal Taymaz-Nikerel,
Rutger D. Douma, Lodewijk P. de Jonge, and Joseph J. Heijnen

Abstract
Obtaining meaningful snapshots of the metabolome of microorganisms requires rapid sampling and
immediate quenching of all metabolic activity, to prevent any changes in metabolite levels after sampling.
Furthermore, a suitable extraction method is required ensuring complete extraction of metabolites from
the cells and inactivation of enzymatic activity, with minimal degradation of labile compounds. Finally, a
sensitive, high-throughput analysis platform is needed to quantify a large number of metabolites in a small
amount of sample. An issue which has often been overlooked in microbial metabolomics is the fact that
many intracellular metabolites are also present in significant amounts outside the cells and may interfere
with the quantification of the endo metabolome. Attempts to remove the extracellular metabolites with
dedicated quenching methods often induce release of intracellular metabolites into the quenching solution.
For eukaryotic microorganisms, this release can be minimized by adaptation of the quenching method. For
prokaryotic cells, this has not yet been accomplished, so the application of a differential method whereby
metabolites are measured in the culture supernatant as well as in total broth samples, to calculate the
intracellular levels by subtraction, seems to be the most suitable approach. Here we present an overview of
different sampling, quenching, and extraction methods developed for microbial metabolomics, described in
the literature. Detailed protocols are provided for rapid sampling, quenching, and extraction, for measure-
ment of metabolites in total broth samples, washed cell samples, and supernatant, to be applied for
quantitative metabolomics of both eukaryotic and prokaryotic microorganisms.

Key words Fast sampling, Quenching, Microbial metabolomics, Endometabolome, Exometabo-


lome, Isotope dilution mass spectrometry

1 Introduction

To obtain a systems biology understanding of the behavior of the


complex machinery of microbial metabolism and its regulation, it is
required to quantitatively study the cells on all different hierarchical
levels, e.g., genome, transcriptome, proteome, fluxome, and meta-
bolome, and especially the interactions between them. For some of
these levels, the techniques for high-throughput analysis have
developed faster than for others. In particular, whole-genome

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_2, © Springer Science+Business Media, LLC, part of Springer Nature 2022

11
12 Walter M. van Gulik et al.

sequencing and genome-wide transcriptome analysis have become


common practice, while methods for the quantification of intracel-
lular fluxes through metabolite balancing or based on stable isotope
(e.g., 13C) labeling have been well established [1]. In contrast to
this, quantitative proteome analysis is still far from being a routine
technique, although significant progress has been made during the
past years [2]. Whole-proteome measurement is not only hampered
because the abundance of individual proteins may differ by a factor
of a million, but also by the fact that many proteins are subject to
posttranslational modifications. It has, for example, been estimated
that a single posttranslational modification (N-terminal methionine
cleavage, or NME) alters roughly half of the proteins in Escherichia
coli [3].
Whole-metabolome measurement is hampered by large differ-
ences in abundance, structure, and properties of the individual
metabolites. Nevertheless, targeted metabolome measurements in
microorganisms, mammalian tissues, and plants have already been
carried out for more than half a century (see ref. 4 and references
therein).

1.1 Method It is well known that many metabolites, especially the intermediates
Development of the central metabolic pathways and connected cofactors like ATP
and NADH, have turnover times in the order of seconds, as can be
1.1.1 Methods for Rapid
calculated from their in vivo pool sizes and conversion rates. This
Sampling and Quenching
implies that a proper snapshot of the intracellular metabolite levels
can only be obtained if sampling and subsequent arrest of metabolic
activity are sufficiently fast, that is, significantly faster than the
turnover times of the metabolite pools.
Biochemists have been aware of this for many decades as can be
inferred from publications from the early 1960s and 1970s, e.g. on
the quenching and extraction of rat liver tissue [5–8]. Some of the
early works on metabolite measurements in microbial cells already
emphasized the importance of arresting all metabolic activity as fast
as possible. With the aim to measure the ATP levels in fermentor
cultures of E. coli under different growth conditions, Cole et al. [9]
took 2 mL broth samples from the fermentor directly into ice-cold
perchloric acid, achieving simultaneous quenching and extraction.
Although the disadvantage of this procedure was that it resulted in
a dilution of the sample, this was not a problem in this case because
the authors used a sensitive luciferase-based assay for the measure-
ment of the ATP level. Another key disadvantage of combining
quenching with extraction of complete culture samples is that no
distinction can be made between the metabolites present in the cells
and in the supernatant. Partly for this reason, but also because of
the relative insensitivity of most of the (in the past mainly enzyme
based) metabolite assays, often a separation step has been applied,
i.e., filtration or centrifugation, followed by resuspension in a small
volume of medium prior to quenching, for example, in cold
Fast Sampling of the Cellular Metabolome 13

perchloric acid (PCA). However, during the delay caused by the


concentration procedure, metabolic conversion processes can still
proceed, resulting in significant changes in metabolite levels. To
avoid this, a rapid filtration method was developed, whereby the
filter cake was washed with a cold (40  C) 50/50 v/v methanol
water solution to quench metabolic activity directly after filtration
[10]. Later, de Koning and van Dam [11] proposed a method
where sampling from yeast cultures was directly performed into a
cold methanol/water mixture (60/40 v/v) of 40  C without
prior filtration. Subsequent separation of cells and supernatant
was accomplished by cold centrifugation, thereby including a cold
washing step to remove extracellular compounds. This quenching
method is currently the most widely applied procedure for eukary-
otic cells and in principle allows the measurement of intracellular
metabolites without interference of compounds present in the
cultivation medium. A schematic representation of this procedure
is shown in Fig. 1a.
It has been reported, however, that in case of prokaryotic cells,
the application of the cold methanol quenching method results in
significant leakage of metabolites into the quenching solution
[13, 14]. To quantify metabolite leakage during cold methanol
quenching of E. coli, Taymaz-Nikerel et al. [15] applied a proce-
dure developed by Canelas et al. [16], whereby metabolite mea-
surements are carried out in all different sample fractions and a mol
balance approach is used to track down the fate of the metabolites
(see Fig. 2). From the results for a few different metabolites, shown
in Fig. 3, it can be seen that after quenching and subsequent
washing of E. coli cells, the major part of the intracellular metabo-
lites is found back in the quenching and washing solutions. There-
fore, a differential method, whereby metabolite measurements are
performed in total broth samples as well as in the supernatant, to

Fig. 1 Schematic overview of two different sampling procedures, left panel: rapid sampling and conventional
cold methanol quenching combined with cold centrifugation and centrifugation-based washing; right panel:
rapid sampling and cold methanol quenching combined with cold filtration and filtration-based washing
(Figure from Douma et al. [12])
14 Walter M. van Gulik et al.

Fig. 2 Measurements carried out in different sample fractions to enable a mass balance-based approach for
quantification of metabolite leakage during quenching (Figure from Canelas et al. [16])

obtain the intracellular amounts by subtraction, has been devel-


oped and successfully applied for metabolome measurements in
E. coli [15], see Fig. 4.

1.1.2 Fast Sampling Probably the first attempt of rapid sampling from a laboratory-scale
Devices bioreactor has been reported in 1969 by Harrison and Maitra
[17]. Sampling was performed via a port in the base plate of the
reactor. To remove the broth from the dead volume of the sampling
port prior to the withdrawal of the sample, 5 mL of culture was
allowed to flow to waste shortly before sampling. The authors
measured the sampling time and the subsequent time required to
fully mix the sample with the quenching solution, by sampling
9 mL of a 10 M alkali solution into 1 mL of concentrated HCl in
a test tube to which a Thymol Blue indicator was added. The
sampling procedure was recorded with a cine-camera at
67 frames/s. In this way, they determined that the maximum
time interval between the removal of the sample from the culture
vessel and coming into contact with the quenching solution in the
sample tube was approximately 0.1 s. Subsequent mixing with the
quenching solution took about 0.08 s. This rapid sampling method
was applied to measure the levels of the adenine nucleotides and
some intermediates of central metabolism in chemostat cultures of
Klebsiella aerogenes under different oxygen supply conditions and
as response to substrate pulses.
With the aim to avoid the contamination of the sample with the
contents of the dead volume of the sample valve, Iversen [18]
constructed a rapid sampling valve wherein the remaining broth
was removed from the dead space of the valve after each sampling
Fast Sampling of the Cellular Metabolome 15

Fig. 3 Examples of results of the balancing approach for quantification of metabolite leakage during the cold
methanol quenching procedure: (F2) amount measured in the filtrate, ICcal (¼ B  F2) calculated amount in the
cell pellet, (WS) measured amount in the washing solution, (QS) measured amount in the quenching solution,
(IC) measured amount in the biomass pellet. Bars represent the averages, with their standard errors, of four
replicate samples taken from two independent chemostat experiments, analyzed in duplicate (Figure from
Taymaz-Nikerel et al. [15])

Fig. 4 Workflow of the differential method for intracellular metabolite quantification (Figure from Taymaz-
Nikerel et al. [15])
16 Walter M. van Gulik et al.

by flushing the valve with a disinfectant, sterile water or sterile air.


With the aim to minimize the part of the sample to waste and thus
avoiding a too large decrease of the culture volume as a result of
sampling, Theobald et al. [19] developed a rapid sampling system
with a minimal dead volume. This system consisted of a hypoder-
mic needle inserted into the bioreactor via a silicone membrane,
and a sterilizable miniature valve coupled to a HPLC capillary with
an internal diameter of 0.7 mm. With this system, the dead volume
was only 200μL and could be neglected compared to the total
sample volume of approximately 5 mL. The sampling speed was
increased by using evacuated sampling tubes. The system allowed
concurrent sampling and quenching in less than 0.5 s with a maxi-
mum sampling frequency in the order of one sample per 5 s, under
aseptic conditions. The system was applied for measuring the
in vivo time profiles of the adenine nucleotides in glucose-limited
chemostat cultures of yeast during transition from glucose limita-
tion to glucose excess.
A limitation of the above-mentioned sampling systems is that
they are manually operated, which is relatively laborious, and that
the variation in sample volume depends on the skills of the opera-
tor. For these reasons, Larsson and Törnkvist [20] developed a
sampling system operated via electrically controlled valves. With
this system, samples could be withdrawn within 0.15 s. In between
sampling, the remaining liquid was removed from the system by
under-pressure. This system was applied for the measurement of
the residual glucose concentration in glucose-limited fed-batch
fermentations, whereby the samples were quenched and extracted
in perchloric acid.
Another example of an electrically operated rapid sampling
system has been published by Lange et al. [21], see Fig. 5. The
system consisted of a sampling port with an internal diameter of
1 mm, connected to a tube adapter. Sampling was started by
removing the dead volume contents by flushing to waste. Subse-
quently, the sample tube was evacuated, and directly thereafter, the
sample was withdrawn from the bioreactor. The liquid flows and
evacuation of the sample tube were controlled by electromagnetic
pinch valves operated by a timer, allowing the sample volume to be
precisely adjusted, i.e., with a standard deviation of less than 2%.
The total inner volume of the sample system was approximately
100μL, of which 50μL could not be flushed before sampling and
should be considered as the dead volume. The authors reported
that with this system samples of 1 mL could be withdrawn from a
bioreactor, operated at an overpressure of 0.3 bar, in 0.7 s. The
residence time of the sample in the system was below 100 ms. The
mixing time with the quenching liquid was assumed to be equal to
the value measured by Harrison and Maitra [17] which was 80 ms.
Fast Sampling of the Cellular Metabolome 17

Fig. 5 Schematic representation of the rapid sampling system developed by


Lange et al. [21]. The operation procedure is described in the text

Still also with this system the sampling frequency could not be
increased much above one sample per 5 s, because of the many
manual handlings that had to be performed. Therefore, Schaefer
et al. [22] developed a completely automated sampling device,
whereby the sampling tubes were fixed in transport racks which
were moved by a step engine underneath a continuous jet of
sample, with a flow rate of 3.3 mL/s, from a stirred tank bioreactor.
In this way, the sampling tubes containing the quenching solution
could be filled within 220 ms, resulting in a sampling rate of
approximately 4.5 samples per second. This automated rapid sam-
pling device was applied for investigation of the intracellular metab-
olite dynamics of glycolysis in Escherichia coli after rapid glucose
addition to a glucose-limited steady state culture.
A completely different approach to increase the sampling fre-
quency was developed by Weuster-Botz et al. [23]. The basic idea
was to perform sampling, inactivation of metabolic activity and
extraction of intracellular metabolites in a continuous way in a
tube, connected to a well-controlled bioreactor. In this way, the
highly dynamic metabolite patterns resulting from a sudden distur-
bance of the culture in the reactor were fixed at a certain position in
the sampling tube. The system consisted of a custom-made sam-
pling probe with an inlet of 4 mm diameter, which contained a
second inlet for continuous supply of quenching/extraction solu-
tion and an outlet of 8 mm diameter connected to the sampling
tube. Cold (40  C) perchloric acid was used as quenching/extrac-
tion solution, which was mixed with the sample 3 mm from the
entrance of the sample probe. The sampling tube was made from
polyethylene with an internal diameter of 8 mm and a total length
of 100 m and was wound up to a coil with a diameter of 0.5 m.
18 Walter M. van Gulik et al.

Before sampling, the tube was completely filled with water, to


ensure a constant pressure driven flow of sample through the
tube. After perturbing the bioreactor culture, continuous sampling
was started, and the system was operated in such a way that the
complete tube was filled within 200 s. Subsequently, the complete
sampling coil was disconnected and immediately frozen at 80  C.
To obtain single samples at different reaction times, the frozen tube
was cut into parts with lengths of 0.33 m. In this way, each piece
contained an amount of sample representing a time period of
0.64 s. It was demonstrated by the authors that the system could
be successfully applied to capture the short time dynamics, on a
sub-seconds scale, of some glycolytic intermediates of chemostat-
cultivated Zymomonas mobilis as a response to a glucose pulse.
More recently, a different approach for integrated sampling and
extraction from a bioreactor culture has been proposed by Schaub
et al. [24]. Fast heating of the sample was used as procedure for
simultaneous quenching and extraction. This was achieved by using
a helical coil heat exchanger which allowed continuous withdrawal
of sample from a bioreactor followed by rapid heating to 95  C.
The helical geometry was chosen to enhance radial mixing. The
residence time of the sample in the device before heating was
200 ms. Thereafter, the sample was heated at 95  C for 2.5 s
which appeared sufficient for complete metabolite extraction.
After extraction, the cells were removed by filtration. This sampling
device allowed withdrawing five samples of 0.7 mL/s.
A dedicated sampling device, the BioScope, has been developed
to carry out pulse response experiments outside the bioreactor
[25, 26]. In this device, actually a mini plug flow reactor which
can be coupled to any bioreactor, experimentation and sampling
are combined. The device has been successfully applied to
elucidate short-term metabolite dynamics in different microorgan-
isms [27–30].
To allow fast sampling from fungal cultures, Lameiras et al.
[31] constructed a system with which a sample could be withdrawn
from an external broth loop connected to a 7-L bench scale biore-
actor with a working volume of 4.5 L. Using a fast peristaltic pump,
the fungal broth was pumped from the reactor via the sampling
device and back into the reactor with a flow rate of 40 mL/s. The
internal diameter of the broth loop was 8 mm, which made sure
that the formation of fungal pellets would not lead to blockage of
the sampling device. An additional advantage of this device is that,
due to the continuous flushing with broth, it has no dead volume
and that the amount of broth withdrawn from the reactor is limited
to the sample itself. The operating principle of the sampling device
is shown schematically in Fig. 6.
Fast Sampling of the Cellular Metabolome 19

Fig. 6 Operation principle of a rapid sampling device for fungal cultivations (Figure from Lameiras et al. [31])

1.1.3 Methods for Fast In the literature on fast sampling methods discussed above, all
Quenching of Metabolism methods have been applied with certain quenching procedures. It
should be realized, however, that many different combinations of
20 Walter M. van Gulik et al.

sampling and quenching methods are possible. Among the differ-


ent quenching protocols developed for microorganisms, two main
groups can be recognized, namely procedures which allow separa-
tion of cells and supernatant and procedures which do not. Meth-
ods wherein quenching and metabolite extraction are combined
belong to this last category. Clearly, the methods which do not
allow separation of the cells after quenching are only suitable to
measure compounds of which the amount present in the superna-
tant is negligible compared to the intracellular amount. Published
data for S. cerevisiae and E. coli [15, 31] show that most metabolic
intermediates are also present, in trace amounts, in the extracellular
medium. However, due the fact that the volume fraction of
medium is much larger than the volume fraction of cells (roughly
a factor 100 in laboratory cultivations), these trace amounts may be
sufficient to cause gross overestimation of intracellular pools if they
are not removed or properly taken into account.
Separation of cells and surrounding culture medium can be
achieved by either filtration or centrifugation. These procedures
should be carried out rapidly enough, and preferably at low tem-
perature, to avoid continuation of metabolic activity, otherwise the
measured metabolite levels are not representative for the applied
cultivation conditions [4].
De Koning and van Dam [11] were the first who combined fast
quenching and subsequent separation of cells and supernatant in
one procedure, thereby applying cold (40  C) 60% aqueous
methanol as quenching solution. Their method was inspired by
the sampling procedure published by Saez and Lagunas [10] who
applied filtration to separate the cells followed by subsequent wash-
ing of the filter cake with cold (40  C) 50% methanol before
freezing the cells in liquid nitrogen.
De Koning and van Dam applied their method for yeast,
whereby they directly sprayed 15 mL of culture broth into 60 mL
of cold 60% aqueous methanol solution. Separation of cells and
surrounding liquid was achieved by cold centrifugation (5 min at
9000  g at 20  C). The obtained cell pellet was subsequently
extracted to release the metabolites.
This method in principle allows determining in vivo intracellu-
lar metabolite levels without interference of extracellular metabo-
lites present. However, an important requirement for this method
to be applicable is that the cells remain intact and that no metabolite
release into the cold methanol solution occurs. This was checked by
de Koning and van Dam [11], who verified whether the metabo-
lites which were present in significant amounts in the cells, could
also be detected in the culture supernatant and in the supernatants
obtained after cold methanol quenching. From the obtained
results, the authors concluded that no significant metabolite leak-
age occurred of the measured intracellular metabolites (glycolytic
intermediates, pyruvate, NAD, NADH, and ATP). Later on, other
Fast Sampling of the Cellular Metabolome 21

workers have verified whether metabolite leakage occurred during


application of the cold methanol quenching method for different
microorganisms. The published results appeared contradictory (see
Canelas et al. [16] and references therein) most probably because of
differences in sensitivity of the applied analytical procedures.
Recently, it has been shown that, especially in case of bacteria,
cold methanol quenching induces extensive metabolite leakage,
possibly due to the so-called cold shock phenomenon [13–
15]. There are indications that also during quenching of yeast
culture samples, metabolite leakage occurs, although the current
literature is not consistent on this issue [16, 32]. Nevertheless, the
cold methanol method is still the most widely used method for
rapid quenching of microbial cultures [21, 31, 34, 35]. However,
before applying the method, it should be verified for each particular
microorganism whether metabolite leakage occurs, and if so, how
this can be minimized or avoided [15, 31].

1.1.4 Extraction The next step in the procedure is the extraction of the metabolites
of Metabolites from from the quenched sample. Ideally, the applied extraction proce-
Quenched Cell Samples dure should result in unbiased and complete extraction of all meta-
bolites from the cells, should not lead to conversion and/or
degradation of metabolites during extraction and subsequent sam-
ple processing, and should be compatible with the analysis methods
to be applied. Extraction can be achieved using high temperature,
extreme pH, organic solvents, mechanical stress, or combinations
of these. Well-known methods which have been employed since the
1950s are extraction in perchloric acid [36, 37], hot water [38, 39],
and boiling ethanol/water [40, 41]. More recently, the tendency
has been to apply milder extraction methods, to prevent degrada-
tion of metabolites as much as possible. In these methods, extrac-
tion is carried out at low temperatures, sometimes combined with
repeated freezing and thawing to disintegrate the cells. Examples
are cold chloroform methanol/extraction [11], freeze-thawing in
methanol [42], and cold acetonitrile-methanol extraction [43]. A
quantitative evaluation of different extraction methods for applica-
tion to metabolome analysis of yeast has been published by Canelas
et al. [44]. In this study, the addition of 13C-labeled internal
standards at different stages of sample processing has been applied
to determine the metabolite recoveries. Canelas et al. concluded
that the boiling ethanol/water and chloroform/methanol extrac-
tion methods performed best, in terms of efficacy and metabolite
recoveries. Application of methods which do not ensure complete
enzyme inactivation, e.g., freeze-thawing in methanol, significantly
affected the outcome of the metabolome measurements, due to
enzymatic conversion of metabolites in the samples. Metabolite
recoveries upon extraction of yeast cells with acidic acetonitrile-
methanol appeared low for larger and more polar metabolites (see
Fig. 7).
Fig. 7 Overall process recoveries for 44 metabolites analyzed in yeast, in order of increasing molecular weight,
for each of the extraction methods, under two growth conditions, chemostat and batch cultivation. Data are
averages and standard deviations of duplicate samples each analyzed twice. Legend: ∇, chemostat; Δ, batch
(Figure from Canelas et al. [44])
Fast Sampling of the Cellular Metabolome 23

1.1.5 Analytical Finally, high-throughput analysis methods are required for selective
Procedures and precise quantification of a large variety of metabolites. In the
past almost exclusively enzyme based methods have been used [45]
which have the advantage that they are very specific for a particular
metabolite, but the disadvantage that for each metabolite a differ-
ent assay is required and that some of the enzymes needed might
not be commercially available. With the improvement of GC and
HPLC techniques, these have therefore increasingly been used.
During the last decade, sensitive high-throughput mass
spectrometry-based methods (mainly GC-MS and LC-MS/MS)
have enabled the measurement of large numbers of different meta-
bolites in a small amount of sample. Especially with the application
of U-13C-labeled internal standards, enabling to perform isotope
dilution mass spectrometry (IDMS), the precision of MS-based
metabolome measurements has increased significantly [46, 48].

2 Materials

2.1 Cold Methanol 1. Rapid sampling setup (see Note 1), e.g., the system published
Quenching Combined by Lange et al. (for a complete description, see ref. 21).
with Cold 2. Cryostat, filled with a suitable cryo liquid (e.g., ethylene glycol)
Centrifugation and capable of reaching a temperature of 40  C.
3. 60% (v/v) methanol/water mixture.
4. Appropriate test tubes (e.g., polypropylene (PP) tubes of
14 mL, 17 mm diameter) with caps.
5. Cooled laboratory centrifuge capable of reaching a tempera-
ture of at least 20  C.
6. A 40  C freezer to pre-cool the centrifuge rotor.
Precautions:
Methanol and ethylene glycol (the most commonly used cool-
ing fluid) are toxic substances. Always wear (impermeable) gloves
and safety goggles when manipulating the samples, and avoid con-
taminating surfaces and equipment.

2.2 Additional 1. For fungal cultures: glass fiber filters (e.g., type A/E, Pall
Materials for Cold Corporation, East Hills, NY, USA, 47 mm diameter, 1μm
Methanol Quenching pore size). For yeast and bacterial cultures: Hydrophilic poly-
Combined with Cold ethersulfone (PES) membrane filter with a pore size of
Filtration 0.2–0.45μm (e.g., Supor, Pall, USA).
2. Peristaltic pump capable of reaching a flow rate of at least
300 mL/min.
3. Filtration setup with vacuum pump.
4. Balance.
5. Water bath at 70  C.
24 Walter M. van Gulik et al.

6. 50-mL test tubes with screw cap.


7. Syringe filters with a pore size of 0.2μm filters (e.g., FP30/0.2
CA-S; Whatman, Maidstone, England).
Note: The bioreactor from which the samples are taken should
be equipped with a sampling port connected to tubing which runs
through the peristaltic pump (see Fig. 1, right panel).

2.3 Rapid Sampling 1. Plastic syringes with a volume of 10, 30, or 60 mL (depending
of Culture Filtrate on the sample volume required).
2. Stainless steel beads with a diameter of 4 mm.
3. Syringe filters with a pore size of 0.45μm, e.g., Milex HV
(Millipore, Cork, UK).

2.4 Extraction 1. 75% (v/v) ethanol/water mixture.


2. If isotope dilution mass spectrometry (IDMS) is used for meta-
bolome analysis (see Note 2): U-13C-labeled cell extract con-
taining all metabolites which have to be measured, in sufficient
amounts (see ref. 45).
3. Vacuum evaporation system (e.g., RapidVap (Labconco Cor-
poration, Kansas City, MI)).
4. 0.2-μm Durapore PVDF centrifuge filters.

3 Methods

It has been found that in microbial cultivations, a large part of the


cellular metabolome is also present in the cultivation medium. This
is partly a result of cell lysis, but presumably also due to the
structure of the cell membranes and the transport proteins located
in them, which permit metabolites to diffuse into the medium.
Metabolome analysis of microbial cultures may therefore include,
apart from the measurement of the intracellular metabolite levels
(the endometabolome), also the measurement of the extracellular
levels (the exometabolome). Below we will present fast sampling
methods for both the endo- and the exometabolome.

3.1 Rapid Sampling This protocol is typically suited for rapid sampling of microorgan-
for Endometabolome isms which show negligible leakage of metabolites into the quench-
Analysis: Cold ing solution. Because this method includes a centrifugation and a
Centrifugation Method washing step, the metabolites present in the cultivation medium are
removed. This allows proper quantification of the intracellular
metabolites without interference of the exometabolome. It should
be noted that although the concentrations of metabolites in the
medium are usually much lower than within the cells, the amount
of extracellular metabolites in a broth sample can still be significant
Fast Sampling of the Cellular Metabolome 25

compared to the intracellular amount, because in most laboratory


cultivations, the volume of the supernatant is roughly two orders of
magnitude larger than the volume of cells. In the protocol below, a
sample volume of 1 mL is assumed. The method is, however, easily
scalable to smaller or larger sample volumes. See Fig. 1, left panel,
for a schematic overview of the procedure.

3.1.1 Preparation It is advisable to carry out the following preparatory steps the day
before the sampling is performed:
1. For n samples, prepare:
– n test tubes containing 5 ml of 60% v/v MeOH, for sam-
pling. Number and weigh them. Close all tubes with caps
and store at 40  C.
– n test tubes containing 5 ml of 60% v/v MeOH for the
washing step. Close tubes and store also at 40  C.
– n test tubes containing 5 ml of 75% v/v EtOH (68% m/m)
for the extraction step. Close tubes and store in the fridge.
2. Set the temperature of the centrifuge to 20  C and put the
appropriate centrifuge rotor in a 40  C freezer. Turn on the
cryostat and set the temperature to 40  C.
3. Connect the rapid sampling setup to the bioreactor to be
sampled.
The next steps are best performed on the same day the sam-
pling is performed:
1. If IDMS analysis is used (see Note 2): Let the frozen
13
C-labeled extract thaw in the fridge. Make sure that you use
the same uniform solution for all samples and standards. Keep
the vial containing the 13C extract closed and cold, e.g., on ice.
2. Place the tubes containing 60% (v/v) methanol, required for
sampling and washing of the cell pellet, in the cryostat at
40  C.
3. Adjust the timer controlling the electronic valve(s) of the rapid
sampling system such that the weight of the sample taken
equals 1.00  0.05 g.
13
4. Calibrate the pipette required for C extract additions (typi-
cally 100μL).
5. Switch on a suitable water bath and let it reach a temperature of
95  C before sampling is started.
6. Place the tubes containing the 75% ethanol next to the water
bath and allow them to warm up to room temperature.
26 Walter M. van Gulik et al.

3.1.2 Sampling 1. Withdraw 1.0 mL of broth into a sampling tube (containing


5 mL of 60% methanol at 40  C) using the rapid sampling
device, mix directly after sampling by vortexing. Close with a
cap and place the tube back in the cryostat. Repeat until the
required number of samples has been taken.
2. Weigh each tube for exact sample amount determination
(by subtracting the weight of the tube containing the 5 mL
of 60% methanol determined the day before) and put back in
the cryostat. Make sure the cryostat fluid (e.g., ethylene glycol)
is effectively wiped from the walls of the tube as it can affect the
weighing and lead to overestimation to sample weight (espe-
cially with small sample volumes), see Note 3. The weighing
procedure should be expeditious to prevent warming up of the
sample (see Note 1).
3. Remove the centrifuge rotor from the 40  C freezer and put
back into the cooled centrifuge. Centrifuge the quenched sam-
ple, e.g., at 2000  g for 5 min. Centrifugation conditions
should ensure a stable pellet which can still be resuspended. It
might be required to adapt the centrifugation speed for a
particular microorganism and/or cultivation condition.
4. Decant and discard the supernatant and resuspend the cell
pellet immediately by adding 5 ml of 60% (v/v) methanol of
40  C and rapid vortexing.
5. Centrifuge again, decant, discard the supernatant, and place
the tube back in the cryostat.
6. If IDMS is used for metabolite quantification: add 13C extract
(typically 100μL) to each washed cell pellet (see Note 2).
Note: From sampling to decanting, the samples should be
exposed to methanol as short as possible to minimize leakage of
metabolites from the cells in the quenching solution (see Note 4).

3.1.3 Extraction Boiling ethanol/water extraction is applied to release the metabo-


of the Cell Pellets lites from the cell pellets (see Note 5). During this procedure, each
tube with the extraction solution (75% v/v EtOH) is heated (e.g.,
4 min for a volume of 5 mL) to reach a temperature of 95  C.
Thereafter, the hot ethanol solution is transferred to tubes contain-
ing the cell pellets. After resuspension of the cells in the hot ethanol
solution, they are kept at 95  C for a period of 3 min. This
procedure effectively releases all metabolites from the cells and, at
the same time, results in denaturation of the enzymes present,
which prevents further (enzyme-catalyzed) conversion of metabo-
lites in the samples (see Note 5).
1. Remove the required number of tubes containing 5 ml of 75%
v/v EtOH from the fridge.
Fast Sampling of the Cellular Metabolome 27

2. Put the tubes containing 5 ml of 75% v/v EtOH in the 95  C


water bath to heat up with convenient time intervals (e.g.,
30 s).
3. After 4 min, transfer (by pouring) the hot ethanol solution of
the first tube to the tube containing the first cell pellet, rapidly
resuspend the cells by vortexing and put back in the hot bath.
Make sure that the cell pellets are fully resuspended; firm pellets
require longer vortexing. Repeat this procedure for the other
tubes at intervals of 30 s.
4. After 3 min, transfer the first ethanol extract to the 40  C
cryostat to cool down. Repeat subsequently with intervals of
30 s, until all cell pellets have been extracted.

3.1.4 Further Sample In the protocol below, it is assumed that a Labconco RapidVap is
Processing used for the sample drying.
1. Turn on the cold trap of the RapidVap. Make sure the cold trap
is empty. It will take 10–20 min to be ready.
2. Evaporate the ethanol/water mixture until the samples are dry.
Set the speed of the RapidVap to 90%, and apply full vacuum.
3. 5 min after the start, switch on the heating and set to 30  C.
4. 25 min after the start, decrease the vacuum to 5 mbar.
5. Stop the RapidVap 110 min after the start and check if the
samples are completely dry. If not, continue until dry.
6. Resuspend the dried sediment in 500μL MilliQ water.
7. Mix thoroughly by vortexing and transfer to Eppendorf tubes.
8. Centrifuge at 15,000  g for 5 min at 1  C. (If the supernatant
is still turbid, transfer supernatant to clean Eppendorf tubes
and centrifuge again.)
9. Transfer the supernatants to (labeled) 0.2-μm Durapore PVDF
centrifuge filters.
10. Filter by centrifuging again at 15,000  g for 5 min at 1  C.
11. Transfer supernatant to screw-cap sample vials and store at
80  C until analysis.

3.2 Rapid Sampling For quantification of intracellular metabolites which are present in
for Endometabolome the cells in very low amounts compared to their presence in the
Analysis: Cold cultivation medium, the washing efficiency of the cold centrifuga-
Filtration Method tion method may not be sufficient. Therefore, a method was devel-
oped whereby cold methanol quenching is combined with a cold
filtration step for virtually complete removal of the exometabolome
[12]. See Fig. 1 right panel for a schematic overview of the method.
This procedure is especially useful to quantify intracellular amounts
of substrates and secreted (by)products. In the following protocol,
it is assumed that 60% aqueous methanol is a suitable quenching
28 Walter M. van Gulik et al.

and washing liquid (see Note 6), that boiling ethanol/water is a


suitable extraction method (see Note 5), and that samples with a
volume of 10 mL are required. A different quenching liquid and an
adjusted sample volume can be used provided that the temperature
after sampling is kept at 20  C or lower, to prevent enzymatic
conversion of metabolites. As timing during this method is critical,
the sampling is best carried out with two experimenters. Although
the description of sampling and extraction is divided over two
sections, both should be carried out quickly and smoothly in one
go. The entire procedure, from sampling until submerging the filter
containing the cells into the 75% ethanol, should be carried out fast
enough to prevent the sample from warming up.

3.2.1 Preparation For n samples, prepare the following the day before the sampling is
carried out:
1. 3  n tubes with 50 mL of 60% methanol. Cap and cool down
to 40  C.
2. n tubes with 30 mL of 75% ethanol. Cap and heat them up in a
70  C water bath before the sampling starts. (70  C is just
below the boiling point of this mixture.)
The next steps are best performed on the same day the sam-
pling is performed:
1. Place the vacuum filtration unit on the balance (see Fig. 1, right
panel). Connect the tubing to the vacuum pump without
strain, such that it does not affect the weight of the filtration
unit during sampling.
13
2. Calibrate the pipette required for C extract additions (typi-
cally 100μL).
3. If IDMS is applied for metabolite quantification (see Note 2):
Let the frozen 13C-labeled extract thaw in the fridge. Make
sure that you use the same uniform solution for all samples and
standards. Keep the vial containing the 13C extract closed and
cold on ice.

3.2.2 Sampling 1. Place a filter on the filter support disc and clamp the filtration
beaker.
2. Open a tube with 75% ethanol at 70  C (required for extraction
in a few minutes) and keep it in the 70  C water bath.
3. Get three tubes with 50 mL of 60% methanol at 40  C from
the freezer/cryostat. Leave two of them next to the sampling
setup ready to grab and pour out one in the filtration beaker,
for washing the cell cake.
4. Tare the balance.
Fast Sampling of the Cellular Metabolome 29

5. Switch on the peristaltic pump and flush the dead volume of the
sampling tubing into a waste tube. Without switching off the
pump, direct the flow/spray into the cold 60% methanol in the
filtration beaker. The spray must directly contact the cold 60%
methanol, so avoid hitting the wall of the filtration beaker.
Switch off the pump after approximately 10 g (¼ 10 mL) of
broth has been sampled.
6. Read the exact sample weight from the balance. (The second
experimenter has time to write down the weight.)
7. Start the vacuum pump. Open the second 60% methanol tube
while the broth/methanol suspension is filtered and pour it out
into the beaker only after the filter cake falls dry. Repeat with
the third 60% methanol tube and turn off the vacuum pump
after the filter cake falls dry.

3.2.3 Extraction 1. Remove the filtration beaker, lift up the filter with cell cake
of the Cell Cakes using tweezers, pipette 100μL of 13C extract (0  C) on top of
the washed cell cake and immediately submerge the cell cake in
the 75% ethanol tube at 70  C.
2. Cap the tube and vigorously shake it by hand for 5 s (glass fiber
filters will disintegrate at this point) and then place in a 95  C
water bath for 3 min (open the cap slightly to prevent
pressurization).
3. Remove the tube from the water bath and cool it on ice. Recap
the tube.
4. If desired, the sample can now be stored at 80  C until further
processing. If not, continue with Subheading 3.2.4, step 1.
5. Clean the filtration setup for the next sample.

3.2.4 Further Sample In the protocol below, it is assumed that a Labconco RapidVap is
Processing used for the sample drying.
1. Centrifuge the extracted samples for 8 min at 4  C and
4400  g.
2. Filter the supernatant using a 0.2-μm filter to remove glass
fibers from the solution.
3. Evaporate the thus obtained extract to dryness using the
RapidVap. Alternatively, if problems occur with resuspension
of the dry residue, the extract can be concentrated instead of
complete evaporation to dryness, e.g., to a final volume of
300–500μL. The drying/concentration step requires about
2 h (depending on the number of tubes processed at the
same time). See Subheading 3.1.4 for the steps preparing the
RapidVap for use. Start at a slow speed (30%) and increase as
more and more water and ethanol evaporates. Set the heat to
30 Walter M. van Gulik et al.

30  C. Do not apply full vacuum at once, but start at 200 mbar


and decrease the pressure in steps of 20 mbar every 20 s until
full vacuum.
4. Resuspend the residue in 500μL MilliQ water (or fill up to
500μL, if the extract is not evaporated to dryness).
5. Mix thoroughly by vortexing and transfer to Eppendorf tubes.
6. Centrifuge at 15,000  g for 5 min at 1  C. (If supernatant is
still turbid, transfer supernatant to clean Eppendorf tubes and
centrifuge again.)
7. Transfer the supernatants to (labeled) 0.2μm Durapore PVDF
centrifuge filters.
8. Filter by centrifuging again at 15,000  g for 5 min at 1  C.
9. Transfer supernatant to screw-cap sample vials and store at
80  C until analysis.

3.3 Rapid Sampling With this procedure, samples from a culture of microorganisms are
for Exometabolome quickly cooled down to a temperature close to 0  C. The purpose is
Analysis to minimize metabolic activity as much as possible while avoiding
freezing the sample, as this may lead to cell damage. The cooling of
the sample is accomplished by direct contact with pre-cooled steel
beads which are placed in a syringe. Directly thereafter the sample is
pressed through a filter to obtain a supernatant sample. The
amount of beads needed to cool down the sample to a temperature
slightly above 0  C can be calculated from the heat capacities of
stainless steel and water, the required sample volume, the initial
sample temperature, and the initial temperature of the stainless steel
beads, see ref. 49. Note that if the cells are susceptible to cold shock
(i.e., sudden cooling will result in release of metabolites from the
cells), the cooling step should be omitted. The protocol below is
designed for the withdrawal of 2 mL of sample with an initial
temperature of 30  C.

3.3.1 Preparation 1. Fill the required number of syringes with 25 g of stainless steel
beads each. Close the syringes with their plungers and the
syringe outlets with parafilm and put them overnight in a
freezer at 20  C.

3.3.2 Sampling 1. Take the required number of syringes filled with cold beads
from the freezer, remove the parafilm, and connect the filters to
the syringes. Keep them in a Styrofoam box filled with cooling
elements of 20  C until sampling, to prevent them from
warming up.
2. Sample 2 mL of broth from the bioreactor into a syringe and
filter immediately, while collecting the supernatant in a
sample vial.
Fast Sampling of the Cellular Metabolome 31

3. Store the sample at 80  C until analysis. Alternatively, if


compounds should be quantified which may be susceptible to
enzymatic conversion, it is advisable to destroy possible
enzymes present by boiling ethanol extraction as described in
Subheading 3.4.3.

3.4 Differential This method is to be preferred if cold methanol quenching results


Method in significant leakage of metabolites from the cells as might be the
case for prokaryotic organisms [13–15]. To be sufficiently accurate,
it is essential to combine this method with IDMS for metabolite
quantification (see Notes 2 and 7).
With the differential method, each measurement requires two
samples: a total broth sample and a filtrate sample. However, in
particular cases, depending on the experimental design, the extra-
cellular metabolite levels may be assumed to be in pseudo steady
state, which means that they do not change significantly during the
time a series of samples is taken. Then only a few samples are
required to quantify the extracellular metabolite levels (see ref.
30). For the protocol below, it is assumed that 1 mL of sample is
required for measuring the metabolites in total broth and 2 mL is
required for measurement in the culture filtrate. Clearly, these
amounts may differ from case to case and depend on the sensitivity
of the analysis method applied. Thereby it must be taken into
account that application of this protocol results in a six times
dilution of the sample, while the conventional boiling ethanol/
water protocol for cell extraction results in a two times concentra-
tion. For further comments on the applicability of the differential
method, see Note 7.

3.4.1 Preparation Most convenient is to carry out the following preparatory steps the
day before the sampling is performed:
1. For n samples, prepare:
– n tubes containing 5 ml of 60% v/v MeOH for sampling.
Number and weigh them. Store at 40  C.
Only if rapid cooling of the sample is required and the
microorganisms are not susceptible to cold shock (see refs. 13,
15):
– n syringes filled with the proper amount of stainless steel
beads (see protocol for exometabolome sampling). Close the
syringes with their plungers, cover the syringe outlets with a
layer of parafilm (to prevent formation of ice) and leave
them overnight in a freezer at 20  C.
– n tubes containing 5 ml of 75% v/v EtOH for the extraction
step. Store in the fridge.
2. Turn on the cryostat and set the temperature to 40  C.
32 Walter M. van Gulik et al.

3. Connect the rapid sampling setup to the bioreactor from which


the total broth samples should be withdrawn. Make sure that
the bioreactor contains a second sampling port for withdrawal
of broth to obtain the filtrate samples.
The next steps are best performed on the same day the sam-
pling is performed:
1. Let the frozen 13C-labeled extract thaw in the fridge. Make
sure that you use the same uniform solution for all samples and
standards (see Note 2). Keep the vials containing the 13C
extract closed and cold on ice.
2. Place the tubes containing 60% (v/v) methanol, required for
sampling in the cryostat at 40  C.
3. Adjust the timer of the rapid sampling system such that the
weight of the sample taken equals the desired amount, in this
case: 1.0  0.05 g.
13
4. Calibrate the pipette required for C extract additions (typi-
cally 100μL).
5. Switch on a suitable water bath and let it reach a temperature of
95  C before sampling is started.
6. Only if rapid cooling of the sample is required: Take the
required number of syringes filled with cold beads from the
freezer, remove the parafilm and connect the filters to the
syringes. Keep them in a Styrofoam box filled with cooling
elements of 20  C until sampling, to prevent them from
heating up.

3.4.2 Sampling 1. Withdraw 1.0 mL of broth into a sampling tube (containing


5 mL of 60% methanol at 40  C) using the rapid sampling
device, mix directly after sampling by vortexing, and place the
tube back in the cryostat.
2. Withdraw approximately 2 mL of broth from the bioreactor
into a syringe.
3. Filter the sample immediately thereafter by pressing the sample
through the filter and collect the supernatant in a sample vial.
Pipette 1 mL of filtrate in a sampling tube (containing 5 mL of
60% methanol at 40  C) mix thoroughly by vortexing and
place the tube back in the cryostat.
4. Repeat steps 1–3 for the number of measurements required.
5. Weigh each tube for exact sample amount determination
(by subtracting the weight of the tube containing the 5 mL
of 60% methanol determined the day before) and put back in
the cryostat. Make sure the cryostat fluid (e.g., ethylene glycol)
is effectively wiped from the walls of the tube as it can affect the
weighing and lead to overestimation to sample weight
Fast Sampling of the Cellular Metabolome 33

(especially with small sample volumes), see Note 3. The weigh-


ing procedure should be expeditious to prevent warming up of
the sample.

3.4.3 Extraction In this protocol not only the total broth sample but also the filtrate
of the Quenched Total sample is extracted in hot ethanol, to denaturate all possible
Broth and Filtrate Samples enzymes present (see Note 5). Even the presence of minimal
amounts of enzymes would lead to distortion of metabolite profiles
later on in the sample processing, which must be avoided.
1. Transfer from each quenched broth sample 500μL into an
empty tube and keep them in the cryostat at 40  C until
extraction. Be sure to completely mix the quenched samples
by vortexing before the transfer.
2. Repeat this procedure for the quenched filtrate samples.
3. Add the U-13C internal standard mix (typically 100μL).
4. Apply the same procedure as described for extraction of the cell
pellets (see Subheading 3.1.3).

3.4.4 Further Processing Apply the same procedure for sample drying and cleanup as
of the Total Broth described for the cell pellets (see Subheading 3.1.4).
and Filtrate Samples

3.4.5 Determination After quantification of the metabolites in the total broth and filtrate
of the Intracellular samples, the intracellular amounts can be calculated by subtraction.
Metabolite Levels Proper quantification of the real amounts of sample taken, which
for the Differential Method was performed by weighing in case of the total broth samples and
by accurate pipetting (in addition, weighing can be used here) will
increase the accuracy of the final result. The most convenient way of
expressing the metabolite levels, both in total broth and in the
filtrate, is per amount of biomass present in the bioreactor, e.g.,
in μmol per gram of biomass dry weight. Subtraction of metabolite
levels in the filtrate from the total broth levels then directly yields
intracellular levels (see Note 7).

3.5 Principles A detailed description of how to apply isotope dilution mass spec-
of Metabolite trometry will not be given here. Different methods for the analysis
Quantification Using of different groups of metabolites have been published previously
Isotope Dilution Mass [47, 48]. The principle of the method is that metabolites are
Spectrometry quantified by mass spectrometry, whereby for each individual
metabolite a chemically identical, fully 13C-labeled analog is
added as internal standard. Each metabolite is then quantified
relative to the amount of its fully 13C-labeled analog present. This
procedure effectively corrects for non-idealities in the subsequent
MS-based quantification, such as sample matrix effects, nonlinear-
ity resulting from competition in the ESI interface (in case of
LC-MS analysis), incomplete derivatization (in case of GC-MS
analysis), machine drift, etc.
34 Walter M. van Gulik et al.

To allow quantification, the 13C-labeled internal standard mix


is therefore added to the samples, as well as to a series of dilutions of
a conventional standard mix. If the 13C-labeled internal standard
mix is added to the samples before the extraction procedure, partial
degradation of metabolites, as well as partial losses of sample, e.g.,
by transferring them to different tubes, are also corrected for.
Unfortunately, for most metabolites fully 13C-labeled analogs are
not commercially available. The only way to obtain 13C-labeled
analogs for all metabolites to be measured is therefore to carry
out a cultivation on a medium containing a fully 13C-labeled carbon
source, e.g., 100% U-13C-labeled glucose [47]. Extraction of the
cells will then yield a U-13C-labeled metabolite mixture which can
be used as internal standard.

4 Notes

1. The necessity of fast sampling and quenching.


To obtain a proper quantitative snapshot of the microbial
metabolome, fast sampling is essential. It should be realized
that as soon as a sample is withdrawn from a culture, the
conditions to which the cells are exposed will change, e.g.,
with respect to temperature, substrate and oxygen availability,
carbon dioxide pressure, and pH. For example, in a sample
taken from a high-density aerobic batch cultivation, the avail-
able dissolved oxygen might be depleted within 1 or 2 s. The
same holds for the substrate concentration in a sample with-
drawn from a substrate-limited chemostat culture. Changes in
the environment of the cells will result in changes in metabolic
rates. Because the vast majority of metabolites have turnover
times of seconds or less, this will result in changes in the
metabolome. Therefore, to prevent these changes, the time
between withdrawal of the sample and quenching of all meta-
bolic activity should preferably be less than a second.
2. Use of 13C extract as internal standard.
For a proper quantification of metabolites with IDMS, the
amount of U-13C extract added should be such that, after the
addition, the concentrations of the U-13C-labeled analogs in
the sample are in the same range as the metabolites which have
to be quantified. This should be taken into account in the
preparation of the 13C extract, i.e., in the final concentration
of the extract. In some cases, it might be required to either
dilute the 13C extract or the samples to achieve this.
It is important that the same 13C extract, i.e., from the
same batch, is used for all samples and standards of the same
series. Furthermore, repeated freezing and thawing of the
extract may result in partial degradation of metabolites,
Fast Sampling of the Cellular Metabolome 35

whereby the extent of degradation is metabolite-specific. Ide-


ally, the addition of 13C extract to the samples and standards
should be performed on the same day, from a single pool of
extract. If this is not practical (e.g., when samples need to be
taken on separate days during cultivation, to observe long-term
trends in metabolite levels), the 13C extract to be used should
be distributed over several vials and frozen together. A single
aliquot can then be thawed on the day it is needed. Note that,
due to the concentration step, cell extracts are often rather
viscous solutions, so be careful when the 13C extract is added
to the samples by pipetting. It should be stressed here that the
precision of the end results depends on the accuracy of the 13C
extract addition, because they are calculated based on that
value. If the precision of the pipettes used is found to be
insufficient, positive displacement pipettes may provide a better
alternative, because they are more suitable for working with
viscous solutions than air/piston pipettes.
3. Determination of the exact sample amount by weighing.
No matter which sampling device is used to rapidly with-
draw samples from a bioreactor, the exact amount withdrawn
will vary between certain limits. For an accurate quantification
of metabolites, it is therefore essential to determine the exact
amounts of sample withdrawn by weighing the tubes contain-
ing the cold aqueous methanol solution before and after sam-
pling. Because cold methanol is very hygroscopic, water vapor
will quickly condense within the tube, affecting the weight of
the tubes. Therefore, all tubes containing cold aqueous metha-
nol should be kept closed and should only be opened shortly
for sampling.
4. Occurrence of leakage of metabolites into the quenching solution.
In the protocols above, it is assumed that 60% methanol is a
suitable quenching liquid. However, some authors have
reported that this quenching solution may give rise to metabo-
lite leakage, even for eukaryotic cells [15, 31]. The suitability of
a quenching liquid should therefore beforehand be validated,
preferably in a quantitative way [15, 31]. Shortly, this involves
comparing the intracellular amounts measured after using the
differential method to the intracellular amounts measured after
using either the centrifugation or the filtration method. If in
addition to this also metabolite quantification in the quenching
and washing solutions is carried out, this allows to calculate the
full mass balances and to quantify the extent of leakage for each
metabolite as has been described extensively in Canelas et al.
[16], see Fig. 2. If leakage is detected, this mass balance
approach can be used to test systematically the effect of changes
in the properties of the quenching solution (temperature, con-
centration of solvent, ionic strength, etc.) and/or the cell
36 Walter M. van Gulik et al.

separation method (centrifugation or filtration) and optimize


the whole-quenching procedure to prevent the occurrence of
leakage.
5. Metabolite extraction.
In the protocols described above, boiling in hot aqueous
ethanol has been applied to extract the metabolites from the
cell samples. This method has several advantages compared to
other published extraction procedures. Because the extraction
is carried out at a high temperature, all enzymes present are
denaturated, preventing further enzymatic (inter)conversions
of metabolites in the cell extract. Furthermore the extractant
(ethanol/water) is nontoxic and easily removed by vacuum
evaporation.
It has been shown that for methods, e.g., freeze thawing in
methanol (FTM), for which complete inactivation of all enzy-
matic activity is not guaranteed [44] significant changes in
metabolite concentrations will occur. Nevertheless FTM
extraction has been applied in several published studies, see
ref. 50 and references therein.
6. Quenching in cold aqueous methanol.
At present, the cold methanol quenching method is widely
considered as the most suitable procedure which allows the
removal of the compounds which are present in the cultivation
medium. To be applicable for a certain organism, the cells
should remain intact during the quenching procedure and
metabolite loss into the quenching solution should be negligi-
ble, otherwise no meaningful measurements will be obtained.
Removal of extracellular compounds is important in different
aspects. First of all, thanks to the increased sensitivity of the,
mainly MS-based, analytical instruments, it has become evident
that many metabolic intermediates are also present in the culti-
vation medium. Although the concentrations in the medium
are in most cases at least two orders of magnitude lower than
the intracellular concentrations, the medium volume is so
much larger than the total cell volume (roughly two orders of
magnitude in most laboratory cultivations) that the total
amounts of these metabolites which are dissolved in the
medium can still be very significant. This implies that if the
cells are not separated from the surrounding medium prior to
metabolite extraction, the resulting metabolome measure-
ments are not representative for the intracellular levels.
Another reason to remove extracellular compounds before
extraction is that certain medium constituents, e.g., sulfate,
phosphate, and chloride, may interfere with the analysis
method applied for metabolite quantification. For example,
when LC-ESI-MS/MS is applied, whereby anion exchange
chromatography is used for the LC separation, these
Fast Sampling of the Cellular Metabolome 37

compounds may elute together with the metabolites of inter-


est, thereby not only affecting their retention times but also
decreasing the sensitivity of the MS due to competition in the
ESI interface (see ref. 44). Clearly, the possible intervention of
sample constituents with the analysis method to be applied
should be a point of attention.
7. Applicability of the differential method.
To be able to apply the differential method for quantifica-
tion of the endometabolome, whereby the intracellular metab-
olite levels are determined by subtraction of metabolites
quantified in total broth and in culture filtrate, some boundary
conditions have to be fulfilled. First of all, the intracellular level
of a metabolite can only be quantified with reasonable accuracy
if the amount present inside the cells is significant compared to
the amount present outside, i.e., in the culture medium (both
expressed in amount per amount of biomass present). It should
be clear that if for example more than 90% of a metabolite is
present outside the cells, quantification with the differential
method will not produce reliable results. Furthermore, a
proper determination of the measurement errors is required
to be able to calculate the error in the end result. To minimize
these errors, it is strongly advised to apply IDMS for the
quantification of the metabolite levels in total broth and
supernatant.
Another issue connected with the application of the differ-
ential method is that all components present in the cultivation
medium are also present in the samples. If some of these
interfere with the analysis (see Note 6), the samples should be
diluted before analysis. This was the reason that we applied a
dilution step in the described protocol for the differential
method. Clearly, thereby the applied analysis procedure should
be sensitive enough to quantify the metabolites in the diluted
samples. It should be stressed that this issue is specific for
the applied analysis method as well as for the composition of
the cultivation medium and possible by-product formation by
the cells and should be verified beforehand.

References

1. Tang YJ, Martin HG, Myers S, Rodriguez S, 3. Gupta N, Tanner S, Jaitly N, Adkins JN,
Baidoo EE, Keasling JD (2009) Advances in Lipton M, Edwards R, Romine M,
analysis of microbial metabolic fluxes via 13C Osterman A, Bafna V, Smith RD, Pevzner PA
isotopic labeling. Mass Spectrom Rev (2007) Whole proteome analysis of post-
28:362–375 translational modifications: applications of
2. Gstaiger M, Aebersold R (2009) Applying mass-spectrometry for proteogenomic annota-
mass spectrometry-based proteomics to genet- tion. Genome Res 17(9):1362–1377
ics, genomics and network biology. Nat Rev
Genet 10:617–627
38 Walter M. van Gulik et al.

4. Gancedo JM, Gancedo C (1973) Concentra- metabolome analysis in Escherichia coli. Anal
tions of intermediary metabolites in yeast. Bio- Biochem 386:9–19
chimie 55:205–211 16. Canelas AB, Ras C, Ten Pierick A, Van Dam
5. Wollenberger a, Ristau O, Schoffa G. (1960) JC, Heijnen JJ, Van Gulik WM (2008)
Eine einfache technik der extrem schnellen Leakage-free rapid quenching technique for
abkuhlung grosserer gewebestucke. Pflugers yeast metabolomics. Metabolomics 4:226–239
Arch 270:399–412 17. Harrison DE, Maitra PK (1969) Control of
6. Williams DH, Lund P, Krebs HA (1967) respiration and metabolism in growing Klebsi-
Redox state of free nicotinamide-adenine dinu- ella aerogenes—role of adenine nucleotides.
cleotide in cytoplasm and mitochondria of rat Biochem J 112:647–656
liver. Biochem J 103:514–527 18. Iversen JJL (1981) A rapid sampling valve with
7. Veech RL, Egglesto LV, Krebs HA (1969) minimal dead space for laboratory scale fer-
Redox state of free nicotinamide-adenine dinu- mentors. Biotechnol Bioeng 23:437–440
cleotide phosphate in cytoplasm of rat liver. 19. Theobald U, Mailinger W, Reuss M, Rizzi M
Biochem J 115:609–619 (1993) In-vivo analysis of glucose-induced fast
8. Faupel RP, Seitz HJ, Tarnowsk W, changes in yeast adenine nucleotide pool apply-
Thiemann V, Weiss C (1972) Problem of tissue ing a rapid sampling technique. Anal Biochem
sampling from experimental-animals with 214:31–37
respect to freezing technique, anoxia, stress 20. Larsson G, Törnkvist M (1996) Rapid sam-
and narcosis – new method for sampling pling, cell inactivation and evaluation of low
rat-liver tissue and physiological values of gly- extracellular glucose concentrations during
colytic intermediates and related compounds. fedbatch cultivation. J Biotechnol 49
Arch Biochem Biophys 148:509–522 (1–3):69–82
9. Cole HA, Wimpenny JW, Hughes DE (1967) 21. Lange HC, Eman M, Van Zuijlen G, Visser D,
ATP pool in Escherichia coli. I. Measurement of Van Dam JC, Frank J, Teixeira de Mattos MJ,
pool using a modified luciferase assay. Biochim Heijnen JJ (2001) Improved rapid sampling
Biophys Acta 143:445–453 for in vivo kinetics of intracellular metabolites
10. Saez MJ, Lagunas R (1976) Determination of in Saccharomyces cerevisiae. Biotechnol Bioeng
intermediary metabolites in yeast – critical- 75(4):406–415
examination of effect of sampling conditions 22. Schaefer U, Boos W, Takors R, Weuster-Botz
and recommendations for obtaining true levels. D (1999) Automated sampling device for
Mol Cell Biochem 13:73–78 monitoring intracellular metabolite dynamics.
11. De Koning W, Van Dam K (1992) A method Anal Biochem 270(1):88–96
for the determination of changes of glycolytic 23. Weuster-Botz D (1997) Sampling tube device
metabolites in yeast on a subsecond time scale for monitoring intracellular metabolite dynam-
using extraction at neutral Ph. Anal Biochem ics. Anal Biochem 246(2):225–233
204:118–123 24. Schaub J, Schiesling C, Reuss M, Dauner M
12. Douma RD, de Jonge LP, Jonker CT, Seifar (2006) Integrated sampling procedure for
RM, Heijnen JJ, Van Gulik WM (2010) Intra- metabolome analysis. Biotechnol Prog 22
cellular metabolite determination in the pres- (5):1434–1442
ence of extracellular abundance: application to 25. Visser D, Van Zuijlen GA, Van Dam JC,
the penicillin biosynthesis pathway in Penicil- Oudshoorn A, Eman MR, Ras C, Van Gulik
lium chrysogenum. Biotechnol Bioeng 107 WM, Frank J, GWK VD, Heijnen JJ (2002)
(1):105–115 Rapid sampling for analysis of in vivo kinetics
13. Wittmann C, Kromer JO, Kiefer P, Binz T, using the bioscope: a system for continuous
Heinzle E (2004) Impact of the cold shock pulse experiments. Biotechnol Bioeng
phenomenon on quantification of intracellular 79:674–681
metabolites in bacteria. Anal Biochem 26. Mashego MR, Van Gulik WM, Vinke JL,
327:135–139 Visser D, Heijnen JJ (2006) in vivo kinetics
14. Bolten CJ, Kiefer P, Letisse F, Portais JC, Witt- with rapid perturbation experiments in Saccha-
mann C (2007) Sampling for metabolome romyces cerevisiae using a second-generation
analysis of microorganisms. Anal Chem bioscope. Metab Eng 8:370–383
79:3843–3849 27. Nasution U, Van Gulik WM, Pröll A, Van
15. Taymaz-Nikerel H, De Mey M, Ras C, Ten Winden WA, Heijnen JJ (2006) Generating
Pierick A, Seifar RM, Van Dam JC, Heijnen short term kinetic responses of primary metab-
JJ, van Gulik WM (2009) Development and olism of Penicillium chrysogenum through
application of a differential method for reliable
Fast Sampling of the Cellular Metabolome 39

glucose perturbation in the bioscope mini reac- 41. Bent KJ, Morton AG (1964) Amino acid com-
tor. Metab Eng 8:395–405 position of fungi during development in
28. Visser D, Van Zuylen GA, Van Dam JC, Eman submerged culture. Biochem J 92:260–269
MR, Pröll A, Ras C, Wu L, Van Gulik WM, 42. Maharjan RP, Ferenci T (2003) Global metab-
Heijnen JJ (2004) Analysis of in vivo kinetics of olite analysis: the influence of extraction meth-
glycolysis in aerobic Saccharomyces cerevisiae by odology on metabolome profiles of Escherichia
application of glucose and ethanol pulses. Bio- coli. Anal Biochem 313:145–154
technol Bioeng 88:157–167 43. Rabinowitz JD, Kimball E (2007) Acidic ace-
29. Mashego MR, Van Gulik WM, Heijnen JJ tonitrile for cellular metabolome extraction
(2007) Metabolome dynamic responses of Sac- from Escherichia coli. Anal Chem
charomyces cerevisiae to simultaneous rapid per- 79:6167–6173
turbations in external electron acceptor and 44. Canelas AB, Ten Pierick A, Ras C, Seifar RM,
electron donor. Fems Yeast Res 7:48–66 Van Dam JC, Van Gulik WM, Heijnen JJ
30. De Mey M, Taymaz-Nikerel H, Baart G, (2009) Quantitative evaluation of intracellular
Waegeman H, Maertens J, Heijnen JJ, Van metabolite extraction techniques for yeast
Gulik WM (2010) Catching prompt metabo- metabolomics. Anal Chem 81:7379–7389
lite dynamics in Escherichia coli with the bio- 45. Bergmeyer HU (1983) Methods of enzymatic
scope at oxygen rich conditions. Metab Eng analysis. VCH Publishers, Weinheim
12:477–487 46. Seifar RM, Ras C, Van Dam JC, Van Gulik
31. Lameiras F, Heijnen JJ, Van Gulik WM (2015) WM, Heijnen JJ, Van Winden WA (2009)
Development of tools for quantitative intracel- Simultaneous quantification of free nucleotides
lular metabolomics of Aspergillus niger chemo- in complex biological samples using ion pair
stat cultures. Metabolomics 11:1253–1264 reversed phase liquid chromatography isotope
32. Villas-Bôas SG, Bruheim P (2007) Cold gly- dilution tandem mass spectrometry. Anal Bio-
cerol–saline: the promising quenching solution chem 388:213–219
for accurate intracellular metabolite analysis of 47. Wu L, Mashego MR, Van Dam JC, Pröll A,
microbial cells. Anal Biochem 370:87–97 Vinke JL, Ras C, Van Winden WA, Van Gulik
33. Bolten CJ, Wittmann C (2008) Appropriate WM, Heijnen JJ (2005) Quantitative analysis
sampling for intracellular amino acid analysis of the microbial metabolome by isotope dilu-
in five phylogenetically different yeasts. Bio- tion mass spectrometry using uniformly
technol Lett 30:1993–2000 13C-labeled cell extracts as internal standards.
34. Oldiges M, Takors R (2005) Applying meta- Anal Biochem 336:164–171
bolic profiling techniques for stimulus- 48. Maleki Seifar R, Zhao Z, Van Dam JC, Van
response experiments: chances and pitfalls. Winden WA, Van Gulik WM, Heijnen JJ
Technol Transfer Biotechnol 92:173–196 (2008) Quantitative analysis of metabolites in
35. Mashego MR, Rumbold K, De Mey M, complex biological samples using ion-pair
Vandamme E, Soetaert W, Heijnen JJ (2007) reversed-phase liquid chromatography-isotope
Microbial metabolomics: past, present and dilution tandem mass spectrometry. J Chrom A
future methodologies. Biotechnol Lett 1187:103–110
29:1–16 49. Mashego MR, Van Gulik WM, Vinke JL, Heij-
36. Hancock R (1958) The intracellular amino nen JJ (2003) Critical evaluation of sampling
acids of Staphylococcus aureus: release and anal- techniques for residual glucose determination
ysis. Biochim Biophys Acta 28:402–412 in carbon-limited chemostat culture of Saccha-
37. Hommes FA (1964) Oscillatory reductions of romyces cerevisiae. Biotechnol Bioeng
pyridine nucleotides during anaerobic glycoly- 83:395–399
sis in brewers yeast. Arch Biochem Biophys 50. Van Dam JC, Eman MR, Frank J, Lange HC,
108:36–46 Van Dedem GWK, Heijnen JJ (2002) Analysis
38. Gale EFJ (1947) Gen Microbiol 1:53–76 of glycolytic intermediates in Saccharomyces cer-
evisiae using anion exchange chromatography
39. Work E (1949) Biochim Biophys Acta and electrospray ionization with tandem mass
3:400–411 spectrometric detection. Anal Chim Acta
40. Fuerst R, Wagner RP (1957) An analysis of the 460:209–218
free intracellular amino acids of certain strains
of Neurospora. Arch Biochem Biophys
70:311–326
Chapter 3

Investigation of Protein–Lipid Interactions Using Native


Mass Spectrometry
Xiao Cong, John W. Patrick, Yang Liu, Xiaowen Liang, Wen Liu,
and Arthur Laganowsky

Abstract
Integral membrane proteins are embedded in biological membranes where various lipids modulate their
structure and function. There exists a critical need to elucidate how these lipids participate in the physio-
logical and pathological processes associated with the membrane protein dysfunction. Native mass spec-
trometry (MS), combined with ion mobility spectrometry (IM), is emerging as a powerful tool to probe
membrane protein complexes and their interactions with ligands, lipids, and other small molecules. Unlike
other biophysical approaches, native IM-MS can resolve individual ligand/lipid binding events. We have
developed a novel method using native MS, coupled with a temperature-control apparatus, to determine
the thermodynamic parameters of individual ligand or lipid binding events to proteins. This approach has
been validated using several soluble protein–ligand systems wherein MS results are compared with those
acquired from conventional biophysical techniques, such as isothermal titration calorimetry (ITC) and
surface plasmon resonance (SPR). Using these principles, it is possible to elucidate the thermodynamics of
individual lipid binding to integral membrane proteins. Herein, we use the ammonia channel (AmtB) from
Escherichia coli as a model membrane protein. Remarkably, distinct thermodynamic signatures for AmtB
binding to lipids with different headgroups and acyl chain configurations are observed. Additionally, using a
mutant form of AmtB that abolishes a specific lipid binding site, distinct changes have been discovered in
the thermodynamic signatures compared with the wild-type, implying that these signatures can identify key
residues involved in specific lipid binding and potentially differentiate between specific lipid binding sites.
This chapter provides procedures and findings associated with temperature-controlled native MS as a novel
approach to interrogate membrane proteins and their interactions with lipids and other molecules.

Key words Membrane protein, Lipid, Native mass spectrometry, Ion mobility-mass spectrometry,
Temperature control, Binding thermodynamics

1 Introduction

Membrane proteins are one of the most important pharmaceutical


drug targets, with a staggering 60% of drugs on the market target-
ing integral membrane proteins [1, 2]. The membrane environ-
ment is comprised of a chemically diverse and dynamic landscape of
lipid molecules. Alongside the complexity of the biological

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_3, © Springer Science+Business Media, LLC, part of Springer Nature 2022

41
42 Xiao Cong et al.

membrane is the growing realization for the important roles that


lipid molecules play in modulating the structure and function of
membrane proteins and how these protein–lipid interactions par-
ticipate in physiological processes such as molecular recognition,
channel regulation, and cell signaling [3–9]. Technical limitations
have left the investigation of protein–lipid interactions in biological
membranes poorly understood on the molecular level.
Membrane protein–ligand/lipid interactions are governed by
their binding thermodynamics, specifically the change in Gibbs free
energy upon binding (ΔG) with contributions from enthalpy (ΔH)
and entropy (TΔS) [10]. Under constant pressure and standard
conditions, ΔG of binding can be transformed from equilibrium
association (KA) or dissociation constants (KD ¼ 1/KA), using the
equation: ΔG ¼ RTln KA [11, 12]. Binding processes that occur
spontaneously are characterized by a negative ΔG, while at equilib-
rium free energy is equal to zero. Under these conditions, KA or KD
values represent the binding affinity of the ligand to the substrate.
Tight binding between a protein and ligand can be expressed as a
larger KA, a smaller KD, or a more negative ΔG value. Binding
enthalpy (ΔH) reflects the energy change of the system upon
binding as a result of the formation or disruption of non-covalent
interactions involved in hydrogen bonding and van der Waals inter-
actions [13–15]. Binding entropy (ΔS) indicates the change in the
disorder or degree of freedom of the system [13–15]. The release of
previously ordered solvent molecules from the binding surface
results in a favorable entropic contribution, whereas the
binding-induced conformational ordering of proteins contributes
unfavorably to the binding entropy. Taken together, binding ther-
modynamics provides a quantitative description of the binding
energetics and insights into molecular forces that drive binding
reactions. Therefore, the selection of potential drug candidates
and lead optimization is intimately tied to protein–lipid/ligand
binding thermodynamics [15, 16].
Currently, various biophysical techniques are available for
detecting protein–ligand binding, such as fluorescence spectros-
copy, nuclear magnetic resonance (NMR), isothermal titration cal-
orimetry (ITC), and surface plasmon resonance (SPR) [17]. Over
the past decade, native MS has emerged as a powerful tool to study
intact proteins/protein complexes and protein–ligand interactions
from their native-like states. Native MS differs from traditional
non-native MS methods where acidified organic solvents are com-
monly employed to ionize analytes and/or chemical perturbation
of proteins prior to analysis is relied on for binding detection [18–
20]. As a rapid and sensitive technique, native MS is well suited for
investigating protein–ligand binding owing to its ability to preserve
non-covalent interactions in the gas phase and resolve individual
binding events. In contrast, conventional biophysical techniques
report on an ensemble of apo and ligand-bound states of the
Investigation of Protein–Lipid Interactions Using Native Mass Spectrometry 43

protein making the determination of individual energetic contribu-


tions difficult. Current methods employed for thermodynamic
analysis suffer from drawbacks such as the requirement for surface
immobilization/labeling which might disturb ligand binding, the
need for large amounts of sample, and difficulty in elucidating
binding stoichiometry [17]. These limitations prompted the devel-
opment of native MS-based methods to interrogate individual
ligand/lipid binding event, obtain useful information about coop-
erativity/allostery in ligand/lipid binding, and possibly detect con-
formational changes influenced by ligands/lipids when combined
with ion mobility spectrometry (IMS).

2 Materials

2.1 Plasmid 1. Plasmid for expression of target soluble and membrane pro-
Construction, Protein teins. Here, we use pET15-based vector that expresses E. coli
Expression, ammonium channel (AmtB) as TEV protease cleavable
and Purification N-terminal fusion to pelB (signal sequence), 10 His-tag,
and maltose binding protein (MBP) (see ref. 3). The soluble
nitrogen regulatory protein, GlnK, from E. coli was expressed
as a TEV cleavable N-terminal fusion to Strep Tag II (see ref.
(21)). MBP was expressed from pET28b as a thrombin cleav-
able N-terminal 6 His-tag (see ref. 21).
2. Tobacco etch virus (TEV) protease purchased or produced
in-house (see ref. 22).
3. Thrombin.
4. E. coli Rosetta 2(DE3) pLysS.
5. E. coli ArcticExpress (DE3) RIL.
6. E. coli BL21 (DE3) Gold.
7. E. coli OverExpress C43(DE3).
8. Terrific broth (TB) medium.
9. Lysogeny broth (LB) Miller medium: 5 g yeast extract, 10 g
peptone from casein, and 10 g sodium chloride per liter.
10. Chloramphenicol.
11. Kanamycin.
12. Isopropyl 1-thio-β-D-galactopyranoside (IPTG).
13. BioSpectrometer® basic.
14. M-110P microfluidizer.
15. Optima™ XPN ultracentrifuge.
16. TBS buffer: 20 mM tris(hydroxymethyl)methylamine (Tris),
pH ¼ 7.4 at room temperature, 150 mM sodium chloride.
17. Resuspension buffer: 50 mM Tris, pH ¼ 7.4 at room tempera-
ture, 300 mM sodium chloride.
44 Xiao Cong et al.

18. ÄKTApure and ÄKTAxpress chromatography system.


19. NHA buffer: 50 mM Tris, pH ¼ 7.4 at room temperature,
300 mM sodium chloride, 20 mM imidazole, and 10%
glycerol.
20. NHB buffer: 50 mM Tris, pH ¼ 7.4 at room temperature,
300 mM sodium chloride, 500 mM imidazole, and 10%
glycerol.
21. Gel filtration (GF) buffer: 50 mM Tris, pH ¼ 7.4 at room
temperature, 100 mM sodium chloride, and 10% glycerol.
22. d-Desthiobiotin.
23. Hen egg white lysozyme.
24. cOmplete™ Protease Inhibitor Cocktail.
25. 2-Mercaptoethanol (BME).
26. n-Octyl-β-D-glucopyranoside (OG).
27. n-dodecyl-β-D-maltopyranoside (DDM).
28. Tetraethylene glycol monooctyl ether (C8E4) detergent.
29. Amicon Ultra Centrifugal Filter Units.

2.2 MS Experiments 1. Temperature-controlled centrifuge.


2. Centrifugal buffer exchange device (Micro Bio-Spin
6, Bio-Rad).
3. Ammonium acetate (AA) buffer: 200 mM AA, pH ¼ 7.3 at
room temperature, adjusted with ammonium hydroxide.
4. Synapt G1 High-Definition Mass Spectrometer (HDMS) with
a 32 k radio frequency (RF) generator (Waters Corporation)
(Note: other instruments can also be used but will require
tuning of instrument parameters such that resolved mass spec-
tra with native-like protein state is preserved, discussed further
in Subheading 4).
5. Borosilicate glass tubing with filament (Sutter Instrument).
6. P-97 Flaming/brown type microcapillary puller (Sutter
Instrument).
7. High vacuum gold sputter coater (ACE200, Leica).
8. Home-made temperature-control apparatus setup (Fig. 1)
[21]:
(a) Fix two 120 mm central processing unit (CPU) fans to the
source chamber of the mass spectrometer to direct the
external air into the chamber passing over a heat sink
(40  40  14.5 mm, Advanced Thermal Solutions)
attached to a Peltier thermoelectric chip
(40  40  3.5 mm, 12 V, 5 A, Adafruit Industries).
Investigation of Protein–Lipid Interactions Using Native Mass Spectrometry 45

Fig. 1 Schematic and photographs of the temperature control apparatus fixed to


the source chamber of the mass spectrometer. (a, b) Shown are the heatsink
and thermoelectric chip along with the direction of airflow (yellow and blue
arrows in (a)) through the use of two central processing unit (CPU) fans, and the
temperature probe (T-type thermocouple) to monitor Tair to get a desired Tsample.
(c) A plot of Tsample when moving the nano-electrospray stage into and out of the
source chamber. (d) A calibration curve example for Tsample as a function of Tair.
Figure adapted with permission from ref. 21 (Copyright 2016 American Chemical
Society)
46 Xiao Cong et al.

(b) Power the thermoelectric chip by a direct current


(DC) power supply and adjust the applied voltage to
heat the incoming external airflow to a desired
temperature.
(c) Connect a small T-type thermocouple (0.23 mm
O.D. IT-24P, Physitemp Instruments) to a USB-TC01
thermocouple measurement device (National Instru-
ments) to digitally record the temperature. The small
thermocouple can be inserted directly into ~2μL of sam-
ple solution within the gold-coated capillary used for
nano-electrospray ionization (nESI) in order to monitor
sample temperature (Tsample). The equilibrium of Tsample
can be reached within ~40 s after moving the nESI stage
into the source chamber, and the temperature can be held
within 0.3  C (Fig. 1c).
(d) Generate a calibration curve for Tsample as a function of
source air temperature (Tair), which enables setting a
desired Tsample while monitoring Tair to avoid potential
cross contamination by re-insertion of the delicate ther-
mocouple probe into the next sample (Fig. 1d).
(e) Insert an insulating barrier into the source chamber to
direct airflow and prevent heat transfer from the source,
which is heated in our experiments. Generally, the capil-
lary used for nESI spray is about 2.5 cm long with an inner
I.D. of ~1 mm at the base and ~0.04 mm at the tip. The
liquid volume in our experiments is usually ~2μL resulting
in a liquid column inside the capillary on the order of
~0.25 cm in length. As the whole capillary is incubated
in the source chamber of the mass spectrometer at a set
temperature, and given the liquid column is short, the
temperature gradient is so little that it can be considered
negligible.
9. Ligands:
(a) D-(+)-Maltose.

(b) Maltotriose.
(c) N, N0 , N00 -Triacetylchitotriose (NAG3).
(d) Adenosine 50 -diphosphate (ADP) (Sigma-Aldrich).
10. Lipid solutions.
(a) 1-Palmitoyl-2-oleoyl-sn-glycero-3-phospho-(10 -rac-glyc-
erol) (POPG).
(b) 1-Palmitoyl-2-oleoyl-sn-glycero-3-phosphoethanolamine
(POPE).
(c) 1-Palmitoyl-2-oleoyl-sn-glycero-3-phosphate (POPA).
(d) 1-Palmitoyl-2-oleoyl-sn-glycero-3-phospho-L-serine
(POPS).
Investigation of Protein–Lipid Interactions Using Native Mass Spectrometry 47

(e) 1,10 ,2,20 -Tetraoleoyl-cardiolipin (TOCDL).


(f) 1,2-Dilauroyl-sn-glycero-3-phospho-(10 -rac-glycerol)
(DLPG).
(g) 1,2-Dimyristoyl-sn-glycero-3-phospho-(10 -rac-glycerol)
(DMPG).
(h) 1,2-Dipalmitoyl-sn-glycero-3-phospho-(10 -rac-glycerol)
(DPPG) (Avanti Polar Lipids Inc.).
11. Biacore 3000 optical biosensor with research-grade CM5 sen-
sor chips (GE Healthcare/Biacore AB).
12. PBS buffer: 10 mM sodium phosphate, pH ¼ 7.4, 150 mM
sodium chloride.
13. Amine-coupling kit containing N-(3-dimethylaminopropyl)-
N0 -ethylcarbodiimide hydrochloride (EDC), N-hydroxysucci-
nimide (NHS), and ethanolamine (pH 8.5) (GE Healthcare/
Biacore AB).
14. MicroCal VP-ITC instrument.

2.3 Data Analysis 1. MassLynx™ V4.1 software (Waters Corporation).


Software 2. Protein unfolding for ligand stabilization and ranking (PUL-
SAR) software (http://pulsar.chem.ox.ac.uk/) [23].
3. Universal deconvolution of mass and ion mobility spectra
(UniDec) software (http://unidec.chem.ox.ac.uk/) [24].
4. Self-coded protein–ligand/lipid binding analysis program
using Python, based on equations described in Subheading 3.

3 Methods

3.1 Soluble Protein 1. MBP and GlnK expression plasmids were transformed into
Expression E. coli Rosetta 2(DE3) pLysS and E. coli ArcticExpress (DE3)
RIL, respectively.
2. Grow colonies overnight in 50 mL of LB medium supplemen-
ted with either chloramphenicol (45μg/mL) and kanamycin
(50μg/mL) for MBP or kanamycin (50μg/mL) for GlnK at
37  C.
3. Inoculate 1 L of LB with 7 mL of overnight culture and grow at
37  C until the OD600 reaches 0.8.
4. Chill the cell cultures on ice prior to adding isopropyl
1-thio-β-D-galactopyranoside (IPTG) to a final concentration
of 1 mM and 0.1 mM for MBP and GlnK, respectively.
5. Grow the cells for 24 h at 20  C.
6. Harvest the cells by centrifugation at 5000  g for 10 min,
wash once with TBS buffer, re-pellet, store the pellets at
80  C.
48 Xiao Cong et al.

3.2 Soluble Protein 1. Thaw the cell pellets and resuspend in NHA buffer, respec-
Purification tively. Unless otherwise stated, all purification steps are carried
out at 4  C.
2. Lyse the cells with 1–3 passes through an M-110P microflui-
dizer at 20,000 psi.
3. Clarify cell lysate by centrifugation (30 min at 30,000  g).
4. Collect the supernatant containing recombinant 6
His-tagged MBP or Strep-tag II-tagged GlnK.
5. Filter the supernatant with a syringe filtration device (0.45μm).
6. Load 6 His-tagged MBP onto a HisTrap 5 mL column elute
with NHB buffer. Load Strep-tag II-tagged GlnK onto a Strep-
Trap HP 5 mL column and elute it with the TBS buffer con-
taining 2.5 mM D-desthiobiotin. Store the peak fraction of 6
His-tagged MBP and Strep-tag II-tagged GlnK, respectively.
7. Inject 6 His-tagged MBP peak fraction onto a HiPrep 26/10
desalting column equilibrated in NHA buffer with imidazole
omitted.
8. Digest the purified 6 His-tagged MBP with thrombin
(3 units of thrombin per mg of protein) overnight at room
temperature.
9. Pass digested MBP over a 5 mL HisTrap HP column to remove
tagged (undigested) protein. Concentrate the untagged pro-
teins using centrifugal concentrator.
10. Load the concentrated protein onto a HiLoad 16/600 Super-
dex 75 pg column equilibrated in GF buffer.
11. Pool and concentrate the peak fractions containing MBP or
Strep-tag II-tagged GlnK.
12. Aliquot and flash-freeze the protein. Store at 80  C.
13. For hen egg white lysozyme, directly dissolve the protein in
TBS buffer on ice prior to use.

3.3 Membrane 1. Transform the MBP–AmtB plasmid into E. coli BL21 (DE3)
Protein Expression Gold cells and the MBP-AmtBN72A/N79A plasmid into E. coli
OverExpress C43(DE3) cells.
2. Inoculate several colonies into 50 mL LB Miller medium and
grow overnight at 37  C.
3. Inoculate each liter of LB in 2 L shaker flasks with 7 mL of
overnight culture and grow at 37  C until the OD600 reached
between 0.6 and 0.8.
4. Add IPTG to the culture at a final concentration of 0.5 mM
and grow for 3 h at 37  C.
5. Collect the cells by centrifugation at 5000  g for 10 min at
4  C, wash them once with TBS buffer, re-pellet them, and
store the pellets at 80  C.
Investigation of Protein–Lipid Interactions Using Native Mass Spectrometry 49

3.4 Membrane 1. Thaw the MBP-AmtB cell pellets and resuspend them in resus-
Protein Purification pension buffer supplemented with a cOmplete™ Protease
Inhibitor Cocktail tablet and 5 mM BME. The protocol for
AmtB has been optimized for native MS studies and other
membrane proteins may require a detergent screening
approach to remove co-purified lipids (see Note 1).
2. Lyse the cells with 4–5 passes through an M-110P microflui-
dizer at 20,000 psi.
3. Clarify cell lysate by centrifugation at 20,000  g for 25 min at
4  C.
4. Pellet membranes by centrifugation at 100,000  g for 2 h at
4  C and resuspend the membranes in 20 mM Tris buffer
(pH ¼ 7.4 at room temperature) containing 100 mM sodium
chloride, 10% glycerol, and 5 mM BME.
5. Homogenize the membrane suspension using a Potter-
Elvehjem Teflon pestle and glass tube, or similarly glass tissue
grinder/homogenizers.
6. Add 200 mM OG to the membrane suspension for membrane
protein extraction and incubate with gentle agitation overnight
at 4  C.
7. Clarify the extraction by centrifugation at 20,000  g for
25 min at 4  C.
8. Filter the supernatant before loading onto a 5 mL HisTrap-HP
column equilibrated in NHA buffer supplemented with
0.025% DDM.
9. After loading, wash the column initially with 40–50 mL of
NHA buffer supplemented with 1% OG before exchanging
the proteins into several column volumes of NHA buffer sup-
plemented with 0.025% DDM until a steady baseline is
reached.
10. Elute the membrane protein fusions with a linear gradient to
100% in two column volumes of NHB buffer supplemented
with 0.025% DDM.
11. Pool the peak fractions and add BME to a final concentration
of 5 mM and His-tagged TEV protease (10 units of TEV
protease per mg of protein). Incubate overnight at 4  C.
12. Filter the sample and pass over a 5 mL HisTrap-HP column
equilibrated in NHA buffer supplemented with 0.025% DDM.
13. Collect the flow-through containing the untagged membrane
protein and concentrate it using a 100 kDa MWCO centrifugal
concentrator.
14. Immediately use purified protein or flash-freeze them in liquid
nitrogen and store at 80  C.
50 Xiao Cong et al.

3.5 Ligand or Lipid 1. Prepare stock solutions by dissolving each ligand (powder
Solution Preparation form) in AA buffer to a final concentration of 0.1–1 mM.
2. For each ligand, perform a serial dilution to make sets of
solutions containing different concentrations of ligands for
titration experiments.
3. Lipids, received as concentrated stocks in chloroform, can be
dried under a stream of nitrogen gas, followed by vacuum
desiccation overnight at room temperature to remove residual
solvent.
4. Prepare stock solutions of lipids by resuspending dried lipid
films in water supplemented with 0.5% C8E4 and 5 mM BME
to a final concentration of 10 mM.
5. Determine the concentration of stock solutions using a phos-
phorus assay [25, 26].
6. Perform a serial dilution of each lipid stock in AA buffer sup-
plemented with 0.5% C8E4 to make sets of solutions containing
different concentrations of lipids for titration experiments.

3.6 Protein–Ligand/ 1. Buffer exchange the purified soluble protein and the mem-
Lipid Titration brane protein into AA buffer and AA buffer supplemented
Experiments Using with 0.5% C8E4, respectively, using Micro Bio-Spin 6 centrifu-
Native MS gal buffer exchange device.
2. Determine the soluble and membrane protein concentration
with the DC Protein Assay kit using bovine serum albumin as
the standard and then dilute the protein sample to a concentra-
tion of 0.1–1μM.
3. Load 3μL of the diluted protein solution into a gold-coated
capillary tip for MS analysis.
4. Tune the mass spectrometer parameters to achieve the native-
like conditions for soluble or membrane proteins (see ref. 27,
28 (see Note 2) and Subheading 4 for further discussion). For
soluble proteins applied in this chapter, the instrument is set to
a source pressure of 6.2–6.4 mbar, capillary voltage of 1.4 kV,
sampling cone voltage of 20 V, extractor cone voltage of 3.0 V,
trap collision voltage of 20 V, collision gas (Argon) flow rate of
4 mL/min (3.6  102 mbar), and T-wave settings (velocity/
height) for trap, IMS, and transfer of 100 ms1/
0.5 V,100 ms1/4.0 V, and 100 ms1/3.0 V, respectively.
The source temperature (50  C) and trap bias (22 V) are
optimized for soluble proteins. For model membrane pro-
tein–lipid binding systems, instrument parameters are tuned
to maximize ion intensity but simultaneously preserve the
native-like state of AmtB, monitored by the drift time of pro-
tein ions using IM coupled with MS. The instrument is set to a
capillary voltage of 1.7 kV, sampling cone voltage of 200 V,
Investigation of Protein–Lipid Interactions Using Native Mass Spectrometry 51

extractor cone voltage of 10 V, and argon flow rate at 7 mL/


min (5.2  102 mbar). The T-wave settings for trap
(300 ms1/2.0 V), IMS (300 ms1/20 V), transfer
(100 ms1/10 V), source temperature (90  C), and trap bias
(35 V) are also optimized.
5. Equilibrium analysis: choose one ligand/lipid solution in the
serial dilution sets and mix 1.2μL of that ligand/lipid solution
with 1.2μL of diluted protein solution; load the protein–
ligand/lipid mixture into a gold-coated capillary tip, followed
by incubation in the source chamber of the mass spectrometer
set at Tsample ¼ 25  C; collect the mass spectra every 5 min
during the incubation until the spectrum has no observed
change, which means that an equilibrium has been achieved.
Notably, for the protein–ligand /lipid binding systems applied
in this chapter, only 5–10 min is needed for the incubation as
we do not observe a change in mole fractions of apo or ligand-
bound proteins with longer incubation time (Fig. 2) [21].
6. Titration experiments (Fig. 3): mix 1.2μL of diluted protein
solution with 1.2μL of each ligand/lipid solution in the serial
dilution sets. Load each protein–ligand/lipid mixture into a
gold-coated capillary tip, incubate in the source chamber of the
mass spectrometer for 5–10 min at different Tsample (such as
25, 29, 33, and 37  C) separately and record mass spectra at
each temperature.

Fig. 2 Evaluating the time for samples to reach equilibrium. Mole fraction of
membrane protein–lipid complexes incubated for 5 min or 24 h. Shown is data
for 1.5μM AmtB mixed with 10μM POPG at Tsample ¼ 25  C for 5 min in the mass
spectrometer or 24 h in a thermal cycler set at 25  C. Reported is the mean and
standard deviation from repeated measurements using different gold-coated
capillary tips (n ¼ 3). Figure adapted with permission from ref. 21 (Copyright
2016 American Chemical Society)
52 Xiao Cong et al.

Fig. 3 Scheme of the titration for binding thermodynamics measurement: protein


is mixed with a given ligand concentration followed by incubation and acquisition
of mass spectra at a desired Tsample. This procedure is repeated for either a
different ligand concentration or temperature. After a series of mass spectra are
recorded, the spectra are deconvoluted to obtain the mole fractions of free and
bound states and then globally fit to a binding model. The resulting equilibrium
association constants are used to determine binding thermodynamics through
van’t Hoff analysis [29]. Figure adapted with permission from ref. 21 (Copyright
2016 American Chemical Society)

3.7 Native MS Data 1. Import MS raw file series into PULSAR or Unidec software to
Analysis convert RAW file format to text-format containing MS and
IM-MS data [23].
2. Open each MS text file using UniDec software and deconvo-
lute each mass spectrum [24]. Input the “charge range” and
“mass range” based on the charge state and mass of the desired
sample. Optimize the fitting parameters such as “sample mass
every (Da)” and “Peak FWHM (full width at half maximum)”
until R2 reaches the highest value. Output the intensity of
protein (P) and protein–ligand/lipid (PLn, n ¼ number of
ligands/lipids bound) species to a text file.
3. Divide the intensity of each species by the sum intensity of all
the species to convert to the mole fraction of each species.
4. As the mole fraction of P and PLn is dependent on the apparent
equilibrium association constant (KA), thus for soluble pro-
tein–ligand systems, the following sequential binding model
can be applied to deduce KA [21]. For protein binding to one
ligand:
KA ½PL
P þ L , PL KA ¼ ð1Þ
½P ½L 
and binding to multiple ligands:
K An ½PLn 
PLn1 þ L , PLn K An ¼ ð2Þ
½PLn1 ½L 
Investigation of Protein–Lipid Interactions Using Native Mass Spectrometry 53

where KAn is the equilibrium association constant for the


nth ligand binding to the species PLn1. [P]total represents total
protein concentration added:
X
n X
n
½P total ¼ ½P  þ ½PLi  ¼ ½P  þ ½P ½L i ∏i j ¼1 K Aj ð3Þ
i¼1 i¼1

Equation 2 can be rearranged to calculate the mole fraction


(Fn) of PLn:

½PLn  ½L n ∏nj ¼1 K Aj
Fn ¼ ¼ ð4Þ
½P total Pn
1þ ½L i ∏i j ¼1 K Aj
i¼1

where [L] is the free ligand concentration at equilibrium


which can be calculated as follows:
X
n
½L  ¼ ½L total  ½P total  iF i ð5Þ
i¼1

To obtain KAn, globally fit mole fraction as a function of


[L] at a given temperature to the sequential binding model
through minimization of the pseudo-χ 2 function:
n X
X d  2
χ2 ¼ F i,j , exp  F i,j ,calc ð6Þ
i¼0 j ¼1

where n is the number of bound ligands and d is the


number of the experimental mole fraction data points. Results
of GlnK-ADP binding thermodynamics are shown in Fig. 4.
5. For membrane protein–lipid binding systems, one should con-
sider taking into account that lipids readily self-associate as
evident in some mass spectra, which alters the lipid concentra-
tion available for binding (Fig. 5). For lipid self-assembly:
K AGG1 ½L 2 
L þ L , L2 K AGG1 ¼ ð7Þ
½L avail 2
K AGGðn1Þ ½L n 
. . . L n1 þ L , Ln K AGGðn1Þ ¼ ð8Þ
½L n1 ½L avail
where [L]avail is the concentration of lipids “free” to bind
proteins and KAGG(n1) is the equilibrium association constant
for the lipid aggregate (Ln1) to bind another lipid to form Ln.
Thus,
X
n
½L total ¼ ½L avail þ i ½L iavail ∏i j ¼2 K AGGð j 1Þ ð9Þ
i¼2
54 Xiao Cong et al.

Fig. 4 Soluble protein–ligand binding thermodynamics determined by native MS. Shown are (a) stacked
representative mass spectra of GlnK titrated with serial ADP concentrations, (b) stacked plots of mole fraction
of apo GlnK and GlnK(ADP)1–3 as a function of free ADP concentration collected at 25, 29, 33, and 37  C, (c)
van’t Hoff plots, and (d) binding thermodynamic parameters for binding of one to three ADP molecules. Shown
in (c) and (d) are the mean and standard deviation (n ¼ 3) (Figure adapted with permission from ref. 21.
Copyright 2016 American Chemical Society)

Although we observe lipid aggregates in the mass spectrum


(Fig. 5a), it is difficult to quantify these species owing to a
mixture of states including those bound to adducts, such as
the C8E4 or other detergents that are ubiquitous in many
membrane protein systems. With these factors in mind, an
assumption is made that KAGG for each binding is equal and
n ¼ 2 (assuming that L2 is the most abundant lipid aggregate
species), which simplifies Eq. 9 to:
½L total ¼ ½L avail þ 2K AGG ½L 2avail ð10Þ
Lavail can be solved as:
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 þ 1 þ 8K AGG ½L total
½L avail ¼ ð11Þ
4K AGG
Incorporating this lipid self-association model introduces
one additional fitting parameter KAGG and essentially divides
the total lipid into a fraction “free” to bind and a fraction that
cannot bind. In comparison to the sequential binding model
Investigation of Protein–Lipid Interactions Using Native Mass Spectrometry 55

Fig. 5 Membrane protein–lipid binding data fit to different binding models. (a) Representative mass spectrum
showing POPE self-assembly and POPE–detergent complexes. (b, c) Plots of mole fraction AmtB(POPE)0–5
determined from the titration series of POPE (dots) and resulting fit (solid lines) using a (b) sequential binding
model (R2 ¼ 0.96, χ 2 ¼ 0.112) or (c) modified sequential lipid-binding model (R2 ¼ 0.99, χ 2 ¼ 0.019). (d) Plot
of available POPE concentrations calculated by the above Eq. 11 as a function of total POPE concentrations
titrated. KAGG applied is abstracted from fitted data in (c) using a modified sequential lipid-binding model,
which in this example equals 7.26μM (Figure adapted with permission from ref. 21. Copyright 2016 American
Chemical Society)

for soluble proteins, the lipid binding model resulted in a better


fit that has been statistically justified (F-test, p < 0.001)
(Fig. 5b, c). Therefore, this model can be applied to determine
KAn for each membrane protein–lipid binding event. Notably,
applying this model can result in a poorer fit to the data and
therefore not used. Although other more sophisticated models
could be considered, such as including protein–detergent or
lipid–detergent interactions, we have observed using our lipid
binding model similar KAn values (within experimental error)
when varying either protein or detergent concentrations.
Therefore, we opt for the simplest model to describe the exper-
imental data, especially for a relative comparison of the binding
thermodynamics of different lipids to the same membrane
protein.
56 Xiao Cong et al.

6. Plot the natural logarithm of KAn as a function of the reciprocal


of temperature to determine the enthalpy change (ΔH),
entropy change (TΔS), and the change in Gibbs free energy
(ΔG) upon binding using the van’t Hoff equation [29]:
ΔG T ΔS  ΔH ΔH 1 ΔS
ln K A ¼ ¼ ¼ ∙ þ ð12Þ
RT RT R T R
Note that the model membrane protein–lipid binding
applied in this chapter can be fitted by linear van’t Hoff equa-
tion while for some other systems, especially for protein–pro-
tein and other interactions, nonlinear van’t Hoff equations are
required [30] (also see Note 3).

3.8 Validation of MS 1. Native MS-based methods can be validated by comparing the


Method by SPR and ITC results obtained from soluble protein–ligand systems with
Measurements other conventional biophysical techniques, such as SPR and
ITC.
2. SPR experiments are performed using a Biacore 3000 optical
biosensor with research-grade CM5 sensor chips as follows.
Prepare sensor surfaces at 25  C using an amine-coupling kit
at a flow rate of 5μL/min in PBS buffer: for lysozyme system,
activate the flow cell with 35μL of a 1:1 mixture of 0.4 M EDC
and 0.1 M NHS, followed by injection of 5μg/mL lysozyme
(in 10 mM sodium acetate, pH 5.5) until surface density
reaches ~2000 RU, and finally deactivate the lysozyme coupled
surface with 1.0 M ethanolamine (pH 8.5) for 5 min; for an
MBP system, after 10 min activation, inject a solution of MBP
(50μg/mL in 10 mM sodium acetate, pH 4.0), deactivate the
surface by 10 injections (1 min at 50μL/min) of AA buffer and
five injections of 100μM maltotriose (in AA buffer) until the
MBP surface is stable. This modification is applied in order to
immobilize sufficient MBP onto the sensor surface. Prepare
two surfaces (with slight differences in density) for each protein
and use one flow cell without protein coupled as a reference to
subtract systematic noise. Collect SPR response data for the
soluble molecules binding to the protein surfaces at 25, 29,
33, and 37  C. At each temperature, inject samples prepared by
serial dilution (factor of 3, in duplicate) in running AA buffer
over the protein and reference surfaces for 30 s at a flow rate of
50μL/min. Perform data evaluation by calculating binding
response (Req, average response of 5 s during steady state
phase) for each injection and plot the result as a function of
ligand concentration (A). KD can be obtained by fitting a
single-site binding curve to the data using equation:
Req ¼ Rmax  A/(KD + A), where Rmax is the maximum
response when the active surface is saturated. Perform van’t
Hoff analysis to determine binding thermodynamic constants
for each compound from the KD values measured by SPR at
different temperatures.
Investigation of Protein–Lipid Interactions Using Native Mass Spectrometry 57

3. ITC experiments can be carried out using the MicroCal


VP-ITC instrument at one specific temperature (29  C applied
here) in AA buffer as follows. For each experiment, inject 10μL
(5μL for the first injection) of either 500μM maltose and
450μM maltotriose or 2 mM NAG3 into 40μM MBP or
200μM lysozyme. Space each injection (at a speed of 0.5μL/
s) with 300 s intervals and set the stirring speed to 300 rpm.
Subtract background data of ligands injected into AA buffer
from the ITC data of protein–ligand systems. Integrate the
titration data and fit to a one-site binding model using Micro-
Cal Origin software to obtain thermodynamic binding
parameters.

4 Notes

1. Detergent Screening for Membrane Proteins.


It is often challenging to investigate membrane protein–
lipid/ligand complexes, primarily due to the difficulty in their
expression, extraction from the membrane, and purification.
Typically, detergent screening is likely required for isolating
proteins from membrane to achieve a purified sample that is
de-lipidated and stable for IM-MS analyses [28]. A variety of
detergents have been screened for their ability to both permit
protein ionization into the gas phase and preserve the stability
of the membrane protein complexes [28, 31]. Polyethylene
glycol type (e.g., C8E4) and amine oxide type (e.g., lauryldi-
methylamine-N-oxide, LDAO) detergents demonstrate charge
reducing behavior for many protein ions while saccharide
detergents (e.g., n-octyl-β-D-glucoside, OG) lead to charge
states closer to the Rayleigh limit of the droplet produced
during ESI [31]. Charge reduction is favorable as lower protein
charge states yield lower surface charge densities, which can
increase the probability of maintaining native-like conforma-
tions for native IM-MS studies. Some detergents, while ubiq-
uitous for crystallographic and solution-phase structural
studies (e.g., n-dodecyl-β-D-maltopyranoside, DDM), are not
ideal for MS owing to higher collision energy requirements to
release proteins from these solubilizing detergent micelles. It
should be noted that these general findings are not universal
rules that can be easily and directly applied to all the membrane
proteins. Detergent screening and instrument optimization
(discussed in Note 2) are required for each protein of interest
in order to obtain highly resolved mass spectra of membrane
proteins devoid of detergent (and endogenous lipid) for inter-
rogating both their structure and ligand/lipid binding events.
58 Xiao Cong et al.

2. MS Instrument Tuning.
It has been known for some time that instrument tuning
can have dramatic effects on the mass spectra obtained for
protein ions [27, 28]. Acquiring data for the same protein of
interest using different types of mass spectrometers can often-
times yield marked differences in their mass spectra. It is imper-
ative that the parameters of each instrument be tuned to obtain
resolved spectra that demonstrate good signal-to-noise ratios.
Adequate tuning ensures that proteins retain their folded,
native-like state, even after removal of detergent and solvent
within the mass spectrometer. These parameters include, but
are not limited to, source temperature, source pressure, capil-
lary voltage, sampling cone voltage, trap and transfer cell colli-
sion voltage, trap and IMS cell bias potentials, as well as other
parameters unique to each mass spectrometer’s function [32].
To assess the extent of protein folding/unfolding in the gas
phase, the collision cross section (CCS) or arrival time distri-
bution of protein ions obtained from IM coupled with MS can
be plotted as a function of collision voltage (or internal energy
applied), which reveals a protein ion’s unfolding pathway from
a native-like conformation at low E, through intermediates, to
a partially unfolded structure (Fig. 6a, b) [3]. Additionally, the
mole fraction of apo and ligand/lipid-bound protein species
can be plotted against the collision voltage/energy applied to
determine the energy range within which the complex does not
undergo dissociation in the mass spectrometer (Fig. 6c).

Fig. 6 Optimization of native mass spectrometry (MS) settings for membrane proteins. (a) Experimental and (b)
modeled gas phase unfolding plots of 15+ ion of AmtB (fitting χ 2 ¼ 2.96). Extraction regions for native-like
(green) and the first (blue), second (gray) and third (purple) partially unfolded states are shown. (c) Plots of the
mole fraction of apo and lipid-bound AmtB as a function of collision voltage for 1.5μM AmtB mixed with
16.7μM POPE. Based on these results, a collision voltage of 60 V is selected for native MS experiments where
the native-like state of AmtB is preserved (Figure adapted with permission from ref. 21. Copyright 2016
American Chemical Society)
Investigation of Protein–Lipid Interactions Using Native Mass Spectrometry 59

It should be noted that the KD values, as well as other


thermodynamic parameters for membrane protein–lipid inter-
actions, are acquired based on the experimental conditions
wherein the detergent type, detergent concentration, instru-
ment type, and instrument parameters can be observed to
influence the protein–lipid interactions. In this work, the bind-
ing between AmtB and lipids is likely a result of replacement of
detergent molecules with lipids. These events include protein–
detergent dissociation, lipid–detergent dissociation, and pro-
tein–lipid association. Taken together, the thermodynamic
parameters obtained from the methods described in this work
should be described as “apparent” instead of “intrinsic” values.
Notably, we observed that similar thermodynamic parameters
were obtained using MS as compared to traditional ITC and
SPR methods for several soluble protein–ligand binding sys-
tems. However, conventional biophysical techniques as
described above cannot at present be used to deduce individual
lipid binding events to membrane proteins. Attention should
be paid when comparing the values obtained from this method
with other conventional biophysical methods such as ITC and
SPR, although a relative comparison of different lipids binding
to the same protein should be applicable and meaningful. We
emphasize that MS serves as a novel complementary tool in the
study of membrane protein–lipid interactions and not a direct
replacement for these techniques.
3. Thermodynamic View of Protein–Ligand/Lipid Interactions.
Prior to the investigation of membrane protein–lipid inter-
actions, it was first necessary to validate the previously
described MS method using soluble protein–ligand binding
systems (Table 1) [21]. Remarkably, the ΔG (or KD) values
obtained from native MS, SPR, and ITC yield similar values to
those reported previously [33–37]. Interestingly, contributions
from enthalpy and entropy exhibit some ambiguity between
these three biophysical techniques. Thermodynamic values
obtained using native MS and ITC demonstrate good agree-
ment for maltose binding protein (MBP). Interestingly, good
agreement for thermodynamic parameters between MBP bind-
ing maltotriose obtained using native MS and SPR, whereas
ITC yields different values. The thermodynamic values for
lysozyme binding NAG3 obtained using native MS, SPR, and
ITC are all in disagreement. The ambiguities in the thermody-
namic parameters obtained from MS, SPR, and ITC can be
rationalized owing to the diverse experimental conditions uti-
lized in each approach. More specifically, the concentrations of
protein and ligand required for ITC measurements are ~20 and
~200 times of the KD, respectively, yielding aggregation of the
analyte. In fact, heavy precipitation was observed in ITC
60 Xiao Cong et al.

Table 1
Thermodynamic parameters for soluble protein–ligand binding

Parameters Systems MBP-maltose MBP-maltotriose Lysozyme-NAG3


ΔH (kJ/mol) MS 14.1  0.89 26.5  1.30 28.4  1.43
SPR – 23.9  1.21 43.2  2.49
ITC 7.69  0.22 11.9  0.70 35.8  0.65
TΔS (298 K, kJ/mol) MS 20.8  0.87 10.5  1.26 0.01  1.40
SPR – 13.1  1.19 15.6  2.45
ITC 25.2 22.9 7.70
ΔG (298 K, kJ/mol) MS 34.9  1.76 37.0  2.56 28.4  2.83
SPR – 37.0  2.40 27.6  4.94
ITC 32.9  0.22 34.8  0.70 28.1  0.65

experiments for GlnK during the initial titration of ADP in


both ammonium acetate and PBS buffer diminishing the utility
of this technique to obtain useful thermodynamic parameters.
In contrast to ITC, the protein concentration used for native
MS experiments is on the order of 0.1–1.0μM and no more
than the KD of a given analyte. In SPR experiments, the protein
is immobilized to the sensor surface which may introduce
conformational changes and reduced entropic contribution of
the analyte. Although the absolute ΔH and ΔS values obtained
from different biophysical techniques display some ambiguity,
relative comparisons of these thermodynamic parameters using
a given technique can provide insight into molecular interac-
tion. In this regard, the binding thermodynamics for a mem-
brane protein binding lipid(s) can provide the unique
opportunity to characterize the influence of lipid physical char-
acteristics, such as different headgroups or acyl chain lengths,
on binding.
Fundamental questions regarding membrane protein-lipid
binding can be validated using temperature-controlled MS,
such as (a) what is the effect of head group or acyl chain length
on lipid binding thermodynamics, and (b) are the binding
thermodynamics representative of specific binding modes for
different lipids? (Fig. 7).
(a) To address the first question, van’t Hoff analyses were
prepared for (1) five phospholipids with the same acyl
chains but different headgroups or (2) PG lipids with
various acyl chain lengths. Interestingly, the five phospho-
lipids with various headgroups show similar ΔG values but
significantly different ( p < 0.01, one-way ANOVA) ther-
modynamic parameters, ΔH and TΔS (Fig. 7c). POPS
and POPG have greater enthalpy compared to the other
three lipids, implying that favorable bonds are formed
Investigation of Protein–Lipid Interactions Using Native Mass Spectrometry 61

Fig. 7 Native MS reveals thermodynamic signatures of individual lipid binding events to AmtB. (a) Represen-
tative mass spectrum in the series of AmtB titrated with POPA. (b) Plots of mole fraction for AmtB and AmtB
(POPA)1–5 determined from a titration series of POPA (dots) and resulting fit (R2 ¼ 0.99) from a sequential lipid-
binding model (solid lines) collected at 29  C. (c) Binding thermodynamics for POPE, POPG, POPS, POPA, and
TOCDL to AmtB determined through van’t Hoff analysis for binding of the first, second, and third lipid (labeled
as 1–3). Shown above are the headgroup structures ( p < 0.01 for ΔH and –TΔS; p > 0.9 for ΔG, one-way
ANOVA, n ¼ 3). (d) Thermodynamics of AmtB binding PG lipids with increasing acyl chain length: DLPG, DMPG,
and DPPG. Trend lines are plotted for binding the first, second, and third lipid ( p < 0.01 for ΔH and –TΔS;
p > 0.9 for ΔG, one-way ANOVA, n ¼ 3). (e) Thermodynamics of the AmtB double mutant (N72A/N79A)
binding the first, second, and third POPG or POPE molecule. Reported in (c–e) are the mean and standard
deviation (n ¼ 3) (Figure adapted with permission from ref. 21. Copyright 2016 American Chemical Society)

between their head groups and the protein. Binding of


POPG to AmtB is entropically unfavorable, which is in
agreement with the crystal structure of AmtB in complex
with PG [3], where a conformational change is observed
upon PG binding. A comparable entropic penalty is
62 Xiao Cong et al.

observed for POPS, suggesting this lipid may bind in a


similar fashion. In contrast, POPA1–3 binding is entropi-
cally and enthalpically favorable, implying a different
binding pathway, driven by hydrophobic and van der
Waals interactions. Interestingly, an entropic penalty is
observed with each additional lipid-binding event, possi-
bly due to structuring the membrane protein (reducing
disorder). In addition, for each replicate, we observe a
trend in increasing KD values for sequential lipid-binding
events that is in line with negative cooperativity, which
might be driven by the entropic penalty. On the other
hand, native MS experiments using PG with varying tail
lengths have been have also performed (Fig. 7d). Intrigu-
ingly, the results revealed an enthalpy-entropy compensa-
tion [15, 38], with the thermodynamic signature of
binding becoming more hydrophobic in nature with
increasing chain length that is consistent with the physio-
chemical properties of lipids [39].
(b) To address the second question, experiments for POPG
binding to AmtBN72A/N79A, a double mutant of AmtB
engineered to remove a specific PG-binding site that is
mediated by two asparagine residues [3]. The thermody-
namic signature for binding POPG to the AmtB mutant
was significantly different ( p < 0.01, one-way ANOVA)
compared to wild-type, exhibiting favorable entropy and
reduced enthalpy, which is in accord with reduction in
bonds formed as a result of the mutations introduced
into the protein (Fig. 7e). POPG can bind to either the
same site under a different binding mode or unique site on
the AmtBN72A/N79A mutant. In contrast, thermodynamic
signatures undergo subtle changes for POPE binding to
both the mutant and wild-type AmtB ( p > 0.1, one-way
ANOVA). In short, these results demonstrate that mem-
brane protein–lipid interactions exhibit unique thermody-
namic signatures. These signatures demonstrate the utility
of our approach to understand the thermodynamic con-
tributions of specific residues in binding lipids or other
molecules.

Acknowledgments

We thank Dr. David H. Russell (Texas A&M University), Dr. David


Clemmer (Indiana University), and Dr. Magnus Höök (Texas
A&M Health Science Center) for useful discussions. This work
was funded by new faculty startup funds from the Texas A&M
University and the National Institutes of Health grant
(DP2GM123486).
Investigation of Protein–Lipid Interactions Using Native Mass Spectrometry 63

References
1. Yildirim MA, Goh KI, Cusick ME, Barabasi 16. Freire E (2006) Overcoming HIV-1 resistance
AL, Vidal M (2007) Drug-target network. to protease inhibitors. Drug Discov Today Dis
Nat Biotechnol 25(10):1119–1126 Mech 3(2):281–286
2. Overington JP, Al-Lazikani B, Hopkins AL 17. Vuignier K, Schappler J, Veuthey JL, Carrupt
(2006) How many drug targets are there? Nat PA, Martel S (2010) Drug-protein binding: a
Rev Drug Discov 5(12):993–996 critical review of analytical tools. Anal Bioanal
3. Laganowsky A, Reading E, Allison TM, Chem 398(1):53–66
Ulmschneider MB, Degiacomi MT, Baldwin 18. Loo JA (1997) Studying noncovalent protein
AJ, Robinson CV (2014) Membrane proteins complexes by electrospray ionization mass
bind lipids selectively to modulate their struc- spectrometry. Mass Spectrom Rev 16(1):1–23
ture and function. Nature 510(7503):172–175 19. Gulbakan B, Barylyuk K, Zenobi R (2015)
4. Lee AG (2011) Biological membranes: the Determination of thermodynamic and kinetic
importance of molecular detail. Trends Bio- properties of biomolecules by mass spectrome-
chem Sci 36(9):493–500 try. Curr Opin Biotechnol 31:65–72
5. Contreras FX, Ernst AM, Wieland F, Brugger 20. Pacholarz KJ, Garlish RA, Taylor RJ, Barran PE
B (2011) Specificity of intramembrane protein- (2012) Mass spectrometry based tools to inves-
lipid interactions. Cold Spring Harb Perspect tigate protein-ligand interactions for drug dis-
Biol 3(6):a004705 covery. Chem Soc Rev 41(11):4335–4355
6. Bogdanov M, Dowhan W, Vitrac H (2014) 21. Cong X, Liu Y, Liu W, Liang X, Russell DH,
Lipids and topological rules governing mem- Laganowsky A (2016) Determining membrane
brane protein assembly. Biochim Biophys Acta protein-lipid binding thermodynamics using
1843(8):1475–1488 native mass spectrometry. J Am Chem Soc
7. Singer SJ, Nicolson GL (1972) The fluid 138(13):4346–4349
mosaic model of the structure of cell mem- 22. Laganowsky A, Benesch JL, Landau M,
branes. Science 175(4023):720–731 Ding L, Sawaya MR, Cascio D, Huang Q,
8. Poveda JA, Giudici AM, Renart ML, Molina Robinson CV, Horwitz J, Eisenberg D
ML, Montoya E, Fernandez-Carvajal A, (2010) Crystal structures of truncated alphaA
Fernandez-Ballester G, Encinar JA, Gonzalez- and alphaB crystallins reveal structural mechan-
Ros JM (2014) Lipid modulation of ion chan- isms of polydispersity important for eye lens
nels through specific binding sites. Biochim function. Protein Sci 19(5):1031–1043
Biophys Acta 1838(6):1560–1567 23. Allison TM, Reading E, Liko I, Baldwin AJ,
9. Jiang QX, Gonen T (2012) The influence of Laganowsky A, Robinson CV (2015) Quanti-
lipids on voltage-gated ion channels. Curr fying the stabilizing effects of protein-ligand
Opin Struct Biol 22(4):529–536 interactions in the gas phase. Nat Commun
10. Eisenberg DS, Crothers DM (1979) Physical 6:8551
chemistry: with applications to the life sciences. 24. Marty MT, Baldwin AJ, Marklund EG, Hoch-
Benjamin/Cummings Publishing Company, berg GK, Benesch JL, Robinson CV (2015)
Menlo Park, CA Bayesian deconvolution of mass and ion mobil-
11. Gilson MK, Zhou HX (2007) Calculation of ity spectra: from binary interactions to polydis-
protein-ligand binding affinities. Annu Rev perse ensembles. Anal Chem 87
Biophys Biomol Struct 36:21–42 (8):4370–4376
12. Gibbs JW (1873) A method of geometrical 25. Subbarow CHF (1925) The colorimetric
representation of the thermodynamic proper- determination of phosphorus. J Biol Chem
ties of substances by means of surfaces. Trans 66:26
Conn Acad Arts Sci 2:23 26. Chen PS, Toribara TY, Warner H (1956)
13. Keserü G, Swinney DC, Mannhold R, Microdetermination of phosphorus. Anal
Kubinyi H, Folkers G (2015) Thermodynamics Chem 28(11):3
and kinetics of drug binding. Wiley-VCH, 27. Ruotolo BT, Benesch JL, Sandercock AM,
Weinheim Hyung SJ, Robinson CV (2008) Ion
14. Raffa RB (2001) Drug-receptor thermody- mobility-mass spectrometry analysis of large
namics: introduction and applications. Wiley, protein complexes. Nat Protoc 3
Chichester (7):1139–1152
15. Klebe G (2015) Applying thermodynamic 28. Laganowsky A, Reading E, Hopper JT, Robin-
profiling in lead finding and optimization. Nat son CV (2013) Mass spectrometry of intact
Rev Drug Discov 14(2):95–110
64 Xiao Cong et al.

membrane protein complexes. Nat Protoc 8 and 2-oxoglutarate. J Biol Chem 285
(4):639–651 (40):31037–31045
29. van’t Hoff MJH (1884) Etudes de dynamique 35. Telmer PG, Shilton BH (2003) Insights into
chimique. Recl Trav Chim Pays-Bas 3 the conformational equilibria of maltose-
(10):333–336 binding protein by analysis of high affinity
30. Prabhu NV, Sharp KA (2005) Heat capacity in mutants. J Biol Chem 278(36):34555–34567
proteins. Annu Rev Phys Chem 56:521–548 36. Maple HJ, Garlish RA, Rigau-Roca L, Porter J,
31. Reading E, Liko I, Allison TM, Benesch JL, Whitcombe I, Prosser CE, Kennedy J, Henry
Laganowsky A, Robinson CV (2015) The role AJ, Taylor RJ, Crump MP, Crosby J (2012)
of the detergent micelle in preserving the struc- Automated protein-ligand interaction screen-
ture of membrane proteins in the gas phase. ing by mass spectrometry. J Med Chem 55
Angew Chem Int Ed Engl 54(15):4577–4581 (2):837–851
32. Liu Y, Cong X, Liu W, Laganowsky A (2017) 37. Maple HJ, Scheibner O, Baumert M, Allen M,
Characterization of membrane protein-lipid Taylor RJ, Garlish RA, Bromirski M, Burnley
interactions by mass spectrometry ion mobility RJ (2014) Application of the Exactive Plus
mass spectrometry. J Am Soc Mass Spectrom EMR for automated protein-ligand screening
28(4):579–586 by non-covalent mass spectrometry. Rapid
33. Daneshfar R, Kitova EN, Klassen JS (2004) Commun Mass Spectrom 28(13):1561–1568
Determination of protein-ligand association 38. Dunitz JD (1995) Win some, lose some:
thermochemistry using variable-temperature enthalpy-entropy compensation in weak inter-
nanoelectrospray mass spectrometry. J Am molecular interactions. Chem Biol 2
Chem Soc 126(15):4786–4787 (11):709–712
34. Radchenko MV, Thornton J, Merrick M 39. Marsh D (2013) Handbook of lipid bilayers,
(2010) Control of AmtB-GlnK complex for- 2nd edn. Taylor & Francis Group, Boca Raton,
mation by intracellular levels of ATP, ADP, FL, p 387
Chapter 4

Western Blot Processing Optimization: The Perfect Blot


Russ Yukhananov, David P. Chimento, and Laura A. Marlow

Abstract
Western blot processing is a well-established procedure that includes protein extraction from tissues and
cells, gel electrophoresis separation, transfer to a membrane, and immunodetection with specific antibodies.
Here, we show that optimization of washing helps to maximize the specific interactions of antigens and
antibodies. Performing all washing steps at 4  C ensures a maximal signal to noise ratio and reduces
nonspecific signals.

Keywords Western blot, Washing temperature, Antibodies specificity, Immunodetection, Electro-


phoresis, Automation

1 Introduction

Protein analysis is a critical step for understanding microbiome


composition, for diagnostic, and for the development of new med-
ications. There are multiple methods to analyze proteins in cells,
tissues, and biological fluid. Some of them involve measuring the
physical and chemical properties of the proteins, their sequence, or
their properties based on interaction with other molecules. The
Western blot uses protein-specific antibodies to detect protein, in
combination with gel-based protein separation to make protein
detection more reliable and specific. The traditional Western blot
(WB) remains an important method to detect and quantify proteins
in tissue, cells, and blood. The WB is in many cases the method of
choice in microbiology, which can be used to investigate protein
concentrations, enzyme activity, cellular localization, protein–pro-
tein interactions, or monitoring of posttranslational modifications,
i.e., events of cleavage, phosphorylation [1], ubiquitinylation [2],
glycosylation [3], methylation [4], and SUMOylation [5].
The critical step for the success of immunodetection of protein
is the availability of specific antibodies. A specific antibody may be a
requirement, but it is not only factor contributing to the quality of
results in the WB. This chapter is focused on details of the Western

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_4, © Springer Science+Business Media, LLC, part of Springer Nature 2022

65
66 Russ Yukhananov et al.

blot procedure often neglected in many laboratories, particularly


the immunodetection of proteins on solid support. The original
method of measuring proteins and peptides by using specific anti-
bodies, radioimmunoassay (RIA), was developed to measure the
concentration of soluble proteins [6]. RIA is a very efficient method
to measure peptides; however, it has a significant limitation since
the majority of proteins are not easily obtained in soluble forms.
The WB emerged as a method for protein detection when it was
shown that proteins can be analyzed attached to a solid support
[7, 8]. The technique became the preferred method for protein
detection and analysis and still cannot be replaced by other, more
stringent and quantitative methods such as mass spectroscopy,
Raman and other spectroscopic methods, and physical applications
that require the soluble form of proteins.
For the last several years, significant efforts have been directed
to increase the reliability, precision, and quantification of proteins
using the WB. One of the problems is the complicated kinetic of
interactions reagents in a solution with the proteins attached to a
solid support.
The common steps of protein detection by Western blot
(Fig. 1) are: (1) protein extraction from tissue or cells, (2) separa-
tion based upon molecular weight, (3) transfer to solid support,
and (4) hybridization with antibodies. These steps can be done
manually or automatically, but this does not change the essence of
the procedure, which requires transferring proteins from the three-
dimensional space of the cell to the two-dimensional support. The
transition to two dimensions in most cases involves separating
proteins based on their molecular weight or electrical charge.
Most often, the separation is done based upon the molecular
weight, which allows a more uniform protein transfer to solid
support.
In order to obtain reliable and reproducible results, it is imper-
ative to optimize all the steps of protein detection, taking into
account that the kinetic of the antigen–antibody interaction on a
solid support differs from that interaction in solution.
There are many publications describing extraction, electropho-
resis, and protein transfer to the membrane. But the crucial step of
immunodetection has attracted much less attention and has not
been carefully documented.

2 Materials

All solutions have been prepared using purified water and analytical
grade reagents (see Note 1).
NP-40 lysis buffer:
Western Blot Processing Optimization: The Perfect Blot 67

Fig. 1 The Western blot workflow

150 mM sodium chloride.


1.0% NP-40.
50 mM Tris, pH 8.0.
RIPA buffer (radio immunoprecipitation assay buffer):
150 mM sodium chloride.
1.0% NP-40 or Triton X-100.
68 Russ Yukhananov et al.

50 mM Tris, pH 8.0.
0.1% SDS (sodium dodecyl sulfate).
0.5% sodium deoxycholate
The 10% sodium deoxycholate stock solution (5 g into 50 ml)
must be protected from light.
Laemmli 2 buffer (loading buffer):
4% SDS.
10% 2-mercaptoethanol.
20% glycerol.
0.004% bromophenol blue.
0.125 M Tris–HCl.
Adjust to pH to 6.8 using HCl (see Note 2).
Transfer buffer:
25 mM Tris, pH 7.5.
200 mM glycine.
20% methanol.
Adjust to pH to 8.3.
1 Running buffer Tris-glycine/SDS:
25 mM Tris–HCl.
200 mM glycine.
0.1 w/v SDS.
pH is adjusted to 8.3.
Tris–glycine 10 or 1 stock solution is commercially available
from multiple vendors. For proteins larger than 80 kDa, we recom-
mend that SDS is included at a final concentration of 0.1%.
PBST buffer:
100 mM sodium phosphate dibasic.
20 mM potassium phosphate monobasic.
27 mM KCl.
1.37 M NaCl.
0.1–0.5% Tween-20.
Check the pH and adjust to pH 7.4  0.2 (see Note 2).
Blocking buffer:
3% milk or 3% goat serum added to PBST buffer and mixed well
and filtered. Failure to filter can lead to spotting, where tiny dark
grains will contaminate the blot during detection.
Western Blot Processing Optimization: The Perfect Blot 69

The antibodies were obtained from commercial sources:


Rabbit anti-alpha-human IL-2 and mouse anti-alpha tubulin,
rabbit anti-PARP antibody, and anti-rabbit horse radish peroxidase-
linked secondary antibodies, DyLight™ 549 conjugated anti-
mouse IgG, and DyLight™ 649 conjugated anti-rabbit IgG (Rock-
land Immunochemicals, Limerick, PA).

3 Methods

The Western blot consists of the following steps (Fig. 1, permission


is granted by Precision Biosystems):
1. Sample preparation.
2. Electrophoresis.
3. Transfer to membrane.
4. Blocking.
5. Washing after blocking.
6. Incubation of primary antibodies.
7. Washing after primary antibodies.
8. Incubation of secondary antibodies.
9. Washing after secondary antibodies.
10. Detection and analysis.

3.1 Sample The sample preparation protocol recorded here is for HELA cells or
Preparation the clear cell renal carcinoma (ccRCC) cell line. However, similar
procedures can be used for a variety of cell types.
Cell culture dishes were placed on ice and washed with ice-cold
PBS. The PBS buffer was removed by aspiration, and an ice-cold
lysis buffer (about 1 ml per ten million cells) was added. Adherent
cells were scraped off the dish using a cold plastic cell scraper, then
gently transferred into a pre-cooled micro centrifuge tube and
agitated for 15 min at 4  C. Cells were centrifuged for 10 min at
3000  g (12,000 rpm) using a refrigerated centrifuge; then the
supernatant was gently removed and the pellet discarded. A small
volume of lysate was removed to measure protein concentrations
using a Bradford assay (see Note 2).
The lysate were aliquoted in 0.5 ml volume and stored at
80  C (see Note 1).

3.2 Preparation of Antibodies typically recognize a small portion of the protein of


Samples for Loading interest (referred to as the epitope), and this domain may reside
into Gels within the three-dimensional conformation of the protein. To
enable access of the antibody to this portion, it is necessary to
unfold the protein, i.e., denature it. To denature, a loading buffer
70 Russ Yukhananov et al.

with the anionic denaturing detergent SDS is used, and the mixture
is boiled at 95–100  C for 5 min. Heating at 70  C for 5–10 min is
also acceptable and may be preferable when studying membrane
proteins. These tend to aggregate when boiled, and the aggregates
may not enter the gel efficiently. With SDS, all of the proteins
become negatively charged by their attachment to the SDS anions.
SDS confers a negative charge to the polypeptide in proportion to
its length—i.e., the denatured proteins become negatively charged
with almost equal charge densities per unit length. In denaturing
SDS-PAGE separations, therefore, migration is determined not by
a protein’s electrical charge, but by its molecular weight.
It might be necessary to reduce disulfide bridges in proteins by
using ß-mercaptoethanol or dithiothreitol (DTT). This allows pro-
teins to adopt a configuration suitable for separation by size. Glyc-
erol is added to the loading buffer to increase the density of the
sample, so the loaded sample is located at the bottom of the well,
reducing uneven gel loading. To enable visualization of protein
migration, a small anionic dye molecule (often bromophenol
blue) is included in the loading buffer. Since the dye is anionic
and small, it migrates faster than any protein and provides an
indicator to monitor the progress of the separation. During sample
treatment, the sample should be mixed by vortexing before and
after the heating step for best resolution, and briefly centrifuged
prior to loading into the gel.
The standard loading buffer, called the 2 Laemmli buffer, was
first described in 1970 by Laemmli [9]. It can also be made at
higher concentrations, 4 and 6, to minimize dilution of the
samples. The 2 is to be mixed in a 1:1 ratio with the sample.
In this example, 10 μl of the sample, or about 15–20 μg of
protein, was diluted 1:1 with 2 loading buffer and boiled at 95  C
for 5 min, cooled on ice, centrifuged, and loaded on a precast gel.
Equal amounts of protein were loaded into each well of the SDS-
PAGE gel, along with molecular weight markers. Extracts from the
HELA cell that included 15 μg of total protein was loaded, along
with 40 ng of purified protein (hIL6). A precast gradient gel of
4–12% was used. Electrophoresis lasted for 1–2 h at 100 V, until the
blue dye reached the bottom of the gel (see Note 3).

3.3 Transferring the There are two main types of protein transfer: wet transfer and semi-
Protein from the Gel to dry transfer. Traditional wet transfer is still the most common
the Membrane method for protein transfer from gel to membrane and offers
high efficiency, but at a cost of time and effort. Semi-dry transfer
is faster than wet transfer [10]. Semi-dry systems can efficiently
transfer proteins to membranes in 7 min. In our hands the quality
of the transfer was similar between the new semi-dry blotters and
the traditional wet transfer method. Transfer can be done through a
stack of capillary paper or using an electroblotting technique. Elec-
troblotting is more consistent and faster, but otherwise there is no
Western Blot Processing Optimization: The Perfect Blot 71

Fig. 2 Protein transfer assembly

advantage for using it over capillary transfer, which is gentler and if


done properly produces excellent results. The sponges and filter
paper were presoaked in transfer buffer. The gel was gently
removed from the assembly and equilibrated in the transfer buffer
for 15–30 min. On a flat surface, all blotting sandwich components
were assembled, from the anode (bottom) to the cathode (top):
sponge, filter paper, gel, membrane blot, filter paper, and sponge in
a cassette as shown in Fig. 2.
Then a pad (sponge) and one piece of paper (make sure both
are completely wet) were placed on, followed by the gel and the
membrane. Tweezers should be used to grab the membrane and
make sure the entire membrane is wet. All air bubbles between the
gel and membrane should be carefully removed by rolling a pipette
to avoid incomplete transfer and a spotty, uneven blot. To complete
the sandwich, one more piece of paper goes on and then the last pad
(sponge), again making sure both are thoroughly soaked in transfer
buffer. Then one should roll a 15-ml tube or a serological pipette
up and down, and side to side, over the completed sandwich to
remove any bubbles. The assembly was inserted in the transfer tank
containing the transfer buffer, so that the gel faces the anode and
the blot faces the cathode. Transfer is done at the constant current
of 10 mA at 4  C for 1 h. The time and voltage of transfer may
require some optimization depending on the source of proteins.
After transfer, the blot was washed in distilled water for 2 min and
incubated with the Ponceau S stain for 5 min to verify the protein
transfer. Blots were washed three times for 5 min each with distilled
water. The membrane can be either nitrocellulose or PVDF. The
PVDF should be activated with methanol for 1 min and rinsed with
the transfer buffer before preparing the stack. The PVDF mem-
brane should be dried out after transfer to ensure the proteins are
attached, and then reactivated in methanol.
72 Russ Yukhananov et al.

3.4 Blocking All steps of immunodetection (blocking, washing, and antibodies


incubation) were performed using the BlotCycler™, an automated
Western blot processor, either at room temperature (RT) or at 4  C
(see Note 4).
Membranes were blocked for 30 min at RT or for 90 min at
4  C using the blocking buffer and were washed once with PBST
after blocking. Washing after blocking helped reduce the back-
ground signal by removing nonattached micro precipitate in the
blocking buffer.

3.5 Incubation of The membranes are incubated with primary antibodies, the rabbit
Primary Antibodies anti-IL2 and mouse anti-α-tubulin antibodies, for 2 h at RT or for
10 h at 4  C with constant shaking.
Western blots with a cell extract from ccRCC were incubated
with the rabbit polyclonal anti-PARP antibody for 8 h at 4  C with
constant shaking.

3.6 The Washing The membranes were washed with PBST or TBST buffer three
Membrane After times at RT and five times at 4  C, with each wash containing
Primary Antibody 20 ml of washing buffer, for 5–15 min with constant shaking.
Incubation

3.7 The Secondary The membranes were incubated with fluorescently labeled or
Antibody Incubation horseradish peroxidase (HRP)-conjugated anti-rabbit IgG second-
ary antibodies for 1 h at RT and for 3 h at 4  C, with constant
shaking.

3.8 Washing After Membranes were washed four times at RT and six times at 4  C
Secondary Antibodies with PBST or TBST.

3.9 Detection and The fluorescence and the chemiluminescent signal were measured
Data Analysis using an imager. Data for each DyLight™ fluorophore was col-
lected independently at excitation, or the emission wavelengths:
530/605 nm for the DyLight™ 549 and 625/695 nm for
DyLight™ 649. The fluorescent intensities between different
blots were normalized using a molecular weight standard. Raw
data was collected as arbitrary light units, averaged and quantified
micro densitometrically using ImageJ 1.40 g, which is freely
provided by the National Institutes of Health (Bethesda, MD).

4 Notes

1. Sample preparation.
The lysis procedure disrupts the cell membrane and solu-
bilizes proteins, so they can migrate individually through the
separating gel. There are many recipes for lysis buffers; the
Western Blot Processing Optimization: The Perfect Blot 73

main differences among them lie in their ability to solubilize


proteins. Buffers containing SDS and other ionic detergents are
considered to be the harshest and therefore most likely to yield
high protein levels. The main considerations when choosing a
lysis buffer are subcellular localization of the protein of interest,
and whether or not the protein is to be detected in its native or
denatured form.

Protein location Lysis buffer recommended


Whole cell NP-40 or RIPA
Cytoplasmic (cytoskeletal bound) Tris–Triton
Membrane bound NP-40 or RIPA
Nuclear RIPA
Mitochondria RIPA
Microbial cell RIPA

As soon as lysis occurs, proteolysis, dephosphorylation, and


denaturation begin. To prevent protein degradation, one or
several enzyme inhibitors should be used. Ready-to-use cock-
tails of inhibitors can be made on your own or purchased from
various suppliers (see Table 1) or can be prepared in the labora-
tory. The solution of sodium orthovanadate (Na3VO4) that is
used for tyrosine phosphatase inhibition preparation should be
prepared in a fume hood:
(a) Prepare a 100 mM solution in double distilled water. Set
pH to 9.0 with HCl.
(b) Boil until colorless. Minimize volume change due to evap-
oration by covering loosely.
(c) Cool to room temperature. Set pH to 9.0 again.
(d) Boil again until colorless.
(e) Repeat this cycle until the solution remains at pH 9.0 after
boiling and cooling.
(f) Bring up to the initial volume with water.
(g) Store in aliquots at 20  C. Discard if samples turn
yellow.
The preparation of lysate from nonadherent bacterial
culture and tissue should be done differently than for
adherent cells. The cells should be centrifuged in a micro
centrifuge tube at 4  C. You may have to vary the centri-
fugation force and time depending on the cell type. A
guideline is 10 min at 3000  g, but this must be deter-
mined for your experiment. The supernatant should be
discarded, and lysis buffer added to the pellet.
74 Russ Yukhananov et al.

Table 1
Protease inhibitor

Final
Protease/phosphatase concentration in
Inhibitor inhibited lysis buffer Stock
Aprotinin Trypsin, chymotrypsin, 2 μg/ml, dilute in 10 mg/ml. Do not re-use
plasmin water aliquots
Leupeptin Lysosomal 5–10 μg/ml, 5–10 μg/ml. Dilute in water.
dilute in water Do not re-use once
defrosted
Pepstatin A Aspartic proteases 1 μg/ml Dilute in methanol, 1 mM
PMSF, Serine, cysteine 1 mM, dilute in Can re-use the aliquot
Phenylmethylsulfonyl proteases ethanol
fluoride
EDTA Metalloproteases that 5 mM, dilute in Adjust to pH to 8.0
require Mg++ and water, 0.5 M
Mn++
EGTA Metalloproteases that 1 mM, dilute in Adjust to pH to 8.0
require Ca++ water, 0.5 M
Sodium fluoride (NaF) Serine/threonine 5–10 mM, dilute Do not re-use once defrosted
phosphatases in water
Sodium orthovanadate Tyrosine phosphatases 1 mM, dilute in Do not re-use once defrosted
water

When working with tissue, it should be dissected with


a sterile tool, on ice preferably, and as quickly as possible in
order to prevent degradation by proteases. The tissue
could be placed in round-bottom microfuge tubes and
kept on ice prior to homogenization. The samples can
be immersed in liquid nitrogen to “snap freeze” them,
and stored at 80  C for later use.
The preparation of samples from microbial cells could
require intensive homogenization that should be done
manually, or using mechanical devices for faster and
more intensive disruption of the cell wall. For about a
5 mg piece of cell, add 300 μl lysis buffer added rapidly
to the tube (the buffer with inhibitors should be ice-cold
prior to homogenization), homogenized with an electric
homogenizer, the blade could be rinsed twice with
another 2  300 μl lysis buffer. Volumes of lysis buffer
must be determined in relation to the amount of tissue
present. The protein extract should not be too diluted in
order to avoid the loss of protein, and prevent a large
volume of samples from being loaded onto gels. The
Western Blot Processing Optimization: The Perfect Blot 75

minimum concentration should be 0.1 mg/ml, while the


optimal concentration should be 1–5 mg/ml. Tubes with
protein extract should be centrifuged at least 20 min at
3000  g at 4  C, gently removed from the centrifuge and
placed on ice. The supernatant should be aspirated and
placed in a fresh tube kept on ice; the pellet could be
discarded. The use of an electrical homogenizer was usu-
ally enough, but successful sample preparation was also
achieved using the bid technique.
An alternative for protein extraction from cell could
be aspiration of the media and washing with PBS. It could
be preferable not to use PBST to avoid any chance of
protein solubilization. Usually 1–4 ml of cold PBS should
be added and cells scraped. The cell suspension could be
collected and centrifuged in a microtube to pellet cells.
The lysis buffer (100–200 μl) could be added and cell
could be frozen at 80  C in a pre-frozen rack.
2. Determination of protein concentration.
Protein concentration could be measured using either cop-
per chelation or dye-based protein assays. Copper-based meth-
ods include Lowry or Bicinchoninic Acid (BCA), while
Bradford is the most well-known dye-based assay. Each method
has both advantages and disadvantages. For example, the BCA
method is compatible with most of the detergents and
responds uniformly to most proteins but does not reach an
endpoint and is highly dependent on the incubation time, thus
requiring all samples and standards to be incubated under
identical conditions.
In contrast, the Lowry method is an endpoint assay and is
less sensitive to incubation time and temperature. However,
the reagent could precipitate in the presence of detergents.
Finally, the advantages of the Coomassie dye-based assay
(Bradford assay) are the short incubation time and compatibil-
ity with most salts, solvents, buffers, thiols, reducing sub-
stances, and metal chelating agents encountered in protein
samples. However, the disadvantages of the method include
incompatibility with surfactants and protein-to-protein varia-
tions due to the method’s sensitivity to the amino acid compo-
sition of proteins. This is not a problem if similar samples are
compared. Once the concentration of each sample is deter-
mined, samples can be stored at 80  C or used immediately
for gel electrophoresis.
3. Native and non-reduced samples.
An antibody may recognize an epitope made up of
non-contiguous amino acids. Although the amino acids of
the epitope are separated from one another in the primary
sequence, they are close to each other in the folded three-
76 Russ Yukhananov et al.

dimensional structure of the protein, and the antibodies only


recognize the epitope as it exists on the surface of the folded
structure. Under these circumstances, it is necessary to carry
out the Western blot procedure under non-denaturing condi-
tions, e.g., leaving SDS out of the sample and migration buffers
and not heating the samples. Certain antibodies only recognize
protein in its non-reduced form, i.e., in an oxidized form
(particularly on cysteine residues). In that situation, the reduc-
ing agents ß-mercaptoethanol and DTT must be left out of the
loading buffer and migration buffer (non-reducing condi-
tions). To select the most appropriate procedure, see Table 2.
Gel electrophoresis time and voltage may also require opti-
mization. The gel percentage required is dependent on the size
of your protein of interest:

MW range, kDa Acrylamide %


2–40 20
12–80 15
10–100 12.5
15–150 10
40–500 8

Gradient gels can also be used if there are both low molec-
ular weight (LMW) and high molecular weight (HMW) pro-
teins that need to be measured in the same sample.
The SDS-PAGE system should include a tank, a lid with
power cables, electrode assembly, cell buffer dam, casting
stands, casting frames, combs, and glass plates. The SDS-
PAGE consists of a stacking gel and separating gel. The stack-
ing gel (acrylamide 5%) is poured on top of the separating gel
(after solidification) and a gel comb is inserted in the
stacking gel.
4. Immunodetection.
The reaction of antibodies in a solution and a protein
attached to a solid support could introduce bias and errors if
not performed properly. There are many variations in how to
perform these procedures that could produce dramatically dif-
ferent results. In general, the procedure should be optimized
for each combination of antibodies and antigen. Each step of
immunodetection is important in achieving reliable and repro-
ducible results, in particular the washing and hybridization
steps. If it is impossible to optimize each step for the particular
antibodies, a general guideline to how to perform each step
that works well with most antibody–antigen combinations
could be followed.
Western Blot Processing Optimization: The Perfect Blot 77

Table 2
Electrophoresis conditions

Migration
Protein state Gel condition Loading buffer buffer
Reduced, Reducing and denaturing With β-mercaptoethanol or DTT, with With SDS
denatured SDS
Reduced, native Reducing and With β-mercaptoethanol or DTT, no No SDS
non-denaturing SDS
Oxidized, Non-reducing and No β-mercaptoethanol or DTT, with With SDS
denatured denaturing SDS
Oxidized, native Non-reducing and native No β-mercaptoethanol or DTT, no No SDS
SDS

There are two critical variables: the duration of incubation


and the temperature. We have found that washing done at 4  C
yields far better results for most antibodies. Washing and hybri-
dization, which require precise timing and consistent mem-
brane agitation, are best performed using an automated
system that allows precise control of time and solution volume.
Overall, we suggest that blocking should be performed either
at RT or at 4  C, the duration seems not be critical, and there is
no need to block for more than 2 h, even at 4  C. We found
that at least one washing step after blocking helped reduce the
background by removing all loosely attached blocking reagents
and any precipitate that can be formed during blocking.
Antibody incubation can be performed at either RT or
4  C. Shaking or rocking was not essential but shaking helped
to reduce the time of incubation. Based on our experience, the
shaking or a combination of shaking and rocking was working
best, since the process mostly depends upon lateral diffusion.
Shaking or a combination of shaking and rocking, compared to
only rocking, helped to reduce the background by preventing
formation of a precipitate.
It is important to consistently wash the membrane after the
primary and secondary antibodies, in order to generate repro-
ducible results. The timing of washing was often neglected and
not controlled accurately. Washing could be performed at
either 4  C or RT. But washing at 4  C seemed to be preferable
since it allowed avoiding excessive dissociation of the antigen–
antibodies complex. Washing at 4  C usually produced a higher
signal and less background noise (see Fig. 3) and helped elimi-
nate nonspecific bands (Fig. 4)—though this could not always
be the case and should be tested for individual antibodies.
Washing time varies between 3 and 30 min, though in our
78 Russ Yukhananov et al.

a b

c 120

100
human IL-2 Tubulin
Relative intensity

80

60

40

20

0
4°C RT 4°C RT

Fig. 3 Western blot processing at low temperature increased staining intensity. The blots were processed as
described in Subheading 3. In each blot, the first lane contained a HeLa whole-cell lysate; the second lane
contained a recombinant human IL-2, and the third lane a mixture of HeLa whole-cell lysate and h-IL2.
Following the SDS-PAGE, proteins were transferred from the gel onto the nitrocellulose membrane as
described in Methods. Membranes were blocked for 30 min at RT or for 90 min at 4  C using blocking
reagent and then incubated with the rabbit anti-IL2 and mouse anti-α-tubulin primary antibodies for 2 h at RT
(a) or for 10 h at 4  C (b), followed by the incubation with DyLight™ 549 conjugated anti-mouse IgG and
DyLight™ 649 conjugated anti-rabbit IgG secondary antibodies for 1 hour at RT (a) or 3 h at 4  C (b). Data for
each DyLight™ fluorophore was collected independently at excitation/emission wavelengths: 530/605 nm for
the DyLight™ 549 and 625/695 nm for DyLight™ 649. (a) All steps of immunodetection (blocking, primary
and secondary antibody incubation, all washing steps) were done at RT. (b) All steps of immunodetection
(blocking, primary and secondary antibody incubation, all washing steps) at 4  C. (c) Comparison of
fluorescence intensity between blots processed at RT and at 4  C. The fluorescent intensities between
different blots were normalized using MW standard
Western Blot Processing Optimization: The Perfect Blot 79

Fig. 4 Western blots with cell extract from the clear cell renal carcinoma (ccRCC)
cell line were incubated with the rabbit polyclonal anti-PARP antibody using the
traditional manual procedure with all washing at RT (a) or with all processing
done at 4  C (b). The membranes were prepared as described in methods. The
bands were visualized using a chemiluminescence substrate. (a) Incubation with
the primary antibody was carried out at 4  C while the rest of the procedure was
performed at room temperature. (b) The entire blot processing was done at 4  C.
When the blot was processed manually at RT (a), multiple nonspecific bands
were observed. In contrast, the blot processed at 4  C (b) displayed only the
bands representing the uncleaved PARP protein (113 kDa) as well as the
C-terminal and N-terminal cleaved PARP fragments (89 and 24 kDa, respec-
tively). The results suggest the importance of blocking and washing steps to be
done at 4  C, especially for monoclonal antibodies

practice, we did not see any advantage to washing more than


5 min. Background reduction was more dependent upon the
number of washing steps. But in most cases, no changes can be
detected after eight consecutive washes, whether they are done
at 4  C or at RT. Further increases in the number of washes may
lead to signal degradation.
Secondary antibody incubation usually required a shorter
incubation time since primary antibodies attached to proteins
are more accessible. Washing after secondary antibodies fol-
lowed the same guideline as washing after primary antibodies.
Since the concentration of secondary antibodies is rarely opti-
mized and often added in excess, we recommend performing at
least five consecutive washes after incubating the secondary
antibodies.
80 Russ Yukhananov et al.

Acknowledgments

The experiments performed were in part supported by Precision


Biosystems, LLC and Rockland Immunochemicals Inc.

References

1. Wu WC, Walaas SI, Nairn AC, Greengard P 6. YALOW RS, GLICK SM, ROTH J, BERSON
(1982) Calcium/phospholipid regulates phos- SA (1964) Radioimmunoassay of human
phorylation of a Mr “87k” substrate protein in plasma ACTH. J Clin Endocrinol Metab
brain synaptosomes. Proc Natl Acad Sci U S A 24:1219–1225
79:5249–5253 7. Towbin H, Staehelin T, Gordon J (1979) Elec-
2. Conte A, Sigismund S (2017) Methods to trophoretic transfer of proteins from polyacryl-
investigate EGFR ubiquitination. Methods amide gels to nitrocellulose sheets: procedure
Mol Biol 1652:81–100. https://doi.org/10. and some applications. Proc Natl Acad Sci
1007/978-1-4939-7219-7_5 76:4350–4354. https://doi.org/10.1073/
3. Pere-Brissaud A, Blanchet X, Delourme D et al pnas.76.9.4350
(2015) Expression of SERPINA3s in cattle: 8. Burnette WN (1981) “Western blotting”: elec-
focus on bovSERPINA3-7 reveals specific trophoretic transfer of proteins from sodium
involvement in skeletal muscle. Open Biol dodecyl sulfate--polyacrylamide gels to
5:150071 unmodified nitrocellulose and radiographic
4. Voelkel T, Andresen C, Unger A et al (2013) detection with antibody and radioiodinated
Lysine methyltransferase Smyd2 regulates protein A. Anal Biochem 112:195–203
Hsp90-mediated protection of the sarcomeric 9. Laemmli UK (1970) Cleavage of structural
titin springs and cardiac function. Biochim Bio- proteins during the assembly of the head of
phys Acta, Mol Cell Res 1833:812–822 bacteriophage T4. Nature 227:680–685
5. Sarge KD, Park-Sarge O-K (2009) Detection 10. Wisdom GB (1994) Protein blotting. Methods
of proteins sumoylated in vivo and in vitro. Mol Biol 32:207–213
Methods Mol Biol 590:265–277
Chapter 5

FISHing on a Budget
Gable M. Wadsworth and Harold D. Kim

Abstract
Sensitive quantification of RNA transcripts via fluorescence in situ hybridization (FISH) is a ubiquitous part
of understanding quantitative gene expression in single cells. Many techniques exist to identify and localize
transcripts inside the cell, but often they are costly and labor intensive. Here we present a method to use a
singly labeled short DNA oligo probe to perform FISH in yeast cells. This method is effective for highly
constrained FISH applications where the target length is limited (<200 nucleotides). This method can
quantify different RNA isoforms or enable the use of fluorescence resonance energy transfer (FRET) to
detect co-transcription of neighboring sequence blocks. Since this method relies on a single probe, it is also
more cost-effective than a multiple probe labeling strategy.

Key words Single molecule, RNA FISH, In situ hybridization, Isoform profiling, FRET

1 Introduction

Measurement of RNA transcripts at the single-cell level is an impor-


tant tool to understand the heterogeneity of gene expression in cell
populations. There are a large number of variations on the techni-
ques to accomplish this task, and choosing the right technique is a
challenging problem. Ensemble techniques such as qPCR or
RNA-seq involve extracting RNA from the cell, while most
single-cell and single-molecule approaches to measuring transcript
levels are based on fluorescence in situ hybridization (FISH). While
it is possible to perform single-cell RNA-seq [1], isolating single
cells and generating libraries for amplification is a significant effort
compared to RNA FISH. In addition, many labs are limited in the
equipment available to perform these measurements which could
include a flow cytometer and sequencing services. Whole-cell
sequencing alone can cost thousands of dollars, which makes
RNA FISH an attractive option since a widefield microscope is a
fairly common instrument in low barrier of entry compared to these
other options and is very flexible in application. RNA FISH is a
robust technique to count single RNA molecules in cells

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_5, © Springer Science+Business Media, LLC, part of Springer Nature 2022

81
82 Gable M. Wadsworth and Harold D. Kim

[2, 3]. This technique has been used in a variety of circumstances to


obtain single-cell expression profiles of RNA transcripts while
maintaining spatial knowledge of their location. However, conven-
tional FISH methods suffer from the requirement to use multiple
probes to obtain a bright punctate spot. This requirement prevents
the detection of transcripts that are shorter than ~200 nucleotides
or highly similar transcripts that can only be uniquely identified by a
similarly short sequence. To achieve detection of RNA molecules
that are short or highly similar, it is necessary to understand the
performance of FISH using a single probe [4]. The floor for the
performance of a FISH method is the detection of ~60% of the
transcripts with a single fluorophore. By combining a single site for
probe annealing with an amplification strategy such as hybridiza-
tion chain reaction [5] or a secondary labeling strategy such as the
one used in MERFISH [6], performance with a single-probe bind-
ing site can be similar to conventional FISH methods while reduc-
ing cost per reaction by ~10 times compared to custom commercial
probe sets (Table 1). Here we describe a FISH protocol to use a
singly labeled DNA oligo probe to detect RNA molecules in Sac-
charomyces cerevisiae. We also discuss several probe design strategies
that can be easily incorporated to increase specificity, signal, or
reduce cost.

2 Materials

2.1 Equipment 1. Autoclave.


2. Centrifuge.
3. Microcentrifuge.
4. Incubator.
5. Spectrophotometer.
6. Water purifier.
7. Pipettors.
8. MATLAB software.
9. Widefield microscope equipped with:
l A halogen lamp, DIC prism.
l Filter wheel.
l Filters.
l Dichroics.
l 60–100 high NA (1.4) objective.
l Translation stage.
10. EMCCD camera.
11. Laser illumination (50 mW power output).
FISHing on a Budget 83

Table 1
Cost of FISH probes

Multiple probe Single probe HCR amplifier Competitor probe


Cost $708.75 $158.44 +$70.20 +$12.22
Cost per reaction $1.77–3.54 $0.08–0.16 +$0.02–0.04 +$0.03–0.06
The cost for a multiple probe set is the current cost of a custom Stellaris® FISH probe set with Quasar670. The other
costs are based on a Cy5-labeled 72 nucleotide toehold probe purified using HPLC and two unlabeled sequences all
ordered from Integrated DNA Technologies (IDT). Cost for custom oligos scales with the number of nucleotides, the
scale of the reaction, the purification, and the modification. The cost per reaction is based on the working concentration
which yields a range of 800–1600 reactions per 100 μL of 100 μM probe stock solution. The probes from IDT are
purified with HPLC

12. Lenses, mirrors, and optical mounts to form the illumination


and excitation paths.

2.2 Reagents All reagents are prepared with RNase-free water.


1. Light-duty tissue wipers.
2. Nitrile gloves.

2.2.1 Cell Culture of 1. Saccharomyces cerevisiae strains.


Saccharomyces cerevisiae 2. SD Complete media for plates and liquid culture [7].
3. Pyrex bottle.
4. Culture flask.
5. Petri dish.
6. Plastic cuvettes.

2.2.2 Fixation, 1. Pipette tips.


Spheroplasting, and 2. Methanol 99% ACS Spectrophotometric grade.
Permeabilization
3. Falcon tube 50 mL.
4. Microcentrifuge tube.
5. Ethanol.
6. Buffer B: 1.2 M sorbitol, 0.1 M potassium phosphate (dibasic),
and RNase-free water.
7. Spheroplasting Buffer: 2 mM Vanadyl ribonucleoside complex
and Buffer B.
8. 5 units/μL zymolyase.

2.2.3 Washing, 1. Wash Buffer: 10% formamide (v/v), 2 SSC, and RNase-free
Hybridization, and water.
Mounting 2. Hybridization Buffer: 10% dextran sulfate (w/v), 1 mg/mL
Escherichia coli tRNA, 2 mM vanadyl ribonucleoside complex,
0.2 mg/mL BSA, 2 SSC, and RNase-free water.
84 Gable M. Wadsworth and Harold D. Kim

3. Imaging Buffer: 1 mM 6-hydroxy-2,5,7,8-tetramethylchro-


man-2-carboxylic acid (Trolox), 2 SSC, 100 mM Tris–HCl
pH 8, 5 mM protocatechuic acid (PCA), and 10 nM proto-
catechuate-3,4-dioxygenase (PCD).
4. Low Auto Fluorescence Immersion Oil.
5. Aluminum foil.
6. 5-min Epoxy.
7. Glass slide.
8. 8 mm square #1.5 coverslip.

3 Methods

3.1 Choosing a FISH In addition to cost, there are three important considerations for
Method choosing which FISH method to use.
1. Is absolute copy number important?
2. How long is the sequence of interest?
3. How sensitive is equipment used in data acquisition?
In many experiments, the absolute number of transcripts is less
important than the change in the number of transcripts due to
some perturbation. In these cases, a systematic error such as the
error introduced by inactive fluorophores that are inherently pres-
ent in a population of labeled DNA oligos does not change the
relationship between transcript levels when comparing two strains.
In cases where absolute copy number is important, it is necessary to
use a multiple probe labeling strategy (Fig. 1a). The single-
fluorophore strategy (Fig. 1b) can be extended to this case by
incorporating a hybridization chain reaction design (HCR) [8]
(Fig. 1c), which allows for amplification to reduce false negatives.
For short sequences or for discrimination between short features of
a long sequence, the length of the available sequence requires a
short probe, and specificity can be enhanced using a hairpin probe
design (Fig. 1d). In cases such as alternate sites of initiation a pair of
probes with FRET is another option (Fig. 1e). In order to perform
these experiments, sensitive equipment is necessary. In contrast to
multiple probe FISH experiments which are typically performed
with a widefield microscope, cooled CCD camera, and
epi-fluorescence geometry, the single-probe design requires a sys-
tem capable of sensitive single-molecule detection. The minimum
set of equipment to accomplish single-fluorophore detection is a
widefield microscope capable of TIR/HILO illumination equipped
with an EMCCD camera. We think the combination of light-sheet
or spinning-disk confocal microscopy with sCMOS camera would
work at least equally well. But due to photobleaching and long
acquisition time, most point-scanning confocal systems are not
desirable.
FISHing on a Budget 85

Fig. 1 Schematic of FISH probe design. (a) For conventional FISH methods, 20–50 unique end labeled probes
are used. These are typically labeled as a set of pooled oligos in a single reaction so that they are not able to
be used independently. (b) This design uses a singly labeled DNA oligoprobe to quantify RNA. (c) Two hairpin
probes are designed to be metastable without the RNA input. When the RNA is present, HCR (hybridization
chain reaction) provides enzyme-free signal amplification. (d) A hairpin probe is designed to be metastable in
the hybridization condition. This probe targets 26 nucleotides of the RNA, but 16 of them are masked to
improve specificity. (e) To utilize fluorescence resonance energy transfer (FRET) with FISH, the probes should
be located with the labeled 50 and 30 ends approximately six nucleotides apart. (f) The RNA is detected with an
unlabeled probe and then is hybridized by a secondary probe. This allows the use of one labeled oligo for many
detection schemes to reduce cost

3.2 Probe Design The power of a single-fluorophore probe method is that it enables
specific detection of short regions of RNA transcripts. If the
sequence of interest is unconstrained, then the choice of probe
target location should depend on a combination of homology
with other genomic sequences, RNA-DNA melting temperature,
probe secondary structure, and target secondary structure. Ideally,
the melting temperature should be around 60  C, and the sequence
should have less than 60% correspondence to any other genomic
sequence according to BLAST. Secondary structure can be pre-
dicted by using mFOLD [8] (see Note 1). If the secondary struc-
ture is unavoidable, the probe and the target should have at least
seven consecutive nucleotides that are fully accessible for toehold-
mediated strand displacement [9].
If a FRET-based design is desired, then the two probes should
target sequences separated by two nucleotides, and the labeling
should be done so that the pair of dye molecules are on the
proximate 30 and 50 ends of the two probes.
A single-fluorophore probe can also be designed as a hairpin
with a 10-nt toehold and a sequestered 10-nt loop, which can
function as a toehold for a secondary probe to provide signal
amplification via HCR. These probes are designed to be metastable
86 Gable M. Wadsworth and Harold D. Kim

without the cognate RNA to initiate branch migration. The meta-


stability depends on the ratio of toehold length to stem length. This
design allows for increased specificity due to the inability of non-
specific sequences to carry out strand displacement [10].
When designing probes, if a negative control strain lacking the
target RNA is not available, an unlabeled competitor probe with
identical sequence can be used to test the probe specificity. This
competitor probe when mixed with the labeled probe at a 10:1 or
100:1 ratio should saturate most available target sites. Hence, any
remaining spots at a 100:1 mixture should be considered
nonspecific.
To optimize cost, a universal probe can be designed as a sec-
ondary label (Fig. 1F). In this scheme, the RNA is targeted by an
unlabeled DNA oligo, which includes an additional sequence that is
complementary to the universal probe. This design reduces cost
because often the fluorophore is the most expensive part of the
probe design and requires more expensive purification.

3.3 Cell Culture, 1. At the end of the day, inoculate cells into SD liquid media in a
Fixation, and culture flask from cells actively growing on an SD complete
Spheroplasting plate.
2. After overnight growth, measure the optical density of cells
using a spectrophotometer at 600 nm by pipetting 1 mL of
cell culture into a cuvette.
3. Upon reaching 0.6 OD600, decant cell culture into 59-mL
Falcon tubes and pellet at 671  g.
4. Resuspend the pellet in 10 mL of ice cold (4  C) methanol for
10 min.
5. Pellet and aspirate the cells.
6. Resuspend in ice-cold Buffer B twice and aspirate.
7. Prepare fresh spheroplasting buffer.
8. Resuspend cells in 1 mL of spheroplasting buffer containing
2 μL of zymolyase and pipette to mix gently.
9. Measure initial cell OD600 by adding 100 μL of cell solution to
900 mL of deionized water and let stand for 1 min. Sphero-
plasted cells should undergo lysis in this condition. Incubate
the remaining cell solution for 30 min and check cell OD600
once more.
10. Once cell OD600 drops by at least 30% from the initial mea-
surement, pellet the cells and aspirate. Cells should be treated
gently in all further steps with no more than 268  g centrifu-
gation (see Note 2).
11. Resuspend in ice-cold Buffer B twice and aspirate.
FISHing on a Budget 87

12. Resuspend in 70% ethanol and incubate at 4  C for 4 h to


overnight.
At this point, cells can be stored at 4  C in 70% ethanol for up
to 2 weeks with no change in the quality of FISH.

3.4 Hybridization, 1. Prepare fresh Wash Buffer (see Note 3).


Washing 2. Pellet cells and wash twice with 1 mL of Wash Buffer.
3. Wrap microcentrifuge tubes in aluminum foil to prevent
photobleaching of probes or store tubes in a dark box.
4. Prepare probe solution through serial dilution of probes in
10 mM Tris–HCl to 1 μM. Then, mix by pipetting the diluted
probe solution with cell pellet and the appropriate amount of
hybridization buffer to achieve a working concentration deter-
mined by titration (see Note 4).
5. Incubate the cell solution at 30  C overnight in the dark.
6. Pellet and incubate cells for 10 min in 1 mL of Wash Buffer
containing 1 mg/mL DAPI (optional).
7. Pellet and wash cells with 1 mL of Wash Buffer.
At this point, cells can be stored at 4  C for up to 2 weeks with
no degradation in signal.

3.5 Mounting 1. Prepare clean slides and coverslips (see Note 5).
2. Mix 3 μL of concentrated cells with 3 μL of Imaging Buffer on
the slide by gentle pipetting.
3. Place a coverslip on the cells by wetting the edge with the cell
solution and gently lowering to avoid trapping any bubbles
underneath.
4. Use a tissue to gently remove any excess liquid from the edges
of the coverslip and create a monolayer of cells. Do not apply
pressure to the cells through the coverslip.
5. Seal the edges of the coverslip using 5 min epoxy (see Note 6).

3.6 Image Full-resolution images are acquired at 100 ms exposure time. EM


Acquisition and gain should be chosen to use the full dynamic range of the camera
Analysis without saturating FISH spots across all samples in the experiment.
For the camera listed, an EM gain of 200 is adequate. To detect
single fluorophores, it is adequate to have 0.5 mW of laser power at
the sample plane, although dim excitation can be compensated by
longer acquisition time and higher gain. Laser power should be
chosen so that the fluorescence lifetime of the fluorophores is much
longer than the acquisition time. Images are typically acquired with
10–25 mW of power at the back focal plane when using 100 ms
exposure time. The magnification of the emission pathway should
be configured so that the images are properly sampled based on the
88 Gable M. Wadsworth and Harold D. Kim

Nyquist bandwidth, which is accomplished at the following condi-


tions for a widefield microscope where NA ¼ n sin, n is the
refractive index and is the half-aperture angle. Sizexy is the pixel
size and Sizez represents the optimal step size between focal planes.
λem
Sizexy ¼ :
4n sin θ
λem
Sizez ¼ :
2nð1  cos θÞ
Image analysis can be performed using the Image Processing
Toolbox in Matlab. The image noise is smoothed by applying a
Gaussian filter that is matched to the width of the point spread
function of a single spot, and then background is approximated by a
wide Gaussian filter and subtracted. Peaks are located in the image,
and a three-dimensional Gaussian profile is fit to the peaks. This can
be accomplished using the FISH-quant program in Matlab [11].

4 Notes

1. In principle, the annealing condition should destabilize sec-


ondary structure via increased temperature and formamide
concentration, which raises the apparent melting temperature
by 1  C/(%v/v). However, the best practice is to avoid second-
ary structure, as well as any protein binding sites, if possible.
This has implications for designing probes with a toehold-
hairpin structure because the stem must have adequate length,
12 nucleotides, to remain metastable in the probe annealing
condition.
2. The success of spheroplasting is critical to probe entry into the
cell. However, cells become very fragile after the digestion of
the cell wall and should be handled gently in all further steps. If
there is a low efficiency of hybridization or a high variability in a
population where it is unexpected, the population of cells may
be insufficiently converted to spheroplasts. Besides using a
spectrophotometer, spheroplasting can be checked via inspec-
tion using DIC. Spheroplasts should be more spherical. There
should be significant debris that resemble the cell wall and there
should be reduced contrast at the cell boundary. Furthermore,
cells should display lysis when mixed with deionized water on a
coverslip.
3. Quality of the Wash Buffer is critical to prevent false positives.
Formamide is a highly reactive chemical, which will convert to
formic acid in the presence of water. Condensation of water on
the inside of the bottle through repeated openings without
warming will result in poor quality washes and high
non-specific signal.
FISHing on a Budget 89

4. The appropriate working concentration should be determined


by a titration in the range between 1 and 100 nM, where the
ratio of the number of detected spots to the number of false
positives is maximized. This titration involves measuring the
number of spots after hybridization for each concentration.
The number of spots detected should saturate at some point
in the concentration series. This should be compared to a
negative control, which should demonstrate no spots at any
concentration.
5. At a minimum, slides and coverslips should be free of dust and
other particles before mounting cells. This is critical to prevent
bubbles in the solution under the coverslip. Any bubbles will
act to cause rapid photobleaching in the sample. If any issues
with photobleaching arise, it is often due to either bubbles or
quality of the PCA or PCD, which will degrade in activity if left
at room temperature.
6. Many labs use nail polish to seal the edges of the slides. The
solvent in the nail polish can be highly auto-fluorescent and is
typically ethyl acetate-based, meaning that it will wick into the
solution under the coverslip. This degrades the quality of the
performance of the imaging buffer. These undesirable features
are not observed for epoxy.

Acknowledgments

This work was supported by Georgia Institute of Technology


startup funds, GAANN Molecular Biophysics and Biotechnology
Fellowship, and the National Institutes of Health grant
(R01-GM112882). The authors declare no conflicts of interests
or competing interests.

References
1. Patel AP, Tirosh I, Trombetta JJ, Shalek AK, single fluorophores. Nucleic Acids Res 45(15):
Gillespie SM, Wakimoto H, Cahill DP, Nahed e141–e141
BV, Curry WT, Martuza RL (2014) Single-cell 5. Choi HM, Chang JY, Trinh LA, Padilla JE,
RNA-seq highlights intratumoral heterogene- Fraser SE, Pierce NA (2010) Programmable
ity in primary glioblastoma. Science 344 in situ amplification for multiplexed imaging
(6190):1396–1401 of mRNA expression. Nat Biotechnol 28
2. Femino AM, Fay FS, Fogarty K, Singer RH (11):1208
(1998) Visualization of single RNA transcripts 6. Moffitt JR, Zhuang X (2016) RNA imaging
in situ. Science 280(5363):585–590 with multiplexed error-robust fluorescence in
3. Raj A, Van Den Bogaard P, Rifkin SA, Van situ hybridization (MERFISH). Methods
Oudenaarden A, Tyagi S (2008) Imaging indi- Enzymol 572:1–49
vidual mRNA molecules using multiple singly 7. Cold Spring Harbor Protocols (2015) Syn-
labeled probes. Nat Methods 5(10):877 thetic defined (SD) medium. Cold Spring Har-
4. Wadsworth GM, Parikh RY, Choy JS, Kim HD bor Protoc 2015:pdb.rec085639. https://doi.
(2017) mRNA detection in budding yeast with org/10.1101/pdb.rec085639
90 Gable M. Wadsworth and Harold D. Kim

8. Zuker M (2003) Mfold web server for nucleic 10. Broadwater DB Jr, Kim HD (2016) The effect
acid folding and hybridization prediction. of basepair mismatch on DNA strand displace-
Nucleic Acids Res 31(13):3406–3415 ment. Biophys J 110(7):1476–1484
9. Broadwater DB, Altman RB, Blanchard SC, 11. Mueller F, Senecal A, Tantale K, Marie-Nelly-
Kim HD (2018) ERASE: a novel surface H, Ly N, Collin O, Basyuk E, Bertrand E,
reconditioning strategy for single-molecule Darzacq X, Zimmer C (2013) FISH-quant:
experiments. Nucleic Acids Res 47(3):e14–e14 automatic counting of transcripts in 3D FISH
images. Nat Methods 10(4):277
Chapter 6

NanoSIP: NanoSIMS Applications for Microbial Biology


Jennifer Pett-Ridge and Peter K. Weber

Abstract
High-resolution imaging with secondary ion mass spectrometry (nanoSIMS) has become a standard
method in systems biology and environmental biogeochemistry and is broadly used to decipher ecophysio-
logical traits of environmental microorganisms, metabolic processes in plant and animal tissues, and cross-
kingdom symbioses. When combined with stable isotope-labeling—an approach we refer to as nanoSIP—
nanoSIMS imaging offers a distinctive means to quantify net assimilation rates and stoichiometry of
individual cell-sized particles in both low- and high-complexity environments. While the majority of
nanoSIP studies in environmental and microbial biology have focused on nitrogen and carbon metabolism
(using 15N and 13C tracers), multiple advances have pushed the capabilities of this approach in the past
decade. The development of a high-brightness oxygen ion source has enabled high-resolution metal
analyses that are easier to perform, allowing quantification of metal distribution in cells and environmental
particles. New preparation methods, tools for automated data extraction from large data sets, and analytical
approaches that push the limits of sensitivity and spatial resolution have allowed for more robust characteri-
zation of populations ranging from marine archaea to fungi and viruses. NanoSIMS studies continue to be
enhanced by correlation with orthogonal imaging and ‘omics approaches; when linked to molecular
visualization methods, such as in situ hybridization and antibody labeling, these techniques enable in situ
function to be linked to microbial identity and gene expression. Here we present an updated description of
the primary materials, methods, and calculations used for nanoSIP, with an emphasis on recent advances in
nanoSIMS applications, key methodological steps, and potential pitfalls.

Key words NanoSIMS, Isotope assimilation, Metal imaging, Single-cell biology, Sample preparation,
SEM, TEM, FIB, FISH, O ion source

1 Introduction

Understanding biological exchanges at the single-cell scale, espe-


cially in complex systems, is one of the grand challenges of micro-
bial ecology and systems biology. This challenge includes
characterizing cell–cell interactions, linking phylogenetic identity
to ecophysiology for uncultured organisms, and quantifying rates
of elemental transfers within and between cells and their surround-
ing matrix. Recent advances in ‘omics techniques have enabled
unprecedented access to gene transcripts, metabolites, and pro-
teins, but rarely at the level of individual cells or mineral particles.

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_6, © Springer Science+Business Media, LLC, part of Springer Nature 2022

91
92 Jennifer Pett-Ridge and Peter K. Weber

These data have also enabled insights into the genomic potential of
uncultured organisms that exist in complex systems; however,
quantitative measures of metabolic functions of these organisms
and within-population variability remain largely untested. Isotope
tracing techniques are unique in their ability to identify in situ
ecophysiology of microorganisms and biogeochemical exchanges,
making them some of the most powerful techniques in microbial
ecology [1–6]. Among these approaches, the development of high-
spatial resolution secondary ion mass spectrometry [7], specifically
with a CAMECA NanoSIMS 50 and later the 50L, has opened up
new capabilities for taking on the challenge of single-cell scale
isotope imaging and has become a standard method for assessing
in situ metabolic activity.
Nanoscale secondary ion mass spectrometry (nanoSIMS) [7] is
a quantitative imaging technique where a high-energy primary ion
beam is used to sputter small volumes of sample surface material,
generating secondary ions that are used to create atomic or molec-
ular ion maps. Its high lateral resolution (~50 nm) and parts per
million to high parts per billion detection limit enables in situ
characterization of isotope enrichment and elemental composition
at the single-cell level. The NanoSIMS 50 and 50L (CAMECA,
Gennevilliers, France) can image 5–7 elements or isotopes simulta-
neously; additional species can be imaged using a magnetic peak
switching approach [8]. These characteristics enable mapping of
trace element and isotopic variations with submicron-scale resolu-
tion, including in subregions of individual cells (Figs. 1, 2, and 3).
Measurement precision in submicron regions is typically 1% for
isotope ratios; higher precision can be achieved in larger volumes.
As such, nanoSIP studies typically involve isotope or rare element
labeling, although microscale imaging of naturally occurring ele-
mental or isotope fractionation patterns is possible [9, 10].
NanoSIMS was first intensively applied to meteoritic material
[11], and in the early 2000’s to biological materials ranging from
cell membranes to bacteria, eukaryote symbionts, archaea, cyano-
bacteria, spores, biominerals, and soils [12–28]. Interest in nano-
SIMS applications for microbial ecology, cell biology, and
environmental science has grown quickly between that period and
the present, with multiple CAMECA nanoSIMS instruments in use
specifically for these applications. Today, nanoSIMS analysis is a
well-accepted technique and has been discussed in over 1000 pub-
lications from many disciplines. Multiple literature reviews have
been published that focus on applications including soils [28, 29],
biofilms [30], marine ecology [31], cell metabolism [32], plant
elemental distribution [33], the combination of nanoSIP with
fluorescent in situ hybridization (FISH) [34], general biological
applications [35–37], and cell membranes [38]. An updated list of
nanoSIMS literature in environmental biology and cell biology may
be found at https://www.cameca.com/products/sims/nanosims
NanoSIP: NanoSIMS Applications for Microbial Biology 93

δ15N δ15N
TEM
1120
0

8530

5810

3080

360

Fig. 1 Correlated nanoSIMS nitrogen isotopic composition and TEM images of a Trichodesmium thin-section
incubated for 8 h with 13C-HCO3 and 15N-N2. The cyanobacterial filament was resin embedded, ultramicro-
tomed into 200-nm-thick sections, imaged by TEM, and then analyzed by nanoSIMS. The nitrogen isotope data
are shown as deviations from the natural abundance value in parts per thousand, as indicated in the legend
(δ15N). Areas of 15N enrichment indicate localization of newly fixed nitrogen, which is accumulated in
cyanophycin granules (arrows) apparent in the TEM image (Finzi-Hart, Pett-Ridge et al. PNAS 2008)

Fig. 2 Thin section isotope imaging illustrates how newly acquired C and N is allocated to regions of active growth
or maintenance. Correlated TEM and nanoSIMS images of a filamentous cyanobacterium, Anabaena sp. SSM-00
(larger cells) infected by an epibiont (Rhizobium sp. WH2K) that attaches to the Anabeana heterocyst, the site of N
fixation. The δ13C and δ15N images show that newly acquired 13C and 15N fixed by Anabeana is used by the
epibiont, in addition to being allocated for active growth or maintenance in the Anabeana. Scale bar is 2μm (Image
by Jennifer Pett-Ridge, in collaboration with A. Spormann and W.O. Ng, Stanford University)
94 Jennifer Pett-Ridge and Peter K. Weber

Fig. 3 TEM and nanoSIMS images illustrating the potential for analysis of subcellular elemental distribution in
resin-embedded and microtome-sectioned cells. Top row, left to right: TEM of ultramicrotome section of
mouse brain tissue, (a) a glial cell nucleus, (b) a blood vessel, and (c) myelinated axions are indicated; 12C14N
ion image; 31P ion image of the same region (In collaboration with B. Anderson, SUNY Stony Brook). Bottom
row, left to right: nanoSIMS secondary ion images showing the distributions of N (measured as CN) and P in
sectioned non-Hodgkin’s lymphoma cells (Raji) (Image by Brenda Anderson and Peter Weber, in collaboration
with G. L. DeNardo, University of California, Davis)

[39]. In this updated version of our 2012 chapter [40], we discuss


advances in nanoSIMS analysis techniques, new applications, and
methodologies that are becoming standardized.

1.1 Recent With nanoSIP, metabolic activities of single microbial and eukary-
Developments otic cells and their symbionts can be tracked by imaging natural
in NanoSIMS Systems isotopic and elemental composition or isotope distribution after
Biology Research a stable isotope tracing experiment [40]. Most nanoSIP environ-
mental microbiology studies have targeted nitrogen and carbon
metabolism (using 15N and 13C enriched tracers) (e.g., [31, 41–
43]), but a growing number discuss patterns of sulfur, phospho-
rous, and metals (e.g., [44–49]) or use D2O as a means to track
NanoSIP: NanoSIMS Applications for Microbial Biology 95

active cells [50, 51]. While many of the earliest nanoSIP microbiol-
ogy studies were focused on aquatic bacterial and archaeal commu-
nities [18–20, 52], and 13C and 15N fixation in diazotroph cultures
such as Trichodesmium spp. [23] (Fig. 1) and Anabaena oscillar-
ioides (Fig. 2) [25], recent years have brought a large expansion in
the types of microbial study systems. These include methane pro-
ducers and consumers in aquatic and industrial waste treatment
systems [18, 53–58], many types of symbionts [43, 59–61], and
taxa found in the human gut microbiome [62] and insect gut
[63, 64]. NanoSIMS imaging has proved particularly useful for
studies of elemental exchanges between symbionts and has been
applied in sponges and corals [65–70], algal–bacterial interactions
[71–73], ant–plant–fungus interactions [74], and microbial mat
studies of multifunctional group interactions [75].
In the past decade, nanoSIP approaches have been used to
support a systems-level understanding in a substantially expanded
pool of study systems, including plants, fungi, soils, and viruses. In
plants, elemental distributions of Zn, Cd, Fe, Mg, K, Cu, As, Si,
and U have been mapped at the cellular and subcellular scale as a
means to understand patterns of hyperaccumulation, toxicity, and
metabolism (reviewed in Nunez et al. [37]). Transfers of carbon,
nutrients, and water between plant roots and mycorrhizal fungi,
first imaged by Nuccio et al. in 2013 [76], are particularly well
suited to nanoSIMS analyses, as these exchanges occur across a
microscale interface [76–84]. In soil, nanoSIMS imaging has the
potential to measure biogeochemical exchanges between diverse
phases, including bacteria, fungi, minerals, organic matter, and
phages, although the extreme spatial complexity demands a large
number of analyses to provide statistically robust conclusions. Since
Hermann et al.’s early perspective article [28], many dozens of
nanoSIP studies have explored soils, including the fate of isotopi-
cally enriched plant amendments to soil [29, 85–88], so-called
rock-eating microbes that weather primary minerals [85], the
incorporation of microbial necromass into soil organic matter
[87], and soil clay minerals that exhibit antibacterial properties
[89, 90]. Creative applications, including nanoSIMS analysis of μl
quantities of soil porewater [88] and cells separated from the soil
matrix via Nycodenz gradients [46, 91] can help to deconvolve the
isotope enrichment or elemental stoichiometry of distinct soil
pools. Viral and phage particles are a frontier for nanoSIMS imag-
ing, since their size is at the current outer limit of technical feasibil-
ity [92–94]. Novel approaches, such as low-energy ion
implantation (see below), may help to increase sensitivity for such
tiny particles, which are so thin that much of the sample could be
sputtered away in the initial moments of an analysis, when the
sputter rate can be 100 times higher than the equilibrium rate [94].
Environmental systems biology studies using nanoSIP have
also expanded in breadth in the past decade, and now reach far
96 Jennifer Pett-Ridge and Peter K. Weber

beyond queries of C and N fixation and elemental distribution.


Using imaging of time-resolved isotope tracing studies, Stuart
et al., Hong-Hermesdorf, Miethke et al., and Finzi, Pett-Ridge
et al. all illustrated that cells can hold resources in temporary
storage molecules (respectively—extracellular polymeric sub-
stances, acidocalcisomes, cyanophycin) until needed for later use
[23, 48, 95–97]. NanoSIMS analyses have also been used to char-
acterize the ecophysiology of novel uncultivated organisms
[98, 99], and the cell–to–cell variability of growth and fixation
rates within populations [44, 100–103]. As these studies illustrate,
individual cells can have widely different assimilation patterns even
within highly clonal and synchronized populations. In recent years,
studies of cellular metal uptake and intracellular distribution have
proliferated [47, 48, 104] (see also citations in Nunez et al. [37]), in
part due to advances in O primary beam sources (see below).
Many such studies have explored Fe metabolism, and the spatial
localization of organisms using Fe as an electron donor or acceptor
[105–107].
Multiple innovations have advanced the use of nanoSIP for
systems biology applications. These include more accurate isotope
assimilation enrichment calculations [41, 108, 186], automated
particle analysis software (and thus more highly-replicated studies)
[41, 109], and use of various forms of spatial statistics, where
phenotypically similar features are grouped based on their nano-
SIMS chemical and isotopic fingerprints [49, 110]. Below, we
discuss these and several other notable methodological advances,
including the new negative oxygen ion source, low energy ion
implantation, improvements in sample preparation and combined
imaging and correlated analyses.

1.2 Recent The most notable technical advance for nanoSIMS in the past
NanoSIMS decade was the development of a high-brightness negative oxygen
Instrumentation ion source, which enables positive secondary ion imaging with
Innovations 50 nm resolution. In the CAMECA nanoSIMS instruments, imag-
ing resolution is determined by the analysis ion beam spot size, and
originally, only the micro Cs+ ion source had sufficient brightness to
achieve a 50 nm spot size; the lower brightness of the duoplasma-
tron source (used to generate O ions) allowed 100 nm resolution
at best, with lower stability and reliability. As such, many researchers
prioritized Cs+ analyses for electronegative elements like C and N
over O analyses for metals.
In 2013, Oregon Physics (Hillsboro, OR, USA) produced a
high-brightness O source called the Hyperion II, which generates
ions from oxygen gas using a radiofrequency (RF) inductively cou-
pled plasma (ICP) [111, 112]. The Oregon Physics system sub-
stantially suppresses electron extraction while producing a high-
brightness O beam. Based on tests with Lawrence Livermore
National Laboratory’s NanoSIMS 50, the Hyperion II was
NanoSIP: NanoSIMS Applications for Microbial Biology 97

Fig. 4 (i) Zebrafish embryo retina sections for wild-type (A, C, E, & G) and copper-deficient Calgw71 embryos
(B, D, F, H). Left to right orientation is from inner to outer retina. A, B: Anatomical nuclear staining for
reference. NanoSIMS images include: C, D: copper (Cu); E, F: phosphorous (P); G, H: overlay of copper and
phosphorous images. GCL: ganglion cell layer; IPL: inner plexiform layer; INL: inner nuclear layer; OPL: outer
plexiform layer; ONL: outer nuclear layer; RPE: retinal pigmented epithelium. Scale bar 25μm. NanoSIMS
copper ion image (D) for the copper deficient embryos show reduced copper in megamitochondria relative to
the wild-type in ONL, but elevated relative to other organs (not shown). These images provide evidence for
copper prioritization for vision. (ii) Standard curve for copper generated by nanoSIMS analysis of matrix-
matched standards plotted against bulk copper concentrations determined by liquid-phase ICP-MS. N  3
measurements per point. Error bars represent standard deviations (Akerman et al. Metallomics 2018)

modified to achieve ~50 nm spatial resolution and allow imaging of


low micromolar concentrations of metals in biological materials
with ~250 nm resolution (Fig. 4). Furthermore, the Hyperion II
output is very stable (<1% drift over 24 h) with a low maintenance
requirement (every 1–3 years). As a result, it is now the preferred
O source for the CAMECA NanoSIMS 50L [113]. The superior
performance of this new source has the potential to attract more
researchers to trace metal analysis in biological systems. In our own
research, it has enabled higher throughput, as well as spatial reso-
lution sufficient for subcellular imaging [47].
Another notable technological advance is the new low-energy
ion implantation capability for the CAMECA nanoSIMS instru-
ments (extreme low impact energy, or EXLIE), which opens the
potential for analysis of smaller and thinner samples. In dynamic
SIMS instruments like the NanoSIMS 50, the analysis ion beam can
enhance the yield of desired secondary ions, with a Cs+ beam used
for negative secondary ions, and an O ion beam for positive
secondary ions. However, this enhancement effect is weak at the
sample surface because ions from the analysis beam are implanted
some 10s of nanometers below the surface. This is problematic for
samples that are only 10s of nanometers thick, such as viruses. To
overcome this challenge, Cabin-Flaman et al. demonstrated that
initial Cs0 deposition resulted in a ~10 increase in ion yield at the
98 Jennifer Pett-Ridge and Peter K. Weber

surface, allowing them to image strands of combed DNA


[114, 115]. In response, CAMECA has released a hardware and
software package that allows the operator to reduce the analysis
beam energy to only a couple of hundred volts so that the beam
effectively coats the sample locally. In our experience, this system
works well, but low energy Cs+ deposition is incompatible with
gold-coated samples (presumably the Cs interacts with the gold,
not the biological sample). Further use of EXLIE should enhance
quantitative analysis of extremely thin particles such as phage and
viruses, DNA and RNA, cell membranes and lipid rafts, exudates,
and other small and thin biological materials.

1.3 Moving Toward While nanoSIP has become widely used by the fields of systems
Standardized Methods biology and microbial ecology, the application of high spatial reso-
lution SIMS to biology is still limited to a couple of dozen labs and
user facilities, each with its own protocols for analysis, standardiza-
tion, and data treatment. A number of important issues are still not
codified in the literature and not widely reported in nanoSIMS-
based publications:
1. Standards to demonstrate proper operation and tuning of the
instrument and for quantification of isotopic ratios and ele-
mental concentrations.
2. Effective mass resolving power (see Subheading 3.4.1) and
demonstration of negligible collection of isobaric interferences.
3. Pre-analysis ion implantation and sputtering equilibrium.
4. Demonstration of sample performance (charging, flatness,
orientation).
5. Data extraction protocols, including defining regions of
interest.
6. Effects of sample preparation.
As the systems biology community continues to elaborate on
the nanoSIP approach, it will serve the community to have a more
standardized approach. In the method description that follows, we
describe a series of protocols that could serve as a basis for
standardization.

2 Materials

2.1 Sample Selection 1. Cultures, co-cultures, natural communities from soil, water or
and Experimental sediment, tissues.
Design 2. Treatments and controls, harvests from a temporal series
(if desired).
3. Final preparation must fit within a 50 mm circle and be vacuum
compatible.
NanoSIP: NanoSIMS Applications for Microbial Biology 99

2.2 Incubations 1. Substrates labeled with stable isotopes. These can be purchased
for Stable from companies such as Cambridge Isotopes, Isotec-Sigma, or
Isotope Tracing JPT Peptide Technologies. Substrates may also be grown (e.g.,
13
in Cultures C and 15N plant litter) or purified in house [96].
and Microbial 2. To label cultures with gasses or gas-exchangeable compounds:
Communities sealed vials, gas bags, or environmental chambers. For gas
injection: gas tight syringe, gas tank regulator, and
extraction port.
3. Any inert container can be used for labeling experiments with
nongaseous compounds. Field labeling is also possible if a
portion of the system can be at least partially sealed off.

2.3 Options 1. Fixation: glutaraldehyde, paraformaldehyde, formaldehyde,


for Sample Preparation ethanol, fast freezing, and high-pressure freezing.
and Pre-analysis 2. Embedding: epoxy, acrylic, elemental sulfur, sucrose, OCT,
Characterization paraffin.
3. Cutting: cryostat, ultramicrotome, razor blade, focused ion
beam (FIB) [116].
4. Sample support and mounting: Si wafers, transmission electron
microscopy (TEM) grid, filter, indium tin oxide (ITO)-coated
glass slides, vector bond, poly-L-lysine, egg white.
5. Sample mapping: epi-illumination, phase contrast, fluores-
cence, electron microscopy.
6. Coordinate encoding: light and electron microscopy systems
can be used.
7. Conductive coat: evaporator or sputter coater with carbon,
gold, iridium, or platinum.

2.4 High Spatial 1. The NanoSIMS 50 and NanoSIMS 50L (CAMECA) are state-
Resolution SIMS of-the-art instruments for isotope and elemental imaging.
These nanoSIMS instruments are a form of magnetic sector
SIMS with high spatial resolution (down to 50 nm), high mass
resolving power and transmission, and simultaneous detection.
They use a high-energy primary ion beam to interrogate the
sample (sputtering). In this process, a small volume of the
sample is impacted by the primary ion beam, breaking bonds,
and ejecting atoms and small molecules. A fraction of the
sputtered material spontaneously ionizes in proportion to the
element-specific ionization probability. The ions are extracted
by an electric field into a secondary ion mass spectrometer. The
sensitivity ranges from detecting 1 in 20 nitrogen atoms to 1 in
1,000,000 helium atoms, and mass resolving power (specific-
ity) can be up to 15,000 M/ΔM in corrected units (see Sub-
heading 3.4.1) [7, 11]. Imaging is achieved by scanning the
primary beam over the sample (in a region <50μm2) and
reconstructing the ion images digitally.
100 Jennifer Pett-Ridge and Peter K. Weber

2. Certified standards can be acquired from the U.S. National


Institute of Standards and Technology (NIST) or equivalent
agencies, though few are relevant for systems biology nanoSIP
studies. Reference standards can be generated “in house” by
characterizing samples by bulk methods and verifying high-
resolution homogeneity by replicate SIMS ana-
lyses (Fig. 4) [47]. A further option is ion implantation in
epoxy or other surrogate biological materials [117]. Tuning
samples can be cell cultures or other materials of known com-
position (e.g., NBS610 from NIST or a piece of metal) used for
instrument tuning and mass selection.

2.5 Data Analysis 1. NanoSIMS image analysis software (L’image, L. Nittler, Car-
negie Institution of Washington).
2. WinImage, Cameca.
3. OpenMIMS (https://nano.bwh.harvard.edu/openmims), an
add-on for Image J, a free-ware program available from the
U.S. National Institutes of Health (http://rsb.info.nih.gov/ij/
download.html).
4. Look@NanoSIMS [118], a free-ware program developed for
MatLab (MathWorks).

3 Methods

3.1 Sample Selection A wide range of biological samples can be analyzed by nanoSIMS if
and Experimental properly prepared (see Subheading 3.3). Experimental effects
Design should be maximized to compensate for sample heterogeneity
and analysis precision: ideally with isotope enrichment >1 atom %
or trace element concentration differences that are >2-fold. Typi-
cally, treatment samples are referenced to control samples. For
nanoSIP experiments, useful controls include the initial mate-
rial, no-heavy isotope addition controls, and time-zero isotope
addition controls. If an isotopically labeled solid substrate has
been used as an amendment (e.g., 13C plant material, necromass,
EPS [87, 96, 119–121]), it is essential to analyze some of the same
material “neat”—to understand its microscale heterogeneity. For
trace element studies, no-treatment controls are likely sufficient.
For many experiments, time course analyses aid data interpretation
[23, 95, 100–102]. Finally, while nanoSIMS analysis time is fre-
quently costly and limited, biological replicates are essential for
each timepoint and treatment and will substantially improve statis-
tical power.

3.2 Isotopic Labeling If an isotope label is to be tracked, the labeled substrate will depend
of Cultures on the experimental goals, but can range from dinitrogen gas to
and Microbial amino acids to complex biomolecules such as cellulose. Typically,
13
Communities C and/or 15N are added as tracers in nanoSIP studies because
NanoSIP: NanoSIMS Applications for Microbial Biology 101

Fig. 5 NanoSIMS ion images showing co-localization of bromine (77Br) with phosphorus (31P) in a HeLa cell,
indicating the incorporation of BrdU into DNA. The high P signal shows the location of the DNA in the nucleus.
The lack of correlation between bromine and chlorine (35Cl) indicates that the distribution of bromine is not
the result of being a trace constituent in the major halide-bearing molecules. Therefore, these results showed
that the Br accumulates in the nucleus, suggesting that the DNA-RNA hybrid was being degraded. The cells
were grown on a Si wafer, treated with BrdU, fixed and dried, and analyzed in a nanoSIMS by sputtering with
high beam current until the nucleus was reached (Image by Peter Weber, in collaboration with L. Dugan, LLNL)

they can be used without altering cellular function (Figs. 1 and 2).
Other options include 18O2- and 2H-labeled substrates and water.
Elemental labels such as F, Br, and I can also be used as tracers
[22, 122]. For example, bromine-labeled deoxy-uradine (BrdU)
may be used as a DNA tag to track cellular division [24, 123, 124]
and can be used to track the fate of a Br-labeled nucleic acids
(Fig. 5). Methods for introducing isotopically labeled substrates
can follow the pattern established by stable isotope probing (SIP)
[125, 126], a set of widely accepted techniques used in microbial
ecology. As a general principle, incubation experiments must last
significantly longer than the time of diffusion into the sample;
however, a balance must be struck in order to avoid cross-feeding
effects. Depending upon the research goal, each labeling experi-
ment will necessarily have minor differences, though many may
resemble the protocol below, which was used to 13C and 15N
label a freshwater cyanobacteria culture [25] (Fig. 6).
Example Isotope Labeling Protocol: A. oscillarioides was grown in
liquid culture with standard conditions, nutrients, buffer, and trace
element amended media [25]. Exponential phase cultures were trans-
ferred to sealed serum vials with no gas phase. Thereafter, a 24 h
incubation occurred with a 12 h light: 12 h dark illumination regime.
At the outset of the pulse labeling, 0.07 ml of NaH13CO3 (~99 atm %
13
C, 0.047 M, final enrichment of 1.7 atm.% 13C-dissolved inorganic
carbon) and 0.3 ml of 99 atm.% 15N2, 0.57 mM, final enrichment of
13.6 atm.% 15N2 were injected into each vial. Basic environmental
factors (irradiance, temperature, pH, starting inorganic N and C
pools) were measured during the incubation period. At multiple time
points (0 min, 15 min, 30 min, 1 h, 2 h, 4 h, 8 h, and 24 h), a vial was
destructively sampled and cells were fixed with 2% glutaraldehyde in
order to determine net uptake rates over the diel cycle.
102 Jennifer Pett-Ridge and Peter K. Weber

19 20 21 22(=Het.B) 23 C 15N
13

A.1 APE APE


0.3 0.3

0.2 0.25
A.2
0.1
0.2

0
A.3 0.15

-0.1
5 μm 0.1

B Het.B

Het.A Het.C
Het.A Het.B Het.C
C 0.25
0.2

0.15
APE

0.1

0.05
13 15
C APE N APE
0
1 10 20 30 40 50

Fig. 6 (a) Chain of five cells from a filament of A. oscillarioides analyzed by nanoSIMS after 4 hours of
incubation with H13CO3 and 15N2. Het heterocyst. Individual cells are numbered to correspond with the
numbering in part c. (a.1) Image reconstruction based on secondary electrons. (a.2) The distribution of 13C
enrichment. (a.3) The distribution of 15N enrichment. Enrichment is expressed as atom percent enrichment
(APE). (b) Post-analysis nanoSIMS secondary electron image of a filament of 50 cells of A. oscillarioides
showing three heterocysts (Popa, 2007 #1969) after 4 h of incubation with H13CO3 and 15N2. The white box
indicates the area shown in the images a.1, a.2, and a.3. (c) The cell-to-cell variation in 13C (diamonds) and
15
N enrichment (squares) along the same 50 cells filament. There are 1–6 independent replicate measure-
ments per cell. Error bars represent two standard errors (Popa et al. ISME Journal 2007)

Following a 13C and/or 15N tracer experiment (e.g., with


compounds such as 13C-substrate, 15N2, or 15NH4+), the rate of
C or N assimilation may be quantitatively determined with nano-
SIMS data. In general, exposure periods should be kept brief rela-
tive to the doubling time of microbial populations, and subsamples
should be harvested at multiple time points during the isotope
incubation in order to measure and minimize recycling and leakage,
which for N can approach 35% of newly fixed material [52]. As
NanoSIP: NanoSIMS Applications for Microbial Biology 103

nanoSIMS measures total elemental or isotopic signal and does not


discriminate between nitrogen derived from NO3, NH4+, or
amino pools, measurements yield net uptake only, not gross pro-
cessing of a substrate. The amount of C or N lost from a cell due to
secondary metabolite production, denitrification, leakage, or sam-
ple preparation effects cannot be precisely measured with nano-
SIMS analysis. Therefore, we define assimilation strictly as all new C
or N from the labeled substrate in the cell regardless of whether the
organism has utilized it for organic biosynthesis or not.

3.3 Sample Sample preparation is critical to the success of any nanoSIP experi-
Preparation ment and in some cases is the most challenging step. SIMS is an
and Pre-analysis ultra-high vacuum (~1010 Torr) technique, and samples must be
Characterization prepared for the vacuum chamber in a way that preserves the
molecular and elemental distribution of interest. NanoSIMS imag-
ing cannot be used for in vivo studies, and samples cannot be
analyzed in an aqueous phase without a cryogenic stage [127]. To
prepare samples, it is often necessary to stabilize biological compo-
nents (fixation), remove water (dehydration), and salts (derived
from growth media or sea or sediment water), mount samples on
a conductive support (Si wafer, TEM grid), and then either proceed
to an intact sample analysis or follow with embedding and section-
ing. For some nonaqueous sample types (soils, fungal hyphae), we
have found it workable to analyze unfixed samples [28, 87]. For
other samples, it is ideal to separate cells or particles from a matrix
prior to nanoSIMS analysis; in these cases, a Nycodenz gradient,
flow cytometry, or microfluics approach can be used [46, 91, 103,
128, 129].

3.3.1 Sample Flatness While ideal samples are flat with no more than nanometer-scale
and Conductivity variations in surface topography, in our experience, it is possible to
work with non-flat samples. The primary concern topography
introduces is increased error in isotopic measurements, which result
from spot to spot variations in ion extraction conditions, which can
detune the mass spectrometer. On a perfectly flat sample (e.g.,
individual spores), ~1 permil (‰) precision is possible when imag-
ing with electron multipliers. However, with large cells, soil parti-
cles or other sources of surface irregularity, only percent-level
precision is often possible. For a given sample type, it is necessary
to establish the precision of the measurement conditions by using
comparable samples to the samples of interest. In most cases,
control samples that were not exposed to isotopically labeled sub-
strates are the best option. In many nanoSIP studies, the goal is to
achieve isotopic enrichment of >100 permil or higher; at these
enrichment levels, even many micrometer of surface topography
can be tolerated [63, 110].
Because SIMS instruments use an ion beam to interrogate the
sample and extract ions and electrons, sample charging is a critical
104 Jennifer Pett-Ridge and Peter K. Weber

consideration. If the sample charges, the extracted secondary ions


will have the wrong energy with respect to the tuning of the
secondary mass spectrometer, resulting in a loss of mass resolving
power and potentially a shift in the mass line. This is primarily an
issue for the analysis of negative secondary ions because a significant
current of electrons are extracted while a beam of Cs+ ions are being
deposited in the sample. If the sample is nonconductive, the sample
will rapidly charge, ruining the analysis. As a practical matter,
sample charging can be identified when there are sample regions
that appear to have close to zero secondary electron counts. To
minimize charging during nanoSIMS analysis, samples (whether
intact or sections) are typically coated in an evaporator or sputter
coater with a 2–20 nm layer of gold or other conductor (e.g.,
carbon, iridium, and platinum). As a general rule, the more topog-
raphy the sample has, the thicker the conductive coat needs to be to
bridge topographic gaps.
For biological samples in the absence of minerals, sample
charging is generally not a problem, even though biological mate-
rials are inherently nonconductive. After sputtering equilibrium is
reached, the sample becomes sufficiently conducting to perform
high-quality analyses. For this same reason, analyses of biological
particles can be performed on filters without having to do more
than deposit a conducting coat on the surface to enable the charge
to dissipate to ground. In fact, monolayers of cells on a conducting
substrate can be analyzed without a conductive coat because the
sample stops charging after sputtering equilibrium. Nonetheless, at
LLNL, we normally apply a conductive coating our samples to
facilitate initial imaging.
Samples with a high mineral or salt component often present a
greater challenge. Most minerals will charge under Cs+ analysis after
the conductive coat is sputtered away. In these cases, an electron
flood gun is needed for charge compensation. While not overly
difficult, the electron flood gun does add complexity to the analysis
and secondary electron imaging cannot be performed at the
same time.
For samples that are to be analyzed intact, some will need to be
washed in deionized water to remove salts or other compounds that
could coat cells or mineral particles and interfere with ion extrac-
tion or cause charging. For cells or particles, washing on a filter is
very efficient, and nucleopore or polycarbonate filters can be used
as a sample substrate if they are flat at the micron scale. Other ideal
sample substrates include Si wafers, plastic slides, and indium-tin-
oxide (ITO)-coated glass slides. Conductive sample substrates are
preferred to insulators, which will charge as soon as the conductive
coat is sputtered away. Cell cultures grown on a solid substrate can
be gently washed with repeated immersion in deionized water.
Poly-L-lysine, vector bond, egg white, or other surface coatings
are useful to enhance adhesion to the sample substrate.
NanoSIP: NanoSIMS Applications for Microbial Biology 105

3.3.2 Fixation Fixation of biological tissues is designed to preserve cell morphol-


ogy and immobilize analytes of interest for imaging analysis. Chem-
ical fixatives (glutaraldehyde, paraformaldehyde, formaldehyde,
ethanol, osmium tetroxide [130–132]) work well if proteins and
other structural molecules are the targets. For these analyses, any
fixation approach that is suitable for scanning electron microscopy
(SEM) imaging will likely work for SIMS imaging. However, more
complex methods such as low temperature methods (flash freezing
and high pressure freezing [133–135]) are sometime warranted to
preserve the distribution of small molecules and diffusible ions in
biological samples. It is best to avoid applying the stains typically
used in EM imaging (e.g., uranyl acetate) in cases where the ele-
mental composition of the sample is of interest. It is also important
to recognize that fixatives can cause significant isotope dilution;
several nanoSIP studies have shown a stepwise dilution of isotopic
composition after chemical fixation, FISH, and CARD-FISH pro-
tocols [34, 136, 137].
The selection of a fixation procedure is a practical matter—if no
fixation is necessary, none should be used. When needed, chemical
fixatives can be added directly to samples in solution at concentra-
tions ranging from 2% to 4%. But the effects of fixatives may be
highly sample dependent, and the SIMS community has reported
widely differing experiences. Glutaraldehyde is a very aggressive
cross-linking agent and is thought to be incompatible with other
treatments, such as FISH. Osmium tetroxide is known to cross-link
phospholipids. Fixation is not necessary for bacterial spores [26]
and potentially encysted microbes. By contrast, vegetative cells are
prone to lysis without fixation, especially during washing to remove
salts [25]. Hermann et al. [138] report only 35% of photosyntheti-
cally fixed 13C was retained as protein in symbiotic algae, following
chemical fixation in a glutaraldehyde: paraformaldehyde mixture.
In our experience at LLNL, mean nanoSIMS isotope ratios of
cyanobacteria fixed with glutaraldehyde correspond well with the
isotope enrichment measured in the same cells via IRMS [23, 25],
as long as enrichment values are less than 50 atom %.
Cryogenic methods of tissue fixation are presumed to be more
conservative but are substantially more laborious, and flash freezing
and high-pressure freezing can only be applied to sample aliquots
or very small samples [131]. In studies where significant migration
of the element of interest is likely to occur during sample prepara-
tion, low-temperature methods such as freeze-drying may be the
best solution [47, 139], but more work is needed to demonstrate
quantitative elemental distribution retention.

3.3.3 Embedding In cases where the goal is to target intercellular elemental or isoto-
and Sectioning pic distribution (e.g., Figs. 1, 2, 3, and 7), embedding and section-
ing will likely be needed prior to nanoSIMS analysis. As with other
aspect of sample preparation, the embedding and sectioning
106 Jennifer Pett-Ridge and Peter K. Weber

Fig. 7 Correlated SEM and nanoSIMS micrographs showing the localization of Rubisco, labeled with 5 nm
immuno-gold in thin sections of the cyanobacterium Trichodesmium IMS 101. The immuno-gold can be
imaged by nanoSIMS, allow stable isotope tracing immuno-localization. Note that the gold enhances the
production of CN ions (In collaboration with G. Sandh & B. Bergman, Stockholm University)

method should be chosen with the target ions and molecules in


mind. Key questions to consider are:
1. Will in situ hybridization or antibody labeling be performed on
the section?
2. Are diffusible ions or molecules of interest?
3. Will the embedding medium be a significant source of interfer-
ence with the target species?
If none of the above cases apply, then standard embedding
methods will likely work and have previously been used to localize
13
C- and 15N-labeled structural molecules [23, 24, 27] and fragile
marine aggregate [140]. Samples can be embedded in a number of
polymers for room temperature sectioning (e.g., epoxy, acrylic,
paraffin [134]). Where larger areas need to be analyzed, histological
methods can be used [141, 47]. In situ hybridization or antibody
labeling requires the fewest modifications to standard embedding
methods for successful labeling. The fixative should minimize
cross-linking of the target (e.g., paraformaldehyde instead of glu-
taraldehyde), and the embedding medium should allow exposure of
the target molecules. For resin embedding, acrylic tends to pene-
trate samples more readily in our experience. Even better nano-
SIMS results can be achieved if the embedding medium is porous or
removed after sectioning, such as with most histological and cryo-
genic methods [47, 134, 141].
NanoSIP: NanoSIMS Applications for Microbial Biology 107

If diffusible ion and molecules are of interest, embedding


methods that employ room temperature liquids should be avoided.
FIB (focused ion beam) sectioning is likely the best option for
preserving the distribution of diffusible species because a fully dry
sample can be sectioned; however, the method requires specialized
equipment and limited sample material can be processed (TEM
sections are particularly time consuming to make by FIB section-
ing). If the samples are only destined for SIMS analysis, top-cutting
may be a more rapid option [116]. One other potential alternative
is sulfur embedding [142–144] which we have used to section
heterogeneous soil aggregates [28].
A final embedding/sectioning option is cryogenic sectioning,
which can be performed with sucrose, OCT, or similar compounds.
Cryosectioning of water-ice embedded samples is also an option,
but is challenging. Cryogenic methods will only preserve the distri-
bution of diffusible ions and molecules if there is no cyro-
protectant infiltration and fast freezing is employed; both are
major changes from standard protocols and not easily implemen-
ted. In particular, not using a cryo-protectant (e.g., sucrose) leaves
the frozen section brittle and very difficult to section.
Sectioning can be performed with an ultramicrotome, a stan-
dard “histological” microtome or cryostat, or even with a razor
blade, depending on the type of pre-nanoSIMS imaging that is
desired. Standard TEM-grade ultrathin sections (~100 nm) can
be analyzed by nanoSIMS; however, more data can be collected
from thicker sections (up to 500 nm) if lesser TEM image quality is
acceptable. Thicker sections are also desirable if large areas (mm2)
need to be imaged or analyzed. If transmission light imaging is
necessary during the sample mapping phase, indium tin oxide
(ITO)-coated glass slides are preferable to uncoated glass slides
because they do not charge in the SIMS. An adhesive surface
coating (e.g., poly-L-lysine) is necessary to retain cryogenic sections
during washing or staining. FIB milling can be used as an alterna-
tive to embedding and sectioning [116], particularly where the user
needs to have precise control over the location and orientation of
the section. All thin sections can be laid onto a TEM grid or directly
on a solid substrate prior to nanoSIMS analysis.
As an example of a general procedure for sample preparation,
before nanoSIMS microanalysis, the filaments of A. oscillarioides
(described above) were fixed with glutaraldehyde, filtered, washed
with Milli-Q (18 MΩ) H2O, transferred onto a silicon wafer and
dried. Since the filaments were sufficiently large, light microscopy
was used for navigation and target identification (Fig. 6).

3.3.4 Sample Mapping Sample mapping is the final critical step prior to nanoSIMS ana-
lyses; it can greatly enhance operator efficiency and is often essential
to interpretation of results. Most nanoSIMS instruments have the
108 Jennifer Pett-Ridge and Peter K. Weber

equivalent of an epi-illumination microscope for sample navigation,


and therefore, epi-illumination micrographs provide the best refer-
ence images for general navigation. SEM mapping (and TEM or
scanning transmission electron microscopy (STEM) for thin sec-
tions) can also positively identify targets for analysis; these images
are often comparable (though with higher resolution) to the sec-
ondary electron or ion images generated by nanoSIMS. An ideal
series of mapping images should capture the whole sample scale, as
well as individual target analysis locations, with reference points
that can be used to translate from one image scale to the next. For
target points that are difficult to find in the NanoSIMS light imag-
ing system, such as very small or complex targets, coordinate
encoding (relative to obvious fiducial points) can aid navigating
for analysis. Matrix-based coordinate transformations simplify the
translation of coordinates to the CAMECA NanoSIMS, which has a
somewhat nonintuitive coordinate system. When analyzing samples
on Si wafers, we often make faint scratch marks with a diamond-
tipped pen before the sample is deposited, this helps to provide
unique reference points.

3.4 NanoSIMS High spatial resolution SIMS (better than 0.5μm lateral resolution)
Analyses is necessary to characterize the isotopic and elemental composition
of individual microbial cells. The CAMECA NanoSIMS 50 and
50L are the state-of-the-art instruments for combining high lateral
resolution, high mass resolution, and high transmission and may be
used for both stable isotope and trace element analyses of microbial
samples (outlined below). These instruments have two modes of
analysis: a Cs+ primary beam to generate negative secondary ions,
or an O primary beam to generate positive secondary ions. As a
general rule, electronegative elements (e.g., halides) are detected as
negative secondary ions, and electropositive elements (e.g., metals)
are detected as positive secondary ions. Manufacturer manuals and
standard references on SIMS can provide additional guidance on
the choice of detection polarity [117]. In some cases, an experi-
ment requires both electronegative and positive elements to be
mapped in the same sample. This is possible, but changing pola-
rities is a multiple-hour effort. Alternatively, at high enough con-
centrations, some elements can be imaged with sufficient sensitivity
in their non-typical polarity (e.g., FeO instead of Fe+; C+ instead
of C, P+ instead of P; Fig. 4) [26, 47, 117].
For any analysis, it is useful to have standard samples that are
routinely used for tuning. This allows session to session comparison
of transmission, mass resolving power (MRP), and elemental or
isotopic ratios. Standards are also important for finding the correct
species, which can be particularly challenging for higher masses.
Simple reference materials (e.g., iron) are easier to work with than
multi-element standards like the National Institute of Science and
Technology’s NBS610, which has 500μg/g of most elements.
NanoSIP: NanoSIMS Applications for Microbial Biology 109

A.

12C 14 N

H12 C13 C
Log plot

B.

DM
spores
12 C 14 N standard
Linear plot

-50 -45 -40 -35 -30


δ13 C (permil, uncorrected)
H 12 C 13 C

DM

Flat-top peaks

Fig. 8 Flattop peaks and ultimate precision. (a) Logarithmic and linear plots of a mass scan at mass 26.
12 14 
C N is readily resolved from H12C13C, which is 0.007 amu heavier. 13C2 is only 0.004 amu heavier than
12 14 
C N and could be resolved, but typically is 4–5 orders of magnitude less abundant, and therefore is
negligible. Note that the 12C14N peak is flat-topped, which means that a range of mass lines from the top of
the peak can be aligned with the detector and precise measurements still be achieved. (b) Measurement
precision is affected by instrument tuning and stability and sample characteristics, but the ultimate limit on
measurement precision is the number of ions collected for the minor species. Therefore, in this example, the
precision of the measurements of bacterial spores is lower than the precision for the graphite standard
because the spores have less mass, and therefore less 13C counts

However, there are characteristic spectra for NBS610 that can be


used for mass calibration, such as the 56Fe+ peak below a ~100
larger 40Ca16O+ + Si2+ peak at mass 56. Setting up for carbon and
nitrogen isotope measurements can easily be done with any
biological sample (Fig. 8).

3.4.1 NanoSIMS Tuning Tuning a SIMS instrument requires expert knowledge. The central
and Estimating Mass aspects of SIMS instrument tuning are primary ion beam align-
Resolving Power ment, peak shape, mass selection, and resolving isobaric interfer-
ences—all of which are important variables to report on in a
nanoSIP article’s methods description. Here we present the basics
issues.
The alignment and focus of the primary ion beam (analysis
beam) set the location of the ion source for the secondary mass
spectrometer and determine the quality of the ion images. Grid
samples are typically used to identify and correct for sources
110 Jennifer Pett-Ridge and Peter K. Weber

of distortion and calibrate the scanning scale. If high current sput-


tering is used to reduce the time to achieve sputtering equilibrium,
the higher and lower current beams need to be aligned. This
alignment should be done before finalizing the tuning of the
instrument because sometimes it is better to move the lower cur-
rent beam position toward the higher current one to optimize ion
current or quality of the focus.
To obtain accurate measurements, the instrument must be
tuned and aligned to collect the ions of a species of interest to the
exclusion of other species at the same nominal mass. The secondary
ion beam for the species of interest must be tightly focused at the
detectors and multiple beam diameters from adjacent masses, and
the detector must be aligned to collect effectively 100% of the
transmitted ions of interest and only those ions, with room for
small variations in the magnetic field or other potential shifts in
the mass line. The result is a peak that is flat-topped and steep-
sided (Fig. 8). A metric of the peak shape is the mass resolving
power (MRP), which is also a metric of the ability to resolve
adjacent masses. MRP is defined based on the nominal mass, M,
at which the measurement is made, and the resolvable difference in
mass, ΔM, between two adjacent species:
MRP ¼ M =ΔM : ð1Þ
Because of the proportional nature of this metric, the measured
MRP of the mass spectrometer is effectively applicable across all
masses. It is important to note, however, that the resolvable differ-
ence in mass increases with mass. The CAMECA NanoSIMS soft-
ware uses the steepness of the side slopes of the mass peaks as a
measure of the mass resolving power of the secondary mass
spectrometer.
MRP ffi R=ð4  L90Þ, ð2Þ
where R is the effective radial distance of the detector position and
L90 is the average lateral distance between the 10% and 90% height
of the peak side slope. This estimate of mass resolving power is ~1.5
times higher than the effective MRP based on the standard defini-
tion of MRP, and in our publications, we report the MRP of our
analyses based on this correction (Fig. 8). Regardless of the MRP
value reported, it is essential to be aware of all potential interfer-
ences and ensure that their contribution to the measured mass line
is negligible. Simply observing that a peak top looks flat on a
standard is not sufficient to be sure there is not a significant unre-
solved interference. Blanks and control samples are important for
checking for interferences, as are software programs that can calcu-
late potential interferences.
Peak shape is an integrated function of everything from the
primary beam location and size to the gain on the detector. A
tightly focused primary beam reduces the abundance of off-axis
NanoSIP: NanoSIMS Applications for Microbial Biology 111

ions, which cause angular aberration. A well-centered primary


beam relative to the secondary ion collection lenses minimizes
potential distortion. The secondary ion beam should be aligned
relative to all the lenses, slits, and apertures in the secondary mass
spectrometer to maximize transmission and minimize distortion.
The entrance slit width is selected based on the target MRP, an
aperture slit (similar to a field aperture for the CAMECA ims series)
is used to reduce angular aberration, and an energy slit is used to
reduce chromatic aberration, along with other tuning. The detec-
tor gain and threshold must be set to exclude noise and register
>90% of the incident ions. In our experience, dimers (e.g., 12C12C
or 12C13C) result in higher gain than monomers, and the detector
voltages must be adjusted accordingly. Incorrect detector settings
or a failing detector can result in sloped peak tops. It is also
important to set detector deflector settings so that the ions strike
a region of the detector first dynode with a flat response to scan-
ning, to achieve a flat top peak. Finally, for a NanoSIMS 50 or 50L,
it is important to keep sustained count rates below
~300,000 counts/s to prevent premature aging of the electron
multipliers. Sustained high count rates can result in dead spots on
the detector first dynode and overall loss of sensitivity from carbon
deposition on the other dynodes.

3.4.2 Cs+ Analysis The vast majority of systems biology studies requiring nanoSIMS
for Electronegative analysis are focused on electronegative elements such as H, C, O,
Elements and Isotope N, P, and S [39]. All of these elements (and their corresponding
Ratios isotopes) are analyzed with a Cs+ primary beam. Of these, com-
bined C and N isotope measurements are the most common and
stringent analyses at the low end of the periodic table; we discuss
their analysis in detail below.
For both of carbon-13 and nitrogen-15, higher sensitivity is
achieved using a Cs+ primary beam and extracting negative second-
ary ions. The rare and major isotopes are both mapped in the
sample, and the ratio of the two reveals the distribution of the
incorporated label in the sample (Fig. 1, 2, 6, and 9). Nitrogen is
typically detected as the molecular ion CN because of the poor
yield of N and N+ [145, 146]. Carbon isotopes can be measured
using the monomers (C), the hydrides (CH), the dimers (C2),
or the CN species (mass resolving power requirements increase with
mass). The CN species typically have the highest ion count rate in
biological samples, but because ~12,000 MRP (~18,000 based on
the CAMECA software) is required to resolve 13C14N from
11 16 
B O at mass 27, these species are typically only used when
the highest surface sensitivity is required [147].
We have found that the C2 dimers measured at mass 24 and
25 are more compatible with the CN species (e.g., 12C2,
13 12  12 14  12 15 
C C , C N , C N ) because of similar secondary ion
112 Jennifer Pett-Ridge and Peter K. Weber

Fig. 9 NanoSIMS images of a filamentous cyanobacterium, Anabaena sp. SSM-00 (larger cells) infected by an
epibiont (Rhizobium sp. WH2K) that attaches to the Anabeana heterocyst, the site of N fixation. (a) and (b) are
replicate filaments from the same culture, illustrating that cell–to–cell variation in isotopic enrichment may be
extremely large, even while relative enrichment patterns remain consistent (Image by Jennifer Pett-Ridge, in
collaboration with A. Spormann & W.O. Ng, Stanford University)

focusing (Fig. 10). Simply put, the maximum transmission for the
carbon dimers is better aligned with the maximum transmission for
CN than the carbon monomers are. Physically, this means that the
optimal focusing voltage for the lens used to focus the secondary
ion beam in the entrance slit to the mass spectrometer is more
similar for C2 and CN than for C and CN. Because the ions
are all detected simultaneously, only a single E0S focusing voltage
can be used, and therefore if C and CN are measured, the E0S
focusing voltage has to be compromised for one or both sets of
species. This compromise not only results in a loss in transmission,
but it may result in lower reproducibility of isotope ratio measure-
ments. Maintaining optimal focus at the entrance slit is important
to isotope ratio measurement reproducibility. The difference in E0S
focusing voltage for these species is likely due to the differences in
energy spectra resulting from C2 and CN primarily coming from
molecule decomposition during flight, while C is generated at the
sample [148]. We have observed that the offset between C and
CN varies, but we have not succeeded in making this offset
NanoSIP: NanoSIMS Applications for Microbial Biology 113

350,000
12C14N-
300,000

Ion counts per second


12C -
2
250,000
12C-

200,000

150,000

100,000

50,000

0
-7,100 -7,050 -7,000 -6,950 -6,900
Secondary ion beam focus (V)
Fig. 10 Scan of the secondary ion beam focus voltage for lens E0S, showing the
relative change in detected counts. The maximum transmission for 12C14N and
12 
C2 coincides here, whereas the maximum transmission for 12C is offset.
While the 12C14N and 12C2 scans are not always this well aligned, C is
typically offset, resulting in either reduced transmission for C or CN if the two
are detected simultaneously. The difference in count rate among these species
varies from sample to sample, but in biological samples, CN typically has a
higher count rate, and C and C2 are similar

acceptably small. We have also observed that there is often a mea-


surable offset between C2 and CN, but it is relatively small
(<10 V; Fig. 10).
The 15N/14N ratio can be directly calculated from the ratio of
the CN ions (12C15N/12C14N). The 13C/12C ratio, however,
equals 12C13C/(2  12C2) based on:
X13  2  2  2   
i¼12
i
C ¼ 12 C þ 13 C þ 2 12 C 13 C , ð3Þ

where [iC] is the relative abundance of the respective isotopes, and


the individual terms on the right hand side of the equation are the
expected relative abundances for the respective combinations of
species [149].
Typical analytical conditions for nanoSIMS are: a ~2 pA Cs+
primary beam focused to a nominal spot size of ~100 nm, a
256  256 pixel raster over a 10  10μm2 area, a dwell time of
1 ms/pixel, the secondary mass spectrometer tuned for 5–7 sec-
ondary ions (e.g., 12C2, 13C12C, 12C14N, 12C15N, and 31P)
detected on electron multipliers in simultaneous collection mode,
~6500 MRP (~10,000 MRP based on the CAMECA software; see
above) to resolve isobaric interferences (e.g., 13C12C vs. 12C1H
at mass 25; 13C2 vs. 12C14N at mass 26; Fig. 8;
11 16 
B O vs. 12C15N at mass 27), and data collection for 10–20
serial quantitative secondary ion images (i.e., layers). For larger
areas, the analysis time must be increased proportional to the
114 Jennifer Pett-Ridge and Peter K. Weber

98Mo/12C

5 µm

Heterocysts =
0 0.005 0.01 0.015 0.02

0.016

0.014

0.012

0.01
98Mo/C

0.008

0.006

0.004

0.002

0
0 10 20 30 40 50 60

cell number
Fig. 11 Molybdenum distribution in Anabaena oscillarioides. Filaments were fixed in gluteraldehyde and
sputtered with a O beam to a depth of 1μm on a Si planchette (wafer). Data for multiple Mo isotopes were
collected to assess for isobaric interferences. Top: ion ratio map of 98Mo normalized to 12C for quantifica-
tion. A thin white line outlines each individual cell. Grey triangles indicate heterocyst cells. Bottom: data
summary for two replicate filaments. Heterocyst cells are consistently enriched in Mo, a critical nitrogenase
co-factor, suggesting active N-fixation. Mean Mo concentrations, estimated based on published relative
sensitivity factors (117) are 64 (4) μg/g in heterocysts (n ¼ 5) and 18 (0.9) μg/g in vegetative cells
(n ¼ 46) (Image by Jennifer Pett-Ridge)

area. Hundreds of cells may need to be analyzed in order to account


for natural variability in metabolism from one cell to another
(Figs. 11 and 12).
When possible, biological samples should be sputtered to a
depth of ~60 nm before data collection to achieve sputtering
equilibrium. The depth of analysis during a measurement is typi-
cally between 50 and 200 nm; however, whole cells may be con-
sumed to acquire sufficient counts for high precision analyses, to
average over the entire cell, or to generate a cell depth profile
(Fig. 13). The sputter rate for biological materials with a Cs+
primary beam (16 kV, normal incidence) is 1–2 nm/μm2∙pA/s at
equilibrium [94, 150]. With a 2 pA Cs+ analysis beam and a
1  1μm2 raster, a 1μm cell can be consumed in a few minutes.
In addition to C and N, the distribution of electronegative
elements (e.g., H, O, S, and P) or highly abundant electropositive
elements (e.g., Fe as FeO) can be imaged during stable isotope
NanoSIP: NanoSIMS Applications for Microbial Biology 115

Fig. 12 Representative nanoSIP images demonstrating high-throughput metabolic screening of cells filtered
from Pacifica, California seawater incubated with 13C-bicarbonate and 15N-amino acids for 6 days. 14N12C
ion counts reflect all carbon- and nitrogen-containing particles, 13C atom percent indicates cells enriched in
13
C, and 15N atom percent indicates cells enriched in 15N. The same four cells are indicated with arrows in
each panel, with letters in the first panel indicating putative metabolism: I (no enrichment; inactive cell), C1
(enrichment in only 13C; chemoautotroph), H (enrichment in only 15N; heterotroph), and C2, (enrichment in 13C,
minor enrichment in 15N; chemoautotroph). Scale bar is 11μm (Dekas et al. Frontiers in Microbiology 2019)

analyses [87]. These can also include labeling elements such as F, I,


or Au (see Subheading 3.6, “immunolabeling”) [22, 122]. In some
cases, magnetic peak switching may be necessary to image the
distribution of all elements of interest; at LLNL, we have success-
fully analyzed up to 20 elements in a single analysis of bacterial
spores. Samples can be imaged simultaneously by secondary elec-
trons with negative secondary ions.

3.4.3 NanoSIMS Trace Trace element analysis in biological samples is often used to deter-
Element Analysis mine the concentration and distribution of metal cofactors and
labels. With the invention of the Hyperion II RF inductively cou-
pled plasma ion source, trace metal analysis with nanoSIMS has
become significantly easier and more attractive. The method of
analysis is similar to the stable isotope analysis method outlined
above, except that typically the trace elements of interest are metals,
which are imaged with higher sensitivity as positive secondary ions
with an O primary beam [117]; elements such as Na, K, Al, Mg,
and Ca ionize extremely well in this mode. Whether metals such as
116 Jennifer Pett-Ridge and Peter K. Weber

Fig. 13 Comparison of nanoSIMS-based characterization of sectioned versus whole Bacillus thuringiensis (Bti)
spores. (a) TEM image of a sectioned Bti spore showing its layered architecture and overall dimensions. Scale
bar is 200 nm. (b) Lateral profile across the surface of a sectioned Bti spore showing the distribution of 12C,
31
P, and 35Cl. The dashed lines identify the core region based on the 31P profile. The whole spore is defined
based on the 12C profile and identified by solid lines. Profile: length 1200 nm; width 200 nm. (c) Model
representation of a sectioned spore with the highlighted rectangular region representing the location of profile
data. (d–f) NanoSIMS secondary ion images showing the distribution of 12C, 31P, and 35Cl across the sectioned
spore surface. Scale bar is 200 nm. (g) SEM image of a whole Bti spore. Scale bar is 200 nm. (h) Depth profile
of whole spore showing the distribution of 12C, 31P, and 35Cl as a function of depth in the spore. (i) Model
representation of a whole spore with the highlighted column representing the location of the profile data.
Profile diameter is 200 nm. (j–l) NanoSIMS secondary ion images showing the spatial distribution of 12C, 31P,
and 35Cl in the spore. Scale bar is 500 nm. Both profiles were acquired with the Cs22 primary ion beam
(Reprinted with permission from: Ghosal et al. Analytical Chemistry 2008. Copyright 2008 American Chemical
Society)

Mn, Fe, Cu, Mo, Cr, V, and Ni (and in the right circumstances, Zn
and As) can be detected in a given system with subcellular resolu-
tion depends on their concentration in the sample and relative
sensitivity factor (a.k.a., relative useful yield; see Subheading 3.5.2
and [117]). At LLNL, we have imaged a range of trace elements in
cells, including Mo (as a proxy for nitrogenase; Fig. 11), Mg, Si, P,
Mn, Fe, Cu, Zn, and As [47, 48, 104, 151]. The highest spatial
resolution achieved with the Hyperion II on a CAMECA Nano-
SIMS in this mode is ~50 nm with ~0.5 pA O primary beam
[47, 113]. For very-low-concentration elements (ppb to low
ppm), a >100 pA primary beam is necessary to acquire enough
NanoSIP: NanoSIMS Applications for Microbial Biology 117

counts for imaging, with spatial resolution >250 nm (Fig. 4). The
sputter rate for biological materials with an O primary beam is
~0.2 nm/μm2∙pA/s [150]. For many metals, low-ppm-level cellu-
lar concentrations can be imaged, but great care must be taken to
ensure detectors only collect the isotope or element of interest, as
opposed to isobaric interferences.

3.4.4 Standards Standards and controls have distinct but related roles that are
and Controls important to obtaining reliable results. Standards are used to
check instrument operation, quantify absolute composition, and
provide a reference for experiments. For high-precision isotope
measurements or trace element measurements, at least two
matrix-matched standards with distinct known compositions are
necessary to insure accurate and meaningful results
[47, 152]. Experimental controls are used to test for experimental
artifacts and the statistical significance of treatments.
Standards are not readily available for biological SIMS because
certified biological samples have not been appropriate. As a result,
standards typically need to be produced and characterized “in-
house” or borrowed from other laboratories. In cases of large
effects relative to analytical uncertainty, no-treatment controls can
sometimes take the place of standards. For elemental analyses with-
out concentration standards, it may be necessary for the measured
ratios of interest to be on the order of 10 higher than background
to be confident the effects are real [48, 139]. Furthermore, without
standards, correct instrument operation is hard to verify. One stop-
gap option is to always analyze the same sample at every session,
even if the absolute composition is uncertain or it is not relevant to
the biological sample (e.g., NBS610) [48, 139].
For C and N isotopic measurements, we at LLNL originally
used a well-characterized Bacillus subtilis spore preparation as a
reference standard for [23]. Measurement precision, σ (internal), for
this standard is 0.4–1.4% (2σ for individual 13C/12C and 15N/15N
measurements), and replicate analyses yielded an analytical preci-
sion, σ(std), of 2.1% (2σ for an individual measurement) (Fig. 8).
More recently, we use an in-house characterized culture of Pseudo-
monas stutzeri deposited on a Si wafer because these cells provide a
better matrix match for our typical experiments.
For high spatial resolution elemental analyses of biological
samples, absolute concentration standards are more difficult to
establish for multiple reasons. First, concentrations are typically
low and therefore prone to contamination. Second, elemental con-
centrations can vary spatially, making it difficult to relate high-
resolution analyses with bulk composition. Third, the composition
of the elemental concentration standards needs to closely match the
unknowns. Beyond these constraints, it is ideal to have multiple
concentrations in the relevant range to establish a calibration curve
to control for potential isobaric interferences.
118 Jennifer Pett-Ridge and Peter K. Weber

The combination of making a homogenous concentration stan-


dard and matching the composition of the unknown is typically the
hardest problem. Concentration standards should be composition-
ally equivalent to the unknowns because matrix and composition
effects are well known to affect relative ion yields [117, 153–
155]. Recently, Ackerman et al. used homogenized fish tissue
mixed with dilute copper solutions to make multiple concentration
standards (Fig. 4) [47]. Repeated analyses of the material correlated
well with bulk concentration data.
In case where biological standards are not available, the NIST
glass standard NBS610 is useful for mass alignment of metallic
elements and for detector gain control, but not quantification in
biological samples. NIST also produces trace element standards for
biological materials, but these are large, heterogeneous particle
samples designed for bulk analysis and are challenging for SIMS.
Reference samples normally have to be made and characterized by
the interested lab. A good but expensive alternative for elemental
quantification is to have the element of interest implanted in epoxy
or another surrogate biological material. The ion implant is then
analyzed by depth profiling and integrating over the ions collected
from the implanted species [117].

3.5 Data Processing NanoSIMS researchers have developed multiple programs that
and Image Analysis allow nanoSIMS ion images to be displayed and processed to
extract the quantitative data (see Subheading 2.5). Data processing
should include corrections for detector dead-time and image shift
and should enable regions of interest (ROIs) to be defined. The
isotopic composition for each ROI is calculated by averaging over
all of the replicate scans. ROI definition algorithms can be used to
identify cells, partition images into uniform subregions, or define
threshold cutoffs for extracting data automatically. Notably,
Arandia-Gorostidi et al. and Dekas et al. both used auto-
identification to select many 100s of putative cells in their analyses
[41, 109], far more than in many early nanoSIMS studies.

3.5.1 Quantifying As discussed in Subheading 3.4.4, standards are a critical part of


and Reporting Isotopic Data ensuring good instrument performance and accurate data. For
isotopic ratios, standards should be used to calculate instrumental
mass fractionation (IMF), which can be expressed as:
RSTDmeas
IMF ¼ , ð4Þ
RsTDtrue
where RSTD  meas and RSTD  true are the measured and true
isotopic ratios for the standard, respectively. There is cause for
concern if the IMF differs from 1 by more than a few percent.
Considering the precision of nanoSIMS (>0.1%), the true isotopic
ratio in the unknown, RUNK  est, can be estimated from the ratio
measured for the unknown, RUNK  meas and the IMF using a gain
correction:
NanoSIP: NanoSIMS Applications for Microbial Biology 119

RUNKmeas
RUNKest ¼ : ð5Þ
IMF
The resulting isotopic data can be presented as ratios, delta
values, and atom percent excess (APE) (e.g., Fig. 6). For tracer
experiments, APE provides the clearest indication of the uptake of a
stable isotope tracer. APE is calculated based on the initial isotopic
ratio of the sample (or organism) at T ¼ 0 (Ri) and the final isotopic
ratio in the sample, Rf [23]:
 
Rf Ri
APE ¼   100%: ð6Þ
Rf þ 1 Ri þ 1
Note that R is the ratio of the rare isotope to the abundant
isotope (e.g., 13C/12C) and that R/(R + 1) is the fraction, f, of the
rare isotope of element X, which can be written fX.
Data can also be presented as net incorporation of the labeled
element in the substrate if its isotopic composition and amount are
well constrained and it is uniformly available to the sampled organ-
isms. In Popa et al., we defined the term Fxnet as the net incorpora-
tion of an element (e.g., net carbon incorporation is “FCnet”)
[25]. Assuming a two-isotope system, we derived Fxnet based on a
two-component mixing model that accounts for the minor (Eq. 7)
and major isotopes (Eq. 8) of element X incorporated from the
initial biomass and the spiked pool:
f xf ¼ F i  f xi þ F s  f xs & ð7Þ
h i    
1  f xf ¼ F i  1  f xi þ F s  1  f xs , ð8Þ

where Fi is the fraction of the labeled element that was initially in


the sampled organism and Fs is the fraction of the labeled element
that was taken up from the spiked pool. In Popa et al., we originally
derived Fxnet as a function of initial biomass, solving Eqs. 7 and
8 for Fs/Fi, yielding

Fs Rf 1  f xi  f xi
F xnet ¼ ¼  , ð9Þ
Fi f xs  R f 1  f xs
where f xs is the fraction of the rare isotope in the spiked pool, the
equation is corrected relative to the original (“1” in the denomi-
nator), and Fxnet is expressed as a fraction [25]. While technically
this equation applies only to the labeled element, it can be used to
estimate change in biomass assuming no change in stoichiometry.
We note that it is not necessary to quantify biomass to use this
equation.
To simplify interpretation, we here define a new parameter,
Xnet, which is net incorporation of an element as a function of
the total final pool of that element [41]:
Fs F xnet f xf  f xi
X net ¼ ¼ ¼ ð10Þ
F s þ F i F xnet þ 1 f xs  f xi
120 Jennifer Pett-Ridge and Peter K. Weber

To simplify the calculation, we add the classical formulation


after Hayes [185] expressed to calculate net incorporation of an
element [186]. Both calculations yield the exact same value for
Xnet. As an example of applying Xnet, net incorporation of carbon
relative to total final carbon content is notated as Cnet.

3.5.2 Quantifying For biological samples, relative and absolute elemental concentra-
and Reporting tions are typically determined based on the relative ion count rates
Elemental Data for the element of interest, X, compared to a uniformly distributed
major element—typically C in most biological samples. This
approach may not be valid if the element of interest is in a structure
that is low in C relative to the average matrix concentration (e.g., if
metal is sequestered in a vacuole). In rare cases, implantation of a
reference ion has been used to enable direct quantification in
biological samples [156]. To the extent SIMS is used to quantify
trace elements in biological samples, researchers tend to use matrix-
matched elemental standards.
If a matrix-matched standard for element X is available, the
concentrations of element X, [X]UNK, can readily be determined
based on proportionality using a parameter known as the relative
useful yield (RUY) [157]. This approach works because SIMS
typically yields a linear change in relative ion count rates as the
concentration of that species increases in the sample. Ideally, line-
arity is demonstrated in the relevant range using a set of standards.
Resolving isobaric interferences is an important aspect of getting a
reliable, linear response. The ion yield for the element of interest is
normalized to a reference ion. The RUY is defined as ratio of the
concentrations of element X and the reference element—here C—
to the corresponding ion ratio measured for a standard:
1
½XSTD Xþ
RUY X:C ¼  ð11Þ
½CSTD Cþ STD

 þ  in the standard of
where [X]STD and [C]STD are the concentrations
element X and carbon, respectively, and CX þ is the measured
STD
ion ratio, here shown as positive ions. Note that the concentrations
can be in any units, and the ion ratio can be for the measured
species (e.g., 56Fe+ and 12C+) or it can be corrected for the isotope
abundances, as long as these choices and the measured species are
consistent for the standard and the unknowns. The RUY is then
used to calculate the concentration of element X in the unknown
using:

½XUNK ¼  ½CUNK  RUY ð12Þ
Cþ UNK

Note that ideally [C]UNK ¼ [C]STD, or else [C]UNK needs to be


determined by an independent method. In some publications, the
RUY is defined as the inverse, with the appropriate change in
Eq. 12.
NanoSIP: NanoSIMS Applications for Microbial Biology 121

Relative sensitivity factor (RSF) is a related parameter used in


the semiconductor industry [117] that is generally not applicable as
defined, but which can be used to estimate the RUY.
RSFx
RUY X:ref ð13Þ
RSFref
where RSFx and RSFref are for the element of interest, X, and the
reference ion, which was C above. We have used this approach and
obtained reasonable estimates of copper in Chlamydomonas rein-
hardtii cells using calcium as the reference ion [48].
In the absence of a standard for absolute quantification, ele-
mental data are typically reported as ion ratios, which is indicated by
maintaining the charge symbol (e.g., 63Cu+/12C+). The mass
superscripts are removed if the ratio is corrected for isotopic
abundances.

3.5.3 Measurement Measurement precision should be determined based on replicate


Precision measurements of the ratio of interest during the analysis by calcu-
lating the standard error of the mean (SE). This statistic can be
compared to Poisson statistics error for a ratio, σ ratio, calculated
from Gaussian error propagation:
" 2
#
2 0:5
X o:5 X o:5
σratio ¼ R  numerator
þ denominator
: ð14Þ
X numerator X denominator

where R is the calculated ratio and X is the number of ion counts


for the numerator and denominator, respectively, which would
typically be the minor and major isotopes, respectively. Because
this calculation is based on a sum of squares, the error for the
minor isotope will dominate σ ratio if Xminor Xmajor (e.g.,
13
C vs. 12C) and σ ratio can be estimated directly from Xminor and R:
X o:5
σratio R minor
: ð15Þ
X minor
σ ratio should be compared to the standard error (SE) for repli-
cate measurements of the ratio in the sample. If the measured SE is
significantly worse (>2 σ), then there is potential for improving the
precision of the measurement based on tuning, sample flatness, or
other factors. In practice, the precision of isotope ratio measure-
ments by ion counting is no better than ~1 permil under the best
conditions.
In addition to considering these factors, measurement repro-
ducibility from sample to sample and even from subregion to sub-
regions within an image has to be included in the measurement
precision when two measurements are being compared, even within
the same image. For example, two cells within an image can only be
considered statistically different if the difference between the two
122 Jennifer Pett-Ridge and Peter K. Weber

measurements is greater than the variability of measurements on


comparable samples. The potential exists for measured isotopic
ratios to vary across a nanoSIMS image for an isotopically homoge-
neous sample because of sample and tuning problems. This error
can formally be incorporated into the measurement precision, by
summing measurement error and the location to location variability
in quadrature:
SE ¼ ½SEmeas 2 þ SDtests 21=2 ð16Þ
where SDtests is the standard deviation of test measurements for
location–to–location variability. The summed errors must be
expressed in fractional units, such as permil. While this calculation
is simple, ensuring that all the sources of potential error are
included is not, and care should be taken when making inferences
from small differences in ratios, or large differences with large but
seemingly statistically significant precision estimates.
The error discussed so far is internal error, meaning that it only
accounts for the variability of a particular set of measurements. For
comparison to other measurements and absolute values, external
measurement error is estimated from standard measurements using
the sum in quadrature approach used above. Because of the poten-
tial for shifts in measured isotopic ratios relative to an absolute value
(i.e., IMF) for slightly different samples, caution must also be
exercised when using the external error estimates. With all of
these issues to consider, researchers typically focus on achieving
large relative isotopic enrichments in nanoSIP measurements.

3.6 Combination Coupling nanoSIMS with other imaging or bulk characterization


with Synergistic methodologies provides an enormous opportunity to extend infer-
Techniques ences and understanding of a sample [158]. By combining nano-
SIMS analysis with approaches such as FISH, SEM, TEM, X-ray
microscopy, or immuno-methodologies, systems biologists can
explore the physiology of known and uncultured microorganisms
by simultaneously collecting functional, phylogenetic, and molecu-
lar information from individual cells or particles. While the list of
synergistic approaches discussed here is by no means exhaustive, the
following technologies have been used in combination with
nanoSIMS:
1. “Bulk analysis” (IRMS and ICP-MS): For many studies, it is
very useful to initially analyze a bulk sample mass by isotope
ratio mass spectrometry (IRMS) to ensure that some isotopic
enrichment occurred, and to determine average APE and
net-fixation values. To perform bulk IRMS analysis, small sam-
ples garnered from cultures or environmental samples may be
filtered onto pre-combusted glass fiber (GF/F) filters, dried,
and then analyzed. In our experience, absolute isotope
NanoSIP: NanoSIMS Applications for Microbial Biology 123

enrichment values of a cell concentrate measured via IRMS can


differ significantly from nanoSIMS analyses because of cell to
cell variability and surface contamination; close attention is
necessary to make quantitative comparisons [137]. Similarly if
trace metal distribution is of interest, it is important to con-
strain the likely concentrations in individual cells or particles by
first analyzing an extract of whole cells or target molecules by
inductively coupled plasma mass spectrometry
(ICP-MS) [151].
2. Light microscopy: Light images are often very useful for naviga-
tion in the nanoSIMS CCD view, which also uses light micros-
copy. Images should be collected at multiple levels of
magnification to identify analysis targets and aid in locating
them. Post-analysis imaging can be used to confirm targets.
3. SEM: SEM imaging is a relatively fast screening tool and allows
pre-identification of particles of appropriate size and morphol-
ogy with higher resolution than light microscopy, particularly
for guiding and confirming analyses of small or complex targets
(e.g., hyphal and bacterial surfaces; filamentous vs. single cells,
amorphous vs. crystalline minerals). SEM images are also fre-
quently useful to guide post-SIMS analysis, after regions with
unique isotopic or molecular signatures have been identified. If
necessary, SEM-EDS mapping can additionally be used to
identify basic elemental distribution. Low-voltage imaging
(<5 kV) typically provides better surface characterization of
biological or soil samples. SEM images are readily correlated
to SIMS secondary electron images, although harder to corre-
late to nanoSIMS CCD images.
4. TEM, STEM, and analytical TEM [23, 48, 63, 159, 160]: EM
imaging is useful for identifying ultrastructure in thin and FIB
sections, but correlation with nanoSIMS is more challenging
than for light microscopy or SEM. Light micrographs are
typically needed to help find desired locations on transmission
electron micrographs.
5. Atomic force microscopy (AFM) [94, 161]: While it has only
rarely been used in combination with SIMS, AFM imaging
provides nanometer-scale topographic information and can be
performed in liquid under controlled conditions. A group in
Luxemburg has incorporated an AFM into a NanoSIMS 50 to
allow correlated height measurements without exposing the
sample to vacuum [162].
6. FISH, El-FISH, and BONCAT: In 2008, several research
groups independently developed new approaches which com-
bined nanoSIMS analysis with in situ hybridization (EL-FISH
[22], SIMSISH [122], and HISH [20]); in each, a phylogentic
124 Jennifer Pett-Ridge and Peter K. Weber

probe is linked to a highly electronegative elemental label


(fluorine, iodine, gold, selenium, or bromine) instead of the
typical fluorophore. These approaches enable simultaneous
localization of the tag via fluorescence in situ hybridization
(FISH) [163, 164] or catalyzed reporter deposition-
fluorescence in situ hybridization (CARD-FISH) [165] and
chemical mapping in the nanoSIMS. These approaches can
help overcome problems with background autofluorescence
in FISH images because nanoSIMS is used to detect the ele-
mental tag linked to the oligonucleotide probe. The key to this
approach is to use highly electronegative elements, such as
halides, sulfur, selenium, tellurium, and noble metals, which
can be detected with very high sensitivity (1 in 20 atoms) in
concert with carbon and nitrogen isotopes (for functional
characterization). When choosing which elemental tag to
apply, care should be taken to ensure the natural background
of these elements is low in the sample (e.g., marine sample
often have high F background). To date, introducing multiple
probes simultaneously (with multiple elemental tags) has
proven difficult. It is often possible to simply correlate fluores-
cent features in FISH/CARD-FISH images with the isotope
ratios of the same locations in nanoSIMS images [34, 41, 54,
75, 137, 166]. It may be possible to use FISH-SIMS
approaches in embedded samples such as the work of Lemaire
et al. [167], where fixed samples were embedded in TissueTek®
and then cryosectioned and FISH labeled. We caution however
that the application of CARD-FISH may reduce original cell
enrichment by 60–80% for 13C and 30–60% for 15N [34, 136,
137]. Other molecular tagging methods (e.g., BONCAT) may
also be combined with nanoSIP studies, particularly for target-
ing active cells [92, 168].
7. Synchrotron imaging (e.g., STXM and NEXAFS) [87, 169–
171]: Spectroscopic techniques allow precise, quantitative
measurement of molecular and isotopic patterns in an undis-
turbed sample, at high resolution, and may be particularly
useful for imaging microbial populations in mineral matrices
such as soils and sediments. Scanning transmission X-ray
microscopy (STXM) can map organic C distribution, image
associations of organics with specific mineral types and has
been used to trace organic matter of differing origins into the
soil matrix [172, 173]. Research at LLNL shows that nano-
SIMS and STXM are quite synergistic, have similar resolution,
and together yield data on both molecular class and elemental
quantity; STXM data is based on transmission (integrates total
volume under the beam), while nanoSIMS can characterize
either surfaces or a 3D volume depending on the method of
preparation and analysis conditions. In this
NanoSIP: NanoSIMS Applications for Microbial Biology 125

approach, nanoSIMS analysis is preceded by synchrotron-based


X-ray imaging techniques such as scanning transmission X-ray
microscopy (STXM) and near edge X-ray absorption fine struc-
ture (NEXAFS) to determine mineral oxidation state or domi-
nant organic constituents. Sample specimens can be mounted
on silicon nitride (Si3N4) windows or standard TEM grids
without a chemical adhesive. Samples should be analyzed by
STXM, then coated with a thin conductive layer of gold or
iridium and imaged by SEM, and then by nanoSIMS.
8. Molecular and structural imaging (e.g., MALDI, Raman,
TOF-SIMS, X-ray tomography) [91, 174, 175]: Multiple imag-
ing techniques now have the capability to map a molecular
landscape with subcellular resolution [158]. While the majority
of these approaches do not have the spatial resolution of nano-
SIMS, the sample preparation requirements are similar enough
that a single preparation can often be imaged first for molecular
distribution, and later by nanoSIMS for elemental or isotope
distribution.
9. Antibody labeling or “immunolabeling” [176, 177]: Antibody-
labeled immuno-gold tags can also be used to target the locale
of specific proteins within a cell [131, 178]. Initial mapping
may be performed by TEM (Fig. 7) or SEM with a back scatter
detector [179] before nanoSIMS analysis.
10. Microarrays (Chip-SIP) [1, 79, 147, 180–182]: Microarrays,
while less commonly used than a decade ago, are very compati-
ble with nanoSIP studies and a creative means to measure the
isotope ratios of individual biomolecules (RNA, DNA, pep-
tides, proteins, sugars, lectins, etc.). They are typically printed
with microscopic spots of a biomolecule tethered to a surface
(often a glass slide). Our group uses Chip-SIP [1], a technique
where community RNA (extracted following an isotope tracing
experiment) is hybridized to an ITO-coated slide surface deri-
vatized with either functionalized alkylphosphonates and/or
organosilanes and printed with custom 16S rRNA gene probes
[147]. Then, nanoSIMS is used to quantify the isotope enrich-
ment in the hybridized RNA. Many 1000s of probes can be
analyzed in a single nanoSIMS session, and like all nanoSIP
studies, Chip-SIP is compatible with dual-label (i.e., 13C and
15
N) experiments—unlike the traditional SIP method. Because
of the unpredictability of probe binding, it is best to design a
suite of probes for each taxon of interest. ITO slides and Si
slides can also serve as a substrate for DNA deposited and
hybridized, or combed DNA, as descried by Cabin-Flaman
et al. via “combing-imaging by SIMS” (CIS) [114, 115].
126 Jennifer Pett-Ridge and Peter K. Weber

4 Future Directions

Continued development of nanoSIMS instruments and related


technologies, such as sample preparation and data processing,
would broadly benefit systems biology research and expand the
potential for nanoSIP studies. The success of the CAMECA Nano-
SIMS 50 and 50L has resulted in a steady growth in the number of
instruments worldwide, and most scientists with an interesting
nanoSIP research problem and some funding can likely gain access
to a nanoSIMS through a user proposal, a collaboration, or a fee-
for-service arrangement. Other large- and small-geometry SIMS
instruments can also benefit biological applications (e.g., 7f with
hyperion; LG-SIMS; TOF-SIMS). Looking forward, SIMS instru-
mentation is also continuing to evolve, such as MS-MS ToF-SIMS
[162] and FT-ICR SIMS [183], and there are new nanoSIMS
capablities in development that will lead to higher spatial resolution
and instrument sensitivity: in situ atomic force microscopy (AFM)
[162], brighter reactive ion sources [184], a cryogenic stage [127],
and the EXLIE extreme low implantation energy approach dis-
cussed above. These advances will particularly benefit those looking
to analyze ever-smaller particles (e.g., viruses, DNA) and do quan-
titative elemental analysis where cryo-preservation is ideal (e.g.,
subcellular trace metals).
Advances in the technologies that support nanoSIMS can also
make a big difference in the quality and throughput of nanoSIP
experiments. More studies with multi-isotope simultaneous label-
ing can help to distinguish overlaps in metabolism and activity (e.g.,
heterotrophs, autotrophs, mixotrophs [41]), and differential ele-
mental stoichiometry [46]. Sample preparation is a perennial chal-
lenge and any new methods that make it easier to prepare high-
quality samples for nanoSIMS analysis would advance the field. On
the output end, data processing can be time consuming, and
improved software and automation would be beneficial as research-
ers seek larger data sets. Finally, standardization continues to be an
area that needs more effort; the wide breadth of need and chal-
lenges of production are serious hurdles.
The nanoSIP method we describe here is a highly flexible and
adaptable approach, enabling the study of isotope and element
exchanges and transformations at the single-cell and sub-cellular
level. In microbial assemblages, it can enable identity and function
to be directly related to community structure, microgradients, and
substrates and has broad relevance for microbiome studies, both in
nature and in laboratory, human, or industrial settings. Researchers
using nanoSIP and nanoSIMS can answer basic but previously
inaccessible questions about where organisms are located within a
community and their functional roles. In many cases, these
advances in our scientific understanding require coordinated use
NanoSIP: NanoSIMS Applications for Microbial Biology 127

of multiple approaches, including sequencing and synergistic visu-


alization techniques. After two decades of application, NanoSIP is a
fully standard method in systems biology, microbial ecology, soils
and plant research, and cell biology. Researchers have seen the value
of this approach and are making necessary efforts to design experi-
ments and supporting analyses to take advantage of its insights into
biological function.

Acknowledgments

We thank Ian Hutcheon, for his many years of mentorship, advice,


and leadership of the LLNL SIMS group. We also thank our many
colleagues and collaborators, with special thanks to Steve Blaze-
wicz, Anne Dekas, Ben Jacobsen, Xavier Mayali, Erin Nuccio,
Rhona Stuart, and Dagmar Woebken. Christina Ramon plays a
critical role in helping to prepare and organize many of the samples
we have discussed. This work was funded in part by awards from the
DOE OBER Genomic Science program and LLNL Laboratory
Directed Research and Development funding and performed
under the auspices of the U.S. Department of Energy at Lawrence
Livermore National Laboratory under Contract DE-AC52-
07NA27344.

References

1. Mayali X, Weber PK, Brodie EL, Mabery S, 6. Koch BJ, McHugh TA, Hayer M, Schwartz E,
Hoeprich PD, Pett-Ridge J (2012) High- Blazewicz SJ, Dijkstra P, van Gestel N, Marks
throughput isotopic analysis of RNA micro- JC, Mau RL, Morrissey EM, Pett-Ridge J,
arrays to quantify microbial resource use. Hungate BA (2018) Estimating taxon-
ISME J 6(6):1210–1221 specific population dynamics in diverse micro-
2. Adamczyk J, Hesselsoe M, Iversen N, bial communities. Ecosphere 9(1):
Horn M, Lehner A, Nielsen PH, Schloter M, e02090–e02015
Roslev P, Wagner M (2003) The isotope array, 7. Hillion F, Daigne B, Girard F, Slodzian G
a new tool that employs substrate-mediated (1993) A new high performance instrument:
labeling of rRNA for determination of micro- the CAMECA NanoSIMS 50. In: Bennin-
bial community structure and function. Appl ghoven A et al (eds) Secondary ion mass spec-
Environ Microbiol 69(11):6875–6887 trometry: SIMS IX, vol 254-257. John Wiley
3. Ouverney CC, Fuhrman JA (1999) Com- & Sons, Chichester
bined microautoradiography-16S rRNA 8. Ghosal S, Leighton TJ, Wheeler KE, Hutch-
probe technique for determination of radio- eon ID, Weber PK (2010) Spatially resolved
isotope uptake by specific microbial cell types characterization of water and ion incorpora-
in situ. Appl Environ Microbiol 65 tion in Bacillus spores. Appl Environ Micro-
(4):1746–1752 biol 76(10):3275–3282
4. Jehmlich N, Schmidt F, Taubert M, Seifert J, 9. Orphan VJ, House CH, Hinrichs K-U,
Bastida F, von Bergen M, Richnow H-H, McKeegan KD, DeLong EF (2001)
Vogt C (2010) Protein-based stable isotope Methane-consuming archaea revealed by
probing. Nat Protoc 5(12):1957–1966 directly coupled isotopic and phylogenetic
5. Murrell JC, Whiteley AS (2011) Stable iso- analysis. Science 293(5529):484–487
tope probing and related technologies. ASM 10. Smart K, Kilburn M, Salter C, Smith J, Gro-
Press, Washington, DC, p 345 venor C (2007) NanoSIMS and EPMA
128 Jennifer Pett-Ridge and Peter K. Weber

analysis of nickel localisation in leaves of the (2007) Morphological and chemical studies
hyperaccumulator plant Alyssum lesbiacum. of pathological human and mice brain at the
Int J Mass Spectrom 260(2–3):107–114 subcellular level: correlation between light,
11. Stadermann FJ, Walker RM, Zinner E (1999) electron, and NanoSIMS microscopies.
Nanosims: the next generation ion probe for Microsc Res Tech 70(4):281–295
the microanalysis of extraterrestrial material. 22. Behrens S, Losekann T, Pett-Ridge J, Weber
Meteorit Planet Sci 34:A111–A112 PK, Ng W, Stevenson BS, Hutcheon ID, Rel-
12. Guerquin-Kern J-L, Hillion F, Madelmont man DA, Spormann AM (2008) Linking
J-C, Labarre P, Papon J, Croisy A (2004) microbial phylogeny to metabolic activity at
Ultra-structural cell distribution of the mela- the single-cell level by using enhanced ele-
noma marker iodobenzamide: improved ment labeling-catalyzed reporter deposition
potentiality of SIMS imaging in life sciences. fluorescence in situ hybridization (EL-FISH)
BioMed Eng. http://www.biomedical-engi and NanoSIMS. Appl Environ Microbiol 74
neering-online.com/content/3/1/10 (10):3143
13. Kraft ML, Fishel SF, Marxer CG, Weber PK, 23. Finzi-Hart J, Pett-Ridge J, Weber P, Popa R,
Hutcheon ID, Boxer SG (2006) Quantitative Fallon SJ, Gunderson T, Hutcheon I,
analysis of supported membrane composition Nealson K, Capone DG (2008) Fixation and
using the NanoSIMS. Appl Surf Sci 252 fate of carbon and nitrogen in Trichodesmium
(19):6950–6956 IMS101 using nanometer resolution second-
14. Moreau JW, Weber PK, Martin MC, ary ion mass spectrometry (NanoSIMS).
Gilbert B, Hutcheon ID, Banfield JF (2007) PNAS 106:6345–6350
Extracellular proteins limit the dispersal of 24. Lechene C, Hillion F, McMahon G,
biogenic nanoparticles. Science Benson D, Kleinfeld A, Kampf JP, Distel D,
316:1600–1603 Luyten Y, Bonventre J, Hentschel D, Park K,
15. Peteranderl R, Lechene C (2004) Measure of Ito S, Schwartz M, Benichou G, Slodzian G
carbon and nitrogen stable isotope ratios in (2006) High-resolution quantitative imaging
cultured cells. J Am Soc Mass Spectrom 15 of mammalian and bacterial cells using stable
(4):478–485 isotope mass spectrometry. J Biol 5(6):20
16. Wainwright M, Weber PK, Smith JB, Hutch- 25. Popa R, Weber PK, Pett-Ridge J, Finzi JA,
eon ID, Klyce B, Wickramasinghe NC, Narli- Fallon SJ, Hutcheon ID, Nealson KH,
kar JV, Rajaratnam P (2004) Studies on Capone DG (2007) Carbon and nitrogen fix-
bacteria-like particles sampled from the ation and metabolite exchange in and
stratosphere. Aeorobiologia 20:237–240 between individual cells of Anabaena oscillar-
ioides. ISME J 1(4):354–360
17. Galli Marxner C, Kraft ML, Weber PK,
Hutcheon I, Boxer SG (2005) Supported 26. Ghosal S, Fallon SJ, Leighton T, Wheeler KE,
membrane composition analysis by secondary Hutcheon ID, Weber PK (2008) Imaging and
ion mass spectrometry with high lateral reso- 3D elemental characterization of intact bacte-
lution. Biophys J 88:2965–2975 rial spores with high-resolution secondary ion
mass spectrometry (NanoSIMS) depth profile
18. Dekas AE, Poretsky RS, Orphan VJ (2009) analysis. Anal Chem 80(15):5986–5992
Deep-sea archaea fix and share nitrogen in
methane-consuming microbial consortia. Sci- 27. Lechene CP, Luyten Y, McMahon G, Distel
ence 326(5951):422–426 DL (2007) Quantitative imaging of nitrogen
fixation by individual bacteria within animal
19. Halm H, Musat N, Lam P, Langlois R, cells. Science 317:1563–1566
Musat F, Peduzzi S, Lavik G, Schubert CJ,
Sinha B, LaRoche J, Kuypers MMM (2009) 28. Herrmann A, Ritz K, Nunan N, Clode P,
Co-occurrence of denitrification and nitrogen Pett-Ridge J, Kilburn M, Murphy D,
fixation in a meromictic lake, Lake Cadagno O’Donnell A, Stockdale E (2007) Nano-
(Switzerland). Environ Microbiol 11 scale secondary ion mass spectrometry – a
(8):2190–2190 new analytical tool in biogeochemistry and
soil ecology: a review article. Soil Biol Bio-
20. Musat N, Halm H, Winterholler B, Hoppe P, chem 39:1835–1850
Peduzzi S, Hillion F, Horreard F, Amann R,
Jørgensen BB, Kuypers MMM (2008) A 29. Mueller CW, Weber PK, Kilburn MR,
single-cell view on the ecophysiology of anaer- Hoeschen C, Kleber M, Pett-Ridge J (2013)
obic phototrophic bacteria. Proc Natl Acad Advances in the analysis of biogeochemical
Sci 105(46):17861–17866 interfaces: NanoSIMS to investigate soil
microenvironments. In: Sparks D
21. Quintana C, Wu TD, Delatour B, (ed) Advances in agronomy. Elsevier,
Dhenain M, Guerquin-Kern JL, Croisy A Amsterdam
NanoSIP: NanoSIMS Applications for Microbial Biology 129

30. Renslow RS, Lindemann SR, Cole JK, Zhu Z, Turk V, Wagner M, Bright M (2018) Nano-
Anderton CR (2016) Quantifying element SIMS and tissue autoradiography reveal sym-
incorporation in multispecies biofilms using biont carbon fixation and organic carbon
nanoscale secondary ion mass spectrometry transfer to giant ciliate host. ISME J 12
image analysis. Biointerphases 11(2):02A322 (3):714–727
31. Mayali X (2020) NanoSIMS: microscale 44. Calabrese F, Voloshynovska I, Musat F,
quantification of biogeochemical activity Thullner M, Schlömann M, Richnow HH,
with large-scale impacts. Annu Rev Mar Sci Lambrecht J, Müller S, Wick LY, Musat N
12:449–467 (2019) Quantitation and comparison of phe-
32. Gao D, Huang X, Tao Y (2016) A critical notypic heterogeneity among single cells of
review of NanoSIMS in analysis of microbial monoclonal microbial populations. Front
metabolic activities at single-cell level. Crit Microbiol 10:2814
Rev Biotechnol 36(5):884–890 45. Braun PD, Schulz-Vogt HN, Vogts A, Nausch
33. Zhao F-J, Moore KL, Lombi E, Zhu Y-G M (2018) Differences in the accumulation of
(2014) Imaging element distribution and spe- phosphorus between vegetative cells and het-
ciation in plant cells. Trends Plant Sci 19 erocysts in the cyanobacterium Nodularia
(3):183–192 spumigena. Sci Rep 8(1):1–6
34. Musat N, Musat F, Weber PK, Pett-Ridge J 46. Gross A, Lin Y, Weber PK, Pett-Ridge J, Silver
(2016) Tracking microbial interactions with WL (2020) The role of soil redox conditions
NanoSIMS. Curr Opin Biotechnol in microbial phosphorus cycling in humid
41:114–121 tropical forests. Ecology 101(2):e02928
35. Agüi-Gonzalez P, J€ahne S, Phan NT (2019) 47. Ackerman CM, Weber PK, Xiao T, Thai B,
SIMS imaging in neurobiology and cell biol- Kuo TJ, Zhang E, Pett-Ridge J, Chang CJ
ogy. J Anal At Spectrom 34(7):1355–1368 (2018) Multimodal LA-ICP-MS and nano-
36. Boxer SG, Kraft ML, Weber PK (2009) SIMS imaging enables copper mapping within
Advances in imaging secondary ion mass spec- photoreceptor megamitochondria in a zebra-
trometry for biological samples. Annu Rev fish model of Menkes disease. Metallomics 10
Biophys 38:53–74 (3):474–485
37. Nuñez J, Renslow R, Cliff JB III, Anderton 48. Hong-Hermesdorf A, Miethke M, Gallaher
CR (2018) NanoSIMS for biological applica- SD, Kropat J, Dodani SC, Barupala D,
tions: current practices and analyses. Biointer- Chan J, Domaille DW, Shirasaki DI, Loo JA,
phases 13(3):03B301 Weber PK, Pett-Ridge J, Stemmler TL,
Chang CJ, Merchant SS (2014) Selective
38. Gorman BL, Kraft ML (2019) High- sub-cellular visualization of trace metals iden-
resolution secondary ion mass spectrometry tifies dynamic sites of Cu accumulation in
analysis of cell membranes. ACS Publications, Chlamydomonas. Nat Chem Biol
Washington, DC 10:1034–1042
39. CAMECA. NanoSIMS 50L: scientific publi- 49. Dawson KS, Scheller S, Dillon JG, Orphan VJ
cations. https://www.cameca.com/ (2016) Stable isotope phenotyping via cluster
products/sims/nanosims analysis of NanoSIMS data as a method for
40. Pett-Ridge J, Weber PK (2012) NanoSIP: characterizing distinct microbial ecophysiolo-
NanoSIMS applications for microbial biology. gies and sulfur-cycling in the environment.
In: Navid A (ed) Microbial systems biology: Front Microbiol 7:774
methods and protocols. Humana, New York, 50. Berry D, Mader E, Lee TK, Woebken D,
NY Wang Y, Zhu D, Palatinszky M,
41. Dekas AE, Parada AE, Mayali X, Fuhrman JA, Schintlmeister A, Schmid MC, Hanson BT
Wollard J, Weber PK, Pett-Ridge J (2019) (2015) Tracking heavy water (D2O) incor-
Characterizing chemoautotrophy and hetero- poration for identifying and sorting active
trophy in marine archaea and bacteria with microbial cells. Proc Natl Acad Sci 112(2):
single-cell multi-isotope nanoSIP. Front E194–E203
Microbiol 10:2682 51. Kopf SH, McGlynn SE, Green-Saxena A,
42. Chadwick GL, Otero FJ, Gralnick JA, Bond Guan Y, Newman DK, Orphan VJ (2015)
DR, Orphan VJ (2019) NanoSIMS imaging Heavy water and 15 N labelling with N ano
reveals metabolic stratification within current- SIMS analysis reveals growth rate-dependent
producing biofilms. Proc Natl Acad Sci 116 metabolic heterogeneity in chemostats. Envi-
(41):20716–20724 ron Microbiol 17(7):2542–2556
43. Volland J-M, Schintlmeister A, Zambalos H, 52. Ploug H, Musat N, Adam B, Moraru CL,
Reipert S, Mozetič P, Espada-Hinojosa S, Lavik G, Vagner T, Bergman B, Kuypers
130 Jennifer Pett-Ridge and Peter K. Weber

MMM (2010) Carbon and nitrogen fluxes 63. Carpenter KJ, Weber PK, Davisson ML, Pett-
associated with the cyanobacterium Aphani- Ridge J, Haverty MI, Keeling PJ (2013) Cor-
zomenon sp. in the Baltic Sea. ISME J 4 related SEM, FIB-SEM, TEM, and Nano-
(9):1215–1223 SIMS imaging of microbes from the hindgut
53. Scheller S, Yu H, Chadwick GL, McGlynn SE, of a lower termite: methods for in situ func-
Orphan VJ (2016) Artificial electron accep- tional and ecological studies of uncultivable
tors decouple archaeal methane oxidation microbes. Microsc Microanal 19
from sulfate reduction. Science 351 (06):1490–1501
(6274):703–707 64. Tai V, Carpenter KJ, Weber PK, Nalepa CA,
54. Dekas AE, Connon SA, Chadwick GL, Perlman SJ, Keeling PJ (2016) Genome evo-
Trembath-Reichert E, Orphan VJ (2016) lution and nitrogen-fixation in bacterial ecto-
Activity and interactions of methane seep symbionts of a protist inhabiting wood-
microorganisms assessed by parallel transcrip- feeding cockroaches. Appl Environ Microbiol.
tion and FISH-NanoSIMS analyses. ISME J https://doi.org/10.1128/AEM.00611-16
10(3):678–692 65. Ceh J, Kilburn MR, Cliff JB, Raina JB, van
55. Green-Saxena A, Dekas AE, Dalleska NF, Keulen M, Bourne DG (2013) Nutrient
Orphan VJ (2014) Nitrate-based niche differ- cycling in early coral life stages: Pocillopora
entiation by distinct sulfate-reducing bacteria damicornis larvae provide their algal symbiont
involved in the anaerobic oxidation of meth- (Symbiodinium) with nitrogen acquired from
ane. ISME J 8(1):150–163 bacterial associates. Ecol Evol 3
56. Milucka J, Kirf M, Lu L, Krupke A, Lam P, (8):2393–2400
Littmann S, Kuypers MMM, Schubert CJ 66. Pernice M, Dunn SR, Tonk L, Dove S,
(2015) Methane oxidation coupled to oxy- Domart-Coulon I, Hoppe P,
genic photosynthesis in anoxic waters. ISME Schintlmeister A, Wagner M, Meibom A
J 9(9):1991–2002 (2015) A nanoscale secondary ion mass spec-
57. Marlow JJ, Steele JA, Ziebis W, Thurber AR, trometry study of dinoflagellate functional
Levin LA, Orphan VJ (2014) Carbonate- diversity in reef-building corals. Environ
hosted methanotrophy represents an unrec- Microbiol 17(10):3570–3580
ognized methane sink in the deep sea. Nat 67. Wangpraseurt D, Pernice M, Guagliardo P,
Commun 5(1):1–12 Kilburn MR, Clode PL, Polerecky L, Kühl
58. Oswald K, Graf JS, Littmann S, Tienken D, M (2016) Light microenvironment and
Brand A, Wehrli B, Albertsen M, Daims H, single-cell gradients of carbon fixation in tis-
Wagner M, Kuypers MM (2017) Crenothrix sues of symbiont-bearing corals. ISME J 10
are major methane consumers in stratified (3):788–792
lakes. ISME J 11(9):2124–2140 68. Lema KA, Clode PL, Kilburn MR,
59. Foster RA, Kuypers MM, Vagner T, Paerl RW, Thornton R, Willis BL, Bourne DG (2016)
Musat N, Zehr JP (2011) Nitrogen fixation Imaging the uptake of nitrogen-fixing bacte-
and transfer in open ocean diatom- ria into larvae of the coral Acropora millepora.
cyanobacterial symbioses. ISME J 5 ISME J 10(7):1804–1808
(9):1484–1493 69. Kopp C, Domart-Coulon I, Barthelemy D,
60. Thompson AW, Foster RA, Krupke A, Carter Meibom A (2016) Nutritional input from
BJ, Musat N, Vaulot D, Kuypers MMM, Zehr dinoflagellate symbionts in reef-building cor-
JP (2012) Unicellular cyanobacterium symbi- als is minimal during planula larval life stage.
otic with a single-celled eukaryotic alga. Sci- Sci Adv 2(3):e1500681
ence 337(6101):1546–1550 70. Yang S-H, Tandon K, Lu C-Y, Wada N, Shih
61. Adam B, Klawonn I, Sveden JB, Bergkvist J, C-J, Hsiao SS-Y, Jane W-N, Lee T-C, Yang
Nahar N, Walve J, Littmann S, Whitehouse C-M, Liu C-T (2019) Metagenomic, phylo-
MJ, Lavik G, Kuypers MMM, Ploug H genetic, and functional characterization of
(2016) N2-fixation, ammonium release and predominant endolithic green sulfur bacteria
N-transfer to the microbial and classical food in the coral Isopora palifera. Microbiome 7
web within a plankton community. ISME J 10 (1):1–13
(2):450–459 71. Samo TJ, Kimbrel JA, Nilson DJ, Pett-Ridge-
62. Berry D, Stecher B, Schintlmeister A, J, Weber PK, Mayali X (2018) Attachment
Reichert J, Brugiroux S, Wild B, Wanek W, between heterotrophic bacteria and microal-
Richter A, Rauch I, Decker T (2013) Host- gae influences symbiotic microscale interac-
compound foraging by intestinal microbiota tions. Environ Microbiol 20(2):4385–4400
revealed by single-cell stable isotope probing. 72. de-Bashan LE, Mayali X, Bebout BM, Weber
Proc Natl Acad Sci 110(12):4720–4725 PK, Detweiler AM, Hernandez J-P, Prufert-
NanoSIP: NanoSIMS Applications for Microbial Biology 131

Bebout L, Bashan Y (2016) Establishment of mediated transfer of water and nutrients sti-
stable synthetic mutualism without mulates bacterial activity in dry and oligotro-
co-evolution between microalgae and bacteria phic environments. Nat Commun 8(1):1–9
demonstrated by mutual transfer of metabo- 83. Bougoure J, Ludwig M, Brundrett M, Cliff J,
lites (NanoSIMS isotopic imaging) and per- Clode P, Kilburn M, Grierson P (2014) High-
sistent physical association (fluorescent in situ resolution secondary ion mass spectrometry
hybridization). Algal Res 15:179–186 analysis of carbon dynamics in mycorrhizas
73. Alonso C, Musat N, Adam B, Kuypers M, formed by an obligately myco-heterotrophic
Amann R (2012) HISH–SIMS analysis of orchid. Plant Cell Environ 37(5):1223–1230
bacterial uptake of algal-derived carbon in 84. Hill PW, Broughton R, Bougoure J,
the Rı́o de la Plata estuary. Syst Appl Micro- Havelange W, Newsham KK, Grant H, Mur-
biol 35(8):541–548 phy DV, Clode P, Ramayah S, Marsden KA
74. Leroy C, Jauneau A, Martinez Y, Cabin- (2019) Angiosperm symbioses with
Flaman A, Gibouin D, Orivel J, Séjalon-Del- non-mycorrhizal fungal partners enhance N
mas N (2017) Exploring fungus–plant N acquisition from ancient organic matter in a
transfer in a tripartite ant–plant–fungus mutu- warming maritime Antarctic. Ecol Lett 22
alism. Ann Bot 120(3):417–426 (12):2111–2119
75. Lee JZ, Burow LC, Woebken D, Everroad 85. Mergelov N, Mueller CW, Prater I,
RC, Kubo MD, Spormann AM, Weber PK, Shorkunov I, Dolgikh A, Zazovskaya E,
Pett-Ridge J, Bebout BM, Hoehler TM Shishkov V, Krupskaya V, Abrosimov K, Cher-
(2014) Fermentation couples chloroflexi and kinsky A (2018) Alteration of rocks by endo-
sulfate-reducing bacteria to cyanobacteria in lithic organisms is one of the pathways for the
hypersaline microbial mats. Front Microbiol beginning of soils on Earth. Sci Rep 8
5:61 (1):1–15
76. Nuccio EE, Hodge A, Pett-Ridge J, Herman 86. Kopittke PM, Dalal RC, Hoeschen C, Li C,
DJ, Weber PK, Firestone MK (2013) An Menzies NW, Mueller CW (2020) Soil
arbuscular mycorrhizal fungus significantly organic matter is stabilized by organo-mineral
modifies the soil bacterial community and associations through two key processes: the
nitrogen cycling during litter decomposition. role of the carbon to nitrogen ratio. Geo-
Environ Microbiol 15(6):1870–1881 derma 357:113974
77. Kaiser C, Kilburn MR, Clode PL, 87. Keiluweit M, Bougoure JJ, Zeglin LH, Myr-
Fuchslueger L, Koranda M, Cliff JB, Solaiman old DD, Weber PK, Pett-Ridge J, Kleber M,
ZM, Murphy DV (2015) Exploring the trans- Nico PS (2012) Nano-scale investigation of
fer of recent plant photosynthates to soil the association of microbial nitrogen residues
microbes: mycorrhizal pathway vs direct root with iron (hydr)oxides in a forest soil
exudation. New Phytol 205(4):1537–1551 O-horizon. Geochim Cosmochim Acta
78. Kuga Y, Sakamoto N, Yurimoto H (2014) 95:213–226
Stable isotope cellular imaging reveals that 88. Keiluweit M, Bougoure JJ, Nico PS, Pett-
both live and degenerating fungal pelotons Ridge J, Weber PK, Kleber M (2015) Mineral
transfer carbon and nitrogen to orchid proto- protection of soil carbon counteracted by root
corms. New Phytol 202(2):594–605 exudates. Nat Clim Chang 5(6):588–595
79. Pett-Ridge J, Firestone MK (2017) Using sta- 89. Morrison KD, Misra R, Williams LB (2016)
ble isotopes to explore root-microbe-mineral Unearthing the antibacterial mechanism of
interactions in soil. Rhizosphere 3:244–253 medicinal clay: a geochemical approach to
80. Hestrin R, Hammer EC, Mueller CW, Leh- combating antibiotic resistance. Sci Rep
mann J (2019) Synergies between mycorrhi- 6:19043
zal fungi and soil microbial communities 90. Londono SC, Hartnett HE, Williams LB
increase plant nitrogen acquisition. Commun (2017) Antibacterial activity of aluminum in
Biol 2(1):1–9 clay from the Colombian Amazon. Environ
81. Gorka S, Dietrich M, Mayerhofer W, Sci Technol 51(4):2401–2408
Gabriel R, Wiesenbauer J, Martin V, 91. Eichorst SA, Strasser F, Woyke T,
Zheng Q, Imai B, Prommer J, Weidinger M Schintlmeister A, Wagner M, Woebken D
(2019) Rapid transfer of plant photosynthates (2015) Advancements in the application of
to soil bacteria via ectomycorrhizal hyphae NanoSIMS and Raman microspectroscopy to
and its interaction with nitrogen availability. investigate the activity of microbial cells in
Front Microbiol 10:168 soils. FEMS Microbiol Ecol 91(10):fiv106
82. Worrich A, Stryhanyuk H, Musat N, König S, 92. Pasulka AL, Thamatrakoln K, Kopf SH,
Banitz T, Centler F, Frank K, Thullner M, Guan Y, Poulos B, Moradian A, Sweredoski
Harms H, Richnow H-H (2017) Mycelium- MJ, Hess S, Sullivan MB, Bidle KD (2018)
132 Jennifer Pett-Ridge and Peter K. Weber

Interrogating marine virus-host interactions bacterial populations. PLoS Genet 13(12):


and elemental transfer with BONCAT and e1007122
nanoSIMS-based methods. Environ Micro- 102. Gangwe Nana GY, Ripoll C, Cabin-Flaman A,
biol 20(2):671–692 Gibouin D, Delaune A, Jannière L,
93. Greenwood DJ, Dos Santos MS, Huang S, Grancher G, Chagny G, Loutelier-Bourhis C,
Russell MR, Collinson LM, MacRae JI, Lentzen E (2018) Division-based, growth
West A, Jiang H, Gutierrez MG (2019) Sub- rate diversity in bacteria. Front Microbiol
cellular antibiotic visualization reveals a 9:849
dynamic drug reservoir in infected macro- 103. Zimmermann M, Escrig S, Hübschmann T,
phages. Science 364(6447):1279–1282 Kirf MK, Brand A, Inglis RF, Musat N,
94. Gates SD, Condit RC, Moussatche N, Stew- Müller S, Meibom A, Ackermann M (2015)
art BJ, Malkin AJ, Weber PK (2018) High Phenotypic heterogeneity in metabolic traits
initial sputter rate found for vaccinia virions among single cells of a rare bacterial species in
using isotopic labeling, nanoSIMS, and AFM. its natural environment quantified with a
Anal Chem 90(3):1613–1620 combination of flow cell sorting and Nano-
95. Stuart RK, Mayali X, Boaro AA, Zemla A, SIMS. Front Microbiol 6:243
Everroad RC, Nilson D, Weber PK, 104. Tsednee M, Castruita M, Salomé PA,
Lipton M, Bebout BM, Pett-Ridge J (2016) Sharma A, Lewis BE, Schmollinger SR,
Light regimes shape utilization of extracellu- Strenkert D, Holbrook K, Otegui MS, Kha-
lar organic C and N in a cyanobacterial bio- tua K (2019) Manganese co-localizes with
film. mBio 7(3):e00650–e00616 calcium and phosphorus in Chlamydomonas
96. Stuart RK, Mayali X, Lee JZ, Everroad RC, acidocalcisomes and is mobilized in
Hwang M, Bebout BM, Weber PK, Pett- manganese-deficient conditions. J Biol
Ridge J, Thelen MP (2016) Cyanobacterial Chem 294(46):17626–17641
reuse of extracellular organic carbon in micro- 105. Kessler N, Armoza-Zvuloni R, Wang S,
bial mats. ISME J 10(5):1240–1251 Basu S, Weber PK, Stuart RK, Shaked Y
97. Stuart RK, Mayali X, Thelen MP, Pett-Ridge J, (2020) Selective collection of iron-rich dust
Weber PK (2017) Measuring cyanobacterial particles by natural Trichodesmium colonies.
metabolism in biofilms with NanoSIMS iso- ISME J 14(1):91–103
tope imaging and scanning electron micros- 106. Newsome L, Lopez Adams R, Downie HF,
copy (SEM). Bioprotocol 7:e2263 Moore KL, Lloyd JR (2018) NanoSIMS
98. Probst AJ, Weinmaier T, Raymann K, imaging of extracellular electron transport
Perras A, Emerson JB, Rattei T, Wanner G, processes during microbial iron (III) reduc-
Klingl A, Berg IA, Yoshinaga M, Viehweger B, tion. FEMS Microbiol Ecol 94(8):fiy104
Hinrichs K-U, Thomas BC, Meck S, Auer- 107. Fleming E, Woyke T, Donatello R, Kuypers
bach AK, Heise M, Schintlmeister A, MM, Sczyrba A, Littmann S, Emerson D
Schmid M, Wagner M, Gribaldo S, Banfield (2018) Insights into the fundamental physi-
JF, Moissl-Eichinger C (2014) Biology of a ology of the uncultured Fe-oxidizing bacte-
widespread uncultivated archaeon that contri- rium Leptothrix ochracea. Appl Environ
butes to carbon fixation in the subsurface. Nat Microbiol 84(9). https://doi.org/10.1128/
Commun 5:5497 AEM.02239-17
99. Tveit AT, Hestnes AG, Robinson SL, 108. Stryhanyuk H, Calabrese F, Kümmel S,
Schintlmeister A, Dedysh SN, Jehmlich N, Musat F, Richnow HH, Musat N (2018) Cal-
von Bergen M, Herbold C, Wagner M, Rich- culation of single cell assimilation rates from
ter A (2019) Widespread soil bacterium that SIP-NanoSIMS-derived isotope ratios: a
oxidizes atmospheric methane. Proc Natl comprehensive approach. Front Microbiol
Acad Sci 116(17):8515–8524 9:2342
100. Sheik AR, Muller EE, Audinot J-N, Lebrun 109. Arandia-Gorostidi N, Weber PK, Alonso-
LA, Grysan P, Guignard C, Wilmes P (2016) Saez L, Moran XAG, Mayali X (2017) Ele-
In situ phenotypic heterogeneity among sin- vated temperature increases carbon and nitro-
gle cells of the filamentous bacterium Candi- gen fluxes between phytoplankton and
datus Microthrix parvicella. ISME J 10 heterotrophic bacteria through physical
(5):1274–1279 attachment. ISME J 11(3):641–650
101. Nikolic N, Schreiber F, Kiviet DJ, 110. Frisz JF, Lou K, Klitzing HA, Hanafin WP,
Bergmiller T, Littmann S, Kuypers MM, Ack- Lizunov V, Wilson RL, Carpenter KJ, Kim R,
ermann M (2017) Cell-to-cell variation and Hutcheon ID, Zimmerberg J (2013) Direct
specialization in sugar metabolism in clonal chemical evidence for sphingolipid domains
NanoSIP: NanoSIMS Applications for Microbial Biology 133

in the plasma membranes of fibroblasts. Proc glycine-derived C and N retention with soil
Natl Acad Sci 110(8):E613–E622 organo-mineral associations. Biogeochem-
111. Smith NS, Boswell RW, Tesch PP, Martin NP istry 125(3):303–313
(2017) Rf system, magnetic filter, and high 122. Li T, Wu TD, Mazeas L, Toffin L, Guerquin-
voltage isolation for an inductively coupled Kern JL, Leblon G, Bouchez T (2008) Simul-
plasma ion source, Google Patents taneous analysis of microbial identity and
112. Smith N, Tesch P, Martin N, Kinion D function using NanoSIMS. Environ Micro-
(2008) A high brightness source for nano- biol 10(3):580–588
probe secondary ion mass spectrometry. 123. Benn PA, Perle MA (1992) Chromosome
Appl Surf Sci 255(4):1606–1609 staining and banding techniques. In: Rooney
113. Malherbe J, Penen F, Isaure M-P, Frank J, DE, Czepulkowski BH (eds) Human cytoge-
Hause G, Dobritzsch D, Gontier E, Horréard netics: volume I, constitutional analysis: a
FO, Hillion FO, Schaumlöffel D (2016) A practical approach. Oxford University Press,
new radio frequency plasma oxygen primary New York, NY
ion source on nano secondary ion mass spec- 124. Latt SA (1973) Microfluorometric detection
trometry for improved lateral resolution and of deoxyribonucleic acid replication in human
detection of electropositive elements at single metaphase chromosomes. Proc Natl Acad Sci
cell level. Anal Chem 88(14):7130–7136 U S A 49:3395–3399
114. Cabin-Flaman A, Monnier AFO, Coffinier Y, 125. Manefield M, Whiteley AS, Griffiths RI, Bai-
Audinot J-N, Gibouin D, Wirtz T, ley MJ (2002) RNA stable isotope probing, a
Boukherroub R, Migeon H-N, Bensimon A, novel means of linking microbial community
Jannière L (2011) Combed single DNA function to phylogeny. Appl Environ Micro-
molecules imaged by secondary ion mass biol 68:5367–5373
spectrometry. Anal Chem 83(18):6940–6947 126. Radajewski S, Ineson P, Parekh NR, Murrell J
115. Cabin-Flaman A, Monnier A-F, Coffinier Y, (2000) Stable-isotope probing as a tool in
Audinot J-N, Gibouin D, Wirtz T, microbial ecology. Nature 403(10):646–649
Boukherroub R, Migeon H-N, Bensimon A, 127. Jensen LHS, Cheng T, Plane FOV, Escrig S,
Jannière L (2016) Combining combing and Comment A, van den Brandt B, Humbel BM,
secondary ion mass spectrometry to study Meibom A (2016) En route to ion micro-
DNA on chips using 13C and 15N labeling. probe analysis of soluble compounds at the
F1000Res 5:27429742 single cell level: the CryoNanoSIMS. In:
116. Weber PK, Graham GA, Teslich NE, European microscopy congress 2016: pro-
MoberlyChan W, Ghosal S, Leighton TJ, ceedings. Wiley, New York, NY
Wheeler KE (2010) NanoSIMS imaging of 128. Lovrić J, Malmberg P, Johansson BR,
Bacillus spores sectioned by focused ion Fletcher JS, Ewing AG (2016) Multimodal
beam. J Microsc 238:189–199 imaging of chemically fixed cells in prepara-
117. Wilson RG, Stevie FA, Magee CW (1989) tion for NanoSIMS. Anal Chem 88
Secondary ion mass spectrometry: a practical (17):8841–8848
handbook for depth profiling and bulk impu- 129. Gibbin E, Gavish A, Domart-Coulon I,
rity analysis. Wiley, New York, NY Kramarsky-Winter E, Shapiro O, Meibom A,
118. Polerecky L, Adam B, Milucka J, Musat N, Vardi A (2018) Using NanoSIMS coupled
Vagner T, Kuypers MM (2012) Look@ Nano- with microfluidics to visualize the early stages
SIMS–a tool for the analysis of nanoSIMS of coral infection by Vibrio coralliilyticus.
data in environmental microbiology. Environ BMC Microbiol 18(1):1–10
Microbiol 14(4):1009–1023 130. Nunan N, Ritz K, Crabb D, Harris K, Wu K,
119. Huang W, Hammel KE, Hao J, Thompson A, Crawford JW, Young IM (2001) Quantifica-
Timokhin VI, Hall SJ (2019) Enrichment of tion of the in situ distribution of soil bacteria
lignin-derived carbon in mineral-associated by large-scale imaging of thin sections of
soil organic matter. Environ Sci Technol 53 undisturbed soil. FEMS Microbiol Ecol 37
(13):7522–7531 (1):67–77
120. Whitman T, Zhu Z, Lehmann J (2014) Car- 131. Kuo J (2007) Electron microscopy: methods
bon mineralizability determines interactive and protocols. In: Methods in molecular biol-
effects on mineralization of pyrogenic organic ogy, 2nd edn. Humana, Totowa, NJ
matter and soil organic carbon. Environ Sci 132. Tippkötter R, Ritz K (1996) Evaluation of
Technol 48(23):13727–13734 polyester, epoxy and acrylic resins for suitabil-
121. Hatton P-J, Remusat L, Zeller B, Brewer EA, ity in preparation of soil thin sections for in
Derrien D (2015) NanoSIMS investigation of
134 Jennifer Pett-Ridge and Peter K. Weber

situ biological studies. Geoderma 69 143. Lehmann J, Liang BQ, Solomon D,


(1–2):31–57 Lerotic M, Luizao F, Kinyangi J, Schafer T,
133. Chandra S, Morrison GH (1992) Sample Wirick S, Jacobsen C (2005) Near-edge X-ray
preparation of animal tissues and cell cultures absorption fine structure (NEXAFS) spectros-
for secondary ion mass spectrometry (SIMS) copy for mapping nano-scale distribution of
microscopy. Biol Cell 74:31–42 organic carbon forms in soil: application to
134. Dykstra MJ, Reuss LE (eds) (2003) Biological black carbon particles. Global Biogeochem
electron microscopy: theory, techniques and Cycles 29:Art. No. GB1013
troubleshooting, 2nd edn. Kluwer Aca- 144. Flynn GJ, Keller LP, Jacobsen C, Wirick S
demic/Plenum Publishers, New York, NY (2004) An assessment of the amount and
135. Echlin P (1992) Low-temperature micros- types of organic matter contributed to the
copy and analysis. Springer, New York, NY Earth by interplanetary dust. Adv Space Res
33:57–66
136. Musat N, Stryhanyuk H, Bombach P,
Adrian L, Audinot J-N, Richnow HH 145. Gnaser H (1997) Formation of metastable
(2014) The effect of FISH and CARD-FISH N2- and CO-anions in sputtering. Phys Rev
on the isotopic composition of 13C- and A 56(4):R2518
15N-labeled Pseudomonas putida cells 146. McMahon G, Saint-Cyr HF, Lechene C,
measured by nanoSIMS. Syst Appl Microbiol Unkefer CJ (2006) CN secondary ions
37(4):267–276 form by recombination as demonstrated
137. Woebken D, Burow LC, Behnam F, Mayali X, using multi-isotope mass spectrometry of
Schintlmeister A, Fleming ED, Prufert- 13C- and 15N-labeled polyglycine. J Am
Bebout L, Singer SW, Lopez Cortes A, Hoeh- Soc Mass Spectrom 17(8):1181–1187
ler TM, Pett-Ridge J, Spormann AM, 147. Mayali X, Weber PK, Nuccio E, Lietard J,
Wagner M, Weber PK, Bebout BM (2015) Somoza M, Blazewicz SJ, Pett-Ridge J
Revisiting N-2 fixation in Guerrero Negro (2019) Chip-SIP: stable isotope probing ana-
intertidal microbial mats with a functional lyzed with rRNA-targeted microarrays and
single-cell approach. ISME J 9(2):485–496 NanoSIMS. In: Stable isotope probing.
138. Herrmann AM, Clode PL, Fletcher IR, Springer, New York, NY, pp 71–87
Nunan N, Stockdale EA, O’Donnel AG, 148. Gnaser H (1999) Singly-and doubly-negative
Murphy DV (2007) A novel method for the carbon clusters in sputtering: energy spectra,
study of the biophysical interface in soils using abundance distributions and unimolecular
nano-scale secondary ion mass spectrometry. fragmentation. Nucl Instrum Methods Phys
Rapid Commun Mass Spectrom 21(1):29–34 Res, Sect B 149(1–2):38–52
139. Weng N, Jiang H, Wang W-X (2017) In situ 149. Weber PK, Bacon CR, Hutcheon ID, Ingram
subcellular imaging of copper and zinc in BL, Wooden JL (2005) Ion microprobe mea-
contaminated oysters revealed by nanoscale surement of strontium isotopes in calcium
secondary ion mass spectrometry. Environ carbonate with application to salmon otoliths.
Sci Technol 51(24):14426–14435 Geochim Cosmochim Acta 69
140. Rogge A, Flintrop CM, Iversen MH, Salter I, (5):1225–1239
Fong AA, Vogts A, Waite AM (2018) Hard 150. Ghosal S, Fallon SJ, Leighton T, Wheeler K,
and soft plastic resin embedding for single- Hutcheon ID, Weber PK (2006) Analysis of
cell element uptake investigations of marine- bacterial spore permeability to water and ions
snow-associated microorganisms using nano- using NanoSecondary Ion Mass Spectrome-
scale secondary ion mass spectrometry. Lim- try (NanoSIMS). Abstr Pap Am Chem Soc
nol Oceanogr Methods 16(8):484–503 231:3
141. Fike DA, Gammon CL, Ziebis W, Orphan VJ 151. Wolfe-Simon F, Blum JS, Kulp TR, Gordon
(2008) Micron-scale mapping of sulfur GW, Hoeft SE, Pett-Ridge J, Stolz JF, Webb
cycling across the oxycline of a cyanobacterial SM, Weber PK, Davies PCW, Anbar AD,
mat: a paired nanoSIMS and CARD-FISH Oremland RS (2011) A bacterium that can
approach. ISME J 2(7):749–759 grow by using arsenic instead of phosphorus.
142. De Gregorio BT, Stroud RM, Nittler LR, Science 332(6034):1163–1166
Alexander CMOD, Kilcoyne ALD, Zega TJ 152. Hauri EH, Papineau D, Wang J, Hillion F
(2010) Isotopic anomalies in organic nano- (2016) High-precision analysis of multiple
globules from Comet 81P/Wild 2: compari- sulfur isotopes using NanoSIMS. Chem
son to Murchison nanoglobules and isotopic Geol 420:148–161
anomalies induced in terrestrial organics by 153. Chandra S, Smith DR, Morrison GH (2000)
electron irradiation. Geochim Cosmochim Subcellular imaging by dynamic SIMS ion
Acta 74(15):4454–4470 microscopy. Anal Chem 72:104A–114A
NanoSIP: NanoSIMS Applications for Microbial Biology 135

154. Guerquin-Kern JL, Wu TD, Quintana C, 165. Pernthaler A, Pernthaler J, Amann R (2002)
Croisy A (2005) Progress in analytical imag- Fluorescence in situ hybridization and cata-
ing of the cell by dynamic secondary ion mass lyzed reporter deposition for the identifica-
spectrometry (SIMS microscopy). BBA-Gen tion of marine bacteria. Appl Environ
Subjects 1724(3):228–238 Microbiol 68(6):3094–3101
155. Burns MS, File DM, Deline V, Galle P (1986) 166. Woebken D, Burow LC, Prufert-Bebout L,
Matrix effects in secondary ion mass spectro- Bebout BM, Hoehler TM, Pett-Ridge J,
metric analysis of biological tissue. Scan Elec- Spormann AM, Weber PK, Singer SW
tron Microscopy 1986(4):1277–1290 (2012) Identification of a novel cyanobacter-
156. Harris WC, Chandra S, Morrison GH (1983) ial group as active diazotrophs in a coastal
Ion implantation for quantitative ion micros- microbial mat using NanoSIMS analysis.
copy of biological soft tissue. Anal Chem 55 ISME J 6(7):1427–1439
(12):1959–1963 167. Lemaire R, Webb RI, Yuan Z (2008) Micro-
157. Phinney D (2006) Quantitative analysis of scale observations of the structure of aerobic
microstructures by secondary ion mass spec- microbial granules used for the treatment of
trometry. Microsc Microanal 12(4):352 nutrient-rich industrial wastewater. ISME J 2
158. Decelle J, Veronesi G, Gallet B, (5):528–541
Stryhanyuk H, Benettoni P, Schmidt M, 168. Hatzenpichler R, Scheller S, Tavormina PL,
Tucoulou R, Passarelli M, Bohic S, Clode P Babin BM, Tirrell DA, Orphan VJ (2014) In
(2020) Subcellular chemical imaging: new situ visualization of newly synthesized pro-
avenues in cell biology. Trends Cell Biol 30 teins in environmental microbes using amino
(3):173–188 acid tagging and click chemistry. Environ
159. Penen F, Malherbe J, Isaure M-P, Microbiol 16(8):2568–2590
Dobritzsch D, Bertalan I, Gontier E, Le 169. Bradley JP, Dai ZR, Erni R, Browning ND,
Coustumer P, Schaumlöffel D (2016) Chem- Graham GA, Weber PK, Smith JB, Hutcheon
ical bioimaging for the subcellular localization ID, Ishii H, Bajt S, Floss C, Stadermann FJ,
of trace elements by high contrast TEM, Sandford S (2005) An astronomical 2175 Å
TEM/X-EDS, and NanoSIMS. J Trace Elem feature in interplanetary dust particles. Sci-
Med Biol 37:62–68 ence 307:244–247
160. Nomaki H, LeKieffre C, Escrig S, Meibom A, 170. Remusat L, Hatton P-J, Nico PS, Zeller B,
Yagyu S, Richardson EA, Matsuzaki T, Kleber M, Derrien D (2012) NanoSIMS
Murayama M, Geslin E, Bernhard JM study of organic matter associated with soil
(2018) Innovative TEM-coupled approaches aggregates: advantages, limitations, and com-
to study foraminiferal cells. Mar Micropaleon- bination with STXM. Environ Sci Technol 46
tol 138:90–104 (7):3943–3949
161. Kraft ML, Weber PK, Longo ML, Hutcheon 171. De Samber B, De Rycke R, De Bruyne M,
ID, Boxer SG (2006) Phase separation of lipid Kienhuis M, Sandblad L, Bohic S, Cloetens P,
membranes analyzed with high-resolution Urban C, Polerecky L, Vincze L (2020)
secondary ion mass spectrometry. Science Effect of sample preparation techniques
313:1948–1951 upon single cell chemical imaging: a practical
162. Wirtz T, Fleming Y, Gysin U, Glatzel T, comparison between synchrotron radiation
Wegmann U, Meyer E, Maier U, Rychen J based X-ray fluorescence (SR-XRF) and nano-
(2013) Combined SIMS-SPM instrument for scopic secondary ion mass spectrometry
high sensitivity and high-resolution elemental (nano-SIMS). Anal Chim Acta 1106:22–32
3D analysis. Surf Interface Anal 45 172. Lehmann J, Kinyangi J, Solomon D (2007)
(1):513–516 Organic matter stabilization in soil microag-
163. Orphan VJ, House CH, Hinrichs K-U, gregates: implications from spatial heteroge-
McKeegan KD, DeLong EF (2001) neity of organic carbon contents and carbon
Methane-consuming Archaea revealed by forms. Biogeochemistry 85(1):45–57
directly coupled isotopic and phylogenetic 173. Wan J, Tyliszczak T, Tokunaga TK (2007)
analysis. Science 293:484–487 Organic carbon distribution, speciation, and
164. Amann RI, Krumholz L, Stahl DA (1990) elemental correlations within soil microaggre-
Fluorescent-oligonucleotide probing of gates: applications of STXM and NEXAFS
whole cells for determinative, phylogenetic, spectroscopy. Geochim Cosmochim Acta 71
and environmental studies in microbiology. J (22):5439–5449
Bacteriol 172(2):762–770
136 Jennifer Pett-Ridge and Peter K. Weber

174. Kopp C, Wisztorski M, Revel J, Mehiri M, response to resource availability: amino acid
Dani V, Capron L, Carette D, Fournier I, incorporation in San Francisco Bay. PLoS
Massi L, Mouajjah D (2015) MALDI-MS One 9(4):e95842
and NanoSIMS imaging techniques to study 181. Mayali X, Weber PK, Pett-Ridge J (2013)
cnidarian–dinoflagellate symbioses. Zoology Taxon-specific C:N relative use efficiency for
118(2):125–131 amino acids in an estuarine community.
175. Schlüter S, Eickhorst T, Mueller CW (2018) FEMS Microbiol Ecol 83(2):402–412
Correlative imaging reveals holistic view of 182. Bryson S, Li Z, Chavez F, Weber PK, Pett-
soil microenvironments. Environ Sci Technol Ridge J, Hettich RL, Pan C, Mayali X, Muel-
53(2):829–837 ler RS (2017) Phylogenetically conserved
176. Lin S, Henze S, Lundgren P, Bergman B, resource partitioning in the coastal microbial
Carpenter EJ (1998) Whole-cell immunolo- loop. ISME J 11(12):2781–2792
calization of nitrogenase in marine diazo- 183. Smith DF, Kiss A, Leach FE, Robinson EW,
trophic cyanobacteria, Trichodesmium spp. Paša-Tolić L, Heeren RM (2013) High mass
Appl Environ Microbiol 64(8):3052–3058 accuracy and high mass resolving power
177. Levenson RM, Borowsky AD, Angelo M FT-ICR secondary ion mass spectrometry
(2015) Immunohistochemistry and mass for biological tissue imaging. Anal Bioanal
spectrometry for highly multiplexed cellular Chem 405(18):6069–6076
molecular imaging. Lab Investig 95 184. Steele AV, Schwarzkopf A, McClelland JJ,
(4):397–405 Knuffman B (2017) High-brightness Cs
178. Singer SW, Chan CS, Hwang MH, Zemla A, focused ion beam from a cold-atomic-beam
VerBerkmoes NC, Hettich RL, Banfield JF, ion source. Nano Fut 1(1):015005
Thelen MP (2008) Characterization of cyto- 185. Hayes JM (2004) An introduction to isotopic
chrome579, an unusual cytochrome isolated calculations. Woods Hole Oceanographic
from an iron-oxidizing microbial community. Institution, Woods Hole, MA, USA.
Appl Environ Microbiol 74:4454–4462 https://www.whoi.edu/cms/files/jhayes/
179. Gerard E, Guyot F, Philippot P, Lopez-Garcia 2005/9/IsoCalcs30Sept04_5183.pdf
P (2005) Fluorescence in situ hybridization 186. Legin AA, Schintlmeister A, Jakupec MA,
coupled to ultra small immunogold detection Galanski M, Lichtscheidl I, Wagner M, Kep-
to identify prokaryotic cells using transmis- pler BK (2014) NanoSIMS combined with
sion and scanning electron microscopy. J fluorescence microscopy as a tool for subcel-
Microbiol Methods 63:20–28 lular imaging of isotopically labeled platinum-
180. Mayali X, Weber PK, Mabery S, Pett-Ridge J based anticancer drugs. Chem Sci 5(8):3135-
(2014) Phylogenetic patterns in the microbial 3143
Chapter 7

Construction of Metatranscriptomic Libraries for 50 End


Sequencing of rRNAs for Microbiome Research
Marja Tiirola and Anita M€aki

Abstract
Metatranscriptomic sequencing enables studying community-wide gene expression profiles of microbial
samples and getting functional insight on their up- or downregulated pathways. However, shotgun
sequencing is not the most efficient way to study expression of ribosomal RNA genes or to compare lot
of samples in experimental setups. Here we describe an efficient primer-independent method for processing
and barcoding libraries for directional sequencing of the 50 end region of the RNA. When applying size
selection of the original RNA, the method forms an optimal solution for the simultaneous analysis of
bacterial, archaeal, and eukaryotic rRNA diversity.

Keywords RNA, Metatranscriptomics, Microbiomes, High-throughput, Sequencing, NGS, SSU


rRNA

1 Introduction

Ribosomal sequencing has become a key standard in microbiome


studies since the usefulness of 16S rRNA gene as a phylogenetic
chronometer was observed in 1970s [1]. However, profiling micro-
bial communities using 16S/18S rRNA instead of rRNA genes can
provide several benefits: it can avoid serious biases generated by the
choice of specific primers and by the copy number variation in the
rRNA genes, especially within eukaryotic species [2]. RNA-based
studies also bring out living organisms without the fear of contam-
ination with relic DNA [3]. While primer-dependent methods are
limited to certain microbial domains and therein biased against
well-characterized phyla, shotgun metatranscriptomic is often
used for characterization of microbiomes. Construction of tran-
scriptome libraries are usually performed using random priming or
fragmentation/ligation. However, for efficient aligning and clus-
tering of the rRNAs, directional sequencing of 50 or 30 ends of the

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_7, © Springer Science+Business Media, LLC, part of Springer Nature 2022

137
138 €ki
Marja Tiirola and Anita Ma

small subunit rRNA would be most cost-efficient. Several ligation


and tailing systems have been developed and evaluated [4, 5], but
many of those need a lot of RNA input. Indeed, using primer-
independent RNA sequencing, the knowledge on the diversity of
life has been dramatically expanded [6].
Here we demonstrate efficient construction of rRNA libraries
for Ion Torrent sequencing, possible to complete in one or two
work days. Since rRNA is lacking both the poly(A) tail and 50
capping, the method is based on the ligation of the RNA 50 end.
Although instructed here for the Ion Torrent platform, the method
is compatible for other sequencing chemistries by replacing the
adapter sequences (IonA and P1) with Illumina or PacBio adapters.
Sample barcoding can be perfomed using the same adapter and
barcoding system as utilized for amplicon sequencing [7].
The RNA library construction starts by the ligation of the RNA
oligo (M13) to the 50 end of the gel extracted small subunit rRNA
(16S/18S rRNA) and continues with a subsequent reverse tran-
scription using a random hexamer primer with an overhang adapter
P1 (Fig. 1). PCR amplification of the cDNA fragments is per-
formed with the P1 primer and an IonA primer construct tailed
with barcodes and M13. Finally, amplification products are dual
size selected with magnetic bead purification, then quantified and
pooled for the sequencing library.

2 Materials

The procedure begins with pre-extracted RNA samples, as RNA


extraction is dependent on the type of the starting matrix, such as
water, soil, food, stool, or eukaryotic tissues. Prepare all solutions
using molecular grade water without RNase contamination. RNA
is stored at 80  C and kept on ice during arrangements of reac-
tions. cDNA and amplified libraries are stable at 4  C overnight, but
stored at 20  C.

2.1 Size Selection of 1. Your RNA samples in water or buffer, typically 5–50 ng/μL.
the RNA (Optional, See 2. Control RNA isolated from E. coli or other organisms.
Note 1)
3. Molecular marker, e.g., 100 bp size ladder.
4. Pre-cast 1% agarose gel (for RNA) system.
5. Soft cooling block, pre-cooled in fridge.
6. Blue light table (emission at around 470 nm wavelength),
sterile scalpels and selective glasses to protect eyes from the
light.
7. Kit for recovery of RNA fragments from agarose gel.
Construction of Metatranscriptomic Libraries for 50 End Sequencing. . . 139

Fig. 1 Library preparation includes ligation of an RNA-oligo to the extracted RNA and reverse transcription of
the RNA into cDNA using a random hexamer primer with an overhang adapter P1. Libraries are amplified using
barcoded primers, dual size selected (containing purification), quantified, and pooled for next-generation
sequencing. The procedure can be done in 1 or 2 work days

2.2 RNA/RNA 1. T4 RNA ligase and buffer.


Ligation and Reverse 2. Molecular grade water, RNase free.
Transcription
3. Polyethylene glycol solution: 40% (w/v) PEG8000 in water.
4. Ribonuclease Inhibitor.
5. Thermocycler (preferably real-time thermocycler) and 0.2 mL
PCR-tubes.
6. Magnetic stand for PCR tubes or microcentrifuge tubes.
7. cDNA Synthesis Kit including the reverse transcriptase
enzyme, buffer, and dNTPs.
140 €ki
Marja Tiirola and Anita Ma

8. M13-RNA oligonucleotide (50 - UGUAAAACGACGGC


CAGU-30 ), 10 μM.
9. P1-6N tailed random primer (50 - CCTCTCTATGGG
CAGTCGGTGATNNNNNN-30 ), 10 μM.

2.3 Nucleic Acid 1. Magnetic bead-based RNA/DNA purification system.


Purification and 2. Fresh 80% ethanol: 80% ethanol and 20% RNAse free water (see
Quantification Note 2). Prepare daily or store at 20  C.
3. Molecular grade water, RNAse free.
4. ProNex Size-Selective Purification System (Promega), includ-
ing bead solution, washing buffer, and elution buffer.
5. An automated electrophoresis tool for DNA and RNA sample
quality control.
6. A benchtop fluorometer to accurately measure DNA and RNA.

2.4 PCR 1. A ready-to-use qPCR master mix including DNA polymerase,


Amplification dNTPs, and PCR buffer.
2. Reverse primer P1 (50 -CCTCTCTATGGGCAGTCGGTGAT-
30 ), 10 μM.
3. Set of barcoded IonA forward primers with barcodes and M13
tail (50 -CCATCTCATCCCTGCGTGTCTCCGACTCAGX 10
GATGTAAAACGACGGCCAGT-30 ), where X10 refers to the
10–12 bp long Ion Torrent barcode, 10 μM (see Note 3).

3 Methods

It is useful to check the concentration and size distribution of the


total RNA of the samples before starting the procedure using an
automated electrophoresis tool for RNA sample quality control (see
Note 4). If your RNA is intact and ribosomal bands (16S/18S
rRNA vs. 23S/28S rRNA) can be clearly seen, you can perform the
RNA size selection to increase the number of SSU rRNA sequences
in the final libraries.

3.1 RNA Size 1. Load 20 μL of your RNA to a precast 1% agarose gel. In extra
Selection (Optional) lanes, load 100 bp size marker and isolated RNA of E. coli to aid
in selecting the right size if the sample RNA is not visible.
2. Run the gel using appropriate program until the rRNA bands
are clearly visible (8–10 min). While running, keep a cooled ice
block on the top of the gel to avoid heating of the gel.
3. Separate the gel plates aiding with a spatula from the four
corners. Cut the area of 16S/18S rRNA (see Note 5) with a
sterile scalpel and insert the gel slices into clean microcentrifuge
tubes (Fig. 1).
Construction of Metatranscriptomic Libraries for 50 End Sequencing. . . 141

4. Extract the cut RNA using instructions of a kit for recovery of


RNA fragments from agarose gel and use elution volume of
20 μL.
5. Optional: measure the concentration of RNA (see Note 6).

3.2 Ligation of the 1. Prepare the RNA ligation reaction (see Note 7). Prepare also a
M13-RNA Oligo negative control using water instead of the RNA. Keep the
negative control as one of the samples in the following steps.

Components for a 20 μL reaction


PEG (40%) 10 μL
M13-RNA oligo (10 μM) 4 μL
T4 RNA ligase buffer (10) 2 μL
Ribonuclease inhibitor (20 U/μL) 0.5 μL
T4 RNA ligase (10 U/μL) 0.5 μL
RNA (total or size-selected) 10 ng or more 3 μL

2. Incubate the reaction at 37  C for 40 min. Add 20 μL of


nuclease-free water to the ligation reaction to get 40 μL total
volume (see Note 8).
3. Purification when RNAClean XP (Beckman Coulter) solution
is used: Add 72 μL (1.8 vol) of well mixed bead solution for
40 μL of a sample volume and mix by pipetting 15 times. After
waiting for 10–20 min, transfer the tube to magnetic stand for
5 min. Keep the tube in the rack and discard the supernatant by
pipetting, avoid touching the beads. Add 200–500 μL of 80%
ethanol for washing. Discard the ethanol by pipetting and
repeat for a total of three washes. In final wash, discard the
ethanol carefully without leaving any ethanol in the tube. Let
the tube air-dry for 5–10 min with the cap open to evaporate
the ethanol remains. Take the tube from the magnetic stand
and add 20 μL of nuclease-free water. Mix well by pipetting.
Elution is rapid and no incubation is needed. Transfer the tube
back to the magnetic stand and let it stand until the supernatant
is clear. Transfer the supernatant to a new tube.

3.3 cDNA Synthesis 1. Prepare the cDNA synthesis reaction by mixing the following
Using P1-Tailed Primer reagents per each sample (see Note 9).

Components for a 20 μL reaction


P1-6N (10 μM) 1 μL
dNTPs (10 mM) 1 μL
Reverse transcriptase buffer (5) 4 μL

(continued)
142 €ki
Marja Tiirola and Anita Ma

Ribonuclease inhibitor (20 U/μL) 0.5 μL


Reverse transcriptase (200 U/μL) 1 μL
Cleaned ligation product 12.5 μL

2. Incubate the reaction at 25  C for 5 min, at 45  C for 30 min,


and at 70  C for 5 min (see Note 10).
3. Purify the cDNA product using the same procedure as earlier
(Subheading 3.2, step 3), except use only 30 μL (1.5) of the
RNAclean bead solution (see Note 11). Elute the cDNA in
40 μL of water.

3.4 Amplification of 1. Prepare the PCR master mix, here Maxima SYBR Green/Fluo-
the cDNA rescein Master Mix (Thermo Scientific).

Components for a 30 μL reaction


qPCR master mix (2) 15 μL
P1 primer (10 μM) 1.5 μL
For each reaction, add specific barcoded primer and the cDNA template
IonA primer (10 μM) 1.5 μL
Template cDNA, purified 12 μL

2. For PCR amplification, use the following ramps: initial dena-


turation at 95  C for 5 min, 30 cycles of amplification (95  C
for 15 s, 52  C for 30 s, 72  C for 30 s) and final extension at
72  C for 5 min. Check that the qPCR curves of the samples are
showing amplification, but amplification of negative control
happens at least 3–4 cycles later, if at all.
4. Add 30 μL of ProNex bead solution (1:1, v/v) for dual size
selection of the PCR products (see Note 12) and mix by pipet-
ting ten times. Incubate for 10 min before placing the sample
on the magnetic stand. After 2 min, transfer the cleared super-
natant to a clean tube and add 8.4 μL of additional ProNex
bead solution (0.28:1, v/v) into the supernatant. Mix by pipet-
ting ten times and incubate 10 min on the bench. Place this
tube to the magnetic stand and wait for 2 min for beads to be
cleared out from the solution. While keeping the tube in the
rack, discard the supernatant by pipetting, avoiding touching
the bead layer/ring. Add 200 μL of ProNex wash buffer and
incubate 30–60 s. Discard the wash buffer by careful pipetting
to keep the beads. Repeat the washing twice. After removing
the wash buffer for the second time, air-dry the remains of the
wash buffer for 5 min. Take the tube off from the magnetic
stand and add 40 μL of nuclease-free water. Mix well by
Construction of Metatranscriptomic Libraries for 50 End Sequencing. . . 143

pipetting or shaking on a plate mixer. After 5 min, transfer the


tube back to the magnetic stand and let it stand 1 min until the
supernatant is clear. Transfer the water to a clean tube.
5. Measure concentration of each purified product by using an
accurate benchtop fluorometer. Pool the samples by combining
10 ng of each PCR product. Check the length distribution of
the pooled sample. If short fragments below 200 bp are still
present, further purification of the pooled sample from short
fragments needs to be done using a magnetic bead-based DNA
purification system before proceeding to the Ion Torrent
sequencing.

4 Notes

1. Although a large fraction of the microbial RNA accounts for


ribosomal RNA, the fraction of small subunit rRNA (SSU
rRNA, 16S/18S rRNA) can be increased through agarose gel
size selection. As RNA is sensitive for degradation by heat,
alkaline conditions, and especially RNases, it is useful to apply
commercial pre-cast agarose gels and avoid heating. Typically,
size selection increases the share of the small subunit rRNA
sequences from one fourth to over 50%.
2. It is extremely important to control the concentration of 80%
ethanol when using magnetic bead purification. As ethanol is
hygroscopic, the concentration may decrease during storage.
3. IonA barcoding primers are compiled of IonA, TCAG for
signal standardization, 10–12 base IonXpress barcode, adapter
GAT, and M13 sequence. Since all IonXpress barcodes (X10)
end with C and the M13 sequence starts with TGTA, adapter
GAT can optionally be left away to shorten the sequences. In
that case, the default IonXpress barcode set has to be modified
to recognize sequences that do not have GAT after the bar-
code. Design of used Ion Torrent barcodes (without GAT) is
described http://www.future-science.com/doi/suppl/10.
2144/000114380.
4. If processing eukaryotic messenger RNA or other RNAs con-
taining 50 -end capping, ligation-compatible monophosphate
must be created, e.g., by using ribopolyphosphorylase (RppH).
5. It is important to cut large enough area for the analysis, since
SSU rRNA may have over 100 bp variation in size. Therefore,
cut a large gel band: 2–3 mm below seen 16S rRNA band until
the middle of 18S and 23S rRNA band region (Fig. 2).
6. The procedure may work even if the concentration is low.
7. We use here about 2,000-fold concentration of M13-RNA
compared to rRNA. Less M13-RNA can be used, but may
144 €ki
Marja Tiirola and Anita Ma

Fig. 2 RNA samples in the E-Gel agarose electrophoresis gel before (a) and after (b) extraction of the 16S/18S
rRNA gel band

result in higher number of rRNA-rRNA self-ligates


[8]. M13-RNA should not self-ligate, due to OH-groups in
both the 50 and 30 ends, created in the synthesis of the
oligonucleotides.
8. Ligation reaction contains high PEG concentration, which may
interfere magnetic bead purification. Therefore, the viscous
sample has to be diluted by nuclease-free water before
continuing to the cleaning step.
9. Reverse transcriptase enzyme can be either ribonuclease H
positive (RNase H+, like AMV ribonuclease) or negative
(RNase H, like many M-MuLV-based and engineered
enzymes). The choice can have an effect on the amount and
length distribution of the products. However, the following
steps (PCR amplification and size selection) can compensate
the differences.
10. Here we used 45  C for reverse transcription, but the optimal
polymerization temperature can be increased depending on the
used reverse transcriptase brand.
11. It is useful to decrease the volume of bead solution from 1.8
to 1.5 (v/v) of the sample volume to avoid binding of small
fragments. This is especially essential if the size selection step is
not used and 5S/5.8S rRNA is still present in the sample.
12. The first step should eliminate too long PCR fragments (mix
1:1 v/v ratio) and the second step should eliminate too short
fragments (mix additional 0.28:1 v/v), resulting in an average
fragment size of about 300–500 bp.
13. In addition to fragments of desired length (300–500 bp), the
graph may show a sharp by-product (80–90 bp) in the unpur-
ified sample, produced from the primer combinations. How-
ever, the size selection step should remove this artefact (Fig. 3).
If remains of the by-product are seen in the pooled sample,
another purification round should be performed using mag-
netic bead purification or agarose gel electrophoresis and gel
extraction. In addition, better separation of the small primer
Construction of Metatranscriptomic Libraries for 50 End Sequencing. . . 145

er

er
1 500

pp
84
Lo

U
1 000 3500

700
3000

500

Sample Intensity [FU]


2500
400

300 2000

200
1500

100 1000

500
50
0

Size

1 000
1 500
[bp]
100

200
300
400
500
700
25
25

50

b
er

er
w

pp
1 500
2
Lo

40

U
1 000

700
800
500
Sample Intensity [FU]

400
600
300

200
400

100
200

50
0

Size
1 000
1 500

[bp]
100

200
300
400
500
700

25
25

50

Fig. 3 Size distribution of the PCR product before (a) and after (b) the dual size selection using ProNex (see
Note 13)
146 €ki
Marja Tiirola and Anita Ma

dimers can be obtained if the first PCR reaction is done using


M13 and P1 primers, and IonA primer with barcode and M13
tail is utilized in an additional PCR round after the ProNex
size-separation.

Acknowledgments

This work was supported by the Academy of Finland (projects


323063 and 311229) and by the European Research Council
(ERC) under the European Union’s Seventh Framework
Programme (FP/2007-2013, grant agreement No. 615146),
both to M.T.

References
1. Woese CR, Fox GE (1977) Phylogenetic struc- 5. Adiconis X, Haber AL, Simmons SK, Levy
ture of the prokaryotic domain: the primary Moonshine A, Ji Z, Busby MA, Shi X,
kingdoms. Proc Natl Acad Sci U S A Jacques J, Lancaster MA, Pan JQ, Regev A,
74:5088–5090 Levin JZ (2018) Comprehensive comparative
2. M€aki A, Salmi P, Mikkonen A, Kremp A, Tiirola analysis of 50 -end RNA-sequencing methods.
M (2017) Sample preservation, DNA or RNA Nat Methods 15(7):505–511
extraction and data analysis for high-throughput 6. Karst SM, Dueholm MS, McIlroy SJ, Kirkegaard
phytoplankton community sequencing. Front RH, Nielsen PH, Albertsen M (2018) Retrieval
Microbiol 8:1848 of a million high-quality, full-length microbial
3. Carini P, Marsden PJ, Leff JW, Morgan EE, 16S and 18S rRNA gene sequences without
Strickland MS, Fierer N (2016) Relic DNA is primer bias. Nat Biotechnol 36:190–195
abundant in soil and obscures estimates of soil 7. M€aki A, Rissanen AJ, Tiirola M (2016) A practi-
microbial diversity. Nat Microbiol 2:16242 cal method for barcoding and size-trimming
4. Machida RJ, Lin YY (2014) Four methods of PCR templates for amplicon sequencing. Bio-
preparing mRNA 50 end libraries using the Illu- Techniques 60:88–90
mina sequencing platform. PLoS One 9(7): 8. M€aki A, Tiirola M (2018) Directional high-
e101812 throughput sequencing of RNAs without gene-
specific primers. BioTechniques 60:219–223
Chapter 8

Computational Approaches for Designing Highly Specific


and Efficient sgRNAs
Jaspreet Kaur Dhanjal, Dhvani Vora, Navaneethan Radhakrishnan,
and Durai Sundar

Abstract
The easily programmable CRISPR/Cas9 system has found applications in biomedical research as well as
microbial and crop applications, due to its ability to create site-specific edits. This powerful and flexible
system has also been modified to enable inducible gene regulation, epigenome modifications and high-
throughput screens. Designing efficient and specific guides for the nuclease is a key step and also a major
challenge in effective application. This chapter describes rules for sgRNA design and important features to
consider while touching upon bioinformatics advances in predicting efficient guides. Computational tools
that suggest improved guides, depending on application, or predict off-targets have also been mentioned
and compared.

Key words CRISPR/Cas9, sgRNA, Knockin, Knockout, Target, Off-target

1 Introduction

The CRISPR/Cas9 system is the latest genome editing technology,


which has been established as a promising tool for genome engi-
neering of unicellular and multicellular organisms [1, 2]. Due to its
simplicity and versatility over other existing genome editing tools,
this technology has received ample amount of attention in academic
as well as industrial setups worldwide. It has been proven to have
vast applications in fundamental research (mainly functional geno-
mics) [3–5], medicine (e.g., cancer characterization, gene therapy,
pharmacological studies, and stem cell technology) [6–12], agri-
cultural biotechnology (crop improvement programs with respect
to yield improvement, stress tolerance, and disease resistance) [13–
15], and industrial biotechnology (enriched production of
bio-products, metabolic control of cells, and strain typing) [16–
18].

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_8, © Springer Science+Business Media, LLC, part of Springer Nature 2022

147
148 Jaspreet Kaur Dhanjal et al.

In general, genome editing tools are engineered nucleases that


can be used to deliver double-stranded breaks (DSB) at a targeted
genomic location, by which fragments can be inserted, deleted, or
replaced in the genome. The existing nuclease systems include
meganucleases, zinc finger nucleases (ZFNs), transcription
activator-like effector-based nucleases (TALENs), and lately,
CRISPR/Cas9 system. Ever since its advent, the CRISPR/Cas9
system has become the most convenient tool of choice. The reason
is that CRISPR/Cas9 system relies on Watson-Crick base pairing
between RNA and DNA for specific targeting, whereas other sys-
tems are based on protein–DNA binding [1, 2].
CRISPR/Cas9 system was first characterized in Streptococcus
pyogenes as part of the bacterial defense system against invading viral
DNA. CRISPR (Clustered Regularly Interspaced Short Palin-
dromic Repeats) is a type of interspaced repeats found in bacterial
DNA, within which it contains segments of exogenous viral DNA.
These viral segments are captured into CRISPR spacer regions and
are retained as an immune memory. These CRISPR regions in
bacterial genomic DNA, when transcribed and processed, become
20-nucleotide crRNAs (CRISPR RNA) that complex with a small
non-coding tracrRNAs (trans-activating CRISPR RNA) and an
endonuclease Cas9 (CRISPR associated nuclease 9) to form the
CRISPR/Cas9 complexes. Upon viral invasion, the crRNA guides
the Cas9 to complementary region in viral DNA and binds to it by
crRNA—viral DNA complementarity. This RNA–DNA binding
activates Cas9, and a DSB is made in the viral DNA. DSBs in viral
DNA by different CRISPR/Cas9 complexes lead to degradation of
viral DNA. A multitude of sampled exogenous DNA
sequences incorporated in the CRISPR regions in genome of bac-
terium makes it immune to diverse viral invasions.
This RNA-guided endonuclease system is harnessed in genome
engineering to perform site-specific cleavage in the genome of
target cells. The crRNA and tracrRNA are combined into a
single-guide RNA (sgRNA). The sgRNA is designed as a
20-nucleotide region which contains sequence complementary to
the sequence of desired region in target DNA. This sgRNA com-
plexes with Cas9 and guides it to the target site in the DNA,
followed by sgRNA–DNA complementary binding and cleavage
(Fig. 1). It is essential that the 20-nucleotide stretch of target
DNA should have a protospacer adjacent motif (PAM) immediately
upstream of it for CRISPR/Cas9 activity. With “NGG” (any nucle-
otide followed by two guanines) being the canonical PAM for Cas9
identified in Streptococcus pyogenes. Cas9 nucleases from different
organisms recognize different PAMs [1, 2, 19]. A schematic repre-
sentation of mechanism of action of CRISPR/Cas9 system has
been shown in Fig. 2.
Computational Approaches for Designing Highly Specific and Efficient sgRNAs 149

Fig. 1 Structure of CRISPR/Cas9 system. It is a two-component system. The 20 nucleotide protospacer


sequence of CRISPR RNA guides the non-specific endonuclease, Cas9, to its target DNA within the genome.
HNH and RuvC are the two nuclease domains of Cas 9 that cleave the target strand and non-target strand,
respectively. This results in a double-strand break just 3 bp upstream of the NGG PAM (marked by red
triangles)

Though the guiding sequence of the sgRNA is expected to


specifically bind to a unique target site, sometimes the Cas endonu-
clease binds to some unintended locations that slightly differ from
the target site by a few nucleotides. Thus, choosing an appropriate
target and designing a highly specific sgRNA for it determines the
success of any genome editing experiment. Here, we have focused
on the computational techniques being employed for designing
highly specific sgRNAs in CRISPR/Cas9-based experiments.

2 Possible Genetic Perturbations

A double-stranded break made by a CRISPR/Cas9 complex is


repaired by one of the two inherent DNA repair mechanisms of
the cell: homology directed repair (HDR), when a homologous
DNA is present; or non-homologous end joining (NHEJ), in the
absence of a template. NHEJ refers to direct ligation of split ends
and is erroneous. It often leads to indels. HDR uses a homologous
DNA as template for repair. Both of these mechanisms are present
in almost all cells. In genetic engineering, NHEJ is preferred if
random indels have to be created, and HDR is preferred if an
exogenous DNA fragment has to be inserted. Based on the type
150 Jaspreet Kaur Dhanjal et al.

Fig. 2 Schematic representation of mechanism of action of CRISPR/Cas9 system. Once the CRISPR/Cas9
system binds to the target DNA, Cas endonuclease identifies the PAM on the non-target strand initiating the
cleavage activity. Formation of a double-strand break is followed by its repair, which uses either non-
homologous end joining (NHEJ) or homologous recombination (HR). NHEJ involves simple re-ligation of the
broken ends but results in loss or gain of nucleotides. HR requires a correct template to repair the break by
synthesizing new nucleotides between the broken edges

of CRISPR/Cas9 complex used and the DNA repair mechanism


involved, broadly three types of genetic perturbations are possible:
gene knockout, gene knockin, and gene expression control [1, 2,
20, 21].
Computational Approaches for Designing Highly Specific and Efficient sgRNAs 151

2.1 Gene Knockout Gene knockout refers to disruption of expression of a genetic locus.
CRISPR/Cas9-based knockout is widely used in functional geno-
mics and screening of drugs. Many research groups have reported
successful CRISPR/Cas9 knockout screening in human cells
[4, 12, 22, 23]. Not only mammalian systems, CRISPR/Cas9
system is also being widely used to manipulate the genomes of
microorganisms [24–27].
Successful CRISPR/Cas9-based gene knockouts have been
reported in different cell types in different organisms [22, 28]. Gen-
eration of indels in a gene by NHEJ after DSBs leads to knockout of
a gene. However, it is common in CRISPR/Cas9-based knockout
experiments, for the gene to remain functional because of in-frame
variants generated after NHEJ. As micro-homology of the DNA
sequence plays a significant role in generation of in-frame variants,
success of CRISPR/Cas9 knockout depends on micro-homology
profile of the target gene [29]. sgRNAs should be designed accord-
ingly for an efficient gene knockout. Nucleotide sequence down-
stream of the protospacer also plays a key role in determining
knockout efficiency [30].

2.2 Gene Knockin Gene knockin refers to introduction or substitution of a DNA


sequence in a genetic locus. CRISPR/Cas9-based gene knockin is
achieved by exploiting HDR mechanism of the cell. An exogenous
DNA fragment, which contains the sequence to be inserted or
substituted, and the sequence flanking the cleaved site of genomic
DNA, is injected into the cell. Once CRISPR/Cas9 complex
cleaves at the target site, HDR mechanism of the cell uses this
exogenous fragment as template and incorporates the sequence
information into the DSB. In cells, NHEJ is more frequent than
HDR; however, inhibition of NHEJ is reported to increase the
frequency of HDR [31]. CRISPR/Cas9-based knockin can be
used for introducing or substituting genes, generating variants of
existing genes and also for correcting aberrant genes. A frameshift
mutation of dystrophin gene in Duchenne muscular dystrophy was
corrected in patient-derived iPSCs by CRISPR/Cas9-based gene
knockin [32]. Another study reported successful production of
human albumin in pigs by knocking-in human albumin cDNA
into swine albumin locus in zygotes using CRISPR/Cas9 technol-
ogy [33]. This approach has further led to the generation of various
disease models [34, 35]. CRISPR-Cas9 knockin-based strategy has
also been used to activate silent biosynthetic gene clusters in strep-
tomycetes for the production of unique metabolites [36]. It is also
being used for metabolic engineering in cyanobacteria for the
production of various compounds paving way for any interesting
industrial applications [37].
152 Jaspreet Kaur Dhanjal et al.

2.3 Gene Expression The CRISPR/Cas9 complex can also be used for specifically reg-
Control ulating the expression of desired genes. A modified CRISPR/Cas9
system called CRISPRi (CRISPR interference) has been proposed
for gene silencing [38]. CRISPRi system consists of sgRNA com-
plexed with a catalytically dead Cas9 (dCas9), which lacks endonu-
clease activity. This functions as a simple DNA-binding system
guided by RNA. This system when bound to a gene inhibits its
transcription by interfering with transcription elongation and bind-
ing of RNA polymerase and transcription factors. Similarly CRIS-
PRa (CRISPR activation) system—sgRNA-dCas9 complex coupled
with different transcriptional activation domains—when targeted
to promoter regions, upregulates the expression of target genes
[39, 40].

3 sgRNA Design Rules

To improve screening libraries and also for small scale gene-editing


experiments, design criteria to improve sgRNA efficacy are of great
value. Identifying sequence features within and around the target
site that predict sgRNA efficacy would also benefit this goal. Studies
in the past few years have shed some light on the various factors that
could play a role in determining the activity of the RNA-guided
nuclease Cas9. Some of the features that can be used for designing
high-efficiency sgRNAs are discussed in the following sections.

3.1 Sequence Profile The 30 end of the sgRNA sequence plays a major role in determin-
of the Target DNA ing Cas9 loading. RNAs which have purines as the last four nucleo-
and sgRNA tides of the spacer are preferred for binding by Cas9 over those
having pyrimidines. In human and murine cell lines, guanines were
found to be preferred at the 1 and 2 positions proximal to the
PAM sequence for efficient sgRNAs (Figure 1 shows the number-
ing pattern used to specify the nucleotide positions in sgRNA).
Cytosine is preferred at the 3 position, which is the DNA cleavage
site by the CRISPR/Cas9 complex [41]. Thymines are disfavored
at the +4/4 positions closest to the PAM [42]. This preference for
thymines might contribute to the efficiency of cleavage or the
introduction of mutation upon DNA repair. Also, adenines are
preferred from position 5 to 12, and guanines are preferred at
positions 14 to 17 [42]. Such a bias was also seen against
guanine and for cytosine at position 16.
Apart from position-specific propensity, the overall GC content
of the guide RNA is also an important factor in determining effi-
ciency of the system. The GC content of a stretch of nucleic acid
determines its melting temperature. Melting is an important pro-
cess for DNA interrogation by the guide. This ATP-independent
process of RNA invasion starts with PAM binding, which leads to
duplex distortion facilitating local DNA melting [43, 44].
Computational Approaches for Designing Highly Specific and Efficient sgRNAs 153

3.2 PAM Recent studies have not only reported nucleotide preferences for
and the Flanking active sgRNAs/target sequence at positions across the 20 nucleo-
Sequence tide length, but also the regions flanking the target sequence. The
position just adjacent to the PAM was found to have preference for
some specific nucleotides. Guanine was reported to be strongly
disfavored at position 1, suggesting that an extended PAM
sequence of CGGH (where H: A/C/T) is optimal for the use of
S. pyogenes Cas9 to engineer genomic DNA in cells. The study also
reported a bias for cytosine in the variable nucleotide of the PAM
and a preference against thymine at the same position [30].

3.3 Location The efficacy of a guide RNA is also related to the location of the cut
Targeted Within site and the location of the target site within the gene. A study
the Gene carried out in 2014 showed that the activity of sgRNAs targeting
close to the C-terminus is diminished, as frameshift mutations near
the end of a protein are comparatively less likely to disrupt expres-
sion. Certain gene-specific patterns were also observed; for
instance, the N-terminus of CD15 was a less-effective target site,
probably reflecting local chromatin accessibility [30]. Targeting
more than one site per gene would presumably help compensate
for such limitations. In keeping with assumptions, the activity
quickly decreased as a function of distance to the nearest CDS,
and the sgRNAs targeting the 50 and 30 UTRs were ineffective [30].
Although sgRNAs are popularly designed to target the CDS,
designing sgRNAs with an expected cut site exactly at the exon–
intron boundary sites that disrupts splicing can be efficacious and
may be particularly useful when it is desirable to reintroduce the
CDS, such as for phenotype rescue experiments. A recent study also
showed that the sgRNAs targeting the transcribed strand are less
effective than those targeting the non-transcribed strand [45].

3.4 Structural Structural accessibility of the RNA has been shown to play an
Accessibility important role in RNA-guided sequence recognition by siRNA
and miRNA, and hence may also play a role in CRISPR/Cas9
system. The important determinants being secondary structure,
self-folding free energy, and the accessibility of individual nucleo-
tides of the sgRNAs [46–48].
To form a functional RNA-protein complex, the crRNA needs
to interact with the Cas9 protein, which is facilitated by the well-
defined structural motifs of the tracrRNA region. It was also seen
that sgRNAs with very high or low G/C content tend to be less
active. The important nucleotides being at positions 21–50, form-
ing a stable stem-loop structure. It has also been reported that
nucleotides at positions 51–53 pair with nucleotides at positions
18–20 of the guide RNA, forming an extended stem loop which
decreases base accessibility of the “seed” region (the first 10–12
nucleotides adjacent to the PAM, which are critical for binding and
activation), which is critical for target site recognition.
154 Jaspreet Kaur Dhanjal et al.

The propensity to form secondary structures is determined by


the self-folding free energy of the sgRNA sequence.
Non-functional guide sequences are reported to have significantly
larger potential for self-folding than the functional guides
(ΔG ¼ 3.1 and 1.9, respectively), which correlates structural
accessibility of the sgRNA with its functionality. The
non-functional sgRNA sequences were also predicted to form
more stable duplexes with the target DNA sequence than func-
tional ones (ΔG ¼ 17.2 and 15.7, respectively) [49].
A contiguous stretch of same nucleotides, i.e., repetitive bases,
could potentially be correlated with poor efficiency for DNA oligo
synthesis; hence, functional sgRNAs were found to be significantly
depleted of repetitive bases. Especially, four contiguous guanines
(GGGG) were correlated with poor activity. This may be explained
by the guanine tetrad, which is a special secondary structure formed
by repetitive guanines, making the sgRNA less accessible for target
sequence recognition [49].
In the “seed” region of the guide, three repetitive uracils
(UUU) could be responsible for decreased crRNA activity
[50]. There exists a strong bias against U and C at the end of
functional sgRNAs as they have a strong tendency to pair with
AAG at positions 51–53 of the crRNA, forming an extended stem
loop structure [30]. U-rich seeds are also likely to result in
decreased sgRNA abundance and increased specificity since multi-
ple U’s in the sequence can induce termination of sgRNA transcrip-
tion [45, 50]. This mechanism cannot be extended to explain the
bias against thymine in the PAM, however, as this thymine is a
feature of the DNA target site and is not included in the sgRNA
transcript. Although the targeting specificity of Cas9 is believed to
be tightly controlled by the 20-nt guide sequence of the sgRNA
and the presence of a PAM adjacent to the target sequence in the
genome, potential off-target cleavage activity could still occur on
DNA sequence with even three to five base pair mismatches in the
PAM-distal part of the sgRNA-guiding sequence [51, 52].

3.5 Microhomology The presence of same short sequence of bases in different genomic
Profile regions is called microhomology, the pattern of which is related to
microhomology-mediated end-joining (MMEJ), otherwise called
alternative NHEJ (Alt-NHEJ). MMEJ is one of the DSB repair
pathways in the cell. The distinguishing property of MMEJ is the
presence of microhomology of 5–25 bp during the alignment of
broken ends before joining, causing deletions of sequences flanking
the break. Microhomology sequence profiles generally correlate
with in-frame mutations and are hence used to predict sgRNA
efficacy because CRISPR-based knockouts often result in genera-
tion of in-frame variants retaining functionality and reducing effi-
ciency. A study was reported where microhomology was used for in
Computational Approaches for Designing Highly Specific and Efficient sgRNAs 155

silico target site selection improving the CRISPR-based knockout


efficiency in human cell lines due to a decrease in the in-frame
mutations [29].

3.6 Chromatin Local chromatin structure and epigenetic modifications have


Characteristics recently been deemed as a major factor affecting the ability of
and Epigenetic Cas9 to find the PAM and begin to bind DNA with the seed region
Features of the sgRNA. There is a multitude of “seed + NGG” sites in the
genome, yet <1% are bound by Cas9, and most of the matches are
in promoters, enhancers, or genes, indicating that chromatin acces-
sibility may be a major determinant of Cas9 binding [50].
It has been observed that SpCas9 efficiently cleaves the target
genome regardless of CpG methylation status in either the 20-bp
target sequence or the PAM in a cell-free cleavage assay. In a study
to test the efficiency of sgRNA targeted to a highly methylated
region of the human genome, it was shown that the tested sgRNAs
were able to mediate indel mutations in endogenously methylated
targets [52]. However, another study reported that CpG methyla-
tion of DNA may impede binding of Cas9 and small molecules that
enhance CRISPR genome editing by promoting precise genome
editing via HDR or sequence-specific gene knockout via NHEJ
[50, 53].
In the CRISPR/Cas system, it is assumed that epigenetic fea-
tures mainly affect the affinity between the sgRNA and target DNA.
This may serve as a pre-condition to trigger the NHEJ pathway to
generate indels, thus playing an important role in determining the
final cleavage effect [54]. Though epigenetic and chromatin fea-
tures may highly influence the CRISPR/Cas system, they are
poorly investigated. Also, these features are highly cell type-specific
and thus demand the immediate focus of the researchers.

3.7 Experimental Various approaches are being reported to increase the specificity of
Conditions the CRISPR/Cas9 system by minimizing the off-target cleavage
events. These majorly include guide RNA truncations [55], exten-
sions [56] at the 50 ends of sgRNAs, co-localizing paired nickase
mutants of Cas9 [57], fusing catalytically inactive i.e. dCas9 to a
dimerization-dependent nuclease Fok1 [58, 59], and engineering
high-fidelity mutants of the Cas9 protein [60–62]. Other
approaches include controlling duration of CRISPR activity in
target cells by delivery of Cas9 and sgRNA as a ribonucleoprotein
complex (sgRNP) by electroporation or using cationic lipids
[63, 64], by spatiotemporal control [65, 66], or by adding a
CRISPR–Cas9 inhibitor in a time-dependent manner [67] among
other strategies. Among these methods, transient delivery of the
RNP complex has gained extensive favor while the other methods
are yet to be widely adopted [68].
156 Jaspreet Kaur Dhanjal et al.

A recent study reported the effect of experimental conditions


on the efficiency of the sgRNAs used [69]. It was shown that the
number of cleaved sites detected increased as a function of nuclease
concentration, with fewer sites being recovered at the lowest con-
centration and thousands of sites being recovered at the highest
concentration. Importantly, the on-target site could be recovered
even with the lowest concentration of the nuclease (most often at
0.25 nM sgRNP). The correlation of positional specificity and
sgRNP concentration seemed to diverge in a protospacer
sequence-dependent manner. A contiguous stretch of at least
13 bp between the crRNA and the target DNA site proximal to
the PAM is critical for interaction and subsequent cleavage. This is
called the “seed” region where mismatches are rarely tolerated
[43]. For some guides, specificity at the 30 end was preserved at
all concentrations, in compliance with the “seed sequence” model
of Cas9 sgRNA specificity; however, for other guides, this trend was
less clear [69]. Nonetheless, direct delivery of the preassembled
sgRNP resulted in least off-target activity, whereas prolonged
expression produced an increase in off-target editing in a time-
dependent manner, although only for a subset of sites [69].
Evidently, off-target editing is multifactorial, and it is indicated
that computational and biochemical approaches to identify poten-
tial off-targets, followed by detailed cell-based experiments, may be
the only way to understand specificity for a given genome modify-
ing application.

4 Role of Computational Tools in Designing sgRNAs

The success of all CRISPR/Cas9-based experiments highly


depends on the efficiency and specificity of the sgRNA designed
to target the genomic location of interest. The sgRNA must possess
high on-target activity with minimal off-target action. Designing
multiple sgRNA candidates and then choosing the most competent
one to be used for the experiments is a waste of time and resources.
Computational methods can effectively aid in this selection and
have become imperative for choosing the most appropriate geno-
mic targets and designing sgRNAs for them.
These in silico tools help in integrating the experimentally
obtained genome editing data to derive generalized sgRNA design
rules that can be applied for carrying out different types of genetic
perturbations. The successful advancements in CRISPR/Cas-based
genome editing tools are leading to great accumulation of such data
and encouraging continuous efforts for refining the existing tools
and algorithms.
Computational Approaches for Designing Highly Specific and Efficient sgRNAs 157

Many tools are available today for predicting the most optimal
sgRNAs for targeting unique locations within the genome of vari-
ous organisms and predict their on-target efficiency. Table 1 lists
some of the representative tools for computational design of
sgRNAs.

5 Computational Approaches for Designing sgRNA with High On-Target Efficiency

5.1 sgRNA Design As discussed in the previous section, the success of CRISPR-based
Tools for Knockout gene knockout experiments largely depends on the nucleotide
and Knockin sequence of the target region, its base pairing with the designed
Experiments sgRNA, location of the cut site within the gene, generation of
indels by NHEJ and the chromatin structure. Microhomology
sequence profile further helps in predicting the chances of
in-frame mutations, which may reduce the efficacy of the knockout
experiment by retaining the partial functionality of the gene. Apart
from complete knockdown of genes, CRISPR/Cas system can also
be used to regulate their expression by fusing functionally mutated
Cas nuclease to transcriptional activators or repressors. The
sgRNAs designed for transcriptional regulation target gene promo-
ters rather than the coding regions, and therefore the sequence
features governing the specificity of such sgRNAs differ from those
used for designing sgRNAs for knockout experiments. Owing to
the insufficient data pertaining to CRISPRi/a experiments, the
sequence context of such sgRNAs is poorly understood and not
many tools exist for CRISPRi/o sgRNA designing. CRISPR-ERA
[70] and SSC [71] are two such tools belonging to this category.
The prediction tools can be broadly categorized into three
groups: (1) sequence-alignment-based tools, (2) hypothesis-driven
tools, and (3) machine learning-based tools [54]. Alignment-based
tools work by simply locating 20 nucleotide sequence followed by
PAM in the region of interest within the genome. This is followed
by the search for other genomic sites similar to each one to probe
their possible off-targets. The hypothesis-driven approaches are an
extension of these alignment-based tools. After the search for
probable sgRNA binding sites using alignment methods, the effi-
cacy potential of each of them is scored and ranked depending on
various parameters as listed in design rules. Learning approaches
make use of the ability of machines to mine a given data to look for
various factors that are important for estimation of the response
variable. Further, using the extracted features, models are devel-
oped that can learn, grow and change when exposed to new data.
With so many tools available for designing highly specific and
efficient sgRNAs, choosing the most appropriate tool is a major
challenge (see Note 1). Majority of the existing tools fall in the
category of alignment-based prediction tools, and most of them
take into consideration only up to three mismatches, not complying
158 Jaspreet Kaur Dhanjal et al.

Table 1
List of computational resources for designing sgRNAs

Alignment-based tools for sgRNA designing


CCTop http://crispr.cos.uni-heidelberg.de/
CasFinder http://arep.med.harvard.edu/CasFinder/
Cas-OFFinder http://casoffinder.snu.ac.kr/
sgRNAcas9 http://www.biootools.com/
Hypothesis-driven sgRNA design tools
Protospacer Workbench www.protospacer.com
E-CRISP http://www.e-crisp.org/E-CRISP/index.html
CHOPCHOP https://chopchop.rc.fas.harvard.edu/
CROP-IT http://cheetah.bioch.virginia.edu/AdliLab/CROP-IT/homepage.html
CRISPR-ERA http://CRISPR-ERA.stanford.edu
CRISPcut https://web.iitd.ac.in/crispcut/webserver/index.html
Learning-based sgRNA design tools
sgRNA Designer http://www.broadinstitute.org/rnai/public/analysis-tools/sgrna-design
SgRNA Scorer http://crispr.med.harvard.edu/sgRNAScorer
CRISPR multitargeter http://www.multicrispr.net/
CRISPRscan http://www.crisprscan.org/
Species-specific sgRNA design tools
CRISPR-P http://cbi.hzau.edu.cn/crispr/
flyCRISPR http://tools.flycrispr.molbio.wisc.edu/targetFinder/
EuPaGDT http://grna.ctegd.uga.edu/index.html
Tools for posterior analysis of CRISPR/Cas9-based experimental data
MAGeCK-VISPR https://bitbucket.org/liulab/mageck-vispr/
caRpools http://github.com/boutroslab/caRpools
CRISPResso http://crispresso.rocks/
CRISPR-GA http://crispr-ga.net
sgRNA databases
WGE http://www.sanger.ac.uk/htgt/wge/
Cas-Database http://www.rgenome.net/cas-database/
CrisprGE http://crdd.osdd.net/servers/crisprge/
CRISPRdb http://crispr.u-psud.fr/crispr
Computational Approaches for Designing Highly Specific and Efficient sgRNAs 159

to the real scenario [72–74]. Also the algorithms like Bowtie, BWA
aligner and BatMis used for exploring the off-targets with mis-
matches limit their efficiency of prediction [72–76]. These align-
ment algorithms in general are meant to explore near exact matches
of short sequences within a large genome, and hence, the possibility
of missing out potential off-targets with more mismatches
increases. Thus, choosing an appropriate alignment algorithm
becomes important. Further, the presence of DNA and RNA
bulges in the sgRNA–DNA duplex also contributes in enhancing
this tolerance for mismatches. Tools like Cas-OFFinder [77] and
COSMID [78] give an option to incorporate gaps (serving as
bulges) while searching for off-targets. However, they also have
their own disadvantages that include, firstly, an incompetent align-
ment algorithm and, secondly, a bulge of only one nucleotide is
allowed. A bulge of up to four nucleotides has already been
reported [79]. Though different tools are expected to serve differ-
ently depending upon the objective of prediction, hypothesis-
driven and learning-based approaches as expected have been
shown to performed better in comparison to the simple
alignment-based methods [48]. This is attributed to their algo-
rithm that takes into consideration sequence preferences and chro-
matin organization. Also the machine learning methods provide us
with an advantage of in-depth sequence profiling, which might be a
limiting factor in case of manual curation. sgRNA-designer, a
machine learning-based tool was reported to perform the best for
human and mouse cell lines among 36 different tools compared in a
study [80]. It ranks and picks sgRNA for the desired genomic locus
using the rule sets made for maximizing on-target activity and
minimizing off-target activity. Tsai et al. in 2015 developed a
protocol, called GUIDE-seq for the unbiased genome-wide detec-
tion of off-targets [81]. The authors also compared their experi-
mental predictions with two in silico tools, namely MIT CRISPR
and E-CRISP. Yet another study compared the performance of
COSMID with several other prediction tools [82]. These compar-
ative studies show wide disparity between results obtained from
these in terms of predicted and experimentally obtained off-target
sites [81, 82]. This may be explained by the fact that sites predicted
purely on the basis of mismatch fail to capture the intrinsic
off-target mechanisms.
Furthermore, factors such as sgRNA structure and chromatin
accessibility which also influence Cas9 binding and cleavage are also
poorly explored by these tools. A little progress has been made by
CROP-IT, an Off-target Prediction and Identification Tool
(CROP-IT) in prediction of off-target binding and cleavage sites
by digging knowledge from the available experimental dataset for
Cas9 binding and cleavage [83]. We have recently developed a tool,
called CRISPcut, for the prediction of the most optimal sgRNA
binding sites in humans [84]. Along with mismatch tolerance, the
160 Jaspreet Kaur Dhanjal et al.

tool also takes into account the possibility of finding nucleotide


bulges within the sgRNA and the target duplex. The tool also
captures the differences in chromatin organization of various cell
types found in human body with the aim to design sgRNAs for the
DNA region that is accessible to CRISPR/Cas system for binding.
The data from DNaseI hypersensitivity assays for around
100 human cell types, available under ENCODE project, was
incorporated in CRISPcut. We compared the performance of
CRISPcut with other commonly used alignment- and hypothesis-
based prediction servers like Optimized CRISPR Design-MIT
server, E-CRISP Design, CcTop, CHOPCHOP, and ZiFit using
the ten sgRNAs analyzed in GUIDE-seq study. CRISPcut was able
to cover most of the GUIDE-seq identified off-targets. The pre-
dictions from CcTop among all the listed tools were closest to
CRISPcut. This is because the off-targets identified by GUIDE-
seq differed from their respective target DNA by utmost five mis-
matches. Also CcTop employs Bowtie-1 for the screening of
off-targets that make it incapable of carrying out gapped align-
ments or predict off-targets with insertions and deletions. Thus,
CRISPcut was able to cover the maximum number of experimen-
tally predicted off-targets. CRISPcut provides an easy-to-use inter-
face to its users for the design of sgRNAs. Figure 3 gives a step-by-
step protocol to design sgRNA for a specific gene using CRISPcut.
See Notes 2 and 3 to get a glimpse of current challenges
associated with this category of sgRNA design tools.

5.2 Species-Specific Some of the sgRNA design tools have been developed for very
Tools for Aiding specific genomes. For example, CRISPR-P has been developed for
in sgRNA Designing plant [85], flyCRISPR for Drosophila [86], and EuPaGDT for
pathogens [87]. These tools should be given a priority when work-
ing with genomes of the corresponding specific species. However,
the precise performance of these sgRNA on-target design tools
requires a thorough assessment because most of the design rules
incorporated in these tools have been derived from experiments
done in human and mouse cells.
A recent effort by MacPherson et al. to compare various fea-
tures of different sgRNA design tools is now available to the users as
an Excel-based program, called CRISPR Software Matchmaker, for
help in selection of the most appropriate tool as per their
needs [88].

5.3 Tools These tools mainly deal with the data obtained from high-
for Posterior Analysis throughput screening using CRISPR/Cas system and genome-
of Data Obtained from wide indels profiling after the experiments. CRISPR/Cas9 system
CRISPR/Cas9-Based is being widely used across the globe for high-throughput screens
Experiments for (1) detection of genes involved in a biological process, (2) cor-
relation of gene function to specific phenotypes, and (3) uncovering
the genetic cause of many diseases. These experiments mainly
Computational Approaches for Designing Highly Specific and Efficient sgRNAs 161

Fig. 3 Designing sgRNAs using CRISPcut. CRISPcut is an easy to use tool for designing sgRNAs for human
targets. The major steps involved in the process have been illustrated here
162 Jaspreet Kaur Dhanjal et al.

involve the generation of a large pool of cells carrying different


mutations. These cells can be used for either knockout of existing
genes or addition of new genes to map their function to resulting
phenotypes.
The role of computational tools here is to aid in library design-
ing so as to maximize on-target sgRNA scores and minimize
off-target sites to improve over the existing libraries for CRISPR
screening. Further, these tools help in the identification of signifi-
cantly selected sgRNAs or genes. MAGeCK [89] and caRpools
[90] are the tools developed for such selection of sgRNAs in both
positive and negative selection in screening. These tools make use
of statistical tests to analyze the high-throughput sequencing data
obtained posterior to CRISPR/Cas9-based screenings.
CRISPR Genome Analyzer (CRISPR-GA) [91] is another tool
that helps in assessing the quality of the data obtained subsequent
to the screening experiment. Using the sequencing data generated
post experiment, it quantifies and characterizes the indels and HDR
at the target site. CRISPResso is yet another similar tool that
evaluates the outcomes of CRISPR/Cas9-based experiments at
the targeted site [92].

5.4 CRISPR/Cas9 Some of the available resources catalog the sgRNAs that can be
Genome Editing used for editing specific genes and are accessible to the users. Some
Databases examples representing this category are listed in Table 1.

6 Notes

1. A comprehensive and systematic assessment of the perfor-


mance of all the computational CRISPR/Cas resources using
a well-curated experimental data is required for different spe-
cies and their diverse cell types. A thorough comparison will
therefore help to comment on the usefulness of various tools in
different scenarios.
2. In comparison to the data available to understand the on-target
genome-editing efficiency, the volume of information related
to the off-target sites is quite limited. The robust training of in
silico models to a very large extent depends upon the off-target
identified using the next-generation sequencing data obtained
after the CRISPR/Cas9-based experiments. Today, nearly all
of the existing tools are simply using the approach of sequence
alignment with mismatch tolerance to fetch out the exhaustive
possibilities of off-target sites, which fails to capture the real
mechanism of off-target activity. There are still many sequence
and structural features of sgRNA and target sequences that
need to be explored. Also the organization of chromatin inside
different cell types strongly affects the outcomes of these
Computational Approaches for Designing Highly Specific and Efficient sgRNAs 163

CRISPR/Cas experiments and has not been accounted in any


of the tools so far. Therefore, the prediction accuracy of all the
available computational tools is not up to the mark.
3. Improvement and refinement of the computational models
thus require “unbiased” genome-wide nuclease activity profiles
for a large set of sgRNAs targeting varied range of locations
within the genome of different organisms. This will allow the
evolution of predictive models to accurately rank the relative
cleavage rate of sites predicted as off-targets based on similarity
with the target region for any designed sgRNA.

References

1. Cong L, Ran FA, Cox D, Lin S, Barretto R, targets by CRISPR-Cas9 screening of protein
Habib N et al (2013) Multiplex genome engi- domains. Nat Biotechnol 33(6):661–667
neering using CRISPR/Cas systems. Science 12. Chen Y, Cao J, Xiong M, Petersen AJ, Dong Y,
339(6121):819–823 Tao Y et al (2015) Engineering human stem
2. Mali P, Yang L, Esvelt KM, Aach J, Guell M, cell lines with inducible gene knockout using
DiCarlo JE et al (2013) RNA-guided human CRISPR/Cas9. Cell Stem Cell 17(2):233–244
genome engineering via Cas9. Science 339 13. Belhaj K, Chaparro-Garcia A, Kamoun S,
(6121):823–826 Patron NJ, Nekrasov V (2015) Editing plant
3. Malina A, Mills JR, Cencic R, Yan Y, Fraser J, genomes with CRISPR/Cas9. Curr Opin Bio-
Schippers LM et al (2013) Repurposing technol 32:76–84
CRISPR/Cas9 for in situ functional assays. 14. Lowder LG, Zhang D, Baltes NJ, Paul JW III,
Genes Dev 27(23):2602–2614 Tang X, Zheng X et al (2015) A CRISPR/Cas9
4. Zhou Y, Zhu S, Cai C, Yuan P, Li C, Huang Y toolbox for multiplexed plant genome editing
et al (2014) High-throughput screening of a and transcriptional regulation. Plant Physiol
CRISPR/Cas9 library for functional genomics 169(2):971–985
in human cells. Nature 509(7501):487–491 15. Khatodia S, Bhatotia K, Passricha N, Khurana
5. Dhanjal JK, Radhakrishnan N, Sundar D SM, Tuteja N (2016) The CRISPR/Cas
(2017) Identifying synthetic lethal targets genome-editing tool: application in improve-
using CRISPR/Cas9 system. Methods ment of crops. Front Plant Sci 7:506
131:66–73 16. Choi KR, Lee SY (2016) CRISPR technologies
6. Qiu XY, Zhu LY, Zhu CS, Ma JX, Hou T, Wu for bacterial systems: current achievements and
XM et al (2018) High-effective and low-cost future directions. Biotechnol Adv 34
microRNA detection with CRISPR-Cas9. ACS (7):1180–1209
Synth Biol 7(3):807–813 17. Estrela R, Cate JH (2016) Energy biotechnol-
7. Sergiu C, Diana G, Amin H, Ioana BN (2018) ogy in the CRISPR-Cas9 era. Curr Opin Bio-
Restoring the p53 ‘guardian’ phenotype in technol 38:79–84
p53-deficient tumor cells with CRISPR/Cas9. 18. Donohoue PD, Barrangou R, May AP (2018)
Trends Biotechnol 36(7):653–660 Advances in industrial biotechnology using
8. Rauscher B, Heigwer F, Henkel L, Hielscher T, CRISPR-Cas systems. Trends Biotechnol 36
Voloshanenko O, Boutros M (2018) Toward (2):134–146
an integrated map of genetic interactions in 19. Deltcheva E, Chylinski K, Sharma CM,
cancer cells. Mol Syst Biol 14(2):e7656 Gonzales K, Chao Y, Pirzada ZA et al (2011)
9. Pham HT, Mesplede T (2018) The latest evi- CRISPR RNA maturation by trans-encoded
dence for possible HIV-1 curative strategies. small RNA and host factor RNase III. Nature
Drugs Context 7:212522 471(7340):602–607
10. Uppada V, Gokara M, Rasineni GK (2018) 20. Hawkins JS, Wong S, Peters JM, Almeida R, Qi
Diagnosis and therapy with CRISPR advanced LS (2015) Targeted transcriptional repression
CRISPR based tools for point of care diagnos- in bacteria using CRISPR interference (CRIS-
tics and early therapies. Gene 656:22–22 PRi). Methods Mol Biol 1311:349–362
11. Shi J, Wang E, Milazzo JP, Wang Z, Kinney JB, 21. Ran FA, Hsu PD, Wright J, Agarwala V, Scott
Vakoc CR (2015) Discovery of cancer drug DA, Zhang F (2013) Genome engineering
164 Jaspreet Kaur Dhanjal et al.

using the CRISPR-Cas9 system. Nat Protoc 8 albumin in pigs through CRISPR/Cas9-
(11):2281–2308 mediated knockin of human cDNA into swine
22. Shalem O, Sanjana NE, Hartenian E, Shi X, albumin locus in the zygotes. Sci Rep 5:16705
Scott DA, Mikkelson T et al (2014) Genome- 34. Platt RJ, Chen S, Zhou Y, Yim MJ, Swiech L,
scale CRISPR-Cas9 knockout screening in Kempton HR et al (2014) CRISPR-Cas9
human cells. Science 343(6166):84–87 knockin mice for genome editing and cancer
23. Chu HW, Rios C, Huang C, Wesolowska- modeling. Cell 159(2):440–455
Andersen A, Burchard EG, O’Connor BP 35. Dow LE (2015) Modeling disease in vivo with
et al (2015) CRISPR-Cas9-mediated gene CRISPR/Cas9. Trends Mol Med 21
knockout in primary human airway epithelial (10):609–621
cells reveals a proinflammatory role for 36. Zhang MM, Wong FT, Wang Y, Luo S, Lim
MUC18. Gene Ther 22(10):822–829 YH, Heng E et al (2017) CRISPR–Cas9 strat-
24. Feng X, Zhao D, Zhang X, Ding X, Bi C egy for activation of silent Streptomyces bio-
(2018) CRISPR/Cas9 assisted multiplex synthetic gene clusters. Nat Chem Biol 13
genome editing technique in Escherichia coli. (6):607
Biotechnol J 13(9):1700604 37. Behler J, Vijay D, Hess WR, Akhtar MK (2018)
25. de Vries ARG, de Groot PA, van den Broek M, CRISPR-based technologies for metabolic
Daran J-MG (2017) CRISPR-Cas9 mediated engineering in cyanobacteria. Trends Biotech-
gene deletions in lager yeast Saccharomyces nol 36(10):996–1010
pastorianus. Microb Cell Fact 16(1):222 38. Qi LS, Larson MH, Gilbert LA, Doudna JA,
26. Serif M, Dubois G, Finoux A-L, Teste M-A, Weissman JS, Arkin AP et al (2013) Repurpos-
Jallet D, Daboussi F (2018) One-step genera- ing CRISPR as an RNA-guided platform for
tion of multiple gene knock-outs in the diatom sequence-specific control of gene expression.
Phaeodactylum tricornutum by DNA-free Cell 152(5):1173–1183
genome editing. Nat Commun 9(1):3924 39. Cheng AW, Wang H, Yang H, Shi L, Katz Y,
27. Hegde S, Nilyanimit P, Kozlova E, Narra HP, Theunissen TW et al (2013) Multiplexed acti-
Sahni SK, Hughes GL (2018) CRISPR/Cas9- vation of endogenous genes by CRISPR-on, an
mediated gene deletion of the ompA gene in an RNA-guided transcriptional activator system.
enterobacter gut symbiont impairs biofilm for- Cell Res 23(10):1163–1171
mation and reduces gut colonization of Aedes 40. Chuai GH, Wang QL, Liu Q (2017) In silico
aegypti mosquitoes. bioRxiv 13(12):e0007883 meets in vivo: towards computational
28. Shen Z, Zhang X, Chai Y, Zhu Z, Yi P, Feng G CRISPR-based sgRNA design. Trends Bio-
et al (2014) Conditional knockouts generated technol 35(1):12–21
by engineered CRISPR-Cas9 endonuclease 41. Cong L, Ran FA, Cox D, Lin S, Barretto R,
reveal the roles of coronin in C. elegans neural Habib N et al (2013) Multiplex genome engi-
development. Dev Cell 30(5):625–636 neering using CRISPR/Cas systems. Science
29. Bae S, Kweon J, Kim HS, Kim JS (2014) 339(6121):819–823
Microhomology-based choice of Cas9 nuclease 42. Xu H, Xiao T, Chen CH, Li W, Meyer CA, Wu
target sites. Nat Methods 11(7):705–706 Q et al (2015) Sequence determinants of
30. Doench JG, Hartenian E, Graham DB, improved CRISPR sgRNA design. Genome
Tothova Z, Hegde M, Smith I et al (2014) Res 25(8):1147–1157
Rational design of highly active sgRNAs for 43. Jinek M, Chylinski K, Fonfara I, Hauer M,
CRISPR-Cas9-mediated gene inactivation. Doudna JA, Charpentier E (2012) A program-
Nat Biotechnol 32(12):1262–1267 mable dual-RNA-guided DNA endonuclease
31. Maruyama T, Dougan SK, Truttmann MC, in adaptive bacterial immunity. Science 337
Bilate AM, Ingram JR, Ploegh HL (2015) (6096):816–821
Increasing the efficiency of precise genome 44. Mekler V, Minakhin L, Severinov K (2017)
editing with CRISPR-Cas9 by inhibition of Mechanism of duplex DNA destabilization by
nonhomologous end joining. Nat Biotechnol RNA-guided Cas9 nuclease during target
33(5):538–542 interrogation. Proc Natl Acad Sci U S A 114
32. Li HL, Fujimoto N, Sasakawa N, Shirai S, (21):5443–5448
Ohkame T, Sakuma T et al (2015) Precise cor- 45. Tim Wang JJW, Sabatini DM, Lander ES
rection of the dystrophin gene in duchenne (2014) Genetic screens in human cells using
muscular dystrophy patient induced pluripo- the CRISPR-Cas9 system. Science 343
tent stem cells by TALEN and CRISPR-Cas9. (6166):80–84
Stem Cell Rep 4(1):143–154 46. Wang X, Wang X, Varma RK, Beauchamp L,
33. Peng J, Wang Y, Jiang J, Zhou X, Song L, Magdaleno S, Sendera TJ (2009) Selection of
Wang L et al (2015) Production of human hyperfunctional siRNAs with improved
Computational Approaches for Designing Highly Specific and Efficient sgRNAs 165

potency and specificity. Nucleic Acids Res 37 60. Kleinstiver BP, Pattanayak V, Prew MS, Tsai
(22):e152 SQ, Nguyen NT, Zheng Z et al (2016) High-
47. Long D, Lee R, Williams P, Chan CY, fidelity CRISPR-Cas9 nucleases with no
Ambros V, Ding Y (2007) Potent effect of detectable genome-wide off-target effects.
target structure on microRNA function. Nat Nature 529(7587):490–495
Struct Mol Biol 14(4):287–294 61. Slaymaker IM, Gao L, Zetsche B, Scott DA,
48. Robins H, Li Y, Padgett RW (2005) Incorpor- Yan WX, Zhang F (2016) Rationally engi-
ating structure to predict microRNA targets. neered Cas9 nucleases with improved specific-
Proc Natl Acad Sci U S A 102(11):4006–4009 ity. Science 351(6268):84–88
49. Wong N, Liu W, Wang X (2015) WU-CRISPR: 62. Chen JS, Dagdas YS, Kleinstiver BP, Welch
characteristics of functional guide RNAs for MM, Sousa AA, Harrington LB et al (2017)
the CRISPR/Cas9 system. Genome Biol Enhanced proofreading governs CRISPR-Cas9
16:218 targeting accuracy. Nature 550
50. Wu X, Scott DA, Kriz AJ, Chiu AC, Hsu PD, (7676):407–410
Dadon DB et al (2014) Genome-wide binding 63. Zuris JA, Thompson DB, Shu Y, Guilinger JP,
of the CRISPR endonuclease Cas9 in mamma- Bessen JL, Hu JH et al (2015) Cationic lipid-
lian cells. Nat Biotechnol 32(7):670–676 mediated delivery of proteins enables efficient
51. Fu Y, Foden JA, Khayter C, Maeder ML, protein-based genome editing in vitro and
Reyon D, Joung JK et al (2013) High- in vivo. Nat Biotechnol 33(1):73–80
frequency off-target mutagenesis induced by 64. Kim S, Kim D, Cho SW, Kim J, Kim JS (2014)
CRISPR-Cas nucleases in human cells. Nat Highly efficient RNA-guided genome editing
Biotechnol 31(9):822–826 in human cells via delivery of purified Cas9
52. Hsu PD, Scott DA, Weinstein JA, Ran FA, ribonucleoproteins. Genome Res 24
Konermann S, Agarwala V et al (2013) DNA (6):1012–1019
targeting specificity of RNA-guided Cas9 65. Zetsche B, Volz SE, Zhang F (2015) A split-
nucleases. Nat Biotechnol 31(9):827–832 Cas9 architecture for inducible genome editing
53. Yu C, Liu Y, Ma T, Liu K, Xu S, Zhang Y et al and transcription modulation. Nat Biotechnol
(2015) Small molecules enhance CRISPR 33(2):139–142
genome editing in pluripotent stem cells. Cell 66. Petris G, Casini A, Montagna C, Lorenzin F,
Stem Cell 16(2):142–147 Prandi D, Romanel A et al (2017) Hit and go
54. G-h C, Wang Q-L, Liu Q (2017) In silico CAS9 delivered through a lentiviral based self-
meets in vivo: towards computational limiting circuit. Nat Commun 8:15334
CRISPR-based sgRNA design. Trends Bio- 67. Shin J, Jiang F, Liu J-J, Bray NL, Rauch BJ,
technol 35(1):12–21 Baik SH, Nogales E, Bondy-Denomy J, Corn
55. Fu Y, Sander JD, Reyon D, Cascio VM, Joung JE, Doudna JA (2017) Disabling Cas9 by an
JK (2014) Improving CRISPR-Cas nuclease anti-CRISPR DNA mimic. Sci Adv 3(7):
specificity using truncated guide RNAs. Nat e1701620
Biotechnol 32(3):279–284 68. Ryan DE, Taussig D, Steinfeld I, Phadnis SM,
56. Daesik Kim SB, Park J, Kim E, Kim S, Yu HR, Lunstad BD, Singh M et al (2017) Improving
Hwang J, Kim J-I, Kim J-S (2015) Digenome- CRISPR-Cas specificity with chemical modifi-
seq: genome-wide profiling of CRISPR-Cas9 cations in single-guide RNAs. Nucleic Acids
off-target effects in human cells. Nat Methods Res 46(2):792–803
12:237–242 69. Cameron P, Fuller CK, Donohoue PD, Jones
57. Ran FA, Hsu PD, Lin CY, Gootenberg JS, BN, Thompson MS, Carter MM et al (2017)
Konermann S, Trevino AE et al (2013) Double Mapping the genomic landscape of CRISPR-
nicking by RNA-guided CRISPR Cas9 for Cas9 cleavage. Nat Methods 14(6):600–606
enhanced genome editing specificity. Cell 154 70. Liu H, Wei Z, Dominguez A, Li Y, Wang X, Qi
(6):1380–1389 LS (2015) CRISPR-ERA: a comprehensive
58. Tsai SQ, Wyvekens N, Khayter C, Foden JA, design tool for CRISPR-mediated gene edit-
Thapar V, Reyon D et al (2014) Dimeric ing, repression and activation. Bioinformatics
CRISPR RNA-guided FokI nucleases for 31(22):3676–3678
highly specific genome editing. Nat Biotechnol 71. Xu H, Xiao T, Chen C-H, Li W, Meyer CA, Wu
32(6):569–576 Q et al (2015) Sequence determinants of
59. Guilinger JP, Thompson DB, Liu DR (2014) improved CRISPR sgRNA design. Genome
Fusion of catalytically inactive Cas9 to FokI Res 25(8):1147–1157
nuclease improves the specificity of genome 72. Ma M, Ye AY, Zheng W, Kong L (2013) A
modification. Nat Biotechnol 32(6):577–582 guide RNA sequence design platform for the
166 Jaspreet Kaur Dhanjal et al.

CRISPR/Cas9 system for model organism 83. Singh R, Kuscu C, Quinlan A, Qi Y, Adli M
genomes. Biomed Res Int 2013:270805 (2015) Cas9-chromatin binding information
73. Montague TG, Cruz JM, Gagnon JA, Church enables more accurate CRISPR off-target pre-
GM, Valen E (2014) CHOPCHOP: a diction. Nucleic Acids Res 43(18):e118
CRISPR/Cas9 and TALEN web tool for 84. Dhanjal JK, Radhakrishnan N, Sundar D
genome editing. Nucleic Acids Res 42(Web (2018) CRISPcut: a novel tool for designing
Server issue):W401–W407 optimal sgRNAs for CRISPR/Cas9 based
74. O’Brien A, Bailey TL (2014) GT-scan: identi- experiments in human cells. Genomics 111
fying unique genomic targets. Bioinformatics (4):560–566
30(18):2673–2675 85. Lei Y, Lu L, Liu HY, Li S, Xing F, Chen LL
75. Heigwer F, Kerr G, Boutros M (2014) (2014) CRISPR-P: a web tool for synthetic
E-CRISP: fast CRISPR target site identifica- single-guide RNA design of CRISPR-system
tion. Nat Methods 11(2):122 in plants. Mol Plant 7(9):1494-1496
76. Lei Y, Lu L, Liu HY, Li S, Xing F, Chen LL 86. Gratz SJ, Ukken FP, Rubinstein CD, Thiede G,
(2014) CRISPR-P: a web tool for synthetic Donohue LK, Cummings AM et al (2014)
single-guide RNA design of CRISPR-system Highly specific and efficient CRISPR/Cas9-
in plants. Mol Plant 7(9):1494–1496 catalyzed homology-directed repair in Dro-
77. Bae S, Park J, Kim JS (2014) Cas-OFFinder: a sophila. Genetics 196(4):961–971
fast and versatile algorithm that searches for 87. Peng D, Tarleton R (2015) EuPaGDT: a web
potential off-target sites of Cas9 RNA-guided tool tailored to design CRISPR guide RNAs
endonucleases. Bioinformatics 30 for eukaryotic pathogens. Microb Genomics 1
(10):1473–1475 (4):e000033
78. Cradick TJ, Qiu P, Lee CM, Fine EJ, Bao G 88. MacPherson CR, Scherf A (2015) Flexible
(2014) COSMID: a web-based tool for identi- guide-RNA design for CRISPR applications
fying and validating CRISPR/Cas off-target using protospacer workbench. Nat Biotechnol
sites. Mol Ther Nucl Acids 3(12):e214 33(8):805
79. Lin Y, Cradick TJ, Brown MT, Deshmukh H, 89. Li W, Xu H, Xiao T, Cong L, Love MI, Zhang
Ranjan P, Sarode N et al (2014) CRISPR/Cas9 F et al (2014) MAGeCK enables robust identi-
systems have off-target activity with insertions fication of essential genes from genome-scale
or deletions between target DNA and guide CRISPR/Cas9 knockout screens. Genome
RNA sequences. Nucleic Acids Res 42 Biol 15(12):554
(11):7473–7485 90. Winter J, Breinig M, Heigwer F,
80. Doench JG, Fusi N, Sullender M, Hegde M, Brügemann D, Leible S, Pelz O et al (2015)
Vaimberg EW, Donovan KF et al (2016) Opti- caRpools: an R package for exploratory data
mized sgRNA design to maximize activity and analysis and documentation of pooled
minimize off-target effects of CRISPR-Cas9. CRISPR/Cas9 screens. Bioinformatics 32
Nat Biotechnol 34(2):184 (4):632–634
81. Tsai SQ, Zheng Z, Nguyen NT, Liebers M, 91. Güell M, Yang L, Church GM (2014) Genome
Topkar VV, Thapar V et al (2015) GUIDE- editing assessment using CRISPR genome ana-
seq enables genome-wide profiling of lyzer (CRISPR-GA). Bioinformatics 30
off-target cleavage by CRISPR-Cas nucleases. (20):2968–2970
Nat Biotechnol 33(2):187–197 92. Pinello L, Canver M, Hoban M (2015) Cris-
82. Lee CM, Cradick TJ, Fine EJ, Bao G (2016) presso: sequencing analysis toolbox for crispr-
Nuclease target site selection for maximizing cas9 genome editing. bioRxiv. https://doi.
on-target activity and minimizing off-target org/10.1101/031203
effects in genome editing. Mol Ther 24
(3):475–487
Chapter 9

Complex Network Analysis in Microbial Systems: Theory


and Examples
André Voigt and Eivind Almaas

Abstract
A central driver for the field of systems biology is to develop an understanding of how interactions between
components affect the functioning of a system as a whole. Network analysis is an approach that is uniquely
suited to uncover patterns and organizing principles in a wide variety of complex systems. In this chapter,
we will give a detailed description of basic concepts for characterizing empirical networks, frequently used
random network models, and how to compute properties of networks using Python packages. We will
demonstrate the application of network analysis by investigating several biological networks.

Keywords Complex networks, Systems biology, Protein interactions, Metabolism

1 Introduction

Systems biology is a recent field in the biological sciences with the


working paradigm of combining typically large-scale experimental
data with mathematical and computational modeling to generate a
predictive understanding of how a biological unit’s different parts
are integrated [1, 2]. Such an approach naturally gives rise to a
strong focus on our ability to quantify and understand complex
interactions in biological systems, using both massively parallel
measurement techniques, as typified by the various ‘omics
approaches, and mathematical modeling and computer simulations
[3]. If we consider biological systems as constructed from individ-
ual building blocks (nodes) that are connected through a variety of
interactions (links), we are essentially thinking of them as
networks [4].
A network, or a graph, consists of nodes (vertices) that are
connected by links (edges), where the links reflect a relationship
or an interaction between the nodes. In a biological system, the
links may represent a physical interaction such as the ability of two
proteins to bind, or a non-physical interaction such as a correlation
in gene expression between two genes [5]. The flexibility in

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_9, © Springer Science+Business Media, LLC, part of Springer Nature 2022

167
168 André Voigt and Eivind Almaas

representing various systems as networks makes network analysis a


general and very powerful tool for inspecting and analyzing a large
number of biological interactions simultaneously. Network analysis
has been applied to a wide range of length scales in biology, from
systems of molecular interactions, such as the protein interaction
network, to the interactions between species, as represented by
food webs [6–8]. Note that, while the same suite of analysis tools
can be used to dissect and investigate the networks resulting from a
wide range of biological systems, the interpretations of the network
properties must be grounded in (1) particular biological systems
being analyzed and (2) particular choices implemented for what a
link and a node should represent [5].
In this chapter, we will describe in detail the properties of some
biological networks and discuss generic network measures and tools
that can provide biological insight to the workings of an organism.
As the network representation of a system is amenable to computer
analysis, we are not limited to the study of small networks consist-
ing of only tens or a few hundred nodes, and we will see that the
statistical properties of large-scale networks may provide deep
biological understandings [4]. Since a network representation is a
useful tool not only in biology, it is possible to discover properties
that are generic to systems as different as the World Wide Web,
social networks, postal delivery routes, and cellular metabolism
[6, 8].
In this chapter, we will also discuss several models for random
networks: the Erdös–Rényi model [9], the Barabási–Albert model
[10], and the Configuration model [11, 12]. We will demonstrate
how the python package NetworkX [13] can be put to use for
calculating network properties, as well as generating networks
according to a particular model. Additionally, we will describe a
much used R-package for weighted gene co-expression network
analysis, WGCNA [14]. Finally, we will briefly describe two much
used software tools for the visualization and analysis of networks:
Cytoscape [15] and Pajek [16].

2 Tools

2.1 Pajek Pajek (meaning of “spider” in Slovenian) is a program for analysis


and visualization of very large networks [16] in the Windows
operating systems, and it can be freely downloaded (with manual)
from http://mrvar.fdv.uni-lj.si/pajek/.
This software was developed by Vladimir Batagelj and Andrej
Mrvar in 1996 and implemented in Delphi (Pascal), and has since
been further extended [16]. In addition to directed, undirected,
and mixed networks, Pajek can work with multi-related networks,
two-mode networks (bipartite graphs), and temporal networks
(dynamic graphs that change over time).
Complex Network Analysis in Microbial Systems: Theory and Examples 169

Further information about the formatting of input files for


Pajek is available at: http://mrvar.fdv.uni-lj.si/pajek/pajekman.
pdf.

2.2 NetworkX NetworkX [17], first released in 2005 is a free and open source
Python library that allows for the generation, manipulation, and
analysis of a variety of different types of networks. The main net-
work types supported are ordinary undirected graphs, directed
graphs, and multigraphs (both directed and undirected), which
differ from ordinary graphs in that a given pair of nodes may be
connected by any number of edges, with potentially different edge
attributes. NetworkX can generate networks according to a variety
of predefined models, including Erdös–Rényi, Barabási–Albert, and
the configuration models. It also contains standard functions to
compute many different network attributes, such as clustering,
shortest paths, centrality, and k-cores. As a library for the Python
programming language, it does not provide a graphical user inter-
face for user input. Instead, it is accessed through the Python/
IPython command line interface or through user-written programs
and scripts. This chapter provides several examples of how to per-
form different aspects of network analysis using the built-in func-
tions in NetworkX. Note that, while user input is entirely through
the command line, NetworkX also provides graphical visualization
of networks through integration with Matplotlib, another free/
open source Python library for 2D plotting.

2.3 Cytoscape Cytoscape is a general open source platform for complex network
analysis and visualization [15]. However, initially, it was developed
as a bioinformatics program for visualizing molecular interaction
networks and biological pathways and integrating these networks
with annotations, gene expression profiles, and other state data.
The Cytoscape core contains a basic set of features for data integra-
tion and visualization. There exists many plug-ins for network and
molecular profiling analyses, network layouts, additional file format
support, scripting, and connection with databases. Plug-ins are
developed by users of the Cytoscape open API based on Java™
(http://www.java.com) technology, and plug-in development by
the community is encouraged. Most of the plug-ins are freely
available at http://www.cytoscape.org/.

2.4 WGCNA WGCNA (“Weighted Gene Co-expression Network Analysis”)


[14] is a popular R package used for the study of weighted gene
co-expression networks, available through the Comprehensive R
Archive Network (CRAN). Released in 2008, its main purpose is
the identification of modules of highly co-expressed genes, and
relating these modules to pre-defined traits and processes. The
WGCNA website (https://labs.genetics.ucla.edu/horvath/
170 André Voigt and Eivind Almaas

CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.
html) contains extensive documentation that provides examples of
use of WGCNA on various types of data.

3 Biological Networks

A biological network is a mathematical representation of only a


limited set of properties of a biological system. However, if the
properties have been selected with some care, the resulting network
(and its analysis) may give important biological insights. While the
choice of what should constitute a node and a link in a biological
network model is only limited by the imagination of the scientist, as
soon as this decision has been made, a host of methods and algo-
rithms are available to dissect the properties of the network model
(see e.g. [18] which is an excellent textbook providing a compre-
hensive discussion).
In a typical network representation, nodes correspond to the
chosen biological elements, and a link between two nodes reflects
the presence of either a direct or an indirect relation between them.
The number of links connected to a node is known as the degree, or
connectivity, of that node. In directed networks, i.e., networks
where the links represent a directional relationship (e.g., molecule
A activates molecule B but not the converse), it is possible to
differentiate between a node’s in-degree and its out-degree. Note
that the sum of a node’s in-degree and out-degree is the same as the
degree of the node when the network is considered undirected
[18]. In visualizations, the directedness of a link is typically indi-
cated by drawing the link as an arrow instead of a line.
How do we represent a network mathematically? By focusing
on the interactions between the nodes, we define a network’s
adjacency matrix [18] as A ¼ [aij] where each aij is computed
from Eq. 1,
aij ¼ 0 ði, j Þ=
2E
aij ¼ 1 ði, j Þ∈E ð1Þ
Here, E represents all the links existing in the network G. We
interpret the adjacency matrix as follows: when aij is non-zero,
there is a direct link between nodes i and j. When the adjacency
matrix A is symmetric, i.e., aij ¼ aji aij¼aji, the links are not
directed. When A is asymmetric, we assign a direction to the
links, and a non-zero aij would typically be drawn as an arrow
pointing from node i to node j. We can calculate the degree of a
node [18] from the adjacency matrix as
X
ki ¼ a
j ij
ð2Þ
Complex Network Analysis in Microbial Systems: Theory and Examples 171

Example 1: Creating Network from Adjacency Matrix


and Computing Node Degree Using NetworkX:

import networkx as nx
import random
adjMat = [[0, 1, 1], [1, 0, 0], [1, 0, 0]] #Diagonally symmetric
matrix, corresponding to an undirected network - node 1 connects to
nodes 2 and 3, nodes 2 and 3 are not connected to each other

G = nx.Graph()
for i in range(len(adjMat)-1):
for j in range(i+1, len(adjMat)):
if (adjMat[i][j] == 1):
G.add_edge(i, j)
nx.degree(G, 1) #Returns degree for node 1, equal to 2
nx.degree(G, 2) #Returns degree for node 2, equal to 1
nx.degree(G, 3) #Returns degree for node 3, equal to 1
nx.degree(G) #Returns dict structure containing node degrees,
{1:2, 2:1, 3:1}

Note that the index j is from 1 to N (all the nodes in the


network). For an undirected network, the resulting value for ki is
the same if the sum in Eq. 2 is implemented for the first or the
second index. However, for a directed network, the standard con-
vention is that one calculates the in-degree by a summation over the
first index and the out-degree by a summation over the second
index. We can consider node degree a local network property, in
that it only characterizes a single node and its immediate neighbor-
hood. In contrast, the degree distribution is a global network prop-
erty, which includes the number of neighbors of all the nodes in the
network. We define the degree distribution P(k) as:
n
P ðkÞ ¼ k , ð3Þ
N
where nk is the number of nodes in the network with k nearest
neighbors. Thus, we may interpret P(k) as the probability that a
randomly selected node has exactly degree k. A description for how
one determines the degree distribution is specified in Example 2. A
related measure is that of the cumulative degree distribution CumP
(k), which represents the fraction of nodes with degree greater than
or equal to k [18]. Example 3 describes this function.
172 André Voigt and Eivind Almaas

Example 2: Creating a Simple Network and Computing Its


Degree Distribution in NetworkX:

import networkx as nx
G = nx.Graph()
G.add_edge(’a’, ’b’) # Creates nodes a and b, and links them by an
edge

G.add_edges_from([(’a’, ’c’), (’a’, ’d’), (’a’, ’e’), (’b’, ’c’)])


#Creates links between specified nodes, creating nodes where
they do not exist

max_deg = max(nx.degree(G).values())
deg_dist = [0]*(max_deg+1)
#Initializes an array to contain the
degree distribution

for node in G:
deg_dist[G.degree(node)] += 1

nx.write_edgelist(G, "example_network.txt") #Saves the network as


a file in the edge list format (each line represents a pair of
nodes)

In this example, degree_distribution assumes the value [0, 2, 2,


0, 1]. This corresponds to two nodes of degree 1 (“d” and “e”),
two nodes of degree 2 (“b” and “c”), and one node of degree
4 (“a”). We also store the created network in a text file, for use in
the following examples.
Example 3: Cumulative Degree Distribution:

cumul_deg_dist = [0] *(max_deg+1) #Creates an array to hold the


values

for i in reversed(range(max_deg, 0)):


cumul_deg_dist[i] =
cumul_deg_dist[i+1]+degree_distribution[i]

Table 1
Degree distribution and cumulative degree distribution for the network in Fig. 1

k 0 1 2 3 4 5 6 7
P(k) 0.000 0.125 0.125 0.375 0.125 0.250 0.000 0.000
CumP(k) 1.000 1.000 0.875 0.750 0.375 0.250 0.000 0.000
The maximal degree of a single node is N  1 when there are N nodes
Complex Network Analysis in Microbial Systems: Theory and Examples 173

Fig. 1 Example network consisting of 8 nodes and 13 links

Table 2
Clustering coefficient, closeness centrality, and betweenness centrality of the example network in
Fig. 1

Node number, i 1 2 3 4 5 6 7 8
ki 3 2 1 5 5 4 3 3
Ci 0.667 1.000 0.000 0.300 0.500 0.667 1.000 1.000
Closenessi 0.714 0.595 0.524 0.857 0.857 0.786 0.667 0.667
Betweennessi 1.167 1.000 1.000 1.233 1.767 1.367 1.000 1.000

Continuing from Example 2, cumul_deg_dist[i] assumes the


value [5, 5, 3, 1, 1].
In Table 1, we present the results for P(k) and CumP(k) as
calculated from the example network shown in Fig. 1.
While a node’s degree and, hence, the properties just discussed
that are derived from a node’s degree are focused on pair-
connectivity in a network, much insight can be gained by studying
the occurrence of connected triplets of nodes [19]. In Fig. 1, nodes
1, 2, and 4 are all directly connected and form such a triplet, or
cluster. If we were studying a friendship network where the nodes
represent people and links indicate the existence of friendship
between two people, the clustering of a node is described as the
propensity of the friend of my friend to also be my friend [19]. We
measure the clustering among the nearest neighbors of a node i by
calculating the clustering coefficient Ci as:
number of actual connections between the neighbours
Ci ¼ ð4Þ
number of possible connections between the neighbors
Using the adjacency matrix, we can restate this expression as:
1 X
Ci ¼ a a a :
i6¼j ij jk kl
ð5Þ
ki ðki  1Þ
Example 4 describes how we may implement the calculation of
the clustering coefficient for all nodes in a network, and in Table 2
we show the result when applied to the network in Fig. 1. By
174 André Voigt and Eivind Almaas

Fig. 2 Mus musculus Protein-protein interactions in the species Mus musculus from BioGrid-2.0.61 [20]. This
network contains a total of 1407 nodes and 1579 links and consists of 176 separate connected subnetworks,
of which the largest has 766 nodes

Fig. 3 Plot of the degree distribution for the protein-protein interaction network of the species Mus musculus
Complex Network Analysis in Microbial Systems: Theory and Examples 175

averaging the clustering coefficient over all the nodes in the net-
work, we obtain the global measure called the average clustering
coefficient [19]:
X
hC i ¼ C =N :
i i
ð6Þ

Example 4: Clustering Coefficient:

import networkx as nx
G = nx.read_edgelist("example_network.txt")
cluster_coef_a = nx.clustering (G, ’a’) #Clustering coefficient of
node a, equal to 0.1666666....
cluster_coef = nx.clustering (G) #{’a’: 0.16666666666666666, ’b’:
1.0, ’c’: 1.0, ’d’: 0.0, ’e’: 0.0}
avg_clustering = nx.average_clustering (G) #0.43333333...

So far, we have described a few very simple tools to characterize


the properties of a network. In the following section, we will discuss
some important biological networks in more detail, and we will
describe additional tools to characterize these networks.

3.1 Protein–Protein The PPI is of great importance for a multitude of process in a cell.
Interaction (PPI) Understanding the details of the PPI is relevant for understanding
Networks diseases and the identification of new therapeutic methods. In PPI
networks, nodes correspond to proteins and (undirected) links
represent an interaction between a pair of proteins. An example of
a PPI network is shown in Fig. 2, and the network’s degree distri-
bution is plotted in Fig. 3.
When studying the depiction of the network (Fig. 2), one
feature is particularly striking: It is not possible to start from one
node and reach all the other nodes by only hopping along the links.
The network is broken into many components [18]. Note that a
component consists of all the nodes that can be reached by only
following the links. The network in Fig. 1 contains one giant
component, containing 54.4% of all the nodes. The remaining
175 components contain significantly fewer nodes. The component
size distribution, CSD(n), is the chance that a randomly selected
component contains n nodes and is defined as
CSDðnÞ ¼ ðNumber of components with size nÞ=ðTotal number of componentsÞ: ð7Þ
We identify both the total number of components and their size
by using a network “burning” algorithm (see Example 5). This is
based on the simple idea of recursively following links to visit
(“burn”) nodes, until no new nodes are discoverable. Then all the
nodes in the current component have been detected.
176 André Voigt and Eivind Almaas

Example 5: Decomposing Components and Computing Size:

import networkx as nx
G = nx.read_edgelist("example_network.txt")
G.add_edges_from([(’f’,’g’), (’f’, ’h’), (’g’, ’h’)])
components = list(nx.connected_components(G)) #Returns simple
list of nodes in each component [{a, b, c, d, e}, {’f’, ’g’, ’h’}]
comp_size = [len(i) for i in components] #[3, 5]

components = list(nx.connected_component_subgraphs(G)) #Returns


list of graph objects that can be analyzed further
comp_clustering_1 = nx.average_clustering (components[0])
#0.4333333...

When considering the properties of the giant component in


Fig. 2, we notice that while two nodes may have the same number
of nearest neighbors, their overall placement in the network may
differ dramatically. One may be located at the outskirt of the giant
component, while the other may be close to the center of the
component. Aside from qualitative observations, we can calculate
the centrality of a node to quantify the difference in placement in
the larger network [18]. Here, we will discuss two centrality mea-
sures, the closeness centrality and the betweenness centrality
[18]. Although the two centrality measures are based on a calcula-
tion of the shortest paths in the giant component, they identify
different types of nodes as central.
Closeness Centrality is based on a calculation of the average
shortest path from each node to every other node. Since the short-
est path between two nodes residing in different components is
infinite, we will only focus on nodes in the giant component in our
calculation. We define closeness centrality as:
X 
1
Closeness i ¼ j 6¼i dist ij
=n, ð8Þ

where distij indicates the distance of the shortest path between


nodes i and j. We may calculate the shortest path between two
nodes by using the functions presented in Example 6. In the
definition in Eq. 8, an infinite distance between two nodes (i.e.,
two nodes in different components) will not contribute to the
closeness score. Also, this definition shows that a node with short
paths to other nodes gets a larger closeness centrality score than a
node that is, in a sense, “far” from the other nodes. Consequently,
the intuitive interpretation of the closeness centrality score is that
nodes that are localized in parts of the network with highest density
will receive the largest closeness centrality values [18].
Betweenness Centrality. This centrality measure is based on the
number of shortest paths that go through a particular node
[18, 21]. Using the calculation of the shortest path between all
Complex Network Analysis in Microbial Systems: Theory and Examples 177

node pairs in the network (there is N(N  1)/2 different node pairs
when N is the number of nodes), for all the nodes we keep track of
how many of the shortest paths go through them. We describe a
script for finding the betweenness centrality of each node in Exam-
ple 6 [21]. The betweenness centrality bk shows the importance of
node k according to the number of shortest paths to node j that
passes through node k. To calculate the betweenness for all paths, bk
is added by a new score for each node, and the entire calculation is
repeated for each of the N nodes. The final bk score is the between-
ness of node k [21]. We normalize the bk score of each node by
dividing with 2(size of the component including k)  1, the smallest
possible number of shortest paths that may pass through a node: at
every node (N  11) shortest paths originate, and at every node
(N  1) shortest paths terminate. Note that in a directed network,
the shortest path from node A to node B and from node B back to
node A often is not the same.
Example 6: Shortest Path, Closeness Centrality,
and Betweenness Centrality:

import networkx as nx
G = nx.read_edgelist("example_network.txt")
nx.shortest_path_length(G, ’b’, ’d’) #2
distances_to_b = {}
for node in G:
distances_to_b[node] = nx.shortest_path_length(G, ’b’, node)
distances_to_b #[’a’:1, ’b’:0, ’c’:1, ’d’:2, ’e:2’]
all_distances = nx.shortest_path_length(G)
all_distances[’a’][’b’] #1
all_distances[’b’][’e’] #2

nx.shortest_path(G, ’b’, ’e’) #[’b’, ’a’, ’e’], may be called


with arguments similar to nx.shortest_path_length()

closeness_centrality_b = nx.closeness_centrality(G, ’b’)


#0.6666...
closeness_centrality = nx.closeness_centrality(G) #{’a’: 1.0, ’b’:
0.6666..., ’c’: 0.6666...., ’d’: 0.5714286, ’e’: 0.5714286}
bet_cent = nx.betweenness_centrality(G) #{’a’: 0.83333...., ’b’:
0.0, ’c’: 0.0, ’d’: 0.0, ’e’: 0.0}
edge_bet_cent = nx.edge_betweenness_centrality(G) #{(’a’, ’b’):
0.3, (’a’, ’c’): 0.3, (’a’,’d’): 0.4, (’a’, ’e’): 0.4, (’c’, ’b’):
0.1}

Note: The node at the end of each path is counted as a node on


the path.
178 André Voigt and Eivind Almaas

Table 3
Properties of the protein interaction network of Mus musculus compared with four random network
models

Network N M hki h1/Lia hCib hCCic hBCid LCSe


PINf 1407 1579 2.2 0.002 0.033 0.047 3.212 766
g
ER 1407 1579 2.2 0.001 0.001 0.095 4.271 1209
BAh 1407 2811 4.0 0.003 0.028 0.259 2.556 1407
i
Conf. 1407 1579 2.2 0.001 0.008 0.128 2.519 1066
ER is short for the Erdös–Rényi model, BA is Barabási–Albert model, and Conf. is the configuration model (see sect. 4)
a
L: Shortest path
b
C: Clustering coefficient
c
CC: Closeness centrality
d
BC: Betweenness centrality
e
LCS: Largest component size of network
f
Protein–protein interaction network in Mus musculus
g
Erdös–Rényi model
h
Barabási–Albert model
i
Configuration model

In order to illustrate the clustering, closeness centrality, and


betweenness centrality measures, we have created a cartoon net-
work (see Fig. 3) on which we will demonstrate these ideas. The test
network consists of eight nodes connected by 13 links, giving rise to
an average degree of hki ¼ 2M N ¼ 8 ¼ 3:25. Note that this does
213

not imply that the majority of the nodes necessarily have their
degree at this value. On visual inspection, we immediately deter-
mine that only nodes 1, 7, and 8 have a degree of 3. Table 1 shows
the degree distribution and the cumulative degree distribution for
this network. In Table 2, we show the degree, the clustering, the
closeness centrality, and the betweenness centrality for each of the
nodes. In this table, node 4 has the largest betweenness centrality
and nodes 4 and 5 have the largest closeness centrality. This result
should not be surprising, given the particular nature of the example
network and the definitions of betweenness and closeness centrality.
We have applied the analysis tools just described to the PIN of
M. musculus [13], and we report some of the global network
characteristics in Table 3. This table also contains results for the
four random graph models, which we will discuss in detail in sect. 4,
with parameters chosen for these models to provide a relevant
comparison to the M. musculus PIN. For now, it suffices to observe
that while many of the models are able to capture aspects of the
statistical properties of the M. musculus PIN, none of the presented
models are fully able to explain the global properties of this PIN.
This marks an important issue in the study of complex networks;
Complex Network Analysis in Microbial Systems: Theory and Examples 179

Fig. 4 The metabolic network of Yersinia pestis [22]. This network has a total of 841 nodes and 2810 links and
consists of 4 connected components. The largest connected component has 835 nodes

Fig. 5 Degree distribution of the metabolic network of Yersinia pestis


180 André Voigt and Eivind Almaas

Table 4
Comparing metabolic network in Yersinia pestis (YP) [22] with four random network models

Network N M hki h1/Li hCi hCCi hBCi hLCSi


YP metabolism 841 2810 6.7 0.008 0.242 0.373 1.911 835
ER 841 2810 6.7 0.008 0.008 0.283 2.379 841
BA 841 1679 4.0 0.005 0.043 0.278 2.426 841
Conf. 841 2810 6.7 0.007 0.112 0.341 2.060 841
DD 841 2394 5.7 0.007 0.299 0.143 2.614 643
ER is Erdös–Rényi model, BA is Barabási–Albert model, Conf. is configuration model

the difference between characterization and analysis of existing


networks and the design of accurate models from basic principles.
We will discuss the differences between the PIN and these models
in sect. 4.

3.2 Metabolic Metabolism is the set of chemical reactions in a cell that produces
Networks the building blocks and supplies the energy needs necessary for the
cell to live and replicate. Here, we will consider cellular metabolism
as a network consisting of metabolites (nodes) that are linked by
reactions [13], and we will use the network concepts discussed
above for its analysis. In order to represent the cellular metabolism
as a network, it is necessary to decide what properties a node and a
link should represent. The most common choice is that of a metab-
olite is being represented by a node, and a link between two nodes
signifying two metabolites being listed together as a substrate–
product pair in a chemical reaction [5, 13]. For instance using
this scheme, the two reactions
A þ B ! C and B ! D
are represented as the four nodes A, B, C, and D with the three
connections A–C, B–C, and B–D. In this common representation,
the directionality of the reactions is often not included, resulting in
an undirected network. Other metabolic representations are dis-
cussed elsewhere [5]. As an example network for our analysis, we
use the genome-scale metabolic reconstruction of the bacterium
Yersinia pestis (YP) [22], the causative agent of bubonic plague.
Fig. 4 shows a representation of the structure of this metabolic
network, and Fig. 5 its degree distribution. In Table 4, we present
the global statistics for this network. Note that the average degree,
the average clustering, and the average closeness centrality scores
are much larger in the YP metabolic network compared to the
M. musculus PIN. In agreement with the visual representation of
the networks, this suggests that many more nodes in the YP meta-
bolic network are in highly clustered regions of the network.
Complex Network Analysis in Microbial Systems: Theory and Examples 181

3.3 Gene Gene expression is the process by which the information contained
Co-expression in a gene manifests itself in a functional gene product. This process
Networks is critical in linking the emergence of phenotype (physical traits) to
genotype (hereditary information). The significance of this process
becomes readily apparent by looking at the development of com-
plex multicellular organisms from single stem cells. Such an organ-
ism grows through duplication of these stem cells—sequence
mutation in this process being the exception rather than the
norm—and as a result, the genetic information contained in each
cell remains the same. The genetic content of an organism’s cells
being the same, the differentiation of cells into widely different and
highly specialized tissue types is instead the consequence of differ-
ences in how this information is expressed. Studying how certain
genes may be silenced, or activated, depending on functional dif-
ferences in cells or changes in environmental conditions is therefore
key to understanding their role.
The importance of understanding gene expression has led to
the development of the field of gene co-expression networks, by
which we seek to understand gene roles by looking at how their
expression depends on each other. In such a network, nodes repre-
sent specific genes, while a link represents a similarity between the
expression profile of a pair of genes—typically, the correlation of
mRNA or protein expression values.
While the base paradigm of systems biology is to provide a big
picture analysis of a given biological system and the interactions
between elements, biological networks often contain smaller
sub-networks that are significantly more interconnected between
each other than they are to the rest of the network. In network
terminology, these sub-networks are commonly referred to as com-
munities (or interchangeably, modules). As these communities often
reflect biologically important relationships, properly identifying
them is of great interest to systems biologists. Note that unlike
components, which, as mentioned earlier, are entirely distinct struc-
tures which do not connect to nodes outside them, different com-
munities in a given network may connect to each other through
intermediary edges. The term community structure denotes the
extent to which a network consists of communities of nodes
which are densely connected internally, but less so toward nodes
in other communities. A common measure to quantify the commu-
nity structure of a network is the modularity [23],
1 X ki k j
Q ¼ ððAij  Þδðc i , c j ÞÞ ð9Þ
2m ij 2m
where A is the adjacency matrix, ki ¼ ∑jAij and kj ¼ ∑iAij the
degree of nodes i and j, respectively, m ¼ ∑ijAij and δ(ci, cj) ¼ 1 if
i and j belong to the same community ci ¼ cj, 0 otherwise (ci 6¼ cj).
As the modularity Q is defined from an existing community decom-
position, Eq. 9 merely tells us the quality of a given decomposition,
182 André Voigt and Eivind Almaas

Fig. 6 Community structure of the co-expression network of transcription factors


in human and chimpanzee brains

but not how to identify communities in the first place. Generally,


one wishes to divide the network such that Q is maximized. Hypo-
thetically, one could identify the optimal module decomposition by
simply trying all possible module decompositions, and retaining the
one with the highest modularity. However, the computational cost
associated with this approach for even modestly sized networks
forces us to use heuristic algorithms to propose solutions. While a
variety of such methods exist, we present the Louvain method [17]
in Example 7. This method performs module decomposition by a
repeated two-step process: first, we iteratively reassign nodes to
neighboring communities if this increases Q (with each node initi-
ally starting in a separate community); second, we iteratively merge
neighboring communities if this further increases Q. These two
steps are again repeated until Q reaches a maximum steady-state
value.
Complex Network Analysis in Microbial Systems: Theory and Examples 183

Example 7: Community Detection:

import networkx as nx
import community #requires https://github.com/taynaud/python-
louvain/

G = nx.read_edgelist(’example_network.txt’)
G.edges() #{(’a’, ’b’), (’a’, ’c’), (’a’, ’d’), (’a’, ’e’),
(’b’, ’c’)}
G.add_edges_from([(’e’, ’f’), (’e’, ’g’), (’e’, ’h’), (’g’, ’h’)])
communities = community.best_partition(G) #{’a’:0, ’b’:0,
’c’: 0, ’d’: 0, ’e’: 1, ’f’: 1, ’g’: 1, ’h’: 1}

Fig. 6 shows the results of community decomposition on a


gene co-expression network for transcription factors in human
and chimpanzee brains. The algorithm identifies three separate
communities, with Q ¼ 0.359. While a human viewer might be
inclined to see the two communities on the left as a single unit,
merging them actually reduces the modularity slightly, to
Q ¼ 0.353, indicating that the decomposition shown in Fig. 6 is
in fact more correct.

3.4 Sources We have downloaded the protein interaction network of Mus mus-
of Datasets culus from the Biological General Repository for Interaction Data-
sets (BioGRID) database [20] (http://www.thebiogrid.org).
Currently, it has more than 330,000 protein and genetic interac-
tions from major model organism species.

Fig. 7 Average degree distribution for 100 networks generated using the ER
model (G(n,M )). N ¼ 10,000 and M ¼ 2,499,750
184 André Voigt and Eivind Almaas

Our metabolic network example is for Yersinia pestis [22],


available for download at (https://www.ntnu.edu/almaaslab).
The transcription factor co-expression network is presented
in [24].

4 Random Graph and Network Models

Here we introduce and describe the four models for generating


networks that we used in the previous sections as reference systems
to our real biological networks. As we previously have alluded to, it
is of great interest to be able to devise tractable models for the
networks: the models may generate important insights regarding
key mechanisms responsible for the statistical properties of a net-
work [6, 8, 18]. The four models we will describe have been central
to developing our current understanding of biological networks’
properties. Note that there are some general similarities between
the models: the ER and Conf models are both based on a static
description of a network, where a fixed number of nodes are
connected through a random process. In contrast, the BA model
is based on growing networks, or iteratively adding nodes to a
starting cluster of nodes, using a combination of rules and random
decisions. Thus, it is relatively easy to get the exact same number of
nodes and links in the ER and Conf models as in the empirical
network data for the PIN or the YP metabolism.

4.1 Erdös–Rényi What is often called the ER model comes in two flavors with
(ER) Model somewhat different properties. One approach is based on using a
fixed number of links to connect randomly chosen nodes (the G(N,
M) model [9]), and the other is based on randomly choosing to
connect two nodes while going through all the possible node
pairings (the G(N,p) model [25]). If the parameters are properly
selected, the difference between the resulting network generation
methods is small for large networks. By choosing p ¼ N ð2MN 1Þ , a
network produced from the G(N,p) approach will on average have
the same number of links as that from the G(N,M) approach when
the number of nodes, N is large [18]. Note that the degree distri-
bution from the ER model is approximately a binomial distribution
that approaches a Poisson distribution when the number of nodes
N is large.

4.1.1 G(N,M) Erdös and Rényi introduced the G(N,M) model [9, 26]. This
model chooses randomly and with the same probability a graph of
a collection of all graphs with n nodes and m links. Starting with a
collection of N nodes that contain no links, we add M links in a
step-wise process. In each iteration, a pair of nodes among the
N nodes is selected to be connected by a link. This is repeated
M times [18]. We have referred to this model as the ER model in
Complex Network Analysis in Microbial Systems: Theory and Examples 185

this chapter. In Fig. 9a, we show an example network resulting from


this model with 200 nodes and 1000 links, and a degree distribu-
tion plot of this model is shown in Fig. 7.

4.1.2 G(N,p) The G(N,p) model was proposed by Edgar Gilbert in 1959 [25]. In
this model, each pair of nodes, of which there are N2 different
combinations, are connected with a probability p. Note that the
process of connecting a particular pair is independent of the out-
come of this decision in other pairs. The resulting graph has
N nodes, and the probability that it will have M links is given by
N
pM ð1  pÞ 2 M [6]. This model is very similar to the ER model (and
the terms are sometimes used interchangeably, for instance by
NetworkX)—the difference being that while an ER/G(N, M) net-
work will always contain the specified number of edges M, the
number of edges in a given G(N,p) network varies stochastically
around the expected value.

Example 8: Erdös–Rényi and G(N, p) Models:

import networkx as nx
import random

n = 10 #number of nodes

#Erdos-Renyi model

m = 20 #number of edges
G = nx.Graph()
G.add_nodes_from(range(n))

for i in range(m):
node1 = random.randint(0, n-1)
node2 = random.randint(0, n-1)
while (node2 == node1):
node2 = random.randint(0, n-1) #in case of self-
loops, draw new node2
G.add_edge(node1, node2)

#G(N, import networkx as nx


import random
n = 10 #number if nodes

#Erdos-Renyi model

m = 20 #number of edges
G = nx.Graph()
G, add notes from(range(n))
186 André Voigt and Eivind Almaas

Fig. 8 Average degree distribution of 100 networks generated using the BA


model with N ¼ 10,000, n0 ¼ 3, and m_step ¼ 2

for i in range(m):
node1 = random, randint(0, n-1)
node2 = random, randint(0, n-1)
while (node2 == node1)
node2 = random,randint(0, n-1) #in case of self-
loops, draw new node2
G,add edge(node1, node2)

#G(N, p) model

p = float(m)/(float(n*(n-1))/2) #translate desired number of


edges to probability of drawing an edge
G = nx.Graph()
G.add_nodes_from(range(n))

for node1 in range(n-1):


for node2 in range(i+1, n):
if (random.random() < p):
G.add_edge(node1, node2)

G = nx.erdos_renyi_graph(n, p) #Equivalent use of the built-in


function - note that inputs are similar to the Gilbert model
rather than Erdos-Renyi p) model

p = float(m)/(float(n*(n-1))/2) #translate desired number of


edges to probability of drawing an edge
Complex Network Analysis in Microbial Systems: Theory and Examples 187

a b c

Fig. 9 Visual comparison of the three network models: a) Erdös–Rényi (ER), b) Barabási–Albert (BA), and
c) configuration (conf) using the degree distribution from Fig. 3. Individual node degrees are indicated with
color code: Black: degree 0–8, Gray: 9–16, White: >16

G = nx.Graph()
G.add_nodes_from(range(n))

for node1 in range(n-1):


for node2 in range(i+1, n):
if (random.random() < p):
G.add_edge(node1, node2)

G = nx.erdos_renyi_graph(n, p) #Equivalent use of the built-in


function - note that inputs are similar to the Gilbert model
rather than Erdos-Renyi

4.2 Barabási–Albert This model was introduced by Barabási and Albert [10] with the
(BA) Model aim of generating random networks with a degree distribution that
was a power law. This was motivated by the observation that many
networks are characterized by a long-tailed, or scale-free, degree
distribution [6, 10]. This is in clear contrast to the properties of the
canonical ER model. The BA model implements the principle of
preferential attachment to grow [10]: the probability of a new node
to connect to existing nodes is proportional to the degree of the
nodes. Thus, nodes with many neighbors have an increased chance
of getting new neighbors. We state this probability mathematically
as
k
Π i ¼ PN i : ð10Þ
j ¼1 k j

The networks resulting from the BA model have power-law


degree distributions (k) / kα , with α ¼ 3. Note that this exponent
is independent of the parameters in the model [10]. Example 9
describes how to generate networks using the BA prescription.
188 André Voigt and Eivind Almaas

Example 9: Barabási–Albert (BA) Model:

import networkx as nx
import random

G = nx.Graph()

n0 = 10
n = 100
k = 3

for i in range(n0):
G.add_node(i)
for node1 in range(n0-1):
for node2 in range(node1+1, n0):
G.add_edge(node1, node2)

for node1 in range(n0, n):


G.add_node(node1)
for i in range(k):
node2 = G.edges()[random.randint(0, G.number_of_edges()-1)]
[random.randint(0, 1)] #Selecting a random end of a random
edge is equivalent to drawing a node with probability propor-
tional to its degree
while(node2 in nx.neighbors(G, node1)):
node2 = G.edges()[random.randint(0,
G.number_of_edges()-1)][random.randint(0, 1)] #Avoids linking
the same two nodes twice

G.add_edge(node1, node2)

G = nx.barabasi_albert_graph(n, k) #Almost equivalent built-


in generator. Differs from original BA model in that initi-
alizes the network with k unconnected nodes, rather than a
n0-clique

Here, n0 is the initial number of nodes, and the algorithm adds


a node in each step with m_step links between this node and
previous nodes (these nodes are selected at random). A graph
generated with this algorithm that has 200 nodes and 998 links
(n0 ¼ 3 and m_step ¼ 5) is presented in Fig. 9b. A degree distribu-
tion of this model is shown in Fig. 8. Note that variations of the
preferential attachment model exist, for which several of the
degree-distribution properties may be tuned [27–31].

4.3 Configuration The configuration model makes it possible to generate a random


Model graph with prescribed number of nodes, links, and degree sequence
[8, 11, 12]. We begin with a sequence of degrees ki of N nodes
Complex Network Analysis in Microbial Systems: Theory and Examples 189

according to Pk. Note that the sum of degrees has to be an even


number.
The algorithm will match pairs of nodes by selecting pairs of
link-stubs according to the degree sequence. The constraint is that
(1) the two stubs to be connected to form a link have to be from
two different nodes, and (2) two nodes already connected by a link
cannot be connected by another pair of stubs [12].
A cautious reader may notice that these constraints are not
compatible with every given degree sequence, even if the sum of
degrees is even. Imagine, for instance, that one wants to create a
network of 100 nodes with the following degree sequence:
ki ¼ 1 for 0  i  97
ki ¼ 99 for i∈½98; 99
The requirements for i ∈ [98, 99] would imply that both of
these two nodes connect to all other nodes (i  97) in the network,
as well as to each other. However, since connections go both ways,
all other nodes must also connect to both nodes 98 and 99, yielding
ki  2 for i  97. This is obviously incompatible with the desired
degree sequence, even if ∑ki ¼ 98 + 2  99 ¼ 296, which is even.
These issues may be redressed by approximating the configuration
model—removing any self-loops and merging duplicated edges.
For many degree sequences (in particular those corresponding to
sparse networks), such modifications have a minimal impact. Alter-
nately, one may ensure that the degree distribution is valid by taking
it from a preexisting network. Indeed, one common use of the
configuration model is to randomly re-wire networks without mod-
ifying the degree distribution, in order to determine to which
extent other measures of network structure (for instance, clustering
or average path length) are determined by the network’s degree
distribution.
In Fig. 9, we show an example network resulting from this
model containing 200 nodes and 59 links.

Example 10: Configuration Model:

import networkx as nx
import randomG = nx.Graph()
n = 100
deg_dist_source_net = nx.barabasi_albert_graph(n, 3)
degree_sequence = nx.degree(deg_dist_source_net).values()
#Create valid degree sequence
random.shuffle(degree_sequence) #Not strictly necessary, sim-
ply randomizes node names

stubs = [] #Initialize pool of stubs


for i in range(len(degree_sequence)):
190 André Voigt and Eivind Almaas

for j in range(degree_sequence[i]):
stubs.append(i)
for i in range(len(degree_distribution)):
G.add_node(i)
nb_of_edges = len(stubs)/2

#Loop will stall if given invalid degree sequence


for i in range(nb_of_edges):
node1 = random.choice(stubs) #Selects one random stub from
pool
node2 = random.choice(stubs) #Selects second random stub
from pool
while ((node2 == node1) | G.has_edge(node1, node2)):
node2 = random.choice(stubs) #ensures no self-loops or
duplicated edges
G.add_edge(node1, node2) #Adds edge for chosen stubs
stubs.remove(node1) #Removes one stub from pool
stubs.remove(node2) #Removes one stub from pool

G = nx.configuration_model(degree_sequence) #Creates multigraph


(potential for self-loops and duplicated loops) using
built-in function
G = nx.Graph(G) #Removes duplicated edges
G.remove_edges_from(G.selfloop_edges()) #Removes self-loops
#Resulting G is approximately equivalent to a configuration model
network model for the given degree sequence

References
1. Kitano H (2002) Systems biology: a brief over- 8. Newman MEJ (2003) The structure and func-
view. Science 295:1662–1664 tion of complex networks. Siam Rev
2. Ideker T, Galitski T, Hood L (2001) A new 45:167–256
approach to decoding life: systems biology. 9. Erdős PR, A. (1959) On random
Annu Rev Genom Hum G 2:343–372 graphs. I. Publ Math 6:290–297
3. Bruggeman FJ, Westerhoff HV (2007) The 10. Barabasi AL, Albert R (1999) Emergence of
nature of systems biology. Trends Microbiol scaling in random networks. Science
15:45–50 286:509–512
4. Barabasi AL, Oltvai ZN (2004) Network biol- 11. Molloy M, Reed B (1995) A critical-point for
ogy: understanding the cell’s functional orga- random graphs with a given degree sequence.
nization. Nat Rev Genet 5:101–113 Random Struct Algor 6:161–179
5. Almaas E (2007) Biological impacts and con- 12. Newman MEJ, Strogatz SH, Watts DJ (2001)
text of network theory. J Exp Biol Random graphs with arbitrary degree distribu-
210:1548–1558 tions and their applications. Phys Rev E
6. Albert R, Barabasi AL (2002) Statistical 2001:6402
mechanics of complex networks. Rev Mod 13. Jeong H, Tombor B, Albert R, Oltvai ZN, Bar-
Phys 74:47–97 abasi AL (2000) The large-scale organization of
7. Dorogovtsev SN, Mendes JFF (2002) Evolu- metabolic networks. Nature 407:651–654
tion of networks. Adv Phys 51:1079–1187 14. Hagberg AA, Schult DA, Swart PJ (2008)
Exploring network structure, dynamics, and
Complex Network Analysis in Microbial Systems: Theory and Examples 191

function using NetworkX. In: Varoquaux G, Yersinia pestis, strain 91001. Mol Biosyst
Vaught T, Millman J (eds) Proceedings of the 5:368–375
7th Python in science conference (SciPy2008). 23. Newman MEJ (2006) Modularity and commu-
SciPy, Pasadena, CA, pp 11–15 nity structure in networks. Proc Natl Acad Sci
15. Shannon P, Markiel A, Ozier O, Baliga NS, U S A 103(23):8577–8582
Wang JT, Ramage D, Amin N, 24. Nowick K, Gernat T, Almaas E, Stubbs L
Schwikowski B, Ideker T (2003) Cytoscape: a (2009) Differences in human and chimpanzee
software environment for integrated models of gene expression patterns define an evolving
biomolecular interaction networks. Genome network of transcription factors in brain. Proc
Res 13:2498–2504 Natl Acad Sci 106(52):22358–22363
16. Batagelj V, Mrvar A (2002) Pajek – analysis and 25. Gilbert EN (1959) Random graphs. Ann Math
visualization of large networks. Graph Drawing Stat 30:1141–1144
2265:477–478 26. Erdos P, Renyi A (1960) On the evolution of
17. Blondel VD, Guillaume J-L, Lambiotte R, random graphs. B Int Stat Inst 38:343–347
Lefebvre R (2008) Fast unfolding of commu- 27. Krapivsky PL, Redner S, Leyvraz F (2000)
nities in large networks. J Stat Mech Theor Exp Connectivity of growing random networks.
10:P10008 Phys Rev Lett 85:4629–4632
18. Newman M (2010) Networks: an introduc- 28. Dorogovtsev SN, Mendes JFF, Samukhin AN
tion. Oxford University Press, New York, NY (2000) Structure of growing networks with
19. Watts DJ, Strogatz SH (1998) Collective preferential linking. Phys Rev Lett
dynamics of ‘small-world’ networks. Nature 85:4633–4636
393:440–442 29. Albert R, Barabasi AL (2000) Topology of
20. Stark C, Breitkreutz BJ, Reguly T, Boucher L, evolving networks: local events and universal-
Breitkreutz A, Tyers M (2006) BioGRID: a ity. Phys Rev Lett 85:5234–5237
general repository for interaction datasets. 30. Amaral LAN, Scala A, Barthelemy M, Stanley
Nucleic Acids Res 34:D535–D539 HE (2000) Classes of small-world networks.
21. Newman MEJ (2001) Scientific collaboration Proc Natl Acad Sci U S A 97:11149–11152
networks. II. Shortest paths, weighted net- 31. Dorogovtsev SN, Mendes JFF (2000) Evolu-
works, and centrality. Phys Rev E 2001:6401 tion of networks with aging of sites. Phys Rev E
22. Navid A, Almaas E (2009) Genome-scale 62:1842–1845
reconstruction of the metabolic network in
Chapter 10

Prokaryotic Genome Annotation


Jeffrey A. Kimbrel, Brendan M. Jeffrey, and Christopher S. Ward

Abstract
In the last decade, the high-throughput and relatively low cost of short-read sequencing technologies have
revolutionized prokaryotic genomics. This has led to an exponential increase in the number of bacterial and
archaeal genome sequences available, as well as corresponding increase of genome assembly and annotation
tools developed. Together, these hardware and software technologies have given scientists unprecedented
options to study their chosen microbial systems without the need for large teams of bioinformaticists or
supercomputing facilities. While these analysis tools largely fall into only a few categories, each may have
different requirements, caveats and file formats, and some may be rarely updated or even abandoned. And
so, despite the apparent ease in sequencing and analyzing a prokaryotic genome, it is no wonder that the
budding genomicist may quickly find oneself overwhelmed. Here, we aim to provide the reader with an
overview of genome annotation and its most important considerations, as well as an easy-to-follow protocol
to get started with annotating a prokaryotic genome.

Key words Genome annotation, Prokaryote sequencing, Gene prediction, Structural annotation,
Functional annotation

1 Introduction

Genome annotation is the process of identifying and assigning


function to genomic features in order to generate a blueprint for
the potential roles and capabilities of an organism. Today, this
process is almost entirely an automated, hands-free endeavor. This
is possible due to the wealth of genomic data that is already avail-
able, and an annotation pipeline can be as simple as identifying
orthologous genes from a closely related sister genome and dupli-
cating its gene annotations to the newly sequenced organism.
While this can be a rapid way to annotate a genome, errors can
easily propagate from genome to genome without experimental
validation, or without incorporating updated gene information
that was still unknown when the sister genome was last annotated.
The current pace of genome sequencing, however, makes it nearly
impossible to manually curate or experimentally validate annota-

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_10, © Springer Science+Business Media, LLC, part of Springer Nature 2022

193
194 Jeffrey A. Kimbrel et al.

tions in a new genome. Therefore, to avoid such pitfalls, genomi-


cists rely on an increasing number of general or specialized data-
bases for assigning functions to genes and genomes.
Annotation can be separated into two main objectives: struc-
tural and functional predictions. Structural genome annotation
(not to be confused with protein structural annotation) is the
identification of the genomic landmarks such as the start and stop
coordinates of genes coding for proteins and transfer/ribosomal
RNAs. Functional annotation next takes the structural annotations,
extracts the sequence information, and assigns a category (meta-
bolic, enzymatic, signaling, transport, etc.). This usually involves
searching the gene sequences against a database designed for either
broad or specific functional annotation objectives.
Although the quality of functional annotation tools and data-
bases has greatly increased over the past few years, even the best
annotated genomes still suffer from a lack of completion due to
“orphan enzymes” (proteins with no known activity; [1]). There
are a few reasons for this, including less hand-curation and more
high-throughput computational annotation, as well as the simple
fact that not every gene/protein has a known functional role
in vivo. To the first point, certain sacrifices are necessarily made as
bioinformaticists try to develop software with the best compromise
between producing robust annotations yet giving useful results for
a wide taxonomic range of prokaryotes. Annotation pipelines are
often benchmarked against “gold standard” genomes such as
Escherichia coli strain K12, resulting in more complete genome
annotations for organisms more taxonomically similar to E. coli
and other Proteobacteria [2]. Further, there has been a steady
increase over the last few decades in the discovery of genes that
are conserved among diverse taxa (many validated with expression
data) yet without a known function. In fact, only recently has the
rate of assigning function to these unknown coding sequences
begun to outpace the discovery of new domains of unknown func-
tion (DUFs [3]).
With the increase in shotgun metagenomics and single-cell
sorting, modern prokaryotic genome sequencing has moved
beyond bacterial and archaeal isolates to also include members
from complex microbial communities. Genomes of individual com-
munity members within complex consortia can be obtained by
mechanically separating single cells using flow cytometry. Once
sorted, whole-genome amplification (WGA) and sequencing is
performed to obtain single amplified genomes (SAGs [4]). Alter-
natively, a metagenomic approach can be employed by sequencing
and assembling DNA from entire communities, with sequence
fragments from individual community members subsequently com-
putationally binned into metagenome-assembled genomes
(MAGs). Although there are many differences in the initial steps
of sequencing and assembling these very different sample types into
Prokaryotic Genome Annotation 195

raw genome sequences (see [5]), the subsequent annotation process


is largely similar whether working with an isolate genome, a SAG or
a MAG (see Note 1).
There are many reasons that scientists choose to have a genome
sequenced, from identifying the gene(s) responsible for a novel
function to building metabolic models entirely in silico. There is
an ever-growing list of specialized databases and classification
schemes for a broad range of functional genes. Even previously
sequenced prokaryotes or those sequenced and annotated by core
facilities will often need to be supplemented with further annota-
tions relevant to the researcher’s priorities and goals. Our aim with
this protocol is to introduce the concepts and challenges that
underlie prokaryotic structural and functional annotation, as these
are often largely hidden within the “black box” of software and
algorithms.

2 Materials

2.1 A High-Quality The success of genome annotation is dependent on a variety of


Genome factors, but perhaps the most important is the quality of the initial
set of assembled genome sequences, or contigs. While genome
assembly quality is largely beyond the scope of this protocol (see
[6]), it is nonetheless an important and necessary step that must be
considered before attempting annotation. Genome quality is
assessed through metrics such as count and length distribution of
contigs, with better assemblies being those with most of the
genome contained in only a few, long contig sequences. These
metrics can vary greatly depending on the sequencing technology
used and the genomic content of the organism (e.g., long repetitive
regions are more difficult to assemble). The large increase in
sequenced genomes in the last decade has been driven by the use
of various “short read” technologies, particularly from the Illumina
Hi-Seq and Mi-Seq. Although these technologies have increased
the throughput of genome sequencing, assembling short reads into
genomes has resulted in high genome fragmentation making it
difficult, if not impossible, to move beyond draft genomes
[7]. Moving increasingly toward long read sequencing would
bring high-throughput, finished genomes closer to a reality [8].
As the sequencing of complex populations to obtain genomes
of individual community members (e.g., MAG and SAG genome
projects) is becoming more commonplace, ensuring that the com-
plete genome is represented in the sequence data with minimal
contamination has become an increasing concern. Contamination
in MAGs can occur from the mis-assembly of pooled short reads
from hundreds or thousands of organisms or from the incorrect
binning of contigs from closely related organisms or containing
laterally transferred genes. In SAGs, the necessary amplification of
196 Jeffrey A. Kimbrel et al.

Table 1
Genome quality and taxonomy assessment software

Program Objective Reference


checkM Completion/contamination [10]
proGenomes Taxonomy [13]
AMPHORA Phylogenomics [11]
QUAST Assembly quality control [14]

low-abundance DNA in a nonspecific manner can result in much of


the DNA sample originating from a contaminant [9]. Completion
and contamination are often measured through the identification of
single-copy marker genes conserved in most prokaryotes; a com-
plete genome has each gene in the marker set, while an incomplete
genome has few. Similarly, a contaminated genome has multiple
copies of marker genes, while they are found in an uncontaminated
genome only once. Databases cataloging these marker genes are
somewhat variable, and the choice of marker gene set influences
genome quality confidence. A popular solution is to use lineage-
specific marker genes, which allows for the proper determination of
genome quality particularly from MAG or SAGs from complex
microbial populations [10, 11]. Taking all of this into account,
standards have been recommended for genome quality assessment
based on the Minimum Information about a Genome Sequence
(MIGS) for isolates [12] by the Genomic Standards Consortium
(GSC), and Minimum Information about a Metagenome-
Assembled Genome (MIMAG) or Single Amplified Genome
(MISAG [5]). Some quality control approaches are listed in
Table 1.
The pipeline detailed here begins simply with a FASTA-
formatted nucleotide sequence of a prokaryotic genome (see Note
2). Here, we treat genome contigs similarly whether from an
isolated organism (complete of draft assembly) or from a SAG or
MAG [5]. For example, in this protocol, we are using the Marino-
bacter sp. PT19DW isolated from cultures with the microalga
Phaeodactylum tricornutum. This genome is publicly available at
the Joint Genome Institute’s Integrated Microbial Genomes web-
site (https://img.jgi.doe.gov) by searching for taxon ID
2751186041, and the genome FASTA file can be downloaded by
following the Genome Portal links. The contigs file is titled
“150652.assembled.fna.”

2.2 Computational The software detailed in this protocol utilize web servers and thus
Requirements can be run on common hardware such as a laptop. That said, many
annotation programs and databases can further be run at the com-
mand line, often expanding the options and capabilities available.
Prokaryotic Genome Annotation 197

Table 2
Genome visualization software

Visualization program Link


IGV viewer http://software.broadinstitute.org/software/igv/
NCBI Genome Workbench http://www.ncbi.nlm.nih.gov/tools/gbench/
Artemis/ACT http://www.sanger.ac.uk/science/tools/artemis
JBrowse http://jbrowse.org
Apollo, plugin for JBrowse http://genomearchitect.github.io
MGcV [17] http://mgcv.cmbi.ru.nl
CLC Sequence Viewer http://www.qiagenbioinformatics.com/products/clc-sequence-viewer/
CLC Workbench http://www.qiagenbioinformatics.com/products/clc-main-workbench/
Geneious http://www.geneious.com

Software package managers utilizing Docker (e.g., BioContainers


[15]) and Conda (e.g., BioConda [16]) have simplified the process
of installing and executing many different software packages.
Although not required here, gaining experience in coding (e.g.,
Python, Perl, or R) is beneficial if not necessary for the modern
genomicist. Visualization tools can also be useful to view the gene
annotations in their genomic context (Table 2).

3 Methods

To provide a convenient overview of the annotation process, we


begin with an automated annotation pipeline that combines both
structural and functional annotation into a single process (Sub-
heading 3.1). Next, we break down the automated pipeline into
each of the major steps beginning with a review of the software and
algorithms for structural annotation (no standalone structural pro-
tocol is provided) (Subheading 3.2). The final two sections detail
methods for taking predicted coding sequences from Subheading
3.1 to obtain further general (Subheading 3.3) or specific (Sub-
heading 3.4) functional annotations using a few of the many online
and downloadable annotation databases.

3.1 All-in-One Tools Automated, or “all-in-one,” annotation tools combine both struc-
tural and functional annotation in one convenient step, producing a
high-quality draft genome. There are several automated annotation
tools available either as online web servers or as downloadable,
command-line-driven software (Table 3). Some tools, such as the
Prokaryotic Genome Annotation Pipeline (PGAP) and Integrated
Microbial Genomes/Genomes OnLine Database (IMG/GOLD),
198 Jeffrey A. Kimbrel et al.

Table 3
Automated genome annotation software

Program Link Reference


PGAP http://www.ncbi.nlm.nih.gov/genome/annotation_prok/ [18]
IMG/GOLD http://gold.jgi.doe.gov [19]
RAST http://rast.nmpdr.org [20]
Prokka http://www.vicbioinformatics.com/software.prokka.shtml [21]
BASys http://www.basys.ca [22]
Genix http://github.com/biopro/genix [23]
DeNoGAP http://sourceforge.net/projects/denogap/ [24]

are part of large genome repositories and submissions that will


eventually be made public. While these pipelines can be beneficial
in that resulting genomes already conform to the submission stan-
dards of those repositories, it does require more upfront sample
information and metadata before a submission will be accepted.
Additionally, it may be more difficult to hand-curate or incorporate
annotation results from other pipelines.
The RAST (Rapid Annotation using Subsystem Technology)
online server will structurally and functionally annotate a prokary-
otic genome into a series of curated subsystems of functional roles
and provide online pathway visualizations and even metabolic
reconstructions.
1. Navigate your browser to https://rast.nmpdr.org/?
page¼Register and register for an account.
2. Upload a genome in either FASTA or Genbank format. For this
example, we will start with a FASTA-formatted file with no
structural annotation (Fig. 1a). This can be either the example
Marinobacter contigs file or another sequence file of your
choice.
3. After uploading, the RAST server will analyze your FASTA file
and present various summary statistics. If your genome con-
tains scaffolded contigs (contig sequences merged and sepa-
rated by a string of “N”), RAST will split those scaffolds into
individual contigs. Summary statistics will be displayed for both
the uploaded FASTA file and the split scaffolds, if necessary.
4. Next, the RAST server will ask for a taxonomy string for your
organism (Fig. 1b). If possible, use the NCBI taxonomy ID
(at https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.
html/), which will allow RAST to fill in the needed taxonomy
information. For our Marinobacter example genome, we enter
“2742” in the Taxonomy ID box, and then manually correct the
remaining information including the genus and strain.
Prokaryotic Genome Annotation 199

Fig. 1 Rapid annotation using subsystem technology (RAST) genome upload interface. (a) Under “choose file,”
select a FASTA file of DNA contig sequences. (b) Complete the genome information by submitting the NCBI
taxonomy identifier, as well as the genus, species, and strain

5. Select final annotation parameters for the RAST pipeline. The


default options should yield good results for the typical
genome. Additional parameters such as “automatically fix
errors” and “backfill gaps” will attempt to fix and alert the
user to potential issues. Selecting “finish the upload” will
enter the genome into the RAST queue, and you will receive
a job identifier and a link to monitor the status. If you already
have gene calls and are submitting a Genbank file, RAST will
recall (i.e., discard your gene calls) ORFs unless you specify
not to.
200 Jeffrey A. Kimbrel et al.

6. The RAST annotation process can take 12–24 h. On the “Your


Jobs” section of the RAST website, the statuses of all current
jobs are displayed, as well as overall server load.
7. You will receive an e-mail when your job has finished running.
Follow the link in the email to the RAST site. Once there, first
inspect that all RAST processes completed successfully (all
boxes are green). To view the annotation files, select “view
details” for the job. Here, you can download files including
Genbank and a table of annotations (either Excel or
tab-delimited). Further, you can browse the genome in the
SEED Viewer to get subsystem category statistics. Here, sub-
system statistics are displayed (for the ~50% of genes assigned a
function; Fig. 2a). Additionally, genome comparisons can be
made with other organisms, and genome annotations can be
visualized onto KEGG metabolic pathways (Fig. 2b).
8. Finally, download the “Amino-acid FASTA file,” as this FASTA
file will be used in Subheadings 3.3 and 3.4.

3.2 Gene Prediction The first step in the typical annotation process (including RAST
(Structural) above) is identifying the structural components of a genome. These
are features with defined genomic coordinates such as genes for
protein coding sequences (CDSs) and transfer and ribosomal RNA
(tRNA and rRNA, respectively). Several popular and highly cited
structural annotation tools are available depending on the type of
structural features identified (Table 4). Prodigal, Glimmer, and the
GeneMark suite all detect open reading frames (ORFs)
corresponding to a protein CDS (detailed in Subheading 3.2.1).
Other structural annotation tools instead discover so-called non--
coding RNA features including tRNA and rRNA (detailed in Sub-
heading 3.2.2). As structural annotation is often included in
automated pipelines, in lieu of a protocol, we give a review of the
methods and challenges these annotation tools encounter.

3.2.1 Protein Coding Prokaryotic protein coding sequences (CDS) typically follow a
Sequences common scheme which gene-calling algorithms utilize. The struc-
ture of a gene is tied to aspects of the central dogma, that is, DNA is
transcribed into messenger RNA (mRNA) which is translated into
amino acids/proteins. The molecular machinery catalyzing the
central dogma rely on landmarks that are encoded in the genome.
Computational pipelines for structural annotation similarly work in
part by identifying these genomic landmarks, followed by further
refinement through algorithmic or heuristic approaches. Genomic
landmarks involved in transcription include promoter regions, tran-
scription factor binding sites, transcription termination sites, and
genes transcribed together in polycistronic operons. Although
transcription is the first step in producing a protein, gene calling
algorithms typically do not utilize these transcriptional landmarks
Prokaryotic Genome Annotation 201

Fig. 2 Online results from the RAST automated annotation of Marinobacter sp. PT19DW. (a) Subsystem
coverage shows the ratio of coding sequences with a predicted functional annotation, and the ratio of those
annotations are displayed in the accompanying pie chart. (b) KEGG metabolic pathway detail of the TCA cycle.
Boxes represent functional categories that were identified (filled) or not identified (unfilled) in the Marinobacter
sp. PT19DW genome
202 Jeffrey A. Kimbrel et al.

Table 4
Structural annotation software

Program Objective Reference


Prodigal ORF [25]
Glimmer ORF [26]
GeneMarkS ORF [27]
tRNAscan-SE tRNA [28]
Rfam RNA [29]
RNAmmer rRNA [30]

in structural annotation. These features, however, can be used in


refining gene models (discussed in Subheading 3.2.3) or in func-
tionally annotating a gene product through the promoter region.
Conversely, genes that overlap in operons introduce additional
complexity that can actually reduce the performance of gene calling
algorithms.
Many of the genomic features utilized for structural annotation
come from the translation process. Algorithmic classification of
protein coding genes begin with the identification (or “calling”)
of ORFs. Gene calling programs begin by identifying candidate
ORF regions—DNA sequences beginning with a start codon and
proceeding downstream until an in-frame stop codon is reached,
resulting in many possible and overlapping ORFs. Most prokar-
yotes use a standard genetic alphabet (NCBI Genetic Code 11)
where ATG is the typical start codon, and TAA, TGA, and TAG are
the typical stop codons. In some organisms, alternative start codons
such as TTG and GTG are common and are often allowed in gene-
calling algorithms. Candidate ORFs that are in the same frame and
overlap will share the same stop codon but may have different start
codons. An additional translational landmark, the ribosome bind-
ing site (RBS), can be identified computationally and help to differ-
entiate between multiple potential start codons. Until recently, it
was thought that each bacterial gene has its own RBS, even when
encoded in an operon. Many archaea, however, were thought to
have a more complex translation initiation mechanism resulting in
so-called leaderless initiation elements without a RBS [31]. Recent
evidence suggests that leaderless sequences are common in bacteria,
especially in Actinobacteria and Deinococcus-Thermus [32]. The
incorporation of leaderless translation initiation is considered in the
recent gene calling program GeneMarkS-2 [33].
Although the identification of candidate ORFs is a relatively
straightforward process, the differences between gene calling algo-
rithms are largely in how they score and ultimately classify ORFs as
a true CDS. A naive approach could be to take only the longest,
Prokaryotic Genome Annotation 203

non-overlapping ORFs as true genes; however, this would result in


many false positives. Gene calling algorithms instead usually take
one of two different approaches: either through evidence-based or
ab initio methods [34].
Evidence-based, or extrinsic, methods compare the unknown
ORF sequence against a database of genes or genomes, scoring
ORFs that match the database above defined thresholds. This is a
valid approach when annotating a genome with close relatives in the
database; however, errors can easily propagate across genomes and
novel or highly diverged genes can be missed [35].
On the other hand, ab initio, or intrinsic, gene callers rely on
compositional features to distinguish a protein CDS from
non-coding ORFs. For example, Glimmer 3 first identifies the
longest ORFs in the genome, as a long sequence without a stop
codon is likely to be a true CDS. It then uses the composition of the
long ORFs as a training set to build gene models specific to that
genome. These gene models are then used to score the remaining
putative ORFs, also considering RBS locations, for their likelihood
of being true protein coding genes. A potential problem with this
method is that the gene models are specific to most but not all
genes within a genome, and genes with different compositions
(such as those horizontally acquired) can be missed.
There are several popular gene-callers that are available both
online or with a command line interface, including Prodigal [25],
Glimmer [26], and GeneMarkS [27]. Several important considera-
tions must be factored when choosing and using a gene calling
program, particularly when annotating a draft genome, MAG or
SAG. The first of these is how the gene caller handles genes that end
abruptly at contig borders, and whether it should include these
“partial” coding sequences. A similar concern is for scaffolded
genomes, and whether the gene caller should consider ORFs that
span regions of padded ambiguous bases as candidates. Also, these
gene calling algorithms are designed for prokaryotes without
introns and will not function properly in eukaryotic or intron-
containing genes/genomes. Finally, should the gene caller consider
the sequence circular for finished genomes or plasmids, allowing
ORFs to be found that span the borders of a contig.

3.2.2 Non-coding RNA Protein-coding sequences are not the only genomic feature that can
be identified in the structural annotation process. There are many
genes that are transcribed into RNA products, but do not undergo
translation into an amino acid sequence, and instead the RNA
transcripts themselves are functionally active. These non-coding
RNAs (ncRNAs) include the well-conserved rRNAs and tRNAs,
to the more variable and less well-studied small regulatory RNAs
(sRNAs) and clustered regularly interspaced short palindromic
repeats (CRISPR) transcripts. Even sequences in the untranslated
204 Jeffrey A. Kimbrel et al.

region of an mRNA, such as riboswitches, may also be under the


umbrella of ncRNAs in that they have secondary structure that can
directly regulate the translation process [36].
The ncRNAs with extensive secondary structure are perhaps
the most straightforward to computationally annotate by using an
evidence-based approach due to conserved homology [37]. The
Rfam database is a collection of RNA models built from aligning
known, or “seed,” RNA transcript sequences [38], which is then
searched against an unknown sequence using programs like Infer-
nal [39]. Additionally, there are algorithms specifically for identify-
ing tRNAs, such as tRNAscan-SE, which not only identify the
tRNA gene but also the amino acid through determination of the
anti-codon.
ncRNA transcripts that have little sequence similarity or known
motifs pose a more difficult task for identification and prediction
[37]. These can include species- or genome-specific small regu-
latory RNA transcripts (sRNAs) that function by binding and
affecting transcription/translation of other RNA transcripts
[40]. A detailed summary of the prediction programs for sRNAs
are beyond the scope of this protocol; however, [41] provides a
detailed review for those interested in learning more about the
topic.
Another ncRNA that has found biotechnology applications in
recent years is CRISPR. Although CRISPRs are now widely con-
sidered gene editing tools, they originated in prokaryotes as an
adaptive immune system against foreign DNA such as bacterio-
phage [42]. CRISPR sequences can be identified by Rfam, and
searches can be included in RAST and IMG annotation protocols.

3.2.3 Refinement Prokaryote genome sequencing and transcriptome quantification


of Gene Calls using RNA-seq are becoming routine, yet there is a need for new
prokaryote annotation platforms that will integrate all available
information to produce a more robust annotation of bacterial
genomes. Several software solutions have been developed to refine
the start/stop coordinates, and to validate the expression of pre-
dicted ORFs (Table 5).
Ab initio gene finders discussed above predict genes based on
the presence of ORFs in the genome sequence, typically using
statistical or machine learning techniques such as Hidden Markov
Models and require training data to evaluate the probability of the
predicted gene model. In contrast to ab initio methods, evidence-
based gene model prediction uses additional external information
to identify genes and to further predict genome and gene structure
models such as operon structure. The source of external evidence
can include expression sequence tag (EST) libraries, messenger
RNA (from RNA-seq studies), and protein sequences. The addition
of these methods using additional external information can aid in
Prokaryotic Genome Annotation 205

Table 5
Structural annotation refinement software

Program Omics data Reference


EuGene-P Transcriptome [43, 44]
GIIRA Transcriptome [45]
Cufflinks Transcriptome [46]
Braker2 Proteome http://bioinf.uni-greifswald.de/bioinf/braker/
iPtgxDB Proteome [47]
MAGI Metabolome [48]

identifying not only the prediction of a CDS, but also transcription


start sites (TSSs), non-coding (nc) transcribed genes, regulatory
sequences, and 50 and 30 UTRs [43].
Sallet et al. developed EuGene-P [43], a tool adapted from the
eukaryotic gene finder EuGene [49], which takes advantage of
statistical properties of sequences, existing annotations in data-
bases, similarities to proteins, and, importantly, oriented RNA-seq
data to produce annotations that include previously unpredicted
gene structures such as 50 and 30 UTRs and nc-RNA genes. Build-
ing on their EuGene-P tool, Sallet et al. developed EuGene-PP, a
next-generation automated pipeline for prokaryotic genomes [44].
Although transcriptome data has been shown to be useful in
refining structural annotations, attempts to validate gene calls on
45 bacterial replicons using peptide data yielded mixed results
[50]. Tripp et al. conclude that any of the tested algorithms can
make useful gene calls, but peptide data alone is insufficient to
completely validate any approach, and further that comparative
genomics should only be done against genomes called with the
same algorithm.

3.3 Functional Although many automated tools can produce high-quality genome
Prediction (General) annotations, combining multiple functional annotation tools can
significantly increase the completeness [2]. Therefore, in this pro-
tocol, we will supplement the RAST annotation with another
“general” functional annotation tool. The goal of general func-
tional annotation is to classify well-conserved genes, or those
genes in well-studied classes, such as glycolysis, amino acid and
nucleotide metabolism, and nutrient cycling. One large collection
of these genes is the Kyoto Encyclopedia of Genes and Genomes
(KEGG [51]). The KEGG database is arranged into an ontology of
proteins, the reactions they can perform, and the pathways that
these reactions fit into. The KEGG Automated Annotation Server
(KAAS [52]) is a web service to annotate protein coding sequences.
206 Jeffrey A. Kimbrel et al.

1. Navigate your web browser to the KAAS web server at https://


www.genome.jp/tools/kaas/.
2. Select the BBH method under “Complete or Draft Genome.”
Select this even if working with metagenome-assembled gen-
omes (MAGs).
3. Under “Search program,” leave the default as BLAST. This is
the methodology used to determine the degree of similarity of
our unknown sequences with those in the database. The use of
GHOSTX or GHOSTZ may give faster yet less accurate results.
4. Under “Query sequences,” upload the FASTA file of amino
acid sequences obtained from RAST annotation or provide
your own amino acid FASTA file.
5. Add a query name and e-mail address. KAAS will send an e-mail
with a link to submit and begin the annotation process, so
provide a valid e-mail address.
6. Under the GENES dataset, click “for Prokaryotes.” This will
fill in a list of selected prokaryotic organisms that will be
searched against (Fig. 3a).
7. Select BBH for the “Assignment method.”
8. Select “Compute.” You will receive an e-mail with a link that
will finalize the submission process. Your job will not start until
you select that link in your e-mail.
9. When the KAAS annotation has completed, you will receive
another e-mail with a link to the results. These results are not
permanent and should be downloaded or saved within 7 days.
10. Click the link of your job ID under “ID” and record your run
parameters by saving the text there (Fig. 3b). This is necessary
to save these search parameters in case the “for Prokaryotes”
genome defaults change.
11. Right click on “text” and save the results to a file (Fig. 3b).
12. To view your results in the context of the KEGG pathways,
navigate to the KEGG pathway list at https://www.genome.
jp/kegg/pathway.html.
13. Within a KEGG map, click the “User data mapping” link, and
paste in the second column from the KEGG text output from
KAAS. This will highlight the identified genes on the map
(example is the TCA cycle map00020; Fig. 4).
Although largely similar, comparing the RAST versus the
KAAS annotations mapped on the TCA cycle pathway reveal
some slight differences. In particular, KAAS alone identified a
pyruvate synthase (EC:1.2.7.1) to catalyze the interconversion
between acetyl-CoA and pyruvate (Fig. 4). KAAS also identified
an alternate malate dehydrogenase (EC:1.1.5.4) using quinone,
leading to a complete TCA cycle. These particular examples are
Prokaryotic Genome Annotation 207

Fig. 3 Kyoto Encyclopedia of Genes and Genomes Automated Annotation Server (KEGG/KAAS) web server for
general functional annotation. (a) Upload the FASTA protein file obtained from the RAST annotation and select
“for Prokaryotes” for the representative set. Submit for annotation, replying to the confirmation e-mails. (b)
After receiving an e-mail informing that your KAAS run is complete, note the run information (by clicking the
Job ID link), and download the text file format with your KEGG orthology annotations

not meant to proclaim one annotation tool “better” than another,


rather it is to stress that the resulting annotations are dependent on
the annotation system used. Agreement between different annota-
tion methodologies can give validation to potential gene functions
but can also lead to differences in pathway representation that must
be disambiguated by the genomicist.

3.4 Functional In addition to annotating for general metabolic capabilities of an


Prediction (Specific) organism, many smaller curated databases exist that are tailored for
certain classes or capabilities. It is impossible to compile an exhaus-
tive list of these resources (only a small subset is shown in Table 6),
nor is it usually necessary to search more than just a few of these
databases against a genome. The choice of which databases to use
depends on your biological question.
208 Jeffrey A. Kimbrel et al.

Fig. 4 KAAS annotation analysis in the online pathway browser for the TCA cycle. As in Fig. 2, boxes represent
functional categories that were identified (filled) or not identified (unfilled). Differences between the RAST and
KAAS annotations are circled

Table 6
List of some specific functional annotation software

Database Description Link Reference


antiSMASH Antibiotics and secondary http://antismash.secondarymetabolites. [53]
metabolites org
dbCAN Carbohydrate-active enzymes http://bcb.unl.edu/dbCAN2/blast.php [54]
transportDB Transporter and substrate http://www.membranetransport.org [55]
prediction
VFDB Virulence factor database http://www.mgc.ac.cn/cgi-bin/VFs/ [56]
main.htm
GeneDB Annotation database for http://www.genedb.org/ [57]
pathogens

3.4.1 CAZy/dbCAN The Carbohydrate-active enzymes (CAZy [58]) database is a large


Protocol catalog of prokaryotic and eukaryotic genes involved in the synthe-
sis and deconstruction of many classes of carbohydrates. Enzymes
are categorized into five major classes for glycoside hydrolases
(GH), glycosyltransferases (GT), polysaccharide lyases (PL),
Prokaryotic Genome Annotation 209

carbohydrate esterases (CE), those with auxiliary activities (AA),


and carbohydrate binding domains (CBM). The database for auto-
mated carbohydrate-active enzyme annotation (dbCAN [54]) is a
regularly updated collection of Hidden Markov Models (HMMs)
for classifying unknown protein sequences into CAZy classes. A
survey of the carbohydrate-active proteins encoded in a genome
can give many insights, for example, to predict the ability to
degrade polymeric carbon substrates [59] or the types of exopoly-
saccharides or glycosylated molecules that a bacterium may
produce [60].
1. Navigate your web browser to http://bcb.unl.edu/dbCAN2/
blast.php.
2. Enter your e-mail address in the field to be notified of the
results.
3. The input for dbCAN is a FASTA file of amino acid sequences,
which was obtained from the RAST automated annotation.
4. Check the box for “I’m not a robot” and click the “submit”
button.
5. After a few moments, you will receive an e-mail with a link to
the results page. This page can be explored in table form, or
graphically to inspect the domain architecture for each gene
result (Fig. 5).
6. The subject column lists the CAZy class that was found in the
protein sequence (Fig. 5a). One protein can have multiple
CAZy classes. The subject links to the dbCAN entry for the
listed CAZy class, with further links to the CAZy database.
7. The confidence of a result is measured by both the “E-value”
and the “Covered fraction” (Fig. 5a). The E-value (or Expect
value) is the chance your query sequence would exist by chance
in a list of sequences, with a greater chance expected as the list
size increases. dbCAN uses an E-value cutoff of 1e5 if the
protein is 80 amino acids or longer, and 1e3 if less than
80 amino acids. The covered fraction shows the portion of
the CAZy model that is found in your query protein and is
depicted in the “dbCAN domain architecture” window
(Fig. 5b). Values near 1 indicate that the model was mostly or
entirely found within the protein.
8. Results can be downloaded in text format with “Download
original output” giving the full HMMER search results, and
“Download parsed output” giving tabular output that can be
loaded into spreadsheet software.

3.4.2 antiSMASH Some annotation tools require different input files, as they may use
other genomic information beyond the primary amino acid
sequence. The antibiotics and secondary metabolites (antiSMASH)
server searches for gene clusters involved in the production of
210 Jeffrey A. Kimbrel et al.

Fig. 5 Database for automated carbohydrate-active enzyme annotation (dbCAN)-specific functional annotation
for carbohydrate-active enzymes. (a) Tabular output giving the gene name and the CAZy functional class, with
confidence metrics including E-value and covered fraction. This data can be downloaded using the “Down-
load” links. (b) Graphical representation of the domain architecture

secondary metabolites and antibiotics and therefore utilizes the


genome architecture of genes including synteny and operon struc-
ture. This information is not available in a gene or protein FASTA
file, but it is in the Genbank file obtained from the RAST
annotation.
1. Navigate your web browser to antismash.
secondarymetabolites.org.
2. Enter your e-mail address to receive a link to the results when
available.
3. Under “Data input,” click the “Upload file” button, and then
Browse to find the Genbank file (.gbk suffix) downloaded from
the RAST annotation (Fig. 6a).
4. The default options can be used; however, be sure to browse
the “Help” section for more detail on the individual options.
5. Click “Submit,” and a new webpage will load with information
on the status of your submission. A status box along the left-
hand side displays the current server workload, including the
current number of jobs that are being processed or queued.
Prokaryotic Genome Annotation 211

Fig. 6 Antibiotics and secondary metabolites (antiSMASH) server for specific functional annotation. (a) Upload
the Genbank file obtained from the RAST annotation. The Genbank file preserves gene synteny, allowing
antiSMASH to more accurately identify gene clusters. (b) Results are shown in the web browser, and each
cluster can be observed in more detail by clicking the cluster name. (c) Further information on Cluster 2 for a
type I polyketide synthetase/non-ribosomal peptide synthase cluster, and a predicted structure of the
produced compound

6. After a few minutes, you will receive an e-mail with a link to the
results. These results will remain online for 1 month. On the
top right corner of the results page, a downward facing arrow
gives options for downloading the results locally, including in
Excel or Genbank format. “Download all results” will down-
load all of the data, including the HTML output.
7. antiSMASH identified three clusters in the Marinobacter
genome, including ectoine production, a type I polyketide
synthase/non-ribosomal peptide synthetase cluster, and for a
212 Jeffrey A. Kimbrel et al.

bacteriocin antimicrobial compound (Fig. 6b). The clusters can


be furthered inspected by clicking on the cluster number to
obtain further information (Fig. 6c). For this Marinobacter
genome, the PKS/NRPS cluster even includes a rough pre-
dicted structure of the synthesized compound.

4 Notes

1. MAGs: Although we suggest this workflow in working with


MAGs, the approaches detailed here are not intended or
recommended for annotating unbinned contigs from large
shotgun metagenomic studies.
2. File formats: There are currently several file formats to hold
genome data, and these formats unfortunately are not always
easily converted among one another. Some formats, such as
GFF3, are fairly concise and can be viewed easily in a spread-
sheet application; however, they only contain some annotation
features and do not contain actual sequence information. On
the other end of the spectrum are the “all-inclusive” formats
such as Genbank which can hold all of the sequences and
annotations for an entire genome. The Genbank format, how-
ever, is less human-readable and relies on very specific format-
ting that can easily be accidentally altered by a user. In addition
to these standards, many annotation tools typically output
results in non-standard formats (e.g., simple data tables).
Non-standard formats can be useful for browsing results; how-
ever, data from multiple tools must routinely be combined to
whole-genome analysis such as building metabolic models, and
this is still a non-trivial task in most cases.

Acknowledgments

This work was performed under the auspices of the


U.S. Department of Energy at Lawrence Livermore National Lab-
oratory under Contract DE-AC52-07NA27344 and supported by
the Genome Sciences Program of the Office of Biological and
Environmental Research under the LLNL Biofuels SFA, FWP
SCW1039.

References
1. Sorokina M, Stam M, Médigue C et al (2014) 3. Baric RS, Crosson S, Damania B et al (2016)
Profiling the orphan enzymes. Biol Direct 9:10 Next-generation high-throughput functional
2. Griesemer M, Kimbrel JA, Zhou CE et al annotation of microbial genomes. MBio 7:
(2018) Combining multiple functional anno- e01245-16
tation tools increases coverage of metabolic
annotation. BMC Genomics 19:948
Prokaryotic Genome Annotation 213

4. Stepanauskas R (2012) Single cell genomics: an 18. Tatusova T, DiCuccio M, Badretdin A et al


individual look at microbes. Curr Opin Micro- (2016) NCBI prokaryotic genome annotation
biol 15:613–620 pipeline. Nucleic Acids Res 44:6614–6624
5. Bowers RM, Kyrpides NC, Stepanauskas R et al 19. Chen IA, Markowitz VM, Chu K et al (2017)
(2017) Minimum information about a single IMG/M: integrated genome and metagenome
amplified genome (MISAG) and a comparative data analysis system. Nucleic Acids
metagenome-assembled genome (MIMAG) Res 45:D507–D516
of bacteria and archaea. Nat Biotechnol 20. Aziz RK, Bartels D, Best AA et al (2008) The
35:725–731 RAST server: rapid annotations using subsys-
6. Forouzan E, Maleki MSM, Karkhane AA et al tems technology. BMC Genomics 9:75
(2017) Evaluation of nine popular de novo 21. Seemann T (2014) Prokka: rapid prokaryotic
assemblers in microbial genome assembly. J genome annotation. Bioinformatics
Microbiol Methods 143:32–37 30:2068–2069
7. Klassen JL, Currie CR (2012) Gene fragmen- 22. Van DGH, Stothard P, Shrivastava S et al
tation in bacterial draft genomes: extent con- (2005) BASys: a web server for automated
sequences and mitigation. BMC Genomics bacterial genome annotation. Nucleic Acids
13:14 Res 33:W455–W459
8. Sohn J, Nam J-W (2016) The present and 23. Kremer FS, Eslabão MR, Dellagostin OA et al
future of de novo whole-genome assembly. (2016) Genix: a new online automated pipeline
Brief Bioinformatics 2016:bbw096 for bacterial genome annotation. FEMS Micro-
9. Bowers RM, Clum A, Tice H et al (2015) biol Lett 363(23):fnw263
Impact of library preparation protocols and 24. Thakur S, Guttman DS (2016) A de-novo
template quantity on the metagenomic recon- genome analysis pipeline (DeNoGAP) for
struction of a mock microbial community. large-scale comparative prokaryotic genomics
BMC Genomics 16:856 studies. BMC Bioinformatics 17:260
10. Parks DH, Imelfort M, Skennerton CT et al 25. Hyatt D, Chen GL, Locascio PF et al (2010)
(2015) CheckM: assessing the quality of micro- Prodigal: prokaryotic gene recognition and
bial genomes recovered from isolates, single translation initiation site identification. BMC
cells, and metagenomes. Genome Res Bioinformatics 11:119
25:1043–1055 26. Delcher AL, Bratke KA, Powers EC et al
11. Wu M, Eisen JA (2008) A simple, fast, and (2007) Identifying bacterial genes and endo-
accurate method of phylogenomic inference. symbiont DNA with Glimmer. Bioinformatics
Genome Biol 9:R151 23:673–679
12. Chain PSG, Grafham DV, Fulton RS et al 27. Besemer J, Lomsadze A, Borodovsky M (2001)
(2009) Genome project standards in a new GeneMarkS: a self-training method for predic-
era of sequencing. Science 326:236–237 tion of gene starts in microbial genomes. Impli-
13. Mende DR, Letunic I, Huerta-Cepas J et al cations for finding sequence motifs in
(2017) proGenomes: a resource for consistent regulatory regions. Nucleic Acids Res
functional and taxonomic annotations of pro- 29:2607–2618
karyotic genomes. Nucleic Acids Res 45: 28. Lowe TM, Eddy SR (1997) tRNAscan-SE: a
D529–D534 program for improved detection of transfer
14. Gurevich A, Saveliev V, Vyahhi N et al (2013) RNA genes in genomic sequence. Nucleic
QUAST: quality assessment tool for genome Acids Res 25:955–964
assemblies. Bioinformatics 29:1072–1075 29. Kalvari I, Argasinska J, Quinones-Olvera N
15. da Veiga Leprevost F, Grüning BA, Alves AS et al (2018) Rfam 13.0: shifting to a genome-
et al (2017) BioContainers: an open-source centric resource for non-coding RNA families.
and community-driven framework for software Nucleic Acids Res 46:D335–D342
standardization. Bioinformatics 30. Lagesen K, Hallin P, Rødland EA et al (2007)
33:2580–2582 RNAmmer: consistent and rapid annotation of
16. Grüning B, Dale R, Sjödin A et al (2017) Bio- ribosomal RNA genes. Nucleic Acids Res
conda: a sustainable and comprehensive soft- 35:3100–3108
ware distribution for the life sciences. Nat 31. Moll I, Grill S, Gualerzi CO et al (2002) Lea-
Methods 15(7):475–476 derless mRNAs in bacteria: surprises in ribo-
17. Overmars L, Kerkhoven R, Siezen RJ et al somal recruitment and translational control.
(2013) MGcV: the microbial genomic context Mol Microbiol 43:239–246
viewer for comparative genome analysis. BMC 32. Zheng X, Hu GQ, She ZS et al (2011) Leader-
Genomics 14:209 less genes in bacteria: clue to the evolution of
214 Jeffrey A. Kimbrel et al.

translation initiation mechanisms in prokar- 47. Omasits U, Varadarajan AR, Schmid M et al


yotes. BMC Genomics 12:361 (2017) An integrative strategy to identify the
33. Lomsadze A, Gemayel K, Tang S et al (2017) entire protein coding potential of prokaryotic
Improved prokaryotic gene prediction yields genomes by proteogenomics. Genome Res
insights into transcription and translation 27:2083–2095
mechanisms on whole genome scale. https:// 48. Erbilgin O, Ruebel O, Louie KB et al (2017)
doi.org/10.1101/193490 MAGI: a Bayesian-like method for metabolite
34. Borodovsky M, Rudd KE, Koonin EV (1994) annotation, and gene integration. ACS Chem
Intrinsic and extrinsic approaches for detecting Biol 14(4):704–714
genes in a bacterial genome. Nucleic Acids Res 49. Schiex T, Moisan A, Rouzé P (2001) Eugène:
22:4756–4767 an eukaryotic gene finder that combines several
35. Richardson EJ, Watson M (2012) The auto- sources of evidence. In: Computational biol-
matic annotation of bacterial genomes. Brief ogy. Springer, Berlin, pp 111–125
Bioinform 14:1–12 50. Tripp HJ, Sutton G, White O et al (2015)
36. Sherwood AV, Henkin TM (2016) Toward a standard in structural genome anno-
Riboswitch-mediated gene regulation: novel tation for prokaryotes. Stand Genomic Sci
RNA architectures dictate gene expression 10:45
responses. Annu Rev Microbiol 70:361–374 51. Kanehisa M, Furumichi M, Tanabe M et al
37. Backofen R, Amman F, Costa F et al (2014) (2017) KEGG: new perspectives on genomes,
Bioinformatics of prokaryotic RNAs. RNA Biol pathways, diseases and drugs. Nucleic Acids
11:470–483 Res 45:D353–D361
38. Kalvari I, Argasinska J, Quinones-Olvera N 52. Moriya Y, Itoh M, Okuda S et al (2007) KAAS:
et al (2017) Rfam 13.0: shifting to a genome- an automatic genome annotation and pathway
centric resource for non-coding RNA families. reconstruction server. Nucleic Acids Res 35:
Nucleic Acids Res 46:D335–D342 W182–W185
39. Nawrocki EP, Eddy SR (2013) Infernal 1.1: 53. Weber T, Blin K, Duddela S et al (2015) anti-
100-fold faster RNA homology searches. Bio- SMASH 3.0-a comprehensive resource for the
informatics 29:2933–2935 genome mining of biosynthetic gene clusters.
40. Bobrovskyy M, Vanderpool CK (2013) Regu- Nucleic Acids Res 43:W237–W243
lation of bacterial metabolism by small RNAs 54. Yin Y, Mao X, Yang J et al (2012) dbCAN: a
using diverse mechanisms. Annu Rev Genet web resource for automated carbohydrate-
47:209–232 active enzyme annotation. Nucleic Acids Res
41. Pain A, Ott A, Amine H et al (2015) An assess- 40:W445–W451
ment of bacterial small RNA target prediction 55. Elbourne LD, Tetu SG, Hassan KA et al
programs. RNA Biol 12:509–513 (2017) TransportDB 2.0: a database for
42. Modell JW, Jiang W, Marraffini LA (2017) exploring membrane transporters in sequenced
CRISPR-Cas systems exploit viral DNA injec- genomes from all domains of life. Nucleic Acids
tion to establish and maintain adaptive immu- Res 45:D320–D324
nity. Nature 544:101–104 56. Chen L (2004) VFDB: a reference database for
43. Sallet E, Roux B, Sauviac L et al (2013) Next- bacterial virulence factors. Nucleic Acids Res
generation annotation of prokaryotic genomes 33:D325–D328
with EuGene-P: application to Sinorhizobium 57. Logan-Klumpler FJ, Silva ND, Boehme U et al
meliloti 2011. DNA Res 20:339–354 (2011) GeneDB–an annotation database for
44. Sallet E, Gouzy J, Schiex T (2014) EuGene- pathogens. Nucleic Acids Res 40:D98–D108
PP: a next-generation automated annotation 58. Lombard V, Ramulu HG, Drula E et al (2013)
pipeline for prokaryotic genomes. Bioinfor- The carbohydrate-active enzymes database
matics 30:2659–2661 (CAZy) in 2013. Nucleic Acids Res 42:
45. Zickmann F, Lindner MS, Renard BY (2014) D490–D495
GIIRA–RNA-Seq driven gene finding incor- 59. Berlemont R, Martiny AC (2015) Genomic
porating ambiguous reads. Bioinformatics potential for polysaccharide deconstruction in
30:606–613 bacteria. Appl Environ Microbiol
46. Roberts A, Pimentel H, Trapnell C et al (2011) 81:1513–1519
Identification of novel transcripts in annotated 60. Sánchez-Rodrı́guez A, Tytgat HL, Winderickx
genomes using RNA-Seq. Bioinformatics J et al (2014) A network-based approach to
27:2325–2329 identify substrate classes of bacterial glycosyl-
transferases. BMC Genomics 15:349
Chapter 11

Functional Annotation from Structural Homology


Brent W. Segelke

Abstract
With the nexus of super computing and the biotech revolution, it seems an era of predictive biology
through systems biology may be at hand. Modern omics capabilities enable examination of the state of
biological system in exquisite detail. The genome, transcriptome, proteome, and metabolome may all be
largely knowable, at least for some model systems, providing a basis for modeling and simulation of
molecular mechanisms, or pathways, that could capture a biological system’s emergent properties. How-
ever, there are significant challenges remaining that impede the realization of this vision, perhaps the most
significant being the missing functional annotation of genes and gene products. For even the most well-
studied organisms as much as a third of called genes for a given genome are not annotated and more than
half may be tenuous. Homology inferred from sequence similarity is the basis for much of genome
annotation. Homology inferred from structural similarity could be a powerful complement to sequence-
based annotation methods. Structural biology or structural informatics can be used to assign molecular
function and may have increasing utility with the rapid growth of gene sequence databases and emerging
methods for structure determination, like structure prediction based on coevolution. Here we describe
tools and provide example cases using structural similarity at the level of quaternary structure, domain
content, domain topology, and small 3D motifs to infer homology and posit function. Ultimately annota-
tion by similarity, be it 3D structure homology or more classically primary sequence homology, must be
founded by accurate annotation of one ortholog in the group—understanding every function encoded by a
genome remains a major challenge to life science.

Keywords Gene annotation, Structural homology, Molecular evolution, Structural Superposition,


3D motif, Structure function

1 Introduction

Life science is in the midst of a revolution sparked by genomics and


fueled by rapid innovation in transcriptomics, proteomics, micros-
copy, and genetic engineering, as well as continued innovation in
genomics and advanced computing. Researchers are awash with
data that give a comprehensive gene listing from many organisms
[1], gene transcription profiles [2], protein expression profiles
[3, 4], and even metabolome information [5]. It may soon be
possible to probe simultaneously the epigenetic state [6], gene
expression, proteomic profile, and metabolic states of individual

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_11, © Springer Science+Business Media, LLC, part of Springer Nature 2022

215
216 Brent W. Segelke

cells; even for each of many cells within a complex tissue [7]. It is
intriguing to think we may be soon able to build system models
informed by this wealth of data and simulate complex cellular and
multicellular behavior of biological systems based on their complex
orchestration of molecular mechanisms. Unfortunately, for the
foreseeable future, attempts at system modeling at the molecular
level for cellular systems will be plagued with the annotation
problem [8].
To build detailed models of cellular systems, a comprehensive
accounting is needed for the molecules present, which may be
within reach, but a comprehensive accounting of molecular func-
tion is also needed. This means that a complete and accurate
annotation of genomes is essential. Unfortunately, genome anno-
tation, even for the most well-studied species, is far from complete
[9]. In fact, the annotation problem is a serious impediment in
many areas of the life sciences. Nearly all proteomic, transcriptomic,
or metagenomic surveys return lists of implicated gene for which a
significant portion have unknown function. Even the genome of
the minimal organism, engineered to have only genes essential for
replication, is missing functional annotation for ~30% of the genes
it encodes [10]. It is estimated that 40% of newly sequenced genes
from genome centers and metagenomics project have no known
function [9]. Furthermore, the rate of discovery of new genes, even
new gene sequence families, outstrips the rate of functional assign-
ment for genes by orders of magnitude [8]. To make matters worse,
existing annotations may be vague, incomplete, or even wrong.
There are protein families that have been studied for decades for
which the function of family members are still in dispute [11].
By necessity, since the rate of production for gene sequence
data outstrips the rate of new sequence family functional assign-
ments, the overwhelming majority of genes are currently annotated
automatically and annotations are inferred from sequence homol-
ogy [12, 13]. Meaning newly sequenced genes (and encoded prod-
uct) are generally assigned function based on homology to genes
with existing annotation. Homology is inferred from sequence
similarity, so a gene lends its annotation to a newly discovered
gene, which may in turn lend its annotation to the next newly
sequenced gene with similarity, and so on. As a result, a gene
annotation could derive from the series of functional inferences
such as the newly annotated gene is not identifiably similar by
sequence to the originally annotated gene.
Gene annotation by structural similarity could be a powerful
compliment to current annotation methods. Similarity can be
deduced by structural comparison independent of sequence simi-
larity. Furthermore, entries in the structural database are much
more likely to have a direct link to primary literature, which gives
greater certainty to the annotation. Identifying homology through
structure could have an outsized impact on the annotation problem
Functional Annotation from Structural Homology 217

since biomacromolecular 3D structures evolve much more slowly


than primary sequence; therefore, structural similarity can be used
to identify much deeper homology and annotation assigned to a
gene with known structure could be relevant to a much larger set of
sequences. In addition, sequence features (e.g., highly conserved
residues within a sequence family) can be mapped onto 3D struc-
ture and examined for functional relevance. There are several tiers
of structural similarity that might be exploited to discover
homology.

1.1 Inferring Structure has long been used for taxonomic classification and infer-
Homology from ring relatedness of species. Gross anatomy of facial structure for
Structural Similarity example is indicative of the relatedness of evolutionarily connected
species (Fig. 1). Just by appearance, we can intuit the similarity of
primate faces: mirror symmetrical across the median plane of the
face; two eyes above a nose with two nostrils; two ears, one on each
side of the head; and a mouth below the nose (Fig. 1a). Knowing
the function of our own eyes, nose, mouth, and ears allows us to
reasonably infer the function of these anatomical structures for
related species. Looking beyond the immediately apparent outward
appearance, an interrelatedness of more distance species can be
discovered. A close look at the skeletal anatomy of fore limbs reveals
that nature has reused the same basic structure for various but
related functions (Fig. 1b). From humans, to reptiles, to birds,
the skeletal anatomy of the fore limb is strikingly similar, one
large bone connecting the limb to the torso, then connected to
two smaller bones which is in turn connected to matrix of even
smaller bones (carpal bones), then to a fan of digits. Structural
analysis makes it clear—there is more relatedness between distant
species than first meets the eye.
Just as gross anatomical structure homology results from evo-
lutionary links between species, biomacromolecular structure
homology evolves from a common origin of molecules within
molecular families and can reveal homologous macromolecular
function. Macromolecular structure homology can reveal evolu-
tionary links over much greater evolutionary distances. From the
domain architecture of single domain phospholipase A2 (PLA2), for
example, a close relationship is apparent between vertebrates
(human (pdb 6g5j, [14]), cow (pdb 1mkt, [15]), and cobra (pdb
1a3f, [16]), Fig. 1b). A close relationship between fungus (pdb
4aup, [18]) and bacteria (pdb 1lwb, [19]) PLA2 molecules is also
readily apparent. There is some apparent relationship between
insect PLA2 and vertebrate, but no obvious relationship between
vertebrate, insect, fungus, and bacteria at the domain architecture
level. Structure-based alignment however reveals a common
subdomain.
Upon close inspection, facilitated by structure-based superpo-
sition, a core subdomain common to all of the extracellular
218 Brent W. Segelke

Fig. 1 Structural similarity of related species, from gross anatomy to molecular


motifs. (a) The outward appearance of primates revealing common facial
features: medial symmetry; common arrangement of eyes, ears, nose, mouth.
The readily apparent similarities evince the evolutionary relationship of human,
chimpanzee, and lemur (Primate images were adapted from images in the
public domain https://commons.wikimedia.org/wiki/File:Albert_Einstein_Head.
jpg, https://commons.wikimedia.org/wiki/File:Chimpanzee_head_sketch.png,
https://commons.wikimedia.org/wiki/File:Red-tailed_Sportive_Lemur_Kirindy_
Madagascar.jpg). (b) The forelimb bone structure of human, crocodile, and bird
showing the related organization in number and relative position of bones that
make the forearm and revealing a more distant evolutionary relationship (The
forelimb comparison is adapted from images in the public domain). (c) The
molecular structures of phospholipase A2 (PLA2) enzymes from human pancreas
(PDB id: 6g5j, [14]), bovine pancreas (PDB id: 1mkt, [15]), cobra venom (PDB id:
1a3f, [16]), bee venom (PDB id: 1poc, [17]), fungus (PDB id: 4aup, [18]), and
bacteria (PDB id: 1lwb, [19]) are shown as ribbon diagrams with helices shown
as coiled ribbons, strands as arrow capped ribbons, and coiled regions as tubes
interpolated through the backbone of the proteins polypeptide chains. The N and
C termini of each molecule are indicated; the fungus PLA2 N-terminus is hidden
Functional Annotation from Structural Homology 219

phospholipase A2s can be identified. The core subdomain is made


up of two antiparallel α-helices. The two core helices are connected
with a variety of loops or subdomains joining the C-terminal end of
the first helix to the N-terminal end of the second helix. For
bacteria and fungus, the core helices are connected by a short
loop or coil segments, whereas the insect core helices are connected
by a longer coil section and vertebrate core helices are connected by
a much longer subdomain that includes a two-stranded β-sheet,
called the β-wing. All of the depicted PLA2 have a third helix packed
against the second (upper) core helix that, when combined with the
core helices, form a substrate binding pocket (not shown). For
vertebrate PLA2s, the third helix is at the N-terminus of the poly-
peptide chain, whereas the third helix for fungus and bacteria is at
the C-terminus; the third helix for insect PLA2 is structurally
homologous to that of bacteria and fungus, in that it is connected
by a short loop to the C-terminus of the second core helix, but
there is a long C-terminal subdomain unique to the insect PLA2.
Mapping amino acid sequence conservation on to structurally
aligned molecules reveals a deeply conserved 3D motif. There is an
absolutely conserved histidine at approximately the midpoint of the
lower helix in the core subdomain that forms the start of a highly
conserved HDXXY primary sequence motif (Fig. 1d). The tyrosine
sidechain hydroxyl donates a hydrogen bond with an absolutely
conserved aspartate residing on the upper core domain helix. This
aspartate in turn hydrogen bonds with the conserved histidine. For
those pdb entries that include solvent, the conserved histidine is
also hydrogen bonded to the conserved water molecule. Not only
are the amino acid types in those positions conserved but the
atomic position of every atom in those positions is highly con-
served. The insect PLA2 subdomain helix with the conserved

Fig. 1 (continued) from view and not labeled. The similarity of PLA2 domain
architecture is readily apparent between vertebrates, as is the similarity between
fungus and bacteria. Similarity between insect and vertebrate, or between insect
and fungus or bacteria, is not obvious. (d) All six PLA2 molecules are shown
superimposed by the common core anti-parallel helix subdomain. The core
subdomain is highlighted in gray; the remaining elements are colored white
and shown as semi-transparent (left panel). All of these PLA2 enzymes have an
antiparallel helix domain, with the upper helix in this depiction going from
N-terminal to C-terminal left to right and have at least three helical turns in
common among all of the molecules shown. Near the center of the antiparallel
helix domain, there is a nearly absolutely conserved motif, with a histidine-
helical turn-tyrosine on the lower helix and an aspartic acid hydrogen bound to
both the histidine and the tyrosine on the upper helix (center panel). A simple
schematic can be derived that codifies the rules making up the hallmark for this
collection of PLA2 molecules and their structural family members (right panel).
These depictions of PLA2 molecules were generated with Chimera [20]
220 Brent W. Segelke

histidine terminates into a long loop region just after the conserved
aspartate and therefore does not have the highly conserved tyro-
sine. Instead there is a compensatory mutation near the c-terminus
of the third helix that provides the tyrosine that hydrogen bonds
with the core aspartate residue. It is known from molecular biology
and enzymology studies that these residues form the core motif
involved in catalysis [21]. The conserved tyrosine and aspartate
stabilize the histidine, which in turn deprotonates the water that
hydrolyzes the ester bond linking the fatty acid in the A2 position to
the glycerol backbone of phospholipid [21].
From the observed structural homology, a simple hallmark
motif can be derived with HDXXY sequence on the lower helix
and D within hydrogen bonding distance on the upper helix.
Notably, the PLA2 structure from bee venom does not exactly
match the hallmark motif. The tyrosine residue in the HDXXY
sequence is missing, but there has been a compensatory change
such that there is a different tyrosine that hydrogen bonds to the
central motif aspartate (Fig. 1d). The subdomain and catalytic
motif common to PLA2 enzymes from bacteria to human reveals
a deep evolutionary conservation between species that last had a
common ancestor billions of years ago.

1.2 The Many Tiers Macromolecular structures evolve over many different length and
of Macromolecular time scales leading to many different tiers of structural homology. A
Structural Homology number of tools provided by and to the structural biology commu-
nity make it possible to explore structural similarity at several dif-
ferent tier levels, all of which we might be exploited to infer
functional details. Macromolecular structures that form a biological
unit to carry out a specific function range in size and complexity
from small, single chain, single domain protein, like PLA2, or
massive mega Dalton heterocomplexes made up of many chains of
various sizes and polymer type, like the ribosome [22]. As molecu-
lar structures evolve, parts get reused and altered. Domains may be
added or removed, by addition or deletion of chains in a complex or
by insertion of deletion of domains into existing chains. Chains can
grow or shrink by insertions and deletions of various length seg-
ments. And sequences of biopolymers that make up macromolecu-
lar structures can drift or change with random mutation that
accumulate over evolutionary timescales.
A brief glimpse at the variations in the large family of ATP
Binding Cassette transporters (ABC transporters) provides a good
case study for the different tiers of structural similarity. ABC trans-
porters are multidomain proteins with widely divergent sequences
but identifiable by their highly conserved ATP binding domains
[23]. These macromolecular assemblies can have very simple
arrangements (in the case of ABC transporters, as simple as a
homodimer of two chains each composed of two domains [23])
or as complex as hetero complexes of many chains of different types
Functional Annotation from Structural Homology 221

(in the case of ABC transporters as many as five chains composed of


eight domains [24]). However spartan or complex, these macro-
molecular assemblies can be recognized as related by similarity in
domain arrangement or the presence of domains they have in
common (the ATP binding and permease domains in the case of
ABC transporters; ABC importers also often associate with binding
proteins [Fig. 2 [28]). The domain arrangement within chains of
related assemblies shows the modularity of domains and how
nature reuses domains in varying arrangements (Fig. 2b). The
similarity of domain topology (the number, order, and spatial
relationship of secondary structural elements of a domain) can be
a much stronger indicator of common origin (Fig. 2c). Superposi-
tion of the 3D structures from ATP binding domains from two
different transporters for example appear nearly identical even
though they have only 25% sequence identity (Fig. 2d). At the
finest detail level, individual residues that interact with cofactors
or ligands can match exactly between macromolecules with related
function even with little other evidence of homology (the ATP
binding motif at the dimer interface between ATP binding domains
from two different ABC transporters are nearly identical, in posi-
tion, amino acid type, and conformation) (Fig. 2e).
Starting with structures of molecules known or expected to
have a common evolutionary origin, it is often possible to find
common structural features, so known functional homology can
lead to insights into structural homology. The question is, given the
rules of homology derived from our knowledge of how macromo-
lecules evolve, can we infer homology and gain functional insights
from structural similarity. Here we provide a description of tools
and several examples cases doing just that.

2 Materials

2.1 The Protein The PDB is the information resource that provides the basis for
Databank (PDB) structural homology searches and therefore functional annotation
via structure homology [29, 30]. The PDB is the repository of
publicly released, experimentally determined bio-macromolecular
structures and complexes. There are three partner PDB organiza-
tions and processing centers, ePDB (Protein Data Bank in Europe),
RCSB (Research Collaboratory for Structural Bioinformatics Pro-
tein Data Bank), and jPDB (Protein Data Bank in Japan), that form
a consolidated structure repository, the worldwide PDB
[31, 32]. The Biological Magnetic Resonance Data Bank
(BMRB) is a fourth member of the wwPDB and serves as a reposi-
tory of NMR data. As of the date that this was written, there are
nearly 160,000 biological macromolecular structures entered in the
PDB [29].
222 Brent W. Segelke

Fig. 2 The many tiers of protein structural homology. (a) Comparison of the
domain organization of two functionally homologous ATP binding cassette (ABC)
transporters. The left figure shows a schematic depiction of the five
non-covalently interacting domains of the vitamin B12 import ABC transporter
in complex with its vitamin B12 periplasmic binding protein (PDB: 2qi9, [25]).
The right figure shows a schematic of the eight-domain maltose import ABC
transporter in complex with its maltose periplasmic binding protein (PDB: 2r6g,
[24]). Domains depicted with the same shape and shade are homologous—they
are thought to have a common ancestor gene. The two larger circles at the
bottom of each schematic are the ATP binding motor domains, the vertically
oriented ovals are the transmembrane permease domains, and the apical circles
depict the periplasmic binding proteins. (b) The chain composition of given ATP
importer complexes. The ovals with adjoining line segments depict the protein
chains corresponding to the ABC transporter directly above. The three chains on
the left are from the vitamin B12 transporter and the four chains on the right are
from the maltose transporter. Note the vitamin B12 transporter is a homodimer
of heterodimers plus the periplasmic binding protein; the maltose transporter
contains a homodimer of ATP binding cassettes, but the rest is a heterooligomer.
The maltose transporter permease domains are structurally homologous but
have only weak sequence homology. (c) Topology of ATP binding cassette
domains (corresponding to the circles at the bottom of the schematics in part
a of the figure). The topology diagrams depict the secondary structural elements
and their interconnectivity corresponding to the ATP binding domains for the ABC
transporter directly above. The number, the constituents, the relative size, and
Functional Annotation from Structural Homology 223

The Research Collaboratory for Structural Bioinformatics


(RCSB) PDB, the main PDB repository, is also a website that
provides an interface to the repository and links to many other
resources that make use of data in the PDB. The most important
features of the PDB web interface for structure-based function
annotation are the search utilities that help identify an entry of
interest (see Note 1), the summary information returned as a search
result, links to other annotation, and links to a list of pre-calculated
structural similarities. The PDB front page [29] provides a basic,
single field, facile search utility that will find PDB ID matches, or
matches to: author(s), macromolecule name, sequence, or ligands
contained within an entry (see Note 2). When a search is refined to a
single entry (i.e., one PDB ID) or a single entry is chosen from a list
of search matches, the site returns a webpage with several clickable
tabs related to the entry.
PDB tabs for individual PDB entries include: “Structure Sum-
mary,” “3D View,” “Annotations,” “Sequence,” and “Experi-
ment.” The Structure Summary tab provides a primary literature
citation, if available; information for each of the macromolecules in
the asymmetric unit; ligands or small molecules in the entry; exper-
imental data and experimental validation in comparison to other
entries in the PDB and other entries with similar resolution. The
Structure Summary tab also provides numerous links to other data
resources, such as UniProt [12], and downloadable files associated
with the entry. The macromolecules section on the summary page
provides links to find other PDB entries with similar sequence or
similar structure (see Note 3). 3D view provides a basic web-based
molecular viewer that can also display electron density. The Anno-
tations tab provides annotation descriptions and links to CATH,

Fig. 2 (continued) the order of connectivity can be discerned from the topology
diagram. The order, the orientation, and interactions of β-strands making up
β-sheets can also be inferred. It is readily apparent that the ATP binding domains
from the vitamin B12 transporter and the maltose transporter are homologous.
β-Strands are depicted as rectangular arrows and α-helices are depicted as
rectangles. (b) Superposition of ATP binding domains from vitamin B12 trans-
porter and the maltose transporter. Shown are the ribbon diagrams of vitamin
B12 transporter and the maltose transporter ATP binding domains superim-
posed. The vitamin B12 transporter ATP binding domain is shown in a lighter
shade of gray compared to the maltose transporter homolog. From the superpo-
sition of ribbon diagrams, it apparent that these two domains are highly
homologous. (e) Superposition of the ATP binding region of vitamin B12 and
maltose transporters. The ATP binding motif is at the dimer interface between
ATP binding domains. The number, type, and conformations of the residues
interacting with ATP are nearly identical (Figure panel c was adapted from the
HERA diagrams [26] available from PDBSum [27]. Figure panels d and e were
composed with Chimera [20])
224 Brent W. Segelke

PFAM, and GO (CATH and PFAM are further described below).


The Sequence tab provides links to sequence database entries for
macromolecules in the PDB entry and gives a pictorial view of
secondary structure assignment above the single letter sequence
string. The Experiment tab provides extensive details about the
experiment(s) leading to the PDB entry.

2.2 BLASTp: Basic BLAST is a heuristic algorithm for similarity searches [33, 34] and
Local Alignment is perhaps the most widely used search tool for finding database
Search Tool entries that contain sequence information. BLASTp is a useful tool
Protein-Protein for examining structural homology when used to search for protein
sequences in the PDB. Like SAS (described below), BLASTp can be
used to find PDB entries that have protein sequence similarily to a
query protein sequence, providing an entrée to structural homol-
ogy starting from protein sequence (see Note 4). As with SAS,
BLASTp is most useful for finding the proteins in the PDB most
closely related to the query sequence.

2.3 FATCAT: Flexible FATCAT is the algorithm used by RCSB (see Note 5) to precalcu-
Structure Alignment by late the clusters of structural similarity for the entire PDB
Chaining Aligned [35]. FATCAT is structure-based alignment or search tool that
Fragment Pairs scores pairwise matches based primarily on root mean square dis-
Allowing Twists tance differences of aligned fragment pairs. FATCAT search against
all of the chains in the PDB is computationally expensive so FAT-
CAT searches are generally conducted against a representative set of
chains selected by sequence clustering for the chain sequences in
the PDB. FATCAT and other aligned fragment pair algorithms are
inherently sequence-independent comparison methods and as such
can identify very different homologies. FATCAT output can pro-
vide the basis for a large multi-sequence structure-based alignment.
As a consequence of matching against representative chains for
sequence clusters FATCAT often returns a matched set of chains
with very diverse sequence. This can provide the basis for a multi-
sequence structure-based alignment that has few highly or abso-
lutely conserved amino acids. If the highly conserved residues
cluster together in space to form a 3D motif, which can be deter-
mined by mapping the conserved residues onto 3D structures,
these residues are often functionally important (see Note 6).
Users can obtain FATCAT results by submitting a query on the
FATCAT web service. The FATCAT web service interface [36]
requires only a PDB ID and chain ID and a choice of databases of
representative chains from precalculated sequence clusters as input.
The output is emailed to the user-designated email address.

2.4 PDBSum PDBSum is a useful complement to the main PDB resource because
it nicely aggregates and displays structure information contained
within PDB entries, for example topologies are much more accessi-
ble through PDBsum than through the RCSB website
[27, 37]. PDBsum is a rich interactive website maintained by the
Functional Annotation from Structural Homology 225

European Bioinformatics Institute (EBI) at the European Molecu-


lar Biology Laboratory (EMBL). The PDBsum home page [37]
offers several search options to help find PDB entries. PDBsum will
search the PDB based on PDB entry four letter codes (PDB ID);
keyword search, which scans TITLE, HEADER, COMPND,
SOURCE, and AUTHOR metadata associated with PDB entries;
protein sequence; UniProt ID, PFAM ID, or Ensemble ID. The
PDB ID search, keyword search, sequence-based search, and Uni-
tprot ID search are similar to other utilities provided by the
Research Collaboratory for Structural Bioinformatics PDB, the
main PDB repository, and the National Center for Biotechnology
Information protein BLAST (when used to BLAST a sequence or
UniProt ID against the PDB). With the exception of the PDB ID
search, the search utilities may return zero to any number of
matches to the PDB. The search utilities provided on the PDBsum
homepage will return a list of clickable links to PDBsum entries for
individual PDB entries. The value added by PDBsum is the aggre-
gated annotation for individual PDB entries and links to structure
analysis tools and other databases.
The PDBsum results returned for an individual PDB entry is a
webpage with varying numbers of clickable tabs, depending on the
features of the given pdb entry. Each tab aggregates information
about the PDB entry. PDBSum tabs include: Top page, Protein,
Metals, Prot-prot, Clefts, Pores, Tunnels, and Links. The top page
shows some of the meta data entered with the PDB entry from the
authors; some QA/QC information, such as the R-factor and
R-free reported and a link to the Ramanchandran plot; a content
list including protein chains, DNA/RNA chains, non-metal
ligands, and metal. The top page includes an additional breakdown
of information in subsections for each chain with enlargeable icon
pictorial ribbon diagram, link to the Uniprot entry for the asso-
ciated protein, clickable pictorial of the pfam domain content for
the given chain, a pictorial of the secondary structural element in
sequence order, and clickable segments for the associated CATH
domains for each chain. The protein tab provided a more detailed
overlay of secondary structure elements with the chain amino acid
sequence; enlargeable icon pictorials of the domain topologies for
each of the CATH domains in a protein chain; and links for info-
graphics displaying precalculated amino acid conservation, or to
other web services that assess the conservation of amino acids
across a family of related sequences, including CONSURF and
SAS (both described below). The prot-prot tab provides a sche-
matic and table describing protein–protein interaction within the
crystallographic asymmetric unit. The Clefts, Pores, and Tunnels
tabs provide pictorial and tabulated precalculated results from
MOLE 2.0 (not described). The links tab provides numerous
links to various structure and sequence-based informatics resources
and quality assessment utilities that can be applied to the given PDB
entry.
226 Brent W. Segelke

2.5 UniProt: The UnitProt is a consortium of bioinformatics institutes that provides


Universal Protein web-based resources for finding and exploring protein sequences
Resource [12]. The UniProt consortium is made up of the European Bioin-
formatics Institute, the Swiss Institute of Bioinformatics, and the
Protein Information Resource. The UniProt resource for protein
sequences and annotation is built on the SWISS-PROT and Trans-
lated EMBL Nucleotide Sequence Data Library (TrEMBL) data-
bases. SWISS-PROT entries have been manually annotated and
reviewed, whereas TrEMBL entries are automatically annotated
by annotation algorithms and are not reviewed. As of the time of
this writing, there are 300 times as many TrEMBL entries as there
are SWISS-PROT entries [13]. RCSB PDB pages and PDBsum
pages cross reference UniProt for protein chains represented in a
given PDB entry. This provides a convenient means to check for
annotation at the protein level because a UniProt web page for a
given entry provides an assessment of the annotation status and
links to primary literature related to the given entry, along with a lot
of other information.

2.6 PFAM: The PFAM is a database of precomputed protein families based on


Proteins Family protein sequence and the associated hidden Markov model that
Database defines the family [38]. A typical PFAM entry represents a con-
served protein segment found encoded across multiple species’
genomes. A PFAM entry does not necessarily represent a whole
protein or chain. A PFAM entry often corresponds with a protein
domain, and a protein chain may contain multiple segments
corresponding to PFAM entries. PFAM entries are classified on
the basis of: family, domain, repeat, motif, coiled-coil, or disor-
dered. A family is a collection of related protein regions; a domain is
similar to a CATH or SCOP domain, an independently folding
structural unit; a repeat is defined as a “short unit which is unstable
in isolation but forms a stable structure when multiple copies are
present”; a motif is defined as “a short unit found outside globular
domains”; a coiled-coil is typically a helical bundle coiled together;
and disordered regions are predicted or known to be unstructured.
The PFAM website [39] provides search utilities for finding
individual entries but PFAM entries related to protein structures
are nicely linked from the PDBsum and RCSB PDB pages for PDB
entries. Users can also find PFAM entries corresponding to a PDB
entry by entering the PDB ID in the PFAM search utility. An
individual PFAM entry web page includes: a summary page, with
several links with text annotations corresponding to the given
entry; links to architecture or domain organization page; a link to
a HMM logo page; a link to a page listing know structures that have
a least one domain corresponding to the given PFAM entry; and a
link to a page that allows the user to generate a variety of sequence
alignments depending on user selected input options. The text
annotations are not generally comprehensive, particularly for large
Functional Annotation from Structural Homology 227

families, and may only pertain to an archetypal member of the given


family. The InterPro annotation [40] generally has more complete
information. The domain organization (also called architecture in
PFAM) page provides a list of domain architectures, represented
graphically, that contain at least one segment corresponding to the
given PFAM entry. The HMM logo page provides a graphical
representation of the sequence conservation at each position
along a sequence alignment, the amino acids at a given position
across the family of sequences, and the relative frequency of amino
acids present at each position. Domain organization can be useful
for finding homologous proteins at the domain organization level.
The HMM logo pages can be useful for finding highly conserved
residues in a sequence that can then be examined in a 3D structure
context for functional significance. The list of structures can be
useful for finding homologous structures, which can in turn enable
a detailed examination of structural homology.

2.7 CATH: Class CATH is a database of classified domain structures derived from
Architecture Topology analysis of the structural domains present in the individual chains in
Homologous the PDB [41]. It is also a website that offers search and analysis
Superfamily tools, enabling the use of the domain classification database
[42]. CATH classifies domains based on: class, architecture, topol-
ogy, and homologous superfamily in that order. Each class, archi-
tecture, topology, and homologous superfamily is assigned a
number such that a superfamily is identifiable by a four number
ID. For example, 3.30.70.100 is the identifier for a superfamily that
in the alpha beta class, with a two-layer sandwich architecture,
alpha-beta plaits topology, and 620 members domains identified
from the pdb within 35% sequence similarity. Superfamilies are
further subdivided into sequence similarity clusters at the sequence
family (35% similarity), orthologous family (60% similarity), like
domain (95% similarity), and identical domain (100% similarity)
levels. Each similarity cluster is also assigned a number, so a set of
domains with identical sequence from the PDB belonging to a
CATH superfamily can be identified by a seven number ID, PDB
ID, plus chain identifier.
CATH entries can be identified with the CATH web site search
utilities, and CATH is often conveniently linked by other structure
analysis resources like PDBsum. The CATH website [42] provides
searches based on CATH ID, PDB ID, other reference IDs such as
UniProt ID, keyword, protein sequence, or by structure. The
structure search requires a coordinate file be uploaded. A CATH
search will return a high-level report on the matching superfamilies,
domains, and PDB structures matched to the query with clickable
links to further examine the query results. The PDBsum link to
CATH links directly to the superfamily match for a given domain
within a PDB entry.
228 Brent W. Segelke

For each superfamily, the CATH website provides links to: a


summary, a view of the whole superfamily superposition, the
CATH classification, functional families, and structural neighbor-
hood. The summary page provides information about the diversity
of GO terms assigned to domains in the superfamily, the diversity of
EC terms, and the number of unique species with genomes encod-
ing a domain that belongs to the superfamily. There is also a scatter
plot of the structural diversity vs. sequence diversity with the given
superfamily highlighted on the plot and summary statistics listing
the number of domains from the PDB matching the superfamily,
the number of domain clusters, etc. Each of the other links provides
interesting and potentially useful information, but superfamilies
often have very diverse functional assignments, and it can be diffi-
cult to find a meaningful annotation that applies to a particular
structure of interests from lists of annotations applicable to the
whole superfamily. The Classification/Domains page however pro-
vides an interactive cluster diagram that helps explore the structural
neighborhood of a given superfamily. The user can enter a PDB ID
to locate a specific PDB entry on the diagram and quickly discover
identical and nearest neighbor structures available in the PDB.
Knowing the nearest neighbors can be the basis for examining
close structural and functional homology.

2.8 SCOPe: SCOPe [43], and its predecessor SCOP, is a database of protein
Structural domain classifications based on structure and evolutionary relation-
Classification ships [44]. Original versions of SCOP, initiated at the Laboratory
of Proteins-Extended for Molecular Biology at the UK Medical Research Council, were
based entirely on manual curation. SCOP manual curation ended in
2009. SCOPe, developed and maintained at Lawrence Berkeley
National Lab and UC Berkeley, resumed PDB domain curation
based on the SCOP classification, now based largely on automated
curation.
SCOP classification, like CATH, is hierarchical, but has seven
levels of classification: SCOP release, Class, Fold, Superfamily,
Family, Protein, and Species. SCOP superfamilies are used as the
basis for identifying structural clusters, which in turn form the bases
for identifying representative domains, which can be searched more
efficiently for structural homology. The SCOPe website provided a
search utility that can be used to find SCOP entries that correspond
to a given PDB, which can then be compared to related SCOP
entries at different levels in the SCOP hierarchy. Comparison to
family members in the same superfamily can help to identify func-
tional homologs.

2.9 SAS: Sequence SAS is a tool for finding PDB entries that have sequence similarity
Annotated by Structure to a given protein sequence, providing an entrée to structural
homology starting with protein sequence [45]. PDBsum provides
a convenient link to SAS for a given PDB entry so that other PDB
Functional Annotation from Structural Homology 229

entries with sequence similarity to a given entry can be identified,


enabling structural homology studies for closely related proteins.
SAS uses FASTA to search the protein sequences in PDB against a
given protein sequence and generate a multi-sequence alignment of
matched sequences—in essence SAS is a front end for FASTA
protein sequence searches against the PDB. SAS also organizes
the FASTA output and maps secondary structure elements onto
the primary query sequence. FASTA, like BLAST, is a heuristic
algorithm for similarity searches [46] (see Note 7).

2.10 ConSurf ConSurf is a program that categorized amino acid positions within
a sequence based on degree of evolutionary conservation [47]. A
conservation category is given a corresponding category numerical
value that can be color coded on a protein sequence or multi-
sequence alignment and/or mapped onto a 3D molecular surface.
Given a protein sequence or a PDB coordinate set, from which
protein sequence can be extracted, ConSurf will generate a multi-
sequence alignment, cluster sequences to remove highly redundant
sequences, then calculate an estimated conservation rate, that is
then broken down into categories 1–9. The rank categories are
conveniently represented by discrete colors that can be mapped
onto sequences or 3D structures. ConSurf also outputs the multi-
sequence alignment, a dendrogram depicting the interrelatedness
of sequences in the multi-sequence alignment, script files for dis-
playing ConSurf results in Chimera (and other popular molecular
graphics programs), and text files. The text files contain the esti-
mated conservation rates for each position, the category number,
reliability statistics, and the frequency of appearance for each amino
acid type for each position along the sequence.
Evolutionarily conservation of amino acids at specific locations
on protein structure is highly related to that amino acid’s functional
importance. To the degree that a multi-sequence alignment can
faithfully recapitulate structural alignment of amino acid positions
in three dimensions, sequence conservation can reveal functionally
important sequence positions and corresponding amino acid types.
ConSurf is a convenient tool for sequence conservation analysis
that can be mapped onto 3D structure, and it is nicely linked to
PDBSum on the “Protein” tab by clicking “Analysis of sequence’s
residue conservation.”

2.11 Chimera Chimera is a display interface with a large number of integrated


tools for display and analysis of molecular structure [20]. Chimera,
the successor to Midas, has been actively maintained by the
Resource for Biocomputing, Visualization, and Informatics
(RBVI) at the University of California, San Francisco for many
years and can be used for non-commercial purposes without a
license fee. It is beyond the scope for this chapter to describe all
of the features of Chimera, but it is useful to understand how to
230 Brent W. Segelke

display structures from PDB entries, how to select and display or


undisplay parts of molecular structure, how to highlight parts of
molecular surfaces, and how to superimpose molecular structures.
A macromolecular structure can be retrieved over the internet
and displayed in Chimera by choosing the “Fetch by ID. . .” option
under the “File” tab. Choosing the Fetch by ID option will open a
dialog box. Selecting the PDB radio button, entering a valid PDB
ID, and clicking fetch in the dialog box will display the structure of
the molecule(s) from the PDB entry (see Note 8). Many molecules
from different PDB entries or other files can be displayed simulta-
neously, which facilitates comparison of molecules. When a macro-
molecule from a PDB entry is first displayed in Chimera, it is shown
with the default ribbon diagram and color scheme (see Note 9).
Chimera offers a diverse set of selection tools that can be used to
highlight various features from a molecular structure (see Note 10)
and a powerful structure comparison tool that can be used to
superimpose structures and to generate structure-based multise-
quence alignments (see Note 11). Chimera can also be used to
superimpose structures based on a specifically selected list of
atoms, which can be used to superimpose structures based on
small 3D motifs for example (see Note 12).

2.12 CASTp: CASTp is a web-based application for identifying and characteriz-


Computed Atlas ing pockets and cavities in macromolecular structures [48]. CASTp
of Surface Topography is useful for identifying possible ligand binding sites or comparing
of Proteins putative ligand binding sites from structures of homologous
macromolecules. CASTp results, in combination with sequence
conservation analysis, are useful for identifying potential catalytic
sites or ligand binding motifs. CASTp has precalculated results for
many PDB entries that can be retrieved and displayed by entering
the PDB ID. The web application has a simple display interface that
shows a volume rendering of pockets and cavities with the 3D
ribbon of the corresponding structure. The user can choose from
a short list of display options to simultaneously display amino acids
or surface patches that line pockets or cavities. CASTp lists pocket
surface area, pocket volume, and the atoms that line the pocket(s).
For each atom listed, the chain ID, residue number (or sequence
ID), amino acid type, and atom type is given. The CASTp web
interface also displays the linear sequence for the given PDB entry
and highlights the residues contributing to selected pockets. For
structures that do not have precaculated results, such as newly
determined structures that have not been deposited, the CASTp
website allows users to upload coordinate files for analysis. Results
will be emailed to a given address. CASTp results can be down-
loaded for use with molecular display programs like Chimera (see
Note 10).
Functional Annotation from Structural Homology 231

2.13 RASMOT 3D RASMOT 3D PRO is a web-based application for finding homolo-


PRO: Recursive gous motifs in protein structures [49, 50]. The input motif is a set
Automatic Search of 3D coordinates generated by the user and uploaded in PDB
of MOTif in 3D format. Amino acids making up a motif can simply be copied out
Structures of PROteins of a PDB file downloaded from RCSB PDB or can be selected and
written out from a molecular graphics display program such as
Chimera. According the RASMOT 3D PRO help web pages,
motifs from 3 to 30 residues can be used. Small motifs of 3–5
amino acids from discontiguous regions of the primary sequence
that are in close proximity in 3D structure, such as a small cluster of
catalytic residues or metal binding motifs, tend to return meaning-
ful matches (see Note 13). There are several other input fields to the
RASMOT 3D PRO web interface, but if a user simply uploads their
motif coordinates and submits a run, the other fields are set to
default values and a search for a match is executed for a
non-redundant set of representative PDB entry chains (see Note
14). The RASMOT 3D PRO output is a rank list of matches. Each
row of the output list represents a matched PDB entry and includes
the PDB ID, the chain ID, the amino acid sequence length (size),
the root mean squared deviation, the residue types, and amino acid
sequence position number for the matched motif.

3 Methods

3.1 Identifying Homologous proteins or macromolecular assemblies will often


Homology Though have domains or chains within the assemblies that have a much
Common Domains or higher similarity than the overall similarity between assemblies. Just
Domain Architecture as ABC transporters can be identified by the presence and simila-
rities of their ATP binding cassette, homologous multidomain
proteins or oligomeric complexes can be identified by their com-
mon domain or chain. Given a structural domain or chain, PDB
entries with a homologous domain or chain in be identified by
BLASTing [33] the given domain sequence against the PDB (see
Note 15). Given a protein with a known domain architecture,
proteins with related domain architectures can be found via the
Pfam website [38, 39] under the architecture tab for a given
protein.
To identify and compare two proteins or protein complexes
that have similar domains in common:
l Retrieve the primary protein sequence for the domain of
interest.
– Navigate to the page for the PDB entry containing the
domain of interest by entering the PDB ID in the search field.
232 Brent W. Segelke

– Navigate to the chain of interest on the sequence tab and


download the FASTA file, open the file in a text editing
program (e.g., notepad), and copy the content of the file.
l Execute BLASTp.
– Paste the sequence into the query sequence filed on the
BLASTp web page.
– Choose “Protein data bank proteins (pdb)” for the search set.
– Choose “blastp” as the program.
– Submit the blast query leaving all other parameters
unchanged from their defaults.
l Choose a PDB accession code from the BLASTp returned
records.
l Compare the two (the source of the query and the match from
BLASTp) using Chimera.
– Fetch each of the PDB entries by their ID in Chimera (option
under the file dropdown menu).
– Run matchmaker (see Note 11) with the model containing
the chain that provided the query sequence chosen as the
reference structure and the model that was chosen from the
BLASTp results as the structure to match. Choose the best
aligning pair of chains as the chain pairing option. The default
setting for all other options should lead to a useful result.

The multidomain murine cluster of differentiation 1 (CD1)


protein heterodimer makes a good demonstration case for identify-
ing homology through common domains. At the time when the
CD1 structure was first determined [51], it was known to have a
β2 microglobulin (β2m) chain in common with MHC type I com-
plexes [52] and was suspected of having homologous function to
MHC type I complexes. Retrieving the primary sequence for
β2 microglobulin from the 1CD1 PDB entry [51] and BLASTing
the sequence against the PDB result in a long list of matched PDB
entries including PDB entries containing MHC type 1 H-2Db.
Superposition of the CD1 and H-2Db complexes based on
their β2m domains reveals homologous domain arrangements and
domain topologies (Fig. 3a). As expected, the β2m domains from
CD1 and H-2Db have nearly identical structure. Somewhat less
expected, the two complexes have strikingly similar domain archi-
tectures for their A chains, and the domains are nearly identically
arranged relative to each other, despite having very low sequence
identify (Chimera matchmaker reports a 20% sequence identity
between the A chains of CD1 and H-2Db). The similarity of
domain architectures and arrangements is strong evidence of a
common ancestor for these two complexes.
Functional Annotation from Structural Homology 233

Fig. 3 Discovering homology through shared domain(s). (a) Structure comparison


of MHC-like CD1 [51] and MHC class I H-2Db [52]. The murine CD1 structure is
compared to Murine H-2Db by superposition based on their β2m domains. It is
readily apparent that CD1 and H-2Db have quaternary homology beyond their
shared β2m domain. The number of chains is the same, as is the number,
arrangement, and topology of their domains. CD1 is colored orange (helices),
blue (strands), and dark gray (coils). H-2Db is colored yellow (helices), cyan
(strands) and white (coils). (b) The ligand binding pockets of CD1 and H-2Db.
Prominent clefts or pockets sit between the helices of the A chains of CD1 and
H-2Db. The upper panels show the solvent accessible surfaces for these clefts,
and the lower panels show the amino acids that make up these clefts. This figure
was composed with Chimera [20] and CASTp [48]

Due to the structural similarity and strong evidence of homol-


ogy, CD1 was assumed to have homologous function to H-2Db—
CD1 was expected to present peptides to T cells. Although it was
right to infer that CD1 has related function with MHC class I
complexes, CD1 provides a cautionary tale, warning against naively
adopting the functional annotation of the nearest homolog. Care-
ful comparison of structural features of the two complexes suggests,
despite their similarity, there is something quite different
about CD1.
Comparing the gross morphology and specific chemical prop-
erties of the ligand binding clefts of CD1 and H-2Db reveals that
CD1 and H-2Db have very different types of ligands (Fig. 3b).
234 Brent W. Segelke

CASTp identifies large pockets in the cleft between the α-helices of


the A chains of both molecules (Fig. 3b). The CD1 pocket is
approximately three times larger than the H-2Db pocket, both in
estimated surface area and volume. Mapping hydrophobicity onto
the solvent accessible surface of these molecules shows that the
CD1 ligand binding pocket is much more hydrophobic than the
ligand binding pocket of H-2Db (Fig. 3b). Finally, looking at the
residues and atoms that contribute to the solvent accessible surface
of these molecules reveals that their ligand binding domains are
quite different. It is now known that CD1 is an antigen-presenting
complex, like H-2Db, but CD1 presents glycolipids to T-cell recep-
tors, whereas H-2Db presents peptides. Having the structures of
these two molecules and being able to compare them tells us where
they are similar and where they are dissimilar.

3.2 Discovering By far the most common invocation of structural similarity is the
Homology Through use of domain similarity, both in domain topology and three-
Domain Structure dimensional geometry, to classify proteins and to identify homo-
Similarity logs. Indeed, both CATH [41, 42] and SCOP [43] are databases of
domains derived from classifying structures based on their simila-
rities. In addition, precalculated structure similarity results, such as
the results presented on the structural similarity tab for a given PDB
entry on the RCSB website [29, 30], tend to return top scoring hits
that have high levels of domain similarity just by the nature of how
proteins fold and how the matching algorithms work. Thanks to
the foundational work of CATH, SCOP, and the all-vs-all precal-
culated structural similarities generated at the PDB (see Note 3),
structural biologists can quickly identify related structures for a
given structure and investigate homologies. For newly determined
structures that have not been entered into the PDB, structure-
based database searches, such as the web service provided on the
FATCAT website [36], can be performed by uploading the coordi-
nate file for the new structure to the database search utility. Homol-
ogy to a protein with well annotated function can be used to infer
details about function for a protein with homologous structure.
As a demonstration case, the example of annotation function
for the F. tularensis Rapid Encystment Protein 34 kDa (REP34) is
reviewed here. At the time the structure for REP34 was first deter-
mined, the REP34 protein was implicated in inducement of encyst-
ment by amoeba [53], but the protein and gene that encodes
REP34 (orf FTN_0149) were annotated as conserved protein of
unknown function. FTN_0149 was annotated as such because
there was no identifiable sequence match to a gene or protein
with well-annotated function. There was also no identifiable
sequence match to proteins in the PDB using BLASTp. Once the
structure for REP34 was determined, homologs were readily iden-
tified by structural similarity, a function was inferred and subse-
quently tested and confirmed [54].
Functional Annotation from Structural Homology 235

The nearest available structural neighbors to REP34 share a


core domain, conspicuous motif, and likely molecular function
(Fig. 4). Accessing the PDB page for REP34 (PDB id: 4oko,
[54]), a user can quickly identify a putative metallopeptidase from
Shewanella denitrificans (PDB id: 3b2y, [55]) and a metallo-
carboxypeptidase from Pseudomonas aeruginosa (PDB id: 4a37,
[56]) as the closest structural neighbors. Superposition of their
structures with Chimera [20] MatchMaker reveals a common
core αβα sandwich domain (Fig. 4a) with nearly identical topology
for the core domain. The putative metallopeptidase from Shewa-
nella denitrificans is made up entirely of this core domain whereas
REP34 and the metallo-carboxypeptidase from Pseudomonas aeru-
ginosa additional N-terminal elements. REP34 has a small addi-
tional N-terminal helical domain, and the metallo-
carboxypeptidase from Pseudomonas aeruginosa has an additional
all β N-terminal domain made up of two anti-parallel β-sheets
sandwiched together. The existence of the large common core
domain is a strong indicator that these proteins are homologs
even though they cannot be identified as homologs by sequence
similarity.
The function of REP34 and the putative metallopeptidase from
Shewanella denitrificans can be reasonably inferred from homology
to the well-annotated metallo-carboxypeptidase from Pseudomonas
aeruginosa and further supported by a highly conserved metal
binding motif (Fig. 4b). The middle four strands of the core
β-sheet are in a parallel arrangement, so their C-termini are proxi-
mal to each other. Near the C-termini of the central two strands,
there is a cluster of six residues that are conserved among all three
proteins along with several other residues that have high similarity
between the three proteins. Three of the conserved residues are
conspicuously coordinated to a metal (Fig. 4b) and the cluster of
residues resides at the base of a similarly located pocket for each of
the three proteins. The conformation of conserved residues is also
nearly identical. All of the structural evidence is consistent with
these molecules having homologous function. In fact, the metallo-
peptidase activity inferred from structural homology has been con-
firmed experimentally for REP34 [54].

3.3 Identifying A structural family identified by structural similarity can be used to


Deeply Conserved 3D generate sequence alignments that would not be identifiable by
Motifs sequence alone. The structure-based multi-sequence alignment
can then be used to identify deeply conserved sequence and struc-
tural elements. Deeply conserved elements that cluster together or
reside near other prominent structural features, such as large puta-
tive ligand binding pockets, can be reliably implicated as having
functional importance. Shared deeply conserved structural elements
can be used to infer homologous function.
To identify conserved 3D motifs:
236 Brent W. Segelke

Fig. 4 Catalytic mechanism discovery through domain homology. (a) The


structure of a putative metallopeptidase from Shewanella denitrificans (PDB id:
3b2y, [55]) is superimposed with the Rapid Encystment Protein 34 kDa structure
(PDB id: 4oko, [54]) and the structure of Metallo-carboxypeptidase from
Pseudomonas aeruginosa (PDB id: 4a37, [56]) and each displayed as their
corresponding ribbon diagrams in three different shades of gray. The common
α-β-α sandwich core domain, made up of the central eight-stranded β-sheet,
sandwiched on both sides by α-helices, is readily apparent. In addition to the
α-β-α sandwich domain, 4a37 has a nine-stranded N-terminal β-sandwich
domain (shown at the top of the figure) and 4oko has a short N-terminal
α-helical domain. (b) Identifying the catalytic motif through homology. On one
face of the central β-sheet, there is a gap between the helical segments that
creates a solvent accessible pocket near the C-terminal ends of the two central
β-strands (pockets not shown) for each of the three structures shown. At the
base of the pocket is a conspicuous metal binding motif made up of two
histidines and a glutamic acid. This is the known catalytic metal binding motif
of the metallo-carboxypeptidase from Pseudomonas aeruginosa [56]. Near to
the metal binding motif are an arginine, glutamine, and glutamic acid. The metal
binding motif residues and these other proximal residues implicated in the
metallo-carboxypeptidase activity are absolutely conserved in all three
structure, in amino acid type and three-dimensional position, and have nearly
identical conformations
Functional Annotation from Structural Homology 237

l Identify the superfamily for your protein of interest.


– Enter the PDB ID in search field on the CATH homepage
and click search.
– Navigate to the matching superfamily for the given PDB ID
and chain of interest.
– Navigate to the CATH page for the corresponding CATH
superfamily.
l Identify a set of representative diverse members of the family.
– Navigate to the CATH Classification page for the given
superfamily by clicking on the “Classification / Domains”
link.
– Select a number of PDB IDs from the structural neighbor-
hood (see Note 17).
l Generate a structure-based alignment from the representative
set of family members.
– Load all of the individual models in Chimera using the Fetch
by ID function under the File pulldown menu.
– Use matchmaker to generate the multisequence alignment
(see Note 11).
Choose the model and chain of interest as the Reference
chain.
Select all of the remaining models as Structure(s) to match.
Holding the ctrl key while clicking on models listed in the
structures to match box toggles their selection.
Choose “Specific chain in reference structure with best-
aligning chain in match structures” as the Chain pairing
option.
Check the “After superposition, compute structure-based
multiple sequence alignment.” The other option can
remain with their defaults. Click Ok to launch the
alignment.
When the superposition calculation is complete, a “Create
Alignment from Superposition” dialogue box may
appear, generally Chimera has already made a reasonable
selection of options for this utility, so click Ok to generate
the sequence alignment.
l Identify the highly conserved residues.
– Chimera will generate the multisequence alignment and dis-
play the alignment in a match!align window. For each block
of aligned sequence there is “consensus” row and a conserva-
tion bar chart. The most highly conserved amino acids are
readily identifiable from the consensus row and conservation
chart.
238 Brent W. Segelke

l Examine the highly conserved residues in their 3D structural


context.
– Select the mostly highly conserved residues by holding the
shift key and dragging the mouse over the sequence
alignment.
– In the Chimera main display window, show selected residues
by choosing atoms/bond!show under the actions drop
down menu (see Note 18).

Proteins in the CATH superfamily 3.30.70.100 in the struc-


tural neighborhood of the putative antibiotic biosynthesis mono-
oxygenase from Nitrosomonas europaea (PDB ID: 2omo, [57])
provides an interesting example case for identifying a structural
motif with functional importance and for using this information
to make functional inferences. The 3.30.70.100 superfamily is
associated with 328 unique go terms and 126 unique EC terms,
meaning the structure family is functionally diverse and function
cannot be inferred simply from association with the superfamily.
The structural similarity link provided by the PDB for the
2omo entry identifies a large set of PDB entries with structural
similarity to 2omo. Searching CATH with the 2omo PDB ID
places 2omo in the CATH superfamily 3.30.70.100. Exploring
the structural neighborhood in CATH a diverse set of sequences
from PDB entries can be identified as a basis for constructing a
structure-based sequence alignment. In this case, a small subset of
PDB entries (14 entries from the 603 representative entries identi-
fied with some structural similarity) have high enough sequence
diversity to search for deeply conserved features. Sequence identity
between entries of this subset and 2omo range from 11% to 47%
(Table 1), as determined by the structure-based pairwise align-
ments. Most of the PDB entries and the proteins represented by
these entries are annotated with putative function or unknown
function and have low UniProt annotation scores (Table 1). Struc-
ture superposition reveals a highly conserved architecture that
includes an anti-parallel four-stranded β-sheet covered on one side
by three α-helical segments (Fig. 5a).
The structure-based multi-sequence alignment reveals three
absolutely conserved amino acids that are not proximal on the
sequence but cluster closely in 3D space on the structural scaffold
of this fold (Fig. 5b, c). Mapping the conserved residues onto the
3D structure reveals that the conserved histidine is positioned
between and within hydrogen bonding distance of the two con-
served glutamic acid residues (Fig. 5c). Furthermore, one of the
conserved glutamic acids and the conserved histidine are found at
the base of a pocket located on one face of the protein between
helical segments and the β-sheet (Fig. 5) and within hydrogen
bonding distance of a water molecule that appears in a nearly
Functional Annotation from Structural Homology 239

Table 1
PDB entries in the structural neighborhood of 2omo

Annotation Identity
ID PFB-description Uniprot description scorea (%)
2gff Yersinia pestis lsrg lsrG; (4S)-4-hydroxy-5- 3 47
phosphonooxypentane-2,3-dione
isomerase
3qmq E. coli autoinducer- lsrG; (4S)-4-hydroxy-5- 2 43
2 modifying protein lsrg phosphonooxypentane-2,3-dione
isomerase
1y0h Structural genomics, Putative monooxygenase 2 22
unknown function
3f44 Putative monooxygenase Putative uncharacterized protein 1 21
2omo Putative anribiotic Domain of unknown function (duf176) 1 N/A
biosynthesis
monooxygenase
3kkf Putative anribiotic Putative flavoredoxin 1 17
biosynthesis
monooxygenase
3mcs Putative monooxygenase Hypothetical cytosolic protein 1 17
3bm7 Putative monooxygenase Antibiotic monooxygenase domain- 1 15
containing protein
1r6y Unknown function Probable quinol monooxygenase YgiN 3 14
1q8b Unknown function Uncharacterized protein Yjcs 1 11
1x7v Unknown function Antibiotic monooxygenase domain- 1 19
containing protein
2fb0 Conserved protein of Antibiotic monooxygenase domain- 1 22
unknown function containing protein
2bbe Unknown function Antibiotic biosynthesis monooxegenase 1 19
family protein
4dpo Conserved protein of Conserved protein 1 16
unknown function
a
Annotation scores are taken from the Uniprot entry linked from the PDB entry

identical position for many of the PDB entries in this comparison.


The absolute conservation of these residues, the nearly identical
conformations, their clustering in space, their proximity to the base
of a pocket on one face of these structures, and association with a
conspicuous structurally conserved water is highly suggestive that
these conserved residues are functionally important and the set of
proteins are homologs.
Proteins that are in the same structural neighborhood with
2omo and share the same deeply conserved E-H-E-H2O 3D
motif can now be better annotated. There is strong evidence from
240 Brent W. Segelke

Fig. 5 Identifying deeply conserved 3D motifs. (a) Superposition of


representative 14 structures [58–70] identified as similar by FATCAT [35] and
listed on the structure similarity tab at the RCSB PDB [29, 30] for PDB entry
Functional Annotation from Structural Homology 241

primary literature that the proteins represented in PDB entries 2gff


and 3qmq isomerize the AI-2 molecule involved in quorum sens-
ing. Isomerization occurs by hydration of a ketone on the AI-2
molecule [59]. Proteins with structures that belong to this cluster
would be better annotated as hydrolases and putative keto-
isomerases. Given the structural and sequence similarity between
2omo, 2gff, and 3qmq, the protein in 2omo can reasonably be
annotated as a putative AI-2 modifying enzyme.
Comparing the structure from 2omo and its homologs with
well-annotated monooxygenases from the 3.30.70.100 superfam-
ily reveals that the 2omo homologs are likely not monooxygenases.
The 3.30.70.100 superfamily has several family members that
are well annotated monooxygenases (annotation score
5, Table 2). Repeating the analysis of deep conservation by


Fig. 5 (continued) 2omo [57]. The structure for 3qmq was identified by CATH
[41, 42] as belonging to the same sequence cluster as 2omo at the 35%
sequence similarity level. The structure for 3qmq is included in this structural
family alignment because it is one of the better annotated proteins in this
collection [59]. (b) Structure-based multi-sequence alignment from the multi-
structure alignment shown in (a). Each row is labeled with the PDB and chain
identifiers for entries used in the alignment. The three non-glycine residues that
are absolutely conserved for the whole multi-sequence alignment (down an
entire column for a given position in the alignment) are by capital letter in the
consensus row above the individual sequences. (c) The three absolutely con-
served non-glycine residues are displayed (displayed as sticks with carbon
atoms shown in a lighter gray than His nitrogen atoms and Glu oxygenatoms)
on the ribbon diagrams of the superimposed structures (left panel). A represen-
tative chain (3qmq, [59]) is displayed with the highly conserved residues and a
solvent accessible pocket (right panel). The surface is shaded the same as the
color of the atom corresponding to the surface vertices. The conserved glutamic
acid at the base of the pocket is obscured by the section of molecular surface,
but the oxygen atoms of the side chain are contributing to the pocket solvent
accessible surface, as is the ε nitrogen from the conserved histidine. The ε
nitrogen of the conserved histidine is also within hydrogen bonding distance of a
water molecule. The absolute conservation, proximity in space, hydrogen bond-
ing network (both glutamic acids are hydrogen bonded to the conserved histi-
dine), and positions of two of the residues at the base of a pocket strongly
implies that these residues are functionally important for this family of proteins.
(d) The canonical monooxygenase (ActVA from Streptomyces coelicolor, [71]) is
shown with three structural neighbors [72–74]. The structure based multi-
sequence of this structural cluster reveals a highly conserved motif made up
of residues proximal in space and at the bottom of a solvent accessible pocket
on the molecular surface (not shown). The conserved motif is made up of a
tryptophan and a contiguous tri-peptide proline-glycine-phenylalanine. The con-
served tryptophan is known to be involved in monooxygenase activity. This figure
was composed using Chimera [20]
242 Brent W. Segelke

Table 2
PDB entries in the structural neighborhood of monooxygenase 3kg0

Annotation
ID PDB description Uniprot description scorea Identity (%)
1lq9 Monooxygenase ActVA 6 protein 1 16
1iuj Hypothetical protein TT1380 protein 1 18
3kg0 Monooxygenase Deoxynogalonate 5 10
SnoB monooxygenase snoaB
3hx9 Heme-degrader Heme oxygenase 5 14
MhuD (mycobilin-producing)
MhuD
a
Uniprot annoation score. A score of 5 indicates the strongest experimental evidence supporting the given annotation
and a score of 1 indicates little or no evidence as to the annotated function.

structure-based sequence alignment on a monooxygenase cluster


from this superfamily reveals that the monooxygenases do not
contain the E-H-E-H2O 3D motif of the 2omo homologs but
rather has a different deeply conserved 3D motif (Fig. 5d). Mono-
oxygenase SnoaB (PDB entry 3 kg0, [73]) is a well-annotated
monooxygenase (Table 2), and the cluster of its near structural
neighbors reveals the same structural architecture but a deeply
conserved 3D motif made of a tryptophan, phenylalanine, and
proline clustered in space and near a molecular pocket (not
shown). The tryptophan is implicated in the catalytic mechanism
of SnoaB [73]. The proteins of this cluster are clearly homologs and
reasonably annotated as monooxygenases. The proteins of the
2omo cluster do not have the conserved motif of the monooxy-
genases and should not be annotated as such.

3.4 Identifying Simple, highly conserved, 3D motifs can be used to identify distant
Distant Homologs homologs that do not necessarily belong to the same CATH or
with Shared 3D Motifs SCOP families. Searching by motif similarity is a powerful means
for finding very distant homologs that do not share whole domain
architectures. These more distant homologies can still lead to func-
tion insights. Motifs can be identified, as in the last section, by
structure-based alignment of structurally similar proteins that are
distantly related by sequence. Putative 3D motifs, for example, a
conspicuous cluster of residues that form a hydrogen bonding net-
work at the base of a pocket can also provide the basis for a motif
search. Matches to a motif search that extended structural similar-
ity, beyond the motif used for the query, provides strong evidence
of common ancestry. Proteins with metal binding sites provide
good example cases for 3D motif-based searches, since the metal
binding motif is conspicuous.
Functional Annotation from Structural Homology 243

To identify distant homologs with shared small 3D motifs:


l Identify a motif (see Note 19).
l Export the coordinates for the motif.
– With the residues that make up the motif selected in Chimera
(see Note 10).
– Write the coordinates in a file using “save pdb. . .” option
under the file drop down menu. Choose the “save selected
atoms only” option and save the file.
l Search for PDB entries with matching motifs.
– Upload the saved motif to the RASMOT-3D pro web
server [49].
– Choose “identical residues” for the Protein Selection param-
eter, leave other parameters with their defaults, enter an email
address, and run.
l Examine the similarity of proteins identified by motif match by
aligning structure with the Chimera match command (see Notes
12 and 20).
The conserved domain protein encoded by the Bacillus anthra-
cis BA_5305 (UniProt A0A0F7RB18, PDB id: 4fca, [75]) provides
a good example case as the gene, protein, and structure are unan-
notated (UniProt annotation score: 1, PDB description: function-
ally unknown conserved protein). The 4fca structure has four
domains and a conspicuous metal binding site in a long shallow
cleft between two domains (Fig. 6a). Using the two histidines and
one glutamate as the basis for a motif search, four representative
PDB chains are identified by RASMOT-3D PRO [50] with identi-
cal residues matching with <0.2 rmsd (see Note 20). Superposition
of the structures identified with matching motifs reveals that the
relative position, conformation, and coordination of metal are
nearly identical for all four chains (Fig. 6a), even though only the
first torsion for each residue was used in the query (a feature of the
RASMOT-3D PRO algorithm).
There is little or no apparent homology between 4fca and the
other three proteins with the matching metal binding motifs at the
domain architecture level (Fig. 6b), although structures in PDB
entries 4fgm and 3u9w have domain topology similarity for two of
their three domains (not shown). Upon close inspection, however,
there is a small subdomain with common architecture for all four
chains (Fig. 6b). All four structures have a common anti-parallel
two-helix subdomain that contains the metal binding motif, and all
four structures have a nearby β-sheet, the first three strands of
which are nearly superimposed based on having superimposed
just the metal binding motifs. The first two strands of this small
section of β-sheet are antiparallel to each other while the second
and third strands are parallel. Inspecting the amino acids in prox-
imity to the metal binding motif reveals a conserved glutamic acid
244 Brent W. Segelke

Fig. 6 Identifying homologs from motif similarity. (a) The protein represented in
PDB entry 4fca [75] is described by the depositors as a conserved protein with
unknown function. At the base of a cleft formed by two domains of the protein
(not shown), there is a conspicuous metal binding motif made up of two histidine
and a glutamate that are all coordinated to a metal ion (left panel). The two
histidines reside on the same α-helix one helical turn apart from each other. The
glutamic acid residue resides on helical section packed next to the first helix in
an anti-parallel direction. Using the three-residue motif identified in the 4fca
entry as the input for RASMOT-3D PRO [49, 50] and selecting “identical
residues” in the protein selection returns a cluster of five matches, including
the query structure, with low RMSD (0.2 Å). The right panel shows the
superposition of the closest four matches: 4fgm [76], 1e1h [77], 3u9w [78],
and 4fca [75]. The motifs match nearly identically in all torsions and relation to
the coordinated metal ion even though only the first torsion for each of the three
residues of the motif are used in the query (an implementation detail of the
RASMOT-3D PRO algorithm). (b) The full ribbon diagram is shown for each of the
four PDB entries with matching motifs that were superimposed in part (a). Short
segments of antiparallel helices that contain the motif and three strands of a
β-sheet in close proximity to the metal binding motif are highlighted in gray. The
remainder of the ribbon diagram for each molecule is shown in white. (c) Given a
Functional Annotation from Structural Homology 245

on the helix with the two conserved histidine and a highly con-
served tyrosine nearby (Fig. 6c). The conservation of structural
elements beyond the query motif and conservation of the subdo-
main architecture is highly suggestive of homology between these
molecules. A small schematic can be made to depict the “rules” for
the core homologous element shared by these protein (Fig. 6d).
From the known function of the newly identified distant homo-
logs to 4fca, we can provide some annotation to this unannotated
protein. Each of the three identified homologs are enzymes with
the metal co-factor implicated in catalysis. The conserved HEXXH
sequence on one helical section of the 3D motif is considered a
hallmark for Zn proteases [79]. The catalytic domain of botulinum
toxin is in fact known to be a Zn protease [77]. The protein
represented by the 4fgm entry is also described as a Zn aminopep-
tidase, although the annotation score is low [76]. Despite contain-
ing the HEXXH motif, the protein in the 3u9w entry is annotated
as human leukotriene a4 hydrolase [78] and the annotation score
indicated high confidence in the annotation. The protein repre-
sented by 4fca could be annotated as a metalloenzyme and hydro-
lase with high confidence. It is reasonable to say that 4fca is a
putative Zn metallopeptidase given the hallmark features it has in
common with known metallopeptidases.

4 Conclusion

Structural homology-based annotation can be quite powerful, for


the reasons described in the introduction, but it is no panacea for
the grand function annotation challenge, yet. Although structural
homology can reveal relationships between distant sequence clus-
ters there is no fail-safe way to determine that the correct functional
assignment has been determined. A useful set of benchmarks might
include: high coverage (>60% or >60% of a given domain); near
exact conservation (in 3D arrangement and sequence) of functional
residues; and primary literature for the chosen structure homolog
that includes direct biochemical and structural evidence of function
(e.g., substrate-bound complex of an inactivated protein, interact-
ing complex for binding proteins). An important caveat: there are
often compensatory mutations that recapitulate function via a


Fig. 6 (continued) matched motif, a zonal region can be explored for other
conserved elements among homologs. Shown here in addition to the metal
binding residues is a second absolutely conserved glutamic acid between the
two histidines and one amino acid toward the c-terminus from the first histidine.
Also shown is a highly conserved tyrosine that is implicated in peptidase activity
in the botulism toxin zinc peptidase (pdb 1e1h, [77]). (d) A schematic is shown
depicting the conserved elements of a larger motif discovered by examining the
superposition of the metal binding motif matches
246 Brent W. Segelke

residue that might be quite distant along the sequence but proximal
in space. High sequence identity (>30%) as determined by struc-
tural alignment can also be a strong indicator of homology.
One major advantage of analyzing function annotation
through structure homology is that there is a much greater likeli-
hood of having associated primary literature relevant to the given
annotation. Historically, a structure determination effort followed
extensive characterization of protein function. In the genomics era,
a structural genomics approach in which gene discovery for pre-
dicted proteins of unknown function motivates a structure deter-
mination effort. With this application of structural genomics, the
structure is expected to reveal function. Unfortunately, many data-
base entries of structures determined by structural genomics cen-
ters do not have functional annotation or simply borrow the Pfam
annotation, and Pfam descriptions can be misleading. The Pfam
descriptions often sound like functional annotations, but they are
really just the name given to the archetypal representative for a
protein sequence family. For example, the Pfam name for the
LsrG protein (pdb 2gff) is antibiotic biosynthesis monooxygenase,
which has no obvious relationship to the known function of LsrG,
recycling phospho-AI-2 in the AI-2 quorum sensing pathway. The
lack of annotation or the use of misleading annotation from Pfam
for structural genomics center depositions can complicate or con-
found structural homology-based function analysis, but it also
represents an opportunity to revisit and improve the annotations
for many of these entries.

5 Notes

1. PDB molecular structure entries are depositor and PDB


curated coordinate files that contain records for all experimen-
tally determined atoms; their atom type (e.g., carbon); x, y, and
z positions; B-factor (a quantitative field meant to capture the
measure of thermal mobility of an atom); occupancy; the amino
acid type the atom belongs to; and the amino acid sequence
number of the amino acid. PDB entries include extensive meta
data in the entry header describing the experiment(s) that lead
to the deposited entry (e.g., means of expression and isolation,
crystallization conditions, data processing statistics, structure
refinement statistics, authors, and primary citations). A PDB
entry will also typically include meta data relating to quality
assessments of the coordinate set included in the entry, second-
ary structure assignments, and information about ligands or
other heteroatoms that are not part of the biopolymer(s). Post
structure determination quality assessments are based on com-
parison to expected chemical geometries such as bond lengths,
bond angles, torsion angles, the histogram of the distribution
Functional Annotation from Structural Homology 247

of these measures across the whole coordinate set. Quality


assessment annotations may be from the authors, from the
database curators, or both.
2. Initial entry returns can be refined by clickable subcategories
that shortens the list or returned pdb entries. There is also an
advanced search utility that provides searches based on many
different search criteria that can be strung together with logical
AND operators.
3. The PDB recently changed their utilities for finding PDB
entries with similar sequences or structures. The new links on
the structure summary page appear to return a comprehensive
list of sequence and structure matches rather than representa-
tive matches from pre-clustered groups. The previous
Sequence Similarity tab provided a list of sequence similar
clusters for PDB entries with related sequences to a given
query. The various listed clusters were provided at various
percent sequence similarity cutoff, from 100% to 30%. The
previous Structural Similarity tab provided a list of representa-
tive PDB entries for clusters of similar structures generated
with jFATCAT-rigid [35]. The representative structure of the
cluster has similar structure to the PDB entry searched and
clusters are determined by a 40% sequence identity cutoff to
all other sequences in the PDB.
4. Like the SAS link provided in PDBSum results (SAS is
described in greater detail below), BLASTp can also be used
to look for sequence similar homologs to a PDB entry by using
the sequence in a PDB entry as the query sequence and the
whole PDB as the search database. This might be accomplished
more quickly by simply clicking on the “Find similar proteins
by: Sequence” link for a given entry in the PDB, but the
BLASTp result may be more immediately pairable with other
analysis tools. The BLASTp web interface [34] has a number of
entry fields to parameterize searches, for most purposes it is
sufficient to enter the query sequence matching accession code,
the PDB as the search set, and BLASTp as the program selec-
tion. Although it is not obvious, the other fields are optional.
5. The pdb recently added a structure search feature in their
advanced search utility that is based on the BioZernike
method [80].
6. Over short sequences, aligned fragment pair methods can also
match coincidentally aligned fragments that have no evolution-
ary connection, so alignments that do not cover a significant
fraction of aligned chains, domains within chains, or subdo-
mains should be carefully examined or ignored.
7. FASTA and BLASTp searches against the PDB give slightly
different results and can be considered complementary for
248 Brent W. Segelke

purposes of finding structural information for closely related


proteins.
8. Also, under the “File” tab, there is an option for opening files
stored on a file system accessible to the computer running the
Chimera program. Selecting the “Open. . .” option opens a file
system navigation dialog box. Selecting a file containing molec-
ular coordinated in PDB format and clicking open will display
the structure of the molecules(s) in the selected file. The user
can also open files using the command line command “open.”
Many molecules from different PDB entries or other files can
be displayed simultaneously, which facilitates comparison of
molecules. All three options for opening PDB files are further
described in the Chimera searchable help utility under the help
tab on the menu bar.
9. When PDB files are first opened in Chimera a default display is
rendered by Chimera, typically a ribbon diagram for each of the
protein chains. Heteroatoms (e.g., ligands) other than water
are displayed along with a stick representation for amino acids
interacting with ligands. Non-interacting amino acids are not
displayed by default. Nucleotide chains are shown with ribbon
backbone and nucleotide bases shown in slab style.
10. Selection in Chimera. Chimera provides a large variety of selec-
tion tools that facilitate the selective display and analysis of
molecular regions and can be used to examine structural
homology. In general, Chimera provides for selection of
molecular elements by: picking from the graphics window,
using one or more of the options under the “Select” menu,
using the “select” command, using selection features provided
with other Chimera tools like the Model Panel, Sequence tool,
or CASTp tool. The selection options provided by Chimera are
too numerous to describe in detail here, but a small handful of
selection strategies along with simple binary logic operators are
sufficient to provide a powerful means for highlighting and
comparing structural features.
An especially useful and versatile subset of select tools
include: graphical picking of atoms or residues; “Select” drop
down menu options “Chain,” “Zone,” “Selection mode,” and
“Invert”; use of the “Sequence” tool; and use of the CASTp
tool. The “atom specifier” is also a powerful select utility for
users familiar with the Chimera atom specifier syntax. In short,
Chimera uses a hierarchical description for molecules from
PDB entries or files starting with “model,” which corresponds
to all of the molecular contents of a given entry; “chain,” which
follows the chain identifier(s) provided in an entry and gener-
ally corresponds to contiguous macromolecular polymers,
“residue,” and “atom.” Models are assigned a number in the
order that they are opened in the display window, starting with
Functional Annotation from Structural Homology 249

zero; chains are assigned the letter code provided in the input
file or entry; residues are assigned the number provided in an
entry; and atoms are assigned the name provided in the input
pdb file.
Graphical selection in combination with zone selection
enables the local spatial expansion of selected molecular fea-
tures which in turn is helpful for: tailoring displayed features,
identifying structurally homologous residues, or examining
interactions. Graphical selection of an atom or bond is achieved
by clicking while holding the ctrl key; shift+ctrl click appends
to or subtracts from the selection; ctrl click in empty space
clears a selection. If a ribbon is displayed for a chain, a residue
within the chain can be graphically selected by ctrl+click on a
section of the ribbon. Graphical selection of an atom or atoms
followed by selection of a zone with a small zone radius (e.g.,
<1.0 Å), conveniently expands atom selections into a residue
selection by selecting the “Select all atoms/bonds of any resi-
due in the selected zone” option within the “Select Zone
Parameters” dialog window. A small zone radius selection is
also useful for identifying structurally homologous residues for
superimposed structurally homologous molecules. Graphical
selection of atoms or residues in combination with zone selec-
tion with a 3–3.5 Å zone radius can be used to find residues
that interact with selected residues or atoms, because
non-bonding interaction, such as salt bridges, h-bonds, or
van der Waals interactions are typically in the 2.5–3.5 Å range.
Chain selection and chain selection in combination with
“Invert,” either selected model or all models, helps to rapidly
simplify the displayed content to focus on features of interest.
There is often redundant information in PDB entries so when a
model is first opened in the display window, there are multiple
copies of chains or molecular assemblies with identical or near
identical sequences—this is a consequence of how crystallogra-
phy and NMR results are entered in the PDB. By selecting
redundant chains, they can be easily undisplayed. Selecting a
chain of interest, all other chains can be quickly undisplayed or
deleted by inverting the selection. If multiple models are open,
selecting the “Invert (selected models)” option after selecting a
chain of interest selects all other chains within the same model,
which can be used to eliminate the unwanted or redundant
chains within the same model. Choosing instead the “Invert
(all models)” options can be used to eliminant all displayed
chains for all models. Inverting the selection again returns the
selection to the initially selected chain of interest.
Select mode options help to rapidly build up or pair down
selected molecular component. The available select mode
options are replace, append, intersect, and subtract. Replace is
the default mode. Other useful selection tools that are not
250 Brent W. Segelke

under the “Select” drop down are provided via the model
panel, the sequence tool, and the CASTp tool. The model
panel and sequence tool can be launched from the “Favorites”
pull down menu. The model panel lists open models and
displayed surfaces. Individual listed items can be highlighted
by mouse click then selected with the select button on the
model panel. The sequence tool allows the user to choose
from a list of chains for all of the chains currently open then
opens a sequence window for each of the chains chosen. A
sequence window will display the linear one-letter code
sequence for a given chain. A residue, or segment of residues,
can be selected by dragging the mouse over the given residue
or segment. Other residues can be added to the existing selec-
tion by shift+drag over other parts of the sequence. Selection
from the sequence window are reflected in the display window
and vice versa. The CASTp tool is automatically launched when
CASTp.poc files are opened through the Chimera File!open
utility. When a CASTp file is opened in Chimera, the
corresponding coordinate file is opened and chains are dis-
played as ribbons. A CASTp tool is also launched which has a
sortable tabular list of CASTp identified pockets and several
selectable options. Clicking on one of the listed pockets will
display the surface of that pocket over the associated ribbon of
the corresponding chain and will select all of the associated
atoms, which can be easily expanded to the associated residues
by use of zone selection with a small zone radius.
Having selected regions of interest, a variety of actions can
be applied to the selected region with the utilities provided
under the “Actions” dropdown. Actions!Atom/Bond!-
show (or hide), Actions!Ribbons!show (or hide), and
Actions!Surface!show (or hide), and Actions!color tend
to be frequently used for highlighting and comparing regions
of interest.
11. Matchmaker in Chimera. Chimera provides a structure com-
parison tool called MatchMaker which can be used to superim-
pose structural homologs and to produce structure-based
sequence alignments which enable detailed analysis of local
and global structural homolog and investigation of the rela-
tionship between structural and sequence homologies. Match-
Maker can produce pairwise superpositions and alignments or
multi-structure superpositions and alignments. MatchMaker is
launched from the Tools!Structure Comparison drop down.
To run MatchMaker, there must be at least two models
(or coordinate files) open. The MatchMaker tool window dis-
plays side-by-side lists of open models, one labeled Reference
Structure and one labeled Structure(s) to match. To run
MatchMaker, one reference structure should be selected and
Functional Annotation from Structural Homology 251

one or more structures to match. If only two models are open,


choosing either as the reference and the other as the structure
to match and using all the default options usually gives a useful
result if the structures have regions of significant structural
homology. If more than two models are open, choosing the
radio button “Specific chain in reference structure with best-
aligning chain in match structure” is usually more useful than
the default option. This option ensures that all the structural
homologs in the structures to match align on a single reference
structure so that they can all be compared to each other in the
same frame of reference. The “After superposition, compute
structure-based multiple sequence alignment” is a very useful
option (not checked by default)—with this option invoked,
MatchMaker will launch a sequence window with the multi-
sequence alignment displayed. The resultant multisequence
alignment is interactive and can be used to select aligned resi-
dues across all of the models represented in the sequence
window. This is very useful for highlighting highly conserved
residues across the multi-sequence alignment for example.
12. Match in Chimera. The Chimera “match” command allows a
user to superimpose models based on user defined lists of
matching atoms. This enables the superposition of molecular
structures by homologous motifs, such as motifs returned by
RASMOT 3D PRO—motifs may be as small as three atoms.
Match can also be used to impose a structural alignment calcu-
lated by algorithms other than MatchMaker, although this can
be quite tedious since the matching list of atoms for each of the
models has to be entered on the command line. Match is run
from the command line, which is launched via the “Favorites”
drop down. Match is run by entering “match” atom-list-1
atom-list-2 at the command line, where atom-list-1 or 2 is
specified using the Chimera atom specifier syntax. The
sequence order of atoms in each list indicates the intended
matches. For example models in pdb entries 1a3f and 4aup
can be superimposed based on conserved motifs by entering:
match #0:47.c,51.c,93.c@ca #1:147.a,151.a,169.a@ca. This
command matches the α-carbons in residues 47, 51, and
93 from the C chain in model 0, with the α-carbons in residues
147, 151, and 169 from the A chain model 1. This is a super-
position based on three atoms, with model 0, chain C, residue
47 α-carbon superimposed with model 1, chain A, residue
147 α-carbon and so on pairwise in the order of atoms listed.
The command match #0:47.c,51.c,93.c #1:147.a,151.a,169.a
would also work if model 0 is from entry 1a3f and model 1 is
from 4aup because the residues listed are exactly matching
residue types so the number and sort order of atoms listed
are inferred from the residue lists and they match.
252 Brent W. Segelke

13. The initial motif search in RASMOT 3D PRO is sequence


independent and is based on the Cα-Cβ geometries, so short
contiguous motifs, like a short segment of α-helix or β-sheet
will match to many portions of many PDB entries and will not
be very meaningful. Glycine residues do not contribute to
motif matches.
14. The default options are reasonable choices but for finding
functional homologs, selecting “identical residues” in the pro-
tein selection section, describing the types of residues to search
for with the input motif, is probably the more judicious choice.
To get a more comprehensive set of matches, a user may opt to
search against a non-redundant PDB chain set other than the
default. Choosing “non-identical” would yield the most com-
prehensive search, but could cause long execution times. It is
also useful to provide an e-mail address to receive notification
of job completion, as compute times can be long, depending
on the choice of input parameters.
15. There are many routes to identifying PDB entries that contain
domains or chains with structural similarity to a given chain. If
there is already a PDB entry with structure for a given chain or
domain of interest, the quickest way to find related structures is
to navigate to “Find similar proteins by: Sequence | Structure”
in the “Macromolecules” section of the “Structure Summary”
tab for the PDB entry. There will be a separate section for each
entity ID corresponding to each unique chain sequence. The
sequence or structure links will return a list of related PDB
entries. Unfortunately, the list does not contain a similarity
score and cannot be sorted on the basis of similarity. BLASTp
using the primary sequence for a given chain or domain of
interest as the query, protein databank proteins as the search
set, BLASTp as the program returns a ranked list sorted on
sequence similarity score by default. Navigating the UniProt
entry for a protein or chain of interest (e.g., UniProt accession
# P01887 for β2 microglobulin), a user can a find a list of PDB
entries under the structure tab that matches the chain by
protein sequence.
16. The CATH Classification page provides an interactive sche-
matic that is a 2D representation of clusters in the structural
neighborhood. Clustering is assigned by sequence similarity.
To identify deeply conserved features of a superfamily, diverse
representative sequences need to be compared, so select PDB
ids for superfamily members that are not clustered together
with high sequence similarity. If there are >6 superfamily
members with <35% sequence identity, their multisequence
alignment is much more likely to give you insight into deep
conservation than a smaller set of sequences or a set of
sequences that are more closely related.
Functional Annotation from Structural Homology 253

17. If highly conserved residues cluster together, particularly if


they interact to form a hydrogen bonding network, or if they
reside within a pocket they are likely important for function.
CASTp [48] provides a nice complement to Chimera for iden-
tifying pockets and displaying pockets next to conserved resi-
dues within Chimera. CASTp generates Chimera input files
and Chimera has CASTp utilities that enable easy selection
of residues that make up pockets. Combined with other select
utilities in Chimera (see Note 10), one can easily identify
residues that are both highly conserved and contribute to
pockets on a molecular surface.
18. This is somewhat subjective. A motif can be identified by
multisequence alignment of large sequence families or by
structure-based multisequence alignment, as described in the
identifying deeply conserved motifs section. A motif can also
be posited by an experienced structural biologist based on, for
example, general similarity to catalytic clusters. Motifs with
histidine in hydrogen bonding networks and residing near
the base of pockets are good candidates for catalytic motifs.
19. PDB matches with exactly the same motif amino acids, and
<0.2 Å rmsd seems to return matches with extended similarity
beyond the query motif. Less stringent matches are often
coincidental, especially for small motifs.

Acknowledgments

Molecular graphics and analyses were performed with UCSF Chi-


mera, developed by the Resource for Biocomputing, Visualization,
and Informatics at the University of California, San Francisco, with
support from NIH P41-GM103311.
This work was performed under the auspices of the
U.S. Department of Energy by Lawrence Livermore National Lab-
oratory under Contract DE-AC52-07NA27344.

References
1. Benson DA, Karsch-Mizrachi I, Lipman DJ, human proteome: 2018 metrics from the
Ostell J, Wheeler DL (2007) GenBank. HUPO human proteome project. J Proteome
Nucleic Acids Res 36(Suppl_1):D25–D30 Res 17(12):4031–4041
2. Lachmann A, Torre D, Keenan AB, Jagodnik 4. McCool EN, Lubeckyj RA, Shen X, Chen D,
KM, Lee HJ, Wang L, Silverstein MC, Ma’ayan Kou Q, Liu X, Sun L (2018) Deep top-down
A (2018) Massive mining of publicly available proteomics using capillary zone
RNA-seq data from human and mouse. Nat electrophoresis-tandem mass spectrometry:
Commun 9(1):1–10 identification of 5700 proteoforms from the
3. Omenn GS, Lane L, Overall CM, Corrales FJ, Escherichia coli proteome. Anal Chem 90
Schwenk JM, Paik YK, Van Eyk JE, Liu S, (9):5529–5533
Snyder M, Baker MS, Deutsch EW (2018) 5. Feussner K, Feussner I (2019) Comprehensive
Progress on identifying and characterizing the LC-MS-based metabolite fingerprinting
254 Brent W. Segelke

approach for plant and fungal-derived samples. 17. Scott DL, Otwinowski Z, Gelb MH, Sigler PB
In: High-throughput metabolomics. Humana, (1990) Crystal structure of bee-venom phos-
New York, NY, pp 167–185 pholipase A2 in a complex with a transition-
6. Lake BB, Chen S, Sos BC, Fan J, Kaeser GE, state analogue. Science 250(4987):1563–1566
Yung YC, Duong TE, Gao D, Chun J, Kharch- 18. Cavazzini D, Meschi F, Corsini R, Bolchi A,
enko PV, Zhang K (2018) Integrative single- Rossi GL, Einsle O, Ottonello S (2013) Autop-
cell analysis of transcriptional and epigenetic roteolytic activation of a symbiosis-regulated
states in the human adult brain. Nat Biotechnol truffle phospholipase A2. J Biol Chem 288
36(1):70–80 (3):1533–1547
7. Sandberg R (2014) Entering the era of single- 19. Matoba Y, Sugiyama M (2003) Atomic resolu-
cell transcriptomics in biology and medicine. tion structure of prokaryotic phospholipase
Nat Methods 11(1):22–24 A2: analysis of internal motion and implication
8. DOE US (2019) Breaking the bottleneck of for a catalytic mechanism. Proteins 51
genomes: understanding gene function across (3):453–469
taxa workshop report, DOE/SC-0199. 20. Pettersen EF, Goddard TD, Huang CC,
U.S. Department of Energy Office of Science, Couch GS, Greenblatt DM, Meng EC, Ferrin
Washington, DC. https://genomicscience. TE (2004) UCSF chimera—a visualization sys-
energy.gov/genefunction/. Accessed 26 Feb tem for exploratory research and analysis. J
2020 Comput Chem 25(13):1605–1612. https://
9. Sivashankari S, Shanmughavel P (2006) Func- doi.org/10.1002/jcc.20084
tional annotation of hypothetical proteins–a 21. Scott DL, Sigler PB (1994) Structure and cata-
review. Bioinformation 1(8):335 lytic mechanism of secretory phospholipases
10. Hutchison CA, Chuang RY, Noskov VN, A2. Adv Protein Chem 45:53–88
Assad-Garcia N, Deerinck TJ, Ellisman MH, 22. Noeske J, Wasserman MR, Terry DS, Altman
Gill J, Kannan K, Karas BJ, Ma L, Pelletier JF RB, Blanchard SC, Cate JH (2015) High-
(2016) Design and synthesis of a minimal bac- resolution structure of the Escherichia coli
terial genome. Science 351:6280 ribosome. Nat Struct Mol Biol 22(4):336–341
11. Richarme G, Liu C, Mihoub M, Abdallah J, 23. Locher KP (2016) Mechanistic diversity in
Leger T, Joly N, Liebart JC, Jurkunas UV, ATP-binding cassette (ABC) transporters. Nat
Nadal M, Bouloc P, Dairou J (2017) Guanine Struct Mol Biol 23(6):487
glycation repair by DJ-1/Park7 and its bacte- 24. Oldham ML, Khare D, Quiocho FA, Davidson
rial homologs. Science 357(6347):208–211 AL, Chen J (2007) Crystal structure of a cata-
12. UniProt Consortium (2018) UniProt: a lytic intermediate of the maltose transporter.
worldwide hub of protein knowledge. Nucleic Nature 450(7169):515
Acids Res 47(D1):D506–D515 25. Hvorup RN, Goetz BA, Niederer M,
13. UniProt consortium (2020) UniProt Hollenstein K, Perozo E, Locher KP (2007)
UniProtKB/Swiss-Prot UniProt release Asymmetry in the structure of the ABC
2020_01. https://www.uniprot.org/statis transporter-binding protein complex BtuCD-
tics/Swiss-Prot. Accessed 26 Feb 2020 BtuF. Science 317(5843):1387–1390
14. Giordanetto F, Knerr L, Nordberg P, 26. Hutchinson EG, Thornton JM (1990)
Pettersen D, Selmi N, Beisel HG, de la HERA—a program to draw schematic dia-
Motte H, Månsson Å, Dahlstrom M, grams of protein secondary structures. Proteins
Broddefalk J, Saarinen G (2018) Design of 8(3):203–212
Selective sPLA2-X inhibitor ()-2-{2-[carba- 27. Laskowski RA, Jabłońska J, Pravda L, Vařeková
moyl-6-(trifluoromethoxy)-1 H-indol-1-yl] RS, Thornton JM (2018) PDBsum: structural
pyridine-2-yl} propanoic acid. ACS Med summaries of PDB entries. Protein Sci 27
Chem Lett 9(7):600–605 (1):129–134
15. Sekar K, Sekharudu C, Tsai MD, Sundaralin- 28. Lewinson O, Livnat-Levanon N (2017) Mech-
gam M (1998) 1.72 Å resolution refinement of anism of action of ABC importers: conserva-
the trigonal form of bovine pancreatic phos- tion, divergence, and physiological
pholipase A2. Acta Crystallogr D Biol Crystal- adaptations. J Mol Biol 429(5):606–619
logr 54(3):342–346 29. RCSB (2000) Protein Data Bank. http://www.
16. Segelke BW, Nguyen D, Chee R, Xuong NH, rcsb.org/. Accessed 26 Feb 2020
Dennis EA (1998) Structures of two novel 30. Berman HM, Westbrook J, Feng Z,
crystal forms of Naja naja naja phospholipase Gilliland G, Bhat TN, Weissig H, Shindyalov
A2 lacking Ca2+ reveal trimeric packing. J Mol IN, Bourne PE (2000) The Protein Data Bank.
Biol 279(1):223–232 Nucleic Acids Res 28:235–242
Functional Annotation from Structural Homology 255

31. wwPDB (2003) Worldwide Protein Data Bank. tool to facilitate the use of structural informa-
http://www.wwpdb.org/. Accessed 26 Feb tion in sequence analysis. Protein Eng 11
2020 (10):855–859
32. Berman H, Henrick K, Nakamura H (2003) 46. Lipman DJ, Pearson WR (1985) Rapid and
Announcing the worldwide protein data bank. sensitive protein similarity searches. Science
Nat Struct Mol Biol 10(12):980 227(4693):1435–1441
33. Altschul SF, Gish W, Miller W, Myers EW, Lip- 47. Ashkenazy H, Abadi S, Martz E, Chay O, May-
man DJ (1990) Basic local alignment search rose I, Pupko T, Ben-Tal N (2016) ConSurf
tool. J Mol Biol 215(3):403–410 2016: an improved methodology to estimate
34. NIH, National Center for Biotechnology and visualize evolutionary conservation in
Information, U.S. National Library of Medi- macromolecules. Nucleic Acids Res 44(W1):
cine (1990) BLAST >> blastp suite. https:// W344–W350
blast.ncbi.nlm.nih.gov/Blast.cgi? 48. Tian W, Chen C, Lei X, Zhao J, Liang J (2018)
PAGE¼Proteins. Accessed 26 Feb 2020 CASTp 3.0: computed atlas of surface topog-
35. Ye Y, Godzik A (2003) Flexible structure align- raphy of proteins. Nucleic Acids Res 46(W1):
ment by chaining aligned fragment pairs allow- W363–W367
ing twists. Bioinformatics 19(Suppl 2): 49. RASMOT-3D PRO (2009) Recursive Auto-
ii246–ii255 matic Search of MOTif in 3D structures of
36. Godzik Lab (2020) FATCAT. http://fatcat. PROteins. http://biodev.cea.fr/rasmot3d/.
godziklab.org/fatcat-cgi/cgi/fatcat.pl?- Accessed 26 Feb 2020
func¼search. Accessed 26 Feb 2020 50. Debret G, Martel A, Cuniasse P (2009)
37. EMBL-EBI (2013) PDBsum pictorial database RASMOT-3D PRO: a 3D motif search web-
of 3D structures in the protein databank. server. Nucleic Acids Res 37(Suppl 2):
https://www.ebi.ac.uk/thornton-srv/ W459–W464
databases/cgi-bin/pdbsum/GetPage.pl? 51. Zeng ZH, Castano AR, Segelke BW, Stura EA,
pdbcode¼index.html. Accessed 26 Feb 2020 Peterson PA, Wilson IA (1997) Crystal struc-
38. El-Gebali S, Mistry J, Bateman A, Eddy SR, ture of mouse CD1: an MHC-like fold with a
Luciani A, Potter SC, Qureshi M, Richardson large hydrophobic binding groove. Science
LJ, Salazar GA, Smart A, Sonnhammer ELL 277(5324):339–345
(2019) The Pfam protein families database in 52. Fremont DH, Matsumura M, Stura EA, Peter-
2019. Nucleic Acids Res 47(D1):D427–D432 son PA, Wilson IA (1992) Crystal structures of
39. EMBL-EBI (2018) Pfam 32.0. https://pfam. two viral peptides in complex with murine
xfam.org/. Accessed 26 Feb 2020 MHC class I H-2Kb. Science 257
40. Hunter S, Apweiler R, Attwood TK, Bairoch A, (5072):919–927
Bateman A, Binns D, Bork P, Das U, 53. El-Etr SH, Margolis JJ, Monack D, Robison
Daugherty L, Duquenne L, Finn RD (2008) RA, Cohen M, Moore E, Rasley A (2009)
InterPro: the integrative protein signature Francisella tularensis type a strains cause the
database. Nucleic Acids Res 37(Suppl 1): rapid encystment of Acanthamoeba castellanii
D211–D215 and survive in amoebal cysts for three weeks
41. Dawson NL, Lewis TE, Das S, Lees JG, Lee D, postinfection. Appl Environ Microbiol 75
Ashford P, Orengo CA, Sillitoe I (2017) (23):7488–7500
CATH: an expanded resource to predict pro- 54. Feld GK, El-Etr S, Corzett MH, Hunter MS,
tein function through structure and sequence. Belhocine K, Monack DM, Frank M, Segelke
Nucleic Acids Res 45(D1):D289–D295 BW, Rasley A (2014) Structure and function of
42. CATH (2020) CATH/Gene3D v4.2. https:// REP34 implicates carboxypeptidase activity in
www.cathdb.info/. Accessed 26 Feb 2020 Francisella tularensis host cell invasion. J Biol
Chem 289(44):30668–30679
43. Fox NK, Brenner SE, Chandonia JM (2014)
SCOPe: structural classification of proteins— 55. PDB id: 3b2y, Joint Center for Structural
extended, integrating SCOP and ASTRAL data Genomics (JCSG) (2007) Crystal structure of
and classification of new structures. Nucleic metallopeptidase containing co-catalytic metal-
Acids Res 42(D1):D304–D309 loactive site (YP_563529.1) from Shewanella
denitrificans OS217 at 1.74 Å resolution.
44. Murzin AG, Brenner SE, Hubbard TJP, https://doi.org/10.2210/pdb3B2Y/pdb
Chothia C (1995) SCOP: a structural classifi-
cation of proteins database for the investigation 56. Otero A, Rodrı́guez de la Vega M, Tanco S,
of sequences and structures. J Mol Biol Lorenzo J, Avilés FX, Reverter D (2012) The
247:536–540 novel structure of a cytosolic M14 metallocar-
boxypeptidase (CCP) from Pseudomonas
45. Milburn D, Laskowski RA, Thornton JM
(1998) Sequences annotated by structure: a
256 Brent W. Segelke

aeruginosa: a model for mammalian CCPs. 66. PDB id: 1q8b, Zhang R, Joachimiak A,
FASEB J 26(9):3754–3764 Edwards A, Savchenko A, Midwest Center for
57. PDB id: 2omo, Osipiuk J, Evdokimova E, Structural Genomics (MCSG) (2003) Struc-
Kagan O, Savchenko A, Edwards A, tural genomics, protein YJCS. https://doi.
Joachimiak A, Midwest Center for Structural org/10.2210/pdb1Q8B/pdb
Genomics (MCSG) (2007) Putative antibiotic 67. PDB id: 1x7v, Sanders DA, Walker JR,
biosynthesis monooxygenase from Nitrosomo- Skarina T, Gorodichtchenskaia E,
nas europaea. DOI. https://doi.org/10. Joachimiak A, Edwards A, Savchenko A, Mid-
2210/pdb2OMO/pdb west Center for Structural Genomics (MCSG)
58. PDB id: 2gff, de Carvalho-Kavanagh M, (2004) Crystal structure of PA3566 from Pseu-
Schafer J, Lekin T, Toppani D, Chain P, domonas aeruginosa. https://doi.org/10.
Lao V, Motin V, Garcia E, Segelke B (2007) 2210/pdb1X7V/pdb
Crystal structure of Yersinia pestis LsrG. 68. PDB id: 2fb0, Nocek B, Hatzos C, Abdullah J,
https://doi.org/10.2210/pdb2GFF/pdb Collart F, and Joachimiak A, Midwest Center
59. Marques JC, Lamosa P, Russell C, Ventura R, for Structural Genomics (MCSG) (2006) Crys-
Maycock C, Semmelhack MF, Miller ST, Xavier tal structure of conserved protein of unknown
KB (2011) Processing the interspecies function from Bacteroides thetaiotaomicron
quorum-sensing signal autoinducer-2 (AI-2) VPI-5482 at 2.10 Å resolution, possible oxido-
characterization of phospho-(S)-4, 5-dihy- reductase. https://doi.org/10.2210/
droxy-2, 3-pentanedione isomerization by pdb2FB0/pdb
LsrG protein. J Biol Chem 286 69. PDB id: 2bbe, Chang C, Bigelow L,
(20):18331–18343 Joachimiak A, Midwest Center for Structural
60. Lemieux MJ, Ference C, Cherney MM, Genomics (MCSG) (2005) Crystal structure of
Wang M, Garen C, James MN (2005) The protein SO0527 from Shewanella oneidensis.
crystal structure of Rv0793, a hypothetical https://doi.org/10.2210/pdb2BBE/pdb
monooxygenase from M. tuberculosis. J Struct 70. PDB id: 4dpo, Agarwal R, Chamala S, Evans R,
Funct Genom 6(4):245–257 Gizzi A, Hillerich B, Kar A, LaFleur J, Foti R,
61. PDB id: 3f44, Joint Center for Structural Siedel R, Zencheck W, Villigas G, Almo SC,
Genomics (JCSG) (2008) Crystal structure of Swaminathan S, New York Structural Geno-
putative monooxygenase (YP_193413.1) from mics Research Consortium (NYSGRC)
Lactobacillus acidophilus NCFM at 1.55 A res- (2012) Crystal structure of a conserved protein
olution. https://doi.org/10.2210/pdb3F44/ MM_1583 from Methanosarcina mazei Go1.
pdb https://doi.org/10.2210/pdb4DPO/pdb
62. PDB id: 3kkf, Joint Center for Structural 71. Sciara G, Kendrew SG, Miele AE, Marsh NG,
Genomics (JCSG) (2009) Crystal structure of Federici L, Malatesta F, Schimperna G,
putative antibiotic biosynthesis monooxygen- Savino C, Vallone B (2003) The structure of
ase (NP_810307.1) from Bacteroides thetaio- ActVA-Orf6, a novel type of monooxygenase
taomicron VPI-5482 at 1.30 Å resolution. involved in actinorhodin biosynthesis. EMBO J
https://doi.org/10.2210/pdb3KKF/pdb 22(2):205–215
63. PDB id: 3mcs, Joint Center for Structural 72. Wada, Shirouzu T, Terada M, Kamewari T,
Genomics (JCSG) (2010) Crystal structure of Park Y, Tame SY, Kuramitsu JR, Yokoyama S
putative monooxygenase (fn1347) from fuso- (2004) Crystal structure of the conserved
bacterium nucleatum subsp. Nucleatum ATCC hypothetical protein TT1380 from Thermus
25586 at 2.55 Å resolution. https://doi.org/ thermophilus HB8. Proteins 55(3):778–780
10.2210/pdb3MCS/pdb 73. Grocholski T, Koskiniemi H, Lindqvist Y,
64. PDB id: 3bm7, Joint Center for Structural M€ants€al€a P, Niemi J, Schneider G (2010) Crys-
Genomics (JCSG) (2007) Crystal structure of tal structure of the cofactor-independent
a putative antibiotic biosynthesis monooxygen- monooxygenase SnoaB from Streptomyces
ase (cc_2132) from Caulobacter crescentus nogalater: implications for the reaction mecha-
cb15 at 1.35 Å resolution. https://doi.org/ nism. Biochemistry 49(5):934–944
10.2210/pdb3BM7/pdb 74. Chim N, Iniguez A, Nguyen TQ, Goulding
65. PDB id: 1r6y, Adams MA, Jia Z, Montreal- CW (2010) Unusual diheme conformation of
Kingston Bacterial Structural Genomics Initia- the heme-degrading protein from Mycobacte-
tive (BSGI) (2003) Crystal structure of YgiN rium tuberculosis. J Mol Biol 395(3):595–608
from Escherichia coli. https://doi.org/10. 75. PDB id: 4fca, Tan K, Zhou M, Kwon K, Ander-
2210/pdb1R6Y/pdb son WF, Joachimiak A, Center for Structural
Genomics of Infectious Diseases (CSGID)
(2012) The crystal structure of a functionally
Functional Annotation from Structural Homology 257

unknown conserved protein from Bacillus noncanonical zinc protease activity. Proc Natl
anthracis str. Ames. https://doi.org/10. Acad Sci 101(18):6888–6893
2210/pdb4FCA/pdb 78. PDB id: 3u9w, Niegowski D, Thunnissen M,
76. PDB id: 4fgm, Vorobiev S, Su M, Tong T, Tholander F, Rinaldo-Matthis A, Muroya A,
Kohan E, Wang D, Everett JK, Acton TB, Haeggstrom J Z (2012) Structure of human
Montelione GT, Tong L, Hunt JF, Northeast leukotriene a4 hydrolase in complex with
Structural Genomics Consortium (NESGC) inhibitor sc57461a. https://doi.org/10.
(2012) Crystal structure of the aminopeptidase 2210/pdb3U9W/pdb
n family protein q5qty1 from Idiomarina loi- 79. Rawlings ND, Barrett AJ (1995) Evolutionary
hiensis, Northeast structural genomics consor- families of metallopeptidases. Methods Enzy-
tium target ilr60. https://doi.org/10.2210/ mol 248:183–228
pdb4FGM/pdb 80. Guzenko D, Burley SK, Duarte JM 2020 Real
77. Segelke B, Knapp M, Kadkhodayan S, time structural search of the Protein Data
Balhorn R, Rupp B (2004) Crystal structure Bank. PLoS computational biology, 16(7), p.
of Clostridium botulinum neurotoxin protease e1007970
in a product-bound state: evidence for
Chapter 12

Metabolic Modeling with MetaFlux


Mario Latendresse, Wai Kit Ong, and Peter D. Karp

Abstract
The MetaFlux software supports creating, executing, and solving quantitative metabolic flux models using
flux balance analysis (FBA). MetaFlux offers four modes of operation: (1) solving mode executes an FBA
model for an individual organism or for an organism community, (2) gene knockout mode executes an FBA
model with one or many gene knockouts, (3) development mode assists the user in creating and improving
FBA models, and (4) flux variability analysis mode generates a report of the robustness of an FBA model.
MetaFlux also solves dynamic FBA (dFBA) for both individual organisms and communities of organisms.
MetaFlux can be used in two different environments: on your local computer, which requires the installa-
tion of the Pathway Tools software, or through the web, which does not require installation of Pathway
Tools. On your local computer, MetaFlux offers all four modes of operation, whereas the web environment
provides only the solving mode.
Several visualization tools are available to analyze model solutions. The Cellular Overview tool graphi-
cally shows the reaction fluxes on an organism’s metabolic map once a model is solved. The Omics
Dashboard provides a hierarchical approach to visualizing reaction fluxes, organized by metabolic sub-
systems. For a community of organisms, plotting of accumulated biomasses and metabolites can be
performed using the Gnuplot tool.
In this chapter, we present eight methods using MetaFlux. Five solving mode methods illustrate execu-
tion of models for individual organisms and for organism communities. One method illustrates the gene
knockout mode. Two methods for the development mode illustrate steps for developing new metabolic
models.

Keywords Flux balance analysis, FBA, Solver, Gene knockout, Metabolic model, Steady-state,
Genome-scale model, Dynamic FBA, Community modeling, COBRA methods

1 Introduction

MetaFlux, a component of the Pathway Tools software [1], enables


creating and solving quantitative metabolic flux models, including
both steady-state and dynamic FBA (flux balance analysis) models.
A MetaFlux metabolic model is derived from a qualitative represen-
tation of the metabolic network of an organism that is stored in a
Pathway/Genome Database (PGDB). The PathoLogic component
of Pathway Tools creates such a PGDB from an organism’s anno-
tated genome, such as from one or more GenBank-format files.

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_12, © Springer Science+Business Media, LLC, part of Springer Nature 2022

259
260 Mario Latendresse et al.

Thus, PathoLogic and MetaFlux together provide a path for creat-


ing a metabolic flux model for an organism from its sequenced
genome. In this discussion, we assume that MetaFlux is applied to
an existing PGDB, and we use the EcoCyc PGDB for E. coli
included with the Pathway Tools installation for all the methods
presented.
MetaFlux has four modes:
1. Solving mode: execute a metabolic model for an individual
organism or a community of organisms using FBA or dynamic
FBA (dFBA) [2].
2. Knockout mode: execute a model with gene or reaction knock-
outs by selecting specific genes for knockout, or by knocking
out all genes one by one.
3. Development mode: MetaFlux assists the user in creating a
functional metabolic model by gap-filling a reaction network,
proposing additional nutrients and secretions, and determining
which biomass metabolites can be produced.
4. Flux variability analysis mode: Flux variability analysis (FVA)
[3] is performed on a given metabolic model which generates a
robustness report.
The methods presented in the following sections cover the first
three modes of MetaFlux.
MetaFlux has a graphical user interface (GUI) from which you
can solve a model, develop a model, or run gene knockouts. Meta-
Flux can also be invoked via Python and Lisp APIs; for more
information, see http://brg.ai.sri.com/ptools/pythoncyc.html,
particularly the function run_fba described at http://pythoncyc.
readthedocs.io/en/latest/pythoncyc.html#pythoncyc.run_fba.
A MetaFlux metabolic model includes the following four com-
ponents: (1) a list of the metabolic reactions present in the organ-
ism, (2) a list of the nutrients that the organism consumes from its
environment, (3) a list of the biomass metabolites (the end pro-
ducts of the organism’s biosynthetic pathways), and (4) a list of the
compounds secreted by the organism as it grows (which are usually
waste products). The listed metabolic reactions (and necessary
gene-protein-reaction associations for knockout mode) are
obtained from the organism’s PGDB; the other three lists are
obtained from a file that we call an FBA file.
MetaFlux offers several advantages over other metabolic mod-
eling tools [4, 5]. First, MetaFlux models are highly accessible to
scientists in several respects. MetaFlux can be executed by
non-programmers via both desktop and web user interfaces,
making its modeling capabilities easy to run for
non-programmers. In contrast, the COBRA toolbox [6] requires
using a programming language (e.g., Python) to run FBA models.
The MetaFlux Python and Lisp APIs provide programmatic
Metabolic Modeling with MetaFlux 261

interfaces for those who desire them. Another way in which Meta-
Flux models are highly accessible to scientists is due to the repre-
sentation of MetaFlux models within PGDBs, which in turn can be
queried, inspected, and visualized by using an extensive set of
web-based tools that support searching for genes, reactions, and
metabolites, along with the existence of extensive information
pages for these entities (for example, the metabolite information
page lists all reactions and pathways that produce and consume a
given metabolite). In addition, the PGDB representation enriches a
MetaFlux model with additional information that increases its
understandability, such as chemical structures, reaction atom map-
pings (which identify for each atom in each reactant compound the
corresponding atom in a product compound), regulatory informa-
tion, and linkages to protein and nucleic acid sequences for the
organism. Finally, a model’s outputs can be visualized through
several different tools that enable more rapid understanding of
model results. A highly accessible model is understood more
quickly, is easier to reuse and modify, and is more effectively
validated.
A second advantage of MetaFlux is its extensive set of software
tools for aiding the user during model development. One of the
most time-consuming aspects of model development is enabling
the model to produce all biomass metabolites from the nutrients.
Whereas most tools have a reaction gap-filler only, MetaFlux con-
tains a multiple gap-filler that proposes not only new reactions to
add to a model, but also new nutrients and secretions to add.
Further, the MetaFlux gap-filler identifies which biomass metabo-
lites cannot be produced by the model, so that the user knows
which biomass metabolites to focus their model-debugging efforts
on; other tools simply report that no solution can be found without
identifying the problematic biomass metabolites. We have studied
the accuracy of the MetaFlux gap-filler by randomly removing
reactions from an existing model and running the MetaFlux
gap-filler against those gaps [7], and by comparing manually
curated gap-filling results against MetaFlux results [8]. For the
metabolites that are not produced by the model, MetaFlux pro-
duces a blocked-reaction/metabolite report that lists, for each
non-produced metabolite, which reactions that could produce the
metabolite were blocked from proceeding because of the absence of
a reactant required by the reaction, and lists the blocking reactants.
MetaFlux also identifies unbalanced reactions.
Starting with Pathway Tools version 23.0, two new features
have been added to the installed version of MetaFlux. First, you can
now perform flux variability analysis (FVA) in MetaFlux once you
have a model that solves in FBA. FVA can be used to determine the
robustness of metabolic models under various simulated growth
conditions. This is done by solving a maximization and a minimi-
zation optimization problem on a set of reactions (all reactions are
262 Mario Latendresse et al.

included by default unless a list of reactions is provided in the FBA


file below the “fva-reaction-list:” keyword). Running MetaFlux in
this mode generates a report showing the reactions which are
blocked (these reactions cannot carry flux), fixed (these reactions
have identical, non-zero minimum and maximum flux values), or
free (these reactions have flux values that span a range). Running
FVA with any exchange fluxes set to zero will also allow you to
identify thermodynamically infeasible cycles. Second, you can now
export FBA models in SBML (Systems Biology Markup Language)
format. SBML is an XML-based, machine-readable format for
representing models of biochemical reaction networks and their
gene associations. Both the features are accessible from the Meta-
Flux Dialog Window (see Fig. 1).

Fig. 1 The graphical user interface of MetaFlux in Pathway Tools. The top pane contains buttons to create an
FBA template file or to select a specific input file from the local machine, a mode selector, and a button to
execute the run. The middle pane will contain a summary of the resulting run and buttons to access files
generated as a result of the run (e.g., log file and solution file). The bottom pane is used to display messages
and traces of execution
Metabolic Modeling with MetaFlux 263

2 Materials

MetaFlux is part of Pathway Tools, which can be downloaded for


free for academic users or for a fee for commercial users. The SCIP
solver required for solving LP and MILP optimization problems is
included in MetaFlux. Pathway Tools can be used on the Linux,
Windows, and macOS platforms. To obtain a license, please go to
http://bioinformatics.ai.sri.com/ptools/. You will also need a text
editor to edit the description of the FBA models. For example, you
can use Emacs, vi, or Notepad.

3 Methods

For methods 1e and 1f, a web browser is needed to access the


website BioCyc.org for which a free account will need to be
created.
Procedure 1: Starting MetaFlux on your computer
1. Start Pathway Tools version 23.0 or later by clicking the desk-
top icon labeled “Pathway-Tools GUI.” The navigator window
opens, showing a list of available databases.
2. Select the database Escherichia coli K-12 substr. MG1655.
3. Open the GUI of MetaFlux using the menu command
Tools!Flux Balance Analysis.
4. You will see the window shown in Fig. 1.
As seen in Fig. 1, three panes appear in the MetaFlux GUI
window: the top, middle, and bottom panes. The top pane contains
the buttons to create an FBA template file, select a file from your
local computer to load and parse in MetaFlux and an FBA model
description, and execute a model. The middle pane is used to show
a summary of the solution after the execution of a model and to
access the files generated that contain the complete solution of the
model and other information. For example, after running MetaFlux
in solving mode, the middle pane would show the biomass flux,
number of reactions that carry flux, buttons to display the results on
the cellular overview or omics dashboard, and buttons to access the
solution file or log file. The bottom pane is used to display messages
and traces of execution.
Procedure 2: Executing MetaFlux at BioCyc.org
1. Start a web browser; we recommend Chrome, and Firefox.
2. Go to the URL BioCyc.org.
3. If you have an account, log in, and then go to step 5.
4. If you do not have an account, create one by clicking the link
“Create New Account” in the top right corner of the web page.
264 Mario Latendresse et al.

5. Select the organism “Escherichia coli K-12 substr. MG1655


(EcoCyc)” by clicking the button “change current database”
above the search bar.
6. Select the command Metabolism!Run Metabolic Model
from the Tools menu in the top menu bar of the web page.
You will get a web page similar to Fig. 10.
This procedure is used by methods 1e and 1f to use MetaFlux
on the web.
3.1 Solving The methods presented in this section use the solving and knockout
and Knockout Modes modes of MetaFlux. The first four methods, 1a to 1d, use the
Pathway Tools desktop environment, whereas the last two meth-
ods, 1e and 1f, use a web browser to access MetaFlux via BioCyc.
org.
Method 1a: Executing a preexisting model
Solving (executing) a model is the fundamental activity with
MetaFlux. Solving a model results in either growth or no growth of
the model. The model shows growth if all the biomass metabolites
can be produced at the ratios specified by the model, in which case
the model produces as output a growth rate for the organism and
flux rates for individual reactions within the model. For this first
method, we use an already constructed model for E. coli. This
approach will illustrate the results created by MetaFlux once a
model shows growth.
1. Apply Procedure 1, starting MetaFlux.
2. Click the button Select Input File in the top pane, which
opens a file dialog widget.
3. Choose the file ecocyc-23.0-gem-cs-glucose-tea-
oxygen.fba under the directory aic-export/pgdbs/biocyc/
ecocyc/23.0/data/fba/fba-examples of your Pathway Tools
installation. Click Open to load the file.
4. The file is parsed and loaded into MetaFlux. The bottom pane
shows some statistics about the model.
5. Execute the model by clicking the button Execute in the
top pane.
Once the model is executed, the middle pane shows the bio-
mass flux (which is the cellular growth rate), the number of active
reactions (the reactions carrying non-zero flux), and a series of
buttons to view the files that were created by MetaFlux that contain
the results of the execution. The solution and log files can be
displayed by clicking the corresponding button. The files can also
be loaded into a text editor of your choice. Clicking the button
Show on Cellular Overview opens a window displaying the
metabolic map of the organism with reactions colored according
to the fluxes of the reactions of the model just solved (see Fig. 2).
Metabolic Modeling with MetaFlux 265

Fig. 2 The Cellular Overview with reactions colored according to their fluxes once a model is solved. The
reactions are grouped in metabolic pathways, and the pathways are grouped into classes of pathways. Due to
space limitations, we show only pathways; the reactions not in pathways to the right are not shown. Based on
the mapping from fluxes to the colors shown in the legend, the cofactor biosynthesis pathways (purple) toward
the upper left clearly carry less flux than glycolysis and the TCA cycle (yellow), which are shown in the center.
Interacting with the Cellular Overview by zooming in and out, mousing over the compounds and reactions to
open tooltips describing them, and more options are possible

Clicking the button Show on Omics Dashboard opens a


browser-based tool for interactively exploring the reaction fluxes
computed from the model in a top-down fashion. Figure 3 shows
the top portion of the Dashboard, which presents the fluxes of
nutrients and secretions, the cellular growth rate, and the produc-
tion rate of ATP. Figure 4 shows the remainder of the Dashboard,
which summarizes the flux rates through other metabolic systems.
The top panel in Fig. 4 summarizes all biosynthetic pathways; the
remaining panels summarize degradation, energy metabolism, and
other pathways. The plots within a panel summarize the fluxes
within different metabolic subsystems, such as for amino acid bio-
synthesis and nucleotide biosynthesis within the Biosynthesis panel.
Each dot within a plot shows a flux for a single reaction within that
subsystem; the large dots show the averages within each subsystem.
(Note that dots sometimes overlap because of scaling issues.) The
user can drill down to a more detailed view by clicking one of the
266 Mario Latendresse et al.

Fig. 3 The upper portion of the Omics Dashboard shows metabolite uptake and secretion rates, the cellular
growth rate, and ATP production rate. The appearance of each histogram (plot) can be customized using the
“Options” tab

plots within a panel (for example, clicking on Glycolysis within the


Energy Metabolism panel produces the view in Fig. 5, which shows
the flux through each glycolytic reaction; clicking “Show Pathway
(s)” at the top of that window produces Fig. 6, which shows
pathway diagrams colored with flux information for the two var-
iants of glycolysis present in E. coli). Similarly, clicking the AA Syn
plot within the Biosynthesis panel would produce plots summariz-
ing the fluxes for each individual pathway of amino acid biosynthe-
sis. For more information about the Dashboard, please click the
“Help” link in the upper left corner of the web page.
The created log file ecocyc-23.0-gem-cs-glucose-tea-
oxygen.log contains much information about the model that was
loaded. It does not contain information about the solution found,
Metabolic Modeling with MetaFlux 267

Fig. 4 The lower portion of the Omics Dashboard summarizes flux through different metabolic systems

but rather details about the model before it was executed. You can
open the log file with your favorite text editor. In the first part of
the file is a list of reactions that were not included in the model
because these reactions were found to have some issues. For exam-
ple, a reaction that is not mass balanced is not included in the
model. Each of these reactions is preceded by a key-code (e.g.,
[unbalanced]) giving the reason for not including the reaction in
the model. The full list of key-codes and their meanings are at the
beginning of the log file.
After the problematic reactions, the log file lists the reactions
included in the model. That list is divided into three sections, one
for the non-generic and non-instantiated reactions, all generic
reactions, and all instantiated reactions (that is, instantiated from
the generic reactions). A generic reaction is a reaction that has at
least one of its substrates as a class (e.g., “an amino acid”). A
compound class contains several compound instances (e.g., L-tryp-
tophan). An instantiated reaction is generated from a generic reac-
tion where each compound class is replaced with specific compound
instance of that class.
268 Mario Latendresse et al.

Fig. 5 A summary of the flux through each individual glycolytic reaction

Finally, in the log file, the last section lists blocked reactions
(the reactions that cannot proceed because one of their reactants is
not present) and blocked metabolites (the metabolites that cannot
be produced because all the reactions that produce them are
blocked). A blocked reaction must have a zero flux. Blocked reac-
tions are not included in the model. Blocked reactions are deter-
mined recursively by first computing a set of basic blocked reactions:
reactions that either (a) have a reactant that is not produced by
another reaction and is not included as a nutrient or (b) have a
product that is not consumed by another reaction and is not a
secretion or biomass metabolite. The products of these basic
blocked reactions are metabolites that also cannot be produced or
used, which block more reactions.
Metabolic Modeling with MetaFlux 269

Fig. 6 Fluxes through the two variants of the glycolysis pathway present in E. coli

The created solution file ecocyc-23.0-gem-cs-glucose-


tea-oxygen.sol contains a description of the solution of the
model. The first part of the file gives several statistics about the reac-
tions, metabolites, and genes/proteins included in the model. Then
comes a section that lists each biomass metabolite, including their flux
and the flux of the biomass reaction. Similarly, there is a section for the
nutrients and secretions, then the list of reactions that are active with a
nonzero flux and the reactions that are not active with a zero flux.
Here we present ways of examining the reaction network within
a MetaFlux model at BioCyc.org.
1. Select the organism database from which the model is derived
by clicking the “change current database” button near the top
of the BioCyc.org page. Select a commonly used database such
as E. coli K-12, or type an organism name.
270 Mario Latendresse et al.

2. Use the Quick Search box below the top menu to search for a
metabolite (example: “pyruvate”), a reaction (example:
“4.1.2.8”), a pathway (example: “glycolysis”), or a gene
(example: “trpa”). Click on an object in the Quick Search
results page to see its information page.
3. From an object information page, click on other related objects
to navigate to their information pages. For example, from a
pathway page, click on a metabolite or gene name to see their
pages; from a reaction page, click on a metabolite name to see
its page.
Method 1b: Executing a model with gene knockouts
We run a model with one or more gene knockouts to determine
the effect of the knockout on the growth rate of an organism. A
knockout could reduce the growth rate or eliminate growth. We
could specify a few genes to knockout or try all the genes of an
organism. In this method, we will knockout all the genes, one by
one, to find which genes are essential for E. coli based on the model
described by the file ecocyc-23.0-gem-cs-glucose-tea-
oxygen.fba.

1. Apply Procedure 1, starting MetaFlux.


2. Open the file ecocyc-23.0-gem-cs-glucose-tea-oxy-
gen.fba using a text editor.
3. Before the reactions section, add the line “knockout-genes:”
and after that add the line “metab-genes.” Save the file under
the new name: ecocyc-23.0-gem-cs-glucose-tea-oxy-
gen-knockout.fba.

4. Select that same file, as in Method 1, to be parsed and loaded by


MetaFlux.
5. Click the button Execute.
Solving will take a few minutes because for each metabolic gene
of the EcoCyc database, the reactions catalyzed by that gene, if any,
are removed from the model if no other isozyme exists for the
reaction and that modified model is solved. For each gene removed,
the reactions removed are given in the bottom pane, which will also
be reported in the solution file. Once the execution is finished,
open the file ecocyc-23.0-gem-cs-glucose-tea-oxygen-
knockout.sol. The first line gives the biomass flux of the model
without doing any knockout. This is done to ensure the state of the
model before doing the knockouts. The second line of that file
gives the number of knockouts done, the number of knockouts that
created models that grow, and the number of models that do not
grow. Afterward, a list of all knockouts is provided. For each
knockout, the line states the biomass flux, the gene knocked out,
and the corresponding reactions removed. Notice that some genes
may not remove any reaction because isozymes exist for these
genes.
Metabolic Modeling with MetaFlux 271

Obtaining a solution file for each gene knockout is possible.


This would be useful to analyze the set of reactions that are active,
the set of nutrients used, and the secretions produced for each gene
being knocked out. To do so, a parameter is added to generate
these files:
6. Open the file ecocyc-23.0-gem-cs-glucose-tea-oxy-
gen-knockout.fba using a text editor.
7. Add the line “knockout-summary-only: no” right before the
line “knockout-genes:”.
8. Save the file.
9. Reload and execute the file by clicking the button Reload &
Execute.

The execution will take longer than the previous execution


with only a summary of the knockouts. The resulting solution
files are the same as discussed in method 1a.
Method 1c: Executing a Dynamic FBA Model on a commu-
nity of organisms
A community of metabolically interacting organisms can be
modeled in MetaFlux. In this method, we will execute an already-
defined community model based on two similar E. coli strains and
visualize the results. One strain is the same E. coli K-12 MG1655
strain that we used in previous methods; the other strain is a
knockout of reactions NADH-DEHYDROG-A-RXN and RXN0-
5388 from the preceding strain, meaning that the second strain has
two electron transport reactions that transform NADH to NAD
removed. This method demonstrates the ability of MetaFlux to
model organism growth within a rectangular grid in which both
organisms and metabolites can diffuse among grid squares due to
Brownian collision based on Fick’s second law of diffusion. The
diffusion coefficients of the metabolites and organisms are set to
5  106 and 3  109 cm2/s. Note that grids are not required for
community modeling, and there is no restriction on
community size.
This method also demonstrates the use of dynamic FBA, in
which one or more steady-state FBA models are run across multiple
time steps. At each time step, the concentrations of nutrients in
each grid square are decreased based on the rate of nutrient uptake
by each model, and the concentrations of secreted metabolites in
each grid square are changed based on the rate of secretion by each
model. Each organism model maximizes its growth independently
from one another in each time step, and the biomass of each
organism is changed in each time step based on the growth rate
of the organism. At each step, the organism models are executed in
a random order. The random selection ensures that no model has
any advantage over the consumption of space and nutrients. A
fraction of each organism dies in each time step as specified by
parameter organism-death-rate.
272 Mario Latendresse et al.

Fig. 7 The content of the file two-ecoli-community-23.0.com

The file two-ecoli-community-23.0.com describes a com-


munity of two E. coli models (see Fig. 7). This community file refers
to two individual FBA model files for the two strains: ecocyc-
23.0-gem-community-example.fba and ecocyc-23.0-
gem-community-example-no-NDH-I.fba. The community
file also specifies a physical grid space of five columns and five
rows within which the organisms grow; the sides of the grid are
5 mm long. We will run a simulation of 12 steps of 1 h each. An
initial (i.e., step 1) biomass of 0.01 gDW of each organism is added
in two corner boxes of the grid, for the first organism at box
coordinate (0, 0) and for the second organism at box coordinate
(4) (note that grid coordinates are defined as (row index, column
index) where indices start at 0). The exchange compartment of
metabolites between the two organisms is the extracellular space.
The nutrient is beta-D-glucopyranose (identifier GLC), with
10 mmol present in each box of the grid, at step 1. No oxygen is
provided initially, but at step 5, 10 mmol is provided in each box of
the grid. No acetate is available at the beginning of the simulation
(acetate can be produced by the organisms and will accumulate
accordingly).
1. Apply Procedure 1, starting MetaFlux.
2. Select the file two-ecoli-community-23.0.com under the
directory aic-export/pgdbs/biocyc/ecocyc/23.0/data/fba/
community-examples of your Pathway Tools installation.
3. Click the button Execute.
4. Execution takes a few minutes to simulate the 12 steps. The
result message should say that both organisms grow.
Metabolic Modeling with MetaFlux 273

Fig. 8 This popup window enables selecting metabolites and organisms for the 2D plots

5. Click the option Plot: Aggregated Biomasses/


Metabolites.
6. Click the button Select Organisms, Metabolites and
Options. A popup window opens (see Fig. 8). Select the
metabolites acetate[extracellular space], CO2[extracellular
space], beta-D-glucopyranose[extracellular space], and oxygen
[extracellular space]. Select, by clicking, both ECOLI_1 and
ECOLI_2. Click the button Save Selection and Close.
7. Click the button View Selected Outputs. A popup window
from Gnuplot will open.
Note that if Gnuplot is not installed, an attempt is made to
download Gnuplot from the Pathway Tools server (for Linux and
macOS platforms). If this fails, a warning message is displayed
directing the user to download Gnuplot manually. The Gnuplot
application produces two 2D plots in one popup window, one for
the accumulated biomasses for the two organisms in gDW and one
for the accumulation/depletion of the four selected metabolites in
mmol, over a period of 12 h (see Fig. 9). In the top 2D plot, the
274 Mario Latendresse et al.

Fig. 9 Two plots over 12 h in the entire grid, one for the accumulation of biomasses of two organisms, and one
for the accumulation of four metabolites
Metabolic Modeling with MetaFlux 275

growth of both organisms starts slowing from step 1 to the begin-


ning of step 5 (because growth is slower without oxygen), and at
step 5, the growth increases, because some oxygen is added to every
grid location at that step. Ecoli_1 grows faster than Ecoli_2,
because Ecoli_2 has reduced capacity in producing NAD from
NADH. In the bottom 2D plot, starting at step 5, CO2 accumu-
lates but beta-D-glucopyranose and oxygen deplete. This bottom
plot aggregates the accumulated metabolites of both organisms,
and we cannot distinguish which organism produces more CO2 or
acetate, and which organism uses more oxygen or beta-D-glucopyr-
anose. The next steps show how to use another option that pro-
vides such details per organism.
8. Close the Gnuplot popup window, if it is still open.
9. Uncheck the option Plot: Aggregated Biomasses/Meta-
bolites, if it is still checked.
10. Click the option Plot: Metabolites Used/Produced per
Organism.
11. Click the button Select Organisms, Metabolites and
Options. A popup window opens. Select the metabolites
acetate[extracellular space], CO2[extracellular space], beta-
D-glucopyranose[extracellular space], and oxygen[extracellu-
lar space] for both organisms ECOLI_1 and ECOLI_2. Click
the button Save Selection and Close.
12. Click the button View Selected Outputs. A popup win-
dow from Gnuplot will open (see Fig. 10).
Figure 10 shows the accumulated production and usage of the
selected metabolites per organism. The negative values represent
total uptake of nutrients (mmol), whereas the positive values repre-
sent accumulation of secreted metabolites (mmol). Organism
Ecoli_1 uses a bit more beta-D-glucopyranose than organism
Ecoli_2, whereas the oxygen uptake is essentially equal for both
organisms. Organism Ecoli_1 produces more CO2 than organism
Ecoli_2, whereas Ecoli_2 produces more acetate than Ecoli_1.
Figure 10 shows the accumulation of biomass and metabolites
summed across all locations in the grid where the biomass and
metabolites accumulate. Visualizing the diffusion of the organisms
and the metabolites within each grid square is possible, as follows:
13. Close the Gnuplot popup window showing the 2D plots, if it
is still open.
14. Click the
option Static Grids: Biomasses/
Metabolites.
15. Click the button Select Organisms, Metabolites and
Options. A popup window opens.
276 Mario Latendresse et al.

Fig. 10 Two plots showing the accumulated production (positive values) and consumption (negative values) of
four metabolites for the two organisms of a dynamic FBA over 12 h for the entire grid

16. Select beta-D-glucopyranose under the metabolites list and


select the two organisms under the organisms list. Enter the
integers 1, 5, 10, 13 in the box labeled Steps.
17. Click the button Save Selection and Close.
18. Click the button View Selected Outputs. A window opens
displaying 2D grids.
Figure 11 shows the resulting display. There are four rows, one
for each step, and three columns, two for the organisms, and one
for metabolite beta-D-glucopyranose. The first two steps (i.e., 1 and
5) show no growth for the two organisms, consistent with the
previous 2D plots showing no growth for both organisms until
the end of step 5. Until step 5, the accumulation of the metabolite
beta-D-glucopyranose does not decrease as shown on the green
grid. At step 10, there is growth for both organisms: Ecoli_1 at
coordinate (0, 0) and Ecoli_2 at coordinate (4). At step 10, there is
a slight decrease in the accumulation of beta-D-glucopyranose at
the same locations, coherent from the uptake of nutrients from
both organisms. At the end of step 12, which is the end of the
simulation after 12 hours, we can see an accumulation of biomass
for Ecoli_1 at locations (0, 0), (1), and (2) similarly for Ecoli_2 but
in the opposite corner of the grid. The organisms diffuse in the
grid. The nutrient beta-D-glucopyranose has an accumulation of
zero in the corresponding corners of the grid.
Figure 12 shows a similar set of grids, but for four metabolites.
These grids were obtained with a method similar to that just
shown, but by selecting four metabolites instead of two organisms
Metabolic Modeling with MetaFlux 277

Fig. 11 Accumulation of the biomass of the two E. coli models and the take-up of oxygen over the grid

and one metabolite. In these grids, we can see the accumulation of


acetate and CO2 and the depletion of oxygen and beta-D-
glucopyranose.
Method 1d: Changing the state of a dynamic FBA commu-
nity model
A community model is described by the size of a grid, and by
the quantity of nutrients and biomasses of the organisms available
in the grid. These quantities and biomasses can be provided for any
step of the simulation and for any location of the grid. For example,
adding oxygen is possible at a certain time and location in the grid
during the dynamic simulation. Adding new organisms in a similar
way is also possible. This method shows how to apply these
dynamic modifications to the community model described by file
two-ecoli-community-23.0.com. We will apply two modifi-
cations: the amount of oxygen provided to the organisms and the
locations of the organisms such that they will interact in the center
of the grid.
278
Mario Latendresse et al.

Fig. 12 Accumulation in mmol of four metabolites (acetate, oxygen, beta-D-glucopyranose, CO2) over the 5  5 grid
Metabolic Modeling with MetaFlux 279

1. Apply Procedure 1, starting MetaFlux.


2. Using a text editor, open the file two-ecoli-community-
23.0.com under the directory aic-export/pgdbs/biocyc/eco-
cyc/23.0/data/fba/community-examples of your Pathway
Tools installation.
3. Rename the file to two-ecoli-community-23.0-inter-
act.com.
4. In that file, for the two “.fba” files, change the location “(0, 0)”
to “(1)” and “(4)” to “(2).”
5. For oxygen molecule, change “:supply 10” to “:supply 1.”
6. For beta-D-glucopyranose, change “:supply 10” to “:supply 1.”
7. Save the file two-ecoli-community-23.0-interact.
com.
8. Select the file two-ecoli-community-23.0-interact.
com to be loaded and parsed in MetaFlux.
9. Click the button Execute.
The amounts of oxygen and beta-D-glucopyranose were
reduced to a point where the organisms will reduce their growth
before the final step. The organisms were put close to each other
such that they are competing for oxygen and beta-D-glucopyranose.
At this point, you can display various plots of the produced and
consumed metabolites as well the biomass of each organism as in
method 1c. For example, Fig. 13 shows the grid of accumulated
biomasses of ECOLI_1 and ECOLI_2 and the depletion of two
metabolites, oxygen and beta-D-glucopyranose. As can be seen, the
two organisms are competing for these two metabolites to grow.
Method 1e: Solving a model on the web
MetaFlux is also available via the web at BioCyc.org, but only
for solving models; the knockout and development modes are not
yet available on the web. A web account is needed to run metabolic
models. These models can be saved, modified, and executed as
needed. Note that there is no limit on the number of models that
can be saved. In this method, we will use the same E. coli model as
used in the previous methods.
1. Apply Procedure 2, accessing MetaFlux for the E. coli model at
BioCyc.org.
2. You should have a web page similar to Fig. 14.
3. Select the model cs_glucose_tea_oxygen (owned by BRG
SRI). You will see a web page similar to Fig. 15.
4. Click the Execute button. Obtaining results will take approx-
imately 20 seconds, similar to Fig. 16.
The results web page of the execution of the model shows the
biomass flux and the number of active reactions. The list of active
280
Mario Latendresse et al.

Fig. 13 On the left are the accumulations of the biomasses of two organisms in a grid of 5  5 at four different steps (beginning of steps 1, 5, and 10; ending of
step 12). On the right are the accumulations of two metabolites in the same grid at the same steps
Metabolic Modeling with MetaFlux 281

Fig. 14 The main model-execution web page after the Tools menu command Metabolism!Run Metabolic
Model has been selected

Fig. 15 The page that displays the description of the public model cs_glucose_tea_oxygen

reactions with their fluxes is directly shown on the web page. The
reaction unique identifiers are clickable to further study them.
Clicking the button Show Solution File opens a new tab of the
browser and shows the same solution file as described for the earlier
methods. Similarly, the button Show Log File shows the entire log
file. Clicking the button Show Fluxes on the Cellular Over-
view opens a new tab to show the Cellular Overview with the
reactions highlighted according to the fluxes of the solved model.
The Cellular Overview functionalities on the web are similar but
not identical to the functionalities of the Cellular Overview on the
desktop platform of Pathway Tools. The model can be analyzed,
modified, and rerun, which are addressed in the next method.
282 Mario Latendresse et al.

Fig. 16 The content of the Results tab after executing the model cs_glucose_tea_oxygen; additional reaction
fluxes appear as the user scrolls down the page

Method 1f: Analyzing, modifying, and re-executing a


model on the web
In the prior method, the model was executed in exactly the
form in which it was initially provided. However, making a copy of
that model; modifying its nutrients, biomass, or secretions; and
executing this modified model to see the effect of these modifica-
tions are simple.
1. Do steps 1–3 of the previous method 1e.
2. Click the button Copy. You will be asked to enter a model
name of your choice for the copy you are creating. Let us call it
my_copy_cs_glucose_tea_oxygen.
3. Click the button Copy FBA Model. A copy is created under
your account. You are now the owner of that model.
4. Click the button Edit. The web page shows a summary of the
model with five tabs.
5. Click the tab Nutrients. The list of nutrients is shown (see
Fig. 17).
The Nutrients tab shows all possible metabolites that could be
used by the model as nutrients. Notice the presence of lower and
upper bound fluxes. The first metabolite, beta-D-glucopyranose, is
set to be between 0 and 10 mmol/gDW/h. We will modify this
model to grow on another source of carbon.
Metabolic Modeling with MetaFlux 283

Fig. 17 The Nutrients tab of the newly copied model. The full list of nutrients is not shown due to space
limitations

6. In the top row, type GLYCEROL-3P (all uppercase) in the box


Enter a compound frame id to add. Hit the return key,
which signals the completion of your entry. The other fields
are automatically completed, and the new metabolite is
included as a nutrient.
7. For the metabolite GLYCEROL-3P just entered, replace
CCO-CYTOSOL by CCO-EXTRACELLULAR.
8. Modify the upper-bound flux of GLYCEROL-3P to 10 instead
of 3000.
9. In the top row again, enter FUM (all uppercase) in the box
"Enter a compound frame id to add". Hit the return key.
The fumarate metabolite also becomes a new nutrient for this
model.
10. As for GLYCEROL-3P, modify the CCO-CYTOSOL and the
upper-bound flux.
11. Set the upper-bound flux of GLC to 0.
The two new metabolites fumarate and glycerol will serve as
new sources of carbon. The flux of nutrient GLC is now bounded
to 0, which makes it a non-usable metabolite as a nutrient. Natu-
rally, the idea of setting its upper-bound flux to zero is to force the
use of the metabolite fumarate. GLC could also have been deleted
from the nutrient list, but by setting its upper-bound flux to zero,
we keep it in the list and can use it again at a later time. We are now
ready to execute the model with the new nutrients and see the
results.
284 Mario Latendresse et al.

12. Click the button Execute.


As for method 1e, the results tab will show the biomass flux,
active reactions, and so on. The biomass flux should be different
than 1e, and the set of active reactions is different.

3.2 Creating The previous methods in Methods 1 used previously created meta-
and Completing bolic models. In this section, we introduce the creation of new
Models models.
Method 2a: creating a basic FBA model specification
This method will show how to create a basic FBA specification
file (or FBA file for short) for any PGDB; here our example organ-
ism will be E. coli. This initial FBA specification will contain a very
basic biomass specification (37 compounds out of the 85 com-
pounds produced by the full model). The following method uses
the EcoCyc PGDB for which we already have a model, but it will
illustrate a simple model-creation method that should work for any
PGDB (i.e., for any organism for which a PGDB has been created).
A template FBA file will include a suggested try-biomass sec-
tion, based on the taxonomic classification of the current PGDB.
We have pre-defined a set of 27 biomass templates that are stored in
MetaCyc. The metabolite composition of each template was
defined by a curator, after reading published biomass information
found in the literature. The templates are intended to represent a
starting point for the biomass, on a per phylum level. Given the
current PGDB, a simple algorithm ascends the tree structure of the
NCBI taxonomy, until the phylum level is reached, under which
this PGDB resides. If this phylum taxon node points to a biomass
template, it is retrieved. If this search fails (in a phylum for which we
have no information available), then a very generic template is
returned, which simply consists of the amino acids, nucleotides,
and a few universal cofactors.
A functional FBA model will uptake the specified nutrients and
process those nutrients via the metabolic reactions provided in the
PGDB to produce the biomass metabolites and the secretions.
During model development, all the four model components (nutri-
ents, reactions, secretions, and biomass metabolites) will probably
need adjustment. If all biomass metabolites are produced by the
model, we say that the model solves. If even one biomass metabo-
lite cannot be produced by the model, we say the model does not
solve. The model can fail to produce a biomass metabolite for fairly
obvious reasons, such as because of a missing nutrient, a missing
reaction, or a reaction whose specified directionality is the opposite
of what it should be. The model can also fail to produce a biomass
metabolite for less obvious reasons, such as a missing secretion
(which will block the reactions that produce the secretion, because
the secreted compound has nowhere to go).
Metabolic Modeling with MetaFlux 285

We will use the Create Template File button available from


the top pane of the MetaFlux GUI. The created FBA file is mostly a
template from which you can develop a model. We will show one
way to complete the nutrients and secretions sections of the
template.
1. Apply Procedure 1, starting MetaFlux.
2. In the top pane, click the button Create Template File.
3. Enter the file name my-first-model.fba, with a directory
of your choice. Click Ok.
4. The file my-first-model.fba is created in the directory you
chose.
Notice that we have produced an FBA file that will work with
the currently selected PGDB, namely EcoCyc. In particular, the set
of biomass metabolites introduced in that file depends on that
organism. The file produced cannot be solved, because it has no
nutrients or secreted metabolites (secretions). In fact, MetaFlux
gives an error message, pointing out the absence of nutrients and
secretions. We will modify that file to add nutrients and secretions
such that MetaFlux can accept it without giving an error message.
1. Open file my-first-model.fba with a text editor.
2. Notice the comments starting with “#” at the beginning of
the file.
3. Scroll down to the section starting with “biomass:”. Remove
that line.
4. Scroll down to the section starting with “try-biomass:”.
Change it to “biomass:”.
5. Scroll down to the section starting with “try-nutrients:”.
6. We can suggest any number of nutrients to try. Under “try-
nutrients:”, enter on different lines each of the following
metabolite (the letters a–j must not be entered):

a. GLC[cytosol].
b. OXYGEN-MOLECULE[cytosol].
c. AMMONIUM[cytosol].
d. SULFATE[cytosol].
e. Pi[cytosol].
f. FE+2[cytosol].
g. CA+2[cytosol].
h. MG+2[cytosol].
i. MN+[cytosol].
j. NA+[cytosol].

7. Scroll down to the section starting with “try-secretions:”.


286 Mario Latendresse et al.

8. Under the “try-secretions:” line, add the line “all-compounds


[cytosol]”.
9. Save the file under the new name my-first-model-v2.fba.
What we have done is add a specific set of nutrients to the
model and requested that MetaFlux try using all known com-
pounds (known in MetaCyc [9]) as secretions. The compartment
where the compounds are located is explicitly stated as the cytosol.
Clearly, for a real organism, the nutrients should not be in the
cytosol, but coming from outside the cell. Similarly, for the secre-
tions, they should be directed outside the cell. For example, we can
define glucose to be in the extracellular compartment in the nutri-
ents section by including the line GLC[CCO- EXTRACELLULAR ].
However, to use the compartments other than cytosol would
require adding the necessary transport reactions, which is beyond
the scope of this guide.
The next step is to load the modified file in MetaFlux:
1. Select the file my-first-model-v2.fba, as in Method 1, to
be loaded in MetaFlux.
2. Notice that you are in development mode to further develop
the model.
3. Click Execute.
MetaFlux is in development mode because the file my-first-
model-v2.fba has nutrient candidate metabolites under the
try-nutrients section, as well as candidate metabolites under
the try-secretions section. Whenever a “try” set is specified in
the FBA file, MetaFlux will enter development mode. A try set
provides MetaFlux with a set of alternatives to try; MetaFlux will
determine whether the model will solve with any subset of the
alternatives. The solution obtained will suggest using some meta-
bolites as nutrients and secretions. In contrast, the “fixed” meta-
bolites and reactions are part of the model that do not change and
are considered to be known to be used by the organism. If you
accept a MetaFlux suggestion to add some metabolites as nutrients
or secretions, or to add reactions, you can add these metabolites as
fixed nutrients and secretions as shown here. Moving metabolites
or reactions from a try section to a fixed section of the .fba file will
significantly increase the speed of MetaFlux.
1. Open the file my-first-model-v2.sol with a text editor.
2. Look under the header Consumed Nutrient(s) for the list of
suggested nutrients.
3. Copy the list of suggested metabolites under the section
nutrients: in the file my-first-model.fba.

4. Look under the header Secreted Tried Secretion(s) in


the file my-first-model-v2.sol for the list of suggested
secretions.
Metabolic Modeling with MetaFlux 287

5. Copy that list of secreted metabolites to the section secre-


tions: in the file my-first-model.fba.
6. Save the file my-first-model.fba.
7. Reload and execute the file my-first-model.fba.
The model described by the file my-first-model.fba
should show growth for the basic biomass of the file. You can also
see the flux values of reactions and metabolites in the file
my-first-model.sol. The button Show on Cellular Over-
view displays the Cellular Overview in the navigator window of
Pathway Tools with the reactions highlighted with flux values
computed by MetaFlux.
The current model is clearly very simple, and much work needs
to be done to complete it. In particular, the biomass must be
completed, which requires literature research about the specific
compounds synthesized by the organism being modeled. Literature
research may also reveal some known nutrients and secretions. The
transport reactions are also missing, which are highly dependent on
the nutrients selected.
Method 2b: Gap-filling a model
The quality of an FBA model highly depends on the annotated
genome that was used to create the initial PGDB. The initial
qualitative metabolic reconstruction might have many missing reac-
tions and pathways. Missing reactions can prevent production of
some biomass metabolites, stopping model growth under certain
nutrient conditions. The development mode of MetaFlux can help
find these missing reactions. In this method, we will simulate the
process of finding a missing reaction using an already-growing
model for EcoCyc: we will instruct MetaFlux to ignore certain
reactions within the model (to simulate removal of some reactions)
to prevent growth and then let the development mode find the
reactions that should be added to the model to enable growth
again.
1. Apply Procedure 1, starting MetaFlux.
2. Using a text editor, open the file ecocyc-23.0-gem-cs-
glucose-tea-oxygen.fba, which is under the directory
aic-export/pgdbs/biocyc/ecocyc/23.0/data/fba/fba-exam-
ples of your Pathway Tools installation.
3. After the line “remove-reactions:” add the three lines “DIHY
DROFOLATEREDUCT -RXN,” “ HOMOCYSMETB 12-
RXN,” and “GLYOHMETRANS-RXN.”
4. Save the modified file as ecocyc-23.0-gem-cs-glucose-
tea-oxygen-gap.fba.
5. Select and load that file in MetaFlux.
6. Click the button Execute.
288 Mario Latendresse et al.

MetaFlux reports no growth because several biomass metabo-


lites under the class of tetrahydrofolate can no longer be produced
when the three reactions in the “remove-reactions:” section are not
in the reaction network of the model. This run was to confirm that
indeed some essential reactions were removed from the model. In
the next steps, we will run MetaFlux in development mode to find
the essential reactions to reestablish growth.
7. In the file ecocyc-23.0-gem-cs-glucose-tea-oxygen-
gap.fba, before the line “remove-reactions:”, insert the two
lines “try-reactions:” and “metacyc-metab-all.”
8. In that same file, change the parameter “minimize-fluxes: yes”
to “minimize-fluxes: no.”
9. Select and load that file in MetaFlux. Notice that MetaFlux is in
general development mode, because we have created a
non-empty try-reactions section.
10. Click the button Execute. It may take a few minutes for the
execution to complete.
11. The results should be a growth with a biomass flux of at least
0.001 mmol/gDW/h.
12. Open the file ecocyc-23.0-gem-cs-glucose-tea-oxy-
gen-gap.sol.
13. Search for the header starting with “Reactions Suggested to
be Added-to.” Underneath that header, there should be one
or more reactions suggested to be added to the model.
This method demonstrates the basic technique to gap-fill a
reaction network when no growth occurs. We forced the removal
of three reactions in the previous steps 1–6. At step 7, a
try-reactions section was added to enable MetaFlux to try adding
candidate reactions from MetaCyc to complete the reaction net-
work to obtain growth. In general development mode, a minimum
set of candidate reactions was found. We changed the parameter
minimize-fluxes to “no” to reduce the computational burden of
finding this minimum set of reactions. Minimizing fluxes is useful
when solving a model, but not that useful when trying to complete
a model. The minimization of fluxes removes many futile loops of
high fluxes. The reactions suggested to be added might not exist in
the organism, because the completion does not consider the organ-
ism’s genome. An analysis should be done before deciding to
include these suggested reactions in the model. Adding reactions
requires using the reaction editor of Pathway Tools and should be
done on your own created PGDB.
Computational imprecision may occur during the execution of
the general development mode that can cause MetaFlux to return
incorrect solutions. These imprecisions are due to arithmetic
rounding errors introduced by the linear solver. To ensure that
Metabolic Modeling with MetaFlux 289

the reactions that you are adding can indeed create growth, execut-
ing the model in solving mode is advisable (i.e., without the
try-reactions section, once the new reactions are added, to confirm
that growth is truly obtained).

Acknowledgments

We would like to thank Markus Krummenacker for the template


biomass generation procedure.

References
1. Karp PD, Latendresse M, Paley SM, Sauls JT, Noronha A, Bordbar A, Cousins B,
Krummenacker M, Ong QD, Billington R, Assal DCE, Valcarcel LV, Apaolaza I,
Kothari A, Weaver D, Lee T, Subhraveti P, Ghaderi S, Ahookhosh M, Guebila MB,
Spaulding A, Fulcher C, Keseler IM, Caspi R Kostromins A, Sompairac N, Le HM, Ma D,
(2016) Pathway tools version 19.0 update: soft- Sun Y, Wang L, Yurkovich JT, Oliveira MAP,
ware for pathway/genome informatics and sys- Vuong PT, Assal LPE, Kuperstein I,
tems biology. Brief Bioinform 17:877–890 Zinovyev A, Hinton HS, Bryant WA, Artacho
2. Mahadevan R, Edwards JS, Doyle FJ (2002) FJA, Planes FJ, Stalidzans E, Maass A,
Dynamic flux balance analysis of diauxic growth Vempala S, Hucka M, Saunders MA, Maranas
in Escherichia coli. Biophys J 83:1331–1340 CD, Lewis NE, Sauter T, Palsson BØ, Thiele I,
3. Mahadevan R, Schilling CH (2003) The effects Fleming RMT (2019) Creation and analysis of
of alternate optimal solutions in constraint- biochemical constraint-based models using the
based genome-scale metabolic models. Metab COBRA Toolbox v.3.0. Nat Protoc 14:639
Eng 5:264–276 7. Latendresse M, Karp P (2018) Evaluation of
4. Dandekar T, Fieselmann A, Majeed S, Ahmed Z reaction gap-filling accuracy by randomization.
(2014) Software applications toward quantita- BMC Bioinformatics 19:53
tive metabolic flux analysis and modeling. Brief 8. Karp PD, Weaver D, Latendresse M (2018)
Bioinform 15:91–107 How accurate is automated gap filling of meta-
5. Lakshmanan M, Koh G, Chung BK, Lee DY bolic models? BMC Syst Biol 12:93
(2014) Software applications for flux balance 9. Caspi R, Billington R, Fulcher CA, Keseler IM,
analysis. Brief Bioinform 15:108–122 Kothari A, Krummenacker M, Latendresse M,
6. Heirendt L, Arreckx S, Pfau T, Mendoza SN, Midford PE, Ong Q, Ong WK, Paley S,
Richelle A, Heinken A, Haraldsdóttir HS, Subhraveti P, Karp PD (2018) The MetaCyc
Wachowiak J, Keating SM, Vlasov V, database of metabolic pathways and enzymes.
Magnusdóttir S, Ng CY, Preciat G, Žagare A, Nucleic Acids Res 46:D633–D639
Chan SHJ, Aurich MK, Clancy CM, Modamio J,
Chapter 13

Application of the Metabolic Modeling Pipeline in KBase


to Categorize Reactions, Predict Essential Genes,
and Predict Pathways in an Isolate Genome
Benjamin H. Allen, Nidhi Gupta, Janaka N. Edirisinghe, José P. Faria,
and Christopher S. Henry

Abstract
The DOE Systems Biology Knowledgebase (KBase) platform offers a range of powerful tools for the
reconstruction, refinement, and analysis of genome-scale metabolic models built from microbial isolate
genomes. In this chapter, we describe and demonstrate these tools in action with an analysis of isoprene
production in the Bacillus subtilis DSM genome. Two different methods are applied to build initial
metabolic models for the DSM genome, then the models are gapfilled in three different growth conditions.
Next, flux balance analysis (FBA) and flux variability analysis (FVA) techniques are applied to both study the
growth of these models in minimal media and classify reactions within each model based on essentiality and
functionality. The models are applied with the FBA method to predict essential genes, which are then
compared to an updated list of essential genes obtained for B. subtilis 168, a very similar strain to the DSM
isolate. The models are also applied to simulate Biolog growth conditions, and these results are compared
with Biolog data collected for B. subtilis 168. Finally, the DSM metabolic models are applied to explore the
pathways and genes responsible for producing isoprene in this strain. These studies demonstrate the
accuracy and utility of models generated from the KBase pipelines, as well as exploring the tools available
for analyzing these models.

Key words Metabolic models, Draft models, Genome-scale reconstruction, Flux balance analysis,
DOE knowledgebase

1 Introduction

Metabolism is a central point of investigation for systems biologists,


biosystems engineers, and ecologists to gain insight into the pro-
ductive capacity of individual organisms and the relationships
between organisms within a community. Genome-scale metabolic
models are one widely used tool for studying metabolism within
biological systems with widespread applications, including (1) refin-
ing genome annotations by fitting with experimental data [1];

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_13, © Springer Science+Business Media, LLC, part of Springer Nature 2022

291
292 Benjamin H. Allen et al.

(2) predicting metabolic interactions within a microbial community


[2]; (3) identifying genetic targets to engineer and improve strain
productivity in industrial applications [3]; and (4) integration of
multi-omics data to predict active pathways in an individual cell or
community [4]. Metabolic models are particularly valuable for their
capacity to integrate data from thousands of individual functional
annotations to predict global phenotypes (e.g., growth conditions,
nutrient selection, byproduct production, gene essentiality) for an
individual cell or community. This capability means models are
particularly amenable to validating genome annotations against
experimentally determined gene essentiality (e.g., transposon
sequencing (TN-seq) and growth condition (e.g., Biolog) data).
Metabolic model predictions are produced by running flux balance
analysis (FBA), a constraint-based simulation technique, whereby
the flux through each individual reaction in a model is predicted
such that mass balance is preserved around each intracellular
metabolite and a selected cellular objective function (typically bio-
mass production) is optimized [5]. Over the years, this core formu-
lation of FBA has been augmented by numerous additional
constraints, including constraints designed to enforce thermody-
namic feasibility of flux profiles [6] or predict the effects or cellular
regulation [7, 8]. The DOE Systems Biology Knowledgebase
(KBase) is a platform that integrates numerous tools for building
and analyzing metabolic models, along with many other bioinfor-
matics and ‘omics data pipelines [9]. In this chapter, we provide a
guide on how to use the modeling tools in KBase to (1) build a new
metabolic model; (2) gapfill a metabolic model; (3) classify model
reactions based on function and essentiality; (4) predict essential
genes; (5) predict growth conditions; (6) predict carbon and nitro-
gen limiting conditions; and (7) identify potential by-product bio-
synthesis pathways. Figure 1 shows this workflow in graphical form.
In this guide, we use Bacillus subtilis DSM as an exemplar genome
because it is of industrial interest due to its ability to produce the
valuable biofuel precursor molecule isoprene, and because it is quite
similar to Bacillus subtilis 168, for which an existing published
metabolic model [10] and extensive experimental data is readily
available [22].

1.1 Background Before diving into our KBase workflow, it is useful to provide some
Description background information on genome-scale metabolic models.
of Metabolic Models These models are meant to be a representation of all the metabolic
pathways encoded by the genes of an organism. Thus, metabolic
models always include a stoichiometric matrix of metabolic reac-
tions and their associate biochemical compounds. Models also
include data specifying how each reaction depends upon the
genes encoding it, called gene-protein-reaction (GPR) associations.
The GPR associations in a model allow it to differentiate between
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 293

Fig. 1 Metabolic modeling workflow in KBase starting with an isolate genome

cases where protein products from multiple genes form a complex


to cooperatively catalyze a reaction, and cases where protein pro-
ducts from multiple genes can independently catalyze the same
reaction [11]. Finally, models typically include a biomass objective
function, which indicates the relative abundance of small molecule
building blocks (e.g., amino acids) that a cell must produce from
available nutrients in order to grow and survive.
294 Benjamin H. Allen et al.

2 Materials

KBase is a knowledge creation and discovery environment designed


for both biologists and bioinformaticians. KBase integrates a variety
of data and analysis tools, into an easy-to-use platform that
leverages scalable computing infrastructure to perform sophisti-
cated systems biology analyses. KBase is a freely available and
developer extensible platform that enables scientists to analyze
their own data within the context of public data and share findings
across the system. Work done in KBase is saved in interactive note-
books called Narratives, which can be shared with other researchers
to give them access to data and analysis to enhance reproducibility.
By default, all data and analysis in KBase are kept private until a user
chooses to share it with other KBase users. Anyone can create a free
account and begin using KBase by navigating to http://kbase.us in
Chrome, Safari, or Firefox browsers.

3 Methods

3.1 Model Before flux balance analysis can be used to study the metabolic
Reconstruction from capabilities of an organism (or microbiome), a metabolic model
Microbial Genomes must first be constructed for that organism. The programs for
building these models based on genome sequencing data are
described in detail in previous publications [12, 13], so only a
brief description will be included here. KBase includes two apps
for constructing new genome-scale metabolic models for a micro-
bial genome. The first of these apps, called Build Metabolic Model,
constructs a new model from scratch based solely on the SEED-
based functional annotations found in the genome input into this
app. Only genomes annotated externally to KBase by RAST [14] or
PATRIC [15] or annotated within KBase by the Annotate Micro-
bial Assembly or Annotate Microbial Genome apps (see Fig. 2) will
have SEED annotations [16].
If a genome comes from any other source (e.g., IMG, Gen-
Bank, RefSeq), it is essential that the genome be first annotated by
one of the aforementioned mechanisms prior to building a model
of the genome using the Build Metabolic Model app (Fig. 3). This
app maps reactions to genes in the genome based on the annotated
functions of those genes. These reactions include both intracellular
and transport reactions, and full GPR associations are constructed
for each reaction in the model. A biomass producing reaction is also
constructed for the organism, with a different biomass being used
for gram-negative and gram-positive bacterial genomes. Gapfilling
of the draft model is optional, although this feature is active by
default. The gapfilling step of this process is described later in this
chapter.
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 295

Fig. 2 Annotating a Bacillus subtilis DSM genome with RAST in KBase using the Annotate Microbial Genome
app; subsequent results table shown above
296 Benjamin H. Allen et al.

Fig. 3 Building a draft metabolic model with the Bacillus subtilis DSM genome using the Build Metabolic Model
app in KBase with Carbon-D-Glucose as the gapfilling media; subsequent results table shown above
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 297

Fig. 4 Propagating a metabolic model from one genome to another in KBase

One caveat to using a tool like the Build Metabolic Model app to
construct a new model directly from annotations is that such mod-
els often require significant manual curation and refinement before
they can be truly used in a predictive capacity. A second model
reconstruction app in KBase, called Propagate Model to New
Genome (see Fig. 4), seeks to overcome this drawback by propagat-
ing an already curated and refined model published for a taxonomi-
cally similar genome to a genome of a new organism of interest.
This propagated model will share any reactions from the original
model for which corresponding orthologous genes could be found
between the original and new genome. The gene orthology rela-
tionships used by this app are generated by another app called
Compare Two Proteomes (see Fig. 5), which uses BLAST to compute
bidirectional-best-hits between the proteins in two input genomes.
If an orthologous gene pair is found, the annotated functions
assigned to the pair are ignored when propagating a model. Instead
the ortholog in the input genome inherits the reaction associations
of its homologous partner in the published model being propa-
gated. Optionally, this propagated model can also carry over any
gapfilled reactions from the original model. The propagated model
also shares reaction directionalities, full GPR rules (when
corresponding genes exist), and biomass reactions with the original
model. Refining these components of the model is the major target
for manual curation that requires significant effort, so propagating
298 Benjamin H. Allen et al.

Fig. 5 Comparing proteomes of the Bacillus subtilis DSM genome and the Bacillus subtilis 168 genome
and analyzing corresponding gene pairs using the Compare Two Proteomes app in KBase; subsequent synteny
map generated by app show above

this information from existing models expedites the process of


drafting the new models by transposing a plausible representation
of metabolic capacity. Often the propagation app will produce a
superior model when the organism of interest is close to an existing
published metabolic model.
Users can import new published models into KBase for this
purpose as needed, but this can be a time-consuming process (see
Note 1). Several existing published models have already been
imported into KBase and are available for immediate use. These
imported models are being gathered in a new KBase organization
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 299

page: https://narrative.kbase.us/#org/published-metabolic-
models. Check this page for existing published model before
attempting to import a published model from scratch. When a
published model is imported into KBase using the Import model
SBML from web and Integrate Imported Model into KBase Name-
space apps, a default media formulation for the model is automati-
cally generated based on the exchange flux constraints in the model
SBML file. This custom media is interoperable with any KBase
model and can be readily applied for gapfilling and flux analysis of
a propagated model. Whether or not the propagated model can
grow using this media can indicate if the propagated genome has
lost significant metabolic functionality.
Both of these model reconstruction methods were applied to
produce new genome-scale models for the exemplar genome
selected for this chapter, Bacillus subtilis DSM (see Note 2). The
subsequent models were tested using a wide range of FBA-based
methods, providing insights into how propagated and draft models
differ in performance. To build the propagated model, the Propa-
gate Model to New Genome app was applied to an existing published
model of B. subtilis 168 called the iBsu1103V2 [10]. This model
was selected because the DSM and 168 strains of B. subtilis are
extremely similar (4161 distinct genes overlapping). Indeed, these
genomes were so exceptionally close that this example represents
the best-case scenario for how propagated models will perform.
Biochemistry data that represent prokaryotic models divide
reactions into intercellular (cytosol) and extracellular compart-
ments. Unlike prokaryotic cells, eukaryotic cell hosts organelles
that each plays a unique physiological role in the cell. Therefore,
eukaryotic models are constructed as multi-compartmentalized
models that represent the metabolic processes in each organelle.
Additional transport reactions are added to facilitate the exchange
of compounds between the subcellular compartments of the meta-
bolic reconstruction, including the extracellular space and cytosolic
compartment.
The Build Metabolic Model app was also applied to build a
brand-new draft model of the DSM genome. The results from the
application of these two model reconstruction apps are shown in
Table 1.
Significant differences can be seen between the draft and pro-
pagated model. The propagated model had more reactions and
more associated genes. It also had more irreversible reactions,
reflecting its inheritance of more carefully curated reversibility con-
straints in the iBsu1103V2 model. The gapfilling differences will be
discussed later in the gapfilling section of this chapter.

3.2 Gapfilling One of the aspects of flux balance analysis (FBA) that makes it
a Metabolic Model challenging to apply to many biological and physical systems is
that the approach is mathematically unforgiving. If a metabolic
300 Benjamin H. Allen et al.

Table 1
Statistics from the reconstruction of models for the B. subtilis DSM genome

Draft model Propagated model iBsu1103V2 model


Compounds (transported) 1264 (138) 1382 (247) 1382 (247)
Reversible reactions 725 841 841
Irreversible reactions 607 606 606
Genes 999 1096 1098

model contains a single gap in any critical pathway, this will render
the pathway, and probably the entire model, nonfunctional. Iden-
tifying and eliminating gaps within genome-scale models can be
particularly difficult, as they are composed of thousands of reac-
tions, hundreds of pathways, and high tens of biomass components.
Furthermore, the fidelity of genome-scale models is commensurate
with the current state of sequencing, assembly, gene calling, and
functional annotation technology. A genome-scale model will
always contain gaps corresponding to those in the underlying tech-
nologies. Sometimes genes have incorrect sequences and transla-
tions due to errors in underlying sequencing; often genes will be
missing due to errors in assembly or gene calling; and most com-
monly, genes are often incorrectly annotated or left entirely unan-
notated due to gaps in knowledge. All of these error sources lead to
many tens of gaps in a typical genome-scale model. In the case of
some reactions, the gene responsible for the reaction remains
unknown, meaning the reaction will always be missing in initial
draft models.
Fortunately, numerous tools and methods have emerged to
greatly simplify the process of finding and filling these gaps in a
genome-scale model. Optimization-based formulations were initi-
ally proposed to solve this problem by adding a minimal set of
reactions from a database such that the model would be capable
of producing biomass (or satisfying some other cellular objective
function) [17, 18]. These approaches have been improved over
time, primarily with respect to reducing the runtime of these algo-
rithms [19, 20]. Today, models can be gapfilled to eliminate the
missing reactions and enable the production of biomass in just a few
minutes. The Gapfill Metabolic Model app (see Fig. 6) in KBase
makes this process very easy. This app enables a user to select a
model for gapfilling, select a growth condition of interest, and
select a target reaction to enable flux through. By default, the target
reaction is the biomass objective function of the model, but the
method is equally applicable to any other objective function. The
growth condition is defined in a KBase Media object, which indi-
cates all the compounds that may be consumed or excreted within a
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 301

Fig. 6 Gapfilling missing reactions in a draft metabolic model of Bacillus subtilis DSM using the Gapfill
Metabolic Model app in KBase; some of the resulting additional reactions are show in the results table below
302 Benjamin H. Allen et al.

Fig. 6 (continued)

Table 2
Comparison of gapfilling required in a variety of growth conditions

Condition Draft model Propagated model


Glucose minimal media 51 35
LB media 46 31
Complete media 43 30

particular environment, as well as establishing limits on the rate at


which these compounds may be consumed/excreted. Media
objects may be uploaded, selected from a public database of over
500 existing formulations, or edited from another existing formu-
lation using the Edit Media app (see Note 3).
As a demonstration, the gapfilling tool in KBase was applied to
both our draft and our propagated model, using three increasingly
rich Media conditions: (1) glucose minimal media, (2) LB media,
and (3) complete media (Table 2). The results of this analysis show
that less gapfilling is typically needed in richer media formulations,
as incomplete biosynthesis pathways may be bypassed by simply
importing the pathway end-product.
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 303

Comparing the draft and propagated model, we see that the


propagated model required less gapfilling to permit growth on each
of our selected growth conditions.

3.3 Flux Balance Once the gapfilling step is complete, a model is ready to be analyzed
Analysis and Media using flux balance analysis (FBA). In flux balance analysis, one flux
in KBase profile in a selected model is predicted such that a specified cellular
objective is optimized. The most common objective used is the
maximization of biomass flux. However, many other objectives are
possible, including maximization of ATP production, maximized
production of entropy, maximum product biosynthesis, or maxi-
mum carbon efficiency [5]. By default, the constraints that govern
the fluxes in FBA include bounds on each individual reaction flux
(typically defined based on thermodynamic reversibility of a reac-
tion), mass balance constraints on each metabolite, and uptake and
excretion constraints on each extracellular metabolite (defined in
KBase from the Media objects described in the gapfilling section).
Many other constraints are possible, including constraints based on
protein allocation [7], constraints enforcing thermodynamic feasi-
bility of all reactions [10], and constraints defined from transcrip-
tomic data [7, 8]. In turn, FBA produces numerous insights,
predicting active pathways during growth in a particular environ-
ment, identifying essential genes and reactions, predicting auxotro-
phy, and testing the impact of strategies to improve strain
productivity.
In KBase, FBA is run using the Run Flux Balance Analysis app
(see Fig. 7). As with the KBase gapfilling app, the user must specify
the model of interest and provide a Media object that defines the
constraints on the uptake and excretion of nutrients allowed in the
model (see Note 4). It is important to understand the power and
utility of the Media object in KBase. Of course, one can use the
Media object to set limits on the uptake of nutrients (by setting
limited maximum flux on a media compound), but one can also use
the object to: (1) prevent the excretion of a compound by setting its
minimum flux to zero, (2) force the excretion of a compound by
setting its maximum flux to a negative value, and (3) force the
uptake of a compound by setting its minimum flux to a positive
value. This is particularly useful when applied in concert with the
gapfilling app, as these strategies enable a user to ensure that a
model is capable of consuming or excreting a set of compounds
of interest. The FBA app in KBase also has many advanced para-
meters, which allow users to specify sets of reactions and genes to
knockout; specify supplemental compounds to add to media; set
upper limits on carbon, nitrogen, phosphate, and sulfate update;
identify expression data to use in constraining fluxes; specify custom
flux bounds on any model reaction; and select an objective to
optimize.
304 Benjamin H. Allen et al.

Fig. 7 Running Flux Balance Analysis app in KBase on a draft metabolic model built from the Bacillus
subtilis DSM genome; (a) parameter for enabling Flux Variability Analysis; (b) parameter for simulating single
knockouts

As an example, the FBA app was applied to maximize biomass


production in our two DSM models while simulating growth in
both minimal and complete media. As part of this study, the app
was used to explore the impact of carbon-source and nitrogen
limitation on optimized biomass production rates (Table 3). One
challenge in attempting to compare growth rates between glucose
and complete media is that complete media allows for the uptake of
far more carbon than glucose minimal media. To constrain this and
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 305

Table 3
Results from FBA run in minimal and complete media, measured in mmol/g CellDryWeight h

Simulation type Draft model Propagated model


Glucose minimal media 0.71 0.73
Complete media with limited carbon 0.80 0.84
Minimal nitrogen constraint minimal media 0.69 0.73
50% Nitrogen constraint in minimal media 0.36 0.37

Fig. 8 Setting maximum uptake values in the advanced parameters of Run Flux Balance Analysis

ensure that the simulations in both conditions are carbon limited at


equal uptake values, the Max carbon uptake parameter was set to
30 mmol/g CDW h in all simulations (see Note 5).
Next, the same FBA app was used to observe the impact of
nitrogen limitation on the simulated growth rates of the DSM
models. In order to simulate the models in nitrogen-limited con-
ditions, it was first necessary to determine the minimum amount of
nitrogen the models required to grow optimally. This was accom-
plished by examining the minimum flux for NH3 uptake in the
glucose minimal media simulations. Because this is the minimum
flux for 10% of the optimal growth value, the value was multiplied
by ten (see Note 6). Next, the Max nitrogen uptake parameter was
set slightly below this value (see Fig. 8), and FBA was rerun, and it
was verified that the objective declines. Finally, we set the nitrogen
to 50% of the limiting value to find the impact of this limitation on
biomass production. Here a more profound impact is seen on
growth rates in all models. These results below demonstrate how
306 Benjamin H. Allen et al.

the FBA solutions between draft models and propagated models


can be similar qualitatively (higher growth on completed media),
while providing different quantitative values for biomass
production.

3.4 Flux Variability One of the challenges in interpreting results from flux balance
Analysis analysis is the possibility of multiple equivalently optimal solutions.
In effect, this means that the flux profile reported by a particular
flux balance analysis simulation is actually only one of many possi-
ble, equally optimal, solutions. As a result, one cannot assume that a
reaction is essential for growth just because it has a non-zero flux
during a single FBA simulation. Similarly, one cannot assume a
reaction will be inactive in a specific condition just because it has a
flux of zero in a single FBA simulation. Flux variability analysis
(FVA) offers a mechanism for better characterizing the behavior
of each reaction in a model in a particular growth condition, despite
the existence of alternative equivalent optima [21]. This is accom-
plished by fixing the primary objective of the initial FBA simulation
to be greater than or equal to some minimal value (KBase uses 10%
of optimal objective value as a default). Next, each individual
reaction was independently minimized and maximized. In the
FBA solution table (see Fig. 9), a reaction with a negative maximum
value is essential in the reverse direction; a reaction with a positive
minimum value is essential in the forward direction; a reaction with
a maximum and minimum value both fixed at zero is said to be
blocked; and all other reactions are “variable” or “optional” mean-
ing they are capable of functioning on a particular condition but are
not essential. FVA is run by default in all KBase FBA simulations,
and the maximum and minimum flux values and associated classi-
fications are all reported in the FBA results table generated by the
Run Flux Balance Analysis app (see Fig. 7a).
To demonstrate FVA in action in KBase, we revisit the FBA
solution tables from the previous analysis of the DSM genome
described in the FBA section. Recall, FBA was performed in two
media conditions: (1) complete and (2) glucose minimal media
(GMM). Also, recall these simulations were performed using both
the propagated DSM model and the newly constructed draft model.
The FVA results from these studies are shown in Table 4.
Note how models simulated in complete media always had
fewer essential reactions than when simulated in glucose minimal
media. This is because complete media contains many biologically
significant nutrients (e.g., amino acids, vitamins, nucleotides) that
the cell can simply import rather than having to synthesize these
compounds. In contrast, in minimal media, the cell is forced to
synthesize virtually all biomass components. Thus, the reactions
that are essential in minimal media but not in complete media are
primarily biosynthesis reactions. The complete media simulations
also had the fewest blocked reactions. This is because complete
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 307

Fig. 9 FBA solution table; the first reaction (bio1) is the biomass producing equation

Table 4
Results from essentiality classification of reactions using FVA

Draft model Draft model Propagated model Propagated model


Class (complete media) (GMM media) (complete media) (GMM media)
Positive 210 276 260 367
essential
Negative 28 54 8 43
essential
Positive 299 181 406 239
variable
Negative 61 94 129 223
variable
Variable 268 189 387 190
Blocked 466 538 257 385

media contains numerous nutrient sources that may be metabo-


lized by the cell for basic essential elements (C, S, P, N, O). In
contrast, minimal media only contains a single C, N, S, or P source.
308 Benjamin H. Allen et al.

Table 5
Results from approximate functional classification of reactions using FVA

Model Draft model Propagated model


Biosynthesis reactions 92 142
Catabolic degradation reactions 72 128
Nonfunctional reactions 466 257

Thus, reactions that are blocked in minimal media but variable in


complete media are primarily catabolic pathways used to break
down nutrients (see Note 7). In our results above, we see that our
propagated model of B. subtilis DSM has substantially fewer
blocked reactions and is thus likely a superior model. Following
this simple set of guidelines, we can construct a new and useful (but
only approximate) table by comparing the FVA results from mini-
mal and complete media as described above (see Table 5).

3.5 Predicting Just as metabolic models can be used to classify model reactions
Essential Genes using FVA, these models can also be applied to classify model genes
as either essential or non-essential by simulating gene knockouts.
As with FVA, the Run Flux Balance Analysis app in KBase makes
this computation relatively easy to perform. Unlike FVA, this par-
ticular analysis is not always performed by default. To classify a
model’s genes, one must click on the checkbox labeled Simulate
All Single KO (see Fig. 7b). When this feature is active, the FBA will
start by maximizing the selected primary objective function (typi-
cally biomass production) as usual. Then, each individual gene
knockout is simulated for every gene in the model. In these simula-
tions, all the reactions that are exclusively associated with the
knocked-out gene (or exclusively associated with a multi-enzyme
complex involving the knocked-out gene) have their fluxes con-
strained to zero. The primary objective is then re-optimized, and
the fraction of the knockout objective and the wild-type objective
values are reported. For completely essential genes, this fraction will
be zero. However, many gene knockouts will result in a reduced
objective rather than completely eliminated objective (see Note 8).
This gene essentiality prediction capability was demonstrated in
KBase using the same DSM models. Additionally, because the
B. subtilis DSM strain was shown to be extremely similar to
B. subtilis 168 strain, the known essential genes in B. subtilis 168
[22] may be translated to the corresponding genes in the DSM
genome and subsequently used to validate the DSM model predic-
tions. Next, the Run Flux Balance Analysis app was applied to
predict essential genes in defined rich LB media and in glucose
minimal media in each of our models (see Table 6).
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 309

Table 6
Evaluating accuracy of DSM models in predicted essential genes in LB media and minimal media

Draft model Propagated model


Correctly predicted essential genes in LB media 149/177 (84.2%) 173/199 (86.9%)
Correctly predicted nonessential genes in LB media 723/822 (88.0%) 857/897 (95.5%)
Overall accuracy in LB media 872/999 (87.3%) 1030/1096 (94.0%)
Correctly predicted essential genes in minimal media 149/211 (77.6%) 180/223 (80.7%)
Correctly predicted nonessential genes in minimal media 643/788 (81.6%) 736/873 (84.3%)
Overall accuracy in minimal media 792/999 (79.3%) 916/1096 (83.6%)

Overall, substantially more accurate results are obtained from


the propagated DSM model when predicting essential genes. This is
to be expected as the 168 model on which the propagated model
was based on was refined to fit a similar gene essentiality dataset
[10]. Generally, this analysis also demonstrates the effectiveness of
metabolic models (even draft models) at accurately predicting
essential genes, and the ease with which these predictions can be
made in KBase. This capability of KBase has already been applied to
support the analysis of essential genes identified in TN-seq experi-
ments performed with Klebsiella pneumoniae KPPR1 [23].

3.6 Simulating In addition to predicting essential genes, models are also useful for
Biolog Phenotype predicting what catabolic degradation pathways are present in an
Profiles organism, and as with essentiality predictions, a simple and wide-
spread experimental procedure exists to facilitate the validation of
these predictions. Specifically, Biolog phenotype arrays [24] and
similar technologies test the capacity of an organism to utilize a
wide range of carbon, nitrogen, phosphate, and sulfate sources.
This type of data may then be loaded into KBase as a Phenotype
Set object (see Fig. 10). Phenotype Sets specify experimental growth
values for a set of media formulations typically in binary growth/no
growth format (although continuous growth rate measurements
are also supported). Preexisting media formulations already exist in
KBase for nearly all of the common compound conditions tested in
Biolog phenotype arrays, although the Edit Media app in KBase can
be applied as needed to create new formulations. Once a Phenotype
Set is loaded, the Simulate Growth on Phenotype Data app (see
Fig. 11) in KBase enables the rapid simulation of biomass produc-
tion in a selected metabolic model in each growth condition
included in the Phenotype Set. This app then reports the accuracy
of the model in predicting biomass production in each growth
condition, in terms of correct positives and negatives, and false
positives and negatives.
310 Benjamin H. Allen et al.

Fig. 10 PhenotypeSet data object imported into KBase

While this app by default design is meant to test the ability of a


model to grow in a wide range of media conditions, it is important
to note that, particularly when applied to predicting growth in
conditions like those used in Biolog phenotype arrays, it fundamen-
tally acts as a mechanism to predict the presence and completeness
of a wide range of degradation pathways (with each media testing
for a different degradation pathway). It is almost always the lack of a
degradation pathway or transporter that prevents a model from
growing in a particular Biolog media condition. It is also important
to note that one does not need to have Biolog data for an organism
of interest to gain value from this app. Simply running a standard
Biolog array, regardless of specified experimental growth rates,
produces valuable information about the transport reactions and
degradation pathways a genome is annotated to have. When using
this app to test for growth on Biolog conditions, it is very impor-
tant to first gapfill your model to be able to grow on at least one
minimal media condition, typically glucose minimal media (unless
your organism lacks a glycolysis pathway). This is because all the
Biolog media formulations are minimal, and if your organism has
auxotrophic requirements for an amino acid or vitamin, it will not
be able to grow on any of the Biolog conditions. Even if your
genome has such requirements, it is still recommended to gapfill
on minimal media just for the sake of running the Biolog simula-
tions, which then permits your model to be tested for degradation
pathways. While this will result in the incorrect gapfilling of some
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 311

Fig. 11 Simulating growth of FBA models based on phenotype data using the Simulate Growth on Phenotype
Data app in KBase; (a) Add transporters for primary nutrients associated with all phenotype conditions
312 Benjamin H. Allen et al.

Fig. 11 (continued)

Table 7
Evaluating the accuracy of our models in predicting phenotype array growth

Draft model Propagated model


Correctly predicted growth conditions 69/169 (40.8%) 135/169 (79.9%)
Correctly predicted no-growth conditions 71/73 (97%) 68/73 (93.2%)
Overall accuracy 140/242 (57.9%) 203/242 (83.9%)

biosynthesis pathways in auxotrophic organisms, this is viewed as an


acceptable tradeoff for this particular study, which is meant to test
for the existence of degradation pathways rather than synthesis
pathways. Of course, one should not use this incorrectly gapfilled
model for any other studies. Lastly, if one is specifically interested in
testing the presence and completeness of degradation pathways
using this app, the fact that the lack of a transporter will disrupt
growth can negatively impact the utility of this method for this
purpose. For this reason, this app has a feature called Add all
transporters (see Fig. 11a). If selected, this feature ensures that all
transport reactions associated with the primary nutrient in each
Biolog media are added to the model prior to testing for growth
in each condition. This ensures that growth will only fail due to
missing degradation pathways.
These capabilities were demonstrated in KBase by uploading a
set of Biolog phenotype data generated for the B. subtilis 168
genome. Our draft and propagated models were then applied to
simulate this phenotype data (see Table 7).
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 313

As in all previous studies, the propagated model performed


substantially better in this analysis, which is to be expected as the
model on which this model is based was optimized to fit experi-
mental Biolog data [10]. Note, models can be gapfilled in individ-
ual conditions to correct the gaps that cause incorrect no-growth
predictions in specific media formulations [17, 19].
A previously published analysis of Klebsiella pneumoniae
KPPR1 also applied the Simulate Growth on Phenotype Data app
in a comparative fashion, simulating Biolog phenotype datasets
from two different strains of Klebsiella pneumoniae and exploring
how variations in the associated genomes led to variations in the
observed phenotype data [23].

3.7 Simulating The Biolog analysis demonstrates the utility of metabolic models to
Metabolite rapidly test for the existence and completeness of degradation
Biosynthesis pathways. Models can be used very similarly to test for the existence
and completeness of biosynthesis pathways as well. In this case, an
export reaction can be added to a model for any compound of
interest, and this export reaction can be maximized, while one
provides a Media formulation that contains the desired starting
points for the pathway of interest. If the resulting objective func-
tion is zero, then the biosynthesis pathway is missing or incomplete.
If the objective is greater than zero, then the resulting FBA solution
provides a list of mass balanced reactions required to produce the
compound of interest.
While this approach works and can be run in KBase using a
combination of the Edit Model and Run Flux Balance Analysis
apps, KBase has another app that accomplishes the same steps in
superior fashion in a single step. This new app, called Predict
Metabolite Biosynthesis Pathway (see Fig. 12), enables a user to select
a model, select a set of compounds of interest, select additional
starting molecules or interest, and reports whether the compounds
can be produced. This app also provides simple streamlined path-
ways for each desired compound as an output. Unlike graph-based
pathway search algorithms, this approach will identify branched
chain pathways with no problem.
We started by first gapfilling our models using the Gapfill
metabolic model app to ensure that both models are capable of
producing isoprene. This resulted in the addition of two reactions
to the draft model and propagated model. Next, we applied the
Predict Metabolite Biosynthesis Pathway app to actually propose a
specific pathway to synthesis isoprene from central carbon metabo-
lites. We selected isoprene in this case because the DSM strain of
B. subtilis is known for producing isoprene, and it is also an impor-
tant biofuel target. The results from this analysis are in Table 8.
314 Benjamin H. Allen et al.

Fig. 12 Predicting pathways for isoprene biosynthesis in draft metabolic model built from the DSM genome
using the Predict Metabolite Biosynthesis Pathway app in KBase; subsequent results table show above
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 315

Table 8
Isoprene biosynthesis pathway prediction using DSM models and FBA

Draft model Propagated model


Isoprene (2) NADPH + NADH + CTP + (3) NADPH + CTP + ATP + (2) H
production ATP + (2) H+ + pyruvate + glyceraldehyde3 +
overall -phosphate (2) NADP + NAD + CMP + pyruvate + glyceraldehyde3-
reaction + ADP + (2) PPi + CO2 + (2) H2O + isoprene phosphate (3) NADP + CMP
+ ADP + (2) PPi + CO2 + (2)
H2O + isoprene
Gapfilled 1 2
reactions
involved
Overall 8 9
pathway
length

4 Conclusion

In this chapter we have discussed in detail tools in KBase for the


reconstruction, gapfilling, and analysis of genome-scale metabolic
models for isolate microbial genomes. Two different model recon-
struction apps, Build Metabolic Model and Propagate Model to New
Genome, were demonstrated, and the two very distinct models
resulting from these apps were rigorously compared. These com-
parisons repeatedly showed the superior performance of the pro-
pagated model, although we once again stress that the strain of
interest used as an exemplar in this chapter, Bacillus subtilis DSM, is
extremely close to the Bacillus subtilis 168 genome on which the
published model we propagated was based. Thus, this comparative
analysis of the propagated versus the draft models was expected to
favor the propagated model heavily. Researchers should investigate
whether any metabolic models exist for organisms that are taxo-
nomically similar to their organism of interest in determining their
approach to using this pipeline. More generally, we note that both
models actually performed quite well in predicting essential genes
and predicting degradation pathways. Both models produced
essentially identical pathways for producing isoprene, our com-
pound of interest when working with Bacillus subtilis DSM.
Producing metabolic models is not only one of the core fea-
tures of genomic analysis in KBase, but also offers a point of entry
into KBase’s integrated approach to multi-omics research. One of
the most popular end-to-end workflows in KBase is to import
genomic sequencing reads, perform quality control, assemble con-
tigs, annotate genetic features, and produce these metabolic models
to evaluate an organism’s metabolic capacity. This workflow allows
316 Benjamin H. Allen et al.

researchers to generate hypothesis for both culturing microorgan-


isms and the potential responses to environmental perturbations.
Additionally, it can be used to strategize approaches for optimizing
an organism for industrial microbiology purposes, such as produc-
ing a particularly valuable metabolite. KBase can also be used to
model the interactions between organisms in a community, given
its capability to merge multiple single-genome models together
into integrated metabolic models of the community. Validation of
metabolic models can be performed as well, using expression data
from RNA-seq, which can be compared to metabolic models to
identify pathways where expression and flux agree or conflict. Alto-
gether, these powerful and modular analysis workflows offer exten-
sive possibilities for hypothesis and discovery within a single online
system.
Online Resources: The following links demonstrate running this
analysis in KBase Narratives. Each Narrative can be copied and
rerun with user data to reproduce this process.
1. Reconstruction of Bacillus subtilis DSM models https://nar-
rative.kbase.us/narrative/ws.39067.obj.51.
2. Using flux balance analysis to classify Bacillus subtilis DSM
reactions https://narrative.kbase.us/narrative/ws.39343.
obj.1.
3. Predicting DSM essential genes and Biolog growth pheno-
types. Https://narrative.kbase.us/narrative/ws.39346.obj.1.
4. Isoprene pathway prediction in Bacillus subtilis DSM genome
https://narrative.kbase.us/narrative/ws.39541.obj.1.

5 Notes

1. Importing models into KBase requires effort for several rea-


sons. First, imported models must to be linked to a genome
within the KBase platform in order for apps like model propa-
gation to work properly. This is because KBase needs to have
protein sequence information in order to do propagation. It
can take time to identify the correct existing genome in KBase
or upload a new genome for this purpose. The genome identi-
fied must have gene IDs or gene aliases that match the IDs in
the model gene reaction associations. Second, to permit model
interoperability, the reaction and compound IDs in the model
are automatically translated to ModelSEED IDs to the greatest
extent possible. This conversion process is conservative and
requires some manual intervention to maximize the mapping
of IDs (the import and integration apps permit this). Finally,
occasionally, an imported model will not immediately function
properly on import, and some tweaking will be needed to fix
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 317

this. To minimize having multiple users upload the same pub-


lished model into KBase, a public KBase organization page has
been created for imported published models here: https://
narrative.kbase.us/#org/published-metabolic-models. Any
user is welcome to join this organization and add their own
model import narratives to the organization.
2. Models constructed for isolate genomes may also be merged
together to form community metabolic models, which can be
applied in a variety of ways to predict the behavior and capabil-
ities of a multi-species consortia. This type of analysis is
described in detail in other recent publications [13, 25, 26].
3. The proper definition of media formulation is a crucial step for
model simulation. KBase provides over 500 defined media
formulations for users to choose from as part of KBase’s public
reference data. Users can also create their custom media for-
mulation using the “Edit Media” app. In this app, an existent
media formulation is required as input, and users can specify
additions and removals to the existing media to create their
custom formulation. Alternatively, users can follow KBase’s
upload media documentation to design their custom media
formulation locally and upload it to KBase.
4. It is very important to note that the automated gapfilling
procedures do not substitute for manual curation. The reac-
tions suggested by gapfilling are analogous to zip codes rather
than street addresses. The reaction points out a missing feature
in the metabolic network, but it may be that another reaction is
actually what is truly missing. Nor do gaps always indicate a
missing reaction. In some cases, gaps appear to be present
because the model biomass reaction contains a compound
that it should not (e.g., ubiquinone in an anaerobic species).
In other cases, gaps appear to be present because an auxotro-
phic dependency of the model has not been accounted for and
the compound is not present in the media formulation in which
the gapfilling was performed. Generally, it is extremely impor-
tant to always carefully review gapfilling results to ensure that
they make biological sense. The process of choosing a medium
for the gapfilling operation is a departure from many FBA apps
contained within the COBRA Toolkit [27], where nutrient
uptake and excretion constraints are built into the model itself.
The separation of these constraints into media objects makes it
easier to run the same model with many different media con-
straints and to run multiple models in identical media to sup-
port comparison.
5. When the Max carbon uptake parameter was set to 30 mmol/g
CDW h for the simulations, both models grew slightly better in
complete media. It is expected that complete media would
318 Benjamin H. Allen et al.

outperform minimal media because the model can consume an


optimal combination of carbon sources rather than being con-
strained to just glucose.
6. When examining the minimum fluxes reported by the FVA
analysis in KBase for essential reactions and nutrients, you will
generally see a reported value that is ten times lower than the
flux the reaction or nutrient has at the optimal objective value.
This is because FVA is performed with the objective value
constrained to be greater than or equal to 10% of the optimal
value rather than the full optimal value. Thus, all minimum flux
values for essential nutrients and reactions will always be simi-
larly scaled down to 10% of the flux required to support optimal
objective values. This 10% scaling is done to avoid having
reactions or nutrients falsely labeled in FVA as essential when
they are only essential for optimal objective values. For exam-
ple, ATP synthase is typically only essential for optimal biomass
or ATP production. ATP and biomass can still often be pro-
duced without the ATP synthase reaction.
7. It is important to note that blocked reactions in complete
media will always be blocked and are thus non-functional in
the model. This is because the permissive nature of complete
media (allowing models to uptake any compound from the
media for which a transporter exists) enables models to exercise
every single functional pathway. Thus, a model with fewer
blocked reactions during growth on complete media is likely
a higher quality model.
8. It is important to note that not every gene associated with an
essential reaction will turn out to be essential. Often multiple
genes are associated with a single reaction, and while in many
cases, these multiple genes encode different essential subunits
of a multi-enzyme complex, they also often encode redundant
isozymes that are each capable of catalyzing a reaction indepen-
dently should one of the isozymes be knocked out. The meta-
bolic models account for these differences and simulate the
knockouts accordingly.

Acknowledgments

This work is supported by the Office of Biological and Environ-


mental Research’s Genomic Science program within the US
Department of Energy Office of Science, under award numbers
DE-AC02-05CH11231, DE-AC02-06CH11357, DE-AC05-
00OR22725, and DE-AC02-98CH10886.
Application of the Metabolic Modeling Pipeline in KBase to Categorize. . . 319

References
1. Kumar VS, Maranas CD (2009) GrowMatch: models. Nat Biotechnol 28:977–982. https://
an automated method for reconciling in silico/ doi.org/10.1038/nbt.1672
in vivo growth predictions. PLoS Comput Biol 13. Faria JP, Khazaei T, Edirisinghe JN et al (2016)
5:e1000308. https://doi.org/10.1371/jour Constructing and analyzing metabolic flux
nal.pcbi.1000308 models of microbial communities. In: McGe-
2. Goldford JE, Lu N, Bajić D et al (2018) Emer- nity TJ, Timmis KN, Nogales B (eds) Hydro-
gent simplicity in microbial community assem- carbon and lipid microbiology protocols.
bly. Science 361:469–474. https://doi.org/ Springer, Berlin, pp 247–273
10.1126/science.aat1168 14. Aziz RK, Bartels D, Best AA et al (2008) The
3. Pharkya P (2004) OptStrain: a computational RAST server: rapid annotations using subsys-
framework for redesign of microbial produc- tems technology. BMC Genomics 9:75.
tion systems. Genome Res 14:2367–2376. https://doi.org/10.1186/1471-2164-9-75
https://doi.org/10.1101/gr.2872004 15. Wattam AR, Brettin T, Davis JJ et al (2018)
4. Monk JM, Koza A, Campodonico MA et al Assembly, annotation, and comparative geno-
(2016) Multi-omics quantification of species mics in PATRIC, the all bacterial bioinformat-
variation of Escherichia coli links molecular fea- ics resource center. In: Setubal JC, Stoye J,
tures with strain phenotypes. Cell Syst Stadler PF (eds) Comparative genomics.
3:238–251.e12. https://doi.org/10.1016/j. Springer, New York, NY, pp 79–101
cels.2016.08.013 16. Overbeek R, Olson R, Pusch GD et al (2014)
5. Orth JD, Thiele I, Palsson BØ (2010) What is The SEED and the Rapid Annotation of micro-
flux balance analysis? Nat Biotechnol bial genomes using Subsystems Technology
28:245–248. https://doi.org/10.1038/nbt. (RAST). Nucleic Acids Res 42:D206–D214.
1614 https://doi.org/10.1093/nar/gkt1226
6. Henry CS, Broadbelt LJ, Hatzimanikatis V 17. Satish Kumar V, Dasika MS, Maranas CD
(2007) Thermodynamics-based metabolic flux (2007) Optimization based automated cura-
analysis. Biophys J 92:1792–1805. https:// tion of metabolic reconstructions. BMC Bioin-
doi.org/10.1529/biophysj.106.093138 formatics 8:212. https://doi.org/10.1186/
7. Tournier L, Goelzer A, Fromion V (2017) 1471-2105-8-212
Optimal resource allocation enables mathemat- 18. Reed JL, Patel TR, Chen KH et al (2006)
ical exploration of microbial metabolic config- Systems approach to refining genome annota-
urations. J Math Biol 75:1349–1380. https:// tion. Proc Natl Acad Sci 103:17480–17484.
doi.org/10.1007/s00285-017-1118-5 https://doi.org/10.1073/pnas.0603364103
8. Covert MW, Palsson BØ (2002) Transcrip- 19. Dreyfuss JM, Zucker JD, Hood HM et al
tional regulation in constraints-based meta- (2013) Reconstruction and validation of a
bolic models of Escherichia coli. J Biol Chem genome-scale metabolic model for the filamen-
277:28058–28064. https://doi.org/10. tous fungus neurospora crassa using FARM.
1074/jbc.M201691200 PLoS Comput Biol 9:e1003126. https://doi.
9. Arkin AP, Cottingham RW, Henry CS et al org/10.1371/journal.pcbi.1003126
(2018) KBase: the United States Department 20. Latendresse M (2014) Efficiently gap-filling
of Energy Systems Biology Knowledgebase. reaction networks. BMC Bioinformatics
Nat Biotechnol 36:566–569. https://doi. 15:225. https://doi.org/10.1186/1471-
org/10.1038/nbt.4163 2105-15-225
10. Henry CS, Zinner JF, Cohoon MP, Stevens RL 21. Mahadevan R, Schilling CH (2003) The effects
(2009) iBsu1103: a new genome-scale meta- of alternate optimal solutions in constraint-
bolic model of Bacillus subtilis based on SEED based genome-scale metabolic models. Metab
annotations. Genome Biol 10:R69. https:// Eng 5:264–276
doi.org/10.1186/gb-2009-10-6-r69 22. Koo B-M, Kritikos G, Farelli JD et al (2017)
11. Thiele I, Palsson BØ (2010) A protocol for Construction and analysis of two genome-scale
generating a high-quality genome-scale meta- deletion libraries for Bacillus subtilis. Cell Syst
bolic reconstruction. Nat Protoc 5:93–121. 4:291–305.e7. https://doi.org/10.1016/j.
https://doi.org/10.1038/nprot.2009.203 cels.2016.12.013
12. Henry CS, DeJongh M, Best AA et al (2010) 23. Henry CS, Rotman E, Lathem WW et al
High-throughput generation, optimization (2017) Generation and Validation of the
and analysis of genome-scale metabolic iKp1289 metabolic model for Klebsiella
320 Benjamin H. Allen et al.

pneumoniae KPPR1. J Infect Dis 215: 26. Song H-S, Nelson WC, Lee J-Y et al (2018)
S37–S43. https://doi.org/10.1093/infdis/ Metabolic network modeling for computer-
jiw465 aided design of microbial interactions. In:
24. Bochner BR (2001) Phenotype microarrays for Chang HN (ed) Emerging areas in bioengi-
high-throughput phenotypic testing and assay neering. Wiley-VCH Verlag GmbH &
of gene function. Genome Res 11:1246–1255. Co. KGaA, Weinheim, pp 793–801
https://doi.org/10.1101/gr.186501 27. Heirendt L, Arreckx S, Pfau T et al (2019)
25. Henry CS, Bernstein HC, Weisenhorn P et al Creation and analysis of biochemical
(2016) Microbial community metabolic mod- constraint-based models using the COBRA
eling: a community data-driven network recon- Toolbox v.3.0. Nat Protoc 14:639–702.
struction: community data-driven metabolic https://doi.org/10.1038/s41596-018-0098-
network modeling. J Cell Physiol 2
231:2339–2345. https://doi.org/10.1002/
jcp.25428
Chapter 14

Curating COBRA Models of Microbial Metabolism


Ali Navid

Abstract
Constraint-based reconstruction and analysis (COBRA) methods have been used for over 20 years to
generate genome-scale models of metabolism in biological systems. The COBRA models have been utilized
to gain new insights into the biochemical conversions that occur within organisms and allow their survival
and proliferation. Using these models, computational biologists can conduct a variety of different analyses
such as examining network structures, predicting metabolic capabilities, resolving unexplained experimen-
tal observations, generating and testing new hypotheses, assessing the nutritional requirements of a
biosystem and approximating its environmental niche, identifying missing enzymatic functions in the
annotated genomes, and engineering desired metabolic capabilities in model organisms. This chapter
details the protocol for developing curated system-level COBRA models of metabolism in microbes.

Keywords Systems biology, Genome-scale models, Constraint-based analysis, FBA, Metabolic


networks

In silico modeling of complex biological processes has become a key


tool for system-level analyses of organisms. COBRA methods are
one of the primary methods for developing genome-scale models of
metabolism. Such models have been developed for a large number
of organisms and have been used for a variety of purposes such as:
l Studying the global organization of metabolic fluxes (e.g.,
[1–5]).
l Assessing the robustness of cellular metabolism to genetic and
environmental perturbations (e.g., [6–10]).
l Improving genome annotations (e.g., [11, 12]).
l Studying the factors affecting microbial evolution (e.g.,
[13–16]).
l Metabolic engineering (e.g., [17–22]).
l Metabolic trade-offs (e.g., [23–26]).
For a review of some of the applications of genome-scale meta-
bolic models, see the manuscript by Oberhardt et al. [27].

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_14, © Springer Science+Business Media, LLC, part of Springer Nature 2022

321
322 Ali Navid

The aim of this chapter is to detail the procedure for developing


curated COBRA models of metabolism in microbes. It lists the
biological data that are required for network reconstruction while
also listing a number of tools that have been developed to automat-
ically generate a draft reconstruction. The chapter discusses how to
curate the model by (a) constraining it using available biological
data models, (b) filling the gaps in the network, and (c) validating
its predictions. Tips on troubleshooting malfunctioning models are
also included.

1 Introduction

Recently there has been significant progress in development of


methods for generating draft metabolic reconstructions and
COBRA models (e.g., see Chapters 12 and 13 in this book).
While these developments significantly accelerate the COBRA
model building process, the routine for developing high-quality
system-level models of metabolism is still very labor intensive.
Automatically generated models still require extensive manual cura-
tion to attain the quality and accuracy displayed by published
models. The amount of time required for generating a curated
system-level model of cellular metabolism can range between
3 months and a year. The length of time varies depending on
(a) how many people work on the project, (b) the size of the
microbial genome and quality of the annotation, and
(c) availability of experimental information to constrain and test
the model.

1.1 Constraint- Readily available genomic information has led to a paradigm shift in
Based Modeling microbiology where detailed analyses of isolated cellular processes
have been replaced by system-level analyses of the organism as a
whole. Typical COBRA models forgo some level of detail (such as
insights into transient behavior of metabolites and enzyme-
substrate affinity) in order to gain a broader understanding of the
overall metabolic capabilities of a cell. The most successful of
COBRA approaches is Flux Balance Analysis (FBA) [28, 29]. FBA
models are based on the knowledge of the stoichiometry of meta-
bolic reactions that can easily be extracted from the annotated
genomes. The stoichiometry is used to develop a mathematical
reconstruction of the metabolic networks. These models also
require prior knowledge of an organism’s growth phenotypes and
related nutritional needs. The data is used to constrain cellular
growth and the uptake of nutrients and export of waste materials.
The constraints also limit the cellular energy metabolism to a
narrow set of possible catabolic pathways. Finally, the knowledge
of thermodynamics of reactions is used to constrain their
directions.
Curating COBRA Models of Microbial Metabolism 323

FBA models optimize a cellular task (an objective function)


while calculating a feasible steady state flux pattern for metabolic
reactions that would adhere to constraints imposed on the system
by mass balance, the structure of metabolic network, as well as
nutritional characteristics of the growth medium. The conventional
objective function is growth, although other choices are possible
depending on the selected environment of the cell [23]. For an
excellent introduction to FBA and its mathematical underpinnings,
see Gottstein et al. [30].

2 Materials

2.1 Annotated The most important organism-specific data that is necessary for
Genome developing genome-scale model of metabolism is the annotated
genome. Genome annotations are available from a number of
different sources (see Table 1). Some annotated genomes can be
found on databases dedicated to a specific model organism (e.g.,
EcoCyc [31] for E. coli). Most of the publicly available annotated
genomes can be found on databases such as Integrated Microbial
Genomes (IMG) [32] and EntrezGene [33] that contain large
grouping of annotated genomes. The annotated genome provides
the modeler with a list of all proteins that can be translated from the
genome of an organism. These include enzymes which catalyze
metabolic reactions.
In case of novel organisms or if one wants to reannotate a
genome, the reader is encouraged to read Chapter 10 in this
book for a detailed protocol on how to annotate genomes. The
U.S. Department of Energy systems biology knowledgebase
(KBase) [34] has recently implemented a number of new apps
that allow a user to import, compare, and consolidate genome
annotations from multiple sources. This tool is an excellent source
for initiating development of genome-scale models (GSMs) since
relying on a single annotation tool will usually result in missing
gene-function assignments and ultimately, incomplete metabolic
network reconstructions [35].

2.2 Software A number of free and commercial software are available for recon-
structing metabolic networks and developing draft FBA models.
2.2.1 Automated
These include:
Metabolic Network
Reconstruction and Model l AutoKEGGRec [36].
Development l CarveMe [37].
l KBase [34]/Model SEED [38, 39]—see Chapter 13.
l Merlin [40].
l Pathway Tools (MetaFlux) [41]—see Chapter 12.
l RAVEN [42, 43].
l SuBliMinaL [44].
324 Ali Navid

Table 1
Some of the databases commonly used for the development of genome-scale metabolic models

Database References Link


KEGG [53, 54] http://www.genome.jp/kegg
IMG [32] http://img.jgi.doe.gov
EcoCyc [31] http://ecocyc.org
BioCyc [91] http://biocyc.org
MetaCyc [55] http://metacyc.org
KBase [34] http://kbase.us
BRENDA [51, 52] http://www.brenda-enzymes.org
SEED [38, 39] http://www.theseed.org/wiki/Home_of_the_SEED
TransportDB [56, 57] http://www.membranetransport.org
NCBI Entrez [92] https://www.ncbi.nlm.nih.gov/

2.2.2 Model Editors Once the constrained model has been generated, common text
editing programs such as vim (www.vim.org) or GNU Emacs
(www.gnu.org/software/emacs/) can be used for editing. How-
ever, most of the programs developed for simulating FBA models
(e.g., COBRA toolbox [45–47]), include commands for easy
manipulation of models.

2.2.3 Linear For most COBRA methods, such as FBA, one needs access to a
Programming Solver linear programming (LP) solver. Commercial programs such as
Cplex from IBM (www.ibm.com/analytics/cplex-optimizer), Gur-
obi (www.gurobi.com/products/gurobi-optimizer), and Matlab
(www.mathworks.com/products/matlab.html), as well as free pro-
grams such as GLPK (Gnu linear programming kit) (www.gnu.
org/software/glpk) and PCx (www.anl.gov/tcp/pcx-optimiza
tion-problem-solver) can be used to simulate FBA models.

2.2.4 Simulation A number of useful tools have been developed for analyzing the
Toolboxes models. For Matlab, a popular suite of programs called the COBRA
toolbox (opencobra.github.io/cobratoolbox/stable) [45–47] has
been developed (see Chapter 15). For those who do not have access
to Matlab, Python (cobrapy, https://opencobra.github.io/
cobrapy/), and Julia (cobra.jl, https://opencobra.github.io/
COBRA.jl/stable) versions are also available. Other tools such as
Pathway tools and MetaFlux [41] (see Chapter 12) and RAVEN
[42, 43] can also be used. A large number of system-level analyses
can be conducted using these programs, such as in silico gene
deletion analyses, flux variability analysis [48], and dynamic
FBA [49].
Curating COBRA Models of Microbial Metabolism 325

3 Model Development

3.1 Draft Network In the early days of COBRA model development, all network
Reconstruction reconstructions and subsequent mathematical formulations of the
models were done manually. However, during the last decade, a
number of useful tools have been developed to significantly accel-
erate model development by automating the process of metabolic
network reconstruction and constraining the models (see Subhead-
ing 2.2.1). These initial drafts are not publication quality models
and require significant amount of additional manual refinement.
However, today a majority of researchers use these initial draft
models as the starting point for generating their new models.
The first step in building a genome-scale model of metabolism
in an organism is to obtain its annotated genome. The process of
similarity-based genome annotation is built upon comparing the
gene sequences in a recently sequenced genome to all other anno-
tated genomes. If there is statistically significant similarity between
the sequence of a query gene and a gene with a known function in
another organism, then the former gene is assigned the same func-
tion as that of the latter gene. The process is obviously more
complicated than this simple description and interested readers
can see Chapter 10 in this book for a detailed description.
As explained in Subheading 2.1, annotated genomes can be
obtained from a number of different databases (see Table 1). The
annotated genome presents the modeler with a list of metabolic
enzymes that can be present in the target organism (see Note 1).
The function of each metabolic enzyme is usually denoted by an
Enzyme (EC) or Transport (TC) Commission number.
Once the annotated genome has been obtained, one has to
ensure that it matches the requirements of the draft model genera-
tor. For example, AutoKEGGRec [36] only accepts annotated
genomes in the KEGG format. For a long time, all KBase generated
draft models were based on RAST [50] annotations. Currently
efforts are underway to allow for use of annotation from multiple
sources for the generation of models in KBase.

3.2 Manual Curation The first step in refining the draft reconstruction is to re-examine
of the Draft the annotated genome and exclude dubious annotation (and asso-
Reconstruction ciated reactions). Some enzymatic functions will be erroneously
included in the list of available functionalities due to poor annota-
3.2.1 Wrong Annotations tions. As a result, it is essential that the modeler ensures that the
most complete annotation of the genome is used for the process of
curating the model. Combining annotations from multiple sources
could provide a more complete reconstruction of metabolic path-
ways when compared to results from a single source [35]. There-
fore, the modeler should examine and include system-level
information that is deposited on other metabolic databases such
326 Ali Navid

as BRENDA [51, 52], Kyoto Encyclopedia of Genes and Genomes


(KEGG) [53, 54], MetaCyc [55], TrasnsportDB [56, 57], and
SEED (see Notes 2 and 3).
Annotated genomes are not static datasets and scientists con-
tinually reevaluate the functions assigned to various genes. Ensur-
ing that the latest version of an annotated genome is used for a draft
reconstruction will significantly ease the task of manually curating
draft networks.

3.2.2 Generic Reactions Some draft metabolic network reconstructions might contain
generic reactions which include nonspecific terms such as peptide,
electron acceptor, or protein. An example of such a reaction is
R00056 from KEGG:
Dinucleotide þ H2 O $ 2 Mononucleotide:
These reactions have to be excised from the reconstruction
because they are too general and do not refer to any specific
component of the network.
In most metabolic reconstructions, some multistep processes
such as those catalyzed by large enzyme complexes (e.g., pyruvate
dehydrogenase and α-ketoglutarate dehydrogenase) or linear path-
ways (e.g., fatty acid oxidation) have been combined into single
reactions for sake of simplicity. To eliminate redundancy, it is nec-
essary to remove the composite reactions from the network. How-
ever, prior to eliminating these reactions, one must ensure that they
do not serve a role in any other pathway.

3.2.3 Non-enzymatic A number of important cellular metabolic reactions are not cata-
Reactions lyzed by enzymes. Their activity might be critical because they can
connect reactions that are essential for cellular growth and survival.
Therefore, one should add to the metabolic reconstruction only
those non-enzymatic reactions that have at least one metabolite
joining them to the rest of the network. This will ensure that a large
number of dead-end metabolites do not clutter the reconstruction.

3.2.4 Stoichiometry A large number of reactions listed on metabolic databases do not


of Reactions accurately account for H+ and H2O. However, the rate of import
and export of these compounds greatly affects the energy balance of
the system. Therefore, it is essential to ensure the stoichiometric
fidelity of all reactions in a network as well as mass and energy
conservation.

3.2.5 Reaction An important problem with some draft reconstructions is that most
Directionality reactions are classified as reversible. For example, a model might
predict that the exact same set of enzymes can catalyze both glycol-
ysis and gluconeogenesis. Additionally, the energetics of the system
will be drastically affected since some metabolic cycles might allow
for energy-free conversion of ADP to ATP. A recent study found
Curating COBRA Models of Microbial Metabolism 327

that nearly 95% of draft models generated by ModelSEED


contained such energy-generating cycles [58]. These inaccuracies
need to be addressed, since they could erroneously lead to signifi-
cantly higher growth rate predictions. Use of organism-specific
biochemical data to set the reaction directionality is ideal. However,
such information is rarely available.
Most textbooks of biochemistry describe the prevalent direc-
tionality for composite reactions of important and well-studied
pathways (see Note 4). In addition, accurate accounting of reaction
thermodynamics can eliminate this problem. Thermodynamics
have been applied to various pathway analyses [59–61], but due
to dearth of thermodynamic data on metabolic reactions, a system-
level implementation has not been feasible. However, increasingly
thermodynamic data are becoming more available. This, along with
use of group contribution methodology for estimation of thermo-
dynamic data [62, 63] makes it possible to develop models that
comply with the laws of thermodynamics. Group contribution
theory has been used extensively to study the feasibility of various
metabolic pathways [64–66]. The method has been used to calcu-
0 0
late ΔfG ∘ and ΔrG ∘ for majority of the compounds and reactions
that are included in KEGG [63, 67]. Use of this information can
greatly reduce the possibility of assigning incorrect directionality to
a reaction.

3.2.6 Organism-Specific Although constantly improving, the automatically generated draft


Reactions reconstructions still often fail to capture some of the organism-
specific attributes of enzymes such as subcellular localization (see
Note 5), unique requirements for cofactors, and substrate specific-
ity. All pertinent experimental data associated with metabolic
behavior of the organism under environmental conditions of inter-
est should be collected and used in order to ensure accurate model
predictions. Under ideal conditions, the modeler should also solicit
the opinion of an expert on the biology of the organism of interest.
Unfortunately, often organism-specific information is not avail-
able. For such cases, data from phylogenetically close organism can
be used. A detailed curation of the draft reconstruction will signifi-
cantly eliminate the presence of false-positive reactions and add
missing reactions that are particular to the modeled organism.

3.2.7 Intracellular If the modeled organism is eukaryotic, one must ensure that the
Transport Reactions proper set of intracellular exchange reactions between the various
compartments are included in the reconstruction. Unfortunately,
experimental data on these processes is not readily available. There-
fore, only those transport reactions that are necessary for the proper
function of a compartmented pathway must be included in the
reconstruction. Addition of excessive transport reactions can lead
to the formation of futile cycles which will lower the value of model
as a predictive tool for network and flux analyses (see Note 6).
328 Ali Navid

3.2.8 Gene-Protein- When refining the network, one must keep a record of the protein
Reaction (GPR) (s) that catalyze a reaction. Often a majority of the reactions in a
Association Table network are catalyzed by single proteins that are encoded by single
genes. However, a number of other scenarios are also possible.
These include:
l One protein can catalyze more than one reaction. For example, a
50 -nucleotidase (EC 3.1.3.5) can catalyze the phosphatase reac-
tions that convert various 50 -nucleotides into their respective
nucleosides.
l An enzyme that catalyzes a reaction is a heteromeric complex,
and hence, the byproducts of translation of multiple genes are
required to proceed with that reaction.
l Proteins encoded by different genes have the same functionality
(isozymes) and can catalyze the same reaction.
The GPR tables are usually included with draft reconstructions
or can be easily extracted from the models encoded in formats such
as Systems Biology Markup Language (SBML) [68, 69]. Develop-
ment of an accurate GPR table is essential for all gene-based ana-
lyses (see Note 7).

3.2.9 Non-growth In order to correctly incorporate cellular energetics into a model, it


Associated Energy is vital to account for the amount of energy a cell uses to maintain
Consumption the status quo. This is achieved by adding an ATP hydrolysis
reaction to the network:
ATP þ H2 O ! ADP þ Pi þ Hþ :
The flux value for this reaction is independent of the growth
rate and will be constant. The value will be different for different
organisms and can be determined from growth experiments [70]
(see Note 8).

3.2.10 Growth- A second ATP hydrolysis reaction should be included in the meta-
Associated Energy bolic network reconstruction to account for the energy a cell uses
Consumption to grow. This includes the energy used for the process of DNA
synthesis, as well as gene transcription and mRNA translation. Since
this energy consumption process is proportional to the rate of the
cellular growth, it is generally included as part of the biomass
reaction.

3.2.11 Organism- In order to simulate bacterial growth, it is necessary to define the


Specific Biomass constituent macromolecular components of a cell and account for
Composition their consumption by composing a biomass reaction. For studies
that use cellular growth as the primary objective of a cell, the
biomass reaction is one of the most important elements of the
model. The ability of a model to produce the constituent compo-
nents of biomass is one of the initial means for assessing the
Curating COBRA Models of Microbial Metabolism 329

completeness of the network and ultimately the accuracy of the


model’s predictions (see Note 9).
Ideally, the biomass composition of an organism should be
determined experimentally; however, this type of data is not avail-
able for most organisms. For such cases, variations of the biochem-
ical composition of cells as was reported by Neidhard et al. [71]
have been frequently used. It is important to augment this compo-
sition with reliable organism-specific data. For some of the cellular
components (e.g., RNA, DNA, and amino acids), the fraction of
precursor metabolites can be estimated from the genome. For
example, some computational tools have been developed to esti-
mate the amino acid composition of cellular proteome (e.g.,
[72, 73]) by using the sequenced genome.
The lipid composition of the cells requires direct experimental
measurements. Because of the important role that these com-
pounds have in microbial interaction with surrounding medium,
adaptation to various environmental changes, and in some cases
pathogenesis; studies into lipid composition of medicinally and
economically important organisms are becoming more prevalent.
One frequently encountered problem with biomass composi-
tion of some published genome-scale models is that the sums of
masses of the components do not add up to 1 g. A recent study by
Chen et al. [74] found that nearly every biomass composition of
64 published models varied from 1 g/mmol by at least 5%. This
finding is problematic since such deviations can hamper reliable
calculation of growth yields. Therefore, it is critical to ensure that
the biomass composition of a model adheres to the defined 1 g/
mmol definition for COBRA models. For a detailed review of
elementary concerns associated with formulation of biomass objec-
tive function, see the review by Feist and Palsson [75].

3.2.12 Nutritional It is important to gather all available information about the unique
Requirements nutritional requirements and preferences of an organism for differ-
ent growth media. This information should be used to ensure the
appropriate nutrients and associated transport mechanisms are
included in the model. Absence of essential nutrients could lead
to erroneous editing of the metabolic network in order to confirm
that the missing metabolite can be produced in vivo. This mistake
can introduce errors into the prediction of optimal cellular growth
rate by diverting needed metabolites toward incorrectly added
anabolic pathways.

3.2.13 Extracellular The annotated genomes list proteins that are associated with differ-
Transport Reactions ent metabolite transport mechanisms. One must make sure that
each one of these proteins is associated with the correct mode of
transportation. It is particularly important to confirm that the
amounts of energy consumed for active transport processes are
accurately formulated. Finally, it is also necessary to ensure that
means of transportation are available for all essential nutrients. If
330 Ali Navid

the annotated genome does not provide a mode of transportation,


a reaction for transport of the compound into the cell should be
added to the network.

3.3 Translating In order to utilize COBRA methods for analyzing the properties of
the Refined a system of interest, the curated biochemical reconstruction has to
Reconstruction into be translated into a mathematical format. The resulting stoichiomet-
a Mathematical Model ric matrix (S) incorporates all the information about interconver-
sion of metabolites and the structure of metabolic network. Each
3.3.1 Mathematical column in S corresponds to a metabolic reaction in the network,
Representation while each row represents a metabolite. If there are m metabolites
and n reactions in the metabolic network, then the dimensions of
S are m  n. If a metabolite i is a product in reaction j, then the
value of Sij is positive. On the other hand, if the metabolite is a
reactant, then the value of Sij would be negative (see Fig. 1). For
each reaction (column), the values of S for all other rows (metabo-
lites that do not participate) are zero. The reaction rates (or fluxes)
for each reaction in the network are represented in a separate 1  n
vector v (see Note 10). Thus, the equation for dynamic mass
balance can be written as:
dX
¼S v
dt
where X represents the metabolite concentrations.

Fig. 1 Conversion of a simple metabolic network into a mathematical format,


i.e., stoichiometric matrix (S)
Curating COBRA Models of Microbial Metabolism 331

3.3.2 Characteristics At this stage of the model building process, one should identify the
of the Medium nutrients that are present in the growth medium of the cell. This is
done by defining an additional compartment (E) that envelops the
cell. This compartment is permeable to those metabolites that are
either imported or exported by the cell. The environment sur-
rounding the cell is characterized by the limits of the flux values
for exchange reactions in and out of E.

3.3.3 External When formulating the mathematical problem, it is necessary to


and Internal Flux limit the interaction of the cell with its surroundings. To do this,
Constraints experimental data is used to constrain the model. The external flux
constraints make certain that there is a limit to the amounts of
metabolites that the cell can import or export.
Usually the approximate value for rates of import/export of
metabolites into/out of cells can be readily measured. On the other
hand, measurement of internal fluxes can still be difficult [76–
79]. However, when available, use of measured flux values as con-
straints on FBA models can drastically reduce the average variability
of predicted metabolic fluxes [79].

3.3.4 System Constraints The FBA method is based on three fundamental assumptions:
1. The system is in a metabolic quasi-steady state. This assump-
tion can be justified by the fact that changes in cellular metab-
olism are generally fast in comparison to overall cellular growth
rate and environmental changes.
2. Mass is conserved. Thus, all mass that is imported into the
system is either transformed into biomass or excreted as meta-
bolic byproducts.
3. The cell has an objective and fluxes through various metabolic
reactions are patterned to optimize this cellular goal.
The steady-state assumption means that we can fix the mass
balance equation so that:
dX
¼ S  v ¼ 0:
dt
Due to scarcity of experimental measurements, there are a lot
more unknown reaction rates in S than linear independent mass
balance equations for all FBA models. Thus, the stoichiometric
matrix is rank deficient, and the problem is highly underdeter-
mined. This obstacle can be eliminated by using linear program-
ming to solve for one feasible flux vector which will optimize an
objective function. The most commonly maximized objective func-
tion for FBA models is cellular growth (i.e., production of biomass)
(see Note 11).
332 Ali Navid

3.3.5 Debugging Once the mathematical conversion has been completed, it becomes
the Network necessary to run the model so as to ensure that it has the capability
to produce the components of biomass and make predictions that
agree with experimental observations. The most common error
that leads to discrepancies is the presence of unresolved network
gaps. As previously mentioned, annotated genomes generally con-
tain missing functionalities. Using annotations from multiple
sources can provide a more complete network reconstruction
[35]. Once the modelers have ensured that all information that
can be collected from a genome has been incorporated in the
network reconstruction, they can use a variety of tools such as the
gap filling function in COBRA toolbox or pathway holefiller mod-
ule of the pathway tools program [80] to ensure the model predicts
the correct phenotype.
In order to make sure that all necessary reactions are included
in the model, it is crucial to examine all experimental data and make
certain that the model has the metabolic capacity to mimic
observed bacterial behavior under various conditions. For example,
if experimental measurements show that an organism has the capa-
bility to consume succinate as its sole carbon source, and the model
does not agree; then the model can be changed to make succinate
the sole carbon source in the medium and by running the gap filler
the missing reactions are added to the network.

3.3.6 MEMOTE Genome-scale models of metabolism are large and complex knowl-
edgebases. They include thousands of metabolites, reactions, and
gene associations. As the field of COBRA modeling grows, partic-
ularly toward analysis of complex multi-organism systems, it is
becoming clear that there is a need to develop a uniform descrip-
tion format for GSMs and also to ensure that published models pass
a set of rigorous quality control measures. To this end, a
community-maintained open-source software called Memote
(Metabolic Model Tests, https://github.com/opencobra/mem-
ote) [81] has been developed to test submitted models for a variety
of common problems ranging from annotation issues to stoichio-
metric inconsistencies and problems with biomass composition.
Memote is a very helpful tool for curating COBRA models. It can
even be configured to automatically test models as the process of
curation progresses.
Memote offers the quality control of models that is needed for
the development of evermore complex models and simplifies colla-
borations needed for such efforts to succeed. It is now agreed by
most members of the COBRA community that all models submit-
ted for publication should include Memote generated comprehen-
sive reports verifying the quality of the submitted work.
Curating COBRA Models of Microbial Metabolism 333

3.4 Applications The availability of annotated genomes has led to a dramatic increase
of FBA Models in the number of genome-scale metabolic models that are being
developed. Concurrently, the number and scope of theoretical
methods for investigating these reconstructions is also expanding.
The applications of these models are too numerous to list here.
However, a number of excellent reviews have been published that
thoroughly categorize and detail the most prominent uses of
constraint-based metabolic models. The interested reader can
examine manuscripts by Price et al. [82], Oberhardt et al. [27],
Feist and Palsson [83], Milne et al. [84], Liu et al. [85], and
Bordbar et al. [86].

4 Notes

1. The information contained in the annotated genome provides


the primary basis for the reconstruction of the cellular meta-
bolic network. Therefore, it is essential that the modeler uses
the most recent and accurate version of the annotated genome
for this purpose.
2. The functions of some proteins that do have an EC number are
not included in the metabolic network reconstructions. For
example, enzymes that are involved in signaling or regulatory
processes are generally excluded from draft network
reconstructions.
3. Not all of the genes in a genome are transcribed and translated
under all environmental conditions. It is the task of the model
curator to use various experimental data (such as transcrip-
tomics and proteomics data) to constrain the model.
4. As a rule of thumb, one can assume that reactions not normally
associated with energy production and involving the transfer of
a phosphate group from an ATP to an acceptor entity are
irreversible.
5. Cytoplasm, periplasm, and extracellular medium are usually the
compartments that have been used for metabolic reconstruc-
tions of prokaryotic organism. In the absence of credible data,
all proteins in prokaryotes should be considered cytosolic.
6. Unless detailed information regarding the energy cost of intra-
cellular transport reactions is available, all such reactions should
be assumed to proceed via free or facilitated diffusion.
7. When combining reactions for the purpose of simplifying
multi-step processes, it is important to note in the GPR table
that all of the gene(s) associated with each composite reaction
are required for the activity of the combined reaction.
8. Although it is ideal to use organism-specific energy consump-
tion values for non-growth-associated maintenance of the cell,
334 Ali Navid

in the absence of experimental data, one can use energy values


from closely related organisms. For example, some models of
metabolism in enteropathogens have used the maintenance
energy value for E. coli (measured as 7.6 mmol/(gDW h)
[70] and more recently 8.39 mmol/(gDW h) [66]).
9. The makeup of the biomass reaction has a vital role for compu-
tational gene knockout simulations. If cells cannot import a
biomass precursor, then the biosynthetic pathway for the pro-
duction of that compound is critical for cellular growth and
consequently all associated genes are identified as essential. If a
component of biomass is excluded from the biomass reaction, a
number of critical genes will not be correctly identified. For
purposes of gene deletion simulations, only the presence of the
metabolite in the biomass reaction is crucial, while the frac-
tional input of the metabolites is inconsequential. The latter
values are however important for correct quantitative predic-
tion of cellular growth as well as nutrient uptake and waste
excretion.
10. The flux vector is partitioned into internal and external fluxes.
The internal fluxes are associated with biochemical transforma-
tions and occur within the cell. The external fluxes are the rates
of import of nutrients and excretion of metabolic byproducts.
11. Other objective functions such as maximization or minimiza-
tion of ATP production, redox potential, or rate of nutrient
uptake [82, 87–90] have also been used.

Acknowledgments

This work was funded in part by the DOE OBER Genomic Science
program and LLNL Laboratory Directed Research and Develop-
ment funding and performed under the auspices of the
U.S. Department of Energy at Lawrence Livermore National Lab-
oratory under Contract DE-AC52-07NA27344.

References

1. Almaas E, Kovacs B, Vicsek T, Oltvai ZN, Bar- 5. Li G, Cao H, Xu Y (2018) Structural and func-
abasi AL (2004) Global organization of meta- tional analyses of microbial metabolic networks
bolic fluxes in the bacterium Escherichia coli. reveal novel insights into genome-scale meta-
Nature 427(6977):839–843 bolic fluxes. Brief Bioinform 20(4):1590–1603
2. Almaas E (2007) Optimal flux patterns in cel- 6. Segre D, Vitkup D, Church GM (2002) Anal-
lular metabolic networks. Chaos 17(2):026107 ysis of optimality in natural and perturbed met-
3. Almaas E, Oltvai ZN, Barabasi AL (2005) The abolic networks. Proc Natl Acad Sci U S A 99
activity reaction core and plasticity of metabolic (23):15112–15117
networks. PLoS Comput Biol 1(7):e68 7. Deutscher D, Meilijson I, Kupiec M, Ruppin E
4. Gagneur J, Jackson DB, Casari G (2003) Hier- (2006) Multiple knockout analysis of genetic
archical analysis of dependency in metabolic robustness in the yeast metabolic network. Nat
networks. Bioinformatics 19(8):1027–1034 Genet 38(9):993–998
Curating COBRA Models of Microbial Metabolism 335

8. Jamshidi N, Palsson BO (2006) Systems biol- 22. Yoshikawa K, Toya Y, Shimizu H (2017) Met-
ogy of SNPs. Mol Syst Biol 2:38 abolic engineering of Synechocystis sp. PCC
9. Edwards JS, Palsson BO (2000) Metabolic flux 6803 for enhanced ethanol production based
balance analysis and the in silico analysis of on flux balance analysis. Bioprocess Biosyst
Escherichia coli K-12 gene deletions. BMC Bio- Eng 40(5):791–796
informatics 1:1 23. Schuetz R, Kuepfer L, Sauer U (2007) System-
10. Ho W-C, Zhang J (2016) Adaptive genetic atic evaluation of objective functions for pre-
robustness of Escherichia coli metabolic fluxes. dicting intracellular fluxes in Escherichia coli.
Mol Biol Evol 33(5):1164–1176 Mol Syst Biol 3:119
11. Reed JL, Famili I, Thiele I, Palsson BO (2006) 24. Schuetz R, Zamboni N, Zampieri M,
Towards multidimensional genome annota- Heinemann M, Sauer U (2012) Multidimen-
tion. Nat Rev Genet 7(2):130–141 sional optimality of microbial metabolism. Sci-
12. Navid A, Almaas E (2009) Genome-scale ence 336(6081):601–604
reconstruction of the metabolic network in 25. Navid A, Jiao Y, Wong SE, Pett-Ridge J (2019)
Yersinia pestis, strain 91001. Mol BioSyst 5 System-level analysis of metabolic trade-offs
(4):368–375 during anaerobic photoheterotrophic growth
13. Pal C, Papp B, Lercher MJ (2005) Adaptive in Rhodopseudomonas palustris. BMC Bioin-
evolution of bacterial metabolic networks by formatics 20(1):233
horizontal gene transfer. Nat Genet 37 26. Peyraud R, Cottret L, Marmiesse L, Gouzy J,
(12):1372–1375 Genin S (2016) A resource allocation trade-off
14. Pal C, Papp B, Lercher MJ (2005) Horizontal between virulence and proliferation drives met-
gene transfer depends on gene content of the abolic versatility in the plant pathogen Ralsto-
host. Bioinformatics 21(Suppl 2):222–223 nia solanacearum. PLoS Pathogens 12(10):
e1005939
15. Pal C, Papp B, Lercher MJ, Csermely P, Oliver
SG, Hurst LD (2006) Chance and necessity in 27. Oberhardt MA, Palsson BO, Papin JA (2009)
the evolution of minimal metabolic networks. Applications of genome-scale metabolic recon-
Nature 440(7084):667–670 structions. Mol Syst Biol 5:320
16. Großkopf T, Consuegra J, Gaffé J, Willison JC, 28. Varma A, Palsson BO (1994) Metabolic flux
Lenski RE, Soyer OS et al (2016) Metabolic balancing: basic concepts, scientific and practi-
modelling in a dynamic evolutionary frame- cal use. Nat Biotechnol 12(10):994–998
work predicts adaptive diversification of bacte- 29. Orth JD, Thiele I, Palsson BO (2010) What is
ria in a long-term evolution experiment. BMC flux balance analysis? Nat Biotechnol 28
Evol Biol 16(1):163 (3):245–248
17. Pharkya P, Burgard AP, Maranas CD (2003) 30. Gottstein W, Olivier BG, Bruggeman FJ, Teu-
Exploring the overproduction of amino acids sink B (2016) Constraint-based stoichiometric
using the bilevel optimization framework Opt- modelling from single organisms to microbial
Knock. Biotechnol Bioeng 84(7):887–899 communities. J R Soc Interface 13
18. Burgard AP, Pharkya P, Maranas CD (2003) (124):20160627
Optknock: a bilevel programming framework 31. Keseler IM, Collado-Vides J, Santos-Zavaleta-
for identifying gene knockout strategies for A, Peralta-Gil M, Gama-Castro S, Muniz-
microbial strain optimization. Biotechnol Rascado L et al (2011) EcoCyc: a comprehen-
Bioeng 84(6):647–657 sive database of Escherichia coli biology.
19. Pharkya P, Burgard AP, Maranas CD (2004) Nucleic Acids Res 39(Database issue):
OptStrain: a computational framework for D583–D590
redesign of microbial production systems. 32. Markowitz VM, Chen IM, Palaniappan K,
Genome Res 14(11):2367–2376 Chu K, Szeto E, Grechkin Y et al (2010) The
20. Fong SS, Burgard AP, Herring CD, Knight integrated microbial genomes system: an
EM, Blattner FR, Maranas CD et al (2005) In expanding comparative analysis resource.
silico design and adaptive evolution of Escher- Nucleic Acids Res 38(Database issue):
ichia coli for production of lactic acid. Biotech- D382–D390
nol Bioeng 91(5):643–648 33. Maglott D, Ostell J, Pruitt KD, Tatusova T
21. Park JH, Lee KH, Kim TY, Lee SY (2007) (2007) Entrez gene: gene-centered informa-
Metabolic engineering of Escherichia coli for tion at NCBI. Nucleic Acids Res 35(Database
the production of L-valine based on transcrip- issue):D26–D31
tome analysis and in silico gene knockout sim- 34. Arkin AP, Cottingham RW, Henry CS, Harris
ulation. Proc Natl Acad Sci U S A 104 NL, Stevens RL, Maslov S et al (2018) KBase:
(19):7797–7802 The United States Department of Energy
336 Ali Navid

Systems Biology Knowledgebase. Nat Biotech- 46. Schellenberger J, Que R, Fleming RMT,
nol 36(7):566 Thiele I, Orth JD, Feist AM et al (2011) Quan-
35. Griesemer M, Kimbrel JA, Zhou CE, Navid A, titative prediction of cellular metabolism with
D’haeseleer P (2018) Combining multiple constraint-based models: the COBRA Toolbox
functional annotation tools increases coverage v2. 0. Nat Protoc 6(9):1290–1307
of metabolic annotation. BMC Genomics 19 47. Heirendt L, Arreckx S, Pfau T, Mendoza SN,
(1):948 Richelle A, Heinken A et al (2019) Creation
36. Karlsen E, Schulz C, Almaas E (2018) Auto- and analysis of biochemical constraint-based
mated generation of genome-scale metabolic models using the COBRA Toolbox v. 3.0. Nat
draft reconstructions based on KEGG. BMC Protoc 2019:1
Bioinformatics 19(1):467 48. Mahadevan R, Schilling CH (2003) The effects
37. Machado D, Andrejev S, Tramontano M, Patil of alternate optimal solutions in constraint-
KR (2018) Fast automated reconstruction of based genome-scale metabolic models. Metab
genome-scale metabolic models for microbial Eng 5(4):264–276
species and communities. Nucleic Acids Res 46 49. Mahadevan R, Edwards JS, Doyle FJ (2002)
(15):7542–7553 Dynamic flux balance analysis of diauxic
38. DeJongh M, Formsma K, Boillot P, Gould J, growth in Escherichia coli. Biophys J 83
Rycenga M, Best A (2007) Toward the auto- (3):1331
mated generation of genome-scale metabolic 50. Aziz RK, Bartels D, Best AA, DeJongh M,
networks in the SEED. BMC Bioinformatics Disz T, Edwards RA et al (2008) The RAST
8:139 Server: rapid annotations using subsystems
39. Henry CS, DeJongh M, Best AA, Frybarger technology. BMC Genomics 9(1):75
PM, Linsay B, Stevens RL (2010) High- 51. Schomburg I, Chang A, Hofmann O,
throughput generation, optimization and anal- Ebeling C, Ehrentreich F, Schomburg D
ysis of genome-scale metabolic models. Nat (2002) BRENDA: a resource for enzyme data
Biotechnol 28(9):977–982 and metabolic information. Trends Biochem
40. Dias O, Rocha M, Ferreira EC, Rocha I (2015) Sci 27(1):54–56
Reconstructing genome-scale metabolic mod- 52. Chang A, Scheer M, Grote A, Schomburg I,
els with merlin. Nucleic Acids Res 43 Schomburg D (2009) BRENDA, AMENDA
(8):3899–3910 and FRENDA the enzyme information system:
41. Karp PD, Latendresse M, Paley SM, new content and tools in 2009. Nucleic Acids
Krummenacker M, Ong QD, Billington R Res 37(Suppl 1):D588
et al (2015) Pathway Tools version 19.0 53. Kanehisa M, Goto S (2000) KEGG: Kyoto
update: software for pathway/genome infor- encyclopedia of genes and genomes. Nucleic
matics and systems biology. Brief Bioinform Acids Res 28(1):27
17(5):877–890 54. Kanehisa M, Goto S, Kawashima S, Okuno Y,
42. Agren R, Liu L, Shoaie S, Vongsangnak W, Hattori M (2004) The KEGG resource for
Nookaew I, Nielsen J (2013) The RAVEN deciphering the genome. Nucleic Acids Res
toolbox and its use for generating a genome- 32(suppl 1):D277
scale metabolic model for Penicillium chryso- 55. Caspi R, Foerster H, Fulcher CA, Kaipa P,
genum. PLoS Comput Biol 9(3):e1002980 Krummenacker M, Latendresse M et al
43. Wang H, Marcišauskas S, Sánchez BJ, (2008) The MetaCyc database of metabolic
Domenzain I, Hermansson D, Agren R et al pathways and enzymes and the BioCyc collec-
(2018) RAVEN 2.0: A versatile toolbox for tion of pathway/genome databases. Nucleic
metabolic network reconstruction and a case Acids Res 36(Database issue):D623–D631
study on Streptomyces coelicolor. PLoS Com- 56. Ren Q, Kang KH, Paulsen IT (2004) Trans-
put Biol 14(10):e1006541 portDB: a relational database of cellular mem-
44. Swainston N, Smallbone K, Mendes P, Kell brane transport systems. Nucleic Acids Res 32
DB, Paton NW (2011) The SuBliMinaL Tool- (suppl 1):D284
box: automating steps in the reconstruction of 57. Ren Q, Chen K, Paulsen IT (2006) Trans-
metabolic networks. J Integr Bioinform 8 portDB: a comprehensive database resource
(2):187–203 for cytoplasmic membrane transport systems
45. Becker SA, Feist AM, Mo ML, Hannum G, and outer membrane channels. Nucleic Acids
Palsson BO, Herrgard MJ (2007) Quantitative Res 35(suppl 1):D274
prediction of cellular metabolism with 58. Fritzemeier CJ, Hartleb D, Szappanos B,
constraint-based models: the COBRA Tool- Papp B, Lercher MJ (2017) Erroneous
box. Nat Protoc 2(3):727–738 energy-generating cycles in published genome
Curating COBRA Models of Microbial Metabolism 337

scale metabolic networks: Identification and growth and metabolic by-product secretion in
removal. PLoS Comput Biol 13(4):e1005494 wild-type Escherichia coli W3110. Appl Envi-
59. Alberty RA (1998) Calculation of standard ron Microbiol 60(10):3724–3731
transformed formation properties of biochem- 71. Neidhardt FC, Curtiss R III, Ingraham J,
ical reactants and standard apparent reduction Lin E, Low K, Magasanik B et al (1996) Escher-
potentials of half reactions. Arch Biochem Bio- ichia coli and salmonella: cellular and molecular
phys 358(1):25–39 biology. Sigma-Aldrich, Washington DC
60. Alberty RA (1998) Calculation of standard 72. Tekaia F, Yeramian E, Dujon B (2002) Amino
transformed Gibbs energies and standard trans- acid composition of genomes, lifestyles of
formed enthalpies of biochemical reactants. organisms, and evolutionary trends: a global
Arch Biochem Biophys 353(1):116–130 picture with correspondence analysis. Gene
61. Kummel A, Panke S, Heinemann M (2006) 297(1–2):51–60
Systematic assignment of thermodynamic con- 73. Dumontier M, Michalickova K, Hogue C
straints in metabolic network models. BMC (2002) Species-specific protein sequence and
Bioinformatics 7:512 fold optimizations. BMC Bioinformatics 3
62. Mavrovouniotis ML (1990) Group contribu- (1):39
tions for estimating standard gibbs energies of 74. Chan SHJ, Cai J, Wang L, Simons-Senftle MN,
formation of biochemical compounds in aque- Maranas CD (2017) Standardizing biomass
ous solution. Biotechnol Bioeng 36 reactions and ensuring complete mass balance
(10):1070–1082 in genome-scale metabolic models. Bioinfor-
63. Jankowski MD, Henry CS, Broadbelt LJ, Hat- matics 33(22):3603–3609
zimanikatis V (2008) Group contribution 75. Feist AM, Palsson BO (2010) The biomass
method for thermodynamic analysis of com- objective function. Curr Opin Microbiol 13
plex metabolic networks. Biophys J 95 (3):344–349
(3):1487–1499 76. Tang YJ, Martin HG, Myers S, Rodriguez S,
64. Henry CS, Jankowski MD, Broadbelt LJ, Hat- Baidoo EEK, Keasling JD (2009) Advances in
zimanikatis V (2006) Genome-scale thermody- analysis of microbial metabolic fluxes via 13C
namic analysis of Escherichia coli metabolism. isotopic labeling. Mass Spectrom Rev 28
Biophys J 90(4):1453–1461 (2):362–375
65. Henry CS, Broadbelt LJ, Hatzimanikatis V 77. Fischer E, Zamboni N, Sauer U (2004) High-
(2007) Thermodynamics-based metabolic flux throughput metabolic flux analysis based on
analysis. Biophys J 92(5):1792–1805 gas chromatography-mass spectrometry
66. Feist AM, Henry CS, Reed JL, derived 13C constraints. Anal Biochem 325
Krummenacker M, Joyce AR, Karp PD et al (2):308–316
(2007) A genome-scale metabolic reconstruc- 78. Sauer U (2006) Metabolic networks in motion:
tion for Escherichia coli K-12 MG1655 that 13C-based flux analysis. Mol Syst Biol 2:62
accounts for 1260 ORFs and thermodynamic 79. Stewart BJ, Navid A, Turteltaub KW, Bench G
information. Mol Syst Biol 3:121 (2010) Yeast dynamic metabolic flux measure-
67. Tanaka M, Okuno Y, Yamada T, Goto S, ment in nutrient-rich media by HPLC and
Uemura S, Kanehisa M (2003) Extraction of a accelerator mass spectrometry. Anal Chem 82
thermodynamic property for biochemical reac- (23):9812–9817
tions in the metabolic pathway. Genome Infor- 80. Green ML, Karp PD (2004) A Bayesian
matics Ser 14:370–371 method for identifying missing enzymes in pre-
68. Hucka M, Finney A, Sauro HM, Bolouri H, dicted metabolic pathway databases. BMC Bio-
Doyle JC, Kitano H et al (2003) The systems informatics 5:76
biology markup language (SBML): a medium 81. Lieven C, Beber ME, Olivier BG, Bergmann
for representation and exchange of biochemical FT, Ataman M, Babaei P et al (2020) MEM-
network models. Bioinformatics 19 OTE for standardized genome-scale metabolic
(4):524–531 model testing. Nat Biotechnol 38(3):272–276
69. Hucka M, Bergmann FT, Dr€ager A, Hoops S, 82. Price ND, Reed JL, Palsson BO (2004)
Keating SM, Le Novère N et al (2018) The Genome-scale models of microbial cells: evalu-
Systems Biology Markup Language (SBML): ating the consequences of constraints. Nat Rev
language specification for level 3 version Microbiol 2(11):886–897
2 core. J Integr Bioinformatics 15 83. Feist AM, Palsson BO (2008) The growing
(1):20170081 scope of applications of genome-scale meta-
70. Varma A, Palsson BO (1994) Stoichiometric bolic reconstructions using Escherichia coli.
flux balance models quantitatively predict Nat Biotechnol 26(6):659–667
338 Ali Navid

84. Milne CB, Kim PJ, Eddy JA, Price ND (2009) stationary fluxes in metabolic networks. Eur J
Accomplishments in genome-scale in silico Biochem 271(14):2905–2922
modeling for industrial and medical biotech- 89. Oliveira AP, Nielsen J, Förster J (2005) Mod-
nology. Biotechnol J 4(12):1653–1670 eling Lactococcus lactis using a genome-scale
85. Liu L, Agren R, Bordel S, Nielsen J (2010) Use flux model. BMC Microbiol 5(1):39
of genome-scale metabolic models for under- 90. Kauffman KJ, Prakash P, Edwards JS (2003)
standing microbial physiology. FEBS Lett 584 Advances in flux balance analysis. Curr Opin
(12):2556–2564 Biotechnol 14(5):491–496
86. Bordbar A, Monk JM, King ZA, Palsson BO 91. Krummenacker M, Paley S, Mueller L, Yan T,
(2014) Constraint-based models predict meta- Karp PD (2005) Querying and computing
bolic and associated cellular functions. Nat Rev with BioCyc databases. Bioinformatics 21
Genet 15(2):107–120 (16):3454–3455
87. Knorr AL, Jain R, Srivastava R (2007) 92. Sayers EW, Beck J, Brister JR, Bolton EE,
Bayesian-based selection of metabolic objective Canese K, Comeau DC et al (2019) Database
functions. Bioinformatics 23(3):351–357 resources of the National Center for Biotech-
88. Holzhutter HG (2004) The principle of flux nology Information. Nucleic Acids Res 48
minimization and its application to estimate (D1):D9–D16
Chapter 15

A Beginner’s Guide to the COBRA Toolbox


Ali Navid

Abstract
COBRA toolbox is one of the most popular tools for systems biology analyses using genome-scale
metabolic reconstructions. The toolbox permits the use of many constraint-based analytical methods for
examining characteristics of metabolism in the biosystems ranging in complexity from single cells to
microbial communities and ultimately multicellular organisms. The toolbox has a number of different
variants that can be used depending on a user’s choice of programming language. Here, I provide a basic
tutorial for beginners that plan to use the original MATLAB version of the toolbox.

Key words Systems biology, Genome-scale models, Constraint-based analysis, FBA, Metabolic net-
works, COBRA toolbox

1 Introduction

The ‘omics’ revolution has allowed collection of vast quantities of


system-level biological data and transformed biology from a reduc-
tionist science to one where holistic analysis of biosystems is the
preferred mode of study. Availability of system-level data has had a
profound impact on the field of computational biology where a
growing community of researchers have been using constraint-
based reconstruction and analysis (COBRA) methods to simulate
the workings of biosystems and evaluate the results in order to gain
a more complete understanding of the system’s metabolic capabil-
ities and robustness. This knowledge can be used for a variety of
purposes [1], including studying metabolism of pathogenic organ-
isms for therapeutic purposes (e.g., [2–4]), metabolic engineering
of microbes (e.g., [5–7]), and devising optimal feeding regimes for
biosystems (e.g., [8]).
In 2007, the COBRA toolbox was introduced to the research
community [9] and has since gone through multiple upgrades.
Recently, the third version of the toolbox was released [10]. The
MATLAB-based toolbox contains a large number of different ana-
lytical methods including flux balance analysis (FBA) [11], flux

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_15, © Springer Science+Business Media, LLC, part of Springer Nature 2022

339
340 Ali Navid

variability analysis (FVA) [12], dynamic FBA (dFBA) [13], parsi-


monious enzyme usage FBA (pFBA) [14], gene-essentiality analy-
sis, and minimization of metabolic adjustment analysis [15], to
name a few.
The purpose of this chapter is not be an exhaustive review of all
the different types of analyses that one can do with COBRA tool-
box. Instead, this is a guide for beginners with limited knowledge of
MATLAB who want to run some common analyses on models they
have downloaded either from the literature or from a draft model
building website like Kbase [16]. This is accomplished by providing
the reader with example commands which can then be tailored for
their analyses of choice. Those interested in learning about some of
latest updates to the COBRA toolbox should read protocols pub-
lished about its version 2 [17] and version 3 [10].
This chapter also will not include a detailed overview of the
fundamentals of COBRA models and different analysis methods.
The interested reader should read cited papers (e.g., [11–15, 18])
for complete explanation of these topics.

2 Materials

Although there are now open-source versions of COBRA toolbox


(e.g., COBRApy, https://opencobra.github.io/cobrapy/) avail-
able, the MATLAB version is still the preferred version for a large
fraction of the COBRA community. This chapter exclusively deals
with the MATLAB version of the toolbox. Accordingly, the follow-
ing software are needed:
1. MATLAB suite of programs (www.mathworks.com/pro-
ducts/matlab.html).
2. Cobra Toolbox (https://opencobra.github.io/cobratoolbox/
stable/).
3. Linear programming (LP) solvers. Although the latest upload
of COBRA toolbox comes with a number of LP solvers, if the
user prefers to use other compatible LP solvers such as gurobi
or cplex, they should be installed on the computer.
4. A COBRA model of biochemical processes. COBRA toolbox
download includes a number of models. For this chapter, the
included model of core metabolism in E. coli will be used as an
example.

3 Methods

3.1 Initializing First, download the package from the COBRA website and install
the Toolbox it. There are different ways of installing the toolbox depending on
your computer’s operating system. For a complete set of
A Beginner’s Guide to the COBRA Toolbox 341

Fig. 1 After initializing, the COBRA toolbox lists available LP solvers that can be used with the toolbox and the
mathematical problem that each can solve. A list of models that are included with the installation can be seen
on the left side of the figure

installation instructions that is appropriate for your computer, visit:


https://opencobra.github.io/cobratoolbox/stable/installation.
html.
After installing COBRA toolbox, go to the directory where you
installed it and initialize the cobra toolbox using the command (see
Notes 1 and 2):

initCobraToolbox()

If you do not enter any value inside the parentheses, the default
value of true will be used and the program will check for any new
updates available for the toolbox from the Cobra toolbox github
site. If you are not interested in checking for new updates and want
to skip this step, enter false in the parentheses.
During the initialization process, the program checks for pres-
ence of needed software for various types of analysis. It provides a
report on availability of different LP solvers and the types of pro-
blems they can solve (see Fig. 1). The default solver for LP and
Mixed-integer linear programming (MILP) problems is the open-
source Gnu linear programming kit (GLPK), unless you have the
Gurobi solver (free for academic user but others need to purchase a
license, https://www.gurobi.com/) installed on your computer. If
you have access to Gurobi, then that solver will be used as default
for LP, MILP, and quadratic problems. The MATLAB solver can be
used for nonlinear programming problems. By using the command
changeCobraSolver, you can get a list of which solvers are being
used for every type of LP problem. If you want to change the solver
for a type of LP problem, then you can use the same command but
342 Ali Navid

with inputs that identify your choice of solver. For example, if you
want to use the pdco solver for solving LP problems, you can
change the solver with this command:

changeCobraSolver(‘pdco’,’LP’);

3.2 Reading Models In order to analyze any model, you will first need to import it into
into MATLAB MATLAB. Version 3 of the toolbox can accept a number of differ-
ent file types including systems biology markup language (SBML)
(.xml), Excel sheets (.xls), as well as MATLAB files (.mat).
Although all these formats have been used before for publishing
and sharing models, there is a strong push for normalizing the
modeling conventions in the COBRA field. Recently, a community
developed suite of tests named MEMOTE [19] has been developed
in order to assess the quality of genome-scale models (GSMs),
ensure easy reusability, and enhance simulation reproducibility.
The developers of MEMOTE have advocated the adoption of
SBML level 3 flux balance constraints package [20] as the primary
format for future published GSMs.
To import a model into the COBRA toolbox, you need to
change your directory to the one where your model file resides.
To access the SBML version of a small model of core metabolic
reactions in E. coli that comes with the cobra toolbox, you need to
go to the directory where that .xml file resides. To get there, from
your cobra toolbox home directory type:

cd test/models/xml

You can also go to this folder from the ‘Current Folder’ panel
on the left side of your MATLAB screen (see Fig. 1). Once in the
correct directory, you can choose to read the file into MATLAB
either directly using the name of the file (e.g., ecoli_core_model.
xml) or by making the name of the file a MATLAB variable. To load
the test model, one could use:

model=readCBModel(‘ecoli_core_model.xml’);

or

modelfile=’ecoli_core_model.xml’;
model=readCBModel(modelfile);

If you now double click on model in the ‘Workspace’ panel, on


the right side of your MATLAB screen (see Fig. 2), you will be able
to see a list of different arrays and matrices that form the model.
A Beginner’s Guide to the COBRA Toolbox 343

Fig. 2 Double clicking on the model name in the ‘Workspace’ panel (right side) will display a list of different
arrays and matrices that form the model knowledgebase

3.3 Model One of the main reasons for building mathematical models is to
Manipulation manipulate them and see how various changes will alter the behav-
ior of a system. In case of COBRA models, this could involve losing
or gaining functions (i.e., deleting/blocking or adding reactions,
respectively) as well as changing the constraints of the systems such
as altering the upper or lower flux bounds for a reaction.

3.3.1 Adding Metabolites are usually the nodes of a metabolic network. They are
and Removing Metabolites the compounds that proteins either chemically alter or transport
between compartments. The addMetabolite command is used
to enter new metabolites into a model. For example, to add pyro-
phosphate to the core E. coli model (henceforth the model) using
the ppi[c] identifier, one can use the command:

model=addMetabolite(model,’ppi[c]’);

The letter in the brackets following the name of the metabolite


identifies the compartment in which the compound resides. The
letter c generally designates the cytoplasm, while e is used for the
extracellular space (see Note 3).
However, using the above command will only add the meta-
bolite’s identifier to the model. No other relevant information will
be added. The best curated models include additional information
about metabolites such as compound formula, charge, as well as Ids
from frequently used databases like KEGG [21, 22]. The addMe-
tabolite command accepts a number of other inputs which allow
simultaneous addition of these details to the model knowledgebase.
For example, one can add the amino acid L-aspartate (identifier:
asp-L[c]) using the following command:

model=addMetabolite(model,’asp-L[c]’,’metName’, ...
’L-Aspartate[c]’,’metFormula’,’C4H6NO4’, ...
’ChEBIID’,’17053’,’KEGGId’,’C00049’, ’Charge’,-1);
344 Ali Navid

The three dots (...) shown in the middle of commands indicate


that the command is continued on the next line.
With the addMetabolite command, other information such
as an InChi descriptor (‘InChi) and PubChem Id (‘PubChemId’)
can also be added. If a metabolite with the same identifier already
exists in the model, the program will return an error message.
To add multiple metabolites simultaneously into the model,
use the addMultipleMetabolites command. For example, to
add L-asparagine and 2-oxosuccinamate along with some of their
associated data into the model, one can use:

model=addMultipleMetabolites(model,...
{’asn-L[c]’,’2oxosucc[c]’}),’metNames’,...
{’L-Asparagine[c]’, ’2-oxosuccinamate[c]’}, ...
’metFormulas’,{’C4H8N2O3’,’C4H4NO4’}, ...
’metChEBIID’,{‘17196’,‘16237’},’metKEGGID’, ...
{’C00152’,’C02362’},’metCharges’,[0,-1]);

To remove one or more metabolites from a model, use the


removeMetabolites command. For example:

model=removeMetabolites(model,{’asp-L[c]’,’2oxosucc[c]’});

3.3.2 Adding, Removing, There are multiple types of reactions in each GSM, and accordingly,
or Modifying Reactions there are multiple ways of adding a new reaction to the models. A
new reaction that deals with biochemical conversions can be added
using the addReaction command. There are two ways of using
this command, and your choice will dictate what information is
added within the parentheses following the command.

Addition Using Reaction When using this method, you will simply type the chemical equa-
Formula tion into the addReaction command. For example, to add the
aspartate transaminase reaction, a biochemical process that involves
two metabolites from the Krebs cycle pathway, one can use:

model=addReaction(model,’ASPTA’,’reactionFormula’,...
‘akg[c] + asp-L[c] <=> glu-L[c] + oaa[c]’)

If the command proceeds without any issues, then you will see
that the ASPTA (the name used for this reaction in GSMs from the
BIGG database [23, 24]) (see Note 4) is added to the tail-end of the
rxns array in the model. Also, the number of reactions in your
model should increase by one.
If the added reaction contains metabolites that are not already
present in the model, then those metabolites will automatically get
added to the model. For example, L-aspartate (asp-L[c]) is not a
compound in the original model and by adding the ASPTA reac-
tion, you add asp-L[c] as a new compound to the model.
A Beginner’s Guide to the COBRA Toolbox 345

When using the reaction formula method to add reactions into


your model, the shape of the arrow that links the right- and left-
hand side of the equation indicates whether a reaction is reversible
or irreversible. If a reaction is irreversible, then -> is used; while
indicates that a reaction is reversible.

Addition Using Separate You can also add a reaction to your model by using a method for
Lists Method which two separate lists, one of reaction metabolites and the other
of their stoichiometric coefficients, describe the reaction. For exam-
ple, to add the ASPTA reaction to the model one can use:

Model=addReaction(model,’ASPTA’,’metabolliteList’, ...
{‘akg[c]’, ‘asp-L[c]’, ‘glu-L[c]’,’oaa[c]’}, ‘stoichCoeff-
List’,[-1; -1; 1; 1], ‘reversible’, true);

It should be noted that the order in which the metabolites are


listed is not important. However, it is critical that the order of
stoichiometric coefficients match the order of associated metabo-
lites. For example, the following would add the exact same reaction
to the model:

Model=addReaction(model,’ASPTA’,’metabolliteList’,...
{‘akg[c]’, ‘glu-L[c]’, ‘asp-L[c]’,’oaa[c]’}, ‘stoichCoeff-
List’,[-1; 1; -1; 1], ‘reversible’, true);

When using the lists method to add an irreversible reaction to a


model, it is important to enter a value of false for the ‘reversible’
input. Since the default value is true, when adding a reversible
reaction, you can skip entering a value for this input. Also, regard-
less of which method you use to enter a new reaction, be sure to add
compartment identifiers for the reactions and metabolites that are
consistent with the original model.
Each GSM serves as a knowledgebase for the organism that is
being modeled. Thus, there are a lot of additional critical and
ancillary information associated with a reaction that can also be
added using the addReaction command. For example, the
ASPTA reaction can be added in this way:

model = addReaction(model, ’ASPTA’,’metaboliteList’,...


{’akg[c]’, ’glu-L[c]’, ’asp-L[c]’, ’oaa[c]’},...
’stoichCoeffList’, [-1; 1; -1; 1], ’reversible’, true,...
’lowerBound’, -10, ’upperBound’, 10,’objectiveCoef’,...
0, ’subSystem’,’Alanine and Aspartate Metabolism’,...
’geneRule’,’b0928’,’checkDuplicate’, false,’printLevel’,1);

While the additional information included in the above com-


mand are not essential for adding a new reaction to the model, they
are necessary for conducing some of COBRA analyses. For
346 Ali Navid

example, by entering the information about which gene’s product


catalyzes the reaction (geneRule), you will be able to conduct in
silico gene knockout analysis to assess whether that gene is essential
for production of biomass, or any other biological objective that
you seek to optimize (see Note 5).
It is important to note that depending on the method you
choose for adding a reaction, the input fields for the other type of
addition need to be empty. Thus, the ‘reactionFormula’ input field
has to stay empty when one uses the lists method; while the ‘meta-
boliteList’ and ‘stoichCoeffList’ inputs need to remain empty when
using the formula method.

Adding Multiple Reactions When adding multiple reactions to a model, the addMultipleR-
eactions command should be used to accelerate the process. For
example, the following command can be used to add two new
reactions (asparagine synthesis via asparagine amidohydrolase and
aspartate:glutamine amido-ligase) to the E. coli model:

model=addMultipleReactions(model,{’ASNS3_1’,’ASNN’},...
{’asp-L[c]’,’atp[c]’,’nh4[c]’,’gln-L[c]’,’amp[c]’, ...
’asn-L[c]’,’ppi[c]’,’glu-L[c]’,’h2o[c]’},[-1 1;-1 0; ...
-1 1; -1 0;1 0; 1 -1; 1 0; 1 0; -1 0],’rxnNames’,...
{’L-aspartate:L-glutamine amido-ligase’,...
’L-asparagine amidohydrolase’},’lb’,[0,0],’ub’, ...
[10,10],’subSystems’,...
{’Alanine, aspartate and glutamate metabolism’,...
’Alanine, aspartate and glutamate metabolism’},’grRules’,...
{’b0677’,’b0828 or b1767 or b2957’});

This command greatly facilitates addition of multiple reactions


because all of the involved metabolites are listed only once, and one
can include the metabolism of that metabolite for a large number of
reactions—simply by typing the stoichiometric coefficient of that
compound for all the newly added reactions. The stoichiometric
coefficient for a compound is set to zero if it does not have a role in
a reaction.
If one is interested in checking the formula for the newly added
reactions, or any other reaction in the model, they can use the
printRxnFormula command:

printRxnFormula(model,’rxnAbbrList’,{’ASNS3_1’,’ASNN’}, ...
’metNameFlag’,false,’gprFlag’,true);

Metabolite names will be used in the reaction formulas if the


metNameFlag value is true.
A Beginner’s Guide to the COBRA Toolbox 347

Adding Non-metabolic While you can add any type of reaction using the addReaction
Reactions and addMultiReactions commands, COBRA toolbox also con-
tains a number of commands for quick addition of non-metabolic
reactions. For example, demand reactions are irreversible reactions
that represent metabolite consumption processes when one is not
concerned with the subsequent fate of the compound. For such
cases, the addDemandReaction command can be used. To add a
demand reaction to the model for the cytoplasmic oxaloacetate,
one can use:

model=addDemandReaction(model,’oaa[c]’);

In the same vein, all GSMs include a significant number of


exchange reactions. Exchange reactions allow import or export of
metabolites across the model system boundary. The model system
boundary is typically the border between the extracellular compart-
ment and surrounding environment. To add an exchange reaction
to the model for oaa[e] use:

model=addExchangeRxn(model,oaa[e],0,0);

Exchange reactions are reversible and unless the lower and


upper bounds are set (as shown above), the default minimum/
maximum flux constraints (1000 and 1000) will be set for the
new reaction. By setting the lower and upper bound of a reaction to
the same value, you can fix the flux value for that reaction. In the
above case, setting the lower and upper bounds of the exchange
reaction to zero adds the reaction to the model, but it will be unable
to carry any flux. In order for the reaction to become active, its flux
constraints need to be changed.
The newly added exchange reactions will be named with an
EX_ prefix followed by the metabolite identifier (e.g., EX_oaa[e]).
The new demand reactions will be named with a DM_ prefix
followed by the identifier (e.g., DM_oaa[c]).

Removing Reactions The removeRxns command is used to delete one or more reac-
tions from a model. For example, to remove the demand reaction
that was added in the previous section, the command would be:

model = removeRxns(model,’DM_oaa[c]’,’metFlag’, false);

The default value for metFlag is true. So when one or more


reactions are removed from the model, all unused metabolites will
be removed from the model as well. When metFlag is false, the
unused metabolites will remain in the model.
348 Ali Navid

Modifying Reactions The greatest advantage of computational modeling over empirical


analyses is the ease with which one can manipulate the in silico
models’ system variables and structures. This allows for easy exami-
nation of the role and significance of different system components
and processes. One can update reactions with new information
using the addReaction command. However, in order to avoid
reentering all the reaction-associated information every time one
wants to update a value, a number of shortcut commands have been
developed. The changeGeneAssociation command allows for
rapid update of gene associations (see Note 6). For example, to
update the gene rule for the previously added ASNS3_1 reaction,
the command would be:

model=changeGeneAssociation(model,’ASNS3_1’,’b0674’);

To set or update the metabolic pathways that reactions are


associated with, one can use setRxnSubSystems command. For
example:

model= setRxnSubSystems(model, ’EX_oaa[e]’, ’Exchange’);

To change the upper and lower bounds of a reaction, chan-


geRxnBounds command can be used:

model=changeRxnBounds(model,{’RPI’,’ICL’,’ME1’},-10,’l’);

It is important to note that unlike most other COBRA toolbox


commands, in the changeRxnBounds command, the boundary
value input precedes the identification of the bound type. The
bound type identifiers are l for lower bound, u for upper bound,
and b for both.

3.3.3 Reordering/ Changing a model can sometime make it unruly. Removing meta-
Cleaning a Model bolites or reactions could result in reactions, metabolites, or genes
that have become ‘orphan’. This means that they no longer have
any role in the model and their associated values in the S matrix are
zero. These entities can be removed using a number of different
commands.
To remove metabolites and reactions that are no longer part of
the model, the removeTrivialStoichiometry command can
be used:

updatedmodel=removeTrivialStoichiometry(model);

To remove unused genes in a model, one can use the remo-


veUnusedGenes command:

updatedmodel=removeUnusedGenes(model);
A Beginner’s Guide to the COBRA Toolbox 349

It is a good practice to give the modified model a different


name so that one a) retains a copy of the original model and b) can
compare the models.

3.4 Creating a Cobra So far, all the commands that have been discussed involve reading
Model and modifying existing models. However, COBRA toolbox can
also be used to develop new models. The easiest way to do this
would be to start a model using the createModel command. For
example, if one was to create a three-reaction model consisting of
these reactions with associated information:
reaction 1 ðR1Þ : C1 þ E⟺C1E associated gene : G1
lb : 100 ub : 50
reaction 2 ðR2Þ : C1E ! C2E associated genes : G2 and G3
lb : 0 ub : 75
reaction 3 ðR3Þ : C2E⟺C2 þ E associated genes : G4 or G5
lb : 10 ub : 100
the command would be:

newmodel=createModel({’R1’,’R2’,’R3’},{’reaction 1’,...
’reaction 2’,’reaction 3’},{’C1 + E &lt;=&gt; C1E’,...
’C1E -> C2E’,’C2E &lt;=&gt; C2 + E’},’lowerBoundList’,...
[-100 0 -10],’upperBoundList’,[50,75,100],...
’subSystemList’, {’Omega subsystem’,’Omega subsystem’,...
’Omega subsystem’},’grRuleList’,{’G1’,’G2 and G3’,...
’G4 or G5’});

Here newmodel is the MATLAB name for the model you are
developing. If you look in the workspace window you will see that
newmodel has been added. After you enter this command you will
also receive a series of messages that inform you about what meta-
bolites and reactions have been added to the model (see Fig. 3).

3.4.1 Setting/Changing To examine the workings of a biosystem using COBRA models


the Objective Function requires optimization of a biological objective. This necessitates
designating one or more biochemical reactions as the process
(es) whose activity will be optimized. An objective function speci-
fies the contribution of each reaction to the overall numerical value
of the optimized problem. Changing the model’s objective func-
tion permits examination of metabolic differences resulting from
different environmental conditions and varying system priorities.
The system objective that is most often optimized in COBRA
studies is the production of biomass, i.e. growth [3]. However,
other objectives, such as production of some compound of interest
350 Ali Navid

Fig. 3 After using the createModel command, the new model will be created as a series of arrays. The
name given to the new model will be displayed in the Workspace panel. The toolbox will also print the list of
metabolites and reactions that are added as part of the new model

can also be studied using COBRA methods. To add or change a


model’s objective function, the changeObjective can be used.
For example, in order to make R2 the objective function in the
newly created model, the command would be:

newmodel=changeObjective(newmodel,’R2’);

One can use the checkObjective command to check what is


the objective function of a model:

obf=checkObjective(newmodel)

obf (or any other name you choose) is a vector that will contain
the name of all the biochemical processes that contribute to the
model’s objective function.

3.4.2 Writing Models COBRA models can be saved and exported in a variety of different
in Different Format formats using the writeCbModel command. As noted earlier,
SBML is the most common format for publishing GSM models.
To save the newly created model in SBML format, the command is:

writeCbModel(newmodel, ’format’,’sbml’);

Since in the above command the name of the output file is not
specified, and only when saving a model in SBML, a pop-up win-
dow will appear (see Fig. 4) that will allow the user to name the
output file (see Note 7). The models can also be saved in ‘text’,
Excel (xls), and MATLAB (mat) formats. However, for these
A Beginner’s Guide to the COBRA Toolbox 351

formats, a file name needs to be added as an input. For example, to


save the model in MATLAB and Excel format, the commands
would be:

writeCbModel(newmodel, ’format’,’mat’, ’fileName’,...


’NewModel.mat’);
writeCbModel(newmodel, ’format’,’xls’, ’fileName’,...
’NewModel.xls’);

Another format that is favored by some COBRA practitioners is


the mathematical programming system (mps) format. While wri-
teCbModel command does not save a model in mps, the following
two commands can:

convertCobraLP2mps(newmodel,’NewModel’);
writeLPProblem(newmodel,’fileName’, ’NewModel.mps’);

3.5 System-Level COBRA toolbox can be used to conduct a large variety of system-
COBRA Analyses level analyses. Here, only the protocol for a few of the most com-
monly used methods of analyzing the basic metabolic characteris-
tics of a system and its robustness are detailed. The reader is
encouraged to read reviews by Oberhardt et al. [25], Bordbar
et al. [26], and Fang et al. [1] to learn about some of the other
COBRA analytical methods. The manuscript introducing the third
version of the toolbox [10] is another great resource for learning
about COBRA toolbox analyses, as well as the latest updates to the
program.

3.5.1 Flux Balance Flux balance analysis (FBA) is the most widely used COBRA
Analysis method for analyzing GSMs [11, 27]. FBA places a number of
constraints on the genome-scale reconstructions of metabolism
and uses linear programming to solve for a metabolic flux pattern
that will result in the optimum outcome for a biosystem objective.
The system objective that is commonly optimized by FBA is bio-
mass production, i.e., growth. However, other biological activities
such as maximizing a system’s energy production, its metabolic
efficiency, or the production of a compound of interest can also
be selected as an FBA objective function. The major constraints
that are placed on a system for FBA are limitations on the rate of
import/export of metabolites and an assumption that the system is
operating at steady state, i.e., the rate of production and consump-
tion of all the metabolites in the system are equal to one another
and thus the overall change in the concentration for these com-
pounds is zero. In order for a compound to be excluded from the
steady-state constraint, there needs to be a demand, sink, or
exchange reaction associated with that metabolite.
352 Ali Navid

Fig. 4 If one is saving a model in the SBML (.xml) format using writeCbModel command, and the name of
the new model file is not included in the command, then the toolbox will provide a pop-up window that will ask
for the name of the model file. This does not happen when saving in other file formats

In COBRA toolbox, a GSM that is sufficiently constrained and


has a valid objective function can be solved by FBA using the
optimizeCbModel command. For example, to analyze the
E. coli core metabolism model for maximum production of bio-
mass, the command can be:

FBAsol=optimizeCbModel(model,’max’);

FBAsol (see Note 8) will be a collection of 14 values and arrays


that contain information about the outcome of the FBA simulation
(see Fig. 5). It is usually a good choice to first examine the ‘stat’
value which indicates if an FBA simulation successfully calculated
the optimum value of the objective function. If the stat value is
1, then the optimum value for the objective function was calcu-
lated. Other values indicate that the model as it stands has issues
that need to be addressed. For example, if the stat value is zero,
then the system as it is currently constrained cannot grow. This
usually indicates that two or more constraints are at odds with one
another. For example, if you have a fixed rate of ATP consumption
(usually the energy needed for non-growth maintenance of a sys-
tem), and the constraints for import of nutrients do not suffice for
generating this minimal amount of ATP, then the model as it is
constrained has no feasible solution. A stat value of 2 indicates that
the model is not sufficiently bound and can have an unlimited value
for the objective function.
A Beginner’s Guide to the COBRA Toolbox 353

Fig. 5 The output of FBA simulation (FBAsol in this case) will provide: f (the optimum value of the objective
function), v (flux value for every reaction in the model), y (shadow prices, a measure of sensitivity of objective
function value to constraints [45]), w (reduced costs, a measure of sensitivity of the objective function to
changes in flux values [45]), s (slacks of the metabolites), solver (name of the solver used for FBA), stat (solver
status), origStat (the original status returned by the solver), and x (a legacy array similar to v)

Print the Reaction Flux The v vector will contain the reaction rates that result in the
Values optimum value of the objective function. The x vector is a legacy
vector and will contain the same values as v. To access the v and x
vectors from the command line, FBAsol.v and FBAsol.x should be
used (see Note 8).
To print a specific subset of the calculated fluxes, the print-
FluxVector command can be used. For example, if one wants to
examine the flux values of all the reaction in the model, this com-
mand can be used:

printFluxVector(model, FBAsol.v)

To examine only those reactions that carry a flux, that choice


has to be included as an input into the command:

nonZeroFlag = 1;
fluxes = FBAsol.v;
printFluxVector(model, fluxes, nonZeroFlag);

To print only the fluxes for the exchange reaction, another


default input value of the command has to be changed:

excFlag=1;
printFluxVector(model, fluxes, nonZeroFlag, excFlag);

Examine the Activity After an FBA simulation, it often becomes necessary to examine
of Individual Model how a specific component of the model behaves when the system
Components operates at the optimal value of the objective function. A very useful
command in the COBRA toolbox for this type of analysis is the
354 Ali Navid

Fig. 6 Example output from surfNet command. This command can be used to examine the activity of
metabolites, reactions, and gene-associated reactions

surfNet command. The command accepts as input a list of model


components and returns information about those entities. For
example, if one is interested in examining the activity of the reaction
labeled ACKr, reactions associated with the metabolite 6pgc in the
cytoplasm (6pgc[c]), and reactions associated with the gene b0720,
this command can be used:

surfNet(model,{’ACKr’,’6pgc[c]’,’b0720’},’metNameFlag’,...
true,’flux’, FBAsol.v,’nonzeroFluxFlag’,false);

By setting the ‘metNameFlag’ value to true, the command


returns results that include the full name of the metabolites instead
of short acronyms or database identifiers. The result of the com-
mand is shown in Fig. 6. By setting the value of the ‘nonzeroFlux-
Flag’ to false, it becomes possible to see the behavior of all the
reactions associated with a compound or gene. If the value is true,
then only those reactions that are active will be displayed. In case of
reactions, if they are in the list of examined components, their
outcome will be listed even if the flux value is zero.

3.5.2 Flux Variability The solution provided by FBA is not unique and other metabolic
Analysis flux patterns can also provide the same maximum value of a model’s
objective function. Flux variability analysis (FVA) is a method for
examining the range of fluxes that each reaction can have while the
objective function is at its maximum value [12]. FVA allows for a
number of different system-level studies, including investigation of
alternative optimal flux patterns, analyses of flux distributions
A Beginner’s Guide to the COBRA Toolbox 355

under suboptimal growth conditions [28], and examination of


network redundancy [29]. The knowledge gained from FVA has
significant importance for successful design of various types of
metabolic engineering projects (e.g., [30–32]).
The command fluxVariability is used for FVA in COBRA
toolbox. To conduct a system-level FVA for the E. coli core model,
this command can be used:

[minFlux, maxFlux] =fluxVariability(model,’optPercentage’,...


100,’osenseStr’,’max’,’rxnNameList’, model.rxns);

As with most other COBRA toolbox commands, the flux-


Variability command also accepts a number of inputs that
enhance the user’s ability to design the analysis to answer specific
questions. In the above command, only some of those inputs have
been entered. The value given to the optPercentage parameter
ensures that only those reaction fluxes that allow at least the stated
percentage of the optimum value of the objective function are
reported. The default value is 100, meaning that only flux ranges
that result in the exact maximum value of the objective function are
reported. oSenseStr value indicates the objective sense, i.e.,
whether the objective function is maximized (‘max’) or minimized
(‘min’). The former is the default value. If the ‘rxnNameList’
parameter is filled with the name of an array that includes only a
subset of the model’s reactions, then FVA is conducted only for
that subset of reactions. The default value allows for all the reac-
tions in the model to be examined. The minFlux and maxFlux
arrays respectively contain the values for minimum and maximum
flux values that each reaction can have at the optimum objective
function value.

3.5.3 Parsimonious As noted above, the flux pattern calculated by FBA is not a unique
Enzyme Usage FBA solution. It is one of many flux patterns that could result in the same
maximum value of the objective function. To reduce the variability
of the fluxes, i.e., shrink the size of the multidimensional solution
space of the model, one has to constrain the model with additional
information or assumptions. In some cases, addition of information
such as measurements of changes in gene expressions or protein
levels have been used to constrain the models (e.g., [33–36]). In
other cases, thermodynamics [37] and metabolite levels [38] have
been used as constraints. Overall it has been shown that integrating
even a handful of experimentally measured fluxes into FBA models
could drastically reduce the variability of the predicted fluxes [39].
Parsimonious enzyme usage FBA (pFBA) [14] is the most
widely used COBRA method for reducing the variability of the
flux predictions. To achieve this, pFBA uses bilevel optimization
to ensure that the model predicts the most enzymatically efficient
flux distribution for a metabolic objective. FBA is used first to find
356 Ali Navid

the maximum value of the model’s objective function. Then,


assuming that the fastest growing phenotype uses the minimum
amount of enzyme necessary, the sum of the fluxes in the metabolic
network is minimized at the fixed optimum growth value. pFBA has
been shown to reproduce experimentally measured fluxes and may,
in some cases, even outperform methods constrained by gene
expression data [40].
The command for parsimonious FBA in the COBRA toolbox is
pFBA. To analyze the model with pFBA, the command can be
written as:

[GeneClasses RxnClasses modelIrrevFM MinimizedFlux] = ...


pFBA(model);

The result of the pFBA can be used to classify all the reactions
and genes in a model based on their contribution to the goal of
reaching the optimum value of the objective function. The results
of the classifications are included in two sets of arrays that, in the
above example, are labeled GeneClasses and RxnClasses. The results
can be accessed by clicking on the chosen names in the workspace
panel (see Fig. 7). The genes are divided into six categories:
1. Essentials (pFBAEssential): these are genes that are necessary
for growth in the modeled media.
2. pFBA optima: these are non-essential genes that are needed for
achieving the maximum growth rate at minimum fluxes.
3. Enzymatically less efficient (ELEGenes): if the reactions asso-
ciated with these genes are used to achieve the fastest growth
rate, then the sum of fluxes through the system will be greater
than that predicted by pFBA; i.e., there are alternative pathways
that can achieve the same growth rate but with lower combined
fluxes.

Fig. 7 List of outputs from a pFBA simulation


A Beginner’s Guide to the COBRA Toolbox 357

4. Metabolically less efficient (MLEGenes): using the pathways


associated with these genes will result in a suboptimal
growth rate.
5. pFBA no-flux (ZeroFluxGenes): genes that cannot carry a flux
for the modeled conditions.
6. Blocked (Blockedgenes): the metabolic network as modeled
cannot accommodate flux through the reactions associated
with these genes.
The version of the model used to calculate the minimum flux
through the system is the modelIrrevFM output of pFBA. In this
model, all the reversible reactions of the original model are replaced
by two irreversible reactions. The pFBA predicted minimal flux
(v or x) can be accessed from MinimizedFlux set of arrays.

3.5.4 Perturbation GSMs can be analyzed with COBRA methods to assess a system’s
Analyses robustness to genetic and environmental perturbations. This is
achieved by varying flux through one or more reactions and using
FBA to see how the change affects the optimum value of the
model’s objective function.

Single-Reaction Knockouts GSMs and FBA can be used to evaluate if the activity of a single
biochemical reaction is essential for cellular growth or some other
biological objective. The command singleRxnDeletion() in
the COBRA toolbox allows one to test the effect of single-reaction
knockouts (SRKO). The command for SRKO on the E. coli model
can be:

[grRatioSRKO, grRateSRKO, grRateWT, hasEffectRxns, ...


delrxnSRKO,fluxSolutionSRKO] = singleRxnDeletion(model,...
’FBA’);

The command can accept a third input consisting of an array of


reaction names. If the third input is added, instead of all the model
reactions being analyzed, only the deletion of the selected reactions
will be analyzed.
As with pFBA, the output from the singleRxnDeletion com-
mand is a number of arrays. The first array (labeled here grRa-
tioSRKO) contains the ratio of the growth rates between the
knockout and the wild-type strains. If the value is 1, deleting that
reaction does not have any effect on the ability of the system to
grow at the fastest possible rate. A zero value means that eliminat-
ing the activity of that reaction causes the system to stop growing.
Reactions that have values between 0 and 1 are those whose activity
is needed for achieving the fastest possible growth rate given the
constraints that are placed on the system.
358 Ali Navid

The grRateSRKO array in the above example contains the


maximum growth rate for the system following the deletion reac-
tions. The third output (grRateWT) is the FBA-calculated value of
the model’s objective function prior to reaction deletions. The
hasEffectRxn is a Boolean array that indicates whether the activity
of a reaction affects (1) or does not affect (0) the maximum value of
the model’s objective function. To identify the deleted reaction, the
user can use the delrxnsSRKO array (or the rxns array of the
model). Finally, the two-dimensional array labeled fluxSolu-
tionSRKO contains the value of flux through every network reac-
tion (rows) following deletion of a reaction (column).

Single-Gene Knockouts Every GSM contains a gene-protein-reaction table which identifies


the gene(s) that are needed for protein catalysis of each network
reaction. In the COBRA toolbox, this information can be found in
the rules and grRules arrays of the model. The information that is
contained in the former array can be used to conduct in silico gene
knockouts in order to identify those genes whose activity affects a
certain behavior of a biosystem. For example, if all reactions that are
dependent on the activity of a gene are deleted and FBA predicts
that the mutant strain of the system does not produce biomass,
then that gene’s presence is essential for the system’s growth.
To conduct single-gene knockouts (SGKO) using the COBRA
toolbox, the singleGeneDeletion command is used. The com-
mand for the SGKO simulation of the model can be:

selGenes={’b0008’,’b0114’,’b0451’,’b0720’,’b0729’,’b1136’};
[grRatioSGKO, grRateSGKO, grRateWT, hasEffectgenes,...
delRxnsSGKO,fluxSolutionSGKO]=singleGeneDeletion(model,...
’FBA’,selGenes);

The output from this operation is very similar to that for sin-
gleRxnDeletion command. Except that for SGKOs, instead of
individual reactions, the results are for inactivating a subset or all of
the genes in the model. The default input for the command is
SGKO of every gene in the model. In order to conduct SGKO
for only a select group of genes, it is necessary to input a list of genes
(as shown above).

Synthetic Lethal Mutations Many biological systems minimize deleterious outcomes that might
results from environmental and genetic perturbations through
functional redundancy and distributed robustness [41, 42]. Meta-
bolic redundancies are manifested through the presence of iso-
zymes and alternate pathways. Isozymes are byproducts of
multiple genes that have the capability to catalyze the same
biological reaction. Alternate pathways are composed of different
biochemical reactions, but they share some of their metabolic
byproducts.
A Beginner’s Guide to the COBRA Toolbox 359

When engineering a system, or treating a pathogen or a disease,


it is essential to know whether blocking a reaction or pathway can
be compensated by the activity of alternate enzymes or pathways.
This knowledge can be used to design multipronged treatments for
fast-growing complex diseases. For example, synthetic lethal inter-
actions between two genes, the scenario where mutation of either
gene alone does not change a system’s phenotype but knocking out
both genes simultaneously is lethal, has been considered as a means
of treating cancer [43]. To identify synthetic lethal mutations
(SLM) in a biological system, the command doubleGeneDele-
tion can be used in the COBRA toolbox:

[grRatioSLM, grRateSLM, grRateWT] = doubleGeneDeletion(model);

Depending on the size and complexity of the model, this


analysis could be time consuming because, unless otherwise stated,
the toolbox will analyze the knockout of every pair of non-essential
genes. For the small E. coli model, this involves examining 9453
double deletions. That would result in 66 times more knockout
simulations than SGKO for the same system. Similar to SRKO and
SGKO, the ratio of the growth rate following each double deletion
compared to wild type, as well as FBA-calculated growth rates are
reported in grRatioSLM and grRateSLM two-dimensional arrays.

Robustness Analysis SRKO, SGKO, and SLM examine how genetic mutations affect a
system’s phenotype. Similarly, examining how flux changes for each
reaction affect the maximum value of a model’s objective function
provide information about phenotypic changes resulting from
switching between different pathways and variations in rates of
import/export of metabolites. The robustnessAnalysis com-
mand is used for this type of analysis. For example, if one is
interested to know how changing the rate of import/export of
CO2 would affect the growth rate for the core E. coli model, the
function can be:

[controlFlux, objFlux] = robustnessAnalysis(model,...


’EX_co2(e)’,25,true,’Biomass_Ecoli_core_w_GAM’,’max’);

The inputs for this function are model name, name of the
control reaction, number of points to be calculated/displayed,
choice of whether a graph of the results should be displayed, the
objective function to be examined, and choice of maximizing or
minimizing the objective function.
The robustnessAnalysis operation will provide three out-
puts. The controlFlux array will contain the set of flux values for the
chosen biochemical reaction. The range of values are determined by
the maximum and minimum amount of flux that the reaction can
carry. The objFlux array will contain the FBA calculated maximum
360 Ali Navid

Fig. 8 Graph displaying the result of robustnessAnalysis for import/export of CO2. The x-axis shows
the fixed value for import/export flux of CO2 and the y-axis displays the optimum value of the objective function

values of the objective function when the flux through the chosen
reaction is set to the values reported in controlFlux. The function
will also result in a 2D plot of the values in these arrays (see Fig. 8).
In order to examine how fluxes through two reactions affect
one another, as well as the maximum value of the objective func-
tion, the doubleRobustnessAnalysis command can be used.
This is a great tool for examining trade-offs between different
biological processes. For example, one might test how the rates of
production of two metabolic byproducts or flux through two path-
ways affect each other as well as the rate of growth. If one wants to
examine how the anabolic gluconeogenesis pathway (represented
by fructose-bisphosphatase, FBP) and the Krebs cycle (represented
by α-ketoglutarate dehydrogenase, AKGDH) affect one another
and the rate of growth of E. coli, the command to use can be:

[controlFlux1, controlFlux2, objFlux]= ...


doubleRobustnessAnalysis(model,’FBP’,’AKGDH’,25,... true,’-
Biomass_Ecoli_core_w_GAM’,’max’);

The inputs for this function are similar to the ones for
robusntessAnalysis function with the exception that you
need to list two control reactions instead of one. The output for
this operation consists of three arrays containing the flux values for
A Beginner’s Guide to the COBRA Toolbox 361

the control reactions and the objective function as well as a 3D


graph of the results (unless the fourth input into the function is
‘false’). The graph that is produced by the function is basic (Fig. 9a)
and requires further manipulation before it is ready for publication
(Fig. 9b). A COBRA toolbox addon has recently been developed
for conducting multi-objective flux analyses beyond 3 dimensions
[46].
In conclusion, COBRA toolbox is as of now the most popular
tool for analyzing GSMs. The variety of different types of analyses
that can be performed using this tool is impressive and continually
expanding. This chapter has barely scratched the surface of the
variety of studies that can be conducted using COBRA toolbox.
It is only meant to provide a minimal set of instructions for con-
ducting the most common type of analyses. For further insight, the
reader is encouraged to: (1) read the manuscripts introducing the
various versions of the toolbox [9, 10, 17], (2) visit the cobra
toolbox website (https://opencobra.github.io/cobratoolbox/sta-
ble), and (3) utilize the tutorials that accompany the installation of
the toolbox. The last resource is the most useful but requires a basic
knowledge of MATLAB.

4 Notes

1. For the COBRA toolbox commands, the courier font has been
used to distinguish them from the rest of the text.
2. When using COBRA toolbox, there are some inputs that need
to be entered in a precise manner. However, users also have a
lot of freedom in naming models, variable and arrays. In this
chapter many commands have been provided as examples
(using the E. coli core metabolism model). In the shown exam-
ple commands, those inputs that can be changed by the user are
in italics.
3. The letter in brackets designation (e.g. [c]) is not universal.
Some models might use _ followed by the letter (e.g. _c) or
parentheses (e.g. (c)).
4. Newly entered reactions can be given any name; however, it is
good practice to make sure the name matches the naming style
of the other reactions in the model. For example, when adding
a reaction to a model from the Kbase [16], it would be a good
idea to check the ModelSEED [44] database (https://model-
seed.org/biochem) to find the reaction’s name in that format
(e.g., ASPTA¼rxn00260).
5. If you do not explicitly set the lower and upper bounds for the
reaction, the default values of 1000 and 1000 will be assigned
to it. If you enter a value other than zero for the ‘objectiveCoef’
input, then the added equation will be optimized along with
362 Ali Navid

Fig. 9 2D graph showing the outcome of doubleRobustnessAnalysis for activity of α-ketoglutarate


dehydrogenase (AKGDH) and fructose-bisphosphatase (FBP). (a) the output from the command, (b) same
figure manipulated for improved viewing and publication

the previously set objective function of the model. If the value


of ‘checkDuplicate’ input is set to true, the COBRA toolbox
will examine the stoichiometric matrix (S) of the model to see if
another reaction with the exact same stoichiometric coefficients
A Beginner’s Guide to the COBRA Toolbox 363

for the specified metabolites exists. If you are updating a reac-


tion that is already in the model, you will receive this message:
‘Warning: Reaction with the same name already exists in the
model, updating the reaction’. If you are adding a new reaction
and another reaction with similar stoichiometry exists in the
model, then you will get a message from the toolbox stating
‘Warning: Model already has the same reaction you tried to add:’
followed by the name of that reaction. The new duplicate
reaction will not be added to the model. If the ‘checkDuplicate’
value is set to false, the duplicate reaction will be added to the
model. Finally, setting the ‘printLevel’ value to 1 will result in
the formula for the added reaction being printed for your
inspection.
6. Some of the COBRA toolbox commands that have been high-
lighted in this chapter accept additional optional inputs that
could be very useful for specific analysis scenarios. Once readers
have become familiar with the basics of COBRA toolbox, they
are encouraged to visit the COBRA toolbox webpage
(https://opencobra.github.io/cobratoolbox/stable/index.
html) and learn more about these optional inputs and
capabilities.
7. When saving in the SBML format, there is no need to affix the .
xml tag to the end of the file name. MATLAB automatically
adds it to the chosen name.
8. FBAsol, like model, Newmodel, and obf are arbitrary output
names that have been used in this chapter. The readers can
choose other names that could be more helpful for their
analyses.

Acknowledgments

This work was funded in part by the DOE OBER Genomic Science
program and LLNL Laboratory Directed Research and Develop-
ment funding and performed under the auspices of the
U.S. Department of Energy at Lawrence Livermore National Lab-
oratory under Contract DE-AC52-07NA27344.

References

1. Fang X, Lloyd CJ, Palsson BO (2020) Recon- metabolic network reconstruction. PLoS One
structing organisms in silico: genome-scale 8(5):e63369
models and their emerging applications. Nat 3. Navid A, Almaas E (2009) Genome-scale
Rev Microbiol 18:731–743 reconstruction of the metabolic network in
2. Chaudhury S, Abdulhameed MDM, Singh N, Yersinia pestis, strain 91001. Mol Biosyst 5
Tawa GJ, D’haeseleer PM, Zemla AT et al (4):368–375
(2013) Rapid countermeasure discovery 4. Navid A (2011) Applications of system-level
against francisella tularensis based on a models of metabolism for analysis of bacterial
364 Ali Navid

physiology and identification of new drug tar- Systems Biology Knowledgebase. Nat Biotech-
gets. Brief Funct Genomics 10(6):354–364 nol 36(7):566
5. Burgard AP, Pharkya P, Maranas CD (2003) 17. Schellenberger J, Que R, Fleming RMT,
Optknock: a bilevel programming framework Thiele I, Orth JD, Feist AM et al (2011) Quan-
for identifying gene knockout strategies for titative prediction of cellular metabolism with
microbial strain optimization. Biotechnol constraint-based models: the COBRA Toolbox
Bioeng 84(6):647–657 v2.0. Nat Protoc 6(9):1290–1307
6. Park JH, Lee KH, Kim TY, Lee SY (2007) 18. Thiele I, Palsson BO (2010) A protocol for
Metabolic engineering of Escherichia coli for generating a high-quality genome-scale meta-
the production of L-valine based on transcrip- bolic reconstruction. Nat Protoc 5(1):93–121
tome analysis and in silico gene knockout sim- 19. Lieven C, Beber ME, Olivier BG, Bergmann
ulation. Proc Natl Acad Sci U S A 104 FT, Ataman M, Babaei P et al (2020) MEM-
(19):7797–7802 OTE for standardized genome-scale metabolic
7. Yoshikawa K, Toya Y, Shimizu H (2017) Met- model testing. Nat Biotechnol 38(3):272–276
abolic engineering of Synechocystis sp. PCC 20. Olivier BG, Bergmann FT (2018) SBML level
6803 for enhanced ethanol production based 3 package: flux balance constraints version 2. J
on flux balance analysis. Bioprocess Biosyst Integr Bioinformatics 15(1):20170082
Eng 40(5):791–796 21. Kanehisa M, Goto S (2000) KEGG: Kyoto
8. Fouladiha H, Marashi S-A, Torkashvand F, encyclopedia of genes and genomes. Nucleic
Mahboudi F, Lewis NE, Vaziri B (2020) A Acids Res 28(1):27
metabolic network-based approach for devel- 22. Kanehisa M, Sato Y, Kawashima M,
oping feeding strategies for CHO cells to Furumichi M, Tanabe M (2016) KEGG as a
increase monoclonal antibody production. reference resource for gene and protein anno-
Bioprocess Biosyst Eng 43:1381–1389 tation. Nucleic Acids Res 44(D1):
9. Becker SA, Feist AM, Mo ML, Hannum G, D457–DD62
Palsson BO, Herrgard MJ (2007) Quantitative 23. Schellenberger J, Park JO, Conrad TM, Pals-
prediction of cellular metabolism with son BO (2010) BiGG: a Biochemical Genetic
constraint-based models: the COBRA Tool- and Genomic knowledgebase of large scale
box. Nat Protoc 2(3):727–738 metabolic reconstructions. BMC Bioinformat-
10. Heirendt L, Arreckx S, Pfau T, Mendoza SN, ics 11:213
Richelle A, Heinken A et al (2019) Creation 24. King ZA, Lu J, Dr€ager A, Miller P,
and analysis of biochemical constraint-based Federowicz S, Lerman JA et al (2015) BiGG
models using the COBRA Toolbox v. 3.0. Nat models: a platform for integrating, standardiz-
Protoc 14:639–702 ing and sharing genome-scale models. Nucleic
11. Orth JD, Thiele I, Palsson BO (2010) What is Acids Res 44(D1):D515–DD22
flux balance analysis? Nat Biotechnol 28 25. Oberhardt MA, Palsson BO, Papin JA (2009)
(3):245–248 Applications of genome-scale metabolic recon-
12. Mahadevan R, Schilling CH (2003) The effects structions. Mol Syst Biol 5:320
of alternate optimal solutions in constraint- 26. Bordbar A, Monk JM, King ZA, Palsson BO
based genome-scale metabolic models. Metab (2014) Constraint-based models predict meta-
Eng 5(4):264–276 bolic and associated cellular functions. Nat Rev
13. Mahadevan R, Edwards JS, Doyle FJ (2002) Genet 15(2):107–120
Dynamic flux balance analysis of diauxic 27. Varma A, Palsson BO (1994) Metabolic flux
growth in Escherichia coli. Biophys J 83 balancing: basic concepts, scientific and practi-
(3):1331 cal use. Nat Biotechnol 12(10):994–998
14. Lewis NE, Hixson KK, Conrad TM, Lerman 28. Reed JL, Palsson BØ (2004) Genome-scale in
JA, Charusanti P, Polpitiya AD et al (2010) silico models of E. coli have multiple equivalent
Omic data from evolved E. coli are consistent phenotypic states: assessment of correlated
with computed optimal growth from genome‐ reaction subsets that comprise network states.
scale models. Mol Syst Biol 6(1):390 Genome Res 14(9):1797–1805
15. Segre D, Vitkup D, Church GM (2002) Anal- 29. Thiele I, Fleming RM, Bordbar A,
ysis of optimality in natural and perturbed met- Schellenberger J, Palsson BO (2010) Func-
abolic networks. Proc Natl Acad Sci U S A 99 tional characterization of alternate optimal
(23):15112–15117 solutions of Escherichia coli’s transcriptional
16. Arkin AP, Cottingham RW, Henry CS, Harris and translational machinery. Biophys J 98
NL, Stevens RL, Maslov S et al (2018) KBase: (10):2072–2081
the United States Department of Energy
A Beginner’s Guide to the COBRA Toolbox 365

30. Pharkya P, Maranas CD (2006) An optimiza- 38. Soh KC, Hatzimanikatis V (2014) Constrain-
tion framework for identifying reaction activa- ing the flux space using thermodynamics and
tion/inhibition or elimination candidates for integration of metabolomics data. In: Krömer
overproduction in microbial systems. Metab JO, Nielsen LK, Blank LM (eds) Metabolic flux
Eng 8(1):1–13 analysis: methods and protocols. Springer,
31. Bushell ME, Sequeira SIP, Khannapho C, New York, NY, pp 49–63
Zhao H, Chater KF, Butler MJ et al (2006) 39. Stewart B, Navid A, Turteltaub K, Bench G
The use of genome scale metabolic flux varia- (2010) Yeast dynamic metabolic flux measure-
bility analysis for process feed formulation ment in nutrient-rich media by HPLC and
based on an investigation of the effects of the accelerator mass spectrometry. Anal Chem 82
zwf mutation on antibiotic production in (23):9812–9817
Streptomyces coelicolor. Enzyme Microb 40. Machado D, Herrgård M (2014) Systematic
Technol 39(6):1347–1353 evaluation of methods for integration of tran-
32. Feist AM, Zielinski DC, Orth JD, scriptomic data into constraint-based models
Schellenberger J, Herrgard MJ, Palsson BO of metabolism. PLoS Comput Biol 10(4):
(2010) Model-driven evaluation of the produc- e1003580
tion potential for growth-coupled products of 41. Wagner A (2005) Distributed robustness ver-
Escherichia coli. Metab Eng 12(3):173–186 sus redundancy as causes of mutational robust-
33. Becker SA, Palsson BO (2008) Context- ness. Bioessays 27(2):176–188
specific metabolic networks are consistent 42. Whitacre J (2012) Biological robustness: para-
with experiments. PLoS Comput Biol 4(5): digms, mechanisms, and systems principles.
e1000082 Front Genet 3:67
34. Zur H, Ruppin E, Shlomi T (2010) iMAT: an 43. O’Neil NJ, Bailey ML, Hieter P (2017) Syn-
integrative metabolic analysis tool. Bioinfor- thetic lethality and cancer. Nat Rev Genet 18
matics 26(24):3140–3142 (10):613–623
35. Chandrasekaran S, Price ND (2010) Probabi- 44. Henry CS, DeJongh M, Best AA, Frybarger
listic integrative modeling of genome-scale PM, Linsay B, Stevens RL (2010) High-
metabolic and regulatory networks in Escheri- throughput generation, optimization and anal-
chia coli and Mycobacterium tuberculosis. Proc ysis of genome-scale metabolic models. Nat
Natl Acad Sci 107(41):17845 Biotechnol 28(9):977–982
36. Navid A, Almaas E (2012) Genome-level tran- 45. Maarleveld TR, Khandelwal RA, Olivier BG,
scription data of Yersinia pestis analyzed with a Teusink B, Bruggeman FJ (2013) Basic con-
new metabolic constraint-based approach. cepts and principles of stoichiometric modeling
BMC Syst Biol 6(1):150 of metabolic networks. Biotechnol J 8
37. Henry CS, Broadbelt LJ, Hatzimanikatis V (9):997–1008
(2007) Thermodynamics-based metabolic flux 46. Griesemer M, Navid A. MOFA: Multi-
analysis. Biophys J 92(5):1792–1805 Objective Flux Analysis for the COBRA Tool-
box. bioRxiv. 2021:2021.05.20.445041
Chapter 16

Rules of Engagement: A Guide to Developing Agent-Based


Models
Marc Griesemer and Suzanne S. Sindi

Abstract
Agent-based models (ABM), also called individual-based models, first appeared several decades ago with
the promise of nearly real-time simulations of active, autonomous individuals such as animals or objects.
The goal of ABMs is to represent a population of individuals (agents) interacting with one another and their
environment. Because of their flexible framework, ABMs have been widely applied to study systems in
engineering, economics, ecology, and biology. This chapter is intended to guide the users in the develop-
ment of an agent-based model by discussing conceptual issues, implementation, and pitfalls of ABMs from
first principles. As a case study, we consider an ABM of the multi-scale dynamics of cellular interactions in a
microbial community. We develop a lattice-free agent-based model of individual cells whose actions of
growth, movement, and division are influenced by both their individual processes (cell cycle) and their
contact with other cells (adhesion and contact inhibition).

Key words Agent-based models, Individual-based models, Stochastic simulations, Lattice-free

1 Introduction

Agent-based paradigms are used in fields ranging from economics


to biology [1–3]. At its core, any system comprised of discrete
individuals interacting with an environment, and perhaps one
another, may be formulated using an agent-based framework.
While not an inherent requirement, agent-based models are usually
developed in contexts where agents are able to move in a two- or
three-dimensional physical environment. Typically, the agent-based
paradigm is employed to study how population scale behaviors
(such as patterns, population distributions) emerge from the
actions of individuals. As we will discuss in this chapter, agent-
based models (ABMs), represent an important spatial modeling
framework capable of addressing challenges that traditional spatial
models cannot. However, there are trade-offs that come with
employing ABMs that must be considered when selecting compu-
tational tools. The goal of this chapter is to empower users to

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0_16, © Springer Science+Business Media, LLC, part of Springer Nature 2022

367
368 Marc Griesemer and Suzanne S. Sindi

Fig. 1 Different modeling frameworks for studying the spatial dynamics of evolving systems. (a) Reaction
diffusion systems are specified through partial differential equations; deterministic equations where density of
species evolve in time through reactions with themselves of other species (reactions) and move in space
(diffusion). (b) Cellular automata are discrete time simulations where agents exist on a rectangular grid and
occupy one of finitely many states (such as white ¼ dead, black ¼ alive). At each time point, agents decide to
modify their state according to the states of neighboring grid cells. (c) Agent-based models consist of
individual agents (in this case yeast cells) that grow and divide according to their own specified processes
and by interacting with their environment and other cells around them

determine when ABMs are an appropriate modeling choice, and


guiding users in either using existing software for simulating ABMs
or developing their own.
We begin by comparing and distinguishing ABMs among other
types of mathematical and computational methods for studying
spatial dynamics (see Fig. 1). For convenience, we will describe
these modeling frameworks in the context of biological systems,
specifically a growing population of cells. Biology presents many
multi-scale and data availability issues that are challenges to simula-
tion and analysis. First, biological problems may consist of large
spatial ranges, from atoms to ecosystems. Second, the timescale of
biological problems may range from nanoseconds to years. Third,
information in biological systems is often scarce due to the com-
plexity of experiments (e.g., challenges determining the values of
key parameters). Finally, unlike other areas in science, such as
Newtonian mechanics, biological systems often lack fundamental
laws meaning that established governing equations are not know
and may not exist.

1.1 Reaction- Partial-differential equations (PDE) describe concentrations of spe-


Diffusion Simulations cies moving in a spatial domain [4]. A common biological context
Using Partial would be the movement of molecules, proteins, and nutrients
Differential Equations through large-scale tissues. Numerous biological problems have
(PDEs) been addressed with PDEs, including the analysis of pattern forma-
tion from diffusing morphogen gradients [5–7]. While the domain
itself may be fixed, as we show in Fig. 1a, it is also possible that the
domain itself could change in time. Such approaches may be used
to study, for example, the growth of a tumor [8]. Commonly, the
distributions of biochemical constituents are evolved according to
Rules of Engagement: A Guide to Developing Agent-Based Models 369

reaction-diffusion equations. That is, each species tracked has an


associated diffusivity and may evolve through specified biochemical
reactions [9, 10].
While PDEs have proven an extremely powerful mathematical
framework, they have several features that may limit their use for
certain biological problems. First, PDEs are deterministic; this
means that they provide a single realization of a system. While this
might be useful in some systems, when a system is stochastic (i.e.,
the outcome of a system will vary), PDEs may not be the right
choice. Second, PDEs require fully specifying a set of equations and
parameters. In many cases, the equations used are chosen for a
particular system and require significant amounts of information
(e.g., the concentration, particle density, hydrostatic pressure gra-
dients of fluids). Also, the forms of the equations have been derived
to be true in only certain conditions. Third, numerical methods for
analysis of PDEs are often developed to work for precise domains
and governing equations; this can require substantial re-writing of
code when assumptions and conditions change.

1.2 Cellular Cellular automata are a discrete time modeling framework for
Automata (CA) considering evolving agents on a rectangular grid (lattice). Each
element in the grid is assigned a value, usually discrete, to indicate
the state of that grid cell [11, 12]. For example, the simplest case
would be 0 (or white) for dead and 1 (or black) for alive as
illustrated in the example in Fig. 1b. The initial state of the CA is
usually taken to be either a single cell with a given state or some
random configuration of cellular states. The dynamics of the CA are
encapsulated in the rules for updating. Given a particular state of
the system at time t ¼ n, the grid state at the next time point
(t ¼ n + 1), is determined by each cell updating based on the
state of the neighboring cells. For example, in the CA formulation
of the Game of Life, developed by Conway [13], each cell considers
its eight neighboring cells and updates its state according to the
following four rules:
1. Any live (black) cell with fewer than two live (black) neighbors
dies (turns white). This represents death by underpopulation.
2. Any live (black) cell with two or three live (black) neighbors
lives (stays black).
3. Any live (black) cell with more than three living (black) neigh-
bors dies (turns white). This represents death by
overpopulation.
4. Any dead (white) cell with exactly three live (black) neighbors
becomes a live cell (turns black). This represents reproduction.
Since CAs were first developed, there have been many general-
izations to the underlying framework. For example, CAs have been
developed to live on torii or more general topological shapes. CA
370 Marc Griesemer and Suzanne S. Sindi

models have the advantage of fast, simple calculations which means


they can scale to large sizes. However, the simulations have an
inherent scale associated with the grid, and as such phenomena far
below that size cannot be modeled. Moreover, any actions such as
movement and growth are discretized. In addition, the behavior of
the system is largely tied to the grid structure, which means artifacts
from boundaries may occur. Finally, due to the relatively simplistic
nature of agents, CAs may not be easily generalized to more com-
plicated scenarios where agents themselves occupy different
volumes of physical space.

1.3 Agent-Based Agent-based modeling, also known as agent-based simulation


Models (ABMs) (ABS), and individual-based modeling (IBM) are (usually) spatially
lattice-free and may be either discrete or continuous in time
[14]. The modeling and analysis of ABMs takes place on both the
level of individual agents and the entire population [15]. ABMs are
inherently stochastic. Agents move and grow based on probabilistic
decisions at each time point. These decisions are based on the
characteristics of the agents themselves, interactions between indi-
vidual agents and the environment and the state of neighboring
agents. This last point is an important distinction between ABMs
and CAs; in ABMs agents are distinct from the environment,
whereas for CAs the individual and domain are intertwined
[2, 16]. Because ABMs are highly generalizable, as we will discuss
in greater detail below, ABMs have been successfully used in a
variety of biological settings such as modeling the growth of cancer
cells [17–21] and biofilms [22–24]. We note that ABMs are an
active area of research, and growing numbers of textbooks and
resources are available for the interested reader [25, 26].
Individuals in ABMs can be defined through three basic con-
cepts [2]. First, the agents must have a given state. In the case of
cellular populations, agents may be circles (which may model yeast
or mammalian cells), rods (which model some types of bacteria), or
appropriate shape. In Fig. 1c, we depict the agents as circles that are
specified by a given center position and radius. Second, agents must
be able to perceive the environment (e.g., who are my neighbors?
where is the food?). Third, the agents must have general rules on
what governs changes in its state. For example, some populations of
cells are strongly adhesive, making them likely to stay close to their
neighbors. In addition, a common feature of most ABMs devel-
oped to model biological cells is that the agents (in this case cells)
must reproduce through cell division. Typically, the decision to
divide is based on both the cellular state (such as size or time
since last division) and the environment of a cell (i.e., do I have
enough nutrients to carry out cell division?). Fourth, there must be
a well-defined way for the individual agents to meaningfully select
from their list of possible actions and carry out a given action. In
many cellular models, such as the example we depict in Fig. 1c,
cellular populations may become crowded so that movement of
Rules of Engagement: A Guide to Developing Agent-Based Models 371

individual agents is limited, and it is possible that no action is


possible during given a given time frame. In these cases, the agent
will elect no action.
There are many advantages to ABMs [27]. First, their inherent
stochasticity allows for the study of phenomena that may not be
well captured by deterministic models. Stochasticity can be
incorporated in the processes of birth and death, growth, and
movement. In fact, this randomness plays an important role in the
emergence of patterns in communities. Second, they represent a
bottom-up, highly generalizable approach to modeling. As men-
tioned above, for many systems, detailed information may exist
(or only be understood) at the individual level. Third, ABMs
allow for emergent behavior. By characterizing individuals, and
then studying system level dynamics, emergent behavior that
could not have been predicted or, in many cases, even quantitatively
formulated may occur. Finally, ABMs are highly generalizable to
many different contexts for both the agents and their environment.
Unlike, CAs where all individuals of a given state are identical,
heterogeneity is an essential feature in ABMs. In Fig. 1c, individual
cells are characterized by two continuous numbers, the position of
their center in a 2D (x and y position), and their radius. The
complexity of the individual in an ABM provides greater detail
than is possible in a CA or in a mean-field approach such as reaction
diffusion PDE-based model. Finally, agents can have memory,
meaning that the set of events experienced by an individual agent
or group of agents affects their actions in the future. However, if
the cells (or more generally agents) have a low amount of interac-
tion, ABM may be less useful compared to CA and differential
equations. The physicality introduced by modeling size, surface
area, and forces explicitly means ABMs are well suited for biophysi-
cal interactions between cells.
However, ABMs are not without their challenges. First, ABMs
may depend on many unknown parameters that may not be easily
measured. For example, it is thought that cell division occurs
through a cell-size checkpoint process, but it may not be under-
stood how that process manifests at the individual level. Second,
because of many characteristics of individual agents, ABMs are
often extremely computationally intensive compared to cellular
automata and PDEs. As such, ABMs are often difficult to use on
large (e.g., tissue-scale) systems. For systems of cells ranging from
106 to 109 cells (colonies and tissues), high-performance comput-
ing becomes a necessity.
One emerging computational method is to take a hybrid
approach combining equations and agent-based characteristics.
One could give each individual of a collective its own equation.
However, these approaches may not alleviate computational chal-
lenges unless all agents can be assumed to have similar character-
istics or there is little variability among individuals.
372 Marc Griesemer and Suzanne S. Sindi

2 Materials

2.1 The Computer If you are developing your own ABM, you will need to consider the
System and ABM computational resources and limitations you have at your disposal.
Software If you are using an existing ABM software framework, the individ-
ual providers will detail the requirements. In addition, we note that
some ABM software allows not only for the simulation of the
population but the visualization of the resulting population
configuration.

2.1.1 ABM Software While below we describe a protocol for your own ABM implemen-
Packages tation, we note that there are many computational packages allow-
ing users to specify their own ABMs (Table 1). Common
applications and platforms for developing agent-based models
include FLAME [34], Netlogo [25], AgentCell [28], and Repast
[38]. There are also readymade simulators for ABMs. BacSim [29]
and its update iDynoMICS [35] and INDSIM [36] are applications
that only need the user to define a model. For very large simula-
tions that require high-performance computing, one can use Bio-
cellion [30] that scales to a billion cells. The multi-agent software
platform Swarm uses collections of agents (swarms) in hierarchical
structures or populations [39]. Virtual Cell (VCell) [40, 41] pro-
vides a platform for simulating chemical reactions using both deter-
ministic and stochastic methods, while including spatial effects.
CompuCell3D [33] is an open-source simulation environment
for multi-cell, single-cell-based modeling of tissues, organs, and
organisms using the Cellular Potts Model. Chaste (Cancer, Heart,
And Soft Tissue Environment) [31, 32] allows for the simulation of
models developed for those domains and more generally in physi-
ology and biology. MASON (Multi-Agent Simulator of Neighbor-
hoods) [37] is a Java-based discrete-event simulation library with a
set of 2D and 3D visualization tools. Extending to large-scale
simulations, the software package FLAME [34] focuses on high-
performance computers with an additional GPU version, while
Biocellion [30] supports the modeling of large cellular commu-
nities up to a billion cells on HPC.

2.1.2 Do It Yourself ABM Grimm [42] set up a protocol, called ODD (Overview, Design
Implementation concepts, and Detail), to describe agent-based systems in a com-
mon set of standards. This approach was done to address an early
criticism of ABMs that their descriptions were not easily under-
standable or complete. By encouraging users to explicitly docu-
ment the purpose of the model, the scientific questions to be
investigated and the scales of the model. If you write your own
ABM, we suggest you follow the OOD principles in your design
and presentation.
Rules of Engagement: A Guide to Developing Agent-Based Models 373

Table 1
Agent-based modeling software

Software Operating system/


name Characteristic(s) language Reference
AgentCell Chemotaxis network in spatial cells Java [28]
BacSim Bacterial growth; uptake of nutrients Java [29]
Biocellion ~109 cell simulation HPC [30]
Chaste Uses cellular Potts model C++, Python scripts [31, 32]
CompuCell3D Simulation of medical therapies and tissue Windows, MacOS, [33]
Linux
FLAME Computational extended finite state machine GPU, HPC [34]
framework
iDynoMICS microbial biofilm simulation Java [35]
INDSIM Stochastic simulation of individual cells and events NetLogo [36]
MASON Discrete events, generation of plots and movies Java [37]
NetLogo ABM programming language Windows, MacOS, [25]
Linux
Repast Object-oriented; libraries for genetic algorithms and Multi-platform, [38]
neural networks HPC
Swarm Biological and social simulation; hierarchies of Multi-platform; C+ [39]
agents +/Java/
Objective-C
Virtual Cell Deterministic/stochastic simulation with spatial Windows, Linux, [40, 41]
(VCell) geometries, alternative web-based MacOS
implementation

Most common programming languages can accommodate


ABMs. However, a few have been specifically designed for them.
For example, NetLogo allows users to develop ABMs in 2D or 3D.
The agents are able to move and interact in a user-specified domain.
NetLogo [25] has become popular for modeling ecological or
population dynamics. For modeling cellular populations and tis-
sues, CompuCell3D [33] has emerged as a powerful ABM frame-
work. CompuCell3D has been used to evaluate medical therapies
and design engineered tissues.

2.2 Biological and/or Regardless of if you design your own code from the ground up or
Biophysical use a preexisting framework, the properties of your agents need to
Parameters be specified. The parameters required for an ABM are highly prob-
lem specific. In what follows, we will go through the details of
developing parameters for an agent-based model.
374 Marc Griesemer and Suzanne S. Sindi

Generally speaking, parameters such as lifetimes, growth,


movement and division will need to be specified. Depending on
the system under study, it may be helpful to consult Bionumbers
(http://bionumbers.hms.harvard.edu), which contains lists of
known quantitative information for a variety of biological contexts.
Or the Digital Cell Line [43], which describe specific cell lines.
Typically, the set of parameters required for an ABM have been
designed along with the ABM itself. In what follows, we provide
detailed notes in our context of a growing population of dividing
cells.

3 Methods

As mentioned above, there is a variety of existing software packages,


each of which has its own input needs and output formats. How-
ever, ABMs are relatively easy to implement, and as such, we
describe components in case the reader should be interested in
developing their own ABM simulation framework. All ABMs
require initialization with a specific arrangement of agents each
with their associated physical characteristics (size/shape), rules for
behavior (growing, moving, consuming nutrients). While we detail
the methods that are required for implementation of any spatial
ABM, for expositional convenience, we will develop ABMs in the
context of the biological problem of a growing population of cells
in a lattice-free framework.
The agent units in our model are individual cells. We consider
each cell to be an elastic, sticky particle of limited compressibility
and deformability, which is capable of active movement, growth,
and division. Each cell is a circle with a center and radius, as well as
other characteristics such as its lineage and cell cycle location. As we
detail below, the action taken by a cell is stochastic and the proba-
bility of any particular action depends on the cell’s current state as
well as the environment around it. (This is in contrast to the cellular
automata framework we presented above where once the initial
condition is set, all behavior is deterministic.)
With these ideas in place, we now describe four critical compo-
nents necessary for an ABM:
1. State of an Agent:
What properties about an agent does the ABM keep track
of? What properties are fixed, and which change in response to
the simulation?
2. Actions of an Agent:
What can an agent do? How does the agent decide what
to do?
3. Population Evolution in Time:
Rules of Engagement: A Guide to Developing Agent-Based Models 375

Does the population update synchronously or asynchro-


nously? Is there external forcing of cellular behavior?
4. Storage and Output of Simulation:
What type of information is stored about the simulation?
When and how does output occur?

3.1 State of an Agent In any implementation of an ABM, the specified quantities asso-
ciated with an agent must be detailed. For example, if this agent is a
cell, the state might be specified as the concentration of biochemi-
cal constituents (either inside or outside the cell), the size of the
cell, its age, and information about the hierarchical lineage of cells it
is related to (see Notes 1–3, and 9). These quantities associated
with an agent will be updated according to the equations and
parameters that define the underlying physical system. One can
also encode for different subtypes of the same agents, such as
differences in phenotypes or wild type versus mutants.

3.2 Action Next, we must specify precisely what an agent can do, and how it
of Individual Agents decides what to do. For example, in our framework agents may
move in a spatial domain, grow, divide, die (see Notes 2–4), and
communicate with other agents (Fig. 2). In our context of a grow-
ing population of cells, we may allow cells to grow in volume but
with growth rates that depend on availability of nutrients and the
space not occupied by neighboring cells.
Pushing (or shoving) is another type of movement in ABM
systems. Each cell moves individually but is not prevented by other
cells blocking the path. Instead these neighboring cells are dis-
placed in a direction opposite to the cell’s movement. This causes
a re-equilibrating of the colony to a new configuration. In this case,
movement always occurs, and it is up to the other cells to move out
of the way. As each cell does this in turn, this could lead to
conflicting actions. Thus, it may be more efficient to calculate the

Fig. 2 Detailed illustration of agent-based model (ABM). In this illustration, we have an off-lattice model of a
population of cells (circles). Each cell has a particular state (in this case the state of an agent is the center and
radius) and will modify their state by considering predefined cellular operations (such as division) and
information from neighboring cells. In this framework, a red cell (left panel) will stochastically decide to
perform one of the three modifications to its state: move, grow, and divide
376 Marc Griesemer and Suzanne S. Sindi

net movement of each cell once all cells have conducted their
movement. The algorithm works to minimize overlap through a
“shove radius.” All of the cells are accessed in a particular sequence.
This gives a bias into the sequence, and to correct that, the entire
set of ordering can be occasionally reversed. Randomization in the
order is another possibility (see Note 6).
Division or reproduction of agents depends on the problem in
question. In our ABM framework, we assume an agent divides
when it reaches twice its original size. Donache [44] found that
DNA replication was correlated with multiples of the initial cell
size. There are also different cellular division orientation phenom-
ena: axial or polar. The former occurs when the daughter cell
divides near its separation with the original mother, bunching up
the cells in the process. Polar division occurs when it happens on
the opposite side of the daughter, so that the cell colony grows away
from each other. Often both of the types of division occur at the
same time within a colony, with an environmental trigger such as
temperature or an internal type like gene regulatory proteins caus-
ing a change between the two modes. Cell division can be symmet-
ric, 60/40 between the mother/daughter, or budding in the case
of yeast. The life cycle of growth and division can also be specified.

3.3 Population The ABM is updated in time according to the actions of individual
Evolution in Time agents and how those actions impact the decisions of other agents.
As we mention below in Notes 5–7, implementation details such as
data structure and organization choices may greatly influence the
computational efficiency of an ABM.
A simulation can start with any number of cells and any config-
uration one can imagine within the geometry of the space. Mim-
icking experiments that perform inoculation of organisms in the lab
is one technique. If one seeks to follow the entire evolution of a
cellular community, then starting at one cell is ideal. The number of
cells can also specify the end of the simulation as well.
Once the colony has started to grow, external events can also
influence its trajectory beyond cell–cell interactions. For example,
one often introduces a nutrient field consisting of compounds to be
consumed by the cells. This can occur on a reaction-diffusion-based
lattice in which the nutrients can have a heterogeneous spatial
concentration. One can also have bulk concentrations such as for
large pools of gaseous nutrients such as CO2 or O2 or nitrogen.
These values can be varied to simulate conditions of limiting or rich
media to explore the effect of these compounds on movement.
Solid media such as agar inhibit cellular movement, while liquids
allow for more movement. This can either be a parameter in the
model or one can simulate the movement effect more explicitly.
There is therefore a link between a cell colony’s morphology and its
surroundings. As the media becomes thicker and impedes move-
ment, translation events become less frequent, meaning that
Rules of Engagement: A Guide to Developing Agent-Based Models 377

individual success in moving causes an asymmetric progression of


the colony as a branching progression. These discrete event simula-
tions would be in contrast to a more continuous movement, which
still could be hampered by the solid media.

3.4 Storage In ABMs, there is usually interest in both how the individuals
and Output behave within a population and population level characteristics.
of Information As such, ABM typically track and output both types of quantities.
As we mention below in Subheading 4, by organizing file structures
and output can greatly facilitate post-processing and analyses of
these simulations (see Notes 8 and 9).
The coordinates of each individual cell’s center and its radius
are the primary data kept by the simulation, with one row per cell
per time point. Other information such as the cell type, state, birth
date can be added as needed. As cells are often object-oriented data
structures, their attributes can scale linearly with their number of
parameters.
The list of parameters needs to be recorded (see Note 10) along
with the random seed to be able to reproduce individual simula-
tions (see Note 11). The evolution of the cell community can be
given through a unique identifier that holds the lineage of the
individual cell. Subcolonies of cells could display different behavior
related to their surroundings.

4 Notes

1. Decide on a life cycle for the cell agents. Cells have checkpoints
such as mitosis, G1 in their cycle. More complicated life cycles
have multiple stages, but simply a focus on division and the
generation of new agents are most important.
2. Decide whether cells can die in the simulation. They can lyse,
providing nutrients for other species or provide space for the
remaining cells. One way the cells can die is if their growth rate
falls below a certain threshold. Another possibility is that the
cell is trapped and cannot grow at all, which lengthens its cell
cycle to the point where regulatory factors induce apoptosis.
3. Give cells multiple states and programmed actions for each
state. One can consider quiescent, hypoxic, apoptotic, necrotic,
or proliferative cells. The cells can transfer to different states
probabilistically or in response to different environmental
conditions.
4. Start with two cells and follow them through the process of
growth, translation, and division to see if they behave correctly
before scaling to many cells. A significant source of error is
when cells overlap due to incorrect updating of cell coordinates
and radius or mistakes in the potential function. An accounting
378 Marc Griesemer and Suzanne S. Sindi

of the energy values for the old and new configurations in a trial
is important in this case. One must make sure that cell births do
not cause immediate overlap unless such events are dependent
on having enough space. One solution is to have the new cells
fit inside the old cell so that no overlap occurs. This creates an
issue with the total cell area diminishing, but this may not have
much effect with fast growth and is a good first step in getting
the simulation working correctly.
5. Encapsulate agent instructions into functions and use the main
program for flow of the turn. For example, testing for a growth
event should be separate from the amount of growth done on a
successful trial.
6. Create a “process list” that has two members: cell number and a
variable containing a random number. For each cell, generate a
random number between 0 and 1 and place in a process list
vector element. Then sort the cells by their random number
and have that ordering govern the order in which the cells
perform trials. Upon the end of the turn, generate new random
numbers for each cell. Then sort the list again. When a new cell
is born, a random number is generated, and its entry should be
added to the “process list” vector.
7. Use get/set methods to modify and retrieve a cell variable. This
ties the cell attribute to an individual and protects from
unwanted editing and overwriting of values by other cell
objects.
8. Organize file I/O as separate files giving the list of the cell data
such as cell size and location. This print out of the simulation
can be for every turn, if debugging or for a video, but often a
snapshot of the output at regular intervals is sufficient to cap-
ture the dynamics of the simulation. Cell number is also impor-
tant to record too, so that the dynamics of the individual cells
can be traced. A third file keeps track of the time and number of
cells to determine growth rates and colony evolution.
9. Represent lineage of a cell as a string of numbers connected by
dashes. This allows for easy parsing for analysis purposes.
The length of the string gives the age of the cell. For example,
the C-0-0-1-2-1 is a fifth-generation cell that is present in the
subcolony of the first daughter born to the founder. This allows
one to find the ancestors and progeny of a particular cell easily
and quickly.
10. Record a seed value for the random number generator that
determines the series of random numbers that the program
generates. This allows for reproducibility of individual simula-
tions. However, generally seed the random number generator
with the current time, so that no two simulations are exactly
alike.
Rules of Engagement: A Guide to Developing Agent-Based Models 379

11. Record a parameter list in a file that gives values for the growth
amount, move amount, percentages, etc. It is important to also
record the type of potential and stopping point criteria.

References

1. Brehm-Stecher BF, Johnson EA (2004) Single- dynamic systems, 2nd edn. Academic,
cell microbiology: tools, technologies, and New York, NY
applications. Microbiol Mol Biol Rev 15. North MJ (2014) A theoretical formalism for
68:538–559 analyzing agent-based models. Complex Adapt
2. Railsback SF, Lytinen SL, Jackson SK (2006) Syst Model 2(1):1–34
Agent-based simulation platforms: review and 16. Railsback SE (2001) Concepts from complex
development recommendations. SIMULA- adaptive systems as a framework for individual-
TION 82(9):609–623 based modeling. Ecol Model 139:47–62
3. Song H-S et al (2014) Mathematical Modeling 17. Zhang L, Athale CA, Deisboeck TS (2007)
of microbial community dynamics: a methodo- Development of a three-dimensional multiscale
logical review. PRO 2(4):711–752 agent-based tumor model: simulating gene-
4. Kolmogorov A, Petrovsky L, Piscounov N protein interaction profiles, cell phenotypes
(1937) Study of the diffusion equation with and multicellular patterns in brain cancer. J
growth of the quantity of matter and its appli- Theor Biol 244:96–107
cation to a biological problem. Moscow Univ 18. Mansury Y et al (2002) Emerging patterns in
Bull Math 1:1–25 tumor systems: simulating the dynamics of
5. Murray JD (1988) How the leopard gets its multicellular clusters with an agent-based spa-
spots. Sci Am 258(3):80–87 tial agglomeration model. J Theor Biol 219
6. Holmes EE et al (1994) Partial differential (3):343–370
equations in ecology: spatial interactions and 19. Galle J et al (2006) Individual cell-based mod-
population dynamics. Ecology 75(1):17–29 els of tumor-environment interactions. Am J
7. Baker RE, Gaffney E, Maini P (2008) Partial Pathol 169(5):1802–1811
differential equations for self-organization in 20. Drasdo D, Höhme S (2003) Individual-based
cellular and developmental biology. Nonlinear- approaches to birth and death in avascular
ity 21(11):R251 tumors. Math Comput Model 37:1163–1175
8. Ward JP, King J (1997) Mathematical model- 21. Drasdo D, Hohme S (2005) A single-cell-
ling of avascular-tumour growth. Math Med based model of tumor growth in vitro: mono-
Biol 14(1):39–69 layers and spheroids. Phys Biol 2:133–147
9. Cantrell RS, Cosner C (2004) Spatial ecology 22. Xavier JB et al (2007) Multi-scale individual
via reaction-diffusion equations. John Wiley & based model of microbial and bioconversion
Sons, Hoboken, NJ dynamics in aerobic granular sludge. Environ
10. Britton NF (1986) Reaction-diffusion equa- Sci Technol 41(18):6410–6417
tions and their applications to biology. Aca- 23. Picioreanu C, Kreft JU, Loosdrecht MCMV
demic, New York, NY (2004) Particle-based multidimensional multi-
11. Alber MS et al (2003) On cellular automaton species biofilm model. Appl Environ Microbiol
approaches to modeling biological cells. In: 70:3024–3040
Rosenthal J, Gilliam DS (eds) Mathematical 24. Modwal A, Rao S (2015) Agent-based model-
systems theory in biology, communications, ling of biofilm formation and inhibition in
computation, and finance, vol 1-39. Springer, Escherichia coli. Curr Sci 109(5):930–937
New York, NY 25. Tisue S, Wilensky U (2004) Netlogo: a simple
12. Lee Y et al (1995) A cellular automaton model environment for modeling complexity. in
for the proliferation of migrating contact- International conference on complex systems.
inhibited cells. Biophys J 69:1284 ICCS, Boston, MA
13. Conway J (1970) The game of life. Sci Am 223 26. Smaldino PE, Calanchini J, Pickett CL (2015)
(4):4 Theory development with agent-based models.
14. Zeigler BP, Praehofer H, Kim TG (2000) The- Organ Psychol Rev 5(4):300–317
ory of modeling and simulation: integrating 27. Bernard RN (1999) Using adaptive agent-
discrete event and continuous complex based simulation models to assist planners in
380 Marc Griesemer and Suzanne S. Sindi

policy development: the case of rent control, 36. Ginovart M, Lopez D, Valls J (2002) INDSIM:
Santa Fe Institute working paper an individual-based discrete simulation model
28. Emonet T et al (2005) AgentCell: a digital to study bacterial cultures. J Theor Biol
single-cell assay for bacterial chemotaxis. Bio- 214:305–319
informatics 21(11):2714–2721 37. Pérez-Rodrı́guez G et al (2015) Agent-based
29. Kreft J-U, Booth G, Wimpenny JWT (1998) spatiotemporal simulation of biomolecular sys-
BacSim: a simulator for individual-based mod- tems within the open source MASON frame-
elling of bacterial colony growth. Microbiology work. Biomed Res Int 2015:1–12
144:3275–3287 38. Collier NT, North M (2012) Parallel agent-
30. Kang S et al (2014) Biocellion: accelerating based programming with repast for high per-
computer simulation of multicellular biological formance computing. SIMULATION
models. Bioinformatics 30(21):3101–3108 2012:1–21
31. Osborne JM et al (2017) Comparing 39. Minar N et al (1996) The swarm simulation
individual-based approaches to modeling the system: a toolkit for building multi-agent simu-
self-organization of multi-cellular tissues. lations. Santa Fe Institute, Santa Fe, NM
PLoS Comput Biol 13(2):e1005387 40. Walker DC, Southgate J (2009) The virtual
32. Harvey DG et al (2015) A parallel implemen- cell—a candidate co-ordinator for ‘middle-
tation of an off-lattice individual-based model out’modelling of biological systems. Brief
of multicellular populations. Comput Phys Bioinform 10(4):450–461
Commun 192:130–137 41. Resasco DC et al (2012) Virtual Cell: compu-
33. Swat MH et al (2012) Multi-scale modeling of tational tools for modeling in cell biology.
tissues using CompuCell3D. Methods Cell Wiley Interdiscip Rev Syst Biol Med 4
Biol 110:325 (2):129–140
34. Kiran M et al (2010) FLAME: simulating large 42. Polhill JG et al (2008) Using the ODD proto-
populations of agents on parallel hardware col for describing three agent-based social sim-
architectures. In: Proceedings of the 9th inter- ulation models of land-use change. J Artif Soc
national conference on autonomous agents and Soc Simul 11(2):3
multiagent systems: volume 1. International 43. Friedman SH et al (2016) MultiCellDS: a stan-
Foundation for Autonomous Agents and Mul- dard and a community for sharing multicellular
tiagent Systems, Richland, SC data. bioRxiv 2016:090696
35. Lardon LA et al (2011) iDynoMiCS: next- 44. Donachie WD (1968) Relationship between
generation individual-based modeling of bio- cell size and time of initiation of DNA replica-
films. Environ Microbiol 13(9):2416–2434 tion. Nature 219(5158):1077
INDEX

A C
Ab initio methods ................................................ 203, 204 Carbohydrate-active enzymes (CAZy)
Accelerator mass spectrometry (AMS), see Mass database............................................... 208–210
spectrometry (MS) Catalyzed reporter deposition-fluorescence in situ
Actinobacteria................................................................ 202 hybridization (CARD-FISH)............. 105, 124
Agent-based models (ABM)................................ 367–379 Cellular automata (CA) ...................................... 322–334,
Amplicons ...................................................................... 138 369–371, 374
Anabaena oscillarioides ......................................... 95, 101, Chimera ..................................... 219, 223, 229–233, 235,
102, 107, 114 237, 238, 241, 243, 248, 250, 251, 253
Antibiotics and secondary metabolites (antiSMASH) Chlamydomonas reinhardtii.......................................... 121
server ............................................................ 211 Chromatin immunoprecipitation-SIP (ChIP-SIP) ..... 125
Antibody labeling................................................. 106, 125 Chromatography
Atomic force microscopy gas chromatography (GC).........................23, 33, 152
(AFM) ...................... 123, 126, see Microscopy high-performance liquid chromatography
ATP Binding Cassette transporters (ABC) (HPLC)....................... 2, 4, 5, 7, 9, 16, 23, 83
transporters ................................ 219, 221, 231 Class Architecture Topology Homologous superfamily
(CATH) database .............................. 223–228,
B 234, 237, 238, 241, 242, 252
Bacillus anthracis .......................................................... 243 Clustered regularly interspaced short palindromic repeats
(CRISPR)........................................... 148–152,
Bacillus subtilis ........................................... 116, 292, 299,
300, 307, 308, 310, 312, 315, 316 155–160, 162, 163, 201, 204
Bacillus thuringiensis..................................................... 116 Clustering ..................................137, 169, 173, 175, 178,
180, 189, 223, 239, 252
Basic local alignment search tool (BLAST)
BLASTp ................................................ 224, 232, 234, Comparative genomics ................................................. 205
247, 252 Computed atlas of surface topography of proteins
Beads ......................................................... 24, 30–32, 138, (CASTp).................................... 230, 233, 234,
248, 250, 253
140–144
BiGG models database.................................................. 344 Constraint Based Reconstruction and Analysis (COBRA)
Binding thermodynamics COBRApy ............................................................... 340
models............................................................. 321–334
binding enthalpy ....................................................... 42
binding entropy......................................................... 42 toolbox ........................................................... 339–364
binding Gibbs free energy ..................................42, 56 ConSurf ................................................................ 225, 229
CRISPcut.............................................................. 158–161
BioCyc database
ecoCyc ................................. 260, 264, 266, 269–272, CRISPR/Cas9 system ........................................ 147–153,
279, 284, 285, 287, 288, 323, 324 155, 158
CRISPR interference (CRISPRi) ........................ 152, 157
metaCyc ................................................ 284, 286, 288,
324, 326 Cytoscape.............................................................. 168, 169
Biofilms ................................................................. 370, 373
D
Biolog phenotype microarrays...................................... 309
Biological General Repository for Interaction Datasets Database for automated carbohydrate-active enzyme
(BioGRID) database ................................... 183 annotation (dbCAN).......................... 208–210
BRENDA database .............................................. 324, 326 Dead cas9 (dCas9) ............................................... 152, 155
Deinococcus-Thermus.................................................. 202

Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 2349,
https://doi.org/10.1007/978-1-0716-1585-0, © Springer Science+Business Media, LLC, part of Springer Nature 2022

381
MICROBIAL SYSTEMS BIOLOGY: METHODS AND PROTOCOLS
382 Index
Department of energy systems biology knowledgebase Gas chromatography–mass spectrometry (GC-MS),
(KBase).............................. 291–318, 323–325, see Mass spectrometry (MS)
340, 361 Gel electrophoresis (GE) .......................... 47, 75, 76, 144
Detergent screening.................................... 57, 59, 60, 62 Genbank ...................................................... 198, 200, 294
Domains of unknown function .................................... 194 Gene knockin ....................................................... 150, 151
Dynamic FBA (dFBA) ........................................ 259, 260, Gene knockouts
271–273, 275–277, 279, 340 double deletion ....................................................... 360
single knockouts...................................................... 304
E Gene ontology (GO) ........................................... 224, 228
Electroporation ............................................................. 155 Gene-protein-reaction (GPR) associations.................. 292
Element labeling-catalyzed reporter deposition Genome annotation ................................... 193–211, 216,
291, 292, 321, 323, 325
fluorescence in situ hybridization (EL-FISH),
see Fluorescent in situ hybridization Github............................................................................ 341
EntrezGene database .................................................... 323 Gnu linear programming kit (GLPK)................. 324, 341
Gurobi solver................................................................. 341
Epitopes .............................................................. 69, 75, 76
Escherichia coli ............................................ 12–14, 17, 20,
H
43, 46, 48, 83, 138, 194, 260, 263–266,
269–272, 277, 279, 323, 334, 340, 341, 343, Hidden Markov Models (HMMs) ...................... 204, 209
346, 352, 355, 356, 360, 361 High-performance liquid chromatography (HPLC),
EuGene .......................................................................... 205 see Chromatography
Expression sequence tag (EST).................................... 204 Homology
sequences ........................................85, 204, 216, 217,
F 219, 221, 223, 228, 229, 232, 243, 246
structural.......................................216, 219, 221, 223,
FASTA format ...................................................... 196, 198
Fast sampling devices ................................................14–19 228, 232, 235, 243, 246
Flexible structure AlignmenT by Chaining Aligned Homology directed repair (HDR)..................... 149, 151,
155, 162
fragment pairs allowing Twists
(FATCAT).................................. 223, 232, 240 Hyperion II ............................................. 95, 97, 112, 116
Flow cytometry .................................................... 101, 194
I
Fluorescence spectroscopy.............................................. 42
Fluorescent in situ hybridization (FISH) ...................... 82 Immuno-labeling ................................................. 115, 125
catalyzed reporter deposition-fluorescence in situ Inductively coupled plasma (ICP)
hybridization (CARD-FISH) ............ 105, 124 radiofrequency (RF) inductively coupled plasma
element labeling-catalyzed reporter deposition (ICP) .............................................................. 96
fluorescence in situ hybridization Ion mobility spectrometry.............................................. 43
(EL-FISH) ................................................... 123 Isothermal titration calorimetry (ITC).................. 42, 53,
multiplexed error-robust fluorescence in situ 57, 59, 60
hybridization (MERFISH) ........................... 82 Isotope labeling................................................................. 2
Flux balance analysis (FBA) IUPAC International Chemical Identifier
objective function................................. 292, 308, 323, (InChI)......................................................... 344
351–355
Fluxome ........................................................................... 11 J
Flux variability analysis (FVA) ............................ 260–262,
Joint Genome Institute (JGI)
304, 307, 308, 318, 339, 354, 355 integrated microbial genomes (IMG)
Focused ion beam (FIB) sectioning............................. 107 system.................................................. 196, 324
Förster resonance energy transfer (FRET) ..............84, 85
Francisella tularensis ..................................................... 232 K
G Klebsiella pneumoniae .......................................... 309, 313
Kyoto Encyclopedia of Genes and Genomes (KEGG)
Gap filling ............................................................ 260, 261, KEGG Automated Annotation Server
287, 288, 332 (KAAS)................................................ 205–208
Gas chromatography (GC), see Chromatography
MICROBIAL SYSTEMS BIOLOGY: METHODS AND PROTOCOLS
Index 383
L mass resolving power (MRP) ......................98, 99,
104, 108–111
Linear programming (LP) .................................. 263, 324, nanometer-scale stable isotope probing (NanoSIP)
331, 340–342, 351 91–127
Links .......................................... 167, 168, 170, 173–175, relative useful yield (RUY) ..............116, 120, 121
178–181, 184, 185, 188, 189, 196–200, SIMS in situ hybridization (SIMSISH)........... 123
206–211, 216, 217, 223–225, 227, 228, 237, time-of-flight SIMS (TOF-SIMS)........... 125, 126
238, 247, 252, 263, 264, 266, 316, 345, 376 quadrupole time of flight (QTOF) ............................ 2
Liquid chromatography (LC) secondary ion mass spectrometry
high-performance liquid chromatography (SIMS) .................................. 92, 96–100, 102,
(HPLC)................................ 2, 4–9, 16, 23, 83 103, 105–109, 116–118, 120, 122–127
Liquid chromatography-electrospray ionisation tandem Matlab................................... 82, 88, 100, 324, 340, 341,
mass spectrometry (LC-ESI-MS/MS), see Mass 349–351, 362, 364
spectrometry (MS) Metabolic model testing (MEMOTE) ............... 332, 341
Liquid chromatography-mass spectrometry (LC-MS), Metabolic networks.................................... 179, 180, 184,
see Mass spectrometry (MS) 259, 317, 322, 323, 325, 326, 328–330, 333,
Liquid chromatography-tandem mass spectrometry 343, 356, 357
(LC-MS/MS), see Mass spectrometry (MS) Metabolome
Liquid scintillation counting (LSC).................... 1, 3, 5, 8 endometabolome ...................................................... 24
Louvain method, see Network analysis exometabolome ......................................................... 24
metabolite extraction ................................................ 36
M
MetaFlux, see Pathway tools software
Marinobacter sp. PT19DW ................................. 196, 201 Metagenome-assembled genomes
Mass spectrometry (MS)..................................... 4, 45, 54, (MAGs) ..............................194, 195, 206, 210
61, 92, 122 Metagenomics ............................................. 194, 210, 216
accelerator mass spectrometry (AMS) ................. 1–10 Microarrays .................................................................... 125
data analysis ....................................... 2, 51, 53, 54, 56 Microbiome ........................................................... 65, 126,
Fourier transform-ion cyclotron resonance mass 137–146, 294
spectrometry (FT-ICR)............................... 126 Microhomology-mediated end-joining
gas chromatography–mass spectrometry (MMEJ) ....................................................... 154
(GC-MS)..................................................23, 33 Microscopy
inductively coupled plasma mass spectrometry atomic force microscopy (AFM) ............................ 123
(ICP-MS) ..................................................... 123 confocal microscopy.................................................. 84
ionization techniques scanning electron microscopy
matrix-assisted laser desorption ionization (SEM) ................................................. 105, 123
(MALDI) ..................................................... 125 scanning electron microscopy and energy dispersive
nanoscale electrospray ionization (nESI)........... 46 spectroscopy (SEM-EDS) ........................... 123
isotope ratio mass spectrometry scanning transmission electron microscopy
(IRMS)............................................1, 105, 122 (STEM) ............................................... 108, 123
liquid chromatography-electrospray ionisation scanning transmission X-ray microscopy
tandem mass spectrometry (STXM) ............................................... 124, 125
(LC-ESI-MS/MS) ........................................ 36 transmission electron microscopy
liquid chromatography-mass spectrometry (TEM) ..........................................99, 101, 107,
(LC-MS) ............................................. 7, 23, 33 108, 122, 123
liquid chromatography-tandem mass spectrometry wear-edge X-ray absorption fine structure
(LC-MS/MS) ................................................ 23 (NEXAFS)........................................... 124, 125
mass resolving power (MRP) ........................ 104, 110 Mitochondria................................................................... 73
nanometer-scale SIMS (NanoSIMS) ...............91–127 Mixed-integer linear programming
CAMECA instruments ....................... 92, 96–100, (MILP) ................................................ 263, 341
108, 110, 111, 113, 116, 126 Multiplexed error-robust fluorescence in situ
extreme low impact energy hybridization
(EXLIE) .......................................... 97, 98, 126 (MERFISH)...........82, see Fluorescence in situ
instrumental mass fractionation hybridization
(IMF) .................................................. 118, 122 Mus musculus............................................... 174, 178, 183
MICROBIAL SYSTEMS BIOLOGY: METHODS AND PROTOCOLS
384 Index
N P
National Center for Biotechnology Information (NCBI) Pajek...................................................................... 168, 169
database........................................................ 324 Parallel accelerator and molecular mass spectrometry
Nanometer-scale SIMS (NanoSIMS) (NanoSIMS), (PAMMS)................................................... 1–10
see Mass spectrometry (MS) Parsimonious enzyme usage FBA
Nanometer-scale stable isotope probing (NanoSIP), (pFBA) ........................................340, 355–357
see Mass spectrometry (MS) Partial differential equations (PDE) ................... 368, 369
NetLogo ............................................................... 372, 373 Pathway tools software
Network analysis MetaFlux.................................................................. 259
adjacency matrix ............................................. 170–172 PathoLogic program............................................... 259
betweenness centrality ...........................173, 176–178 pathway/genome databases
closeness Centrality ................................173, 176–178 (PGDBs) ....................................259–261, 264,
community structure ..................................... 181, 182 284, 285, 287, 288
component size distribution (CSD) ...................... 175 PATRIC ......................................................................... 294
cumulative degree distribution.............................. 171, PDBSum ..................................... 223, 225, 227–229, 247
172, 178 Perl ................................................................................. 197
clustering coefficient ............................. 173, 175, 178 Phaeodactylum tricornutum.......................................... 196
degree distribution........................................ 171–175, Polymerase chain reaction (PCR)
178–180, 183–189 quantitative PCR (qPCR)................................ 81, 142
largest component size (LCS) ................................ 178 Posttranslational modifications ................................12, 65
Louvain method...................................................... 182 Primal-Dual interior method for convex objectives (pdco)
weighted gene co-expression network analysis solver ............................................................ 342
(WGCNA) .......................................... 168–170 Prokaryotic genome annotation pipeline
Networks (PGAP)................................................ 197, 198
communities/modules ........................................... 181 Protein coding sequence (CDS) ........................ 153, 198,
components ............................................................. 181 202, 203, 205
directed ........................................................... 168–170 Protein databank (PDB)
protein-protein interaction research collaboratory for structural bioinformatics
networks.............................................. 178, 183 (RCSB)...................................... 221, 223, 225,
random .................................................. 168, 184, 187 231, 232, 240
NetworkX ............................................................ 168, 169, The biological magnetic resonance data bank
171, 172, 185 (BMRB) ....................................................... 221
Next generation sequencing (NGS) ................... 139, 162 Protein–DNA complexes .............................................. 148
Nitrosomonas europaea.................................................. 238 Protein expression
Nodes membrane .................................................................. 48
centrality soluble........................................................................ 43
betweenness centrality .....................173, 176–178 Protein FAMily database (PFAM) ..................... 224, 225,
closeness centrality (CC) .................173, 176–178 227, 231, 246
Non-homologous end joining Protein–ligand binding .............................. 42, 47, 50–52,
(NHEJ) .............................................. 149–151, 54, 56, 57, 59, 60
154, 155, 157 Protein–protein interaction
Nuclear magnetic resonance (NMR) ............42, 221, 249 (PPI).............................. 56, 65, 174–180, 225
Nucleases Protein purification
meganucleases ......................................................... 148 membrane .................................................................. 49
transcription activator-like effector-based nucleases soluble..................................................................43, 48
(TALENs) .................................................... 148 Proteome .........................................................11, 12, 205,
zinc finger nucleases (ZFNs) .................................. 148 297, 298, 329
Protospacer adjacent motif (PAM) .................... 148–150,
O 152–157
Off-target prediction and identification tool Pseudomonas stutzeri ..................................................... 116
PubChem....................................................................... 344
(CROP-IT) ......................................... 158, 159
Open reading frames (ORFs) ............................. 198, 199, Python ..................................................46, 168, 169, 197,
202–204 260, 324, 373
MICROBIAL SYSTEMS BIOLOGY: METHODS AND PROTOCOLS
Index 385
Q Shewanella denitrificans ....................................... 235, 236
SIMS in situ hybridization (SIMSISH), see Spectrometry
Quantitative PCR (QPCR), see Polymerase chain reaction Single amplified genomes .................................... 194, 196
(PCR) Soil .............1, 92, 95, 97, 101, 107, 123, 124, 127, 138
Quenching Spectrometry .......................................... 1–10, 23, 24, 33,
cold methanol quenching ..................................13, 15, 41–62, 92, 122, 123
20, 21, 27, 31, 36 Stable isotope probing (SIP) ............................... 101, 125
metabolism ..........................................................19, 20 Streptococcus pyogenes............................................ 148, 153
Structural classification of proteins (SCOP)
R
structural classification of proteins-extended
Radioimmunoassay (RIA) .............................................. 66 (SCOPe) ...................................................... 228
Random graphs .................................................... 178, 188 Surface plasmon resonance (SPR).............. 42, 53, 59, 60
Random network models SWISS-PROT protein database ................................... 226
Barabási–Albert (BA) model .................................. 187 Synthetic lethal mutations, see Gene knockouts
configuration model....................................... 188–190 Systems biology markup language
Erdös–Renyı́ model........................................ 185–187 (SBML) ............................ 262, 299, 328, 341,
Rapid annotation using subsystems technology 350, 352, 364
(RAST) .............................. 198–201, 204–211,
294, 295, 325 T
RAVEN program.................................................. 323, 324 The SEED
Reaction-diffusion models...........................368–369, 376 ModelSEED .......................................... 316, 327, 362
Reaction knockouts.............................................. 260, 356 Transcription activator-like effector-based nucleases
Recursive Automatic Search of MOTif in 3D structures of (TALEN), see Nucleases
PROteins Transcriptome ................................ 11, 12, 137, 204, 205
(RASMOT 3D PRO) ................ 231, 251, 252 Transmission electron microscopy (TEM), see Microscopy
RefSeq............................................................................ 294 Translated EMBL nucleotide sequence data library
Rfam database ............................................................... 204 (TrEMBL).................................................... 225
RNA TransportDB ................................................................. 208
crRNAs (CRISPR RNA) ........................................ 149 Trichodesmium spp. ......................................................... 95
non-coding RNAs (ncRNAs) ........................ 201, 204
ribosomal RNA (rRNA) ............................... 125, 137, U
138, 140, 143, 144, 198
RNA sequencing (RNA-Seq) ..................81, 138, 204 UniProt....................................................... 223, 225, 227,
single guide-RNA (sgRNA) .......................... 152, 155 238, 243, 252
small regulatory RNAs (sRNAs) ................... 201, 204
V
tracrRNAs (trans-activating
CRISPR RNA) ................................... 148, 153 Virus................................................................................. 43
RNA sequencing (RNA-seq), see Sequencing
W
S
Weighted gene co-expression network analysis
Saccharomyces cerevisiae................................................... 20 (WGCNA)........................................... 168–170
Scanning electron microscopy (SEM), see Microscopy Western blots (WBs) .................................................65–79
Scanning electron microscopy and energy dispersive Whole genome amplification (WGA) .......................... 194
spectroscopy (SEM-EDS), see Microscopy
Scanning transmission X-ray microscopy (STXM), X
see Microscopy
X-ray tomography ......................................................... 125
SDS-PAGE ......................................................... 70, 76, 78
Sequence annotated by structure Y
(SAS) .........................223, 225, 228, 229, 247
Sequencing Yersinia pestis (YP) ...................................... 179, 180, 184
genome ...........................11, 162, 193–196, 204, 294
metatranscriptomic ........................................ 137–146
Z
RNA sequencing (RNA-seq)..........81, 138, 204, 316 Zinc finger nucleases (ZFNs), see Nucleases
transposon sequencing (TN-seq) .................. 292, 309

You might also like